Online Pandora's Box for Contextual LLM Cascading

arXiv cs.AI 06/08/26, 04:00 AM Papers
llm-cascading pandoras-box contextual-bandits online-learning api-selection regret-analysis
Summary
This paper introduces an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs, proposing a learning approach that combines GMM estimation with UCB-style confidence bounds and proving dimension-dependent regret bounds.
arXiv:2606.07392v1 Announce Type: new Abstract: Motivated by Large Language Model (LLM) cascading, we propose an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs. In each period, a decision-maker observes a request context and faces a two-phase decision problem. In the query phase, the decision-maker sequentially queries APIs, where each query reveals a generated output and the decision-maker incurs an (output-dependent) cost. In the selection phase, the decision-maker selects one of the generated outputs to deploy and observes only the downstream reward of the deployed output. This output-mediated feedback structure differs from classical online contextual Pandora's Box models, in which opening a box directly reveals its reward. Rather than estimating the full conditional output and cost distributions of each API, we directly model the reservation index and develop a learning approach for the query phase. Specifically, we impose a parametric structure on the contextual reservation index functions induced by the classical Weitzman's policy. Our policy combines generalized method of moments (GMM) type estimation of these reservation indices with UCB-style confidence bounds for both these indices and the shared output-level reward evaluator. Under regularity conditions, we prove that the resulting policy achieves dimension-dependent $\widetilde O(\sqrt T)$ cumulative regret over a horizon of $T$ periods.
Original Article
View Cached Full Text
Cached at: 06/08/26, 09:15 AM
# 1 Introduction
Source: [https://arxiv.org/html/2606.07392](https://arxiv.org/html/2606.07392)
\\OneAndAHalfSpacedXII\\EquationsNumberedThrough\\TheoremsNumberedThrough\\ECRepeatTheorems\\MANUSCRIPTNO

\\RUNAUTHOR

Belloni, Chen, and Wei

\\RUNTITLE

Online Pandora’s Box for Contextual LLM Cascading

\\TITLE

Online Pandora’s Box for Contextual LLM Cascading

\\ARTICLEAUTHORS\\AUTHORfont\\HD

150Alexandre Belloni\\HD06

\\AFFfont

The Fuqua School of Business, Duke University,\\EMAILabn5@duke\.edu

\\AUTHORfont\\HD

150Yan Chen\\HD06

\\AFFfont

The Fuqua School of Business, Duke University,\\EMAILyc555@duke\.edu

\\AUTHORfont\\HD

150Yehua Wei\\HD06

\\AFFfont

The Fuqua School of Business, Duke University,\\EMAILyehua\.wei@duke\.edu

\\ABSTRACT

Motivated by Large Language Model \(LLM\) cascading, we propose an online contextual Pandora’s Box model for adaptively querying and selecting LLM APIs\. In each period, a decision\-maker observes a request context and faces a two\-phase decision problem\. In the query phase, the decision\-maker sequentially queries APIs, where each query reveals a generated output and the decision\-maker incurs an \(output\-dependent\) cost\. In the selection phase, the decision\-maker selects one of the generated outputs to deploy and observes only the downstream reward of the deployed output\. This output\-mediated feedback structure differs from classical online contextual Pandora’s Box models, in which opening a box directly reveals its reward\.

Rather than estimating the full conditional output and cost distributions of each API, we directly model the reservation index and develop a learning approach for the query phase\. Specifically, we impose a parametric structure on the contextual reservation index functions induced by the classical Weitzman’s policy\. Our policy combines generalized method of moments \(GMM\) type estimation of these reservation indices with UCB\-style confidence bounds for both these indices and the shared output\-level reward evaluator\. Under regularity conditions, we prove that the resulting policy achieves dimension\-dependentO~\(T\)\\widetilde\{O\}\(\\sqrt\{T\}\)cumulative regret over a horizon ofTTperiods\.

The proliferation of Large Language Models \(LLMs\) has transformed the economics of content generation and decision support\. Firms increasingly use generative AI systems to generate advertisements, produce code, serve customer requests, and conduct complex analytical tasks\. In these applications, the operational challenge is often not whether content can be generated, but how to generate high\-quality content reliably and cost\-effectively at scale\. A firm that produces a high volume of advertisements per day, for example, may have access to a portfolio of LLM APIs that differ in cost and output quality across task types\. More capable proprietary models may, in general, deliver higher\-quality outputs but incur substantially higher costs, while smaller or specialized models may be cheaper but less reliable\. Because model performance varies across request contexts and it is difficult to predict the quality of the outputs a priori, committing to a single API is often suboptimal\. Similarly, querying all of the available APIs is typically prohibitively costly\.

We study this problem from the perspective of an organization that uses external or internal LLM APIs to serve a stream of business requests\. In such applications, requests arrive as distinct decision units without a strict latency requirement, allowing the decision\-maker to query more than one API before selecting a final output\. It is worth noting that this request\-level decision problem fundamentally differs from the model hosting problem faced by large LLM platforms, where a key operational challenge is to reduce latency through batching, scheduling, and load management\(e\.g\. Yuet al\.[2022](https://arxiv.org/html/2606.07392#bib.bib119), Kwonet al\.[2023](https://arxiv.org/html/2606.07392#bib.bib120), Agrawalet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib121), Aoet al\.[2025](https://arxiv.org/html/2606.07392#bib.bib117), Jailletet al\.[2025](https://arxiv.org/html/2606.07392#bib.bib116)\)\. Our focus is instead on the API consumer’s decision problem of dynamically querying a portfolio of APIs to secure high\-quality outputs in a cost\-effective manner\.

A prominent approach for managing this cost\-quality trade\-off is*LLM cascading*\(see e\.g\., Chenet al\.[2025a](https://arxiv.org/html/2606.07392#bib.bib68), Yueet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib77), Guptaet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib115)\)\. Rather than sending every request directly to a single expensive model, a cascade queries LLM APIs sequentially, typically beginning with lower\-cost models and escalating only when earlier outputs appear insufficiently reliable\. The pioneering work ofChenet al\.\([2025a](https://arxiv.org/html/2606.07392#bib.bib68)\), for example, illustrates this design by routing a query through a sequence of LLM APIs and stopping once a generated response receives a sufficiently high reliability score from an evaluator\. Cascading allows the easy requests to be handled by cheaper APIs, with only the difficult requests escalated to the expensive ones\. Empirical evidence fromChenet al\.\([2025a](https://arxiv.org/html/2606.07392#bib.bib68)\)suggests that such cascades can substantially reduce inference costs while preserving, and in some cases improving, task performance\.

The LLM cascading system raises a fundamental sequential decision problem\. For each incoming request, the decision\-maker must manage both a query phase and a selection phase\. In the query phase, the decision\-maker must decide which API to query next and whether the generated outputs justify continuing the search\. In the selection phase, the decision\-maker must decide which of the generated outputs to deploy\. These decisions are tightly coupled\. Querying another API is costly, but it may reveal a better output; stopping the query phase early saves cost, but significantly increases the risks of not generating a better outcome; and selecting among queried outputs requires evaluating their downstream value\. Thus, LLM cascading is not simply a model\-routing problem, but a sequential search problem with costly information acquisition\.

A natural starting point for principled analysis of this search problem is the classical Pandora’s Box problem\. Consider an idealized benchmark in which, for each request, the decision\-maker knows the context\-dependent distribution of the output\-cost pair generated by each API and can evaluate the downstream value of any generated output once it is observed\. In this benchmark, querying an API corresponds to opening a box, the inference cost acts as the inspection cost, and the downstream value of the generated output serves as the realized reward\. For this formulation, the seminal work ofWeitzman \([1979](https://arxiv.org/html/2606.07392#bib.bib57)\)characterizes the structure of the optimal policy\. Specifically, given the context of a request, the policy assigns each API a reservation index, then queries APIs in decreasing order of these indices, and terminates the search as soon as the reward of an output exceeds the highest remaining index\.

Real applications of LLM cascading, however, depart from this oracle benchmark in ways that make learning central\. The value of querying an API is inherently contextual, and the primitives required by the oracle Weitzman’s policy are rarely known in advance\. In some settings, firms may have substantial historical data on prior human\- or AI\-generated outputs and their downstream rewards, allowing them to construct an accurate offline model of the reward function\. In many others, however, such output\-reward pairs are limited or unavailable\. Moreover, even when rewards can be modeled from offline data, a firm deploying LLM APIs typically does not know the conditional distribution of outputs and costs generated by each API under a given request context\. Since requests arrive sequentially over time, the decision\-maker must learn context\-dependent reservation indices while continuing to make cost\-sensitive query and selection decisions\. This naturally leads to an online contextual Pandora’s Box formulation of LLM cascading\.

Recent work on LLM cascading has developed increasingly sophisticated methods for balancing cost and performance\. Those include budget\-constrained cascade policies, uncertainty or threshold\-based deferral rules, and hybrid methods that combine cascading with model routing\(Chenet al\.[2025a](https://arxiv.org/html/2606.07392#bib.bib68), Yueet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib77), Guptaet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib115), Zhanget al\.[2024](https://arxiv.org/html/2606.07392#bib.bib74)\)\. While these approaches provide important insights into efficient multi\-LLM deployment, they rely primarily on heuristic designs\. To the best of our knowledge, existing work has not yet formulated LLM cascading as an online contextual Pandora’s Box problem with joint query and selection, nor developed formal regret guarantees\. Conversely, existing Pandora’s Box literature are not directly tailored to this LLM setting\. A prominent line of work takes distribution\-learning approaches\(Liuet al\.[2025](https://arxiv.org/html/2606.07392#bib.bib54)\), in which opening a box reveals a scalar reward and the learner estimates how the reward distribution varies with context, under parametric structure on the full distribution\. Closer to our work,Atsidakouet al\.\([2024](https://arxiv.org/html/2606.07392#bib.bib20)\)parameterizes contextual reservation indices directly, but their model remains a classical Pandora’s box setting in which opened boxes reveal rewards directly\. In LLM cascading, by contrast, an API call reveals a stochastic output\-cost pair\. In our model, the downstream value depends on the request and the generated output, while API\-specific heterogeneity enters through the distribution of outputs and costs\.

### 1\.1Our Contributions

We make three main contributions\. First, we introduce an online contextual Pandora’s Box model motivated by LLM cascading\. Unlike existing online Pandora’s Box models, our model features a natural two\-phase decision structure: querying an API reveals a stochastic output and incurs a cost, while the downstream reward is observed only after one generated output is selected and deployed\. This formulation captures a key practical feature of LLM cascading: API\-specific heterogeneity enters through the distributions of generated outputs and costs, while the downstream value of a generated output is evaluated through a shared reward model on the request\-output pair\. That is, an API can influence the downstream value only through the generated output\.

Second, we propose a direct index modeling and learning approach for the query phase\. In the full\-information benchmark, Weitzman’s policy is governed by contextual reservation indices, which are induced by each API’s conditional output and cost distributions together with the reward evaluator\. Rather than estimating these distributions directly, we impose a parametric structure on the reservation index functions themselves as a generalized linear function of observable context features\. This modeling assumption leads to a practical optimism\-based learning approach in the LLM cascading setting, as the decision\-maker can focus on constructing optimistic box\-specific reservation indices from queried output\-cost samples through the moment equations instead of the full conditional output distributions\. The policy also learns the shared output\-level reward evaluator from deployed\-output rewards, and the two learned objects jointly determine the cascade’s querying and selection decisions\.

Third, we provide regret guarantees for the resulting online policy\. The policy combines*generalized method of moments*\(GMM\) estimation for reservation indices with*upper confidence bound*\(UCB\)\-style approaches for both the indices and rewards\. Our policy is motivated by a simple but crucial regret decomposition under optimism: cumulative regret separates into errors from estimating reservation indices for only the*queried APIs*and errors from evaluating generated outputs for only the*selected APIs*\. This decomposition allows us to apply the UCB\-style analysis to our moment estimation problem, yielding a dimension\-dependentO~\(T\)\\tilde\{O\}\(\\sqrt\{T\}\)cumulative regret bound\. The analysis covers the practically important known\-evaluator regime, where offline data provide an accurate output\-level reward model, as well as the full regime where the evaluator and reservation indices must be learned jointly online\.

### 1\.2Related Literature

Our work is related to the literature on cost\-aware LLM systems, sequential test\-time inference, and the theoretical frameworks of sequential search and online learning\.

First, our paper is closely related to the emerging literature on LLM cascading and cost\-aware multi\-LLM deployment\. This literature studies how to allocate inference requests across multiple LLM APIs when models differ in cost, latency, and output quality\. Representative approaches include FrugalGPT\(Chenet al\.[2025a](https://arxiv.org/html/2606.07392#bib.bib68)\), budget\-constrained cascade policies, uncertainty\- or reliability\-based deferral rules, and hybrid methods that combine cascading with model routing\(Nieet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib71), Chenet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib73), Zhanget al\.[2024](https://arxiv.org/html/2606.07392#bib.bib74), Yueet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib77), Guptaet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib115)\)\. Our work contributes to this literature by formulating LLM cascading as an online contextual Pandora’s Box problem, where the decision\-maker sequentially decides which APIs to query and when to stop under unknown, context\-dependent output and cost distributions\. Our work is also related to LLM routing and ensembling methods although they address different decision problems than our LLM cascading motivation\. Routing methods usually choose a single LLM for each request before observing any generated output, either through non\-predictive rules or through predicted quality, reward, or utility scores\(Huet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib69), Shnitzeret al\.[2023](https://arxiv.org/html/2606.07392#bib.bib81), Hari and Thomson[2023](https://arxiv.org/html/2606.07392#bib.bib82), Luet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib83), Šakotaet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib84), Meiet al\.[2025](https://arxiv.org/html/2606.07392#bib.bib72)\)\. Such methods capture one\-shot model choice but not adaptive within\-request search\. Ensemble methods aggregate or synthesize outputs from multiple LLMs, such as by ranking and fusing candidate responses\(Jianget al\.[2023](https://arxiv.org/html/2606.07392#bib.bib70), Fanget al\.[2024](https://arxiv.org/html/2606.07392#bib.bib75), Huet al\.[2025](https://arxiv.org/html/2606.07392#bib.bib76)\); they focus on output aggregation rather than sequential stopping under query costs\.

Second, our paper is related to recent work on sequential testing and sequential stopping for LLM inference\.Huanget al\.\([2026](https://arxiv.org/html/2606.07392#bib.bib123)\)study adaptive self\-consistency for a single LLM, where the system sequentially samples reasoning paths and stops once the most\-likely answer can be identified with high posterior confidence\.Liet al\.\([2026](https://arxiv.org/html/2606.07392#bib.bib124)\)study Bayesian sequential testing with heterogeneous LLMs, where the decision\-maker adaptively chooses which LLM to query and stops once the posterior confidence for a hypothesis crosses a target threshold\. These papers formulate LLM inference as posterior\-driven sequential testing, whereas our LLM cascading model focuses on contextual sequential search over realized output\-cost pairs, with stopping governed by reservation indices and learning focused on both the index functions and the output\-level reward model\.

Third, the query phase component of our model extends the contextual Pandora’s Box problem\. The classic Pandora’s Box problem without contexts originates from the seminal work ofWeitzman \([1979](https://arxiv.org/html/2606.07392#bib.bib57)\)and has since been studied under a variety of extensions\(e\.g\. Doval[2018](https://arxiv.org/html/2606.07392#bib.bib85), Chawlaet al\.[2020](https://arxiv.org/html/2606.07392#bib.bib86), Fu and Lin[2020](https://arxiv.org/html/2606.07392#bib.bib92), Boodaghianset al\.[2020](https://arxiv.org/html/2606.07392#bib.bib89), Agarwalet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib55), Ezraet al\.[2026](https://arxiv.org/html/2606.07392#bib.bib90)\), including online variants\(Gergatsouli and Tzamos[2022](https://arxiv.org/html/2606.07392#bib.bib91), Atsidakouet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib20)\)\. In particular,Gergatsouli and Tzamos \([2022](https://arxiv.org/html/2606.07392#bib.bib91)\)studies an adversarial online Pandora’s Box setting in which box rewards are chosen by an adversary\.Gatmiryet al\.\([2024](https://arxiv.org/html/2606.07392#bib.bib59)\)subsequently shows that, in this setting, no algorithm can achieve sublinear regret against the optimal policy, even with full\-information feedback\. For the online contextual Pandora’s Box problem,Liuet al\.\([2025](https://arxiv.org/html/2606.07392#bib.bib54)\)obtains a regret bound ofO~\(T\)\\widetilde\{O\}\(\\sqrt\{T\}\)for under the assumption that the reward distribution of each box is a location shift of a context\-invariant noise distribution, but their results does not apply heterogeneous output distributions across contexts within each box, a key feature for LLM cascading applications\.Atsidakouet al\.\([2024](https://arxiv.org/html/2606.07392#bib.bib20)\)studies a contextual Pandora’s Box model under adversarially arriving contexts, where the optimal reservation index is parametrized as a linear function of the observed context\. In this general setting, they obtain anO~\(T5/6\)\\widetilde\{O\}\(T^\{5/6\}\)regret bound by reducing the learning problem to online linear regression\. Our approach builds on a similar model ofAtsidakouet al\.\([2024](https://arxiv.org/html/2606.07392#bib.bib20)\)by imposing a parametric reservation index structure, but adopts a different learning approach through a combination of GMM and UCB analysis\. This allows us to propose an optimism\-based learning algorithm withO~\(T\)\\widetilde\{O\}\(\\sqrt\{T\}\)regret bound under suitable regularity assumptions\.

Further, the selection phase of our model draws on the literature on generalized linear contextual bandits\. This line of work originates from the seminal paper ofFilippiet al\.\([2010](https://arxiv.org/html/2606.07392#bib.bib15)\), which introduces an optimism\-based algorithm for generalized linear bandits, and has since been extended in various directions\(e\.g\. Junet al\.[2017](https://arxiv.org/html/2606.07392#bib.bib19), Kvetonet al\.[2020](https://arxiv.org/html/2606.07392#bib.bib93), Dinget al\.[2021](https://arxiv.org/html/2606.07392#bib.bib94), Kimet al\.[2023](https://arxiv.org/html/2606.07392#bib.bib95)\)\.

Finally, our learning algorithm combines the GMM with an UCB–type algorithm\. GMM originates from the seminal work ofHansen \([1982](https://arxiv.org/html/2606.07392#bib.bib96)\)and has been widely developed and applied in econometrics\(e\.g\. Newey and McFadden[1994](https://arxiv.org/html/2606.07392#bib.bib11), Arellano and Bond[1991](https://arxiv.org/html/2606.07392#bib.bib97), Chamberlain[1987](https://arxiv.org/html/2606.07392#bib.bib98), Lin and Lee[2010](https://arxiv.org/html/2606.07392#bib.bib100), Chenget al\.[2024](https://arxiv.org/html/2606.07392#bib.bib99), Andrews and Mikusheva[2022](https://arxiv.org/html/2606.07392#bib.bib101), Hansen and Lee[2021](https://arxiv.org/html/2606.07392#bib.bib102)\)\. UCB\-type algorithms trace back to the seminal paper ofLai and Robbins \([1985](https://arxiv.org/html/2606.07392#bib.bib103)\)and have since been extensively studied in the bandit literature\(e\.g\. Aueret al\.[2002](https://arxiv.org/html/2606.07392#bib.bib104), Rusmevichientong and Tsitsiklis[2010](https://arxiv.org/html/2606.07392#bib.bib17), Filippiet al\.[2010](https://arxiv.org/html/2606.07392#bib.bib15), Fan and Glynn[2025](https://arxiv.org/html/2606.07392#bib.bib106), Garivier and Cappé[2011](https://arxiv.org/html/2606.07392#bib.bib107), Audibertet al\.[2009](https://arxiv.org/html/2606.07392#bib.bib108)\), as well as in broader online learning settings and applications in operations management\(e\.g\. Rakhlin and Sridharan[2013](https://arxiv.org/html/2606.07392#bib.bib109), Gaoet al\.[2022](https://arxiv.org/html/2606.07392#bib.bib111), Cheunget al\.[2022](https://arxiv.org/html/2606.07392#bib.bib112), Maoet al\.[2025](https://arxiv.org/html/2606.07392#bib.bib110)\)\.

In addition, our paper belongs to the growing literature on LLM\-assisted decision making across a range of fields, including operations management\(e\.g\. Chen and Chan[2024](https://arxiv.org/html/2606.07392#bib.bib64), Chenet al\.[2025b](https://arxiv.org/html/2606.07392#bib.bib63), Simchi\-Leviet al\.[2026](https://arxiv.org/html/2606.07392#bib.bib65)\), advertising\(e\.g\. Yanget al\.[2023](https://arxiv.org/html/2606.07392#bib.bib66), Reisenbichleret al\.[2025](https://arxiv.org/html/2606.07392#bib.bib67)\), healthcare\(e\.g\. Thirunavukarasuet al\.[2023](https://arxiv.org/html/2606.07392#bib.bib62), Hageret al\.[2024](https://arxiv.org/html/2606.07392#bib.bib60), Haoet al\.[2025](https://arxiv.org/html/2606.07392#bib.bib61)\), among others\.

### 1\.3Notations

Given any integerk≥1k\\geq 1and vectorv∈ℝkv\\in\\mathbb\{R\}^\{k\}, we use‖v‖2\\\|v\\\|\_\{2\}to denote the Euclidean norm ofvv, and‖v‖∞\\\|v\\\|\_\{\\infty\}to denotemaxi=1,…,k⁡\|vi\|\\max\_\{i=1,\\ldots,k\}\|v\_\{i\}\|\. For any positive semidefinite matrixΦ∈ℝk×k\\Phi\\in\\mathbb\{R\}^\{k\\times k\}, we use the notation‖v‖Φ=v⊤Φv\\\|v\\\|\_\{\\Phi\}=\\sqrt\{v^\{\\top\}\\Phi v\}to denote theΦ\\Phi\-weighted Euclidean norm ofvv\. For matricesA,B∈ℝd×dA,B\\in\\mathbb\{R\}^\{d\\times d\},A⪰BA\\succeq B\(resp\.,A⪯BA\\preceq B\) means thatA−BA\-B\(resp\.,B−AB\-A\) is positive semidefinite\. Forx,y∈ℝx,y\\in\\mathbb\{R\}, we writex∨y=max⁡\{x,y\}x\\vee y=\\max\\\{x,y\\\}\. For any positive semidefinite matrixM∈ℝd×dM\\in\\mathbb\{R\}^\{d\\times d\},λmin\(M\)\\lambda\_\{\\min\}\(M\)denotes its minimum eigenvalue\. For any integerk≥1k\\geq 1, let\[k\]=\{1,2,…,k\}\[k\]=\\\{1,2,\\ldots,k\\\}\. Given any setSS, we useScS^\{c\}to denote the complement ofSS\. We writeω∼p\(⋅\)\\omega\\sim p\(\\cdot\)to indicate that the random variableω\\omegafollows distributionp\(⋅\)p\(\\cdot\)\. We useN\(θ,σ2\)N\(\\theta,\\sigma^\{2\}\)to denote the Gaussian distribution with meanθ\\thetaand varianceσ2\\sigma^\{2\}\. Given anyμ∈ℝd\\mu\\in\\mathbb\{R\}^\{d\}and positive definite matrixΣ∈ℝd×d\\Sigma\\in\\mathbb\{R\}^\{d\\times d\}, we use𝒩\(μ,Σ\)\\mathcal\{N\}\(\\mu,\\Sigma\)to denote thedd\-dimensional Gaussian distribution with meanμ\\muand covarianceΣ\\Sigma\. For any two random variablesX,YX,Y,X⟂⟂YX\\perp\\\!\\\!\\\!\\\!\\perp YmeansXXandYYare independent\. For any random eventℰ\\mathcal\{E\}, we use𝕀\{ℰ\}\\mathbb\{I\}\\\{\\mathcal\{E\}\\\}to denote the indicator ofℰ\\mathcal\{E\}\. The notation “a\.s\.” stands for “almost surely\.” For any vectorxxor matrixAA,x⊤x^\{\\top\}andA⊤A^\{\\top\}denote their transposes\. Finally, for any twice\-differentiable functiongg,g′g^\{\\prime\}andg′′g^\{\\prime\\prime\}denote its first\- and second\-order derivatives\. We useO~\(⋅\)\\widetilde\{O\}\(\\cdot\)to hide logarithmic factors\. Specifically, for functionsf,g:ℕ→ℝ\+f,g:\\mathbb\{N\}\\to\\mathbb\{R\}\_\{\+\}, we writef\(T\)=O~\(g\(T\)\)f\(T\)=\\widetilde\{O\}\(g\(T\)\)if there exists a constantC\>0C\>0and a poly\-logarithmic functionpolylog\(T\)\\mathrm\{polylog\}\(T\)such thatf\(T\)≤Cg\(T\)polylog\(T\)f\(T\)\\leq Cg\(T\)\\textrm\{polylog\}\(T\)\. For random variablesY1Y\_\{1\}andY2Y\_\{2\},Y1≲Y2Y\_\{1\}\\lesssim Y\_\{2\}denotes thatY1≤cY2Y\_\{1\}\\leq cY\_\{2\}with high probability for some constantc\>0c\>0\. For two positive sequencesf\(T\)f\(T\)andg\(T\)g\(T\), we writef\(T\)≍g\(T\)f\(T\)\\asymp g\(T\)if bothf\(T\)≲g\(T\)f\(T\)\\lesssim g\(T\)andg\(T\)≲f\(T\)g\(T\)\\lesssim f\(T\)hold up to universal constant factors\.

## 2Model

In this section, we formally introduce our online contextual Pandora’s box model motivated by LLM cascading\. In our model, there areAAdifferent boxes, each of which can be thought of as an API in LLM cascading\. There areTTdecision periods\. Letℱt−1\\mathcal\{F\}\_\{t\-1\}denote the history available before periodtt, including all past observations and actions\. At the beginning of periodtt, the decision\-maker \(DM\) observes a request context vectorxt∈𝒳⊂ℝdxx\_\{t\}\\in\\mathcal\{X\}\\subset\\mathbb\{R\}^\{d\_\{x\}\}, wherextx\_\{t\}are independent acrosst∈\[T\]t\\in\[T\]\. For each boxa∈\[A\]a\\in\[A\],ωat\\omega\_\{at\}, defined as the potential output vector ofaa, is random and its distribution is assumed to be independent across boxes and independent of the past history\. Formally, for eacha∈\[A\]a\\in\[A\],ωat\\omega\_\{at\}is drawn from a context\-dependent distributionpa\(⋅\|xt\)p\_\{a\}\(\\cdot\|x\_\{t\}\), and for any measurable setsB1,…,BA⊆ΩB\_\{1\},\\ldots,B\_\{A\}\\subseteq\\Omega, we have

ℙ\(ω1t∈B1,…,ωAt∈BA∣ℱt−1,xt\)=∏a=1Apa\(Ba∣xt\)a\.s\.\\mathbb\{P\}\\\!\\left\(\\omega\_\{1t\}\\in B\_\{1\},\\ldots,\\omega\_\{At\}\\in B\_\{A\}\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\right\)=\\prod\_\{a=1\}^\{A\}p\_\{a\}\(B\_\{a\}\\mid x\_\{t\}\)\\qquad\\text\{a\.s\.\}\(1\)The outputωat\\omega\_\{at\}is revealed only if boxaais queried\. In addition, ifaais queried, a costca\(xt,ωat\)c\_\{a\}\(x\_\{t\},\\omega\_\{at\}\)is incurred and observed, whereca:𝒳×Ω→\(0,1\)c\_\{a\}:\\mathcal\{X\}\\times\\Omega\\to\(0,1\)is the cost function associated with boxaa\. Finally, at the end of each period, the DM selects one queried boxata\_\{t\}whose generated outputωatt\\omega\_\{a\_\{t\}t\}will be deployed\. The DM then receives and observes the realized reward associated with the deployed output,rt=μ∗\(xt,ωatt\)\+ζtr\_\{t\}=\\mu^\{\*\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\+\\zeta\_\{t\}, whereμ∗\(⋅,⋅\):𝒳×Ω→\[0,1\]\\mu^\{\*\}\(\\cdot,\\cdot\):\\mathcal\{X\}\\times\\Omega\\rightarrow\[0,1\]is the \(unknown\) reward function andζt\\zeta\_\{t\}is the post\-deployment noise satisfying𝔼\[ζt∣ℱt−1,xt,ωat,t\]=0\\mathbb\{E\}\[\\zeta\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\},\\omega\_\{a\_\{t\},t\}\]=0\.

A key feature of our model is that the reward function is shared across boxes\. Conditional on the same request\-output pair\(x,ω\)\(x,\\omega\), the expected downstream reward is the same regardless of which box generated the output\. Thus, box\-specific heterogeneity only enters through the distributions of outputs and costs, while realized rewards depend solely on the common reward functionμ∗\(⋅,⋅\)\\mu^\{\*\}\(\\cdot,\\cdot\)and noiseζt\\zeta\_\{t\}\. Consequently, outputsωat\\omega\_\{at\}act as intermediaries linking boxes to rewards, which permits cross\-learning ofμ∗\\mu^\{\*\}using reward observations from all selected boxes\. This reflects our motivating LLM example, where the reward received by the DM depends on the generated output and the request it serves, not directly on the identity of the API that generated it\.

We preview the parametric structures used for learning, with the formal assumptions and estimators introduced in Sections[4](https://arxiv.org/html/2606.07392#S4)and[5](https://arxiv.org/html/2606.07392#S5)\. Suppose, as an oracle benchmark, the DM knows the conditional output distributions\{pa\(⋅∣x\)\}a∈\[A\],x∈𝒳\\\{p\_\{a\}\(\\cdot\\mid x\)\\\}\_\{a\\in\[A\],x\\in\\mathcal\{X\}\}, the cost functions\{ca\}a∈\[A\]\\\{c\_\{a\}\\\}\_\{a\\in\[A\]\}, and the reward functionμ∗\\mu^\{\*\}\. Then, after the contextxtx\_\{t\}is observed, the decision problem in periodttreduces to a contextual Pandora’s box problem of the type studied byWeitzman \([1979](https://arxiv.org/html/2606.07392#bib.bib57)\)\. As we formally illustrate in Section[3](https://arxiv.org/html/2606.07392#S3), the optimal full\-information policy is determined by two objects: the reward functionμ∗\\mu^\{\*\}, and the reservation indices, denoted as\{σa∗\(xt\)\}a∈\[A\]\\\{\\sigma\_\{a\}^\{\*\}\(x\_\{t\}\)\\\}\_\{a\\in\[A\]\}\. These indices are, in turn, determined by the oracle objectspa\(⋅\|xt\)p\_\{a\}\(\\cdot\|x\_\{t\}\),μ∗\\mu^\{\*\}, andcac\_\{a\}through the reservation index equation \([2](https://arxiv.org/html/2606.07392#S3.E2)\) in Proposition[3\.1](https://arxiv.org/html/2606.07392#S3.Thmtheorem1)\. In the online learning problem, these oracle objects are not known\. Rather than estimating the full conditional output distributionspa\(⋅∣x\)p\_\{a\}\(\\cdot\\mid x\), we impose generalized linear parametric structure directly on the reward functionμ∗\\mu^\{\*\}and on the contextual reservation index functions\{σa∗\(⋅\)\}a∈\[A\]\\\{\\sigma\_\{a\}^\{\*\}\(\\cdot\)\\\}\_\{a\\in\[A\]\}, namelyμ∗\(x,ω\)=G\(θ∗⊤ϕ\(x,ω\)\)\\mu^\{\*\}\(x,\\omega\)=G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x,\\omega\)\)andσa∗\(x\)=Λ\(ρa⊤ψ\(x\)\)\\sigma\_\{a\}^\{\*\}\(x\)=\\Lambda\(\\rho\_\{a\}^\{\\top\}\\psi\(x\)\)where the coefficientsθ∗\\theta\_\{\*\}andρa\\rho\_\{a\}are unknown;GG,Λ\\Lambdaare known \(monotonic\) link functions andϕ\\phi,ψ\\psiare known feature maps, respectively\. This allows the DM to learn the shared reward and box specific contextual reservation indices through parametric estimation\.

### 2\.1DM’s Problem and Regret

Next, we describe the decision\-making pipeline in each periodt∈\[T\]t\\in\[T\]in more detail\. Each periodt∈\[T\]t\\in\[T\]consists of two phases, which we call*query phase*and*selection phase*\.

During the*query phase*, the DM observesxtx\_\{t\}and sequentially queries boxes\. A policy is non\-anticipating: before each query, the DM may useℱt−1\\mathcal\{F\}\_\{t\-1\}, the current contextxtx\_\{t\}, and the output\-cost pairs of the boxes already observed in periodtt, but not the output or cost of any unqueried box\. Let𝒜t⊆\[A\]\\mathcal\{A\}\_\{t\}\\subseteq\[A\]denote the random set of boxes queried in periodtt\. We assume each box can be queried at most once within a period\. After each query, the DM observes the generated output and cost, and then decides whether to continue querying or stop\. This non\-anticipativity condition implies a useful sampling property\. For anya∈\[A\]a\\in\[A\], conditional on\(ℱt−1,xt\)\(\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\), the event\{a∈𝒜t\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}is determined before observing boxaa’s own potential output\-cost pair\. Hence, wheneverℙ\(a∈𝒜t∣ℱt−1,xt\)\>0\\mathbb\{P\}\(a\\in\\mathcal\{A\}\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\)\>0,

\(ωat,cat\)\|ℱt−1,xt,\{a∈𝒜t\}=𝑑\(ωat,cat\)\|ℱt−1,xt\.\(\\omega\_\{at\},c\_\{at\}\)\\,\|\\,\\mathcal\{F\}\_\{t\-1\},x\_\{t\},\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\;\\overset\{d\}\{=\}\\;\(\\omega\_\{at\},c\_\{at\}\)\\,\|\\,\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\.This property will be used later to justify learning reservation indices from adaptively queried samples\. After the query phase concludes, the DM enters the*selection phase*, selecting one queried boxat∈𝒜ta\_\{t\}\\in\\mathcal\{A\}\_\{t\}whose generated outputωatt\\omega\_\{a\_\{t\}t\}will be deployed\.

LetΠt\\Pi\_\{t\}denote the class of admissible non\-anticipating policies in periodtt\. For a policyπt∈Πt\\pi\_\{t\}\\in\\Pi\_\{t\}, let𝒜t\(πt\)\\mathcal\{A\}\_\{t\}\(\\pi\_\{t\}\)be the set of queried boxes andat\(πt\)∈𝒜t\(πt\)a\_\{t\}\(\\pi\_\{t\}\)\\in\\mathcal\{A\}\_\{t\}\(\\pi\_\{t\}\)be the selected box\. The conditional expected utility of policyπt\\pi\_\{t\}in periodttis

Ut\(πt;ℱt−1,xt\):=𝔼\[μ∗\(xt,ωat\(πt\)t\)−∑a∈𝒜t\(πt\)cat\|ℱt−1,xt\]\.U\_\{t\}\(\\pi\_\{t\};\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\):=\\mathbb\{E\}\\\!\\left\[\\mu^\{\*\}\(x\_\{t\},\\omega\_\{a\_\{t\}\(\\pi\_\{t\}\)t\}\)\-\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\(\\pi\_\{t\}\)\}c\_\{at\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\right\]\.We note that the post\-deployment noiseζt\\zeta\_\{t\}does not enter the utility because it is conditionally mean\-zero\. It matters only as feedback for learning the reward functionμ∗\(⋅,⋅\)\\mu^\{\*\}\(\\cdot,\\cdot\)\.

We compare the DM against the full\-information benchmark that knows the distributionpa\(⋅\|xt\)p\_\{a\}\(\\cdot\|x\_\{t\}\), and functionsμ∗\(xt,⋅\),ca\(xt,⋅\)\\mu^\{\*\}\(x\_\{t\},\\cdot\),c\_\{a\}\(x\_\{t\},\\cdot\)for every boxaaand every contextxtx\_\{t\}\. At each periodtt, after observingxtx\_\{t\}, the optimal policyπt∗∈arg⁡maxπt∈Πt⁡Ut\(πt;ℱt−1,xt\)\\pi\_\{t\}^\{\*\}\\in\\arg\\max\_\{\\pi\_\{t\}\\in\\Pi\_\{t\}\}U\_\{t\}\(\\pi\_\{t\};\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\)reduces to the solution of the Pandora’s box problem given contextxtx\_\{t\}, which we formally characterize in Section[3](https://arxiv.org/html/2606.07392#S3)\.

Define the condition expected utility ofπt∗\\pi\_\{t\}^\{\*\}as

Ut\(πt∗;ℱt−1,xt\)=𝔼\[maxa∈𝒜t∗⁡μ∗\(xt,ωat\)−∑a∈𝒜t∗cat\|ℱt−1,xt\]\.U\_\{t\}\(\\pi\_\{t\}^\{\*\};\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\)=\\mathbb\{E\}\\bigg\[\\max\_\{a\\in\\mathcal\{A\}\_\{t\}^\{\*\}\}\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}^\{\*\}\}c\_\{at\}\\,\\bigg\|\\,\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\bigg\]\.Then, for any admissible policy𝝅=\{πt\}t∈\[T\]\\bm\{\\pi\}=\\\{\\pi\_\{t\}\\\}\_\{t\\in\[T\]\}, the cumulative regret is defined as

RT\(𝝅\):=∑t=1T𝔼\[Ut\(πt∗;ℱt−1,xt\)−Ut\(πt;ℱt−1,xt\)\]\.R\_\{T\}\(\\bm\{\\pi\}\):=\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\\\!\\left\[U\_\{t\}\(\\pi\_\{t\}^\{\*\};\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\)\-U\_\{t\}\(\\pi\_\{t\};\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\)\\right\]\.
Readers familiar with the bandit literature might ask whether the problem can be modeled as a contextual bandit by treating each API as an arm\. Such a formulation would lead to a one\-shot API\-as\-arm policy, in which the decision\-maker selects a single API for each request before observing any generated output\. It does not, however, capture the adaptive information\-acquisition structure of LLM cascading\. The following example illustrates that such a restriction can create a constant per\-period gap relative to the sequential\-search benchmark, even when all primitives are known\.

Consider an example with two boxes and the contextxtx\_\{t\}is fixed throughout\. Querying box 1 costsc1=0\.5c\_\{1\}=0\.5and always generates outputω1t=1\\omega\_\{1t\}=1\. Querying box 2 costsc2=0\.01c\_\{2\}=0\.01and generates outputω2t=1\\omega\_\{2t\}=1with probability0\.10\.1, andω2t=0\\omega\_\{2t\}=0otherwise\. The downstream value isμ∗\(xt,ω\)=ω\\mu^\{\*\}\(x\_\{t\},\\omega\)=\\omega\. A one\-shot API\-as\-arm policy obtains expected utility0\.50\.5from box 1 and0\.090\.09from box 2, so the best one\-shot policy selects box 1\. Now consider a cascade that first queries box 2\. Ifω2t=1\\omega\_\{2t\}=1, it stops and selects box 2; ifω2t=0\\omega\_\{2t\}=0, it queries box 1 and selects box 1\. Its expected utility is−0\.01\+0\.1\+0\.9\(0\.5\)=0\.54\>0\.5\-0\.01\+0\.1\+0\.9\(0\.5\)=0\.54\>0\.5\. Thus, relative to the cascade benchmark, any one\-shot API\-as\-arm formulation suffers a constant per\-period optimality gap in this instance, and hence a loss that grows linearly withTT\.

## 3Policy and Algorithm

This section develops the online policy and the algorithm\. We begin with the full\-information benchmark\. Fix a periodttand condition on the arriving contextxtx\_\{t\}\. Suppose the DM knows the reward functionμ∗\(⋅,⋅\)\\mu^\{\*\}\(\\cdot,\\cdot\)and the output distributions\{pa\(⋅\|x\)\}a∈\[A\],x∈𝒳\\\{p\_\{a\}\(\\cdot\|x\)\\\}\_\{a\\in\[A\],x\\in\\mathcal\{X\}\}\. The following proposition characterizes the optimal oracle policy during periodtt\.

###### Proposition 3\.1\(Oracle reservation index\)

For each boxa∈\[A\]a\\in\[A\]and any fixedt∈\[T\]t\\in\[T\], letσa∗\(⋅\)\\sigma^\{\*\}\_\{a\}\(\\cdot\)be the index function such that

𝔼\[\{μ∗\(xt,ωat\)−σa∗\(xt\)\}\+\|xt\]=𝔼\[cat∣xt\],∀xt\.\\mathbb\{E\}\\left\[\\left\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sigma^\{\*\}\_\{a\}\(x\_\{t\}\)\\right\\\}^\{\+\}\\,\\middle\|\\,x\_\{t\}\\right\]=\\mathbb\{E\}\[c\_\{at\}\\mid x\_\{t\}\],\\forall x\_\{t\}\.\(2\)Then there exists an optimal policyπt∗\\pi\_\{t\}^\{\*\}with the following structure: At any periodttwith contextxtx\_\{t\}, it orders the boxes\{\(1\),\(2\),…,\(A\)\}\\\{\(1\),\(2\),\\ldots,\(A\)\\\}such thatσ\(1\)∗\(xt\)≥⋯≥σ\(A\)∗\(xt\)\\sigma^\{\*\}\_\{\(1\)\}\(x\_\{t\}\)\\geq\\cdots\\geq\\sigma^\{\*\}\_\{\(A\)\}\(x\_\{t\}\), setsσ\(A\+1\)∗\(xt\):=−∞\\sigma^\{\*\}\_\{\(A\+1\)\}\(x\_\{t\}\):=\-\\infty, queries boxes in this order, stops at thekk\-th query ifmax1≤j≤k⁡μ∗\(xt,ω\(j\)t\)≥σ\(k\+1\)∗\(xt\)\\max\_\{1\\leq j\\leq k\}\\mu^\{\*\}\(x\_\{t\},\\omega\_\{\(j\)t\}\)\\geq\\sigma^\{\*\}\_\{\(k\+1\)\}\(x\_\{t\}\), and selectsat∗∈\\argmaxa∈𝒜t∗μ∗\(xt,ωat\)a\_\{t\}^\{\*\}\\in\\argmax\_\{a\\in\\mathcal\{A\}\_\{t\}^\{\*\}\}\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\), where𝒜t∗\\mathcal\{A\}\_\{t\}^\{\*\}is the set of boxes queried by the oracle\. Moreover, the oracle value satisfies

Ut\(πt∗;ℱt−1,xt\)=𝔼\[maxa∈\[A\]min\{μ∗\(xt,ωat\),σa∗\(xt\)\}\|xt,ℱt−1\]\.U\_\{t\}\(\\pi\_\{t\}^\{\*\};\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\)=\\mathbb\{E\}\\left\[\\max\_\{a\\in\[A\]\}\\min\\left\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\),\\sigma\_\{a\}^\{\*\}\(x\_\{t\}\)\\right\\\}\\,\\middle\|\\,x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]\.\(3\)

Defineσat∗:=σa∗\(xt\)\\sigma\_\{at\}^\{\*\}:=\\sigma\_\{a\}^\{\*\}\(x\_\{t\}\)\. When the costscat\{c\_\{at\}\}are known constants \(i\.e\.,cat≡cac\_\{at\}\\equiv c\_\{a\}for eacha∈\[A\]a\\in\[A\]\), the indices\{σat∗\}a∈\[A\]\\\{\\sigma\_\{at\}^\{\*\}\\\}\_\{a\\in\[A\]\}are optimal by the seminal work ofWeitzman \([1979](https://arxiv.org/html/2606.07392#bib.bib57)\)\. When costs are stochastic \(as in our motivating setting\), the result above follows from a minor adaptation of the proof in\(Kleinberget al\.[2016](https://arxiv.org/html/2606.07392#bib.bib1)\)\(see Appendix[7](https://arxiv.org/html/2606.07392#S7)\)\.

Proposition[3\.1](https://arxiv.org/html/2606.07392#S3.Thmtheorem1)formally identifies the reward evaluatorμ∗\(x,ω\)\\mu^\{\*\}\(x,\\omega\)and the reservation index functionσa∗\(x\)\\sigma\_\{a\}^\{\*\}\(x\)as the two objects required to make the optimal sequential decisions\. We next describe the online policy as Algorithm[1](https://arxiv.org/html/2606.07392#alg1)\. At a high level, the policy follows the oracle structure in Proposition[3\.1](https://arxiv.org/html/2606.07392#S3.Thmtheorem1), but replaces the unknown reward evaluator and reservation indices with optimistic estimates\. We call the resulting policyCOst\-awareSequentialMOdel query andSelection, orCOSMOS\.

### 3\.1Optimism and Regret Decomposition

OurCOSMOSpolicy constructs optimistic estimatesμ~t\(x,ω\)\\widetilde\{\\mu\}\_\{t\}\(x,\\omega\)andσ~at,∀a∈\[A\],t∈\[T\]\\widetilde\{\\sigma\}\_\{at\},\\ \\forall a\\in\[A\],t\\in\[T\]\. The estimateμ~t\(x,ω\)\\widetilde\{\\mu\}\_\{t\}\(x,\\omega\)is an optimistic estimate ofμ∗\(x,ω\)\\mu^\{\*\}\(x,\\omega\)for a realized output\. The estimateσ~at\\widetilde\{\\sigma\}\_\{at\}is an optimistic estimate of the oracle reservation indexσa∗\(xt\)\\sigma\_\{a\}^\{\*\}\(x\_\{t\}\)for boxaaat the current context\. The formal construction of these estimates is postponed to Sections[4](https://arxiv.org/html/2606.07392#S4)and[5](https://arxiv.org/html/2606.07392#S5), and here we focus only on their roles in the policy\.

To understand why optimism is crucial in our COSMOS policy, we next present a theorem showing that, under optimism, the one\-period regret decomposes into the reward estimation error for the selected output and the index estimation errors for the queried boxes\.

###### Theorem 3\.3\(Regret decomposition under optimism\)

Fix any periodtt, and let

Δt\(π~\):=Ut\(πt∗;xt\)−Ut\(π~t;xt\)\\Delta\_\{t\}\(\\widetilde\{\\pi\}\):=U\_\{t\}\(\\pi\_\{t\}^\{\*\};x\_\{t\}\)\-U\_\{t\}\(\\widetilde\{\\pi\}\_\{t\};x\_\{t\}\)denote the period\-ttconditional regret of the policyπ~\\widetilde\{\\pi\}induced by Algorithm[1](https://arxiv.org/html/2606.07392#alg1)\. Suppose that, for alla∈\[A\]a\\in\[A\],μ~t\(xt,ωat\)≥μ∗\(xt,ωat\)\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\\geq\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\),σ~at≥σat∗\\widetilde\{\\sigma\}\_\{at\}\\geq\\sigma^\{\*\}\_\{at\}\. Letat∈argmaxa∈𝒜t⁡μ~t\(xt,ωat\)a\_\{t\}\\in\\operatorname\*\{arg\\,max\}\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)be the output selected by Algorithm[1](https://arxiv.org/html/2606.07392#alg1)\. Then,

Δt\(π~\)≤𝔼\[μ~t\(xt,ωatt\)−μ∗\(xt,ωatt\)∣ℱt−1,xt\]\+𝔼\[∑a∈𝒜t\(σ~at−σat∗\)∣ℱt−1,xt\]\.\\Delta\_\{t\}\(\\widetilde\{\\pi\}\)\\leq\\mathbb\{E\}\\left\[\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\-\\mu^\{\*\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\right\]\+\\mathbb\{E\}\\left\[\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\left\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma^\{\*\}\_\{at\}\\right\)\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\right\]\.

We provide intuition on why Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)holds\. Consider an auxiliary “optimistic” problem in which both its reward estimateμ~t\\widetilde\{\\mu\}\_\{t\}and its indexσ~at\\widetilde\{\\sigma\}\_\{at\}upper\-bound the truth\. ThenCOSMOSis the optimal policy for the “optimistic” problem, and its corresponding optimistic expected value is thus at least as large as the optimal expected value of the true problem\. As a result, the loss can be bounded by the discrepancy between the optimistic quantities used byCOSMOSand the true quantities realized along the same path\. This yields exactly two errors: a reward estimation error for the selected output, and an index\-estimation error for every queried box\.

The decomposition in Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)allows us to leverage the principle of optimism to control our regret\. More specifically, if a box is queried frequently, its query dataset grows and its index uncertainty decreases\. If a box is queried rarely, then it contributes only rarely to the second term in Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)\. This mirrors the classic self\-correcting principle behind UCB analysis, adapted here to govern reservation indices and reward rather than the direct feedback in bandits\.

Consequently, Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)motivates us to construct high\-probability confidence bounds for both reward and reservation indices and using them to define the optimistic estimatorsμ~t\(xt,ωat\)\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)andσ~at\\widetilde\{\\sigma\}\_\{at\}\. The goal is to ensure that, with high probability, these estimators upper bound their population counterparts while remaining sufficiently accurate\. Together, optimism and error control provide the key ingredients for deriving the final regret guarantee of theCOSMOSpolicy, whose high\-level structure is summarized as Algorithm[1](https://arxiv.org/html/2606.07392#alg1)\. The full implementation, including the computation of optimistic reservation indices and reward estimates, is deferred to Algorithm[2](https://arxiv.org/html/2606.07392#alg2)in the Appendix\.

Initialization\.Observe an initial context

x0x\_\{0\}\. Query each box

a∈\[A\]a\\in\[A\]once and record the observed output\-cost pairs

\{\(ωa0,ca0\)\}a∈\[A\]\\\{\(\\omega\_\{a0\},c\_\{a0\}\)\\\}\_\{a\\in\[A\]\}\.

for*t=1,2,…,Tt=1,2,\\ldots,T*do

Observe context

xtx\_\{t\}\. Initialize

𝒜t=∅\\mathcal\{A\}\_\{t\}=\\emptysetand

Mt=−∞M\_\{t\}=\-\\infty;

Construct optimistic reservation index estimates

\{σ~at\}a∈\[A\]\\\{\\widetilde\{\\sigma\}\_\{at\}\\\}\_\{a\\in\[A\]\};

Order the boxes so that

σ~\(1\)t≥σ~\(2\)t≥⋯≥σ~\(A\)t\\widetilde\{\\sigma\}\_\{\(1\)t\}\\geq\\widetilde\{\\sigma\}\_\{\(2\)t\}\\geq\\cdots\\geq\\widetilde\{\\sigma\}\_\{\(A\)t\}\. Set

σ~\(A\+1\)t:=−∞\\widetilde\{\\sigma\}\_\{\(A\+1\)t\}:=\-\\infty;

for*k=1,2,…,Ak=1,2,\\ldots,A*do

Query box

\(k\)\(k\)and observe its output and cost

\(ω\(k\)t,c\(k\)t\)\(\\omega\_\{\(k\)t\},c\_\{\(k\)t\}\);

Add

\(k\)\(k\)to the queried set:

𝒜t←𝒜t∪\{\(k\)\}\\mathcal\{A\}\_\{t\}\\leftarrow\\mathcal\{A\}\_\{t\}\\cup\\\{\(k\)\\\};

Compute the optimistic reward estimate

μ~t\(xt,ω\(k\)t\)\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{\(k\)t\}\);

Update

Mt←maxa∈𝒜t⁡μ~t\(xt,ωat\)M\_\{t\}\\leftarrow\\max\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\);

if

Mt≥σ~\(k\+1\)tM\_\{t\}\\geq\\widetilde\{\\sigma\}\_\{\(k\+1\)t\}, stop querying andbreak;

end for

Select and deploy

at∈\\argmaxa∈𝒜tμ~t\(xt,ωat\),a\_\{t\}\\in\\argmax\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\),and observe reward

rtr\_\{t\};

end for

ALGORITHM 1COSMOSPolicy Framework

## 4Results under Known Reward Function

To build intuition, we first study the case in which the reward functionμ∗\(x,ω\)\\mu^\{\*\}\(x,\\omega\)is known\. This setting is practically relevant when the reward model can be estimated offline using a substantially larger historical dataset\. If the offline data are sufficiently rich, the resulting estimate of reward functionμ∗\(⋅,⋅\)\\mu^\{\*\}\(\\cdot,\\cdot\)may be accurate enough relative to theT\\sqrt\{T\}scale of the online regret analysis, to be treated as fixed in the subsequent online decision problem\. Importantly, even with a known reward function, the conditional distribution of the outputωat\\omega\_\{at\}givenxtx\_\{t\}typically remains unknown\.

This case therefore captures regimes in which the main online learning challenge is not reward estimation, but learning the reservation indices and managing exploration across boxes\. Indeed, once boxaais opened in periodtt, the decision\-maker observes its outputωat\\omega\_\{at\}and costcatc\_\{at\}, and can therefore compute the conditional expected rewardμ∗\(xt,ωat\)\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\. Thus, in Algorithm[1](https://arxiv.org/html/2606.07392#alg1), we would setμ~t\(xt,ωat\)=μ∗\(xt,ωat\)\.\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)=\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\.The selection rule then chooses the opened box with the largest true conditional expected reward, and the regret component arising from reward estimation disappears\. The remaining learning problem is then focused solely on estimatingσa∗\(xt\)\\sigma\_\{a\}^\{\*\}\(x\_\{t\}\)for eachaa\.

This is analogous to the bandit\-feedback setting inAtsidakouet al\.\([2024](https://arxiv.org/html/2606.07392#bib.bib20)\), in that feedback is obtained only from opened boxes\.111The authors also study a full information setting where the DM observes the rewards of all boxes at the end of each round\.Atsidakouet al\.\([2024](https://arxiv.org/html/2606.07392#bib.bib20)\)develop a black\-box reduction from contextual Pandora’s Box to online regression, using a specially designed linear\-quadratic loss to learn all the reservation indices\. In contrast, our approach follows the principle of optimism from the UCB literature\. As we illustrate, under suitable regularity conditions, it is not necessary to learn all boxes’ index functions uniformly well\. Instead, it is enough to maintain optimistic index estimates and refine the estimates of boxes that remain relevant to the query process\.

We next formalize this intuition\. Recall from Section[2](https://arxiv.org/html/2606.07392#S2)that we impose the parametric structure on the indices

σa∗\(x\)=Λ\(ρa⊤ψ\(x\)\),∀x∈𝒳,\\sigma\_\{a\}^\{\*\}\(x\)=\\Lambda\(\\rho\_\{a\}^\{\\top\}\\psi\(x\)\),\\qquad\\forall x\\in\\mathcal\{X\},\(5\)whereΛ\\Lambdais a known monotonic link function, andψ\(⋅\)\\psi\(\\cdot\)is a known feature map\. We impose the following regularity condition on the parametric reservation index function \([5](https://arxiv.org/html/2606.07392#S4.E5)\):\{assumption\}\[Regularity conditions for reservation indices\] The link functionΛ:ℝ→\[−1,1\]\\Lambda:\\mathbb\{R\}\\to\[\-1,1\]is continuously differentiable, strictly increasing, andLL\-Lipschitz,lims→−∞Λ\(s\)=−1\\lim\_\{s\\rightarrow\-\\infty\}\\Lambda\(s\)=\-1,lims→\+∞Λ\(s\)=1\\lim\_\{s\\rightarrow\+\\infty\}\\Lambda\(s\)=1\. Moreover,infρ∈ℬ,x∈𝒳Λ′\(ρ⊤ψ\(x\)\)≥μ1\>0,‖ψ\(x\)‖2≤C¯ψfor allx∈𝒳\\inf\_\{\\rho\\in\\mathcal\{B\},\\;x\\in\\mathcal\{X\}\}\\Lambda^\{\\prime\}\(\\rho^\{\\top\}\\psi\(x\)\)\\geq\\mu\_\{1\}\>0,\\\|\\psi\(x\)\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\psi\}\\ \\text\{for all \}x\\in\\mathcal\{X\}\. The parameter spaceℬ⊂ℝm\\mathcal\{B\}\\subset\\mathbb\{R\}^\{m\}is convex and compact, with diameterdℬd\_\{\\mathcal\{B\}\}:supρ,ρ′∈ℬ‖ρ−ρ′‖2≤dℬ\\sup\_\{\\rho,\\rho^\{\\prime\}\\in\\mathcal\{B\}\}\\\|\\rho\-\\rho^\{\\prime\}\\\|\_\{2\}\\leq d\_\{\\mathcal\{B\}\}\. The range restrictionΛ:ℝ→\[−1,1\]\\Lambda:\\mathbb\{R\}\\to\[\-1,1\]is consistent with the fact thatcat∈\(0,1\)c\_\{at\}\\in\(0,1\)andμ∗∈\[0,1\]\\mu^\{\*\}\\in\[0,1\], which implyσat∗∈\[−1,1\]\\sigma\_\{at\}^\{\*\}\\in\[\-1,1\]\. In addition, Assumption[5](https://arxiv.org/html/2606.07392#S4.E5)imposes only mild regularity conditions\. It accommodates the linear reservation index specification considered byAtsidakouet al\.\([2024](https://arxiv.org/html/2606.07392#bib.bib20)\), which corresponds to takingΛ\\Lambdaas the identity map on the relevant, strictly bounded domain ofρ⊤ψ\(x\)\\rho^\{\\top\}\\psi\(x\), while allowing it to smoothly flatten to±1\\pm 1ass→±∞s\\to\\pm\\infty\.

Assumption[5](https://arxiv.org/html/2606.07392#S4.E5)yields the point identification ofρa⊤ψ\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\):

###### Proposition 4\.1

Under Assumption[5](https://arxiv.org/html/2606.07392#S4.E5),ρa⊤ψ\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)is point identified through \([4](https://arxiv.org/html/2606.07392#S3.E4)\) for eacha∈\[A\]a\\in\[A\]\.

At the beginning of periodtt, Algorithm[1](https://arxiv.org/html/2606.07392#alg1)estimatesρa\\rho\_\{a\}byρ^at\\hat\{\\rho\}\_\{at\}and constructs an upper confidence bound for the linear indexρa⊤ψ\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\. Letℛat\(xt\)\\mathcal\{R\}\_\{at\}\(x\_\{t\}\)denote the corresponding confidence radius, and defineσ~at=Λ\(ρ^at⊤ψ\(xt\)\+ℛat\(xt\)\)\\widetilde\{\\sigma\}\_\{at\}=\\Lambda\\\!\\left\(\\hat\{\\rho\}\_\{at\}^\{\\top\}\\psi\(x\_\{t\}\)\+\\mathcal\{R\}\_\{at\}\(x\_\{t\}\)\\right\)\. Letℰρ\\mathcal\{E\}\_\{\\rho\}denote the high\-probability event on which

\|ρ^at⊤ψ\(xt\)−ρa⊤ψ\(xt\)\|≤ℛat\(xt\),∀a∈\[A\],t∈\[T\]\.\\left\|\\hat\{\\rho\}\_\{at\}^\{\\top\}\\psi\(x\_\{t\}\)\-\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\\right\|\\leq\\mathcal\{R\}\_\{at\}\(x\_\{t\}\),\\qquad\\forall a\\in\[A\],\\;t\\in\[T\]\.Then onℰρ\\mathcal\{E\}\_\{\\rho\}, by the monotonicity ofΛ\\Lambda,

σ~at≥Λ\(ρa⊤ψ\(xt\)\)=σat∗,∀a∈\[A\],t∈\[T\],\\widetilde\{\\sigma\}\_\{at\}\\geq\\Lambda\(\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\)=\\sigma\_\{at\}^\{\*\},\\qquad\\forall a\\in\[A\],\\;t\\in\[T\],allowing us to invoke the regret decomposition of Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)\. Recall thatμ~t=μ∗\\widetilde\{\\mu\}\_\{t\}=\\mu^\{\*\}in the known\-μ∗\\mu^\{\*\}case, implying the reward estimation term in Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)is zero\. Therefore, under the eventℰρ\\mathcal\{E\}\_\{\\rho\}, Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)implies that the period\-ttregret satisfies

Δt\(π~\)≤𝔼\[∑a∈𝒜t\(σ~at−σat∗\)\|xt,ℱt−1\],\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\leq\\mathbb\{E\}\\left\[\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\left\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\\right\)\\,\\bigg\|\\,x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\],\(6\)where𝒜t\\mathcal\{A\}\_\{t\}is the random set of boxes opened by Algorithm[1](https://arxiv.org/html/2606.07392#alg1)in periodtt\. Inequality \([6](https://arxiv.org/html/2606.07392#S4.E6)\) illustrates that the regret depends only on the index estimation errors of boxes that are actually opened\. Thus, frequently opened boxes yield larger sample sizes that shrink their confidence radii, while rarely opened boxes contribute negligibly to the overall regret\. Once we obtain a valid high\-probability confidence radius for the linear indexρa⊤ψ\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\), the cumulative regret can be controlled by the standard elliptical\-potential argument used for contextual bandits\(e\.g\., Abbasi\-Yadkoriet al\.[2011](https://arxiv.org/html/2606.07392#bib.bib56)\)\. However, unlike in standard contextual bandits, an opened box does not produce a noisy observation of its reservation indexσat∗\\sigma\_\{at\}^\{\*\}; but instead, a tuple\(xt,ωat,cat\)\(x\_\{t\},\\omega\_\{at\},c\_\{at\}\)\. Therefore, the index parameter must be learned indirectly from the defining equation \([4](https://arxiv.org/html/2606.07392#S3.E4)\) of the reservation index\.

### 4\.1Index Estimation and Loss Function

We next describe how to estimate the box\-specific reservation index functionσa∗\(⋅\)\\sigma\_\{a\}^\{\*\}\(\\cdot\)\. Given \([4](https://arxiv.org/html/2606.07392#S3.E4)\) and \([5](https://arxiv.org/html/2606.07392#S4.E5)\), our estimation problem is moment\-based and is thus inspired by the generalized method of moments \(GMM\)\. For this purpose, we define the known\-μ∗\\mu^\{\*\}moment function, for anyx∈𝒳x\\in\\mathcal\{X\},ω∈Ω\\omega\\in\\Omega,c∈\[0,1\]c\\in\[0,1\], andρ∈ℬ\\rho\\in\\mathcal\{B\}, as

m∗\(x,ω,c;ρ\):=c−\(μ∗\(x,ω\)−Λ\(ρ⊤ψ\(x\)\)\)\+\.m^\{\*\}\(x,\\omega,c;\\rho\):=c\-\\left\(\\mu^\{\*\}\(x,\\omega\)\-\\Lambda\(\\rho^\{\\top\}\\psi\(x\)\)\\right\)^\{\+\}\.\(7\)By Equation \([4](https://arxiv.org/html/2606.07392#S3.E4)\), we have𝔼\[m∗\(xt,ωat,cat;ρa\)\|ℱt−1,xt,a∈𝒜t\]=0\\mathbb\{E\}\\left\[m^\{\*\}\(x\_\{t\},\\omega\_\{at\},c\_\{at\};\\rho\_\{a\}\)\\,\\bigg\|\\,\\mathcal\{F\}\_\{t\-1\},x\_\{t\},\\ a\\in\\mathcal\{A\}\_\{t\}\\right\]=0\. Next, let

𝒮at:=\{s<t:a∈𝒜s\},nat:=\|𝒮at\|\\mathcal\{S\}\_\{at\}:=\\\{s<t:\\ a\\in\\mathcal\{A\}\_\{s\}\\\},\\qquad n\_\{at\}:=\|\\mathcal\{S\}\_\{at\}\|denote the set and number of past periods in which boxaawas queried before periodtt, and the elements in𝒮at\\mathcal\{S\}\_\{at\}be ordered asτa,1<τa,2<⋯<τa,nat\.\\tau\_\{a,1\}<\\tau\_\{a,2\}<\\cdots<\\tau\_\{a,n\_\{at\}\}\.

As we discuss below, our estimation problem is a special case of the GMM framework with conditional moment conditions\. By setting the moment conditions based on the featuresψ\(xs\)\\psi\(x\_\{s\}\), we can formulate it as aMM\-estimation problem that is more suitable for our online learning setting\. A standard GMM approach would instead form the empirical vector moment

g¯at\(ρ\):=1nat∑s∈𝒮atψ\(xs\)m∗\(xs,ωas,cas;ρ\)\\bar\{g\}\_\{at\}\(\\rho\):=\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\psi\(x\_\{s\}\)m^\{\*\}\(x\_\{s\},\\omega\_\{as\},c\_\{as\};\\rho\)and minimize a quadratic criterion such asg¯at\(ρ\)⊤Watg¯at\(ρ\)\\bar\{g\}\_\{at\}\(\\rho\)^\{\\top\}W\_\{at\}\\bar\{g\}\_\{at\}\(\\rho\)for some positive definite weighting matrixWatW\_\{at\}at each periodtt\. However, this approach focuses on estimating the coefficientsρa\\rho\_\{a\}, which requires a full\-rank condition on the local GMM curvature\. Such a requirement can be overly stringent in our setting, as it requires that the observations in which boxaais queried contain sufficiently many “active” samples, namely those satisfyingμ∗\(xt,ωat\)\>Λ\(ρa⊤ψ\(xt\)\)\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\>\\Lambda\(\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\), and that the corresponding feature vectorsψ\(xt\)\\psi\(x\_\{t\}\)span the parameter space in a well\-conditioned way conditional ona∈𝒜ta\\in\\mathcal\{A\}\_\{t\}\.

Our objective, however, is not to recover the entire vectorρa\\rho\_\{a\}\. For the UCB analysis, it is enough to construct a confidence interval for the scalar quantityρa⊤ψ\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)at the current context\. We therefore adopt anMM\-estimation formulation, whose first\-order condition recovers the sample moment\.

For a queried sample of boxaain periodss, define

ℓas\(ρ\):=∫0ρ⊤ψ\(xs\)\[cas−\(μ∗\(xs,ωas\)−Λ\(u\)\)\+\]𝑑u\.\\ell\_\{as\}\(\\rho\):=\\int\_\{0\}^\{\\rho^\{\\top\}\\psi\(x\_\{s\}\)\}\\left\[c\_\{as\}\-\\left\(\\mu^\{\*\}\(x\_\{s\},\\omega\_\{as\}\)\-\\Lambda\(u\)\\right\)^\{\+\}\\right\]du\.\(8\)Note thatℓas\(ρ\)\\ell\_\{as\}\(\\rho\)is convex inρ\\rho, asΛ\(⋅\)\\Lambda\(\\cdot\)is strictly increasing\. Also, the derivative ofℓas\\ell\_\{as\}satisfies

∇ρℓas\(ρ\)=ψ\(xs\)\[cas−\(μ∗\(xs,ωas\)−Λ\(ρ⊤ψ\(xs\)\)\)\+\]=ψ\(xs\)m∗\(xs,ωas,cas;ρ\)\.\\nabla\_\{\\rho\}\\ell\_\{as\}\(\\rho\)=\\psi\(x\_\{s\}\)\\left\[c\_\{as\}\-\\left\(\\mu^\{\*\}\(x\_\{s\},\\omega\_\{as\}\)\-\\Lambda\(\\rho^\{\\top\}\\psi\(x\_\{s\}\)\)\\right\)^\{\+\}\\right\]=\\psi\(x\_\{s\}\)m^\{\*\}\(x\_\{s\},\\omega\_\{as\},c\_\{as\};\\rho\)\.\(9\)Therefore, the first\-order condition of the empirical loss minimization is equivalent to the sample analogue of the vector moment equation\. The known\-μ∗\\mu^\{\*\}estimator ofρa\\rho\_\{a\}for periodttis

ρ^at∈\\argminρ∈ℬ∑s∈𝒮atℓas\(ρ\)\.\\hat\{\\rho\}\_\{at\}\\in\\argmin\_\{\\rho\\in\\mathcal\{B\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\ell\_\{as\}\(\\rho\)\.\(10\)We next state the regularity condition that yields curvature of the population loss\. The condition requires thatρ⊤ψ\(x\)\\rho^\{\\top\}\\psi\(x\)remains in a bounded region, and for every point in this region, there is some probability mass that the reward of the output is above its corresponding reservation index\.

\{assumption\}

\[Local mass around reservation thresholds\] There exist constantsι¯\>0\\bar\{\\iota\}\>0andκ\>0\\kappa\>0such that, for allx∈𝒳x\\in\\mathcal\{X\}andρ∈ℬ\\rho\\in\\mathcal\{B\},\|ψ\(x\)⊤ρ\|≤ι¯\|\\psi\(x\)^\{\\top\}\\rho\|\\leq\\bar\{\\iota\}, and for alla∈\[A\]a\\in\[A\],x∈𝒳x\\in\\mathcal\{X\},\|u\|≤ι¯\|u\|\\leq\\bar\{\\iota\}, we haveℙω∼pa\(⋅∣x\)\(μ∗\(x,ω\)\>Λ\(u\)\)≥κ\\mathbb\{P\}\_\{\\omega\\sim p\_\{a\}\(\\cdot\\mid x\)\}\\left\(\\mu^\{\*\}\(x,\\omega\)\>\\Lambda\(u\)\\right\)\\geq\\kappa\. Assumption[4\.1](https://arxiv.org/html/2606.07392#S4.SS1)rules out the cases in which the reservation index lies in a region with no probability mass above it\. Together with the lower derivative boundΛ′\(⋅\)≥μ1\\Lambda^\{\\prime\}\(\\cdot\)\\geq\\mu\_\{1\}from Assumption[5](https://arxiv.org/html/2606.07392#S4.E5), it ensures that the reservation index moment crosses zero with slope bounded away from zero\. This is the one\-dimensional source of curvature for the loss function\.

###### Lemma 4\.2\(Population curvature of the loss function\)

Under Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5),[4\.1](https://arxiv.org/html/2606.07392#S4.SS1), for anyt∈\[T\]t\\in\[T\],a∈\[A\]a\\in\[A\]andρ∈ℬ\\rho\\in\\mathcal\{B\},

𝔼\[ℓat\(ρ\)−ℓat\(ρa\)∣xt,ℱt−1\]≥κμ12\(ψ\(xt\)⊤\(ρ−ρa\)\)2\.\\mathbb\{E\}\[\\ell\_\{at\}\(\\rho\)\-\\ell\_\{at\}\(\\rho\_\{a\}\)\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\geq\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)^\{2\}\.\(11\)

The curvature bound in Lemma[4\.2](https://arxiv.org/html/2606.07392#S4.Thmtheorem2)illustrates the advantage for using theMM\-estimation formulation\. It shows that the population loss increases quadratically in the prediction errorψ\(xs\)⊤\(ρ−ρa\)\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\. As a preview of the subsequent analysis, we will compare this quadratic curvature against the empirical fluctuation of the additive loss∑s∈𝒮atℓas\(ρ\)\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\ell\_\{as\}\(\\rho\)\. By controlling this fluctuation uniformly overρ∈ℬ\\rho\\in\\mathcal\{B\}via martingale concentration, we can absorb it into the curvature term\. This procedure yields a high\-probability bound on\|\(ρ^at−ρa\)⊤ψ\(xt\)\|\|\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\)\|, providing the confidence radius needed to construct the optimistic indexσ~at\\widetilde\{\\sigma\}\_\{at\}in \([6](https://arxiv.org/html/2606.07392#S4.E6)\)\.

### 4\.2Confidence Bound for a Fixed Box

Next, we construct confidence radii for the reservation indices\. For each fixed boxa∈\[A\]a\\in\[A\], becauseΛ\(⋅\)\\Lambda\(\\cdot\)is monotone andσat∗=Λ\(ρa⊤ψ\(xt\)\)\\sigma\_\{at\}^\{\*\}=\\Lambda\(\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\), we focus on constructing a confidence interval for the scalar indexρa⊤ψ\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)at the realized contextxtx\_\{t\}\. We defineHat\(ρ\)H\_\{at\}\(\\rho\), the empirical fluctuation of the loss atρ\\rho, as

Hat\(ρ\):=∑s∈𝒮at\{ℓas\(ρ\)−ℓas\(ρa\)−𝔼\[ℓas\(ρ\)−ℓas\(ρa\)∣ℱs−1,xs\]\}\.H\_\{at\}\(\\rho\):=\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\-\\mathbb\{E\}\\\!\\left\[\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\\right\]\\right\\\}\.\(12\)By optimality ofρ^at\\hat\{\\rho\}\_\{at\}, the empirical excess loss∑s∈𝒮at\{ℓas\(ρ^at\)−ℓas\(ρa\)\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\ell\_\{as\}\(\\hat\{\\rho\}\_\{at\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\right\\\}is non\-positive\. This sum can be decomposed by its conditional expectation and empirical fluctuation as

∑s∈𝒮at\{ℓas\(ρ^at\)−ℓas\(ρa\)\}=∑s∈𝒮at𝔼\[ℓas\(ρ^at\)−ℓas\(ρa\)\|ℱs−1,xs\]\+Hat\(ρ^at\)\.\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\ell\_\{as\}\(\\hat\{\\rho\}\_\{at\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\right\\\}=\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\mathbb\{E\}\\left\[\\ell\_\{as\}\(\\hat\{\\rho\}\_\{at\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\,\\middle\|\\,\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\\right\]\+H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\.Lemma[4\.2](https://arxiv.org/html/2606.07392#S4.Thmtheorem2)implies that the conditional expectation is bounded below by the quadratic formκμ1/2∑s∈𝒮at\{ψ\(xs\)⊤\(ρ^at−ρa\)\}2\\kappa\\mu\_\{1\}/2\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\\right\\\}^\{2\}\. Consequently, to establish a confidence interval forρa⊤ψ\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\), it suffices to bound the empirical fluctuationHat\(ρ^at\)H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)using a comparable quadratic term and a logarithmic complexity term, as established in the following lemma\.

###### Lemma 4\.3

Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5),[4\.1](https://arxiv.org/html/2606.07392#S4.SS1)hold\. Fixa∈\[A\]a\\in\[A\], and given any constantc0\>0c\_\{0\}\>0, with probability at least1−δ/31\-\\delta/3, uniformly over allt∈\[T\]t\\in\[T\], for anyρt\\rho\_\{t\}adapted toℱt−1\\mathcal\{F\}\_\{t\-1\}we have

−Hat\(ρt\)≤c08∑s∈𝒮at\{\(ρt−ρa\)⊤ψ\(xs\)\}2\+\(144c0\+C0\)\(mlog⁡T\+log⁡\(T/δ\)\),\-H\_\{at\}\(\\rho\_\{t\}\)\\leq\\frac\{c\_\{0\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+\\left\(\\frac\{144\}\{c\_\{0\}\}\+C\_\{0\}\\right\)\(m\\log T\+\\log\(T/\\delta\)\),whereC0\>0C\_\{0\}\>0is a constant depending ondℬ,C¯ψ,ι¯d\_\{\\mathcal\{B\}\},\\bar\{C\}\_\{\\psi\},\\overline\{\\iota\}\.

Lemma[4\.3](https://arxiv.org/html/2606.07392#S4.Thmtheorem3)is derived through standard concentration arguments\. For a fixedρ\\rho,Hat\(ρ\)H\_\{at\}\(\\rho\)form a martingale difference sequence whose conditional variance is controlled by the quadratic term that appears in the population curvature lemma\. A peeling argument over the size of this quadratic term and anϵ\\epsilon\-net argument overℬ\\mathcal\{B\}yield the uniform bound and hence for adaptiveρt\\rho\_\{t\}\.

We now build the confidence interval for the scalar indexρa⊤ψ\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)using Lemmas[4\.2](https://arxiv.org/html/2606.07392#S4.Thmtheorem2)and[4\.3](https://arxiv.org/html/2606.07392#S4.Thmtheorem3)\. Define the regularized empirical design matrix

Vat\(η1\):=η1𝐈m\+∑s∈𝒮atψ\(xs\)ψ\(xs\)⊤,V\_\{at\}\(\\eta\_\{1\}\):=\\eta\_\{1\}\\mathbf\{I\}\_\{m\}\+\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\psi\(x\_\{s\}\)\\psi\(x\_\{s\}\)^\{\\top\},\(13\)whereη1\>0\\eta\_\{1\}\>0is a regularization parameter and𝐈m\\mathbf\{I\}\_\{m\}is them×mm\\times midentity matrix\. Intuitively,Vat\(η1\)V\_\{at\}\(\\eta\_\{1\}\)summarizes the contexts in which boxaahas been queried before periodtt\. The next proposition establishes a high\-probability bound on\|\(ρ^at−ρa\)⊤ψ\(xt\)\|\|\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\)\|\.

###### Proposition 4\.4

Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5),[4\.1](https://arxiv.org/html/2606.07392#S4.SS1)hold\. Whenμ∗\\mu^\{\*\}is known, fix anya∈\[A\]a\\in\[A\], given anyδ\>0\\delta\>0, with probability at least1−δ1\-\\delta, uniformly over allt∈\[T\]t\\in\[T\], we have

\|\(ρ^at−ρa\)⊤ψ\(xt\)\|≤Bat∗‖ψ\(xt\)‖Vat\(η1\)−1,\|\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\)\|\\leq B\_\{at\}^\{\*\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\},\(14\)whereBat∗=C1\[mlog⁡\(T\)\+log⁡\(TA/δ\)\]\+η1dℬ2B\_\{at\}^\{\*\}=\\sqrt\{C\_\{1\}\[m\\log\(T\)\+\\log\(TA/\\delta\)\]\+\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}\}, andC1C\_\{1\}is a constant depending on the constantsκ,μ1,ι¯,C¯ψ,dℬ\\kappa,\\mu\_\{1\},\\bar\{\\iota\},\\bar\{C\}\_\{\\psi\},d\_\{\\mathcal\{B\}\}defined in Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5),[4\.1](https://arxiv.org/html/2606.07392#S4.SS1)\.

Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4)has the usual online\-learning interpretation\(see, e\.g\., Abbasi\-Yadkoriet al\.[2011](https://arxiv.org/html/2606.07392#bib.bib56), Lattimore and Szepesvári[2020](https://arxiv.org/html/2606.07392#bib.bib50)\)as the confidence bounds used in bandit UCB algorithms\. It is large when the current context points in a direction that has not been well explored for boxaa, and small when the historical opened\-box contexts provide enough information in that direction\.

Motivated by Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4), we define the optimistic index in the known\-μ∗\\mu^\{\*\}case as

σ~at=Λ\(ρ^at⊤ψ\(xt\)\+Bat∗‖ψ\(xt\)‖Vat\(η1\)−1\),\\widetilde\{\\sigma\}\_\{at\}=\\Lambda\\left\(\\hat\{\\rho\}\_\{at\}^\{\\top\}\\psi\(x\_\{t\}\)\+B\_\{at\}^\{\*\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\\right\),\(15\)SinceΛ\\Lambdais increasing, Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4)impliesσ~at≥σat∗\\widetilde\{\\sigma\}\_\{at\}\\geq\\sigma\_\{at\}^\{\*\}uniformly overa∈\[A\],t∈\[T\]a\\in\[A\],t\\in\[T\]with probability at least1−δ1\-\\delta\. This, in turn, is used to establish the uniform optimism event required for the regret guarantee\. In implementation, one may calibrate this radius using a bootstrap approximation to the distribution of\(ρ^at−ρa\)⊤ψ\(xt\)\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\), which may yield less conservative confidence bounds\. Such a bootstrap procedure is not covered by the present theory and proving its validity would require a uniform bootstrap approximation under adaptive sampling\.

### 4\.3Regret under Known Reward Function

Now we bound the cumulative regret whenμ∗\\mu^\{\*\}is known so thatμ~t=μ∗\\widetilde\{\\mu\}\_\{t\}=\\mu^\{\*\}\. When the indices are optimistic, i\.e\.,σ~at≥σat∗,∀a∈\[A\],t∈\[T\]\\widetilde\{\\sigma\}\_\{at\}\\geq\\sigma\_\{at\}^\{\*\},\\forall a\\in\[A\],\\ t\\in\[T\], Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)givesΔt\(π~\)≤𝔼\[∑a∈𝒜t\(σ~at−σat∗\)\|xt,ℱt−1\]\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\leq\\mathbb\{E\}\\left\[\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\left\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\\right\)\\,\\middle\|\\,x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]\. Using the Lipschitz continuity ofΛ\\Lambdaand the confidence bound in Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4), we obtainσ~at−σat∗≤2LBat∗‖ψ\(xt\)‖Vat\(η1\)−1\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\\leq 2LB\_\{at\}^\{\*\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}on the high\-probability event, whereBat∗≤O\(mlog⁡\(T\)\+log⁡\(TA\)\)B\_\{at\}^\{\*\}\\leq\\mathrm\{O\}\(\\sqrt\{m\\log\(T\)\+\\log\(TA\)\}\)\. Thus the cumulative regret is controlled by∑t=1T∑a∈𝒜t‖ψ\(xt\)‖Vat\(η1\)−1\\sum\_\{t=1\}^\{T\}\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\. Every time boxaais queried, the corresponding feature vectorψ\(xt\)\\psi\(x\_\{t\}\)is added to its design matrix\. Therefore, the uncertainty in the same direction decreases over time\. The standard elliptical\-potential argument implies that for each boxa∈\[A\]a\\in\[A\],

∑t=1TBat∗‖ψ\(xt\)‖Vat\(η1\)−1𝕀\{a∈𝒜t\}≤O~\(mT\)\.\\sum\_\{t=1\}^\{T\}B\_\{at\}^\{\*\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\leq\\widetilde\{O\}\(m\\sqrt\{T\}\)\.\(16\)Summing this bound over theAAboxes yields the following regret guarantee:

###### Proposition 4\.5

Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5),[4\.1](https://arxiv.org/html/2606.07392#S4.SS1)hold\. When the reward functionμ∗\\mu^\{\*\}is known,

𝔼\[∑t=1TΔt\(π~\)\]≤O~\(AmT\)\.\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\right\]\\leq\\widetilde\{O\}\\left\(Am\\sqrt\{T\}\\right\)\.

Compared with theO~\(T5/6\)\\widetilde\{O\}\(T^\{5/6\}\)bound ofAtsidakouet al\.\([2024](https://arxiv.org/html/2606.07392#bib.bib20)\)for a more general contextual Pandora’s Box setting, ourO~\(T\)\\widetilde\{O\}\(\\sqrt\{T\}\)rate relies on the additional local\-curvature structure in Assumption[4\.1](https://arxiv.org/html/2606.07392#S4.SS1), which enables UCB\-style confidence control for the reservation indices\.

## 5Results under Unknown Reward Function

We now drop the assumption thatμ∗\\mu^\{\*\}is known and study the problem whereμ∗\\mu^\{\*\}must be learned online\. The known\-μ∗\\mu^\{\*\}result before guides the analysis whenμ∗\\mu^\{\*\}is unknown\. Relative to the previous section, two additional challenges arise:

- \(i\)The reward estimation term in Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)is no longer zero\.
- \(ii\)The reservation index estimation can no longer use the true conditional expected rewardμ∗\(xs,ωas\)\\mu^\{\*\}\(x\_\{s\},\\omega\_\{as\}\)when defining the loss\.

Recall from Section[2](https://arxiv.org/html/2606.07392#S2)that we impose the generalized linear parametric structure on the expected reward function:

μ∗\(x,ω\)=G\(θ∗⊤ϕ\(x,ω\)\),∀\(x,ω\)∈𝒳×Ω,\\mu^\{\*\}\(x,\\omega\)=G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x,\\omega\)\),\\qquad\\forall\(x,\\omega\)\\in\\mathcal\{X\}\\times\\Omega,\(17\)whereG:ℝ→\[0,1\]G:\\mathbb\{R\}\\rightarrow\[0,1\]is strictly increasing\. In Section[5\.1](https://arxiv.org/html/2606.07392#S5.SS1)we address \(i\) using a standard penalized maximum likelihood estimator \(MLE\) construction based on \([17](https://arxiv.org/html/2606.07392#S5.E17)\)\. In Section[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)we address \(ii\) by showing that the main intuitions from Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4)continue to work\. Particularly, Proposition[11\.2](https://arxiv.org/html/2606.07392#S11.Thmtheorem2)extends Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4)to this setting by accounting for the plug\-in error for the loss\. Similarly, the index estimator is controlled by population curvature, empirical concentration, and now an additional perturbation term from reward estimation\. Once the reward parameter is learned at the standard online rate, this perturbation is small enough to preserve the overallO~\(T\)\\widetilde\{O\}\(\\sqrt\{T\}\)regret rate\.

However, challenge \(ii\) with unknown reward is more delicate because reward learning uses selected outputs while index estimation is based on queried outputs\. Hence controlling the plug\-in loss requires an additional assumption ensuring that the reward\-design matrix is sufficiently informative for the queried features that enter the index estimator\.

We begin by imposing the following regularity condition on the expected reward function:\{assumption\}\[Regularity conditions for expected reward function\] \(i\)G\(⋅\)G\(\\cdot\)is continuously differentiable and strictly increasing, with Lipschitz constantLL\. Furthermore,μ¯:=infθ∈Θ,x∈𝒳,ω∈ΩG′\(θ⊤ϕ\(x,ω\)\)\>0\\underline\{\\mu\}:=\\inf\_\{\\theta\\in\\Theta,x\\in\\mathcal\{X\},\\omega\\in\\Omega\}G^\{\\prime\}\(\\theta^\{\\top\}\\phi\(x,\\omega\)\)\>0, and for some absolute constantL1\>0L\_\{1\}\>0,\|G′′\(θ⊤ϕ\(x,ω\)\)\|≤L1\|G^\{\\prime\\prime\}\(\\theta^\{\\top\}\\phi\(x,\\omega\)\)\|\\leq L\_\{1\},∀x∈𝒳,ω∈Ω,θ∈Θ\\forall x\\in\\mathcal\{X\},\\omega\\in\\Omega,\\theta\\in\\Theta\. \(ii\) For an absolute constantC¯ϕ,α¯\>0\\bar\{C\}\_\{\\phi\},\\bar\{\\alpha\}\>0,‖ϕ\(x,ω\)‖2≤C¯ϕ\\\|\\phi\(x,\\omega\)\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\},∀ω∈Ω,x∈𝒳\\forall\\omega\\in\\Omega,x\\in\\mathcal\{X\}and‖θ∗‖2≤α¯\\\|\\theta\_\{\*\}\\\|\_\{2\}\\leq\\bar\{\\alpha\}\. Assumption[5](https://arxiv.org/html/2606.07392#S5)is standard in the generalized linear contextual bandit literature\(e\.g\., Filippiet al\.[2010](https://arxiv.org/html/2606.07392#bib.bib15), Liet al\.[2017](https://arxiv.org/html/2606.07392#bib.bib3), Leeet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib16)\)\. The Lipschitz constantLLcontrols how errors inθ^t−1\\hat\{\\theta\}\_\{t\-1\}propagate into the reward prediction, the lower boundμ¯\\underline\{\\mu\}onG′G^\{\\prime\}ensures uniform strong convexity of the objective, and the boundedness ofϕ\\phiandθ∗\\theta\_\{\*\}keeps the analysis in the standard regime\.

We next assume that the random reward observed by the DM is drawn from acanonical exponential family:\{assumption\}Given anyx∈𝒳x\\in\\mathcal\{X\}andω∈Ω\\omega\\in\\Omega, conditioning on the pair of input context and output of the selected API\(xt,ωatt\)=\(x,ω\)\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)=\(x,\\omega\)during periodtt, the density of the random rewardrt=G\(θ∗⊤ϕ\(x,ω\)\)\+ζtr\_\{t\}=G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x,\\omega\)\)\+\\zeta\_\{t\}with respect to a reference measure is given by

f\(r\|x,ω\)=exp⁡\(rθ∗⊤ϕ\(x,ω\)−b\(θ∗⊤ϕ\(x,ω\)\)\+ξ\(r\)\),f\(r\|x,\\omega\)=\\exp\\left\(r\\theta\_\{\*\}^\{\\top\}\\phi\(x,\\omega\)\-b\(\\theta\_\{\*\}^\{\\top\}\\phi\(x,\\omega\)\)\+\\xi\(r\)\\right\),\(18\)where0≤rt≤γ00\\leq r\_\{t\}\\leq\\gamma\_\{0\}a\.s\. for some constantγ0\>1\\gamma\_\{0\}\>1,ξ\(⋅\)\\xi\(\\cdot\)is a real function,b\(⋅\)b\(\\cdot\)is twice continuously differentiable and𝔼\[ζt\|ωatt,xt,ℱt−1\]=0\\mathbb\{E\}\[\\zeta\_\{t\}\|\\omega\_\{a\_\{t\}t\},x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]=0for anyt∈\[T\]t\\in\[T\]almost surely\. Assumption[5](https://arxiv.org/html/2606.07392#S5)is standard in generalized linear contextual bandit\(Filippiet al\.[2010](https://arxiv.org/html/2606.07392#bib.bib15), Liet al\.[2017](https://arxiv.org/html/2606.07392#bib.bib3), Zhouet al\.[2019](https://arxiv.org/html/2606.07392#bib.bib4), Leeet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib16), etc\.\), which encompasses Bernoulli, logistic, and binomial reward distributions, and more generally the exponential family \(e\.g\. Gaussian, Poisson, etc\.\) with support truncated to\[0,γ0\]\[0,\\gamma\_\{0\}\]\. By definition, taking derivative ofbbwe getb′=Gb^\{\\prime\}=Gas theinverse link function\. Common examples ofGGinclude any known cumulative distribution function \(CDF\) such as the logistic and probit CDFs\.

### 5\.1Estimating the Reward Parameter

The reward estimation follows essentially a generalized linear contextual bandit framework\. At each periodttthe deployment\(xt,ωatt\)\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)yields a reward observationrtr\_\{t\}, and the dataset\{\(xk,ωakk,rk\)\}k=1t−1\\\{\(x\_\{k\},\\omega\_\{a\_\{k\}k\},r\_\{k\}\)\\\}\_\{k=1\}^\{t\-1\}accumulates over time\. We estimateθ∗\\theta\_\{\*\}by a penalized \(projected\) maximum likelihood estimator

θ^t−1:=\\argminθ∈Θ‖∑k=1t−1\{rk−G\(θ⊤ϕ\(xk,ωakk\)\)\}ϕ\(xk,ωakk\)−η0θ‖Φt−1−1,\\hat\{\\theta\}\_\{t\-1\}:=\\argmin\_\{\\theta\\in\\Theta\}\\left\\\|\\sum\_\{k=1\}^\{t\-1\}\\\!\\big\\\{r\_\{k\}\-G\\big\(\\theta^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\big\)\\big\\\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\-\\eta\_\{0\}\\theta\\right\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\},\(19\)where

Φt−1=η0𝐈d\+∑k=1t−1ϕ\(xk,ωakk\)ϕ\(xk,ωakk\)⊤\\Phi\_\{t\-1\}=\\eta\_\{0\}\\mathbf\{I\}\_\{d\}\+\\sum\_\{k=1\}^\{t\-1\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)^\{\\top\}\(20\)is the regularized design matrix of reward features collected through periodt−1t\-1, andη0\>0\\eta\_\{0\}\>0is a regularization parameter,𝐈d\\mathbf\{I\}\_\{d\}is thed×dd\\times didentity matrix\. The term within the weighted norm∥⋅∥Φt−1−1\\\|\\,\\cdot\\,\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}in \([19](https://arxiv.org/html/2606.07392#S5.E19)\) is the gradient of the objective function for the unconstrained penalized log\-likelihoodmaxθ∑k=1t−1log⁡f\(rk∣xk,ωakk\)−η02‖θ‖22\\max\_\{\\theta\}\\sum\_\{k=1\}^\{t\-1\}\\log f\(r\_\{k\}\\mid x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\-\\tfrac\{\\eta\_\{0\}\}\{2\}\\\|\\theta\\\|\_\{2\}^\{2\}, andθ^t−1\\hat\{\\theta\}\_\{t\-1\}can be viewed a “projection” ontoΘ\\Thetaif the unconstrained solution falls outsideΘ\\Theta\. This construction followsFilippiet al\.\([2010](https://arxiv.org/html/2606.07392#bib.bib15)\)\.

The following standard result provides a high\-probability bound on the prediction error induced by usingθ^t−1\\hat\{\\theta\}\_\{t\-1\}to evaluate the reward of a candidate context\-output pair\(xt,ωat\)\(x\_\{t\},\\omega\_\{at\}\)\.

###### Lemma 5\.1

Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S5)and[5](https://arxiv.org/html/2606.07392#S5)hold and fix anyδ∈\(0,min⁡\{1,2d/e\}\)\\delta\\in\(0,\\min\\\{1,2d/e\\\}\)\. With probability at least1−δ/21\-\\delta/2, uniformly over alla∈\[A\]a\\in\[A\]andt≥2t\\geq 2,

\|\(θ^t−1−θ∗\)⊤ϕ\(xt,ωat\)\|≤βt‖ϕ\(xt,ωat\)‖Φt−1−1,\\big\|\(\\hat\{\\theta\}\_\{t\-1\}\-\\theta\_\{\*\}\)^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\\big\|\\;\\leq\\;\\beta\_\{t\}\\;\\big\\\|\\phi\(x\_\{t\},\\omega\_\{at\}\)\\big\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\},whereβt=O\(dlog⁡\(t\)\+log⁡\(1/δ\)\)\\beta\_\{t\}=\\mathrm\{O\}\\left\(\\sqrt\{d\\log\(t\)\+\\log\(1/\\delta\)\}\\right\)defined as in \([61](https://arxiv.org/html/2606.07392#S9.E61)\)\.

This is the ellipsoidal confidence bound for generalized linear bandits, which has similar structure as Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4)for the index parameter estimation error bound\. The matrixΦt−1\\Phi\_\{t\-1\}plays the role thatVat\(η1\)V\_\{at\}\(\\eta\_\{1\}\)played for the indices, andβt\\beta\_\{t\}grows logarithmically intt\. The proof of Lemma[5\.1](https://arxiv.org/html/2606.07392#S5.Thmtheorem1)adaptsFilippiet al\.\([2010](https://arxiv.org/html/2606.07392#bib.bib15)\)via a self\-normalized martingale inequality\. Equipped with this bound, we construct the optimistic reward as

μ~t\(xt,ωat\):=G\(θ^t−1⊤ϕ\(xt,ωat\)\+βt‖ϕ\(xt,ωat\)‖Φt−1−1\)\.\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\\;:=\\;G\\\!\\left\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\+\\beta\_\{t\}\\,\\big\\\|\\phi\(x\_\{t\},\\omega\_\{at\}\)\\big\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\right\)\.\(21\)BecauseGGis monotonically increasing by Assumption[5](https://arxiv.org/html/2606.07392#S5), Lemma[5\.1](https://arxiv.org/html/2606.07392#S5.Thmtheorem1)implies that with probability at least1−δ/21\-\\delta/2, uniformly over alla∈\[A\]a\\in\[A\]andt≥2t\\geq 2,μ~t\(xt,ωat\)≥μ∗\(xt,ωat\)\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\\;\\geq\\;\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\.

### 5\.2Estimating Indices

Next, we estimate the reservation indices and construct their associated confidence intervals for the regime where the reward function is unknown\. As an unknown\-reward analogue to Section[4\.2](https://arxiv.org/html/2606.07392#S4.SS2), we replace the true rewardμ∗\(xs,ωas\)\\mu^\{\*\}\(x\_\{s\},\\omega\_\{as\}\)by its plug\-in estimateG\(θ^t−1⊤ϕ\(xs,ωas\)\)G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)for eachs≤ts\\leq t\. The main goal is to show that this substitution preserves the same confidence\-bound structure as Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4), up to an additional perturbation term controlled by the reward estimation error bound\. Fix a periodttand a boxa∈\[A\]a\\in\[A\]\. For anyρ∈ℬ\\rho\\in\\mathcal\{B\}and any past queried samples∈𝒮ats\\in\\mathcal\{S\}\_\{at\}, define the plug\-in loss

ℓ^as,t\(ρ\):=∫0ρ⊤ψ\(xs\)\[cas−\{G\(θ^t−1⊤ϕ\(xs,ωas\)\)−Λ\(u\)\}\+\]𝑑u\.\\hat\{\\ell\}\_\{as,t\}\(\\rho\):=\\int\_\{0\}^\{\\rho^\{\\top\}\\psi\(x\_\{s\}\)\}\\left\[c\_\{as\}\-\\left\\\{G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\\right\\\}^\{\+\}\\right\]du\.\(22\)This is the same loss as \([8](https://arxiv.org/html/2606.07392#S4.E8)\), withμ∗\\mu^\{\*\}replaced by the current reward estimate\. The plug\-in index estimator is

ρ^at∈\\argminρ∈ℬ∑s∈𝒮atℓ^as,t\(ρ\)\.\\hat\{\\rho\}\_\{at\}\\in\\argmin\_\{\\rho\\in\\mathcal\{B\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\hat\{\\ell\}\_\{as,t\}\(\\rho\)\.\(23\)The first\-order condition of \([23](https://arxiv.org/html/2606.07392#S5.E23)\) is the empirical analogue of the Weitzman moment equation with the reward function evaluated atθ^t−1\\hat\{\\theta\}\_\{t\-1\}\. The new challenge compared to Section[4\.2](https://arxiv.org/html/2606.07392#S4.SS2)is that the empirical loss in \([23](https://arxiv.org/html/2606.07392#S5.E23)\) is not the oracle loss\. To isolate this difference, define the plug\-in perturbation

Jat\(ρ\):=∑s∈𝒮at\{\[ℓ^as,t\(ρ\)−ℓas\(ρ\)\]−\[ℓ^as,t\(ρa\)−ℓas\(ρa\)\]\},J\_\{at\}\(\\rho\):=\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\left\[\\hat\{\\ell\}\_\{as,t\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\)\\right\]\-\\left\[\\hat\{\\ell\}\_\{as,t\}\(\\rho\_\{a\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\right\]\\right\\\},\(24\)whereℓas\(ρ\)\\ell\_\{as\}\(\\rho\)is the known\-μ∗\\mu^\{\*\}loss in \([8](https://arxiv.org/html/2606.07392#S4.E8)\)\. The termJat\(ρ\)J\_\{at\}\(\\rho\)measures how the estimation error ofθ^t−1\\hat\{\\theta\}\_\{t\-1\}propagates into the index\-estimation loss\.

We impose the following additional assumption for the unknown\-reward case\. It ensures that the reward\-design matrix contains enough information in all directions needed to control the plug\-in perturbation\. We usepolylog\(T\)\\mathrm\{polylog\}\(T\)to denote a quantity bounded by a polynomial inlog⁡T\\log T\(i\.e\.polylog\(T\)=O\(\(log⁡T\)c\)\\mathrm\{polylog\}\(T\)=\\mathrm\{O\}\(\(\\log T\)^\{c\}\)for some constantc\>0c\>0\)\.\{assumption\}\[Anti\-concentration of reward features\] There exists a sequenceMT≥1M\_\{T\}\\geq 1withMT=polylog\(T\)/AM\_\{T\}=\\sqrt\{\\mathrm\{polylog\}\(T\)\}/Asuch that, for anyt∈\[T\]t\\in\[T\],a∈\[A\]a\\in\[A\],v∈𝕊d−1v\\in\\mathbb\{S\}^\{d\-1\}, and0<ϵ≤\(2AMT\)−10<\\epsilon\\leq\(2AM\_\{T\}\)^\{\-1\},

ℙ\(\|v⊤ϕ\(xt,ωat\)\|≤ϵ\)≤MTϵa\.s\.\\mathbb\{P\}\\\!\\left\(\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\|\\leq\\epsilon\\right\)\\leq M\_\{T\}\\epsilon\\qquad\\text\{a\.s\.\}Recall that the contextsxtx\_\{t\}are independent across time, implying that the distribution of the reward featureϕ\(xt,ωat\)\\phi\(x\_\{t\},\\omega\_\{at\}\)is independent ofℱt−1\\mathcal\{F\}\_\{t\-1\}\. Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)leverages this independence to impose an anti\-concentration condition that rules out degenerate reward features\. Informally, every one\-dimensional projection ofϕ\(xt,ωat\)\\phi\(x\_\{t\},\\omega\_\{at\}\)must have nontrivial variation near zero\. This condition is imposed becauseθ∗\\theta\_\{\*\}is learned only from selected outputs, whereas the index loss for boxaauses queried outputs\. Thus, the reward features must be sufficiently informative for evaluating the queried features that enter the index estimator\. For example, it is satisfied when the embedding distribution is a truncated Gaussian, a truncated multivariate Student’stt, or a uniform distribution on a hyperrectangle; see Lemmas[10\.3](https://arxiv.org/html/2606.07392#S10.Thmtheorem3)–[10\.7](https://arxiv.org/html/2606.07392#S10.Thmtheorem7)in Appendix[10](https://arxiv.org/html/2606.07392#S10)\.

The following lemma bounds the additional perturbation caused by usingθ^t−1\\hat\{\\theta\}\_\{t\-1\}in the index loss\.

###### Lemma 5\.3\(Plug\-in loss perturbation\)

Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5)and[5](https://arxiv.org/html/2606.07392#S5)hold\. Then, for anyc0\>0c\_\{0\}\>0,

−Jat\(ρ^at\)≤c08∑s∈𝒮at\{ψ\(xs\)⊤\(ρ^at−ρa\)\}2\+2L2c0∑s∈𝒮at\{\(θ^t−1−θ∗\)⊤ϕ\(xs,ωas\)\}2\.\-J\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\\leq\\frac\{c\_\{0\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\\right\\\}^\{2\}\+\\frac\{2L^\{2\}\}\{c\_\{0\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\(\\hat\{\\theta\}\_\{t\-1\}\-\\theta\_\{\*\}\)^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\\right\\\}^\{2\}\.

Lemma[5\.3](https://arxiv.org/html/2606.07392#S5.Thmtheorem3)shows why Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)is needed\. The first term on the right\-hand side has the same quadratic form as the population curvature in Lemma[4\.2](https://arxiv.org/html/2606.07392#S4.Thmtheorem2)and can therefore be absorbed into the curvature argument\. The second term is the reward estimation error evaluated on queried outputs\. By Lemma[5\.1](https://arxiv.org/html/2606.07392#S5.Thmtheorem1), this term is controlled by the inverse\-design norms‖ϕ\(xs,ωas\)‖Φt−1−1\\\|\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}, whereΦt−1\\Phi\_\{t\-1\}is constructed from previously selected outputs, whileϕ\(xs,ωas\)\\phi\(x\_\{s\},\\omega\_\{as\}\)corresponds to queried outputs used for estimating the index of boxaa\. Thus, there is a potential mismatch between the selected outputs that inform reward learning and the queried outputs that enter index estimation\. Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)rules out severe mismatches of this form by ensuring thatΦt−1\\Phi\_\{t\-1\}is sufficiently well conditioned, so these inverse\-design norms do not remain large\.

We now state the confidence bound for the index estimator:

###### Proposition 5\.4\(Index confidence bound with unknown reward function\)

Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5)–[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)hold\. Fix anyδ\>0\\delta\>0\. Then, with probability at least1−2δ/31\-2\\delta/3, uniformly over alla∈\[A\]a\\in\[A\]and allt∈\[T\]t\\in\[T\],

\|\(ρ^at−ρa\)⊤ψ\(xt\)\|≤Bat‖ψ\(xt\)‖Vat\(η1\)−1,\\left\|\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\)\\right\|\\leq B\_\{at\}\\left\\\|\\psi\(x\_\{t\}\)\\right\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\},\(26\)whereVat\(η1\)V\_\{at\}\(\\eta\_\{1\}\)is defined as \([13](https://arxiv.org/html/2606.07392#S4.E13)\),

Bat=C2mlog⁡\(T\)\+log⁡\(AT/δ\)\+\[dlog⁡\(t\)\+log⁡\(1/δ\)\]∑s∈𝒮at‖ϕ\(xs,ωas\)‖Φt−1−12,B\_\{at\}=C\_\{2\}\\sqrt\{m\\log\(T\)\+\\log\(AT/\\delta\)\+\[d\\log\(t\)\+\\log\(1/\\delta\)\]\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\|\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\},andC2C\_\{2\}is a constant depending onη0,η1,κ,μ1,ι¯,dℬ,C¯ψ,L,C¯ϕ,γ0,μ¯\\eta\_\{0\},\\eta\_\{1\},\\kappa,\\mu\_\{1\},\\bar\{\\iota\},d\_\{\\mathcal\{B\}\},\\bar\{C\}\_\{\\psi\},L,\\bar\{C\}\_\{\\phi\},\\gamma\_\{0\},\\underline\{\\mu\}, andα¯\\bar\{\\alpha\}\.

The proof follows the same localized empirical\-process argument as Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4)\. By the optimality ofρ^at\\hat\{\\rho\}\_\{at\}in \([23](https://arxiv.org/html/2606.07392#S5.E23)\),∑s∈𝒮at\{ℓ^as,t\(ρ^at\)−ℓ^as,t\(ρa\)\}≤0\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\hat\{\\ell\}\_\{as,t\}\(\\hat\{\\rho\}\_\{at\}\)\-\\hat\{\\ell\}\_\{as,t\}\(\\rho\_\{a\}\)\\right\\\}\\leq 0\. Decomposing this into the oracle population curvature, the empirical fluctuation, and the plug\-in perturbation gives

∑s∈𝒮at𝔼\[ℓas\(ρ^at\)−ℓas\(ρa\)\|ℱs−1,xs\]\+Hat\(ρ^at\)\+Jat\(ρ^at\)≤0\.\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\mathbb\{E\}\\\!\\left\[\\ell\_\{as\}\(\\hat\{\\rho\}\_\{at\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\,\\middle\|\\,\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\\right\]\+H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\+J\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\\leq 0\.The first term is lower bounded by the curvature result in Lemma[4\.2](https://arxiv.org/html/2606.07392#S4.Thmtheorem2), the second is controlled by the empirical\-process bound in Lemma[4\.3](https://arxiv.org/html/2606.07392#S4.Thmtheorem3), and the third is controlled by Lemma[5\.3](https://arxiv.org/html/2606.07392#S5.Thmtheorem3)together with the reward confidence bound in Lemma[5\.1](https://arxiv.org/html/2606.07392#S5.Thmtheorem1)\. Combining these bounds yields \([26](https://arxiv.org/html/2606.07392#S5.E26)\)\. The detailed proof is provided in Appendix[11](https://arxiv.org/html/2606.07392#S11)\. Motivated by Proposition[5\.4](https://arxiv.org/html/2606.07392#S5.Thmtheorem4), define the optimistic reservation index as

σ~at:=Λ\(ρ^at⊤ψ\(xt\)\+Bat‖ψ\(xt\)‖Vat\(η1\)−1\)\.\\widetilde\{\\sigma\}\_\{at\}:=\\Lambda\\\!\\left\(\\hat\{\\rho\}\_\{at\}^\{\\top\}\\psi\(x\_\{t\}\)\+B\_\{at\}\\left\\\|\\psi\(x\_\{t\}\)\\right\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\\right\)\.\(27\)SinceΛ\\Lambdais increasing, Proposition[5\.4](https://arxiv.org/html/2606.07392#S5.Thmtheorem4)implies that, with probability at least1−δ1\-\\delta,σ~at≥Λ\(ρa⊤ψ\(xt\)\)=σat∗\\widetilde\{\\sigma\}\_\{at\}\\geq\\Lambda\(\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\)=\\sigma\_\{at\}^\{\*\}uniformly over alla∈\[A\]a\\in\[A\]andt∈\[T\]t\\in\[T\]\.

### 5\.3Regret Analysis under Unknown Reward Function

We now combine the reward and index confidence bounds to control the cumulative regret ofCOSMOSwhen the reward function is unknown\. The preceding subsections established two high\-probability optimism events: Lemma[5\.1](https://arxiv.org/html/2606.07392#S5.Thmtheorem1)gives optimism for the reward estimateμ~t\\widetilde\{\\mu\}\_\{t\}, and Proposition[5\.4](https://arxiv.org/html/2606.07392#S5.Thmtheorem4)gives optimism for the reservation index estimateσ~at\\widetilde\{\\sigma\}\_\{at\}\. Therefore, when both events hold, the regret decomposition in Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)applies and separates the period\-ttregret into a reward estimation term and an index\-estimation term for alla∈\[A\]a\\in\[A\]andt∈\[T\]t\\in\[T\]\. This structure parallels the known\-reward analysis in Section[4](https://arxiv.org/html/2606.07392#S4)\. When the reward function is unknown, an additional generalized linear bandit term appears from estimating the reward parameter\.

As shown in Appendix[10](https://arxiv.org/html/2606.07392#S10), Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)implies that, with high probability,λmin\(Φt−1\)≥t−116polylog\(T\)\\lambda\_\{\\min\}\(\\Phi\_\{t\-1\}\)\\geq\\frac\{t\-1\}\{16\\,\\mathrm\{polylog\}\(T\)\}uniformly for allt≥τ:=Cpolylog\(T\)log⁡\(Ad\)t\\geq\\tau:=C\\mathrm\{polylog\}\(T\)\\log\(Ad\)for some absolute constantCC\. This eigenvalue growth allows the reward estimation error inJat\(ρ\)J\_\{at\}\(\\rho\)to be controlled uniformly overt≥τt\\geq\\tau, so that the cumulative regret is controlled aftert≥τt\\geq\\tau\. For these initial periods before⌈τ⌉\\lceil\\tau\\rceil, the per\-period regret is bounded by\(2A\+1\)\(2A\+1\)\. Sinceτ=O\(log⁡\(Ad\)polylog\(T\)\)\\tau=\\mathrm\{O\}\(\\log\(Ad\)\\mathrm\{polylog\}\(T\)\), so the cumulative regret up to period⌈τ⌉\\lceil\\tau\\rceilis at mostO\(Alog⁡\(Ad\)polylog\(T\)\)\\mathrm\{O\}\(A\\log\(Ad\)\\mathrm\{polylog\}\(T\)\), and is therefore dominated by the regret accumulated over periodst≥τt\\geq\\tau\.

###### Theorem 5\.5

Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5)–[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)hold\. Then, ignoring logarithmic factors,

𝔼\[∑t=1TΔt\(π~\)\]≤O~\(\[d\+A\(m\+md\)\]T\)\.\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\right\]\\leq\\widetilde\{O\}\\left\(\[d\+A\(m\+\\sqrt\{md\}\)\]\\sqrt\{T\}\\right\)\.In particular, the regret bound is asymptotically minimized by choosing

η1≍max⁡\{m1/2,\(dm\)1/4,d1/3\},η0≍min⁡\{dAm,\(dAm\)2/3\}\.\\eta\_\{1\}\\asymp\\max\\left\\\{m^\{1/2\},\(dm\)^\{1/4\},d^\{1/3\}\\right\\\},\\qquad\\eta\_\{0\}\\asymp\\min\\left\\\{\\frac\{d\}\{A\\sqrt\{m\}\},\\left\(\\frac\{d\}\{A\\sqrt\{m\}\}\\right\)^\{2/3\}\\right\\\}\.

The terms in Theorem[5\.5](https://arxiv.org/html/2606.07392#S5.Thmtheorem5)correspond directly to the two components in Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)\. The termO~\(dT\)\\widetilde\{O\}\(d\\sqrt\{T\}\)is the reward estimation regret, matching the usual dimension dependence for generalized linear contextual bandits with reward feature dimensiondd\. The termO~\(A\(m\+dm\)T\)\\widetilde\{O\}\(A\(m\+\\sqrt\{dm\}\)\\sqrt\{T\}\)is the index\-estimation regret, whereAAis the number of boxes andmmis the dimension of the index featureψ\\psi\. Relative to the known\-reward case in Proposition[4\.5](https://arxiv.org/html/2606.07392#S4.Thmtheorem5), the additional termO~\(AdmT\)\\widetilde\{O\}\(A\\sqrt\{dmT\}\)reflects the effect of reward estimation error on index estimation\. On the high\-probability optimism event, Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)gives𝔼\[∑t=1TΔt\(π~\)\]≤𝔼\[∑t=1T\{μ~t\(xt,ωatt\)−μ∗\(xt,ωatt\)\}\]\+𝔼\[∑t=1T∑a∈𝒜t\(σ~at−σat∗\)\]\.\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\right\]\\leq\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\left\\\{\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\-\\mu^\{\*\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\right\\\}\\right\]\+\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\left\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\\right\)\\right\]\.The first sum is controlled by the reward confidence radius from Lemma[5\.1](https://arxiv.org/html/2606.07392#S5.Thmtheorem1)and an elliptical\-potential argument\(Abbasi\-Yadkoriet al\.[2011](https://arxiv.org/html/2606.07392#bib.bib56)\)for the reward\-design matrixΦt−1\\Phi\_\{t\-1\}\. The second sum is controlled by the index confidence radius from Proposition[5\.4](https://arxiv.org/html/2606.07392#S5.Thmtheorem4)and a separate elliptical\-potential argument for each box\-specific design matrixVat\(η1\)V\_\{at\}\(\\eta\_\{1\}\)\. Thus the unknown\-reward analysis adds a standard reward\-learning term while preserving the same index\-learning rate as in the known\-reward benchmark\. The detailed proof is provided in Appendix[12](https://arxiv.org/html/2606.07392#S12)\.

## 6Conclusion

Motivated by LLM cascading, we introduce an online contextual Pandora’s Box model in which a decision\-maker sequentially queries APIs to generate outputs at a cost and deploys a single output, observing only its downstream reward\. This output\-mediated feedback departs from the classical setting, where opening a box reveals its reward directly, and captures a key feature of LLM cascading systems: API\-specific heterogeneity enters only through the distributions of generated outputs and costs, while downstream value is governed by a shared reward evaluator on the context\-output pair\. Rather than estimating the full conditional output and cost distributions, we impose a generalized linear structure directly on the reservation indices and the shared reward function\. OurCOSMOSpolicy is built by combining GMM estimation of the reservation indices with UCB\-style confidence bounds for both the indices and the reward evaluator\. Through a regret decomposition under optimism, which separates cumulative regret into reward\-estimation errors for deployed outputs and index\-estimation errors for queried APIs, we derive aO~\(\[d\+A\(m\+dm\)\]T\)\\widetilde\{\\mathrm\{O\}\}\\left\(\[d\+A\(m\+\\sqrt\{dm\}\)\]\\sqrt\{T\}\\right\)regret bound\.

Our work opens several avenues for future research\. First, because our theoretical guarantees rely on generalized linear specifications, extending the moment\-based UCB framework to richer nonparametric or neural representations is a natural next step\. Achieving this while preservingO~\(T\)\\widetilde\{O\}\(\\sqrt\{T\}\)regret would further broaden the model’s applicability\. Second, our current formulation assumes a standard LLM cascading structure where APIs are queried sequentially and a single output is selected and then deployed\. Expanding this framework to allow for batched querying under latency budgets, or to support output ensembling and synthesis, would significantly change the query\-selection dynamics\. Addressing these richer action spaces raises compelling new questions about the structure of the learning policy\.

## AI Disclosure

The authors used generative artificial intelligence tools, including GPT, Gemini, and Claude, to edit and polish earlier versions of the drafts\. All AI\-generated text and suggestions were checked, revised, and approved by the authors\. The authors take full responsibility for the accuracy, integrity, and originality of the submitted work\.

## Acknowledgment

We thank Ali Makhdoumi for helpful discussions on the general theory and intuition of Pandora’s Box, and Shreyas Sekar for an early conversation on AI\-assisted coding and model selection for reducing LLM API costs that eventually led us to the FrugalGPT paper byChenet al\.\([2025a](https://arxiv.org/html/2606.07392#bib.bib68)\)\. We also thank the participants from 2026 Marketplace Innovation Workshop for comments and discussion\.

\{APPENDICES\}

## 7Proofs for Section[3](https://arxiv.org/html/2606.07392#S3)

In this section we prove three results in Section[3](https://arxiv.org/html/2606.07392#S3): the optimality of the oracle reservation index policy under full information \(Proposition[3\.1](https://arxiv.org/html/2606.07392#S3.Thmtheorem1)\), point identification ofρa⊤ψ\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)from the queried samples for boxaabefore periodtt\(Proposition[4\.1](https://arxiv.org/html/2606.07392#S4.Thmtheorem1)\), and a regret decomposition under optimistic reward and index estimators \(Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)\)\.

The following lemma establishes that the query decision is independent of the stochastic output conditioning on the historical data and the current context at each periodtt\. Although the proof is immediate, the result is used repeatedly in subsequent arguments, so we state it explicitly here\.

###### Lemma 7\.1

Given anyt∈\[T\]t\\in\[T\],xt∈𝒳x\_\{t\}\\in\\mathcal\{X\}anda∈\[A\]a\\in\[A\], we have𝕀\{a∈𝒜t\}⟂⟂\(ωat,cat\)\|xt,ℱt−1\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\perp\\\!\\\!\\\!\\\!\\perp\(\\omega\_\{at\},c\_\{at\}\)\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\.

###### Proof 7\.2

Proof of Lemma[7\.1](https://arxiv.org/html/2606.07392#S7.Thmtheorem1)For anyxt∈𝒳x\_\{t\}\\in\\mathcal\{X\}anda∈\[A\]a\\in\[A\], under both the oracle reservation index policy andCOSMOSalgorithm, the event\{a∈𝒜t\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}is determined by the sequential construction, which depends on the historyℱt−1\\mathcal\{F\}\_\{t\-1\}, the current contextxtx\_\{t\}, and the outputs of boxes queried beforeaa\. Under the conditional independence of\{ωa′t:a′∈\[A\]\}\\\{\\omega\_\{a^\{\\prime\}t\}:a^\{\\prime\}\\in\[A\]\\\}givenxtx\_\{t\}according to \([1](https://arxiv.org/html/2606.07392#S2.E1)\), these preceding outputs are independent ofωat\\omega\_\{at\}\. Further,cat=ca\(xt,ωat\)c\_\{at\}=c\_\{a\}\(x\_\{t\},\\omega\_\{at\}\)\. Hence,𝕀\{a∈𝒜t\}⟂⟂\(ωat,cat\)\|xt,ℱt−1\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\perp\\\!\\\!\\\!\\\!\\perp\(\\omega\_\{at\},c\_\{at\}\)\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}as claimed\.■\\blacksquare

###### Proof 7\.3

Proof of Proposition[3\.1](https://arxiv.org/html/2606.07392#S3.Thmtheorem1)Fort∈\[T\]t\\in\[T\], letQat=𝕀\{a∈𝒜t\}Q\_\{a\}^\{t\}=\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}andSat=𝕀\{at=a\}S\_\{a\}^\{t\}=\\mathbb\{I\}\\\{a\_\{t\}=a\\\}\. Sinceωat∼pa\(⋅\|xt\)\\omega\_\{at\}\\sim p\_\{a\}\(\\cdot\|x\_\{t\}\), \([2](https://arxiv.org/html/2606.07392#S3.E2)\) implies

𝔼\[\{μ∗\(xt,ωat\)−σa∗\(xt\)\}\+\|xt,ℱt−1\]=𝔼\[cat∣xt,ℱt−1\],∀xt\.\\mathbb\{E\}\\left\[\\left\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sigma^\{\*\}\_\{a\}\(x\_\{t\}\)\\right\\\}^\{\+\}\\,\\middle\|\\,x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]=\\mathbb\{E\}\[c\_\{at\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\],\\forall x\_\{t\}\.Thus

𝔼\[∑a∈\[A\]Satμ∗\(xt,ωat\)−∑a∈\[A\]Qatcat\|xt,ℱt−1\]=\(i\)𝔼\[∑a∈\[A\]Satμ∗\(xt,ωat\)−∑a∈\[A\]Qat\{μ∗\(xt,ωat\)−σat∗\}\+\|xt,ℱt−1\]=\(ii\)𝔼\[∑a∈\[A\]Satmin⁡\{σat∗,μ∗\(xt,ωat\)\}\+∑a∈\[A\]\(Sat−Qat\)\{μ∗\(xt,ωat\)−σat∗\}\+\|xt,ℱt−1\]≤\(iii\)𝔼\[∑a∈\[A\]Satmin⁡\{σat∗,μ∗\(xt,ωat\)\}\|xt,ℱt−1\]≤\(iv\)𝔼\[maxa∈\[A\]⁡min⁡\{σat∗,μ∗\(xt,ωat\)\}\|xt,ℱt−1\],\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\\mathbb\{E\}\\left\[\\sum\_\{a\\in\[A\]\}S\_\{a\}^\{t\}\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sum\_\{a\\in\[A\]\}Q\_\{a\}^\{t\}c\_\{at\}\\bigg\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]\\\\ &\\displaystyle=\_\{\(i\)\}\\mathbb\{E\}\\left\[\\sum\_\{a\\in\[A\]\}S\_\{a\}^\{t\}\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sum\_\{a\\in\[A\]\}Q\_\{a\}^\{t\}\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sigma\_\{at\}^\{\*\}\\\}^\{\+\}\\bigg\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]\\\\ &\\displaystyle=\_\{\(ii\)\}\\mathbb\{E\}\\left\[\\sum\_\{a\\in\[A\]\}S\_\{a\}^\{t\}\\min\\\{\\sigma\_\{at\}^\{\*\},\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\\\}\+\\sum\_\{a\\in\[A\]\}\(S\_\{a\}^\{t\}\-Q\_\{a\}^\{t\}\)\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sigma\_\{at\}^\{\*\}\\\}^\{\+\}\\bigg\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]\\\\ &\\displaystyle\\leq\_\{\(iii\)\}\\mathbb\{E\}\\left\[\\sum\_\{a\\in\[A\]\}S\_\{a\}^\{t\}\\min\\\{\\sigma\_\{at\}^\{\*\},\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\\\}\\bigg\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]\\\\ &\\displaystyle\\leq\_\{\(iv\)\}\\mathbb\{E\}\\left\[\\max\_\{a\\in\[A\]\}\\min\\\{\\sigma\_\{at\}^\{\*\},\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\\\}\\bigg\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\],\\end\{array\}where \(i\) holds becauseQat⟂⟂\(ωat,cat\)∣xt,ℱt−1Q\_\{a\}^\{t\}\\perp\\\!\\\!\\\!\\\!\\perp\(\\omega\_\{at\},c\_\{at\}\)\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}according to Lemma[7\.1](https://arxiv.org/html/2606.07392#S7.Thmtheorem1)and the definition ofσat∗\\sigma\_\{at\}^\{\*\}in \([2](https://arxiv.org/html/2606.07392#S3.E2)\) imply

𝔼\[Qatcat\|xt,ℱt−1\]=𝔼\[Qat\|xt,ℱt−1\]𝔼\[cat\|xt,ℱt−1\]=𝔼\[Qat\|xt,ℱt−1\]𝔼\[\{μ∗\(xt,ωat\)−σat∗\}\+\|xt,ℱt−1\]=𝔼\[Qat\{μ∗\(xt,ωat\)−σat∗\}\+\|xt,ℱt−1\],\\begin\{array\}\[\]\{rl\}\\mathbb\{E\}\[Q\_\{a\}^\{t\}c\_\{at\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]&\\displaystyle=\\mathbb\{E\}\[Q\_\{a\}^\{t\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\mathbb\{E\}\[c\_\{at\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\\\ &\\displaystyle=\\mathbb\{E\}\[Q\_\{a\}^\{t\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\mathbb\{E\}\[\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sigma\_\{at\}^\{\*\}\\\}^\{\+\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\\\ &\\displaystyle=\\mathbb\{E\}\[Q\_\{a\}^\{t\}\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sigma\_\{at\}^\{\*\}\\\}^\{\+\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\],\\end\{array\}\(ii\) holds by the identityb=min⁡\{a,b\}\+\{b−a\}\+b=\\min\\\{a,b\\\}\+\\\{b\-a\\\}^\{\+\}, \(iii\) sinceSat≤QatS\_\{a\}^\{t\}\\leq Q\_\{a\}^\{t\}, and \(iv\) since∑a∈\[A\]Sat=1\\sum\_\{a\\in\[A\]\}S\_\{a\}^\{t\}=1andSat≥0S\_\{a\}^\{t\}\\geq 0\.

The inequalities \(iii\) and \(iv\) hold as equalities when using Weitzman’s principle via indices defined by \([2](https://arxiv.org/html/2606.07392#S3.E2)\)\. Indeed the policy selectsaawheneverμ∗\(xt,ωat\)\>σat∗\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\>\\sigma\_\{at\}^\{\*\}and selects the maximum ofmaxa∈\[A\]⁡min⁡\{μ∗\(xt,ωat\),σat∗\}\\max\_\{a\\in\[A\]\}\\min\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\),\\sigma\_\{at\}^\{\*\}\\\}\. So the utility of the DMUtU\_\{t\}satisfies

Ut=𝔼\[∑a∈\[A\]Satμ∗\(xt,ωat\)−∑a∈\[A\]Qatcat\|xt,ℱt−1\]=𝔼\[maxa∈\[A\]⁡min⁡\{σat∗,μ∗\(xt,ωat\)\}\|xt,ℱt−1\],U\_\{t\}=\\mathbb\{E\}\\left\[\\sum\_\{a\\in\[A\]\}S\_\{a\}^\{t\}\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sum\_\{a\\in\[A\]\}Q\_\{a\}^\{t\}c\_\{at\}\\bigg\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]=\\mathbb\{E\}\\left\[\\max\_\{a\\in\[A\]\}\\min\\\{\\sigma\_\{at\}^\{\*\},\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\\\}\\bigg\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\],\(28\)thus concludes the proof\.■\\blacksquare

###### Proof 7\.4

Proof of Proposition[4\.1](https://arxiv.org/html/2606.07392#S4.Thmtheorem1)For anyxt∈𝒳x\_\{t\}\\in\\mathcal\{X\}anda∈\[A\]a\\in\[A\], we have𝕀\{a∈𝒜t\}⟂⟂ωat\|xt,ℱt−1\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\perp\\\!\\\!\\\!\\\!\\perp\\omega\_\{at\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}by Lemma[7\.1](https://arxiv.org/html/2606.07392#S7.Thmtheorem1)\. Sincecat=ca\(xt,ωat\)c\_\{at\}=c\_\{a\}\(x\_\{t\},\\omega\_\{at\}\)we have

𝔼\[\{μ∗\(xt,ωat\)−Λ\(ρa⊤ψ\(xt\)\)\}\+−cat\|xt,ℱt−1,a∈𝒜t\]=𝔼\[\{μ∗\(xt,ωat\)−Λ\(ρa⊤ψ\(xt\)\)\}\+−cat\|xt,ℱt−1\]\.\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\\mathbb\{E\}\\left\[\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\Lambda\(\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\)\\\}^\{\+\}\-c\_\{at\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\\right\]\\\\ &\\displaystyle=\\mathbb\{E\}\\left\[\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\Lambda\(\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\)\\\}^\{\+\}\-c\_\{at\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]\.\\end\{array\}For any givena∈\[A\]a\\in\[A\]andxx, define

Fa\(x,s\):=𝔼\[\{μ∗\(x,ωat\)−Λ\(s\)\}\+−cat\|xt=x,ℱt−1,a∈𝒜t\]\.F\_\{a\}\(x,s\):=\\mathbb\{E\}\\left\[\\\{\\mu^\{\*\}\(x,\\omega\_\{at\}\)\-\\Lambda\(s\)\\\}^\{\+\}\-c\_\{at\}\|x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\\right\]\.So we haveFa\(x,ρa⊤ψ\(x\)\)=0,∀x∈𝒳F\_\{a\}\\left\(x,\\rho\_\{a\}^\{\\top\}\\psi\(x\)\\right\)=0,\\forall x\\in\\mathcal\{X\}\. Note thatΛ\(⋅\)\\Lambda\(\\cdot\)is strictly increasing and takes values in\[−1,1\]\[\-1,1\], andcat∈\(0,1\)c\_\{at\}\\in\(0,1\), so for everyx∈𝒳x\\in\\mathcal\{X\},Fa\(x,s\)F\_\{a\}\(x,s\)is decreasing and continuous inss\. Note that

lims→−∞Fa\(x,s\)=lims→−∞𝔼\[\{μ∗\(x,ωat\)−Λ\(s\)\}\+−cat\|xt=x,ℱt−1,a∈𝒜t\]=\(i\)𝔼\[\{μ∗\(x,ωat\)\+1\}\+−cat\|xt=x,ℱt−1,a∈𝒜t\]≥𝔼\[1−cat\|xt=x,ℱt−1,a∈𝒜t\]=\(ii\)𝔼\[1−cat\|xt=x,ℱt−1\]\>\(iii\)0\\begin\{array\}\[\]\{rl\}\\lim\_\{s\\rightarrow\-\\infty\}F\_\{a\}\(x,s\)&=\\lim\_\{s\\rightarrow\-\\infty\}\\mathbb\{E\}\\left\[\\\{\\mu^\{\*\}\(x,\\omega\_\{at\}\)\-\\Lambda\(s\)\\\}^\{\+\}\-c\_\{at\}\|x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\\right\]\\\\ &\\displaystyle=\_\{\(i\)\}\\mathbb\{E\}\\left\[\\\{\\mu^\{\*\}\(x,\\omega\_\{at\}\)\+1\\\}^\{\+\}\-c\_\{at\}\|x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\\right\]\\\\ &\\displaystyle\\geq\\mathbb\{E\}\\left\[1\-c\_\{at\}\|x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\\right\]=\_\{\(ii\)\}\\mathbb\{E\}\\left\[1\-c\_\{at\}\|x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\}\\right\]\>\_\{\(iii\)\}0\\end\{array\}where \(i\) follows from Assumption[5](https://arxiv.org/html/2606.07392#S4.E5), \(ii\) follows becausecat=ca\(xt,ωat\)c\_\{at\}=c\_\{a\}\(x\_\{t\},\\omega\_\{at\}\)andωat⟂⟂𝕀\{a∈𝒜t\}\|xt,ℱt−1\\omega\_\{at\}\\perp\\\!\\\!\\\!\\\!\\perp\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}by Lemma[7\.1](https://arxiv.org/html/2606.07392#S7.Thmtheorem1), \(iii\) follows becausecat∈\(0,1\)c\_\{at\}\\in\(0,1\)\. Thuslims→−∞Fa\(x,s\)\>0\\lim\_\{s\\rightarrow\-\\infty\}F\_\{a\}\(x,s\)\>0\. Similarly, Assumption[5](https://arxiv.org/html/2606.07392#S4.E5)implieslims→\+∞Fa\(x,s\)<0\\lim\_\{s\\rightarrow\+\\infty\}F\_\{a\}\(x,s\)<0\. Thus by intermediate value theorem, there exists aξa\(x\)∈ℝ\\xi\_\{a\}\(x\)\\in\\mathbb\{R\}such thatFa\(x,ξa\(x\)\)=0F\_\{a\}\(x,\\xi\_\{a\}\(x\)\)=0\. Now suppose that there existss1<s2s\_\{1\}<s\_\{2\}such thatFa\(x,s1\)=Fa\(x,s2\)=0F\_\{a\}\(x,s\_\{1\}\)=F\_\{a\}\(x,s\_\{2\}\)=0\. Then we have

0=Fa\(x,s1\)−Fa\(x,s2\)=𝔼\[\(μ∗\(x,ωat\)−Λ\(s1\)\)\+−\(μ∗\(x,ωat\)−Λ\(s2\)\)\+∣xt=x,ℱt−1,a∈𝒜t\]\.0=F\_\{a\}\(x,s\_\{1\}\)\-F\_\{a\}\(x,s\_\{2\}\)=\\mathbb\{E\}\[\(\\mu^\{\*\}\(x,\\omega\_\{at\}\)\-\\Lambda\(s\_\{1\}\)\)^\{\+\}\-\(\\mu^\{\*\}\(x,\\omega\_\{at\}\)\-\\Lambda\(s\_\{2\}\)\)^\{\+\}\\mid x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\]\.\(29\)This implies that

𝔼\[\(μ∗\(x,ωat\)−Λ\(s1\)\)\+\|xt=x,ℱt−1,a∈𝒜t\]=𝔼\[\(μ∗\(x,ωat\)−Λ\(s2\)\)\+\|xt=x,ℱt−1,a∈𝒜t\]=𝔼\[cat\|xt=x,ℱt−1,a∈𝒜t\]\>0\.\\begin\{array\}\[\]\{rl\}\\displaystyle\\mathbb\{E\}\[\(\\mu^\{\*\}\(x,\\omega\_\{at\}\)\-\\Lambda\(s\_\{1\}\)\)^\{\+\}\|x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\]&\\displaystyle=\\mathbb\{E\}\[\(\\mu^\{\*\}\(x,\\omega\_\{at\}\)\-\\Lambda\(s\_\{2\}\)\)^\{\+\}\|x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\]\\\\ &\\displaystyle=\\mathbb\{E\}\[c\_\{at\}\|x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\]\>0\.\\end\{array\}Sinces1<s2s\_\{1\}<s\_\{2\}andΛ\\Lambdais strictly increasing, there must exists an eventℰ\\mathcal\{E\}whereμ∗\(x,ωat\)\>Λ\(s2\)\>Λ\(s1\)\\mu^\{\*\}\(x,\\omega\_\{at\}\)\>\\Lambda\(s\_\{2\}\)\>\\Lambda\(s\_\{1\}\)andℙ\(ℰ\|xt=x,ℱt−1,a∈𝒜t\)\>0\\mathbb\{P\}\(\\mathcal\{E\}\|x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\)\>0, and

Fa\(x,s1\)−Fa\(x,s2\)=𝔼\[\(μ∗\(x,ωat\)−Λ\(s1\)\)\+−\(μ∗\(x,ωat\)−Λ\(s2\)\)\+∣xt=x,ℱt−1,a∈𝒜t\]≥𝔼\[\(Λ\(s2\)−Λ\(s1\)\)𝕀\{ℰ\}\|xt=x,ℱt−1,a∈𝒜t\]\>0,\\begin\{array\}\[\]\{rl\}F\_\{a\}\(x,s\_\{1\}\)\-F\_\{a\}\(x,s\_\{2\}\)&=\\mathbb\{E\}\[\(\\mu^\{\*\}\(x,\\omega\_\{at\}\)\-\\Lambda\(s\_\{1\}\)\)^\{\+\}\-\(\\mu^\{\*\}\(x,\\omega\_\{at\}\)\-\\Lambda\(s\_\{2\}\)\)^\{\+\}\\mid x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\]\\\\ &\\geq\\mathbb\{E\}\[\(\\Lambda\(s\_\{2\}\)\-\\Lambda\(s\_\{1\}\)\)\\mathbb\{I\}\\\{\\mathcal\{E\}\\\}\|x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\]\>0,\\end\{array\}which contradicts \([29](https://arxiv.org/html/2606.07392#S7.E29)\)\. So the conditional moment restriction pins down the optimal indexσa∗\(x\)=Λ\(ξa\(x\)\)\\sigma\_\{a\}^\{\*\}\(x\)=\\Lambda\(\\xi\_\{a\}\(x\)\)uniquely, andξa\(x\)=ρa⊤ψ\(x\)\\xi\_\{a\}\(x\)=\\rho\_\{a\}^\{\\top\}\\psi\(x\)for anyx∈𝒳x\\in\\mathcal\{X\}\.■\\blacksquare

###### Proof 7\.5

Proof of Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)DefineUt\(π~\):=μ\(θ∗;xt,ωatt\)−∑a∈𝒜tcatU\_\{t\}\(\\tilde\{\\pi\}\):=\\mu\(\\theta\_\{\*\};x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\-\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}c\_\{at\}, whereπ~\\tilde\{\\pi\}is theCOSMOSpolicy\. Letc~at:=\(μ~t\(xt,ωat\)−σ~at\)\+\\tilde\{c\}\_\{at\}:=\(\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\-\\widetilde\{\\sigma\}\_\{at\}\)^\{\+\}be defined as an auxiliary pseudo\-cost for anya∈\[A\],t∈\[T\]a\\in\[A\],t\\in\[T\]\. LetU~t\(π~\)\\tilde\{U\}\_\{t\}\(\\tilde\{\\pi\}\)denote the realized utility ofπ~\\tilde\{\\pi\}under costc~at\\tilde\{c\}\_\{at\}, i\.e\.

U~t\(π~\):=maxa∈𝒜t⁡μ~t\(xt,ωat\)−∑a∈𝒜tc~at\.\\tilde\{U\}\_\{t\}\(\\tilde\{\\pi\}\):=\\max\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\tilde\{c\}\_\{at\}\.By \([3](https://arxiv.org/html/2606.07392#S3.E3)\) in Proposition[3\.1](https://arxiv.org/html/2606.07392#S3.Thmtheorem1), we have

𝔼\[Ut\(π∗\)∣xt,ℱt−1\]=𝔼\[maxa∈\[A\]⁡min⁡\{μ∗\(xt,ωat\),σat∗\}∣xt,ℱt−1\],𝔼\[U~t\(π~\)∣xt,ℱt−1\]=𝔼\[maxa∈\[A\]⁡min⁡\{μ~t\(xt,ωat\),σ~at\}∣xt,ℱt−1\]\.\\begin\{array\}\[\]\{rcl\}\\displaystyle\\mathbb\{E\}\[U\_\{t\}\(\\pi^\{\*\}\)\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]&=&\\displaystyle\\mathbb\{E\}\\big\[\\max\_\{a\\in\[A\]\}\\min\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\),\\sigma\_\{at\}^\{\*\}\\\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\big\],\\\\ \\displaystyle\\mathbb\{E\}\\left\[\\tilde\{U\}\_\{t\}\(\\tilde\{\\pi\}\)\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]&=&\\displaystyle\\mathbb\{E\}\\big\[\\max\_\{a\\in\[A\]\}\\min\\\{\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\),\\widetilde\{\\sigma\}\_\{at\}\\\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\big\]\.\\end\{array\}Note that for the same execution ofπ~\\tilde\{\\pi\}pathwise,

Ut\(π~\)=U~t\(π~\)−\(μ~t\(xt,ωatt\)−μ\(θ∗;xt,ωatt\)\)−∑a∈𝒜t\(cat−c~at\)\.U\_\{t\}\(\\tilde\{\\pi\}\)=\\tilde\{U\}\_\{t\}\(\\tilde\{\\pi\}\)\-\(\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\-\\mu\(\\theta\_\{\*\};x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\)\-\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\(c\_\{at\}\-\\tilde\{c\}\_\{at\}\)\.Therefore,

Δt\(π~\)=𝔼\[Ut\(π∗\)∣xt,ℱt−1\]−𝔼\[Ut\(π~\)∣xt,ℱt−1\]=𝔼\[maxa∈\[A\]⁡min⁡\{μ∗\(xt,ωat\),σat∗\}∣xt,ℱt−1\]−𝔼\[maxa∈\[A\]⁡min⁡\{μ~t\(xt,ωat\),σ~at\}∣xt,ℱt−1\]\+𝔼\[μ~t\(xt,ωatt\)−μ\(θ∗;xt,ωatt\)∣xt,ℱt−1\]\+𝔼\[∑a∈𝒜t\(cat−c~at\)∣xt,ℱt−1\]\.\\begin\{array\}\[\]\{rl\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)&=\\mathbb\{E\}\[U\_\{t\}\(\\pi^\{\*\}\)\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\-\\mathbb\{E\}\[U\_\{t\}\(\\tilde\{\\pi\}\)\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\\\ &\\displaystyle=\\mathbb\{E\}\\big\[\\max\_\{a\\in\[A\]\}\\min\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\),\\sigma\_\{at\}^\{\*\}\\\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\big\]\-\\mathbb\{E\}\\big\[\\max\_\{a\\in\[A\]\}\\min\\\{\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\),\\widetilde\{\\sigma\}\_\{at\}\\\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\big\]\\\\ &\\displaystyle\\quad\+\\mathbb\{E\}\[\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\-\\mu\(\\theta\_\{\*\};x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\+\\mathbb\{E\}\\bigg\[\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\(c\_\{at\}\-\\tilde\{c\}\_\{at\}\)\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\bigg\]\.\\end\{array\}\(30\)Sinceμ~t\(xt,ωat\)≥μ∗\(xt,ωat\)\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\\geq\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)andσ~at≥σat∗\\widetilde\{\\sigma\}\_\{at\}\\geq\\sigma\_\{at\}^\{\*\}a\.s\.,min⁡\{μ~t\(xt,ωat\),σ~at\}≥min⁡\{μ∗\(xt,ωat\),σat∗\}\\min\\\{\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\),\\widetilde\{\\sigma\}\_\{at\}\\\}\\geq\\min\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\),\\sigma\_\{at\}^\{\*\}\\\}a\.s\. for everya∈\[A\]a\\in\[A\]\. So

𝔼\[maxa∈\[A\]⁡min⁡\{μ∗\(xt,ωat\),σat∗\}∣xt,ℱt−1\]−𝔼\[maxa∈\[A\]⁡min⁡\{μ~t\(xt,ωat\),σ~at\}∣xt,ℱt−1\]≤0\.\\mathbb\{E\}\\big\[\\max\_\{a\\in\[A\]\}\\min\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\),\\sigma\_\{at\}^\{\*\}\\\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\big\]\-\\mathbb\{E\}\\big\[\\max\_\{a\\in\[A\]\}\\min\\\{\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\),\\widetilde\{\\sigma\}\_\{at\}\\\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\big\]\\leq 0\.\(31\)Further, for eacha∈\[A\]a\\in\[A\],

μ∗\(xt,ωat\)−σat∗=\(μ~t\(xt,ωat\)−σ~at\)\+\(μ∗\(xt,ωat\)−μ~t\(xt,ωat\)\)\+\(σ~at−σat∗\)≤\(μ~t\(xt,ωat\)−σ~at\)\+\(σ~at−σat∗\)\.\\begin\{array\}\[\]\{rl\}\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sigma\_\{at\}^\{\*\}&\\displaystyle=\(\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\-\\widetilde\{\\sigma\}\_\{at\}\)\+\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\)\+\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\)\\\\ &\\displaystyle\\leq\(\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\-\\widetilde\{\\sigma\}\_\{at\}\)\+\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\)\.\\end\{array\}Taking positive parts and using subadditivity ofx↦x\+x\\mapsto x^\{\+\}gives

\(μ∗\(xt,ωat\)−σat∗\)\+≤\(μ~t\(xt,ωat\)−σ~at\)\+\+\(σ~at−σat∗\)\+=\(μ~t\(xt,ωat\)−σ~at\)\+\+\(σ~at−σat∗\)\.\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sigma\_\{at\}^\{\*\}\)^\{\+\}\\leq\(\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\-\\widetilde\{\\sigma\}\_\{at\}\)^\{\+\}\+\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\)^\{\+\}=\(\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\-\\widetilde\{\\sigma\}\_\{at\}\)^\{\+\}\+\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\)\.Hence

𝔼\[cat−c~at∣xt,ℱt−1\]=𝔼\[\(μ∗\(xt,ωat\)−σat∗\)\+∣xt,ℱt−1\]−𝔼\[\(μ~t\(xt,ωat\)−σ~at\)\+∣xt,ℱt−1\]≤σ~at−σat∗\.\\begin\{array\}\[\]\{rl\}\\mathbb\{E\}\[c\_\{at\}\-\\tilde\{c\}\_\{at\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]&\\displaystyle=\\mathbb\{E\}\[\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sigma\_\{at\}^\{\*\}\)^\{\+\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\-\\mathbb\{E\}\[\(\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\-\\widetilde\{\\sigma\}\_\{at\}\)^\{\+\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\\\ &\\displaystyle\\leq\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\.\\end\{array\}\(32\)Thus

𝔼\[∑a∈𝒜t\(cat−c~at\)\|xt,ℱt−1\]=∑a∈\[A\]𝔼\[\(cat−c~at\)𝕀\{a∈𝒜t\}∣xt,ℱt−1\]=\(i\)∑a∈\[A\]𝔼\[cat−c~at∣xt,ℱt−1\]𝔼\[𝕀\{a∈𝒜t\}∣xt,ℱt−1\]≤\(ii\)∑a∈\[A\]\(σ~at−σat∗\)𝔼\[𝕀\{a∈𝒜t\}∣xt,ℱt−1\]≤𝔼\[∑a∈𝒜t\(σ~at−σat∗\)\|xt,ℱt−1\]\.\\begin\{array\}\[\]\{rl\}\\mathbb\{E\}\\left\[\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\(c\_\{at\}\-\\tilde\{c\}\_\{at\}\)\\ \\bigg\|\\ x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]&\\displaystyle=\\sum\_\{a\\in\[A\]\}\\mathbb\{E\}\[\(c\_\{at\}\-\\tilde\{c\}\_\{at\}\)\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\\\ &\\displaystyle=\_\{\(i\)\}\\sum\_\{a\\in\[A\]\}\\mathbb\{E\}\[c\_\{at\}\-\\tilde\{c\}\_\{at\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\mathbb\{E\}\[\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\\\ &\\displaystyle\\leq\_\{\(ii\)\}\\sum\_\{a\\in\[A\]\}\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\)\\mathbb\{E\}\[\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\\\ &\\displaystyle\\leq\\mathbb\{E\}\\left\[\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\)\\ \\bigg\|\\ x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]\.\\end\{array\}\(33\)where \(i\) of \([33](https://arxiv.org/html/2606.07392#S7.E33)\) follows sincecat=ca\(xt,ωat\),c~at=\(μ~t\(xt,ωat\)−σ~at\)\+c\_\{at\}=c\_\{a\}\(x\_\{t\},\\omega\_\{at\}\),\\tilde\{c\}\_\{at\}=\(\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\-\\widetilde\{\\sigma\}\_\{at\}\)^\{\+\}and𝕀\{a∈𝒜t\}⟂⟂ωat\|xt,ℱt−1\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\perp\\\!\\\!\\\!\\\!\\perp\\omega\_\{at\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}by Lemma[7\.1](https://arxiv.org/html/2606.07392#S7.Thmtheorem1), \(ii\) follows from \([32](https://arxiv.org/html/2606.07392#S7.E32)\)\. The result then follows from \([30](https://arxiv.org/html/2606.07392#S7.E30)\), \([31](https://arxiv.org/html/2606.07392#S7.E31)\), \([33](https://arxiv.org/html/2606.07392#S7.E33)\)\.■\\blacksquare

## 8Regret under Known Reward Function

In this section, we provide regret analysis when the reward parameterθ∗\\theta\_\{\*\}is known\. We begin with the technical lemmas used in the analysis\.

### 8\.1Technical Lemmas

For anya∈\[A\],t∈\[T\]a\\in\[A\],t\\in\[T\], let

mat\(u\):=𝔼\[\(μ∗\(xt,ωat\)−Λ\(u\)\)\+−cat∣ℱt−1,xt\],m\_\{at\}\(u\):=\\mathbb\{E\}\\bigl\[\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\Lambda\(u\)\)^\{\+\}\-c\_\{at\}\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\bigr\],\(34\)thenmat\(ρa⊤ψ\(xt\)\)=0m\_\{at\}\(\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\)=0\.

###### Lemma 8\.1

Assume that, for everytt, for some constantη\>0\\eta\>0,

ℙ\(μ∗\(xt,ωat\)\>Λ\(u\)∣ℱt−1,xt\)Λ′\(u\)≥η,∀\|u\|≤ι¯,\\mathbb\{P\}\\bigl\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\>\\Lambda\(u\)\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\bigr\)\\,\\Lambda^\{\\prime\}\(u\)\\geq\\eta,\\ \\forall\|u\|\\leq\\overline\{\\iota\},\(35\)Then, for everyu∈\[−ι¯,ι¯\]u\\in\[\-\\overline\{\\iota\},\\overline\{\\iota\}\],

−\(u−ρa⊤ψ\(xt\)\)mat\(u\)≥η\(u−ρa⊤ψ\(xt\)\)2a\.s\.\-\(u\-\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\)m\_\{at\}\(u\)\\geq\\eta\(u\-\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\)^\{2\}\\qquad\\text\{a\.s\.\}\(36\)

Lemma[8\.1](https://arxiv.org/html/2606.07392#S8.Thmtheorem1)says that the reservation\-value moment crosses zero with a slope bounded away from zero\. If the candidate thresholduuis above the true thresholdvasv\_\{as\}, then the expected excess value is too small relative to cost, so the momentc−\(μ∗−Λ\(u\)\)\+c\-\(\\mu^\{\*\}\-\\Lambda\(u\)\)^\{\+\}is positive\. Ifuuis below the true threshold, then the expected excess value is too large relative to cost, so the moment is negative\. This monotonicity is what ultimately makes the loss locally curved aroundρa\\rho\_\{a\}\.

###### Proof 8\.2

Proof of Lemma[8\.1](https://arxiv.org/html/2606.07392#S8.Thmtheorem1)Fixttandu∈Iu\\in I\. For any real numberaaand any continuously differentiable functiongg, the mapv↦\(a−g\(v\)\)\+v\\mapsto\(a\-g\(v\)\)^\{\+\}is absolutely continuous and has a\.e\. derivativeddv\(a−g\(v\)\)\+=−𝕀\{a\>g\(v\)\}g′\(v\)\\frac\{d\}\{dv\}\(a\-g\(v\)\)^\{\+\}=\-\\mathbb\{I\}\\\{a\>g\(v\)\\\}g^\{\\prime\}\(v\)\. Applying the fundamental theorem of calculus withg=Λg=\\Lambda, we obtain for anyu,v∈ℝu,v\\in\\mathbb\{R\},\(μ∗\(xt,ωat\)−Λ\(u\)\)\+−\(μ∗\(xt,ωat\)−Λ\(v\)\)\+=−∫vu𝕀\{μ∗\(xt,ωat\)\>Λ\(r\)\}Λ′\(r\)𝑑r\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\Lambda\(u\)\)^\{\+\}\-\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\Lambda\(v\)\)^\{\+\}=\-\\int\_\{v\}^\{u\}\\mathbb\{I\}\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\>\\Lambda\(r\)\\\}\\Lambda^\{\\prime\}\(r\)\\,dr\. Taking conditional expectations givenℱt−1\\mathcal\{F\}\_\{t\-1\}yields

mat\(u\)−mat\(v\)=−∫vuℙ\(μ∗\(xt,ωat\)\>Λ\(r\)∣ℱt−1,xt\)Λ′\(r\)𝑑r\.m\_\{at\}\(u\)\-m\_\{at\}\(v\)=\-\\int\_\{v\}^\{u\}\\mathbb\{P\}\\bigl\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\>\\Lambda\(r\)\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\bigr\)\\Lambda^\{\\prime\}\(r\)\\,dr\.Sincemat\(ρa⊤ψ\(xt\)\)=0m\_\{at\}\(\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\)=0, settingv=ρa⊤ψ\(xt\)v=\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)gives

mat\(u\)=−∫ρa⊤ψ\(xt\)uℙ\(μ∗\(xt,ωat\)\>Λ\(r\)∣ℱt−1,xt\)Λ′\(r\)𝑑r\.m\_\{at\}\(u\)=\-\\int\_\{\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\}^\{u\}\\mathbb\{P\}\\bigl\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\>\\Lambda\(r\)\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\bigr\)\\Lambda^\{\\prime\}\(r\)\\,dr\.Ifu≥ρa⊤ψ\(xt\)u\\geq\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\), then \([35](https://arxiv.org/html/2606.07392#S8.E35)\) impliesmat\(u\)≤−∫ρa⊤ψ\(xt\)uη𝑑r=−η\(u−ρa⊤ψ\(xt\)\)m\_\{at\}\(u\)\\leq\-\\int\_\{\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\}^\{u\}\\eta\\,dr=\-\\eta\\left\(u\-\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\\right\), which proves \([36](https://arxiv.org/html/2606.07392#S8.E36)\)\. Ifu≤ρa⊤ψ\(xt\)u\\leq\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\), then

mat\(u\)=∫uρa⊤ψ\(xt\)ℙ\(μ∗\(xt,ωat\)\>Λ\(r\)∣ℱt−1,xt\)Λ′\(r\)𝑑r≥\(i\)∫uρa⊤ψ\(xt\)η𝑑r=η\(ρa⊤ψ\(xt\)−u\),m\_\{at\}\(u\)=\\int\_\{u\}^\{\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\}\\mathbb\{P\}\\bigl\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\>\\Lambda\(r\)\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\bigr\)\\Lambda^\{\\prime\}\(r\)\\,dr\\geq\_\{\(i\)\}\\int\_\{u\}^\{\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\}\\eta\\,dr=\\eta\\left\(\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\-u\\right\),where \(i\) follows from \([35](https://arxiv.org/html/2606.07392#S8.E35)\) and \([36](https://arxiv.org/html/2606.07392#S8.E36)\) follows again\.■\\blacksquare

We next prove Lemma[4\.2](https://arxiv.org/html/2606.07392#S4.Thmtheorem2)using Lemma[8\.1](https://arxiv.org/html/2606.07392#S8.Thmtheorem1):

Lemma \(Restatement of Lemma[4\.2](https://arxiv.org/html/2606.07392#S4.Thmtheorem2)\)\.Under Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5),[4\.1](https://arxiv.org/html/2606.07392#S4.SS1), for anyt∈\[T\]t\\in\[T\],a∈\[A\]a\\in\[A\]andρ∈ℬ\\rho\\in\\mathcal\{B\},𝔼\[ℓat\(ρ\)−ℓat\(ρa\)∣xt,ℱt−1\]≥12κμ1\(ψ\(xt\)⊤\(ρ−ρa\)\)2\\mathbb\{E\}\[\\ell\_\{at\}\(\\rho\)\-\\ell\_\{at\}\(\\rho\_\{a\}\)\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\geq\\frac\{1\}\{2\}\\kappa\\mu\_\{1\}\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)^\{2\}\.

###### Proof 8\.3

Proof of Lemma[4\.2](https://arxiv.org/html/2606.07392#S4.Thmtheorem2)The scalar mapz↦∫0z\(cat−\(μ∗\(xt,ωat\)−Λ\(u\)\)\+\)𝑑u\\displaystyle z\\mapsto\\int\_\{0\}^\{z\}\\Bigl\(c\_\{at\}\-\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\Lambda\(u\)\)^\{\+\}\\Bigr\)\\,duis differentiable, with derivativecat−\(μ∗\(xt,ωat\)−Λ\(z\)\)\+c\_\{at\}\-\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\Lambda\(z\)\)^\{\+\}\. Therefore

∇ℓat\(ρ\)=ψ\(xt\)\(cat−\(μ∗\(xt,ωat\)−Λ\(ρ⊤ψ\(xt\)\)\)\+\)\.\\nabla\\ell\_\{at\}\(\\rho\)=\\psi\(x\_\{t\}\)\\Bigl\(c\_\{at\}\-\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\Lambda\(\\rho^\{\\top\}\\psi\(x\_\{t\}\)\)\)^\{\+\}\\Bigr\)\.\(37\)The derivative ofz↦cat−\(μ∗\(xt,ωat\)−Λ\(u\)\)\+z\\mapsto c\_\{at\}\-\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\Lambda\(u\)\)^\{\+\}is𝕀\{μ∗\(xt,ωat\)\>Λ\(z\)\}Λ′\(z\)≥0\\mathbb\{I\}\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\>\\Lambda\(z\)\\\}\\Lambda^\{\\prime\}\(z\)\\geq 0a\.e\., so the scalar map is nondecreasing and thusℓat\(ρ\)\\ell\_\{at\}\(\\rho\)is convex inρ⊤ψ\(xt\)\\rho^\{\\top\}\\psi\(x\_\{t\}\), hence convex inρ\\rho\. Taking conditional expectations in \([37](https://arxiv.org/html/2606.07392#S8.E37)\) gives∇𝔼\[ℓat\(ρ\)∣xt,ℱt−1\]=−ψ\(xt\)mat\(ψ\(xt\)⊤ρ\)\\nabla\\mathbb\{E\}\[\\ell\_\{at\}\(\\rho\)\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]=\-\\psi\(x\_\{t\}\)\\,m\_\{at\}\(\\psi\(x\_\{t\}\)^\{\\top\}\\rho\)\. By convexity ofℬ\\mathcal\{B\}, the entire segmentψ\(xt\)⊤ρa\+sψ\(xt\)⊤\(ρ−ρa\)=ψ\(xt\)⊤\(ρa\+s\(ρ−ρa\)\),∀s∈\[0,1\]\\psi\(x\_\{t\}\)^\{\\top\}\\rho\_\{a\}\+s\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)=\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\_\{a\}\+s\(\\rho\-\\rho\_\{a\}\)\),\\forall s\\in\[0,1\]lies inI=\[−ι¯,ι¯\]I=\[\-\\bar\{\\iota\},\\bar\{\\iota\}\]by Assumption[4\.1](https://arxiv.org/html/2606.07392#S4.SS1)\. Using the fundamental theorem of calculus along the line segment fromρa\\rho\_\{a\}toρ\\rho,

𝔼\[ℓat\(ρ\)−ℓat\(ρa\)∣ℱt−1,xt\]=∫01\(ρ−ρa\)⊤∇𝔼\[ℓat\(ρa\+s\(ρ−ρa\)\)∣ℱt−1,xt\]𝑑s=−∫01\(ψ\(xt\)⊤\(ρ−ρa\)\)mat\(ψ\(xt\)⊤ρa\+sψ\(xt\)⊤\(ρ−ρa\)\)𝑑s\.\\begin\{array\}\[\]\{rl\}\\mathbb\{E\}\[\\ell\_\{at\}\(\\rho\)\-\\ell\_\{at\}\(\\rho\_\{a\}\)\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\]&\\displaystyle=\\int\_\{0\}^\{1\}\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\nabla\\mathbb\{E\}\[\\ell\_\{at\}\(\\rho\_\{a\}\+s\(\\rho\-\\rho\_\{a\}\)\)\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\]\\,ds\\\\ &\\displaystyle=\-\\int\_\{0\}^\{1\}\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)m\_\{at\}\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\\rho\_\{a\}\+s\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)\\,ds\.\\end\{array\}Applying Lemma[8\.1](https://arxiv.org/html/2606.07392#S8.Thmtheorem1)in the Appendix withu=ψ\(xt\)⊤ρa\+sψ\(xt\)⊤\(ρ−ρa\)u=\\psi\(x\_\{t\}\)^\{\\top\}\\rho\_\{a\}\+s\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)yields

−\(sψ\(xt\)⊤\(ρ−ρa\)\)mat\(ψ\(xt\)⊤ρa\+sψ\(xt\)⊤\(ρ−ρa\)\)≥κμ1s2\(ψ\(xt\)⊤\(ρ−ρa\)\)2,\-\\bigl\(s\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)m\_\{at\}\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\\rho\_\{a\}\+s\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)\\geq\\kappa\\mu\_\{1\}s^\{2\}\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)^\{2\},where the last inequality follows becauseℙ\(μ∗\(xt,ωat\)\>Λ\(u\)∣ℱt−1,xt\)Λ′\(u\)≥κμ1\\mathbb\{P\}\\bigl\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\>\\Lambda\(u\)\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\bigr\)\\,\\Lambda^\{\\prime\}\(u\)\\geq\\kappa\\mu\_\{1\}according to Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5),[4\.1](https://arxiv.org/html/2606.07392#S4.SS1)\. Fors\>0s\>0, divide byssto obtain

−\(ψ\(xt\)⊤\(ρ−ρa\)\)mat\(ψ\(xt\)⊤ρa\+sψ\(xt\)⊤\(ρ−ρa\)\)≥κμ1s\(ψ\(xt\)⊤\(ρ−ρa\)\)2\.\-\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)m\_\{at\}\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\\rho\_\{a\}\+s\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)\\geq\\kappa\\mu\_\{1\}s\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)^\{2\}\.Integrating overs∈\[0,1\]s\\in\[0,1\]yields

𝔼\[ℓat\(ρ\)−ℓat\(ρa\)∣ℱt−1,xt\]≥∫01ηs\(ψ\(xt\)⊤\(ρ−ρa\)\)2𝑑s=κμ12\(ψ\(xt\)⊤\(ρ−ρa\)\)2\\mathbb\{E\}\[\\ell\_\{at\}\(\\rho\)\-\\ell\_\{at\}\(\\rho\_\{a\}\)\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\]\\geq\\int\_\{0\}^\{1\}\\eta s\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)^\{2\}\\,ds=\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)^\{2\}as claimed\.■\\blacksquare

###### Lemma 8\.4

Fix anya∈\[A\]a\\in\[A\]andt∈\[T\]t\\in\[T\], for anyρt\\rho\_\{t\}adapted toℱt−1\\mathcal\{F\}\_\{t\-1\}, we have

1nat∑s∈𝒮at𝔼\[ℓas\(ρt\)−ℓas\(ρa\)∣ℱs−1,xs\]≥κμ121nat∑s∈𝒮at\[ψ\(xs\)⊤\(ρt−ρa\)\]2−1nat\.\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\mathbb\{E\}\\left\[\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\\right\]\\geq\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\_\{t\}\-\\rho\_\{a\}\)\]^\{2\}\-\\frac\{1\}\{n\_\{at\}\}\.

###### Proof 8\.5

Proof of Lemma[8\.4](https://arxiv.org/html/2606.07392#S8.Thmtheorem4)Set

ϵ=12natC¯ψ\(κμ1ι¯\+1\)\\epsilon=\\frac\{1\}\{2n\_\{at\}\\bar\{C\}\_\{\\psi\}\(\\kappa\\mu\_\{1\}\\bar\{\\iota\}\+1\)\}\(38\)and take an Euclideanϵ\\epsilon\-net𝒩ϵ\\mathcal\{N\}\_\{\\epsilon\}ofℬ\\mathcal\{B\}\. According to Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5)and[4\.1](https://arxiv.org/html/2606.07392#S4.SS1), for anys∈\[T\]s\\in\[T\]we haveℙ\(G\(θ∗⊤ϕ\(xs,ωas\)\)\>Λ\(u\)∣xs,ℱs−1\)Λ′\(u\)≥κμ1\>0,∀\|u\|≤ι¯\\mathbb\{P\}\\bigl\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\>\\Lambda\(u\)\\mid x\_\{s\},\\mathcal\{F\}\_\{s\-1\}\\bigr\)\\Lambda^\{\\prime\}\(u\)\\geq\\kappa\\mu\_\{1\}\>0,\\ \\forall\|u\|\\leq\\overline\{\\iota\}\. Lemma[4\.2](https://arxiv.org/html/2606.07392#S4.Thmtheorem2)implies that given anys∈\[T\]s\\in\[T\]andρ¯∈𝒩ϵ\\bar\{\\rho\}\\in\\mathcal\{N\}\_\{\\epsilon\}, it always holds that𝔼\[ℓas\(ρ¯\)−ℓas\(ρa\)∣ℱs−1,xs\]≥κμ12\[ψ\(xs\)⊤\(ρ¯−ρa\)\]2\\displaystyle\\mathbb\{E\}\\left\[\\ell\_\{as\}\(\\bar\{\\rho\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\\right\]\\geq\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\bar\{\\rho\}\-\\rho\_\{a\}\)\]^\{2\}\. Sinceℬ\\mathcal\{B\}has diameterdℬd\_\{\\mathcal\{B\}\}by Assumption[5](https://arxiv.org/html/2606.07392#S4.E5),𝒩ϵ\\mathcal\{N\}\_\{\\epsilon\}is finite\. So after intersecting finitely many probability\-one events, we have with probability one, simultaneously for allρ∈𝒩ϵ\\rho\\in\\mathcal\{N\}\_\{\\epsilon\}ands∈\[T\]s\\in\[T\],

𝔼\[ℓas\(ρ\)−ℓas\(ρa\)∣ℱs−1,xs\]≥κμ12\[ψ\(xs\)⊤\(ρ−ρa\)\]2\.\\mathbb\{E\}\\left\[\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\\right\]\\geq\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\]^\{2\}\.\(39\)Chooseρϵ∈𝒩ϵ\\rho\_\{\\epsilon\}\\in\\mathcal\{N\}\_\{\\epsilon\}such that‖ρt−ρϵ‖2≤ϵ\\\|\\rho\_\{t\}\-\\rho\_\{\\epsilon\}\\\|\_\{2\}\\leq\\epsilon\. Note that

\|\[ℓas\(ρt\)−ℓas\(ρa\)\]−\[ℓas\(ρϵ\)−ℓas\(ρa\)\]\|=\|ℓas\(ρt\)−ℓas\(ρϵ\)\|=\(i\)\|∫ρϵ⊤ψ\(xs\)ρt⊤ψ\(xs\)\[cas−\(G\(θ∗⊤ϕ\(xs,ωas\)\)−Λ\(u\)\)\+\]𝑑u\|≤\(ii\)2\|\(ρt−ρϵ\)⊤ψ\(xs\)\|≤\(iii\)2C¯ψϵ,\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\\left\|\[\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\]\-\[\\ell\_\{as\}\(\\rho\_\{\\epsilon\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\]\\right\|\\\\ &\\displaystyle\\ =\|\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{\\epsilon\}\)\|\\\\ &\\displaystyle=\_\{\(i\)\}\\left\|\\int\_\{\\rho\_\{\\epsilon\}^\{\\top\}\\psi\(x\_\{s\}\)\}^\{\\rho\_\{t\}^\{\\top\}\\psi\(x\_\{s\}\)\}\[c\_\{as\}\-\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\)^\{\+\}\]du\\right\|\\\\ &\\displaystyle\\leq\_\{\(ii\)\}2\|\(\\rho\_\{t\}\-\\rho\_\{\\epsilon\}\)^\{\\top\}\\psi\(x\_\{s\}\)\|\\leq\_\{\(iii\)\}2\\bar\{C\}\_\{\\psi\}\\epsilon,\\end\{array\}\(40\)where \(i\) follows by definition ofℓas\(ρ\)\\ell\_\{as\}\(\\rho\), \(ii\) follows since by definition,cas∈\[0,1\]c\_\{as\}\\in\[0,1\],G\(θ∗⊤ϕ\(xs,ωas\)\)∈\[0,1\]G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\\in\[0,1\],Λ\(u\)∈\[−1,1\]\\Lambda\(u\)\\in\[\-1,1\], so\|cas−\(G\(θ∗⊤ϕ\(xs,ωas\)\)−Λ\(u\)\)\+\|≤2\|c\_\{as\}\-\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\)^\{\+\}\|\\leq 2\. \(iii\) follows by Cauchy\-Schwarz inequality, the fact that‖ρ−ρϵ‖2≤ϵ\\\|\\rho\-\\rho\_\{\\epsilon\}\\\|\_\{2\}\\leq\\epsilonand‖ψ\(xt\)‖2≤C¯ψ\\\|\\psi\(x\_\{t\}\)\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\psi\}by Assumption[5](https://arxiv.org/html/2606.07392#S4.E5)\. Hence

1nat∑s∈𝒮at𝔼\[ℓas\(ρt\)−ℓas\(ρa\)∣ℱs−1,xs\]=1nat∑s∈𝒮at𝔼\[ℓas\(ρt\)−ℓas\(ρϵ\)\+ℓas\(ρϵ\)−ℓas\(ρa\)∣ℱs−1,xs\]≥\(1\)κμ121nat∑s∈𝒮at\[ψ\(xs\)⊤\(ρϵ−ρa\)\]2−2C¯ψϵ=κμ121nat∑s∈𝒮at\[ψ\(xs\)⊤\(ρϵ−ρt\)\+ψ\(xs\)⊤\(ρt−ρa\)\]2−2C¯ψϵ≥\(2\)κμ121nat∑s∈𝒮at\{\[ψ\(xs\)⊤\(ρt−ρa\)\]2−2ϵC¯ψ\|ψ\(xs\)⊤\(ρt−ρa\)\|\}−2C¯ψϵ≥\(3\)κμ121nat∑s∈𝒮at\{\[ψ\(xs\)⊤\(ρt−ρa\)\]2−4ϵC¯ψι¯\}−2C¯ψϵ=κμ121nat∑s∈𝒮at\[ψ\(xs\)⊤\(ρt−ρa\)\]2−2\(κμ1ι¯\+1\)C¯ψϵ=\(4\)κμ121nat∑s∈𝒮at\[ψ\(xs\)⊤\(ρt−ρa\)\]2−1nat,\\begin\{array\}\[\]\{rl\}&\\displaystyle\\quad\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\mathbb\{E\}\\left\[\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\\right\]\\\\ &\\displaystyle=\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\mathbb\{E\}\\left\[\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{\\epsilon\}\)\+\\ell\_\{as\}\(\\rho\_\{\\epsilon\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\\right\]\\\\ &\\displaystyle\\geq\_\{\(1\)\}\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\_\{\\epsilon\}\-\\rho\_\{a\}\)\]^\{2\}\-2\\bar\{C\}\_\{\\psi\}\\epsilon\\\\ &\\displaystyle=\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\_\{\\epsilon\}\-\\rho\_\{t\}\)\+\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\_\{t\}\-\\rho\_\{a\}\)\]^\{2\}\-2\\bar\{C\}\_\{\\psi\}\\epsilon\\\\ &\\displaystyle\\geq\_\{\(2\)\}\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\_\{t\}\-\\rho\_\{a\}\)\]^\{2\}\-2\\epsilon\\bar\{C\}\_\{\\psi\}\|\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\_\{t\}\-\\rho\_\{a\}\)\|\\right\\\}\-2\\bar\{C\}\_\{\\psi\}\\epsilon\\\\ &\\displaystyle\\geq\_\{\(3\)\}\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\_\{t\}\-\\rho\_\{a\}\)\]^\{2\}\-4\\epsilon\\bar\{C\}\_\{\\psi\}\\bar\{\\iota\}\\right\\\}\-2\\bar\{C\}\_\{\\psi\}\\epsilon\\\\ &\\displaystyle=\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\_\{t\}\-\\rho\_\{a\}\)\]^\{2\}\-2\(\\kappa\\mu\_\{1\}\\bar\{\\iota\}\+1\)\\bar\{C\}\_\{\\psi\}\\epsilon\\\\ &\\displaystyle=\_\{\(4\)\}\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\_\{t\}\-\\rho\_\{a\}\)\]^\{2\}\-\\frac\{1\}\{n\_\{at\}\},\\end\{array\}where \(1\) follows from \([39](https://arxiv.org/html/2606.07392#S8.E39)\) and \([40](https://arxiv.org/html/2606.07392#S8.E40)\), \(2\) follows from applying Cauchy\-Schwarz inequality toψ\(xs\)⊤\(ρϵ−ρt\)\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\_\{\\epsilon\}\-\\rho\_\{t\}\)and the fact that‖ρϵ−ρt‖2≤ϵ\\\|\\rho\_\{\\epsilon\}\-\\rho\_\{t\}\\\|\_\{2\}\\leq\\epsilon,‖ψ\(xs\)‖2≤C¯ψ\\\|\\psi\(x\_\{s\}\)\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\psi\}, \(3\) follows sinceψ\(xs\)⊤ρt∈\[−ι¯,ι¯\],ψ\(xs\)⊤ρa∈\[−ι¯,ι¯\]\\psi\(x\_\{s\}\)^\{\\top\}\\rho\_\{t\}\\in\[\-\\bar\{\\iota\},\\bar\{\\iota\}\],\\psi\(x\_\{s\}\)^\{\\top\}\\rho\_\{a\}\\in\[\-\\bar\{\\iota\},\\bar\{\\iota\}\], \(4\) follows from \([38](https://arxiv.org/html/2606.07392#S8.E38)\)\.■\\blacksquare

###### Lemma 8\.6\(Freedman’s Inequality\(Freedman[1975](https://arxiv.org/html/2606.07392#bib.bib2)\)\)

Consider a real\-valued martingaleYk=∑j=1kXjY\_\{k\}=\\sum\_\{j=1\}^\{k\}X\_\{j\}withY0=0Y\_\{0\}=0and difference sequence\{Xk:k=1,2,3,…\}\\\{X\_\{k\}:k=1,2,3,\\ldots\\\}\. Assume thatXk≤RX\_\{k\}\\leq Ralmost surely fork≥1k\\geq 1, whereRRis a constant\. LetWk:=∑j=1k𝔼\[Xj2∣ℱj−1\]W\_\{k\}:=\\sum\_\{j=1\}^\{k\}\\mathbb\{E\}\[X\_\{j\}^\{2\}\\mid\\mathcal\{F\}\_\{j\-1\}\]fork≥1k\\geq 1\. Then for allt≥1t\\geq 1andσ2\>0\\sigma^\{2\}\>0,ℙ\(∃k≥0:Yk≥t,andWk≤σ2\)≤exp\{−t2/2σ2\+Rt/3\}\\displaystyle\\mathbb\{P\}\\left\(\\exists k\\geq 0:Y\_\{k\}\\geq t,\\ \\mbox\{and\}\\ W\_\{k\}\\leq\\sigma^\{2\}\\right\)\\leq\\exp\\left\\\{\-\\frac\{t^\{2\}/2\}\{\\sigma^\{2\}\+Rt/3\}\\right\\\}\.

Lemma \(Restatement of Lemma[4\.3](https://arxiv.org/html/2606.07392#S4.Thmtheorem3)\)\.Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5),[4\.1](https://arxiv.org/html/2606.07392#S4.SS1)hold\. Then given any constantc0\>0c\_\{0\}\>0and anya∈\[A\]a\\in\[A\], with probability at least1−δ/31\-\\delta/3, uniformly over allt∈\[T\]t\\in\[T\], for anyρt\\rho\_\{t\}adapted toℱt−1\\mathcal\{F\}\_\{t\-1\}, we have

−∑s∈𝒮at\{ℓas\(ρt\)−ℓas\(ρa\)−𝔼\[ℓas\(ρt\)−ℓas\(ρa\)∣ℱs−1,xs\]\}≤c08∑s∈𝒮at\{\(ρt−ρa\)⊤ψ\(xs\)\}2\+\(144c0\+2min⁡\{2ι¯,C¯ψdℬ\}\+6\)Γat\(δ\)\+4\+c08,\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\-\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\-\\mathbb\{E\}\[\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\]\\right\\\}\\\\ &\\displaystyle\\leq\\frac\{c\_\{0\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+\\left\(\\frac\{144\}\{c\_\{0\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}\(\\delta\)\+4\+\\frac\{c\_\{0\}\}\{8\},\\end\{array\}whereΓat\(δ\):=mlog⁡\(1\+2dℬC¯ψT\)\+log⁡\(⌈log2⁡\(1\+natmin⁡\{9ι¯2,dℬ2C¯ψ2\}\)⌉\+1\)\+log⁡\(6T/δ\)\\Gamma\_\{at\}\(\\delta\)\\\!:=m\\log\(1\+2d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}T\)\+\\log\\left\(\\left\\lceil\\log\_\{2\}\\left\(1\+n\_\{at\}\\min\\\{9\\overline\{\\iota\}^\{2\},d\_\{\\mathcal\{B\}\}^\{2\}\\bar\{C\}\_\{\\psi\}^\{2\}\\\}\\right\)\\right\\rceil\+1\\right\)\+\\log\\left\(6T/\\delta\\right\)\.

###### Proof 8\.7

Proof of Lemma[4\.3](https://arxiv.org/html/2606.07392#S4.Thmtheorem3)For anys∈𝒮ats\\in\\mathcal\{S\}\_\{at\}, define𝒢s−1:=σ\(ℱs−1,xs\)\\mathcal\{G\}\_\{s\-1\}:=\\sigma\(\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\)\. Let

Hat\(ρ\):=∑s∈𝒮at\{ℓas\(ρ\)−ℓas\(ρa\)−𝔼\[ℓas\(ρ\)−ℓas\(ρa\)∣𝒢s−1\]\}\.H\_\{at\}\(\\rho\):=\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\-\\mathbb\{E\}\[\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{G\}\_\{s\-1\}\]\\right\\\}\.Note that−2≤cat−\(G\(θ∗⊤ϕ\(xt,ωat\)\)−Λ\(u\)\)\+≤1\-2\\leq c\_\{at\}\-\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\)\-\\Lambda\(u\)\)^\{\+\}\\leq 1, so for anyρ∈ℬ\\rho\\in\\mathcal\{B\},

\|ℓas\(ρ\)−ℓas\(ρa\)−𝔼\[ℓas\(ρ\)−ℓas\(ρa\)∣𝒢s−1\]\|=\|∫ρa⊤ψ\(xs\)ρ⊤ψ\(xs\)\[cas−\(G\(θ∗⊤ϕ\(xs,ωas\)−Λ\(u\)\)\+\]du−𝔼\[∫ρa⊤ψ\(xs\)ρ⊤ψ\(xs\)\[cas−\(G\(θ∗⊤ϕ\(xs,ωas\)−Λ\(u\)\)\+\]du∣𝒢s−1\]\|≤3\|\(ρ−ρa\)⊤ψ\(xs\)\|≤min⁡\{6ι¯,3C¯ψdℬ\}\.\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\\left\|\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\-\\mathbb\{E\}\[\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{G\}\_\{s\-1\}\]\\right\|\\\\ &\\displaystyle\\ =\\bigg\|\\int\_\{\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{s\}\)\}^\{\\rho^\{\\top\}\\psi\(x\_\{s\}\)\}\[c\_\{as\}\-\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\-\\Lambda\(u\)\)^\{\+\}\]du\\\\ &\\displaystyle\\quad\\quad\-\\mathbb\{E\}\\bigg\[\\int\_\{\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{s\}\)\}^\{\\rho^\{\\top\}\\psi\(x\_\{s\}\)\}\[c\_\{as\}\-\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\-\\Lambda\(u\)\)^\{\+\}\]du\\mid\\mathcal\{G\}\_\{s\-1\}\\bigg\]\\bigg\|\\\\ &\\displaystyle\\leq 3\|\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\|\\leq\\min\\\{6\\overline\{\\iota\},3\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\.\\end\{array\}LetWt\(ρ\):=∑s∈𝒮at𝔼\[\{ℓas\(ρ\)−ℓas\(ρa\)−𝔼\[ℓas\(ρ\)−ℓas\(ρa\)∣𝒢s−1\]\}2∣𝒢s−1\]\\displaystyle W\_\{t\}\(\\rho\):=\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\mathbb\{E\}\\left\[\\\{\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\-\\mathbb\{E\}\[\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{G\}\_\{s\-1\}\]\\\}^\{2\}\\mid\\mathcal\{G\}\_\{s\-1\}\\right\]\. So for anyρ∈ℬ\\rho\\in\\mathcal\{B\},

Wt\(ρ\)≤9∑s∈𝒮at\{\(ρ−ρa\)⊤ψ\(xs\)\}2,W\_\{t\}\(\\rho\)\\leq 9\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\},\(41\)where\(ρ−ρa\)⊤ψ\(xs\)\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)is𝒢s−1\\mathcal\{G\}\_\{s\-1\}\-measurable\.

Note that for anyρ∈ℬ\\rho\\in\\mathcal\{B\},ρ⊤ψ\(xs\)∈\[−2ι¯,ι¯\]\\rho^\{\\top\}\\psi\(x\_\{s\}\)\\in\[\-2\\overline\{\\iota\},\\overline\{\\iota\}\]and\|\(ρ−ρa\)⊤ψ\(x\)\|≤‖ρ−ρa‖2‖ψ\(x\)‖2≤dℬC¯ψ\|\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\)\|\\leq\\\|\\rho\-\\rho\_\{a\}\\\|\_\{2\}\\\|\\psi\(x\)\\\|\_\{2\}\\leq d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}, thus∑s∈𝒮at\{\(ρ−ρa\)⊤ψ\(xs\)\}2≤natmin⁡\{9ι¯2,dℬ2C¯ψ2\}\\displaystyle\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\\leq n\_\{at\}\\min\\\{9\\overline\{\\iota\}^\{2\},d\_\{\\mathcal\{B\}\}^\{2\}\\bar\{C\}\_\{\\psi\}^\{2\}\\\}\. Let

Qat:=⌈log2⁡\(1\+natmin⁡\{9ι¯2,dℬ2C¯ψ2\}\)⌉\.Q\_\{at\}:=\\left\\lceil\\log\_\{2\}\\left\(1\+n\_\{at\}\\min\\\{9\\overline\{\\iota\}^\{2\},d\_\{\\mathcal\{B\}\}^\{2\}\\bar\{C\}\_\{\\psi\}^\{2\}\\\}\\right\)\\right\\rceil\.\(42\)Consider the events\{ℰq\}\\\{\\mathcal\{E\}\_\{q\}\\\}forq=\{0\}∪\[Qat\]q=\\\{0\\\}\\cup\[Q\_\{at\}\], where

ℰq:=\{\{2q−1<∑s∈𝒮at\{\(ρ−ρa\)⊤ψ\(xs\)\}2≤2q\}ifq≥1\{0≤∑s∈𝒮at\{\(ρ−ρa\)⊤ψ\(xs\)\}2≤1\}ifq=0\\mathcal\{E\}\_\{q\}:=\\begin\{cases\}\\displaystyle\\big\\\{2^\{q\-1\}<\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\\leq 2^\{q\}\\big\\\}&\\text\{if \}q\\geq 1\\\\ \\displaystyle\\big\\\{0\\leq\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\\leq 1\\big\\\}&\\text\{if \}q=0\\end\{cases\}\(43\)Fix anyx\>0x\>0\. On the eventℰq\\mathcal\{E\}\_\{q\},Wt\(ρ\)≤9∑s∈𝒮at\{\(ρ−ρa\)⊤ψ\(xs\)\}2≤9×2q\\displaystyle W\_\{t\}\(\\rho\)\\leq 9\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\\leq 9\\times 2^\{q\}\. So applying Freedman’s inequality \(Lemma[8\.6](https://arxiv.org/html/2606.07392#S8.Thmtheorem6)\) withσ2=9×2q\\sigma^\{2\}=9\\times 2^\{q\}, we have

ℙ\(−Hat\(ρ\)≥322qx\+2min⁡\{2ι¯,C¯ψdℬ\}x,∑s∈𝒮at\{\(ρ−ρa\)⊤ψ\(xs\)\}2≤2q\)≤e−x\.\\mathbb\{P\}\\left\(\-H\_\{at\}\(\\rho\)\\geq 3\\sqrt\{2\}\\sqrt\{2^\{q\}x\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}x,\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\\leq 2^\{q\}\\right\)\\leq e^\{\-x\}\.Additionally, on eventℰq\\mathcal\{E\}\_\{q\},6x\(1\+∑s∈𝒮at\{\(ρ−ρa\)⊤ψ\(xs\)\}2\)≥62q−1x=322qx\\displaystyle 6\\sqrt\{x\\bigg\(1\+\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\\bigg\)\}\\geq 6\\sqrt\{2^\{q\-1\}x\}=3\\sqrt\{2\}\\sqrt\{2^\{q\}x\}\. Thus on eventℰq\\mathcal\{E\}\_\{q\},

−Hat\(ρ\)≥6x\(1\+∑s∈𝒮at\{\(ρ−ρa\)⊤ψ\(xs\)\}2\)\+2min⁡\{2ι¯,C¯ψdℬ\}x⇒−Hat\(ρ\)≥322qx\+2min⁡\{2ι¯,C¯ψdℬ\}x\.\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\-H\_\{at\}\(\\rho\)\\geq 6\\sqrt\{x\\bigg\(1\+\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\\bigg\)\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}x\\\\ &\\displaystyle\\Rightarrow\-H\_\{at\}\(\\rho\)\\geq 3\\sqrt\{2\}\\sqrt\{2^\{q\}x\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}x\.\\end\{array\}Summing the probability bound above overq∈\{0\}∪\[Qat\]q\\in\\\{0\\\}\\cup\[Q\_\{at\}\], we have

ℙ\(−Hat\(ρ\)≥6x\+x∑s∈𝒮at\{\(ρ−ρa\)⊤ψ\(xs\)\}2\+2min⁡\{2ι¯,C¯ψdℬ\}x\)≤2\(Qat\+1\)e−x\.\\mathbb\{P\}\\left\(\-H\_\{at\}\(\\rho\)\\geq 6\\sqrt\{x\+x\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}x\\right\)\\leq 2\(Q\_\{at\}\+1\)e^\{\-x\}\.Settingx=log⁡\(2\(Qat\+1\)/δ\)x=\\log\\left\(2\(Q\_\{at\}\+1\)/\\delta\\right\)above, then for any fixedρ∈ℬ\\rho\\in\\mathcal\{B\}, with probability at least1−δ1\-\\deltawe have

−Hat\(ρ\)<6log⁡\(2\(Qat\+1\)δ\)∑s∈𝒮at\{\(ρ−ρa\)⊤ψ\(xs\)\}2\+\(2min⁡\{2ι¯,C¯ψdℬ\}\+6\)log⁡\(2\(Qat\+1\)δ\)\.\-H\_\{at\}\(\\rho\)<6\\sqrt\{\\log\\left\(\\frac\{2\(Q\_\{at\}\+1\)\}\{\\delta\}\\right\)\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\}\+\(2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\)\\log\\left\(\\frac\{2\(Q\_\{at\}\+1\)\}\{\\delta\}\\right\)\.\(44\)Set

ϵ=1TC¯ψ\\epsilon=\\frac\{1\}\{T\\bar\{C\}\_\{\\psi\}\}\(45\)and take an Euclideanϵ\\epsilon\-net𝒩ϵ\\mathcal\{N\}\_\{\\epsilon\}ofℬ\\mathcal\{B\}\. Sinceℬ\\mathcal\{B\}has diameterdℬd\_\{\\mathcal\{B\}\}by Assumption[5](https://arxiv.org/html/2606.07392#S4.E5),

\|𝒩ϵ\|≤\(1\+2dℬϵ\)m=\(1\+2dℬC¯ψT\)m\.\|\\mathcal\{N\}\_\{\\epsilon\}\|\\leq\\left\(1\+\\frac\{2d\_\{\\mathcal\{B\}\}\}\{\\epsilon\}\\right\)^\{m\}=\(1\+2d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}T\)^\{m\}\.Note that \([44](https://arxiv.org/html/2606.07392#S8.E44)\) further implies that with probability at least1−δ/31\-\\delta/3, uniformly over allρ∈𝒩ϵ\\rho\\in\\mathcal\{N\}\_\{\\epsilon\}andt∈\[T\]t\\in\[T\],

−Hat\(ρ\)<6log⁡\(6T\|𝒩ϵ\|\(Qat\+1\)δ\)∑s∈𝒮at\{\(ρ−ρa\)⊤ψ\(xs\)\}2\+\(2min⁡\{2ι¯,C¯ψdℬ\}\+6\)log⁡\(6T\|𝒩ϵ\|\(Qat\+1\)δ\)≤6Γat\(δ\)∑s∈𝒮at\{\(ρ−ρa\)⊤ψ\(xs\)\}2\+\(2min⁡\{2ι¯,C¯ψdℬ\}\+6\)Γat\(δ\),\\begin\{array\}\[\]\{rl\}\-H\_\{at\}\(\\rho\)&\\displaystyle<6\\sqrt\{\\log\\left\(\\frac\{6T\|\\mathcal\{N\}\_\{\\epsilon\}\|\(Q\_\{at\}\+1\)\}\{\\delta\}\\right\)\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\}\\\\ &\\quad\\displaystyle\+\(2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\)\\log\\left\(\\frac\{6T\|\\mathcal\{N\}\_\{\\epsilon\}\|\(Q\_\{at\}\+1\)\}\{\\delta\}\\right\)\\\\ &\\displaystyle\\leq 6\\sqrt\{\\Gamma\_\{at\}\(\\delta\)\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\}\+\(2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\)\\Gamma\_\{at\}\(\\delta\),\\end\{array\}\(46\)where

Γat\(δ\):=mlog⁡\(1\+2dℬC¯ψT\)\+log⁡\(Qat\+1\)\+log⁡\(6T/δ\)\.\\Gamma\_\{at\}\(\\delta\)\\\!:=m\\log\(1\+2d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}T\)\+\\log\(Q\_\{at\}\+1\)\+\\log\\left\(6T/\\delta\\right\)\.\(47\)Letρϵ∈𝒩ϵ\\rho\_\{\\epsilon\}\\in\\mathcal\{N\}\_\{\\epsilon\}satisfy‖ρt−ρϵ‖2≤ϵ\\\|\\rho\_\{t\}\-\\rho\_\{\\epsilon\}\\\|\_\{2\}\\leq\\epsilon\. Note that

\|\[ℓas\(ρt\)−ℓas\(ρa\)\]−\[ℓas\(ρϵ\)−ℓas\(ρa\)\]\|=\|ℓas\(ρt\)−ℓas\(ρϵ\)\|=\(i\)\|∫ρϵ⊤ψ\(xs\)ρt⊤ψ\(xs\)\[cas−\(G\(θ∗⊤ϕ\(xs,ωas\)\)−Λ\(u\)\)\+\]𝑑u\|≤\(ii\)2\|\(ρt−ρϵ\)⊤ψ\(xs\)\|≤\(iii\)2C¯ψϵ≤\(iv\)2nat\.\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\\left\|\[\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\]\-\[\\ell\_\{as\}\(\\rho\_\{\\epsilon\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\]\\right\|\\\\ &\\displaystyle\\ =\|\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{\\epsilon\}\)\|=\_\{\(i\)\}\\left\|\\int\_\{\\rho\_\{\\epsilon\}^\{\\top\}\\psi\(x\_\{s\}\)\}^\{\\rho\_\{t\}^\{\\top\}\\psi\(x\_\{s\}\)\}\[c\_\{as\}\-\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\)^\{\+\}\]du\\right\|\\\\ &\\displaystyle\\leq\_\{\(ii\)\}2\|\(\\rho\_\{t\}\-\\rho\_\{\\epsilon\}\)^\{\\top\}\\psi\(x\_\{s\}\)\|\\leq\_\{\(iii\)\}2\\bar\{C\}\_\{\\psi\}\\epsilon\\leq\_\{\(iv\)\}\\frac\{2\}\{n\_\{at\}\}\.\\end\{array\}where \(i\) follows by definition ofℓas\(ρ\)\\ell\_\{as\}\(\\rho\)in \([8](https://arxiv.org/html/2606.07392#S4.E8)\), \(ii\) follows since by definition,

\|cas−\(G\(θ∗⊤ϕ\(xs,ωas\)\)−Λ\(u\)\)\+\|≤2,\|c\_\{as\}\-\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\)^\{\+\}\|\\leq 2,\(iii\) follows by Cauchy\-Schwarz inequality, the fact that‖ρt−ρϵ‖2≤ϵ\\\|\\rho\_\{t\}\-\\rho\_\{\\epsilon\}\\\|\_\{2\}\\leq\\epsilonand‖ψ\(xt\)‖2≤C¯ψ\\\|\\psi\(x\_\{t\}\)\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\psi\}by Assumption[5](https://arxiv.org/html/2606.07392#S4.E5), \(iv\) follows by \([45](https://arxiv.org/html/2606.07392#S8.E45)\)\. The above inequality implies that

\|Hat\(ρt\)−Hat\(ρϵ\)\|=\|∑s∈𝒮at\[ℓas\(ρt\)−ℓas\(ρϵ\)\]−𝔼\[ℓas\(ρt\)−ℓas\(ρϵ\)∣𝒢s−1\]\|≤∑s∈𝒮at4nat≤4\.\|H\_\{at\}\(\\rho\_\{t\}\)\-H\_\{at\}\(\\rho\_\{\\epsilon\}\)\|=\\left\|\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\[\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{\\epsilon\}\)\]\-\\mathbb\{E\}\[\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{\\epsilon\}\)\\mid\\mathcal\{G\}\_\{s\-1\}\]\\right\|\\leq\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\frac\{4\}\{n\_\{at\}\}\\leq 4\.Therefore,

\|Hat\(ρt\)−Hat\(ρϵ\)\|≤4\.\|H\_\{at\}\(\\rho\_\{t\}\)\-H\_\{at\}\(\\rho\_\{\\epsilon\}\)\|\\leq 4\.\(48\)Note that∑s∈𝒮at\{\(ρt−ρa\)⊤ψ\(xs\)\}2=‖\(\(ρt−ρa\)⊤ψ\(xs\)\)j∈\[nat\]‖2\\displaystyle\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\}=\\\|\(\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\)\_\{j\\in\[n\_\{at\}\]\}\\\|\_\{2\}\. Thus by triangular inequality,

∑s∈𝒮at\{\(ρt−ρa\)⊤ψ\(xs\)\}2−∑s∈𝒮at\{\(ρϵ−ρa\)⊤ψ\(xs\)\}2≤‖\(\(ρt−ρϵ\)⊤ψ\(xs\)\)j∈\[nat\]‖2≤nat\(C¯ψϵ\)2=1nat\.\\begin\{array\}\[\]\{rl\}\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\}\-\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{\\epsilon\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\}&\\displaystyle\\leq\\left\\\|\(\(\\rho\_\{t\}\-\\rho\_\{\\epsilon\}\)^\{\\top\}\\psi\(x\_\{s\}\)\)\_\{j\\in\[n\_\{at\}\]\}\\right\\\|\_\{2\}\\\\ &\\displaystyle\\leq\\sqrt\{n\_\{at\}\(\\bar\{C\}\_\{\\psi\}\\epsilon\)^\{2\}\}=\\frac\{1\}\{\\sqrt\{n\_\{at\}\}\}\.\\end\{array\}\(49\)On the event that \([46](https://arxiv.org/html/2606.07392#S8.E46)\) holds uniformly over allρ∈𝒩ϵ\\rho\\in\\mathcal\{N\}\_\{\\epsilon\}andt∈\[T\]t\\in\[T\], we have

−Hat\(ρt\)≤\(i\)−Hat\(ρϵ\)\+4≤\(ii\)6Γat\(δ\)∑s∈𝒮at\{\(ρϵ−ρa\)⊤ψ\(xs\)\}2\+\(2min⁡\{2ι¯,C¯ψdℬ\}\+6\)Γat\(δ\)\+4≤\(iii\)6Γat\(δ\)\(∑s∈𝒮at\{\(ρt−ρa\)⊤ψ\(xs\)\}2\+1nat\)\+\(2min⁡\{2ι¯,C¯ψdℬ\}\+6\)Γat\(δ\)\+4≤\(iv\)62Γat\(δ\)∑s∈𝒮at\{\(ρt−ρa\)⊤ψ\(xs\)\}2\+1\+\(2min⁡\{2ι¯,C¯ψdℬ\}\+6\)Γat\(δ\)\+4≤\(v\)c08∑s∈𝒮at\{\(ρt−ρa\)⊤ψ\(xs\)\}2\+\(144c0\+2min⁡\{2ι¯,C¯ψdℬ\}\+6\)Γat\(δ\)\+4\+c08\\begin\{array\}\[\]\{rl\}\-H\_\{at\}\(\\rho\_\{t\}\)&\\displaystyle\\leq\_\{\(i\)\}\-H\_\{at\}\(\\rho\_\{\\epsilon\}\)\+4\\\\ &\\displaystyle\\leq\_\{\(ii\)\}6\\sqrt\{\\Gamma\_\{at\}\(\\delta\)\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{\\epsilon\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\}\+\(2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\)\\Gamma\_\{at\}\(\\delta\)\+4\\\\ &\\displaystyle\\leq\_\{\(iii\)\}6\\sqrt\{\\Gamma\_\{at\}\(\\delta\)\}\\left\(\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\}\+\\frac\{1\}\{\\sqrt\{n\_\{at\}\}\}\\right\)\+\(2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\)\\Gamma\_\{at\}\(\\delta\)\+4\\\\ &\\displaystyle\\leq\_\{\(iv\)\}6\\sqrt\{2\\Gamma\_\{at\}\(\\delta\)\}\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+1\}\+\(2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\)\\Gamma\_\{at\}\(\\delta\)\+4\\\\ &\\displaystyle\\leq\_\{\(v\)\}\\frac\{c\_\{0\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+\\left\(\\frac\{144\}\{c\_\{0\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}\(\\delta\)\+4\+\\frac\{c\_\{0\}\}\{8\}\\end\{array\}where \(i\) follows from \([48](https://arxiv.org/html/2606.07392#S8.E48)\), \(ii\) follows from \([46](https://arxiv.org/html/2606.07392#S8.E46)\), \(iii\) holds from \([49](https://arxiv.org/html/2606.07392#S8.E49)\), \(iv\) holds from the fact thats\+1nat≤2\(s\+1/nat\)\\displaystyle\\sqrt\{s\}\+\\frac\{1\}\{\\sqrt\{n\_\{at\}\}\}\\leq\\sqrt\{2\(s\+1/\\sqrt\{n\_\{at\}\}\)\}, wheres=∑s∈𝒮at\{\(ρt−ρa\)⊤ψ\(xs\)\}2\\displaystyle s=\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}and1/nat≤11/\\sqrt\{n\_\{at\}\}\\leq 1, \(v\) follows by applying2ab≤ϵa2\+ϵ−1b22ab\\leq\\epsilon a^\{2\}\+\\epsilon^\{\-1\}b^\{2\}toa=∑s∈𝒮at\{\(ρt−ρa\)⊤ψ\(xs\)\}2\+1\\displaystyle a=\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+1\},b=62Γat\(δ\)b=6\\sqrt\{2\\Gamma\_\{at\}\(\\delta\)\},ϵ=c04\\displaystyle\\epsilon=\\frac\{c\_\{0\}\}\{4\}, so that

62Γat\(δ\)∑s∈𝒮at\{\(ρt−ρa\)⊤ψ\(xs\)\}2\+1≤c08∑s∈𝒮at\{\(ρt−ρa\)⊤ψ\(xs\)\}2\+144c0Γat\(δ\)\+c08\.6\\sqrt\{2\\Gamma\_\{at\}\(\\delta\)\}\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+1\}\\leq\\frac\{c\_\{0\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+\\frac\{144\}\{c\_\{0\}\}\\Gamma\_\{at\}\(\\delta\)\+\\frac\{c\_\{0\}\}\{8\}\.Hence with probability at least1−δ/31\-\\delta/3, uniformly over allt∈\[T\]t\\in\[T\]we have

−Hat\(ρt\)≤c08∑s∈𝒮at\{\(ρt−ρa\)⊤ψ\(xs\)\}2\+\(144c0\+2min⁡\{2ι¯,C¯ψdℬ\}\+6\)Γat\(δ\)\+4\+c08,\-H\_\{at\}\(\\rho\_\{t\}\)\\leq\\frac\{c\_\{0\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+\\left\(\\frac\{144\}\{c\_\{0\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}\(\\delta\)\+4\+\\frac\{c\_\{0\}\}\{8\},whereΓat\(δ\)\\Gamma\_\{at\}\(\\delta\)is defined as in \([47](https://arxiv.org/html/2606.07392#S8.E47)\), thus the result follows\.■\\blacksquare

### 8\.2Index Estimation under Known Reward Function

Before introducing the formal proof, we first provide the proof’s intuition\. The argument follows a standard localized empirical\-process approach for M\-estimation, adapted to our martingale setting\. The estimator’s empirical optimality is combined with a population curvature lower bound, while the stochastic deviation is controlled uniformly over the parameter space using Freedman’s martingale inequality, a peeling argument, and anϵ\\epsilon\-net\. Readers familiar with localized empirical\-process and martingale concentration arguments may skip the following intuition and proceed directly to the formal proof\.

*\(i\) The optimality inequality\.*Becauseρ^at\\hat\{\\rho\}\_\{at\}minimizes the empirical primitive loss, the empirical excess loss atρ^at\\hat\{\\rho\}\_\{at\}cannot be positive\. Plugρ=ρ^at\\rho=\\hat\{\\rho\}\_\{at\}into

∑s∈𝒮at\{ℓas\(ρ\)−ℓas\(ρa\)\}=∑s∈𝒮at𝔼\[ℓas\(ρ\)−ℓas\(ρa\)\|ℱs−1,xs\]\+Hat\(ρ\)\.\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\!\\big\\\{\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\big\\\}\\;=\\;\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\!\\\!\\mathbb\{E\}\\big\[\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\,\\big\|\\,\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\\big\]\\;\+\\;H\_\{at\}\(\\rho\)\.and use the population curvature lower bound from Lemma[4\.2](https://arxiv.org/html/2606.07392#S4.Thmtheorem2)\(with a small Lipschitz adjustment so that the curvature holds uniformly overρ\\rho, which costs only an additive11\)\. After rearranging, one obtains

κμ12∑s∈𝒮at\{ψ\(xs\)⊤\(ρ^at−ρa\)\}2≤−Hat\(ρ^at\)\+1\.\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\!\\big\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\\big\\\}^\{2\}\\;\\leq\\;\-H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\\;\+1\.The left side is the local quadratic “signal” we want to bound, and the right side is essentially the “noise”\.

*\(ii\) Bounding the noise at one fixedρ\\rho\.*For each fixedρ\\rho, every term insideHat\(ρ\)H\_\{at\}\(\\rho\)is uniformly bounded and has a conditional variance controlled by the same quadratic quantity∑s∈𝒮at\{ψ\(xs\)⊤\(ρ−ρa\)\}2\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\\}^\{2\}\. Freedman’s\(Freedman[1975](https://arxiv.org/html/2606.07392#bib.bib2)\)martingale inequality \(Lemma[8\.6](https://arxiv.org/html/2606.07392#S8.Thmtheorem6)\) therefore gives, for each fixedρ\\rho,

\|Hat\(ρ\)\|≲\(∑s\{ψ\(xs\)⊤\(ρ−ρa\)\}2\)⋅log⁡\(T/δ\)\+log⁡\(T/δ\)\.\|H\_\{at\}\(\\rho\)\|\\;\\lesssim\\;\\sqrt\{\\Big\(\\textstyle\\sum\_\{s\}\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\\}^\{2\}\\Big\)\\cdot\\log\(T/\\delta\)\}\\;\+\\;\\log\(T/\\delta\)\.The key point is that the stochastic error enters through the same quadratic form that governs the population curvature, allowing the error term to be absorbed into the curvature in the final bound\.

*\(iii\) Making the bound uniform inρ\\rho\.*Sinceρ^at\\hat\{\\rho\}\_\{at\}is itself random, we need step \(ii\) to hold for allρ\\rhosimultaneously\. We do this in two passes\. First, a peeling argument splits the parameter space into dyadic shells based on the size of∑s\{ψ\(xs\)⊤\(ρ−ρa\)\}2\\sum\_\{s\}\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\\}^\{2\}; on each shell the variance proxy is replaced by a deterministic ceiling and Freedman’s bound applies\. Summing over the shells costs only alog⁡log\\log\\logfactor\. Second, anϵ\\epsilon\-net argument extends the bound from a finite grid ofρ\\rho’s to all ofℬ\\mathcal\{B\}via the Lipschitz continuity of the loss primitive\. The end product is a uniform bound on\|Hat\(ρ\)\|\|H\_\{at\}\(\\rho\)\|that still scales with the same quadratic quantity,

*\(iv\) Closing the loop\.*Plugρ=ρ^at\\rho=\\hat\{\\rho\}\_\{at\}into the uniform bound from step \(iii\) and combine with step \(i\)\. The resulting inequality has the schematic formquadratic≲quadratic⋅log⁡T\+log⁡T\\text\{quadratic\}\\lesssim\\sqrt\{\\text\{quadratic\}\}\\cdot\\sqrt\{\\log T\}\+\\log T\. Applying2ab≤ϵa2\+ϵ−1b22ab\\leq\\epsilon a^\{2\}\+\\epsilon^\{\-1\}b^\{2\}with a suitableϵ\\epsilonabsorbs the square\-root term into the quadratic, leaving1/nat∑s∈𝒮at\{ψ\(xs\)⊤\(ρ^at−ρa\)\}2≲log⁡T/nat1/n\_\{at\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\big\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\\big\\\}^\{2\}\\;\\lesssim\\;\\log T/n\_\{at\}\. Folding in the regularizationη1𝐈m\\eta\_\{1\}\\mathbf\{I\}\_\{m\}to ensure invertibility and applying Cauchy–Schwarz then yields the stated confidence radius\.

###### Proposition 8\.8\(Restatement of Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4)\)

Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5),[4\.1](https://arxiv.org/html/2606.07392#S4.SS1)hold\. Whenμ∗\\mu^\{\*\}is known, given anyδ\>0\\delta\>0, with probability at least1−δ1\-\\delta, uniformly over allt∈\[T\]t\\in\[T\]anda∈\[A\]a\\in\[A\],

\|\(ρ^at−ρa\)⊤ψ\(xt\)\|≤Bat∗‖ψ\(xt\)‖Vat\(η1\)−1,\|\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\)\|\\leq B\_\{at\}^\{\*\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\},\(50\)where

Bat∗:=83κμ1\(144κμ1\+2min⁡\{2ι¯,C¯ψdℬ\}\+6\)Γat∗\(δ\)\+40/\(3κμ1\)\+1/3\+η1dℬ2,B\_\{at\}^\{\*\}:=\\sqrt\{\\frac\{8\}\{3\\kappa\\mu\_\{1\}\}\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}^\{\*\}\(\\delta\)\+40/\(3\\kappa\\mu\_\{1\}\)\+1/3\+\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}\},\(51\)Γat∗\(δ\):=mlog⁡\(1\+2dℬC¯ψT\)\+log⁡\(⌈log2⁡\(1\+natmin⁡\{9ι¯2,dℬ2C¯ψ2\}\)⌉\+1\)\+log⁡\(6AT/δ\),\\Gamma\_\{at\}^\{\*\}\(\\delta\)\\\!:=m\\log\(1\+2d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}T\)\+\\log\\left\(\\left\\lceil\\log\_\{2\}\\big\(1\+n\_\{at\}\\min\\\{9\\overline\{\\iota\}^\{2\},d\_\{\\mathcal\{B\}\}^\{2\}\\bar\{C\}\_\{\\psi\}^\{2\}\\\}\\right\)\\right\\rceil\+1\\big\)\+\\log\(6AT/\\delta\),andVat\(η1\)=η1𝐈m\+∑s∈𝒮atψ\(xs\)ψ\(xs\)⊤\\displaystyle V\_\{at\}\(\\eta\_\{1\}\)=\\eta\_\{1\}\\mathbf\{I\}\_\{m\}\+\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\psi\(x\_\{s\}\)\\psi\(x\_\{s\}\)^\{\\top\}\.

###### Proof 8\.9

Proof of Proposition[8\.8](https://arxiv.org/html/2606.07392#S8.Thmtheorem8)For anys∈\[T\]s\\in\[T\], define𝒢s−1:=σ\(ℱs−1,xs\)\\mathcal\{G\}\_\{s\-1\}:=\\sigma\(\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\)\. Note that

0≥\(1\)∑s∈𝒮atℓas\(ρ^at\)−∑s∈𝒮atℓas\(ρa\)=\(2\)∑s∈𝒮at𝔼\[ℓas\(ρ^at\)−ℓas\(ρa\)∣𝒢s−1\]\+Hat\(ρ^at\)≥\(3\)κμ12∑s∈𝒮at\[ψ\(xs\)⊤\(ρ^at−ρa\)\]2−1\+Hat\(ρ^at\),\\begin\{array\}\[\]\{rl\}0&\\geq\_\{\(1\)\}\\displaystyle\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\ell\_\{as\}\(\\hat\{\\rho\}\_\{at\}\)\-\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\ell\_\{as\}\(\\rho\_\{a\}\)\\\\ &\\displaystyle=\_\{\(2\)\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\mathbb\{E\}\\left\[\\ell\_\{as\}\(\\hat\{\\rho\}\_\{at\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{G\}\_\{s\-1\}\\right\]\+H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\\\\ &\\displaystyle\\geq\_\{\(3\)\}\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\]^\{2\}\-1\+H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\),\\end\{array\}\(52\)where \(1\) holds becauseρ^at=\\argminρ∈ℬ∑s∈𝒮atℓas\(ρ\)\\displaystyle\\hat\{\\rho\}\_\{at\}=\\argmin\_\{\\rho\\in\\mathcal\{B\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\ell\_\{as\}\(\\rho\), \(2\) holds with

Hat\(ρ\)=∑s∈𝒮at\{ℓas\(ρ\)−ℓas\(ρa\)−𝔼\[ℓas\(ρ\)−ℓas\(ρa\)∣𝒢s−1\]\}H\_\{at\}\(\\rho\)=\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\-\\mathbb\{E\}\[\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{G\}\_\{s\-1\}\]\\right\\\}defined as \([12](https://arxiv.org/html/2606.07392#S4.E12)\), and \(3\) follows from Lemma[8\.4](https://arxiv.org/html/2606.07392#S8.Thmtheorem4)\. So \([52](https://arxiv.org/html/2606.07392#S8.E52)\) implies that

κμ12∑s∈𝒮at\[ψ\(xs\)⊤\(ρ^at−ρa\)\]2≤−Hat\(ρ^at\)\+1\.\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\]^\{2\}\\leq\-H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\+1\.\(53\)Lemma[4\.3](https://arxiv.org/html/2606.07392#S4.Thmtheorem3)implies that with probability at least1−δ/31\-\\delta/3, uniformly over allt∈\[T\]t\\in\[T\]anda∈\[A\]a\\in\[A\],

−Hat\(ρ^at\)≤κμ18∑s∈𝒮at\{\(ρ^at−ρa\)⊤ψ\(xs\)\}2\+\(144κμ1\+2min⁡\{2ι¯,C¯ψdℬ\}\+6\)Γat∗\(δ\)\+4\+κμ18,\-H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\\leq\\frac\{\\kappa\\mu\_\{1\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}^\{\*\}\(\\delta\)\+4\+\\frac\{\\kappa\\mu\_\{1\}\}\{8\},\(54\)whereΓat∗\(δ\)=mlog⁡\(1\+2dℬC¯ψT\)\+log⁡\(Qat\+1\)\+log⁡\(6AT/δ\)\\Gamma\_\{at\}^\{\*\}\(\\delta\)\\\>=m\\log\(1\+2d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}T\)\+\\log\(Q\_\{at\}\+1\)\+\\log\\left\(6AT/\\delta\\right\)\. So \([53](https://arxiv.org/html/2606.07392#S8.E53)\) further implies that

3κμ18∑s∈𝒮at\{\(ρ^at−ρa\)⊤ψ\(xs\)\}2≤\(144κμ1\+2min⁡\{2ι¯,C¯ψdℬ\}\+6\)Γat∗\(δ\)\+5\+κμ18\.\\frac\{3\\kappa\\mu\_\{1\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\\leq\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}^\{\*\}\(\\delta\)\+5\+\\frac\{\\kappa\\mu\_\{1\}\}\{8\}\.\(55\)Combining with the fact thatη1‖ρ^at−ρa‖2≤η1dℬ2\\eta\_\{1\}\\\|\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\\\|^\{2\}\\leq\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}, and recall that

Vat\(η1\)=η1𝐈m\+∑s∈𝒮atψ\(xs\)ψ\(xs\)⊤,V\_\{at\}\(\\eta\_\{1\}\)=\\eta\_\{1\}\\mathbf\{I\}\_\{m\}\+\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\psi\(x\_\{s\}\)\\psi\(x\_\{s\}\)^\{\\top\},\([55](https://arxiv.org/html/2606.07392#S8.E55)\) then implies that with probability at least1−δ/31\-\\delta/3, uniformly over allt∈\[T\]t\\in\[T\]anda∈\[A\]a\\in\[A\],

‖ρ^at−ρa‖Vat\(η1\)2≤83κμ1\(144κμ1\+2min⁡\{2ι¯,C¯ψdℬ\}\+6\)Γat∗\(δ\)\+40/\(3κμ1\)\+1/3\+η1dℬ2\.\\begin\{array\}\[\]\{rl\}\\displaystyle\\\|\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)\}^\{2\}&\\displaystyle\\leq\\frac\{8\}\{3\\kappa\\mu\_\{1\}\}\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}^\{\*\}\(\\delta\)\+40/\(3\\kappa\\mu\_\{1\}\)\+1/3\+\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}\.\\end\{array\}\(56\)By Cauchy\-Schwarz inequality,\|\(ρ^at−ρa\)⊤ψ\(xt\)\|≤‖ψ\(xt\)‖Vat\(η1\)−1‖ρ^at−ρa‖Vat\(η1\)\|\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\)\|\\leq\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\\\|\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)\}, so the result follows\.■\\blacksquare

### 8\.3Regret Analysis under Known Reward Function

###### Proof 8\.10

Proof of Proposition[4\.5](https://arxiv.org/html/2606.07392#S4.Thmtheorem5)Letℰ\\mathcal\{E\}denote the event that uniformly over alla∈\[A\]a\\in\[A\]andt∈\[T\]t\\in\[T\],

\|\(ρ^at−ρa\)⊤ψ\(xt\)\|≤Bat∗‖ψ\(xt\)‖Vat\(η1\)−1,\|\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\)\|\\leq B\_\{at\}^\{\*\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\},\(57\)whereVat\(η1\)\\displaystyle V\_\{at\}\(\\eta\_\{1\}\)is defined as in Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4)andBat∗B\_\{at\}^\{\*\}is defined as \([51](https://arxiv.org/html/2606.07392#S8.E51)\)\. On eventℰ\\mathcal\{E\},σ~at≥σat∗\\widetilde\{\\sigma\}\_\{at\}\\geq\\sigma\_\{at\}^\{\*\}for alla∈\[A\],t∈\[T\]a\\in\[A\],t\\in\[T\], whereσ~at\\widetilde\{\\sigma\}\_\{at\}is defined as \([15](https://arxiv.org/html/2606.07392#S4.E15)\)\. Then Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)implies that onℰ\\mathcal\{E\},

𝔼\[∑t=1TΔt\(π~\)\|ℰ\]≤𝔼\[∑t=1T∑a=1A\(σ~at−σat∗\)𝕀\{a∈𝒜t\}\|ℰ\]≤𝔼\[∑a=1A∑t=1T\(Λ\(ρ^at⊤ψ\(xt\)\+Bat∗‖ψ\(xt\)‖Vat\(η1\)−1\)−Λ\(ρa⊤ψ\(xt\)\)\)𝕀\{a∈𝒜t\}\|ℰ\]≤\(i\)𝔼\[∑a=1A∑t=1TL\|ρ^at⊤ψ\(xt\)\+Bat∗‖ψ\(xt\)∥Vat\(η1\)−1−ρa⊤ψ\(xt\)\|𝕀\{a∈𝒜t\}\|ℰ\]≤\(ii\)2LBT∗𝔼\[∑a=1A∑t=1T‖ψ\(xt\)‖Vat\(η1\)−1𝕀\{a∈𝒜t\}\|ℰ\],\\begin\{array\}\[\]\{rl\}\\displaystyle\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\big\|\\mathcal\{E\}\\right\]&\\displaystyle\\leq\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\sum\_\{a=1\}^\{A\}\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\)\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\big\|\\mathcal\{E\}\\right\]\\\\ &\\displaystyle\\leq\\mathbb\{E\}\\left\[\\sum\_\{a=1\}^\{A\}\\sum\_\{t=1\}^\{T\}\\left\(\\Lambda\\left\(\\hat\{\\rho\}\_\{at\}^\{\\top\}\\psi\(x\_\{t\}\)\+B\_\{at\}^\{\*\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\\right\)\-\\Lambda\(\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\)\\right\)\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\big\|\\mathcal\{E\}\\right\]\\\\ &\\displaystyle\\leq\_\{\(i\)\}\\mathbb\{E\}\\left\[\\sum\_\{a=1\}^\{A\}\\sum\_\{t=1\}^\{T\}L\\left\|\\hat\{\\rho\}\_\{at\}^\{\\top\}\\psi\(x\_\{t\}\)\+B\_\{at\}^\{\*\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\-\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\\right\|\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\big\|\\mathcal\{E\}\\right\]\\\\ &\\displaystyle\\leq\_\{\(ii\)\}2LB\_\{T\}^\{\*\}\\mathbb\{E\}\\left\[\\sum\_\{a=1\}^\{A\}\\sum\_\{t=1\}^\{T\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\big\|\\mathcal\{E\}\\right\],\\end\{array\}where \(i\) follows from the Lipschitz property ofΛ\\Lambdaaccording to Assumption[5](https://arxiv.org/html/2606.07392#S4.E5), \(ii\) follows from \([57](https://arxiv.org/html/2606.07392#S8.E57)\), andBT∗=supa∈\[A\],t≤TBat∗=83κμ1\(144κμ1\+2min⁡\{2ι¯,C¯ψdℬ\}\+6\)ΓT\(δ\)\+403κμ1\+13\+η1dℬ2B\_\{T\}^\{\*\}=\\sup\_\{a\\in\[A\],t\\leq T\}B\_\{at\}^\{\*\}=\\sqrt\{\\frac\{8\}\{3\\kappa\\mu\_\{1\}\}\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{T\}\(\\delta\)\+\\frac\{40\}\{3\\kappa\\mu\_\{1\}\}\+\\frac\{1\}\{3\}\+\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}\},

ΓT\(δ\):=mlog⁡\(1\+2dℬC¯ψT\)\+log⁡\(⌈log2⁡\(1\+Tmin⁡\{9ι¯2,dℬ2C¯ψ2\}\)⌉\+1\)\+log⁡\(6AT/δ\)\.\\Gamma\_\{T\}\(\\delta\)\\\!:=m\\log\(1\+2d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}T\)\+\\log\\left\(\\left\\lceil\\log\_\{2\}\\big\(1\+T\\min\\\{9\\overline\{\\iota\}^\{2\},d\_\{\\mathcal\{B\}\}^\{2\}\\bar\{C\}\_\{\\psi\}^\{2\}\\\}\\right\)\\right\\rceil\+1\\big\)\+\\log\(6AT/\\delta\)\.Then following similar proof steps as in the proof for Theorem[5\.5](https://arxiv.org/html/2606.07392#S5.Thmtheorem5), we have that onℰ\\mathcal\{E\},

∑a=1A∑t=1T‖ψ\(xt\)‖Vat\(η1\)−1𝕀\{a∈𝒜t\}=∑a=1A∑s∈𝒮a,T\+1‖ψ\(xs\)‖Vas\(η1\)−1≤AT\(1\+C¯ψ2/η1\)mlog⁡\(1\+TC¯ψ2/\(η1m\)\)\.\\begin\{array\}\[\]\{rl\}\\displaystyle\\sum\_\{a=1\}^\{A\}\\sum\_\{t=1\}^\{T\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}&\\displaystyle=\\sum\_\{a=1\}^\{A\}\\sum\_\{s\\in\\mathcal\{S\}\_\{a,T\+1\}\}\\\|\\psi\(x\_\{s\}\)\\\|\_\{V\_\{as\}\(\\eta\_\{1\}\)^\{\-1\}\}\\\\ &\\displaystyle\\leq A\\sqrt\{T\(1\+\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\)m\\log\\left\(1\+T\\bar\{C\}\_\{\\psi\}^\{2\}/\(\\eta\_\{1\}m\)\\right\)\}\.\\end\{array\}Hence onℰ\\mathcal\{E\},

𝔼\[∑t=1TΔt\(π~\)\|ℰ\]≤2LBT∗AT\(1\+C¯ψ2/η1\)mlog⁡\(1\+TC¯ψ2/\(η1m\)\)\.\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\,\\bigg\|\\,\\mathcal\{E\}\\right\]\\\!\\leq 2LB\_\{T\}^\{\*\}A\\sqrt\{T\(1\+\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\)m\\log\\left\(1\+T\\bar\{C\}\_\{\\psi\}^\{2\}/\(\\eta\_\{1\}m\)\\right\)\}\.\(58\)Note that it always holds thatσ~at−σat∗≤2\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\\leq 2, and recall from Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4)thatℙ\(ℰ\)≥1−δ\\mathbb\{P\}\(\\mathcal\{E\}\)\\geq 1\-\\delta\. Onℰc\\mathcal\{E\}^\{c\}, the per\-period regret is at most1\+2A1\+2A\. Takingδ=1/T\\delta=1/Tgives a failure event contribution of orderO\(A\)\\mathrm\{O\}\(A\)\. Thus ignoring logarithmic factors we have𝔼\[∑t=1TΔt\(π~\)\]≤O~\(AmT\)\\displaystyle\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\right\]\\leq\\widetilde\{O\}\\left\(Am\\sqrt\{T\}\\right\)\.■\\blacksquare

## 9Reward Estimation

Lemma[9\.1](https://arxiv.org/html/2606.07392#S9.Thmtheorem1)below is Theorem 1 fromAbbasi\-Yadkoriet al\.\([2011](https://arxiv.org/html/2606.07392#bib.bib56)\):

###### Lemma 9\.1

Let\{vs:s≥0\}\\\{v\_\{s\}:s\\geq 0\\\}be anℝd\\mathbb\{R\}^\{d\}\-valued stochastic process adapted to filtration\{ℋs:s≥0\}\\\{\\mathcal\{H\}\_\{s\}:s\\geq 0\\\},\{ϵs:s≥1\}\\\{\\epsilon\_\{s\}:s\\geq 1\\\}be a real\-valued stochastic process adapted to\{ℋs\}\\\{\\mathcal\{H\}\_\{s\}\\\}\. Assume thatϵs\\epsilon\_\{s\}is conditionally sub\-Gaussian such that there exists someγ\>0\\gamma\>0such that for anyu∈ℝu\\in\\mathbb\{R\},s≥1s\\geq 1,

𝔼\[exp⁡\(uϵs\)\|ℋs−1\]≤exp⁡\(u2γ22\)a\.s\.\\mathbb\{E\}\\left\[\\exp\\left\(u\\epsilon\_\{s\}\\right\)\|\\mathcal\{H\}\_\{s\-1\}\\right\]\\leq\\exp\\left\(\\frac\{u^\{2\}\\gamma^\{2\}\}\{2\}\\right\)\\quad\\textrm\{a\.s\.\}Assume thatVVis ad×dd\\times dpostive definite matrix\. For anyt≥0t\\geq 0, defineV¯t=V\+∑s=1tvsvs⊤\\bar\{V\}\_\{t\}=V\+\\sum\_\{s=1\}^\{t\}v\_\{s\}v\_\{s\}^\{\\top\}, then for anyδ\>0\\delta\>0, with probability at least1−δ1\-\\delta, for allt≥0t\\geq 0,

‖∑s=1tϵsvs‖V¯t−12≤2γ2log⁡\(det\(V¯t\)1/2det\(V\)−1/2δ\)\.\\left\\\|\\sum\_\{s=1\}^\{t\}\\epsilon\_\{s\}v\_\{s\}\\right\\\|\_\{\\bar\{V\}\_\{t\}^\{\-1\}\}^\{2\}\\leq 2\\gamma^\{2\}\\log\\left\(\\frac\{\\mathrm\{det\}\(\\bar\{V\}\_\{t\}\)^\{1/2\}\\mathrm\{det\}\(V\)^\{\-1/2\}\}\{\\delta\}\\right\)\.

Lemma[9\.2](https://arxiv.org/html/2606.07392#S9.Thmtheorem2)in the below provides a stronger result which implies Lemma[5\.1](https://arxiv.org/html/2606.07392#S5.Thmtheorem1)directly:

###### Lemma 9\.2

Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S5),[5](https://arxiv.org/html/2606.07392#S5)hold\. Fix anyδ\>0\\delta\>0\. Then with probability at least1−δ1\-\\delta, uniformly over allx∈𝒳x\\in\\mathcal\{X\},ω∈Ω\\omega\\in\\Omega, andt≥2t\\geq 2, the following holds:

\|θ∗⊤ϕ\(x,ω\)−θ^t−1⊤ϕ\(x,ω\)\|≤2min⁡\{1,μ¯\}‖ϕ\(x,ω\)‖Φt−1−1\(γ0dlog⁡\(1\+tC¯ϕ2/η0\)\+2log⁡\(1/δ\)\+η0α¯\)\.\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\\left\|\\theta\_\{\*\}^\{\\top\}\\phi\(x,\\omega\)\-\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x,\\omega\)\\right\|\\\\ &\\displaystyle\\leq\\frac\{2\}\{\\min\\\{1,\\underline\{\\mu\}\\\}\}\\\|\\phi\(x,\\omega\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\bigg\(\\gamma\_\{0\}\\sqrt\{d\\log\(1\+t\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(1/\\delta\)\}\+\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\\bigg\)\.\\end\{array\}\(59\)Particularly, with probability at least1−δ/21\-\\delta/2, uniformly over alla∈\[A\]a\\in\[A\]andt≥2t\\geq 2,

\|\(θ^t−1−θ∗\)⊤ϕ\(xt,ωat\)\|≤βt‖ϕ\(xt,ωat\)‖Φt−1−1,\\big\|\(\\hat\{\\theta\}\_\{t\-1\}\-\\theta\_\{\*\}\)^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\\big\|\\;\\leq\\;\\beta\_\{t\}\\;\\big\\\|\\phi\(x\_\{t\},\\omega\_\{at\}\)\\big\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\},\(60\)where

βt:=2min⁡\{1,μ¯\}\(γ0dlog⁡\(1\+tC¯ϕ2/η0\)\+2log⁡\(2/δ\)\+η0α¯\)\.\\beta\_\{t\}:=\\frac\{2\}\{\\min\\\{1,\\underline\{\\mu\}\\\}\}\\bigg\(\\gamma\_\{0\}\\sqrt\{d\\log\(1\+t\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(2/\\delta\)\}\+\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\\bigg\)\.\(61\)

The proof of Lemma[9\.2](https://arxiv.org/html/2606.07392#S9.Thmtheorem2)follows closely from Proposition 1 ofFilippiet al\.\([2010](https://arxiv.org/html/2606.07392#bib.bib15)\)\.

###### Proof 9\.3

Proof of Lemma[9\.2](https://arxiv.org/html/2606.07392#S9.Thmtheorem2)Letgt\(θ\)=∑k=1t−1G\(θ⊤ϕ\(xk,ωakk\)\)ϕ\(xk,ωakk\)\+η0θ\\displaystyle g\_\{t\}\(\\theta\)=\\sum\_\{k=1\}^\{t\-1\}G\(\\theta^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\)\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\+\\eta\_\{0\}\\thetabe the invertible function such thatgt\(θ~t−1\)=∑k=1t−1rkϕ\(xk,ωakk\)\\displaystyle g\_\{t\}\(\\tilde\{\\theta\}\_\{t\-1\}\)=\\sum\_\{k=1\}^\{t\-1\}r\_\{k\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\), whereθ~t−1\\tilde\{\\theta\}\_\{t\-1\}is the unique solution to \([62](https://arxiv.org/html/2606.07392#S9.E62)\):

∇θ\[ℒt\(θ\)−η02‖θ‖22\]=−η0θ\+∑k=1t−1\{rk−G\(θ⊤ϕ\(xk,ωakk\)\)\}ϕ\(xk,ωakk\)=0,\\nabla\_\{\\theta\}\\left\[\\mathcal\{L\}\_\{t\}\(\\theta\)\-\\frac\{\\eta\_\{0\}\}\{2\}\\\|\\theta\\\|\_\{2\}^\{2\}\\right\]=\-\\eta\_\{0\}\\theta\+\\\!\\sum\_\{k=1\}^\{t\-1\}\\left\\\{r\_\{k\}\-G\\left\(\\theta^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\right\)\\right\\\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)=0,\(62\)By Assumption[5](https://arxiv.org/html/2606.07392#S5),∇gt\\nabla g\_\{t\}is continuous, so by the Fundamental Theorem of Calculus,

gt\(θ∗\)−gt\(θ~t−1\)=Γt\(θ∗−θ~t−1\),g\_\{t\}\(\\theta\_\{\*\}\)\-g\_\{t\}\(\\tilde\{\\theta\}\_\{t\-1\}\)=\\Gamma\_\{t\}\(\\theta\_\{\*\}\-\\tilde\{\\theta\}\_\{t\-1\}\),\(63\)whereΓt=∫01∇gt\(sθ∗\+\(1−s\)θ~t−1\)𝑑s\\displaystyle\\Gamma\_\{t\}=\\int\_\{0\}^\{1\}\\nabla g\_\{t\}\(s\\theta\_\{\*\}\+\(1\-s\)\\tilde\{\\theta\}\_\{t\-1\}\)ds, and

∇gt\(θ\)=η0𝐈d\+∑k=1t−1ϕ\(xk,ωakk\)ϕ\(xk,ωakk\)⊤G′\(θ⊤ϕ\(xk,ωakk\)\)⪰\(i\)η0𝐈d\+μ¯∑k=1t−1ϕ\(xk,ωakk\)ϕ\(xk,ωakk\)⊤,\\begin\{array\}\[\]\{rl\}\\nabla g\_\{t\}\(\\theta\)&\\displaystyle=\\eta\_\{0\}\\mathbf\{I\}\_\{d\}\+\\sum\_\{k=1\}^\{t\-1\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)^\{\\top\}G^\{\\prime\}\\left\(\\theta^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\right\)\\\\ &\\displaystyle\\succeq\_\{\(i\)\}\\eta\_\{0\}\\mathbf\{I\}\_\{d\}\+\\underline\{\\mu\}\\sum\_\{k=1\}^\{t\-1\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)^\{\\top\},\\end\{array\}where \(i\) follows becauseG′\(θ⊤ϕ\(xk,ωakk\)\)≥μ¯G^\{\\prime\}\\left\(\\theta^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\right\)\\geq\\underline\{\\mu\}according to Assumption[5](https://arxiv.org/html/2606.07392#S5)\. Hence for anyt∈\[T\]t\\in\[T\],

Γt⪰min⁡\{1,μ¯\}Φt−1⪰min⁡\{1,μ¯\}η0𝐈d≻0,\\Gamma\_\{t\}\\succeq\\min\\\{1,\\underline\{\\mu\}\\\}\\Phi\_\{t\-1\}\\succeq\\min\\\{1,\\underline\{\\mu\}\\\}\\eta\_\{0\}\\mathbf\{I\}\_\{d\}\\succ 0,\(64\)SoΓt\\Gamma\_\{t\}is positive definite and is non\-singular for anyt∈\[T\]t\\in\[T\]\. Therefore,

\|θ∗⊤ϕ\(x,ω\)−θ~t−1⊤ϕ\(x,ω\)\|=\(1\)\|ϕ\(x,ω\)⊤Γt−1\{gt\(θ∗\)−gt\(θ~t−1\)\}\|≤\(2\)‖ϕ\(x,ω\)‖Γt−1‖gt\(θ∗\)−gt\(θ~t−1\)‖Γt−1,\\begin\{array\}\[\]\{rl\}\\left\|\\theta\_\{\*\}^\{\\top\}\\phi\(x,\\omega\)\-\\tilde\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x,\\omega\)\\right\|&\\displaystyle=\_\{\(1\)\}\|\\phi\(x,\\omega\)^\{\\top\}\\Gamma\_\{t\}^\{\-1\}\\\{g\_\{t\}\(\\theta\_\{\*\}\)\-g\_\{t\}\(\\tilde\{\\theta\}\_\{t\-1\}\)\\\}\|\\\\ &\\displaystyle\\leq\_\{\(2\)\}\\\|\\phi\(x,\\omega\)\\\|\_\{\\Gamma\_\{t\}^\{\-1\}\}\\\|g\_\{t\}\(\\theta\_\{\*\}\)\-g\_\{t\}\(\\tilde\{\\theta\}\_\{t\-1\}\)\\\|\_\{\\Gamma\_\{t\}^\{\-1\}\},\\end\{array\}\(65\)where \(1\) of \([65](https://arxiv.org/html/2606.07392#S9.E65)\) follows from \([63](https://arxiv.org/html/2606.07392#S9.E63)\), and \(2\) follows from Cauchy\-Schwarz inequality and the fact thatΓt−1\\Gamma\_\{t\}^\{\-1\}is positive definite\. \([64](https://arxiv.org/html/2606.07392#S9.E64)\) implies thatΓt⪰min⁡\{1,μ¯\}Φt−1\\Gamma\_\{t\}\\succeq\\min\\\{1,\\underline\{\\mu\}\\\}\\Phi\_\{t\-1\}, which further impliesmax⁡\{1,1/μ¯\}Φt−1−1⪰Γt−1\\max\\\{1,1/\\underline\{\\mu\}\\\}\\Phi\_\{t\-1\}^\{\-1\}\\succeq\\Gamma\_\{t\}^\{\-1\}, so‖v‖Γt−1≤max⁡\{1/μ¯,1\}‖v‖Φt−1−1,∀v∈ℝd\\displaystyle\\\|v\\\|\_\{\\Gamma\_\{t\}^\{\-1\}\}\\leq\\max\\left\\\{1/\\sqrt\{\\underline\{\\mu\}\},1\\right\\\}\\\|v\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\},\\ \\forall v\\in\\mathbb\{R\}^\{d\}\. Hence, \([65](https://arxiv.org/html/2606.07392#S9.E65)\) further implies

\|θ∗⊤ϕ\(x,ω\)−θ~t−1⊤ϕ\(x,ω\)\|≤max⁡\{1μ¯,1\}‖ϕ\(x,ω\)‖Φt−1−1‖gt\(θ∗\)−gt\(θ~t−1\)‖Φt−1−1\.\\left\|\\theta\_\{\*\}^\{\\top\}\\phi\(x,\\omega\)\-\\tilde\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x,\\omega\)\\right\|\\leq\\max\\bigg\\\{\\frac\{1\}\{\\underline\{\\mu\}\},1\\bigg\\\}\\\|\\phi\(x,\\omega\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\\|g\_\{t\}\(\\theta\_\{\*\}\)\-g\_\{t\}\(\\tilde\{\\theta\}\_\{t\-1\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\.\(66\)Further,

‖gt\(θ∗\)−gt\(θ^t−1\)‖Φt−1−1≤‖gt\(θ∗\)−gt\(θ~t−1\)‖Φt−1−1\+‖gt\(θ~t−1\)−gt\(θ^t−1\)‖Φt−1−1≤2‖gt\(θ∗\)−gt\(θ~t−1\)‖Φt−1−1,\\begin\{array\}\[\]\{rl\}\\\|g\_\{t\}\(\\theta\_\{\*\}\)\-g\_\{t\}\(\\hat\{\\theta\}\_\{t\-1\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}&\\displaystyle\\leq\\\|g\_\{t\}\(\\theta\_\{\*\}\)\-g\_\{t\}\(\\tilde\{\\theta\}\_\{t\-1\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\+\\\|g\_\{t\}\(\\tilde\{\\theta\}\_\{t\-1\}\)\-g\_\{t\}\(\\hat\{\\theta\}\_\{t\-1\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\\\ &\\displaystyle\\leq 2\\\|g\_\{t\}\(\\theta\_\{\*\}\)\-g\_\{t\}\(\\tilde\{\\theta\}\_\{t\-1\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\},\\end\{array\}\(67\)where the first inequality of \([67](https://arxiv.org/html/2606.07392#S9.E67)\) follows from triangular inequality, and the second inequality of \([67](https://arxiv.org/html/2606.07392#S9.E67)\) follows from the fact thatθ∗∈Θ\\theta\_\{\*\}\\in\\Thetaand the optimality ofθ^t−1\\hat\{\\theta\}\_\{t\-1\}inΘ\\Thetaby definition\. Recall that

gt\(θ~t−1\)−gt\(θ∗\)=∑k=1t−1ϕ\(xk,ωakk\)\{rk−G\(θ∗⊤ϕ\(xk,ωakk\)\)\}−η0θ∗,g\_\{t\}\(\\tilde\{\\theta\}\_\{t\-1\}\)\-g\_\{t\}\(\\theta\_\{\*\}\)=\\sum\_\{k=1\}^\{t\-1\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\\{r\_\{k\}\-G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\)\\\}\-\\eta\_\{0\}\\theta\_\{\*\},so \([66](https://arxiv.org/html/2606.07392#S9.E66)\) and \([67](https://arxiv.org/html/2606.07392#S9.E67)\) imply that

\|θ∗⊤ϕ\(x,ω\)−θ^t−1⊤ϕ\(x,ω\)\|≤max⁡\{2μ¯,2\}‖ϕ\(x,ω\)‖Φt−1−1‖∑k=1t−1ϕ\(xk,ωakk\)\{rk−G\(θ∗⊤ϕ\(xk,ωakk\)\)\}−η0θ∗‖Φt−1−1≤\(i\)max⁡\{2μ¯,2\}‖ϕ\(x,ω\)‖Φt−1−1\(‖∑k=1t−1ϕ\(xk,ωakk\)\{rk−G\(θ∗⊤ϕ\(xk,ωakk\)\)\}‖Φt−1−1\+η0‖θ∗‖Φt−1−1\)≤\(ii\)max⁡\{2μ¯,2\}‖ϕ\(x,ω\)‖Φt−1−1\(‖∑k=1t−1ϕ\(xk,ωakk\)\{rk−G\(θ∗⊤ϕ\(xk,ωakk\)\)\}‖Φt−1−1\+η0‖θ∗‖2\),\\begin\{array\}\[\]\{rl\}&\\displaystyle\\quad\\left\|\\theta\_\{\*\}^\{\\top\}\\phi\(x,\\omega\)\-\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x,\\omega\)\\right\|\\\\ &\\displaystyle\\leq\\max\\bigg\\\{\\frac\{2\}\{\\underline\{\\mu\}\},2\\bigg\\\}\\\|\\phi\(x,\\omega\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\left\\\|\\sum\_\{k=1\}^\{t\-1\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\\{r\_\{k\}\-G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\)\\\}\-\\eta\_\{0\}\\theta\_\{\*\}\\right\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\\\ &\\displaystyle\\leq\_\{\(i\)\}\\max\\bigg\\\{\\frac\{2\}\{\\underline\{\\mu\}\},2\\bigg\\\}\\\|\\phi\(x,\\omega\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\bigg\(\\left\\\|\\sum\_\{k=1\}^\{t\-1\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\\{r\_\{k\}\-G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\)\\\}\\right\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\+\\eta\_\{0\}\\left\\\|\\theta\_\{\*\}\\right\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\bigg\)\\\\ &\\displaystyle\\leq\_\{\(ii\)\}\\max\\bigg\\\{\\frac\{2\}\{\\underline\{\\mu\}\},2\\bigg\\\}\\\|\\phi\(x,\\omega\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\bigg\(\\left\\\|\\sum\_\{k=1\}^\{t\-1\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\\{r\_\{k\}\-G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\)\\\}\\right\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\+\\sqrt\{\\eta\_\{0\}\}\\\|\\theta\_\{\*\}\\\|\_\{2\}\\bigg\),\\end\{array\}holds for allx,ωx,\\omega, where inequality \(i\) above holds from triangular inequality, and inequality \(ii\) above holds sinceΦt−1⪰η0𝐈d\\Phi\_\{t\-1\}\\succeq\\eta\_\{0\}\\mathbf\{I\}\_\{d\}so that‖θ∗‖Φt−1−1≤‖θ∗‖2/η0\\\|\\theta\_\{\*\}\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\leq\\\|\\theta\_\{\*\}\\\|\_\{2\}/\\sqrt\{\\eta\_\{0\}\}\. We now apply Lemma[9\.1](https://arxiv.org/html/2606.07392#S9.Thmtheorem1)to bound

‖∑k=1t−1ϕ\(xk,ωakk\)\{rk−G\(θ∗⊤ϕ\(xk,ωakk\)\)\}‖Φt−1−1\.\\left\\\|\\sum\_\{k=1\}^\{t\-1\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\\{r\_\{k\}\-G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\)\\\}\\right\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\.Setvk=ϕ\(xk,ωakk\)v\_\{k\}=\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\),ϵk=ζk\\epsilon\_\{k\}=\\zeta\_\{k\},ℋk=σ\(vs,ϵs;s≤k\)\\mathcal\{H\}\_\{k\}=\\sigma\(v\_\{s\},\\epsilon\_\{s\};s\\leq k\),ξt=∑k=1t−1vkϵk\\xi\_\{t\}=\\sum\_\{k=1\}^\{t\-1\}v\_\{k\}\\epsilon\_\{k\}andVt=∑k=1tvk−1vk−1⊤V\_\{t\}=\\sum\_\{k=1\}^\{t\}v\_\{k\-1\}v\_\{k\-1\}^\{\\top\}\. Note that\|ϵk\|≤γ0\|\\epsilon\_\{k\}\|\\leq\\gamma\_\{0\}, soϵk\\epsilon\_\{k\}isγ0\\gamma\_\{0\}\-sub\-Gaussian, meaning that for anyu∈ℝu\\in\\mathbb\{R\},t≥1t\\geq 1,

𝔼\[exp⁡\(uζt\)\|ℋt−1\]≤exp⁡\(u2γ022\)a\.s\.\\mathbb\{E\}\\left\[\\exp\\left\(u\\zeta\_\{t\}\\right\)\|\\mathcal\{H\}\_\{t\-1\}\\right\]\\leq\\exp\\left\(\\frac\{u^\{2\}\\gamma\_\{0\}^\{2\}\}\{2\}\\right\)\\quad\\textrm\{a\.s\.\}Further, by Assumption[5](https://arxiv.org/html/2606.07392#S5),‖ϕ\(xk,ωakk\)‖2≤C¯ϕ\\\|\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\}, implying thatdet\(V¯t\)≤\(η0\+\(t−1\)C¯ϕ2\)d\\mathrm\{det\}\(\\bar\{V\}\_\{t\}\)\\leq\\left\(\\eta\_\{0\}\+\(t\-1\)\\bar\{C\}\_\{\\phi\}^\{2\}\\right\)^\{d\}\. So by Lemma[9\.1](https://arxiv.org/html/2606.07392#S9.Thmtheorem1), given anyδ\>0\\delta\>0, with probability at least1−δ1\-\\delta, for allt≥2t\\geq 2we have

‖∑k=1t−1ϕ\(xk,ωakk\)\{rk−G\(θ∗⊤ϕ\(xk,ωakk\)\)\}‖Φt−1−1≤γ0dlog⁡\(1\+tC¯ϕ2/η0\)\+2log⁡\(1/δ\)\.\\left\\\|\\sum\_\{k=1\}^\{t\-1\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\\{r\_\{k\}\-G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\)\\\}\\right\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\leq\\gamma\_\{0\}\\sqrt\{d\\log\(1\+t\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(1/\\delta\)\}\.\(68\)Note thatγ0\>1\\gamma\_\{0\}\>1according to Assumption[5](https://arxiv.org/html/2606.07392#S5), hence with probability at least1−δ1\-\\delta, uniformly over allt≥2t\\geq 2and allx∈𝒳x\\in\\mathcal\{X\}andω∈Ω\\omega\\in\\Omega,

\|θ∗⊤ϕ\(x,ω\)−θ^t−1⊤ϕ\(x,ω\)\|≤2min⁡\{1,μ¯\}‖ϕ\(x,ω\)‖Φt−1−1\(γ0dlog⁡\(1\+tC¯ϕ2/η0\)\+2log⁡\(1/δ\)\+η0‖θ∗‖2\)\.\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\\left\|\\theta\_\{\*\}^\{\\top\}\\phi\(x,\\omega\)\-\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x,\\omega\)\\right\|\\\\ &\\displaystyle\\leq\\frac\{2\}\{\\min\\\{1,\\underline\{\\mu\}\\\}\}\\\|\\phi\(x,\\omega\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\bigg\(\\gamma\_\{0\}\\sqrt\{d\\log\(1\+t\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(1/\\delta\)\}\+\\sqrt\{\\eta\_\{0\}\}\\\|\\theta\_\{\*\}\\\|\_\{2\}\\bigg\)\.\\end\{array\}Hence \([59](https://arxiv.org/html/2606.07392#S9.E59)\) follows from \(iv\) of Assumption[5](https://arxiv.org/html/2606.07392#S5)that‖θ∗‖2≤α¯\\\|\\theta\_\{\*\}\\\|\_\{2\}\\leq\\bar\{\\alpha\}\. This immediately implies the high\-probability bound \([60](https://arxiv.org/html/2606.07392#S9.E60)\) taken uniformly over alla∈\[A\],t≥2a\\in\[A\],t\\geq 2\.■\\blacksquare

## 10Minimum Eigenvalue ofΦt−1\\Phi\_\{t\-1\}

In this section, we first verify the examples satisfying Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)\(Lemma[10\.3](https://arxiv.org/html/2606.07392#S10.Thmtheorem3), Lemma[10\.5](https://arxiv.org/html/2606.07392#S10.Thmtheorem5), Lemma[10\.7](https://arxiv.org/html/2606.07392#S10.Thmtheorem7)\)\. Then we show that Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)implies a lower bound on the minimum eigenvalue forΦt−1\\Phi\_\{t\-1\}\(Proposition[10\.10](https://arxiv.org/html/2606.07392#S10.Thmtheorem10)\)\.

We first present Lemma[10\.1](https://arxiv.org/html/2606.07392#S10.Thmtheorem1), which is useful for proving Lemmas[10\.3](https://arxiv.org/html/2606.07392#S10.Thmtheorem3)–[10\.7](https://arxiv.org/html/2606.07392#S10.Thmtheorem7)\.

###### Lemma 10\.1

LetZat0∈ℝdZ^\{0\}\_\{at\}\\in\\mathbb\{R\}^\{d\}be a possibly unbounded random vector\. Supposeℙ\(\|v⊤Zat0\|≤ϵ\|ℱt−1\)≤CTϵ,∀v∈𝕊d−1,ϵ\>0\\mathbb\{P\}\\left\(\|v^\{\\top\}Z^\{0\}\_\{at\}\|\\leq\\epsilon\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\)\\leq C\_\{T\}\\epsilon,\\forall v\\in\\mathbb\{S\}^\{d\-1\},\\epsilon\>0, and suppose thatℙ\(∥Zat0∥2≤C¯ϕ\|ℱt−1\)≥qTa\.s\.\\mathbb\{P\}\\left\(\\\|Z^\{0\}\_\{at\}\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\)\\geq q\_\{T\}\\ \\text\{a\.s\.\}for some deterministicqT∈\(0,1\]q\_\{T\}\\in\(0,1\]\. Define the bounded vectorZatZ\_\{at\}by the conditional lawZat∼ℒ\(Zat0\|∥Zat0∥2≤C¯ϕ,ℱt−1\)Z\_\{at\}\\sim\\mathcal\{L\}\\left\(Z^\{0\}\_\{at\}\\,\\middle\|\\,\\\|Z^\{0\}\_\{at\}\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\},\\mathcal\{F\}\_\{t\-1\}\\right\)\. IfCTqT≤MT\\frac\{C\_\{T\}\}\{q\_\{T\}\}\\leq M\_\{T\}, thenZatZ\_\{at\}satisfies Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)\. Moreover,‖Zat‖2≤C¯ϕ\\\|Z\_\{at\}\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\}almost surely\.

###### Proof 10\.2

Proof of Lemma[10\.1](https://arxiv.org/html/2606.07392#S10.Thmtheorem1)\. By construction,‖Zat‖2≤C¯ϕa\.s\.\\\|Z\_\{at\}\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\}\\ \\ \\text\{a\.s\.\}\. Fixv∈𝕊d−1v\\in\\mathbb\{S\}^\{d\-1\}andϵ\>0\\epsilon\>0\. Then

ℙ\(\|v⊤Zat\|≤ϵ\|ℱt−1\)=ℙ\(\|v⊤Zat0\|≤ϵ\|∥Zat0∥≤C¯ϕ,ℱt−1\)=ℙ\(\|v⊤Zat0\|≤ϵ,∥Zat0∥≤C¯ϕ\|ℱt−1\)ℙ\(∥Zat0∥≤C¯ϕ\|ℱt−1\)≤ℙ\(\|v⊤Zat0\|≤ϵ\|ℱt−1\)qT≤CTqTϵ≤MTϵ\.\\begin\{array\}\[\]\{rl\}\\mathbb\{P\}\\left\(\|v^\{\\top\}Z\_\{at\}\|\\leq\\epsilon\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\)&\\displaystyle=\\mathbb\{P\}\\left\(\|v^\{\\top\}Z^\{0\}\_\{at\}\|\\leq\\epsilon\\,\\middle\|\\,\\\|Z^\{0\}\_\{at\}\\\|\\leq\\bar\{C\}\_\{\\phi\},\\mathcal\{F\}\_\{t\-1\}\\right\)=\\frac\{\\mathbb\{P\}\\left\(\|v^\{\\top\}Z^\{0\}\_\{at\}\|\\leq\\epsilon,\\,\\\|Z^\{0\}\_\{at\}\\\|\\leq\\bar\{C\}\_\{\\phi\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\)\}\{\\mathbb\{P\}\\left\(\\\|Z^\{0\}\_\{at\}\\\|\\leq\\bar\{C\}\_\{\\phi\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\)\}\\\\ &\\displaystyle\\leq\\frac\{\\mathbb\{P\}\\left\(\|v^\{\\top\}Z^\{0\}\_\{at\}\|\\leq\\epsilon\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\)\}\{q\_\{T\}\}\\leq\\frac\{C\_\{T\}\}\{q\_\{T\}\}\\epsilon\\leq M\_\{T\}\\epsilon\.\\end\{array\}In particular, the above inequality holds for all0<ϵ≤\(2AMT\)−10<\\epsilon\\leq\(2AM\_\{T\}\)^\{\-1\}, so Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)holds\.■\\blacksquare

###### Lemma 10\.3\(Truncated Gaussian\)

Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)holds under the following conditions:

- \(i\)conditional onℱt−1\\mathcal\{F\}\_\{t\-1\},Zat∼𝒩\(μat,Σat\)Z\_\{at\}\\sim\\mathcal\{N\}\(\\mu\_\{at\},\\Sigma\_\{at\}\)andϕ\(xt,ωat\)∼ℒ\(Zat∣∥Zat∥2≤C¯ϕ,ℱt−1\)\\phi\(x\_\{t\},\\omega\_\{at\}\)\\sim\\mathcal\{L\}\(Z\_\{at\}\\mid\\\|Z\_\{at\}\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\},\\mathcal\{F\}\_\{t\-1\}\)\.
- \(ii\)There exist deterministic sequencesσT\>0\\sigma\_\{T\}\>0,σ¯T\>0\\bar\{\\sigma\}\_\{T\}\>0andμ¯T∈\[0,C¯ϕ\)\\bar\{\\mu\}\_\{T\}\\in\[0,\\bar\{C\}\_\{\\phi\}\)such that uniformly overa∈\[A\],t∈\[T\]a\\in\[A\],t\\in\[T\],λmin\(Σat\)≥1σT2,λmax\(Σat\)≤σ¯T2,‖μat‖2≤μ¯T,a\.s\.\\lambda\_\{\\min\}\(\\Sigma\_\{at\}\)\\geq\\frac\{1\}\{\\sigma\_\{T\}^\{2\}\},\\ \\lambda\_\{\\max\}\(\\Sigma\_\{at\}\)\\leq\\bar\{\\sigma\}\_\{T\}^\{2\},\\ \\\|\\mu\_\{at\}\\\|\_\{2\}\\leq\\bar\{\\mu\}\_\{T\},\\ \\ \\mathrm\{a\.s\.\}
- \(iii\)LetFχd2\(⋅\)F\_\{\\chi\_\{d\}^\{2\}\}\(\\cdot\)denote the cumulative distribution function of a chi\-square random variable withdddegrees of freedom\. Suppose2/πσT≤MTFχd2\(\(C¯ϕ−μ¯T\)2σ¯T2\)\\sqrt\{2/\\pi\}\\sigma\_\{T\}\\leq M\_\{T\}F\_\{\\chi\_\{d\}^\{2\}\}\\left\(\\frac\{\(\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\)^\{2\}\}\{\\bar\{\\sigma\}\_\{T\}^\{2\}\}\\right\)whereMT=polylog\(T\)/AM\_\{T\}=\\sqrt\{\\mathrm\{polylog\}\(T\)\}/A\.

###### Proof 10\.4

Proof of Lemma[10\.3](https://arxiv.org/html/2606.07392#S10.Thmtheorem3)Fixa∈\[A\],t∈\[T\]a\\in\[A\],t\\in\[T\]andv∈𝕊d−1v\\in\\mathbb\{S\}^\{d\-1\}\. Conditional onℱt−1\\mathcal\{F\}\_\{t\-1\},

v⊤Zat∼𝒩\(v⊤μat,v⊤Σatv\)\.v^\{\\top\}Z\_\{at\}\\sim\\mathcal\{N\}\(v^\{\\top\}\\mu\_\{at\},v^\{\\top\}\\Sigma\_\{at\}v\)\.Sincev∈𝕊d−1v\\in\\mathbb\{S\}^\{d\-1\},v⊤Σatv≥1/σT2v^\{\\top\}\\Sigma\_\{at\}v\\geq 1/\\sigma\_\{T\}^\{2\}\. The density ofv⊤Zatv^\{\\top\}Z\_\{at\}is bounded by12πv⊤Σatv≤σT2π\\displaystyle\\frac\{1\}\{\\sqrt\{2\\pi\}\\sqrt\{v^\{\\top\}\\Sigma\_\{at\}v\}\}\\leq\\frac\{\\sigma\_\{T\}\}\{\\sqrt\{2\\pi\}\}\. Letfatvf\_\{atv\}be the density ofv⊤Zatv^\{\\top\}Z\_\{at\}\. So for everyϵ\>0\\epsilon\>0,

ℙ\(\|v⊤Zat\|≤ϵ∣ℱt−1\)=∫−ϵϵfatv\(u\)𝑑u≤2ϵσT2π=2/πσTϵ\.\\mathbb\{P\}\(\|v^\{\\top\}Z\_\{at\}\|\\leq\\epsilon\\mid\\mathcal\{F\}\_\{t\-1\}\)=\\int\_\{\-\\epsilon\}^\{\\epsilon\}f\_\{atv\}\(u\)du\\leq\\frac\{2\\epsilon\\sigma\_\{T\}\}\{\\sqrt\{2\\pi\}\}=\\sqrt\{2/\\pi\}\\sigma\_\{T\}\\epsilon\.Next, note thatZat=μat\+Σat1/2ZZ\_\{at\}=\\mu\_\{at\}\+\\Sigma\_\{at\}^\{1/2\}ZwhereZ∼𝒩\(0,𝐈d\)Z\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{d\}\)\. Note that

‖Zat‖2≤\(1\)‖μat‖2\+‖Σat1/2Z‖2≤\(2\)‖μat‖2\+λmax\(Σat\)‖Z‖2≤\(3\)μ¯T\+σ¯T‖Z‖2,\\\|Z\_\{at\}\\\|\_\{2\}\\leq\_\{\(1\)\}\\\|\\mu\_\{at\}\\\|\_\{2\}\+\\\|\\Sigma\_\{at\}^\{1/2\}Z\\\|\_\{2\}\\leq\_\{\(2\)\}\\\|\\mu\_\{at\}\\\|\_\{2\}\+\\sqrt\{\\lambda\_\{\\max\}\(\\Sigma\_\{at\}\)\}\\\|Z\\\|\_\{2\}\\leq\_\{\(3\)\}\\bar\{\\mu\}\_\{T\}\+\\bar\{\\sigma\}\_\{T\}\\\|Z\\\|\_\{2\},where \(1\) follows from triangular inequality, \(2\) follows because‖Σat1/2Z‖2≤λmax\(Σat\)‖Z‖2\\\|\\Sigma\_\{at\}^\{1/2\}Z\\\|\_\{2\}\\leq\\sqrt\{\\lambda\_\{\\max\}\(\\Sigma\_\{at\}\)\}\\\|Z\\\|\_\{2\}, \(3\) follows from condition \(ii\) in the lemma\. Therefore, the event\{‖Z‖2≤C¯ϕ−μ¯Tσ¯T\}\\\{\\\|Z\\\|\_\{2\}\\leq\\frac\{\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\}\{\\bar\{\\sigma\}\_\{T\}\}\\\}implies

‖Zat‖2≤μ¯T\+σ¯TC¯ϕ−μ¯Tσ¯T=C¯ϕ\.\\\|Z\_\{at\}\\\|\_\{2\}\\leq\\bar\{\\mu\}\_\{T\}\+\\bar\{\\sigma\}\_\{T\}\\frac\{\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\}\{\\bar\{\\sigma\}\_\{T\}\}=\\bar\{C\}\_\{\\phi\}\.Hence

ℙ\(‖Zat‖2≤C¯ϕ∣ℱt−1\)≥ℙ\(‖Z‖2≤C¯ϕ−μ¯Tσ¯T∣ℱt−1\)=Fχd2\(\(C¯ϕ−μ¯T\)2σ¯T2\)=qT\.\\mathbb\{P\}\(\\\|Z\_\{at\}\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\}\\mid\\mathcal\{F\}\_\{t\-1\}\)\\geq\\mathbb\{P\}\\left\(\\\|Z\\\|\_\{2\}\\leq\\frac\{\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\}\{\\bar\{\\sigma\}\_\{T\}\}\\,\\mid\\,\\mathcal\{F\}\_\{t\-1\}\\right\)=F\_\{\\chi\_\{d\}^\{2\}\}\\left\(\\frac\{\(\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\)^\{2\}\}\{\\bar\{\\sigma\}\_\{T\}^\{2\}\}\\right\)=q\_\{T\}\.Applying Lemma[10\.1](https://arxiv.org/html/2606.07392#S10.Thmtheorem1)withCT=2/πσTC\_\{T\}=\\sqrt\{2/\\pi\}\\sigma\_\{T\}andqT=Fχd2\(\(C¯ϕ−μ¯T\)2σ¯T2\)q\_\{T\}=F\_\{\\chi\_\{d\}^\{2\}\}\\left\(\\frac\{\(\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\)^\{2\}\}\{\\bar\{\\sigma\}\_\{T\}^\{2\}\}\\right\)gives

ℙ\(\|v⊤ϕ\(xt,ωat\)\|≤ϵ∣ℱt−1\)≤2/πσTqTϵ≤MTϵ\.\\mathbb\{P\}\(\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\|\\leq\\epsilon\\mid\\mathcal\{F\}\_\{t\-1\}\)\\leq\\frac\{\\sqrt\{2/\\pi\}\\sigma\_\{T\}\}\{q\_\{T\}\}\\epsilon\\leq M\_\{T\}\\epsilon\.So Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)is satisfied\.■\\blacksquare

###### Lemma 10\.5\(Truncated Student’s t distribution\)

Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)holds under conditions below:

- \(i\)ϕ\(xt,ωat\)∼ℒ\(Zat∣∥Zat∥2≤C¯ϕ,ℱt−1\)\\phi\(x\_\{t\},\\omega\_\{at\}\)\\sim\\mathcal\{L\}\(Z\_\{at\}\\mid\\\|Z\_\{at\}\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\},\\mathcal\{F\}\_\{t\-1\}\), whereZat∼tξ\(μat,Σat\)Z\_\{at\}\\sim t\_\{\\xi\}\(\\mu\_\{at\},\\Sigma\_\{at\}\)conditioning onℱt−1\\mathcal\{F\}\_\{t\-1\}, andtξ\(μat,Σat\)t\_\{\\xi\}\(\\mu\_\{at\},\\Sigma\_\{at\}\)is add\-dimensional Student’sttdistribution withξ\>0\\xi\>0degrees of freedom, locationμat\\mu\_\{at\}, scale matrixΣat\\Sigma\_\{at\}\.
- \(ii\)There exists deterministic sequencesσT\>0\\sigma\_\{T\}\>0,σ¯T\>0\\bar\{\\sigma\}\_\{T\}\>0andμ¯T∈\[0,C¯ϕ\)\\bar\{\\mu\}\_\{T\}\\in\[0,\\bar\{C\}\_\{\\phi\}\)such that uniformly over alla∈\[A\]a\\in\[A\],t∈\[T\]t\\in\[T\],λmin\(Σat\)≥1/σT2\\lambda\_\{\\min\}\(\\Sigma\_\{at\}\)\\geq 1/\\sigma\_\{T\}^\{2\},λmax\(Σat\)≤σ¯T2\\lambda\_\{\\max\}\(\\Sigma\_\{at\}\)\\leq\\bar\{\\sigma\}\_\{T\}^\{2\},‖μat‖2≤μ¯T\\\|\\mu\_\{at\}\\\|\_\{2\}\\leq\\bar\{\\mu\}\_\{T\}a\.s\.\\mathrm\{a\.s\.\}\.
- \(iii\)2cξσT/qT≤MT=polylog\(T\)/A2c\_\{\\xi\}\\sigma\_\{T\}/q\_\{T\}\\leq M\_\{T\}=\\sqrt\{\\mathrm\{polylog\}\(T\)\}/A, wherecξ:=Γ\(\(ξ\+1\)/2\)ξπΓ\(ξ/2\)c\_\{\\xi\}:=\\frac\{\\Gamma\(\(\\xi\+1\)/2\)\}\{\\sqrt\{\\xi\\pi\}\\Gamma\(\\xi/2\)\},qT=Fd,ξ\(\(C¯ϕ−μ¯T\)2/\(dσ¯T2\)\)q\_\{T\}=F\_\{d,\\xi\}\(\(\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\)^\{2\}/\(d\\bar\{\\sigma\}\_\{T\}^\{2\}\)\), andFd,ξ\(⋅\)F\_\{d,\\xi\}\(\\cdot\)denotes the cumulative distribution of theFF\-distribution withddandξ\\xidegrees of freedom\.

###### Proof 10\.6

Proof of Lemma[10\.5](https://arxiv.org/html/2606.07392#S10.Thmtheorem5)Fixa∈\[A\],t∈\[T\],v∈𝕊d−1a\\in\[A\],t\\in\[T\],v\\in\\mathbb\{S\}^\{d\-1\}\. The multivariate Student’sttrandom vectorZatZ\_\{at\}can be written asZat=μat\+Σat1/2ZS/ξZ\_\{at\}=\\mu\_\{at\}\+\\frac\{\\Sigma\_\{at\}^\{1/2\}Z\}\{\\sqrt\{S/\\xi\}\}, whereZ∼𝒩\(0,𝐈d\)Z\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{d\}\),S∼χξ2S\\sim\\chi\_\{\\xi\}^\{2\}, whereχξ2\\chi\_\{\\xi\}^\{2\}is the chi\-square distribution withξ\\xidegrees of freedom,ZZandSSare independent\. So

v⊤Zat=v⊤μat\+v⊤Σat1/2ZS/ξ\.v^\{\\top\}Z\_\{at\}=v^\{\\top\}\\mu\_\{at\}\+\\frac\{v^\{\\top\}\\Sigma\_\{at\}^\{1/2\}Z\}\{\\sqrt\{S/\\xi\}\}\.Sincev⊤Σat1/2Z∼𝒩\(0,v⊤Σatv\)v^\{\\top\}\\Sigma\_\{at\}^\{1/2\}Z\\sim\\mathcal\{N\}\(0,v^\{\\top\}\\Sigma\_\{at\}v\), it holds thatv⊤Zat∼tξ\(v⊤μat,v⊤Σatv\)v^\{\\top\}Z\_\{at\}\\sim t\_\{\\xi\}\(v^\{\\top\}\\mu\_\{at\},v^\{\\top\}\\Sigma\_\{at\}v\), where by the eigenvalue lower bound, we havev⊤Σatv≥1/σT2v^\{\\top\}\\Sigma\_\{at\}v\\geq 1/\\sigma\_\{T\}^\{2\}\. The densityfatvf\_\{atv\}of the univariate Student’sttrandom variablev⊤Zatv^\{\\top\}Z\_\{at\}is

fatv\(y\)=cξv⊤Σatv\(1\+\(y−v⊤μat\)2ξ\[v⊤Σatv\]\)−\(ξ\+1\)/2\.f\_\{atv\}\(y\)=\\frac\{c\_\{\\xi\}\}\{\\sqrt\{v^\{\\top\}\\Sigma\_\{at\}v\}\}\\left\(1\+\\frac\{\(y\-v^\{\\top\}\\mu\_\{at\}\)^\{2\}\}\{\\xi\[v^\{\\top\}\\Sigma\_\{at\}v\]\}\\right\)^\{\-\(\\xi\+1\)/2\}\.Hencesupyfatv\(y\)≤cξ/v⊤Σatv≤cξσT\\sup\_\{y\}f\_\{atv\}\(y\)\\leq c\_\{\\xi\}/\\sqrt\{v^\{\\top\}\\Sigma\_\{at\}v\}\\leq c\_\{\\xi\}\\sigma\_\{T\}\. So for everyϵ\>0\\epsilon\>0,

ℙ\(\|v⊤Zat\|≤ϵ∣ℱt−1\)=∫−ϵϵfatv\(u\)𝑑u≤2cξσTϵ\.\\mathbb\{P\}\\left\(\|v^\{\\top\}Z\_\{at\}\|\\leq\\epsilon\\mid\\mathcal\{F\}\_\{t\-1\}\\right\)=\\int\_\{\-\\epsilon\}^\{\\epsilon\}f\_\{atv\}\(u\)du\\leq 2c\_\{\\xi\}\\sigma\_\{T\}\\epsilon\.Next, we lower bound the probability of the truncation event\{‖Zat‖2≤C¯ϕ\}\\\{\\\|Z\_\{at\}\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\}\\\}\. Note that

‖Zat‖2≤‖μat‖2\+‖Σat1/2Z‖2/S/ξ≤μ¯T\+σ¯T‖Z‖2/S/ξ\.\\\|Z\_\{at\}\\\|\_\{2\}\\leq\\\|\\mu\_\{at\}\\\|\_\{2\}\+\\\|\\Sigma\_\{at\}^\{1/2\}Z\\\|\_\{2\}/\\sqrt\{S/\\xi\}\\leq\\bar\{\\mu\}\_\{T\}\+\\bar\{\\sigma\}\_\{T\}\\\|Z\\\|\_\{2\}/\\sqrt\{S/\\xi\}\.Therefore,‖Z‖2/S/ξ≤\(C¯ϕ−μ¯T\)/σ¯T\\\|Z\\\|\_\{2\}/\\sqrt\{S/\\xi\}\\leq\(\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\)/\\bar\{\\sigma\}\_\{T\}implies‖Zat‖2≤μ¯T\+σ¯TC¯ϕ−μ¯Tσ¯T=C¯ϕ\\\|Z\_\{at\}\\\|\_\{2\}\\leq\\bar\{\\mu\}\_\{T\}\+\\bar\{\\sigma\}\_\{T\}\\frac\{\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\}\{\\bar\{\\sigma\}\_\{T\}\}=\\bar\{C\}\_\{\\phi\}\. So

ℙ\(‖Zat‖2≤C¯ϕ∣ℱt−1\)≥ℙ\(‖Z‖2S/ξ≤C¯ϕ−μ¯Tσ¯T∣ℱt−1\)=ℙ\(‖Z‖22/dS/ξ≤\(C¯ϕ−μ¯T\)2dσ¯T2∣ℱt−1\)=Fd,ξ\(\(C¯ϕ−μ¯T\)2dσ¯T2\)=qT\.\\begin\{array\}\[\]\{rl\}\\mathbb\{P\}\(\\\|Z\_\{at\}\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\}\\mid\\mathcal\{F\}\_\{t\-1\}\)&\\displaystyle\\geq\\mathbb\{P\}\\left\(\\frac\{\\\|Z\\\|\_\{2\}\}\{\\sqrt\{S/\\xi\}\}\\leq\\frac\{\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\}\{\\bar\{\\sigma\}\_\{T\}\}\\mid\\mathcal\{F\}\_\{t\-1\}\\right\)\\\\ &\\displaystyle=\\mathbb\{P\}\\left\(\\frac\{\\\|Z\\\|\_\{2\}^\{2\}/d\}\{S/\\xi\}\\leq\\frac\{\(\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\)^\{2\}\}\{d\\bar\{\\sigma\}\_\{T\}^\{2\}\}\\mid\\mathcal\{F\}\_\{t\-1\}\\right\)=F\_\{d,\\xi\}\\left\(\\frac\{\(\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\)^\{2\}\}\{d\\bar\{\\sigma\}\_\{T\}^\{2\}\}\\right\)=q\_\{T\}\.\\end\{array\}Hence applying Lemma[10\.1](https://arxiv.org/html/2606.07392#S10.Thmtheorem1), we haveℙ\(\|v⊤ϕ\(xt,ωat\)\|≤ϵ∣ℱt−1\)≤MTϵ\\mathbb\{P\}\(\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\|\\leq\\epsilon\\mid\\mathcal\{F\}\_\{t\-1\}\)\\leq M\_\{T\}\\epsilon, so Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)holds\.■\\blacksquare

###### Lemma 10\.7\(Uniform distribution on a hyperrectangle\)

Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)holds under \(i\)–\(ii\):

- \(i\)Conditional onℱt−1\\mathcal\{F\}\_\{t\-1\},ϕ\(xt,ωat\)=ϕ¯at\+Uat\\phi\(x\_\{t\},\\omega\_\{at\}\)=\\bar\{\\phi\}\_\{at\}\+U\_\{at\}, whereϕ¯at\\bar\{\\phi\}\_\{at\}is a fixed constant vector,Uat∼Unif\[−Rat,j,Rat,j\]U\_\{at\}\\sim\\mathrm\{Unif\}\[\-R\_\{at,j\},R\_\{at,j\}\]forj∈\[d\]j\\in\[d\]andUat,jU\_\{at,j\}are independent acrossj∈\[d\]j\\in\[d\]\.
- \(ii\)Rat,j≥d/MTR\_\{at,j\}\\geq\\sqrt\{d\}/M\_\{T\}whereMT=polylog\(T\)/AM\_\{T\}=\\sqrt\{\\mathrm\{polylog\}\(T\)\}/Aand‖ϕ¯at‖2\+\(∑j=1dRat,j2\)1/2≤C¯ϕ,a\.s\.\\\|\\bar\{\\phi\}\_\{at\}\\\|\_\{2\}\+\\left\(\\sum\_\{j=1\}^\{d\}R\_\{at,j\}^\{2\}\\right\)^\{1/2\}\\leq\\bar\{C\}\_\{\\phi\},\\ \\ \\mathrm\{a\.s\.\}

###### Proof 10\.8

Proof of Lemma[10\.7](https://arxiv.org/html/2606.07392#S10.Thmtheorem7)Fixa∈\[A\],t∈\[T\],v∈𝕊d−1a\\in\[A\],t\\in\[T\],v\\in\\mathbb\{S\}^\{d\-1\}\. Since‖v‖2=1\\\|v\\\|\_\{2\}=1, there exists a coordinatej∗∈\[d\]j^\{\*\}\\in\[d\], s\.t\.\|vj∗\|≥1/d\|v\_\{j^\{\*\}\}\|\\geq 1/\\sqrt\{d\}\. Note thatv⊤ϕ\(xt,ωat\)=v⊤ϕ¯at\+v⊤Uat=v⊤ϕ¯at\+∑j=1dvjUat,jv^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)=v^\{\\top\}\\bar\{\\phi\}\_\{at\}\+v^\{\\top\}U\_\{at\}=v^\{\\top\}\\bar\{\\phi\}\_\{at\}\+\\sum\_\{j=1\}^\{d\}v\_\{j\}U\_\{at,j\}\. Conditional on all\{Uat,j:j≠j∗\}\\\{U\_\{at,j\}:j\\neq j^\{\*\}\\\},v⊤ϕ\(xt,ωat\)=C\+vj∗Uat,j∗v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)=C\+v\_\{j^\{\*\}\}U\_\{at,j^\{\*\}\}whereCCis a fixed constant\. Conditional onℱt−1\\mathcal\{F\}\_\{t\-1\},Uat,jU\_\{at,j\}are independent uniform random on\[−Rat,j,Rat,j\]\[\-R\_\{at,j\},R\_\{at,j\}\]\. SinceUat,j∗∼Unif\[−Rat,j∗,Rat,j∗\]U\_\{at,j^\{\*\}\}\\sim\\mathrm\{Unif\}\[\-R\_\{at,j^\{\*\}\},R\_\{at,j^\{\*\}\}\],C\+vj∗Uat,j∗C\+v\_\{j^\{\*\}\}U\_\{at,j^\{\*\}\}is uniform on an interval of length2Rat,j∗\|vj∗\|2R\_\{at,j^\{\*\}\}\|v\_\{j^\{\*\}\}\|\. Therefore, for any intervalI⊂ℝI\\subset\\mathbb\{R\}of length\|I\|\|I\|, we have

ℙ\(C\+vj∗Uat,j∗∈I∣\{Uat,j:j≠j∗\},ℱt−1\)≤\|I\|2Rat,j∗\|vj∗\|\.\\mathbb\{P\}\(C\+v\_\{j^\{\*\}\}U\_\{at,j^\{\*\}\}\\in I\\mid\\\{U\_\{at,j\}:j\\neq j^\{\*\}\\\},\\mathcal\{F\}\_\{t\-1\}\)\\leq\\frac\{\|I\|\}\{2R\_\{at,j^\{\*\}\}\|v\_\{j^\{\*\}\}\|\}\.TakeI=\[−ϵ,ϵ\]I=\[\-\\epsilon,\\epsilon\], then

ℙ\(\|v⊤ϕ\(xt,ωat\)\|≤ϵ∣ℱt−1,\{Uat,j:j≠j∗\}\)≤2ϵ2Rat,j∗\|vj∗\|≤dRat,j∗ϵ\.\\mathbb\{P\}\\left\(\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\|\\leq\\epsilon\\mid\\mathcal\{F\}\_\{t\-1\},\\\{U\_\{at,j\}:j\\neq j^\{\*\}\\\}\\right\)\\leq\\frac\{2\\epsilon\}\{2R\_\{at,j^\{\*\}\}\|v\_\{j^\{\*\}\}\|\}\\leq\\frac\{\\sqrt\{d\}\}\{R\_\{at,j^\{\*\}\}\}\\epsilon\.Taking expectation over the conditioned coordinates ofUatU\_\{at\}in the above inequality gives

ℙ\(\|v⊤ϕ\(xt,ωat\)\|≤ϵ∣ℱt−1\)≤MTϵ\\mathbb\{P\}\(\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\|\\leq\\epsilon\\mid\\mathcal\{F\}\_\{t\-1\}\)\\leq M\_\{T\}\\epsilonwithMT=polylog\(T\)/AM\_\{T\}=\\sqrt\{\\mathrm\{polylog\}\(T\)\}/A\.■\\blacksquare

###### Lemma 10\.9\(Theorem 3\.1 ofTropp \([2011](https://arxiv.org/html/2606.07392#bib.bib5)\)\)

Letℱ1⊂ℱ2⋯\\mathcal\{F\}\_\{1\}\\subset\\mathcal\{F\}\_\{2\}\\cdotsbe a filtration and consider a finite sequence\{Xk\}\\\{X\_\{k\}\\\}of positive semi\-definite matrices with dimensionddadapted to this filtration\. Suppose thatλmax\(Xk\)≤R\\lambda\_\{\\max\}\(X\_\{k\}\)\\leq Ralmost surely\. Define the seriesY≡∑kXkY\\equiv\\sum\_\{k\}X\_\{k\}andW≡∑k𝔼\[Xk\|ℋk−1\]W\\equiv\\sum\_\{k\}\\mathbb\{E\}\[X\_\{k\}\|\\mathcal\{H\}\_\{k\-1\}\]\. Then for allμ≥0\\mu\\geq 0,γ∈\[0,1\)\\gamma\\in\[0,1\),ℙ\(λmin\(Y\)≤\(1−γ\)μ,λmin\(W\)≥μ\)≤d\(e−γ\(1−γ\)1−γ\)μ/R\\mathbb\{P\}\\left\(\\lambda\_\{\\min\}\(Y\)\\leq\(1\-\\gamma\)\\mu,\\lambda\_\{\\min\}\(W\)\\geq\\mu\\right\)\\leq d\\left\(\\frac\{e^\{\-\\gamma\}\}\{\(1\-\\gamma\)^\{1\-\\gamma\}\}\\right\)^\{\\mu/R\}\.

###### Proposition 10\.10\(Minimum eigenvalue\)

Suppose Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)holds\. Then

λmin\(𝔼\[ϕ\(xt,ωatt\)ϕ\(xt,ωatt\)⊤\|ℱt−1\]\)≥1/8polylog\(T\)\.\\lambda\_\{\\min\}\\\!\\left\(\\mathbb\{E\}\\\!\\left\[\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)^\{\\top\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\]\\right\)\\geq\\frac\{1/8\}\{\\mathrm\{polylog\}\(T\)\}\.\(69\)Further, for anyδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\delta,

λmin\(Φt−1\)≥\(t−1\)/\[16polylog\(T\)\]\\lambda\_\{\\min\}\(\\Phi\_\{t\-1\}\)\\geq\(t\-1\)/\[16\\mathrm\{polylog\}\(T\)\]\(70\)holds uniformly for allt≥100C¯ϕ2polylog\(T\)log⁡\(Td/δ\)t\\geq 100\\bar\{C\}\_\{\\phi\}^\{2\}\\mathrm\{polylog\}\(T\)\\log\(Td/\\delta\)\.

###### Proof 10\.11

Proof of Proposition[10\.10](https://arxiv.org/html/2606.07392#S10.Thmtheorem10)Fixv∈𝕊d−1v\\in\\mathbb\{S\}^\{d\-1\}\. Sinceat∈\[A\]a\_\{t\}\\in\[A\], for anyϵ\>0\\epsilon\>0,

\{\|v⊤ϕ\(xt,ωatt\)\|≤ϵ\}⊆⋃a=1A\{\|v⊤ϕ\(xt,ωat\)\|≤ϵ\}\.\\\{\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\|\\leq\\epsilon\\\}\\subseteq\\bigcup\_\{a=1\}^\{A\}\\\{\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\|\\leq\\epsilon\\\}\.Therefore, by applying union bound and Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2),

ℙ\(\|v⊤ϕ\(xt,ωatt\)\|≤ϵ\|ℱt−1\)≤∑a=1Aℙ\(\|v⊤ϕ\(xt,ωat\)\|≤ϵ\|ℱt−1\)≤AMTϵ\.\\mathbb\{P\}\\\!\\left\(\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\|\\leq\\epsilon\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\)\\leq\\sum\_\{a=1\}^\{A\}\\mathbb\{P\}\\\!\\left\(\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\|\\leq\\epsilon\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\)\\leq AM\_\{T\}\\epsilon\.Takeϵ=\(2AMT\)−1\\epsilon=\(2AM\_\{T\}\)^\{\-1\}\. Then the above inequality implies that

ℙ\(\|v⊤ϕ\(xt,ωatt\)\|\>\(2AMT\)−1\|ℱt−1\)≥12\.\\mathbb\{P\}\\\!\\left\(\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\|\>\(2AM\_\{T\}\)^\{\-1\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\)\\geq\\frac\{1\}\{2\}\.Hence

v⊤𝔼\[ϕ\(xt,ωatt\)ϕ\(xt,ωatt\)⊤∣ℱt−1\]v=𝔼\[\(v⊤ϕ\(xt,ωatt\)\)2∣ℱt−1\]≥\(2AMT\)−2ℙ\(\|v⊤ϕ\(xt,ωatt\)\|\>\(2AMT\)−1\|ℱt−1\)≥1/8\(AMT\)2=1/8polylog\(T\)\.\\begin\{array\}\[\]\{rl\}\\displaystyle v^\{\\top\}\\mathbb\{E\}\[\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)^\{\\top\}\\mid\\mathcal\{F\}\_\{t\-1\}\]v&\\displaystyle=\\mathbb\{E\}\[\(v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\)^\{2\}\\mid\\mathcal\{F\}\_\{t\-1\}\]\\\\ &\\displaystyle\\geq\(2AM\_\{T\}\)^\{\-2\}\\mathbb\{P\}\\\!\\left\(\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\|\>\(2AM\_\{T\}\)^\{\-1\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\)\\\\ &\\displaystyle\\geq\\frac\{1/8\}\{\(AM\_\{T\}\)^\{2\}\}=\\frac\{1/8\}\{\\mathrm\{polylog\}\(T\)\}\.\\end\{array\}Since this holds for anyv∈𝕊d−1v\\in\\mathbb\{S\}^\{d\-1\}, \([69](https://arxiv.org/html/2606.07392#S10.E69)\) follows\. Settingγ=1/2\\gamma=1/2,R=C¯ϕ2R=\\bar\{C\}\_\{\\phi\}^\{2\}in Lemma[10\.9](https://arxiv.org/html/2606.07392#S10.Thmtheorem9), with probability at least1−exp⁡\{log⁡\(d\)−0\.01\(t−1\)/\[C¯ϕ2polylog\(T\)\]\}1\-\\exp\\\{\\log\(d\)\-0\.01\(t\-1\)/\[\\bar\{C\}\_\{\\phi\}^\{2\}\\mathrm\{polylog\}\(T\)\]\\\},λmin\(Φt−1\)≥\(t−1\)/\[16polylog\(T\)\]\\lambda\_\{\\min\}\(\\Phi\_\{t\-1\}\)\\geq\(t\-1\)/\[16\\mathrm\{polylog\}\(T\)\]\. Particularly, for anyδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\delta,λmin\(Φt−1\)≥\(t−1\)/\[16polylog\(T\)\]\\lambda\_\{\\min\}\(\\Phi\_\{t\-1\}\)\\geq\(t\-1\)/\[16\\mathrm\{polylog\}\(T\)\]holds uniformly for allt≥100C¯ϕ2polylog\(T\)log⁡\(Td/δ\)t\\geq 100\\bar\{C\}\_\{\\phi\}^\{2\}\\mathrm\{polylog\}\(T\)\\log\(Td/\\delta\)\.■\\blacksquare

## 11Index Estimation under Unknown Reward Function

The same argument for Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4)extends to Proposition[11\.2](https://arxiv.org/html/2606.07392#S11.Thmtheorem2)for the high probability bound of the index estimation error when the reward parameterθ∗\\theta\_\{\*\}is unknown and learned online\. The main difference is that the primitive loss is evaluated at the plug\-in estimateθ^t−1\\hat\{\\theta\}\_\{t\-1\}rather than at the true parameterθ∗\\theta\_\{\*\}\. This introduces an additional error term in the optimality inequality:

population curvature≤empirical fluctuation\+plug\-in error fromθ^t−1\.\\mbox\{population curvature\}\\;\\;\\leq\\;\\;\\mbox\{empirical fluctuation\}\\;\\;\+\\;\\;\\mbox\{plug\-in error from \}\\hat\{\\theta\}\_\{t\-1\}\.Consequently, Proposition[11\.2](https://arxiv.org/html/2606.07392#S11.Thmtheorem2)yields the same type of confidence radius forρa⊤ψ\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\), with an additional additive term that accounts for uncertainty in the reward parameter\.

Recall from Section[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)thatρ^at∈\\argminρ∈ℬ∑s∈𝒮atℓ^as,t\(ρ\)\\hat\{\\rho\}\_\{at\}\\in\\argmin\_\{\\rho\\in\\mathcal\{B\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\hat\{\\ell\}\_\{as,t\}\(\\rho\), whereℓ^as,t\\hat\{\\ell\}\_\{as,t\}is the loss defined as \([22](https://arxiv.org/html/2606.07392#S5.E22)\) when plugging inθ^t−1\\hat\{\\theta\}\_\{t\-1\}for the unknownθ∗\\theta\_\{\*\}\. Lemma[5\.3](https://arxiv.org/html/2606.07392#S5.Thmtheorem3)in the below characterizes how the estimation error ofθ∗\\theta\_\{\*\}propagates into the estimation error ofρa⊤ψ\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\.

###### Proof 11\.1

Proof of Lemma[5\.3](https://arxiv.org/html/2606.07392#S5.Thmtheorem3)Recall from \([24](https://arxiv.org/html/2606.07392#S5.E24)\) that

Jat\(ρ\):=∑s∈𝒮at\{\[ℓ^as,t\(ρ\)−ℓas\(ρ\)\]−\[ℓ^as,t\(ρa\)−ℓas\(ρa\)\]\}\.J\_\{at\}\(\\rho\):=\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\[\\hat\{\\ell\}\_\{as,t\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\)\]\-\[\\hat\{\\ell\}\_\{as,t\}\(\\rho\_\{a\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\]\\right\\\}\.Note that \([8](https://arxiv.org/html/2606.07392#S4.E8)\) and \([22](https://arxiv.org/html/2606.07392#S5.E22)\) imply that

\[ℓ^as,t\(ρa\)−ℓas\(ρa\)\]−\[ℓ^as,t\(ρ^at\)−ℓas\(ρ^at\)\]=∫0ρa⊤ψ\(xs\)\[\(G\(θ∗⊤ϕ\(xs,ωas\)\)−Λ\(u\)\)\+−\(G\(θ^t−1⊤ϕ\(xs,ωas\)\)−Λ\(u\)\)\+\]𝑑u−∫0ρ^at⊤ψ\(xs\)\[\(G\(θ∗⊤ϕ\(xs,ωas\)\)−Λ\(u\)\)\+−\(G\(θ^t−1⊤ϕ\(xs,ωas\)\)−Λ\(u\)\)\+\]𝑑u=∫ρa⊤ψ\(xs\)ρ^at⊤ψ\(xs\)\[\(G\(θ^t−1⊤ϕ\(xs,ωas\)\)−Λ\(u\)\)\+−\(G\(θ∗⊤ϕ\(xs,ωas\)\)−Λ\(u\)\)\+\]𝑑u≤\(i\)\|G\(θ^t−1⊤ϕ\(xs,ωas\)\)−G\(θ∗⊤ϕ\(xs,ωas\)\)\|⋅\|ψ\(xs\)⊤\(ρ^at−ρa\)\|,\\begin\{array\}\[\]\{rl\}&\\displaystyle\\quad\[\\hat\{\\ell\}\_\{as,t\}\(\\rho\_\{a\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\]\-\[\\hat\{\\ell\}\_\{as,t\}\(\\hat\{\\rho\}\_\{at\}\)\-\\ell\_\{as\}\(\\hat\{\\rho\}\_\{at\}\)\]\\\\ &\\displaystyle=\\int\_\{0\}^\{\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{s\}\)\}\[\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\)^\{\+\}\-\(G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\)^\{\+\}\]du\\\\ &\\displaystyle\\quad\-\\int\_\{0\}^\{\\hat\{\\rho\}\_\{at\}^\{\\top\}\\psi\(x\_\{s\}\)\}\[\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\)^\{\+\}\-\(G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\)^\{\+\}\]du\\\\ \\\\ &\\displaystyle=\\int\_\{\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{s\}\)\}^\{\\hat\{\\rho\}\_\{at\}^\{\\top\}\\psi\(x\_\{s\}\)\}\[\(G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\)^\{\+\}\-\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\)^\{\+\}\]du\\\\ &\\displaystyle\\leq\_\{\(i\)\}\\left\|G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\\right\|\\cdot\|\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\|,\\end\{array\}where \(i\) uses the 1\-Lipchitz property ofz↦z\+z\\mapsto z^\{\+\}\. Thus

−Jat\(ρ^at\)≤\(i\)∑s∈𝒮at\{ψ\(xs\)⊤\(ρ^at−ρa\)\}2∑s∈𝒮at\[G\(θ^t−1⊤ϕ\(xs,ωas\)\)−G\(θ∗⊤ϕ\(xs,ωas\)\)\]2≤\(ii\)c08∑s∈𝒮at\{ψ\(xs\)⊤\(ρ^at−ρa\)\}2\+2c0∑s∈𝒮at\[G\(θ^t−1⊤ϕ\(xs,ωas\)\)−G\(θ∗⊤ϕ\(xs,ωas\)\)\]2≤\(iii\)c08∑s∈𝒮at\{ψ\(xs\)⊤\(ρ^at−ρa\)\}2\+2c0∑s∈𝒮atL2\{\(θ^t−1−θ∗\)⊤ϕ\(xs,ωas\)\}2\\begin\{array\}\[\]\{rl\}\-J\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)&\\displaystyle\\leq\_\{\(i\)\}\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\\\}^\{2\}\}\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\[G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\\right\]^\{2\}\}\\\\ &\\displaystyle\\leq\_\{\(ii\)\}\\frac\{c\_\{0\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\\\}^\{2\}\+\\frac\{2\}\{c\_\{0\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\[G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\\right\]^\{2\}\\\\ &\\displaystyle\\leq\_\{\(iii\)\}\\frac\{c\_\{0\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\\\}^\{2\}\+\\frac\{2\}\{c\_\{0\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}L^\{2\}\\\{\(\\hat\{\\theta\}\_\{t\-1\}\-\\theta\_\{\*\}\)^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\}^\{2\}\\end\{array\}where \(i\) above holds from Cauchy\-Schwarz inequality, \(ii\) holds from applying

2ab≤ϵa2\+ϵ−1b22ab\\leq\\epsilon a^\{2\}\+\\epsilon^\{\-1\}b^\{2\}toϵ=c04\\displaystyle\\epsilon=\\frac\{c\_\{0\}\}\{4\},a=∑s∈𝒮at\{ψ\(xs\)⊤\(ρ^at−ρa\)\}2\\displaystyle a=\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\\\}^\{2\}\},b=∑s∈𝒮at\[G\(θ^t−1⊤ϕ\(xs,ωas\)\)−G\(θ∗⊤ϕ\(xs,ωas\)\)\]2\\displaystyle b=\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\[G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\\right\]^\{2\}\}, \(iii\) holds from the Lipschitz property ofGGaccording to Assumption[5](https://arxiv.org/html/2606.07392#S5)\. Hence the result follows\.■\\blacksquare

###### Proposition 11\.2\(Restatement of Proposition[5\.4](https://arxiv.org/html/2606.07392#S5.Thmtheorem4)\)

Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5),[4\.1](https://arxiv.org/html/2606.07392#S4.SS1),[5](https://arxiv.org/html/2606.07392#S5),[5](https://arxiv.org/html/2606.07392#S5),[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)hold\. LetVat\(η1\)V\_\{at\}\(\\eta\_\{1\}\)be defined as \([13](https://arxiv.org/html/2606.07392#S4.E13)\)\. Given anyδ\>0\\delta\>0, with probability1−2δ/31\-2\\delta/3, uniformly over alla∈\[A\]a\\in\[A\]andt∈\[T\]t\\in\[T\],

\|\(ρ^at−ρa\)⊤ψ\(xt\)\|≤∥ψ\(xt\)∥Vat\(η1\)−1\[4Γat\(δ\)κμ1\(144κμ1\+2min⁡\{2ι¯,C¯ψdℬ\}\+6\)\+20κμ1\+12\+η1dℬ2\+22Lκμ1C¯at\(δ\)∑s∈𝒮at‖ϕ\(xs,ωas\)‖Φt−1−12\],\\begin\{array\}\[\]\{rl\}\|\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\)\|&\\displaystyle\\leq\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\\Bigg\[\\sqrt\{\\frac\{4\\Gamma\_\{at\}\(\\delta\)\}\{\\kappa\\mu\_\{1\}\}\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\+\\frac\{20\}\{\\kappa\\mu\_\{1\}\}\+\\frac\{1\}\{2\}\+\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}\}\\\\ &\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\displaystyle\+\\frac\{2\\sqrt\{2\}L\}\{\\kappa\\mu\_\{1\}\}\\sqrt\{\\bar\{C\}\_\{at\}\(\\delta\)\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\|\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\}\\Bigg\],\\end\{array\}where

Γat\(δ\):=mlog⁡\(1\+2dℬC¯ψT\)\+log⁡\(⌈log2⁡\(1\+natmin⁡\{9ι¯2,dℬ2C¯ψ2\}\)⌉\+1\)\+log⁡\(6AT/δ\),\\Gamma\_\{at\}\(\\delta\)\\\!:=m\\log\(1\+2d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}T\)\+\\log\\left\(\\left\\lceil\\log\_\{2\}\\big\(1\+n\_\{at\}\\min\\\{9\\overline\{\\iota\}^\{2\},d\_\{\\mathcal\{B\}\}^\{2\}\\bar\{C\}\_\{\\psi\}^\{2\}\\\}\\right\)\\right\\rceil\+1\\big\)\+\\log\(6AT/\\delta\),C¯at\(δ\):=4min\{1,μ¯\}2\(γ0dlog⁡\(1\+tC¯ϕ2/η0\)\+2log⁡\(3/δ\)\+η0α¯\)2\.\\bar\{C\}\_\{at\}\(\\delta\):=\\frac\{4\}\{\\min\\\{1,\\underline\{\\mu\}\\\}^\{2\}\}\\bigg\(\\gamma\_\{0\}\\sqrt\{d\\log\(1\+t\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(3/\\delta\)\}\+\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\\bigg\)^\{2\}\.

###### Proof 11\.3

Proof of Proposition[11\.2](https://arxiv.org/html/2606.07392#S11.Thmtheorem2)For anys∈\[T\]s\\in\[T\], define𝒢s−1:=σ\(ℱs−1,xs\)\\mathcal\{G\}\_\{s\-1\}:=\\sigma\(\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\)\. Note that

0≥\(1\)∑s∈𝒮atℓ^as,t\(ρ^at\)−∑s∈𝒮atℓ^as,t\(ρa\)=\(2\)∑s∈𝒮at𝔼\[ℓas\(ρ^at\)−ℓas\(ρa\)∣𝒢s−1\]\+Hat\(ρ^at\)\+Jat\(ρ^at\)≥\(3\)κμ12∑s∈𝒮at\[ψ\(xs\)⊤\(ρ^at−ρa\)\]2−1\+Hat\(ρ^at\)\+Jat\(ρ^at\),\\begin\{array\}\[\]\{rl\}0&\\geq\_\{\(1\)\}\\displaystyle\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\hat\{\\ell\}\_\{as,t\}\(\\hat\{\\rho\}\_\{at\}\)\-\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\hat\{\\ell\}\_\{as,t\}\(\\rho\_\{a\}\)\\\\ &\\displaystyle=\_\{\(2\)\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\mathbb\{E\}\\left\[\\ell\_\{as\}\(\\hat\{\\rho\}\_\{at\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{G\}\_\{s\-1\}\\right\]\+H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\+J\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\\\\ &\\displaystyle\\geq\_\{\(3\)\}\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\]^\{2\}\-1\+H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\+J\_\{at\}\(\\hat\{\\rho\}\_\{at\}\),\\end\{array\}\(71\)where \(1\) holds becauseρ^at=\\argminρ∈ℬ1nat∑s∈𝒮atℓ^as,t\(ρ\)\\displaystyle\\hat\{\\rho\}\_\{at\}=\\argmin\_\{\\rho\\in\\mathcal\{B\}\}\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\hat\{\\ell\}\_\{as,t\}\(\\rho\), \(2\) holds with

Hat\(ρ\):=∑s∈𝒮at\{ℓas\(ρ\)−ℓas\(ρa\)−𝔼\[ℓas\(ρ\)−ℓas\(ρa\)∣𝒢s−1\]\},H\_\{at\}\(\\rho\):=\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\-\\mathbb\{E\}\[\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{G\}\_\{s\-1\}\]\\right\\\},Jat\(ρ\):=∑s∈𝒮at\{\[ℓ^as,t\(ρ\)−ℓas\(ρ\)\]−\[ℓ^as,t\(ρa\)−ℓas\(ρa\)\]\},J\_\{at\}\(\\rho\):=\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\[\\hat\{\\ell\}\_\{as,t\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\)\]\-\[\\hat\{\\ell\}\_\{as,t\}\(\\rho\_\{a\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\]\\right\\\},and \(3\) follows from Lemma[8\.4](https://arxiv.org/html/2606.07392#S8.Thmtheorem4)\. So \([71](https://arxiv.org/html/2606.07392#S11.E71)\) implies that

κμ12∑s∈𝒮at\[ψ\(xs\)⊤\(ρ^at−ρa\)\]2≤−\(Hat\(ρ^at\)\+Jat\(ρ^at\)\)\+1\.\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\]^\{2\}\\leq\-\(H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\+J\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\)\+1\.\(72\)Lemma[4\.3](https://arxiv.org/html/2606.07392#S4.Thmtheorem3)implies that with probability at least1−δ/31\-\\delta/3, uniformly over allt∈\[T\]t\\in\[T\]anda∈\[A\]a\\in\[A\]we have

−Hat\(ρ^at\)≤κμ18∑s∈𝒮at\{\(ρ^at−ρa\)⊤ψ\(xs\)\}2\+\(144κμ1\+2min⁡\{2ι¯,C¯ψdℬ\}\+6\)Γat\(δ\)\+4\+κμ18,\-H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\\leq\\frac\{\\kappa\\mu\_\{1\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}\(\\delta\)\+4\+\\frac\{\\kappa\\mu\_\{1\}\}\{8\},\(73\)where

Γat\(δ\):=mlog⁡\(1\+2dℬC¯ψT\)\+log⁡\(Qat\+1\)\+log⁡\(6ATδ\)\.\\Gamma\_\{at\}\(\\delta\)\\\!:=m\\log\(1\+2d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}T\)\+\\log\(Q\_\{at\}\+1\)\+\\log\\left\(\\\!\\frac\{6AT\}\{\\delta\}\\\!\\right\)\.Lemma[5\.3](https://arxiv.org/html/2606.07392#S5.Thmtheorem3)implies

−Jat\(ρ^at\)≤κμ18∑s∈𝒮at\{ψ\(xs\)⊤\(ρ^at−ρa\)\}2\+2κμ1∑s∈𝒮atL2\{\(θ^t−1−θ∗\)⊤ϕ\(xs,ωas\)\}2\.\-J\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\\leq\\frac\{\\kappa\\mu\_\{1\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\\\}^\{2\}\+\\frac\{2\}\{\\kappa\\mu\_\{1\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}L^\{2\}\\\{\(\\hat\{\\theta\}\_\{t\-1\}\-\\theta\_\{\*\}\)^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\}^\{2\}\.Hence combining this inequality and \([73](https://arxiv.org/html/2606.07392#S11.E73)\), with probability at least1−δ/31\-\\delta/3, uniformly over allt∈\[T\],a∈\[A\]t\\in\[T\],a\\in\[A\]we have

−\(Hat\(ρ^at\)\+Jat\(ρ^at\)\)≤κμ14∑s∈𝒮at\{\(ρ^at−ρa\)⊤ψ\(xs\)\}2\+\(144κμ1\+2min⁡\{2ι¯,C¯ψdℬ\}\+6\)Γat\(δ\)\+4\+κμ18\+2L2κμ1∑s∈𝒮at\{\(θ^t−1−θ∗\)⊤ϕ\(xs,ωas\)\}2\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\-\(H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\+J\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\)\\\\ &\\displaystyle\\leq\\frac\{\\kappa\\mu\_\{1\}\}\{4\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}\(\\delta\)\+4\+\\frac\{\\kappa\\mu\_\{1\}\}\{8\}\\\\ &\\quad\\displaystyle\+\\frac\{2L^\{2\}\}\{\\kappa\\mu\_\{1\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\hat\{\\theta\}\_\{t\-1\}\-\\theta\_\{\*\}\)^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\}^\{2\}\\end\{array\}\(74\)Thus \([72](https://arxiv.org/html/2606.07392#S11.E72)\) further implies that with probability at least1−δ/31\-\\delta/3, uniformly over allt∈\[T\]t\\in\[T\]we have

κμ14∑s∈𝒮at\{\(ρ^at−ρa\)⊤ψ\(xs\)\}2≤\(144κμ1\+2min⁡\{2ι¯,C¯ψdℬ\}\+6\)Γat\(δ\)\+5\+κμ18\+2L2κμ1∑s∈𝒮at\{\(θ^t−1−θ∗\)⊤ϕ\(xs,ωas\)\}2,\\begin\{array\}\[\]\{rl\}\\displaystyle\\frac\{\\kappa\\mu\_\{1\}\}\{4\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}&\\displaystyle\\leq\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}\(\\delta\)\+5\+\\frac\{\\kappa\\mu\_\{1\}\}\{8\}\\\\ &\\quad\\displaystyle\+\\frac\{2L^\{2\}\}\{\\kappa\\mu\_\{1\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\hat\{\\theta\}\_\{t\-1\}\-\\theta\_\{\*\}\)^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\}^\{2\},\\end\{array\}\(75\)combining with the fact thatη1‖ρ^at−ρa‖2≤η1dℬ2\\eta\_\{1\}\\\|\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\\\|^\{2\}\\leq\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}, and recall that

Vat\(η1\)=η1𝐈m\+∑s∈𝒮atψ\(xs\)ψ\(xs\)⊤,V\_\{at\}\(\\eta\_\{1\}\)=\\eta\_\{1\}\\mathbf\{I\}\_\{m\}\+\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\psi\(x\_\{s\}\)\\psi\(x\_\{s\}\)^\{\\top\},the above inequality then implies that with probability at least1−δ/31\-\\delta/3, uniformly over allt∈\[T\]t\\in\[T\]anda∈\[A\]a\\in\[A\], we have

‖ρ^at−ρa‖Vat\(η1\)2≤4κμ1\(144κμ1\+2min⁡\{2ι¯,C¯ψdℬ\}\+6\)Γat\(δ\)\+20κμ1\+12\+η1dℬ2\+8L2\(κμ1\)2∑s∈𝒮at\{\(θ^t−1−θ∗\)⊤ϕ\(xs,ωas\)\}2\.\\begin\{array\}\[\]\{rl\}\\displaystyle\\\|\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)\}^\{2\}&\\displaystyle\\leq\\frac\{4\}\{\\kappa\\mu\_\{1\}\}\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}\(\\delta\)\+\\frac\{20\}\{\\kappa\\mu\_\{1\}\}\+\\frac\{1\}\{2\}\+\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}\\\\ &\\quad\\displaystyle\+\\frac\{8L^\{2\}\}\{\(\\kappa\\mu\_\{1\}\)^\{2\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\!\\\{\(\\hat\{\\theta\}\_\{t\-1\}\-\\theta\_\{\*\}\)^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\}^\{2\}\.\\end\{array\}\(76\)Lemma[9\.2](https://arxiv.org/html/2606.07392#S9.Thmtheorem2)implies that with probability at least1−δ/31\-\\delta/3, uniformly over allt∈\[T\]t\\in\[T\]anda∈\[A\]a\\in\[A\],

8L2\(κμ1\)2∑s∈𝒮at\{\(θ^t−1−θ∗\)⊤ϕ\(xs,ωas\)\}2≤8L2\(κμ1\)2C¯at\(δ\)∑s∈𝒮at‖ϕ\(xs,ωas\)‖Φt−1−12,\\frac\{8L^\{2\}\}\{\(\\kappa\\mu\_\{1\}\)^\{2\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\hat\{\\theta\}\_\{t\-1\}\-\\theta\_\{\*\}\)^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\}^\{2\}\\leq\\frac\{8L^\{2\}\}\{\(\\kappa\\mu\_\{1\}\)^\{2\}\}\\bar\{C\}\_\{at\}\(\\delta\)\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\|\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\},\(77\)where

C¯at\(δ\):=4min\{1,μ¯\}2\(γ0dlog⁡\(1\+tC¯ϕ2/η0\)\+2log⁡\(3/δ\)\+η0α¯\)2\.\\bar\{C\}\_\{at\}\(\\delta\):=\\frac\{4\}\{\\min\\\{1,\\underline\{\\mu\}\\\}^\{2\}\}\\bigg\(\\gamma\_\{0\}\\sqrt\{d\\log\(1\+t\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(3/\\delta\)\}\+\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\\bigg\)^\{2\}\.Thus with probability1−2δ/31\-2\\delta/3we have

‖ρ^at−ρa‖Vat\(η1\)≤4κμ1\(144κμ1\+2min⁡\{2ι¯,C¯ψdℬ\}\+6\)Γat\(δ\)\+20κμ1\+12\+η1dℬ2\+22Lκμ1C¯at\(δ\)∑s∈𝒮at‖ϕ\(xs,ωas\)‖Φt−1−12\.\\begin\{array\}\[\]\{rl\}\\displaystyle\\\|\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)\}&\\displaystyle\\leq\\sqrt\{\\frac\{4\}\{\\kappa\\mu\_\{1\}\}\\\!\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}\(\\delta\)\+\\frac\{20\}\{\\kappa\\mu\_\{1\}\}\+\\frac\{1\}\{2\}\+\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}\}\\\\ &\\quad\\displaystyle\+\\frac\{2\\sqrt\{2\}L\}\{\\kappa\\mu\_\{1\}\}\\sqrt\{\\bar\{C\}\_\{at\}\(\\delta\)\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\|\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\}\.\\end\{array\}\(78\)By Cauchy\-Schwarz inequality,\|\(ρ^at−ρa\)⊤ψ\(xt\)\|≤‖ψ\(xt\)‖Vat\(η1\)−1‖ρ^at−ρa‖Vat\(η1\)\|\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\)\|\\leq\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\\\|\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)\}, so combining \([78](https://arxiv.org/html/2606.07392#S11.E78)\), the result follows\.■\\blacksquare

## 12Regret under Unknown Reward Function

###### Proof 12\.1

Proof of Theorem[5\.5](https://arxiv.org/html/2606.07392#S5.Thmtheorem5)Letℰ\\mathcal\{E\}denote the event that conditions \(i\)\-\(ii\) hold in the following:

- \(i\)Uniformly over alla∈\[A\]a\\in\[A\],t∈\[T\]t\\in\[T\], \|\(θ∗−θ^t−1\)⊤ϕ\(xt,ωat\)\|≤2min⁡\{μ¯,1\}‖ϕ\(xt,ωat\)‖Φt−1−1\(γ0dlog⁡\(1\+tC¯ϕ2/η0\)\+2log⁡\(2/δ\)\+η0α¯\);\\begin\{array\}\[\]\{rl\}&\\quad\|\(\\theta\_\{\*\}\-\\hat\{\\theta\}\_\{t\-1\}\)^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\|\\\\ &\\displaystyle\\leq\\frac\{2\}\{\\min\\\{\\underline\{\\mu\},1\\\}\}\\\|\\phi\(x\_\{t\},\\omega\_\{at\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\bigg\(\\gamma\_\{0\}\\sqrt\{d\\log\(1\+t\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(2/\\delta\)\}\+\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\\bigg\);\\end\{array\}
- \(ii\)Uniformly over alla∈\[A\]a\\in\[A\],t∈\[T\]t\\in\[T\], \|\(ρ^at−ρa\)⊤ψ\(xt\)\|≤\[4κμ1\(144κμ1\+2min⁡\{2ι¯,C¯ψdℬ\}\+6\)Γat\(δ\)\+20κμ1\+12\+η1dℬ2\+22Lκμ1Cat\(δ\)∑s∈𝒮at‖ϕ\(xs,ωas\)‖Φt−1−12\]∥ψ\(xt\)∥Vat\(η1\)−1,\\begin\{array\}\[\]\{rl\}\\displaystyle\|\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\)\|&\\displaystyle\\leq\\bigg\[\\sqrt\{\\frac\{4\}\{\\kappa\\mu\_\{1\}\}\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}\(\\delta\)\+\\frac\{20\}\{\\kappa\\mu\_\{1\}\}\+\\frac\{1\}\{2\}\+\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}\}\\\\ &\\quad\\displaystyle\+\\frac\{2\\sqrt\{2\}L\}\{\\kappa\\mu\_\{1\}\}\\sqrt\{C\_\{at\}\(\\delta\)\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\|\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\}\\bigg\]\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\},\\end\{array\}whereVat\(η1\)=η1𝐈m\+∑s∈𝒮atψ\(xs\)ψ\(xs\)⊤\\displaystyle V\_\{at\}\(\\eta\_\{1\}\)=\\eta\_\{1\}\\mathbf\{I\}\_\{m\}\+\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\psi\(x\_\{s\}\)\\psi\(x\_\{s\}\)^\{\\top\},Qat:=⌈log2⁡\(1\+natmin⁡\{9ι¯2,dℬ2C¯ψ2\}\)⌉Q\_\{at\}:=\\left\\lceil\\log\_\{2\}\\left\(1\+n\_\{at\}\\min\\\{9\\overline\{\\iota\}^\{2\},d\_\{\\mathcal\{B\}\}^\{2\}\\bar\{C\}\_\{\\psi\}^\{2\}\\\}\\right\)\\right\\rceil, Γat\(δ\):=mlog⁡\(1\+2dℬC¯ψT\)\+log⁡\(2Qat\+2\)\+log⁡\(6AT/δ\),\\Gamma\_\{at\}\(\\delta\)\\\!:=m\\log\(1\+2d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}T\)\+\\log\(2Q\_\{at\}\+2\)\+\\log\\left\(\\\!6AT/\\delta\\\!\\right\),Cat\(δ\):=4min\{μ¯,1\}2\(γ0dlog⁡\(1\+tC¯ϕ2/η0\)\+2log⁡\(6/δ\)\+η0α¯\)2\.C\_\{at\}\(\\delta\):=\\frac\{4\}\{\\min\\\{\\underline\{\\mu\},1\\\}^\{2\}\}\\bigg\(\\gamma\_\{0\}\\sqrt\{d\\log\(1\+t\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(6/\\delta\)\}\+\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\\bigg\)^\{2\}\.

Thus onℰ\\mathcal\{E\},μ~t\(xt,ωat\)≥μ∗\(xt,ωat\)\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\\geq\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)holds uniformly overa∈\[A\],t∈\[T\]a\\in\[A\],t\\in\[T\]andσ~at≥σat\\widetilde\{\\sigma\}\_\{at\}\\geq\\sigma\_\{at\}holds uniformly over alla∈\[A\],t∈\[T\]a\\in\[A\],t\\in\[T\]\. \([21](https://arxiv.org/html/2606.07392#S5.E21)\), \([27](https://arxiv.org/html/2606.07392#S5.E27)\), Lemma[9\.2](https://arxiv.org/html/2606.07392#S9.Thmtheorem2)and Proposition[11\.2](https://arxiv.org/html/2606.07392#S11.Thmtheorem2)imply thatℰ\\mathcal\{E\}holds with probability at least1−δ1\-\\delta\. Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)implies

𝔼\[∑t=1TΔt\(π~\)\|ℰ\]≤𝔼\[∑t=1Tμ~t\(xt,ωatt\)−μ\(θ∗;xt,ωatt\)\|ℰ\]\+𝔼\[∑t=1T∑a∈𝒜t\(σ~at−σat\)\|ℰ\]\.\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\ \\big\|\\ \\mathcal\{E\}\\right\]\\leq\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\-\\mu\(\\theta\_\{\*\};x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\ \\big\|\\ \\mathcal\{E\}\\right\]\+\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}\)\\ \\big\|\\ \\mathcal\{E\}\\right\]\.\(79\)Onℰ\\mathcal\{E\}, using the fact that

μ~t\(xt,ωatt\)−μ\(θ∗;xt,ωatt\)≤\|μ~t\(xt,ωatt\)−G\(θ^t−1⊤ϕ\(xt,ωatt\)\)\|\+\|G\(θ^t−1⊤ϕ\(xt,ωatt\)\)−μ\(θ∗;xt,ωatt\)\|,\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\-\\mu\(\\theta\_\{\*\};x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\\ &\\displaystyle\\leq\\left\|\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\-G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\)\\right\|\+\\left\|G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\)\-\\mu\(\\theta\_\{\*\};x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\right\|,\\end\{array\}and the Lipschitz property ofGGby Assumption[5](https://arxiv.org/html/2606.07392#S5), we have

∑t=1Tμ~t\(xt,ωatt\)−μ\(θ∗;xt,ωatt\)≤4Lmin⁡\{μ¯,1\}\(γ0dlog⁡\(1\+tC¯ϕ2/η0\)\+2log⁡\(2/δ\)\+η0α¯\)∑t=1T‖ϕ\(xt,ωatt\)‖Φt−1−1\.\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\\sum\_\{t=1\}^\{T\}\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\-\\mu\(\\theta\_\{\*\};x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\\ &\\displaystyle\\leq\\frac\{4L\}\{\\min\\\{\\underline\{\\mu\},1\\\}\}\\bigg\(\\gamma\_\{0\}\\sqrt\{d\\log\(1\+t\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(2/\\delta\)\}\+\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\\bigg\)\\sum\_\{t=1\}^\{T\}\\\|\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\.\\end\{array\}Sincedet\(Φt\)=det\(Φt−1\)\(1\+‖ϕ\(xt,ωatt\)‖Φt−1−12\)\\mathrm\{det\}\(\\Phi\_\{t\}\)=\\mathrm\{det\}\(\\Phi\_\{t\-1\}\)\(1\+\\\|\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\)andΦ0=η0𝐈d\\Phi\_\{0\}=\\eta\_\{0\}\\mathbf\{I\}\_\{d\}, we have

log⁡det\(ΦT\)det\(Φ0\)=∑t=1Tlog⁡\(1\+‖ϕ\(xt,ωatt\)‖Φt−1−12\)≥\(i\)∑t=1T‖ϕ\(xt,ωatt\)‖Φt−1−121\+‖ϕ\(xt,ωatt\)‖Φt−1−12≥\(ii\)∑t=1T‖ϕ\(xt,ωatt\)‖Φt−1−121\+C¯ϕ2/η0,\\begin\{array\}\[\]\{rl\}\\displaystyle\\log\\frac\{\\mathrm\{det\}\(\\Phi\_\{T\}\)\}\{\\mathrm\{det\}\(\\Phi\_\{0\}\)\}&\\displaystyle=\\sum\_\{t=1\}^\{T\}\\log\(1\+\\\|\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\)\\\\ &\\displaystyle\\geq\_\{\(i\)\}\\sum\_\{t=1\}^\{T\}\\frac\{\\\|\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\}\{1\+\\\|\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\}\\geq\_\{\(ii\)\}\\sum\_\{t=1\}^\{T\}\\frac\{\\\|\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\}\{1\+\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\},\\end\{array\}where \(i\) uses the fact thatlog⁡\(1\+a\)≥a1\+a\\log\(1\+a\)\\geq\\frac\{a\}\{1\+a\}fora≥0a\\geq 0and \(ii\) uses‖ϕ\(xt,ωatt\)‖Φt−1−12≤C¯ϕ2/η0\\\|\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\\leq\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\. Hence the above inequality implies that

∑t=1T‖ϕ\(xt,ωatt\)‖Φt−1−12≤\(1\+C¯ϕ2/η0\)log⁡det\(ΦT\)det\(Φ0\)≤\(1\)\(1\+C¯ϕ2/η0\)dlog⁡\(1\+TC¯ϕ2/\(η0d\)\),\\sum\_\{t=1\}^\{T\}\\\|\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\\leq\(1\+\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\\log\\frac\{\\mathrm\{det\}\(\\Phi\_\{T\}\)\}\{\\mathrm\{det\}\(\\Phi\_\{0\}\)\}\\leq\_\{\(1\)\}\(1\+\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)d\\log\\left\(1\+T\\bar\{C\}\_\{\\phi\}^\{2\}/\(\\eta\_\{0\}d\)\\right\),where \(1\) follows becausedet\(ΦT\)≤\(tr\(ΦT\)d\)d≤\(η0\+TC¯ϕ2/\(η0d\)\)d\\displaystyle\\mathrm\{det\}\(\\Phi\_\{T\}\)\\leq\\left\(\\frac\{\\mathrm\{tr\}\(\\Phi\_\{T\}\)\}\{d\}\\right\)^\{d\}\\leq\\left\(\\eta\_\{0\}\+T\\bar\{C\}\_\{\\phi\}^\{2\}/\(\\eta\_\{0\}d\)\\right\)^\{d\}anddet\(Φ0\)=η0d\\mathrm\{det\}\(\\Phi\_\{0\}\)=\\eta\_\{0\}^\{d\}so thatlog⁡det\(ΦT\)det\(Φ0\)≤dlog⁡\(1\+TC¯ϕ2/\(η0d\)\)\\displaystyle\\log\\frac\{\\mathrm\{det\}\(\\Phi\_\{T\}\)\}\{\\mathrm\{det\}\(\\Phi\_\{0\}\)\}\\leq d\\log\\left\(1\+T\\bar\{C\}\_\{\\phi\}^\{2\}/\(\\eta\_\{0\}d\)\\right\)\. Thus by Cauchy\-Schwarz inequality,

∑t=1T‖ϕ\(xt,ωatt\)‖Φt−1−1≤T∑t=1T‖ϕ\(xt,ωatt\)‖Φt−1−12≤T\(1\+C¯ϕ2/η0\)dlog⁡\(1\+TC¯ϕ2/\(η0d\)\),\\sum\_\{t=1\}^\{T\}\\\|\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\leq\\sqrt\{T\\sum\_\{t=1\}^\{T\}\\\|\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\}\\leq\\sqrt\{T\(1\+\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)d\\log\\left\(1\+T\\bar\{C\}\_\{\\phi\}^\{2\}/\(\\eta\_\{0\}d\)\\right\)\},implying that onℰ\\mathcal\{E\}, we have

∑t=1Tμ~t\(xt,ωatt\)−μ\(θ∗;xt,ωatt\)≤\(Cμd\+4Lγ0η0α¯\)T,\\sum\_\{t=1\}^\{T\}\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\-\\mu\(\\theta\_\{\*\};x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\leq\(C\_\{\\mu\}\\sqrt\{d\}\+4L\\gamma\_\{0\}\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\)\\sqrt\{T\},\(80\)where

Cμ=4L\[γ0dlog⁡\(1\+TC¯ϕ2/η0\)\+2log⁡\(2/δ\)\+η0α¯\]min⁡\{μ¯,1\}\(1\+C¯ϕ2/η0\)log⁡\(1\+TC¯ϕ2/\(η0d\)\)\.C\_\{\\mu\}=\\frac\{4L\\left\[\\gamma\_\{0\}\\sqrt\{d\\log\(1\+T\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(2/\\delta\)\}\+\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\\right\]\}\{\\min\\\{\\underline\{\\mu\},1\\\}\}\\sqrt\{\\left\(1\+\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\\right\)\\log\\left\(1\+T\\bar\{C\}\_\{\\phi\}^\{2\}/\(\\eta\_\{0\}d\)\\right\)\}\.\(81\)Additionally, Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)and Proposition[10\.10](https://arxiv.org/html/2606.07392#S10.Thmtheorem10)imply that with probability at least1−δ/31\-\\delta/3,

λmin\(Φt−1\)≥\(t−1\)/\[16polylog\(T\)\]\\lambda\_\{\\min\}\(\\Phi\_\{t\-1\}\)\\geq\(t\-1\)/\[16\\mathrm\{polylog\}\(T\)\]holds uniformly for allt≥τ~0:=100C¯ϕ2polylog\(T\)log⁡\(6TAd/δ\)t\\geq\\tilde\{\\tau\}\_\{0\}:=100\\bar\{C\}\_\{\\phi\}^\{2\}\\mathrm\{polylog\}\(T\)\\log\(6TAd/\\delta\)anda∈\[A\]a\\in\[A\], under which

∑s∈𝒮at‖ϕ\(xs,ωas\)‖Φt−1−12≤16polylog\(T\)C¯ϕ2natt−1≤16polylog\(T\)C¯ϕ2,\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\|\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\\leq\\frac\{16\\mathrm\{polylog\}\(T\)\\bar\{C\}\_\{\\phi\}^\{2\}n\_\{at\}\}\{t\-1\}\\leq 16\\mathrm\{polylog\}\(T\)\\bar\{C\}\_\{\\phi\}^\{2\},where the last inequality uses the fact thatnat≤tn\_\{at\}\\leq t\. Henceforth, combining \([76](https://arxiv.org/html/2606.07392#S11.E76)\) and \([77](https://arxiv.org/html/2606.07392#S11.E77)\), with probability at least1−δ1\-\\delta, uniformly over allt≥100C¯ϕ2polylog\(T\)log⁡\(6TAd/δ\)t\\geq 100\\bar\{C\}\_\{\\phi\}^\{2\}\\mathrm\{polylog\}\(T\)\\log\(6TAd/\\delta\)we have

‖ρ^at−ρa‖Vat\(η1\)≤4κμ1\(144κμ1\+2min⁡\{2ι¯,C¯ψdℬ\}\+6\)Γat\(δ\)\+20κμ1\+12\+η1dℬ2\+82Lκμ1C¯ϕC¯at\(δ\)polylog\(T\)\.\\begin\{array\}\[\]\{rl\}\\displaystyle\\\|\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)\}&\\displaystyle\\leq\\sqrt\{\\frac\{4\}\{\\kappa\\mu\_\{1\}\}\\\!\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}\(\\delta\)\+\\frac\{20\}\{\\kappa\\mu\_\{1\}\}\+\\frac\{1\}\{2\}\+\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}\}\\\\ &\\quad\\displaystyle\+\\frac\{8\\sqrt\{2\}L\}\{\\kappa\\mu\_\{1\}\}\\bar\{C\}\_\{\\phi\}\\sqrt\{\\bar\{C\}\_\{at\}\(\\delta\)\\mathrm\{polylog\}\(T\)\}\.\\end\{array\}Furthermore, by \([27](https://arxiv.org/html/2606.07392#S5.E27)\) and the Lipschitz property ofΛ\\Lambdaby Assumption[5](https://arxiv.org/html/2606.07392#S4.E5), conditional on eventℰ\\mathcal\{E\}, with probability at least1−δ1\-\\delta, uniformly over alla∈\[A\]a\\in\[A\],t≥τ~0=100C¯ϕ2polylog\(T\)log⁡\(6TAd/δ\)t\\geq\\tilde\{\\tau\}\_\{0\}=100\\bar\{C\}\_\{\\phi\}^\{2\}\\mathrm\{polylog\}\(T\)\\log\(6TAd/\\delta\),

σ~at−σat≤Cσ‖ψ\(xt\)‖Vat\(η1\)−1,\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}\\leq C\_\{\\sigma\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\},\(82\)where

Cσ=2L\[4κμ1\(144κμ1\+2min⁡\{2ι¯,C¯ψdℬ\}\+6\)Γ¯a\(δ\)\+20κμ1\+12\+η1dℬ2\+82Lκμ1C¯ϕC¯a\(δ\)polylog\(T\)\],\\begin\{array\}\[\]\{rl\}C\_\{\\sigma\}=2L\\bigg\[&\\displaystyle\\sqrt\{\\frac\{4\}\{\\kappa\\mu\_\{1\}\}\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\overline\{\\Gamma\}\_\{a\}\(\\delta\)\+\\frac\{20\}\{\\kappa\\mu\_\{1\}\}\+\\frac\{1\}\{2\}\+\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}\}\\\\ &\\quad\\displaystyle\+\\frac\{8\\sqrt\{2\}L\}\{\\kappa\\mu\_\{1\}\}\\bar\{C\}\_\{\\phi\}\\sqrt\{\\bar\{C\}\_\{a\}\(\\delta\)\\mathrm\{polylog\}\(T\)\}\\bigg\],\\end\{array\}\(83\)andΓ¯a\(δ\)\\overline\{\\Gamma\}\_\{a\}\(\\delta\),Q¯a\\overline\{Q\}\_\{a\},C¯a\(δ\)\\bar\{C\}\_\{a\}\(\\delta\)are defined as

Γ¯a\(δ\):=mlog⁡\(1\+2dℬC¯ψT\)\+log⁡\(2Q¯a\+2\)\+log⁡\(6ATδ\),Q¯a:=⌈log2⁡\(1\+Tmin⁡\{9ι¯2,dℬ2C¯ψ2\}\)⌉,\\overline\{\\Gamma\}\_\{a\}\(\\delta\)\\\!:=m\\log\(1\+2d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}T\)\+\\log\(2\\overline\{Q\}\_\{a\}\+2\)\+\\log\\left\(\\\!\\frac\{6AT\}\{\\delta\}\\\!\\right\),\\quad\\overline\{Q\}\_\{a\}:=\\left\\lceil\\log\_\{2\}\\left\(1\+T\\min\\\{9\\overline\{\\iota\}^\{2\},d\_\{\\mathcal\{B\}\}^\{2\}\\bar\{C\}\_\{\\psi\}^\{2\}\\\}\\right\)\\right\\rceil,C¯a\(δ\):=4min\{μ¯,1\}2\(γ0dlog⁡\(1\+TC¯ϕ2/η0\)\+2log⁡\(6/δ\)\+η0α¯\)2\.\\bar\{C\}\_\{a\}\(\\delta\):=\\frac\{4\}\{\\min\\\{\\underline\{\\mu\},1\\\}^\{2\}\}\\bigg\(\\gamma\_\{0\}\\sqrt\{d\\log\(1\+T\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(6/\\delta\)\}\+\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\\bigg\)^\{2\}\.DefineV~a,nat:=η1𝐈m\+∑s∈𝒮atψ\(xs\)ψ\(xs\)⊤=Vat\(η1\)\\displaystyle\\tilde\{V\}\_\{a,n\_\{at\}\}:=\\eta\_\{1\}\\mathbf\{I\}\_\{m\}\+\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\psi\(x\_\{s\}\)\\psi\(x\_\{s\}\)^\{\\top\}=V\_\{at\}\(\\eta\_\{1\}\)and letV~a,0:=η1𝐈m\\tilde\{V\}\_\{a,0\}:=\\eta\_\{1\}\\mathbf\{I\}\_\{m\}\. Note that

det\(V~a,nat\)=det\(V~a,nat−1\)\(1\+‖ψ\(xta\(nat\)\)‖V~a,nat−1−12\),\\mathrm\{det\}\(\\tilde\{V\}\_\{a,n\_\{at\}\}\)=\\mathrm\{det\}\\left\(\\tilde\{V\}\_\{a,n\_\{at\}\-1\}\\right\)\\left\(1\+\\\|\\psi\(x\_\{t\_\{a\}\(n\_\{at\}\)\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{at\}\-1\}^\{\-1\}\}^\{2\}\\right\),whereta\(nat\)t\_\{a\}\(n\_\{at\}\)is the period whereaais queried thenatn\_\{at\}\-th time\. Thus

log⁡det\(V~a,naT\)det\(V~a,0\)=∑s∈𝒮aTlog⁡\(1\+‖ψ\(xs\)‖V~a,nas−12\)≥\(i\)∑s∈𝒮aT‖ψ\(xs\)‖V~a,nas−121\+‖ψ\(xs\)‖V~a,nas−12≥\(ii\)∑s∈𝒮aT‖ψ\(xs\)‖V~a,nas−121\+C¯ψ2/η1,\\begin\{array\}\[\]\{rl\}\\displaystyle\\log\\frac\{\\mathrm\{det\}\(\\tilde\{V\}\_\{a,n\_\{aT\}\}\)\}\{\\mathrm\{det\}\(\\tilde\{V\}\_\{a,0\}\)\}&\\displaystyle=\\sum\_\{s\\in\\mathcal\{S\}\_\{aT\}\}\\log\\left\(1\+\\\|\\psi\(x\_\{s\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{as\}\}^\{\-1\}\}^\{2\}\\right\)\\\\ &\\displaystyle\\geq\_\{\(i\)\}\\sum\_\{s\\in\\mathcal\{S\}\_\{aT\}\}\\frac\{\\\|\\psi\(x\_\{s\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{as\}\}^\{\-1\}\}^\{2\}\}\{1\+\\\|\\psi\(x\_\{s\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{as\}\}^\{\-1\}\}^\{2\}\}\\geq\_\{\(ii\)\}\\sum\_\{s\\in\\mathcal\{S\}\_\{aT\}\}\\frac\{\\\|\\psi\(x\_\{s\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{as\}\}^\{\-1\}\}^\{2\}\}\{1\+\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\},\\end\{array\}where inequality \(i\) uses the fact thatlog⁡\(1\+a\)≥a1\+a\\log\(1\+a\)\\geq\\frac\{a\}\{1\+a\}fora≥0a\\geq 0and \(ii\) uses the fact that‖ψ\(xs\)‖V~a,nas−12≤C¯ψ2/η1\\\|\\psi\(x\_\{s\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{as\}\}^\{\-1\}\}^\{2\}\\leq\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\. Hence the above inequality implies that

∑s∈𝒮aT‖ψ\(xs\)‖V~a,nat−12≤\(1\+C¯ψ2/η1\)log⁡det\(V~a,naT\)det\(V~a,0\)≤\(1\)\(1\+C¯ψ2/η1\)mlog⁡\(1\+naTC¯ψ2/\(η1m\)\),\\sum\_\{s\\in\\mathcal\{S\}\_\{aT\}\}\\\|\\psi\(x\_\{s\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{at\}\}^\{\-1\}\}^\{2\}\\leq\(1\+\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\)\\log\\frac\{\\mathrm\{det\}\(\\tilde\{V\}\_\{a,n\_\{aT\}\}\)\}\{\\mathrm\{det\}\(\\tilde\{V\}\_\{a,0\}\)\}\\leq\_\{\(1\)\}\(1\+\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\)m\\log\\left\(1\+n\_\{aT\}\\bar\{C\}\_\{\\psi\}^\{2\}/\(\\eta\_\{1\}m\)\\right\),where \(1\) follows sincedet\(V~a,0\)=η1m\\mathrm\{det\}\(\\tilde\{V\}\_\{a,0\}\)=\\eta\_\{1\}^\{m\}anddet\(V~a,naT\)≤\(tr\(V~a,naT\)m\)m≤\(η1\+naTC¯ψ2η1m\)m\\displaystyle\\mathrm\{det\}\(\\tilde\{V\}\_\{a,n\_\{aT\}\}\)\\leq\\left\(\\frac\{\\mathrm\{tr\}\(\\tilde\{V\}\_\{a,n\_\{aT\}\}\)\}\{m\}\\right\)^\{m\}\\leq\\left\(\\eta\_\{1\}\+\\frac\{n\_\{aT\}\\bar\{C\}\_\{\\psi\}^\{2\}\}\{\\eta\_\{1\}m\}\\right\)^\{m\}so thatlog⁡det\(V~a,naT\)det\(V~a,0\)≤mlog⁡\(1\+naTC¯ψ2η1m\)\\displaystyle\\log\\frac\{\\mathrm\{det\}\(\\tilde\{V\}\_\{a,n\_\{aT\}\}\)\}\{\\mathrm\{det\}\(\\tilde\{V\}\_\{a,0\}\)\}\\leq m\\log\\left\(1\+\\frac\{n\_\{aT\}\\bar\{C\}\_\{\\psi\}^\{2\}\}\{\\eta\_\{1\}m\}\\right\)\. Further, for anya∈\[A\]a\\in\[A\],

∑t=τ~0T‖ψ\(xt\)‖V~a,nat−1𝕀\{a∈𝒜t\}≤∑t=1T‖ψ\(xt\)‖V~a,nat−1𝕀\{a∈𝒜t\}=∑s∈𝒮aT‖ψ\(xs\)‖V~a,nas−1≤\(i\)naT∑s∈𝒮aT‖ψ\(xs\)‖V~a,nas−12,\\begin\{array\}\[\]\{rl\}\\displaystyle\\sum\_\{t=\\tilde\{\\tau\}\_\{0\}\}^\{T\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{at\}\}^\{\-1\}\}\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}&\\displaystyle\\leq\\sum\_\{t=1\}^\{T\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{at\}\}^\{\-1\}\}\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}=\\sum\_\{s\\in\\mathcal\{S\}\_\{aT\}\}\\\|\\psi\(x\_\{s\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{as\}\}^\{\-1\}\}\\\\ &\\displaystyle\\leq\_\{\(i\)\}\\sqrt\{n\_\{aT\}\\sum\_\{s\\in\\mathcal\{S\}\_\{aT\}\}\\\|\\psi\(x\_\{s\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{as\}\}^\{\-1\}\}^\{2\}\},\\end\{array\}where inequality \(i\) follows from Cauchy\-Schwarz inequality\. Note thatnaT≤Tn\_\{aT\}\\leq T, thus for anya∈\[A\]a\\in\[A\],

∑t=τ~0T‖ψ\(xt\)‖V~a,nat−1𝕀\{a∈𝒜t\}≤T\(1\+C¯ψ2/η1\)mlog⁡\(1\+TC¯ψ2/\(η1m\)\)\.\\sum\_\{t=\\tilde\{\\tau\}\_\{0\}\}^\{T\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{at\}\}^\{\-1\}\}\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\leq\\sqrt\{T\(1\+\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\)m\\log\\left\(1\+T\\bar\{C\}\_\{\\psi\}^\{2\}/\(\\eta\_\{1\}m\)\\right\)\}\.Combining with \([82](https://arxiv.org/html/2606.07392#S12.E82)\), conditional on eventℰ\\mathcal\{E\}, with probability at least1−δ1\-\\delta, uniformly over alla∈\[A\]a\\in\[A\], we have

∑t=1T\(σ~at−σat\)𝕀\{a∈𝒜t\}≤∑t=1τ~0−1\(σ~at−σat\)𝕀\{a∈𝒜t\}\+∑t=τ~0T\(σ~at−σat\)𝕀\{a∈𝒜t\}≤2\(τ~0−1\)\+CσT\(1\+C¯ψ2/η1\)mlog⁡\(1\+TC¯ψ2/\(η1m\)\),\\begin\{array\}\[\]\{rl\}\\displaystyle\\sum\_\{t=1\}^\{T\}\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}\)\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}&\\displaystyle\\leq\\sum\_\{t=1\}^\{\\tilde\{\\tau\}\_\{0\}\-1\}\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}\)\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\+\\sum\_\{t=\\tilde\{\\tau\}\_\{0\}\}^\{T\}\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}\)\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\\\ &\\displaystyle\\leq 2\(\\tilde\{\\tau\}\_\{0\}\-1\)\+C\_\{\\sigma\}\\sqrt\{T\(1\+\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\)m\\log\\left\(1\+T\\bar\{C\}\_\{\\psi\}^\{2\}/\(\\eta\_\{1\}m\)\\right\)\},\\end\{array\}where the second inequality follows because fort∈\[1,τ~0\]t\\in\[1,\\tilde\{\\tau\}\_\{0\}\],σ~at−σat≤2\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}\\leq 2by definition of the indices according to Assumption[5](https://arxiv.org/html/2606.07392#S4.E5)\. Henceforth, conditional on eventℰ\\mathcal\{E\}, with probability at least1−δ1\-\\delta,

∑t=1T∑a∈𝒜t\(σ~at−σat\)=∑a∈\[A\]∑t=1T\(σ~at−σat\)𝕀\{a∈𝒜t\}≤2Aτ~0\+ACσT\(1\+C¯ψ2/η1\)mlog⁡\(1\+TC¯ψ2/\(η1m\)\)\.\\begin\{array\}\[\]\{rl\}\\displaystyle\\sum\_\{t=1\}^\{T\}\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}\)&\\displaystyle=\\sum\_\{a\\in\[A\]\}\\sum\_\{t=1\}^\{T\}\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}\)\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\\\ &\\displaystyle\\leq 2A\\tilde\{\\tau\}\_\{0\}\+AC\_\{\\sigma\}\\sqrt\{T\(1\+\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\)m\\log\\left\(1\+T\\bar\{C\}\_\{\\psi\}^\{2\}/\(\\eta\_\{1\}m\)\\right\)\}\.\\end\{array\}Note that by definition,Δt\(π~\)≤1\+2A\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\leq 1\+2Afor anyt∈\[T\]t\\in\[T\]\. So \([79](https://arxiv.org/html/2606.07392#S12.E79)\) and \([80](https://arxiv.org/html/2606.07392#S12.E80)\) together further imply that

𝔼\[∑t=1TΔt\(π~\)\|ℰ\]≤\(Cμd\+4Lγ0η0α¯\)T\+2Aτ~0\+T\(1\+2A\)δ\+ACσT\(1\+C¯ψ2/η1\)mlog⁡\(1\+TC¯ψ2/\(η1m\)\)\.\\begin\{array\}\[\]\{rl\}\\displaystyle\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\ \\big\|\\ \\mathcal\{E\}\\right\]&\\displaystyle\\leq\(C\_\{\\mu\}\\sqrt\{d\}\+4L\\gamma\_\{0\}\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\)\\sqrt\{T\}\+2A\\tilde\{\\tau\}\_\{0\}\+T\(1\+2A\)\\delta\\\\ &\\quad\\displaystyle\+AC\_\{\\sigma\}\\sqrt\{T\(1\+\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\)m\\log\\left\(1\+T\\bar\{C\}\_\{\\psi\}^\{2\}/\(\\eta\_\{1\}m\)\\right\)\}\.\\end\{array\}\(84\)Recall thatℰ\\mathcal\{E\}holds with probability at least1−δ1\-\\delta\. Thus

𝔼\[∑t=1TΔt\(π~\)𝕀\{ℰc\}\]≤T\(2A\+1\)δ\.\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\mathbb\{I\}\\\{\\mathcal\{E\}^\{c\}\\\}\\right\]\\leq T\(2A\+1\)\\delta\.\(85\)Thus settingδ=1/T\\delta=1/\\sqrt\{T\}, \([84](https://arxiv.org/html/2606.07392#S12.E84)\) and \([85](https://arxiv.org/html/2606.07392#S12.E85)\) imply that ignoring logarithmic factors, we have

𝔼\[∑t=1TΔt\(π~\)\]≤\(Cμd\+4Lγ0η0α¯\+4A\+2\)T\+2Aτ~0\+ACσT\(1\+C¯ψ2/η1\)mlog⁡\(1\+TC¯ψ2/\(η1m\)\)\.\\begin\{array\}\[\]\{rl\}\\displaystyle\\mathbb\{E\}\\bigg\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\bigg\]&\\displaystyle\\leq\(C\_\{\\mu\}\\sqrt\{d\}\+4L\\gamma\_\{0\}\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\+4A\+2\)\\sqrt\{T\}\+2A\\tilde\{\\tau\}\_\{0\}\\\\ &\\displaystyle\\quad\+AC\_\{\\sigma\}\\sqrt\{T\(1\+\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\)m\\log\\left\(1\+T\\bar\{C\}\_\{\\psi\}^\{2\}/\(\\eta\_\{1\}m\)\\right\)\}\.\\end\{array\}\(86\)Recall thatCμC\_\{\\mu\}andCδC\_\{\\delta\}are defined as \([81](https://arxiv.org/html/2606.07392#S12.E81)\) and \([83](https://arxiv.org/html/2606.07392#S12.E83)\)\. So ignoring logarithmic factors we have

𝔼\[∑t=1TΔt\(π~\)\]≤O~\(\(d\+Am\+Adm\)T\)\.\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\right\]\\leq\\widetilde\{O\}\\left\(\(d\+Am\+A\\sqrt\{dm\}\)\\sqrt\{T\}\\right\)\.Next, we computeη0,η1\\eta\_\{0\},\\eta\_\{1\}that minimize the order of the right hand side of \([86](https://arxiv.org/html/2606.07392#S12.E86)\) \(ignoring logarithmic factor dependence\)\. In the following, for functionsf,g:ℕ→ℝ\+f,g:\\mathbb\{N\}\\to\\mathbb\{R\}\_\{\+\}, we writef\(T\)=Θ~\(g\(T\)\)f\(T\)=\\widetilde\{\\Theta\}\(g\(T\)\)if bothf\(T\)=O~\(g\(T\)\)f\(T\)=\\widetilde\{O\}\(g\(T\)\)andg\(T\)=O~\(f\(T\)\)g\(T\)=\\widetilde\{O\}\(f\(T\)\)hold\.

Suppressing logarithmic factors, we haveCμ=Θ~\(1\+1η0\)C\_\{\\mu\}=\\widetilde\{\\Theta\}\\left\(\\sqrt\{1\+\\frac\{1\}\{\\eta\_\{0\}\}\}\\right\)\. Moreover,Γ¯a\(δ\)=Θ~\(m\)\\overline\{\\Gamma\}\_\{a\}\(\\delta\)=\\widetilde\{\\Theta\}\(m\)andC¯a\(δ\)=Θ~\(d\+η0\)\\sqrt\{\\overline\{C\}\_\{a\}\(\\delta\)\}=\\widetilde\{\\Theta\}\\left\(\\sqrt\{d\}\+\\sqrt\{\\eta\_\{0\}\}\\right\)\. Therefore,Cσ=Θ~\(m\+η1\+d\+η0\)C\_\{\\sigma\}=\\widetilde\{\\Theta\}\\left\(\\sqrt\{m\+\\eta\_\{1\}\}\+\\sqrt\{d\}\+\\sqrt\{\\eta\_\{0\}\}\\right\)\. Substituting these estimates into the regret bound and dropping additive terms independent ofη0,η1\\eta\_\{0\},\\eta\_\{1\}, the relevant log\-free objective is

R\(η0,η1\)≍T\[d1\+1η0\+η0\+Am1\+1η1\(m\+η1\+d\+η0\)\]\.R\(\\eta\_\{0\},\\eta\_\{1\}\)\\asymp\\sqrt\{T\}\\left\[d\\sqrt\{1\+\\frac\{1\}\{\\eta\_\{0\}\}\}\+\\sqrt\{\\eta\_\{0\}\}\+A\\sqrt\{m\}\\sqrt\{1\+\\frac\{1\}\{\\eta\_\{1\}\}\}\\left\(\\sqrt\{m\+\\eta\_\{1\}\}\+\\sqrt\{d\}\+\\sqrt\{\\eta\_\{0\}\}\\right\)\\right\]\.Since the factorT\\sqrt\{T\}is common to allη0,η1\\eta\_\{0\},\\eta\_\{1\}\-dependent terms, it does not affect the minimizer\. Defineg\(η1\):=1\+1η1,b\(η1\):=m\+η1\+d\.g\(\\eta\_\{1\}\):=\\sqrt\{1\+\\frac\{1\}\{\\eta\_\{1\}\}\},b\(\\eta\_\{1\}\):=\\sqrt\{m\+\\eta\_\{1\}\}\+\\sqrt\{d\}\.Then the relevant objective can be written asr\(η0,η1\)=d1\+1η0\+\[1\+AmG\(η1\)\]η0\+Amg\(η1\)b\(η1\)r\(\\eta\_\{0\},\\eta\_\{1\}\)=d\\sqrt\{1\+\\frac\{1\}\{\\eta\_\{0\}\}\}\+\\left\[1\+A\\sqrt\{m\}\\,G\(\\eta\_\{1\}\)\\right\]\\sqrt\{\\eta\_\{0\}\}\+A\\sqrt\{m\}\\,g\(\\eta\_\{1\}\)b\(\\eta\_\{1\}\)\. We first optimizeη0\\eta\_\{0\}for fixedη1\\eta\_\{1\}\. LetK\(η1\):=1\+Amg\(η1\)K\(\\eta\_\{1\}\):=1\+A\\sqrt\{m\}\\,g\(\\eta\_\{1\}\)\. Theη0\\eta\_\{0\}\-dependent part isf\(η0;η1\)=d1\+1/η0\+K\(η1\)η0\\displaystyle f\(\\eta\_\{0\};\\eta\_\{1\}\)=d\\sqrt\{1\+1/\\eta\_\{0\}\}\+K\(\\eta\_\{1\}\)\\sqrt\{\\eta\_\{0\}\}\. The first\-order condition is−d2η021\+1/η0\+K\(η1\)2η0=0\-\\frac\{d\}\{2\\eta\_\{0\}^\{2\}\\sqrt\{1\+1/\\eta\_\{0\}\}\}\+\\frac\{K\(\\eta\_\{1\}\)\}\{2\\sqrt\{\\eta\_\{0\}\}\}=0, equivalently,K\(η1\)=dη01\+η0K\(\\eta\_\{1\}\)=\\frac\{d\}\{\\eta\_\{0\}\\sqrt\{1\+\\eta\_\{0\}\}\}\. Letρ\(η1\):=dK\(η1\)\\rho\(\\eta\_\{1\}\):=\\frac\{d\}\{K\(\\eta\_\{1\}\)\}\. Then the positive solution satisfies

η0⋆\(η1\)≍\{ρ\(η1\),ρ\(η1\)≲1,ρ\(η1\)2/3,ρ\(η1\)≳1\.\\eta\_\{0\}^\{\\star\}\(\\eta\_\{1\}\)\\asymp\\begin\{cases\}\\rho\(\\eta\_\{1\}\),&\\rho\(\\eta\_\{1\}\)\\lesssim 1,\\\\\[3\.44444pt\] \\rho\(\\eta\_\{1\}\)^\{2/3\},&\\rho\(\\eta\_\{1\}\)\\gtrsim 1\.\\end\{cases\}It remains to optimizeη1\\eta\_\{1\}\. By the envelope theorem,

ddη1r\(η0⋆\(η1\),η1\)=Am\[g′\(η1\)\(b\(η1\)\+η0⋆\(η1\)\)\+g\(η1\)b′\(η1\)\]\.\\frac\{d\}\{d\\eta\_\{1\}\}r\(\\eta\_\{0\}^\{\\star\}\(\\eta\_\{1\}\),\\eta\_\{1\}\)=A\\sqrt\{m\}\\left\[g^\{\\prime\}\(\\eta\_\{1\}\)\\left\(b\(\\eta\_\{1\}\)\+\\sqrt\{\\eta\_\{0\}^\{\\star\}\(\\eta\_\{1\}\)\}\\right\)\+g\(\\eta\_\{1\}\)b^\{\\prime\}\(\\eta\_\{1\}\)\\right\]\.Sinceg′\(η1\)g\(η1\)=−12η1\(η1\+1\),b′\(η1\)=12m\+η1\\frac\{g^\{\\prime\}\(\\eta\_\{1\}\)\}\{g\(\\eta\_\{1\}\)\}=\-\\frac\{1\}\{2\\eta\_\{1\}\(\\eta\_\{1\}\+1\)\},\\ b^\{\\prime\}\(\\eta\_\{1\}\)=\\frac\{1\}\{2\\sqrt\{m\+\\eta\_\{1\}\}\}, the first\-order condition is equivalent to

η1\(η1\+1\)=m\+η1\[m\+η1\+d\+η0⋆\(η1\)\]\.\\eta\_\{1\}\(\\eta\_\{1\}\+1\)=\\sqrt\{m\+\\eta\_\{1\}\}\\left\[\\sqrt\{m\+\\eta\_\{1\}\}\+\\sqrt\{d\}\+\\sqrt\{\\eta\_\{0\}^\{\\star\}\(\\eta\_\{1\}\)\}\\right\]\.Thusη12−m=\[d\+η0⋆\(η1\)\]m\+η1\\eta\_\{1\}^\{2\}\-m=\\left\[\\sqrt\{d\}\+\\sqrt\{\\eta\_\{0\}^\{\\star\}\(\\eta\_\{1\}\)\}\\right\]\\sqrt\{m\+\\eta\_\{1\}\}\. We next show that the termη0⋆\(η1\)\\sqrt\{\\eta\_\{0\}^\{\\star\}\(\\eta\_\{1\}\)\}never changes the polynomial order of theη1\\eta\_\{1\}minimizer\. SinceK\(η1\)≥1K\(\\eta\_\{1\}\)\\geq 1, we haveρ\(η1\)≤d\\rho\(\\eta\_\{1\}\)\\leq d\. Ifρ\(η1\)≲1\\rho\(\\eta\_\{1\}\)\\lesssim 1, thenη0⋆\(η1\)≍ρ\(η1\)1/2≲1≲d\\sqrt\{\\eta\_\{0\}^\{\\star\}\(\\eta\_\{1\}\)\}\\asymp\\rho\(\\eta\_\{1\}\)^\{1/2\}\\lesssim 1\\lesssim\\sqrt\{d\}\. Ifρ\(η1\)≳1\\rho\(\\eta\_\{1\}\)\\gtrsim 1, thenη0⋆\(η1\)≍ρ\(η1\)1/3≲d1/3≲d\\sqrt\{\\eta\_\{0\}^\{\\star\}\(\\eta\_\{1\}\)\}\\asymp\\rho\(\\eta\_\{1\}\)^\{1/3\}\\lesssim d^\{1/3\}\\lesssim\\sqrt\{d\}\. Therefore,d\+η0⋆\(η1\)≍d\\sqrt\{d\}\+\\sqrt\{\\eta\_\{0\}^\{\\star\}\(\\eta\_\{1\}\)\}\\asymp\\sqrt\{d\}, and theη1\\eta\_\{1\}balancing equation reduces toη12≍m\+dm\+η1\\eta\_\{1\}^\{2\}\\asymp m\+\\sqrt\{d\}\\,\\sqrt\{m\+\\eta\_\{1\}\}\. We now solve this equation by considering different regimes\. Ifd≲md\\lesssim m, then atη1≍m\\eta\_\{1\}\\asymp\\sqrt\{m\},dm\+η1≲m\\sqrt\{d\}\\,\\sqrt\{m\+\\eta\_\{1\}\}\\lesssim m\. Henceη1⋆≍m1/2\\eta\_\{1\}^\{\\star\}\\asymp m^\{1/2\}\. Ifm≲d≲m3m\\lesssim d\\lesssim m^\{3\}, then the solution satisfiesη1≲m\\eta\_\{1\}\\lesssim m, som\+η1≍mm\+\\eta\_\{1\}\\asymp m\. Henceη12≍dm\\eta\_\{1\}^\{2\}\\asymp\\sqrt\{dm\}, and thereforeη1⋆≍\(dm\)1/4\\eta\_\{1\}^\{\\star\}\\asymp\(dm\)^\{1/4\}\. Ifd≳m3d\\gtrsim m^\{3\}, then the solution satisfiesη1≳m\\eta\_\{1\}\\gtrsim m, som\+η1≍η1m\+\\eta\_\{1\}\\asymp\\eta\_\{1\}\. Henceη12≍dη1\\eta\_\{1\}^\{2\}\\asymp\\sqrt\{d\\eta\_\{1\}\}, and thereforeη1⋆≍d1/3\\eta\_\{1\}^\{\\star\}\\asymp d^\{1/3\}\. Combining the three regimes givesη1⋆≍max⁡\{m1/2,\(dm\)1/4,d1/3\}\\eta\_\{1\}^\{\\star\}\\asymp\\max\\left\\\{m^\{1/2\},\(dm\)^\{1/4\},d^\{1/3\}\\right\\\}\.

At this optimizer,G\(η1⋆\)=Θ\(1\)G\(\\eta\_\{1\}^\{\\star\}\)=\\Theta\(1\)\. Therefore,K\(η1⋆\)=1\+AmG\(η1⋆\)≍AmK\(\\eta\_\{1\}^\{\\star\}\)=1\+A\\sqrt\{m\}\\,G\(\\eta\_\{1\}^\{\\star\}\)\\asymp A\\sqrt\{m\}\. Substituting this into the expression forη0⋆\\eta\_\{0\}^\{\\star\}givesρ\(η1⋆\)=dK\(η1⋆\)≍dAm\\rho\(\\eta\_\{1\}^\{\\star\}\)=\\frac\{d\}\{K\(\\eta\_\{1\}^\{\\star\}\)\}\\asymp\\frac\{d\}\{A\\sqrt\{m\}\}\. Thus

η0⋆≍\{dAm,d≲Am,\(dAm\)2/3,d≳Am\.\\eta\_\{0\}^\{\\star\}\\asymp\\begin\{cases\}\\dfrac\{d\}\{A\\sqrt\{m\}\},&d\\lesssim A\\sqrt\{m\},\\\\\[5\.16663pt\] \\left\(\\dfrac\{d\}\{A\\sqrt\{m\}\}\\right\)^\{2/3\},&d\\gtrsim A\\sqrt\{m\}\.\\end\{cases\}Equivalently,η0⋆≍min⁡\{dAm,\(dAm\)2/3\}\\eta\_\{0\}^\{\\star\}\\asymp\\min\\left\\\{\\frac\{d\}\{A\\sqrt\{m\}\},\\left\(\\frac\{d\}\{A\\sqrt\{m\}\}\\right\)^\{2/3\}\\right\\\}\. Consequently, suppressing logarithmic factors, the optimal polynomial choices areη1⋆≍max⁡\{m1/2,\(dm\)1/4,d1/3\},η0⋆≍min⁡\{dAm,\(dAm\)2/3\}\\eta\_\{1\}^\{\\star\}\\asymp\\max\\left\\\{m^\{1/2\},\(dm\)^\{1/4\},d^\{1/3\}\\right\\\},\\ \\eta\_\{0\}^\{\\star\}\\asymp\\min\\left\\\{\\frac\{d\}\{A\\sqrt\{m\}\},\\left\(\\frac\{d\}\{A\\sqrt\{m\}\}\\right\)^\{2/3\}\\right\\\}\.■\\blacksquare

Initialization\.Observe an initial context

x0x\_\{0\}\. Query each box

a∈\[A\]a\\in\[A\]once and record the observed output\-cost pairs

\{\(ωa0,ca0\)\}a∈\[A\]\\\{\(\\omega\_\{a0\},c\_\{a0\}\)\\\}\_\{a\\in\[A\]\}\.

for*t=1,2,…,Tt=1,2,\\ldots,T*do

Observe context

xtx\_\{t\}\. Initialize

𝒜t=∅\\mathcal\{A\}\_\{t\}=\\emptysetand

Mt=−∞M\_\{t\}=\-\\infty;

For any

a∈\[A\]a\\in\[A\], compute

ρ^at\\hat\{\\rho\}\_\{at\}by \([23](https://arxiv.org/html/2606.07392#S5.E23)\) and set

σ~at=Λ\(ρ^at⊤ψ\(xt\)\+Bat‖ψ\(xt\)‖Vat\(η1\)−1\)\\tilde\{\\sigma\}\_\{at\}=\\Lambda\(\\hat\{\\rho\}\_\{at\}^\{\\top\}\\psi\(x\_\{t\}\)\+B\_\{at\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\), where

BatB\_\{at\}is defined as in Proposition[5\.4](https://arxiv.org/html/2606.07392#S5.Thmtheorem4),

Vat\(η1\)V\_\{at\}\(\\eta\_\{1\}\)is defined as \([13](https://arxiv.org/html/2606.07392#S4.E13)\);

Order the boxes so that

σ~\(1\)t≥σ~\(2\)t≥⋯≥σ~\(A\)t\\widetilde\{\\sigma\}\_\{\(1\)t\}\\geq\\widetilde\{\\sigma\}\_\{\(2\)t\}\\geq\\cdots\\geq\\widetilde\{\\sigma\}\_\{\(A\)t\}\. Set

σ~\(A\+1\)t:=−∞\\widetilde\{\\sigma\}\_\{\(A\+1\)t\}:=\-\\infty;

for*k=1,2,…,Ak=1,2,\\ldots,A*do

Query box

\(k\)\(k\)and observe its output and cost

\(ω\(k\)t,c\(k\)t\)\(\\omega\_\{\(k\)t\},c\_\{\(k\)t\}\);

Add

\(k\)\(k\)to the queried set:

𝒜t←𝒜t∪\{\(k\)\}\\mathcal\{A\}\_\{t\}\\leftarrow\\mathcal\{A\}\_\{t\}\\cup\\\{\(k\)\\\};

Compute the optimistic reward estimate

μ~t\(xt,ω\(k\)t\)=G\(θ^t−1⊤ϕ\(xt,ω\(k\)t\)\+βt‖ϕ\(xt,ω\(k\)t\)‖Φt−1−1\)\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{\(k\)t\}\)=G\\\!\\left\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{\(k\)t\}\)\+\\beta\_\{t\}\\,\\big\\\|\\phi\(x\_\{t\},\\omega\_\{\(k\)t\}\)\\big\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\right\), where

Φt−1\\Phi\_\{t\-1\}is defined as in \([20](https://arxiv.org/html/2606.07392#S5.E20)\) and

βt\\beta\_\{t\}is as given in Lemma[5\.1](https://arxiv.org/html/2606.07392#S5.Thmtheorem1);

Update

Mt←maxa∈𝒜t⁡μ~t\(xt,ωat\)M\_\{t\}\\leftarrow\\max\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\);

if

Mt≥σ~\(k\+1\)tM\_\{t\}\\geq\\widetilde\{\\sigma\}\_\{\(k\+1\)t\}, stop querying andbreak;

end for

Select and deploy

at∈\\argmaxa∈𝒜tμ~t\(xt,ωat\),a\_\{t\}\\in\\argmax\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\),and observe reward

rtr\_\{t\};

end for

ALGORITHM 2COSMOSPolicy Algorithm

## References

- Improved algorithms for linear stochastic bandits\.Advances in neural information processing systems24\.Cited by:[§4\.2](https://arxiv.org/html/2606.07392#S4.SS2.p4.1),[§4](https://arxiv.org/html/2606.07392#S4.p6.18),[§5\.3](https://arxiv.org/html/2606.07392#S5.SS3.p3.10),[§9](https://arxiv.org/html/2606.07392#S9.p1.1)\.
- A\. Agarwal, R\. Ghuge, and V\. Nagarajan \(2024\)Semi\-bandit learning for monotone stochastic optimization\.In2024 IEEE 65th Annual Symposium on Foundations of Computer Science \(FOCS\),pp\. 1260–1274\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3)\.
- A\. Agrawal, N\. Kedia, A\. Panwar, J\. Mohan, N\. Kwatra, B\. S\. Gulavani, A\. Tumanov, and R\. Ramjee \(2024\)Taming throughput\-latency tradeoff in llm inference with sarathi\-serve\.InProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation,Cited by:[§1](https://arxiv.org/html/2606.07392#S1.p2.1)\.
- I\. Andrews and A\. Mikusheva \(2022\)Optimal decision rules for weak gmm\.Econometrica90\(2\),pp\. 715–748\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- R\. Ao, G\. Luo, D\. Simchi\-Levi, and X\. Wang \(2025\)Optimizing llm inference: fluid\-guided online scheduling with memory constraints\.arXiv preprint arXiv:2504\.11320\.Cited by:[§1](https://arxiv.org/html/2606.07392#S1.p2.1)\.
- M\. Arellano and S\. Bond \(1991\)Some tests of specification for panel data: monte carlo evidence and an application to employment equations\.The review of economic studies58\(2\),pp\. 277–297\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- A\. Atsidakou, C\. Caramanis, E\. Gergatsouli, O\. Papadigenopoulos, and C\. Tzamos \(2024\)Contextual pandora’s box\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 10944–10952\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3),[§1](https://arxiv.org/html/2606.07392#S1.p7.1),[§4\.3](https://arxiv.org/html/2606.07392#S4.SS3.p2.2),[§4](https://arxiv.org/html/2606.07392#S4.p3.1),[§4](https://arxiv.org/html/2606.07392#S4.p4.18)\.
- J\. Audibert, R\. Munos, and C\. Szepesvári \(2009\)Exploration–exploitation tradeoff using variance estimates in multi\-armed bandits\.Theoretical Computer Science410\(19\),pp\. 1876–1902\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- P\. Auer, N\. Cesa\-Bianchi, and P\. Fischer \(2002\)Finite\-time analysis of the multiarmed bandit problem\.Machine learning47\(2\),pp\. 235–256\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- S\. Boodaghians, F\. Fusco, P\. Lazos, and S\. Leonardi \(2020\)Pandora’s box problem with order constraints\.InProceedings of the 21st ACM Conference on Economics and Computation,pp\. 439–458\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3)\.
- G\. Chamberlain \(1987\)Asymptotic efficiency in estimation with conditional moment restrictions\.Journal of econometrics34\(3\),pp\. 305–334\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- S\. Chawla, E\. Gergatsouli, Y\. Teng, C\. Tzamos, and R\. Zhang \(2020\)Pandora’s box with correlations: learning and approximation\.In2020 IEEE 61st Annual Symposium on Foundations of Computer Science \(FOCS\),pp\. 1214–1225\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3)\.
- L\. Chen, M\. Zaharia, and J\. Zou \(2025a\)FrugalGPT: how to use large language models while reducing cost and improving performance\.Transactions on Machine Learning Research\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1),[§1](https://arxiv.org/html/2606.07392#S1.p3.1),[§1](https://arxiv.org/html/2606.07392#S1.p7.1),[Remark 2\.1](https://arxiv.org/html/2606.07392#S2.Thmtheorem1.p1.2.2),[Acknowledgment](https://arxiv.org/html/2606.07392#Sx2.p1.1)\.
- Y\. Chen, S\. N\. Kirshner, A\. Ovchinnikov, M\. Andiappan, and T\. Jenkin \(2025b\)A manager and an ai walk into a bar: does chatgpt make biased decisions like we do?\.Manufacturing & Service Operations Management27\(2\),pp\. 354–368\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p7.1)\.
- Z\. Chen and J\. Chan \(2024\)Large language model in creative work: the role of collaboration modality and user expertise\.Management Science70\(12\),pp\. 9101–9117\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p7.1)\.
- Z\. Chen, X\. Yang, J\. Lin, C\. Sun, K\. C\. Chang, and J\. Huang \(2024\)Cascade speculative drafting for even faster llm inference\.Advances in Neural Information Processing Systems37,pp\. 86226–86242\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- T\. Cheng, C\. Dong, J\. Gao, and O\. Linton \(2024\)GMM estimation for high\-dimensional panel data models\.Journal of Econometrics244\(1\),pp\. 105853\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- W\. C\. Cheung, W\. Ma, D\. Simchi\-Levi, and X\. Wang \(2022\)Inventory balancing with online learning\.Management Science68\(3\),pp\. 1776–1807\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- Q\. Ding, C\. Hsieh, and J\. Sharpnack \(2021\)An efficient algorithm for generalized linear bandit: online stochastic gradient descent and thompson sampling\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 1585–1593\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p5.1)\.
- L\. Doval \(2018\)Whether or not to open pandora’s box\.Journal of Economic Theory175,pp\. 127–158\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3)\.
- T\. Ezra, M\. Feldman, and M\. Schlesinger \(2026\)Contract design for sequential actions\.InProceedings of the 2026 Annual ACM\-SIAM Symposium on Discrete Algorithms \(SODA\),pp\. 6537–6570\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3)\.
- L\. Fan and P\. W\. Glynn \(2025\)The fragility of optimized bandit algorithms\.Operations Research73\(6\),pp\. 3173–3198\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- C\. Fang, X\. Li, Z\. Fan, J\. Xu, K\. Nag, E\. Korpeoglu, S\. Kumar, and K\. Achan \(2024\)Llm\-ensemble: optimal large language model ensemble method for e\-commerce product attribute value extraction\.InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 2910–2914\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- S\. Filippi, O\. Cappe, A\. Garivier, and C\. Szepesvári \(2010\)Parametric bandits: the generalized linear case\.Advances in neural information processing systems23\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p5.1),[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1),[§5\.1](https://arxiv.org/html/2606.07392#S5.SS1.p1.14),[§5\.1](https://arxiv.org/html/2606.07392#S5.SS1.p3.4),[§5](https://arxiv.org/html/2606.07392#S5.p3.16),[§5](https://arxiv.org/html/2606.07392#S5.p4.15),[§9](https://arxiv.org/html/2606.07392#S9.p3.1)\.
- D\. A\. Freedman \(1975\)On tail probabilities for martingales\.the Annals of Probability,pp\. 100–118\.Cited by:[§8\.2](https://arxiv.org/html/2606.07392#S8.SS2.p3.5),[Lemma 8\.6](https://arxiv.org/html/2606.07392#S8.Thmtheorem6)\.
- H\. Fu and T\. Lin \(2020\)Learning utilities and equilibria in non\-truthful auctions\.Advances in Neural Information Processing Systems33,pp\. 14231–14242\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3)\.
- X\. Gao, S\. Jasin, S\. Najafi, and H\. Zhang \(2022\)Joint learning and optimization for multi\-product pricing \(and ranking\) under a general cascade click model\.Management Science68\(10\),pp\. 7362–7382\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- A\. Garivier and O\. Cappé \(2011\)The kl\-ucb algorithm for bounded stochastic bandits and beyond\.InProceedings of the 24th annual conference on learning theory,pp\. 359–376\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- K\. Gatmiry, T\. Kesselheim, S\. Singla, and Y\. Wang \(2024\)Bandit algorithms for prophet inequality and pandora’s box\.InProceedings of the 2024 Annual ACM\-SIAM Symposium on Discrete Algorithms \(SODA\),pp\. 462–500\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3)\.
- E\. Gergatsouli and C\. Tzamos \(2022\)Online learning for min sum set cover and pandora’s box\.InInternational Conference on Machine Learning,pp\. 7382–7403\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3)\.
- N\. Gupta, H\. Narasimhan, W\. Jitkrittum, A\. S\. Rawat, A\. K\. Menon, and S\. Kumar \(2024\)Language model cascades: token\-level uncertainty and beyond\.arXiv preprint arXiv:2404\.10136\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1),[§1](https://arxiv.org/html/2606.07392#S1.p3.1),[§1](https://arxiv.org/html/2606.07392#S1.p7.1)\.
- P\. Hager, F\. Jungmann, R\. Holland, K\. Bhagat, I\. Hubrecht, M\. Knauer, J\. Vielhauer, M\. Makowski, R\. Braren, G\. Kaissis,et al\.\(2024\)Evaluation and mitigation of the limitations of large language models in clinical decision\-making\.Nature medicine30\(9\),pp\. 2613–2622\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p7.1)\.
- B\. E\. Hansen and S\. Lee \(2021\)Inference for iterated gmm under misspecification\.Econometrica89\(3\),pp\. 1419–1447\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- L\. P\. Hansen \(1982\)Large sample properties of generalized method of moments estimators\.Econometrica: Journal of the econometric society,pp\. 1029–1054\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- Y\. Hao, Z\. Qiu, J\. Holmes, C\. E\. Löckenhoff, W\. Liu, M\. Ghassemi, and S\. Kalantari \(2025\)Large language model integrations in cancer decision\-making: a systematic review and meta\-analysis\.NPJ Digital Medicine8\(1\),pp\. 450\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p7.1)\.
- S\. N\. Hari and M\. Thomson \(2023\)Tryage: real\-time, intelligent routing of user prompts to large language models\.arXiv preprint arXiv:2308\.11601\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- J\. Hu, Y\. Wang, S\. Zhang, K\. Zhou, G\. Chen, Y\. Hu, B\. Xiao, and M\. Tan \(2025\)Efficient dynamic ensembling for multiple llm experts\.InProceedings of the Thirty\-Fourth International Joint Conference on Artificial Intelligence, IJCAI,pp\. 16–22\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- Q\. J\. Hu, J\. Bieker, X\. Li, N\. Jiang, B\. Keigwin, G\. Ranganath, K\. Keutzer, and S\. K\. Upadhyay \(2024\)Routerbench: a benchmark for multi\-llm routing system\.arXiv preprint arXiv:2403\.12031\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- J\. Huang, W\. Ma, and Z\. Zhou \(2026\)Optimal bayesian stopping for efficient inference of consistent llm answers\.arXiv preprint arXiv:2602\.05395\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p3.1)\.
- P\. Jaillet, J\. Jiang, K\. Mellou, M\. Molinaro, C\. Podimata, and Z\. Zhou \(2025\)Online scheduling for llm inference with kv cache constraints\.arXiv preprint arXiv:2502\.07115\.Cited by:[§1](https://arxiv.org/html/2606.07392#S1.p2.1)\.
- D\. Jiang, X\. Ren, and B\. Y\. Lin \(2023\)Llm\-blender: ensembling large language models with pairwise ranking and generative fusion\.arXiv preprint arXiv:2306\.02561\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- K\. Jun, A\. Bhargava, R\. Nowak, and R\. Willett \(2017\)Scalable generalized linear bandits: online computation and hashing\.Advances in Neural Information Processing Systems30\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p5.1)\.
- W\. Kim, K\. Lee, and M\. C\. Paik \(2023\)Double doubly robust thompson sampling for generalized linear contextual bandits\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.37,pp\. 8300–8307\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p5.1)\.
- R\. Kleinberg, B\. Waggoner, and E\. G\. Weyl \(2016\)Descending price coordinates approximately efficient search\.InExtended abstract in the Proceedings of the 17th ACM Conference on Electronic Commerce \(EC’16\),Cited by:[§3](https://arxiv.org/html/2606.07392#S3.p2.5)\.
- B\. Kveton, M\. Zaheer, C\. Szepesvari, L\. Li, M\. Ghavamzadeh, and C\. Boutilier \(2020\)Randomized exploration in generalized linear bandits\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 2066–2076\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p5.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,Cited by:[§1](https://arxiv.org/html/2606.07392#S1.p2.1)\.
- T\. L\. Lai and H\. Robbins \(1985\)Asymptotically efficient adaptive allocation rules\.Advances in applied mathematics6\(1\),pp\. 4–22\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- T\. Lattimore and C\. Szepesvári \(2020\)Bandit algorithms\.Cambridge University Press\.Cited by:[§4\.2](https://arxiv.org/html/2606.07392#S4.SS2.p4.1)\.
- J\. Lee, S\. Yun, and K\. Jun \(2024\)A unified confidence sequence for generalized linear models, with applications to bandits\.Advances in Neural Information Processing Systems37,pp\. 124640–124685\.Cited by:[§5](https://arxiv.org/html/2606.07392#S5.p3.16),[§5](https://arxiv.org/html/2606.07392#S5.p4.15)\.
- G\. Li, J\. Liang, M\. Liu, Y\. Lei, S\. Jasin, F\. Yang, and P\. Baxi \(2026\)Asymptotically optimal sequential testing with heterogeneous llms\.arXiv preprint arXiv:2604\.01086\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p3.1)\.
- L\. Li, Y\. Lu, and D\. Zhou \(2017\)Provably optimal algorithms for generalized linear contextual bandits\.InInternational Conference on Machine Learning,pp\. 2071–2080\.Cited by:[§5](https://arxiv.org/html/2606.07392#S5.p3.16),[§5](https://arxiv.org/html/2606.07392#S5.p4.15)\.
- X\. Lin and L\. Lee \(2010\)GMM estimation of spatial autoregressive models with unknown heteroskedasticity\.Journal of Econometrics157\(1\),pp\. 34–52\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- J\. Liu, Z\. Chen, K\. Wang, H\. Luo, and L\. J\. Ratliff \(2025\)Improved regret and contextual linear extension for pandora’s box and prophet inequality\.arXiv preprint arXiv:2505\.18828\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3),[§1](https://arxiv.org/html/2606.07392#S1.p7.1)\.
- K\. Lu, H\. Yuan, R\. Lin, J\. Lin, Z\. Yuan, C\. Zhou, and J\. Zhou \(2024\)Routing to the expert: efficient reward\-guided ensemble of large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 1964–1974\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- W\. Mao, K\. Zhang, R\. Zhu, D\. Simchi\-Levi, and T\. Başar \(2025\)Model\-free nonstationary reinforcement learning: near\-optimal regret and applications in multiagent reinforcement learning and inventory control\.Management Science71\(2\),pp\. 1564–1580\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- K\. Mei, W\. Xu, M\. Guo, S\. Lin, and Y\. Zhang \(2025\)Omnirouter: budget and performance controllable multi\-llm routing\.ACM SIGKDD Explorations Newsletter27\(2\),pp\. 107–116\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- W\. K\. Newey and D\. McFadden \(1994\)Large sample estimation and hypothesis testing\.Handbook of econometrics4,pp\. 2111–2245\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- L\. Nie, Z\. Ding, E\. Hu, C\. Jermaine, and S\. Chaudhuri \(2024\)Online cascade learning for efficient inference over streams\.arXiv preprint arXiv:2402\.04513\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- S\. Rakhlin and K\. Sridharan \(2013\)Optimization, learning, and games with predictable sequences\.Advances in Neural Information Processing Systems26\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- M\. Reisenbichler, T\. Reutterer, and D\. A\. Schweidel \(2025\)Applying large language models to sponsored search advertising\.Marketing Science\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p7.1)\.
- P\. Rusmevichientong and J\. N\. Tsitsiklis \(2010\)Linearly parameterized bandits\.Mathematics of Operations Research35\(2\),pp\. 395–411\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- M\. Šakota, M\. Peyrard, and R\. West \(2024\)Fly\-swat or cannon? cost\-effective language model choice via meta\-modeling\.InProceedings of the 17th ACM International Conference on Web Search and Data Mining,pp\. 606–615\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- T\. Shnitzer, A\. Ou, M\. Silva, K\. Soule, Y\. Sun, J\. Solomon, N\. Thompson, and M\. Yurochkin \(2023\)Large language model routing with benchmark datasets\.InFirst Conference on Language Modeling,Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- D\. Simchi\-Levi, K\. Mellou, I\. Menache, and J\. Pathuri \(2026\)Large language models for supply chain decisions\.InAI in Supply Chains: Perspectives from Global Thought Leaders,pp\. 93–104\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p7.1)\.
- A\. J\. Thirunavukarasu, D\. S\. J\. Ting, K\. Elangovan, L\. Gutierrez, T\. F\. Tan, and D\. S\. W\. Ting \(2023\)Large language models in medicine\.Nature medicine29\(8\),pp\. 1930–1940\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p7.1)\.
- J\. A\. Tropp \(2011\)User\-friendly tail bounds for matrix martingales\.Technical reportCited by:[Lemma 10\.9](https://arxiv.org/html/2606.07392#S10.Thmtheorem9)\.
- M\. L\. Weitzman \(1979\)OPTIMAL search for the best alternative\.\.Econometrica47\(3\)\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3),[§1](https://arxiv.org/html/2606.07392#S1.p5.1),[§2](https://arxiv.org/html/2606.07392#S2.p3.21),[§3](https://arxiv.org/html/2606.07392#S3.p2.5)\.
- Q\. Yang, M\. Ongpin, S\. Nikolenko, A\. Huang, and A\. Farseev \(2023\)Against opacity: explainable ai and large language models for effective digital advertising\.InProceedings of the 31st ACM International Conference on Multimedia,pp\. 9299–9305\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p7.1)\.
- G\. Yu, J\. S\. Jeong, G\. Kim, S\. Kim, and B\. Chun \(2022\)Orca: a distributed serving system for transformer\-based generative models\.InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation,pp\. 521–538\.Cited by:[§1](https://arxiv.org/html/2606.07392#S1.p2.1)\.
- M\. Yue, J\. Zhao, M\. Zhang, L\. Du, and Z\. Yao \(2024\)Large language model cascades with mixture of thought representations for cost\-efficient reasoning\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1),[§1](https://arxiv.org/html/2606.07392#S1.p3.1),[§1](https://arxiv.org/html/2606.07392#S1.p7.1)\.
- X\. Zhang, Z\. Huang, E\. O\. Taga, C\. Joe\-Wong, S\. Oymak, and J\. Chen \(2024\)Efficient contextual llm cascades through budget\-constrained policy learning\.Advances in Neural Information Processing Systems37,pp\. 91691–91722\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1),[§1](https://arxiv.org/html/2606.07392#S1.p7.1)\.
- Z\. Zhou, R\. Xu, and J\. Blanchet \(2019\)Learning in generalized linear contextual bandits with stochastic delays\.Advances in Neural Information Processing Systems32\.Cited by:[§5](https://arxiv.org/html/2606.07392#S5.p4.15)\.
Online Pandora's Box for Contextual LLM Cascading

Similar Articles

Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators

Capability Conditioned Scaffolding for Professional Human LLM Collaboration

Can LLMs Take Retrieved Information with a Grain of Salt?

It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs

Margin-Adaptive Confidence Ranking for Reliable LLM Judgement

Submit Feedback

Similar Articles

Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators
Capability Conditioned Scaffolding for Professional Human LLM Collaboration
Can LLMs Take Retrieved Information with a Grain of Salt?
It Takes Two: Complementary Self-Distillation for Contextual Integrity in LLMs
Margin-Adaptive Confidence Ranking for Reliable LLM Judgement