Online Pandora's Box for Contextual LLM Cascading

arXiv cs.AI Papers

Summary

This paper introduces an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs, proposing a learning approach that combines GMM estimation with UCB-style confidence bounds and proving dimension-dependent regret bounds.

arXiv:2606.07392v1 Announce Type: new Abstract: Motivated by Large Language Model (LLM) cascading, we propose an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs. In each period, a decision-maker observes a request context and faces a two-phase decision problem. In the query phase, the decision-maker sequentially queries APIs, where each query reveals a generated output and the decision-maker incurs an (output-dependent) cost. In the selection phase, the decision-maker selects one of the generated outputs to deploy and observes only the downstream reward of the deployed output. This output-mediated feedback structure differs from classical online contextual Pandora's Box models, in which opening a box directly reveals its reward. Rather than estimating the full conditional output and cost distributions of each API, we directly model the reservation index and develop a learning approach for the query phase. Specifically, we impose a parametric structure on the contextual reservation index functions induced by the classical Weitzman's policy. Our policy combines generalized method of moments (GMM) type estimation of these reservation indices with UCB-style confidence bounds for both these indices and the shared output-level reward evaluator. Under regularity conditions, we prove that the resulting policy achieves dimension-dependent $\widetilde O(\sqrt T)$ cumulative regret over a horizon of $T$ periods.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:15 AM

# 1 Introduction
Source: [https://arxiv.org/html/2606.07392](https://arxiv.org/html/2606.07392)
\\OneAndAHalfSpacedXII\\EquationsNumberedThrough\\TheoremsNumberedThrough\\ECRepeatTheorems\\MANUSCRIPTNO

\\RUNAUTHOR

Belloni, Chen, and Wei

\\RUNTITLE

Online Pandora’s Box for Contextual LLM Cascading

\\TITLE

Online Pandora’s Box for Contextual LLM Cascading

\\ARTICLEAUTHORS\\AUTHORfont\\HD

150Alexandre Belloni\\HD06

\\AFFfont

The Fuqua School of Business, Duke University,\\EMAILabn5@duke\.edu

\\AUTHORfont\\HD

150Yan Chen\\HD06

\\AFFfont

The Fuqua School of Business, Duke University,\\EMAILyc555@duke\.edu

\\AUTHORfont\\HD

150Yehua Wei\\HD06

\\AFFfont

The Fuqua School of Business, Duke University,\\EMAILyehua\.wei@duke\.edu

\\ABSTRACT

Motivated by Large Language Model \(LLM\) cascading, we propose an online contextual Pandora’s Box model for adaptively querying and selecting LLM APIs\. In each period, a decision\-maker observes a request context and faces a two\-phase decision problem\. In the query phase, the decision\-maker sequentially queries APIs, where each query reveals a generated output and the decision\-maker incurs an \(output\-dependent\) cost\. In the selection phase, the decision\-maker selects one of the generated outputs to deploy and observes only the downstream reward of the deployed output\. This output\-mediated feedback structure differs from classical online contextual Pandora’s Box models, in which opening a box directly reveals its reward\.

Rather than estimating the full conditional output and cost distributions of each API, we directly model the reservation index and develop a learning approach for the query phase\. Specifically, we impose a parametric structure on the contextual reservation index functions induced by the classical Weitzman’s policy\. Our policy combines generalized method of moments \(GMM\) type estimation of these reservation indices with UCB\-style confidence bounds for both these indices and the shared output\-level reward evaluator\. Under regularity conditions, we prove that the resulting policy achieves dimension\-dependentO~​\(T\)\\widetilde\{O\}\(\\sqrt\{T\}\)cumulative regret over a horizon ofTTperiods\.

The proliferation of Large Language Models \(LLMs\) has transformed the economics of content generation and decision support\. Firms increasingly use generative AI systems to generate advertisements, produce code, serve customer requests, and conduct complex analytical tasks\. In these applications, the operational challenge is often not whether content can be generated, but how to generate high\-quality content reliably and cost\-effectively at scale\. A firm that produces a high volume of advertisements per day, for example, may have access to a portfolio of LLM APIs that differ in cost and output quality across task types\. More capable proprietary models may, in general, deliver higher\-quality outputs but incur substantially higher costs, while smaller or specialized models may be cheaper but less reliable\. Because model performance varies across request contexts and it is difficult to predict the quality of the outputs a priori, committing to a single API is often suboptimal\. Similarly, querying all of the available APIs is typically prohibitively costly\.

We study this problem from the perspective of an organization that uses external or internal LLM APIs to serve a stream of business requests\. In such applications, requests arrive as distinct decision units without a strict latency requirement, allowing the decision\-maker to query more than one API before selecting a final output\. It is worth noting that this request\-level decision problem fundamentally differs from the model hosting problem faced by large LLM platforms, where a key operational challenge is to reduce latency through batching, scheduling, and load management\(e\.g\. Yuet al\.[2022](https://arxiv.org/html/2606.07392#bib.bib119), Kwonet al\.[2023](https://arxiv.org/html/2606.07392#bib.bib120), Agrawalet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib121), Aoet al\.[2025](https://arxiv.org/html/2606.07392#bib.bib117), Jailletet al\.[2025](https://arxiv.org/html/2606.07392#bib.bib116)\)\. Our focus is instead on the API consumer’s decision problem of dynamically querying a portfolio of APIs to secure high\-quality outputs in a cost\-effective manner\.

A prominent approach for managing this cost\-quality trade\-off is*LLM cascading*\(see e\.g\., Chenet al\.[2025a](https://arxiv.org/html/2606.07392#bib.bib68), Yueet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib77), Guptaet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib115)\)\. Rather than sending every request directly to a single expensive model, a cascade queries LLM APIs sequentially, typically beginning with lower\-cost models and escalating only when earlier outputs appear insufficiently reliable\. The pioneering work ofChenet al\.\([2025a](https://arxiv.org/html/2606.07392#bib.bib68)\), for example, illustrates this design by routing a query through a sequence of LLM APIs and stopping once a generated response receives a sufficiently high reliability score from an evaluator\. Cascading allows the easy requests to be handled by cheaper APIs, with only the difficult requests escalated to the expensive ones\. Empirical evidence fromChenet al\.\([2025a](https://arxiv.org/html/2606.07392#bib.bib68)\)suggests that such cascades can substantially reduce inference costs while preserving, and in some cases improving, task performance\.

The LLM cascading system raises a fundamental sequential decision problem\. For each incoming request, the decision\-maker must manage both a query phase and a selection phase\. In the query phase, the decision\-maker must decide which API to query next and whether the generated outputs justify continuing the search\. In the selection phase, the decision\-maker must decide which of the generated outputs to deploy\. These decisions are tightly coupled\. Querying another API is costly, but it may reveal a better output; stopping the query phase early saves cost, but significantly increases the risks of not generating a better outcome; and selecting among queried outputs requires evaluating their downstream value\. Thus, LLM cascading is not simply a model\-routing problem, but a sequential search problem with costly information acquisition\.

A natural starting point for principled analysis of this search problem is the classical Pandora’s Box problem\. Consider an idealized benchmark in which, for each request, the decision\-maker knows the context\-dependent distribution of the output\-cost pair generated by each API and can evaluate the downstream value of any generated output once it is observed\. In this benchmark, querying an API corresponds to opening a box, the inference cost acts as the inspection cost, and the downstream value of the generated output serves as the realized reward\. For this formulation, the seminal work ofWeitzman \([1979](https://arxiv.org/html/2606.07392#bib.bib57)\)characterizes the structure of the optimal policy\. Specifically, given the context of a request, the policy assigns each API a reservation index, then queries APIs in decreasing order of these indices, and terminates the search as soon as the reward of an output exceeds the highest remaining index\.

Real applications of LLM cascading, however, depart from this oracle benchmark in ways that make learning central\. The value of querying an API is inherently contextual, and the primitives required by the oracle Weitzman’s policy are rarely known in advance\. In some settings, firms may have substantial historical data on prior human\- or AI\-generated outputs and their downstream rewards, allowing them to construct an accurate offline model of the reward function\. In many others, however, such output\-reward pairs are limited or unavailable\. Moreover, even when rewards can be modeled from offline data, a firm deploying LLM APIs typically does not know the conditional distribution of outputs and costs generated by each API under a given request context\. Since requests arrive sequentially over time, the decision\-maker must learn context\-dependent reservation indices while continuing to make cost\-sensitive query and selection decisions\. This naturally leads to an online contextual Pandora’s Box formulation of LLM cascading\.

Recent work on LLM cascading has developed increasingly sophisticated methods for balancing cost and performance\. Those include budget\-constrained cascade policies, uncertainty or threshold\-based deferral rules, and hybrid methods that combine cascading with model routing\(Chenet al\.[2025a](https://arxiv.org/html/2606.07392#bib.bib68), Yueet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib77), Guptaet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib115), Zhanget al\.[2024](https://arxiv.org/html/2606.07392#bib.bib74)\)\. While these approaches provide important insights into efficient multi\-LLM deployment, they rely primarily on heuristic designs\. To the best of our knowledge, existing work has not yet formulated LLM cascading as an online contextual Pandora’s Box problem with joint query and selection, nor developed formal regret guarantees\. Conversely, existing Pandora’s Box literature are not directly tailored to this LLM setting\. A prominent line of work takes distribution\-learning approaches\(Liuet al\.[2025](https://arxiv.org/html/2606.07392#bib.bib54)\), in which opening a box reveals a scalar reward and the learner estimates how the reward distribution varies with context, under parametric structure on the full distribution\. Closer to our work,Atsidakouet al\.\([2024](https://arxiv.org/html/2606.07392#bib.bib20)\)parameterizes contextual reservation indices directly, but their model remains a classical Pandora’s box setting in which opened boxes reveal rewards directly\. In LLM cascading, by contrast, an API call reveals a stochastic output\-cost pair\. In our model, the downstream value depends on the request and the generated output, while API\-specific heterogeneity enters through the distribution of outputs and costs\.

### 1\.1Our Contributions

We make three main contributions\. First, we introduce an online contextual Pandora’s Box model motivated by LLM cascading\. Unlike existing online Pandora’s Box models, our model features a natural two\-phase decision structure: querying an API reveals a stochastic output and incurs a cost, while the downstream reward is observed only after one generated output is selected and deployed\. This formulation captures a key practical feature of LLM cascading: API\-specific heterogeneity enters through the distributions of generated outputs and costs, while the downstream value of a generated output is evaluated through a shared reward model on the request\-output pair\. That is, an API can influence the downstream value only through the generated output\.

Second, we propose a direct index modeling and learning approach for the query phase\. In the full\-information benchmark, Weitzman’s policy is governed by contextual reservation indices, which are induced by each API’s conditional output and cost distributions together with the reward evaluator\. Rather than estimating these distributions directly, we impose a parametric structure on the reservation index functions themselves as a generalized linear function of observable context features\. This modeling assumption leads to a practical optimism\-based learning approach in the LLM cascading setting, as the decision\-maker can focus on constructing optimistic box\-specific reservation indices from queried output\-cost samples through the moment equations instead of the full conditional output distributions\. The policy also learns the shared output\-level reward evaluator from deployed\-output rewards, and the two learned objects jointly determine the cascade’s querying and selection decisions\.

Third, we provide regret guarantees for the resulting online policy\. The policy combines*generalized method of moments*\(GMM\) estimation for reservation indices with*upper confidence bound*\(UCB\)\-style approaches for both the indices and rewards\. Our policy is motivated by a simple but crucial regret decomposition under optimism: cumulative regret separates into errors from estimating reservation indices for only the*queried APIs*and errors from evaluating generated outputs for only the*selected APIs*\. This decomposition allows us to apply the UCB\-style analysis to our moment estimation problem, yielding a dimension\-dependentO~​\(T\)\\tilde\{O\}\(\\sqrt\{T\}\)cumulative regret bound\. The analysis covers the practically important known\-evaluator regime, where offline data provide an accurate output\-level reward model, as well as the full regime where the evaluator and reservation indices must be learned jointly online\.

### 1\.2Related Literature

Our work is related to the literature on cost\-aware LLM systems, sequential test\-time inference, and the theoretical frameworks of sequential search and online learning\.

First, our paper is closely related to the emerging literature on LLM cascading and cost\-aware multi\-LLM deployment\. This literature studies how to allocate inference requests across multiple LLM APIs when models differ in cost, latency, and output quality\. Representative approaches include FrugalGPT\(Chenet al\.[2025a](https://arxiv.org/html/2606.07392#bib.bib68)\), budget\-constrained cascade policies, uncertainty\- or reliability\-based deferral rules, and hybrid methods that combine cascading with model routing\(Nieet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib71), Chenet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib73), Zhanget al\.[2024](https://arxiv.org/html/2606.07392#bib.bib74), Yueet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib77), Guptaet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib115)\)\. Our work contributes to this literature by formulating LLM cascading as an online contextual Pandora’s Box problem, where the decision\-maker sequentially decides which APIs to query and when to stop under unknown, context\-dependent output and cost distributions\. Our work is also related to LLM routing and ensembling methods although they address different decision problems than our LLM cascading motivation\. Routing methods usually choose a single LLM for each request before observing any generated output, either through non\-predictive rules or through predicted quality, reward, or utility scores\(Huet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib69), Shnitzeret al\.[2023](https://arxiv.org/html/2606.07392#bib.bib81), Hari and Thomson[2023](https://arxiv.org/html/2606.07392#bib.bib82), Luet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib83), Šakotaet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib84), Meiet al\.[2025](https://arxiv.org/html/2606.07392#bib.bib72)\)\. Such methods capture one\-shot model choice but not adaptive within\-request search\. Ensemble methods aggregate or synthesize outputs from multiple LLMs, such as by ranking and fusing candidate responses\(Jianget al\.[2023](https://arxiv.org/html/2606.07392#bib.bib70), Fanget al\.[2024](https://arxiv.org/html/2606.07392#bib.bib75), Huet al\.[2025](https://arxiv.org/html/2606.07392#bib.bib76)\); they focus on output aggregation rather than sequential stopping under query costs\.

Second, our paper is related to recent work on sequential testing and sequential stopping for LLM inference\.Huanget al\.\([2026](https://arxiv.org/html/2606.07392#bib.bib123)\)study adaptive self\-consistency for a single LLM, where the system sequentially samples reasoning paths and stops once the most\-likely answer can be identified with high posterior confidence\.Liet al\.\([2026](https://arxiv.org/html/2606.07392#bib.bib124)\)study Bayesian sequential testing with heterogeneous LLMs, where the decision\-maker adaptively chooses which LLM to query and stops once the posterior confidence for a hypothesis crosses a target threshold\. These papers formulate LLM inference as posterior\-driven sequential testing, whereas our LLM cascading model focuses on contextual sequential search over realized output\-cost pairs, with stopping governed by reservation indices and learning focused on both the index functions and the output\-level reward model\.

Third, the query phase component of our model extends the contextual Pandora’s Box problem\. The classic Pandora’s Box problem without contexts originates from the seminal work ofWeitzman \([1979](https://arxiv.org/html/2606.07392#bib.bib57)\)and has since been studied under a variety of extensions\(e\.g\. Doval[2018](https://arxiv.org/html/2606.07392#bib.bib85), Chawlaet al\.[2020](https://arxiv.org/html/2606.07392#bib.bib86), Fu and Lin[2020](https://arxiv.org/html/2606.07392#bib.bib92), Boodaghianset al\.[2020](https://arxiv.org/html/2606.07392#bib.bib89), Agarwalet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib55), Ezraet al\.[2026](https://arxiv.org/html/2606.07392#bib.bib90)\), including online variants\(Gergatsouli and Tzamos[2022](https://arxiv.org/html/2606.07392#bib.bib91), Atsidakouet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib20)\)\. In particular,Gergatsouli and Tzamos \([2022](https://arxiv.org/html/2606.07392#bib.bib91)\)studies an adversarial online Pandora’s Box setting in which box rewards are chosen by an adversary\.Gatmiryet al\.\([2024](https://arxiv.org/html/2606.07392#bib.bib59)\)subsequently shows that, in this setting, no algorithm can achieve sublinear regret against the optimal policy, even with full\-information feedback\. For the online contextual Pandora’s Box problem,Liuet al\.\([2025](https://arxiv.org/html/2606.07392#bib.bib54)\)obtains a regret bound ofO~​\(T\)\\widetilde\{O\}\(\\sqrt\{T\}\)for under the assumption that the reward distribution of each box is a location shift of a context\-invariant noise distribution, but their results does not apply heterogeneous output distributions across contexts within each box, a key feature for LLM cascading applications\.Atsidakouet al\.\([2024](https://arxiv.org/html/2606.07392#bib.bib20)\)studies a contextual Pandora’s Box model under adversarially arriving contexts, where the optimal reservation index is parametrized as a linear function of the observed context\. In this general setting, they obtain anO~​\(T5/6\)\\widetilde\{O\}\(T^\{5/6\}\)regret bound by reducing the learning problem to online linear regression\. Our approach builds on a similar model ofAtsidakouet al\.\([2024](https://arxiv.org/html/2606.07392#bib.bib20)\)by imposing a parametric reservation index structure, but adopts a different learning approach through a combination of GMM and UCB analysis\. This allows us to propose an optimism\-based learning algorithm withO~​\(T\)\\widetilde\{O\}\(\\sqrt\{T\}\)regret bound under suitable regularity assumptions\.

Further, the selection phase of our model draws on the literature on generalized linear contextual bandits\. This line of work originates from the seminal paper ofFilippiet al\.\([2010](https://arxiv.org/html/2606.07392#bib.bib15)\), which introduces an optimism\-based algorithm for generalized linear bandits, and has since been extended in various directions\(e\.g\. Junet al\.[2017](https://arxiv.org/html/2606.07392#bib.bib19), Kvetonet al\.[2020](https://arxiv.org/html/2606.07392#bib.bib93), Dinget al\.[2021](https://arxiv.org/html/2606.07392#bib.bib94), Kimet al\.[2023](https://arxiv.org/html/2606.07392#bib.bib95)\)\.

Finally, our learning algorithm combines the GMM with an UCB–type algorithm\. GMM originates from the seminal work ofHansen \([1982](https://arxiv.org/html/2606.07392#bib.bib96)\)and has been widely developed and applied in econometrics\(e\.g\. Newey and McFadden[1994](https://arxiv.org/html/2606.07392#bib.bib11), Arellano and Bond[1991](https://arxiv.org/html/2606.07392#bib.bib97), Chamberlain[1987](https://arxiv.org/html/2606.07392#bib.bib98), Lin and Lee[2010](https://arxiv.org/html/2606.07392#bib.bib100), Chenget al\.[2024](https://arxiv.org/html/2606.07392#bib.bib99), Andrews and Mikusheva[2022](https://arxiv.org/html/2606.07392#bib.bib101), Hansen and Lee[2021](https://arxiv.org/html/2606.07392#bib.bib102)\)\. UCB\-type algorithms trace back to the seminal paper ofLai and Robbins \([1985](https://arxiv.org/html/2606.07392#bib.bib103)\)and have since been extensively studied in the bandit literature\(e\.g\. Aueret al\.[2002](https://arxiv.org/html/2606.07392#bib.bib104), Rusmevichientong and Tsitsiklis[2010](https://arxiv.org/html/2606.07392#bib.bib17), Filippiet al\.[2010](https://arxiv.org/html/2606.07392#bib.bib15), Fan and Glynn[2025](https://arxiv.org/html/2606.07392#bib.bib106), Garivier and Cappé[2011](https://arxiv.org/html/2606.07392#bib.bib107), Audibertet al\.[2009](https://arxiv.org/html/2606.07392#bib.bib108)\), as well as in broader online learning settings and applications in operations management\(e\.g\. Rakhlin and Sridharan[2013](https://arxiv.org/html/2606.07392#bib.bib109), Gaoet al\.[2022](https://arxiv.org/html/2606.07392#bib.bib111), Cheunget al\.[2022](https://arxiv.org/html/2606.07392#bib.bib112), Maoet al\.[2025](https://arxiv.org/html/2606.07392#bib.bib110)\)\.

In addition, our paper belongs to the growing literature on LLM\-assisted decision making across a range of fields, including operations management\(e\.g\. Chen and Chan[2024](https://arxiv.org/html/2606.07392#bib.bib64), Chenet al\.[2025b](https://arxiv.org/html/2606.07392#bib.bib63), Simchi\-Leviet al\.[2026](https://arxiv.org/html/2606.07392#bib.bib65)\), advertising\(e\.g\. Yanget al\.[2023](https://arxiv.org/html/2606.07392#bib.bib66), Reisenbichleret al\.[2025](https://arxiv.org/html/2606.07392#bib.bib67)\), healthcare\(e\.g\. Thirunavukarasuet al\.[2023](https://arxiv.org/html/2606.07392#bib.bib62), Hageret al\.[2024](https://arxiv.org/html/2606.07392#bib.bib60), Haoet al\.[2025](https://arxiv.org/html/2606.07392#bib.bib61)\), among others\.

### 1\.3Notations

Given any integerk≥1k\\geq 1and vectorv∈ℝkv\\in\\mathbb\{R\}^\{k\}, we use‖v‖2\\\|v\\\|\_\{2\}to denote the Euclidean norm ofvv, and‖v‖∞\\\|v\\\|\_\{\\infty\}to denotemaxi=1,…,k⁡\|vi\|\\max\_\{i=1,\\ldots,k\}\|v\_\{i\}\|\. For any positive semidefinite matrixΦ∈ℝk×k\\Phi\\in\\mathbb\{R\}^\{k\\times k\}, we use the notation‖v‖Φ=v⊤​Φ​v\\\|v\\\|\_\{\\Phi\}=\\sqrt\{v^\{\\top\}\\Phi v\}to denote theΦ\\Phi\-weighted Euclidean norm ofvv\. For matricesA,B∈ℝd×dA,B\\in\\mathbb\{R\}^\{d\\times d\},A⪰BA\\succeq B\(resp\.,A⪯BA\\preceq B\) means thatA−BA\-B\(resp\.,B−AB\-A\) is positive semidefinite\. Forx,y∈ℝx,y\\in\\mathbb\{R\}, we writex∨y=max⁡\{x,y\}x\\vee y=\\max\\\{x,y\\\}\. For any positive semidefinite matrixM∈ℝd×dM\\in\\mathbb\{R\}^\{d\\times d\},λmin​\(M\)\\lambda\_\{\\min\}\(M\)denotes its minimum eigenvalue\. For any integerk≥1k\\geq 1, let\[k\]=\{1,2,…,k\}\[k\]=\\\{1,2,\\ldots,k\\\}\. Given any setSS, we useScS^\{c\}to denote the complement ofSS\. We writeω∼p​\(⋅\)\\omega\\sim p\(\\cdot\)to indicate that the random variableω\\omegafollows distributionp​\(⋅\)p\(\\cdot\)\. We useN​\(θ,σ2\)N\(\\theta,\\sigma^\{2\}\)to denote the Gaussian distribution with meanθ\\thetaand varianceσ2\\sigma^\{2\}\. Given anyμ∈ℝd\\mu\\in\\mathbb\{R\}^\{d\}and positive definite matrixΣ∈ℝd×d\\Sigma\\in\\mathbb\{R\}^\{d\\times d\}, we use𝒩​\(μ,Σ\)\\mathcal\{N\}\(\\mu,\\Sigma\)to denote thedd\-dimensional Gaussian distribution with meanμ\\muand covarianceΣ\\Sigma\. For any two random variablesX,YX,Y,X⟂⟂YX\\perp\\\!\\\!\\\!\\\!\\perp YmeansXXandYYare independent\. For any random eventℰ\\mathcal\{E\}, we use𝕀​\{ℰ\}\\mathbb\{I\}\\\{\\mathcal\{E\}\\\}to denote the indicator ofℰ\\mathcal\{E\}\. The notation “a\.s\.” stands for “almost surely\.” For any vectorxxor matrixAA,x⊤x^\{\\top\}andA⊤A^\{\\top\}denote their transposes\. Finally, for any twice\-differentiable functiongg,g′g^\{\\prime\}andg′′g^\{\\prime\\prime\}denote its first\- and second\-order derivatives\. We useO~​\(⋅\)\\widetilde\{O\}\(\\cdot\)to hide logarithmic factors\. Specifically, for functionsf,g:ℕ→ℝ\+f,g:\\mathbb\{N\}\\to\\mathbb\{R\}\_\{\+\}, we writef​\(T\)=O~​\(g​\(T\)\)f\(T\)=\\widetilde\{O\}\(g\(T\)\)if there exists a constantC\>0C\>0and a poly\-logarithmic functionpolylog​\(T\)\\mathrm\{polylog\}\(T\)such thatf​\(T\)≤C​g​\(T\)​polylog​\(T\)f\(T\)\\leq Cg\(T\)\\textrm\{polylog\}\(T\)\. For random variablesY1Y\_\{1\}andY2Y\_\{2\},Y1≲Y2Y\_\{1\}\\lesssim Y\_\{2\}denotes thatY1≤c​Y2Y\_\{1\}\\leq cY\_\{2\}with high probability for some constantc\>0c\>0\. For two positive sequencesf​\(T\)f\(T\)andg​\(T\)g\(T\), we writef​\(T\)≍g​\(T\)f\(T\)\\asymp g\(T\)if bothf​\(T\)≲g​\(T\)f\(T\)\\lesssim g\(T\)andg​\(T\)≲f​\(T\)g\(T\)\\lesssim f\(T\)hold up to universal constant factors\.

## 2Model

In this section, we formally introduce our online contextual Pandora’s box model motivated by LLM cascading\. In our model, there areAAdifferent boxes, each of which can be thought of as an API in LLM cascading\. There areTTdecision periods\. Letℱt−1\\mathcal\{F\}\_\{t\-1\}denote the history available before periodtt, including all past observations and actions\. At the beginning of periodtt, the decision\-maker \(DM\) observes a request context vectorxt∈𝒳⊂ℝdxx\_\{t\}\\in\\mathcal\{X\}\\subset\\mathbb\{R\}^\{d\_\{x\}\}, wherextx\_\{t\}are independent acrosst∈\[T\]t\\in\[T\]\. For each boxa∈\[A\]a\\in\[A\],ωa​t\\omega\_\{at\}, defined as the potential output vector ofaa, is random and its distribution is assumed to be independent across boxes and independent of the past history\. Formally, for eacha∈\[A\]a\\in\[A\],ωa​t\\omega\_\{at\}is drawn from a context\-dependent distributionpa\(⋅\|xt\)p\_\{a\}\(\\cdot\|x\_\{t\}\), and for any measurable setsB1,…,BA⊆ΩB\_\{1\},\\ldots,B\_\{A\}\\subseteq\\Omega, we have

ℙ​\(ω1​t∈B1,…,ωA​t∈BA∣ℱt−1,xt\)=∏a=1Apa​\(Ba∣xt\)a\.s\.\\mathbb\{P\}\\\!\\left\(\\omega\_\{1t\}\\in B\_\{1\},\\ldots,\\omega\_\{At\}\\in B\_\{A\}\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\right\)=\\prod\_\{a=1\}^\{A\}p\_\{a\}\(B\_\{a\}\\mid x\_\{t\}\)\\qquad\\text\{a\.s\.\}\(1\)The outputωa​t\\omega\_\{at\}is revealed only if boxaais queried\. In addition, ifaais queried, a costca​\(xt,ωa​t\)c\_\{a\}\(x\_\{t\},\\omega\_\{at\}\)is incurred and observed, whereca:𝒳×Ω→\(0,1\)c\_\{a\}:\\mathcal\{X\}\\times\\Omega\\to\(0,1\)is the cost function associated with boxaa\. Finally, at the end of each period, the DM selects one queried boxata\_\{t\}whose generated outputωat​t\\omega\_\{a\_\{t\}t\}will be deployed\. The DM then receives and observes the realized reward associated with the deployed output,rt=μ∗​\(xt,ωat​t\)\+ζtr\_\{t\}=\\mu^\{\*\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\+\\zeta\_\{t\}, whereμ∗​\(⋅,⋅\):𝒳×Ω→\[0,1\]\\mu^\{\*\}\(\\cdot,\\cdot\):\\mathcal\{X\}\\times\\Omega\\rightarrow\[0,1\]is the \(unknown\) reward function andζt\\zeta\_\{t\}is the post\-deployment noise satisfying𝔼​\[ζt∣ℱt−1,xt,ωat,t\]=0\\mathbb\{E\}\[\\zeta\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\},\\omega\_\{a\_\{t\},t\}\]=0\.

A key feature of our model is that the reward function is shared across boxes\. Conditional on the same request\-output pair\(x,ω\)\(x,\\omega\), the expected downstream reward is the same regardless of which box generated the output\. Thus, box\-specific heterogeneity only enters through the distributions of outputs and costs, while realized rewards depend solely on the common reward functionμ∗​\(⋅,⋅\)\\mu^\{\*\}\(\\cdot,\\cdot\)and noiseζt\\zeta\_\{t\}\. Consequently, outputsωa​t\\omega\_\{at\}act as intermediaries linking boxes to rewards, which permits cross\-learning ofμ∗\\mu^\{\*\}using reward observations from all selected boxes\. This reflects our motivating LLM example, where the reward received by the DM depends on the generated output and the request it serves, not directly on the identity of the API that generated it\.

We preview the parametric structures used for learning, with the formal assumptions and estimators introduced in Sections[4](https://arxiv.org/html/2606.07392#S4)and[5](https://arxiv.org/html/2606.07392#S5)\. Suppose, as an oracle benchmark, the DM knows the conditional output distributions\{pa\(⋅∣x\)\}a∈\[A\],x∈𝒳\\\{p\_\{a\}\(\\cdot\\mid x\)\\\}\_\{a\\in\[A\],x\\in\\mathcal\{X\}\}, the cost functions\{ca\}a∈\[A\]\\\{c\_\{a\}\\\}\_\{a\\in\[A\]\}, and the reward functionμ∗\\mu^\{\*\}\. Then, after the contextxtx\_\{t\}is observed, the decision problem in periodttreduces to a contextual Pandora’s box problem of the type studied byWeitzman \([1979](https://arxiv.org/html/2606.07392#bib.bib57)\)\. As we formally illustrate in Section[3](https://arxiv.org/html/2606.07392#S3), the optimal full\-information policy is determined by two objects: the reward functionμ∗\\mu^\{\*\}, and the reservation indices, denoted as\{σa∗​\(xt\)\}a∈\[A\]\\\{\\sigma\_\{a\}^\{\*\}\(x\_\{t\}\)\\\}\_\{a\\in\[A\]\}\. These indices are, in turn, determined by the oracle objectspa\(⋅\|xt\)p\_\{a\}\(\\cdot\|x\_\{t\}\),μ∗\\mu^\{\*\}, andcac\_\{a\}through the reservation index equation \([2](https://arxiv.org/html/2606.07392#S3.E2)\) in Proposition[3\.1](https://arxiv.org/html/2606.07392#S3.Thmtheorem1)\. In the online learning problem, these oracle objects are not known\. Rather than estimating the full conditional output distributionspa\(⋅∣x\)p\_\{a\}\(\\cdot\\mid x\), we impose generalized linear parametric structure directly on the reward functionμ∗\\mu^\{\*\}and on the contextual reservation index functions\{σa∗​\(⋅\)\}a∈\[A\]\\\{\\sigma\_\{a\}^\{\*\}\(\\cdot\)\\\}\_\{a\\in\[A\]\}, namelyμ∗​\(x,ω\)=G​\(θ∗⊤​ϕ​\(x,ω\)\)\\mu^\{\*\}\(x,\\omega\)=G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x,\\omega\)\)andσa∗​\(x\)=Λ​\(ρa⊤​ψ​\(x\)\)\\sigma\_\{a\}^\{\*\}\(x\)=\\Lambda\(\\rho\_\{a\}^\{\\top\}\\psi\(x\)\)where the coefficientsθ∗\\theta\_\{\*\}andρa\\rho\_\{a\}are unknown;GG,Λ\\Lambdaare known \(monotonic\) link functions andϕ\\phi,ψ\\psiare known feature maps, respectively\. This allows the DM to learn the shared reward and box specific contextual reservation indices through parametric estimation\.

### 2\.1DM’s Problem and Regret

Next, we describe the decision\-making pipeline in each periodt∈\[T\]t\\in\[T\]in more detail\. Each periodt∈\[T\]t\\in\[T\]consists of two phases, which we call*query phase*and*selection phase*\.

During the*query phase*, the DM observesxtx\_\{t\}and sequentially queries boxes\. A policy is non\-anticipating: before each query, the DM may useℱt−1\\mathcal\{F\}\_\{t\-1\}, the current contextxtx\_\{t\}, and the output\-cost pairs of the boxes already observed in periodtt, but not the output or cost of any unqueried box\. Let𝒜t⊆\[A\]\\mathcal\{A\}\_\{t\}\\subseteq\[A\]denote the random set of boxes queried in periodtt\. We assume each box can be queried at most once within a period\. After each query, the DM observes the generated output and cost, and then decides whether to continue querying or stop\. This non\-anticipativity condition implies a useful sampling property\. For anya∈\[A\]a\\in\[A\], conditional on\(ℱt−1,xt\)\(\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\), the event\{a∈𝒜t\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}is determined before observing boxaa’s own potential output\-cost pair\. Hence, wheneverℙ​\(a∈𝒜t∣ℱt−1,xt\)\>0\\mathbb\{P\}\(a\\in\\mathcal\{A\}\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\)\>0,

\(ωa​t,ca​t\)\|ℱt−1,xt,\{a∈𝒜t\}=𝑑\(ωa​t,ca​t\)\|ℱt−1,xt\.\(\\omega\_\{at\},c\_\{at\}\)\\,\|\\,\\mathcal\{F\}\_\{t\-1\},x\_\{t\},\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\;\\overset\{d\}\{=\}\\;\(\\omega\_\{at\},c\_\{at\}\)\\,\|\\,\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\.This property will be used later to justify learning reservation indices from adaptively queried samples\. After the query phase concludes, the DM enters the*selection phase*, selecting one queried boxat∈𝒜ta\_\{t\}\\in\\mathcal\{A\}\_\{t\}whose generated outputωat​t\\omega\_\{a\_\{t\}t\}will be deployed\.

LetΠt\\Pi\_\{t\}denote the class of admissible non\-anticipating policies in periodtt\. For a policyπt∈Πt\\pi\_\{t\}\\in\\Pi\_\{t\}, let𝒜t​\(πt\)\\mathcal\{A\}\_\{t\}\(\\pi\_\{t\}\)be the set of queried boxes andat​\(πt\)∈𝒜t​\(πt\)a\_\{t\}\(\\pi\_\{t\}\)\\in\\mathcal\{A\}\_\{t\}\(\\pi\_\{t\}\)be the selected box\. The conditional expected utility of policyπt\\pi\_\{t\}in periodttis

Ut\(πt;ℱt−1,xt\):=𝔼\[μ∗\(xt,ωat​\(πt\)​t\)−∑a∈𝒜t​\(πt\)ca​t\|ℱt−1,xt\]\.U\_\{t\}\(\\pi\_\{t\};\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\):=\\mathbb\{E\}\\\!\\left\[\\mu^\{\*\}\(x\_\{t\},\\omega\_\{a\_\{t\}\(\\pi\_\{t\}\)t\}\)\-\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\(\\pi\_\{t\}\)\}c\_\{at\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\right\]\.We note that the post\-deployment noiseζt\\zeta\_\{t\}does not enter the utility because it is conditionally mean\-zero\. It matters only as feedback for learning the reward functionμ∗​\(⋅,⋅\)\\mu^\{\*\}\(\\cdot,\\cdot\)\.

We compare the DM against the full\-information benchmark that knows the distributionpa\(⋅\|xt\)p\_\{a\}\(\\cdot\|x\_\{t\}\), and functionsμ∗​\(xt,⋅\),ca​\(xt,⋅\)\\mu^\{\*\}\(x\_\{t\},\\cdot\),c\_\{a\}\(x\_\{t\},\\cdot\)for every boxaaand every contextxtx\_\{t\}\. At each periodtt, after observingxtx\_\{t\}, the optimal policyπt∗∈arg⁡maxπt∈Πt⁡Ut​\(πt;ℱt−1,xt\)\\pi\_\{t\}^\{\*\}\\in\\arg\\max\_\{\\pi\_\{t\}\\in\\Pi\_\{t\}\}U\_\{t\}\(\\pi\_\{t\};\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\)reduces to the solution of the Pandora’s box problem given contextxtx\_\{t\}, which we formally characterize in Section[3](https://arxiv.org/html/2606.07392#S3)\.

Define the condition expected utility ofπt∗\\pi\_\{t\}^\{\*\}as

Ut​\(πt∗;ℱt−1,xt\)=𝔼​\[maxa∈𝒜t∗⁡μ∗​\(xt,ωa​t\)−∑a∈𝒜t∗ca​t\|ℱt−1,xt\]\.U\_\{t\}\(\\pi\_\{t\}^\{\*\};\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\)=\\mathbb\{E\}\\bigg\[\\max\_\{a\\in\\mathcal\{A\}\_\{t\}^\{\*\}\}\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}^\{\*\}\}c\_\{at\}\\,\\bigg\|\\,\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\bigg\]\.Then, for any admissible policy𝝅=\{πt\}t∈\[T\]\\bm\{\\pi\}=\\\{\\pi\_\{t\}\\\}\_\{t\\in\[T\]\}, the cumulative regret is defined as

RT​\(𝝅\):=∑t=1T𝔼​\[Ut​\(πt∗;ℱt−1,xt\)−Ut​\(πt;ℱt−1,xt\)\]\.R\_\{T\}\(\\bm\{\\pi\}\):=\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\\\!\\left\[U\_\{t\}\(\\pi\_\{t\}^\{\*\};\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\)\-U\_\{t\}\(\\pi\_\{t\};\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\)\\right\]\.
Readers familiar with the bandit literature might ask whether the problem can be modeled as a contextual bandit by treating each API as an arm\. Such a formulation would lead to a one\-shot API\-as\-arm policy, in which the decision\-maker selects a single API for each request before observing any generated output\. It does not, however, capture the adaptive information\-acquisition structure of LLM cascading\. The following example illustrates that such a restriction can create a constant per\-period gap relative to the sequential\-search benchmark, even when all primitives are known\.

Consider an example with two boxes and the contextxtx\_\{t\}is fixed throughout\. Querying box 1 costsc1=0\.5c\_\{1\}=0\.5and always generates outputω1​t=1\\omega\_\{1t\}=1\. Querying box 2 costsc2=0\.01c\_\{2\}=0\.01and generates outputω2​t=1\\omega\_\{2t\}=1with probability0\.10\.1, andω2​t=0\\omega\_\{2t\}=0otherwise\. The downstream value isμ∗​\(xt,ω\)=ω\\mu^\{\*\}\(x\_\{t\},\\omega\)=\\omega\. A one\-shot API\-as\-arm policy obtains expected utility0\.50\.5from box 1 and0\.090\.09from box 2, so the best one\-shot policy selects box 1\. Now consider a cascade that first queries box 2\. Ifω2​t=1\\omega\_\{2t\}=1, it stops and selects box 2; ifω2​t=0\\omega\_\{2t\}=0, it queries box 1 and selects box 1\. Its expected utility is−0\.01\+0\.1\+0\.9​\(0\.5\)=0\.54\>0\.5\-0\.01\+0\.1\+0\.9\(0\.5\)=0\.54\>0\.5\. Thus, relative to the cascade benchmark, any one\-shot API\-as\-arm formulation suffers a constant per\-period optimality gap in this instance, and hence a loss that grows linearly withTT\.

## 3Policy and Algorithm

This section develops the online policy and the algorithm\. We begin with the full\-information benchmark\. Fix a periodttand condition on the arriving contextxtx\_\{t\}\. Suppose the DM knows the reward functionμ∗​\(⋅,⋅\)\\mu^\{\*\}\(\\cdot,\\cdot\)and the output distributions\{pa\(⋅\|x\)\}a∈\[A\],x∈𝒳\\\{p\_\{a\}\(\\cdot\|x\)\\\}\_\{a\\in\[A\],x\\in\\mathcal\{X\}\}\. The following proposition characterizes the optimal oracle policy during periodtt\.

###### Proposition 3\.1\(Oracle reservation index\)

For each boxa∈\[A\]a\\in\[A\]and any fixedt∈\[T\]t\\in\[T\], letσa∗​\(⋅\)\\sigma^\{\*\}\_\{a\}\(\\cdot\)be the index function such that

𝔼\[\{μ∗\(xt,ωa​t\)−σa∗\(xt\)\}\+\|xt\]=𝔼\[ca​t∣xt\],∀xt\.\\mathbb\{E\}\\left\[\\left\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sigma^\{\*\}\_\{a\}\(x\_\{t\}\)\\right\\\}^\{\+\}\\,\\middle\|\\,x\_\{t\}\\right\]=\\mathbb\{E\}\[c\_\{at\}\\mid x\_\{t\}\],\\forall x\_\{t\}\.\(2\)Then there exists an optimal policyπt∗\\pi\_\{t\}^\{\*\}with the following structure: At any periodttwith contextxtx\_\{t\}, it orders the boxes\{\(1\),\(2\),…,\(A\)\}\\\{\(1\),\(2\),\\ldots,\(A\)\\\}such thatσ\(1\)∗​\(xt\)≥⋯≥σ\(A\)∗​\(xt\)\\sigma^\{\*\}\_\{\(1\)\}\(x\_\{t\}\)\\geq\\cdots\\geq\\sigma^\{\*\}\_\{\(A\)\}\(x\_\{t\}\), setsσ\(A\+1\)∗​\(xt\):=−∞\\sigma^\{\*\}\_\{\(A\+1\)\}\(x\_\{t\}\):=\-\\infty, queries boxes in this order, stops at thekk\-th query ifmax1≤j≤k⁡μ∗​\(xt,ω\(j\)​t\)≥σ\(k\+1\)∗​\(xt\)\\max\_\{1\\leq j\\leq k\}\\mu^\{\*\}\(x\_\{t\},\\omega\_\{\(j\)t\}\)\\geq\\sigma^\{\*\}\_\{\(k\+1\)\}\(x\_\{t\}\), and selectsat∗∈\\argmaxa∈𝒜t∗​μ∗​\(xt,ωa​t\)a\_\{t\}^\{\*\}\\in\\argmax\_\{a\\in\\mathcal\{A\}\_\{t\}^\{\*\}\}\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\), where𝒜t∗\\mathcal\{A\}\_\{t\}^\{\*\}is the set of boxes queried by the oracle\. Moreover, the oracle value satisfies

Ut\(πt∗;ℱt−1,xt\)=𝔼\[maxa∈\[A\]min\{μ∗\(xt,ωa​t\),σa∗\(xt\)\}\|xt,ℱt−1\]\.U\_\{t\}\(\\pi\_\{t\}^\{\*\};\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\)=\\mathbb\{E\}\\left\[\\max\_\{a\\in\[A\]\}\\min\\left\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\),\\sigma\_\{a\}^\{\*\}\(x\_\{t\}\)\\right\\\}\\,\\middle\|\\,x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]\.\(3\)

Defineσa​t∗:=σa∗​\(xt\)\\sigma\_\{at\}^\{\*\}:=\\sigma\_\{a\}^\{\*\}\(x\_\{t\}\)\. When the costsca​t\{c\_\{at\}\}are known constants \(i\.e\.,ca​t≡cac\_\{at\}\\equiv c\_\{a\}for eacha∈\[A\]a\\in\[A\]\), the indices\{σa​t∗\}a∈\[A\]\\\{\\sigma\_\{at\}^\{\*\}\\\}\_\{a\\in\[A\]\}are optimal by the seminal work ofWeitzman \([1979](https://arxiv.org/html/2606.07392#bib.bib57)\)\. When costs are stochastic \(as in our motivating setting\), the result above follows from a minor adaptation of the proof in\(Kleinberget al\.[2016](https://arxiv.org/html/2606.07392#bib.bib1)\)\(see Appendix[7](https://arxiv.org/html/2606.07392#S7)\)\.

Proposition[3\.1](https://arxiv.org/html/2606.07392#S3.Thmtheorem1)formally identifies the reward evaluatorμ∗​\(x,ω\)\\mu^\{\*\}\(x,\\omega\)and the reservation index functionσa∗​\(x\)\\sigma\_\{a\}^\{\*\}\(x\)as the two objects required to make the optimal sequential decisions\. We next describe the online policy as Algorithm[1](https://arxiv.org/html/2606.07392#alg1)\. At a high level, the policy follows the oracle structure in Proposition[3\.1](https://arxiv.org/html/2606.07392#S3.Thmtheorem1), but replaces the unknown reward evaluator and reservation indices with optimistic estimates\. We call the resulting policyCOst\-awareSequentialMOdel query andSelection, orCOSMOS\.

### 3\.1Optimism and Regret Decomposition

OurCOSMOSpolicy constructs optimistic estimatesμ~t​\(x,ω\)\\widetilde\{\\mu\}\_\{t\}\(x,\\omega\)andσ~a​t,∀a∈\[A\],t∈\[T\]\\widetilde\{\\sigma\}\_\{at\},\\ \\forall a\\in\[A\],t\\in\[T\]\. The estimateμ~t​\(x,ω\)\\widetilde\{\\mu\}\_\{t\}\(x,\\omega\)is an optimistic estimate ofμ∗​\(x,ω\)\\mu^\{\*\}\(x,\\omega\)for a realized output\. The estimateσ~a​t\\widetilde\{\\sigma\}\_\{at\}is an optimistic estimate of the oracle reservation indexσa∗​\(xt\)\\sigma\_\{a\}^\{\*\}\(x\_\{t\}\)for boxaaat the current context\. The formal construction of these estimates is postponed to Sections[4](https://arxiv.org/html/2606.07392#S4)and[5](https://arxiv.org/html/2606.07392#S5), and here we focus only on their roles in the policy\.

To understand why optimism is crucial in our COSMOS policy, we next present a theorem showing that, under optimism, the one\-period regret decomposes into the reward estimation error for the selected output and the index estimation errors for the queried boxes\.

###### Theorem 3\.3\(Regret decomposition under optimism\)

Fix any periodtt, and let

Δt​\(π~\):=Ut​\(πt∗;xt\)−Ut​\(π~t;xt\)\\Delta\_\{t\}\(\\widetilde\{\\pi\}\):=U\_\{t\}\(\\pi\_\{t\}^\{\*\};x\_\{t\}\)\-U\_\{t\}\(\\widetilde\{\\pi\}\_\{t\};x\_\{t\}\)denote the period\-ttconditional regret of the policyπ~\\widetilde\{\\pi\}induced by Algorithm[1](https://arxiv.org/html/2606.07392#alg1)\. Suppose that, for alla∈\[A\]a\\in\[A\],μ~t​\(xt,ωa​t\)≥μ∗​\(xt,ωa​t\)\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\\geq\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\),σ~a​t≥σa​t∗\\widetilde\{\\sigma\}\_\{at\}\\geq\\sigma^\{\*\}\_\{at\}\. Letat∈arg​maxa∈𝒜t⁡μ~t​\(xt,ωa​t\)a\_\{t\}\\in\\operatorname\*\{arg\\,max\}\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)be the output selected by Algorithm[1](https://arxiv.org/html/2606.07392#alg1)\. Then,

Δt​\(π~\)≤𝔼​\[μ~t​\(xt,ωat​t\)−μ∗​\(xt,ωat​t\)∣ℱt−1,xt\]\+𝔼​\[∑a∈𝒜t\(σ~a​t−σa​t∗\)∣ℱt−1,xt\]\.\\Delta\_\{t\}\(\\widetilde\{\\pi\}\)\\leq\\mathbb\{E\}\\left\[\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\-\\mu^\{\*\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\right\]\+\\mathbb\{E\}\\left\[\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\left\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma^\{\*\}\_\{at\}\\right\)\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\right\]\.

We provide intuition on why Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)holds\. Consider an auxiliary “optimistic” problem in which both its reward estimateμ~t\\widetilde\{\\mu\}\_\{t\}and its indexσ~a​t\\widetilde\{\\sigma\}\_\{at\}upper\-bound the truth\. ThenCOSMOSis the optimal policy for the “optimistic” problem, and its corresponding optimistic expected value is thus at least as large as the optimal expected value of the true problem\. As a result, the loss can be bounded by the discrepancy between the optimistic quantities used byCOSMOSand the true quantities realized along the same path\. This yields exactly two errors: a reward estimation error for the selected output, and an index\-estimation error for every queried box\.

The decomposition in Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)allows us to leverage the principle of optimism to control our regret\. More specifically, if a box is queried frequently, its query dataset grows and its index uncertainty decreases\. If a box is queried rarely, then it contributes only rarely to the second term in Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)\. This mirrors the classic self\-correcting principle behind UCB analysis, adapted here to govern reservation indices and reward rather than the direct feedback in bandits\.

Consequently, Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)motivates us to construct high\-probability confidence bounds for both reward and reservation indices and using them to define the optimistic estimatorsμ~t​\(xt,ωa​t\)\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)andσ~a​t\\widetilde\{\\sigma\}\_\{at\}\. The goal is to ensure that, with high probability, these estimators upper bound their population counterparts while remaining sufficiently accurate\. Together, optimism and error control provide the key ingredients for deriving the final regret guarantee of theCOSMOSpolicy, whose high\-level structure is summarized as Algorithm[1](https://arxiv.org/html/2606.07392#alg1)\. The full implementation, including the computation of optimistic reservation indices and reward estimates, is deferred to Algorithm[2](https://arxiv.org/html/2606.07392#alg2)in the Appendix\.

Initialization\.Observe an initial context

x0x\_\{0\}\. Query each box

a∈\[A\]a\\in\[A\]once and record the observed output\-cost pairs

\{\(ωa​0,ca​0\)\}a∈\[A\]\\\{\(\\omega\_\{a0\},c\_\{a0\}\)\\\}\_\{a\\in\[A\]\}\.

for*t=1,2,…,Tt=1,2,\\ldots,T*do

Observe context

xtx\_\{t\}\. Initialize

𝒜t=∅\\mathcal\{A\}\_\{t\}=\\emptysetand

Mt=−∞M\_\{t\}=\-\\infty;

Construct optimistic reservation index estimates

\{σ~a​t\}a∈\[A\]\\\{\\widetilde\{\\sigma\}\_\{at\}\\\}\_\{a\\in\[A\]\};

Order the boxes so that

σ~\(1\)​t≥σ~\(2\)​t≥⋯≥σ~\(A\)​t\\widetilde\{\\sigma\}\_\{\(1\)t\}\\geq\\widetilde\{\\sigma\}\_\{\(2\)t\}\\geq\\cdots\\geq\\widetilde\{\\sigma\}\_\{\(A\)t\}\. Set

σ~\(A\+1\)​t:=−∞\\widetilde\{\\sigma\}\_\{\(A\+1\)t\}:=\-\\infty;

for*k=1,2,…,Ak=1,2,\\ldots,A*do

Query box

\(k\)\(k\)and observe its output and cost

\(ω\(k\)​t,c\(k\)​t\)\(\\omega\_\{\(k\)t\},c\_\{\(k\)t\}\);

Add

\(k\)\(k\)to the queried set:

𝒜t←𝒜t∪\{\(k\)\}\\mathcal\{A\}\_\{t\}\\leftarrow\\mathcal\{A\}\_\{t\}\\cup\\\{\(k\)\\\};

Compute the optimistic reward estimate

μ~t​\(xt,ω\(k\)​t\)\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{\(k\)t\}\);

Update

Mt←maxa∈𝒜t⁡μ~t​\(xt,ωa​t\)M\_\{t\}\\leftarrow\\max\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\);

if

Mt≥σ~\(k\+1\)​tM\_\{t\}\\geq\\widetilde\{\\sigma\}\_\{\(k\+1\)t\}, stop querying andbreak;

end for

Select and deploy

at∈\\argmaxa∈𝒜t​μ~t​\(xt,ωa​t\),a\_\{t\}\\in\\argmax\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\),and observe reward

rtr\_\{t\};

end for

ALGORITHM 1COSMOSPolicy Framework

## 4Results under Known Reward Function

To build intuition, we first study the case in which the reward functionμ∗​\(x,ω\)\\mu^\{\*\}\(x,\\omega\)is known\. This setting is practically relevant when the reward model can be estimated offline using a substantially larger historical dataset\. If the offline data are sufficiently rich, the resulting estimate of reward functionμ∗​\(⋅,⋅\)\\mu^\{\*\}\(\\cdot,\\cdot\)may be accurate enough relative to theT\\sqrt\{T\}scale of the online regret analysis, to be treated as fixed in the subsequent online decision problem\. Importantly, even with a known reward function, the conditional distribution of the outputωa​t\\omega\_\{at\}givenxtx\_\{t\}typically remains unknown\.

This case therefore captures regimes in which the main online learning challenge is not reward estimation, but learning the reservation indices and managing exploration across boxes\. Indeed, once boxaais opened in periodtt, the decision\-maker observes its outputωa​t\\omega\_\{at\}and costca​tc\_\{at\}, and can therefore compute the conditional expected rewardμ∗​\(xt,ωa​t\)\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\. Thus, in Algorithm[1](https://arxiv.org/html/2606.07392#alg1), we would setμ~t​\(xt,ωa​t\)=μ∗​\(xt,ωa​t\)\.\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)=\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\.The selection rule then chooses the opened box with the largest true conditional expected reward, and the regret component arising from reward estimation disappears\. The remaining learning problem is then focused solely on estimatingσa∗​\(xt\)\\sigma\_\{a\}^\{\*\}\(x\_\{t\}\)for eachaa\.

This is analogous to the bandit\-feedback setting inAtsidakouet al\.\([2024](https://arxiv.org/html/2606.07392#bib.bib20)\), in that feedback is obtained only from opened boxes\.111The authors also study a full information setting where the DM observes the rewards of all boxes at the end of each round\.Atsidakouet al\.\([2024](https://arxiv.org/html/2606.07392#bib.bib20)\)develop a black\-box reduction from contextual Pandora’s Box to online regression, using a specially designed linear\-quadratic loss to learn all the reservation indices\. In contrast, our approach follows the principle of optimism from the UCB literature\. As we illustrate, under suitable regularity conditions, it is not necessary to learn all boxes’ index functions uniformly well\. Instead, it is enough to maintain optimistic index estimates and refine the estimates of boxes that remain relevant to the query process\.

We next formalize this intuition\. Recall from Section[2](https://arxiv.org/html/2606.07392#S2)that we impose the parametric structure on the indices

σa∗​\(x\)=Λ​\(ρa⊤​ψ​\(x\)\),∀x∈𝒳,\\sigma\_\{a\}^\{\*\}\(x\)=\\Lambda\(\\rho\_\{a\}^\{\\top\}\\psi\(x\)\),\\qquad\\forall x\\in\\mathcal\{X\},\(5\)whereΛ\\Lambdais a known monotonic link function, andψ​\(⋅\)\\psi\(\\cdot\)is a known feature map\. We impose the following regularity condition on the parametric reservation index function \([5](https://arxiv.org/html/2606.07392#S4.E5)\):\{assumption\}\[Regularity conditions for reservation indices\] The link functionΛ:ℝ→\[−1,1\]\\Lambda:\\mathbb\{R\}\\to\[\-1,1\]is continuously differentiable, strictly increasing, andLL\-Lipschitz,lims→−∞Λ​\(s\)=−1\\lim\_\{s\\rightarrow\-\\infty\}\\Lambda\(s\)=\-1,lims→\+∞Λ​\(s\)=1\\lim\_\{s\\rightarrow\+\\infty\}\\Lambda\(s\)=1\. Moreover,infρ∈ℬ,x∈𝒳Λ′​\(ρ⊤​ψ​\(x\)\)≥μ1\>0,‖ψ​\(x\)‖2≤C¯ψ​for all​x∈𝒳\\inf\_\{\\rho\\in\\mathcal\{B\},\\;x\\in\\mathcal\{X\}\}\\Lambda^\{\\prime\}\(\\rho^\{\\top\}\\psi\(x\)\)\\geq\\mu\_\{1\}\>0,\\\|\\psi\(x\)\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\psi\}\\ \\text\{for all \}x\\in\\mathcal\{X\}\. The parameter spaceℬ⊂ℝm\\mathcal\{B\}\\subset\\mathbb\{R\}^\{m\}is convex and compact, with diameterdℬd\_\{\\mathcal\{B\}\}:supρ,ρ′∈ℬ‖ρ−ρ′‖2≤dℬ\\sup\_\{\\rho,\\rho^\{\\prime\}\\in\\mathcal\{B\}\}\\\|\\rho\-\\rho^\{\\prime\}\\\|\_\{2\}\\leq d\_\{\\mathcal\{B\}\}\. The range restrictionΛ:ℝ→\[−1,1\]\\Lambda:\\mathbb\{R\}\\to\[\-1,1\]is consistent with the fact thatca​t∈\(0,1\)c\_\{at\}\\in\(0,1\)andμ∗∈\[0,1\]\\mu^\{\*\}\\in\[0,1\], which implyσa​t∗∈\[−1,1\]\\sigma\_\{at\}^\{\*\}\\in\[\-1,1\]\. In addition, Assumption[5](https://arxiv.org/html/2606.07392#S4.E5)imposes only mild regularity conditions\. It accommodates the linear reservation index specification considered byAtsidakouet al\.\([2024](https://arxiv.org/html/2606.07392#bib.bib20)\), which corresponds to takingΛ\\Lambdaas the identity map on the relevant, strictly bounded domain ofρ⊤​ψ​\(x\)\\rho^\{\\top\}\\psi\(x\), while allowing it to smoothly flatten to±1\\pm 1ass→±∞s\\to\\pm\\infty\.

Assumption[5](https://arxiv.org/html/2606.07392#S4.E5)yields the point identification ofρa⊤​ψ​\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\):

###### Proposition 4\.1

Under Assumption[5](https://arxiv.org/html/2606.07392#S4.E5),ρa⊤​ψ​\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)is point identified through \([4](https://arxiv.org/html/2606.07392#S3.E4)\) for eacha∈\[A\]a\\in\[A\]\.

At the beginning of periodtt, Algorithm[1](https://arxiv.org/html/2606.07392#alg1)estimatesρa\\rho\_\{a\}byρ^a​t\\hat\{\\rho\}\_\{at\}and constructs an upper confidence bound for the linear indexρa⊤​ψ​\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\. Letℛa​t​\(xt\)\\mathcal\{R\}\_\{at\}\(x\_\{t\}\)denote the corresponding confidence radius, and defineσ~a​t=Λ​\(ρ^a​t⊤​ψ​\(xt\)\+ℛa​t​\(xt\)\)\\widetilde\{\\sigma\}\_\{at\}=\\Lambda\\\!\\left\(\\hat\{\\rho\}\_\{at\}^\{\\top\}\\psi\(x\_\{t\}\)\+\\mathcal\{R\}\_\{at\}\(x\_\{t\}\)\\right\)\. Letℰρ\\mathcal\{E\}\_\{\\rho\}denote the high\-probability event on which

\|ρ^a​t⊤​ψ​\(xt\)−ρa⊤​ψ​\(xt\)\|≤ℛa​t​\(xt\),∀a∈\[A\],t∈\[T\]\.\\left\|\\hat\{\\rho\}\_\{at\}^\{\\top\}\\psi\(x\_\{t\}\)\-\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\\right\|\\leq\\mathcal\{R\}\_\{at\}\(x\_\{t\}\),\\qquad\\forall a\\in\[A\],\\;t\\in\[T\]\.Then onℰρ\\mathcal\{E\}\_\{\\rho\}, by the monotonicity ofΛ\\Lambda,

σ~a​t≥Λ​\(ρa⊤​ψ​\(xt\)\)=σa​t∗,∀a∈\[A\],t∈\[T\],\\widetilde\{\\sigma\}\_\{at\}\\geq\\Lambda\(\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\)=\\sigma\_\{at\}^\{\*\},\\qquad\\forall a\\in\[A\],\\;t\\in\[T\],allowing us to invoke the regret decomposition of Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)\. Recall thatμ~t=μ∗\\widetilde\{\\mu\}\_\{t\}=\\mu^\{\*\}in the known\-μ∗\\mu^\{\*\}case, implying the reward estimation term in Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)is zero\. Therefore, under the eventℰρ\\mathcal\{E\}\_\{\\rho\}, Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)implies that the period\-ttregret satisfies

Δt​\(π~\)≤𝔼​\[∑a∈𝒜t\(σ~a​t−σa​t∗\)\|xt,ℱt−1\],\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\leq\\mathbb\{E\}\\left\[\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\left\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\\right\)\\,\\bigg\|\\,x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\],\(6\)where𝒜t\\mathcal\{A\}\_\{t\}is the random set of boxes opened by Algorithm[1](https://arxiv.org/html/2606.07392#alg1)in periodtt\. Inequality \([6](https://arxiv.org/html/2606.07392#S4.E6)\) illustrates that the regret depends only on the index estimation errors of boxes that are actually opened\. Thus, frequently opened boxes yield larger sample sizes that shrink their confidence radii, while rarely opened boxes contribute negligibly to the overall regret\. Once we obtain a valid high\-probability confidence radius for the linear indexρa⊤​ψ​\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\), the cumulative regret can be controlled by the standard elliptical\-potential argument used for contextual bandits\(e\.g\., Abbasi\-Yadkoriet al\.[2011](https://arxiv.org/html/2606.07392#bib.bib56)\)\. However, unlike in standard contextual bandits, an opened box does not produce a noisy observation of its reservation indexσa​t∗\\sigma\_\{at\}^\{\*\}; but instead, a tuple\(xt,ωa​t,ca​t\)\(x\_\{t\},\\omega\_\{at\},c\_\{at\}\)\. Therefore, the index parameter must be learned indirectly from the defining equation \([4](https://arxiv.org/html/2606.07392#S3.E4)\) of the reservation index\.

### 4\.1Index Estimation and Loss Function

We next describe how to estimate the box\-specific reservation index functionσa∗​\(⋅\)\\sigma\_\{a\}^\{\*\}\(\\cdot\)\. Given \([4](https://arxiv.org/html/2606.07392#S3.E4)\) and \([5](https://arxiv.org/html/2606.07392#S4.E5)\), our estimation problem is moment\-based and is thus inspired by the generalized method of moments \(GMM\)\. For this purpose, we define the known\-μ∗\\mu^\{\*\}moment function, for anyx∈𝒳x\\in\\mathcal\{X\},ω∈Ω\\omega\\in\\Omega,c∈\[0,1\]c\\in\[0,1\], andρ∈ℬ\\rho\\in\\mathcal\{B\}, as

m∗\(x,ω,c;ρ\):=c−\(μ∗\(x,ω\)−Λ\(ρ⊤ψ\(x\)\)\)\+\.m^\{\*\}\(x,\\omega,c;\\rho\):=c\-\\left\(\\mu^\{\*\}\(x,\\omega\)\-\\Lambda\(\\rho^\{\\top\}\\psi\(x\)\)\\right\)^\{\+\}\.\(7\)By Equation \([4](https://arxiv.org/html/2606.07392#S3.E4)\), we have𝔼​\[m∗​\(xt,ωa​t,ca​t;ρa\)\|ℱt−1,xt,a∈𝒜t\]=0\\mathbb\{E\}\\left\[m^\{\*\}\(x\_\{t\},\\omega\_\{at\},c\_\{at\};\\rho\_\{a\}\)\\,\\bigg\|\\,\\mathcal\{F\}\_\{t\-1\},x\_\{t\},\\ a\\in\\mathcal\{A\}\_\{t\}\\right\]=0\. Next, let

𝒮a​t:=\{s<t:a∈𝒜s\},na​t:=\|𝒮a​t\|\\mathcal\{S\}\_\{at\}:=\\\{s<t:\\ a\\in\\mathcal\{A\}\_\{s\}\\\},\\qquad n\_\{at\}:=\|\\mathcal\{S\}\_\{at\}\|denote the set and number of past periods in which boxaawas queried before periodtt, and the elements in𝒮a​t\\mathcal\{S\}\_\{at\}be ordered asτa,1<τa,2<⋯<τa,na​t\.\\tau\_\{a,1\}<\\tau\_\{a,2\}<\\cdots<\\tau\_\{a,n\_\{at\}\}\.

As we discuss below, our estimation problem is a special case of the GMM framework with conditional moment conditions\. By setting the moment conditions based on the featuresψ​\(xs\)\\psi\(x\_\{s\}\), we can formulate it as aMM\-estimation problem that is more suitable for our online learning setting\. A standard GMM approach would instead form the empirical vector moment

g¯a​t\(ρ\):=1na​t∑s∈𝒮a​tψ\(xs\)m∗\(xs,ωa​s,ca​s;ρ\)\\bar\{g\}\_\{at\}\(\\rho\):=\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\psi\(x\_\{s\}\)m^\{\*\}\(x\_\{s\},\\omega\_\{as\},c\_\{as\};\\rho\)and minimize a quadratic criterion such asg¯a​t​\(ρ\)⊤​Wa​t​g¯a​t​\(ρ\)\\bar\{g\}\_\{at\}\(\\rho\)^\{\\top\}W\_\{at\}\\bar\{g\}\_\{at\}\(\\rho\)for some positive definite weighting matrixWa​tW\_\{at\}at each periodtt\. However, this approach focuses on estimating the coefficientsρa\\rho\_\{a\}, which requires a full\-rank condition on the local GMM curvature\. Such a requirement can be overly stringent in our setting, as it requires that the observations in which boxaais queried contain sufficiently many “active” samples, namely those satisfyingμ∗​\(xt,ωa​t\)\>Λ​\(ρa⊤​ψ​\(xt\)\)\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\>\\Lambda\(\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\), and that the corresponding feature vectorsψ​\(xt\)\\psi\(x\_\{t\}\)span the parameter space in a well\-conditioned way conditional ona∈𝒜ta\\in\\mathcal\{A\}\_\{t\}\.

Our objective, however, is not to recover the entire vectorρa\\rho\_\{a\}\. For the UCB analysis, it is enough to construct a confidence interval for the scalar quantityρa⊤​ψ​\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)at the current context\. We therefore adopt anMM\-estimation formulation, whose first\-order condition recovers the sample moment\.

For a queried sample of boxaain periodss, define

ℓa​s​\(ρ\):=∫0ρ⊤​ψ​\(xs\)\[ca​s−\(μ∗​\(xs,ωa​s\)−Λ​\(u\)\)\+\]​𝑑u\.\\ell\_\{as\}\(\\rho\):=\\int\_\{0\}^\{\\rho^\{\\top\}\\psi\(x\_\{s\}\)\}\\left\[c\_\{as\}\-\\left\(\\mu^\{\*\}\(x\_\{s\},\\omega\_\{as\}\)\-\\Lambda\(u\)\\right\)^\{\+\}\\right\]du\.\(8\)Note thatℓa​s​\(ρ\)\\ell\_\{as\}\(\\rho\)is convex inρ\\rho, asΛ​\(⋅\)\\Lambda\(\\cdot\)is strictly increasing\. Also, the derivative ofℓa​s\\ell\_\{as\}satisfies

∇ρℓa​s​\(ρ\)=ψ​\(xs\)​\[ca​s−\(μ∗​\(xs,ωa​s\)−Λ​\(ρ⊤​ψ​\(xs\)\)\)\+\]=ψ​\(xs\)​m∗​\(xs,ωa​s,ca​s;ρ\)\.\\nabla\_\{\\rho\}\\ell\_\{as\}\(\\rho\)=\\psi\(x\_\{s\}\)\\left\[c\_\{as\}\-\\left\(\\mu^\{\*\}\(x\_\{s\},\\omega\_\{as\}\)\-\\Lambda\(\\rho^\{\\top\}\\psi\(x\_\{s\}\)\)\\right\)^\{\+\}\\right\]=\\psi\(x\_\{s\}\)m^\{\*\}\(x\_\{s\},\\omega\_\{as\},c\_\{as\};\\rho\)\.\(9\)Therefore, the first\-order condition of the empirical loss minimization is equivalent to the sample analogue of the vector moment equation\. The known\-μ∗\\mu^\{\*\}estimator ofρa\\rho\_\{a\}for periodttis

ρ^a​t∈\\argminρ∈ℬ​∑s∈𝒮a​tℓa​s​\(ρ\)\.\\hat\{\\rho\}\_\{at\}\\in\\argmin\_\{\\rho\\in\\mathcal\{B\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\ell\_\{as\}\(\\rho\)\.\(10\)We next state the regularity condition that yields curvature of the population loss\. The condition requires thatρ⊤​ψ​\(x\)\\rho^\{\\top\}\\psi\(x\)remains in a bounded region, and for every point in this region, there is some probability mass that the reward of the output is above its corresponding reservation index\.

\{assumption\}

\[Local mass around reservation thresholds\] There exist constantsι¯\>0\\bar\{\\iota\}\>0andκ\>0\\kappa\>0such that, for allx∈𝒳x\\in\\mathcal\{X\}andρ∈ℬ\\rho\\in\\mathcal\{B\},\|ψ​\(x\)⊤​ρ\|≤ι¯\|\\psi\(x\)^\{\\top\}\\rho\|\\leq\\bar\{\\iota\}, and for alla∈\[A\]a\\in\[A\],x∈𝒳x\\in\\mathcal\{X\},\|u\|≤ι¯\|u\|\\leq\\bar\{\\iota\}, we haveℙω∼pa\(⋅∣x\)​\(μ∗​\(x,ω\)\>Λ​\(u\)\)≥κ\\mathbb\{P\}\_\{\\omega\\sim p\_\{a\}\(\\cdot\\mid x\)\}\\left\(\\mu^\{\*\}\(x,\\omega\)\>\\Lambda\(u\)\\right\)\\geq\\kappa\. Assumption[4\.1](https://arxiv.org/html/2606.07392#S4.SS1)rules out the cases in which the reservation index lies in a region with no probability mass above it\. Together with the lower derivative boundΛ′​\(⋅\)≥μ1\\Lambda^\{\\prime\}\(\\cdot\)\\geq\\mu\_\{1\}from Assumption[5](https://arxiv.org/html/2606.07392#S4.E5), it ensures that the reservation index moment crosses zero with slope bounded away from zero\. This is the one\-dimensional source of curvature for the loss function\.

###### Lemma 4\.2\(Population curvature of the loss function\)

Under Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5),[4\.1](https://arxiv.org/html/2606.07392#S4.SS1), for anyt∈\[T\]t\\in\[T\],a∈\[A\]a\\in\[A\]andρ∈ℬ\\rho\\in\\mathcal\{B\},

𝔼​\[ℓa​t​\(ρ\)−ℓa​t​\(ρa\)∣xt,ℱt−1\]≥κ​μ12​\(ψ​\(xt\)⊤​\(ρ−ρa\)\)2\.\\mathbb\{E\}\[\\ell\_\{at\}\(\\rho\)\-\\ell\_\{at\}\(\\rho\_\{a\}\)\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\geq\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)^\{2\}\.\(11\)

The curvature bound in Lemma[4\.2](https://arxiv.org/html/2606.07392#S4.Thmtheorem2)illustrates the advantage for using theMM\-estimation formulation\. It shows that the population loss increases quadratically in the prediction errorψ​\(xs\)⊤​\(ρ−ρa\)\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\. As a preview of the subsequent analysis, we will compare this quadratic curvature against the empirical fluctuation of the additive loss∑s∈𝒮a​tℓa​s​\(ρ\)\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\ell\_\{as\}\(\\rho\)\. By controlling this fluctuation uniformly overρ∈ℬ\\rho\\in\\mathcal\{B\}via martingale concentration, we can absorb it into the curvature term\. This procedure yields a high\-probability bound on\|\(ρ^a​t−ρa\)⊤​ψ​\(xt\)\|\|\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\)\|, providing the confidence radius needed to construct the optimistic indexσ~a​t\\widetilde\{\\sigma\}\_\{at\}in \([6](https://arxiv.org/html/2606.07392#S4.E6)\)\.

### 4\.2Confidence Bound for a Fixed Box

Next, we construct confidence radii for the reservation indices\. For each fixed boxa∈\[A\]a\\in\[A\], becauseΛ​\(⋅\)\\Lambda\(\\cdot\)is monotone andσa​t∗=Λ​\(ρa⊤​ψ​\(xt\)\)\\sigma\_\{at\}^\{\*\}=\\Lambda\(\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\), we focus on constructing a confidence interval for the scalar indexρa⊤​ψ​\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)at the realized contextxtx\_\{t\}\. We defineHa​t​\(ρ\)H\_\{at\}\(\\rho\), the empirical fluctuation of the loss atρ\\rho, as

Ha​t​\(ρ\):=∑s∈𝒮a​t\{ℓa​s​\(ρ\)−ℓa​s​\(ρa\)−𝔼​\[ℓa​s​\(ρ\)−ℓa​s​\(ρa\)∣ℱs−1,xs\]\}\.H\_\{at\}\(\\rho\):=\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\-\\mathbb\{E\}\\\!\\left\[\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\\right\]\\right\\\}\.\(12\)By optimality ofρ^a​t\\hat\{\\rho\}\_\{at\}, the empirical excess loss∑s∈𝒮a​t\{ℓa​s​\(ρ^a​t\)−ℓa​s​\(ρa\)\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\ell\_\{as\}\(\\hat\{\\rho\}\_\{at\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\right\\\}is non\-positive\. This sum can be decomposed by its conditional expectation and empirical fluctuation as

∑s∈𝒮a​t\{ℓa​s\(ρ^a​t\)−ℓa​s\(ρa\)\}=∑s∈𝒮a​t𝔼\[ℓa​s\(ρ^a​t\)−ℓa​s\(ρa\)\|ℱs−1,xs\]\+Ha​t\(ρ^a​t\)\.\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\ell\_\{as\}\(\\hat\{\\rho\}\_\{at\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\right\\\}=\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\mathbb\{E\}\\left\[\\ell\_\{as\}\(\\hat\{\\rho\}\_\{at\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\,\\middle\|\\,\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\\right\]\+H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\.Lemma[4\.2](https://arxiv.org/html/2606.07392#S4.Thmtheorem2)implies that the conditional expectation is bounded below by the quadratic formκ​μ1/2​∑s∈𝒮a​t\{ψ​\(xs\)⊤​\(ρ^a​t−ρa\)\}2\\kappa\\mu\_\{1\}/2\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\\right\\\}^\{2\}\. Consequently, to establish a confidence interval forρa⊤​ψ​\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\), it suffices to bound the empirical fluctuationHa​t​\(ρ^a​t\)H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)using a comparable quadratic term and a logarithmic complexity term, as established in the following lemma\.

###### Lemma 4\.3

Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5),[4\.1](https://arxiv.org/html/2606.07392#S4.SS1)hold\. Fixa∈\[A\]a\\in\[A\], and given any constantc0\>0c\_\{0\}\>0, with probability at least1−δ/31\-\\delta/3, uniformly over allt∈\[T\]t\\in\[T\], for anyρt\\rho\_\{t\}adapted toℱt−1\\mathcal\{F\}\_\{t\-1\}we have

−Ha​t​\(ρt\)≤c08​∑s∈𝒮a​t\{\(ρt−ρa\)⊤​ψ​\(xs\)\}2\+\(144c0\+C0\)​\(m​log⁡T\+log⁡\(T/δ\)\),\-H\_\{at\}\(\\rho\_\{t\}\)\\leq\\frac\{c\_\{0\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+\\left\(\\frac\{144\}\{c\_\{0\}\}\+C\_\{0\}\\right\)\(m\\log T\+\\log\(T/\\delta\)\),whereC0\>0C\_\{0\}\>0is a constant depending ondℬ,C¯ψ,ι¯d\_\{\\mathcal\{B\}\},\\bar\{C\}\_\{\\psi\},\\overline\{\\iota\}\.

Lemma[4\.3](https://arxiv.org/html/2606.07392#S4.Thmtheorem3)is derived through standard concentration arguments\. For a fixedρ\\rho,Ha​t​\(ρ\)H\_\{at\}\(\\rho\)form a martingale difference sequence whose conditional variance is controlled by the quadratic term that appears in the population curvature lemma\. A peeling argument over the size of this quadratic term and anϵ\\epsilon\-net argument overℬ\\mathcal\{B\}yield the uniform bound and hence for adaptiveρt\\rho\_\{t\}\.

We now build the confidence interval for the scalar indexρa⊤​ψ​\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)using Lemmas[4\.2](https://arxiv.org/html/2606.07392#S4.Thmtheorem2)and[4\.3](https://arxiv.org/html/2606.07392#S4.Thmtheorem3)\. Define the regularized empirical design matrix

Va​t​\(η1\):=η1​𝐈m\+∑s∈𝒮a​tψ​\(xs\)​ψ​\(xs\)⊤,V\_\{at\}\(\\eta\_\{1\}\):=\\eta\_\{1\}\\mathbf\{I\}\_\{m\}\+\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\psi\(x\_\{s\}\)\\psi\(x\_\{s\}\)^\{\\top\},\(13\)whereη1\>0\\eta\_\{1\}\>0is a regularization parameter and𝐈m\\mathbf\{I\}\_\{m\}is them×mm\\times midentity matrix\. Intuitively,Va​t​\(η1\)V\_\{at\}\(\\eta\_\{1\}\)summarizes the contexts in which boxaahas been queried before periodtt\. The next proposition establishes a high\-probability bound on\|\(ρ^a​t−ρa\)⊤​ψ​\(xt\)\|\|\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\)\|\.

###### Proposition 4\.4

Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5),[4\.1](https://arxiv.org/html/2606.07392#S4.SS1)hold\. Whenμ∗\\mu^\{\*\}is known, fix anya∈\[A\]a\\in\[A\], given anyδ\>0\\delta\>0, with probability at least1−δ1\-\\delta, uniformly over allt∈\[T\]t\\in\[T\], we have

\|\(ρ^a​t−ρa\)⊤​ψ​\(xt\)\|≤Ba​t∗​‖ψ​\(xt\)‖Va​t​\(η1\)−1,\|\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\)\|\\leq B\_\{at\}^\{\*\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\},\(14\)whereBa​t∗=C1​\[m​log⁡\(T\)\+log⁡\(T​A/δ\)\]\+η1​dℬ2B\_\{at\}^\{\*\}=\\sqrt\{C\_\{1\}\[m\\log\(T\)\+\\log\(TA/\\delta\)\]\+\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}\}, andC1C\_\{1\}is a constant depending on the constantsκ,μ1,ι¯,C¯ψ,dℬ\\kappa,\\mu\_\{1\},\\bar\{\\iota\},\\bar\{C\}\_\{\\psi\},d\_\{\\mathcal\{B\}\}defined in Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5),[4\.1](https://arxiv.org/html/2606.07392#S4.SS1)\.

Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4)has the usual online\-learning interpretation\(see, e\.g\., Abbasi\-Yadkoriet al\.[2011](https://arxiv.org/html/2606.07392#bib.bib56), Lattimore and Szepesvári[2020](https://arxiv.org/html/2606.07392#bib.bib50)\)as the confidence bounds used in bandit UCB algorithms\. It is large when the current context points in a direction that has not been well explored for boxaa, and small when the historical opened\-box contexts provide enough information in that direction\.

Motivated by Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4), we define the optimistic index in the known\-μ∗\\mu^\{\*\}case as

σ~a​t=Λ​\(ρ^a​t⊤​ψ​\(xt\)\+Ba​t∗​‖ψ​\(xt\)‖Va​t​\(η1\)−1\),\\widetilde\{\\sigma\}\_\{at\}=\\Lambda\\left\(\\hat\{\\rho\}\_\{at\}^\{\\top\}\\psi\(x\_\{t\}\)\+B\_\{at\}^\{\*\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\\right\),\(15\)SinceΛ\\Lambdais increasing, Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4)impliesσ~a​t≥σa​t∗\\widetilde\{\\sigma\}\_\{at\}\\geq\\sigma\_\{at\}^\{\*\}uniformly overa∈\[A\],t∈\[T\]a\\in\[A\],t\\in\[T\]with probability at least1−δ1\-\\delta\. This, in turn, is used to establish the uniform optimism event required for the regret guarantee\. In implementation, one may calibrate this radius using a bootstrap approximation to the distribution of\(ρ^a​t−ρa\)⊤​ψ​\(xt\)\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\), which may yield less conservative confidence bounds\. Such a bootstrap procedure is not covered by the present theory and proving its validity would require a uniform bootstrap approximation under adaptive sampling\.

### 4\.3Regret under Known Reward Function

Now we bound the cumulative regret whenμ∗\\mu^\{\*\}is known so thatμ~t=μ∗\\widetilde\{\\mu\}\_\{t\}=\\mu^\{\*\}\. When the indices are optimistic, i\.e\.,σ~a​t≥σa​t∗,∀a∈\[A\],t∈\[T\]\\widetilde\{\\sigma\}\_\{at\}\\geq\\sigma\_\{at\}^\{\*\},\\forall a\\in\[A\],\\ t\\in\[T\], Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)givesΔt\(π~\)≤𝔼\[∑a∈𝒜t\(σ~a​t−σa​t∗\)\|xt,ℱt−1\]\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\leq\\mathbb\{E\}\\left\[\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\left\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\\right\)\\,\\middle\|\\,x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]\. Using the Lipschitz continuity ofΛ\\Lambdaand the confidence bound in Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4), we obtainσ~a​t−σa​t∗≤2​L​Ba​t∗​‖ψ​\(xt\)‖Va​t​\(η1\)−1\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\\leq 2LB\_\{at\}^\{\*\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}on the high\-probability event, whereBa​t∗≤O​\(m​log⁡\(T\)\+log⁡\(T​A\)\)B\_\{at\}^\{\*\}\\leq\\mathrm\{O\}\(\\sqrt\{m\\log\(T\)\+\\log\(TA\)\}\)\. Thus the cumulative regret is controlled by∑t=1T∑a∈𝒜t‖ψ​\(xt\)‖Va​t​\(η1\)−1\\sum\_\{t=1\}^\{T\}\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\. Every time boxaais queried, the corresponding feature vectorψ​\(xt\)\\psi\(x\_\{t\}\)is added to its design matrix\. Therefore, the uncertainty in the same direction decreases over time\. The standard elliptical\-potential argument implies that for each boxa∈\[A\]a\\in\[A\],

∑t=1TBa​t∗​‖ψ​\(xt\)‖Va​t​\(η1\)−1​𝕀​\{a∈𝒜t\}≤O~​\(m​T\)\.\\sum\_\{t=1\}^\{T\}B\_\{at\}^\{\*\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\leq\\widetilde\{O\}\(m\\sqrt\{T\}\)\.\(16\)Summing this bound over theAAboxes yields the following regret guarantee:

###### Proposition 4\.5

Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5),[4\.1](https://arxiv.org/html/2606.07392#S4.SS1)hold\. When the reward functionμ∗\\mu^\{\*\}is known,

𝔼​\[∑t=1TΔt​\(π~\)\]≤O~​\(A​m​T\)\.\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\right\]\\leq\\widetilde\{O\}\\left\(Am\\sqrt\{T\}\\right\)\.

Compared with theO~​\(T5/6\)\\widetilde\{O\}\(T^\{5/6\}\)bound ofAtsidakouet al\.\([2024](https://arxiv.org/html/2606.07392#bib.bib20)\)for a more general contextual Pandora’s Box setting, ourO~​\(T\)\\widetilde\{O\}\(\\sqrt\{T\}\)rate relies on the additional local\-curvature structure in Assumption[4\.1](https://arxiv.org/html/2606.07392#S4.SS1), which enables UCB\-style confidence control for the reservation indices\.

## 5Results under Unknown Reward Function

We now drop the assumption thatμ∗\\mu^\{\*\}is known and study the problem whereμ∗\\mu^\{\*\}must be learned online\. The known\-μ∗\\mu^\{\*\}result before guides the analysis whenμ∗\\mu^\{\*\}is unknown\. Relative to the previous section, two additional challenges arise:

- \(i\)The reward estimation term in Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)is no longer zero\.
- \(ii\)The reservation index estimation can no longer use the true conditional expected rewardμ∗​\(xs,ωa​s\)\\mu^\{\*\}\(x\_\{s\},\\omega\_\{as\}\)when defining the loss\.

Recall from Section[2](https://arxiv.org/html/2606.07392#S2)that we impose the generalized linear parametric structure on the expected reward function:

μ∗​\(x,ω\)=G​\(θ∗⊤​ϕ​\(x,ω\)\),∀\(x,ω\)∈𝒳×Ω,\\mu^\{\*\}\(x,\\omega\)=G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x,\\omega\)\),\\qquad\\forall\(x,\\omega\)\\in\\mathcal\{X\}\\times\\Omega,\(17\)whereG:ℝ→\[0,1\]G:\\mathbb\{R\}\\rightarrow\[0,1\]is strictly increasing\. In Section[5\.1](https://arxiv.org/html/2606.07392#S5.SS1)we address \(i\) using a standard penalized maximum likelihood estimator \(MLE\) construction based on \([17](https://arxiv.org/html/2606.07392#S5.E17)\)\. In Section[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)we address \(ii\) by showing that the main intuitions from Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4)continue to work\. Particularly, Proposition[11\.2](https://arxiv.org/html/2606.07392#S11.Thmtheorem2)extends Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4)to this setting by accounting for the plug\-in error for the loss\. Similarly, the index estimator is controlled by population curvature, empirical concentration, and now an additional perturbation term from reward estimation\. Once the reward parameter is learned at the standard online rate, this perturbation is small enough to preserve the overallO~​\(T\)\\widetilde\{O\}\(\\sqrt\{T\}\)regret rate\.

However, challenge \(ii\) with unknown reward is more delicate because reward learning uses selected outputs while index estimation is based on queried outputs\. Hence controlling the plug\-in loss requires an additional assumption ensuring that the reward\-design matrix is sufficiently informative for the queried features that enter the index estimator\.

We begin by imposing the following regularity condition on the expected reward function:\{assumption\}\[Regularity conditions for expected reward function\] \(i\)G​\(⋅\)G\(\\cdot\)is continuously differentiable and strictly increasing, with Lipschitz constantLL\. Furthermore,μ¯:=infθ∈Θ,x∈𝒳,ω∈ΩG′​\(θ⊤​ϕ​\(x,ω\)\)\>0\\underline\{\\mu\}:=\\inf\_\{\\theta\\in\\Theta,x\\in\\mathcal\{X\},\\omega\\in\\Omega\}G^\{\\prime\}\(\\theta^\{\\top\}\\phi\(x,\\omega\)\)\>0, and for some absolute constantL1\>0L\_\{1\}\>0,\|G′′​\(θ⊤​ϕ​\(x,ω\)\)\|≤L1\|G^\{\\prime\\prime\}\(\\theta^\{\\top\}\\phi\(x,\\omega\)\)\|\\leq L\_\{1\},∀x∈𝒳,ω∈Ω,θ∈Θ\\forall x\\in\\mathcal\{X\},\\omega\\in\\Omega,\\theta\\in\\Theta\. \(ii\) For an absolute constantC¯ϕ,α¯\>0\\bar\{C\}\_\{\\phi\},\\bar\{\\alpha\}\>0,‖ϕ​\(x,ω\)‖2≤C¯ϕ\\\|\\phi\(x,\\omega\)\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\},∀ω∈Ω,x∈𝒳\\forall\\omega\\in\\Omega,x\\in\\mathcal\{X\}and‖θ∗‖2≤α¯\\\|\\theta\_\{\*\}\\\|\_\{2\}\\leq\\bar\{\\alpha\}\. Assumption[5](https://arxiv.org/html/2606.07392#S5)is standard in the generalized linear contextual bandit literature\(e\.g\., Filippiet al\.[2010](https://arxiv.org/html/2606.07392#bib.bib15), Liet al\.[2017](https://arxiv.org/html/2606.07392#bib.bib3), Leeet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib16)\)\. The Lipschitz constantLLcontrols how errors inθ^t−1\\hat\{\\theta\}\_\{t\-1\}propagate into the reward prediction, the lower boundμ¯\\underline\{\\mu\}onG′G^\{\\prime\}ensures uniform strong convexity of the objective, and the boundedness ofϕ\\phiandθ∗\\theta\_\{\*\}keeps the analysis in the standard regime\.

We next assume that the random reward observed by the DM is drawn from acanonical exponential family:\{assumption\}Given anyx∈𝒳x\\in\\mathcal\{X\}andω∈Ω\\omega\\in\\Omega, conditioning on the pair of input context and output of the selected API\(xt,ωat​t\)=\(x,ω\)\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)=\(x,\\omega\)during periodtt, the density of the random rewardrt=G​\(θ∗⊤​ϕ​\(x,ω\)\)\+ζtr\_\{t\}=G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x,\\omega\)\)\+\\zeta\_\{t\}with respect to a reference measure is given by

f​\(r\|x,ω\)=exp⁡\(r​θ∗⊤​ϕ​\(x,ω\)−b​\(θ∗⊤​ϕ​\(x,ω\)\)\+ξ​\(r\)\),f\(r\|x,\\omega\)=\\exp\\left\(r\\theta\_\{\*\}^\{\\top\}\\phi\(x,\\omega\)\-b\(\\theta\_\{\*\}^\{\\top\}\\phi\(x,\\omega\)\)\+\\xi\(r\)\\right\),\(18\)where0≤rt≤γ00\\leq r\_\{t\}\\leq\\gamma\_\{0\}a\.s\. for some constantγ0\>1\\gamma\_\{0\}\>1,ξ​\(⋅\)\\xi\(\\cdot\)is a real function,b​\(⋅\)b\(\\cdot\)is twice continuously differentiable and𝔼​\[ζt\|ωat​t,xt,ℱt−1\]=0\\mathbb\{E\}\[\\zeta\_\{t\}\|\\omega\_\{a\_\{t\}t\},x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]=0for anyt∈\[T\]t\\in\[T\]almost surely\. Assumption[5](https://arxiv.org/html/2606.07392#S5)is standard in generalized linear contextual bandit\(Filippiet al\.[2010](https://arxiv.org/html/2606.07392#bib.bib15), Liet al\.[2017](https://arxiv.org/html/2606.07392#bib.bib3), Zhouet al\.[2019](https://arxiv.org/html/2606.07392#bib.bib4), Leeet al\.[2024](https://arxiv.org/html/2606.07392#bib.bib16), etc\.\), which encompasses Bernoulli, logistic, and binomial reward distributions, and more generally the exponential family \(e\.g\. Gaussian, Poisson, etc\.\) with support truncated to\[0,γ0\]\[0,\\gamma\_\{0\}\]\. By definition, taking derivative ofbbwe getb′=Gb^\{\\prime\}=Gas theinverse link function\. Common examples ofGGinclude any known cumulative distribution function \(CDF\) such as the logistic and probit CDFs\.

### 5\.1Estimating the Reward Parameter

The reward estimation follows essentially a generalized linear contextual bandit framework\. At each periodttthe deployment\(xt,ωat​t\)\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)yields a reward observationrtr\_\{t\}, and the dataset\{\(xk,ωak​k,rk\)\}k=1t−1\\\{\(x\_\{k\},\\omega\_\{a\_\{k\}k\},r\_\{k\}\)\\\}\_\{k=1\}^\{t\-1\}accumulates over time\. We estimateθ∗\\theta\_\{\*\}by a penalized \(projected\) maximum likelihood estimator

θ^t−1:=\\argminθ∈Θ​‖∑k=1t−1\{rk−G​\(θ⊤​ϕ​\(xk,ωak​k\)\)\}​ϕ​\(xk,ωak​k\)−η0​θ‖Φt−1−1,\\hat\{\\theta\}\_\{t\-1\}:=\\argmin\_\{\\theta\\in\\Theta\}\\left\\\|\\sum\_\{k=1\}^\{t\-1\}\\\!\\big\\\{r\_\{k\}\-G\\big\(\\theta^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\big\)\\big\\\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\-\\eta\_\{0\}\\theta\\right\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\},\(19\)where

Φt−1=η0​𝐈d\+∑k=1t−1ϕ​\(xk,ωak​k\)​ϕ​\(xk,ωak​k\)⊤\\Phi\_\{t\-1\}=\\eta\_\{0\}\\mathbf\{I\}\_\{d\}\+\\sum\_\{k=1\}^\{t\-1\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)^\{\\top\}\(20\)is the regularized design matrix of reward features collected through periodt−1t\-1, andη0\>0\\eta\_\{0\}\>0is a regularization parameter,𝐈d\\mathbf\{I\}\_\{d\}is thed×dd\\times didentity matrix\. The term within the weighted norm∥⋅∥Φt−1−1\\\|\\,\\cdot\\,\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}in \([19](https://arxiv.org/html/2606.07392#S5.E19)\) is the gradient of the objective function for the unconstrained penalized log\-likelihoodmaxθ​∑k=1t−1log⁡f​\(rk∣xk,ωak​k\)−η02​‖θ‖22\\max\_\{\\theta\}\\sum\_\{k=1\}^\{t\-1\}\\log f\(r\_\{k\}\\mid x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\-\\tfrac\{\\eta\_\{0\}\}\{2\}\\\|\\theta\\\|\_\{2\}^\{2\}, andθ^t−1\\hat\{\\theta\}\_\{t\-1\}can be viewed a “projection” ontoΘ\\Thetaif the unconstrained solution falls outsideΘ\\Theta\. This construction followsFilippiet al\.\([2010](https://arxiv.org/html/2606.07392#bib.bib15)\)\.

The following standard result provides a high\-probability bound on the prediction error induced by usingθ^t−1\\hat\{\\theta\}\_\{t\-1\}to evaluate the reward of a candidate context\-output pair\(xt,ωa​t\)\(x\_\{t\},\\omega\_\{at\}\)\.

###### Lemma 5\.1

Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S5)and[5](https://arxiv.org/html/2606.07392#S5)hold and fix anyδ∈\(0,min⁡\{1,2​d/e\}\)\\delta\\in\(0,\\min\\\{1,2d/e\\\}\)\. With probability at least1−δ/21\-\\delta/2, uniformly over alla∈\[A\]a\\in\[A\]andt≥2t\\geq 2,

\|\(θ^t−1−θ∗\)⊤​ϕ​\(xt,ωa​t\)\|≤βt​‖ϕ​\(xt,ωa​t\)‖Φt−1−1,\\big\|\(\\hat\{\\theta\}\_\{t\-1\}\-\\theta\_\{\*\}\)^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\\big\|\\;\\leq\\;\\beta\_\{t\}\\;\\big\\\|\\phi\(x\_\{t\},\\omega\_\{at\}\)\\big\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\},whereβt=O​\(d​log⁡\(t\)\+log⁡\(1/δ\)\)\\beta\_\{t\}=\\mathrm\{O\}\\left\(\\sqrt\{d\\log\(t\)\+\\log\(1/\\delta\)\}\\right\)defined as in \([61](https://arxiv.org/html/2606.07392#S9.E61)\)\.

This is the ellipsoidal confidence bound for generalized linear bandits, which has similar structure as Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4)for the index parameter estimation error bound\. The matrixΦt−1\\Phi\_\{t\-1\}plays the role thatVa​t​\(η1\)V\_\{at\}\(\\eta\_\{1\}\)played for the indices, andβt\\beta\_\{t\}grows logarithmically intt\. The proof of Lemma[5\.1](https://arxiv.org/html/2606.07392#S5.Thmtheorem1)adaptsFilippiet al\.\([2010](https://arxiv.org/html/2606.07392#bib.bib15)\)via a self\-normalized martingale inequality\. Equipped with this bound, we construct the optimistic reward as

μ~t​\(xt,ωa​t\):=G​\(θ^t−1⊤​ϕ​\(xt,ωa​t\)\+βt​‖ϕ​\(xt,ωa​t\)‖Φt−1−1\)\.\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\\;:=\\;G\\\!\\left\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\+\\beta\_\{t\}\\,\\big\\\|\\phi\(x\_\{t\},\\omega\_\{at\}\)\\big\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\right\)\.\(21\)BecauseGGis monotonically increasing by Assumption[5](https://arxiv.org/html/2606.07392#S5), Lemma[5\.1](https://arxiv.org/html/2606.07392#S5.Thmtheorem1)implies that with probability at least1−δ/21\-\\delta/2, uniformly over alla∈\[A\]a\\in\[A\]andt≥2t\\geq 2,μ~t​\(xt,ωa​t\)≥μ∗​\(xt,ωa​t\)\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\\;\\geq\\;\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\.

### 5\.2Estimating Indices

Next, we estimate the reservation indices and construct their associated confidence intervals for the regime where the reward function is unknown\. As an unknown\-reward analogue to Section[4\.2](https://arxiv.org/html/2606.07392#S4.SS2), we replace the true rewardμ∗​\(xs,ωa​s\)\\mu^\{\*\}\(x\_\{s\},\\omega\_\{as\}\)by its plug\-in estimateG​\(θ^t−1⊤​ϕ​\(xs,ωa​s\)\)G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)for eachs≤ts\\leq t\. The main goal is to show that this substitution preserves the same confidence\-bound structure as Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4), up to an additional perturbation term controlled by the reward estimation error bound\. Fix a periodttand a boxa∈\[A\]a\\in\[A\]\. For anyρ∈ℬ\\rho\\in\\mathcal\{B\}and any past queried samples∈𝒮a​ts\\in\\mathcal\{S\}\_\{at\}, define the plug\-in loss

ℓ^a​s,t​\(ρ\):=∫0ρ⊤​ψ​\(xs\)\[ca​s−\{G​\(θ^t−1⊤​ϕ​\(xs,ωa​s\)\)−Λ​\(u\)\}\+\]​𝑑u\.\\hat\{\\ell\}\_\{as,t\}\(\\rho\):=\\int\_\{0\}^\{\\rho^\{\\top\}\\psi\(x\_\{s\}\)\}\\left\[c\_\{as\}\-\\left\\\{G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\\right\\\}^\{\+\}\\right\]du\.\(22\)This is the same loss as \([8](https://arxiv.org/html/2606.07392#S4.E8)\), withμ∗\\mu^\{\*\}replaced by the current reward estimate\. The plug\-in index estimator is

ρ^a​t∈\\argminρ∈ℬ​∑s∈𝒮a​tℓ^a​s,t​\(ρ\)\.\\hat\{\\rho\}\_\{at\}\\in\\argmin\_\{\\rho\\in\\mathcal\{B\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\hat\{\\ell\}\_\{as,t\}\(\\rho\)\.\(23\)The first\-order condition of \([23](https://arxiv.org/html/2606.07392#S5.E23)\) is the empirical analogue of the Weitzman moment equation with the reward function evaluated atθ^t−1\\hat\{\\theta\}\_\{t\-1\}\. The new challenge compared to Section[4\.2](https://arxiv.org/html/2606.07392#S4.SS2)is that the empirical loss in \([23](https://arxiv.org/html/2606.07392#S5.E23)\) is not the oracle loss\. To isolate this difference, define the plug\-in perturbation

Ja​t​\(ρ\):=∑s∈𝒮a​t\{\[ℓ^a​s,t​\(ρ\)−ℓa​s​\(ρ\)\]−\[ℓ^a​s,t​\(ρa\)−ℓa​s​\(ρa\)\]\},J\_\{at\}\(\\rho\):=\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\left\[\\hat\{\\ell\}\_\{as,t\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\)\\right\]\-\\left\[\\hat\{\\ell\}\_\{as,t\}\(\\rho\_\{a\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\right\]\\right\\\},\(24\)whereℓa​s​\(ρ\)\\ell\_\{as\}\(\\rho\)is the known\-μ∗\\mu^\{\*\}loss in \([8](https://arxiv.org/html/2606.07392#S4.E8)\)\. The termJa​t​\(ρ\)J\_\{at\}\(\\rho\)measures how the estimation error ofθ^t−1\\hat\{\\theta\}\_\{t\-1\}propagates into the index\-estimation loss\.

We impose the following additional assumption for the unknown\-reward case\. It ensures that the reward\-design matrix contains enough information in all directions needed to control the plug\-in perturbation\. We usepolylog​\(T\)\\mathrm\{polylog\}\(T\)to denote a quantity bounded by a polynomial inlog⁡T\\log T\(i\.e\.polylog​\(T\)=O​\(\(log⁡T\)c\)\\mathrm\{polylog\}\(T\)=\\mathrm\{O\}\(\(\\log T\)^\{c\}\)for some constantc\>0c\>0\)\.\{assumption\}\[Anti\-concentration of reward features\] There exists a sequenceMT≥1M\_\{T\}\\geq 1withMT=polylog​\(T\)/AM\_\{T\}=\\sqrt\{\\mathrm\{polylog\}\(T\)\}/Asuch that, for anyt∈\[T\]t\\in\[T\],a∈\[A\]a\\in\[A\],v∈𝕊d−1v\\in\\mathbb\{S\}^\{d\-1\}, and0<ϵ≤\(2​A​MT\)−10<\\epsilon\\leq\(2AM\_\{T\}\)^\{\-1\},

ℙ​\(\|v⊤​ϕ​\(xt,ωa​t\)\|≤ϵ\)≤MT​ϵa\.s\.\\mathbb\{P\}\\\!\\left\(\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\|\\leq\\epsilon\\right\)\\leq M\_\{T\}\\epsilon\\qquad\\text\{a\.s\.\}Recall that the contextsxtx\_\{t\}are independent across time, implying that the distribution of the reward featureϕ​\(xt,ωa​t\)\\phi\(x\_\{t\},\\omega\_\{at\}\)is independent ofℱt−1\\mathcal\{F\}\_\{t\-1\}\. Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)leverages this independence to impose an anti\-concentration condition that rules out degenerate reward features\. Informally, every one\-dimensional projection ofϕ​\(xt,ωa​t\)\\phi\(x\_\{t\},\\omega\_\{at\}\)must have nontrivial variation near zero\. This condition is imposed becauseθ∗\\theta\_\{\*\}is learned only from selected outputs, whereas the index loss for boxaauses queried outputs\. Thus, the reward features must be sufficiently informative for evaluating the queried features that enter the index estimator\. For example, it is satisfied when the embedding distribution is a truncated Gaussian, a truncated multivariate Student’stt, or a uniform distribution on a hyperrectangle; see Lemmas[10\.3](https://arxiv.org/html/2606.07392#S10.Thmtheorem3)–[10\.7](https://arxiv.org/html/2606.07392#S10.Thmtheorem7)in Appendix[10](https://arxiv.org/html/2606.07392#S10)\.

The following lemma bounds the additional perturbation caused by usingθ^t−1\\hat\{\\theta\}\_\{t\-1\}in the index loss\.

###### Lemma 5\.3\(Plug\-in loss perturbation\)

Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5)and[5](https://arxiv.org/html/2606.07392#S5)hold\. Then, for anyc0\>0c\_\{0\}\>0,

−Ja​t​\(ρ^a​t\)≤c08​∑s∈𝒮a​t\{ψ​\(xs\)⊤​\(ρ^a​t−ρa\)\}2\+2​L2c0​∑s∈𝒮a​t\{\(θ^t−1−θ∗\)⊤​ϕ​\(xs,ωa​s\)\}2\.\-J\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\\leq\\frac\{c\_\{0\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\\right\\\}^\{2\}\+\\frac\{2L^\{2\}\}\{c\_\{0\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\(\\hat\{\\theta\}\_\{t\-1\}\-\\theta\_\{\*\}\)^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\\right\\\}^\{2\}\.

Lemma[5\.3](https://arxiv.org/html/2606.07392#S5.Thmtheorem3)shows why Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)is needed\. The first term on the right\-hand side has the same quadratic form as the population curvature in Lemma[4\.2](https://arxiv.org/html/2606.07392#S4.Thmtheorem2)and can therefore be absorbed into the curvature argument\. The second term is the reward estimation error evaluated on queried outputs\. By Lemma[5\.1](https://arxiv.org/html/2606.07392#S5.Thmtheorem1), this term is controlled by the inverse\-design norms‖ϕ​\(xs,ωa​s\)‖Φt−1−1\\\|\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}, whereΦt−1\\Phi\_\{t\-1\}is constructed from previously selected outputs, whileϕ​\(xs,ωa​s\)\\phi\(x\_\{s\},\\omega\_\{as\}\)corresponds to queried outputs used for estimating the index of boxaa\. Thus, there is a potential mismatch between the selected outputs that inform reward learning and the queried outputs that enter index estimation\. Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)rules out severe mismatches of this form by ensuring thatΦt−1\\Phi\_\{t\-1\}is sufficiently well conditioned, so these inverse\-design norms do not remain large\.

We now state the confidence bound for the index estimator:

###### Proposition 5\.4\(Index confidence bound with unknown reward function\)

Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5)–[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)hold\. Fix anyδ\>0\\delta\>0\. Then, with probability at least1−2​δ/31\-2\\delta/3, uniformly over alla∈\[A\]a\\in\[A\]and allt∈\[T\]t\\in\[T\],

\|\(ρ^a​t−ρa\)⊤​ψ​\(xt\)\|≤Ba​t​‖ψ​\(xt\)‖Va​t​\(η1\)−1,\\left\|\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\)\\right\|\\leq B\_\{at\}\\left\\\|\\psi\(x\_\{t\}\)\\right\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\},\(26\)whereVa​t​\(η1\)V\_\{at\}\(\\eta\_\{1\}\)is defined as \([13](https://arxiv.org/html/2606.07392#S4.E13)\),

Ba​t=C2​m​log⁡\(T\)\+log⁡\(A​T/δ\)\+\[d​log⁡\(t\)\+log⁡\(1/δ\)\]​∑s∈𝒮a​t‖ϕ​\(xs,ωa​s\)‖Φt−1−12,B\_\{at\}=C\_\{2\}\\sqrt\{m\\log\(T\)\+\\log\(AT/\\delta\)\+\[d\\log\(t\)\+\\log\(1/\\delta\)\]\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\|\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\},andC2C\_\{2\}is a constant depending onη0,η1,κ,μ1,ι¯,dℬ,C¯ψ,L,C¯ϕ,γ0,μ¯\\eta\_\{0\},\\eta\_\{1\},\\kappa,\\mu\_\{1\},\\bar\{\\iota\},d\_\{\\mathcal\{B\}\},\\bar\{C\}\_\{\\psi\},L,\\bar\{C\}\_\{\\phi\},\\gamma\_\{0\},\\underline\{\\mu\}, andα¯\\bar\{\\alpha\}\.

The proof follows the same localized empirical\-process argument as Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4)\. By the optimality ofρ^a​t\\hat\{\\rho\}\_\{at\}in \([23](https://arxiv.org/html/2606.07392#S5.E23)\),∑s∈𝒮a​t\{ℓ^a​s,t​\(ρ^a​t\)−ℓ^a​s,t​\(ρa\)\}≤0\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\hat\{\\ell\}\_\{as,t\}\(\\hat\{\\rho\}\_\{at\}\)\-\\hat\{\\ell\}\_\{as,t\}\(\\rho\_\{a\}\)\\right\\\}\\leq 0\. Decomposing this into the oracle population curvature, the empirical fluctuation, and the plug\-in perturbation gives

∑s∈𝒮a​t𝔼\[ℓa​s\(ρ^a​t\)−ℓa​s\(ρa\)\|ℱs−1,xs\]\+Ha​t\(ρ^a​t\)\+Ja​t\(ρ^a​t\)≤0\.\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\mathbb\{E\}\\\!\\left\[\\ell\_\{as\}\(\\hat\{\\rho\}\_\{at\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\,\\middle\|\\,\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\\right\]\+H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\+J\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\\leq 0\.The first term is lower bounded by the curvature result in Lemma[4\.2](https://arxiv.org/html/2606.07392#S4.Thmtheorem2), the second is controlled by the empirical\-process bound in Lemma[4\.3](https://arxiv.org/html/2606.07392#S4.Thmtheorem3), and the third is controlled by Lemma[5\.3](https://arxiv.org/html/2606.07392#S5.Thmtheorem3)together with the reward confidence bound in Lemma[5\.1](https://arxiv.org/html/2606.07392#S5.Thmtheorem1)\. Combining these bounds yields \([26](https://arxiv.org/html/2606.07392#S5.E26)\)\. The detailed proof is provided in Appendix[11](https://arxiv.org/html/2606.07392#S11)\. Motivated by Proposition[5\.4](https://arxiv.org/html/2606.07392#S5.Thmtheorem4), define the optimistic reservation index as

σ~a​t:=Λ​\(ρ^a​t⊤​ψ​\(xt\)\+Ba​t​‖ψ​\(xt\)‖Va​t​\(η1\)−1\)\.\\widetilde\{\\sigma\}\_\{at\}:=\\Lambda\\\!\\left\(\\hat\{\\rho\}\_\{at\}^\{\\top\}\\psi\(x\_\{t\}\)\+B\_\{at\}\\left\\\|\\psi\(x\_\{t\}\)\\right\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\\right\)\.\(27\)SinceΛ\\Lambdais increasing, Proposition[5\.4](https://arxiv.org/html/2606.07392#S5.Thmtheorem4)implies that, with probability at least1−δ1\-\\delta,σ~a​t≥Λ​\(ρa⊤​ψ​\(xt\)\)=σa​t∗\\widetilde\{\\sigma\}\_\{at\}\\geq\\Lambda\(\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\)=\\sigma\_\{at\}^\{\*\}uniformly over alla∈\[A\]a\\in\[A\]andt∈\[T\]t\\in\[T\]\.

### 5\.3Regret Analysis under Unknown Reward Function

We now combine the reward and index confidence bounds to control the cumulative regret ofCOSMOSwhen the reward function is unknown\. The preceding subsections established two high\-probability optimism events: Lemma[5\.1](https://arxiv.org/html/2606.07392#S5.Thmtheorem1)gives optimism for the reward estimateμ~t\\widetilde\{\\mu\}\_\{t\}, and Proposition[5\.4](https://arxiv.org/html/2606.07392#S5.Thmtheorem4)gives optimism for the reservation index estimateσ~a​t\\widetilde\{\\sigma\}\_\{at\}\. Therefore, when both events hold, the regret decomposition in Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)applies and separates the period\-ttregret into a reward estimation term and an index\-estimation term for alla∈\[A\]a\\in\[A\]andt∈\[T\]t\\in\[T\]\. This structure parallels the known\-reward analysis in Section[4](https://arxiv.org/html/2606.07392#S4)\. When the reward function is unknown, an additional generalized linear bandit term appears from estimating the reward parameter\.

As shown in Appendix[10](https://arxiv.org/html/2606.07392#S10), Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)implies that, with high probability,λmin​\(Φt−1\)≥t−116​polylog​\(T\)\\lambda\_\{\\min\}\(\\Phi\_\{t\-1\}\)\\geq\\frac\{t\-1\}\{16\\,\\mathrm\{polylog\}\(T\)\}uniformly for allt≥τ:=C​polylog​\(T\)​log⁡\(A​d\)t\\geq\\tau:=C\\mathrm\{polylog\}\(T\)\\log\(Ad\)for some absolute constantCC\. This eigenvalue growth allows the reward estimation error inJa​t​\(ρ\)J\_\{at\}\(\\rho\)to be controlled uniformly overt≥τt\\geq\\tau, so that the cumulative regret is controlled aftert≥τt\\geq\\tau\. For these initial periods before⌈τ⌉\\lceil\\tau\\rceil, the per\-period regret is bounded by\(2​A\+1\)\(2A\+1\)\. Sinceτ=O​\(log⁡\(A​d\)​polylog​\(T\)\)\\tau=\\mathrm\{O\}\(\\log\(Ad\)\\mathrm\{polylog\}\(T\)\), so the cumulative regret up to period⌈τ⌉\\lceil\\tau\\rceilis at mostO​\(A​log⁡\(A​d\)​polylog​\(T\)\)\\mathrm\{O\}\(A\\log\(Ad\)\\mathrm\{polylog\}\(T\)\), and is therefore dominated by the regret accumulated over periodst≥τt\\geq\\tau\.

###### Theorem 5\.5

Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5)–[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)hold\. Then, ignoring logarithmic factors,

𝔼​\[∑t=1TΔt​\(π~\)\]≤O~​\(\[d\+A​\(m\+m​d\)\]​T\)\.\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\right\]\\leq\\widetilde\{O\}\\left\(\[d\+A\(m\+\\sqrt\{md\}\)\]\\sqrt\{T\}\\right\)\.In particular, the regret bound is asymptotically minimized by choosing

η1≍max⁡\{m1/2,\(d​m\)1/4,d1/3\},η0≍min⁡\{dA​m,\(dA​m\)2/3\}\.\\eta\_\{1\}\\asymp\\max\\left\\\{m^\{1/2\},\(dm\)^\{1/4\},d^\{1/3\}\\right\\\},\\qquad\\eta\_\{0\}\\asymp\\min\\left\\\{\\frac\{d\}\{A\\sqrt\{m\}\},\\left\(\\frac\{d\}\{A\\sqrt\{m\}\}\\right\)^\{2/3\}\\right\\\}\.

The terms in Theorem[5\.5](https://arxiv.org/html/2606.07392#S5.Thmtheorem5)correspond directly to the two components in Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)\. The termO~​\(d​T\)\\widetilde\{O\}\(d\\sqrt\{T\}\)is the reward estimation regret, matching the usual dimension dependence for generalized linear contextual bandits with reward feature dimensiondd\. The termO~​\(A​\(m\+d​m\)​T\)\\widetilde\{O\}\(A\(m\+\\sqrt\{dm\}\)\\sqrt\{T\}\)is the index\-estimation regret, whereAAis the number of boxes andmmis the dimension of the index featureψ\\psi\. Relative to the known\-reward case in Proposition[4\.5](https://arxiv.org/html/2606.07392#S4.Thmtheorem5), the additional termO~​\(A​d​m​T\)\\widetilde\{O\}\(A\\sqrt\{dmT\}\)reflects the effect of reward estimation error on index estimation\. On the high\-probability optimism event, Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)gives𝔼​\[∑t=1TΔt​\(π~\)\]≤𝔼​\[∑t=1T\{μ~t​\(xt,ωat​t\)−μ∗​\(xt,ωat​t\)\}\]\+𝔼​\[∑t=1T∑a∈𝒜t\(σ~a​t−σa​t∗\)\]\.\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\right\]\\leq\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\left\\\{\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\-\\mu^\{\*\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\right\\\}\\right\]\+\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\left\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\\right\)\\right\]\.The first sum is controlled by the reward confidence radius from Lemma[5\.1](https://arxiv.org/html/2606.07392#S5.Thmtheorem1)and an elliptical\-potential argument\(Abbasi\-Yadkoriet al\.[2011](https://arxiv.org/html/2606.07392#bib.bib56)\)for the reward\-design matrixΦt−1\\Phi\_\{t\-1\}\. The second sum is controlled by the index confidence radius from Proposition[5\.4](https://arxiv.org/html/2606.07392#S5.Thmtheorem4)and a separate elliptical\-potential argument for each box\-specific design matrixVa​t​\(η1\)V\_\{at\}\(\\eta\_\{1\}\)\. Thus the unknown\-reward analysis adds a standard reward\-learning term while preserving the same index\-learning rate as in the known\-reward benchmark\. The detailed proof is provided in Appendix[12](https://arxiv.org/html/2606.07392#S12)\.

## 6Conclusion

Motivated by LLM cascading, we introduce an online contextual Pandora’s Box model in which a decision\-maker sequentially queries APIs to generate outputs at a cost and deploys a single output, observing only its downstream reward\. This output\-mediated feedback departs from the classical setting, where opening a box reveals its reward directly, and captures a key feature of LLM cascading systems: API\-specific heterogeneity enters only through the distributions of generated outputs and costs, while downstream value is governed by a shared reward evaluator on the context\-output pair\. Rather than estimating the full conditional output and cost distributions, we impose a generalized linear structure directly on the reservation indices and the shared reward function\. OurCOSMOSpolicy is built by combining GMM estimation of the reservation indices with UCB\-style confidence bounds for both the indices and the reward evaluator\. Through a regret decomposition under optimism, which separates cumulative regret into reward\-estimation errors for deployed outputs and index\-estimation errors for queried APIs, we derive aO~​\(\[d\+A​\(m\+d​m\)\]​T\)\\widetilde\{\\mathrm\{O\}\}\\left\(\[d\+A\(m\+\\sqrt\{dm\}\)\]\\sqrt\{T\}\\right\)regret bound\.

Our work opens several avenues for future research\. First, because our theoretical guarantees rely on generalized linear specifications, extending the moment\-based UCB framework to richer nonparametric or neural representations is a natural next step\. Achieving this while preservingO~​\(T\)\\widetilde\{O\}\(\\sqrt\{T\}\)regret would further broaden the model’s applicability\. Second, our current formulation assumes a standard LLM cascading structure where APIs are queried sequentially and a single output is selected and then deployed\. Expanding this framework to allow for batched querying under latency budgets, or to support output ensembling and synthesis, would significantly change the query\-selection dynamics\. Addressing these richer action spaces raises compelling new questions about the structure of the learning policy\.

## AI Disclosure

The authors used generative artificial intelligence tools, including GPT, Gemini, and Claude, to edit and polish earlier versions of the drafts\. All AI\-generated text and suggestions were checked, revised, and approved by the authors\. The authors take full responsibility for the accuracy, integrity, and originality of the submitted work\.

## Acknowledgment

We thank Ali Makhdoumi for helpful discussions on the general theory and intuition of Pandora’s Box, and Shreyas Sekar for an early conversation on AI\-assisted coding and model selection for reducing LLM API costs that eventually led us to the FrugalGPT paper byChenet al\.\([2025a](https://arxiv.org/html/2606.07392#bib.bib68)\)\. We also thank the participants from 2026 Marketplace Innovation Workshop for comments and discussion\.

\{APPENDICES\}

## 7Proofs for Section[3](https://arxiv.org/html/2606.07392#S3)

In this section we prove three results in Section[3](https://arxiv.org/html/2606.07392#S3): the optimality of the oracle reservation index policy under full information \(Proposition[3\.1](https://arxiv.org/html/2606.07392#S3.Thmtheorem1)\), point identification ofρa⊤​ψ​\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)from the queried samples for boxaabefore periodtt\(Proposition[4\.1](https://arxiv.org/html/2606.07392#S4.Thmtheorem1)\), and a regret decomposition under optimistic reward and index estimators \(Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)\)\.

The following lemma establishes that the query decision is independent of the stochastic output conditioning on the historical data and the current context at each periodtt\. Although the proof is immediate, the result is used repeatedly in subsequent arguments, so we state it explicitly here\.

###### Lemma 7\.1

Given anyt∈\[T\]t\\in\[T\],xt∈𝒳x\_\{t\}\\in\\mathcal\{X\}anda∈\[A\]a\\in\[A\], we have𝕀\{a∈𝒜t\}⟂⟂\(ωa​t,ca​t\)\|xt,ℱt−1\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\perp\\\!\\\!\\\!\\\!\\perp\(\\omega\_\{at\},c\_\{at\}\)\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\.

###### Proof 7\.2

Proof of Lemma[7\.1](https://arxiv.org/html/2606.07392#S7.Thmtheorem1)For anyxt∈𝒳x\_\{t\}\\in\\mathcal\{X\}anda∈\[A\]a\\in\[A\], under both the oracle reservation index policy andCOSMOSalgorithm, the event\{a∈𝒜t\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}is determined by the sequential construction, which depends on the historyℱt−1\\mathcal\{F\}\_\{t\-1\}, the current contextxtx\_\{t\}, and the outputs of boxes queried beforeaa\. Under the conditional independence of\{ωa′​t:a′∈\[A\]\}\\\{\\omega\_\{a^\{\\prime\}t\}:a^\{\\prime\}\\in\[A\]\\\}givenxtx\_\{t\}according to \([1](https://arxiv.org/html/2606.07392#S2.E1)\), these preceding outputs are independent ofωa​t\\omega\_\{at\}\. Further,ca​t=ca​\(xt,ωa​t\)c\_\{at\}=c\_\{a\}\(x\_\{t\},\\omega\_\{at\}\)\. Hence,𝕀\{a∈𝒜t\}⟂⟂\(ωa​t,ca​t\)\|xt,ℱt−1\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\perp\\\!\\\!\\\!\\\!\\perp\(\\omega\_\{at\},c\_\{at\}\)\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}as claimed\.■\\blacksquare

###### Proof 7\.3

Proof of Proposition[3\.1](https://arxiv.org/html/2606.07392#S3.Thmtheorem1)Fort∈\[T\]t\\in\[T\], letQat=𝕀​\{a∈𝒜t\}Q\_\{a\}^\{t\}=\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}andSat=𝕀​\{at=a\}S\_\{a\}^\{t\}=\\mathbb\{I\}\\\{a\_\{t\}=a\\\}\. Sinceωa​t∼pa\(⋅\|xt\)\\omega\_\{at\}\\sim p\_\{a\}\(\\cdot\|x\_\{t\}\), \([2](https://arxiv.org/html/2606.07392#S3.E2)\) implies

𝔼\[\{μ∗\(xt,ωa​t\)−σa∗\(xt\)\}\+\|xt,ℱt−1\]=𝔼\[ca​t∣xt,ℱt−1\],∀xt\.\\mathbb\{E\}\\left\[\\left\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sigma^\{\*\}\_\{a\}\(x\_\{t\}\)\\right\\\}^\{\+\}\\,\\middle\|\\,x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]=\\mathbb\{E\}\[c\_\{at\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\],\\forall x\_\{t\}\.Thus

𝔼​\[∑a∈\[A\]Sat​μ∗​\(xt,ωa​t\)−∑a∈\[A\]Qat​ca​t\|xt,ℱt−1\]=\(i\)𝔼​\[∑a∈\[A\]Sat​μ∗​\(xt,ωa​t\)−∑a∈\[A\]Qat​\{μ∗​\(xt,ωa​t\)−σa​t∗\}\+\|xt,ℱt−1\]=\(i​i\)𝔼​\[∑a∈\[A\]Sat​min⁡\{σa​t∗,μ∗​\(xt,ωa​t\)\}\+∑a∈\[A\]\(Sat−Qat\)​\{μ∗​\(xt,ωa​t\)−σa​t∗\}\+\|xt,ℱt−1\]≤\(i​i​i\)𝔼​\[∑a∈\[A\]Sat​min⁡\{σa​t∗,μ∗​\(xt,ωa​t\)\}\|xt,ℱt−1\]≤\(i​v\)𝔼​\[maxa∈\[A\]⁡min⁡\{σa​t∗,μ∗​\(xt,ωa​t\)\}\|xt,ℱt−1\],\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\\mathbb\{E\}\\left\[\\sum\_\{a\\in\[A\]\}S\_\{a\}^\{t\}\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sum\_\{a\\in\[A\]\}Q\_\{a\}^\{t\}c\_\{at\}\\bigg\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]\\\\ &\\displaystyle=\_\{\(i\)\}\\mathbb\{E\}\\left\[\\sum\_\{a\\in\[A\]\}S\_\{a\}^\{t\}\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sum\_\{a\\in\[A\]\}Q\_\{a\}^\{t\}\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sigma\_\{at\}^\{\*\}\\\}^\{\+\}\\bigg\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]\\\\ &\\displaystyle=\_\{\(ii\)\}\\mathbb\{E\}\\left\[\\sum\_\{a\\in\[A\]\}S\_\{a\}^\{t\}\\min\\\{\\sigma\_\{at\}^\{\*\},\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\\\}\+\\sum\_\{a\\in\[A\]\}\(S\_\{a\}^\{t\}\-Q\_\{a\}^\{t\}\)\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sigma\_\{at\}^\{\*\}\\\}^\{\+\}\\bigg\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]\\\\ &\\displaystyle\\leq\_\{\(iii\)\}\\mathbb\{E\}\\left\[\\sum\_\{a\\in\[A\]\}S\_\{a\}^\{t\}\\min\\\{\\sigma\_\{at\}^\{\*\},\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\\\}\\bigg\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]\\\\ &\\displaystyle\\leq\_\{\(iv\)\}\\mathbb\{E\}\\left\[\\max\_\{a\\in\[A\]\}\\min\\\{\\sigma\_\{at\}^\{\*\},\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\\\}\\bigg\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\],\\end\{array\}where \(i\) holds becauseQat⟂⟂\(ωa​t,ca​t\)∣xt,ℱt−1Q\_\{a\}^\{t\}\\perp\\\!\\\!\\\!\\\!\\perp\(\\omega\_\{at\},c\_\{at\}\)\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}according to Lemma[7\.1](https://arxiv.org/html/2606.07392#S7.Thmtheorem1)and the definition ofσa​t∗\\sigma\_\{at\}^\{\*\}in \([2](https://arxiv.org/html/2606.07392#S3.E2)\) imply

𝔼​\[Qat​ca​t\|xt,ℱt−1\]=𝔼​\[Qat\|xt,ℱt−1\]​𝔼​\[ca​t\|xt,ℱt−1\]=𝔼​\[Qat\|xt,ℱt−1\]​𝔼​\[\{μ∗​\(xt,ωa​t\)−σa​t∗\}\+\|xt,ℱt−1\]=𝔼​\[Qat​\{μ∗​\(xt,ωa​t\)−σa​t∗\}\+\|xt,ℱt−1\],\\begin\{array\}\[\]\{rl\}\\mathbb\{E\}\[Q\_\{a\}^\{t\}c\_\{at\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]&\\displaystyle=\\mathbb\{E\}\[Q\_\{a\}^\{t\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\mathbb\{E\}\[c\_\{at\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\\\ &\\displaystyle=\\mathbb\{E\}\[Q\_\{a\}^\{t\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\mathbb\{E\}\[\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sigma\_\{at\}^\{\*\}\\\}^\{\+\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\\\ &\\displaystyle=\\mathbb\{E\}\[Q\_\{a\}^\{t\}\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sigma\_\{at\}^\{\*\}\\\}^\{\+\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\],\\end\{array\}\(ii\) holds by the identityb=min⁡\{a,b\}\+\{b−a\}\+b=\\min\\\{a,b\\\}\+\\\{b\-a\\\}^\{\+\}, \(iii\) sinceSat≤QatS\_\{a\}^\{t\}\\leq Q\_\{a\}^\{t\}, and \(iv\) since∑a∈\[A\]Sat=1\\sum\_\{a\\in\[A\]\}S\_\{a\}^\{t\}=1andSat≥0S\_\{a\}^\{t\}\\geq 0\.

The inequalities \(iii\) and \(iv\) hold as equalities when using Weitzman’s principle via indices defined by \([2](https://arxiv.org/html/2606.07392#S3.E2)\)\. Indeed the policy selectsaawheneverμ∗​\(xt,ωa​t\)\>σa​t∗\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\>\\sigma\_\{at\}^\{\*\}and selects the maximum ofmaxa∈\[A\]⁡min⁡\{μ∗​\(xt,ωa​t\),σa​t∗\}\\max\_\{a\\in\[A\]\}\\min\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\),\\sigma\_\{at\}^\{\*\}\\\}\. So the utility of the DMUtU\_\{t\}satisfies

Ut=𝔼​\[∑a∈\[A\]Sat​μ∗​\(xt,ωa​t\)−∑a∈\[A\]Qat​ca​t\|xt,ℱt−1\]=𝔼​\[maxa∈\[A\]⁡min⁡\{σa​t∗,μ∗​\(xt,ωa​t\)\}\|xt,ℱt−1\],U\_\{t\}=\\mathbb\{E\}\\left\[\\sum\_\{a\\in\[A\]\}S\_\{a\}^\{t\}\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sum\_\{a\\in\[A\]\}Q\_\{a\}^\{t\}c\_\{at\}\\bigg\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]=\\mathbb\{E\}\\left\[\\max\_\{a\\in\[A\]\}\\min\\\{\\sigma\_\{at\}^\{\*\},\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\\\}\\bigg\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\],\(28\)thus concludes the proof\.■\\blacksquare

###### Proof 7\.4

Proof of Proposition[4\.1](https://arxiv.org/html/2606.07392#S4.Thmtheorem1)For anyxt∈𝒳x\_\{t\}\\in\\mathcal\{X\}anda∈\[A\]a\\in\[A\], we have𝕀\{a∈𝒜t\}⟂⟂ωa​t\|xt,ℱt−1\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\perp\\\!\\\!\\\!\\\!\\perp\\omega\_\{at\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}by Lemma[7\.1](https://arxiv.org/html/2606.07392#S7.Thmtheorem1)\. Sinceca​t=ca​\(xt,ωa​t\)c\_\{at\}=c\_\{a\}\(x\_\{t\},\\omega\_\{at\}\)we have

𝔼​\[\{μ∗​\(xt,ωa​t\)−Λ​\(ρa⊤​ψ​\(xt\)\)\}\+−ca​t\|xt,ℱt−1,a∈𝒜t\]=𝔼​\[\{μ∗​\(xt,ωa​t\)−Λ​\(ρa⊤​ψ​\(xt\)\)\}\+−ca​t\|xt,ℱt−1\]\.\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\\mathbb\{E\}\\left\[\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\Lambda\(\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\)\\\}^\{\+\}\-c\_\{at\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\\right\]\\\\ &\\displaystyle=\\mathbb\{E\}\\left\[\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\Lambda\(\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\)\\\}^\{\+\}\-c\_\{at\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]\.\\end\{array\}For any givena∈\[A\]a\\in\[A\]andxx, define

Fa​\(x,s\):=𝔼​\[\{μ∗​\(x,ωa​t\)−Λ​\(s\)\}\+−ca​t\|xt=x,ℱt−1,a∈𝒜t\]\.F\_\{a\}\(x,s\):=\\mathbb\{E\}\\left\[\\\{\\mu^\{\*\}\(x,\\omega\_\{at\}\)\-\\Lambda\(s\)\\\}^\{\+\}\-c\_\{at\}\|x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\\right\]\.So we haveFa​\(x,ρa⊤​ψ​\(x\)\)=0,∀x∈𝒳F\_\{a\}\\left\(x,\\rho\_\{a\}^\{\\top\}\\psi\(x\)\\right\)=0,\\forall x\\in\\mathcal\{X\}\. Note thatΛ​\(⋅\)\\Lambda\(\\cdot\)is strictly increasing and takes values in\[−1,1\]\[\-1,1\], andca​t∈\(0,1\)c\_\{at\}\\in\(0,1\), so for everyx∈𝒳x\\in\\mathcal\{X\},Fa​\(x,s\)F\_\{a\}\(x,s\)is decreasing and continuous inss\. Note that

lims→−∞Fa​\(x,s\)=lims→−∞𝔼​\[\{μ∗​\(x,ωa​t\)−Λ​\(s\)\}\+−ca​t\|xt=x,ℱt−1,a∈𝒜t\]=\(i\)𝔼​\[\{μ∗​\(x,ωa​t\)\+1\}\+−ca​t\|xt=x,ℱt−1,a∈𝒜t\]≥𝔼​\[1−ca​t\|xt=x,ℱt−1,a∈𝒜t\]=\(i​i\)𝔼​\[1−ca​t\|xt=x,ℱt−1\]\>\(i​i​i\)0\\begin\{array\}\[\]\{rl\}\\lim\_\{s\\rightarrow\-\\infty\}F\_\{a\}\(x,s\)&=\\lim\_\{s\\rightarrow\-\\infty\}\\mathbb\{E\}\\left\[\\\{\\mu^\{\*\}\(x,\\omega\_\{at\}\)\-\\Lambda\(s\)\\\}^\{\+\}\-c\_\{at\}\|x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\\right\]\\\\ &\\displaystyle=\_\{\(i\)\}\\mathbb\{E\}\\left\[\\\{\\mu^\{\*\}\(x,\\omega\_\{at\}\)\+1\\\}^\{\+\}\-c\_\{at\}\|x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\\right\]\\\\ &\\displaystyle\\geq\\mathbb\{E\}\\left\[1\-c\_\{at\}\|x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\\right\]=\_\{\(ii\)\}\\mathbb\{E\}\\left\[1\-c\_\{at\}\|x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\}\\right\]\>\_\{\(iii\)\}0\\end\{array\}where \(i\) follows from Assumption[5](https://arxiv.org/html/2606.07392#S4.E5), \(ii\) follows becauseca​t=ca​\(xt,ωa​t\)c\_\{at\}=c\_\{a\}\(x\_\{t\},\\omega\_\{at\}\)andωa​t⟂⟂𝕀\{a∈𝒜t\}\|xt,ℱt−1\\omega\_\{at\}\\perp\\\!\\\!\\\!\\\!\\perp\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}by Lemma[7\.1](https://arxiv.org/html/2606.07392#S7.Thmtheorem1), \(iii\) follows becauseca​t∈\(0,1\)c\_\{at\}\\in\(0,1\)\. Thuslims→−∞Fa​\(x,s\)\>0\\lim\_\{s\\rightarrow\-\\infty\}F\_\{a\}\(x,s\)\>0\. Similarly, Assumption[5](https://arxiv.org/html/2606.07392#S4.E5)implieslims→\+∞Fa​\(x,s\)<0\\lim\_\{s\\rightarrow\+\\infty\}F\_\{a\}\(x,s\)<0\. Thus by intermediate value theorem, there exists aξa​\(x\)∈ℝ\\xi\_\{a\}\(x\)\\in\\mathbb\{R\}such thatFa​\(x,ξa​\(x\)\)=0F\_\{a\}\(x,\\xi\_\{a\}\(x\)\)=0\. Now suppose that there existss1<s2s\_\{1\}<s\_\{2\}such thatFa​\(x,s1\)=Fa​\(x,s2\)=0F\_\{a\}\(x,s\_\{1\}\)=F\_\{a\}\(x,s\_\{2\}\)=0\. Then we have

0=Fa​\(x,s1\)−Fa​\(x,s2\)=𝔼​\[\(μ∗​\(x,ωa​t\)−Λ​\(s1\)\)\+−\(μ∗​\(x,ωa​t\)−Λ​\(s2\)\)\+∣xt=x,ℱt−1,a∈𝒜t\]\.0=F\_\{a\}\(x,s\_\{1\}\)\-F\_\{a\}\(x,s\_\{2\}\)=\\mathbb\{E\}\[\(\\mu^\{\*\}\(x,\\omega\_\{at\}\)\-\\Lambda\(s\_\{1\}\)\)^\{\+\}\-\(\\mu^\{\*\}\(x,\\omega\_\{at\}\)\-\\Lambda\(s\_\{2\}\)\)^\{\+\}\\mid x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\]\.\(29\)This implies that

𝔼​\[\(μ∗​\(x,ωa​t\)−Λ​\(s1\)\)\+\|xt=x,ℱt−1,a∈𝒜t\]=𝔼​\[\(μ∗​\(x,ωa​t\)−Λ​\(s2\)\)\+\|xt=x,ℱt−1,a∈𝒜t\]=𝔼​\[ca​t\|xt=x,ℱt−1,a∈𝒜t\]\>0\.\\begin\{array\}\[\]\{rl\}\\displaystyle\\mathbb\{E\}\[\(\\mu^\{\*\}\(x,\\omega\_\{at\}\)\-\\Lambda\(s\_\{1\}\)\)^\{\+\}\|x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\]&\\displaystyle=\\mathbb\{E\}\[\(\\mu^\{\*\}\(x,\\omega\_\{at\}\)\-\\Lambda\(s\_\{2\}\)\)^\{\+\}\|x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\]\\\\ &\\displaystyle=\\mathbb\{E\}\[c\_\{at\}\|x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\]\>0\.\\end\{array\}Sinces1<s2s\_\{1\}<s\_\{2\}andΛ\\Lambdais strictly increasing, there must exists an eventℰ\\mathcal\{E\}whereμ∗​\(x,ωa​t\)\>Λ​\(s2\)\>Λ​\(s1\)\\mu^\{\*\}\(x,\\omega\_\{at\}\)\>\\Lambda\(s\_\{2\}\)\>\\Lambda\(s\_\{1\}\)andℙ​\(ℰ\|xt=x,ℱt−1,a∈𝒜t\)\>0\\mathbb\{P\}\(\\mathcal\{E\}\|x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\)\>0, and

Fa​\(x,s1\)−Fa​\(x,s2\)=𝔼​\[\(μ∗​\(x,ωa​t\)−Λ​\(s1\)\)\+−\(μ∗​\(x,ωa​t\)−Λ​\(s2\)\)\+∣xt=x,ℱt−1,a∈𝒜t\]≥𝔼​\[\(Λ​\(s2\)−Λ​\(s1\)\)​𝕀​\{ℰ\}\|xt=x,ℱt−1,a∈𝒜t\]\>0,\\begin\{array\}\[\]\{rl\}F\_\{a\}\(x,s\_\{1\}\)\-F\_\{a\}\(x,s\_\{2\}\)&=\\mathbb\{E\}\[\(\\mu^\{\*\}\(x,\\omega\_\{at\}\)\-\\Lambda\(s\_\{1\}\)\)^\{\+\}\-\(\\mu^\{\*\}\(x,\\omega\_\{at\}\)\-\\Lambda\(s\_\{2\}\)\)^\{\+\}\\mid x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\]\\\\ &\\geq\\mathbb\{E\}\[\(\\Lambda\(s\_\{2\}\)\-\\Lambda\(s\_\{1\}\)\)\\mathbb\{I\}\\\{\\mathcal\{E\}\\\}\|x\_\{t\}=x,\\mathcal\{F\}\_\{t\-1\},a\\in\\mathcal\{A\}\_\{t\}\]\>0,\\end\{array\}which contradicts \([29](https://arxiv.org/html/2606.07392#S7.E29)\)\. So the conditional moment restriction pins down the optimal indexσa∗​\(x\)=Λ​\(ξa​\(x\)\)\\sigma\_\{a\}^\{\*\}\(x\)=\\Lambda\(\\xi\_\{a\}\(x\)\)uniquely, andξa​\(x\)=ρa⊤​ψ​\(x\)\\xi\_\{a\}\(x\)=\\rho\_\{a\}^\{\\top\}\\psi\(x\)for anyx∈𝒳x\\in\\mathcal\{X\}\.■\\blacksquare

###### Proof 7\.5

Proof of Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)DefineUt​\(π~\):=μ​\(θ∗;xt,ωat​t\)−∑a∈𝒜tca​tU\_\{t\}\(\\tilde\{\\pi\}\):=\\mu\(\\theta\_\{\*\};x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\-\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}c\_\{at\}, whereπ~\\tilde\{\\pi\}is theCOSMOSpolicy\. Letc~a​t:=\(μ~t​\(xt,ωa​t\)−σ~a​t\)\+\\tilde\{c\}\_\{at\}:=\(\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\-\\widetilde\{\\sigma\}\_\{at\}\)^\{\+\}be defined as an auxiliary pseudo\-cost for anya∈\[A\],t∈\[T\]a\\in\[A\],t\\in\[T\]\. LetU~t​\(π~\)\\tilde\{U\}\_\{t\}\(\\tilde\{\\pi\}\)denote the realized utility ofπ~\\tilde\{\\pi\}under costc~a​t\\tilde\{c\}\_\{at\}, i\.e\.

U~t​\(π~\):=maxa∈𝒜t⁡μ~t​\(xt,ωa​t\)−∑a∈𝒜tc~a​t\.\\tilde\{U\}\_\{t\}\(\\tilde\{\\pi\}\):=\\max\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\tilde\{c\}\_\{at\}\.By \([3](https://arxiv.org/html/2606.07392#S3.E3)\) in Proposition[3\.1](https://arxiv.org/html/2606.07392#S3.Thmtheorem1), we have

𝔼​\[Ut​\(π∗\)∣xt,ℱt−1\]=𝔼​\[maxa∈\[A\]⁡min⁡\{μ∗​\(xt,ωa​t\),σa​t∗\}∣xt,ℱt−1\],𝔼​\[U~t​\(π~\)∣xt,ℱt−1\]=𝔼​\[maxa∈\[A\]⁡min⁡\{μ~t​\(xt,ωa​t\),σ~a​t\}∣xt,ℱt−1\]\.\\begin\{array\}\[\]\{rcl\}\\displaystyle\\mathbb\{E\}\[U\_\{t\}\(\\pi^\{\*\}\)\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]&=&\\displaystyle\\mathbb\{E\}\\big\[\\max\_\{a\\in\[A\]\}\\min\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\),\\sigma\_\{at\}^\{\*\}\\\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\big\],\\\\ \\displaystyle\\mathbb\{E\}\\left\[\\tilde\{U\}\_\{t\}\(\\tilde\{\\pi\}\)\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]&=&\\displaystyle\\mathbb\{E\}\\big\[\\max\_\{a\\in\[A\]\}\\min\\\{\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\),\\widetilde\{\\sigma\}\_\{at\}\\\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\big\]\.\\end\{array\}Note that for the same execution ofπ~\\tilde\{\\pi\}pathwise,

Ut​\(π~\)=U~t​\(π~\)−\(μ~t​\(xt,ωat​t\)−μ​\(θ∗;xt,ωat​t\)\)−∑a∈𝒜t\(ca​t−c~a​t\)\.U\_\{t\}\(\\tilde\{\\pi\}\)=\\tilde\{U\}\_\{t\}\(\\tilde\{\\pi\}\)\-\(\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\-\\mu\(\\theta\_\{\*\};x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\)\-\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\(c\_\{at\}\-\\tilde\{c\}\_\{at\}\)\.Therefore,

Δt​\(π~\)=𝔼​\[Ut​\(π∗\)∣xt,ℱt−1\]−𝔼​\[Ut​\(π~\)∣xt,ℱt−1\]=𝔼​\[maxa∈\[A\]⁡min⁡\{μ∗​\(xt,ωa​t\),σa​t∗\}∣xt,ℱt−1\]−𝔼​\[maxa∈\[A\]⁡min⁡\{μ~t​\(xt,ωa​t\),σ~a​t\}∣xt,ℱt−1\]\+𝔼​\[μ~t​\(xt,ωat​t\)−μ​\(θ∗;xt,ωat​t\)∣xt,ℱt−1\]\+𝔼​\[∑a∈𝒜t\(ca​t−c~a​t\)∣xt,ℱt−1\]\.\\begin\{array\}\[\]\{rl\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)&=\\mathbb\{E\}\[U\_\{t\}\(\\pi^\{\*\}\)\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\-\\mathbb\{E\}\[U\_\{t\}\(\\tilde\{\\pi\}\)\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\\\ &\\displaystyle=\\mathbb\{E\}\\big\[\\max\_\{a\\in\[A\]\}\\min\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\),\\sigma\_\{at\}^\{\*\}\\\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\big\]\-\\mathbb\{E\}\\big\[\\max\_\{a\\in\[A\]\}\\min\\\{\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\),\\widetilde\{\\sigma\}\_\{at\}\\\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\big\]\\\\ &\\displaystyle\\quad\+\\mathbb\{E\}\[\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\-\\mu\(\\theta\_\{\*\};x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\+\\mathbb\{E\}\\bigg\[\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\(c\_\{at\}\-\\tilde\{c\}\_\{at\}\)\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\bigg\]\.\\end\{array\}\(30\)Sinceμ~t​\(xt,ωa​t\)≥μ∗​\(xt,ωa​t\)\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\\geq\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)andσ~a​t≥σa​t∗\\widetilde\{\\sigma\}\_\{at\}\\geq\\sigma\_\{at\}^\{\*\}a\.s\.,min⁡\{μ~t​\(xt,ωa​t\),σ~a​t\}≥min⁡\{μ∗​\(xt,ωa​t\),σa​t∗\}\\min\\\{\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\),\\widetilde\{\\sigma\}\_\{at\}\\\}\\geq\\min\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\),\\sigma\_\{at\}^\{\*\}\\\}a\.s\. for everya∈\[A\]a\\in\[A\]\. So

𝔼​\[maxa∈\[A\]⁡min⁡\{μ∗​\(xt,ωa​t\),σa​t∗\}∣xt,ℱt−1\]−𝔼​\[maxa∈\[A\]⁡min⁡\{μ~t​\(xt,ωa​t\),σ~a​t\}∣xt,ℱt−1\]≤0\.\\mathbb\{E\}\\big\[\\max\_\{a\\in\[A\]\}\\min\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\),\\sigma\_\{at\}^\{\*\}\\\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\big\]\-\\mathbb\{E\}\\big\[\\max\_\{a\\in\[A\]\}\\min\\\{\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\),\\widetilde\{\\sigma\}\_\{at\}\\\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\big\]\\leq 0\.\(31\)Further, for eacha∈\[A\]a\\in\[A\],

μ∗​\(xt,ωa​t\)−σa​t∗=\(μ~t​\(xt,ωa​t\)−σ~a​t\)\+\(μ∗​\(xt,ωa​t\)−μ~t​\(xt,ωa​t\)\)\+\(σ~a​t−σa​t∗\)≤\(μ~t​\(xt,ωa​t\)−σ~a​t\)\+\(σ~a​t−σa​t∗\)\.\\begin\{array\}\[\]\{rl\}\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sigma\_\{at\}^\{\*\}&\\displaystyle=\(\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\-\\widetilde\{\\sigma\}\_\{at\}\)\+\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\)\+\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\)\\\\ &\\displaystyle\\leq\(\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\-\\widetilde\{\\sigma\}\_\{at\}\)\+\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\)\.\\end\{array\}Taking positive parts and using subadditivity ofx↦x\+x\\mapsto x^\{\+\}gives

\(μ∗​\(xt,ωa​t\)−σa​t∗\)\+≤\(μ~t​\(xt,ωa​t\)−σ~a​t\)\+\+\(σ~a​t−σa​t∗\)\+=\(μ~t​\(xt,ωa​t\)−σ~a​t\)\+\+\(σ~a​t−σa​t∗\)\.\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sigma\_\{at\}^\{\*\}\)^\{\+\}\\leq\(\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\-\\widetilde\{\\sigma\}\_\{at\}\)^\{\+\}\+\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\)^\{\+\}=\(\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\-\\widetilde\{\\sigma\}\_\{at\}\)^\{\+\}\+\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\)\.Hence

𝔼​\[ca​t−c~a​t∣xt,ℱt−1\]=𝔼​\[\(μ∗​\(xt,ωa​t\)−σa​t∗\)\+∣xt,ℱt−1\]−𝔼​\[\(μ~t​\(xt,ωa​t\)−σ~a​t\)\+∣xt,ℱt−1\]≤σ~a​t−σa​t∗\.\\begin\{array\}\[\]\{rl\}\\mathbb\{E\}\[c\_\{at\}\-\\tilde\{c\}\_\{at\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]&\\displaystyle=\\mathbb\{E\}\[\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\sigma\_\{at\}^\{\*\}\)^\{\+\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\-\\mathbb\{E\}\[\(\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\-\\widetilde\{\\sigma\}\_\{at\}\)^\{\+\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\\\ &\\displaystyle\\leq\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\.\\end\{array\}\(32\)Thus

𝔼​\[∑a∈𝒜t\(ca​t−c~a​t\)\|xt,ℱt−1\]=∑a∈\[A\]𝔼​\[\(ca​t−c~a​t\)​𝕀​\{a∈𝒜t\}∣xt,ℱt−1\]=\(i\)∑a∈\[A\]𝔼​\[ca​t−c~a​t∣xt,ℱt−1\]​𝔼​\[𝕀​\{a∈𝒜t\}∣xt,ℱt−1\]≤\(i​i\)∑a∈\[A\]\(σ~a​t−σa​t∗\)​𝔼​\[𝕀​\{a∈𝒜t\}∣xt,ℱt−1\]≤𝔼​\[∑a∈𝒜t\(σ~a​t−σa​t∗\)\|xt,ℱt−1\]\.\\begin\{array\}\[\]\{rl\}\\mathbb\{E\}\\left\[\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\(c\_\{at\}\-\\tilde\{c\}\_\{at\}\)\\ \\bigg\|\\ x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]&\\displaystyle=\\sum\_\{a\\in\[A\]\}\\mathbb\{E\}\[\(c\_\{at\}\-\\tilde\{c\}\_\{at\}\)\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\\\ &\\displaystyle=\_\{\(i\)\}\\sum\_\{a\\in\[A\]\}\\mathbb\{E\}\[c\_\{at\}\-\\tilde\{c\}\_\{at\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\mathbb\{E\}\[\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\\\ &\\displaystyle\\leq\_\{\(ii\)\}\\sum\_\{a\\in\[A\]\}\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\)\\mathbb\{E\}\[\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\\\ &\\displaystyle\\leq\\mathbb\{E\}\\left\[\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\)\\ \\bigg\|\\ x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\\right\]\.\\end\{array\}\(33\)where \(i\) of \([33](https://arxiv.org/html/2606.07392#S7.E33)\) follows sinceca​t=ca​\(xt,ωa​t\),c~a​t=\(μ~t​\(xt,ωa​t\)−σ~a​t\)\+c\_\{at\}=c\_\{a\}\(x\_\{t\},\\omega\_\{at\}\),\\tilde\{c\}\_\{at\}=\(\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\-\\widetilde\{\\sigma\}\_\{at\}\)^\{\+\}and𝕀\{a∈𝒜t\}⟂⟂ωa​t\|xt,ℱt−1\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\perp\\\!\\\!\\\!\\\!\\perp\\omega\_\{at\}\|x\_\{t\},\\mathcal\{F\}\_\{t\-1\}by Lemma[7\.1](https://arxiv.org/html/2606.07392#S7.Thmtheorem1), \(ii\) follows from \([32](https://arxiv.org/html/2606.07392#S7.E32)\)\. The result then follows from \([30](https://arxiv.org/html/2606.07392#S7.E30)\), \([31](https://arxiv.org/html/2606.07392#S7.E31)\), \([33](https://arxiv.org/html/2606.07392#S7.E33)\)\.■\\blacksquare

## 8Regret under Known Reward Function

In this section, we provide regret analysis when the reward parameterθ∗\\theta\_\{\*\}is known\. We begin with the technical lemmas used in the analysis\.

### 8\.1Technical Lemmas

For anya∈\[A\],t∈\[T\]a\\in\[A\],t\\in\[T\], let

ma​t​\(u\):=𝔼​\[\(μ∗​\(xt,ωa​t\)−Λ​\(u\)\)\+−ca​t∣ℱt−1,xt\],m\_\{at\}\(u\):=\\mathbb\{E\}\\bigl\[\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\Lambda\(u\)\)^\{\+\}\-c\_\{at\}\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\bigr\],\(34\)thenma​t​\(ρa⊤​ψ​\(xt\)\)=0m\_\{at\}\(\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\)=0\.

###### Lemma 8\.1

Assume that, for everytt, for some constantη\>0\\eta\>0,

ℙ​\(μ∗​\(xt,ωa​t\)\>Λ​\(u\)∣ℱt−1,xt\)​Λ′​\(u\)≥η,∀\|u\|≤ι¯,\\mathbb\{P\}\\bigl\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\>\\Lambda\(u\)\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\bigr\)\\,\\Lambda^\{\\prime\}\(u\)\\geq\\eta,\\ \\forall\|u\|\\leq\\overline\{\\iota\},\(35\)Then, for everyu∈\[−ι¯,ι¯\]u\\in\[\-\\overline\{\\iota\},\\overline\{\\iota\}\],

−\(u−ρa⊤​ψ​\(xt\)\)​ma​t​\(u\)≥η​\(u−ρa⊤​ψ​\(xt\)\)2a\.s\.\-\(u\-\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\)m\_\{at\}\(u\)\\geq\\eta\(u\-\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\)^\{2\}\\qquad\\text\{a\.s\.\}\(36\)

Lemma[8\.1](https://arxiv.org/html/2606.07392#S8.Thmtheorem1)says that the reservation\-value moment crosses zero with a slope bounded away from zero\. If the candidate thresholduuis above the true thresholdva​sv\_\{as\}, then the expected excess value is too small relative to cost, so the momentc−\(μ∗−Λ​\(u\)\)\+c\-\(\\mu^\{\*\}\-\\Lambda\(u\)\)^\{\+\}is positive\. Ifuuis below the true threshold, then the expected excess value is too large relative to cost, so the moment is negative\. This monotonicity is what ultimately makes the loss locally curved aroundρa\\rho\_\{a\}\.

###### Proof 8\.2

Proof of Lemma[8\.1](https://arxiv.org/html/2606.07392#S8.Thmtheorem1)Fixttandu∈Iu\\in I\. For any real numberaaand any continuously differentiable functiongg, the mapv↦\(a−g​\(v\)\)\+v\\mapsto\(a\-g\(v\)\)^\{\+\}is absolutely continuous and has a\.e\. derivativedd​v​\(a−g​\(v\)\)\+=−𝕀​\{a\>g​\(v\)\}​g′​\(v\)\\frac\{d\}\{dv\}\(a\-g\(v\)\)^\{\+\}=\-\\mathbb\{I\}\\\{a\>g\(v\)\\\}g^\{\\prime\}\(v\)\. Applying the fundamental theorem of calculus withg=Λg=\\Lambda, we obtain for anyu,v∈ℝu,v\\in\\mathbb\{R\},\(μ∗​\(xt,ωa​t\)−Λ​\(u\)\)\+−\(μ∗​\(xt,ωa​t\)−Λ​\(v\)\)\+=−∫vu𝕀​\{μ∗​\(xt,ωa​t\)\>Λ​\(r\)\}​Λ′​\(r\)​𝑑r\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\Lambda\(u\)\)^\{\+\}\-\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\Lambda\(v\)\)^\{\+\}=\-\\int\_\{v\}^\{u\}\\mathbb\{I\}\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\>\\Lambda\(r\)\\\}\\Lambda^\{\\prime\}\(r\)\\,dr\. Taking conditional expectations givenℱt−1\\mathcal\{F\}\_\{t\-1\}yields

ma​t​\(u\)−ma​t​\(v\)=−∫vuℙ​\(μ∗​\(xt,ωa​t\)\>Λ​\(r\)∣ℱt−1,xt\)​Λ′​\(r\)​𝑑r\.m\_\{at\}\(u\)\-m\_\{at\}\(v\)=\-\\int\_\{v\}^\{u\}\\mathbb\{P\}\\bigl\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\>\\Lambda\(r\)\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\bigr\)\\Lambda^\{\\prime\}\(r\)\\,dr\.Sincema​t​\(ρa⊤​ψ​\(xt\)\)=0m\_\{at\}\(\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\)=0, settingv=ρa⊤​ψ​\(xt\)v=\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)gives

ma​t​\(u\)=−∫ρa⊤​ψ​\(xt\)uℙ​\(μ∗​\(xt,ωa​t\)\>Λ​\(r\)∣ℱt−1,xt\)​Λ′​\(r\)​𝑑r\.m\_\{at\}\(u\)=\-\\int\_\{\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\}^\{u\}\\mathbb\{P\}\\bigl\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\>\\Lambda\(r\)\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\bigr\)\\Lambda^\{\\prime\}\(r\)\\,dr\.Ifu≥ρa⊤​ψ​\(xt\)u\\geq\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\), then \([35](https://arxiv.org/html/2606.07392#S8.E35)\) impliesma​t​\(u\)≤−∫ρa⊤​ψ​\(xt\)uη​𝑑r=−η​\(u−ρa⊤​ψ​\(xt\)\)m\_\{at\}\(u\)\\leq\-\\int\_\{\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\}^\{u\}\\eta\\,dr=\-\\eta\\left\(u\-\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\\right\), which proves \([36](https://arxiv.org/html/2606.07392#S8.E36)\)\. Ifu≤ρa⊤​ψ​\(xt\)u\\leq\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\), then

ma​t​\(u\)=∫uρa⊤​ψ​\(xt\)ℙ​\(μ∗​\(xt,ωa​t\)\>Λ​\(r\)∣ℱt−1,xt\)​Λ′​\(r\)​𝑑r≥\(i\)∫uρa⊤​ψ​\(xt\)η​𝑑r=η​\(ρa⊤​ψ​\(xt\)−u\),m\_\{at\}\(u\)=\\int\_\{u\}^\{\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\}\\mathbb\{P\}\\bigl\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\>\\Lambda\(r\)\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\bigr\)\\Lambda^\{\\prime\}\(r\)\\,dr\\geq\_\{\(i\)\}\\int\_\{u\}^\{\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\}\\eta\\,dr=\\eta\\left\(\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\-u\\right\),where \(i\) follows from \([35](https://arxiv.org/html/2606.07392#S8.E35)\) and \([36](https://arxiv.org/html/2606.07392#S8.E36)\) follows again\.■\\blacksquare

We next prove Lemma[4\.2](https://arxiv.org/html/2606.07392#S4.Thmtheorem2)using Lemma[8\.1](https://arxiv.org/html/2606.07392#S8.Thmtheorem1):

Lemma \(Restatement of Lemma[4\.2](https://arxiv.org/html/2606.07392#S4.Thmtheorem2)\)\.Under Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5),[4\.1](https://arxiv.org/html/2606.07392#S4.SS1), for anyt∈\[T\]t\\in\[T\],a∈\[A\]a\\in\[A\]andρ∈ℬ\\rho\\in\\mathcal\{B\},𝔼​\[ℓa​t​\(ρ\)−ℓa​t​\(ρa\)∣xt,ℱt−1\]≥12​κ​μ1​\(ψ​\(xt\)⊤​\(ρ−ρa\)\)2\\mathbb\{E\}\[\\ell\_\{at\}\(\\rho\)\-\\ell\_\{at\}\(\\rho\_\{a\}\)\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]\\geq\\frac\{1\}\{2\}\\kappa\\mu\_\{1\}\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)^\{2\}\.

###### Proof 8\.3

Proof of Lemma[4\.2](https://arxiv.org/html/2606.07392#S4.Thmtheorem2)The scalar mapz↦∫0z\(ca​t−\(μ∗​\(xt,ωa​t\)−Λ​\(u\)\)\+\)​𝑑u\\displaystyle z\\mapsto\\int\_\{0\}^\{z\}\\Bigl\(c\_\{at\}\-\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\Lambda\(u\)\)^\{\+\}\\Bigr\)\\,duis differentiable, with derivativeca​t−\(μ∗​\(xt,ωa​t\)−Λ​\(z\)\)\+c\_\{at\}\-\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\Lambda\(z\)\)^\{\+\}\. Therefore

∇ℓa​t​\(ρ\)=ψ​\(xt\)​\(ca​t−\(μ∗​\(xt,ωa​t\)−Λ​\(ρ⊤​ψ​\(xt\)\)\)\+\)\.\\nabla\\ell\_\{at\}\(\\rho\)=\\psi\(x\_\{t\}\)\\Bigl\(c\_\{at\}\-\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\Lambda\(\\rho^\{\\top\}\\psi\(x\_\{t\}\)\)\)^\{\+\}\\Bigr\)\.\(37\)The derivative ofz↦ca​t−\(μ∗​\(xt,ωa​t\)−Λ​\(u\)\)\+z\\mapsto c\_\{at\}\-\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\-\\Lambda\(u\)\)^\{\+\}is𝕀​\{μ∗​\(xt,ωa​t\)\>Λ​\(z\)\}​Λ′​\(z\)≥0\\mathbb\{I\}\\\{\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\>\\Lambda\(z\)\\\}\\Lambda^\{\\prime\}\(z\)\\geq 0a\.e\., so the scalar map is nondecreasing and thusℓa​t​\(ρ\)\\ell\_\{at\}\(\\rho\)is convex inρ⊤​ψ​\(xt\)\\rho^\{\\top\}\\psi\(x\_\{t\}\), hence convex inρ\\rho\. Taking conditional expectations in \([37](https://arxiv.org/html/2606.07392#S8.E37)\) gives∇𝔼​\[ℓa​t​\(ρ\)∣xt,ℱt−1\]=−ψ​\(xt\)​ma​t​\(ψ​\(xt\)⊤​ρ\)\\nabla\\mathbb\{E\}\[\\ell\_\{at\}\(\\rho\)\\mid x\_\{t\},\\mathcal\{F\}\_\{t\-1\}\]=\-\\psi\(x\_\{t\}\)\\,m\_\{at\}\(\\psi\(x\_\{t\}\)^\{\\top\}\\rho\)\. By convexity ofℬ\\mathcal\{B\}, the entire segmentψ​\(xt\)⊤​ρa\+s​ψ​\(xt\)⊤​\(ρ−ρa\)=ψ​\(xt\)⊤​\(ρa\+s​\(ρ−ρa\)\),∀s∈\[0,1\]\\psi\(x\_\{t\}\)^\{\\top\}\\rho\_\{a\}\+s\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)=\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\_\{a\}\+s\(\\rho\-\\rho\_\{a\}\)\),\\forall s\\in\[0,1\]lies inI=\[−ι¯,ι¯\]I=\[\-\\bar\{\\iota\},\\bar\{\\iota\}\]by Assumption[4\.1](https://arxiv.org/html/2606.07392#S4.SS1)\. Using the fundamental theorem of calculus along the line segment fromρa\\rho\_\{a\}toρ\\rho,

𝔼​\[ℓa​t​\(ρ\)−ℓa​t​\(ρa\)∣ℱt−1,xt\]=∫01\(ρ−ρa\)⊤​∇𝔼​\[ℓa​t​\(ρa\+s​\(ρ−ρa\)\)∣ℱt−1,xt\]​𝑑s=−∫01\(ψ​\(xt\)⊤​\(ρ−ρa\)\)​ma​t​\(ψ​\(xt\)⊤​ρa\+s​ψ​\(xt\)⊤​\(ρ−ρa\)\)​𝑑s\.\\begin\{array\}\[\]\{rl\}\\mathbb\{E\}\[\\ell\_\{at\}\(\\rho\)\-\\ell\_\{at\}\(\\rho\_\{a\}\)\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\]&\\displaystyle=\\int\_\{0\}^\{1\}\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\nabla\\mathbb\{E\}\[\\ell\_\{at\}\(\\rho\_\{a\}\+s\(\\rho\-\\rho\_\{a\}\)\)\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\]\\,ds\\\\ &\\displaystyle=\-\\int\_\{0\}^\{1\}\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)m\_\{at\}\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\\rho\_\{a\}\+s\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)\\,ds\.\\end\{array\}Applying Lemma[8\.1](https://arxiv.org/html/2606.07392#S8.Thmtheorem1)in the Appendix withu=ψ​\(xt\)⊤​ρa\+s​ψ​\(xt\)⊤​\(ρ−ρa\)u=\\psi\(x\_\{t\}\)^\{\\top\}\\rho\_\{a\}\+s\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)yields

−\(s​ψ​\(xt\)⊤​\(ρ−ρa\)\)​ma​t​\(ψ​\(xt\)⊤​ρa\+s​ψ​\(xt\)⊤​\(ρ−ρa\)\)≥κ​μ1​s2​\(ψ​\(xt\)⊤​\(ρ−ρa\)\)2,\-\\bigl\(s\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)m\_\{at\}\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\\rho\_\{a\}\+s\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)\\geq\\kappa\\mu\_\{1\}s^\{2\}\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)^\{2\},where the last inequality follows becauseℙ​\(μ∗​\(xt,ωa​t\)\>Λ​\(u\)∣ℱt−1,xt\)​Λ′​\(u\)≥κ​μ1\\mathbb\{P\}\\bigl\(\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)\>\\Lambda\(u\)\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\\bigr\)\\,\\Lambda^\{\\prime\}\(u\)\\geq\\kappa\\mu\_\{1\}according to Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5),[4\.1](https://arxiv.org/html/2606.07392#S4.SS1)\. Fors\>0s\>0, divide byssto obtain

−\(ψ​\(xt\)⊤​\(ρ−ρa\)\)​ma​t​\(ψ​\(xt\)⊤​ρa\+s​ψ​\(xt\)⊤​\(ρ−ρa\)\)≥κ​μ1​s​\(ψ​\(xt\)⊤​\(ρ−ρa\)\)2\.\-\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)m\_\{at\}\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\\rho\_\{a\}\+s\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)\\geq\\kappa\\mu\_\{1\}s\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)^\{2\}\.Integrating overs∈\[0,1\]s\\in\[0,1\]yields

𝔼​\[ℓa​t​\(ρ\)−ℓa​t​\(ρa\)∣ℱt−1,xt\]≥∫01η​s​\(ψ​\(xt\)⊤​\(ρ−ρa\)\)2​𝑑s=κ​μ12​\(ψ​\(xt\)⊤​\(ρ−ρa\)\)2\\mathbb\{E\}\[\\ell\_\{at\}\(\\rho\)\-\\ell\_\{at\}\(\\rho\_\{a\}\)\\mid\\mathcal\{F\}\_\{t\-1\},x\_\{t\}\]\\geq\\int\_\{0\}^\{1\}\\eta s\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)^\{2\}\\,ds=\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\bigl\(\\psi\(x\_\{t\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\bigr\)^\{2\}as claimed\.■\\blacksquare

###### Lemma 8\.4

Fix anya∈\[A\]a\\in\[A\]andt∈\[T\]t\\in\[T\], for anyρt\\rho\_\{t\}adapted toℱt−1\\mathcal\{F\}\_\{t\-1\}, we have

1na​t​∑s∈𝒮a​t𝔼​\[ℓa​s​\(ρt\)−ℓa​s​\(ρa\)∣ℱs−1,xs\]≥κ​μ12​1na​t​∑s∈𝒮a​t\[ψ​\(xs\)⊤​\(ρt−ρa\)\]2−1na​t\.\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\mathbb\{E\}\\left\[\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\\right\]\\geq\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\_\{t\}\-\\rho\_\{a\}\)\]^\{2\}\-\\frac\{1\}\{n\_\{at\}\}\.

###### Proof 8\.5

Proof of Lemma[8\.4](https://arxiv.org/html/2606.07392#S8.Thmtheorem4)Set

ϵ=12​na​t​C¯ψ​\(κ​μ1​ι¯\+1\)\\epsilon=\\frac\{1\}\{2n\_\{at\}\\bar\{C\}\_\{\\psi\}\(\\kappa\\mu\_\{1\}\\bar\{\\iota\}\+1\)\}\(38\)and take an Euclideanϵ\\epsilon\-net𝒩ϵ\\mathcal\{N\}\_\{\\epsilon\}ofℬ\\mathcal\{B\}\. According to Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5)and[4\.1](https://arxiv.org/html/2606.07392#S4.SS1), for anys∈\[T\]s\\in\[T\]we haveℙ​\(G​\(θ∗⊤​ϕ​\(xs,ωa​s\)\)\>Λ​\(u\)∣xs,ℱs−1\)​Λ′​\(u\)≥κ​μ1\>0,∀\|u\|≤ι¯\\mathbb\{P\}\\bigl\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\>\\Lambda\(u\)\\mid x\_\{s\},\\mathcal\{F\}\_\{s\-1\}\\bigr\)\\Lambda^\{\\prime\}\(u\)\\geq\\kappa\\mu\_\{1\}\>0,\\ \\forall\|u\|\\leq\\overline\{\\iota\}\. Lemma[4\.2](https://arxiv.org/html/2606.07392#S4.Thmtheorem2)implies that given anys∈\[T\]s\\in\[T\]andρ¯∈𝒩ϵ\\bar\{\\rho\}\\in\\mathcal\{N\}\_\{\\epsilon\}, it always holds that𝔼​\[ℓa​s​\(ρ¯\)−ℓa​s​\(ρa\)∣ℱs−1,xs\]≥κ​μ12​\[ψ​\(xs\)⊤​\(ρ¯−ρa\)\]2\\displaystyle\\mathbb\{E\}\\left\[\\ell\_\{as\}\(\\bar\{\\rho\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\\right\]\\geq\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\bar\{\\rho\}\-\\rho\_\{a\}\)\]^\{2\}\. Sinceℬ\\mathcal\{B\}has diameterdℬd\_\{\\mathcal\{B\}\}by Assumption[5](https://arxiv.org/html/2606.07392#S4.E5),𝒩ϵ\\mathcal\{N\}\_\{\\epsilon\}is finite\. So after intersecting finitely many probability\-one events, we have with probability one, simultaneously for allρ∈𝒩ϵ\\rho\\in\\mathcal\{N\}\_\{\\epsilon\}ands∈\[T\]s\\in\[T\],

𝔼​\[ℓa​s​\(ρ\)−ℓa​s​\(ρa\)∣ℱs−1,xs\]≥κ​μ12​\[ψ​\(xs\)⊤​\(ρ−ρa\)\]2\.\\mathbb\{E\}\\left\[\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\\right\]\\geq\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\]^\{2\}\.\(39\)Chooseρϵ∈𝒩ϵ\\rho\_\{\\epsilon\}\\in\\mathcal\{N\}\_\{\\epsilon\}such that‖ρt−ρϵ‖2≤ϵ\\\|\\rho\_\{t\}\-\\rho\_\{\\epsilon\}\\\|\_\{2\}\\leq\\epsilon\. Note that

\|\[ℓa​s​\(ρt\)−ℓa​s​\(ρa\)\]−\[ℓa​s​\(ρϵ\)−ℓa​s​\(ρa\)\]\|=\|ℓa​s​\(ρt\)−ℓa​s​\(ρϵ\)\|=\(i\)\|∫ρϵ⊤​ψ​\(xs\)ρt⊤​ψ​\(xs\)\[ca​s−\(G​\(θ∗⊤​ϕ​\(xs,ωa​s\)\)−Λ​\(u\)\)\+\]​𝑑u\|≤\(i​i\)2​\|\(ρt−ρϵ\)⊤​ψ​\(xs\)\|≤\(i​i​i\)2​C¯ψ​ϵ,\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\\left\|\[\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\]\-\[\\ell\_\{as\}\(\\rho\_\{\\epsilon\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\]\\right\|\\\\ &\\displaystyle\\ =\|\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{\\epsilon\}\)\|\\\\ &\\displaystyle=\_\{\(i\)\}\\left\|\\int\_\{\\rho\_\{\\epsilon\}^\{\\top\}\\psi\(x\_\{s\}\)\}^\{\\rho\_\{t\}^\{\\top\}\\psi\(x\_\{s\}\)\}\[c\_\{as\}\-\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\)^\{\+\}\]du\\right\|\\\\ &\\displaystyle\\leq\_\{\(ii\)\}2\|\(\\rho\_\{t\}\-\\rho\_\{\\epsilon\}\)^\{\\top\}\\psi\(x\_\{s\}\)\|\\leq\_\{\(iii\)\}2\\bar\{C\}\_\{\\psi\}\\epsilon,\\end\{array\}\(40\)where \(i\) follows by definition ofℓa​s​\(ρ\)\\ell\_\{as\}\(\\rho\), \(ii\) follows since by definition,ca​s∈\[0,1\]c\_\{as\}\\in\[0,1\],G​\(θ∗⊤​ϕ​\(xs,ωa​s\)\)∈\[0,1\]G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\\in\[0,1\],Λ​\(u\)∈\[−1,1\]\\Lambda\(u\)\\in\[\-1,1\], so\|ca​s−\(G​\(θ∗⊤​ϕ​\(xs,ωa​s\)\)−Λ​\(u\)\)\+\|≤2\|c\_\{as\}\-\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\)^\{\+\}\|\\leq 2\. \(iii\) follows by Cauchy\-Schwarz inequality, the fact that‖ρ−ρϵ‖2≤ϵ\\\|\\rho\-\\rho\_\{\\epsilon\}\\\|\_\{2\}\\leq\\epsilonand‖ψ​\(xt\)‖2≤C¯ψ\\\|\\psi\(x\_\{t\}\)\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\psi\}by Assumption[5](https://arxiv.org/html/2606.07392#S4.E5)\. Hence

1na​t​∑s∈𝒮a​t𝔼​\[ℓa​s​\(ρt\)−ℓa​s​\(ρa\)∣ℱs−1,xs\]=1na​t​∑s∈𝒮a​t𝔼​\[ℓa​s​\(ρt\)−ℓa​s​\(ρϵ\)\+ℓa​s​\(ρϵ\)−ℓa​s​\(ρa\)∣ℱs−1,xs\]≥\(1\)κ​μ12​1na​t​∑s∈𝒮a​t\[ψ​\(xs\)⊤​\(ρϵ−ρa\)\]2−2​C¯ψ​ϵ=κ​μ12​1na​t​∑s∈𝒮a​t\[ψ​\(xs\)⊤​\(ρϵ−ρt\)\+ψ​\(xs\)⊤​\(ρt−ρa\)\]2−2​C¯ψ​ϵ≥\(2\)κ​μ12​1na​t​∑s∈𝒮a​t\{\[ψ​\(xs\)⊤​\(ρt−ρa\)\]2−2​ϵ​C¯ψ​\|ψ​\(xs\)⊤​\(ρt−ρa\)\|\}−2​C¯ψ​ϵ≥\(3\)κ​μ12​1na​t​∑s∈𝒮a​t\{\[ψ​\(xs\)⊤​\(ρt−ρa\)\]2−4​ϵ​C¯ψ​ι¯\}−2​C¯ψ​ϵ=κ​μ12​1na​t​∑s∈𝒮a​t\[ψ​\(xs\)⊤​\(ρt−ρa\)\]2−2​\(κ​μ1​ι¯\+1\)​C¯ψ​ϵ=\(4\)κ​μ12​1na​t​∑s∈𝒮a​t\[ψ​\(xs\)⊤​\(ρt−ρa\)\]2−1na​t,\\begin\{array\}\[\]\{rl\}&\\displaystyle\\quad\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\mathbb\{E\}\\left\[\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\\right\]\\\\ &\\displaystyle=\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\mathbb\{E\}\\left\[\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{\\epsilon\}\)\+\\ell\_\{as\}\(\\rho\_\{\\epsilon\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\\right\]\\\\ &\\displaystyle\\geq\_\{\(1\)\}\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\_\{\\epsilon\}\-\\rho\_\{a\}\)\]^\{2\}\-2\\bar\{C\}\_\{\\psi\}\\epsilon\\\\ &\\displaystyle=\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\_\{\\epsilon\}\-\\rho\_\{t\}\)\+\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\_\{t\}\-\\rho\_\{a\}\)\]^\{2\}\-2\\bar\{C\}\_\{\\psi\}\\epsilon\\\\ &\\displaystyle\\geq\_\{\(2\)\}\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\_\{t\}\-\\rho\_\{a\}\)\]^\{2\}\-2\\epsilon\\bar\{C\}\_\{\\psi\}\|\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\_\{t\}\-\\rho\_\{a\}\)\|\\right\\\}\-2\\bar\{C\}\_\{\\psi\}\\epsilon\\\\ &\\displaystyle\\geq\_\{\(3\)\}\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\_\{t\}\-\\rho\_\{a\}\)\]^\{2\}\-4\\epsilon\\bar\{C\}\_\{\\psi\}\\bar\{\\iota\}\\right\\\}\-2\\bar\{C\}\_\{\\psi\}\\epsilon\\\\ &\\displaystyle=\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\_\{t\}\-\\rho\_\{a\}\)\]^\{2\}\-2\(\\kappa\\mu\_\{1\}\\bar\{\\iota\}\+1\)\\bar\{C\}\_\{\\psi\}\\epsilon\\\\ &\\displaystyle=\_\{\(4\)\}\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\_\{t\}\-\\rho\_\{a\}\)\]^\{2\}\-\\frac\{1\}\{n\_\{at\}\},\\end\{array\}where \(1\) follows from \([39](https://arxiv.org/html/2606.07392#S8.E39)\) and \([40](https://arxiv.org/html/2606.07392#S8.E40)\), \(2\) follows from applying Cauchy\-Schwarz inequality toψ​\(xs\)⊤​\(ρϵ−ρt\)\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\_\{\\epsilon\}\-\\rho\_\{t\}\)and the fact that‖ρϵ−ρt‖2≤ϵ\\\|\\rho\_\{\\epsilon\}\-\\rho\_\{t\}\\\|\_\{2\}\\leq\\epsilon,‖ψ​\(xs\)‖2≤C¯ψ\\\|\\psi\(x\_\{s\}\)\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\psi\}, \(3\) follows sinceψ​\(xs\)⊤​ρt∈\[−ι¯,ι¯\],ψ​\(xs\)⊤​ρa∈\[−ι¯,ι¯\]\\psi\(x\_\{s\}\)^\{\\top\}\\rho\_\{t\}\\in\[\-\\bar\{\\iota\},\\bar\{\\iota\}\],\\psi\(x\_\{s\}\)^\{\\top\}\\rho\_\{a\}\\in\[\-\\bar\{\\iota\},\\bar\{\\iota\}\], \(4\) follows from \([38](https://arxiv.org/html/2606.07392#S8.E38)\)\.■\\blacksquare

###### Lemma 8\.6\(Freedman’s Inequality\(Freedman[1975](https://arxiv.org/html/2606.07392#bib.bib2)\)\)

Consider a real\-valued martingaleYk=∑j=1kXjY\_\{k\}=\\sum\_\{j=1\}^\{k\}X\_\{j\}withY0=0Y\_\{0\}=0and difference sequence\{Xk:k=1,2,3,…\}\\\{X\_\{k\}:k=1,2,3,\\ldots\\\}\. Assume thatXk≤RX\_\{k\}\\leq Ralmost surely fork≥1k\\geq 1, whereRRis a constant\. LetWk:=∑j=1k𝔼​\[Xj2∣ℱj−1\]W\_\{k\}:=\\sum\_\{j=1\}^\{k\}\\mathbb\{E\}\[X\_\{j\}^\{2\}\\mid\\mathcal\{F\}\_\{j\-1\}\]fork≥1k\\geq 1\. Then for allt≥1t\\geq 1andσ2\>0\\sigma^\{2\}\>0,ℙ\(∃k≥0:Yk≥t,andWk≤σ2\)≤exp\{−t2/2σ2\+R​t/3\}\\displaystyle\\mathbb\{P\}\\left\(\\exists k\\geq 0:Y\_\{k\}\\geq t,\\ \\mbox\{and\}\\ W\_\{k\}\\leq\\sigma^\{2\}\\right\)\\leq\\exp\\left\\\{\-\\frac\{t^\{2\}/2\}\{\\sigma^\{2\}\+Rt/3\}\\right\\\}\.

Lemma \(Restatement of Lemma[4\.3](https://arxiv.org/html/2606.07392#S4.Thmtheorem3)\)\.Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5),[4\.1](https://arxiv.org/html/2606.07392#S4.SS1)hold\. Then given any constantc0\>0c\_\{0\}\>0and anya∈\[A\]a\\in\[A\], with probability at least1−δ/31\-\\delta/3, uniformly over allt∈\[T\]t\\in\[T\], for anyρt\\rho\_\{t\}adapted toℱt−1\\mathcal\{F\}\_\{t\-1\}, we have

−∑s∈𝒮a​t\{ℓa​s​\(ρt\)−ℓa​s​\(ρa\)−𝔼​\[ℓa​s​\(ρt\)−ℓa​s​\(ρa\)∣ℱs−1,xs\]\}≤c08​∑s∈𝒮a​t\{\(ρt−ρa\)⊤​ψ​\(xs\)\}2\+\(144c0\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​Γa​t​\(δ\)\+4\+c08,\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\-\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\-\\mathbb\{E\}\[\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\]\\right\\\}\\\\ &\\displaystyle\\leq\\frac\{c\_\{0\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+\\left\(\\frac\{144\}\{c\_\{0\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}\(\\delta\)\+4\+\\frac\{c\_\{0\}\}\{8\},\\end\{array\}whereΓa​t​\(δ\):=m​log⁡\(1\+2​dℬ​C¯ψ​T\)\+log⁡\(⌈log2⁡\(1\+na​t​min⁡\{9​ι¯2,dℬ2​C¯ψ2\}\)⌉\+1\)\+log⁡\(6​T/δ\)\\Gamma\_\{at\}\(\\delta\)\\\!:=m\\log\(1\+2d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}T\)\+\\log\\left\(\\left\\lceil\\log\_\{2\}\\left\(1\+n\_\{at\}\\min\\\{9\\overline\{\\iota\}^\{2\},d\_\{\\mathcal\{B\}\}^\{2\}\\bar\{C\}\_\{\\psi\}^\{2\}\\\}\\right\)\\right\\rceil\+1\\right\)\+\\log\\left\(6T/\\delta\\right\)\.

###### Proof 8\.7

Proof of Lemma[4\.3](https://arxiv.org/html/2606.07392#S4.Thmtheorem3)For anys∈𝒮a​ts\\in\\mathcal\{S\}\_\{at\}, define𝒢s−1:=σ​\(ℱs−1,xs\)\\mathcal\{G\}\_\{s\-1\}:=\\sigma\(\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\)\. Let

Ha​t​\(ρ\):=∑s∈𝒮a​t\{ℓa​s​\(ρ\)−ℓa​s​\(ρa\)−𝔼​\[ℓa​s​\(ρ\)−ℓa​s​\(ρa\)∣𝒢s−1\]\}\.H\_\{at\}\(\\rho\):=\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\-\\mathbb\{E\}\[\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{G\}\_\{s\-1\}\]\\right\\\}\.Note that−2≤ca​t−\(G​\(θ∗⊤​ϕ​\(xt,ωa​t\)\)−Λ​\(u\)\)\+≤1\-2\\leq c\_\{at\}\-\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\)\-\\Lambda\(u\)\)^\{\+\}\\leq 1, so for anyρ∈ℬ\\rho\\in\\mathcal\{B\},

\|ℓa​s\(ρ\)−ℓa​s\(ρa\)−𝔼\[ℓa​s\(ρ\)−ℓa​s\(ρa\)∣𝒢s−1\]\|=\|∫ρa⊤​ψ​\(xs\)ρ⊤​ψ​\(xs\)\[ca​s−\(G\(θ∗⊤ϕ\(xs,ωa​s\)−Λ\(u\)\)\+\]du−𝔼\[∫ρa⊤​ψ​\(xs\)ρ⊤​ψ​\(xs\)\[ca​s−\(G\(θ∗⊤ϕ\(xs,ωa​s\)−Λ\(u\)\)\+\]du∣𝒢s−1\]\|≤3​\|\(ρ−ρa\)⊤​ψ​\(xs\)\|≤min⁡\{6​ι¯,3​C¯ψ​dℬ\}\.\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\\left\|\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\-\\mathbb\{E\}\[\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{G\}\_\{s\-1\}\]\\right\|\\\\ &\\displaystyle\\ =\\bigg\|\\int\_\{\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{s\}\)\}^\{\\rho^\{\\top\}\\psi\(x\_\{s\}\)\}\[c\_\{as\}\-\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\-\\Lambda\(u\)\)^\{\+\}\]du\\\\ &\\displaystyle\\quad\\quad\-\\mathbb\{E\}\\bigg\[\\int\_\{\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{s\}\)\}^\{\\rho^\{\\top\}\\psi\(x\_\{s\}\)\}\[c\_\{as\}\-\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\-\\Lambda\(u\)\)^\{\+\}\]du\\mid\\mathcal\{G\}\_\{s\-1\}\\bigg\]\\bigg\|\\\\ &\\displaystyle\\leq 3\|\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\|\\leq\\min\\\{6\\overline\{\\iota\},3\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\.\\end\{array\}LetWt​\(ρ\):=∑s∈𝒮a​t𝔼​\[\{ℓa​s​\(ρ\)−ℓa​s​\(ρa\)−𝔼​\[ℓa​s​\(ρ\)−ℓa​s​\(ρa\)∣𝒢s−1\]\}2∣𝒢s−1\]\\displaystyle W\_\{t\}\(\\rho\):=\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\mathbb\{E\}\\left\[\\\{\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\-\\mathbb\{E\}\[\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{G\}\_\{s\-1\}\]\\\}^\{2\}\\mid\\mathcal\{G\}\_\{s\-1\}\\right\]\. So for anyρ∈ℬ\\rho\\in\\mathcal\{B\},

Wt​\(ρ\)≤9​∑s∈𝒮a​t\{\(ρ−ρa\)⊤​ψ​\(xs\)\}2,W\_\{t\}\(\\rho\)\\leq 9\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\},\(41\)where\(ρ−ρa\)⊤​ψ​\(xs\)\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)is𝒢s−1\\mathcal\{G\}\_\{s\-1\}\-measurable\.

Note that for anyρ∈ℬ\\rho\\in\\mathcal\{B\},ρ⊤​ψ​\(xs\)∈\[−2​ι¯,ι¯\]\\rho^\{\\top\}\\psi\(x\_\{s\}\)\\in\[\-2\\overline\{\\iota\},\\overline\{\\iota\}\]and\|\(ρ−ρa\)⊤​ψ​\(x\)\|≤‖ρ−ρa‖2​‖ψ​\(x\)‖2≤dℬ​C¯ψ\|\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\)\|\\leq\\\|\\rho\-\\rho\_\{a\}\\\|\_\{2\}\\\|\\psi\(x\)\\\|\_\{2\}\\leq d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}, thus∑s∈𝒮a​t\{\(ρ−ρa\)⊤​ψ​\(xs\)\}2≤na​t​min⁡\{9​ι¯2,dℬ2​C¯ψ2\}\\displaystyle\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\\leq n\_\{at\}\\min\\\{9\\overline\{\\iota\}^\{2\},d\_\{\\mathcal\{B\}\}^\{2\}\\bar\{C\}\_\{\\psi\}^\{2\}\\\}\. Let

Qa​t:=⌈log2⁡\(1\+na​t​min⁡\{9​ι¯2,dℬ2​C¯ψ2\}\)⌉\.Q\_\{at\}:=\\left\\lceil\\log\_\{2\}\\left\(1\+n\_\{at\}\\min\\\{9\\overline\{\\iota\}^\{2\},d\_\{\\mathcal\{B\}\}^\{2\}\\bar\{C\}\_\{\\psi\}^\{2\}\\\}\\right\)\\right\\rceil\.\(42\)Consider the events\{ℰq\}\\\{\\mathcal\{E\}\_\{q\}\\\}forq=\{0\}∪\[Qa​t\]q=\\\{0\\\}\\cup\[Q\_\{at\}\], where

ℰq:=\{\{2q−1<∑s∈𝒮a​t\{\(ρ−ρa\)⊤​ψ​\(xs\)\}2≤2q\}if​q≥1\{0≤∑s∈𝒮a​t\{\(ρ−ρa\)⊤​ψ​\(xs\)\}2≤1\}if​q=0\\mathcal\{E\}\_\{q\}:=\\begin\{cases\}\\displaystyle\\big\\\{2^\{q\-1\}<\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\\leq 2^\{q\}\\big\\\}&\\text\{if \}q\\geq 1\\\\ \\displaystyle\\big\\\{0\\leq\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\\leq 1\\big\\\}&\\text\{if \}q=0\\end\{cases\}\(43\)Fix anyx\>0x\>0\. On the eventℰq\\mathcal\{E\}\_\{q\},Wt​\(ρ\)≤9​∑s∈𝒮a​t\{\(ρ−ρa\)⊤​ψ​\(xs\)\}2≤9×2q\\displaystyle W\_\{t\}\(\\rho\)\\leq 9\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\\leq 9\\times 2^\{q\}\. So applying Freedman’s inequality \(Lemma[8\.6](https://arxiv.org/html/2606.07392#S8.Thmtheorem6)\) withσ2=9×2q\\sigma^\{2\}=9\\times 2^\{q\}, we have

ℙ​\(−Ha​t​\(ρ\)≥3​2​2q​x\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}​x,∑s∈𝒮a​t\{\(ρ−ρa\)⊤​ψ​\(xs\)\}2≤2q\)≤e−x\.\\mathbb\{P\}\\left\(\-H\_\{at\}\(\\rho\)\\geq 3\\sqrt\{2\}\\sqrt\{2^\{q\}x\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}x,\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\\leq 2^\{q\}\\right\)\\leq e^\{\-x\}\.Additionally, on eventℰq\\mathcal\{E\}\_\{q\},6​x​\(1\+∑s∈𝒮a​t\{\(ρ−ρa\)⊤​ψ​\(xs\)\}2\)≥6​2q−1​x=3​2​2q​x\\displaystyle 6\\sqrt\{x\\bigg\(1\+\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\\bigg\)\}\\geq 6\\sqrt\{2^\{q\-1\}x\}=3\\sqrt\{2\}\\sqrt\{2^\{q\}x\}\. Thus on eventℰq\\mathcal\{E\}\_\{q\},

−Ha​t​\(ρ\)≥6​x​\(1\+∑s∈𝒮a​t\{\(ρ−ρa\)⊤​ψ​\(xs\)\}2\)\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}​x⇒−Ha​t​\(ρ\)≥3​2​2q​x\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}​x\.\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\-H\_\{at\}\(\\rho\)\\geq 6\\sqrt\{x\\bigg\(1\+\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\\bigg\)\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}x\\\\ &\\displaystyle\\Rightarrow\-H\_\{at\}\(\\rho\)\\geq 3\\sqrt\{2\}\\sqrt\{2^\{q\}x\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}x\.\\end\{array\}Summing the probability bound above overq∈\{0\}∪\[Qa​t\]q\\in\\\{0\\\}\\cup\[Q\_\{at\}\], we have

ℙ​\(−Ha​t​\(ρ\)≥6​x\+x​∑s∈𝒮a​t\{\(ρ−ρa\)⊤​ψ​\(xs\)\}2\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}​x\)≤2​\(Qa​t\+1\)​e−x\.\\mathbb\{P\}\\left\(\-H\_\{at\}\(\\rho\)\\geq 6\\sqrt\{x\+x\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}x\\right\)\\leq 2\(Q\_\{at\}\+1\)e^\{\-x\}\.Settingx=log⁡\(2​\(Qa​t\+1\)/δ\)x=\\log\\left\(2\(Q\_\{at\}\+1\)/\\delta\\right\)above, then for any fixedρ∈ℬ\\rho\\in\\mathcal\{B\}, with probability at least1−δ1\-\\deltawe have

−Ha​t​\(ρ\)<6​log⁡\(2​\(Qa​t\+1\)δ\)​∑s∈𝒮a​t\{\(ρ−ρa\)⊤​ψ​\(xs\)\}2\+\(2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​log⁡\(2​\(Qa​t\+1\)δ\)\.\-H\_\{at\}\(\\rho\)<6\\sqrt\{\\log\\left\(\\frac\{2\(Q\_\{at\}\+1\)\}\{\\delta\}\\right\)\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\}\+\(2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\)\\log\\left\(\\frac\{2\(Q\_\{at\}\+1\)\}\{\\delta\}\\right\)\.\(44\)Set

ϵ=1T​C¯ψ\\epsilon=\\frac\{1\}\{T\\bar\{C\}\_\{\\psi\}\}\(45\)and take an Euclideanϵ\\epsilon\-net𝒩ϵ\\mathcal\{N\}\_\{\\epsilon\}ofℬ\\mathcal\{B\}\. Sinceℬ\\mathcal\{B\}has diameterdℬd\_\{\\mathcal\{B\}\}by Assumption[5](https://arxiv.org/html/2606.07392#S4.E5),

\|𝒩ϵ\|≤\(1\+2​dℬϵ\)m=\(1\+2​dℬ​C¯ψ​T\)m\.\|\\mathcal\{N\}\_\{\\epsilon\}\|\\leq\\left\(1\+\\frac\{2d\_\{\\mathcal\{B\}\}\}\{\\epsilon\}\\right\)^\{m\}=\(1\+2d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}T\)^\{m\}\.Note that \([44](https://arxiv.org/html/2606.07392#S8.E44)\) further implies that with probability at least1−δ/31\-\\delta/3, uniformly over allρ∈𝒩ϵ\\rho\\in\\mathcal\{N\}\_\{\\epsilon\}andt∈\[T\]t\\in\[T\],

−Ha​t​\(ρ\)<6​log⁡\(6​T​\|𝒩ϵ\|​\(Qa​t\+1\)δ\)​∑s∈𝒮a​t\{\(ρ−ρa\)⊤​ψ​\(xs\)\}2\+\(2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​log⁡\(6​T​\|𝒩ϵ\|​\(Qa​t\+1\)δ\)≤6​Γa​t​\(δ\)​∑s∈𝒮a​t\{\(ρ−ρa\)⊤​ψ​\(xs\)\}2\+\(2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​Γa​t​\(δ\),\\begin\{array\}\[\]\{rl\}\-H\_\{at\}\(\\rho\)&\\displaystyle<6\\sqrt\{\\log\\left\(\\frac\{6T\|\\mathcal\{N\}\_\{\\epsilon\}\|\(Q\_\{at\}\+1\)\}\{\\delta\}\\right\)\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\}\\\\ &\\quad\\displaystyle\+\(2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\)\\log\\left\(\\frac\{6T\|\\mathcal\{N\}\_\{\\epsilon\}\|\(Q\_\{at\}\+1\)\}\{\\delta\}\\right\)\\\\ &\\displaystyle\\leq 6\\sqrt\{\\Gamma\_\{at\}\(\\delta\)\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\}\+\(2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\)\\Gamma\_\{at\}\(\\delta\),\\end\{array\}\(46\)where

Γa​t​\(δ\):=m​log⁡\(1\+2​dℬ​C¯ψ​T\)\+log⁡\(Qa​t\+1\)\+log⁡\(6​T/δ\)\.\\Gamma\_\{at\}\(\\delta\)\\\!:=m\\log\(1\+2d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}T\)\+\\log\(Q\_\{at\}\+1\)\+\\log\\left\(6T/\\delta\\right\)\.\(47\)Letρϵ∈𝒩ϵ\\rho\_\{\\epsilon\}\\in\\mathcal\{N\}\_\{\\epsilon\}satisfy‖ρt−ρϵ‖2≤ϵ\\\|\\rho\_\{t\}\-\\rho\_\{\\epsilon\}\\\|\_\{2\}\\leq\\epsilon\. Note that

\|\[ℓa​s​\(ρt\)−ℓa​s​\(ρa\)\]−\[ℓa​s​\(ρϵ\)−ℓa​s​\(ρa\)\]\|=\|ℓa​s​\(ρt\)−ℓa​s​\(ρϵ\)\|=\(i\)\|∫ρϵ⊤​ψ​\(xs\)ρt⊤​ψ​\(xs\)\[ca​s−\(G​\(θ∗⊤​ϕ​\(xs,ωa​s\)\)−Λ​\(u\)\)\+\]​𝑑u\|≤\(i​i\)2​\|\(ρt−ρϵ\)⊤​ψ​\(xs\)\|≤\(i​i​i\)2​C¯ψ​ϵ≤\(i​v\)2na​t\.\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\\left\|\[\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\]\-\[\\ell\_\{as\}\(\\rho\_\{\\epsilon\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\]\\right\|\\\\ &\\displaystyle\\ =\|\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{\\epsilon\}\)\|=\_\{\(i\)\}\\left\|\\int\_\{\\rho\_\{\\epsilon\}^\{\\top\}\\psi\(x\_\{s\}\)\}^\{\\rho\_\{t\}^\{\\top\}\\psi\(x\_\{s\}\)\}\[c\_\{as\}\-\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\)^\{\+\}\]du\\right\|\\\\ &\\displaystyle\\leq\_\{\(ii\)\}2\|\(\\rho\_\{t\}\-\\rho\_\{\\epsilon\}\)^\{\\top\}\\psi\(x\_\{s\}\)\|\\leq\_\{\(iii\)\}2\\bar\{C\}\_\{\\psi\}\\epsilon\\leq\_\{\(iv\)\}\\frac\{2\}\{n\_\{at\}\}\.\\end\{array\}where \(i\) follows by definition ofℓa​s​\(ρ\)\\ell\_\{as\}\(\\rho\)in \([8](https://arxiv.org/html/2606.07392#S4.E8)\), \(ii\) follows since by definition,

\|ca​s−\(G​\(θ∗⊤​ϕ​\(xs,ωa​s\)\)−Λ​\(u\)\)\+\|≤2,\|c\_\{as\}\-\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\)^\{\+\}\|\\leq 2,\(iii\) follows by Cauchy\-Schwarz inequality, the fact that‖ρt−ρϵ‖2≤ϵ\\\|\\rho\_\{t\}\-\\rho\_\{\\epsilon\}\\\|\_\{2\}\\leq\\epsilonand‖ψ​\(xt\)‖2≤C¯ψ\\\|\\psi\(x\_\{t\}\)\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\psi\}by Assumption[5](https://arxiv.org/html/2606.07392#S4.E5), \(iv\) follows by \([45](https://arxiv.org/html/2606.07392#S8.E45)\)\. The above inequality implies that

\|Ha​t\(ρt\)−Ha​t\(ρϵ\)\|=\|∑s∈𝒮a​t\[ℓa​s\(ρt\)−ℓa​s\(ρϵ\)\]−𝔼\[ℓa​s\(ρt\)−ℓa​s\(ρϵ\)∣𝒢s−1\]\|≤∑s∈𝒮a​t4na​t≤4\.\|H\_\{at\}\(\\rho\_\{t\}\)\-H\_\{at\}\(\\rho\_\{\\epsilon\}\)\|=\\left\|\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\[\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{\\epsilon\}\)\]\-\\mathbb\{E\}\[\\ell\_\{as\}\(\\rho\_\{t\}\)\-\\ell\_\{as\}\(\\rho\_\{\\epsilon\}\)\\mid\\mathcal\{G\}\_\{s\-1\}\]\\right\|\\leq\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\frac\{4\}\{n\_\{at\}\}\\leq 4\.Therefore,

\|Ha​t​\(ρt\)−Ha​t​\(ρϵ\)\|≤4\.\|H\_\{at\}\(\\rho\_\{t\}\)\-H\_\{at\}\(\\rho\_\{\\epsilon\}\)\|\\leq 4\.\(48\)Note that∑s∈𝒮a​t\{\(ρt−ρa\)⊤​ψ​\(xs\)\}2=‖\(\(ρt−ρa\)⊤​ψ​\(xs\)\)j∈\[na​t\]‖2\\displaystyle\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\}=\\\|\(\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\)\_\{j\\in\[n\_\{at\}\]\}\\\|\_\{2\}\. Thus by triangular inequality,

∑s∈𝒮a​t\{\(ρt−ρa\)⊤​ψ​\(xs\)\}2−∑s∈𝒮a​t\{\(ρϵ−ρa\)⊤​ψ​\(xs\)\}2≤‖\(\(ρt−ρϵ\)⊤​ψ​\(xs\)\)j∈\[na​t\]‖2≤na​t​\(C¯ψ​ϵ\)2=1na​t\.\\begin\{array\}\[\]\{rl\}\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\}\-\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{\\epsilon\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\}&\\displaystyle\\leq\\left\\\|\(\(\\rho\_\{t\}\-\\rho\_\{\\epsilon\}\)^\{\\top\}\\psi\(x\_\{s\}\)\)\_\{j\\in\[n\_\{at\}\]\}\\right\\\|\_\{2\}\\\\ &\\displaystyle\\leq\\sqrt\{n\_\{at\}\(\\bar\{C\}\_\{\\psi\}\\epsilon\)^\{2\}\}=\\frac\{1\}\{\\sqrt\{n\_\{at\}\}\}\.\\end\{array\}\(49\)On the event that \([46](https://arxiv.org/html/2606.07392#S8.E46)\) holds uniformly over allρ∈𝒩ϵ\\rho\\in\\mathcal\{N\}\_\{\\epsilon\}andt∈\[T\]t\\in\[T\], we have

−Ha​t​\(ρt\)≤\(i\)−Ha​t​\(ρϵ\)\+4≤\(i​i\)6​Γa​t​\(δ\)​∑s∈𝒮a​t\{\(ρϵ−ρa\)⊤​ψ​\(xs\)\}2\+\(2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​Γa​t​\(δ\)\+4≤\(i​i​i\)6​Γa​t​\(δ\)​\(∑s∈𝒮a​t\{\(ρt−ρa\)⊤​ψ​\(xs\)\}2\+1na​t\)\+\(2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​Γa​t​\(δ\)\+4≤\(i​v\)6​2​Γa​t​\(δ\)​∑s∈𝒮a​t\{\(ρt−ρa\)⊤​ψ​\(xs\)\}2\+1\+\(2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​Γa​t​\(δ\)\+4≤\(v\)c08​∑s∈𝒮a​t\{\(ρt−ρa\)⊤​ψ​\(xs\)\}2\+\(144c0\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​Γa​t​\(δ\)\+4\+c08\\begin\{array\}\[\]\{rl\}\-H\_\{at\}\(\\rho\_\{t\}\)&\\displaystyle\\leq\_\{\(i\)\}\-H\_\{at\}\(\\rho\_\{\\epsilon\}\)\+4\\\\ &\\displaystyle\\leq\_\{\(ii\)\}6\\sqrt\{\\Gamma\_\{at\}\(\\delta\)\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{\\epsilon\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\}\+\(2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\)\\Gamma\_\{at\}\(\\delta\)\+4\\\\ &\\displaystyle\\leq\_\{\(iii\)\}6\\sqrt\{\\Gamma\_\{at\}\(\\delta\)\}\\left\(\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\}\+\\frac\{1\}\{\\sqrt\{n\_\{at\}\}\}\\right\)\+\(2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\)\\Gamma\_\{at\}\(\\delta\)\+4\\\\ &\\displaystyle\\leq\_\{\(iv\)\}6\\sqrt\{2\\Gamma\_\{at\}\(\\delta\)\}\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+1\}\+\(2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\)\\Gamma\_\{at\}\(\\delta\)\+4\\\\ &\\displaystyle\\leq\_\{\(v\)\}\\frac\{c\_\{0\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+\\left\(\\frac\{144\}\{c\_\{0\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}\(\\delta\)\+4\+\\frac\{c\_\{0\}\}\{8\}\\end\{array\}where \(i\) follows from \([48](https://arxiv.org/html/2606.07392#S8.E48)\), \(ii\) follows from \([46](https://arxiv.org/html/2606.07392#S8.E46)\), \(iii\) holds from \([49](https://arxiv.org/html/2606.07392#S8.E49)\), \(iv\) holds from the fact thats\+1na​t≤2​\(s\+1/na​t\)\\displaystyle\\sqrt\{s\}\+\\frac\{1\}\{\\sqrt\{n\_\{at\}\}\}\\leq\\sqrt\{2\(s\+1/\\sqrt\{n\_\{at\}\}\)\}, wheres=∑s∈𝒮a​t\{\(ρt−ρa\)⊤​ψ​\(xs\)\}2\\displaystyle s=\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}and1/na​t≤11/\\sqrt\{n\_\{at\}\}\\leq 1, \(v\) follows by applying2​a​b≤ϵ​a2\+ϵ−1​b22ab\\leq\\epsilon a^\{2\}\+\\epsilon^\{\-1\}b^\{2\}toa=∑s∈𝒮a​t\{\(ρt−ρa\)⊤​ψ​\(xs\)\}2\+1\\displaystyle a=\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+1\},b=6​2​Γa​t​\(δ\)b=6\\sqrt\{2\\Gamma\_\{at\}\(\\delta\)\},ϵ=c04\\displaystyle\\epsilon=\\frac\{c\_\{0\}\}\{4\}, so that

6​2​Γa​t​\(δ\)​∑s∈𝒮a​t\{\(ρt−ρa\)⊤​ψ​\(xs\)\}2\+1≤c08​∑s∈𝒮a​t\{\(ρt−ρa\)⊤​ψ​\(xs\)\}2\+144c0​Γa​t​\(δ\)\+c08\.6\\sqrt\{2\\Gamma\_\{at\}\(\\delta\)\}\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+1\}\\leq\\frac\{c\_\{0\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+\\frac\{144\}\{c\_\{0\}\}\\Gamma\_\{at\}\(\\delta\)\+\\frac\{c\_\{0\}\}\{8\}\.Hence with probability at least1−δ/31\-\\delta/3, uniformly over allt∈\[T\]t\\in\[T\]we have

−Ha​t​\(ρt\)≤c08​∑s∈𝒮a​t\{\(ρt−ρa\)⊤​ψ​\(xs\)\}2\+\(144c0\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​Γa​t​\(δ\)\+4\+c08,\-H\_\{at\}\(\\rho\_\{t\}\)\\leq\\frac\{c\_\{0\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\rho\_\{t\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+\\left\(\\frac\{144\}\{c\_\{0\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}\(\\delta\)\+4\+\\frac\{c\_\{0\}\}\{8\},whereΓa​t​\(δ\)\\Gamma\_\{at\}\(\\delta\)is defined as in \([47](https://arxiv.org/html/2606.07392#S8.E47)\), thus the result follows\.■\\blacksquare

### 8\.2Index Estimation under Known Reward Function

Before introducing the formal proof, we first provide the proof’s intuition\. The argument follows a standard localized empirical\-process approach for M\-estimation, adapted to our martingale setting\. The estimator’s empirical optimality is combined with a population curvature lower bound, while the stochastic deviation is controlled uniformly over the parameter space using Freedman’s martingale inequality, a peeling argument, and anϵ\\epsilon\-net\. Readers familiar with localized empirical\-process and martingale concentration arguments may skip the following intuition and proceed directly to the formal proof\.

*\(i\) The optimality inequality\.*Becauseρ^a​t\\hat\{\\rho\}\_\{at\}minimizes the empirical primitive loss, the empirical excess loss atρ^a​t\\hat\{\\rho\}\_\{at\}cannot be positive\. Plugρ=ρ^a​t\\rho=\\hat\{\\rho\}\_\{at\}into

∑s∈𝒮a​t\{ℓa​s​\(ρ\)−ℓa​s​\(ρa\)\}=∑s∈𝒮a​t𝔼​\[ℓa​s​\(ρ\)−ℓa​s​\(ρa\)\|ℱs−1,xs\]\+Ha​t​\(ρ\)\.\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\!\\big\\\{\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\big\\\}\\;=\\;\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\!\\\!\\mathbb\{E\}\\big\[\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\,\\big\|\\,\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\\big\]\\;\+\\;H\_\{at\}\(\\rho\)\.and use the population curvature lower bound from Lemma[4\.2](https://arxiv.org/html/2606.07392#S4.Thmtheorem2)\(with a small Lipschitz adjustment so that the curvature holds uniformly overρ\\rho, which costs only an additive11\)\. After rearranging, one obtains

κ​μ12​∑s∈𝒮a​t\{ψ​\(xs\)⊤​\(ρ^a​t−ρa\)\}2≤−Ha​t​\(ρ^a​t\)\+1\.\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\!\\big\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\\big\\\}^\{2\}\\;\\leq\\;\-H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\\;\+1\.The left side is the local quadratic “signal” we want to bound, and the right side is essentially the “noise”\.

*\(ii\) Bounding the noise at one fixedρ\\rho\.*For each fixedρ\\rho, every term insideHa​t​\(ρ\)H\_\{at\}\(\\rho\)is uniformly bounded and has a conditional variance controlled by the same quadratic quantity∑s∈𝒮a​t\{ψ​\(xs\)⊤​\(ρ−ρa\)\}2\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\\}^\{2\}\. Freedman’s\(Freedman[1975](https://arxiv.org/html/2606.07392#bib.bib2)\)martingale inequality \(Lemma[8\.6](https://arxiv.org/html/2606.07392#S8.Thmtheorem6)\) therefore gives, for each fixedρ\\rho,

\|Ha​t​\(ρ\)\|≲\(∑s\{ψ​\(xs\)⊤​\(ρ−ρa\)\}2\)⋅log⁡\(T/δ\)\+log⁡\(T/δ\)\.\|H\_\{at\}\(\\rho\)\|\\;\\lesssim\\;\\sqrt\{\\Big\(\\textstyle\\sum\_\{s\}\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\\}^\{2\}\\Big\)\\cdot\\log\(T/\\delta\)\}\\;\+\\;\\log\(T/\\delta\)\.The key point is that the stochastic error enters through the same quadratic form that governs the population curvature, allowing the error term to be absorbed into the curvature in the final bound\.

*\(iii\) Making the bound uniform inρ\\rho\.*Sinceρ^a​t\\hat\{\\rho\}\_\{at\}is itself random, we need step \(ii\) to hold for allρ\\rhosimultaneously\. We do this in two passes\. First, a peeling argument splits the parameter space into dyadic shells based on the size of∑s\{ψ​\(xs\)⊤​\(ρ−ρa\)\}2\\sum\_\{s\}\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\rho\-\\rho\_\{a\}\)\\\}^\{2\}; on each shell the variance proxy is replaced by a deterministic ceiling and Freedman’s bound applies\. Summing over the shells costs only alog⁡log\\log\\logfactor\. Second, anϵ\\epsilon\-net argument extends the bound from a finite grid ofρ\\rho’s to all ofℬ\\mathcal\{B\}via the Lipschitz continuity of the loss primitive\. The end product is a uniform bound on\|Ha​t​\(ρ\)\|\|H\_\{at\}\(\\rho\)\|that still scales with the same quadratic quantity,

*\(iv\) Closing the loop\.*Plugρ=ρ^a​t\\rho=\\hat\{\\rho\}\_\{at\}into the uniform bound from step \(iii\) and combine with step \(i\)\. The resulting inequality has the schematic formquadratic≲quadratic⋅log⁡T\+log⁡T\\text\{quadratic\}\\lesssim\\sqrt\{\\text\{quadratic\}\}\\cdot\\sqrt\{\\log T\}\+\\log T\. Applying2​a​b≤ϵ​a2\+ϵ−1​b22ab\\leq\\epsilon a^\{2\}\+\\epsilon^\{\-1\}b^\{2\}with a suitableϵ\\epsilonabsorbs the square\-root term into the quadratic, leaving1/na​t​∑s∈𝒮a​t\{ψ​\(xs\)⊤​\(ρ^a​t−ρa\)\}2≲log⁡T/na​t1/n\_\{at\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\big\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\\big\\\}^\{2\}\\;\\lesssim\\;\\log T/n\_\{at\}\. Folding in the regularizationη1​𝐈m\\eta\_\{1\}\\mathbf\{I\}\_\{m\}to ensure invertibility and applying Cauchy–Schwarz then yields the stated confidence radius\.

###### Proposition 8\.8\(Restatement of Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4)\)

Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5),[4\.1](https://arxiv.org/html/2606.07392#S4.SS1)hold\. Whenμ∗\\mu^\{\*\}is known, given anyδ\>0\\delta\>0, with probability at least1−δ1\-\\delta, uniformly over allt∈\[T\]t\\in\[T\]anda∈\[A\]a\\in\[A\],

\|\(ρ^a​t−ρa\)⊤​ψ​\(xt\)\|≤Ba​t∗​‖ψ​\(xt\)‖Va​t​\(η1\)−1,\|\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\)\|\\leq B\_\{at\}^\{\*\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\},\(50\)where

Ba​t∗:=83​κ​μ1​\(144κ​μ1\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​Γa​t∗​\(δ\)\+40/\(3​κ​μ1\)\+1/3\+η1​dℬ2,B\_\{at\}^\{\*\}:=\\sqrt\{\\frac\{8\}\{3\\kappa\\mu\_\{1\}\}\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}^\{\*\}\(\\delta\)\+40/\(3\\kappa\\mu\_\{1\}\)\+1/3\+\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}\},\(51\)Γa​t∗​\(δ\):=m​log⁡\(1\+2​dℬ​C¯ψ​T\)\+log⁡\(⌈log2⁡\(1\+na​t​min⁡\{9​ι¯2,dℬ2​C¯ψ2\}\)⌉\+1\)\+log⁡\(6​A​T/δ\),\\Gamma\_\{at\}^\{\*\}\(\\delta\)\\\!:=m\\log\(1\+2d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}T\)\+\\log\\left\(\\left\\lceil\\log\_\{2\}\\big\(1\+n\_\{at\}\\min\\\{9\\overline\{\\iota\}^\{2\},d\_\{\\mathcal\{B\}\}^\{2\}\\bar\{C\}\_\{\\psi\}^\{2\}\\\}\\right\)\\right\\rceil\+1\\big\)\+\\log\(6AT/\\delta\),andVa​t​\(η1\)=η1​𝐈m\+∑s∈𝒮a​tψ​\(xs\)​ψ​\(xs\)⊤\\displaystyle V\_\{at\}\(\\eta\_\{1\}\)=\\eta\_\{1\}\\mathbf\{I\}\_\{m\}\+\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\psi\(x\_\{s\}\)\\psi\(x\_\{s\}\)^\{\\top\}\.

###### Proof 8\.9

Proof of Proposition[8\.8](https://arxiv.org/html/2606.07392#S8.Thmtheorem8)For anys∈\[T\]s\\in\[T\], define𝒢s−1:=σ​\(ℱs−1,xs\)\\mathcal\{G\}\_\{s\-1\}:=\\sigma\(\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\)\. Note that

0≥\(1\)∑s∈𝒮a​tℓa​s​\(ρ^a​t\)−∑s∈𝒮a​tℓa​s​\(ρa\)=\(2\)∑s∈𝒮a​t𝔼​\[ℓa​s​\(ρ^a​t\)−ℓa​s​\(ρa\)∣𝒢s−1\]\+Ha​t​\(ρ^a​t\)≥\(3\)κ​μ12​∑s∈𝒮a​t\[ψ​\(xs\)⊤​\(ρ^a​t−ρa\)\]2−1\+Ha​t​\(ρ^a​t\),\\begin\{array\}\[\]\{rl\}0&\\geq\_\{\(1\)\}\\displaystyle\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\ell\_\{as\}\(\\hat\{\\rho\}\_\{at\}\)\-\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\ell\_\{as\}\(\\rho\_\{a\}\)\\\\ &\\displaystyle=\_\{\(2\)\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\mathbb\{E\}\\left\[\\ell\_\{as\}\(\\hat\{\\rho\}\_\{at\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{G\}\_\{s\-1\}\\right\]\+H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\\\\ &\\displaystyle\\geq\_\{\(3\)\}\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\]^\{2\}\-1\+H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\),\\end\{array\}\(52\)where \(1\) holds becauseρ^a​t=\\argminρ∈ℬ​∑s∈𝒮a​tℓa​s​\(ρ\)\\displaystyle\\hat\{\\rho\}\_\{at\}=\\argmin\_\{\\rho\\in\\mathcal\{B\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\ell\_\{as\}\(\\rho\), \(2\) holds with

Ha​t​\(ρ\)=∑s∈𝒮a​t\{ℓa​s​\(ρ\)−ℓa​s​\(ρa\)−𝔼​\[ℓa​s​\(ρ\)−ℓa​s​\(ρa\)∣𝒢s−1\]\}H\_\{at\}\(\\rho\)=\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\-\\mathbb\{E\}\[\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{G\}\_\{s\-1\}\]\\right\\\}defined as \([12](https://arxiv.org/html/2606.07392#S4.E12)\), and \(3\) follows from Lemma[8\.4](https://arxiv.org/html/2606.07392#S8.Thmtheorem4)\. So \([52](https://arxiv.org/html/2606.07392#S8.E52)\) implies that

κ​μ12​∑s∈𝒮a​t\[ψ​\(xs\)⊤​\(ρ^a​t−ρa\)\]2≤−Ha​t​\(ρ^a​t\)\+1\.\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\]^\{2\}\\leq\-H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\+1\.\(53\)Lemma[4\.3](https://arxiv.org/html/2606.07392#S4.Thmtheorem3)implies that with probability at least1−δ/31\-\\delta/3, uniformly over allt∈\[T\]t\\in\[T\]anda∈\[A\]a\\in\[A\],

−Ha​t​\(ρ^a​t\)≤κ​μ18​∑s∈𝒮a​t\{\(ρ^a​t−ρa\)⊤​ψ​\(xs\)\}2\+\(144κ​μ1\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​Γa​t∗​\(δ\)\+4\+κ​μ18,\-H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\\leq\\frac\{\\kappa\\mu\_\{1\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}^\{\*\}\(\\delta\)\+4\+\\frac\{\\kappa\\mu\_\{1\}\}\{8\},\(54\)whereΓa​t∗​\(δ\)=m​log⁡\(1\+2​dℬ​C¯ψ​T\)\+log⁡\(Qa​t\+1\)\+log⁡\(6​A​T/δ\)\\Gamma\_\{at\}^\{\*\}\(\\delta\)\\\>=m\\log\(1\+2d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}T\)\+\\log\(Q\_\{at\}\+1\)\+\\log\\left\(6AT/\\delta\\right\)\. So \([53](https://arxiv.org/html/2606.07392#S8.E53)\) further implies that

3​κ​μ18​∑s∈𝒮a​t\{\(ρ^a​t−ρa\)⊤​ψ​\(xs\)\}2≤\(144κ​μ1\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​Γa​t∗​\(δ\)\+5\+κ​μ18\.\\frac\{3\\kappa\\mu\_\{1\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\\leq\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}^\{\*\}\(\\delta\)\+5\+\\frac\{\\kappa\\mu\_\{1\}\}\{8\}\.\(55\)Combining with the fact thatη1​‖ρ^a​t−ρa‖2≤η1​dℬ2\\eta\_\{1\}\\\|\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\\\|^\{2\}\\leq\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}, and recall that

Va​t​\(η1\)=η1​𝐈m\+∑s∈𝒮a​tψ​\(xs\)​ψ​\(xs\)⊤,V\_\{at\}\(\\eta\_\{1\}\)=\\eta\_\{1\}\\mathbf\{I\}\_\{m\}\+\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\psi\(x\_\{s\}\)\\psi\(x\_\{s\}\)^\{\\top\},\([55](https://arxiv.org/html/2606.07392#S8.E55)\) then implies that with probability at least1−δ/31\-\\delta/3, uniformly over allt∈\[T\]t\\in\[T\]anda∈\[A\]a\\in\[A\],

‖ρ^a​t−ρa‖Va​t​\(η1\)2≤83​κ​μ1​\(144κ​μ1\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​Γa​t∗​\(δ\)\+40/\(3​κ​μ1\)\+1/3\+η1​dℬ2\.\\begin\{array\}\[\]\{rl\}\\displaystyle\\\|\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)\}^\{2\}&\\displaystyle\\leq\\frac\{8\}\{3\\kappa\\mu\_\{1\}\}\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}^\{\*\}\(\\delta\)\+40/\(3\\kappa\\mu\_\{1\}\)\+1/3\+\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}\.\\end\{array\}\(56\)By Cauchy\-Schwarz inequality,\|\(ρ^a​t−ρa\)⊤​ψ​\(xt\)\|≤‖ψ​\(xt\)‖Va​t​\(η1\)−1​‖ρ^a​t−ρa‖Va​t​\(η1\)\|\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\)\|\\leq\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\\\|\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)\}, so the result follows\.■\\blacksquare

### 8\.3Regret Analysis under Known Reward Function

###### Proof 8\.10

Proof of Proposition[4\.5](https://arxiv.org/html/2606.07392#S4.Thmtheorem5)Letℰ\\mathcal\{E\}denote the event that uniformly over alla∈\[A\]a\\in\[A\]andt∈\[T\]t\\in\[T\],

\|\(ρ^a​t−ρa\)⊤​ψ​\(xt\)\|≤Ba​t∗​‖ψ​\(xt\)‖Va​t​\(η1\)−1,\|\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\)\|\\leq B\_\{at\}^\{\*\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\},\(57\)whereVa​t​\(η1\)\\displaystyle V\_\{at\}\(\\eta\_\{1\}\)is defined as in Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4)andBa​t∗B\_\{at\}^\{\*\}is defined as \([51](https://arxiv.org/html/2606.07392#S8.E51)\)\. On eventℰ\\mathcal\{E\},σ~a​t≥σa​t∗\\widetilde\{\\sigma\}\_\{at\}\\geq\\sigma\_\{at\}^\{\*\}for alla∈\[A\],t∈\[T\]a\\in\[A\],t\\in\[T\], whereσ~a​t\\widetilde\{\\sigma\}\_\{at\}is defined as \([15](https://arxiv.org/html/2606.07392#S4.E15)\)\. Then Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)implies that onℰ\\mathcal\{E\},

𝔼​\[∑t=1TΔt​\(π~\)\|ℰ\]≤𝔼​\[∑t=1T∑a=1A\(σ~a​t−σa​t∗\)​𝕀​\{a∈𝒜t\}\|ℰ\]≤𝔼​\[∑a=1A∑t=1T\(Λ​\(ρ^a​t⊤​ψ​\(xt\)\+Ba​t∗​‖ψ​\(xt\)‖Va​t​\(η1\)−1\)−Λ​\(ρa⊤​ψ​\(xt\)\)\)​𝕀​\{a∈𝒜t\}\|ℰ\]≤\(i\)𝔼​\[∑a=1A∑t=1TL​\|ρ^a​t⊤​ψ​\(xt\)\+Ba​t∗‖​ψ​\(xt\)∥Va​t​\(η1\)−1−ρa⊤​ψ​\(xt\)​\|𝕀​\{a∈𝒜t\}\|​ℰ\]≤\(i​i\)2​L​BT∗​𝔼​\[∑a=1A∑t=1T‖ψ​\(xt\)‖Va​t​\(η1\)−1​𝕀​\{a∈𝒜t\}\|ℰ\],\\begin\{array\}\[\]\{rl\}\\displaystyle\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\big\|\\mathcal\{E\}\\right\]&\\displaystyle\\leq\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\sum\_\{a=1\}^\{A\}\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\)\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\big\|\\mathcal\{E\}\\right\]\\\\ &\\displaystyle\\leq\\mathbb\{E\}\\left\[\\sum\_\{a=1\}^\{A\}\\sum\_\{t=1\}^\{T\}\\left\(\\Lambda\\left\(\\hat\{\\rho\}\_\{at\}^\{\\top\}\\psi\(x\_\{t\}\)\+B\_\{at\}^\{\*\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\\right\)\-\\Lambda\(\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\)\\right\)\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\big\|\\mathcal\{E\}\\right\]\\\\ &\\displaystyle\\leq\_\{\(i\)\}\\mathbb\{E\}\\left\[\\sum\_\{a=1\}^\{A\}\\sum\_\{t=1\}^\{T\}L\\left\|\\hat\{\\rho\}\_\{at\}^\{\\top\}\\psi\(x\_\{t\}\)\+B\_\{at\}^\{\*\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\-\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\\right\|\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\big\|\\mathcal\{E\}\\right\]\\\\ &\\displaystyle\\leq\_\{\(ii\)\}2LB\_\{T\}^\{\*\}\\mathbb\{E\}\\left\[\\sum\_\{a=1\}^\{A\}\\sum\_\{t=1\}^\{T\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\big\|\\mathcal\{E\}\\right\],\\end\{array\}where \(i\) follows from the Lipschitz property ofΛ\\Lambdaaccording to Assumption[5](https://arxiv.org/html/2606.07392#S4.E5), \(ii\) follows from \([57](https://arxiv.org/html/2606.07392#S8.E57)\), andBT∗=supa∈\[A\],t≤TBa​t∗=83​κ​μ1​\(144κ​μ1\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​ΓT​\(δ\)\+403​κ​μ1\+13\+η1​dℬ2B\_\{T\}^\{\*\}=\\sup\_\{a\\in\[A\],t\\leq T\}B\_\{at\}^\{\*\}=\\sqrt\{\\frac\{8\}\{3\\kappa\\mu\_\{1\}\}\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{T\}\(\\delta\)\+\\frac\{40\}\{3\\kappa\\mu\_\{1\}\}\+\\frac\{1\}\{3\}\+\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}\},

ΓT​\(δ\):=m​log⁡\(1\+2​dℬ​C¯ψ​T\)\+log⁡\(⌈log2⁡\(1\+T​min⁡\{9​ι¯2,dℬ2​C¯ψ2\}\)⌉\+1\)\+log⁡\(6​A​T/δ\)\.\\Gamma\_\{T\}\(\\delta\)\\\!:=m\\log\(1\+2d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}T\)\+\\log\\left\(\\left\\lceil\\log\_\{2\}\\big\(1\+T\\min\\\{9\\overline\{\\iota\}^\{2\},d\_\{\\mathcal\{B\}\}^\{2\}\\bar\{C\}\_\{\\psi\}^\{2\}\\\}\\right\)\\right\\rceil\+1\\big\)\+\\log\(6AT/\\delta\)\.Then following similar proof steps as in the proof for Theorem[5\.5](https://arxiv.org/html/2606.07392#S5.Thmtheorem5), we have that onℰ\\mathcal\{E\},

∑a=1A∑t=1T‖ψ​\(xt\)‖Va​t​\(η1\)−1​𝕀​\{a∈𝒜t\}=∑a=1A∑s∈𝒮a,T\+1‖ψ​\(xs\)‖Va​s​\(η1\)−1≤A​T​\(1\+C¯ψ2/η1\)​m​log⁡\(1\+T​C¯ψ2/\(η1​m\)\)\.\\begin\{array\}\[\]\{rl\}\\displaystyle\\sum\_\{a=1\}^\{A\}\\sum\_\{t=1\}^\{T\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}&\\displaystyle=\\sum\_\{a=1\}^\{A\}\\sum\_\{s\\in\\mathcal\{S\}\_\{a,T\+1\}\}\\\|\\psi\(x\_\{s\}\)\\\|\_\{V\_\{as\}\(\\eta\_\{1\}\)^\{\-1\}\}\\\\ &\\displaystyle\\leq A\\sqrt\{T\(1\+\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\)m\\log\\left\(1\+T\\bar\{C\}\_\{\\psi\}^\{2\}/\(\\eta\_\{1\}m\)\\right\)\}\.\\end\{array\}Hence onℰ\\mathcal\{E\},

𝔼​\[∑t=1TΔt​\(π~\)\|ℰ\]≤2​L​BT∗​A​T​\(1\+C¯ψ2/η1\)​m​log⁡\(1\+T​C¯ψ2/\(η1​m\)\)\.\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\,\\bigg\|\\,\\mathcal\{E\}\\right\]\\\!\\leq 2LB\_\{T\}^\{\*\}A\\sqrt\{T\(1\+\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\)m\\log\\left\(1\+T\\bar\{C\}\_\{\\psi\}^\{2\}/\(\\eta\_\{1\}m\)\\right\)\}\.\(58\)Note that it always holds thatσ~a​t−σa​t∗≤2\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}^\{\*\}\\leq 2, and recall from Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4)thatℙ​\(ℰ\)≥1−δ\\mathbb\{P\}\(\\mathcal\{E\}\)\\geq 1\-\\delta\. Onℰc\\mathcal\{E\}^\{c\}, the per\-period regret is at most1\+2​A1\+2A\. Takingδ=1/T\\delta=1/Tgives a failure event contribution of orderO​\(A\)\\mathrm\{O\}\(A\)\. Thus ignoring logarithmic factors we have𝔼​\[∑t=1TΔt​\(π~\)\]≤O~​\(A​m​T\)\\displaystyle\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\right\]\\leq\\widetilde\{O\}\\left\(Am\\sqrt\{T\}\\right\)\.■\\blacksquare

## 9Reward Estimation

Lemma[9\.1](https://arxiv.org/html/2606.07392#S9.Thmtheorem1)below is Theorem 1 fromAbbasi\-Yadkoriet al\.\([2011](https://arxiv.org/html/2606.07392#bib.bib56)\):

###### Lemma 9\.1

Let\{vs:s≥0\}\\\{v\_\{s\}:s\\geq 0\\\}be anℝd\\mathbb\{R\}^\{d\}\-valued stochastic process adapted to filtration\{ℋs:s≥0\}\\\{\\mathcal\{H\}\_\{s\}:s\\geq 0\\\},\{ϵs:s≥1\}\\\{\\epsilon\_\{s\}:s\\geq 1\\\}be a real\-valued stochastic process adapted to\{ℋs\}\\\{\\mathcal\{H\}\_\{s\}\\\}\. Assume thatϵs\\epsilon\_\{s\}is conditionally sub\-Gaussian such that there exists someγ\>0\\gamma\>0such that for anyu∈ℝu\\in\\mathbb\{R\},s≥1s\\geq 1,

𝔼​\[exp⁡\(u​ϵs\)\|ℋs−1\]≤exp⁡\(u2​γ22\)a\.s\.\\mathbb\{E\}\\left\[\\exp\\left\(u\\epsilon\_\{s\}\\right\)\|\\mathcal\{H\}\_\{s\-1\}\\right\]\\leq\\exp\\left\(\\frac\{u^\{2\}\\gamma^\{2\}\}\{2\}\\right\)\\quad\\textrm\{a\.s\.\}Assume thatVVis ad×dd\\times dpostive definite matrix\. For anyt≥0t\\geq 0, defineV¯t=V\+∑s=1tvs​vs⊤\\bar\{V\}\_\{t\}=V\+\\sum\_\{s=1\}^\{t\}v\_\{s\}v\_\{s\}^\{\\top\}, then for anyδ\>0\\delta\>0, with probability at least1−δ1\-\\delta, for allt≥0t\\geq 0,

‖∑s=1tϵs​vs‖V¯t−12≤2​γ2​log⁡\(det​\(V¯t\)1/2​det​\(V\)−1/2δ\)\.\\left\\\|\\sum\_\{s=1\}^\{t\}\\epsilon\_\{s\}v\_\{s\}\\right\\\|\_\{\\bar\{V\}\_\{t\}^\{\-1\}\}^\{2\}\\leq 2\\gamma^\{2\}\\log\\left\(\\frac\{\\mathrm\{det\}\(\\bar\{V\}\_\{t\}\)^\{1/2\}\\mathrm\{det\}\(V\)^\{\-1/2\}\}\{\\delta\}\\right\)\.

Lemma[9\.2](https://arxiv.org/html/2606.07392#S9.Thmtheorem2)in the below provides a stronger result which implies Lemma[5\.1](https://arxiv.org/html/2606.07392#S5.Thmtheorem1)directly:

###### Lemma 9\.2

Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S5),[5](https://arxiv.org/html/2606.07392#S5)hold\. Fix anyδ\>0\\delta\>0\. Then with probability at least1−δ1\-\\delta, uniformly over allx∈𝒳x\\in\\mathcal\{X\},ω∈Ω\\omega\\in\\Omega, andt≥2t\\geq 2, the following holds:

\|θ∗⊤​ϕ​\(x,ω\)−θ^t−1⊤​ϕ​\(x,ω\)\|≤2min⁡\{1,μ¯\}​‖ϕ​\(x,ω\)‖Φt−1−1​\(γ0​d​log⁡\(1\+t​C¯ϕ2/η0\)\+2​log⁡\(1/δ\)\+η0​α¯\)\.\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\\left\|\\theta\_\{\*\}^\{\\top\}\\phi\(x,\\omega\)\-\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x,\\omega\)\\right\|\\\\ &\\displaystyle\\leq\\frac\{2\}\{\\min\\\{1,\\underline\{\\mu\}\\\}\}\\\|\\phi\(x,\\omega\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\bigg\(\\gamma\_\{0\}\\sqrt\{d\\log\(1\+t\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(1/\\delta\)\}\+\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\\bigg\)\.\\end\{array\}\(59\)Particularly, with probability at least1−δ/21\-\\delta/2, uniformly over alla∈\[A\]a\\in\[A\]andt≥2t\\geq 2,

\|\(θ^t−1−θ∗\)⊤​ϕ​\(xt,ωa​t\)\|≤βt​‖ϕ​\(xt,ωa​t\)‖Φt−1−1,\\big\|\(\\hat\{\\theta\}\_\{t\-1\}\-\\theta\_\{\*\}\)^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\\big\|\\;\\leq\\;\\beta\_\{t\}\\;\\big\\\|\\phi\(x\_\{t\},\\omega\_\{at\}\)\\big\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\},\(60\)where

βt:=2min⁡\{1,μ¯\}​\(γ0​d​log⁡\(1\+t​C¯ϕ2/η0\)\+2​log⁡\(2/δ\)\+η0​α¯\)\.\\beta\_\{t\}:=\\frac\{2\}\{\\min\\\{1,\\underline\{\\mu\}\\\}\}\\bigg\(\\gamma\_\{0\}\\sqrt\{d\\log\(1\+t\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(2/\\delta\)\}\+\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\\bigg\)\.\(61\)

The proof of Lemma[9\.2](https://arxiv.org/html/2606.07392#S9.Thmtheorem2)follows closely from Proposition 1 ofFilippiet al\.\([2010](https://arxiv.org/html/2606.07392#bib.bib15)\)\.

###### Proof 9\.3

Proof of Lemma[9\.2](https://arxiv.org/html/2606.07392#S9.Thmtheorem2)Letgt​\(θ\)=∑k=1t−1G​\(θ⊤​ϕ​\(xk,ωak​k\)\)​ϕ​\(xk,ωak​k\)\+η0​θ\\displaystyle g\_\{t\}\(\\theta\)=\\sum\_\{k=1\}^\{t\-1\}G\(\\theta^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\)\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\+\\eta\_\{0\}\\thetabe the invertible function such thatgt​\(θ~t−1\)=∑k=1t−1rk​ϕ​\(xk,ωak​k\)\\displaystyle g\_\{t\}\(\\tilde\{\\theta\}\_\{t\-1\}\)=\\sum\_\{k=1\}^\{t\-1\}r\_\{k\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\), whereθ~t−1\\tilde\{\\theta\}\_\{t\-1\}is the unique solution to \([62](https://arxiv.org/html/2606.07392#S9.E62)\):

∇θ\[ℒt​\(θ\)−η02​‖θ‖22\]=−η0​θ\+∑k=1t−1\{rk−G​\(θ⊤​ϕ​\(xk,ωak​k\)\)\}​ϕ​\(xk,ωak​k\)=0,\\nabla\_\{\\theta\}\\left\[\\mathcal\{L\}\_\{t\}\(\\theta\)\-\\frac\{\\eta\_\{0\}\}\{2\}\\\|\\theta\\\|\_\{2\}^\{2\}\\right\]=\-\\eta\_\{0\}\\theta\+\\\!\\sum\_\{k=1\}^\{t\-1\}\\left\\\{r\_\{k\}\-G\\left\(\\theta^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\right\)\\right\\\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)=0,\(62\)By Assumption[5](https://arxiv.org/html/2606.07392#S5),∇gt\\nabla g\_\{t\}is continuous, so by the Fundamental Theorem of Calculus,

gt​\(θ∗\)−gt​\(θ~t−1\)=Γt​\(θ∗−θ~t−1\),g\_\{t\}\(\\theta\_\{\*\}\)\-g\_\{t\}\(\\tilde\{\\theta\}\_\{t\-1\}\)=\\Gamma\_\{t\}\(\\theta\_\{\*\}\-\\tilde\{\\theta\}\_\{t\-1\}\),\(63\)whereΓt=∫01∇gt​\(s​θ∗\+\(1−s\)​θ~t−1\)​𝑑s\\displaystyle\\Gamma\_\{t\}=\\int\_\{0\}^\{1\}\\nabla g\_\{t\}\(s\\theta\_\{\*\}\+\(1\-s\)\\tilde\{\\theta\}\_\{t\-1\}\)ds, and

∇gt​\(θ\)=η0​𝐈d\+∑k=1t−1ϕ​\(xk,ωak​k\)​ϕ​\(xk,ωak​k\)⊤​G′​\(θ⊤​ϕ​\(xk,ωak​k\)\)⪰\(i\)η0​𝐈d\+μ¯​∑k=1t−1ϕ​\(xk,ωak​k\)​ϕ​\(xk,ωak​k\)⊤,\\begin\{array\}\[\]\{rl\}\\nabla g\_\{t\}\(\\theta\)&\\displaystyle=\\eta\_\{0\}\\mathbf\{I\}\_\{d\}\+\\sum\_\{k=1\}^\{t\-1\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)^\{\\top\}G^\{\\prime\}\\left\(\\theta^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\right\)\\\\ &\\displaystyle\\succeq\_\{\(i\)\}\\eta\_\{0\}\\mathbf\{I\}\_\{d\}\+\\underline\{\\mu\}\\sum\_\{k=1\}^\{t\-1\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)^\{\\top\},\\end\{array\}where \(i\) follows becauseG′​\(θ⊤​ϕ​\(xk,ωak​k\)\)≥μ¯G^\{\\prime\}\\left\(\\theta^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\right\)\\geq\\underline\{\\mu\}according to Assumption[5](https://arxiv.org/html/2606.07392#S5)\. Hence for anyt∈\[T\]t\\in\[T\],

Γt⪰min⁡\{1,μ¯\}​Φt−1⪰min⁡\{1,μ¯\}​η0​𝐈d≻0,\\Gamma\_\{t\}\\succeq\\min\\\{1,\\underline\{\\mu\}\\\}\\Phi\_\{t\-1\}\\succeq\\min\\\{1,\\underline\{\\mu\}\\\}\\eta\_\{0\}\\mathbf\{I\}\_\{d\}\\succ 0,\(64\)SoΓt\\Gamma\_\{t\}is positive definite and is non\-singular for anyt∈\[T\]t\\in\[T\]\. Therefore,

\|θ∗⊤​ϕ​\(x,ω\)−θ~t−1⊤​ϕ​\(x,ω\)\|=\(1\)\|ϕ​\(x,ω\)⊤​Γt−1​\{gt​\(θ∗\)−gt​\(θ~t−1\)\}\|≤\(2\)‖ϕ​\(x,ω\)‖Γt−1​‖gt​\(θ∗\)−gt​\(θ~t−1\)‖Γt−1,\\begin\{array\}\[\]\{rl\}\\left\|\\theta\_\{\*\}^\{\\top\}\\phi\(x,\\omega\)\-\\tilde\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x,\\omega\)\\right\|&\\displaystyle=\_\{\(1\)\}\|\\phi\(x,\\omega\)^\{\\top\}\\Gamma\_\{t\}^\{\-1\}\\\{g\_\{t\}\(\\theta\_\{\*\}\)\-g\_\{t\}\(\\tilde\{\\theta\}\_\{t\-1\}\)\\\}\|\\\\ &\\displaystyle\\leq\_\{\(2\)\}\\\|\\phi\(x,\\omega\)\\\|\_\{\\Gamma\_\{t\}^\{\-1\}\}\\\|g\_\{t\}\(\\theta\_\{\*\}\)\-g\_\{t\}\(\\tilde\{\\theta\}\_\{t\-1\}\)\\\|\_\{\\Gamma\_\{t\}^\{\-1\}\},\\end\{array\}\(65\)where \(1\) of \([65](https://arxiv.org/html/2606.07392#S9.E65)\) follows from \([63](https://arxiv.org/html/2606.07392#S9.E63)\), and \(2\) follows from Cauchy\-Schwarz inequality and the fact thatΓt−1\\Gamma\_\{t\}^\{\-1\}is positive definite\. \([64](https://arxiv.org/html/2606.07392#S9.E64)\) implies thatΓt⪰min⁡\{1,μ¯\}​Φt−1\\Gamma\_\{t\}\\succeq\\min\\\{1,\\underline\{\\mu\}\\\}\\Phi\_\{t\-1\}, which further impliesmax⁡\{1,1/μ¯\}​Φt−1−1⪰Γt−1\\max\\\{1,1/\\underline\{\\mu\}\\\}\\Phi\_\{t\-1\}^\{\-1\}\\succeq\\Gamma\_\{t\}^\{\-1\}, so‖v‖Γt−1≤max⁡\{1/μ¯,1\}​‖v‖Φt−1−1,∀v∈ℝd\\displaystyle\\\|v\\\|\_\{\\Gamma\_\{t\}^\{\-1\}\}\\leq\\max\\left\\\{1/\\sqrt\{\\underline\{\\mu\}\},1\\right\\\}\\\|v\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\},\\ \\forall v\\in\\mathbb\{R\}^\{d\}\. Hence, \([65](https://arxiv.org/html/2606.07392#S9.E65)\) further implies

\|θ∗⊤​ϕ​\(x,ω\)−θ~t−1⊤​ϕ​\(x,ω\)\|≤max⁡\{1μ¯,1\}​‖ϕ​\(x,ω\)‖Φt−1−1​‖gt​\(θ∗\)−gt​\(θ~t−1\)‖Φt−1−1\.\\left\|\\theta\_\{\*\}^\{\\top\}\\phi\(x,\\omega\)\-\\tilde\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x,\\omega\)\\right\|\\leq\\max\\bigg\\\{\\frac\{1\}\{\\underline\{\\mu\}\},1\\bigg\\\}\\\|\\phi\(x,\\omega\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\\|g\_\{t\}\(\\theta\_\{\*\}\)\-g\_\{t\}\(\\tilde\{\\theta\}\_\{t\-1\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\.\(66\)Further,

‖gt​\(θ∗\)−gt​\(θ^t−1\)‖Φt−1−1≤‖gt​\(θ∗\)−gt​\(θ~t−1\)‖Φt−1−1\+‖gt​\(θ~t−1\)−gt​\(θ^t−1\)‖Φt−1−1≤2​‖gt​\(θ∗\)−gt​\(θ~t−1\)‖Φt−1−1,\\begin\{array\}\[\]\{rl\}\\\|g\_\{t\}\(\\theta\_\{\*\}\)\-g\_\{t\}\(\\hat\{\\theta\}\_\{t\-1\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}&\\displaystyle\\leq\\\|g\_\{t\}\(\\theta\_\{\*\}\)\-g\_\{t\}\(\\tilde\{\\theta\}\_\{t\-1\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\+\\\|g\_\{t\}\(\\tilde\{\\theta\}\_\{t\-1\}\)\-g\_\{t\}\(\\hat\{\\theta\}\_\{t\-1\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\\\ &\\displaystyle\\leq 2\\\|g\_\{t\}\(\\theta\_\{\*\}\)\-g\_\{t\}\(\\tilde\{\\theta\}\_\{t\-1\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\},\\end\{array\}\(67\)where the first inequality of \([67](https://arxiv.org/html/2606.07392#S9.E67)\) follows from triangular inequality, and the second inequality of \([67](https://arxiv.org/html/2606.07392#S9.E67)\) follows from the fact thatθ∗∈Θ\\theta\_\{\*\}\\in\\Thetaand the optimality ofθ^t−1\\hat\{\\theta\}\_\{t\-1\}inΘ\\Thetaby definition\. Recall that

gt​\(θ~t−1\)−gt​\(θ∗\)=∑k=1t−1ϕ​\(xk,ωak​k\)​\{rk−G​\(θ∗⊤​ϕ​\(xk,ωak​k\)\)\}−η0​θ∗,g\_\{t\}\(\\tilde\{\\theta\}\_\{t\-1\}\)\-g\_\{t\}\(\\theta\_\{\*\}\)=\\sum\_\{k=1\}^\{t\-1\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\\{r\_\{k\}\-G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\)\\\}\-\\eta\_\{0\}\\theta\_\{\*\},so \([66](https://arxiv.org/html/2606.07392#S9.E66)\) and \([67](https://arxiv.org/html/2606.07392#S9.E67)\) imply that

\|θ∗⊤​ϕ​\(x,ω\)−θ^t−1⊤​ϕ​\(x,ω\)\|≤max⁡\{2μ¯,2\}​‖ϕ​\(x,ω\)‖Φt−1−1​‖∑k=1t−1ϕ​\(xk,ωak​k\)​\{rk−G​\(θ∗⊤​ϕ​\(xk,ωak​k\)\)\}−η0​θ∗‖Φt−1−1≤\(i\)max⁡\{2μ¯,2\}​‖ϕ​\(x,ω\)‖Φt−1−1​\(‖∑k=1t−1ϕ​\(xk,ωak​k\)​\{rk−G​\(θ∗⊤​ϕ​\(xk,ωak​k\)\)\}‖Φt−1−1\+η0​‖θ∗‖Φt−1−1\)≤\(i​i\)max⁡\{2μ¯,2\}​‖ϕ​\(x,ω\)‖Φt−1−1​\(‖∑k=1t−1ϕ​\(xk,ωak​k\)​\{rk−G​\(θ∗⊤​ϕ​\(xk,ωak​k\)\)\}‖Φt−1−1\+η0​‖θ∗‖2\),\\begin\{array\}\[\]\{rl\}&\\displaystyle\\quad\\left\|\\theta\_\{\*\}^\{\\top\}\\phi\(x,\\omega\)\-\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x,\\omega\)\\right\|\\\\ &\\displaystyle\\leq\\max\\bigg\\\{\\frac\{2\}\{\\underline\{\\mu\}\},2\\bigg\\\}\\\|\\phi\(x,\\omega\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\left\\\|\\sum\_\{k=1\}^\{t\-1\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\\{r\_\{k\}\-G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\)\\\}\-\\eta\_\{0\}\\theta\_\{\*\}\\right\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\\\ &\\displaystyle\\leq\_\{\(i\)\}\\max\\bigg\\\{\\frac\{2\}\{\\underline\{\\mu\}\},2\\bigg\\\}\\\|\\phi\(x,\\omega\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\bigg\(\\left\\\|\\sum\_\{k=1\}^\{t\-1\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\\{r\_\{k\}\-G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\)\\\}\\right\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\+\\eta\_\{0\}\\left\\\|\\theta\_\{\*\}\\right\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\bigg\)\\\\ &\\displaystyle\\leq\_\{\(ii\)\}\\max\\bigg\\\{\\frac\{2\}\{\\underline\{\\mu\}\},2\\bigg\\\}\\\|\\phi\(x,\\omega\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\bigg\(\\left\\\|\\sum\_\{k=1\}^\{t\-1\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\\{r\_\{k\}\-G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\)\\\}\\right\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\+\\sqrt\{\\eta\_\{0\}\}\\\|\\theta\_\{\*\}\\\|\_\{2\}\\bigg\),\\end\{array\}holds for allx,ωx,\\omega, where inequality \(i\) above holds from triangular inequality, and inequality \(ii\) above holds sinceΦt−1⪰η0​𝐈d\\Phi\_\{t\-1\}\\succeq\\eta\_\{0\}\\mathbf\{I\}\_\{d\}so that‖θ∗‖Φt−1−1≤‖θ∗‖2/η0\\\|\\theta\_\{\*\}\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\leq\\\|\\theta\_\{\*\}\\\|\_\{2\}/\\sqrt\{\\eta\_\{0\}\}\. We now apply Lemma[9\.1](https://arxiv.org/html/2606.07392#S9.Thmtheorem1)to bound

‖∑k=1t−1ϕ​\(xk,ωak​k\)​\{rk−G​\(θ∗⊤​ϕ​\(xk,ωak​k\)\)\}‖Φt−1−1\.\\left\\\|\\sum\_\{k=1\}^\{t\-1\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\\{r\_\{k\}\-G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\)\\\}\\right\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\.Setvk=ϕ​\(xk,ωak​k\)v\_\{k\}=\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\),ϵk=ζk\\epsilon\_\{k\}=\\zeta\_\{k\},ℋk=σ​\(vs,ϵs;s≤k\)\\mathcal\{H\}\_\{k\}=\\sigma\(v\_\{s\},\\epsilon\_\{s\};s\\leq k\),ξt=∑k=1t−1vk​ϵk\\xi\_\{t\}=\\sum\_\{k=1\}^\{t\-1\}v\_\{k\}\\epsilon\_\{k\}andVt=∑k=1tvk−1​vk−1⊤V\_\{t\}=\\sum\_\{k=1\}^\{t\}v\_\{k\-1\}v\_\{k\-1\}^\{\\top\}\. Note that\|ϵk\|≤γ0\|\\epsilon\_\{k\}\|\\leq\\gamma\_\{0\}, soϵk\\epsilon\_\{k\}isγ0\\gamma\_\{0\}\-sub\-Gaussian, meaning that for anyu∈ℝu\\in\\mathbb\{R\},t≥1t\\geq 1,

𝔼​\[exp⁡\(u​ζt\)\|ℋt−1\]≤exp⁡\(u2​γ022\)a\.s\.\\mathbb\{E\}\\left\[\\exp\\left\(u\\zeta\_\{t\}\\right\)\|\\mathcal\{H\}\_\{t\-1\}\\right\]\\leq\\exp\\left\(\\frac\{u^\{2\}\\gamma\_\{0\}^\{2\}\}\{2\}\\right\)\\quad\\textrm\{a\.s\.\}Further, by Assumption[5](https://arxiv.org/html/2606.07392#S5),‖ϕ​\(xk,ωak​k\)‖2≤C¯ϕ\\\|\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\}, implying thatdet​\(V¯t\)≤\(η0\+\(t−1\)​C¯ϕ2\)d\\mathrm\{det\}\(\\bar\{V\}\_\{t\}\)\\leq\\left\(\\eta\_\{0\}\+\(t\-1\)\\bar\{C\}\_\{\\phi\}^\{2\}\\right\)^\{d\}\. So by Lemma[9\.1](https://arxiv.org/html/2606.07392#S9.Thmtheorem1), given anyδ\>0\\delta\>0, with probability at least1−δ1\-\\delta, for allt≥2t\\geq 2we have

‖∑k=1t−1ϕ​\(xk,ωak​k\)​\{rk−G​\(θ∗⊤​ϕ​\(xk,ωak​k\)\)\}‖Φt−1−1≤γ0​d​log⁡\(1\+t​C¯ϕ2/η0\)\+2​log⁡\(1/δ\)\.\\left\\\|\\sum\_\{k=1\}^\{t\-1\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\\\{r\_\{k\}\-G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{k\},\\omega\_\{a\_\{k\}k\}\)\)\\\}\\right\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\leq\\gamma\_\{0\}\\sqrt\{d\\log\(1\+t\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(1/\\delta\)\}\.\(68\)Note thatγ0\>1\\gamma\_\{0\}\>1according to Assumption[5](https://arxiv.org/html/2606.07392#S5), hence with probability at least1−δ1\-\\delta, uniformly over allt≥2t\\geq 2and allx∈𝒳x\\in\\mathcal\{X\}andω∈Ω\\omega\\in\\Omega,

\|θ∗⊤​ϕ​\(x,ω\)−θ^t−1⊤​ϕ​\(x,ω\)\|≤2min⁡\{1,μ¯\}​‖ϕ​\(x,ω\)‖Φt−1−1​\(γ0​d​log⁡\(1\+t​C¯ϕ2/η0\)\+2​log⁡\(1/δ\)\+η0​‖θ∗‖2\)\.\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\\left\|\\theta\_\{\*\}^\{\\top\}\\phi\(x,\\omega\)\-\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x,\\omega\)\\right\|\\\\ &\\displaystyle\\leq\\frac\{2\}\{\\min\\\{1,\\underline\{\\mu\}\\\}\}\\\|\\phi\(x,\\omega\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\bigg\(\\gamma\_\{0\}\\sqrt\{d\\log\(1\+t\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(1/\\delta\)\}\+\\sqrt\{\\eta\_\{0\}\}\\\|\\theta\_\{\*\}\\\|\_\{2\}\\bigg\)\.\\end\{array\}Hence \([59](https://arxiv.org/html/2606.07392#S9.E59)\) follows from \(iv\) of Assumption[5](https://arxiv.org/html/2606.07392#S5)that‖θ∗‖2≤α¯\\\|\\theta\_\{\*\}\\\|\_\{2\}\\leq\\bar\{\\alpha\}\. This immediately implies the high\-probability bound \([60](https://arxiv.org/html/2606.07392#S9.E60)\) taken uniformly over alla∈\[A\],t≥2a\\in\[A\],t\\geq 2\.■\\blacksquare

## 10Minimum Eigenvalue ofΦt−1\\Phi\_\{t\-1\}

In this section, we first verify the examples satisfying Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)\(Lemma[10\.3](https://arxiv.org/html/2606.07392#S10.Thmtheorem3), Lemma[10\.5](https://arxiv.org/html/2606.07392#S10.Thmtheorem5), Lemma[10\.7](https://arxiv.org/html/2606.07392#S10.Thmtheorem7)\)\. Then we show that Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)implies a lower bound on the minimum eigenvalue forΦt−1\\Phi\_\{t\-1\}\(Proposition[10\.10](https://arxiv.org/html/2606.07392#S10.Thmtheorem10)\)\.

We first present Lemma[10\.1](https://arxiv.org/html/2606.07392#S10.Thmtheorem1), which is useful for proving Lemmas[10\.3](https://arxiv.org/html/2606.07392#S10.Thmtheorem3)–[10\.7](https://arxiv.org/html/2606.07392#S10.Thmtheorem7)\.

###### Lemma 10\.1

LetZa​t0∈ℝdZ^\{0\}\_\{at\}\\in\\mathbb\{R\}^\{d\}be a possibly unbounded random vector\. Supposeℙ\(\|v⊤Za​t0\|≤ϵ\|ℱt−1\)≤CTϵ,∀v∈𝕊d−1,ϵ\>0\\mathbb\{P\}\\left\(\|v^\{\\top\}Z^\{0\}\_\{at\}\|\\leq\\epsilon\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\)\\leq C\_\{T\}\\epsilon,\\forall v\\in\\mathbb\{S\}^\{d\-1\},\\epsilon\>0, and suppose thatℙ\(∥Za​t0∥2≤C¯ϕ\|ℱt−1\)≥qTa\.s\.\\mathbb\{P\}\\left\(\\\|Z^\{0\}\_\{at\}\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\)\\geq q\_\{T\}\\ \\text\{a\.s\.\}for some deterministicqT∈\(0,1\]q\_\{T\}\\in\(0,1\]\. Define the bounded vectorZa​tZ\_\{at\}by the conditional lawZa​t∼ℒ\(Za​t0\|∥Za​t0∥2≤C¯ϕ,ℱt−1\)Z\_\{at\}\\sim\\mathcal\{L\}\\left\(Z^\{0\}\_\{at\}\\,\\middle\|\\,\\\|Z^\{0\}\_\{at\}\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\},\\mathcal\{F\}\_\{t\-1\}\\right\)\. IfCTqT≤MT\\frac\{C\_\{T\}\}\{q\_\{T\}\}\\leq M\_\{T\}, thenZa​tZ\_\{at\}satisfies Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)\. Moreover,‖Za​t‖2≤C¯ϕ\\\|Z\_\{at\}\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\}almost surely\.

###### Proof 10\.2

Proof of Lemma[10\.1](https://arxiv.org/html/2606.07392#S10.Thmtheorem1)\. By construction,‖Za​t‖2≤C¯ϕa\.s\.\\\|Z\_\{at\}\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\}\\ \\ \\text\{a\.s\.\}\. Fixv∈𝕊d−1v\\in\\mathbb\{S\}^\{d\-1\}andϵ\>0\\epsilon\>0\. Then

ℙ\(\|v⊤Za​t\|≤ϵ\|ℱt−1\)=ℙ\(\|v⊤Za​t0\|≤ϵ\|∥Za​t0∥≤C¯ϕ,ℱt−1\)=ℙ\(\|v⊤Za​t0\|≤ϵ,∥Za​t0∥≤C¯ϕ\|ℱt−1\)ℙ\(∥Za​t0∥≤C¯ϕ\|ℱt−1\)≤ℙ\(\|v⊤Za​t0\|≤ϵ\|ℱt−1\)qT≤CTqT​ϵ≤MT​ϵ\.\\begin\{array\}\[\]\{rl\}\\mathbb\{P\}\\left\(\|v^\{\\top\}Z\_\{at\}\|\\leq\\epsilon\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\)&\\displaystyle=\\mathbb\{P\}\\left\(\|v^\{\\top\}Z^\{0\}\_\{at\}\|\\leq\\epsilon\\,\\middle\|\\,\\\|Z^\{0\}\_\{at\}\\\|\\leq\\bar\{C\}\_\{\\phi\},\\mathcal\{F\}\_\{t\-1\}\\right\)=\\frac\{\\mathbb\{P\}\\left\(\|v^\{\\top\}Z^\{0\}\_\{at\}\|\\leq\\epsilon,\\,\\\|Z^\{0\}\_\{at\}\\\|\\leq\\bar\{C\}\_\{\\phi\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\)\}\{\\mathbb\{P\}\\left\(\\\|Z^\{0\}\_\{at\}\\\|\\leq\\bar\{C\}\_\{\\phi\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\)\}\\\\ &\\displaystyle\\leq\\frac\{\\mathbb\{P\}\\left\(\|v^\{\\top\}Z^\{0\}\_\{at\}\|\\leq\\epsilon\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\)\}\{q\_\{T\}\}\\leq\\frac\{C\_\{T\}\}\{q\_\{T\}\}\\epsilon\\leq M\_\{T\}\\epsilon\.\\end\{array\}In particular, the above inequality holds for all0<ϵ≤\(2​A​MT\)−10<\\epsilon\\leq\(2AM\_\{T\}\)^\{\-1\}, so Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)holds\.■\\blacksquare

###### Lemma 10\.3\(Truncated Gaussian\)

Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)holds under the following conditions:

- \(i\)conditional onℱt−1\\mathcal\{F\}\_\{t\-1\},Za​t∼𝒩​\(μa​t,Σa​t\)Z\_\{at\}\\sim\\mathcal\{N\}\(\\mu\_\{at\},\\Sigma\_\{at\}\)andϕ\(xt,ωa​t\)∼ℒ\(Za​t∣∥Za​t∥2≤C¯ϕ,ℱt−1\)\\phi\(x\_\{t\},\\omega\_\{at\}\)\\sim\\mathcal\{L\}\(Z\_\{at\}\\mid\\\|Z\_\{at\}\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\},\\mathcal\{F\}\_\{t\-1\}\)\.
- \(ii\)There exist deterministic sequencesσT\>0\\sigma\_\{T\}\>0,σ¯T\>0\\bar\{\\sigma\}\_\{T\}\>0andμ¯T∈\[0,C¯ϕ\)\\bar\{\\mu\}\_\{T\}\\in\[0,\\bar\{C\}\_\{\\phi\}\)such that uniformly overa∈\[A\],t∈\[T\]a\\in\[A\],t\\in\[T\],λmin​\(Σa​t\)≥1σT2,λmax​\(Σa​t\)≤σ¯T2,‖μa​t‖2≤μ¯T,a\.s\.\\lambda\_\{\\min\}\(\\Sigma\_\{at\}\)\\geq\\frac\{1\}\{\\sigma\_\{T\}^\{2\}\},\\ \\lambda\_\{\\max\}\(\\Sigma\_\{at\}\)\\leq\\bar\{\\sigma\}\_\{T\}^\{2\},\\ \\\|\\mu\_\{at\}\\\|\_\{2\}\\leq\\bar\{\\mu\}\_\{T\},\\ \\ \\mathrm\{a\.s\.\}
- \(iii\)LetFχd2​\(⋅\)F\_\{\\chi\_\{d\}^\{2\}\}\(\\cdot\)denote the cumulative distribution function of a chi\-square random variable withdddegrees of freedom\. Suppose2/π​σT≤MT​Fχd2​\(\(C¯ϕ−μ¯T\)2σ¯T2\)\\sqrt\{2/\\pi\}\\sigma\_\{T\}\\leq M\_\{T\}F\_\{\\chi\_\{d\}^\{2\}\}\\left\(\\frac\{\(\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\)^\{2\}\}\{\\bar\{\\sigma\}\_\{T\}^\{2\}\}\\right\)whereMT=polylog​\(T\)/AM\_\{T\}=\\sqrt\{\\mathrm\{polylog\}\(T\)\}/A\.

###### Proof 10\.4

Proof of Lemma[10\.3](https://arxiv.org/html/2606.07392#S10.Thmtheorem3)Fixa∈\[A\],t∈\[T\]a\\in\[A\],t\\in\[T\]andv∈𝕊d−1v\\in\\mathbb\{S\}^\{d\-1\}\. Conditional onℱt−1\\mathcal\{F\}\_\{t\-1\},

v⊤​Za​t∼𝒩​\(v⊤​μa​t,v⊤​Σa​t​v\)\.v^\{\\top\}Z\_\{at\}\\sim\\mathcal\{N\}\(v^\{\\top\}\\mu\_\{at\},v^\{\\top\}\\Sigma\_\{at\}v\)\.Sincev∈𝕊d−1v\\in\\mathbb\{S\}^\{d\-1\},v⊤​Σa​t​v≥1/σT2v^\{\\top\}\\Sigma\_\{at\}v\\geq 1/\\sigma\_\{T\}^\{2\}\. The density ofv⊤​Za​tv^\{\\top\}Z\_\{at\}is bounded by12​π​v⊤​Σa​t​v≤σT2​π\\displaystyle\\frac\{1\}\{\\sqrt\{2\\pi\}\\sqrt\{v^\{\\top\}\\Sigma\_\{at\}v\}\}\\leq\\frac\{\\sigma\_\{T\}\}\{\\sqrt\{2\\pi\}\}\. Letfa​t​vf\_\{atv\}be the density ofv⊤​Za​tv^\{\\top\}Z\_\{at\}\. So for everyϵ\>0\\epsilon\>0,

ℙ​\(\|v⊤​Za​t\|≤ϵ∣ℱt−1\)=∫−ϵϵfa​t​v​\(u\)​𝑑u≤2​ϵ​σT2​π=2/π​σT​ϵ\.\\mathbb\{P\}\(\|v^\{\\top\}Z\_\{at\}\|\\leq\\epsilon\\mid\\mathcal\{F\}\_\{t\-1\}\)=\\int\_\{\-\\epsilon\}^\{\\epsilon\}f\_\{atv\}\(u\)du\\leq\\frac\{2\\epsilon\\sigma\_\{T\}\}\{\\sqrt\{2\\pi\}\}=\\sqrt\{2/\\pi\}\\sigma\_\{T\}\\epsilon\.Next, note thatZa​t=μa​t\+Σa​t1/2​ZZ\_\{at\}=\\mu\_\{at\}\+\\Sigma\_\{at\}^\{1/2\}ZwhereZ∼𝒩​\(0,𝐈d\)Z\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{d\}\)\. Note that

‖Za​t‖2≤\(1\)‖μa​t‖2\+‖Σa​t1/2​Z‖2≤\(2\)‖μa​t‖2\+λmax​\(Σa​t\)​‖Z‖2≤\(3\)μ¯T\+σ¯T​‖Z‖2,\\\|Z\_\{at\}\\\|\_\{2\}\\leq\_\{\(1\)\}\\\|\\mu\_\{at\}\\\|\_\{2\}\+\\\|\\Sigma\_\{at\}^\{1/2\}Z\\\|\_\{2\}\\leq\_\{\(2\)\}\\\|\\mu\_\{at\}\\\|\_\{2\}\+\\sqrt\{\\lambda\_\{\\max\}\(\\Sigma\_\{at\}\)\}\\\|Z\\\|\_\{2\}\\leq\_\{\(3\)\}\\bar\{\\mu\}\_\{T\}\+\\bar\{\\sigma\}\_\{T\}\\\|Z\\\|\_\{2\},where \(1\) follows from triangular inequality, \(2\) follows because‖Σa​t1/2​Z‖2≤λmax​\(Σa​t\)​‖Z‖2\\\|\\Sigma\_\{at\}^\{1/2\}Z\\\|\_\{2\}\\leq\\sqrt\{\\lambda\_\{\\max\}\(\\Sigma\_\{at\}\)\}\\\|Z\\\|\_\{2\}, \(3\) follows from condition \(ii\) in the lemma\. Therefore, the event\{‖Z‖2≤C¯ϕ−μ¯Tσ¯T\}\\\{\\\|Z\\\|\_\{2\}\\leq\\frac\{\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\}\{\\bar\{\\sigma\}\_\{T\}\}\\\}implies

‖Za​t‖2≤μ¯T\+σ¯T​C¯ϕ−μ¯Tσ¯T=C¯ϕ\.\\\|Z\_\{at\}\\\|\_\{2\}\\leq\\bar\{\\mu\}\_\{T\}\+\\bar\{\\sigma\}\_\{T\}\\frac\{\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\}\{\\bar\{\\sigma\}\_\{T\}\}=\\bar\{C\}\_\{\\phi\}\.Hence

ℙ​\(‖Za​t‖2≤C¯ϕ∣ℱt−1\)≥ℙ​\(‖Z‖2≤C¯ϕ−μ¯Tσ¯T∣ℱt−1\)=Fχd2​\(\(C¯ϕ−μ¯T\)2σ¯T2\)=qT\.\\mathbb\{P\}\(\\\|Z\_\{at\}\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\}\\mid\\mathcal\{F\}\_\{t\-1\}\)\\geq\\mathbb\{P\}\\left\(\\\|Z\\\|\_\{2\}\\leq\\frac\{\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\}\{\\bar\{\\sigma\}\_\{T\}\}\\,\\mid\\,\\mathcal\{F\}\_\{t\-1\}\\right\)=F\_\{\\chi\_\{d\}^\{2\}\}\\left\(\\frac\{\(\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\)^\{2\}\}\{\\bar\{\\sigma\}\_\{T\}^\{2\}\}\\right\)=q\_\{T\}\.Applying Lemma[10\.1](https://arxiv.org/html/2606.07392#S10.Thmtheorem1)withCT=2/π​σTC\_\{T\}=\\sqrt\{2/\\pi\}\\sigma\_\{T\}andqT=Fχd2​\(\(C¯ϕ−μ¯T\)2σ¯T2\)q\_\{T\}=F\_\{\\chi\_\{d\}^\{2\}\}\\left\(\\frac\{\(\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\)^\{2\}\}\{\\bar\{\\sigma\}\_\{T\}^\{2\}\}\\right\)gives

ℙ​\(\|v⊤​ϕ​\(xt,ωa​t\)\|≤ϵ∣ℱt−1\)≤2/π​σTqT​ϵ≤MT​ϵ\.\\mathbb\{P\}\(\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\|\\leq\\epsilon\\mid\\mathcal\{F\}\_\{t\-1\}\)\\leq\\frac\{\\sqrt\{2/\\pi\}\\sigma\_\{T\}\}\{q\_\{T\}\}\\epsilon\\leq M\_\{T\}\\epsilon\.So Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)is satisfied\.■\\blacksquare

###### Lemma 10\.5\(Truncated Student’s t distribution\)

Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)holds under conditions below:

- \(i\)ϕ\(xt,ωa​t\)∼ℒ\(Za​t∣∥Za​t∥2≤C¯ϕ,ℱt−1\)\\phi\(x\_\{t\},\\omega\_\{at\}\)\\sim\\mathcal\{L\}\(Z\_\{at\}\\mid\\\|Z\_\{at\}\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\},\\mathcal\{F\}\_\{t\-1\}\), whereZa​t∼tξ​\(μa​t,Σa​t\)Z\_\{at\}\\sim t\_\{\\xi\}\(\\mu\_\{at\},\\Sigma\_\{at\}\)conditioning onℱt−1\\mathcal\{F\}\_\{t\-1\}, andtξ​\(μa​t,Σa​t\)t\_\{\\xi\}\(\\mu\_\{at\},\\Sigma\_\{at\}\)is add\-dimensional Student’sttdistribution withξ\>0\\xi\>0degrees of freedom, locationμa​t\\mu\_\{at\}, scale matrixΣa​t\\Sigma\_\{at\}\.
- \(ii\)There exists deterministic sequencesσT\>0\\sigma\_\{T\}\>0,σ¯T\>0\\bar\{\\sigma\}\_\{T\}\>0andμ¯T∈\[0,C¯ϕ\)\\bar\{\\mu\}\_\{T\}\\in\[0,\\bar\{C\}\_\{\\phi\}\)such that uniformly over alla∈\[A\]a\\in\[A\],t∈\[T\]t\\in\[T\],λmin​\(Σa​t\)≥1/σT2\\lambda\_\{\\min\}\(\\Sigma\_\{at\}\)\\geq 1/\\sigma\_\{T\}^\{2\},λmax​\(Σa​t\)≤σ¯T2\\lambda\_\{\\max\}\(\\Sigma\_\{at\}\)\\leq\\bar\{\\sigma\}\_\{T\}^\{2\},‖μa​t‖2≤μ¯T\\\|\\mu\_\{at\}\\\|\_\{2\}\\leq\\bar\{\\mu\}\_\{T\}a\.s\.\\mathrm\{a\.s\.\}\.
- \(iii\)2​cξ​σT/qT≤MT=polylog​\(T\)/A2c\_\{\\xi\}\\sigma\_\{T\}/q\_\{T\}\\leq M\_\{T\}=\\sqrt\{\\mathrm\{polylog\}\(T\)\}/A, wherecξ:=Γ​\(\(ξ\+1\)/2\)ξ​π​Γ​\(ξ/2\)c\_\{\\xi\}:=\\frac\{\\Gamma\(\(\\xi\+1\)/2\)\}\{\\sqrt\{\\xi\\pi\}\\Gamma\(\\xi/2\)\},qT=Fd,ξ​\(\(C¯ϕ−μ¯T\)2/\(d​σ¯T2\)\)q\_\{T\}=F\_\{d,\\xi\}\(\(\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\)^\{2\}/\(d\\bar\{\\sigma\}\_\{T\}^\{2\}\)\), andFd,ξ​\(⋅\)F\_\{d,\\xi\}\(\\cdot\)denotes the cumulative distribution of theFF\-distribution withddandξ\\xidegrees of freedom\.

###### Proof 10\.6

Proof of Lemma[10\.5](https://arxiv.org/html/2606.07392#S10.Thmtheorem5)Fixa∈\[A\],t∈\[T\],v∈𝕊d−1a\\in\[A\],t\\in\[T\],v\\in\\mathbb\{S\}^\{d\-1\}\. The multivariate Student’sttrandom vectorZa​tZ\_\{at\}can be written asZa​t=μa​t\+Σa​t1/2​ZS/ξZ\_\{at\}=\\mu\_\{at\}\+\\frac\{\\Sigma\_\{at\}^\{1/2\}Z\}\{\\sqrt\{S/\\xi\}\}, whereZ∼𝒩​\(0,𝐈d\)Z\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{d\}\),S∼χξ2S\\sim\\chi\_\{\\xi\}^\{2\}, whereχξ2\\chi\_\{\\xi\}^\{2\}is the chi\-square distribution withξ\\xidegrees of freedom,ZZandSSare independent\. So

v⊤​Za​t=v⊤​μa​t\+v⊤​Σa​t1/2​ZS/ξ\.v^\{\\top\}Z\_\{at\}=v^\{\\top\}\\mu\_\{at\}\+\\frac\{v^\{\\top\}\\Sigma\_\{at\}^\{1/2\}Z\}\{\\sqrt\{S/\\xi\}\}\.Sincev⊤​Σa​t1/2​Z∼𝒩​\(0,v⊤​Σa​t​v\)v^\{\\top\}\\Sigma\_\{at\}^\{1/2\}Z\\sim\\mathcal\{N\}\(0,v^\{\\top\}\\Sigma\_\{at\}v\), it holds thatv⊤​Za​t∼tξ​\(v⊤​μa​t,v⊤​Σa​t​v\)v^\{\\top\}Z\_\{at\}\\sim t\_\{\\xi\}\(v^\{\\top\}\\mu\_\{at\},v^\{\\top\}\\Sigma\_\{at\}v\), where by the eigenvalue lower bound, we havev⊤​Σa​t​v≥1/σT2v^\{\\top\}\\Sigma\_\{at\}v\\geq 1/\\sigma\_\{T\}^\{2\}\. The densityfa​t​vf\_\{atv\}of the univariate Student’sttrandom variablev⊤​Za​tv^\{\\top\}Z\_\{at\}is

fa​t​v​\(y\)=cξv⊤​Σa​t​v​\(1\+\(y−v⊤​μa​t\)2ξ​\[v⊤​Σa​t​v\]\)−\(ξ\+1\)/2\.f\_\{atv\}\(y\)=\\frac\{c\_\{\\xi\}\}\{\\sqrt\{v^\{\\top\}\\Sigma\_\{at\}v\}\}\\left\(1\+\\frac\{\(y\-v^\{\\top\}\\mu\_\{at\}\)^\{2\}\}\{\\xi\[v^\{\\top\}\\Sigma\_\{at\}v\]\}\\right\)^\{\-\(\\xi\+1\)/2\}\.Hencesupyfa​t​v​\(y\)≤cξ/v⊤​Σa​t​v≤cξ​σT\\sup\_\{y\}f\_\{atv\}\(y\)\\leq c\_\{\\xi\}/\\sqrt\{v^\{\\top\}\\Sigma\_\{at\}v\}\\leq c\_\{\\xi\}\\sigma\_\{T\}\. So for everyϵ\>0\\epsilon\>0,

ℙ​\(\|v⊤​Za​t\|≤ϵ∣ℱt−1\)=∫−ϵϵfa​t​v​\(u\)​𝑑u≤2​cξ​σT​ϵ\.\\mathbb\{P\}\\left\(\|v^\{\\top\}Z\_\{at\}\|\\leq\\epsilon\\mid\\mathcal\{F\}\_\{t\-1\}\\right\)=\\int\_\{\-\\epsilon\}^\{\\epsilon\}f\_\{atv\}\(u\)du\\leq 2c\_\{\\xi\}\\sigma\_\{T\}\\epsilon\.Next, we lower bound the probability of the truncation event\{‖Za​t‖2≤C¯ϕ\}\\\{\\\|Z\_\{at\}\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\}\\\}\. Note that

‖Za​t‖2≤‖μa​t‖2\+‖Σa​t1/2​Z‖2/S/ξ≤μ¯T\+σ¯T​‖Z‖2/S/ξ\.\\\|Z\_\{at\}\\\|\_\{2\}\\leq\\\|\\mu\_\{at\}\\\|\_\{2\}\+\\\|\\Sigma\_\{at\}^\{1/2\}Z\\\|\_\{2\}/\\sqrt\{S/\\xi\}\\leq\\bar\{\\mu\}\_\{T\}\+\\bar\{\\sigma\}\_\{T\}\\\|Z\\\|\_\{2\}/\\sqrt\{S/\\xi\}\.Therefore,‖Z‖2/S/ξ≤\(C¯ϕ−μ¯T\)/σ¯T\\\|Z\\\|\_\{2\}/\\sqrt\{S/\\xi\}\\leq\(\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\)/\\bar\{\\sigma\}\_\{T\}implies‖Za​t‖2≤μ¯T\+σ¯T​C¯ϕ−μ¯Tσ¯T=C¯ϕ\\\|Z\_\{at\}\\\|\_\{2\}\\leq\\bar\{\\mu\}\_\{T\}\+\\bar\{\\sigma\}\_\{T\}\\frac\{\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\}\{\\bar\{\\sigma\}\_\{T\}\}=\\bar\{C\}\_\{\\phi\}\. So

ℙ​\(‖Za​t‖2≤C¯ϕ∣ℱt−1\)≥ℙ​\(‖Z‖2S/ξ≤C¯ϕ−μ¯Tσ¯T∣ℱt−1\)=ℙ​\(‖Z‖22/dS/ξ≤\(C¯ϕ−μ¯T\)2d​σ¯T2∣ℱt−1\)=Fd,ξ​\(\(C¯ϕ−μ¯T\)2d​σ¯T2\)=qT\.\\begin\{array\}\[\]\{rl\}\\mathbb\{P\}\(\\\|Z\_\{at\}\\\|\_\{2\}\\leq\\bar\{C\}\_\{\\phi\}\\mid\\mathcal\{F\}\_\{t\-1\}\)&\\displaystyle\\geq\\mathbb\{P\}\\left\(\\frac\{\\\|Z\\\|\_\{2\}\}\{\\sqrt\{S/\\xi\}\}\\leq\\frac\{\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\}\{\\bar\{\\sigma\}\_\{T\}\}\\mid\\mathcal\{F\}\_\{t\-1\}\\right\)\\\\ &\\displaystyle=\\mathbb\{P\}\\left\(\\frac\{\\\|Z\\\|\_\{2\}^\{2\}/d\}\{S/\\xi\}\\leq\\frac\{\(\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\)^\{2\}\}\{d\\bar\{\\sigma\}\_\{T\}^\{2\}\}\\mid\\mathcal\{F\}\_\{t\-1\}\\right\)=F\_\{d,\\xi\}\\left\(\\frac\{\(\\bar\{C\}\_\{\\phi\}\-\\bar\{\\mu\}\_\{T\}\)^\{2\}\}\{d\\bar\{\\sigma\}\_\{T\}^\{2\}\}\\right\)=q\_\{T\}\.\\end\{array\}Hence applying Lemma[10\.1](https://arxiv.org/html/2606.07392#S10.Thmtheorem1), we haveℙ​\(\|v⊤​ϕ​\(xt,ωa​t\)\|≤ϵ∣ℱt−1\)≤MT​ϵ\\mathbb\{P\}\(\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\|\\leq\\epsilon\\mid\\mathcal\{F\}\_\{t\-1\}\)\\leq M\_\{T\}\\epsilon, so Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)holds\.■\\blacksquare

###### Lemma 10\.7\(Uniform distribution on a hyperrectangle\)

Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)holds under \(i\)–\(ii\):

- \(i\)Conditional onℱt−1\\mathcal\{F\}\_\{t\-1\},ϕ​\(xt,ωa​t\)=ϕ¯a​t\+Ua​t\\phi\(x\_\{t\},\\omega\_\{at\}\)=\\bar\{\\phi\}\_\{at\}\+U\_\{at\}, whereϕ¯a​t\\bar\{\\phi\}\_\{at\}is a fixed constant vector,Ua​t∼Unif​\[−Ra​t,j,Ra​t,j\]U\_\{at\}\\sim\\mathrm\{Unif\}\[\-R\_\{at,j\},R\_\{at,j\}\]forj∈\[d\]j\\in\[d\]andUa​t,jU\_\{at,j\}are independent acrossj∈\[d\]j\\in\[d\]\.
- \(ii\)Ra​t,j≥d/MTR\_\{at,j\}\\geq\\sqrt\{d\}/M\_\{T\}whereMT=polylog​\(T\)/AM\_\{T\}=\\sqrt\{\\mathrm\{polylog\}\(T\)\}/Aand‖ϕ¯a​t‖2\+\(∑j=1dRa​t,j2\)1/2≤C¯ϕ,a\.s\.\\\|\\bar\{\\phi\}\_\{at\}\\\|\_\{2\}\+\\left\(\\sum\_\{j=1\}^\{d\}R\_\{at,j\}^\{2\}\\right\)^\{1/2\}\\leq\\bar\{C\}\_\{\\phi\},\\ \\ \\mathrm\{a\.s\.\}

###### Proof 10\.8

Proof of Lemma[10\.7](https://arxiv.org/html/2606.07392#S10.Thmtheorem7)Fixa∈\[A\],t∈\[T\],v∈𝕊d−1a\\in\[A\],t\\in\[T\],v\\in\\mathbb\{S\}^\{d\-1\}\. Since‖v‖2=1\\\|v\\\|\_\{2\}=1, there exists a coordinatej∗∈\[d\]j^\{\*\}\\in\[d\], s\.t\.\|vj∗\|≥1/d\|v\_\{j^\{\*\}\}\|\\geq 1/\\sqrt\{d\}\. Note thatv⊤​ϕ​\(xt,ωa​t\)=v⊤​ϕ¯a​t\+v⊤​Ua​t=v⊤​ϕ¯a​t\+∑j=1dvj​Ua​t,jv^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)=v^\{\\top\}\\bar\{\\phi\}\_\{at\}\+v^\{\\top\}U\_\{at\}=v^\{\\top\}\\bar\{\\phi\}\_\{at\}\+\\sum\_\{j=1\}^\{d\}v\_\{j\}U\_\{at,j\}\. Conditional on all\{Ua​t,j:j≠j∗\}\\\{U\_\{at,j\}:j\\neq j^\{\*\}\\\},v⊤​ϕ​\(xt,ωa​t\)=C\+vj∗​Ua​t,j∗v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)=C\+v\_\{j^\{\*\}\}U\_\{at,j^\{\*\}\}whereCCis a fixed constant\. Conditional onℱt−1\\mathcal\{F\}\_\{t\-1\},Ua​t,jU\_\{at,j\}are independent uniform random on\[−Ra​t,j,Ra​t,j\]\[\-R\_\{at,j\},R\_\{at,j\}\]\. SinceUa​t,j∗∼Unif​\[−Ra​t,j∗,Ra​t,j∗\]U\_\{at,j^\{\*\}\}\\sim\\mathrm\{Unif\}\[\-R\_\{at,j^\{\*\}\},R\_\{at,j^\{\*\}\}\],C\+vj∗​Ua​t,j∗C\+v\_\{j^\{\*\}\}U\_\{at,j^\{\*\}\}is uniform on an interval of length2​Ra​t,j∗​\|vj∗\|2R\_\{at,j^\{\*\}\}\|v\_\{j^\{\*\}\}\|\. Therefore, for any intervalI⊂ℝI\\subset\\mathbb\{R\}of length\|I\|\|I\|, we have

ℙ​\(C\+vj∗​Ua​t,j∗∈I∣\{Ua​t,j:j≠j∗\},ℱt−1\)≤\|I\|2​Ra​t,j∗​\|vj∗\|\.\\mathbb\{P\}\(C\+v\_\{j^\{\*\}\}U\_\{at,j^\{\*\}\}\\in I\\mid\\\{U\_\{at,j\}:j\\neq j^\{\*\}\\\},\\mathcal\{F\}\_\{t\-1\}\)\\leq\\frac\{\|I\|\}\{2R\_\{at,j^\{\*\}\}\|v\_\{j^\{\*\}\}\|\}\.TakeI=\[−ϵ,ϵ\]I=\[\-\\epsilon,\\epsilon\], then

ℙ​\(\|v⊤​ϕ​\(xt,ωa​t\)\|≤ϵ∣ℱt−1,\{Ua​t,j:j≠j∗\}\)≤2​ϵ2​Ra​t,j∗​\|vj∗\|≤dRa​t,j∗​ϵ\.\\mathbb\{P\}\\left\(\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\|\\leq\\epsilon\\mid\\mathcal\{F\}\_\{t\-1\},\\\{U\_\{at,j\}:j\\neq j^\{\*\}\\\}\\right\)\\leq\\frac\{2\\epsilon\}\{2R\_\{at,j^\{\*\}\}\|v\_\{j^\{\*\}\}\|\}\\leq\\frac\{\\sqrt\{d\}\}\{R\_\{at,j^\{\*\}\}\}\\epsilon\.Taking expectation over the conditioned coordinates ofUa​tU\_\{at\}in the above inequality gives

ℙ​\(\|v⊤​ϕ​\(xt,ωa​t\)\|≤ϵ∣ℱt−1\)≤MT​ϵ\\mathbb\{P\}\(\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\|\\leq\\epsilon\\mid\\mathcal\{F\}\_\{t\-1\}\)\\leq M\_\{T\}\\epsilonwithMT=polylog​\(T\)/AM\_\{T\}=\\sqrt\{\\mathrm\{polylog\}\(T\)\}/A\.■\\blacksquare

###### Lemma 10\.9\(Theorem 3\.1 ofTropp \([2011](https://arxiv.org/html/2606.07392#bib.bib5)\)\)

Letℱ1⊂ℱ2​⋯\\mathcal\{F\}\_\{1\}\\subset\\mathcal\{F\}\_\{2\}\\cdotsbe a filtration and consider a finite sequence\{Xk\}\\\{X\_\{k\}\\\}of positive semi\-definite matrices with dimensionddadapted to this filtration\. Suppose thatλmax​\(Xk\)≤R\\lambda\_\{\\max\}\(X\_\{k\}\)\\leq Ralmost surely\. Define the seriesY≡∑kXkY\\equiv\\sum\_\{k\}X\_\{k\}andW≡∑k𝔼​\[Xk\|ℋk−1\]W\\equiv\\sum\_\{k\}\\mathbb\{E\}\[X\_\{k\}\|\\mathcal\{H\}\_\{k\-1\}\]\. Then for allμ≥0\\mu\\geq 0,γ∈\[0,1\)\\gamma\\in\[0,1\),ℙ​\(λmin​\(Y\)≤\(1−γ\)​μ,λmin​\(W\)≥μ\)≤d​\(e−γ\(1−γ\)1−γ\)μ/R\\mathbb\{P\}\\left\(\\lambda\_\{\\min\}\(Y\)\\leq\(1\-\\gamma\)\\mu,\\lambda\_\{\\min\}\(W\)\\geq\\mu\\right\)\\leq d\\left\(\\frac\{e^\{\-\\gamma\}\}\{\(1\-\\gamma\)^\{1\-\\gamma\}\}\\right\)^\{\\mu/R\}\.

###### Proposition 10\.10\(Minimum eigenvalue\)

Suppose Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)holds\. Then

λmin\(𝔼\[ϕ\(xt,ωat​t\)ϕ\(xt,ωat​t\)⊤\|ℱt−1\]\)≥1/8polylog​\(T\)\.\\lambda\_\{\\min\}\\\!\\left\(\\mathbb\{E\}\\\!\\left\[\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)^\{\\top\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\]\\right\)\\geq\\frac\{1/8\}\{\\mathrm\{polylog\}\(T\)\}\.\(69\)Further, for anyδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\delta,

λmin​\(Φt−1\)≥\(t−1\)/\[16​p​o​l​y​l​o​g​\(T\)\]\\lambda\_\{\\min\}\(\\Phi\_\{t\-1\}\)\\geq\(t\-1\)/\[16\\mathrm\{polylog\}\(T\)\]\(70\)holds uniformly for allt≥100​C¯ϕ2​polylog​\(T\)​log⁡\(T​d/δ\)t\\geq 100\\bar\{C\}\_\{\\phi\}^\{2\}\\mathrm\{polylog\}\(T\)\\log\(Td/\\delta\)\.

###### Proof 10\.11

Proof of Proposition[10\.10](https://arxiv.org/html/2606.07392#S10.Thmtheorem10)Fixv∈𝕊d−1v\\in\\mathbb\{S\}^\{d\-1\}\. Sinceat∈\[A\]a\_\{t\}\\in\[A\], for anyϵ\>0\\epsilon\>0,

\{\|v⊤​ϕ​\(xt,ωat​t\)\|≤ϵ\}⊆⋃a=1A\{\|v⊤​ϕ​\(xt,ωa​t\)\|≤ϵ\}\.\\\{\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\|\\leq\\epsilon\\\}\\subseteq\\bigcup\_\{a=1\}^\{A\}\\\{\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\|\\leq\\epsilon\\\}\.Therefore, by applying union bound and Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2),

ℙ\(\|v⊤ϕ\(xt,ωat​t\)\|≤ϵ\|ℱt−1\)≤∑a=1Aℙ\(\|v⊤ϕ\(xt,ωa​t\)\|≤ϵ\|ℱt−1\)≤AMTϵ\.\\mathbb\{P\}\\\!\\left\(\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\|\\leq\\epsilon\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\)\\leq\\sum\_\{a=1\}^\{A\}\\mathbb\{P\}\\\!\\left\(\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\|\\leq\\epsilon\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\)\\leq AM\_\{T\}\\epsilon\.Takeϵ=\(2​A​MT\)−1\\epsilon=\(2AM\_\{T\}\)^\{\-1\}\. Then the above inequality implies that

ℙ\(\|v⊤ϕ\(xt,ωat​t\)\|\>\(2AMT\)−1\|ℱt−1\)≥12\.\\mathbb\{P\}\\\!\\left\(\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\|\>\(2AM\_\{T\}\)^\{\-1\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\)\\geq\\frac\{1\}\{2\}\.Hence

v⊤​𝔼​\[ϕ​\(xt,ωat​t\)​ϕ​\(xt,ωat​t\)⊤∣ℱt−1\]​v=𝔼​\[\(v⊤​ϕ​\(xt,ωat​t\)\)2∣ℱt−1\]≥\(2AMT\)−2ℙ\(\|v⊤ϕ\(xt,ωat​t\)\|\>\(2AMT\)−1\|ℱt−1\)≥1/8\(A​MT\)2=1/8polylog​\(T\)\.\\begin\{array\}\[\]\{rl\}\\displaystyle v^\{\\top\}\\mathbb\{E\}\[\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)^\{\\top\}\\mid\\mathcal\{F\}\_\{t\-1\}\]v&\\displaystyle=\\mathbb\{E\}\[\(v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\)^\{2\}\\mid\\mathcal\{F\}\_\{t\-1\}\]\\\\ &\\displaystyle\\geq\(2AM\_\{T\}\)^\{\-2\}\\mathbb\{P\}\\\!\\left\(\|v^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\|\>\(2AM\_\{T\}\)^\{\-1\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\-1\}\\right\)\\\\ &\\displaystyle\\geq\\frac\{1/8\}\{\(AM\_\{T\}\)^\{2\}\}=\\frac\{1/8\}\{\\mathrm\{polylog\}\(T\)\}\.\\end\{array\}Since this holds for anyv∈𝕊d−1v\\in\\mathbb\{S\}^\{d\-1\}, \([69](https://arxiv.org/html/2606.07392#S10.E69)\) follows\. Settingγ=1/2\\gamma=1/2,R=C¯ϕ2R=\\bar\{C\}\_\{\\phi\}^\{2\}in Lemma[10\.9](https://arxiv.org/html/2606.07392#S10.Thmtheorem9), with probability at least1−exp⁡\{log⁡\(d\)−0\.01​\(t−1\)/\[C¯ϕ2​polylog​\(T\)\]\}1\-\\exp\\\{\\log\(d\)\-0\.01\(t\-1\)/\[\\bar\{C\}\_\{\\phi\}^\{2\}\\mathrm\{polylog\}\(T\)\]\\\},λmin​\(Φt−1\)≥\(t−1\)/\[16​p​o​l​y​l​o​g​\(T\)\]\\lambda\_\{\\min\}\(\\Phi\_\{t\-1\}\)\\geq\(t\-1\)/\[16\\mathrm\{polylog\}\(T\)\]\. Particularly, for anyδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\delta,λmin​\(Φt−1\)≥\(t−1\)/\[16​p​o​l​y​l​o​g​\(T\)\]\\lambda\_\{\\min\}\(\\Phi\_\{t\-1\}\)\\geq\(t\-1\)/\[16\\mathrm\{polylog\}\(T\)\]holds uniformly for allt≥100​C¯ϕ2​polylog​\(T\)​log⁡\(T​d/δ\)t\\geq 100\\bar\{C\}\_\{\\phi\}^\{2\}\\mathrm\{polylog\}\(T\)\\log\(Td/\\delta\)\.■\\blacksquare

## 11Index Estimation under Unknown Reward Function

The same argument for Proposition[4\.4](https://arxiv.org/html/2606.07392#S4.Thmtheorem4)extends to Proposition[11\.2](https://arxiv.org/html/2606.07392#S11.Thmtheorem2)for the high probability bound of the index estimation error when the reward parameterθ∗\\theta\_\{\*\}is unknown and learned online\. The main difference is that the primitive loss is evaluated at the plug\-in estimateθ^t−1\\hat\{\\theta\}\_\{t\-1\}rather than at the true parameterθ∗\\theta\_\{\*\}\. This introduces an additional error term in the optimality inequality:

population curvature≤empirical fluctuation\+plug\-in error from​θ^t−1\.\\mbox\{population curvature\}\\;\\;\\leq\\;\\;\\mbox\{empirical fluctuation\}\\;\\;\+\\;\\;\\mbox\{plug\-in error from \}\\hat\{\\theta\}\_\{t\-1\}\.Consequently, Proposition[11\.2](https://arxiv.org/html/2606.07392#S11.Thmtheorem2)yields the same type of confidence radius forρa⊤​ψ​\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\), with an additional additive term that accounts for uncertainty in the reward parameter\.

Recall from Section[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)thatρ^a​t∈\\argminρ∈ℬ​∑s∈𝒮a​tℓ^a​s,t​\(ρ\)\\hat\{\\rho\}\_\{at\}\\in\\argmin\_\{\\rho\\in\\mathcal\{B\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\hat\{\\ell\}\_\{as,t\}\(\\rho\), whereℓ^a​s,t\\hat\{\\ell\}\_\{as,t\}is the loss defined as \([22](https://arxiv.org/html/2606.07392#S5.E22)\) when plugging inθ^t−1\\hat\{\\theta\}\_\{t\-1\}for the unknownθ∗\\theta\_\{\*\}\. Lemma[5\.3](https://arxiv.org/html/2606.07392#S5.Thmtheorem3)in the below characterizes how the estimation error ofθ∗\\theta\_\{\*\}propagates into the estimation error ofρa⊤​ψ​\(xt\)\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{t\}\)\.

###### Proof 11\.1

Proof of Lemma[5\.3](https://arxiv.org/html/2606.07392#S5.Thmtheorem3)Recall from \([24](https://arxiv.org/html/2606.07392#S5.E24)\) that

Ja​t​\(ρ\):=∑s∈𝒮a​t\{\[ℓ^a​s,t​\(ρ\)−ℓa​s​\(ρ\)\]−\[ℓ^a​s,t​\(ρa\)−ℓa​s​\(ρa\)\]\}\.J\_\{at\}\(\\rho\):=\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\[\\hat\{\\ell\}\_\{as,t\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\)\]\-\[\\hat\{\\ell\}\_\{as,t\}\(\\rho\_\{a\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\]\\right\\\}\.Note that \([8](https://arxiv.org/html/2606.07392#S4.E8)\) and \([22](https://arxiv.org/html/2606.07392#S5.E22)\) imply that

\[ℓ^a​s,t​\(ρa\)−ℓa​s​\(ρa\)\]−\[ℓ^a​s,t​\(ρ^a​t\)−ℓa​s​\(ρ^a​t\)\]=∫0ρa⊤​ψ​\(xs\)\[\(G​\(θ∗⊤​ϕ​\(xs,ωa​s\)\)−Λ​\(u\)\)\+−\(G​\(θ^t−1⊤​ϕ​\(xs,ωa​s\)\)−Λ​\(u\)\)\+\]​𝑑u−∫0ρ^a​t⊤​ψ​\(xs\)\[\(G​\(θ∗⊤​ϕ​\(xs,ωa​s\)\)−Λ​\(u\)\)\+−\(G​\(θ^t−1⊤​ϕ​\(xs,ωa​s\)\)−Λ​\(u\)\)\+\]​𝑑u=∫ρa⊤​ψ​\(xs\)ρ^a​t⊤​ψ​\(xs\)\[\(G​\(θ^t−1⊤​ϕ​\(xs,ωa​s\)\)−Λ​\(u\)\)\+−\(G​\(θ∗⊤​ϕ​\(xs,ωa​s\)\)−Λ​\(u\)\)\+\]​𝑑u≤\(i\)\|G​\(θ^t−1⊤​ϕ​\(xs,ωa​s\)\)−G​\(θ∗⊤​ϕ​\(xs,ωa​s\)\)\|⋅\|ψ​\(xs\)⊤​\(ρ^a​t−ρa\)\|,\\begin\{array\}\[\]\{rl\}&\\displaystyle\\quad\[\\hat\{\\ell\}\_\{as,t\}\(\\rho\_\{a\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\]\-\[\\hat\{\\ell\}\_\{as,t\}\(\\hat\{\\rho\}\_\{at\}\)\-\\ell\_\{as\}\(\\hat\{\\rho\}\_\{at\}\)\]\\\\ &\\displaystyle=\\int\_\{0\}^\{\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{s\}\)\}\[\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\)^\{\+\}\-\(G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\)^\{\+\}\]du\\\\ &\\displaystyle\\quad\-\\int\_\{0\}^\{\\hat\{\\rho\}\_\{at\}^\{\\top\}\\psi\(x\_\{s\}\)\}\[\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\)^\{\+\}\-\(G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\)^\{\+\}\]du\\\\ \\\\ &\\displaystyle=\\int\_\{\\rho\_\{a\}^\{\\top\}\\psi\(x\_\{s\}\)\}^\{\\hat\{\\rho\}\_\{at\}^\{\\top\}\\psi\(x\_\{s\}\)\}\[\(G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\)^\{\+\}\-\(G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-\\Lambda\(u\)\)^\{\+\}\]du\\\\ &\\displaystyle\\leq\_\{\(i\)\}\\left\|G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\\right\|\\cdot\|\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\|,\\end\{array\}where \(i\) uses the 1\-Lipchitz property ofz↦z\+z\\mapsto z^\{\+\}\. Thus

−Ja​t​\(ρ^a​t\)≤\(i\)∑s∈𝒮a​t\{ψ​\(xs\)⊤​\(ρ^a​t−ρa\)\}2​∑s∈𝒮a​t\[G​\(θ^t−1⊤​ϕ​\(xs,ωa​s\)\)−G​\(θ∗⊤​ϕ​\(xs,ωa​s\)\)\]2≤\(i​i\)c08​∑s∈𝒮a​t\{ψ​\(xs\)⊤​\(ρ^a​t−ρa\)\}2\+2c0​∑s∈𝒮a​t\[G​\(θ^t−1⊤​ϕ​\(xs,ωa​s\)\)−G​\(θ∗⊤​ϕ​\(xs,ωa​s\)\)\]2≤\(i​i​i\)c08​∑s∈𝒮a​t\{ψ​\(xs\)⊤​\(ρ^a​t−ρa\)\}2\+2c0​∑s∈𝒮a​tL2​\{\(θ^t−1−θ∗\)⊤​ϕ​\(xs,ωa​s\)\}2\\begin\{array\}\[\]\{rl\}\-J\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)&\\displaystyle\\leq\_\{\(i\)\}\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\\\}^\{2\}\}\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\[G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\\right\]^\{2\}\}\\\\ &\\displaystyle\\leq\_\{\(ii\)\}\\frac\{c\_\{0\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\\\}^\{2\}\+\\frac\{2\}\{c\_\{0\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\[G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\\right\]^\{2\}\\\\ &\\displaystyle\\leq\_\{\(iii\)\}\\frac\{c\_\{0\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\\\}^\{2\}\+\\frac\{2\}\{c\_\{0\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}L^\{2\}\\\{\(\\hat\{\\theta\}\_\{t\-1\}\-\\theta\_\{\*\}\)^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\}^\{2\}\\end\{array\}where \(i\) above holds from Cauchy\-Schwarz inequality, \(ii\) holds from applying

2​a​b≤ϵ​a2\+ϵ−1​b22ab\\leq\\epsilon a^\{2\}\+\\epsilon^\{\-1\}b^\{2\}toϵ=c04\\displaystyle\\epsilon=\\frac\{c\_\{0\}\}\{4\},a=∑s∈𝒮a​t\{ψ​\(xs\)⊤​\(ρ^a​t−ρa\)\}2\\displaystyle a=\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\\\}^\{2\}\},b=∑s∈𝒮a​t\[G​\(θ^t−1⊤​ϕ​\(xs,ωa​s\)\)−G​\(θ∗⊤​ϕ​\(xs,ωa​s\)\)\]2\\displaystyle b=\\sqrt\{\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\[G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\-G\(\\theta\_\{\*\}^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\)\\right\]^\{2\}\}, \(iii\) holds from the Lipschitz property ofGGaccording to Assumption[5](https://arxiv.org/html/2606.07392#S5)\. Hence the result follows\.■\\blacksquare

###### Proposition 11\.2\(Restatement of Proposition[5\.4](https://arxiv.org/html/2606.07392#S5.Thmtheorem4)\)

Suppose Assumptions[5](https://arxiv.org/html/2606.07392#S4.E5),[4\.1](https://arxiv.org/html/2606.07392#S4.SS1),[5](https://arxiv.org/html/2606.07392#S5),[5](https://arxiv.org/html/2606.07392#S5),[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)hold\. LetVa​t​\(η1\)V\_\{at\}\(\\eta\_\{1\}\)be defined as \([13](https://arxiv.org/html/2606.07392#S4.E13)\)\. Given anyδ\>0\\delta\>0, with probability1−2​δ/31\-2\\delta/3, uniformly over alla∈\[A\]a\\in\[A\]andt∈\[T\]t\\in\[T\],

\|\(ρ^a​t−ρa\)⊤​ψ​\(xt\)\|≤∥ψ\(xt\)∥Va​t​\(η1\)−1\[4​Γa​t​\(δ\)κ​μ1​\(144κ​μ1\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)\+20κ​μ1\+12\+η1​dℬ2\+2​2​Lκ​μ1C¯a​t​\(δ\)​∑s∈𝒮a​t‖ϕ​\(xs,ωa​s\)‖Φt−1−12\],\\begin\{array\}\[\]\{rl\}\|\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\)\|&\\displaystyle\\leq\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\\Bigg\[\\sqrt\{\\frac\{4\\Gamma\_\{at\}\(\\delta\)\}\{\\kappa\\mu\_\{1\}\}\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\+\\frac\{20\}\{\\kappa\\mu\_\{1\}\}\+\\frac\{1\}\{2\}\+\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}\}\\\\ &\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\quad\\displaystyle\+\\frac\{2\\sqrt\{2\}L\}\{\\kappa\\mu\_\{1\}\}\\sqrt\{\\bar\{C\}\_\{at\}\(\\delta\)\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\|\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\}\\Bigg\],\\end\{array\}where

Γa​t​\(δ\):=m​log⁡\(1\+2​dℬ​C¯ψ​T\)\+log⁡\(⌈log2⁡\(1\+na​t​min⁡\{9​ι¯2,dℬ2​C¯ψ2\}\)⌉\+1\)\+log⁡\(6​A​T/δ\),\\Gamma\_\{at\}\(\\delta\)\\\!:=m\\log\(1\+2d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}T\)\+\\log\\left\(\\left\\lceil\\log\_\{2\}\\big\(1\+n\_\{at\}\\min\\\{9\\overline\{\\iota\}^\{2\},d\_\{\\mathcal\{B\}\}^\{2\}\\bar\{C\}\_\{\\psi\}^\{2\}\\\}\\right\)\\right\\rceil\+1\\big\)\+\\log\(6AT/\\delta\),C¯a​t​\(δ\):=4min\{1,μ¯\}2​\(γ0​d​log⁡\(1\+t​C¯ϕ2/η0\)\+2​log⁡\(3/δ\)\+η0​α¯\)2\.\\bar\{C\}\_\{at\}\(\\delta\):=\\frac\{4\}\{\\min\\\{1,\\underline\{\\mu\}\\\}^\{2\}\}\\bigg\(\\gamma\_\{0\}\\sqrt\{d\\log\(1\+t\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(3/\\delta\)\}\+\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\\bigg\)^\{2\}\.

###### Proof 11\.3

Proof of Proposition[11\.2](https://arxiv.org/html/2606.07392#S11.Thmtheorem2)For anys∈\[T\]s\\in\[T\], define𝒢s−1:=σ​\(ℱs−1,xs\)\\mathcal\{G\}\_\{s\-1\}:=\\sigma\(\\mathcal\{F\}\_\{s\-1\},x\_\{s\}\)\. Note that

0≥\(1\)∑s∈𝒮a​tℓ^a​s,t​\(ρ^a​t\)−∑s∈𝒮a​tℓ^a​s,t​\(ρa\)=\(2\)∑s∈𝒮a​t𝔼​\[ℓa​s​\(ρ^a​t\)−ℓa​s​\(ρa\)∣𝒢s−1\]\+Ha​t​\(ρ^a​t\)\+Ja​t​\(ρ^a​t\)≥\(3\)κ​μ12​∑s∈𝒮a​t\[ψ​\(xs\)⊤​\(ρ^a​t−ρa\)\]2−1\+Ha​t​\(ρ^a​t\)\+Ja​t​\(ρ^a​t\),\\begin\{array\}\[\]\{rl\}0&\\geq\_\{\(1\)\}\\displaystyle\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\hat\{\\ell\}\_\{as,t\}\(\\hat\{\\rho\}\_\{at\}\)\-\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\hat\{\\ell\}\_\{as,t\}\(\\rho\_\{a\}\)\\\\ &\\displaystyle=\_\{\(2\)\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\mathbb\{E\}\\left\[\\ell\_\{as\}\(\\hat\{\\rho\}\_\{at\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{G\}\_\{s\-1\}\\right\]\+H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\+J\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\\\\ &\\displaystyle\\geq\_\{\(3\)\}\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\]^\{2\}\-1\+H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\+J\_\{at\}\(\\hat\{\\rho\}\_\{at\}\),\\end\{array\}\(71\)where \(1\) holds becauseρ^a​t=\\argminρ∈ℬ​1na​t​∑s∈𝒮a​tℓ^a​s,t​\(ρ\)\\displaystyle\\hat\{\\rho\}\_\{at\}=\\argmin\_\{\\rho\\in\\mathcal\{B\}\}\\frac\{1\}\{n\_\{at\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\hat\{\\ell\}\_\{as,t\}\(\\rho\), \(2\) holds with

Ha​t​\(ρ\):=∑s∈𝒮a​t\{ℓa​s​\(ρ\)−ℓa​s​\(ρa\)−𝔼​\[ℓa​s​\(ρ\)−ℓa​s​\(ρa\)∣𝒢s−1\]\},H\_\{at\}\(\\rho\):=\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\-\\mathbb\{E\}\[\\ell\_\{as\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\\mid\\mathcal\{G\}\_\{s\-1\}\]\\right\\\},Ja​t​\(ρ\):=∑s∈𝒮a​t\{\[ℓ^a​s,t​\(ρ\)−ℓa​s​\(ρ\)\]−\[ℓ^a​s,t​\(ρa\)−ℓa​s​\(ρa\)\]\},J\_\{at\}\(\\rho\):=\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\left\\\{\[\\hat\{\\ell\}\_\{as,t\}\(\\rho\)\-\\ell\_\{as\}\(\\rho\)\]\-\[\\hat\{\\ell\}\_\{as,t\}\(\\rho\_\{a\}\)\-\\ell\_\{as\}\(\\rho\_\{a\}\)\]\\right\\\},and \(3\) follows from Lemma[8\.4](https://arxiv.org/html/2606.07392#S8.Thmtheorem4)\. So \([71](https://arxiv.org/html/2606.07392#S11.E71)\) implies that

κ​μ12​∑s∈𝒮a​t\[ψ​\(xs\)⊤​\(ρ^a​t−ρa\)\]2≤−\(Ha​t​\(ρ^a​t\)\+Ja​t​\(ρ^a​t\)\)\+1\.\\frac\{\\kappa\\mu\_\{1\}\}\{2\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\[\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\]^\{2\}\\leq\-\(H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\+J\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\)\+1\.\(72\)Lemma[4\.3](https://arxiv.org/html/2606.07392#S4.Thmtheorem3)implies that with probability at least1−δ/31\-\\delta/3, uniformly over allt∈\[T\]t\\in\[T\]anda∈\[A\]a\\in\[A\]we have

−Ha​t​\(ρ^a​t\)≤κ​μ18​∑s∈𝒮a​t\{\(ρ^a​t−ρa\)⊤​ψ​\(xs\)\}2\+\(144κ​μ1\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​Γa​t​\(δ\)\+4\+κ​μ18,\-H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\\leq\\frac\{\\kappa\\mu\_\{1\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}\(\\delta\)\+4\+\\frac\{\\kappa\\mu\_\{1\}\}\{8\},\(73\)where

Γa​t​\(δ\):=m​log⁡\(1\+2​dℬ​C¯ψ​T\)\+log⁡\(Qa​t\+1\)\+log⁡\(6​A​Tδ\)\.\\Gamma\_\{at\}\(\\delta\)\\\!:=m\\log\(1\+2d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}T\)\+\\log\(Q\_\{at\}\+1\)\+\\log\\left\(\\\!\\frac\{6AT\}\{\\delta\}\\\!\\right\)\.Lemma[5\.3](https://arxiv.org/html/2606.07392#S5.Thmtheorem3)implies

−Ja​t​\(ρ^a​t\)≤κ​μ18​∑s∈𝒮a​t\{ψ​\(xs\)⊤​\(ρ^a​t−ρa\)\}2\+2κ​μ1​∑s∈𝒮a​tL2​\{\(θ^t−1−θ∗\)⊤​ϕ​\(xs,ωa​s\)\}2\.\-J\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\\leq\\frac\{\\kappa\\mu\_\{1\}\}\{8\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\\psi\(x\_\{s\}\)^\{\\top\}\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)\\\}^\{2\}\+\\frac\{2\}\{\\kappa\\mu\_\{1\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}L^\{2\}\\\{\(\\hat\{\\theta\}\_\{t\-1\}\-\\theta\_\{\*\}\)^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\}^\{2\}\.Hence combining this inequality and \([73](https://arxiv.org/html/2606.07392#S11.E73)\), with probability at least1−δ/31\-\\delta/3, uniformly over allt∈\[T\],a∈\[A\]t\\in\[T\],a\\in\[A\]we have

−\(Ha​t​\(ρ^a​t\)\+Ja​t​\(ρ^a​t\)\)≤κ​μ14​∑s∈𝒮a​t\{\(ρ^a​t−ρa\)⊤​ψ​\(xs\)\}2\+\(144κ​μ1\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​Γa​t​\(δ\)\+4\+κ​μ18\+2​L2κ​μ1​∑s∈𝒮a​t\{\(θ^t−1−θ∗\)⊤​ϕ​\(xs,ωa​s\)\}2\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\-\(H\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\+J\_\{at\}\(\\hat\{\\rho\}\_\{at\}\)\)\\\\ &\\displaystyle\\leq\\frac\{\\kappa\\mu\_\{1\}\}\{4\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}\+\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}\(\\delta\)\+4\+\\frac\{\\kappa\\mu\_\{1\}\}\{8\}\\\\ &\\quad\\displaystyle\+\\frac\{2L^\{2\}\}\{\\kappa\\mu\_\{1\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\hat\{\\theta\}\_\{t\-1\}\-\\theta\_\{\*\}\)^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\}^\{2\}\\end\{array\}\(74\)Thus \([72](https://arxiv.org/html/2606.07392#S11.E72)\) further implies that with probability at least1−δ/31\-\\delta/3, uniformly over allt∈\[T\]t\\in\[T\]we have

κ​μ14​∑s∈𝒮a​t\{\(ρ^a​t−ρa\)⊤​ψ​\(xs\)\}2≤\(144κ​μ1\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​Γa​t​\(δ\)\+5\+κ​μ18\+2​L2κ​μ1​∑s∈𝒮a​t\{\(θ^t−1−θ∗\)⊤​ϕ​\(xs,ωa​s\)\}2,\\begin\{array\}\[\]\{rl\}\\displaystyle\\frac\{\\kappa\\mu\_\{1\}\}\{4\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{s\}\)\\\}^\{2\}&\\displaystyle\\leq\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}\(\\delta\)\+5\+\\frac\{\\kappa\\mu\_\{1\}\}\{8\}\\\\ &\\quad\\displaystyle\+\\frac\{2L^\{2\}\}\{\\kappa\\mu\_\{1\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\hat\{\\theta\}\_\{t\-1\}\-\\theta\_\{\*\}\)^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\}^\{2\},\\end\{array\}\(75\)combining with the fact thatη1​‖ρ^a​t−ρa‖2≤η1​dℬ2\\eta\_\{1\}\\\|\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\\\|^\{2\}\\leq\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}, and recall that

Va​t​\(η1\)=η1​𝐈m\+∑s∈𝒮a​tψ​\(xs\)​ψ​\(xs\)⊤,V\_\{at\}\(\\eta\_\{1\}\)=\\eta\_\{1\}\\mathbf\{I\}\_\{m\}\+\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\psi\(x\_\{s\}\)\\psi\(x\_\{s\}\)^\{\\top\},the above inequality then implies that with probability at least1−δ/31\-\\delta/3, uniformly over allt∈\[T\]t\\in\[T\]anda∈\[A\]a\\in\[A\], we have

‖ρ^a​t−ρa‖Va​t​\(η1\)2≤4κ​μ1​\(144κ​μ1\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​Γa​t​\(δ\)\+20κ​μ1\+12\+η1​dℬ2\+8​L2\(κ​μ1\)2​∑s∈𝒮a​t\{\(θ^t−1−θ∗\)⊤​ϕ​\(xs,ωa​s\)\}2\.\\begin\{array\}\[\]\{rl\}\\displaystyle\\\|\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)\}^\{2\}&\\displaystyle\\leq\\frac\{4\}\{\\kappa\\mu\_\{1\}\}\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}\(\\delta\)\+\\frac\{20\}\{\\kappa\\mu\_\{1\}\}\+\\frac\{1\}\{2\}\+\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}\\\\ &\\quad\\displaystyle\+\\frac\{8L^\{2\}\}\{\(\\kappa\\mu\_\{1\}\)^\{2\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\!\\\{\(\\hat\{\\theta\}\_\{t\-1\}\-\\theta\_\{\*\}\)^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\}^\{2\}\.\\end\{array\}\(76\)Lemma[9\.2](https://arxiv.org/html/2606.07392#S9.Thmtheorem2)implies that with probability at least1−δ/31\-\\delta/3, uniformly over allt∈\[T\]t\\in\[T\]anda∈\[A\]a\\in\[A\],

8​L2\(κ​μ1\)2​∑s∈𝒮a​t\{\(θ^t−1−θ∗\)⊤​ϕ​\(xs,ωa​s\)\}2≤8​L2\(κ​μ1\)2​C¯a​t​\(δ\)​∑s∈𝒮a​t‖ϕ​\(xs,ωa​s\)‖Φt−1−12,\\frac\{8L^\{2\}\}\{\(\\kappa\\mu\_\{1\}\)^\{2\}\}\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\{\(\\hat\{\\theta\}\_\{t\-1\}\-\\theta\_\{\*\}\)^\{\\top\}\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\}^\{2\}\\leq\\frac\{8L^\{2\}\}\{\(\\kappa\\mu\_\{1\}\)^\{2\}\}\\bar\{C\}\_\{at\}\(\\delta\)\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\|\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\},\(77\)where

C¯a​t​\(δ\):=4min\{1,μ¯\}2​\(γ0​d​log⁡\(1\+t​C¯ϕ2/η0\)\+2​log⁡\(3/δ\)\+η0​α¯\)2\.\\bar\{C\}\_\{at\}\(\\delta\):=\\frac\{4\}\{\\min\\\{1,\\underline\{\\mu\}\\\}^\{2\}\}\\bigg\(\\gamma\_\{0\}\\sqrt\{d\\log\(1\+t\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(3/\\delta\)\}\+\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\\bigg\)^\{2\}\.Thus with probability1−2​δ/31\-2\\delta/3we have

‖ρ^a​t−ρa‖Va​t​\(η1\)≤4κ​μ1​\(144κ​μ1\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​Γa​t​\(δ\)\+20κ​μ1\+12\+η1​dℬ2\+2​2​Lκ​μ1​C¯a​t​\(δ\)​∑s∈𝒮a​t‖ϕ​\(xs,ωa​s\)‖Φt−1−12\.\\begin\{array\}\[\]\{rl\}\\displaystyle\\\|\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)\}&\\displaystyle\\leq\\sqrt\{\\frac\{4\}\{\\kappa\\mu\_\{1\}\}\\\!\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}\(\\delta\)\+\\frac\{20\}\{\\kappa\\mu\_\{1\}\}\+\\frac\{1\}\{2\}\+\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}\}\\\\ &\\quad\\displaystyle\+\\frac\{2\\sqrt\{2\}L\}\{\\kappa\\mu\_\{1\}\}\\sqrt\{\\bar\{C\}\_\{at\}\(\\delta\)\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\|\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\}\.\\end\{array\}\(78\)By Cauchy\-Schwarz inequality,\|\(ρ^a​t−ρa\)⊤​ψ​\(xt\)\|≤‖ψ​\(xt\)‖Va​t​\(η1\)−1​‖ρ^a​t−ρa‖Va​t​\(η1\)\|\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\)\|\\leq\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\\\|\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)\}, so combining \([78](https://arxiv.org/html/2606.07392#S11.E78)\), the result follows\.■\\blacksquare

## 12Regret under Unknown Reward Function

###### Proof 12\.1

Proof of Theorem[5\.5](https://arxiv.org/html/2606.07392#S5.Thmtheorem5)Letℰ\\mathcal\{E\}denote the event that conditions \(i\)\-\(ii\) hold in the following:

- \(i\)Uniformly over alla∈\[A\]a\\in\[A\],t∈\[T\]t\\in\[T\], \|\(θ∗−θ^t−1\)⊤​ϕ​\(xt,ωa​t\)\|≤2min⁡\{μ¯,1\}​‖ϕ​\(xt,ωa​t\)‖Φt−1−1​\(γ0​d​log⁡\(1\+t​C¯ϕ2/η0\)\+2​log⁡\(2/δ\)\+η0​α¯\);\\begin\{array\}\[\]\{rl\}&\\quad\|\(\\theta\_\{\*\}\-\\hat\{\\theta\}\_\{t\-1\}\)^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{at\}\)\|\\\\ &\\displaystyle\\leq\\frac\{2\}\{\\min\\\{\\underline\{\\mu\},1\\\}\}\\\|\\phi\(x\_\{t\},\\omega\_\{at\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\bigg\(\\gamma\_\{0\}\\sqrt\{d\\log\(1\+t\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(2/\\delta\)\}\+\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\\bigg\);\\end\{array\}
- \(ii\)Uniformly over alla∈\[A\]a\\in\[A\],t∈\[T\]t\\in\[T\], \|\(ρ^a​t−ρa\)⊤​ψ​\(xt\)\|≤\[4κ​μ1​\(144κ​μ1\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​Γa​t​\(δ\)\+20κ​μ1\+12\+η1​dℬ2\+2​2​Lκ​μ1Ca​t​\(δ\)​∑s∈𝒮a​t‖ϕ​\(xs,ωa​s\)‖Φt−1−12\]∥ψ\(xt\)∥Va​t​\(η1\)−1,\\begin\{array\}\[\]\{rl\}\\displaystyle\|\(\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\)^\{\\top\}\\psi\(x\_\{t\}\)\|&\\displaystyle\\leq\\bigg\[\\sqrt\{\\frac\{4\}\{\\kappa\\mu\_\{1\}\}\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}\(\\delta\)\+\\frac\{20\}\{\\kappa\\mu\_\{1\}\}\+\\frac\{1\}\{2\}\+\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}\}\\\\ &\\quad\\displaystyle\+\\frac\{2\\sqrt\{2\}L\}\{\\kappa\\mu\_\{1\}\}\\sqrt\{C\_\{at\}\(\\delta\)\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\|\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\}\\bigg\]\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\},\\end\{array\}whereVa​t​\(η1\)=η1​𝐈m\+∑s∈𝒮a​tψ​\(xs\)​ψ​\(xs\)⊤\\displaystyle V\_\{at\}\(\\eta\_\{1\}\)=\\eta\_\{1\}\\mathbf\{I\}\_\{m\}\+\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\psi\(x\_\{s\}\)\\psi\(x\_\{s\}\)^\{\\top\},Qa​t:=⌈log2⁡\(1\+na​t​min⁡\{9​ι¯2,dℬ2​C¯ψ2\}\)⌉Q\_\{at\}:=\\left\\lceil\\log\_\{2\}\\left\(1\+n\_\{at\}\\min\\\{9\\overline\{\\iota\}^\{2\},d\_\{\\mathcal\{B\}\}^\{2\}\\bar\{C\}\_\{\\psi\}^\{2\}\\\}\\right\)\\right\\rceil, Γa​t​\(δ\):=m​log⁡\(1\+2​dℬ​C¯ψ​T\)\+log⁡\(2​Qa​t\+2\)\+log⁡\(6​A​T/δ\),\\Gamma\_\{at\}\(\\delta\)\\\!:=m\\log\(1\+2d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}T\)\+\\log\(2Q\_\{at\}\+2\)\+\\log\\left\(\\\!6AT/\\delta\\\!\\right\),Ca​t​\(δ\):=4min\{μ¯,1\}2​\(γ0​d​log⁡\(1\+t​C¯ϕ2/η0\)\+2​log⁡\(6/δ\)\+η0​α¯\)2\.C\_\{at\}\(\\delta\):=\\frac\{4\}\{\\min\\\{\\underline\{\\mu\},1\\\}^\{2\}\}\\bigg\(\\gamma\_\{0\}\\sqrt\{d\\log\(1\+t\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(6/\\delta\)\}\+\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\\bigg\)^\{2\}\.

Thus onℰ\\mathcal\{E\},μ~t​\(xt,ωa​t\)≥μ∗​\(xt,ωa​t\)\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\)\\geq\\mu^\{\*\}\(x\_\{t\},\\omega\_\{at\}\)holds uniformly overa∈\[A\],t∈\[T\]a\\in\[A\],t\\in\[T\]andσ~a​t≥σa​t\\widetilde\{\\sigma\}\_\{at\}\\geq\\sigma\_\{at\}holds uniformly over alla∈\[A\],t∈\[T\]a\\in\[A\],t\\in\[T\]\. \([21](https://arxiv.org/html/2606.07392#S5.E21)\), \([27](https://arxiv.org/html/2606.07392#S5.E27)\), Lemma[9\.2](https://arxiv.org/html/2606.07392#S9.Thmtheorem2)and Proposition[11\.2](https://arxiv.org/html/2606.07392#S11.Thmtheorem2)imply thatℰ\\mathcal\{E\}holds with probability at least1−δ1\-\\delta\. Theorem[3\.3](https://arxiv.org/html/2606.07392#S3.Thmtheorem3)implies

𝔼​\[∑t=1TΔt​\(π~\)\|ℰ\]≤𝔼​\[∑t=1Tμ~t​\(xt,ωat​t\)−μ​\(θ∗;xt,ωat​t\)\|ℰ\]\+𝔼​\[∑t=1T∑a∈𝒜t\(σ~a​t−σa​t\)\|ℰ\]\.\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\ \\big\|\\ \\mathcal\{E\}\\right\]\\leq\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\-\\mu\(\\theta\_\{\*\};x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\ \\big\|\\ \\mathcal\{E\}\\right\]\+\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}\)\\ \\big\|\\ \\mathcal\{E\}\\right\]\.\(79\)Onℰ\\mathcal\{E\}, using the fact that

μ~t​\(xt,ωat​t\)−μ​\(θ∗;xt,ωat​t\)≤\|μ~t​\(xt,ωat​t\)−G​\(θ^t−1⊤​ϕ​\(xt,ωat​t\)\)\|\+\|G​\(θ^t−1⊤​ϕ​\(xt,ωat​t\)\)−μ​\(θ∗;xt,ωat​t\)\|,\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\-\\mu\(\\theta\_\{\*\};x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\\ &\\displaystyle\\leq\\left\|\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\-G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\)\\right\|\+\\left\|G\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\)\-\\mu\(\\theta\_\{\*\};x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\right\|,\\end\{array\}and the Lipschitz property ofGGby Assumption[5](https://arxiv.org/html/2606.07392#S5), we have

∑t=1Tμ~t​\(xt,ωat​t\)−μ​\(θ∗;xt,ωat​t\)≤4​Lmin⁡\{μ¯,1\}​\(γ0​d​log⁡\(1\+t​C¯ϕ2/η0\)\+2​log⁡\(2/δ\)\+η0​α¯\)​∑t=1T‖ϕ​\(xt,ωat​t\)‖Φt−1−1\.\\begin\{array\}\[\]\{rl\}&\\quad\\displaystyle\\sum\_\{t=1\}^\{T\}\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\-\\mu\(\\theta\_\{\*\};x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\\ &\\displaystyle\\leq\\frac\{4L\}\{\\min\\\{\\underline\{\\mu\},1\\\}\}\\bigg\(\\gamma\_\{0\}\\sqrt\{d\\log\(1\+t\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(2/\\delta\)\}\+\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\\bigg\)\\sum\_\{t=1\}^\{T\}\\\|\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\.\\end\{array\}Sincedet​\(Φt\)=det​\(Φt−1\)​\(1\+‖ϕ​\(xt,ωat​t\)‖Φt−1−12\)\\mathrm\{det\}\(\\Phi\_\{t\}\)=\\mathrm\{det\}\(\\Phi\_\{t\-1\}\)\(1\+\\\|\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\)andΦ0=η0​𝐈d\\Phi\_\{0\}=\\eta\_\{0\}\\mathbf\{I\}\_\{d\}, we have

log⁡det​\(ΦT\)det​\(Φ0\)=∑t=1Tlog⁡\(1\+‖ϕ​\(xt,ωat​t\)‖Φt−1−12\)≥\(i\)∑t=1T‖ϕ​\(xt,ωat​t\)‖Φt−1−121\+‖ϕ​\(xt,ωat​t\)‖Φt−1−12≥\(i​i\)∑t=1T‖ϕ​\(xt,ωat​t\)‖Φt−1−121\+C¯ϕ2/η0,\\begin\{array\}\[\]\{rl\}\\displaystyle\\log\\frac\{\\mathrm\{det\}\(\\Phi\_\{T\}\)\}\{\\mathrm\{det\}\(\\Phi\_\{0\}\)\}&\\displaystyle=\\sum\_\{t=1\}^\{T\}\\log\(1\+\\\|\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\)\\\\ &\\displaystyle\\geq\_\{\(i\)\}\\sum\_\{t=1\}^\{T\}\\frac\{\\\|\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\}\{1\+\\\|\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\}\\geq\_\{\(ii\)\}\\sum\_\{t=1\}^\{T\}\\frac\{\\\|\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\}\{1\+\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\},\\end\{array\}where \(i\) uses the fact thatlog⁡\(1\+a\)≥a1\+a\\log\(1\+a\)\\geq\\frac\{a\}\{1\+a\}fora≥0a\\geq 0and \(ii\) uses‖ϕ​\(xt,ωat​t\)‖Φt−1−12≤C¯ϕ2/η0\\\|\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\\leq\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\. Hence the above inequality implies that

∑t=1T‖ϕ​\(xt,ωat​t\)‖Φt−1−12≤\(1\+C¯ϕ2/η0\)​log⁡det​\(ΦT\)det​\(Φ0\)≤\(1\)\(1\+C¯ϕ2/η0\)​d​log⁡\(1\+T​C¯ϕ2/\(η0​d\)\),\\sum\_\{t=1\}^\{T\}\\\|\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\\leq\(1\+\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\\log\\frac\{\\mathrm\{det\}\(\\Phi\_\{T\}\)\}\{\\mathrm\{det\}\(\\Phi\_\{0\}\)\}\\leq\_\{\(1\)\}\(1\+\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)d\\log\\left\(1\+T\\bar\{C\}\_\{\\phi\}^\{2\}/\(\\eta\_\{0\}d\)\\right\),where \(1\) follows becausedet​\(ΦT\)≤\(tr​\(ΦT\)d\)d≤\(η0\+T​C¯ϕ2/\(η0​d\)\)d\\displaystyle\\mathrm\{det\}\(\\Phi\_\{T\}\)\\leq\\left\(\\frac\{\\mathrm\{tr\}\(\\Phi\_\{T\}\)\}\{d\}\\right\)^\{d\}\\leq\\left\(\\eta\_\{0\}\+T\\bar\{C\}\_\{\\phi\}^\{2\}/\(\\eta\_\{0\}d\)\\right\)^\{d\}anddet​\(Φ0\)=η0d\\mathrm\{det\}\(\\Phi\_\{0\}\)=\\eta\_\{0\}^\{d\}so thatlog⁡det​\(ΦT\)det​\(Φ0\)≤d​log⁡\(1\+T​C¯ϕ2/\(η0​d\)\)\\displaystyle\\log\\frac\{\\mathrm\{det\}\(\\Phi\_\{T\}\)\}\{\\mathrm\{det\}\(\\Phi\_\{0\}\)\}\\leq d\\log\\left\(1\+T\\bar\{C\}\_\{\\phi\}^\{2\}/\(\\eta\_\{0\}d\)\\right\)\. Thus by Cauchy\-Schwarz inequality,

∑t=1T‖ϕ​\(xt,ωat​t\)‖Φt−1−1≤T​∑t=1T‖ϕ​\(xt,ωat​t\)‖Φt−1−12≤T​\(1\+C¯ϕ2/η0\)​d​log⁡\(1\+T​C¯ϕ2/\(η0​d\)\),\\sum\_\{t=1\}^\{T\}\\\|\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\leq\\sqrt\{T\\sum\_\{t=1\}^\{T\}\\\|\\phi\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\}\\leq\\sqrt\{T\(1\+\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)d\\log\\left\(1\+T\\bar\{C\}\_\{\\phi\}^\{2\}/\(\\eta\_\{0\}d\)\\right\)\},implying that onℰ\\mathcal\{E\}, we have

∑t=1Tμ~t​\(xt,ωat​t\)−μ​\(θ∗;xt,ωat​t\)≤\(Cμ​d\+4​L​γ0​η0​α¯\)​T,\\sum\_\{t=1\}^\{T\}\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\-\\mu\(\\theta\_\{\*\};x\_\{t\},\\omega\_\{a\_\{t\}t\}\)\\leq\(C\_\{\\mu\}\\sqrt\{d\}\+4L\\gamma\_\{0\}\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\)\\sqrt\{T\},\(80\)where

Cμ=4​L​\[γ0​d​log⁡\(1\+T​C¯ϕ2/η0\)\+2​log⁡\(2/δ\)\+η0​α¯\]min⁡\{μ¯,1\}​\(1\+C¯ϕ2/η0\)​log⁡\(1\+T​C¯ϕ2/\(η0​d\)\)\.C\_\{\\mu\}=\\frac\{4L\\left\[\\gamma\_\{0\}\\sqrt\{d\\log\(1\+T\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(2/\\delta\)\}\+\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\\right\]\}\{\\min\\\{\\underline\{\\mu\},1\\\}\}\\sqrt\{\\left\(1\+\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\\right\)\\log\\left\(1\+T\\bar\{C\}\_\{\\phi\}^\{2\}/\(\\eta\_\{0\}d\)\\right\)\}\.\(81\)Additionally, Assumption[5\.2](https://arxiv.org/html/2606.07392#S5.SS2)and Proposition[10\.10](https://arxiv.org/html/2606.07392#S10.Thmtheorem10)imply that with probability at least1−δ/31\-\\delta/3,

λmin​\(Φt−1\)≥\(t−1\)/\[16​p​o​l​y​l​o​g​\(T\)\]\\lambda\_\{\\min\}\(\\Phi\_\{t\-1\}\)\\geq\(t\-1\)/\[16\\mathrm\{polylog\}\(T\)\]holds uniformly for allt≥τ~0:=100​C¯ϕ2​polylog​\(T\)​log⁡\(6​T​A​d/δ\)t\\geq\\tilde\{\\tau\}\_\{0\}:=100\\bar\{C\}\_\{\\phi\}^\{2\}\\mathrm\{polylog\}\(T\)\\log\(6TAd/\\delta\)anda∈\[A\]a\\in\[A\], under which

∑s∈𝒮a​t‖ϕ​\(xs,ωa​s\)‖Φt−1−12≤16​p​o​l​y​l​o​g​\(T\)​C¯ϕ2​na​tt−1≤16​p​o​l​y​l​o​g​\(T\)​C¯ϕ2,\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\\|\\phi\(x\_\{s\},\\omega\_\{as\}\)\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}^\{2\}\\leq\\frac\{16\\mathrm\{polylog\}\(T\)\\bar\{C\}\_\{\\phi\}^\{2\}n\_\{at\}\}\{t\-1\}\\leq 16\\mathrm\{polylog\}\(T\)\\bar\{C\}\_\{\\phi\}^\{2\},where the last inequality uses the fact thatna​t≤tn\_\{at\}\\leq t\. Henceforth, combining \([76](https://arxiv.org/html/2606.07392#S11.E76)\) and \([77](https://arxiv.org/html/2606.07392#S11.E77)\), with probability at least1−δ1\-\\delta, uniformly over allt≥100​C¯ϕ2​polylog​\(T\)​log⁡\(6​T​A​d/δ\)t\\geq 100\\bar\{C\}\_\{\\phi\}^\{2\}\\mathrm\{polylog\}\(T\)\\log\(6TAd/\\delta\)we have

‖ρ^a​t−ρa‖Va​t​\(η1\)≤4κ​μ1​\(144κ​μ1\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​Γa​t​\(δ\)\+20κ​μ1\+12\+η1​dℬ2\+8​2​Lκ​μ1​C¯ϕ​C¯a​t​\(δ\)​polylog​\(T\)\.\\begin\{array\}\[\]\{rl\}\\displaystyle\\\|\\hat\{\\rho\}\_\{at\}\-\\rho\_\{a\}\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)\}&\\displaystyle\\leq\\sqrt\{\\frac\{4\}\{\\kappa\\mu\_\{1\}\}\\\!\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\Gamma\_\{at\}\(\\delta\)\+\\frac\{20\}\{\\kappa\\mu\_\{1\}\}\+\\frac\{1\}\{2\}\+\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}\}\\\\ &\\quad\\displaystyle\+\\frac\{8\\sqrt\{2\}L\}\{\\kappa\\mu\_\{1\}\}\\bar\{C\}\_\{\\phi\}\\sqrt\{\\bar\{C\}\_\{at\}\(\\delta\)\\mathrm\{polylog\}\(T\)\}\.\\end\{array\}Furthermore, by \([27](https://arxiv.org/html/2606.07392#S5.E27)\) and the Lipschitz property ofΛ\\Lambdaby Assumption[5](https://arxiv.org/html/2606.07392#S4.E5), conditional on eventℰ\\mathcal\{E\}, with probability at least1−δ1\-\\delta, uniformly over alla∈\[A\]a\\in\[A\],t≥τ~0=100​C¯ϕ2​polylog​\(T\)​log⁡\(6​T​A​d/δ\)t\\geq\\tilde\{\\tau\}\_\{0\}=100\\bar\{C\}\_\{\\phi\}^\{2\}\\mathrm\{polylog\}\(T\)\\log\(6TAd/\\delta\),

σ~a​t−σa​t≤Cσ​‖ψ​\(xt\)‖Va​t​\(η1\)−1,\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}\\leq C\_\{\\sigma\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\},\(82\)where

Cσ=2L\[4κ​μ1​\(144κ​μ1\+2​min⁡\{2​ι¯,C¯ψ​dℬ\}\+6\)​Γ¯a​\(δ\)\+20κ​μ1\+12\+η1​dℬ2\+8​2​Lκ​μ1C¯ϕC¯a​\(δ\)​polylog​\(T\)\],\\begin\{array\}\[\]\{rl\}C\_\{\\sigma\}=2L\\bigg\[&\\displaystyle\\sqrt\{\\frac\{4\}\{\\kappa\\mu\_\{1\}\}\\left\(\\frac\{144\}\{\\kappa\\mu\_\{1\}\}\+2\\min\\\{2\\overline\{\\iota\},\\bar\{C\}\_\{\\psi\}d\_\{\\mathcal\{B\}\}\\\}\+6\\right\)\\overline\{\\Gamma\}\_\{a\}\(\\delta\)\+\\frac\{20\}\{\\kappa\\mu\_\{1\}\}\+\\frac\{1\}\{2\}\+\\eta\_\{1\}d\_\{\\mathcal\{B\}\}^\{2\}\}\\\\ &\\quad\\displaystyle\+\\frac\{8\\sqrt\{2\}L\}\{\\kappa\\mu\_\{1\}\}\\bar\{C\}\_\{\\phi\}\\sqrt\{\\bar\{C\}\_\{a\}\(\\delta\)\\mathrm\{polylog\}\(T\)\}\\bigg\],\\end\{array\}\(83\)andΓ¯a​\(δ\)\\overline\{\\Gamma\}\_\{a\}\(\\delta\),Q¯a\\overline\{Q\}\_\{a\},C¯a​\(δ\)\\bar\{C\}\_\{a\}\(\\delta\)are defined as

Γ¯a​\(δ\):=m​log⁡\(1\+2​dℬ​C¯ψ​T\)\+log⁡\(2​Q¯a\+2\)\+log⁡\(6​A​Tδ\),Q¯a:=⌈log2⁡\(1\+T​min⁡\{9​ι¯2,dℬ2​C¯ψ2\}\)⌉,\\overline\{\\Gamma\}\_\{a\}\(\\delta\)\\\!:=m\\log\(1\+2d\_\{\\mathcal\{B\}\}\\bar\{C\}\_\{\\psi\}T\)\+\\log\(2\\overline\{Q\}\_\{a\}\+2\)\+\\log\\left\(\\\!\\frac\{6AT\}\{\\delta\}\\\!\\right\),\\quad\\overline\{Q\}\_\{a\}:=\\left\\lceil\\log\_\{2\}\\left\(1\+T\\min\\\{9\\overline\{\\iota\}^\{2\},d\_\{\\mathcal\{B\}\}^\{2\}\\bar\{C\}\_\{\\psi\}^\{2\}\\\}\\right\)\\right\\rceil,C¯a​\(δ\):=4min\{μ¯,1\}2​\(γ0​d​log⁡\(1\+T​C¯ϕ2/η0\)\+2​log⁡\(6/δ\)\+η0​α¯\)2\.\\bar\{C\}\_\{a\}\(\\delta\):=\\frac\{4\}\{\\min\\\{\\underline\{\\mu\},1\\\}^\{2\}\}\\bigg\(\\gamma\_\{0\}\\sqrt\{d\\log\(1\+T\\bar\{C\}\_\{\\phi\}^\{2\}/\\eta\_\{0\}\)\+2\\log\(6/\\delta\)\}\+\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\\bigg\)^\{2\}\.DefineV~a,na​t:=η1​𝐈m\+∑s∈𝒮a​tψ​\(xs\)​ψ​\(xs\)⊤=Va​t​\(η1\)\\displaystyle\\tilde\{V\}\_\{a,n\_\{at\}\}:=\\eta\_\{1\}\\mathbf\{I\}\_\{m\}\+\\sum\_\{s\\in\\mathcal\{S\}\_\{at\}\}\\psi\(x\_\{s\}\)\\psi\(x\_\{s\}\)^\{\\top\}=V\_\{at\}\(\\eta\_\{1\}\)and letV~a,0:=η1​𝐈m\\tilde\{V\}\_\{a,0\}:=\\eta\_\{1\}\\mathbf\{I\}\_\{m\}\. Note that

det​\(V~a,na​t\)=det​\(V~a,na​t−1\)​\(1\+‖ψ​\(xta​\(na​t\)\)‖V~a,na​t−1−12\),\\mathrm\{det\}\(\\tilde\{V\}\_\{a,n\_\{at\}\}\)=\\mathrm\{det\}\\left\(\\tilde\{V\}\_\{a,n\_\{at\}\-1\}\\right\)\\left\(1\+\\\|\\psi\(x\_\{t\_\{a\}\(n\_\{at\}\)\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{at\}\-1\}^\{\-1\}\}^\{2\}\\right\),whereta​\(na​t\)t\_\{a\}\(n\_\{at\}\)is the period whereaais queried thena​tn\_\{at\}\-th time\. Thus

log⁡det​\(V~a,na​T\)det​\(V~a,0\)=∑s∈𝒮a​Tlog⁡\(1\+‖ψ​\(xs\)‖V~a,na​s−12\)≥\(i\)∑s∈𝒮a​T‖ψ​\(xs\)‖V~a,na​s−121\+‖ψ​\(xs\)‖V~a,na​s−12≥\(i​i\)∑s∈𝒮a​T‖ψ​\(xs\)‖V~a,na​s−121\+C¯ψ2/η1,\\begin\{array\}\[\]\{rl\}\\displaystyle\\log\\frac\{\\mathrm\{det\}\(\\tilde\{V\}\_\{a,n\_\{aT\}\}\)\}\{\\mathrm\{det\}\(\\tilde\{V\}\_\{a,0\}\)\}&\\displaystyle=\\sum\_\{s\\in\\mathcal\{S\}\_\{aT\}\}\\log\\left\(1\+\\\|\\psi\(x\_\{s\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{as\}\}^\{\-1\}\}^\{2\}\\right\)\\\\ &\\displaystyle\\geq\_\{\(i\)\}\\sum\_\{s\\in\\mathcal\{S\}\_\{aT\}\}\\frac\{\\\|\\psi\(x\_\{s\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{as\}\}^\{\-1\}\}^\{2\}\}\{1\+\\\|\\psi\(x\_\{s\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{as\}\}^\{\-1\}\}^\{2\}\}\\geq\_\{\(ii\)\}\\sum\_\{s\\in\\mathcal\{S\}\_\{aT\}\}\\frac\{\\\|\\psi\(x\_\{s\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{as\}\}^\{\-1\}\}^\{2\}\}\{1\+\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\},\\end\{array\}where inequality \(i\) uses the fact thatlog⁡\(1\+a\)≥a1\+a\\log\(1\+a\)\\geq\\frac\{a\}\{1\+a\}fora≥0a\\geq 0and \(ii\) uses the fact that‖ψ​\(xs\)‖V~a,na​s−12≤C¯ψ2/η1\\\|\\psi\(x\_\{s\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{as\}\}^\{\-1\}\}^\{2\}\\leq\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\. Hence the above inequality implies that

∑s∈𝒮a​T‖ψ​\(xs\)‖V~a,na​t−12≤\(1\+C¯ψ2/η1\)​log⁡det​\(V~a,na​T\)det​\(V~a,0\)≤\(1\)\(1\+C¯ψ2/η1\)​m​log⁡\(1\+na​T​C¯ψ2/\(η1​m\)\),\\sum\_\{s\\in\\mathcal\{S\}\_\{aT\}\}\\\|\\psi\(x\_\{s\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{at\}\}^\{\-1\}\}^\{2\}\\leq\(1\+\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\)\\log\\frac\{\\mathrm\{det\}\(\\tilde\{V\}\_\{a,n\_\{aT\}\}\)\}\{\\mathrm\{det\}\(\\tilde\{V\}\_\{a,0\}\)\}\\leq\_\{\(1\)\}\(1\+\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\)m\\log\\left\(1\+n\_\{aT\}\\bar\{C\}\_\{\\psi\}^\{2\}/\(\\eta\_\{1\}m\)\\right\),where \(1\) follows sincedet​\(V~a,0\)=η1m\\mathrm\{det\}\(\\tilde\{V\}\_\{a,0\}\)=\\eta\_\{1\}^\{m\}anddet​\(V~a,na​T\)≤\(tr​\(V~a,na​T\)m\)m≤\(η1\+na​T​C¯ψ2η1​m\)m\\displaystyle\\mathrm\{det\}\(\\tilde\{V\}\_\{a,n\_\{aT\}\}\)\\leq\\left\(\\frac\{\\mathrm\{tr\}\(\\tilde\{V\}\_\{a,n\_\{aT\}\}\)\}\{m\}\\right\)^\{m\}\\leq\\left\(\\eta\_\{1\}\+\\frac\{n\_\{aT\}\\bar\{C\}\_\{\\psi\}^\{2\}\}\{\\eta\_\{1\}m\}\\right\)^\{m\}so thatlog⁡det​\(V~a,na​T\)det​\(V~a,0\)≤m​log⁡\(1\+na​T​C¯ψ2η1​m\)\\displaystyle\\log\\frac\{\\mathrm\{det\}\(\\tilde\{V\}\_\{a,n\_\{aT\}\}\)\}\{\\mathrm\{det\}\(\\tilde\{V\}\_\{a,0\}\)\}\\leq m\\log\\left\(1\+\\frac\{n\_\{aT\}\\bar\{C\}\_\{\\psi\}^\{2\}\}\{\\eta\_\{1\}m\}\\right\)\. Further, for anya∈\[A\]a\\in\[A\],

∑t=τ~0T‖ψ​\(xt\)‖V~a,na​t−1​𝕀​\{a∈𝒜t\}≤∑t=1T‖ψ​\(xt\)‖V~a,na​t−1​𝕀​\{a∈𝒜t\}=∑s∈𝒮a​T‖ψ​\(xs\)‖V~a,na​s−1≤\(i\)na​T​∑s∈𝒮a​T‖ψ​\(xs\)‖V~a,na​s−12,\\begin\{array\}\[\]\{rl\}\\displaystyle\\sum\_\{t=\\tilde\{\\tau\}\_\{0\}\}^\{T\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{at\}\}^\{\-1\}\}\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}&\\displaystyle\\leq\\sum\_\{t=1\}^\{T\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{at\}\}^\{\-1\}\}\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}=\\sum\_\{s\\in\\mathcal\{S\}\_\{aT\}\}\\\|\\psi\(x\_\{s\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{as\}\}^\{\-1\}\}\\\\ &\\displaystyle\\leq\_\{\(i\)\}\\sqrt\{n\_\{aT\}\\sum\_\{s\\in\\mathcal\{S\}\_\{aT\}\}\\\|\\psi\(x\_\{s\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{as\}\}^\{\-1\}\}^\{2\}\},\\end\{array\}where inequality \(i\) follows from Cauchy\-Schwarz inequality\. Note thatna​T≤Tn\_\{aT\}\\leq T, thus for anya∈\[A\]a\\in\[A\],

∑t=τ~0T‖ψ​\(xt\)‖V~a,na​t−1​𝕀​\{a∈𝒜t\}≤T​\(1\+C¯ψ2/η1\)​m​log⁡\(1\+T​C¯ψ2/\(η1​m\)\)\.\\sum\_\{t=\\tilde\{\\tau\}\_\{0\}\}^\{T\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{\\tilde\{V\}\_\{a,n\_\{at\}\}^\{\-1\}\}\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\leq\\sqrt\{T\(1\+\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\)m\\log\\left\(1\+T\\bar\{C\}\_\{\\psi\}^\{2\}/\(\\eta\_\{1\}m\)\\right\)\}\.Combining with \([82](https://arxiv.org/html/2606.07392#S12.E82)\), conditional on eventℰ\\mathcal\{E\}, with probability at least1−δ1\-\\delta, uniformly over alla∈\[A\]a\\in\[A\], we have

∑t=1T\(σ~a​t−σa​t\)​𝕀​\{a∈𝒜t\}≤∑t=1τ~0−1\(σ~a​t−σa​t\)​𝕀​\{a∈𝒜t\}\+∑t=τ~0T\(σ~a​t−σa​t\)​𝕀​\{a∈𝒜t\}≤2​\(τ~0−1\)\+Cσ​T​\(1\+C¯ψ2/η1\)​m​log⁡\(1\+T​C¯ψ2/\(η1​m\)\),\\begin\{array\}\[\]\{rl\}\\displaystyle\\sum\_\{t=1\}^\{T\}\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}\)\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}&\\displaystyle\\leq\\sum\_\{t=1\}^\{\\tilde\{\\tau\}\_\{0\}\-1\}\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}\)\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\+\\sum\_\{t=\\tilde\{\\tau\}\_\{0\}\}^\{T\}\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}\)\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\\\ &\\displaystyle\\leq 2\(\\tilde\{\\tau\}\_\{0\}\-1\)\+C\_\{\\sigma\}\\sqrt\{T\(1\+\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\)m\\log\\left\(1\+T\\bar\{C\}\_\{\\psi\}^\{2\}/\(\\eta\_\{1\}m\)\\right\)\},\\end\{array\}where the second inequality follows because fort∈\[1,τ~0\]t\\in\[1,\\tilde\{\\tau\}\_\{0\}\],σ~a​t−σa​t≤2\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}\\leq 2by definition of the indices according to Assumption[5](https://arxiv.org/html/2606.07392#S4.E5)\. Henceforth, conditional on eventℰ\\mathcal\{E\}, with probability at least1−δ1\-\\delta,

∑t=1T∑a∈𝒜t\(σ~a​t−σa​t\)=∑a∈\[A\]∑t=1T\(σ~a​t−σa​t\)​𝕀​\{a∈𝒜t\}≤2​A​τ~0\+A​Cσ​T​\(1\+C¯ψ2/η1\)​m​log⁡\(1\+T​C¯ψ2/\(η1​m\)\)\.\\begin\{array\}\[\]\{rl\}\\displaystyle\\sum\_\{t=1\}^\{T\}\\sum\_\{a\\in\\mathcal\{A\}\_\{t\}\}\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}\)&\\displaystyle=\\sum\_\{a\\in\[A\]\}\\sum\_\{t=1\}^\{T\}\(\\widetilde\{\\sigma\}\_\{at\}\-\\sigma\_\{at\}\)\\mathbb\{I\}\\\{a\\in\\mathcal\{A\}\_\{t\}\\\}\\\\ &\\displaystyle\\leq 2A\\tilde\{\\tau\}\_\{0\}\+AC\_\{\\sigma\}\\sqrt\{T\(1\+\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\)m\\log\\left\(1\+T\\bar\{C\}\_\{\\psi\}^\{2\}/\(\\eta\_\{1\}m\)\\right\)\}\.\\end\{array\}Note that by definition,Δt​\(π~\)≤1\+2​A\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\leq 1\+2Afor anyt∈\[T\]t\\in\[T\]\. So \([79](https://arxiv.org/html/2606.07392#S12.E79)\) and \([80](https://arxiv.org/html/2606.07392#S12.E80)\) together further imply that

𝔼​\[∑t=1TΔt​\(π~\)\|ℰ\]≤\(Cμ​d\+4​L​γ0​η0​α¯\)​T\+2​A​τ~0\+T​\(1\+2​A\)​δ\+A​Cσ​T​\(1\+C¯ψ2/η1\)​m​log⁡\(1\+T​C¯ψ2/\(η1​m\)\)\.\\begin\{array\}\[\]\{rl\}\\displaystyle\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\ \\big\|\\ \\mathcal\{E\}\\right\]&\\displaystyle\\leq\(C\_\{\\mu\}\\sqrt\{d\}\+4L\\gamma\_\{0\}\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\)\\sqrt\{T\}\+2A\\tilde\{\\tau\}\_\{0\}\+T\(1\+2A\)\\delta\\\\ &\\quad\\displaystyle\+AC\_\{\\sigma\}\\sqrt\{T\(1\+\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\)m\\log\\left\(1\+T\\bar\{C\}\_\{\\psi\}^\{2\}/\(\\eta\_\{1\}m\)\\right\)\}\.\\end\{array\}\(84\)Recall thatℰ\\mathcal\{E\}holds with probability at least1−δ1\-\\delta\. Thus

𝔼​\[∑t=1TΔt​\(π~\)​𝕀​\{ℰc\}\]≤T​\(2​A\+1\)​δ\.\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\mathbb\{I\}\\\{\\mathcal\{E\}^\{c\}\\\}\\right\]\\leq T\(2A\+1\)\\delta\.\(85\)Thus settingδ=1/T\\delta=1/\\sqrt\{T\}, \([84](https://arxiv.org/html/2606.07392#S12.E84)\) and \([85](https://arxiv.org/html/2606.07392#S12.E85)\) imply that ignoring logarithmic factors, we have

𝔼​\[∑t=1TΔt​\(π~\)\]≤\(Cμ​d\+4​L​γ0​η0​α¯\+4​A\+2\)​T\+2​A​τ~0\+A​Cσ​T​\(1\+C¯ψ2/η1\)​m​log⁡\(1\+T​C¯ψ2/\(η1​m\)\)\.\\begin\{array\}\[\]\{rl\}\\displaystyle\\mathbb\{E\}\\bigg\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\bigg\]&\\displaystyle\\leq\(C\_\{\\mu\}\\sqrt\{d\}\+4L\\gamma\_\{0\}\\sqrt\{\\eta\_\{0\}\}\\bar\{\\alpha\}\+4A\+2\)\\sqrt\{T\}\+2A\\tilde\{\\tau\}\_\{0\}\\\\ &\\displaystyle\\quad\+AC\_\{\\sigma\}\\sqrt\{T\(1\+\\bar\{C\}\_\{\\psi\}^\{2\}/\\eta\_\{1\}\)m\\log\\left\(1\+T\\bar\{C\}\_\{\\psi\}^\{2\}/\(\\eta\_\{1\}m\)\\right\)\}\.\\end\{array\}\(86\)Recall thatCμC\_\{\\mu\}andCδC\_\{\\delta\}are defined as \([81](https://arxiv.org/html/2606.07392#S12.E81)\) and \([83](https://arxiv.org/html/2606.07392#S12.E83)\)\. So ignoring logarithmic factors we have

𝔼​\[∑t=1TΔt​\(π~\)\]≤O~​\(\(d\+A​m\+A​d​m\)​T\)\.\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}\\Delta\_\{t\}\(\\tilde\{\\pi\}\)\\right\]\\leq\\widetilde\{O\}\\left\(\(d\+Am\+A\\sqrt\{dm\}\)\\sqrt\{T\}\\right\)\.Next, we computeη0,η1\\eta\_\{0\},\\eta\_\{1\}that minimize the order of the right hand side of \([86](https://arxiv.org/html/2606.07392#S12.E86)\) \(ignoring logarithmic factor dependence\)\. In the following, for functionsf,g:ℕ→ℝ\+f,g:\\mathbb\{N\}\\to\\mathbb\{R\}\_\{\+\}, we writef​\(T\)=Θ~​\(g​\(T\)\)f\(T\)=\\widetilde\{\\Theta\}\(g\(T\)\)if bothf​\(T\)=O~​\(g​\(T\)\)f\(T\)=\\widetilde\{O\}\(g\(T\)\)andg​\(T\)=O~​\(f​\(T\)\)g\(T\)=\\widetilde\{O\}\(f\(T\)\)hold\.

Suppressing logarithmic factors, we haveCμ=Θ~​\(1\+1η0\)C\_\{\\mu\}=\\widetilde\{\\Theta\}\\left\(\\sqrt\{1\+\\frac\{1\}\{\\eta\_\{0\}\}\}\\right\)\. Moreover,Γ¯a​\(δ\)=Θ~​\(m\)\\overline\{\\Gamma\}\_\{a\}\(\\delta\)=\\widetilde\{\\Theta\}\(m\)andC¯a​\(δ\)=Θ~​\(d\+η0\)\\sqrt\{\\overline\{C\}\_\{a\}\(\\delta\)\}=\\widetilde\{\\Theta\}\\left\(\\sqrt\{d\}\+\\sqrt\{\\eta\_\{0\}\}\\right\)\. Therefore,Cσ=Θ~​\(m\+η1\+d\+η0\)C\_\{\\sigma\}=\\widetilde\{\\Theta\}\\left\(\\sqrt\{m\+\\eta\_\{1\}\}\+\\sqrt\{d\}\+\\sqrt\{\\eta\_\{0\}\}\\right\)\. Substituting these estimates into the regret bound and dropping additive terms independent ofη0,η1\\eta\_\{0\},\\eta\_\{1\}, the relevant log\-free objective is

R​\(η0,η1\)≍T​\[d​1\+1η0\+η0\+A​m​1\+1η1​\(m\+η1\+d\+η0\)\]\.R\(\\eta\_\{0\},\\eta\_\{1\}\)\\asymp\\sqrt\{T\}\\left\[d\\sqrt\{1\+\\frac\{1\}\{\\eta\_\{0\}\}\}\+\\sqrt\{\\eta\_\{0\}\}\+A\\sqrt\{m\}\\sqrt\{1\+\\frac\{1\}\{\\eta\_\{1\}\}\}\\left\(\\sqrt\{m\+\\eta\_\{1\}\}\+\\sqrt\{d\}\+\\sqrt\{\\eta\_\{0\}\}\\right\)\\right\]\.Since the factorT\\sqrt\{T\}is common to allη0,η1\\eta\_\{0\},\\eta\_\{1\}\-dependent terms, it does not affect the minimizer\. Defineg​\(η1\):=1\+1η1,b​\(η1\):=m\+η1\+d\.g\(\\eta\_\{1\}\):=\\sqrt\{1\+\\frac\{1\}\{\\eta\_\{1\}\}\},b\(\\eta\_\{1\}\):=\\sqrt\{m\+\\eta\_\{1\}\}\+\\sqrt\{d\}\.Then the relevant objective can be written asr​\(η0,η1\)=d​1\+1η0\+\[1\+A​m​G​\(η1\)\]​η0\+A​m​g​\(η1\)​b​\(η1\)r\(\\eta\_\{0\},\\eta\_\{1\}\)=d\\sqrt\{1\+\\frac\{1\}\{\\eta\_\{0\}\}\}\+\\left\[1\+A\\sqrt\{m\}\\,G\(\\eta\_\{1\}\)\\right\]\\sqrt\{\\eta\_\{0\}\}\+A\\sqrt\{m\}\\,g\(\\eta\_\{1\}\)b\(\\eta\_\{1\}\)\. We first optimizeη0\\eta\_\{0\}for fixedη1\\eta\_\{1\}\. LetK​\(η1\):=1\+A​m​g​\(η1\)K\(\\eta\_\{1\}\):=1\+A\\sqrt\{m\}\\,g\(\\eta\_\{1\}\)\. Theη0\\eta\_\{0\}\-dependent part isf​\(η0;η1\)=d​1\+1/η0\+K​\(η1\)​η0\\displaystyle f\(\\eta\_\{0\};\\eta\_\{1\}\)=d\\sqrt\{1\+1/\\eta\_\{0\}\}\+K\(\\eta\_\{1\}\)\\sqrt\{\\eta\_\{0\}\}\. The first\-order condition is−d2​η02​1\+1/η0\+K​\(η1\)2​η0=0\-\\frac\{d\}\{2\\eta\_\{0\}^\{2\}\\sqrt\{1\+1/\\eta\_\{0\}\}\}\+\\frac\{K\(\\eta\_\{1\}\)\}\{2\\sqrt\{\\eta\_\{0\}\}\}=0, equivalently,K​\(η1\)=dη0​1\+η0K\(\\eta\_\{1\}\)=\\frac\{d\}\{\\eta\_\{0\}\\sqrt\{1\+\\eta\_\{0\}\}\}\. Letρ​\(η1\):=dK​\(η1\)\\rho\(\\eta\_\{1\}\):=\\frac\{d\}\{K\(\\eta\_\{1\}\)\}\. Then the positive solution satisfies

η0⋆​\(η1\)≍\{ρ​\(η1\),ρ​\(η1\)≲1,ρ​\(η1\)2/3,ρ​\(η1\)≳1\.\\eta\_\{0\}^\{\\star\}\(\\eta\_\{1\}\)\\asymp\\begin\{cases\}\\rho\(\\eta\_\{1\}\),&\\rho\(\\eta\_\{1\}\)\\lesssim 1,\\\\\[3\.44444pt\] \\rho\(\\eta\_\{1\}\)^\{2/3\},&\\rho\(\\eta\_\{1\}\)\\gtrsim 1\.\\end\{cases\}It remains to optimizeη1\\eta\_\{1\}\. By the envelope theorem,

dd​η1​r​\(η0⋆​\(η1\),η1\)=A​m​\[g′​\(η1\)​\(b​\(η1\)\+η0⋆​\(η1\)\)\+g​\(η1\)​b′​\(η1\)\]\.\\frac\{d\}\{d\\eta\_\{1\}\}r\(\\eta\_\{0\}^\{\\star\}\(\\eta\_\{1\}\),\\eta\_\{1\}\)=A\\sqrt\{m\}\\left\[g^\{\\prime\}\(\\eta\_\{1\}\)\\left\(b\(\\eta\_\{1\}\)\+\\sqrt\{\\eta\_\{0\}^\{\\star\}\(\\eta\_\{1\}\)\}\\right\)\+g\(\\eta\_\{1\}\)b^\{\\prime\}\(\\eta\_\{1\}\)\\right\]\.Sinceg′​\(η1\)g​\(η1\)=−12​η1​\(η1\+1\),b′​\(η1\)=12​m\+η1\\frac\{g^\{\\prime\}\(\\eta\_\{1\}\)\}\{g\(\\eta\_\{1\}\)\}=\-\\frac\{1\}\{2\\eta\_\{1\}\(\\eta\_\{1\}\+1\)\},\\ b^\{\\prime\}\(\\eta\_\{1\}\)=\\frac\{1\}\{2\\sqrt\{m\+\\eta\_\{1\}\}\}, the first\-order condition is equivalent to

η1​\(η1\+1\)=m\+η1​\[m\+η1\+d\+η0⋆​\(η1\)\]\.\\eta\_\{1\}\(\\eta\_\{1\}\+1\)=\\sqrt\{m\+\\eta\_\{1\}\}\\left\[\\sqrt\{m\+\\eta\_\{1\}\}\+\\sqrt\{d\}\+\\sqrt\{\\eta\_\{0\}^\{\\star\}\(\\eta\_\{1\}\)\}\\right\]\.Thusη12−m=\[d\+η0⋆​\(η1\)\]​m\+η1\\eta\_\{1\}^\{2\}\-m=\\left\[\\sqrt\{d\}\+\\sqrt\{\\eta\_\{0\}^\{\\star\}\(\\eta\_\{1\}\)\}\\right\]\\sqrt\{m\+\\eta\_\{1\}\}\. We next show that the termη0⋆​\(η1\)\\sqrt\{\\eta\_\{0\}^\{\\star\}\(\\eta\_\{1\}\)\}never changes the polynomial order of theη1\\eta\_\{1\}minimizer\. SinceK​\(η1\)≥1K\(\\eta\_\{1\}\)\\geq 1, we haveρ​\(η1\)≤d\\rho\(\\eta\_\{1\}\)\\leq d\. Ifρ​\(η1\)≲1\\rho\(\\eta\_\{1\}\)\\lesssim 1, thenη0⋆​\(η1\)≍ρ​\(η1\)1/2≲1≲d\\sqrt\{\\eta\_\{0\}^\{\\star\}\(\\eta\_\{1\}\)\}\\asymp\\rho\(\\eta\_\{1\}\)^\{1/2\}\\lesssim 1\\lesssim\\sqrt\{d\}\. Ifρ​\(η1\)≳1\\rho\(\\eta\_\{1\}\)\\gtrsim 1, thenη0⋆​\(η1\)≍ρ​\(η1\)1/3≲d1/3≲d\\sqrt\{\\eta\_\{0\}^\{\\star\}\(\\eta\_\{1\}\)\}\\asymp\\rho\(\\eta\_\{1\}\)^\{1/3\}\\lesssim d^\{1/3\}\\lesssim\\sqrt\{d\}\. Therefore,d\+η0⋆​\(η1\)≍d\\sqrt\{d\}\+\\sqrt\{\\eta\_\{0\}^\{\\star\}\(\\eta\_\{1\}\)\}\\asymp\\sqrt\{d\}, and theη1\\eta\_\{1\}balancing equation reduces toη12≍m\+d​m\+η1\\eta\_\{1\}^\{2\}\\asymp m\+\\sqrt\{d\}\\,\\sqrt\{m\+\\eta\_\{1\}\}\. We now solve this equation by considering different regimes\. Ifd≲md\\lesssim m, then atη1≍m\\eta\_\{1\}\\asymp\\sqrt\{m\},d​m\+η1≲m\\sqrt\{d\}\\,\\sqrt\{m\+\\eta\_\{1\}\}\\lesssim m\. Henceη1⋆≍m1/2\\eta\_\{1\}^\{\\star\}\\asymp m^\{1/2\}\. Ifm≲d≲m3m\\lesssim d\\lesssim m^\{3\}, then the solution satisfiesη1≲m\\eta\_\{1\}\\lesssim m, som\+η1≍mm\+\\eta\_\{1\}\\asymp m\. Henceη12≍d​m\\eta\_\{1\}^\{2\}\\asymp\\sqrt\{dm\}, and thereforeη1⋆≍\(d​m\)1/4\\eta\_\{1\}^\{\\star\}\\asymp\(dm\)^\{1/4\}\. Ifd≳m3d\\gtrsim m^\{3\}, then the solution satisfiesη1≳m\\eta\_\{1\}\\gtrsim m, som\+η1≍η1m\+\\eta\_\{1\}\\asymp\\eta\_\{1\}\. Henceη12≍d​η1\\eta\_\{1\}^\{2\}\\asymp\\sqrt\{d\\eta\_\{1\}\}, and thereforeη1⋆≍d1/3\\eta\_\{1\}^\{\\star\}\\asymp d^\{1/3\}\. Combining the three regimes givesη1⋆≍max⁡\{m1/2,\(d​m\)1/4,d1/3\}\\eta\_\{1\}^\{\\star\}\\asymp\\max\\left\\\{m^\{1/2\},\(dm\)^\{1/4\},d^\{1/3\}\\right\\\}\.

At this optimizer,G​\(η1⋆\)=Θ​\(1\)G\(\\eta\_\{1\}^\{\\star\}\)=\\Theta\(1\)\. Therefore,K​\(η1⋆\)=1\+A​m​G​\(η1⋆\)≍A​mK\(\\eta\_\{1\}^\{\\star\}\)=1\+A\\sqrt\{m\}\\,G\(\\eta\_\{1\}^\{\\star\}\)\\asymp A\\sqrt\{m\}\. Substituting this into the expression forη0⋆\\eta\_\{0\}^\{\\star\}givesρ​\(η1⋆\)=dK​\(η1⋆\)≍dA​m\\rho\(\\eta\_\{1\}^\{\\star\}\)=\\frac\{d\}\{K\(\\eta\_\{1\}^\{\\star\}\)\}\\asymp\\frac\{d\}\{A\\sqrt\{m\}\}\. Thus

η0⋆≍\{dA​m,d≲A​m,\(dA​m\)2/3,d≳A​m\.\\eta\_\{0\}^\{\\star\}\\asymp\\begin\{cases\}\\dfrac\{d\}\{A\\sqrt\{m\}\},&d\\lesssim A\\sqrt\{m\},\\\\\[5\.16663pt\] \\left\(\\dfrac\{d\}\{A\\sqrt\{m\}\}\\right\)^\{2/3\},&d\\gtrsim A\\sqrt\{m\}\.\\end\{cases\}Equivalently,η0⋆≍min⁡\{dA​m,\(dA​m\)2/3\}\\eta\_\{0\}^\{\\star\}\\asymp\\min\\left\\\{\\frac\{d\}\{A\\sqrt\{m\}\},\\left\(\\frac\{d\}\{A\\sqrt\{m\}\}\\right\)^\{2/3\}\\right\\\}\. Consequently, suppressing logarithmic factors, the optimal polynomial choices areη1⋆≍max⁡\{m1/2,\(d​m\)1/4,d1/3\},η0⋆≍min⁡\{dA​m,\(dA​m\)2/3\}\\eta\_\{1\}^\{\\star\}\\asymp\\max\\left\\\{m^\{1/2\},\(dm\)^\{1/4\},d^\{1/3\}\\right\\\},\\ \\eta\_\{0\}^\{\\star\}\\asymp\\min\\left\\\{\\frac\{d\}\{A\\sqrt\{m\}\},\\left\(\\frac\{d\}\{A\\sqrt\{m\}\}\\right\)^\{2/3\}\\right\\\}\.■\\blacksquare

Initialization\.Observe an initial context

x0x\_\{0\}\. Query each box

a∈\[A\]a\\in\[A\]once and record the observed output\-cost pairs

\{\(ωa​0,ca​0\)\}a∈\[A\]\\\{\(\\omega\_\{a0\},c\_\{a0\}\)\\\}\_\{a\\in\[A\]\}\.

for*t=1,2,…,Tt=1,2,\\ldots,T*do

Observe context

xtx\_\{t\}\. Initialize

𝒜t=∅\\mathcal\{A\}\_\{t\}=\\emptysetand

Mt=−∞M\_\{t\}=\-\\infty;

For any

a∈\[A\]a\\in\[A\], compute

ρ^a​t\\hat\{\\rho\}\_\{at\}by \([23](https://arxiv.org/html/2606.07392#S5.E23)\) and set

σ~a​t=Λ​\(ρ^a​t⊤​ψ​\(xt\)\+Ba​t​‖ψ​\(xt\)‖Va​t​\(η1\)−1\)\\tilde\{\\sigma\}\_\{at\}=\\Lambda\(\\hat\{\\rho\}\_\{at\}^\{\\top\}\\psi\(x\_\{t\}\)\+B\_\{at\}\\\|\\psi\(x\_\{t\}\)\\\|\_\{V\_\{at\}\(\\eta\_\{1\}\)^\{\-1\}\}\), where

Ba​tB\_\{at\}is defined as in Proposition[5\.4](https://arxiv.org/html/2606.07392#S5.Thmtheorem4),

Va​t​\(η1\)V\_\{at\}\(\\eta\_\{1\}\)is defined as \([13](https://arxiv.org/html/2606.07392#S4.E13)\);

Order the boxes so that

σ~\(1\)​t≥σ~\(2\)​t≥⋯≥σ~\(A\)​t\\widetilde\{\\sigma\}\_\{\(1\)t\}\\geq\\widetilde\{\\sigma\}\_\{\(2\)t\}\\geq\\cdots\\geq\\widetilde\{\\sigma\}\_\{\(A\)t\}\. Set

σ~\(A\+1\)​t:=−∞\\widetilde\{\\sigma\}\_\{\(A\+1\)t\}:=\-\\infty;

for*k=1,2,…,Ak=1,2,\\ldots,A*do

Query box

\(k\)\(k\)and observe its output and cost

\(ω\(k\)​t,c\(k\)​t\)\(\\omega\_\{\(k\)t\},c\_\{\(k\)t\}\);

Add

\(k\)\(k\)to the queried set:

𝒜t←𝒜t∪\{\(k\)\}\\mathcal\{A\}\_\{t\}\\leftarrow\\mathcal\{A\}\_\{t\}\\cup\\\{\(k\)\\\};

Compute the optimistic reward estimate

μ~t​\(xt,ω\(k\)​t\)=G​\(θ^t−1⊤​ϕ​\(xt,ω\(k\)​t\)\+βt​‖ϕ​\(xt,ω\(k\)​t\)‖Φt−1−1\)\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{\(k\)t\}\)=G\\\!\\left\(\\hat\{\\theta\}\_\{t\-1\}^\{\\top\}\\phi\(x\_\{t\},\\omega\_\{\(k\)t\}\)\+\\beta\_\{t\}\\,\\big\\\|\\phi\(x\_\{t\},\\omega\_\{\(k\)t\}\)\\big\\\|\_\{\\Phi\_\{t\-1\}^\{\-1\}\}\\right\), where

Φt−1\\Phi\_\{t\-1\}is defined as in \([20](https://arxiv.org/html/2606.07392#S5.E20)\) and

βt\\beta\_\{t\}is as given in Lemma[5\.1](https://arxiv.org/html/2606.07392#S5.Thmtheorem1);

Update

Mt←maxa∈𝒜t⁡μ~t​\(xt,ωa​t\)M\_\{t\}\\leftarrow\\max\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\);

if

Mt≥σ~\(k\+1\)​tM\_\{t\}\\geq\\widetilde\{\\sigma\}\_\{\(k\+1\)t\}, stop querying andbreak;

end for

Select and deploy

at∈\\argmaxa∈𝒜t​μ~t​\(xt,ωa​t\),a\_\{t\}\\in\\argmax\_\{a\\in\\mathcal\{A\}\_\{t\}\}\\widetilde\{\\mu\}\_\{t\}\(x\_\{t\},\\omega\_\{at\}\),and observe reward

rtr\_\{t\};

end for

ALGORITHM 2COSMOSPolicy Algorithm

## References

- Improved algorithms for linear stochastic bandits\.Advances in neural information processing systems24\.Cited by:[§4\.2](https://arxiv.org/html/2606.07392#S4.SS2.p4.1),[§4](https://arxiv.org/html/2606.07392#S4.p6.18),[§5\.3](https://arxiv.org/html/2606.07392#S5.SS3.p3.10),[§9](https://arxiv.org/html/2606.07392#S9.p1.1)\.
- A\. Agarwal, R\. Ghuge, and V\. Nagarajan \(2024\)Semi\-bandit learning for monotone stochastic optimization\.In2024 IEEE 65th Annual Symposium on Foundations of Computer Science \(FOCS\),pp\. 1260–1274\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3)\.
- A\. Agrawal, N\. Kedia, A\. Panwar, J\. Mohan, N\. Kwatra, B\. S\. Gulavani, A\. Tumanov, and R\. Ramjee \(2024\)Taming throughput\-latency tradeoff in llm inference with sarathi\-serve\.InProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation,Cited by:[§1](https://arxiv.org/html/2606.07392#S1.p2.1)\.
- I\. Andrews and A\. Mikusheva \(2022\)Optimal decision rules for weak gmm\.Econometrica90\(2\),pp\. 715–748\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- R\. Ao, G\. Luo, D\. Simchi\-Levi, and X\. Wang \(2025\)Optimizing llm inference: fluid\-guided online scheduling with memory constraints\.arXiv preprint arXiv:2504\.11320\.Cited by:[§1](https://arxiv.org/html/2606.07392#S1.p2.1)\.
- M\. Arellano and S\. Bond \(1991\)Some tests of specification for panel data: monte carlo evidence and an application to employment equations\.The review of economic studies58\(2\),pp\. 277–297\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- A\. Atsidakou, C\. Caramanis, E\. Gergatsouli, O\. Papadigenopoulos, and C\. Tzamos \(2024\)Contextual pandora’s box\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 10944–10952\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3),[§1](https://arxiv.org/html/2606.07392#S1.p7.1),[§4\.3](https://arxiv.org/html/2606.07392#S4.SS3.p2.2),[§4](https://arxiv.org/html/2606.07392#S4.p3.1),[§4](https://arxiv.org/html/2606.07392#S4.p4.18)\.
- J\. Audibert, R\. Munos, and C\. Szepesvári \(2009\)Exploration–exploitation tradeoff using variance estimates in multi\-armed bandits\.Theoretical Computer Science410\(19\),pp\. 1876–1902\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- P\. Auer, N\. Cesa\-Bianchi, and P\. Fischer \(2002\)Finite\-time analysis of the multiarmed bandit problem\.Machine learning47\(2\),pp\. 235–256\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- S\. Boodaghians, F\. Fusco, P\. Lazos, and S\. Leonardi \(2020\)Pandora’s box problem with order constraints\.InProceedings of the 21st ACM Conference on Economics and Computation,pp\. 439–458\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3)\.
- G\. Chamberlain \(1987\)Asymptotic efficiency in estimation with conditional moment restrictions\.Journal of econometrics34\(3\),pp\. 305–334\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- S\. Chawla, E\. Gergatsouli, Y\. Teng, C\. Tzamos, and R\. Zhang \(2020\)Pandora’s box with correlations: learning and approximation\.In2020 IEEE 61st Annual Symposium on Foundations of Computer Science \(FOCS\),pp\. 1214–1225\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3)\.
- L\. Chen, M\. Zaharia, and J\. Zou \(2025a\)FrugalGPT: how to use large language models while reducing cost and improving performance\.Transactions on Machine Learning Research\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1),[§1](https://arxiv.org/html/2606.07392#S1.p3.1),[§1](https://arxiv.org/html/2606.07392#S1.p7.1),[Remark 2\.1](https://arxiv.org/html/2606.07392#S2.Thmtheorem1.p1.2.2),[Acknowledgment](https://arxiv.org/html/2606.07392#Sx2.p1.1)\.
- Y\. Chen, S\. N\. Kirshner, A\. Ovchinnikov, M\. Andiappan, and T\. Jenkin \(2025b\)A manager and an ai walk into a bar: does chatgpt make biased decisions like we do?\.Manufacturing & Service Operations Management27\(2\),pp\. 354–368\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p7.1)\.
- Z\. Chen and J\. Chan \(2024\)Large language model in creative work: the role of collaboration modality and user expertise\.Management Science70\(12\),pp\. 9101–9117\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p7.1)\.
- Z\. Chen, X\. Yang, J\. Lin, C\. Sun, K\. C\. Chang, and J\. Huang \(2024\)Cascade speculative drafting for even faster llm inference\.Advances in Neural Information Processing Systems37,pp\. 86226–86242\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- T\. Cheng, C\. Dong, J\. Gao, and O\. Linton \(2024\)GMM estimation for high\-dimensional panel data models\.Journal of Econometrics244\(1\),pp\. 105853\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- W\. C\. Cheung, W\. Ma, D\. Simchi\-Levi, and X\. Wang \(2022\)Inventory balancing with online learning\.Management Science68\(3\),pp\. 1776–1807\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- Q\. Ding, C\. Hsieh, and J\. Sharpnack \(2021\)An efficient algorithm for generalized linear bandit: online stochastic gradient descent and thompson sampling\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 1585–1593\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p5.1)\.
- L\. Doval \(2018\)Whether or not to open pandora’s box\.Journal of Economic Theory175,pp\. 127–158\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3)\.
- T\. Ezra, M\. Feldman, and M\. Schlesinger \(2026\)Contract design for sequential actions\.InProceedings of the 2026 Annual ACM\-SIAM Symposium on Discrete Algorithms \(SODA\),pp\. 6537–6570\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3)\.
- L\. Fan and P\. W\. Glynn \(2025\)The fragility of optimized bandit algorithms\.Operations Research73\(6\),pp\. 3173–3198\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- C\. Fang, X\. Li, Z\. Fan, J\. Xu, K\. Nag, E\. Korpeoglu, S\. Kumar, and K\. Achan \(2024\)Llm\-ensemble: optimal large language model ensemble method for e\-commerce product attribute value extraction\.InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 2910–2914\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- S\. Filippi, O\. Cappe, A\. Garivier, and C\. Szepesvári \(2010\)Parametric bandits: the generalized linear case\.Advances in neural information processing systems23\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p5.1),[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1),[§5\.1](https://arxiv.org/html/2606.07392#S5.SS1.p1.14),[§5\.1](https://arxiv.org/html/2606.07392#S5.SS1.p3.4),[§5](https://arxiv.org/html/2606.07392#S5.p3.16),[§5](https://arxiv.org/html/2606.07392#S5.p4.15),[§9](https://arxiv.org/html/2606.07392#S9.p3.1)\.
- D\. A\. Freedman \(1975\)On tail probabilities for martingales\.the Annals of Probability,pp\. 100–118\.Cited by:[§8\.2](https://arxiv.org/html/2606.07392#S8.SS2.p3.5),[Lemma 8\.6](https://arxiv.org/html/2606.07392#S8.Thmtheorem6)\.
- H\. Fu and T\. Lin \(2020\)Learning utilities and equilibria in non\-truthful auctions\.Advances in Neural Information Processing Systems33,pp\. 14231–14242\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3)\.
- X\. Gao, S\. Jasin, S\. Najafi, and H\. Zhang \(2022\)Joint learning and optimization for multi\-product pricing \(and ranking\) under a general cascade click model\.Management Science68\(10\),pp\. 7362–7382\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- A\. Garivier and O\. Cappé \(2011\)The kl\-ucb algorithm for bounded stochastic bandits and beyond\.InProceedings of the 24th annual conference on learning theory,pp\. 359–376\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- K\. Gatmiry, T\. Kesselheim, S\. Singla, and Y\. Wang \(2024\)Bandit algorithms for prophet inequality and pandora’s box\.InProceedings of the 2024 Annual ACM\-SIAM Symposium on Discrete Algorithms \(SODA\),pp\. 462–500\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3)\.
- E\. Gergatsouli and C\. Tzamos \(2022\)Online learning for min sum set cover and pandora’s box\.InInternational Conference on Machine Learning,pp\. 7382–7403\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3)\.
- N\. Gupta, H\. Narasimhan, W\. Jitkrittum, A\. S\. Rawat, A\. K\. Menon, and S\. Kumar \(2024\)Language model cascades: token\-level uncertainty and beyond\.arXiv preprint arXiv:2404\.10136\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1),[§1](https://arxiv.org/html/2606.07392#S1.p3.1),[§1](https://arxiv.org/html/2606.07392#S1.p7.1)\.
- P\. Hager, F\. Jungmann, R\. Holland, K\. Bhagat, I\. Hubrecht, M\. Knauer, J\. Vielhauer, M\. Makowski, R\. Braren, G\. Kaissis,et al\.\(2024\)Evaluation and mitigation of the limitations of large language models in clinical decision\-making\.Nature medicine30\(9\),pp\. 2613–2622\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p7.1)\.
- B\. E\. Hansen and S\. Lee \(2021\)Inference for iterated gmm under misspecification\.Econometrica89\(3\),pp\. 1419–1447\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- L\. P\. Hansen \(1982\)Large sample properties of generalized method of moments estimators\.Econometrica: Journal of the econometric society,pp\. 1029–1054\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- Y\. Hao, Z\. Qiu, J\. Holmes, C\. E\. Löckenhoff, W\. Liu, M\. Ghassemi, and S\. Kalantari \(2025\)Large language model integrations in cancer decision\-making: a systematic review and meta\-analysis\.NPJ Digital Medicine8\(1\),pp\. 450\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p7.1)\.
- S\. N\. Hari and M\. Thomson \(2023\)Tryage: real\-time, intelligent routing of user prompts to large language models\.arXiv preprint arXiv:2308\.11601\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- J\. Hu, Y\. Wang, S\. Zhang, K\. Zhou, G\. Chen, Y\. Hu, B\. Xiao, and M\. Tan \(2025\)Efficient dynamic ensembling for multiple llm experts\.InProceedings of the Thirty\-Fourth International Joint Conference on Artificial Intelligence, IJCAI,pp\. 16–22\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- Q\. J\. Hu, J\. Bieker, X\. Li, N\. Jiang, B\. Keigwin, G\. Ranganath, K\. Keutzer, and S\. K\. Upadhyay \(2024\)Routerbench: a benchmark for multi\-llm routing system\.arXiv preprint arXiv:2403\.12031\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- J\. Huang, W\. Ma, and Z\. Zhou \(2026\)Optimal bayesian stopping for efficient inference of consistent llm answers\.arXiv preprint arXiv:2602\.05395\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p3.1)\.
- P\. Jaillet, J\. Jiang, K\. Mellou, M\. Molinaro, C\. Podimata, and Z\. Zhou \(2025\)Online scheduling for llm inference with kv cache constraints\.arXiv preprint arXiv:2502\.07115\.Cited by:[§1](https://arxiv.org/html/2606.07392#S1.p2.1)\.
- D\. Jiang, X\. Ren, and B\. Y\. Lin \(2023\)Llm\-blender: ensembling large language models with pairwise ranking and generative fusion\.arXiv preprint arXiv:2306\.02561\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- K\. Jun, A\. Bhargava, R\. Nowak, and R\. Willett \(2017\)Scalable generalized linear bandits: online computation and hashing\.Advances in Neural Information Processing Systems30\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p5.1)\.
- W\. Kim, K\. Lee, and M\. C\. Paik \(2023\)Double doubly robust thompson sampling for generalized linear contextual bandits\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.37,pp\. 8300–8307\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p5.1)\.
- R\. Kleinberg, B\. Waggoner, and E\. G\. Weyl \(2016\)Descending price coordinates approximately efficient search\.InExtended abstract in the Proceedings of the 17th ACM Conference on Electronic Commerce \(EC’16\),Cited by:[§3](https://arxiv.org/html/2606.07392#S3.p2.5)\.
- B\. Kveton, M\. Zaheer, C\. Szepesvari, L\. Li, M\. Ghavamzadeh, and C\. Boutilier \(2020\)Randomized exploration in generalized linear bandits\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 2066–2076\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p5.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,Cited by:[§1](https://arxiv.org/html/2606.07392#S1.p2.1)\.
- T\. L\. Lai and H\. Robbins \(1985\)Asymptotically efficient adaptive allocation rules\.Advances in applied mathematics6\(1\),pp\. 4–22\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- T\. Lattimore and C\. Szepesvári \(2020\)Bandit algorithms\.Cambridge University Press\.Cited by:[§4\.2](https://arxiv.org/html/2606.07392#S4.SS2.p4.1)\.
- J\. Lee, S\. Yun, and K\. Jun \(2024\)A unified confidence sequence for generalized linear models, with applications to bandits\.Advances in Neural Information Processing Systems37,pp\. 124640–124685\.Cited by:[§5](https://arxiv.org/html/2606.07392#S5.p3.16),[§5](https://arxiv.org/html/2606.07392#S5.p4.15)\.
- G\. Li, J\. Liang, M\. Liu, Y\. Lei, S\. Jasin, F\. Yang, and P\. Baxi \(2026\)Asymptotically optimal sequential testing with heterogeneous llms\.arXiv preprint arXiv:2604\.01086\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p3.1)\.
- L\. Li, Y\. Lu, and D\. Zhou \(2017\)Provably optimal algorithms for generalized linear contextual bandits\.InInternational Conference on Machine Learning,pp\. 2071–2080\.Cited by:[§5](https://arxiv.org/html/2606.07392#S5.p3.16),[§5](https://arxiv.org/html/2606.07392#S5.p4.15)\.
- X\. Lin and L\. Lee \(2010\)GMM estimation of spatial autoregressive models with unknown heteroskedasticity\.Journal of Econometrics157\(1\),pp\. 34–52\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- J\. Liu, Z\. Chen, K\. Wang, H\. Luo, and L\. J\. Ratliff \(2025\)Improved regret and contextual linear extension for pandora’s box and prophet inequality\.arXiv preprint arXiv:2505\.18828\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3),[§1](https://arxiv.org/html/2606.07392#S1.p7.1)\.
- K\. Lu, H\. Yuan, R\. Lin, J\. Lin, Z\. Yuan, C\. Zhou, and J\. Zhou \(2024\)Routing to the expert: efficient reward\-guided ensemble of large language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 1964–1974\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- W\. Mao, K\. Zhang, R\. Zhu, D\. Simchi\-Levi, and T\. Başar \(2025\)Model\-free nonstationary reinforcement learning: near\-optimal regret and applications in multiagent reinforcement learning and inventory control\.Management Science71\(2\),pp\. 1564–1580\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- K\. Mei, W\. Xu, M\. Guo, S\. Lin, and Y\. Zhang \(2025\)Omnirouter: budget and performance controllable multi\-llm routing\.ACM SIGKDD Explorations Newsletter27\(2\),pp\. 107–116\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- W\. K\. Newey and D\. McFadden \(1994\)Large sample estimation and hypothesis testing\.Handbook of econometrics4,pp\. 2111–2245\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- L\. Nie, Z\. Ding, E\. Hu, C\. Jermaine, and S\. Chaudhuri \(2024\)Online cascade learning for efficient inference over streams\.arXiv preprint arXiv:2402\.04513\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- S\. Rakhlin and K\. Sridharan \(2013\)Optimization, learning, and games with predictable sequences\.Advances in Neural Information Processing Systems26\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- M\. Reisenbichler, T\. Reutterer, and D\. A\. Schweidel \(2025\)Applying large language models to sponsored search advertising\.Marketing Science\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p7.1)\.
- P\. Rusmevichientong and J\. N\. Tsitsiklis \(2010\)Linearly parameterized bandits\.Mathematics of Operations Research35\(2\),pp\. 395–411\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p6.1)\.
- M\. Šakota, M\. Peyrard, and R\. West \(2024\)Fly\-swat or cannon? cost\-effective language model choice via meta\-modeling\.InProceedings of the 17th ACM International Conference on Web Search and Data Mining,pp\. 606–615\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- T\. Shnitzer, A\. Ou, M\. Silva, K\. Soule, Y\. Sun, J\. Solomon, N\. Thompson, and M\. Yurochkin \(2023\)Large language model routing with benchmark datasets\.InFirst Conference on Language Modeling,Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1)\.
- D\. Simchi\-Levi, K\. Mellou, I\. Menache, and J\. Pathuri \(2026\)Large language models for supply chain decisions\.InAI in Supply Chains: Perspectives from Global Thought Leaders,pp\. 93–104\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p7.1)\.
- A\. J\. Thirunavukarasu, D\. S\. J\. Ting, K\. Elangovan, L\. Gutierrez, T\. F\. Tan, and D\. S\. W\. Ting \(2023\)Large language models in medicine\.Nature medicine29\(8\),pp\. 1930–1940\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p7.1)\.
- J\. A\. Tropp \(2011\)User\-friendly tail bounds for matrix martingales\.Technical reportCited by:[Lemma 10\.9](https://arxiv.org/html/2606.07392#S10.Thmtheorem9)\.
- M\. L\. Weitzman \(1979\)OPTIMAL search for the best alternative\.\.Econometrica47\(3\)\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p4.3),[§1](https://arxiv.org/html/2606.07392#S1.p5.1),[§2](https://arxiv.org/html/2606.07392#S2.p3.21),[§3](https://arxiv.org/html/2606.07392#S3.p2.5)\.
- Q\. Yang, M\. Ongpin, S\. Nikolenko, A\. Huang, and A\. Farseev \(2023\)Against opacity: explainable ai and large language models for effective digital advertising\.InProceedings of the 31st ACM International Conference on Multimedia,pp\. 9299–9305\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p7.1)\.
- G\. Yu, J\. S\. Jeong, G\. Kim, S\. Kim, and B\. Chun \(2022\)Orca: a distributed serving system for transformer\-based generative models\.InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation,pp\. 521–538\.Cited by:[§1](https://arxiv.org/html/2606.07392#S1.p2.1)\.
- M\. Yue, J\. Zhao, M\. Zhang, L\. Du, and Z\. Yao \(2024\)Large language model cascades with mixture of thought representations for cost\-efficient reasoning\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1),[§1](https://arxiv.org/html/2606.07392#S1.p3.1),[§1](https://arxiv.org/html/2606.07392#S1.p7.1)\.
- X\. Zhang, Z\. Huang, E\. O\. Taga, C\. Joe\-Wong, S\. Oymak, and J\. Chen \(2024\)Efficient contextual llm cascades through budget\-constrained policy learning\.Advances in Neural Information Processing Systems37,pp\. 91691–91722\.Cited by:[§1\.2](https://arxiv.org/html/2606.07392#S1.SS2.p2.1),[§1](https://arxiv.org/html/2606.07392#S1.p7.1)\.
- Z\. Zhou, R\. Xu, and J\. Blanchet \(2019\)Learning in generalized linear contextual bandits with stochastic delays\.Advances in Neural Information Processing Systems32\.Cited by:[§5](https://arxiv.org/html/2606.07392#S5.p4.15)\.

Similar Articles

Can LLMs Take Retrieved Information with a Grain of Salt?

arXiv cs.CL

This paper investigates how large language models adapt to the certainty of retrieved information, identifying systematic limitations in handling uncertainty. It proposes an interaction strategy that reduces obedience errors by 25% without modifying model weights.

Margin-Adaptive Confidence Ranking for Reliable LLM Judgement

arXiv cs.LG

This paper introduces a margin-based confidence ranking method for LLM-as-a-judge systems, learning a dedicated estimator to ensure monotonicity between confidence and human-disagreement risk, with generalization guarantees and improved ranking accuracy across datasets.