Learning in Markovian bandits with non-observable states and constrained decision epochs

arXiv cs.LG 06/29/26, 04:00 AM Papers
Summary
This paper studies regret minimization in Markovian bandits with non-observable states and constrained decision epochs, introducing a generalization called self-degrading Markovian bandits. The authors propose the UCB-NOM algorithm that achieves nearly logarithmic regret and provide bounds that do not depend on the number of states.
arXiv:2606.27448v1 Announce Type: new Abstract: This paper studies the problem of regret minimization in Markovian bandits with \emph{non-observable states} and possibly \emph{constrained} decision epochs. The focus is restricted to a ``pure'' regret benchmark, that compares the performance of the learning algorithm to the best \emph{pure policy} which -- akin to optimal policies of stochastic bandits -- picks the optimal arm from start to finish without ever switching. We introduce a generalization of rested Markovian bandits, \emph{self-degrading Markovian bandits}, for which pure policies are always asymptotically optimal.We show that without prior knowledge on the underlying bandit, the regret of algorithms that switch arms rarely necessarily scales super-logarithmically for every bandit, i.e., as $\omega(\log(T))$, where $T$ is the learning horizon. Despite the unreachability of the logarithmic regime, we design UCB-NOM, an optimistic algorithm inspired by UCB, of which the regret is nearly logarithmic. Lastly, we show that given prior knowledge on the Markovian bandit in the form of a bound on the bias functions of its arm, a proper instantiation of UCB-NOM achieves $O(\log(T))$ regret. We further show that this prior knowledge allows for a $O(\sqrt{T \log(T)})$ worst-case regret bound for UCB-NOM. Notably, our regret bounds do not depend on the number of states of the underlying Markov chains. Our findings suggest that the non-observability of states is a mild inconvenience in self-degrading Markovian bandits.
Original Article
View Cached Full Text
Cached at: 06/29/26, 05:22 AM
# Learning in Markovian bandits with non-observable states and constrained decision epochs
Source: [https://arxiv.org/html/2606.27448](https://arxiv.org/html/2606.27448)
Thomas Hirathomas\.hira@irit\.fr Victor Boone11footnotemark:122footnotemark:2victor\.boone@irit\.fr Urtzi Ayesta22footnotemark:2urtzi\.ayesta@irit\.fr Ina Maria Verloop22footnotemark:2ina\-maria\.verloop@irit\.frEqual contribution\.IRIT, Université de Toulouse, CNRS, Toulouse INP, Toulouse, FranceIkerbasque\-UPV/EHU, University of the Basque Country, Bilbao, Spain

\( March, 2026 \)

###### Abstract

This paper studies the problem of regret minimization in Markovian bandits with*non\-observable states*and possibly*constrained*decision epochs\. The focus is restricted to a “pure” regret benchmark, that compares the performance of the learning algorithm to the best*pure policy*which—akin to optimal policies of stochastic bandits—picks the optimal arm from start to finish without ever switching\. We introduce a generalization of rested Markovian bandits,*self\-degrading Markovian bandits*, for which pure policies are always asymptotically optimal\. We show that without prior knowledge on the underlying bandit, the regret of algorithms that switch arms rarely necessarily scales super\-logarithmically for every bandit, i\.e\., asω\(log⁡\(T\)\)\\omega\(\\log\(T\)\), whereTTis the learning horizon\. Despite the unreachability of the logarithmic regime, we designUCB\-NOM, an optimistic algorithm inspired byUCB, of which the regret is nearly logarithmic\. Lastly, we show that given prior knowledge on the Markovian bandit in the form of a bound on the bias functions of its arm, a proper instantiation ofUCB\-NOMachievesO\(log⁡\(T\)\)\\operatorname\*\{\\mathrm\{O\}\}\(\\log\(T\)\)regret\. We further show that this prior knowledge allows for aO\(Tlog⁡\(T\)\)\\operatorname\*\{\\mathrm\{O\}\}\(\\sqrt\{T\\log\(T\)\}\)worst\-case regret bound forUCB\-NOM\. Notably, our regret bounds do not depend on the number of states of the underlying Markov chains\. Our findings suggest that the non\-observability of states is a mild inconvenience in self\-degrading Markovian bandits\.

## 1Introduction

## 2Learning non\-observable Markovian bandits

We begin this paper by describing the learning problem and introducing notations\. Our learning problem is a class of Markovian bandits \([Section˜2\.1](https://arxiv.org/html/2606.27448#S2.SS1)\) in which states are non\-observable and decision epochs are possibly constrained\. InLABEL:section\_pure\_policies, we motivate*pure policies*as a benchmark by identifying subclasses of restless bandits in which they are actually optimal\. Lastly, in[Section˜2\.2](https://arxiv.org/html/2606.27448#S2.SS2), we motivate the focus on learning algorithms that switch arms rarely\.

### 2\.1Non\-observable Markovian bandits with decision epochs

We consider a Markovian bandit with arms𝒜\\mathcal\{A\}, where every arma∈𝒜a\\in\\mathcal\{A\}is modeled by a Markov decision processMa≡\(𝒮a,\{0,1\},νa,pa0,pa1,sainit\)M\_\{a\}\\equiv\(\\mathcal\{S\}\_\{a\},\\\{0,1\\\},\\nu\_\{a\},p\_\{a\}^\{0\},p\_\{a\}^\{1\},s^\{\\text\{init\}\}\_\{a\}\)with*activated/non\-activated*actions, states𝒮a\\mathcal\{S\}\_\{a\}, a fixed initial statesainits^\{\\text\{init\}\}\_\{a\}, a reward function agnostic to action choicesνa:𝒮a→𝒫\(\[0,1\]\)\\nu\_\{a\}:\\mathcal\{S\}\_\{a\}\\to\\mathcal\{P\}\(\[0,1\]\), and two transition kernelspa0,pa1:𝒮a→𝒫\(𝒮a\)p\_\{a\}^\{0\},p\_\{a\}^\{1\}:\\mathcal\{S\}\_\{a\}\\to\\mathcal\{P\}\(\\mathcal\{S\}\_\{a\}\)that govern a state’s evolution depending on its activation or non\-activation\. Mean rewards are denoted byra\(sa\):=∫xdνa\(sa\)\(x\)r\_\{a\}\(s\_\{a\}\):=\\int x\\,\\mathrm\{d\}\\nu\_\{a\}\(s\_\{a\}\)\(x\)\. The resulting system is obtained by bundling arms in parallel asM≡\(Ma\)a∈𝒜M\\equiv\(M\_\{a\}\)\_\{a\\in\\mathcal\{A\}\}\.

The interaction between the system and the controller goes as follows\. At timet≥1t\\geq 1, the controller selects a*single*armA\(t\)∈𝒜A\(t\)\\in\\mathcal\{A\}to activate, and the system’s state evolves as

Sa\(t\+1\)∼\{pa1\(Sa\(t\)\)ifA\(t\)=a;pa0\(Sa\(t\)\)ifA\(t\)≠a;\.S\_\{a\}\(t\+1\)\\sim\\begin\{cases\}p\_\{a\}^\{1\}\(S\_\{a\}\(t\)\)&\\text\{if $A\(t\)=a$;\}\\\\ p\_\{a\}^\{0\}\(S\_\{a\}\(t\)\)&\\text\{if $A\(t\)\\neq a$;\}\\end\{cases\}\\;\.Moreover, one receives an immediate rewardR\(t\+1\)∼νa\(Sa\(t\)\)R\(t\+1\)\\sim\\nu\_\{a\}\(S\_\{a\}\(t\)\)fora=A\(t\)a=A\(t\),111Because the activated arm is the only one to produce a reward, the assumption that rewards are agnostic to actions can be made without loss of generality\. Equivalently, we could have assumed non\-activated arms produce no reward\.independently of the state’s generation\. By convention,R\(1\)=0R\(1\)=0\. All arms evolve independently conditionally on the current state and activated arm\. In the sequel, the*number of pulls*of armaawill be denotedNa\(T\):=∑t=1T−1𝟏\(A\(t\)=a\)N\_\{a\}\(T\):=\\sum\_\{t=1\}^\{T\-1\}\\mathbf\{1\}\\left\(A\(t\)=a\\right\)\. To simplify the exposition, we further assume that transition kernels under activationspa1p\_\{a\}^\{1\}are unichain — this assumption is fairly standard in restless Markovian bandits\.

###### Assumption 1\.

For everya∈𝒜a\\in\\mathcal\{A\},pa1p\_\{a\}^\{1\}is*unichain*, i\.e\., it has a unique recurrent component of states\.

#### 2\.1\.1Constrained decision epochs

In opposition to standard restless Markovian bandits, we are interested in systems in which the controller is constrained to switch arms*only at decision epochs*, induced by the stream of rewards\. More specifically, epochs consist of an increasing sequence\(τi\)i≥1\(\\tau\_\{i\}\)\_\{i\\geq 1\}of stopping times under the filtration induced by the system’s rewards\(R\(t\)\)t≥1\(R\(t\)\)\_\{t\\geq 1\}\. In\-between two decision epochs, the controller is not allowed to switch arms, that is,

∀i,∀t∈\{τi,…,τi\+1−1\},A\(t\)=A\(τi\)\.\\forall i,\\forall t\\in\\\{\\tau\_\{i\},\\ldots,\\tau\_\{i\+1\}\-1\\\},\\quad A\(t\)=A\(\\tau\_\{i\}\)\.\(1\)Throughout, we make two assumptions on the structure of decision epochs\. In[Assumption˜3](https://arxiv.org/html/2606.27448#Thmassumption3)below,H\(t\)=\(S\(1\),A\(1\),R\(2\),S\(2\),A\(2\),R\(3\),…,S\(t\)\)H\(t\)=\(S\(1\),A\(1\),R\(2\),S\(2\),A\(2\),R\(3\),\\ldots,S\(t\)\)is the*history of play*, i\.e\., the vector of every state, every played arm and every emitted reward up to timett\.

###### Assumption 2\.

There exists a measurable set𝒰τ⊆\[0,1\]\\mathcal\{U\}\_\{\\tau\}\\subseteq\[0,1\]such thatτi\+1=inf\{t\>τi:R\(t\)∈𝒰τ\}\.\\tau\_\{i\+1\}=\\inf\\left\\\{t\>\\tau\_\{i\}:R\(t\)\\in\\mathcal\{U\}\_\{\\tau\}\\right\\\}\.

###### Assumption 3\.

There existsCτ∈R\+C\_\{\\tau\}\\in\\text\{R\}\_\{\+\}such that, for allt≥1t\\geq 1,E\[inf\{τi−t:τi≥t\}\|H\(t\)\]≤Cτ\.\\text\{E\}\\left\[\\inf\\left\\\{\\tau\_\{i\}\-t:\\tau\_\{i\}\\geq t\\right\\\}\|H\(t\)\\right\]\\leq C\_\{\\tau\}\.

[Assumption˜2](https://arxiv.org/html/2606.27448#Thmassumption2)means that decision epochs happen when the immediate reward falls within a special set\. For instance, if𝒰τ=\[0,1\]\\mathcal\{U\}\_\{\\tau\}=\[0,1\], then the decision epochs are trivial withτi=i\\tau\_\{i\}=ifor alli≥1i\\geq 1, so that[Assumption˜2](https://arxiv.org/html/2606.27448#Thmassumption2)captures systems without constraints on decision epochs in particular\.[Assumption˜3](https://arxiv.org/html/2606.27448#Thmassumption3)claims that the expected time before the next decision epoch is bounded uniformly over time conditionally on the history\. This assumption holds, for example, if, for every arma∈𝒜a\\in\\mathcal\{A\}, there exists a statesa∈𝒮as\_\{a\}\\in\\mathcal\{S\}\_\{a\}that is recurrent underpa1p\_\{a\}^\{1\}and such thatν\(sa\)\(𝒰τ\)\>0\\nu\(s\_\{a\}\)\(\\mathcal\{U\}\_\{\\tau\}\)\>0\. Under that scenario,CτC\_\{\\tau\}can typically be bounded in terms of the diameter ofpa1p\_\{a\}^\{1\}\(expected time required to travel from a state to another underpa1p\_\{a\}^\{1\}\) and the probabilityν\(sa\)\(𝒰τ\)\\nu\(s\_\{a\}\)\(\\mathcal\{U\}\_\{\\tau\}\)to emit the right reward from favorable states\.

#### 2\.1\.2Full bandit feedback: non\-observability of states

While the number of arms is known to the controller, the latter is unaware of the current statesSa\(t\)S\_\{a\}\(t\)of arms, nor of how many states\|𝒮a\|\|\\mathcal\{S\}\_\{a\}\|they can possibly be in\. In other words, states are non\-observable\.

Non\-observability implies that the choice of armA\(t\)A\(t\)depends exclusively on the past*history of observations*Ho\(t\):=\(A\(1\),R\(2\),A\(2\),…,R\(t\)\)H\_\{o\}\(t\):=\(A\(1\),R\(2\),A\(2\),\\ldots,R\(t\)\)\. In formal terms, the random armA\(τi\)A\(\\tau\_\{i\}\)isσ\(Ho\(t\)\)\\sigma\(H\_\{o\}\(t\)\)\-measurable\. Note that for each decision epochτi\\tau\_\{i\}, we have\{τi≤t\}∈σ\(R\(2\),…,R\(t\)\)\\\{\\tau\_\{i\}\\leq t\\\}\\in\\sigma\(R\(2\),\\ldots,R\(t\)\)\. That is, decision epochs are measurable with respect to the aggregate reward vector\. This assumption is absolutely necessary for our work, as the controller would be unable to properly change arm otherwise\.

### 2\.2Rarely switching algorithms and pseudo\-regret

In this section, we define the concept of rarely switching policies, which provide a natural class of learning rules when considering regret with respect to pure policies\. Indeed, such a learner should be able to identify the best arm by constructing a good estimator for the gaingag\_\{a\}for every arm\. The gain of a pure policyπa\\pi\_\{a\}only depends on the states that are visited infinitely many times underπa\\pi\_\{a\}; the set of*recurrent states*ofπa\\pi\_\{a\}\. When playing an arm to estimategag\_\{a\}, the process may initially lie outside this set, requiring several steps before reaching it and thus before the estimation ofgag\_\{a\}can begin\. The task is further complicated by the fact that\|𝒮a\|\|\\mathcal\{S\}\_\{a\}\|, the number of states of armaa, is unknown\.

0…\\ldotsiii\+1i\+1…\\ldotsnnr=12r=\\frac\{1\}\{2\}r=23r=\\frac\{2\}\{3\}

Figure 1:An arm that requires to be activated at leastnntimes in a row to be estimated correctly\. All transitions are deterministic, withpa1p\_\{a\}^\{1\}represented in black andpa0p\_\{a\}^\{0\}inred\.An example is provided in[Figure˜1](https://arxiv.org/html/2606.27448#S2.F1), where statennis absorbing underπa\\pi\_\{a\}and the gain of the arm isga=23g\_\{a\}=\\frac\{2\}\{3\}\. The latter can only be estimated accurately if, infinitely often during the learning process, the controller activates the arm more thannntimes in a row\. Asn≥1n\\geq 1can be arbitrarily large in general, it becomes tempting to reduce the number of times one switches from one arm to another as much as possible, and to measure the regret by*counting*the number of times bad arms are pulled\. This leads to the introduction of the*pseudo\-regret*:

PR\(T\):=∑a∈𝒜\(g∗−ga\)Na\(T\)\.\\mathrm\{PR\}\(T\):=\\sum\_\{a\\in\\mathcal\{A\}\}\\left\(g\_\{\*\}\-g\_\{a\}\\right\)N\_\{a\}\(T\)\.\(2\)It is named after its eponym version in multi\-armed bandits\(Lattimore and Szepesvári,[2020](https://arxiv.org/html/2606.27448#bib.bib34), §4\.9\)\.

The pseudo\-regret measures the performance of the learning algorithm as a sum of instantaneous costs: the individual cost of armaais given by the*gain\-gap*g∗−gag\_\{\*\}\-g\_\{a\}\(how far its gain is fromg∗g\_\{\*\}\), and the pseudo regret is the sum of gain\-gaps over time\. The pseudo\-regret is much easier to analyze than the regret, because it is directly expressed in terms of the number of pulls to sub\-optimal arms\. If the pseudo\-regret is large, then the regret is large\. The converse is false in general\. We now go over a class of learning algorithms for which the regret can nevertheless be upper\-bounded in terms of the pseudo\-regret:*rarely\-switching algorithms*\.

###### Definition 1\(Rarely switching algorithms\)\.

Given a sublinear smooth concave functionψ:R\+→R\+\\psi:\\text\{R\}\_\{\+\}\\to\\text\{R\}\_\{\+\}, we say that the algorithm is*ψ\\psi\-rarely switching*if for allM∈ℳM\\in\\mathcal\{M\}, it holds

∀a∈𝒜,∀T≥1,∑t=1T−1𝟏\(A\(t\)=a,A\(t\+1\)≠a\)≤ψ\(Na\(T\)\)\.\\forall a\\in\\mathcal\{A\},\\;\\forall T\\geq 1,\\quad\\sum\_\{t=1\}^\{T\-1\}\\mathbf\{1\}\\left\(A\(t\)=a,A\(t\+1\)\\neq a\\right\)\\leq\\psi\(N\_\{a\}\(T\)\)\.In other words, an algorithm is*rarely switching*if the number of switches to armaagrows sub\-linearly with respect toNa\(t\)N\_\{a\}\(t\), the number of pulls to armaa\.

When algorithms areψ\\psi\-rarely switching, the*pseudo\-regret*\([2](https://arxiv.org/html/2606.27448#S2.E2)\) becomes a natural proxy for the regret: If the algorithm switches arms rarely, then the two quantities are approximately the same \(at first order\), as stated by[Proposition˜1](https://arxiv.org/html/2606.27448#Thmtheorem1)thereafter\. This result encourages to fully focus on the study of pseudo\-regret\.

###### Proposition 1\(Regret and pseudo\-regret\)\.

LetM∈ℳM\\in\\mathcal\{M\}be an arbitrary model with a unique optimal arm\. LetΛ\\Lambdabe aψ\\psi\-rarely switching algorithm\. Then, there exists a positive smooth concave functionψ0=O\(ψ\)\\psi\_\{0\}=\\operatorname\*\{\\mathrm\{O\}\}\(\\psi\)such that, for every horizonT≥1T\\geq 1, we have

\|Reg\(T\)−E\[PR\(T\)\]\|≤ψ0\(E\[PR\(T\)\]\)\.\\left\|\\mathrm\{Reg\}\(T\)\-\\text\{E\}\[\\mathrm\{PR\}\(T\)\]\\right\|\\leq\\psi\_\{0\}\\left\(\\text\{E\}\[\\mathrm\{PR\}\(T\)\]\\right\)\.

The proof of[Proposition˜1](https://arxiv.org/html/2606.27448#Thmtheorem1)is deferred toLABEL:appendix\_rarely\_switching\.

Becauseψ0=O\(ψ\)\\psi\_\{0\}=\\operatorname\*\{\\mathrm\{O\}\}\(\\psi\)is a sublinear function,[Proposition˜1](https://arxiv.org/html/2606.27448#Thmtheorem1)states that the expected regret of a rarely switching algorithm isE\[PR\(T\)\]\+o\(E\[PR\(T\)\]\)\\text\{E\}\[\\mathrm\{PR\}\(T\)\]\+\\operatorname\*\{\\mathrm\{o\}\}\(\\text\{E\}\[\\mathrm\{PR\}\(T\)\]\)\. Accordingly, the regret of such algorithms can be analyzed by controlling the pseudo\-regret directly, which is much easier to handle\.

## 3General regret bounds

As a first sketch of the landscape of the theory, we attack the problem from an*instance\-dependent*viewpoint\. That is, given an instanceM∈ℳM\\in\\mathcal\{M\}and a learning algorithmΛ\\Lambda, we aim at boundingReg\(T;M,Λ\)\\mathrm\{Reg\}\(T;M,\\Lambda\)as accurately as possible whenT→∞T\\to\\infty\. This is in opposition to*worst\-case*, or*minimax*approaches, for which we would boundsupM∈ℳReg\(T;M,Λ\)\\sup\_\{M\\in\\mathcal\{M\}\}\\mathrm\{Reg\}\(T;M,\\Lambda\)\. The minimax approach is proved to be ill\-behaved whenℳ\\mathcal\{M\}is too large, while non\-trivial instance\-dependent bounds can always be obtained\.

Starting with the instance\-dependent setting, we show in[Section˜3\.1](https://arxiv.org/html/2606.27448#S3.SS1)that the expected regret \(LABEL:equation\_regret\_definition\) of every rarely\-switching algorithm grows super\-logarithmically \([Theorem˜2](https://arxiv.org/html/2606.27448#Thmtheorem2)\)\. We further conjecture that the rarely\-switching assumption can even be dropped on the space of self\-degrading instances\. These lower bounds provide a baseline for learning algorithms\. In particular, regret guarantees of orderO\(log⁡\(T\)\)\\operatorname\*\{\\mathrm\{O\}\}\(\\log\(T\)\)are not achievable\. In[Section˜3\.2](https://arxiv.org/html/2606.27448#S3.SS2), we present our algorithmUCB\-NOM, a variant ofUCB2\(Auer \([2002](https://arxiv.org/html/2606.27448#bib.bib33)\)\), that takes as input a diverging*span\-proxy*functionffas well as a confidence functionδ:N→\[0,1\]\\delta:\\text\{N\}\\to\[0,1\]\. In[Section˜3\.3](https://arxiv.org/html/2606.27448#S3.SS3), we show that the instance\-dependent regret ofUCB\-NOMisO\(f\(T\)2log⁡\(T\)\)\\operatorname\*\{\\mathrm\{O\}\}\(f\(T\)^\{2\}\\log\(T\)\), which implies that it can approximate the idealistic logarithmic regret regime arbitrarily close\. We conclude in[Section˜3\.4](https://arxiv.org/html/2606.27448#S3.SS4)by showing that minimax regret bounds are unattainable for generalℳ\\mathcal\{M\}\.

### 3\.1A super\-logarithmic regret lower bound

To establish*instance\-dependent*lower bounds, we must assume that the learning algorithm of interest is*consistent*, that is, that the algorithm learns the optimal policy in every possible instance inℳ\\mathcal\{M\}\. Formally, a learning algorithmΛ\\Lambdais*consistent*onℳ\\mathcal\{M\}if, for allM∈ℳM\\in\\mathcal\{M\}andε\>0\\varepsilon\>0, it achieves regretReg\(T;M,Λ\)=o\(Tε\),\\mathrm\{Reg\}\(T;M,\\Lambda\)=\\operatorname\*\{\\mathrm\{o\}\}\(T^\{\\varepsilon\}\),whenT→∞T\\to\\infty\. This definition is fairly standard in the multi\-armed bandit literature\(Lattimore and Szepesvári,[2020](https://arxiv.org/html/2606.27448#bib.bib34), §16\)and is necessary to derive the famous lower bound ofLai and Robbins \([1985](https://arxiv.org/html/2606.27448#bib.bib37)\)\. Below, we show that consistency is enough to derive a super\-logarithmic lower bound, provided that the algorithm isψ\\psi\-rarely switching in addition\.

###### Theorem 2\(Lower bound, rarely switching algorithms\)\.

LetΛ\\Lambdabe a learning algorithm that \(1\) is consistent onℳ\\mathcal\{M\}, and \(2\) isψ\\psi\-rarely switching\. For every instanceM∈ℳM\\in\\mathcal\{M\}withg∗<1g\_\{\*\}<1and unique optimal pure policy, it holds that

∀s0∈𝒮,Reg\(T;M,Λ\)=ω\(log⁡\(T\)\)\.\\forall s\_\{0\}\\in\\mathcal\{S\},\\quad\\mathrm\{Reg\}\(T;M,\\Lambda\)=\\omega\(\\log\(T\)\)\.\(3\)

In particular,[Equation˜3](https://arxiv.org/html/2606.27448#S3.E3)implies that no algorithm achievesO\(log⁡\(T\)\)\\operatorname\*\{\\mathrm\{O\}\}\(\\log\(T\)\)regret, the gold standard for a large variety of RL problems, ranging from multi\-armed bandits \(Lai and Robbins \([1985](https://arxiv.org/html/2606.27448#bib.bib37)\); Auer \([2002](https://arxiv.org/html/2606.27448#bib.bib33)\)\) to Markov decision processes \(Auer and Ortner \([2006](https://arxiv.org/html/2606.27448#bib.bib32)\)\)\.

We further believe that the rarely switching property ofΛ\\Lambdacan be changed toM∈ℳsdM\\in\\mathcal\{M\}\_\{\\mathrm\{sd\}\}, i\.e\., that the regret scales super\-logarithmically on the class of self\-degrading instances, without requiring the rarely switching property\. We note that super\-logarithmic lower bounds have also been found in other settings, including multi\-armed bandits problems for which the eligible set of reward distributions is too large, see\(Lattimore and Szepesvári,[2020](https://arxiv.org/html/2606.27448#bib.bib34), Exercise 16\.4\)\.

###### Proof sketch of[Theorem˜2](https://arxiv.org/html/2606.27448#Thmtheorem2)\(see complete proof inLABEL:appendix\_lowerbound\_instance\_dependent\.\)\.

The proof involves the construction of an alternative modelMaεM\_\{a\}^\{\\varepsilon\}that \(*1*\) is nearly indistinguishable from the originalMaM\_\{a\}from observations only, and \(*2*\) has a different optimal arm\. The alternative modelMaεM\_\{a\}^\{\\varepsilon\}is constructed by adding anε\\varepsilon\-transition to an absorbing states∞s\_\{\\infty\}that yields maximal rewardR=1R=1, see[Figure˜2](https://arxiv.org/html/2606.27448#S3.F2)\. We constructMεM^\{\\varepsilon\}fromMMby picking a suboptimal armaaand changingMaM\_\{a\}toMaεM\_\{a\}^\{\\varepsilon\}, hence makingaathe best arm forMεM^\{\\varepsilon\}\.

Due to consistency, the algorithm learns the optimal policy in bothMMand inMεM^\{\\varepsilon\}—yet the two instances do not have the same optimal policy\. Hence, to play optimal inMM, the algorithm must reject the plausibility that the true underlying environment isMεM^\{\\varepsilon\}\. AsMMandMεM^\{\\varepsilon\}only differ at armaa, the algorithm can only rejectMεM^\{\\varepsilon\}by activating armaa\. As a result, this gives a lower bound on the number of pulls ofaa\.

Initial, trueMaM\_\{a\}MaM\_\{a\}Possible alternativeMaεM\_\{a\}^\{\\varepsilon\}MaM\_\{a\}s∞s\_\{\\infty\}ε\\varepsilonR=1R=1Figure 2:An illustration of the transformationMaεM\_\{a\}^\{\\varepsilon\}\.More precisely, we show thatEM,Λ\[Na\(T\)\]≥1εlog⁡\(T\)\\text\{E\}^\{M,\\Lambda\}\[N\_\{a\}\(T\)\]\\geq\\frac\{1\}\{\\varepsilon\}\\log\(T\)\. Sinceε\>0\\varepsilon\>0is arbitrary, we deduce thatEM,Λ\[Na\(T\)\]=ω\(log⁡\(T\)\)\\text\{E\}^\{M,\\Lambda\}\[N\_\{a\}\(T\)\]=\\omega\(\\log\(T\)\), which directly lower bounds the expected pseudo\-regret\.[Equation˜3](https://arxiv.org/html/2606.27448#S3.E3)now follows by linking the pseudo\-regret to the regret using theψ\\psi\-rarely switching property \([Proposition˜1](https://arxiv.org/html/2606.27448#Thmtheorem1)\)\. ∎

### 3\.2An indexed optimistic algorithm:UCB\-NOM

In this[Section˜3\.2](https://arxiv.org/html/2606.27448#S3.SS2), we presentUCB\-NOM\(AlgorithmLABEL:algo:UCB\-POMB\)\.UCB\-NOMfollows the*optimism\-in\-the\-face\-of\-uncertainty*\(OFU\) principle, and begins by constructing an*optimistic index*g~a\(t\)\\tilde\{g\}\_\{a\}\(t\)for every armaa\. To construct that index, the rationale goes as follows\. First, we provide a natural estimatorg^a\(t\)\\hat\{g\}\_\{a\}\(t\)for the valuegag\_\{a\}, then bound by how much that estimator tends to deviate fromgag\_\{a\}, and based on that bound, we finally construct a high\-probability upper\-boundg~a\(t\)\\tilde\{g\}\_\{a\}\(t\)ofgag\_\{a\}\. Following the OFU principle, our algorithm simply activates the arm maximizingg~a\(t\)\\tilde\{g\}\_\{a\}\(t\), and keeps it active until the number of visitsNa\(t\)N\_\{a\}\(t\)doubles\.

We now provide details on how the index ofUCB\-NOMis constructed\.

##### The empirical value of an arm\.

To estimate the value of an arm \(its gaingag\_\{a\}\), we simply aggregate and take the average of all the rewards collected by pulling the said arm\. We call it the*empirical value*of armaa, and denote itg^a\(t\)\\hat\{g\}\_\{a\}\(t\)\. Formally,

g^a\(t\):=1Na\(t\)∑i=1t−1𝟏\(A\(i\)=a\)R\(i\+1\),\\hat\{g\}\_\{a\}\(t\):=\\frac\{1\}\{N\_\{a\}\(t\)\}\\sum\_\{i=1\}^\{t\-1\}\\mathbf\{1\}\\left\(A\(i\)=a\\right\)\\,R\(i\+1\),\(4\)with the conventiong^a\(t\)=1\\hat\{g\}\_\{a\}\(t\)=1whenNa\(t\)=0N\_\{a\}\(t\)=0\. As dictated by the OFU principle, we are interested in constructing a*high probability upper bound*ofgag\_\{a\}fromg^a\(t\)\\hat\{g\}\_\{a\}\(t\)\. We do so by bounding\|g^a\(t\)−ga\|\|\\hat\{g\}\_\{a\}\(t\)\-g\_\{a\}\|via a concentration result \([Proposition˜3](https://arxiv.org/html/2606.27448#Thmtheorem3)\), that bounds that error as a function of the number of pulls to armaa, the number of switches to armaa, and the bias span\.

[Proposition˜3](https://arxiv.org/html/2606.27448#Thmtheorem3)is at the heart of the construction of the indexg~a\(t\)\\tilde\{g\}\_\{a\}\(t\)of our algorithm as such, we provide its proof in the main body\.

###### Proposition 3\(Error of the empirical estimator\)\.

For all arma∈𝒜a\\in\\mathcal\{A\}and confidence thresholdδ\>0\\delta\>0,

P\(∃t≥1,Na\(t\)\|g^a\(t\)−ga\|\>D⋅ξa\(t,δ\)\)≤δ,\\text\{P\}\\big\(\\exists t\\geq 1,N\_\{a\}\(t\)\\left\|\\hat\{g\}\_\{a\}\(t\)\-g\_\{a\}\\right\|\>D\\cdot\\xi\_\{a\}\(t,\\delta\)\\big\)\\leq\\delta,whereD≡1\+maxa∈𝒜⁡sp⁡\(ba\)D\\equiv 1\+\\max\_\{a\\in\\mathcal\{A\}\}\\operatorname\{\\mathrm\{sp\}\}\(b\_\{a\}\)and the error function is given by

ξa\(t,δ\):=Na\(t\)log⁡\(1\+Na\(t\)δ\)\+∑i=1t−1𝟏\(A\(i−1\)≠a,A\(i\)=a\)\.\\xi\_\{a\}\(t,\\delta\):=\\sqrt\{N\_\{a\}\(t\)\\log\\left\(\\frac\{1\+N\_\{a\}\(t\)\}\{\\delta\}\\right\)\}\+\\sum\_\{i=1\}^\{t\-1\}\\mathbf\{1\}\\left\(A\(i\-1\)\\neq a,A\(i\)=a\\right\)\.\(5\)

###### Proof\.

Consider the sequence of stopping times given byτ1a:=inf\{t≥1:A\(t\)=a\}\\tau\_\{1\}^\{a\}:=\\inf\\\{t\\geq 1:A\(t\)=a\\\}andτk\+1a:=inf\{t\>τka:A\(t\)=a\}\\tau\_\{k\+1\}^\{a\}:=\\inf\\\{t\>\\tau\_\{k\}^\{a\}:A\(t\)=a\\\}, that enumerates the time\-instants when the played arm isa∈𝒜a\\in\\mathcal\{A\}\. First, introduce the noise terms

War\(t\)\\displaystyle W\_\{a\}^\{r\}\(t\):=∑k=1Na\(t\)\(R\(τka\+1\)−ra\(S\(τka\)\)\)andWap\(t\):=∑k=1Na\(t\)\(ba\(Sa\(τka\+1\)\)−pa1\(Sa\(τka\)\)ba\),\\displaystyle=\\sum\_\{k=1\}^\{N\_\{a\}\(t\)\}\\Big\(R\(\\tau\_\{k\}^\{a\}\+1\)\-r\_\{a\}\(S\(\\tau\_\{k\}^\{a\}\)\)\\Big\)\\qquad\\text\{and\}\\qquad W\_\{a\}^\{p\}\(t\)=\\sum\_\{k=1\}^\{N\_\{a\}\(t\)\}\\Big\(b\_\{a\}\(S\_\{a\}\(\\tau\_\{k\}^\{a\}\+1\)\)\-p\_\{a\}^\{1\}\(S\_\{a\}\(\\tau\_\{k\}^\{a\}\)\)b\_\{a\}\\Big\),that respectively measure how much the immediate rewards and transitions—both obtained by activating armaa—deviate from their means\. Further, introduce their sumWa\(t\):=War\(t\)\+Wap\(t\)W\_\{a\}\(t\):=W\_\{a\}^\{r\}\(t\)\+W\_\{a\}^\{p\}\(t\)\.

Fort≥1t\\geq 1, we have

Na\(t\)g^a\(t\)\\displaystyle N\_\{a\}\(t\)\\hat\{g\}\_\{a\}\(t\)=∑i=1t−1𝟏\(A\(i\)=a\)R\(i\+1\)\\displaystyle=\\sum\_\{i=1\}^\{t\-1\}\\mathbf\{1\}\\left\(A\(i\)=a\\right\)R\(i\+1\)=∑k=1∞𝟏\(τka<t\)R\(τka\+1\)\\displaystyle=\\sum\_\{k=1\}^\{\\infty\}\\mathbf\{1\}\\left\(\\tau\_\{k\}^\{a\}<t\\right\)R\(\\tau\_\{k\}^\{a\}\+1\)=°1∑k=1Na\(t\)R\(τka\+1\)\\displaystyle\\overset\{\\text\{\\textdegree 1\}\}\{=\}\\sum\_\{k=1\}^\{N\_\{a\}\(t\)\}R\(\\tau\_\{k\}^\{a\}\+1\)=°2∑k=1Na\(t\)\(ga\+\(eSa\(τka\)−pa1\(Sa\(τka\)\)\)ba\)\+War\(t\)\\displaystyle\\overset\{\\text\{\\textdegree 2\}\}\{=\}\\sum\_\{k=1\}^\{N\_\{a\}\(t\)\}\\left\(g\_\{a\}\+\\left\(e\_\{S\_\{a\}\(\\tau\_\{k\}^\{a\}\)\}\-p\_\{a\}^\{1\}\(S\_\{a\}\(\\tau\_\{k\}^\{a\}\)\)\\right\)b\_\{a\}\\right\)\+W\_\{a\}^\{r\}\(t\)=°3Na\(t\)ga\+∑k=1Na\(t\)\(eSa\(τka\)−eSa\(τka\+1\)\)ba\+Wa\(t\)\\displaystyle\\overset\{\\text\{\\textdegree 3\}\}\{=\}N\_\{a\}\(t\)g\_\{a\}\+\\sum\_\{k=1\}^\{N\_\{a\}\(t\)\}\\left\(e\_\{S\_\{a\}\(\\tau\_\{k\}^\{a\}\)\}\-e\_\{S\_\{a\}\(\\tau\_\{k\}^\{a\}\+1\)\}\\right\)b\_\{a\}\+W\_\{a\}\(t\)\(6\)where \(°1\) uses thatNa\(t\):=sup\{k:τka<t\}N\_\{a\}\(t\):=\\sup\\\{k:\\tau\_\{k\}^\{a\}<t\\\}; and \(°2\) introduces the canonical basis\(es\)s∈𝒮a\(e\_\{s\}\)\_\{s\\in\\mathcal\{S\}\_\{a\}\}ofR𝒮a\\text\{R\}^\{\\mathcal\{S\}\_\{a\}\}and invokes the Poisson equation underπa\\pi^\{a\}, given byga\+ba\(sa\)=r\(sa\)\+pa1\(sa\)bag\_\{a\}\+b\_\{a\}\(s\_\{a\}\)=r\(s\_\{a\}\)\+p\_\{a\}^\{1\}\(s\_\{a\}\)b\_\{a\}222Here a compact vector notation is used\.\. In \(°3\), we recognize two error terms\. The first is linked to the number of switches, while the second is a martingale noise\.

First, the martingale noiseWa\(t\)W\_\{a\}\(t\)is bounded using a time\-uniform Azuma\-Hoeffding’s inequality \(Bourelet al\.\([2020](https://arxiv.org/html/2606.27448#bib.bib27)\)\)\. We get that, uniformly fort≥1t\\geq 1and with probability at least1−δ1\-\\delta,

\|War\(t\)\+Wap\(t\)\|≤\(1\+sp⁡\(ba\)\)Na\(t\)log⁡\(1\+Na\(t\)δ\)\.\|W\_\{a\}^\{r\}\(t\)\+W\_\{a\}^\{p\}\(t\)\|\\leq\\left\(1\+\\operatorname\{\\mathrm\{sp\}\}\(b\_\{a\}\)\\right\)\\sqrt\{N\_\{a\}\(t\)\\log\\left\(\\tfrac\{1\+N\_\{a\}\(t\)\}\{\\delta\}\\right\)\}\.For the first error term of[Equation˜6](https://arxiv.org/html/2606.27448#S3.E6), we have:

\|∑k=1Na\(t\)\(eS\(τka\)−eS\(τka\+1\)\)ba\|\\displaystyle\\Bigg\|\\sum\_\{k=1\}^\{N\_\{a\}\(t\)\}\\Big\(e\_\{S\(\\tau\_\{k\}^\{a\}\)\}\-e\_\{S\(\\tau\_\{k\}^\{a\}\+1\)\}\\Big\)b\_\{a\}\\Bigg\|≤sp⁡\(ba\)\(1\+∑k=1Na\(t\)𝟏\(S\(τka\+1\)≠S\(τk\+1a\)\)\)\\displaystyle\\leq\\operatorname\{\\mathrm\{sp\}\}\(b\_\{a\}\)\\left\(1\+\\sum\_\{k=1\}^\{N\_\{a\}\(t\)\}\\mathbf\{1\}\\left\(S\(\\tau\_\{k\}^\{a\}\+1\)\\neq S\(\\tau\_\{k\+1\}^\{a\}\)\\right\)\\right\)≤sp⁡\(ba\)\(1\+∑k=1Na\(t\)𝟏\(τka\+1≠τk\+1a\)\)\\displaystyle\\leq\\operatorname\{\\mathrm\{sp\}\}\(b\_\{a\}\)\\left\(1\+\\sum\_\{k=1\}^\{N\_\{a\}\(t\)\}\\mathbf\{1\}\\left\(\\tau\_\{k\}^\{a\}\+1\\neq\\tau\_\{k\+1\}^\{a\}\\right\)\\right\)≤sp⁡\(ba\)∑i=1t−1𝟏\(A\(i−1\)≠a,A\(i\)=a\)\.\\displaystyle\\leq\\operatorname\{\\mathrm\{sp\}\}\(b\_\{a\}\)\\sum\_\{i=1\}^\{t\-1\}\\mathbf\{1\}\\left\(A\(i\-1\)\\neq a,A\(i\)=a\\right\)\.We conclude by plugging everything back in[Equation˜6](https://arxiv.org/html/2606.27448#S3.E6)\. ∎

##### Building an optimistic estimator\.

The bound on the estimation error established in[Proposition˜3](https://arxiv.org/html/2606.27448#Thmtheorem3)allows for the construction of an*optimistic estimator*g~a\(t\)\\tilde\{g\}\_\{a\}\(t\)ofgag\_\{a\}\. By optimistic, we mean thatg~a\(t\)\\tilde\{g\}\_\{a\}\(t\)serves as a high probability upper bound on the arm’s valuegag\_\{a\}, that is,P\(∃t≥1,g~a\(t\)<ga\)≤δ\.\\text\{P\}\\left\(\\exists t\\geq 1,\\tilde\{g\}\_\{a\}\(t\)<g\_\{a\}\\right\)\\leq\\delta\.A direct consequence of[Proposition˜3](https://arxiv.org/html/2606.27448#Thmtheorem3)is that

P\(∃t≥1:g^a\(t\)\+D⋅ξa\(t,δ\)Na\(t\)<ga\)≤δ\.\\text\{P\}\\left\(\\exists t\\geq 1:\\hat\{g\}\_\{a\}\(t\)\+\\frac\{D\\cdot\\xi\_\{a\}\(t,\\delta\)\}\{N\_\{a\}\(t\)\}<g\_\{a\}\\right\)\\leq\\delta\.\(7\)[Equation˜7](https://arxiv.org/html/2606.27448#S3.E7)suggests thatg^a\(t\)\+D⋅ξa\(t,δ\)Na\(t\)\\hat\{g\}\_\{a\}\(t\)\+\\frac\{D\\cdot\\xi\_\{a\}\(t,\\delta\)\}\{N\_\{a\}\(t\)\}is a natural choice forg~a\(t\)\\tilde\{g\}\_\{a\}\(t\)\. Unfortunately, the quantityDDis unknown and varies depending on the underlying instance of Markovian bandit\. To circumvent this difficulty, we consider a*span proxy function*f:N→R\+f:\\text\{N\}\\to\\text\{R\}\_\{\+\}diverging to\+∞\+\\infty\. Eventually,f\(t\)f\(t\)exceedsDD, so that for every instanceM∈ℳM\\in\\mathcal\{M\}, there is a time thresholdβf∈R\+\\beta\_\{f\}\\in\\text\{R\}\_\{\+\}, s\.t\.,

P\(∃t≥βf:g^a\(t\)\+f\(t\)⋅ξa\(t,δ\)Na\(t\)<ga\)≤δ\.\\text\{P\}\\left\(\\exists t\\geq\\beta\_\{f\}:\\hat\{g\}\_\{a\}\(t\)\+\\frac\{f\(t\)\\cdot\\xi\_\{a\}\(t,\\delta\)\}\{N\_\{a\}\(t\)\}<g\_\{a\}\\right\)\\leq\\delta\.\(8\)That time threshold isβf:=1\+sup\{t≥1:f\(t\)≤D\}\\beta\_\{f\}:=1\+\\sup\\\{t\\geq 1:f\(t\)\\leq D\\\}\. The above justifies our choice of the indexg~a\(t\)\\tilde\{g\}\_\{a\}\(t\)inLABEL:equation\_optimistic\_estimateofLABEL:algo:UCB\-POMB\.

##### Taming the number of switches\.

It is preferable for the optimistic estimator to converge to the empirical estimator whenNa\(t\)N\_\{a\}\(t\)clearly overshoots, or said differently, that the confidence region forgag\_\{a\}shrinks to a singleton whenaais the most visited arm for instance\. To make sure of it, we need the associated errorξa\(t,δ\)/Na\(t\)\\xi\_\{a\}\(t,\\delta\)/N\_\{a\}\(t\)to vanish asymptotically\. In[Equation˜5](https://arxiv.org/html/2606.27448#S3.E5), we see that the error functionξa\(t,δ\)\\xi\_\{a\}\(t,\\delta\)increases with the number of switches∑i=1t−1𝟏\(A\(i−1\)≠a,A\(i\)=a\)\\sum\_\{i=1\}^\{t\-1\}\\mathbf\{1\}\\left\(A\(i\-1\)\\neq a,A\(i\)=a\\right\)\. So,ξa\(t,δ\)/Na\(t\)\\xi\_\{a\}\(t,\\delta\)/N\_\{a\}\(t\)can only be vanishing if the number of switches grows sublinearly inNa\(t\)N\_\{a\}\(t\); In other words, when the algorithm satisfies a rarely switching property \([Definition˜1](https://arxiv.org/html/2606.27448#Thmdefinition1)\)\.

A simple method to obtain a rarely switching algorithm is to rely on the famous*doubling trick*\(seeAueret al\.\([1995](https://arxiv.org/html/2606.27448#bib.bib23)\)and\(Auer,[2002](https://arxiv.org/html/2606.27448#bib.bib33),UCB2\)\): Upon activating armaa, activate it until the visit counts ofNa\(t\)N\_\{a\}\(t\)doubles\. Doing so, the number of switches to armaabecomes of orderlog⁡\(1\+Na\(t\)\)\\log\(1\+N\_\{a\}\(t\)\), which makes the algorithmO\(log\)\\operatorname\*\{\\mathrm\{O\}\}\(\\log\)\-rarely switching\.

Combining the above, we recoverUCB\-NOM\(LABEL:algo:UCB\-POMB\)\.

##### Comparison withRCA\.

Our algorithmUCB\-NOMis closely related toRCAofTekin and Liu \([2012](https://arxiv.org/html/2606.27448#bib.bib40)\), as both are generalizations ofUCB\(Auer,[2002](https://arxiv.org/html/2606.27448#bib.bib33)\)to Markovian bandits\. Two major differences withRCAare worth mentioning\.

- •*Observability of states and regeneration\.*RCAstands for*Regenerative Cycle Algorithm*, and the algorithm ofTekin and Liu \([2012](https://arxiv.org/html/2606.27448#bib.bib40)\)observes the stream of states to calibrate its arm estimates better\. In particular,RCAalways make sure to switch off from armaawhile armaais in a reference statesa∈𝒮as\_\{a\}\\in\\mathcal\{S\}\_\{a\}\. Moreover, when it switches onto armaa, it ignores the first pool of rewards untilsas\_\{a\}is reached\. This makes the empirical estimator ofgag\_\{a\}— theg^a\(t\)\\hat\{g\}\_\{a\}\(t\)— ofTekin and Liu \([2012](https://arxiv.org/html/2606.27448#bib.bib40)\)different from our own\. Their estimator is unbiased—hence better—because it does not require the correction term∑i=1t−1𝟏\(A\(i−1\)≠a,A\(i\)=a\)\\sum\_\{i=1\}^\{t\-1\}\\mathbf\{1\}\\left\(A\(i\-1\)\\neq a,A\(i\)=a\\right\)due to switching arms in the optimistic bonus\. However, because states are non\-observable in our setting, the regeneration technique is unachievable\.
- •*Optimistic bonus\.*In the end, the optimistic bonus is a correction term\. It is about how much the aggregate rewards of a Markov reward process tends to deviate from its mean, i\.e\., is about isolating the trajectory of rewards\(Ria\)i≥1\(R\_\{i\}^\{a\}\)\_\{i\\geq 1\}generated by arma∈𝒜a\\in\\mathcal\{A\}and comparing∑i=1tRia\\sum\_\{i=1\}^\{t\}R\_\{i\}^\{a\}tot⋅gat\\cdot g\_\{a\}\. For both, the optimistic bonus is of the formL/Na\(t\)\\sqrt\{L/N\_\{a\}\(t\)\}, where L≍\{\|𝒮a\|2π^max2εminforTekin and Liu \([2012](https://arxiv.org/html/2606.27448#bib.bib40)\);1\+sp\(ba\)2for us,L\\asymp\\begin\{cases\}\\frac\{\|\\mathcal\{S\}\_\{a\}\|^\{2\}\\hat\{\\pi\}\_\{\\mathrm\{max\}\}^\{2\}\}\{\\varepsilon\_\{\\mathrm\{min\}\}\}&\\text\{for \\cite\[cite\]\{\\@@bibref\{Authors Phrase1YearPhrase2\}\{tekin2012online\}\{\\@@citephrase\{\(\}\}\{\\@@citephrase\{\)\}\}\};\}\\\\ 1\+\\operatorname\{\\mathrm\{sp\}\}\(b\_\{a\}\)^\{2\}&\\text\{for us,\}\\end\{cases\}whereπ^max≤1\\hat\{\\pi\}\_\{\\mathrm\{max\}\}\\leq 1is related to the stationary distribution of the chainMaM\_\{a\}andεmin\>0\\varepsilon\_\{\\mathrm\{\\min\}\}\>0is the spectral gap, see\(Tekin and Liu,[2012](https://arxiv.org/html/2606.27448#bib.bib40)\)\. The presence of the spectral gap in the bound ofTekin and Liu \([2012](https://arxiv.org/html/2606.27448#bib.bib40)\)naturally emerges from their analysis: to compare∑i=1tRia\\sum\_\{i=1\}^\{t\}R\_\{i\}^\{a\}andt⋅gat\\cdot g\_\{a\}, the authors focus on the speed of convergence of the empirical measure of presence of\(Sia\)i≥1\(S^\{a\}\_\{i\}\)\_\{i\\geq 1\}, relying on standard Markov chain techniques\. In general, the spectral gap is related to the mixing time astmix=O~\(\|𝒮\|/εmin\)t\_\{\\mathrm\{mix\}\}=\\smash\{\\widetilde\{O\}\(\|\\mathcal\{S\}\|/\\varepsilon\_\{\\mathrm\{min\}\}\)\},333O~\(−\)\\widetilde\{O\}\(\-\)hides logarithmic factors\.see\(Jerison,[2013](https://arxiv.org/html/2606.27448#bib.bib4)\), while the bias span is related to the mixing time withsp⁡\(b\)=O\(tmix\)\\operatorname\{\\mathrm\{sp\}\}\(b\)=\\operatorname\*\{\\mathrm\{O\}\}\(t\_\{\\mathrm\{mix\}\}\), see\(Wanget al\.,[2022](https://arxiv.org/html/2606.27448#bib.bib3)\)\. A dependency insp⁡\(b\)\\operatorname\{\\mathrm\{sp\}\}\(b\)is better than one in\(\|𝒮\|,εmin\)\(\|\\mathcal\{S\}\|,\\varepsilon\_\{\\mathrm\{min\}\}\): we have no dependency in the number of states andsp⁡\(b\)\\operatorname\{\\mathrm\{sp\}\}\(b\)ties the transition structure together with the reward mechanism\. However, oursp\(b\)2\\operatorname\{\\mathrm\{sp\}\}\(b\)^\{2\}is a bit loose — it could be improved using variance reduction techniques\(Boone,[2024](https://arxiv.org/html/2606.27448#bib.bib2), Chapter 5\), but proof details would then be more delicate\.

### 3\.3An asymptotic regret bound forUCB\-NOMmatching the general regret lower bound

We have shown that the regret of every algorithm must beω\(log⁡\(T\)\)\\omega\(\\log\(T\)\)and we have described an algorithm \(LABEL:algo:UCB\-POMB\), parameterized by a bias proxyff\. The next theorem demonstrates that the idealisticlog⁡\(T\)\\log\(T\)regime is approached arbitrarily closely byUCB\-NOM\.

###### Theorem 4\(Upper bound\)\.

Letf:R\+→R\+f:\\text\{R\}\_\{\+\}\\to\\text\{R\}\_\{\+\}an increasing function diverging to infinity and a confidence functionδ\(t\):N→R\+∗\\delta\(t\):\\text\{N\}\\to\\text\{R\}\_\{\+\}^\{\*\}such that∑t≥1δ\(t\)<∞\\sum\_\{t\\geq 1\}\\delta\(t\)<\\inftyandlog⁡\(1δ\(t\)\)=Θ\(log⁡\(t\)\)\\log\(\\frac\{1\}\{\\delta\(t\)\}\)=\\Theta\(\\log\(t\)\)\.444For instance, anyδ\(t\)\\delta\(t\)of the formδ\(t\)=\(1t\)1\+ε\\delta\(t\)=\(\\frac\{1\}\{t\}\)^\{1\+\\varepsilon\}withε\>0\\varepsilon\>0works\.For all instanceM∈ℳM\\in\\mathcal\{M\},

Reg\(T;M,UCB\-NOM\(f\)\)=O\(∑a∉𝒜∗f\(T\)2log⁡\(T\)g∗−ga\+∑a∉𝒜∗\(g∗−ga\)\(βf\)2\),\\mathrm\{Reg\}\(T;M,\\texttt\{UCB\-NOM\}\(f\)\)=\\operatorname\*\{\\mathrm\{O\}\}\\left\(\\sum\_\{a\\notin\\mathcal\{A\}^\{\*\}\}\\frac\{f\(T\)^\{2\}\\log\(T\)\}\{g\_\{\*\}\-g\_\{a\}\}\+\\sum\_\{a\\notin\\mathcal\{A\}^\{\*\}\}\\left\(g\_\{\*\}\-g\_\{a\}\\right\)\(\\beta\_\{f\}\)^\{2\}\\right\),\(9\)where𝒜∗:=arg⁡maxa∈𝒜⁡ga\\mathcal\{A\}^\{\*\}:=\\arg\\max\_\{a\\in\\mathcal\{A\}\}g\_\{a\}is the set of optimal arms andβf:=1\+sup\{t≥1:f\(t\)≤D\}\\beta\_\{f\}:=1\+\\sup\\left\\\{t\\geq 1:f\(t\)\\leq D\\right\\\}is a burn\-in cost \(see[Equation˜8](https://arxiv.org/html/2606.27448#S3.E8)\), withD:=max⁡\{maxa⁡sp⁡\(ba\)\+1,Cτ\}D:=\\max\\\{\\max\_\{a\}\\operatorname\{\\mathrm\{sp\}\}\(b\_\{a\}\)\+1,C\_\{\\tau\}\\\}a threshold that depends onMM\.

The proof, together with error terms of higher orders, can be found inLABEL:appendix:upper\_bound\_instance\_dependent\. Note that the dominant term is of orderf\(T\)2log⁡\(T\)f\(T\)^\{2\}\\log\(T\)and that notably, the delay constantCτC\_\{\\tau\}is absent from it\.

In particular[Theorem˜4](https://arxiv.org/html/2606.27448#Thmtheorem4)shows that the lower bound of[Theorem˜2](https://arxiv.org/html/2606.27448#Thmtheorem2)is*tight*, and thatUCB\-NOMachieves it\. Indeed, the definition ofω\(log⁡\(T\)\)\\omega\(\\log\(T\)\)provided by the lower bound reads as follows: for a given instanceMMand an algorithmΛ\\Lambda, there exists a functionf0→∞f\_\{0\}\\to\\inftysuch thatlim infT→∞Reg\(T;M,Λ\)/\(f0\(T\)log⁡\(T\)\)\>0\\liminf\_\{T\\to\\infty\}\\mathrm\{Reg\}\(T;M,\\Lambda\)/\(f\_\{0\}\(T\)\\log\(T\)\)\>0\. ForUCB\-NOMwith span proxyf1≡\(f0\)13f\_\{1\}\\equiv\(f\_\{0\}\)^\{\\frac\{1\}\{3\}\}, it holds that,

lim supT→∞Reg\(T;M,UCB\-NOM\(f1\)\)Reg\(T;M,Λ\)=0\.\\limsup\_\{T\\to\\infty\}\\frac\{\\mathrm\{Reg\}\(T;M,\\texttt\{UCB\-NOM\}\(f\_\{1\}\)\)\}\{\\mathrm\{Reg\}\(T;M,\\Lambda\)\}=0\.Hence, ourUCB\-NOM\(f1\)\\texttt\{UCB\-NOM\}\(f\_\{1\}\)has better regret thanΛ\\LambdaonMM\. It can even be further improved by running it with span proxyf2≡\(f1\)13f\_\{2\}\\equiv\(f\_\{1\}\)^\{\\frac\{1\}\{3\}\}, orf3≡\(f2\)13f\_\{3\}\\equiv\(f\_\{2\}\)^\{\\frac\{1\}\{3\}\}, and so and so forth\. This means that whatever the algorithm is, it can be forever improved, getting closer and closer to the idealistic \(but non\-reachable\)log⁡\(T\)\\log\(T\)regret regime\.

### 3\.4Worst case regret: what is the best way to choose the span proxyff?

From an asymptotic viewpoint, the regret is simplyO\(f\(T\)2log⁡\(T\)\)\\operatorname\*\{\\mathrm\{O\}\}\(f\(T\)^\{2\}\\log\(T\)\)\. The best choice forffis thus a function diverging\+∞\+\\inftyas slowly as possible\. Unfortunately, there is no such function, becausef\\sqrt\{f\}always diverges slower thanff\. However, this observation only tells half of the story\.

As it turns out, due to the reliance ofUCB\-NOMon an optimistic estimatorg~a\(t\)\\tilde\{g\}\_\{a\}\(t\)to estimate the valuegag\_\{a\}of armaa, the asymptotic regret guarantee of[Theorem˜4](https://arxiv.org/html/2606.27448#Thmtheorem4)hides a burn\-in costβf:=1\+sup\{t≥1:f\(t\)≤D\}\\beta\_\{f\}:=1\+\\sup\\\{t\\geq 1:f\(t\)\\leq D\\\}\(see[Equation˜8](https://arxiv.org/html/2606.27448#S3.E8)\) that is eventually negligible, but dominant nonetheless in the early stages of the learning process\. The regret upper\-bound of[Equation˜9](https://arxiv.org/html/2606.27448#S3.E9)should therefore be read as such: The first term \(inf\(T\)2log⁡\(T\)/\(g∗−ga\)f\(T\)^\{2\}\\log\(T\)/\(g\_\{\*\}\-g\_\{a\}\)\) holds the asymptotic behavior regret, while the second term \(inβf\\beta\_\{f\}\) represents the regret that may be accumulated before the optimistic indices become valid—a*burn\-in cost*\.

From a non\-asymptotic viewpoint, no proper tradeoff between the asymptoticf\(T\)2log⁡\(T\)/\(g∗−ga\)f\(T\)^\{2\}\\log\(T\)/\(g\_\{\*\}\-g\_\{a\}\)and the burn\-in costβf\\beta\_\{f\}can be found\.[Theorem˜5](https://arxiv.org/html/2606.27448#Thmtheorem5)states that independently of the choice of span proxyff, for every horizonT≥1T\\geq 1, there will be an instance where the regret scales linearly\.

###### Theorem 5\(Worst case lower bound\)\.

LetΛ\\Lambdabe a learning algorithm\. Let𝒰τ⊆\[0,1\]\\mathcal\{U\}\_\{\\tau\}\\subseteq\[0,1\]be arbitrary\. ForT≥1T\\geq 1, we have

supM∈ℳReg\(T;M,Λ\)≥114800⋅T\.\\sup\_\{M\\in\\mathcal\{M\}\}\\mathrm\{Reg\}\(T;M,\\Lambda\)\\geq\\frac\{11\}\{4800\}\\cdot T\.\(10\)

The proof can be found inLABEL:appendix\_lowerbound\_worst\_case\.

[Theorem˜5](https://arxiv.org/html/2606.27448#Thmtheorem5)shows that our class of models is so broad that*every*algorithm has linear regret in the worst case \(in particularUCB\-NOM, whatever the choice offf\)\. Hence, no single span proxyffcould be uniformly optimal\. Different choices offfwill perform better on some instances and worse on others\.

## 4Regret bounds with uniformly bounded bias span

In[Section˜3](https://arxiv.org/html/2606.27448#S3), we have shown that for the space of all Markovian bandit problems, the regret of consistent algorithms must scale asω\(log⁡\(T\)\)\\omega\(\\log\(T\)\)and that this bound is achieved byUCB\-NOM\(LABEL:algo:UCB\-POMB\)\. To achieve this result, we have argued how important it is to make the span proxyffdiverge to\+∞\+\\infty, so thatffeventually upper\-bounds the unknown worst bias spanmaxa∈𝒜⁡sp⁡\(ba\)\\max\_\{a\\in\\mathcal\{A\}\}\\operatorname\{\\mathrm\{sp\}\}\(b\_\{a\}\)\.

What if a bound on the worst bias span is known?

In this section, we show that given prior knowledge onmaxa∈𝒜⁡sp⁡\(ba\)\\max\_\{a\\in\\mathcal\{A\}\}\\operatorname\{\\mathrm\{sp\}\}\(b\_\{a\}\), the span proxy function does not need to diverge to\+∞\+\\inftyanymore and the logarithmic regret regime can even be achieved\. In particular, by runningUCB\-NOMwith span proxyf≡maxa∈𝒜⁡sp⁡\(ba\)f\\equiv\\max\_\{a\\in\\mathcal\{A\}\}\\operatorname\{\\mathrm\{sp\}\}\(b\_\{a\}\), we obtain instance dependent regretO\(log⁡\(T\)\)\\operatorname\*\{\\mathrm\{O\}\}\(\\log\(T\)\)\([Section˜4\.1](https://arxiv.org/html/2606.27448#S4.SS1)\) and worst\-case regretO\(Tlog⁡\(T\)\)\\operatorname\*\{\\mathrm\{O\}\}\(\\sqrt\{T\\log\(T\)\}\)\([Section˜4\.2](https://arxiv.org/html/2606.27448#S4.SS2)\)\.

ForD∈R\+D\\in\\text\{R\}\_\{\+\}, we introduce

ℳD:=\{M∈ℳ:max⁡\{maxa∈𝒜⁡sp⁡\(ba\)\+1,Cτ\}≤D\},\\mathcal\{M\}\_\{D\}:=\\left\\\{M\\in\\mathcal\{M\}:\\max\\left\\\{\\max\_\{a\\in\\mathcal\{A\}\}\\operatorname\{\\mathrm\{sp\}\}\(b\_\{a\}\)\+1,C\_\{\\tau\}\\right\\\}\\leq D\\right\\\},the collection of instances with bias span and decision epoch delaysCτC\_\{\\tau\}\([Assumption˜3](https://arxiv.org/html/2606.27448#Thmassumption3)\) both bounded byDD\. The second assumption “Cτ≤DC\_\{\\tau\}\\leq D” is not strictly necessary, but will significantly simplify the theory\.

### 4\.1Learning with a bound on the span proxy

Under the assumption thatM∈ℳDM\\in\\mathcal\{M\}\_\{D\}, the idealistic logarithmic regret regime is achieved byUCB\-NOM, when equipped with the span proxy functionf≡Df\\equiv D, see[Theorem˜6](https://arxiv.org/html/2606.27448#Thmtheorem6)below\.

###### Theorem 6\(Instance\-dependent regret\)\.

LetD∈R\+D\\in\\text\{R\}\_\{\+\}andM∈ℳDM\\in\\mathcal\{M\}\_\{D\}with a*unique*optimal arma∗a^\{\*\}\. Letδ\(t\):N→R\+∗\\delta\(t\):\\text\{N\}\\to\\text\{R\}\_\{\+\}^\{\*\}be a confidence function such that∑t≥1δ\(t\)<∞\\sum\_\{t\\geq 1\}\\delta\(t\)<\\inftyandlog⁡\(1δ\(t\)\)=Θ\(log⁡\(t\)\)\\log\(\\frac\{1\}\{\\delta\(t\)\}\)=\\Theta\(\\log\(t\)\)\. When run with span proxyf≡D=2maxa∈𝒜⁡max⁡\{1\+sp⁡\(ba\),3sp⁡\(ba\)\}f\\equiv D=2\\max\_\{a\\in\\mathcal\{A\}\}\\max\\left\\\{1\+\\operatorname\{\\mathrm\{sp\}\}\(b\_\{a\}\),3\\operatorname\{\\mathrm\{sp\}\}\(b\_\{a\}\)\\right\\\}, the regret ofUCB\-NOMsatisfies

Reg\(T;M,UCB\-NOM\(D\)\)=O\(∑a≠a∗D2log⁡\(T\)g∗−ga\)\.\\mathrm\{Reg\}\(T;M,\\texttt\{UCB\-NOM\}\(D\)\)=\\operatorname\*\{\\mathrm\{O\}\}\\left\(\\sum\_\{a\\neq a^\{\*\}\}\\frac\{D^\{2\}\\log\(T\)\}\{g\_\{\*\}\-g\_\{a\}\}\\right\)\.

###### Proof\.

The proof essentially reduces to the analysis performed in the proof of[Theorem˜4](https://arxiv.org/html/2606.27448#Thmtheorem4)\. Details can be found in theLABEL:appendix\_instance\_dependent\_bounded\. ∎

The regret remains logarithmic even if the optimal arm is not unique, but the regret upper bound is slightly degraded\. SeeLABEL:appendix\_instance\_dependent\_boundedfor more details\. A discussion on the relation between our results and similar ones from the stochastic bandits literature can be found in[Section˜5\.3](https://arxiv.org/html/2606.27448#S5.SS3)\.

### 4\.2Worst\-case regret guarantees

Furthermore, with the assumptionM∈ℳDM\\in\\mathcal\{M\}\_\{D\}, we can move beyond the instance\-dependent logarithmic regret previously presented in[Theorem˜6](https://arxiv.org/html/2606.27448#Thmtheorem6)and establish robust guarantees that hold for every instance uniformly\.

###### Theorem 7\(Worst\-case regret\)\.

LetD∈R\+D\\in\\text\{R\}\_\{\+\}andδ\(t\):N→R\+∗\\delta\(t\):\\text\{N\}\\to\\text\{R\}\_\{\+\}^\{\*\}be a confidence function such that∑t≥1t⋅δ\(t\)<∞\\sum\_\{t\\geq 1\}t\\cdot\\delta\(t\)<\\inftyandlog⁡\(1δ\(t\)\)=Θ\(log⁡\(t\)\)\\log\(\\frac\{1\}\{\\delta\(t\)\}\)=\\Theta\(\\log\(t\)\)\.555For instance, anyδ\(t\)=\(1t\)2\+ε\\delta\(t\)=\(\\frac\{1\}\{t\}\)^\{2\+\\varepsilon\}withε\>0\\varepsilon\>0is fine\.When running with span proxyf≡Df\\equiv D,UCB\-NOM\(D\)\\texttt\{UCB\-NOM\}\(D\)has the following worst\-case regret:

supM∈ℳDReg\(T;M,UCB\-NOM\(D\)\)=O\(D\|𝒜\|Tlog⁡\(T\)\)\.\\begin\{gathered\}\\sup\_\{M\\in\\mathcal\{M\}\_\{D\}\}\\mathrm\{Reg\}\(T;M,\\texttt\{UCB\-NOM\}\(D\)\)=\\operatorname\*\{\\mathrm\{O\}\}\\left\(D\\sqrt\{\|\\mathcal\{A\}\|T\\log\(T\)\}\\right\)\.\\end\{gathered\}

###### Proof\.

The proof is deferred toLABEL:appendix\_worst\_case\_upper\. ∎

The tightness of this result is discussed in more detail inLABEL:appendix\_worst\_case\_upper\.

## 5Tuningδ\(t\)\\delta\(t\): Being under\-optimistic is preferable in Markovian bandits

[Section˜4](https://arxiv.org/html/2606.27448#S4)provides the natural choicef≡Df\\equiv Dfor the span proxy—when bounds on maximal span of the bias function and the delays of the decision epochs are available—and removes the need for a diverging span proxy\. In this last section, we are interested in what the best choice for the confidence thresholdδ\(t\)\\delta\(t\)is, given that the span\-proxy is fixed to the value recommended by[Section˜4](https://arxiv.org/html/2606.27448#S4)\. Indeed, while from an asymptotic viewpoint the regret is simplyO\(\(DΔ\)2log⁡\(T\)\)\\smash\{\\operatorname\*\{\\mathrm\{O\}\}\(\(\\frac\{D\}\{\\Delta\}\)^\{2\}\\log\(T\)\)\}for all summable confidence function, in finite horizon, a trade\-off appears\.

- •Beδ\(t\)\\delta\(t\)too large, and the confidence intervals are too narrow to contain the true gains, leading to a failure of the optimistic property\. As the optimistic bonuslog⁡\(1δ\(t\)\)\\smash\{\\log\(\\frac\{1\}\{\\delta\(t\)\}\)\}vanishes, the algorithm’s behavior converges toward a greedy algorithm, that pulls the arm with the highest empirical estimator\. Then,UCB\-NOMis very likely to pull bad up to≈T\\approx Ttimes, resulting in disastrous performance\.
- •Beδ\(t\)\\delta\(t\)too small, and the algorithm is too conservative: The optimistic bonuses dictate everything that the algorithm does\. As the optimistic bonuslog⁡\(1δ\(t\)\)\\smash\{\\log\(\\frac\{1\}\{\\delta\(t\)\}\)\}explodes to\+∞\+\\infty, the algorithm’s behavior converges towards a round\-robin process\.

InLABEL:fig:confidence, we display the regret ofUCB\-NOM\(D\)\(D\)for a reasonably short horizon, as a function of the confidence parameter and for different parameterizations ofδ\(t\)\\delta\(t\)\. To avoid running into deceptive conclusions, the environment is specifically designed so that the greedy algorithm \(UCB\-NOMwithδ=1\\delta=1\) has no chance to work\. But then, even though the environment is unfavorable to greedy algorithms, these experiments suggest that very high confidence parameters are preferred, such asδ\(t\)=α\(t\)t\\smash\{\\delta\(t\)=\\frac\{\\alpha\(t\)\}\{t\}\}orβ\(t\)tlog⁡\(t\)\\smash\{\\frac\{\\beta\(t\)\}\{t\\log\(t\)\}\}with very highα\(t\)\\alpha\(t\)orβ\(t\)\\beta\(t\)\. In the remaining of this section, we explain why\.

![Refer to caption](https://arxiv.org/html/2606.27448v1/x1.png)α\\alphaRegret\(a\)Regret ofUCB\-NOM\(D\)\\texttt\{UCB\-NOM\}\(D\)with horizonT=2000T=2000with confidence parameterδ\(t\):=αt\\delta\(t\):=\\frac\{\\alpha\}\{t\}\. Each point is obtained by averaging over 1000 trajectories\.
![Refer to caption](https://arxiv.org/html/2606.27448v1/x2.png)β\\betaRegret\(b\)Regret ofUCB\-NOM\(D\)\\texttt\{UCB\-NOM\}\(D\)with horizonT=2000T=2000with confidence parameterδ\(t\):=βtlog⁡t\\delta\(t\):=\\frac\{\\beta\}\{t\\log t\}\. Each point is obtained by averaging over 1000 trajectories\.

Figure 3:figure\]fig:confidence Regret ofUCB\-NOM\(D\)\\texttt\{UCB\-NOM\}\(D\)on a Markovian bandits with two arms\. The first arma=1a=1is a stochastic bandit with rewardν1=Bernoulli\(0\.3\)\\nu\_\{1\}=\\mathrm\{Bernoulli\}\(0\.3\)\. The second arma=2a=2is a two states Markovian arm whose transition are, when active, uniform over\{0,1\}\\left\\\{0,1\\right\\\}, and when passive, deterministically directed to 0\. Associated rewards areν2\(0\)=0\\nu\_\{2\}\(0\)=0andν2\(1\)=Bernoulli\(0\.8\)\\nu\_\{2\}\(1\)=\\mathrm\{Bernoulli\}\(0\.8\)\.In the bonus, the term due to switching cost cannot be improved in general\. The environment inLABEL:fig:confidenceis chosen so that this term is chosen in a tight fashion: every time the arm is dropped, the arm goes back to the worst state instantly\. For this environment, the switching bonus perfectly corrects the empirical estimator and we can fully focus on the study of the optimistic bonus\. So, switching terms are casted aside by looking at bounds on the pseudo\-regret directly\.

From the proof of[Theorem˜6](https://arxiv.org/html/2606.27448#Thmtheorem6), keeping dominant terms, we get the upper\-bound of the pseudo\-regret is of the form:

E\[PR\(T\)\]≲C1‖Δ‖1∫1∞δ\(t\)dt\(∑a∉𝒜∗1Δa\)log⁡\(1δ\(T\)\)\.\\text\{E\}\\left\[\\mathrm\{PR\}\(T\)\\right\]\\lesssim C\_\{1\}\\\|\\Delta\\\|\_\{1\}\\int\_\{1\}^\{\\infty\}\\delta\(t\)\\penalty 10000\\ \\mathrm\{d\}t\\left\(\\sum\_\{a\\notin\\mathcal\{A\}^\{\*\}\}\\frac\{1\}\{\\Delta\_\{a\}\}\\right\)\\log\\left\(\\frac\{1\}\{\\delta\(T\)\}\\right\)\\;\.\(11\)whereC1,C2\>0C\_\{1\},C\_\{2\}\>0are universal constants\.666It is not easy to find a real meaning to the precise values of these constantsC1,C2C\_\{1\},C\_\{2\}, because they are the results of computations that are pretty conservative\. Therefore, the discussion revolves around the dependency inΔ\\DeltaandDDexclusively\.

### 5\.1Confidence functions of the formδ\(t\)=αt\\delta\(t\)=\\frac\{\\alpha\}\{t\}

The bound \([11](https://arxiv.org/html/2606.27448#S5.E11)\) encourages to chooseδ\(t\)≍1t\\delta\(t\)\\asymp\\frac\{1\}\{t\}, so that the two terms are of the same order—we recover confidence parameters that are polynomial in1t\\frac\{1\}\{t\}, as suggested by standard stochastic bandits theory\(Lattimore and Szepesvári,[2020](https://arxiv.org/html/2606.27448#bib.bib34)\)\. Therefore, as a first choice for the confidence function, we takeδ\(t\):=αt\\delta\(t\):=\\frac\{\\alpha\}\{t\}for someα∈R\\alpha\\in\\text\{R\}\.

Rewriting the previous upper\-bound and trashing a few negligible terms, we obtain

E\[PR\(T\)\]≲φ\(α\):=C1‖Δ‖1αlog⁡\(T\)\+C2D2\(∑a∉𝒜∗1Δa\)log⁡\(Tα\)\.\\text\{E\}\[\\mathrm\{PR\}\(T\)\]\\lesssim\\varphi\(\\alpha\):=C\_\{1\}\\\|\\Delta\\\|\_\{1\}\\alpha\\log\(T\)\+C\_\{2\}D^\{2\}\\left\(\\sum\_\{a\\notin\\mathcal\{A\}^\{\*\}\}\\frac\{1\}\{\\Delta\_\{a\}\}\\right\)\\log\\left\(\\frac\{T\}\{\\alpha\}\\right\)\\;\.Now, computing derivative ofφ\\varphiwith respect toα\\alphagives the optimal value:

α∗=C2D2∑a∉𝒜∗1ΔaC1‖Δ‖1log⁡\(T\)\.\\alpha^\{\*\}=\\frac\{C\_\{2\}D^\{2\}\\sum\_\{a\\notin\\mathcal\{A\}^\{\*\}\}\\frac\{1\}\{\\Delta\_\{a\}\}\}\{C\_\{1\}\\\|\\Delta\\\|\_\{1\}\\log\(T\)\}\\;\.\(12\)We should highlight the presence of thelog⁡\(T\)\\log\(T\)term which emphasis that our class of function forδ\\deltais not fully satisfactory\. However, this remains a suitable choice as long as the horizon is not too large, i\.e\., for horizon that are sub\-exponential, typicallyT≪exp⁡\{D2∑a1Δa/‖Δ‖1\}\\smash\{T\\ll\\exp\\\{D^\{2\}\\sum\_\{a\}\\frac\{1\}\{\\Delta\_\{a\}\}/\\\|\\Delta\\\|\_\{1\}\\\}\}\. For such horizons and in the*two\-armed case*, we get an optimal optimism of order\(D/Δ\)2\(D/\\Delta\)^\{2\}up to numerical values: this is a quartic in the problem parameters, and therefore is large very easily\. For the environments on which the experiments ofLABEL:fig:confidence, we have\(D/Δ\)2=400\(D/\\Delta\)^\{2\}=400already\.

### 5\.2Confidence functions of the formδ\(t\)=βtlog⁡\(t\)\\delta\(t\)=\\frac\{\\beta\}\{t\\log\(t\)\}

The logarithmic term in the denominator ofα∗\\alpha^\{\*\}in[Equation˜12](https://arxiv.org/html/2606.27448#S5.E12)is a bit unsatisfactory\. To correct it, a simple approach is to take a confidence function of the formδ\(t\):=βtlog⁡\(t\)\\delta\(t\):=\\frac\{\\beta\}\{t\\log\(t\)\}forβ∈R\\beta\\in\\text\{R\}\. Repeating the same calculation, we find a pseudo\-regret upper\-bound of the form

E\[PR\(T\)\]≲φ\(β\):=C1‖Δ‖1β∫eT1tlog⁡tdt⏟log⁡log⁡\(T\)\+C2D2\(∑a∉𝒜∗1Δa\)log⁡\(Tlog⁡\(T\)β\)\.\\text\{E\}\[\\mathrm\{PR\}\(T\)\]\\lesssim\\varphi\(\\beta\):=C\_\{1\}\\\|\\Delta\\\|\_\{1\}\\beta\\underbrace\{\\int\_\{\\mathrm\{e\}\}^\{T\}\\frac\{1\}\{t\\log t\}\\;\\mathrm\{d\}t\}\_\{\\log\\log\(T\)\}\+C\_\{2\}D^\{2\}\\left\(\\sum\_\{a\\notin\\mathcal\{A\}^\{\*\}\}\\frac\{1\}\{\\Delta\_\{a\}\}\\right\)\\log\\left\(\\frac\{T\\log\(T\)\}\{\\beta\}\\right\)\.Optimizingφ\\varphiwith respect toβ\\beta, we obtain

β∗=C2D2∑a∉𝒜∗1ΔaC1‖Δ‖1log⁡log⁡\(T\)\.\\displaystyle\\beta^\{\*\}=\\frac\{C\_\{2\}D^\{2\}\\sum\_\{a\\notin\\mathcal\{A\}^\{\*\}\}\\frac\{1\}\{\\Delta\_\{a\}\}\}\{C\_\{1\}\\\|\\Delta\\\|\_\{1\}\\log\\log\(T\)\}\\;\.Again,β∗\\beta^\{\*\}depends onTT, but the dependency inTTis mild, aslog⁡log⁡\(T\)\\log\\log\(T\)is basically a numerical constant for any reasonable time horizon \(sayT≤1010T\\leq 10^\{10\}\)\.777The reason why we have remaininglog⁡\(−\)\\log\(\-\)terms is that we are solving equations of the formax\+blog⁡\(x\)=cax\+b\\log\(x\)=c, of which the solution involves the[LambertWWfunction](https://en.wikipedia.org/wiki/Lambert_W_function), that cannot be expressed analytically\.We recover the quartic critical value of\(D/Δ\)2\(D/\\Delta\)^\{2\}, as in the parameterizationδ\(t\)=αlog⁡\(t\)\\delta\(t\)=\\frac\{\\alpha\}\{\\log\(t\)\}\.

### 5\.3Comparison with stochastic bandits

The critical value\(D/Δ\)2\(D/\\Delta\)^\{2\}is reminiscent of what is known in stochastic bandits\. Indeed, for such learning problems, it is known that choosing a confidence parameter scaling asδ\(t\)=σ2Δ21t\\smash\{\\delta\(t\)=\\frac\{\\sigma^\{2\}\}\{\\Delta^\{2\}\}\\frac\{1\}\{t\}\}leads to better regret bounds, see for instance\(Lattimore and Szepesvári,[2020](https://arxiv.org/html/2606.27448#bib.bib34), Chapter 9\.2\)\.888Lattimore and Szepesvári \([2020](https://arxiv.org/html/2606.27448#bib.bib34)\)is focusing on the caseσ=1\\sigma=1\. We refer the reader to their discussion in Chapter 9\.2, where they optimize a regret bound of ordernΔδ\+1Δlog⁡\(1δ\)n\\Delta\\delta\+\\frac\{1\}\{\\Delta\}\\log\(\\frac\{1\}\{\\delta\}\)\. Forσ2\\sigma^\{2\}\-subgaussian bandits withσ≠1\\sigma\\neq 1, the analogue of their bound isnΔδ\+σ2Δlog⁡\(1δ\)\\smash\{n\\Delta\\delta\+\\frac\{\\sigma^\{2\}\}\{\\Delta\}\\log\(\\frac\{1\}\{\\delta\}\)\}, that is optimized forδ≈σ2/\(nΔ2\)\\delta\\approx\\sigma^\{2\}/\(n\\Delta^\{2\}\)\.In our setting, the quantityDDplays a similar role to the variance proxy of subgaussian random variables: according to[Proposition˜3](https://arxiv.org/html/2606.27448#Thmtheorem3), the noise on the empirical estimator of the gain scales asDnlog⁡\(1/δ\)D\\sqrt\{n\\log\(1/\\delta\)\}, while it isσ2nlog⁡\(1/δ\)\\sigma^\{2\}\\sqrt\{n\\log\(1/\\delta\)\}forσ2\\sigma^\{2\}\-subgaussian bandits\. However—and this is the crucial difference with stochastic bandits—the noise can have high magnitude even though rewards remain\[0,1\]\[0,1\]\. This is something that is exclusive to Markovian bandits, because by Hoeffding’s Lemma, every random variable supported in\[0,1\]\[0,1\]is12\\frac\{1\}\{2\}\-subgaussian\. Accordingly, Markovian bandits can “simulate” reward distributions with higher noise*yet*values in\[0,1\]\[0,1\], and the optimistic bonuses must be inflated accordingly\.

## 6Conclusion

In this paper, we considered a class of Markovian bandit learning problems that allow for non\-observable states and constrained decision epochs\. We argued that pure policies provide an appropriate benchmark for Markovian bandits by introducing a subclass, self\-degrading bandits, for which they are optimal\. We first provided a lower bound for the regret of rarely\-switching algorithms, revealing that strict logarithmic regret remains unreachable—in opposition to the stochastic bandit case\. We then introducedUCB\-NOMand showed that, despite the non\-observability of the environment’s inner state, non\-trivial regret guarantees can be achieved\. We have further discussed a structural assumption—a prior knowledge of the bias span—under which the classicalO\(log⁡\(T\)\)\\operatorname\*\{\\mathrm\{O\}\}\(\\log\(T\)\)instance\-dependent andO\(Tlog⁡\(T\)\)\\operatorname\*\{\\mathrm\{O\}\}\(\\sqrt\{T\\log\(T\)\}\)worst\-case regrets can be recovered, as soon as the delay between decision epochs remain reasonable\.

We conclude by mentioning several possible directions for future work\. A central assumption in our theory is that the state space𝒮a\\mathcal\{S\}\_\{a\}of every arm can be arbitrarily large\. It would be interesting to study what happens when the state space𝒮a\\mathcal\{S\}\_\{a\}is known in advance\. Our preliminary intuition is that in such a setting, the theory would become considerably more intricate\. Another avenue is to explore alternative information settings, for instance when states are organized into clusters, and rewards depend on the cluster of the current state\. Finally, in line with the well\-studied restless bandit framework, it would be interesting to consider scenarios where both activated and non\-activated arms yield rewards, possibly dependent on the action\.

## Acknowledgements

The authors acknowledge the funding of the French National Research Agency under the PEPR IA FOUNDRY project \(ANR\-23\-PEIA\-0003\), the ANR LabEx CIMI \(grant ANR\-11\-LABX\-0040\) within the French State Programme Investissements d’Avenir, the AI Interdisciplinary Institute ANITI, and the PEPR NAI project funded by the France 2030 program under the Grant agreements n°ANR\-23\-IACL\-0002 and n°ANR\-22\-PEFT\-0003\.

## References

- Gambling in a rigged casino: The adversarial multi\-armed bandit problem\.InProceedings of IEEE 36th Annual Foundations of Computer Science,pp\. 322–331\.Cited by:[§3\.2](https://arxiv.org/html/2606.27448#S3.SS2.SSS0.Px3.p2.5)\.
- P\. Auer and R\. Ortner \(2006\)Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning\.Proceedings of the 19th International Conference on Neural Information Processing Systems\(en\)\.Cited by:[§3\.1](https://arxiv.org/html/2606.27448#S3.SS1.p2.1)\.
- P\. Auer \(2002\)Using Confidence Bounds for Exploitation\-Exploration Trade\-offs\.J\. Mach\. Learn\. Res\.3,pp\. 397–422\.Cited by:[§3\.1](https://arxiv.org/html/2606.27448#S3.SS1.p2.1),[§3\.2](https://arxiv.org/html/2606.27448#S3.SS2.SSS0.Px3.p2.5),[§3\.2](https://arxiv.org/html/2606.27448#S3.SS2.SSS0.Px4.p1.1),[§3](https://arxiv.org/html/2606.27448#S3.p2.5)\.
- V\. Boone \(2024\)Optimal regrets in markov decision processes\.Note:PhD thesis, see victor\-boone\.github\.ioExternal Links:[Link](https://victor-boone.github.io/assets/pdf/VBOONE-manuscript-v1.1.pdf)Cited by:[2nd item](https://arxiv.org/html/2606.27448#S3.I1.i2.p1.17)\.
- H\. Bourel, O\. Maillard, and M\. S\. Talebi \(2020\)Tightening Exploration in Upper Confidence Reinforcement Learning\.InProceedings of the 37th International Conference on Machine Learning,H\. D\. III and A\. Singh \(Eds\.\),Proceedings of Machine Learning Research, Vol\.119,pp\. 1056–1066\.Cited by:[§3\.2](https://arxiv.org/html/2606.27448#S3.SS2.SSS0.Px1.3.p3.3)\.
- D\. Jerison \(2013\)General mixing time bounds for finite markov chains via the absolute spectral gap\.External Links:1310\.8021Cited by:[2nd item](https://arxiv.org/html/2606.27448#S3.I1.i2.p1.17)\.
- T\. L\. Lai and H\. Robbins \(1985\)Asymptotically efficient adaptive allocation rules\.Advances in applied mathematics6\(1\),pp\. 4–22\.Cited by:[§3\.1](https://arxiv.org/html/2606.27448#S3.SS1.p1.8),[§3\.1](https://arxiv.org/html/2606.27448#S3.SS1.p2.1)\.
- T\. Lattimore and C\. Szepesvári \(2020\)Bandit algorithms\.Cambridge University Press\.Cited by:[§2\.2](https://arxiv.org/html/2606.27448#S2.SS2.p2.6),[§3\.1](https://arxiv.org/html/2606.27448#S3.SS1.p1.8),[§3\.1](https://arxiv.org/html/2606.27448#S3.SS1.p3.2),[§5\.1](https://arxiv.org/html/2606.27448#S5.SS1.p1.4),[§5\.3](https://arxiv.org/html/2606.27448#S5.SS3.p1.10),[footnote 8](https://arxiv.org/html/2606.27448#footnote8)\.
- C\. Tekin and M\. Liu \(2012\)Online learning of rested and restless bandits\.IEEE Transactions on Information Theory58\(8\),pp\. 5588–5611\.Cited by:[2nd item](https://arxiv.org/html/2606.27448#S3.Ex14.m1.2.2.2.2.2.1),[1st item](https://arxiv.org/html/2606.27448#S3.I1.i1.p1.8),[2nd item](https://arxiv.org/html/2606.27448#S3.I1.i2.p1.17),[§3\.2](https://arxiv.org/html/2606.27448#S3.SS2.SSS0.Px4.p1.1)\.
- J\. Wang, M\. Wang, and L\. F\. Yang \(2022\)Near sample\-optimal reduction\-based policy learning for average reward mdp\.arXiv preprint arXiv:2212\.00603\.Cited by:[2nd item](https://arxiv.org/html/2606.27448#S3.I1.i2.p1.17)\.

## Appendix AGeneral results

## Appendix BRegret lower bounds \([Theorems˜2](https://arxiv.org/html/2606.27448#Thmtheorem2)and[5](https://arxiv.org/html/2606.27448#Thmtheorem5)\)

## Appendix CRegret upper bounds \([Theorems˜4](https://arxiv.org/html/2606.27448#Thmtheorem4),[6](https://arxiv.org/html/2606.27448#Thmtheorem6)and[7](https://arxiv.org/html/2606.27448#Thmtheorem7)\)

Contents
Learning in Markovian bandits with non-observable states and constrained decision epochs

Similar Articles

Online Learning on Hidden-Convex Losses via Algorithmic Equivalence: Optimal Regret, Geometric Barrier, and Bandit Feedback

Optimal Gap-Dependent Regret for Private Stochastic Decision-Theoretic Online Learning

Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity

Contextual Slate GLM Bandits with Limited Adaptivity

Graph Dimensionality Reduction for Contextual Bandits: Structure-Specific Regret Bounds under Approximate Smoothness and Noisy Eigenspaces

Submit Feedback

Similar Articles

Online Learning on Hidden-Convex Losses via Algorithmic Equivalence: Optimal Regret, Geometric Barrier, and Bandit Feedback
Optimal Gap-Dependent Regret for Private Stochastic Decision-Theoretic Online Learning
Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity
Contextual Slate GLM Bandits with Limited Adaptivity
Graph Dimensionality Reduction for Contextual Bandits: Structure-Specific Regret Bounds under Approximate Smoothness and Noisy Eigenspaces