Evolving Robustness--Exploration Trade-off in Online Reinforcement Learning via Quantile Bayesian Risk MDPs

arXiv cs.LG 05/26/26, 04:00 AM Papers
Summary
This paper proposes a quantile Bayesian risk-aware MDP framework for online RL that adaptively balances robustness and exploration over time, providing theoretical regret bounds and demonstrating strong empirical performance.
arXiv:2605.24345v1 Announce Type: new Abstract: In online reinforcement learning, data scarcity creates epistemic uncertainty that makes robustness important early in learning, whereas sufficient exploration is needed to learn the true-environment optimal policy. We study this time-varying robustness--exploration trade-off through a quantile Bayesian risk-aware Markov decision process (BR-MDP), in which the quantile level controls how posterior uncertainty enters the Bellman backup. We characterize this control through an asymptotic normality result for the difference between the quantile BR-MDP value and the value in the true environment. The result implies that upper/lower-tail quantiles induce optimism/pessimism towards epistemic uncertainty, and the magnitude of the optimism/pessimism decreases as data accumulate. Building on this characterization, we propose an online Bayesian risk-aware algorithm with an adaptive quantile schedule that emphasizes robustness early and gradually encourages exploration of less-visited state--action pairs. We establish sublinear Bayesian regret bounds with respect to both the true optimal value and the optimal BR-MDP robust value. Numerical experiments demonstrate strong performance in both exploration-demanding and exploration-costly environments.
Original Article
View Cached Full Text
Cached at: 05/26/26, 09:04 AM
# 1 Introduction
Source: [https://arxiv.org/html/2605.24345](https://arxiv.org/html/2605.24345)
Evolving Robustness–Exploration Trade\-off in Online Reinforcement Learning via Quantile Bayesian Risk MDPs

###### Abstract

In online reinforcement learning, data scarcity creates epistemic uncertainty that makes robustness important early in learning, whereas sufficient exploration is needed to learn the true\-environment optimal policy\. We study this time\-varying robustness–exploration trade\-off through a quantile Bayesian risk\-aware Markov decision process \(BR\-MDP\), in which the quantile level controls how posterior uncertainty enters the Bellman backup\. We characterize this control through an asymptotic normality result for the difference between the quantile BR\-MDP value and the value in the true environment\. The result implies that upper/lower\-tail quantiles induce optimism/pessimism towards epistemic uncertainty, and the magnitude of the optimism/pessimism decreases as data accumulate\. Building on this characterization, we propose an online Bayesian risk\-aware algorithm with an adaptive quantile schedule that emphasizes robustness early and gradually encourages exploration of less\-visited state–action pairs\. We establish sublinear Bayesian regret bounds with respect to both the true optimal value and the optimal BR\-MDP robust value\. Numerical experiments demonstrate strong performance in both exploration\-demanding and exploration\-costly environments\.

Keywords:online reinforcement learning; Bayesian risk optimization; Markov decision process

In online reinforcement learning \(RL\), an agent sequentially interacts with an unknown environment and uses collected data to estimate the unknown environment and update the policy used in subsequent interactions\. Thus, each action affects both the immediate reward and the information available for future decisions\. Limited data lead to epistemic uncertainty in estimating environment parameters\(Der Kiureghian and Ditlevsen[2009](https://arxiv.org/html/2605.24345#bib.bib10)\)\. This uncertainty is central to the consideration of exploration–exploitation trade\-off\(Jaksch et al\.[2010](https://arxiv.org/html/2605.24345#bib.bib20), Osband et al\.[2013](https://arxiv.org/html/2605.24345#bib.bib29), Azar et al\.[2017](https://arxiv.org/html/2605.24345#bib.bib3), Ma and Lee[2026](https://arxiv.org/html/2605.24345#bib.bib27)\): exploitation chooses actions that appear optimal under current estimates to pursue high estimated cumulative reward, whereas exploration collects information to reduce epistemic uncertainty\.

Although regions with higher uncertainty offer a greater incentive to explore, acting in such regions is also risky because the unreliable estimates induce estimated optimal policies that can perform poorly in the true environment\. This risk is most pronounced early in learning, when data are scarce\. It is particularly salient in high\-stakes settings with limited interaction budgets\(Dulac\-Arnold et al\.[2021](https://arxiv.org/html/2605.24345#bib.bib14)\), such as public health intervention problems\(Liang et al\.[2025](https://arxiv.org/html/2605.24345#bib.bib23)\)and inventory or service systems with costly trial\-and\-error decisions\. Therefore, beyond the classical exploration–exploitation consideration, the robustness of the policy used in interactions \(which we refer to as the interaction policy throughout the rest of the paper\) is also a primary concern early in learning, as it hedges against the risk induced by acting under epistemic uncertainty\. As learning progresses and more data are collected, epistemic uncertainty decreases and such unreliable estimates become less likely\. Consequently, the need for robustness becomes less essential, whereas exploring less\-visited state\-action pairs becomes more urgent for learning the optimal policy in the true environment\. This yields an intrinsic and time\-varying robustness–exploration trade\-off: robustness is valuable early in learning, but maintaining a fixed conservative attitude can later hinder the exploration needed to learn the true\-environment optimal policy\. Thus, the interaction policy should adapt its treatment of epistemic uncertainty over time, being more robust when data are scarce and becoming less conservative as information accumulates\.

To account for epistemic uncertainty, a widely used approach is robust and distributionally robust MDPs and reinforcement learning, which optimizes worst\-case performance over uncertainty or ambiguity sets\(Nilim and El Ghaoui[2005](https://arxiv.org/html/2605.24345#bib.bib28), Iyengar[2005](https://arxiv.org/html/2605.24345#bib.bib19), Xu and Mannor[2010](https://arxiv.org/html/2605.24345#bib.bib47), Wiesemann et al\.[2013](https://arxiv.org/html/2605.24345#bib.bib45)\)\. However, much of this literature focuses on offline settings, where the historical dataset remains fixed throughout learning or data are accessed through a generative model or simulator\(Panaganti and Kalathil[2022](https://arxiv.org/html/2605.24345#bib.bib32), Zhou et al\.[2021](https://arxiv.org/html/2605.24345#bib.bib51), Panaganti et al\.[2022](https://arxiv.org/html/2605.24345#bib.bib33), Blanchet et al\.[2023](https://arxiv.org/html/2605.24345#bib.bib6)\)\. These results show how robust policies can be learned when data are exogenously available, but do not address how an interaction policy should jointly manage robustness and exploration as data accumulate\.

Recent online robust RL works are closer to the present setting\(Wang and Zou[2021](https://arxiv.org/html/2605.24345#bib.bib44), Badrinath and Kalathil[2021](https://arxiv.org/html/2605.24345#bib.bib4), Dong et al\.[2022](https://arxiv.org/html/2605.24345#bib.bib11), Lu et al\.[2024](https://arxiv.org/html/2605.24345#bib.bib26), Wang and Zhou[2023](https://arxiv.org/html/2605.24345#bib.bib42), Ghosh et al\.[2026](https://arxiv.org/html/2605.24345#bib.bib17), Wang and Zhou[2025](https://arxiv.org/html/2605.24345#bib.bib43)\)\. Some of these studies rely on an external exploratory behavior policy to interact with the true environment\(Wang and Zou[2021](https://arxiv.org/html/2605.24345#bib.bib44), Badrinath and Kalathil[2021](https://arxiv.org/html/2605.24345#bib.bib4), Wang and Zhou[2023](https://arxiv.org/html/2605.24345#bib.bib42)\)\. However, such a behavior policy does not account for robustness during interaction\. Another line of work encourages exploration by adding explicit visit\-count\-dependent bonuses to robust Bellman backups before selecting the interaction policy\(Dong et al\.[2022](https://arxiv.org/html/2605.24345#bib.bib11), Lu et al\.[2024](https://arxiv.org/html/2605.24345#bib.bib26), Ghosh et al\.[2026](https://arxiv.org/html/2605.24345#bib.bib17), Wang and Zhou[2025](https://arxiv.org/html/2605.24345#bib.bib43)\)\. These bonuses take larger values early in learning, so exploration can have a greater effect on action selection than the robust value estimate\. These methods therefore do not directly ensure that the interaction policy is robust\. Moreover, these methods aim to learn an optimal robust policy, whereas our goal is to learn the optimal policy in the true environment while ensuring that the interaction policy is robust, especially early in learning\. This leaves open how the interaction policy should adapt its risk attitude over time—being more robust when data are scarce, but becoming less conservative as more data are collected so that the agent can explore enough to learn the optimal policy in the true environment\.

The Bayesian risk\-aware Markov Decision Process \(BR\-MDP\) provides a starting point to this question\. It models unknown parameters in the true environment through a posterior distribution and imposes risk measures on future rewards at each transition step, resulting in a data\-adaptive stochastic model\(Wu et al\.[2018](https://arxiv.org/html/2605.24345#bib.bib46), Lin et al\.[2022](https://arxiv.org/html/2605.24345#bib.bib24), Wang and Zhou[2023](https://arxiv.org/html/2605.24345#bib.bib42), Lin and Zhou[2025](https://arxiv.org/html/2605.24345#bib.bib25)\)\. Recently,Wang and Zhou \([2025](https://arxiv.org/html/2605.24345#bib.bib43)\)incorporated lower\-tail CVaR as the risk measure in the BR\-MDP, so that solving the resulting BR\-MDP yields a risk\-averse policy\. Based on this formulation, they proposed an online Bayesian risk\-averse algorithm to learn an optimal robust policy under the Bayesian risk criterion\. However, its regret analysis also relies on an additional exploration bonus that is large when visit counts are small\. As a result, the interaction policy may not have the desired robustness to epistemic uncertainty early in learning\.

To rigorously characterize the robustness\-exploration trade\-off in online RL, we first study the BR\-MDP formulation with a fixedα\\alpha\-quantile as the risk measure, referred to asα\\alpha\-quantile BR\-MDP\. Our theoretical analysis shows how theα\\alpha\-quantile BR\-MDP explicitly controls robustness to epistemic uncertainty while maintaining exploration\. The optimal policy of theα\\alpha\-quantile BR\-MDP can be more robust or more exploratory, depending on the quantile levelα\\alpha\. Therefore, we propose adaptive quantile BR\-MDP \(AQ\-BRMDP\), which uses an adaptive quantile schedule to respond to the evolving robustness–exploration trade\-off in online RL\. Early in learning, the schedule sets the quantile level to emphasize lower\-tail evaluations of the cumulative future rewards, yielding a more robust policy\. As more data are collected and the posterior concentrates, the schedule gradually increases the quantile level, especially for less\-visited state\-action pairs that remain important for learning the optimal policy in the true environment, thereby encouraging exploration toward those pairs\. In the discounted infinite\-horizon setting, we implement this idea through pseudo\-episode construction, as inXu et al\. \([2024](https://arxiv.org/html/2605.24345#bib.bib48)\), which partitions the interaction process into intervals with random lengths according to the discount factor\. Specifically, at the beginning of each pseudo\-episode, we update the posterior belief and quantile schedule, then solve the correspondingα\\alpha\-quantile BR\-MDP, and finally execute the resulting policy until the next update\. This yields an implementable online procedure that adapts the treatment of epistemic uncertainty to the stage of learning and the collected data\.

We summarized the main contributions of this paper as follows\.

1. 1\.We formulate theα\\alpha\-quantile BR\-MDP and show that the quantile level can explicitly control the trade\-off between robustness and exploration\. We characterize this trade\-off through an analytical result that the difference between the value function of theα\\alpha\-quantile BR\-MDP and the original value function is asymptotically normal\. The magnitude of this mean increases as the quantile level moves farther towards the lower tail, inducing more robust policy, or farther into the upper tail, inducing more exploratory policy\. It decreases as more data are collected, at a rate ofO\(1N\)O\(\\frac\{1\}\{\\sqrt\{N\}\}\), whereNNis the total number of data points used to estimate the posterior distribution\.
2. 2\.We utilize the property above to design an online Bayesian risk\-aware algorithm, AQ\-BRMDP, that combines adaptive quantile scheduling with pseudo\-episodic posterior updates in the discounted infinite\-horizon setting\. The quantile schedule is designed to depend on both visit counts across state\-action pairs and the stage of learning, so AQ\-BRMDP can adapt to the evolving robustness–exploration trade\-off in online RL settings\.
3. 3\.Theoretically, we establish that AQ\-BRMDP has Bayesian regret bounds of orderO~\(T\)\\widetilde\{O\}\(\\sqrt\{T\}\), whereTTis the total number of interactions, with respect to two benchmarks: the optimal value in the true environment, and the optimal robust value\. Numerically, we evaluate the performance of AQ\-BRMDP in two environments: one that requires sustained exploration, and the other in which exploratory mistakes are costly\. We also implement an extension of the proposed algorithm to continuous\-state spaces and evaluate its empirical performance in a continuous\-state environment\.

### 1\.1Related Work

Classical online reinforcement learning studies the exploration–exploitation trade\-off for efficiently learning the optimal policy in the true environment\. A broad class of optimism\-based methods addresses this trade\-off through optimistic initialization, optimistic models, confidence sets, or explicit exploration bonuses\(Kearns and Singh[2002](https://arxiv.org/html/2605.24345#bib.bib22), Brafman and Tennenholtz[2002](https://arxiv.org/html/2605.24345#bib.bib8), Strehl et al\.[2006](https://arxiv.org/html/2605.24345#bib.bib38), Strehl and Littman[2008](https://arxiv.org/html/2605.24345#bib.bib39), Jaksch et al\.[2010](https://arxiv.org/html/2605.24345#bib.bib20), Bartlett and Tewari[2009](https://arxiv.org/html/2605.24345#bib.bib5), Azar et al\.[2017](https://arxiv.org/html/2605.24345#bib.bib3), He et al\.[2021](https://arxiv.org/html/2605.24345#bib.bib18), Jin et al\.[2018](https://arxiv.org/html/2605.24345#bib.bib21), Dong et al\.[2020](https://arxiv.org/html/2605.24345#bib.bib12)\)\. Posterior\-sampling methods instead introduce exploration through the randomness of models sampled from a Bayesian posterior and are typically analyzed through Bayesian regret\(Osband et al\.[2013](https://arxiv.org/html/2605.24345#bib.bib29), Abbasi\-Yadkori and Szepesvári[2015](https://arxiv.org/html/2605.24345#bib.bib1), Russo et al\.[2018](https://arxiv.org/html/2605.24345#bib.bib37), Xu et al\.[2024](https://arxiv.org/html/2605.24345#bib.bib48)\)\. Related posterior\-inference and posterior\-quantile methods also use posterior information to direct exploration toward actions whose values remain uncertain\(Tiapkin et al\.[2022](https://arxiv.org/html/2605.24345#bib.bib41), Tarbouriech et al\.[2023](https://arxiv.org/html/2605.24345#bib.bib40), Ma and Lee[2026](https://arxiv.org/html/2605.24345#bib.bib27)\)\. This literature primarily uses uncertainty to improve learning of the true\-environment optimal policy; it does not explicitly control the interaction risk of exploratory actions when epistemic uncertainty is high\.

Robust and distributionally robust reinforcement learning originate from robust MDP formulations, where the objective is to optimize performance under worst\-case models or over ambiguity sets\(Nilim and El Ghaoui[2005](https://arxiv.org/html/2605.24345#bib.bib28), Iyengar[2005](https://arxiv.org/html/2605.24345#bib.bib19), Xu and Mannor[2010](https://arxiv.org/html/2605.24345#bib.bib47), Wiesemann et al\.[2013](https://arxiv.org/html/2605.24345#bib.bib45)\)\. This line has developed into learning algorithms for generative\-model and offline\-data settings\(Panaganti and Kalathil[2022](https://arxiv.org/html/2605.24345#bib.bib32), Zhou et al\.[2021](https://arxiv.org/html/2605.24345#bib.bib51), Panaganti et al\.[2022](https://arxiv.org/html/2605.24345#bib.bib33), Blanchet et al\.[2023](https://arxiv.org/html/2605.24345#bib.bib6)\), and more recently into online robust RL with interactive data collection\(Wang and Zou[2021](https://arxiv.org/html/2605.24345#bib.bib44), Badrinath and Kalathil[2021](https://arxiv.org/html/2605.24345#bib.bib4), Dong et al\.[2022](https://arxiv.org/html/2605.24345#bib.bib11), Lu et al\.[2024](https://arxiv.org/html/2605.24345#bib.bib26), Ghosh et al\.[2026](https://arxiv.org/html/2605.24345#bib.bib17)\)\. These works primarily account for model misspecification and aim to learn robust\-optimal policies, but they do not directly address how the behavior policy should balance robustness with the exploration needed to learn the optimal policy in the true environment\.

Our work bridges these two lines of work: as robust RL, it accounts for epistemic uncertainty to improve the robustness of the interaction policy; as an online RL method designed for efficient exploration, it ultimately aims to learn the optimal policy in the true environment\. The key idea of our work is to use anα\\alpha\-quantile Bayesian risk criterion to interpolate between robustness early in learning and exploration needed later to learn the true\-environment optimal policy, rather than treating robustness and exploration as separate mechanisms\.

Our paper is built on Bayesian risk optimization and the BRMDP formulation\. Bayesian risk optimization was introduced byZhou and Xie \([2015](https://arxiv.org/html/2605.24345#bib.bib50)\)as a flexible alternative to worst\-case robustness in static \(one\-stage\) stochastic optimization, and its statistical properties were established byWu et al\. \([2018](https://arxiv.org/html/2605.24345#bib.bib46)\)\. This perspective was extended to sequential \(multi\-stage\) decision making through BRMDP, spanning finite\- and infinite\-horizon formulations as well as offline and online learning settings\(Lin et al\.[2022](https://arxiv.org/html/2605.24345#bib.bib24), Lin and Zhou[2025](https://arxiv.org/html/2605.24345#bib.bib25), Wang and Zhou[2023](https://arxiv.org/html/2605.24345#bib.bib42),[2025](https://arxiv.org/html/2605.24345#bib.bib43)\)\. On a related note, a broader Bayesian RL literature studies how posterior uncertainty can be represented and used in sequential decision making\(Ghavamzadeh et al\.[2015](https://arxiv.org/html/2605.24345#bib.bib16)\)\. In particular, Bayes\-adaptive formulations augment the decision state with posterior beliefs and optimize Bayes\-adaptive expected return\(Duff[2002](https://arxiv.org/html/2605.24345#bib.bib13), Poupart et al\.[2006](https://arxiv.org/html/2605.24345#bib.bib34), Ross et al\.[2007](https://arxiv.org/html/2605.24345#bib.bib36)\)\.

Finally, our paper should be distinguished from safe exploration\. In this literature, risk is intrinsic to the true environment, i\.e\., the possibility that certain actions lead to inherently undesirable outcomes\(Garcelon et al\.[2020](https://arxiv.org/html/2605.24345#bib.bib15), Yamagata and Santos\-Rodríguez[2024](https://arxiv.org/html/2605.24345#bib.bib49)\)\. In contrast, the risk we consider does not arise from the environment itself, but from epistemic uncertainty: limited data can cause the agent to act on unreliable estimates of the environment\. This risk diminishes as more data are collected\.

## 2Bayesian Risk\-Aware MDPs

### 2\.1α\\alpha\-quantile Bayesian Risk MDPs

#### Unknown True Environment\.

We consider an infinite\-horizon discounted Markov decision process \(MDP\)ℳ=\(𝒮,𝒜,γ,r,Pc\)\\mathcal\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},\\gamma,r,P^\{c\}\), where𝒮\\mathcal\{S\}and𝒜\\mathcal\{A\}are finite state and action spaces with\|𝒮\|=S\|\\mathcal\{S\}\|=Sand\|𝒜\|=A\|\\mathcal\{A\}\|=A,γ∈\(0,1\)\\gamma\\in\(0,1\)is the discount factor,r:𝒮×𝒜→\[0,1\]r:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\[0,1\]is the reward function, andPc\(s′∣s,a\)P^\{c\}\(s^\{\\prime\}\\mid s,a\)is the true but unknown transition kernel\. For notational simplicity, we writePs,ac\(⋅\):=Pc\(⋅∣s,a\)P^\{c\}\_\{s,a\}\(\\cdot\):=P^\{c\}\(\\cdot\\mid s,a\)\.

LetΠ\\Pibe the set of Markovian policiesπ=\{πt\}t≥0\\pi=\\\{\\pi\_\{t\}\\\}\_\{t\\geq 0\}, where eachπt\(⋅∣s\)\\pi\_\{t\}\(\\cdot\\mid s\)is a probability distribution over𝒜\\mathcal\{A\}for everys∈𝒮s\\in\\mathcal\{S\}\. Given an initial states0=ss\_\{0\}=s, the objective is to maximize the expected total discounted reward

supπ∈Π𝔼π\[∑t=0∞γtr\(st,at\)\|s0=s\],∀s∈𝒮,\\displaystyle\\sup\_\{\\pi\\in\\Pi\}\\mathbb\{E\}^\{\\pi\}\\left\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}r\(s\_\{t\},a\_\{t\}\)\\,\\middle\|\\,s\_\{0\}=s\\right\],\\quad\\forall s\\in\\mathcal\{S\},\(1\)whereat∼πt\(⋅∣st\)a\_\{t\}\\sim\\pi\_\{t\}\(\\cdot\\mid s\_\{t\}\)andst\+1∼Pc\(⋅∣st,at\)s\_\{t\+1\}\\sim P^\{c\}\(\\cdot\\mid s\_\{t\},a\_\{t\}\)\. For a policyπ∈Π\\pi\\in\\Pi, the objective function under a fixed policyπ\\piin \([1](https://arxiv.org/html/2605.24345#S2.E1)\) defines the value function ofπ\\piat statess

Vπ\(s\):=𝔼π\[∑t=0∞γtr\(st,at\)\|s0=s\],∀s∈𝒮\.V^\{\\pi\}\(s\):=\\mathbb\{E\}^\{\\pi\}\\left\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}r\(s\_\{t\},a\_\{t\}\)\\,\\middle\|\\,s\_\{0\}=s\\right\],\\qquad\\forall s\\in\\mathcal\{S\}\.\(2\)Accordingly, the optimal value function isV∗\(s\):=supπ∈ΠVπ\(s\),V^\{\*\}\(s\):=\\sup\_\{\\pi\\in\\Pi\}V^\{\\pi\}\(s\),∀s∈𝒮,\\forall s\\in\\mathcal\{S\},and a policyπ∗\\pi^\{\*\}is optimal ifVπ∗\(s\)=V∗\(s\)V^\{\\pi^\{\*\}\}\(s\)=V^\{\*\}\(s\)for alls∈𝒮s\\in\\mathcal\{S\}\.

For tabular MDPs, there exists a stationary deterministic optimal policy\(Puterman[1994](https://arxiv.org/html/2605.24345#bib.bib35)\)\. Thus, without loss of optimality, we henceforth restrict attention to stationary deterministic policies of the formπ:𝒮→𝒜\\pi:\\mathcal\{S\}\\to\\mathcal\{A\}\. For any such policy, the value function satisfies the Bellman equation,∀s∈𝒮,\\forall s\\in\\mathcal\{S\},

Vπ\(s\)=r\(s,π\(s\)\)\+γ𝔼Ps,π\(s\)c\[Vπ\(s′\)\],V^\{\\pi\}\(s\)=r\\bigl\(s,\\pi\(s\)\\bigr\)\+\\gamma\\,\\mathbb\{E\}\_\{P^\{c\}\_\{s,\\pi\(s\)\}\}\\\!\\left\[V^\{\\pi\}\(s^\{\\prime\}\)\\right\],\(3\)where𝔼P\[Vπ\(s′\)\]=∑s′∈𝒮P\(s′\)Vπ\(s′\)=P⊤Vπ\\mathbb\{E\}\_\{P\}\[V^\{\\pi\}\(s^\{\\prime\}\)\]=\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}P\(s^\{\\prime\}\)V^\{\\pi\}\(s^\{\\prime\}\)=P^\{\\top\}V^\{\\pi\}is the expected value \(EV\) of the successor\-state value function with respect to the distributionPP\. The optimal value functionV∗V^\{\*\}satisfies the Bellman optimality equation,∀s∈𝒮,\\forall s\\in\\mathcal\{S\},

V∗\(s\)=maxa∈𝒜⁡\{r\(s,a\)\+γ𝔼Ps,ac\[V∗\(s′\)\]\}\.V^\{\*\}\(s\)=\\max\_\{a\\in\\mathcal\{A\}\}\\left\\\{r\(s,a\)\+\\gamma\\,\\mathbb\{E\}\_\{P^\{c\}\_\{s,a\}\}\\\!\\left\[V^\{\*\}\(s^\{\\prime\}\)\\right\]\\right\\\}\.\(4\)In practice, the transition kernelPcP^\{c\}is unknown and must be learned from data, which introduces epistemic uncertainty\. We assume throughout that the reward functionrris known, but the proposed approach can be extended to settings with unknown rewards\.

#### Bayesian Modeling of the Transition Kernel\.

To model epistemic uncertainty in the unknown transition kernelPcP^\{c\}, we adopt a Bayesian approach\. Specifically, we model each unknown transition vectorPs,acP^\{c\}\_\{s,a\}by a Bayesian random vectorPs,aP\_\{s,a\}and place independent Dirichlet conjugate priors on\{Ps,a\}\(s,a\)∈𝒮×𝒜\\\{P\_\{s,a\}\\\}\_\{\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\}\. After observing a trajectory, the posterior of eachPs,aP\_\{s,a\}remains Dirichlet by conjugacy, with a parameter vector determined by the prior and the transition counts along the observed trajectory\. Details of the Dirichlet definition and posterior update are provided in Appendix[EC\.1](https://arxiv.org/html/2605.24345#A1)\.

When analyzing the BR\-MDP under the current posterior, we writePs,a∼Dir\(ϕ\(s,a\)\),P\_\{s,a\}\\sim\\mathrm\{Dir\}\(\\phi\(s,a\)\),whereϕ\(s,a\)=\(ϕ\(s,a,s′\)\)s′∈𝒮\\phi\(s,a\)=\(\\phi\(s,a,s^\{\\prime\}\)\)\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}is the current posterior parameter vector\. We also writeϕ=\{ϕ\(s,a\)\}\(s,a\)∈𝒮×𝒜\\phi=\\\{\\phi\(s,a\)\\\}\_\{\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\}for the collection of current posterior parameters\. The dependence ofϕ\\phion the prior and the observed trajectory is suppressed for notational simplicity in this section\.

#### Bayesian Risk Value Function\.

To account for epistemic uncertainty represented by the current posterior over transition kernels, we evaluate policies by applying a risk measure with respect to this posterior distribution\.

For a policyπ\\pi, the value function of BR\-MDP with posterior parameterϕ\\phiis defined through the Bellman equation\(Wang and Zhou[2023](https://arxiv.org/html/2605.24345#bib.bib42),[2025](https://arxiv.org/html/2605.24345#bib.bib43)\),∀s∈𝒮,\\forall s\\in\\mathcal\{S\},

Vϕ,απ\(s\)=r\(s,π\(s\)\)\+γρϕ\(s,π\(s\)\)α\(P⊤Vϕ,απ\),\\displaystyle V^\{\\pi\}\_\{\\phi,\\alpha\}\(s\)=r\\bigl\(s,\\pi\(s\)\\bigr\)\+\\gamma\\,\\rho^\{\\alpha\}\_\{\\phi\(s,\\pi\(s\)\)\}\\\!\\bigl\(P^\{\\top\}V^\{\\pi\}\_\{\\phi,\\alpha\}\\bigr\),\\qquad\(5\)whereρϕ\(s,π\(s\)\)α\(⋅\)\\rho\_\{\\phi\(s,\\pi\(s\)\)\}^\{\\alpha\}\(\\cdot\)is a risk measure at levelα∈\(0,1\)\\alpha\\in\(0,1\)with respect toDir\(ϕ\(s,π\(s\)\)\)\\mathrm\{Dir\}\(\\phi\(s,\\pi\(s\)\)\)\. Notably, the Bayesian risk value functionVϕ,απV^\{\\pi\}\_\{\\phi,\\alpha\}depends on the posterior parameterϕ\\phi\. As new data are observed,ϕ\\phiis updated via \([EC\.1](https://arxiv.org/html/2605.24345#A1.E1)\), which changes the posterior distribution of the transition kernel and the associated risk evaluation\. Consequently, both the value function and the induced optimal policy evolve over time, reflecting the agent’s updated belief about the environment\.

As shown byWang and Zhou \([2023](https://arxiv.org/html/2605.24345#bib.bib42),[2025](https://arxiv.org/html/2605.24345#bib.bib43)\), the value function of BR\-MDP also admits an equivalent nested representation:

Vϕ,απ\(s0\)\\displaystyle V^\{\\pi\}\_\{\\phi,\\alpha\}\(s\_\{0\}\)=r\(s0,a0\)\+γρϕ\(s0,a0\)α\(𝔼P1\[r\(s1,a1\)\+γρϕ\(s1,a1\)α\(𝔼P2\[r\(s2,a2\)\+⋯\]\)\]\)\.\\displaystyle=r\(s\_\{0\},a\_\{0\}\)\+\\gamma\\rho^\{\\alpha\}\_\{\\phi\(s\_\{0\},a\_\{0\}\)\}\\\!\\Bigl\(\\mathbb\{E\}\_\{P^\{1\}\}\\Bigl\[r\(s\_\{1\},a\_\{1\}\)\+\\gamma\\rho^\{\\alpha\}\_\{\\phi\(s\_\{1\},a\_\{1\}\)\}\\\!\\Bigl\(\\mathbb\{E\}\_\{P^\{2\}\}\\bigl\[r\(s\_\{2\},a\_\{2\}\)\+\\cdots\\bigr\]\\Bigr\)\\Bigr\]\\Bigr\)\.\(6\)where, independently across stages,Pt∼Dir\(ϕ\(st−1,at−1\)\)P^\{t\}\\sim\\mathrm\{Dir\}\(\\phi\(s\_\{t\-1\},a\_\{t\-1\}\)\),st∼Pts\_\{t\}\\sim P^\{t\}fort≥1t\\geq 1, andat=π\(st\)a\_\{t\}=\\pi\(s\_\{t\}\)fort≥0t\\geq 0\. Moreover, the optimal value function can be computed via the following Bayesian risk Bellman optimality equation:∀s∈𝒮\\forall s\\in\\mathcal\{S\},

Vϕ,α∗\(s\)=maxa∈𝒜⁡\{r\(s,a\)\+γρϕ\(s,a\)α\(P⊤Vϕ,α∗\)\}\.\\displaystyle V^\{\*\}\_\{\\phi,\\alpha\}\(s\)=\\max\_\{a\\in\\mathcal\{A\}\}\\Bigl\\\{r\\bigl\(s,a\\bigr\)\+\\gamma\\,\\rho^\{\\alpha\}\_\{\\phi\(s,a\)\}\\\!\\bigl\(P^\{\\top\}V^\{\*\}\_\{\\phi,\\alpha\}\\bigr\)\\Bigr\\\}\.\(7\)By solving \([7](https://arxiv.org/html/2605.24345#S2.E7)\), we can obtain an optimal policy,πϕ,α∗\\pi^\{\*\}\_\{\\phi,\\alpha\}, that is deterministic and stationary\.

We also define the corresponding Bellman operator under the policyπ\\piand the optimal Bellman operator as,∀s∈𝒮,\\forall s\\in\\mathcal\{S\},

\(𝒯ϕ,απV\)\(s\)\\displaystyle\(\\mathcal\{T\}^\{\\pi\}\_\{\\phi,\\alpha\}V\)\(s\):=r\(s,π\(s\)\)\+γρϕ\(s,π\(s\)\)α\(P⊤V\),\\displaystyle:=r\\bigl\(s,\\pi\(s\)\\bigr\)\+\\gamma\\,\\rho^\{\\alpha\}\_\{\\phi\(s,\\pi\(s\)\)\}\\\!\\bigl\(P^\{\\top\}V\\bigr\),\(8\)\(𝒯ϕ,α∗V\)\(s\)\\displaystyle\(\\mathcal\{T\}^\{\*\}\_\{\\phi,\\alpha\}V\)\(s\):=maxa∈𝒜⁡\{r\(s,a\)\+γρϕ\(s,a\)α\(P⊤V\)\}\.\\displaystyle:=\\max\_\{a\\in\\mathcal\{A\}\}\\Bigl\\\{r\(s,a\)\+\\gamma\\,\\rho^\{\\alpha\}\_\{\\phi\(s,a\)\}\\\!\\bigl\(P^\{\\top\}V\\bigr\)\\Bigr\\\}\.\(9\)Under structural conditions on the risk measure discussed byWang and Zhou \([2023](https://arxiv.org/html/2605.24345#bib.bib42)\),𝒯ϕ,απ\\mathcal\{T\}^\{\\pi\}\_\{\\phi,\\alpha\}and𝒯ϕ,α∗\\mathcal\{T\}^\{\*\}\_\{\\phi,\\alpha\}areγ\\gamma\-contractions under∥⋅∥∞\\\|\\cdot\\\|\_\{\\infty\}and therefore admit unique fixed points\.

#### α\\alpha\-quantile BR\-MDP\.

We specialize the risk measureρα\(⋅\)\\rho^\{\\alpha\}\(\\cdot\)to theα\\alpha\-quantile, thereby obtaining anα\\alpha\-quantile BR\-MDP\. We refer toVϕ,απV^\{\\pi\}\_\{\\phi,\\alpha\}as the correspondingα\\alpha\-quantile BR\-MDP value function\. Specifically, for a random variableXXandα∈\(0,1\)\\alpha\\in\(0,1\), theα\\alpha\-quantile ofXXis defined asρα\(X\):=inf\{z∈ℝ:ℙ\(X≤z\)≥α\}\.\\rho^\{\\alpha\}\(X\):=\\inf\\\{z\\in\\mathbb\{R\}:\\mathbb\{P\}\(X\\leq z\)\\geq\\alpha\\\}\.Under this choice, the quantile levelα\\alphadetermines which part of the posterior distribution ofPs,a⊤VP\_\{s,a\}^\{\\top\}Vis emphasized in the Bellman backup, thereby encoding a risk attitude toward epistemic uncertainty\.

To illustrate the effect of the quantile levelα\\alpha, we define the one\-step adjustment induced by replacing the true EV of successor states with its posteriorα\\alpha\-quantile as follows:

bϕ\(s,a\)α\(V\):=ρϕ\(s,a\)α\(P⊤V\)−𝔼Ps,ac\[V\(s′\)\]\.b^\{\\alpha\}\_\{\\phi\(s,a\)\}\(V\):=\\rho^\{\\alpha\}\_\{\\phi\(s,a\)\}\\\!\\bigl\(P^\{\\top\}V\\bigr\)\-\\mathbb\{E\}\_\{P^\{c\}\_\{s,a\}\}\[V\(s^\{\\prime\}\)\]\.The optimal Bellman operator can then be written as

\(𝒯ϕ,α∗V\)\(s\)=maxa∈𝒜⁡\{r\(s,a\)\+γ𝔼Ps,ac\[V\(s′\)\]\+γbϕ\(s,a\)α\(V\)\},∀s∈𝒮\.\(\\mathcal\{T\}^\{\*\}\_\{\\phi,\\alpha\}V\)\(s\)=\\max\_\{a\\in\\mathcal\{A\}\}\\Bigl\\\{r\(s,a\)\+\\gamma\\,\\mathbb\{E\}\_\{P^\{c\}\_\{s,a\}\}\[V\(s^\{\\prime\}\)\]\+\\gamma\\,b^\{\\alpha\}\_\{\\phi\(s,a\)\}\(V\)\\Bigr\\\},\\qquad\\forall s\\in\\mathcal\{S\}\.
This representation decomposes the Bellman backup intor\(s,a\)\+γ𝔼Ps,ac\[V\(s′\)\]r\(s,a\)\+\\gamma\\,\\mathbb\{E\}\_\{P^\{c\}\_\{s,a\}\}\[V\(s^\{\\prime\}\)\], the backup under the true transition kernel for\(s,a\)\(s,a\), and the one\-step adjustmentbϕ\(s,a\)α\(V\)b^\{\\alpha\}\_\{\\phi\(s,a\)\}\(V\)that captures both the risk attitude and the belief about epistemic uncertainty in the transition kernel\.

We next briefly discuss how the risk levelα\\alphaand the posterior parameterϕ\\phitogether affect the optimal policy from aBayesian perspective\. Conditional on the current posterior parameterϕ\(s,a\)\\phi\(s,a\), the unknown true transition vectorPs,acP^\{c\}\_\{s,a\}is viewed as a draw from the posteriorDir\(ϕ\(s,a\)\)\\mathrm\{Dir\}\(\\phi\(s,a\)\)andbϕ\(s,a\)α\(V\)=ρϕ\(s,a\)α\(P⊤V\)−\(Ps,ac\)⊤V\.b^\{\\alpha\}\_\{\\phi\(s,a\)\}\(V\)=\\rho^\{\\alpha\}\_\{\\phi\(s,a\)\}\\\!\\bigl\(P^\{\\top\}V\\bigr\)\-\(P^\{c\}\_\{s,a\}\)^\{\\top\}V\.Therefore, by the definition of theα\\alpha\-quantile, the posterior probability ofbϕ\(s,a\)α\(V\)≥0b^\{\\alpha\}\_\{\\phi\(s,a\)\}\(V\)\\geq 0is at leastα\\alpha\. For largeα\\alpha, the Bellman backup in \([7](https://arxiv.org/html/2605.24345#S2.E7)\) emphasizes the upper tail of the posterior distribution ofPs,a⊤VP\_\{s,a\}^\{\\top\}V\. In this case,bϕ\(s,a\)α\(V\)b^\{\\alpha\}\_\{\\phi\(s,a\)\}\(V\)is positive with high posterior probabilityα\\alphaand acts as an implicit bonus for exploration, reflecting an optimistic attitude toward epistemic uncertainty about the transition kernel\. Moreover, for such largeα\\alpha, the magnitude ofbϕ\(s,a\)α\(V\)b^\{\\alpha\}\_\{\\phi\(s,a\)\}\(V\)increases with larger epistemic uncertainty \(i\.e\., a more dispersed posterior distribution ofPs,aP\_\{s,a\}\)\. Therefore,bϕ\(s,a\)α\(V\)b^\{\\alpha\}\_\{\\phi\(s,a\)\}\(V\)acts as an exploration incentive for state\-action pairs\(s,a\)\(s,a\)with high epistemic uncertainty aboutPs,aP\_\{s,a\}\. Conversely, whenα\\alphais small,bϕ\(s,a\)α\(V\)b^\{\\alpha\}\_\{\\phi\(s,a\)\}\(V\)is negative with high posterior probability1−α1\-\\alphaand its magnitude increases with larger epistemic uncertainty\. As a result, the one\-step adjustmentbϕ\(s,a\)α\(V\)b^\{\\alpha\}\_\{\\phi\(s,a\)\}\(V\)acts as an implicit penalty for epistemic uncertainty aboutPs,aP\_\{s,a\}and thereby yields a policy that is more robust to epistemic uncertainty\.

Overall, whenbϕ\(s,a\)α\(V\)b^\{\\alpha\}\_\{\\phi\(s,a\)\}\(V\)is positive, it plays a role analogous to an optimistic exploration bonus in online RL methods such as UCBVI\-γ\\gamma\(He et al\.[2021](https://arxiv.org/html/2605.24345#bib.bib18)\), but it is induced by the posterior quantile rather than added as an explicit term derived by concentration inequalities\. This adjustment can also be negative when the quantile level is chosen in the lower tail, therefore acting as a pessimistic penalty\.

### 2\.2Asymptotic Analysis ofα\\alpha\-quantile BR\-MDP

In the previous subsection, we introducedα\\alpha\-quantile BR\-MDP to account for epistemic uncertainty in the unknown transition kernel and discussed how the quantile levelα\\alphareflects a risk attitude toward this uncertainty\. In this subsection, we further refine this interpretation by studying the asymptotic behavior of the gap between theα\\alpha\-quantile BR\-MDP value functionVϕ,απV^\{\\pi\}\_\{\\phi,\\alpha\}and the true value functionVπV^\{\\pi\}as the Dirichlet posterior concentrates\. LetPπc:=\(Ps,π\(s\)c\)s∈𝒮P\_\{\\pi\}^\{c\}:=\\bigl\(P^\{c\}\_\{s,\\pi\(s\)\}\\bigr\)\_\{s\\in\\mathcal\{S\}\}denote the true transition matrix induced by policyπ\\pi, and letdiag⁡\(x\)\\operatorname\{diag\}\(x\)denote the diagonal matrix with diagonal entries given by the entries ofxx\. We impose the following regularity conditions on the data\-collection scheme\.

###### Assumption 1\.

Fix a stationary deterministic policyπ\\pi\.

1. \(i\)LetNNdenote the total number of observed transitions\. The data are collected along an on\-policy trajectory\{s0,a0,s1,…,sN−1,aN−1,sN\}\\\{s\_\{0\},a\_\{0\},s\_\{1\},\\ldots,s\_\{N\-1\},a\_\{N\-1\},s\_\{N\}\\\}, whereai=π\(si\)a\_\{i\}=\\pi\(s\_\{i\}\)andsi\+1∼Psi,aics\_\{i\+1\}\\sim P^\{c\}\_\{s\_\{i\},a\_\{i\}\}fori=0,…,N−1i=0,\\ldots,N\-1\. Define the visit count of each state\-action pair byN\(s,a\):=∑i=0N−1𝕀\{si=s,ai=a\}N\(s,a\):=\\sum\_\{i=0\}^\{N\-1\}\\mathbb\{I\}\\\{s\_\{i\}=s,a\_\{i\}=a\\\}\. For eachs∈𝒮s\\in\\mathcal\{S\}, there exists a constantn¯s∈\(0,1\)\\bar\{n\}\_\{s\}\\in\(0,1\)such that N\(s,π\(s\)\)N→a\.s\.n¯s,∑s∈𝒮n¯s=1\.\\frac\{N\(s,\\pi\(s\)\)\}\{N\}\\xrightarrow\{\\mathrm\{a\.s\.\}\}\\bar\{n\}\_\{s\},\\qquad\\sum\_\{s\\in\\mathcal\{S\}\}\\bar\{n\}\_\{s\}=1\.
2. \(ii\)For each state\-action pair\(s,a\)∈𝒮×𝒜\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}, the prior onPs,aP\_\{s,a\}is a uniform Dirichlet prior withϕ0\(s,a,s′\)=1\\phi\_\{0\}\(s,a,s^\{\\prime\}\)=1,∀s′∈𝒮\\forall s^\{\\prime\}\\in\\mathcal\{S\}\.

Assumption[1](https://arxiv.org/html/2605.24345#Thmassumption1)[\(i\)](https://arxiv.org/html/2605.24345#S2.Ex3)is a trajectory\-based sampling condition\. It is weaker than the state\-action\-wise independent sampling condition used inWang and Zhou \([2025](https://arxiv.org/html/2605.24345#bib.bib43)\), which assumes independent transition samples within each state\-action pair and mutual independence across different state\-action pairs\. Here, the data are collected along a single on\-policy trajectory, and the induced temporal dependence is handled in the proof by a martingale central limit theorem \(CLT\)\. The ratio condition ensures that each on\-policy state\-action pair\(s,π\(s\)\)\(s,\\pi\(s\)\)receives a nonvanishing fraction of the total number of observations; it holds, for instance, when the Markov chain induced byPπcP\_\{\\pi\}^\{c\}is ergodic with a strictly positive stationary mass on every state\. Assumption[1](https://arxiv.org/html/2605.24345#Thmassumption1)[\(ii\)](https://arxiv.org/html/2605.24345#S2.I1.i2)is used only for notational simplicity\. More generally, any fixed Dirichlet prior with strictly positive parameters contributes onlyO\(1\)O\(1\)pseudo\-counts and therefore does not affect the limiting behavior\.

###### Theorem 1\(Asymptotic normality forα\\alpha\-quantile BR\-MDP\)\.

Fixα∈\(0,1\)\\alpha\\in\(0,1\)and letzα:=Φ−1\(α\)z\_\{\\alpha\}:=\\Phi^\{\-1\}\(\\alpha\), whereΦ\\Phiis the cdf of the standard normal distribution\. Under Assumption[1](https://arxiv.org/html/2605.24345#Thmassumption1),

N\(I−γPπc\)\(Vϕ,απ−Vπ\)⇒𝒩\(γλπ,diag⁡\(\(γσπ\)2\)\),\\sqrt\{N\}\\,\(I\-\\gamma P\_\{\\pi\}^\{c\}\)\\bigl\(V^\{\\pi\}\_\{\\phi,\\alpha\}\-V^\{\\pi\}\\bigr\)\\Rightarrow\\mathcal\{N\}\\\!\\Bigl\(\\gamma\\lambda\_\{\\pi\},\\ \\operatorname\{diag\}\\bigl\(\(\\gamma\\sigma\_\{\\pi\}\)^\{2\}\\bigr\)\\Bigr\),where, for eachs∈𝒮s\\in\\mathcal\{S\},σπ2\(s\):=1n¯sVars′∼Ps,π\(s\)c⁡\[Vπ\(s′\)\]\\sigma\_\{\\pi\}^\{2\}\(s\):=\\frac\{1\}\{\\bar\{n\}\_\{s\}\}\\,\\operatorname\{Var\}\_\{s^\{\\prime\}\\sim P^\{c\}\_\{s,\\pi\(s\)\}\}\\\!\\bigl\[V^\{\\pi\}\(s^\{\\prime\}\)\\bigr\], andλπ\(s\):=zασπ\(s\)\\lambda\_\{\\pi\}\(s\):=z\_\{\\alpha\}\\,\\sigma\_\{\\pi\}\(s\)\.

The proof of Theorem[1](https://arxiv.org/html/2605.24345#Thmtheorem1)is deferred to Appendix[EC\.2](https://arxiv.org/html/2605.24345#A2)\. In particular, Theorem[1](https://arxiv.org/html/2605.24345#Thmtheorem1)yields the following expansion of theα\\alpha\-quantile BR\-MDP value function:

Vϕ,απ\\displaystyle V^\{\\pi\}\_\{\\phi,\\alpha\}=Vπ\+\(I−γPπc\)−1×\(γλπN\+γdiag⁡\(σπ\)ZN\)\+op\(N−1/2\),\\displaystyle=V^\{\\pi\}\+\(I\-\\gamma P\_\{\\pi\}^\{c\}\)^\{\-1\}\\times\\left\(\\frac\{\\gamma\\lambda\_\{\\pi\}\}\{\\sqrt\{N\}\}\+\\frac\{\\gamma\\,\\operatorname\{diag\}\(\\sigma\_\{\\pi\}\)\\,Z\}\{\\sqrt\{N\}\}\\right\)\+o\_\{p\}\(N^\{\-1/2\}\),\(10\)whereσπ:=\(σπ\(s\)\)s∈𝒮\\sigma\_\{\\pi\}:=\\bigl\(\\sigma\_\{\\pi\}\(s\)\\bigr\)\_\{s\\in\\mathcal\{S\}\},op\(⋅\)o\_\{p\}\(\\cdot\)denotes little\-oonotation in probability with respect to the randomness of the observed data, andZ∼𝒩\(0,IS\)Z\\sim\\mathcal\{N\}\(0,I\_\{S\}\)is a standard multivariate normal vector\. For fixedα∈\(0,1\)\\alpha\\in\(0,1\), \([10](https://arxiv.org/html/2605.24345#S2.E10)\) yields a stochastic expansion ofVϕ,απV^\{\\pi\}\_\{\\phi,\\alpha\}aroundVπV^\{\\pi\}, whose leading term decomposes into a deterministic bias term of orderN−1/2N^\{\-1/2\}and a Gaussian fluctuation term, with remainderop\(N−1/2\)o\_\{p\}\(N^\{\-1/2\}\)\. More specifically, before Bellman propagation, i\.e\., before multiplying by\(I−γPπc\)−1\(I\-\\gamma P\_\{\\pi\}^\{c\}\)^\{\-1\}in \([10](https://arxiv.org/html/2605.24345#S2.E10)\), the asymptotic mean of the one\-step adjustmentbϕ\(s,π\(s\)\)α\(Vπ\)b^\{\\alpha\}\_\{\\phi\(s,\\pi\(s\)\)\}\(V^\{\\pi\}\)is given by

λπ\(s\)N\\displaystyle\\frac\{\\lambda\_\{\\pi\}\(s\)\}\{\\sqrt\{N\}\}=zαVars′∼Ps,π\(s\)c⁡\[Vπ\(s′\)\]n¯sN\.\\displaystyle=z\_\{\\alpha\}\\sqrt\{\\frac\{\\operatorname\{Var\}\_\{s^\{\\prime\}\\sim P^\{c\}\_\{s,\\pi\(s\)\}\}\\\!\\bigl\[V^\{\\pi\}\(s^\{\\prime\}\)\\bigr\]\}\{\\bar\{n\}\_\{s\}N\}\}\.\(11\)The magnitude of \([11](https://arxiv.org/html/2605.24345#S2.E11)\) is larger for states that are sampled less frequently \(i\.e\. smalln¯s\\bar\{n\}\_\{s\}\) or for which the next\-state valueVπ\(s′\)V^\{\\pi\}\(s^\{\\prime\}\)is more dispersed under the true transition kernel \(i\.e\. largeVars′∼Ps,π\(s\)c⁡\[Vπ\(s′\)\]\\operatorname\{Var\}\_\{s^\{\\prime\}\\sim P^\{c\}\_\{s,\\pi\(s\)\}\}\[V^\{\\pi\}\(s^\{\\prime\}\)\]\)\. The sign of \([11](https://arxiv.org/html/2605.24345#S2.E11)\) is determined byzαz\_\{\\alpha\}: whenα<0\.5\\alpha<0\.5, the asymptotic mean of the one\-step adjustment is negative, so it discourages visiting less frequently visited states on average, reflecting a conservative attitude toward epistemic uncertainty and inducing an underestimation ofVϕ,απV^\{\\pi\}\_\{\\phi,\\alpha\}after Bellman propagation; whenα=0\.5\\alpha=0\.5,z0\.5=0z\_\{0\.5\}=0, so the deterministic bias term in the asymptotic expansion vanishes\. This makes the0\.50\.5\-quantile BR\-MDP asymptotically risk\-neutral, similar to applying the posterior expectation in \([5](https://arxiv.org/html/2605.24345#S2.E5)\) under the normal approximation; whenα\>0\.5\\alpha\>0\.5,zαz\_\{\\alpha\}becomes positive, so the one\-step adjustment provides a bonus to visit less frequently visited states on average, encouraging exploration, reflecting an optimistic attitude and inducing an overestimation due to such exploration incentives\. Thus, Theorem[1](https://arxiv.org/html/2605.24345#Thmtheorem1)shows how the quantile levelα\\alphaof the quantile BR\-MDP accounts for epistemic uncertainty in the asymptotic regime\.

The matrix\(I−γPπc\)−1\(I\-\\gamma P\_\{\\pi\}^\{c\}\)^\{\-1\}then propagates this one\-step adjustment through state transitions\. For fixedNN, the deterministic component in \([10](https://arxiv.org/html/2605.24345#S2.E10)\),\(I−γPπc\)−1γλπ/N\(I\-\\gamma P\_\{\\pi\}^\{c\}\)^\{\-1\}\{\\gamma\\lambda\_\{\\pi\}\}/\{\\sqrt\{N\}\}, represents the mean shift ofVϕ,απV^\{\\pi\}\_\{\\phi,\\alpha\}away fromVπV^\{\\pi\}induced by theα\\alpha\-quantile in the Bellman backup\. Sinceλπ=zασπ→±∞\\lambda\_\{\\pi\}=z\_\{\\alpha\}\\sigma\_\{\\pi\}\\to\\pm\\inftyasα↑1\\alpha\\uparrow 1orα↓0\\alpha\\downarrow 0, the magnitude of the mean shift increases as the quantile level moves farther into the upper tail for exploration or farther into the lower tail for robustness\. On the other hand, for every fixedα∈\(0,1\)\\alpha\\in\(0,1\), since‖\(I−γPπc\)−1‖∞≤\(1−γ\)−1\\\|\(I\-\\gamma P\_\{\\pi\}^\{c\}\)^\{\-1\}\\\|\_\{\\infty\}\\leq\(1\-\\gamma\)^\{\-1\}, the deterministic component\(I−γPπc\)−1γλπ/N\(I\-\\gamma P\_\{\\pi\}^\{c\}\)^\{\-1\}\{\\gamma\\lambda\_\{\\pi\}\}/\{\\sqrt\{N\}\}is bounded in sup\-norm byγ1−γ\|zα\|‖σπ‖∞/N\\frac\{\\gamma\}\{1\-\\gamma\}\\,\|z\_\{\\alpha\}\|\\,\\\|\\sigma\_\{\\pi\}\\\|\_\{\\infty\}/\\sqrt\{N\}\. Hence, the mean shift induced by a fixed quantile levelα\\alphashrinks automatically as more data are collected, which implies that the trade\-off between robustness and exploration diminishes with more data\. As a result, for every fixedα∈\(0,1\)\\alpha\\in\(0,1\), the differenceVϕ,απ−VπV^\{\\pi\}\_\{\\phi,\\alpha\}\-V^\{\\pi\}is of orderOp\(N−1/2\)O\_\{p\}\(N^\{\-1/2\}\)\.

Finally, the asymptotic covarianceσπ2\\sigma^\{2\}\_\{\\pi\}in Theorem[1](https://arxiv.org/html/2605.24345#Thmtheorem1)does not depend onα\\alpha\. Thus,α\\alphaonly controls the direction and magnitude of the mean shift, whereas the intrinsic variability is determined by the true transition dynamics and the policy used\.

### 2\.3Finite\-sample Robustness ofα\\alpha\-quantile BR\-MDP

In the previous subsection, we introducedα\\alpha\-quantile BR\-MDP to account for epistemic uncertainty in the unknown transition kernel and discussed how the quantile levelα\\alphareflects a risk attitude toward this uncertainty\. In this subsection, we further provide a robustness interpretation ofα\\alpha\-quantile BR\-MDP\.

With a slight abuse of notation, letP=\(Ps,a\)\(s,a\)∈𝒮×𝒜P=\\bigl\(P\_\{s,a\}\\bigr\)\_\{\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\}be the random transition kernel obtained by independently drawingPs,a∼Dir\(ϕ\(s,a\)\)P\_\{s,a\}\\sim\\mathrm\{Dir\}\(\\phi\(s,a\)\)for each state\-action pair\. Letℙϕ\\mathbb\{P\}\_\{\\phi\}denote the posterior probability measure over the transition kernelPPandρϕα\(⋅\)\\rho^\{\\alpha\}\_\{\\phi\}\(\\cdot\)denote theα\\alpha\-quantile with respect to the transition kernelPP\. Given a realized transition kernelPPand a policyπ\\pi, letVPπV^\{\\pi\}\_\{P\}denote the value function ofπ\\piunder the transition kernelPP\. For each states∈𝒮s\\in\\mathcal\{S\}, we define the posteriorα\\alpha\-quantile value of policyπ\\piby

Vϕ,απ,q\(s\):=sup\{c∈ℝ:ℙϕ\(VPπ\(s\)≥c\)≥1−α\}\.\\displaystyle\{V^\{\\pi,\\mathrm\{q\}\}\_\{\\phi,\\alpha\}\(s\)\}:=\\sup\\Bigl\\\{c\\in\\mathbb\{R\}:\\mathbb\{P\}\_\{\\phi\}\\bigl\(V^\{\\pi\}\_\{P\}\(s\)\\geq c\\bigr\)\\geq 1\-\\alpha\\Bigr\\\}\.\(12\)Thus,Vϕ,απ,q\(s\)V^\{\\pi,\\mathrm\{q\}\}\_\{\\phi,\\alpha\}\(s\)is the largest value threshold thatVPπ\(s\)V^\{\\pi\}\_\{P\}\(s\)exceeds with posterior probability at least1−α1\-\\alpha\. Smaller values ofα\\alphacorrespond to a stronger robustness guarantee\.

Assume that, for every fixed policyπ\\piand statess, the posterior distribution ofVPπ\(s\)V^\{\\pi\}\_\{P\}\(s\)is continuous, thenVϕ,απ,q\(s\)=ρϕα\(VPπ\(s\)\)V^\{\\pi,\\mathrm\{q\}\}\_\{\\phi,\\alpha\}\(s\)=\\rho^\{\\alpha\}\_\{\\phi\}\\\!\\bigl\(V^\{\\pi\}\_\{P\}\(s\)\\bigr\)\. Here,ρϕα\(VPπ\(s\)\)\\rho^\{\\alpha\}\_\{\\phi\}\\\!\\bigl\(V^\{\\pi\}\_\{P\}\(s\)\\bigr\)applies the risk measure to the entire value functionVPπV^\{\\pi\}\_\{P\}, rather than applying the risk measure at each transition stage, as in the nested formulation ofVϕ,απV^\{\\pi\}\_\{\\phi,\\alpha\}in \([6](https://arxiv.org/html/2605.24345#S2.E6)\)\. Since the quantile functional is non\-additive,Vϕ,απ,qV^\{\\pi,\\mathrm\{q\}\}\_\{\\phi,\\alpha\}does not admit a Bellman equation in general\. The hardness results ofDelage and Mannor \([2010](https://arxiv.org/html/2605.24345#bib.bib9)\)imply that directly optimizing the corresponding posterior quantile objective, i\.e\.maxπ⁡Vϕ,απ,q\(s\)\\max\_\{\\pi\}V^\{\\pi,\\mathrm\{q\}\}\_\{\\phi,\\alpha\}\(s\), is NP\-hard under general uncertainty in the transition parameters\. Even under independent Dirichlet transition priors, exact optimization remains difficult\. Instead,α\\alpha\-quantile BR\-MDP value function can serve as a tractable lower bound, as shown in the following proposition\.

###### Proposition 1\.

Forα∈\(0,1\)\\alpha\\in\(0,1\), letα¯:=1−\(1−α\)1/S\\bar\{\\alpha\}:=1\-\(1\-\\alpha\)^\{1/S\}\. Assume that, for every fixed policyπ\\piand statess, the posterior distribution ofVPπ\(s\)V^\{\\pi\}\_\{P\}\(s\)is continuous\. For any fixed policyπ\\pi,

ℙϕ\(VPπ\(s\)≥Vϕ,α¯π\(s\),∀s∈𝒮\)≥1−α\.\\mathbb\{P\}\_\{\\phi\}\\Bigl\(V^\{\\pi\}\_\{P\}\(s\)\\geq V^\{\\pi\}\_\{\\phi,\\bar\{\\alpha\}\}\(s\),\\quad\\forall s\\in\\mathcal\{S\}\\Bigr\)\\geq 1\-\\alpha\.Consequently, for everys∈𝒮s\\in\\mathcal\{S\},Vϕ,απ,q\(s\)≥Vϕ,α¯π\(s\)\.V^\{\\pi,\\mathrm\{q\}\}\_\{\\phi,\\alpha\}\(s\)\\geq V^\{\\pi\}\_\{\\phi,\\bar\{\\alpha\}\}\(s\)\.

The proof of Proposition[1](https://arxiv.org/html/2605.24345#Thmproposition1)is in Appendix[EC\.3](https://arxiv.org/html/2605.24345#A3)\. As an immediate consequence, for any fixed statess,maxπ⁡Vϕ,απ,q\(s\)≥maxπ⁡Vϕ,α¯π\(s\)\.\\max\_\{\\pi\}V^\{\\pi,\\mathrm\{q\}\}\_\{\\phi,\\alpha\}\(s\)\\geq\\max\_\{\\pi\}V^\{\\pi\}\_\{\\phi,\\bar\{\\alpha\}\}\(s\)\.Thus, maximizing theα¯\\bar\{\\alpha\}\-quantile value function gives a conservative surrogate for maximizing the posteriorα\\alpha\-quantile value\. Proposition[1](https://arxiv.org/html/2605.24345#Thmproposition1), together with \([12](https://arxiv.org/html/2605.24345#S2.E12)\), shows that, for smallα\\alpha,α\\alpha\-quantile BR\-MDP admits a finite\-sample robustness interpretation: the posteriorα\\alpha\-quantile valueVϕ,απ,qV^\{\\pi,\\mathrm\{q\}\}\_\{\\phi,\\alpha\}provides a posterior high\-probability \(1−α1\-\\alpha\) performance guarantee, and the nestedα¯\\bar\{\\alpha\}\-quantile BR\-MDP valueVϕ,α¯πV^\{\\pi\}\_\{\\phi,\\bar\{\\alpha\}\}is a tractable lower bound for this robustness guarantee\.

## 3Online RL via Adaptive Quantile Scheduling

Section[2](https://arxiv.org/html/2605.24345#S2)showed that the quantile level inα\\alpha\-quantile BR\-MDP provides a flexible way to handle epistemic uncertainty captured by the posterior distribution for different purposes, namely robustness and exploration\. We now turn to the online setting, where this flexibility becomes algorithmically useful for adapting to the evolving robustness–exploration trade\-off over the course of learning\. To illustrate such a trade\-off, two examples that separately demonstrate the robustness and exploration sides of this trade\-off are presented in Section[EC\.4](https://arxiv.org/html/2605.24345#A4)\. Section[3\.1](https://arxiv.org/html/2605.24345#S3.SS1)introduces an adaptive quantile schedule that tracks this evolving trade\-off by varying the quantile level across learning stages and across state\-action pairs\. Then, Section[3\.2](https://arxiv.org/html/2605.24345#S3.SS2)embeds this mechanism into a pseudo\-episode scheme for discounted infinite\-horizon problems, enabling a fully online learning algorithm\. Specifically, at the start of each pseudo\-episode, we update the posterior, compute the quantile level, solve the correspondingα\\alpha\-quantile BR\-MDP, and then execute the resulting policy to collect more data until the next update\.

### 3\.1Adaptive Quantile Scheduling

To track the evolving robustness–exploration trade\-off identified in Section[EC\.4](https://arxiv.org/html/2605.24345#A4), we let the quantile level vary across pseudo\-episodes and across state\-action pairs\.

We index the pseudo\-episodes byk=1,2,…k=1,2,\\ldots\. At the beginning of thekkth pseudo\-episode, for each\(s,a\)∈𝒮×𝒜\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}, letNk\(s,a,s′\)N\_\{k\}\(s,a,s^\{\\prime\}\)denote the number of transitions from\(s,a\)\(s,a\)tos′s^\{\\prime\}, and letNk\(s,a\):=∑s′∈𝒮Nk\(s,a,s′\)N\_\{k\}\(s,a\):=\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}N\_\{k\}\(s,a,s^\{\\prime\}\)denote the total number of visits to\(s,a\)\(s,a\)\. Under the priorDir\(1,…,1\)\\mathrm\{Dir\}\(1,\\ldots,1\), for each\(s,a\)\(s,a\), the posterior distribution ofPs,aP\_\{s,a\}at pseudo\-episodekkis denoted byDir\(ϕk\(s,a\)\)\\mathrm\{Dir\}\(\\phi\_\{k\}\(s,a\)\)and the posterior parameters are updated as follows:

ϕk\(s,a,s′\)\\displaystyle\\phi\_\{k\}\(s,a,s^\{\\prime\}\)=1\+Nk\(s,a,s′\),ϕk\(s,a\)=\(ϕk\(s,a,s′\)\)s′∈𝒮\.\\displaystyle=1\+N\_\{k\}\(s,a,s^\{\\prime\}\),\\quad\\phi\_\{k\}\(s,a\)=\\bigl\(\\phi\_\{k\}\(s,a,s^\{\\prime\}\)\\bigr\)\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}\.\(13\)Letϕk=\{ϕk\(s,a\)\}\(s,a\)∈𝒮×𝒜\\phi\_\{k\}=\\\{\\phi\_\{k\}\(s,a\)\\\}\_\{\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\}denote the collection of posterior parameters at the beginning of thekkth pseudo\-episode\. To construct the adaptive quantile schedule, we first define the adjusted visit count and its average byNk\+\(s,a\):=max⁡\{Nk\(s,a\),1\}N\_\{k\}^\{\+\}\(s,a\):=\\max\\\{N\_\{k\}\(s,a\),1\\\}, andN¯k\+:=1SA∑\(s,a\)∈𝒮×𝒜Nk\+\(s,a\)\.\\bar\{N\}\_\{k\}^\{\+\}:=\\frac\{1\}\{SA\}\\sum\_\{\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\}N\_\{k\}^\{\+\}\(s,a\)\.The relative visit count of\(s,a\)\(s,a\)is then defined as

rk\(s,a\):=Nk\+\(s,a\)N¯k\+,\(s,a\)∈𝒮×𝒜,r\_\{k\}\(s,a\):=\\frac\{N\_\{k\}^\{\+\}\(s,a\)\}\{\\bar\{N\}\_\{k\}^\{\+\}\},~\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\},which compares the number of visits to\(s,a\)\(s,a\)with the average adjusted visit count over the state\-action space\. We also define the scaling factor:

gk:=ln⁡\(2k\)k,g\_\{k\}:=\\frac\{\\ln\(2k\)\}\{\\sqrt\{k\}\},whose scaling form is chosen in accordance with the regret guarantee established in Section[4](https://arxiv.org/html/2605.24345#S4)\.

Given a problem\-specified robustness target parameterα¯∈\(0,1\)\\underline\{\\alpha\}\\in\(0,1\), which encodes the degree of lower\-tail robustness that the decision maker deems sufficient when acting under epistemic uncertainty, and an algorithmic sensitivity parameterδ\>0\\delta\>0, we define the adaptive quantile schedule as follows:

αk\(s,a\)\\displaystyle\\alpha\_\{k\}\(s,a\)=max⁡\{1−δrk\(s,a\)gk,α¯\}\.\\displaystyle=\\max\\left\\\{1\-\\delta\\,r\_\{k\}\(s,a\)\\,g\_\{k\},\\ \\underline\{\\alpha\}\\right\\\}\.\(14\)Before explaining the intuition behind the schedule \([14](https://arxiv.org/html/2605.24345#S3.E14)\), we extend the formulation of theα\\alpha\-quantile BR\-MDP with a common quantile levelα\\alphaacross all state\-action pairs to allow state\-action\-dependent quantile levels\. Specifically, letα:𝒮×𝒜→\(0,1\)\\alpha:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\(0,1\)and replaceρϕ\(st,at\)α\\rho^\{\\alpha\}\_\{\\phi\(s\_\{t\},a\_\{t\}\)\}in \([6](https://arxiv.org/html/2605.24345#S2.E6)\) withρϕ\(st,at\)α\(st,at\)\\rho^\{\\alpha\(s\_\{t\},a\_\{t\}\)\}\_\{\\phi\(s\_\{t\},a\_\{t\}\)\}\. With a slight abuse of notation, we writeα=\(α\(s,a\)\)\(s,a\)∈𝒮×𝒜\\alpha=\(\\alpha\(s,a\)\)\_\{\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\}and useρϕ\(s,a\)α\\rho^\{\\alpha\}\_\{\\phi\(s,a\)\}as shorthand forρϕ\(s,a\)α\(s,a\)\\rho^\{\\alpha\(s,a\)\}\_\{\\phi\(s,a\)\}\. With this notation, the Bellman equations \([5](https://arxiv.org/html/2605.24345#S2.E5)\)–\([7](https://arxiv.org/html/2605.24345#S2.E7)\) and the Bellman operators \([8](https://arxiv.org/html/2605.24345#S2.E8)\)–\([9](https://arxiv.org/html/2605.24345#S2.E9)\) retain exactly the same form in the state\-action\-dependent case\. Moreover, under the same structural conditions on the risk measure as in Section[2](https://arxiv.org/html/2605.24345#S2), the corresponding Bellman operators remainγ\\gamma\-contractions under∥⋅∥∞\\\|\\cdot\\\|\_\{\\infty\}and therefore admit unique fixed points\.

The factorgkg\_\{k\}in \([14](https://arxiv.org/html/2605.24345#S3.E14)\) depends only onkkand controls how far the quantile levels are pushed toward the lower tail across different pseudo\-episodes\. Early in learning, before visit counts become highly imbalanced across state\-action pairs,rk\(s,a\)r\_\{k\}\(s,a\)typically does not vary much across state\-action pairs, whereas the common factorgkg\_\{k\}is relatively large in the early pseudo\-episodes\. Accordingly, the quantile levelsαk\(s,a\)\\alpha\_\{k\}\(s,a\)remain in the lower\-tail regime, so the Bellman backup of the corresponding quantile BR\-MDP focuses on lower\-tail performance of the EV of successor states and therefore yields a more robust policy as discussed in Section[2](https://arxiv.org/html/2605.24345#S2)\.

This robustness is not uniform across state–action pairs\. The factorrk\(s,a\)r\_\{k\}\(s,a\)adjusts the quantile level according to relative visit counts\. Before the floorα¯\\underline\{\\alpha\}becomes binding, a less\-visited\(s,a\)\(s,a\)pair has a smallerrk\(s,a\)r\_\{k\}\(s,a\)and thus a largerαk\(s,a\)\\alpha\_\{k\}\(s,a\)\. Consequently, in the Bellman backup for such\(s,a\)\(s,a\), the EV of successor states is evaluated at a less extreme lower quantile level\. This effect is not confined to the local state–action pair but propagates backward to states from which this less\-visited pair can be reached through repeated Bellman backups\.

Askkincreases, the common factorgkg\_\{k\}decreases and visit countsNk\(s,a\)N\_\{k\}\(s,a\)can become imbalanced across\(s,a\)\(s,a\)\. Consequently,rk\(s,a\)gkr\_\{k\}\(s,a\)g\_\{k\}may become sufficiently small for less\-visited\(s,a\)\(s,a\)pairs, causing the schedule to move the quantile levelαk\(s,a\)\\alpha\_\{k\}\(s,a\)toward the upper tail and thereby encouraging exploration of these less\-visited pairs, as discussed in Section[2](https://arxiv.org/html/2605.24345#S2)\. Moreover, in \([14](https://arxiv.org/html/2605.24345#S3.E14)\),δ\\deltacontrols the sensitivity of the schedule to relative visit counts\.

### 3\.2Pseudo\-Episodes and Algorithm

In finite\-horizon MDPs, it is natural to update the posterior and recompute the policy at the beginning of each episode\(Osband et al\.[2013](https://arxiv.org/html/2605.24345#bib.bib29), Osband and Van Roy[2016](https://arxiv.org/html/2605.24345#bib.bib30)\)\. In the discounted infinite\-horizon setting considered here, however, no natural episodes are available\. Updating the posterior and recomputing the policy after every transition would be computationally demanding, whereas long deterministic artificial episodes may delay the use of newly collected data\. We therefore partition the interaction stream into pseudo\-episodes, as inXu et al\. \([2024](https://arxiv.org/html/2605.24345#bib.bib48)\)\.

For eacht≥1t\\geq 1, after observing the transition at timett, we draw an independent pseudo\-episode indicatorXt\+1∼Bernoulli\(γ\)X\_\{t\+1\}\\sim\\mathrm\{Bernoulli\}\(\\gamma\), and interpretXt\+1=1X\_\{t\+1\}=1as continuing the current pseudo\-episode andXt\+1=0X\_\{t\+1\}=0as starting a new pseudo\-episode at timet\+1t\+1\. We initializeX1=0X\_\{1\}=0, so that time11starts the first pseudo\-episode\. Then, the pseudo\-episode lengthLLfollows a Geometric distribution with parameter1−γ1\-\\gammaon\{1,2,…\}\\\{1,2,\\ldots\\\}, denoted byL∼Geom\(1−γ\)L\\sim\\mathrm\{Geom\}\(1\-\\gamma\), and is independent of the trajectory\. With such a random length, the expectation of the accumulated reward with respect toLLis equal to the discounted accumulated reward, i\.e\. for any realized reward sequence\{r\(si,ai\)\}i≥0\\\{r\(s\_\{i\},a\_\{i\}\)\\\}\_\{i\\geq 0\}indexed from the start of the pseudo\-episode,

𝔼L∼Geom\(1−γ\)\[∑i=0L−1r\(si,ai\)\]=∑i=0∞ℙ\(L≥i\+1\)r\(si,ai\)=∑i=0∞γir\(si,ai\)\.\\mathbb\{E\}\_\{L\\sim\\mathrm\{Geom\}\(1\-\\gamma\)\}\\left\[\\sum\_\{i=0\}^\{L\-1\}r\(s\_\{i\},a\_\{i\}\)\\right\]=\\sum\_\{i=0\}^\{\\infty\}\\mathbb\{P\}\(L\\geq i\+1\)r\(s\_\{i\},a\_\{i\}\)=\\sum\_\{i=0\}^\{\\infty\}\\gamma^\{i\}r\(s\_\{i\},a\_\{i\}\)\.
LetKTK\_\{T\}denote the number of pseudo\-episodes started by timeTT, and hence𝔼\[KT\]=1\+\(T−1\)\(1−γ\)\\mathbb\{E\}\[K\_\{T\}\]=1\+\(T\-1\)\(1\-\\gamma\)\. Hence, the expected number of pseudo\-episodes, which is equivalent to the number of posterior updates and policy recomputations, grows on the order of\(1−γ\)T\(1\-\\gamma\)Trather thanTT\.

Letℋt:=\{\(sτ,aτ,sτ\+1\)\}τ=1t−1\\mathcal\{H\}\_\{t\}:=\\\{\(s\_\{\\tau\},a\_\{\\tau\},s\_\{\\tau\+1\}\)\\\}\_\{\\tau=1\}^\{t\-1\}denote the interaction history available before timett\. By incorporating the adaptive quantile schedule with the pseudo\-episodic posterior updates, we obtain the Adaptive Quantile BR\-MDP \(AQ\-BRMDP\), summarized in Algorithm[1](https://arxiv.org/html/2605.24345#alg1)\.

Algorithm 1AQ\-BRMDP1:Input:Discount factor

γ\\gamma, total learning time

TT, schedule parameters

δ\\deltaand

α¯\\underline\{\\alpha\}
2:Initialize

t←1t\\leftarrow 1,

k←0k\\leftarrow 0,

X1←0X\_\{1\}\\leftarrow 0, and

ℋ1←∅\\mathcal\{H\}\_\{1\}\\leftarrow\\emptyset
3:while

t≤Tt\\leq Tdo

4:if

Xt=0X\_\{t\}=0then

5:

k←k\+1k\\leftarrow k\+1and

tk←tt\_\{k\}\\leftarrow t
6:Update the posterior

ϕk\\phi\_\{k\}using

ℋtk\\mathcal\{H\}\_\{t\_\{k\}\}according to \([13](https://arxiv.org/html/2605.24345#S3.E13)\)

7:Compute

αk\(s,a\)\\alpha\_\{k\}\(s,a\)for all

\(s,a\)∈𝒮×𝒜\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}via \([14](https://arxiv.org/html/2605.24345#S3.E14)\)

8:Compute

πk\\pi\_\{k\}by solving the

αk\\alpha\_\{k\}\-quantile BR\-MDP under

ϕk\\phi\_\{k\}via value iteration \(Algorithm[EC\.1](https://arxiv.org/html/2605.24345#alg1a)in Appendix[EC\.5](https://arxiv.org/html/2605.24345#A5)\)

9:endif

10:Take action

at←πk\(st\)a\_\{t\}\\leftarrow\\pi\_\{k\}\(s\_\{t\}\)and observe the next state

st\+1s\_\{t\+1\}
11:Update the history

ℋt\+1←ℋt∪\{\(st,at,st\+1\)\}\\mathcal\{H\}\_\{t\+1\}\\leftarrow\\mathcal\{H\}\_\{t\}\\cup\\\{\(s\_\{t\},a\_\{t\},s\_\{t\+1\}\)\\\}
12:if

t<Tt<Tthen

13:Sample

Xt\+1∼Bernoulli\(γ\)X\_\{t\+1\}\\sim\\mathrm\{Bernoulli\}\(\\gamma\)
14:endif

15:

t←t\+1t\\leftarrow t\+1
16:endwhile

## 4Regret Analysis

Section[3](https://arxiv.org/html/2605.24345#S3)introduced the adaptive quantile schedule and its pseudo\-episodic implementation to account for the evolving robustness–exploration trade\-off in online learning\. We now establish performance guarantees for AQ\-BRMDP\. We analyze Bayesian regret relative to two complementary benchmarks\. The first benchmark is the true\-optimal valueV∗V^\{\*\}, which measures regret relative to the optimal value function in the true environment\. The second benchmark is the robust\-optimal value in thekkth pseudo\-episodeVϕk,α¯∗V^\{\*\}\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}, which measures regret relative to the optimal value function of theα¯\\underline\{\\alpha\}\-quantile BR\-MDP under posterior parametersϕk\\phi\_\{k\}\. In implementation, the quantiles are approximated by sampling transition kernels from the posterior distribution, and the corresponding quantile BR\-MDP is solved in the simulator\. Compared with the cost of interacting with the true environment, the computational cost of posterior sampling and solving the quantile BR\-MDP is treated as negligible\. Accordingly, we assume that the policy computed in each pseudo\-episode can be made arbitrarily close to the exact optimal policy of the corresponding quantile BR\-MDP, and we ignore the resulting approximation error in the regret analysis\.

Throughout this section,sk,is\_\{k,i\}denotes the state at theiith time step of pseudo\-episodekk,πk=πϕk,αk∗\\pi\_\{k\}=\\pi^\{\*\}\_\{\\phi\_\{k\},\\alpha\_\{k\}\}is the policy executed in pseudo\-episodekk,Vk:=Vϕk,αk∗V\_\{k\}:=V^\{\*\}\_\{\\phi\_\{k\},\\alpha\_\{k\}\}denotes the optimal value of the corresponding adaptiveαk\\alpha\_\{k\}\-quantile BR\-MDP, andLkL\_\{k\}denotes the realized length of pseudo\-episodekkwithin the interaction horizonTT\. Bayesian regret \(BR\) relative to the true\-optimal benchmark is defined by

BR\(T\):=𝔼\[∑k=1KT∑i=1Lk\(V∗\(sk,i\)−Vπk\(sk,i\)\)\],\\displaystyle BR\(T\):=\\mathbb\{E\}\\\!\\left\[\\sum\_\{k=1\}^\{K\_\{T\}\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\Bigl\(V^\{\*\}\(s\_\{k,i\}\)\-V^\{\\pi\_\{k\}\}\(s\_\{k,i\}\)\\Bigr\)\\right\],\(15\)where the expectation is taken over the prior distribution of the true transition kernel, the trajectory randomness, and the randomness of the pseudo\-episode lengths\. To quantify whether exploration incurs substantial loss relative to a conservative benchmark, we also define Bayesian regret relative to the robust\-optimal benchmark \(BR\-R\) as follows:

BR\-R\(T\):=𝔼\[∑k=1KT∑i=1Lk\(Vϕk,α¯∗\(sk,i\)−Vϕk,α¯πk\(sk,i\)\)\],\\displaystyle BR\\text\{\-\}R\(T\)\\\!:=\\\!\\mathbb\{E\}\\\!\\left\[\\sum\_\{k=1\}^\{K\_\{T\}\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\Bigl\(V^\{\*\}\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}\(s\_\{k,i\}\)\-V^\{\\pi\_\{k\}\}\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}\(s\_\{k,i\}\)\\Bigr\)\\right\],\(16\)where the expectation is taken over the prior distribution of the true transition kernel, the trajectory randomness, and the randomness of the pseudo\-episode lengths\. Letℱt\\mathcal\{F\}\_\{t\}denote the sigma\-field generated by the interaction history and the pseudo\-episode indicators observed before timett:ℱt:=σ\(\(sτ,aτ,sτ\+1\)τ=1t−1,X1,…,Xt\)\.\\mathcal\{F\}\_\{t\}:=\\sigma\\\!\\left\(\(s\_\{\\tau\},a\_\{\\tau\},s\_\{\\tau\+1\}\)\_\{\\tau=1\}^\{t\-1\},\\,X\_\{1\},\\ldots,X\_\{t\}\\right\)\.

###### Lemma 1\(Dual\-benchmark optimism\)\.

Fix pseudo\-episodekkwith starting timetkt\_\{k\}\. Then:

1. \(i\)ℙ\(∀s∈𝒮:V∗\(s\)≤Vk\(s\)\|ℱtk\)≥1−δSAln⁡\(2k\)k\.\\mathbb\{P\}\\\!\\left\(\\forall s\\in\\mathcal\{S\}:\\ V^\{\*\}\(s\)\\leq V\_\{k\}\(s\)\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\}\\right\)\\geq 1\-\\frac\{\\delta SA\\ln\(2k\)\}\{\\sqrt\{k\}\}\.
2. \(ii\)Vϕk,α¯∗\(s\)≤Vk\(s\),∀s∈𝒮\.V^\{\*\}\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}\(s\)\\leq V\_\{k\}\(s\),\\qquad\\forall s\\in\\mathcal\{S\}\.

We defer the proof of Lemma[1](https://arxiv.org/html/2605.24345#Thmlemma1)to Appendix[EC\.6\.1](https://arxiv.org/html/2605.24345#A6.SS1)\. Part \(i\) in Lemma[1](https://arxiv.org/html/2605.24345#Thmlemma1)shows that the true optimal value functionV∗V^\{\*\}can be upper bounded byVkV\_\{k\}with posterior probability tending to 1 as learning progresses \(i\.e\., askkincreases\)\. Part \(ii\) in Lemma[1](https://arxiv.org/html/2605.24345#Thmlemma1)shows that the robust\-optimal value functionVϕk,α¯∗V^\{\*\}\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}can also be upper bounded byVkV\_\{k\}for every statess\.

Define the optimistic event𝒢k:=∩\(s,a\)∈𝒮×𝒜\{\(Ps,ac\)⊤Vk≤ρϕk\(s,a\)αk\(P⊤Vk\)\}\\mathcal\{G\}\_\{k\}:=\\cap\_\{\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\}\\left\\\{\(P^\{c\}\_\{s,a\}\)^\{\\top\}V\_\{k\}\\leq\\rho^\{\\alpha\_\{k\}\}\_\{\\phi\_\{k\}\(s,a\)\}\(P^\{\\top\}V\_\{k\}\)\\right\\\}, which is the event that the BR\-MDP EV overestimates the true EV\. We show that on𝒢k\\mathcal\{G\}\_\{k\},V∗\(s\)≤Vk\(s\)V^\{\*\}\(s\)\\leq V\_\{k\}\(s\)for alls∈𝒮s\\in\\mathcal\{S\}in Appendix[EC\.6\.1](https://arxiv.org/html/2605.24345#A6.SS1)\. Hence, we can decompose the upper bound forBR\(T\)BR\(T\)in \([15](https://arxiv.org/html/2605.24345#S4.E15)\) before taking expectations as follows:

∑k=1KT∑i=1Lk\(V∗\(sk,i\)−Vπk\(sk,i\)\)≤∑k=1KT∑i=1Lk\(Vk\(sk,i\)−Vπk\(sk,i\)\)\+11−γ∑k=1KTLk𝕀\{𝒢kc\}\.\\displaystyle\\sum\_\{k=1\}^\{K\_\{T\}\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\Bigl\(V^\{\*\}\(s\_\{k,i\}\)\-V^\{\\pi\_\{k\}\}\(s\_\{k,i\}\)\\Bigr\)\\leq\\sum\_\{k=1\}^\{K\_\{T\}\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\Bigl\(V\_\{k\}\(s\_\{k,i\}\)\-V^\{\\pi\_\{k\}\}\(s\_\{k,i\}\)\\Bigr\)\+\\frac\{1\}\{1\-\\gamma\}\\sum\_\{k=1\}^\{K\_\{T\}\}L\_\{k\}\\,\\mathbb\{I\}\\\{\\mathcal\{G\}\_\{k\}^\{c\}\\\}\.\(17\)Using Lemma[1](https://arxiv.org/html/2605.24345#Thmlemma1)\(ii\), we can decompose the corresponding upper bound forBR\-R\(T\)BR\\text\{\-\}R\(T\)in \([16](https://arxiv.org/html/2605.24345#S4.E16)\) as follows:

∑k=1KT∑i=1Lk\(Vϕk,α¯∗\(sk,i\)−Vϕk,α¯πk\(sk,i\)\)≤∑k=1KT∑i=1Lk\(Vk\(sk,i\)−Vϕk,α¯πk\(sk,i\)\)\.\\displaystyle\\sum\_\{k=1\}^\{K\_\{T\}\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\Bigl\(V\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}^\{\*\}\(s\_\{k,i\}\)\-V\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}^\{\\pi\_\{k\}\}\(s\_\{k,i\}\)\\Bigr\)\\leq\\sum\_\{k=1\}^\{K\_\{T\}\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\Bigl\(V\_\{k\}\(s\_\{k,i\}\)\-V\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}^\{\\pi\_\{k\}\}\(s\_\{k,i\}\)\\Bigr\)\.\(18\)
Therefore, the main remaining terms to control are the value\-difference terms in \([17](https://arxiv.org/html/2605.24345#S4.E17)\) and \([18](https://arxiv.org/html/2605.24345#S4.E18)\), together with the non\-optimistic\-event term in \([17](https://arxiv.org/html/2605.24345#S4.E17)\)\. The next lemma shows that the gap between the value of a fixed policy in the true environmentVπV^\{\\pi\}and the value ofαk\\alpha\_\{k\}\-quantile BR\-MDP under the posterior distribution with parametersϕk\\phi\_\{k\},Vϕk,αkπV^\{\\pi\}\_\{\\phi\_\{k\},\\alpha\_\{k\}\}, can be expressed as the cumulative one\-step discrepancy between theαk\\alpha\_\{k\}\-quantile BR\-MDP Bellman backup and the corresponding Bellman backup under the true transition along the realized trajectory\.

###### Lemma 2\(Pseudo\-episode value decomposition\)\.

Fix pseudo\-episodekkwith starting timetkt\_\{k\}, and fix a stationary deterministic policyπ\\pi\. Let\{\(sk,i,ak,i\)\}i≥1\\\{\(s\_\{k,i\},a\_\{k,i\}\)\\\}\_\{i\\geq 1\}be the infinite\-horizon trajectory generated by followingπ\\pifrom timetkt\_\{k\}in the true MDP, so thatak,i:=π\(sk,i\)a\_\{k,i\}:=\\pi\(s\_\{k,i\}\)for alli≥1i\\geq 1\. DefineEk,i:=ρϕk\(sk,i,ak,i\)αk\(P⊤Vϕk,αkπ\)−\(Psk,i,ak,ic\)⊤Vϕk,αkπ,E\_\{k,i\}:=\\rho^\{\\alpha\_\{k\}\}\_\{\\phi\_\{k\}\(s\_\{k,i\},a\_\{k,i\}\)\}\\\!\\bigl\(P^\{\\top\}V^\{\\pi\}\_\{\\phi\_\{k\},\\alpha\_\{k\}\}\\bigr\)\-\(P^\{c\}\_\{s\_\{k,i\},a\_\{k,i\}\}\)^\{\\top\}V^\{\\pi\}\_\{\\phi\_\{k\},\\alpha\_\{k\}\},i≥1\.i\\geq 1\.

Then

𝔼\[∑i=1Lk\(Vϕk,αkπ\(sk,i\)−Vπ\(sk,i\)\)\|ℱtk,Pc\]=𝔼\[∑i=1LkiγEk,i\|ℱtk,Pc\]\.\\displaystyle\\mathbb\{E\}\\\!\\left\[\\sum\_\{i=1\}^\{L\_\{k\}\}\\Bigl\(V^\{\\pi\}\_\{\\phi\_\{k\},\\alpha\_\{k\}\}\(s\_\{k,i\}\)\-V^\{\\pi\}\(s\_\{k,i\}\)\\Bigr\)\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},\\,P^\{c\}\\right\]=\\mathbb\{E\}\\\!\\left\[\\sum\_\{i=1\}^\{L\_\{k\}\}i\\,\\gamma\\,E\_\{k,i\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},\\,P^\{c\}\\right\]\.\(19\)Here the conditional expectation is taken over the trajectory randomness underπ\\piin the true MDP and the independent pseudo\-episode lengths\{Lk:k≥1\}\\\{L\_\{k\}:k\\geq 1\\\}\.

The proof of Lemma[2](https://arxiv.org/html/2605.24345#Thmlemma2)is deferred to Appendix[EC\.6\.2](https://arxiv.org/html/2605.24345#A6.SS2)\. For the robust regret, the following lemma provides an analogous decomposition between theαk\\alpha\_\{k\}\-quantile BR\-MDP value function and theα¯\\underline\{\\alpha\}\-quantile BR\-MDP value function\.

###### Lemma 3\.

Fix pseudo\-episodekkwith starting timetkt\_\{k\}, and fix a stationary deterministic policyπ\\pi\. Let\{\(sk,i,ak,i\)\}i≥1\\\{\(s\_\{k,i\},a\_\{k,i\}\)\\\}\_\{i\\geq 1\}be the infinite\-horizon trajectory generated by followingπ\\pifrom timetkt\_\{k\}in the true environment, so thatak,i:=π\(sk,i\)a\_\{k,i\}:=\\pi\(s\_\{k,i\}\)for alli≥1i\\geq 1\. DefineEk,i−:=\(Psk,i,ak,ic\)⊤Vϕk,α¯π−ρϕk\(sk,i,ak,i\)α¯\(P⊤Vϕk,α¯π\),E^\{\-\}\_\{k,i\}:=\(P^\{c\}\_\{s\_\{k,i\},a\_\{k,i\}\}\)^\{\\top\}V^\{\\pi\}\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}\-\\rho^\{\\underline\{\\alpha\}\}\_\{\\phi\_\{k\}\(s\_\{k,i\},a\_\{k,i\}\)\}\\\!\\bigl\(P^\{\\top\}V^\{\\pi\}\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}\\bigr\),andEk,i:=ρϕk\(sk,i,ak,i\)αk\(P⊤Vϕk,αkπ\)−\(Psk,i,ak,ic\)⊤Vϕk,αkπ\.E\_\{k,i\}:=\\rho^\{\\alpha\_\{k\}\}\_\{\\phi\_\{k\}\(s\_\{k,i\},a\_\{k,i\}\)\}\\\!\\bigl\(P^\{\\top\}V^\{\\pi\}\_\{\\phi\_\{k\},\\alpha\_\{k\}\}\\bigr\)\-\(P^\{c\}\_\{s\_\{k,i\},a\_\{k,i\}\}\)^\{\\top\}V^\{\\pi\}\_\{\\phi\_\{k\},\\alpha\_\{k\}\}\.Then

𝔼\[∑i=1Lk\(Vϕk,αkπ\(sk,i\)−Vϕk,α¯π\(sk,i\)\)\|ℱtk,Pc\]=𝔼\[∑i=1Lkiγ\(Ek,i\+Ek,i−\)\|ℱtk,Pc\]\.\\displaystyle\\mathbb\{E\}\\\!\\left\[\\sum\_\{i=1\}^\{L\_\{k\}\}\\Bigl\(V^\{\\pi\}\_\{\\phi\_\{k\},\\alpha\_\{k\}\}\(s\_\{k,i\}\)\-V^\{\\pi\}\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}\(s\_\{k,i\}\)\\Bigr\)\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},\\,P^\{c\}\\right\]=\\mathbb\{E\}\\\!\\left\[\\sum\_\{i=1\}^\{L\_\{k\}\}i\\,\\gamma\\,\\bigl\(E\_\{k,i\}\+E^\{\-\}\_\{k,i\}\\bigr\)\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},\\,P^\{c\}\\right\]\.\(20\)Here the conditional expectation is taken over the trajectory randomness underπ\\piin the true environment and the independent pseudo\-episode lengths\{Lk:k≥1\}\\\{L\_\{k\}:k\\geq 1\\\}\.

Using[Lemmas1](https://arxiv.org/html/2605.24345#Thmlemma1),[2](https://arxiv.org/html/2605.24345#Thmlemma2)and[3](https://arxiv.org/html/2605.24345#Thmlemma3)together with Dirichlet\-posterior concentration in[Lemma9](https://arxiv.org/html/2605.24345#Thmlemma9), we upper bound the Bayesian regret of AQ\-BRMDP under both benchmarks\.

###### Theorem 2\.

Suppose that the schedule parameterδ\\deltain \([14](https://arxiv.org/html/2605.24345#S3.E14)\) satisfiesδ\>0\\delta\>0\. IfT≥S2AT\\geq S^\{2\}A, then

BR\(T\)≤O~\(γSATln⁡e1−α¯\(1−γ\)2\+δSAT\(1−γ\)2\+SA\(1−γ\)3\),BR\(T\)\\leq\\widetilde\{O\}\\\!\\left\(\\frac\{\\gamma\\sqrt\{SA\\,T\\ln\\frac\{e\}\{1\-\\underline\{\\alpha\}\}\}\}\{\(1\-\\gamma\)^\{2\}\}\+\\frac\{\\delta\\,SA\\,\\sqrt\{T\}\}\{\(1\-\\gamma\)^\{2\}\}\+\\frac\{SA\}\{\(1\-\\gamma\)^\{3\}\}\\right\),and

BR\-R\(T\)≤O~\(γSATln⁡1min⁡\{1−α¯,α¯\}\(1−γ\)2\+SA\(1−γ\)3\),BR\\text\{\-\}R\(T\)\\leq\\widetilde\{O\}\\\!\\left\(\\frac\{\\gamma\\sqrt\{SA\\,T\\ln\\frac\{1\}\{\\min\\\{1\-\\underline\{\\alpha\},\\underline\{\\alpha\}\\\}\}\}\}\{\(1\-\\gamma\)^\{2\}\}\+\\frac\{SA\}\{\(1\-\\gamma\)^\{3\}\}\\right\),whereO~\(⋅\)\\widetilde\{O\}\(\\cdot\)suppresses polylogarithmic factors inSS,AA,TT,1/δ1/\\delta, and1/\(1−γ\)1/\(1\-\\gamma\)\.

Theorem[2](https://arxiv.org/html/2605.24345#Thmtheorem2)formalizes that the adaptive schedule accounts for the evolving robustness–exploration trade\-off described in Section[3](https://arxiv.org/html/2605.24345#S3)\. For fixed problem parameters, the Bayesian regret bound for AQ\-BRMDP grows asO~\(T\)\\widetilde\{O\}\(\\sqrt\{T\}\)in the total number of interactionsTT\. The termδSAT/\(1−γ\)2\\delta SA\\sqrt\{T\}/\(1\-\\gamma\)^\{2\}captures the additional regret associated with using a more robust interaction policy early in learning\. Choosing a larger constantδ\\deltawould strengthen early robustness, at the cost of a largerδSAT/\(1−γ\)2\\delta SA\\sqrt\{T\}/\(1\-\\gamma\)^\{2\}term in the regret bound\. If the focus is on the standard exploration–exploitation trade\-off rather than on prioritizing additional early\-stage robustness, we can setδ=1/SA\\delta=1/\\sqrt\{SA\}\. This choice keeps the robustness\-induced regret term at the same order as the first term,SAT/\(1−γ\)2\\sqrt\{SAT\}/\(1\-\\gamma\)^\{2\}, in its dependence onSS,AA,TT, and\(1−γ\)−1\(1\-\\gamma\)^\{\-1\}up to the explicit factorγln⁡\(e/\(1−α¯\)\)\\gamma\\sqrt\{\\ln\(e/\(1\-\\underline\{\\alpha\}\)\)\}and polylogarithmic factors hidden inO~\(⋅\)\\widetilde\{O\}\(\\cdot\)\. TheTT\-independent lower\-order termSA/\(1−γ\)3SA/\(1\-\\gamma\)^\{3\}comes from the finite\-time analysis of the pseudo\-episode construction for discounted infinite\-horizon learning\. WhenT≥SAmax⁡\{S,1/\(1−γ\)2\}T\\geq SA\\max\\\{S,1/\(1\-\\gamma\)^\{2\}\\\}, this lower\-order term is absorbed by theT\\sqrt\{T\}terms, so the dominant scaling becomesO~\(SAT/\(1−γ\)2\)\\widetilde\{O\}\(\\sqrt\{SAT\}/\(1\-\\gamma\)^\{2\}\)\. Thus, with this choice ofδ\\delta, AQ\-BRMDP achieves the same Bayesian regret rate as state\-of\-the\-art Bayesian RL algorithms, such as theO~\(HSAT\)\\widetilde\{O\}\(H\\sqrt\{SAT\}\)bound for posterior sampling in RL \(PSRL\)\(Osband and Van Roy[2017](https://arxiv.org/html/2605.24345#bib.bib31)\), up to logarithmic and horizon\-dependent factors\.

The difference in horizon dependence stems from the different definitions of Bayesian regret in finite\-horizon episodic and discounted infinite\-horizon settings\. Finite\-horizon episodic PSRL boundsO~\(HSAT\)\\widetilde\{O\}\(H\\sqrt\{SAT\}\)are typically expressed with a linear dependence on the horizonHH\(Osband and Van Roy[2017](https://arxiv.org/html/2605.24345#bib.bib31)\), because regret is accumulated once per episode through the value gap at the initial state of each episode\. In contrast, our discounted infinite\-horizon regret accumulates a value gap at every interaction timett\. Since each value gap itself represents a discounted future loss over an effective horizon of order\(1−γ\)−1\(1\-\\gamma\)^\{\-1\}, this per\-step regret criterion introduces an additional effective\-horizon factor, leading to the\(1−γ\)−2\(1\-\\gamma\)^\{\-2\}dependence in our bound\.

The Bayesian robust regret for AQ\-BRMDP,BR\-R\(T\)BR\\text\{\-\}R\(T\), indicates that the adaptive quantile schedule, while becoming less robust to encourage learning of the true\-environment optimal policy, does not incur large cumulative regret relative to the robust\-optimal value\. It grows asO~\(SAT\)\\widetilde\{O\}\(\\sqrt\{SAT\}\), and increases asα¯\\underline\{\\alpha\}approaches0or11, following an order ofO\(ln⁡1min⁡\{1−α¯,α¯\}\)O\\left\(\\sqrt\{\\ln\\frac\{1\}\{\\min\\\{1\-\\underline\{\\alpha\},\\underline\{\\alpha\}\\\}\}\}\\right\)\. This dependence is comparable to the state\-action and time dependence appearing in recent tabular online robust and distributionally robust RL guarantees\. For finite\-horizon robust MDPs withKKepisodes and horizonHH,Dong et al\. \([2022](https://arxiv.org/html/2605.24345#bib.bib11)\)obtain frequentist robust\-regret bounds of orderO~\(H2SAK\)\\widetilde\{O\}\(H^\{2\}S\\sqrt\{AK\}\)for\(s,a\)\(s,a\)\-rectangular uncertainty sets\. More recent onlineff\-divergence distributionally robust RL results obtainSAK\\sqrt\{SAK\}dependence, for exampleO~\(H4\(1\+σ\)SAK\)\\widetilde\{O\}\(\\sqrt\{H^\{4\}\(1\+\\sigma\)SAK\}\)forχ2\\chi^\{2\}ambiguity sets, whereσ\\sigmais the radius of the ambiguity set\(Ghosh et al\.[2026](https://arxiv.org/html/2605.24345#bib.bib17)\)\. The comparison, however, should be interpreted carefully\. Existing online robust RL results typically establish frequentist robust regret bounds under a fixed but unknown environment, with the benchmark given by the optimal policy for a worst\-case value over an ambiguity set\. Such frequentist guarantees are stronger than Bayesian regret guarantees when the model class and benchmark coincide, since a Bayesian regret bound is the prior average of the true environment regret bound\. In our setting, however, the BR\-MDP formulation itself is posterior\-based: epistemic uncertainty is represented by the posterior distribution, and the Bellman backup is evaluated through a posterior risk criterion\. Bayesian regret is therefore the natural performance measure, in the same spirit as PSRL, where regret is commonly analyzed after averaging over the prior distribution\(Osband et al\.[2013](https://arxiv.org/html/2605.24345#bib.bib29), Osband and Van Roy[2017](https://arxiv.org/html/2605.24345#bib.bib31), Xu et al\.[2024](https://arxiv.org/html/2605.24345#bib.bib48)\)\.

## 5Experiments

We evaluate AQ\-BRMDP in finite\-state environments with two complementary settings: RiverSwim in Section[5\.1](https://arxiv.org/html/2605.24345#S5.SS1), which requires sustained exploration toward a distant rewarding state, and risky\-branch FrozenLake in Section[5\.2](https://arxiv.org/html/2605.24345#S5.SS2), where exploratory shortcut actions can lead to sticky holes\. Appendix[EC\.7\.6](https://arxiv.org/html/2605.24345#A7.SS6)reports the sensitivity analysis for the schedule parameterδ\\deltain AQ\-BRMDP\. We also implement an extension of AQ\-BRMDP to continuous\-state spaces and evaluate its empirical performance in a continuous\-state environment in Appendix[EC\.7](https://arxiv.org/html/2605.24345#A7)\.

We compare AQ\-BRMDP with two classes of algorithms: Continuing PSRL and fixed\-quantile BR\-MDPs, denoted by BRMDP\-0\.1, BRMDP\-0\.3, and BRMDP\-0\.5, which use constant quantile levelsα=0\.1,0\.3,0\.5\\alpha=0\.1,0\.3,0\.5, respectively\. Continuing PSRL uses the same pseudo\-episode mechanism but samples one transition kernel from the current posterior and acts optimally in the sampled MDP\. The fixed\-quantile BR\-MDP baselines use a constant quantile level throughout learning; smaller quantiles correspond to more robust policies as discussed in Section[2](https://arxiv.org/html/2605.24345#S2)\. For BRMDP\-based methods, posterior quantiles in Bellman backups are approximated by empirical quantiles from posterior samples; the corresponding value\-iteration procedure and Monte Carlo budgets are given in Appendix[EC\.7\.1](https://arxiv.org/html/2605.24345#A7.SS1)\. Unless otherwise noted, curves report empirical means over100100independent runs with95%95\\%confidence intervals\.

### 5\.1Exploration\-Demanding Environments

We first consider RiverSwim\-nnwithn∈\{6,10\}n\\in\\\{6,10\\\}\. The state space is𝒮n=\{1,…,n\}\\mathcal\{S\}\_\{n\}=\\\{1,\\ldots,n\\\}, the initial state is11, and the action space is𝒜=\{aL,aR\}\\mathcal\{A\}=\\\{a\_\{L\},a\_\{R\}\\\}\. The left actionaLa\_\{L\}moves deterministically to the left neighboring state, and at state11the next state remains11, i\.e\.,Pc\(s−1∣s,aL\)=1P^\{c\}\(s\-1\\mid s,a\_\{L\}\)=1for2≤s≤n2\\leq s\\leq nandPc\(1∣1,aL\)=1P^\{c\}\(1\\mid 1,a\_\{L\}\)=1\. The rewards satisfyr\(1,aL,s′\)=0\.005r\(1,a\_\{L\},s^\{\\prime\}\)=0\.005,r\(n,aR,n\)=1r\(n,a\_\{R\},n\)=1, andr\(s,a,s′\)=0r\(s,a,s^\{\\prime\}\)=0otherwise\. For the right action, the true transition kernel makes sustained rightward progress difficult: at interior states2≤s≤n−12\\leq s\\leq n\-1,Pc\(s\+1∣s,aR\)=0\.35P^\{c\}\(s\+1\\mid s,a\_\{R\}\)=0\.35,Pc\(s∣s,aR\)=0\.60P^\{c\}\(s\\mid s,a\_\{R\}\)=0\.60, andPc\(s−1∣s,aR\)=0\.05P^\{c\}\(s\-1\\mid s,a\_\{R\}\)=0\.05; at the left end,Pc\(2∣1,aR\)=0\.60P^\{c\}\(2\\mid 1,a\_\{R\}\)=0\.60andPc\(1∣1,aR\)=0\.40P^\{c\}\(1\\mid 1,a\_\{R\}\)=0\.40; at the right end,Pc\(n∣n,aR\)=0\.60P^\{c\}\(n\\mid n,a\_\{R\}\)=0\.60andPc\(n−1∣n,aR\)=0\.40P^\{c\}\(n\-1\\mid n,a\_\{R\}\)=0\.40\. The agent maintains a Dirichlet posterior over each transition vectorPs,acP^\{c\}\_\{s,a\}\. Thus, obtaining reward11requires sustained rightward exploration through the chain under stochastic transitions\. The ten\-state version lengthens this path and makes sustained exploration harder\. A schematic of RiverSwim\-6 is provided in Appendix Figure[EC\.3](https://arxiv.org/html/2605.24345#A7.F3)\.

#### Results\.

Figure[1](https://arxiv.org/html/2605.24345#S5.F1)reports cumulative true regret, moving\-average reward trajectories, and state\-occupancy heatmaps for RiverSwim\-6 and RiverSwim\-10\. The reward trajectory at timettis the moving averager¯t\(w\):=w−1∑τ=t−w\+1trτ\\bar\{r\}\_\{t\}^\{\(w\)\}:=w^\{\-1\}\\sum\_\{\\tau=t\-w\+1\}^\{t\}r\_\{\\tau\}with a window size ofw=100w=100\. The state\-occupancy heatmap reports the empirical state\-occupancy measured^T\(s\):=T−1∑t=0T−1𝟏\{st=s\}\\widehat\{d\}\_\{T\}\(s\):=T^\{\-1\}\\sum\_\{t=0\}^\{T\-1\}\\mathbf\{1\}\\\{s\_\{t\}=s\\\}\. In RiverSwim\-10, the fixed\-quantile BR\-MDP baselines behave similarly and their curves largely overlap in the cumulative\-regret and moving\-average\-reward panels\.

We have the following observations\.

1. \(1\)Sustained exploration by AQ\-BRMDP\.Compared with Continuing PSRL, AQ\-BRMDP achieves lower cumulative regret in both RiverSwim\-6 and RiverSwim\-10\. Its moving\-average reward increases more quickly early in learning, indicating that it reaches the right end of the chain earlier\. The occupancy heatmaps also show that AQ\-BRMDP spends more time in states near the right end of the chain, especially in RiverSwim\-10\. These observations indicate that AQ\-BRMDP supports more effective sustained exploration than Continuing PSRL\.
2. \(2\)Effect of fixed quantile levels\.The fixed\-quantile BR\-MDP baselines show that small quantile levels can be overly conservative in exploration\-demanding environments\. In RiverSwim\-6, BRMDP\-0\.1 remains concentrated near state11, its moving\-average reward stays close to the local reward0\.0050\.005, and its cumulative regret is the largest among the fixed\-quantile baselines\. BRMDP\-0\.5 reaches states near the right end of the chain more often and attains higher moving\-average rewards\. In RiverSwim\-10, all three fixed\-quantile baselines remain concentrated near the initial state and their cumulative regret grows approximately linearly\.

These results are consistent with the mechanism in Section[EC\.4](https://arxiv.org/html/2605.24345#A4)\. A fixed lower\-tail quantile can make the policy too conservative to move away from the initial region and collect the data needed to reach the right end of the chain\. By contrast, the adaptive quantile schedule in AQ\-BRMDP reduces this conservatism over learning and improves exploration toward distant high\-reward states\.

![Refer to caption](https://arxiv.org/html/2605.24345v1/figs/sec5_1_rs_overview.png)Figure 1:Performance in RiverSwim\-6 \(left column\) and RiverSwim\-10 \(right column\)\. The first row shows cumulative true regret, the second row shows moving\-average rewards with a window size of100100, and the third row shows state\-occupancy heatmaps\.

### 5\.2Exploration\-Costly Environments

We next consider a risky\-branch variant of FrozenLake\. A description of this problem is provided in Appendix Figure[EC\.3](https://arxiv.org/html/2605.24345#A7.F3)\. The environment is a4×44\\times 4grid with states indexed by𝒮=\{0,…,15\}\\mathcal\{S\}=\\\{0,\\ldots,15\\\}, initial states0=0s\_\{0\}=0, goal statesG=15s\_\{G\}=15, and hole statesH=\{5,7,11,12\}H=\\\{5,7,11,12\\\}\. The action space is𝒜=\{Left,Right,Up,Down\}\\mathcal\{A\}=\\\{\\mathrm\{Left\},\\mathrm\{Right\},\\mathrm\{Up\},\\mathrm\{Down\}\\\}\. The reward is known: a transition tosGs\_\{G\}gives reward11, and all other transitions give reward0\. For each state that is neither a hole nor the goal, the transition is slippery\. If the agent chooses an actionaa, then with probability0\.500\.50the next state follows the intended move, and with probability0\.250\.25for each perpendicular direction, it follows one of the two directions perpendicular toaa\. If a realized move leaves the grid, the agent remains in the current state\. Forh∈Hh\\in Hand actionaa, the agent moves in the intended direction with probabilityphp\_\{h\}and remains athhwith probability1−ph1\-p\_\{h\}; we setph=0\.2p\_\{h\}=0\.2\. After the agent reaches the goal, the next state is sampled from a uniform distribution supported on the non\-hole, non\-goal states\. The risky shortcut is at the state\-action pair\(2,Down\)\(2,\\mathrm\{Down\}\)\. Under the true transition kernel, taking actionDown\\mathrm\{Down\}at state22does not follow the default slippery transition rule\. Instead,Pc\(10∣2,Down\)=1−θP^\{c\}\(10\\mid 2,\\mathrm\{Down\}\)=1\-\\thetaandPc\(5∣2,Down\)=θP^\{c\}\(5\\mid 2,\\mathrm\{Down\}\)=\\theta\. Thus, this action can shorten the path to the goal when it succeeds, but with probabilityθ\\thetait sends the agent into a sticky hole\. We considerθ=0\.7\\theta=0\.7andθ=0\.9\\theta=0\.9\. The agent treats the entire transition kernelPcP^\{c\}as unknown and maintains an independent Dirichlet posterior over eachPs,acP^\{c\}\_\{s,a\}\.

In this environment, we also compute the posterior0\.10\.1\-quantile value defined in \([12](https://arxiv.org/html/2605.24345#S2.E12)\) to evaluate the robustness of the policy executed during learning\. For a time stepttin pseudo\-episodekk, letϕt:=ϕk\\phi\_\{t\}:=\\phi\_\{k\}andπt:=πk\\pi\_\{t\}:=\\pi\_\{k\}\. The posterior0\.10\.1\-quantile value for the value ofπt\\pi\_\{t\}at the initial states0s\_\{0\},Vϕt,0\.1πt,q\(s0\)V^\{\\pi\_\{t\},\\mathrm\{q\}\}\_\{\\phi\_\{t\},0\.1\}\(s\_\{0\}\), means that whenπt\\pi\_\{t\}is deployed in the true environment, its value exceedsVϕt,0\.1πt,q\(s0\)V^\{\\pi\_\{t\},\\mathrm\{q\}\}\_\{\\phi\_\{t\},0\.1\}\(s\_\{0\}\)with posterior probability at least0\.90\.9\. Hence, a largerVϕt,0\.1πt,q\(s0\)V^\{\\pi\_\{t\},\\mathrm\{q\}\}\_\{\\phi\_\{t\},0\.1\}\(s\_\{0\}\)indicates better robustness performance under the current posterior\. As discussed in Appendix[EC\.7\.5](https://arxiv.org/html/2605.24345#A7.SS5), we estimateVϕt,0\.1πt,q\(s0\)V^\{\\pi\_\{t\},\\mathrm\{q\}\}\_\{\\phi\_\{t\},0\.1\}\(s\_\{0\}\)via Monte Carlo sampling and report the approximateV^ϕt,0\.1πt,q\(s0\)\\widehat\{V\}^\{\\pi\_\{t\},\\mathrm\{q\}\}\_\{\\phi\_\{t\},0\.1\}\(s\_\{0\}\)in Figure[3](https://arxiv.org/html/2605.24345#S5.F3)\.

#### Results\.

Figures[2](https://arxiv.org/html/2605.24345#S5.F2)and[3](https://arxiv.org/html/2605.24345#S5.F3)show the following patterns\.

1. \(1\)Effective robustness–exploration trade\-off\.The cumulative robust\-regret and true\-regret curves together show that AQ\-BRMDP achieves lower regret than Continuing PSRL under both criteria\. This indicates that the adaptive quantile schedule can keep the policy robust relative to the optimal value of theα¯\\underline\{\\alpha\}\-quantile BR\-MDP while also effectively learning the true\-optimal policy\.
2. \(2\)Reduced true regret of BRMDP\-based policies in risky environments\.In both settings, AQ\-BRMDP and the BRMDP\-α\\alphabaselines accumulate lower true regret than Continuing PSRL\. This indicates that BRMDP\-based policies are more effective in those risky\-shortcut environments\. The reason is that when the shortcut transition is not yet well estimated, a posterior sample may underestimatePc\(5\|2,Down\)P^\{c\}\(5\\,\|\\,2,\\mathrm\{Down\}\), making policies that route the agent through state22appear overly favorable\. Under the true kernel, however, repeated use of actionDown\\mathrm\{Down\}at state22can move the agent to the sticky hole with high probability and produce larger regret\. BRMDP\-based policies instead conservatively evaluate the shortcut through posterior quantiles, and therefore tend to avoid the risky branch\.
3. \(3\)Non\-monotone effect of fixed quantile levels\.The BRMDP\-α\\alphabaselines reduce this early cost by evaluating lower\-tail performance and therefore tend to avoid sticky holes and the risky branch\. Among these baselines, BRMDP\-0\.3 generally yields lower regret than BRMDP\-0\.5, showing that some robustness is useful when the shortcut transition is not yet well estimated\. However, BRMDP\-0\.1 can have higher regret than BRMDP\-0\.3, indicating that excessive conservatism may also cause large true regret\. Thus, the performance of BRMDP\-α\\alphais not monotone in the quantile levelα\\alpha\.
4. \(4\)Posterior robustness under adaptive scheduling\.AQ\-BRMDP generally attains the largest posterior0\.10\.1\-quantile value in both settings, showing that its executed policies maintain stronger posterior robustness while still improving true regret relative to Continuing PSRL\.

![Refer to caption](https://arxiv.org/html/2605.24345v1/figs/sec5_2_theta07_true_regret.png)

![Refer to caption](https://arxiv.org/html/2605.24345v1/figs/sec5_2_theta07_robust_regret.png)

![Refer to caption](https://arxiv.org/html/2605.24345v1/figs/sec5_2_theta09_true_regret.png)

![Refer to caption](https://arxiv.org/html/2605.24345v1/figs/sec5_2_theta09_robust_regret.png)

Figure 2:Cumulative true regret \(left column\) and cumulative robust regret withα¯=0\.2\\underline\{\\alpha\}=0\.2\(right column\) in risky\-branch FrozenLake forθ=0\.7\\theta=0\.7andθ=0\.9\\theta=0\.9\.![Refer to caption](https://arxiv.org/html/2605.24345v1/figs/sec5_2_theta07_qalpha01_s0.png)

![Refer to caption](https://arxiv.org/html/2605.24345v1/figs/sec5_2_theta09_qalpha01_s0.png)

Figure 3:Posterior0\.10\.1\-quantile valueV^ϕt,0\.1πt,q\(s0\)\\widehat\{V\}^\{\\pi\_\{t\},\\mathrm\{q\}\}\_\{\\phi\_\{t\},0\.1\}\(s\_\{0\}\)in risky\-branch FrozenLake forθ=0\.7\\theta=0\.7andθ=0\.9\\theta=0\.9\.

## 6Conclusions

This paper studies theα\\alpha\-quantile BR\-MDP from both modeling and algorithmic perspectives\. From a modeling perspective, we formulate theα\\alpha\-quantile BR\-MDP, in which the quantile level provides a flexible way to adjust the risk attitude toward epistemic uncertainty\. The formulation can induce robust or exploratory behavior\. We further characterize this flexibility through an asymptotic normality result: for a fixed policy, theα\\alpha\-quantile BR\-MDP value function differs from the true value function by a mean term whose sign is determined by whetherα\\alphais above or below1/21/2, and whose magnitude increases asα\\alphamoves farther into either tail and shrinks as the posterior concentrates\. The posterior quantile value result further provides a finite\-sample robustness interpretation for lower\-tail quantile evaluation\. From an algorithmic perspective, we develop AQ\-BRMDP to account for the evolving robustness–exploration trade\-off in online RL\. The adaptive schedule varies with the pseudo\-episode index and relative state\-action visit counts, allowing the policy to be more conservative early in learning and less conservative when further exploration is needed\. We theoretically prove a sublinear Bayesian regret bound and numerically demonstrate that AQ\-BRMDP performs effectively in both exploration\-demanding and exploration\-costly environments\. We also implement an extension of the proposed algorithm to continuous\-state spaces and evaluate its empirical performance in a continuous\-state environment\.

Several directions remain open\. The present analysis is developed for finite state and action spaces with direct transition parameterization and Dirichlet posteriors\. Extending the formulation and the theoretical guarantees to continuous\-state and continuous\-action spaces would broaden the scope of the approach\. On the computational side, the method relies on repeated posterior sampling and solving the quantile BR\-MDP at the beginning of each pseudo\-episode\. Although sampling transition kernels from the Dirichlet posterior is carried out in the simulator rather than through the more costly online interaction with the true environment, it can still materially increase computational cost\. It would therefore be useful to understand how previously sampled transition kernels can be reused\.

## Acknowledgments

This work was supported by the Air Force Office of Scientific Research \(AFOSR\) under Grant FA9550\-25\-1\-0310 and the National Science Foundation under Award ECCS\-2419562\.

## References

- Abbasi\-Yadkori and Szepesvári \(2015\)Abbasi\-Yadkori Y, Szepesvári C \(2015\) Bayesian optimal control of smoothly parameterized systems\.*Proceedings of the Thirty\-First Conference on Uncertainty in Artificial Intelligence*, 1–11\.
- Agrawal and Jia \(2023\)Agrawal S, Jia R \(2023\) Optimistic posterior sampling for reinforcement learning: Worst\-case regret bounds\.*Mathematics of Operations Research*48\(1\):363–392\.
- Azar et al\. \(2017\)Azar MG, Osband I, Munos R \(2017\) Minimax regret bounds for reinforcement learning\.*Proceedings of the 34th International Conference on Machine Learning*, volume 70 of*Proceedings of Machine Learning Research*, 263–272\.
- Badrinath and Kalathil \(2021\)Badrinath KP, Kalathil D \(2021\) Robust reinforcement learning using least squares policy iteration with provable performance guarantees\.*Proceedings of the 38th International Conference on Machine Learning*, volume 139 of*Proceedings of Machine Learning Research*, 511–520\.
- Bartlett and Tewari \(2009\)Bartlett PL, Tewari A \(2009\) REGAL: A regularization based algorithm for reinforcement learning in weakly communicating MDPs\.*Proceedings of the Twenty\-Fifth Conference on Uncertainty in Artificial Intelligence*, 35–42 \(AUAI Press\)\.
- Blanchet et al\. \(2023\)Blanchet J, Lu M, Zhang T, Zhong H \(2023\) Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage\.*Advances in Neural Information Processing Systems*, volume 36\.
- Boucheron and Gassiat \(2009\)Boucheron S, Gassiat E \(2009\) A Bernstein–von Mises theorem for discrete probability distributions\.*Electronic Journal of Statistics*3:114–148\.
- Brafman and Tennenholtz \(2002\)Brafman RI, Tennenholtz M \(2002\) R\-MAX: A general polynomial time algorithm for near\-optimal reinforcement learning\.*Journal of Machine Learning Research*3:213–231\.
- Delage and Mannor \(2010\)Delage E, Mannor S \(2010\) Percentile optimization for Markov decision processes with parameter uncertainty\.*Operations Research*58\(1\):203–213, URL[http://dx\.doi\.org/10\.1287/opre\.1080\.0685](http://dx.doi.org/10.1287/opre.1080.0685)\.
- Der Kiureghian and Ditlevsen \(2009\)Der Kiureghian A, Ditlevsen O \(2009\) Aleatory or epistemic? Does it matter?*Structural Safety*31\(2\):105–112\.
- Dong et al\. \(2022\)Dong J, Li J, Wang B, Zhang J \(2022\) Online policy optimization for robust MDP\. ArXiv:2209\.13841\.
- Dong et al\. \(2020\)Dong K, Wang Y, Chen X, Wang L \(2020\) Q\-learning with UCB exploration is sample efficient for infinite\-horizon MDP\.*International Conference on Learning Representations*\.
- Duff \(2002\)Duff MO \(2002\)*Optimal Learning: Computational Procedures for Bayes\-adaptive Markov Decision Processes*\. Ph\.D\. thesis, University of Massachusetts Amherst\.
- Dulac\-Arnold et al\. \(2021\)Dulac\-Arnold G, Levine N, Mankowitz DJ, Li J, Paduraru C, Gowal S, Hester T \(2021\) Challenges of real\-world reinforcement learning: Definitions, benchmarks and analysis\.*Machine Learning*110\(9\):2419–2468\.
- Garcelon et al\. \(2020\)Garcelon E, Ghavamzadeh M, Lazaric A, Pirotta M \(2020\) Conservative exploration in reinforcement learning\.*Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics*, volume 108 of*Proceedings of Machine Learning Research*, 1431–1441\.
- Ghavamzadeh et al\. \(2015\)Ghavamzadeh M, Mannor S, Pineau J, Tamar A \(2015\) Bayesian reinforcement learning: A survey\.*Foundations and Trends in Machine Learning*8\(5–6\):359–483\.
- Ghosh et al\. \(2026\)Ghosh D, Atia GK, Wang Y \(2026\) ORVIT: Near\-optimal online distributionally robust reinforcement learning\.*Proceedings of the AAAI Conference on Artificial Intelligence*40\(25\):21278–21286\.
- He et al\. \(2021\)He J, Zhou D, Gu Q \(2021\) Nearly minimax optimal reinforcement learning for discounted MDPs\.*Advances in Neural Information Processing Systems*, volume 34, 22288–22300\.
- Iyengar \(2005\)Iyengar GN \(2005\) Robust dynamic programming\.*Mathematics of Operations Research*30\(2\):257–280\.
- Jaksch et al\. \(2010\)Jaksch T, Ortner R, Auer P \(2010\) Near\-optimal regret bounds for reinforcement learning\.*Journal of Machine Learning Research*11\(51\):1563–1600\.
- Jin et al\. \(2018\)Jin C, Allen\-Zhu Z, Bubeck S, Jordan MI \(2018\) Is Q\-learning provably efficient?*Advances in Neural Information Processing Systems*, volume 31\.
- Kearns and Singh \(2002\)Kearns M, Singh S \(2002\) Near\-optimal reinforcement learning in polynomial time\.*Machine Learning*49\(2–3\):209–232\.
- Liang et al\. \(2025\)Liang B, Xu L, Taneja A, Tambe M, Janson L \(2025\) Context in public health for underserved communities: A Bayesian approach to online restless bandits\.*Proceedings of the AAAI Conference on Artificial Intelligence*39\(27\):28195–28203\.
- Lin et al\. \(2022\)Lin Y, Ren Y, Zhou E \(2022\) Bayesian risk Markov decision processes\.*Advances in Neural Information Processing Systems*, volume 35, 17430–17442\.
- Lin and Zhou \(2025\)Lin Y, Zhou E \(2025\) Approximate bilevel difference convex programming for Bayesian risk Markov decision processes\.*Proceedings of the AAAI Conference on Artificial Intelligence*39\(25\):26605–26613\.
- Lu et al\. \(2024\)Lu M, Zhong H, Zhang T, Blanchet J \(2024\) Distributionally robust reinforcement learning with interactive data collection: Fundamental hardness and near\-optimal algorithms\.*Advances in Neural Information Processing Systems*, volume 37\.
- Ma and Lee \(2026\)Ma J, Lee WS \(2026\) EUBRL: Epistemic uncertainty directed Bayesian reinforcement learning\.*International Conference on Learning Representations*\.
- Nilim and El Ghaoui \(2005\)Nilim A, El Ghaoui L \(2005\) Robust control of Markov decision processes with uncertain transition matrices\.*Operations Research*53\(5\):780–798\.
- Osband et al\. \(2013\)Osband I, Russo D, Van Roy B \(2013\) \(More\) efficient reinforcement learning via posterior sampling\.*Advances in Neural Information Processing Systems*, volume 26\.
- Osband and Van Roy \(2016\)Osband I, Van Roy B \(2016\) Posterior sampling for reinforcement learning without episodes\.*arXiv preprint arXiv:1608\.02731*\.
- Osband and Van Roy \(2017\)Osband I, Van Roy B \(2017\) Why is posterior sampling better than optimism for reinforcement learning?*Proceedings of the 34th International Conference on Machine Learning*, volume 70 of*Proceedings of Machine Learning Research*, 2701–2710 \(PMLR\)\.
- Panaganti and Kalathil \(2022\)Panaganti K, Kalathil D \(2022\) Sample complexity of robust reinforcement learning with a generative model\.*Proceedings of The 25th International Conference on Artificial Intelligence and Statistics*, volume 151 of*Proceedings of Machine Learning Research*, 9582–9602\.
- Panaganti et al\. \(2022\)Panaganti K, Xu Z, Kalathil D, Ghavamzadeh M \(2022\) Robust reinforcement learning using offline data\.*Advances in Neural Information Processing Systems*, volume 35, 32211–32224\.
- Poupart et al\. \(2006\)Poupart P, Vlassis N, Hoey J, Regan K \(2006\) An analytic solution to discrete Bayesian reinforcement learning\.*Proceedings of the 23rd International Conference on Machine Learning*, 697–704\.
- Puterman \(1994\)Puterman ML \(1994\)*Markov Decision Processes: Discrete Stochastic Dynamic Programming*\(New York: John Wiley & Sons\)\.
- Ross et al\. \(2007\)Ross S, Chaib\-draa B, Pineau J \(2007\) Bayes\-adaptive POMDPs\.*Advances in Neural Information Processing Systems*, volume 20, 1225–1232\.
- Russo et al\. \(2018\)Russo DJ, Van Roy B, Kazerouni A, Osband I, Wen Z \(2018\) A tutorial on Thompson sampling\.*Foundations and Trends in Machine Learning*11\(1\):1–96\.
- Strehl et al\. \(2006\)Strehl AL, Li L, Wiewiora E, Langford J, Littman ML \(2006\) PAC model\-free reinforcement learning\.*Proceedings of the 23rd International Conference on Machine Learning*, 881–888 \(ACM\)\.
- Strehl and Littman \(2008\)Strehl AL, Littman ML \(2008\) An analysis of model\-based interval estimation for Markov decision processes\.*Journal of Computer and System Sciences*74\(8\):1309–1331\.
- Tarbouriech et al\. \(2023\)Tarbouriech J, Lattimore T, O’Donoghue B \(2023\) Probabilistic inference in reinforcement learning done right\. ArXiv:2311\.13294\.
- Tiapkin et al\. \(2022\)Tiapkin D, Belomestny D, Moulines E, Naumov A, Samsonov S, Tang Y, Valko M, Menard P \(2022\) From Dirichlet to Rubin: Optimistic exploration in RL without bonuses\.*Proceedings of the 39th International Conference on Machine Learning*, volume 162 of*Proceedings of Machine Learning Research*, 21380–21431\.
- Wang and Zhou \(2023\)Wang Y, Zhou E \(2023\) Bayesian risk\-averse Q\-learning with streaming observations\.*Advances in Neural Information Processing Systems*, volume 36\.
- Wang and Zhou \(2025\)Wang Y, Zhou E \(2025\) Online Bayesian risk\-averse reinforcement learning\.*arXiv preprint arXiv:2509\.14077*\.
- Wang and Zou \(2021\)Wang Y, Zou S \(2021\) Online robust reinforcement learning with model uncertainty\.*Advances in Neural Information Processing Systems*, volume 34, 7193–7206\.
- Wiesemann et al\. \(2013\)Wiesemann W, Kuhn D, Rustem B \(2013\) Robust Markov decision processes\.*Mathematics of Operations Research*38\(1\):153–183\.
- Wu et al\. \(2018\)Wu D, Zhu H, Zhou E \(2018\) A Bayesian risk approach to data\-driven stochastic optimization: Formulations and asymptotics\.*SIAM Journal on Optimization*28\(2\):1588–1612\.
- Xu and Mannor \(2010\)Xu H, Mannor S \(2010\) Distributionally robust Markov decision processes\.*Advances in Neural Information Processing Systems*, volume 23\.
- Xu et al\. \(2024\)Xu W, Dong S, Van Roy B \(2024\) Posterior sampling for continuing environments\.*Reinforcement Learning Conference \(RLC\)*\.
- Yamagata and Santos\-Rodríguez \(2024\)Yamagata T, Santos\-Rodríguez R \(2024\) Safe and robust reinforcement learning: Principles and practice\. ArXiv:2403\.18539\.
- Zhou and Xie \(2015\)Zhou E, Xie W \(2015\) Simulation optimization when facing input uncertainty\.*Proceedings of the 2015 Winter Simulation Conference*, 3714–3724\.
- Zhou et al\. \(2021\)Zhou Z, Zhou Z, Bai Q, Qiu L, Blanchet J, Glynn P \(2021\) Finite\-sample regret bound for distributionally robust offline tabular reinforcement learning\.*Proceedings of The 24th International Conference on Artificial Intelligence and Statistics*, volume 130 of*Proceedings of Machine Learning Research*, 3331–3339\.

## Appendix: Proofs and Implementation Details

## Appendix EC\.1Dirichlet Posterior Update for the Transition Kernel

LetΔS:=\{p∈ℝ\+S:∑s′∈𝒮p\(s′\)=1\}\\Delta^\{S\}:=\\\{p\\in\\mathbb\{R\}\_\{\+\}^\{S\}:\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}p\(s^\{\\prime\}\)=1\\\}denote the probability simplex\. For a parameter vectorη∈ℝ\+\+S\\eta\\in\\mathbb\{R\}\_\{\+\+\}^\{S\}, letDir\(η\)\\mathrm\{Dir\}\(\\eta\)denote the Dirichlet distribution onΔS\\Delta^\{S\}with density

f\(p∣η\)∝∏s′∈𝒮p\(s′\)η\(s′\)−1\.f\(p\\mid\\eta\)\\propto\\prod\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}p\(s^\{\\prime\}\)^\{\\eta\(s^\{\\prime\}\)\-1\}\.For each state\-action pair\(s,a\)\(s,a\), we place the priorPs,a∼Dir\(ϕ0\(s,a\)\),P\_\{s,a\}\\sim\\mathrm\{Dir\}\(\\phi\_\{0\}\(s,a\)\),whereϕ0\(s,a\)=\(ϕ0\(s,a,s′\)\)s′∈𝒮\\phi\_\{0\}\(s,a\)=\(\\phi\_\{0\}\(s,a,s^\{\\prime\}\)\)\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}is the prior parameter vector withϕ0\(s,a,s′\)\>0\\phi\_\{0\}\(s,a,s^\{\\prime\}\)\>0\. The prior parameterϕ0\(s,a\)\\phi\_\{0\}\(s,a\)can be chosen based on prior knowledge; in the absence of such information, a common choice is the uniform Dirichlet priorϕ0\(s,a,s′\)=1\\phi\_\{0\}\(s,a,s^\{\\prime\}\)=1for alls′∈𝒮s^\{\\prime\}\\in\\mathcal\{S\}\.

LetN\(s,a,s′\)N\(s,a,s^\{\\prime\}\)be the number of observed transitions from\(s,a\)\(s,a\)tos′s^\{\\prime\}, letN\(s,a\):=∑s′∈𝒮N\(s,a,s′\)N\(s,a\):=\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}N\(s,a,s^\{\\prime\}\)denote the number of visits to\(s,a\)\(s,a\), and letN:=∑s,aN\(s,a\)N:=\\sum\_\{s,a\}N\(s,a\)denote the total number of observations\. Define theσ\\sigma\-field generated by the observed transitions asℱN=σ\{\(si,ai,si\+1\):i=0,…,N−1\}\\mathcal\{F\}\_\{N\}=\\sigma\\\{\(s\_\{i\},a\_\{i\},s\_\{i\+1\}\):i=0,\\ldots,N\-1\\\}\. By conjugacy, the posterior is

Ps,a∣ℱN∼Dir\(ϕN\(s,a\)\),P\_\{s,a\}\\mid\\mathcal\{F\}\_\{N\}\\sim\\mathrm\{Dir\}\(\\phi\_\{N\}\(s,a\)\),where

ϕN\(s,a,s′\)=ϕ0\(s,a,s′\)\+N\(s,a,s′\),∀s′∈𝒮\.\\phi\_\{N\}\(s,a,s^\{\\prime\}\)=\\phi\_\{0\}\(s,a,s^\{\\prime\}\)\+N\(s,a,s^\{\\prime\}\),\\qquad\\forall s^\{\\prime\}\\in\\mathcal\{S\}\.\(EC\.1\)Thus, the posterior parameterϕN\\phi\_\{N\}is determined by the prior parameterϕ0\\phi\_\{0\}and the transition counts induced by the observed trajectory\. In the main text, when the current posterior is fixed, we suppress the dependence onNNand writeϕ\(s,a\)\\phi\(s,a\)forϕN\(s,a\)\\phi\_\{N\}\(s,a\)\.

## Appendix EC\.2Proof of Asymptotic Normality

###### Lemma 4\(Basic properties of the quantile functional\)\.

Letρα\\rho^\{\\alpha\}be the leftα\\alpha\-quantile functional,α∈\(0,1\)\\alpha\\in\(0,1\)\. Then for any real random variablesX,YX,Y:

1. \(i\)for any constantc∈ℝc\\in\\mathbb\{R\},ρα\(X\+c\)=ρα\(X\)\+c\\rho^\{\\alpha\}\(X\+c\)=\\rho^\{\\alpha\}\(X\)\+c;
2. \(ii\)for any constantc\>0c\>0,ρα\(cX\)=cρα\(X\)\\rho^\{\\alpha\}\(cX\)=c\\,\\rho^\{\\alpha\}\(X\);
3. \(iii\)ifX≥YX\\geq Yalmost surely, thenρα\(X\)≥ρα\(Y\)\\rho^\{\\alpha\}\(X\)\\geq\\rho^\{\\alpha\}\(Y\);
4. \(iv\)if\|X−Y\|≤c\|X\-Y\|\\leq calmost surely for somec≥0c\\geq 0, then\|ρα\(X\)−ρα\(Y\)\|≤‖X−Y‖∞≤c\|\\rho^\{\\alpha\}\(X\)\-\\rho^\{\\alpha\}\(Y\)\|\\leq\\\|X\-Y\\\|\_\{\\infty\}\\leq c\.

Let\(Ω,ℱ,ℙ\)\(\\Omega,\\mathcal\{F\},\\mathbb\{P\}\)be the underlying probability space on which the observed transition process is defined\. For eachNN, let𝒟N:=\{s0,a0,s1,…,sN−1,aN−1,sN\}\\mathcal\{D\}\_\{N\}:=\\\{s\_\{0\},a\_\{0\},s\_\{1\},\\ldots,s\_\{N\-1\},a\_\{N\-1\},s\_\{N\}\\\}be the trajectory data in Assumption[1](https://arxiv.org/html/2605.24345#Thmassumption1)[\(i\)](https://arxiv.org/html/2605.24345#S2.Ex3), and setℱN:=σ\(𝒟N\)\\mathcal\{F\}\_\{N\}:=\\sigma\(\\mathcal\{D\}\_\{N\}\)\. For the martingale argument below, writeℱi:=σ\(s0,a0,…,si,ai\)\\mathcal\{F\}\_\{i\}:=\\sigma\(s\_\{0\},a\_\{0\},\\ldots,s\_\{i\},a\_\{i\}\)for the history before observingsi\+1s\_\{i\+1\},i=0,…,N−1i=0,\\ldots,N\-1\. Throughout this proof,Op\(⋅\)O\_\{p\}\(\\cdot\),op\(⋅\)o\_\{p\}\(\\cdot\), and→𝑝\\xrightarrow\{p\}are understood with respect to the randomness of the observed data underℙ\\mathbb\{P\}asN→∞N\\to\\infty\.

For any integerm≥2m\\geq 2and any probability vectorp∈ℝ\+mp\\in\\mathbb\{R\}\_\{\+\}^\{m\}with∑i=1mpi=1\\sum\_\{i=1\}^\{m\}p\_\{i\}=1, defineΣ\(p\):=diag\(p\)−pp⊤\.\\Sigma\(p\)\\ :=\\ \\mathrm\{diag\}\(p\)\-pp^\{\\top\}\.Define the transition countsN\(s,a,s′\):=∑i=0N−1𝕀\{si=s,ai=a,si\+1=s′\},N\(s,a,s^\{\\prime\}\):=\\sum\_\{i=0\}^\{N\-1\}\\mathbb\{I\}\\\{s\_\{i\}=s,a\_\{i\}=a,s\_\{i\+1\}=s^\{\\prime\}\\\},and thenN\(s,a\)=∑s′∈𝒮N\(s,a,s′\)\.N\(s,a\)=\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}N\(s,a,s^\{\\prime\}\)\.For each\(s,a\)\(s,a\)withN\(s,a\)\>0N\(s,a\)\>0, defineP~N\(s,a\)\(s′\)=N\(s,a,s′\)N\(s,a\)\.\\widetilde\{P\}\_\{N\}\(s,a\)\(s^\{\\prime\}\)=\\frac\{N\(s,a,s^\{\\prime\}\)\}\{N\(s,a\)\}\.Under the uniform Dirichlet prior, the posterior parameter satisfiesϕ\(s,a,s′\)=N\(s,a,s′\)\+1\\phi\(s,a,s^\{\\prime\}\)=N\(s,a,s^\{\\prime\}\)\+1\.

Here and below, for a real\-valued posterior random variableXNX\_\{N\}and a real\-valued random variableXXwith cdfFF, we write

XN∣ℱN⇒Xinℙ\-probability\\displaystyle X\_\{N\}\\mid\{\\mathcal\{F\}\_\{N\}\}\\Rightarrow X\\quad\\text\{in \}\\mathbb\{P\}\\text\{\-probability\}to mean conditional weak convergence in probability\. More precisely, for every bounded continuous test functionψ:ℝ→ℝ\\psi:\\mathbb\{R\}\\to\\mathbb\{R\},

𝔼\[ψ\(XN\)∣ℱN\]−𝔼\[ψ\(X\)\]→ℙ0\.\\displaystyle\\mathbb\{E\}\[\\psi\(X\_\{N\}\)\\mid\{\\mathcal\{F\}\_\{N\}\}\]\-\\mathbb\{E\}\[\\psi\(X\)\]\\xrightarrow\{\\mathbb\{P\}\}0\.The conditional expectation is taken with respect to the posterior distribution givenℱN\\mathcal\{F\}\_\{N\}, whereas the convergence in probability is with respect to the randomness of the observed data\.

Lemma[5](https://arxiv.org/html/2605.24345#Thmlemma5)is a fixed\-dimensional Dirichlet special case of the Bernstein–von Mises theorem for discrete probability distributions\(Boucheron and Gassiat[2009](https://arxiv.org/html/2605.24345#bib.bib7)\); we restate it in the form needed here\.

###### Lemma 5\(Posterior CLT for a Dirichlet transition vector\)\.

For\(s,a\)∈𝒮×𝒜\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}, supposePs,a∣ℱN∼Dir\(ϕ\(s,a\)\)\.P\_\{s,a\}\\mid\{\\mathcal\{F\}\_\{N\}\}\\sim\\mathrm\{Dir\}\\bigl\(\\phi\(s,a\)\\bigr\)\.Then, conditionally onℱN\\mathcal\{F\}\_\{N\}, asN\(s,a\)→∞N\(s,a\)\\to\\infty,

N\(s,a\)\(Ps,a−P~N\(s,a\)\)⇒𝒩\(0,Σ\(Ps,ac\)\)\\displaystyle\\sqrt\{N\(s,a\)\}\\bigl\(P\_\{s,a\}\-\\widetilde\{P\}\_\{N\}\(s,a\)\\bigr\)\\Rightarrow\\mathcal\{N\}\\\!\\bigl\(0,\\Sigma\(P^\{c\}\_\{s,a\}\)\\bigr\)inℙ\\mathbb\{P\}\-probability\.

###### Proof\.

The Bernstein–von Mises \(BvM\) theorem applies directly to the coordinatesJ\+:=\{x:Ps,ac\(x\)\>0\}J\_\{\+\}:=\\\{x:P^\{c\}\_\{s,a\}\(x\)\>0\\\}, on which the true transition vector lies in the relative interior of the simplex\(Boucheron and Gassiat[2009](https://arxiv.org/html/2605.24345#bib.bib7)\)\. Forx∈J0:=\{x:Ps,ac\(x\)=0\}x\\in J\_\{0\}:=\\\{x:P^\{c\}\_\{s,a\}\(x\)=0\\\}, no transition toxxis observed almost surely, soN\(s,a,x\)=0N\(s,a,x\)=0\. Since the prior parameters are fixed and positive,𝔼\[Ps,a\(x\)∣ℱN\]=O\(N\(s,a\)−1\),\\mathbb\{E\}\[P\_\{s,a\}\(x\)\\mid\\mathcal\{F\}\_\{N\}\]=O\(N\(s,a\)^\{\-1\}\),and hencePs,a\(x\)=Op\(N\(s,a\)−1\)P\_\{s,a\}\(x\)=O\_\{p\}\(N\(s,a\)^\{\-1\}\)\. Therefore

N\(s,a\)\(Ps,a\(x\)−P~N\(s,a\)\(x\)\)=N\(s,a\)Ps,a\(x\)→𝑝0\.\\sqrt\{N\(s,a\)\}\\bigl\(P\_\{s,a\}\(x\)\-\\widetilde\{P\}\_\{N\}\(s,a\)\(x\)\\bigr\)=\\sqrt\{N\(s,a\)\}P\_\{s,a\}\(x\)\\xrightarrow\{p\}0\.Combining the BvM limit onJ\+J\_\{\+\}with this degenerate limit onJ0J\_\{0\}gives the stated Gaussian limit with covarianceΣ\(Ps,ac\)\\Sigma\(P^\{c\}\_\{s,a\}\), possibly singular\.

∎

In the following lemma, we show that conditional weak convergence of random variables implies convergence of the corresponding conditional quantiles\.

###### Lemma 6\.

Let\(ZN\)N≥1\(Z\_\{N\}\)\_\{N\\geq 1\}be real\-valued random variables on\(Ω,ℱ,ℙ\)\(\\Omega,\\mathcal\{F\},\\mathbb\{P\}\), and let\(ℱN\)N≥1\(\\mathcal\{F\}\_\{N\}\)\_\{N\\geq 1\}be a sequence of sub\-σ\\sigma\-fields ofℱ\\mathcal\{F\}\. For eachNN, define the conditional cdf ofZNZ\_\{N\}givenℱN\\mathcal\{F\}\_\{N\}by

FN\(x\):=ℙ\(ZN≤x∣ℱN\),x∈ℝ,\\displaystyle F\_\{N\}\(x\):=\{\\mathbb\{P\}\(Z\_\{N\}\\leq x\\mid\\mathcal\{F\}\_\{N\}\)\},\\qquad x\\in\\mathbb\{R\},Assume that there exists a real\-valued random variableZZwith cdfFFsuch thatZN∣ℱN⇒Zinℙ\-probabilityZ\_\{N\}\\mid\{\\mathcal\{F\}\_\{N\}\}\\Rightarrow Z\\text\{ in \}\\mathbb\{P\}\\text\{\-probability\}\. Forα∈\(0,1\)\\alpha\\in\(0,1\), define

qN:=inf\{x∈ℝ:FN\(x\)≥α\},qα:=inf\{x∈ℝ:F\(x\)≥α\}\.\\displaystyle q\_\{N\}:=\\inf\\\{x\\in\\mathbb\{R\}:F\_\{N\}\(x\)\\geq\\alpha\\\},\\qquad q\_\{\\alpha\}:=\\inf\\\{x\\in\\mathbb\{R\}:F\(x\)\\geq\\alpha\\\}\.IfFFis continuous and strictly increasing on a neighborhood ofqαq\_\{\\alpha\}, then

qN→𝑝qα\.\\displaystyle q\_\{N\}\\xrightarrow\{p\}q\_\{\\alpha\}\.In particular, ifF=ΦF=\\Phiis the standard normal cdf, thenqα=Φ−1\(α\)=zαq\_\{\\alpha\}=\\Phi^\{\-1\}\(\\alpha\)=z\_\{\\alpha\}, and henceqN→𝑝zαq\_\{N\}\\xrightarrow\{p\}z\_\{\\alpha\}\.

###### Proof\.

Fix a continuity pointxxofFFandδ\>0\\delta\>0\. Forε\>0\\varepsilon\>0, define

ψx,ε−\(t\):=\{1,t≤x−ε,1−t−\(x−ε\)ε,x−ε<t<x,0,t≥x,ψx,ε\+\(t\):=\{1,t≤x,1−t−xε,x<t<x\+ε,0,t≥x\+ε\.\\displaystyle\\psi^\{\-\}\_\{x,\\varepsilon\}\(t\):=\\begin\{cases\}1,&t\\leq x\-\\varepsilon,\\\\ 1\-\\dfrac\{t\-\(x\-\\varepsilon\)\}\{\\varepsilon\},&x\-\\varepsilon<t<x,\\\\ 0,&t\\geq x,\\end\{cases\}\\qquad\\psi^\{\+\}\_\{x,\\varepsilon\}\(t\):=\\begin\{cases\}1,&t\\leq x,\\\\ 1\-\\dfrac\{t\-x\}\{\\varepsilon\},&x<t<x\+\\varepsilon,\\\\ 0,&t\\geq x\+\\varepsilon\.\\end\{cases\}Thenψx,ε−≤𝟏\{t≤x\}≤ψx,ε\+\\psi^\{\-\}\_\{x,\\varepsilon\}\\leq\\mathbf\{1\}\\\{t\\leq x\\\}\\leq\\psi^\{\+\}\_\{x,\\varepsilon\}, so

𝔼\[ψx,ε−\(ZN\)∣ℱN\]≤FN\(x\)≤𝔼\[ψx,ε\+\(ZN\)∣ℱN\]\.\\displaystyle\{\\mathbb\{E\}\[\\psi^\{\-\}\_\{x,\\varepsilon\}\(Z\_\{N\}\)\\mid\\mathcal\{F\}\_\{N\}\]\\leq F\_\{N\}\(x\)\\leq\\mathbb\{E\}\[\\psi^\{\+\}\_\{x,\\varepsilon\}\(Z\_\{N\}\)\\mid\\mathcal\{F\}\_\{N\}\]\.\}By the assumed conditional weak convergence,

𝔼\[ψx,ε±\(ZN\)∣ℱN\]→𝑝𝔼\[ψx,ε±\(Z\)\]\.\\displaystyle\{\\mathbb\{E\}\[\\psi^\{\\pm\}\_\{x,\\varepsilon\}\(Z\_\{N\}\)\\mid\\mathcal\{F\}\_\{N\}\]\\xrightarrow\{p\}\\mathbb\{E\}\[\\psi^\{\\pm\}\_\{x,\\varepsilon\}\(Z\)\]\.\}SinceFFis continuous atxx,𝔼\[ψx,ε−\(Z\)\]↑F\(x\)\\mathbb\{E\}\[\\psi^\{\-\}\_\{x,\\varepsilon\}\(Z\)\]\\uparrow F\(x\)and𝔼\[ψx,ε\+\(Z\)\]↓F\(x\)\\mathbb\{E\}\[\\psi^\{\+\}\_\{x,\\varepsilon\}\(Z\)\]\\downarrow F\(x\)asε↓0\\varepsilon\\downarrow 0\. Chooseε\>0\\varepsilon\>0so that

F\(x\)−δ2≤𝔼\[ψx,ε−\(Z\)\]≤F\(x\)≤𝔼\[ψx,ε\+\(Z\)\]≤F\(x\)\+δ2\.\\displaystyle F\(x\)\-\\frac\{\\delta\}\{2\}\\;\\leq\\;\\mathbb\{E\}\[\\psi^\{\-\}\_\{x,\\varepsilon\}\(Z\)\]\\;\\leq\\;F\(x\)\\;\\leq\\;\\mathbb\{E\}\[\\psi^\{\+\}\_\{x,\\varepsilon\}\(Z\)\]\\;\\leq\\;F\(x\)\+\\frac\{\\delta\}\{2\}\.Then

ℙ\(FN\(x\)<F\(x\)−δ\)≤ℙ\(𝔼\[ψx,ε−\(ZN\)∣ℱN\]<𝔼\[ψx,ε−\(Z\)\]−δ2\)→0,\\displaystyle\\mathbb\{P\}\\\!\\left\(F\_\{N\}\(x\)<F\(x\)\-\\delta\\right\)\\leq\\mathbb\{P\}\\\!\\left\(\\mathbb\{E\}\[\\psi^\{\-\}\_\{x,\\varepsilon\}\(Z\_\{N\}\)\\mid\\mathcal\{F\}\_\{N\}\]<\\mathbb\{E\}\[\\psi^\{\-\}\_\{x,\\varepsilon\}\(Z\)\]\-\\frac\{\\delta\}\{2\}\\right\)\\to 0,and similarly,

ℙ\(FN\(x\)\>F\(x\)\+δ\)≤ℙ\(𝔼\[ψx,ε\+\(ZN\)∣ℱN\]\>𝔼\[ψx,ε\+\(Z\)\]\+δ2\)→0\.\\displaystyle\\mathbb\{P\}\\\!\\left\(F\_\{N\}\(x\)\>F\(x\)\+\\delta\\right\)\\leq\\mathbb\{P\}\\\!\\left\(\\mathbb\{E\}\[\\psi^\{\+\}\_\{x,\\varepsilon\}\(Z\_\{N\}\)\\mid\\mathcal\{F\}\_\{N\}\]\>\\mathbb\{E\}\[\\psi^\{\+\}\_\{x,\\varepsilon\}\(Z\)\]\+\\frac\{\\delta\}\{2\}\\right\)\\to 0\.Hence

FN\(x\)→𝑝F\(x\)\.\\displaystyle F\_\{N\}\(x\)\\xrightarrow\{p\}F\(x\)\.Forε\>0\\varepsilon\>0small enough thatF\(qα−ε\)<α<F\(qα\+ε\)F\(q\_\{\\alpha\}\-\\varepsilon\)<\\alpha<F\(q\_\{\\alpha\}\+\\varepsilon\), which is possible becauseFFis continuous and strictly increasing on a neighborhood ofqαq\_\{\\alpha\}, we have

FN\(qα−ε\)→𝑝F\(qα−ε\),FN\(qα\+ε\)→𝑝F\(qα\+ε\)\.\\displaystyle F\_\{N\}\(q\_\{\\alpha\}\-\\varepsilon\)\\xrightarrow\{p\}F\(q\_\{\\alpha\}\-\\varepsilon\),\\qquad F\_\{N\}\(q\_\{\\alpha\}\+\\varepsilon\)\\xrightarrow\{p\}F\(q\_\{\\alpha\}\+\\varepsilon\)\.Hence,ℙ\(FN\(qα−ε\)<α<FN\(qα\+ε\)\)→1\.\\mathbb\{P\}\(F\_\{N\}\(q\_\{\\alpha\}\-\\varepsilon\)<\\alpha<F\_\{N\}\(q\_\{\\alpha\}\+\\varepsilon\)\)\\rightarrow 1\.By the definition of the left quantile, this impliesℙ\(qα−ε<qN≤qα\+ε\)→1\\mathbb\{P\}\(q\_\{\\alpha\}\-\\varepsilon<q\_\{N\}\\leq q\_\{\\alpha\}\+\\varepsilon\)\\rightarrow 1\. ThereforeqN→𝑝qαq\_\{N\}\\xrightarrow\{p\}q\_\{\\alpha\}\. ∎

###### Lemma 7\(Posterior quantile expansion for bounded linear forms\)\.

For\(s,a\)∈𝒮×𝒜\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}, supposeN\(s,a\)→∞N\(s,a\)\\to\\infty\. LetvNv\_\{N\}be anℱN\\mathcal\{F\}\_\{N\}\-measurable random vector such that‖vN‖∞≤B\\\|v\_\{N\}\\\|\_\{\\infty\}\\leq Balmost surely for some constantB<∞B<\\inftyand assume thatvN→𝑝vv\_\{N\}\\xrightarrow\{p\}vfor some deterministic vectorv∈ℝSv\\in\\mathbb\{R\}^\{S\}\. Defineσ2\(s,a;vN\):=vN⊤Σ\(Ps,ac\)vN\\sigma^\{2\}\(s,a;v\_\{N\}\):=v\_\{N\}^\{\\top\}\\Sigma\(P^\{c\}\_\{s,a\}\)v\_\{N\}\. Then

ρϕ\(s,a\)α\(P⊤vN\)=P~N\(s,a\)⊤vN\+zαN\(s,a\)σ\(s,a;vN\)\+op\(N\(s,a\)−1/2\)\.\\displaystyle\\rho\_\{\\phi\(s,a\)\}^\{\\alpha\}\\\!\\bigl\(P^\{\\top\}v\_\{N\}\\bigr\)=\\widetilde\{P\}\_\{N\}\(s,a\)^\{\\top\}v\_\{N\}\+\\frac\{z\_\{\\alpha\}\}\{\\sqrt\{N\(s,a\)\}\}\\,\\sigma\(s,a;v\_\{N\}\)\+o\_\{p\}\\bigl\(N\(s,a\)^\{\-1/2\}\\bigr\)\.

###### Proof\.

Conditionally onℱN\\mathcal\{F\}\_\{N\}, the vectorvNv\_\{N\}is deterministic\. Ifσ\(s,a;v\)=0\\sigma\(s,a;v\)=0, thenvvis constant on the support ofPs,acP^\{c\}\_\{s,a\}, and Lemma[5](https://arxiv.org/html/2605.24345#Thmlemma5)together withvN→𝑝vv\_\{N\}\\xrightarrow\{p\}vimpliesP⊤vN−P~N\(s,a\)⊤vN=op\(N\(s,a\)−1/2\)P^\{\\top\}v\_\{N\}\-\\widetilde\{P\}\_\{N\}\(s,a\)^\{\\top\}v\_\{N\}=o\_\{p\}\(N\(s,a\)^\{\-1/2\}\); moreoverσ\(s,a;vN\)→0\\sigma\(s,a;v\_\{N\}\)\\to 0, so the variance term is alsoop\(N\(s,a\)−1/2\)o\_\{p\}\(N\(s,a\)^\{\-1/2\}\)\. Thus it remains to consider the caseσ\(s,a;v\)\>0\\sigma\(s,a;v\)\>0\.

By Lemma[5](https://arxiv.org/html/2605.24345#Thmlemma5)and the continuous mapping theorem,

ZN:=N\(s,a\)\(P⊤vN−P~N\(s,a\)⊤vN\)σ\(s,a;vN\)∣ℱN⇒𝒩\(0,1\)\\displaystyle Z\_\{N\}:=\\frac\{\\sqrt\{N\(s,a\)\}\\bigl\(P^\{\\top\}v\_\{N\}\-\\widetilde\{P\}\_\{N\}\(s,a\)^\{\\top\}v\_\{N\}\\bigr\)\}\{\\sigma\(s,a;v\_\{N\}\)\}\\mid\\mathcal\{F\}\_\{N\}\\Rightarrow\\mathcal\{N\}\(0,1\)inℙ\\mathbb\{P\}\-probability\. LetFN\(x\):=ℙ\(ZN≤x∣ℱN\)F\_\{N\}\(x\):=\\mathbb\{P\}\(Z\_\{N\}\\leq x\\mid\\mathcal\{F\}\_\{N\}\),x∈ℝx\\in\\mathbb\{R\}, denote the posterior conditional cdf ofZNZ\_\{N\}givenℱN\\mathcal\{F\}\_\{N\}\. Applying Lemma[6](https://arxiv.org/html/2605.24345#Thmlemma6)withF=ΦF=\\Phi, we obtain

ρα\(ZN\)=zα\+op\(1\)\.\\displaystyle\\rho^\{\\alpha\}\(Z\_\{N\}\)=z\_\{\\alpha\}\+o\_\{p\}\(1\)\.Now write

P⊤vN=P~N\(s,a\)⊤vN\+σ\(s,a;vN\)N\(s,a\)ZN\.\\displaystyle P^\{\\top\}v\_\{N\}=\\widetilde\{P\}\_\{N\}\(s,a\)^\{\\top\}v\_\{N\}\+\\frac\{\\sigma\(s,a;v\_\{N\}\)\}\{\\sqrt\{N\(s,a\)\}\}\\,Z\_\{N\}\.Using Lemma[4](https://arxiv.org/html/2605.24345#Thmlemma4)\(i\)–\(ii\),

ρϕ\(s,a\)α\(P⊤vN\)=P~N\(s,a\)⊤vN\+σ\(s,a;vN\)N\(s,a\)\(zα\+op\(1\)\)\.\\displaystyle\\rho\_\{\\phi\(s,a\)\}^\{\\alpha\}\\\!\\bigl\(P^\{\\top\}v\_\{N\}\\bigr\)=\\widetilde\{P\}\_\{N\}\(s,a\)^\{\\top\}v\_\{N\}\+\\frac\{\\sigma\(s,a;v\_\{N\}\)\}\{\\sqrt\{N\(s,a\)\}\}\\bigl\(z\_\{\\alpha\}\+o\_\{p\}\(1\)\\bigr\)\.Finally, sinceσ\(s,a;vN\)2=vN⊤Σ\(Ps,ac\)vN\\sigma\(s,a;v\_\{N\}\)^\{2\}=v\_\{N\}^\{\\top\}\\Sigma\(P^\{c\}\_\{s,a\}\)v\_\{N\}≤‖vN‖∞2≤B2\\leq\\\|v\_\{N\}\\\|\_\{\\infty\}^\{2\}\\leq B^\{2\}almost surely, we haveσ\(s,a;vN\)=Op\(1\)\\sigma\(s,a;v\_\{N\}\)=O\_\{p\}\(1\), and therefore

σ\(s,a;vN\)N\(s,a\)op\(1\)=op\(N\(s,a\)−1/2\)\.\\displaystyle\\frac\{\\sigma\(s,a;v\_\{N\}\)\}\{\\sqrt\{N\(s,a\)\}\}\\,o\_\{p\}\(1\)=o\_\{p\}\\bigl\(N\(s,a\)^\{\-1/2\}\\bigr\)\.This proves the claim\. ∎

###### Lemma 8\(Martingale CLT for empirical transition frequencies\)\.

Under Assumption[1](https://arxiv.org/html/2605.24345#Thmassumption1),

\(N\(P~N\(s,π\(s\)\)−Ps,π\(s\)c\)⊤Vπ\)s∈𝒮⇒𝒩\(0,diag\(σπ2\(s\)\)s∈𝒮\),\\displaystyle\\left\(\\sqrt\{N\}\\bigl\(\\widetilde\{P\}\_\{N\}\(s,\\pi\(s\)\)\-P^\{c\}\_\{s,\\pi\(s\)\}\\bigr\)^\{\\top\}V^\{\\pi\}\\right\)\_\{s\\in\\mathcal\{S\}\}\\Rightarrow\\mathcal\{N\}\\\!\\left\(0,\\,\\operatorname\{diag\}\\bigl\(\\sigma\_\{\\pi\}^\{2\}\(s\)\\bigr\)\_\{s\\in\\mathcal\{S\}\}\\right\),where

σπ2\(s\)=1n¯s\(Vπ\)⊤Σ\(Ps,π\(s\)c\)Vπ\.\\displaystyle\\sigma\_\{\\pi\}^\{2\}\(s\)=\\frac\{1\}\{\\bar\{n\}\_\{s\}\}\(V^\{\\pi\}\)^\{\\top\}\\Sigma\(P^\{c\}\_\{s,\\pi\(s\)\}\)V^\{\\pi\}\.Moreover, for eachs∈𝒮s\\in\\mathcal\{S\},‖P~N\(s,π\(s\)\)−Ps,π\(s\)c‖∞=Op\(N\(s,π\(s\)\)−1/2\)=Op\(N−1/2\)\\\|\\widetilde\{P\}\_\{N\}\(s,\\pi\(s\)\)\-P^\{c\}\_\{s,\\pi\(s\)\}\\\|\_\{\\infty\}=O\_\{p\}\(N\(s,\\pi\(s\)\)^\{\-1/2\}\)=O\_\{p\}\(N^\{\-1/2\}\)\.

###### Proof\.

For eachs∈𝒮s\\in\\mathcal\{S\}, define

Di\+1s:=𝕀\{si=s,ai=π\(s\)\}\(Vπ\(si\+1\)−\(Ps,π\(s\)c\)⊤Vπ\),i=0,…,N−1\.\\displaystyle D\_\{i\+1\}^\{s\}:=\\mathbb\{I\}\\\{s\_\{i\}=s,a\_\{i\}=\\pi\(s\)\\\}\\left\(V^\{\\pi\}\(s\_\{i\+1\}\)\-\(P^\{c\}\_\{s,\\pi\(s\)\}\)^\{\\top\}V^\{\\pi\}\\right\),\\qquad i=0,\\ldots,N\-1\.By Assumption[1](https://arxiv.org/html/2605.24345#Thmassumption1)[\(i\)](https://arxiv.org/html/2605.24345#S2.Ex3),𝔼si\+1∼Ps,π\(s\)c\[Di\+1s∣ℱi\]=0\\mathbb\{E\}\_\{s\_\{i\+1\}\\sim P^\{c\}\_\{s,\\pi\(s\)\}\}\[D\_\{i\+1\}^\{s\}\\mid\\mathcal\{F\}\_\{i\}\]=0\. HenceMNs:=∑i=0N−1Di\+1sM\_\{N\}^\{s\}:=\\sum\_\{i=0\}^\{N\-1\}D\_\{i\+1\}^\{s\}is a sum of martingale differences\. Fors,s~∈𝒮s,\\tilde\{s\}\\in\\mathcal\{S\}, the conditional covariance satisfies

∑i=0N−1𝔼\[Di\+1sDi\+1s~∣ℱi\]=𝕀\{s=s~\}N\(s,π\(s\)\)\(Vπ\)⊤Σ\(Ps,π\(s\)c\)Vπ\.\\displaystyle\\sum\_\{i=0\}^\{N\-1\}\\mathbb\{E\}\\\!\\left\[D\_\{i\+1\}^\{s\}D\_\{i\+1\}^\{\\tilde\{s\}\}\\mid\\mathcal\{F\}\_\{i\}\\right\]=\\mathbb\{I\}\\\{s=\\tilde\{s\}\\\}N\(s,\\pi\(s\)\)\(V^\{\\pi\}\)^\{\\top\}\\Sigma\(P^\{c\}\_\{s,\\pi\(s\)\}\)V^\{\\pi\}\.Indeed, ifs≠s~s\\neq\\tilde\{s\}, the two indicators cannot both be one at the same time\. Dividing byNNand using Assumption[1](https://arxiv.org/html/2605.24345#Thmassumption1)[\(i\)](https://arxiv.org/html/2605.24345#S2.Ex3),

1N∑i=0N−1𝔼\[Di\+1sDi\+1s~∣ℱi\]→a\.s\.𝕀\{s=s~\}n¯s\(Vπ\)⊤Σ\(Ps,π\(s\)c\)Vπ\.\\displaystyle\\frac\{1\}\{N\}\\sum\_\{i=0\}^\{N\-1\}\\mathbb\{E\}\\\!\\left\[D\_\{i\+1\}^\{s\}D\_\{i\+1\}^\{\\tilde\{s\}\}\\mid\\mathcal\{F\}\_\{i\}\\right\]\\xrightarrow\{\\mathrm\{a\.s\.\}\}\\mathbb\{I\}\\\{s=\\tilde\{s\}\\\}\\bar\{n\}\_\{s\}\(V^\{\\pi\}\)^\{\\top\}\\Sigma\(P^\{c\}\_\{s,\\pi\(s\)\}\)V^\{\\pi\}\.Since the state space is finite andVπV^\{\\pi\}is bounded, the increments are uniformly bounded, so the conditional Lindeberg condition holds\. The multivariate martingale CLT gives

\(MNsN\)s∈𝒮⇒𝒩\(0,diag\(n¯s\(Vπ\)⊤Σ\(Ps,π\(s\)c\)Vπ\)s∈𝒮\)\.\\displaystyle\\left\(\\frac\{M\_\{N\}^\{s\}\}\{\\sqrt\{N\}\}\\right\)\_\{s\\in\\mathcal\{S\}\}\\Rightarrow\\mathcal\{N\}\\\!\\left\(0,\\,\\operatorname\{diag\}\\left\(\\bar\{n\}\_\{s\}\(V^\{\\pi\}\)^\{\\top\}\\Sigma\(P^\{c\}\_\{s,\\pi\(s\)\}\)V^\{\\pi\}\\right\)\_\{s\\in\\mathcal\{S\}\}\\right\)\.Finally, note thatMNs=N\(s,π\(s\)\)\(P~N\(s,π\(s\)\)−Ps,π\(s\)c\)⊤Vπ\.M\_\{N\}^\{s\}=N\(s,\\pi\(s\)\)\\bigl\(\\widetilde\{P\}\_\{N\}\(s,\\pi\(s\)\)\-P^\{c\}\_\{s,\\pi\(s\)\}\\bigr\)^\{\\top\}V^\{\\pi\}\.Therefore,

N\(P~N\(s,π\(s\)\)−Ps,π\(s\)c\)⊤Vπ=NN\(s,π\(s\)\)MNsN\.\\displaystyle\\sqrt\{N\}\\bigl\(\\widetilde\{P\}\_\{N\}\(s,\\pi\(s\)\)\-P^\{c\}\_\{s,\\pi\(s\)\}\\bigr\)^\{\\top\}V^\{\\pi\}=\\frac\{N\}\{N\(s,\\pi\(s\)\)\}\\frac\{M\_\{N\}^\{s\}\}\{\\sqrt\{N\}\}\.Slutsky’s theorem and Assumption[1](https://arxiv.org/html/2605.24345#Thmassumption1)[\(i\)](https://arxiv.org/html/2605.24345#S2.Ex3)imply the stated joint convergence\.

It remains to show the rate bound\. For eachs,x∈𝒮s,x\\in\\mathcal\{S\}, define

Di\+1s,x:=𝕀\{si=s,ai=π\(s\)\}\(𝕀\{si\+1=x\}−Pc\(x∣s,π\(s\)\)\)\.\\displaystyle D\_\{i\+1\}^\{s,x\}:=\\mathbb\{I\}\\\{s\_\{i\}=s,a\_\{i\}=\\pi\(s\)\\\}\\left\(\\mathbb\{I\}\\\{s\_\{i\+1\}=x\\\}\-P^\{c\}\(x\\mid s,\\pi\(s\)\)\\right\)\.The same martingale argument gives∑i=0N−1Di\+1s,x=Op\(N1/2\)\\sum\_\{i=0\}^\{N\-1\}D\_\{i\+1\}^\{s,x\}=O\_\{p\}\(N^\{1/2\}\)\. Since

P~N\(s,π\(s\)\)\(x\)−Pc\(x∣s,π\(s\)\)=1N\(s,π\(s\)\)∑i=0N−1Di\+1s,x,\\displaystyle\\widetilde\{P\}\_\{N\}\(s,\\pi\(s\)\)\(x\)\-P^\{c\}\(x\\mid s,\\pi\(s\)\)=\\frac\{1\}\{N\(s,\\pi\(s\)\)\}\\sum\_\{i=0\}^\{N\-1\}D\_\{i\+1\}^\{s,x\},andN\(s,π\(s\)\)≍NN\(s,\\pi\(s\)\)\\asymp Nby Assumption[1](https://arxiv.org/html/2605.24345#Thmassumption1)[\(i\)](https://arxiv.org/html/2605.24345#S2.Ex3), we obtainP~N\(s,π\(s\)\)\(x\)−Pc\(x∣s,π\(s\)\)=Op\(N−1/2\)\\widetilde\{P\}\_\{N\}\(s,\\pi\(s\)\)\(x\)\-P^\{c\}\(x\\mid s,\\pi\(s\)\)=O\_\{p\}\(N^\{\-1/2\}\)\. The state space is finite, so the same bound holds in sup norm\. ∎

Next, we are ready to prove Theorem[1](https://arxiv.org/html/2605.24345#Thmtheorem1)\.

###### Proof of Theorem[1](https://arxiv.org/html/2605.24345#Thmtheorem1)\.

LetΔN:=Vϕ,απ−Vπ\.\\Delta\_\{N\}:=\{V\_\{\\phi,\\alpha\}^\{\\pi\}\}\-V^\{\\pi\}\.For each states∈𝒮s\\in\\mathcal\{S\}, we define the posterior quantile map, for notational simplicity,qN,s\(v\):=ρϕ\(s,π\(s\)\)α\(P⊤v\)\.q\_\{N,s\}\(v\):=\\rho\_\{\\phi\(s,\\pi\(s\)\)\}^\{\\alpha\}\\\!\\bigl\(P^\{\\top\}v\\bigr\)\.

Recall thatVϕ,απ\{V\_\{\\phi,\\alpha\}^\{\\pi\}\}andVπV^\{\\pi\}are the unique fixed points of \([5](https://arxiv.org/html/2605.24345#S2.E5)\) and \([3](https://arxiv.org/html/2605.24345#S2.E3)\), respectively\. Hence we have the two fixed\-point equations:

Vϕ,απ\(s\)=r\(s,π\(s\)\)\+γqN,s\(Vϕ,απ\),Vπ\(s\)=r\(s,π\(s\)\)\+γ\(Ps,π\(s\)c\)⊤Vπ\.\\displaystyle\{V\_\{\\phi,\\alpha\}^\{\\pi\}\}\(s\)=r\(s,\\pi\(s\)\)\+\\gamma q\_\{N,s\}\(\{V\_\{\\phi,\\alpha\}^\{\\pi\}\}\),\\qquad V^\{\\pi\}\(s\)=r\(s,\\pi\(s\)\)\+\\gamma\(P^\{c\}\_\{s,\\pi\(s\)\}\)^\{\\top\}V^\{\\pi\}\.For eachs∈𝒮s\\in\\mathcal\{S\},

ΔN\(s\)\\displaystyle\\Delta\_\{N\}\(s\)=γ\(qN,s\(Vϕ,απ\)−\(Ps,π\(s\)c\)⊤Vπ\)\\displaystyle=\\gamma\\Big\(q\_\{N,s\}\(\{V\_\{\\phi,\\alpha\}^\{\\pi\}\}\)\-\(P^\{c\}\_\{s,\\pi\(s\)\}\)^\{\\top\}V^\{\\pi\}\\Big\)=γ\(qN,s\(Vπ\)−\(Ps,π\(s\)c\)⊤Vπ\)\+γ\(Ps,π\(s\)c\)⊤ΔN\\displaystyle=\\gamma\\Big\(q\_\{N,s\}\(V^\{\\pi\}\)\-\(P^\{c\}\_\{s,\\pi\(s\)\}\)^\{\\top\}V^\{\\pi\}\\Big\)\+\\gamma\(P^\{c\}\_\{s,\\pi\(s\)\}\)^\{\\top\}\\Delta\_\{N\}\+γ\(qN,s\(Vϕ,απ\)−qN,s\(Vπ\)−\(Ps,π\(s\)c\)⊤ΔN\)\.\\displaystyle\\quad\+\\gamma\\Big\(q\_\{N,s\}\(\{V\_\{\\phi,\\alpha\}^\{\\pi\}\}\)\-q\_\{N,s\}\(V^\{\\pi\}\)\-\(P^\{c\}\_\{s,\\pi\(s\)\}\)^\{\\top\}\\Delta\_\{N\}\\Big\)\.Therefore, componentwise,

\(\(I−γPπc\)ΔN\)\(s\)=γ\(qN,s\(Vπ\)−\(Ps,π\(s\)c\)⊤Vπ\)⏟BN\(s\)\+γ\(qN,s\(Vϕ,απ\)−qN,s\(Vπ\)−\(Ps,π\(s\)c\)⊤ΔN\)⏟RN\(s\)\.\\bigl\(\(I\-\\gamma P\_\{\\pi\}^\{c\}\)\\Delta\_\{N\}\\bigr\)\(s\)=\\underbrace\{\\gamma\\Big\(q\_\{N,s\}\(V^\{\\pi\}\)\-\(P^\{c\}\_\{s,\\pi\(s\)\}\)^\{\\top\}V^\{\\pi\}\\Big\)\}\_\{B\_\{N\}\(s\)\}\+\\underbrace\{\\gamma\\Big\(q\_\{N,s\}\(\{V\_\{\\phi,\\alpha\}^\{\\pi\}\}\)\-q\_\{N,s\}\(V^\{\\pi\}\)\-\(P^\{c\}\_\{s,\\pi\(s\)\}\)^\{\\top\}\\Delta\_\{N\}\\Big\)\}\_\{R\_\{N\}\(s\)\}\.\(EC\.2\)
We next identify the weak limit ofBNB\_\{N\}\. Fixs∈𝒮s\\in\\mathcal\{S\}anda=π\(s\)a=\\pi\(s\)\. Defineσ~π2\(s\):=\(Vπ\)⊤Σ\(Ps,ac\)Vπ\.\\widetilde\{\\sigma\}\_\{\\pi\}^\{2\}\(s\):=\(V^\{\\pi\}\)^\{\\top\}\\Sigma\(P^\{c\}\_\{s,a\}\)V^\{\\pi\}\.Lemma[7](https://arxiv.org/html/2605.24345#Thmlemma7)withvN≡Vπv\_\{N\}\\equiv V^\{\\pi\}yields

qN,s\(Vπ\)=P~N\(s,a\)⊤Vπ\+zαN\(s,a\)σ~π\(s\)\+op\(N\(s,a\)−1/2\),q\_\{N,s\}\(V^\{\\pi\}\)=\{\\widetilde\{P\}\_\{N\}\(s,a\)^\{\\top\}V^\{\\pi\}\+\\frac\{z\_\{\\alpha\}\}\{\\sqrt\{N\(s,a\)\}\}\\,\\widetilde\{\\sigma\}\_\{\\pi\}\(s\)\}\+o\_\{p\}\\bigl\(N\(s,a\)^\{\-1/2\}\\bigr\),\(EC\.3\)
Combining \([EC\.3](https://arxiv.org/html/2605.24345#A2.E3)\) with Assumption[1](https://arxiv.org/html/2605.24345#Thmassumption1)[\(i\)](https://arxiv.org/html/2605.24345#S2.Ex3),

N\(qN,s\(Vπ\)−\(Ps,π\(s\)c\)⊤Vπ\)=N\(P~N\(s,π\(s\)\)−Ps,π\(s\)c\)⊤Vπ\+zαNN\(s,π\(s\)\)σ~π\(s\)\+op\(1\)\.\\displaystyle\\sqrt\{N\}\\Big\(q\_\{N,s\}\(V^\{\\pi\}\)\-\(P^\{c\}\_\{s,\\pi\(s\)\}\)^\{\\top\}V^\{\\pi\}\\Big\)=\\sqrt\{N\}\\bigl\(\\widetilde\{P\}\_\{N\}\(s,\\pi\(s\)\)\-P^\{c\}\_\{s,\\pi\(s\)\}\\bigr\)^\{\\top\}V^\{\\pi\}\+z\_\{\\alpha\}\\sqrt\{\\frac\{N\}\{N\(s,\\pi\(s\)\)\}\}\\,\\widetilde\{\\sigma\}\_\{\\pi\}\(s\)\+o\_\{p\}\(1\)\.By Lemma[8](https://arxiv.org/html/2605.24345#Thmlemma8), the first term on the right\-hand side converges jointly to a centered normal vector\. SinceN\(s,π\(s\)\)/N→a\.s\.n¯sN\(s,\\pi\(s\)\)/N\\xrightarrow\{\\mathrm\{a\.s\.\}\}\\bar\{n\}\_\{s\}, Slutsky’s theorem gives

\(N\(qN,s\(Vπ\)−\(Ps,π\(s\)c\)⊤Vπ\)\)s∈𝒮⇒𝒩\(λπ,diag\(σπ2\(s\)\)s∈𝒮\),\\displaystyle\\left\(\\sqrt\{N\}\\Big\(q\_\{N,s\}\(V^\{\\pi\}\)\-\(P^\{c\}\_\{s,\\pi\(s\)\}\)^\{\\top\}V^\{\\pi\}\\Big\)\\right\)\_\{s\\in\\mathcal\{S\}\}\\Rightarrow\\mathcal\{N\}\\\!\\left\(\\lambda\_\{\\pi\},\\,\\operatorname\{diag\}\\bigl\(\\sigma\_\{\\pi\}^\{2\}\(s\)\\bigr\)\_\{s\\in\\mathcal\{S\}\}\\right\),where

σπ2\(s\)=\(σ~π\(s\)\)2n¯s=1n¯s\(Vπ\)⊤\(diag⁡\(Ps,π\(s\)c\)−Ps,π\(s\)c\(Ps,π\(s\)c\)⊤\)Vπ,λπ\(s\):=zασπ\(s\)\.\\displaystyle\\sigma\_\{\\pi\}^\{2\}\(s\)=\\frac\{\(\\widetilde\{\\sigma\}\_\{\\pi\}\(s\)\)^\{2\}\}\{\\bar\{n\}\_\{s\}\}=\\frac\{1\}\{\\bar\{n\}\_\{s\}\}\(V^\{\\pi\}\)^\{\\top\}\\Big\(\\operatorname\{diag\}\(P^\{c\}\_\{s,\\pi\(s\)\}\)\-P^\{c\}\_\{s,\\pi\(s\)\}\(P^\{c\}\_\{s,\\pi\(s\)\}\)^\{\\top\}\\Big\)V^\{\\pi\},\\qquad\\lambda\_\{\\pi\}\(s\):=z\_\{\\alpha\}\\sigma\_\{\\pi\}\(s\)\.Therefore,

NBN⇒𝒩\(γλπ,diag⁡\(\(γσπ\)2\)\)\.\\sqrt\{N\}\\,B\_\{N\}\\Rightarrow\\mathcal\{N\}\\\!\\Bigl\(\\gamma\\lambda\_\{\\pi\},\\operatorname\{diag\}\\bigl\(\(\\gamma\\sigma\_\{\\pi\}\)^\{2\}\\bigr\)\\Bigr\)\.\(EC\.4\)
We now derive the rate ofΔN\\Delta\_\{N\}itself\. From the fixed\-point equations \([5](https://arxiv.org/html/2605.24345#S2.E5)\) and \([3](https://arxiv.org/html/2605.24345#S2.E3)\),

ΔN\(s\)=γ\(qN,s\(Vϕ,απ\)−qN,s\(Vπ\)\)\+BN\(s\)\.\\displaystyle\\Delta\_\{N\}\(s\)=\\gamma\\Big\(q\_\{N,s\}\(\{V\_\{\\phi,\\alpha\}^\{\\pi\}\}\)\-q\_\{N,s\}\(V^\{\\pi\}\)\\Big\)\+B\_\{N\}\(s\)\.Taking sup norms and using Lemma[4](https://arxiv.org/html/2605.24345#Thmlemma4)\(iv\),‖ΔN‖∞≤γ‖ΔN‖∞\+‖BN‖∞\.\\\|\\Delta\_\{N\}\\\|\_\{\\infty\}\\leq\\gamma\\\|\\Delta\_\{N\}\\\|\_\{\\infty\}\+\\\|B\_\{N\}\\\|\_\{\\infty\}\.Hence‖ΔN‖∞≤11−γ‖BN‖∞\.\\\|\\Delta\_\{N\}\\\|\_\{\\infty\}\\leq\\frac\{1\}\{1\-\\gamma\}\\\|B\_\{N\}\\\|\_\{\\infty\}\.Since \([EC\.4](https://arxiv.org/html/2605.24345#A2.E4)\) impliesNBN\\sqrt\{N\}\\,B\_\{N\}is tight,

‖ΔN‖∞=Op\(N−1/2\)\.\\displaystyle\\\|\\Delta\_\{N\}\\\|\_\{\\infty\}=O\_\{p\}\(N^\{\-1/2\}\)\.\(EC\.5\)
It remains to show that the remainderRNR\_\{N\}is negligible\. Fix agains∈𝒮s\\in\\mathcal\{S\}and seta=π\(s\)a=\\pi\(s\)\. Define, for anyv∈ℝSv\\in\\mathbb\{R\}^\{S\},σ2\(s,a;v\):=v⊤Σ\(Ps,ac\)v\\sigma^\{2\}\(s,a;v\):=v^\{\\top\}\\Sigma\(P^\{c\}\_\{s,a\}\)v\. By \([EC\.5](https://arxiv.org/html/2605.24345#A2.E5)\),Vϕ,απ→𝑝VπV^\{\\pi\}\_\{\\phi,\\alpha\}\\xrightarrow\{p\}V^\{\\pi\}, so the convergence condition in Lemma[7](https://arxiv.org/html/2605.24345#Thmlemma7)is satisfied\. Applying Lemma[7](https://arxiv.org/html/2605.24345#Thmlemma7)twice, first withvN=Vϕ,απv\_\{N\}=\{V\_\{\\phi,\\alpha\}^\{\\pi\}\}, then withvN≡Vπ,v\_\{N\}\\equiv V^\{\\pi\},we have

qN,s\(Vϕ,απ\)−qN,s\(Vπ\)\\displaystyle q\_\{N,s\}\(\{V\_\{\\phi,\\alpha\}^\{\\pi\}\}\)\-q\_\{N,s\}\(V^\{\\pi\}\)=P~N\(s,a\)⊤ΔN\\displaystyle=\\widetilde\{P\}\_\{N\}\(s,a\)^\{\\top\}\\Delta\_\{N\}\+zαN\(s,a\)\(σ\(s,a;Vϕ,απ\)−σ\(s,a;Vπ\)\)\+op\(N\(s,a\)−1/2\)\.\\displaystyle\\quad\+\\frac\{z\_\{\\alpha\}\}\{\\sqrt\{N\(s,a\)\}\}\\Big\(\\sigma\(s,a;\{V\_\{\\phi,\\alpha\}^\{\\pi\}\}\)\-\\sigma\(s,a;V^\{\\pi\}\)\\Big\)\+o\_\{p\}\\bigl\(N\(s,a\)^\{\-1/2\}\\bigr\)\.\(EC\.6\)Substituting \([EC\.6](https://arxiv.org/html/2605.24345#A2.E6)\) into \([EC\.2](https://arxiv.org/html/2605.24345#A2.E2)\),

RN\(s\)\\displaystyle R\_\{N\}\(s\)=γ\(P~N\(s,a\)−Ps,ac\)⊤ΔN\\displaystyle=\\gamma\\bigl\(\\widetilde\{P\}\_\{N\}\(s,a\)\-P^\{c\}\_\{s,a\}\\bigr\)^\{\\top\}\\Delta\_\{N\}\+γzαN\(s,a\)\(σ\(s,a;Vϕ,απ\)−σ\(s,a;Vπ\)\)\+op\(N\(s,a\)−1/2\)\.\\displaystyle\\quad\+\\frac\{\\gamma z\_\{\\alpha\}\}\{\\sqrt\{N\(s,a\)\}\}\\Big\(\\sigma\(s,a;\{V\_\{\\phi,\\alpha\}^\{\\pi\}\}\)\-\\sigma\(s,a;V^\{\\pi\}\)\\Big\)\+o\_\{p\}\\bigl\(N\(s,a\)^\{\-1/2\}\\bigr\)\.\(EC\.7\)
We estimate the middle term in \([EC\.7](https://arxiv.org/html/2605.24345#A2.E7)\)\. LetAc\(s,a\):=Σ\(Ps,ac\)A^\{c\}\(s,a\):=\\Sigma\(P^\{c\}\_\{s,a\}\)\. SinceAc\(s,a\)A^\{c\}\(s,a\)is positive semidefinite,σ\(s,a;v\)=‖Ac\(s,a\)1/2v‖2\.\\sigma\(s,a;v\)=\\\|A^\{c\}\(s,a\)^\{1/2\}v\\\|\_\{2\}\.Hence,

\|σ\(s,a;Vϕ,απ\)−σ\(s,a;Vπ\)\|\\displaystyle\\Big\|\\sigma\(s,a;\{V\_\{\\phi,\\alpha\}^\{\\pi\}\}\)\-\\sigma\(s,a;V^\{\\pi\}\)\\Big\|=\|‖Ac\(s,a\)1/2Vϕ,απ‖2−‖Ac\(s,a\)1/2Vπ‖2\|\\displaystyle\\qquad=\\Big\|\\\|A^\{c\}\(s,a\)^\{1/2\}\{V\_\{\\phi,\\alpha\}^\{\\pi\}\}\\\|\_\{2\}\-\\\|A^\{c\}\(s,a\)^\{1/2\}V^\{\\pi\}\\\|\_\{2\}\\Big\|≤‖Ac\(s,a\)1/2\(Vϕ,απ−Vπ\)‖2≤‖Ac\(s,a\)1/2‖op‖ΔN‖2\.\\displaystyle\\qquad\\leq\\\|A^\{c\}\(s,a\)^\{1/2\}\(\{V\_\{\\phi,\\alpha\}^\{\\pi\}\}\-V^\{\\pi\}\)\\\|\_\{2\}\\leq\\\|A^\{c\}\(s,a\)^\{1/2\}\\\|\_\{\\mathrm\{op\}\}\\\|\\Delta\_\{N\}\\\|\_\{2\}\.SinceAc\(s,a\)⪯diag⁡\(Ps,ac\)⪯I,A^\{c\}\(s,a\)\\preceq\\operatorname\{diag\}\(P^\{c\}\_\{s,a\}\)\\preceq I,‖Ac\(s,a\)‖op1/2≤1\.\\\|A^\{c\}\(s,a\)\\\|^\{1/2\}\_\{\\mathrm\{op\}\}\\leq 1\.Therefore

\|σ\(s,a;Vϕ,απ\)−σ\(s,a;Vπ\)\|≤‖Ac\(s,a\)‖op1/2‖ΔN‖2≤S‖ΔN‖∞=Op\(N−1/2\)\.\\displaystyle\\Big\|\\sigma\(s,a;\{V\_\{\\phi,\\alpha\}^\{\\pi\}\}\)\-\\sigma\(s,a;V^\{\\pi\}\)\\Big\|\\leq\\\|A^\{c\}\(s,a\)\\\|^\{1/2\}\_\{\\mathrm\{op\}\}\\\|\\Delta\_\{N\}\\\|\_\{2\}\\leq\\sqrt\{S\}\\,\\\|\\Delta\_\{N\}\\\|\_\{\\infty\}=O\_\{p\}\(N^\{\-1/2\}\)\.After division byN\(s,a\)\\sqrt\{N\(s,a\)\}, the middle term in \([EC\.7](https://arxiv.org/html/2605.24345#A2.E7)\) becomesOp\(N−1\)\.O\_\{p\}\(N^\{\-1\}\)\.

For the first term in \([EC\.7](https://arxiv.org/html/2605.24345#A2.E7)\), Lemma[8](https://arxiv.org/html/2605.24345#Thmlemma8)gives

‖P~N\(s,a\)−Ps,ac‖∞=Op\(N\(s,a\)−1/2\)=Op\(N−1/2\),\\displaystyle\\\|\\widetilde\{P\}\_\{N\}\(s,a\)\-P^\{c\}\_\{s,a\}\\\|\_\{\\infty\}=O\_\{p\}\(N\(s,a\)^\{\-1/2\}\)=O\_\{p\}\(N^\{\-1/2\}\),while \([EC\.5](https://arxiv.org/html/2605.24345#A2.E5)\) gives‖ΔN‖∞=Op\(N−1/2\)\.\\\|\\Delta\_\{N\}\\\|\_\{\\infty\}=O\_\{p\}\(N^\{\-1/2\}\)\.Hence

\(P~N\(s,a\)−Ps,ac\)⊤ΔN=Op\(N−1\)\.\\displaystyle\\bigl\(\\widetilde\{P\}\_\{N\}\(s,a\)\-P^\{c\}\_\{s,a\}\\bigr\)^\{\\top\}\\Delta\_\{N\}=O\_\{p\}\(N^\{\-1\}\)\.Since alsoop\(N\(s,a\)−1/2\)=op\(N−1/2\),o\_\{p\}\\bigl\(N\(s,a\)^\{\-1/2\}\\bigr\)=o\_\{p\}\(N^\{\-1/2\}\),we conclude from \([EC\.7](https://arxiv.org/html/2605.24345#A2.E7)\) thatRN\(s\)=op\(N−1/2\)R\_\{N\}\(s\)=o\_\{p\}\(N^\{\-1/2\}\),∀s∈𝒮\.\\forall s\\in\\mathcal\{S\}\.BecauseS<∞S<\\infty,

NRN→𝑝0\.\\sqrt\{N\}\\,R\_\{N\}\\xrightarrow\{p\}0\.\(EC\.8\)
Finally, multiply \([EC\.2](https://arxiv.org/html/2605.24345#A2.E2)\) byN\\sqrt\{N\}:

N\(I−γPπc\)ΔN=NBN\+NRN\.\\displaystyle\\sqrt\{N\}\\,\(I\-\\gamma P\_\{\\pi\}^\{c\}\)\\Delta\_\{N\}=\\sqrt\{N\}\\,B\_\{N\}\+\\sqrt\{N\}\\,R\_\{N\}\.By \([EC\.4](https://arxiv.org/html/2605.24345#A2.E4)\), \([EC\.8](https://arxiv.org/html/2605.24345#A2.E8)\), and Slutsky’s theorem,

N\(I−γPπc\)\(Vϕ,απ−Vπ\)⇒𝒩\(γλπ,diag⁡\(\(γσπ\)2\)\)\.\\displaystyle\\sqrt\{N\}\\,\(I\-\\gamma P\_\{\\pi\}^\{c\}\)\\bigl\(\{V\_\{\\phi,\\alpha\}^\{\\pi\}\}\-V^\{\\pi\}\\bigr\)\\Rightarrow\\mathcal\{N\}\\\!\\Bigl\(\\gamma\\lambda\_\{\\pi\},\\operatorname\{diag\}\\bigl\(\(\\gamma\\sigma\_\{\\pi\}\)^\{2\}\\bigr\)\\Bigr\)\.SincePπcP\_\{\\pi\}^\{c\}is row\-stochastic,‖γPπc‖∞≤γ<1,\\\|\\gamma P\_\{\\pi\}^\{c\}\\\|\_\{\\infty\}\\leq\\gamma<1,so\(I−γPπc\)−1\(I\-\\gamma P\_\{\\pi\}^\{c\}\)^\{\-1\}exists and is bounded\. Applying the continuous mapping theorem to the linear map\(I−γPπc\)−1\(I\-\\gamma P\_\{\\pi\}^\{c\}\)^\{\-1\}yields

N\(Vϕ,απ−Vπ\)⇒𝒩\(\(I−γPπc\)−1γλπ,\(I−γPπc\)−1diag⁡\(\(γσπ\)2\)\(I−γPπc\)−T\)\.\\displaystyle\\sqrt\{N\}\\,\\bigl\(\{V\_\{\\phi,\\alpha\}^\{\\pi\}\}\-V^\{\\pi\}\\bigr\)\\Rightarrow\\mathcal\{N\}\\\!\\Bigl\(\(I\-\\gamma P\_\{\\pi\}^\{c\}\)^\{\-1\}\\gamma\\lambda\_\{\\pi\},\\,\(I\-\\gamma P\_\{\\pi\}^\{c\}\)^\{\-1\}\\operatorname\{diag\}\\bigl\(\(\\gamma\\sigma\_\{\\pi\}\)^\{2\}\\bigr\)\(I\-\\gamma P\_\{\\pi\}^\{c\}\)^\{\-T\}\\Bigr\)\.This completes the proof\. ∎

## Appendix EC\.3Proof of Proposition[1](https://arxiv.org/html/2605.24345#Thmproposition1)

###### Proof\.

Fix a policyπ\\pi\. By the definition of the leftα¯\\bar\{\\alpha\}\-quantile, for eachs∈𝒮s\\in\\mathcal\{S\},

ℙϕ\(Ps,π\(s\)⊤Vϕ,α¯π≥ρϕ\(s,π\(s\)\)α¯\(P⊤Vϕ,α¯π\)\)≥1−α¯\.\\displaystyle\{\\mathbb\{P\}\_\{\\phi\}\}\\Bigl\(P\_\{s,\\pi\(s\)\}^\{\\top\}V^\{\\pi\}\_\{\\phi,\\bar\{\\alpha\}\}\\geq\\rho^\{\\bar\{\\alpha\}\}\_\{\\phi\(s,\\pi\(s\)\)\}\\\!\\bigl\(P^\{\\top\}V^\{\\pi\}\_\{\\phi,\\bar\{\\alpha\}\}\\bigr\)\\Bigr\)\\geq 1\-\\bar\{\\alpha\}\.Using the Bellman equation forVϕ,α¯πV^\{\\pi\}\_\{\\phi,\\bar\{\\alpha\}\}, this implies

ℙϕ\(r\(s,π\(s\)\)\+γPs,π\(s\)⊤Vϕ,α¯π≥Vϕ,α¯π\(s\)\)≥1−α¯\.\\displaystyle\{\\mathbb\{P\}\_\{\\phi\}\}\\Bigl\(r\(s,\\pi\(s\)\)\+\\gamma\\,P\_\{s,\\pi\(s\)\}^\{\\top\}V^\{\\pi\}\_\{\\phi,\\bar\{\\alpha\}\}\\geq V^\{\\pi\}\_\{\\phi,\\bar\{\\alpha\}\}\(s\)\\Bigr\)\\geq 1\-\\bar\{\\alpha\}\.Because the posterior rows\(Ps,π\(s\)\)s∈𝒮\\left\(P\_\{s,\\pi\(s\)\}\\right\)\_\{s\\in\\mathcal\{S\}\}are independent across states underℙϕ\\mathbb\{P\}\_\{\\phi\}, the above events are independent\. Hence,

ℙϕ\(r\(s,π\(s\)\)\+γPs,π\(s\)⊤Vϕ,α¯π≥Vϕ,α¯π\(s\),∀s∈𝒮\)≥\(1−α¯\)S=1−α\.\\displaystyle\{\\mathbb\{P\}\_\{\\phi\}\}\\Bigl\(r\(s,\\pi\(s\)\)\+\\gamma\\,P\_\{s,\\pi\(s\)\}^\{\\top\}V^\{\\pi\}\_\{\\phi,\\bar\{\\alpha\}\}\\geq V^\{\\pi\}\_\{\\phi,\\bar\{\\alpha\}\}\(s\),\\quad\\forall s\\in\\mathcal\{S\}\\Bigr\)\\geq\(1\-\\bar\{\\alpha\}\)^\{S\}=1\-\\alpha\.On this event,

Vϕ,α¯π\(s\)≤r\(s,π\(s\)\)\+γPs,π\(s\)⊤Vϕ,α¯π,∀s∈𝒮\.\\displaystyle V^\{\\pi\}\_\{\\phi,\\bar\{\\alpha\}\}\(s\)\\leq r\(s,\\pi\(s\)\)\+\\gamma\\,P\_\{s,\\pi\(s\)\}^\{\\top\}V^\{\\pi\}\_\{\\phi,\\bar\{\\alpha\}\},\\qquad\\forall s\\in\\mathcal\{S\}\.Let𝒯Pπ\\mathcal\{T\}^\{\\pi\}\_\{P\}denote the standard Bellman operator under policyπ\\piand realized transition matrixPP:

\(𝒯PπV\)\(s\):=r\(s,π\(s\)\)\+γPs,π\(s\)⊤V,∀s∈𝒮\.\(\\mathcal\{T\}^\{\\pi\}\_\{P\}V\)\(s\):=r\(s,\\pi\(s\)\)\+\\gamma\\,P\_\{s,\\pi\(s\)\}^\{\\top\}V,\\qquad\\forall s\\in\\mathcal\{S\}\.
The preceding event impliesVϕ,α¯π≤𝒯PπVϕ,α¯πV^\{\\pi\}\_\{\\phi,\\bar\{\\alpha\}\}\\leq\\mathcal\{T\}^\{\\pi\}\_\{P\}V^\{\\pi\}\_\{\\phi,\\bar\{\\alpha\}\}componentwise\. By monotonicity of𝒯Pπ\\mathcal\{T\}^\{\\pi\}\_\{P\},Vϕ,α¯π≤\(𝒯Pπ\)mVϕ,α¯π,V^\{\\pi\}\_\{\\phi,\\bar\{\\alpha\}\}\\leq\(\\mathcal\{T\}^\{\\pi\}\_\{P\}\)^\{m\}V^\{\\pi\}\_\{\\phi,\\bar\{\\alpha\}\},∀m≥1\.\\forall m\\geq 1\.Since𝒯Pπ\\mathcal\{T\}^\{\\pi\}\_\{P\}is aγ\\gamma\-contraction under∥⋅∥∞\\\|\\cdot\\\|\_\{\\infty\}, its iterates converge to the unique fixed pointVPπV^\{\\pi\}\_\{P\}\. Lettingm→∞m\\to\\inftyyieldsVϕ,α¯π≤VPπV^\{\\pi\}\_\{\\phi,\\bar\{\\alpha\}\}\\leq V^\{\\pi\}\_\{P\}componentwise on the above event\. Then the claim follows immediately from the definition ofVϕ,απ,q\(s\)V^\{\\pi,\\mathrm\{q\}\}\_\{\\phi,\\alpha\}\(s\)in \([12](https://arxiv.org/html/2605.24345#S2.E12)\)\. ∎

## Appendix EC\.4Evolving Robustness–Exploration Trade\-off in Online RL: Illustrative Examples

In online RL, the balance between robustness and exploration evolves over the course of learning\. When the posterior distribution is still dispersed and epistemic uncertainty is relatively large across all state–action pairs, robustness induced by lower\-tail evaluation \(lower quantile level\) can be beneficial\. As more data are collected, the remaining epistemic uncertainty aboutPs,acP^\{c\}\_\{s,a\}becomes concentrated in less\-visited state–action pairs, which remain critical for learning the optimal policy in the true environment\. In this regime, a higher quantile level encourages exploration of these state–action pairs that have higher epistemic uncertainty\. The examples in this subsection separately illustrate these two mechanisms and show why a fixed lower\-tail rule, i\.e\., choosing actions according to the optimal policy of theα\\alpha\-quantile BR\-MDP with a fixedα\\alpha, fails to learn the optimal policy\.

We begin with a regime in which the balance tilts toward robustness\. For any transition kernelPP, letVP∗V^\{\*\}\_\{P\}denote the optimal value function underPP\. Given a posterior distributionφ\\varphiover transition kernels, letP¯φ:=𝔼P∼φ\[P\]\\bar\{P\}\_\{\\varphi\}:=\\mathbb\{E\}\_\{P\\sim\\varphi\}\[P\]denote the posterior\-mean kernel\. We denote byπφ,α∗\\pi^\{\*\}\_\{\\varphi,\\alpha\}an optimal risk\-aware policy of theα\\alpha\-quantile BR\-MDP under the posteriorφ\\varphi, and byπP¯φ∗\\pi^\{\*\}\_\{\\bar\{P\}\_\{\\varphi\}\}an optimal risk\-neutral policy under the posterior\-mean kernelP¯φ\\bar\{P\}\_\{\\varphi\}\. The latter is obtained by replacingPcP^\{c\}withP¯φ\\bar\{P\}\_\{\\varphi\}in \([4](https://arxiv.org/html/2605.24345#S2.E4)\) and solving the resulting Bellman optimality equation\. To make precise when robustness is needed in this regime, we introduce the notion of posterior downside exposure in the following definition\.

###### Definition 1\(Posterior downside exposure\)\.

Fix a posterior distributionφ\\varphiover transition kernels, an initial states0s\_\{0\}, a policyπ\\pi, and a thresholdΛ\>0\\Lambda\>0\. Define the set of transition kernels under which the regret of policyπ\\piis at least the thresholdΛ\\Lambdaas follows:

𝒟\(π,Λ\):=\{P:VP∗\(s0\)−VPπ\(s0\)≥Λ\}\.\\mathcal\{D\}\(\\pi,\\Lambda\):=\\Bigl\\\{P:\\;V^\{\*\}\_\{P\}\(s\_\{0\}\)\-V^\{\\pi\}\_\{P\}\(s\_\{0\}\)\\geq\\Lambda\\Bigr\\\}\.We say thatπ\\pihas\(β,Λ\)\(\\beta,\\Lambda\)\-*posterior downside exposure*underφ\\varphiifφ\(𝒟\(π,Λ\)\)≥β\.\\varphi\\\!\\left\(\\mathcal\{D\}\(\\pi,\\Lambda\)\\right\)\\geq\\beta\.

Definition[1](https://arxiv.org/html/2605.24345#Thmdefinition1)measures how much posterior mass is placed on transition kernels under which the regret of policyπ\\piis at leastΛ\\Lambdafrom the initial states0s\_\{0\}\.

###### Example 1\(A posterior downside\-exposure example\)\.

Fix a discount factorγ∈\(0,1\)\\gamma\\in\(0,1\), a constantc∈\(0,γ\)c\\in\(0,\\gamma\), and a loss levelL\>0L\>0\. Consider the discounted MDP in Figure[EC\.1](https://arxiv.org/html/2605.24345#A4.F1), where the reward is known to the agent but the transition kernel is unknown\. The current posterior is supported on two kernels,

φ=μδPG\+\(1−μ\)δPB,μ∈\(0,1\),\\varphi=\\mu\\,\\delta\_\{P^\{G\}\}\+\(1\-\\mu\)\\,\\delta\_\{P^\{B\}\},\\qquad\\mu\\in\(0,1\),whereδx\\delta\_\{x\}denotes the Dirac measure atxx, andPGP^\{G\}andPBP^\{B\}differ only in the transition following the risky actionaRa\_\{R\}at the initial states0s\_\{0\}, as shown in the figure\.

![Refer to caption](https://arxiv.org/html/2605.24345v1/figs/posterior_downside_exposure_example.png)Figure EC\.1:Schematic of Example[1](https://arxiv.org/html/2605.24345#Thmexample1)\. From the initial states0s\_\{0\}, the safe actionaSa\_\{S\}yields rewardccand returns tos0s\_\{0\}\. The risky actionaRa\_\{R\}yields reward0\. UnderPGP^\{G\}, it moves to the absorbing stategg, which yields reward11at every subsequent step; underPBP^\{B\}, it moves to the absorbing statebb, which yields reward−L\-Lat every subsequent step\.###### Proposition 2\.

In Example[1](https://arxiv.org/html/2605.24345#Thmexample1), ifγ\(1−L\)2<c<γ\(μ−\(1−μ\)L\)\\frac\{\\gamma\(1\-L\)\}\{2\}<c<\\gamma\(\\mu\-\(1\-\\mu\)L\),μ\>1/2\\mu\>1/2, andα≤1−μ\\alpha\\leq 1\-\\mu, thenπP¯φ∗\(s0\)=aR\\pi^\{\*\}\_\{\\bar\{P\}\_\{\\varphi\}\}\(s\_\{0\}\)=a\_\{R\}andπφ,α∗\(s0\)=aS\\pi^\{\*\}\_\{\\varphi,\\alpha\}\(s\_\{0\}\)=a\_\{S\}\. Moreover, for everyΛ\\Lambdasatisfyingγ−c1−γ<Λ≤c\+γL1−γ,\\frac\{\\gamma\-c\}\{1\-\\gamma\}<\\Lambda\\leq\\frac\{c\+\\gamma L\}\{1\-\\gamma\},the optimal policy under the posterior\-mean kernelπP¯φ∗\\pi^\{\*\}\_\{\\bar\{P\}\_\{\\varphi\}\}has\(1−μ,Λ\)\(1\-\\mu,\\Lambda\)\-posterior downside exposure, whereas the regret ofπφ,α∗\\pi^\{\*\}\_\{\\varphi,\\alpha\}is below the thresholdΛ\\Lambdawith posterior probability one, i\.e\.,

φ\(𝒟φ\(πP¯φ∗,Λ\)\)=1−μ,φ\(𝒟φ\(πφ,α∗,Λ\)\)=0\.\\varphi\\\!\\left\(\\mathcal\{D\}\_\{\\varphi\}\(\\pi^\{\*\}\_\{\\bar\{P\}\_\{\\varphi\}\},\\Lambda\)\\right\)=1\-\\mu,\\qquad\\varphi\\\!\\left\(\\mathcal\{D\}\_\{\\varphi\}\(\\pi^\{\*\}\_\{\\varphi,\\alpha\},\\Lambda\)\\right\)=0\.

The proof of Proposition[2](https://arxiv.org/html/2605.24345#Thmproposition2)is deferred to Appendix[EC\.4\.1](https://arxiv.org/html/2605.24345#A4.SS1)\. The proposition highlights a failure mode of posterior\-mean planning under model uncertainty\. Since the posterior\-mean kernel averages over the possible transition models, it assign a higher posterior\-mean value toaRa\_\{R\}than toaSa\_\{S\}; However, this averaged comparison hides the fact that, with posterior probability1−μ1\-\\mu, the selected actionaRa\_\{R\}leads to the bad absorbing statebband receives reward−L\-Lthereafter\. Under those posterior models, the posterior\-mean policy suffers regret above the thresholdΛ\\Lambdaspecified in Proposition[2](https://arxiv.org/html/2605.24345#Thmproposition2)\. In contrast, whenα≤1−μ\\alpha\\leq 1\-\\mu, theα\\alpha\-quantile BR\-MDP is sensitive to this lower\-tail posterior outcome: it penalizesaRa\_\{R\}for its performance on the models whereaRa\_\{R\}leads tobb, and therefore selects the safe actionaSa\_\{S\}\.

We next consider a regime in which the balance shifts toward exploration\. An action can matter because after observing the transition it generates, the posterior distribution can become substantially more concentrated, or even collapse to a point mass at the true transition kernel\. This becomes important once the main issue is no longer the large posterior downside exposure defined in Definition[1](https://arxiv.org/html/2605.24345#Thmdefinition1)but whether taking the action substantially reduces the remaining epistemic uncertainty that matters for subsequent decisions\. The next example illustrates this situation by contrasting an explorative action, whose transition reveals the kernel, with a safe action that leaves the posterior unchanged\. Letφ\(s,a,s′\)\\varphi^\{\(s,a,s^\{\\prime\}\)\}denote the updated posterior distribution after observing\(s,a,s′\)\(s,a,s^\{\\prime\}\)\.

###### Example 2\(An informative exploration example\)\.

Fix a discount factorγ∈\(0,1\)\\gamma\\in\(0,1\)and a constantc∈\(0,γ\)c\\in\(0,\\gamma\)\. Consider the discounted MDP in Figure[EC\.2](https://arxiv.org/html/2605.24345#A4.F2), where the reward is known to the agent but the transition kernel is unknown\. The current posterior is supported on two kernels,

φ=μδPG\+\(1−μ\)δPB,μ∈\(0,1\),\\varphi=\\mu\\,\\delta\_\{P^\{G\}\}\+\(1\-\\mu\)\\,\\delta\_\{P^\{B\}\},\\qquad\\mu\\in\(0,1\),wherePGP^\{G\}andPBP^\{B\}differ only along the branch reached after taking the explorative action shown in the figure\.

![Refer to caption](https://arxiv.org/html/2605.24345v1/figs/posterior_informative_probing_example.png)Figure EC\.2:Schematic of Example[2](https://arxiv.org/html/2605.24345#Thmexample2)\. From the initial states0s\_\{0\}, the safe actionaSa\_\{S\}yields rewardccand returns tos0s\_\{0\}, whereas the exploratory actionaEa\_\{E\}yields reward0and moves toyGy\_\{G\}underPGP^\{G\}and toyBy\_\{B\}underPBP^\{B\}\. At each diagnostic statey∈\{yG,yB\}y\\in\\\{y\_\{G\},y\_\{B\}\\\}, actionaSa\_\{S\}yields rewardccand stays atyy, whereas actionaXa\_\{X\}yields reward0and moves to the absorbing stateggunderPGP^\{G\}and to the absorbing statebbunderPBP^\{B\}\. Stateggyields reward11at every subsequent step, and statebbyields reward0at every subsequent step\. All other transitions are identical underPGP^\{G\}andPBP^\{B\}\. Hence, the transition observed after\(s0,aE\)\(s\_\{0\},a\_\{E\}\)reveals the kernel, whereas\(s0,aS\)\(s\_\{0\},a\_\{S\}\)reveals no new information\.This example isolates a situation in which an action is valuable because of the information it reveals: after takingaEa\_\{E\}ats0s\_\{0\}, the next\-state observation reveals the kernel, soφ\(s0,aE,yG\)=δPG,\\varphi^\{\(s\_\{0\},a\_\{E\},y\_\{G\}\)\}=\\delta\_\{P^\{G\}\},andφ\(s0,aE,yB\)=δPB\.\\varphi^\{\(s\_\{0\},a\_\{E\},y\_\{B\}\)\}=\\delta\_\{P^\{B\}\}\.The optimal subsequent action can then be selected according to the updated posterior: takeaXa\_\{X\}atyGy\_\{G\}andaSa\_\{S\}atyBy\_\{B\}\. Therefore, takingaEa\_\{E\}ats0s\_\{0\}produces an observation that fully resolves the remaining uncertainty relevant to subsequent decisions\. By contrast, taking the safe actionaSa\_\{S\}ats0s\_\{0\}does not change the posterior:PG\(s0\|s0,aS\)=PB\(s0\|s0,aS\)=1P^\{G\}\(s\_\{0\}\|s\_\{0\},a\_\{S\}\)=P^\{B\}\(s\_\{0\}\|s\_\{0\},a\_\{S\}\)=1, so observing\(s0,aS,s0\)\(s\_\{0\},a\_\{S\},s\_\{0\}\)cannot provide additional information for distinguishingPGP^\{G\}fromPBP^\{B\}, and henceφ\(s0,aS,s0\)=φ\.\\varphi^\{\(s\_\{0\},a\_\{S\},s\_\{0\}\)\}=\\varphi\.

The next proposition shows that a fixed lower\-tail rule can nevertheless avoid this informative action indefinitely\. Once the agent choosesaSa\_\{S\}, the posterior never changes, so the same decision rule continues to chooseaSa\_\{S\}at every subsequent visit tos0s\_\{0\}\. If the true kernel isPGP^\{G\}, this yields linear regret\.

###### Proposition 3\.

In Example[2](https://arxiv.org/html/2605.24345#Thmexample2), assume thatα≤1−μ\\alpha\\leq 1\-\\muandc<γ2c<\\gamma^\{2\}\. Suppose that after each realized transition the agent updates its posterior and then acts according to an optimal policy of theα\\alpha\-quantile BR\-MDP,πt:=πφt,α∗\\pi\_\{t\}:=\\pi^\{\*\}\_\{\\varphi\_\{t\},\\alpha\}at time steptt\. If the initial state iss0s\_\{0\}and the prior distributionφ0=φ\\varphi\_\{0\}=\\varphi, then

φt=φ,at=πt\(st\)=aSandst\+1=s0∀t≥0\.\\varphi\_\{t\}=\\varphi,\\qquad a\_\{t\}=\\pi\_\{t\}\(s\_\{t\}\)=a\_\{S\}\\qquad\\text\{and\}\\qquad s\_\{t\+1\}=s\_\{0\}\\qquad\\forall t\\geq 0\.If the true kernel isPGP^\{G\}, then under the cumulative regret criterionRT:=∑t=0T−1\(VPG∗\(st\)−VPGπt\(st\)\)R\_\{T\}:=\\sum\_\{t=0\}^\{T\-1\}\\Bigl\(V^\{\*\}\_\{P^\{G\}\}\(s\_\{t\}\)\-V^\{\\pi\_\{t\}\}\_\{P^\{G\}\}\(s\_\{t\}\)\\Bigr\), one hasRT=Tγ2−c1−γ\.R\_\{T\}=T\\,\\frac\{\\gamma^\{2\}\-c\}\{1\-\\gamma\}\.

The proof of Proposition[3](https://arxiv.org/html/2605.24345#Thmproposition3)is deferred to Appendix[EC\.4\.2](https://arxiv.org/html/2605.24345#A4.SS2)\. It shows how a fixed lower\-tail rule can create what we call a self\-confirming trap: under the current posterior, the optimal policy of theα\\alpha\-quantile BR\-MDP is suboptimal in the true environment, yet following this policy generates no new information to update the posterior\. As a result, re\-solving the sameα\\alpha\-quantile BR\-MDP under the unchanged posterior and fixed quantile levelα\\alphaselects the same action again\. When the true kernel isPGP^\{G\}, the fixed lower\-tail rule continues to chooseaSa\_\{S\}and the agent therefore never discovers the higher future value available after probing, which leads to linear regret\.

Taken together, these results show that the robustness induced by the lower\-tail quantile can be desirable when posterior downside exposure is substantial due to high epistemic uncertainty, but that using a fixed lower\-tail rule can later hinder informative exploration and create a self\-confirming trap with linear regret\. The reason is that the trade\-off between robustness and exploration evolves over learning\.

### EC\.4\.1Proof of Proposition[2](https://arxiv.org/html/2605.24345#Thmproposition2)

###### Proof\.

LetRP¯:=γ\(μ−\(1−μ\)L\)1−γ\.R\_\{\\bar\{P\}\}:=\\frac\{\\gamma\(\\mu\-\(1\-\\mu\)L\)\}\{1\-\\gamma\}\.Under the posterior\-mean kernelP¯φ\\bar\{P\}\_\{\\varphi\}, if actionaRa\_\{R\}is taken ats0s\_\{0\}, the value obtained after the transition isRP¯=μγ1−γ\+\(1−μ\)−γL1−γ\.R\_\{\\bar\{P\}\}=\\mu\\frac\{\\gamma\}\{1\-\\gamma\}\+\(1\-\\mu\)\\frac\{\-\\gamma L\}\{1\-\\gamma\}\.The optimal value ats0s\_\{0\}underP¯φ\\bar\{P\}\_\{\\varphi\}satisfies

VP¯φ∗\(s0\)=max⁡\{c\+γVP¯φ∗\(s0\),RP¯\}\.\\displaystyle V^\{\*\}\_\{\\bar\{P\}\_\{\\varphi\}\}\(s\_\{0\}\)=\\max\\left\\\{c\+\\gamma V^\{\*\}\_\{\\bar\{P\}\_\{\\varphi\}\}\(s\_\{0\}\),\\,R\_\{\\bar\{P\}\}\\right\\\}\.Sinceγ\(μ−\(1−μ\)L\)\>c\\gamma\(\\mu\-\(1\-\\mu\)L\)\>c, we haveRP¯\>c1−γ\.R\_\{\\bar\{P\}\}\>\\frac\{c\}\{1\-\\gamma\}\.Equivalently,c\+γRP¯<RP¯\.c\+\\gamma R\_\{\\bar\{P\}\}<R\_\{\\bar\{P\}\}\.Hence the fixed point isVP¯φ∗\(s0\)=RP¯,V^\{\*\}\_\{\\bar\{P\}\_\{\\varphi\}\}\(s\_\{0\}\)=R\_\{\\bar\{P\}\},and the unique optimal action ats0s\_\{0\}isaRa\_\{R\}\. ThereforeπP¯φ∗\(s0\)=aR\.\\pi^\{\*\}\_\{\\bar\{P\}\_\{\\varphi\}\}\(s\_\{0\}\)=a\_\{R\}\.

Next consider the lower\-tailα\\alpha\-quantile BR\-MDP\. Sinceggandbbare absorbing,V\(g\)=11−γ,V\(g\)=\\frac\{1\}\{1\-\\gamma\},andV\(b\)=−L1−γ\.V\(b\)=\\frac\{\-L\}\{1\-\\gamma\}\.UnderaRa\_\{R\}, the posterior distribution of the next\-state value is11−γ\\frac\{1\}\{1\-\\gamma\}with probabilityμ\\mu,−L1−γ\\frac\{\-L\}\{1\-\\gamma\}with probability1−μ1\-\\mu\. Sinceα≤1−μ\\alpha\\leq 1\-\\mu, the leftα\\alpha\-quantile is−L1−γ\.\\frac\{\-L\}\{1\-\\gamma\}\.Thus the BRMDP value of takingaRa\_\{R\}ats0s\_\{0\}isRα:=−γL1−γ\.R\_\{\\alpha\}:=\-\\frac\{\\gamma L\}\{1\-\\gamma\}\.The BRMDP optimal value ats0s\_\{0\}therefore satisfies

Vφ,α∗\(s0\)=max⁡\{c\+γVφ,α∗\(s0\),Rα\}\.\\displaystyle V^\{\*\}\_\{\\varphi,\\alpha\}\(s\_\{0\}\)=\\max\\left\\\{c\+\\gamma V^\{\*\}\_\{\\varphi,\\alpha\}\(s\_\{0\}\),\\,R\_\{\\alpha\}\\right\\\}\.Becausec\>0c\>0andL\>0L\>0,c1−γ\>−γL1−γ=Rα\.\\frac\{c\}\{1\-\\gamma\}\>\-\\frac\{\\gamma L\}\{1\-\\gamma\}=R\_\{\\alpha\}\.Hence the fixed point isVφ,α∗\(s0\)=c1−γ,V^\{\*\}\_\{\\varphi,\\alpha\}\(s\_\{0\}\)=\\frac\{c\}\{1\-\\gamma\},and the optimal action isaSa\_\{S\}\. Thereforeπφ,α∗\(s0\)=aS\.\\pi^\{\*\}\_\{\\varphi,\\alpha\}\(s\_\{0\}\)=a\_\{S\}\.

We now compute the regret under each realized kernel\. Sincec<γ\(μ−\(1−μ\)L\)≤γc<\\gamma\(\\mu\-\(1\-\\mu\)L\)\\leq\\gamma, underPGP^\{G\}, takingaRa\_\{R\}is optimal\. HenceVPG∗\(s0\)=γ1−γ\.V^\{\*\}\_\{P^\{G\}\}\(s\_\{0\}\)=\\frac\{\\gamma\}\{1\-\\gamma\}\.The posterior\-mean policy choosesaRa\_\{R\}, so its regret underPGP^\{G\}is zero\. The BRMDP policy choosesaSa\_\{S\}, so

VPG∗\(s0\)−VPGπφ,α∗\(s0\)=γ1−γ−c1−γ=γ−c1−γ\.\\displaystyle V^\{\*\}\_\{P^\{G\}\}\(s\_\{0\}\)\-V^\{\\pi^\{\*\}\_\{\\varphi,\\alpha\}\}\_\{P^\{G\}\}\(s\_\{0\}\)=\\frac\{\\gamma\}\{1\-\\gamma\}\-\\frac\{c\}\{1\-\\gamma\}=\\frac\{\\gamma\-c\}\{1\-\\gamma\}\.UnderPBP^\{B\}, takingaSa\_\{S\}is optimal becausec\>0c\>0andL\>0L\>0\. HenceVPB∗\(s0\)=c1−γ\.V^\{\*\}\_\{P^\{B\}\}\(s\_\{0\}\)=\\frac\{c\}\{1\-\\gamma\}\.The BRMDP policy choosesaSa\_\{S\}, so its regret underPBP^\{B\}is zero\. The posterior\-mean policy choosesaRa\_\{R\}, so

VPB∗\(s0\)−VPBπP¯φ∗\(s0\)=c1−γ−\(−γL1−γ\)=c\+γL1−γ\.\\displaystyle V^\{\*\}\_\{P^\{B\}\}\(s\_\{0\}\)\-V^\{\\pi^\{\*\}\_\{\\bar\{P\}\_\{\\varphi\}\}\}\_\{P^\{B\}\}\(s\_\{0\}\)=\\frac\{c\}\{1\-\\gamma\}\-\\left\(\-\\frac\{\\gamma L\}\{1\-\\gamma\}\\right\)=\\frac\{c\+\\gamma L\}\{1\-\\gamma\}\.Finally, the conditionc\>γ\(1−L\)2c\>\\frac\{\\gamma\(1\-L\)\}\{2\}is equivalent toγ−c1−γ<c\+γL1−γ,\\frac\{\\gamma\-c\}\{1\-\\gamma\}<\\frac\{c\+\\gamma L\}\{1\-\\gamma\},so the stated interval forΛ\\Lambdais nonempty\. For anyγ−c1−γ<Λ≤c\+γL1−γ,\\frac\{\\gamma\-c\}\{1\-\\gamma\}<\\Lambda\\leq\\frac\{c\+\\gamma L\}\{1\-\\gamma\},the posterior\-mean policy has regret at leastΛ\\Lambdaexactly underPBP^\{B\}, whose posterior probability is1−μ1\-\\mu\. Thereforeφ\(𝒟\(πP¯φ∗,Λ\)\)=1−μ\.\\varphi\\\!\\left\(\\mathcal\{D\}\(\\pi^\{\*\}\_\{\\bar\{P\}\_\{\\varphi\}\},\\Lambda\)\\right\)=1\-\\mu\.

On the other hand, the BRMDP policy has regretγ−c1−γ<Λ\\frac\{\\gamma\-c\}\{1\-\\gamma\}<\\LambdaunderPGP^\{G\}, and regret0underPBP^\{B\}\. Hence its regret is belowΛ\\Lambdawith posterior probability one, andφ\(𝒟\(πφ,α∗,Λ\)\)=0\.\\varphi\\\!\\left\(\\mathcal\{D\}\(\\pi^\{\*\}\_\{\\varphi,\\alpha\},\\Lambda\)\\right\)=0\.∎

### EC\.4\.2Proof of Proposition[3](https://arxiv.org/html/2605.24345#Thmproposition3)

###### Proof\.

Under theα\\alpha\-quantile BR\-MDP criterion withα≤1−μ\\alpha\\leq 1\-\\mu, takingaXa\_\{X\}at either diagnostic state yields next\-state value1/\(1−γ\)1/\(1\-\\gamma\)with posterior massμ\\muand0with posterior mass1−μ1\-\\mu\. Therefore its leftα\\alpha\-quantile is0, whereas repeatedly taking the safe self\-loop yields valuec/\(1−γ\)c/\(1\-\\gamma\)\. Hence

Vφ,α∗\(yG\)=Vφ,α∗\(yB\)=c1−γ,\\displaystyle V^\{\*\}\_\{\\varphi,\\alpha\}\(y\_\{G\}\)=V^\{\*\}\_\{\\varphi,\\alpha\}\(y\_\{B\}\)=\\frac\{c\}\{1\-\\gamma\},so takingaEa\_\{E\}ats0s\_\{0\}yields valueγc/\(1−γ\)\\gamma c/\(1\-\\gamma\)\. By contrast, repeatedly taking the safe self\-loop ats0s\_\{0\}yields valuec/\(1−γ\)c/\(1\-\\gamma\), and delayingaEa\_\{E\}byk≥1k\\geq 1safe steps yields

c\(1−γk\)\+γk\+1c1−γ<c1−γ\.\\displaystyle\\frac\{c\(1\-\\gamma^\{k\}\)\+\\gamma^\{k\+1\}c\}\{1\-\\gamma\}<\\frac\{c\}\{1\-\\gamma\}\.Thusπφ,α∗\(s0\)=aS\\pi^\{\*\}\_\{\\varphi,\\alpha\}\(s\_\{0\}\)=a\_\{S\}\.

Now suppose that after each realized transition the agent updates its posterior and then choosesπt:=πφt,α∗\\pi\_\{t\}:=\\pi^\{\*\}\_\{\\varphi\_\{t\},\\alpha\}\. TakingaSa\_\{S\}ats0s\_\{0\}always produces the observation\(s0,aS,s0\)\(s\_\{0\},a\_\{S\},s\_\{0\}\), whose likelihood is identical underPGP^\{G\}andPBP^\{B\}\. Hence Bayes updating leaves the posterior unchanged\. Starting fromS0=s0S\_\{0\}=s\_\{0\}andφ0=φ\\varphi\_\{0\}=\\varphi, an induction givesφt=φ\\varphi\_\{t\}=\\varphiandπt\(s0\)=aS\\pi\_\{t\}\(s\_\{0\}\)=a\_\{S\}for allt≥0t\\geq 0; sinceaSa\_\{S\}returns tos0s\_\{0\}, we also haveSt=s0S\_\{t\}=s\_\{0\}for alltt\. In particular, the agent never probes\.

UnderPGP^\{G\}, takingaEa\_\{E\}ats0s\_\{0\}and thenaXa\_\{X\}atyGy\_\{G\}yields valueγ2/\(1−γ\)\\gamma^\{2\}/\(1\-\\gamma\), whereas repeatedly taking the safe self\-loop yields valuec/\(1−γ\)c/\(1\-\\gamma\)\. Sincec<γ2c<\\gamma^\{2\}, probing strictly dominates staying safe, and delayingaEa\_\{E\}byk≥1k\\geq 1safe steps yieldsc\(1−γk\)\+γk\+21−γ<γ21−γ\.\\frac\{c\(1\-\\gamma^\{k\}\)\+\\gamma^\{k\+2\}\}\{1\-\\gamma\}<\\frac\{\\gamma^\{2\}\}\{1\-\\gamma\}\.HenceVPG∗\(s0\)=γ2/\(1−γ\)V^\{\*\}\_\{P^\{G\}\}\(s\_\{0\}\)=\\gamma^\{2\}/\(1\-\\gamma\)\. On the other hand, eachπt\\pi\_\{t\}choosesaSa\_\{S\}ats0s\_\{0\}, soVPGπt\(s0\)=c/\(1−γ\)V^\{\\pi\_\{t\}\}\_\{P^\{G\}\}\(s\_\{0\}\)=c/\(1\-\\gamma\)\. BecauseSt=s0S\_\{t\}=s\_\{0\}for alltt, each summand inRTR\_\{T\}equals\(γ2−c\)/\(1−γ\)\(\\gamma^\{2\}\-c\)/\(1\-\\gamma\), and thereforeRT=Tγ2−c1−γ\.R\_\{T\}=T\\,\\frac\{\\gamma^\{2\}\-c\}\{1\-\\gamma\}\.∎

## Appendix EC\.5Value Iteration forαk\\alpha\_\{k\}\-quantile BR\-MDP

At the beginning of pseudo\-episodekk, after computing the posterior parameter collectionϕk\\phi\_\{k\}and the adaptive quantile scheduleαk\\alpha\_\{k\}, we solve the corresponding BR\-MDP by value iteration\.

Algorithm EC\.1Value Iteration for theαk\\alpha\_\{k\}\-quantile BR\-MDP in Pseudo\-Episodekk1:Input:Posterior parameter collection

ϕk\\phi\_\{k\}, quantile schedule

αk\\alpha\_\{k\}, optimal value function in pseudo\-episode

k−1k\-1Vk−1V\_\{k\-1\}, tolerance

εVI\\varepsilon\_\{\\mathrm\{VI\}\}, maximum iterations

MVIM\_\{\\mathrm\{VI\}\}
2:Initialize

V\(0\)←Vk−1V^\{\(0\)\}\\leftarrow V\_\{k\-1\}
3:for

m=0,1,…,MVI−1m=0,1,\\ldots,M\_\{\\mathrm\{VI\}\}\-1do

4:for all

s∈𝒮s\\in\\mathcal\{S\}do

5:

V\(m\+1\)\(s\)←maxa∈𝒜⁡\{r\(s,a\)\+γρϕk\(s,a\)αk\(P⊤V\(m\)\)\}\.\\displaystyle V^\{\(m\+1\)\}\(s\)\\leftarrow\\max\_\{a\\in\\mathcal\{A\}\}\\left\\\{r\(s,a\)\+\\gamma\\,\\rho^\{\\alpha\_\{k\}\}\_\{\\phi\_\{k\}\(s,a\)\}\\\!\\left\(P^\{\\top\}V^\{\(m\)\}\\right\)\\right\\\}\.
6:endfor

7:if

‖V\(m\+1\)−V\(m\)‖∞≤εVI\\\|V^\{\(m\+1\)\}\-V^\{\(m\)\}\\\|\_\{\\infty\}\\leq\\varepsilon\_\{\\mathrm\{VI\}\}then

8:break

9:endif

10:endfor

11:Set

Vk←V\(m\+1\)V\_\{k\}\\leftarrow V^\{\(m\+1\)\}
12:For each

s∈𝒮s\\in\\mathcal\{S\}, choose

πk\(s\)∈argmaxa∈𝒜⁡\{r\(s,a\)\+γρϕk\(s,a\)αk\(P⊤Vk\)\}\\pi\_\{k\}\(s\)\\in\\operatorname\*\{arg\\,max\}\_\{a\\in\\mathcal\{A\}\}\\left\\\{r\(s,a\)\+\\gamma\\,\\rho^\{\\alpha\_\{k\}\}\_\{\\phi\_\{k\}\(s,a\)\}\\\!\\left\(P^\{\\top\}V\_\{k\}\\right\)\\right\\\}
13:return

VkV\_\{k\},

πk\\pi\_\{k\}

## Appendix EC\.6Proofs of Regret Analysis

For a fixed interaction horizonTT, the last pseudo\-episode may be truncated\. In the proof, with a slight abuse of notation, letLkL\_\{k\}denote the full geometric length of pseudo\-episodekk\. Equivalently, if the last pseudo\-episode is truncated by the fixed horizon, we continue it only for the purpose of the proof under the same policyπKT\\pi\_\{K\_\{T\}\}\. Since the added regret summands are nonnegative, this no\-truncation convention provides an upper bound on the actual regret\. Thus,

BR\(T\)≤𝔼\[∑k=1KT∑i=1Lk\(V∗\(sk,i\)−Vπk\(sk,i\)\)\]\.\\displaystyle BR\(T\)\\leq\\mathbb\{E\}\\\!\\left\[\\sum\_\{k=1\}^\{K\_\{T\}\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\Bigl\(V^\{\*\}\(s\_\{k,i\}\)\-V^\{\\pi\_\{k\}\}\(s\_\{k,i\}\)\\Bigr\)\\right\]\.\(EC\.9\)It remains to bound the right\-hand side under this convention\.

### EC\.6\.1Proof of Lemma[1](https://arxiv.org/html/2605.24345#Thmlemma1)

###### Proof\.

From the Bayesian perspective,

Ps,ac∣ℱtk∼Dir\(ϕk\(s,a\)\),∀\(s,a\)∈𝒮×𝒜,\\displaystyle P^\{c\}\_\{s,a\}\\mid\\mathcal\{F\}\_\{t\_\{k\}\}\\sim\\mathrm\{Dir\}\(\\phi\_\{k\}\(s,a\)\),\\qquad\\forall\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\},whereϕk\(s,a\)=\(ϕk\(s,a,s′\)\)s′∈𝒮\\phi\_\{k\}\(s,a\)=\(\\phi\_\{k\}\(s,a,s^\{\\prime\}\)\)\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}denotes the Dirichlet parameter vector\. For each\(s,a\)∈𝒮×𝒜\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}, define the event𝒢k\(s,a\):=\{\(Ps,ac\)⊤Vk≤ρϕk\(s,a\)αk\(P⊤Vk\)\}\.\\mathcal\{G\}\_\{k\}\(s,a\):=\\left\\\{\(P^\{c\}\_\{s,a\}\)^\{\\top\}V\_\{k\}\\leq\\rho^\{\\alpha\_\{k\}\}\_\{\\phi\_\{k\}\(s,a\)\}\(P^\{\\top\}V\_\{k\}\)\\right\\\}\.By the definition of theα\\alpha\-quantile,

ℙ\(𝒢k\(s,a\)∣ℱtk\)=ℙ\(\(Ps,ac\)⊤Vk≤ρϕk\(s,a\)αk\(P⊤Vk\)\|ℱtk\)≥αk\(s,a\)\.\\displaystyle\\mathbb\{P\}\\\!\\left\(\\mathcal\{G\}\_\{k\}\(s,a\)\\mid\\mathcal\{F\}\_\{t\_\{k\}\}\\right\)=\\mathbb\{P\}\\\!\\left\(\(P^\{c\}\_\{s,a\}\)^\{\\top\}V\_\{k\}\\leq\\rho^\{\\alpha\_\{k\}\}\_\{\\phi\_\{k\}\(s,a\)\}\(P^\{\\top\}V\_\{k\}\)\\ \\middle\|\\ \\mathcal\{F\}\_\{t\_\{k\}\}\\right\)\\geq\\alpha\_\{k\}\(s,a\)\.Let𝒢k:=⋂\(s,a\)∈𝒮×𝒜𝒢k\(s,a\)\.\\mathcal\{G\}\_\{k\}:=\\bigcap\_\{\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\}\\mathcal\{G\}\_\{k\}\(s,a\)\.Then

ℙ\(𝒢k∣ℱtk\)\\displaystyle\\mathbb\{P\}\\\!\\left\(\\mathcal\{G\}\_\{k\}\\mid\\mathcal\{F\}\_\{t\_\{k\}\}\\right\)≥1−∑\(s,a\)∈𝒮×𝒜ℙ\(𝒢kc\(s,a\)∣ℱtk\)≥1−∑\(s,a\)∈𝒮×𝒜\(1−αk\(s,a\)\)\\displaystyle\\geq 1\-\\sum\_\{\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\}\\mathbb\{P\}\\\!\\left\(\\mathcal\{G\}\_\{k\}^\{c\}\(s,a\)\\mid\\mathcal\{F\}\_\{t\_\{k\}\}\\right\)\\geq 1\-\\sum\_\{\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\}\\bigl\(1\-\\alpha\_\{k\}\(s,a\)\\bigr\)≥1−δln⁡\(2k\)k∑\(s,a\)∈𝒮×𝒜Nk\+\(s,a\)N¯k\+=1−δSAln⁡\(2k\)k,\\displaystyle\\geq 1\-\\delta\\frac\{\\ln\(2k\)\}\{\\sqrt\{k\}\}\\sum\_\{\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\}\\frac\{N^\{\+\}\_\{k\}\(s,a\)\}\{\\bar\{N\}^\{\+\}\_\{k\}\}=1\-\\frac\{\\delta SA\\ln\(2k\)\}\{\\sqrt\{k\}\},where the third line follows from \([14](https://arxiv.org/html/2605.24345#S3.E14)\) and the last equality usesN¯k\+=1SA∑\(s,a\)∈𝒮×𝒜Nk\+\(s,a\)\.\\bar\{N\}^\{\+\}\_\{k\}=\\frac\{1\}\{SA\}\\sum\_\{\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\}N^\{\+\}\_\{k\}\(s,a\)\.

It remains to show that on the event𝒢k\\mathcal\{G\}\_\{k\}we haveV∗≤VkV^\{\*\}\\leq V\_\{k\}pointwise\. Define

Q∗\(s,a\):=r\(s,a\)\+γ\(Ps,ac\)⊤V∗,Qk\(s,a\):=r\(s,a\)\+γρϕk\(s,a\)αk\(P⊤Vk\),\\displaystyle Q^\{\*\}\(s,a\):=r\(s,a\)\+\\gamma\\,\(P^\{c\}\_\{s,a\}\)^\{\\top\}V^\{\*\},\\qquad Q\_\{k\}\(s,a\):=r\(s,a\)\+\\gamma\\,\\rho^\{\\alpha\_\{k\}\}\_\{\\phi\_\{k\}\(s,a\)\}\(P^\{\\top\}V\_\{k\}\),so thatV∗\(s\)=maxa∈𝒜⁡Q∗\(s,a\)V^\{\*\}\(s\)=\\max\_\{a\\in\\mathcal\{A\}\}Q^\{\*\}\(s,a\)andVk\(s\)=maxa∈𝒜⁡Qk\(s,a\)\.V\_\{k\}\(s\)=\\max\_\{a\\in\\mathcal\{A\}\}Q\_\{k\}\(s,a\)\.LetΔ\(s\):=V∗\(s\)−Vk\(s\)\\Delta\(s\):=V^\{\*\}\(s\)\-V\_\{k\}\(s\)andΔmax:=maxx∈𝒮⁡Δ\(x\)\.\\Delta\_\{\\max\}:=\\max\_\{x\\in\\mathcal\{S\}\}\\Delta\(x\)\.For eachs∈𝒮s\\in\\mathcal\{S\}, chooseas∗∈arg⁡maxa∈𝒜⁡Q∗\(s,a\)a\_\{s\}^\{\*\}\\in\\arg\\max\_\{a\\in\\mathcal\{A\}\}Q^\{\*\}\(s,a\)\. Then

Δ\(s\)\\displaystyle\\Delta\(s\)=V∗\(s\)−Vk\(s\)≤Q∗\(s,as∗\)−Qk\(s,as∗\)\\displaystyle=V^\{\*\}\(s\)\-V\_\{k\}\(s\)\\leq Q^\{\*\}\(s,a\_\{s\}^\{\*\}\)\-Q\_\{k\}\(s,a\_\{s\}^\{\*\}\)=γ\(\(Ps,as∗c\)⊤V∗−ρϕk\(s,as∗\)αk\(P⊤Vk\)\)\\displaystyle=\\gamma\\Big\(\(P^\{c\}\_\{s,a\_\{s\}^\{\*\}\}\)^\{\\top\}V^\{\*\}\-\\rho^\{\\alpha\_\{k\}\}\_\{\\phi\_\{k\}\(s,a\_\{s\}^\{\*\}\)\}\(P^\{\\top\}V\_\{k\}\)\\Big\)=γ\(\(Ps,as∗c\)⊤\(V∗−Vk\)\+\(Ps,as∗c\)⊤Vk−ρϕk\(s,as∗\)αk\(P⊤Vk\)\)\\displaystyle=\\gamma\\Big\(\(P^\{c\}\_\{s,a\_\{s\}^\{\*\}\}\)^\{\\top\}\(V^\{\*\}\-V\_\{k\}\)\+\(P^\{c\}\_\{s,a\_\{s\}^\{\*\}\}\)^\{\\top\}V\_\{k\}\-\\rho^\{\\alpha\_\{k\}\}\_\{\\phi\_\{k\}\(s,a\_\{s\}^\{\*\}\)\}\(P^\{\\top\}V\_\{k\}\)\\Big\)≤γΔmax,\\displaystyle\\leq\\gamma\\Delta\_\{\\max\},where the last inequality uses\(Ps,as∗c\)⊤\(V∗−Vk\)≤Δmax\(P^\{c\}\_\{s,a\_\{s\}^\{\*\}\}\)^\{\\top\}\(V^\{\*\}\-V\_\{k\}\)\\leq\\Delta\_\{\\max\}and, on the event𝒢k\\mathcal\{G\}\_\{k\},\(Ps,as∗c\)⊤Vk−ρϕk\(s,as∗\)αk\(P⊤Vk\)≤0\.\(P^\{c\}\_\{s,a\_\{s\}^\{\*\}\}\)^\{\\top\}V\_\{k\}\-\\rho^\{\\alpha\_\{k\}\}\_\{\\phi\_\{k\}\(s,a\_\{s\}^\{\*\}\)\}\(P^\{\\top\}V\_\{k\}\)\\leq 0\.HenceΔmax≤γΔmax,\\Delta\_\{\\max\}\\leq\\gamma\\Delta\_\{\\max\},which impliesΔ\(s\)≤Δmax≤0\\Delta\(s\)\\leq\\Delta\_\{\\max\}\\leq 0\. That is,V∗\(s\)≤Vk\(s\)V^\{\*\}\(s\)\\leq V\_\{k\}\(s\)for alls∈𝒮s\\in\\mathcal\{S\}on the event𝒢k\\mathcal\{G\}\_\{k\}\. Combining this implication withℙ\(𝒢k∣ℱtk\)≥1−δSAln⁡\(2k\)k\\mathbb\{P\}\\\!\\left\(\\mathcal\{G\}\_\{k\}\\mid\{\\mathcal\{F\}\_\{t\_\{k\}\}\}\\right\)\\geq 1\-\\frac\{\\delta SA\\ln\(2k\)\}\{\\sqrt\{k\}\}proves part \(i\)\.

For part \(ii\), letV−:=Vϕk,α¯∗\.V^\{\-\}:=V^\{\*\}\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}\.Sinceαk\(s,a\)≥α¯\\alpha\_\{k\}\(s,a\)\\geq\\underline\{\\alpha\}for all\(s,a\)∈𝒮×𝒜\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}and theα\\alpha\-quantile is nondecreasing inα\\alpha, we have

\(𝒯ϕk,α¯∗V\)\(s\)≤\(𝒯ϕk,αk∗V\)\(s\),∀V,∀s∈𝒮\.\\displaystyle\\big\(\{\\mathcal\{T\}^\{\*\}\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}\}V\\big\)\(s\)\\leq\\big\(\{\\mathcal\{T\}^\{\*\}\_\{\\phi\_\{k\},\\alpha\_\{k\}\}\}V\\big\)\(s\),\\qquad\\forall V,\\ \\forall s\\in\\mathcal\{S\}\.Therefore,

V−=𝒯ϕk,α¯∗V−≤𝒯ϕk,αk∗V−≤\(𝒯ϕk,αk∗\)nV−,∀n≥1,\\displaystyle V^\{\-\}=\\mathcal\{T\}\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}^\{\*\}V^\{\-\}\\leq\{\\mathcal\{T\}^\{\*\}\_\{\\phi\_\{k\},\\alpha\_\{k\}\}\}V^\{\-\}\\leq\(\{\\mathcal\{T\}^\{\*\}\_\{\\phi\_\{k\},\\alpha\_\{k\}\}\}\)^\{n\}V^\{\-\},\\qquad\\forall n\\geq 1,where the last inequality follows from the monotonicity of𝒯ϕk,αk∗\{\\mathcal\{T\}^\{\*\}\_\{\\phi\_\{k\},\\alpha\_\{k\}\}\}\. Since𝒯ϕk,αk∗\{\\mathcal\{T\}^\{\*\}\_\{\\phi\_\{k\},\\alpha\_\{k\}\}\}is aγ\\gamma\-contraction, its iterates converge to its unique fixed pointVkV\_\{k\}\. Lettingn→∞n\\to\\inftyyieldsV−≤VkV^\{\-\}\\leq V\_\{k\}, that is,Vϕk,α¯∗\(s\)≤Vk\(s\)V^\{\*\}\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}\(s\)\\leq V\_\{k\}\(s\)for alls∈𝒮s\\in\\mathcal\{S\}\. This completes the proof\. ∎

### EC\.6\.2Proof of Lemma[2](https://arxiv.org/html/2605.24345#Thmlemma2)

###### Proof\.

DefineΔ\(s\):=Vϕk,αkπ\(s\)−Vπ\(s\)\\Delta\(s\):=V\_\{\\phi\_\{k\},\\alpha\_\{k\}\}^\{\\pi\}\(s\)\-V^\{\\pi\}\(s\)andΔk,i:=Δ\(sk,i\)\.\\Delta\_\{k,i\}:=\\Delta\(s\_\{k,i\}\)\.By the Bellman equations forVϕk,αkπV\_\{\\phi\_\{k\},\\alpha\_\{k\}\}^\{\\pi\}andVπV^\{\\pi\}, for eachi≥1i\\geq 1,

Δk,i\\displaystyle\\Delta\_\{k,i\}=\[r\(sk,i,ak,i\)\+γρϕk\(sk,i,ak,i\)αk\(P⊤Vϕk,αkπ\)\]−\[r\(sk,i,ak,i\)\+γ\(Psk,i,ak,ic\)⊤Vπ\]\\displaystyle=\\Bigl\[r\(s\_\{k,i\},a\_\{k,i\}\)\+\\gamma\\,\\rho\_\{\\phi\_\{k\}\(s\_\{k,i\},a\_\{k,i\}\)\}^\{\\alpha\_\{k\}\}\\\!\\bigl\(P^\{\\top\}V\_\{\\phi\_\{k\},\\alpha\_\{k\}\}^\{\\pi\}\\bigr\)\\Bigr\]\-\\Bigl\[r\(s\_\{k,i\},a\_\{k,i\}\)\+\\gamma\\,\(P^\{c\}\_\{s\_\{k,i\},a\_\{k,i\}\}\)^\{\\top\}V^\{\\pi\}\\Bigr\]=γEk,i\+γ\(Psk,i,ak,ic\)⊤\(Vϕk,αkπ−Vπ\)\.\\displaystyle=\\gamma\\,E\_\{k,i\}\+\\gamma\\,\(P^\{c\}\_\{s\_\{k,i\},a\_\{k,i\}\}\)^\{\\top\}\\bigl\(V\_\{\\phi\_\{k\},\\alpha\_\{k\}\}^\{\\pi\}\-V^\{\\pi\}\\bigr\)\.Since

𝔼\[Δk,i\+1\|sk,i,ak,i,ℱtk,Pc\]=\(Psk,i,ak,ic\)⊤Δ,\\displaystyle\\mathbb\{E\}\\\!\\left\[\\Delta\_\{k,i\+1\}\\,\\middle\|\\,s\_\{k,i\},a\_\{k,i\},\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\\right\]=\(P^\{c\}\_\{s\_\{k,i\},a\_\{k,i\}\}\)^\{\\top\}\\Delta,the tower property gives

𝔼\[Δk,i\|ℱtk,Pc\]=γ𝔼\[Ek,i\|ℱtk,Pc\]\+γ𝔼\[Δk,i\+1\|ℱtk,Pc\]\.\\mathbb\{E\}\\\!\\left\[\\Delta\_\{k,i\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\\right\]=\\gamma\\,\\mathbb\{E\}\\\!\\left\[E\_\{k,i\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\\right\]\+\\gamma\\,\\mathbb\{E\}\\\!\\left\[\\Delta\_\{k,i\+1\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\\right\]\.\(EC\.10\)
Because rewards are bounded in\[0,1\]\[0,1\], bothVϕk,αkπV\_\{\\phi\_\{k\},\\alpha\_\{k\}\}^\{\\pi\}andVπV^\{\\pi\}are bounded by\(1−γ\)−1\(1\-\\gamma\)^\{\-1\}in sup norm\. Hence\|Δk,i\|≤11−γ\|\\Delta\_\{k,i\}\|\\leq\\frac\{1\}\{1\-\\gamma\}and\|Ek,i\|≤11−γ\|E\_\{k,i\}\|\\leq\\frac\{1\}\{1\-\\gamma\}for alli≥1\.i\\geq 1\.Iterating \([EC\.10](https://arxiv.org/html/2605.24345#A6.E10)\) forn≥1n\\geq 1yields

𝔼\[Δk,i\|ℱtk,Pc\]=∑h=0n−1γh\+1𝔼\[Ek,i\+h\|ℱtk,Pc\]\+γn𝔼\[Δk,i\+n\|ℱtk,Pc\]\.\\displaystyle\\mathbb\{E\}\\\!\\left\[\\Delta\_\{k,i\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\\right\]=\\sum\_\{h=0\}^\{n\-1\}\\gamma^\{h\+1\}\\mathbb\{E\}\\\!\\left\[E\_\{k,i\+h\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\\right\]\+\\gamma^\{n\}\\mathbb\{E\}\\\!\\left\[\\Delta\_\{k,i\+n\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\\right\]\.The last term converges to zero asn→∞n\\to\\inftybecauseγ∈\(0,1\)\\gamma\\in\(0,1\)andΔk,i\+n\\Delta\_\{k,i\+n\}is uniformly bounded\. Therefore,

𝔼\[Δk,i\|ℱtk,Pc\]=∑h=0∞γh\+1𝔼\[Ek,i\+h\|ℱtk,Pc\]\.\\mathbb\{E\}\\\!\\left\[\\Delta\_\{k,i\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\\right\]=\\sum\_\{h=0\}^\{\\infty\}\\gamma^\{h\+1\}\\mathbb\{E\}\\\!\\left\[E\_\{k,i\+h\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\\right\]\.\(EC\.11\)
Next use the pseudo\-episode construction\. Conditional on\(ℱtk,Pc\)\(\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\), the random lengthLkL\_\{k\}is independent of the MDP trajectory and satisfiesℙ\(Lk≥i\|ℱtk,Pc\)=γi−1,\\mathbb\{P\}\\\!\\left\(L\_\{k\}\\geq i\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\\right\)=\\gamma^\{i\-1\},i≥1,i\\geq 1,becauseLkL\_\{k\}is geometric with success probability1−γ1\-\\gammaon\{1,2,…\}\\\{1,2,\\dots\\\}\. Since\|Δk,i\|≤\(1−γ\)−1\|\\Delta\_\{k,i\}\|\\leq\(1\-\\gamma\)^\{\-1\}, we have𝔼\[∑i≥1𝟏\{Lk≥i\}\|Δk,i\|\|ℱtk,Pc\]≤\(1−γ\)−2<∞\\mathbb\{E\}\[\\sum\_\{i\\geq 1\}\\mathbf\{1\}\\\{L\_\{k\}\\geq i\\\}\|\\Delta\_\{k,i\}\|\\,\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\]\\leq\(1\-\\gamma\)^\{\-2\}<\\infty\. Hence Fubini’s theorem and conditional independence give

𝔼\[∑i=1LkΔk,i\|ℱtk,Pc\]\\displaystyle\\mathbb\{E\}\\\!\\left\[\\sum\_\{i=1\}^\{L\_\{k\}\}\\Delta\_\{k,i\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\\right\]=∑i=1∞𝔼\[𝟏\{Lk≥i\}Δk,i\|ℱtk,Pc\]=∑i=1∞γi−1𝔼\[Δk,i\|ℱtk,Pc\]\.\\displaystyle=\\sum\_\{i=1\}^\{\\infty\}\\mathbb\{E\}\\\!\\left\[\\mathbf\{1\}\\\{L\_\{k\}\\geq i\\\}\\Delta\_\{k,i\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\\right\]=\\sum\_\{i=1\}^\{\\infty\}\\gamma^\{i\-1\}\\mathbb\{E\}\\\!\\left\[\\Delta\_\{k,i\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\\right\]\.\(EC\.12\)Substituting \([EC\.11](https://arxiv.org/html/2605.24345#A6.E11)\) into \([EC\.12](https://arxiv.org/html/2605.24345#A6.E12)\) and exchanging the order of summation, which is justified by absolute summability, gives

𝔼\[∑i=1LkΔk,i\|ℱtk,Pc\]\\displaystyle\\mathbb\{E\}\\\!\\left\[\\sum\_\{i=1\}^\{L\_\{k\}\}\\Delta\_\{k,i\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\\right\]=∑i=1∞γi−1∑h=0∞γh\+1𝔼\[Ek,i\+h\|ℱtk,Pc\]=∑t=1∞tγt𝔼\[Ek,t\|ℱtk,Pc\]\.\\displaystyle=\\sum\_\{i=1\}^\{\\infty\}\\gamma^\{i\-1\}\\sum\_\{h=0\}^\{\\infty\}\\gamma^\{h\+1\}\\mathbb\{E\}\\\!\\left\[E\_\{k,i\+h\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\\right\]=\\sum\_\{t=1\}^\{\\infty\}t\\,\\gamma^\{t\}\\mathbb\{E\}\\\!\\left\[E\_\{k,t\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\\right\]\.\(EC\.13\)Similarly,𝔼\[∑t≥1𝟏\{Lk≥t\}γt\|Ek,t\|\|ℱtk,Pc\]<∞\\mathbb\{E\}\[\\sum\_\{t\\geq 1\}\\mathbf\{1\}\\\{L\_\{k\}\\geq t\\\}\\gamma t\|E\_\{k,t\}\|\\,\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\]<\\infty, so Fubini’s theorem and conditional independence give

𝔼\[∑t=1LktγEk,t\|ℱtk,Pc\]\\displaystyle\\mathbb\{E\}\\\!\\left\[\\sum\_\{t=1\}^\{L\_\{k\}\}t\\,\\gamma\\,E\_\{k,t\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\\right\]=∑t=1∞tγ𝔼\[𝟏\{Lk≥t\}Ek,t\|ℱtk,Pc\]=∑t=1∞tγt𝔼\[Ek,t\|ℱtk,Pc\]\.\\displaystyle=\\sum\_\{t=1\}^\{\\infty\}t\\,\\gamma\\,\\mathbb\{E\}\\\!\\left\[\\mathbf\{1\}\\\{L\_\{k\}\\geq t\\\}E\_\{k,t\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\\right\]=\\sum\_\{t=1\}^\{\\infty\}t\\,\\gamma^\{t\}\\mathbb\{E\}\\\!\\left\[E\_\{k,t\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\\right\]\.\(EC\.14\)Comparing \([EC\.13](https://arxiv.org/html/2605.24345#A6.E13)\) and \([EC\.14](https://arxiv.org/html/2605.24345#A6.E14)\) proves \([19](https://arxiv.org/html/2605.24345#S4.E19)\)\. ∎

### EC\.6\.3Proof of Lemma[3](https://arxiv.org/html/2605.24345#Thmlemma3)

###### Proof\.

Apply Lemma[2](https://arxiv.org/html/2605.24345#Thmlemma2)with the risk profileαk\\alpha\_\{k\}\. This gives

𝔼\[∑i=1Lk\(Vϕk,αkπ\(sk,i\)−Vπ\(sk,i\)\)\|ℱtk,Pc\]=𝔼\[∑i=1LkiγEk,i\|ℱtk,Pc\]\.\\mathbb\{E\}\\\!\\left\[\\sum\_\{i=1\}^\{L\_\{k\}\}\\Bigl\(V\_\{\\phi\_\{k\},\\alpha\_\{k\}\}^\{\\pi\}\(s\_\{k,i\}\)\-V^\{\\pi\}\(s\_\{k,i\}\)\\Bigr\)\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},\\,P^\{c\}\\right\]=\\mathbb\{E\}\\\!\\left\[\\sum\_\{i=1\}^\{L\_\{k\}\}i\\,\\gamma\\,E\_\{k,i\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},\\,P^\{c\}\\right\]\.\(EC\.15\)Next, repeat the same argument as in Lemma[2](https://arxiv.org/html/2605.24345#Thmlemma2)for the constant risk levelα¯\\underline\{\\alpha\}\. The corresponding one\-step discrepancy is

ρϕk\(sk,i,ak,i\)α¯\(P⊤Vϕk,α¯π\)−\(Psk,i,ak,ic\)⊤Vϕk,α¯π=−Ek,i−\.\\displaystyle\\rho\_\{\\phi\_\{k\}\(s\_\{k,i\},a\_\{k,i\}\)\}^\{\\underline\{\\alpha\}\}\\\!\\bigl\(P^\{\\top\}\{V\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}^\{\\pi\}\}\\bigr\)\-\(P^\{c\}\_\{s\_\{k,i\},a\_\{k,i\}\}\)^\{\\top\}\{V\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}^\{\\pi\}\}=\-E^\{\-\}\_\{k,i\}\.Therefore,

𝔼\[∑i=1Lk\(Vϕk,α¯π\(sk,i\)−Vπ\(sk,i\)\)\|ℱtk,Pc\]=𝔼\[∑i=1Lkiγ\(−Ek,i−\)\|ℱtk,Pc\]\.\\mathbb\{E\}\\\!\\left\[\\sum\_\{i=1\}^\{L\_\{k\}\}\\Bigl\(\{V\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}^\{\\pi\}\}\(s\_\{k,i\}\)\-V^\{\\pi\}\(s\_\{k,i\}\)\\Bigr\)\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},\\,P^\{c\}\\right\]=\\mathbb\{E\}\\\!\\left\[\\sum\_\{i=1\}^\{L\_\{k\}\}i\\,\\gamma\\,\(\-E^\{\-\}\_\{k,i\}\)\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},\\,P^\{c\}\\right\]\.\(EC\.16\)Subtracting \([EC\.16](https://arxiv.org/html/2605.24345#A6.E16)\) from \([EC\.15](https://arxiv.org/html/2605.24345#A6.E15)\) and using linearity of conditional expectation, we obtain

𝔼\[∑i=1Lk\(Vϕk,αkπ\(sk,i\)−Vϕk,α¯π\(sk,i\)\)\|ℱtk,Pc\]=𝔼\[∑i=1Lkiγ\(Ek,i\+Ek,i−\)\|ℱtk,Pc\],\\mathbb\{E\}\\\!\\left\[\\sum\_\{i=1\}^\{L\_\{k\}\}\\Bigl\(V\_\{\\phi\_\{k\},\\alpha\_\{k\}\}^\{\\pi\}\(s\_\{k,i\}\)\-\{V\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}^\{\\pi\}\}\(s\_\{k,i\}\)\\Bigr\)\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},\\,P^\{c\}\\right\]=\\mathbb\{E\}\\\!\\left\[\\sum\_\{i=1\}^\{L\_\{k\}\}i\\,\\gamma\\,\\bigl\(E\_\{k,i\}\+E^\{\-\}\_\{k,i\}\\bigr\)\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\},\\,P^\{c\}\\right\],which is exactly \([20](https://arxiv.org/html/2605.24345#S4.E20)\)\. ∎

### EC\.6\.4Proof of Theorem[3](https://arxiv.org/html/2605.24345#Thmtheorem3)

###### Lemma 9\(Dirichlet posterior quantile deviation for BR\-MDP\)\.

Fix pseudo\-episodekkand\(s,a\)∈𝒮×𝒜\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\. Conditional onℱtk\\mathcal\{F\}\_\{t\_\{k\}\}, letP∼Dir\(ϕk\(s,a\)\),P\\sim\\mathrm\{Dir\}\(\\phi\_\{k\}\(s,a\)\),whereϕk\(s,a\)=\(ϕk\(s,a,s′\)\)s′∈𝒮\\phi\_\{k\}\(s,a\)=\(\\phi\_\{k\}\(s,a,s^\{\\prime\}\)\)\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}is the Dirichlet parameter vector\. Letϕk,0\(s,a\):=∑s′∈𝒮ϕk\(s,a,s′\)=Nk\(s,a\)\+S\\phi\_\{k,0\}\(s,a\):=\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}\\phi\_\{k\}\(s,a,s^\{\\prime\}\)=N\_\{k\}\(s,a\)\+Sdenote the scalar total concentration parameter\. Define the posterior meanP¯k\(⋅∣s,a\):=ϕk\(s,a\)/ϕk,0\(s,a\)\.\\bar\{P\}\_\{k\}\(\\cdot\\mid s,a\):=\\phi\_\{k\}\(s,a\)/\\phi\_\{k,0\}\(s,a\)\.Then for anyℱtk\\mathcal\{F\}\_\{t\_\{k\}\}\-measurable vectorV∈\[0,11−γ\]SV\\in\\left\[0,\\frac\{1\}\{1\-\\gamma\}\\right\]^\{S\}and anyα∈\(0,1\)\\alpha\\in\(0,1\),

ρϕk\(s,a\)α\(P⊤V\)−P¯k\(⋅∣s,a\)⊤V\\displaystyle\\rho\_\{\\phi\_\{k\}\(s,a\)\}^\{\\alpha\}\(P^\{\\top\}V\)\-\\bar\{P\}\_\{k\}\(\\cdot\\mid s,a\)^\{\\top\}V≤11−γ2ϕk,0\(s,a\)ln⁡\(11−α\),\\displaystyle\\leq\\frac\{1\}\{1\-\\gamma\}\\sqrt\{\\frac\{2\}\{\{\\phi\_\{k,0\}\(s,a\)\}\}\\ln\\\!\\left\(\\frac\{1\}\{1\-\\alpha\}\\right\)\},\(EC\.17\)P¯k\(⋅∣s,a\)⊤V−ρϕk\(s,a\)α\(P⊤V\)\\displaystyle\\bar\{P\}\_\{k\}\(\\cdot\\mid s,a\)^\{\\top\}V\-\\rho\_\{\\phi\_\{k\}\(s,a\)\}^\{\\alpha\}\(P^\{\\top\}V\)≤11−γ2ϕk,0\(s,a\)ln⁡\(1α\)\.\\displaystyle\\leq\\frac\{1\}\{1\-\\gamma\}\\sqrt\{\\frac\{2\}\{\{\\phi\_\{k,0\}\(s,a\)\}\}\\ln\\\!\\left\(\\frac\{1\}\{\\alpha\}\\right\)\}\.\(EC\.18\)

###### Proof\.

IfVVis constant, thenP⊤V=P¯k\(⋅∣s,a\)⊤VP^\{\\top\}V=\\bar\{P\}\_\{k\}\(\\cdot\\mid s,a\)^\{\\top\}Valmost surely, and both \([EC\.17](https://arxiv.org/html/2605.24345#A6.E17)\)–\([EC\.18](https://arxiv.org/html/2605.24345#A6.E18)\) are trivial\. Hence we only consider the non\-constant case\.

IfS=1S=1, thenp≡P¯k\(⋅∣s,a\)≡1p\\equiv\\bar\{P\}\_\{k\}\(\\cdot\\mid s,a\)\\equiv 1, so the result is again trivial\. Thus it remains to consider the caseS≥2S\\geq 2\. In this case,ϕk,0\(s,a\)=Nk\(s,a\)\+S≥S≥2\.\\phi\_\{k,0\}\(s,a\)=N\_\{k\}\(s,a\)\+S\\geq S\\geq 2\.DefineY:=\(1−γ\)P⊤VY:=\(1\-\\gamma\)P^\{\\top\}Vandμ:=\(1−γ\)P¯k\(⋅∣s,a\)⊤V\.\\mu:=\(1\-\\gamma\)\\bar\{P\}\_\{k\}\(\\cdot\\mid s,a\)^\{\\top\}V\.ByAgrawal and Jia \([2023](https://arxiv.org/html/2605.24345#bib.bib2), Lemma B\.1, Lemma B\.4 and the proof of Corollary B\.2\), we have the one\-sided Gaussian tail bounds

ℙ\(Y−μ≥t\|ℱtk\)≤exp\(−ϕk,0\(s,a\)t22\),\\displaystyle\\mathbb\{P\}\\\!\\left\(Y\-\\mu\\geq t\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\}\\right\)\\leq\\exp\\\!\\left\(\-\\frac\{\\phi\_\{k,0\}\(s,a\)t^\{2\}\}\{2\}\\right\),ℙ\(μ−Y≥t\|ℱtk\)≤exp\(−ϕk,0\(s,a\)t22\),\\displaystyle\\mathbb\{P\}\\\!\\left\(\\mu\-Y\\geq t\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\}\\right\)\\leq\\exp\\\!\\left\(\-\\frac\{\\phi\_\{k,0\}\(s,a\)t^\{2\}\}\{2\}\\right\),for allt\>0t\>0\. Takingt=\(1−γ\)ϵt=\(1\-\\gamma\)\\epsilonin these bounds yields, for everyϵ\>0\\epsilon\>0,

ℙ\(P⊤V−P¯k\(⋅∣s,a\)⊤V≥ϵ\|ℱtk\)\\displaystyle\\mathbb\{P\}\\\!\\left\(P^\{\\top\}V\-\\bar\{P\}\_\{k\}\(\\cdot\\mid s,a\)^\{\\top\}V\\geq\\epsilon\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\}\\right\)≤exp⁡\(−ϕk,0\(s,a\)\(1−γ\)2ϵ22\),\\displaystyle\\leq\\exp\\\!\\left\(\-\\frac\{\\phi\_\{k,0\}\(s,a\)\(1\-\\gamma\)^\{2\}\\epsilon^\{2\}\}\{2\}\\right\),\(EC\.19\)ℙ\(P¯k\(⋅∣s,a\)⊤V−P⊤V≥ϵ\|ℱtk\)\\displaystyle\\mathbb\{P\}\\\!\\left\(\\bar\{P\}\_\{k\}\(\\cdot\\mid s,a\)^\{\\top\}V\-P^\{\\top\}V\\geq\\epsilon\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\}\\right\)≤exp⁡\(−ϕk,0\(s,a\)\(1−γ\)2ϵ22\)\.\\displaystyle\\leq\\exp\\\!\\left\(\-\\frac\{\\phi\_\{k,0\}\(s,a\)\(1\-\\gamma\)^\{2\}\\epsilon^\{2\}\}\{2\}\\right\)\.\(EC\.20\)
We now prove \([EC\.17](https://arxiv.org/html/2605.24345#A6.E17)\)\. Letϵ\+:=11−γ2ϕk,0\(s,a\)ln⁡\(11−α\)\.\\epsilon\_\{\+\}:=\\frac\{1\}\{1\-\\gamma\}\\sqrt\{\\frac\{2\}\{\\phi\_\{k,0\}\(s,a\)\}\\ln\\\!\\left\(\\frac\{1\}\{1\-\\alpha\}\\right\)\}\.Then by \([EC\.19](https://arxiv.org/html/2605.24345#A6.E19)\),

ℙ\(P⊤V≤P¯k\(⋅∣s,a\)⊤V\+ϵ\+\|ℱtk\)≥α\.\\displaystyle\\mathbb\{P\}\\\!\\left\(P^\{\\top\}V\\leq\\bar\{P\}\_\{k\}\(\\cdot\\mid s,a\)^\{\\top\}V\+\\epsilon\_\{\+\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\}\\right\)\\geq\\alpha\.By the definition of the leftα\\alpha\-quantile,

ρϕk\(s,a\)α\(P⊤V\)≤P¯k\(⋅∣s,a\)⊤V\+ϵ\+,\\displaystyle\\rho\_\{\\phi\_\{k\}\(s,a\)\}^\{\\alpha\}\(P^\{\\top\}V\)\\leq\\bar\{P\}\_\{k\}\(\\cdot\\mid s,a\)^\{\\top\}V\+\\epsilon\_\{\+\},which proves \([EC\.17](https://arxiv.org/html/2605.24345#A6.E17)\)\.

Next, letϵ−:=11−γ2ϕk,0\(s,a\)ln⁡\(1α\)\.\\epsilon\_\{\-\}:=\\frac\{1\}\{1\-\\gamma\}\\sqrt\{\\frac\{2\}\{\\phi\_\{k,0\}\(s,a\)\}\\ln\\\!\\left\(\\frac\{1\}\{\\alpha\}\\right\)\}\.SinceVVis non\-constant and the Dirichlet parameter vector has strictly positive components,ϕk\(s,a,s′\)≥1\\phi\_\{k\}\(s,a,s^\{\\prime\}\)\\geq 1for alls′∈𝒮s^\{\\prime\}\\in\\mathcal\{S\},P⊤VP^\{\\top\}Vhas a continuous distribution with a strictly increasing CDF on its support\. By \([EC\.20](https://arxiv.org/html/2605.24345#A6.E20)\),

ℙ\(P⊤V≤P¯k\(⋅∣s,a\)⊤V−ϵ−\|ℱtk\)≤α\.\\displaystyle\\mathbb\{P\}\\\!\\left\(P^\{\\top\}V\\leq\\bar\{P\}\_\{k\}\(\\cdot\\mid s,a\)^\{\\top\}V\-\\epsilon\_\{\-\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\_\{k\}\}\\right\)\\leq\\alpha\.Therefore, the preceding bound and the continuity and strict monotonicity of the CDF imply

ρϕk\(s,a\)α\(P⊤V\)≥P¯k\(⋅∣s,a\)⊤V−ϵ−,\\displaystyle\\rho\_\{\\phi\_\{k\}\(s,a\)\}^\{\\alpha\}\(P^\{\\top\}V\)\\geq\\bar\{P\}\_\{k\}\(\\cdot\\mid s,a\)^\{\\top\}V\-\\epsilon\_\{\-\},which proves \([EC\.18](https://arxiv.org/html/2605.24345#A6.E18)\)\. ∎

###### Lemma 10\.

Letγ∈\(0,1\)\\gamma\\in\(0,1\)and letL1,…,LKL\_\{1\},\\dots,L\_\{K\}be i\.i\.d\. geometric random variables on\{1,2,…\}\\\{1,2,\\dots\\\}withℙ\(Lk=ℓ\)=\(1−γ\)γℓ−1\.\\mathbb\{P\}\(L\_\{k\}=\\ell\)=\(1\-\\gamma\)\\gamma^\{\\ell\-1\}\.Defineu:=max⁡\{e,K\(1−γ\)2\}u:=\\max\\left\\\{e,\\frac\{K\}\{\(1\-\\gamma\)^\{2\}\}\\right\\\},m:=⌈11−γ\(log⁡u\+log⁡log⁡u\+2\)⌉m:=\\left\\lceil\\frac\{1\}\{1\-\\gamma\}\\left\(\\log u\+\\log\\log u\+2\\right\)\\right\\rceil, and

𝔗:=4γ1−γ\(log⁡u\+log⁡log⁡u\+4\)2\.\\displaystyle\\mathfrak\{T\}:=\\frac\{4\\gamma\}\{1\-\\gamma\}\\left\(\\log u\+\\log\\log u\+4\\right\)^\{2\}\.Then

𝔼\[11−γ∑k=1K∑i=1Lkiγ𝕀\{Lk\>m\}\]≤𝔗\.\\displaystyle\\mathbb\{E\}\\\!\\left\[\\frac\{1\}\{1\-\\gamma\}\\sum\_\{k=1\}^\{K\}\\sum\_\{i=1\}^\{L\_\{k\}\}i\\,\\gamma\\,\\mathbb\{I\}\\\{L\_\{k\}\>m\\\}\\right\]\\leq\\mathfrak\{T\}\.

###### Proof\.

Letq:=1−γq:=1\-\\gamma, and letL∼Geom\(q\)L\\sim\\mathrm\{Geom\}\(q\)on\{1,2,…\}\\\{1,2,\\dots\\\}\. By linearity of expectation,

𝔼\[11−γ∑k=1K∑i=1Lkγi𝕀\{Lk\>m\}\]=Kγq𝔼\[L\(L\+1\)2𝕀\{L\>m\}\]≤Kγq𝔼\[L2𝕀\{L\>m\}\]\.\\mathbb\{E\}\\\!\\left\[\\frac\{1\}\{1\-\\gamma\}\\sum\_\{k=1\}^\{K\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\gamma i\\,\\mathbb\{I\}\\\{L\_\{k\}\>m\\\}\\right\]=\\frac\{K\\gamma\}\{q\}\\mathbb\{E\}\\\!\\left\[\\frac\{L\(L\+1\)\}\{2\}\\mathbb\{I\}\\\{L\>m\\\}\\right\]\\leq\\frac\{K\\gamma\}\{q\}\\mathbb\{E\}\\\!\\left\[L^\{2\}\\mathbb\{I\}\\\{L\>m\\\}\\right\]\.By the memoryless property of the geometric distribution, conditional on\{L\>m\}\\\{L\>m\\\}we can writeL=m\+L~L=m\+\\widetilde\{L\}, whereL~∼Geom\(q\)\\widetilde\{L\}\\sim\\mathrm\{Geom\}\(q\)\. Hence

𝔼\[L2𝕀\{L\>m\}\]=γm𝔼\[\(m\+L~\)2\]≤γm\(m2\+2mq\+2q2\)\.\\displaystyle\\mathbb\{E\}\\\!\\left\[L^\{2\}\\mathbb\{I\}\\\{L\>m\\\}\\right\]=\\gamma^\{m\}\\mathbb\{E\}\\\!\\left\[\(m\+\\widetilde\{L\}\)^\{2\}\\right\]\\leq\\gamma^\{m\}\\left\(m^\{2\}\+\\frac\{2m\}\{q\}\+\\frac\{2\}\{q^\{2\}\}\\right\)\.Sinceγ≤e−q\\gamma\\leq e^\{\-q\},qm≥log⁡u\+log⁡log⁡u\+2qm\\geq\\log u\+\\log\\log u\+2, andK/q2≤uK/q^\{2\}\\leq u, we haveγm≤e−2/\(ulog⁡u\)\\gamma^\{m\}\\leq e^\{\-2\}/\(u\\log u\)\. Alsoqm≤log⁡u\+log⁡log⁡u\+3qm\\leq\\log u\+\\log\\log u\+3\. Therefore,

Kγq𝔼\[L2𝕀\{L\>m\}\]≤γe−2qlog⁡u\[\(log⁡u\+log⁡log⁡u\+3\)2\+2\(log⁡u\+log⁡log⁡u\+3\)\+2\]≤𝔗\.\\frac\{K\\gamma\}\{q\}\\mathbb\{E\}\\\!\\left\[L^\{2\}\\mathbb\{I\}\\\{L\>m\\\}\\right\]\\leq\\frac\{\\gamma e^\{\-2\}\}\{q\\log u\}\\left\[\(\\log u\+\\log\\log u\+3\)^\{2\}\+2\(\\log u\+\\log\\log u\+3\)\+2\\right\]\\leq\\mathfrak\{T\}\.This proves the claim\. ∎

With a slight abuse of notation, for each timett, letNt\(s,a\)N\_\{t\}\(s,a\)denote the number of visits to\(s,a\)\(s,a\)before timett\.

###### Theorem 3\.

Fix a deterministic interaction horizonTT\. LetuT:=max⁡\{e,T\(1−γ\)2\},u\_\{T\}:=\\max\\left\\\{e,\\frac\{T\}\{\(1\-\\gamma\)^\{2\}\}\\right\\\},mT:=⌈11−γ\(log⁡uT\+log⁡log⁡uT\+2\)⌉,m\_\{T\}:=\\left\\lceil\\frac\{1\}\{1\-\\gamma\}\\left\(\\log u\_\{T\}\+\\log\\log u\_\{T\}\+2\\right\)\\right\\rceil,𝔗T:=4γ1−γ\(log⁡uT\+log⁡log⁡uT\+4\)2,\\mathfrak\{T\}\_\{T\}:=\\frac\{4\\gamma\}\{1\-\\gamma\}\\left\(\\log u\_\{T\}\+\\log\\log u\_\{T\}\+4\\right\)^\{2\},andT¯:=T\+mT\\bar\{T\}:=T\+m\_\{T\}\. DefineMT:=2\(S\+1\)\(T¯\+SA\)T¯SAδln⁡2\.M\_\{T\}:=\\frac\{2\(S\+1\)\(\\bar\{T\}\+SA\)\\sqrt\{\\bar\{T\}\}\}\{SA\\,\\delta\\,\\ln 2\}\.Then:

1. 1\.BR\(T\)\\displaystyle BR\(T\)≤γmT1−γ\(16SAT¯ln⁡11−α¯\+16SA\(T¯\+S2A\)\(ln⁡\(1\+SAMTT¯\+S2A\)\+2\)\\displaystyle\\leq\\frac\{\\gamma m\_\{T\}\}\{1\-\\gamma\}\\Bigg\(\\sqrt\{16\\,SA\\,\\bar\{T\}\\,\\ln\\frac\{1\}\{1\-\\underline\{\\alpha\}\}\}\+\\sqrt\{16\\,SA\\,\(\\bar\{T\}\+S^\{2\}A\)\\left\(\\ln\\left\(1\+\\frac\{SA\\,M\_\{T\}\}\{\\bar\{T\}\+S^\{2\}A\}\\right\)\+2\\right\)\}\+16SAT¯ln⁡\(2SAT3mT\)\)\+γSA1−γmT2⌈log2mT⌉\\displaystyle\\qquad\+\\sqrt\{16\\,SA\\,\\bar\{T\}\\,\\ln\(2SA\\,T^\{3\}m\_\{T\}\)\}\\Bigg\)\+\\frac\{\\gamma SA\}\{1\-\\gamma\}\\,m\_\{T\}^\{2\}\\left\\lceil\\log\_\{2\}m\_\{T\}\\right\\rceil\+π26\(1−γ\)\+𝔗T\+2δSATln⁡\(2T\)\(1−γ\)2\.\\displaystyle\\qquad\+\\frac\{\\pi^\{2\}\}\{6\(1\-\\gamma\)\}\+\\mathfrak\{T\}\_\{T\}\+\\frac\{2\\delta\\,SA\\,\\sqrt\{T\}\\ln\(2T\)\}\{\(1\-\\gamma\)^\{2\}\}\.\(EC\.21\)
2. 2\.BR\-R\(T\)\\displaystyle BR\\text\{\-\}R\(T\)≤γmT1−γ\(16SAT¯ln⁡11−α¯\+16SA\(T¯\+S2A\)\(ln⁡\(1\+SAMTT¯\+S2A\)\+2\)\\displaystyle\\leq\\frac\{\\gamma m\_\{T\}\}\{1\-\\gamma\}\\Bigg\(\\sqrt\{16\\,SA\\,\\bar\{T\}\\,\\ln\\frac\{1\}\{1\-\\underline\{\\alpha\}\}\}\+\\sqrt\{16\\,SA\\,\(\\bar\{T\}\+S^\{2\}A\)\\left\(\\ln\\left\(1\+\\frac\{SA\\,M\_\{T\}\}\{\\bar\{T\}\+S^\{2\}A\}\\right\)\+2\\right\)\}\+16SAT¯ln⁡1α¯\+216SAT¯ln⁡\(2SAT3mT\)\)\\displaystyle\\qquad\+\\sqrt\{16\\,SA\\,\\bar\{T\}\\,\\ln\\frac\{1\}\{\\underline\{\\alpha\}\}\}\+2\\sqrt\{16\\,SA\\,\\bar\{T\}\\,\\ln\(2SA\\,T^\{3\}m\_\{T\}\)\}\\Bigg\)\+2γSA1−γmT2⌈log2⁡mT⌉\+π23\(1−γ\)\+2𝔗T\.\\displaystyle\\qquad\+\\frac\{2\\gamma SA\}\{1\-\\gamma\}\\,m\_\{T\}^\{2\}\\left\\lceil\\log\_\{2\}m\_\{T\}\\right\\rceil\+\\frac\{\\pi^\{2\}\}\{3\(1\-\\gamma\)\}\+2\\mathfrak\{T\}\_\{T\}\.\(EC\.22\)

###### Proof\.

For notational convenience, extend the Bernoulli restart process and the corresponding trajectory beyond timeTTonly for the purpose of the proof\. LetIkT:=𝕀\{k≤KT\}I\_\{k\}^\{T\}:=\\mathbb\{I\}\\\{k\\leq K\_\{T\}\\\}\. Since every pseudo\-episode has length at least one,KT≤TK\_\{T\}\\leq T, and sums over pseudo\-episodes started by timeTTcan be written as sums overk=1,…,Tk=1,\\ldots,Tmultiplied byIkTI\_\{k\}^\{T\}\.

Using the upper bound in \([EC\.9](https://arxiv.org/html/2605.24345#A6.E9)\),

BR\(T\)≤𝔼\[∑k=1TIkT∑i=1Lk\(V∗\(sk,i\)−Vπk\(sk,i\)\)\]\.\\displaystyle\{BR\(T\)\}\{\\leq\}\{\\mathbb\{E\}\\\!\\left\[\\sum\_\{k=1\}^\{T\}I\_\{k\}^\{T\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\Bigl\(V^\{\*\}\(s\_\{k,i\}\)\-V^\{\\pi\_\{k\}\}\(s\_\{k,i\}\)\\Bigr\)\\right\]\.\}Under the same no\-truncation convention for the robust\-optimal benchmark,

BR\-R\(T\)≤𝔼\[∑k=1TIkT∑i=1Lk\(Vϕk,α¯∗\(sk,i\)−Vϕk,α¯πk\(sk,i\)\)\]\.\\displaystyle\{BR\\text\{\-\}R\(T\)\}\{\\leq\}\{\\mathbb\{E\}\\\!\\left\[\\sum\_\{k=1\}^\{T\}I\_\{k\}^\{T\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\Bigl\(V^\{\*\}\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}\(s\_\{k,i\}\)\-V^\{\\pi\_\{k\}\}\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}\(s\_\{k,i\}\)\\Bigr\)\\right\]\.\}For each started pseudo\-episodekk, recall𝒢k=⋂s,a𝒢k\(s,a\)\\mathcal\{G\}\_\{k\}=\\bigcap\_\{s,a\}\\mathcal\{G\}\_\{k\}\(s,a\)is the optimism event from Lemma[1](https://arxiv.org/html/2605.24345#Thmlemma1)\. Since1−αk\(s,a\)≤δNk\+\(s,a\)N¯k\+ln⁡\(2k\)k1\-\\alpha\_\{k\}\(s,a\)\\leq\\delta\\frac\{N\_\{k\}^\{\+\}\(s,a\)\}\{\\bar\{N\}\_\{k\}^\{\+\}\}\\frac\{\\ln\(2k\)\}\{\\sqrt\{k\}\}, we have

ℙ\(𝒢kc∣ℱtk\)≤∑s,a\(1−αk\(s,a\)\)≤δln⁡\(2k\)k∑s,aNk\+\(s,a\)N¯k\+=δSAln⁡\(2k\)k\.\\displaystyle\\mathbb\{P\}\(\\mathcal\{G\}\_\{k\}^\{c\}\\mid\{\\mathcal\{F\}\_\{t\_\{k\}\}\}\)\\leq\\sum\_\{s,a\}\(1\-\\alpha\_\{k\}\(s,a\)\)\\leq\\delta\\,\\frac\{\\ln\(2k\)\}\{\\sqrt\{k\}\}\\sum\_\{s,a\}\\frac\{N\_\{k\}^\{\+\}\(s,a\)\}\{\\bar\{N\}\_\{k\}^\{\+\}\}=\\delta\\,SA\\,\\frac\{\\ln\(2k\)\}\{\\sqrt\{k\}\}\.Moreover, on𝒢k\\mathcal\{G\}\_\{k\},V∗\(s\)≤Vk\(s\)=Vϕk,αk∗\(s\),V^\{\*\}\(s\)\\leq\{V\_\{k\}\(s\)\}=V\_\{\\phi\_\{k\},\\alpha\_\{k\}\}^\{\*\}\(s\),∀s∈𝒮\.\\forall s\\in\\mathcal\{S\}\.

For part \(i\), defineΔk,i:=Vk\(sk,i\)−Vπk\(sk,i\)\\Delta\_\{k,i\}:=\{V\_\{k\}\(s\_\{k,i\}\)\}\-V^\{\\pi\_\{k\}\}\(s\_\{k,i\}\)\. Then, forsk,is\_\{k,i\},

V∗\(sk,i\)−Vπk\(sk,i\)=Δk,i\+\(V∗\(sk,i\)−Vk\(sk,i\)\)≤Δk,i\+11−γ𝕀\{𝒢kc\}\.\\displaystyle V^\{\*\}\(s\_\{k,i\}\)\-V^\{\\pi\_\{k\}\}\(s\_\{k,i\}\)=\\Delta\_\{k,i\}\+\\Big\(V^\{\*\}\(s\_\{k,i\}\)\-\{V\_\{k\}\(s\_\{k,i\}\)\}\\Big\)\\leq\\Delta\_\{k,i\}\+\\frac\{1\}\{1\-\\gamma\}\\,\\mathbb\{I\}\\\{\\mathcal\{G\}\_\{k\}^\{c\}\\\}\.Therefore,

BR\(T\)≤𝔼\[∑k=1TIkT∑i=1LkΔk,i\]\+11−γ∑k=1T𝔼\[IkTLk𝕀\{𝒢kc\}\]\.\\displaystyle\{BR\(T\)\}\\leq\\mathbb\{E\}\\\!\\left\[\\sum\_\{k=1\}^\{T\}I\_\{k\}^\{T\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\Delta\_\{k,i\}\\right\]\+\\frac\{1\}\{1\-\\gamma\}\\sum\_\{k=1\}^\{T\}\\mathbb\{E\}\\\!\\left\[I\_\{k\}^\{T\}L\_\{k\}\\,\\mathbb\{I\}\\\{\\mathcal\{G\}\_\{k\}^\{c\}\\\}\\right\]\.For the non\-optimistic\-event term, we first useIkT≤1I\_\{k\}^\{T\}\\leq 1\. The event𝒢kc\\mathcal\{G\}\_\{k\}^\{c\}is determined by\(ℱtk,Pc\)\(\\mathcal\{F\}\_\{t\_\{k\}\},P^\{c\}\), whereas the pseudo\-episode lengthLkL\_\{k\}is generated by the independent restart randomness within pseudo\-episodekk\. HenceLkL\_\{k\}is independent of𝒢kc\\mathcal\{G\}\_\{k\}^\{c\}and satisfies𝔼\[Lk\]=1/\(1−γ\)\\mathbb\{E\}\[L\_\{k\}\]=1/\(1\-\\gamma\)\. Thus,

11−γ∑k=1T𝔼\[IkTLk𝕀\{𝒢kc\}\]\\displaystyle\\frac\{1\}\{1\-\\gamma\}\\sum\_\{k=1\}^\{T\}\\mathbb\{E\}\\\!\\left\[I\_\{k\}^\{T\}L\_\{k\}\\,\\mathbb\{I\}\\\{\\mathcal\{G\}\_\{k\}^\{c\}\\\}\\right\]≤11−γ∑k=1T𝔼\[Lk𝕀\{𝒢kc\}\]=1\(1−γ\)2∑k=1Tℙ\(𝒢kc\)=1\(1−γ\)2∑k=1T𝔼\[ℙ\(𝒢kc∣ℱtk\)\]\\displaystyle\\leq\{\\frac\{1\}\{1\-\\gamma\}\\sum\_\{k=1\}^\{T\}\\mathbb\{E\}\\\!\\left\[L\_\{k\}\\,\\mathbb\{I\}\\\{\\mathcal\{G\}\_\{k\}^\{c\}\\\}\\right\]\}\{=\}\{\\frac\{1\}\{\(1\-\\gamma\)^\{2\}\}\\sum\_\{k=1\}^\{T\}\\mathbb\{P\}\(\\mathcal\{G\}\_\{k\}^\{c\}\)\}\{=\}\{\\frac\{1\}\{\(1\-\\gamma\)^\{2\}\}\\sum\_\{k=1\}^\{T\}\\mathbb\{E\}\\\!\\left\[\\mathbb\{P\}\(\\mathcal\{G\}\_\{k\}^\{c\}\\mid\\mathcal\{F\}\_\{t\_\{k\}\}\)\\right\]\}≤δSA\(1−γ\)2∑k=1Tln⁡\(2k\)k≤2δSATln⁡\(2T\)\(1−γ\)2\.\\displaystyle\\leq\\frac\{\\delta SA\}\{\(1\-\\gamma\)^\{2\}\}\\sum\_\{k=1\}^\{T\}\\frac\{\\ln\(2k\)\}\{\\sqrt\{k\}\}\\leq\\frac\{2\\delta SA\\,\\sqrt\{T\}\\ln\(2T\)\}\{\(1\-\\gamma\)^\{2\}\}\.We now bound𝔼\[∑k=1TIkT∑i=1LkΔk,i\]\\mathbb\{E\}\[\\sum\_\{k=1\}^\{T\}I\_\{k\}^\{T\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\Delta\_\{k,i\}\]\. For each started pseudo\-episodekk, define𝒫k1\\mathcal\{P\}\_\{k\}^\{1\}as the event that, for all\(s,a\)∈𝒮×𝒜\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\},

\|\(P¯k\(⋅∣s,a\)−Ps,ac\)⊤Vk\|≤11−γ2ln⁡\(2SATmTtk2\)Ntk\(s,a\)\+S\.\\displaystyle\\Big\|\(\\bar\{P\}\_\{k\}\(\\cdot\\mid s,a\)\-P^\{c\}\_\{s,a\}\)^\{\\top\}\{V\_\{k\}\}\\Big\|\\leq\\frac\{1\}\{1\-\\gamma\}\\sqrt\{\\frac\{2\\ln\(2SA\\,T\\,m\_\{T\}\\,t\_\{k\}^\{2\}\)\}\{N\_\{t\_\{k\}\}\(s,a\)\+S\}\}\.Conditioned onℱtk\{\\mathcal\{F\}\_\{t\_\{k\}\}\}, the vectorVkV\_\{k\}is deterministic andPs,ac∣ℱtk∼Dir\(ϕk\(s,a\)\)P^\{c\}\_\{s,a\}\\mid\{\\mathcal\{F\}\_\{t\_\{k\}\}\}\\sim\\mathrm\{Dir\}\(\\phi\_\{k\}\(s,a\)\)\. Hence, by \([EC\.19](https://arxiv.org/html/2605.24345#A6.E19)\), \([EC\.20](https://arxiv.org/html/2605.24345#A6.E20)\), and a union bound,

ℙ\(\(𝒫k1\)c∣ℱtk\)≤1TmTtk2\.\\displaystyle\\mathbb\{P\}\(\(\\mathcal\{P\}\_\{k\}^\{1\}\)^\{c\}\\mid\{\\mathcal\{F\}\_\{t\_\{k\}\}\}\)\\leq\\frac\{1\}\{T\\,m\_\{T\}\\,t\_\{k\}^\{2\}\}\.UsingΔk,i≤\(1−γ\)−1\\Delta\_\{k,i\}\\leq\(1\-\\gamma\)^\{\-1\}, we split according to𝒫k1\\mathcal\{P\}\_\{k\}^\{1\}:

𝔼\[∑k=1TIkT∑i=1LkΔk,i\]\\displaystyle\\mathbb\{E\}\\\!\\left\[\\sum\_\{k=1\}^\{T\}I\_\{k\}^\{T\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\Delta\_\{k,i\}\\right\]≤𝔼\[∑k=1TIkT∑i=1LkΔk,i𝕀\{𝒫k1\}\]\+11−γ∑k=1T𝔼\[IkTLk𝕀\{\(𝒫k1\)c\}\]\.\\displaystyle\\leq\\mathbb\{E\}\\\!\\left\[\\sum\_\{k=1\}^\{T\}I\_\{k\}^\{T\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\Delta\_\{k,i\}\\mathbb\{I\}\\\{\{\\mathcal\{P\}\_\{k\}^\{1\}\}\\\}\\right\]\+\\frac\{1\}\{1\-\\gamma\}\\sum\_\{k=1\}^\{T\}\\mathbb\{E\}\\\!\\left\[I\_\{k\}^\{T\}L\_\{k\}\\mathbb\{I\}\\\{\{\(\\mathcal\{P\}\_\{k\}^\{1\}\)^\{c\}\}\\\}\\right\]\.Since𝒫k1∈ℱtk∨σ\(Pc\)\\mathcal\{P\}\_\{k\}^\{1\}\\in\{\\mathcal\{F\}\_\{t\_\{k\}\}\\vee\\sigma\(P^\{c\}\)\}, Lemma[2](https://arxiv.org/html/2605.24345#Thmlemma2)gives

𝔼\[∑k=1TIkT∑i=1LkΔk,i𝕀\{𝒫k1\}\]=𝔼\[∑k=1TIkT∑i=1LkiγEk,i𝕀\{𝒫k1\}\],\\displaystyle\\mathbb\{E\}\\\!\\left\[\\sum\_\{k=1\}^\{T\}I\_\{k\}^\{T\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\Delta\_\{k,i\}\\mathbb\{I\}\\\{\{\\mathcal\{P\}\_\{k\}^\{1\}\}\\\}\\right\]=\\mathbb\{E\}\\\!\\left\[\\sum\_\{k=1\}^\{T\}I\_\{k\}^\{T\}\\sum\_\{i=1\}^\{L\_\{k\}\}i\\,\\gamma E\_\{k,i\}\\mathbb\{I\}\\\{\{\\mathcal\{P\}\_\{k\}^\{1\}\}\\\}\\right\],whereEk,i:=ρϕk\(sk,i,ak,i\)αk\(P⊤Vk\)−\(Psk,i,ak,ic\)⊤VkE\_\{k,i\}:=\\rho^\{\\alpha\_\{k\}\}\_\{\\phi\_\{k\}\(s\_\{k,i\},a\_\{k,i\}\)\}\(P^\{\\top\}\{V\_\{k\}\}\)\-\(P^\{c\}\_\{s\_\{k,i\},a\_\{k,i\}\}\)^\{\\top\}\{V\_\{k\}\}\. Moreover, by the same independence betweenLkL\_\{k\}and𝕀\{\(𝒫k1\)c\}\\mathbb\{I\}\\\{\(\\mathcal\{P\}\_\{k\}^\{1\}\)^\{c\}\\\}, and sincetk≥kt\_\{k\}\\geq kandmT≥\(1−γ\)−1m\_\{T\}\\geq\(1\-\\gamma\)^\{\-1\},

11−γ∑k=1T𝔼\[IkTLk𝕀\{\(𝒫k1\)c\}\]≤π26\(1−γ\)\.\\displaystyle\\frac\{1\}\{1\-\\gamma\}\\sum\_\{k=1\}^\{T\}\\mathbb\{E\}\\\!\\left\[I\_\{k\}^\{T\}L\_\{k\}\\mathbb\{I\}\\\{\{\(\\mathcal\{P\}\_\{k\}^\{1\}\)^\{c\}\}\\\}\\right\]\\leq\\frac\{\\pi^\{2\}\}\{6\(1\-\\gamma\)\}\.LetNt′\(s,a\):=Nt\(s,a\)\+SN^\{\\prime\}\_\{t\}\(s,a\):=N\_\{t\}\(s,a\)\+S, and define

ℬk:=\{Ntk\+1−1′\(s,a\)\+1≤2Ntk′\(s,a\)for all\(s,a\)∈𝒮×𝒜\}\.\\displaystyle\\mathcal\{B\}\_\{k\}:=\\left\\\{N^\{\\prime\}\_\{t\_\{k\+1\}\-1\}\(s,a\)\+1\\leq 2N^\{\\prime\}\_\{t\_\{k\}\}\(s,a\)\\ \\text\{for all \}\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\\right\\\}\.For the last started pseudo\-episode,tKT\+1t\_\{K\_\{T\}\+1\}is interpreted in the proof\-only continuation\. SinceEk,i≤\(1−γ\)−1E\_\{k,i\}\\leq\(1\-\\gamma\)^\{\-1\}, the preceding Bellman\-error sum is bounded by

11−γ∑k=1TIkT∑i=1Lkγi𝕀\{Lk\>mT\}⏟\(I\)\+11−γ∑k=1TIkT∑i=1Lkγi𝕀\{Lk≤mT\}𝕀\{ℬkc\}⏟\(II\)\\displaystyle\\underbrace\{\\frac\{1\}\{1\-\\gamma\}\\sum\_\{k=1\}^\{T\}I\_\{k\}^\{T\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\gamma i\\,\\mathbb\{I\}\\\{L\_\{k\}\>m\_\{T\}\\\}\}\_\{\(I\)\}\+\\underbrace\{\\frac\{1\}\{1\-\\gamma\}\\sum\_\{k=1\}^\{T\}I\_\{k\}^\{T\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\gamma i\\,\\mathbb\{I\}\\\{L\_\{k\}\\leq m\_\{T\}\\\}\\mathbb\{I\}\\\{\\mathcal\{B\}\_\{k\}^\{c\}\\\}\}\_\{\(II\)\}\+∑k=1TIkT∑i=1LkγiEk,i𝕀\{Lk≤mT\}𝕀\{ℬk\}𝕀\{𝒫k1\}⏟\(III\)\.\\displaystyle\\qquad\+\\underbrace\{\\sum\_\{k=1\}^\{T\}I\_\{k\}^\{T\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\gamma i\\,E\_\{k,i\}\\mathbb\{I\}\\\{L\_\{k\}\\leq m\_\{T\}\\\}\\mathbb\{I\}\\\{\\mathcal\{B\}\_\{k\}\\\}\\mathbb\{I\}\\\{\{\\mathcal\{P\}\_\{k\}^\{1\}\}\\\}\}\_\{\(III\)\}\.By Lemma[10](https://arxiv.org/html/2605.24345#Thmlemma10)withK=TK=T,𝔼\[\(I\)\]≤𝔗T\\mathbb\{E\}\[\(I\)\]\\leq\\mathfrak\{T\}\_\{T\}\. For term\(II\)\(II\), the witness\-pair argument gives

∑k=1TIkT𝕀\{Lk≤mT\}𝕀\{ℬkc\}≤SA⌈log2⁡mT⌉,\\displaystyle\\sum\_\{k=1\}^\{T\}I\_\{k\}^\{T\}\\mathbb\{I\}\\\{L\_\{k\}\\leq m\_\{T\}\\\}\\mathbb\{I\}\\\{\\mathcal\{B\}\_\{k\}^\{c\}\\\}\\leq SA\\lceil\\log\_\{2\}m\_\{T\}\\rceil,and hence𝔼\[\(II\)\]≤γSA1−γmT2⌈log2⁡mT⌉\\mathbb\{E\}\[\(II\)\]\\leq\\frac\{\\gamma SA\}\{1\-\\gamma\}\\,m\_\{T\}^\{2\}\\lceil\\log\_\{2\}m\_\{T\}\\rceil\.

It remains to bound\(III\)\(III\)\. On\{Lk≤mT\}∩ℬk∩𝒫k1\\\{L\_\{k\}\\leq m\_\{T\}\\\}\\cap\\mathcal\{B\}\_\{k\}\\cap\{\\mathcal\{P\}\_\{k\}^\{1\}\}, we havei≤mTi\\leq m\_\{T\}\. By Lemma[9](https://arxiv.org/html/2605.24345#Thmlemma9), the definition of𝒫k1\\mathcal\{P\}\_\{k\}^\{1\}, and

ln⁡11−αk\(s,a\)≤ln⁡11−α¯\+\(ln⁡N¯k\+kδNk\+\(s,a\)ln⁡\(2k\)\)\+,\\displaystyle\\ln\\frac\{1\}\{1\-\\alpha\_\{k\}\(s,a\)\}\\leq\\ln\\frac\{1\}\{1\-\\underline\{\\alpha\}\}\+\\left\(\\ln\\frac\{\\bar\{N\}\_\{k\}^\{\+\}\\sqrt\{k\}\}\{\\delta\\,N\_\{k\}^\{\+\}\(s,a\)\\ln\(2k\)\}\\right\)\_\{\+\},where\(x\)\+:=max⁡\{x,0\}\(x\)\_\{\+\}:=\\max\\\{x,0\\\}, we have

Ek,i\\displaystyle E\_\{k,i\}≤11−γ2ln⁡11−α¯Ntk\(sk,i,ak,i\)\+S\+11−γ2\(ln⁡N¯k\+kδNk\+\(sk,i,ak,i\)ln⁡\(2k\)\)\+Ntk\(sk,i,ak,i\)\+S\+11−γ2ln⁡\(2SATmTtk2\)Ntk\(sk,i,ak,i\)\+S\.\\displaystyle\\leq\\frac\{1\}\{1\-\\gamma\}\\sqrt\{\\frac\{2\\ln\\frac\{1\}\{1\-\\underline\{\\alpha\}\}\}\{N\_\{t\_\{k\}\}\(s\_\{k,i\},a\_\{k,i\}\)\+S\}\}\+\\frac\{1\}\{1\-\\gamma\}\\frac\{\\sqrt\{2\\left\(\\ln\\frac\{\\bar\{N\}\_\{k\}^\{\+\}\\sqrt\{k\}\}\{\\delta\\,N\_\{k\}^\{\+\}\(s\_\{k,i\},a\_\{k,i\}\)\\ln\(2k\)\}\\right\)\_\{\+\}\}\}\{\\sqrt\{N\_\{t\_\{k\}\}\(s\_\{k,i\},a\_\{k,i\}\)\+S\}\}\+\\frac\{1\}\{1\-\\gamma\}\\sqrt\{\\frac\{2\\ln\(2SA\\,T\\,m\_\{T\}\\,t\_\{k\}^\{2\}\)\}\{N\_\{t\_\{k\}\}\(s\_\{k,i\},a\_\{k,i\}\)\+S\}\}\.Sincek≤Tk\\leq T,tk≤Tt\_\{k\}\\leq T,N¯k\+≤\(T¯\+SA\)/SA\\bar\{N\}\_\{k\}^\{\+\}\\leq\(\\bar\{T\}\+SA\)/SA,ln⁡\(2k\)≥ln⁡2\\ln\(2k\)\\geq\\ln 2, andNk\+\(s,a\)≥\(Nk\(s,a\)\+S\)/\(S\+1\)N\_\{k\}^\{\+\}\(s,a\)\\geq\(N\_\{k\}\(s,a\)\+S\)/\(S\+1\),

\(ln⁡N¯k\+kδNk\+\(s,a\)ln⁡\(2k\)\)\+≤\(ln⁡\(S\+1\)\(T¯\+SA\)T¯SAδln⁡2\(Nk\(s,a\)\+S\)\)\+\.\\displaystyle\\left\(\\ln\\frac\{\\bar\{N\}\_\{k\}^\{\+\}\\sqrt\{k\}\}\{\\delta\\,N\_\{k\}^\{\+\}\(s,a\)\\ln\(2k\)\}\\right\)\_\{\+\}\\leq\\left\(\\ln\\frac\{\(S\+1\)\(\\bar\{T\}\+SA\)\\sqrt\{\\bar\{T\}\}\}\{SA\\,\\delta\\,\\ln 2\\,\(N\_\{k\}\(s,a\)\+S\)\}\\right\)\_\{\+\}\.Also,ln⁡\(2SATmTtk2\)≤ln⁡\(2SAT3mT\)\\ln\(2SA\\,T\\,m\_\{T\}\\,t\_\{k\}^\{2\}\)\\leq\\ln\(2SA\\,T^\{3\}m\_\{T\}\)\. Therefore,

\(III\)\\displaystyle\(III\)≤γmT1−γ2ln⁡11−α¯∑k=1TIkT∑i=1Lk𝕀\{Lk≤mT\}𝕀\{ℬk\}Ntk\(sk,i,ak,i\)\+S\\displaystyle\\leq\\frac\{\\gamma m\_\{T\}\}\{1\-\\gamma\}\\sqrt\{2\\ln\\frac\{1\}\{1\-\\underline\{\\alpha\}\}\}\\sum\_\{k=1\}^\{T\}I\_\{k\}^\{T\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\frac\{\\mathbb\{I\}\\\{L\_\{k\}\\leq m\_\{T\}\\\}\\mathbb\{I\}\\\{\\mathcal\{B\}\_\{k\}\\\}\}\{\\sqrt\{N\_\{t\_\{k\}\}\(s\_\{k,i\},a\_\{k,i\}\)\+S\}\}\+γmT1−γ2∑k=1TIkT∑i=1Lk𝕀\{Lk≤mT\}𝕀\{ℬk\}\(ln⁡\(S\+1\)\(T¯\+SA\)T¯SAδln⁡2\(Ntk\(sk,i,ak,i\)\+S\)\)\+Ntk\(sk,i,ak,i\)\+S\\displaystyle\\quad\+\\frac\{\\gamma m\_\{T\}\}\{1\-\\gamma\}\\sqrt\{2\}\\sum\_\{k=1\}^\{T\}I\_\{k\}^\{T\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\frac\{\\mathbb\{I\}\\\{L\_\{k\}\\leq m\_\{T\}\\\}\\mathbb\{I\}\\\{\\mathcal\{B\}\_\{k\}\\\}\\sqrt\{\\left\(\\ln\\frac\{\(S\+1\)\(\\bar\{T\}\+SA\)\\sqrt\{\\bar\{T\}\}\}\{SA\\,\\delta\\,\\ln 2\\,\(N\_\{t\_\{k\}\}\(s\_\{k,i\},a\_\{k,i\}\)\+S\)\}\\right\)\_\{\+\}\}\}\{\\sqrt\{N\_\{t\_\{k\}\}\(s\_\{k,i\},a\_\{k,i\}\)\+S\}\}\+γmT1−γ2ln⁡\(2SAT3mT\)∑k=1TIkT∑i=1Lk𝕀\{Lk≤mT\}𝕀\{ℬk\}Ntk\(sk,i,ak,i\)\+S\.\\displaystyle\\quad\+\\frac\{\\gamma m\_\{T\}\}\{1\-\\gamma\}\\sqrt\{2\\ln\(2SA\\,T^\{3\}m\_\{T\}\)\}\\sum\_\{k=1\}^\{T\}I\_\{k\}^\{T\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\frac\{\\mathbb\{I\}\\\{L\_\{k\}\\leq m\_\{T\}\\\}\\mathbb\{I\}\\\{\\mathcal\{B\}\_\{k\}\\\}\}\{\\sqrt\{N\_\{t\_\{k\}\}\(s\_\{k,i\},a\_\{k,i\}\)\+S\}\}\.The short augmented steps in\(III\)\(III\)consist of the firstTTreal interactions plus, only if the last pseudo\-episode is short, at mostmTm\_\{T\}additional proof\-only steps\. Hence their total number is at mostT¯=T\+mT\\bar\{T\}=T\+m\_\{T\}\. Onℬk\\mathcal\{B\}\_\{k\}, for time within pseudo\-episodekk,

Nt\(st,at\)\+S≤2\(Ntk\(st,at\)\+S\)\.\\displaystyle N\_\{t\}\(s\_\{t\},a\_\{t\}\)\+S\\leq 2\\bigl\(N\_\{t\_\{k\}\}\(s\_\{t\},a\_\{t\}\)\+S\\bigr\)\.Thus the two count\-sums without logarithmic weights satisfy

∑k=1TIkT∑i=1Lk𝕀\{Lk≤mT\}𝕀\{ℬk\}Ntk\(sk,i,ak,i\)\+S≤2∑t=1T¯1Nt\(st,at\)\+S≤8SAT¯\.\\displaystyle\\sum\_\{k=1\}^\{T\}I\_\{k\}^\{T\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\frac\{\\mathbb\{I\}\\\{L\_\{k\}\\leq m\_\{T\}\\\}\\mathbb\{I\}\\\{\\mathcal\{B\}\_\{k\}\\\}\}\{\\sqrt\{N\_\{t\_\{k\}\}\(s\_\{k,i\},a\_\{k,i\}\)\+S\}\}\\leq\\sqrt\{2\}\\sum\_\{t=1\}^\{\\bar\{T\}\}\\frac\{1\}\{\\sqrt\{N\_\{t\}\(s\_\{t\},a\_\{t\}\)\+S\}\}\\leq\\sqrt\{8SA\\bar\{T\}\}\.For the logarithmically weighted sum, using the definition ofMTM\_\{T\}, the same argument gives

∑k=1TIkT∑i=1Lk𝕀\{Lk≤mT\}𝕀\{ℬk\}\(ln⁡\(S\+1\)\(T¯\+SA\)T¯SAδln⁡2\(Ntk\(sk,i,ak,i\)\+S\)\)\+Ntk\(sk,i,ak,i\)\+S\\displaystyle\\sum\_\{k=1\}^\{T\}I\_\{k\}^\{T\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\frac\{\\mathbb\{I\}\\\{L\_\{k\}\\leq m\_\{T\}\\\}\\mathbb\{I\}\\\{\\mathcal\{B\}\_\{k\}\\\}\\sqrt\{\\left\(\\ln\\frac\{\(S\+1\)\(\\bar\{T\}\+SA\)\\sqrt\{\\bar\{T\}\}\}\{SA\\,\\delta\\,\\ln 2\\,\(N\_\{t\_\{k\}\}\(s\_\{k,i\},a\_\{k,i\}\)\+S\)\}\\right\)\_\{\+\}\}\}\{\\sqrt\{N\_\{t\_\{k\}\}\(s\_\{k,i\},a\_\{k,i\}\)\+S\}\}≤8SA\(T¯\+S2A\)\(ln⁡\(1\+SAMTT¯\+S2A\)\+2\)\.\\displaystyle\\qquad\\leq\\sqrt\{8SA\(\\bar\{T\}\+S^\{2\}A\)\\left\(\\ln\\left\(1\+\\frac\{SA\\,M\_\{T\}\}\{\\bar\{T\}\+S^\{2\}A\}\\right\)\+2\\right\)\}\.Combining these bounds yields

𝔼\[\(III\)\]\\displaystyle\\mathbb\{E\}\[\(III\)\]≤γmT1−γ\(16SAT¯ln⁡11−α¯\+16SA\(T¯\+S2A\)\(ln⁡\(1\+SAMTT¯\+S2A\)\+2\)\\displaystyle\\leq\\frac\{\\gamma m\_\{T\}\}\{1\-\\gamma\}\\Bigg\(\\sqrt\{16SA\\bar\{T\}\\ln\\frac\{1\}\{1\-\\underline\{\\alpha\}\}\}\+\\sqrt\{16SA\(\\bar\{T\}\+S^\{2\}A\)\\left\(\\ln\\left\(1\+\\frac\{SA\\,M\_\{T\}\}\{\\bar\{T\}\+S^\{2\}A\}\\right\)\+2\\right\)\}\+16SAT¯ln⁡\(2SAT3mT\)\)\.\\displaystyle\\qquad\+\\sqrt\{16SA\\bar\{T\}\\ln\(2SA\\,T^\{3\}m\_\{T\}\)\}\\Bigg\)\.Combining the previous estimates proves \([EC\.21](https://arxiv.org/html/2605.24345#A6.E21)\)\.

For part \(ii\), defineΔ~k,i:=Vk\(sk,i\)−Vϕk,α¯πk\(sk,i\)\.\\widetilde\{\\Delta\}\_\{k,i\}:=\{V\_\{k\}\}\(s\_\{k,i\}\)\-V\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}^\{\\pi\_\{k\}\}\(s\_\{k,i\}\)\.By Lemma[1](https://arxiv.org/html/2605.24345#Thmlemma1)\(ii\),Vϕk,α¯∗\(s\)≤Vk\(s\)V\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}^\{\*\}\(s\)\\leq\{V\_\{k\}\}\(s\)for alls∈𝒮s\\in\\mathcal\{S\}, soVϕk,α¯∗\(sk,i\)−Vϕk,α¯πk\(sk,i\)≤Δ~k,i\.V\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}^\{\*\}\(s\_\{k,i\}\)\-V\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}^\{\\pi\_\{k\}\}\(s\_\{k,i\}\)\\leq\\widetilde\{\\Delta\}\_\{k,i\}\.HenceBR\-R\(T\)≤𝔼\[∑k=1TIkT∑i=1LkΔ~k,i\]\{BR\\text\{\-\}R\(T\)\}\\leq\\mathbb\{E\}\[\\sum\_\{k=1\}^\{T\}I\_\{k\}^\{T\}\\sum\_\{i=1\}^\{L\_\{k\}\}\\widetilde\{\\Delta\}\_\{k,i\}\]\.

Define𝒫k2\\mathcal\{P\}\_\{k\}^\{2\}as the event that, for all\(s,a\)∈𝒮×𝒜\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\},

\|\(P¯k\(⋅∣s,a\)−Ps,ac\)⊤Vϕk,α¯πk\|≤11−γ2ln⁡\(2SATmTtk2\)Ntk\(s,a\)\+S\.\\displaystyle\\Big\|\(\\bar\{P\}\_\{k\}\(\\cdot\\mid s,a\)\-P^\{c\}\_\{s,a\}\)^\{\\top\}V\_\{\\phi\_\{k\},\\underline\{\\alpha\}\}^\{\\pi\_\{k\}\}\\Big\|\\leq\\frac\{1\}\{1\-\\gamma\}\\sqrt\{\\frac\{2\\ln\(2SA\\,T\\,m\_\{T\}\\,t\_\{k\}^\{2\}\)\}\{N\_\{t\_\{k\}\}\(s,a\)\+S\}\}\.Thenℙ\(\(𝒫k2\)c∣ℱtk\)≤1/\(TmTtk2\)\\mathbb\{P\}\(\(\\mathcal\{P\}\_\{k\}^\{2\}\)^\{c\}\\mid\{\\mathcal\{F\}\_\{t\_\{k\}\}\}\)\\leq 1/\(T\\,m\_\{T\}\\,t\_\{k\}^\{2\}\), and thereforeℙ\(\(𝒫k1∩𝒫k2\)c∣ℱtk\)≤2/\(TmTtk2\)\\mathbb\{P\}\(\(\\mathcal\{P\}\_\{k\}^\{1\}\\cap\\mathcal\{P\}\_\{k\}^\{2\}\)^\{c\}\\mid\{\\mathcal\{F\}\_\{t\_\{k\}\}\}\)\\leq 2/\(T\\,m\_\{T\}\\,t\_\{k\}^\{2\}\)\.

On𝒫k2\\mathcal\{P\}\_\{k\}^\{2\}, Lemma[9](https://arxiv.org/html/2605.24345#Thmlemma9)gives, for eachi≥1i\\geq 1,

Ek,i−≤11−γ2ln⁡1α¯Ntk\(sk,i,ak,i\)\+S\+11−γ2ln⁡\(2SATmTtk2\)Ntk\(sk,i,ak,i\)\+S\.\\displaystyle\{E^\{\-\}\_\{k,i\}\}\{\\leq\\frac\{1\}\{1\-\\gamma\}\\sqrt\{\\frac\{2\\ln\\frac\{1\}\{\\underline\{\\alpha\}\}\}\{N\_\{t\_\{k\}\}\(s\_\{k,i\},a\_\{k,i\}\)\+S\}\}\+\\frac\{1\}\{1\-\\gamma\}\\sqrt\{\\frac\{2\\ln\(2SA\\,T\\,m\_\{T\}\\,t\_\{k\}^\{2\}\)\}\{N\_\{t\_\{k\}\}\(s\_\{k,i\},a\_\{k,i\}\)\+S\}\}\.\}As in part \(i\), splitting only according to𝒫k1∩𝒫k2\\mathcal\{P\}\_\{k\}^\{1\}\\cap\\mathcal\{P\}\_\{k\}^\{2\}and using Lemma[3](https://arxiv.org/html/2605.24345#Thmlemma3), the same argument gives

BR\-R\(T\)\\displaystyle\{BR\\text\{\-\}R\(T\)\}≤γmT1−γ\(16SAT¯ln⁡11−α¯\+16SA\(T¯\+S2A\)\(ln⁡\(1\+SAMTT¯\+S2A\)\+2\)\\displaystyle\\leq\\frac\{\\gamma m\_\{T\}\}\{1\-\\gamma\}\\Bigg\(\\sqrt\{16SA\\bar\{T\}\\ln\\frac\{1\}\{1\-\\underline\{\\alpha\}\}\}\+\\sqrt\{16SA\(\\bar\{T\}\+S^\{2\}A\)\\left\(\\ln\\left\(1\+\\frac\{SA\\,M\_\{T\}\}\{\\bar\{T\}\+S^\{2\}A\}\\right\)\+2\\right\)\}\+16SAT¯ln⁡1α¯\+216SAT¯ln⁡\(2SAT3mT\)\)\\displaystyle\\qquad\+\\sqrt\{16SA\\bar\{T\}\\ln\\frac\{1\}\{\\underline\{\\alpha\}\}\}\+2\\sqrt\{16SA\\bar\{T\}\\ln\(2SA\\,T^\{3\}m\_\{T\}\)\}\\Bigg\)\+2γSA1−γmT2⌈log2⁡mT⌉\+π23\(1−γ\)\+2𝔗T\.\\displaystyle\\qquad\+\\frac\{2\\gamma SA\}\{1\-\\gamma\}\\,m\_\{T\}^\{2\}\\left\\lceil\\log\_\{2\}m\_\{T\}\\right\\rceil\+\\frac\{\\pi^\{2\}\}\{3\(1\-\\gamma\)\}\+2\\mathfrak\{T\}\_\{T\}\.This proves \([EC\.22](https://arxiv.org/html/2605.24345#A6.E22)\)\. ∎

###### Theorem 4\(Restatement of Theorem[2](https://arxiv.org/html/2605.24345#Thmtheorem2)\)\.

Forδ\>0\\delta\>0,α¯∈\(0,1\)\\underline\{\\alpha\}\\in\(0,1\), andT≥S2AT\\geq S^\{2\}A,

BR\(T\)≤O~\(γSATln⁡e1−α¯\(1−γ\)2\+δSAT\(1−γ\)2\+SA\(1−γ\)3\),\\displaystyle BR\(T\)\\leq\\widetilde\{O\}\\\!\\left\(\\frac\{\\gamma\\sqrt\{SA\\,T\\,\\ln\\frac\{e\}\{1\-\\underline\{\\alpha\}\}\}\}\{\(1\-\\gamma\)^\{2\}\}\+\\frac\{\\delta\\,SA\\sqrt\{T\}\}\{\(1\-\\gamma\)^\{2\}\}\+\\frac\{SA\}\{\(1\-\\gamma\)^\{3\}\}\\right\),and

BR\-R\(T\)≤O~\(γSATln⁡1min⁡\{1−α¯,α¯\}\(1−γ\)2\+SA\(1−γ\)3\),\\displaystyle BR\\text\{\-\}R\(T\)\\leq\\widetilde\{O\}\\\!\\left\(\\frac\{\\gamma\\sqrt\{SA\\,T\\,\\ln\\frac\{1\}\{\\min\\\{1\-\\underline\{\\alpha\},\\underline\{\\alpha\}\\\}\}\}\}\{\(1\-\\gamma\)^\{2\}\}\+\\frac\{SA\}\{\(1\-\\gamma\)^\{3\}\}\\right\),whereO~\(⋅\)\\widetilde\{O\}\(\\cdot\)omits polylogarithmic factors inS,A,T,1/δS,A,T,1/\\delta, and logarithms of1/\(1−γ\)1/\(1\-\\gamma\)\. In particular, ifδ=1SA\\delta=\\frac\{1\}\{\\sqrt\{SA\}\}in AQ\-BRMDP, then

BR\(T\)≤O~\(γSATln⁡e1−α¯\(1−γ\)2\+SAT\(1−γ\)2\+SA\(1−γ\)3\)\.\\displaystyle BR\(T\)\{\\leq\}\\widetilde\{O\}\\\!\\left\(\\frac\{\\gamma\\sqrt\{SA\\,T\\ln\\frac\{e\}\{1\-\\underline\{\\alpha\}\}\}\}\{\(1\-\\gamma\)^\{2\}\}\+\\frac\{\\sqrt\{SA\\,T\}\}\{\(1\-\\gamma\)^\{2\}\}\+\\frac\{SA\}\{\(1\-\\gamma\)^\{3\}\}\\right\)\.

###### Proof\.

We first prove the bounds for arbitraryδ\>0\\delta\>0\. By Theorem[3](https://arxiv.org/html/2605.24345#Thmtheorem3),

mT=O~\(11−γ\),mT2⌈log2⁡mT⌉=O~\(1\(1−γ\)2\)\.\\displaystyle m\_\{T\}=\\widetilde\{O\}\\\!\\left\(\\frac\{1\}\{1\-\\gamma\}\\right\),\\qquad m\_\{T\}^\{2\}\\left\\lceil\\log\_\{2\}m\_\{T\}\\right\\rceil=\\widetilde\{O\}\\\!\\left\(\\frac\{1\}\{\(1\-\\gamma\)^\{2\}\}\\right\)\.Therefore,

γSA1−γmT2⌈log2⁡mT⌉=O~\(SA\(1−γ\)3\)\.\\displaystyle\\frac\{\\gamma SA\}\{1\-\\gamma\}\\,m\_\{T\}^\{2\}\\left\\lceil\\log\_\{2\}m\_\{T\}\\right\\rceil=\\widetilde\{O\}\\\!\\left\(\\frac\{SA\}\{\(1\-\\gamma\)^\{3\}\}\\right\)\.Also𝔗T=O~\(\(1−γ\)−1\)\\mathfrak\{T\}\_\{T\}=\\widetilde\{O\}\(\(1\-\\gamma\)^\{\-1\}\), which is absorbed byO~\(SA/\(1−γ\)3\)\\widetilde\{O\}\(SA/\(1\-\\gamma\)^\{3\}\)\.

SinceT¯=T\+mT\\bar\{T\}=T\+m\_\{T\}, the square\-root terms involvingT¯\\bar\{T\}are bounded by the correspondingTT\-terms plus lower\-order terms that are absorbed byO~\(SA/\(1−γ\)3\)\\widetilde\{O\}\(SA/\(1\-\\gamma\)^\{3\}\)\. Moreover, the logarithmic factor involvingMTM\_\{T\}contributes only polylogarithmic dependence onS,A,T,1/δS,A,T,1/\\delta, and1/\(1−γ\)1/\(1\-\\gamma\)\. SinceT≥S2AT\\geq S^\{2\}A, the termSA\(T¯\+S2A\)\\sqrt\{SA\(\\bar\{T\}\+S^\{2\}A\)\}is absorbed, up to lower\-order terms, bySAT\\sqrt\{SA\\,T\}\. For the true\-optimal benchmark, this pureSAT\\sqrt\{SA\\,T\}term is absorbed intoSATln⁡e1−α¯\\sqrt\{SA\\,T\\ln\\frac\{e\}\{1\-\\underline\{\\alpha\}\}\}becauseln⁡e1−α¯≥1\\ln\\frac\{e\}\{1\-\\underline\{\\alpha\}\}\\geq 1\. Substituting these estimates into \([EC\.21](https://arxiv.org/html/2605.24345#A6.E21)\) yields

BR\(T\)≤O~\(γSATln⁡e1−α¯\(1−γ\)2\+δSAT\(1−γ\)2\+SA\(1−γ\)3\)\.\\displaystyle BR\(T\)\{\\leq\}\\widetilde\{O\}\\\!\\left\(\\frac\{\\gamma\\sqrt\{SA\\,T\\,\\ln\\frac\{e\}\{1\-\\underline\{\\alpha\}\}\}\}\{\(1\-\\gamma\)^\{2\}\}\+\\frac\{\\delta\\,SA\\sqrt\{T\}\}\{\(1\-\\gamma\)^\{2\}\}\+\\frac\{SA\}\{\(1\-\\gamma\)^\{3\}\}\\right\)\.For the robust\-optimal benchmark, sincemin⁡\{1−α¯,α¯\}≤1/2\\min\\\{1\-\\underline\{\\alpha\},\\underline\{\\alpha\}\\\}\\leq 1/2, the pureSAT\\sqrt\{SA\\,T\}term is absorbed bySATln⁡1min⁡\{1−α¯,α¯\}\\sqrt\{SA\\,T\\ln\\frac\{1\}\{\\min\\\{1\-\\underline\{\\alpha\},\\underline\{\\alpha\}\\\}\}\}\. Substituting the same estimates into \([EC\.22](https://arxiv.org/html/2605.24345#A6.E22)\) gives

BR\-R\(T\)≤O~\(γSATln⁡1min⁡\{1−α¯,α¯\}\(1−γ\)2\+SA\(1−γ\)3\)\.\\displaystyle BR\\text\{\-\}R\(T\)\{\\leq\}\\widetilde\{O\}\\\!\\left\(\\frac\{\\gamma\\sqrt\{SA\\,T\\,\\ln\\frac\{1\}\{\\min\\\{1\-\\underline\{\\alpha\},\\underline\{\\alpha\}\\\}\}\}\}\{\(1\-\\gamma\)^\{2\}\}\+\\frac\{SA\}\{\(1\-\\gamma\)^\{3\}\}\\right\)\.Finally, ifδ=1/SA\\delta=1/\\sqrt\{SA\}, then

δSAT\(1−γ\)2=SAT\(1−γ\)2\.\\displaystyle\\frac\{\\delta SA\\sqrt\{T\}\}\{\(1\-\\gamma\)^\{2\}\}=\\frac\{\\sqrt\{SA\\,T\}\}\{\(1\-\\gamma\)^\{2\}\}\.Substituting this into the arbitrary\-δ\\deltabound forBR\(T\)BR\(T\)gives the stated special case; theBR\-R\(T\)BR\\text\{\-\}R\(T\)bound does not contain theδSAT/\(1−γ\)2\\delta SA\\sqrt\{T\}/\(1\-\\gamma\)^\{2\}term and is therefore unchanged up to hidden polylogarithmic factors\. ∎

## Appendix EC\.7Implementation Details

This appendix gives the implementation details for the experiments in Section[5](https://arxiv.org/html/2605.24345#S5)\. For the finite\-state experiments, the per\-transition reward functionr\(s,a,s′\)r\(s,a,s^\{\\prime\}\)is known and deterministic, while the entire transition kernel is treated as unknown\. When the reward depends on the next state, as in FrozenLake, expected one\-step rewards are computed under the sampled posterior transition model\. The agent maintains an independent Dirichlet posterior over each transition vectorPs,acP^\{c\}\_\{s,a\}and updates the posterior parameters using the observed transition counts\. For the continuous\-state FrozenLake experiment, the transition kernel is parameterized by some stochastic components; details are given in Appendix[EC\.7\.7](https://arxiv.org/html/2605.24345#A7.SS7)\.

### EC\.7\.1Value Iteration for the Approximateαk\\alpha\_\{k\}\-quantile BR\-MDP

At the beginning of pseudo\-episodekk, AQ\-BRMDP first updates the posterior parametersϕk\\phi\_\{k\}using the history data and then computes the adaptive quantile scheduleαk\\alpha\_\{k\}\. These two quantities define theαk\\alpha\_\{k\}\-quantile BR\-MDP to be solved in the current pseudo\-episode\. The exact posterior quantiles in Bellman backups are generally not available in closed form\. We therefore replace the exact posterior quantile by an empirical quantile computed from posterior samples, leading to the approximateαk\\alpha\_\{k\}\-quantile BR\-MDP used in implementation\. For a value vectorVV, state\-action pair\(s,a\)\(s,a\), and quantile levelαk\(s,a\)\\alpha\_\{k\}\(s,a\), we draw posterior transition samplesPs,a1,…,Ps,aM∼i\.i\.d\.Dir\(ϕk\(s,a\)\)P\_\{s,a\}^\{1\},\\ldots,P\_\{s,a\}^\{M\}\\stackrel\{\{\\scriptstyle\\mathrm\{i\.i\.d\.\}\}\}\{\{\\sim\}\}\\mathrm\{Dir\}\(\\phi\_\{k\}\(s,a\)\)and compute the sampled one\-step Bellman targetsZj=∑s′∈𝒮Ps,aj\(s′\)\[r\(s,a,s′\)\+γV\(s′\)\]Z\_\{j\}=\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}P\_\{s,a\}^\{j\}\(s^\{\\prime\}\)\\bigl\[r\(s,a,s^\{\\prime\}\)\+\\gamma V\(s^\{\\prime\}\)\\bigr\]\. We then sortZ\(1\)≤⋯≤Z\(M\)Z\_\{\(1\)\}\\leq\\cdots\\leq Z\_\{\(M\)\}and useZ\(⌈Mαk\(s,a\)⌉\)Z\_\{\(\\lceil M\\alpha\_\{k\}\(s,a\)\\rceil\)\}as the empirical posterior quantile\.

#### Monte Carlo budget for quantile estimation\.

The posterior\-sampling budget used in each Bellman backup controls the Monte Carlo error of the empirical quantile estimator\. To choose this budget across different quantile levels, we use the classical asymptotic normal approximation for sample quantiles\. Ifq^α\\widehat\{q\}\_\{\\alpha\}is the empiricalα\\alpha\-quantile computed frommmindependent samples of a scalar random variable with densityggpositive at itsα\\alpha\-quantileqαq\_\{\\alpha\}, then the leading variance term is proportional to

α\(1−α\)mg\(qα\)2\.\\displaystyle\\frac\{\\alpha\(1\-\\alpha\)\}\{m\\,g\(q\_\{\\alpha\}\)^\{2\}\}\.Since the exact density of the sampled Bellman target depends on both the posterior parameter and the current value vector, we use the standard normal distribution as a reference distribution to choose how the sample size varies across quantile levels\. This choice allocates more posterior samples to tail quantiles\.

For AQ\-BRMDP, at pseudo\-episodekkand state\-action pair\(s,a\)\(s,a\), letqk,s,a:=Φ−1\(αk\(s,a\)\),q\_\{k,s,a\}:=\\Phi^\{\-1\}\\\!\\bigl\(\\alpha\_\{k\}\(s,a\)\\bigr\),and letφstd\\varphi\_\{\\mathrm\{std\}\}denote the standard normal density\. The number of posterior samples used to estimate the quantile from\(s,a\)\(s,a\)is chosen as

nk,s,a=min⁡\{2048,⌈cnsamplesαk\(s,a\)\(1−αk\(s,a\)\)φstd\(qk,s,a\)2⌉\}\.\{n\_\{k,s,a\}=\\min\\left\\\{2048,\\,\\left\\lceil c\_\{n\_\{\\mathrm\{samples\}\}\}\\,\\frac\{\\alpha\_\{k\}\(s,a\)\\bigl\(1\-\\alpha\_\{k\}\(s,a\)\\bigr\)\}\{\\varphi\_\{\\mathrm\{std\}\}\(q\_\{k,s,a\}\)^\{2\}\}\\right\\rceil\\right\\\}\.\}\(EC\.23\)For the fixed\-level baselines BR\-MDP\-α\\alphawithα∈\{0\.1,0\.3,0\.5\}\\alpha\\in\\\{0\.1,0\.3,0\.5\\\}, we use the same rule withαk\(s,a\)\\alpha\_\{k\}\(s,a\)replaced by the corresponding fixed valueα\\alpha\.

Algorithm EC\.2Value Iteration for the Approximateαk\\alpha\_\{k\}\-quantile BR\-MDP1:Input:Posterior parameters

ϕk\\phi\_\{k\}, approximate optimal value in pseudo\-episode

k−1k\-1V^k−1\\widehat\{V\}\_\{k\-1\}, quantile schedule

αk\\alpha\_\{k\}, per\-transition reward function

rr, discount factor

γ\\gamma, Monte Carlo budget coefficient

cnsamplesc\_\{n\_\{\\mathrm\{samples\}\}\}, tolerance

εVI\\varepsilon\_\{\\mathrm\{VI\}\}, maximum number of iterations

MVIM\_\{\\mathrm\{VI\}\}
2:for all

\(s,a\)∈𝒮×𝒜\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}do

3:Compute

nk,s,an\_\{k,s,a\}from \([EC\.23](https://arxiv.org/html/2605.24345#A7.E23)\) and set

ℓk,s,a←⌈nk,s,aαk\(s,a\)⌉\\ell\_\{k,s,a\}\\leftarrow\\lceil n\_\{k,s,a\}\\alpha\_\{k\}\(s,a\)\\rceil
4:Draw and fix posterior samples

Ps,a1,…,Ps,ank,s,a∼i\.i\.d\.Dir\(ϕk\(s,a\)\)P^\{1\}\_\{s,a\},\\ldots,P^\{n\_\{k,s,a\}\}\_\{s,a\}\\stackrel\{\{\\scriptstyle\\mathrm\{i\.i\.d\.\}\}\}\{\{\\sim\}\}\\mathrm\{Dir\}\(\\phi\_\{k\}\(s,a\)\)
5:endfor

6:Initialize

V\(0\)←V^k−1V^\{\(0\)\}\\leftarrow\\widehat\{V\}\_\{k\-1\}
7:for

m=0,1,…,MVI−1m=0,1,\\ldots,M\_\{\\mathrm\{VI\}\}\-1do

8:for all

\(s,a\)∈𝒮×𝒜\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}do

9:Using the fixed posterior samples, compute

Zj\(m\)\(s,a\)←∑s′∈𝒮Ps,aj\(s′\)\[r\(s,a,s′\)\+γV\(m\)\(s′\)\],j=1,…,nk,s,a\.\\displaystyle Z\_\{j\}^\{\(m\)\}\(s,a\)\\leftarrow\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}P\_\{s,a\}^\{j\}\(s^\{\\prime\}\)\\left\[r\(s,a,s^\{\\prime\}\)\+\\gamma V^\{\(m\)\}\(s^\{\\prime\}\)\\right\],\\hskip 18\.49988ptj=1,\\ldots,n\_\{k,s,a\}\.
10:Sort

Z\(1\)\(m\)\(s,a\)≤⋯≤Z\(nk,s,a\)\(m\)\(s,a\)Z\_\{\(1\)\}^\{\(m\)\}\(s,a\)\\leq\\cdots\\leq Z\_\{\(n\_\{k,s,a\}\)\}^\{\(m\)\}\(s,a\)
11:Set

Qk\(m\+1\)\(s,a\)←Z\(ℓk,s,a\)\(m\)\(s,a\)Q\_\{k\}^\{\(m\+1\)\}\(s,a\)\\leftarrow Z\_\{\(\\ell\_\{k,s,a\}\)\}^\{\(m\)\}\(s,a\)
12:endfor

13:Set

V\(m\+1\)\(s\)←maxa∈𝒜⁡Qk\(m\+1\)\(s,a\)V^\{\(m\+1\)\}\(s\)\\leftarrow\\max\_\{a\\in\\mathcal\{A\}\}Q\_\{k\}^\{\(m\+1\)\}\(s,a\)for all

s∈𝒮s\\in\\mathcal\{S\}
14:if

‖V\(m\+1\)−V\(m\)‖∞≤εVI\\\|V^\{\(m\+1\)\}\-V^\{\(m\)\}\\\|\_\{\\infty\}\\leq\\varepsilon\_\{\\mathrm\{VI\}\}then

15:break

16:endif

17:endfor

18:Set

V^k←V\(m\+1\)\\widehat\{V\}\_\{k\}\\leftarrow V^\{\(m\+1\)\}
19:For each

s∈𝒮s\\in\\mathcal\{S\}, choose

π^k\(s\)∈arg⁡maxa∈𝒜⁡Qk\(m\+1\)\(s,a\)\\widehat\{\\pi\}\_\{k\}\(s\)\\in\\arg\\max\_\{a\\in\\mathcal\{A\}\}\{Q\_\{k\}^\{\(m\+1\)\}\(s,a\)\}
20:return

V^k\\widehat\{V\}\_\{k\}and

π^k\\widehat\{\\pi\}\_\{k\}

The posterior transition samples in Algorithm[EC\.2](https://arxiv.org/html/2605.24345#alg2)are drawn once at the beginning of pseudo\-episodekkand held fixed throughout the value\-iteration loop\. Hence the empirical\-quantile Bellman operator is deterministic within the pseudo\-episode, and the stopping criterion is applied to a fixed sample\-average approximation of theαk\\alpha\_\{k\}\-quantile BR\-MDP\. The fixed\-level baselines BRMDP\-0\.1, BRMDP\-0\.3, and BRMDP\-0\.5 use Algorithm[EC\.2](https://arxiv.org/html/2605.24345#alg2)withαk\(s,a\)≡0\.1\\alpha\_\{k\}\(s,a\)\\equiv 0\.1,0\.30\.3, and0\.50\.5, respectively\.

### EC\.7\.2Implementations of Continuing PSRL

Continuing PSRL uses the same pseudo\-episode mechanism as AQ\-BRMDP\. At the beginning of pseudo\-episodekk, it samples one transition kernelP~k\\widetilde\{P\}\_\{k\}from the current posterior by drawingP~k,s,a∼Dir\(ϕk\(s,a\)\)\\widetilde\{P\}\_\{k,s,a\}\\sim\\mathrm\{Dir\}\(\\phi\_\{k\}\(s,a\)\)independently for all\(s,a\)\(s,a\)\. It then solves the sampled discounted MDP using standard value iteration and executes a greedy policyπ~k\\widetilde\{\\pi\}\_\{k\}throughout pseudo\-episodekk\. When rewards are transition\-dependent, the one\-step reward in the sampled MDP is evaluated as∑s′P~k,s,a\(s′\)r\(s,a,s′\)\\sum\_\{s^\{\\prime\}\}\\widetilde\{P\}\_\{k,s,a\}\(s^\{\\prime\}\)r\(s,a,s^\{\\prime\}\), so Continuing PSRL also avoids using the true expected reward underPcP^\{c\}during learning\.

### EC\.7\.3Finite\-State Experiment Schematics

Figure[EC\.3](https://arxiv.org/html/2605.24345#A7.F3)provides schematic diagrams for the two finite\-state experiment families used in Section[5](https://arxiv.org/html/2605.24345#S5)\.

![Refer to caption](https://arxiv.org/html/2605.24345v1/figs/sec5_1_rs6_schematic.png)

![Refer to caption](https://arxiv.org/html/2605.24345v1/figs/sec5_2_risky_branch_schematic.png)

Figure EC\.3:Finite\-state experiment schematics: RiverSwim\-6 \(left\) and risky\-branch FrozenLake \(right\)\. In the FrozenLake schematic, the shortcut at\(2,Down\)\(2,\\mathrm\{Down\}\)moves the agent to state1010with probability1−θ1\-\\thetaand to the sticky hole at state55with probabilityθ\\theta\.
### EC\.7\.4Experiment\-Specific Settings and Evaluation Details

For AQ\-BRMDP, the floor parameter in the adaptive schedule is set toα¯=0\.2\\underline\{\\alpha\}=0\.2in all finite\-state experiments unless otherwise stated\. In all tabular BR\-MDP planning calls, we use value\-iteration toleranceεVI=10−8\\varepsilon\_\{\\mathrm\{VI\}\}=10^\{\-8\}and maximum iteration countMVI=10,000M\_\{\\mathrm\{VI\}\}=10\{,\}000\. The same stopping rule and iteration cap are used for Continuing PSRL, fixed\-quantile BR\-MDP, and AQ\-BRMDP planning\. The finite\-state RiverSwim and risky\-branch FrozenLake experiments were run on a MacBook using CPU execution without GPU acceleration\. These experiments were substantially less computationally demanding than the continuous\-state FrozenLake experiment reported below\.

#### RiverSwim\.

The discount factor isγ=0\.9\\gamma=0\.9, and the total interaction horizon isT=4500T=4500\. We set the constant in \([EC\.23](https://arxiv.org/html/2605.24345#A7.E23)\) and \([14](https://arxiv.org/html/2605.24345#S3.E14)\) to\(cnsamples,δ\)=\(150,5\)\(c\_\{n\_\{\\mathrm\{samples\}\}\},\\delta\)=\(150,5\)and\(200,10\)\(200,10\)respectively for RiverSwim\-6 and RiverSwim\-10\. The plots for RiverSwim\-6 and RiverSwim\-10 are truncated to the first20002000and40004000time steps, respectively\. The same truncation windows are used for the corresponding occupancy heatmaps\.

#### Risky\-branch FrozenLake\.

In risky\-branch FrozenLake experiments, the discount factorγ=0\.8\\gamma=0\.8, and total interaction horizonT=4500T=4500\. We set the constant in \([EC\.23](https://arxiv.org/html/2605.24345#A7.E23)\) and \([14](https://arxiv.org/html/2605.24345#S3.E14)\) to\(cnsamples,δ\)=\(250,10\)\(c\_\{n\_\{\\mathrm\{samples\}\}\},\\delta\)=\(250,10\)\.

### EC\.7\.5Estimation of the Posteriorα\\alpha\-Quantile Value

We estimate the posteriorα\\alpha\-quantile value by evaluating the current policy under posterior samples of the transition kernel\. For each independent run, each algorithm, and diagnostic timett, we take the current posterior parameterϕt\\phi\_\{t\}and the current policyπt\\pi\_\{t\}\. We then drawM=147M=147independent transition kernels by drawingPs,a\(m\)∼Dir\(ϕt\(s,a\)\)P^\{\(m\)\}\_\{s,a\}\\sim\\mathrm\{Dir\}\(\\phi\_\{t\}\(s,a\)\),m=1,…,Mm=1,\\ldots,M, independently for all\(s,a\)\(s,a\)\. For each sampled transition kernelP\(m\)P^\{\(m\)\}, we compute the value ofπt\\pi\_\{t\}by exact policy evaluation\. In particular, lettingP\(m\),πt\(s,s′\):=Ps,πt\(s\)\(m\)\(s′\)P^\{\(m\),\\pi\_\{t\}\}\(s,s^\{\\prime\}\):=P^\{\(m\)\}\_\{s,\\pi\_\{t\}\(s\)\}\(s^\{\\prime\}\)andr\(m\),πt\(s\):=∑s′∈𝒮Ps,πt\(s\)\(m\)\(s′\)r\(s,πt\(s\),s′\)r^\{\(m\),\\pi\_\{t\}\}\(s\):=\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}P^\{\(m\)\}\_\{s,\\pi\_\{t\}\(s\)\}\(s^\{\\prime\}\)r\(s,\\pi\_\{t\}\(s\),s^\{\\prime\}\), we computeVP\(m\)πt=\(I−γP\(m\),πt\)−1r\(m\),πtV^\{\\pi\_\{t\}\}\_\{P^\{\(m\)\}\}=\(I\-\\gamma P^\{\(m\),\\pi\_\{t\}\}\)^\{\-1\}r^\{\(m\),\\pi\_\{t\}\}\. We then form theMMsampled values at the initial state,Ym:=VP\(m\)πt\(s0\)Y\_\{m\}:=V^\{\\pi\_\{t\}\}\_\{P^\{\(m\)\}\}\(s\_\{0\}\)form=1,…,Mm=1,\\ldots,M, and letY\(1\)≤⋯≤Y\(M\)Y\_\{\(1\)\}\\leq\\cdots\\leq Y\_\{\(M\)\}denote their order statistics\. The empirical estimate of the posterior0\.10\.1\-quantile value is

V^ϕt,0\.1πt,q\(s0\):=Y\(⌈0\.1M⌉\)\.\\widehat\{V\}^\{\\pi\_\{t\},\\mathrm\{q\}\}\_\{\\phi\_\{t\},0\.1\}\(s\_\{0\}\):=Y\_\{\(\\lceil 0\.1M\\rceil\)\}\.\(EC\.24\)We setM=147M=147according to \([EC\.23](https://arxiv.org/html/2605.24345#A7.E23)\) withcnsamples=50c\_\{n\_\{\\mathrm\{samples\}\}\}=50\. This metric is computed every200200time steps\. For each diagnostic time, we average the estimates over100100independent runs, and the95%95\\%confidence bands are computed across these independent runs\.

### EC\.7\.6Sensitivity Analysis for the Schedule Parameter

We examine sensitivity to the schedule parameterδ\\deltaon standard FrozenLake without the risky branch, holding the remaining settings fixed as described in Section[5\.2](https://arxiv.org/html/2605.24345#S5.SS2)\. We compareδ∈\{5,10,15\}\\delta\\in\\\{5,10,15\\\}over100100independent runs per setting\. These small differences in Figure[EC\.4](https://arxiv.org/html/2605.24345#A7.F4)and Table[EC\.1](https://arxiv.org/html/2605.24345#A7.T1)suggest that AQ\-BRMDP is not materially sensitive toδ\\deltaover the range tested\.

![Refer to caption](https://arxiv.org/html/2605.24345v1/figs/sec5_3_delta_true_regret.png)Figure EC\.4:Sensitivity of AQ\-BRMDP to the schedule parameterδ\\deltaon FrozenLake without the risky branch\. The plot shows cumulative true regret\.Table EC\.1:Final cumulative true regret in the sensitivity study\. Each entry reports the mean over100100runs\. Percentages are reductions relative to theδ=5\\delta=5setting\.
### EC\.7\.7Continuous\-State FrozenLake Experiment

We further evaluate AQ\-BRMDP in a continuous\-state version of FrozenLake\. In this experiment, the state space is continuous, the posterior is placed on a parametric transition model, and the resulting Bellman backups are approximated using neural fitted Q iteration\.

#### Environment\.

The state space is\[0,4\]2\[0,4\]^\{2\}, partitioned into the same4×44\\times 4cells as the discrete FrozenLake layout\. The start state is\(0\.5,0\.5\)\(0\.5,0\.5\), the hole cells are\{5,7,11,12\}\\\{5,7,11,12\\\}, and the goal cell is1515\. The action space remains\{Left,Right,Up,Down\}\\\{\\mathrm\{Left\},\\mathrm\{Right\},\\mathrm\{Up\},\\mathrm\{Down\}\\\}\. A transition receives reward11if the next state enters the goal cell and reward0otherwise\. From the goal cell, the next state is sampled from the known uniform reset distribution over the non\-hole, non\-goal cells\.

For ordinary cells, the realized movement direction is the intended direction with probability0\.500\.50and one of the two perpendicular directions with probability0\.250\.25each\. For hole cells, the agent moves in the intended direction with probabilityph=0\.20p\_\{h\}=0\.20and otherwise remains in place\. The continuous movement length isL=ℓ0\+UL=\\ell\_\{0\}\+U, whereℓ0=0\.75\\ell\_\{0\}=0\.75andU∼Unif\(0,θL\)U\\sim\\mathrm\{Unif\}\(0,\\theta\_\{L\}\)with true valueθL=0\.50\\theta\_\{L\}=0\.50\. If a candidate next state leaves\[0,4\]2\[0,4\]^\{2\}, the agent remains in the current state\.

#### Posterior model\.

The unknown transition parameters are\(pslip,ph,θL\)\(p\_\{\\mathrm\{slip\}\},p\_\{h\},\\theta\_\{L\}\)\. We use the conjugate priorspslip∼Dir\(1,1,1\)p\_\{\\mathrm\{slip\}\}\\sim\\mathrm\{Dir\}\(1,1,1\),ph∼Beta\(1,1\)p\_\{h\}\\sim\\mathrm\{Beta\}\(1,1\), andθL∼Pareto\(a0,x0\)\\theta\_\{L\}\\sim\\mathrm\{Pareto\}\(a\_\{0\},x\_\{0\}\)witha0=2a\_\{0\}=2andx0=0\.25x\_\{0\}=0\.25\. The simulator records latent primitives:MtM\_\{t\}for the ordinary\-cell movement mode,EtE\_\{t\}for the hole mobility indicator, andLtL\_\{t\}for the attempted movement length when a movement is activated\. The Dirichlet posterior is updated from counts ofMtM\_\{t\}, the Beta posterior is updated from counts ofEtE\_\{t\}, and the Pareto posterior is updated from observations ofUt=Lt−ℓ0U\_\{t\}=L\_\{t\}\-\\ell\_\{0\}\. Goal reset transitions do not update the posterior\.

#### Neural fitted Q approximation\.

The Q\-function is represented by a neural networkQω\(s,a\)Q\_\{\\omega\}\(s,a\)with architecture20→64→64→420\\to 64\\to 64\\to 4, where the four outputs correspond to the four actions\. The2020input features consist of the normalized two\-dimensional coordinates, a1616\-dimensional cell one\-hot vector, and two hole/goal indicators\. At the start of each pseudo\-episode, the network is warm\-started from the previous pseudo\-episode; it is not reinitialized\.

For a sampled posterior modelθb\\theta^\{b\}and a Q\-functionQQ, define the posterior\-model Bellman targetGb\(s,a;Q\):=𝔼s′∼Pθb\(⋅∣s,a\)\[R\(s,a,s′\)\+γmaxa′⁡Q\(s′,a′\)\]G^\{b\}\(s,a;Q\):=\\mathbb\{E\}\_\{s^\{\\prime\}\\sim P\_\{\\theta^\{b\}\}\(\\cdot\\mid s,a\)\}\\left\[R\(s,a,s^\{\\prime\}\)\+\\gamma\\max\_\{a^\{\\prime\}\}Q\(s^\{\\prime\},a^\{\\prime\}\)\\right\]\.

For AQ\-BRMDP, the Bellman target is the empiricalαk\(s,a\)\\alpha\_\{k\}\(s,a\)\-quantile ofGb\(s,a;Q\)G^\{b\}\(s,a;Q\)over posterior samples\. For PSRL, a single posterior model is sampled and the risk\-neutral Bellman target under that sampled model is used\. The expectations in these targets are estimated by Monte Carlo simulation from the corresponding model\.

#### Adaptive quantile schedule in the continuous setting\.

The normalization set consists of the 15 cell centers excluding the goal center\(3\.5,3\.5\)\(3\.5,3\.5\)\. For each queried\(s,a\)\(s,a\), the posterior predictive uncertaintyuk\(s,a\)u\_\{k\}\(s,a\)is computed as the standard deviation ofGb\(s,a;Qωk−1\)G^\{b\}\(s,a;Q\_\{\\omega\_\{k\-1\}\}\)across posterior samples, andck\(s,a\)=1/\(uk\(s,a\)\+ε\)c\_\{k\}\(s,a\)=1/\(u\_\{k\}\(s,a\)\+\\varepsilon\)\. Letc¯k\\bar\{c\}\_\{k\}be the average ofck\(s~,a\)c\_\{k\}\(\\tilde\{s\},a\)over the non\-goal center states and actions\. For non\-goal states,αk\(s,a\)=clip\(1−δck\(s,a\)c¯kgk,α¯,0\.95\)\\alpha\_\{k\}\(s,a\)=\\mathrm\{clip\}\\left\(1\-\\delta\\frac\{c\_\{k\}\(s,a\)\}\{\\bar\{c\}\_\{k\}\}g\_\{k\},\\,\\underline\{\\alpha\},\\,0\.95\\right\), andgk=log⁡\(2k\)kg\_\{k\}=\\frac\{\\log\(2k\)\}\{\\sqrt\{k\}\}\.

For goal states, all actions are equivalent and we setαk\(s,a\)=clip\(1−δgk,α¯,0\.95\)\\alpha\_\{k\}\(s,a\)=\\mathrm\{clip\}\(1\-\\delta g\_\{k\},\\underline\{\\alpha\},0\.95\)\. The posterior sample budget used to estimate each quantile is selected by the same rule as in \([EC\.23](https://arxiv.org/html/2605.24345#A7.E23)\) and capped at800800posterior samples\.

Algorithm EC\.3Continuous\-State AQ\-BRMDP with Neural Fitted Q Approximation1:Input:initial posterior, initial Q\-network parameter

ω0\\omega\_\{0\}, schedule parameters, and replay buffer

ℬ←∅\\mathcal\{B\}\\leftarrow\\emptyset
2:Initialize pseudo\-episode index

k←0k\\leftarrow 0and restart indicator

X1←0X\_\{1\}\\leftarrow 0
3:for

t=1,…,Tt=1,\\ldots,Tdo

4:if

Xt=0X\_\{t\}=0then

5:

k←k\+1k\\leftarrow k\+1
6:Update the posterior using the observed latent primitives

7:Draw posterior model samples

\{θkb\}\\\{\\theta\_\{k\}^\{b\}\\\}from the current posterior

8:Compute

c¯k\\bar\{c\}\_\{k\}over non\-goal center states and fix the rule defining

αk\(s,a\)\\alpha\_\{k\}\(s,a\)for this pseudo\-episode

9:

ωk←FitQ\(ωk−1,αk,\{θkb\},ℬ\)\\omega\_\{k\}\\leftarrow\\mathrm\{FitQ\}\(\\omega\_\{k\-1\},\\alpha\_\{k\},\\\{\\theta\_\{k\}^\{b\}\\\},\\mathcal\{B\}\)using fitted\-Q regression

10:endif

11:Take action

at∈arg⁡maxa∈𝒜⁡Qωk\(st,a\)a\_\{t\}\\in\\arg\\max\_\{a\\in\\mathcal\{A\}\}Q\_\{\\omega\_\{k\}\}\(s\_\{t\},a\)
12:Observe

st\+1s\_\{t\+1\}, reward

rtr\_\{t\}, and latent primitives when applicable

13:Append the transition and latent primitives to

ℬ\\mathcal\{B\}
14:Sample

Xt\+1∼Bernoulli\(γ\)X\_\{t\+1\}\\sim\\mathrm\{Bernoulli\}\(\\gamma\)if

t<Tt<T
15:endfor

The fitted\-Q subroutine initializes atωk−1\\omega\_\{k\-1\}and repeats a standard fitted\-Q loop: sample training states, construct clipped Monte Carlo Bellman targets for each state\-action pair, and regressQωQ\_\{\\omega\}onto those targets\. For AQ\-BRMDP, the target is the empiricalαk\(s,a\)\\alpha\_\{k\}\(s,a\)\-quantile over posterior models; for PSRL, it is the risk\-neutral target under one sampled model\.

The continuous\-state experiment compares AQ\-BRMDP with a continuous\-state PSRL baseline using the same pseudo\-episode mechanism, posterior model, Q\-network architecture, and fitted\-Q solver\.

![Refer to caption](https://arxiv.org/html/2605.24345v1/figs/continuous_true_regret.png)Figure EC\.5:Cumulative true regret in continuous\-state FrozenLake\.The continuous\-state results show that AQ\-BRMDP can be implemented with a parametric posterior model and neural fitted\-Q approximation in a continuous\-state space\. Figure[EC\.5](https://arxiv.org/html/2605.24345#A7.F5)shows that AQ\-BRMDP outperforms the continuous\-state PSRL baseline with smaller cumulative true regret while retaining the adaptive quantile mechanism used in the finite\-state experiments\.

The continuous\-state FrozenLake experiments were run on a single NVIDIA GeForce RTX 4090 GPU with 24 GB GPU memory\. The mean runtime per independent run was21\.221\.2minutes for AQ\-BRMDP and5\.55\.5minutes for Continuing PSRL, with the difference mainly due to posterior quantile estimation\.
Evolving Robustness--Exploration Trade-off in Online Reinforcement Learning via Quantile Bayesian Risk MDPs

Similar Articles

Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness

Robust Shielding for Safe Reinforcement Learning

How Maximum Entropy makes Reinforcement Learning Robust

Robust Peak-cost Constrained Reinforcement Learning

Deep Reinforcement Learning for Reliability Based Bi-Objective Portfolio Optimization

Submit Feedback

Similar Articles

Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness
Robust Shielding for Safe Reinforcement Learning
How Maximum Entropy makes Reinforcement Learning Robust
Robust Peak-cost Constrained Reinforcement Learning
Deep Reinforcement Learning for Reliability Based Bi-Objective Portfolio Optimization