On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents

arXiv cs.LG 05/22/26, 04:00 AM Papers
Summary
This paper studies risk-sensitive reinforcement learning in finite discounted MDPs with a generative model, focusing on the sample complexity of learning optimal value functions and policies under the optimized certainty equivalent (OCE) risk measure. It provides exact conditions for PAC-learnability, analyzes a model-based approach, and establishes tight lower bounds, including an improved dependence on the risk parameter for CVaR.
arXiv:2605.21763v1 Announce Type: new Abstract: We study risk-sensitive reinforcement learning in finite discounted MDPs, where a generative model of the MDP is assumed to be available. We consider a family or risk measures called the optimized certainty equivalent (OCE), which includes important risk measures such as entropic risk, CVaR, and mean-variance. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive OCE. We provide an exact characterization of utility functions $u$ for which the corresponding OCE defines an objective that is PAC-learnable. We analyze a simple model-based approach and derive PAC sample complexity bounds. We establish that whenever $u$ does not have full domain $\text{dom}(u)\neq \mathbb{R}$, the corresponding problem is not PAC-learnable. Finally, we establish corresponding lower bounds for both value and policy learning, demonstrating tightness in the size $SA$ of state-action space, and for a more restricted class of utilities, we derive lower bounds that makes the dependence on the effective horizon $\frac{1}{1-\gamma}$ explicit. Specifically, for $\text{CVaR}_\tau$ we show that the correct dependence on $\tau$ is $\frac{1}{\tau^2}$, thus improving by a factor of $\frac{1}{\tau}$ over state-of-the-art although our bound has a suboptimal dependence on $\frac{1}{1-\gamma}$.
Original Article
View Cached Full Text
Cached at: 05/22/26, 08:51 AM
# On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents
Source: [https://arxiv.org/html/2605.21763](https://arxiv.org/html/2605.21763)
Mohammad Sadegh Talebi222Department of Computer Science, University of Copenhagen\. Email:sadegh\.talebi@di\.ku\.dk\.

###### Abstract

We study risk\-sensitive reinforcement learning in finite discounted MDPs, where a generative model of the MDP is assumed to be available\. We consider a family or risk measures called the optimized certainty equivalent \(OCE\), which includes important risk measures such as entropic risk, CVaR, and mean\-variance\. Our focus is on the sample complexities of learning the optimal state–action value function \(value learning\) and an optimal policy \(policy learning\) under recursive OCE\. We provide an exact characterization of utility functionsuufor which the corresponding OCE defines an objective that is PAC\-learnable\. We analyze a simple model\-based approach and derive PAC sample complexity bounds\. We establish that wheneveruudoes not have full domaindom\(u\)≠ℝ\\text\{dom\}\(u\)\\neq\\mathbb\{R\}, the corresponding problem is not PAC learnable\. Finally, we establish corresponding lower bounds for both value and policy learning, demonstrating tightness in the sizeSASAof state\-action space, and for a more restricted class of utilities, we derive lower bounds that makes the dependence on the effective horizon11−γ\\frac\{1\}\{1\-\\gamma\}explicit\. Specifically forCVaRτ\\text\{CVaR\}\_\{\\tau\}we show that the correct dependence onτ−1\\tau^\{\-1\}is1τ2\\frac\{1\}\{\\tau^\{2\}\}improving by a factor of1τ\\frac\{1\}\{\\tau\}over state\-of\-the\-art although our bound has a suboptimal dependence on11−γ\\frac\{1\}\{1\-\\gamma\}\.

## 1Introduction

In reinforcement learning \(RL\), the standard objective is to maximize the expected return, defined as the \(possibly discounted\) sum of rewards\[[64](https://arxiv.org/html/2605.21763#bib.bib22)\]\. However, because this objective is inherently*risk\-neutral*, it may be inadequate for many high\-stakes application domains, such as treatment\[[20](https://arxiv.org/html/2605.21763#bib.bib46)\], finance\[[58](https://arxiv.org/html/2605.21763#bib.bib49),[9](https://arxiv.org/html/2605.21763#bib.bib48)\], operations research\[[16](https://arxiv.org/html/2605.21763#bib.bib47)\], and transportation\[[35](https://arxiv.org/html/2605.21763#bib.bib102)\]\. Such applications demand for account the variability of returns, and risks thereof\. One principled approach to addressing this limitation is to optimize a risk measure of the return distribution\. which using concave risk measures leads to well\-defined optimization problems\. Notable risk measures include mean\-variance\[[42](https://arxiv.org/html/2605.21763#bib.bib2)\], value\-at\-risk \(VaR\)\[[17](https://arxiv.org/html/2605.21763#bib.bib5)\], Conditional VaR \(CVaR\)\[[59](https://arxiv.org/html/2605.21763#bib.bib37)\], entropic risk\[[29](https://arxiv.org/html/2605.21763#bib.bib50)\], and entropic VaR \(EVaR\)\[[2](https://arxiv.org/html/2605.21763#bib.bib38)\], all of which have been applied to a wide\-range of scenarios\. Among these, CVaR has become particularly popular for modeling risk\-sensitivity in MDPs\[[15](https://arxiv.org/html/2605.21763#bib.bib51),[10](https://arxiv.org/html/2605.21763#bib.bib52),[13](https://arxiv.org/html/2605.21763#bib.bib53),[6](https://arxiv.org/html/2605.21763#bib.bib54)\], mainly due to a delicate control it offers for the undesirable tail of return distribution\. ERM, as another popular notion, has long been considered for risk\-sensitive control in MDPs and RL\[[29](https://arxiv.org/html/2605.21763#bib.bib50),[11](https://arxiv.org/html/2605.21763#bib.bib32),[28](https://arxiv.org/html/2605.21763#bib.bib31),[30](https://arxiv.org/html/2605.21763#bib.bib23),[21](https://arxiv.org/html/2605.21763#bib.bib25)\]\. However, much of the existing literature focuses on undiscounted settings, despite the prevalence of discounted MDPs; see, e\.g\.,\[[7](https://arxiv.org/html/2605.21763#bib.bib18),[28](https://arxiv.org/html/2605.21763#bib.bib31),[50](https://arxiv.org/html/2605.21763#bib.bib65)\]for notable exceptions\.

In this paper, we study risk\-sensitive RL for discounted MDPs, assuming access to a generative model of the MDP, which is a simulator that generates samples from the true MDP for arbitrary state\-action pairs\. We consider a broad and important class of risk measure that includes risk measures that can be expressed as an Optimized Certainty Equivalent \(OCE\)\[[8](https://arxiv.org/html/2605.21763#bib.bib89)\]\. We refer to Section[2](https://arxiv.org/html/2605.21763#S2)for definitions, and to Appendix[A](https://arxiv.org/html/2605.21763#A1)for a more detailed primer on risk measures\. The class of OCE measures captures many important risks such as CVaR, ERM, and mean\-variance, as special cases \(that are derived under suitable choices of utility functions\)\.

In risk\-sensitive RL, objectives can be formulated in two ways: In the*non\-recursive*\(also called non\-iterated or static\) formulation, the risk measure is directly applied to the total return\[[12](https://arxiv.org/html/2605.21763#bib.bib34),[11](https://arxiv.org/html/2605.21763#bib.bib32),[27](https://arxiv.org/html/2605.21763#bib.bib96)\], while in the*recursive*formulation \(also called iterated, nested, or dynamic\), the risk measure is applied at every stepttto the reward\-to\-go\[[3](https://arxiv.org/html/2605.21763#bib.bib16),[4](https://arxiv.org/html/2605.21763#bib.bib10),[18](https://arxiv.org/html/2605.21763#bib.bib69)\]\. Thenon\-recursive approach may allow the agent to visit high\-risk states, even though the risk of the entire trajectory is still controlled, which might be unacceptable in many safety\-critical applications\. In contrast, the recursive approach may lead to a more cautious behavior by controlling risk at every step, which can be either desirable or overly conservative depending on the application\[[18](https://arxiv.org/html/2605.21763#bib.bib69),[68](https://arxiv.org/html/2605.21763#bib.bib87),[66](https://arxiv.org/html/2605.21763#bib.bib86)\]\. Due to these qualitative differences, the two are considered as orthogonal modeling choices\. From a technical standpoint, a key distinction is that non\-recursive formulations do not generally admit Bellman\-type optimality equations and may result in time\-inconsistent optimal policies \(see\[[32](https://arxiv.org/html/2605.21763#bib.bib44)\]\), whereas recursive formulations preserve Bellman\-type optimality structures\. Motivated by these considerations, we study risk\-sensitive discounted RL with objectives defined via the recursive OCE\.

### 1\.1Main Contributions

We consider risk\-sensitive RL in tabular discounted MDPs under recursive OCE in the generative setting\. Learning performance is assessed in terms of sample complexity, defined as the total numberTTof samples required, for given\(ε,δ\)\(\\varepsilon,\\delta\), to obtain either anε\\varepsilon\-optimal policy \(the*policy learning*problem\) or anε\\varepsilon\-close approximation of the optimal Q\-value in the max\-norm \(the*value learning*problem\), with probability exceeding1−δ1\-\\delta\.

We make the following contributions\. We propose a model\-based algorithm, called Model\-Based OCE Value Iteration \(MB\-OCE\-VI\), and establish PAC\-type bounds on its sample complexity for both value learning and policy learning \(Theorem[1](https://arxiv.org/html/2605.21763#Thmtheorem1)\), under the assumption that the utility functionuudefining the OCE belongs to𝕌1\\mathbb\{U\}\_\{1\}, which is essentially the set of utilities with a full domain —for a more precise definition and context, see Subsection[2\.1](https://arxiv.org/html/2605.21763#S2.SS1)\. These bounds have an optimal dependence on \(up to log\-factors\) on the size of state\-action space,SASA, and hold simultaneously for all OCE objectives defined using utilities in𝕌1\\mathbb\{U\}\_\{1\}\.

We further show that the restriction to OCEs associated to utilitiesu∈𝕌1u\\in\\mathbb\{U\}\_\{1\}is necessary\. We prove this claim by establishing impossibility results on the PAC sample complexity bound for when OCE is defined by a utility functionu∉𝕌1u\\notin\\mathbb\{U\}\_\{1\}\. We thus provide an exact characterization of OCE measures, which are PAC learnable under value and policy learning\.

Finally, we establish worst\-case lower bounds on the sample complexity under OCEs\. For each problem, we present two lower bounds\. The first one \(Theorems[6\.1](https://arxiv.org/html/2605.21763#S6.SS1)and[6\.2](https://arxiv.org/html/2605.21763#S6.SS2)\) is a general lower bound that holds for any OCE defined by a utilityu∈𝕌1u\\in\\mathbb\{U\}\_\{1\}\. These lower bounds has an optimal dependence \(up to log\-factors\) onS,A,1δ,εS,A,\\frac\{1\}\{\\delta\},\\varepsilon, but a complicated dependence on1/\(1−γ\)1/\(1\-\\gamma\)\. The second lower bound \(Theorems[6\.1](https://arxiv.org/html/2605.21763#S6.SS1)and[6\.2](https://arxiv.org/html/2605.21763#S6.SS2)\) has an explicit dependence on1/\(1−γ\)1/\(1\-\\gamma\), but holds for a sub\-class of utilities in𝕌1\\mathbb\{U\}\_\{1\}\. Nevertheless, we show that this sub\-class includes all strongly risk\-averse coherent OCE risk\-measures that notably includes CVaR\. We show that the latter lower bound can outperform the best existing lower bound forCVaRτ\\text\{CVaR\}\_\{\\tau\}in the regime whereτ\\tauis very small\. To the best of our knowledge, these results constitute the first upper and lower bounds on the sample complexity of recursive OCE in discounted MDPs, and are first impossibility results for RL under OCEs\.

### 1\.2Related Work

##### Risk\-neutral discounted RL\.

There is a rich literature on provably\-sample efficient learning algorithms in tabular discounted MDPs, encompassing a variety of settings such as the generative setting\[[34](https://arxiv.org/html/2605.21763#bib.bib41),[26](https://arxiv.org/html/2605.21763#bib.bib30),[1](https://arxiv.org/html/2605.21763#bib.bib15),[60](https://arxiv.org/html/2605.21763#bib.bib56),[45](https://arxiv.org/html/2605.21763#bib.bib28),[33](https://arxiv.org/html/2605.21763#bib.bib82)\], the offline \(or batch\) setting\[[55](https://arxiv.org/html/2605.21763#bib.bib39),[43](https://arxiv.org/html/2605.21763#bib.bib61)\], and the online setting\[[63](https://arxiv.org/html/2605.21763#bib.bib8),[40](https://arxiv.org/html/2605.21763#bib.bib6)\]\. In the case of generative setting, which we also consider, early work includes\[[34](https://arxiv.org/html/2605.21763#bib.bib41),[36](https://arxiv.org/html/2605.21763#bib.bib55)\], which was further improved and followed up by a ling of work, notably\[[26](https://arxiv.org/html/2605.21763#bib.bib30),[60](https://arxiv.org/html/2605.21763#bib.bib56),[67](https://arxiv.org/html/2605.21763#bib.bib57),[44](https://arxiv.org/html/2605.21763#bib.bib29),[33](https://arxiv.org/html/2605.21763#bib.bib82)\]\. Azar et al\.\[[26](https://arxiv.org/html/2605.21763#bib.bib30)\]provide the first minimax\-optimal sample complexity bounds of𝒪~\(SAε2\(1−γ\)3\)\\widetilde\{\\cal O\}\\big\(\\frac\{SA\}\{\\varepsilon^\{2\}\(1\-\\gamma\)^\{3\}\}\\big\)for both value learning and policy learning, albeit for substantially limitedε\\varepsilon\-ranges, which is attained by simple model\-based methods\. Further, they establish a lower bound ofΩ~\(SAε2\(1−γ\)3\)\\widetilde\{\\Omega\}\\big\(\\frac\{SA\}\{\\varepsilon^\{2\}\(1\-\\gamma\)^\{3\}\}\\big\)for value learning\. Model\-free methods are presented in more recent subsequent work such as\[[60](https://arxiv.org/html/2605.21763#bib.bib56),[67](https://arxiv.org/html/2605.21763#bib.bib57),[33](https://arxiv.org/html/2605.21763#bib.bib82)\]\. Notably,\[[44](https://arxiv.org/html/2605.21763#bib.bib29)\]has recently established an optimal bound valid for the entireε\\varepsilon\-range, using model\-based methods built via the empirical MDP but with reward perturbations or conservative planning\. We note that existing optimal sample complexities rely on techniques that crucially exploit the additivity of the return in terms of rewards; this structural property generally fails for risk\-sensitive objectives, and the corresponding techniques do not carry over\.

##### Risk\-sensitive RL\.

There exists a substantial literature on decision making under a risk measure in bandit and RL settings\. In bandits, risk\-sensitive objectives are typically studied through regret minimization; see, e\.g\.,\[[57](https://arxiv.org/html/2605.21763#bib.bib62),[47](https://arxiv.org/html/2605.21763#bib.bib63),[37](https://arxiv.org/html/2605.21763#bib.bib64)\]\. Extensions to MDPs introduce substantially richer structural and algorithmic challenges\. The literature on RL under risk measures may be broadly categorized by the way the risk measure is applied \(recursive vs\. non\-recursive\) as well as the type of risk measure studied\. Representative examples include CVaR\[[18](https://arxiv.org/html/2605.21763#bib.bib69),[19](https://arxiv.org/html/2605.21763#bib.bib70),[14](https://arxiv.org/html/2605.21763#bib.bib71),[39](https://arxiv.org/html/2605.21763#bib.bib74)\], ERM\[[11](https://arxiv.org/html/2605.21763#bib.bib32),[51](https://arxiv.org/html/2605.21763#bib.bib95),[49](https://arxiv.org/html/2605.21763#bib.bib66),[48](https://arxiv.org/html/2605.21763#bib.bib67),[28](https://arxiv.org/html/2605.21763#bib.bib31)\], mean\-variance risk\[[62](https://arxiv.org/html/2605.21763#bib.bib100),[31](https://arxiv.org/html/2605.21763#bib.bib101),[38](https://arxiv.org/html/2605.21763#bib.bib73)\], and EVaR\[[53](https://arxiv.org/html/2605.21763#bib.bib92),[25](https://arxiv.org/html/2605.21763#bib.bib103)\]\. Among these, CVaR has been the most extensively studied\. Under recursive CVaR,\[[18](https://arxiv.org/html/2605.21763#bib.bib69)\]analyzes sample complexity in the generative setting and provides a lower bound\. Under recursive ERM, recent works such as\[[21](https://arxiv.org/html/2605.21763#bib.bib25),[22](https://arxiv.org/html/2605.21763#bib.bib27),[23](https://arxiv.org/html/2605.21763#bib.bib26),[30](https://arxiv.org/html/2605.21763#bib.bib23),[46](https://arxiv.org/html/2605.21763#bib.bib24)\]study online episodic RL in the regret setting\. To the best of our knowledge, existing work on discounted MDPs under recursive ERM is limited to planning; a notable example is\[[4](https://arxiv.org/html/2605.21763#bib.bib10)\], which provides a thorough theoretical treatment but does not propose learning algorithms\.

There exists a line work in risk\-sensitive RL and control that develop algorithms for an entire family of risk measures\. Two notable families studied in this context are coherent risk measures and OCEs\. While entropic risk is not coherent, it belongs to the OCE; a brief overview of risk measures is provided in Appendix[A](https://arxiv.org/html/2605.21763#A1)\. Existing results for OCE risks\[[66](https://arxiv.org/html/2605.21763#bib.bib86),[68](https://arxiv.org/html/2605.21763#bib.bib87),[56](https://arxiv.org/html/2605.21763#bib.bib97),[41](https://arxiv.org/html/2605.21763#bib.bib4)\]do not address provably sample\-efficient learning under recursive entropic risk in discounted MDPs\. Furthermore, results for coherent risks\[[54](https://arxiv.org/html/2605.21763#bib.bib77),[65](https://arxiv.org/html/2605.21763#bib.bib72),[39](https://arxiv.org/html/2605.21763#bib.bib74),[69](https://arxiv.org/html/2605.21763#bib.bib93)\]do not apply to entropic risk\. In particular,\[[56](https://arxiv.org/html/2605.21763#bib.bib97)\]considers offline RL in discounted MDPs under recursive OCE but does not provide sample\-complexity guarantees\. We also note that a connection between MDPs with recursive coherent risks and distributionally robust MDPs has been established in\[[4](https://arxiv.org/html/2605.21763#bib.bib10)\]\.

## 2Setting: RL under Recursive OCE

##### Notations\.

Forn∈ℕn\\in\\mathbb\{N\}, let\[n\]:=\{1,…,n\}\[n\]:=\\\{1,\\ldots,n\\\}\.𝟙A\\mathbbmss\{1\}\_\{A\}denotes the indicator function of an eventAA\. Given a set𝒳\\mathcal\{X\},Δ\(𝒳\)\\Delta\(\\mathcal\{X\}\)denotes the probability simplex over𝒳\\mathcal\{X\}\. We use the convention that∥⋅∥:=∥⋅∥∞\\\|\\cdot\\\|:=\\\|\\cdot\\\|\_\{\\infty\}\. We letL∞\(Ω,ℱ,ℙ\)L^\{\\infty\}\(\\Omega,\\mathcal\{F\},\\mathbb\{P\}\)denote the space of essentially bounded random variables on the probability space\(Ω,ℱ,ℙ\)\(\\Omega,\\mathcal\{F\},\\mathbb\{P\}\)\.

### 2\.1Optimized Certainty Equivalents

For risk\-averse agents it is natural to be able to rank different random variables based on risk\-measures\. In\[[8](https://arxiv.org/html/2605.21763#bib.bib89)\]they propose the optimised certainty equivalent of a class of utility functions𝕌0\\mathbb\{U\}\_\{0\}to be defined shortly\. This approach not only provides a direct link between microeconomic foundations and risk but also form a unifying framework as risk measures derived as OCEs are guaranteed to be convex and includes several well\-known risk measures such as entropic risk, the mean\-variance criterion, and conditional value\-at\-risk\.

Letu:ℝ→\[−∞,∞\)u:\\mathbb\{R\}\\rightarrow\[\-\\infty,\\infty\)be a closed, proper concave, non\-decreasing function satisfying thatu\(0\)=0u\(0\)=0and1∈∂u\(0\)1\\in\\partial u\(0\), where∂u\\partial udenotes the superdifferential ofuu\. Further, definedom\(u\):=\{t∈ℝ\|u\(t\)\>−∞\}\\text\{dom\}\(u\):=\\\{t\\in\\mathbb\{R\}\|u\(t\)\>\-\\infty\\\}\. The collection of all such functions are denoted by𝕌0\\mathbb\{U\}\_\{0\}\. A subset of great interest in this paper is the collection of finite utility functions, i\.e\. those wheredom\(u\)=ℝ\\text\{dom\}\(u\)=\\mathbb\{R\}, which we denote by𝕌1\\mathbb\{U\}\_\{1\}; that is,𝕌1=\{u∈𝕌0\|dom\(u\)=ℝ\}\\mathbb\{U\}\_\{1\}=\\\{u\\in\\mathbb\{U\}\_\{0\}\|\\text\{dom\}\(u\)=\\mathbb\{R\}\\\}\. We further define the subclass of strongly risk\-averse utility functions𝕌1<:=\{u∈𝕌1\|∀t≠0:u\(t\)<t\}\\mathbb\{U\}\_\{1\}^\{<\}:=\\\{u\\in\\mathbb\{U\}\_\{1\}\|\\forall t\\neq 0:u\(t\)<t\\\}\.

The optimized certainty equivalent \(OCE\) ofuuis the mapOCEu:L∞\(Ω,ℱ,ℙ\)→ℝ\\mathrm\{OCE\}^\{u\}:L^\{\\infty\}\(\\Omega,\\mathcal\{F\},\\mathbb\{P\}\)\\rightarrow\\mathbb\{R\}defined as

OCEu\(X\)=supη∈ℝ\(η\+𝔼\[u\(X−η\)\]\)\.\\displaystyle\\mathrm\{OCE\}^\{u\}\(X\)=\\sup\_\{\\eta\\in\\mathbb\{R\}\}\(\\eta\+\\mathbb\{E\}\[u\(X\-\\eta\)\]\)\\,\.The notion of OCE was first introduced in\[[8](https://arxiv.org/html/2605.21763#bib.bib89)\], where they also show that the negative of the OCE is a convex risk\-measure and thatOCEu\(X\)≤𝔼\[X\]\\mathrm\{OCE\}^\{u\}\(X\)\\leq\\mathbb\{E\}\[X\]\. For brevity, we will often use the notationρ:=OCEu\\rho:=\\mathrm\{OCE\}^\{u\}to refer to the OCE associated to the utility functionuu\. The OCEρ\\rhoadmits the following properties: \(i\)ρ\(0\)=0\\rho\(0\)=0\(normalization\), \(ii\)X≤YX\\leq Yimpliesρ\(X\)≤ρ\(Y\)\\rho\(X\)\\leq\\rho\(Y\)\(monotonicity\), and \(iii\)ρ\(c\)=c\\rho\(c\)=cforc∈ℝc\\in\\mathbb\{R\}\(consistency\)\.

Let𝒮\\mathcal\{S\}be a finite set with sizeS:=\|𝒮\|S:=\|\\mathcal\{S\}\|, andXXbe a random variable with support\{v\(s\)\}s∈𝒮\\\{v\(s\)\\\}\_\{s\\in\\mathcal\{S\}\}with probabilities given byℙ\(X=v\(s\)\)=p\(s\)\\mathbb\{P\}\(X=v\(s\)\)=p\(s\)\. We introduce the short\-handρp\(v\(s\)\):=ρ\(X\)\\rho\_\{p\}\(v\(s\)\):=\\rho\(X\)\.

### 2\.2Discounted Markov Decision Processes and Recursive OCE Objectives

A discounted Markov decision process \(MDP\) is a55\-tupleM=\(𝒮,𝒜,P,R,γ\)M=\(\\mathcal\{S\},\\mathcal\{A\},P,R,\\gamma\), where𝒮=\{1,2,…,S\}\\mathcal\{S\}=\\\{1,2,\\ldots,S\\\}is the finite state space of sizeS:=\|𝒮\|S:=\|\\mathcal\{S\}\|,𝒜=\{1,2,…,A\}\\mathcal\{A\}=\\\{1,2,\\ldots,A\\\}is the finite action space of sizeA:=\|𝒜\|A:=\|\\mathcal\{A\}\|,P:𝒮×𝒜→Δ\(𝒮\)P:\\mathcal\{S\}\\times\\mathcal\{A\}\\rightarrow\\Delta\(\\mathcal\{S\}\)is the transition probability function,R:𝒮×𝒜→\[0,1\]R:\\mathcal\{S\}\\times\\mathcal\{A\}\\rightarrow\[0,1\]is the deterministic reward function, andγ∈\(0,1\)\\gamma\\in\(0,1\)is the discount factor\. A stationary deterministic policy is a mapπ:𝒮→𝒜\\pi:\\mathcal\{S\}\\rightarrow\\mathcal\{A\}\. The agents interaction with the MDPMMis as follows\. At initialization of the process,MMis in some initial states0∈𝒮s\_\{0\}\\in\\mathcal\{S\}\. At each timet≥0t\\geq 0, the agent is in statest∈𝒮s\_\{t\}\\in\\mathcal\{S\}and decides on an actionat∈𝒜a\_\{t\}\\in\\mathcal\{A\}\. The MDP generates a rewardrt:=R\(st,at\)r\_\{t\}:=R\(s\_\{t\},a\_\{t\}\)and a next\-statest\+1∼P\(⋅\|st,at\)s\_\{t\+1\}\\sim P\(\\cdot\|s\_\{t\},a\_\{t\}\)\. The MDP moves tost\+1s\_\{t\+1\}when the next time slot begins, and this process continues ad infinitum\. This process yields a growing sequence\(st,at,rt\)t≥0\(s\_\{t\},a\_\{t\},r\_\{t\}\)\_\{t\\geq 0\}\.

The agent’s goal is to maximize an objective function, as a function of the collected rewards\(rt\)t≥0\(r\_\{t\}\)\_\{t\\geq 0\}, which depends on bothγ\\gammaandρ\\rho\. In classical setting, the value of a policyπ\\piis defined as discounted sum of rewards collected underπ\\pi\. However, the classical objective fails to capture the inherent risk coming from the stochastic transitions and thus the uncertainty about the rewards collected during a trajectory\. Under the recursive OCE criterion, the agent’s objective is defined using the iteration of OCEs\[[5](https://arxiv.org/html/2605.21763#bib.bib12)\]\. The value functionVπV^\{\\pi\}and the state\-action value function \(or Q\-value\)QπQ^\{\\pi\}of a policyπ\\piare defined informally as

Vπ\(s0\)\\displaystyle V^\{\\pi\}\(s\_\{0\}\)=r0\+γρs0,π\(s0\)\(r1\+γρs1,π\(s1\)\(r2\+γρs2,π\(s2\)\(…\)\)\),\\displaystyle=r\_\{0\}\+\\gamma\\rho\_\{s\_\{0\},\\pi\(s\_\{0\}\)\}\\Big\(r\_\{1\}\+\\gamma\\rho\_\{s\_\{1\},\\pi\(s\_\{1\}\)\}\\big\(r\_\{2\}\+\\gamma\\rho\_\{s\_\{2\},\\pi\(s\_\{2\}\)\}\(\\ldots\)\\big\)\\Big\)\\,,Qπ\(s0,a\)\\displaystyle Q^\{\\pi\}\(s\_\{0\},a\)=R\(s0,a\)\+γρs0,a\(r1\+γρs1,π\(s1\)\(r2\+γρs2,π\(s2\)\(…\)\)\)\\displaystyle=R\(s\_\{0\},a\)\+\\gamma\\rho\_\{s\_\{0\},a\}\\Big\(r\_\{1\}\+\\gamma\\rho\_\{s\_\{1\},\\pi\(s\_\{1\}\)\}\\Big\(r\_\{2\}\+\\gamma\\rho\_\{s\_\{2\},\\pi\(s\_\{2\}\)\}\(\\ldots\)\\Big\)\\Big\)\\,where we used the convention thatρs,a:=ρPs,a\\rho\_\{s,a\}:=\\rho\_\{P\_\{s,a\}\}andrt:=R\(st,π\(st\)\)r\_\{t\}:=R\(s\_\{t\},\\pi\(s\_\{t\}\)\)\. For all\(s,a\)\(s,a\), letV∗\(s\):=supπVπ\(s\)V^\{\*\}\(s\):=\\sup\_\{\\pi\}V^\{\\pi\}\(s\)andQ∗\(s,a\):=supπQπ\(s,a\)Q^\{\*\}\(s,a\):=\\sup\_\{\\pi\}Q^\{\\pi\}\(s,a\)denote the optimal value \(at statess\) and Q\-value \(at state\-action\(s,a\)\(s,a\)\), respectively, wheresup\\supis taken over set of all possible policies\. Any policyπ∗\\pi^\{\*\}satisfyingVπ∗=V∗V^\{\\pi^\{\*\}\}=V^\{\*\}is called optimal, and any policyπ\\piobeyingVπ\(s\)≥V∗\(s\)−εV^\{\\pi\}\(s\)\\geq V^\{\*\}\(s\)\-\\varepsilonfor a givenε\>0\\varepsilon\>0is calledε\\varepsilon\-optimal\. As established in\[[5](https://arxiv.org/html/2605.21763#bib.bib12)\], there exists a stationary deterministic optimal policyπ∗:𝒮→𝒜\\pi^\{\*\}:\\mathcal\{S\}\\to\\mathcal\{A\}that achievesV∗\(s\)V^\{\*\}\(s\)for all statessssimultaneously, which is shown to satisfy the optimal Bellman equations:

V∗\(s\)\\displaystyle V^\{\*\}\(s\)=maxa∈𝒜⁡\(R\(s,a\)\+γρs,a\(V∗\(s′\)\)\),\\displaystyle=\\max\_\{a\\in\\mathcal\{A\}\}\\big\(R\(s,a\)\+\\gamma\\rho\_\{s,a\}\(V^\{\*\}\(s^\{\\prime\}\)\)\\big\),\\quad∀s∈𝒮\.\\displaystyle\\forall s\\in\\mathcal\{S\}\.Q∗\(s,a\)\\displaystyle Q^\{\*\}\(s,a\)=R\(s,a\)\+γρs,a\(maxa′∈𝒜⁡Q∗\(s′,a′\)\),\\displaystyle=\\\!R\(s,a\)\+\\gamma\\rho\_\{s,a\}\\big\(\\max\_\{a^\{\\prime\}\\in\\mathcal\{A\}\}Q^\{\*\}\(s^\{\\prime\},a^\{\\prime\}\)\\big\),\\quad∀\(s,a\)∈𝒮×𝒜\.\\displaystyle\\forall\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\.We introduce the optimal Bellman operator𝒯:ℝ𝒮×𝒜→ℝ𝒮×𝒜\\mathcal\{T\}:\\mathbb\{R\}^\{\\mathcal\{S\}\\times\\mathcal\{A\}\}\\rightarrow\\mathbb\{R\}^\{\\mathcal\{S\}\\times\\mathcal\{A\}\}defined forf:𝒮×𝒜→ℝf:\\mathcal\{S\}\\times\\mathcal\{A\}\\rightarrow\\mathbb\{R\}by

\(𝒯f\)\(s,a\)\\displaystyle\(\\mathcal\{T\}f\)\(s,a\):=R\(s,a\)\+γρs,a\(maxa′∈𝒜⁡f\(s′,a′\)\),∀\(s,a\)∈𝒮×𝒜\.\\displaystyle:=R\(s,a\)\+\\gamma\\rho\_\{s,a\}\\big\(\\max\_\{a^\{\\prime\}\\in\\mathcal\{A\}\}f\(s^\{\\prime\},a^\{\\prime\}\)\\big\)\\,,\\quad\\forall\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\.It is evident thatQ∗Q^\{\*\}is the unique fixed\-point of𝒯\\mathcal\{T\}:Q∗=𝒯Q∗Q^\{\*\}=\\mathcal\{T\}Q^\{\*\}\.

Since𝒯\\mathcal\{T\}is aγ\\gamma\-contraction \(Lemma[1](https://arxiv.org/html/2605.21763#Thmlemma1)in Appendix[B](https://arxiv.org/html/2605.21763#A2)\), it follows thatQ∗Q^\{\*\}can be efficiently approximated arbitrarily well by a value\-iteration\-type algorithm \(see Algorithm[2](https://arxiv.org/html/2605.21763#algorithm2)\)\.

### 2\.3Learning Performance

Under a given OCE risk measureρ\\rhoapplied recursively, we consider RL algorithms that aim to find anε\\varepsilon\-optimal policy or anε\\varepsilon\-optimal value function for inputε\>0\\varepsilon\>0\. This is done by assuming access to a generative model \(or simulator\) of the MDP, which can produce a samples′∼Ps,as^\{\\prime\}\\sim P\_\{s,a\}for any queried state\-action\(s,a\)\(s,a\)\. We consider two types of such algorithms, which we generically denote by𝒰\\mathcal\{U\}: The first type outputs aQQ\-valueQT𝒰:𝒮×𝒜→ℝQ\_\{T\}^\{\\mathcal\{U\}\}:\\mathcal\{S\}\\times\\mathcal\{A\}\\rightarrow\\mathbb\{R\}, whereas the second outputs a policyπT𝒰:𝒮→𝒜\\pi\_\{T\}^\{\\mathcal\{U\}\}:\\mathcal\{S\}\\rightarrow\\mathcal\{A\}usingTTsamples\.

We evaluate the quality of an algorithm that outputs aQQ\-value by‖Q∗−QT𝒰‖\\\|Q^\{\*\}\-Q\_\{T\}^\{\\mathcal\{U\}\}\\\|, and that outputs a policy by‖V∗−VπT𝒰‖\\\|V^\{\*\}\-V^\{\\pi\_\{T\}^\{\\mathcal\{U\}\}\}\\\|\. Often, we will suppressTTfrom the notation\. This leads to the notion of\(ε,δ\)\(\\varepsilon,\\delta\)\-correct value and policy for input parameters\(ε,δ\)\(\\varepsilon,\\delta\)as formalized below:

###### Definition 1\(\(ε,δ\)\(\\varepsilon,\\delta\)\-correct value and policy\)\.

An algorithm𝒰\\mathcal\{U\}that outputs aQQ\-valueQ𝒰Q^\{\\mathcal\{U\}\}is called*\(ε,δ\)\(\\varepsilon,\\delta\)\-value\-correct*on a set of MDPs𝕄\\mathbb\{M\}ifℙ\(‖Q∗−Q𝒰‖≤ε\)≥1−δ\\mathbb\{P\}\(\\\|Q^\{\*\}\-Q^\{\\mathcal\{U\}\}\\\|\\leq\\varepsilon\)\\geq 1\-\\deltafor allM∈𝕄M\\in\\mathbb\{M\}\. Similarly, an algorithm𝒰\\mathcal\{U\}that outputs a policyπ𝒰\\pi^\{\\mathcal\{U\}\}is called*\(ε,δ\)\(\\varepsilon,\\delta\)\-policy\-correct*on a set of MDPs𝕄\\mathbb\{M\}ifℙ\(‖V∗−Vπ𝒰‖≤ε\)≥1−δ\\mathbb\{P\}\(\\\|V^\{\*\}\-V^\{\\mathcal\{\\pi^\{\\mathcal\{U\}\}\}\}\\\|\\leq\\varepsilon\)\\geq 1\-\\deltafor allM∈𝕄M\\in\\mathbb\{M\}\.

The notion of\(ε,δ\)\(\\varepsilon,\\delta\)\-value\-correctness yields a sample complexity notion in the case of*value learning*, while\(ε,δ\)\(\\varepsilon,\\delta\)\-policy\-correctness serves a similar role for*policy learning*\.

## 3Model\-based Utility Value Iteration

We now present a simple model\-based algorithm, called Model\-Based OCE Value Iteration \(MB\-OCE\-VI\), for value and policy learning settings with the RL objective defined using recursive OCEs, assuming access to a generative model of the MDP\.

1

2

Input:Generative model

PP
Output:Model estimate

P^\\widehat\{P\}
3

4

5

6Function*EstimateModel\(*NN*\)*:

7

∀\\forall\(s,z\)∈𝒮×Z:\(s,z\)\\in\\mathcal\{S\}\\times Z:m\(s,z\)=0m\(s,z\)=0
8for*eachz∈Zz\\in Z*do

9for*i=1,2,…,Ni=1,2,\\ldots,N*do

10

s∼P\(⋅\|z\)s\\sim P\(\\cdot\|z\)
11

m\(s,z\):=m\(s,z\)\+1m\(s,z\):=m\(s,z\)\+1
12end for

13

∀s∈𝒮:P^\(s,z\)=m\(s,z\)N\\forall s\\in\\mathcal\{S\}:\\widehat\{P\}\(s,z\)=\\frac\{m\(s,z\)\}\{N\}
14end for

15return

P^\\widehat\{P\}
16

Algorithm 1Model estimation1

Input:Empirical MDP

M^=\(𝒮,𝒜,P^,R,γ\)\\widehat\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},\\widehat\{P\},R,\\gamma\), OCE

ρ\\rho, number of iterations

kk
Output:Estimate

QkQ\_\{k\}of optimal

QQ\-function

Q∗Q^\{\*\}
2Initialization:

∀\(s,a\)\\forall\(s,a\)set

Q\(s,a\)=12\(1−γ\)Q\(s,a\)=\\frac\{1\}\{2\(1\-\\gamma\)\}
3for*j=0,1,…,k−1j=0,1,\\ldots,k\-1*do

4for*all\(s,a\)∈𝒮×𝒜\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}*do

5

Qj\+1\(s,a\)=R\(s,a\)\+γρs,a\(maxa′⁡Q\(s′,a′\)\)Q\_\{j\+1\}\(s,a\)=R\(s,a\)\+\\gamma\\rho\_\{s,a\}\(\\max\_\{a^\{\\prime\}\}Q\(s^\{\\prime\},a^\{\\prime\}\)\)
6

7end for

8

9end for

10

∀s∈𝒮:\\forall s\\in\\mathcal\{S\}:πk\(s\)=arg⁡maxa∈𝒜⁡Qk\(s,a\)\\pi\_\{k\}\(s\)=\\arg\\\!\\max\_\{a\\in\\mathcal\{A\}\}Q\_\{k\}\(s,a\)
11return

QkQ\_\{k\}and

πk\\pi\_\{k\}

Algorithm 2MB\-OCE\-VIWe introduce some necessary notations\. LetP^\\widehat\{P\}denote the plug\-in estimator of the transition functionPP, built usingNNindependent samples from each state\-action pairs of the MDP\. More precisely, for\(s,a,s′\)∈Z×𝒮\(s,a,s^\{\\prime\}\)\\in Z\\times\\mathcal\{S\},P^\(s′\|s,a\)=n\(s,a,s′\)N\\widehat\{P\}\(s^\{\\prime\}\|s,a\)=\\frac\{n\(s,a,s^\{\\prime\}\)\}\{N\}, wheren\(s,a,s′\)n\(s,a,s^\{\\prime\}\)denotes the number of timess′s^\{\\prime\}was observed under the pair\(s,a\)∈Z\(s,a\)\\in Z\. Further, letM^=\(𝒮,𝒜,R,P^,γ\)\\widehat\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},R,\\widehat\{P\},\\gamma\)be the corresponding empirical MDP built usingP^\\widehat\{P\}as described in algorithm[1](https://arxiv.org/html/2605.21763#algorithm1)\.

We are now ready to introduce MB\-OCE\-VIwhose pseduocode is provided as Algorithm[2](https://arxiv.org/html/2605.21763#algorithm2)\. It is a value\-iteration type algorithm that extends the classical value iteration for risk\-neutral objectives to those defined recursively using the OCEρ\\rhoapplied to the empirical MDPM^\\widehat\{M\}\. The MB\-OCE\-VI algorithm works as follows\. For any inputε\>0\\varepsilon\>0, it first collectsT=NSAT=NSAsamples, which is done by makingNNcalls to the generative model, and then computesP^\\widehat\{P\}\. Then, it runs value\-iteration updates which returns a policyπk\\pi\_\{k\}and a Q\-value estimateQkQ\_\{k\}\. Finally, one can setk=log⁡\(12ε\(1−γ\)\)/log⁡\(1/γ\)k=\\log\\big\(\\frac\{1\}\{2\\varepsilon\(1\-\\gamma\)\}\\big\)/\\log\(1/\\gamma\)\(see Lemma[2](https://arxiv.org/html/2605.21763#Thmlemma2)in Appendix[B](https://arxiv.org/html/2605.21763#A2)\)\.

## 4Sample Complexity Analysis of MB\-OCE\-VI

In this section, we present PAC bounds on the sample complexity of MB\-OCE\-VI for both value and policy learning\. The bounds hold under the assumption that the OCE is defined by a utilityu∈𝕌1u\\in\\mathbb\{U\}\_\{1\}; we refer to Section[2\.1](https://arxiv.org/html/2605.21763#S2.SS1)for the definition of𝕌1\\mathbb\{U\}\_\{1\}\. As it turns out, this restriction is necessary since the set𝕌1\\mathbb\{U\}\_\{1\}will be shown to be precisely the utilities that guarantee continuity of value\-functions between similar MDPs on the same state\-action space\. Forγ∈\(0,1\)\\gamma\\in\(0,1\), we introduceHγ:=11−γH\_\{\\gamma\}:=\\frac\{1\}\{1\-\\gamma\}\.

###### Theorem 1\.

Assumeu∈𝕌1u\\in\\mathbb\{U\}\_\{1\}\. If the total numberTTof calls to the generative model satisfies

T≥32γ2SA\[u\(−Hγ\)\]2ε2\(1−γ\)2log⁡\(8γSAu\+′\(−Hγ\)ε\(1−γ\)2\),\\displaystyle T\\geq 32\\frac\{\\gamma^\{2\}SA\[u\(\-H\_\{\\gamma\}\)\]^\{2\}\}\{\\varepsilon^\{2\}\(1\-\\gamma\)^\{2\}\}\\log\\bigg\(\\frac\{8\\gamma SAu^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\}\{\\varepsilon\(1\-\\gamma\)^\{2\}\}\\bigg\)\\,,then it holds thatℙ\(‖Q∗−Qk‖≥ε\)≤δ\\mathbb\{P\}\(\\\|Q^\{\*\}\-Q\_\{k\}\\\|\\geq\\varepsilon\)\\leq\\delta\. Furthermore, if

T≥128γ4SA\[u\(−Hγ\)\]2ε2\(1−γ\)4log⁡\(16γ2SAu\+′\(−Hγ\)ε\(1−γ\)3\),\\displaystyle T\\geq 128\\frac\{\\gamma^\{4\}SA\[u\(\-H\_\{\\gamma\}\)\]^\{2\}\}\{\\varepsilon^\{2\}\(1\-\\gamma\)^\{4\}\}\\log\\bigg\(\\frac\{16\\gamma^\{2\}SAu^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\}\{\\varepsilon\(1\-\\gamma\)^\{3\}\}\\bigg\)\\,,then it holds thatℙ\(‖V∗−Vπk‖\>ε\)≤δ\\mathbb\{P\}\(\\\|V^\{\*\}\-V^\{\\pi\_\{k\}\}\\\|\>\\varepsilon\)\\leq\\delta\. Here,u\+′u^\{\\prime\}\_\{\+\}denotes the right derivative ofuu\.

Before we give the proof, we present in Table[1](https://arxiv.org/html/2605.21763#S4.T1)a comparison of the policy learning sample complexity upper bounds for a set of specific risk\-measures \(see Appendix[A](https://arxiv.org/html/2605.21763#A1)\)\.

### 4\.1Comparison with Existing Sample Complexity Bounds

Here we provide a comparison between the sample complexity of policy learning in Theorem[1](https://arxiv.org/html/2605.21763#Thmtheorem1)to best existing bounds for some concrete cases of OCE in the recursive setting\.

##### CVaR\.

For CVaR with parameterτ∈\(0,1\)\\tau\\in\(0,1\), Theorem[1](https://arxiv.org/html/2605.21763#Thmtheorem1)gives a policy learning sample complexity of𝒪~\(SAε2\(1−γ\)6τ2\)\\widetilde\{\\mathcal\{O\}\}\\Big\(\\frac\{SA\}\{\\varepsilon^\{2\}\(1\-\\gamma\)^\{6\}\\tau^\{2\}\}\\Big\)\. For recursive CVaR, the best existing bound is due to\[[18](https://arxiv.org/html/2605.21763#bib.bib69)\], and scales as𝒪~\(SAε2\(1−γ\)4τ2\)\\widetilde\{\\mathcal\{O\}\}\\Big\(\\frac\{SA\}\{\\varepsilon^\{2\}\(1\-\\gamma\)^\{4\}\\tau^\{2\}\}\\Big\)\. This shows that our general analysis leaves a gap of1\(1−γ\)2\\frac\{1\}\{\(1\-\\gamma\)^\{2\}\}\. The analysis in\[[18](https://arxiv.org/html/2605.21763#bib.bib69)\]exploits the specific properties of CVaR that cannot be generalized to a generic OCE\. We also mention that there is no specialized result for value learning for CVaR, to the best of our knowledge\.

##### Entropic\.

For the entropic risk with parameterβ\>0\\beta\>0, the best available sample complexities for value learning and policy learning are reported in\[[52](https://arxiv.org/html/2605.21763#bib.bib3)\]\. For value learning, Theorem[1](https://arxiv.org/html/2605.21763#Thmtheorem1)gives a bound of𝒪~\(SAε2\(1−γ\)3β\(eβ1−γ−1\)2\)\\widetilde\{\\mathcal\{O\}\}\\Big\(\\frac\{SA\}\{\\varepsilon^\{2\}\(1\-\\gamma\)^\{3\}\\beta\}\\big\(e^\{\\frac\{\\beta\}\{1\-\\gamma\}\}\-1\\big\)^\{2\}\\Big\), which is worse by a factor of1/\(1−γ\)1/\(1\-\\gamma\)compared to the bound in\[[52](https://arxiv.org/html/2605.21763#bib.bib3)\], which scales as𝒪~\(SAε2\(1−γ\)2β2\(eβ1−γ−1\)2\)\\widetilde\{\\mathcal\{O\}\}\\Big\(\\frac\{SA\}\{\\varepsilon^\{2\}\(1\-\\gamma\)^\{2\}\\beta^\{2\}\}\\big\(e^\{\\frac\{\\beta\}\{1\-\\gamma\}\}\-1\\big\)^\{2\}\\Big\)\. For policy learning, the resulting bound from Theorem[1](https://arxiv.org/html/2605.21763#Thmtheorem1)scales as𝒪~\(SAε2\(1−γ\)5β\(eβ1−γ−1\)2\)\\widetilde\{\\mathcal\{O\}\}\\Big\(\\frac\{SA\}\{\\varepsilon^\{2\}\(1\-\\gamma\)^\{5\}\\beta\}\\big\(e^\{\\frac\{\\beta\}\{1\-\\gamma\}\}\-1\\big\)^\{2\}\\Big\), which is again off by a factor of1/\(1−γ\)1/\(1\-\\gamma\)compared to the corresponding bound in\[[52](https://arxiv.org/html/2605.21763#bib.bib3)\]\. We note that the bounds in\[[52](https://arxiv.org/html/2605.21763#bib.bib3)\]uses some proof elements that are specifically tailored to the entropic measure, and cannot be applied for generic OCEs\.

Table 1:Implication of Theorem[1](https://arxiv.org/html/2605.21763#Thmtheorem1)for different risk measures

### 4\.2Proof of Theorem[1](https://arxiv.org/html/2605.21763#Thmtheorem1)

In this section, we prove Theorem[1](https://arxiv.org/html/2605.21763#Thmtheorem1)\. First, we present a lemma, proven in Appendix[C](https://arxiv.org/html/2605.21763#A3), that characterizes the smoothness of Q\-values under the OCE defined by a utilityu∈𝕌1u\\in\\mathbb\{U\}\_\{1\}, when the transition function is perturbed\. This result could of interest, beyond the considered RL setting\.

\{restatable\}

lemmaOCESimLemma LetM=\(𝒮,𝒜,P,R,γ\)M=\(\\mathcal\{S\},\\mathcal\{A\},P,R,\\gamma\)andM~=\(𝒮,𝒜,P~,R,γ\)\\widetilde\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},\\widetilde\{P\},R,\\gamma\)be two MDPs that only differ in their transition function andπ\\pia fixed stationary policy andu∈𝕌1u\\in\\mathbb\{U\}\_\{1\}be a utility function\. Then

‖Qπ−Q~π‖≤γ1−γmaxs,asupη∈\[0,Hγ\]\|∑s′∈𝒮\[Ps,a\(s′\)−P~s,a\(s′\)\]u\(Vπ\(s′\)−η\)\|\.\\displaystyle\\\|Q^\{\\pi\}\-\\widetilde\{Q\}^\{\\pi\}\\\|\\leq\\frac\{\\gamma\}\{1\-\\gamma\}\\max\_\{s,a\}\\sup\_\{\\eta\\in\[0,H\_\{\\gamma\}\]\}\\bigg\|\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}\[P\_\{s,a\}\(s^\{\\prime\}\)\-\\widetilde\{P\}\_\{s,a\}\(s^\{\\prime\}\)\]u\(V^\{\\pi\}\(s^\{\\prime\}\)\-\\eta\)\\bigg\|\\,\.
###### Proof of Theorem[1](https://arxiv.org/html/2605.21763#Thmtheorem1)\.

Letε\>0\\varepsilon\>0\. LetQkQ\_\{k\}be theQQ\-function output of OCE\-VI afterkkiterations, and letQ^∗\\widehat\{Q\}^\{\*\}denote the optimal Q\-value inM^\\widehat\{M\}\. Note thatQ^π∗\\widehat\{Q\}^\{\\pi^\{\*\}\}is the Q\-value in the empirical MDP of the optimal policy of the true MDP\. Using a standard decomposition which usesQ^∗≥Q^π∗\\widehat\{Q\}^\{\*\}\\geq\\widehat\{Q\}^\{\\pi^\{\*\}\}, we have for any\(s,a\)∈𝒮×𝒜\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\},

Qk\(s,a\)\\displaystyle Q\_\{k\}\(s,a\)=Q∗\(s,a\)\+Qk\(s,a\)−Q^∗\(s,a\)\+Q^∗\(s,a\)−Q∗\(s,a\)\\displaystyle=Q^\{\*\}\(s,a\)\+Q\_\{k\}\(s,a\)\-\\widehat\{Q\}^\{\*\}\(s,a\)\+\\widehat\{Q\}^\{\*\}\(s,a\)\-Q^\{\*\}\(s,a\)≥Q∗\(s,a\)\+Qk\(s,a\)−Q^∗\(s,a\)\+Q^π∗\(s,a\)−Q∗\(s,a\)\\displaystyle\\geq Q^\{\*\}\(s,a\)\+Q\_\{k\}\(s,a\)\-\\widehat\{Q\}^\{\*\}\(s,a\)\+\\widehat\{Q\}^\{\\pi^\{\*\}\}\(s,a\)\-Q^\{\*\}\(s,a\)≥Q∗\(s,a\)−‖Qk−Q^∗‖−‖Q^π∗−Q∗‖\.\\displaystyle\\geq Q^\{\*\}\(s,a\)\-\\\|Q\_\{k\}\-\\widehat\{Q\}^\{\*\}\\\|\-\\\|\\widehat\{Q\}^\{\\pi^\{\*\}\}\-Q^\{\*\}\\\|\\,\.\(1\)
Therefore, to establishε\\varepsilon\-value\-correctness it suffices to ensure‖Q^π∗−Q∗‖≤ε/2\\\|\\widehat\{Q\}^\{\\pi^\{\*\}\}\-Q^\{\*\}\\\|\\leq\\varepsilon/2and‖Qk−Q^∗‖≤ε/2\\\|Q\_\{k\}\-\\widehat\{Q\}^\{\*\}\\\|\\leq\\varepsilon/2\. By Lemma[2](https://arxiv.org/html/2605.21763#Thmlemma2), we can have‖Qk−Q^∗‖<ε/2\\\|Q\_\{k\}\-\\widehat\{Q\}^\{\*\}\\\|<\\varepsilon/2by pickingk≥log⁡\(1\(1−γ\)ε\)/log⁡\(1/γ\)k\\geq\\log\(\\frac\{1\}\{\(1\-\\gamma\)\\varepsilon\}\)/\\log\(1/\\gamma\)\.

To control‖Q^π∗−Q∗‖\\\|\\widehat\{Q\}^\{\\pi^\{\*\}\}\-Q^\{\*\}\\\|, we apply Lemma[4\.2](https://arxiv.org/html/2605.21763#S4.SS2)withM~=M^\\widetilde\{M\}=\\widehat\{M\}andπ=π∗\\pi=\\pi^\{\*\}, which yields

‖Q∗−Q^π∗‖≤γ1−γmaxs,asupη∈\[0,Hγ\]\|∑s′\[Ps,a\(s′\)−P^s,a\(s′\)\]u\(V∗\(s′\)−η\)\|\.\\displaystyle\\\|Q^\{\*\}\-\\widehat\{Q\}^\{\\pi^\{\*\}\}\\\|\\leq\\frac\{\\gamma\}\{1\-\\gamma\}\\max\_\{s,a\}\\sup\_\{\\eta\\in\[0,H\_\{\\gamma\}\]\}\\bigg\|\\sum\_\{s^\{\\prime\}\}\[P\_\{s,a\}\(s^\{\\prime\}\)\-\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)\]u\(V^\{\*\}\(s^\{\\prime\}\)\-\\eta\)\\bigg\|\\,\.For a fixed\(s,a\)\(s,a\), letη∗\\eta^\{\*\}be an arbitrary but fixed optimizer of

supη∈\[0,Hγ\]\|∑s′∈𝒮\(Ps,a\(s′\)−P^s,a\(s′\)\)u\(Vπ\(s′\)−η\)\|\.\\sup\_\{\\eta\\in\[0,H\_\{\\gamma\}\]\}\\bigg\|\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}\(P\_\{s,a\}\(s^\{\\prime\}\)\-\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)\)u\(V^\{\\pi\}\(s^\{\\prime\}\)\-\\eta\)\\bigg\|\.Sinceη⋆\\eta^\{\\star\}is data\-dependent, a direct application of Hoeffding’s inequality is not allowed\. To handle this, we discretize the interval\[0,Hγ\]\[0,H\_\{\\gamma\}\]\. LetDDdenote the corresponding discretized set built using uniform discretization\. The following lemma controls the introduced error, which establishes thatuuis sufficiently regular to get a handle on the number of discretization points needed:

\{restatable\}

lemmaBoundingSimulationTermCombined Letu∈𝕌1u\\in\\mathbb\{U\}\_\{1\}, and defineη¯=minη∈D⁡u\+′\(−Hγ\)\|η∗−η\|\\bar\{\\eta\}=\\min\_\{\\eta\\in D\}u^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\|\\eta^\{\*\}\-\\eta\|\. If the setDDof discretization points satisfies\|D\|≥u\+′\(−Hγ\)ε\(1−γ\)\|D\|\\geq\\frac\{u^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\}\{\\varepsilon\(1\-\\gamma\)\}, then

maxs,a\|∑s′\(Ps,a\(s′\)−\\displaystyle\\max\_\{s,a\}\\bigg\|\\sum\_\{s^\{\\prime\}\}\(P\_\{s,a\}\(s^\{\\prime\}\)\-P^s,a\(s′\)\)u\(V∗\(s′\)−η∗\)\|≤maxs,a\|∑s′\(Ps,a\(s′\)−P^s,a\(s′\)\)u\(V∗\(s′\)−η¯\)\|\+ε2\.\\displaystyle\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)\)u\(V^\{\*\}\(s^\{\\prime\}\)\\\!\-\\\!\\eta^\{\*\}\)\\bigg\|\\leq\\max\_\{s,a\}\\bigg\|\\sum\_\{s^\{\\prime\}\}\(P\_\{s,a\}\(s^\{\\prime\}\)\-\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)\)u\(V^\{\*\}\(s^\{\\prime\}\)\\\!\-\\\!\\bar\{\\eta\}\)\\bigg\|\+\\frac\{\\varepsilon\}\{2\}\\,\.
Lemma[4\.2](https://arxiv.org/html/2605.21763#S4.SS2)implies that

ℙ\(maxs,asupη∈\[0,Hγ\]∑s′∈𝒮\[Ps,a\(s′\)−P^s,a\(s′\)\]u\(V∗\(s′\)−η\)\>ε\)≤\\displaystyle\\mathbb\{P\}\\bigg\(\\max\_\{s,a\}\\sup\_\{\\eta\\in\[0,H\_\{\\gamma\}\]\}\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}\[P\_\{s,a\}\(s^\{\\prime\}\)\-\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)\]u\(V^\{\*\}\(s^\{\\prime\}\)\-\\eta\)\>\\varepsilon\\bigg\)\\leqℙ\(∃η¯∈D:maxs,a∑s′∈𝒮\[Ps,a\(s′\)−P^s,a\(s′\)\]u\(V∗\(s′\)−η\)\>ε2\)\.\\displaystyle\\mathbb\{P\}\\bigg\(\\exists\\bar\{\\eta\}\\in D:\\max\_\{s,a\}\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}\[P\_\{s,a\}\(s^\{\\prime\}\)\-\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)\]u\(V^\{\*\}\(s^\{\\prime\}\)\-\\eta\)\>\\frac\{\\varepsilon\}\{2\}\\bigg\)\\,\.\(2\)An application of Hoeffding’s inequality \(Lemma[3](https://arxiv.org/html/2605.21763#Thmlemma3)in the appendix\) and taking a union bound overDDand all state\-action pairs, it follows that the right\-hand side of \([2](https://arxiv.org/html/2605.21763#S4.E2)\) will be smaller thanδ\\deltaif any state\-action pair is sampledN=8\[u\(−Hγ\)\]2ε2log⁡\(4SAu\+′\(−Hγ\)ε\(1−γ\)\)N=\\frac\{8\[u\(\-H\_\{\\gamma\}\)\]^\{2\}\}\{\\varepsilon^\{2\}\}\\log\\big\(\\frac\{4SAu^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\}\{\\varepsilon\(1\-\\gamma\)\}\\big\)times\. After adjustingε\\varepsilonappropriately, it follows that if

T≥32γ2SA\[u\(−Hγ\)\]2ε2\(1−γ\)2log⁡\(8γSAu\+′\(−Hγ\)ε\(1−γ\)2\),\\displaystyle T\\geq 32\\frac\{\\gamma^\{2\}SA\[u\(\-H\_\{\\gamma\}\)\]^\{2\}\}\{\\varepsilon^\{2\}\(1\-\\gamma\)^\{2\}\}\\log\\bigg\(\\frac\{8\\gamma SAu^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\}\{\\varepsilon\(1\-\\gamma\)^\{2\}\}\\bigg\),thenℙ\(‖Q^π∗−Q∗‖≥ε2\)<δ\\mathbb\{P\}\(\\\|\\widehat\{Q\}^\{\\pi^\{\*\}\}\-Q^\{\*\}\\\|\\geq\\frac\{\\varepsilon\}\{2\}\)<\\delta, showing the first part of the theorem\.

To prove the second result, we use the followig lemma, which is proven in the appendix:\{restatable\}lemmagreedyPolicyBound Letε\>0\\varepsilon\>0\. LetV¯∈ℝS\\overline\{V\}\\in\\mathbb\{R\}^\{S\}be a value function obeying‖V∗−V¯‖<ε\\\|V^\{\*\}\-\\overline\{V\}\\\|<\\varepsilon, andπG:=arg⁡maxa⁡\[R\(s,a\)\+γρs,a\(V¯\(s′\)\)\]\\pi^\{G\}:=\\arg\\\!\\max\_\{a\}\[R\(s,a\)\+\\gamma\\rho\_\{s,a\}\(\\overline\{V\}\(s^\{\\prime\}\)\)\]be a greedy policy with respect toV¯\\overline\{V\}\. Then,‖V∗−VπG‖≤2γ1−γε\\\|V^\{\*\}\-V^\{\\pi\_\{G\}\}\\\|\\leq\\frac\{2\\gamma\}\{1\-\\gamma\}\\varepsilon\. Applying Lemma[4\.2](https://arxiv.org/html/2605.21763#S4.SS2)withV¯=V^πk\\overline\{V\}=\\widehat\{V\}^\{\\pi\_\{k\}\}, we have shown that‖V^πk−V∗‖≤ε\\\|\\widehat\{V\}^\{\\pi\_\{k\}\}\-V^\{\*\}\\\|\\leq\\varepsilonwith probability1−δ1\-\\delta\. Further, note thatπk\\pi\_\{k\}by construction is the greedy policy with respect toV^πk\\widehat\{V\}^\{\\pi\_\{k\}\}\. Therefore, the true value ofπk\\pi\_\{k\}satisfies, with probability at least1−δ1\-\\delta,

‖V∗−Vπk‖≤2γ1−γ‖Q∗−Qk‖,\\\|V^\{\*\}\-V^\{\\pi\_\{k\}\}\\\|\\leq\\frac\{2\\gamma\}\{1\-\\gamma\}\\\|Q^\{\*\}\-Q\_\{k\}\\\|,which after properly adjustingε\\varepsilonyields the announced sample complexity for policy learning\. ∎

## 5Impossibility Results

In this section, we present some results for PAC\-learnability under OCEs\. Specifically, we establish that for essentially all utility functionsu∈𝕌0u\\\!\\in\\\!\\mathbb\{U\}\_\{0\}for whichdom\(u\)≠ℝ\\text\{dom\}\(u\)\\\!\\neq\\\!\\mathbb\{R\}, it is impossible to obtain PAC\-bounds for the corresponding learning problems withOCEu\\text\{OCE\}^\{u\}\. This is done by constructing a parametric family of simple MDPs for which the value functions are not continuous in the parameter\. By making this parameter sufficiently small, we can thus have two MDPs with a large gap in value functions, while the number of samples it takes to distinguish them can be made arbitrarily large\.

We will have to require thatuuis not the identity foru≥0u\\geq 0\. A formalized in Proposition[2](https://arxiv.org/html/2605.21763#Thmproposition2), the only utilities this assumption rules out all lead to the same risk measure, namely the expectation\.

###### Theorem 2\(Impossibility, value learning\)\.

Letu∈𝕌0u\\in\\mathbb\{U\}\_\{0\}be a utility function for which \(i\)dom\(u\)≠ℝ\\text\{dom\}\(u\)\\neq\\mathbb\{R\}and \(ii\) there is somet0\>0t\_\{0\}\>0such thatu\(t0\)<t0u\(t\_\{0\}\)<t\_\{0\}\. Then, there exists a class of MDPs𝕄\\mathbb\{M\}withSSstates,AAactions, and discount factorsγ\\gammasuch that no value\-learning algorithm𝒰\\mathcal\{U\}can be\(ε,δ\)\(\\varepsilon,\\delta\)\-correct on𝕄\\mathbb\{M\}\.

###### Proof\.

We consider a class𝕄\\mathbb\{M\}of MDPs with a single actionaaand three statess0,sGs\_\{0\},s\_\{G\}andsBs\_\{B\}, wheresGs\_\{G\}andsBs\_\{B\}are absorbing andR\(sG,a\)=1R\(s\_\{G\},a\)=1andR\(sB,a\)=R\(s0,a\)=0R\(s\_\{B\},a\)=R\(s\_\{0\},a\)=0\. Finally, froms0s\_\{0\}it is possible to transition tosGs\_\{G\}with probabilitypp, and tosBs\_\{B\}with probability1−p1\-p\. The MDPs in𝕄\\mathbb\{M\}differ only in their value ofpp\. Hence, we can parametrize them byppand writeMpM\_\{p\}to represent the MDP in which transition probability tosGs\_\{G\}ispp\. LetQ1Q\_\{1\}andQpQ\_\{p\}denote the Q\-values ofM1M\_\{1\}andMpM\_\{p\}, respectively\. We have

Q1\(s0,a\)=γ1−γ,Qp\(s0,a\)=γsupη∈\[0,Hγ\]\{η\+pu\(Hγ−η\)\+\(1−p\)u\(−η\)\}\.\\displaystyle Q\_\{1\}\(s\_\{0\},a\)=\\frac\{\\gamma\}\{1\-\\gamma\},\\qquad Q\_\{p\}\(s\_\{0\},a\)=\\gamma\\sup\_\{\\eta\\in\[0,H\_\{\\gamma\}\]\}\\Big\\\{\\eta\+pu\(H\_\{\\gamma\}\-\\eta\)\+\(1\-p\)u\(\-\\eta\)\\Big\\\}\\,\.By pickingγ\\gammalarge enough, we get that11−γ−ξ\>t0\\frac\{1\}\{1\-\\gamma\}\-\\xi\>t\_\{0\}, whereξ=:−infdom\(u\)\\xi=:\-\\inf\\text\{dom\}\(u\)\. Since for anyη\>ξ\\eta\>\\xi, we have thatη\+u\(Hγ−η\)\+\(1−p\)u\(−η\)=−∞\\eta\+u\(H\_\{\\gamma\}\-\\eta\)\+\(1\-p\)u\(\-\\eta\)=\-\\infty, it holds that

supη∈\[0,Hγ\]\{η\+pu\(Hγ−η\)\+\(1−p\)u\(−η\)\}=supη∈\[0,ξ\]\{η\+pu\(Hγ−η\)\+\(1−p\)u\(−η\)\}\.\\displaystyle\\sup\_\{\\eta\\in\[0,H\_\{\\gamma\}\]\}\\Big\\\{\\eta\+pu\(H\_\{\\gamma\}\-\\eta\)\+\(1\-p\)u\(\-\\eta\)\\Big\\\}=\\sup\_\{\\eta\\in\[0,\\xi\]\}\\Big\\\{\\eta\+pu\(H\_\{\\gamma\}\-\\eta\)\+\(1\-p\)u\(\-\\eta\)\\Big\\\}\.Moreover, since

supη∈\[0,Hγ\]\{η\+pu\(Hγ−η\)\+\(1−p\)u\(−η\)\}\\displaystyle\\sup\_\{\\eta\\in\[0,H\_\{\\gamma\}\]\}\\Big\\\{\\eta\+pu\(H\_\{\\gamma\}\-\\eta\)\+\(1\-p\)u\(\-\\eta\)\\Big\\\}≤supη∈\[0,Hγ\]\{η\+pu\(Hγ−η\)\}\\displaystyle\\leq\\sup\_\{\\eta\\in\[0,H\_\{\\gamma\}\]\}\\Big\\\{\\eta\+pu\(H\_\{\\gamma\}\-\\eta\)\\Big\\\}=ξ\+pu\(Hγ−ξ\)<Hγ,\\displaystyle=\\xi\+pu\(H\_\{\\gamma\}\-\\xi\)<H\_\{\\gamma\},it follows thatQ1\(s0,a\)−Qp\(s0,a\)≥γ\(Hγ−ξ\)=:Δ\>0Q\_\{1\}\(s\_\{0\},a\)\-Q\_\{p\}\(s\_\{0\},a\)\\geq\\gamma\(H\_\{\\gamma\}\-\\xi\)=:\\Delta\>0for allp<1p<1\. Introduce

ℰp:=\{\|Qp∗\(s0,a\)−Q𝒰\(s0,a\)\|≤ε\},ℰ1\\displaystyle\\mathcal\{E\}\_\{p\}:=\\\{\|Q\_\{p\}^\{\*\}\(s\_\{0\},a\)\-Q^\{\\mathcal\{U\}\}\(s\_\{0\},a\)\|\\leq\\varepsilon\\\},\\qquad\\quad\\mathcal\{E\}\_\{1\}:=\{\|Q1∗\(s0,a\)−Q𝒰\(s0,a\)\|≤ε\}\.\\displaystyle:=\\\{\|Q\_\{1\}^\{\*\}\(s\_\{0\},a\)\-Q^\{\\mathcal\{U\}\}\(s\_\{0\},a\)\|\\leq\\varepsilon\\\}\.By pickingε<Δ/2\\varepsilon<\\Delta/2, the outputQ𝒰Q^\{\\mathcal\{U\}\}of any\(ε,δ\)\(\\varepsilon,\\delta\)\-correct algorithm𝒰\\mathcal\{U\}satisfies thatℰ1∩ℰp=∅\\mathcal\{E\}\_\{1\}\\cap\\mathcal\{E\}\_\{p\}=\\emptyset, which implies thatQ𝒰Q^\{\\mathcal\{U\}\}can never beε\\varepsilon\-correct on both MDPs simultaneously\.

The next argument relies on a change\-of\-measure between the one induced by the two MDPs\. Since underℙ1\\mathbb\{P\}\_\{1\}the only possible sequence of transitions from trying actionaains0s\_\{0\}is to observe a transition tosGs\_\{G\}every time, the probability of its occurrence underℙp\\mathbb\{P\}\_\{p\}ispNp^\{N\}, whereNNis the number of samples\. Assuming thatδ<12\\delta<\\frac\{1\}\{2\}, we then have for any\(ε,δ\)\(\\varepsilon,\\delta\)\-correct algorithm𝒰\\mathcal\{U\}that

ℙp\(ℰp∁\)≥ℙp\(ℰ1\)=𝔼p\[𝟙ℰ1\]=pN𝔼1\[𝟙ℰ1\]=pNℙ1\(ℰ1\)≥pN\(1−δ\)≥12pN\.\\displaystyle\\mathbb\{P\}\_\{p\}\(\\mathcal\{E\}^\{\\complement\}\_\{p\}\)\\geq\\mathbb\{P\}\_\{p\}\(\\mathcal\{E\}\_\{1\}\)=\\mathbb\{E\}\_\{p\}\[\\mathbbmss\{1\}\_\{\\mathcal\{E\}\_\{1\}\}\]=p^\{N\}\\mathbb\{E\}\_\{1\}\[\\mathbbmss\{1\}\_\{\\mathcal\{E\}\_\{1\}\}\]=p^\{N\}\\mathbb\{P\}\_\{1\}\(\\mathcal\{E\}\_\{1\}\)\\geq p^\{N\}\(1\-\\delta\)\\geq\\frac\{1\}\{2\}p^\{N\}\\,\.Solving the equation12pN≥δ\\frac\{1\}\{2\}p^\{N\}\\geq\\deltaforNN, we find that ifN≤log⁡\(1/2δ\)log⁡\(1/p\)N\\leq\\frac\{\\log\(1/2\\delta\)\}\{\\log\(1/p\)\}, then any algorithm that is\(ε,δ\)\(\\varepsilon,\\delta\)\-correct onM1M\_\{1\}cannot also be\(ε,δ\)\(\\varepsilon,\\delta\)\-correct onMpM\_\{p\}\. Since this expression tends to infinity asp→1p\\rightarrow 1, the result follows\. ∎

\{restatable\}

\[Impossibility, policy learning\]theoremImpossibilityPolicy Letu∈𝕌0u\\in\\mathbb\{U\}\_\{0\}be a utility function for which \(i\)dom\(u\)≠ℝ\\text\{dom\}\(u\)\\neq\\mathbb\{R\}and \(ii\) there is somet0\>0t\_\{0\}\>0such thatu\(t0\)<t0u\(t\_\{0\}\)<t\_\{0\}\. Then, there exists a class of MDPs𝕄\\mathbb\{M\}withSSstates,AAactions, and discount factorsγ\\gammasuch that no policy\-learning algorithm𝒰\\mathcal\{U\}can be\(ε,δ\)\(\\varepsilon,\\delta\)\-correct on𝕄\\mathbb\{M\}\.

The proof of Theorem[2](https://arxiv.org/html/2605.21763#Thmtheorem2)is quite similar to that of Theorem[2](https://arxiv.org/html/2605.21763#Thmtheorem2)and is postponed to the appendix\. The following result, proven in the appendix, shows that the assumption in Theorems[2](https://arxiv.org/html/2605.21763#Thmtheorem2)\-[2](https://arxiv.org/html/2605.21763#Thmtheorem2)only excludes the utilities that correspond toOCEu≡𝔼\\mathrm\{OCE\}\_\{u\}\\equiv\\mathbb\{E\}\.

###### Proposition 1\.

Letu∈𝕌0u\\in\\mathbb\{U\}\_\{0\}be any utility function for whichu\(t\)=tu\(t\)=tfor allt≥0t\\geq 0\. ThenOCEu\(X\)=𝔼\[X\]\\mathrm\{OCE\}\_\{u\}\(X\)=\\mathbb\{E\}\[X\]\.

In summary, the upper bounds and the above impossibility results now classify exactly the class of utility functions for which the worst\-case sample complexities are finite\. These are precisely the utility functions in𝕌1\\mathbb\{U\}\_\{1\}, i\.e\., the ones with full domaindom\(u\)=ℝ\\text\{dom\}\(u\)=\\mathbb\{R\}, apart from the special utilitiesuufor whichOCEu=𝔼\\text\{OCE\}^\{u\}=\\mathbb\{E\}\(hence, risk\-neutral problem\)\.

## 6Lower Bounds

In this section, we provide sample complexity lower bounds for both policy and value learning\. For each problem, we present two lower bounds\. The first one \(Theorems[6\.1](https://arxiv.org/html/2605.21763#S6.SS1)and[6\.2](https://arxiv.org/html/2605.21763#S6.SS2)\) is a general lower bound that holds for allu∈𝕌1u\\in\\mathbb\{U\}\_\{1\}\. This lower bounds has an optimal dependence \(up to log\-factors\) onS,A,1δ,εS,A,\\frac\{1\}\{\\delta\},\\varepsilon, but a complicated dependence onuuandHγH\_\{\\gamma\}\. The second lower bound \(Theorems[6\.1](https://arxiv.org/html/2605.21763#S6.SS1)and[6\.2](https://arxiv.org/html/2605.21763#S6.SS2)\) enjoys a more interpretable dependence onHγH\_\{\\gamma\}, but holds for a sub\-class of utilities in𝕌1\\mathbb\{U\}\_\{1\}\. The sub\-class includes all strongly risk\-averse coherent OCE risk\-measures includingCVaRτ\\text\{CVaR\}\_\{\\tau\}\.

The MDP constructions used in the proofs are similar to\[[52](https://arxiv.org/html/2605.21763#bib.bib3)\], where only the entropic risk is considered\. The main challenge with general OCE is to get tight lower bounds on the difference in value functions for the MDP building blocks with similar transitions\. These building blocks are shown in Figure[1](https://arxiv.org/html/2605.21763#S6.F1)\(a\) for value learning and \(b\) for policy learning\.

zzsGs^\{G\}sBs^\{B\}qq1−q1\-qR=1R=1R=0R=0\(a\)Hard\-to\-learn MDP construction building block
for value learningsssGs^\{G\}sBs^\{B\}qaq\_\{a\}1−qa1\-q\_\{a\}R=1R=1R=0R=0\(b\)Hard\-to\-learn MDP construction building block
for policy learning
Figure 1:Lower bound building blocks for value learning \(a\) and policy learning \(b\)The idea behind the proofs is to first obtain a tight lower bound on the difference in value functions for two instances with almost similar transition probabilities to ensure only the optimal policy isε\\varepsilon\-good for policy learning and that an outputQ𝒰Q^\{\\mathcal\{U\}\}cannot beε\\varepsilon\-good on both instances for value learning\. Then one has to lower bound the number of samples needed for the learner to correctly identify which instance the samples are from which is harder the more similar the instances are\. Finally, one has to combine the building blocks to get the dependence onSSandAAand then perform some parameter tuning to make the learning as hard as possible\.

### 6\.1Value Learning Lower Bounds

\{restatable\}

theoremLowerboundGenericValue Letu∈𝕌1u\\in\\mathbb\{U\}\_\{1\}be a utility function\. There existε¯\(u,Hγ\)\>0\\bar\{\\varepsilon\}\(u,H\_\{\\gamma\}\)\>0, constantsc1,c2\>0c\_\{1\},c\_\{2\}\>0, and a strictly positive functionΦ:𝕌1×\[1,∞\)→\(0,∞\)\\Phi:\\mathbb\{U\}\_\{1\}\\times\[1,\\infty\)\\rightarrow\(0,\\infty\)such that for any RL algorithm𝒰\\mathcal\{U\}that outputs aQQ\-valueQ𝒰Q^\{\\mathcal\{U\}\}, anyδ∈\(0,14\)\\delta\\in\(0,\\frac\{1\}\{4\}\), andε∈\(0,ε¯\)\\varepsilon\\in\\big\(0,\\bar\{\\varepsilon\}\), it holds: if the total numberTTof samples satisfies

T≤SAΦ\(u,Hγ\)c1ε2log⁡\(SAc2δ\),\\displaystyle T\\leq\\frac\{SA\\Phi\(u,H\_\{\\gamma\}\)\}\{c\_\{1\}\\varepsilon^\{2\}\}\\log\\Big\(\\frac\{SA\}\{c\_\{2\}\\delta\}\\Big\),then there exists some MDPMMwithSSstates andAAactions for whichℙ\(‖QM∗−QT𝒰‖\>ε\)≥δ\\mathbb\{P\}\(\\\|Q^\{\*\}\_\{M\}\-Q\_\{T\}^\{\\mathcal\{U\}\}\\\|\>\\varepsilon\)\\geq\\delta\.

\{restatable\}

theoremLowerboundSpecificValue Letu∈𝕌1u\\in\\mathbb\{U\}\_\{1\}be a utility function for whichu\+′\(0\)<1<u−′\(0\)u\_\{\+\}^\{\\prime\}\(0\)<1<u^\{\\prime\}\_\{\-\}\(0\)\. There existε¯\(u,Hγ\)\>0\\bar\{\\varepsilon\}\(u,H\_\{\\gamma\}\)\>0and constantsc1,c2\>0c\_\{1\},c\_\{2\}\>0such that for any RL algorithm𝒰\\mathcal\{U\}that outputs aQQ\-valueQ𝒰Q^\{\\mathcal\{U\}\}, anyδ∈\(0,14\)\\delta\\in\(0,\\frac\{1\}\{4\}\), andε∈\(0,ε¯\)\\varepsilon\\in\\big\(0,\\bar\{\\varepsilon\}\), the following holds: if the total numberTTof samples satisfies

T≤SAc1ε2log⁡\(SAc2δ\)\|u\(−Hγ\)\|2\(1−\[1−u\+′\(0\)u\+′\(−Hγ\)−u\+′\(0\)\]2\),\\displaystyle T\\leq\\frac\{SA\}\{c\_\{1\}\\varepsilon^\{2\}\}\\log\\Big\(\\frac\{SA\}\{c\_\{2\}\\delta\}\\Big\)\|u\(\-H\_\{\\gamma\}\)\|^\{2\}\\bigg\(1\-\\bigg\[\\frac\{1\-u^\{\\prime\}\_\{\+\}\(0\)\}\{u^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\-u^\{\\prime\}\_\{\+\}\(0\)\}\\bigg\]^\{2\}\\bigg\),then there exists some MDPMMwithSSstates andAAactions for whichℙ\(‖QM∗−QT𝒰‖\>ε\)≥δ\\mathbb\{P\}\(\\\|Q^\{\*\}\_\{M\}\-Q\_\{T\}^\{\\mathcal\{U\}\}\\\|\>\\varepsilon\)\\geq\\delta\.

### 6\.2Policy Learning Lower Bounds

\{restatable\}

theoremLowerboundGenericPolicy Letu∈𝕌1u\\in\\mathbb\{U\}\_\{1\}be a utility function\. There existε¯\(u,Hγ\)\>0\\bar\{\\varepsilon\}\(u,H\_\{\\gamma\}\)\>0, a strictly positive functionΦ:𝕌1×\[1,∞\)→\(0,∞\)\\Phi:\\mathbb\{U\}\_\{1\}\\times\[1,\\infty\)\\rightarrow\(0,\\infty\)and constantsc1,c2\>0c\_\{1\},c\_\{2\}\>0such that for any RL algorithm𝒰\\mathcal\{U\}that outputs a policyπ𝒰\\pi\_\{\\mathcal\{U\}\}, anyδ∈\(0,14\)\\delta\\in\(0,\\frac\{1\}\{4\}\), andε∈\(0,ε¯\)\\varepsilon\\in\(0,\\bar\{\\varepsilon\}\), we have: if the total numberTTof samples satisfies

T≤SAΦ\(u,Hγ\)c1ε2log⁡\(Sc2δ\),\\displaystyle T\\leq\\frac\{SA\\Phi\(u,H\_\{\\gamma\}\)\}\{c\_\{1\}\\varepsilon^\{2\}\}\\log\\Big\(\\frac\{S\}\{c\_\{2\}\\delta\}\\Big\),then there exists some MDPMMwithSSstates andAAactions for whichℙ\(‖VM∗−Vπ𝒰‖\>ε\)≥δ\\mathbb\{P\}\(\\\|V^\{\*\}\_\{M\}\-V^\{\\pi\_\{\\mathcal\{U\}\}\}\\\|\>\\varepsilon\)\\geq\\delta\.

\{restatable\}

theoremLowerboundSpecificPolicy Letu∈𝕌1u\\in\\mathbb\{U\}\_\{1\}be a utility function for whichu\+′\(0\)<1<u−′\(0\)u\_\{\+\}^\{\\prime\}\(0\)<1<u^\{\\prime\}\_\{\-\}\(0\)\. There existε¯\(u,Hγ\)\>0\\bar\{\\varepsilon\}\(u,H\_\{\\gamma\}\)\>0and constantsc1,c2\>0c\_\{1\},c\_\{2\}\>0such that for any RL algorithm𝒰\\mathcal\{U\}that outputs a policyπ𝒰\\pi\_\{\\mathcal\{U\}\}, anyδ∈\(0,14\)\\delta\\in\(0,\\frac\{1\}\{4\}\), andε∈\(0,ε¯\)\\varepsilon\\in\\big\(0,\\bar\{\\varepsilon\}\), the following holds: if the total numberTTof samples satisfies

T≤SAc1ε2log⁡\(Sc2δ\)\|u\(−Hγ\)\|2\(1−\[1−u\+′\(0\)u\+′\(−Hγ\)−u\+′\(0\)\]2\),\\displaystyle T\\leq\\frac\{SA\}\{c\_\{1\}\\varepsilon^\{2\}\}\\log\\Big\(\\frac\{S\}\{c\_\{2\}\\delta\}\\Big\)\|u\(\-H\_\{\\gamma\}\)\|^\{2\}\\bigg\(1\-\\bigg\[\\frac\{1\-u^\{\\prime\}\_\{\+\}\(0\)\}\{u^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\-u^\{\\prime\}\_\{\+\}\(0\)\}\\bigg\]^\{2\}\\bigg\),then there exists some MDPMMwithSSstates andAAactions for whichℙ\(‖VM∗−Vπ𝒰‖\>ε\)≥δ\\mathbb\{P\}\(\\\|V^\{\*\}\_\{M\}\-V^\{\\pi\_\{\\mathcal\{U\}\}\}\\\|\>\\varepsilon\)\\geq\\delta\.

### 6\.3Discussion of Bounds

The presented lower bounds establish the first sample complexity lower bounds for the family of OCEs, to our best knowledge\. For specific OCE measures, there are two relevant lower bounds in the literature: the one in\[[18](https://arxiv.org/html/2605.21763#bib.bib69)\]derived for CVaR \(policy learning\), and the one in\[[52](https://arxiv.org/html/2605.21763#bib.bib3)\]for the entropic risk \(value and policy learning\)\. The lower bound in\[[52](https://arxiv.org/html/2605.21763#bib.bib3)\]has an exponential dependence onHγH\_\{\\gamma\}\. The dependence onHγH\_\{\\gamma\}in our lower bound is not straightforward, which makes the comparison difficult\. However, it could be that our bound is not as sharp as those in\[[52](https://arxiv.org/html/2605.21763#bib.bib3)\], but they hold for a much larger set of problems\. A comparison with the bound in\[[18](https://arxiv.org/html/2605.21763#bib.bib69)\]is provided later\.

Theorems[6\.1](https://arxiv.org/html/2605.21763#S6.SS1)and[6\.2](https://arxiv.org/html/2605.21763#S6.SS2)establish that our upper bounds from Theorem[1](https://arxiv.org/html/2605.21763#Thmtheorem1)are tight inS,A,1ε,1δS,A,\\frac\{1\}\{\\varepsilon\},\\frac\{1\}\{\\delta\}but with a dependence onHγH\_\{\\gamma\}that is not interpretable\.

While the assumption thatu\+′\(0\)<1<u−′\(0\)u^\{\\prime\}\_\{\+\}\(0\)<1<u^\{\\prime\}\_\{\-\}\(0\)excludes entropic risk and the mean\-variance criterion, it is inclusive enough to include all finite strongly risk\-averse coherent risk\-measures, which by Theorem 3\.1 of\[[8](https://arxiv.org/html/2605.21763#bib.bib89)\]are exactly the ones for whichu=\{λ1tt≥0λ2tt<0u=\\begin\{cases\}\\lambda\_\{1\}t\\quad t\\geq 0\\\\ \\lambda\_\{2\}t\\quad t<0\\end\{cases\}for some0≤λ1<1<λ20\\leq\\lambda\_\{1\}<1<\\lambda\_\{2\}\. Furthermore, for any fixed finite strongly risk\-averse coherent risk\-measure, we obtain a lower bound scaling asΩ~\(SAε2\(1−γ\)2\)\\widetilde\{\\Omega\}\\Big\(\\frac\{SA\}\{\\varepsilon^\{2\}\(1\-\\gamma\)^\{2\}\}\\Big\)\. The risk\-measureCVaRτ\\text\{CVaR\}\_\{\\tau\}is obtained by takingλ1=0,λ2=1τ\\lambda\_\{1\}=0,\\lambda\_\{2\}=\\frac\{1\}\{\\tau\}, yielding the following lower bound:

###### Corollary 1\.

ForCVaRτ\\mathrm\{CVaR\}\_\{\\tau\}, there exists constantsc1,c2\>0c\_\{1\},c\_\{2\}\>0such that for any RL algorithm𝒰\\mathcal\{U\}that outputs a policyπ𝒰\\pi\_\{\\mathcal\{U\}\}, anyδ∈\(0,14\)\\delta\\in\(0,\\frac\{1\}\{4\}\), andε∈\(0,ε¯\)\\varepsilon\\in\(0,\{\\color\[rgb\]\{0,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@gray@stroke\{0\}\\pgfsys@color@gray@fill\{0\}\\bar\{\\varepsilon\}\}\), if the total numberTTof samples satisfies

T≤SAc1ε2\(1−γ\)2τ2log⁡\(Sc2δ\),\\displaystyle T\\leq\\frac\{SA\}\{c\_\{1\}\\varepsilon^\{2\}\(1\-\\gamma\)^\{2\}\\tau^\{2\}\}\\log\\Big\(\\frac\{S\}\{c\_\{2\}\\delta\}\\Big\),then there exists some MDPMMwithSSstates andAAactions for whichℙ\(‖VM∗−Vπ𝒰‖\>ε\)≥δ\\mathbb\{P\}\(\\\|V^\{\*\}\_\{M\}\-V^\{\\pi\_\{\\mathcal\{U\}\}\}\\\|\>\\varepsilon\)\\geq\\delta\.

Corollary[1](https://arxiv.org/html/2605.21763#Thmcorollary1)implies a sample complexity lower bound ofΩ~\(SAε2\(1−γ\)2τ2\)\\widetilde\{\\Omega\}\\Big\(\\frac\{SA\}\{\\varepsilon^\{2\}\(1\-\\gamma\)^\{2\}\\tau^\{2\}\}\\Big\)for policy learning under CVaRτ\. This bound can be compared with the lower bound in\[[18](https://arxiv.org/html/2605.21763#bib.bib69)\], scaling asΩ~\(\(1−γτ\)SAε2\(1−γ\)4τ\)\\widetilde\{\\Omega\}\\Big\(\\frac\{\(1\-\\gamma\\tau\)SA\}\{\\varepsilon^\{2\}\(1\-\\gamma\)^\{4\}\\tau\}\\Big\), which to our knowledge constitutes the best existing lower bound for policy learning under CVaR\. While the two bounds have the same dependence onSASAandlog⁡\(1/δ\)\\log\(1/\\delta\), they differ in terms of dependence onτ\\tauand the effective horizonHγH\_\{\\gamma\}\. Asτ≤1\\tau\\leq 1, our lower bound has a stronger dependence onτ\\tau, while that of\[[18](https://arxiv.org/html/2605.21763#bib.bib69)\]has a better dependence on the effective horizon\. So neither bounds dominates the other uniformly\. The comparison depends on the relative scaling ofτ\\tauandHγH\_\{\\gamma\}\. Ifτ\\tauis sufficiently small \(e\.g\., relative toHγ2H\_\{\\gamma\}^\{2\}\), theτ\\tau\-dependence dominates and our lower bound has better overall scaling\. But for large horizon relative toτ\\tau, the lower bound of\[[18](https://arxiv.org/html/2605.21763#bib.bib69)\]becomes larger\.

## 7Concluding Remarks

We studied the PAC\-learnability and sample complexities of value learning and policy learning in finite discounted MDPs with a generative model, where the objective is defined recursively using a risk measure from the family of OCE measures\. We introduced a simple model\-based algorithm MB\-OCE\-VI and derived PAC bounds on the sample complexity when the utility function defining the OCE has full domain\. Next we show that the assumption of full domain is needed to obtain PAC\-bounds by proving that it is impossible to obtain worst\-case guarantee PAC\-bounds when the domain is not full\. We thus classify exactly which utility functions for which PAC bounds are possible\. Finally, we derive lower bounds for all utilities with full domain that demonstrate the tightness of our upper bounds inS,A,1ε,1δS,A,\\frac\{1\}\{\\varepsilon\},\\frac\{1\}\{\\delta\}but with a complicated dependence on11−γ\\frac\{1\}\{1\-\\gamma\}\. Finally, for a more restricted class of utilities that include all strongly risk\-averse risk measures, we give tighter lower bounds and show that they can outperform the best existing lower bound forCVaRτ\\text\{CVaR\}\_\{\\tau\}in the regime whereτ\\tauis very small\.

While we solve the learnability problem, we leave open some gaps in the effective horizon11−γ\\frac\{1\}\{1\-\\gamma\}\. Closing these gaps is left for future research and it is not immediately clear how much can be gained from improvements on the upper and lower bounds respectively but we conjecture that improvements can be made on both fronts\. Improving the upper bounds however seems to require new analytical techniques as the MB\-OCE\-VI algorithm whenOCEu=𝔼\\mathrm\{OCE\}^\{u\}=\\mathbb\{E\}is shown to be optimal in the classical setting\. Improving the lower bounds might require both new analytical techniques and a new hard\-to\-learn MDP construction\. Finally, we also believe it would be interesting to consider more complex RL settings such as offline RL where data is collected under a behaviour policy which is fixed but unknown or online RL where the data collection process is directly affected by the actions of learning agent\.

## Acknowledgments

The authors would like to acknowledge the support from Independent Research Fund Denmark, grant number 1026\-00397B\. Mohammad Sadegh Talebi was partially supported by Innovation Fund Denmark under Grant 1063\-00031B\.

## References

- \[1\]A\. Agarwal, S\. Kakade, and L\. F\. Yang\(2020\)Model\-based reinforcement learning with a generative model is minimax optimal\.InConference on Learning Theory,pp\. 67–83\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px1.p1.4)\.
- \[2\]A\. Ahmadi\-Javid\(2012\)Entropic value\-at\-risk: a new coherent risk measure\.Journal of Optimization Theory and Applications155,pp\. 1105–1123\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p1.1)\.
- \[3\]H\. Asienkiewicz and A\. Jaśkiewicz\(2017\)A note on a new class of recursive utilities in Markov decision processes\.Applicationes Mathematicae44,pp\. 149–161\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p3.1)\.
- \[4\]N\. Bäuerle and A\. Glauner\(2022\)Markov decision processes with recursive risk measures\.European Journal of Operational Research296\(3\),pp\. 953–966\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1),[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p2.1),[§1](https://arxiv.org/html/2605.21763#S1.p3.1)\.
- \[5\]N\. Bäuerle and A\. Jaśkiewicz\(2024\)Markov decision processes with risk\-sensitive criteria: An overview\.Mathematical Methods of Operations Research99\(1\),pp\. 141–178\.Cited by:[§2\.2](https://arxiv.org/html/2605.21763#S2.SS2.p2.25),[§2\.2](https://arxiv.org/html/2605.21763#S2.SS2.p2.8),[Remark 2](https://arxiv.org/html/2605.21763#Thmremark2.p1.1.1)\.
- \[6\]N\. Bäuerle and J\. Ott\(2011\)Markov decision processes with average\-value\-at\-risk criteria\.Mathematical Methods of Operations Research74,pp\. 361–379\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p1.1)\.
- \[7\]N\. Bäuerle and U\. Rieder\(2014\)More risk\-sensitive Markov decision processes\.Mathematics of Operations Research39\(1\),pp\. 105–120\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p1.1)\.
- \[8\]A\. Ben\-Tal and M\. Teboulle\(2007\)An old\-new concept of convex risk measures: the optimized certainty equivalent\.Mathematical Finance17\(3\),pp\. 449–476\.Cited by:[§C\.2](https://arxiv.org/html/2605.21763#A3.SS2.2.p2.5),[§1](https://arxiv.org/html/2605.21763#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.21763#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.21763#S2.SS1.p3.11),[§6\.3](https://arxiv.org/html/2605.21763#S6.SS3.p3.6)\.
- \[9\]T\. R\. Bielecki and S\. R\. Pliska\(1999\)Risk\-sensitive dynamic asset management\.Applied Mathematics and Optimization39,pp\. 337–360\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p1.1)\.
- \[10\]L\. Bisi, D\. Santambrogio, F\. Sandrelli, A\. Tirinzoni, B\. D\. Ziebart, and M\. Restelli\(2022\)Risk\-averse policy optimization via risk\-neutral policy optimization\.Artificial Intelligence311,pp\. 103765\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p1.1)\.
- \[11\]V\. S\. Borkar and S\. P\. Meyn\(2002\)Risk\-sensitive optimal control for markov decision processes with monotone cost\.Mathematics of Operations Research27\(1\),pp\. 192–209\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.21763#S1.p1.1),[§1](https://arxiv.org/html/2605.21763#S1.p3.1)\.
- \[12\]V\. S\. Borkar\(2002\)Q\-learning for risk\-sensitive control\.Mathematics of operations research27\(2\),pp\. 294–311\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p3.1)\.
- \[13\]D\. Brown, S\. Niekum, and M\. Petrik\(2020\)Bayesian robust optimization for imitation learning\.Advances in Neural Information Processing Systems33,pp\. 2479–2491\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p1.1)\.
- \[14\]Y\. Chen, Y\. Du, P\. Hu, S\. Wang, D\. Wu, and L\. Huang\(2024\)Provably efficient iterated cvar reinforcement learning with function approximation and human feedback\.InThe Twelfth International Conference on Learning Representations,Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1)\.
- \[15\]Y\. Chow and M\. Ghavamzadeh\(2014\)Algorithms for cvar optimization in mdps\.Advances in neural information processing systems27\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p1.1)\.
- \[16\]E\. Delage and S\. Mannor\(2010\)Percentile optimization for Markov decision processes with parameter uncertainty\.Operations research58\(1\),pp\. 203–213\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p1.1)\.
- \[17\]M\. A\. H\. Dempster\(2002\)Risk management: value at risk and beyond\.Cambridge University Press\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p1.1)\.
- \[18\]Z\. Deng, S\. Khan, and S\. Zou\(2025\)Near\-optimal sample complexity for iterated CVaR reinforcement learning with a generative model\.InThe 28th International Conference on Artificial Intelligence and Statistics,Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.21763#S1.p3.1),[§4\.1](https://arxiv.org/html/2605.21763#S4.SS1.SSS0.Px1.p1.4),[§6\.3](https://arxiv.org/html/2605.21763#S6.SS3.p1.2),[§6\.3](https://arxiv.org/html/2605.21763#S6.SS3.p4.15)\.
- \[19\]Y\. Du, S\. Wang, and L\. Huang\(2023\)Provably efficient risk\-sensitive reinforcement learning: iterated CVaR and worst path\.InThe Eleventh International Conference on Learning Representations,Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1)\.
- \[20\]D\. Ernst, G\. Stan, J\. Goncalves, and L\. Wehenkel\(2006\)Clinical data based optimal STI strategies for HIV: A reinforcement learning approach\.InProceedings of the 45th IEEE Conference on Decision and Control,pp\. 667–672\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p1.1)\.
- \[21\]Y\. Fei, Z\. Yang, Y\. Chen, Z\. Wang, and Q\. Xie\(2020\)Risk\-sensitive reinforcement learning: near\-optimal risk\-sample tradeoff in regret\.Advances in Neural Information Processing Systems33,pp\. 22384–22395\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.21763#S1.p1.1)\.
- \[22\]Y\. Fei, Z\. Yang, Y\. Chen, and Z\. Wang\(2021\)Exponential Bellman equation and improved regret bounds for risk\-sensitive reinforcement learning\.Advances in neural information processing systems34,pp\. 20436–20446\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1)\.
- \[23\]Y\. Fei, Z\. Yang, and Z\. Wang\(2021\)Risk\-sensitive reinforcement learning with function approximation: a debiasing approach\.InInternational Conference on Machine Learning,pp\. 3198–3207\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1)\.
- \[24\]H\. Föllmer and A\. Schied\(2010\)Convex and coherent risk measures\.Encyclopedia of Quantitative Finance,pp\. 355–363\.Cited by:[Appendix A](https://arxiv.org/html/2605.21763#A1.SS0.SSS0.Px1.p1.4),[Appendix A](https://arxiv.org/html/2605.21763#A1.p1.1)\.
- \[25\]D\. K\. Ganguly, A\. G\. Joseph, S\. Girotra, and S\. SekharRisk\-seeking reinforcement learning via multi\-timescale evar optimization\.Transactions on Machine Learning Research\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1)\.
- \[26\]M\. Gheshlaghi Azar, R\. Munos, and H\. J\. Kappen\(2013\)Minimax pac bounds on the sample complexity of reinforcement learning with a generative model\.Machine learning91,pp\. 325–349\.Cited by:[§E\.1\.1](https://arxiv.org/html/2605.21763#A5.SS1.SSS1.3.p3.6),[§E\.1\.2](https://arxiv.org/html/2605.21763#A5.SS1.SSS2.5.p5.3),[§E\.1](https://arxiv.org/html/2605.21763#A5.SS1.p1.1),[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px1.p1.4)\.
- \[27\]J\. L\. Hau, E\. Delage, M\. Ghavamzadeh, and M\. Petrik\(2023\)On dynamic programming decompositions of static risk measures in Markov decision processes\.Advances in Neural Information Processing Systems36,pp\. 51734–51757\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p3.1)\.
- \[28\]J\. L\. Hau, M\. Petrik, and M\. Ghavamzadeh\(2023\)Entropic risk optimization in discounted mdps\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 47–76\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.21763#S1.p1.1)\.
- \[29\]R\. A\. Howard and J\. E\. Matheson\(1972\)Risk\-sensitive Markov decision processes\.Management science18\(7\),pp\. 356–369\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p1.1)\.
- \[30\]X\. Hu and H\. Leung\(2023\)A tighter problem\-dependent regret bound for risk\-sensitive reinforcement learning\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 5411–5437\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.21763#S1.p1.1)\.
- \[31\]Y\. Huang, Y\. Jia, and X\. Zhou\(2022\)Achieving mean–variance efficiency by continuous\-time reinforcement learning\.InProceedings of the Third ACM International Conference on AI in Finance,pp\. 377–385\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1)\.
- \[32\]S\. C\. Jaquette\(1976\)A utility criterion for markov decision processes\.Management Science23\(1\),pp\. 43–49\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p3.1)\.
- \[33\]Y\. Jin, I\. Karmarkar, A\. Sidford, and J\. Wang\(2024\)Truncated variance reduced value iteration\.Advances in Neural Information Processing Systems37,pp\. 117481–117508\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px1.p1.4)\.
- \[34\]S\. M\. Kakade\(2003\)On the sample complexity of reinforcement learning\.University of London, University College London \(United Kingdom\)\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px1.p1.4)\.
- \[35\]D\. Kamran, C\. F\. Lopez, M\. Lauer, and C\. Stiller\(2020\)Risk\-aware high\-level decisions for automated driving at occluded intersections with reinforcement learning\.In2020 IEEE Intelligent Vehicles Symposium \(IV\),pp\. 1205–1212\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p1.1)\.
- \[36\]M\. Kearns and S\. Singh\(1998\)Finite\-sample convergence rates for Q\-learning and indirect algorithms\.Advances in neural information processing systems11\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px1.p1.4)\.
- \[37\]N\. Khajonchotpanya, Y\. Xue, and N\. Rujeerapaiboon\(2021\)A revised approach for risk\-averse multi\-armed bandits under CVaR criterion\.Operations Research Letters49\(4\),pp\. 465–472\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1)\.
- \[38\]P\. La and M\. Ghavamzadeh\(2013\)Actor\-critic algorithms for risk\-sensitive MDPs\.Advances in neural information processing systems26\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1)\.
- \[39\]T\. Lam, A\. Verma, B\. K\. H\. Low, and P\. Jaillet\(2022\)Risk\-aware reinforcement learning with coherent risk measures and non\-linear function approximation\.InThe Eleventh International Conference on Learning Representations,Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1),[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p2.1)\.
- \[40\]T\. Lattimore and M\. Hutter\(2014\)Near\-optimal PAC bounds for discounted MDPs\.Theoretical Computer Science558,pp\. 125–143\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px1.p1.4)\.
- \[41\]J\. H\. Lee, B\. Saglam, S\. Pougkakiotis, A\. Karbasi, and D\. Kalogerias\(2025\)Risk\-averse constrained reinforcement learning with optimized certainty equivalents\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p2.1)\.
- \[42\]D\. Li and W\. Ng\(2000\)Optimal dynamic portfolio selection: multiperiod mean\-variance formulation\.Mathematical finance10\(3\),pp\. 387–406\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p1.1)\.
- \[43\]G\. Li, L\. Shi, Y\. Chen, Y\. Chi, and Y\. Wei\(2024\)Settling the sample complexity of model\-based offline reinforcement learning\.The Annals of Statistics52\(1\),pp\. 233–260\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px1.p1.4)\.
- \[44\]G\. Li, Y\. Wei, Y\. Chi, and Y\. Chen\(2024\)Breaking the sample size barrier in model\-based reinforcement learning with a generative model\.Operations Research72\(1\),pp\. 203–221\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px1.p1.4)\.
- \[45\]G\. Li, Y\. Wei, Y\. Chi, Y\. Gu, and Y\. Chen\(2020\)Breaking the sample size barrier in model\-based reinforcement learning with a generative model\.Advances in neural information processing systems33,pp\. 12861–12872\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px1.p1.4)\.
- \[46\]H\. Liang and Z\. Luo\(2024\)Regret bounds for risk\-sensitive reinforcement learning with lipschitz dynamic risk measures\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 1774–1782\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1)\.
- \[47\]O\. Maillard\(2013\)Robust risk\-averse stochastic multi\-armed bandits\.InAlgorithmic Learning Theory: 24th International Conference, ALT 2013, Singapore, October 6\-9, 2013\. Proceedings 24,pp\. 218–233\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1)\.
- \[48\]A\. Marthe, S\. Bounan, A\. Garivier, and C\. Vernade\(2025\)Efficient risk\-sensitive planning via entropic risk measures\.arXiv preprint arXiv:2502\.20423\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1)\.
- \[49\]A\. Marthe, A\. Garivier, and C\. Vernade\(2023\)Beyond average return in Markov decision processes\.Advances in Neural Information Processing Systems36,pp\. 56488–56507\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1)\.
- \[50\]O\. Mihatsch and R\. Neuneier\(2002\)Risk\-sensitive reinforcement learning\.Machine learning49\(2\),pp\. 267–290\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p1.1)\.
- \[51\]M\. Moharrami, Y\. Murthy, A\. Roy, and R\. Srikant\(2025\)A policy gradient algorithm for the risk\-sensitive exponential cost mdp\.Mathematics of operations research50\(1\),pp\. 431–458\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1)\.
- \[52\]O\. Mortensen and M\. S\. Talebi\(2025\)Recursive entropic risk optimization in discounted mdps: sample complexity bounds with a generative model\.arXiv preprint arXiv:2506\.00286\.Cited by:[§E\.1\.1](https://arxiv.org/html/2605.21763#A5.SS1.SSS1.3.p3.6),[§E\.1\.2](https://arxiv.org/html/2605.21763#A5.SS1.SSS2.5.p5.3),[§E\.1](https://arxiv.org/html/2605.21763#A5.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.21763#S4.SS1.SSS0.Px2.p1.6),[§6\.3](https://arxiv.org/html/2605.21763#S6.SS3.p1.2),[§6](https://arxiv.org/html/2605.21763#S6.p2.1)\.
- \[53\]X\. Ni and L\. Lai\(2022\)Risk\-sensitive reinforcement learning via Entropic\-VaR optimization\.In2022 56th Asilomar Conference on Signals, Systems, and Computers,pp\. 953–959\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1)\.
- \[54\]M\. Petrik and D\. Subramanian\(2012\)An approximate solution method for large risk\-averse markov decision processes\.InConference on Uncertainty in Artificial Intelligence,Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p2.1)\.
- \[55\]P\. Rashidinejad, B\. Zhu, C\. Ma, J\. Jiao, and S\. Russell\(2021\)Bridging offline reinforcement learning and imitation learning: A tale of pessimism\.Advances in Neural Information Processing Systems34,pp\. 11702–11716\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px1.p1.4)\.
- \[56\]M\. Rigter, B\. Lacerda, and N\. Hawes\(2023\)One risk to rule them all: a risk\-sensitive perspective on model\-based offline reinforcement learning\.Advances in neural information processing systems36,pp\. 77520–77545\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p2.1)\.
- \[57\]A\. Sani, A\. Lazaric, and R\. Munos\(2012\)Risk\-aversion in multi\-armed bandits\.Advances in neural information processing systems25\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1)\.
- \[58\]M\. G\. Scutella and R\. Recchia\(2013\)Robust portfolio asset allocation and risk measures\.Annals of Operations Research204\(1\),pp\. 145–169\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p1.1)\.
- \[59\]A\. Shapiro, D\. Dentcheva, and A\. Ruszczynski\(2021\)Lectures on stochastic programming: Modeling and theory\.SIAM\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p1.1)\.
- \[60\]A\. Sidford, M\. Wang, X\. Wu, and Y\. Ye\(2018\)Variance reduced value iteration and faster algorithms for solving Markov decision processes\.InProceedings of the Twenty\-Ninth Annual ACM\-SIAM Symposium on Discrete Algorithms,pp\. 770–787\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px1.p1.4)\.
- \[61\]S\. P\. Singh and R\. C\. Yee\(1994\)An upper bound on the loss from approximate optimal\-value functions\.Machine Learning16\(3\),pp\. 227–233\.Cited by:[§C\.1](https://arxiv.org/html/2605.21763#A3.SS1.p1.5)\.
- \[62\]S\. Sood, K\. Papasotiriou, M\. Vaiciulis, and T\. Balch\(2023\)Deep reinforcement learning for optimal portfolio allocation: a comparative study with mean\-variance optimization\.FinPlan2023\(2023\),pp\. 21\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p1.1)\.
- \[63\]A\. L\. Strehl and M\. L\. Littman\(2008\)An analysis of model\-based interval estimation for Markov decision processes\.Journal of Computer and System Sciences74\(8\),pp\. 1309–1331\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px1.p1.4)\.
- \[64\]R\. S\. Sutton, A\. G\. Barto,et al\.\(1998\)Reinforcement learning: an introduction\.Vol\.1,MIT press Cambridge\.Cited by:[§1](https://arxiv.org/html/2605.21763#S1.p1.1)\.
- \[65\]A\. Tamar, Y\. Chow, M\. Ghavamzadeh, and S\. Mannor\(2015\)Policy gradient for coherent risk measures\.Advances in neural information processing systems28\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p2.1)\.
- \[66\]K\. Wang, D\. Liang, N\. Kallus, and W\. Sun\(2025\)A reductions approach to risk\-sensitive reinforcement learning with optimized certainty equivalents\.InForty\-second International Conference on Machine Learning,Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p2.1),[§1](https://arxiv.org/html/2605.21763#S1.p3.1)\.
- \[67\]M\. Wang\(2020\)Randomized linear programming solves the Markov decision problem in nearly linear \(sometimes sublinear\) time\.Mathematics of Operations Research45\(2\),pp\. 517–546\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px1.p1.4)\.
- \[68\]W\. Xu, X\. Gao, and X\. He\(2023\)Regret bounds for markov decision processes with recursive optimized certainty equivalents\.InInternational Conference on Machine Learning,pp\. 38400–38427\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p2.1),[§1](https://arxiv.org/html/2605.21763#S1.p3.1)\.
- \[69\]Y\. Zhao, J\. E\. Escamill, W\. Lu, and H\. Wang\(2024\)RA\-PbRL: provably efficient risk\-aware preference\-based reinforcement learning\.Advances in Neural Information Processing Systems37,pp\. 60835–60871\.Cited by:[§1\.2](https://arxiv.org/html/2605.21763#S1.SS2.SSS0.Px2.p2.1)\.

## Appendix ARisk Measures

In this section, we briefly introduce risk measures\. See, e\.g\.,\[[24](https://arxiv.org/html/2605.21763#bib.bib90)\]for a good reference that like us model stochastic outcomes as rewards\. We here collect some precise definitions for the reward setting and list some important examples\.

Let\(Ω,ℱ,ℙ\)\(\\Omega,\\mathcal\{F\},\\mathbb\{P\}\)be a background probability space, andℳ\\mathcal\{M\}a convex cone of random variables defined on the background space\. That is, for anyX,Y∈ℳX,Y\\in\\mathcal\{M\}andλ\>0\\lambda\>0, it holds thatX\+Y∈ℳX\+Y\\in\\mathcal\{M\}andλX∈ℳ\\lambda X\\in\\mathcal\{M\}\.

###### Definition 2\(Risk measure\)\.

A functionalψ:ℳ→ℝ\\psi:\\mathcal\{M\\rightarrow\\mathbb\{R\}\}is a*risk measure*if it satisfies the following properties:

ψ\(0\)=0,\\displaystyle\\psi\(0\)=0,\\qquad\(Normalization\)ifX≤Ythenψ\(X\)≥ψ\(Y\),\\displaystyle\\text\{if \}X\\leq Y\\text\{ then \}\\psi\(X\)\\geq\\psi\(Y\),\\qquad\(Monotonicity\)ψ\(X\+c\)=ψ\(X\)−c,∀c∈ℝ\.\\displaystyle\\psi\(X\+c\)=\\psi\(X\)\-c,\\quad\\forall c\\in\\mathbb\{R\}\.\\qquad\(Translation invariance\)If, in addition,ψ\\psisatisfies the properties

ψ\(cX\)=cψ\(X\),∀c\>0,\\displaystyle\\psi\(cX\)=c\\psi\(X\),\\quad\\forall c\>0,\\qquad\(Positive homogeneity\)ψ\(X\+Y\)≤ψ\(X\)\+ψ\(Y\),\\displaystyle\\psi\(X\+Y\)\\leq\\psi\(X\)\+\\psi\(Y\),\\qquad\(Sub\-additivity\)it is called a*coherent risk measure*\. A weaker notion is*convex risk measure*, which is one obeying

ψ\(λX\+\(1−λ\)Y\)≤λψ\(X\)\+\(1−λ\)ψ\(Y\),∀λ∈\[0,1\]\.\\displaystyle\\psi\(\\lambda X\+\(1\-\\lambda\)Y\)\\leq\\lambda\\psi\(X\)\+\(1\-\\lambda\)\\psi\(Y\),\\quad\\forall\\lambda\\in\[0,1\]\\,\.\\qquad\(Convexity\)Finally, a risk\-measureψ\\psiis called*law\-invariant*ifψ\(X\)\\psi\(X\)only depends on the distribution ofXXunderℙ\\mathbb\{P\}\.

We now mention some examples of risk measures\.

##### Entropic Risk Measure \(ERM\)\.

The risk measure given by

ERMβ\(X\)=1βlog⁡\(𝔼\[e−βX\]\)\\displaystyle\\text\{ERM\}\_\{\\beta\}\(X\)=\\frac\{1\}\{\\beta\}\\log\\big\(\\mathbb\{E\}\[e^\{\-\\beta X\}\]\\big\)is known as the entropic risk measure \(ERM\) with parameterβ≠0\\beta\\neq 0\. Notably, ERM is not coherent \(see, e\.g\.,\[[24](https://arxiv.org/html/2605.21763#bib.bib90)\]\) as it is does not satisfy the positive homogeneity property\. Lettingβ→0\\beta\\rightarrow 0one recovers the expectation𝔼\[X\]\\mathbb\{E\}\[X\], and lettingβ→∞\\beta\\rightarrow\\inftyyields the essential infimum risk measure\.

##### Value\-at\-Risk \(VaR\)\.

The risk measure given by

VaRτ\(X\):=qτ\(X\):=inf\{x∈ℝ:FX\(x\)≥τ\}\\displaystyle\\text\{VaR\}\_\{\\tau\}\(X\):=q\_\{\\tau\}\(X\):=\\inf\\\{x\\in\\mathbb\{R\}:F\_\{X\}\(x\)\\geq\\tau\\\}is called the Value\-at\-Risk \(VaR\) at levelτ∈\(0,1\)\\tau\\in\(0,1\)\. VaR is in general not sub\-additive, and hence also not coherent\.

##### Conditional Value\-at\-Risk \(CVaR\)\.

The risk measure given by

CVaRτ\(X\):=1τ∫0τVaRu\(X\)𝑑u\\displaystyle\\text\{CVaR\}\_\{\\tau\}\(X\):=\\frac\{1\}\{\\tau\}\\int\_\{0\}^\{\\tau\}\\text\{VaR\}\_\{u\}\(X\)duis known as the Conditional Value\-at\-Risk \(CVaR\), or sometimes as the expected shortfall \(ES\)\. It is known to be a coherent risk\-measure\.

##### Mean\-variance criterion

The risk\-measureMV\(X\):=12Var\(X\)−𝔼\[X\]\\text\{MV\}\(X\):=\\frac\{1\}\{2\}\\text\{Var\}\(X\)\-\\mathbb\{E\}\[X\]is called the mean\-variance criterion risk measure and is one\-way to explicitly using the variance to account for risk\.

##### Essential infimum\.

The functional−Essinf\(X\)\-\\text\{Essinf\}\(X\)is a risk\-measure\. It can reasonable be said to be the most risk\-averse risk measure as it only takes the worst\-case outcome of a random variable into consideration and ignores all other distributional aspects of the random variable\. It can be obtained aslimβ→∞ERMβ\(X\)\\lim\_\{\\beta\\rightarrow\\infty\}\\text\{ERM\}\_\{\\beta\}\(X\)andlimτ→0VaRτ\(X\)\\lim\_\{\\tau\\rightarrow 0\}\\text\{VaR\}\_\{\\tau\}\(X\)

## Appendix BConvergence of UVI

###### Lemma 1\.

The operator𝒯\\mathcal\{T\}is aγ\\gamma\-contraction with respect to∥⋅∥∞\\\|\\cdot\\\|\_\{\\infty\}\.

###### Proof\.

Consider two mapsQ:ℝS×A→ℝS×AQ:\\mathbb\{R\}^\{S\\times A\}\\rightarrow\\mathbb\{R\}^\{S\\times A\}andW:ℝS×A→ℝS×AW:\\mathbb\{R\}^\{S\\times A\}\\rightarrow\\mathbb\{R\}^\{S\\times A\}, and letQ′=𝒯QQ^\{\\prime\}=\\mathcal\{T\}QandW′=𝒯WW^\{\\prime\}=\\mathcal\{T\}Wbe their respective𝒯\\mathcal\{T\}\-transforms\. Let\(s,a\)\(s,a\)be any pair such that\|Q′\(s,a\)−W′\(s,a\)\|=‖Q′−W′‖∞\|Q^\{\\prime\}\(s,a\)\-W^\{\\prime\}\(s,a\)\|=\\\|Q^\{\\prime\}\-W^\{\\prime\}\\\|\_\{\\infty\}, and assume without loss of generality thatQ′\(s,a\)≥W′\(s,a\)Q^\{\\prime\}\(s,a\)\\geq W^\{\\prime\}\(s,a\)\. Further, define

V\(s\):=maxa⁡Q\(s,a\),X\(s\):=maxa⁡W\(s,a\)\.\\displaystyle V\(s\):=\\max\_\{a\}Q\(s,a\)\\,,\\qquad\\ X\(s\):=\\max\_\{a\}W\(s,a\)\\,\.Then by monotonicity and consistency ofρ\\rhoit holds that

‖Q′−W′‖\\displaystyle\\\|Q^\{\\prime\}\-W^\{\\prime\}\\\|=Q′\(s,a\)−W′\(s,a\)\\displaystyle=Q^\{\\prime\}\(s,a\)\-W^\{\\prime\}\(s,a\)=γ\(ρs,a\(V\(s′\)\)−ρs,a\(X\(s′\)\)\)\\displaystyle=\\gamma\\Big\(\\rho\_\{s,a\}\(V\(s^\{\\prime\}\)\)\-\\rho\_\{s,a\}\(X\(s^\{\\prime\}\)\)\\Big\)=γ\(ρs,a\(X\(s′\)\+V\(s′\)−X\(s′\)−ρs,a\(X\(s′\)\)\)\\displaystyle=\\gamma\\Big\(\\rho\_\{s,a\}\(X\(s^\{\\prime\}\)\+V\(s^\{\\prime\}\)\-X\(s^\{\\prime\}\)\-\\rho\_\{s,a\}\(X\(s^\{\\prime\}\)\)\\Big\)≤γ\(ρs,a\(X\(s′\)\+‖V−X‖\)−ρs,a\(X\(s′\)\)\)\\displaystyle\\leq\\gamma\\Big\(\\rho\_\{s,a\}\(X\(s^\{\\prime\}\)\+\\\|V\-X\\\|\)\-\\rho\_\{s,a\}\(X\(s^\{\\prime\}\)\)\\Big\)=γ‖V−X‖\\displaystyle=\\gamma\\\|V\-X\\\|≤γ‖Q−W‖\.\\displaystyle\\leq\\gamma\\\|Q\-W\\\|\.∎

From this fact we get the following convergence guarantee on OCE\-VI \(Algorithm[2](https://arxiv.org/html/2605.21763#algorithm2)\):

###### Lemma 2\.

Ifk≥log⁡\(12ε\(1−γ\)\)/log⁡\(1/γ\)k\\geq\\log\\big\(\\frac\{1\}\{2\\varepsilon\(1\-\\gamma\)\}\\big\)/\\log\(1/\\gamma\), then‖Q∗−Qk‖≤ε\\\|Q^\{\*\}\-Q\_\{k\}\\\|\\leq\\varepsilon\.

###### Proof\.

Since𝒯\\mathcal\{T\}is aγ\\gamma\-contraction, we have that

‖Qk−Q∗‖\\displaystyle\\\|Q\_\{k\}\-Q^\{\*\}\\\|=‖𝒯Qk−1−𝒯Q∗‖≤γ‖Qk−1−Q∗‖,\\displaystyle=\\\|\\mathcal\{T\}Q\_\{k\-1\}\-\\mathcal\{T\}Q^\{\*\}\\\|\\leq\\gamma\\\|Q\_\{k\-1\}\-Q^\{\*\}\\\|\\,,from which it follows that‖Q0−Q∗‖≤γk‖Q0−Q∗‖\\\|Q\_\{0\}\-Q^\{\*\}\\\|\\leq\\gamma^\{k\}\\\|Q\_\{0\}\-Q^\{\*\}\\\|\. Furthermore, by definition ofQ0Q\_\{0\}, it holds that‖Q0−Q∗‖≤12\(1−γ\)\\\|Q\_\{0\}\-Q^\{\*\}\\\|\\leq\\frac\{1\}\{2\(1\-\\gamma\)\}\. Solvingγk2\(1−γ\)<ε\\frac\{\\gamma^\{k\}\}\{2\(1\-\\gamma\)\}<\\varepsilonforγ\\gammayields the announced result\. ∎

## Appendix CMissing Lemmas for Upper Bounds

### C\.1Proof of Lemma[4\.2](https://arxiv.org/html/2605.21763#S4.SS2)

We first prove a result that bounds the value of a greedy policy by the value\-function for which the policy is greedy\. The result is a generalization of\[[61](https://arxiv.org/html/2605.21763#bib.bib83)\]to general risk measures\. We use the notationρs,a\(V\(s′\)\)\\rho\_\{s,a\}\(V\(s^\{\\prime\}\)\)as shorthand forρ\\rhoapplied to the categorical random variableXXwhich takes values in the set\{V\(s′\)\}s′∈𝒮\\\{V\(s^\{\\prime\}\)\\\}\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}with probabilities given byℙ\(X=V\(s′\)\)=P\(s′\|s,a\)\\mathbb\{P\}\(X=V\(s^\{\\prime\}\)\)=P\(s^\{\\prime\}\|s,a\)\.

###### Proof\.

Lets¯\\bar\{s\}be a state such that‖V∗−VG‖=V∗\(s¯\)−VG\(s¯\)\\\|V^\{\*\}\-V^\{G\}\\\|=V^\{\*\}\(\\bar\{s\}\)\-V^\{G\}\(\\bar\{s\}\), whereVG:=VπGV^\{G\}:=V^\{\\pi\_\{G\}\}\. We then consider the two actionsa∗:=π∗\(s¯\)a^\{\*\}:=\\pi^\{\*\}\(\\bar\{s\}\)andaG:=πG\(s¯\)a^\{G\}:=\\pi^\{G\}\(\\bar\{s\}\); ties can be broken arbitrarily\. SinceπG\\pi^\{G\}is greedy with respect toV¯\\overline\{V\}, we have that

R\(s¯,a∗\)\+γρs¯,a∗\(V¯\(s′\)\)≤R\(s¯,aG\)\+γρs¯,aG\(V¯\(s′\)\)\.\\displaystyle R\(\\bar\{s\},a^\{\*\}\)\+\\gamma\\rho\_\{\\bar\{s\},a^\{\*\}\}\(\\overline\{V\}\(s^\{\\prime\}\)\)\\leq R\(\\bar\{s\},a^\{G\}\)\+\\gamma\\rho\_\{\\bar\{s\},a^\{G\}\}\(\\overline\{V\}\(s^\{\\prime\}\)\)\\,\.By assumption, it holds for anys∈𝒮s\\in\\mathcal\{S\}that

V∗\(s\)−ε≤V¯\(s\)≤V∗\(s\)\+ε\.\\displaystyle V^\{\*\}\(s\)\-\\varepsilon\\leq\\overline\{V\}\(s\)\\leq V^\{\*\}\(s\)\+\\varepsilon\.Sinceρ\\rhois monotone and translation invariant it follows that

R\(s¯,a∗\)\+γρs¯,a∗\(V¯\(s′\)\)\\displaystyle R\(\\bar\{s\},a^\{\*\}\)\+\\gamma\\rho\_\{\\bar\{s\},a^\{\*\}\}\(\\overline\{V\}\(s^\{\\prime\}\)\)≥R\(s¯,a∗\)\+γρs¯,a∗\(V∗\(s′\)−ε\)\\displaystyle\\geq R\(\\bar\{s\},a^\{\*\}\)\+\\gamma\\rho\_\{\\bar\{s\},a^\{\*\}\}\(V^\{\*\}\(s^\{\\prime\}\)\-\\varepsilon\)=R\(s¯,a∗\)\+γρs¯,a∗\(V∗\(s′\)\)−γε,\\displaystyle=R\(\\bar\{s\},a^\{\*\}\)\+\\gamma\\rho\_\{\\bar\{s\},a^\{\*\}\}\(V^\{\*\}\(s^\{\\prime\}\)\)\-\\gamma\\varepsilon\\,,and by a similar argument

R\(s¯,aG\)\+γρs¯,aG\(V¯\(s′\)\)≤R\(s¯,aG\)\+γρs¯,aG\(V∗\(s′\)\)\+γε,\\displaystyle R\(\\bar\{s\},a^\{G\}\)\+\\gamma\\rho\_\{\\bar\{s\},a^\{G\}\}\(\\overline\{V\}\(s^\{\\prime\}\)\)\\leq R\(\\bar\{s\},a^\{G\}\)\+\\gamma\\rho\_\{\\bar\{s\},a^\{G\}\}\(V^\{\*\}\(s^\{\\prime\}\)\)\+\\gamma\\varepsilon,yielding the inequality

R\(s¯,a∗\)−R\(s¯,aG\)≤2γε\+γ\(ρs¯,aG\(V∗\(s′\)\)−ρs¯,a∗\(V∗\(s′\)\)\.\\displaystyle R\(\\bar\{s\},a^\{\*\}\)\-R\(\\bar\{s\},a^\{G\}\)\\leq 2\\gamma\\varepsilon\+\\gamma\\big\(\\rho\_\{\\bar\{s\},a^\{G\}\}\(V^\{\*\}\(s^\{\\prime\}\)\)\-\\rho\_\{\\bar\{s\},a^\{\*\}\}\(V^\{\*\}\(s^\{\\prime\}\)\\big\)\\,\.Combining the previous inequalities we finally see that

V∗\(s¯\)−VG\(s¯\)\\displaystyle V^\{\*\}\(\\bar\{s\}\)\-V^\{G\}\(\\bar\{s\}\)=R\(s¯,a∗\)−R\(s¯,aG\)\+γρs¯,a∗\(V∗\(s′\)\)−γρs¯,aG\(VG\(s′\)\)\\displaystyle=R\(\\bar\{s\},a^\{\*\}\)\-R\(\\bar\{s\},a^\{G\}\)\+\\gamma\\rho\_\{\\bar\{s\},a^\{\*\}\}\(V^\{\*\}\(s^\{\\prime\}\)\)\-\\gamma\\rho\_\{\\bar\{s\},a^\{G\}\}\(V^\{G\}\(s^\{\\prime\}\)\)≤2γε\+γρs¯,aG\(V∗\(s′\)\)−γρs¯,a∗\(V∗\(s′\)\+γρs¯,a∗\(V∗\(s′\)\)−γρs¯,aG\(VG\(s′\)\)\\displaystyle\\leq 2\\gamma\\varepsilon\+\\gamma\\rho\_\{\\bar\{s\},a^\{G\}\}\(V^\{\*\}\(s^\{\\prime\}\)\)\-\\gamma\\rho\_\{\\bar\{s\},a^\{\*\}\}\(V^\{\*\}\(s^\{\\prime\}\)\+\\gamma\\rho\_\{\\bar\{s\},a^\{\*\}\}\(V^\{\*\}\(s^\{\\prime\}\)\)\-\\gamma\\rho\_\{\\bar\{s\},a^\{G\}\}\(V^\{G\}\(s^\{\\prime\}\)\)=2γε\+γ\(ρs¯,aG\(V∗\(s′\)−ρs¯,aG\(VG\(s′\)\)\)\\displaystyle=2\\gamma\\varepsilon\+\\gamma\\big\(\\rho\_\{\\bar\{s\},a^\{G\}\}\(V^\{\*\}\(s^\{\\prime\}\)\-\\rho\_\{\\bar\{s\},a^\{G\}\}\(V^\{G\}\(s^\{\\prime\}\)\)\\big\)=2γε\+γ‖V∗−VG‖,\\displaystyle=2\\gamma\\varepsilon\+\\gamma\\\|V^\{\*\}\-V^\{G\}\\\|,from which the result follows\. ∎

### C\.2Proof of Lemma[4\.2](https://arxiv.org/html/2605.21763#S4.SS2)

In the following proof, we use the following convention\. We write\[V\|P\]\[V\|P\]for the random variable taking values given by the vectorVVwith probability distributionPP\.

###### Proof\.

We suppressπ\\pifrom the notation since it is fixed throughout\.

Let\(s,a\)\(s,a\)be a state\-action pair such that‖Q−Q~‖=\|Q\(s,a\)−Q~\(s,a\)\|\\\|Q\-\\widetilde\{Q\}\\\|=\|Q\(s,a\)\-\\widetilde\{Q\}\(s,a\)\|and assume thatQ\(s,a\)≥Q~\(s,a\)\.Q\(s,a\)\\geq\\widetilde\{Q\}\(s,a\)\.Note that from monotonicity and consistency ofρ\\rhoit follows that

ρ\(\[V~\|P~\(⋅\|s,a\)\]\)\\displaystyle\\rho\(\[\\widetilde\{V\}\|\\widetilde\{P\}\(\\cdot\|s,a\)\]\)=ρ\(\[V\+V~−V\|P~s,a\]\)\\displaystyle=\\rho\(\[V\+\\widetilde\{V\}\-V\|\\widetilde\{P\}\_\{s,a\}\]\)\(3\)≥ρ\(\[V−‖V−V~‖∞\|P~s,a\]\)\\displaystyle\\geq\\rho\(\[V\-\\\|V\-\\widetilde\{V\}\\\|\_\{\\infty\}\|\\widetilde\{P\}\_\{s,a\}\]\)\(4\)=−‖V−V~‖\+ρ\(\[V\|P~s,a\]\)\\displaystyle=\-\\\|V\-\\widetilde\{V\}\\\|\+\\rho\(\[V\|\\widetilde\{P\}\_\{s,a\}\]\)\(5\)which along with the Bellman recursion implies

‖Q−Q~‖\\displaystyle\\\|Q\-\\widetilde\{Q\}\\\|=R\(s,a\)\+γρ\(\[V\|Ps,a\(⋅\)\]\)−R\(s,a\)−γρ\(\[V~\|P~s,a\]\)\\displaystyle=R\(s,a\)\+\\gamma\\rho\(\[V\|P\{s,a\}\(\\cdot\)\]\)\-R\(s,a\)\-\\gamma\\rho\(\[\\widetilde\{V\}\|\\widetilde\{P\}\_\{s,a\}\]\)\(6\)≤γ‖V−V~‖\+γ\(ρ\(\[V\|Ps,a\(⋅\)\]\)−ρ\(\[V\|P~s,a\]\)\)\\displaystyle\\leq\\gamma\\\|V\-\\widetilde\{V\}\\\|\+\\gamma\\big\(\\rho\(\[V\|P\_\{s,a\}\(\\cdot\)\]\)\-\\rho\(\[V\|\\widetilde\{P\}\_\{s,a\}\]\)\\big\)\(7\)which then implies

‖Q−Q~‖≤γ1−γ\(ρ\(\[V\|Ps,a\]\)−ρ\(\[V\|P~s,a\]\)\)\\displaystyle\\\|Q\-\\widetilde\{Q\}\\\|\\leq\\frac\{\\gamma\}\{1\-\\gamma\}\\big\(\\rho\(\[V\|P\_\{s,a\}\]\)\-\\rho\(\[V\|\\widetilde\{P\}\_\{s,a\}\]\)\\big\)\(8\)Using that the OCE is given by the solution to an optimization problem and using Proposition 2\.1 in\[[8](https://arxiv.org/html/2605.21763#bib.bib89)\]showing that it suffices to optimize over the intervalη∈\[0,Hγ\]\\eta\\in\[0,H\_\{\\gamma\}\], we then have

ρ\(\[V\|Ps,a\(⋅\)\]\)−ρ\(\[V\|P~\(⋅\)\]\)\\displaystyle\\rho\(\[V\|P\_\{s,a\}\(\\cdot\)\]\)\-\\rho\(\[V\|\\widetilde\{P\}\(\\cdot\)\]\)=supη∈\[0,Hγ\]\{η\+∑s′∈SPs,a\(s′\)\[u\(V\(s′\)−η\)\]\}\\displaystyle=\\sup\_\{\\eta\\in\[0,H\_\{\\gamma\}\]\}\\\{\\eta\+\\sum\_\{s^\{\\prime\}\\in S\}P\_\{s,a\}\(s^\{\\prime\}\)\[u\(V\(s^\{\\prime\}\)\-\\eta\)\]\\\}\(9\)−supη∈\[0,Hγ\]\{η\+∑s′∈SP~s,a\(s′\)\[u\(V\(s′\)−η\)\]\}\\displaystyle\-\\sup\_\{\\eta\\in\[0,H\_\{\\gamma\}\]\}\\\{\\eta\+\\sum\_\{s^\{\\prime\}\\in S\}\\widetilde\{P\}\_\{s,a\}\(s^\{\\prime\}\)\[u\(V\(s^\{\\prime\}\)\-\\eta\)\]\\\}\(10\)≤supη∈\[0,Hγ\]∑s′∈S\[Ps,a\(s′\)−P~s,a\(s′\)\]u\(V\(s′\)−η\)\\displaystyle\\leq\\sup\_\{\\eta\\in\[0,H\_\{\\gamma\}\]\}\\sum\_\{s^\{\\prime\}\\in S\}\[P\_\{s,a\}\(s^\{\\prime\}\)\-\\widetilde\{P\}\_\{s,a\}\(s^\{\\prime\}\)\]u\(V\(s^\{\\prime\}\)\-\\eta\)\\\(11\)≤supη∈\[0,Hγ\]\|∑s′∈S\[Ps,a\(s′\)−P~s,a\(s′\)\]u\(V\(s′\)−η\)\|\\displaystyle\\leq\\sup\_\{\\eta\\in\[0,H\_\{\\gamma\}\]\}\\bigg\|\\sum\_\{s^\{\\prime\}\\in S\}\[P\_\{s,a\}\(s^\{\\prime\}\)\-\\widetilde\{P\}\_\{s,a\}\(s^\{\\prime\}\)\]u\(V\(s^\{\\prime\}\)\-\\eta\)\\ \\bigg\|\(12\)thus concluding this case\.

The proof for the caseQ~\(s,a\)<Q\(s,a\)\\widetilde\{Q\}\(s,a\)<Q\(s,a\)is similar, but instead uses the fact that

ρ\(\[V~\|P~s,a\(⋅\)\]\)\\displaystyle\\rho\(\[\\widetilde\{V\}\|\\widetilde\{P\}\_\{s,a\}\(\\cdot\)\]\)=ρ\(\[V~−V\+V\|P~s,a\(⋅\)\]\)≤‖V−V~‖\+ρ\(\[V\|P~s,a\(⋅\)\]\),\\displaystyle=\\rho\(\[\\widetilde\{V\}\-V\+V\|\\widetilde\{P\}\_\{s,a\}\(\\cdot\)\]\)\\leq\\\|V\-\\widetilde\{V\}\\\|\+\\rho\(\[V\|\\widetilde\{P\}\_\{s,a\}\(\\cdot\)\]\),from which we obtain

‖Q−Q~‖\\displaystyle\\\|Q\-\\widetilde\{Q\}\\\|=γ\(ρ\(\[V~\|P~\]\)−ρ\(\[V\|P\]\)\)\\displaystyle=\\gamma\\big\(\\rho\(\[\\widetilde\{V\}\|\\widetilde\{P\}\]\)\-\\rho\(\[V\|P\]\)\\big\)\(13\)≤γ‖V−V~‖\+γ\(ρ\(\[V\|P~\]\)−ρ\(\[V\|P\]\)\)\.\\displaystyle\\leq\\gamma\\\|V\-\\widetilde\{V\}\\\|\+\\gamma\\big\(\\rho\(\[V\|\\widetilde\{P\}\]\)\-\\rho\(\[V\|P\]\)\\big\)\.\(14\)The proof of this case follows by noting that

ρ\(\[V\|P~\]\)−ρ\(\[V\|P\]\)\\displaystyle\\rho\(\[V\|\\widetilde\{P\}\]\)\-\\rho\(\[V\|P\]\)≤supη∈\[0,Hγ\]∑s′∈S\[P~s,a\(s′\)−Ps,a\(s′\)\]u\(V\(s′\)−η\)\\displaystyle\\leq\\sup\_\{\\eta\\in\[0,H\_\{\\gamma\}\]\}\\sum\_\{s^\{\\prime\}\\in S\}\[\\widetilde\{P\}\_\{s,a\}\(s^\{\\prime\}\)\-\{P\}\_\{s,a\}\(s^\{\\prime\}\)\]u\(V\(s^\{\\prime\}\)\-\\eta\)\\supη∈\[0,Hγ\]\|∑s′∈S\[Ps,a\(s′\)−P~\(s′\)\]u\(V\(s′\)−η\)\|\.\\displaystyle\\sup\_\{\\eta\\in\[0,H\_\{\\gamma\}\]\}\\bigg\|\\sum\_\{s^\{\\prime\}\\in S\}\[P\_\{s,a\}\(s^\{\\prime\}\)\-\\widetilde\{P\}\(s^\{\\prime\}\)\]u\(V\(s^\{\\prime\}\)\-\\eta\)\\ \\bigg\|\.∎

### C\.3Concentration via Hoeffding’s Inequality

The next result is an application of Hoeffding’s inequality to establish a bound on the number of samples needed for the expression inside the supremum in Lemma[4\.2](https://arxiv.org/html/2605.21763#S4.SS2)for a fixedη\\etato concentrate\. LetP^s,a=1N∑n=1N𝟙\{Xn=s′\}\\widehat\{P\}\_\{s,a\}=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\mathbbmss\{1\}\_\{\\\{X\_\{n\}=s^\{\\prime\}\\\}\}whereXnX\_\{n\}is sampled from\{1,…,S\}\\\{1,\\ldots,S\\\}with probabilities according toPs,aP\_\{s,a\}\.

###### Lemma 3\(Hoeffding bound\)\.

Fixη∈\[0,Hγ\]\\eta\\in\[0,H\_\{\\gamma\}\]and\(s,a\)∈𝒮×𝒜\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}, and letV∈ℝSV\\in\\mathbb\{R\}^\{S\}\. If the numberNNof samples from the state\-action pair\(s,a\)\(s,a\)satisfiesN≥2ε2\[u\(−Hγ\)\]2log⁡\(2/δ\)N\\geq\\frac\{2\}\{\\varepsilon^\{2\}\}\\big\[u\(\-H\_\{\\gamma\}\)\\big\]^\{2\}\\log\(2/\\delta\), then with probability exceeding1−δ1\-\\delta,

\|∑s′\[Ps,a\(s′\)−P^s,a\(s′\)\]u\(V\(s′\)−η\)\|≤ε\.\\displaystyle\\bigg\|\\sum\_\{s^\{\\prime\}\}\[P\_\{s,a\}\(s^\{\\prime\}\)\-\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)\]u\(V\(s^\{\\prime\}\)\-\\eta\)\\bigg\|\\leq\\varepsilon\\,\.

###### Proof\.

We first observe that for the random variable∑s′∈𝒮𝟙\{Xn=s′\}u\(V\(s′\)−η\)\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}\\mathbbmss\{1\}\_\{\\\{X\_\{n\}=s^\{\\prime\}\\\}\}u\(V\(s^\{\\prime\}\)\-\\eta\), we have that

𝔼\[∑s′𝟙\{Xn=s′\}u\(V\(s′\)−η\)\]\\displaystyle\\mathbb\{E\}\\bigg\[\\sum\_\{s^\{\\prime\}\}\\mathbbmss\{1\}\_\{\\\{X\_\{n\}=s^\{\\prime\}\\\}\}u\(V\(s^\{\\prime\}\)\-\\eta\)\\bigg\]=∑s′𝔼\[𝟙\{Xn=s′\}\]u\(V\(s′\)−η\)\\displaystyle=\\sum\_\{s^\{\\prime\}\}\\mathbb\{E\}\\big\[\\mathbbmss\{1\}\_\{\\\{X\_\{n\}=s^\{\\prime\}\\\}\}\\big\]u\(V\(s^\{\\prime\}\)\-\\eta\)=∑s′P\(s′\|s,a\)u\(V\(s′\)−η\)\\displaystyle=\\sum\_\{s^\{\\prime\}\}P\(s^\{\\prime\}\|s,a\)u\(V\(s^\{\\prime\}\)\-\\eta\)and that it is bounded in\[u\(−Hγ\),u\(Hγ\)\]⊂\[u\(−Hγ\),−u\(−Hγ\)\]\[u\(\-H\_\{\\gamma\}\),u\(H\_\{\\gamma\}\)\]\\subset\[u\(\-H\_\{\\gamma\}\),\-u\(\-H\_\{\\gamma\}\)\]\. Also, since

∑s′P^s,a\(s′\)u\(V\(s′\)−η\)=1N∑n=1N∑s′𝟙\{Xn=s′\}u\(V\(s′\)−η\),\\displaystyle\\sum\_\{s^\{\\prime\}\}\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)u\(V\(s^\{\\prime\}\)\-\\eta\)=\\frac\{1\}\{N\}\\sum\_\{n=1\}^\{N\}\\sum\_\{s^\{\\prime\}\}\\mathbbmss\{1\}\_\{\\\{X\_\{n\}=s^\{\\prime\}\\\}\}u\(V\(s^\{\\prime\}\)\-\\eta\),we have by Hoeffding’s inequality that

ℙ\(\|∑s′\[Ps,a\(s′\)−P^s,a\(s′\)\]u\(V\(s′\)−η\)\|≥ε\)≤2exp⁡\(−2Nε2\(−2u\(−Hγ\)\)2\),\\displaystyle\\mathbb\{P\}\\bigg\(\\bigg\|\\sum\_\{s^\{\\prime\}\}\[P\_\{s,a\}\(s^\{\\prime\}\)\-\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)\]u\(V\(s^\{\\prime\}\)\-\\eta\)\\bigg\|\\geq\\varepsilon\\bigg\)\\leq 2\\exp\\bigg\(\-\\frac\{2N\\varepsilon^\{2\}\}\{\\big\(\-2u\(\-H\_\{\\gamma\}\)\)^\{2\}\}\\bigg\),with the right\-hand side being smaller thanδ\\deltaifN≥2ε2\[u\(−Hγ\)\]2log⁡\(2/δ\)N\\geq\\frac\{2\}\{\\varepsilon^\{2\}\}\\big\[u\(\-H\_\{\\gamma\}\)\\big\]^\{2\}\\log\(2/\\delta\)\. ∎

### C\.4Proof of Lemma[4\.2](https://arxiv.org/html/2605.21763#S4.SS2)

The proof of Lemma[4\.2](https://arxiv.org/html/2605.21763#S4.SS2)follows from the following two lemmas \(Lemma[C\.4](https://arxiv.org/html/2605.21763#A3.SS4)and Lemma[C\.4](https://arxiv.org/html/2605.21763#A3.SS4)\)\.

\{restatable\}

lemmaBoundingSimulationTerm LetV∈ℝSV\\in\\mathbb\{R\}^\{S\}\. Forη∗,η¯∈\[0,Hγ\]\\eta^\{\*\},\\bar\{\\eta\}\\in\[0,H\_\{\\gamma\}\], it holds that

maxs,a\|∑s′\(Ps,a\(s′\)−\\displaystyle\\max\_\{s,a\}\\bigg\|\\sum\_\{s^\{\\prime\}\}\(P\_\{s,a\}\(s^\{\\prime\}\)\-P^s,a\(s′\)\)u\(V\(s′\)−η∗\)\|\\displaystyle\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)\)u\(V\(s^\{\\prime\}\)\-\\eta^\{\*\}\)\\bigg\|≤maxs,a⁡\|∑s′\(Ps,a\(s′\)−P^s,a\(s′\)\)u\(V\(s′\)−η¯\)\|\+u\+′\(−Hγ\)\|η∗−η¯\|\.\\displaystyle\\leq\\max\_\{s,a\}\\bigg\|\\sum\_\{s^\{\\prime\}\}\(P\_\{s,a\}\(s^\{\\prime\}\)\-\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)\)u\(V\(s^\{\\prime\}\)\-\\bar\{\\eta\}\)\\bigg\|\+u^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\|\\eta^\{\*\}\-\\bar\{\\eta\}\|\\,\.
###### Proof\.

Sinceuuis increasing and concave, we have forx,y∈\[−Hγ,Hγ\]x,y\\in\[\-H\_\{\\gamma\},H\_\{\\gamma\}\]wherex≤yx\\leq ythat

u\(y\)\\displaystyle u\(y\)≤u\(x\)\+u\+′\(x\)\(y−x\)≤u\+′\(−Hγ\)\(y−x\),\\displaystyle\\leq u\(x\)\+u^\{\\prime\}\_\{\+\}\(x\)\(y\-x\)\\leq u^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\(y\-x\)\\,,and so

0≤u\(y\)−u\(x\)≤u\+′\(−Hγ\)\(y−x\)\.\\displaystyle 0\\leq u\(y\)\-u\(x\)\\leq u^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\(y\-x\)\.Similarly, ify≤xy\\leq x, it holds that

0≤u\(x\)−u\(y\)≤u\+′\(−Hγ\)\(x−y\)\.\\displaystyle 0\\leq u\(x\)\-u\(y\)\\leq u^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\(x\-y\)\.Combining these, the conclusion follows\. Plugging inx=V\(s′\)−η∗x=V\(s^\{\\prime\}\)\-\\eta^\{\*\}andy=V\(s′\)−η¯y=V\(s^\{\\prime\}\)\-\\bar\{\\eta\}, we thus get

\|u\(V\(s′\)−η∗\)−u\(V\(s′\)−η¯\)\|≤u\+′\(−Hγ\)\|η∗−η¯\|\.\\displaystyle\|u\(V\(s^\{\\prime\}\)\-\\eta^\{\*\}\)\-u\(V\(s^\{\\prime\}\)\-\\bar\{\\eta\}\)\|\\leq u^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\|\\eta^\{\*\}\-\\bar\{\\eta\}\|\\,\.\(15\)Next we notice that

∑s′\(Ps,a\(s′\)−P^s,a\(s′\)\)u\(V\(s′\)−η∗\)\\displaystyle\\sum\_\{s^\{\\prime\}\}\(P\_\{s,a\}\(s^\{\\prime\}\)\-\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)\)u\(V\(s^\{\\prime\}\)\-\\eta^\{\*\}\)=∑s′\(Ps,a\(s′\)−P^s,a\(s′\)\)u\(V\(s′\)−η¯\)\\displaystyle=\\sum\_\{s^\{\\prime\}\}\(P\_\{s,a\}\(s^\{\\prime\}\)\-\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)\)u\(V\(s^\{\\prime\}\)\-\\bar\{\\eta\}\)\+∑s′\(Ps,a\(s′\)−P^s,a\(s′\)\)\[u\(V\(s′\)−η∗\)−u\(V\(s′\)−η¯\)\],\\displaystyle\+\\sum\_\{s^\{\\prime\}\}\(P\_\{s,a\}\(s^\{\\prime\}\)\-\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)\)\\big\[u\(V\(s^\{\\prime\}\)\-\\eta^\{\*\}\)\-u\(V\(s^\{\\prime\}\)\-\\bar\{\\eta\}\)\\big\]\\,,and since

−u\+′\(−Hγ\)\|η∗−η¯\|≤∑s′\(Ps,a\(s′\)−P^s,a\(s′\)\)\[u\(V\(s′\)−η∗\)−u\(V\(s′\)−η¯\)\]≤u\+′\(−Hγ\)\|η∗−η¯\|,\\displaystyle\-u^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\|\\eta^\{\*\}\-\\bar\{\\eta\}\|\\leq\\sum\_\{s^\{\\prime\}\}\(P\_\{s,a\}\(s^\{\\prime\}\)\-\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)\)\\big\[u\(V\(s^\{\\prime\}\)\-\\eta^\{\*\}\)\-u\(V\(s^\{\\prime\}\)\-\\bar\{\\eta\}\)\\big\]\\leq u^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\|\\eta^\{\*\}\-\\bar\{\\eta\}\|\\,,we obtain that

\|∑s′\(Ps,a\(s′\)−P^s,a\(s′\)\)u\(V\(s′\)−η∗\)−∑s′\(Ps,a\(s′\)−P^s,a\(s′\)\)u\(V\(s′\)−η¯\)\|≤u\+′\(−Hγ\)\|η∗−η¯\|\.\\displaystyle\\bigg\|\\sum\_\{s^\{\\prime\}\}\(P\_\{s,a\}\(s^\{\\prime\}\)\-\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)\)u\(V\(s^\{\\prime\}\)\-\\eta^\{\*\}\)\-\\sum\_\{s^\{\\prime\}\}\(P\_\{s,a\}\(s^\{\\prime\}\)\-\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)\)u\(V\(s^\{\\prime\}\)\-\\bar\{\\eta\}\)\\bigg\|\\leq u^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\|\\eta^\{\*\}\-\\bar\{\\eta\}\|\\,\.Finally, by the reverse triangle inequality we have

\|\|∑s′\\displaystyle\\bigg\|\\big\|\\sum\_\{s^\{\\prime\}\}\(Ps,a\(s′\)−P^s,a\(s′\)\)u\(V\(s′\)−η∗\)\|−\|∑s′\(Ps,a\(s′\)−P^s,a\(s′\)\)u\(V\(s′\)−η¯\)\|\|\\displaystyle\(P\_\{s,a\}\(s^\{\\prime\}\)\-\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)\)u\(V\(s^\{\\prime\}\)\-\\eta^\{\*\}\)\\big\|\-\\big\|\\sum\_\{s^\{\\prime\}\}\(P\_\{s,a\}\(s^\{\\prime\}\)\-\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)\)u\(V\(s^\{\\prime\}\)\-\\bar\{\\eta\}\)\\big\|\\bigg\|≤\|∑s′\(Ps,a\(s′\)−P^s,a\(s′\)\)u\(V\(s′\)−η∗\)−∑s′\(Ps,a\(s′\)−P^s,a\(s′\)\)u\(V\(s′\)−η¯\)\|\\displaystyle\\leq\\bigg\|\\sum\_\{s^\{\\prime\}\}\(P\_\{s,a\}\(s^\{\\prime\}\)\-\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)\)u\(V\(s^\{\\prime\}\)\-\\eta^\{\*\}\)\-\\sum\_\{s^\{\\prime\}\}\(P\_\{s,a\}\(s^\{\\prime\}\)\-\\widehat\{P\}\_\{s,a\}\(s^\{\\prime\}\)\)u\(V\(s^\{\\prime\}\)\-\\bar\{\\eta\}\)\\bigg\|≤u\+′\(−Hγ\)\|η∗−η¯\|\\displaystyle\\leq u^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\|\\eta^\{\*\}\-\\bar\{\\eta\}\|for any\(s,a\)∈𝒮×𝒜\(s,a\)\\in\\mathcal\{S\}\\times\\mathcal\{A\}\. The result follows by taking maximum on both sides\. ∎

\{restatable\}

lemmaBoundingDerivative If the setDDof discretization points satisfies\|D\|≥u\+′\(−Hγ\)ε\(1−γ\)\|D\|\\geq\\frac\{u^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\}\{\\varepsilon\(1\-\\gamma\)\}, it holds that

minη¯∈D⁡u\+′\(−Hγ\)\|η∗−η¯\|<ε2\.\\displaystyle\\min\_\{\\bar\{\\eta\}\\in D\}u^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\|\\eta^\{\*\}\-\\bar\{\\eta\}\|<\\frac\{\\varepsilon\}\{2\}\\,\.
###### Proof\.

For any interval\[0,L\]\[0,L\]of lengthLLthat is discretized equidistantly by\|D\|\|D\|points, the interval length between any two discretization points isI=L\|D\|\+1I=\\frac\{L\}\{\|D\|\+1\}and thus for any pointx∈\[0,L\]x\\in\[0,L\]its distance to its nearest discretization point isI2=L2\(\|D\|\+1\)\\frac\{I\}\{2\}=\\frac\{L\}\{2\(\|D\|\+1\)\}\. To ensure that the distance of anyx∈\[0,L\]x\\in\[0,L\]to its nearest point is less thanξ\>0\\xi\>0solving forDDshows that it suffices thatD≥L2ξD\\geq\\frac\{L\}\{2\\xi\}\. Plugging inL=HγL=H\_\{\\gamma\}andξ=ε2u\+′\(−Hγ\)\\xi=\\frac\{\\varepsilon\}\{2u^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\}the result follows\. ∎

## Appendix DProof of Theorem[2](https://arxiv.org/html/2605.21763#Thmtheorem2)

###### Proof\.

We consider a class of MDPs with33statess0,sG,sBs\_\{0\},s^\{G\},s^\{B\}and22actionsa1a\_\{1\}anda2a\_\{2\}\. We will index the MDPs in𝕄\\mathbb\{M\}by\{1,2\}×\(0,1\)\\\{1,2\\\}\\times\(0,1\)and writeMipM\_\{i\}^\{p\}fori∈\{1,2\}i\\in\\\{1,2\\\}andp∈\(0,1\)p\\in\(0,1\)\. The statesGs^\{G\}is absorbing under any action and yields a reward of 1 under any action\. The statesBs^\{B\}is absorbing under any action and yields zero reward under any action\. The states0s\_\{0\}yields zero reward under any action and under MDPMjp∈𝕄M\_\{j\}^\{p\}\\in\\mathbb\{M\}we haveℙj,p\(sG\|s0,aj\)=p\\mathbb\{P\}\_\{j,p\}\(s^\{G\}\|s\_\{0\},a\_\{j\}\)=p,ℙj,p\(sB\|s0,aj\)=1−p\\mathbb\{P\}\_\{j,p\}\(s^\{B\}\|s\_\{0\},a\_\{j\}\)=1\-pandℙj,p\(sG\|s0,a−\)=1\\mathbb\{P\}\_\{j,p\}\(s^\{G\}\|s\_\{0\},a^\{\-\}\)=1wherea−a^\{\-\}denotes the other action that is notaja\_\{j\}\.

By the exact same argument as in Theorem[2](https://arxiv.org/html/2605.21763#Thmtheorem2)by pickingγ\\gammaso large thatHγ\>ξ=:−infdom\(u\)H\_\{\\gamma\}\>\\xi=:\-\\inf\\text\{dom\}\(u\)we get on any memberM∈𝕄M\\in\\mathbb\{M\}the following lower bound on the value\-gap:V∗\(s0\)−Va−\(s0\)\>γ\(Hγ−ξ\)=:Δ\>0V^\{\*\}\(s\_\{0\}\)\-V^\{a^\{\-\}\}\(s\_\{0\}\)\>\\gamma\(H\_\{\\gamma\}\-\\xi\)=:\\Delta\>0between the optimal actiona∗a^\{\*\}and the other actiona−a^\{\-\}\. Pickingε<Δ\\varepsilon<\\Deltaonly the optimal policies areε\\varepsilon\-good\.

The next part of the argument is again a likelihood\-ratio type argument that ifppis close enough to11the samples needed to tell two MDPs apart will also need to be very large\. Let𝒰\\mathcal\{U\}be an algorithm that picks an actionaabasedn1n\_\{1\}tries of actiona1a\_\{1\}andn2n\_\{2\}tries of actiona2a\_\{2\}where the total number of samplesN=n1\+n2N=n\_\{1\}\+n\_\{2\}\. LetLi\(m1,m2,n1,n2\)L\_\{i\}\(m\_\{1\},m\_\{2\},n\_\{1\},n\_\{2\}\)denote the probability of observingm1m\_\{1\}successes by tryinga1a\_\{1\}forn1n\_\{1\}times, andm2m\_\{2\}successes by tryinga2a\_\{2\}forn2n\_\{2\}times under hypothesisHiH\_\{i\}\. It is clear to see that

L1\(m1,m2,n1,n2\)=pm1𝟙\[m2=n2\],L2\(m1,m2,n1,n2\)=pm2𝟙\[m1=n1\]\\displaystyle L\_\{1\}\(m\_\{1\},m\_\{2\},n\_\{1\},n\_\{2\}\)=p^\{m\_\{1\}\}\\mathbbmss\{1\}\_\{\[m\_\{2\}=n\_\{2\}\]\},\\qquad L\_\{2\}\(m\_\{1\},m\_\{2\},n\_\{1\},n\_\{2\}\)=p^\{m\_\{2\}\}\\mathbbmss\{1\}\_\{\[m\_\{1\}=n\_\{1\}\]\}We want to evaluate this likelihood\-ratio on the event that all tries turn out to be successes, that is on the eventℰ:=\{m1=n1andm2=n2\},\\mathcal\{E\}:=\\\{m\_\{1\}=n\_\{1\}\\text\{ and \}m\_\{2\}=n\_\{2\}\\\},where clearlyℙ1,p\(ℰ\)=pn1\\mathbb\{P\}\_\{1,p\}\(\\mathcal\{E\}\)=p^\{n\_\{1\}\}\. Assumingδ\>12\\delta\>\\frac\{1\}\{2\}, we observe that

ℙ1,p\(B2\)=𝔼1\[𝟙B2\]=𝔼2\[L1\(n1,n2,n1,n2\)L2\(n1,n2,n1,n2\)𝟙ℰ𝟙B2\]≥pn1ℙ2,p\(B2\)\>12pn1≥12pN\.\\displaystyle\\mathbb\{P\}\_\{1,p\}\(B\_\{2\}\)=\\mathbb\{E\}\_\{1\}\[\\mathbbmss\{1\}\_\{B\_\{2\}\}\]=\\mathbb\{E\}\_\{2\}\\bigg\[\\frac\{L\_\{1\}\(n\_\{1\},n\_\{2\},n\_\{1\},n\_\{2\}\)\}\{L\_\{2\}\(n\_\{1\},n\_\{2\},n\_\{1\},n\_\{2\}\)\}\\mathbbmss\{1\}\_\{\\mathcal\{E\}\}\\mathbbmss\{1\}\_\{B\_\{2\}\}\\bigg\]\\geq p^\{n\_\{1\}\}\\mathbb\{P\}\_\{2,p\}\(B\_\{2\}\)\>\\frac\{1\}\{2\}p^\{n\_\{1\}\}\\geq\\frac\{1\}\{2\}p^\{N\}\\,\.Solving12pN=δ\\frac\{1\}\{2\}p^\{N\}=\\delta, we find that ifN≤log⁡\(1/2δ\)log⁡\(1/p\)N\\leq\\frac\{\\log\(1/2\\delta\)\}\{\\log\(1/p\)\}, the algorithm that is\(ε,δ\)\(\\varepsilon,\\delta\)\-correct onM2pM\_\{2\}^\{p\}cannot also be\(ε,δ\)\(\\varepsilon,\\delta\)\-correct onM1pM\_\{1\}^\{p\}\. Finally, by taking the limitp→1p\\rightarrow 1the result follows\. ∎

## Appendix ELower Bounds

In this section, we provide two template constructions for lower bounds for value and policy learning and then prove the lower bounds\.

### E\.1Hard\-to\-learn MDP constructions

In this section we describe the template constructions for the hard\-to\-learn MDPs for value learning and policy learning\. The constructions and proof techniques borrow from\[[26](https://arxiv.org/html/2605.21763#bib.bib30)\]and\[[52](https://arxiv.org/html/2605.21763#bib.bib3)\]\.

#### E\.1\.1Value Learning

For a lower bound we construct the following class of MDPs withS′:=S\+2S^\{\\prime\}:=S\+2states andAAactions where the first states are labeledS1,,…,sS,sG,sBS\_\{1\},,\.\.\.,s\_\{S\},s^\{G\},s^\{B\}and the actions are labeleda1,…,aAa\_\{1\},\.\.\.,a\_\{A\}\. The statessGs^\{G\}andsBs^\{B\}are absorbing under any actions andR\(sG,a\)=1R\(s^\{G\},a\)=1for alljjandR\(sB,a\)=0R\(s^\{B\},a\)=0for alla∈Aa\\in A\. For the statess∈\{s1,…,sS\}s\\in\\\{s\_\{1\},\.\.\.,s\_\{S\}\\\}, we have thatR\(s,a\)=0R\(s,a\)=0for alla∈Aa\\in A\. We haveSASAstate\-action pair combinations from\{s1,…,sS\}×A=:Z\\\{s\_\{1\},\.\.\.,s\_\{S\}\\\}\\times A=:Zon which we assume some ordering allowing us to writezi,i∈\[SA\]z\_\{i\},i\\in\[SA\]\. Finally for all state\-action pairszi∈\[SA\]z\_\{i\}\\in\[SA\]we haveP\(sG\|zi\)=qiP\(s^\{G\}\|z\_\{i\}\)=q\_\{i\}andP\(sB\|zi\)=1−qiP\(s^\{B\}\|z\_\{i\}\)=1\-q\_\{i\}for someqi∈\[0,1\]q\_\{i\}\\in\[0,1\]\. The structure of this class of MDPs allows us to get lower bounds on the samples needed to learn theQQ\-value of each state\-action pairziz\_\{i\}and then use the fact that samples used to learn theQQ\-values for different state\-action pairs bring no information on each other to get the final bound\.

In this class of MDPs, we will usually for eachzi∈Zz\_\{i\}\\in Zconsider two MDPsM0M\_\{0\}whereqi=pq\_\{i\}=pandM1M\_\{1\}whereqi=p\+αq\_\{i\}=p\+\\alphawherep∈\(12,1\)p\\in\(\\frac\{1\}\{2\},1\)andα∈\(0,1−p5\)\\alpha\\in\(0,\\frac\{1\-p\}\{5\}\)and denote the corresponding optimal value functions byQ0∗Q^\{\*\}\_\{0\}andQ1∗Q^\{\*\}\_\{1\}\. Furthermore we

###### Theorem 3\.

Assumeγ≥12\\gamma\\geq\\frac\{1\}\{2\},δ<14\\delta<\\frac\{1\}\{4\}andu∈𝕌1u\\in\\mathbb\{U\}\_\{1\}\. If there existsp∈\(12,1\)p\\in\(\\frac\{1\}\{2\},1\)andΔ\>0\\Delta\>0such that for anya∈\(0,1−p5\)a\\in\(0,\\frac\{1\-p\}\{5\}\)and anyzi∈Zz\_\{i\}\\in Zit holds that

Q1∗\(z\)−Q0∗\(z\)≥γαΔ\\displaystyle Q\_\{1\}^\{\*\}\(z\)\-Q\_\{0\}^\{\*\}\(z\)\\geq\\gamma\\alpha\\Delta\(16\)then there existsε¯\(u,Hγ\)\\bar\{\\varepsilon\}\(u,H\_\{\\gamma\}\)and constantsc1,c2\>0c\_\{1\},c\_\{2\}\>0such that if the total number of samples is less than

T≤1c1SAΔ2p\(1−p\)ε2log⁡\(SAc2δ\)\\displaystyle T\\leq\\frac\{1\}\{c\_\{1\}\}\\frac\{SA\\Delta^\{2\}p\(1\-p\)\}\{\\varepsilon^\{2\}\}\\log\(\\frac\{SA\}\{c\_\{2\}\\delta\}\)\(17\)for any algorithm𝒰\\mathcal\{U\}and anyε∈\(0,ε¯\)\\varepsilon\\in\(0,\\bar\{\\varepsilon\}\), there exists some MDPMi∈𝕄M\_\{i\}\\in\\mathbb\{M\}whereℙ\(‖Qm∗−Q𝒰‖\>ε\)\>δ\\mathbb\{P\}\(\\\|Q^\{\*\}\_\{m\}\-Q^\{\\mathcal\{U\}\}\\\|\>\\varepsilon\)\>\\delta\.

###### Proof\.

By assumption we can pickp∈\(12,1\)p\\in\(\\frac\{1\}\{2\},1\)andΔ\>0\\Delta\>0such that forz∈Zz\\in Zit holds that

Q1∗\(s,a\)−Q0∗\(s,a\)≥γαΔ≥αΔ2\.\\displaystyle Q\_\{1\}^\{\*\}\(s,a\)\-Q\_\{0\}^\{\*\}\(s,a\)\\geq\\gamma\\alpha\\Delta\\geq\\frac\{\\alpha\\Delta\}\{2\}\.\(18\)Hence, for anyε<Δ\(1−p\)20\\varepsilon<\\frac\{\\Delta\(1\-p\)\}\{20\}we can pickα=4εΔ\\alpha=\\frac\{4\\varepsilon\}\{\\Delta\}to ensure

Q1∗\(z\)−Q0∗\(z\)≥2ε\\displaystyle Q\_\{1\}^\{\*\}\(z\)\-Q^\{\*\}\_\{0\}\(z\)\\geq 2\\varepsilonand so no outputQ𝒰Q^\{\\mathcal\{U\}\}can beε\\varepsilon\-close to bothQ1∗Q^\{\*\}\_\{1\}andQ2∗Q^\{\*\}\_\{2\}simultaneously and therefore the two setsB1:=\{\|Q1∗−Q𝒰\|≤ε\}B\_\{1\}:=\\\{\|Q\_\{1\}^\{\*\}\-Q^\{\\mathcal\{U\}\}\|\\leq\\varepsilon\\\}andB0:=\{\|Q0∗−Q𝒰\|≤ε\}B\_\{0\}:=\\\{\|Q\_\{0\}^\{\*\}\-Q^\{\\mathcal\{U\}\}\|\\leq\\varepsilon\\\}are disjoint\.

Letttbe the number of times the algorithm triesziz\_\{i\}\. Since𝒰\\mathcal\{U\}is\(ε,δ\)\(\\varepsilon,\\delta\)\-correct it holds thatℙ0\(B0\)≥1−δ≥34\\mathbb\{P\}\_\{0\}\(B\_\{0\}\)\\geq 1\-\\delta\\geq\\frac\{3\}\{4\}\.

Letkkbe the number of transitions fromziz\_\{i\}tosGs^\{G\}in the t trials\. We then defineθ\\thetaby

θ:=exp⁡\(−32α2tp\(1−p\)\)\\displaystyle\\theta:=\\exp\\Big\(\-\\frac\{32\\alpha^\{2\}t\}\{p\(1\-p\)\}\\Big\)and the event

ℰ\\displaystyle\\mathcal\{E\}=\{pt−k≤2p\(1−p\)tlog⁡\(82θ\)\},\\displaystyle=\\bigg\\\{pt\-k\\leq\\sqrt\{2p\(1\-p\)t\\log\(\\frac\{8\}\{2\\theta\}\)\}\\bigg\\\},for which, we haveℙ0\(ℰ\)\>34\\mathbb\{P\}\_\{0\}\(\\mathcal\{E\}\)\>\\frac\{3\}\{4\}by Lemma 16 in\[[26](https://arxiv.org/html/2605.21763#bib.bib30)\]and thusℙ0\(B0∩ℰ\)\>12\\mathbb\{P\}\_\{0\}\(B\_\{0\}\\cap\\mathcal\{E\}\)\>\\frac\{1\}\{2\}\. Now by Theorem 9 in\[[52](https://arxiv.org/html/2605.21763#bib.bib3)\], we get that

ℙ1\(B0\)≥ℙ1\(B0∩ℰ\)=𝔼1\[𝟙ℰ𝟙B0\]=𝔼0\[L1L0𝟙ℰ𝟙B0\]≥θ4𝔼0\[𝟙ℰ𝟙B0\]=θ4ℙ0\(ℰ∩B0\)≥θ8\.\\displaystyle\\mathbb\{P\}\_\{1\}\(B\_\{0\}\)\\geq\\mathbb\{P\}\_\{1\}\(B\_\{0\}\\cap\\mathcal\{E\}\)=\\mathbb\{E\}\_\{1\}\[\\mathbbmss\{1\}\_\{\\mathcal\{E\}\}\\mathbbmss\{1\}\_\{B\_\{0\}\}\]=\\mathbb\{E\}\_\{0\}\\bigg\[\\frac\{L\_\{1\}\}\{L\_\{0\}\}\\mathbbmss\{1\}\_\{\\mathcal\{E\}\}\\mathbbmss\{1\}\_\{B\_\{0\}\}\\bigg\]\\geq\\frac\{\\theta\}\{4\}\\mathbb\{E\}\_\{0\}\[\\mathbbmss\{1\}\_\{\\mathcal\{E\}\}\\mathbbmss\{1\}\_\{B\_\{0\}\}\]=\\frac\{\\theta\}\{4\}\\mathbb\{P\}\_\{0\}\(\\mathcal\{E\}\\cap B\_\{0\}\)\\geq\\frac\{\\theta\}\{8\}\\,\.Solving forttinθ8\>δ\\frac\{\\theta\}\{8\}\>\\deltawe find

t<p\(1−p\)32α2log⁡\(18δ\)=Δ2p\(1−p\)512ε2log⁡\(18δ\)\\displaystyle t<\\frac\{p\(1\-p\)\}\{32\\alpha^\{2\}\}\\log\(\\frac\{1\}\{8\\delta\}\)=\\frac\{\\Delta^\{2\}p\(1\-p\)\}\{512\\varepsilon^\{2\}\}\\log\(\\frac\{1\}\{8\\delta\}\)
Since alsoB0⊂B1cB\_\{0\}\\subset B\_\{1\}^\{c\}we conclude that if the algorithm𝒰\\mathcal\{U\}tries the state\-action pairziz\_\{i\}less than

T~\(ε,δ\):=Δ2p\(1−p\)512ε2log⁡\(18δ\)\\displaystyle\\widetilde\{T\}\(\\varepsilon,\\delta\):=\\frac\{\\Delta^\{2\}p\(1\-p\)\}\{512\\varepsilon^\{2\}\}\\log\(\\frac\{1\}\{8\\delta\}\)times under the hypothesisH0iH\_\{0\}^\{i\}, thenℙ1\(B1∁\)\>δ\\mathbb\{P\}\_\{1\}\(B\_\{1\}^\{\\complement\}\)\>\\delta

Letn:=SAn:=SA\. If the total number of transition samples is less thann2T~\(ε,δ\)\\frac\{n\}\{2\}\\widetilde\{T\}\(\\varepsilon,\\delta\)there has to be at leastn/2n/2state\-action pairsziz\_\{i\}that has been tried at mostT~\(ε,δ\)\\widetilde\{T\}\(\\varepsilon,\\delta\)times which we might assume are the state\-action pairs\{zi\}i=1n/2\\\{z\_\{i\}\\\}\_\{i=1\}^\{n/2\}without loss of generalirt\.

LetTiT\_\{i\}be the number of times the algorithm has triedziz\_\{i\}fori≤n/2i\\leq n/2Due to the structure of the MDPs in𝕄\\mathbb\{M\}it suffices to only consider algorithms that outputs an estimate ofQTi𝒰Q^\{\\mathcal\{U\}\}\_\{T\_\{i\}\}based on samples fromziz\_\{i\}since any other samples cannot possibly yield information onQ∗\(zi\)Q^\{\*\}\(z\_\{i\}\)\.

By defining the eventsΛi:=\{\|QM1∗\(zi\)−QTi𝒰\(zi\)\|\>ε\}\\Lambda\_\{i\}:=\\\{\|Q\_\{M\_\{1\}\}^\{\*\}\(z\_\{i\}\)\-Q^\{\\mathcal\{U\}\}\_\{T\_\{i\}\}\(z\_\{i\}\)\|\>\\varepsilon\\\}we therefore have thatΛi\\Lambda\_\{i\}andΛj\\Lambda\_\{j\}are conditionally independent givenTiT\_\{i\}andTjT\_\{j\}\. We then have

ℙ1\(\{Λic\}1≤i≤n/2\\displaystyle\\mathbb\{P\}\_\{1\}\(\\\{\\Lambda\_\{i\}^\{c\}\\\}\_\{1\\leq i\\leq n/2\}∩\{Ti≤T~\(ε,δ\)\}1≤i≤n/2\)\\displaystyle\\cap\\\{T\_\{i\}\\leq\\widetilde\{T\}\(\\varepsilon,\\delta\)\\\}\_\{1\\leq i\\leq n/2\}\)=∑t1=0T~\(ε,δ\)…∑tn/2=0T~\(ε,δ\)ℙ1\(\{Ti=ti\}1≤i≤n/2\)ℙ1\(\{Λic\}1≤i≤n/2∩\{Ti=ti\}1≤i≤n/2\)\\displaystyle=\\sum\_\{t\_\{1\}=0\}^\{\\widetilde\{T\}\(\\varepsilon,\\delta\)\}\\dots\\sum\_\{t\_\{n/2\}=0\}^\{\\widetilde\{T\}\(\\varepsilon,\\delta\)\}\\mathbb\{P\}\_\{1\}\(\\\{T\_\{i\}=t\_\{i\}\\\}\_\{1\\leq i\\leq n/2\}\)\\mathbb\{P\}\_\{1\}\(\\\{\\Lambda\_\{i\}^\{c\}\\\}\_\{1\\leq i\\leq n/2\}\\cap\\\{T\_\{i\}=t\_\{i\}\\\}\_\{1\\leq i\\leq n/2\}\)=∑t1=0T~\(ε,δ\)…∑tn/2=0T~\(ε,δ\)ℙ1\(\{Ti=ti\}1≤i≤n/2\)∏1≤i≤n/2ℙ1\(Λic∩\{Ti=ti\}\)\\displaystyle=\\sum\_\{t\_\{1\}=0\}^\{\\widetilde\{T\}\(\\varepsilon,\\delta\)\}\\dots\\sum\_\{t\_\{n/2\}=0\}^\{\\widetilde\{T\}\(\\varepsilon,\\delta\)\}\\mathbb\{P\}\_\{1\}\(\\\{T\_\{i\}=t\_\{i\}\\\}\_\{1\\leq i\\leq n/2\}\)\\prod\_\{1\\leq i\\leq n/2\}\\mathbb\{P\}\_\{1\}\(\\Lambda\_\{i\}^\{c\}\\cap\\\{T\_\{i\}=t\_\{i\}\\\}\)=∑t1=0T~\(ε,δ\)…∑tn/2=0T~\(ε,δ\)ℙ1\(\{Ti=ti\}1≤i≤n/2\)\(1−δ\)n/2,\\displaystyle=\\sum\_\{t\_\{1\}=0\}^\{\\widetilde\{T\}\(\\varepsilon,\\delta\)\}\\dots\\sum\_\{t\_\{n/2\}=0\}^\{\\widetilde\{T\}\(\\varepsilon,\\delta\)\}\\mathbb\{P\}\_\{1\}\(\\\{T\_\{i\}=t\_\{i\}\\\}\_\{1\\leq i\\leq n/2\}\)\(1\-\\delta\)^\{n/2\},where we have used the law of total probability from line one to two and from two to three follows from independence\. It now follows directly that

ℙ1\(\{Λic\}1≤i≤n/2\|\{Ti≤T~\(ε,δ\)\}1≤i≤n/2\)≤\(1−δ\)n2\.\\displaystyle\\mathbb\{P\}\_\{1\}\(\\\{\\Lambda\_\{i\}^\{c\}\\\}\_\{1\\leq i\\leq n/2\}\|\\\{T\_\{i\}\\leq\\widetilde\{T\}\(\\varepsilon,\\delta\)\\\}\_\{1\\leq i\\leq n/2\}\)\\leq\(1\-\\delta\)^\{\\frac\{n\}\{2\}\}\\,\.Therefore, if the total number of transitionsTTis less thann2T~\(ε,δ\)\\frac\{n\}\{2\}\\widetilde\{T\}\(\\varepsilon,\\delta\), then

ℙ1\(‖Q∗−QT𝒰‖\>ε\)\\displaystyle\\mathbb\{P\}\_\{1\}\(\\\|Q^\{\*\}\-Q^\{\\mathcal\{U\}\}\_\{T\}\\\|\>\\varepsilon\)≥ℙ1\(⋃z∈S×AΛ\(z\)\)\\displaystyle\\geq\\mathbb\{P\}\_\{1\}\\bigg\(\\bigcup\_\{z\\in S\\times A\}\\Lambda\(z\)\\bigg\)=1−ℙ1\(⋂1≤i≤n/2Λic\)\\displaystyle=1\-\\mathbb\{P\}\_\{1\}\\bigg\(\\bigcap\_\{1\\leq i\\leq n/2\}\\Lambda\_\{i\}^\{c\}\\bigg\)≥1−ℙ1\(\{Λic\}1≤i≤n/2\|\{Tzi≤T~\(ε,δ\)\}1≤i≤n/2\)\\displaystyle\\geq 1\-\\mathbb\{P\}\_\{1\}\(\\\{\\Lambda\_\{i\}^\{c\}\\\}\_\{1\\leq i\\leq n/2\}\|\\\{T\_\{z\_\{i\}\}\\leq\\widetilde\{T\}\(\\varepsilon,\\delta\)\\\}\_\{1\\leq i\\leq n/2\}\)≥1−\(1−δ\)n/2\\displaystyle\\geq 1\-\(1\-\\delta\)^\{n/2\}≥δn4\.\\displaystyle\\geq\\frac\{\\delta n\}\{4\}\\,\.By settingδ′=δn4\\delta^\{\\prime\}=\\delta\\frac\{n\}\{4\}and substituting backS′S^\{\\prime\}and usingS2≤S−2\\frac\{S\}\{2\}\\leq S\-2forS≥2S\\geq 2

T=SAΔ2p\(1−p\)2048ε2log⁡\(SA64δ\)\\displaystyle T=\\frac\{SA\\Delta^\{2\}p\(1\-p\)\}\{2048\\varepsilon^\{2\}\}\\log\\Big\(\\frac\{SA\}\{64\\delta\}\\Big\)\(19\)on the MDP corresponding to the hypothesisH0:\{H0i\|1≤i≤n\}H\_\{0\}:\\\{H\_\{0\}^\{i\}\|1\\leq i\\leq n\\\}it finally holds thatℙ1\(‖QM1∗−QT𝒰‖\>ε\)\>δ′\\mathbb\{P\}\_\{1\}\(\\\|Q^\{\*\}\_\{M\_\{1\}\}\-Q^\{\\mathcal\{U\}\}\_\{T\}\\\|\>\\varepsilon\)\>\\delta^\{\\prime\}\. ∎

#### E\.1\.2Policy Learning

The class of MDPs we consider hasS\+2S\+2states labelleds1,…,sS,sG,sBs\_\{1\},\\ldots,s\_\{S\},s^\{G\},s^\{B\}andA\+1A\+1actions labelleda0,a1,…,aAa\_\{0\},a\_\{1\},\\ldots,a\_\{A\}\. The statesGs^\{G\}is absorbing and yields a reward ofR=1R=1under all actions\. The statesBs^\{B\}is also absorbing and yields a reward ofR=0R=0under all actions\. All other statess1,…,sSs\_\{1\},\\ldots,s\_\{S\}yields zero reward under all actions and can only tansition to eithersGs^\{G\}orsBs^\{B\}with probabilities depending on the action and the MDP\.

For each statesi∈\{s1,…,sS\}s\_\{i\}\\in\\\{s\_\{1\},\\ldots,s\_\{S\}\\\}we consider thea\+1a\+1hypothesesH0iH^\{i\}\_\{0\}andHliH^\{i\}\_\{l\}forl≠0l\\neq 0defined as

H0i:\\displaystyle H^\{i\}\_\{0\}:q\(si,a0\)=p\+α\\displaystyle q\(s\_\{i\},a\_\{0\}\)=p\+\\alphaq\(si,a\)=pfora≠a0\\displaystyle q\(s\_\{i\},a\)=p\\text\{ for \}a\\neq a\_\{0\}Hli:\\displaystyle H^\{i\}\_\{l\}:q\(si,a0\)=p\+α\\displaystyle q\(s\_\{i\},a\_\{0\}\)=p\+\\alphaq\(si,a\)=pfora∉\{a0,l\}\\displaystyle q\(s\_\{i\},a\)=p\\text\{ for \}a\\notin\\\{a\_\{0\},l\\\}\\quadq\(si,al\)=p\+2α,\\displaystyle q\(s\_\{i\},a\_\{l\}\)=p\+2\\alpha\\,,wherep∈\(12,1\)p\\in\(\\frac\{1\}\{2\},1\)andα∈\(0,1−p10\)\\alpha\\in\(0,\\frac\{1\-p\}\{10\}\)andq\(si,a\):=ℙ\(sG\|si,a\)q\(s\_\{i\},a\):=\\mathbb\{P\}\(s^\{G\}\|s\_\{i\},a\)\.

We useVlj\(si\)V\_\{l\}^\{j\}\(s\_\{i\}\)to denote the value function of statesis\_\{i\}in any of the MDPs whereala\_\{l\}is the optimal action in statesis\_\{i\}under any policy for whichπ\(s\)=aj\\pi\(s\)=a\_\{j\}\. From the construction it is clear that this value function insis\_\{i\}does not depend on the entire policy but only the action taken insis\_\{i\}\.

Note that underHliH\_\{l\}^\{i\}the optimal action isala\_\{l\}with the second best option beinga0a\_\{0\}and all remaining actions being even worse in the sense thatVl∗\(si\)≥Vl0\(si\)≥Vlj\(si\)V^\{\*\}\_\{l\}\(s\_\{i\}\)\\geq V^\{0\}\_\{l\}\(s\_\{i\}\)\\geq V^\{j\}\_\{l\}\(s\_\{i\}\)whereV∗V^\{\*\}is the optimal value\-function andVlV^\{l\}is the value\-function under any policy where actionala\_\{l\}is taken in statesis\_\{i\}\.

###### Theorem 4\.

Letγ≥12\\gamma\\geq\\frac\{1\}\{2\},δ<14\\delta<\\frac\{1\}\{4\}andu∈𝕌1u\\in\\mathbb\{U\}\_\{1\}be given\. Furthermore, assume that there existp∈\(12,1\)p\\in\(\\frac\{1\}\{2\},1\)andΔ\>0\\Delta\>0such for everyα∈\(0,1−p10\)\\alpha\\in\(0,\\frac\{1\-p\}\{10\}\), everyl∈\{0,1,…,A\}l\\in\\\{0,1,\\ldots,A\\\}, andsi∈\{s1,…,sS\}s\_\{i\}\\in\\\{s\_\{1\},\\ldots,s\_\{S\}\\\}, it holds that

V0∗\(si\)−V0l\(si\)≥γαΔ,Vl∗\(si\)−Vl0\(si\)\\displaystyle V\_\{0\}^\{\*\}\(s\_\{i\}\)\-V\_\{0\}^\{l\}\(s\_\{i\}\)\\geq\\gamma\\alpha\\Delta,\\qquad V^\{\*\}\_\{l\}\(s\_\{i\}\)\-V^\{0\}\_\{l\}\(s\_\{i\}\)≥γαΔ\.\\displaystyle\\geq\\gamma\\alpha\\Delta\.Then there exist constantsc1,c2\>0c\_\{1\},c\_\{2\}\>0such that if the total number of samples is less than

T≤1c1SAΔ2p\(1−p\)ε2log⁡\(Sc2δ\),\\displaystyle T\\leq\\frac\{1\}\{c\_\{1\}\}\\frac\{SA\\Delta^\{2\}p\(1\-p\)\}\{\\varepsilon^\{2\}\}\\log\(\\frac\{S\}\{c\_\{2\}\\delta\}\),then for any algorithm𝒰\\mathcal\{U\}and anyε∈\(0,Δ\(1−p\)10\)\\varepsilon\\in\(0,\\frac\{\\Delta\(1\-p\)\}\{10\}\), there exists some MDPMi∈𝕄M\_\{i\}\\in\\mathbb\{M\}such thatℙ\(‖Vm∗−Vπ𝒰‖\>ε\)\>δ\\mathbb\{P\}\(\\\|V^\{\*\}\_\{m\}\-V^\{\\pi\_\{\\mathcal\{U\}\}\}\\\|\>\\varepsilon\)\>\\delta\.

###### Proof\.

If we chooseα=2εΔ\\alpha=\\frac\{2\\varepsilon\}\{\\Delta\}for anyε<Δ\(1−p\)20\\varepsilon<\\frac\{\\Delta\(1\-p\)\}\{20\}any suboptimal action isε\\varepsilon\-bad\.

Now that all non\-optimal actions areε\\varepsilon\-bad, we wish to show that any algorithm that is\(ε,δ\)\(\\varepsilon,\\delta\)\-correct onH0iH\_\{0\}^\{i\}, i\.e\. choosing the actiona0a\_\{0\}with probability at least1−δ1\-\\delta, will also have a probability of choosinga0a\_\{0\}onHliH\_\{l\}^\{i\}that is larger thanδ\\deltaprovided thatala\_\{l\}is not tried sufficiently many times underH0iH\_\{0\}^\{i\}\.

Letℙl\\mathbb\{P\}\_\{l\}and𝔼l\\mathbb\{E\}\_\{l\}denote the probability operator and expectation operator under the hypothesisHliH^\{i\}\_\{l\}\. Lett:=tlit:=t^\{i\}\_\{l\}be the number of times the algorithm tries actionllinsis\_\{i\}underH0H\_\{0\}\. Assuming thatδ∈\(0,14\)\\delta\\in\(0,\\frac\{1\}\{4\}\)and using that the algorithm is\(ε,δ\)\(\\varepsilon,\\delta\)\-correct it follows thatℙ0\(B\)≥1−δ≥34\\mathbb\{P\}\_\{0\}\(B\)\\geq 1\-\\delta\\geq\\frac\{3\}\{4\}whereB=\{π𝒰\(si\)=a0\}B=\\\{\\pi^\{\\mathcal\{U\}\}\(s\_\{i\}\)=a\_\{0\}\\\}is the event that the algorithm chooses the actiona0a\_\{0\}\.

Letθ=exp⁡\(−32α2tp\(1−p\)\)\\theta=\\exp\\big\(\-\\frac\{32\\alpha^\{2\}t\}\{p\(1\-p\)\}\\big\)\. Fix somet∈ℕt\\in\\mathbb\{N\}and letkkbe the number of transitions tosGs^\{G\}intttrials\.

Finally, we define the eventℰ\\mathcal\{E\}as

ℰ=\{pt−k≤2p\(1−p\)log⁡\(82θ\)\}\.\\displaystyle\\mathcal\{E\}=\\bigg\\\{pt\-k\\leq\\sqrt\{2p\(1\-p\)\\log\(\\frac\{8\}\{2\\theta\}\)\}\\bigg\\\}\\,\.\(20\)From the Chernoff\-Hoeffding bound and as shown in\[[26](https://arxiv.org/html/2605.21763#bib.bib30)\], we have thatℙ0\(ℰ\)\>34\\mathbb\{P\}\_\{0\}\(\\mathcal\{E\)\}\>\\frac\{3\}\{4\}, and thus,ℙ0\(B∩ℰ\)\>12\\mathbb\{P\}\_\{0\}\(B\\cap\\mathcal\{E\)\}\>\\frac\{1\}\{2\}\. From Theorem 9 in\[[52](https://arxiv.org/html/2605.21763#bib.bib3)\], we get that

ℙ1\(B\)≥ℙ1\(B∩ℰ\)=𝔼1\[𝟙B𝟙ℰ\]≥𝔼0\[L1\(W\)L0\(W\)𝟙ℰ𝟙B\]≥𝔼0\[θ4𝟙ℰ𝟙B\]=θ4ℙ0\(ℰ∩B\)≥θ8\.\\displaystyle\\mathbb\{P\}\_\{1\}\(B\)\\geq\\mathbb\{P\}\_\{1\}\(B\\cap\\mathcal\{E\)\}=\\mathbb\{E\}\_\{1\}\[\\mathbbmss\{1\}\_\{B\}\\mathbbmss\{1\}\_\{\\mathcal\{E\}\}\]\\geq\\mathbb\{E\}\_\{0\}\\bigg\[\\frac\{L\_\{1\}\(W\)\}\{L\_\{0\}\(W\)\}\\mathbbmss\{1\}\_\{\\mathcal\{E\}\}\\mathbbmss\{1\}\_\{B\}\\bigg\]\\geq\\mathbb\{E\}\_\{0\}\\bigg\[\\frac\{\\theta\}\{4\}\\mathbbmss\{1\}\_\{\\mathcal\{E\}\}\\mathbbmss\{1\}\_\{B\}\\bigg\]=\\frac\{\\theta\}\{4\}\\mathbb\{P\}\_\{0\}\(\\mathcal\{E\}\\cap B\)\\geq\\frac\{\\theta\}\{8\}\\,\.\(21\)Now solving forθ8\>δ\\frac\{\\theta\}\{8\}\>\\delta, we see that if

t<T~\(ε,δ\):=Δ2p\(1−p\)128ε2log⁡\(18δ\)\\displaystyle t<\\widetilde\{T\}\(\\varepsilon,\\delta\):=\\frac\{\\Delta^\{2\}p\(1\-p\)\}\{128\\varepsilon^\{2\}\}\\log\(\\frac\{1\}\{8\\delta\}\)\(22\)thenℙ1\(B\)\>δ\\mathbb\{P\}\_\{1\}\(B\)\>\\deltaand the eventBBis containing the event that the algorithm does not choose the optimal actionala\_\{l\}\.

Since this holds for all theAAhypothesesHli,l=1,2,…,AH\_\{l\}^\{i\},l=1,2,\\ldots,A, it follows that the algorithm needs at leastT~\(ε,δ\):=AT~\(ε,δ\)\\widetilde\{T\}\(\\varepsilon,\\delta\):=A\\widetilde\{T\}\(\\varepsilon,\\delta\)samples to be\(ε,δ\)\(\\varepsilon,\\delta\)\-correct on the statesis\_\{i\}\.

Next we use the fact that the structure of the MDPs is such that the information used to determineπ∗\(si\)\\pi^\{\*\}\(s\_\{i\}\)carries no information to determineπ∗\(sj\)\\pi^\{\*\}\(s\_\{j\}\)fori≠ji\\neq j\.

If the total number of transition samples is less thanS2T~\(ε,δ\)\\frac\{S\}\{2\}\\widetilde\{T\}\(\\varepsilon,\\delta\), there has to be at leastS2\\frac\{S\}\{2\}states in the set\{si\}i=1S\\\{s\_\{i\}\\\}\_\{i=1\}^\{S\}for which at least one action \(apart froma0a\_\{0\}\) has been tried at mostT~\(ε,δ\)\\widetilde\{T\}\(\\varepsilon,\\delta\)times\. We might without loss of generality assume that these are the states\{si\}i=1S/2\\\{s\_\{i\}\\\}\_\{i=1\}^\{S/2\}and that it is actiona1a\_\{1\}that has been tried out at mostT~\(ε,δ\)\\widetilde\{T\}\(\\varepsilon,\\delta\)times in each of these states\.

LetTiT\_\{i\}be the number of times the algorithm has tried sampled any action onsis\_\{i\}fori≤S/2i\\leq S/2\. By the structure of the MDPs in𝕄\\mathbb\{M\}it is suffices to only consider algorithms that outputs an estimate ofπTi𝒰\\pi^\{\\mathcal\{U\}\}\_\{T\_\{i\}\}based on samples fromsis\_\{i\}since any other samples yields no information onπ∗\(si\)\\pi^\{\*\}\(s\_\{i\}\)\.

Let us define the eventsΛi:=\{\|VM1∗\(si\)−VπTi𝒰\(si\)\|\>ε\}\\Lambda\_\{i\}:=\\\{\|V\_\{M\_\{1\}\}^\{\*\}\(s\_\{i\}\)\-V^\{\\pi^\{\\mathcal\{U\}\}\_\{T\_\{i\}\}\}\(s\_\{i\}\)\|\>\\varepsilon\\\}fori=1,…,Si=1,\\ldots,S\. Then, we have thatΛi\\Lambda\_\{i\}andΛj\\Lambda\_\{j\}are conditionally independent givenTiT\_\{i\}andTjT\_\{j\}\. We then have that for the MDPM1∈𝕄M\_\{1\}\\in\\mathbb\{M\}–the one corresponding to the hypothesisH1:=\{H1i\|1≤i≤n\}H\_\{1\}:=\\\{H\_\{1\}^\{i\}\|1\\leq i\\leq n\\\}– it holds that

ℙ\(\{Λic\}1≤i≤S/2\\displaystyle\\mathbb\{P\}\\big\(\\\{\\Lambda\_\{i\}^\{c\}\\\}\_\{1\\leq i\\leq S/2\}∩\{Ti≤T~\(ε,δ\)\}1≤i≤S/2\)\\displaystyle\\cap\\\{T\_\{i\}\\leq\\widetilde\{T\}\(\\varepsilon,\\delta\)\\\}\_\{1\\leq i\\leq S/2\}\\big\)=∑t1=0T~\(ε,δ\)…∑tS/2=0T~\(ε,δ\)ℙ\(\{Ti=ti\}1≤i≤S/2\)ℙ\(\{Λic\}1≤i≤S/2∩\{Ti=ti\}1≤i≤S/2\)\\displaystyle=\\sum\_\{t\_\{1\}=0\}^\{\\widetilde\{T\}\(\\varepsilon,\\delta\)\}\\dots\\sum\_\{t\_\{S/2\}=0\}^\{\\widetilde\{T\}\(\\varepsilon,\\delta\)\}\\mathbb\{P\}\\big\(\\\{T\_\{i\}=t\_\{i\}\\\}\_\{1\\leq i\\leq S/2\}\\big\)\\mathbb\{P\}\\big\(\\\{\\Lambda\_\{i\}^\{c\}\\\}\_\{1\\leq i\\leq S/2\}\\cap\\\{T\_\{i\}=t\_\{i\}\\\}\_\{1\\leq i\\leq S/2\}\\big\)=∑t1=0T~\(ε,δ\)…∑tS/2=0T~\(ε,δ\)ℙ\(\{Ti=ti\}1≤i≤S/2\)∏1≤i≤S/2ℙ\(Λic∩\{Ti=ti\}\)\\displaystyle=\\sum\_\{t\_\{1\}=0\}^\{\\widetilde\{T\}\(\\varepsilon,\\delta\)\}\\dots\\sum\_\{t\_\{S/2\}=0\}^\{\\widetilde\{T\}\(\\varepsilon,\\delta\)\}\\mathbb\{P\}\\big\(\\\{T\_\{i\}=t\_\{i\}\\\}\_\{1\\leq i\\leq S/2\}\\big\)\\prod\_\{1\\leq i\\leq S/2\}\\mathbb\{P\}\\big\(\\Lambda\_\{i\}^\{c\}\\cap\\\{T\_\{i\}=t\_\{i\}\\\}\\big\)=∑t1=0T~\(ε,δ\)…∑tS/2=0T~\(ε,δ\)ℙ\(\{Ti=ti\}1≤i≤S/2\)\(1−δ\)S/2,\\displaystyle=\\sum\_\{t\_\{1\}=0\}^\{\\widetilde\{T\}\(\\varepsilon,\\delta\)\}\\dots\\sum\_\{t\_\{S/2\}=0\}^\{\\widetilde\{T\}\(\\varepsilon,\\delta\)\}\\mathbb\{P\}\\big\(\\\{T\_\{i\}=t\_\{i\}\\\}\_\{1\\leq i\\leq S/2\}\\big\)\(1\-\\delta\)^\{S/2\}\\,,where the first line follows from the law of total probability, and the second line from independence\. We now have directly that

ℙ\(\{Λic\}1≤i≤S/2\|\{Ti≤T~\(ε,δ\)\}1≤i≤S/2\)≤\(1−δ\)S2\.\\displaystyle\\mathbb\{P\}\\Big\(\\\{\\Lambda\_\{i\}^\{c\}\\\}\_\{1\\leq i\\leq S/2\}\\Big\|\\\{T\_\{i\}\\leq\\widetilde\{T\}\(\\varepsilon,\\delta\)\\\}\_\{1\\leq i\\leq S/2\}\\Big\)\\leq\(1\-\\delta\)^\{\\frac\{S\}\{2\}\}\\,\.Thus, if the total number of transitionsTTis less thanS2T~\(ε,δ\)\\frac\{S\}\{2\}\\widetilde\{T\}\(\\varepsilon,\\delta\)on the MDPM0M\_\{0\}corresponding to the hypothesisH0:\{H0i\|1≤i≤n\}H\_\{0\}:\\\{H\_\{0\}^\{i\}\|1\\leq i\\leq n\\\}, then onM1M\_\{1\}it holds that

ℙ\(‖V∗−VπT𝒰‖\>ε\)\\displaystyle\\mathbb\{P\}\(\\\|V^\{\*\}\-V^\{\\pi^\{\\mathcal\{U\}\}\_\{T\}\}\\\|\>\\varepsilon\)≥ℙ\(⋃1≤i≤S/2Λ\(z\)\)\\displaystyle\\geq\\mathbb\{P\}\\bigg\(\\bigcup\_\{1\\leq i\\leq S/2\}\\Lambda\(z\)\\bigg\)=1−ℙ\(⋂1≤i≤S/2Λic\)\\displaystyle=1\-\\mathbb\{P\}\\bigg\(\\bigcap\_\{1\\leq i\\leq S/2\}\\Lambda\_\{i\}^\{c\}\\bigg\)≥1−ℙ\(\{Λic\}1≤i≤S/2\|\{Tzi≤T~\(ε,δ\)\}1≤i≤S/2\)\\displaystyle\\geq 1\-\\mathbb\{P\}\\Big\(\\\{\\Lambda\_\{i\}^\{c\}\\\}\_\{1\\leq i\\leq S/2\}\\,\\Big\|\\,\\\{T\_\{z\_\{i\}\}\\leq\\widetilde\{T\}\(\\varepsilon,\\delta\)\\\}\_\{1\\leq i\\leq S/2\}\\Big\)≥1−\(1−δ\)S/2\\displaystyle\\geq 1\-\(1\-\\delta\)^\{S/2\}≥δS4,\\displaystyle\\geq\\frac\{\\delta S\}\{4\},whenδS2≤1\\frac\{\\delta S\}\{2\}\\leq 1\. By settingδ′=δS4\\delta^\{\\prime\}=\\delta\\frac\{S\}\{4\}and substituting backS′S^\{\\prime\}andA′A^\{\\prime\}, assuming thatS≥4,A≥2S\\geq 4,A\\geq 2we conclude that if the number of samples is smaller than

T=SAΔ2p\(1−p\)1024log⁡\(S64δ\)\\displaystyle T=\\frac\{SA\\Delta^\{2\}p\(1\-p\)\}\{1024\}\\log\(\\frac\{S\}\{64\\delta\}\)onM0M\_\{0\}, then onM1M\_\{1\}it holds thatℙ\(‖V∗−VπT𝒰‖\>ε\)\>δ\\mathbb\{P\}\(\\\|V^\{\*\}\-V^\{\\pi^\{\\mathcal\{U\}\}\_\{T\}\}\\\|\>\\varepsilon\)\>\\delta\. ∎

### E\.2Proofs of Lower Bounds

In this section, we give the proofs of the lower bounds using the constructions described above\.

#### E\.2\.1Value Learning Lower Bounds

###### Proof\.

On the small MDP type sketched in figure[1](https://arxiv.org/html/2605.21763#S6.F1)we will give a lower bound on the optimal Q\-functions onziz\_\{i\}for two different parameter choices ofqq, namelyq0=pq\_\{0\}=pandq1=p\+αq\_\{1\}=p\+\\alpha\.

Clearly for anyQQwe have that

Q∗\(z\)=γmaxη∈\[0,Hγ\]⁡\{η\+qu\(Hγ−η\)\+\(1−q\)u\(−η\)\}\\displaystyle Q^\{\*\}\(z\)=\\gamma\\max\_\{\\eta\\in\[0,H\_\{\\gamma\}\]\}\\\{\\eta\+qu\(H\_\{\\gamma\}\-\\eta\)\+\(1\-q\)u\(\-\\eta\)\\\}We useη1\\eta\_\{1\}andη0\\eta\_\{0\}to denote the respective maximizers forq1q\_\{1\}andq0q\_\{0\}respectively\. We then have that

Q1∗\(z\)\\displaystyle Q^\{\*\}\_\{1\}\(z\)−Q0∗\(z\)\\displaystyle\-Q^\{\*\}\_\{0\}\(z\)=γ\(η1\+\(p\+α\)u\(Hγ−η1\+\(1−p−α\)u\(−η1\)−η0−pu\(Hγ−η0\)−\(1−p\)u\(−η0\)\)\\displaystyle=\\gamma\\big\(\\eta\_\{1\}\+\(p\+\\alpha\)u\(H\_\{\\gamma\}\-\\eta\_\{1\}\+\(1\-p\-\\alpha\)u\(\-\\eta\_\{1\}\)\-\\eta\_\{0\}\-pu\(H\_\{\\gamma\}\-\\eta\_\{0\}\)\-\(1\-p\)u\(\-\\eta\_\{0\}\)\\big\)≥γα\(u\(Hγ−η0\)−u\(−η0\)\)\\displaystyle\\geq\\gamma\\alpha\\big\(u\(H\_\{\\gamma\}\-\\eta\_\{0\}\)\-u\(\-\\eta\_\{0\}\)\\big\)withΔ=\(u\(Hγ−η0\)−u\(−η0\)\)\\Delta=\\big\(u\(H\_\{\\gamma\}\-\\eta\_\{0\}\)\-u\(\-\\eta\_\{0\}\)\\big\)By Theorem[6](https://arxiv.org/html/2605.21763#Thmtheorem6), there existsp∈\(12,1\)p\\in\(\\frac\{1\}\{2\},1\)such thatΔ:=u\(Hγ−η0\)−u\(η0\)\>0\\Delta:=u\(\{H\_\{\\gamma\}\-\\eta\_\{0\}\}\)\-u\(\\eta\_\{0\}\)\>0and so the result follows by Theorem[3](https://arxiv.org/html/2605.21763#Thmtheorem3)withΦ\(u,Hγ\)=Δ2p\(1−p\)\\Phi\(u,H\_\{\\gamma\}\)=\\Delta^\{2\}p\(1\-p\)\. ∎

\\LowerboundSpecificValue

\*

###### Proof\.

The idea of the proof is similar to that of Theorem[6\.1](https://arxiv.org/html/2605.21763#S6.SS1)with the same MDP construction\. Recall that

Q1∗\(z\)−Q0∗\(z\)≥γα\(u\(Hγ−η0\)−u\(−η0\)\)\.\\displaystyle Q\_\{1\}^\{\*\}\(z\)\-Q^\{\*\}\_\{0\}\(z\)\\geq\\gamma\\alpha\\big\(u\\big\(H\_\{\\gamma\}\-\\eta\_\{0\}\\big\)\-u\(\-\\eta\_\{0\}\)\\big\)\.By the assumption onuu, we have by Theorem[5](https://arxiv.org/html/2605.21763#Thmtheorem5), we can pick

p=12\+121−u\+′\(0\)u\+′\(−Hγ\)−u\+′\(0\)\>12p=\\frac\{1\}\{2\}\+\\frac\{1\}\{2\}\\frac\{1\-u^\{\\prime\}\_\{\+\}\(0\)\}\{u^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\-u^\{\\prime\}\_\{\+\}\(0\)\}\>\\frac\{1\}\{2\}to ensure thatη0=Hγ\\eta\_\{0\}=H\_\{\\gamma\}and so thatΔ=u\(Hγ−η0\)−u\(−η0\)=−u\(−Hγ\)\\Delta=u\(H\_\{\\gamma\}\-\\eta\_\{0\}\)\-u\(\-\\eta\_\{0\}\)=\-u\(\-H\_\{\\gamma\}\)\. The result then follows from Theorem[3](https://arxiv.org/html/2605.21763#Thmtheorem3)\. ∎

#### E\.2\.2Policy Learning Lower Bounds

\\LowerboundGenericPolicy

\*

###### Proof\.

We have for anyssandl≠0l\\neq 0

V0∗\(s\)\\displaystyle V\_\{0\}^\{\*\}\(s\)−V0l\(s\)\\displaystyle\-V\_\{0\}^\{l\}\(s\)=γ\(η1\+\(p\+α\)u\(Hγ−η1\)\+\(1−p−α\)u\(−η1\)−η0−pu\(Hγ−η0\)−\(1−p\)u\(−η0\)\)\\displaystyle=\\gamma\\big\(\\eta\_\{1\}\+\(p\+\\alpha\)u\(H\_\{\\gamma\}\-\\eta\_\{1\}\)\+\(1\-p\-\\alpha\)u\(\-\\eta\_\{1\}\)\-\\eta\_\{0\}\-pu\(H\_\{\\gamma\}\-\\eta\_\{0\}\)\-\(1\-p\)u\(\-\\eta\_\{0\}\)\\big\)≥γα\(u\(Hγ−η0\)−u\(−η0\)\),\\displaystyle\\geq\\gamma\\alpha\\big\(u\(H\_\{\\gamma\}\-\\eta\_\{0\}\)\-u\(\-\\eta\_\{0\}\)\\big\),whereη1\\eta\_\{1\}is an optimizer ofmaxη∈\[0,Hγ\]\{η\+\(p\+α\)u\(Hγ−η\)−\(1−p−α\)\)u\(−η\)\}\\max\_\{\\eta\\in\[0,H\_\{\\gamma\}\]\}\\\{\\eta\+\(p\+\\alpha\)u\(H\_\{\\gamma\}\-\\eta\)\-\(1\-p\-\\alpha\)\)u\(\-\\eta\)\\\}andη0\\eta\_\{0\}is an optimizer ofmaxη∈\[0,Hγ\]⁡\{η\+pu\(Hγ−η\)−\(1−p\)u\(−η\)\}\\max\_\{\\eta\\in\[0,H\_\{\\gamma\}\]\}\\\{\\eta\+pu\(H\_\{\\gamma\}\-\\eta\)\-\(1\-p\)u\(\-\\eta\)\\\}\. Similarly for alll=1…,Al=1\\ldots,A,

Vl∗\(s\)−Vl0\(s\)≥γα\(u\(Hγ−η1\)−u\(−η1\)\)\.\\displaystyle V\_\{l\}^\{\*\}\(s\)\-V\_\{l\}^\{0\}\(s\)\\geq\\gamma\\alpha\\big\(u\(H\_\{\\gamma\}\-\\eta\_\{1\}\)\-u\(\-\\eta\_\{1\}\)\\big\)\.By Theorem[5](https://arxiv.org/html/2605.21763#Thmtheorem5), there existsp¯0<1\\bar\{p\}\_\{0\}<1so that for allp\>p¯0p\>\\bar\{p\}\_\{0\}it holds thatu\(Hγ−η0\)−u\(−η0\)\>0u\(H\_\{\\gamma\}\-\\eta\_\{0\}\)\-u\(\-\\eta\_\{0\}\)\>0andp¯1<1\\bar\{p\}\_\{1\}<1so that forp\+α\>p¯1p\+\\alpha\>\\bar\{p\}\_\{1\}it holds thatu\(Hγ−η1\)−u\(−η1\)\>0u\(H\_\{\\gamma\}\-\\eta\_\{1\}\)\-u\(\-\\eta\_\{1\}\)\>0\. By pickingpplarge enough andε\\varepsilonsufficiently small thenα\\alphais also sufficiently small so thatp\+2α<1p\+2\\alpha<1and bothu\(Hγ−η0\)−u\(−η0\)\>0u\(H\_\{\\gamma\}\-\\eta\_\{0\}\)\-u\(\-\\eta\_\{0\}\)\>0andu\(Hγ−η1\)−u\(−η1\)\>0u\(H\_\{\\gamma\}\-\\eta\_\{1\}\)\-u\(\-\\eta\_\{1\}\)\>0\. Plugging in

Δ=min⁡\{u\(Hγ−η1\)−u\(−η1\),u\(Hγ−η0\)−u\(−η0\)\},\\Delta=\\min\\\{u\(H\_\{\\gamma\}\-\\eta\_\{1\}\)\-u\(\-\\eta\_\{1\}\),u\(H\_\{\\gamma\}\-\\eta\_\{0\}\)\-u\(\-\\eta\_\{0\}\)\\\},the results follows from Theorem[4](https://arxiv.org/html/2605.21763#Thmtheorem4)withΦ\(u,Hγ\)=Δ2p\(1−p\)\\Phi\(u,H\_\{\\gamma\}\)=\\Delta^\{2\}p\(1\-p\)\. ∎

\\LowerboundSpecificPolicy

\*

###### Proof\.

The idea of the proof is similar to that of Theorem[6\.2](https://arxiv.org/html/2605.21763#S6.SS2)\. Recall that

V0∗\(s\)−V0l\(s\)\\displaystyle V\_\{0\}^\{\*\}\(s\)\-V\_\{0\}^\{l\}\(s\)≥γα\(u\(Hγ−η0\)−u\(−η0\)\),Vl∗\(s\)−Vl0\(s\)≥γα\(u\(Hγ−η1\)−u\(−η1\)\),\\displaystyle\\geq\\gamma\\alpha\\big\(u\(H\_\{\\gamma\}\-\\eta\_\{0\}\)\-u\(\-\\eta\_\{0\}\)\\big\),\\qquad V\_\{l\}^\{\*\}\(s\)\-V\_\{l\}^\{0\}\(s\)\\geq\\gamma\\alpha\\big\(u\(H\_\{\\gamma\}\-\\eta\_\{1\}\)\-u\(\-\\eta\_\{1\}\)\\big\),and by Theorem[5](https://arxiv.org/html/2605.21763#Thmtheorem5)if bothppandp\+α≥max⁡\{12,1−1−u\+′\(0\)u\+′\(−Hγ\)−u\+′\(0\)\}p\+\\alpha\\geq\\max\\\{\\frac\{1\}\{2\},1\-\\frac\{1\-u^\{\\prime\}\_\{\+\}\(0\)\}\{u^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\-u^\{\\prime\}\_\{\+\}\(0\)\}\\\}then the above right hand sides are both equal toΔ=\|u\(−Hγ\)\|\\Delta=\|u\(\-H\_\{\\gamma\}\)\|\. For sufficiently smallε\\varepsilonwe get thatα\\alphais sufficiently small so thatp=12\+121−u\+′\(0\)u\+′\(−Hγ\)−u\+′\(0\)p=\\frac\{1\}\{2\}\+\\frac\{1\}\{2\}\\frac\{1\-u^\{\\prime\}\_\{\+\}\(0\)\}\{u^\{\\prime\}\_\{\+\}\(\-H\_\{\\gamma\}\)\-u\_\{\+\}^\{\\prime\}\(0\)\}ensures this\. Plugging in forppandΔ\\Deltathe result follows\. ∎

## Appendix FAuxiliary Results

###### Proposition 2\.

Letu∈𝕌0u\\in\\mathbb\{U\}\_\{0\}be any utility function for whichu\(t\)=tu\(t\)=tfor allt≥0t\\geq 0\. ThenOCEu\(X\)=𝔼\[X\]\\mathrm\{OCE\}\_\{u\}\(X\)=\\mathbb\{E\}\[X\]\.

###### Proof\.

Since any OCE satisfies thatOCE\(X\)≤𝔼\[X\]\\mathrm\{OCE\}\(X\)\\leq\\mathbb\{E\}\[X\]we have to show thatOCE\(X\)≥𝔼\[X\]\\mathrm\{OCE\}\(X\)\\geq\\mathbb\{E\}\[X\]\. By pickingη=Essinf\(X\)\\eta=\\text\{Essinf\}\(X\), we get thatu\(X−η\)=X−ηu\(X\-\\eta\)=X\-\\etaand so

η\+𝔼\[u\(X−η\)\]=η\+𝔼\[X−η\]=𝔼\[X\]\.\\displaystyle\\eta\+\\mathbb\{E\}\[u\(X\-\\eta\)\]=\\eta\+\\mathbb\{E\}\[X\-\\eta\]=\\mathbb\{E\}\[X\]\\,\.\(23\)Thus, by taking the supremum over allη\\eta, the result follows\. ∎

###### Proposition 3\.

For piecewise differentiableu∈𝕌1u\\in\\mathbb\{U\}\_\{1\}, it holds that−u\(−Hγ\)≥u\+′\(−γ1−γ\)\-u\(\-H\_\{\\gamma\}\)\\geq u\_\{\+\}^\{\\prime\}\(\-\\frac\{\\gamma\}\{1\-\\gamma\}\)\.

###### Proof\.

Assume thatuuis differentiable\. Then sinceuuis increasing and concave, we have fort≥1t\\geq 1that

u\(−t\)\\displaystyle u\(\-t\)=u\(0\)\+∫0−tu′\(x\)𝑑x=−∫−t0u′\(x\)𝑑x≤−∫−t−t\+1u′\(x\)𝑑x≤−u′\(−t\+1\)\.\\displaystyle=u\(0\)\+\\int\_\{0\}^\{\-t\}u^\{\\prime\}\(x\)dx=\-\\int\_\{\-t\}^\{0\}u^\{\\prime\}\(x\)dx\\leq\-\\int\_\{\-t\}^\{\-t\+1\}u^\{\\prime\}\(x\)dx\\leq\-u^\{\\prime\}\(\-t\+1\)\\,\.Multiplying by−1\-1on both sides and plugging int=Hγt=H\_\{\\gamma\}, we get that−u\(−Hγ\)≥u′\(−γ1−γ\)\-u\(\-H\_\{\\gamma\}\)\\geq u^\{\\prime\}\(\-\\frac\{\\gamma\}\{1\-\\gamma\}\)\. The result then follows from partitioning the integral over the different subdomains on whichuuis differentiable\. ∎

###### Theorem 5\.

Letx\>0x\>0andXXbe a random variable withℙ\(X=x\)=p\\mathbb\{P\}\(X=x\)=pandℙ\(x=0\)=1−p\\mathbb\{P\}\(x=0\)=1\-p\. Letu∈𝕌1<u\\in\\mathbb\{U\}\_\{1\}^\{<\}be a strongly risk\-averse utility function\. If

p≥1−1−u\+′\(0\)u\+′\(−x\)−u\+′\(0\),\\displaystyle p\\geq 1\-\\frac\{1\-u^\{\\prime\}\_\{\+\}\(0\)\}\{u^\{\\prime\}\_\{\+\}\(\-x\)\-u^\{\\prime\}\_\{\+\}\(0\)\},thenη∗=x\\eta^\{\*\}=xis a solution to

supη∈\[0,x\]\{η\+pu\(x−η\)\+\(1−p\)u\(−η\)\}\.\\displaystyle\\sup\_\{\\eta\\in\[0,x\]\}\\\{\\eta\+pu\(x\-\\eta\)\+\(1\-p\)u\(\-\\eta\)\\\}\.

###### Proof\.

By definitionη∗=x\\eta^\{\*\}=xis a solution if and only if for allη∈\[0,x\]\\eta\\in\[0,x\], it holds that

η\+pu\(x−η\)\+\(1−p\)u\(−η\)\\displaystyle\\eta\+pu\(x\-\\eta\)\+\(1\-p\)u\(\-\\eta\)≤x\+\(1−p\)u\(−x\),\\displaystyle\\leq x\+\(1\-p\)u\(\-x\),where trivially the inequality holds forη=x\\eta=x, and where forη<x\\eta<xthe inequality is equivalent to

1−p≤x−η−u\(x−η\)−u\(−x\)\+u\(−η\)−u\(x−η\)=1−u\(x−η\)x−η−u\(−x\)\+u\(−η\)x−η−u\(x−η\)x−η\.\\displaystyle 1\-p\\leq\\frac\{x\-\\eta\-u\(x\-\\eta\)\}\{\-u\(\-x\)\+u\(\-\\eta\)\-u\(x\-\\eta\)\}=\\frac\{1\-\\frac\{u\(x\-\\eta\)\}\{x\-\\eta\}\}\{\\frac\{\-u\(\-x\)\+u\(\-\\eta\)\}\{x\-\\eta\}\-\\frac\{u\(x\-\\eta\)\}\{x\-\\eta\}\}\.Thus, we wish to find a lower bound on the right\-hand side that holds for allη∈\[0,x\)\\eta\\in\[0,x\)\. Note that

−u\(−x\)\+u\(−η\)x−η\\displaystyle\\frac\{\-u\(\-x\)\+u\(\-\\eta\)\}\{x\-\\eta\}=u\(−x\)−u\(−η\)−x−\(−η\)≤u\+′\(−x\),\\displaystyle=\\frac\{u\(\-x\)\-u\(\-\\eta\)\}\{\-x\-\(\-\\eta\)\}\\leq u^\{\\prime\}\_\{\+\}\(\-x\),and since for anyb\>1b\>1, the mapt↦1−tb−tt\\mapsto\\frac\{1\-t\}\{b\-t\}is decreasing ont∈\[0,1\]t\\in\[0,1\]andu\(x−η\)x−η\>u\+′\(0\)\\frac\{u\(x\-\\eta\)\}\{x\-\\eta\}\>u^\{\\prime\}\_\{\+\}\(0\), by concavity ofuuwe then have

1−u\(x−η\)x−η−u\(−x\)\+u\(−η\)x−η−u\(x−η\)x−η\\displaystyle\\frac\{1\-\\frac\{u\(x\-\\eta\)\}\{x\-\\eta\}\}\{\\frac\{\-u\(\-x\)\+u\(\-\\eta\)\}\{x\-\\eta\}\-\\frac\{u\(x\-\\eta\)\}\{x\-\\eta\}\}≥1−u\(x−η\)x−ηu\+′\(−x\)−u\(x−η\)x−η\\displaystyle\\geq\\frac\{1\-\\frac\{u\(x\-\\eta\)\}\{x\-\\eta\}\}\{u^\{\\prime\}\_\{\+\}\(\-x\)\-\\frac\{u\(x\-\\eta\)\}\{x\-\\eta\}\}≥1−u\+′\(0\)u\+′\(−x\)−u\+′\(0\),\\displaystyle\\geq\\frac\{1\-u^\{\\prime\}\_\{\+\}\(0\)\}\{u^\{\\prime\}\_\{\+\}\(\-x\)\-u^\{\\prime\}\_\{\+\}\(0\)\},thus proving the lemma\. ∎

###### Theorem 6\.

Letx\>0x\>0andu∈𝕌1u\\in\\mathbb\{U\}\_\{1\}be given\. Then there exists somep¯∈\(12,1\)\\bar\{p\}\\in\(\\frac\{1\}\{2\},1\)such that for allp∈\(p¯,1\)p\\in\(\\bar\{p\},1\),

u\(x−η∗\)−u\(−η∗\)\>0,\\displaystyle u\(x\-\\eta^\{\*\}\)\-u\(\-\\eta^\{\*\}\)\>0,whereη∗\\eta^\{\*\}is the solution to

maxη∈\[0,x\]⁡\{η\+pu\(x−η\)\+\(1−p\)u\(−η\)\}\.\\displaystyle\\max\_\{\\eta\\in\[0,x\]\}\\\{\\eta\+pu\(x\-\\eta\)\+\(1\-p\)u\(\-\\eta\)\\\}\.

###### Proof\.

We first note thatu\(x−η\)≥0u\(x\-\\eta\)\\geq 0and−u\(−η\)≥0\-u\(\-\\eta\)\\geq 0for allη∈\[0,x\]\\eta\\in\[0,x\]andu\(x−η\)−u\(−η\)=0u\(x\-\\eta\)\-u\(\-\\eta\)=0if and only if bothη=0\\eta=0andu\(t\)=0u\(t\)=0for allt≥0t\\geq 0\.

Ifu\(t\)u\(t\)is not identically zero ont≥0t\\geq 0thenu\(t\)\>0u\(t\)\>0for allt\>0t\>0and sou\(x−η\)=0u\(x\-\\eta\)=0implies thatη=x\\eta=xbut since−u\(−x\)≥x\>0\-u\(\-x\)\\geq x\>0we can in this case concludeu\(x−η∗\)−u\(−η∗\)\>0u\(x\-\\eta^\{\*\}\)\-u\(\-\\eta^\{\*\}\)\>0for allp∈\(0,1\)\.p\\in\(0,1\)\.

Now we treat the case whereu\(t\)=0u\(t\)=0for allt≥0t\\geq 0which we partition into two cases:u\(t\)=tu\(t\)=tfort∈\[−x,0\]t\\in\[\-x,0\]\(Case \(i\)\) andu\(−x\)<−xu\(\-x\)<\-x\(Case \(ii\)\)\.

Case \(i\)\.We have thatη\+pu\(x−η\)\+\(1−p\)u\(−η\)=pη\\eta\+pu\(x\-\\eta\)\+\(1\-p\)u\(\-\\eta\)=p\\etawhich is clearly maximized byη∗=x\\eta^\{\*\}=xand sou\(x−η∗\)−u\(−η∗\)=−u\(−x\)\>0u\(x\-\\eta^\{\*\}\)\-u\(\-\\eta^\{\*\}\)=\-u\(\-x\)\>0for allp∈\(0,1\)p\\in\(0,1\)\.

Case \(ii\)\.We note thatη∗=0\\eta^\{\*\}=0is a solution if and only if for allη∈\[0,x\]\\eta\\in\[0,x\]it holds thatη\+\(1−p\)u\(−η\)≤0\\eta\+\(1\-p\)u\(\-\\eta\)\\leq 0or equivalentlyp≤1−η−u\(−η\)p\\leq 1\-\\frac\{\\eta\}\{\-u\(\-\\eta\)\}but sinceu\(−x\)<−xu\(\-x\)<\-xthe inequality is violated ifp\>1\+−x−u\(−x\)p\>1\+\\frac\{\-x\}\{\-u\(\-x\)\}and since−x−u\(−x\)\>−1\\frac\{\-x\}\{\-u\(\-x\)\}\>\-1, we can pick

p¯=1\+x−u\(−x\)2∈\(12,1\),\\displaystyle\\bar\{p\}=\\frac\{1\+\\frac\{x\}\{\-u\(\-x\)\}\}\{2\}\\in\(\\frac\{1\}\{2\},1\),ensuring thatη=0\\eta=0is not a solution \(asη=x\\eta=xviolates the inequality\)\. Thus,u\(x−η∗\)−u\(−η∗\)\>0u\(x\-\\eta^\{\*\}\)\-u\(\-\\eta^\{\*\}\)\>0\. ∎
On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents

Similar Articles

Lyapunov-Based Sample Complexity Analysis for Weakly-Coupled MDPs

Evolving Robustness--Exploration Trade-off in Online Reinforcement Learning via Quantile Bayesian Risk MDPs

Adversarially Robust Control of Conditional Value-at-Risk via Rockafellar-Uryasev Conformal Inference

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

Accurate Large-sample Uncertainty Quantification using Stochastic Gradient Markov Chain Monte Carlo

Submit Feedback

Similar Articles

Lyapunov-Based Sample Complexity Analysis for Weakly-Coupled MDPs
Evolving Robustness--Exploration Trade-off in Online Reinforcement Learning via Quantile Bayesian Risk MDPs
Adversarially Robust Control of Conditional Value-at-Risk via Rockafellar-Uryasev Conformal Inference
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
Accurate Large-sample Uncertainty Quantification using Stochastic Gradient Markov Chain Monte Carlo