A Finite-Iteration Theory for Asynchronous Categorical Distributional Temporal-Difference Learning

arXiv cs.LG Papers

Summary

This paper presents a finite-iteration theory for asynchronous categorical distributional temporal-difference learning, bridging the gap between existing theoretical frameworks and practical online implementations.

arXiv:2605.06866v1 Announce Type: new Abstract: Recent non-asymptotic analyses have substantially advanced the theory of distributional policy evaluation, but they largely concern synchronous full-state updates under a generative model, model-based estimators, accelerated variants, or different approximation architectures. Standard categorical temporal-difference learning is typically used in a different regime. It asynchronously performs a single-state update at each iteration and, in online settings, is driven by a Markovian trajectory. This leaves an important gap between existing finite-iteration theory and the categorical recursions most closely aligned with practical distributional temporal-difference implementations. We bridge this gap for two categorical policy-evaluation methods: scalar categorical temporal-difference learning in the Cram\'er geometry and multivariate signed-categorical temporal-difference learning in the maximum mean discrepancy geometry. After suitable isometric embeddings, both algorithms take the form of asynchronous single-state stochastic-approximation recursions that contract in a statewise supremum norm. This permits finite-iteration guarantees in discounted problems under both i.i.d. and Markovian state sampling, and in undiscounted fixed-horizon problems under i.i.d. episodic sampling.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/11/26, 06:59 AM

# A Finite-Iteration Theory for Asynchronous Categorical Distributional Temporal-Difference Learning
Source: [https://arxiv.org/html/2605.06866](https://arxiv.org/html/2605.06866)
Ege C\. Kaya, Abolfazl Hashemi Elmore Family School of Electrical and Computer Engineering Purdue University West Lafayette, IN 47906 kayae@purdue\.edu, abolfazl@purdue\.edu

###### Abstract

Recent non\-asymptotic analyses have substantially advanced the theory of distributional policy evaluation, but they largely concern synchronous full\-state updates under a generative model, model\-based estimators, accelerated variants, or different approximation architectures\. Standard categorical temporal\-difference learning is typically used in a different regime\. It asynchronously performs a single\-state update at each iteration and, in online settings, is driven by a Markovian trajectory\. This leaves an important gap between existing finite\-iteration theory and the categorical recursions most closely aligned with practical distributional temporal\-difference implementations\. We bridge this gap for two categorical policy\-evaluation methods: scalar categorical temporal\-difference learning in the Cramér geometry and multivariate signed\-categorical temporal\-difference learning in the maximum mean discrepancy geometry\. After suitable isometric embeddings, both algorithms take the form of asynchronous single\-state stochastic\-approximation recursions that contract in a statewise supremum norm\. This permits finite\-iteration guarantees in discounted problems under both i\.i\.d\. and Markovian state sampling, and in undiscounted fixed\-horizon problems under i\.i\.d\. episodic sampling\.

## 1Introduction

Categorical representations are among the central approximation families in distributional reinforcement learning \(RL\)\[[4](https://arxiv.org/html/2605.06866#bib.bib6),[39](https://arxiv.org/html/2605.06866#bib.bib7),[15](https://arxiv.org/html/2605.06866#bib.bib40),[14](https://arxiv.org/html/2605.06866#bib.bib41),[51](https://arxiv.org/html/2605.06866#bib.bib42),[30](https://arxiv.org/html/2605.06866#bib.bib43),[40](https://arxiv.org/html/2605.06866#bib.bib44),[5](https://arxiv.org/html/2605.06866#bib.bib5)\]\. In the scalar case they underlie the categorical methods initiated by C51\[[4](https://arxiv.org/html/2605.06866#bib.bib6)\], while in the multivariate case they support signed\-measure constructions for vector\-valued returns\[[49](https://arxiv.org/html/2605.06866#bib.bib4)\]\. These methods replace infinite\-dimensional return\-distribution objects with finite\-dimensional ones while preserving the geometry that governs projected distributional policy evaluation\[[21](https://arxiv.org/html/2605.06866#bib.bib35),[23](https://arxiv.org/html/2605.06866#bib.bib36),[24](https://arxiv.org/html/2605.06866#bib.bib37),[43](https://arxiv.org/html/2605.06866#bib.bib38),[32](https://arxiv.org/html/2605.06866#bib.bib39)\]\. For these categorical methods, the asymptotic picture is by now well understood\[[20](https://arxiv.org/html/2605.06866#bib.bib17),[17](https://arxiv.org/html/2605.06866#bib.bib18),[47](https://arxiv.org/html/2605.06866#bib.bib19),[22](https://arxiv.org/html/2605.06866#bib.bib20),[6](https://arxiv.org/html/2605.06866#bib.bib21),[28](https://arxiv.org/html/2605.06866#bib.bib22),[39](https://arxiv.org/html/2605.06866#bib.bib7),[5](https://arxiv.org/html/2605.06866#bib.bib5)\]\. The scalar categorical temporal\-difference \(CTD\) method admits a projected Bellman contraction in the Cramér metric together with asymptotic guarantees\[[39](https://arxiv.org/html/2605.06866#bib.bib7),[5](https://arxiv.org/html/2605.06866#bib.bib5)\]\. The multivariate signed\-categorical temporal\-difference \(MTD\) method admits the analogous maximum mean discrepancy \(MMD\)\-based contraction and theory\[[49](https://arxiv.org/html/2605.06866#bib.bib4)\]\. What remains less well understood is the finite\-iteration behavior of these TD methods\.

This matters because standard TD learning is not a synchronous full\-state procedure\[[46](https://arxiv.org/html/2605.06866#bib.bib23),[45](https://arxiv.org/html/2605.06866#bib.bib24),[48](https://arxiv.org/html/2605.06866#bib.bib25),[10](https://arxiv.org/html/2605.06866#bib.bib28)\]\. In practice, each update modifies only the sampled state, and in online RL, the sampled states are generated by a Markovian trajectory\. Finite\-iteration guarantees for asynchronous, trajectory\-driven updates are therefore the natural benchmark for understanding how quickly CTD and MTD approach their projected fixed point in regimes relevant to practice\. Several recent results address neighboring finite\-iteration questions\[[37](https://arxiv.org/html/2605.06866#bib.bib45),[7](https://arxiv.org/html/2605.06866#bib.bib46),[44](https://arxiv.org/html/2605.06866#bib.bib47),[34](https://arxiv.org/html/2605.06866#bib.bib48),[16](https://arxiv.org/html/2605.06866#bib.bib49),[36](https://arxiv.org/html/2605.06866#bib.bib8),[8](https://arxiv.org/html/2605.06866#bib.bib9),[52](https://arxiv.org/html/2605.06866#bib.bib10),[41](https://arxiv.org/html/2605.06866#bib.bib14),[35](https://arxiv.org/html/2605.06866#bib.bib11)\]\. These works establish that non\-asymptotic distributional analysis is feasible, but do not provide a finite\-iteration theory for the standard asynchronous categorical recursion driven by one\-state updates\.

The present paper is organized in two halves\. The first treats discounted policy evaluation under i\.i\.d\. and Markovian sampling, with a framework guided byChenet al\.\[[12](https://arxiv.org/html/2605.06866#bib.bib3),[13](https://arxiv.org/html/2605.06866#bib.bib2),[11](https://arxiv.org/html/2605.06866#bib.bib1)\], Robbins and Monro \[[38](https://arxiv.org/html/2605.06866#bib.bib26)\], Ljung \[[29](https://arxiv.org/html/2605.06866#bib.bib27)\], Borkar \[[9](https://arxiv.org/html/2605.06866#bib.bib30)\], Kushner and Yin \[[26](https://arxiv.org/html/2605.06866#bib.bib31)\]\. The second treats a tractable instance of an undiscounted policy evaluation, in the finite, fixed\-horizon episodic regime\. Across both halves, the main theme remains the same: After suitable statewise isometric embeddings, CTD and MTD constitute an asynchronous stochastic approximation \(SA\) recursion in a block\-supremum geometry\.

Practical interpretation of the sampling regimes\.Each sampling model considered in this work has a practical RL interpretation\. The discounted i\.i\.d\. regime points to a replay\-based or generative\-model benchmark, where updates are formed from approximately independent samples drawn from a buffer or simulator, also akin to the synchronous analyses often used to study TD\-style methods theoretically\. The discounted Markovian regime is the standard online setting for TD learning, in which samples are generated sequentially along a single behavior trajectory\. In the finite\-horizon undiscounted case, the episodic regime is the standard reset\-based formulation of finite\-horizon RL, where interaction proceeds for a fixed horizonHHand then restarts from an initial distribution\. Our main claim is that, once CTD and MTD are written in the right statewise embedding, these update rules become asynchronous SA recursions with contractive structure enabling finite\-iteration control\.

Contributions\.

1. 1\.We establish finite\-iteration guarantees for asynchronous CTD and MTD under discounted i\.i\.d\. and Markovian sampling\.
2. 2\.We establish finite\-iteration guarantees for undiscounted fixed\-horizon CTD and MTD under i\.i\.d\. episodic sampling\.
3. 3\.We record a deterministic representation\-error decomposition that turns the projected fixed\-point bounds into total\-error bounds with an explicit projection bias term\.

## 2Related work

Two lines of work are most directly relevant\. On the finite\-iteration side,Chenet al\.\[[12](https://arxiv.org/html/2605.06866#bib.bib3),[13](https://arxiv.org/html/2605.06866#bib.bib2),[11](https://arxiv.org/html/2605.06866#bib.bib1)\]develop non\-asymptotic SA tools for contractive recursions under i\.i\.d\. and Markovian sampling, building on the broader TD and SA literature\[[7](https://arxiv.org/html/2605.06866#bib.bib46),[37](https://arxiv.org/html/2605.06866#bib.bib45),[38](https://arxiv.org/html/2605.06866#bib.bib26),[29](https://arxiv.org/html/2605.06866#bib.bib27),[9](https://arxiv.org/html/2605.06866#bib.bib30),[26](https://arxiv.org/html/2605.06866#bib.bib31)\]\. On the distributional side,Rowlandet al\.\[[39](https://arxiv.org/html/2605.06866#bib.bib7)\], Bellemareet al\.\[[5](https://arxiv.org/html/2605.06866#bib.bib5)\]establish the scalar categorical projection and Cramér contraction theory underlying CTD, whileWiltzeret al\.\[[49](https://arxiv.org/html/2605.06866#bib.bib4)\]provide the multivariate signed\-categorical MMD framework for MTD\. For the undiscounted half, our setup follows the fixed\-horizon viewpoint ofDe Asiset al\.\[[18](https://arxiv.org/html/2605.06866#bib.bib55)\]\. Our contribution is to bring these threads together for the tabular asynchronous categorical recursions that are used in practice, rather than for synchronous or idealized surrogates\.

Recent non\-asymptotic distributional analyses study several neighboring regimes\.Penget al\.\[[36](https://arxiv.org/html/2605.06866#bib.bib8)\]andBöck and Heitzinger \[[8](https://arxiv.org/html/2605.06866#bib.bib9)\]analyze CTD with generative\-model access, where updates are synchronous or accelerated\.Zhanget al\.\[[52](https://arxiv.org/html/2605.06866#bib.bib10)\]andRowlandet al\.\[[41](https://arxiv.org/html/2605.06866#bib.bib14)\]study model\-based or direct\-estimation procedures with stronger sampling access\.Wuet al\.\[[50](https://arxiv.org/html/2605.06866#bib.bib15)\]focuses on offline policy evaluation,Penget al\.\[[35](https://arxiv.org/html/2605.06866#bib.bib11)\]studies linear function approximation, andKastneret al\.\[[25](https://arxiv.org/html/2605.06866#bib.bib12)\]considers a KL\-based categorical analysis with a different divergence and asymptotic emphasis\. By contrast, we analyze the exact asynchronous recursions of CTD and MTD\.

## 3Discounted categorical policy evaluation

We consider a discounted Markov decision process\[[45](https://arxiv.org/html/2605.06866#bib.bib24)\]\(𝒮,𝒜,P,R,γ\)\(\\mathcal\{S\},\\mathcal\{A\},P,R,\\gamma\)with finite state space𝒮\\mathcal\{S\}, finite action space𝒜\\mathcal\{A\}, reward functionR​\(s,a\)R\(s,a\), transition kernelP\(⋅∣s,a\)P\(\\cdot\\mid s,a\), and discount factorγ∈\(0,1\)\\gamma\\in\(0,1\)\. We assume thatR:𝒮×𝒜→\[0,1\]R:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\[0,1\]in the scalar case, andR:𝒮×𝒜→\[0,1\]q,q≥2R:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\[0,1\]^\{q\},q\\geq 2in the multivariate case\. A policyπ\\piis fixed throughout\. The induced state trajectory is either i\.i\.d\. with lawρ\\rhoor a Markov chain with stationary distributionμ𝒮\\mu\_\{\\mathcal\{S\}\}\. In the Markovian case we assume irreducibility and aperiodicity\. The finiteness of𝒮\\mathcal\{S\}implies that the Markov chain resulting from the MDP with fixed policyπ\\pimixes geometrically\[[27](https://arxiv.org/html/2605.06866#bib.bib29)\], i\.e\., that there exist constantsCmix≥1C\_\{\\mathrm\{mix\}\}\\geq 1andσmix∈\(0,1\)\\sigma\_\{\\mathrm\{mix\}\}\\in\(0,1\)such that

supx∈𝒮∥Pr\(Sk∈⋅∣S0=x\)−μ𝒮\(⋅\)∥TV≤Cmixσmixkfor allk≥0\.\\sup\_\{x\\in\\mathcal\{S\}\}\\left\\lVert\\Pr\(S\_\{k\}\\in\\cdot\\mid S\_\{0\}=x\)\-\\mu\_\{\\mathcal\{S\}\}\(\\cdot\)\\right\\rVert\_\{\\mathrm\{TV\}\}\\leq C\_\{\\mathrm\{mix\}\}\\sigma\_\{\\mathrm\{mix\}\}^\{k\}\\qquad\\text\{for all \}k\\geq 0\.\(1\)Forδ\>0\\delta\>0, we define the associated mixing time

tδ:=min\{k≥0:supx∈𝒮∥Pr\(Sk∈⋅∣S0=x\)−μ𝒮\(⋅\)∥TV≤δ\}\.t\_\{\\delta\}:=\\min\\Bigl\\\{k\\geq 0:\\sup\_\{x\\in\\mathcal\{S\}\}\\left\\lVert\\Pr\(S\_\{k\}\\in\\cdot\\mid S\_\{0\}=x\)\-\\mu\_\{\\mathcal\{S\}\}\(\\cdot\)\\right\\rVert\_\{\\mathrm\{TV\}\}\\leq\\delta\\Bigr\\\}\.\(2\)
For both CTD and MTD, we work with a block\-supremum contraction metricℓ∞\\ell\_\{\\infty\}, which admits a statewise isometric embeddingIIto a product space with contraction norm∥⋅∥2,∞\\left\\lVert\\cdot\\right\\rVert\_\{2,\\infty\}, and a statewise projectionΠΘ\\Pi^\{\\Theta\}onto the space ofΘ\\Theta\-supported, state\-indexed representationsℱΘ𝒮\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\Theta\}\. We can then compose the distributional Bellman operatorTπT^\{\\pi\}to obtain the embedded, projected operator

𝒪:=I∘ΠΘ​Tπ∘I−1\.\\mathcal\{O\}:=I\\circ\\Pi^\{\\Theta\}T^\{\\pi\}\\circ I^\{\-1\}\.\(3\)This operator then admits a one\-step sampled Bellman targetT^​\(Uk;s,\(Rk,Sk\+1\)\)\\widehat\{T\}\(U\_\{k\};s,\(R\_\{k\},S\_\{k\+1\}\)\)computed from a current estimateUkU\_\{k\}and a random sample\(Rk,Sk\+1\)\(R\_\{k\},S\_\{k\+1\}\), satisfying

𝔼​\[T^​\(Uk;Sk,\(Rk,Sk\+1\)\)∣Uk,Sk=s\]=\(𝒪​Uk\)​\(s\),\\mathbb\{E\}\\left\[\\widehat\{T\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\\mid U\_\{k\},S\_\{k\}=s\\right\]=\(\\mathcal\{O\}U\_\{k\}\)\(s\),\(4\)and a recursion that takes the form

Uk\+1=Uk\+αk​PSk​\(T^​\(Uk;Sk,\(Rk,Sk\+1\)\)−Uk​\(Sk\)\),U\_\{k\+1\}=U\_\{k\}\+\\alpha\_\{k\}P\_\{S\_\{k\}\}\(\\widehat\{T\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-U\_\{k\}\(S\_\{k\}\)\),\(5\)where for eachs∈𝒮s\\in\\mathcal\{S\},PsP\_\{s\}denotes the coordinate projector onto blockss\.

The point of the discounted analysis is that the exact one\-state recursion already has the ingredients needed for finite\-iteration SA bounds\. Concretely, the proof uses the contraction of the averaged operator in∥⋅∥2,∞\\lVert\\cdot\\rVert\_\{2,\\infty\}, a Moreau\-envelope smoothing of∥⋅∥2,∞\\lVert\\cdot\\rVert\_\{2,\\infty\}based on∥⋅∥2,p\\lVert\\cdot\\rVert\_\{2,p\}\[[9](https://arxiv.org/html/2605.06866#bib.bib30),[26](https://arxiv.org/html/2605.06866#bib.bib31),[3](https://arxiv.org/html/2605.06866#bib.bib32),[31](https://arxiv.org/html/2605.06866#bib.bib33),[2](https://arxiv.org/html/2605.06866#bib.bib34)\], an affine conditional second\-moment bound or a centered pathwise perturbation bound depending on the method, the samplewise11\-Lipschitzness of the one\-step target map in∥⋅∥2,∞\\lVert\\cdot\\rVert\_\{2,\\infty\}, and the geometric mixing in the Markovian case\. More precisely, withp⋆:=max⁡\{2,⌈log⁡\|𝒮\|⌉\}p^\{\\star\}:=\\max\\\{2,\\lceil\\log\\lvert\\mathcal\{S\}\\rvert\\rceil\\\}, the smoothing argument introduces a parameterϑ\>0\\vartheta\>0through the generalized Moreau envelopeMϑ,p⋆M\_\{\\vartheta,p^\{\\star\}\}of the squared block\-supremum distance\. More details about these common ingredients are deferred to Appendix[A](https://arxiv.org/html/2605.06866#A1), while the formal verification of the discounted finite\-iteration results for CTD and MTD is deferred to Appendices[B](https://arxiv.org/html/2605.06866#A2)and[C](https://arxiv.org/html/2605.06866#A3), respectively\.

### 3\.1Discounted CTD

For each states∈𝒮s\\in\\mathcal\{S\}, fix an ordered supportΘ​\(s\)=\{θ1​\(s\)<⋯<θd​\(s\)\}⊂ℝ\\Theta\(s\)=\\\{\\theta\_\{1\}\(s\)<\\cdots<\\theta\_\{d\}\(s\)\\\}\\subset\\mathbb\{R\}\.ℱC,Θ𝒮\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{C\},\\Theta\}is the class of state\-indexed categorical laws supported on these statewise grids\. The contraction metricℓC,∞\\ell\_\{\\mathrm\{C\},\\infty\}is the supremum Cramér metric, the embeddingICI\_\{\\mathrm\{C\}\}is the standard cumulative\-mass isometryIC,sI\_\{\\mathrm\{C\},s\}\[[39](https://arxiv.org/html/2605.06866#bib.bib7),[5](https://arxiv.org/html/2605.06866#bib.bib5)\]applied to all states, and the statewise projectionΠCΘ\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\}is the usual linear\-interpolation categorical projectionΠCΘ​\(s\)\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\(s\)\}applied to all states\.

Given a sampled transition\(Sk,Ak,Rk,Sk\+1\)\(S\_\{k\},A\_\{k\},R\_\{k\},S\_\{k\+1\}\), the sampled Bellman target is

T^C​\(Uk;Sk,\(Rk,Sk\+1\)\):=IC,Sk​\(ΠCΘ​\(Sk\)​\(\(fRk,γ\)\#​IC,Sk\+1−1​\(Uk​\(Sk\+1\)\)\)\),\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\):=I\_\{\\mathrm\{C\},S\_\{k\}\}\\Bigl\(\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\(S\_\{k\}\)\}\\bigl\(\(f\_\{R\_\{k\},\\gamma\}\)\_\{\\\#\}I\_\{\\mathrm\{C\},S\_\{k\+1\}\}^\{\-1\}\(U\_\{k\}\(S\_\{k\+1\}\)\)\\bigr\)\\Bigr\),\(6\)wherefr,γ​\(z\)=r\+γ​zf\_\{r,\\gamma\}\(z\)=r\+\\gamma zand\#\\\#is the measure\-pushforward operator\.

The next theorem states the discounted CTD convergence rates for various step size regimes\. Appendix[B](https://arxiv.org/html/2605.06866#A2)verifies the contraction, Lipschitz continuity and perturbation ingredients and records the explicit constants\.

###### Theorem 1\(Discounted asynchronous CTD\)\.

Letη⋆∈ℱC,Θ𝒮\\eta^\{\\star\}\\in\\mathcal\{F\}\_\{\\mathrm\{C\},\\Theta\}^\{\\mathcal\{S\}\}denote the unique fixed point ofΠCΘ​Tπ\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\}T^\{\\pi\}, and letηk:=IC−1​\(Uk\)\\eta\_\{k\}:=I\_\{\\mathrm\{C\}\}^\{\-1\}\(U\_\{k\}\)be generated by the asynchronous recursion \([5](https://arxiv.org/html/2605.06866#S3.E5)\) withT^=T^C\\widehat\{T\}=\\widehat\{T\}\_\{\\mathrm\{C\}\}\.

\(i\) i\.i\.d\. sampling\.Suppose\(Sk\)k≥0\(S\_\{k\}\)\_\{k\\geq 0\}are i\.i\.d\. with lawρ\\rhoon𝒮\\mathcal\{S\}, andmins∈𝒮⁡ρ​\(s\)\>0\\min\_\{s\\in\\mathcal\{S\}\}\\rho\(s\)\>0\. There exist explicit constantsCCiid,cCiid,α¯Ciid\>0C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\},c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\},\\bar\{\\alpha\}^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\>0and explicit thresholdshCiid​\(α\)h^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\(\\alpha\)andhCiid​\(α,z\)h^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\(\\alpha,z\)for which the following results hold on three step size regimes\.

Constant step size\.Ifαk≡α≤α¯Ciid\\alpha\_\{k\}\\equiv\\alpha\\leq\\bar\{\\alpha\}^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}, then the iterates converge geometrically to anO​\(α\)O\(\\alpha\)neighborhood:

𝔼​\[ℓC,∞​\(ηk,η⋆\)2\]≤CCiid​ℓC,∞​\(η0,η⋆\)2​\(1−cCiid​α\)k\+CCiid​α\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\(1\-c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\\alpha\)^\{k\}\+C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\\alpha\.\(7\)Linearly\-diminishing step size\.Ifα=α/\(k\+h\)\\alpha=\\alpha/\(k\+h\),α\>1/cCiid\\alpha\>1/c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}, andh≥hCiid​\(α\)h\\geq h^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\(\\alpha\), then the leading residual term decays inO​\(1/k\)O\(1/k\):

𝔼​\[ℓC,∞​\(ηk,η⋆\)2\]≤CCiid​ℓC,∞​\(η0,η⋆\)2​\(hk\+h\)cCiid​α\+CCiid​α2cCiid​α−1⋅1k\+h\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\left\(\\frac\{h\}\{k\+h\}\\right\)^\{c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\\alpha\}\+\\frac\{C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\\alpha^\{2\}\}\{c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\\alpha\-1\}\\cdot\\frac\{1\}\{k\+h\}\.\(8\)Polynomially\-diminishing step size\.Ifαk=α/\(k\+h\)z\\alpha\_\{k\}=\\alpha/\(k\+h\)^\{z\},z∈\(0,1\)z\\in\(0,1\), andh≥hCiid​\(α,z\)h\\geq h^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\(\\alpha,z\), then the leading residual term decays inO​\(1/kz\)O\(1/k^\{z\}\):

𝔼​\[ℓC,∞​\(ηk,η⋆\)2\]≤CCiid​ℓC,∞​\(η0,η⋆\)2​exp⁡\(−cCiid​α1−z​\(\(k\+h\)1−z−h1−z\)\)\+CCiid​α\(k\+h\)z\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\exp\\left\(\-\\frac\{c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\\alpha\}\{1\-z\}\\bigl\(\(k\+h\)^\{1\-z\}\-h^\{1\-z\}\\bigr\)\\right\)\+\\frac\{C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\\alpha\}\{\(k\+h\)^\{z\}\}\.\(9\)
\(ii\) Markovian trajectory sampling\.Suppose\(Sk\)k≥0\(S\_\{k\}\)\_\{k\\geq 0\}is irreducible and aperiodic with stationary distributionμ𝒮\\mu\_\{\\mathcal\{S\}\}satisfyingmins∈𝒮⁡μ𝒮​\(s\)\>0\\min\_\{s\\in\\mathcal\{S\}\}\\mu\_\{\\mathcal\{S\}\}\(s\)\>0and the mixing condition \([1](https://arxiv.org/html/2605.06866#S3.E1)\)\. Lettk:=tαkt\_\{k\}:=t\_\{\\alpha\_\{k\}\}and setK:=min⁡\{k≥0:k≥tk\}K:=\\min\\\{k\\geq 0:k\\geq t\_\{k\}\\\}\. There exist explicit constantsCCmk,cCmk,α¯Cmk\>0C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\},c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\},\\bar\{\\alpha\}^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\>0and explicit thresholdshCmk​\(α\)h^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\(\\alpha\)andhCmk​\(α,z\)h^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\(\\alpha,z\)for which the following results hold on three step size regimes\.Constant step size\.Ifαk≡α\\alpha\_\{k\}\\equiv\\alphaandα​tα≤α¯Cmk\\alpha t\_\{\\alpha\}\\leq\\bar\{\\alpha\}^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}, then fork≥tαk\\geq t\_\{\\alpha\}, the iterates converge geometrically to anO~​\(α\)\\tilde\{O\}\(\\alpha\)neighborhood:

𝔼​\[ℓC,∞​\(ηk,η⋆\)2\]≤CCmk​\(1\+ℓC,∞​\(η0,η⋆\)2\)​\(1−cCmk​α\)k−tα\+CCmk​α​tα\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\\bigl\(1\+\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\bigr\)\(1\-c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\\alpha\)^\{k\-t\_\{\\alpha\}\}\+C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\\alpha t\_\{\\alpha\}\.\(10\)Linearly\-diminishing step size\.Ifαk=α/\(k\+h\)\\alpha\_\{k\}=\\alpha/\(k\+h\),α\>1/cCmk\\alpha\>1/c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}, andh≥hCmk​\(α\)h\\geq h^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\(\\alpha\), then fork≥Kk\\geq K, the leading residual term decays inO~​\(1/k\)\\tilde\{O\}\(1/k\):

𝔼​\[ℓC,∞​\(ηk,η⋆\)2\]≤CCmk​\(1\+ℓC,∞​\(η0,η⋆\)2\)​\(K\+hk\+h\)cCmk​α\+CCmk​α2cCmk​α−1⋅tkk\+h\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\\bigl\(1\+\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\bigr\)\\left\(\\frac\{K\+h\}\{k\+h\}\\right\)^\{c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\\alpha\}\+\\frac\{C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\\alpha^\{2\}\}\{c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\\alpha\-1\}\\cdot\\frac\{t\_\{k\}\}\{k\+h\}\.\(11\)Polynomially\-diminishing step size\.Ifαk=α/\(k\+h\)z\\alpha\_\{k\}=\\alpha/\(k\+h\)^\{z\},z∈\(0,1\)z\\in\(0,1\), andh≥hCmk​\(α,z\)h\\geq h^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\(\\alpha,z\), then fork≥Kk\\geq K, the leading residual term decays inO~​\(1/kz\)\\tilde\{O\}\(1/k^\{z\}\):

𝔼​\[ℓC,∞​\(ηk,η⋆\)2\]≤CCmk​\(1\+ℓC,∞​\(η0,η⋆\)2\)​exp⁡\(cCmk​α1−z​\(\(K\+h\)1−z−\(k\+h\)1−z\)\)\+CCmk​α​tk\(k\+h\)z\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\\bigl\(1\+\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\bigr\)\\exp\\left\(\\frac\{c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\\alpha\}\{1\-z\}\\bigl\(\(K\+h\)^\{1\-z\}\-\(k\+h\)^\{1\-z\}\\bigr\)\\right\)\+\\frac\{C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\\alpha t\_\{k\}\}\{\(k\+h\)^\{z\}\}\.

\(12\)

###### Proof sketch\.

The Cramér embedding makes the projected Bellman map aγ\\sqrt\{\\gamma\}\-contraction\. Under i\.i\.d\. sampling, the averaged asynchronous map contracts according to the minimum state mass, while under Markovian sampling, the same drift is recovered after the mixing\-time comparison\. The sampled target is samplewise Lipschitz and its centered perturbation is uniformly bounded by the support radius\. A sufficiently small Moreau envelope smoothing parameter makes the drift constants positive, and the upper bounds onα\\alphaor lower bounds onhhensure the quadratic remainder is dominated by the negative drift\. Appendix[B](https://arxiv.org/html/2605.06866#A2)gives the exact constants and thresholds\.

∎

The following finite\-sample corollary follows from the linearly\-diminishing step size bounds\.

###### Corollary 2\.

To guarantee𝔼​\[ℓC,∞​\(ηk,η⋆\)\]≤ε\\mathbb\{E\}\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)\]\\leq\\varepsilonwith the discounted CTD recursion, it suffices to takek=O​\(ε−2\)k=O\(\\varepsilon^\{\-2\}\)samples in the i\.i\.d\. andk=O~​\(ε−2\)k=\\tilde\{O\}\(\\varepsilon^\{\-2\}\)samples in the Markovian sampling case\.

### 3\.2Discounted MTD

Letq≥2q\\geq 2\. For each states∈𝒮s\\in\\mathcal\{S\}, fix add\-point supportΘ​\(s\)=\{θ1​\(s\),⋯,θd​\(s\)\}⊂ℝq\\Theta\(s\)=\\\{\\theta\_\{1\}\(s\),\\cdots,\\theta\_\{d\}\(s\)\\\}\\subset\\mathbb\{R\}^\{q\}\.ℱM,Θ𝒮\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{M\},\\Theta\}is the class of state\-indexed, mass\-11signed categorical laws supported on these points\. The contraction metricℓM,∞\\ell\_\{\\mathrm\{M\},\\infty\}is the supremum MMD metric with a characteristic kernelκ\\kappainduced by a shift\-invariant,cc\-homogeneous semimetric of strong negative type\[[42](https://arxiv.org/html/2605.06866#bib.bib50),[19](https://arxiv.org/html/2605.06866#bib.bib51),[33](https://arxiv.org/html/2605.06866#bib.bib52),[49](https://arxiv.org/html/2605.06866#bib.bib4)\]\.

The embeddingIMI\_\{\\mathrm\{M\}\}is the per\-state embeddingIM,s​\(η​\(s\)\):=Ks1/2​p​\(s\)I\_\{\\mathrm\{M\},s\}\(\\eta\(s\)\):=K\_\{s\}^\{1/2\}p\(s\)applied to all states, whereKs:=\(κ​\(θi​\(s\),θj​\(s\)\)\)i,j=1dK\_\{s\}:=\\bigl\(\\kappa\(\\theta\_\{i\}\(s\),\\theta\_\{j\}\(s\)\)\\bigr\)\_\{i,j=1\}^\{d\}is the Gram matrix on the supportΘ​\(s\)\\Theta\(s\)\[[49](https://arxiv.org/html/2605.06866#bib.bib4)\]\. The statewise projectionΠMΘ\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\}is

ΠMΘ​\(s\)​ν:=arg​infp∈ℝ1dMMDκ​\(∑i=1dpi​δθi​\(s\),ν\),\\Pi^\{\\Theta\(s\)\}\_\{\\mathrm\{M\}\}\\nu:=\\arg\\inf\_\{p\\in\\mathbb\{R\}\_\{1\}^\{d\}\}\\mathrm\{MMD\}\_\{\\kappa\}\\Bigl\(\\sum\_\{i=1\}^\{d\}p\_\{i\}\\delta\_\{\\theta\_\{i\}\(s\)\},\\nu\\Bigr\),\(13\)applied to all states, whereℝ1d\\mathbb\{R\}\_\{1\}^\{d\}is the affine subspace ofℝd\\mathbb\{R\}^\{d\}with unit total mass\. The projection is well\-defined forℱM,Θ𝒮\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{M\},\\Theta\}, as shown byWiltzeret al\.\[[49](https://arxiv.org/html/2605.06866#bib.bib4)\]\.

Given a sampled transition\(Sk,Ak,Rk,Sk\+1\)\(S\_\{k\},A\_\{k\},R\_\{k\},S\_\{k\+1\}\), the sampled Bellman target is

T^M​\(Uk;Sk,\(Rk,Sk\+1\)\):=IM,Sk​\(ΠMΘ​\(Sk\)​\(\(fRk,γ\)\#​IM,Sk\+1−1​\(Uk​\(Sk\+1\)\)\)\)\.\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\):=I\_\{\\mathrm\{M\},S\_\{k\}\}\\Bigl\(\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\(S\_\{k\}\)\}\\bigl\(\(f\_\{R\_\{k\},\\gamma\}\)\_\{\\\#\}I\_\{\\mathrm\{M\},S\_\{k\+1\}\}^\{\-1\}\(U\_\{k\}\(S\_\{k\+1\}\)\)\\bigr\)\\Bigr\)\.\(14\)
The next theorem gives the MTD rates analogous to Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1)\. The proof backbone is the same as for CTD, but the perturbation geometry is affine in the norm of the current iterate\.

###### Theorem 3\(Discounted asynchronous MTD\)\.

Letη⋆∈ℱM,Θ𝒮\\eta^\{\\star\}\\in\\mathcal\{F\}\_\{\\mathrm\{M\},\\Theta\}^\{\\mathcal\{S\}\}denote the unique fixed point ofΠMΘ​Tπ\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\}T^\{\\pi\}, and letηk:=IM−1​\(Uk\)\\eta\_\{k\}:=I\_\{\\mathrm\{M\}\}^\{\-1\}\(U\_\{k\}\)be generated by \([5](https://arxiv.org/html/2605.06866#S3.E5)\) withT^=T^M\\widehat\{T\}=\\widehat\{T\}\_\{\\mathrm\{M\}\}\.

\(i\) i\.i\.d\. sampling\.Suppose\(Sk\)k≥0\(S\_\{k\}\)\_\{k\\geq 0\}are i\.i\.d\. with lawρ\\rhoon𝒮\\mathcal\{S\}, andmins∈𝒮⁡ρ​\(s\)\>0\\min\_\{s\\in\\mathcal\{S\}\}\\rho\(s\)\>0\. There exist explicit constantsCMiid,cMiid,α¯Miid\>0C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\},c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\},\\bar\{\\alpha\}^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\>0and explicit thresholdshMiid​\(α\)h^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\(\\alpha\)andhMiid​\(α,z\)h^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\(\\alpha,z\)for which the following results hold on three step size regimes\.

Constant step size\.Ifαk≡α≤α¯Miid\\alpha\_\{k\}\\equiv\\alpha\\leq\\bar\{\\alpha\}^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}, then the iterates converge geometrically to anO​\(α\)O\(\\alpha\)neighborhood:

𝔼​\[ℓM,∞​\(ηk,η⋆\)2\]≤CMiid​ℓM,∞​\(η0,η⋆\)2​\(1−cMiid​α\)k\+CMiid​α\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\(1\-c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\\alpha\)^\{k\}\+C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\\alpha\.\(15\)Linearly\-diminishing step size\.Ifα=α/\(k\+h\)\\alpha=\\alpha/\(k\+h\),α\>1/cMiid\\alpha\>1/c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}, andh≥hMiid​\(α\)h\\geq h^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\(\\alpha\), then the leading residual term decays inO​\(1/k\)O\(1/k\):

𝔼​\[ℓM,∞​\(ηk,η⋆\)2\]≤CMiid​ℓM,∞​\(η0,η⋆\)2​\(hk\+h\)cMiid​α\+CMiid​α2cMiid​α−1⋅1k\+h\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\left\(\\frac\{h\}\{k\+h\}\\right\)^\{c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\\alpha\}\+\\frac\{C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\\alpha^\{2\}\}\{c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\\alpha\-1\}\\cdot\\frac\{1\}\{k\+h\}\.\(16\)Polynomially\-diminishing step size\.Ifαk=α/\(k\+h\)z\\alpha\_\{k\}=\\alpha/\(k\+h\)^\{z\},z∈\(0,1\)z\\in\(0,1\), andh≥hMiid​\(α,z\)h\\geq h^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\(\\alpha,z\), then the leading residual term decays inO​\(1/kz\)O\(1/k^\{z\}\):

𝔼​\[ℓM,∞​\(ηk,η⋆\)2\]≤CMiid​ℓM,∞​\(η0,η⋆\)2​exp⁡\(−cMiid​α1−z​\(\(k\+h\)1−z−h1−z\)\)\+CMiid​α\(k\+h\)z\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\exp\\left\(\-\\frac\{c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\\alpha\}\{1\-z\}\\bigl\(\(k\+h\)^\{1\-z\}\-h^\{1\-z\}\\bigr\)\\right\)\+\\frac\{C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\\alpha\}\{\(k\+h\)^\{z\}\}\.\(17\)
\(ii\) Markovian trajectory sampling\.Suppose\(Sk\)k≥0\(S\_\{k\}\)\_\{k\\geq 0\}is irreducible and aperiodic with stationary distributionμ𝒮\\mu\_\{\\mathcal\{S\}\}satisfyingmins∈𝒮⁡μ𝒮​\(s\)\>0\\min\_\{s\\in\\mathcal\{S\}\}\\mu\_\{\\mathcal\{S\}\}\(s\)\>0and the mixing condition \([1](https://arxiv.org/html/2605.06866#S3.E1)\)\. Lettk:=tαkt\_\{k\}:=t\_\{\\alpha\_\{k\}\}and setK:=min⁡\{k≥0:k≥tk\}K:=\\min\\\{k\\geq 0:k\\geq t\_\{k\}\\\}\. There exist explicit constantsCMmk,cMmk,α¯Mmk\>0C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\},c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\},\\bar\{\\alpha\}^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\>0and explicit thresholdshMmk​\(α\)h^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\(\\alpha\)andhMmk​\(α,z\)h^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\(\\alpha,z\)for which the following results hold on three step size regimes\.Constant step size\.Ifαk≡α\\alpha\_\{k\}\\equiv\\alphaandα​tα≤α¯Mmk\\alpha t\_\{\\alpha\}\\leq\\bar\{\\alpha\}^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}, then fork≥tαk\\geq t\_\{\\alpha\}, the iterates converge geometrically to anO~​\(α\)\\tilde\{O\}\(\\alpha\)neighborhood:

𝔼​\[ℓM,∞​\(ηk,η⋆\)2\]≤CMmk​\(1\+ℓC,∞​\(η0,η⋆\)2\)​\(1−cMmk​α\)k−tα\+CMmk​α​tα\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\\bigl\(1\+\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\bigr\)\(1\-c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\\alpha\)^\{k\-t\_\{\\alpha\}\}\+C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\\alpha t\_\{\\alpha\}\.\(18\)Linearly\-diminishing step size\.Ifαk=α/\(k\+h\)\\alpha\_\{k\}=\\alpha/\(k\+h\),α\>1/cMmk\\alpha\>1/c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}, andh≥hMmk​\(α\)h\\geq h^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\(\\alpha\), then fork≥Kk\\geq K, the leading residual term decays inO~​\(1/k\)\\tilde\{O\}\(1/k\):

𝔼​\[ℓM,∞​\(ηk,η⋆\)2\]≤CMmk​\(1\+ℓM,∞​\(η0,η⋆\)2\)​\(K\+hk\+h\)cMmk​α\+CMmk​α2cMmk​α−1⋅tkk\+h\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\\bigl\(1\+\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\bigr\)\\left\(\\frac\{K\+h\}\{k\+h\}\\right\)^\{c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\\alpha\}\+\\frac\{C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\\alpha^\{2\}\}\{c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\\alpha\-1\}\\cdot\\frac\{t\_\{k\}\}\{k\+h\}\.\(19\)Polynomially\-diminishing step size\.Ifαk=α/\(k\+h\)z\\alpha\_\{k\}=\\alpha/\(k\+h\)^\{z\},z∈\(0,1\)z\\in\(0,1\), andh≥hMmk​\(α,z\)h\\geq h^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\(\\alpha,z\), then fork≥Kk\\geq K, the leading residual term decays inO~​\(1/kz\)\\tilde\{O\}\(1/k^\{z\}\):

𝔼​\[ℓM,∞​\(ηk,η⋆\)2\]≤CMmk​\(1\+ℓM,∞​\(η0,η⋆\)2\)​exp⁡\(cMmk​α1−z​\(\(K\+h\)1−z−\(k\+h\)1−z\)\)\+CMmk​α​tk\(k\+h\)z\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\\bigl\(1\+\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\bigr\)\\exp\\left\(\\frac\{c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\\alpha\}\{1\-z\}\\bigl\(\(K\+h\)^\{1\-z\}\-\(k\+h\)^\{1\-z\}\\bigr\)\\right\)\+\\frac\{C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\\alpha t\_\{k\}\}\{\(k\+h\)^\{z\}\}

\(20\)

###### Proof sketch\.

The Gram\-matrix embedding turns the projected MTD Bellman map into aγc/2\\gamma^\{c/2\}\-contraction in the block\-supremum MMD geometry\. The sample target is again Lipschitz, but the signed\-categorical projection gives an affine centered perturbation bound rather than the uniform CTD bound\. The smoothed SA propositions therefore apply with an affine noise constant\. The smoothing parameter can be chosen small enough to make the drift positive, and the step size thresholds absorb the quadratic remainder\. Appendix[C](https://arxiv.org/html/2605.06866#A3)gives the exact constants and thresholds\. ∎

Once again, the following finite\-sample result is a simple consequence of the previous theorem\.

###### Corollary 4\.

To guarantee𝔼​\[ℓM,∞​\(ηk,η⋆\)\]≤ε\\mathbb\{E\}\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)\]\\leq\\varepsilonwith the discounted MTD recursion, it suffices to takek=O​\(ε−2\)k=O\(\\varepsilon^\{\-2\}\)samples in the i\.i\.d\. andk=O~​\(ε−2\)k=\\tilde\{O\}\(\\varepsilon^\{\-2\}\)samples in the Markovian sampling case\.

## 4Undiscounted fixed\-horizon categorical policy evaluation

We now turn to the undiscounted fixed\-horizon setting, where the value object is indexed by the remaining horizon\[[45](https://arxiv.org/html/2605.06866#bib.bib24),[1](https://arxiv.org/html/2605.06866#bib.bib57),[18](https://arxiv.org/html/2605.06866#bib.bib55)\]\. The aim of this section is to recover a comparable finite\-iteration picture without relying on discounting, which means the learned object must now be treated as a horizon\-indexed stack rather than a single per\-state quantity\. Fix a horizonH∈ℕH\\in\\mathbb\{N\}and a stationary policyπ\(⋅∣s\)\\pi\(\\cdot\\mid s\)\. For eachh∈\{0,1,…,H\}h\\in\\\{0,1,\\dots,H\\\}and eachs∈𝒮s\\in\\mathcal\{S\}, letηh​\(s\)\\eta^\{h\}\(s\)denote thehh\-step return\-distribution object underπ\\pi, with terminal layerη0​\(s\)=δ0\\eta^\{0\}\(s\)=\\delta\_\{0\}\.

Forh≥1h\\geq 1, we define the undiscounted fixed\-horizon distributional Bellman operator by

\(THπ​η\)h​\(s\):=∑a∈𝒜π​\(a∣s\)​∑s′∈𝒮P​\(s′∣s,a\)​\(fR​\(s,a\),1\)\#​ηh−1​\(s′\),\(T\_\{H\}^\{\\pi\}\\eta\)^\{h\}\(s\):=\\sum\_\{a\\in\\mathcal\{A\}\}\\pi\(a\\mid s\)\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}P\(s^\{\\prime\}\\mid s,a\)\\,\(f\_\{R\(s,a\),1\}\)\_\{\\\#\}\\eta^\{h\-1\}\(s^\{\\prime\}\),\(21\)Thus thehhth horizon bootstraps from the\(h−1\)\(h\-1\)st horizon, as in the fixed\-horizon TD formulation ofDe Asiset al\.\[[18](https://arxiv.org/html/2605.06866#bib.bib55)\]\.

For both CTD and MTD, we use exactly the same local categorical projections and local statewise isometric embeddings as in Section[3](https://arxiv.org/html/2605.06866#S3), now applied separately at each horizon and then stacked overh=1,…,Hh=1,\\dots,H\. For eachs∈𝒮s\\in\\mathcal\{S\}, the corresponding embedded iterate is

U​\(s\):=\(U1​\(s\),…,UH​\(s\)\),U:=\(Uh​\(s\)\)1≤h≤H,s∈𝒮\.U\(s\):=\\bigl\(U^\{1\}\(s\),\\ldots,U^\{H\}\(s\)\\bigr\),\\qquad U:=\\bigl\(U^\{h\}\(s\)\\bigr\)\_\{1\\leq h\\leq H,\\ s\\in\\mathcal\{S\}\}\.\(22\)
The contraction of𝒪\\mathcal\{O\}in the discounted setting is entirely due to the discount factorγ∈\(0,1\)\\gamma\\in\(0,1\)and hence does not carry over to the undiscounted case\. However, after introducing a suitable weighted block metric, we recover a contraction\. Fixλ∈\(0,1\)\\lambda\\in\(0,1\)and define the weighted block metric and weighted embedded norm by

ℓH,∞​\(η,η′\):=max1≤h≤H,s∈𝒮⁡λh​ℓ​\(ηh​\(s\),η′h​\(s\)\),∥U∥H,2,∞:=max1≤h≤H,s∈𝒮⁡λh​∥Uh​\(s\)∥2\.\\ell\_\{H,\\infty\}\(\\eta,\\eta^\{\\prime\}\):=\\max\_\{1\\leq h\\leq H,\\ s\\in\\mathcal\{S\}\}\\lambda^\{h\}\\ell\\bigl\(\\eta^\{h\}\(s\),\{\\eta^\{\\prime\}\}^\{h\}\(s\)\\bigr\),\\qquad\\lVert U\\rVert\_\{H,2,\\infty\}:=\\max\_\{1\\leq h\\leq H,\\ s\\in\\mathcal\{S\}\}\\lambda^\{h\}\\lVert U^\{h\}\(s\)\\rVert\_\{2\}\.\(23\)LetΠHΘ\\Pi\_\{H\}^\{\\Theta\}andIHI\_\{H\}denote the horizonwise projection and horizonwise embedding, and define

𝒪H:=IH∘ΠHΘ​THπ∘IH−1\.\\mathcal\{O\}\_\{H\}:=I\_\{H\}\\circ\\Pi\_\{H\}^\{\\Theta\}T\_\{H\}^\{\\pi\}\\circ I\_\{H\}^\{\-1\}\.\(24\)
###### Proposition 5\.

For both CTD and MTD,

ℓH,∞​\(ΠHΘ​THπ​η,ΠHΘ​THπ​η′\)≤λ​ℓH,∞​\(η,η′\)\.\\ell\_\{H,\\infty\}\(\\Pi\_\{H\}^\{\\Theta\}T\_\{H\}^\{\\pi\}\\eta,\\Pi\_\{H\}^\{\\Theta\}T\_\{H\}^\{\\pi\}\\eta^\{\\prime\}\)\\leq\\lambda\\ell\_\{H,\\infty\}\(\\eta,\\eta^\{\\prime\}\)\.\(25\)By construction,IHI\_\{H\}is an isometric embedding from the fixed\-horizon distribution space equipped withℓH,∞\\ell\_\{H,\\infty\}into the embedding space equipped with∥⋅∥H,2,∞\\lVert\\cdot\\rVert\_\{H,2,\\infty\}\. Hence the preceding statement is equivalent to

∥𝒪H​U−𝒪H​U′∥H,2,∞≤λ​∥U−U′∥H,2,∞\.\\lVert\\mathcal\{O\}\_\{H\}U\-\\mathcal\{O\}\_\{H\}U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\\leq\\lambda\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(26\)

###### Proof sketch\.

The Bellman update moves horizonhhtoh−1h\-1, and the weightλh\\lambda^\{h\}therefore contributes exactly one factorλ\\lambda\. The local categorical projections are nonexpansive in the underlying statewise metric, so the weighted stack inherits a contraction\. Details are recorded in Appendices[D](https://arxiv.org/html/2605.06866#A4)and[E](https://arxiv.org/html/2605.06866#A5)\. ∎

For each sampled transition\(Sk,Ak,Rk,Sk\+1\)\(S\_\{k\},A\_\{k\},R\_\{k\},S\_\{k\+1\}\), let

T^H​\(Uk;Sk,\(Rk,Sk\+1\)\):=\(T^H1​\(Uk;Sk,\(Rk,Sk\+1\)\),…,T^HH​\(Uk;Sk,\(Rk,Sk\+1\)\)\)\\widehat\{T\}\_\{H\}\\bigl\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\\bigr\):=\\bigl\(\\widehat\{T\}\_\{H\}^\{1\}\\bigl\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\\bigr\),\\ldots,\\widehat\{T\}\_\{H\}^\{H\}\\bigl\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\\bigr\)\\bigr\)\(27\)denote the stacked one\-step sampled Bellman target, where the horizonwise components are defined in the CTD and MTD subsections below\. Then, for everys∈𝒮s\\in\\mathcal\{S\},

𝔼​\[T^H​\(Uk;Sk,\(Rk,Sk\+1\)\)∣Uk,Sk=s\]=\(𝒪H​Uk\)​\(s\)\.\\mathbb\{E\}\\bigl\[\\widehat\{T\}\_\{H\}\\bigl\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\\bigr\)\\mid U\_\{k\},\\ S\_\{k\}=s\\bigr\]=\(\\mathcal\{O\}\_\{H\}U\_\{k\}\)\(s\)\.\(28\)Thus a single sampled transition supplies a Bellman sample for the full stack at the visited state: the observed first reward is shared across all horizons, while the continuation term for horizonhhis the current\(h−1\)\(h\-1\)\-step estimate at the next state\. IfPsHP\_\{s\}^\{H\}denotes the coordinate projector onto the full horizon stack at statess, the online fixed\-horizon recursion takes the form

Uk\+1=Uk\+αk​PSkH​\(T^H​\(Uk;Sk,\(Rk,Sk\+1\)\)−Uk​\(Sk\)\)\.U\_\{k\+1\}=U\_\{k\}\+\\alpha\_\{k\}P\_\{S\_\{k\}\}^\{H\}\\Bigl\(\\widehat\{T\}\_\{H\}\\bigl\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\\bigr\)\-U\_\{k\}\(S\_\{k\}\)\\Bigr\)\.\(29\)
We consider only the standard episodic sampling model in which episodes have fixed horizonHH, reset distributionν0\\nu\_\{0\}, are i\.i\.d\. across resets, and follow the stationary policyπ\\piwithin each episode\. The recursion in \([29](https://arxiv.org/html/2605.06866#S4.E29)\) is still performed after each transition, but its averaged drift depends on the within\-episode phase\. We therefore analyze the episode\-boundary sequence\(Um​H\)m≥0\(U\_\{mH\}\)\_\{m\\geq 0\}, in keeping with the horizon\-stacked fixed\-horizon viewpoint ofDe Asiset al\.\[[18](https://arxiv.org/html/2605.06866#bib.bib55)\]\. To formalize this phase dependence, define the phase distributions

ρt​\(s\):=Pr⁡\(St=s\),0≤t≤H−1,s∈𝒮,\\rho\_\{t\}\(s\):=\\Pr\(S\_\{t\}=s\),\\qquad 0\\leq t\\leq H\-1,\\qquad s\\in\\mathcal\{S\},\(30\)and the lower bound

ρmin:=min0≤t≤H−1,s∈𝒮⁡ρt​\(s\)\>0\.\\rho\_\{\\min\}:=\\min\_\{0\\leq t\\leq H\-1,\\ s\\in\\mathcal\{S\}\}\\rho\_\{t\}\(s\)\>0\.\(31\)This full\-support condition is used to obtain a uniform contraction factor across phase\-state blocks\. For each phaset∈\{0,…,H−1\}t\\in\\\{0,\\ldots,H\-1\\\}, define the phasewise averaged increment and phasewise averaged map by

Γt​\(U\):=∑s∈𝒮ρt​\(s\)​PsH​\(\(𝒪H​U\)​\(s\)−U​\(s\)\),Gt​\(U\):=U\+Γt​\(U\)\.\\Gamma\_\{t\}\(U\):=\\sum\_\{s\\in\\mathcal\{S\}\}\\rho\_\{t\}\(s\)\\,P\_\{s\}^\{H\}\\bigl\(\(\\mathcal\{O\}\_\{H\}U\)\(s\)\-U\(s\)\\bigr\),\\qquad G\_\{t\}\(U\):=U\+\\Gamma\_\{t\}\(U\)\.\(32\)For each episodem≥0m\\geq 0, define the episodewise first\- and second\-order step size masses

α¯m:=∑u=0H−1αm​H\+u,α¯m\(2\):=∑u=0H−1αm​H\+u2\.\\bar\{\\alpha\}\_\{m\}:=\\sum\_\{u=0\}^\{H\-1\}\\alpha\_\{mH\+u\},\\qquad\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}:=\\sum\_\{u=0\}^\{H\-1\}\\alpha\_\{mH\+u\}^\{2\}\.\(33\)These are the episodewise analogues of the single\-step quantitiesαk\\alpha\_\{k\}andαk2\\alpha\_\{k\}^\{2\}in the discounted analysis\. The phase dependence enters through the averaged mapsGtG\_\{t\}, while the weighted block\-supremum geometry from Appendix[A](https://arxiv.org/html/2605.06866#A1)remains unchanged\. The mapsGtG\_\{t\}are auxiliary proof objects for the episodewise drift analysis rather than additional algorithmic iterates\.

### 4\.1Undiscounted fixed\-horizon CTD

For CTD, each state\-horizon pair\(h,s\)\(h,s\)carries an ordered, possibly horizon\-dependent scalar supportΘh​\(s\)=\{θh,1​\(s\)<⋯<θh,d​\(s\)\}⊂ℝ\\Theta\_\{h\}\(s\)=\\\{\\theta\_\{h,1\}\(s\)<\\cdots<\\theta\_\{h,d\}\(s\)\\\}\\subset\\mathbb\{R\}\. The statewise Cramér metricℓH,C,∞\\ell\_\{H,\\mathrm\{C\},\\infty\}, cumulative\-mass embeddingIH,C,h,sI\_\{H,\\mathrm\{C\},h,s\}, and linear\-interpolation projectionΠH,CΘh​\(s\)\\Pi^\{\\Theta\_\{h\}\(s\)\}\_\{H,\\mathrm\{C\}\}are defined exactly as in Section[3\.1](https://arxiv.org/html/2605.06866#S3.SS1), only now applied separately at each horizon and stacked overh=1,…,Hh=1,\\dots,H\[[39](https://arxiv.org/html/2605.06866#bib.bib7),[5](https://arxiv.org/html/2605.06866#bib.bib5)\]\.

Given a sampled transition\(Sk,Ak,Rk,Sk\+1\)\(S\_\{k\},A\_\{k\},R\_\{k\},S\_\{k\+1\}\), the sampled Bellman target is computed at every horizonh=1,…,Hh=1,\\dots,Hby

T^H,Ch​\(Uk;Sk,\(Rk,Sk\+1\)\):=IH,C,h,Sk​\(ΠH,CΘh​\(Sk\)​\(\(fRk,1\)\#​IH,C,h−1,Sk\+1−1​\(Ukh−1​\(Sk\+1\)\)\)\),\\widehat\{T\}\_\{H,\\mathrm\{C\}\}^\{h\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\):=I\_\{H,\\mathrm\{C\},h,S\_\{k\}\}\\Bigl\(\\Pi\_\{H,\\mathrm\{C\}\}^\{\\Theta\_\{h\}\(S\_\{k\}\)\}\\bigl\(\(f\_\{R\_\{k\},1\}\)\_\{\\\#\}I\_\{H,\\mathrm\{C\},h\-1,S\_\{k\+1\}\}^\{\-1\}\(U\_\{k\}^\{h\-1\}\(S\_\{k\+1\}\)\)\\bigr\)\\Bigr\),\(34\)with the convention thatUk0​\(s\)≡0U\_\{k\}^\{0\}\(s\)\\equiv 0\. Stacking these horizonwise targets gives

T^H,C​\(Uk;Sk,\(Rk,Sk\+1\)\):=\(T^H,C1​\(Uk;Sk,\(Rk,Sk\+1\)\),…,T^H,CH​\(Uk;Sk,\(Rk,Sk\+1\)\)\)\.\\widehat\{T\}\_\{H,\\mathrm\{C\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\):=\\bigl\(\\widehat\{T\}\_\{H,\\mathrm\{C\}\}^\{1\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\),\\dots,\\widehat\{T\}\_\{H,\\mathrm\{C\}\}^\{H\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\\bigr\)\.\(35\)One sampled transition therefore updates the full block atSkS\_\{k\}: all horizons share the same first step and differ only in the bootstrapped tail\. The next theorem states the resulting episode\-boundary rates\.

###### Theorem 6\(Undiscounted fixed\-horizon episodic CTD\)\.

Assume the episodic sampling model described above andρmin\>0\\rho\_\{\\min\}\>0\. LetηH,C⋆\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}denote the unique fixed point ofΠH,CΘ​THπ\\Pi\_\{H,\\mathrm\{C\}\}^\{\\Theta\}T\_\{H\}^\{\\pi\}, and letηk:=IH,C−1​\(Uk\)\\eta\_\{k\}:=I\_\{H,\\mathrm\{C\}\}^\{\-1\}\(U\_\{k\}\)be generated by \([29](https://arxiv.org/html/2605.06866#S4.E29)\) withT^H=T^H,C\\widehat\{T\}\_\{H\}=\\widehat\{T\}\_\{H,\\mathrm\{C\}\}\. There exist explicit constantsCCH,cCH,α¯CH\>0C^\{H\}\_\{\\mathrm\{C\}\},c^\{H\}\_\{\\mathrm\{C\}\},\\bar\{\\alpha\}^\{H\}\_\{\\mathrm\{C\}\}\>0and explicit thresholdsgCH​\(α\)g^\{H\}\_\{\\mathrm\{C\}\}\(\\alpha\)andgCH​\(α,z\)g^\{H\}\_\{\\mathrm\{C\}\}\(\\alpha,z\)for which the following episode\-boundary results hold\.Constant step size\.Ifαk≡α≤α¯CH\\alpha\_\{k\}\\equiv\\alpha\\leq\\bar\{\\alpha\}^\{H\}\_\{\\mathrm\{C\}\}, then the episode\-boundary error decays geometrically to anO​\(α\)O\(\\alpha\)neighborhood:

𝔼​\[ℓH,C,∞​\(ηm​H,η⋆\)2\]≤CCH​ℓH,C,∞​\(η0,η⋆\)2​\(1−cCH​H​α\)m\+CCH​α\.\\mathbb\{E\}\\bigl\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{H\}\_\{\\mathrm\{C\}\}\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\(1\-c^\{H\}\_\{\\mathrm\{C\}\}H\\alpha\)^\{m\}\+C^\{H\}\_\{\\mathrm\{C\}\}\\alpha\.\(36\)For the diminishing step size regimes below, writeτm:=m​H\+g\+H−1\\tau\_\{m\}:=mH\+g\+H\-1, soτ0=g\+H−1\\tau\_\{0\}=g\+H\-1, whereggis the step\-size offset\.

Linearly\-diminishing step size\.Ifα=α/\(k\+g\)\\alpha=\\alpha/\(k\+g\),α\>1/cCH\\alpha\>1/c^\{H\}\_\{\\mathrm\{C\}\}, andg≥gCH​\(α\)g\\geq g^\{H\}\_\{\\mathrm\{C\}\}\(\\alpha\), then the leading episode\-boundary residual term decays inO​\(1/m\)O\(1/m\):

𝔼​\[ℓH,C,∞​\(ηm​H,η⋆\)2\]≤CCH​ℓH,C,∞​\(η0,η⋆\)2​\(τ0τm\)cCH​α\+CCH​α2cCH​α−1⋅1τm\.\\mathbb\{E\}\\bigl\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{H\}\_\{\\mathrm\{C\}\}\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\left\(\\frac\{\\tau\_\{0\}\}\{\\tau\_\{m\}\}\\right\)^\{c^\{H\}\_\{\\mathrm\{C\}\}\\alpha\}\+\\frac\{C^\{H\}\_\{\\mathrm\{C\}\}\\alpha^\{2\}\}\{c^\{H\}\_\{\\mathrm\{C\}\}\\alpha\-1\}\\cdot\\frac\{1\}\{\\tau\_\{m\}\}\.\(37\)Polynomially\-diminishing step size\.Ifαk=α/\(k\+g\)z\\alpha\_\{k\}=\\alpha/\(k\+g\)^\{z\},z∈\(0,1\)z\\in\(0,1\), andg≥gMH​\(α,z\)g\\geq g^\{H\}\_\{\\mathrm\{M\}\}\(\\alpha,z\), then the leading episode\-boundary residual term decays inO​\(1/mz\)O\(1/m^\{z\}\):

𝔼​\[ℓH,C,∞​\(ηk,η⋆\)2\]≤CCH​ℓH,C,∞​\(η0,η⋆\)2​exp⁡\(−cCH​α1−z​\(τm1−z−τ01−z\)\)\+CCH​ατmz\.\\mathbb\{E\}\\bigl\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{H\}\_\{\\mathrm\{C\}\}\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\exp\\left\(\-\\frac\{c^\{H\}\_\{\\mathrm\{C\}\}\\alpha\}\{1\-z\}\\bigl\(\\tau^\{1\-z\}\_\{m\}\-\\tau^\{1\-z\}\_\{0\}\\bigr\)\\right\)\+\\frac\{C^\{H\}\_\{\\mathrm\{C\}\}\\alpha\}\{\\tau\_\{m\}^\{z\}\}\.\(38\)

###### Proof sketch\.

Weighting horizonhhbyλh\\lambda^\{h\}turns the horizon\-to\-horizon bootstrap into a contraction\. Because the averaged drift depends on the phase within the episode, the proof contracts phasewise averaged maps and aggregates them over the course of an episode\. The difference between the averaged trajectory and the online trajectory is controlled by a mean\-zero frozen\-iterate term summed with a within\-episode movement term\. Details are given in Appendix[D](https://arxiv.org/html/2605.06866#A4)\. ∎

###### Corollary 7\.

To guarantee𝔼​\[ℓH,C,∞​\(ηm​H,ηH,C⋆\)\]≤ε\\mathbb\{E\}\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}\)\]\\leq\\varepsilonwith the undiscounted fixed\-horizon CTD recursion, it suffices to takem=O​\(ε−2\)m=O\(\\varepsilon^\{\-2\}\)episodes\. Equivalently, since each episode has lengthHH, one may takek=m​H=O​\(ε−2\)k=mH=O\(\\varepsilon^\{\-2\}\)transitions\.

### 4\.2Undiscounted fixed\-horizon MTD

For MTD, each state\-horizon pair\(h,s\)\(h,s\)carries a possibly horizon\-dependentdd\-point supportΘh​\(s\)=\{θh,1​\(s\),…,θh,d​\(s\)\}⊂ℝq\\Theta\_\{h\}\(s\)=\\\{\\theta\_\{h,1\}\(s\),\\dots,\\theta\_\{h,d\}\(s\)\\\}\\subset\\mathbb\{R\}^\{q\}\. The statewise MMD metricℓH,M,∞\\ell\_\{H,\\mathrm\{M\},\\infty\}with a characteristic kernelκ\\kappainduced by a shift\-invariant,cc\-homogeneous semimetric of strong negative type, the signed\-categorical projectionΠH,MΘh​\(s\)\\Pi^\{\\Theta\_\{h\}\(s\)\}\_\{H,\\mathrm\{M\}\}, and the Gram\-matrix embeddingIH,M,h,sI\_\{H,\\mathrm\{M\},h,s\}are defined exactly as in Section[3\.2](https://arxiv.org/html/2605.06866#S3.SS2), now applied separately at each horizon and stacked overh=1,…,Hh=1,\\dots,H\.

Given a sampled transition\(Sk,Ak,Rk,Sk\+1\)\(S\_\{k\},A\_\{k\},R\_\{k\},S\_\{k\+1\}\), the sampled Bellman target is computed at every horizonh=1,…,Hh=1,\\dots,Hby

T^H,Mh​\(Uk;Sk,\(Rk,Sk\+1\)\):=IH,M,h,Sk​\(ΠH,MΘh​\(Sk\)​\(\(fRk,1\)\#​IH,M,h−1,Sk\+1−1​\(Ukh−1​\(Sk\+1\)\)\)\),\\widehat\{T\}\_\{H,\\mathrm\{M\}\}^\{h\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\):=I\_\{H,\\mathrm\{M\},h,S\_\{k\}\}\\Bigl\(\\Pi\_\{H,\\mathrm\{M\}\}^\{\\Theta\_\{h\}\(S\_\{k\}\)\}\\bigl\(\(f\_\{R\_\{k\},1\}\)\_\{\\\#\}I\_\{H,\\mathrm\{M\},h\-1,S\_\{k\+1\}\}^\{\-1\}\(U\_\{k\}^\{h\-1\}\(S\_\{k\+1\}\)\)\\bigr\)\\Bigr\),\(39\)again with the convention thatUk0​\(s\)≡0U\_\{k\}^\{0\}\(s\)\\equiv 0\. Stacking these horizonwise targets gives

T^H,M​\(Uk;Sk,\(Rk,Sk\+1\)\):=\(T^H,M1​\(Uk;Sk,\(Rk,Sk\+1\)\),…,T^H,MH​\(Uk;Sk,\(Rk,Sk\+1\)\)\)\.\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\):=\\bigl\(\\widehat\{T\}\_\{H,\\mathrm\{M\}\}^\{1\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\),\\dots,\\widehat\{T\}\_\{H,\\mathrm\{M\}\}^\{H\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\\bigr\)\.\(40\)The next theorem follows the same episode\-boundary viewpoint as CTD\. The weighted contraction mechanism is unchanged, while the perturbation term is now affine in the current boundary iterate\.

###### Theorem 8\(Undiscounted fixed\-horizon episodic MTD\)\.

Assume the episodic sampling model described above andρmin\>0\\rho\_\{\\min\}\>0\. LetηH,M⋆\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}denote the unique fixed point ofΠH,MΘ​THπ\\Pi\_\{H,\\mathrm\{M\}\}^\{\\Theta\}T\_\{H\}^\{\\pi\}, and letηk:=IH,M−1​\(Uk\)\\eta\_\{k\}:=I\_\{H,\\mathrm\{M\}\}^\{\-1\}\(U\_\{k\}\)be generated by \([29](https://arxiv.org/html/2605.06866#S4.E29)\) withT^H=T^H,M\\widehat\{T\}\_\{H\}=\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\. There exist explicit constantsCMH,cMH,α¯MH\>0C^\{H\}\_\{\\mathrm\{M\}\},c^\{H\}\_\{\\mathrm\{M\}\},\\bar\{\\alpha\}^\{H\}\_\{\\mathrm\{M\}\}\>0and explicit thresholdsgMH​\(α\)g^\{H\}\_\{\\mathrm\{M\}\}\(\\alpha\)andgMH​\(α,z\)g^\{H\}\_\{\\mathrm\{M\}\}\(\\alpha,z\)for which the following episode\-boundary results hold\.Constant step size\.Ifαk≡α≤α¯MH\\alpha\_\{k\}\\equiv\\alpha\\leq\\bar\{\\alpha\}^\{H\}\_\{\\mathrm\{M\}\}, then the episode\-boundary error decays geometrically to anO​\(α\)O\(\\alpha\)neighborhood:

𝔼​\[ℓH,M,∞​\(ηm​H,η⋆\)2\]≤CMH​ℓH,M,∞​\(η0,η⋆\)2​\(1−cMH​H​α\)m\+CMH​α\.\\mathbb\{E\}\\bigl\[\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{mH\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{H\}\_\{\\mathrm\{M\}\}\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\(1\-c^\{H\}\_\{\\mathrm\{M\}\}H\\alpha\)^\{m\}\+C^\{H\}\_\{\\mathrm\{M\}\}\\alpha\.\(41\)For the diminishing step size regimes below, writeτm:=m​H\+g\+H−1\\tau\_\{m\}:=mH\+g\+H\-1, soτ0=g\+H−1\\tau\_\{0\}=g\+H\-1, whereggis the step\-size offset\.

Linearly\-diminishing step size\.Ifα=α/\(k\+g\)\\alpha=\\alpha/\(k\+g\),α\>1/cMH\\alpha\>1/c^\{H\}\_\{\\mathrm\{M\}\}, andg≥gMH​\(α\)g\\geq g^\{H\}\_\{\\mathrm\{M\}\}\(\\alpha\), then the leading episode\-boundary residual term decays inO​\(1/m\)O\(1/m\):

𝔼​\[ℓH,C,∞​\(ηm​H,η⋆\)2\]≤CMH​ℓH,M,∞​\(η0,η⋆\)2​\(τ0τm\)cMH​α\+CMH​α2cMH​α−1⋅1τm\.\\mathbb\{E\}\\bigl\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{H\}\_\{\\mathrm\{M\}\}\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\left\(\\frac\{\\tau\_\{0\}\}\{\\tau\_\{m\}\}\\right\)^\{c^\{H\}\_\{\\mathrm\{M\}\}\\alpha\}\+\\frac\{C^\{H\}\_\{\\mathrm\{M\}\}\\alpha^\{2\}\}\{c^\{H\}\_\{\\mathrm\{M\}\}\\alpha\-1\}\\cdot\\frac\{1\}\{\\tau\_\{m\}\}\.\(42\)Polynomially\-diminishing step size\.Ifαk=α/\(k\+g\)z\\alpha\_\{k\}=\\alpha/\(k\+g\)^\{z\},z∈\(0,1\)z\\in\(0,1\), andg≥gMiid​\(α,z\)g\\geq g^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\(\\alpha,z\), then the leading episode\-boundary residual term decays inO​\(1/mz\)O\(1/m^\{z\}\):

𝔼​\[ℓH,M,∞​\(ηk,η⋆\)2\]≤CMH​ℓH,M,∞​\(η0,η⋆\)2​exp⁡\(−cMH​α1−z​\(τm1−z−τ01−z\)\)\+CMH​ατmz\.\\mathbb\{E\}\\bigl\[\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{H\}\_\{\\mathrm\{M\}\}\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\exp\\left\(\-\\frac\{c^\{H\}\_\{\\mathrm\{M\}\}\\alpha\}\{1\-z\}\\bigl\(\\tau^\{1\-z\}\_\{m\}\-\\tau^\{1\-z\}\_\{0\}\\bigr\)\\right\)\+\\frac\{C^\{H\}\_\{\\mathrm\{M\}\}\\alpha\}\{\\tau\_\{m\}^\{z\}\}\.\(43\)

###### Proof sketch\.

This follows the CTD episode\-boundary argument after replacing the Cramér embedding by the Gram\-matrix embedding\. The phasewise weighted contraction mechanism remains, but the perturbation bound is affine in the current boundary iterate\. Details are given in Appendix[E](https://arxiv.org/html/2605.06866#A5)\. ∎

###### Corollary 9\.

To guarantee𝔼​\[ℓH,M,∞​\(ηm​H,ηH,M⋆\)\]≤ε\\mathbb\{E\}\[\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\)\]\\leq\\varepsilonwith the undiscounted fixed\-horizon MTD recursion, it suffices to takem=O​\(ε−2\)m=O\(\\varepsilon^\{\-2\}\)episodes\. Equivalently, since each episode has lengthHH, one may takek=m​H=O​\(ε−2\)k=mH=O\(\\varepsilon^\{\-2\}\)transitions\.

## 5Representation error

Both the discounted and undiscounted fixed\-horizon theorems control the distance to the fixed point of a projected Bellman operator\. A common deterministic decomposition turns these projected\-fixed\-point guarantees into total\-error guarantees relative to the exact return\-distribution fixed point\. Letℓ\\elldenote eitherℓC,∞\\ell\_\{\\mathrm\{C\},\\infty\}orℓM,∞\\ell\_\{\\mathrm\{M\},\\infty\}in the discounted setting, or eitherℓH,C,∞\\ell\_\{H,\\mathrm\{C\},\\infty\}orℓH,M,∞\\ell\_\{H,\\mathrm\{M\},\\infty\}in the finite\-horizon setting\. LetTTdenote the corresponding Bellman operator with contraction modulusβ\\betaand letΠ\\Pidenote the corresponding projection\. Furthermore, letηπ\\eta^\{\\pi\}be the fixed point ofTT,η⋆\\eta^\{\\star\}be the fixed point ofΠ​T\\Pi Tandεrepr:=ℓ​\(Π​ηπ,ηπ\)\\varepsilon^\{\\mathrm\{repr\}\}:=\\ell\(\\Pi\\eta^\{\\pi\},\\eta^\{\\pi\}\)\. Then

ℓ​\(η⋆,ηπ\)≤εrepr1−β,𝔼​\[ℓ​\(ηk,ηπ\)2\]≤2​𝔼​\[ℓ​\(ηk,η⋆\)2\]\+2​\(εrepr1−β\)2\.\\ell\(\\eta^\{\\star\},\\eta^\{\\pi\}\)\\leq\\frac\{\\varepsilon^\{\\mathrm\{repr\}\}\}\{1\-\\beta\},\\qquad\\mathbb\{E\}\\bigl\[\\ell\(\\eta\_\{k\},\\eta^\{\\pi\}\)^\{2\}\\bigr\]\\leq 2\\,\\mathbb\{E\}\\bigl\[\\ell\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\+2\\left\(\\frac\{\\varepsilon^\{\\mathrm\{repr\}\}\}\{1\-\\beta\}\\right\)^\{2\}\.\(44\)The decomposition in \([44](https://arxiv.org/html/2605.06866#S5.E44)\) separates two conceptually different sources of error\. The first is the algorithmic term, controlled by the finite\-iteration bounds of the preceding theorems\. The second is the deterministic projection termεrepr/\(1−β\)\\varepsilon^\{\\mathrm\{repr\}\}/\(1\-\\beta\), which depends only on how well the chosen categorical support can approximate the exact return\-distribution fixed point\. Thus, once the support family is fixed, the SA analysis and the discretization bias can be seen independently\. The full statement and proof are deferred to Appendix[F](https://arxiv.org/html/2605.06866#A6)\.

## 6Discussion of the results

Theorems[1](https://arxiv.org/html/2605.06866#Thmtheorem1)and[3](https://arxiv.org/html/2605.06866#Thmtheorem3)show that the standard categorical recursions are structurally simpler than they may initially appear\. After suitable statewise isometric embeddings, both CTD and MTD become asynchronous SA schemes that update one state block at a time and contract in a block\-supremum norm\. This common structure is what makes a unified finite\-iteration analysis possible\. The real point of departure is the perturbation geometry: CTD lives in a bounded\-noise regime because bounded categorical supports and the Cramér geometry give uniform samplewise control, whereas MTD produces affine perturbations in∥U∥2,∞\\lVert U\\rVert\_\{2,\\infty\}because the signed\-categorical MMD projection is affine rather than uniformly bounded\.

In the undiscounted fixed\-horizon half, the learned object is a horizon\-indexed stack, and one sampled transition updates the whole stack block at the visited state because every horizon shares the same first reward and differs only in the bootstrapped tail\. There is also no discount\-driven contraction property to exploit\. Instead, contraction is recovered by weighting horizon layers, and the theory is stated at episode boundaries because the averaged drift changes with the within\-episode phase\. Across both discounted and undiscounted fixed\-horizon settings, the step size results are consistent: constant steps give a controllableO​\(α\)O\(\\alpha\)neighborhood \(with logarithmic mixing terms entering in the Markovian case\), linearly\-diminishing step sizes give the sharpest asymptotic decay once the threshold condition is met, and polynomially\-diminishing step sizes trade rate for milder admissibility conditions\. Further discussion is deferred to Appendix[G](https://arxiv.org/html/2605.06866#A7)\.

## 7Conclusion

We established finite\-iteration guarantees for asynchronous categorical distributional temporal\-difference learning in the scalar Cramér and multivariate signed\-categorical MMD settings\. The scope of the present theory is tabular finite\-state policy\-evaluation analysis, it does not treat control or policy improvement, and its guarantees are expectation bounds rather than high\-probability bounds\. Within that scope, the discounted results cover i\.i\.d\. sampled states and Markovian trajectories, while the undiscounted results treat the fixed\-horizon episodic regime under the exact online recursions\.

## Acknowledgments

The authors are grateful to Prof\. Zaiwei Chen for helpful feedback during the preparation of this manuscript\.

## References

- \[1\]\(2017\)Minimax regret bounds for reinforcement learning\.InInternational conference on machine learning,pp\. 263–272\.Cited by:[§4](https://arxiv.org/html/2605.06866#S4.p1.8)\.
- \[2\]H\. H\. Bauschke and P\. L\. Combettes\(2017\)Convex analysis and monotone operator theory in hilbert spaces\.2nd edition,Springer Publishing Company, Incorporated\.External Links:ISBN 3319483102Cited by:[§3](https://arxiv.org/html/2605.06866#S3.p3.8),[Proposition 13](https://arxiv.org/html/2605.06866#Thmtheorem13)\.
- \[3\]A\. Beck\(2017\)First\-order methods in optimization\.SIAM\.Cited by:[§3](https://arxiv.org/html/2605.06866#S3.p3.8),[Proposition 12](https://arxiv.org/html/2605.06866#Thmtheorem12.p1.4.3)\.
- \[4\]M\. G\. Bellemare, W\. Dabney, and R\. Munos\(2017\)A distributional perspective on reinforcement learning\.InInternational conference on machine learning,pp\. 449–458\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[5\]M\. G\. Bellemare, W\. Dabney, and M\. Rowland\(2023\)Distributional reinforcement learning\.MIT Press\.Cited by:[§B\.1](https://arxiv.org/html/2605.06866#A2.SS1.2.p1.3),[§1](https://arxiv.org/html/2605.06866#S1.p1.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.06866#S3.SS1.p1.8),[§4\.1](https://arxiv.org/html/2605.06866#S4.SS1.p1.6)\.
- \[6\]D\. P\. Bertsekas and J\. N\. Tsitsiklis\(1996\)Neuro\-dynamic programming\.Athena Scientific\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[7\]J\. Bhandari, D\. Russo, and R\. Singal\(2018\)A finite time analysis of temporal difference learning with linear function approximation\.InConference on learning theory,pp\. 1691–1692\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1)\.
- \[8\]M\. Böck and C\. Heitzinger\(2022\)Speedy categorical distributional reinforcement learning and complexity analysis\.SIAM Journal on Mathematics of Data Science4\(2\),pp\. 675–693\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1),[§2](https://arxiv.org/html/2605.06866#S2.p2.1)\.
- \[9\]V\. S\. BorkarStochastic approximation: a dynamical systems viewpoint\.Vol\.100,Springer\.Cited by:[Appendix A](https://arxiv.org/html/2605.06866#A1.p1.3),[§1](https://arxiv.org/html/2605.06866#S1.p3.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1),[§3](https://arxiv.org/html/2605.06866#S3.p3.8)\.
- \[10\]V\. S\. Borkar\(1998\)Asynchronous stochastic approximations\.SIAM Journal on Control and Optimization36\(3\),pp\. 840–851\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1)\.
- \[11\]Z\. Chen, S\. T\. Maguluri, S\. Shakkottai, and K\. Shanmugam\(2024\)A lyapunov theory for finite\-sample guarantees of markovian stochastic approximation\.Operations Research72\(4\),pp\. 1352–1367\.Cited by:[§A\.1](https://arxiv.org/html/2605.06866#A1.SS1.4.p1.1),[§A\.3](https://arxiv.org/html/2605.06866#A1.SS3.1.p1.5),[§1](https://arxiv.org/html/2605.06866#S1.p3.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1)\.
- \[12\]Z\. Chen, S\. T\. Maguluri, S\. Shakkottai, and K\. Shanmugam\(2020\)Finite\-sample analysis of contractive stochastic approximation using smooth convex envelopes\.Advances in Neural Information Processing Systems33,pp\. 8223–8234\.Cited by:[§A\.1](https://arxiv.org/html/2605.06866#A1.SS1.4.p1.1),[§A\.2](https://arxiv.org/html/2605.06866#A1.SS2.1.p1.12),[§1](https://arxiv.org/html/2605.06866#S1.p3.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1)\.
- \[13\]Z\. Chen, S\. Zhang, T\. T\. Doan, J\. Clarke, and S\. T\. Maguluri\(2022\)Finite\-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning\.Automatica146,pp\. 110623\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p3.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1)\.
- \[14\]W\. Dabney, G\. Ostrovski, D\. Silver, and R\. Munos\(2018\)Implicit quantile networks for distributional reinforcement learning\.InInternational conference on machine learning,pp\. 1096–1105\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[15\]W\. Dabney, M\. Rowland, M\. Bellemare, and R\. Munos\(2018\)Distributional reinforcement learning with quantile regression\.InProceedings of the AAAI conference on artificial intelligence,Vol\.32\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[16\]G\. Dalal, B\. Szörényi, G\. Thoppe, and S\. Mannor\(2018\)Finite sample analyses for td \(0\) with function approximation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.32\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1)\.
- \[17\]P\. Dayan and T\. J\. Sejnowski\(1994\)TD \(λ\\lambda\) converges with probability 1\.Machine Learning14\(3\),pp\. 295–301\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[18\]K\. De Asis, A\. Chan, S\. Pitis, R\. Sutton, and D\. Graves\(2020\)Fixed\-horizon temporal difference methods for stable reinforcement learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.34,pp\. 3741–3748\.Cited by:[§2](https://arxiv.org/html/2605.06866#S2.p1.1),[§4](https://arxiv.org/html/2605.06866#S4.p1.8),[§4](https://arxiv.org/html/2605.06866#S4.p2.3),[§4](https://arxiv.org/html/2605.06866#S4.p6.4)\.
- \[19\]A\. Gretton, K\. M\. Borgwardt, M\. J\. Rasch, B\. Schölkopf, and A\. Smola\(2012\)A kernel two\-sample test\.The journal of machine learning research13\(1\),pp\. 723–773\.Cited by:[§3\.2](https://arxiv.org/html/2605.06866#S3.SS2.p1.9)\.
- \[20\]L\. Gurvits, L\. Lin, and S\. J\. Hanson\(1994\)Incremental learning of evaluation functions for absorbing markov chains: new methods and theorems\.preprint\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[21\]R\. A\. Howard and J\. E\. Matheson\(1972\)Risk\-sensitive markov decision processes\.Management science18\(7\),pp\. 356–369\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[22\]T\. Jaakkola, M\. Jordan, and S\. Singh\(1993\)Convergence of stochastic iterative dynamic programming algorithms\.Advances in neural information processing systems6\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[23\]S\. C\. Jaquette\(1973\)Markov decision processes with a new optimality criterion: discrete time\.The Annals of Statistics1\(3\),pp\. 496–505\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[24\]S\. C\. Jaquette\(1976\)A utility criterion for markov decision processes\.Management Science23\(1\),pp\. 43–49\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[25\]T\. Kastner, M\. Rowland, Y\. Tang, M\. A\. Erdogdu, and A\. Farahmand\(2025\)Categorical distributional reinforcement learning with kullback\-leibler divergence: convergence and asymptotics\.InForty\-second International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.06866#S2.p2.1)\.
- \[26\]H\. J\. Kushner and G\. G\. Yin\(2003\)Stochastic approximation and recursive algorithms and applications\.Springer\.Cited by:[Appendix A](https://arxiv.org/html/2605.06866#A1.p1.3),[§1](https://arxiv.org/html/2605.06866#S1.p3.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1),[§3](https://arxiv.org/html/2605.06866#S3.p3.8)\.
- \[27\]D\. A\. Levin and Y\. Peres\(2017\)Markov chains and mixing times\.Vol\.107,American Mathematical Soc\.\.Cited by:[§3](https://arxiv.org/html/2605.06866#S3.p1.15)\.
- \[28\]M\. L\. Littman and C\. Szepesvári\(1996\)A generalized reinforcement\-learning model: convergence and applications\.InICML,Vol\.96,pp\. 310–318\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[29\]L\. Ljung\(1977\)Analysis of recursive stochastic algorithms\.IEEE Transactions on Automatic Control22\(4\),pp\. 551–575\.External Links:[Document](https://dx.doi.org/10.1109/TAC.1977.1101561)Cited by:[Appendix A](https://arxiv.org/html/2605.06866#A1.p1.3),[§1](https://arxiv.org/html/2605.06866#S1.p3.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1)\.
- \[30\]C\. Lyle, M\. G\. Bellemare, and P\. S\. Castro\(2019\)A comparative analysis of expected and distributional reinforcement learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.33,pp\. 4504–4511\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[31\]J\. Moreau\(1965\)Proximité et dualité dans un espace hilbertien\.Bulletin de la Société mathématique de France93,pp\. 273–299\.Cited by:[§3](https://arxiv.org/html/2605.06866#S3.p3.8),[Proposition 13](https://arxiv.org/html/2605.06866#Thmtheorem13)\.
- \[32\]T\. Morimura, M\. Sugiyama, H\. Kashima, H\. Hachiya, and T\. Tanaka\(2010\)Nonparametric return distribution approximation for reinforcement learning\.InProceedings of the 27th International Conference on Machine Learning \(ICML\-10\),pp\. 799–806\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[33\]K\. Muandet, K\. Fukumizu, B\. Sriperumbudur, and B\. Schölkopf\(2017\-06\)Kernel mean embedding of distributions: a review and beyond\.Found\. Trends Mach\. Learn\.10\(1–2\),pp\. 1–141\.External Links:ISSN 1935\-8237,[Link](https://doi.org/10.1561/2200000060),[Document](https://dx.doi.org/10.1561/2200000060)Cited by:[§3\.2](https://arxiv.org/html/2605.06866#S3.SS2.p1.9)\.
- \[34\]G\. Patil, L\. Prashanth, D\. Nagaraj, and D\. Precup\(2023\)Finite time analysis of temporal difference learning with linear function approximation: tail averaging and regularisation\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 5438–5448\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1)\.
- \[35\]Y\. Peng, K\. Jin, L\. Zhang, and Z\. Zhang\(2025\)A finite sample analysis of distributional td learning with linear function approximation\.arXiv preprint arXiv:2502\.14172\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1),[§2](https://arxiv.org/html/2605.06866#S2.p2.1)\.
- \[36\]Y\. Peng, L\. Zhang, and Z\. Zhang\(2024\)Statistical efficiency of distributional temporal difference learning\.Advances in Neural Information Processing Systems37,pp\. 24724–24761\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1),[§2](https://arxiv.org/html/2605.06866#S2.p2.1)\.
- \[37\]G\. Qu and A\. Wierman\(2020\)Finite\-time analysis of asynchronous stochastic approximation andQQ\-learning\.InConference on learning theory,pp\. 3185–3205\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1)\.
- \[38\]H\. Robbins and S\. Monro\(1951\)A stochastic approximation method\.The annals of mathematical statistics,pp\. 400–407\.Cited by:[Appendix A](https://arxiv.org/html/2605.06866#A1.p1.3),[§1](https://arxiv.org/html/2605.06866#S1.p3.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1)\.
- \[39\]M\. Rowland, M\. Bellemare, W\. Dabney, R\. Munos, and Y\. W\. Teh\(2018\)An analysis of categorical distributional reinforcement learning\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 29–37\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.06866#S3.SS1.p1.8),[§4\.1](https://arxiv.org/html/2605.06866#S4.SS1.p1.6)\.
- \[40\]M\. Rowland, R\. Dadashi, S\. Kumar, R\. Munos, M\. G\. Bellemare, and W\. Dabney\(2019\)Statistics and samples in distributional reinforcement learning\.InInternational Conference on Machine Learning,pp\. 5528–5536\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[41\]M\. Rowland, L\. K\. Wenliang, R\. Munos, C\. Lyle, Y\. Tang, and W\. Dabney\(2024\)Near\-minimax\-optimal distributional reinforcement learning with a generative model\.Advances in Neural Information Processing Systems37,pp\. 132774–132823\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1),[§2](https://arxiv.org/html/2605.06866#S2.p2.1)\.
- \[42\]A\. Smola, A\. Gretton, L\. Song, and B\. Schölkopf\(2007\)A hilbert space embedding for distributions\.InInternational conference on algorithmic learning theory,pp\. 13–31\.Cited by:[§3\.2](https://arxiv.org/html/2605.06866#S3.SS2.p1.9)\.
- \[43\]M\. J\. Sobel\(1982\)The variance of discounted markov decision processes\.Journal of Applied Probability19\(4\),pp\. 794–802\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[44\]R\. Srikant and L\. Ying\(2019\)Finite\-time error bounds for linear stochastic approximation and td learning\.InConference on learning theory,pp\. 2803–2830\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1)\.
- \[45\]R\. S\. Sutton, A\. G\. Barto,et al\.\(1998\)Reinforcement learning: an introduction\.Vol\.1,MIT press Cambridge\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1),[§3](https://arxiv.org/html/2605.06866#S3.p1.15),[§4](https://arxiv.org/html/2605.06866#S4.p1.8)\.
- \[46\]R\. S\. Sutton\(1988\)Learning to predict by the methods of temporal differences\.Machine learning3\(1\),pp\. 9–44\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1)\.
- \[47\]J\. N\. Tsitsiklis\(1994\)Asynchronous stochastic approximation and q\-learning\.Machine learning16\(3\),pp\. 185–202\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[48\]J\. Tsitsiklis and B\. Van Roy\(1996\)Analysis of temporal\-difference learning with function approximation\.Advances in neural information processing systems9\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1)\.
- \[49\]H\. Wiltzer, J\. Farebrother, A\. Gretton, and M\. Rowland\(2024\)Foundations of multivariate distributional reinforcement learning\.Advances in Neural Information Processing Systems37,pp\. 101297–101336\.Cited by:[§C\.1](https://arxiv.org/html/2605.06866#A3.SS1.2.p1.3),[§1](https://arxiv.org/html/2605.06866#S1.p1.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1),[§3\.2](https://arxiv.org/html/2605.06866#S3.SS2.p1.9),[§3\.2](https://arxiv.org/html/2605.06866#S3.SS2.p2.5),[§3\.2](https://arxiv.org/html/2605.06866#S3.SS2.p2.8)\.
- \[50\]R\. Wu, M\. Uehara, and W\. Sun\(2023\)Distributional offline policy evaluation with predictive error guarantees\.InInternational Conference on Machine Learning,pp\. 37685–37712\.Cited by:[§2](https://arxiv.org/html/2605.06866#S2.p2.1)\.
- \[51\]D\. Yang, L\. Zhao, Z\. Lin, T\. Qin, J\. Bian, and T\. Liu\(2019\)Fully parameterized quantile function for distributional reinforcement learning\.Advances in neural information processing systems32\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[52\]L\. Zhang, Y\. Peng, J\. Liang, W\. Yang, and Z\. Zhang\(2023\)Estimation and inference in distributional reinforcement learning\.arXiv preprint arXiv:2309\.17262\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1),[§2](https://arxiv.org/html/2605.06866#S2.p2.1)\.

## Appendix Table of Contents

## Appendix ACommon finite\-iteration ingredients in the block\-supremum geometry

We collect the abstract ingredients\[[38](https://arxiv.org/html/2605.06866#bib.bib26),[29](https://arxiv.org/html/2605.06866#bib.bib27),[9](https://arxiv.org/html/2605.06866#bib.bib30),[26](https://arxiv.org/html/2605.06866#bib.bib31)\]used in the proofs of Theorems[1](https://arxiv.org/html/2605.06866#Thmtheorem1)and[3](https://arxiv.org/html/2605.06866#Thmtheorem3)\. Throughout this appendix, let

V:=∏s∈𝒮E​\(s\)V:=\\prod\_\{s\\in\\mathcal\{S\}\}E\(s\)\(45\)be a finite product of Euclidean state blocks\. ForU∈VU\\in Vandp∈\[2,∞\)p\\in\[2,\\infty\), define

∥U∥2,∞:=maxs∈𝒮∥U\(s\)∥2,∥U∥2,p:=\(∑s∈𝒮∥U\(s\)∥2p\)1/p\.\\lVert U\\rVert\_\{2,\\infty\}:=\\max\_\{s\\in\\mathcal\{S\}\}\\lVert U\(s\)\\rVert\_\{2\},\\qquad\\lVert U\\rVert\_\{2,p\}:=\\left\(\\sum\_\{s\\in\\mathcal\{S\}\}\\lVert U\(s\)\\rVert\_\{2\}^\{p\}\\right\)^\{1/p\}\.\(46\)
BecauseVVis finite\-dimensional and each blockE​\(s\)E\(s\)is Euclidean, all gradients, smoothness statements, and Moreau\-envelope constructions below are taken with respect to the product Euclidean inner product onVV\. In particular,∥⋅∥2,p\\lVert\\cdot\\rVert\_\{2,p\}is a genuine norm on a finite\-dimensional Euclidean space, so the standard smoothness results for squaredℓp\\ell\_\{p\}\-type norms apply directly\.

For each states∈𝒮s\\in\\mathcal\{S\}, letPs:V→VP\_\{s\}:V\\to Vdenote the coordinate projector onto blockss\.

### A\.1Norm comparison and smoothing

###### Lemma 10\(Block\-norm comparison\)\.

For everyU∈VU\\in Vand everyp∈\[2,∞\)p\\in\[2,\\infty\),

∥U∥2,∞≤∥U∥2,p≤\|𝒮\|1/p​∥U∥2,∞\.\\lVert U\\rVert\_\{2,\\infty\}\\leq\\lVert U\\rVert\_\{2,p\}\\leq\\lvert\\mathcal\{S\}\\rvert^\{1/p\}\\lVert U\\rVert\_\{2,\\infty\}\.\(47\)

###### Proof\.

The left inequality is immediate because the maximum of finitely many nonnegative numbers is bounded above by theirℓp\\ell\_\{p\}norm\. For the right inequality,

∥U∥2,pp=∑s∈𝒮∥U​\(s\)∥2p≤∑s∈𝒮∥U∥2,∞p=\|𝒮\|​∥U∥2,∞p\.\\lVert U\\rVert\_\{2,p\}^\{p\}=\\sum\_\{s\\in\\mathcal\{S\}\}\\lVert U\(s\)\\rVert\_\{2\}^\{p\}\\leq\\sum\_\{s\\in\\mathcal\{S\}\}\\lVert U\\rVert\_\{2,\\infty\}^\{p\}=\\lvert\\mathcal\{S\}\\rvert\\lVert U\\rVert\_\{2,\\infty\}^\{p\}\.\(48\)Takingppth roots proves the claim\. ∎

###### Proposition 11\(Choice of the smoothing exponent\)\.

Let

p⋆:=max⁡\{2,⌈log⁡\|𝒮\|⌉\}\.p^\{\\star\}:=\\max\\\{2,\\lceil\\log\\lvert\\mathcal\{S\}\\rvert\\rceil\\\}\.\(49\)Then

\|𝒮\|2/p⋆≤e2\.\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\\leq e^\{2\}\.\(50\)Consequently,

\(p⋆−1\)​\|𝒮\|2/p⋆≤e2​log⁡\|𝒮\|for​\|𝒮\|≥2\.\(p^\{\\star\}\-1\)\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\\leq e^\{2\}\\log\\lvert\\mathcal\{S\}\\rvert\\qquad\\text\{for \}\\lvert\\mathcal\{S\}\\rvert\\geq 2\.\(51\)

###### Proof\.

If\|𝒮\|≤e2\\lvert\\mathcal\{S\}\\rvert\\leq e^\{2\}, thenp⋆=2p^\{\\star\}=2and the claim is immediate\. If\|𝒮\|\>e2\\lvert\\mathcal\{S\}\\rvert\>e^\{2\}, thenp⋆≥log⁡\|𝒮\|p^\{\\star\}\\geq\\log\\lvert\\mathcal\{S\}\\rvert, hence

\|𝒮\|2/p⋆=exp⁡\(2​log⁡\|𝒮\|p⋆\)≤e2\.\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}=\\exp\\left\(\\frac\{2\\log\\lvert\\mathcal\{S\}\\rvert\}\{p^\{\\star\}\}\\right\)\\leq e^\{2\}\.\(52\)The second claim follows by multiplying by\(p⋆−1\)≤log⁡\|𝒮\|\(p^\{\\star\}\-1\)\\leq\\log\\lvert\\mathcal\{S\}\\rvertwhen\|𝒮\|≥2\\lvert\\mathcal\{S\}\\rvert\\geq 2\. ∎

###### Proposition 12\(Smooth block potential\)\.

Forp∈\[2,∞\)p\\in\[2,\\infty\), define

gp​\(U\):=12​∥U∥2,p2\.g\_\{p\}\(U\):=\\frac\{1\}\{2\}\\lVert U\\rVert\_\{2,p\}^\{2\}\.\(53\)Thengpg\_\{p\}is convex and\(p−1\)\(p\-1\)\-smooth\[[3](https://arxiv.org/html/2605.06866#bib.bib32)\]with respect to∥⋅∥2,p\\lVert\\cdot\\rVert\_\{2,p\}\.

###### Proof\.

This is the standard smoothness of the squaredℓp\\ell\_\{p\}norm forp≥2p\\geq 2, applied to the mixed norm∥⋅∥2,p\\lVert\\cdot\\rVert\_\{2,p\}on the finite\-dimensional product spaceVV\. Equivalently, after choosing orthonormal bases in each blockE​\(s\)E\(s\), the spaceVVidentifies withℝN\\mathbb\{R\}^\{N\}for some finiteNN, and∥⋅∥2,p\\lVert\\cdot\\rVert\_\{2,p\}becomes a norm ofℓp\\ell\_\{p\}type on grouped coordinates\. The standardℓp\\ell\_\{p\}smoothness estimate therefore gives

gp​\(U′\)≤gp​\(U\)\+⟨∇gp​\(U\),U′−U⟩\+p−12​∥U′−U∥2,p2g\_\{p\}\(U^\{\\prime\}\)\\leq g\_\{p\}\(U\)\+\\langle\\nabla g\_\{p\}\(U\),U^\{\\prime\}\-U\\rangle\+\\frac\{p\-1\}\{2\}\\lVert U^\{\\prime\}\-U\\rVert\_\{2,p\}^\{2\}\(54\)for allU,U′∈VU,U^\{\\prime\}\\in V\. ∎

###### Proposition 13\(Generalized Moreau envelope\[[31](https://arxiv.org/html/2605.06866#bib.bib33),[2](https://arxiv.org/html/2605.06866#bib.bib34)\]for block\-supremum norms\)\.

FixU⋆∈VU^\{\\star\}\\in V,ϑ\>0\\vartheta\>0, andp∈\[2,∞\)p\\in\[2,\\infty\)\. Define

Mϑ,p​\(U\):=infW∈V\{12​∥W−U⋆∥2,∞2\+12​ϑ​∥U−W∥2,p2\}\.M\_\{\\vartheta,p\}\(U\):=\\inf\_\{W\\in V\}\\Bigl\\\{\\frac\{1\}\{2\}\\lVert W\-U^\{\\star\}\\rVert\_\{2,\\infty\}^\{2\}\+\\frac\{1\}\{2\\vartheta\}\\lVert U\-W\\rVert\_\{2,p\}^\{2\}\\Bigr\\\}\.\(55\)ThenMϑ,pM\_\{\\vartheta,p\}is convex and\(p−1\)/ϑ\(p\-1\)/\\vartheta\-smooth with respect to∥⋅∥2,p\\lVert\\cdot\\rVert\_\{2,p\}\. Moreover, for everyU∈VU\\in V,

\(1\+ϑ\)​Mϑ,p​\(U\)≤12​∥U−U⋆∥2,∞2≤\(1\+ϑ​\|𝒮\|2/p\)​Mϑ,p​\(U\)\.\(1\+\\vartheta\)M\_\{\\vartheta,p\}\(U\)\\leq\\frac\{1\}\{2\}\\lVert U\-U^\{\\star\}\\rVert\_\{2,\\infty\}^\{2\}\\leq\(1\+\\vartheta\\lvert\\mathcal\{S\}\\rvert^\{2/p\}\)M\_\{\\vartheta,p\}\(U\)\.\(56\)

###### Proof\.

This is the generalized Moreau\-envelope approximation theorem used byChenet al\.\[[12](https://arxiv.org/html/2605.06866#bib.bib3),[11](https://arxiv.org/html/2605.06866#bib.bib1)\], specialized to

h1​\(U\):=12​∥U−U⋆∥2,∞2,h2​\(U\):=12​∥U∥2,p2\.h\_\{1\}\(U\):=\\frac\{1\}\{2\}\\lVert U\-U^\{\\star\}\\rVert\_\{2,\\infty\}^\{2\},\\qquad h\_\{2\}\(U\):=\\frac\{1\}\{2\}\\lVert U\\rVert\_\{2,p\}^\{2\}\.\(57\)The smoothness claim follows from Proposition[12](https://arxiv.org/html/2605.06866#Thmtheorem12)\. The two\-sided approximation follows from the standard envelope inequalities together with Lemma[10](https://arxiv.org/html/2605.06866#Thmtheorem10)\. ∎

### A\.2Abstract i\.i\.d\. finite\-iteration framework

Assume first that\(Sk\)k≥0\(S\_\{k\}\)\_\{k\\geq 0\}are i\.i\.d\. with lawρ\\rhoon𝒮\\mathcal\{S\}and let

ρmin:=mins∈𝒮⁡ρ​\(s\)\>0\.\\rho\_\{\\min\}:=\\min\_\{s\\in\\mathcal\{S\}\}\\rho\(s\)\>0\.\(58\)
Let𝒪:V→V\\mathcal\{O\}:V\\to Vbe a contraction in∥⋅∥2,∞\\lVert\\cdot\\rVert\_\{2,\\infty\}with modulusβ∈\(0,1\)\\beta\\in\(0,1\)and fixed pointU⋆U^\{\\star\}:

∥𝒪​U−𝒪​U′∥2,∞≤β​∥U−U′∥2,∞for all​U,U′∈V\.\\lVert\\mathcal\{O\}U\-\\mathcal\{O\}U^\{\\prime\}\\rVert\_\{2,\\infty\}\\leq\\beta\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\\quad\\text\{ for all \}U,U^\{\\prime\}\\in V\.\(59\)LetT^​\(U;s,ξ\)\\widehat\{T\}\(U;s,\\xi\)be a sampled target satisfying

𝔼​\[T^​\(Uk;Sk,ξk\)∣Uk,Sk=s\]=\(𝒪​Uk\)​\(s\)\.\\mathbb\{E\}\\left\[\\widehat\{T\}\(U\_\{k\};S\_\{k\},\\xi\_\{k\}\)\\mid U\_\{k\},S\_\{k\}=s\\right\]=\(\\mathcal\{O\}U\_\{k\}\)\(s\)\.\(60\)Consider the asynchronous recursion

Uk\+1=Uk\+αk​PSk​\(T^​\(Uk;Sk,ξk\)−Uk​\(Sk\)\)\.U\_\{k\+1\}=U\_\{k\}\+\\alpha\_\{k\}P\_\{S\_\{k\}\}\\left\(\\widehat\{T\}\(U\_\{k\};S\_\{k\},\\xi\_\{k\}\)\-U\_\{k\}\(S\_\{k\}\)\\right\)\.\(61\)
For the centered recursion, write

Wk:=Uk−U⋆\.W\_\{k\}:=U\_\{k\}\-U^\{\\star\}\.\(62\)Assume there exists a finite constantAiid≥0A^\{\\mathrm\{iid\}\}\\geq 0such that

𝔼​\[‖T^​\(Uk;Sk,ξk\)−\(𝒪​Uk\)​\(Sk\)‖22∣Uk,Sk\]≤Aiid​\(1\+∥Wk∥2,∞2\)a\.s\.\\mathbb\{E\}\\left\[\\left\\lVert\\widehat\{T\}\(U\_\{k\};S\_\{k\},\\xi\_\{k\}\)\-\(\\mathcal\{O\}U\_\{k\}\)\(S\_\{k\}\)\\right\\rVert\_\{2\}^\{2\}\\mid U\_\{k\},S\_\{k\}\\right\]\\leq A^\{\\mathrm\{iid\}\}\\bigl\(1\+\\lVert W\_\{k\}\\rVert\_\{2,\\infty\}^\{2\}\\bigr\)\\qquad\\text\{a\.s\.\}\(63\)
###### Proposition 15\(Abstract i\.i\.d\. finite\-iteration bound\)\.

Fix anyϑ\>0\\vartheta\>0and define

βρ:=1−ρmin​\(1−β\),a1:=1\+ϑ​\|𝒮\|2/p⋆1\+ϑ,a2:=1−βρ​1\+ϑ​\|𝒮\|2/p⋆1\+ϑ,\\beta\_\{\\rho\}:=1\-\\rho\_\{\\min\}\(1\-\\beta\),\\qquad a\_\{1\}:=\\frac\{1\+\\vartheta\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\},\\qquad a\_\{2\}:=1\-\\beta\_\{\\rho\}\\sqrt\{\\frac\{1\+\\vartheta\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\}\},\(64\)and

a3:=4​\(p⋆−1\)​\|𝒮\|2/p⋆​\(Aiid\+2\)​\(1\+ϑ\)ϑ,a4:=2​\(p⋆−1\)​\|𝒮\|2/p⋆​Aiid​\(1\+ϑ\)ϑ\.a\_\{3\}:=\\frac\{4\(p^\{\\star\}\-1\)\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\(A^\{\\mathrm\{iid\}\}\+2\)\(1\+\\vartheta\)\}\{\\vartheta\},\\qquad a\_\{4\}:=\\frac\{2\(p^\{\\star\}\-1\)\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}A^\{\\mathrm\{iid\}\}\(1\+\\vartheta\)\}\{\\vartheta\}\.\(65\)Sincea2→1−βρ\>0a\_\{2\}\\to 1\-\\beta\_\{\\rho\}\>0asϑ↓0\\vartheta\\downarrow 0, the conditiona2\>0a\_\{2\}\>0holds for all sufficiently smallϑ\\vartheta\. Ifa2\>0a\_\{2\}\>0,\(αk\)k≥0\(\\alpha\_\{k\}\)\_\{k\\geq 0\}is nonincreasing, andα0≤a2/a3\\alpha\_\{0\}\\leq a\_\{2\}/a\_\{3\}, then for allk≥0k\\geq 0,

𝔼​\[∥Wk∥2,∞2\]≤a1​∥W0∥2,∞2​∏j=0k−1\(1−a2​αj\)\+a4​∑i=0k−1αi2​∏j=i\+1k−1\(1−a2​αj\)\.\\mathbb\{E\}\\left\[\\lVert W\_\{k\}\\rVert\_\{2,\\infty\}^\{2\}\\right\]\\leq a\_\{1\}\\lVert W\_\{0\}\\rVert\_\{2,\\infty\}^\{2\}\\prod\_\{j=0\}^\{k\-1\}\(1\-a\_\{2\}\\alpha\_\{j\}\)\+a\_\{4\}\\sum\_\{i=0\}^\{k\-1\}\\alpha\_\{i\}^\{2\}\\prod\_\{j=i\+1\}^\{k\-1\}\(1\-a\_\{2\}\\alpha\_\{j\}\)\.\(66\)In particular:

1. 1\.ifαk≡α\\alpha\_\{k\}\\equiv\\alphaandα≤a2/a3\\alpha\\leq a\_\{2\}/a\_\{3\}, then for allk≥0k\\geq 0, 𝔼​\[∥Wk∥2,∞2\]≤a1​∥W0∥2,∞2​\(1−a2​α\)k\+a4a2​α,\\mathbb\{E\}\\left\[\\lVert W\_\{k\}\\rVert\_\{2,\\infty\}^\{2\}\\right\]\\leq a\_\{1\}\\lVert W\_\{0\}\\rVert\_\{2,\\infty\}^\{2\}\(1\-a\_\{2\}\\alpha\)^\{k\}\+\\frac\{a\_\{4\}\}\{a\_\{2\}\}\\alpha,\(67\)
2. 2\.ifαk=α/\(k\+h\)\\alpha\_\{k\}=\\alpha/\(k\+h\),α\>1/a2\\alpha\>1/a\_\{2\}, and h≥max⁡\{1,α​a3a2\},h\\geq\\max\\left\\\{1,\\frac\{\\alpha a\_\{3\}\}\{a\_\{2\}\}\\right\\\},\(68\)then for allk≥0k\\geq 0, 𝔼​\[∥Wk∥2,∞2\]≤a1​∥W0∥2,∞2​\(hk\+h\)a2​α\+4​e​α2​a4a2​α−1⋅1k\+h,\\mathbb\{E\}\\left\[\\lVert W\_\{k\}\\rVert\_\{2,\\infty\}^\{2\}\\right\]\\leq a\_\{1\}\\lVert W\_\{0\}\\rVert\_\{2,\\infty\}^\{2\}\\left\(\\frac\{h\}\{k\+h\}\\right\)^\{a\_\{2\}\\alpha\}\+\\frac\{4e\\,\\alpha^\{2\}a\_\{4\}\}\{a\_\{2\}\\alpha\-1\}\\cdot\\frac\{1\}\{k\+h\},\(69\)
3. 3\.ifαk=α/\(k\+h\)z\\alpha\_\{k\}=\\alpha/\(k\+h\)^\{z\}withz∈\(0,1\)z\\in\(0,1\)and h≥max⁡\{1,\(α​a3a2\)1/z,\(2​za2​α\)1/\(1−z\)\},h\\geq\\max\\left\\\{1,\\left\(\\frac\{\\alpha a\_\{3\}\}\{a\_\{2\}\}\\right\)^\{1/z\},\\left\(\\frac\{2z\}\{a\_\{2\}\\alpha\}\\right\)^\{1/\(1\-z\)\}\\right\\\},\(70\)then for allk≥0k\\geq 0, 𝔼​\[∥Wk∥2,∞2\]≤a1​∥W0∥2,∞2​exp⁡\(−a2​α1−z​\(\(k\+h\)1−z−h1−z\)\)\+2​α​a4a2⋅1\(k\+h\)z\.\\mathbb\{E\}\\left\[\\lVert W\_\{k\}\\rVert\_\{2,\\infty\}^\{2\}\\right\]\\leq a\_\{1\}\\lVert W\_\{0\}\\rVert\_\{2,\\infty\}^\{2\}\\exp\\left\(\-\\frac\{a\_\{2\}\\alpha\}\{1\-z\}\\bigl\(\(k\+h\)^\{1\-z\}\-h^\{1\-z\}\\bigr\)\\right\)\+\\frac\{2\\alpha a\_\{4\}\}\{a\_\{2\}\}\\cdot\\frac\{1\}\{\(k\+h\)^\{z\}\}\.\(71\)

###### Proof\.

Define the averaged asynchronous operator

Hρ​\(U\):=U\+∑s∈𝒮ρ​\(s\)​Ps​\(\(𝒪​U\)​\(s\)−U​\(s\)\)\.H\_\{\\rho\}\(U\):=U\+\\sum\_\{s\\in\\mathcal\{S\}\}\\rho\(s\)P\_\{s\}\\bigl\(\(\\mathcal\{O\}U\)\(s\)\-U\(s\)\\bigr\)\.\(72\)For each state blockss,

\(Hρ​\(U\)−Hρ​\(V\)\)​\(s\)=\(1−ρ​\(s\)\)​\(U​\(s\)−V​\(s\)\)\+ρ​\(s\)​\(\(𝒪​U\)​\(s\)−\(𝒪​V\)​\(s\)\)\.\\bigl\(H\_\{\\rho\}\(U\)\-H\_\{\\rho\}\(V\)\\bigr\)\(s\)=\(1\-\\rho\(s\)\)\(U\(s\)\-V\(s\)\)\+\\rho\(s\)\\bigl\(\(\\mathcal\{O\}U\)\(s\)\-\(\\mathcal\{O\}V\)\(s\)\\bigr\)\.\(73\)Since𝒪\\mathcal\{O\}is aβ\\beta\-contraction in∥⋅∥2,∞\\lVert\\cdot\\rVert\_\{2,\\infty\},

‖\(Hρ​\(U\)−Hρ​\(V\)\)​\(s\)‖2\\displaystyle\\left\\lVert\\bigl\(H\_\{\\rho\}\(U\)\-H\_\{\\rho\}\(V\)\\bigr\)\(s\)\\right\\rVert\_\{2\}≤\(\(1−ρ​\(s\)\)\+ρ​\(s\)​β\)​∥U−V∥2,∞\\displaystyle\\leq\\bigl\(\(1\-\\rho\(s\)\)\+\\rho\(s\)\\beta\\bigr\)\\lVert U\-V\\rVert\_\{2,\\infty\}\(74\)=\(1−ρ​\(s\)​\(1−β\)\)​∥U−V∥2,∞\\displaystyle=\\bigl\(1\-\\rho\(s\)\(1\-\\beta\)\\bigr\)\\lVert U\-V\\rVert\_\{2,\\infty\}≤βρ​∥U−V∥2,∞\.\\displaystyle\\leq\\beta\_\{\\rho\}\\lVert U\-V\\rVert\_\{2,\\infty\}\.Taking the maximum oversson both sides gives

∥Hρ​\(U\)−Hρ​\(V\)∥2,∞≤maxs∈𝒮⁡\(1−ρ​\(s\)​\(1−β\)\)​∥U−V∥2,∞=βρ​∥U−V∥2,∞,\\lVert H\_\{\\rho\}\(U\)\-H\_\{\\rho\}\(V\)\\rVert\_\{2,\\infty\}\\leq\\max\_\{s\\in\\mathcal\{S\}\}\\bigl\(1\-\\rho\(s\)\(1\-\\beta\)\\bigr\)\\lVert U\-V\\rVert\_\{2,\\infty\}=\\beta\_\{\\rho\}\\lVert U\-V\\rVert\_\{2,\\infty\},\(75\)since the coefficient1−ρ​\(s\)​\(1−β\)1\-\\rho\(s\)\(1\-\\beta\)is decreasing inρ​\(s\)\\rho\(s\)\. Now apply Corollary 2\.1 ofChenet al\.\[[12](https://arxiv.org/html/2605.06866#bib.bib3)\]to the centered recursion forWkW\_\{k\}, with contraction norm∥⋅∥2,∞\\lVert\\cdot\\rVert\_\{2,\\infty\}, smoothing norm∥⋅∥2,p⋆\\lVert\\cdot\\rVert\_\{2,p^\{\\star\}\}, Lyapunov functionMϑ,p⋆M\_\{\\vartheta,p^\{\\star\}\}from Proposition[13](https://arxiv.org/html/2605.06866#Thmtheorem13), and noise constantAiidA^\{\\mathrm\{iid\}\}\. The approximation factors in Proposition[13](https://arxiv.org/html/2605.06866#Thmtheorem13)and the smoothness constant in Proposition[12](https://arxiv.org/html/2605.06866#Thmtheorem12)produce exactly the displayed constants\. The constant\-step, linearly\-diminishing, and polynomially\-diminishing step size bounds follow by specializing the displayed recursion\. ∎

### A\.3Abstract Markovian finite\-iteration framework

Assume now that\(Sk\)k≥0\(S\_\{k\}\)\_\{k\\geq 0\}is an irreducible, aperiodic Markov chain on𝒮\\mathcal\{S\}with stationary distributionμ𝒮\\mu\_\{\\mathcal\{S\}\}\. Let

μmin:=mins∈𝒮⁡μ𝒮​\(s\)\>0,β¯μ:=1−μmin​\(1−β\)\.\\mu\_\{\\min\}:=\\min\_\{s\\in\\mathcal\{S\}\}\\mu\_\{\\mathcal\{S\}\}\(s\)\>0,\\qquad\\bar\{\\beta\}\_\{\\mu\}:=1\-\\mu\_\{\\min\}\(1\-\\beta\)\.\(76\)
Assume further that the chain satisfies the geometric mixing condition from Section[3](https://arxiv.org/html/2605.06866#S3)and lettδt\_\{\\delta\}denote the corresponding mixing time\.

Define the full\-sample map

H^​\(U;s,ξ\):=U\+Ps​\(T^​\(U;s,ξ\)−U​\(s\)\)\\widehat\{H\}\(U;s,\\xi\):=U\+P\_\{s\}\\bigl\(\\widehat\{T\}\(U;s,\\xi\)\-U\(s\)\\bigr\)\(77\)and the expected full\-sample map

F​\(U,s\):=U\+Ps​\(\(𝒪​U\)​\(s\)−U​\(s\)\)\.F\(U,s\):=U\+P\_\{s\}\\bigl\(\(\\mathcal\{O\}U\)\(s\)\-U\(s\)\\bigr\)\.\(78\)Assume there exist finite constantsA1≥0A\_\{1\}\\geq 0,A2≥0A\_\{2\}\\geq 0, andB2≥0B\_\{2\}\\geq 0such that

∥H^​\(U;s,ξ\)−H^​\(U′;s,ξ\)∥2,∞≤A1​∥U−U′∥2,∞\\lVert\\widehat\{H\}\(U;s,\\xi\)\-\\widehat\{H\}\(U^\{\\prime\};s,\\xi\)\\rVert\_\{2,\\infty\}\\leq A\_\{1\}\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\(79\)for allU,U′∈VU,U^\{\\prime\}\\in V, and

Δk:=H^​\(Uk;Sk,ξk\)−F​\(Uk,Sk\)\\Delta\_\{k\}:=\\widehat\{H\}\(U\_\{k\};S\_\{k\},\\xi\_\{k\}\)\-F\(U\_\{k\},S\_\{k\}\)\(80\)satisfies

𝔼​\[Δk∣𝒢k\]=0,∥Δk∥2,∞≤A2​∥Uk∥2,∞\+B2a\.s\.\\mathbb\{E\}\\left\[\\Delta\_\{k\}\\,\\mid\\,\\mathcal\{G\}\_\{k\}\\right\]=0,\\qquad\\lVert\\Delta\_\{k\}\\rVert\_\{2,\\infty\}\\leq A\_\{2\}\\lVert U\_\{k\}\\rVert\_\{2,\\infty\}\+B\_\{2\}\\qquad\\text\{a\.s\.\}\(81\)where

𝒢k:=σ​\(U0,S0,ξ0,…,Uk,Sk\)\.\\mathcal\{G\}\_\{k\}:=\\sigma\(U\_\{0\},S\_\{0\},\\xi\_\{0\},\\dots,U\_\{k\},S\_\{k\}\)\.\(82\)
###### Proposition 16\(Abstract Markovian finite\-iteration bound\)\.

Fix anyϑ\>0\\vartheta\>0and define

ϕ1:=1\+ϑ​\|𝒮\|2/p⋆1\+ϑ,ϕ2:=1−β¯μ​1\+ϑ​\|𝒮\|2/p⋆1\+ϑ,ϕ3:=114​\(p⋆−1\)​\(1\+ϑ​\|𝒮\|2/p⋆\)ϑ\.\\phi\_\{1\}:=\\frac\{1\+\\vartheta\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\},\\qquad\\phi\_\{2\}:=1\-\\bar\{\\beta\}\_\{\\mu\}\\sqrt\{\\frac\{1\+\\vartheta\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\}\},\\qquad\\phi\_\{3\}:=\\frac\{114\(p^\{\\star\}\-1\)\\bigl\(1\+\\vartheta\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\\bigr\)\}\{\\vartheta\}\.\(83\)Sinceϕ2→1−β¯μ\>0\\phi\_\{2\}\\to 1\-\\bar\{\\beta\}\_\{\\mu\}\>0asϑ↓0\\vartheta\\downarrow 0, the conditionϕ2\>0\\phi\_\{2\}\>0for all sufficiently smallϑ\\vartheta\. Set

A:=A1\+A2\+1,c1:=\(2​∥U0−U⋆∥2,∞\+B2A\)2,c2:=B22\.A:=A\_\{1\}\+A\_\{2\}\+1,\\qquad c\_\{1\}:=\\left\(2\\lVert U\_\{0\}\-U^\{\\star\}\\rVert\_\{2,\\infty\}\+\\frac\{B\_\{2\}\}\{A\}\\right\)^\{2\},\\qquad c\_\{2\}:=B\_\{2\}^\{2\}\.\(84\)Lettk:=tαkt\_\{k\}:=t\_\{\\alpha\_\{k\}\}andK:=min⁡\{k≥0:k≥tk\}K:=\\min\\\{k\\geq 0:k\\geq t\_\{k\}\\\}\. Ifϕ2\>0\\phi\_\{2\}\>0,\(αk\)k≥0\(\\alpha\_\{k\}\)\_\{k\\geq 0\}is nonincreasing, and

∑i=k−tkk−1αi≤min⁡\{ϕ2ϕ3​A2,14​A\}for all​k≥K,\\sum\_\{i=k\-t\_\{k\}\}^\{k\-1\}\\alpha\_\{i\}\\leq\\min\\left\\\{\\frac\{\\phi\_\{2\}\}\{\\phi\_\{3\}A^\{2\}\},\\frac\{1\}\{4A\}\\right\\\}\\qquad\\text\{for all \}k\\geq K,\(85\)then:

1. 1\.ifαk≡α\\alpha\_\{k\}\\equiv\\alpha, then for allk≥tαk\\geq t\_\{\\alpha\}, 𝔼​\[∥Wk∥2,∞2\]≤ϕ1​c1​\(1−ϕ2​α\)k−tα\+ϕ3ϕ2​c2​α​tα;\\mathbb\{E\}\\left\[\\lVert W\_\{k\}\\rVert\_\{2,\\infty\}^\{2\}\\right\]\\leq\\phi\_\{1\}c\_\{1\}\(1\-\\phi\_\{2\}\\alpha\)^\{k\-t\_\{\\alpha\}\}\+\\frac\{\\phi\_\{3\}\}\{\\phi\_\{2\}\}c\_\{2\}\\,\\alpha t\_\{\\alpha\};\(86\)
2. 2\.ifαk=α/\(k\+h\)\\alpha\_\{k\}=\\alpha/\(k\+h\),α\>1/ϕ2\\alpha\>1/\\phi\_\{2\}, andh≥1h\\geq 1, then for allk≥Kk\\geq K, 𝔼​\[∥Wk∥2,∞2\]≤ϕ1​c1​\(K\+hk\+h\)ϕ2​α\+8​e​α2​ϕ3​c2ϕ2​α−1⋅tkk\+h;\\mathbb\{E\}\\left\[\\lVert W\_\{k\}\\rVert\_\{2,\\infty\}^\{2\}\\right\]\\leq\\phi\_\{1\}c\_\{1\}\\left\(\\frac\{K\+h\}\{k\+h\}\\right\)^\{\\phi\_\{2\}\\alpha\}\+\\frac\{8e\\,\\alpha^\{2\}\\phi\_\{3\}c\_\{2\}\}\{\\phi\_\{2\}\\alpha\-1\}\\cdot\\frac\{t\_\{k\}\}\{k\+h\};\(87\)
3. 3\.ifαk=α/\(k\+h\)z\\alpha\_\{k\}=\\alpha/\(k\+h\)^\{z\}withz∈\(0,1\)z\\in\(0,1\), then for allk≥Kk\\geq K, 𝔼​\[∥Wk∥2,∞2\]≤ϕ1​c1​exp⁡\(−ϕ2​α1−z​\(\(k\+h\)1−z−\(K\+h\)1−z\)\)\+4​ϕ3​c2​αϕ2⋅tk\(k\+h\)z\.\\mathbb\{E\}\\left\[\\lVert W\_\{k\}\\rVert\_\{2,\\infty\}^\{2\}\\right\]\\leq\\phi\_\{1\}c\_\{1\}\\exp\\left\(\-\\frac\{\\phi\_\{2\}\\alpha\}\{1\-z\}\\bigl\(\(k\+h\)^\{1\-z\}\-\(K\+h\)^\{1\-z\}\\bigr\)\\right\)\+\\frac\{4\\phi\_\{3\}c\_\{2\}\\alpha\}\{\\phi\_\{2\}\}\\cdot\\frac\{t\_\{k\}\}\{\(k\+h\)^\{z\}\}\.\(88\)

###### Proof\.

This is the Markovian finite\-iteration theorem ofChenet al\.\[[11](https://arxiv.org/html/2605.06866#bib.bib1)\], translated into the present block\-supremum notation and applied to the centered recursionWk:=Uk−U⋆W\_\{k\}:=U\_\{k\}\-U^\{\\star\}\. The contraction factor isβ¯μ\\bar\{\\beta\}\_\{\\mu\}\. The smoothness and approximation factors come from Propositions[12](https://arxiv.org/html/2605.06866#Thmtheorem12),[13](https://arxiv.org/html/2605.06866#Thmtheorem13), and[11](https://arxiv.org/html/2605.06866#Thmtheorem11)\. The constantsA1A\_\{1\},A2A\_\{2\}, andB2B\_\{2\}enter exactly through the samplewise Lipschitz and pathwise affine perturbation hypotheses\. ∎

## Appendix BCTD details and proof of Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1)

We verify the hypotheses of Appendix[A](https://arxiv.org/html/2605.06866#A1)for the discounted scalar categorical temporal\-difference recursion and thereby prove Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1)\. The proof has three parts\. First, we record the statewise Cramér isometry and the induced block\-supremum contraction of the projected Bellman operator\. Second, we verify the samplewise Lipschitz property of the sampled target map\. Third, we verify the centered perturbation bounds required by the i\.i\.d\. and Markovian finite\-iteration arguments\.

### B\.1Statewise Cramér isometry and contraction

For each states∈𝒮s\\in\\mathcal\{S\}, let

Θ​\(s\)=\{θ1​\(s\)<⋯<θd​\(s\)\}⊂ℝ\\Theta\(s\)=\\\{\\theta\_\{1\}\(s\)<\\cdots<\\theta\_\{d\}\(s\)\\\}\\subset\\mathbb\{R\}\(89\)and letℱC,Θ𝒮\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{C\},\\Theta\}denote the class of state\-indexed categorical laws supported on these statewise grids\. For

η​\(s\)=∑i=1dpi​\(s\)​δθi​\(s\),\\eta\(s\)=\\sum\_\{i=1\}^\{d\}p\_\{i\}\(s\)\\delta\_\{\\theta\_\{i\}\(s\)\},\(90\)define the cumulative masses and the grid separations

cjη​\(s\):=∑i=1jpi​\(s\),Δj​\(s\):=θj\+1​\(s\)−θj​\(s\),j=1,…,d−1\.c^\{\\eta\}\_\{j\}\(s\):=\\sum\_\{i=1\}^\{j\}p\_\{i\}\(s\),\\qquad\\Delta\_\{j\}\(s\):=\\theta\_\{j\+1\}\(s\)\-\\theta\_\{j\}\(s\),\\qquad j=1,\\dots,d\-1\.\(91\)The statewise Cramér embedding is

Is​\(η​\(s\)\):=\(Δ1​\(s\)​c1η​\(s\),…,Δd−1​\(s\)​cd−1η​\(s\),0\)⊤∈ℝd\.I\_\{s\}\(\\eta\(s\)\):=\\bigl\(\\sqrt\{\\Delta\_\{1\}\(s\)\}\\,c^\{\\eta\}\_\{1\}\(s\),\\dots,\\sqrt\{\\Delta\_\{d\-1\}\(s\)\}\\,c^\{\\eta\}\_\{d\-1\}\(s\),0\\bigr\)^\{\\top\}\\in\\mathbb\{R\}^\{d\}\.\(92\)
###### Proposition 17\(Statewise Cramér isometry\)\.

For every states∈𝒮s\\in\\mathcal\{S\}and allμ,ν∈ℱC,Θ​\(s\)\\mu,\\nu\\in\\mathcal\{F\}\_\{\\mathrm\{C\},\\Theta\(s\)\},

ℓC​\(μ,ν\)2=∑j=1d−1Δj​\(s\)​\(cjμ​\(s\)−cjν​\(s\)\)2=∥Is​\(μ\)−Is​\(ν\)∥22\.\\ell\_\{\\mathrm\{C\}\}\(\\mu,\\nu\)^\{2\}=\\sum\_\{j=1\}^\{d\-1\}\\Delta\_\{j\}\(s\)\\bigl\(c^\{\\mu\}\_\{j\}\(s\)\-c^\{\\nu\}\_\{j\}\(s\)\\bigr\)^\{2\}=\\lVert I\_\{s\}\(\\mu\)\-I\_\{s\}\(\\nu\)\\rVert\_\{2\}^\{2\}\.\(93\)Consequently, with the product embedding

IC​\(η\):=\(Is​\(η​\(s\)\)\)s∈𝒮,I\_\{\\mathrm\{C\}\}\(\\eta\):=\(I\_\{s\}\(\\eta\(s\)\)\)\_\{s\\in\\mathcal\{S\}\},\(94\)we have

∥IC​\(η\)−IC​\(η′\)∥2,∞=ℓC,∞​\(η,η′\)\.\\lVert I\_\{\\mathrm\{C\}\}\(\\eta\)\-I\_\{\\mathrm\{C\}\}\(\\eta^\{\\prime\}\)\\rVert\_\{2,\\infty\}=\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta,\\eta^\{\\prime\}\)\.\(95\)

###### Proof\.

For laws supported on the ordered gridΘ​\(s\)\\Theta\(s\), the cumulative distribution functions are piecewise constant on the intervals

\[θj​\(s\),θj\+1​\(s\)\),j=1,…,d−1,\[\\theta\_\{j\}\(s\),\\theta\_\{j\+1\}\(s\)\),\\qquad j=1,\\dots,d\-1,\(96\)with valuescjη​\(s\)c^\{\\eta\}\_\{j\}\(s\)andcjν​\(s\)c^\{\\nu\}\_\{j\}\(s\)respectively\. Therefore,

ℓC​\(μ,ν\)2=∫ℝ\(Fμ​\(z\)−Fν​\(z\)\)2​dz=∑j=1d−1Δj​\(s\)​\(cjμ​\(s\)−cjν​\(s\)\)2,\\ell\_\{\\mathrm\{C\}\}\(\\mu,\\nu\)^\{2\}=\\int\_\{\\mathbb\{R\}\}\\bigl\(F\_\{\\mu\}\(z\)\-F\_\{\\nu\}\(z\)\\bigr\)^\{2\}\\,\\mathrm\{d\}z=\\sum\_\{j=1\}^\{d\-1\}\\Delta\_\{j\}\(s\)\\bigl\(c^\{\\mu\}\_\{j\}\(s\)\-c^\{\\nu\}\_\{j\}\(s\)\\bigr\)^\{2\},\(97\)which is exactly∥Is​\(μ\)−Is​\(ν\)∥22\\lVert I\_\{s\}\(\\mu\)\-I\_\{s\}\(\\nu\)\\rVert\_\{2\}^\{2\}\. Taking the maximum overs∈𝒮s\\in\\mathcal\{S\}gives the product\-space identity\. ∎

Recall from Section[3\.1](https://arxiv.org/html/2605.06866#S3.SS1)that

𝒪C:=IC∘ΠCΘ​Tπ∘IC−1\.\\mathcal\{O\}\_\{\\mathrm\{C\}\}:=I\_\{\\mathrm\{C\}\}\\circ\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\}T^\{\\pi\}\\circ I\_\{\\mathrm\{C\}\}^\{\-1\}\.\(98\)
###### Proposition 18\(Embedded CTD contraction\)\.

For allU,U′∈IC​\(ℱC,Θ𝒮\)U,U^\{\\prime\}\\in I\_\{\\mathrm\{C\}\}\(\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{C\},\\Theta\}\),

∥𝒪C​U−𝒪C​U′∥2,∞≤γ​∥U−U′∥2,∞\.\\lVert\\mathcal\{O\}\_\{\\mathrm\{C\}\}U\-\\mathcal\{O\}\_\{\\mathrm\{C\}\}U^\{\\prime\}\\rVert\_\{2,\\infty\}\\leq\\sqrt\{\\gamma\}\\,\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(99\)

###### Proof\.

Letη:=IC−1​\(U\)\\eta:=I\_\{\\mathrm\{C\}\}^\{\-1\}\(U\)andη′:=IC−1​\(U′\)\\eta^\{\\prime\}:=I\_\{\\mathrm\{C\}\}^\{\-1\}\(U^\{\\prime\}\)\. By Proposition[17](https://arxiv.org/html/2605.06866#Thmtheorem17)and the contraction of𝒪C\\mathcal\{O\}\_\{\\mathrm\{C\}\}in the supremum Cramér metric\[[5](https://arxiv.org/html/2605.06866#bib.bib5)\],

∥𝒪C​U−𝒪C​U′∥2,∞=ℓC,∞​\(ΠCΘ​Tπ​η,ΠCΘ​Tπ​η′\)≤γ​ℓC,∞​\(η,η′\)=γ​∥U−U′∥2,∞\.\\lVert\\mathcal\{O\}\_\{\\mathrm\{C\}\}U\-\\mathcal\{O\}\_\{\\mathrm\{C\}\}U^\{\\prime\}\\rVert\_\{2,\\infty\}=\\ell\_\{\\mathrm\{C\},\\infty\}\(\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\}T^\{\\pi\}\\eta,\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\}T^\{\\pi\}\\eta^\{\\prime\}\)\\leq\\sqrt\{\\gamma\}\\,\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta,\\eta^\{\\prime\}\)=\\sqrt\{\\gamma\}\\,\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(100\)∎

### B\.2Unbiased sampled target and samplewise Lipschitz continuity

Recall the sampled CTD target from Section[3\.1](https://arxiv.org/html/2605.06866#S3.SS1):

T^C​\(Uk;Sk,\(Rk,Sk\+1\)\):=ISk​\(ΠCΘ​\(Sk\)​\(\(fRk,γ\)\#​ISk\+1−1​\(Uk​\(Sk\+1\)\)\)\),\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\):=I\_\{S\_\{k\}\}\\Bigl\(\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\(S\_\{k\}\)\}\\bigl\(\(f\_\{R\_\{k\},\\gamma\}\)\_\{\\\#\}I\_\{S\_\{k\+1\}\}^\{\-1\}\(U\_\{k\}\(S\_\{k\+1\}\)\)\\bigr\)\\Bigr\),\(101\)wherefr,γ​\(z\)=r\+γ​zf\_\{r,\\gamma\}\(z\)=r\+\\gamma z\.

###### Proposition 19\(Conditional unbiasedness of the sampled CTD target\)\.

For everyU∈IC​\(ℱC,Θ𝒮\)U\\in I\_\{\\mathrm\{C\}\}\(\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{C\},\\Theta\}\)and every states∈𝒮s\\in\\mathcal\{S\},

𝔼​\[T^C​\(U;s,\(Rk,Sk\+1\)\)∣Uk=U,Sk=s\]=\(𝒪C​U\)​\(s\)\.\\mathbb\{E\}\\left\[\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U;s,\(R\_\{k\},S\_\{k\+1\}\)\)\\mid U\_\{k\}=U,S\_\{k\}=s\\right\]=\(\\mathcal\{O\}\_\{\\mathrm\{C\}\}U\)\(s\)\.\(102\)

###### Proof\.

Condition onUk=UU\_\{k\}=UandSk=sS\_\{k\}=s\. By definition,

T^C​\(U;s,\(Rk,Sk\+1\)\)=Is​\(ΠCΘ​\(s\)​\(\(fRk,γ\)\#​ISk\+1−1​\(U​\(Sk\+1\)\)\)\)\.\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U;s,\(R\_\{k\},S\_\{k\+1\}\)\)=I\_\{s\}\\Bigl\(\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\(s\)\}\\bigl\(\(f\_\{R\_\{k\},\\gamma\}\)\_\{\\\#\}I\_\{S\_\{k\+1\}\}^\{\-1\}\(U\(S\_\{k\+1\}\)\)\\bigr\)\\Bigr\)\.\(103\)Taking conditional expectation with respect to the one\-step transition and reward law underπ\\pigives exactly the Bellman expectation at statess\. Applying the deterministic statewise projection and then the embeddingIsI\_\{s\}yields

𝔼​\[T^C​\(U;s,\(Rk,Sk\+1\)\)∣Uk=U,Sk=s\]=Is​\(\(ΠCΘ​Tπ​IC−1​U\)​\(s\)\)=\(𝒪C​U\)​\(s\)\.\\mathbb\{E\}\\left\[\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U;s,\(R\_\{k\},S\_\{k\+1\}\)\)\\mid U\_\{k\}=U,S\_\{k\}=s\\right\]=I\_\{s\}\\bigl\(\(\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\}T^\{\\pi\}I\_\{\\mathrm\{C\}\}^\{\-1\}U\)\(s\)\\bigr\)=\(\\mathcal\{O\}\_\{\\mathrm\{C\}\}U\)\(s\)\.\(104\)∎

###### Proposition 20\(CTD samplewise Lipschitz continuity\)\.

For everys∈𝒮s\\in\\mathcal\{S\}, every\(r,s′\)∈\[0,1\]×𝒮\(r,s^\{\\prime\}\)\\in\[0,1\]\\times\\mathcal\{S\}, and everyU,U′∈IC​\(ℱC,Θ𝒮\)U,U^\{\\prime\}\\in I\_\{\\mathrm\{C\}\}\(\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{C\},\\Theta\}\),

‖T^C​\(U;s,\(r,s′\)\)−T^C​\(U′;s,\(r,s′\)\)‖2≤∥U−U′∥2,∞\.\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{2\}\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(105\)Consequently, if

H^C​\(U;s,\(r,s′\)\):=U\+Ps​\(T^C​\(U;s,\(r,s′\)\)−U​\(s\)\),\\widehat\{H\}\_\{\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\):=U\+P\_\{s\}\\bigl\(\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-U\(s\)\\bigr\),\(106\)then

∥H^C​\(U;s,\(r,s′\)\)−H^C​\(U′;s,\(r,s′\)\)∥2,∞≤∥U−U′∥2,∞\.\\lVert\\widehat\{H\}\_\{\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{H\}\_\{\\mathrm\{C\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{2,\\infty\}\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(107\)

###### Proof\.

Fixs∈𝒮s\\in\\mathcal\{S\}and\(r,s′\)∈\[0,1\]×𝒮\(r,s^\{\\prime\}\)\\in\[0,1\]\\times\\mathcal\{S\}\. Let

μ:=Is′−1​\(U​\(s′\)\),ν:=Is′−1​\(U′​\(s′\)\)\.\\mu:=I\_\{s^\{\\prime\}\}^\{\-1\}\(U\(s^\{\\prime\}\)\),\\qquad\\nu:=I\_\{s^\{\\prime\}\}^\{\-1\}\(U^\{\\prime\}\(s^\{\\prime\}\)\)\.\(108\)Then

T^C​\(U;s,\(r,s′\)\)=Is​\(ΠCΘ​\(s\)​\(\(fr,γ\)\#​μ\)\)\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)=I\_\{s\}\\bigl\(\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\(s\)\}\(\(f\_\{r,\\gamma\}\)\_\{\\\#\}\\mu\)\\bigr\)\(109\)and similarly forU′U^\{\\prime\}\. Using the statewise Cramér isometry, the nonexpansiveness of the categorical projection in the Cramér metric, and the deterministic pushforward contraction byγ\\sqrt\{\\gamma\},

‖T^C​\(U;s,\(r,s′\)\)−T^C​\(U′;s,\(r,s′\)\)‖2\\displaystyle\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{2\}=ℓC,s​\(ΠCΘ​\(s\)​\(\(fr,γ\)\#​μ\),ΠCΘ​\(s\)​\(\(fr,γ\)\#​ν\)\)\\displaystyle=\\ell\_\{\\mathrm\{C\},s\}\\bigl\(\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\(s\)\}\(\(f\_\{r,\\gamma\}\)\_\{\\\#\}\\mu\),\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\(s\)\}\(\(f\_\{r,\\gamma\}\)\_\{\\\#\}\\nu\)\\bigr\)\(110\)≤ℓC,s​\(\(fr,γ\)\#​μ,\(fr,γ\)\#​ν\)\\displaystyle\\leq\\ell\_\{\\mathrm\{C\},s\}\(\(f\_\{r,\\gamma\}\)\_\{\\\#\}\\mu,\(f\_\{r,\\gamma\}\)\_\{\\\#\}\\nu\)≤γ​ℓC,s′​\(μ,ν\)\\displaystyle\\leq\\sqrt\{\\gamma\}\\,\\ell\_\{\\mathrm\{C\},s^\{\\prime\}\}\(\\mu,\\nu\)=γ​∥U​\(s′\)−U′​\(s′\)∥2\\displaystyle=\\sqrt\{\\gamma\}\\,\\lVert U\(s^\{\\prime\}\)\-U^\{\\prime\}\(s^\{\\prime\}\)\\rVert\_\{2\}≤∥U−U′∥2,∞\.\\displaystyle\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.For the full\-sample map, every block other thanssis unchanged, while thessth block is replaced by the sampled target\. Therefore,

∥H^C\(U;s,\(r,s′\)\)−H^C\(U′;s,\(r,s′\)\)∥2,∞=max\{maxx≠s∥U\(x\)−U′\(x\)∥2,∥T^C\(U;s,\(r,s′\)\)−T^C\(U′;s,\(r,s′\)\)∥2\},\\displaystyle\\lVert\\widehat\{H\}\_\{\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{H\}\_\{\\mathrm\{C\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{2,\\infty\}=\\max\\Bigl\\\{\\max\_\{x\\neq s\}\\lVert U\(x\)\-U^\{\\prime\}\(x\)\\rVert\_\{2\},\\lVert\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{2\}\\Bigr\\\},

\(111\)which is bounded by∥U−U′∥2,∞\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}by the previous result\. ∎

### B\.3Uniform perturbation bound

###### Proposition 21\(Uniform bound on embedded categorical blocks\)\.

For every states∈𝒮s\\in\\mathcal\{S\}and every embedded categorical lawU​\(s\)∈Is​\(ℱC,Θ​\(s\)\)U\(s\)\\in I\_\{s\}\(\\mathcal\{F\}\_\{\\mathrm\{C\},\\Theta\(s\)\}\),

∥U​\(s\)∥22≤θd​\(s\)−θ1​\(s\)\.\\lVert U\(s\)\\rVert\_\{2\}^\{2\}\\leq\\theta\_\{d\}\(s\)\-\\theta\_\{1\}\(s\)\.\(112\)

###### Proof\.

Write

U​\(s\)=Is​\(η​\(s\)\)=\(Δ1​\(s\)​c1η​\(s\),…,Δd−1​\(s\)​cd−1η​\(s\),0\)⊤\.U\(s\)=I\_\{s\}\(\\eta\(s\)\)=\\bigl\(\\sqrt\{\\Delta\_\{1\}\(s\)\}\\,c^\{\\eta\}\_\{1\}\(s\),\\dots,\\sqrt\{\\Delta\_\{d\-1\}\(s\)\}\\,c^\{\\eta\}\_\{d\-1\}\(s\),0\\bigr\)^\{\\top\}\.\(113\)Becauseη​\(s\)\\eta\(s\)is a probability law, every cumulative masscjη​\(s\)c^\{\\eta\}\_\{j\}\(s\)lies in\[0,1\]\[0,1\]\. Hence

∥U​\(s\)∥22=∑j=1d−1Δj​\(s\)​\(cjη​\(s\)\)2≤∑j=1d−1Δj​\(s\)=θd​\(s\)−θ1​\(s\)\.\\lVert U\(s\)\\rVert\_\{2\}^\{2\}=\\sum\_\{j=1\}^\{d\-1\}\\Delta\_\{j\}\(s\)\\bigl\(c^\{\\eta\}\_\{j\}\(s\)\\bigr\)^\{2\}\\leq\\sum\_\{j=1\}^\{d\-1\}\\Delta\_\{j\}\(s\)=\\theta\_\{d\}\(s\)\-\\theta\_\{1\}\(s\)\.\(114\)∎

###### Proposition 22\(CTD centered perturbation bound\)\.

Let the CTD expected full\-sample map be

FC​\(U,s\):=U\+Ps​\(\(𝒪C​U\)​\(s\)−U​\(s\)\)F\_\{\\mathrm\{C\}\}\(U,s\):=U\+P\_\{s\}\\bigl\(\(\\mathcal\{O\}\_\{\\mathrm\{C\}\}U\)\(s\)\-U\(s\)\\bigr\)\(115\)and

ΔkC:=H^C​\(Uk;Sk,\(Rk,Sk\+1\)\)−FC​\(Uk,Sk\)\.\\Delta\_\{k\}^\{\\mathrm\{C\}\}:=\\widehat\{H\}\_\{\\mathrm\{C\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-F\_\{\\mathrm\{C\}\}\(U\_\{k\},S\_\{k\}\)\.\(116\)Then

𝔼​\[ΔkC∣𝒢k\]=0\\mathbb\{E\}\\left\[\\Delta\_\{k\}^\{\\mathrm\{C\}\}\\,\\mid\\,\\mathcal\{G\}\_\{k\}\\right\]=0\(117\)and

∥ΔkC∥2,∞≤2​BCa\.s\. for all​k,\\lVert\\Delta\_\{k\}^\{\\mathrm\{C\}\}\\rVert\_\{2,\\infty\}\\leq 2B\_\{\\mathrm\{C\}\}\\qquad\\text\{a\.s\. for all \}k,\(118\)where

BC:=maxs∈𝒮⁡θd​\(s\)−θ1​\(s\)\.B\_\{\\mathrm\{C\}\}:=\\max\_\{s\\in\\mathcal\{S\}\}\\sqrt\{\\theta\_\{d\}\(s\)\-\\theta\_\{1\}\(s\)\}\.\(119\)Consequently,

𝔼​\[‖T^C​\(Uk;Sk,\(Rk,Sk\+1\)\)−\(𝒪C​Uk\)​\(Sk\)‖22∣Uk,Sk\]≤4​BC2\.\\mathbb\{E\}\\left\[\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{C\}\}U\_\{k\}\)\(S\_\{k\}\)\\right\\rVert\_\{2\}^\{2\}\\,\\mid\\,U\_\{k\},\\ S\_\{k\}\\right\]\\leq 4B\_\{\\mathrm\{C\}\}^\{2\}\.\(120\)

###### Proof\.

The conditional mean identity follows from Proposition[19](https://arxiv.org/html/2605.06866#Thmtheorem19)\. For the pathwise bound, only theSkS\_\{k\}th block is nonzero, so

∥ΔkC∥2,∞=‖T^C​\(Uk;Sk,\(Rk,Sk\+1\)\)−\(𝒪C​Uk\)​\(Sk\)‖2\.\\lVert\\Delta\_\{k\}^\{\\mathrm\{C\}\}\\rVert\_\{2,\\infty\}=\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{C\}\}U\_\{k\}\)\(S\_\{k\}\)\\right\\rVert\_\{2\}\.\(121\)Both terms on the right are embedded categorical laws at stateSkS\_\{k\}\. By Proposition[21](https://arxiv.org/html/2605.06866#Thmtheorem21), each has Euclidean norm at mostBCB\_\{\\mathrm\{C\}\}\. Therefore, the triangle inequality yields

∥ΔkC∥2,∞≤2​BC\.\\lVert\\Delta\_\{k\}^\{\\mathrm\{C\}\}\\rVert\_\{2,\\infty\}\\leq 2B\_\{\\mathrm\{C\}\}\.\(122\)Squaring gives the conditional second\-moment bound\. ∎

### B\.4Completion of the proof of Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1)

###### Proof of Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1)\.

We now verify the hypotheses of Appendix[A](https://arxiv.org/html/2605.06866#A1)for the discounted CTD recursion\.

First, Proposition[18](https://arxiv.org/html/2605.06866#Thmtheorem18)gives the block\-supremum contraction

∥𝒪C​U−𝒪C​U′∥2,∞≤βC​∥U−U′∥2,∞,βC:=γ\.\\lVert\\mathcal\{O\}\_\{\\mathrm\{C\}\}U\-\\mathcal\{O\}\_\{\\mathrm\{C\}\}U^\{\\prime\}\\rVert\_\{2,\\infty\}\\leq\\beta\_\{\\mathrm\{C\}\}\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\},\\qquad\\beta\_\{\\mathrm\{C\}\}:=\\sqrt\{\\gamma\}\.\(123\)
Second, Proposition[20](https://arxiv.org/html/2605.06866#Thmtheorem20)gives the samplewise Lipschitz property in the Markovian setting with

AC,1=1\.A\_\{\\mathrm\{C\},1\}=1\.\(124\)Moreover, the proof of Proposition[20](https://arxiv.org/html/2605.06866#Thmtheorem20)gives the stronger target\-level estimate

∥T^C​\(U;s,\(r,s′\)\)−T^C​\(U′;s,\(r,s′\)\)∥2≤βC​∥U−U′∥2,∞≤∥U−U′∥2,∞\.\\lVert\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{2\}\\leq\\beta\_\{\\mathrm\{C\}\}\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(125\)
Third, Proposition[22](https://arxiv.org/html/2605.06866#Thmtheorem22)gives the centered pathwise perturbation bound

𝔼​\[ΔkC∣𝒢k\]=0,∥ΔkC∥2,∞≤2​BCa\.s\.,\\mathbb\{E\}\[\\Delta\_\{k\}^\{\\mathrm\{C\}\}\\mid\\mathcal\{G\}\_\{k\}\]=0,\\qquad\\lVert\\Delta\_\{k\}^\{\\mathrm\{C\}\}\\rVert\_\{2,\\infty\}\\leq 2B\_\{\\mathrm\{C\}\}\\quad\\text\{a\.s\.\},\(126\)so Proposition[16](https://arxiv.org/html/2605.06866#Thmtheorem16)applies with

AC,1=1,AC,2=0,BC,2=2​BC,AC:=AC,1\+AC,2\+1=2\.A\_\{\\mathrm\{C\},1\}=1,\\qquad A\_\{\\mathrm\{C\},2\}=0,\\qquad B\_\{\\mathrm\{C\},2\}=2B\_\{\\mathrm\{C\}\},\\qquad A\_\{\\mathrm\{C\}\}:=A\_\{\\mathrm\{C\},1\}\+A\_\{\\mathrm\{C\},2\}\+1=2\.\(127\)In particular, the i\.i\.d\. conditional second\-moment condition in Proposition[15](https://arxiv.org/html/2605.06866#Thmtheorem15)holds with

ACiid:=4​BC2\.A\_\{\\mathrm\{C\}\}^\{\\mathrm\{iid\}\}:=4B\_\{\\mathrm\{C\}\}^\{2\}\.\(128\)
Applying Proposition[15](https://arxiv.org/html/2605.06866#Thmtheorem15)with

β=βC,Aiid=ACiid,ϑ=ϑC,ρ,p⋆=max⁡\{2,⌈log⁡\|𝒮\|⌉\},\\beta=\\beta\_\{\\mathrm\{C\}\},\\qquad A^\{\\mathrm\{iid\}\}=A\_\{\\mathrm\{C\}\}^\{\\mathrm\{iid\}\},\\qquad\\vartheta=\\vartheta\_\{\\mathrm\{C\},\\rho\},\\qquad p^\{\\star\}=\\max\\\{2,\\lceil\\log\\lvert\\mathcal\{S\}\\rvert\\rceil\\\},\(129\)yields Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1)\(i\) with

aC,1iid:=1\+ϑC,ρ​\|𝒮\|2/p⋆1\+ϑC,ρ,aC,2iid:=1−β¯C,ρ​1\+ϑC,ρ​\|𝒮\|2/p⋆1\+ϑC,ρ,a\_\{\\mathrm\{C\},1\}^\{\\mathrm\{iid\}\}:=\\frac\{1\+\\vartheta\_\{\\mathrm\{C\},\\rho\}\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\_\{\\mathrm\{C\},\\rho\}\},\\qquad a\_\{\\mathrm\{C\},2\}^\{\\mathrm\{iid\}\}:=1\-\\bar\{\\beta\}\_\{\\mathrm\{C\},\\rho\}\\sqrt\{\\frac\{1\+\\vartheta\_\{\\mathrm\{C\},\\rho\}\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\_\{\\mathrm\{C\},\\rho\}\}\},\(130\)where

β¯C,ρ:=1−ρmin​\(1−βC\),\\bar\{\\beta\}\_\{\\mathrm\{C\},\\rho\}:=1\-\\rho\_\{\\min\}\(1\-\\beta\_\{\\mathrm\{C\}\}\),\(131\)and

aC,3iid:=4​\(p⋆−1\)​\|𝒮\|2/p⋆​\(4​BC2\+2\)​\(1\+ϑC,ρ\)ϑC,ρ,aC,4iid:=8​\(p⋆−1\)​\|𝒮\|2/p⋆​BC2​\(1\+ϑC,ρ\)ϑC,ρ\.a\_\{\\mathrm\{C\},3\}^\{\\mathrm\{iid\}\}:=\\frac\{4\(p^\{\\star\}\-1\)\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\(4B\_\{\\mathrm\{C\}\}^\{2\}\+2\)\(1\+\\vartheta\_\{\\mathrm\{C\},\\rho\}\)\}\{\\vartheta\_\{\\mathrm\{C\},\\rho\}\},\\qquad a\_\{\\mathrm\{C\},4\}^\{\\mathrm\{iid\}\}:=\\frac\{8\(p^\{\\star\}\-1\)\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}B\_\{\\mathrm\{C\}\}^\{2\}\(1\+\\vartheta\_\{\\mathrm\{C\},\\rho\}\)\}\{\\vartheta\_\{\\mathrm\{C\},\\rho\}\}\.\(132\)
Applying Proposition[16](https://arxiv.org/html/2605.06866#Thmtheorem16)with

A1=AC,1=1,A2=AC,2=0,B2=BC,2=2​BC,A\_\{1\}=A\_\{\\mathrm\{C\},1\}=1,\\qquad A\_\{2\}=A\_\{\\mathrm\{C\},2\}=0,\\qquad B\_\{2\}=B\_\{\\mathrm\{C\},2\}=2B\_\{\\mathrm\{C\}\},\(133\)yields Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1), part \(ii\), with

ϕ~C,1:=1\+ϑC,μ​\|𝒮\|2/p⋆1\+ϑC,μ,ϕC,2:=1−β¯C,μ​1\+ϑC,μ​\|𝒮\|2/p⋆1\+ϑC,μ,\\tilde\{\\phi\}\_\{\\mathrm\{C\},1\}:=\\frac\{1\+\\vartheta\_\{\\mathrm\{C\},\\mu\}\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\_\{\\mathrm\{C\},\\mu\}\},\\qquad\\phi\_\{\\mathrm\{C\},2\}:=1\-\\bar\{\\beta\}\_\{\\mathrm\{C\},\\mu\}\\sqrt\{\\frac\{1\+\\vartheta\_\{\\mathrm\{C\},\\mu\}\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\_\{\\mathrm\{C\},\\mu\}\}\},\(134\)where

β¯C,μ:=1−μmin​\(1−βC\),\\bar\{\\beta\}\_\{\\mathrm\{C\},\\mu\}:=1\-\\mu\_\{\\min\}\(1\-\\beta\_\{\\mathrm\{C\}\}\),\(135\)and

ϕ~C,3:=114​\(p⋆−1\)​\(1\+ϑC,μ​\|𝒮\|2/p⋆\)ϑC,μ,ϕC,3:=4​BC2​ϕ~C,3\\tilde\{\\phi\}\_\{\\mathrm\{C\},3\}:=\\frac\{114\(p^\{\\star\}\-1\)\\bigl\(1\+\\vartheta\_\{\\mathrm\{C\},\\mu\}\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\\bigr\)\}\{\\vartheta\_\{\\mathrm\{C\},\\mu\}\},\\qquad\\phi\_\{\\mathrm\{C\},3\}:=4B\_\{\\mathrm\{C\}\}^\{2\}\\tilde\{\\phi\}\_\{\\mathrm\{C\},3\}\(136\)together with

ϕC,1:=8​ϕ~C,1,ϕC,4:=2​ϕ~C,1​BC2,\\phi\_\{\\mathrm\{C\},1\}:=8\\tilde\{\\phi\}\_\{\\mathrm\{C\},1\},\\qquad\\phi\_\{\\mathrm\{C\},4\}:=2\\tilde\{\\phi\}\_\{\\mathrm\{C\},1\}B\_\{\\mathrm\{C\}\}^\{2\},\(137\)and the step size condition

∑i=k−tkk−1αi≤min⁡\{ϕC,24​ϕC,3,18\}for all​k≥K\.\\sum\_\{i=k\-t\_\{k\}\}^\{k\-1\}\\alpha\_\{i\}\\leq\\min\\left\\\{\\frac\{\\phi\_\{\\mathrm\{C\},2\}\}\{4\\phi\_\{\\mathrm\{C\},3\}\},\\frac\{1\}\{8\}\\right\\\}\\qquad\\text\{for all \}k\\geq K\.\(138\)The constant, linearly\-diminishing, and polynomially\-diminishing step size bounds in Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1)are therefore obtained by the corresponding specializations of Proposition[15](https://arxiv.org/html/2605.06866#Thmtheorem15)and Proposition[16](https://arxiv.org/html/2605.06866#Thmtheorem16)\. ∎

### B\.5Proof of Corollary[2](https://arxiv.org/html/2605.06866#Thmtheorem2)

###### Proof\.

We use the linearly\-diminishing step size bounds in Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1)\.

For part \(i\), the linearly\-diminishing i\.i\.d\. bound in Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1)\(i\) gives

𝔼​\[ℓC,∞​\(ηk,η⋆\)2\]≤aC,1iid​ℓC,∞​\(η0,η⋆\)2​\(hk\+h\)aC,2iid​α\+4​e​α2​aC,4iidaC,2iid​α−1⋅1k\+h\.\\mathbb\{E\}\\left\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\right\]\\leq a\_\{\\mathrm\{C\},1\}^\{\\mathrm\{iid\}\}\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\left\(\\frac\{h\}\{k\+h\}\\right\)^\{a\_\{\\mathrm\{C\},2\}^\{\\mathrm\{iid\}\}\\alpha\}\+\\frac\{4e\\alpha^\{2\}a\_\{\\mathrm\{C\},4\}^\{\\mathrm\{iid\}\}\}\{a\_\{\\mathrm\{C\},2\}^\{\\mathrm\{iid\}\}\\alpha\-1\}\\cdot\\frac\{1\}\{k\+h\}\.\(139\)Sinceα\>1/aC,2iid\\alpha\>1/a\_\{\\mathrm\{C\},2\}^\{\\mathrm\{iid\}\}, one hasaC,2iid​α\>1a\_\{\\mathrm\{C\},2\}^\{\\mathrm\{iid\}\}\\alpha\>1, and therefore the first term is alsoO​\(\(k\+h\)−1\)O\(\(k\+h\)^\{\-1\}\)\. Hence

𝔼​\[ℓC,∞​\(ηk,η⋆\)2\]=O​\(1k\+h\)\.\\mathbb\{E\}\\left\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\right\]=O\\left\(\\frac\{1\}\{k\+h\}\\right\)\.\(140\)Thus𝔼​\[ℓC,∞​\(ηk,η⋆\)\]≤ε\\mathbb\{E\}\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)\]\\leq\\varepsilonis guaranteed once

𝔼​\[ℓC,∞​\(ηk,η⋆\)2\]≤ε2,\\mathbb\{E\}\\left\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\right\]\\leq\\varepsilon^\{2\},\(141\)which holds fork=O​\(ε−2\)k=O\(\\varepsilon^\{\-2\}\)\.

For part \(ii\), the linearly\-diminishing Markovian bound in Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1), part \(ii\), gives

𝔼​\[ℓC,∞​\(ηk,η⋆\)2\]≤\(ϕC,1​ℓC,∞​\(η0,η⋆\)2\+ϕC,4\)​\(K\+hk\+h\)ϕC,2​α\+8​e​α2​ϕC,3ϕC,2​α−1⋅tkk\+h\.\\mathbb\{E\}\\left\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\right\]\\leq\\bigl\(\\phi\_\{\\mathrm\{C\},1\}\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\+\\phi\_\{\\mathrm\{C\},4\}\\bigr\)\\left\(\\frac\{K\+h\}\{k\+h\}\\right\)^\{\\phi\_\{\\mathrm\{C\},2\}\\alpha\}\+\\frac\{8e\\alpha^\{2\}\\phi\_\{\\mathrm\{C\},3\}\}\{\\phi\_\{\\mathrm\{C\},2\}\\alpha\-1\}\\cdot\\frac\{t\_\{k\}\}\{k\+h\}\.\(142\)Sinceα\>1/ϕC,2\\alpha\>1/\\phi\_\{\\mathrm\{C\},2\}, the first term isO​\(\(k\+h\)−1\)O\(\(k\+h\)^\{\-1\}\)\. By geometric mixing,tk=tαk=O​\(log⁡\(k\+h\)\)t\_\{k\}=t\_\{\\alpha\_\{k\}\}=O\(\\log\(k\+h\)\), so the second term is

O​\(log⁡\(k\+h\)k\+h\)=O~​\(1k\+h\)\.O\\left\(\\frac\{\\log\(k\+h\)\}\{k\+h\}\\right\)=\\widetilde\{O\}\\left\(\\frac\{1\}\{k\+h\}\\right\)\.\(143\)Therefore

𝔼​\[ℓC,∞​\(ηk,η⋆\)2\]=O~​\(1k\+h\),\\mathbb\{E\}\\left\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\right\]=\\widetilde\{O\}\\left\(\\frac\{1\}\{k\+h\}\\right\),\(144\)and hence𝔼​\[ℓC,∞​\(ηk,η⋆\)\]≤ε\\mathbb\{E\}\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)\]\\leq\\varepsilonis ensured fork=O~​\(ε−2\)k=\\widetilde\{O\}\(\\varepsilon^\{\-2\}\)\. ∎

## Appendix CMTD details and proof of Theorem[3](https://arxiv.org/html/2605.06866#Thmtheorem3)

We now verify the abstract hypotheses of Appendix[A](https://arxiv.org/html/2605.06866#A1)for the discounted multivariate signed\-categorical recursion\.

### C\.1Statewise MMD isometry and contraction

For each states∈𝒮s\\in\\mathcal\{S\}, let

Θ​\(s\)=\{θ1​\(s\),…,θd​\(s\)\}⊂ℝq,ℝ1d:=\{p∈ℝd:∑i=1dpi=1\}\.\\Theta\(s\)=\\\{\\theta\_\{1\}\(s\),\\dots,\\theta\_\{d\}\(s\)\\\}\\subset\\mathbb\{R\}^\{q\},\\qquad\\mathbb\{R\}\_\{1\}^\{d\}:=\\Bigl\\\{p\\in\\mathbb\{R\}^\{d\}:\\sum\_\{i=1\}^\{d\}p\_\{i\}=1\\Bigr\\\}\.\(145\)Letℳ​\(ℝq\)\\mathcal\{M\}\(\\mathbb\{R\}^\{q\}\)denote the space of finite signed Borel measures onℝq\\mathbb\{R\}^\{q\}\. Define

ℱM,Θ𝒮:=\{η:𝒮→ℳ​\(ℝq\):η​\(s\)=∑i=1dpi​\(s\)​δθi​\(s\),p​\(s\)∈ℝ1d​for all​s∈𝒮\}\.\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{M\},\\Theta\}:=\\Bigl\\\{\\eta:\\mathcal\{S\}\\to\\mathcal\{M\}\(\\mathbb\{R\}^\{q\}\):\\eta\(s\)=\\sum\_\{i=1\}^\{d\}p\_\{i\}\(s\)\\delta\_\{\\theta\_\{i\}\(s\)\},\\ p\(s\)\\in\\mathbb\{R\}\_\{1\}^\{d\}\\text\{ for all \}s\\in\\mathcal\{S\}\\Bigr\\\}\.\(146\)
Letκ:ℝq×ℝq→ℝ\\kappa:\\mathbb\{R\}^\{q\}\\times\\mathbb\{R\}^\{q\}\\to\\mathbb\{R\}be the characteristic kernel from Section[3\.2](https://arxiv.org/html/2605.06866#S3.SS2)\. For each statess, let

Ks:=\(κ​\(θi​\(s\),θj​\(s\)\)\)i,j=1dK\_\{s\}:=\\bigl\(\\kappa\(\\theta\_\{i\}\(s\),\\theta\_\{j\}\(s\)\)\\bigr\)\_\{i,j=1\}^\{d\}\(147\)be the Gram matrix onΘ​\(s\)\\Theta\(s\)\. For

η​\(s\)=∑i=1dpi​\(s\)​δθi​\(s\),\\eta\(s\)=\\sum\_\{i=1\}^\{d\}p\_\{i\}\(s\)\\delta\_\{\\theta\_\{i\}\(s\)\},\(148\)define

Is​\(η​\(s\)\):=Ks1/2​p​\(s\)∈ℝd\.I\_\{s\}\(\\eta\(s\)\):=K\_\{s\}^\{1/2\}p\(s\)\\in\\mathbb\{R\}^\{d\}\.\(149\)
###### Proposition 23\(Statewise MMD isometry\)\.

For every states∈𝒮s\\in\\mathcal\{S\}and all signed\-categorical lawsμ,ν\\mu,\\nusupported onΘ​\(s\)\\Theta\(s\),

MMDκ​\(μ,ν\)2=∥Is​\(μ\)−Is​\(ν\)∥22\.\\mathrm\{MMD\}\_\{\\kappa\}\(\\mu,\\nu\)^\{2\}=\\lVert I\_\{s\}\(\\mu\)\-I\_\{s\}\(\\nu\)\\rVert\_\{2\}^\{2\}\.\(150\)Consequently, with

IM​\(η\):=\(Is​\(η​\(s\)\)\)s∈𝒮,I\_\{\\mathrm\{M\}\}\(\\eta\):=\(I\_\{s\}\(\\eta\(s\)\)\)\_\{s\\in\\mathcal\{S\}\},\(151\)one has

∥IM​\(η\)−IM​\(η′\)∥2,∞=ℓM,∞​\(η,η′\)\.\\lVert I\_\{\\mathrm\{M\}\}\(\\eta\)\-I\_\{\\mathrm\{M\}\}\(\\eta^\{\\prime\}\)\\rVert\_\{2,\\infty\}=\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta,\\eta^\{\\prime\}\)\.\(152\)

###### Proof\.

Write

μ=∑i=1dpi​δθi​\(s\),ν=∑i=1dqi​δθi​\(s\)\.\\mu=\\sum\_\{i=1\}^\{d\}p\_\{i\}\\delta\_\{\\theta\_\{i\}\(s\)\},\\qquad\\nu=\\sum\_\{i=1\}^\{d\}q\_\{i\}\\delta\_\{\\theta\_\{i\}\(s\)\}\.\(153\)Then

μ−ν=∑i=1d\(pi−qi\)​δθi​\(s\)\.\\mu\-\\nu=\\sum\_\{i=1\}^\{d\}\(p\_\{i\}\-q\_\{i\}\)\\delta\_\{\\theta\_\{i\}\(s\)\}\.\(154\)By the definition of MMD,

MMDκ​\(μ,ν\)2=∑i=1d∑j=1d\(pi−qi\)​\(pj−qj\)​κ​\(θi​\(s\),θj​\(s\)\)=\(p−q\)⊤​Ks​\(p−q\)\.\\mathrm\{MMD\}\_\{\\kappa\}\(\\mu,\\nu\)^\{2\}=\\sum\_\{i=1\}^\{d\}\\sum\_\{j=1\}^\{d\}\(p\_\{i\}\-q\_\{i\}\)\(p\_\{j\}\-q\_\{j\}\)\\kappa\(\\theta\_\{i\}\(s\),\\theta\_\{j\}\(s\)\)=\(p\-q\)^\{\\top\}K\_\{s\}\(p\-q\)\.\(155\)SinceKsK\_\{s\}is positive semidefinite,

\(p−q\)⊤​Ks​\(p−q\)=∥Ks1/2​\(p−q\)∥22=∥Is​\(μ\)−Is​\(ν\)∥22\.\(p\-q\)^\{\\top\}K\_\{s\}\(p\-q\)=\\lVert K\_\{s\}^\{1/2\}\(p\-q\)\\rVert\_\{2\}^\{2\}=\\lVert I\_\{s\}\(\\mu\)\-I\_\{s\}\(\\nu\)\\rVert\_\{2\}^\{2\}\.\(156\)Taking the maximum overs∈𝒮s\\in\\mathcal\{S\}gives the product isometry\. ∎

Recall from Section[3\.2](https://arxiv.org/html/2605.06866#S3.SS2)that

𝒪M:=IM∘ΠMΘ​Tπ∘IM−1\.\\mathcal\{O\}\_\{\\mathrm\{M\}\}:=I\_\{\\mathrm\{M\}\}\\circ\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\}T^\{\\pi\}\\circ I\_\{\\mathrm\{M\}\}^\{\-1\}\.\(157\)
###### Proposition 24\(Embedded MTD contraction\)\.

For allU,U′∈IM​\(ℱM,Θ𝒮\)U,U^\{\\prime\}\\in I\_\{\\mathrm\{M\}\}\(\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{M\},\\Theta\}\),

∥𝒪M​U−𝒪M​U′∥2,∞≤γc/2​∥U−U′∥2,∞\.\\lVert\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\-\\mathcal\{O\}\_\{\\mathrm\{M\}\}U^\{\\prime\}\\rVert\_\{2,\\infty\}\\leq\\gamma^\{c/2\}\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(158\)

###### Proof\.

Letη:=IM−1​\(U\)\\eta:=I\_\{\\mathrm\{M\}\}^\{\-1\}\(U\)andη′:=IM−1​\(U′\)\\eta^\{\\prime\}:=I\_\{\\mathrm\{M\}\}^\{\-1\}\(U^\{\\prime\}\)\. By Proposition[23](https://arxiv.org/html/2605.06866#Thmtheorem23)and the contraction of𝒪M\\mathcal\{O\}\_\{\\mathrm\{M\}\}in the supremum MMD metric\[[49](https://arxiv.org/html/2605.06866#bib.bib4)\],

∥𝒪M​U−𝒪M​U′∥2,∞=ℓM,∞​\(ΠMΘ​Tπ​η,ΠMΘ​Tπ​η′\)≤γc/2​ℓM,∞​\(η,η′\)=γc/2​∥U−U′∥2,∞\.\\lVert\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\-\\mathcal\{O\}\_\{\\mathrm\{M\}\}U^\{\\prime\}\\rVert\_\{2,\\infty\}=\\ell\_\{\\mathrm\{M\},\\infty\}\(\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\}T^\{\\pi\}\\eta,\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\}T^\{\\pi\}\\eta^\{\\prime\}\)\\leq\\gamma^\{c/2\}\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta,\\eta^\{\\prime\}\)=\\gamma^\{c/2\}\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(159\)∎

### C\.2Unbiased sampled target, Lipschitz continuity, and perturbation control

Recall the sampled MTD target from Section[3\.2](https://arxiv.org/html/2605.06866#S3.SS2):

T^M​\(Uk;Sk,\(Rk,Sk\+1\)\):=ISk​\(ΠMΘ​\(Sk\)​\(\(fRk,γ\)\#​ISk\+1−1​\(Uk​\(Sk\+1\)\)\)\)\.\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\):=I\_\{S\_\{k\}\}\\Bigl\(\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\(S\_\{k\}\)\}\\bigl\(\(f\_\{R\_\{k\},\\gamma\}\)\_\{\\\#\}I\_\{S\_\{k\+1\}\}^\{\-1\}\(U\_\{k\}\(S\_\{k\+1\}\)\)\\bigr\)\\Bigr\)\.\(160\)
###### Proposition 25\(Embedded MTD projection as an affine Euclidean projector\)\.

For each states∈𝒮s\\in\\mathcal\{S\}, let

𝒜s:=Ks1/2​ℝ1d⊂ℝd\.\\mathcal\{A\}\_\{s\}:=K\_\{s\}^\{1/2\}\\mathbb\{R\}\_\{1\}^\{d\}\\subset\\mathbb\{R\}^\{d\}\.\(161\)For every mass\-11finite signed measureν\\nuonℝq\\mathbb\{R\}^\{q\}, define

bs​\(ν\):=\(∫κ​\(θi​\(s\),y\)​𝑑ν​\(y\)\)i=1d∈ℝdb\_\{s\}\(\\nu\):=\\left\(\\int\\kappa\(\\theta\_\{i\}\(s\),y\)\\,d\\nu\(y\)\\right\)\_\{i=1\}^\{d\}\\in\\mathbb\{R\}^\{d\}\(162\)and

ζs​\(ν\):=Ks†⁣/2​bs​\(ν\),\\zeta\_\{s\}\(\\nu\):=K\_\{s\}^\{\\dagger/2\}b\_\{s\}\(\\nu\),\(163\)whereKs†K\_\{s\}^\{\\dagger\}is the Moore\-Penrose pseudoinverse ofKsK\_\{s\}\. Thenbs​\(ν\)∈range⁡\(Ks\)b\_\{s\}\(\\nu\)\\in\\operatorname\{range\}\(K\_\{s\}\)and

IM,s​\(ΠMΘ​\(s\)​ν\)=proj𝒜s⁡\(ζs​\(ν\)\),I\_\{\\mathrm\{M\},s\}\\bigl\(\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\(s\)\}\\nu\\bigr\)=\\operatorname\{proj\}\_\{\\mathcal\{A\}\_\{s\}\}\\bigl\(\\zeta\_\{s\}\(\\nu\)\\bigr\),\(164\)whereproj𝒜s\\operatorname\{proj\}\_\{\\mathcal\{A\}\_\{s\}\}is the Euclidean orthogonal projection onto𝒜s\\mathcal\{A\}\_\{s\}\. Consequently, the map

Js​\(ν\):=IM,s​\(ΠMΘ​\(s\)​ν\)J\_\{s\}\(\\nu\):=I\_\{\\mathrm\{M\},s\}\\bigl\(\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\(s\)\}\\nu\\bigr\)\(165\)is affine on the affine space of mass\-11finite signed measures\.

###### Proof\.

Letℋκ\\mathcal\{H\}\_\{\\kappa\}be the RKHS ofκ\\kappa, letφ:ℝq→ℋκ\\varphi:\\mathbb\{R\}^\{q\}\\to\\mathcal\{H\}\_\{\\kappa\}denote the feature map, and define the linear operator

Φs:ℝd→ℋκ,Φs​p:=∑i=1dpi​φ​\(θi​\(s\)\)\.\\Phi\_\{s\}:\\mathbb\{R\}^\{d\}\\to\\mathcal\{H\}\_\{\\kappa\},\\qquad\\Phi\_\{s\}p:=\\sum\_\{i=1\}^\{d\}p\_\{i\}\\varphi\(\\theta\_\{i\}\(s\)\)\.\(166\)Then

Ks=Φs∗​Φs\.K\_\{s\}=\\Phi\_\{s\}^\{\*\}\\Phi\_\{s\}\.\(167\)For a mass\-11finite signed measureν\\nu, define its kernel mean embedding by

mν:=∫φ​\(y\)​𝑑ν​\(y\)∈ℋκ\.m\_\{\\nu\}:=\\int\\varphi\(y\)\\,d\\nu\(y\)\\in\\mathcal\{H\}\_\{\\kappa\}\.\(168\)LetPsP\_\{s\}denote the orthogonal projection inℋκ\\mathcal\{H\}\_\{\\kappa\}ontorange⁡\(Φs\)\\operatorname\{range\}\(\\Phi\_\{s\}\)\. SinceΦs∗​\(I−Ps\)=0\\Phi\_\{s\}^\{\*\}\(I\-P\_\{s\}\)=0, we have

bs​\(ν\)=Φs∗​mν=Φs∗​Ps​mν∈range⁡\(Φs∗​Φs\)=range⁡\(Ks\)\.b\_\{s\}\(\\nu\)=\\Phi\_\{s\}^\{\*\}m\_\{\\nu\}=\\Phi\_\{s\}^\{\*\}P\_\{s\}m\_\{\\nu\}\\in\\operatorname\{range\}\(\\Phi\_\{s\}^\{\*\}\\Phi\_\{s\}\)=\\operatorname\{range\}\(K\_\{s\}\)\.\(169\)
Now fixp∈ℝ1dp\\in\\mathbb\{R\}\_\{1\}^\{d\}and write

μp:=∑i=1dpi​δθi​\(s\)\.\\mu\_\{p\}:=\\sum\_\{i=1\}^\{d\}p\_\{i\}\\delta\_\{\\theta\_\{i\}\(s\)\}\.\(170\)By the reproducing\-kernel representation of MMD,

MMDκ​\(μp,ν\)2=∥Φs​p−mν∥ℋκ2=∥Φs​p−Ps​mν∥ℋκ2\+∥\(I−Ps\)​mν∥ℋκ2\.\\mathrm\{MMD\}\_\{\\kappa\}\(\\mu\_\{p\},\\nu\)^\{2\}=\\lVert\\Phi\_\{s\}p\-m\_\{\\nu\}\\rVert\_\{\\mathcal\{H\}\_\{\\kappa\}\}^\{2\}=\\lVert\\Phi\_\{s\}p\-P\_\{s\}m\_\{\\nu\}\\rVert\_\{\\mathcal\{H\}\_\{\\kappa\}\}^\{2\}\+\\lVert\(I\-P\_\{s\}\)m\_\{\\nu\}\\rVert\_\{\\mathcal\{H\}\_\{\\kappa\}\}^\{2\}\.\(171\)
We next identifyPs​mνP\_\{s\}m\_\{\\nu\}in coefficient form\. SinceKs=Φs∗​ΦsK\_\{s\}=\\Phi\_\{s\}^\{\*\}\\Phi\_\{s\}, the standard Moore\-Penrose identity gives

Φs​Ks†​Φs∗=Ps\.\\Phi\_\{s\}K\_\{s\}^\{\\dagger\}\\Phi\_\{s\}^\{\*\}=P\_\{s\}\.\(172\)Therefore,

Φs​\(Ks†​bs​\(ν\)\)=Φs​Ks†​Φs∗​mν=Ps​mν\.\\Phi\_\{s\}\\bigl\(K\_\{s\}^\{\\dagger\}b\_\{s\}\(\\nu\)\\bigr\)=\\Phi\_\{s\}K\_\{s\}^\{\\dagger\}\\Phi\_\{s\}^\{\*\}m\_\{\\nu\}=P\_\{s\}m\_\{\\nu\}\.\(173\)Substituting this into the previous display yields

∥Φs​p−Ps​mν∥ℋκ=∥Φs​\(p−Ks†​bs​\(ν\)\)∥ℋκ\.\\lVert\\Phi\_\{s\}p\-P\_\{s\}m\_\{\\nu\}\\rVert\_\{\\mathcal\{H\}\_\{\\kappa\}\}=\\lVert\\Phi\_\{s\}\(p\-K\_\{s\}^\{\\dagger\}b\_\{s\}\(\\nu\)\)\\rVert\_\{\\mathcal\{H\}\_\{\\kappa\}\}\.\(174\)Since∥Φs​x∥ℋκ2=x⊤​Ks​x=∥Ks1/2​x∥22\\lVert\\Phi\_\{s\}x\\rVert\_\{\\mathcal\{H\}\_\{\\kappa\}\}^\{2\}=x^\{\\top\}K\_\{s\}x=\\lVert K\_\{s\}^\{1/2\}x\\rVert\_\{2\}^\{2\}for everyx∈ℝdx\\in\\mathbb\{R\}^\{d\}, we obtain

∥Φs​p−Ps​mν∥ℋκ=∥Ks1/2​\(p−Ks†​bs​\(ν\)\)∥2=∥Ks1/2​p−Ks1/2​Ks†​bs​\(ν\)∥2\.\\lVert\\Phi\_\{s\}p\-P\_\{s\}m\_\{\\nu\}\\rVert\_\{\\mathcal\{H\}\_\{\\kappa\}\}=\\lVert K\_\{s\}^\{1/2\}\(p\-K\_\{s\}^\{\\dagger\}b\_\{s\}\(\\nu\)\)\\rVert\_\{2\}=\\lVert K\_\{s\}^\{1/2\}p\-K\_\{s\}^\{1/2\}K\_\{s\}^\{\\dagger\}b\_\{s\}\(\\nu\)\\rVert\_\{2\}\.\(175\)BecauseKsK\_\{s\}is symmetric positive semidefinite, spectral calculus gives

Ks1/2​Ks†=Ks†⁣/2\.K\_\{s\}^\{1/2\}K\_\{s\}^\{\\dagger\}=K\_\{s\}^\{\\dagger/2\}\.\(176\)Hence

∥Φs​p−Ps​mν∥ℋκ=∥Ks1/2​p−Ks†⁣/2​bs​\(ν\)∥2=∥Ks1/2​p−ζs​\(ν\)∥2\.\\lVert\\Phi\_\{s\}p\-P\_\{s\}m\_\{\\nu\}\\rVert\_\{\\mathcal\{H\}\_\{\\kappa\}\}=\\lVert K\_\{s\}^\{1/2\}p\-K\_\{s\}^\{\\dagger/2\}b\_\{s\}\(\\nu\)\\rVert\_\{2\}=\\lVert K\_\{s\}^\{1/2\}p\-\\zeta\_\{s\}\(\\nu\)\\rVert\_\{2\}\.\(177\)Therefore

MMDκ​\(μp,ν\)2=∥Ks1/2​p−ζs​\(ν\)∥22\+cs​\(ν\),\\mathrm\{MMD\}\_\{\\kappa\}\(\\mu\_\{p\},\\nu\)^\{2\}=\\lVert K\_\{s\}^\{1/2\}p\-\\zeta\_\{s\}\(\\nu\)\\rVert\_\{2\}^\{2\}\+c\_\{s\}\(\\nu\),\(178\)where

cs​\(ν\):=∥\(I−Ps\)​mν∥ℋκ2c\_\{s\}\(\\nu\):=\\lVert\(I\-P\_\{s\}\)m\_\{\\nu\}\\rVert\_\{\\mathcal\{H\}\_\{\\kappa\}\}^\{2\}\(179\)does not depend onpp\.

Minimizing overp∈ℝ1dp\\in\\mathbb\{R\}\_\{1\}^\{d\}is therefore equivalent to Euclidean projection ofζs​\(ν\)\\zeta\_\{s\}\(\\nu\)onto

𝒜s=Ks1/2​ℝ1d\.\\mathcal\{A\}\_\{s\}=K\_\{s\}^\{1/2\}\\mathbb\{R\}\_\{1\}^\{d\}\.\(180\)SinceIM,s​\(∑i=1dpi​δθi​\(s\)\)=Ks1/2​pI\_\{\\mathrm\{M\},s\}\\bigl\(\\sum\_\{i=1\}^\{d\}p\_\{i\}\\delta\_\{\\theta\_\{i\}\(s\)\}\\bigr\)=K\_\{s\}^\{1/2\}p, it follows that

IM,s​\(ΠMΘ​\(s\)​ν\)=proj𝒜s⁡\(ζs​\(ν\)\)\.I\_\{\\mathrm\{M\},s\}\\bigl\(\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\(s\)\}\\nu\\bigr\)=\\operatorname\{proj\}\_\{\\mathcal\{A\}\_\{s\}\}\\bigl\(\\zeta\_\{s\}\(\\nu\)\\bigr\)\.\(181\)
Finally,ν↦bs​\(ν\)\\nu\\mapsto b\_\{s\}\(\\nu\)is linear, hence so isν↦ζs​\(ν\)=Ks†⁣/2​bs​\(ν\)\\nu\\mapsto\\zeta\_\{s\}\(\\nu\)=K\_\{s\}^\{\\dagger/2\}b\_\{s\}\(\\nu\)\. Since orthogonal projection onto an affine subspace is an affine map, the composition

Js​\(ν\)=proj𝒜s⁡\(ζs​\(ν\)\)J\_\{s\}\(\\nu\)=\\operatorname\{proj\}\_\{\\mathcal\{A\}\_\{s\}\}\\bigl\(\\zeta\_\{s\}\(\\nu\)\\bigr\)\(182\)is affine on the affine space of mass\-11finite signed measures\. ∎

###### Proposition 26\(Conditional unbiasedness of the sampled MTD target\)\.

For every admissibleU∈IM​\(ℱM,Θ𝒮\)U\\in I\_\{\\mathrm\{M\}\}\(\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{M\},\\Theta\}\)and every states∈𝒮s\\in\\mathcal\{S\},

𝔼​\[T^M​\(U;s,\(Rk,Sk\+1\)\)∣Uk=U,Sk=s\]=\(𝒪M​U\)​\(s\)\.\\mathbb\{E\}\\left\[\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(R\_\{k\},S\_\{k\+1\}\)\)\\,\\mid\\,U\_\{k\}=U,\\ S\_\{k\}=s\\right\]=\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\)\(s\)\.\(183\)

###### Proof\.

Letη:=IM−1​\(U\)\\eta:=I\_\{\\mathrm\{M\}\}^\{\-1\}\(U\)and condition onUk=UU\_\{k\}=UandSk=sS\_\{k\}=s\. By Proposition[25](https://arxiv.org/html/2605.06866#Thmtheorem25), the statewise map

Js​\(ν\):=Is​\(ΠMΘ​\(s\)​ν\)J\_\{s\}\(\\nu\):=I\_\{s\}\\bigl\(\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\(s\)\}\\nu\\bigr\)\(184\)is affine on mass\-11finite signed measures\. Hence

𝔼​\[T^M​\(U;s,\(Rk,Sk\+1\)\)∣Uk=U,Sk=s\]\\displaystyle\\mathbb\{E\}\\left\[\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(R\_\{k\},S\_\{k\+1\}\)\)\\,\\mid\\,U\_\{k\}=U,\\ S\_\{k\}=s\\right\]=𝔼​\[Js​\(\(fRk,γ\)\#​η​\(Sk\+1\)\)∣Uk=U,Sk=s\]\\displaystyle=\\mathbb\{E\}\\left\[J\_\{s\}\\bigl\(\(f\_\{R\_\{k\},\\gamma\}\)\_\{\\\#\}\\eta\(S\_\{k\+1\}\)\\bigr\)\\,\\mid\\,U\_\{k\}=U,\\ S\_\{k\}=s\\right\]\(185\)=Js​\(𝔼​\[\(fRk,γ\)\#​η​\(Sk\+1\)∣Uk=U,Sk=s\]\)\\displaystyle=J\_\{s\}\\left\(\\mathbb\{E\}\\left\[\(f\_\{R\_\{k\},\\gamma\}\)\_\{\\\#\}\\eta\(S\_\{k\+1\}\)\\,\\mid\\,U\_\{k\}=U,\\ S\_\{k\}=s\\right\]\\right\)=Js​\(\(Tπ​η\)​\(s\)\)\\displaystyle=J\_\{s\}\\bigl\(\(T^\{\\pi\}\\eta\)\(s\)\\bigr\)=Is​\(\(ΠMΘ​Tπ​η\)​\(s\)\)\\displaystyle=I\_\{s\}\\bigl\(\(\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\}T^\{\\pi\}\\eta\)\(s\)\\bigr\)=\(𝒪M​U\)​\(s\)\.\\displaystyle=\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\)\(s\)\.∎

###### Proposition 27\(MTD samplewise Lipschitz continuity\)\.

For everys∈𝒮s\\in\\mathcal\{S\}, every\(r,s′\)∈\[0,1\]q×𝒮\(r,s^\{\\prime\}\)\\in\[0,1\]^\{q\}\\times\\mathcal\{S\}, and allU,U′∈IM​\(ℱM,Θ𝒮\)U,U^\{\\prime\}\\in I\_\{\\mathrm\{M\}\}\(\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{M\},\\Theta\}\),

‖T^M​\(U;s,\(r,s′\)\)−T^M​\(U′;s,\(r,s′\)\)‖2≤∥U−U′∥2,∞\.\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{2\}\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(186\)Consequently, if

H^M​\(U;s,\(r,s′\)\):=U\+Ps​\(T^M​\(U;s,\(r,s′\)\)−U​\(s\)\),\\widehat\{H\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\):=U\+P\_\{s\}\\bigl\(\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-U\(s\)\\bigr\),\(187\)then

∥H^M​\(U;s,\(r,s′\)\)−H^M​\(U′;s,\(r,s′\)\)∥2,∞≤∥U−U′∥2,∞\.\\lVert\\widehat\{H\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{H\}\_\{\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{2,\\infty\}\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(188\)

###### Proof\.

Fixs∈𝒮s\\in\\mathcal\{S\}and\(r,s′\)∈\[0,1\]q×𝒮\(r,s^\{\\prime\}\)\\in\[0,1\]^\{q\}\\times\\mathcal\{S\}\. Let

μ:=Is′−1​\(U​\(s′\)\),ν:=Is′−1​\(U′​\(s′\)\)\.\\mu:=I\_\{s^\{\\prime\}\}^\{\-1\}\(U\(s^\{\\prime\}\)\),\\qquad\\nu:=I\_\{s^\{\\prime\}\}^\{\-1\}\(U^\{\\prime\}\(s^\{\\prime\}\)\)\.\(189\)Then

T^M​\(U;s,\(r,s′\)\)=Is​\(ΠMΘ​\(s\)​\(\(fr,γ\)\#​μ\)\)\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)=I\_\{s\}\\bigl\(\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\(s\)\}\(\(f\_\{r,\\gamma\}\)\_\{\\\#\}\\mu\)\\bigr\)\(190\)and likewise forU′U^\{\\prime\}\. Using the statewise MMD isometry, the Euclidean projection representation from Proposition[25](https://arxiv.org/html/2605.06866#Thmtheorem25), and theγc/2\\gamma^\{c/2\}contraction of the pushforward in MMD,

‖T^M​\(U;s,\(r,s′\)\)−T^M​\(U′;s,\(r,s′\)\)‖2\\displaystyle\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{2\}=MMDκ​\(ΠMΘ​\(s\)​\(\(fr,γ\)\#​μ\),ΠMΘ​\(s\)​\(\(fr,γ\)\#​ν\)\)\\displaystyle=\\mathrm\{MMD\}\_\{\\kappa\}\\bigl\(\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\(s\)\}\(\(f\_\{r,\\gamma\}\)\_\{\\\#\}\\mu\),\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\(s\)\}\(\(f\_\{r,\\gamma\}\)\_\{\\\#\}\\nu\)\\bigr\)\(191\)≤MMDκ​\(\(fr,γ\)\#​μ,\(fr,γ\)\#​ν\)\\displaystyle\\leq\\mathrm\{MMD\}\_\{\\kappa\}\(\(f\_\{r,\\gamma\}\)\_\{\\\#\}\\mu,\(f\_\{r,\\gamma\}\)\_\{\\\#\}\\nu\)≤γc/2​MMDκ​\(μ,ν\)\\displaystyle\\leq\\gamma^\{c/2\}\\,\\mathrm\{MMD\}\_\{\\kappa\}\(\\mu,\\nu\)=γc/2​∥U​\(s′\)−U′​\(s′\)∥2\\displaystyle=\\gamma^\{c/2\}\\,\\lVert U\(s^\{\\prime\}\)\-U^\{\\prime\}\(s^\{\\prime\}\)\\rVert\_\{2\}≤∥U−U′∥2,∞\.\\displaystyle\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.For the full\-sample map, blocks other thanssare unchanged, while blockssis replaced by the sampled target\. Hence, for everyx≠sx\\neq s,

\(H^M​\(U;s,\(r,s′\)\)−H^M​\(U′;s,\(r,s′\)\)\)​\(x\)=U​\(x\)−U′​\(x\),\\bigl\(\\widehat\{H\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{H\}\_\{\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\bigr\)\(x\)=U\(x\)\-U^\{\\prime\}\(x\),\(192\)and for the sampled block,

\(H^M​\(U;s,\(r,s′\)\)−H^M​\(U′;s,\(r,s′\)\)\)​\(s\)=T^M​\(U;s,\(r,s′\)\)−T^M​\(U′;s,\(r,s′\)\)\.\\bigl\(\\widehat\{H\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{H\}\_\{\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\bigr\)\(s\)=\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\.\(193\)Taking the block supremum and using the target\-level bound proves the claim\. ∎

###### Proposition 28\(Discounted MTD affine perturbation bound\)\.

Let

U⋆:=IM​\(η⋆\),βM:=γc/2\.U^\{\\star\}:=I\_\{\\mathrm\{M\}\}\(\\eta^\{\\star\}\),\\qquad\\beta\_\{\\mathrm\{M\}\}:=\\gamma^\{c/2\}\.\(194\)Hereη⋆\\eta^\{\\star\}denotes the unique fixed point ofΠMΘ​Tπ\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\}T^\{\\pi\}\. Define

BM⋆:=maxs,s′∈𝒮​supr∈\[0,1\]q‖T^M​\(U⋆;s,\(r,s′\)\)−\(𝒪M​U⋆\)​\(s\)‖2B\_\{\\mathrm\{M\}\}^\{\\star\}:=\\max\_\{s,s^\{\\prime\}\\in\\mathcal\{S\}\}\\sup\_\{r\\in\[0,1\]^\{q\}\}\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U^\{\\star\};s,\(r,s^\{\\prime\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U^\{\\star\}\)\(s\)\\right\\rVert\_\{2\}\(195\)and

BM:=2​βM​∥U⋆∥2,∞\+BM⋆\.B\_\{\\mathrm\{M\}\}:=2\\beta\_\{\\mathrm\{M\}\}\\lVert U^\{\\star\}\\rVert\_\{2,\\infty\}\+B\_\{\\mathrm\{M\}\}^\{\\star\}\.\(196\)ThenBM⋆<∞B\_\{\\mathrm\{M\}\}^\{\\star\}<\\infty, and for everys∈𝒮s\\in\\mathcal\{S\}, every\(r,s′\)∈\[0,1\]q×𝒮\(r,s^\{\\prime\}\)\\in\[0,1\]^\{q\}\\times\\mathcal\{S\}, and everyU∈IM​\(ℱM,Θ𝒮\)U\\in I\_\{\\mathrm\{M\}\}\(\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{M\},\\Theta\}\),

‖T^M​\(U;s,\(r,s′\)\)−\(𝒪M​U\)​\(s\)‖2≤2​βM​∥U∥2,∞\+BM\.\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\)\(s\)\\right\\rVert\_\{2\}\\leq 2\\beta\_\{\\mathrm\{M\}\}\\lVert U\\rVert\_\{2,\\infty\}\+B\_\{\\mathrm\{M\}\}\.\(197\)Consequently, for everyk≥0k\\geq 0,

𝔼​\[‖T^M​\(Uk;Sk,\(Rk,Sk\+1\)\)−\(𝒪M​Uk\)​\(Sk\)‖22∣Uk,Sk\]≤2​BM2\+8​βM2​∥Uk∥2,∞2\.\\mathbb\{E\}\\left\[\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\_\{k\}\)\(S\_\{k\}\)\\right\\rVert\_\{2\}^\{2\}\\,\\mid\\,U\_\{k\},S\_\{k\}\\right\]\\leq 2B\_\{\\mathrm\{M\}\}^\{2\}\+8\\beta\_\{\\mathrm\{M\}\}^\{2\}\\lVert U\_\{k\}\\rVert\_\{2,\\infty\}^\{2\}\.\(198\)Moreover, if

FM​\(U,s\):=U\+Ps​\(\(𝒪M​U\)​\(s\)−U​\(s\)\),F\_\{\\mathrm\{M\}\}\(U,s\):=U\+P\_\{s\}\\bigl\(\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\)\(s\)\-U\(s\)\\bigr\),\(199\)H^M​\(U;s,\(r,s′\)\):=U\+Ps​\(T^M​\(U;s,\(r,s′\)\)−U​\(s\)\),\\widehat\{H\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\):=U\+P\_\{s\}\\bigl\(\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-U\(s\)\\bigr\),\(200\)and

ΔkM:=H^M​\(Uk;Sk,\(Rk,Sk\+1\)\)−FM​\(Uk,Sk\),\\Delta\_\{k\}^\{\\mathrm\{M\}\}:=\\widehat\{H\}\_\{\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-F\_\{\\mathrm\{M\}\}\(U\_\{k\},S\_\{k\}\),\(201\)then

𝔼​\[ΔkM∣𝒢k\]=0\\mathbb\{E\}\\left\[\\Delta\_\{k\}^\{\\mathrm\{M\}\}\\,\\mid\\,\\mathcal\{G\}\_\{k\}\\right\]=0\(202\)and

∥ΔkM∥2,∞≤2​βM​∥Uk∥2,∞\+BMa\.s\.\\lVert\\Delta\_\{k\}^\{\\mathrm\{M\}\}\\rVert\_\{2,\\infty\}\\leq 2\\beta\_\{\\mathrm\{M\}\}\\lVert U\_\{k\}\\rVert\_\{2,\\infty\}\+B\_\{\\mathrm\{M\}\}\\qquad\\text\{a\.s\.\}\(203\)In particular, the theorem constants may be taken as

C1:=2​BM2,C2:=8​βM2\.C\_\{1\}:=2B\_\{\\mathrm\{M\}\}^\{2\},\\qquad C\_\{2\}:=8\\beta\_\{\\mathrm\{M\}\}^\{2\}\.\(204\)

###### Proof\.

For fixeds,s′∈𝒮s,s^\{\\prime\}\\in\\mathcal\{S\}andr∈\[0,1\]qr\\in\[0,1\]^\{q\}, the proof of Proposition[27](https://arxiv.org/html/2605.06866#Thmtheorem27)together with Proposition[24](https://arxiv.org/html/2605.06866#Thmtheorem24)gives

∥T^M​\(U;s,\(r,s′\)\)−T^M​\(U′;s,\(r,s′\)\)∥2≤βM​∥U−U′∥2,∞\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{2\}\\leq\\beta\_\{\\mathrm\{M\}\}\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\(205\)and

∥\(𝒪M​U\)​\(s\)−\(𝒪M​U′\)​\(s\)∥2≤βM​∥U−U′∥2,∞\.\\lVert\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\)\(s\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U^\{\\prime\}\)\(s\)\\rVert\_\{2\}\\leq\\beta\_\{\\mathrm\{M\}\}\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(206\)Because\[0,1\]q\[0,1\]^\{q\}is compact, the state set is finite, and the mapr↦T^M​\(U⋆;s,\(r,s′\)\)r\\mapsto\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U^\{\\star\};s,\(r,s^\{\\prime\}\)\)is continuous for each fixed\(s,s′\)\(s,s^\{\\prime\}\), the constantBM⋆B\_\{\\mathrm\{M\}\}^\{\\star\}is finite\. Now add and subtract the same quantities atU⋆U^\{\\star\}:

‖T^M​\(U;s,\(r,s′\)\)−\(𝒪M​U\)​\(s\)‖2≤‖T^M​\(U;s,\(r,s′\)\)−T^M​\(U⋆;s,\(r,s′\)\)‖2\+‖T^M​\(U⋆;s,\(r,s′\)\)−\(𝒪M​U⋆\)​\(s\)‖2\+‖\(𝒪M​U⋆\)​\(s\)−\(𝒪M​U\)​\(s\)‖2≤2​βM​∥U−U⋆∥2,∞\+BM⋆≤2​βM​∥U∥2,∞\+\(2​βM​∥U⋆∥2,∞\+BM⋆\),\\begin\{gathered\}\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\)\(s\)\\right\\rVert\_\{2\}\\leq\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U^\{\\star\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{2\}\\\\ \+\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U^\{\\star\};s,\(r,s^\{\\prime\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U^\{\\star\}\)\(s\)\\right\\rVert\_\{2\}\+\\left\\lVert\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U^\{\\star\}\)\(s\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\)\(s\)\\right\\rVert\_\{2\}\\\\ \\leq 2\\beta\_\{\\mathrm\{M\}\}\\lVert U\-U^\{\\star\}\\rVert\_\{2,\\infty\}\+B\_\{\\mathrm\{M\}\}^\{\\star\}\\\\ \\leq 2\\beta\_\{\\mathrm\{M\}\}\\lVert U\\rVert\_\{2,\\infty\}\+\\Bigl\(2\\beta\_\{\\mathrm\{M\}\}\\lVert U^\{\\star\}\\rVert\_\{2,\\infty\}\+B\_\{\\mathrm\{M\}\}^\{\\star\}\\Bigr\),\\end\{gathered\}\(207\)which is exactly the target\-level affine bound\. Squaring and using\(a\+b\)2≤2​a2\+2​b2\(a\+b\)^\{2\}\\leq 2a^\{2\}\+2b^\{2\}with

a:=2​βM​∥Uk∥2,∞,b:=BM,a:=2\\beta\_\{\\mathrm\{M\}\}\\lVert U\_\{k\}\\rVert\_\{2,\\infty\},\\qquad b:=B\_\{\\mathrm\{M\}\},\(208\)gives the conditional second\-moment bound of \([198](https://arxiv.org/html/2605.06866#A3.E198)\)\. Next, by definition ofH^M\\widehat\{H\}\_\{\\mathrm\{M\}\}andFMF\_\{\\mathrm\{M\}\}, only theSkS\_\{k\}\-th block can be nonzero, so

∥ΔkM∥2,∞=‖T^M​\(Uk;Sk,\(Rk,Sk\+1\)\)−\(𝒪M​Uk\)​\(Sk\)‖2,\\lVert\\Delta\_\{k\}^\{\\mathrm\{M\}\}\\rVert\_\{2,\\infty\}=\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\_\{k\}\)\(S\_\{k\}\)\\right\\rVert\_\{2\},\(209\)and the pathwise affine bound follows immediately\. Finally, the conditional mean identity is exactly Proposition[26](https://arxiv.org/html/2605.06866#Thmtheorem26)written in centered form\. ∎

### C\.3Completion of the proof of Theorem[3](https://arxiv.org/html/2605.06866#Thmtheorem3)

###### Proof of Theorem[3](https://arxiv.org/html/2605.06866#Thmtheorem3)\.

We verify the hypotheses of Appendix[A](https://arxiv.org/html/2605.06866#A1)for discounted MTD\.

First, Proposition[24](https://arxiv.org/html/2605.06866#Thmtheorem24)gives the block\-supremum contraction with modulus

βM=γc/2\.\\beta\_\{\\mathrm\{M\}\}=\\gamma^\{c/2\}\.\(210\)
Second, Proposition[27](https://arxiv.org/html/2605.06866#Thmtheorem27)gives the samplewise Lipschitz property of the full\-sample map with

AM,1=1\.A\_\{\\mathrm\{M\},1\}=1\.\(211\)Moreover, the proof of Proposition[27](https://arxiv.org/html/2605.06866#Thmtheorem27)gives the stronger target\-level estimate

∥T^M​\(U;s,\(r,s′\)\)−T^M​\(U′;s,\(r,s′\)\)∥2≤βM​∥U−U′∥2,∞≤∥U−U′∥2,∞\.\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{2\}\\leq\\beta\_\{\\mathrm\{M\}\}\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(212\)
Third, Proposition[28](https://arxiv.org/html/2605.06866#Thmtheorem28)gives the centered pathwise affine perturbation bound

𝔼​\[ΔkM∣𝒢k\]=0,∥ΔkM∥2,∞≤2​βM​∥Uk∥2,∞\+BMa\.s\.\\mathbb\{E\}\\left\[\\Delta\_\{k\}^\{\\mathrm\{M\}\}\\mid\\mathcal\{G\}\_\{k\}\\right\]=0,\\qquad\\lVert\\Delta\_\{k\}^\{\\mathrm\{M\}\}\\rVert\_\{2,\\infty\}\\leq 2\\beta\_\{\\mathrm\{M\}\}\\lVert U\_\{k\}\\rVert\_\{2,\\infty\}\+B\_\{\\mathrm\{M\}\}\\qquad\\text\{a\.s\.\}\(213\)Hence the Markovian perturbation condition in Proposition[16](https://arxiv.org/html/2605.06866#Thmtheorem16)holds with

AM,2=2​βM,BM,2=BM,AM:=AM,1\+AM,2\+1=2​βM\+2\.A\_\{\\mathrm\{M\},2\}=2\\beta\_\{\\mathrm\{M\}\},\\qquad B\_\{\\mathrm\{M\},2\}=B\_\{\\mathrm\{M\}\},\\qquad A\_\{\\mathrm\{M\}\}:=A\_\{\\mathrm\{M\},1\}\+A\_\{\\mathrm\{M\},2\}\+1=2\\beta\_\{\\mathrm\{M\}\}\+2\.\(214\)
For the i\.i\.d\. case, Proposition[28](https://arxiv.org/html/2605.06866#Thmtheorem28)also yields the conditional second\-moment estimate

𝔼​\[∥T^M​\(Uk;Sk,\(Rk,Sk\+1\)\)−\(𝒪M​Uk\)​\(Sk\)∥22∣Uk,Sk\]≤C1\+C2​∥Uk∥2,∞2,\\mathbb\{E\}\\bigl\[\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\_\{k\}\)\(S\_\{k\}\)\\rVert\_\{2\}^\{2\}\\,\\mid\\,U\_\{k\},S\_\{k\}\\bigr\]\\leq C\_\{1\}\+C\_\{2\}\\lVert U\_\{k\}\\rVert\_\{2,\\infty\}^\{2\},\(215\)with

C1=2​BM2,C2=8​βM2\.C\_\{1\}=2B\_\{\\mathrm\{M\}\}^\{2\},\\qquad C\_\{2\}=8\\beta\_\{\\mathrm\{M\}\}^\{2\}\.\(216\)Therefore Proposition[15](https://arxiv.org/html/2605.06866#Thmtheorem15)applies with

AMiid:=C1\+2​C2​\(1\+ΥM\)\.A\_\{\\mathrm\{M\}\}^\{\\mathrm\{iid\}\}:=C\_\{1\}\+2C\_\{2\}\(1\+\\Upsilon\_\{\\mathrm\{M\}\}\)\.\(217\)
Applying Proposition[15](https://arxiv.org/html/2605.06866#Thmtheorem15)with

β=βM,Aiid=AMiid,ϑ=ϑM,ρ,p⋆=max⁡\{2,⌈log⁡\|𝒮\|⌉\},\\beta=\\beta\_\{\\mathrm\{M\}\},\\qquad A^\{\\mathrm\{iid\}\}=A\_\{\\mathrm\{M\}\}^\{\\mathrm\{iid\}\},\\qquad\\vartheta=\\vartheta\_\{\\mathrm\{M\},\\rho\},\\qquad p^\{\\star\}=\\max\\\{2,\\lceil\\log\\lvert\\mathcal\{S\}\\rvert\\rceil\\\},\(218\)yields Theorem[3](https://arxiv.org/html/2605.06866#Thmtheorem3)\(i\) with

aM,1iid:=1\+ϑM,ρ​\|𝒮\|2/p⋆1\+ϑM,ρ,aM,2iid:=1−β¯M,ρ​1\+ϑM,ρ​\|𝒮\|2/p⋆1\+ϑM,ρ,a\_\{\\mathrm\{M\},1\}^\{\\mathrm\{iid\}\}:=\\frac\{1\+\\vartheta\_\{\\mathrm\{M\},\\rho\}\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\_\{\\mathrm\{M\},\\rho\}\},\\qquad a\_\{\\mathrm\{M\},2\}^\{\\mathrm\{iid\}\}:=1\-\\bar\{\\beta\}\_\{\\mathrm\{M\},\\rho\}\\sqrt\{\\frac\{1\+\\vartheta\_\{\\mathrm\{M\},\\rho\}\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\_\{\\mathrm\{M\},\\rho\}\}\},\(219\)where

β¯M,ρ:=1−ρmin​\(1−βM\),\\bar\{\\beta\}\_\{\\mathrm\{M\},\\rho\}:=1\-\\rho\_\{\\min\}\(1\-\\beta\_\{\\mathrm\{M\}\}\),\(220\)and

aM,3iid:=4​\(p⋆−1\)​\|𝒮\|2/p⋆​\(AMiid\+2\)​\(1\+ϑM,ρ\)ϑM,ρ,aM,4iid:=2​\(p⋆−1\)​\|𝒮\|2/p⋆​AMiid​\(1\+ϑM,ρ\)ϑM,ρ\.a\_\{\\mathrm\{M\},3\}^\{\\mathrm\{iid\}\}:=\\frac\{4\(p^\{\\star\}\-1\)\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\(A\_\{\\mathrm\{M\}\}^\{\\mathrm\{iid\}\}\+2\)\(1\+\\vartheta\_\{\\mathrm\{M\},\\rho\}\)\}\{\\vartheta\_\{\\mathrm\{M\},\\rho\}\},\\qquad a\_\{\\mathrm\{M\},4\}^\{\\mathrm\{iid\}\}:=\\frac\{2\(p^\{\\star\}\-1\)\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}A\_\{\\mathrm\{M\}\}^\{\\mathrm\{iid\}\}\(1\+\\vartheta\_\{\\mathrm\{M\},\\rho\}\)\}\{\\vartheta\_\{\\mathrm\{M\},\\rho\}\}\.\(221\)
Applying Proposition[16](https://arxiv.org/html/2605.06866#Thmtheorem16)with

A1=AM,1=1,A2=AM,2=2​βM,B2=BM,2=BM,A\_\{1\}=A\_\{\\mathrm\{M\},1\}=1,\\qquad A\_\{2\}=A\_\{\\mathrm\{M\},2\}=2\\beta\_\{\\mathrm\{M\}\},\\qquad B\_\{2\}=B\_\{\\mathrm\{M\},2\}=B\_\{\\mathrm\{M\}\},\(222\)yields Theorem[3](https://arxiv.org/html/2605.06866#Thmtheorem3)\(ii\) with

ϕ~M,1:=1\+ϑM,μ​\|𝒮\|2/p⋆1\+ϑM,μ,ϕM,2:=1−β¯M,μ​1\+ϑM,μ​\|𝒮\|2/p⋆1\+ϑM,μ,\\tilde\{\\phi\}\_\{\\mathrm\{M\},1\}:=\\frac\{1\+\\vartheta\_\{\\mathrm\{M\},\\mu\}\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\_\{\\mathrm\{M\},\\mu\}\},\\qquad\\phi\_\{\\mathrm\{M\},2\}:=1\-\\bar\{\\beta\}\_\{\\mathrm\{M\},\\mu\}\\sqrt\{\\frac\{1\+\\vartheta\_\{\\mathrm\{M\},\\mu\}\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\_\{\\mathrm\{M\},\\mu\}\}\},\(223\)where

β¯M,μ:=1−μmin​\(1−βM\),\\bar\{\\beta\}\_\{\\mathrm\{M\},\\mu\}:=1\-\\mu\_\{\\min\}\(1\-\\beta\_\{\\mathrm\{M\}\}\),\(224\)and

ϕ~M,3:=114​\(p⋆−1\)​\(1\+ϑM,μ​\|𝒮\|2/p⋆\)ϑM,μ,ϕM,3:=BM2​ϕ~M,3\\tilde\{\\phi\}\_\{\\mathrm\{M\},3\}:=\\frac\{114\(p^\{\\star\}\-1\)\\bigl\(1\+\\vartheta\_\{\\mathrm\{M\},\\mu\}\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\\bigr\)\}\{\\vartheta\_\{\\mathrm\{M\},\\mu\}\},\\qquad\\phi\_\{\\mathrm\{M\},3\}:=B\_\{\\mathrm\{M\}\}^\{2\}\\tilde\{\\phi\}\_\{\\mathrm\{M\},3\}\(225\)together with

ϕM,1:=8​ϕ~M,1,ϕM,4:=2​ϕ~M,1​BM2AM2,\\phi\_\{\\mathrm\{M\},1\}:=8\\tilde\{\\phi\}\_\{\\mathrm\{M\},1\},\\qquad\\phi\_\{\\mathrm\{M\},4\}:=\\frac\{2\\tilde\{\\phi\}\_\{\\mathrm\{M\},1\}B\_\{\\mathrm\{M\}\}^\{2\}\}\{A\_\{\\mathrm\{M\}\}^\{2\}\},\(226\)and the step size condition

∑i=k−tkk−1αi≤min⁡\{ϕM,2ϕM,3​AM2,14​AM\}for all​k≥K\.\\sum\_\{i=k\-t\_\{k\}\}^\{k\-1\}\\alpha\_\{i\}\\leq\\min\\left\\\{\\frac\{\\phi\_\{\\mathrm\{M\},2\}\}\{\\phi\_\{\\mathrm\{M\},3\}A\_\{\\mathrm\{M\}\}^\{2\}\},\\frac\{1\}\{4A\_\{\\mathrm\{M\}\}\}\\right\\\}\\qquad\\text\{for all \}k\\geq K\.\(227\)
The constant, linearly\-diminishing, and polynomially\-diminishing step size bounds in Theorem[3](https://arxiv.org/html/2605.06866#Thmtheorem3)now follow from the corresponding specializations of Proposition[15](https://arxiv.org/html/2605.06866#Thmtheorem15)and Proposition[16](https://arxiv.org/html/2605.06866#Thmtheorem16)\. ∎

### C\.4Proof of Corollary[4](https://arxiv.org/html/2605.06866#Thmtheorem4)

###### Proof\.

We use the linearly\-diminishing step size bounds in Theorem[3](https://arxiv.org/html/2605.06866#Thmtheorem3)\.

For part \(i\), the linearly\-diminishing i\.i\.d\. bound in Theorem[3](https://arxiv.org/html/2605.06866#Thmtheorem3)\(i\) gives

𝔼​\[ℓM,∞​\(ηk,η⋆\)2\]≤aM,1iid​ℓM,∞​\(η0,η⋆\)2​\(hk\+h\)aM,2iid​α\+4​e​α2​aM,4iidaM,2iid​α−1⋅1k\+h\.\\mathbb\{E\}\\left\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\right\]\\leq a\_\{\\mathrm\{M\},1\}^\{\\mathrm\{iid\}\}\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\left\(\\frac\{h\}\{k\+h\}\\right\)^\{a\_\{\\mathrm\{M\},2\}^\{\\mathrm\{iid\}\}\\alpha\}\+\\frac\{4e\\alpha^\{2\}a\_\{\\mathrm\{M\},4\}^\{\\mathrm\{iid\}\}\}\{a\_\{\\mathrm\{M\},2\}^\{\\mathrm\{iid\}\}\\alpha\-1\}\\cdot\\frac\{1\}\{k\+h\}\.\(228\)Sinceα\>1/aM,2iid\\alpha\>1/a\_\{\\mathrm\{M\},2\}^\{\\mathrm\{iid\}\}, the first term is alsoO​\(\(k\+h\)−1\)O\(\(k\+h\)^\{\-1\}\)\. Therefore

𝔼​\[ℓM,∞​\(ηk,η⋆\)2\]=O​\(1k\+h\),\\mathbb\{E\}\\left\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\right\]=O\\left\(\\frac\{1\}\{k\+h\}\\right\),\(229\)and hence𝔼​\[ℓM,∞​\(ηk,η⋆\)\]≤ε\\mathbb\{E\}\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)\]\\leq\\varepsilonis guaranteed fork=O​\(ε−2\)k=O\(\\varepsilon^\{\-2\}\)\.

For part \(ii\), the linearly\-diminishing Markovian bound in Theorem[3](https://arxiv.org/html/2605.06866#Thmtheorem3)\(ii\) gives

𝔼​\[ℓM,∞​\(ηk,η⋆\)2\]≤\(ϕM,1​ℓM,∞​\(η0,η⋆\)2\+ϕM,4\)​\(K\+hk\+h\)ϕM,2​α\+8​e​α2​ϕM,3ϕM,2​α−1⋅tkk\+h\.\\mathbb\{E\}\\left\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\right\]\\leq\\bigl\(\\phi\_\{\\mathrm\{M\},1\}\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\+\\phi\_\{\\mathrm\{M\},4\}\\bigr\)\\left\(\\frac\{K\+h\}\{k\+h\}\\right\)^\{\\phi\_\{\\mathrm\{M\},2\}\\alpha\}\+\\frac\{8e\\alpha^\{2\}\\phi\_\{\\mathrm\{M\},3\}\}\{\\phi\_\{\\mathrm\{M\},2\}\\alpha\-1\}\\cdot\\frac\{t\_\{k\}\}\{k\+h\}\.\(230\)Again the first term isO​\(\(k\+h\)−1\)O\(\(k\+h\)^\{\-1\}\), while geometric mixing givestk=tαk=O​\(log⁡\(k\+h\)\)t\_\{k\}=t\_\{\\alpha\_\{k\}\}=O\(\\log\(k\+h\)\), so the second term is

O​\(log⁡\(k\+h\)k\+h\)=O~​\(1k\+h\)\.O\\left\(\\frac\{\\log\(k\+h\)\}\{k\+h\}\\right\)=\\widetilde\{O\}\\left\(\\frac\{1\}\{k\+h\}\\right\)\.\(231\)Thus

𝔼​\[ℓM,∞​\(ηk,η⋆\)2\]=O~​\(1k\+h\),\\mathbb\{E\}\\left\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\right\]=\\widetilde\{O\}\\left\(\\frac\{1\}\{k\+h\}\\right\),\(232\)which implies𝔼​\[ℓM,∞​\(ηk,η⋆\)\]≤ε\\mathbb\{E\}\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)\]\\leq\\varepsilonfork=O~​\(ε−2\)k=\\widetilde\{O\}\(\\varepsilon^\{\-2\}\)\. ∎

## Appendix DUndiscounted fixed\-horizon CTD

This appendix records the undiscounted fixed\-horizon CTD ingredients culminating in the proof of Theorem[6](https://arxiv.org/html/2605.06866#Thmtheorem6)\.

### D\.1Flattened weighted horizon\-state space

To apply the framework of Appendix[A](https://arxiv.org/html/2605.06866#A1), we rewrite the weighted fixed\-horizon recursion on a flattened product space indexed by horizon\-state pairs\. Let

𝒮H:=\{\(h,s\):h∈\{1,…,H\},s∈𝒮\},\|𝒮H\|=H​\|𝒮\|\.\\mathcal\{S\}\_\{H\}:=\\\{\(h,s\):h\\in\\\{1,\\dots,H\\\},\\ s\\in\\mathcal\{S\}\\\},\\qquad\\lvert\\mathcal\{S\}\_\{H\}\\rvert=H\\lvert\\mathcal\{S\}\\rvert\.\(233\)Let

VH,C:=∏\(h,s\)∈𝒮HℝdV\_\{H,\\mathrm\{C\}\}:=\\prod\_\{\(h,s\)\\in\\mathcal\{S\}\_\{H\}\}\\mathbb\{R\}^\{d\}\(234\)denote the corresponding flattened Euclidean product space\. For a horizon\-stacked iterateU=\(Uh​\(s\)\)1≤h≤H,s∈𝒮U=\(U^\{h\}\(s\)\)\_\{1\\leq h\\leq H,\\ s\\in\\mathcal\{S\}\}, define the weighted flattening

U¯​\(h,s\):=λh​Uh​\(s\),\(h,s\)∈𝒮H\.\\bar\{U\}\(h,s\):=\\lambda^\{h\}U^\{h\}\(s\),\\qquad\(h,s\)\\in\\mathcal\{S\}\_\{H\}\.\(235\)Then

∥U∥H,2,∞=max\(h,s\)∈𝒮H∥U¯\(h,s\)∥2,∥U∥H,2,p=\(∑\(h,s\)∈𝒮H∥U¯\(h,s\)∥2p\)1/p\.\\lVert U\\rVert\_\{H,2,\\infty\}=\\max\_\{\(h,s\)\\in\\mathcal\{S\}\_\{H\}\}\\lVert\\bar\{U\}\(h,s\)\\rVert\_\{2\},\\qquad\\lVert U\\rVert\_\{H,2,p\}=\\left\(\\sum\_\{\(h,s\)\\in\\mathcal\{S\}\_\{H\}\}\\lVert\\bar\{U\}\(h,s\)\\rVert\_\{2\}^\{p\}\\right\)^\{1/p\}\.\(236\)For a single horizon stackU=\(U1,…,UH\)U=\(U^\{1\},\\dots,U^\{H\}\), also define

∥U∥H,2:=max1≤h≤H⁡λh​∥uh∥2\.\\lVert U\\rVert\_\{H,2\}:=\\max\_\{1\\leq h\\leq H\}\\lambda^\{h\}\\lVert u^\{h\}\\rVert\_\{2\}\.\(237\)Then

∥U∥H,2,∞=maxs∈𝒮∥U\(s\)∥H,2\.\\lVert U\\rVert\_\{H,2,\\infty\}=\\max\_\{s\\in\\mathcal\{S\}\}\\lVert U\(s\)\\rVert\_\{H,2\}\.\(238\)Thus the fixed\-horizon weighted norm is an ordinary block\-supremum norm on a finite product of Euclidean blocks, and Appendix[A](https://arxiv.org/html/2605.06866#A1)applies with\|𝒮H\|\\lvert\\mathcal\{S\}\_\{H\}\\rvertin place of\|𝒮\|\\lvert\\mathcal\{S\}\\rvert\. In particular, throughout this appendix we set

pH⋆:=max⁡\{2,⌈log⁡\|𝒮H\|⌉\}\.p\_\{H\}^\{\\star\}:=\\max\\\{2,\\lceil\\log\\lvert\\mathcal\{S\}\_\{H\}\\rvert\\rceil\\\}\.\(239\)

### D\.2Fixed\-horizon CTD residual map

For each states∈𝒮s\\in\\mathcal\{S\}, let

U​\(s\):=\(U1​\(s\),…,UH​\(s\)\)U\(s\):=\\bigl\(U^\{1\}\(s\),\\dots,U^\{H\}\(s\)\\bigr\)\(240\)denote the horizon stack at statess, and letPsHP\_\{s\}^\{H\}denote the coordinate projector onto that stack\. Define

FH,C​\(U;s,\(r,s′\)\):=PsH​\(T^H,C​\(U;s,\(r,s′\)\)−U​\(s\)\),F\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\):=P\_\{s\}^\{H\}\\bigl\(\\widehat\{T\}\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-U\(s\)\\bigr\),\(241\)whereT^H,C​\(U;s,\(r,s′\)\)\\widehat\{T\}\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)is the full horizon\-stacked CTD target from Section[4](https://arxiv.org/html/2605.06866#S4)\. Then the online fixed\-horizon CTD recursion is

Uk\+1=Uk\+αk​FH,C​\(Uk;Sk,\(Rk,Sk\+1\)\)\.U\_\{k\+1\}=U\_\{k\}\+\\alpha\_\{k\}F\_\{H,\\mathrm\{C\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\.\(242\)Define also the CTD embedded averaged operator by

𝒪H,C:=IH,C∘ΠH,CΘ​THπ∘IH,C−1\.\\mathcal\{O\}\_\{H,\\mathrm\{C\}\}:=I\_\{H,\\mathrm\{C\}\}\\circ\\Pi\_\{H,\\mathrm\{C\}\}^\{\\Theta\}T\_\{H\}^\{\\pi\}\\circ I\_\{H,\\mathrm\{C\}\}^\{\-1\}\.\(243\)
###### Proposition 29\(Fixed\-horizon CTD weighted contraction\)\.

For allU,U′∈IH,C​\(ℱH,C,Θ𝒮\)U,U^\{\\prime\}\\in I\_\{H,\\mathrm\{C\}\}\(\\mathcal\{F\}\_\{H,\\mathrm\{C\},\\Theta\}^\{\\mathcal\{S\}\}\),

∥𝒪H,C​U−𝒪H,C​U′∥H,2,∞≤λ​∥U−U′∥H,2,∞\.\\lVert\\mathcal\{O\}\_\{H,\\mathrm\{C\}\}U\-\\mathcal\{O\}\_\{H,\\mathrm\{C\}\}U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\\leq\\lambda\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(244\)

###### Proof\.

Letη:=IH,C−1​\(U\)\\eta:=I\_\{H,\\mathrm\{C\}\}^\{\-1\}\(U\)andη′:=IH,C−1​\(U′\)\\eta^\{\\prime\}:=I\_\{H,\\mathrm\{C\}\}^\{\-1\}\(U^\{\\prime\}\)\. Fix\(h,s\)∈𝒮H\(h,s\)\\in\\mathcal\{S\}\_\{H\}withh≥1h\\geq 1\. By the statewise CTD isometry and nonexpansiveness of the categorical projection in the Cramér metric,

λh​∥\(𝒪H,C​U\)h​\(s\)−\(𝒪H,C​U′\)h​\(s\)∥2\\displaystyle\\lambda^\{h\}\\lVert\(\\mathcal\{O\}\_\{H,\\mathrm\{C\}\}U\)^\{h\}\(s\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{C\}\}U^\{\\prime\}\)^\{h\}\(s\)\\rVert\_\{2\}=λh​ℓC​\(\(ΠH,CΘ​THπ​η\)h​\(s\),\(ΠH,CΘ​THπ​η′\)h​\(s\)\)\\displaystyle=\\lambda^\{h\}\\ell\_\{\\mathrm\{C\}\}\\left\(\(\\Pi\_\{H,\\mathrm\{C\}\}^\{\\Theta\}T\_\{H\}^\{\\pi\}\\eta\)^\{h\}\(s\),\(\\Pi\_\{H,\\mathrm\{C\}\}^\{\\Theta\}T\_\{H\}^\{\\pi\}\\eta^\{\\prime\}\)^\{h\}\(s\)\\right\)\(245\)≤λh​ℓC​\(\(THπ​η\)h​\(s\),\(THπ​η′\)h​\(s\)\)\.\\displaystyle\\leq\\lambda^\{h\}\\ell\_\{\\mathrm\{C\}\}\\left\(\(T\_\{H\}^\{\\pi\}\\eta\)^\{h\}\(s\),\(T\_\{H\}^\{\\pi\}\\eta^\{\\prime\}\)^\{h\}\(s\)\\right\)\.Now

\(THπ​η\)h​\(s\)=∑a∈𝒜π​\(a∣s\)​∑s′∈𝒮P​\(s′∣s,a\)​\(fR​\(s,a\),1\)\#​ηh−1​\(s′\),\(T\_\{H\}^\{\\pi\}\\eta\)^\{h\}\(s\)=\\sum\_\{a\\in\\mathcal\{A\}\}\\pi\(a\\mid s\)\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}P\(s^\{\\prime\}\\mid s,a\)\\,\(f\_\{R\(s,a\),1\}\)\_\{\\\#\}\\eta^\{h\-1\}\(s^\{\\prime\}\),\(246\)and the same formula holds forη′\\eta^\{\\prime\}\. Since translation preserves the Cramér metric and the Cramér metric is convex under mixtures,

ℓC​\(\(THπ​η\)h​\(s\),\(THπ​η′\)h​\(s\)\)\\displaystyle\\ell\_\{\\mathrm\{C\}\}\\left\(\(T\_\{H\}^\{\\pi\}\\eta\)^\{h\}\(s\),\(T\_\{H\}^\{\\pi\}\\eta^\{\\prime\}\)^\{h\}\(s\)\\right\)≤∑a∈𝒜π​\(a∣s\)​∑s′∈𝒮P​\(s′∣s,a\)​ℓC​\(ηh−1​\(s′\),η′h−1​\(s′\)\)\\displaystyle\\leq\\sum\_\{a\\in\\mathcal\{A\}\}\\pi\(a\\mid s\)\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}P\(s^\{\\prime\}\\mid s,a\)\\,\\ell\_\{\\mathrm\{C\}\}\\left\(\\eta^\{h\-1\}\(s^\{\\prime\}\),\{\\eta^\{\\prime\}\}^\{h\-1\}\(s^\{\\prime\}\)\\right\)\(247\)≤maxx∈𝒮⁡ℓC​\(ηh−1​\(x\),η′h−1​\(x\)\)\.\\displaystyle\\leq\\max\_\{x\\in\\mathcal\{S\}\}\\ell\_\{\\mathrm\{C\}\}\\left\(\\eta^\{h\-1\}\(x\),\{\\eta^\{\\prime\}\}^\{h\-1\}\(x\)\\right\)\.Therefore

λh​∥\(𝒪H,C​U\)h​\(s\)−\(𝒪H,C​U′\)h​\(s\)∥2\\displaystyle\\lambda^\{h\}\\lVert\(\\mathcal\{O\}\_\{H,\\mathrm\{C\}\}U\)^\{h\}\(s\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{C\}\}U^\{\\prime\}\)^\{h\}\(s\)\\rVert\_\{2\}≤λ​maxx∈𝒮⁡λh−1​ℓC​\(ηh−1​\(x\),η′h−1​\(x\)\)\\displaystyle\\leq\\lambda\\max\_\{x\\in\\mathcal\{S\}\}\\lambda^\{h\-1\}\\ell\_\{\\mathrm\{C\}\}\\left\(\\eta^\{h\-1\}\(x\),\{\\eta^\{\\prime\}\}^\{h\-1\}\(x\)\\right\)\(248\)≤λ​ℓH,C,∞​\(η,η′\)=λ​∥U−U′∥H,2,∞\.\\displaystyle\\leq\\lambda\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta,\\eta^\{\\prime\}\)=\\lambda\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.Taking the maximum over\(h,s\)∈𝒮H\(h,s\)\\in\\mathcal\{S\}\_\{H\}proves the claim\. ∎

### D\.3One\-step bounds

###### Proposition 30\(Fixed\-horizon CTD samplewise Lipschitz continuity\)\.

For everys∈𝒮s\\in\\mathcal\{S\},\(r,s′\)∈\[0,1\]×𝒮\(r,s^\{\\prime\}\)\\in\[0,1\]\\times\\mathcal\{S\}andU,U′∈IH,C​\(ℱH,C,Θ𝒮\)U,U^\{\\prime\}\\in I\_\{H,\\mathrm\{C\}\}\(\\mathcal\{F\}\_\{H,\\mathrm\{C\},\\Theta\}^\{\\mathcal\{S\}\}\),

∥FH,C​\(U;s,\(r,s′\)\)−FH,C​\(U′;s,\(r,s′\)\)∥H,2,∞≤2​∥U−U′∥H,2,∞\.\\lVert F\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-F\_\{H,\\mathrm\{C\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(249\)

###### Proof\.

For each horizonhh, the local targetT^H,Ch​\(U;s,\(r,s′\)\)\\widehat\{T\}\_\{H,\\mathrm\{C\}\}^\{h\}\(U;s,\(r,s^\{\\prime\}\)\)depends only on the\(h−1\)\(h\-1\)st horizon block ats′s^\{\\prime\}\. By the discounted CTD samplewise Lipschitz bound, that local target is11\-Lipschitz in the Euclidean block norm\. Multiplying byλh\\lambda^\{h\}and using the fact thatλ∈\(0,1\)\\lambda\\in\(0,1\)gives

λh​∥T^H,Ch​\(U;s,\(r,s′\)\)−T^H,Ch​\(U′;s,\(r,s′\)\)∥2≤λh−1​∥Uh−1​\(s′\)−U′⁣h−1​\(s′\)∥2≤∥U−U′∥H,2,∞\.\\lambda^\{h\}\\lVert\\widehat\{T\}\_\{H,\\mathrm\{C\}\}^\{h\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{H,\\mathrm\{C\}\}^\{h\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{2\}\\leq\\lambda^\{h\-1\}\\lVert U^\{h\-1\}\(s^\{\\prime\}\)\-U^\{\\prime\\,h\-1\}\(s^\{\\prime\}\)\\rVert\_\{2\}\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.

\(250\)Taking the maximum overhhyields

∥T^H,C​\(U;s,\(r,s′\)\)−T^H,C​\(U′;s,\(r,s′\)\)∥H,2≤∥U−U′∥H,2,∞\.\\lVert\\widehat\{T\}\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{H,\\mathrm\{C\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2\}\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(251\)SinceFH,CF\_\{H,\\mathrm\{C\}\}is the projected residual

FH,C​\(U;s,\(r,s′\)\)=PsH​\(T^H,C​\(U;s,\(r,s′\)\)−U​\(s\)\),F\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)=P\_\{s\}^\{H\}\\bigl\(\\widehat\{T\}\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-U\(s\)\\bigr\),\(252\)the differenceFH,C​\(U\)−FH,C​\(U′\)F\_\{H,\\mathrm\{C\}\}\(U\)\-F\_\{H,\\mathrm\{C\}\}\(U^\{\\prime\}\)splits into a projected target difference and a projected current\-stack difference\. Each is bounded by∥U−U′∥H,2,∞\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}, yielding the factor22through triangle inequality\. ∎

###### Proposition 31\(Fixed\-horizon CTD pathwise boundedness\)\.

For everys∈𝒮s\\in\\mathcal\{S\},\(r,s′\)∈\[0,1\]×𝒮\(r,s^\{\\prime\}\)\\in\[0,1\]\\times\\mathcal\{S\}andU∈IH,C​\(ℱH,C,Θ𝒮\)U\\in I\_\{H,\\mathrm\{C\}\}\(\\mathcal\{F\}\_\{H,\\mathrm\{C\},\\Theta\}^\{\\mathcal\{S\}\}\),

∥FH,C​\(U;s,\(r,s′\)\)∥H,2,∞≤2​BH,C,\\lVert F\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2B\_\{H,\\mathrm\{C\}\},\(253\)where

BH,C:=max1≤h≤H,s∈𝒮⁡λh​θh,d​\(s\)−θh,1​\(s\)\.B\_\{H,\\mathrm\{C\}\}:=\\max\_\{1\\leq h\\leq H,\\ s\\in\\mathcal\{S\}\}\\lambda^\{h\}\\sqrt\{\\theta\_\{h,d\}\(s\)\-\\theta\_\{h,1\}\(s\)\}\.\(254\)

###### Proof\.

Fixhhandss\. Both the local targetT^H,Ch​\(U;s,\(r,s′\)\)\\widehat\{T\}\_\{H,\\mathrm\{C\}\}^\{h\}\(U;s,\(r,s^\{\\prime\}\)\)and the current iterateUh​\(s\)U^\{h\}\(s\)are embedded categorical laws supported onΘh​\(s\)\\Theta\_\{h\}\(s\)\. By the statewise CTD support radius bound, each has Euclidean norm at mostθh,d​\(s\)−θh,1​\(s\)\\sqrt\{\\theta\_\{h,d\}\(s\)\-\\theta\_\{h,1\}\(s\)\}\. After weighting byλh\\lambda^\{h\}, both are bounded byBH,CB\_\{H,\\mathrm\{C\}\}\. The triangle inequality then yields the claim\. ∎

### D\.4Phasewise averaged maps

Recall the definition of phase distributions from Section[4](https://arxiv.org/html/2605.06866#S4):

ρt​\(s\):=Pr⁡\(St=s\),0≤t≤H−1,s∈𝒮\.\\rho\_\{t\}\(s\):=\\Pr\(S\_\{t\}=s\),\\qquad 0\\leq t\\leq H\-1,\\ s\\in\\mathcal\{S\}\.\(255\)For each phaset∈\{0,…,H−1\}t\\in\\\{0,\\dots,H\-1\\\}, define

Γt,C​\(U\):=∑s∈𝒮ρt​\(s\)​PsH​\(\(𝒪H,C​U\)​\(s\)−U​\(s\)\),Gt,C​\(U\):=U\+Γt,C​\(U\)\.\\Gamma\_\{t,\\mathrm\{C\}\}\(U\):=\\sum\_\{s\\in\\mathcal\{S\}\}\\rho\_\{t\}\(s\)\\,P\_\{s\}^\{H\}\\bigl\(\(\\mathcal\{O\}\_\{H,\\mathrm\{C\}\}U\)\(s\)\-U\(s\)\\bigr\),\\qquad G\_\{t,\\mathrm\{C\}\}\(U\):=U\+\\Gamma\_\{t,\\mathrm\{C\}\}\(U\)\.\(256\)
###### Proposition 32\(Phasewise averaged contraction\)\.

For each phaset∈\{0,…,H−1\}t\\in\\\{0,\\dots,H\-1\\\},

∥Gt,C​\(U\)−Gt,C​\(U′\)∥H,2,∞≤β¯tfh​∥U−U′∥H,2,∞,\\lVert G\_\{t,\\mathrm\{C\}\}\(U\)\-G\_\{t,\\mathrm\{C\}\}\(U^\{\\prime\}\)\\rVert\_\{H,2,\\infty\}\\leq\\bar\{\\beta\}\_\{t\}^\{\\mathrm\{fh\}\}\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\},\(257\)where

β¯tfh:=1−ρt,min​\(1−λ\),ρt,min:=mins∈𝒮⁡ρt​\(s\)\.\\bar\{\\beta\}\_\{t\}^\{\\mathrm\{fh\}\}:=1\-\\rho\_\{t,\\min\}\(1\-\\lambda\),\\qquad\\rho\_\{t,\\min\}:=\\min\_\{s\\in\\mathcal\{S\}\}\\rho\_\{t\}\(s\)\.\(258\)Hence, with

β¯H,Cfh:=1−ρmin​\(1−λ\),ρmin:=min0≤t≤H−1,s∈𝒮⁡ρt​\(s\),\\bar\{\\beta\}\_\{H,\\mathrm\{C\}\}^\{\\mathrm\{fh\}\}:=1\-\\rho\_\{\\min\}\(1\-\\lambda\),\\qquad\\rho\_\{\\min\}:=\\min\_\{0\\leq t\\leq H\-1,\\ s\\in\\mathcal\{S\}\}\\rho\_\{t\}\(s\),\(259\)we have

∥Gt,C​\(U\)−Gt,C​\(U′\)∥H,2,∞≤β¯H,Cfh​∥U−U′∥H,2,∞for all​t\.\\lVert G\_\{t,\\mathrm\{C\}\}\(U\)\-G\_\{t,\\mathrm\{C\}\}\(U^\{\\prime\}\)\\rVert\_\{H,2,\\infty\}\\leq\\bar\{\\beta\}\_\{H,\\mathrm\{C\}\}^\{\\mathrm\{fh\}\}\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\\qquad\\text\{for all \}t\.\(260\)

###### Proof\.

At phasett, the update acts as the identity on all blocksxxsuch thatx≠sx\\neq sand as theλ\\lambda\-contractive projected Bellman map on the blockss\. More precisely, for each state blockss,

\(Gt,C​\(U\)−Gt,C​\(U′\)\)​\(s\)=\(1−ρt​\(s\)\)​\(U​\(s\)−U′​\(s\)\)\+ρt​\(s\)​\(\(𝒪H,C​U\)​\(s\)−\(𝒪H,C​U′\)​\(s\)\)\.\\bigl\(G\_\{t,\\mathrm\{C\}\}\(U\)\-G\_\{t,\\mathrm\{C\}\}\(U^\{\\prime\}\)\\bigr\)\(s\)=\(1\-\\rho\_\{t\}\(s\)\)\(U\(s\)\-U^\{\\prime\}\(s\)\)\\,\+\\,\\rho\_\{t\}\(s\)\\bigl\(\(\\mathcal\{O\}\_\{H,\\mathrm\{C\}\}U\)\(s\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{C\}\}U^\{\\prime\}\)\(s\)\\bigr\)\.\(261\)By Proposition[29](https://arxiv.org/html/2605.06866#Thmtheorem29),

‖\(Gt,C​\(U\)−Gt,C​\(U′\)\)​\(s\)‖H,2\\displaystyle\\left\\lVert\\bigl\(G\_\{t,\\mathrm\{C\}\}\(U\)\-G\_\{t,\\mathrm\{C\}\}\(U^\{\\prime\}\)\\bigr\)\(s\)\\right\\rVert\_\{H,2\}≤\(\(1−ρt​\(s\)\)\+ρt​\(s\)​λ\)​∥U−U′∥H,2,∞\\displaystyle\\leq\\bigl\(\(1\-\\rho\_\{t\}\(s\)\)\+\\rho\_\{t\}\(s\)\\lambda\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\(262\)=\(1−ρt​\(s\)​\(1−λ\)\)​∥U−U′∥H,2,∞\.\\displaystyle=\\bigl\(1\-\\rho\_\{t\}\(s\)\(1\-\\lambda\)\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.Taking the maximum oversson both sides gives

∥Gt,C​\(U\)−Gt,C​\(U′\)∥H,2,∞\\displaystyle\\lVert G\_\{t,\\mathrm\{C\}\}\(U\)\-G\_\{t,\\mathrm\{C\}\}\(U^\{\\prime\}\)\\rVert\_\{H,2,\\infty\}≤maxs∈𝒮⁡\(1−ρt​\(s\)​\(1−λ\)\)​∥U−U′∥H,2,∞\\displaystyle\\leq\\max\_\{s\\in\\mathcal\{S\}\}\\bigl\(1\-\\rho\_\{t\}\(s\)\(1\-\\lambda\)\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\(263\)=\(1−ρt,min​\(1−λ\)\)​∥U−U′∥H,2,∞,\\displaystyle=\\bigl\(1\-\\rho\_\{t,\\min\}\(1\-\\lambda\)\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\},since the coefficient1−ρt​\(s\)​\(1−λ\)1\-\\rho\_\{t\}\(s\)\(1\-\\lambda\)is decreasing inρt​\(s\)\\rho\_\{t\}\(s\)\. This proves the first displayed bound with coefficientβ¯tfh\\bar\{\\beta\}\_\{t\}^\{\\mathrm\{fh\}\}\. For the uniform estimate, note thatρt,min≥ρmin\\rho\_\{t,\\min\}\\geq\\rho\_\{\\min\}for every phasett, hence

β¯tfh=1−ρt,min​\(1−λ\)≤1−ρmin​\(1−λ\)=β¯H,Cfh\.\\bar\{\\beta\}\_\{t\}^\{\\mathrm\{fh\}\}=1\-\\rho\_\{t,\\min\}\(1\-\\lambda\)\\leq 1\-\\rho\_\{\\min\}\(1\-\\lambda\)=\\bar\{\\beta\}\_\{H,\\mathrm\{C\}\}^\{\\mathrm\{fh\}\}\.\(264\)Substituting this larger phase\-independent coefficient into the preceding bound yields the claimed uniform contraction factor\. ∎

### D\.5Auxiliary phasewise bounds

For episodemm, write

Um,t:=Um​H\+t,t=0,…,H\.U\_\{m,t\}:=U\_\{mH\+t\},\\qquad t=0,\\dots,H\.\(265\)Also define the phasewise filtration

ℱm,t:=σ​\(Um,0,\(Sum,Rum,Su\+1m\)0≤u<t\),0≤t≤H,\\mathcal\{F\}\_\{m,t\}:=\\sigma\\bigl\(U\_\{m,0\},\(S\_\{u\}^\{m\},R\_\{u\}^\{m\},S\_\{u\+1\}^\{m\}\)\_\{0\\leq u<t\}\\bigr\),\\qquad 0\\leq t\\leq H,\(266\)where the superscriptmmindicates transitions within episodemm\. ThenUm,tU\_\{m,t\}isℱm,t\\mathcal\{F\}\_\{m,t\}\-measurable for everytt\.

###### Lemma 33\(Within\-episode deviation\)\.

For allmmand allt∈\{0,…,H\}t\\in\\\{0,\\dots,H\\\},

∥Um,t−Um,0∥H,2,∞≤2​BH,C​∑u=0t−1αm​H\+u\.\\lVert U\_\{m,t\}\-U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\leq 2B\_\{H,\\mathrm\{C\}\}\\sum\_\{u=0\}^\{t\-1\}\\alpha\_\{mH\+u\}\.\(267\)

###### Proof\.

From the recursion,

Um,t−Um,0=∑u=0t−1αm​H\+u​FH,C​\(Um,u;Sum,\(Rum,Su\+1m\)\)\.U\_\{m,t\}\-U\_\{m,0\}=\\sum\_\{u=0\}^\{t\-1\}\\alpha\_\{mH\+u\}F\_\{H,\\mathrm\{C\}\}\\bigl\(U\_\{m,u\};S\_\{u\}^\{m\},\(R\_\{u\}^\{m\},S\_\{u\+1\}^\{m\}\)\\bigr\)\.\(268\)Taking∥⋅∥H,2,∞\\lVert\\cdot\\rVert\_\{H,2,\\infty\}and using Proposition[31](https://arxiv.org/html/2605.06866#Thmtheorem31)term by term gives

∥Um,t−Um,0∥H,2,∞≤∑u=0t−1αm​H\+u​∥FH,C​\(Um,u;Sum,\(Rum,Su\+1m\)\)∥H,2,∞≤2​BH,C​∑u=0t−1αm​H\+u\.\\lVert U\_\{m,t\}\-U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\leq\\sum\_\{u=0\}^\{t\-1\}\\alpha\_\{mH\+u\}\\,\\lVert F\_\{H,\\mathrm\{C\}\}\(U\_\{m,u\};S\_\{u\}^\{m\},\(R\_\{u\}^\{m\},S\_\{u\+1\}^\{m\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2B\_\{H,\\mathrm\{C\}\}\\sum\_\{u=0\}^\{t\-1\}\\alpha\_\{mH\+u\}\.\(269\)∎

###### Lemma 34\(Frozen residual decomposition\)\.

For each phaset∈\{0,…,H−1\}t\\in\\\{0,\\dots,H\-1\\\}, define

ζm,tC:=FH,C​\(Um,0;Stm,\(Rtm,St\+1m\)\)−Γt,C​\(Um,0\)\\zeta\_\{m,t\}^\{\\mathrm\{C\}\}:=F\_\{H,\\mathrm\{C\}\}\(U\_\{m,0\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\-\\Gamma\_\{t,\\mathrm\{C\}\}\(U\_\{m,0\}\)\(270\)and

ξm,tC:=\(FH,C​\(Um,t;Stm,\(Rtm,St\+1m\)\)−FH,C​\(Um,0;Stm,\(Rtm,St\+1m\)\)\)−\(Γt,C​\(Um,t\)−Γt,C​\(Um,0\)\)\.\\xi\_\{m,t\}^\{\\mathrm\{C\}\}:=\\Bigl\(F\_\{H,\\mathrm\{C\}\}\(U\_\{m,t\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\-F\_\{H,\\mathrm\{C\}\}\(U\_\{m,0\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\\Bigr\)\-\\Bigl\(\\Gamma\_\{t,\\mathrm\{C\}\}\(U\_\{m,t\}\)\-\\Gamma\_\{t,\\mathrm\{C\}\}\(U\_\{m,0\}\)\\Bigr\)\.

\(271\)Then

FH,C​\(Um,t;Stm,\(Rtm,St\+1m\)\)−Γt,C​\(Um,t\)=ζm,tC\+ξm,tC,𝔼​\[ζm,tC∣Um,0\]=0,∥ζm,tC∥H,2,∞≤4​BH,C,\\begin\{gathered\}F\_\{H,\\mathrm\{C\}\}\(U\_\{m,t\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\-\\Gamma\_\{t,\\mathrm\{C\}\}\(U\_\{m,t\}\)=\\zeta\_\{m,t\}^\{\\mathrm\{C\}\}\+\\xi\_\{m,t\}^\{\\mathrm\{C\}\},\\\\ \\mathbb\{E\}\\left\[\\zeta\_\{m,t\}^\{\\mathrm\{C\}\}\\,\\mid\\,U\_\{m,0\}\\right\]=0,\\qquad\\lVert\\zeta\_\{m,t\}^\{\\mathrm\{C\}\}\\rVert\_\{H,2,\\infty\}\\leq 4B\_\{H,\\mathrm\{C\}\},\\end\{gathered\}\(272\)and

∥ξm,tC∥H,2,∞≤4​∥Um,t−Um,0∥H,2,∞≤8​BH,C​∑u=0t−1αm​H\+u≤8​BH,C​∑u=0H−1αm​H\+u\.\\lVert\\xi\_\{m,t\}^\{\\mathrm\{C\}\}\\rVert\_\{H,2,\\infty\}\\leq 4\\lVert U\_\{m,t\}\-U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\leq 8B\_\{H,\\mathrm\{C\}\}\\sum\_\{u=0\}^\{t\-1\}\\alpha\_\{mH\+u\}\\leq 8B\_\{H,\\mathrm\{C\}\}\\sum\_\{u=0\}^\{H\-1\}\\alpha\_\{mH\+u\}\.\(273\)

###### Proof\.

Since episodemmis independent ofUm,0U\_\{m,0\}and its phase\-ttsample has marginal lawρt\\rho\_\{t\},

𝔼​\[ζm,tC∣Um,0\]=0\.\\mathbb\{E\}\\left\[\\zeta\_\{m,t\}^\{\\mathrm\{C\}\}\\,\\mid\\,U\_\{m,0\}\\right\]=0\.\(274\)∥ζm,tC∥H,2,∞≤∥FH,C​\(Um,0;Stm,\(Rtm,St\+1m\)\)∥H,2,∞\+∥Γt,C​\(Um,0\)∥H,2,∞\.\\lVert\\zeta\_\{m,t\}^\{\\mathrm\{C\}\}\\rVert\_\{H,2,\\infty\}\\leq\\lVert F\_\{H,\\mathrm\{C\}\}\(U\_\{m,0\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\\rVert\_\{H,2,\\infty\}\+\\lVert\\Gamma\_\{t,\\mathrm\{C\}\}\(U\_\{m,0\}\)\\rVert\_\{H,2,\\infty\}\.\(275\)By Proposition[31](https://arxiv.org/html/2605.06866#Thmtheorem31),

∥FH,C​\(Um,0;Stm,\(Rtm,St\+1m\)\)∥H,2,∞≤2​BH,C,\\lVert F\_\{H,\\mathrm\{C\}\}\(U\_\{m,0\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2B\_\{H,\\mathrm\{C\}\},\(276\)and, sinceΓt,C​\(Um,0\)\\Gamma\_\{t,\\mathrm\{C\}\}\(U\_\{m,0\}\)is a convex combination of such residuals,

∥Γt,C​\(Um,0\)∥H,2,∞≤2​BH,C\.\\lVert\\Gamma\_\{t,\\mathrm\{C\}\}\(U\_\{m,0\}\)\\rVert\_\{H,2,\\infty\}\\leq 2B\_\{H,\\mathrm\{C\}\}\.\(277\)This gives∥ζm,tC∥H,2,∞≤4​BH,C\\lVert\\zeta\_\{m,t\}^\{\\mathrm\{C\}\}\\rVert\_\{H,2,\\infty\}\\leq 4B\_\{H,\\mathrm\{C\}\}\.

Proposition[30](https://arxiv.org/html/2605.06866#Thmtheorem30)also gives

‖FH,C​\(U;s,\(r,s′\)\)−FH,C​\(U′;s,\(r,s′\)\)‖H,2,∞≤2​∥U−U′∥H,2,∞\\left\\lVert F\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-F\_\{H,\\mathrm\{C\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\(278\)for every admissible\(s,r,s′\)\(s,r,s^\{\\prime\}\)\. Averaging the same estimate over the phase\-ttlaw yields

‖Γt,C​\(U\)−Γt,C​\(U′\)‖H,2,∞≤2​∥U−U′∥H,2,∞\.\\left\\lVert\\Gamma\_\{t,\\mathrm\{C\}\}\(U\)\-\\Gamma\_\{t,\\mathrm\{C\}\}\(U^\{\\prime\}\)\\right\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(279\)Therefore

∥ξm,tC∥H,2,∞≤4​∥Um,t−Um,0∥H,2,∞\.\\lVert\\xi\_\{m,t\}^\{\\mathrm\{C\}\}\\rVert\_\{H,2,\\infty\}\\leq 4\\lVert U\_\{m,t\}\-U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\.\(280\)Applying Lemma[33](https://arxiv.org/html/2605.06866#Thmtheorem33)yields the remaining bound\. ∎

### D\.6Episode\-level finite\-iteration drift

SinceΓt,C\\Gamma\_\{t,\\mathrm\{C\}\}is defined by averaging with respect to the phase\-ttmarginalρt\\rho\_\{t\}, the drift argument is formulated for the episode\-boundary sequence\(Um​H\)m≥0\(U\_\{mH\}\)\_\{m\\geq 0\}conditioned underUm,0U\_\{m,0\}\.

Let

Wm,0:=Um,0−UH,C⋆,UH,C⋆:=IH,C​\(ηH,C⋆\),W\_\{m,0\}:=U\_\{m,0\}\-U^\{\\star\}\_\{H,\\mathrm\{C\}\},\\qquad U^\{\\star\}\_\{H,\\mathrm\{C\}\}:=I\_\{H,\\mathrm\{C\}\}\(\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}\),\(281\)whereηH,C⋆\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}denotes the unique fixed point ofΠH,CΘ​THπ\\Pi\_\{H,\\mathrm\{C\}\}^\{\\Theta\}T\_\{H\}^\{\\pi\}, and define

α¯m:=∑u=0H−1αm​H\+u,α¯m\(2\):=∑u=0H−1αm​H\+u2\.\\bar\{\\alpha\}\_\{m\}:=\\sum\_\{u=0\}^\{H\-1\}\\alpha\_\{mH\+u\},\\qquad\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}:=\\sum\_\{u=0\}^\{H\-1\}\\alpha\_\{mH\+u\}^\{2\}\.\(282\)Also write

\|𝒮H\|:=H​\|𝒮\|,β¯H,Cfh:=max0≤t≤H−1⁡β¯tfh=1−ρmin​\(1−λ\),κH,C:=1−β¯H,Cfh=ρmin​\(1−λ\)\>0\.\\lvert\\mathcal\{S\}\_\{H\}\\rvert:=H\\lvert\\mathcal\{S\}\\rvert,\\quad\\bar\{\\beta\}\_\{H,\\mathrm\{C\}\}^\{\\mathrm\{fh\}\}:=\\max\_\{0\\leq t\\leq H\-1\}\\bar\{\\beta\}\_\{t\}^\{\\mathrm\{fh\}\}=1\-\\rho\_\{\\min\}\(1\-\\lambda\),\\quad\\kappa\_\{H,\\mathrm\{C\}\}:=1\-\\bar\{\\beta\}\_\{H,\\mathrm\{C\}\}^\{\\mathrm\{fh\}\}=\\rho\_\{\\min\}\(1\-\\lambda\)\>0\.\(283\)
Fix anyϑH,C\>0\\vartheta\_\{H,\\mathrm\{C\}\}\>0and define the smoothed potential

MH,C​\(W\):=infZ∈VH,C\{12​∥Z∥H,2,∞2\+12​ϑH,C​∥W−Z∥H,2,pH⋆2\}\.M\_\{H,\\mathrm\{C\}\}\(W\):=\\inf\_\{Z\\in V\_\{H,\\mathrm\{C\}\}\}\\left\\\{\\frac\{1\}\{2\}\\lVert Z\\rVert^\{2\}\_\{H,2,\\infty\}\+\\frac\{1\}\{2\\vartheta\_\{H,\\mathrm\{C\}\}\}\\lVert W\-Z\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\right\\\}\.\(284\)
By Proposition[13](https://arxiv.org/html/2605.06866#Thmtheorem13)on the flattened weighted space,

\(1\+ϑH,C\)​MH,C​\(W\)≤12​∥W∥H,2,∞2≤\(1\+ϑH,C​\|𝒮H\|2/pH⋆\)​MH,C​\(W\)\.\(1\+\\vartheta\_\{H,\\mathrm\{C\}\}\)M\_\{H,\\mathrm\{C\}\}\(W\)\\leq\\frac\{1\}\{2\}\\lVert W\\rVert^\{2\}\_\{H,2,\\infty\}\\leq\\bigl\(1\+\\vartheta\_\{H,\\mathrm\{C\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\\bigr\)M\_\{H,\\mathrm\{C\}\}\(W\)\.\(285\)Define

rH,Cfh:=1\+ϑH,C​\|𝒮H\|2/pH⋆1\+ϑH,C,LH,Cfh:=pH⋆−1ϑH,C,ωH,Cfh:=1−β¯H,Cfh​rH,Cfh,r^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}:=\\frac\{1\+\\vartheta\_\{H,\\mathrm\{C\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\}\{1\+\\vartheta\_\{H,\\mathrm\{C\}\}\},\\qquad L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}:=\\frac\{p^\{\\star\}\_\{H\}\-1\}\{\\vartheta\_\{H,\\mathrm\{C\}\}\},\\qquad\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}:=1\-\\bar\{\\beta\}^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\sqrt\{r^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\},\(286\)SincerH,Cfh→1r^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\to 1asϑH,C↓0\\vartheta\_\{H,\\mathrm\{C\}\}\\downarrow 0andβ¯H,Cfh<1\\bar\{\\beta\}^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}<1, we haveωH,Cfh\>0\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\>0for all sufficiently smallϑH,C\\vartheta\_\{H,\\mathrm\{C\}\}\. Define the explicit CTD constants

dH,Cfh:=8​LH,Cfh​\|𝒮H\|2/pH⋆​\(1\+ϑH,C\),α¯H,C:=min⁡\{1H,ωH,CfhdH,Cfh\},d^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}:=8L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\(1\+\\vartheta\_\{H,\\mathrm\{C\}\}\),\\qquad\\bar\{\\alpha\}\_\{H,\\mathrm\{C\}\}:=\\min\\left\\\{\\frac\{1\}\{H\},\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{d^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\\right\\\},\(287\)and

cH,C,1fh:=ωH,Cfh4,cH,C,2fh:=H​LH,Cfh​\|𝒮H\|2/pH⋆​BH,C2​\(128ωH,Cfh\+72\)\.c^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},1\}:=\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{4\},\\qquad c^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}:=HL^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}B^\{2\}\_\{H,\\mathrm\{C\}\}\\left\(\\frac\{128\}\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\+72\\right\)\.\(288\)
###### Proposition 35\(Episodewise CTD potential drift\)\.

Assume thatωH,Cfh\>0\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\>0, that\(αk\)k≥0\(\\alpha\_\{k\}\)\_\{k\\geq 0\}is nonincreasing, and that

α0≤α¯H,C\.\\alpha\_\{0\}\\leq\\bar\{\\alpha\}\_\{H,\\mathrm\{C\}\}\.\(289\)
Then, for every episodemm,

𝔼​\[MH,C​\(Wm\+1,0\)∣Um,0\]≤\(1−cH,C,1fh​α¯m\)​MH,C​\(Wm,0\)\+cH,C,2fh​α¯m\(2\)\.\\mathbb\{E\}\\left\[M\_\{H,\\mathrm\{C\}\}\(W\_\{m\+1,0\}\)\\mid U\_\{m,0\}\\right\]\\leq\\bigl\(1\-c^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},1\}\\bar\{\\alpha\}\_\{m\}\\bigr\)M\_\{H,\\mathrm\{C\}\}\(W\_\{m,0\}\)\+c^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}\\bar\{\\alpha\}^\{\(2\)\}\_\{m\}\.\(290\)

###### Proof\.

For each phaset∈\{0,…,H−1\}t\\in\\\{0,\\dots,H\-1\\\}and eachα∈\[0,1\]\\alpha\\in\[0,1\], define

At,α,C​\(U\):=U\+α​Γt,C​\(U\)=\(1−α\)​U\+α​Gt,C​\(U\)\.A\_\{t,\\alpha,\\mathrm\{C\}\}\(U\):=U\+\\alpha\\Gamma\_\{t,\\mathrm\{C\}\}\(U\)=\(1\-\\alpha\)U\+\\alpha G\_\{t,\\mathrm\{C\}\}\(U\)\.\(291\)By Proposition[32](https://arxiv.org/html/2605.06866#Thmtheorem32),

‖Gt,C​\(U\)−Gt,C​\(U′\)‖H,2,∞≤β¯H,Cfh​∥U−U′∥H,2,∞\.\\left\\lVert G\_\{t,\\mathrm\{C\}\}\(U\)\-G\_\{t,\\mathrm\{C\}\}\(U^\{\\prime\}\)\\right\\rVert\_\{H,2,\\infty\}\\leq\\bar\{\\beta\}^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(292\)Hence, by convexity of the norm,

∥At,α,C​\(U\)−At,α,C​\(U′\)∥H,2,∞≤\(\(1−α\)\+α​β¯H,Cfh\)​∥U−U′∥H,2,∞=\(1−κH,C​α\)​∥U−U′∥H,2,∞\.\\lVert A\_\{t,\\alpha,\\mathrm\{C\}\}\(U\)\-A\_\{t,\\alpha,\\mathrm\{C\}\}\(U^\{\\prime\}\)\\rVert\_\{H,2,\\infty\}\\leq\\bigl\(\(1\-\\alpha\)\+\\alpha\\bar\{\\beta\}^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}=\\bigl\(1\-\\kappa\_\{H,\\mathrm\{C\}\}\\alpha\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.

\(293\)Define the deterministic phasewise averaged trajectory over episodemmby

U¯m,0:=Um,0,U¯m,t\+1:=At,αm​H\+t,C​\(U¯m,t\),0≤t≤H−1\.\\bar\{U\}\_\{m,0\}:=U\_\{m,0\},\\qquad\\bar\{U\}\_\{m,t\+1\}:=A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{C\}\}\(\\bar\{U\}\_\{m,t\}\),\\qquad 0\\leq t\\leq H\-1\.\(294\)SinceGt,C​\(UH,C⋆\)=UH,C⋆G\_\{t,\\mathrm\{C\}\}\(U^\{\\star\}\_\{H,\\mathrm\{C\}\}\)=U^\{\\star\}\_\{H,\\mathrm\{C\}\}, the contraction estimate above, the Moreau\-envelope comparison, and the smoothness of the envelope give the deterministic relaxed\-map estimate

MH,C​\(At,α,C​\(U\)−UH,C⋆\)≤\(1−2​ωH,Cfh​α\+dH,Cfh​α2\)​MH,C​\(U−UH,C⋆\)\.M\_\{H,\\mathrm\{C\}\}\\bigl\(A\_\{t,\\alpha,\\mathrm\{C\}\}\(U\)\-U^\{\\star\}\_\{H,\\mathrm\{C\}\}\\bigr\)\\leq\(1\-2\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\alpha\+d^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\alpha^\{2\}\)M\_\{H,\\mathrm\{C\}\}\(U\-U^\{\\star\}\_\{H,\\mathrm\{C\}\}\)\.\(295\)Sinceαm​H\+t≤α0≤α¯H,C≤ωH,Cfh/dH,Cfh\\alpha\_\{mH\+t\}\\leq\\alpha\_\{0\}\\leq\\bar\{\\alpha\}\_\{H,\\mathrm\{C\}\}\\leq\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}/d^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}, this implies

MH,C​\(At,αm​H\+t,C​\(U\)−UH,C⋆\)≤\(1−ωH,Cfh​αm​H\+t\)​MH,C​\(U−UH,C⋆\)\.M\_\{H,\\mathrm\{C\}\}\\bigl\(A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{C\}\}\(U\)\-U^\{\\star\}\_\{H,\\mathrm\{C\}\}\\bigr\)\\leq\\bigl\(1\-\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\alpha\_\{mH\+t\}\\bigr\)M\_\{H,\\mathrm\{C\}\}\\bigl\(U\-U^\{\\star\}\_\{H,\\mathrm\{C\}\}\\bigr\)\.\(296\)Therefore

MH,C​\(U¯m,H−UH,C⋆\)≤∏u=0H−1\(1−ωH,Cfh​αm​H\+u\)​MH,C​\(Wm,0\)≤\(1−ωH,Cfh2​α¯m\)​MH,C​\(Wm,0\)\.M\_\{H,\\mathrm\{C\}\}\(\\bar\{U\}\_\{m,H\}\-U^\{\\star\}\_\{H,\\mathrm\{C\}\}\)\\leq\\prod\_\{u=0\}^\{H\-1\}\(1\-\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\alpha\_\{mH\+u\}\)M\_\{H,\\mathrm\{C\}\}\(W\_\{m,0\}\)\\leq\\left\(1\-\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{2\}\\bar\{\\alpha\}\_\{m\}\\right\)M\_\{H,\\mathrm\{C\}\}\(W\_\{m,0\}\)\.\(297\)Next, define the approximation error

Dm,t:=Um,t−U¯m,t,Dm,0=0\.D\_\{m,t\}:=U\_\{m,t\}\-\\bar\{U\}\_\{m,t\},\\qquad D\_\{m,0\}=0\.\(298\)Because each horizonwise CTD target is obtained by linear interpolation on a fixed support, the mapU↦T^H,C​\(U;s,\(r,s′\)\)U\\mapsto\\widehat\{T\}\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)is affine for every admissible\(s,r,s′\)\(s,r,s^\{\\prime\}\)\. Hence,FH,CF\_\{H,\\mathrm\{C\}\},Γt,C\\Gamma\_\{t,\\mathrm\{C\}\}andAt,α,CA\_\{t,\\alpha,\\mathrm\{C\}\}are all affine maps\. LetLm,t,CL\_\{m,t,\\mathrm\{C\}\}denote the linear part ofAt,αm​H\+t,CA\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{C\}\}\. Then

At,αm​H\+t,C​\(U\)−At,αm​H\+t,C​\(U′\)=Lm,t,C​\(U−U′\),A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{C\}\}\(U\)\-A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{C\}\}\(U^\{\\prime\}\)=L\_\{m,t,\\mathrm\{C\}\}\(U\-U^\{\\prime\}\),\(299\)and the contraction bound implies that, for every admissibleZZ,

∥Lm,t,C​\(Z\)∥H,2,∞≤\(1−κH,C​αm​H\+t\)​∥Z∥H,2,∞\.\\lVert L\_\{m,t,\\mathrm\{C\}\}\(Z\)\\rVert\_\{H,2,\\infty\}\\leq\(1\-\\kappa\_\{H,\\mathrm\{C\}\}\\alpha\_\{mH\+t\}\)\\lVert Z\\rVert\_\{H,2,\\infty\}\.\(300\)Using the recursion forUm,t\+1U\_\{m,t\+1\}, the definition ofU¯m,t\+1\\bar\{U\}\_\{m,t\+1\}, and the decomposition from Lemma[34](https://arxiv.org/html/2605.06866#Thmtheorem34), we obtain

Dm,t\+1=Um,t\+1−U¯m,t\+1=Um,t\+αm​H\+t​FH,C​\(Um,t;Stm,\(Rtm,St\+1m\)\)−U¯m,t−αm​H\+t​Γt,C​\(U¯m,t\)=Um,t−U¯m,t\+αm​H\+t​\(Γt,C​\(Um,t\)−Γt,C​\(U¯m,t\)\)\+αm​H\+t​\(FH,C​\(Um,t;Stm,\(Rtm,St\+1m\)\)−Γt,C​\(Um,t\)\)=At,αm​H\+t,C​\(Um,t\)−At,αm​H\+t,C​\(U¯m,t\)\+αm​H\+t​\(FH,C​\(Um,t;Stm,\(Rtm,St\+1m\)\)−Γt,C​\(Um,t\)\)=Lm,t,C​\(Dm,t\)\+αm​H\+t​ζm,tC\+αm​H\+t​ξm,tC\.\\begin\{gathered\}D\_\{m,t\+1\}=U\_\{m,t\+1\}\-\\bar\{U\}\_\{m,t\+1\}\\\\ =U\_\{m,t\}\+\\alpha\_\{mH\+t\}F\_\{H,\\mathrm\{C\}\}\\bigl\(U\_\{m,t\};S^\{m\}\_\{t\},\(R^\{m\}\_\{t\},S^\{m\}\_\{t\+1\}\)\\bigr\)\-\\bar\{U\}\_\{m,t\}\-\\alpha\_\{mH\+t\}\\Gamma\_\{t,\\mathrm\{C\}\}\(\\bar\{U\}\_\{m,t\}\)\\\\ =U\_\{m,t\}\-\\bar\{U\}\_\{m,t\}\+\\alpha\_\{mH\+t\}\\bigl\(\\Gamma\_\{t,\\mathrm\{C\}\}\(U\_\{m,t\}\)\-\\Gamma\_\{t,\\mathrm\{C\}\}\(\\bar\{U\}\_\{m,t\}\)\\bigr\)\\\\ \+\\alpha\_\{mH\+t\}\\Bigl\(F\_\{H,\\mathrm\{C\}\}\\bigl\(U\_\{m,t\};S^\{m\}\_\{t\},\(R^\{m\}\_\{t\},S^\{m\}\_\{t\+1\}\)\\bigr\)\-\\Gamma\_\{t,\\mathrm\{C\}\}\(U\_\{m,t\}\)\\Bigr\)\\\\ =A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{C\}\}\(U\_\{m,t\}\)\-A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{C\}\}\(\\bar\{U\}\_\{m,t\}\)\+\\alpha\_\{mH\+t\}\\Bigl\(F\_\{H,\\mathrm\{C\}\}\\bigl\(U\_\{m,t\};S^\{m\}\_\{t\},\(R^\{m\}\_\{t\},S^\{m\}\_\{t\+1\}\)\\bigr\)\-\\Gamma\_\{t,\\mathrm\{C\}\}\(U\_\{m,t\}\)\\Bigr\)\\\\ =L\_\{m,t,\\mathrm\{C\}\}\(D\_\{m,t\}\)\+\\alpha\_\{mH\+t\}\\zeta^\{\\mathrm\{C\}\}\_\{m,t\}\+\\alpha\_\{mH\+t\}\\xi^\{\\mathrm\{C\}\}\_\{m,t\}\.\\end\{gathered\}

\(301\)Therefore

𝔼​\[Dm,t\+1∣Um,0\]=Lm,t,C​\(𝔼​\[Dm,t∣Um,0\]\)\+αm​H\+t​𝔼​\[ξm,tC∣Um,0\],\\mathbb\{E\}\[D\_\{m,t\+1\}\\mid U\_\{m,0\}\]=L\_\{m,t,\\mathrm\{C\}\}\\bigl\(\\mathbb\{E\}\[D\_\{m,t\}\\mid U\_\{m,0\}\]\\bigr\)\+\\alpha\_\{mH\+t\}\\mathbb\{E\}\[\\xi^\{\\mathrm\{C\}\}\_\{m,t\}\\mid U\_\{m,0\}\],\(302\)since𝔼​\[ζm,tC∣Um,0\]=0\\mathbb\{E\}\[\\zeta^\{\\mathrm\{C\}\}\_\{m,t\}\\mid U\_\{m,0\}\]=0\. Taking∥⋅∥H,2,∞\\left\\lVert\\cdot\\right\\rVert\_\{H,2,\\infty\}gives

νm,t\+1:=∥𝔼\[Dm,t\+1∣Um,0\]∥H,2,∞≤\(1−κH,Cαm​H\+t\)∥𝔼\[Dm,t∣Um,0\]∥H,2,∞\+8BH,Cαm​H\+tα¯m\.\\begin\{gathered\}\\nu\_\{m,t\+1\}:=\\left\\lVert\\mathbb\{E\}\[D\_\{m,t\+1\}\\mid U\_\{m,0\}\]\\right\\rVert\_\{H,2,\\infty\}\\\\ \\leq\(1\-\\kappa\_\{H,\\mathrm\{C\}\}\\alpha\_\{mH\+t\}\)\\left\\lVert\\mathbb\{E\}\[D\_\{m,t\}\\mid U\_\{m,0\}\]\\right\\rVert\_\{H,2,\\infty\}\+8B\_\{H,\\mathrm\{C\}\}\\alpha\_\{mH\+t\}\\bar\{\\alpha\}\_\{m\}\.\\end\{gathered\}\(303\)Sinceνm,0=0\\nu\_\{m,0\}=0, induction overttyields

∥𝔼\[Dm,H∣Um,0\]∥H,2,∞≤8BH,Cα¯m2\.\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert\_\{H,2,\\infty\}\\leq 8B\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}^\{2\}\_\{m\}\.\(304\)Also, iterating the recursion forDm,t\+1D\_\{m,t\+1\}, using

‖Lm,t,C​\(Z\)‖H,2,∞≤‖Z‖H,2,∞,\\left\\lVert L\_\{m,t,\\mathrm\{C\}\}\(Z\)\\right\\rVert\_\{H,2,\\infty\}\\leq\\left\\lVert Z\\right\\rVert\_\{H,2,\\infty\},\(305\)the identityDm,0=0D\_\{m,0\}=0, and the bounds from Lemma[34](https://arxiv.org/html/2605.06866#Thmtheorem34)give the pathwise upper bound

‖Dm,H‖H,2,∞≤∑t=0H−1αm​H\+t​\(‖ζm,tC‖H,2,∞\+‖ξm,tC‖H,2,∞\)≤4​BH,C​α¯m\+8​BH,C​α¯m2\.\\left\\lVert D\_\{m,H\}\\right\\rVert\_\{H,2,\\infty\}\\leq\\sum\_\{t=0\}^\{H\-1\}\\alpha\_\{mH\+t\}\\bigl\(\\left\\lVert\\zeta^\{\\mathrm\{C\}\}\_\{m,t\}\\right\\rVert\_\{H,2,\\infty\}\+\\left\\lVert\\xi^\{\\mathrm\{C\}\}\_\{m,t\}\\right\\rVert\_\{H,2,\\infty\}\\bigr\)\\leq 4B\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}\_\{m\}\+8B\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}^\{2\}\_\{m\}\.\(306\)Sinceα0≤α¯H,C≤1/H\\alpha\_\{0\}\\leq\\bar\{\\alpha\}\_\{H,\\mathrm\{C\}\}\\leq 1/H, we haveα¯m≤1\\bar\{\\alpha\}\_\{m\}\\leq 1, and hence

‖Dm,H‖H,2,∞≤12​BH,C​α¯m\.\\left\\lVert D\_\{m,H\}\\right\\rVert\_\{H,2,\\infty\}\\leq 12B\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}\_\{m\}\.\(307\)pathwise\. Therefore

𝔼​\[‖Dm,H‖H,2,∞2∣Um,0\]≤144​BH,C2​α¯m2\.\\mathbb\{E\}\\bigl\[\\left\\lVert D\_\{m,H\}\\right\\rVert^\{2\}\_\{H,2,\\infty\}\\mid U\_\{m,0\}\\bigr\]\\leq 144B^\{2\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}^\{2\}\_\{m\}\.\(308\)Let

W¯m,H:=U¯m,H−UH,C⋆,δm:=ωH,Cfh​α¯m4​LH,Cfh\.\\bar\{W\}\_\{m,H\}:=\\bar\{U\}\_\{m,H\}\-U^\{\\star\}\_\{H,\\mathrm\{C\}\},\\qquad\\delta\_\{m\}:=\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}\_\{m\}\}\{4L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\.\(309\)BecauseMH,CM\_\{H,\\mathrm\{C\}\}is nonnegative, convex,LH,CfhL^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\-smooth with respect to∥⋅∥H,2,pH⋆\\left\\lVert\\cdot\\right\\rVert\_\{H,2,p^\{\\star\}\_\{H\}\}, and minimized at zero, its gradient satisfies

‖∇MH,C​\(W¯m,H\)‖∗2≤2​LH,Cfh​MH,C​\(W¯m,H\),\\left\\lVert\\nabla M\_\{H,\\mathrm\{C\}\}\(\\bar\{W\}\_\{m,H\}\)\\right\\rVert^\{2\}\_\{\*\}\\leq 2L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}M\_\{H,\\mathrm\{C\}\}\(\\bar\{W\}\_\{m,H\}\),\(310\)where∥⋅∥∗\\left\\lVert\\cdot\\right\\rVert\_\{\*\}denotes the dual norm of∥⋅∥H,2,pH⋆\\left\\lVert\\cdot\\right\\rVert\_\{H,2,p^\{\\star\}\_\{H\}\}\. Therefore

𝔼​\[MH,C​\(Wm\+1,0\)∣Um,0\]≤MH,C​\(W¯m,H\)\+⟨∇MH,C​\(W¯m,H\),𝔼​\[Dm,H∣Um,0\]⟩\+LH,Cfh2​𝔼​\[‖Dm,H‖H,2,pH⋆2∣Um,0\]\.\\begin\{gathered\}\\mathbb\{E\}\\bigl\[M\_\{H,\\mathrm\{C\}\}\(W\_\{m\+1\},0\)\\mid U\_\{m,0\}\\bigr\]\\leq M\_\{H,\\mathrm\{C\}\}\(\\bar\{W\}\_\{m,H\}\)\+\\left\\langle\\nabla M\_\{H,\\mathrm\{C\}\}\(\\bar\{W\}\_\{m,H\}\),\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rangle\\\\ \+\\frac\{L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{2\}\\mathbb\{E\}\\bigl\[\\left\\lVert D\_\{m,H\}\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\mid U\_\{m,0\}\\bigr\]\.\\end\{gathered\}\(311\)By Young’s inequality,

⟨∇MH,C\(W¯m,H\),𝔼\[Dm,H∣Um,0\]⟩≤δm2∥∇MH,C\(W¯m,H\)∥∗2\+12​δm∥𝔼\[Dm,H∣Um,0\]∥H,2,pH⋆2≤ωH,Cfh​α¯m4MH,C\(W¯m,H\)\+2​LH,CfhωH,Cfh​α¯m∥𝔼\[Dm,H∣Um,0\]∥H,2,pH⋆2\.\\begin\{gathered\}\\left\\langle\\nabla M\_\{H,\\mathrm\{C\}\}\(\\bar\{W\}\_\{m,H\}\),\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rangle\\leq\\frac\{\\delta\_\{m\}\}\{2\}\\left\\lVert\\nabla M\_\{H,\\mathrm\{C\}\}\(\\bar\{W\}\_\{m,H\}\)\\right\\rVert^\{2\}\_\{\*\}\+\\frac\{1\}\{2\\delta\_\{m\}\}\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\\\ \\leq\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}\_\{m\}\}\{4\}M\_\{H,\\mathrm\{C\}\}\(\\bar\{W\}\_\{m,H\}\)\+\\frac\{2L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}\_\{m\}\}\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\.\\end\{gathered\}\(312\)Hence

𝔼​\[MH,C​\(Wm\+1,0\)∣Um,0\]≤\(1\+ωH,Cfh​α¯m4\)​MH,C​\(W¯m,H\)\+2​LH,CfhωH,Cfh​α¯m∥𝔼\[Dm,H∣Um,0\]∥H,2,pH⋆2\+LH,Cfh2𝔼\[∥Dm,H∥H,2,pH⋆2∣Um,0\]\.\\begin\{gathered\}\\mathbb\{E\}\[M\_\{H,\\mathrm\{C\}\}\(W\_\{m\+1,0\}\)\\mid U\_\{m,0\}\]\\leq\\left\(1\+\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}\_\{m\}\}\{4\}\\right\)M\_\{H,\\mathrm\{C\}\}\(\\bar\{W\}\_\{m,H\}\)\\\\ \+\\frac\{2L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}\_\{m\}\}\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\+\\frac\{L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{2\}\\mathbb\{E\}\\bigl\[\\left\\lVert D\_\{m,H\}\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\mid U\_\{m,0\}\\bigr\]\.\\end\{gathered\}\(313\)SinceωH,Cfh​α¯m≤1\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}\_\{m\}\\leq 1, we have

\(1\+ωH,Cfh​α¯m4\)​\(1−ωH,Cfh2​α¯m\)≤1−ωH,Cfh​α¯m4\.\\left\(1\+\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}\_\{m\}\}\{4\}\\right\)\\left\(1\-\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{2\}\\bar\{\\alpha\}\_\{m\}\\right\)\\leq 1\-\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}\_\{m\}\}\{4\}\.\(314\)Also,

∥𝔼\[Dm,H∣Um,0\]∥H,2,pH⋆2≤\|𝒮H\|2/pH⋆∥𝔼\[Dm,H∣Um,0\]∥H,2,∞2≤64\|𝒮H\|2/pH⋆BH,C2α¯m4,\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\leq\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert^\{2\}\_\{H,2,\\infty\}\\leq 64\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}B^\{2\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}^\{4\}\_\{m\},\(315\)and

𝔼​\[‖Dm,H‖H,2,pH⋆2∣Um,0\]≤\|𝒮H\|2/pH⋆​𝔼​\[‖Dm,H‖H,2,∞2∣Um,0\]≤144​\|𝒮H\|2/pH⋆​BH,C2​α¯m2\.\\mathbb\{E\}\[\\left\\lVert D\_\{m,H\}\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\mid U\_\{m,0\}\]\\leq\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\\mathbb\{E\}\[\\left\\lVert D\_\{m,H\}\\right\\rVert^\{2\}\_\{H,2,\\infty\}\\mid U\_\{m,0\}\]\\leq 144\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}B^\{2\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}^\{2\}\_\{m\}\.\(316\)Usingα¯m≤1\\bar\{\\alpha\}\_\{m\}\\leq 1,α¯m2≤H​α¯m\(2\)\\bar\{\\alpha\}^\{2\}\_\{m\}\\leq H\\bar\{\\alpha\}^\{\(2\)\}\_\{m\}, andα¯m3≤H​α¯m\(2\)\\bar\{\\alpha\}^\{3\}\_\{m\}\\leq H\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}, we conclude that

𝔼​\[MH,C​\(Wm\+1,0\)∣Um,0\]≤\(1−ωH,Cfh4​α¯m\)​MH,C​\(Wm,0\)\+H​LH,Cfh​\|𝒮H\|2/pH⋆​BH,C2​\(128ωH,Cfh\+72\)​α¯m\(2\)\.\\mathbb\{E\}\[M\_\{H,\\mathrm\{C\}\}\(W\_\{m\+1,0\}\)\\mid U\_\{m,0\}\]\\leq\\left\(1\-\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{4\}\\bar\{\\alpha\}\_\{m\}\\right\)M\_\{H,\\mathrm\{C\}\}\(W\_\{m,0\}\)\+HL^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}B^\{2\}\_\{H,\\mathrm\{C\}\}\\left\(\\frac\{128\}\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\+72\\right\)\\bar\{\\alpha\}^\{\(2\)\}\_\{m\}\.

\(317\)∎

###### Proposition 36\(Fixed\-horizon CTD finite\-iteration bound\)\.

Define

aH,C,1fh:=rH,Cfh,aH,C,2fh:=ωH,Cfh4,aH,C,3fh:=ωH,Cfh4​α¯H,C,aH,C,4fh:=2​\(1\+ϑH,C​\|𝒮H\|2/pH⋆\)​cH,C,2fh\.a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},1\}:=r^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\},\\quad a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}:=\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{4\},\\quad a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},3\}:=\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{4\\bar\{\\alpha\}\_\{H,\\mathrm\{C\}\}\},\\quad a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},4\}:=2\(1\+\\vartheta\_\{H,\\mathrm\{C\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\)c^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}\.

\(318\)IfaH,C,2fh\>0a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}\>0,\(αk\)k≥0\(\\alpha\_\{k\}\)\_\{k\\geq 0\}is nonincreasing, and if

α0≤aH,C,2fhaH,C,3fh,\\alpha\_\{0\}\\leq\\frac\{a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}\}\{a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},3\}\},\(319\)then, for all episodesm≥0m\\geq 0,

𝔼​\[ℓH,C,∞​\(ηm​H,ηH,C⋆\)2\]≤aH,C,1fh​ℓH,C,∞​\(η0,ηH,C⋆\)2​∏j=0m−1\(1−aH,C,2fh​α¯j\)\+aH,C,4fh​∑i=0m−1α¯i\(2\)​∏j=i\+1m−1\(1−aH,C,2fh​α¯j\)\.\\begin\{gathered\}\\mathbb\{E\}\\bigl\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta^\{\\star\}\_\{H,\\mathrm\{C\}\}\)^\{2\}\\bigr\]\\leq a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},1\}\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\_\{H,\\mathrm\{C\}\}\)^\{2\}\\prod\_\{j=0\}^\{m\-1\}\(1\-a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}\\bar\{\\alpha\}\_\{j\}\)\\\\ \+a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},4\}\\sum\_\{i=0\}^\{m\-1\}\\bar\{\\alpha\}^\{\(2\)\}\_\{i\}\\prod\_\{j=i\+1\}^\{m\-1\}\(1\-a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}\\bar\{\\alpha\}\_\{j\}\)\.\\end\{gathered\}\(320\)

###### Proof\.

Iterating Proposition[35](https://arxiv.org/html/2605.06866#Thmtheorem35)gives

𝔼​\[MH,C​\(Wm,0\)\]≤MH,C​\(W0,0\)​∏j=0m−1\(1−aH,C,2fh​α¯j\)\+cH,C,2fh​∑i=0m−1α¯i\(2\)​∏j=i\+1m−1\(1−aH,C,2fh​α¯j\)\.\\mathbb\{E\}\\bigl\[M\_\{H,\\mathrm\{C\}\}\(W\_\{m,0\}\)\\bigr\]\\leq M\_\{H,\\mathrm\{C\}\}\(W\_\{0,0\}\)\\prod\_\{j=0\}^\{m\-1\}\(1\-a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}\\bar\{\\alpha\}\_\{j\}\)\+c^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}\\sum\_\{i=0\}^\{m\-1\}\\bar\{\\alpha\}^\{\(2\)\}\_\{i\}\\prod\_\{j=i\+1\}^\{m\-1\}\(1\-a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}\\bar\{\\alpha\}\_\{j\}\)\.\(321\)By the Moreau envelope comparison,

𝔼​\[‖Wm,0‖H,2,∞2\]≤2​\(1\+ϑH,C​\|𝒮H\|2/pH⋆\)​𝔼​\[MH,C​\(Wm,0\)\]\\mathbb\{E\}\[\\left\\lVert W\_\{m,0\}\\right\\rVert^\{2\}\_\{H,2,\\infty\}\]\\leq 2\(1\+\\vartheta\_\{H,\\mathrm\{C\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\)\\mathbb\{E\}\\bigl\[M\_\{H,\\mathrm\{C\}\}\(W\_\{m,0\}\)\\bigr\]\(322\)and

MH,C​\(W0,0\)≤12​\(1\+ϑH,C\)​‖W0,0‖H,2,∞2\.M\_\{H,\\mathrm\{C\}\}\(W\_\{0,0\}\)\\leq\\frac\{1\}\{2\(1\+\\vartheta\_\{H,\\mathrm\{C\}\}\)\}\\left\\lVert W\_\{0,0\}\\right\\rVert^\{2\}\_\{H,2,\\infty\}\.\(323\)The embedding isometric identity

‖Wm,0‖H,2,∞=ℓH,C,∞​\(ηm​H,ηH,C⋆\)\\left\\lVert W\_\{m,0\}\\right\\rVert\_\{H,2,\\infty\}=\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta^\{\\star\}\_\{H,\\mathrm\{C\}\}\)\(324\)provides the claim\. ∎

###### Corollary 37\(Boundary\-iterate step size consequences\)\.

Under the hypotheses of Proposition[36](https://arxiv.org/html/2605.06866#Thmtheorem36):

\(a\)ifαk≡α\\alpha\_\{k\}\\equiv\\alphaand

α≤aH,C,2fhaH,C,3fh,\\alpha\\leq\\frac\{a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\}\{a\_\{H,\\mathrm\{C\},3\}^\{\\mathrm\{fh\}\}\},\(325\)then for all episodesm≥0m\\geq 0,

𝔼​\[ℓH,C,∞​\(ηm​H,ηH,C⋆\)2\]≤aH,C,1fh​ℓH,C,∞​\(η0,ηH,C⋆\)2​\(1−aH,C,2fh​H​α\)m\+aH,C,4fhaH,C,2fh​α\.\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}\)^\{2\}\\right\]\\leq a\_\{H,\\mathrm\{C\},1\}^\{\\mathrm\{fh\}\}\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}\)^\{2\}\\bigl\(1\-a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}H\\alpha\\bigr\)^\{m\}\+\\frac\{a\_\{H,\\mathrm\{C\},4\}^\{\\mathrm\{fh\}\}\}\{a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\}\\alpha\.\(326\)
For the two diminishing\-step cases below, whereggis the step\-size offset, writeτm:=m​H\+g\+H−1\\tau\_\{m\}:=mH\+g\+H\-1, soτ0=g\+H−1\\tau\_\{0\}=g\+H\-1\.

\(b\)ifαk=α/\(k\+g\)\\alpha\_\{k\}=\\alpha/\(k\+g\),α\>1/aH,C,2fh\\alpha\>1/a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}, and

g≥max⁡\{1,α​aH,C,3fhaH,C,2fh\},g\\geq\\max\\left\\\{1,\\frac\{\\alpha a\_\{H,\\mathrm\{C\},3\}^\{\\mathrm\{fh\}\}\}\{a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\}\\right\\\},\(327\)then the boundary iterates satisfy

𝔼​\[ℓH,C,∞​\(ηm​H,ηH,C⋆\)2\]≤aH,C,1fh​ℓH,C,∞​\(η0,ηH,C⋆\)2​\(τ0τm\)aH,C,2fh​α\+aH,C,4fh​H2​α2aH,C,2fh​α−1⋅1τm\.\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}\)^\{2\}\\right\]\\leq a\_\{H,\\mathrm\{C\},1\}^\{\\mathrm\{fh\}\}\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}\)^\{2\}\\left\(\\frac\{\\tau\_\{0\}\}\{\\tau\_\{m\}\}\\right\)^\{a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\\alpha\}\+\\frac\{a\_\{H,\\mathrm\{C\},4\}^\{\\mathrm\{fh\}\}H^\{2\}\\alpha^\{2\}\}\{a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\\alpha\-1\}\\cdot\\frac\{1\}\{\\tau\_\{m\}\}\.\(328\)
\(c\)ifαk=α/\(k\+g\)z\\alpha\_\{k\}=\\alpha/\(k\+g\)^\{z\}withz∈\(0,1\)z\\in\(0,1\)and

g≥max⁡\{1,\(α​aH,C,3fhaH,C,2fh\)1/z,\(2​zaH,C,2fh​α\)1/\(1−z\)\},g\\geq\\max\\left\\\{1,\\left\(\\frac\{\\alpha a\_\{H,\\mathrm\{C\},3\}^\{\\mathrm\{fh\}\}\}\{a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\}\\right\)^\{1/z\},\\left\(\\frac\{2z\}\{a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\\alpha\}\\right\)^\{1/\(1\-z\)\}\\right\\\},\(329\)then the boundary iterates satisfy

𝔼​\[ℓH,C,∞​\(ηm​H,ηH,C⋆\)2\]≤aH,C,1fh​ℓH,C,∞​\(η0,ηH,C⋆\)2⋅exp⁡\(−aH,C,2fh​α1−z​\(τm1−z−τ01−z\)\)\+2​aH,C,4fh​H2​αaH,C,2fh⋅1τmz\.\\begin\{gathered\}\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}\)^\{2\}\\right\]\\leq a\_\{H,\\mathrm\{C\},1\}^\{\\mathrm\{fh\}\}\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}\)^\{2\}\\cdot\\exp\\left\(\-\\frac\{a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\\alpha\}\{1\-z\}\\bigl\(\\tau\_\{m\}^\{1\-z\}\-\\tau\_\{0\}^\{1\-z\}\\bigr\)\\right\)\\\\ \+\\frac\{2a\_\{H,\\mathrm\{C\},4\}^\{\\mathrm\{fh\}\}H^\{2\}\\alpha\}\{a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\}\\cdot\\frac\{1\}\{\\tau\_\{m\}^\{z\}\}\.\\end\{gathered\}\(330\)

###### Proof\.

For part \(a\), substitution ofα¯m=H​α\\bar\{\\alpha\}\_\{m\}=H\\alphaandα¯m\(2\)=H​α2\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}=H\\alpha^\{2\}into Proposition[36](https://arxiv.org/html/2605.06866#Thmtheorem36)yields part \(a\)\.

For part \(b\), setq=τ0/Hq=\\tau\_\{0\}/Handλ=aH,C,2fh​α\\lambda=a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\\alpha\. The bounds

H​ατm≤α¯m≤H​αm​H\+g,α¯m\(2\)≤H​α2\(m​H\+g\)2\.\\frac\{H\\alpha\}\{\\tau\_\{m\}\}\\leq\\bar\{\\alpha\}\_\{m\}\\leq\\frac\{H\\alpha\}\{mH\+g\},\\qquad\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}\\leq\\frac\{H\\alpha^\{2\}\}\{\(mH\+g\)^\{2\}\}\.\(331\)imply

∏j=0m−1\(1−aH,C,2fh​α¯j\)≤\(qm\+q\)λ\.\\prod\_\{j=0\}^\{m\-1\}\(1\-a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\)\\leq\\left\(\\frac\{q\}\{m\+q\}\\right\)^\{\\lambda\}\.\(332\)The same elementary product\-sum estimate gives

∑i=0m−1α¯i\(2\)​∏j=i\+1m−1\(1−aH,C,2fh​α¯j\)≤H​α2λ−1⋅1m\+q,\\sum\_\{i=0\}^\{m\-1\}\\bar\{\\alpha\}\_\{i\}^\{\(2\)\}\\prod\_\{j=i\+1\}^\{m\-1\}\(1\-a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\)\\leq\\frac\{H\\alpha^\{2\}\}\{\\lambda\-1\}\\cdot\\frac\{1\}\{m\+q\},\(333\)and the displayed bound follows from Proposition[36](https://arxiv.org/html/2605.06866#Thmtheorem36)after substitutingq=τ0/Hq=\\tau\_\{0\}/Handτm=H​\(m\+q\)\\tau\_\{m\}=H\(m\+q\)\.

For part \(c\), keep the sameqqand setA=aH,C,2fh​α​H1−zA=a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\\alpha H^\{1\-z\}\. The bounds

H​ατmz≤α¯m≤H​α\(m​H\+g\)z,α¯m\(2\)≤H​α2\(m​H\+g\)2​z,\\frac\{H\\alpha\}\{\\tau\_\{m\}^\{z\}\}\\leq\\bar\{\\alpha\}\_\{m\}\\leq\\frac\{H\\alpha\}\{\(mH\+g\)^\{z\}\},\\qquad\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}\\leq\\frac\{H\\alpha^\{2\}\}\{\(mH\+g\)^\{2z\}\},\(334\)imply

∏j=0m−1\(1−aH,C,2fh​α¯j\)≤exp⁡\(−A1−z​\(\(m\+q\)1−z−q1−z\)\)\.\\prod\_\{j=0\}^\{m\-1\}\(1\-a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\)\\leq\\exp\\left\(\-\\frac\{A\}\{1\-z\}\\bigl\(\(m\+q\)^\{1\-z\}\-q^\{1\-z\}\\bigr\)\\right\)\.\(335\)The lower bound onggimpliesq≥\(2​z/A\)1/\(1−z\)q\\geq\(2z/A\)^\{1/\(1\-z\)\}\. The elementary polynomial product\-sum estimate gives

∑i=0m−1α¯i\(2\)​∏j=i\+1m−1\(1−aH,C,2fh​α¯j\)≤2​H​α2A⋅1\(m\+q\)z\.\\sum\_\{i=0\}^\{m\-1\}\\bar\{\\alpha\}\_\{i\}^\{\(2\)\}\\prod\_\{j=i\+1\}^\{m\-1\}\(1\-a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\)\\leq\\frac\{2H\\alpha^\{2\}\}\{A\}\\cdot\\frac\{1\}\{\(m\+q\)^\{z\}\}\.\(336\)Substituting the definitions ofAAandqqinto Proposition[36](https://arxiv.org/html/2605.06866#Thmtheorem36), usingτm=H​\(m\+q\)\\tau\_\{m\}=H\(m\+q\), and usingH2​z≤H2H^\{2z\}\\leq H^\{2\}, gives the displayed bound\. ∎

###### Proof of Theorem[6](https://arxiv.org/html/2605.06866#Thmtheorem6)\.

Combine Proposition[36](https://arxiv.org/html/2605.06866#Thmtheorem36)with Corollary[37](https://arxiv.org/html/2605.06866#Thmtheorem37)\. Sincek=m​Hk=mHandHHis fixed, the rates in the episode indexmmare equivalent to the stated rates in the numberkkof transitions\. ∎

###### Proof of Corollary[7](https://arxiv.org/html/2605.06866#Thmtheorem7)\.

By Corollary[37](https://arxiv.org/html/2605.06866#Thmtheorem37)\(b\),

𝔼​\[ℓH,C,∞​\(ηm​H,ηH,C⋆\)2\]=O​\(1m\+1\)\.\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}\)^\{2\}\\right\]=O\\left\(\\frac\{1\}\{m\+1\}\\right\)\.\(337\)Jensen’s inequality then gives

𝔼​\[ℓH,C,∞​\(ηm​H,ηH,C⋆\)\]=O​\(1m\+1\),\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}\)\\right\]=O\\left\(\\frac\{1\}\{\\sqrt\{m\+1\}\}\\right\),\(338\)som=O​\(ε−2\)m=O\(\\varepsilon^\{\-2\}\)episodes suffice\. Sincek=m​Hk=mHandHHis fixed, this is equivalentlyk=O​\(ε−2\)k=O\(\\varepsilon^\{\-2\}\)transitions\. ∎

## Appendix EFixed\-horizon MTD

This appendix records the fixed\-horizon MTD ingredients and the proof of Theorem[8](https://arxiv.org/html/2605.06866#Thmtheorem8)\.

### E\.1Flattened weighted state–horizon space

We use the same weighted flattening as in Appendix[D](https://arxiv.org/html/2605.06866#A4)\. Let

𝒮H:=\{\(h,s\):h∈\{1,…,H\},s∈𝒮\},\|𝒮H\|=H​\|𝒮\|,pH⋆:=max⁡\{2,⌈log⁡\|𝒮H\|⌉\}\.\\mathcal\{S\}\_\{H\}:=\\\{\(h,s\):h\\in\\\{1,\\dots,H\\\},\\ s\\in\\mathcal\{S\}\\\},\\qquad\\lvert\\mathcal\{S\}\_\{H\}\\rvert=H\\lvert\\mathcal\{S\}\\rvert,\\qquad p\_\{H\}^\{\\star\}:=\\max\\\{2,\\lceil\\log\\lvert\\mathcal\{S\}\_\{H\}\\rvert\\rceil\\\}\.\(339\)Let

VH,M:=∏\(h,s\)∈𝒮HℝdV\_\{H,\\mathrm\{M\}\}:=\\prod\_\{\(h,s\)\\in\\mathcal\{S\}\_\{H\}\}\\mathbb\{R\}^\{d\}\(340\)denote the corresponding flattened Euclidean product space\. For a horizon\-stacked iterateU=\(Uh​\(s\)\)U=\(U^\{h\}\(s\)\), define

U¯​\(h,s\):=λh​Uh​\(s\)\.\\bar\{U\}\(h,s\):=\\lambda^\{h\}U^\{h\}\(s\)\.\(341\)Then

∥U∥H,2,∞=max\(h,s\)∈𝒮H∥U¯\(h,s\)∥2,∥U∥H,2,p=\(∑\(h,s\)∈𝒮H∥U¯\(h,s\)∥2p\)1/p\.\\lVert U\\rVert\_\{H,2,\\infty\}=\\max\_\{\(h,s\)\\in\\mathcal\{S\}\_\{H\}\}\\lVert\\bar\{U\}\(h,s\)\\rVert\_\{2\},\\qquad\\lVert U\\rVert\_\{H,2,p\}=\\left\(\\sum\_\{\(h,s\)\\in\\mathcal\{S\}\_\{H\}\}\\lVert\\bar\{U\}\(h,s\)\\rVert\_\{2\}^\{p\}\\right\)^\{1/p\}\.\(342\)For a single horizon stackU=\(U1,…,UH\)U=\(U^\{1\},\\dots,U^\{H\}\), also set

∥U∥H,2:=max1≤h≤H⁡λh​∥Uh∥2\.\\lVert U\\rVert\_\{H,2\}:=\\max\_\{1\\leq h\\leq H\}\\lambda^\{h\}\\lVert U^\{h\}\\rVert\_\{2\}\.\(343\)Then

∥U∥H,2,∞=maxs∈𝒮∥U\(s\)∥H,2\.\\lVert U\\rVert\_\{H,2,\\infty\}=\\max\_\{s\\in\\mathcal\{S\}\}\\lVert U\(s\)\\rVert\_\{H,2\}\.\(344\)Thus Appendix[A](https://arxiv.org/html/2605.06866#A1)applies directly to the flattened weighted process\.

### E\.2Fixed\-horizon MTD residual map

For each states∈𝒮s\\in\\mathcal\{S\}, let

U​\(s\):=\(U1​\(s\),…,UH​\(s\)\),U\(s\):=\\bigl\(U^\{1\}\(s\),\\dots,U^\{H\}\(s\)\\bigr\),\(345\)and letPsHP\_\{s\}^\{H\}denote the coordinate projector onto that stack\. Define

FH,M​\(U;s,\(r,s′\)\):=PsH​\(T^H,M​\(U;s,\(r,s′\)\)−U​\(s\)\),F\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\):=P\_\{s\}^\{H\}\\bigl\(\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-U\(s\)\\bigr\),\(346\)whereT^H,M​\(U;s,\(r,s′\)\)\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)is the full horizon\-stacked MTD target from Section[4](https://arxiv.org/html/2605.06866#S4)\. Then the online fixed\-horizon MTD recursion is

Uk\+1=Uk\+αk​FH,M​\(Uk;Sk,\(Rk,Sk\+1\)\)\.U\_\{k\+1\}=U\_\{k\}\+\\alpha\_\{k\}F\_\{H,\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\.\(347\)Define also the MTD embedded averaged operator by

𝒪H,M:=IH,M∘ΠH,MΘ​THπ∘IH,M−1\.\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}:=I\_\{H,\\mathrm\{M\}\}\\circ\\Pi\_\{H,\\mathrm\{M\}\}^\{\\Theta\}T\_\{H\}^\{\\pi\}\\circ I\_\{H,\\mathrm\{M\}\}^\{\-1\}\.\(348\)
###### Proposition 38\(Fixed\-horizon MTD weighted contraction\)\.

For allU,U′∈IH,M​\(ℱH,M,Θ𝒮\)U,U^\{\\prime\}\\in I\_\{H,\\mathrm\{M\}\}\(\\mathcal\{F\}\_\{H,\\mathrm\{M\},\\Theta\}^\{\\mathcal\{S\}\}\),

∥𝒪H,M​U−𝒪H,M​U′∥H,2,∞≤λ​∥U−U′∥H,2,∞\.\\lVert\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\-\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\\leq\\lambda\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(349\)

###### Proof\.

Letη:=IH,M−1​\(U\)\\eta:=I\_\{H,\\mathrm\{M\}\}^\{\-1\}\(U\)andη′:=IH,M−1​\(U′\)\\eta^\{\\prime\}:=I\_\{H,\\mathrm\{M\}\}^\{\-1\}\(U^\{\\prime\}\)\. Fix\(h,s\)∈𝒮H\(h,s\)\\in\\mathcal\{S\}\_\{H\}withh≥1h\\geq 1\. By the statewise MMD isometry and nonexpansiveness of the signed\-categorical projection,

λh​∥\(𝒪H,M​U\)h​\(s\)−\(𝒪H,M​U′\)h​\(s\)∥2\\displaystyle\\lambda^\{h\}\\lVert\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\)^\{h\}\(s\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U^\{\\prime\}\)^\{h\}\(s\)\\rVert\_\{2\}=λh​ℓM​\(\(ΠH,MΘ​THπ​η\)h​\(s\),\(ΠH,MΘ​THπ​η′\)h​\(s\)\)\\displaystyle=\\lambda^\{h\}\\ell\_\{\\mathrm\{M\}\}\\left\(\(\\Pi\_\{H,\\mathrm\{M\}\}^\{\\Theta\}T\_\{H\}^\{\\pi\}\\eta\)^\{h\}\(s\),\(\\Pi\_\{H,\\mathrm\{M\}\}^\{\\Theta\}T\_\{H\}^\{\\pi\}\\eta^\{\\prime\}\)^\{h\}\(s\)\\right\)\(350\)≤λh​ℓM​\(\(THπ​η\)h​\(s\),\(THπ​η′\)h​\(s\)\)\.\\displaystyle\\leq\\lambda^\{h\}\\ell\_\{\\mathrm\{M\}\}\\left\(\(T\_\{H\}^\{\\pi\}\\eta\)^\{h\}\(s\),\(T\_\{H\}^\{\\pi\}\\eta^\{\\prime\}\)^\{h\}\(s\)\\right\)\.Now

\(THπ​η\)h​\(s\)=∑a∈𝒜π​\(a∣s\)​∑s′∈𝒮P​\(s′∣s,a\)​\(fR​\(s,a\),1\)\#​ηh−1​\(s′\),\(T\_\{H\}^\{\\pi\}\\eta\)^\{h\}\(s\)=\\sum\_\{a\\in\\mathcal\{A\}\}\\pi\(a\\mid s\)\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}P\(s^\{\\prime\}\\mid s,a\)\\,\(f\_\{R\(s,a\),1\}\)\_\{\\\#\}\\eta^\{h\-1\}\(s^\{\\prime\}\),\(351\)and likewise forη′\\eta^\{\\prime\}\. Since\(fr,1\)\#\(f\_\{r,1\}\)\_\{\\\#\}is an isometry in the MMD metric associated with a shift\-invariant kernel and MMD is convex under mixtures,

ℓM​\(\(THπ​η\)h​\(s\),\(THπ​η′\)h​\(s\)\)\\displaystyle\\ell\_\{\\mathrm\{M\}\}\\left\(\(T\_\{H\}^\{\\pi\}\\eta\)^\{h\}\(s\),\(T\_\{H\}^\{\\pi\}\\eta^\{\\prime\}\)^\{h\}\(s\)\\right\)≤∑a∈𝒜π​\(a∣s\)​∑s′∈𝒮P​\(s′∣s,a\)​ℓM​\(ηh−1​\(s′\),η′h−1​\(s′\)\)\\displaystyle\\leq\\sum\_\{a\\in\\mathcal\{A\}\}\\pi\(a\\mid s\)\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}P\(s^\{\\prime\}\\mid s,a\)\\,\\ell\_\{\\mathrm\{M\}\}\\left\(\\eta^\{h\-1\}\(s^\{\\prime\}\),\{\\eta^\{\\prime\}\}^\{h\-1\}\(s^\{\\prime\}\)\\right\)\(352\)≤maxx∈𝒮⁡ℓM​\(ηh−1​\(x\),η′h−1​\(x\)\)\.\\displaystyle\\leq\\max\_\{x\\in\\mathcal\{S\}\}\\ell\_\{\\mathrm\{M\}\}\\left\(\\eta^\{h\-1\}\(x\),\{\\eta^\{\\prime\}\}^\{h\-1\}\(x\)\\right\)\.Therefore

λh​∥\(𝒪H,M​U\)h​\(s\)−\(𝒪H,M​U′\)h​\(s\)∥2\\displaystyle\\lambda^\{h\}\\lVert\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\)^\{h\}\(s\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U^\{\\prime\}\)^\{h\}\(s\)\\rVert\_\{2\}≤λ​maxx∈𝒮⁡λh−1​ℓM​\(ηh−1​\(x\),η′h−1​\(x\)\)\\displaystyle\\leq\\lambda\\max\_\{x\\in\\mathcal\{S\}\}\\lambda^\{h\-1\}\\ell\_\{\\mathrm\{M\}\}\\left\(\\eta^\{h\-1\}\(x\),\{\\eta^\{\\prime\}\}^\{h\-1\}\(x\)\\right\)\(353\)≤λ​ℓH,M,∞​\(η,η′\)=λ​∥U−U′∥H,2,∞\.\\displaystyle\\leq\\lambda\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta,\\eta^\{\\prime\}\)=\\lambda\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.Taking the maximum over\(h,s\)∈𝒮H\(h,s\)\\in\\mathcal\{S\}\_\{H\}proves the claim\. ∎

### E\.3One\-step bounds

###### Proposition 39\(Fixed\-horizon MTD samplewise Lipschitzness\)\.

For everys∈𝒮s\\in\\mathcal\{S\},\(r,s′\)∈\[0,1\]q×𝒮\(r,s^\{\\prime\}\)\\in\[0,1\]^\{q\}\\times\\mathcal\{S\}andU,U′∈IH,M​\(ℱH,M,Θ𝒮\)U,U^\{\\prime\}\\in I\_\{H,\\mathrm\{M\}\}\(\\mathcal\{F\}\_\{H,\\mathrm\{M\},\\Theta\}^\{\\mathcal\{S\}\}\),

∥FH,M​\(U;s,\(r,s′\)\)−FH,M​\(U′;s,\(r,s′\)\)∥H,2,∞≤2​∥U−U′∥H,2,∞\.\\lVert F\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-F\_\{H,\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(354\)

###### Proof\.

For each horizonhh, the local targetT^H,Mh​\(U;s,\(r,s′\)\)\\widehat\{T\}\_\{H,\\mathrm\{M\}\}^\{h\}\(U;s,\(r,s^\{\\prime\}\)\)depends only on the\(h−1\)\(h\-1\)st horizon block ats′s^\{\\prime\}\. By Proposition[27](https://arxiv.org/html/2605.06866#Thmtheorem27),

‖T^H,Mh​\(U;s,\(r,s′\)\)−T^H,Mh​\(U′;s,\(r,s′\)\)‖2≤∥Uh−1​\(s′\)−U′h−1​\(s′\)∥2\.\\left\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}^\{h\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{H,\\mathrm\{M\}\}^\{h\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{2\}\\leq\\lVert U^\{h\-1\}\(s^\{\\prime\}\)\-\{U^\{\\prime\}\}^\{h\-1\}\(s^\{\\prime\}\)\\rVert\_\{2\}\.\(355\)Multiplying byλh\\lambda^\{h\}yields

λh​‖T^H,Mh​\(U;s,\(r,s′\)\)−T^H,Mh​\(U′;s,\(r,s′\)\)‖2≤λh−1​∥Uh−1​\(s′\)−U′h−1​\(s′\)∥2≤∥U−U′∥H,2,∞\.\\lambda^\{h\}\\left\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}^\{h\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{H,\\mathrm\{M\}\}^\{h\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{2\}\\leq\\lambda^\{h\-1\}\\lVert U^\{h\-1\}\(s^\{\\prime\}\)\-\{U^\{\\prime\}\}^\{h\-1\}\(s^\{\\prime\}\)\\rVert\_\{2\}\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.

\(356\)Taking the maximum overhhgives

∥T^H,M​\(U;s,\(r,s′\)\)−T^H,M​\(U′;s,\(r,s′\)\)∥H,2,∞≤∥U−U′∥H,2,∞\.\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2,\\infty\}\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(357\)Since

FH,M​\(U;s,\(r,s′\)\)=PsH​\(T^H,M​\(U;s,\(r,s′\)\)−U​\(s\)\),F\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)=P\_\{s\}^\{H\}\\bigl\(\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-U\(s\)\\bigr\),\(358\)the residual difference splits into a projected target difference and a projected current\-stack difference\. Bounding both by∥U−U′∥H,2,∞\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}gives the stated factor22\. ∎

###### Proposition 40\(Fixed\-horizon MTD affine perturbation bound\)\.

Let

UH,M⋆:=IH,M​\(ηH,M⋆\)\.U^\{\\star\}\_\{H,\\mathrm\{M\}\}:=I\_\{H,\\mathrm\{M\}\}\(\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\)\.\(359\)HereηH,M⋆\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}denotes the unique fixed point ofΠH,MΘ​THπ\\Pi\_\{H,\\mathrm\{M\}\}^\{\\Theta\}T\_\{H\}^\{\\pi\}\. Then there exist finite constantsBH,Mtar≥0B^\{\\mathrm\{tar\}\}\_\{H,\\mathrm\{M\}\}\\geq 0andBH,Mres≥0B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\\geq 0such that, for every admissible sample\(s,\(r,s′\)\)\(s,\(r,s^\{\\prime\}\)\)and every admissible iterateUU,

‖T^H,M​\(U;s,\(r,s′\)\)−\(𝒪H,M​U\)​\(s\)‖H,2≤2​λ​∥U∥H,2,∞\+BH,Mtar,\\left\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\)\(s\)\\right\\rVert\_\{H,2\}\\leq 2\\lambda\\lVert U\\rVert\_\{H,2,\\infty\}\+B^\{\\mathrm\{tar\}\}\_\{H,\\mathrm\{M\}\},\(360\)and

∥FH,M​\(U;s,\(r,s′\)\)∥H,2,∞≤2​∥U∥H,2,∞\+BH,Mres\.\\lVert F\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\\rVert\_\{H,2,\\infty\}\+B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\.\(361\)Consequently,

𝔼​\[‖T^H,M​\(Uk;Sk,\(Rk,Sk\+1\)\)−\(𝒪H,M​Uk\)​\(Sk\)‖H,22∣Uk,Sk\]≤2​\(BH,Mtar\)2\+8​λ2​∥Uk∥H,2,∞2\.\\mathbb\{E\}\\Bigl\[\\left\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\_\{k\}\)\(S\_\{k\}\)\\right\\rVert\_\{H,2\}^\{2\}\\mid U\_\{k\},S\_\{k\}\\Bigr\]\\leq 2\\bigl\(B^\{\\mathrm\{tar\}\}\_\{H,\\mathrm\{M\}\}\\bigr\)^\{2\}\+8\\lambda^\{2\}\\lVert U\_\{k\}\\rVert\_\{H,2,\\infty\}^\{2\}\.\(362\)

###### Proof\.

For fixeds,s′∈𝒮s,s^\{\\prime\}\\in\\mathcal\{S\}andr∈\[0,1\]qr\\in\[0,1\]^\{q\}, Proposition[27](https://arxiv.org/html/2605.06866#Thmtheorem27)gives, for each horizonhh,

‖T^H,Mh​\(U;s,\(r,s′\)\)−T^H,Mh​\(U′;s,\(r,s′\)\)‖2≤∥Uh−1​\(s′\)−U′h−1​\(s′\)∥2\.\\left\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}^\{h\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{H,\\mathrm\{M\}\}^\{h\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{2\}\\leq\\lVert U^\{h\-1\}\(s^\{\\prime\}\)\-\{U^\{\\prime\}\}^\{h\-1\}\(s^\{\\prime\}\)\\rVert\_\{2\}\.\(363\)After multiplying byλh\\lambda^\{h\}and taking the maximum overhh, the horizon shift contributes exactly one factorλ\\lambda\. Therefore

‖T^H,M​\(U;s,\(r,s′\)\)−T^H,M​\(U′;s,\(r,s′\)\)‖H,2≤λ​∥U−U′∥H,2,∞\.\\left\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{H,2\}\\leq\\lambda\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(364\)Likewise, Proposition[38](https://arxiv.org/html/2605.06866#Thmtheorem38)yields

∥\(𝒪H,M​U\)​\(s\)−\(𝒪H,M​U′\)​\(s\)∥H,2≤λ​∥U−U′∥H,2,∞\.\\lVert\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\)\(s\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U^\{\\prime\}\)\(s\)\\rVert\_\{H,2\}\\leq\\lambda\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(365\)
Now define

BH,M⋆,tar:=maxs,s′∈𝒮​supr∈\[0,1\]q‖T^H,M​\(UH,M⋆;s,\(r,s′\)\)−\(𝒪H,M​UH,M⋆\)​\(s\)‖H,2\.B^\{\\star,\\mathrm\{tar\}\}\_\{H,\\mathrm\{M\}\}:=\\max\_\{s,s^\{\\prime\}\\in\\mathcal\{S\}\}\\sup\_\{r\\in\[0,1\]^\{q\}\}\\left\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U^\{\\star\}\_\{H,\\mathrm\{M\}\};s,\(r,s^\{\\prime\}\)\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U^\{\\star\}\_\{H,\\mathrm\{M\}\}\)\(s\)\\right\\rVert\_\{H,2\}\.\(366\)As in the discounted case, this constant is finite because\[0,1\]q\[0,1\]^\{q\}is compact, there are only finitely many state pairs, and the embedded projected target depends continuously onrr\.

Then

‖T^H,M​\(U;s,\(r,s′\)\)−\(𝒪H,M​U\)​\(s\)‖H,2≤‖T^H,M​\(U;s,\(r,s′\)\)−T^H,M​\(UH,M⋆;s,\(r,s′\)\)‖H,2\+‖T^H,M​\(UH,M⋆;s,\(r,s′\)\)−\(𝒪H,M​UH,M⋆\)​\(s\)‖H,2\+‖\(𝒪H,M​UH,M⋆\)​\(s\)−\(𝒪H,M​U\)​\(s\)‖H,2≤2​λ​∥U−UH,M⋆∥H,2,∞\+BH,M⋆,tar≤2​λ​∥U∥H,2,∞\+\(2​λ​∥UH,M⋆∥H,2,∞\+BH,M⋆,tar\)\.\\begin\{gathered\}\\left\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\)\(s\)\\right\\rVert\_\{H,2\}\\leq\\left\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U^\{\\star\}\_\{H,\\mathrm\{M\}\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{H,2\}\\\\ \+\\left\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U^\{\\star\}\_\{H,\\mathrm\{M\}\};s,\(r,s^\{\\prime\}\)\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U^\{\\star\}\_\{H,\\mathrm\{M\}\}\)\(s\)\\right\\rVert\_\{H,2\}\+\\left\\lVert\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U^\{\\star\}\_\{H,\\mathrm\{M\}\}\)\(s\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\)\(s\)\\right\\rVert\_\{H,2\}\\\\ \\leq 2\\lambda\\lVert U\-U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\+B^\{\\star,\\mathrm\{tar\}\}\_\{H,\\mathrm\{M\}\}\\\\ \\leq 2\\lambda\\lVert U\\rVert\_\{H,2,\\infty\}\+\\Bigl\(2\\lambda\\lVert U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\+B^\{\\star,\\mathrm\{tar\}\}\_\{H,\\mathrm\{M\}\}\\Bigr\)\.\\end\{gathered\}\(367\)Hence the target\-level affine bound holds with

BH,Mtar:=2​λ​∥UH,M⋆∥H,2,∞\+BH,M⋆,tar\.B^\{\\mathrm\{tar\}\}\_\{H,\\mathrm\{M\}\}:=2\\lambda\\lVert U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\+B^\{\\star,\\mathrm\{tar\}\}\_\{H,\\mathrm\{M\}\}\.\(368\)
For the residual map, Proposition[39](https://arxiv.org/html/2605.06866#Thmtheorem39)gives

‖FH,M​\(U;s,\(r,s′\)\)−FH,M​\(U′;s,\(r,s′\)\)‖H,2,∞≤2​∥U−U′∥H,2,∞\.\\left\\lVert F\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-F\_\{H,\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(369\)Define

BH,M⋆,res:=maxs,s′∈𝒮​supr∈\[0,1\]q∥FH,M​\(UH,M⋆;s,\(r,s′\)\)∥H,2,∞,B^\{\\star,\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}:=\\max\_\{s,s^\{\\prime\}\\in\\mathcal\{S\}\}\\sup\_\{r\\in\[0,1\]^\{q\}\}\\lVert F\_\{H,\\mathrm\{M\}\}\(U^\{\\star\}\_\{H,\\mathrm\{M\}\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2,\\infty\},\(370\)which is finite by the same continuity and compactness argument\. Then

∥FH,M​\(U;s,\(r,s′\)\)∥H,2,∞≤2​∥U−UH,M⋆∥H,2,∞\+BH,M⋆,res≤2​∥U∥H,2,∞\+\(2​∥UH,M⋆∥H,2,∞\+BH,M⋆,res\)\.\\lVert F\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\-U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\+B^\{\\star,\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\\leq 2\\lVert U\\rVert\_\{H,2,\\infty\}\+\\Bigl\(2\\lVert U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\+B^\{\\star,\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\\Bigr\)\.

\(371\)Thus the residual\-level affine bound holds with

BH,Mres:=2​∥UH,M⋆∥H,2,∞\+BH,M⋆,res\.B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}:=2\\lVert U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\+B^\{\\star,\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\.\(372\)
Finally, the conditional second\-moment estimate follows from the target\-level affine bound by squaring and using

\(a\+b\)2≤2​a2\+2​b2\(a\+b\)^\{2\}\\leq 2a^\{2\}\+2b^\{2\}\(373\)with

a:=2​λ​∥Uk∥H,2,∞,b:=BH,Mtar\.a:=2\\lambda\\lVert U\_\{k\}\\rVert\_\{H,2,\\infty\},\\qquad b:=B^\{\\mathrm\{tar\}\}\_\{H,\\mathrm\{M\}\}\.\(374\)∎

###### Proposition 41\(Fixed\-horizon MTD affine conditional second moment\)\.

There exist finite constantsCH,M,1,CH,M,2≥0C\_\{H,\\mathrm\{M\},1\},C\_\{H,\\mathrm\{M\},2\}\\geq 0such that

𝔼​\[∥T^H,M​\(Uk;Sk,\(Rk,Sk\+1\)\)−\(𝒪H,M​Uk\)​\(Sk\)∥H,22∣Uk,Sk\]≤CH,M,1\+CH,M,2​∥Uk∥H,2,∞2\.\\mathbb\{E\}\\left\[\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\_\{k\}\)\(S\_\{k\}\)\\rVert\_\{H,2\}^\{2\}\\mid U\_\{k\},S\_\{k\}\\right\]\\leq C\_\{H,\\mathrm\{M\},1\}\+C\_\{H,\\mathrm\{M\},2\}\\lVert U\_\{k\}\\rVert\_\{H,2,\\infty\}^\{2\}\.

\(375\)

###### Proof\.

Square the affine pathwise bound and use\(a\+b\)2≤2​a2\+2​b2\(a\+b\)^\{2\}\\leq 2a^\{2\}\+2b^\{2\}\. ∎

### E\.4Phasewise averaged maps

For each phaset∈\{0,…,H−1\}t\\in\\\{0,\\dots,H\-1\\\}, define

Γt,M​\(U\):=∑s∈𝒮ρt​\(s\)​PsH​\(\(𝒪H,M​U\)​\(s\)−U​\(s\)\),Gt,M​\(U\):=U\+Γt,M​\(U\)\.\\Gamma\_\{t,\\mathrm\{M\}\}\(U\):=\\sum\_\{s\\in\\mathcal\{S\}\}\\rho\_\{t\}\(s\)\\,P\_\{s\}^\{H\}\\bigl\(\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\)\(s\)\-U\(s\)\\bigr\),\\qquad G\_\{t,\\mathrm\{M\}\}\(U\):=U\+\\Gamma\_\{t,\\mathrm\{M\}\}\(U\)\.\(376\)
###### Proposition 42\(Phasewise averaged contraction\)\.

For each phaset∈\{0,…,H−1\}t\\in\\\{0,\\dots,H\-1\\\},

∥Gt,M​\(U\)−Gt,M​\(U′\)∥H,2,∞≤β¯H,Mfh​∥U−U′∥H,2,∞,β¯H,Mfh:=1−ρmin​\(1−λ\)\.\\lVert G\_\{t,\\mathrm\{M\}\}\(U\)\-G\_\{t,\\mathrm\{M\}\}\(U^\{\\prime\}\)\\rVert\_\{H,2,\\infty\}\\leq\\bar\{\\beta\}\_\{H,\\mathrm\{M\}\}^\{\\mathrm\{fh\}\}\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\},\\qquad\\bar\{\\beta\}\_\{H,\\mathrm\{M\}\}^\{\\mathrm\{fh\}\}:=1\-\\rho\_\{\\min\}\(1\-\\lambda\)\.\(377\)

###### Proof\.

For each state blockss,

\(Gt,M​\(U\)−Gt,M​\(U′\)\)​\(s\)=\(1−ρt​\(s\)\)​\(U​\(s\)−U′​\(s\)\)\+ρt​\(s\)​\(\(𝒪H,M​U\)​\(s\)−\(𝒪H,M​U′\)​\(s\)\)\.\\bigl\(G\_\{t,\\mathrm\{M\}\}\(U\)\-G\_\{t,\\mathrm\{M\}\}\(U^\{\\prime\}\)\\bigr\)\(s\)=\(1\-\\rho\_\{t\}\(s\)\)\(U\(s\)\-U^\{\\prime\}\(s\)\)\\,\+\\,\\rho\_\{t\}\(s\)\\bigl\(\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\)\(s\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U^\{\\prime\}\)\(s\)\\bigr\)\.\(378\)By Proposition[38](https://arxiv.org/html/2605.06866#Thmtheorem38),

‖\(Gt,M​\(U\)−Gt,M​\(U′\)\)​\(s\)‖H,2\\displaystyle\\left\\lVert\\bigl\(G\_\{t,\\mathrm\{M\}\}\(U\)\-G\_\{t,\\mathrm\{M\}\}\(U^\{\\prime\}\)\\bigr\)\(s\)\\right\\rVert\_\{H,2\}≤\(\(1−ρt​\(s\)\)\+ρt​\(s\)​λ\)​∥U−U′∥H,2,∞\\displaystyle\\leq\\bigl\(\(1\-\\rho\_\{t\}\(s\)\)\+\\rho\_\{t\}\(s\)\\lambda\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\(379\)=\(1−ρt​\(s\)​\(1−λ\)\)​∥U−U′∥H,2,∞\.\\displaystyle=\\bigl\(1\-\\rho\_\{t\}\(s\)\(1\-\\lambda\)\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.Taking the maximum oversson both sides gives

∥Gt,M​\(U\)−Gt,M​\(U′\)∥H,2,∞\\displaystyle\\lVert G\_\{t,\\mathrm\{M\}\}\(U\)\-G\_\{t,\\mathrm\{M\}\}\(U^\{\\prime\}\)\\rVert\_\{H,2,\\infty\}≤maxs∈𝒮⁡\(1−ρt​\(s\)​\(1−λ\)\)​∥U−U′∥H,2,∞\\displaystyle\\leq\\max\_\{s\\in\\mathcal\{S\}\}\\bigl\(1\-\\rho\_\{t\}\(s\)\(1\-\\lambda\)\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\(380\)≤\(1−ρmin​\(1−λ\)\)​∥U−U′∥H,2,∞,\\displaystyle\\leq\\bigl\(1\-\\rho\_\{\\min\}\(1\-\\lambda\)\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\},since the coefficient1−ρt​\(s\)​\(1−λ\)1\-\\rho\_\{t\}\(s\)\(1\-\\lambda\)is decreasing inρt​\(s\)\\rho\_\{t\}\(s\)andρt​\(s\)≥ρmin\\rho\_\{t\}\(s\)\\geq\\rho\_\{\\min\}for every phase\-state block\. Equivalently,

β¯tfh:=1−ρt,min​\(1−λ\)≤1−ρmin​\(1−λ\)=β¯H,Mfh,\\bar\{\\beta\}\_\{t\}^\{\\mathrm\{fh\}\}:=1\-\\rho\_\{t,\\min\}\(1\-\\lambda\)\\leq 1\-\\rho\_\{\\min\}\(1\-\\lambda\)=\\bar\{\\beta\}\_\{H,\\mathrm\{M\}\}^\{\\mathrm\{fh\}\},\(381\)so the phasewise bound is dominated by the claimed uniform contraction factor\. ∎

### E\.5Auxiliary phasewise bounds

For episodemm, write

Um,t:=Um​H\+t,t=0,…,H\.U\_\{m,t\}:=U\_\{mH\+t\},\\qquad t=0,\\dots,H\.\(382\)
###### Lemma 43\(Within\-episode deviation\)\.

For every episodemmand everyt∈\{0,…,H\}t\\in\\\{0,\\dots,H\\\},

∥Um,t−Um,0∥H,2,∞≤∑u=0t−1αm​H\+u​\(2​∥Um,u∥H,2,∞\+BH,Mres\)\.\\lVert U\_\{m,t\}\-U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\leq\\sum\_\{u=0\}^\{t\-1\}\\alpha\_\{mH\+u\}\\Bigl\(2\\lVert U\_\{m,u\}\\rVert\_\{H,2,\\infty\}\+B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\\Bigr\)\.\(383\)

###### Proof\.

From the recursion,

Um,t−Um,0=∑u=0t−1αm​H\+u​FH,M​\(Um,u;Sum,\(Rum,Su\+1m\)\)\.U\_\{m,t\}\-U\_\{m,0\}=\\sum\_\{u=0\}^\{t\-1\}\\alpha\_\{mH\+u\}F\_\{H,\\mathrm\{M\}\}\\bigl\(U\_\{m,u\};S\_\{u\}^\{m\},\(R\_\{u\}^\{m\},S\_\{u\+1\}^\{m\}\)\\bigr\)\.\(384\)Take norms and apply the residual\-level affine bound from Proposition[40](https://arxiv.org/html/2605.06866#Thmtheorem40)term by term:

∥Um,t−Um,0∥H,2,∞≤∑u=0t−1αm​H\+u​\(2​∥Um,u∥H,2,∞\+BH,Mres\)\.\\lVert U\_\{m,t\}\-U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\leq\\sum\_\{u=0\}^\{t\-1\}\\alpha\_\{mH\+u\}\\Bigl\(2\\lVert U\_\{m,u\}\\rVert\_\{H,2,\\infty\}\+B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\\Bigr\)\.\(385\)∎

###### Lemma 44\(Frozen residual decomposition\)\.

For each phaset∈\{0,…,H−1\}t\\in\\\{0,\\dots,H\-1\\\}, define

ζm,tM:=FH,M​\(Um,0;Stm,\(Rtm,St\+1m\)\)−Γt,M​\(Um,0\)\\zeta\_\{m,t\}^\{\\mathrm\{M\}\}:=F\_\{H,\\mathrm\{M\}\}\(U\_\{m,0\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\-\\Gamma\_\{t,\\mathrm\{M\}\}\(U\_\{m,0\}\)\(386\)and

ξm,tM:=\(FH,M​\(Um,t;Stm,\(Rtm,St\+1m\)\)−FH,M​\(Um,0;Stm,\(Rtm,St\+1m\)\)\)−\(Γt,M​\(Um,t\)−Γt,M​\(Um,0\)\)\.\\xi\_\{m,t\}^\{\\mathrm\{M\}\}:=\\Bigl\(F\_\{H,\\mathrm\{M\}\}\(U\_\{m,t\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\-F\_\{H,\\mathrm\{M\}\}\(U\_\{m,0\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\\Bigr\)\-\\Bigl\(\\Gamma\_\{t,\\mathrm\{M\}\}\(U\_\{m,t\}\)\-\\Gamma\_\{t,\\mathrm\{M\}\}\(U\_\{m,0\}\)\\Bigr\)\.

\(387\)Then

FH,M​\(Um,t;Stm,\(Rtm,St\+1m\)\)−Γt,M​\(Um,t\)=ζm,tM\+ξm,tM,F\_\{H,\\mathrm\{M\}\}\(U\_\{m,t\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\-\\Gamma\_\{t,\\mathrm\{M\}\}\(U\_\{m,t\}\)=\\zeta\_\{m,t\}^\{\\mathrm\{M\}\}\+\\xi\_\{m,t\}^\{\\mathrm\{M\}\},\(388\)𝔼​\[ζm,tM∣Um,0\]=0,∥ζm,tM∥H,2,∞≤4​∥Um,0∥H,2,∞\+2​BH,Mres,\\mathbb\{E\}\\left\[\\zeta\_\{m,t\}^\{\\mathrm\{M\}\}\\,\\mid\\,U\_\{m,0\}\\right\]=0,\\qquad\\lVert\\zeta\_\{m,t\}^\{\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\\leq 4\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\+2B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\},\(389\)and

∥ξm,tM∥H,2,∞≤4​∥Um,t−Um,0∥H,2,∞\.\\lVert\\xi\_\{m,t\}^\{\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\\leq 4\\lVert U\_\{m,t\}\-U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\.\(390\)

###### Proof\.

Since episodemmis independent ofUm,0U\_\{m,0\}and its phase\-ttsample is distributed according to the phase\-ttlaw used inΓt,M\\Gamma\_\{t,\\mathrm\{M\}\}, we have

𝔼​\[ζm,tM∣Um,0\]=0\.\\mathbb\{E\}\\left\[\\zeta\_\{m,t\}^\{\\mathrm\{M\}\}\\,\\mid\\,U\_\{m,0\}\\right\]=0\.\(391\)By Proposition[40](https://arxiv.org/html/2605.06866#Thmtheorem40),

∥FH,M​\(Um,0;Stm,\(Rtm,St\+1m\)\)∥H,2,∞≤2​∥Um,0∥H,2,∞\+BH,Mres,\\lVert F\_\{H,\\mathrm\{M\}\}\(U\_\{m,0\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\+B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\},\(392\)and, sinceΓt,M​\(Um,0\)\\Gamma\_\{t,\\mathrm\{M\}\}\(U\_\{m,0\}\)is the average of the same residual map under the phase\-ttlaw,

∥Γt,M​\(Um,0\)∥H,2,∞≤2​∥Um,0∥H,2,∞\+BH,Mres\.\\lVert\\Gamma\_\{t,\\mathrm\{M\}\}\(U\_\{m,0\}\)\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\+B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\.\(393\)Hence

∥ζm,tM∥H,2,∞≤4​∥Um,0∥H,2,∞\+2​BH,Mres\.\\lVert\\zeta\_\{m,t\}^\{\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\\leq 4\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\+2B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\.\(394\)
Proposition[39](https://arxiv.org/html/2605.06866#Thmtheorem39)gives

‖FH,M​\(U;s,\(r,s′\)\)−FH,M​\(V;s,\(r,s′\)\)‖H,2,∞≤2​∥U−V∥H,2,∞,\\left\\lVert F\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-F\_\{H,\\mathrm\{M\}\}\(V;s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\-V\\rVert\_\{H,2,\\infty\},\(395\)and averaging the same estimate yields

∥Γt,M​\(U\)−Γt,M​\(V\)∥H,2,∞≤2​∥U−V∥H,2,∞\.\\lVert\\Gamma\_\{t,\\mathrm\{M\}\}\(U\)\-\\Gamma\_\{t,\\mathrm\{M\}\}\(V\)\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\-V\\rVert\_\{H,2,\\infty\}\.\(396\)Therefore

∥ξm,tM∥H,2,∞≤4​∥Um,t−Um,0∥H,2,∞\.\\lVert\\xi\_\{m,t\}^\{\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\\leq 4\\lVert U\_\{m,t\}\-U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\.\(397\)∎

### E\.6Episode\-level finite\-iteration drift

SinceΓt,M\\Gamma\_\{t,\\mathrm\{M\}\}is defined by averaging with respect to the phase\-ttmarginalρt\\rho\_\{t\}, the drift argument is formulated for the episode\-boundary sequence\(Um​H\)m≥0\(U\_\{mH\}\)\_\{m\\geq 0\}under the conditioning onUm,0U\_\{m,0\}\.

Let

Wm,0:=Um,0−UH,M⋆,UH,M⋆:=IH,M​\(ηH,M⋆\),W\_\{m,0\}:=U\_\{m,0\}\-U^\{\\star\}\_\{H,\\mathrm\{M\}\},\\qquad U^\{\\star\}\_\{H,\\mathrm\{M\}\}:=I\_\{H,\\mathrm\{M\}\}\(\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\),\(398\)and define

α¯m:=∑u=0H−1αm​H\+u,α¯m\(2\):=∑u=0H−1αm​H\+u2\.\\bar\{\\alpha\}\_\{m\}:=\\sum\_\{u=0\}^\{H\-1\}\\alpha\_\{mH\+u\},\\qquad\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}:=\\sum\_\{u=0\}^\{H\-1\}\\alpha\_\{mH\+u\}^\{2\}\.\(399\)Also write

\|𝒮H\|:=H​\|𝒮\|,β¯H,Mfh:=1−ρmin​\(1−λ\),κH,M:=1−β¯H,Mfh=ρmin​\(1−λ\)\>0\.\\lvert\\mathcal\{S\}\_\{H\}\\rvert:=H\\lvert\\mathcal\{S\}\\rvert,\\qquad\\bar\{\\beta\}\_\{H,\\mathrm\{M\}\}^\{\\mathrm\{fh\}\}:=1\-\\rho\_\{\\min\}\(1\-\\lambda\),\\qquad\\kappa\_\{H,\\mathrm\{M\}\}:=1\-\\bar\{\\beta\}\_\{H,\\mathrm\{M\}\}^\{\\mathrm\{fh\}\}=\\rho\_\{\\min\}\(1\-\\lambda\)\>0\.\(400\)Fix anyϑH,M\>0\\vartheta\_\{H,\\mathrm\{M\}\}\>0and define the smoothed potential

MH,M​\(W\):=infZ∈VH,M\{12‖Z∥H,2,∞2\+12​ϑH,M​‖W−Z‖H,2,pH⋆2\}\.M\_\{H,\\mathrm\{M\}\}\(W\):=\\inf\_\{Z\\in V\_\{H,\\mathrm\{M\}\}\}\\left\\\{\\frac\{1\}\{2\}\\left\\lVert Z\\right\\rVert^\{2\}\_\{H,2,\\infty\}\+\\frac\{1\}\{2\\vartheta\_\{H,\\mathrm\{M\}\}\}\\left\\lVert W\-Z\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\right\\\}\.\(401\)By Proposition[13](https://arxiv.org/html/2605.06866#Thmtheorem13)on the flattened weighted space,

\(1\+ϑH,M\)​MH,M​\(W\)≤12​‖W‖H,2,∞2≤\(1\+ϑH,M​\|𝒮H\|2/pH⋆\)​MH,M​\(W\)\.\(1\+\\vartheta\_\{H,\\mathrm\{M\}\}\)M\_\{H,\\mathrm\{M\}\}\(W\)\\leq\\frac\{1\}\{2\}\\left\\lVert W\\right\\rVert^\{2\}\_\{H,2,\\infty\}\\leq\(1\+\\vartheta\_\{H,\\mathrm\{M\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\)M\_\{H,\\mathrm\{M\}\}\(W\)\.\(402\)Define

rH,Mfh:=1\+ϑH,M​\|𝒮H\|2/pH⋆1\+ϑH,M,LH,Mfh:=pH⋆−1ϑH,M,ωH,Mfh:=1−β¯H,Mfh​rH,Mfh\.r^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}:=\\frac\{1\+\\vartheta\_\{H,\\mathrm\{M\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\}\{1\+\\vartheta\_\{H,\\mathrm\{M\}\}\},\\qquad L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}:=\\frac\{p^\{\\star\}\_\{H\}\-1\}\{\\vartheta\_\{H,\\mathrm\{M\}\}\},\\qquad\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}:=1\-\\bar\{\\beta\}^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\sqrt\{r^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\.\(403\)SincerH,Mfh→1r^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\to 1asϑH,M↓0\\vartheta\_\{H,\\mathrm\{M\}\}\\downarrow 0andβ¯H,Mfh<1\\bar\{\\beta\}^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}<1, we haveωH,Mfh\>0\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\>0for all sufficiently smallϑH,M\\vartheta\_\{H,\\mathrm\{M\}\}\. Define the preliminary threshold and the explicit MTD constants

dH,Mfh:=8​LH,Mfh​\|𝒮H\|2/pH⋆​\(1\+ϑH,M\),α^H,M:=min⁡\{1H,12​H​κH,M,ωH,MfhdH,Mfh\},CH,M,g:=exp⁡\(1κH,M\)​max⁡\{1,BH,Mres2​κH,M\},CH,M,w:=2​CH,M,g\+BH,Mres,CH,M,p:=H​LH,Mfh​\|𝒮H\|2/pH⋆​CH,M,w2​\(32ωH,Mfh\+36\),\\begin\{gathered\}d^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}:=8L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\(1\+\\vartheta\_\{H,\\mathrm\{M\}\}\),\\qquad\\widehat\{\\alpha\}\_\{H,\\mathrm\{M\}\}:=\\min\\left\\\{\\frac\{1\}\{H\},\\frac\{1\}\{2H\\kappa\_\{H,\\mathrm\{M\}\}\},\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{d^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\\right\\\},\\\\ C\_\{H,\\mathrm\{M\},\\mathrm\{g\}\}:=\\exp\\left\(\\frac\{1\}\{\\kappa\_\{H,\\mathrm\{M\}\}\}\\right\)\\max\\left\\\{1,\\frac\{B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\}\{2\\kappa\_\{H,\\mathrm\{M\}\}\}\\right\\\},\\qquad C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}:=2C\_\{H,\\mathrm\{M\},\\mathrm\{g\}\}\+B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\},\\\\ C\_\{H,\\mathrm\{M\},\\mathrm\{p\}\}:=HL^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}C^\{2\}\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\left\(\\frac\{32\}\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\+36\\right\),\\end\{gathered\}\(404\)and

α¯H,M:=min⁡\{α^H,M,ωH,Mfh64​CH,M,p​\(1\+ϑH,M​\|𝒮H\|2/pH⋆\)\}\.\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}:=\\min\\left\\\{\\widehat\{\\alpha\}\_\{H,\\mathrm\{M\}\},\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{64C\_\{H,\\mathrm\{M\},\\mathrm\{p\}\}\(1\+\\vartheta\_\{H,\\mathrm\{M\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\)\}\\right\\\}\.\(405\)Finally, define

cH,M,1fh:=ωH,Mfh4,cH,M,2fh:=2​CH,M,p​\(1\+2​∥UH,M⋆∥H,2,∞2\),cH,M,3fh:=8​CH,M,p​\(1\+ϑH,M​\|𝒮H\|2/pH⋆\)\.\\begin\{gathered\}c\_\{H,\\mathrm\{M\},1\}^\{\\mathrm\{fh\}\}:=\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{4\},\\qquad c\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}:=2C\_\{H,\\mathrm\{M\},\\mathrm\{p\}\}\\bigl\(1\+2\\lVert U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}^\{2\}\\bigr\),\\\\ c\_\{H,\\mathrm\{M\},3\}^\{\\mathrm\{fh\}\}:=8C\_\{H,\\mathrm\{M\},\\mathrm\{p\}\}\(1\+\\vartheta\_\{H,\\mathrm\{M\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\)\.\\end\{gathered\}\(406\)
###### Lemma 45\(Pathwise growth control inside one episode\)\.

Assume that\(αk\)k≥0\(\\alpha\_\{k\}\)\_\{k\\geq 0\}is nonincreasing and thatα0≤α¯H,M≤1\\alpha\_\{0\}\\leq\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}\\leq 1\. Then, for every episodemm,

max0≤t≤H∥Um,t∥H,2,∞≤CH,M,g\(1\+∥Um,0∥H,2,∞\)\\max\_\{0\\leq t\\leq H\}\\lVert U\_\{m,t\}\\rVert\_\{H,2,\\infty\}\\leq C\_\{H,\\mathrm\{M\},\\mathrm\{g\}\}\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\)\(407\)pathwise\.

###### Proof\.

By Proposition[40](https://arxiv.org/html/2605.06866#Thmtheorem40),

∥FH,M​\(U;s,\(r,s′\)\)∥H,2,∞≤2​∥U∥H,2,∞\+BH,Mres,\\lVert F\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\\rVert\_\{H,2,\\infty\}\+B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\},\(408\)whereBH,MresB^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}is the residual\-level affine constant from Proposition[40](https://arxiv.org/html/2605.06866#Thmtheorem40)\. Hence the recursion gives

∥Um,t\+1∥H,2,∞≤\(1\+2​αm​H\+t\)​∥Um,t∥H,2,∞\+αm​H\+t​BH,Mres\.\\lVert U\_\{m,t\+1\}\\rVert\_\{H,2,\\infty\}\\leq\\bigl\(1\+2\\alpha\_\{mH\+t\}\\bigr\)\\lVert U\_\{m,t\}\\rVert\_\{H,2,\\infty\}\+\\alpha\_\{mH\+t\}B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\.\(409\)Iterating over at mostHHsteps and using

∑u=0H−1αm​H\+u≤H​α0≤H​α¯H,M\\sum\_\{u=0\}^\{H\-1\}\\alpha\_\{mH\+u\}\\leq H\\alpha\_\{0\}\\leq H\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}\(410\)yields

max0≤t≤H∥Um,t∥H,2,∞≤e2​H​α¯H,M\(∥Um,0∥H,2,∞\+Hα¯H,MBH,Mres\)\.\\max\_\{0\\leq t\\leq H\}\\lVert U\_\{m,t\}\\rVert\_\{H,2,\\infty\}\\leq e^\{2H\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}\}\\bigl\(\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\+H\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\\bigr\)\.\(411\)Sinceα¯H,M≤α^H,M\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}\\leq\\widehat\{\\alpha\}\_\{H,\\mathrm\{M\}\}, we have

e2​H​α¯H,M≤e1/κH,M,H​α¯H,M​BH,Mres≤BH,Mres2​κH,M\.e^\{2H\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}\}\\leq e^\{1/\\kappa\_\{H,\\mathrm\{M\}\}\},\\qquad H\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\\leq\\frac\{B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\}\{2\\kappa\_\{H,\\mathrm\{M\}\}\}\.\(412\)Therefore

max0≤t≤H∥Um,t∥H,2,∞≤CH,M,g\(1\+∥Um,0∥H,2,∞\)\.\\max\_\{0\\leq t\\leq H\}\\lVert U\_\{m,t\}\\rVert\_\{H,2,\\infty\}\\leq C\_\{H,\\mathrm\{M\},\\mathrm\{g\}\}\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\)\.\(413\)∎

###### Proposition 46\(Episodewise MTD potential drift\)\.

Assume thatωH,Mfh\>0\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\>0, that\(αk\)k≥0\(\\alpha\_\{k\}\)\_\{k\\geq 0\}is nonincreasing and that

α0≤α¯H,M\\alpha\_\{0\}\\leq\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}\(414\)Then, for every episodemm,

𝔼​\[MH,M​\(Wm\+1,0\)∣Um,0\]≤\(1−cH,M,1fh​α¯m\)​MH,M​\(Wm,0\)\+cH,M,2fh​α¯m\(2\)\+cH,M,3fh​α¯m\(2\)​MH,M​\(Wm,0\)\.\\mathbb\{E\}\\left\[M\_\{H,\\mathrm\{M\}\}\(W\_\{m\+1,0\}\)\\mid U\_\{m,0\}\\right\]\\leq\\bigl\(1\-c\_\{H,\\mathrm\{M\},1\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{m\}\\bigr\)M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\+c\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}\+c\_\{H,\\mathrm\{M\},3\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\.

\(415\)Consequently,

𝔼​\[MH,M​\(Wm\+1,0\)∣Um,0\]≤\(1−ωH,Mfh8​α¯m\)​MH,M​\(Wm,0\)\+cH,M,2fh​α¯m\(2\)\.\\mathbb\{E\}\\left\[M\_\{H,\\mathrm\{M\}\}\(W\_\{m\+1,0\}\)\\mid U\_\{m,0\}\\right\]\\leq\\left\(1\-\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{8\}\\bar\{\\alpha\}\_\{m\}\\right\)M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\+c\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}\.\(416\)

###### Proof\.

For each phasettand eachα∈\[0,1\]\\alpha\\in\[0,1\], define

At,α,M​\(U\):=U\+α​Γt,M​\(U\)=\(1−α\)​U\+α​Gt,M​\(U\)\.A\_\{t,\\alpha,\\mathrm\{M\}\}\(U\):=U\+\\alpha\\Gamma\_\{t,\\mathrm\{M\}\}\(U\)=\(1\-\\alpha\)U\+\\alpha G\_\{t,\\mathrm\{M\}\}\(U\)\.\(417\)By Proposition[42](https://arxiv.org/html/2605.06866#Thmtheorem42),

∥At,α,M​\(U\)−At,α,M​\(U′\)∥H,2,∞≤\(1−κH,M​α\)​∥U−U′∥H,2,∞\.\\lVert A\_\{t,\\alpha,\\mathrm\{M\}\}\(U\)\-A\_\{t,\\alpha,\\mathrm\{M\}\}\(U^\{\\prime\}\)\\rVert\_\{H,2,\\infty\}\\leq\\bigl\(1\-\\kappa\_\{H,\\mathrm\{M\}\}\\alpha\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(418\)
Define the deterministic phasewise averaged trajectory over episodemmby

U¯m,0:=Um,0,U¯m,t\+1:=At,αm​H\+t,M​\(U¯m,t\),0≤t≤H−1\.\\bar\{U\}\_\{m,0\}:=U\_\{m,0\},\\qquad\\bar\{U\}\_\{m,t\+1\}:=A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{M\}\}\(\\bar\{U\}\_\{m,t\}\),\\qquad 0\\leq t\\leq H\-1\.\(419\)
SinceGt,M​\(UH,M⋆\)=UH,M⋆G\_\{t,\\mathrm\{M\}\}\(U^\{\\star\}\_\{H,\\mathrm\{M\}\}\)=U^\{\\star\}\_\{H,\\mathrm\{M\}\}, the contraction estimate above, the Moreau\-envelope comparison, and the smoothness of the envelope give the deterministic relaxed\-map estimate

MH,M​\(At,α,M​\(U\)−UH,M⋆\)≤\(1−2​ωH,Mfh​α\+dH,Mfh​α2\)​MH,M​\(U−UH,M⋆\)\.M\_\{H,\\mathrm\{M\}\}\\bigl\(A\_\{t,\\alpha,\\mathrm\{M\}\}\(U\)\-U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\bigr\)\\leq\(1\-2\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\alpha\+d^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\alpha^\{2\}\)M\_\{H,\\mathrm\{M\}\}\(U\-U^\{\\star\}\_\{H,\\mathrm\{M\}\}\)\.\(420\)Sinceαm​H\+t≤α0≤α¯H,M≤ωH,Mfh/dH,Mfh\\alpha\_\{mH\+t\}\\leq\\alpha\_\{0\}\\leq\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}\\leq\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}/d^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}, this implies

MH,M​\(At,αm​H\+t,M​\(U\)−UH,M⋆\)≤\(1−ωH,Mfh​αm​H\+t\)​MH,M​\(U−UH,M⋆\)\.M\_\{H,\\mathrm\{M\}\}\\bigl\(A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{M\}\}\(U\)\-U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\bigr\)\\leq\(1\-\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\alpha\_\{mH\+t\}\)M\_\{H,\\mathrm\{M\}\}\(U\-U^\{\\star\}\_\{H,\\mathrm\{M\}\}\)\.\(421\)Therefore

MH,M​\(U¯m,H−UH,M⋆\)≤∏u=0H−1\(1−ωH,Mfh​αm​H\+u\)​MH,M​\(Wm,0\)≤\(1−ωH,Mfh2​α¯m\)​MH,M​\(Wm,0\)\.M\_\{H,\\mathrm\{M\}\}\(\\bar\{U\}\_\{m,H\}\-U^\{\\star\}\_\{H,\\mathrm\{M\}\}\)\\leq\\prod\_\{u=0\}^\{H\-1\}\(1\-\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\alpha\_\{mH\+u\}\)M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\\leq\\left\(1\-\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{2\}\\bar\{\\alpha\}\_\{m\}\\right\)M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\.

\(422\)Now define

Dm,t:=Um,t−U¯m,t,Dm,0=0\.D\_\{m,t\}:=U\_\{m,t\}\-\\bar\{U\}\_\{m,t\},\\qquad D\_\{m,0\}=0\.\(423\)By Proposition[25](https://arxiv.org/html/2605.06866#Thmtheorem25), each statewise signed\-categorical projection is affine\. Applying this horizon\-by\-horizon shows thatU↦T^H,M​\(U;s,\(r,s′\)\)U\\mapsto\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)andU↦\(𝒪H,M​U\)​\(s\)U\\mapsto\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\)\(s\)are affine for every admissible\(s,r,s′\)\(s,r,s^\{\\prime\}\)\. HenceFH,MF\_\{H,\\mathrm\{M\}\},Γt,M\\Gamma\_\{t,\\mathrm\{M\}\}, andAt,α,MA\_\{t,\\alpha,\\mathrm\{M\}\}are affine\. LetLm,t,ML\_\{m,t,\\mathrm\{M\}\}denote the linear part ofAt,αm​H\+t,MA\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{M\}\}\. Then

At,αm​H\+t,M​\(U\)−At,αm​H\+t,M​\(U′\)=Lm,t,M​\(U−U′\),A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{M\}\}\(U\)\-A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{M\}\}\(U^\{\\prime\}\)=L\_\{m,t,\\mathrm\{M\}\}\(U\-U^\{\\prime\}\),\(424\)and for every admissibleZZ

∥Lm,t,M​Z∥H,2,∞≤\(1−κH,M​αm​H\+t\)​∥Z∥H,2,∞\.\\lVert L\_\{m,t,\\mathrm\{M\}\}Z\\rVert\_\{H,2,\\infty\}\\leq\\bigl\(1\-\\kappa\_\{H,\\mathrm\{M\}\}\\alpha\_\{mH\+t\}\\bigr\)\\lVert Z\\rVert\_\{H,2,\\infty\}\.\(425\)Using the recursion forUm,t\+1U\_\{m,t\+1\}, the definition ofU¯m,t\+1\\bar\{U\}\_\{m,t\+1\}, and the decomposition from Lemma[44](https://arxiv.org/html/2605.06866#Thmtheorem44), we obtain

Dm,t\+1=Um,t\+1−U¯m,t\+1=Um,t\+αm​H\+t​FH,M​\(Um,t;Stm,\(Rtm,St\+1m\)\)−U¯m,t−αm​H\+t​Γt,M​\(U¯m,t\)=Um,t−U¯m,t\+αm​H\+t​\(Γt,M​\(Um,t\)−Γt,M​\(U¯m,t\)\)\+αm​H\+t​\(FH,M​\(Um,t;Stm,\(Rtm,St\+1m\)\)−Γt,M​\(Um,t\)\)=At,αm​H\+t,M​\(Um,t\)−At,αm​H\+t,M​\(U¯m,t\)\+αm​H\+t​\(FH,M​\(Um,t;Stm,\(Rtm,St\+1m\)\)−Γt,M​\(Um,t\)\)=Lm,t,M​Dm,t\+αm​H\+t​ζm,tM\+αm​H\+t​ξm,tM\.\\begin\{gathered\}D\_\{m,t\+1\}=U\_\{m,t\+1\}\-\\bar\{U\}\_\{m,t\+1\}\\\\ =U\_\{m,t\}\+\\alpha\_\{mH\+t\}F\_\{H,\\mathrm\{M\}\}\\bigl\(U\_\{m,t\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\\bigr\)\-\\bar\{U\}\_\{m,t\}\-\\alpha\_\{mH\+t\}\\Gamma\_\{t,\\mathrm\{M\}\}\(\\bar\{U\}\_\{m,t\}\)\\\\ =U\_\{m,t\}\-\\bar\{U\}\_\{m,t\}\+\\alpha\_\{mH\+t\}\\Bigl\(\\Gamma\_\{t,\\mathrm\{M\}\}\(U\_\{m,t\}\)\-\\Gamma\_\{t,\\mathrm\{M\}\}\(\\bar\{U\}\_\{m,t\}\)\\Bigr\)\\\\ \+\\alpha\_\{mH\+t\}\\Bigl\(F\_\{H,\\mathrm\{M\}\}\\bigl\(U\_\{m,t\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\\bigr\)\-\\Gamma\_\{t,\\mathrm\{M\}\}\(U\_\{m,t\}\)\\Bigr\)\\\\ =A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{M\}\}\(U\_\{m,t\}\)\-A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{M\}\}\(\\bar\{U\}\_\{m,t\}\)\+\\alpha\_\{mH\+t\}\\Bigl\(F\_\{H,\\mathrm\{M\}\}\\bigl\(U\_\{m,t\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\\bigr\)\-\\Gamma\_\{t,\\mathrm\{M\}\}\(U\_\{m,t\}\)\\Bigr\)\\\\ =L\_\{m,t,\\mathrm\{M\}\}D\_\{m,t\}\+\\alpha\_\{mH\+t\}\\zeta\_\{m,t\}^\{\\mathrm\{M\}\}\+\\alpha\_\{mH\+t\}\\xi\_\{m,t\}^\{\\mathrm\{M\}\}\.\\end\{gathered\}

\(426\)By Lemma[45](https://arxiv.org/html/2605.06866#Thmtheorem45)and Lemma[43](https://arxiv.org/html/2605.06866#Thmtheorem43),

∥Um,t−Um,0∥H,2,∞≤CH,M,w​α¯m​\(1\+∥Um,0∥H,2,∞\),\\lVert U\_\{m,t\}\-U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\leq C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\bar\{\\alpha\}\_\{m\}\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\),\(427\)and, sinceCH,M,g≥1C\_\{H,\\mathrm\{M\},\\mathrm\{g\}\}\\geq 1,

4​∥Um,0∥H,2,∞\+2​BH,Mres≤2​CH,M,w​\(1\+∥Um,0∥H,2,∞\)\.4\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\+2B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\\leq 2C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\)\.\(428\)Hence Lemma[44](https://arxiv.org/html/2605.06866#Thmtheorem44)gives

∥ζm,tM∥H,2,∞≤2​CH,M,w​\(1\+∥Um,0∥H,2,∞\),\\lVert\\zeta\_\{m,t\}^\{\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\\leq 2C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\),\(429\)and

∥ξm,tM∥H,2,∞≤4​CH,M,w​α¯m​\(1\+∥Um,0∥H,2,∞\)\.\\lVert\\xi\_\{m,t\}^\{\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\\leq 4C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\bar\{\\alpha\}\_\{m\}\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\)\.\(430\)Therefore

𝔼​\[Dm,t\+1∣Um,0\]=Lm,t,M​𝔼​\[Dm,t∣Um,0\]\+αm​H\+t​𝔼​\[ξm,tM∣Um,0\],\\mathbb\{E\}\\left\[D\_\{m,t\+1\}\\,\\mid\\,U\_\{m,0\}\\right\]=L\_\{m,t,\\mathrm\{M\}\}\\mathbb\{E\}\\left\[D\_\{m,t\}\\,\\mid\\,U\_\{m,0\}\\right\]\+\\alpha\_\{mH\+t\}\\mathbb\{E\}\\left\[\\xi\_\{m,t\}^\{\\mathrm\{M\}\}\\,\\mid\\,U\_\{m,0\}\\right\],\(431\)because𝔼​\[ζm,tM∣Um,0\]=0\\mathbb\{E\}\\left\[\\zeta\_\{m,t\}^\{\\mathrm\{M\}\}\\,\\mid\\,U\_\{m,0\}\\right\]=0\. Hence

∥𝔼\[Dm,t\+1∣Um,0\]∥H,2,∞≤\(1−κH,Mαm​H\+t\)∥𝔼\[Dm,t∣Um,0\]∥H,2,∞\+4​CH,M,w​αm​H\+t​α¯m​\(1\+∥Um,0∥H,2,∞\)\.\\begin\{gathered\}\\left\\lVert\\mathbb\{E\}\\left\[D\_\{m,t\+1\}\\,\\mid\\,U\_\{m,0\}\\right\]\\right\\rVert\_\{H,2,\\infty\}\\leq\\bigl\(1\-\\kappa\_\{H,\\mathrm\{M\}\}\\alpha\_\{mH\+t\}\\bigr\)\\left\\lVert\\mathbb\{E\}\\left\[D\_\{m,t\}\\,\\mid\\,U\_\{m,0\}\\right\]\\right\\rVert\_\{H,2,\\infty\}\\\\ \+4C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\alpha\_\{mH\+t\}\\bar\{\\alpha\}\_\{m\}\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\)\.\\end\{gathered\}\(432\)SinceDm,0=0D\_\{m,0\}=0, induction gives

∥𝔼\[Dm,H∣Um,0\]∥H,2,∞≤4CH,M,wα¯m2\(1\+∥Um,0∥H,2,∞\)\.\\left\\lVert\\mathbb\{E\}\\left\[D\_\{m,H\}\\,\\mid\\,U\_\{m,0\}\\right\]\\right\\rVert\_\{H,2,\\infty\}\\leq 4C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\bar\{\\alpha\}\_\{m\}^\{2\}\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\)\.\(433\)
Iterating the recursion forDm,t\+1D\_\{m,t\+1\}, using

∥Lm,t,M​\(Z\)∥H,2,∞≤∥Z∥H,2,∞,\\lVert L\_\{m,t,\\mathrm\{M\}\}\(Z\)\\rVert\_\{H,2,\\infty\}\\leq\\lVert Z\\rVert\_\{H,2,\\infty\},\(434\)the identityDm,0=0D\_\{m,0\}=0, and the bounds from Lemma[44](https://arxiv.org/html/2605.06866#Thmtheorem44)give the pathwise estimate

∥Dm,H∥H,2,∞≤∑t=0H−1αm​H\+t​\(∥ζm,tM∥H,2,∞\+∥ξm,tM∥H,2,∞\)\\lVert D\_\{m,H\}\\rVert\_\{H,2,\\infty\}\\leq\\sum\_\{t=0\}^\{H\-1\}\\alpha\_\{mH\+t\}\\Bigl\(\\lVert\\zeta\_\{m,t\}^\{\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\+\\lVert\\xi\_\{m,t\}^\{\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\\Bigr\)\(435\)and therefore

∥Dm,H∥H,2,∞≤\(2​CH,M,w​α¯m\+4​CH,M,w​α¯m2\)​\(1\+∥Um,0∥H,2,∞\)\.\\lVert D\_\{m,H\}\\rVert\_\{H,2,\\infty\}\\leq\\Bigl\(2C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\bar\{\\alpha\}\_\{m\}\+4C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\bar\{\\alpha\}\_\{m\}^\{2\}\\Bigr\)\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\)\.\(436\)Sinceα0≤α¯H,M≤α^H,M≤1/H\\alpha\_\{0\}\\leq\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}\\leq\\widehat\{\\alpha\}\_\{H,\\mathrm\{M\}\}\\leq 1/H, we haveα¯m≤1\\bar\{\\alpha\}\_\{m\}\\leq 1, so

∥Dm,H∥H,2,∞≤6​CH,M,w​α¯m​\(1\+∥Um,0∥H,2,∞\)\\lVert D\_\{m,H\}\\rVert\_\{H,2,\\infty\}\\leq 6C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\bar\{\\alpha\}\_\{m\}\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\)\(437\)pathwise\. Hence

𝔼​\[∥Dm,H∥H,2,∞2∣Um,0\]≤36​CH,M,w2​α¯m2​\(1\+∥Um,0∥H,2,∞\)2\.\\mathbb\{E\}\\left\[\\lVert D\_\{m,H\}\\rVert\_\{H,2,\\infty\}^\{2\}\\,\\mid\\,U\_\{m,0\}\\right\]\\leq 36C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}^\{2\}\\bar\{\\alpha\}\_\{m\}^\{2\}\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\)^\{2\}\.\(438\)
Define

W¯m,H:=U¯m,H−UH,M⋆,δm:=ωH,Mfh​α¯m4​LH,Mfh\.\\bar\{W\}\_\{m,H\}:=\\bar\{U\}\_\{m,H\}\-U^\{\\star\}\_\{H,\\mathrm\{M\}\},\\qquad\\delta\_\{m\}:=\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\bar\{\\alpha\}\_\{m\}\}\{4L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\.\(439\)BecauseMH,MM\_\{H,\\mathrm\{M\}\}is nonnegative, convex,LH,MfhL^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\-smooth with respect to∥⋅∥H,2,pH⋆\\left\\lVert\\cdot\\right\\rVert\_\{H,2,p^\{\\star\}\_\{H\}\}, and minimized at zero, its gradient satisfies

‖∇MH,M​\(W¯m,H\)‖∗2≤2​LH,Mfh​MH,M​\(W¯m,H\),\\left\\lVert\\nabla M\_\{H,\\mathrm\{M\}\}\(\\bar\{W\}\_\{m,H\}\)\\right\\rVert^\{2\}\_\{\*\}\\leq 2L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}M\_\{H,\\mathrm\{M\}\}\(\\bar\{W\}\_\{m,H\}\),\(440\)where∥⋅∥∗\\left\\lVert\\cdot\\right\\rVert\_\{\*\}denotes the dual norm of∥⋅∥H,2,pH⋆\\left\\lVert\\cdot\\right\\rVert\_\{H,2,p^\{\\star\}\_\{H\}\}\. Therefore

𝔼​\[MH,M​\(Wm\+1,0\)∣Um,0\]≤MH,M​\(W¯m,H\)\+⟨∇MH,M​\(W¯m,H\),𝔼​\[Dm,H∣Um,0\]⟩\+LH,Mfh2​𝔼​\[‖Dm,H‖H,2,pH⋆2∣Um,0\]\.\\begin\{gathered\}\\mathbb\{E\}\\bigl\[M\_\{H,\\mathrm\{M\}\}\(W\_\{m\+1\},0\)\\mid U\_\{m,0\}\\bigr\]\\leq M\_\{H,\\mathrm\{M\}\}\(\\bar\{W\}\_\{m,H\}\)\+\\left\\langle\\nabla M\_\{H,\\mathrm\{M\}\}\(\\bar\{W\}\_\{m,H\}\),\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rangle\\\\ \+\\frac\{L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{2\}\\mathbb\{E\}\\bigl\[\\left\\lVert D\_\{m,H\}\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\mid U\_\{m,0\}\\bigr\]\.\\end\{gathered\}\(441\)By Young’s inequality,

⟨∇MH,M\(W¯m,H\),𝔼\[Dm,H∣Um,0\]⟩≤δm2∥∇MH,M\(W¯m,H\)∥∗2\+12​δm∥𝔼\[Dm,H∣Um,0\]∥H,2,pH⋆2≤ωH,Mfh​α¯m4MH,M\(W¯m,H\)\+2​LH,MfhωH,Mfh​α¯m∥𝔼\[Dm,H∣Um,0\]∥H,2,pH⋆2\.\\begin\{gathered\}\\left\\langle\\nabla M\_\{H,\\mathrm\{M\}\}\(\\bar\{W\}\_\{m,H\}\),\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rangle\\leq\\frac\{\\delta\_\{m\}\}\{2\}\\left\\lVert\\nabla M\_\{H,\\mathrm\{M\}\}\(\\bar\{W\}\_\{m,H\}\)\\right\\rVert^\{2\}\_\{\*\}\+\\frac\{1\}\{2\\delta\_\{m\}\}\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\\\ \\leq\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\bar\{\\alpha\}\_\{m\}\}\{4\}M\_\{H,\\mathrm\{M\}\}\(\\bar\{W\}\_\{m,H\}\)\+\\frac\{2L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\bar\{\\alpha\}\_\{m\}\}\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\.\\end\{gathered\}

\(442\)Hence

𝔼​\[MH,M​\(Wm\+1,0\)∣Um,0\]≤\(1\+ωH,Mfh​α¯m4\)​MH,M​\(W¯m,H\)\+2​LH,MfhωH,Mfh​α¯m∥𝔼\[Dm,H∣Um,0\]∥H,2,pH⋆2\+LH,Mfh2𝔼\[∥Dm,H∥H,2,pH⋆2∣Um,0\]\.\\begin\{gathered\}\\mathbb\{E\}\[M\_\{H,\\mathrm\{M\}\}\(W\_\{m\+1,0\}\)\\mid U\_\{m,0\}\]\\leq\\left\(1\+\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\bar\{\\alpha\}\_\{m\}\}\{4\}\\right\)M\_\{H,\\mathrm\{M\}\}\(\\bar\{W\}\_\{m,H\}\)\\\\ \+\\frac\{2L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\bar\{\\alpha\}\_\{m\}\}\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\+\\frac\{L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{2\}\\mathbb\{E\}\\bigl\[\\left\\lVert D\_\{m,H\}\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\mid U\_\{m,0\}\\bigr\]\.\\end\{gathered\}\(443\)SinceωH,Mfh​α¯m≤1\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\bar\{\\alpha\}\_\{m\}\\leq 1, we have

\(1\+ωH,Mfh​α¯m4\)​\(1−ωH,Mfh2​α¯m\)≤1−ωH,Mfh​α¯m4\.\\left\(1\+\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\bar\{\\alpha\}\_\{m\}\}\{4\}\\right\)\\left\(1\-\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{2\}\\bar\{\\alpha\}\_\{m\}\\right\)\\leq 1\-\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\bar\{\\alpha\}\_\{m\}\}\{4\}\.\(444\)Also,

∥𝔼\[Dm,H∣Um,0\]∥H,2,pH⋆2\\displaystyle\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}≤\|𝒮H\|2/pH⋆∥𝔼\[Dm,H∣Um,0\]∥H,2,∞2\\displaystyle\\leq\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert^\{2\}\_\{H,2,\\infty\}\(445\)≤16​\|𝒮H\|2/pH⋆​CH,M,w2​α¯m4​\(1\+‖Um,0‖H,2,∞\)2,\\displaystyle\\leq 6\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}C^\{2\}\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\bar\{\\alpha\}^\{4\}\_\{m\}\(1\+\\left\\lVert U\_\{m,0\}\\right\\rVert\_\{H,2,\\infty\}\)^\{2\},and

𝔼​\[‖Dm,H‖H,2,pH⋆2∣Um,0\]\\displaystyle\\mathbb\{E\}\[\\left\\lVert D\_\{m,H\}\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\mid U\_\{m,0\}\]≤\|𝒮H\|2/pH⋆​𝔼​\[‖Dm,H‖H,2,∞2∣Um,0\]\\displaystyle\\leq\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\\mathbb\{E\}\[\\left\\lVert D\_\{m,H\}\\right\\rVert^\{2\}\_\{H,2,\\infty\}\\mid U\_\{m,0\}\]\(446\)≤36​\|𝒮H\|2/pH⋆​CH,M,w2​α¯m2​\(1\+‖Um,0‖H,2,∞\)2\.\\displaystyle\\leq 6\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}C^\{2\}\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\bar\{\\alpha\}^\{2\}\_\{m\}\(1\+\\left\\lVert U\_\{m,0\}\\right\\rVert\_\{H,2,\\infty\}\)^\{2\}\.Usingα¯m≤1\\bar\{\\alpha\}\_\{m\}\\leq 1,α¯m2≤H​α¯m\(2\)\\bar\{\\alpha\}^\{2\}\_\{m\}\\leq H\\bar\{\\alpha\}^\{\(2\)\}\_\{m\}, andα¯m3≤H​α¯m\(2\)\\bar\{\\alpha\}^\{3\}\_\{m\}\\leq H\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}, we obtain

𝔼​\[MH,M​\(Wm\+1,0\)∣Um,0\]≤\(1−ωH,Mfh4​α¯m\)​MH,M​\(Wm,0\)\+CH,M,p​α¯m\(2\)​\(1\+‖Um,0‖H,2,∞\)2\.\\mathbb\{E\}\[M\_\{H,\\mathrm\{M\}\}\(W\_\{m\+1,0\}\)\\mid U\_\{m,0\}\]\\leq\\left\(1\-\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{4\}\\bar\{\\alpha\}\_\{m\}\\right\)M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\+C\_\{H,\\mathrm\{M\},\\mathrm\{p\}\}\\bar\{\\alpha\}^\{\(2\)\}\_\{m\}\(1\+\\left\\lVert U\_\{m,0\}\\right\\rVert\_\{H,2,\\infty\}\)^\{2\}\.

\(447\)Since\(1\+x\)2≤2​\(1\+x2\)\(1\+x\)^\{2\}\\leq 2\(1\+x^\{2\}\)and

‖Um,0‖H,2,∞2\\displaystyle\\left\\lVert U\_\{m,0\}\\right\\rVert^\{2\}\_\{H,2,\\infty\}≤2​‖Wm,0‖H,2,∞2\+2​‖UH,M⋆‖H,2,∞2\\displaystyle\\leq 2\\left\\lVert W\_\{m,0\}\\right\\rVert^\{2\}\_\{H,2,\\infty\}\+2\\left\\lVert U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\right\\rVert^\{2\}\_\{H,2,\\infty\}\(448\)≤4​\(1\+ϑH,M​\|𝒮H\|2/pH⋆\)​MH,M​\(Wm,0\)\+2​‖UH,M⋆‖H,2,∞2,\\displaystyle\\leq 4\(1\+\\vartheta\_\{H,\\mathrm\{M\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\)M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\+2\\left\\lVert U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\right\\rVert^\{2\}\_\{H,2,\\infty\},this yields

𝔼​\[MH,M​\(Wm\+1,0\)∣Um,0\]≤\(1−cH,M,1fh​α¯m\)​MH,M​\(Wm,0\)\+cH,M,2fh​α¯m\(2\)\+cH,M,3fh​α¯m\(2\)​MH,M​\(Wm,0\),\\mathbb\{E\}\\left\[M\_\{H,\\mathrm\{M\}\}\(W\_\{m\+1,0\}\)\\mid U\_\{m,0\}\\right\]\\leq\\bigl\(1\-c\_\{H,\\mathrm\{M\},1\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{m\}\\bigr\)M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\+c\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}\+c\_\{H,\\mathrm\{M\},3\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\),

\(449\)which is the first stated inequality\. Finally, since\(αk\)\(\\alpha\_\{k\}\)is nonincreasing andα0≤α¯H,M\\alpha\_\{0\}\\leq\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\},

α¯m\(2\)≤α0​α¯m≤α¯H,M​α¯m≤ωH,Mfh64​CH,M,p​\(1\+ϑH,M​\|𝒮H\|2/pH⋆\)​α¯m\.\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}\\leq\\alpha\_\{0\}\\bar\{\\alpha\}\_\{m\}\\leq\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}\\bar\{\\alpha\}\_\{m\}\\leq\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{64C\_\{H,\\mathrm\{M\},\\mathrm\{p\}\}\(1\+\\vartheta\_\{H,\\mathrm\{M\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\)\}\\bar\{\\alpha\}\_\{m\}\.\(450\)Therefore

cH,M,3fh​α¯m\(2\)​MH,M​\(Wm,0\)≤ωH,Mfh8​α¯m​MH,M​\(Wm,0\),c\_\{H,\\mathrm\{M\},3\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\\leq\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{8\}\\bar\{\\alpha\}\_\{m\}M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\),\(451\)which gives the second inequality\. ∎

###### Proposition 47\(Fixed\-horizon MTD finite\-iteration bound\)\.

Define

aH,M,1fh:=rH,Mfh,aH,M,2fh:=ωH,Mfh8,aH,M,3fh:=ωH,Mfh8​α¯H,M,aH,M,4fh:=2​\(1\+ϑH,M​\|𝒮H\|2/pH⋆\)​cH,M,2fh\.\\begin\{gathered\}a\_\{H,\\mathrm\{M\},1\}^\{\\mathrm\{fh\}\}:=r^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\},\\qquad a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}:=\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{8\},\\qquad a\_\{H,\\mathrm\{M\},3\}^\{\\mathrm\{fh\}\}:=\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{8\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}\},\\\\ a\_\{H,\\mathrm\{M\},4\}^\{\\mathrm\{fh\}\}:=2\(1\+\\vartheta\_\{H,\\mathrm\{M\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\)c\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\.\\end\{gathered\}\(452\)IfaH,M,2fh\>0a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\},2\}\>0,\(αk\)k≥0\(\\alpha\_\{k\}\)\_\{k\\geq 0\}is nonincreasing and

α0≤aH,M,2fhaH,M,3fh,\\alpha\_\{0\}\\leq\\frac\{a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\}\{a\_\{H,\\mathrm\{M\},3\}^\{\\mathrm\{fh\}\}\},\(453\)then, for all episodesm≥0m\\geq 0,

𝔼​\[ℓH,M,∞​\(ηm​H,ηH,M⋆\)2\]\\displaystyle\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{M\},\\infty\}\\bigl\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\\bigr\)^\{2\}\\right\]≤aH,M,1fh​ℓH,M,∞​\(η0,ηH,M⋆\)2​∏j=0m−1\(1−aH,M,2fh​α¯j\)\\displaystyle\\leq a\_\{H,\\mathrm\{M\},1\}^\{\\mathrm\{fh\}\}\\ell\_\{H,\\mathrm\{M\},\\infty\}\\bigl\(\\eta\_\{0\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\\bigr\)^\{2\}\\prod\_\{j=0\}^\{m\-1\}\\bigl\(1\-a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\\bigr\)\(454\)\+aH,M,4fh​∑i=0m−1α¯i\(2\)​∏j=i\+1m−1\(1−aH,M,2fh​α¯j\)\.\\displaystyle\\quad\+a\_\{H,\\mathrm\{M\},4\}^\{\\mathrm\{fh\}\}\\sum\_\{i=0\}^\{m\-1\}\\bar\{\\alpha\}\_\{i\}^\{\(2\)\}\\prod\_\{j=i\+1\}^\{m\-1\}\\bigl\(1\-a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\\bigr\)\.

###### Proof\.

Iterating the second bound in Proposition[46](https://arxiv.org/html/2605.06866#Thmtheorem46)gives

𝔼​\[MH,M​\(Wm,0\)\]≤MH,M​\(W0,0\)​∏j=0m−1\(1−aH,M,2fh​α¯j\)\+cH,M,2fh​∑i=0m−1α¯i\(2\)​∏j=i\+1m−1\(1−aH,M,2fh​α¯j\)\.\\mathbb\{E\}\\left\[M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\\right\]\\leq M\_\{H,\\mathrm\{M\}\}\(W\_\{0,0\}\)\\prod\_\{j=0\}^\{m\-1\}\\bigl\(1\-a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\\bigr\)\+c^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\},2\}\\sum\_\{i=0\}^\{m\-1\}\\bar\{\\alpha\}^\{\(2\)\}\_\{i\}\\prod\_\{j=i\+1\}^\{m\-1\}\\bigl\(1\-a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\\bigr\)\.\(455\)By the Moreau envelope comparison,

𝔼​\[‖Wm,0‖H,2,∞2\]≤2​\(1\+ϑH,M​\|𝒮H\|2/pH⋆\)​𝔼​\[MH,M​\(Wm,0\)\]\\mathbb\{E\}\[\\left\\lVert W\_\{m,0\}\\right\\rVert^\{2\}\_\{H,2,\\infty\}\]\\leq 2\(1\+\\vartheta\_\{H,\\mathrm\{M\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\)\\mathbb\{E\}\\bigl\[M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\\bigr\]\(456\)and

MH,M​\(W0,0\)≤12​\(1\+ϑH,M\)​‖W0,0‖H,2,∞2\.M\_\{H,\\mathrm\{M\}\}\(W\_\{0,0\}\)\\leq\\frac\{1\}\{2\(1\+\\vartheta\_\{H,\\mathrm\{M\}\}\)\}\\left\\lVert W\_\{0,0\}\\right\\rVert^\{2\}\_\{H,2,\\infty\}\.\(457\)The embedding isometric identity

‖Wm,0‖H,2,∞=ℓH,M,∞​\(ηm​H,ηH,M⋆\)\\left\\lVert W\_\{m,0\}\\right\\rVert\_\{H,2,\\infty\}=\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{mH\},\\eta^\{\\star\}\_\{H,\\mathrm\{M\}\}\)\(458\)provides the claim\. ∎

###### Corollary 48\(Boundary\-iterate step size consequences\)\.

Under the hypotheses of Proposition[47](https://arxiv.org/html/2605.06866#Thmtheorem47):

\(a\)ifαk≡α\\alpha\_\{k\}\\equiv\\alphaand

α≤aH,M,2fhaH,M,3fh,\\alpha\\leq\\frac\{a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\}\{a\_\{H,\\mathrm\{M\},3\}^\{\\mathrm\{fh\}\}\},\(459\)then for all episodesm≥0m\\geq 0,

𝔼​\[ℓH,M,∞​\(ηm​H,ηH,M⋆\)2\]≤aH,M,1fh​ℓH,M,∞​\(η0,ηH,M⋆\)2​\(1−aH,M,2fh​H​α\)m\+aH,M,4fhaH,M,2fh​α\.\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\)^\{2\}\\right\]\\leq a\_\{H,\\mathrm\{M\},1\}^\{\\mathrm\{fh\}\}\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\)^\{2\}\\bigl\(1\-a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}H\\alpha\\bigr\)^\{m\}\+\\frac\{a\_\{H,\\mathrm\{M\},4\}^\{\\mathrm\{fh\}\}\}\{a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\}\\alpha\.\(460\)
For the two diminishing\-step cases below, whereggis the step\-size offset, writeτm:=m​H\+g\+H−1\\tau\_\{m\}:=mH\+g\+H\-1, soτ0=g\+H−1\\tau\_\{0\}=g\+H\-1\.

\(b\)ifαk=α/\(k\+g\)\\alpha\_\{k\}=\\alpha/\(k\+g\),α\>1/aH,M,2fh\\alpha\>1/a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}, and

g≥max⁡\{1,α​aH,M,3fhaH,M,2fh\},g\\geq\\max\\left\\\{1,\\frac\{\\alpha a\_\{H,\\mathrm\{M\},3\}^\{\\mathrm\{fh\}\}\}\{a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\}\\right\\\},\(461\)then the boundary iterates satisfy

𝔼​\[ℓH,M,∞​\(ηm​H,ηH,M⋆\)2\]≤aH,M,1fh​ℓH,M,∞​\(η0,ηH,M⋆\)2​\(τ0τm\)aH,M,2fh​α\+aH,M,4fh​H2​α2aH,M,2fh​α−1⋅1τm\.\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\)^\{2\}\\right\]\\leq a\_\{H,\\mathrm\{M\},1\}^\{\\mathrm\{fh\}\}\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\)^\{2\}\\left\(\\frac\{\\tau\_\{0\}\}\{\\tau\_\{m\}\}\\right\)^\{a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\alpha\}\+\\frac\{a\_\{H,\\mathrm\{M\},4\}^\{\\mathrm\{fh\}\}H^\{2\}\\alpha^\{2\}\}\{a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\alpha\-1\}\\cdot\\frac\{1\}\{\\tau\_\{m\}\}\.\(462\)
\(c\)ifαk=α/\(k\+g\)z\\alpha\_\{k\}=\\alpha/\(k\+g\)^\{z\}withz∈\(0,1\)z\\in\(0,1\)and

g≥max⁡\{1,\(α​aH,M,3fhaH,M,2fh\)1/z,\(2​zaH,M,2fh​α\)1/\(1−z\)\},g\\geq\\max\\left\\\{1,\\left\(\\frac\{\\alpha a\_\{H,\\mathrm\{M\},3\}^\{\\mathrm\{fh\}\}\}\{a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\}\\right\)^\{1/z\},\\left\(\\frac\{2z\}\{a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\alpha\}\\right\)^\{1/\(1\-z\)\}\\right\\\},\(463\)then the boundary iterates satisfy

𝔼​\[ℓH,M,∞​\(ηm​H,ηH,M⋆\)2\]≤aH,M,1fh​ℓH,M,∞​\(η0,ηH,M⋆\)2⋅exp⁡\(−aH,M,2fh​α1−z​\(τm1−z−τ01−z\)\)\+2​aH,M,4fh​H2​αaH,M,2fh⋅1τmz\.\\begin\{gathered\}\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\)^\{2\}\\right\]\\leq a\_\{H,\\mathrm\{M\},1\}^\{\\mathrm\{fh\}\}\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\)^\{2\}\\cdot\\exp\\left\(\-\\frac\{a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\alpha\}\{1\-z\}\\bigl\(\\tau\_\{m\}^\{1\-z\}\-\\tau\_\{0\}^\{1\-z\}\\bigr\)\\right\)\\\\ \+\\frac\{2a\_\{H,\\mathrm\{M\},4\}^\{\\mathrm\{fh\}\}H^\{2\}\\alpha\}\{a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\}\\cdot\\frac\{1\}\{\\tau\_\{m\}^\{z\}\}\.\\end\{gathered\}\(464\)

###### Proof\.

For part \(a\), substitution ofα¯m=H​α\\bar\{\\alpha\}\_\{m\}=H\\alphaandα¯m\(2\)=H​α2\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}=H\\alpha^\{2\}into Proposition[47](https://arxiv.org/html/2605.06866#Thmtheorem47)yields part \(a\)\.

For part \(b\), setq=τ0/Hq=\\tau\_\{0\}/Handλ=aH,M,2fh​α\\lambda=a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\alpha\. The bounds

H​ατm≤α¯m≤H​αm​H\+g,α¯m\(2\)≤H​α2\(m​H\+g\)2\.\\frac\{H\\alpha\}\{\\tau\_\{m\}\}\\leq\\bar\{\\alpha\}\_\{m\}\\leq\\frac\{H\\alpha\}\{mH\+g\},\\qquad\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}\\leq\\frac\{H\\alpha^\{2\}\}\{\(mH\+g\)^\{2\}\}\.\(465\)imply

∏j=0m−1\(1−aH,M,2fh​α¯j\)≤\(qm\+q\)λ\.\\prod\_\{j=0\}^\{m\-1\}\(1\-a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\)\\leq\\left\(\\frac\{q\}\{m\+q\}\\right\)^\{\\lambda\}\.\(466\)The same elementary product\-sum estimate gives

∑i=0m−1α¯i\(2\)​∏j=i\+1m−1\(1−aH,M,2fh​α¯j\)≤H​α2λ−1⋅1m\+q,\\sum\_\{i=0\}^\{m\-1\}\\bar\{\\alpha\}\_\{i\}^\{\(2\)\}\\prod\_\{j=i\+1\}^\{m\-1\}\(1\-a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\)\\leq\\frac\{H\\alpha^\{2\}\}\{\\lambda\-1\}\\cdot\\frac\{1\}\{m\+q\},\(467\)and the displayed bound follows from Proposition[47](https://arxiv.org/html/2605.06866#Thmtheorem47)after substitutingq=τ0/Hq=\\tau\_\{0\}/Handτm=H​\(m\+q\)\\tau\_\{m\}=H\(m\+q\)\.

For part \(c\), keep the sameqqand setA=aH,M,2fh​α​H1−zA=a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\alpha H^\{1\-z\}\. The bounds

H​ατmz≤α¯m≤H​α\(m​H\+g\)z,α¯m\(2\)≤H​α2\(m​H\+g\)2​z\.\\frac\{H\\alpha\}\{\\tau\_\{m\}^\{z\}\}\\leq\\bar\{\\alpha\}\_\{m\}\\leq\\frac\{H\\alpha\}\{\(mH\+g\)^\{z\}\},\\qquad\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}\\leq\\frac\{H\\alpha^\{2\}\}\{\(mH\+g\)^\{2z\}\}\.\(468\)imply

∏j=0m−1\(1−aH,M,2fh​α¯j\)≤exp⁡\(−A1−z​\(\(m\+q\)1−z−q1−z\)\)\.\\prod\_\{j=0\}^\{m\-1\}\(1\-a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\)\\leq\\exp\\left\(\-\\frac\{A\}\{1\-z\}\\bigl\(\(m\+q\)^\{1\-z\}\-q^\{1\-z\}\\bigr\)\\right\)\.\(469\)The lower bound onggimpliesq≥\(2​z/A\)1/\(1−z\)q\\geq\(2z/A\)^\{1/\(1\-z\)\}\. The elementary polynomial product\-sum estimate gives

∑i=0m−1α¯i\(2\)​∏j=i\+1m−1\(1−aH,M,2fh​α¯j\)≤2​H​α2A⋅1\(m\+q\)z\.\\sum\_\{i=0\}^\{m\-1\}\\bar\{\\alpha\}\_\{i\}^\{\(2\)\}\\prod\_\{j=i\+1\}^\{m\-1\}\(1\-a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\)\\leq\\frac\{2H\\alpha^\{2\}\}\{A\}\\cdot\\frac\{1\}\{\(m\+q\)^\{z\}\}\.\(470\)Substituting the definitions ofAAandqqinto Proposition[47](https://arxiv.org/html/2605.06866#Thmtheorem47), usingτm=H​\(m\+q\)\\tau\_\{m\}=H\(m\+q\), and usingH2​z≤H2H^\{2z\}\\leq H^\{2\}, gives the displayed bound\. ∎

###### Proof of Theorem[8](https://arxiv.org/html/2605.06866#Thmtheorem8)\.

Combine Proposition[47](https://arxiv.org/html/2605.06866#Thmtheorem47)with Corollary[48](https://arxiv.org/html/2605.06866#Thmtheorem48)\. Sincek=m​Hk=mHandHHis fixed, the rates in the episode indexmmare equivalent to the stated rates in the numberkkof transitions\. ∎

###### Proof of Corollary[9](https://arxiv.org/html/2605.06866#Thmtheorem9)\.

By Corollary[48](https://arxiv.org/html/2605.06866#Thmtheorem48)\(b\),

𝔼​\[ℓH,M,∞​\(ηm​H,ηH,M⋆\)2\]=O​\(1m\+1\)\.\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\)^\{2\}\\right\]=O\\left\(\\frac\{1\}\{m\+1\}\\right\)\.\(471\)By Jensen’s inequality,

𝔼​\[ℓH,M,∞​\(ηm​H,ηH,M⋆\)\]=O​\(1m\+1\)\.\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\)\\right\]=O\\left\(\\frac\{1\}\{\\sqrt\{m\+1\}\}\\right\)\.\(472\)Thusm=O​\(ε−2\)m=O\(\\varepsilon^\{\-2\}\)episodes suffice\. Sincek=m​Hk=mHandHHis fixed, this is equivalentlyk=O​\(ε−2\)k=O\(\\varepsilon^\{\-2\}\)transitions\. ∎

## Appendix FRepresentation error for projected categorical policy evaluation

In all projected categorical settings considered in this paper, the algorithm converges to the fixed point of a projected Bellman operator rather than directly to the exact return\-distribution fixed point\. This appendix records the corresponding deterministic representation error and combines it with the finite\-iteration bounds of the main text\.

Let\(𝒴,ℓ\)\(\\mathcal\{Y\},\\ell\)be a complete metric space\. LetT:𝒴→𝒴T:\\mathcal\{Y\}\\to\\mathcal\{Y\}be a contraction with modulusβ∈\(0,1\)\\beta\\in\(0,1\), and letΠ:𝒴→𝒴\\Pi:\\mathcal\{Y\}\\to\\mathcal\{Y\}be nonexpansive inℓ\\ell\. Letηπ\\eta^\{\\pi\}be the fixed point ofTTandη⋆\\eta^\{\\star\}be the fixed point ofΠ​T\\Pi T\. Let

εrepr:=ℓ​\(Π​ηπ,ηπ\)\.\\varepsilon^\{\\mathrm\{repr\}\}:=\\ell\(\\Pi\\eta^\{\\pi\},\\eta^\{\\pi\}\)\.\(473\)
###### Proposition 49\.

We have

ℓ​\(η⋆,ηπ\)≤εrepr1−β\.\\ell\(\\eta^\{\\star\},\\eta^\{\\pi\}\)\\leq\\frac\{\\varepsilon^\{\\mathrm\{repr\}\}\}\{1\-\\beta\}\.\(474\)

###### Proof\.

Usingη⋆=Π​T​η⋆\\eta^\{\\star\}=\\Pi T\\eta^\{\\star\}andηπ=T​ηπ\\eta^\{\\pi\}=T\\eta^\{\\pi\},

ℓ​\(η⋆,ηπ\)\\displaystyle\\ell\(\\eta^\{\\star\},\\eta^\{\\pi\}\)=ℓ​\(Π​T​η⋆,ηπ\)\\displaystyle=\\ell\(\\Pi T\\eta^\{\\star\},\\eta^\{\\pi\}\)\(475\)≤ℓ​\(Π​T​η⋆,Π​T​ηπ\)\+ℓ​\(Π​T​ηπ,ηπ\)\\displaystyle\\leq\\ell\(\\Pi T\\eta^\{\\star\},\\Pi T\\eta^\{\\pi\}\)\+\\ell\(\\Pi T\\eta^\{\\pi\},\\eta^\{\\pi\}\)≤ℓ​\(T​η⋆,T​ηπ\)\+εrepr\\displaystyle\\leq\\ell\(T\\eta^\{\\star\},T\\eta^\{\\pi\}\)\+\\varepsilon^\{\\mathrm\{repr\}\}≤β​ℓ​\(η⋆,ηπ\)\+εrepr\.\\displaystyle\\leq\\beta\\,\\ell\(\\eta^\{\\star\},\\eta^\{\\pi\}\)\+\\varepsilon^\{\\mathrm\{repr\}\}\.Rearranging gives the claim\. ∎

###### Corollary 50\.

For every random iterateηk\\eta\_\{k\},

ℓ​\(ηk,ηπ\)2≤2​ℓ​\(ηk,η⋆\)2\+2​\(εrepr1−β\)2\.\\ell\(\\eta\_\{k\},\\eta^\{\\pi\}\)^\{2\}\\leq 2\\ell\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\+2\\left\(\\frac\{\\varepsilon^\{\\mathrm\{repr\}\}\}\{1\-\\beta\}\\right\)^\{2\}\.\(476\)Consequently,

𝔼​\[ℓ​\(ηk,ηπ\)2\]≤2​𝔼​\[ℓ​\(ηk,η⋆\)2\]\+2​\(εrepr1−β\)2\.\\mathbb\{E\}\[\\ell\(\\eta\_\{k\},\\eta^\{\\pi\}\)^\{2\}\]\\leq 2\\mathbb\{E\}\[\\ell\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\]\+2\\left\(\\frac\{\\varepsilon^\{\\mathrm\{repr\}\}\}\{1\-\\beta\}\\right\)^\{2\}\.\(477\)

###### Proof\.

By the triangle inequality and\(a\+b\)2≤2​a2\+2​b2\(a\+b\)^\{2\}\\leq 2a^\{2\}\+2b^\{2\},

ℓ​\(ηk,ηπ\)2≤2​ℓ​\(ηk,η⋆\)2\+2​ℓ​\(η⋆,ηπ\)2\.\\ell\(\\eta\_\{k\},\\eta^\{\\pi\}\)^\{2\}\\leq 2\\ell\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\+2\\ell\(\\eta^\{\\star\},\\eta^\{\\pi\}\)^\{2\}\.\(478\)Now apply Proposition[49](https://arxiv.org/html/2605.06866#Thmtheorem49)\. ∎

#### Instantiation in the present paper\.

In the discounted setting,ℓ\\ellis eitherℓC,∞\\ell\_\{\\mathrm\{C\},\\infty\}orℓM,∞\\ell\_\{\\mathrm\{M\},\\infty\}, andTTis the corresponding discounted Bellman operator\. In the fixed\-horizon setting,ℓ\\ellis eitherℓH,C,∞\\ell\_\{H,\\mathrm\{C\},\\infty\}orℓH,M,∞\\ell\_\{H,\\mathrm\{M\},\\infty\}, andTTis the corresponding fixed\-horizon Bellman operator\. In every case, the algorithmic term is controlled by the finite\-iteration theorems, while the second term depends only on the categorical support family throughεrepr\\varepsilon^\{\\mathrm\{repr\}\}\.

## Appendix GFurther discussion of the results

#### Bounded and affine perturbation regimes\.

After the statewise isometric embeddings, both CTD and MTD fit the same asynchronous contractive stochastic\-approximation template\. The main difference is the geometry of the sampled perturbation\. In the scalar categorical case, bounded supports together with the Cramér geometry yield uniform samplewise bounds in both the discounted and fixed\-horizon undiscounted settings\. In the discounted setting,

∥T^C​\(U;s,\(r,s′\)\)−\(𝒪C​U\)​\(s\)∥2≤2​BC\.\\lVert\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{C\}\}U\)\(s\)\\rVert\_\{2\}\\leq 2B\_\{\\mathrm\{C\}\}\.\(479\)In the fixed\-horizon setting,

∥FH,C​\(U;s,\(r,s′\)\)∥H,2,∞≤2​BH,C\.\\lVert F\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2B\_\{H,\\mathrm\{C\}\}\.\(480\)Thus CTD falls into a bounded\-noise regime throughout the paper\. In particular, the i\.i\.d\. conditional second moment is uniformly bounded in the discounted analysis, and the fixed\-horizon episodewise argument inherits only support\-radius constants\.

By contrast, in the multivariate signed\-categorical case one obtains only an affine perturbation bounds\. In the discounted setting,

∥T^M​\(U;s,\(r,s′\)\)−\(𝒪M​U\)​\(s\)∥2≤2​βM​∥U∥2,∞\+BM,\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\)\(s\)\\rVert\_\{2\}\\leq 2\\beta\_\{\\mathrm\{M\}\}\\lVert U\\rVert\_\{2,\\infty\}\+B\_\{\\mathrm\{M\}\},\(481\)together with a local second\-moment estimate

𝔼​\[∥T^M​\(Uk;Sk,\(Rk,Sk\+1\)\)−\(𝒪M​Uk\)​\(Sk\)∥22∣Uk,Sk\]≤C1\+C2​∥Uk∥2,∞2\.\\mathbb\{E\}\\left\[\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\_\{k\}\)\(S\_\{k\}\)\\rVert\_\{2\}^\{2\}\\,\\mid\\,U\_\{k\},S\_\{k\}\\right\]\\leq C\_\{1\}\+C\_\{2\}\\lVert U\_\{k\}\\rVert\_\{2,\\infty\}^\{2\}\.\(482\)In the fixed\-horizon setting, the analogous residual bound is

∥FH,M​\(U;s,\(r,s′\)\)∥H,2,∞≤2​∥U∥H,2,∞\+BH,Mres,\\lVert F\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\\rVert\_\{H,2,\\infty\}\+B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\},\(483\)again with an affine conditional second\-moment estimate\. This is why the MTD theorem constants contain additional growth, bias, and absorption quantities, whereas the CTD bounds depend only on support\-radius constants\. This is also the reason for the MTD theorem constants containing additional quantities such asC1,C2,BM,ΥMC\_\{1\},C\_\{2\},B\_\{\\mathrm\{M\}\},\\Upsilon\_\{\\mathrm\{M\}\}, whereas the CTD bounds depend only on the support\-radius constantBCB\_\{\\mathrm\{C\}\}\.

#### Smoothing constants and the number of asynchronous blocks\.

The finite\-iteration analysis is driven by a block\-supremum contraction norm and a smoothed block\-ℓp\\ell\_\{p\}potential\. The only explicit dimension loss in this smoothing step comes from the comparison between∥⋅∥2,∞\\lVert\\cdot\\rVert\_\{2,\\infty\}and∥⋅∥2,p\\lVert\\cdot\\rVert\_\{2,p\}, and therefore depends only on the number of asynchronously updated blocks\. In the discounted appendices, those blocks are indexed only by the state variable and the relevant count is\|𝒮\|\\lvert\\mathcal\{S\}\\rvert\. In the fixed\-horizon appendices, the stacked process is first flattened over horizon and state, so the relevant count becomes\|𝒮H\|=H​\|𝒮\|\\lvert\\mathcal\{S\}\_\{H\}\\rvert=H\\lvert\\mathcal\{S\}\\rvert\. The inner representation dimensiondd, and in the multivariate case the reward dimensionqq, still affect problem\-dependent constants, but they do not enter through the Moreau\-envelope smoothing comparison itself\.

#### Discounted and fixed\-horizon contraction mechanisms\.

The fixed\-horizon undiscounted theorems should not be read as the discounted results withγ=1\\gamma=1\. In the discounted setting, the contraction comes directly from the Bellman discount factor and each update modifies only one sampled state block\. In the fixed\-horizon setting there is no discount\-driven contraction\. Instead, the contraction is recovered by weighting the horizon layers and exploiting the fact that horizonhhbootstraps from horizonh−1h\-1\. The corresponding online recursion is also structurally different: one sampled transition updates the entire horizon stack at the sampled state\. Thus the fixed\-horizon setting is different both in its contraction mechanism and in its stochastic approximation structure\.

#### Interpretation of the step size regimes\.

Across all theorem statements, the product\-sum bounds imply the same qualitative picture\. Constant step sizes yield geometric convergence to a controllableO​\(α\)O\(\\alpha\)neighborhood of the projected fixed point\. Linearly\-diminishing step sizes yield the strongest asymptotic decay, namelyO​\(1/k\)O\(1/k\)in the squared error and henceO​\(ε−2\)O\(\\varepsilon^\{\-2\}\)sample complexity for mean error at mostε\\varepsilon\. Polynomially\-diminishing step sizes trade a weaker asymptotic decay for milder admissibility requirements\. In the discounted Markovian case, these conclusions hold up to the expected logarithmic mixing\-time overhead inherited from the Markovian finite\-iteration theorem\. This is the standard price of replacing i\.i\.d\. sampling by a trajectory\-generated sample path\.

#### Algorithmic and representation error\.

The finite\-iteration theorems control distance to the fixed point of a projected Bellman operator\. The representation\-error decomposition then separates total error into an algorithmic term and a deterministic approximation term\. This separation is conceptually useful: the stochastic\-approximation analysis explains how quickly the recursion approaches the projected fixed point for a fixed support family, whereas the deterministic term isolates the bias introduced by the categorical approximation itself\.

Similar Articles

An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning

Hugging Face Daily Papers

This paper introduces MMOT, an online mixture model learning framework based on optimal transport theory that addresses incremental learning with distributional shifts through dynamic centroid updates and improved class similarity estimation. The approach includes a Dynamic Preservation strategy to mitigate catastrophic forgetting and maintain class separability in latent space.