A Finite-Iteration Theory for Asynchronous Categorical Distributional Temporal-Difference Learning
Summary
This paper presents a finite-iteration theory for asynchronous categorical distributional temporal-difference learning, bridging the gap between existing theoretical frameworks and practical online implementations.
View Cached Full Text
Cached at: 05/11/26, 06:59 AM
# A Finite-Iteration Theory for Asynchronous Categorical Distributional Temporal-Difference Learning
Source: [https://arxiv.org/html/2605.06866](https://arxiv.org/html/2605.06866)
Ege C\. Kaya, Abolfazl Hashemi Elmore Family School of Electrical and Computer Engineering Purdue University West Lafayette, IN 47906 kayae@purdue\.edu, abolfazl@purdue\.edu
###### Abstract
Recent non\-asymptotic analyses have substantially advanced the theory of distributional policy evaluation, but they largely concern synchronous full\-state updates under a generative model, model\-based estimators, accelerated variants, or different approximation architectures\. Standard categorical temporal\-difference learning is typically used in a different regime\. It asynchronously performs a single\-state update at each iteration and, in online settings, is driven by a Markovian trajectory\. This leaves an important gap between existing finite\-iteration theory and the categorical recursions most closely aligned with practical distributional temporal\-difference implementations\. We bridge this gap for two categorical policy\-evaluation methods: scalar categorical temporal\-difference learning in the Cramér geometry and multivariate signed\-categorical temporal\-difference learning in the maximum mean discrepancy geometry\. After suitable isometric embeddings, both algorithms take the form of asynchronous single\-state stochastic\-approximation recursions that contract in a statewise supremum norm\. This permits finite\-iteration guarantees in discounted problems under both i\.i\.d\. and Markovian state sampling, and in undiscounted fixed\-horizon problems under i\.i\.d\. episodic sampling\.
## 1Introduction
Categorical representations are among the central approximation families in distributional reinforcement learning \(RL\)\[[4](https://arxiv.org/html/2605.06866#bib.bib6),[39](https://arxiv.org/html/2605.06866#bib.bib7),[15](https://arxiv.org/html/2605.06866#bib.bib40),[14](https://arxiv.org/html/2605.06866#bib.bib41),[51](https://arxiv.org/html/2605.06866#bib.bib42),[30](https://arxiv.org/html/2605.06866#bib.bib43),[40](https://arxiv.org/html/2605.06866#bib.bib44),[5](https://arxiv.org/html/2605.06866#bib.bib5)\]\. In the scalar case they underlie the categorical methods initiated by C51\[[4](https://arxiv.org/html/2605.06866#bib.bib6)\], while in the multivariate case they support signed\-measure constructions for vector\-valued returns\[[49](https://arxiv.org/html/2605.06866#bib.bib4)\]\. These methods replace infinite\-dimensional return\-distribution objects with finite\-dimensional ones while preserving the geometry that governs projected distributional policy evaluation\[[21](https://arxiv.org/html/2605.06866#bib.bib35),[23](https://arxiv.org/html/2605.06866#bib.bib36),[24](https://arxiv.org/html/2605.06866#bib.bib37),[43](https://arxiv.org/html/2605.06866#bib.bib38),[32](https://arxiv.org/html/2605.06866#bib.bib39)\]\. For these categorical methods, the asymptotic picture is by now well understood\[[20](https://arxiv.org/html/2605.06866#bib.bib17),[17](https://arxiv.org/html/2605.06866#bib.bib18),[47](https://arxiv.org/html/2605.06866#bib.bib19),[22](https://arxiv.org/html/2605.06866#bib.bib20),[6](https://arxiv.org/html/2605.06866#bib.bib21),[28](https://arxiv.org/html/2605.06866#bib.bib22),[39](https://arxiv.org/html/2605.06866#bib.bib7),[5](https://arxiv.org/html/2605.06866#bib.bib5)\]\. The scalar categorical temporal\-difference \(CTD\) method admits a projected Bellman contraction in the Cramér metric together with asymptotic guarantees\[[39](https://arxiv.org/html/2605.06866#bib.bib7),[5](https://arxiv.org/html/2605.06866#bib.bib5)\]\. The multivariate signed\-categorical temporal\-difference \(MTD\) method admits the analogous maximum mean discrepancy \(MMD\)\-based contraction and theory\[[49](https://arxiv.org/html/2605.06866#bib.bib4)\]\. What remains less well understood is the finite\-iteration behavior of these TD methods\.
This matters because standard TD learning is not a synchronous full\-state procedure\[[46](https://arxiv.org/html/2605.06866#bib.bib23),[45](https://arxiv.org/html/2605.06866#bib.bib24),[48](https://arxiv.org/html/2605.06866#bib.bib25),[10](https://arxiv.org/html/2605.06866#bib.bib28)\]\. In practice, each update modifies only the sampled state, and in online RL, the sampled states are generated by a Markovian trajectory\. Finite\-iteration guarantees for asynchronous, trajectory\-driven updates are therefore the natural benchmark for understanding how quickly CTD and MTD approach their projected fixed point in regimes relevant to practice\. Several recent results address neighboring finite\-iteration questions\[[37](https://arxiv.org/html/2605.06866#bib.bib45),[7](https://arxiv.org/html/2605.06866#bib.bib46),[44](https://arxiv.org/html/2605.06866#bib.bib47),[34](https://arxiv.org/html/2605.06866#bib.bib48),[16](https://arxiv.org/html/2605.06866#bib.bib49),[36](https://arxiv.org/html/2605.06866#bib.bib8),[8](https://arxiv.org/html/2605.06866#bib.bib9),[52](https://arxiv.org/html/2605.06866#bib.bib10),[41](https://arxiv.org/html/2605.06866#bib.bib14),[35](https://arxiv.org/html/2605.06866#bib.bib11)\]\. These works establish that non\-asymptotic distributional analysis is feasible, but do not provide a finite\-iteration theory for the standard asynchronous categorical recursion driven by one\-state updates\.
The present paper is organized in two halves\. The first treats discounted policy evaluation under i\.i\.d\. and Markovian sampling, with a framework guided byChenet al\.\[[12](https://arxiv.org/html/2605.06866#bib.bib3),[13](https://arxiv.org/html/2605.06866#bib.bib2),[11](https://arxiv.org/html/2605.06866#bib.bib1)\], Robbins and Monro \[[38](https://arxiv.org/html/2605.06866#bib.bib26)\], Ljung \[[29](https://arxiv.org/html/2605.06866#bib.bib27)\], Borkar \[[9](https://arxiv.org/html/2605.06866#bib.bib30)\], Kushner and Yin \[[26](https://arxiv.org/html/2605.06866#bib.bib31)\]\. The second treats a tractable instance of an undiscounted policy evaluation, in the finite, fixed\-horizon episodic regime\. Across both halves, the main theme remains the same: After suitable statewise isometric embeddings, CTD and MTD constitute an asynchronous stochastic approximation \(SA\) recursion in a block\-supremum geometry\.
Practical interpretation of the sampling regimes\.Each sampling model considered in this work has a practical RL interpretation\. The discounted i\.i\.d\. regime points to a replay\-based or generative\-model benchmark, where updates are formed from approximately independent samples drawn from a buffer or simulator, also akin to the synchronous analyses often used to study TD\-style methods theoretically\. The discounted Markovian regime is the standard online setting for TD learning, in which samples are generated sequentially along a single behavior trajectory\. In the finite\-horizon undiscounted case, the episodic regime is the standard reset\-based formulation of finite\-horizon RL, where interaction proceeds for a fixed horizonHHand then restarts from an initial distribution\. Our main claim is that, once CTD and MTD are written in the right statewise embedding, these update rules become asynchronous SA recursions with contractive structure enabling finite\-iteration control\.
Contributions\.
1. 1\.We establish finite\-iteration guarantees for asynchronous CTD and MTD under discounted i\.i\.d\. and Markovian sampling\.
2. 2\.We establish finite\-iteration guarantees for undiscounted fixed\-horizon CTD and MTD under i\.i\.d\. episodic sampling\.
3. 3\.We record a deterministic representation\-error decomposition that turns the projected fixed\-point bounds into total\-error bounds with an explicit projection bias term\.
## 2Related work
Two lines of work are most directly relevant\. On the finite\-iteration side,Chenet al\.\[[12](https://arxiv.org/html/2605.06866#bib.bib3),[13](https://arxiv.org/html/2605.06866#bib.bib2),[11](https://arxiv.org/html/2605.06866#bib.bib1)\]develop non\-asymptotic SA tools for contractive recursions under i\.i\.d\. and Markovian sampling, building on the broader TD and SA literature\[[7](https://arxiv.org/html/2605.06866#bib.bib46),[37](https://arxiv.org/html/2605.06866#bib.bib45),[38](https://arxiv.org/html/2605.06866#bib.bib26),[29](https://arxiv.org/html/2605.06866#bib.bib27),[9](https://arxiv.org/html/2605.06866#bib.bib30),[26](https://arxiv.org/html/2605.06866#bib.bib31)\]\. On the distributional side,Rowlandet al\.\[[39](https://arxiv.org/html/2605.06866#bib.bib7)\], Bellemareet al\.\[[5](https://arxiv.org/html/2605.06866#bib.bib5)\]establish the scalar categorical projection and Cramér contraction theory underlying CTD, whileWiltzeret al\.\[[49](https://arxiv.org/html/2605.06866#bib.bib4)\]provide the multivariate signed\-categorical MMD framework for MTD\. For the undiscounted half, our setup follows the fixed\-horizon viewpoint ofDe Asiset al\.\[[18](https://arxiv.org/html/2605.06866#bib.bib55)\]\. Our contribution is to bring these threads together for the tabular asynchronous categorical recursions that are used in practice, rather than for synchronous or idealized surrogates\.
Recent non\-asymptotic distributional analyses study several neighboring regimes\.Penget al\.\[[36](https://arxiv.org/html/2605.06866#bib.bib8)\]andBöck and Heitzinger \[[8](https://arxiv.org/html/2605.06866#bib.bib9)\]analyze CTD with generative\-model access, where updates are synchronous or accelerated\.Zhanget al\.\[[52](https://arxiv.org/html/2605.06866#bib.bib10)\]andRowlandet al\.\[[41](https://arxiv.org/html/2605.06866#bib.bib14)\]study model\-based or direct\-estimation procedures with stronger sampling access\.Wuet al\.\[[50](https://arxiv.org/html/2605.06866#bib.bib15)\]focuses on offline policy evaluation,Penget al\.\[[35](https://arxiv.org/html/2605.06866#bib.bib11)\]studies linear function approximation, andKastneret al\.\[[25](https://arxiv.org/html/2605.06866#bib.bib12)\]considers a KL\-based categorical analysis with a different divergence and asymptotic emphasis\. By contrast, we analyze the exact asynchronous recursions of CTD and MTD\.
## 3Discounted categorical policy evaluation
We consider a discounted Markov decision process\[[45](https://arxiv.org/html/2605.06866#bib.bib24)\]\(𝒮,𝒜,P,R,γ\)\(\\mathcal\{S\},\\mathcal\{A\},P,R,\\gamma\)with finite state space𝒮\\mathcal\{S\}, finite action space𝒜\\mathcal\{A\}, reward functionR\(s,a\)R\(s,a\), transition kernelP\(⋅∣s,a\)P\(\\cdot\\mid s,a\), and discount factorγ∈\(0,1\)\\gamma\\in\(0,1\)\. We assume thatR:𝒮×𝒜→\[0,1\]R:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\[0,1\]in the scalar case, andR:𝒮×𝒜→\[0,1\]q,q≥2R:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\[0,1\]^\{q\},q\\geq 2in the multivariate case\. A policyπ\\piis fixed throughout\. The induced state trajectory is either i\.i\.d\. with lawρ\\rhoor a Markov chain with stationary distributionμ𝒮\\mu\_\{\\mathcal\{S\}\}\. In the Markovian case we assume irreducibility and aperiodicity\. The finiteness of𝒮\\mathcal\{S\}implies that the Markov chain resulting from the MDP with fixed policyπ\\pimixes geometrically\[[27](https://arxiv.org/html/2605.06866#bib.bib29)\], i\.e\., that there exist constantsCmix≥1C\_\{\\mathrm\{mix\}\}\\geq 1andσmix∈\(0,1\)\\sigma\_\{\\mathrm\{mix\}\}\\in\(0,1\)such that
supx∈𝒮∥Pr\(Sk∈⋅∣S0=x\)−μ𝒮\(⋅\)∥TV≤Cmixσmixkfor allk≥0\.\\sup\_\{x\\in\\mathcal\{S\}\}\\left\\lVert\\Pr\(S\_\{k\}\\in\\cdot\\mid S\_\{0\}=x\)\-\\mu\_\{\\mathcal\{S\}\}\(\\cdot\)\\right\\rVert\_\{\\mathrm\{TV\}\}\\leq C\_\{\\mathrm\{mix\}\}\\sigma\_\{\\mathrm\{mix\}\}^\{k\}\\qquad\\text\{for all \}k\\geq 0\.\(1\)Forδ\>0\\delta\>0, we define the associated mixing time
tδ:=min\{k≥0:supx∈𝒮∥Pr\(Sk∈⋅∣S0=x\)−μ𝒮\(⋅\)∥TV≤δ\}\.t\_\{\\delta\}:=\\min\\Bigl\\\{k\\geq 0:\\sup\_\{x\\in\\mathcal\{S\}\}\\left\\lVert\\Pr\(S\_\{k\}\\in\\cdot\\mid S\_\{0\}=x\)\-\\mu\_\{\\mathcal\{S\}\}\(\\cdot\)\\right\\rVert\_\{\\mathrm\{TV\}\}\\leq\\delta\\Bigr\\\}\.\(2\)
For both CTD and MTD, we work with a block\-supremum contraction metricℓ∞\\ell\_\{\\infty\}, which admits a statewise isometric embeddingIIto a product space with contraction norm∥⋅∥2,∞\\left\\lVert\\cdot\\right\\rVert\_\{2,\\infty\}, and a statewise projectionΠΘ\\Pi^\{\\Theta\}onto the space ofΘ\\Theta\-supported, state\-indexed representationsℱΘ𝒮\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\Theta\}\. We can then compose the distributional Bellman operatorTπT^\{\\pi\}to obtain the embedded, projected operator
𝒪:=I∘ΠΘTπ∘I−1\.\\mathcal\{O\}:=I\\circ\\Pi^\{\\Theta\}T^\{\\pi\}\\circ I^\{\-1\}\.\(3\)This operator then admits a one\-step sampled Bellman targetT^\(Uk;s,\(Rk,Sk\+1\)\)\\widehat\{T\}\(U\_\{k\};s,\(R\_\{k\},S\_\{k\+1\}\)\)computed from a current estimateUkU\_\{k\}and a random sample\(Rk,Sk\+1\)\(R\_\{k\},S\_\{k\+1\}\), satisfying
𝔼\[T^\(Uk;Sk,\(Rk,Sk\+1\)\)∣Uk,Sk=s\]=\(𝒪Uk\)\(s\),\\mathbb\{E\}\\left\[\\widehat\{T\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\\mid U\_\{k\},S\_\{k\}=s\\right\]=\(\\mathcal\{O\}U\_\{k\}\)\(s\),\(4\)and a recursion that takes the form
Uk\+1=Uk\+αkPSk\(T^\(Uk;Sk,\(Rk,Sk\+1\)\)−Uk\(Sk\)\),U\_\{k\+1\}=U\_\{k\}\+\\alpha\_\{k\}P\_\{S\_\{k\}\}\(\\widehat\{T\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-U\_\{k\}\(S\_\{k\}\)\),\(5\)where for eachs∈𝒮s\\in\\mathcal\{S\},PsP\_\{s\}denotes the coordinate projector onto blockss\.
The point of the discounted analysis is that the exact one\-state recursion already has the ingredients needed for finite\-iteration SA bounds\. Concretely, the proof uses the contraction of the averaged operator in∥⋅∥2,∞\\lVert\\cdot\\rVert\_\{2,\\infty\}, a Moreau\-envelope smoothing of∥⋅∥2,∞\\lVert\\cdot\\rVert\_\{2,\\infty\}based on∥⋅∥2,p\\lVert\\cdot\\rVert\_\{2,p\}\[[9](https://arxiv.org/html/2605.06866#bib.bib30),[26](https://arxiv.org/html/2605.06866#bib.bib31),[3](https://arxiv.org/html/2605.06866#bib.bib32),[31](https://arxiv.org/html/2605.06866#bib.bib33),[2](https://arxiv.org/html/2605.06866#bib.bib34)\], an affine conditional second\-moment bound or a centered pathwise perturbation bound depending on the method, the samplewise11\-Lipschitzness of the one\-step target map in∥⋅∥2,∞\\lVert\\cdot\\rVert\_\{2,\\infty\}, and the geometric mixing in the Markovian case\. More precisely, withp⋆:=max\{2,⌈log\|𝒮\|⌉\}p^\{\\star\}:=\\max\\\{2,\\lceil\\log\\lvert\\mathcal\{S\}\\rvert\\rceil\\\}, the smoothing argument introduces a parameterϑ\>0\\vartheta\>0through the generalized Moreau envelopeMϑ,p⋆M\_\{\\vartheta,p^\{\\star\}\}of the squared block\-supremum distance\. More details about these common ingredients are deferred to Appendix[A](https://arxiv.org/html/2605.06866#A1), while the formal verification of the discounted finite\-iteration results for CTD and MTD is deferred to Appendices[B](https://arxiv.org/html/2605.06866#A2)and[C](https://arxiv.org/html/2605.06866#A3), respectively\.
### 3\.1Discounted CTD
For each states∈𝒮s\\in\\mathcal\{S\}, fix an ordered supportΘ\(s\)=\{θ1\(s\)<⋯<θd\(s\)\}⊂ℝ\\Theta\(s\)=\\\{\\theta\_\{1\}\(s\)<\\cdots<\\theta\_\{d\}\(s\)\\\}\\subset\\mathbb\{R\}\.ℱC,Θ𝒮\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{C\},\\Theta\}is the class of state\-indexed categorical laws supported on these statewise grids\. The contraction metricℓC,∞\\ell\_\{\\mathrm\{C\},\\infty\}is the supremum Cramér metric, the embeddingICI\_\{\\mathrm\{C\}\}is the standard cumulative\-mass isometryIC,sI\_\{\\mathrm\{C\},s\}\[[39](https://arxiv.org/html/2605.06866#bib.bib7),[5](https://arxiv.org/html/2605.06866#bib.bib5)\]applied to all states, and the statewise projectionΠCΘ\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\}is the usual linear\-interpolation categorical projectionΠCΘ\(s\)\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\(s\)\}applied to all states\.
Given a sampled transition\(Sk,Ak,Rk,Sk\+1\)\(S\_\{k\},A\_\{k\},R\_\{k\},S\_\{k\+1\}\), the sampled Bellman target is
T^C\(Uk;Sk,\(Rk,Sk\+1\)\):=IC,Sk\(ΠCΘ\(Sk\)\(\(fRk,γ\)\#IC,Sk\+1−1\(Uk\(Sk\+1\)\)\)\),\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\):=I\_\{\\mathrm\{C\},S\_\{k\}\}\\Bigl\(\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\(S\_\{k\}\)\}\\bigl\(\(f\_\{R\_\{k\},\\gamma\}\)\_\{\\\#\}I\_\{\\mathrm\{C\},S\_\{k\+1\}\}^\{\-1\}\(U\_\{k\}\(S\_\{k\+1\}\)\)\\bigr\)\\Bigr\),\(6\)wherefr,γ\(z\)=r\+γzf\_\{r,\\gamma\}\(z\)=r\+\\gamma zand\#\\\#is the measure\-pushforward operator\.
The next theorem states the discounted CTD convergence rates for various step size regimes\. Appendix[B](https://arxiv.org/html/2605.06866#A2)verifies the contraction, Lipschitz continuity and perturbation ingredients and records the explicit constants\.
###### Theorem 1\(Discounted asynchronous CTD\)\.
Letη⋆∈ℱC,Θ𝒮\\eta^\{\\star\}\\in\\mathcal\{F\}\_\{\\mathrm\{C\},\\Theta\}^\{\\mathcal\{S\}\}denote the unique fixed point ofΠCΘTπ\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\}T^\{\\pi\}, and letηk:=IC−1\(Uk\)\\eta\_\{k\}:=I\_\{\\mathrm\{C\}\}^\{\-1\}\(U\_\{k\}\)be generated by the asynchronous recursion \([5](https://arxiv.org/html/2605.06866#S3.E5)\) withT^=T^C\\widehat\{T\}=\\widehat\{T\}\_\{\\mathrm\{C\}\}\.
\(i\) i\.i\.d\. sampling\.Suppose\(Sk\)k≥0\(S\_\{k\}\)\_\{k\\geq 0\}are i\.i\.d\. with lawρ\\rhoon𝒮\\mathcal\{S\}, andmins∈𝒮ρ\(s\)\>0\\min\_\{s\\in\\mathcal\{S\}\}\\rho\(s\)\>0\. There exist explicit constantsCCiid,cCiid,α¯Ciid\>0C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\},c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\},\\bar\{\\alpha\}^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\>0and explicit thresholdshCiid\(α\)h^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\(\\alpha\)andhCiid\(α,z\)h^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\(\\alpha,z\)for which the following results hold on three step size regimes\.
Constant step size\.Ifαk≡α≤α¯Ciid\\alpha\_\{k\}\\equiv\\alpha\\leq\\bar\{\\alpha\}^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}, then the iterates converge geometrically to anO\(α\)O\(\\alpha\)neighborhood:
𝔼\[ℓC,∞\(ηk,η⋆\)2\]≤CCiidℓC,∞\(η0,η⋆\)2\(1−cCiidα\)k\+CCiidα\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\(1\-c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\\alpha\)^\{k\}\+C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\\alpha\.\(7\)Linearly\-diminishing step size\.Ifα=α/\(k\+h\)\\alpha=\\alpha/\(k\+h\),α\>1/cCiid\\alpha\>1/c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}, andh≥hCiid\(α\)h\\geq h^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\(\\alpha\), then the leading residual term decays inO\(1/k\)O\(1/k\):
𝔼\[ℓC,∞\(ηk,η⋆\)2\]≤CCiidℓC,∞\(η0,η⋆\)2\(hk\+h\)cCiidα\+CCiidα2cCiidα−1⋅1k\+h\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\left\(\\frac\{h\}\{k\+h\}\\right\)^\{c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\\alpha\}\+\\frac\{C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\\alpha^\{2\}\}\{c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\\alpha\-1\}\\cdot\\frac\{1\}\{k\+h\}\.\(8\)Polynomially\-diminishing step size\.Ifαk=α/\(k\+h\)z\\alpha\_\{k\}=\\alpha/\(k\+h\)^\{z\},z∈\(0,1\)z\\in\(0,1\), andh≥hCiid\(α,z\)h\\geq h^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\(\\alpha,z\), then the leading residual term decays inO\(1/kz\)O\(1/k^\{z\}\):
𝔼\[ℓC,∞\(ηk,η⋆\)2\]≤CCiidℓC,∞\(η0,η⋆\)2exp\(−cCiidα1−z\(\(k\+h\)1−z−h1−z\)\)\+CCiidα\(k\+h\)z\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\exp\\left\(\-\\frac\{c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\\alpha\}\{1\-z\}\\bigl\(\(k\+h\)^\{1\-z\}\-h^\{1\-z\}\\bigr\)\\right\)\+\\frac\{C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{C\}\}\\alpha\}\{\(k\+h\)^\{z\}\}\.\(9\)
\(ii\) Markovian trajectory sampling\.Suppose\(Sk\)k≥0\(S\_\{k\}\)\_\{k\\geq 0\}is irreducible and aperiodic with stationary distributionμ𝒮\\mu\_\{\\mathcal\{S\}\}satisfyingmins∈𝒮μ𝒮\(s\)\>0\\min\_\{s\\in\\mathcal\{S\}\}\\mu\_\{\\mathcal\{S\}\}\(s\)\>0and the mixing condition \([1](https://arxiv.org/html/2605.06866#S3.E1)\)\. Lettk:=tαkt\_\{k\}:=t\_\{\\alpha\_\{k\}\}and setK:=min\{k≥0:k≥tk\}K:=\\min\\\{k\\geq 0:k\\geq t\_\{k\}\\\}\. There exist explicit constantsCCmk,cCmk,α¯Cmk\>0C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\},c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\},\\bar\{\\alpha\}^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\>0and explicit thresholdshCmk\(α\)h^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\(\\alpha\)andhCmk\(α,z\)h^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\(\\alpha,z\)for which the following results hold on three step size regimes\.Constant step size\.Ifαk≡α\\alpha\_\{k\}\\equiv\\alphaandαtα≤α¯Cmk\\alpha t\_\{\\alpha\}\\leq\\bar\{\\alpha\}^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}, then fork≥tαk\\geq t\_\{\\alpha\}, the iterates converge geometrically to anO~\(α\)\\tilde\{O\}\(\\alpha\)neighborhood:
𝔼\[ℓC,∞\(ηk,η⋆\)2\]≤CCmk\(1\+ℓC,∞\(η0,η⋆\)2\)\(1−cCmkα\)k−tα\+CCmkαtα\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\\bigl\(1\+\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\bigr\)\(1\-c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\\alpha\)^\{k\-t\_\{\\alpha\}\}\+C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\\alpha t\_\{\\alpha\}\.\(10\)Linearly\-diminishing step size\.Ifαk=α/\(k\+h\)\\alpha\_\{k\}=\\alpha/\(k\+h\),α\>1/cCmk\\alpha\>1/c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}, andh≥hCmk\(α\)h\\geq h^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\(\\alpha\), then fork≥Kk\\geq K, the leading residual term decays inO~\(1/k\)\\tilde\{O\}\(1/k\):
𝔼\[ℓC,∞\(ηk,η⋆\)2\]≤CCmk\(1\+ℓC,∞\(η0,η⋆\)2\)\(K\+hk\+h\)cCmkα\+CCmkα2cCmkα−1⋅tkk\+h\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\\bigl\(1\+\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\bigr\)\\left\(\\frac\{K\+h\}\{k\+h\}\\right\)^\{c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\\alpha\}\+\\frac\{C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\\alpha^\{2\}\}\{c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\\alpha\-1\}\\cdot\\frac\{t\_\{k\}\}\{k\+h\}\.\(11\)Polynomially\-diminishing step size\.Ifαk=α/\(k\+h\)z\\alpha\_\{k\}=\\alpha/\(k\+h\)^\{z\},z∈\(0,1\)z\\in\(0,1\), andh≥hCmk\(α,z\)h\\geq h^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\(\\alpha,z\), then fork≥Kk\\geq K, the leading residual term decays inO~\(1/kz\)\\tilde\{O\}\(1/k^\{z\}\):
𝔼\[ℓC,∞\(ηk,η⋆\)2\]≤CCmk\(1\+ℓC,∞\(η0,η⋆\)2\)exp\(cCmkα1−z\(\(K\+h\)1−z−\(k\+h\)1−z\)\)\+CCmkαtk\(k\+h\)z\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\\bigl\(1\+\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\bigr\)\\exp\\left\(\\frac\{c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\\alpha\}\{1\-z\}\\bigl\(\(K\+h\)^\{1\-z\}\-\(k\+h\)^\{1\-z\}\\bigr\)\\right\)\+\\frac\{C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{C\}\}\\alpha t\_\{k\}\}\{\(k\+h\)^\{z\}\}\.
\(12\)
###### Proof sketch\.
The Cramér embedding makes the projected Bellman map aγ\\sqrt\{\\gamma\}\-contraction\. Under i\.i\.d\. sampling, the averaged asynchronous map contracts according to the minimum state mass, while under Markovian sampling, the same drift is recovered after the mixing\-time comparison\. The sampled target is samplewise Lipschitz and its centered perturbation is uniformly bounded by the support radius\. A sufficiently small Moreau envelope smoothing parameter makes the drift constants positive, and the upper bounds onα\\alphaor lower bounds onhhensure the quadratic remainder is dominated by the negative drift\. Appendix[B](https://arxiv.org/html/2605.06866#A2)gives the exact constants and thresholds\.
∎
The following finite\-sample corollary follows from the linearly\-diminishing step size bounds\.
###### Corollary 2\.
To guarantee𝔼\[ℓC,∞\(ηk,η⋆\)\]≤ε\\mathbb\{E\}\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)\]\\leq\\varepsilonwith the discounted CTD recursion, it suffices to takek=O\(ε−2\)k=O\(\\varepsilon^\{\-2\}\)samples in the i\.i\.d\. andk=O~\(ε−2\)k=\\tilde\{O\}\(\\varepsilon^\{\-2\}\)samples in the Markovian sampling case\.
### 3\.2Discounted MTD
Letq≥2q\\geq 2\. For each states∈𝒮s\\in\\mathcal\{S\}, fix add\-point supportΘ\(s\)=\{θ1\(s\),⋯,θd\(s\)\}⊂ℝq\\Theta\(s\)=\\\{\\theta\_\{1\}\(s\),\\cdots,\\theta\_\{d\}\(s\)\\\}\\subset\\mathbb\{R\}^\{q\}\.ℱM,Θ𝒮\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{M\},\\Theta\}is the class of state\-indexed, mass\-11signed categorical laws supported on these points\. The contraction metricℓM,∞\\ell\_\{\\mathrm\{M\},\\infty\}is the supremum MMD metric with a characteristic kernelκ\\kappainduced by a shift\-invariant,cc\-homogeneous semimetric of strong negative type\[[42](https://arxiv.org/html/2605.06866#bib.bib50),[19](https://arxiv.org/html/2605.06866#bib.bib51),[33](https://arxiv.org/html/2605.06866#bib.bib52),[49](https://arxiv.org/html/2605.06866#bib.bib4)\]\.
The embeddingIMI\_\{\\mathrm\{M\}\}is the per\-state embeddingIM,s\(η\(s\)\):=Ks1/2p\(s\)I\_\{\\mathrm\{M\},s\}\(\\eta\(s\)\):=K\_\{s\}^\{1/2\}p\(s\)applied to all states, whereKs:=\(κ\(θi\(s\),θj\(s\)\)\)i,j=1dK\_\{s\}:=\\bigl\(\\kappa\(\\theta\_\{i\}\(s\),\\theta\_\{j\}\(s\)\)\\bigr\)\_\{i,j=1\}^\{d\}is the Gram matrix on the supportΘ\(s\)\\Theta\(s\)\[[49](https://arxiv.org/html/2605.06866#bib.bib4)\]\. The statewise projectionΠMΘ\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\}is
ΠMΘ\(s\)ν:=arginfp∈ℝ1dMMDκ\(∑i=1dpiδθi\(s\),ν\),\\Pi^\{\\Theta\(s\)\}\_\{\\mathrm\{M\}\}\\nu:=\\arg\\inf\_\{p\\in\\mathbb\{R\}\_\{1\}^\{d\}\}\\mathrm\{MMD\}\_\{\\kappa\}\\Bigl\(\\sum\_\{i=1\}^\{d\}p\_\{i\}\\delta\_\{\\theta\_\{i\}\(s\)\},\\nu\\Bigr\),\(13\)applied to all states, whereℝ1d\\mathbb\{R\}\_\{1\}^\{d\}is the affine subspace ofℝd\\mathbb\{R\}^\{d\}with unit total mass\. The projection is well\-defined forℱM,Θ𝒮\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{M\},\\Theta\}, as shown byWiltzeret al\.\[[49](https://arxiv.org/html/2605.06866#bib.bib4)\]\.
Given a sampled transition\(Sk,Ak,Rk,Sk\+1\)\(S\_\{k\},A\_\{k\},R\_\{k\},S\_\{k\+1\}\), the sampled Bellman target is
T^M\(Uk;Sk,\(Rk,Sk\+1\)\):=IM,Sk\(ΠMΘ\(Sk\)\(\(fRk,γ\)\#IM,Sk\+1−1\(Uk\(Sk\+1\)\)\)\)\.\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\):=I\_\{\\mathrm\{M\},S\_\{k\}\}\\Bigl\(\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\(S\_\{k\}\)\}\\bigl\(\(f\_\{R\_\{k\},\\gamma\}\)\_\{\\\#\}I\_\{\\mathrm\{M\},S\_\{k\+1\}\}^\{\-1\}\(U\_\{k\}\(S\_\{k\+1\}\)\)\\bigr\)\\Bigr\)\.\(14\)
The next theorem gives the MTD rates analogous to Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1)\. The proof backbone is the same as for CTD, but the perturbation geometry is affine in the norm of the current iterate\.
###### Theorem 3\(Discounted asynchronous MTD\)\.
Letη⋆∈ℱM,Θ𝒮\\eta^\{\\star\}\\in\\mathcal\{F\}\_\{\\mathrm\{M\},\\Theta\}^\{\\mathcal\{S\}\}denote the unique fixed point ofΠMΘTπ\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\}T^\{\\pi\}, and letηk:=IM−1\(Uk\)\\eta\_\{k\}:=I\_\{\\mathrm\{M\}\}^\{\-1\}\(U\_\{k\}\)be generated by \([5](https://arxiv.org/html/2605.06866#S3.E5)\) withT^=T^M\\widehat\{T\}=\\widehat\{T\}\_\{\\mathrm\{M\}\}\.
\(i\) i\.i\.d\. sampling\.Suppose\(Sk\)k≥0\(S\_\{k\}\)\_\{k\\geq 0\}are i\.i\.d\. with lawρ\\rhoon𝒮\\mathcal\{S\}, andmins∈𝒮ρ\(s\)\>0\\min\_\{s\\in\\mathcal\{S\}\}\\rho\(s\)\>0\. There exist explicit constantsCMiid,cMiid,α¯Miid\>0C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\},c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\},\\bar\{\\alpha\}^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\>0and explicit thresholdshMiid\(α\)h^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\(\\alpha\)andhMiid\(α,z\)h^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\(\\alpha,z\)for which the following results hold on three step size regimes\.
Constant step size\.Ifαk≡α≤α¯Miid\\alpha\_\{k\}\\equiv\\alpha\\leq\\bar\{\\alpha\}^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}, then the iterates converge geometrically to anO\(α\)O\(\\alpha\)neighborhood:
𝔼\[ℓM,∞\(ηk,η⋆\)2\]≤CMiidℓM,∞\(η0,η⋆\)2\(1−cMiidα\)k\+CMiidα\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\(1\-c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\\alpha\)^\{k\}\+C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\\alpha\.\(15\)Linearly\-diminishing step size\.Ifα=α/\(k\+h\)\\alpha=\\alpha/\(k\+h\),α\>1/cMiid\\alpha\>1/c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}, andh≥hMiid\(α\)h\\geq h^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\(\\alpha\), then the leading residual term decays inO\(1/k\)O\(1/k\):
𝔼\[ℓM,∞\(ηk,η⋆\)2\]≤CMiidℓM,∞\(η0,η⋆\)2\(hk\+h\)cMiidα\+CMiidα2cMiidα−1⋅1k\+h\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\left\(\\frac\{h\}\{k\+h\}\\right\)^\{c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\\alpha\}\+\\frac\{C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\\alpha^\{2\}\}\{c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\\alpha\-1\}\\cdot\\frac\{1\}\{k\+h\}\.\(16\)Polynomially\-diminishing step size\.Ifαk=α/\(k\+h\)z\\alpha\_\{k\}=\\alpha/\(k\+h\)^\{z\},z∈\(0,1\)z\\in\(0,1\), andh≥hMiid\(α,z\)h\\geq h^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\(\\alpha,z\), then the leading residual term decays inO\(1/kz\)O\(1/k^\{z\}\):
𝔼\[ℓM,∞\(ηk,η⋆\)2\]≤CMiidℓM,∞\(η0,η⋆\)2exp\(−cMiidα1−z\(\(k\+h\)1−z−h1−z\)\)\+CMiidα\(k\+h\)z\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\exp\\left\(\-\\frac\{c^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\\alpha\}\{1\-z\}\\bigl\(\(k\+h\)^\{1\-z\}\-h^\{1\-z\}\\bigr\)\\right\)\+\\frac\{C^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\\alpha\}\{\(k\+h\)^\{z\}\}\.\(17\)
\(ii\) Markovian trajectory sampling\.Suppose\(Sk\)k≥0\(S\_\{k\}\)\_\{k\\geq 0\}is irreducible and aperiodic with stationary distributionμ𝒮\\mu\_\{\\mathcal\{S\}\}satisfyingmins∈𝒮μ𝒮\(s\)\>0\\min\_\{s\\in\\mathcal\{S\}\}\\mu\_\{\\mathcal\{S\}\}\(s\)\>0and the mixing condition \([1](https://arxiv.org/html/2605.06866#S3.E1)\)\. Lettk:=tαkt\_\{k\}:=t\_\{\\alpha\_\{k\}\}and setK:=min\{k≥0:k≥tk\}K:=\\min\\\{k\\geq 0:k\\geq t\_\{k\}\\\}\. There exist explicit constantsCMmk,cMmk,α¯Mmk\>0C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\},c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\},\\bar\{\\alpha\}^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\>0and explicit thresholdshMmk\(α\)h^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\(\\alpha\)andhMmk\(α,z\)h^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\(\\alpha,z\)for which the following results hold on three step size regimes\.Constant step size\.Ifαk≡α\\alpha\_\{k\}\\equiv\\alphaandαtα≤α¯Mmk\\alpha t\_\{\\alpha\}\\leq\\bar\{\\alpha\}^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}, then fork≥tαk\\geq t\_\{\\alpha\}, the iterates converge geometrically to anO~\(α\)\\tilde\{O\}\(\\alpha\)neighborhood:
𝔼\[ℓM,∞\(ηk,η⋆\)2\]≤CMmk\(1\+ℓC,∞\(η0,η⋆\)2\)\(1−cMmkα\)k−tα\+CMmkαtα\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\\bigl\(1\+\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\bigr\)\(1\-c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\\alpha\)^\{k\-t\_\{\\alpha\}\}\+C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\\alpha t\_\{\\alpha\}\.\(18\)Linearly\-diminishing step size\.Ifαk=α/\(k\+h\)\\alpha\_\{k\}=\\alpha/\(k\+h\),α\>1/cMmk\\alpha\>1/c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}, andh≥hMmk\(α\)h\\geq h^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\(\\alpha\), then fork≥Kk\\geq K, the leading residual term decays inO~\(1/k\)\\tilde\{O\}\(1/k\):
𝔼\[ℓM,∞\(ηk,η⋆\)2\]≤CMmk\(1\+ℓM,∞\(η0,η⋆\)2\)\(K\+hk\+h\)cMmkα\+CMmkα2cMmkα−1⋅tkk\+h\.\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\\bigl\(1\+\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\bigr\)\\left\(\\frac\{K\+h\}\{k\+h\}\\right\)^\{c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\\alpha\}\+\\frac\{C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\\alpha^\{2\}\}\{c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\\alpha\-1\}\\cdot\\frac\{t\_\{k\}\}\{k\+h\}\.\(19\)Polynomially\-diminishing step size\.Ifαk=α/\(k\+h\)z\\alpha\_\{k\}=\\alpha/\(k\+h\)^\{z\},z∈\(0,1\)z\\in\(0,1\), andh≥hMmk\(α,z\)h\\geq h^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\(\\alpha,z\), then fork≥Kk\\geq K, the leading residual term decays inO~\(1/kz\)\\tilde\{O\}\(1/k^\{z\}\):
𝔼\[ℓM,∞\(ηk,η⋆\)2\]≤CMmk\(1\+ℓM,∞\(η0,η⋆\)2\)exp\(cMmkα1−z\(\(K\+h\)1−z−\(k\+h\)1−z\)\)\+CMmkαtk\(k\+h\)z\\mathbb\{E\}\\bigl\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\\bigl\(1\+\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\bigr\)\\exp\\left\(\\frac\{c^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\\alpha\}\{1\-z\}\\bigl\(\(K\+h\)^\{1\-z\}\-\(k\+h\)^\{1\-z\}\\bigr\)\\right\)\+\\frac\{C^\{\\mathrm\{mk\}\}\_\{\\mathrm\{M\}\}\\alpha t\_\{k\}\}\{\(k\+h\)^\{z\}\}
\(20\)
###### Proof sketch\.
The Gram\-matrix embedding turns the projected MTD Bellman map into aγc/2\\gamma^\{c/2\}\-contraction in the block\-supremum MMD geometry\. The sample target is again Lipschitz, but the signed\-categorical projection gives an affine centered perturbation bound rather than the uniform CTD bound\. The smoothed SA propositions therefore apply with an affine noise constant\. The smoothing parameter can be chosen small enough to make the drift positive, and the step size thresholds absorb the quadratic remainder\. Appendix[C](https://arxiv.org/html/2605.06866#A3)gives the exact constants and thresholds\. ∎
Once again, the following finite\-sample result is a simple consequence of the previous theorem\.
###### Corollary 4\.
To guarantee𝔼\[ℓM,∞\(ηk,η⋆\)\]≤ε\\mathbb\{E\}\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)\]\\leq\\varepsilonwith the discounted MTD recursion, it suffices to takek=O\(ε−2\)k=O\(\\varepsilon^\{\-2\}\)samples in the i\.i\.d\. andk=O~\(ε−2\)k=\\tilde\{O\}\(\\varepsilon^\{\-2\}\)samples in the Markovian sampling case\.
## 4Undiscounted fixed\-horizon categorical policy evaluation
We now turn to the undiscounted fixed\-horizon setting, where the value object is indexed by the remaining horizon\[[45](https://arxiv.org/html/2605.06866#bib.bib24),[1](https://arxiv.org/html/2605.06866#bib.bib57),[18](https://arxiv.org/html/2605.06866#bib.bib55)\]\. The aim of this section is to recover a comparable finite\-iteration picture without relying on discounting, which means the learned object must now be treated as a horizon\-indexed stack rather than a single per\-state quantity\. Fix a horizonH∈ℕH\\in\\mathbb\{N\}and a stationary policyπ\(⋅∣s\)\\pi\(\\cdot\\mid s\)\. For eachh∈\{0,1,…,H\}h\\in\\\{0,1,\\dots,H\\\}and eachs∈𝒮s\\in\\mathcal\{S\}, letηh\(s\)\\eta^\{h\}\(s\)denote thehh\-step return\-distribution object underπ\\pi, with terminal layerη0\(s\)=δ0\\eta^\{0\}\(s\)=\\delta\_\{0\}\.
Forh≥1h\\geq 1, we define the undiscounted fixed\-horizon distributional Bellman operator by
\(THπη\)h\(s\):=∑a∈𝒜π\(a∣s\)∑s′∈𝒮P\(s′∣s,a\)\(fR\(s,a\),1\)\#ηh−1\(s′\),\(T\_\{H\}^\{\\pi\}\\eta\)^\{h\}\(s\):=\\sum\_\{a\\in\\mathcal\{A\}\}\\pi\(a\\mid s\)\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}P\(s^\{\\prime\}\\mid s,a\)\\,\(f\_\{R\(s,a\),1\}\)\_\{\\\#\}\\eta^\{h\-1\}\(s^\{\\prime\}\),\(21\)Thus thehhth horizon bootstraps from the\(h−1\)\(h\-1\)st horizon, as in the fixed\-horizon TD formulation ofDe Asiset al\.\[[18](https://arxiv.org/html/2605.06866#bib.bib55)\]\.
For both CTD and MTD, we use exactly the same local categorical projections and local statewise isometric embeddings as in Section[3](https://arxiv.org/html/2605.06866#S3), now applied separately at each horizon and then stacked overh=1,…,Hh=1,\\dots,H\. For eachs∈𝒮s\\in\\mathcal\{S\}, the corresponding embedded iterate is
U\(s\):=\(U1\(s\),…,UH\(s\)\),U:=\(Uh\(s\)\)1≤h≤H,s∈𝒮\.U\(s\):=\\bigl\(U^\{1\}\(s\),\\ldots,U^\{H\}\(s\)\\bigr\),\\qquad U:=\\bigl\(U^\{h\}\(s\)\\bigr\)\_\{1\\leq h\\leq H,\\ s\\in\\mathcal\{S\}\}\.\(22\)
The contraction of𝒪\\mathcal\{O\}in the discounted setting is entirely due to the discount factorγ∈\(0,1\)\\gamma\\in\(0,1\)and hence does not carry over to the undiscounted case\. However, after introducing a suitable weighted block metric, we recover a contraction\. Fixλ∈\(0,1\)\\lambda\\in\(0,1\)and define the weighted block metric and weighted embedded norm by
ℓH,∞\(η,η′\):=max1≤h≤H,s∈𝒮λhℓ\(ηh\(s\),η′h\(s\)\),∥U∥H,2,∞:=max1≤h≤H,s∈𝒮λh∥Uh\(s\)∥2\.\\ell\_\{H,\\infty\}\(\\eta,\\eta^\{\\prime\}\):=\\max\_\{1\\leq h\\leq H,\\ s\\in\\mathcal\{S\}\}\\lambda^\{h\}\\ell\\bigl\(\\eta^\{h\}\(s\),\{\\eta^\{\\prime\}\}^\{h\}\(s\)\\bigr\),\\qquad\\lVert U\\rVert\_\{H,2,\\infty\}:=\\max\_\{1\\leq h\\leq H,\\ s\\in\\mathcal\{S\}\}\\lambda^\{h\}\\lVert U^\{h\}\(s\)\\rVert\_\{2\}\.\(23\)LetΠHΘ\\Pi\_\{H\}^\{\\Theta\}andIHI\_\{H\}denote the horizonwise projection and horizonwise embedding, and define
𝒪H:=IH∘ΠHΘTHπ∘IH−1\.\\mathcal\{O\}\_\{H\}:=I\_\{H\}\\circ\\Pi\_\{H\}^\{\\Theta\}T\_\{H\}^\{\\pi\}\\circ I\_\{H\}^\{\-1\}\.\(24\)
###### Proposition 5\.
For both CTD and MTD,
ℓH,∞\(ΠHΘTHπη,ΠHΘTHπη′\)≤λℓH,∞\(η,η′\)\.\\ell\_\{H,\\infty\}\(\\Pi\_\{H\}^\{\\Theta\}T\_\{H\}^\{\\pi\}\\eta,\\Pi\_\{H\}^\{\\Theta\}T\_\{H\}^\{\\pi\}\\eta^\{\\prime\}\)\\leq\\lambda\\ell\_\{H,\\infty\}\(\\eta,\\eta^\{\\prime\}\)\.\(25\)By construction,IHI\_\{H\}is an isometric embedding from the fixed\-horizon distribution space equipped withℓH,∞\\ell\_\{H,\\infty\}into the embedding space equipped with∥⋅∥H,2,∞\\lVert\\cdot\\rVert\_\{H,2,\\infty\}\. Hence the preceding statement is equivalent to
∥𝒪HU−𝒪HU′∥H,2,∞≤λ∥U−U′∥H,2,∞\.\\lVert\\mathcal\{O\}\_\{H\}U\-\\mathcal\{O\}\_\{H\}U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\\leq\\lambda\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(26\)
###### Proof sketch\.
The Bellman update moves horizonhhtoh−1h\-1, and the weightλh\\lambda^\{h\}therefore contributes exactly one factorλ\\lambda\. The local categorical projections are nonexpansive in the underlying statewise metric, so the weighted stack inherits a contraction\. Details are recorded in Appendices[D](https://arxiv.org/html/2605.06866#A4)and[E](https://arxiv.org/html/2605.06866#A5)\. ∎
For each sampled transition\(Sk,Ak,Rk,Sk\+1\)\(S\_\{k\},A\_\{k\},R\_\{k\},S\_\{k\+1\}\), let
T^H\(Uk;Sk,\(Rk,Sk\+1\)\):=\(T^H1\(Uk;Sk,\(Rk,Sk\+1\)\),…,T^HH\(Uk;Sk,\(Rk,Sk\+1\)\)\)\\widehat\{T\}\_\{H\}\\bigl\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\\bigr\):=\\bigl\(\\widehat\{T\}\_\{H\}^\{1\}\\bigl\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\\bigr\),\\ldots,\\widehat\{T\}\_\{H\}^\{H\}\\bigl\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\\bigr\)\\bigr\)\(27\)denote the stacked one\-step sampled Bellman target, where the horizonwise components are defined in the CTD and MTD subsections below\. Then, for everys∈𝒮s\\in\\mathcal\{S\},
𝔼\[T^H\(Uk;Sk,\(Rk,Sk\+1\)\)∣Uk,Sk=s\]=\(𝒪HUk\)\(s\)\.\\mathbb\{E\}\\bigl\[\\widehat\{T\}\_\{H\}\\bigl\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\\bigr\)\\mid U\_\{k\},\\ S\_\{k\}=s\\bigr\]=\(\\mathcal\{O\}\_\{H\}U\_\{k\}\)\(s\)\.\(28\)Thus a single sampled transition supplies a Bellman sample for the full stack at the visited state: the observed first reward is shared across all horizons, while the continuation term for horizonhhis the current\(h−1\)\(h\-1\)\-step estimate at the next state\. IfPsHP\_\{s\}^\{H\}denotes the coordinate projector onto the full horizon stack at statess, the online fixed\-horizon recursion takes the form
Uk\+1=Uk\+αkPSkH\(T^H\(Uk;Sk,\(Rk,Sk\+1\)\)−Uk\(Sk\)\)\.U\_\{k\+1\}=U\_\{k\}\+\\alpha\_\{k\}P\_\{S\_\{k\}\}^\{H\}\\Bigl\(\\widehat\{T\}\_\{H\}\\bigl\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\\bigr\)\-U\_\{k\}\(S\_\{k\}\)\\Bigr\)\.\(29\)
We consider only the standard episodic sampling model in which episodes have fixed horizonHH, reset distributionν0\\nu\_\{0\}, are i\.i\.d\. across resets, and follow the stationary policyπ\\piwithin each episode\. The recursion in \([29](https://arxiv.org/html/2605.06866#S4.E29)\) is still performed after each transition, but its averaged drift depends on the within\-episode phase\. We therefore analyze the episode\-boundary sequence\(UmH\)m≥0\(U\_\{mH\}\)\_\{m\\geq 0\}, in keeping with the horizon\-stacked fixed\-horizon viewpoint ofDe Asiset al\.\[[18](https://arxiv.org/html/2605.06866#bib.bib55)\]\. To formalize this phase dependence, define the phase distributions
ρt\(s\):=Pr\(St=s\),0≤t≤H−1,s∈𝒮,\\rho\_\{t\}\(s\):=\\Pr\(S\_\{t\}=s\),\\qquad 0\\leq t\\leq H\-1,\\qquad s\\in\\mathcal\{S\},\(30\)and the lower bound
ρmin:=min0≤t≤H−1,s∈𝒮ρt\(s\)\>0\.\\rho\_\{\\min\}:=\\min\_\{0\\leq t\\leq H\-1,\\ s\\in\\mathcal\{S\}\}\\rho\_\{t\}\(s\)\>0\.\(31\)This full\-support condition is used to obtain a uniform contraction factor across phase\-state blocks\. For each phaset∈\{0,…,H−1\}t\\in\\\{0,\\ldots,H\-1\\\}, define the phasewise averaged increment and phasewise averaged map by
Γt\(U\):=∑s∈𝒮ρt\(s\)PsH\(\(𝒪HU\)\(s\)−U\(s\)\),Gt\(U\):=U\+Γt\(U\)\.\\Gamma\_\{t\}\(U\):=\\sum\_\{s\\in\\mathcal\{S\}\}\\rho\_\{t\}\(s\)\\,P\_\{s\}^\{H\}\\bigl\(\(\\mathcal\{O\}\_\{H\}U\)\(s\)\-U\(s\)\\bigr\),\\qquad G\_\{t\}\(U\):=U\+\\Gamma\_\{t\}\(U\)\.\(32\)For each episodem≥0m\\geq 0, define the episodewise first\- and second\-order step size masses
α¯m:=∑u=0H−1αmH\+u,α¯m\(2\):=∑u=0H−1αmH\+u2\.\\bar\{\\alpha\}\_\{m\}:=\\sum\_\{u=0\}^\{H\-1\}\\alpha\_\{mH\+u\},\\qquad\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}:=\\sum\_\{u=0\}^\{H\-1\}\\alpha\_\{mH\+u\}^\{2\}\.\(33\)These are the episodewise analogues of the single\-step quantitiesαk\\alpha\_\{k\}andαk2\\alpha\_\{k\}^\{2\}in the discounted analysis\. The phase dependence enters through the averaged mapsGtG\_\{t\}, while the weighted block\-supremum geometry from Appendix[A](https://arxiv.org/html/2605.06866#A1)remains unchanged\. The mapsGtG\_\{t\}are auxiliary proof objects for the episodewise drift analysis rather than additional algorithmic iterates\.
### 4\.1Undiscounted fixed\-horizon CTD
For CTD, each state\-horizon pair\(h,s\)\(h,s\)carries an ordered, possibly horizon\-dependent scalar supportΘh\(s\)=\{θh,1\(s\)<⋯<θh,d\(s\)\}⊂ℝ\\Theta\_\{h\}\(s\)=\\\{\\theta\_\{h,1\}\(s\)<\\cdots<\\theta\_\{h,d\}\(s\)\\\}\\subset\\mathbb\{R\}\. The statewise Cramér metricℓH,C,∞\\ell\_\{H,\\mathrm\{C\},\\infty\}, cumulative\-mass embeddingIH,C,h,sI\_\{H,\\mathrm\{C\},h,s\}, and linear\-interpolation projectionΠH,CΘh\(s\)\\Pi^\{\\Theta\_\{h\}\(s\)\}\_\{H,\\mathrm\{C\}\}are defined exactly as in Section[3\.1](https://arxiv.org/html/2605.06866#S3.SS1), only now applied separately at each horizon and stacked overh=1,…,Hh=1,\\dots,H\[[39](https://arxiv.org/html/2605.06866#bib.bib7),[5](https://arxiv.org/html/2605.06866#bib.bib5)\]\.
Given a sampled transition\(Sk,Ak,Rk,Sk\+1\)\(S\_\{k\},A\_\{k\},R\_\{k\},S\_\{k\+1\}\), the sampled Bellman target is computed at every horizonh=1,…,Hh=1,\\dots,Hby
T^H,Ch\(Uk;Sk,\(Rk,Sk\+1\)\):=IH,C,h,Sk\(ΠH,CΘh\(Sk\)\(\(fRk,1\)\#IH,C,h−1,Sk\+1−1\(Ukh−1\(Sk\+1\)\)\)\),\\widehat\{T\}\_\{H,\\mathrm\{C\}\}^\{h\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\):=I\_\{H,\\mathrm\{C\},h,S\_\{k\}\}\\Bigl\(\\Pi\_\{H,\\mathrm\{C\}\}^\{\\Theta\_\{h\}\(S\_\{k\}\)\}\\bigl\(\(f\_\{R\_\{k\},1\}\)\_\{\\\#\}I\_\{H,\\mathrm\{C\},h\-1,S\_\{k\+1\}\}^\{\-1\}\(U\_\{k\}^\{h\-1\}\(S\_\{k\+1\}\)\)\\bigr\)\\Bigr\),\(34\)with the convention thatUk0\(s\)≡0U\_\{k\}^\{0\}\(s\)\\equiv 0\. Stacking these horizonwise targets gives
T^H,C\(Uk;Sk,\(Rk,Sk\+1\)\):=\(T^H,C1\(Uk;Sk,\(Rk,Sk\+1\)\),…,T^H,CH\(Uk;Sk,\(Rk,Sk\+1\)\)\)\.\\widehat\{T\}\_\{H,\\mathrm\{C\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\):=\\bigl\(\\widehat\{T\}\_\{H,\\mathrm\{C\}\}^\{1\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\),\\dots,\\widehat\{T\}\_\{H,\\mathrm\{C\}\}^\{H\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\\bigr\)\.\(35\)One sampled transition therefore updates the full block atSkS\_\{k\}: all horizons share the same first step and differ only in the bootstrapped tail\. The next theorem states the resulting episode\-boundary rates\.
###### Theorem 6\(Undiscounted fixed\-horizon episodic CTD\)\.
Assume the episodic sampling model described above andρmin\>0\\rho\_\{\\min\}\>0\. LetηH,C⋆\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}denote the unique fixed point ofΠH,CΘTHπ\\Pi\_\{H,\\mathrm\{C\}\}^\{\\Theta\}T\_\{H\}^\{\\pi\}, and letηk:=IH,C−1\(Uk\)\\eta\_\{k\}:=I\_\{H,\\mathrm\{C\}\}^\{\-1\}\(U\_\{k\}\)be generated by \([29](https://arxiv.org/html/2605.06866#S4.E29)\) withT^H=T^H,C\\widehat\{T\}\_\{H\}=\\widehat\{T\}\_\{H,\\mathrm\{C\}\}\. There exist explicit constantsCCH,cCH,α¯CH\>0C^\{H\}\_\{\\mathrm\{C\}\},c^\{H\}\_\{\\mathrm\{C\}\},\\bar\{\\alpha\}^\{H\}\_\{\\mathrm\{C\}\}\>0and explicit thresholdsgCH\(α\)g^\{H\}\_\{\\mathrm\{C\}\}\(\\alpha\)andgCH\(α,z\)g^\{H\}\_\{\\mathrm\{C\}\}\(\\alpha,z\)for which the following episode\-boundary results hold\.Constant step size\.Ifαk≡α≤α¯CH\\alpha\_\{k\}\\equiv\\alpha\\leq\\bar\{\\alpha\}^\{H\}\_\{\\mathrm\{C\}\}, then the episode\-boundary error decays geometrically to anO\(α\)O\(\\alpha\)neighborhood:
𝔼\[ℓH,C,∞\(ηmH,η⋆\)2\]≤CCHℓH,C,∞\(η0,η⋆\)2\(1−cCHHα\)m\+CCHα\.\\mathbb\{E\}\\bigl\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{H\}\_\{\\mathrm\{C\}\}\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\(1\-c^\{H\}\_\{\\mathrm\{C\}\}H\\alpha\)^\{m\}\+C^\{H\}\_\{\\mathrm\{C\}\}\\alpha\.\(36\)For the diminishing step size regimes below, writeτm:=mH\+g\+H−1\\tau\_\{m\}:=mH\+g\+H\-1, soτ0=g\+H−1\\tau\_\{0\}=g\+H\-1, whereggis the step\-size offset\.
Linearly\-diminishing step size\.Ifα=α/\(k\+g\)\\alpha=\\alpha/\(k\+g\),α\>1/cCH\\alpha\>1/c^\{H\}\_\{\\mathrm\{C\}\}, andg≥gCH\(α\)g\\geq g^\{H\}\_\{\\mathrm\{C\}\}\(\\alpha\), then the leading episode\-boundary residual term decays inO\(1/m\)O\(1/m\):
𝔼\[ℓH,C,∞\(ηmH,η⋆\)2\]≤CCHℓH,C,∞\(η0,η⋆\)2\(τ0τm\)cCHα\+CCHα2cCHα−1⋅1τm\.\\mathbb\{E\}\\bigl\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{H\}\_\{\\mathrm\{C\}\}\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\left\(\\frac\{\\tau\_\{0\}\}\{\\tau\_\{m\}\}\\right\)^\{c^\{H\}\_\{\\mathrm\{C\}\}\\alpha\}\+\\frac\{C^\{H\}\_\{\\mathrm\{C\}\}\\alpha^\{2\}\}\{c^\{H\}\_\{\\mathrm\{C\}\}\\alpha\-1\}\\cdot\\frac\{1\}\{\\tau\_\{m\}\}\.\(37\)Polynomially\-diminishing step size\.Ifαk=α/\(k\+g\)z\\alpha\_\{k\}=\\alpha/\(k\+g\)^\{z\},z∈\(0,1\)z\\in\(0,1\), andg≥gMH\(α,z\)g\\geq g^\{H\}\_\{\\mathrm\{M\}\}\(\\alpha,z\), then the leading episode\-boundary residual term decays inO\(1/mz\)O\(1/m^\{z\}\):
𝔼\[ℓH,C,∞\(ηk,η⋆\)2\]≤CCHℓH,C,∞\(η0,η⋆\)2exp\(−cCHα1−z\(τm1−z−τ01−z\)\)\+CCHατmz\.\\mathbb\{E\}\\bigl\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{H\}\_\{\\mathrm\{C\}\}\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\exp\\left\(\-\\frac\{c^\{H\}\_\{\\mathrm\{C\}\}\\alpha\}\{1\-z\}\\bigl\(\\tau^\{1\-z\}\_\{m\}\-\\tau^\{1\-z\}\_\{0\}\\bigr\)\\right\)\+\\frac\{C^\{H\}\_\{\\mathrm\{C\}\}\\alpha\}\{\\tau\_\{m\}^\{z\}\}\.\(38\)
###### Proof sketch\.
Weighting horizonhhbyλh\\lambda^\{h\}turns the horizon\-to\-horizon bootstrap into a contraction\. Because the averaged drift depends on the phase within the episode, the proof contracts phasewise averaged maps and aggregates them over the course of an episode\. The difference between the averaged trajectory and the online trajectory is controlled by a mean\-zero frozen\-iterate term summed with a within\-episode movement term\. Details are given in Appendix[D](https://arxiv.org/html/2605.06866#A4)\. ∎
###### Corollary 7\.
To guarantee𝔼\[ℓH,C,∞\(ηmH,ηH,C⋆\)\]≤ε\\mathbb\{E\}\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}\)\]\\leq\\varepsilonwith the undiscounted fixed\-horizon CTD recursion, it suffices to takem=O\(ε−2\)m=O\(\\varepsilon^\{\-2\}\)episodes\. Equivalently, since each episode has lengthHH, one may takek=mH=O\(ε−2\)k=mH=O\(\\varepsilon^\{\-2\}\)transitions\.
### 4\.2Undiscounted fixed\-horizon MTD
For MTD, each state\-horizon pair\(h,s\)\(h,s\)carries a possibly horizon\-dependentdd\-point supportΘh\(s\)=\{θh,1\(s\),…,θh,d\(s\)\}⊂ℝq\\Theta\_\{h\}\(s\)=\\\{\\theta\_\{h,1\}\(s\),\\dots,\\theta\_\{h,d\}\(s\)\\\}\\subset\\mathbb\{R\}^\{q\}\. The statewise MMD metricℓH,M,∞\\ell\_\{H,\\mathrm\{M\},\\infty\}with a characteristic kernelκ\\kappainduced by a shift\-invariant,cc\-homogeneous semimetric of strong negative type, the signed\-categorical projectionΠH,MΘh\(s\)\\Pi^\{\\Theta\_\{h\}\(s\)\}\_\{H,\\mathrm\{M\}\}, and the Gram\-matrix embeddingIH,M,h,sI\_\{H,\\mathrm\{M\},h,s\}are defined exactly as in Section[3\.2](https://arxiv.org/html/2605.06866#S3.SS2), now applied separately at each horizon and stacked overh=1,…,Hh=1,\\dots,H\.
Given a sampled transition\(Sk,Ak,Rk,Sk\+1\)\(S\_\{k\},A\_\{k\},R\_\{k\},S\_\{k\+1\}\), the sampled Bellman target is computed at every horizonh=1,…,Hh=1,\\dots,Hby
T^H,Mh\(Uk;Sk,\(Rk,Sk\+1\)\):=IH,M,h,Sk\(ΠH,MΘh\(Sk\)\(\(fRk,1\)\#IH,M,h−1,Sk\+1−1\(Ukh−1\(Sk\+1\)\)\)\),\\widehat\{T\}\_\{H,\\mathrm\{M\}\}^\{h\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\):=I\_\{H,\\mathrm\{M\},h,S\_\{k\}\}\\Bigl\(\\Pi\_\{H,\\mathrm\{M\}\}^\{\\Theta\_\{h\}\(S\_\{k\}\)\}\\bigl\(\(f\_\{R\_\{k\},1\}\)\_\{\\\#\}I\_\{H,\\mathrm\{M\},h\-1,S\_\{k\+1\}\}^\{\-1\}\(U\_\{k\}^\{h\-1\}\(S\_\{k\+1\}\)\)\\bigr\)\\Bigr\),\(39\)again with the convention thatUk0\(s\)≡0U\_\{k\}^\{0\}\(s\)\\equiv 0\. Stacking these horizonwise targets gives
T^H,M\(Uk;Sk,\(Rk,Sk\+1\)\):=\(T^H,M1\(Uk;Sk,\(Rk,Sk\+1\)\),…,T^H,MH\(Uk;Sk,\(Rk,Sk\+1\)\)\)\.\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\):=\\bigl\(\\widehat\{T\}\_\{H,\\mathrm\{M\}\}^\{1\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\),\\dots,\\widehat\{T\}\_\{H,\\mathrm\{M\}\}^\{H\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\\bigr\)\.\(40\)The next theorem follows the same episode\-boundary viewpoint as CTD\. The weighted contraction mechanism is unchanged, while the perturbation term is now affine in the current boundary iterate\.
###### Theorem 8\(Undiscounted fixed\-horizon episodic MTD\)\.
Assume the episodic sampling model described above andρmin\>0\\rho\_\{\\min\}\>0\. LetηH,M⋆\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}denote the unique fixed point ofΠH,MΘTHπ\\Pi\_\{H,\\mathrm\{M\}\}^\{\\Theta\}T\_\{H\}^\{\\pi\}, and letηk:=IH,M−1\(Uk\)\\eta\_\{k\}:=I\_\{H,\\mathrm\{M\}\}^\{\-1\}\(U\_\{k\}\)be generated by \([29](https://arxiv.org/html/2605.06866#S4.E29)\) withT^H=T^H,M\\widehat\{T\}\_\{H\}=\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\. There exist explicit constantsCMH,cMH,α¯MH\>0C^\{H\}\_\{\\mathrm\{M\}\},c^\{H\}\_\{\\mathrm\{M\}\},\\bar\{\\alpha\}^\{H\}\_\{\\mathrm\{M\}\}\>0and explicit thresholdsgMH\(α\)g^\{H\}\_\{\\mathrm\{M\}\}\(\\alpha\)andgMH\(α,z\)g^\{H\}\_\{\\mathrm\{M\}\}\(\\alpha,z\)for which the following episode\-boundary results hold\.Constant step size\.Ifαk≡α≤α¯MH\\alpha\_\{k\}\\equiv\\alpha\\leq\\bar\{\\alpha\}^\{H\}\_\{\\mathrm\{M\}\}, then the episode\-boundary error decays geometrically to anO\(α\)O\(\\alpha\)neighborhood:
𝔼\[ℓH,M,∞\(ηmH,η⋆\)2\]≤CMHℓH,M,∞\(η0,η⋆\)2\(1−cMHHα\)m\+CMHα\.\\mathbb\{E\}\\bigl\[\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{mH\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{H\}\_\{\\mathrm\{M\}\}\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\(1\-c^\{H\}\_\{\\mathrm\{M\}\}H\\alpha\)^\{m\}\+C^\{H\}\_\{\\mathrm\{M\}\}\\alpha\.\(41\)For the diminishing step size regimes below, writeτm:=mH\+g\+H−1\\tau\_\{m\}:=mH\+g\+H\-1, soτ0=g\+H−1\\tau\_\{0\}=g\+H\-1, whereggis the step\-size offset\.
Linearly\-diminishing step size\.Ifα=α/\(k\+g\)\\alpha=\\alpha/\(k\+g\),α\>1/cMH\\alpha\>1/c^\{H\}\_\{\\mathrm\{M\}\}, andg≥gMH\(α\)g\\geq g^\{H\}\_\{\\mathrm\{M\}\}\(\\alpha\), then the leading episode\-boundary residual term decays inO\(1/m\)O\(1/m\):
𝔼\[ℓH,C,∞\(ηmH,η⋆\)2\]≤CMHℓH,M,∞\(η0,η⋆\)2\(τ0τm\)cMHα\+CMHα2cMHα−1⋅1τm\.\\mathbb\{E\}\\bigl\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{H\}\_\{\\mathrm\{M\}\}\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\left\(\\frac\{\\tau\_\{0\}\}\{\\tau\_\{m\}\}\\right\)^\{c^\{H\}\_\{\\mathrm\{M\}\}\\alpha\}\+\\frac\{C^\{H\}\_\{\\mathrm\{M\}\}\\alpha^\{2\}\}\{c^\{H\}\_\{\\mathrm\{M\}\}\\alpha\-1\}\\cdot\\frac\{1\}\{\\tau\_\{m\}\}\.\(42\)Polynomially\-diminishing step size\.Ifαk=α/\(k\+g\)z\\alpha\_\{k\}=\\alpha/\(k\+g\)^\{z\},z∈\(0,1\)z\\in\(0,1\), andg≥gMiid\(α,z\)g\\geq g^\{\\mathrm\{iid\}\}\_\{\\mathrm\{M\}\}\(\\alpha,z\), then the leading episode\-boundary residual term decays inO\(1/mz\)O\(1/m^\{z\}\):
𝔼\[ℓH,M,∞\(ηk,η⋆\)2\]≤CMHℓH,M,∞\(η0,η⋆\)2exp\(−cMHα1−z\(τm1−z−τ01−z\)\)\+CMHατmz\.\\mathbb\{E\}\\bigl\[\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\\leq C^\{H\}\_\{\\mathrm\{M\}\}\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\exp\\left\(\-\\frac\{c^\{H\}\_\{\\mathrm\{M\}\}\\alpha\}\{1\-z\}\\bigl\(\\tau^\{1\-z\}\_\{m\}\-\\tau^\{1\-z\}\_\{0\}\\bigr\)\\right\)\+\\frac\{C^\{H\}\_\{\\mathrm\{M\}\}\\alpha\}\{\\tau\_\{m\}^\{z\}\}\.\(43\)
###### Proof sketch\.
This follows the CTD episode\-boundary argument after replacing the Cramér embedding by the Gram\-matrix embedding\. The phasewise weighted contraction mechanism remains, but the perturbation bound is affine in the current boundary iterate\. Details are given in Appendix[E](https://arxiv.org/html/2605.06866#A5)\. ∎
###### Corollary 9\.
To guarantee𝔼\[ℓH,M,∞\(ηmH,ηH,M⋆\)\]≤ε\\mathbb\{E\}\[\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\)\]\\leq\\varepsilonwith the undiscounted fixed\-horizon MTD recursion, it suffices to takem=O\(ε−2\)m=O\(\\varepsilon^\{\-2\}\)episodes\. Equivalently, since each episode has lengthHH, one may takek=mH=O\(ε−2\)k=mH=O\(\\varepsilon^\{\-2\}\)transitions\.
## 5Representation error
Both the discounted and undiscounted fixed\-horizon theorems control the distance to the fixed point of a projected Bellman operator\. A common deterministic decomposition turns these projected\-fixed\-point guarantees into total\-error guarantees relative to the exact return\-distribution fixed point\. Letℓ\\elldenote eitherℓC,∞\\ell\_\{\\mathrm\{C\},\\infty\}orℓM,∞\\ell\_\{\\mathrm\{M\},\\infty\}in the discounted setting, or eitherℓH,C,∞\\ell\_\{H,\\mathrm\{C\},\\infty\}orℓH,M,∞\\ell\_\{H,\\mathrm\{M\},\\infty\}in the finite\-horizon setting\. LetTTdenote the corresponding Bellman operator with contraction modulusβ\\betaand letΠ\\Pidenote the corresponding projection\. Furthermore, letηπ\\eta^\{\\pi\}be the fixed point ofTT,η⋆\\eta^\{\\star\}be the fixed point ofΠT\\Pi Tandεrepr:=ℓ\(Πηπ,ηπ\)\\varepsilon^\{\\mathrm\{repr\}\}:=\\ell\(\\Pi\\eta^\{\\pi\},\\eta^\{\\pi\}\)\. Then
ℓ\(η⋆,ηπ\)≤εrepr1−β,𝔼\[ℓ\(ηk,ηπ\)2\]≤2𝔼\[ℓ\(ηk,η⋆\)2\]\+2\(εrepr1−β\)2\.\\ell\(\\eta^\{\\star\},\\eta^\{\\pi\}\)\\leq\\frac\{\\varepsilon^\{\\mathrm\{repr\}\}\}\{1\-\\beta\},\\qquad\\mathbb\{E\}\\bigl\[\\ell\(\\eta\_\{k\},\\eta^\{\\pi\}\)^\{2\}\\bigr\]\\leq 2\\,\\mathbb\{E\}\\bigl\[\\ell\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\bigr\]\+2\\left\(\\frac\{\\varepsilon^\{\\mathrm\{repr\}\}\}\{1\-\\beta\}\\right\)^\{2\}\.\(44\)The decomposition in \([44](https://arxiv.org/html/2605.06866#S5.E44)\) separates two conceptually different sources of error\. The first is the algorithmic term, controlled by the finite\-iteration bounds of the preceding theorems\. The second is the deterministic projection termεrepr/\(1−β\)\\varepsilon^\{\\mathrm\{repr\}\}/\(1\-\\beta\), which depends only on how well the chosen categorical support can approximate the exact return\-distribution fixed point\. Thus, once the support family is fixed, the SA analysis and the discretization bias can be seen independently\. The full statement and proof are deferred to Appendix[F](https://arxiv.org/html/2605.06866#A6)\.
## 6Discussion of the results
Theorems[1](https://arxiv.org/html/2605.06866#Thmtheorem1)and[3](https://arxiv.org/html/2605.06866#Thmtheorem3)show that the standard categorical recursions are structurally simpler than they may initially appear\. After suitable statewise isometric embeddings, both CTD and MTD become asynchronous SA schemes that update one state block at a time and contract in a block\-supremum norm\. This common structure is what makes a unified finite\-iteration analysis possible\. The real point of departure is the perturbation geometry: CTD lives in a bounded\-noise regime because bounded categorical supports and the Cramér geometry give uniform samplewise control, whereas MTD produces affine perturbations in∥U∥2,∞\\lVert U\\rVert\_\{2,\\infty\}because the signed\-categorical MMD projection is affine rather than uniformly bounded\.
In the undiscounted fixed\-horizon half, the learned object is a horizon\-indexed stack, and one sampled transition updates the whole stack block at the visited state because every horizon shares the same first reward and differs only in the bootstrapped tail\. There is also no discount\-driven contraction property to exploit\. Instead, contraction is recovered by weighting horizon layers, and the theory is stated at episode boundaries because the averaged drift changes with the within\-episode phase\. Across both discounted and undiscounted fixed\-horizon settings, the step size results are consistent: constant steps give a controllableO\(α\)O\(\\alpha\)neighborhood \(with logarithmic mixing terms entering in the Markovian case\), linearly\-diminishing step sizes give the sharpest asymptotic decay once the threshold condition is met, and polynomially\-diminishing step sizes trade rate for milder admissibility conditions\. Further discussion is deferred to Appendix[G](https://arxiv.org/html/2605.06866#A7)\.
## 7Conclusion
We established finite\-iteration guarantees for asynchronous categorical distributional temporal\-difference learning in the scalar Cramér and multivariate signed\-categorical MMD settings\. The scope of the present theory is tabular finite\-state policy\-evaluation analysis, it does not treat control or policy improvement, and its guarantees are expectation bounds rather than high\-probability bounds\. Within that scope, the discounted results cover i\.i\.d\. sampled states and Markovian trajectories, while the undiscounted results treat the fixed\-horizon episodic regime under the exact online recursions\.
## Acknowledgments
The authors are grateful to Prof\. Zaiwei Chen for helpful feedback during the preparation of this manuscript\.
## References
- \[1\]\(2017\)Minimax regret bounds for reinforcement learning\.InInternational conference on machine learning,pp\. 263–272\.Cited by:[§4](https://arxiv.org/html/2605.06866#S4.p1.8)\.
- \[2\]H\. H\. Bauschke and P\. L\. Combettes\(2017\)Convex analysis and monotone operator theory in hilbert spaces\.2nd edition,Springer Publishing Company, Incorporated\.External Links:ISBN 3319483102Cited by:[§3](https://arxiv.org/html/2605.06866#S3.p3.8),[Proposition 13](https://arxiv.org/html/2605.06866#Thmtheorem13)\.
- \[3\]A\. Beck\(2017\)First\-order methods in optimization\.SIAM\.Cited by:[§3](https://arxiv.org/html/2605.06866#S3.p3.8),[Proposition 12](https://arxiv.org/html/2605.06866#Thmtheorem12.p1.4.3)\.
- \[4\]M\. G\. Bellemare, W\. Dabney, and R\. Munos\(2017\)A distributional perspective on reinforcement learning\.InInternational conference on machine learning,pp\. 449–458\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[5\]M\. G\. Bellemare, W\. Dabney, and M\. Rowland\(2023\)Distributional reinforcement learning\.MIT Press\.Cited by:[§B\.1](https://arxiv.org/html/2605.06866#A2.SS1.2.p1.3),[§1](https://arxiv.org/html/2605.06866#S1.p1.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.06866#S3.SS1.p1.8),[§4\.1](https://arxiv.org/html/2605.06866#S4.SS1.p1.6)\.
- \[6\]D\. P\. Bertsekas and J\. N\. Tsitsiklis\(1996\)Neuro\-dynamic programming\.Athena Scientific\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[7\]J\. Bhandari, D\. Russo, and R\. Singal\(2018\)A finite time analysis of temporal difference learning with linear function approximation\.InConference on learning theory,pp\. 1691–1692\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1)\.
- \[8\]M\. Böck and C\. Heitzinger\(2022\)Speedy categorical distributional reinforcement learning and complexity analysis\.SIAM Journal on Mathematics of Data Science4\(2\),pp\. 675–693\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1),[§2](https://arxiv.org/html/2605.06866#S2.p2.1)\.
- \[9\]V\. S\. BorkarStochastic approximation: a dynamical systems viewpoint\.Vol\.100,Springer\.Cited by:[Appendix A](https://arxiv.org/html/2605.06866#A1.p1.3),[§1](https://arxiv.org/html/2605.06866#S1.p3.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1),[§3](https://arxiv.org/html/2605.06866#S3.p3.8)\.
- \[10\]V\. S\. Borkar\(1998\)Asynchronous stochastic approximations\.SIAM Journal on Control and Optimization36\(3\),pp\. 840–851\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1)\.
- \[11\]Z\. Chen, S\. T\. Maguluri, S\. Shakkottai, and K\. Shanmugam\(2024\)A lyapunov theory for finite\-sample guarantees of markovian stochastic approximation\.Operations Research72\(4\),pp\. 1352–1367\.Cited by:[§A\.1](https://arxiv.org/html/2605.06866#A1.SS1.4.p1.1),[§A\.3](https://arxiv.org/html/2605.06866#A1.SS3.1.p1.5),[§1](https://arxiv.org/html/2605.06866#S1.p3.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1)\.
- \[12\]Z\. Chen, S\. T\. Maguluri, S\. Shakkottai, and K\. Shanmugam\(2020\)Finite\-sample analysis of contractive stochastic approximation using smooth convex envelopes\.Advances in Neural Information Processing Systems33,pp\. 8223–8234\.Cited by:[§A\.1](https://arxiv.org/html/2605.06866#A1.SS1.4.p1.1),[§A\.2](https://arxiv.org/html/2605.06866#A1.SS2.1.p1.12),[§1](https://arxiv.org/html/2605.06866#S1.p3.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1)\.
- \[13\]Z\. Chen, S\. Zhang, T\. T\. Doan, J\. Clarke, and S\. T\. Maguluri\(2022\)Finite\-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning\.Automatica146,pp\. 110623\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p3.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1)\.
- \[14\]W\. Dabney, G\. Ostrovski, D\. Silver, and R\. Munos\(2018\)Implicit quantile networks for distributional reinforcement learning\.InInternational conference on machine learning,pp\. 1096–1105\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[15\]W\. Dabney, M\. Rowland, M\. Bellemare, and R\. Munos\(2018\)Distributional reinforcement learning with quantile regression\.InProceedings of the AAAI conference on artificial intelligence,Vol\.32\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[16\]G\. Dalal, B\. Szörényi, G\. Thoppe, and S\. Mannor\(2018\)Finite sample analyses for td \(0\) with function approximation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.32\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1)\.
- \[17\]P\. Dayan and T\. J\. Sejnowski\(1994\)TD \(λ\\lambda\) converges with probability 1\.Machine Learning14\(3\),pp\. 295–301\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[18\]K\. De Asis, A\. Chan, S\. Pitis, R\. Sutton, and D\. Graves\(2020\)Fixed\-horizon temporal difference methods for stable reinforcement learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.34,pp\. 3741–3748\.Cited by:[§2](https://arxiv.org/html/2605.06866#S2.p1.1),[§4](https://arxiv.org/html/2605.06866#S4.p1.8),[§4](https://arxiv.org/html/2605.06866#S4.p2.3),[§4](https://arxiv.org/html/2605.06866#S4.p6.4)\.
- \[19\]A\. Gretton, K\. M\. Borgwardt, M\. J\. Rasch, B\. Schölkopf, and A\. Smola\(2012\)A kernel two\-sample test\.The journal of machine learning research13\(1\),pp\. 723–773\.Cited by:[§3\.2](https://arxiv.org/html/2605.06866#S3.SS2.p1.9)\.
- \[20\]L\. Gurvits, L\. Lin, and S\. J\. Hanson\(1994\)Incremental learning of evaluation functions for absorbing markov chains: new methods and theorems\.preprint\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[21\]R\. A\. Howard and J\. E\. Matheson\(1972\)Risk\-sensitive markov decision processes\.Management science18\(7\),pp\. 356–369\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[22\]T\. Jaakkola, M\. Jordan, and S\. Singh\(1993\)Convergence of stochastic iterative dynamic programming algorithms\.Advances in neural information processing systems6\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[23\]S\. C\. Jaquette\(1973\)Markov decision processes with a new optimality criterion: discrete time\.The Annals of Statistics1\(3\),pp\. 496–505\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[24\]S\. C\. Jaquette\(1976\)A utility criterion for markov decision processes\.Management Science23\(1\),pp\. 43–49\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[25\]T\. Kastner, M\. Rowland, Y\. Tang, M\. A\. Erdogdu, and A\. Farahmand\(2025\)Categorical distributional reinforcement learning with kullback\-leibler divergence: convergence and asymptotics\.InForty\-second International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.06866#S2.p2.1)\.
- \[26\]H\. J\. Kushner and G\. G\. Yin\(2003\)Stochastic approximation and recursive algorithms and applications\.Springer\.Cited by:[Appendix A](https://arxiv.org/html/2605.06866#A1.p1.3),[§1](https://arxiv.org/html/2605.06866#S1.p3.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1),[§3](https://arxiv.org/html/2605.06866#S3.p3.8)\.
- \[27\]D\. A\. Levin and Y\. Peres\(2017\)Markov chains and mixing times\.Vol\.107,American Mathematical Soc\.\.Cited by:[§3](https://arxiv.org/html/2605.06866#S3.p1.15)\.
- \[28\]M\. L\. Littman and C\. Szepesvári\(1996\)A generalized reinforcement\-learning model: convergence and applications\.InICML,Vol\.96,pp\. 310–318\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[29\]L\. Ljung\(1977\)Analysis of recursive stochastic algorithms\.IEEE Transactions on Automatic Control22\(4\),pp\. 551–575\.External Links:[Document](https://dx.doi.org/10.1109/TAC.1977.1101561)Cited by:[Appendix A](https://arxiv.org/html/2605.06866#A1.p1.3),[§1](https://arxiv.org/html/2605.06866#S1.p3.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1)\.
- \[30\]C\. Lyle, M\. G\. Bellemare, and P\. S\. Castro\(2019\)A comparative analysis of expected and distributional reinforcement learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.33,pp\. 4504–4511\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[31\]J\. Moreau\(1965\)Proximité et dualité dans un espace hilbertien\.Bulletin de la Société mathématique de France93,pp\. 273–299\.Cited by:[§3](https://arxiv.org/html/2605.06866#S3.p3.8),[Proposition 13](https://arxiv.org/html/2605.06866#Thmtheorem13)\.
- \[32\]T\. Morimura, M\. Sugiyama, H\. Kashima, H\. Hachiya, and T\. Tanaka\(2010\)Nonparametric return distribution approximation for reinforcement learning\.InProceedings of the 27th International Conference on Machine Learning \(ICML\-10\),pp\. 799–806\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[33\]K\. Muandet, K\. Fukumizu, B\. Sriperumbudur, and B\. Schölkopf\(2017\-06\)Kernel mean embedding of distributions: a review and beyond\.Found\. Trends Mach\. Learn\.10\(1–2\),pp\. 1–141\.External Links:ISSN 1935\-8237,[Link](https://doi.org/10.1561/2200000060),[Document](https://dx.doi.org/10.1561/2200000060)Cited by:[§3\.2](https://arxiv.org/html/2605.06866#S3.SS2.p1.9)\.
- \[34\]G\. Patil, L\. Prashanth, D\. Nagaraj, and D\. Precup\(2023\)Finite time analysis of temporal difference learning with linear function approximation: tail averaging and regularisation\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 5438–5448\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1)\.
- \[35\]Y\. Peng, K\. Jin, L\. Zhang, and Z\. Zhang\(2025\)A finite sample analysis of distributional td learning with linear function approximation\.arXiv preprint arXiv:2502\.14172\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1),[§2](https://arxiv.org/html/2605.06866#S2.p2.1)\.
- \[36\]Y\. Peng, L\. Zhang, and Z\. Zhang\(2024\)Statistical efficiency of distributional temporal difference learning\.Advances in Neural Information Processing Systems37,pp\. 24724–24761\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1),[§2](https://arxiv.org/html/2605.06866#S2.p2.1)\.
- \[37\]G\. Qu and A\. Wierman\(2020\)Finite\-time analysis of asynchronous stochastic approximation andQQ\-learning\.InConference on learning theory,pp\. 3185–3205\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1)\.
- \[38\]H\. Robbins and S\. Monro\(1951\)A stochastic approximation method\.The annals of mathematical statistics,pp\. 400–407\.Cited by:[Appendix A](https://arxiv.org/html/2605.06866#A1.p1.3),[§1](https://arxiv.org/html/2605.06866#S1.p3.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1)\.
- \[39\]M\. Rowland, M\. Bellemare, W\. Dabney, R\. Munos, and Y\. W\. Teh\(2018\)An analysis of categorical distributional reinforcement learning\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 29–37\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1),[§3\.1](https://arxiv.org/html/2605.06866#S3.SS1.p1.8),[§4\.1](https://arxiv.org/html/2605.06866#S4.SS1.p1.6)\.
- \[40\]M\. Rowland, R\. Dadashi, S\. Kumar, R\. Munos, M\. G\. Bellemare, and W\. Dabney\(2019\)Statistics and samples in distributional reinforcement learning\.InInternational Conference on Machine Learning,pp\. 5528–5536\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[41\]M\. Rowland, L\. K\. Wenliang, R\. Munos, C\. Lyle, Y\. Tang, and W\. Dabney\(2024\)Near\-minimax\-optimal distributional reinforcement learning with a generative model\.Advances in Neural Information Processing Systems37,pp\. 132774–132823\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1),[§2](https://arxiv.org/html/2605.06866#S2.p2.1)\.
- \[42\]A\. Smola, A\. Gretton, L\. Song, and B\. Schölkopf\(2007\)A hilbert space embedding for distributions\.InInternational conference on algorithmic learning theory,pp\. 13–31\.Cited by:[§3\.2](https://arxiv.org/html/2605.06866#S3.SS2.p1.9)\.
- \[43\]M\. J\. Sobel\(1982\)The variance of discounted markov decision processes\.Journal of Applied Probability19\(4\),pp\. 794–802\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[44\]R\. Srikant and L\. Ying\(2019\)Finite\-time error bounds for linear stochastic approximation and td learning\.InConference on learning theory,pp\. 2803–2830\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1)\.
- \[45\]R\. S\. Sutton, A\. G\. Barto,et al\.\(1998\)Reinforcement learning: an introduction\.Vol\.1,MIT press Cambridge\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1),[§3](https://arxiv.org/html/2605.06866#S3.p1.15),[§4](https://arxiv.org/html/2605.06866#S4.p1.8)\.
- \[46\]R\. S\. Sutton\(1988\)Learning to predict by the methods of temporal differences\.Machine learning3\(1\),pp\. 9–44\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1)\.
- \[47\]J\. N\. Tsitsiklis\(1994\)Asynchronous stochastic approximation and q\-learning\.Machine learning16\(3\),pp\. 185–202\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[48\]J\. Tsitsiklis and B\. Van Roy\(1996\)Analysis of temporal\-difference learning with function approximation\.Advances in neural information processing systems9\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1)\.
- \[49\]H\. Wiltzer, J\. Farebrother, A\. Gretton, and M\. Rowland\(2024\)Foundations of multivariate distributional reinforcement learning\.Advances in Neural Information Processing Systems37,pp\. 101297–101336\.Cited by:[§C\.1](https://arxiv.org/html/2605.06866#A3.SS1.2.p1.3),[§1](https://arxiv.org/html/2605.06866#S1.p1.1),[§2](https://arxiv.org/html/2605.06866#S2.p1.1),[§3\.2](https://arxiv.org/html/2605.06866#S3.SS2.p1.9),[§3\.2](https://arxiv.org/html/2605.06866#S3.SS2.p2.5),[§3\.2](https://arxiv.org/html/2605.06866#S3.SS2.p2.8)\.
- \[50\]R\. Wu, M\. Uehara, and W\. Sun\(2023\)Distributional offline policy evaluation with predictive error guarantees\.InInternational Conference on Machine Learning,pp\. 37685–37712\.Cited by:[§2](https://arxiv.org/html/2605.06866#S2.p2.1)\.
- \[51\]D\. Yang, L\. Zhao, Z\. Lin, T\. Qin, J\. Bian, and T\. Liu\(2019\)Fully parameterized quantile function for distributional reinforcement learning\.Advances in neural information processing systems32\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p1.1)\.
- \[52\]L\. Zhang, Y\. Peng, J\. Liang, W\. Yang, and Z\. Zhang\(2023\)Estimation and inference in distributional reinforcement learning\.arXiv preprint arXiv:2309\.17262\.Cited by:[§1](https://arxiv.org/html/2605.06866#S1.p2.1),[§2](https://arxiv.org/html/2605.06866#S2.p2.1)\.
## Appendix Table of Contents
## Appendix ACommon finite\-iteration ingredients in the block\-supremum geometry
We collect the abstract ingredients\[[38](https://arxiv.org/html/2605.06866#bib.bib26),[29](https://arxiv.org/html/2605.06866#bib.bib27),[9](https://arxiv.org/html/2605.06866#bib.bib30),[26](https://arxiv.org/html/2605.06866#bib.bib31)\]used in the proofs of Theorems[1](https://arxiv.org/html/2605.06866#Thmtheorem1)and[3](https://arxiv.org/html/2605.06866#Thmtheorem3)\. Throughout this appendix, let
V:=∏s∈𝒮E\(s\)V:=\\prod\_\{s\\in\\mathcal\{S\}\}E\(s\)\(45\)be a finite product of Euclidean state blocks\. ForU∈VU\\in Vandp∈\[2,∞\)p\\in\[2,\\infty\), define
∥U∥2,∞:=maxs∈𝒮∥U\(s\)∥2,∥U∥2,p:=\(∑s∈𝒮∥U\(s\)∥2p\)1/p\.\\lVert U\\rVert\_\{2,\\infty\}:=\\max\_\{s\\in\\mathcal\{S\}\}\\lVert U\(s\)\\rVert\_\{2\},\\qquad\\lVert U\\rVert\_\{2,p\}:=\\left\(\\sum\_\{s\\in\\mathcal\{S\}\}\\lVert U\(s\)\\rVert\_\{2\}^\{p\}\\right\)^\{1/p\}\.\(46\)
BecauseVVis finite\-dimensional and each blockE\(s\)E\(s\)is Euclidean, all gradients, smoothness statements, and Moreau\-envelope constructions below are taken with respect to the product Euclidean inner product onVV\. In particular,∥⋅∥2,p\\lVert\\cdot\\rVert\_\{2,p\}is a genuine norm on a finite\-dimensional Euclidean space, so the standard smoothness results for squaredℓp\\ell\_\{p\}\-type norms apply directly\.
For each states∈𝒮s\\in\\mathcal\{S\}, letPs:V→VP\_\{s\}:V\\to Vdenote the coordinate projector onto blockss\.
### A\.1Norm comparison and smoothing
###### Lemma 10\(Block\-norm comparison\)\.
For everyU∈VU\\in Vand everyp∈\[2,∞\)p\\in\[2,\\infty\),
∥U∥2,∞≤∥U∥2,p≤\|𝒮\|1/p∥U∥2,∞\.\\lVert U\\rVert\_\{2,\\infty\}\\leq\\lVert U\\rVert\_\{2,p\}\\leq\\lvert\\mathcal\{S\}\\rvert^\{1/p\}\\lVert U\\rVert\_\{2,\\infty\}\.\(47\)
###### Proof\.
The left inequality is immediate because the maximum of finitely many nonnegative numbers is bounded above by theirℓp\\ell\_\{p\}norm\. For the right inequality,
∥U∥2,pp=∑s∈𝒮∥U\(s\)∥2p≤∑s∈𝒮∥U∥2,∞p=\|𝒮\|∥U∥2,∞p\.\\lVert U\\rVert\_\{2,p\}^\{p\}=\\sum\_\{s\\in\\mathcal\{S\}\}\\lVert U\(s\)\\rVert\_\{2\}^\{p\}\\leq\\sum\_\{s\\in\\mathcal\{S\}\}\\lVert U\\rVert\_\{2,\\infty\}^\{p\}=\\lvert\\mathcal\{S\}\\rvert\\lVert U\\rVert\_\{2,\\infty\}^\{p\}\.\(48\)Takingppth roots proves the claim\. ∎
###### Proposition 11\(Choice of the smoothing exponent\)\.
Let
p⋆:=max\{2,⌈log\|𝒮\|⌉\}\.p^\{\\star\}:=\\max\\\{2,\\lceil\\log\\lvert\\mathcal\{S\}\\rvert\\rceil\\\}\.\(49\)Then
\|𝒮\|2/p⋆≤e2\.\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\\leq e^\{2\}\.\(50\)Consequently,
\(p⋆−1\)\|𝒮\|2/p⋆≤e2log\|𝒮\|for\|𝒮\|≥2\.\(p^\{\\star\}\-1\)\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\\leq e^\{2\}\\log\\lvert\\mathcal\{S\}\\rvert\\qquad\\text\{for \}\\lvert\\mathcal\{S\}\\rvert\\geq 2\.\(51\)
###### Proof\.
If\|𝒮\|≤e2\\lvert\\mathcal\{S\}\\rvert\\leq e^\{2\}, thenp⋆=2p^\{\\star\}=2and the claim is immediate\. If\|𝒮\|\>e2\\lvert\\mathcal\{S\}\\rvert\>e^\{2\}, thenp⋆≥log\|𝒮\|p^\{\\star\}\\geq\\log\\lvert\\mathcal\{S\}\\rvert, hence
\|𝒮\|2/p⋆=exp\(2log\|𝒮\|p⋆\)≤e2\.\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}=\\exp\\left\(\\frac\{2\\log\\lvert\\mathcal\{S\}\\rvert\}\{p^\{\\star\}\}\\right\)\\leq e^\{2\}\.\(52\)The second claim follows by multiplying by\(p⋆−1\)≤log\|𝒮\|\(p^\{\\star\}\-1\)\\leq\\log\\lvert\\mathcal\{S\}\\rvertwhen\|𝒮\|≥2\\lvert\\mathcal\{S\}\\rvert\\geq 2\. ∎
###### Proposition 12\(Smooth block potential\)\.
Forp∈\[2,∞\)p\\in\[2,\\infty\), define
gp\(U\):=12∥U∥2,p2\.g\_\{p\}\(U\):=\\frac\{1\}\{2\}\\lVert U\\rVert\_\{2,p\}^\{2\}\.\(53\)Thengpg\_\{p\}is convex and\(p−1\)\(p\-1\)\-smooth\[[3](https://arxiv.org/html/2605.06866#bib.bib32)\]with respect to∥⋅∥2,p\\lVert\\cdot\\rVert\_\{2,p\}\.
###### Proof\.
This is the standard smoothness of the squaredℓp\\ell\_\{p\}norm forp≥2p\\geq 2, applied to the mixed norm∥⋅∥2,p\\lVert\\cdot\\rVert\_\{2,p\}on the finite\-dimensional product spaceVV\. Equivalently, after choosing orthonormal bases in each blockE\(s\)E\(s\), the spaceVVidentifies withℝN\\mathbb\{R\}^\{N\}for some finiteNN, and∥⋅∥2,p\\lVert\\cdot\\rVert\_\{2,p\}becomes a norm ofℓp\\ell\_\{p\}type on grouped coordinates\. The standardℓp\\ell\_\{p\}smoothness estimate therefore gives
gp\(U′\)≤gp\(U\)\+⟨∇gp\(U\),U′−U⟩\+p−12∥U′−U∥2,p2g\_\{p\}\(U^\{\\prime\}\)\\leq g\_\{p\}\(U\)\+\\langle\\nabla g\_\{p\}\(U\),U^\{\\prime\}\-U\\rangle\+\\frac\{p\-1\}\{2\}\\lVert U^\{\\prime\}\-U\\rVert\_\{2,p\}^\{2\}\(54\)for allU,U′∈VU,U^\{\\prime\}\\in V\. ∎
###### Proposition 13\(Generalized Moreau envelope\[[31](https://arxiv.org/html/2605.06866#bib.bib33),[2](https://arxiv.org/html/2605.06866#bib.bib34)\]for block\-supremum norms\)\.
FixU⋆∈VU^\{\\star\}\\in V,ϑ\>0\\vartheta\>0, andp∈\[2,∞\)p\\in\[2,\\infty\)\. Define
Mϑ,p\(U\):=infW∈V\{12∥W−U⋆∥2,∞2\+12ϑ∥U−W∥2,p2\}\.M\_\{\\vartheta,p\}\(U\):=\\inf\_\{W\\in V\}\\Bigl\\\{\\frac\{1\}\{2\}\\lVert W\-U^\{\\star\}\\rVert\_\{2,\\infty\}^\{2\}\+\\frac\{1\}\{2\\vartheta\}\\lVert U\-W\\rVert\_\{2,p\}^\{2\}\\Bigr\\\}\.\(55\)ThenMϑ,pM\_\{\\vartheta,p\}is convex and\(p−1\)/ϑ\(p\-1\)/\\vartheta\-smooth with respect to∥⋅∥2,p\\lVert\\cdot\\rVert\_\{2,p\}\. Moreover, for everyU∈VU\\in V,
\(1\+ϑ\)Mϑ,p\(U\)≤12∥U−U⋆∥2,∞2≤\(1\+ϑ\|𝒮\|2/p\)Mϑ,p\(U\)\.\(1\+\\vartheta\)M\_\{\\vartheta,p\}\(U\)\\leq\\frac\{1\}\{2\}\\lVert U\-U^\{\\star\}\\rVert\_\{2,\\infty\}^\{2\}\\leq\(1\+\\vartheta\\lvert\\mathcal\{S\}\\rvert^\{2/p\}\)M\_\{\\vartheta,p\}\(U\)\.\(56\)
###### Proof\.
This is the generalized Moreau\-envelope approximation theorem used byChenet al\.\[[12](https://arxiv.org/html/2605.06866#bib.bib3),[11](https://arxiv.org/html/2605.06866#bib.bib1)\], specialized to
h1\(U\):=12∥U−U⋆∥2,∞2,h2\(U\):=12∥U∥2,p2\.h\_\{1\}\(U\):=\\frac\{1\}\{2\}\\lVert U\-U^\{\\star\}\\rVert\_\{2,\\infty\}^\{2\},\\qquad h\_\{2\}\(U\):=\\frac\{1\}\{2\}\\lVert U\\rVert\_\{2,p\}^\{2\}\.\(57\)The smoothness claim follows from Proposition[12](https://arxiv.org/html/2605.06866#Thmtheorem12)\. The two\-sided approximation follows from the standard envelope inequalities together with Lemma[10](https://arxiv.org/html/2605.06866#Thmtheorem10)\. ∎
### A\.2Abstract i\.i\.d\. finite\-iteration framework
Assume first that\(Sk\)k≥0\(S\_\{k\}\)\_\{k\\geq 0\}are i\.i\.d\. with lawρ\\rhoon𝒮\\mathcal\{S\}and let
ρmin:=mins∈𝒮ρ\(s\)\>0\.\\rho\_\{\\min\}:=\\min\_\{s\\in\\mathcal\{S\}\}\\rho\(s\)\>0\.\(58\)
Let𝒪:V→V\\mathcal\{O\}:V\\to Vbe a contraction in∥⋅∥2,∞\\lVert\\cdot\\rVert\_\{2,\\infty\}with modulusβ∈\(0,1\)\\beta\\in\(0,1\)and fixed pointU⋆U^\{\\star\}:
∥𝒪U−𝒪U′∥2,∞≤β∥U−U′∥2,∞for allU,U′∈V\.\\lVert\\mathcal\{O\}U\-\\mathcal\{O\}U^\{\\prime\}\\rVert\_\{2,\\infty\}\\leq\\beta\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\\quad\\text\{ for all \}U,U^\{\\prime\}\\in V\.\(59\)LetT^\(U;s,ξ\)\\widehat\{T\}\(U;s,\\xi\)be a sampled target satisfying
𝔼\[T^\(Uk;Sk,ξk\)∣Uk,Sk=s\]=\(𝒪Uk\)\(s\)\.\\mathbb\{E\}\\left\[\\widehat\{T\}\(U\_\{k\};S\_\{k\},\\xi\_\{k\}\)\\mid U\_\{k\},S\_\{k\}=s\\right\]=\(\\mathcal\{O\}U\_\{k\}\)\(s\)\.\(60\)Consider the asynchronous recursion
Uk\+1=Uk\+αkPSk\(T^\(Uk;Sk,ξk\)−Uk\(Sk\)\)\.U\_\{k\+1\}=U\_\{k\}\+\\alpha\_\{k\}P\_\{S\_\{k\}\}\\left\(\\widehat\{T\}\(U\_\{k\};S\_\{k\},\\xi\_\{k\}\)\-U\_\{k\}\(S\_\{k\}\)\\right\)\.\(61\)
For the centered recursion, write
Wk:=Uk−U⋆\.W\_\{k\}:=U\_\{k\}\-U^\{\\star\}\.\(62\)Assume there exists a finite constantAiid≥0A^\{\\mathrm\{iid\}\}\\geq 0such that
𝔼\[‖T^\(Uk;Sk,ξk\)−\(𝒪Uk\)\(Sk\)‖22∣Uk,Sk\]≤Aiid\(1\+∥Wk∥2,∞2\)a\.s\.\\mathbb\{E\}\\left\[\\left\\lVert\\widehat\{T\}\(U\_\{k\};S\_\{k\},\\xi\_\{k\}\)\-\(\\mathcal\{O\}U\_\{k\}\)\(S\_\{k\}\)\\right\\rVert\_\{2\}^\{2\}\\mid U\_\{k\},S\_\{k\}\\right\]\\leq A^\{\\mathrm\{iid\}\}\\bigl\(1\+\\lVert W\_\{k\}\\rVert\_\{2,\\infty\}^\{2\}\\bigr\)\\qquad\\text\{a\.s\.\}\(63\)
###### Proposition 15\(Abstract i\.i\.d\. finite\-iteration bound\)\.
Fix anyϑ\>0\\vartheta\>0and define
βρ:=1−ρmin\(1−β\),a1:=1\+ϑ\|𝒮\|2/p⋆1\+ϑ,a2:=1−βρ1\+ϑ\|𝒮\|2/p⋆1\+ϑ,\\beta\_\{\\rho\}:=1\-\\rho\_\{\\min\}\(1\-\\beta\),\\qquad a\_\{1\}:=\\frac\{1\+\\vartheta\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\},\\qquad a\_\{2\}:=1\-\\beta\_\{\\rho\}\\sqrt\{\\frac\{1\+\\vartheta\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\}\},\(64\)and
a3:=4\(p⋆−1\)\|𝒮\|2/p⋆\(Aiid\+2\)\(1\+ϑ\)ϑ,a4:=2\(p⋆−1\)\|𝒮\|2/p⋆Aiid\(1\+ϑ\)ϑ\.a\_\{3\}:=\\frac\{4\(p^\{\\star\}\-1\)\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\(A^\{\\mathrm\{iid\}\}\+2\)\(1\+\\vartheta\)\}\{\\vartheta\},\\qquad a\_\{4\}:=\\frac\{2\(p^\{\\star\}\-1\)\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}A^\{\\mathrm\{iid\}\}\(1\+\\vartheta\)\}\{\\vartheta\}\.\(65\)Sincea2→1−βρ\>0a\_\{2\}\\to 1\-\\beta\_\{\\rho\}\>0asϑ↓0\\vartheta\\downarrow 0, the conditiona2\>0a\_\{2\}\>0holds for all sufficiently smallϑ\\vartheta\. Ifa2\>0a\_\{2\}\>0,\(αk\)k≥0\(\\alpha\_\{k\}\)\_\{k\\geq 0\}is nonincreasing, andα0≤a2/a3\\alpha\_\{0\}\\leq a\_\{2\}/a\_\{3\}, then for allk≥0k\\geq 0,
𝔼\[∥Wk∥2,∞2\]≤a1∥W0∥2,∞2∏j=0k−1\(1−a2αj\)\+a4∑i=0k−1αi2∏j=i\+1k−1\(1−a2αj\)\.\\mathbb\{E\}\\left\[\\lVert W\_\{k\}\\rVert\_\{2,\\infty\}^\{2\}\\right\]\\leq a\_\{1\}\\lVert W\_\{0\}\\rVert\_\{2,\\infty\}^\{2\}\\prod\_\{j=0\}^\{k\-1\}\(1\-a\_\{2\}\\alpha\_\{j\}\)\+a\_\{4\}\\sum\_\{i=0\}^\{k\-1\}\\alpha\_\{i\}^\{2\}\\prod\_\{j=i\+1\}^\{k\-1\}\(1\-a\_\{2\}\\alpha\_\{j\}\)\.\(66\)In particular:
1. 1\.ifαk≡α\\alpha\_\{k\}\\equiv\\alphaandα≤a2/a3\\alpha\\leq a\_\{2\}/a\_\{3\}, then for allk≥0k\\geq 0, 𝔼\[∥Wk∥2,∞2\]≤a1∥W0∥2,∞2\(1−a2α\)k\+a4a2α,\\mathbb\{E\}\\left\[\\lVert W\_\{k\}\\rVert\_\{2,\\infty\}^\{2\}\\right\]\\leq a\_\{1\}\\lVert W\_\{0\}\\rVert\_\{2,\\infty\}^\{2\}\(1\-a\_\{2\}\\alpha\)^\{k\}\+\\frac\{a\_\{4\}\}\{a\_\{2\}\}\\alpha,\(67\)
2. 2\.ifαk=α/\(k\+h\)\\alpha\_\{k\}=\\alpha/\(k\+h\),α\>1/a2\\alpha\>1/a\_\{2\}, and h≥max\{1,αa3a2\},h\\geq\\max\\left\\\{1,\\frac\{\\alpha a\_\{3\}\}\{a\_\{2\}\}\\right\\\},\(68\)then for allk≥0k\\geq 0, 𝔼\[∥Wk∥2,∞2\]≤a1∥W0∥2,∞2\(hk\+h\)a2α\+4eα2a4a2α−1⋅1k\+h,\\mathbb\{E\}\\left\[\\lVert W\_\{k\}\\rVert\_\{2,\\infty\}^\{2\}\\right\]\\leq a\_\{1\}\\lVert W\_\{0\}\\rVert\_\{2,\\infty\}^\{2\}\\left\(\\frac\{h\}\{k\+h\}\\right\)^\{a\_\{2\}\\alpha\}\+\\frac\{4e\\,\\alpha^\{2\}a\_\{4\}\}\{a\_\{2\}\\alpha\-1\}\\cdot\\frac\{1\}\{k\+h\},\(69\)
3. 3\.ifαk=α/\(k\+h\)z\\alpha\_\{k\}=\\alpha/\(k\+h\)^\{z\}withz∈\(0,1\)z\\in\(0,1\)and h≥max\{1,\(αa3a2\)1/z,\(2za2α\)1/\(1−z\)\},h\\geq\\max\\left\\\{1,\\left\(\\frac\{\\alpha a\_\{3\}\}\{a\_\{2\}\}\\right\)^\{1/z\},\\left\(\\frac\{2z\}\{a\_\{2\}\\alpha\}\\right\)^\{1/\(1\-z\)\}\\right\\\},\(70\)then for allk≥0k\\geq 0, 𝔼\[∥Wk∥2,∞2\]≤a1∥W0∥2,∞2exp\(−a2α1−z\(\(k\+h\)1−z−h1−z\)\)\+2αa4a2⋅1\(k\+h\)z\.\\mathbb\{E\}\\left\[\\lVert W\_\{k\}\\rVert\_\{2,\\infty\}^\{2\}\\right\]\\leq a\_\{1\}\\lVert W\_\{0\}\\rVert\_\{2,\\infty\}^\{2\}\\exp\\left\(\-\\frac\{a\_\{2\}\\alpha\}\{1\-z\}\\bigl\(\(k\+h\)^\{1\-z\}\-h^\{1\-z\}\\bigr\)\\right\)\+\\frac\{2\\alpha a\_\{4\}\}\{a\_\{2\}\}\\cdot\\frac\{1\}\{\(k\+h\)^\{z\}\}\.\(71\)
###### Proof\.
Define the averaged asynchronous operator
Hρ\(U\):=U\+∑s∈𝒮ρ\(s\)Ps\(\(𝒪U\)\(s\)−U\(s\)\)\.H\_\{\\rho\}\(U\):=U\+\\sum\_\{s\\in\\mathcal\{S\}\}\\rho\(s\)P\_\{s\}\\bigl\(\(\\mathcal\{O\}U\)\(s\)\-U\(s\)\\bigr\)\.\(72\)For each state blockss,
\(Hρ\(U\)−Hρ\(V\)\)\(s\)=\(1−ρ\(s\)\)\(U\(s\)−V\(s\)\)\+ρ\(s\)\(\(𝒪U\)\(s\)−\(𝒪V\)\(s\)\)\.\\bigl\(H\_\{\\rho\}\(U\)\-H\_\{\\rho\}\(V\)\\bigr\)\(s\)=\(1\-\\rho\(s\)\)\(U\(s\)\-V\(s\)\)\+\\rho\(s\)\\bigl\(\(\\mathcal\{O\}U\)\(s\)\-\(\\mathcal\{O\}V\)\(s\)\\bigr\)\.\(73\)Since𝒪\\mathcal\{O\}is aβ\\beta\-contraction in∥⋅∥2,∞\\lVert\\cdot\\rVert\_\{2,\\infty\},
‖\(Hρ\(U\)−Hρ\(V\)\)\(s\)‖2\\displaystyle\\left\\lVert\\bigl\(H\_\{\\rho\}\(U\)\-H\_\{\\rho\}\(V\)\\bigr\)\(s\)\\right\\rVert\_\{2\}≤\(\(1−ρ\(s\)\)\+ρ\(s\)β\)∥U−V∥2,∞\\displaystyle\\leq\\bigl\(\(1\-\\rho\(s\)\)\+\\rho\(s\)\\beta\\bigr\)\\lVert U\-V\\rVert\_\{2,\\infty\}\(74\)=\(1−ρ\(s\)\(1−β\)\)∥U−V∥2,∞\\displaystyle=\\bigl\(1\-\\rho\(s\)\(1\-\\beta\)\\bigr\)\\lVert U\-V\\rVert\_\{2,\\infty\}≤βρ∥U−V∥2,∞\.\\displaystyle\\leq\\beta\_\{\\rho\}\\lVert U\-V\\rVert\_\{2,\\infty\}\.Taking the maximum oversson both sides gives
∥Hρ\(U\)−Hρ\(V\)∥2,∞≤maxs∈𝒮\(1−ρ\(s\)\(1−β\)\)∥U−V∥2,∞=βρ∥U−V∥2,∞,\\lVert H\_\{\\rho\}\(U\)\-H\_\{\\rho\}\(V\)\\rVert\_\{2,\\infty\}\\leq\\max\_\{s\\in\\mathcal\{S\}\}\\bigl\(1\-\\rho\(s\)\(1\-\\beta\)\\bigr\)\\lVert U\-V\\rVert\_\{2,\\infty\}=\\beta\_\{\\rho\}\\lVert U\-V\\rVert\_\{2,\\infty\},\(75\)since the coefficient1−ρ\(s\)\(1−β\)1\-\\rho\(s\)\(1\-\\beta\)is decreasing inρ\(s\)\\rho\(s\)\. Now apply Corollary 2\.1 ofChenet al\.\[[12](https://arxiv.org/html/2605.06866#bib.bib3)\]to the centered recursion forWkW\_\{k\}, with contraction norm∥⋅∥2,∞\\lVert\\cdot\\rVert\_\{2,\\infty\}, smoothing norm∥⋅∥2,p⋆\\lVert\\cdot\\rVert\_\{2,p^\{\\star\}\}, Lyapunov functionMϑ,p⋆M\_\{\\vartheta,p^\{\\star\}\}from Proposition[13](https://arxiv.org/html/2605.06866#Thmtheorem13), and noise constantAiidA^\{\\mathrm\{iid\}\}\. The approximation factors in Proposition[13](https://arxiv.org/html/2605.06866#Thmtheorem13)and the smoothness constant in Proposition[12](https://arxiv.org/html/2605.06866#Thmtheorem12)produce exactly the displayed constants\. The constant\-step, linearly\-diminishing, and polynomially\-diminishing step size bounds follow by specializing the displayed recursion\. ∎
### A\.3Abstract Markovian finite\-iteration framework
Assume now that\(Sk\)k≥0\(S\_\{k\}\)\_\{k\\geq 0\}is an irreducible, aperiodic Markov chain on𝒮\\mathcal\{S\}with stationary distributionμ𝒮\\mu\_\{\\mathcal\{S\}\}\. Let
μmin:=mins∈𝒮μ𝒮\(s\)\>0,β¯μ:=1−μmin\(1−β\)\.\\mu\_\{\\min\}:=\\min\_\{s\\in\\mathcal\{S\}\}\\mu\_\{\\mathcal\{S\}\}\(s\)\>0,\\qquad\\bar\{\\beta\}\_\{\\mu\}:=1\-\\mu\_\{\\min\}\(1\-\\beta\)\.\(76\)
Assume further that the chain satisfies the geometric mixing condition from Section[3](https://arxiv.org/html/2605.06866#S3)and lettδt\_\{\\delta\}denote the corresponding mixing time\.
Define the full\-sample map
H^\(U;s,ξ\):=U\+Ps\(T^\(U;s,ξ\)−U\(s\)\)\\widehat\{H\}\(U;s,\\xi\):=U\+P\_\{s\}\\bigl\(\\widehat\{T\}\(U;s,\\xi\)\-U\(s\)\\bigr\)\(77\)and the expected full\-sample map
F\(U,s\):=U\+Ps\(\(𝒪U\)\(s\)−U\(s\)\)\.F\(U,s\):=U\+P\_\{s\}\\bigl\(\(\\mathcal\{O\}U\)\(s\)\-U\(s\)\\bigr\)\.\(78\)Assume there exist finite constantsA1≥0A\_\{1\}\\geq 0,A2≥0A\_\{2\}\\geq 0, andB2≥0B\_\{2\}\\geq 0such that
∥H^\(U;s,ξ\)−H^\(U′;s,ξ\)∥2,∞≤A1∥U−U′∥2,∞\\lVert\\widehat\{H\}\(U;s,\\xi\)\-\\widehat\{H\}\(U^\{\\prime\};s,\\xi\)\\rVert\_\{2,\\infty\}\\leq A\_\{1\}\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\(79\)for allU,U′∈VU,U^\{\\prime\}\\in V, and
Δk:=H^\(Uk;Sk,ξk\)−F\(Uk,Sk\)\\Delta\_\{k\}:=\\widehat\{H\}\(U\_\{k\};S\_\{k\},\\xi\_\{k\}\)\-F\(U\_\{k\},S\_\{k\}\)\(80\)satisfies
𝔼\[Δk∣𝒢k\]=0,∥Δk∥2,∞≤A2∥Uk∥2,∞\+B2a\.s\.\\mathbb\{E\}\\left\[\\Delta\_\{k\}\\,\\mid\\,\\mathcal\{G\}\_\{k\}\\right\]=0,\\qquad\\lVert\\Delta\_\{k\}\\rVert\_\{2,\\infty\}\\leq A\_\{2\}\\lVert U\_\{k\}\\rVert\_\{2,\\infty\}\+B\_\{2\}\\qquad\\text\{a\.s\.\}\(81\)where
𝒢k:=σ\(U0,S0,ξ0,…,Uk,Sk\)\.\\mathcal\{G\}\_\{k\}:=\\sigma\(U\_\{0\},S\_\{0\},\\xi\_\{0\},\\dots,U\_\{k\},S\_\{k\}\)\.\(82\)
###### Proposition 16\(Abstract Markovian finite\-iteration bound\)\.
Fix anyϑ\>0\\vartheta\>0and define
ϕ1:=1\+ϑ\|𝒮\|2/p⋆1\+ϑ,ϕ2:=1−β¯μ1\+ϑ\|𝒮\|2/p⋆1\+ϑ,ϕ3:=114\(p⋆−1\)\(1\+ϑ\|𝒮\|2/p⋆\)ϑ\.\\phi\_\{1\}:=\\frac\{1\+\\vartheta\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\},\\qquad\\phi\_\{2\}:=1\-\\bar\{\\beta\}\_\{\\mu\}\\sqrt\{\\frac\{1\+\\vartheta\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\}\},\\qquad\\phi\_\{3\}:=\\frac\{114\(p^\{\\star\}\-1\)\\bigl\(1\+\\vartheta\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\\bigr\)\}\{\\vartheta\}\.\(83\)Sinceϕ2→1−β¯μ\>0\\phi\_\{2\}\\to 1\-\\bar\{\\beta\}\_\{\\mu\}\>0asϑ↓0\\vartheta\\downarrow 0, the conditionϕ2\>0\\phi\_\{2\}\>0for all sufficiently smallϑ\\vartheta\. Set
A:=A1\+A2\+1,c1:=\(2∥U0−U⋆∥2,∞\+B2A\)2,c2:=B22\.A:=A\_\{1\}\+A\_\{2\}\+1,\\qquad c\_\{1\}:=\\left\(2\\lVert U\_\{0\}\-U^\{\\star\}\\rVert\_\{2,\\infty\}\+\\frac\{B\_\{2\}\}\{A\}\\right\)^\{2\},\\qquad c\_\{2\}:=B\_\{2\}^\{2\}\.\(84\)Lettk:=tαkt\_\{k\}:=t\_\{\\alpha\_\{k\}\}andK:=min\{k≥0:k≥tk\}K:=\\min\\\{k\\geq 0:k\\geq t\_\{k\}\\\}\. Ifϕ2\>0\\phi\_\{2\}\>0,\(αk\)k≥0\(\\alpha\_\{k\}\)\_\{k\\geq 0\}is nonincreasing, and
∑i=k−tkk−1αi≤min\{ϕ2ϕ3A2,14A\}for allk≥K,\\sum\_\{i=k\-t\_\{k\}\}^\{k\-1\}\\alpha\_\{i\}\\leq\\min\\left\\\{\\frac\{\\phi\_\{2\}\}\{\\phi\_\{3\}A^\{2\}\},\\frac\{1\}\{4A\}\\right\\\}\\qquad\\text\{for all \}k\\geq K,\(85\)then:
1. 1\.ifαk≡α\\alpha\_\{k\}\\equiv\\alpha, then for allk≥tαk\\geq t\_\{\\alpha\}, 𝔼\[∥Wk∥2,∞2\]≤ϕ1c1\(1−ϕ2α\)k−tα\+ϕ3ϕ2c2αtα;\\mathbb\{E\}\\left\[\\lVert W\_\{k\}\\rVert\_\{2,\\infty\}^\{2\}\\right\]\\leq\\phi\_\{1\}c\_\{1\}\(1\-\\phi\_\{2\}\\alpha\)^\{k\-t\_\{\\alpha\}\}\+\\frac\{\\phi\_\{3\}\}\{\\phi\_\{2\}\}c\_\{2\}\\,\\alpha t\_\{\\alpha\};\(86\)
2. 2\.ifαk=α/\(k\+h\)\\alpha\_\{k\}=\\alpha/\(k\+h\),α\>1/ϕ2\\alpha\>1/\\phi\_\{2\}, andh≥1h\\geq 1, then for allk≥Kk\\geq K, 𝔼\[∥Wk∥2,∞2\]≤ϕ1c1\(K\+hk\+h\)ϕ2α\+8eα2ϕ3c2ϕ2α−1⋅tkk\+h;\\mathbb\{E\}\\left\[\\lVert W\_\{k\}\\rVert\_\{2,\\infty\}^\{2\}\\right\]\\leq\\phi\_\{1\}c\_\{1\}\\left\(\\frac\{K\+h\}\{k\+h\}\\right\)^\{\\phi\_\{2\}\\alpha\}\+\\frac\{8e\\,\\alpha^\{2\}\\phi\_\{3\}c\_\{2\}\}\{\\phi\_\{2\}\\alpha\-1\}\\cdot\\frac\{t\_\{k\}\}\{k\+h\};\(87\)
3. 3\.ifαk=α/\(k\+h\)z\\alpha\_\{k\}=\\alpha/\(k\+h\)^\{z\}withz∈\(0,1\)z\\in\(0,1\), then for allk≥Kk\\geq K, 𝔼\[∥Wk∥2,∞2\]≤ϕ1c1exp\(−ϕ2α1−z\(\(k\+h\)1−z−\(K\+h\)1−z\)\)\+4ϕ3c2αϕ2⋅tk\(k\+h\)z\.\\mathbb\{E\}\\left\[\\lVert W\_\{k\}\\rVert\_\{2,\\infty\}^\{2\}\\right\]\\leq\\phi\_\{1\}c\_\{1\}\\exp\\left\(\-\\frac\{\\phi\_\{2\}\\alpha\}\{1\-z\}\\bigl\(\(k\+h\)^\{1\-z\}\-\(K\+h\)^\{1\-z\}\\bigr\)\\right\)\+\\frac\{4\\phi\_\{3\}c\_\{2\}\\alpha\}\{\\phi\_\{2\}\}\\cdot\\frac\{t\_\{k\}\}\{\(k\+h\)^\{z\}\}\.\(88\)
###### Proof\.
This is the Markovian finite\-iteration theorem ofChenet al\.\[[11](https://arxiv.org/html/2605.06866#bib.bib1)\], translated into the present block\-supremum notation and applied to the centered recursionWk:=Uk−U⋆W\_\{k\}:=U\_\{k\}\-U^\{\\star\}\. The contraction factor isβ¯μ\\bar\{\\beta\}\_\{\\mu\}\. The smoothness and approximation factors come from Propositions[12](https://arxiv.org/html/2605.06866#Thmtheorem12),[13](https://arxiv.org/html/2605.06866#Thmtheorem13), and[11](https://arxiv.org/html/2605.06866#Thmtheorem11)\. The constantsA1A\_\{1\},A2A\_\{2\}, andB2B\_\{2\}enter exactly through the samplewise Lipschitz and pathwise affine perturbation hypotheses\. ∎
## Appendix BCTD details and proof of Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1)
We verify the hypotheses of Appendix[A](https://arxiv.org/html/2605.06866#A1)for the discounted scalar categorical temporal\-difference recursion and thereby prove Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1)\. The proof has three parts\. First, we record the statewise Cramér isometry and the induced block\-supremum contraction of the projected Bellman operator\. Second, we verify the samplewise Lipschitz property of the sampled target map\. Third, we verify the centered perturbation bounds required by the i\.i\.d\. and Markovian finite\-iteration arguments\.
### B\.1Statewise Cramér isometry and contraction
For each states∈𝒮s\\in\\mathcal\{S\}, let
Θ\(s\)=\{θ1\(s\)<⋯<θd\(s\)\}⊂ℝ\\Theta\(s\)=\\\{\\theta\_\{1\}\(s\)<\\cdots<\\theta\_\{d\}\(s\)\\\}\\subset\\mathbb\{R\}\(89\)and letℱC,Θ𝒮\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{C\},\\Theta\}denote the class of state\-indexed categorical laws supported on these statewise grids\. For
η\(s\)=∑i=1dpi\(s\)δθi\(s\),\\eta\(s\)=\\sum\_\{i=1\}^\{d\}p\_\{i\}\(s\)\\delta\_\{\\theta\_\{i\}\(s\)\},\(90\)define the cumulative masses and the grid separations
cjη\(s\):=∑i=1jpi\(s\),Δj\(s\):=θj\+1\(s\)−θj\(s\),j=1,…,d−1\.c^\{\\eta\}\_\{j\}\(s\):=\\sum\_\{i=1\}^\{j\}p\_\{i\}\(s\),\\qquad\\Delta\_\{j\}\(s\):=\\theta\_\{j\+1\}\(s\)\-\\theta\_\{j\}\(s\),\\qquad j=1,\\dots,d\-1\.\(91\)The statewise Cramér embedding is
Is\(η\(s\)\):=\(Δ1\(s\)c1η\(s\),…,Δd−1\(s\)cd−1η\(s\),0\)⊤∈ℝd\.I\_\{s\}\(\\eta\(s\)\):=\\bigl\(\\sqrt\{\\Delta\_\{1\}\(s\)\}\\,c^\{\\eta\}\_\{1\}\(s\),\\dots,\\sqrt\{\\Delta\_\{d\-1\}\(s\)\}\\,c^\{\\eta\}\_\{d\-1\}\(s\),0\\bigr\)^\{\\top\}\\in\\mathbb\{R\}^\{d\}\.\(92\)
###### Proposition 17\(Statewise Cramér isometry\)\.
For every states∈𝒮s\\in\\mathcal\{S\}and allμ,ν∈ℱC,Θ\(s\)\\mu,\\nu\\in\\mathcal\{F\}\_\{\\mathrm\{C\},\\Theta\(s\)\},
ℓC\(μ,ν\)2=∑j=1d−1Δj\(s\)\(cjμ\(s\)−cjν\(s\)\)2=∥Is\(μ\)−Is\(ν\)∥22\.\\ell\_\{\\mathrm\{C\}\}\(\\mu,\\nu\)^\{2\}=\\sum\_\{j=1\}^\{d\-1\}\\Delta\_\{j\}\(s\)\\bigl\(c^\{\\mu\}\_\{j\}\(s\)\-c^\{\\nu\}\_\{j\}\(s\)\\bigr\)^\{2\}=\\lVert I\_\{s\}\(\\mu\)\-I\_\{s\}\(\\nu\)\\rVert\_\{2\}^\{2\}\.\(93\)Consequently, with the product embedding
IC\(η\):=\(Is\(η\(s\)\)\)s∈𝒮,I\_\{\\mathrm\{C\}\}\(\\eta\):=\(I\_\{s\}\(\\eta\(s\)\)\)\_\{s\\in\\mathcal\{S\}\},\(94\)we have
∥IC\(η\)−IC\(η′\)∥2,∞=ℓC,∞\(η,η′\)\.\\lVert I\_\{\\mathrm\{C\}\}\(\\eta\)\-I\_\{\\mathrm\{C\}\}\(\\eta^\{\\prime\}\)\\rVert\_\{2,\\infty\}=\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta,\\eta^\{\\prime\}\)\.\(95\)
###### Proof\.
For laws supported on the ordered gridΘ\(s\)\\Theta\(s\), the cumulative distribution functions are piecewise constant on the intervals
\[θj\(s\),θj\+1\(s\)\),j=1,…,d−1,\[\\theta\_\{j\}\(s\),\\theta\_\{j\+1\}\(s\)\),\\qquad j=1,\\dots,d\-1,\(96\)with valuescjη\(s\)c^\{\\eta\}\_\{j\}\(s\)andcjν\(s\)c^\{\\nu\}\_\{j\}\(s\)respectively\. Therefore,
ℓC\(μ,ν\)2=∫ℝ\(Fμ\(z\)−Fν\(z\)\)2dz=∑j=1d−1Δj\(s\)\(cjμ\(s\)−cjν\(s\)\)2,\\ell\_\{\\mathrm\{C\}\}\(\\mu,\\nu\)^\{2\}=\\int\_\{\\mathbb\{R\}\}\\bigl\(F\_\{\\mu\}\(z\)\-F\_\{\\nu\}\(z\)\\bigr\)^\{2\}\\,\\mathrm\{d\}z=\\sum\_\{j=1\}^\{d\-1\}\\Delta\_\{j\}\(s\)\\bigl\(c^\{\\mu\}\_\{j\}\(s\)\-c^\{\\nu\}\_\{j\}\(s\)\\bigr\)^\{2\},\(97\)which is exactly∥Is\(μ\)−Is\(ν\)∥22\\lVert I\_\{s\}\(\\mu\)\-I\_\{s\}\(\\nu\)\\rVert\_\{2\}^\{2\}\. Taking the maximum overs∈𝒮s\\in\\mathcal\{S\}gives the product\-space identity\. ∎
Recall from Section[3\.1](https://arxiv.org/html/2605.06866#S3.SS1)that
𝒪C:=IC∘ΠCΘTπ∘IC−1\.\\mathcal\{O\}\_\{\\mathrm\{C\}\}:=I\_\{\\mathrm\{C\}\}\\circ\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\}T^\{\\pi\}\\circ I\_\{\\mathrm\{C\}\}^\{\-1\}\.\(98\)
###### Proposition 18\(Embedded CTD contraction\)\.
For allU,U′∈IC\(ℱC,Θ𝒮\)U,U^\{\\prime\}\\in I\_\{\\mathrm\{C\}\}\(\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{C\},\\Theta\}\),
∥𝒪CU−𝒪CU′∥2,∞≤γ∥U−U′∥2,∞\.\\lVert\\mathcal\{O\}\_\{\\mathrm\{C\}\}U\-\\mathcal\{O\}\_\{\\mathrm\{C\}\}U^\{\\prime\}\\rVert\_\{2,\\infty\}\\leq\\sqrt\{\\gamma\}\\,\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(99\)
###### Proof\.
Letη:=IC−1\(U\)\\eta:=I\_\{\\mathrm\{C\}\}^\{\-1\}\(U\)andη′:=IC−1\(U′\)\\eta^\{\\prime\}:=I\_\{\\mathrm\{C\}\}^\{\-1\}\(U^\{\\prime\}\)\. By Proposition[17](https://arxiv.org/html/2605.06866#Thmtheorem17)and the contraction of𝒪C\\mathcal\{O\}\_\{\\mathrm\{C\}\}in the supremum Cramér metric\[[5](https://arxiv.org/html/2605.06866#bib.bib5)\],
∥𝒪CU−𝒪CU′∥2,∞=ℓC,∞\(ΠCΘTπη,ΠCΘTπη′\)≤γℓC,∞\(η,η′\)=γ∥U−U′∥2,∞\.\\lVert\\mathcal\{O\}\_\{\\mathrm\{C\}\}U\-\\mathcal\{O\}\_\{\\mathrm\{C\}\}U^\{\\prime\}\\rVert\_\{2,\\infty\}=\\ell\_\{\\mathrm\{C\},\\infty\}\(\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\}T^\{\\pi\}\\eta,\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\}T^\{\\pi\}\\eta^\{\\prime\}\)\\leq\\sqrt\{\\gamma\}\\,\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta,\\eta^\{\\prime\}\)=\\sqrt\{\\gamma\}\\,\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(100\)∎
### B\.2Unbiased sampled target and samplewise Lipschitz continuity
Recall the sampled CTD target from Section[3\.1](https://arxiv.org/html/2605.06866#S3.SS1):
T^C\(Uk;Sk,\(Rk,Sk\+1\)\):=ISk\(ΠCΘ\(Sk\)\(\(fRk,γ\)\#ISk\+1−1\(Uk\(Sk\+1\)\)\)\),\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\):=I\_\{S\_\{k\}\}\\Bigl\(\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\(S\_\{k\}\)\}\\bigl\(\(f\_\{R\_\{k\},\\gamma\}\)\_\{\\\#\}I\_\{S\_\{k\+1\}\}^\{\-1\}\(U\_\{k\}\(S\_\{k\+1\}\)\)\\bigr\)\\Bigr\),\(101\)wherefr,γ\(z\)=r\+γzf\_\{r,\\gamma\}\(z\)=r\+\\gamma z\.
###### Proposition 19\(Conditional unbiasedness of the sampled CTD target\)\.
For everyU∈IC\(ℱC,Θ𝒮\)U\\in I\_\{\\mathrm\{C\}\}\(\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{C\},\\Theta\}\)and every states∈𝒮s\\in\\mathcal\{S\},
𝔼\[T^C\(U;s,\(Rk,Sk\+1\)\)∣Uk=U,Sk=s\]=\(𝒪CU\)\(s\)\.\\mathbb\{E\}\\left\[\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U;s,\(R\_\{k\},S\_\{k\+1\}\)\)\\mid U\_\{k\}=U,S\_\{k\}=s\\right\]=\(\\mathcal\{O\}\_\{\\mathrm\{C\}\}U\)\(s\)\.\(102\)
###### Proof\.
Condition onUk=UU\_\{k\}=UandSk=sS\_\{k\}=s\. By definition,
T^C\(U;s,\(Rk,Sk\+1\)\)=Is\(ΠCΘ\(s\)\(\(fRk,γ\)\#ISk\+1−1\(U\(Sk\+1\)\)\)\)\.\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U;s,\(R\_\{k\},S\_\{k\+1\}\)\)=I\_\{s\}\\Bigl\(\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\(s\)\}\\bigl\(\(f\_\{R\_\{k\},\\gamma\}\)\_\{\\\#\}I\_\{S\_\{k\+1\}\}^\{\-1\}\(U\(S\_\{k\+1\}\)\)\\bigr\)\\Bigr\)\.\(103\)Taking conditional expectation with respect to the one\-step transition and reward law underπ\\pigives exactly the Bellman expectation at statess\. Applying the deterministic statewise projection and then the embeddingIsI\_\{s\}yields
𝔼\[T^C\(U;s,\(Rk,Sk\+1\)\)∣Uk=U,Sk=s\]=Is\(\(ΠCΘTπIC−1U\)\(s\)\)=\(𝒪CU\)\(s\)\.\\mathbb\{E\}\\left\[\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U;s,\(R\_\{k\},S\_\{k\+1\}\)\)\\mid U\_\{k\}=U,S\_\{k\}=s\\right\]=I\_\{s\}\\bigl\(\(\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\}T^\{\\pi\}I\_\{\\mathrm\{C\}\}^\{\-1\}U\)\(s\)\\bigr\)=\(\\mathcal\{O\}\_\{\\mathrm\{C\}\}U\)\(s\)\.\(104\)∎
###### Proposition 20\(CTD samplewise Lipschitz continuity\)\.
For everys∈𝒮s\\in\\mathcal\{S\}, every\(r,s′\)∈\[0,1\]×𝒮\(r,s^\{\\prime\}\)\\in\[0,1\]\\times\\mathcal\{S\}, and everyU,U′∈IC\(ℱC,Θ𝒮\)U,U^\{\\prime\}\\in I\_\{\\mathrm\{C\}\}\(\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{C\},\\Theta\}\),
‖T^C\(U;s,\(r,s′\)\)−T^C\(U′;s,\(r,s′\)\)‖2≤∥U−U′∥2,∞\.\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{2\}\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(105\)Consequently, if
H^C\(U;s,\(r,s′\)\):=U\+Ps\(T^C\(U;s,\(r,s′\)\)−U\(s\)\),\\widehat\{H\}\_\{\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\):=U\+P\_\{s\}\\bigl\(\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-U\(s\)\\bigr\),\(106\)then
∥H^C\(U;s,\(r,s′\)\)−H^C\(U′;s,\(r,s′\)\)∥2,∞≤∥U−U′∥2,∞\.\\lVert\\widehat\{H\}\_\{\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{H\}\_\{\\mathrm\{C\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{2,\\infty\}\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(107\)
###### Proof\.
Fixs∈𝒮s\\in\\mathcal\{S\}and\(r,s′\)∈\[0,1\]×𝒮\(r,s^\{\\prime\}\)\\in\[0,1\]\\times\\mathcal\{S\}\. Let
μ:=Is′−1\(U\(s′\)\),ν:=Is′−1\(U′\(s′\)\)\.\\mu:=I\_\{s^\{\\prime\}\}^\{\-1\}\(U\(s^\{\\prime\}\)\),\\qquad\\nu:=I\_\{s^\{\\prime\}\}^\{\-1\}\(U^\{\\prime\}\(s^\{\\prime\}\)\)\.\(108\)Then
T^C\(U;s,\(r,s′\)\)=Is\(ΠCΘ\(s\)\(\(fr,γ\)\#μ\)\)\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)=I\_\{s\}\\bigl\(\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\(s\)\}\(\(f\_\{r,\\gamma\}\)\_\{\\\#\}\\mu\)\\bigr\)\(109\)and similarly forU′U^\{\\prime\}\. Using the statewise Cramér isometry, the nonexpansiveness of the categorical projection in the Cramér metric, and the deterministic pushforward contraction byγ\\sqrt\{\\gamma\},
‖T^C\(U;s,\(r,s′\)\)−T^C\(U′;s,\(r,s′\)\)‖2\\displaystyle\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{2\}=ℓC,s\(ΠCΘ\(s\)\(\(fr,γ\)\#μ\),ΠCΘ\(s\)\(\(fr,γ\)\#ν\)\)\\displaystyle=\\ell\_\{\\mathrm\{C\},s\}\\bigl\(\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\(s\)\}\(\(f\_\{r,\\gamma\}\)\_\{\\\#\}\\mu\),\\Pi\_\{\\mathrm\{C\}\}^\{\\Theta\(s\)\}\(\(f\_\{r,\\gamma\}\)\_\{\\\#\}\\nu\)\\bigr\)\(110\)≤ℓC,s\(\(fr,γ\)\#μ,\(fr,γ\)\#ν\)\\displaystyle\\leq\\ell\_\{\\mathrm\{C\},s\}\(\(f\_\{r,\\gamma\}\)\_\{\\\#\}\\mu,\(f\_\{r,\\gamma\}\)\_\{\\\#\}\\nu\)≤γℓC,s′\(μ,ν\)\\displaystyle\\leq\\sqrt\{\\gamma\}\\,\\ell\_\{\\mathrm\{C\},s^\{\\prime\}\}\(\\mu,\\nu\)=γ∥U\(s′\)−U′\(s′\)∥2\\displaystyle=\\sqrt\{\\gamma\}\\,\\lVert U\(s^\{\\prime\}\)\-U^\{\\prime\}\(s^\{\\prime\}\)\\rVert\_\{2\}≤∥U−U′∥2,∞\.\\displaystyle\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.For the full\-sample map, every block other thanssis unchanged, while thessth block is replaced by the sampled target\. Therefore,
∥H^C\(U;s,\(r,s′\)\)−H^C\(U′;s,\(r,s′\)\)∥2,∞=max\{maxx≠s∥U\(x\)−U′\(x\)∥2,∥T^C\(U;s,\(r,s′\)\)−T^C\(U′;s,\(r,s′\)\)∥2\},\\displaystyle\\lVert\\widehat\{H\}\_\{\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{H\}\_\{\\mathrm\{C\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{2,\\infty\}=\\max\\Bigl\\\{\\max\_\{x\\neq s\}\\lVert U\(x\)\-U^\{\\prime\}\(x\)\\rVert\_\{2\},\\lVert\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{2\}\\Bigr\\\},
\(111\)which is bounded by∥U−U′∥2,∞\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}by the previous result\. ∎
### B\.3Uniform perturbation bound
###### Proposition 21\(Uniform bound on embedded categorical blocks\)\.
For every states∈𝒮s\\in\\mathcal\{S\}and every embedded categorical lawU\(s\)∈Is\(ℱC,Θ\(s\)\)U\(s\)\\in I\_\{s\}\(\\mathcal\{F\}\_\{\\mathrm\{C\},\\Theta\(s\)\}\),
∥U\(s\)∥22≤θd\(s\)−θ1\(s\)\.\\lVert U\(s\)\\rVert\_\{2\}^\{2\}\\leq\\theta\_\{d\}\(s\)\-\\theta\_\{1\}\(s\)\.\(112\)
###### Proof\.
Write
U\(s\)=Is\(η\(s\)\)=\(Δ1\(s\)c1η\(s\),…,Δd−1\(s\)cd−1η\(s\),0\)⊤\.U\(s\)=I\_\{s\}\(\\eta\(s\)\)=\\bigl\(\\sqrt\{\\Delta\_\{1\}\(s\)\}\\,c^\{\\eta\}\_\{1\}\(s\),\\dots,\\sqrt\{\\Delta\_\{d\-1\}\(s\)\}\\,c^\{\\eta\}\_\{d\-1\}\(s\),0\\bigr\)^\{\\top\}\.\(113\)Becauseη\(s\)\\eta\(s\)is a probability law, every cumulative masscjη\(s\)c^\{\\eta\}\_\{j\}\(s\)lies in\[0,1\]\[0,1\]\. Hence
∥U\(s\)∥22=∑j=1d−1Δj\(s\)\(cjη\(s\)\)2≤∑j=1d−1Δj\(s\)=θd\(s\)−θ1\(s\)\.\\lVert U\(s\)\\rVert\_\{2\}^\{2\}=\\sum\_\{j=1\}^\{d\-1\}\\Delta\_\{j\}\(s\)\\bigl\(c^\{\\eta\}\_\{j\}\(s\)\\bigr\)^\{2\}\\leq\\sum\_\{j=1\}^\{d\-1\}\\Delta\_\{j\}\(s\)=\\theta\_\{d\}\(s\)\-\\theta\_\{1\}\(s\)\.\(114\)∎
###### Proposition 22\(CTD centered perturbation bound\)\.
Let the CTD expected full\-sample map be
FC\(U,s\):=U\+Ps\(\(𝒪CU\)\(s\)−U\(s\)\)F\_\{\\mathrm\{C\}\}\(U,s\):=U\+P\_\{s\}\\bigl\(\(\\mathcal\{O\}\_\{\\mathrm\{C\}\}U\)\(s\)\-U\(s\)\\bigr\)\(115\)and
ΔkC:=H^C\(Uk;Sk,\(Rk,Sk\+1\)\)−FC\(Uk,Sk\)\.\\Delta\_\{k\}^\{\\mathrm\{C\}\}:=\\widehat\{H\}\_\{\\mathrm\{C\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-F\_\{\\mathrm\{C\}\}\(U\_\{k\},S\_\{k\}\)\.\(116\)Then
𝔼\[ΔkC∣𝒢k\]=0\\mathbb\{E\}\\left\[\\Delta\_\{k\}^\{\\mathrm\{C\}\}\\,\\mid\\,\\mathcal\{G\}\_\{k\}\\right\]=0\(117\)and
∥ΔkC∥2,∞≤2BCa\.s\. for allk,\\lVert\\Delta\_\{k\}^\{\\mathrm\{C\}\}\\rVert\_\{2,\\infty\}\\leq 2B\_\{\\mathrm\{C\}\}\\qquad\\text\{a\.s\. for all \}k,\(118\)where
BC:=maxs∈𝒮θd\(s\)−θ1\(s\)\.B\_\{\\mathrm\{C\}\}:=\\max\_\{s\\in\\mathcal\{S\}\}\\sqrt\{\\theta\_\{d\}\(s\)\-\\theta\_\{1\}\(s\)\}\.\(119\)Consequently,
𝔼\[‖T^C\(Uk;Sk,\(Rk,Sk\+1\)\)−\(𝒪CUk\)\(Sk\)‖22∣Uk,Sk\]≤4BC2\.\\mathbb\{E\}\\left\[\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{C\}\}U\_\{k\}\)\(S\_\{k\}\)\\right\\rVert\_\{2\}^\{2\}\\,\\mid\\,U\_\{k\},\\ S\_\{k\}\\right\]\\leq 4B\_\{\\mathrm\{C\}\}^\{2\}\.\(120\)
###### Proof\.
The conditional mean identity follows from Proposition[19](https://arxiv.org/html/2605.06866#Thmtheorem19)\. For the pathwise bound, only theSkS\_\{k\}th block is nonzero, so
∥ΔkC∥2,∞=‖T^C\(Uk;Sk,\(Rk,Sk\+1\)\)−\(𝒪CUk\)\(Sk\)‖2\.\\lVert\\Delta\_\{k\}^\{\\mathrm\{C\}\}\\rVert\_\{2,\\infty\}=\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{C\}\}U\_\{k\}\)\(S\_\{k\}\)\\right\\rVert\_\{2\}\.\(121\)Both terms on the right are embedded categorical laws at stateSkS\_\{k\}\. By Proposition[21](https://arxiv.org/html/2605.06866#Thmtheorem21), each has Euclidean norm at mostBCB\_\{\\mathrm\{C\}\}\. Therefore, the triangle inequality yields
∥ΔkC∥2,∞≤2BC\.\\lVert\\Delta\_\{k\}^\{\\mathrm\{C\}\}\\rVert\_\{2,\\infty\}\\leq 2B\_\{\\mathrm\{C\}\}\.\(122\)Squaring gives the conditional second\-moment bound\. ∎
### B\.4Completion of the proof of Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1)
###### Proof of Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1)\.
We now verify the hypotheses of Appendix[A](https://arxiv.org/html/2605.06866#A1)for the discounted CTD recursion\.
First, Proposition[18](https://arxiv.org/html/2605.06866#Thmtheorem18)gives the block\-supremum contraction
∥𝒪CU−𝒪CU′∥2,∞≤βC∥U−U′∥2,∞,βC:=γ\.\\lVert\\mathcal\{O\}\_\{\\mathrm\{C\}\}U\-\\mathcal\{O\}\_\{\\mathrm\{C\}\}U^\{\\prime\}\\rVert\_\{2,\\infty\}\\leq\\beta\_\{\\mathrm\{C\}\}\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\},\\qquad\\beta\_\{\\mathrm\{C\}\}:=\\sqrt\{\\gamma\}\.\(123\)
Second, Proposition[20](https://arxiv.org/html/2605.06866#Thmtheorem20)gives the samplewise Lipschitz property in the Markovian setting with
AC,1=1\.A\_\{\\mathrm\{C\},1\}=1\.\(124\)Moreover, the proof of Proposition[20](https://arxiv.org/html/2605.06866#Thmtheorem20)gives the stronger target\-level estimate
∥T^C\(U;s,\(r,s′\)\)−T^C\(U′;s,\(r,s′\)\)∥2≤βC∥U−U′∥2,∞≤∥U−U′∥2,∞\.\\lVert\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{2\}\\leq\\beta\_\{\\mathrm\{C\}\}\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(125\)
Third, Proposition[22](https://arxiv.org/html/2605.06866#Thmtheorem22)gives the centered pathwise perturbation bound
𝔼\[ΔkC∣𝒢k\]=0,∥ΔkC∥2,∞≤2BCa\.s\.,\\mathbb\{E\}\[\\Delta\_\{k\}^\{\\mathrm\{C\}\}\\mid\\mathcal\{G\}\_\{k\}\]=0,\\qquad\\lVert\\Delta\_\{k\}^\{\\mathrm\{C\}\}\\rVert\_\{2,\\infty\}\\leq 2B\_\{\\mathrm\{C\}\}\\quad\\text\{a\.s\.\},\(126\)so Proposition[16](https://arxiv.org/html/2605.06866#Thmtheorem16)applies with
AC,1=1,AC,2=0,BC,2=2BC,AC:=AC,1\+AC,2\+1=2\.A\_\{\\mathrm\{C\},1\}=1,\\qquad A\_\{\\mathrm\{C\},2\}=0,\\qquad B\_\{\\mathrm\{C\},2\}=2B\_\{\\mathrm\{C\}\},\\qquad A\_\{\\mathrm\{C\}\}:=A\_\{\\mathrm\{C\},1\}\+A\_\{\\mathrm\{C\},2\}\+1=2\.\(127\)In particular, the i\.i\.d\. conditional second\-moment condition in Proposition[15](https://arxiv.org/html/2605.06866#Thmtheorem15)holds with
ACiid:=4BC2\.A\_\{\\mathrm\{C\}\}^\{\\mathrm\{iid\}\}:=4B\_\{\\mathrm\{C\}\}^\{2\}\.\(128\)
Applying Proposition[15](https://arxiv.org/html/2605.06866#Thmtheorem15)with
β=βC,Aiid=ACiid,ϑ=ϑC,ρ,p⋆=max\{2,⌈log\|𝒮\|⌉\},\\beta=\\beta\_\{\\mathrm\{C\}\},\\qquad A^\{\\mathrm\{iid\}\}=A\_\{\\mathrm\{C\}\}^\{\\mathrm\{iid\}\},\\qquad\\vartheta=\\vartheta\_\{\\mathrm\{C\},\\rho\},\\qquad p^\{\\star\}=\\max\\\{2,\\lceil\\log\\lvert\\mathcal\{S\}\\rvert\\rceil\\\},\(129\)yields Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1)\(i\) with
aC,1iid:=1\+ϑC,ρ\|𝒮\|2/p⋆1\+ϑC,ρ,aC,2iid:=1−β¯C,ρ1\+ϑC,ρ\|𝒮\|2/p⋆1\+ϑC,ρ,a\_\{\\mathrm\{C\},1\}^\{\\mathrm\{iid\}\}:=\\frac\{1\+\\vartheta\_\{\\mathrm\{C\},\\rho\}\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\_\{\\mathrm\{C\},\\rho\}\},\\qquad a\_\{\\mathrm\{C\},2\}^\{\\mathrm\{iid\}\}:=1\-\\bar\{\\beta\}\_\{\\mathrm\{C\},\\rho\}\\sqrt\{\\frac\{1\+\\vartheta\_\{\\mathrm\{C\},\\rho\}\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\_\{\\mathrm\{C\},\\rho\}\}\},\(130\)where
β¯C,ρ:=1−ρmin\(1−βC\),\\bar\{\\beta\}\_\{\\mathrm\{C\},\\rho\}:=1\-\\rho\_\{\\min\}\(1\-\\beta\_\{\\mathrm\{C\}\}\),\(131\)and
aC,3iid:=4\(p⋆−1\)\|𝒮\|2/p⋆\(4BC2\+2\)\(1\+ϑC,ρ\)ϑC,ρ,aC,4iid:=8\(p⋆−1\)\|𝒮\|2/p⋆BC2\(1\+ϑC,ρ\)ϑC,ρ\.a\_\{\\mathrm\{C\},3\}^\{\\mathrm\{iid\}\}:=\\frac\{4\(p^\{\\star\}\-1\)\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\(4B\_\{\\mathrm\{C\}\}^\{2\}\+2\)\(1\+\\vartheta\_\{\\mathrm\{C\},\\rho\}\)\}\{\\vartheta\_\{\\mathrm\{C\},\\rho\}\},\\qquad a\_\{\\mathrm\{C\},4\}^\{\\mathrm\{iid\}\}:=\\frac\{8\(p^\{\\star\}\-1\)\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}B\_\{\\mathrm\{C\}\}^\{2\}\(1\+\\vartheta\_\{\\mathrm\{C\},\\rho\}\)\}\{\\vartheta\_\{\\mathrm\{C\},\\rho\}\}\.\(132\)
Applying Proposition[16](https://arxiv.org/html/2605.06866#Thmtheorem16)with
A1=AC,1=1,A2=AC,2=0,B2=BC,2=2BC,A\_\{1\}=A\_\{\\mathrm\{C\},1\}=1,\\qquad A\_\{2\}=A\_\{\\mathrm\{C\},2\}=0,\\qquad B\_\{2\}=B\_\{\\mathrm\{C\},2\}=2B\_\{\\mathrm\{C\}\},\(133\)yields Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1), part \(ii\), with
ϕ~C,1:=1\+ϑC,μ\|𝒮\|2/p⋆1\+ϑC,μ,ϕC,2:=1−β¯C,μ1\+ϑC,μ\|𝒮\|2/p⋆1\+ϑC,μ,\\tilde\{\\phi\}\_\{\\mathrm\{C\},1\}:=\\frac\{1\+\\vartheta\_\{\\mathrm\{C\},\\mu\}\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\_\{\\mathrm\{C\},\\mu\}\},\\qquad\\phi\_\{\\mathrm\{C\},2\}:=1\-\\bar\{\\beta\}\_\{\\mathrm\{C\},\\mu\}\\sqrt\{\\frac\{1\+\\vartheta\_\{\\mathrm\{C\},\\mu\}\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\_\{\\mathrm\{C\},\\mu\}\}\},\(134\)where
β¯C,μ:=1−μmin\(1−βC\),\\bar\{\\beta\}\_\{\\mathrm\{C\},\\mu\}:=1\-\\mu\_\{\\min\}\(1\-\\beta\_\{\\mathrm\{C\}\}\),\(135\)and
ϕ~C,3:=114\(p⋆−1\)\(1\+ϑC,μ\|𝒮\|2/p⋆\)ϑC,μ,ϕC,3:=4BC2ϕ~C,3\\tilde\{\\phi\}\_\{\\mathrm\{C\},3\}:=\\frac\{114\(p^\{\\star\}\-1\)\\bigl\(1\+\\vartheta\_\{\\mathrm\{C\},\\mu\}\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\\bigr\)\}\{\\vartheta\_\{\\mathrm\{C\},\\mu\}\},\\qquad\\phi\_\{\\mathrm\{C\},3\}:=4B\_\{\\mathrm\{C\}\}^\{2\}\\tilde\{\\phi\}\_\{\\mathrm\{C\},3\}\(136\)together with
ϕC,1:=8ϕ~C,1,ϕC,4:=2ϕ~C,1BC2,\\phi\_\{\\mathrm\{C\},1\}:=8\\tilde\{\\phi\}\_\{\\mathrm\{C\},1\},\\qquad\\phi\_\{\\mathrm\{C\},4\}:=2\\tilde\{\\phi\}\_\{\\mathrm\{C\},1\}B\_\{\\mathrm\{C\}\}^\{2\},\(137\)and the step size condition
∑i=k−tkk−1αi≤min\{ϕC,24ϕC,3,18\}for allk≥K\.\\sum\_\{i=k\-t\_\{k\}\}^\{k\-1\}\\alpha\_\{i\}\\leq\\min\\left\\\{\\frac\{\\phi\_\{\\mathrm\{C\},2\}\}\{4\\phi\_\{\\mathrm\{C\},3\}\},\\frac\{1\}\{8\}\\right\\\}\\qquad\\text\{for all \}k\\geq K\.\(138\)The constant, linearly\-diminishing, and polynomially\-diminishing step size bounds in Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1)are therefore obtained by the corresponding specializations of Proposition[15](https://arxiv.org/html/2605.06866#Thmtheorem15)and Proposition[16](https://arxiv.org/html/2605.06866#Thmtheorem16)\. ∎
### B\.5Proof of Corollary[2](https://arxiv.org/html/2605.06866#Thmtheorem2)
###### Proof\.
We use the linearly\-diminishing step size bounds in Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1)\.
For part \(i\), the linearly\-diminishing i\.i\.d\. bound in Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1)\(i\) gives
𝔼\[ℓC,∞\(ηk,η⋆\)2\]≤aC,1iidℓC,∞\(η0,η⋆\)2\(hk\+h\)aC,2iidα\+4eα2aC,4iidaC,2iidα−1⋅1k\+h\.\\mathbb\{E\}\\left\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\right\]\\leq a\_\{\\mathrm\{C\},1\}^\{\\mathrm\{iid\}\}\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\left\(\\frac\{h\}\{k\+h\}\\right\)^\{a\_\{\\mathrm\{C\},2\}^\{\\mathrm\{iid\}\}\\alpha\}\+\\frac\{4e\\alpha^\{2\}a\_\{\\mathrm\{C\},4\}^\{\\mathrm\{iid\}\}\}\{a\_\{\\mathrm\{C\},2\}^\{\\mathrm\{iid\}\}\\alpha\-1\}\\cdot\\frac\{1\}\{k\+h\}\.\(139\)Sinceα\>1/aC,2iid\\alpha\>1/a\_\{\\mathrm\{C\},2\}^\{\\mathrm\{iid\}\}, one hasaC,2iidα\>1a\_\{\\mathrm\{C\},2\}^\{\\mathrm\{iid\}\}\\alpha\>1, and therefore the first term is alsoO\(\(k\+h\)−1\)O\(\(k\+h\)^\{\-1\}\)\. Hence
𝔼\[ℓC,∞\(ηk,η⋆\)2\]=O\(1k\+h\)\.\\mathbb\{E\}\\left\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\right\]=O\\left\(\\frac\{1\}\{k\+h\}\\right\)\.\(140\)Thus𝔼\[ℓC,∞\(ηk,η⋆\)\]≤ε\\mathbb\{E\}\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)\]\\leq\\varepsilonis guaranteed once
𝔼\[ℓC,∞\(ηk,η⋆\)2\]≤ε2,\\mathbb\{E\}\\left\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\right\]\\leq\\varepsilon^\{2\},\(141\)which holds fork=O\(ε−2\)k=O\(\\varepsilon^\{\-2\}\)\.
For part \(ii\), the linearly\-diminishing Markovian bound in Theorem[1](https://arxiv.org/html/2605.06866#Thmtheorem1), part \(ii\), gives
𝔼\[ℓC,∞\(ηk,η⋆\)2\]≤\(ϕC,1ℓC,∞\(η0,η⋆\)2\+ϕC,4\)\(K\+hk\+h\)ϕC,2α\+8eα2ϕC,3ϕC,2α−1⋅tkk\+h\.\\mathbb\{E\}\\left\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\right\]\\leq\\bigl\(\\phi\_\{\\mathrm\{C\},1\}\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\+\\phi\_\{\\mathrm\{C\},4\}\\bigr\)\\left\(\\frac\{K\+h\}\{k\+h\}\\right\)^\{\\phi\_\{\\mathrm\{C\},2\}\\alpha\}\+\\frac\{8e\\alpha^\{2\}\\phi\_\{\\mathrm\{C\},3\}\}\{\\phi\_\{\\mathrm\{C\},2\}\\alpha\-1\}\\cdot\\frac\{t\_\{k\}\}\{k\+h\}\.\(142\)Sinceα\>1/ϕC,2\\alpha\>1/\\phi\_\{\\mathrm\{C\},2\}, the first term isO\(\(k\+h\)−1\)O\(\(k\+h\)^\{\-1\}\)\. By geometric mixing,tk=tαk=O\(log\(k\+h\)\)t\_\{k\}=t\_\{\\alpha\_\{k\}\}=O\(\\log\(k\+h\)\), so the second term is
O\(log\(k\+h\)k\+h\)=O~\(1k\+h\)\.O\\left\(\\frac\{\\log\(k\+h\)\}\{k\+h\}\\right\)=\\widetilde\{O\}\\left\(\\frac\{1\}\{k\+h\}\\right\)\.\(143\)Therefore
𝔼\[ℓC,∞\(ηk,η⋆\)2\]=O~\(1k\+h\),\\mathbb\{E\}\\left\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\right\]=\\widetilde\{O\}\\left\(\\frac\{1\}\{k\+h\}\\right\),\(144\)and hence𝔼\[ℓC,∞\(ηk,η⋆\)\]≤ε\\mathbb\{E\}\[\\ell\_\{\\mathrm\{C\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)\]\\leq\\varepsilonis ensured fork=O~\(ε−2\)k=\\widetilde\{O\}\(\\varepsilon^\{\-2\}\)\. ∎
## Appendix CMTD details and proof of Theorem[3](https://arxiv.org/html/2605.06866#Thmtheorem3)
We now verify the abstract hypotheses of Appendix[A](https://arxiv.org/html/2605.06866#A1)for the discounted multivariate signed\-categorical recursion\.
### C\.1Statewise MMD isometry and contraction
For each states∈𝒮s\\in\\mathcal\{S\}, let
Θ\(s\)=\{θ1\(s\),…,θd\(s\)\}⊂ℝq,ℝ1d:=\{p∈ℝd:∑i=1dpi=1\}\.\\Theta\(s\)=\\\{\\theta\_\{1\}\(s\),\\dots,\\theta\_\{d\}\(s\)\\\}\\subset\\mathbb\{R\}^\{q\},\\qquad\\mathbb\{R\}\_\{1\}^\{d\}:=\\Bigl\\\{p\\in\\mathbb\{R\}^\{d\}:\\sum\_\{i=1\}^\{d\}p\_\{i\}=1\\Bigr\\\}\.\(145\)Letℳ\(ℝq\)\\mathcal\{M\}\(\\mathbb\{R\}^\{q\}\)denote the space of finite signed Borel measures onℝq\\mathbb\{R\}^\{q\}\. Define
ℱM,Θ𝒮:=\{η:𝒮→ℳ\(ℝq\):η\(s\)=∑i=1dpi\(s\)δθi\(s\),p\(s\)∈ℝ1dfor alls∈𝒮\}\.\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{M\},\\Theta\}:=\\Bigl\\\{\\eta:\\mathcal\{S\}\\to\\mathcal\{M\}\(\\mathbb\{R\}^\{q\}\):\\eta\(s\)=\\sum\_\{i=1\}^\{d\}p\_\{i\}\(s\)\\delta\_\{\\theta\_\{i\}\(s\)\},\\ p\(s\)\\in\\mathbb\{R\}\_\{1\}^\{d\}\\text\{ for all \}s\\in\\mathcal\{S\}\\Bigr\\\}\.\(146\)
Letκ:ℝq×ℝq→ℝ\\kappa:\\mathbb\{R\}^\{q\}\\times\\mathbb\{R\}^\{q\}\\to\\mathbb\{R\}be the characteristic kernel from Section[3\.2](https://arxiv.org/html/2605.06866#S3.SS2)\. For each statess, let
Ks:=\(κ\(θi\(s\),θj\(s\)\)\)i,j=1dK\_\{s\}:=\\bigl\(\\kappa\(\\theta\_\{i\}\(s\),\\theta\_\{j\}\(s\)\)\\bigr\)\_\{i,j=1\}^\{d\}\(147\)be the Gram matrix onΘ\(s\)\\Theta\(s\)\. For
η\(s\)=∑i=1dpi\(s\)δθi\(s\),\\eta\(s\)=\\sum\_\{i=1\}^\{d\}p\_\{i\}\(s\)\\delta\_\{\\theta\_\{i\}\(s\)\},\(148\)define
Is\(η\(s\)\):=Ks1/2p\(s\)∈ℝd\.I\_\{s\}\(\\eta\(s\)\):=K\_\{s\}^\{1/2\}p\(s\)\\in\\mathbb\{R\}^\{d\}\.\(149\)
###### Proposition 23\(Statewise MMD isometry\)\.
For every states∈𝒮s\\in\\mathcal\{S\}and all signed\-categorical lawsμ,ν\\mu,\\nusupported onΘ\(s\)\\Theta\(s\),
MMDκ\(μ,ν\)2=∥Is\(μ\)−Is\(ν\)∥22\.\\mathrm\{MMD\}\_\{\\kappa\}\(\\mu,\\nu\)^\{2\}=\\lVert I\_\{s\}\(\\mu\)\-I\_\{s\}\(\\nu\)\\rVert\_\{2\}^\{2\}\.\(150\)Consequently, with
IM\(η\):=\(Is\(η\(s\)\)\)s∈𝒮,I\_\{\\mathrm\{M\}\}\(\\eta\):=\(I\_\{s\}\(\\eta\(s\)\)\)\_\{s\\in\\mathcal\{S\}\},\(151\)one has
∥IM\(η\)−IM\(η′\)∥2,∞=ℓM,∞\(η,η′\)\.\\lVert I\_\{\\mathrm\{M\}\}\(\\eta\)\-I\_\{\\mathrm\{M\}\}\(\\eta^\{\\prime\}\)\\rVert\_\{2,\\infty\}=\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta,\\eta^\{\\prime\}\)\.\(152\)
###### Proof\.
Write
μ=∑i=1dpiδθi\(s\),ν=∑i=1dqiδθi\(s\)\.\\mu=\\sum\_\{i=1\}^\{d\}p\_\{i\}\\delta\_\{\\theta\_\{i\}\(s\)\},\\qquad\\nu=\\sum\_\{i=1\}^\{d\}q\_\{i\}\\delta\_\{\\theta\_\{i\}\(s\)\}\.\(153\)Then
μ−ν=∑i=1d\(pi−qi\)δθi\(s\)\.\\mu\-\\nu=\\sum\_\{i=1\}^\{d\}\(p\_\{i\}\-q\_\{i\}\)\\delta\_\{\\theta\_\{i\}\(s\)\}\.\(154\)By the definition of MMD,
MMDκ\(μ,ν\)2=∑i=1d∑j=1d\(pi−qi\)\(pj−qj\)κ\(θi\(s\),θj\(s\)\)=\(p−q\)⊤Ks\(p−q\)\.\\mathrm\{MMD\}\_\{\\kappa\}\(\\mu,\\nu\)^\{2\}=\\sum\_\{i=1\}^\{d\}\\sum\_\{j=1\}^\{d\}\(p\_\{i\}\-q\_\{i\}\)\(p\_\{j\}\-q\_\{j\}\)\\kappa\(\\theta\_\{i\}\(s\),\\theta\_\{j\}\(s\)\)=\(p\-q\)^\{\\top\}K\_\{s\}\(p\-q\)\.\(155\)SinceKsK\_\{s\}is positive semidefinite,
\(p−q\)⊤Ks\(p−q\)=∥Ks1/2\(p−q\)∥22=∥Is\(μ\)−Is\(ν\)∥22\.\(p\-q\)^\{\\top\}K\_\{s\}\(p\-q\)=\\lVert K\_\{s\}^\{1/2\}\(p\-q\)\\rVert\_\{2\}^\{2\}=\\lVert I\_\{s\}\(\\mu\)\-I\_\{s\}\(\\nu\)\\rVert\_\{2\}^\{2\}\.\(156\)Taking the maximum overs∈𝒮s\\in\\mathcal\{S\}gives the product isometry\. ∎
Recall from Section[3\.2](https://arxiv.org/html/2605.06866#S3.SS2)that
𝒪M:=IM∘ΠMΘTπ∘IM−1\.\\mathcal\{O\}\_\{\\mathrm\{M\}\}:=I\_\{\\mathrm\{M\}\}\\circ\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\}T^\{\\pi\}\\circ I\_\{\\mathrm\{M\}\}^\{\-1\}\.\(157\)
###### Proposition 24\(Embedded MTD contraction\)\.
For allU,U′∈IM\(ℱM,Θ𝒮\)U,U^\{\\prime\}\\in I\_\{\\mathrm\{M\}\}\(\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{M\},\\Theta\}\),
∥𝒪MU−𝒪MU′∥2,∞≤γc/2∥U−U′∥2,∞\.\\lVert\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\-\\mathcal\{O\}\_\{\\mathrm\{M\}\}U^\{\\prime\}\\rVert\_\{2,\\infty\}\\leq\\gamma^\{c/2\}\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(158\)
###### Proof\.
Letη:=IM−1\(U\)\\eta:=I\_\{\\mathrm\{M\}\}^\{\-1\}\(U\)andη′:=IM−1\(U′\)\\eta^\{\\prime\}:=I\_\{\\mathrm\{M\}\}^\{\-1\}\(U^\{\\prime\}\)\. By Proposition[23](https://arxiv.org/html/2605.06866#Thmtheorem23)and the contraction of𝒪M\\mathcal\{O\}\_\{\\mathrm\{M\}\}in the supremum MMD metric\[[49](https://arxiv.org/html/2605.06866#bib.bib4)\],
∥𝒪MU−𝒪MU′∥2,∞=ℓM,∞\(ΠMΘTπη,ΠMΘTπη′\)≤γc/2ℓM,∞\(η,η′\)=γc/2∥U−U′∥2,∞\.\\lVert\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\-\\mathcal\{O\}\_\{\\mathrm\{M\}\}U^\{\\prime\}\\rVert\_\{2,\\infty\}=\\ell\_\{\\mathrm\{M\},\\infty\}\(\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\}T^\{\\pi\}\\eta,\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\}T^\{\\pi\}\\eta^\{\\prime\}\)\\leq\\gamma^\{c/2\}\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta,\\eta^\{\\prime\}\)=\\gamma^\{c/2\}\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(159\)∎
### C\.2Unbiased sampled target, Lipschitz continuity, and perturbation control
Recall the sampled MTD target from Section[3\.2](https://arxiv.org/html/2605.06866#S3.SS2):
T^M\(Uk;Sk,\(Rk,Sk\+1\)\):=ISk\(ΠMΘ\(Sk\)\(\(fRk,γ\)\#ISk\+1−1\(Uk\(Sk\+1\)\)\)\)\.\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\):=I\_\{S\_\{k\}\}\\Bigl\(\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\(S\_\{k\}\)\}\\bigl\(\(f\_\{R\_\{k\},\\gamma\}\)\_\{\\\#\}I\_\{S\_\{k\+1\}\}^\{\-1\}\(U\_\{k\}\(S\_\{k\+1\}\)\)\\bigr\)\\Bigr\)\.\(160\)
###### Proposition 25\(Embedded MTD projection as an affine Euclidean projector\)\.
For each states∈𝒮s\\in\\mathcal\{S\}, let
𝒜s:=Ks1/2ℝ1d⊂ℝd\.\\mathcal\{A\}\_\{s\}:=K\_\{s\}^\{1/2\}\\mathbb\{R\}\_\{1\}^\{d\}\\subset\\mathbb\{R\}^\{d\}\.\(161\)For every mass\-11finite signed measureν\\nuonℝq\\mathbb\{R\}^\{q\}, define
bs\(ν\):=\(∫κ\(θi\(s\),y\)𝑑ν\(y\)\)i=1d∈ℝdb\_\{s\}\(\\nu\):=\\left\(\\int\\kappa\(\\theta\_\{i\}\(s\),y\)\\,d\\nu\(y\)\\right\)\_\{i=1\}^\{d\}\\in\\mathbb\{R\}^\{d\}\(162\)and
ζs\(ν\):=Ks†/2bs\(ν\),\\zeta\_\{s\}\(\\nu\):=K\_\{s\}^\{\\dagger/2\}b\_\{s\}\(\\nu\),\(163\)whereKs†K\_\{s\}^\{\\dagger\}is the Moore\-Penrose pseudoinverse ofKsK\_\{s\}\. Thenbs\(ν\)∈range\(Ks\)b\_\{s\}\(\\nu\)\\in\\operatorname\{range\}\(K\_\{s\}\)and
IM,s\(ΠMΘ\(s\)ν\)=proj𝒜s\(ζs\(ν\)\),I\_\{\\mathrm\{M\},s\}\\bigl\(\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\(s\)\}\\nu\\bigr\)=\\operatorname\{proj\}\_\{\\mathcal\{A\}\_\{s\}\}\\bigl\(\\zeta\_\{s\}\(\\nu\)\\bigr\),\(164\)whereproj𝒜s\\operatorname\{proj\}\_\{\\mathcal\{A\}\_\{s\}\}is the Euclidean orthogonal projection onto𝒜s\\mathcal\{A\}\_\{s\}\. Consequently, the map
Js\(ν\):=IM,s\(ΠMΘ\(s\)ν\)J\_\{s\}\(\\nu\):=I\_\{\\mathrm\{M\},s\}\\bigl\(\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\(s\)\}\\nu\\bigr\)\(165\)is affine on the affine space of mass\-11finite signed measures\.
###### Proof\.
Letℋκ\\mathcal\{H\}\_\{\\kappa\}be the RKHS ofκ\\kappa, letφ:ℝq→ℋκ\\varphi:\\mathbb\{R\}^\{q\}\\to\\mathcal\{H\}\_\{\\kappa\}denote the feature map, and define the linear operator
Φs:ℝd→ℋκ,Φsp:=∑i=1dpiφ\(θi\(s\)\)\.\\Phi\_\{s\}:\\mathbb\{R\}^\{d\}\\to\\mathcal\{H\}\_\{\\kappa\},\\qquad\\Phi\_\{s\}p:=\\sum\_\{i=1\}^\{d\}p\_\{i\}\\varphi\(\\theta\_\{i\}\(s\)\)\.\(166\)Then
Ks=Φs∗Φs\.K\_\{s\}=\\Phi\_\{s\}^\{\*\}\\Phi\_\{s\}\.\(167\)For a mass\-11finite signed measureν\\nu, define its kernel mean embedding by
mν:=∫φ\(y\)𝑑ν\(y\)∈ℋκ\.m\_\{\\nu\}:=\\int\\varphi\(y\)\\,d\\nu\(y\)\\in\\mathcal\{H\}\_\{\\kappa\}\.\(168\)LetPsP\_\{s\}denote the orthogonal projection inℋκ\\mathcal\{H\}\_\{\\kappa\}ontorange\(Φs\)\\operatorname\{range\}\(\\Phi\_\{s\}\)\. SinceΦs∗\(I−Ps\)=0\\Phi\_\{s\}^\{\*\}\(I\-P\_\{s\}\)=0, we have
bs\(ν\)=Φs∗mν=Φs∗Psmν∈range\(Φs∗Φs\)=range\(Ks\)\.b\_\{s\}\(\\nu\)=\\Phi\_\{s\}^\{\*\}m\_\{\\nu\}=\\Phi\_\{s\}^\{\*\}P\_\{s\}m\_\{\\nu\}\\in\\operatorname\{range\}\(\\Phi\_\{s\}^\{\*\}\\Phi\_\{s\}\)=\\operatorname\{range\}\(K\_\{s\}\)\.\(169\)
Now fixp∈ℝ1dp\\in\\mathbb\{R\}\_\{1\}^\{d\}and write
μp:=∑i=1dpiδθi\(s\)\.\\mu\_\{p\}:=\\sum\_\{i=1\}^\{d\}p\_\{i\}\\delta\_\{\\theta\_\{i\}\(s\)\}\.\(170\)By the reproducing\-kernel representation of MMD,
MMDκ\(μp,ν\)2=∥Φsp−mν∥ℋκ2=∥Φsp−Psmν∥ℋκ2\+∥\(I−Ps\)mν∥ℋκ2\.\\mathrm\{MMD\}\_\{\\kappa\}\(\\mu\_\{p\},\\nu\)^\{2\}=\\lVert\\Phi\_\{s\}p\-m\_\{\\nu\}\\rVert\_\{\\mathcal\{H\}\_\{\\kappa\}\}^\{2\}=\\lVert\\Phi\_\{s\}p\-P\_\{s\}m\_\{\\nu\}\\rVert\_\{\\mathcal\{H\}\_\{\\kappa\}\}^\{2\}\+\\lVert\(I\-P\_\{s\}\)m\_\{\\nu\}\\rVert\_\{\\mathcal\{H\}\_\{\\kappa\}\}^\{2\}\.\(171\)
We next identifyPsmνP\_\{s\}m\_\{\\nu\}in coefficient form\. SinceKs=Φs∗ΦsK\_\{s\}=\\Phi\_\{s\}^\{\*\}\\Phi\_\{s\}, the standard Moore\-Penrose identity gives
ΦsKs†Φs∗=Ps\.\\Phi\_\{s\}K\_\{s\}^\{\\dagger\}\\Phi\_\{s\}^\{\*\}=P\_\{s\}\.\(172\)Therefore,
Φs\(Ks†bs\(ν\)\)=ΦsKs†Φs∗mν=Psmν\.\\Phi\_\{s\}\\bigl\(K\_\{s\}^\{\\dagger\}b\_\{s\}\(\\nu\)\\bigr\)=\\Phi\_\{s\}K\_\{s\}^\{\\dagger\}\\Phi\_\{s\}^\{\*\}m\_\{\\nu\}=P\_\{s\}m\_\{\\nu\}\.\(173\)Substituting this into the previous display yields
∥Φsp−Psmν∥ℋκ=∥Φs\(p−Ks†bs\(ν\)\)∥ℋκ\.\\lVert\\Phi\_\{s\}p\-P\_\{s\}m\_\{\\nu\}\\rVert\_\{\\mathcal\{H\}\_\{\\kappa\}\}=\\lVert\\Phi\_\{s\}\(p\-K\_\{s\}^\{\\dagger\}b\_\{s\}\(\\nu\)\)\\rVert\_\{\\mathcal\{H\}\_\{\\kappa\}\}\.\(174\)Since∥Φsx∥ℋκ2=x⊤Ksx=∥Ks1/2x∥22\\lVert\\Phi\_\{s\}x\\rVert\_\{\\mathcal\{H\}\_\{\\kappa\}\}^\{2\}=x^\{\\top\}K\_\{s\}x=\\lVert K\_\{s\}^\{1/2\}x\\rVert\_\{2\}^\{2\}for everyx∈ℝdx\\in\\mathbb\{R\}^\{d\}, we obtain
∥Φsp−Psmν∥ℋκ=∥Ks1/2\(p−Ks†bs\(ν\)\)∥2=∥Ks1/2p−Ks1/2Ks†bs\(ν\)∥2\.\\lVert\\Phi\_\{s\}p\-P\_\{s\}m\_\{\\nu\}\\rVert\_\{\\mathcal\{H\}\_\{\\kappa\}\}=\\lVert K\_\{s\}^\{1/2\}\(p\-K\_\{s\}^\{\\dagger\}b\_\{s\}\(\\nu\)\)\\rVert\_\{2\}=\\lVert K\_\{s\}^\{1/2\}p\-K\_\{s\}^\{1/2\}K\_\{s\}^\{\\dagger\}b\_\{s\}\(\\nu\)\\rVert\_\{2\}\.\(175\)BecauseKsK\_\{s\}is symmetric positive semidefinite, spectral calculus gives
Ks1/2Ks†=Ks†/2\.K\_\{s\}^\{1/2\}K\_\{s\}^\{\\dagger\}=K\_\{s\}^\{\\dagger/2\}\.\(176\)Hence
∥Φsp−Psmν∥ℋκ=∥Ks1/2p−Ks†/2bs\(ν\)∥2=∥Ks1/2p−ζs\(ν\)∥2\.\\lVert\\Phi\_\{s\}p\-P\_\{s\}m\_\{\\nu\}\\rVert\_\{\\mathcal\{H\}\_\{\\kappa\}\}=\\lVert K\_\{s\}^\{1/2\}p\-K\_\{s\}^\{\\dagger/2\}b\_\{s\}\(\\nu\)\\rVert\_\{2\}=\\lVert K\_\{s\}^\{1/2\}p\-\\zeta\_\{s\}\(\\nu\)\\rVert\_\{2\}\.\(177\)Therefore
MMDκ\(μp,ν\)2=∥Ks1/2p−ζs\(ν\)∥22\+cs\(ν\),\\mathrm\{MMD\}\_\{\\kappa\}\(\\mu\_\{p\},\\nu\)^\{2\}=\\lVert K\_\{s\}^\{1/2\}p\-\\zeta\_\{s\}\(\\nu\)\\rVert\_\{2\}^\{2\}\+c\_\{s\}\(\\nu\),\(178\)where
cs\(ν\):=∥\(I−Ps\)mν∥ℋκ2c\_\{s\}\(\\nu\):=\\lVert\(I\-P\_\{s\}\)m\_\{\\nu\}\\rVert\_\{\\mathcal\{H\}\_\{\\kappa\}\}^\{2\}\(179\)does not depend onpp\.
Minimizing overp∈ℝ1dp\\in\\mathbb\{R\}\_\{1\}^\{d\}is therefore equivalent to Euclidean projection ofζs\(ν\)\\zeta\_\{s\}\(\\nu\)onto
𝒜s=Ks1/2ℝ1d\.\\mathcal\{A\}\_\{s\}=K\_\{s\}^\{1/2\}\\mathbb\{R\}\_\{1\}^\{d\}\.\(180\)SinceIM,s\(∑i=1dpiδθi\(s\)\)=Ks1/2pI\_\{\\mathrm\{M\},s\}\\bigl\(\\sum\_\{i=1\}^\{d\}p\_\{i\}\\delta\_\{\\theta\_\{i\}\(s\)\}\\bigr\)=K\_\{s\}^\{1/2\}p, it follows that
IM,s\(ΠMΘ\(s\)ν\)=proj𝒜s\(ζs\(ν\)\)\.I\_\{\\mathrm\{M\},s\}\\bigl\(\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\(s\)\}\\nu\\bigr\)=\\operatorname\{proj\}\_\{\\mathcal\{A\}\_\{s\}\}\\bigl\(\\zeta\_\{s\}\(\\nu\)\\bigr\)\.\(181\)
Finally,ν↦bs\(ν\)\\nu\\mapsto b\_\{s\}\(\\nu\)is linear, hence so isν↦ζs\(ν\)=Ks†/2bs\(ν\)\\nu\\mapsto\\zeta\_\{s\}\(\\nu\)=K\_\{s\}^\{\\dagger/2\}b\_\{s\}\(\\nu\)\. Since orthogonal projection onto an affine subspace is an affine map, the composition
Js\(ν\)=proj𝒜s\(ζs\(ν\)\)J\_\{s\}\(\\nu\)=\\operatorname\{proj\}\_\{\\mathcal\{A\}\_\{s\}\}\\bigl\(\\zeta\_\{s\}\(\\nu\)\\bigr\)\(182\)is affine on the affine space of mass\-11finite signed measures\. ∎
###### Proposition 26\(Conditional unbiasedness of the sampled MTD target\)\.
For every admissibleU∈IM\(ℱM,Θ𝒮\)U\\in I\_\{\\mathrm\{M\}\}\(\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{M\},\\Theta\}\)and every states∈𝒮s\\in\\mathcal\{S\},
𝔼\[T^M\(U;s,\(Rk,Sk\+1\)\)∣Uk=U,Sk=s\]=\(𝒪MU\)\(s\)\.\\mathbb\{E\}\\left\[\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(R\_\{k\},S\_\{k\+1\}\)\)\\,\\mid\\,U\_\{k\}=U,\\ S\_\{k\}=s\\right\]=\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\)\(s\)\.\(183\)
###### Proof\.
Letη:=IM−1\(U\)\\eta:=I\_\{\\mathrm\{M\}\}^\{\-1\}\(U\)and condition onUk=UU\_\{k\}=UandSk=sS\_\{k\}=s\. By Proposition[25](https://arxiv.org/html/2605.06866#Thmtheorem25), the statewise map
Js\(ν\):=Is\(ΠMΘ\(s\)ν\)J\_\{s\}\(\\nu\):=I\_\{s\}\\bigl\(\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\(s\)\}\\nu\\bigr\)\(184\)is affine on mass\-11finite signed measures\. Hence
𝔼\[T^M\(U;s,\(Rk,Sk\+1\)\)∣Uk=U,Sk=s\]\\displaystyle\\mathbb\{E\}\\left\[\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(R\_\{k\},S\_\{k\+1\}\)\)\\,\\mid\\,U\_\{k\}=U,\\ S\_\{k\}=s\\right\]=𝔼\[Js\(\(fRk,γ\)\#η\(Sk\+1\)\)∣Uk=U,Sk=s\]\\displaystyle=\\mathbb\{E\}\\left\[J\_\{s\}\\bigl\(\(f\_\{R\_\{k\},\\gamma\}\)\_\{\\\#\}\\eta\(S\_\{k\+1\}\)\\bigr\)\\,\\mid\\,U\_\{k\}=U,\\ S\_\{k\}=s\\right\]\(185\)=Js\(𝔼\[\(fRk,γ\)\#η\(Sk\+1\)∣Uk=U,Sk=s\]\)\\displaystyle=J\_\{s\}\\left\(\\mathbb\{E\}\\left\[\(f\_\{R\_\{k\},\\gamma\}\)\_\{\\\#\}\\eta\(S\_\{k\+1\}\)\\,\\mid\\,U\_\{k\}=U,\\ S\_\{k\}=s\\right\]\\right\)=Js\(\(Tπη\)\(s\)\)\\displaystyle=J\_\{s\}\\bigl\(\(T^\{\\pi\}\\eta\)\(s\)\\bigr\)=Is\(\(ΠMΘTπη\)\(s\)\)\\displaystyle=I\_\{s\}\\bigl\(\(\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\}T^\{\\pi\}\\eta\)\(s\)\\bigr\)=\(𝒪MU\)\(s\)\.\\displaystyle=\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\)\(s\)\.∎
###### Proposition 27\(MTD samplewise Lipschitz continuity\)\.
For everys∈𝒮s\\in\\mathcal\{S\}, every\(r,s′\)∈\[0,1\]q×𝒮\(r,s^\{\\prime\}\)\\in\[0,1\]^\{q\}\\times\\mathcal\{S\}, and allU,U′∈IM\(ℱM,Θ𝒮\)U,U^\{\\prime\}\\in I\_\{\\mathrm\{M\}\}\(\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{M\},\\Theta\}\),
‖T^M\(U;s,\(r,s′\)\)−T^M\(U′;s,\(r,s′\)\)‖2≤∥U−U′∥2,∞\.\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{2\}\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(186\)Consequently, if
H^M\(U;s,\(r,s′\)\):=U\+Ps\(T^M\(U;s,\(r,s′\)\)−U\(s\)\),\\widehat\{H\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\):=U\+P\_\{s\}\\bigl\(\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-U\(s\)\\bigr\),\(187\)then
∥H^M\(U;s,\(r,s′\)\)−H^M\(U′;s,\(r,s′\)\)∥2,∞≤∥U−U′∥2,∞\.\\lVert\\widehat\{H\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{H\}\_\{\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{2,\\infty\}\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(188\)
###### Proof\.
Fixs∈𝒮s\\in\\mathcal\{S\}and\(r,s′\)∈\[0,1\]q×𝒮\(r,s^\{\\prime\}\)\\in\[0,1\]^\{q\}\\times\\mathcal\{S\}\. Let
μ:=Is′−1\(U\(s′\)\),ν:=Is′−1\(U′\(s′\)\)\.\\mu:=I\_\{s^\{\\prime\}\}^\{\-1\}\(U\(s^\{\\prime\}\)\),\\qquad\\nu:=I\_\{s^\{\\prime\}\}^\{\-1\}\(U^\{\\prime\}\(s^\{\\prime\}\)\)\.\(189\)Then
T^M\(U;s,\(r,s′\)\)=Is\(ΠMΘ\(s\)\(\(fr,γ\)\#μ\)\)\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)=I\_\{s\}\\bigl\(\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\(s\)\}\(\(f\_\{r,\\gamma\}\)\_\{\\\#\}\\mu\)\\bigr\)\(190\)and likewise forU′U^\{\\prime\}\. Using the statewise MMD isometry, the Euclidean projection representation from Proposition[25](https://arxiv.org/html/2605.06866#Thmtheorem25), and theγc/2\\gamma^\{c/2\}contraction of the pushforward in MMD,
‖T^M\(U;s,\(r,s′\)\)−T^M\(U′;s,\(r,s′\)\)‖2\\displaystyle\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{2\}=MMDκ\(ΠMΘ\(s\)\(\(fr,γ\)\#μ\),ΠMΘ\(s\)\(\(fr,γ\)\#ν\)\)\\displaystyle=\\mathrm\{MMD\}\_\{\\kappa\}\\bigl\(\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\(s\)\}\(\(f\_\{r,\\gamma\}\)\_\{\\\#\}\\mu\),\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\(s\)\}\(\(f\_\{r,\\gamma\}\)\_\{\\\#\}\\nu\)\\bigr\)\(191\)≤MMDκ\(\(fr,γ\)\#μ,\(fr,γ\)\#ν\)\\displaystyle\\leq\\mathrm\{MMD\}\_\{\\kappa\}\(\(f\_\{r,\\gamma\}\)\_\{\\\#\}\\mu,\(f\_\{r,\\gamma\}\)\_\{\\\#\}\\nu\)≤γc/2MMDκ\(μ,ν\)\\displaystyle\\leq\\gamma^\{c/2\}\\,\\mathrm\{MMD\}\_\{\\kappa\}\(\\mu,\\nu\)=γc/2∥U\(s′\)−U′\(s′\)∥2\\displaystyle=\\gamma^\{c/2\}\\,\\lVert U\(s^\{\\prime\}\)\-U^\{\\prime\}\(s^\{\\prime\}\)\\rVert\_\{2\}≤∥U−U′∥2,∞\.\\displaystyle\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.For the full\-sample map, blocks other thanssare unchanged, while blockssis replaced by the sampled target\. Hence, for everyx≠sx\\neq s,
\(H^M\(U;s,\(r,s′\)\)−H^M\(U′;s,\(r,s′\)\)\)\(x\)=U\(x\)−U′\(x\),\\bigl\(\\widehat\{H\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{H\}\_\{\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\bigr\)\(x\)=U\(x\)\-U^\{\\prime\}\(x\),\(192\)and for the sampled block,
\(H^M\(U;s,\(r,s′\)\)−H^M\(U′;s,\(r,s′\)\)\)\(s\)=T^M\(U;s,\(r,s′\)\)−T^M\(U′;s,\(r,s′\)\)\.\\bigl\(\\widehat\{H\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{H\}\_\{\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\bigr\)\(s\)=\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\.\(193\)Taking the block supremum and using the target\-level bound proves the claim\. ∎
###### Proposition 28\(Discounted MTD affine perturbation bound\)\.
Let
U⋆:=IM\(η⋆\),βM:=γc/2\.U^\{\\star\}:=I\_\{\\mathrm\{M\}\}\(\\eta^\{\\star\}\),\\qquad\\beta\_\{\\mathrm\{M\}\}:=\\gamma^\{c/2\}\.\(194\)Hereη⋆\\eta^\{\\star\}denotes the unique fixed point ofΠMΘTπ\\Pi\_\{\\mathrm\{M\}\}^\{\\Theta\}T^\{\\pi\}\. Define
BM⋆:=maxs,s′∈𝒮supr∈\[0,1\]q‖T^M\(U⋆;s,\(r,s′\)\)−\(𝒪MU⋆\)\(s\)‖2B\_\{\\mathrm\{M\}\}^\{\\star\}:=\\max\_\{s,s^\{\\prime\}\\in\\mathcal\{S\}\}\\sup\_\{r\\in\[0,1\]^\{q\}\}\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U^\{\\star\};s,\(r,s^\{\\prime\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U^\{\\star\}\)\(s\)\\right\\rVert\_\{2\}\(195\)and
BM:=2βM∥U⋆∥2,∞\+BM⋆\.B\_\{\\mathrm\{M\}\}:=2\\beta\_\{\\mathrm\{M\}\}\\lVert U^\{\\star\}\\rVert\_\{2,\\infty\}\+B\_\{\\mathrm\{M\}\}^\{\\star\}\.\(196\)ThenBM⋆<∞B\_\{\\mathrm\{M\}\}^\{\\star\}<\\infty, and for everys∈𝒮s\\in\\mathcal\{S\}, every\(r,s′\)∈\[0,1\]q×𝒮\(r,s^\{\\prime\}\)\\in\[0,1\]^\{q\}\\times\\mathcal\{S\}, and everyU∈IM\(ℱM,Θ𝒮\)U\\in I\_\{\\mathrm\{M\}\}\(\\mathcal\{F\}^\{\\mathcal\{S\}\}\_\{\\mathrm\{M\},\\Theta\}\),
‖T^M\(U;s,\(r,s′\)\)−\(𝒪MU\)\(s\)‖2≤2βM∥U∥2,∞\+BM\.\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\)\(s\)\\right\\rVert\_\{2\}\\leq 2\\beta\_\{\\mathrm\{M\}\}\\lVert U\\rVert\_\{2,\\infty\}\+B\_\{\\mathrm\{M\}\}\.\(197\)Consequently, for everyk≥0k\\geq 0,
𝔼\[‖T^M\(Uk;Sk,\(Rk,Sk\+1\)\)−\(𝒪MUk\)\(Sk\)‖22∣Uk,Sk\]≤2BM2\+8βM2∥Uk∥2,∞2\.\\mathbb\{E\}\\left\[\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\_\{k\}\)\(S\_\{k\}\)\\right\\rVert\_\{2\}^\{2\}\\,\\mid\\,U\_\{k\},S\_\{k\}\\right\]\\leq 2B\_\{\\mathrm\{M\}\}^\{2\}\+8\\beta\_\{\\mathrm\{M\}\}^\{2\}\\lVert U\_\{k\}\\rVert\_\{2,\\infty\}^\{2\}\.\(198\)Moreover, if
FM\(U,s\):=U\+Ps\(\(𝒪MU\)\(s\)−U\(s\)\),F\_\{\\mathrm\{M\}\}\(U,s\):=U\+P\_\{s\}\\bigl\(\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\)\(s\)\-U\(s\)\\bigr\),\(199\)H^M\(U;s,\(r,s′\)\):=U\+Ps\(T^M\(U;s,\(r,s′\)\)−U\(s\)\),\\widehat\{H\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\):=U\+P\_\{s\}\\bigl\(\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-U\(s\)\\bigr\),\(200\)and
ΔkM:=H^M\(Uk;Sk,\(Rk,Sk\+1\)\)−FM\(Uk,Sk\),\\Delta\_\{k\}^\{\\mathrm\{M\}\}:=\\widehat\{H\}\_\{\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-F\_\{\\mathrm\{M\}\}\(U\_\{k\},S\_\{k\}\),\(201\)then
𝔼\[ΔkM∣𝒢k\]=0\\mathbb\{E\}\\left\[\\Delta\_\{k\}^\{\\mathrm\{M\}\}\\,\\mid\\,\\mathcal\{G\}\_\{k\}\\right\]=0\(202\)and
∥ΔkM∥2,∞≤2βM∥Uk∥2,∞\+BMa\.s\.\\lVert\\Delta\_\{k\}^\{\\mathrm\{M\}\}\\rVert\_\{2,\\infty\}\\leq 2\\beta\_\{\\mathrm\{M\}\}\\lVert U\_\{k\}\\rVert\_\{2,\\infty\}\+B\_\{\\mathrm\{M\}\}\\qquad\\text\{a\.s\.\}\(203\)In particular, the theorem constants may be taken as
C1:=2BM2,C2:=8βM2\.C\_\{1\}:=2B\_\{\\mathrm\{M\}\}^\{2\},\\qquad C\_\{2\}:=8\\beta\_\{\\mathrm\{M\}\}^\{2\}\.\(204\)
###### Proof\.
For fixeds,s′∈𝒮s,s^\{\\prime\}\\in\\mathcal\{S\}andr∈\[0,1\]qr\\in\[0,1\]^\{q\}, the proof of Proposition[27](https://arxiv.org/html/2605.06866#Thmtheorem27)together with Proposition[24](https://arxiv.org/html/2605.06866#Thmtheorem24)gives
∥T^M\(U;s,\(r,s′\)\)−T^M\(U′;s,\(r,s′\)\)∥2≤βM∥U−U′∥2,∞\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{2\}\\leq\\beta\_\{\\mathrm\{M\}\}\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\(205\)and
∥\(𝒪MU\)\(s\)−\(𝒪MU′\)\(s\)∥2≤βM∥U−U′∥2,∞\.\\lVert\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\)\(s\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U^\{\\prime\}\)\(s\)\\rVert\_\{2\}\\leq\\beta\_\{\\mathrm\{M\}\}\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(206\)Because\[0,1\]q\[0,1\]^\{q\}is compact, the state set is finite, and the mapr↦T^M\(U⋆;s,\(r,s′\)\)r\\mapsto\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U^\{\\star\};s,\(r,s^\{\\prime\}\)\)is continuous for each fixed\(s,s′\)\(s,s^\{\\prime\}\), the constantBM⋆B\_\{\\mathrm\{M\}\}^\{\\star\}is finite\. Now add and subtract the same quantities atU⋆U^\{\\star\}:
‖T^M\(U;s,\(r,s′\)\)−\(𝒪MU\)\(s\)‖2≤‖T^M\(U;s,\(r,s′\)\)−T^M\(U⋆;s,\(r,s′\)\)‖2\+‖T^M\(U⋆;s,\(r,s′\)\)−\(𝒪MU⋆\)\(s\)‖2\+‖\(𝒪MU⋆\)\(s\)−\(𝒪MU\)\(s\)‖2≤2βM∥U−U⋆∥2,∞\+BM⋆≤2βM∥U∥2,∞\+\(2βM∥U⋆∥2,∞\+BM⋆\),\\begin\{gathered\}\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\)\(s\)\\right\\rVert\_\{2\}\\leq\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U^\{\\star\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{2\}\\\\ \+\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U^\{\\star\};s,\(r,s^\{\\prime\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U^\{\\star\}\)\(s\)\\right\\rVert\_\{2\}\+\\left\\lVert\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U^\{\\star\}\)\(s\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\)\(s\)\\right\\rVert\_\{2\}\\\\ \\leq 2\\beta\_\{\\mathrm\{M\}\}\\lVert U\-U^\{\\star\}\\rVert\_\{2,\\infty\}\+B\_\{\\mathrm\{M\}\}^\{\\star\}\\\\ \\leq 2\\beta\_\{\\mathrm\{M\}\}\\lVert U\\rVert\_\{2,\\infty\}\+\\Bigl\(2\\beta\_\{\\mathrm\{M\}\}\\lVert U^\{\\star\}\\rVert\_\{2,\\infty\}\+B\_\{\\mathrm\{M\}\}^\{\\star\}\\Bigr\),\\end\{gathered\}\(207\)which is exactly the target\-level affine bound\. Squaring and using\(a\+b\)2≤2a2\+2b2\(a\+b\)^\{2\}\\leq 2a^\{2\}\+2b^\{2\}with
a:=2βM∥Uk∥2,∞,b:=BM,a:=2\\beta\_\{\\mathrm\{M\}\}\\lVert U\_\{k\}\\rVert\_\{2,\\infty\},\\qquad b:=B\_\{\\mathrm\{M\}\},\(208\)gives the conditional second\-moment bound of \([198](https://arxiv.org/html/2605.06866#A3.E198)\)\. Next, by definition ofH^M\\widehat\{H\}\_\{\\mathrm\{M\}\}andFMF\_\{\\mathrm\{M\}\}, only theSkS\_\{k\}\-th block can be nonzero, so
∥ΔkM∥2,∞=‖T^M\(Uk;Sk,\(Rk,Sk\+1\)\)−\(𝒪MUk\)\(Sk\)‖2,\\lVert\\Delta\_\{k\}^\{\\mathrm\{M\}\}\\rVert\_\{2,\\infty\}=\\left\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\_\{k\}\)\(S\_\{k\}\)\\right\\rVert\_\{2\},\(209\)and the pathwise affine bound follows immediately\. Finally, the conditional mean identity is exactly Proposition[26](https://arxiv.org/html/2605.06866#Thmtheorem26)written in centered form\. ∎
### C\.3Completion of the proof of Theorem[3](https://arxiv.org/html/2605.06866#Thmtheorem3)
###### Proof of Theorem[3](https://arxiv.org/html/2605.06866#Thmtheorem3)\.
We verify the hypotheses of Appendix[A](https://arxiv.org/html/2605.06866#A1)for discounted MTD\.
First, Proposition[24](https://arxiv.org/html/2605.06866#Thmtheorem24)gives the block\-supremum contraction with modulus
βM=γc/2\.\\beta\_\{\\mathrm\{M\}\}=\\gamma^\{c/2\}\.\(210\)
Second, Proposition[27](https://arxiv.org/html/2605.06866#Thmtheorem27)gives the samplewise Lipschitz property of the full\-sample map with
AM,1=1\.A\_\{\\mathrm\{M\},1\}=1\.\(211\)Moreover, the proof of Proposition[27](https://arxiv.org/html/2605.06866#Thmtheorem27)gives the stronger target\-level estimate
∥T^M\(U;s,\(r,s′\)\)−T^M\(U′;s,\(r,s′\)\)∥2≤βM∥U−U′∥2,∞≤∥U−U′∥2,∞\.\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{2\}\\leq\\beta\_\{\\mathrm\{M\}\}\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{2,\\infty\}\.\(212\)
Third, Proposition[28](https://arxiv.org/html/2605.06866#Thmtheorem28)gives the centered pathwise affine perturbation bound
𝔼\[ΔkM∣𝒢k\]=0,∥ΔkM∥2,∞≤2βM∥Uk∥2,∞\+BMa\.s\.\\mathbb\{E\}\\left\[\\Delta\_\{k\}^\{\\mathrm\{M\}\}\\mid\\mathcal\{G\}\_\{k\}\\right\]=0,\\qquad\\lVert\\Delta\_\{k\}^\{\\mathrm\{M\}\}\\rVert\_\{2,\\infty\}\\leq 2\\beta\_\{\\mathrm\{M\}\}\\lVert U\_\{k\}\\rVert\_\{2,\\infty\}\+B\_\{\\mathrm\{M\}\}\\qquad\\text\{a\.s\.\}\(213\)Hence the Markovian perturbation condition in Proposition[16](https://arxiv.org/html/2605.06866#Thmtheorem16)holds with
AM,2=2βM,BM,2=BM,AM:=AM,1\+AM,2\+1=2βM\+2\.A\_\{\\mathrm\{M\},2\}=2\\beta\_\{\\mathrm\{M\}\},\\qquad B\_\{\\mathrm\{M\},2\}=B\_\{\\mathrm\{M\}\},\\qquad A\_\{\\mathrm\{M\}\}:=A\_\{\\mathrm\{M\},1\}\+A\_\{\\mathrm\{M\},2\}\+1=2\\beta\_\{\\mathrm\{M\}\}\+2\.\(214\)
For the i\.i\.d\. case, Proposition[28](https://arxiv.org/html/2605.06866#Thmtheorem28)also yields the conditional second\-moment estimate
𝔼\[∥T^M\(Uk;Sk,\(Rk,Sk\+1\)\)−\(𝒪MUk\)\(Sk\)∥22∣Uk,Sk\]≤C1\+C2∥Uk∥2,∞2,\\mathbb\{E\}\\bigl\[\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\_\{k\}\)\(S\_\{k\}\)\\rVert\_\{2\}^\{2\}\\,\\mid\\,U\_\{k\},S\_\{k\}\\bigr\]\\leq C\_\{1\}\+C\_\{2\}\\lVert U\_\{k\}\\rVert\_\{2,\\infty\}^\{2\},\(215\)with
C1=2BM2,C2=8βM2\.C\_\{1\}=2B\_\{\\mathrm\{M\}\}^\{2\},\\qquad C\_\{2\}=8\\beta\_\{\\mathrm\{M\}\}^\{2\}\.\(216\)Therefore Proposition[15](https://arxiv.org/html/2605.06866#Thmtheorem15)applies with
AMiid:=C1\+2C2\(1\+ΥM\)\.A\_\{\\mathrm\{M\}\}^\{\\mathrm\{iid\}\}:=C\_\{1\}\+2C\_\{2\}\(1\+\\Upsilon\_\{\\mathrm\{M\}\}\)\.\(217\)
Applying Proposition[15](https://arxiv.org/html/2605.06866#Thmtheorem15)with
β=βM,Aiid=AMiid,ϑ=ϑM,ρ,p⋆=max\{2,⌈log\|𝒮\|⌉\},\\beta=\\beta\_\{\\mathrm\{M\}\},\\qquad A^\{\\mathrm\{iid\}\}=A\_\{\\mathrm\{M\}\}^\{\\mathrm\{iid\}\},\\qquad\\vartheta=\\vartheta\_\{\\mathrm\{M\},\\rho\},\\qquad p^\{\\star\}=\\max\\\{2,\\lceil\\log\\lvert\\mathcal\{S\}\\rvert\\rceil\\\},\(218\)yields Theorem[3](https://arxiv.org/html/2605.06866#Thmtheorem3)\(i\) with
aM,1iid:=1\+ϑM,ρ\|𝒮\|2/p⋆1\+ϑM,ρ,aM,2iid:=1−β¯M,ρ1\+ϑM,ρ\|𝒮\|2/p⋆1\+ϑM,ρ,a\_\{\\mathrm\{M\},1\}^\{\\mathrm\{iid\}\}:=\\frac\{1\+\\vartheta\_\{\\mathrm\{M\},\\rho\}\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\_\{\\mathrm\{M\},\\rho\}\},\\qquad a\_\{\\mathrm\{M\},2\}^\{\\mathrm\{iid\}\}:=1\-\\bar\{\\beta\}\_\{\\mathrm\{M\},\\rho\}\\sqrt\{\\frac\{1\+\\vartheta\_\{\\mathrm\{M\},\\rho\}\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\_\{\\mathrm\{M\},\\rho\}\}\},\(219\)where
β¯M,ρ:=1−ρmin\(1−βM\),\\bar\{\\beta\}\_\{\\mathrm\{M\},\\rho\}:=1\-\\rho\_\{\\min\}\(1\-\\beta\_\{\\mathrm\{M\}\}\),\(220\)and
aM,3iid:=4\(p⋆−1\)\|𝒮\|2/p⋆\(AMiid\+2\)\(1\+ϑM,ρ\)ϑM,ρ,aM,4iid:=2\(p⋆−1\)\|𝒮\|2/p⋆AMiid\(1\+ϑM,ρ\)ϑM,ρ\.a\_\{\\mathrm\{M\},3\}^\{\\mathrm\{iid\}\}:=\\frac\{4\(p^\{\\star\}\-1\)\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\(A\_\{\\mathrm\{M\}\}^\{\\mathrm\{iid\}\}\+2\)\(1\+\\vartheta\_\{\\mathrm\{M\},\\rho\}\)\}\{\\vartheta\_\{\\mathrm\{M\},\\rho\}\},\\qquad a\_\{\\mathrm\{M\},4\}^\{\\mathrm\{iid\}\}:=\\frac\{2\(p^\{\\star\}\-1\)\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}A\_\{\\mathrm\{M\}\}^\{\\mathrm\{iid\}\}\(1\+\\vartheta\_\{\\mathrm\{M\},\\rho\}\)\}\{\\vartheta\_\{\\mathrm\{M\},\\rho\}\}\.\(221\)
Applying Proposition[16](https://arxiv.org/html/2605.06866#Thmtheorem16)with
A1=AM,1=1,A2=AM,2=2βM,B2=BM,2=BM,A\_\{1\}=A\_\{\\mathrm\{M\},1\}=1,\\qquad A\_\{2\}=A\_\{\\mathrm\{M\},2\}=2\\beta\_\{\\mathrm\{M\}\},\\qquad B\_\{2\}=B\_\{\\mathrm\{M\},2\}=B\_\{\\mathrm\{M\}\},\(222\)yields Theorem[3](https://arxiv.org/html/2605.06866#Thmtheorem3)\(ii\) with
ϕ~M,1:=1\+ϑM,μ\|𝒮\|2/p⋆1\+ϑM,μ,ϕM,2:=1−β¯M,μ1\+ϑM,μ\|𝒮\|2/p⋆1\+ϑM,μ,\\tilde\{\\phi\}\_\{\\mathrm\{M\},1\}:=\\frac\{1\+\\vartheta\_\{\\mathrm\{M\},\\mu\}\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\_\{\\mathrm\{M\},\\mu\}\},\\qquad\\phi\_\{\\mathrm\{M\},2\}:=1\-\\bar\{\\beta\}\_\{\\mathrm\{M\},\\mu\}\\sqrt\{\\frac\{1\+\\vartheta\_\{\\mathrm\{M\},\\mu\}\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\}\{1\+\\vartheta\_\{\\mathrm\{M\},\\mu\}\}\},\(223\)where
β¯M,μ:=1−μmin\(1−βM\),\\bar\{\\beta\}\_\{\\mathrm\{M\},\\mu\}:=1\-\\mu\_\{\\min\}\(1\-\\beta\_\{\\mathrm\{M\}\}\),\(224\)and
ϕ~M,3:=114\(p⋆−1\)\(1\+ϑM,μ\|𝒮\|2/p⋆\)ϑM,μ,ϕM,3:=BM2ϕ~M,3\\tilde\{\\phi\}\_\{\\mathrm\{M\},3\}:=\\frac\{114\(p^\{\\star\}\-1\)\\bigl\(1\+\\vartheta\_\{\\mathrm\{M\},\\mu\}\\lvert\\mathcal\{S\}\\rvert^\{2/p^\{\\star\}\}\\bigr\)\}\{\\vartheta\_\{\\mathrm\{M\},\\mu\}\},\\qquad\\phi\_\{\\mathrm\{M\},3\}:=B\_\{\\mathrm\{M\}\}^\{2\}\\tilde\{\\phi\}\_\{\\mathrm\{M\},3\}\(225\)together with
ϕM,1:=8ϕ~M,1,ϕM,4:=2ϕ~M,1BM2AM2,\\phi\_\{\\mathrm\{M\},1\}:=8\\tilde\{\\phi\}\_\{\\mathrm\{M\},1\},\\qquad\\phi\_\{\\mathrm\{M\},4\}:=\\frac\{2\\tilde\{\\phi\}\_\{\\mathrm\{M\},1\}B\_\{\\mathrm\{M\}\}^\{2\}\}\{A\_\{\\mathrm\{M\}\}^\{2\}\},\(226\)and the step size condition
∑i=k−tkk−1αi≤min\{ϕM,2ϕM,3AM2,14AM\}for allk≥K\.\\sum\_\{i=k\-t\_\{k\}\}^\{k\-1\}\\alpha\_\{i\}\\leq\\min\\left\\\{\\frac\{\\phi\_\{\\mathrm\{M\},2\}\}\{\\phi\_\{\\mathrm\{M\},3\}A\_\{\\mathrm\{M\}\}^\{2\}\},\\frac\{1\}\{4A\_\{\\mathrm\{M\}\}\}\\right\\\}\\qquad\\text\{for all \}k\\geq K\.\(227\)
The constant, linearly\-diminishing, and polynomially\-diminishing step size bounds in Theorem[3](https://arxiv.org/html/2605.06866#Thmtheorem3)now follow from the corresponding specializations of Proposition[15](https://arxiv.org/html/2605.06866#Thmtheorem15)and Proposition[16](https://arxiv.org/html/2605.06866#Thmtheorem16)\. ∎
### C\.4Proof of Corollary[4](https://arxiv.org/html/2605.06866#Thmtheorem4)
###### Proof\.
We use the linearly\-diminishing step size bounds in Theorem[3](https://arxiv.org/html/2605.06866#Thmtheorem3)\.
For part \(i\), the linearly\-diminishing i\.i\.d\. bound in Theorem[3](https://arxiv.org/html/2605.06866#Thmtheorem3)\(i\) gives
𝔼\[ℓM,∞\(ηk,η⋆\)2\]≤aM,1iidℓM,∞\(η0,η⋆\)2\(hk\+h\)aM,2iidα\+4eα2aM,4iidaM,2iidα−1⋅1k\+h\.\\mathbb\{E\}\\left\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\right\]\\leq a\_\{\\mathrm\{M\},1\}^\{\\mathrm\{iid\}\}\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\\left\(\\frac\{h\}\{k\+h\}\\right\)^\{a\_\{\\mathrm\{M\},2\}^\{\\mathrm\{iid\}\}\\alpha\}\+\\frac\{4e\\alpha^\{2\}a\_\{\\mathrm\{M\},4\}^\{\\mathrm\{iid\}\}\}\{a\_\{\\mathrm\{M\},2\}^\{\\mathrm\{iid\}\}\\alpha\-1\}\\cdot\\frac\{1\}\{k\+h\}\.\(228\)Sinceα\>1/aM,2iid\\alpha\>1/a\_\{\\mathrm\{M\},2\}^\{\\mathrm\{iid\}\}, the first term is alsoO\(\(k\+h\)−1\)O\(\(k\+h\)^\{\-1\}\)\. Therefore
𝔼\[ℓM,∞\(ηk,η⋆\)2\]=O\(1k\+h\),\\mathbb\{E\}\\left\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\right\]=O\\left\(\\frac\{1\}\{k\+h\}\\right\),\(229\)and hence𝔼\[ℓM,∞\(ηk,η⋆\)\]≤ε\\mathbb\{E\}\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)\]\\leq\\varepsilonis guaranteed fork=O\(ε−2\)k=O\(\\varepsilon^\{\-2\}\)\.
For part \(ii\), the linearly\-diminishing Markovian bound in Theorem[3](https://arxiv.org/html/2605.06866#Thmtheorem3)\(ii\) gives
𝔼\[ℓM,∞\(ηk,η⋆\)2\]≤\(ϕM,1ℓM,∞\(η0,η⋆\)2\+ϕM,4\)\(K\+hk\+h\)ϕM,2α\+8eα2ϕM,3ϕM,2α−1⋅tkk\+h\.\\mathbb\{E\}\\left\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\right\]\\leq\\bigl\(\\phi\_\{\\mathrm\{M\},1\}\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\)^\{2\}\+\\phi\_\{\\mathrm\{M\},4\}\\bigr\)\\left\(\\frac\{K\+h\}\{k\+h\}\\right\)^\{\\phi\_\{\\mathrm\{M\},2\}\\alpha\}\+\\frac\{8e\\alpha^\{2\}\\phi\_\{\\mathrm\{M\},3\}\}\{\\phi\_\{\\mathrm\{M\},2\}\\alpha\-1\}\\cdot\\frac\{t\_\{k\}\}\{k\+h\}\.\(230\)Again the first term isO\(\(k\+h\)−1\)O\(\(k\+h\)^\{\-1\}\), while geometric mixing givestk=tαk=O\(log\(k\+h\)\)t\_\{k\}=t\_\{\\alpha\_\{k\}\}=O\(\\log\(k\+h\)\), so the second term is
O\(log\(k\+h\)k\+h\)=O~\(1k\+h\)\.O\\left\(\\frac\{\\log\(k\+h\)\}\{k\+h\}\\right\)=\\widetilde\{O\}\\left\(\\frac\{1\}\{k\+h\}\\right\)\.\(231\)Thus
𝔼\[ℓM,∞\(ηk,η⋆\)2\]=O~\(1k\+h\),\\mathbb\{E\}\\left\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\\right\]=\\widetilde\{O\}\\left\(\\frac\{1\}\{k\+h\}\\right\),\(232\)which implies𝔼\[ℓM,∞\(ηk,η⋆\)\]≤ε\\mathbb\{E\}\[\\ell\_\{\\mathrm\{M\},\\infty\}\(\\eta\_\{k\},\\eta^\{\\star\}\)\]\\leq\\varepsilonfork=O~\(ε−2\)k=\\widetilde\{O\}\(\\varepsilon^\{\-2\}\)\. ∎
## Appendix DUndiscounted fixed\-horizon CTD
This appendix records the undiscounted fixed\-horizon CTD ingredients culminating in the proof of Theorem[6](https://arxiv.org/html/2605.06866#Thmtheorem6)\.
### D\.1Flattened weighted horizon\-state space
To apply the framework of Appendix[A](https://arxiv.org/html/2605.06866#A1), we rewrite the weighted fixed\-horizon recursion on a flattened product space indexed by horizon\-state pairs\. Let
𝒮H:=\{\(h,s\):h∈\{1,…,H\},s∈𝒮\},\|𝒮H\|=H\|𝒮\|\.\\mathcal\{S\}\_\{H\}:=\\\{\(h,s\):h\\in\\\{1,\\dots,H\\\},\\ s\\in\\mathcal\{S\}\\\},\\qquad\\lvert\\mathcal\{S\}\_\{H\}\\rvert=H\\lvert\\mathcal\{S\}\\rvert\.\(233\)Let
VH,C:=∏\(h,s\)∈𝒮HℝdV\_\{H,\\mathrm\{C\}\}:=\\prod\_\{\(h,s\)\\in\\mathcal\{S\}\_\{H\}\}\\mathbb\{R\}^\{d\}\(234\)denote the corresponding flattened Euclidean product space\. For a horizon\-stacked iterateU=\(Uh\(s\)\)1≤h≤H,s∈𝒮U=\(U^\{h\}\(s\)\)\_\{1\\leq h\\leq H,\\ s\\in\\mathcal\{S\}\}, define the weighted flattening
U¯\(h,s\):=λhUh\(s\),\(h,s\)∈𝒮H\.\\bar\{U\}\(h,s\):=\\lambda^\{h\}U^\{h\}\(s\),\\qquad\(h,s\)\\in\\mathcal\{S\}\_\{H\}\.\(235\)Then
∥U∥H,2,∞=max\(h,s\)∈𝒮H∥U¯\(h,s\)∥2,∥U∥H,2,p=\(∑\(h,s\)∈𝒮H∥U¯\(h,s\)∥2p\)1/p\.\\lVert U\\rVert\_\{H,2,\\infty\}=\\max\_\{\(h,s\)\\in\\mathcal\{S\}\_\{H\}\}\\lVert\\bar\{U\}\(h,s\)\\rVert\_\{2\},\\qquad\\lVert U\\rVert\_\{H,2,p\}=\\left\(\\sum\_\{\(h,s\)\\in\\mathcal\{S\}\_\{H\}\}\\lVert\\bar\{U\}\(h,s\)\\rVert\_\{2\}^\{p\}\\right\)^\{1/p\}\.\(236\)For a single horizon stackU=\(U1,…,UH\)U=\(U^\{1\},\\dots,U^\{H\}\), also define
∥U∥H,2:=max1≤h≤Hλh∥uh∥2\.\\lVert U\\rVert\_\{H,2\}:=\\max\_\{1\\leq h\\leq H\}\\lambda^\{h\}\\lVert u^\{h\}\\rVert\_\{2\}\.\(237\)Then
∥U∥H,2,∞=maxs∈𝒮∥U\(s\)∥H,2\.\\lVert U\\rVert\_\{H,2,\\infty\}=\\max\_\{s\\in\\mathcal\{S\}\}\\lVert U\(s\)\\rVert\_\{H,2\}\.\(238\)Thus the fixed\-horizon weighted norm is an ordinary block\-supremum norm on a finite product of Euclidean blocks, and Appendix[A](https://arxiv.org/html/2605.06866#A1)applies with\|𝒮H\|\\lvert\\mathcal\{S\}\_\{H\}\\rvertin place of\|𝒮\|\\lvert\\mathcal\{S\}\\rvert\. In particular, throughout this appendix we set
pH⋆:=max\{2,⌈log\|𝒮H\|⌉\}\.p\_\{H\}^\{\\star\}:=\\max\\\{2,\\lceil\\log\\lvert\\mathcal\{S\}\_\{H\}\\rvert\\rceil\\\}\.\(239\)
### D\.2Fixed\-horizon CTD residual map
For each states∈𝒮s\\in\\mathcal\{S\}, let
U\(s\):=\(U1\(s\),…,UH\(s\)\)U\(s\):=\\bigl\(U^\{1\}\(s\),\\dots,U^\{H\}\(s\)\\bigr\)\(240\)denote the horizon stack at statess, and letPsHP\_\{s\}^\{H\}denote the coordinate projector onto that stack\. Define
FH,C\(U;s,\(r,s′\)\):=PsH\(T^H,C\(U;s,\(r,s′\)\)−U\(s\)\),F\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\):=P\_\{s\}^\{H\}\\bigl\(\\widehat\{T\}\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-U\(s\)\\bigr\),\(241\)whereT^H,C\(U;s,\(r,s′\)\)\\widehat\{T\}\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)is the full horizon\-stacked CTD target from Section[4](https://arxiv.org/html/2605.06866#S4)\. Then the online fixed\-horizon CTD recursion is
Uk\+1=Uk\+αkFH,C\(Uk;Sk,\(Rk,Sk\+1\)\)\.U\_\{k\+1\}=U\_\{k\}\+\\alpha\_\{k\}F\_\{H,\\mathrm\{C\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\.\(242\)Define also the CTD embedded averaged operator by
𝒪H,C:=IH,C∘ΠH,CΘTHπ∘IH,C−1\.\\mathcal\{O\}\_\{H,\\mathrm\{C\}\}:=I\_\{H,\\mathrm\{C\}\}\\circ\\Pi\_\{H,\\mathrm\{C\}\}^\{\\Theta\}T\_\{H\}^\{\\pi\}\\circ I\_\{H,\\mathrm\{C\}\}^\{\-1\}\.\(243\)
###### Proposition 29\(Fixed\-horizon CTD weighted contraction\)\.
For allU,U′∈IH,C\(ℱH,C,Θ𝒮\)U,U^\{\\prime\}\\in I\_\{H,\\mathrm\{C\}\}\(\\mathcal\{F\}\_\{H,\\mathrm\{C\},\\Theta\}^\{\\mathcal\{S\}\}\),
∥𝒪H,CU−𝒪H,CU′∥H,2,∞≤λ∥U−U′∥H,2,∞\.\\lVert\\mathcal\{O\}\_\{H,\\mathrm\{C\}\}U\-\\mathcal\{O\}\_\{H,\\mathrm\{C\}\}U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\\leq\\lambda\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(244\)
###### Proof\.
Letη:=IH,C−1\(U\)\\eta:=I\_\{H,\\mathrm\{C\}\}^\{\-1\}\(U\)andη′:=IH,C−1\(U′\)\\eta^\{\\prime\}:=I\_\{H,\\mathrm\{C\}\}^\{\-1\}\(U^\{\\prime\}\)\. Fix\(h,s\)∈𝒮H\(h,s\)\\in\\mathcal\{S\}\_\{H\}withh≥1h\\geq 1\. By the statewise CTD isometry and nonexpansiveness of the categorical projection in the Cramér metric,
λh∥\(𝒪H,CU\)h\(s\)−\(𝒪H,CU′\)h\(s\)∥2\\displaystyle\\lambda^\{h\}\\lVert\(\\mathcal\{O\}\_\{H,\\mathrm\{C\}\}U\)^\{h\}\(s\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{C\}\}U^\{\\prime\}\)^\{h\}\(s\)\\rVert\_\{2\}=λhℓC\(\(ΠH,CΘTHπη\)h\(s\),\(ΠH,CΘTHπη′\)h\(s\)\)\\displaystyle=\\lambda^\{h\}\\ell\_\{\\mathrm\{C\}\}\\left\(\(\\Pi\_\{H,\\mathrm\{C\}\}^\{\\Theta\}T\_\{H\}^\{\\pi\}\\eta\)^\{h\}\(s\),\(\\Pi\_\{H,\\mathrm\{C\}\}^\{\\Theta\}T\_\{H\}^\{\\pi\}\\eta^\{\\prime\}\)^\{h\}\(s\)\\right\)\(245\)≤λhℓC\(\(THπη\)h\(s\),\(THπη′\)h\(s\)\)\.\\displaystyle\\leq\\lambda^\{h\}\\ell\_\{\\mathrm\{C\}\}\\left\(\(T\_\{H\}^\{\\pi\}\\eta\)^\{h\}\(s\),\(T\_\{H\}^\{\\pi\}\\eta^\{\\prime\}\)^\{h\}\(s\)\\right\)\.Now
\(THπη\)h\(s\)=∑a∈𝒜π\(a∣s\)∑s′∈𝒮P\(s′∣s,a\)\(fR\(s,a\),1\)\#ηh−1\(s′\),\(T\_\{H\}^\{\\pi\}\\eta\)^\{h\}\(s\)=\\sum\_\{a\\in\\mathcal\{A\}\}\\pi\(a\\mid s\)\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}P\(s^\{\\prime\}\\mid s,a\)\\,\(f\_\{R\(s,a\),1\}\)\_\{\\\#\}\\eta^\{h\-1\}\(s^\{\\prime\}\),\(246\)and the same formula holds forη′\\eta^\{\\prime\}\. Since translation preserves the Cramér metric and the Cramér metric is convex under mixtures,
ℓC\(\(THπη\)h\(s\),\(THπη′\)h\(s\)\)\\displaystyle\\ell\_\{\\mathrm\{C\}\}\\left\(\(T\_\{H\}^\{\\pi\}\\eta\)^\{h\}\(s\),\(T\_\{H\}^\{\\pi\}\\eta^\{\\prime\}\)^\{h\}\(s\)\\right\)≤∑a∈𝒜π\(a∣s\)∑s′∈𝒮P\(s′∣s,a\)ℓC\(ηh−1\(s′\),η′h−1\(s′\)\)\\displaystyle\\leq\\sum\_\{a\\in\\mathcal\{A\}\}\\pi\(a\\mid s\)\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}P\(s^\{\\prime\}\\mid s,a\)\\,\\ell\_\{\\mathrm\{C\}\}\\left\(\\eta^\{h\-1\}\(s^\{\\prime\}\),\{\\eta^\{\\prime\}\}^\{h\-1\}\(s^\{\\prime\}\)\\right\)\(247\)≤maxx∈𝒮ℓC\(ηh−1\(x\),η′h−1\(x\)\)\.\\displaystyle\\leq\\max\_\{x\\in\\mathcal\{S\}\}\\ell\_\{\\mathrm\{C\}\}\\left\(\\eta^\{h\-1\}\(x\),\{\\eta^\{\\prime\}\}^\{h\-1\}\(x\)\\right\)\.Therefore
λh∥\(𝒪H,CU\)h\(s\)−\(𝒪H,CU′\)h\(s\)∥2\\displaystyle\\lambda^\{h\}\\lVert\(\\mathcal\{O\}\_\{H,\\mathrm\{C\}\}U\)^\{h\}\(s\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{C\}\}U^\{\\prime\}\)^\{h\}\(s\)\\rVert\_\{2\}≤λmaxx∈𝒮λh−1ℓC\(ηh−1\(x\),η′h−1\(x\)\)\\displaystyle\\leq\\lambda\\max\_\{x\\in\\mathcal\{S\}\}\\lambda^\{h\-1\}\\ell\_\{\\mathrm\{C\}\}\\left\(\\eta^\{h\-1\}\(x\),\{\\eta^\{\\prime\}\}^\{h\-1\}\(x\)\\right\)\(248\)≤λℓH,C,∞\(η,η′\)=λ∥U−U′∥H,2,∞\.\\displaystyle\\leq\\lambda\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta,\\eta^\{\\prime\}\)=\\lambda\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.Taking the maximum over\(h,s\)∈𝒮H\(h,s\)\\in\\mathcal\{S\}\_\{H\}proves the claim\. ∎
### D\.3One\-step bounds
###### Proposition 30\(Fixed\-horizon CTD samplewise Lipschitz continuity\)\.
For everys∈𝒮s\\in\\mathcal\{S\},\(r,s′\)∈\[0,1\]×𝒮\(r,s^\{\\prime\}\)\\in\[0,1\]\\times\\mathcal\{S\}andU,U′∈IH,C\(ℱH,C,Θ𝒮\)U,U^\{\\prime\}\\in I\_\{H,\\mathrm\{C\}\}\(\\mathcal\{F\}\_\{H,\\mathrm\{C\},\\Theta\}^\{\\mathcal\{S\}\}\),
∥FH,C\(U;s,\(r,s′\)\)−FH,C\(U′;s,\(r,s′\)\)∥H,2,∞≤2∥U−U′∥H,2,∞\.\\lVert F\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-F\_\{H,\\mathrm\{C\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(249\)
###### Proof\.
For each horizonhh, the local targetT^H,Ch\(U;s,\(r,s′\)\)\\widehat\{T\}\_\{H,\\mathrm\{C\}\}^\{h\}\(U;s,\(r,s^\{\\prime\}\)\)depends only on the\(h−1\)\(h\-1\)st horizon block ats′s^\{\\prime\}\. By the discounted CTD samplewise Lipschitz bound, that local target is11\-Lipschitz in the Euclidean block norm\. Multiplying byλh\\lambda^\{h\}and using the fact thatλ∈\(0,1\)\\lambda\\in\(0,1\)gives
λh∥T^H,Ch\(U;s,\(r,s′\)\)−T^H,Ch\(U′;s,\(r,s′\)\)∥2≤λh−1∥Uh−1\(s′\)−U′h−1\(s′\)∥2≤∥U−U′∥H,2,∞\.\\lambda^\{h\}\\lVert\\widehat\{T\}\_\{H,\\mathrm\{C\}\}^\{h\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{H,\\mathrm\{C\}\}^\{h\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{2\}\\leq\\lambda^\{h\-1\}\\lVert U^\{h\-1\}\(s^\{\\prime\}\)\-U^\{\\prime\\,h\-1\}\(s^\{\\prime\}\)\\rVert\_\{2\}\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.
\(250\)Taking the maximum overhhyields
∥T^H,C\(U;s,\(r,s′\)\)−T^H,C\(U′;s,\(r,s′\)\)∥H,2≤∥U−U′∥H,2,∞\.\\lVert\\widehat\{T\}\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{H,\\mathrm\{C\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2\}\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(251\)SinceFH,CF\_\{H,\\mathrm\{C\}\}is the projected residual
FH,C\(U;s,\(r,s′\)\)=PsH\(T^H,C\(U;s,\(r,s′\)\)−U\(s\)\),F\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)=P\_\{s\}^\{H\}\\bigl\(\\widehat\{T\}\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-U\(s\)\\bigr\),\(252\)the differenceFH,C\(U\)−FH,C\(U′\)F\_\{H,\\mathrm\{C\}\}\(U\)\-F\_\{H,\\mathrm\{C\}\}\(U^\{\\prime\}\)splits into a projected target difference and a projected current\-stack difference\. Each is bounded by∥U−U′∥H,2,∞\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}, yielding the factor22through triangle inequality\. ∎
###### Proposition 31\(Fixed\-horizon CTD pathwise boundedness\)\.
For everys∈𝒮s\\in\\mathcal\{S\},\(r,s′\)∈\[0,1\]×𝒮\(r,s^\{\\prime\}\)\\in\[0,1\]\\times\\mathcal\{S\}andU∈IH,C\(ℱH,C,Θ𝒮\)U\\in I\_\{H,\\mathrm\{C\}\}\(\\mathcal\{F\}\_\{H,\\mathrm\{C\},\\Theta\}^\{\\mathcal\{S\}\}\),
∥FH,C\(U;s,\(r,s′\)\)∥H,2,∞≤2BH,C,\\lVert F\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2B\_\{H,\\mathrm\{C\}\},\(253\)where
BH,C:=max1≤h≤H,s∈𝒮λhθh,d\(s\)−θh,1\(s\)\.B\_\{H,\\mathrm\{C\}\}:=\\max\_\{1\\leq h\\leq H,\\ s\\in\\mathcal\{S\}\}\\lambda^\{h\}\\sqrt\{\\theta\_\{h,d\}\(s\)\-\\theta\_\{h,1\}\(s\)\}\.\(254\)
###### Proof\.
Fixhhandss\. Both the local targetT^H,Ch\(U;s,\(r,s′\)\)\\widehat\{T\}\_\{H,\\mathrm\{C\}\}^\{h\}\(U;s,\(r,s^\{\\prime\}\)\)and the current iterateUh\(s\)U^\{h\}\(s\)are embedded categorical laws supported onΘh\(s\)\\Theta\_\{h\}\(s\)\. By the statewise CTD support radius bound, each has Euclidean norm at mostθh,d\(s\)−θh,1\(s\)\\sqrt\{\\theta\_\{h,d\}\(s\)\-\\theta\_\{h,1\}\(s\)\}\. After weighting byλh\\lambda^\{h\}, both are bounded byBH,CB\_\{H,\\mathrm\{C\}\}\. The triangle inequality then yields the claim\. ∎
### D\.4Phasewise averaged maps
Recall the definition of phase distributions from Section[4](https://arxiv.org/html/2605.06866#S4):
ρt\(s\):=Pr\(St=s\),0≤t≤H−1,s∈𝒮\.\\rho\_\{t\}\(s\):=\\Pr\(S\_\{t\}=s\),\\qquad 0\\leq t\\leq H\-1,\\ s\\in\\mathcal\{S\}\.\(255\)For each phaset∈\{0,…,H−1\}t\\in\\\{0,\\dots,H\-1\\\}, define
Γt,C\(U\):=∑s∈𝒮ρt\(s\)PsH\(\(𝒪H,CU\)\(s\)−U\(s\)\),Gt,C\(U\):=U\+Γt,C\(U\)\.\\Gamma\_\{t,\\mathrm\{C\}\}\(U\):=\\sum\_\{s\\in\\mathcal\{S\}\}\\rho\_\{t\}\(s\)\\,P\_\{s\}^\{H\}\\bigl\(\(\\mathcal\{O\}\_\{H,\\mathrm\{C\}\}U\)\(s\)\-U\(s\)\\bigr\),\\qquad G\_\{t,\\mathrm\{C\}\}\(U\):=U\+\\Gamma\_\{t,\\mathrm\{C\}\}\(U\)\.\(256\)
###### Proposition 32\(Phasewise averaged contraction\)\.
For each phaset∈\{0,…,H−1\}t\\in\\\{0,\\dots,H\-1\\\},
∥Gt,C\(U\)−Gt,C\(U′\)∥H,2,∞≤β¯tfh∥U−U′∥H,2,∞,\\lVert G\_\{t,\\mathrm\{C\}\}\(U\)\-G\_\{t,\\mathrm\{C\}\}\(U^\{\\prime\}\)\\rVert\_\{H,2,\\infty\}\\leq\\bar\{\\beta\}\_\{t\}^\{\\mathrm\{fh\}\}\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\},\(257\)where
β¯tfh:=1−ρt,min\(1−λ\),ρt,min:=mins∈𝒮ρt\(s\)\.\\bar\{\\beta\}\_\{t\}^\{\\mathrm\{fh\}\}:=1\-\\rho\_\{t,\\min\}\(1\-\\lambda\),\\qquad\\rho\_\{t,\\min\}:=\\min\_\{s\\in\\mathcal\{S\}\}\\rho\_\{t\}\(s\)\.\(258\)Hence, with
β¯H,Cfh:=1−ρmin\(1−λ\),ρmin:=min0≤t≤H−1,s∈𝒮ρt\(s\),\\bar\{\\beta\}\_\{H,\\mathrm\{C\}\}^\{\\mathrm\{fh\}\}:=1\-\\rho\_\{\\min\}\(1\-\\lambda\),\\qquad\\rho\_\{\\min\}:=\\min\_\{0\\leq t\\leq H\-1,\\ s\\in\\mathcal\{S\}\}\\rho\_\{t\}\(s\),\(259\)we have
∥Gt,C\(U\)−Gt,C\(U′\)∥H,2,∞≤β¯H,Cfh∥U−U′∥H,2,∞for allt\.\\lVert G\_\{t,\\mathrm\{C\}\}\(U\)\-G\_\{t,\\mathrm\{C\}\}\(U^\{\\prime\}\)\\rVert\_\{H,2,\\infty\}\\leq\\bar\{\\beta\}\_\{H,\\mathrm\{C\}\}^\{\\mathrm\{fh\}\}\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\\qquad\\text\{for all \}t\.\(260\)
###### Proof\.
At phasett, the update acts as the identity on all blocksxxsuch thatx≠sx\\neq sand as theλ\\lambda\-contractive projected Bellman map on the blockss\. More precisely, for each state blockss,
\(Gt,C\(U\)−Gt,C\(U′\)\)\(s\)=\(1−ρt\(s\)\)\(U\(s\)−U′\(s\)\)\+ρt\(s\)\(\(𝒪H,CU\)\(s\)−\(𝒪H,CU′\)\(s\)\)\.\\bigl\(G\_\{t,\\mathrm\{C\}\}\(U\)\-G\_\{t,\\mathrm\{C\}\}\(U^\{\\prime\}\)\\bigr\)\(s\)=\(1\-\\rho\_\{t\}\(s\)\)\(U\(s\)\-U^\{\\prime\}\(s\)\)\\,\+\\,\\rho\_\{t\}\(s\)\\bigl\(\(\\mathcal\{O\}\_\{H,\\mathrm\{C\}\}U\)\(s\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{C\}\}U^\{\\prime\}\)\(s\)\\bigr\)\.\(261\)By Proposition[29](https://arxiv.org/html/2605.06866#Thmtheorem29),
‖\(Gt,C\(U\)−Gt,C\(U′\)\)\(s\)‖H,2\\displaystyle\\left\\lVert\\bigl\(G\_\{t,\\mathrm\{C\}\}\(U\)\-G\_\{t,\\mathrm\{C\}\}\(U^\{\\prime\}\)\\bigr\)\(s\)\\right\\rVert\_\{H,2\}≤\(\(1−ρt\(s\)\)\+ρt\(s\)λ\)∥U−U′∥H,2,∞\\displaystyle\\leq\\bigl\(\(1\-\\rho\_\{t\}\(s\)\)\+\\rho\_\{t\}\(s\)\\lambda\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\(262\)=\(1−ρt\(s\)\(1−λ\)\)∥U−U′∥H,2,∞\.\\displaystyle=\\bigl\(1\-\\rho\_\{t\}\(s\)\(1\-\\lambda\)\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.Taking the maximum oversson both sides gives
∥Gt,C\(U\)−Gt,C\(U′\)∥H,2,∞\\displaystyle\\lVert G\_\{t,\\mathrm\{C\}\}\(U\)\-G\_\{t,\\mathrm\{C\}\}\(U^\{\\prime\}\)\\rVert\_\{H,2,\\infty\}≤maxs∈𝒮\(1−ρt\(s\)\(1−λ\)\)∥U−U′∥H,2,∞\\displaystyle\\leq\\max\_\{s\\in\\mathcal\{S\}\}\\bigl\(1\-\\rho\_\{t\}\(s\)\(1\-\\lambda\)\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\(263\)=\(1−ρt,min\(1−λ\)\)∥U−U′∥H,2,∞,\\displaystyle=\\bigl\(1\-\\rho\_\{t,\\min\}\(1\-\\lambda\)\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\},since the coefficient1−ρt\(s\)\(1−λ\)1\-\\rho\_\{t\}\(s\)\(1\-\\lambda\)is decreasing inρt\(s\)\\rho\_\{t\}\(s\)\. This proves the first displayed bound with coefficientβ¯tfh\\bar\{\\beta\}\_\{t\}^\{\\mathrm\{fh\}\}\. For the uniform estimate, note thatρt,min≥ρmin\\rho\_\{t,\\min\}\\geq\\rho\_\{\\min\}for every phasett, hence
β¯tfh=1−ρt,min\(1−λ\)≤1−ρmin\(1−λ\)=β¯H,Cfh\.\\bar\{\\beta\}\_\{t\}^\{\\mathrm\{fh\}\}=1\-\\rho\_\{t,\\min\}\(1\-\\lambda\)\\leq 1\-\\rho\_\{\\min\}\(1\-\\lambda\)=\\bar\{\\beta\}\_\{H,\\mathrm\{C\}\}^\{\\mathrm\{fh\}\}\.\(264\)Substituting this larger phase\-independent coefficient into the preceding bound yields the claimed uniform contraction factor\. ∎
### D\.5Auxiliary phasewise bounds
For episodemm, write
Um,t:=UmH\+t,t=0,…,H\.U\_\{m,t\}:=U\_\{mH\+t\},\\qquad t=0,\\dots,H\.\(265\)Also define the phasewise filtration
ℱm,t:=σ\(Um,0,\(Sum,Rum,Su\+1m\)0≤u<t\),0≤t≤H,\\mathcal\{F\}\_\{m,t\}:=\\sigma\\bigl\(U\_\{m,0\},\(S\_\{u\}^\{m\},R\_\{u\}^\{m\},S\_\{u\+1\}^\{m\}\)\_\{0\\leq u<t\}\\bigr\),\\qquad 0\\leq t\\leq H,\(266\)where the superscriptmmindicates transitions within episodemm\. ThenUm,tU\_\{m,t\}isℱm,t\\mathcal\{F\}\_\{m,t\}\-measurable for everytt\.
###### Lemma 33\(Within\-episode deviation\)\.
For allmmand allt∈\{0,…,H\}t\\in\\\{0,\\dots,H\\\},
∥Um,t−Um,0∥H,2,∞≤2BH,C∑u=0t−1αmH\+u\.\\lVert U\_\{m,t\}\-U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\leq 2B\_\{H,\\mathrm\{C\}\}\\sum\_\{u=0\}^\{t\-1\}\\alpha\_\{mH\+u\}\.\(267\)
###### Proof\.
From the recursion,
Um,t−Um,0=∑u=0t−1αmH\+uFH,C\(Um,u;Sum,\(Rum,Su\+1m\)\)\.U\_\{m,t\}\-U\_\{m,0\}=\\sum\_\{u=0\}^\{t\-1\}\\alpha\_\{mH\+u\}F\_\{H,\\mathrm\{C\}\}\\bigl\(U\_\{m,u\};S\_\{u\}^\{m\},\(R\_\{u\}^\{m\},S\_\{u\+1\}^\{m\}\)\\bigr\)\.\(268\)Taking∥⋅∥H,2,∞\\lVert\\cdot\\rVert\_\{H,2,\\infty\}and using Proposition[31](https://arxiv.org/html/2605.06866#Thmtheorem31)term by term gives
∥Um,t−Um,0∥H,2,∞≤∑u=0t−1αmH\+u∥FH,C\(Um,u;Sum,\(Rum,Su\+1m\)\)∥H,2,∞≤2BH,C∑u=0t−1αmH\+u\.\\lVert U\_\{m,t\}\-U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\leq\\sum\_\{u=0\}^\{t\-1\}\\alpha\_\{mH\+u\}\\,\\lVert F\_\{H,\\mathrm\{C\}\}\(U\_\{m,u\};S\_\{u\}^\{m\},\(R\_\{u\}^\{m\},S\_\{u\+1\}^\{m\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2B\_\{H,\\mathrm\{C\}\}\\sum\_\{u=0\}^\{t\-1\}\\alpha\_\{mH\+u\}\.\(269\)∎
###### Lemma 34\(Frozen residual decomposition\)\.
For each phaset∈\{0,…,H−1\}t\\in\\\{0,\\dots,H\-1\\\}, define
ζm,tC:=FH,C\(Um,0;Stm,\(Rtm,St\+1m\)\)−Γt,C\(Um,0\)\\zeta\_\{m,t\}^\{\\mathrm\{C\}\}:=F\_\{H,\\mathrm\{C\}\}\(U\_\{m,0\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\-\\Gamma\_\{t,\\mathrm\{C\}\}\(U\_\{m,0\}\)\(270\)and
ξm,tC:=\(FH,C\(Um,t;Stm,\(Rtm,St\+1m\)\)−FH,C\(Um,0;Stm,\(Rtm,St\+1m\)\)\)−\(Γt,C\(Um,t\)−Γt,C\(Um,0\)\)\.\\xi\_\{m,t\}^\{\\mathrm\{C\}\}:=\\Bigl\(F\_\{H,\\mathrm\{C\}\}\(U\_\{m,t\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\-F\_\{H,\\mathrm\{C\}\}\(U\_\{m,0\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\\Bigr\)\-\\Bigl\(\\Gamma\_\{t,\\mathrm\{C\}\}\(U\_\{m,t\}\)\-\\Gamma\_\{t,\\mathrm\{C\}\}\(U\_\{m,0\}\)\\Bigr\)\.
\(271\)Then
FH,C\(Um,t;Stm,\(Rtm,St\+1m\)\)−Γt,C\(Um,t\)=ζm,tC\+ξm,tC,𝔼\[ζm,tC∣Um,0\]=0,∥ζm,tC∥H,2,∞≤4BH,C,\\begin\{gathered\}F\_\{H,\\mathrm\{C\}\}\(U\_\{m,t\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\-\\Gamma\_\{t,\\mathrm\{C\}\}\(U\_\{m,t\}\)=\\zeta\_\{m,t\}^\{\\mathrm\{C\}\}\+\\xi\_\{m,t\}^\{\\mathrm\{C\}\},\\\\ \\mathbb\{E\}\\left\[\\zeta\_\{m,t\}^\{\\mathrm\{C\}\}\\,\\mid\\,U\_\{m,0\}\\right\]=0,\\qquad\\lVert\\zeta\_\{m,t\}^\{\\mathrm\{C\}\}\\rVert\_\{H,2,\\infty\}\\leq 4B\_\{H,\\mathrm\{C\}\},\\end\{gathered\}\(272\)and
∥ξm,tC∥H,2,∞≤4∥Um,t−Um,0∥H,2,∞≤8BH,C∑u=0t−1αmH\+u≤8BH,C∑u=0H−1αmH\+u\.\\lVert\\xi\_\{m,t\}^\{\\mathrm\{C\}\}\\rVert\_\{H,2,\\infty\}\\leq 4\\lVert U\_\{m,t\}\-U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\leq 8B\_\{H,\\mathrm\{C\}\}\\sum\_\{u=0\}^\{t\-1\}\\alpha\_\{mH\+u\}\\leq 8B\_\{H,\\mathrm\{C\}\}\\sum\_\{u=0\}^\{H\-1\}\\alpha\_\{mH\+u\}\.\(273\)
###### Proof\.
Since episodemmis independent ofUm,0U\_\{m,0\}and its phase\-ttsample has marginal lawρt\\rho\_\{t\},
𝔼\[ζm,tC∣Um,0\]=0\.\\mathbb\{E\}\\left\[\\zeta\_\{m,t\}^\{\\mathrm\{C\}\}\\,\\mid\\,U\_\{m,0\}\\right\]=0\.\(274\)∥ζm,tC∥H,2,∞≤∥FH,C\(Um,0;Stm,\(Rtm,St\+1m\)\)∥H,2,∞\+∥Γt,C\(Um,0\)∥H,2,∞\.\\lVert\\zeta\_\{m,t\}^\{\\mathrm\{C\}\}\\rVert\_\{H,2,\\infty\}\\leq\\lVert F\_\{H,\\mathrm\{C\}\}\(U\_\{m,0\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\\rVert\_\{H,2,\\infty\}\+\\lVert\\Gamma\_\{t,\\mathrm\{C\}\}\(U\_\{m,0\}\)\\rVert\_\{H,2,\\infty\}\.\(275\)By Proposition[31](https://arxiv.org/html/2605.06866#Thmtheorem31),
∥FH,C\(Um,0;Stm,\(Rtm,St\+1m\)\)∥H,2,∞≤2BH,C,\\lVert F\_\{H,\\mathrm\{C\}\}\(U\_\{m,0\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2B\_\{H,\\mathrm\{C\}\},\(276\)and, sinceΓt,C\(Um,0\)\\Gamma\_\{t,\\mathrm\{C\}\}\(U\_\{m,0\}\)is a convex combination of such residuals,
∥Γt,C\(Um,0\)∥H,2,∞≤2BH,C\.\\lVert\\Gamma\_\{t,\\mathrm\{C\}\}\(U\_\{m,0\}\)\\rVert\_\{H,2,\\infty\}\\leq 2B\_\{H,\\mathrm\{C\}\}\.\(277\)This gives∥ζm,tC∥H,2,∞≤4BH,C\\lVert\\zeta\_\{m,t\}^\{\\mathrm\{C\}\}\\rVert\_\{H,2,\\infty\}\\leq 4B\_\{H,\\mathrm\{C\}\}\.
Proposition[30](https://arxiv.org/html/2605.06866#Thmtheorem30)also gives
‖FH,C\(U;s,\(r,s′\)\)−FH,C\(U′;s,\(r,s′\)\)‖H,2,∞≤2∥U−U′∥H,2,∞\\left\\lVert F\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-F\_\{H,\\mathrm\{C\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\(278\)for every admissible\(s,r,s′\)\(s,r,s^\{\\prime\}\)\. Averaging the same estimate over the phase\-ttlaw yields
‖Γt,C\(U\)−Γt,C\(U′\)‖H,2,∞≤2∥U−U′∥H,2,∞\.\\left\\lVert\\Gamma\_\{t,\\mathrm\{C\}\}\(U\)\-\\Gamma\_\{t,\\mathrm\{C\}\}\(U^\{\\prime\}\)\\right\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(279\)Therefore
∥ξm,tC∥H,2,∞≤4∥Um,t−Um,0∥H,2,∞\.\\lVert\\xi\_\{m,t\}^\{\\mathrm\{C\}\}\\rVert\_\{H,2,\\infty\}\\leq 4\\lVert U\_\{m,t\}\-U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\.\(280\)Applying Lemma[33](https://arxiv.org/html/2605.06866#Thmtheorem33)yields the remaining bound\. ∎
### D\.6Episode\-level finite\-iteration drift
SinceΓt,C\\Gamma\_\{t,\\mathrm\{C\}\}is defined by averaging with respect to the phase\-ttmarginalρt\\rho\_\{t\}, the drift argument is formulated for the episode\-boundary sequence\(UmH\)m≥0\(U\_\{mH\}\)\_\{m\\geq 0\}conditioned underUm,0U\_\{m,0\}\.
Let
Wm,0:=Um,0−UH,C⋆,UH,C⋆:=IH,C\(ηH,C⋆\),W\_\{m,0\}:=U\_\{m,0\}\-U^\{\\star\}\_\{H,\\mathrm\{C\}\},\\qquad U^\{\\star\}\_\{H,\\mathrm\{C\}\}:=I\_\{H,\\mathrm\{C\}\}\(\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}\),\(281\)whereηH,C⋆\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}denotes the unique fixed point ofΠH,CΘTHπ\\Pi\_\{H,\\mathrm\{C\}\}^\{\\Theta\}T\_\{H\}^\{\\pi\}, and define
α¯m:=∑u=0H−1αmH\+u,α¯m\(2\):=∑u=0H−1αmH\+u2\.\\bar\{\\alpha\}\_\{m\}:=\\sum\_\{u=0\}^\{H\-1\}\\alpha\_\{mH\+u\},\\qquad\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}:=\\sum\_\{u=0\}^\{H\-1\}\\alpha\_\{mH\+u\}^\{2\}\.\(282\)Also write
\|𝒮H\|:=H\|𝒮\|,β¯H,Cfh:=max0≤t≤H−1β¯tfh=1−ρmin\(1−λ\),κH,C:=1−β¯H,Cfh=ρmin\(1−λ\)\>0\.\\lvert\\mathcal\{S\}\_\{H\}\\rvert:=H\\lvert\\mathcal\{S\}\\rvert,\\quad\\bar\{\\beta\}\_\{H,\\mathrm\{C\}\}^\{\\mathrm\{fh\}\}:=\\max\_\{0\\leq t\\leq H\-1\}\\bar\{\\beta\}\_\{t\}^\{\\mathrm\{fh\}\}=1\-\\rho\_\{\\min\}\(1\-\\lambda\),\\quad\\kappa\_\{H,\\mathrm\{C\}\}:=1\-\\bar\{\\beta\}\_\{H,\\mathrm\{C\}\}^\{\\mathrm\{fh\}\}=\\rho\_\{\\min\}\(1\-\\lambda\)\>0\.\(283\)
Fix anyϑH,C\>0\\vartheta\_\{H,\\mathrm\{C\}\}\>0and define the smoothed potential
MH,C\(W\):=infZ∈VH,C\{12∥Z∥H,2,∞2\+12ϑH,C∥W−Z∥H,2,pH⋆2\}\.M\_\{H,\\mathrm\{C\}\}\(W\):=\\inf\_\{Z\\in V\_\{H,\\mathrm\{C\}\}\}\\left\\\{\\frac\{1\}\{2\}\\lVert Z\\rVert^\{2\}\_\{H,2,\\infty\}\+\\frac\{1\}\{2\\vartheta\_\{H,\\mathrm\{C\}\}\}\\lVert W\-Z\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\right\\\}\.\(284\)
By Proposition[13](https://arxiv.org/html/2605.06866#Thmtheorem13)on the flattened weighted space,
\(1\+ϑH,C\)MH,C\(W\)≤12∥W∥H,2,∞2≤\(1\+ϑH,C\|𝒮H\|2/pH⋆\)MH,C\(W\)\.\(1\+\\vartheta\_\{H,\\mathrm\{C\}\}\)M\_\{H,\\mathrm\{C\}\}\(W\)\\leq\\frac\{1\}\{2\}\\lVert W\\rVert^\{2\}\_\{H,2,\\infty\}\\leq\\bigl\(1\+\\vartheta\_\{H,\\mathrm\{C\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\\bigr\)M\_\{H,\\mathrm\{C\}\}\(W\)\.\(285\)Define
rH,Cfh:=1\+ϑH,C\|𝒮H\|2/pH⋆1\+ϑH,C,LH,Cfh:=pH⋆−1ϑH,C,ωH,Cfh:=1−β¯H,CfhrH,Cfh,r^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}:=\\frac\{1\+\\vartheta\_\{H,\\mathrm\{C\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\}\{1\+\\vartheta\_\{H,\\mathrm\{C\}\}\},\\qquad L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}:=\\frac\{p^\{\\star\}\_\{H\}\-1\}\{\\vartheta\_\{H,\\mathrm\{C\}\}\},\\qquad\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}:=1\-\\bar\{\\beta\}^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\sqrt\{r^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\},\(286\)SincerH,Cfh→1r^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\to 1asϑH,C↓0\\vartheta\_\{H,\\mathrm\{C\}\}\\downarrow 0andβ¯H,Cfh<1\\bar\{\\beta\}^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}<1, we haveωH,Cfh\>0\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\>0for all sufficiently smallϑH,C\\vartheta\_\{H,\\mathrm\{C\}\}\. Define the explicit CTD constants
dH,Cfh:=8LH,Cfh\|𝒮H\|2/pH⋆\(1\+ϑH,C\),α¯H,C:=min\{1H,ωH,CfhdH,Cfh\},d^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}:=8L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\(1\+\\vartheta\_\{H,\\mathrm\{C\}\}\),\\qquad\\bar\{\\alpha\}\_\{H,\\mathrm\{C\}\}:=\\min\\left\\\{\\frac\{1\}\{H\},\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{d^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\\right\\\},\(287\)and
cH,C,1fh:=ωH,Cfh4,cH,C,2fh:=HLH,Cfh\|𝒮H\|2/pH⋆BH,C2\(128ωH,Cfh\+72\)\.c^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},1\}:=\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{4\},\\qquad c^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}:=HL^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}B^\{2\}\_\{H,\\mathrm\{C\}\}\\left\(\\frac\{128\}\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\+72\\right\)\.\(288\)
###### Proposition 35\(Episodewise CTD potential drift\)\.
Assume thatωH,Cfh\>0\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\>0, that\(αk\)k≥0\(\\alpha\_\{k\}\)\_\{k\\geq 0\}is nonincreasing, and that
α0≤α¯H,C\.\\alpha\_\{0\}\\leq\\bar\{\\alpha\}\_\{H,\\mathrm\{C\}\}\.\(289\)
Then, for every episodemm,
𝔼\[MH,C\(Wm\+1,0\)∣Um,0\]≤\(1−cH,C,1fhα¯m\)MH,C\(Wm,0\)\+cH,C,2fhα¯m\(2\)\.\\mathbb\{E\}\\left\[M\_\{H,\\mathrm\{C\}\}\(W\_\{m\+1,0\}\)\\mid U\_\{m,0\}\\right\]\\leq\\bigl\(1\-c^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},1\}\\bar\{\\alpha\}\_\{m\}\\bigr\)M\_\{H,\\mathrm\{C\}\}\(W\_\{m,0\}\)\+c^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}\\bar\{\\alpha\}^\{\(2\)\}\_\{m\}\.\(290\)
###### Proof\.
For each phaset∈\{0,…,H−1\}t\\in\\\{0,\\dots,H\-1\\\}and eachα∈\[0,1\]\\alpha\\in\[0,1\], define
At,α,C\(U\):=U\+αΓt,C\(U\)=\(1−α\)U\+αGt,C\(U\)\.A\_\{t,\\alpha,\\mathrm\{C\}\}\(U\):=U\+\\alpha\\Gamma\_\{t,\\mathrm\{C\}\}\(U\)=\(1\-\\alpha\)U\+\\alpha G\_\{t,\\mathrm\{C\}\}\(U\)\.\(291\)By Proposition[32](https://arxiv.org/html/2605.06866#Thmtheorem32),
‖Gt,C\(U\)−Gt,C\(U′\)‖H,2,∞≤β¯H,Cfh∥U−U′∥H,2,∞\.\\left\\lVert G\_\{t,\\mathrm\{C\}\}\(U\)\-G\_\{t,\\mathrm\{C\}\}\(U^\{\\prime\}\)\\right\\rVert\_\{H,2,\\infty\}\\leq\\bar\{\\beta\}^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(292\)Hence, by convexity of the norm,
∥At,α,C\(U\)−At,α,C\(U′\)∥H,2,∞≤\(\(1−α\)\+αβ¯H,Cfh\)∥U−U′∥H,2,∞=\(1−κH,Cα\)∥U−U′∥H,2,∞\.\\lVert A\_\{t,\\alpha,\\mathrm\{C\}\}\(U\)\-A\_\{t,\\alpha,\\mathrm\{C\}\}\(U^\{\\prime\}\)\\rVert\_\{H,2,\\infty\}\\leq\\bigl\(\(1\-\\alpha\)\+\\alpha\\bar\{\\beta\}^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}=\\bigl\(1\-\\kappa\_\{H,\\mathrm\{C\}\}\\alpha\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.
\(293\)Define the deterministic phasewise averaged trajectory over episodemmby
U¯m,0:=Um,0,U¯m,t\+1:=At,αmH\+t,C\(U¯m,t\),0≤t≤H−1\.\\bar\{U\}\_\{m,0\}:=U\_\{m,0\},\\qquad\\bar\{U\}\_\{m,t\+1\}:=A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{C\}\}\(\\bar\{U\}\_\{m,t\}\),\\qquad 0\\leq t\\leq H\-1\.\(294\)SinceGt,C\(UH,C⋆\)=UH,C⋆G\_\{t,\\mathrm\{C\}\}\(U^\{\\star\}\_\{H,\\mathrm\{C\}\}\)=U^\{\\star\}\_\{H,\\mathrm\{C\}\}, the contraction estimate above, the Moreau\-envelope comparison, and the smoothness of the envelope give the deterministic relaxed\-map estimate
MH,C\(At,α,C\(U\)−UH,C⋆\)≤\(1−2ωH,Cfhα\+dH,Cfhα2\)MH,C\(U−UH,C⋆\)\.M\_\{H,\\mathrm\{C\}\}\\bigl\(A\_\{t,\\alpha,\\mathrm\{C\}\}\(U\)\-U^\{\\star\}\_\{H,\\mathrm\{C\}\}\\bigr\)\\leq\(1\-2\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\alpha\+d^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\alpha^\{2\}\)M\_\{H,\\mathrm\{C\}\}\(U\-U^\{\\star\}\_\{H,\\mathrm\{C\}\}\)\.\(295\)SinceαmH\+t≤α0≤α¯H,C≤ωH,Cfh/dH,Cfh\\alpha\_\{mH\+t\}\\leq\\alpha\_\{0\}\\leq\\bar\{\\alpha\}\_\{H,\\mathrm\{C\}\}\\leq\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}/d^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}, this implies
MH,C\(At,αmH\+t,C\(U\)−UH,C⋆\)≤\(1−ωH,CfhαmH\+t\)MH,C\(U−UH,C⋆\)\.M\_\{H,\\mathrm\{C\}\}\\bigl\(A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{C\}\}\(U\)\-U^\{\\star\}\_\{H,\\mathrm\{C\}\}\\bigr\)\\leq\\bigl\(1\-\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\alpha\_\{mH\+t\}\\bigr\)M\_\{H,\\mathrm\{C\}\}\\bigl\(U\-U^\{\\star\}\_\{H,\\mathrm\{C\}\}\\bigr\)\.\(296\)Therefore
MH,C\(U¯m,H−UH,C⋆\)≤∏u=0H−1\(1−ωH,CfhαmH\+u\)MH,C\(Wm,0\)≤\(1−ωH,Cfh2α¯m\)MH,C\(Wm,0\)\.M\_\{H,\\mathrm\{C\}\}\(\\bar\{U\}\_\{m,H\}\-U^\{\\star\}\_\{H,\\mathrm\{C\}\}\)\\leq\\prod\_\{u=0\}^\{H\-1\}\(1\-\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\alpha\_\{mH\+u\}\)M\_\{H,\\mathrm\{C\}\}\(W\_\{m,0\}\)\\leq\\left\(1\-\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{2\}\\bar\{\\alpha\}\_\{m\}\\right\)M\_\{H,\\mathrm\{C\}\}\(W\_\{m,0\}\)\.\(297\)Next, define the approximation error
Dm,t:=Um,t−U¯m,t,Dm,0=0\.D\_\{m,t\}:=U\_\{m,t\}\-\\bar\{U\}\_\{m,t\},\\qquad D\_\{m,0\}=0\.\(298\)Because each horizonwise CTD target is obtained by linear interpolation on a fixed support, the mapU↦T^H,C\(U;s,\(r,s′\)\)U\\mapsto\\widehat\{T\}\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)is affine for every admissible\(s,r,s′\)\(s,r,s^\{\\prime\}\)\. Hence,FH,CF\_\{H,\\mathrm\{C\}\},Γt,C\\Gamma\_\{t,\\mathrm\{C\}\}andAt,α,CA\_\{t,\\alpha,\\mathrm\{C\}\}are all affine maps\. LetLm,t,CL\_\{m,t,\\mathrm\{C\}\}denote the linear part ofAt,αmH\+t,CA\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{C\}\}\. Then
At,αmH\+t,C\(U\)−At,αmH\+t,C\(U′\)=Lm,t,C\(U−U′\),A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{C\}\}\(U\)\-A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{C\}\}\(U^\{\\prime\}\)=L\_\{m,t,\\mathrm\{C\}\}\(U\-U^\{\\prime\}\),\(299\)and the contraction bound implies that, for every admissibleZZ,
∥Lm,t,C\(Z\)∥H,2,∞≤\(1−κH,CαmH\+t\)∥Z∥H,2,∞\.\\lVert L\_\{m,t,\\mathrm\{C\}\}\(Z\)\\rVert\_\{H,2,\\infty\}\\leq\(1\-\\kappa\_\{H,\\mathrm\{C\}\}\\alpha\_\{mH\+t\}\)\\lVert Z\\rVert\_\{H,2,\\infty\}\.\(300\)Using the recursion forUm,t\+1U\_\{m,t\+1\}, the definition ofU¯m,t\+1\\bar\{U\}\_\{m,t\+1\}, and the decomposition from Lemma[34](https://arxiv.org/html/2605.06866#Thmtheorem34), we obtain
Dm,t\+1=Um,t\+1−U¯m,t\+1=Um,t\+αmH\+tFH,C\(Um,t;Stm,\(Rtm,St\+1m\)\)−U¯m,t−αmH\+tΓt,C\(U¯m,t\)=Um,t−U¯m,t\+αmH\+t\(Γt,C\(Um,t\)−Γt,C\(U¯m,t\)\)\+αmH\+t\(FH,C\(Um,t;Stm,\(Rtm,St\+1m\)\)−Γt,C\(Um,t\)\)=At,αmH\+t,C\(Um,t\)−At,αmH\+t,C\(U¯m,t\)\+αmH\+t\(FH,C\(Um,t;Stm,\(Rtm,St\+1m\)\)−Γt,C\(Um,t\)\)=Lm,t,C\(Dm,t\)\+αmH\+tζm,tC\+αmH\+tξm,tC\.\\begin\{gathered\}D\_\{m,t\+1\}=U\_\{m,t\+1\}\-\\bar\{U\}\_\{m,t\+1\}\\\\ =U\_\{m,t\}\+\\alpha\_\{mH\+t\}F\_\{H,\\mathrm\{C\}\}\\bigl\(U\_\{m,t\};S^\{m\}\_\{t\},\(R^\{m\}\_\{t\},S^\{m\}\_\{t\+1\}\)\\bigr\)\-\\bar\{U\}\_\{m,t\}\-\\alpha\_\{mH\+t\}\\Gamma\_\{t,\\mathrm\{C\}\}\(\\bar\{U\}\_\{m,t\}\)\\\\ =U\_\{m,t\}\-\\bar\{U\}\_\{m,t\}\+\\alpha\_\{mH\+t\}\\bigl\(\\Gamma\_\{t,\\mathrm\{C\}\}\(U\_\{m,t\}\)\-\\Gamma\_\{t,\\mathrm\{C\}\}\(\\bar\{U\}\_\{m,t\}\)\\bigr\)\\\\ \+\\alpha\_\{mH\+t\}\\Bigl\(F\_\{H,\\mathrm\{C\}\}\\bigl\(U\_\{m,t\};S^\{m\}\_\{t\},\(R^\{m\}\_\{t\},S^\{m\}\_\{t\+1\}\)\\bigr\)\-\\Gamma\_\{t,\\mathrm\{C\}\}\(U\_\{m,t\}\)\\Bigr\)\\\\ =A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{C\}\}\(U\_\{m,t\}\)\-A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{C\}\}\(\\bar\{U\}\_\{m,t\}\)\+\\alpha\_\{mH\+t\}\\Bigl\(F\_\{H,\\mathrm\{C\}\}\\bigl\(U\_\{m,t\};S^\{m\}\_\{t\},\(R^\{m\}\_\{t\},S^\{m\}\_\{t\+1\}\)\\bigr\)\-\\Gamma\_\{t,\\mathrm\{C\}\}\(U\_\{m,t\}\)\\Bigr\)\\\\ =L\_\{m,t,\\mathrm\{C\}\}\(D\_\{m,t\}\)\+\\alpha\_\{mH\+t\}\\zeta^\{\\mathrm\{C\}\}\_\{m,t\}\+\\alpha\_\{mH\+t\}\\xi^\{\\mathrm\{C\}\}\_\{m,t\}\.\\end\{gathered\}
\(301\)Therefore
𝔼\[Dm,t\+1∣Um,0\]=Lm,t,C\(𝔼\[Dm,t∣Um,0\]\)\+αmH\+t𝔼\[ξm,tC∣Um,0\],\\mathbb\{E\}\[D\_\{m,t\+1\}\\mid U\_\{m,0\}\]=L\_\{m,t,\\mathrm\{C\}\}\\bigl\(\\mathbb\{E\}\[D\_\{m,t\}\\mid U\_\{m,0\}\]\\bigr\)\+\\alpha\_\{mH\+t\}\\mathbb\{E\}\[\\xi^\{\\mathrm\{C\}\}\_\{m,t\}\\mid U\_\{m,0\}\],\(302\)since𝔼\[ζm,tC∣Um,0\]=0\\mathbb\{E\}\[\\zeta^\{\\mathrm\{C\}\}\_\{m,t\}\\mid U\_\{m,0\}\]=0\. Taking∥⋅∥H,2,∞\\left\\lVert\\cdot\\right\\rVert\_\{H,2,\\infty\}gives
νm,t\+1:=∥𝔼\[Dm,t\+1∣Um,0\]∥H,2,∞≤\(1−κH,CαmH\+t\)∥𝔼\[Dm,t∣Um,0\]∥H,2,∞\+8BH,CαmH\+tα¯m\.\\begin\{gathered\}\\nu\_\{m,t\+1\}:=\\left\\lVert\\mathbb\{E\}\[D\_\{m,t\+1\}\\mid U\_\{m,0\}\]\\right\\rVert\_\{H,2,\\infty\}\\\\ \\leq\(1\-\\kappa\_\{H,\\mathrm\{C\}\}\\alpha\_\{mH\+t\}\)\\left\\lVert\\mathbb\{E\}\[D\_\{m,t\}\\mid U\_\{m,0\}\]\\right\\rVert\_\{H,2,\\infty\}\+8B\_\{H,\\mathrm\{C\}\}\\alpha\_\{mH\+t\}\\bar\{\\alpha\}\_\{m\}\.\\end\{gathered\}\(303\)Sinceνm,0=0\\nu\_\{m,0\}=0, induction overttyields
∥𝔼\[Dm,H∣Um,0\]∥H,2,∞≤8BH,Cα¯m2\.\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert\_\{H,2,\\infty\}\\leq 8B\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}^\{2\}\_\{m\}\.\(304\)Also, iterating the recursion forDm,t\+1D\_\{m,t\+1\}, using
‖Lm,t,C\(Z\)‖H,2,∞≤‖Z‖H,2,∞,\\left\\lVert L\_\{m,t,\\mathrm\{C\}\}\(Z\)\\right\\rVert\_\{H,2,\\infty\}\\leq\\left\\lVert Z\\right\\rVert\_\{H,2,\\infty\},\(305\)the identityDm,0=0D\_\{m,0\}=0, and the bounds from Lemma[34](https://arxiv.org/html/2605.06866#Thmtheorem34)give the pathwise upper bound
‖Dm,H‖H,2,∞≤∑t=0H−1αmH\+t\(‖ζm,tC‖H,2,∞\+‖ξm,tC‖H,2,∞\)≤4BH,Cα¯m\+8BH,Cα¯m2\.\\left\\lVert D\_\{m,H\}\\right\\rVert\_\{H,2,\\infty\}\\leq\\sum\_\{t=0\}^\{H\-1\}\\alpha\_\{mH\+t\}\\bigl\(\\left\\lVert\\zeta^\{\\mathrm\{C\}\}\_\{m,t\}\\right\\rVert\_\{H,2,\\infty\}\+\\left\\lVert\\xi^\{\\mathrm\{C\}\}\_\{m,t\}\\right\\rVert\_\{H,2,\\infty\}\\bigr\)\\leq 4B\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}\_\{m\}\+8B\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}^\{2\}\_\{m\}\.\(306\)Sinceα0≤α¯H,C≤1/H\\alpha\_\{0\}\\leq\\bar\{\\alpha\}\_\{H,\\mathrm\{C\}\}\\leq 1/H, we haveα¯m≤1\\bar\{\\alpha\}\_\{m\}\\leq 1, and hence
‖Dm,H‖H,2,∞≤12BH,Cα¯m\.\\left\\lVert D\_\{m,H\}\\right\\rVert\_\{H,2,\\infty\}\\leq 12B\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}\_\{m\}\.\(307\)pathwise\. Therefore
𝔼\[‖Dm,H‖H,2,∞2∣Um,0\]≤144BH,C2α¯m2\.\\mathbb\{E\}\\bigl\[\\left\\lVert D\_\{m,H\}\\right\\rVert^\{2\}\_\{H,2,\\infty\}\\mid U\_\{m,0\}\\bigr\]\\leq 144B^\{2\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}^\{2\}\_\{m\}\.\(308\)Let
W¯m,H:=U¯m,H−UH,C⋆,δm:=ωH,Cfhα¯m4LH,Cfh\.\\bar\{W\}\_\{m,H\}:=\\bar\{U\}\_\{m,H\}\-U^\{\\star\}\_\{H,\\mathrm\{C\}\},\\qquad\\delta\_\{m\}:=\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}\_\{m\}\}\{4L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\.\(309\)BecauseMH,CM\_\{H,\\mathrm\{C\}\}is nonnegative, convex,LH,CfhL^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\-smooth with respect to∥⋅∥H,2,pH⋆\\left\\lVert\\cdot\\right\\rVert\_\{H,2,p^\{\\star\}\_\{H\}\}, and minimized at zero, its gradient satisfies
‖∇MH,C\(W¯m,H\)‖∗2≤2LH,CfhMH,C\(W¯m,H\),\\left\\lVert\\nabla M\_\{H,\\mathrm\{C\}\}\(\\bar\{W\}\_\{m,H\}\)\\right\\rVert^\{2\}\_\{\*\}\\leq 2L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}M\_\{H,\\mathrm\{C\}\}\(\\bar\{W\}\_\{m,H\}\),\(310\)where∥⋅∥∗\\left\\lVert\\cdot\\right\\rVert\_\{\*\}denotes the dual norm of∥⋅∥H,2,pH⋆\\left\\lVert\\cdot\\right\\rVert\_\{H,2,p^\{\\star\}\_\{H\}\}\. Therefore
𝔼\[MH,C\(Wm\+1,0\)∣Um,0\]≤MH,C\(W¯m,H\)\+⟨∇MH,C\(W¯m,H\),𝔼\[Dm,H∣Um,0\]⟩\+LH,Cfh2𝔼\[‖Dm,H‖H,2,pH⋆2∣Um,0\]\.\\begin\{gathered\}\\mathbb\{E\}\\bigl\[M\_\{H,\\mathrm\{C\}\}\(W\_\{m\+1\},0\)\\mid U\_\{m,0\}\\bigr\]\\leq M\_\{H,\\mathrm\{C\}\}\(\\bar\{W\}\_\{m,H\}\)\+\\left\\langle\\nabla M\_\{H,\\mathrm\{C\}\}\(\\bar\{W\}\_\{m,H\}\),\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rangle\\\\ \+\\frac\{L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{2\}\\mathbb\{E\}\\bigl\[\\left\\lVert D\_\{m,H\}\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\mid U\_\{m,0\}\\bigr\]\.\\end\{gathered\}\(311\)By Young’s inequality,
⟨∇MH,C\(W¯m,H\),𝔼\[Dm,H∣Um,0\]⟩≤δm2∥∇MH,C\(W¯m,H\)∥∗2\+12δm∥𝔼\[Dm,H∣Um,0\]∥H,2,pH⋆2≤ωH,Cfhα¯m4MH,C\(W¯m,H\)\+2LH,CfhωH,Cfhα¯m∥𝔼\[Dm,H∣Um,0\]∥H,2,pH⋆2\.\\begin\{gathered\}\\left\\langle\\nabla M\_\{H,\\mathrm\{C\}\}\(\\bar\{W\}\_\{m,H\}\),\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rangle\\leq\\frac\{\\delta\_\{m\}\}\{2\}\\left\\lVert\\nabla M\_\{H,\\mathrm\{C\}\}\(\\bar\{W\}\_\{m,H\}\)\\right\\rVert^\{2\}\_\{\*\}\+\\frac\{1\}\{2\\delta\_\{m\}\}\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\\\ \\leq\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}\_\{m\}\}\{4\}M\_\{H,\\mathrm\{C\}\}\(\\bar\{W\}\_\{m,H\}\)\+\\frac\{2L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}\_\{m\}\}\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\.\\end\{gathered\}\(312\)Hence
𝔼\[MH,C\(Wm\+1,0\)∣Um,0\]≤\(1\+ωH,Cfhα¯m4\)MH,C\(W¯m,H\)\+2LH,CfhωH,Cfhα¯m∥𝔼\[Dm,H∣Um,0\]∥H,2,pH⋆2\+LH,Cfh2𝔼\[∥Dm,H∥H,2,pH⋆2∣Um,0\]\.\\begin\{gathered\}\\mathbb\{E\}\[M\_\{H,\\mathrm\{C\}\}\(W\_\{m\+1,0\}\)\\mid U\_\{m,0\}\]\\leq\\left\(1\+\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}\_\{m\}\}\{4\}\\right\)M\_\{H,\\mathrm\{C\}\}\(\\bar\{W\}\_\{m,H\}\)\\\\ \+\\frac\{2L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}\_\{m\}\}\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\+\\frac\{L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{2\}\\mathbb\{E\}\\bigl\[\\left\\lVert D\_\{m,H\}\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\mid U\_\{m,0\}\\bigr\]\.\\end\{gathered\}\(313\)SinceωH,Cfhα¯m≤1\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}\_\{m\}\\leq 1, we have
\(1\+ωH,Cfhα¯m4\)\(1−ωH,Cfh2α¯m\)≤1−ωH,Cfhα¯m4\.\\left\(1\+\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}\_\{m\}\}\{4\}\\right\)\\left\(1\-\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{2\}\\bar\{\\alpha\}\_\{m\}\\right\)\\leq 1\-\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}\_\{m\}\}\{4\}\.\(314\)Also,
∥𝔼\[Dm,H∣Um,0\]∥H,2,pH⋆2≤\|𝒮H\|2/pH⋆∥𝔼\[Dm,H∣Um,0\]∥H,2,∞2≤64\|𝒮H\|2/pH⋆BH,C2α¯m4,\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\leq\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert^\{2\}\_\{H,2,\\infty\}\\leq 64\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}B^\{2\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}^\{4\}\_\{m\},\(315\)and
𝔼\[‖Dm,H‖H,2,pH⋆2∣Um,0\]≤\|𝒮H\|2/pH⋆𝔼\[‖Dm,H‖H,2,∞2∣Um,0\]≤144\|𝒮H\|2/pH⋆BH,C2α¯m2\.\\mathbb\{E\}\[\\left\\lVert D\_\{m,H\}\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\mid U\_\{m,0\}\]\\leq\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\\mathbb\{E\}\[\\left\\lVert D\_\{m,H\}\\right\\rVert^\{2\}\_\{H,2,\\infty\}\\mid U\_\{m,0\}\]\\leq 144\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}B^\{2\}\_\{H,\\mathrm\{C\}\}\\bar\{\\alpha\}^\{2\}\_\{m\}\.\(316\)Usingα¯m≤1\\bar\{\\alpha\}\_\{m\}\\leq 1,α¯m2≤Hα¯m\(2\)\\bar\{\\alpha\}^\{2\}\_\{m\}\\leq H\\bar\{\\alpha\}^\{\(2\)\}\_\{m\}, andα¯m3≤Hα¯m\(2\)\\bar\{\\alpha\}^\{3\}\_\{m\}\\leq H\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}, we conclude that
𝔼\[MH,C\(Wm\+1,0\)∣Um,0\]≤\(1−ωH,Cfh4α¯m\)MH,C\(Wm,0\)\+HLH,Cfh\|𝒮H\|2/pH⋆BH,C2\(128ωH,Cfh\+72\)α¯m\(2\)\.\\mathbb\{E\}\[M\_\{H,\\mathrm\{C\}\}\(W\_\{m\+1,0\}\)\\mid U\_\{m,0\}\]\\leq\\left\(1\-\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{4\}\\bar\{\\alpha\}\_\{m\}\\right\)M\_\{H,\\mathrm\{C\}\}\(W\_\{m,0\}\)\+HL^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}B^\{2\}\_\{H,\\mathrm\{C\}\}\\left\(\\frac\{128\}\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\+72\\right\)\\bar\{\\alpha\}^\{\(2\)\}\_\{m\}\.
\(317\)∎
###### Proposition 36\(Fixed\-horizon CTD finite\-iteration bound\)\.
Define
aH,C,1fh:=rH,Cfh,aH,C,2fh:=ωH,Cfh4,aH,C,3fh:=ωH,Cfh4α¯H,C,aH,C,4fh:=2\(1\+ϑH,C\|𝒮H\|2/pH⋆\)cH,C,2fh\.a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},1\}:=r^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\},\\quad a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}:=\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{4\},\\quad a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},3\}:=\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\}\}\}\{4\\bar\{\\alpha\}\_\{H,\\mathrm\{C\}\}\},\\quad a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},4\}:=2\(1\+\\vartheta\_\{H,\\mathrm\{C\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\)c^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}\.
\(318\)IfaH,C,2fh\>0a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}\>0,\(αk\)k≥0\(\\alpha\_\{k\}\)\_\{k\\geq 0\}is nonincreasing, and if
α0≤aH,C,2fhaH,C,3fh,\\alpha\_\{0\}\\leq\\frac\{a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}\}\{a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},3\}\},\(319\)then, for all episodesm≥0m\\geq 0,
𝔼\[ℓH,C,∞\(ηmH,ηH,C⋆\)2\]≤aH,C,1fhℓH,C,∞\(η0,ηH,C⋆\)2∏j=0m−1\(1−aH,C,2fhα¯j\)\+aH,C,4fh∑i=0m−1α¯i\(2\)∏j=i\+1m−1\(1−aH,C,2fhα¯j\)\.\\begin\{gathered\}\\mathbb\{E\}\\bigl\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta^\{\\star\}\_\{H,\\mathrm\{C\}\}\)^\{2\}\\bigr\]\\leq a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},1\}\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta^\{\\star\}\_\{H,\\mathrm\{C\}\}\)^\{2\}\\prod\_\{j=0\}^\{m\-1\}\(1\-a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}\\bar\{\\alpha\}\_\{j\}\)\\\\ \+a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},4\}\\sum\_\{i=0\}^\{m\-1\}\\bar\{\\alpha\}^\{\(2\)\}\_\{i\}\\prod\_\{j=i\+1\}^\{m\-1\}\(1\-a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}\\bar\{\\alpha\}\_\{j\}\)\.\\end\{gathered\}\(320\)
###### Proof\.
Iterating Proposition[35](https://arxiv.org/html/2605.06866#Thmtheorem35)gives
𝔼\[MH,C\(Wm,0\)\]≤MH,C\(W0,0\)∏j=0m−1\(1−aH,C,2fhα¯j\)\+cH,C,2fh∑i=0m−1α¯i\(2\)∏j=i\+1m−1\(1−aH,C,2fhα¯j\)\.\\mathbb\{E\}\\bigl\[M\_\{H,\\mathrm\{C\}\}\(W\_\{m,0\}\)\\bigr\]\\leq M\_\{H,\\mathrm\{C\}\}\(W\_\{0,0\}\)\\prod\_\{j=0\}^\{m\-1\}\(1\-a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}\\bar\{\\alpha\}\_\{j\}\)\+c^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}\\sum\_\{i=0\}^\{m\-1\}\\bar\{\\alpha\}^\{\(2\)\}\_\{i\}\\prod\_\{j=i\+1\}^\{m\-1\}\(1\-a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{C\},2\}\\bar\{\\alpha\}\_\{j\}\)\.\(321\)By the Moreau envelope comparison,
𝔼\[‖Wm,0‖H,2,∞2\]≤2\(1\+ϑH,C\|𝒮H\|2/pH⋆\)𝔼\[MH,C\(Wm,0\)\]\\mathbb\{E\}\[\\left\\lVert W\_\{m,0\}\\right\\rVert^\{2\}\_\{H,2,\\infty\}\]\\leq 2\(1\+\\vartheta\_\{H,\\mathrm\{C\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\)\\mathbb\{E\}\\bigl\[M\_\{H,\\mathrm\{C\}\}\(W\_\{m,0\}\)\\bigr\]\(322\)and
MH,C\(W0,0\)≤12\(1\+ϑH,C\)‖W0,0‖H,2,∞2\.M\_\{H,\\mathrm\{C\}\}\(W\_\{0,0\}\)\\leq\\frac\{1\}\{2\(1\+\\vartheta\_\{H,\\mathrm\{C\}\}\)\}\\left\\lVert W\_\{0,0\}\\right\\rVert^\{2\}\_\{H,2,\\infty\}\.\(323\)The embedding isometric identity
‖Wm,0‖H,2,∞=ℓH,C,∞\(ηmH,ηH,C⋆\)\\left\\lVert W\_\{m,0\}\\right\\rVert\_\{H,2,\\infty\}=\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta^\{\\star\}\_\{H,\\mathrm\{C\}\}\)\(324\)provides the claim\. ∎
###### Corollary 37\(Boundary\-iterate step size consequences\)\.
Under the hypotheses of Proposition[36](https://arxiv.org/html/2605.06866#Thmtheorem36):
\(a\)ifαk≡α\\alpha\_\{k\}\\equiv\\alphaand
α≤aH,C,2fhaH,C,3fh,\\alpha\\leq\\frac\{a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\}\{a\_\{H,\\mathrm\{C\},3\}^\{\\mathrm\{fh\}\}\},\(325\)then for all episodesm≥0m\\geq 0,
𝔼\[ℓH,C,∞\(ηmH,ηH,C⋆\)2\]≤aH,C,1fhℓH,C,∞\(η0,ηH,C⋆\)2\(1−aH,C,2fhHα\)m\+aH,C,4fhaH,C,2fhα\.\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}\)^\{2\}\\right\]\\leq a\_\{H,\\mathrm\{C\},1\}^\{\\mathrm\{fh\}\}\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}\)^\{2\}\\bigl\(1\-a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}H\\alpha\\bigr\)^\{m\}\+\\frac\{a\_\{H,\\mathrm\{C\},4\}^\{\\mathrm\{fh\}\}\}\{a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\}\\alpha\.\(326\)
For the two diminishing\-step cases below, whereggis the step\-size offset, writeτm:=mH\+g\+H−1\\tau\_\{m\}:=mH\+g\+H\-1, soτ0=g\+H−1\\tau\_\{0\}=g\+H\-1\.
\(b\)ifαk=α/\(k\+g\)\\alpha\_\{k\}=\\alpha/\(k\+g\),α\>1/aH,C,2fh\\alpha\>1/a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}, and
g≥max\{1,αaH,C,3fhaH,C,2fh\},g\\geq\\max\\left\\\{1,\\frac\{\\alpha a\_\{H,\\mathrm\{C\},3\}^\{\\mathrm\{fh\}\}\}\{a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\}\\right\\\},\(327\)then the boundary iterates satisfy
𝔼\[ℓH,C,∞\(ηmH,ηH,C⋆\)2\]≤aH,C,1fhℓH,C,∞\(η0,ηH,C⋆\)2\(τ0τm\)aH,C,2fhα\+aH,C,4fhH2α2aH,C,2fhα−1⋅1τm\.\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}\)^\{2\}\\right\]\\leq a\_\{H,\\mathrm\{C\},1\}^\{\\mathrm\{fh\}\}\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}\)^\{2\}\\left\(\\frac\{\\tau\_\{0\}\}\{\\tau\_\{m\}\}\\right\)^\{a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\\alpha\}\+\\frac\{a\_\{H,\\mathrm\{C\},4\}^\{\\mathrm\{fh\}\}H^\{2\}\\alpha^\{2\}\}\{a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\\alpha\-1\}\\cdot\\frac\{1\}\{\\tau\_\{m\}\}\.\(328\)
\(c\)ifαk=α/\(k\+g\)z\\alpha\_\{k\}=\\alpha/\(k\+g\)^\{z\}withz∈\(0,1\)z\\in\(0,1\)and
g≥max\{1,\(αaH,C,3fhaH,C,2fh\)1/z,\(2zaH,C,2fhα\)1/\(1−z\)\},g\\geq\\max\\left\\\{1,\\left\(\\frac\{\\alpha a\_\{H,\\mathrm\{C\},3\}^\{\\mathrm\{fh\}\}\}\{a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\}\\right\)^\{1/z\},\\left\(\\frac\{2z\}\{a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\\alpha\}\\right\)^\{1/\(1\-z\)\}\\right\\\},\(329\)then the boundary iterates satisfy
𝔼\[ℓH,C,∞\(ηmH,ηH,C⋆\)2\]≤aH,C,1fhℓH,C,∞\(η0,ηH,C⋆\)2⋅exp\(−aH,C,2fhα1−z\(τm1−z−τ01−z\)\)\+2aH,C,4fhH2αaH,C,2fh⋅1τmz\.\\begin\{gathered\}\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}\)^\{2\}\\right\]\\leq a\_\{H,\\mathrm\{C\},1\}^\{\\mathrm\{fh\}\}\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{0\},\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}\)^\{2\}\\cdot\\exp\\left\(\-\\frac\{a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\\alpha\}\{1\-z\}\\bigl\(\\tau\_\{m\}^\{1\-z\}\-\\tau\_\{0\}^\{1\-z\}\\bigr\)\\right\)\\\\ \+\\frac\{2a\_\{H,\\mathrm\{C\},4\}^\{\\mathrm\{fh\}\}H^\{2\}\\alpha\}\{a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\}\\cdot\\frac\{1\}\{\\tau\_\{m\}^\{z\}\}\.\\end\{gathered\}\(330\)
###### Proof\.
For part \(a\), substitution ofα¯m=Hα\\bar\{\\alpha\}\_\{m\}=H\\alphaandα¯m\(2\)=Hα2\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}=H\\alpha^\{2\}into Proposition[36](https://arxiv.org/html/2605.06866#Thmtheorem36)yields part \(a\)\.
For part \(b\), setq=τ0/Hq=\\tau\_\{0\}/Handλ=aH,C,2fhα\\lambda=a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\\alpha\. The bounds
Hατm≤α¯m≤HαmH\+g,α¯m\(2\)≤Hα2\(mH\+g\)2\.\\frac\{H\\alpha\}\{\\tau\_\{m\}\}\\leq\\bar\{\\alpha\}\_\{m\}\\leq\\frac\{H\\alpha\}\{mH\+g\},\\qquad\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}\\leq\\frac\{H\\alpha^\{2\}\}\{\(mH\+g\)^\{2\}\}\.\(331\)imply
∏j=0m−1\(1−aH,C,2fhα¯j\)≤\(qm\+q\)λ\.\\prod\_\{j=0\}^\{m\-1\}\(1\-a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\)\\leq\\left\(\\frac\{q\}\{m\+q\}\\right\)^\{\\lambda\}\.\(332\)The same elementary product\-sum estimate gives
∑i=0m−1α¯i\(2\)∏j=i\+1m−1\(1−aH,C,2fhα¯j\)≤Hα2λ−1⋅1m\+q,\\sum\_\{i=0\}^\{m\-1\}\\bar\{\\alpha\}\_\{i\}^\{\(2\)\}\\prod\_\{j=i\+1\}^\{m\-1\}\(1\-a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\)\\leq\\frac\{H\\alpha^\{2\}\}\{\\lambda\-1\}\\cdot\\frac\{1\}\{m\+q\},\(333\)and the displayed bound follows from Proposition[36](https://arxiv.org/html/2605.06866#Thmtheorem36)after substitutingq=τ0/Hq=\\tau\_\{0\}/Handτm=H\(m\+q\)\\tau\_\{m\}=H\(m\+q\)\.
For part \(c\), keep the sameqqand setA=aH,C,2fhαH1−zA=a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\\alpha H^\{1\-z\}\. The bounds
Hατmz≤α¯m≤Hα\(mH\+g\)z,α¯m\(2\)≤Hα2\(mH\+g\)2z,\\frac\{H\\alpha\}\{\\tau\_\{m\}^\{z\}\}\\leq\\bar\{\\alpha\}\_\{m\}\\leq\\frac\{H\\alpha\}\{\(mH\+g\)^\{z\}\},\\qquad\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}\\leq\\frac\{H\\alpha^\{2\}\}\{\(mH\+g\)^\{2z\}\},\(334\)imply
∏j=0m−1\(1−aH,C,2fhα¯j\)≤exp\(−A1−z\(\(m\+q\)1−z−q1−z\)\)\.\\prod\_\{j=0\}^\{m\-1\}\(1\-a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\)\\leq\\exp\\left\(\-\\frac\{A\}\{1\-z\}\\bigl\(\(m\+q\)^\{1\-z\}\-q^\{1\-z\}\\bigr\)\\right\)\.\(335\)The lower bound onggimpliesq≥\(2z/A\)1/\(1−z\)q\\geq\(2z/A\)^\{1/\(1\-z\)\}\. The elementary polynomial product\-sum estimate gives
∑i=0m−1α¯i\(2\)∏j=i\+1m−1\(1−aH,C,2fhα¯j\)≤2Hα2A⋅1\(m\+q\)z\.\\sum\_\{i=0\}^\{m\-1\}\\bar\{\\alpha\}\_\{i\}^\{\(2\)\}\\prod\_\{j=i\+1\}^\{m\-1\}\(1\-a\_\{H,\\mathrm\{C\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\)\\leq\\frac\{2H\\alpha^\{2\}\}\{A\}\\cdot\\frac\{1\}\{\(m\+q\)^\{z\}\}\.\(336\)Substituting the definitions ofAAandqqinto Proposition[36](https://arxiv.org/html/2605.06866#Thmtheorem36), usingτm=H\(m\+q\)\\tau\_\{m\}=H\(m\+q\), and usingH2z≤H2H^\{2z\}\\leq H^\{2\}, gives the displayed bound\. ∎
###### Proof of Theorem[6](https://arxiv.org/html/2605.06866#Thmtheorem6)\.
Combine Proposition[36](https://arxiv.org/html/2605.06866#Thmtheorem36)with Corollary[37](https://arxiv.org/html/2605.06866#Thmtheorem37)\. Sincek=mHk=mHandHHis fixed, the rates in the episode indexmmare equivalent to the stated rates in the numberkkof transitions\. ∎
###### Proof of Corollary[7](https://arxiv.org/html/2605.06866#Thmtheorem7)\.
By Corollary[37](https://arxiv.org/html/2605.06866#Thmtheorem37)\(b\),
𝔼\[ℓH,C,∞\(ηmH,ηH,C⋆\)2\]=O\(1m\+1\)\.\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}\)^\{2\}\\right\]=O\\left\(\\frac\{1\}\{m\+1\}\\right\)\.\(337\)Jensen’s inequality then gives
𝔼\[ℓH,C,∞\(ηmH,ηH,C⋆\)\]=O\(1m\+1\),\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{C\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{C\}\}^\{\\star\}\)\\right\]=O\\left\(\\frac\{1\}\{\\sqrt\{m\+1\}\}\\right\),\(338\)som=O\(ε−2\)m=O\(\\varepsilon^\{\-2\}\)episodes suffice\. Sincek=mHk=mHandHHis fixed, this is equivalentlyk=O\(ε−2\)k=O\(\\varepsilon^\{\-2\}\)transitions\. ∎
## Appendix EFixed\-horizon MTD
This appendix records the fixed\-horizon MTD ingredients and the proof of Theorem[8](https://arxiv.org/html/2605.06866#Thmtheorem8)\.
### E\.1Flattened weighted state–horizon space
We use the same weighted flattening as in Appendix[D](https://arxiv.org/html/2605.06866#A4)\. Let
𝒮H:=\{\(h,s\):h∈\{1,…,H\},s∈𝒮\},\|𝒮H\|=H\|𝒮\|,pH⋆:=max\{2,⌈log\|𝒮H\|⌉\}\.\\mathcal\{S\}\_\{H\}:=\\\{\(h,s\):h\\in\\\{1,\\dots,H\\\},\\ s\\in\\mathcal\{S\}\\\},\\qquad\\lvert\\mathcal\{S\}\_\{H\}\\rvert=H\\lvert\\mathcal\{S\}\\rvert,\\qquad p\_\{H\}^\{\\star\}:=\\max\\\{2,\\lceil\\log\\lvert\\mathcal\{S\}\_\{H\}\\rvert\\rceil\\\}\.\(339\)Let
VH,M:=∏\(h,s\)∈𝒮HℝdV\_\{H,\\mathrm\{M\}\}:=\\prod\_\{\(h,s\)\\in\\mathcal\{S\}\_\{H\}\}\\mathbb\{R\}^\{d\}\(340\)denote the corresponding flattened Euclidean product space\. For a horizon\-stacked iterateU=\(Uh\(s\)\)U=\(U^\{h\}\(s\)\), define
U¯\(h,s\):=λhUh\(s\)\.\\bar\{U\}\(h,s\):=\\lambda^\{h\}U^\{h\}\(s\)\.\(341\)Then
∥U∥H,2,∞=max\(h,s\)∈𝒮H∥U¯\(h,s\)∥2,∥U∥H,2,p=\(∑\(h,s\)∈𝒮H∥U¯\(h,s\)∥2p\)1/p\.\\lVert U\\rVert\_\{H,2,\\infty\}=\\max\_\{\(h,s\)\\in\\mathcal\{S\}\_\{H\}\}\\lVert\\bar\{U\}\(h,s\)\\rVert\_\{2\},\\qquad\\lVert U\\rVert\_\{H,2,p\}=\\left\(\\sum\_\{\(h,s\)\\in\\mathcal\{S\}\_\{H\}\}\\lVert\\bar\{U\}\(h,s\)\\rVert\_\{2\}^\{p\}\\right\)^\{1/p\}\.\(342\)For a single horizon stackU=\(U1,…,UH\)U=\(U^\{1\},\\dots,U^\{H\}\), also set
∥U∥H,2:=max1≤h≤Hλh∥Uh∥2\.\\lVert U\\rVert\_\{H,2\}:=\\max\_\{1\\leq h\\leq H\}\\lambda^\{h\}\\lVert U^\{h\}\\rVert\_\{2\}\.\(343\)Then
∥U∥H,2,∞=maxs∈𝒮∥U\(s\)∥H,2\.\\lVert U\\rVert\_\{H,2,\\infty\}=\\max\_\{s\\in\\mathcal\{S\}\}\\lVert U\(s\)\\rVert\_\{H,2\}\.\(344\)Thus Appendix[A](https://arxiv.org/html/2605.06866#A1)applies directly to the flattened weighted process\.
### E\.2Fixed\-horizon MTD residual map
For each states∈𝒮s\\in\\mathcal\{S\}, let
U\(s\):=\(U1\(s\),…,UH\(s\)\),U\(s\):=\\bigl\(U^\{1\}\(s\),\\dots,U^\{H\}\(s\)\\bigr\),\(345\)and letPsHP\_\{s\}^\{H\}denote the coordinate projector onto that stack\. Define
FH,M\(U;s,\(r,s′\)\):=PsH\(T^H,M\(U;s,\(r,s′\)\)−U\(s\)\),F\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\):=P\_\{s\}^\{H\}\\bigl\(\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-U\(s\)\\bigr\),\(346\)whereT^H,M\(U;s,\(r,s′\)\)\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)is the full horizon\-stacked MTD target from Section[4](https://arxiv.org/html/2605.06866#S4)\. Then the online fixed\-horizon MTD recursion is
Uk\+1=Uk\+αkFH,M\(Uk;Sk,\(Rk,Sk\+1\)\)\.U\_\{k\+1\}=U\_\{k\}\+\\alpha\_\{k\}F\_\{H,\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\.\(347\)Define also the MTD embedded averaged operator by
𝒪H,M:=IH,M∘ΠH,MΘTHπ∘IH,M−1\.\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}:=I\_\{H,\\mathrm\{M\}\}\\circ\\Pi\_\{H,\\mathrm\{M\}\}^\{\\Theta\}T\_\{H\}^\{\\pi\}\\circ I\_\{H,\\mathrm\{M\}\}^\{\-1\}\.\(348\)
###### Proposition 38\(Fixed\-horizon MTD weighted contraction\)\.
For allU,U′∈IH,M\(ℱH,M,Θ𝒮\)U,U^\{\\prime\}\\in I\_\{H,\\mathrm\{M\}\}\(\\mathcal\{F\}\_\{H,\\mathrm\{M\},\\Theta\}^\{\\mathcal\{S\}\}\),
∥𝒪H,MU−𝒪H,MU′∥H,2,∞≤λ∥U−U′∥H,2,∞\.\\lVert\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\-\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\\leq\\lambda\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(349\)
###### Proof\.
Letη:=IH,M−1\(U\)\\eta:=I\_\{H,\\mathrm\{M\}\}^\{\-1\}\(U\)andη′:=IH,M−1\(U′\)\\eta^\{\\prime\}:=I\_\{H,\\mathrm\{M\}\}^\{\-1\}\(U^\{\\prime\}\)\. Fix\(h,s\)∈𝒮H\(h,s\)\\in\\mathcal\{S\}\_\{H\}withh≥1h\\geq 1\. By the statewise MMD isometry and nonexpansiveness of the signed\-categorical projection,
λh∥\(𝒪H,MU\)h\(s\)−\(𝒪H,MU′\)h\(s\)∥2\\displaystyle\\lambda^\{h\}\\lVert\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\)^\{h\}\(s\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U^\{\\prime\}\)^\{h\}\(s\)\\rVert\_\{2\}=λhℓM\(\(ΠH,MΘTHπη\)h\(s\),\(ΠH,MΘTHπη′\)h\(s\)\)\\displaystyle=\\lambda^\{h\}\\ell\_\{\\mathrm\{M\}\}\\left\(\(\\Pi\_\{H,\\mathrm\{M\}\}^\{\\Theta\}T\_\{H\}^\{\\pi\}\\eta\)^\{h\}\(s\),\(\\Pi\_\{H,\\mathrm\{M\}\}^\{\\Theta\}T\_\{H\}^\{\\pi\}\\eta^\{\\prime\}\)^\{h\}\(s\)\\right\)\(350\)≤λhℓM\(\(THπη\)h\(s\),\(THπη′\)h\(s\)\)\.\\displaystyle\\leq\\lambda^\{h\}\\ell\_\{\\mathrm\{M\}\}\\left\(\(T\_\{H\}^\{\\pi\}\\eta\)^\{h\}\(s\),\(T\_\{H\}^\{\\pi\}\\eta^\{\\prime\}\)^\{h\}\(s\)\\right\)\.Now
\(THπη\)h\(s\)=∑a∈𝒜π\(a∣s\)∑s′∈𝒮P\(s′∣s,a\)\(fR\(s,a\),1\)\#ηh−1\(s′\),\(T\_\{H\}^\{\\pi\}\\eta\)^\{h\}\(s\)=\\sum\_\{a\\in\\mathcal\{A\}\}\\pi\(a\\mid s\)\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}P\(s^\{\\prime\}\\mid s,a\)\\,\(f\_\{R\(s,a\),1\}\)\_\{\\\#\}\\eta^\{h\-1\}\(s^\{\\prime\}\),\(351\)and likewise forη′\\eta^\{\\prime\}\. Since\(fr,1\)\#\(f\_\{r,1\}\)\_\{\\\#\}is an isometry in the MMD metric associated with a shift\-invariant kernel and MMD is convex under mixtures,
ℓM\(\(THπη\)h\(s\),\(THπη′\)h\(s\)\)\\displaystyle\\ell\_\{\\mathrm\{M\}\}\\left\(\(T\_\{H\}^\{\\pi\}\\eta\)^\{h\}\(s\),\(T\_\{H\}^\{\\pi\}\\eta^\{\\prime\}\)^\{h\}\(s\)\\right\)≤∑a∈𝒜π\(a∣s\)∑s′∈𝒮P\(s′∣s,a\)ℓM\(ηh−1\(s′\),η′h−1\(s′\)\)\\displaystyle\\leq\\sum\_\{a\\in\\mathcal\{A\}\}\\pi\(a\\mid s\)\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}P\(s^\{\\prime\}\\mid s,a\)\\,\\ell\_\{\\mathrm\{M\}\}\\left\(\\eta^\{h\-1\}\(s^\{\\prime\}\),\{\\eta^\{\\prime\}\}^\{h\-1\}\(s^\{\\prime\}\)\\right\)\(352\)≤maxx∈𝒮ℓM\(ηh−1\(x\),η′h−1\(x\)\)\.\\displaystyle\\leq\\max\_\{x\\in\\mathcal\{S\}\}\\ell\_\{\\mathrm\{M\}\}\\left\(\\eta^\{h\-1\}\(x\),\{\\eta^\{\\prime\}\}^\{h\-1\}\(x\)\\right\)\.Therefore
λh∥\(𝒪H,MU\)h\(s\)−\(𝒪H,MU′\)h\(s\)∥2\\displaystyle\\lambda^\{h\}\\lVert\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\)^\{h\}\(s\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U^\{\\prime\}\)^\{h\}\(s\)\\rVert\_\{2\}≤λmaxx∈𝒮λh−1ℓM\(ηh−1\(x\),η′h−1\(x\)\)\\displaystyle\\leq\\lambda\\max\_\{x\\in\\mathcal\{S\}\}\\lambda^\{h\-1\}\\ell\_\{\\mathrm\{M\}\}\\left\(\\eta^\{h\-1\}\(x\),\{\\eta^\{\\prime\}\}^\{h\-1\}\(x\)\\right\)\(353\)≤λℓH,M,∞\(η,η′\)=λ∥U−U′∥H,2,∞\.\\displaystyle\\leq\\lambda\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta,\\eta^\{\\prime\}\)=\\lambda\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.Taking the maximum over\(h,s\)∈𝒮H\(h,s\)\\in\\mathcal\{S\}\_\{H\}proves the claim\. ∎
### E\.3One\-step bounds
###### Proposition 39\(Fixed\-horizon MTD samplewise Lipschitzness\)\.
For everys∈𝒮s\\in\\mathcal\{S\},\(r,s′\)∈\[0,1\]q×𝒮\(r,s^\{\\prime\}\)\\in\[0,1\]^\{q\}\\times\\mathcal\{S\}andU,U′∈IH,M\(ℱH,M,Θ𝒮\)U,U^\{\\prime\}\\in I\_\{H,\\mathrm\{M\}\}\(\\mathcal\{F\}\_\{H,\\mathrm\{M\},\\Theta\}^\{\\mathcal\{S\}\}\),
∥FH,M\(U;s,\(r,s′\)\)−FH,M\(U′;s,\(r,s′\)\)∥H,2,∞≤2∥U−U′∥H,2,∞\.\\lVert F\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-F\_\{H,\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(354\)
###### Proof\.
For each horizonhh, the local targetT^H,Mh\(U;s,\(r,s′\)\)\\widehat\{T\}\_\{H,\\mathrm\{M\}\}^\{h\}\(U;s,\(r,s^\{\\prime\}\)\)depends only on the\(h−1\)\(h\-1\)st horizon block ats′s^\{\\prime\}\. By Proposition[27](https://arxiv.org/html/2605.06866#Thmtheorem27),
‖T^H,Mh\(U;s,\(r,s′\)\)−T^H,Mh\(U′;s,\(r,s′\)\)‖2≤∥Uh−1\(s′\)−U′h−1\(s′\)∥2\.\\left\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}^\{h\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{H,\\mathrm\{M\}\}^\{h\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{2\}\\leq\\lVert U^\{h\-1\}\(s^\{\\prime\}\)\-\{U^\{\\prime\}\}^\{h\-1\}\(s^\{\\prime\}\)\\rVert\_\{2\}\.\(355\)Multiplying byλh\\lambda^\{h\}yields
λh‖T^H,Mh\(U;s,\(r,s′\)\)−T^H,Mh\(U′;s,\(r,s′\)\)‖2≤λh−1∥Uh−1\(s′\)−U′h−1\(s′\)∥2≤∥U−U′∥H,2,∞\.\\lambda^\{h\}\\left\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}^\{h\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{H,\\mathrm\{M\}\}^\{h\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{2\}\\leq\\lambda^\{h\-1\}\\lVert U^\{h\-1\}\(s^\{\\prime\}\)\-\{U^\{\\prime\}\}^\{h\-1\}\(s^\{\\prime\}\)\\rVert\_\{2\}\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.
\(356\)Taking the maximum overhhgives
∥T^H,M\(U;s,\(r,s′\)\)−T^H,M\(U′;s,\(r,s′\)\)∥H,2,∞≤∥U−U′∥H,2,∞\.\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2,\\infty\}\\leq\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(357\)Since
FH,M\(U;s,\(r,s′\)\)=PsH\(T^H,M\(U;s,\(r,s′\)\)−U\(s\)\),F\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)=P\_\{s\}^\{H\}\\bigl\(\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-U\(s\)\\bigr\),\(358\)the residual difference splits into a projected target difference and a projected current\-stack difference\. Bounding both by∥U−U′∥H,2,∞\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}gives the stated factor22\. ∎
###### Proposition 40\(Fixed\-horizon MTD affine perturbation bound\)\.
Let
UH,M⋆:=IH,M\(ηH,M⋆\)\.U^\{\\star\}\_\{H,\\mathrm\{M\}\}:=I\_\{H,\\mathrm\{M\}\}\(\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\)\.\(359\)HereηH,M⋆\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}denotes the unique fixed point ofΠH,MΘTHπ\\Pi\_\{H,\\mathrm\{M\}\}^\{\\Theta\}T\_\{H\}^\{\\pi\}\. Then there exist finite constantsBH,Mtar≥0B^\{\\mathrm\{tar\}\}\_\{H,\\mathrm\{M\}\}\\geq 0andBH,Mres≥0B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\\geq 0such that, for every admissible sample\(s,\(r,s′\)\)\(s,\(r,s^\{\\prime\}\)\)and every admissible iterateUU,
‖T^H,M\(U;s,\(r,s′\)\)−\(𝒪H,MU\)\(s\)‖H,2≤2λ∥U∥H,2,∞\+BH,Mtar,\\left\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\)\(s\)\\right\\rVert\_\{H,2\}\\leq 2\\lambda\\lVert U\\rVert\_\{H,2,\\infty\}\+B^\{\\mathrm\{tar\}\}\_\{H,\\mathrm\{M\}\},\(360\)and
∥FH,M\(U;s,\(r,s′\)\)∥H,2,∞≤2∥U∥H,2,∞\+BH,Mres\.\\lVert F\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\\rVert\_\{H,2,\\infty\}\+B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\.\(361\)Consequently,
𝔼\[‖T^H,M\(Uk;Sk,\(Rk,Sk\+1\)\)−\(𝒪H,MUk\)\(Sk\)‖H,22∣Uk,Sk\]≤2\(BH,Mtar\)2\+8λ2∥Uk∥H,2,∞2\.\\mathbb\{E\}\\Bigl\[\\left\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\_\{k\}\)\(S\_\{k\}\)\\right\\rVert\_\{H,2\}^\{2\}\\mid U\_\{k\},S\_\{k\}\\Bigr\]\\leq 2\\bigl\(B^\{\\mathrm\{tar\}\}\_\{H,\\mathrm\{M\}\}\\bigr\)^\{2\}\+8\\lambda^\{2\}\\lVert U\_\{k\}\\rVert\_\{H,2,\\infty\}^\{2\}\.\(362\)
###### Proof\.
For fixeds,s′∈𝒮s,s^\{\\prime\}\\in\\mathcal\{S\}andr∈\[0,1\]qr\\in\[0,1\]^\{q\}, Proposition[27](https://arxiv.org/html/2605.06866#Thmtheorem27)gives, for each horizonhh,
‖T^H,Mh\(U;s,\(r,s′\)\)−T^H,Mh\(U′;s,\(r,s′\)\)‖2≤∥Uh−1\(s′\)−U′h−1\(s′\)∥2\.\\left\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}^\{h\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{H,\\mathrm\{M\}\}^\{h\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{2\}\\leq\\lVert U^\{h\-1\}\(s^\{\\prime\}\)\-\{U^\{\\prime\}\}^\{h\-1\}\(s^\{\\prime\}\)\\rVert\_\{2\}\.\(363\)After multiplying byλh\\lambda^\{h\}and taking the maximum overhh, the horizon shift contributes exactly one factorλ\\lambda\. Therefore
‖T^H,M\(U;s,\(r,s′\)\)−T^H,M\(U′;s,\(r,s′\)\)‖H,2≤λ∥U−U′∥H,2,∞\.\\left\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{H,2\}\\leq\\lambda\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(364\)Likewise, Proposition[38](https://arxiv.org/html/2605.06866#Thmtheorem38)yields
∥\(𝒪H,MU\)\(s\)−\(𝒪H,MU′\)\(s\)∥H,2≤λ∥U−U′∥H,2,∞\.\\lVert\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\)\(s\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U^\{\\prime\}\)\(s\)\\rVert\_\{H,2\}\\leq\\lambda\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(365\)
Now define
BH,M⋆,tar:=maxs,s′∈𝒮supr∈\[0,1\]q‖T^H,M\(UH,M⋆;s,\(r,s′\)\)−\(𝒪H,MUH,M⋆\)\(s\)‖H,2\.B^\{\\star,\\mathrm\{tar\}\}\_\{H,\\mathrm\{M\}\}:=\\max\_\{s,s^\{\\prime\}\\in\\mathcal\{S\}\}\\sup\_\{r\\in\[0,1\]^\{q\}\}\\left\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U^\{\\star\}\_\{H,\\mathrm\{M\}\};s,\(r,s^\{\\prime\}\)\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U^\{\\star\}\_\{H,\\mathrm\{M\}\}\)\(s\)\\right\\rVert\_\{H,2\}\.\(366\)As in the discounted case, this constant is finite because\[0,1\]q\[0,1\]^\{q\}is compact, there are only finitely many state pairs, and the embedded projected target depends continuously onrr\.
Then
‖T^H,M\(U;s,\(r,s′\)\)−\(𝒪H,MU\)\(s\)‖H,2≤‖T^H,M\(U;s,\(r,s′\)\)−T^H,M\(UH,M⋆;s,\(r,s′\)\)‖H,2\+‖T^H,M\(UH,M⋆;s,\(r,s′\)\)−\(𝒪H,MUH,M⋆\)\(s\)‖H,2\+‖\(𝒪H,MUH,M⋆\)\(s\)−\(𝒪H,MU\)\(s\)‖H,2≤2λ∥U−UH,M⋆∥H,2,∞\+BH,M⋆,tar≤2λ∥U∥H,2,∞\+\(2λ∥UH,M⋆∥H,2,∞\+BH,M⋆,tar\)\.\\begin\{gathered\}\\left\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\)\(s\)\\right\\rVert\_\{H,2\}\\leq\\left\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U^\{\\star\}\_\{H,\\mathrm\{M\}\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{H,2\}\\\\ \+\\left\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U^\{\\star\}\_\{H,\\mathrm\{M\}\};s,\(r,s^\{\\prime\}\)\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U^\{\\star\}\_\{H,\\mathrm\{M\}\}\)\(s\)\\right\\rVert\_\{H,2\}\+\\left\\lVert\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U^\{\\star\}\_\{H,\\mathrm\{M\}\}\)\(s\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\)\(s\)\\right\\rVert\_\{H,2\}\\\\ \\leq 2\\lambda\\lVert U\-U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\+B^\{\\star,\\mathrm\{tar\}\}\_\{H,\\mathrm\{M\}\}\\\\ \\leq 2\\lambda\\lVert U\\rVert\_\{H,2,\\infty\}\+\\Bigl\(2\\lambda\\lVert U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\+B^\{\\star,\\mathrm\{tar\}\}\_\{H,\\mathrm\{M\}\}\\Bigr\)\.\\end\{gathered\}\(367\)Hence the target\-level affine bound holds with
BH,Mtar:=2λ∥UH,M⋆∥H,2,∞\+BH,M⋆,tar\.B^\{\\mathrm\{tar\}\}\_\{H,\\mathrm\{M\}\}:=2\\lambda\\lVert U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\+B^\{\\star,\\mathrm\{tar\}\}\_\{H,\\mathrm\{M\}\}\.\(368\)
For the residual map, Proposition[39](https://arxiv.org/html/2605.06866#Thmtheorem39)gives
‖FH,M\(U;s,\(r,s′\)\)−FH,M\(U′;s,\(r,s′\)\)‖H,2,∞≤2∥U−U′∥H,2,∞\.\\left\\lVert F\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-F\_\{H,\\mathrm\{M\}\}\(U^\{\\prime\};s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(369\)Define
BH,M⋆,res:=maxs,s′∈𝒮supr∈\[0,1\]q∥FH,M\(UH,M⋆;s,\(r,s′\)\)∥H,2,∞,B^\{\\star,\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}:=\\max\_\{s,s^\{\\prime\}\\in\\mathcal\{S\}\}\\sup\_\{r\\in\[0,1\]^\{q\}\}\\lVert F\_\{H,\\mathrm\{M\}\}\(U^\{\\star\}\_\{H,\\mathrm\{M\}\};s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2,\\infty\},\(370\)which is finite by the same continuity and compactness argument\. Then
∥FH,M\(U;s,\(r,s′\)\)∥H,2,∞≤2∥U−UH,M⋆∥H,2,∞\+BH,M⋆,res≤2∥U∥H,2,∞\+\(2∥UH,M⋆∥H,2,∞\+BH,M⋆,res\)\.\\lVert F\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\-U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\+B^\{\\star,\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\\leq 2\\lVert U\\rVert\_\{H,2,\\infty\}\+\\Bigl\(2\\lVert U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\+B^\{\\star,\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\\Bigr\)\.
\(371\)Thus the residual\-level affine bound holds with
BH,Mres:=2∥UH,M⋆∥H,2,∞\+BH,M⋆,res\.B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}:=2\\lVert U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\+B^\{\\star,\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\.\(372\)
Finally, the conditional second\-moment estimate follows from the target\-level affine bound by squaring and using
\(a\+b\)2≤2a2\+2b2\(a\+b\)^\{2\}\\leq 2a^\{2\}\+2b^\{2\}\(373\)with
a:=2λ∥Uk∥H,2,∞,b:=BH,Mtar\.a:=2\\lambda\\lVert U\_\{k\}\\rVert\_\{H,2,\\infty\},\\qquad b:=B^\{\\mathrm\{tar\}\}\_\{H,\\mathrm\{M\}\}\.\(374\)∎
###### Proposition 41\(Fixed\-horizon MTD affine conditional second moment\)\.
There exist finite constantsCH,M,1,CH,M,2≥0C\_\{H,\\mathrm\{M\},1\},C\_\{H,\\mathrm\{M\},2\}\\geq 0such that
𝔼\[∥T^H,M\(Uk;Sk,\(Rk,Sk\+1\)\)−\(𝒪H,MUk\)\(Sk\)∥H,22∣Uk,Sk\]≤CH,M,1\+CH,M,2∥Uk∥H,2,∞2\.\\mathbb\{E\}\\left\[\\lVert\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\_\{k\}\)\(S\_\{k\}\)\\rVert\_\{H,2\}^\{2\}\\mid U\_\{k\},S\_\{k\}\\right\]\\leq C\_\{H,\\mathrm\{M\},1\}\+C\_\{H,\\mathrm\{M\},2\}\\lVert U\_\{k\}\\rVert\_\{H,2,\\infty\}^\{2\}\.
\(375\)
###### Proof\.
Square the affine pathwise bound and use\(a\+b\)2≤2a2\+2b2\(a\+b\)^\{2\}\\leq 2a^\{2\}\+2b^\{2\}\. ∎
### E\.4Phasewise averaged maps
For each phaset∈\{0,…,H−1\}t\\in\\\{0,\\dots,H\-1\\\}, define
Γt,M\(U\):=∑s∈𝒮ρt\(s\)PsH\(\(𝒪H,MU\)\(s\)−U\(s\)\),Gt,M\(U\):=U\+Γt,M\(U\)\.\\Gamma\_\{t,\\mathrm\{M\}\}\(U\):=\\sum\_\{s\\in\\mathcal\{S\}\}\\rho\_\{t\}\(s\)\\,P\_\{s\}^\{H\}\\bigl\(\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\)\(s\)\-U\(s\)\\bigr\),\\qquad G\_\{t,\\mathrm\{M\}\}\(U\):=U\+\\Gamma\_\{t,\\mathrm\{M\}\}\(U\)\.\(376\)
###### Proposition 42\(Phasewise averaged contraction\)\.
For each phaset∈\{0,…,H−1\}t\\in\\\{0,\\dots,H\-1\\\},
∥Gt,M\(U\)−Gt,M\(U′\)∥H,2,∞≤β¯H,Mfh∥U−U′∥H,2,∞,β¯H,Mfh:=1−ρmin\(1−λ\)\.\\lVert G\_\{t,\\mathrm\{M\}\}\(U\)\-G\_\{t,\\mathrm\{M\}\}\(U^\{\\prime\}\)\\rVert\_\{H,2,\\infty\}\\leq\\bar\{\\beta\}\_\{H,\\mathrm\{M\}\}^\{\\mathrm\{fh\}\}\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\},\\qquad\\bar\{\\beta\}\_\{H,\\mathrm\{M\}\}^\{\\mathrm\{fh\}\}:=1\-\\rho\_\{\\min\}\(1\-\\lambda\)\.\(377\)
###### Proof\.
For each state blockss,
\(Gt,M\(U\)−Gt,M\(U′\)\)\(s\)=\(1−ρt\(s\)\)\(U\(s\)−U′\(s\)\)\+ρt\(s\)\(\(𝒪H,MU\)\(s\)−\(𝒪H,MU′\)\(s\)\)\.\\bigl\(G\_\{t,\\mathrm\{M\}\}\(U\)\-G\_\{t,\\mathrm\{M\}\}\(U^\{\\prime\}\)\\bigr\)\(s\)=\(1\-\\rho\_\{t\}\(s\)\)\(U\(s\)\-U^\{\\prime\}\(s\)\)\\,\+\\,\\rho\_\{t\}\(s\)\\bigl\(\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\)\(s\)\-\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U^\{\\prime\}\)\(s\)\\bigr\)\.\(378\)By Proposition[38](https://arxiv.org/html/2605.06866#Thmtheorem38),
‖\(Gt,M\(U\)−Gt,M\(U′\)\)\(s\)‖H,2\\displaystyle\\left\\lVert\\bigl\(G\_\{t,\\mathrm\{M\}\}\(U\)\-G\_\{t,\\mathrm\{M\}\}\(U^\{\\prime\}\)\\bigr\)\(s\)\\right\\rVert\_\{H,2\}≤\(\(1−ρt\(s\)\)\+ρt\(s\)λ\)∥U−U′∥H,2,∞\\displaystyle\\leq\\bigl\(\(1\-\\rho\_\{t\}\(s\)\)\+\\rho\_\{t\}\(s\)\\lambda\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\(379\)=\(1−ρt\(s\)\(1−λ\)\)∥U−U′∥H,2,∞\.\\displaystyle=\\bigl\(1\-\\rho\_\{t\}\(s\)\(1\-\\lambda\)\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.Taking the maximum oversson both sides gives
∥Gt,M\(U\)−Gt,M\(U′\)∥H,2,∞\\displaystyle\\lVert G\_\{t,\\mathrm\{M\}\}\(U\)\-G\_\{t,\\mathrm\{M\}\}\(U^\{\\prime\}\)\\rVert\_\{H,2,\\infty\}≤maxs∈𝒮\(1−ρt\(s\)\(1−λ\)\)∥U−U′∥H,2,∞\\displaystyle\\leq\\max\_\{s\\in\\mathcal\{S\}\}\\bigl\(1\-\\rho\_\{t\}\(s\)\(1\-\\lambda\)\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\(380\)≤\(1−ρmin\(1−λ\)\)∥U−U′∥H,2,∞,\\displaystyle\\leq\\bigl\(1\-\\rho\_\{\\min\}\(1\-\\lambda\)\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\},since the coefficient1−ρt\(s\)\(1−λ\)1\-\\rho\_\{t\}\(s\)\(1\-\\lambda\)is decreasing inρt\(s\)\\rho\_\{t\}\(s\)andρt\(s\)≥ρmin\\rho\_\{t\}\(s\)\\geq\\rho\_\{\\min\}for every phase\-state block\. Equivalently,
β¯tfh:=1−ρt,min\(1−λ\)≤1−ρmin\(1−λ\)=β¯H,Mfh,\\bar\{\\beta\}\_\{t\}^\{\\mathrm\{fh\}\}:=1\-\\rho\_\{t,\\min\}\(1\-\\lambda\)\\leq 1\-\\rho\_\{\\min\}\(1\-\\lambda\)=\\bar\{\\beta\}\_\{H,\\mathrm\{M\}\}^\{\\mathrm\{fh\}\},\(381\)so the phasewise bound is dominated by the claimed uniform contraction factor\. ∎
### E\.5Auxiliary phasewise bounds
For episodemm, write
Um,t:=UmH\+t,t=0,…,H\.U\_\{m,t\}:=U\_\{mH\+t\},\\qquad t=0,\\dots,H\.\(382\)
###### Lemma 43\(Within\-episode deviation\)\.
For every episodemmand everyt∈\{0,…,H\}t\\in\\\{0,\\dots,H\\\},
∥Um,t−Um,0∥H,2,∞≤∑u=0t−1αmH\+u\(2∥Um,u∥H,2,∞\+BH,Mres\)\.\\lVert U\_\{m,t\}\-U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\leq\\sum\_\{u=0\}^\{t\-1\}\\alpha\_\{mH\+u\}\\Bigl\(2\\lVert U\_\{m,u\}\\rVert\_\{H,2,\\infty\}\+B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\\Bigr\)\.\(383\)
###### Proof\.
From the recursion,
Um,t−Um,0=∑u=0t−1αmH\+uFH,M\(Um,u;Sum,\(Rum,Su\+1m\)\)\.U\_\{m,t\}\-U\_\{m,0\}=\\sum\_\{u=0\}^\{t\-1\}\\alpha\_\{mH\+u\}F\_\{H,\\mathrm\{M\}\}\\bigl\(U\_\{m,u\};S\_\{u\}^\{m\},\(R\_\{u\}^\{m\},S\_\{u\+1\}^\{m\}\)\\bigr\)\.\(384\)Take norms and apply the residual\-level affine bound from Proposition[40](https://arxiv.org/html/2605.06866#Thmtheorem40)term by term:
∥Um,t−Um,0∥H,2,∞≤∑u=0t−1αmH\+u\(2∥Um,u∥H,2,∞\+BH,Mres\)\.\\lVert U\_\{m,t\}\-U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\leq\\sum\_\{u=0\}^\{t\-1\}\\alpha\_\{mH\+u\}\\Bigl\(2\\lVert U\_\{m,u\}\\rVert\_\{H,2,\\infty\}\+B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\\Bigr\)\.\(385\)∎
###### Lemma 44\(Frozen residual decomposition\)\.
For each phaset∈\{0,…,H−1\}t\\in\\\{0,\\dots,H\-1\\\}, define
ζm,tM:=FH,M\(Um,0;Stm,\(Rtm,St\+1m\)\)−Γt,M\(Um,0\)\\zeta\_\{m,t\}^\{\\mathrm\{M\}\}:=F\_\{H,\\mathrm\{M\}\}\(U\_\{m,0\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\-\\Gamma\_\{t,\\mathrm\{M\}\}\(U\_\{m,0\}\)\(386\)and
ξm,tM:=\(FH,M\(Um,t;Stm,\(Rtm,St\+1m\)\)−FH,M\(Um,0;Stm,\(Rtm,St\+1m\)\)\)−\(Γt,M\(Um,t\)−Γt,M\(Um,0\)\)\.\\xi\_\{m,t\}^\{\\mathrm\{M\}\}:=\\Bigl\(F\_\{H,\\mathrm\{M\}\}\(U\_\{m,t\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\-F\_\{H,\\mathrm\{M\}\}\(U\_\{m,0\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\\Bigr\)\-\\Bigl\(\\Gamma\_\{t,\\mathrm\{M\}\}\(U\_\{m,t\}\)\-\\Gamma\_\{t,\\mathrm\{M\}\}\(U\_\{m,0\}\)\\Bigr\)\.
\(387\)Then
FH,M\(Um,t;Stm,\(Rtm,St\+1m\)\)−Γt,M\(Um,t\)=ζm,tM\+ξm,tM,F\_\{H,\\mathrm\{M\}\}\(U\_\{m,t\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\-\\Gamma\_\{t,\\mathrm\{M\}\}\(U\_\{m,t\}\)=\\zeta\_\{m,t\}^\{\\mathrm\{M\}\}\+\\xi\_\{m,t\}^\{\\mathrm\{M\}\},\(388\)𝔼\[ζm,tM∣Um,0\]=0,∥ζm,tM∥H,2,∞≤4∥Um,0∥H,2,∞\+2BH,Mres,\\mathbb\{E\}\\left\[\\zeta\_\{m,t\}^\{\\mathrm\{M\}\}\\,\\mid\\,U\_\{m,0\}\\right\]=0,\\qquad\\lVert\\zeta\_\{m,t\}^\{\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\\leq 4\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\+2B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\},\(389\)and
∥ξm,tM∥H,2,∞≤4∥Um,t−Um,0∥H,2,∞\.\\lVert\\xi\_\{m,t\}^\{\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\\leq 4\\lVert U\_\{m,t\}\-U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\.\(390\)
###### Proof\.
Since episodemmis independent ofUm,0U\_\{m,0\}and its phase\-ttsample is distributed according to the phase\-ttlaw used inΓt,M\\Gamma\_\{t,\\mathrm\{M\}\}, we have
𝔼\[ζm,tM∣Um,0\]=0\.\\mathbb\{E\}\\left\[\\zeta\_\{m,t\}^\{\\mathrm\{M\}\}\\,\\mid\\,U\_\{m,0\}\\right\]=0\.\(391\)By Proposition[40](https://arxiv.org/html/2605.06866#Thmtheorem40),
∥FH,M\(Um,0;Stm,\(Rtm,St\+1m\)\)∥H,2,∞≤2∥Um,0∥H,2,∞\+BH,Mres,\\lVert F\_\{H,\\mathrm\{M\}\}\(U\_\{m,0\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\+B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\},\(392\)and, sinceΓt,M\(Um,0\)\\Gamma\_\{t,\\mathrm\{M\}\}\(U\_\{m,0\}\)is the average of the same residual map under the phase\-ttlaw,
∥Γt,M\(Um,0\)∥H,2,∞≤2∥Um,0∥H,2,∞\+BH,Mres\.\\lVert\\Gamma\_\{t,\\mathrm\{M\}\}\(U\_\{m,0\}\)\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\+B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\.\(393\)Hence
∥ζm,tM∥H,2,∞≤4∥Um,0∥H,2,∞\+2BH,Mres\.\\lVert\\zeta\_\{m,t\}^\{\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\\leq 4\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\+2B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\.\(394\)
Proposition[39](https://arxiv.org/html/2605.06866#Thmtheorem39)gives
‖FH,M\(U;s,\(r,s′\)\)−FH,M\(V;s,\(r,s′\)\)‖H,2,∞≤2∥U−V∥H,2,∞,\\left\\lVert F\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-F\_\{H,\\mathrm\{M\}\}\(V;s,\(r,s^\{\\prime\}\)\)\\right\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\-V\\rVert\_\{H,2,\\infty\},\(395\)and averaging the same estimate yields
∥Γt,M\(U\)−Γt,M\(V\)∥H,2,∞≤2∥U−V∥H,2,∞\.\\lVert\\Gamma\_\{t,\\mathrm\{M\}\}\(U\)\-\\Gamma\_\{t,\\mathrm\{M\}\}\(V\)\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\-V\\rVert\_\{H,2,\\infty\}\.\(396\)Therefore
∥ξm,tM∥H,2,∞≤4∥Um,t−Um,0∥H,2,∞\.\\lVert\\xi\_\{m,t\}^\{\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\\leq 4\\lVert U\_\{m,t\}\-U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\.\(397\)∎
### E\.6Episode\-level finite\-iteration drift
SinceΓt,M\\Gamma\_\{t,\\mathrm\{M\}\}is defined by averaging with respect to the phase\-ttmarginalρt\\rho\_\{t\}, the drift argument is formulated for the episode\-boundary sequence\(UmH\)m≥0\(U\_\{mH\}\)\_\{m\\geq 0\}under the conditioning onUm,0U\_\{m,0\}\.
Let
Wm,0:=Um,0−UH,M⋆,UH,M⋆:=IH,M\(ηH,M⋆\),W\_\{m,0\}:=U\_\{m,0\}\-U^\{\\star\}\_\{H,\\mathrm\{M\}\},\\qquad U^\{\\star\}\_\{H,\\mathrm\{M\}\}:=I\_\{H,\\mathrm\{M\}\}\(\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\),\(398\)and define
α¯m:=∑u=0H−1αmH\+u,α¯m\(2\):=∑u=0H−1αmH\+u2\.\\bar\{\\alpha\}\_\{m\}:=\\sum\_\{u=0\}^\{H\-1\}\\alpha\_\{mH\+u\},\\qquad\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}:=\\sum\_\{u=0\}^\{H\-1\}\\alpha\_\{mH\+u\}^\{2\}\.\(399\)Also write
\|𝒮H\|:=H\|𝒮\|,β¯H,Mfh:=1−ρmin\(1−λ\),κH,M:=1−β¯H,Mfh=ρmin\(1−λ\)\>0\.\\lvert\\mathcal\{S\}\_\{H\}\\rvert:=H\\lvert\\mathcal\{S\}\\rvert,\\qquad\\bar\{\\beta\}\_\{H,\\mathrm\{M\}\}^\{\\mathrm\{fh\}\}:=1\-\\rho\_\{\\min\}\(1\-\\lambda\),\\qquad\\kappa\_\{H,\\mathrm\{M\}\}:=1\-\\bar\{\\beta\}\_\{H,\\mathrm\{M\}\}^\{\\mathrm\{fh\}\}=\\rho\_\{\\min\}\(1\-\\lambda\)\>0\.\(400\)Fix anyϑH,M\>0\\vartheta\_\{H,\\mathrm\{M\}\}\>0and define the smoothed potential
MH,M\(W\):=infZ∈VH,M\{12‖Z∥H,2,∞2\+12ϑH,M‖W−Z‖H,2,pH⋆2\}\.M\_\{H,\\mathrm\{M\}\}\(W\):=\\inf\_\{Z\\in V\_\{H,\\mathrm\{M\}\}\}\\left\\\{\\frac\{1\}\{2\}\\left\\lVert Z\\right\\rVert^\{2\}\_\{H,2,\\infty\}\+\\frac\{1\}\{2\\vartheta\_\{H,\\mathrm\{M\}\}\}\\left\\lVert W\-Z\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\right\\\}\.\(401\)By Proposition[13](https://arxiv.org/html/2605.06866#Thmtheorem13)on the flattened weighted space,
\(1\+ϑH,M\)MH,M\(W\)≤12‖W‖H,2,∞2≤\(1\+ϑH,M\|𝒮H\|2/pH⋆\)MH,M\(W\)\.\(1\+\\vartheta\_\{H,\\mathrm\{M\}\}\)M\_\{H,\\mathrm\{M\}\}\(W\)\\leq\\frac\{1\}\{2\}\\left\\lVert W\\right\\rVert^\{2\}\_\{H,2,\\infty\}\\leq\(1\+\\vartheta\_\{H,\\mathrm\{M\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\)M\_\{H,\\mathrm\{M\}\}\(W\)\.\(402\)Define
rH,Mfh:=1\+ϑH,M\|𝒮H\|2/pH⋆1\+ϑH,M,LH,Mfh:=pH⋆−1ϑH,M,ωH,Mfh:=1−β¯H,MfhrH,Mfh\.r^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}:=\\frac\{1\+\\vartheta\_\{H,\\mathrm\{M\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\}\{1\+\\vartheta\_\{H,\\mathrm\{M\}\}\},\\qquad L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}:=\\frac\{p^\{\\star\}\_\{H\}\-1\}\{\\vartheta\_\{H,\\mathrm\{M\}\}\},\\qquad\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}:=1\-\\bar\{\\beta\}^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\sqrt\{r^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\.\(403\)SincerH,Mfh→1r^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\to 1asϑH,M↓0\\vartheta\_\{H,\\mathrm\{M\}\}\\downarrow 0andβ¯H,Mfh<1\\bar\{\\beta\}^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}<1, we haveωH,Mfh\>0\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\>0for all sufficiently smallϑH,M\\vartheta\_\{H,\\mathrm\{M\}\}\. Define the preliminary threshold and the explicit MTD constants
dH,Mfh:=8LH,Mfh\|𝒮H\|2/pH⋆\(1\+ϑH,M\),α^H,M:=min\{1H,12HκH,M,ωH,MfhdH,Mfh\},CH,M,g:=exp\(1κH,M\)max\{1,BH,Mres2κH,M\},CH,M,w:=2CH,M,g\+BH,Mres,CH,M,p:=HLH,Mfh\|𝒮H\|2/pH⋆CH,M,w2\(32ωH,Mfh\+36\),\\begin\{gathered\}d^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}:=8L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\(1\+\\vartheta\_\{H,\\mathrm\{M\}\}\),\\qquad\\widehat\{\\alpha\}\_\{H,\\mathrm\{M\}\}:=\\min\\left\\\{\\frac\{1\}\{H\},\\frac\{1\}\{2H\\kappa\_\{H,\\mathrm\{M\}\}\},\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{d^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\\right\\\},\\\\ C\_\{H,\\mathrm\{M\},\\mathrm\{g\}\}:=\\exp\\left\(\\frac\{1\}\{\\kappa\_\{H,\\mathrm\{M\}\}\}\\right\)\\max\\left\\\{1,\\frac\{B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\}\{2\\kappa\_\{H,\\mathrm\{M\}\}\}\\right\\\},\\qquad C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}:=2C\_\{H,\\mathrm\{M\},\\mathrm\{g\}\}\+B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\},\\\\ C\_\{H,\\mathrm\{M\},\\mathrm\{p\}\}:=HL^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}C^\{2\}\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\left\(\\frac\{32\}\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\+36\\right\),\\end\{gathered\}\(404\)and
α¯H,M:=min\{α^H,M,ωH,Mfh64CH,M,p\(1\+ϑH,M\|𝒮H\|2/pH⋆\)\}\.\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}:=\\min\\left\\\{\\widehat\{\\alpha\}\_\{H,\\mathrm\{M\}\},\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{64C\_\{H,\\mathrm\{M\},\\mathrm\{p\}\}\(1\+\\vartheta\_\{H,\\mathrm\{M\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\)\}\\right\\\}\.\(405\)Finally, define
cH,M,1fh:=ωH,Mfh4,cH,M,2fh:=2CH,M,p\(1\+2∥UH,M⋆∥H,2,∞2\),cH,M,3fh:=8CH,M,p\(1\+ϑH,M\|𝒮H\|2/pH⋆\)\.\\begin\{gathered\}c\_\{H,\\mathrm\{M\},1\}^\{\\mathrm\{fh\}\}:=\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{4\},\\qquad c\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}:=2C\_\{H,\\mathrm\{M\},\\mathrm\{p\}\}\\bigl\(1\+2\\lVert U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}^\{2\}\\bigr\),\\\\ c\_\{H,\\mathrm\{M\},3\}^\{\\mathrm\{fh\}\}:=8C\_\{H,\\mathrm\{M\},\\mathrm\{p\}\}\(1\+\\vartheta\_\{H,\\mathrm\{M\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\)\.\\end\{gathered\}\(406\)
###### Lemma 45\(Pathwise growth control inside one episode\)\.
Assume that\(αk\)k≥0\(\\alpha\_\{k\}\)\_\{k\\geq 0\}is nonincreasing and thatα0≤α¯H,M≤1\\alpha\_\{0\}\\leq\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}\\leq 1\. Then, for every episodemm,
max0≤t≤H∥Um,t∥H,2,∞≤CH,M,g\(1\+∥Um,0∥H,2,∞\)\\max\_\{0\\leq t\\leq H\}\\lVert U\_\{m,t\}\\rVert\_\{H,2,\\infty\}\\leq C\_\{H,\\mathrm\{M\},\\mathrm\{g\}\}\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\)\(407\)pathwise\.
###### Proof\.
By Proposition[40](https://arxiv.org/html/2605.06866#Thmtheorem40),
∥FH,M\(U;s,\(r,s′\)\)∥H,2,∞≤2∥U∥H,2,∞\+BH,Mres,\\lVert F\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\\rVert\_\{H,2,\\infty\}\+B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\},\(408\)whereBH,MresB^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}is the residual\-level affine constant from Proposition[40](https://arxiv.org/html/2605.06866#Thmtheorem40)\. Hence the recursion gives
∥Um,t\+1∥H,2,∞≤\(1\+2αmH\+t\)∥Um,t∥H,2,∞\+αmH\+tBH,Mres\.\\lVert U\_\{m,t\+1\}\\rVert\_\{H,2,\\infty\}\\leq\\bigl\(1\+2\\alpha\_\{mH\+t\}\\bigr\)\\lVert U\_\{m,t\}\\rVert\_\{H,2,\\infty\}\+\\alpha\_\{mH\+t\}B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\.\(409\)Iterating over at mostHHsteps and using
∑u=0H−1αmH\+u≤Hα0≤Hα¯H,M\\sum\_\{u=0\}^\{H\-1\}\\alpha\_\{mH\+u\}\\leq H\\alpha\_\{0\}\\leq H\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}\(410\)yields
max0≤t≤H∥Um,t∥H,2,∞≤e2Hα¯H,M\(∥Um,0∥H,2,∞\+Hα¯H,MBH,Mres\)\.\\max\_\{0\\leq t\\leq H\}\\lVert U\_\{m,t\}\\rVert\_\{H,2,\\infty\}\\leq e^\{2H\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}\}\\bigl\(\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\+H\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\\bigr\)\.\(411\)Sinceα¯H,M≤α^H,M\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}\\leq\\widehat\{\\alpha\}\_\{H,\\mathrm\{M\}\}, we have
e2Hα¯H,M≤e1/κH,M,Hα¯H,MBH,Mres≤BH,Mres2κH,M\.e^\{2H\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}\}\\leq e^\{1/\\kappa\_\{H,\\mathrm\{M\}\}\},\\qquad H\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\\leq\\frac\{B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\}\{2\\kappa\_\{H,\\mathrm\{M\}\}\}\.\(412\)Therefore
max0≤t≤H∥Um,t∥H,2,∞≤CH,M,g\(1\+∥Um,0∥H,2,∞\)\.\\max\_\{0\\leq t\\leq H\}\\lVert U\_\{m,t\}\\rVert\_\{H,2,\\infty\}\\leq C\_\{H,\\mathrm\{M\},\\mathrm\{g\}\}\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\)\.\(413\)∎
###### Proposition 46\(Episodewise MTD potential drift\)\.
Assume thatωH,Mfh\>0\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\>0, that\(αk\)k≥0\(\\alpha\_\{k\}\)\_\{k\\geq 0\}is nonincreasing and that
α0≤α¯H,M\\alpha\_\{0\}\\leq\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}\(414\)Then, for every episodemm,
𝔼\[MH,M\(Wm\+1,0\)∣Um,0\]≤\(1−cH,M,1fhα¯m\)MH,M\(Wm,0\)\+cH,M,2fhα¯m\(2\)\+cH,M,3fhα¯m\(2\)MH,M\(Wm,0\)\.\\mathbb\{E\}\\left\[M\_\{H,\\mathrm\{M\}\}\(W\_\{m\+1,0\}\)\\mid U\_\{m,0\}\\right\]\\leq\\bigl\(1\-c\_\{H,\\mathrm\{M\},1\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{m\}\\bigr\)M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\+c\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}\+c\_\{H,\\mathrm\{M\},3\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\.
\(415\)Consequently,
𝔼\[MH,M\(Wm\+1,0\)∣Um,0\]≤\(1−ωH,Mfh8α¯m\)MH,M\(Wm,0\)\+cH,M,2fhα¯m\(2\)\.\\mathbb\{E\}\\left\[M\_\{H,\\mathrm\{M\}\}\(W\_\{m\+1,0\}\)\\mid U\_\{m,0\}\\right\]\\leq\\left\(1\-\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{8\}\\bar\{\\alpha\}\_\{m\}\\right\)M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\+c\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}\.\(416\)
###### Proof\.
For each phasettand eachα∈\[0,1\]\\alpha\\in\[0,1\], define
At,α,M\(U\):=U\+αΓt,M\(U\)=\(1−α\)U\+αGt,M\(U\)\.A\_\{t,\\alpha,\\mathrm\{M\}\}\(U\):=U\+\\alpha\\Gamma\_\{t,\\mathrm\{M\}\}\(U\)=\(1\-\\alpha\)U\+\\alpha G\_\{t,\\mathrm\{M\}\}\(U\)\.\(417\)By Proposition[42](https://arxiv.org/html/2605.06866#Thmtheorem42),
∥At,α,M\(U\)−At,α,M\(U′\)∥H,2,∞≤\(1−κH,Mα\)∥U−U′∥H,2,∞\.\\lVert A\_\{t,\\alpha,\\mathrm\{M\}\}\(U\)\-A\_\{t,\\alpha,\\mathrm\{M\}\}\(U^\{\\prime\}\)\\rVert\_\{H,2,\\infty\}\\leq\\bigl\(1\-\\kappa\_\{H,\\mathrm\{M\}\}\\alpha\\bigr\)\\lVert U\-U^\{\\prime\}\\rVert\_\{H,2,\\infty\}\.\(418\)
Define the deterministic phasewise averaged trajectory over episodemmby
U¯m,0:=Um,0,U¯m,t\+1:=At,αmH\+t,M\(U¯m,t\),0≤t≤H−1\.\\bar\{U\}\_\{m,0\}:=U\_\{m,0\},\\qquad\\bar\{U\}\_\{m,t\+1\}:=A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{M\}\}\(\\bar\{U\}\_\{m,t\}\),\\qquad 0\\leq t\\leq H\-1\.\(419\)
SinceGt,M\(UH,M⋆\)=UH,M⋆G\_\{t,\\mathrm\{M\}\}\(U^\{\\star\}\_\{H,\\mathrm\{M\}\}\)=U^\{\\star\}\_\{H,\\mathrm\{M\}\}, the contraction estimate above, the Moreau\-envelope comparison, and the smoothness of the envelope give the deterministic relaxed\-map estimate
MH,M\(At,α,M\(U\)−UH,M⋆\)≤\(1−2ωH,Mfhα\+dH,Mfhα2\)MH,M\(U−UH,M⋆\)\.M\_\{H,\\mathrm\{M\}\}\\bigl\(A\_\{t,\\alpha,\\mathrm\{M\}\}\(U\)\-U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\bigr\)\\leq\(1\-2\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\alpha\+d^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\alpha^\{2\}\)M\_\{H,\\mathrm\{M\}\}\(U\-U^\{\\star\}\_\{H,\\mathrm\{M\}\}\)\.\(420\)SinceαmH\+t≤α0≤α¯H,M≤ωH,Mfh/dH,Mfh\\alpha\_\{mH\+t\}\\leq\\alpha\_\{0\}\\leq\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}\\leq\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}/d^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}, this implies
MH,M\(At,αmH\+t,M\(U\)−UH,M⋆\)≤\(1−ωH,MfhαmH\+t\)MH,M\(U−UH,M⋆\)\.M\_\{H,\\mathrm\{M\}\}\\bigl\(A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{M\}\}\(U\)\-U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\bigr\)\\leq\(1\-\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\alpha\_\{mH\+t\}\)M\_\{H,\\mathrm\{M\}\}\(U\-U^\{\\star\}\_\{H,\\mathrm\{M\}\}\)\.\(421\)Therefore
MH,M\(U¯m,H−UH,M⋆\)≤∏u=0H−1\(1−ωH,MfhαmH\+u\)MH,M\(Wm,0\)≤\(1−ωH,Mfh2α¯m\)MH,M\(Wm,0\)\.M\_\{H,\\mathrm\{M\}\}\(\\bar\{U\}\_\{m,H\}\-U^\{\\star\}\_\{H,\\mathrm\{M\}\}\)\\leq\\prod\_\{u=0\}^\{H\-1\}\(1\-\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\alpha\_\{mH\+u\}\)M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\\leq\\left\(1\-\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{2\}\\bar\{\\alpha\}\_\{m\}\\right\)M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\.
\(422\)Now define
Dm,t:=Um,t−U¯m,t,Dm,0=0\.D\_\{m,t\}:=U\_\{m,t\}\-\\bar\{U\}\_\{m,t\},\\qquad D\_\{m,0\}=0\.\(423\)By Proposition[25](https://arxiv.org/html/2605.06866#Thmtheorem25), each statewise signed\-categorical projection is affine\. Applying this horizon\-by\-horizon shows thatU↦T^H,M\(U;s,\(r,s′\)\)U\\mapsto\\widehat\{T\}\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)andU↦\(𝒪H,MU\)\(s\)U\\mapsto\(\\mathcal\{O\}\_\{H,\\mathrm\{M\}\}U\)\(s\)are affine for every admissible\(s,r,s′\)\(s,r,s^\{\\prime\}\)\. HenceFH,MF\_\{H,\\mathrm\{M\}\},Γt,M\\Gamma\_\{t,\\mathrm\{M\}\}, andAt,α,MA\_\{t,\\alpha,\\mathrm\{M\}\}are affine\. LetLm,t,ML\_\{m,t,\\mathrm\{M\}\}denote the linear part ofAt,αmH\+t,MA\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{M\}\}\. Then
At,αmH\+t,M\(U\)−At,αmH\+t,M\(U′\)=Lm,t,M\(U−U′\),A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{M\}\}\(U\)\-A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{M\}\}\(U^\{\\prime\}\)=L\_\{m,t,\\mathrm\{M\}\}\(U\-U^\{\\prime\}\),\(424\)and for every admissibleZZ
∥Lm,t,MZ∥H,2,∞≤\(1−κH,MαmH\+t\)∥Z∥H,2,∞\.\\lVert L\_\{m,t,\\mathrm\{M\}\}Z\\rVert\_\{H,2,\\infty\}\\leq\\bigl\(1\-\\kappa\_\{H,\\mathrm\{M\}\}\\alpha\_\{mH\+t\}\\bigr\)\\lVert Z\\rVert\_\{H,2,\\infty\}\.\(425\)Using the recursion forUm,t\+1U\_\{m,t\+1\}, the definition ofU¯m,t\+1\\bar\{U\}\_\{m,t\+1\}, and the decomposition from Lemma[44](https://arxiv.org/html/2605.06866#Thmtheorem44), we obtain
Dm,t\+1=Um,t\+1−U¯m,t\+1=Um,t\+αmH\+tFH,M\(Um,t;Stm,\(Rtm,St\+1m\)\)−U¯m,t−αmH\+tΓt,M\(U¯m,t\)=Um,t−U¯m,t\+αmH\+t\(Γt,M\(Um,t\)−Γt,M\(U¯m,t\)\)\+αmH\+t\(FH,M\(Um,t;Stm,\(Rtm,St\+1m\)\)−Γt,M\(Um,t\)\)=At,αmH\+t,M\(Um,t\)−At,αmH\+t,M\(U¯m,t\)\+αmH\+t\(FH,M\(Um,t;Stm,\(Rtm,St\+1m\)\)−Γt,M\(Um,t\)\)=Lm,t,MDm,t\+αmH\+tζm,tM\+αmH\+tξm,tM\.\\begin\{gathered\}D\_\{m,t\+1\}=U\_\{m,t\+1\}\-\\bar\{U\}\_\{m,t\+1\}\\\\ =U\_\{m,t\}\+\\alpha\_\{mH\+t\}F\_\{H,\\mathrm\{M\}\}\\bigl\(U\_\{m,t\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\\bigr\)\-\\bar\{U\}\_\{m,t\}\-\\alpha\_\{mH\+t\}\\Gamma\_\{t,\\mathrm\{M\}\}\(\\bar\{U\}\_\{m,t\}\)\\\\ =U\_\{m,t\}\-\\bar\{U\}\_\{m,t\}\+\\alpha\_\{mH\+t\}\\Bigl\(\\Gamma\_\{t,\\mathrm\{M\}\}\(U\_\{m,t\}\)\-\\Gamma\_\{t,\\mathrm\{M\}\}\(\\bar\{U\}\_\{m,t\}\)\\Bigr\)\\\\ \+\\alpha\_\{mH\+t\}\\Bigl\(F\_\{H,\\mathrm\{M\}\}\\bigl\(U\_\{m,t\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\\bigr\)\-\\Gamma\_\{t,\\mathrm\{M\}\}\(U\_\{m,t\}\)\\Bigr\)\\\\ =A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{M\}\}\(U\_\{m,t\}\)\-A\_\{t,\\alpha\_\{mH\+t\},\\mathrm\{M\}\}\(\\bar\{U\}\_\{m,t\}\)\+\\alpha\_\{mH\+t\}\\Bigl\(F\_\{H,\\mathrm\{M\}\}\\bigl\(U\_\{m,t\};S\_\{t\}^\{m\},\(R\_\{t\}^\{m\},S\_\{t\+1\}^\{m\}\)\\bigr\)\-\\Gamma\_\{t,\\mathrm\{M\}\}\(U\_\{m,t\}\)\\Bigr\)\\\\ =L\_\{m,t,\\mathrm\{M\}\}D\_\{m,t\}\+\\alpha\_\{mH\+t\}\\zeta\_\{m,t\}^\{\\mathrm\{M\}\}\+\\alpha\_\{mH\+t\}\\xi\_\{m,t\}^\{\\mathrm\{M\}\}\.\\end\{gathered\}
\(426\)By Lemma[45](https://arxiv.org/html/2605.06866#Thmtheorem45)and Lemma[43](https://arxiv.org/html/2605.06866#Thmtheorem43),
∥Um,t−Um,0∥H,2,∞≤CH,M,wα¯m\(1\+∥Um,0∥H,2,∞\),\\lVert U\_\{m,t\}\-U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\leq C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\bar\{\\alpha\}\_\{m\}\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\),\(427\)and, sinceCH,M,g≥1C\_\{H,\\mathrm\{M\},\\mathrm\{g\}\}\\geq 1,
4∥Um,0∥H,2,∞\+2BH,Mres≤2CH,M,w\(1\+∥Um,0∥H,2,∞\)\.4\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\+2B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\}\\leq 2C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\)\.\(428\)Hence Lemma[44](https://arxiv.org/html/2605.06866#Thmtheorem44)gives
∥ζm,tM∥H,2,∞≤2CH,M,w\(1\+∥Um,0∥H,2,∞\),\\lVert\\zeta\_\{m,t\}^\{\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\\leq 2C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\),\(429\)and
∥ξm,tM∥H,2,∞≤4CH,M,wα¯m\(1\+∥Um,0∥H,2,∞\)\.\\lVert\\xi\_\{m,t\}^\{\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\\leq 4C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\bar\{\\alpha\}\_\{m\}\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\)\.\(430\)Therefore
𝔼\[Dm,t\+1∣Um,0\]=Lm,t,M𝔼\[Dm,t∣Um,0\]\+αmH\+t𝔼\[ξm,tM∣Um,0\],\\mathbb\{E\}\\left\[D\_\{m,t\+1\}\\,\\mid\\,U\_\{m,0\}\\right\]=L\_\{m,t,\\mathrm\{M\}\}\\mathbb\{E\}\\left\[D\_\{m,t\}\\,\\mid\\,U\_\{m,0\}\\right\]\+\\alpha\_\{mH\+t\}\\mathbb\{E\}\\left\[\\xi\_\{m,t\}^\{\\mathrm\{M\}\}\\,\\mid\\,U\_\{m,0\}\\right\],\(431\)because𝔼\[ζm,tM∣Um,0\]=0\\mathbb\{E\}\\left\[\\zeta\_\{m,t\}^\{\\mathrm\{M\}\}\\,\\mid\\,U\_\{m,0\}\\right\]=0\. Hence
∥𝔼\[Dm,t\+1∣Um,0\]∥H,2,∞≤\(1−κH,MαmH\+t\)∥𝔼\[Dm,t∣Um,0\]∥H,2,∞\+4CH,M,wαmH\+tα¯m\(1\+∥Um,0∥H,2,∞\)\.\\begin\{gathered\}\\left\\lVert\\mathbb\{E\}\\left\[D\_\{m,t\+1\}\\,\\mid\\,U\_\{m,0\}\\right\]\\right\\rVert\_\{H,2,\\infty\}\\leq\\bigl\(1\-\\kappa\_\{H,\\mathrm\{M\}\}\\alpha\_\{mH\+t\}\\bigr\)\\left\\lVert\\mathbb\{E\}\\left\[D\_\{m,t\}\\,\\mid\\,U\_\{m,0\}\\right\]\\right\\rVert\_\{H,2,\\infty\}\\\\ \+4C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\alpha\_\{mH\+t\}\\bar\{\\alpha\}\_\{m\}\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\)\.\\end\{gathered\}\(432\)SinceDm,0=0D\_\{m,0\}=0, induction gives
∥𝔼\[Dm,H∣Um,0\]∥H,2,∞≤4CH,M,wα¯m2\(1\+∥Um,0∥H,2,∞\)\.\\left\\lVert\\mathbb\{E\}\\left\[D\_\{m,H\}\\,\\mid\\,U\_\{m,0\}\\right\]\\right\\rVert\_\{H,2,\\infty\}\\leq 4C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\bar\{\\alpha\}\_\{m\}^\{2\}\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\)\.\(433\)
Iterating the recursion forDm,t\+1D\_\{m,t\+1\}, using
∥Lm,t,M\(Z\)∥H,2,∞≤∥Z∥H,2,∞,\\lVert L\_\{m,t,\\mathrm\{M\}\}\(Z\)\\rVert\_\{H,2,\\infty\}\\leq\\lVert Z\\rVert\_\{H,2,\\infty\},\(434\)the identityDm,0=0D\_\{m,0\}=0, and the bounds from Lemma[44](https://arxiv.org/html/2605.06866#Thmtheorem44)give the pathwise estimate
∥Dm,H∥H,2,∞≤∑t=0H−1αmH\+t\(∥ζm,tM∥H,2,∞\+∥ξm,tM∥H,2,∞\)\\lVert D\_\{m,H\}\\rVert\_\{H,2,\\infty\}\\leq\\sum\_\{t=0\}^\{H\-1\}\\alpha\_\{mH\+t\}\\Bigl\(\\lVert\\zeta\_\{m,t\}^\{\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\+\\lVert\\xi\_\{m,t\}^\{\\mathrm\{M\}\}\\rVert\_\{H,2,\\infty\}\\Bigr\)\(435\)and therefore
∥Dm,H∥H,2,∞≤\(2CH,M,wα¯m\+4CH,M,wα¯m2\)\(1\+∥Um,0∥H,2,∞\)\.\\lVert D\_\{m,H\}\\rVert\_\{H,2,\\infty\}\\leq\\Bigl\(2C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\bar\{\\alpha\}\_\{m\}\+4C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\bar\{\\alpha\}\_\{m\}^\{2\}\\Bigr\)\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\)\.\(436\)Sinceα0≤α¯H,M≤α^H,M≤1/H\\alpha\_\{0\}\\leq\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}\\leq\\widehat\{\\alpha\}\_\{H,\\mathrm\{M\}\}\\leq 1/H, we haveα¯m≤1\\bar\{\\alpha\}\_\{m\}\\leq 1, so
∥Dm,H∥H,2,∞≤6CH,M,wα¯m\(1\+∥Um,0∥H,2,∞\)\\lVert D\_\{m,H\}\\rVert\_\{H,2,\\infty\}\\leq 6C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\bar\{\\alpha\}\_\{m\}\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\)\(437\)pathwise\. Hence
𝔼\[∥Dm,H∥H,2,∞2∣Um,0\]≤36CH,M,w2α¯m2\(1\+∥Um,0∥H,2,∞\)2\.\\mathbb\{E\}\\left\[\\lVert D\_\{m,H\}\\rVert\_\{H,2,\\infty\}^\{2\}\\,\\mid\\,U\_\{m,0\}\\right\]\\leq 36C\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}^\{2\}\\bar\{\\alpha\}\_\{m\}^\{2\}\\bigl\(1\+\\lVert U\_\{m,0\}\\rVert\_\{H,2,\\infty\}\\bigr\)^\{2\}\.\(438\)
Define
W¯m,H:=U¯m,H−UH,M⋆,δm:=ωH,Mfhα¯m4LH,Mfh\.\\bar\{W\}\_\{m,H\}:=\\bar\{U\}\_\{m,H\}\-U^\{\\star\}\_\{H,\\mathrm\{M\}\},\\qquad\\delta\_\{m\}:=\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\bar\{\\alpha\}\_\{m\}\}\{4L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\.\(439\)BecauseMH,MM\_\{H,\\mathrm\{M\}\}is nonnegative, convex,LH,MfhL^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\-smooth with respect to∥⋅∥H,2,pH⋆\\left\\lVert\\cdot\\right\\rVert\_\{H,2,p^\{\\star\}\_\{H\}\}, and minimized at zero, its gradient satisfies
‖∇MH,M\(W¯m,H\)‖∗2≤2LH,MfhMH,M\(W¯m,H\),\\left\\lVert\\nabla M\_\{H,\\mathrm\{M\}\}\(\\bar\{W\}\_\{m,H\}\)\\right\\rVert^\{2\}\_\{\*\}\\leq 2L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}M\_\{H,\\mathrm\{M\}\}\(\\bar\{W\}\_\{m,H\}\),\(440\)where∥⋅∥∗\\left\\lVert\\cdot\\right\\rVert\_\{\*\}denotes the dual norm of∥⋅∥H,2,pH⋆\\left\\lVert\\cdot\\right\\rVert\_\{H,2,p^\{\\star\}\_\{H\}\}\. Therefore
𝔼\[MH,M\(Wm\+1,0\)∣Um,0\]≤MH,M\(W¯m,H\)\+⟨∇MH,M\(W¯m,H\),𝔼\[Dm,H∣Um,0\]⟩\+LH,Mfh2𝔼\[‖Dm,H‖H,2,pH⋆2∣Um,0\]\.\\begin\{gathered\}\\mathbb\{E\}\\bigl\[M\_\{H,\\mathrm\{M\}\}\(W\_\{m\+1\},0\)\\mid U\_\{m,0\}\\bigr\]\\leq M\_\{H,\\mathrm\{M\}\}\(\\bar\{W\}\_\{m,H\}\)\+\\left\\langle\\nabla M\_\{H,\\mathrm\{M\}\}\(\\bar\{W\}\_\{m,H\}\),\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rangle\\\\ \+\\frac\{L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{2\}\\mathbb\{E\}\\bigl\[\\left\\lVert D\_\{m,H\}\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\mid U\_\{m,0\}\\bigr\]\.\\end\{gathered\}\(441\)By Young’s inequality,
⟨∇MH,M\(W¯m,H\),𝔼\[Dm,H∣Um,0\]⟩≤δm2∥∇MH,M\(W¯m,H\)∥∗2\+12δm∥𝔼\[Dm,H∣Um,0\]∥H,2,pH⋆2≤ωH,Mfhα¯m4MH,M\(W¯m,H\)\+2LH,MfhωH,Mfhα¯m∥𝔼\[Dm,H∣Um,0\]∥H,2,pH⋆2\.\\begin\{gathered\}\\left\\langle\\nabla M\_\{H,\\mathrm\{M\}\}\(\\bar\{W\}\_\{m,H\}\),\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rangle\\leq\\frac\{\\delta\_\{m\}\}\{2\}\\left\\lVert\\nabla M\_\{H,\\mathrm\{M\}\}\(\\bar\{W\}\_\{m,H\}\)\\right\\rVert^\{2\}\_\{\*\}\+\\frac\{1\}\{2\\delta\_\{m\}\}\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\\\ \\leq\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\bar\{\\alpha\}\_\{m\}\}\{4\}M\_\{H,\\mathrm\{M\}\}\(\\bar\{W\}\_\{m,H\}\)\+\\frac\{2L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\bar\{\\alpha\}\_\{m\}\}\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\.\\end\{gathered\}
\(442\)Hence
𝔼\[MH,M\(Wm\+1,0\)∣Um,0\]≤\(1\+ωH,Mfhα¯m4\)MH,M\(W¯m,H\)\+2LH,MfhωH,Mfhα¯m∥𝔼\[Dm,H∣Um,0\]∥H,2,pH⋆2\+LH,Mfh2𝔼\[∥Dm,H∥H,2,pH⋆2∣Um,0\]\.\\begin\{gathered\}\\mathbb\{E\}\[M\_\{H,\\mathrm\{M\}\}\(W\_\{m\+1,0\}\)\\mid U\_\{m,0\}\]\\leq\\left\(1\+\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\bar\{\\alpha\}\_\{m\}\}\{4\}\\right\)M\_\{H,\\mathrm\{M\}\}\(\\bar\{W\}\_\{m,H\}\)\\\\ \+\\frac\{2L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\bar\{\\alpha\}\_\{m\}\}\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\+\\frac\{L^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{2\}\\mathbb\{E\}\\bigl\[\\left\\lVert D\_\{m,H\}\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\mid U\_\{m,0\}\\bigr\]\.\\end\{gathered\}\(443\)SinceωH,Mfhα¯m≤1\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\bar\{\\alpha\}\_\{m\}\\leq 1, we have
\(1\+ωH,Mfhα¯m4\)\(1−ωH,Mfh2α¯m\)≤1−ωH,Mfhα¯m4\.\\left\(1\+\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\bar\{\\alpha\}\_\{m\}\}\{4\}\\right\)\\left\(1\-\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{2\}\\bar\{\\alpha\}\_\{m\}\\right\)\\leq 1\-\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\\bar\{\\alpha\}\_\{m\}\}\{4\}\.\(444\)Also,
∥𝔼\[Dm,H∣Um,0\]∥H,2,pH⋆2\\displaystyle\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}≤\|𝒮H\|2/pH⋆∥𝔼\[Dm,H∣Um,0\]∥H,2,∞2\\displaystyle\\leq\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\\left\\lVert\\mathbb\{E\}\[D\_\{m,H\}\\mid U\_\{m,0\}\]\\right\\rVert^\{2\}\_\{H,2,\\infty\}\(445\)≤16\|𝒮H\|2/pH⋆CH,M,w2α¯m4\(1\+‖Um,0‖H,2,∞\)2,\\displaystyle\\leq 6\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}C^\{2\}\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\bar\{\\alpha\}^\{4\}\_\{m\}\(1\+\\left\\lVert U\_\{m,0\}\\right\\rVert\_\{H,2,\\infty\}\)^\{2\},and
𝔼\[‖Dm,H‖H,2,pH⋆2∣Um,0\]\\displaystyle\\mathbb\{E\}\[\\left\\lVert D\_\{m,H\}\\right\\rVert^\{2\}\_\{H,2,p^\{\\star\}\_\{H\}\}\\mid U\_\{m,0\}\]≤\|𝒮H\|2/pH⋆𝔼\[‖Dm,H‖H,2,∞2∣Um,0\]\\displaystyle\\leq\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\\mathbb\{E\}\[\\left\\lVert D\_\{m,H\}\\right\\rVert^\{2\}\_\{H,2,\\infty\}\\mid U\_\{m,0\}\]\(446\)≤36\|𝒮H\|2/pH⋆CH,M,w2α¯m2\(1\+‖Um,0‖H,2,∞\)2\.\\displaystyle\\leq 6\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}C^\{2\}\_\{H,\\mathrm\{M\},\\mathrm\{w\}\}\\bar\{\\alpha\}^\{2\}\_\{m\}\(1\+\\left\\lVert U\_\{m,0\}\\right\\rVert\_\{H,2,\\infty\}\)^\{2\}\.Usingα¯m≤1\\bar\{\\alpha\}\_\{m\}\\leq 1,α¯m2≤Hα¯m\(2\)\\bar\{\\alpha\}^\{2\}\_\{m\}\\leq H\\bar\{\\alpha\}^\{\(2\)\}\_\{m\}, andα¯m3≤Hα¯m\(2\)\\bar\{\\alpha\}^\{3\}\_\{m\}\\leq H\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}, we obtain
𝔼\[MH,M\(Wm\+1,0\)∣Um,0\]≤\(1−ωH,Mfh4α¯m\)MH,M\(Wm,0\)\+CH,M,pα¯m\(2\)\(1\+‖Um,0‖H,2,∞\)2\.\\mathbb\{E\}\[M\_\{H,\\mathrm\{M\}\}\(W\_\{m\+1,0\}\)\\mid U\_\{m,0\}\]\\leq\\left\(1\-\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{4\}\\bar\{\\alpha\}\_\{m\}\\right\)M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\+C\_\{H,\\mathrm\{M\},\\mathrm\{p\}\}\\bar\{\\alpha\}^\{\(2\)\}\_\{m\}\(1\+\\left\\lVert U\_\{m,0\}\\right\\rVert\_\{H,2,\\infty\}\)^\{2\}\.
\(447\)Since\(1\+x\)2≤2\(1\+x2\)\(1\+x\)^\{2\}\\leq 2\(1\+x^\{2\}\)and
‖Um,0‖H,2,∞2\\displaystyle\\left\\lVert U\_\{m,0\}\\right\\rVert^\{2\}\_\{H,2,\\infty\}≤2‖Wm,0‖H,2,∞2\+2‖UH,M⋆‖H,2,∞2\\displaystyle\\leq 2\\left\\lVert W\_\{m,0\}\\right\\rVert^\{2\}\_\{H,2,\\infty\}\+2\\left\\lVert U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\right\\rVert^\{2\}\_\{H,2,\\infty\}\(448\)≤4\(1\+ϑH,M\|𝒮H\|2/pH⋆\)MH,M\(Wm,0\)\+2‖UH,M⋆‖H,2,∞2,\\displaystyle\\leq 4\(1\+\\vartheta\_\{H,\\mathrm\{M\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\)M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\+2\\left\\lVert U^\{\\star\}\_\{H,\\mathrm\{M\}\}\\right\\rVert^\{2\}\_\{H,2,\\infty\},this yields
𝔼\[MH,M\(Wm\+1,0\)∣Um,0\]≤\(1−cH,M,1fhα¯m\)MH,M\(Wm,0\)\+cH,M,2fhα¯m\(2\)\+cH,M,3fhα¯m\(2\)MH,M\(Wm,0\),\\mathbb\{E\}\\left\[M\_\{H,\\mathrm\{M\}\}\(W\_\{m\+1,0\}\)\\mid U\_\{m,0\}\\right\]\\leq\\bigl\(1\-c\_\{H,\\mathrm\{M\},1\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{m\}\\bigr\)M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\+c\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}\+c\_\{H,\\mathrm\{M\},3\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\),
\(449\)which is the first stated inequality\. Finally, since\(αk\)\(\\alpha\_\{k\}\)is nonincreasing andα0≤α¯H,M\\alpha\_\{0\}\\leq\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\},
α¯m\(2\)≤α0α¯m≤α¯H,Mα¯m≤ωH,Mfh64CH,M,p\(1\+ϑH,M\|𝒮H\|2/pH⋆\)α¯m\.\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}\\leq\\alpha\_\{0\}\\bar\{\\alpha\}\_\{m\}\\leq\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}\\bar\{\\alpha\}\_\{m\}\\leq\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{64C\_\{H,\\mathrm\{M\},\\mathrm\{p\}\}\(1\+\\vartheta\_\{H,\\mathrm\{M\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\)\}\\bar\{\\alpha\}\_\{m\}\.\(450\)Therefore
cH,M,3fhα¯m\(2\)MH,M\(Wm,0\)≤ωH,Mfh8α¯mMH,M\(Wm,0\),c\_\{H,\\mathrm\{M\},3\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\\leq\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{8\}\\bar\{\\alpha\}\_\{m\}M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\),\(451\)which gives the second inequality\. ∎
###### Proposition 47\(Fixed\-horizon MTD finite\-iteration bound\)\.
Define
aH,M,1fh:=rH,Mfh,aH,M,2fh:=ωH,Mfh8,aH,M,3fh:=ωH,Mfh8α¯H,M,aH,M,4fh:=2\(1\+ϑH,M\|𝒮H\|2/pH⋆\)cH,M,2fh\.\\begin\{gathered\}a\_\{H,\\mathrm\{M\},1\}^\{\\mathrm\{fh\}\}:=r^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\},\\qquad a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}:=\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{8\},\\qquad a\_\{H,\\mathrm\{M\},3\}^\{\\mathrm\{fh\}\}:=\\frac\{\\omega^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\}\}\}\{8\\bar\{\\alpha\}\_\{H,\\mathrm\{M\}\}\},\\\\ a\_\{H,\\mathrm\{M\},4\}^\{\\mathrm\{fh\}\}:=2\(1\+\\vartheta\_\{H,\\mathrm\{M\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\)c\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\.\\end\{gathered\}\(452\)IfaH,M,2fh\>0a^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\},2\}\>0,\(αk\)k≥0\(\\alpha\_\{k\}\)\_\{k\\geq 0\}is nonincreasing and
α0≤aH,M,2fhaH,M,3fh,\\alpha\_\{0\}\\leq\\frac\{a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\}\{a\_\{H,\\mathrm\{M\},3\}^\{\\mathrm\{fh\}\}\},\(453\)then, for all episodesm≥0m\\geq 0,
𝔼\[ℓH,M,∞\(ηmH,ηH,M⋆\)2\]\\displaystyle\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{M\},\\infty\}\\bigl\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\\bigr\)^\{2\}\\right\]≤aH,M,1fhℓH,M,∞\(η0,ηH,M⋆\)2∏j=0m−1\(1−aH,M,2fhα¯j\)\\displaystyle\\leq a\_\{H,\\mathrm\{M\},1\}^\{\\mathrm\{fh\}\}\\ell\_\{H,\\mathrm\{M\},\\infty\}\\bigl\(\\eta\_\{0\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\\bigr\)^\{2\}\\prod\_\{j=0\}^\{m\-1\}\\bigl\(1\-a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\\bigr\)\(454\)\+aH,M,4fh∑i=0m−1α¯i\(2\)∏j=i\+1m−1\(1−aH,M,2fhα¯j\)\.\\displaystyle\\quad\+a\_\{H,\\mathrm\{M\},4\}^\{\\mathrm\{fh\}\}\\sum\_\{i=0\}^\{m\-1\}\\bar\{\\alpha\}\_\{i\}^\{\(2\)\}\\prod\_\{j=i\+1\}^\{m\-1\}\\bigl\(1\-a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\\bigr\)\.
###### Proof\.
Iterating the second bound in Proposition[46](https://arxiv.org/html/2605.06866#Thmtheorem46)gives
𝔼\[MH,M\(Wm,0\)\]≤MH,M\(W0,0\)∏j=0m−1\(1−aH,M,2fhα¯j\)\+cH,M,2fh∑i=0m−1α¯i\(2\)∏j=i\+1m−1\(1−aH,M,2fhα¯j\)\.\\mathbb\{E\}\\left\[M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\\right\]\\leq M\_\{H,\\mathrm\{M\}\}\(W\_\{0,0\}\)\\prod\_\{j=0\}^\{m\-1\}\\bigl\(1\-a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\\bigr\)\+c^\{\\mathrm\{fh\}\}\_\{H,\\mathrm\{M\},2\}\\sum\_\{i=0\}^\{m\-1\}\\bar\{\\alpha\}^\{\(2\)\}\_\{i\}\\prod\_\{j=i\+1\}^\{m\-1\}\\bigl\(1\-a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\\bigr\)\.\(455\)By the Moreau envelope comparison,
𝔼\[‖Wm,0‖H,2,∞2\]≤2\(1\+ϑH,M\|𝒮H\|2/pH⋆\)𝔼\[MH,M\(Wm,0\)\]\\mathbb\{E\}\[\\left\\lVert W\_\{m,0\}\\right\\rVert^\{2\}\_\{H,2,\\infty\}\]\\leq 2\(1\+\\vartheta\_\{H,\\mathrm\{M\}\}\\lvert\\mathcal\{S\}\_\{H\}\\rvert^\{2/p^\{\\star\}\_\{H\}\}\)\\mathbb\{E\}\\bigl\[M\_\{H,\\mathrm\{M\}\}\(W\_\{m,0\}\)\\bigr\]\(456\)and
MH,M\(W0,0\)≤12\(1\+ϑH,M\)‖W0,0‖H,2,∞2\.M\_\{H,\\mathrm\{M\}\}\(W\_\{0,0\}\)\\leq\\frac\{1\}\{2\(1\+\\vartheta\_\{H,\\mathrm\{M\}\}\)\}\\left\\lVert W\_\{0,0\}\\right\\rVert^\{2\}\_\{H,2,\\infty\}\.\(457\)The embedding isometric identity
‖Wm,0‖H,2,∞=ℓH,M,∞\(ηmH,ηH,M⋆\)\\left\\lVert W\_\{m,0\}\\right\\rVert\_\{H,2,\\infty\}=\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{mH\},\\eta^\{\\star\}\_\{H,\\mathrm\{M\}\}\)\(458\)provides the claim\. ∎
###### Corollary 48\(Boundary\-iterate step size consequences\)\.
Under the hypotheses of Proposition[47](https://arxiv.org/html/2605.06866#Thmtheorem47):
\(a\)ifαk≡α\\alpha\_\{k\}\\equiv\\alphaand
α≤aH,M,2fhaH,M,3fh,\\alpha\\leq\\frac\{a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\}\{a\_\{H,\\mathrm\{M\},3\}^\{\\mathrm\{fh\}\}\},\(459\)then for all episodesm≥0m\\geq 0,
𝔼\[ℓH,M,∞\(ηmH,ηH,M⋆\)2\]≤aH,M,1fhℓH,M,∞\(η0,ηH,M⋆\)2\(1−aH,M,2fhHα\)m\+aH,M,4fhaH,M,2fhα\.\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\)^\{2\}\\right\]\\leq a\_\{H,\\mathrm\{M\},1\}^\{\\mathrm\{fh\}\}\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\)^\{2\}\\bigl\(1\-a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}H\\alpha\\bigr\)^\{m\}\+\\frac\{a\_\{H,\\mathrm\{M\},4\}^\{\\mathrm\{fh\}\}\}\{a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\}\\alpha\.\(460\)
For the two diminishing\-step cases below, whereggis the step\-size offset, writeτm:=mH\+g\+H−1\\tau\_\{m\}:=mH\+g\+H\-1, soτ0=g\+H−1\\tau\_\{0\}=g\+H\-1\.
\(b\)ifαk=α/\(k\+g\)\\alpha\_\{k\}=\\alpha/\(k\+g\),α\>1/aH,M,2fh\\alpha\>1/a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}, and
g≥max\{1,αaH,M,3fhaH,M,2fh\},g\\geq\\max\\left\\\{1,\\frac\{\\alpha a\_\{H,\\mathrm\{M\},3\}^\{\\mathrm\{fh\}\}\}\{a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\}\\right\\\},\(461\)then the boundary iterates satisfy
𝔼\[ℓH,M,∞\(ηmH,ηH,M⋆\)2\]≤aH,M,1fhℓH,M,∞\(η0,ηH,M⋆\)2\(τ0τm\)aH,M,2fhα\+aH,M,4fhH2α2aH,M,2fhα−1⋅1τm\.\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\)^\{2\}\\right\]\\leq a\_\{H,\\mathrm\{M\},1\}^\{\\mathrm\{fh\}\}\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\)^\{2\}\\left\(\\frac\{\\tau\_\{0\}\}\{\\tau\_\{m\}\}\\right\)^\{a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\alpha\}\+\\frac\{a\_\{H,\\mathrm\{M\},4\}^\{\\mathrm\{fh\}\}H^\{2\}\\alpha^\{2\}\}\{a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\alpha\-1\}\\cdot\\frac\{1\}\{\\tau\_\{m\}\}\.\(462\)
\(c\)ifαk=α/\(k\+g\)z\\alpha\_\{k\}=\\alpha/\(k\+g\)^\{z\}withz∈\(0,1\)z\\in\(0,1\)and
g≥max\{1,\(αaH,M,3fhaH,M,2fh\)1/z,\(2zaH,M,2fhα\)1/\(1−z\)\},g\\geq\\max\\left\\\{1,\\left\(\\frac\{\\alpha a\_\{H,\\mathrm\{M\},3\}^\{\\mathrm\{fh\}\}\}\{a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\}\\right\)^\{1/z\},\\left\(\\frac\{2z\}\{a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\alpha\}\\right\)^\{1/\(1\-z\)\}\\right\\\},\(463\)then the boundary iterates satisfy
𝔼\[ℓH,M,∞\(ηmH,ηH,M⋆\)2\]≤aH,M,1fhℓH,M,∞\(η0,ηH,M⋆\)2⋅exp\(−aH,M,2fhα1−z\(τm1−z−τ01−z\)\)\+2aH,M,4fhH2αaH,M,2fh⋅1τmz\.\\begin\{gathered\}\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\)^\{2\}\\right\]\\leq a\_\{H,\\mathrm\{M\},1\}^\{\\mathrm\{fh\}\}\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{0\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\)^\{2\}\\cdot\\exp\\left\(\-\\frac\{a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\alpha\}\{1\-z\}\\bigl\(\\tau\_\{m\}^\{1\-z\}\-\\tau\_\{0\}^\{1\-z\}\\bigr\)\\right\)\\\\ \+\\frac\{2a\_\{H,\\mathrm\{M\},4\}^\{\\mathrm\{fh\}\}H^\{2\}\\alpha\}\{a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\}\\cdot\\frac\{1\}\{\\tau\_\{m\}^\{z\}\}\.\\end\{gathered\}\(464\)
###### Proof\.
For part \(a\), substitution ofα¯m=Hα\\bar\{\\alpha\}\_\{m\}=H\\alphaandα¯m\(2\)=Hα2\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}=H\\alpha^\{2\}into Proposition[47](https://arxiv.org/html/2605.06866#Thmtheorem47)yields part \(a\)\.
For part \(b\), setq=τ0/Hq=\\tau\_\{0\}/Handλ=aH,M,2fhα\\lambda=a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\alpha\. The bounds
Hατm≤α¯m≤HαmH\+g,α¯m\(2\)≤Hα2\(mH\+g\)2\.\\frac\{H\\alpha\}\{\\tau\_\{m\}\}\\leq\\bar\{\\alpha\}\_\{m\}\\leq\\frac\{H\\alpha\}\{mH\+g\},\\qquad\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}\\leq\\frac\{H\\alpha^\{2\}\}\{\(mH\+g\)^\{2\}\}\.\(465\)imply
∏j=0m−1\(1−aH,M,2fhα¯j\)≤\(qm\+q\)λ\.\\prod\_\{j=0\}^\{m\-1\}\(1\-a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\)\\leq\\left\(\\frac\{q\}\{m\+q\}\\right\)^\{\\lambda\}\.\(466\)The same elementary product\-sum estimate gives
∑i=0m−1α¯i\(2\)∏j=i\+1m−1\(1−aH,M,2fhα¯j\)≤Hα2λ−1⋅1m\+q,\\sum\_\{i=0\}^\{m\-1\}\\bar\{\\alpha\}\_\{i\}^\{\(2\)\}\\prod\_\{j=i\+1\}^\{m\-1\}\(1\-a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\)\\leq\\frac\{H\\alpha^\{2\}\}\{\\lambda\-1\}\\cdot\\frac\{1\}\{m\+q\},\(467\)and the displayed bound follows from Proposition[47](https://arxiv.org/html/2605.06866#Thmtheorem47)after substitutingq=τ0/Hq=\\tau\_\{0\}/Handτm=H\(m\+q\)\\tau\_\{m\}=H\(m\+q\)\.
For part \(c\), keep the sameqqand setA=aH,M,2fhαH1−zA=a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\alpha H^\{1\-z\}\. The bounds
Hατmz≤α¯m≤Hα\(mH\+g\)z,α¯m\(2\)≤Hα2\(mH\+g\)2z\.\\frac\{H\\alpha\}\{\\tau\_\{m\}^\{z\}\}\\leq\\bar\{\\alpha\}\_\{m\}\\leq\\frac\{H\\alpha\}\{\(mH\+g\)^\{z\}\},\\qquad\\bar\{\\alpha\}\_\{m\}^\{\(2\)\}\\leq\\frac\{H\\alpha^\{2\}\}\{\(mH\+g\)^\{2z\}\}\.\(468\)imply
∏j=0m−1\(1−aH,M,2fhα¯j\)≤exp\(−A1−z\(\(m\+q\)1−z−q1−z\)\)\.\\prod\_\{j=0\}^\{m\-1\}\(1\-a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\)\\leq\\exp\\left\(\-\\frac\{A\}\{1\-z\}\\bigl\(\(m\+q\)^\{1\-z\}\-q^\{1\-z\}\\bigr\)\\right\)\.\(469\)The lower bound onggimpliesq≥\(2z/A\)1/\(1−z\)q\\geq\(2z/A\)^\{1/\(1\-z\)\}\. The elementary polynomial product\-sum estimate gives
∑i=0m−1α¯i\(2\)∏j=i\+1m−1\(1−aH,M,2fhα¯j\)≤2Hα2A⋅1\(m\+q\)z\.\\sum\_\{i=0\}^\{m\-1\}\\bar\{\\alpha\}\_\{i\}^\{\(2\)\}\\prod\_\{j=i\+1\}^\{m\-1\}\(1\-a\_\{H,\\mathrm\{M\},2\}^\{\\mathrm\{fh\}\}\\bar\{\\alpha\}\_\{j\}\)\\leq\\frac\{2H\\alpha^\{2\}\}\{A\}\\cdot\\frac\{1\}\{\(m\+q\)^\{z\}\}\.\(470\)Substituting the definitions ofAAandqqinto Proposition[47](https://arxiv.org/html/2605.06866#Thmtheorem47), usingτm=H\(m\+q\)\\tau\_\{m\}=H\(m\+q\), and usingH2z≤H2H^\{2z\}\\leq H^\{2\}, gives the displayed bound\. ∎
###### Proof of Theorem[8](https://arxiv.org/html/2605.06866#Thmtheorem8)\.
Combine Proposition[47](https://arxiv.org/html/2605.06866#Thmtheorem47)with Corollary[48](https://arxiv.org/html/2605.06866#Thmtheorem48)\. Sincek=mHk=mHandHHis fixed, the rates in the episode indexmmare equivalent to the stated rates in the numberkkof transitions\. ∎
###### Proof of Corollary[9](https://arxiv.org/html/2605.06866#Thmtheorem9)\.
By Corollary[48](https://arxiv.org/html/2605.06866#Thmtheorem48)\(b\),
𝔼\[ℓH,M,∞\(ηmH,ηH,M⋆\)2\]=O\(1m\+1\)\.\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\)^\{2\}\\right\]=O\\left\(\\frac\{1\}\{m\+1\}\\right\)\.\(471\)By Jensen’s inequality,
𝔼\[ℓH,M,∞\(ηmH,ηH,M⋆\)\]=O\(1m\+1\)\.\\mathbb\{E\}\\left\[\\ell\_\{H,\\mathrm\{M\},\\infty\}\(\\eta\_\{mH\},\\eta\_\{H,\\mathrm\{M\}\}^\{\\star\}\)\\right\]=O\\left\(\\frac\{1\}\{\\sqrt\{m\+1\}\}\\right\)\.\(472\)Thusm=O\(ε−2\)m=O\(\\varepsilon^\{\-2\}\)episodes suffice\. Sincek=mHk=mHandHHis fixed, this is equivalentlyk=O\(ε−2\)k=O\(\\varepsilon^\{\-2\}\)transitions\. ∎
## Appendix FRepresentation error for projected categorical policy evaluation
In all projected categorical settings considered in this paper, the algorithm converges to the fixed point of a projected Bellman operator rather than directly to the exact return\-distribution fixed point\. This appendix records the corresponding deterministic representation error and combines it with the finite\-iteration bounds of the main text\.
Let\(𝒴,ℓ\)\(\\mathcal\{Y\},\\ell\)be a complete metric space\. LetT:𝒴→𝒴T:\\mathcal\{Y\}\\to\\mathcal\{Y\}be a contraction with modulusβ∈\(0,1\)\\beta\\in\(0,1\), and letΠ:𝒴→𝒴\\Pi:\\mathcal\{Y\}\\to\\mathcal\{Y\}be nonexpansive inℓ\\ell\. Letηπ\\eta^\{\\pi\}be the fixed point ofTTandη⋆\\eta^\{\\star\}be the fixed point ofΠT\\Pi T\. Let
εrepr:=ℓ\(Πηπ,ηπ\)\.\\varepsilon^\{\\mathrm\{repr\}\}:=\\ell\(\\Pi\\eta^\{\\pi\},\\eta^\{\\pi\}\)\.\(473\)
###### Proposition 49\.
We have
ℓ\(η⋆,ηπ\)≤εrepr1−β\.\\ell\(\\eta^\{\\star\},\\eta^\{\\pi\}\)\\leq\\frac\{\\varepsilon^\{\\mathrm\{repr\}\}\}\{1\-\\beta\}\.\(474\)
###### Proof\.
Usingη⋆=ΠTη⋆\\eta^\{\\star\}=\\Pi T\\eta^\{\\star\}andηπ=Tηπ\\eta^\{\\pi\}=T\\eta^\{\\pi\},
ℓ\(η⋆,ηπ\)\\displaystyle\\ell\(\\eta^\{\\star\},\\eta^\{\\pi\}\)=ℓ\(ΠTη⋆,ηπ\)\\displaystyle=\\ell\(\\Pi T\\eta^\{\\star\},\\eta^\{\\pi\}\)\(475\)≤ℓ\(ΠTη⋆,ΠTηπ\)\+ℓ\(ΠTηπ,ηπ\)\\displaystyle\\leq\\ell\(\\Pi T\\eta^\{\\star\},\\Pi T\\eta^\{\\pi\}\)\+\\ell\(\\Pi T\\eta^\{\\pi\},\\eta^\{\\pi\}\)≤ℓ\(Tη⋆,Tηπ\)\+εrepr\\displaystyle\\leq\\ell\(T\\eta^\{\\star\},T\\eta^\{\\pi\}\)\+\\varepsilon^\{\\mathrm\{repr\}\}≤βℓ\(η⋆,ηπ\)\+εrepr\.\\displaystyle\\leq\\beta\\,\\ell\(\\eta^\{\\star\},\\eta^\{\\pi\}\)\+\\varepsilon^\{\\mathrm\{repr\}\}\.Rearranging gives the claim\. ∎
###### Corollary 50\.
For every random iterateηk\\eta\_\{k\},
ℓ\(ηk,ηπ\)2≤2ℓ\(ηk,η⋆\)2\+2\(εrepr1−β\)2\.\\ell\(\\eta\_\{k\},\\eta^\{\\pi\}\)^\{2\}\\leq 2\\ell\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\+2\\left\(\\frac\{\\varepsilon^\{\\mathrm\{repr\}\}\}\{1\-\\beta\}\\right\)^\{2\}\.\(476\)Consequently,
𝔼\[ℓ\(ηk,ηπ\)2\]≤2𝔼\[ℓ\(ηk,η⋆\)2\]\+2\(εrepr1−β\)2\.\\mathbb\{E\}\[\\ell\(\\eta\_\{k\},\\eta^\{\\pi\}\)^\{2\}\]\\leq 2\\mathbb\{E\}\[\\ell\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\]\+2\\left\(\\frac\{\\varepsilon^\{\\mathrm\{repr\}\}\}\{1\-\\beta\}\\right\)^\{2\}\.\(477\)
###### Proof\.
By the triangle inequality and\(a\+b\)2≤2a2\+2b2\(a\+b\)^\{2\}\\leq 2a^\{2\}\+2b^\{2\},
ℓ\(ηk,ηπ\)2≤2ℓ\(ηk,η⋆\)2\+2ℓ\(η⋆,ηπ\)2\.\\ell\(\\eta\_\{k\},\\eta^\{\\pi\}\)^\{2\}\\leq 2\\ell\(\\eta\_\{k\},\\eta^\{\\star\}\)^\{2\}\+2\\ell\(\\eta^\{\\star\},\\eta^\{\\pi\}\)^\{2\}\.\(478\)Now apply Proposition[49](https://arxiv.org/html/2605.06866#Thmtheorem49)\. ∎
#### Instantiation in the present paper\.
In the discounted setting,ℓ\\ellis eitherℓC,∞\\ell\_\{\\mathrm\{C\},\\infty\}orℓM,∞\\ell\_\{\\mathrm\{M\},\\infty\}, andTTis the corresponding discounted Bellman operator\. In the fixed\-horizon setting,ℓ\\ellis eitherℓH,C,∞\\ell\_\{H,\\mathrm\{C\},\\infty\}orℓH,M,∞\\ell\_\{H,\\mathrm\{M\},\\infty\}, andTTis the corresponding fixed\-horizon Bellman operator\. In every case, the algorithmic term is controlled by the finite\-iteration theorems, while the second term depends only on the categorical support family throughεrepr\\varepsilon^\{\\mathrm\{repr\}\}\.
## Appendix GFurther discussion of the results
#### Bounded and affine perturbation regimes\.
After the statewise isometric embeddings, both CTD and MTD fit the same asynchronous contractive stochastic\-approximation template\. The main difference is the geometry of the sampled perturbation\. In the scalar categorical case, bounded supports together with the Cramér geometry yield uniform samplewise bounds in both the discounted and fixed\-horizon undiscounted settings\. In the discounted setting,
∥T^C\(U;s,\(r,s′\)\)−\(𝒪CU\)\(s\)∥2≤2BC\.\\lVert\\widehat\{T\}\_\{\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{C\}\}U\)\(s\)\\rVert\_\{2\}\\leq 2B\_\{\\mathrm\{C\}\}\.\(479\)In the fixed\-horizon setting,
∥FH,C\(U;s,\(r,s′\)\)∥H,2,∞≤2BH,C\.\\lVert F\_\{H,\\mathrm\{C\}\}\(U;s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2B\_\{H,\\mathrm\{C\}\}\.\(480\)Thus CTD falls into a bounded\-noise regime throughout the paper\. In particular, the i\.i\.d\. conditional second moment is uniformly bounded in the discounted analysis, and the fixed\-horizon episodewise argument inherits only support\-radius constants\.
By contrast, in the multivariate signed\-categorical case one obtains only an affine perturbation bounds\. In the discounted setting,
∥T^M\(U;s,\(r,s′\)\)−\(𝒪MU\)\(s\)∥2≤2βM∥U∥2,∞\+BM,\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\)\(s\)\\rVert\_\{2\}\\leq 2\\beta\_\{\\mathrm\{M\}\}\\lVert U\\rVert\_\{2,\\infty\}\+B\_\{\\mathrm\{M\}\},\(481\)together with a local second\-moment estimate
𝔼\[∥T^M\(Uk;Sk,\(Rk,Sk\+1\)\)−\(𝒪MUk\)\(Sk\)∥22∣Uk,Sk\]≤C1\+C2∥Uk∥2,∞2\.\\mathbb\{E\}\\left\[\\lVert\\widehat\{T\}\_\{\\mathrm\{M\}\}\(U\_\{k\};S\_\{k\},\(R\_\{k\},S\_\{k\+1\}\)\)\-\(\\mathcal\{O\}\_\{\\mathrm\{M\}\}U\_\{k\}\)\(S\_\{k\}\)\\rVert\_\{2\}^\{2\}\\,\\mid\\,U\_\{k\},S\_\{k\}\\right\]\\leq C\_\{1\}\+C\_\{2\}\\lVert U\_\{k\}\\rVert\_\{2,\\infty\}^\{2\}\.\(482\)In the fixed\-horizon setting, the analogous residual bound is
∥FH,M\(U;s,\(r,s′\)\)∥H,2,∞≤2∥U∥H,2,∞\+BH,Mres,\\lVert F\_\{H,\\mathrm\{M\}\}\(U;s,\(r,s^\{\\prime\}\)\)\\rVert\_\{H,2,\\infty\}\\leq 2\\lVert U\\rVert\_\{H,2,\\infty\}\+B^\{\\mathrm\{res\}\}\_\{H,\\mathrm\{M\}\},\(483\)again with an affine conditional second\-moment estimate\. This is why the MTD theorem constants contain additional growth, bias, and absorption quantities, whereas the CTD bounds depend only on support\-radius constants\. This is also the reason for the MTD theorem constants containing additional quantities such asC1,C2,BM,ΥMC\_\{1\},C\_\{2\},B\_\{\\mathrm\{M\}\},\\Upsilon\_\{\\mathrm\{M\}\}, whereas the CTD bounds depend only on the support\-radius constantBCB\_\{\\mathrm\{C\}\}\.
#### Smoothing constants and the number of asynchronous blocks\.
The finite\-iteration analysis is driven by a block\-supremum contraction norm and a smoothed block\-ℓp\\ell\_\{p\}potential\. The only explicit dimension loss in this smoothing step comes from the comparison between∥⋅∥2,∞\\lVert\\cdot\\rVert\_\{2,\\infty\}and∥⋅∥2,p\\lVert\\cdot\\rVert\_\{2,p\}, and therefore depends only on the number of asynchronously updated blocks\. In the discounted appendices, those blocks are indexed only by the state variable and the relevant count is\|𝒮\|\\lvert\\mathcal\{S\}\\rvert\. In the fixed\-horizon appendices, the stacked process is first flattened over horizon and state, so the relevant count becomes\|𝒮H\|=H\|𝒮\|\\lvert\\mathcal\{S\}\_\{H\}\\rvert=H\\lvert\\mathcal\{S\}\\rvert\. The inner representation dimensiondd, and in the multivariate case the reward dimensionqq, still affect problem\-dependent constants, but they do not enter through the Moreau\-envelope smoothing comparison itself\.
#### Discounted and fixed\-horizon contraction mechanisms\.
The fixed\-horizon undiscounted theorems should not be read as the discounted results withγ=1\\gamma=1\. In the discounted setting, the contraction comes directly from the Bellman discount factor and each update modifies only one sampled state block\. In the fixed\-horizon setting there is no discount\-driven contraction\. Instead, the contraction is recovered by weighting the horizon layers and exploiting the fact that horizonhhbootstraps from horizonh−1h\-1\. The corresponding online recursion is also structurally different: one sampled transition updates the entire horizon stack at the sampled state\. Thus the fixed\-horizon setting is different both in its contraction mechanism and in its stochastic approximation structure\.
#### Interpretation of the step size regimes\.
Across all theorem statements, the product\-sum bounds imply the same qualitative picture\. Constant step sizes yield geometric convergence to a controllableO\(α\)O\(\\alpha\)neighborhood of the projected fixed point\. Linearly\-diminishing step sizes yield the strongest asymptotic decay, namelyO\(1/k\)O\(1/k\)in the squared error and henceO\(ε−2\)O\(\\varepsilon^\{\-2\}\)sample complexity for mean error at mostε\\varepsilon\. Polynomially\-diminishing step sizes trade a weaker asymptotic decay for milder admissibility requirements\. In the discounted Markovian case, these conclusions hold up to the expected logarithmic mixing\-time overhead inherited from the Markovian finite\-iteration theorem\. This is the standard price of replacing i\.i\.d\. sampling by a trajectory\-generated sample path\.
#### Algorithmic and representation error\.
The finite\-iteration theorems control distance to the fixed point of a projected Bellman operator\. The representation\-error decomposition then separates total error into an algorithmic term and a deterministic approximation term\. This separation is conceptually useful: the stochastic\-approximation analysis explains how quickly the recursion approaches the projected fixed point for a fixed support family, whereas the deterministic term isolates the bias introduced by the categorical approximation itself\.Similar Articles
TTCD:Transformer Integrated Temporal Causal Discovery from Non-Stationary Time Series Data
The paper introduces TTCD, a novel framework for temporal causal discovery from non-stationary time series data using transformer-based feature learning and reconstruction-guided signal distillation.
On the Divergence of Differential Temporal Difference Learning without Local Clocks
This paper addresses an open problem in reinforcement learning by providing a counterexample showing that differential temporal difference learning can diverge when using a global clock, despite converging with a local clock, in average-reward settings.
An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning
This paper introduces MMOT, an online mixture model learning framework based on optimal transport theory that addresses incremental learning with distributional shifts through dynamic centroid updates and improved class similarity estimation. The approach includes a Dynamic Preservation strategy to mitigate catastrophic forgetting and maintain class separability in latent space.
A Theory of Online Learning with Autoregressive Chain-of-Thought Reasoning
This academic paper develops a theoretical framework for online learning with autoregressive chain-of-thought reasoning, analyzing mistake bounds under end-to-end and trajectory supervision models.
SFT, RL, and On-Policy Distillation Through a Distributional Lens (19 minute read)
This article analyzes post-training methods for language models through a distributional perspective, comparing how SFT, RL, and on-policy distillation reshape model distributions and impact phenomena like catastrophic forgetting.