A Single Stepsize Suffices for Unprojected Linear TD(0): Simultaneous Robust and Fast Rates via Polyak--Ruppert Averaging
Summary
This paper provides high-probability guarantees for an unprojected linear TD(0) algorithm with Polyak–Ruppert averaging under Markovian sampling, using a single stepsize schedule that achieves both robust curvature-free and fast curvature-dependent convergence rates.
View Cached Full Text
Cached at: 06/25/26, 05:09 AM
# A Single Stepsize Suffices for Unprojected Linear TD(0): Simultaneous Robust and Fast Rates via Polyak–Ruppert Averaging
Source: [https://arxiv.org/html/2606.24981](https://arxiv.org/html/2606.24981)
Wei\-Cheng Lee King Abdullah University of Science and Technology \(KAUST\) Thuwal, 23955\-6900, Kingdom of Saudi Arabia weicheng\.frank\.lee@gmail\.comFrancesco Orabona King Abdullah University of Science and Technology \(KAUST\) Thuwal, 23955\-6900, Kingdom of Saudi Arabia francesco@orabona\.com
###### Abstract
We study linear TD\(0\) under Markovian sampling, where data are generated along a single trajectory\. We provide high\-probability guarantees for a plain*unprojected*TD\(0\) algorithm with Polyak–Ruppert \(PR\) averaging, using a*single*stepsize scheduleηt∝1/\(τmixlog\(t\)t\)\\eta\_\{t\}\\propto 1/\(\\tau\_\{\\mathrm\{mix\}\}\\log\(t\)\\,\\sqrt\{t\}\)that depends on mixing time but requires*no prior knowledge of the curvature parameterω\\omega*\. Our first result shows that such a choice of the stepsize guarantees that the TD\(0\) iterates are automatically and uniformly bounded*with high probability*, without projections and without any stability argument based onω\\omega\. Building on this result, we establish a simultaneous high\-probability convergence guarantee for the PR average: the same stepsize yields both a robust curvature\-free𝒪~\(τmix/T\)\\widetilde\{\\mathcal\{O\}\}\(\\tau\_\{\\mathrm\{mix\}\}/\\sqrt\{T\}\)rate and a fast curvature\-dependent𝒪~\(τmix2/\(ωT\)\)\\widetilde\{\\mathcal\{O\}\}\(\\tau\_\{\\mathrm\{mix\}\}^\{2\}/\(\\omega T\)\)rate, with the bound taking the minimum of the two\. The core technical ingredient is a Poisson\-equation toolkit for geometrically mixing Markov chains, which decomposes Markov noise into a martingale term plus a controlled remainder and enables a new self\-bounding inductive argument for pathwise stability\.
Keywords:Reinforcement Learning, Temporal Difference Learning, Finite\-Time Analysis, Markovian Noise, Stochastic Approximation
## 1Introduction
Temporal\-difference \(TD\) learning is one of the central algorithmic primitives in reinforcement learning, used for policy evaluation and as a building block for control methods such as actor–critic\. When the state space is large, TD is typically combined with function approximation; in the linear case, the resulting TD\(0\) recursion admits a clean stochastic\-approximation interpretation and converges asymptotically under standard conditions\(Sutton,[1988](https://arxiv.org/html/2606.24981#bib.bib11); Tsitsiklis and Van Roy,[1996](https://arxiv.org/html/2606.24981#bib.bib14); Kushner,[2010](https://arxiv.org/html/2606.24981#bib.bib20)\)\.
Despite this classical theory, obtaining*finite\-time*,*high\-probability*guarantees for TD\(0\) remains challenging, especially in the practically relevant*Markovian sampling*regime where data are generated along a single trajectory\.
A major obstacle is that TD\(0\) is not a standard stochastic gradient method: Even in the linear setting, the update is generally biased and non\-symmetric, and the noise is temporally correlated\. Moreover, the magnitude of the TD\(0\) updates depends on the magnitude of the iterates\. Therefore, existing non\-asymptotic analyses rely on additional structure to control the iterates\. One common approach is to enforce boundedness via*projections*onto a known ball\(e\.g\., Bhandariet al\.,[2018](https://arxiv.org/html/2606.24981#bib.bib3)\), which simplifies concentration arguments but changes the algorithm and requires*a priori*knowledge of a suitable radius \(often tied implicitly to the problem curvature\)\. A second approach is to exploit a*contractive/curvature*structure of the mean dynamics to argue stability without projections; however, this typically yields stepsize conditions or stability bounds that depend on an unknown curvature parameter, which becomes arbitrarily slow when the curvature is arbitrarily small\.
Ideally, one would like to use a*single*stepsize rule that*simultaneously*allows us to obtain rates that are \(i\)*robust*\(curvature\-free\) and \(ii\)*fast*when curvature is favorable\. Such a stepsize would adapt to the better of the robust and fast high\-probability upper bounds, without prior knowledge of the curvature\.
#### Robust and fast rates\.
To explain the distinction between robust and fast rates, we first briefly describe the potential function we study\(Ollivier,[2018](https://arxiv.org/html/2606.24981#bib.bib13); Liu and Olshevsky,[2021](https://arxiv.org/html/2606.24981#bib.bib4)\)\(formally introduced later in \([2](https://arxiv.org/html/2606.24981#S3.E2)\)\), which naturally captures the TD fixed point error under Markovian sampling\. This potential always has positive curvatureω\>0\\omega\>0, so one can hope for a “fast” rate of order111The notation𝒪~\\widetilde\{\\mathcal\{O\}\}suppresses logarithmic factors\.\.𝒪~\(1ωT\)\\widetilde\{\\mathcal\{O\}\}\(\\frac\{1\}\{\\omega T\}\)\(up to mixing and logarithmic factors\)\. On the other hand, even without curvature assumptions, one can always achieve a “robust” rate of order𝒪~\(1T\)\\widetilde\{\\mathcal\{O\}\}\(\\frac\{1\}\{\\sqrt\{T\}\}\), again up to mixing and logs\. It is obvious that, in the finite\-time regime, one rate can be better than the other depending onω\\omega, so both rates should be pursued in different situations\.
Now, prior work typically achieves these two regimes with*different*stepsize choices and additional algorithmic modifications\(e\.g\., Bhandariet al\.,[2018](https://arxiv.org/html/2606.24981#bib.bib3)\)\. Moreover, the projection\-free high\-probability analyses under Markovian noise that we are aware of rely on stability arguments whose constants deteriorate withω\\omega\(e\.g\., Samsonovet al\.,[2024](https://arxiv.org/html/2606.24981#bib.bib30); Durmuset al\.,[2025](https://arxiv.org/html/2606.24981#bib.bib56)\)\.
As far as we know,*obtaining both rates simultaneously with high probability with a single stepsize choice that is independent ofω\\omegaand without projections has not previously been established for TD\(0\)*\. In particular, the key technical difficulty in achieving such a result is to control the random iterates,𝜽t\\bm\{\\theta\}\_\{t\},*pathwise*\. Without projections, the update magnitude scales with‖𝜽t‖\\\|\\bm\{\\theta\}\_\{t\}\\\|, and a naive concentration analysis becomes circular: to bound the error one needs bounded updates, but bounded updates require bounded iterates\. Moreover, the need to obtain robust rates \(i\.e\., independent ofω\\omega\) rules out any strategy that shows that the iterates are bounded through a contractive argument based onω\\omega\.
#### Our approach: a self\-bounding argument via Poisson equations\.
In this paper, we show that a simple stepsize, proportional to1τmixlogtt\\frac\{1\}\{\\tau\_\{\\mathrm\{mix\}\}\\log t\\sqrt\{t\}\}and independent ofω\\omega, guarantees that TD\(0\) achieves both fast and robust rates with high probability\.
Our main technical novelty to circumvent the above issues is a new*self\-bounding*argument that closes this loop*with high probability**without*appealing to curvatureω\\omegaand*without*artificial projections\. Concretely, we develop a Poisson\-equation toolkit for geometrically mixing Markov chains that decomposes additive Markov noise into a martingale term plus a controlled remainder\. This decomposition allows us to control the Markovian bias terms that appear in the basic potential expansion of TD, and to run an induction in which boundedness at times≤t−1\\leq t\-1implies boundedness at timett\. The resulting uniform bound onsupt‖𝜽t‖\\sup\_\{t\}\\ \\\|\\bm\{\\theta\}\_\{t\}\\\|holds with high probability and is independent of the curvatureω\\omega\.
Once boundedness is established, we can analyze the convergence guarantee\. The bounded\-iterates event provides the missing ingredient needed to control the quadratic variation of the martingale terms \(again through the Poisson decomposition\), yielding high\-probability bounds for the averaged iterate𝜽¯T\\bar\{\\bm\{\\theta\}\}\_\{T\}\. Importantly, the*same*stepsize delivers:
- •a curvature\-free robust rate of𝒪~\(1T\)\\widetilde\{\\mathcal\{O\}\}\(\\frac\{1\}\{\\sqrt\{T\}\}\), through a classic averaged\-stochastic\-gradient\-descent\-like analysis, as one would expect from the choice of the stepsize;
- •a curvature\-dependent fast rate𝒪~\(1/\(ωT\)\)\\widetilde\{\\mathcal\{O\}\}\(1/\(\\omega T\)\), without requiringω\\omegato set the stepsize, thanks to a Polyak–Ruppert averaging analysis\.
## 2Related Work
In this section, we briefly survey the main results on the convergence guarantees for TD learning with linear function approximation, as well as the main tools we use in our proofs\. Later, in Section[5](https://arxiv.org/html/2606.24981#S5), we will give a detailed comparison in terms of convergence rates\.
Early convergence results for TD learning with linear function approximation trace back toTsitsiklis and Van Roy \([1996](https://arxiv.org/html/2606.24981#bib.bib14)\), who interpreted TD updates through the lens of stochastic approximation\(Kushner,[2010](https://arxiv.org/html/2606.24981#bib.bib20)\)\. While foundational, that theory is largely asymptotic and does not yield explicit non\-asymptotic rates\. Finite\-time guarantees were later developed in a sequence of works\(Korda and La,[2015](https://arxiv.org/html/2606.24981#bib.bib23); Lakshminarayanan and Szepesvári,[2018](https://arxiv.org/html/2606.24981#bib.bib22); Dalalet al\.,[2018](https://arxiv.org/html/2606.24981#bib.bib21)\), but these analyses typically assume i\.i\.d\. samples drawn from the stationary distribution\. This assumption sidesteps the temporal dependence present in most reinforcement\-learning pipelines, where data are generated sequentially along a single Markov\-chain trajectory\. Accounting for such Markovian correlations substantially complicates the analysis, even for TD\(0\)\.
The seminal work ofBhandariet al\.\([2018](https://arxiv.org/html/2606.24981#bib.bib3)\)provided the first finite\-time treatment under Markovian sampling\. They obtained both fast and robust rates, using two different stepsizes\. However, their arguments \(and even subsequent refinements for the robust rates such asLiu and Olshevsky \([2021](https://arxiv.org/html/2606.24981#bib.bib4)\)\) rely on an explicit projection step to keep the iterates controlled\.
The need for a projection is removed in subsequent work in the fast regime by exploiting the contractive nature of the update due to the presence of the curvature\. In particular, using a control\-theoretic framework,Srikant and Ying \([2019](https://arxiv.org/html/2606.24981#bib.bib16)\)derived finite\-time error bounds for linear TD with Markovian data without projections by establishing a contraction\-like behavior via Lyapunov theory\. However, their stepsize depends on the curvature of the potential function, which is generally unknown in practice\. Building on this direction,Patilet al\.\([2023](https://arxiv.org/html/2606.24981#bib.bib32)\)eliminated the need to know the curvature parameter when choosing stepsizes, at the cost of introducing a data\-dropping modification of TD\. Related progress has also focused on simplifying proofs and relaxing algorithmic modifications:Mitra \([2025](https://arxiv.org/html/2606.24981#bib.bib5)\)proposed a streamlined inductive two\-step analysis,Liet al\.\([2026](https://arxiv.org/html/2606.24981#bib.bib55)\)used exponentially decaying stepsizes to avoid data dropping, andSunet al\.\([2022](https://arxiv.org/html/2606.24981#bib.bib48)\)extended fast\-rate analyses to neural networks in the NTK regime\. More recently,Samsonovet al\.\([2024](https://arxiv.org/html/2606.24981#bib.bib30)\)strengthened the analysis ofPatilet al\.\([2023](https://arxiv.org/html/2606.24981#bib.bib32)\)and obtained high\-probability bounds without using projections\. Closest to the high\-probability projection\-free line,Chandak and Borkar \([2025](https://arxiv.org/html/2606.24981#bib.bib61)\)obtained all\-time high\-probability control for unprojected TD\(0\) without data dropping\. Their analysis, however, remains contractive in nature, and the resulting constants and burn\-in depend on the curvature\.
The only result that removes the need for projections with stepsizes independent of the curvature of the potential function, and without using curvature in the analysis, isLee and Orabona \([2025](https://arxiv.org/html/2606.24981#bib.bib37)\), which proves that the iterates of TD\(0\) are bounded in expectation\. Our approach is inspired by their method, but differs substantially because we need high\-probability bounds\.
Several closely related works study cheap linear\-update methods that are informative but not direct head\-to\-head baselines for plain TD\(0\)\.Rajet al\.\([2022](https://arxiv.org/html/2606.24981#bib.bib64)\)analyze linear composition optimization and gradient TD methods for off\-policy MSPBE minimization, using primal–dual updates rather than the semi\-gradient TD\(0\) recursion studied here, and obtain adaptive finite\-time guarantees but require bounded iterates or prior knowledge of the curvature\. In the tabular setting,Liet al\.\([2020](https://arxiv.org/html/2606.24981#bib.bib63)\)show that asynchronous Q\-learning, and hence tabular TD, can have a leading statistical term matching the i\.i\.d\. sampling complexity, with mixing entering only as an additive transient cost; they also provide data\-driven stepsizes that avoid prior knowledge ofτmix\\tau\_\{\\mathrm\{mix\}\}\. These results address the same broad practical question of obtaining fast, stable, cheap linear updates under limited tuning information, but they do not directly apply to unprojected linear TD\(0\)\.
As far as we know, there is no prior work that analyzes the possibility of getting both fast and robust rates with a stepsize schedule independent of the curvature and without using projections for linear TD learning\.
The use of Polyak–Ruppert averaging in the analysis of TD learning can be traced back at least toKonda \([2002](https://arxiv.org/html/2606.24981#bib.bib62)\)\. To the best of our knowledge,Korda and La \([2015](https://arxiv.org/html/2606.24981#bib.bib23)\)were the first to leverage this technique to establish finite\-time guarantees\. In particular, Polyak–Ruppert \(tail\) averaging has been employed to enable stepsize choices that avoid explicit dependence on the curvature parameterω\\omega\(see, e\.g\., Lakshminarayanan and Szepesvári,[2018](https://arxiv.org/html/2606.24981#bib.bib22); Patilet al\.,[2023](https://arxiv.org/html/2606.24981#bib.bib32); Samsonovet al\.,[2024](https://arxiv.org/html/2606.24981#bib.bib30)\)\.
Finally, we make use of the Poisson equation to obtain high\-probability bounds\. This is a standard tool, used widely in the literature on TD learning and stochastic approximation with Markovian noise\(see, e\.g\., Chandak and Borkar,[2025](https://arxiv.org/html/2606.24981#bib.bib61); Blaser and Zhang,[2024](https://arxiv.org/html/2606.24981#bib.bib60)\)\.
## 3Setting and Assumptions
We work directly with the finite discounted Markov reward process \(MRP\) induced by a fixed policyμ\\mu\. Thus the action variables have already been averaged underμ\\mu\. Let𝒮≔\{1,…,n\}\\mathcal\{S\}\\coloneqq\\\{1,\\dots,n\\\}be a finite state space, letPμ:𝒮×𝒮→\[0,1\]P^\{\\mu\}:\\mathcal\{S\}\\times\\mathcal\{S\}\\to\[0,1\]be the induced transition kernel, and letr:𝒮×𝒮→ℝr:\\mathcal\{S\}\\times\\mathcal\{S\}\\to\\mathbb\{R\}be the one\-step reward function\. Given an initial states0s\_\{0\}, the state process satisfiesℙ\(st\+1=j∣st=i\)=Pμ\(i,j\)\\mathbb\{P\}\(s\_\{t\+1\}=j\\mid s\_\{t\}=i\)=P^\{\\mu\}\(i,j\)fori,j∈𝒮i,j\\in\\mathcal\{S\}\. We write𝑷μ∈ℝn×n\\bm\{P\}^\{\\mu\}\\in\\mathbb\{R\}^\{n\\times n\}for the corresponding transition matrix, with entries𝑷μ\(i,j\)=Pμ\(i,j\)\\bm\{P\}^\{\\mu\}\(i,j\)=P^\{\\mu\}\(i,j\), and fix a discount factor0<γ<10<\\gamma<1\.
We are interested in the*policy evaluation*problem in reinforcement learning\(Sutton and Barto,[1998](https://arxiv.org/html/2606.24981#bib.bib34); Mannoret al\.,[2026](https://arxiv.org/html/2606.24981#bib.bib10)\)\. For the MRP induced byμ\\mu, the value function is
Vμ\(s\)≔𝔼\[∑t=0∞γtr\(st,st\+1\)\|s0=s\]\.V^\{\\mu\}\(s\)\\coloneqq\\mathbb\{E\}\\\!\\left\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}r\(s\_\{t\},s\_\{t\+1\}\)\\;\\middle\|\\;s\_\{0\}=s\\right\]\.The value function𝑽μ\\bm\{V\}^\{\\mu\}can be viewed as a vector inℝn\\mathbb\{R\}^\{n\}and is the unique fixed point of the Bellman expectation operatorTμ:ℝn→ℝnT^\{\\mu\}:\\mathbb\{R\}^\{n\}\\to\\mathbb\{R\}^\{n\}defined by
\(Tμ𝑽\)\(s\)≔rμ\(s\)\+γ∑s′∈𝒮Pμ\(s,s′\)𝑽\(s′\),s∈𝒮,\(T^\{\\mu\}\\bm\{V\}\)\(s\)\\coloneqq r^\{\\mu\}\(s\)\+\\gamma\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}P^\{\\mu\}\(s,s^\{\\prime\}\)\\,\\bm\{V\}\(s^\{\\prime\}\),\\qquad s\\in\\mathcal\{S\},whererμ\(s\)≔∑s′∈𝒮Pμ\(s,s′\)r\(s,s′\)r^\{\\mu\}\(s\)\\coloneqq\\sum\_\{s^\{\\prime\}\\in\\mathcal\{S\}\}P^\{\\mu\}\(s,s^\{\\prime\}\)\\,r\(s,s^\{\\prime\}\)\.
Whennnbecomes large, solving the Bellman expectation equation via matrix inversion becomes infeasible\. Instead, we consider Temporal\-Difference \(TD\) learning with linear function approximation\(Sutton,[1988](https://arxiv.org/html/2606.24981#bib.bib11)\)\. Letϕ:𝒮→ℝd\\bm\{\\phi\}:\\mathcal\{S\}\\to\\mathbb\{R\}^\{d\}be a fixed feature mapping withd≪nd\\ll n, and let𝜽∈ℝd\\bm\{\\theta\}\\in\\mathbb\{R\}^\{d\}\. We approximate𝑽μ\\bm\{V\}^\{\\mu\}by𝑽𝜽\(s\)≔𝜽⊤ϕ\(s\)\\bm\{V\}\_\{\\bm\{\\theta\}\}\(s\)\\coloneqq\\bm\{\\theta\}^\{\\top\}\\bm\{\\phi\}\(s\), or in vector form𝑽𝜽=𝚽𝜽\\bm\{V\}\_\{\\bm\{\\theta\}\}=\\bm\{\\Phi\}\\bm\{\\theta\}, where the feature matrix𝚽∈ℝn×d\\bm\{\\Phi\}\\in\\mathbb\{R\}^\{n\\times d\}has rowϕ\(s\)⊤\\bm\{\\phi\}\(s\)^\{\\top\}corresponding to statess\.
#### Structural assumption\.
Throughout the paper, we impose the following standard condition on the induced MRP and the feature matrix\.
###### Assumption 3\.1\(Ergodic induced MRP and full\-rank features\)\.
The transition matrix𝑷μ\\bm\{P\}^\{\\mu\}is irreducible and aperiodic, and the feature matrix𝚽\\bm\{\\Phi\}has full column rank\.
Since𝑷μ\\bm\{P\}^\{\\mu\}is finite, irreducible, and aperiodic, it admits a unique stationary distributionπ\\pi\. We write𝑫≔diag\(π\)\\bm\{D\}\\coloneqq\\mathrm\{diag\}\(\\pi\)and use the value\-space norm‖𝒗‖𝑫≔𝒗⊤𝑫𝒗\\left\\\|\{\\bm\{v\}\}\\right\\\|\_\{\\bm\{D\}\}\\coloneqq\\sqrt\{\\bm\{v\}^\{\\top\}\\bm\{D\}\\bm\{v\}\}for𝒗∈ℝn\\bm\{v\}\\in\\mathbb\{R\}^\{n\}\. We also defineΣ≔𝚽⊤𝑫𝚽\\Sigma\\coloneqq\\bm\{\\Phi\}^\{\\top\}\\bm\{D\}\\bm\{\\Phi\}\. Under Assumption[3\.1](https://arxiv.org/html/2606.24981#S3.Thmtheorem1),π\(s\)\>0\\pi\(s\)\>0for everys∈𝒮s\\in\\mathcal\{S\}, and henceΣ≻0\\Sigma\\succ 0\.
For any𝜽0∈ℝd\\bm\{\\theta\}\_\{0\}\\in\\mathbb\{R\}^\{d\}, the TD\(0\) algorithm is defined fort≥0t\\geq 0by
𝜽t\+1=𝜽t\+ηt𝒈\(𝜽t,Zt\),Zt≔\(st,st\+1\),\\bm\{\\theta\}\_\{t\+1\}=\\bm\{\\theta\}\_\{t\}\+\\eta\_\{t\}\\bm\{g\}\(\\bm\{\\theta\}\_\{t\},Z\_\{t\}\),\\qquad Z\_\{t\}\\coloneqq\(s\_\{t\},s\_\{t\+1\}\),whereηt\>0\\eta\_\{t\}\>0is the stepsize and
𝒈\(𝜽,Zt\)≔\(r\(st,st\+1\)\+γϕ\(st\+1\)⊤𝜽−ϕ\(st\)⊤𝜽\)ϕ\(st\)\.\\bm\{g\}\(\\bm\{\\theta\},Z\_\{t\}\)\\coloneqq\\bigl\(r\(s\_\{t\},s\_\{t\+1\}\)\+\\gamma\\bm\{\\phi\}\(s\_\{t\+1\}\)^\{\\top\}\\bm\{\\theta\}\-\\bm\{\\phi\}\(s\_\{t\}\)^\{\\top\}\\bm\{\\theta\}\\bigr\)\\bm\{\\phi\}\(s\_\{t\}\)\.When the dependence on randomness and iterates is clear, we write𝒈t≔𝒈\(𝜽t,Zt\)\\bm\{g\}\_\{t\}\\coloneqq\\bm\{g\}\(\\bm\{\\theta\}\_\{t\},Z\_\{t\}\)\.
The TD\(0\) update can equivalently be written as𝜽t\+1=𝜽t\+ηt\(𝒃Zt−𝑨Zt𝜽t\)\\bm\{\\theta\}\_\{t\+1\}=\\bm\{\\theta\}\_\{t\}\+\\eta\_\{t\}\(\\bm\{b\}\_\{Z\_\{t\}\}\-\\bm\{A\}\_\{Z\_\{t\}\}\\bm\{\\theta\}\_\{t\}\), where
𝑨Zt≔ϕ\(st\)\(ϕ\(st\)−γϕ\(st\+1\)\)⊤∈ℝd×d,𝒃Zt≔r\(st,st\+1\)ϕ\(st\)∈ℝd\.\\bm\{A\}\_\{Z\_\{t\}\}\\coloneqq\\bm\{\\phi\}\(s\_\{t\}\)\\bigl\(\\bm\{\\phi\}\(s\_\{t\}\)\-\\gamma\\bm\{\\phi\}\(s\_\{t\+1\}\)\\bigr\)^\{\\top\}\\in\\mathbb\{R\}^\{d\\times d\},\\qquad\\bm\{b\}\_\{Z\_\{t\}\}\\coloneqq r\(s\_\{t\},s\_\{t\+1\}\)\\bm\{\\phi\}\(s\_\{t\}\)\\in\\mathbb\{R\}^\{d\}\.Let𝒵≔\{\(i,j\)∈𝒮×𝒮:Pμ\(i,j\)\>0\}\\mathcal\{Z\}\\coloneqq\\\{\(i,j\)\\in\\mathcal\{S\}\\times\\mathcal\{S\}:P^\{\\mu\}\(i,j\)\>0\\\}be the state space of the transition chainZt=\(st,st\+1\)Z\_\{t\}=\(s\_\{t\},s\_\{t\+1\}\)\. Its Markov kernel is
PZ\(\(i,j\),\(j′,k\)\)=\{Pμ\(j,k\),j′=j,0,j′≠j,\(i,j\),\(j′,k\)∈𝒵,P\_\{Z\}\\bigl\(\(i,j\),\(j^\{\\prime\},k\)\\bigr\)=\\begin\{cases\}P^\{\\mu\}\(j,k\),&j^\{\\prime\}=j,\\\\ 0,&j^\{\\prime\}\\neq j,\\end\{cases\}\\qquad\(i,j\),\(j^\{\\prime\},k\)\\in\\mathcal\{Z\},and its stationary distribution isπZ\(i,j\)≔π\(i\)Pμ\(i,j\)\\pi\_\{Z\}\(i,j\)\\coloneqq\\pi\(i\)P^\{\\mu\}\(i,j\)for\(i,j\)∈𝒵\(i,j\)\\in\\mathcal\{Z\}\. Define the population TD matrix and vector by𝑨≔𝔼Z∼πZ\[𝑨Z\]\\bm\{A\}\\coloneqq\\mathbb\{E\}\_\{Z\\sim\\pi\_\{Z\}\}\[\\bm\{A\}\_\{Z\}\]and𝒃≔𝔼Z∼πZ\[𝒃Z\]\\bm\{b\}\\coloneqq\\mathbb\{E\}\_\{Z\\sim\\pi\_\{Z\}\}\[\\bm\{b\}\_\{Z\}\]\.
Forz∈𝒵z\\in\\mathcal\{Z\}, define
𝝃\(z\)≔𝒃z−𝑨z𝜽∗,𝜹\(z\)≔𝑨−𝑨z\.\\bm\{\\xi\}\(z\)\\coloneqq\\bm\{b\}\_\{z\}\-\\bm\{A\}\_\{z\}\\bm\{\\theta\}^\{\*\},\\qquad\\bm\{\\delta\}\(z\)\\coloneqq\\bm\{A\}\-\\bm\{A\}\_\{z\}\.Whenz=Ztz=Z\_\{t\}, we write
𝝃t≔𝝃\(Zt\),𝜹t≔𝜹\(Zt\)\.\\bm\{\\xi\}\_\{t\}\\coloneqq\\bm\{\\xi\}\(Z\_\{t\}\),\\qquad\\bm\{\\delta\}\_\{t\}\\coloneqq\\bm\{\\delta\}\(Z\_\{t\}\)\.
As shown in Lemma[3\.4](https://arxiv.org/html/2606.24981#S3.Thmtheorem4)below, Assumption[3\.1](https://arxiv.org/html/2606.24981#S3.Thmtheorem1)implies that𝑨\\bm\{A\}is nonsingular\. Hence the expected TD linear system𝑨𝜽∗=𝒃\\bm\{A\}\\bm\{\\theta\}^\{\*\}=\\bm\{b\}has a unique solution𝜽∗∈ℝd\\bm\{\\theta\}^\{\*\}\\in\\mathbb\{R\}^\{d\}\. It is known that the corresponding value approximation𝑽𝜽∗=𝚽𝜽∗\\bm\{V\}\_\{\\bm\{\\theta\}^\{\*\}\}=\\bm\{\\Phi\}\\bm\{\\theta\}^\{\*\}satisfies the projected Bellman equation\(Tsitsiklis and Van Roy,[1996](https://arxiv.org/html/2606.24981#bib.bib14)\)
𝚽𝜽∗=Π𝑫Tμ\(𝚽𝜽∗\),\\bm\{\\Phi\}\\bm\{\\theta\}^\{\*\}=\\Pi\_\{\\bm\{D\}\}T^\{\\mu\}\(\\bm\{\\Phi\}\\bm\{\\theta\}^\{\*\}\),whereΠ𝑫\\Pi\_\{\\bm\{D\}\}is the orthogonal projection operator onto the subspace\{𝚽𝒙:𝒙∈ℝd\}\\\{\\bm\{\\Phi\}\\bm\{x\}:\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\\\}with respect to∥⋅∥𝑫\\left\\\|\{\\cdot\}\\right\\\|\_\{\\bm\{D\}\}\. Moreover,
‖𝑽μ−𝑽𝜽∗‖𝑫≤11−γ‖𝑽μ−Π𝑫𝑽μ‖𝑫\.\\left\\\|\{\\bm\{V\}^\{\\mu\}\-\\bm\{V\}\_\{\\bm\{\\theta\}^\{\*\}\}\}\\right\\\|\_\{\\bm\{D\}\}\\leq\\frac\{1\}\{1\-\\gamma\}\\left\\\|\{\\bm\{V\}^\{\\mu\}\-\\Pi\_\{\\bm\{D\}\}\\bm\{V\}^\{\\mu\}\}\\right\\\|\_\{\\bm\{D\}\}\.
The*mean\-path TD update*is defined as
𝒈¯\(𝜽\)≔𝔼Z∼πZ\[\(r\(s,s′\)\+γϕ\(s′\)⊤𝜽−ϕ\(s\)⊤𝜽\)ϕ\(s\)\]\.\\bar\{\\bm\{g\}\}\(\\bm\{\\theta\}\)\\coloneqq\\mathbb\{E\}\_\{Z\\sim\\pi\_\{Z\}\}\\bigl\[\\bigl\(r\(s,s^\{\\prime\}\)\+\\gamma\\bm\{\\phi\}\(s^\{\\prime\}\)^\{\\top\}\\bm\{\\theta\}\-\\bm\{\\phi\}\(s\)^\{\\top\}\\bm\{\\theta\}\\bigr\)\\bm\{\\phi\}\(s\)\\bigr\]\.Using𝑨𝜽∗=𝒃\\bm\{A\}\\bm\{\\theta\}^\{\*\}=\\bm\{b\}, the mean\-path TD update is linear in𝒆≔𝜽−𝜽∗\\bm\{e\}\\coloneqq\\bm\{\\theta\}\-\\bm\{\\theta\}^\{\*\}for all𝜽∈ℝd\\bm\{\\theta\}\\in\\mathbb\{R\}^\{d\}:
𝒈¯\(𝜽\)=𝒃−𝑨𝜽=−𝑨\(𝜽−𝜽∗\),𝒈¯\(𝜽∗\)=𝟎\.\\bar\{\\bm\{g\}\}\(\\bm\{\\theta\}\)=\\bm\{b\}\-\\bm\{A\}\\bm\{\\theta\}=\-\\bm\{A\}\(\\bm\{\\theta\}\-\\bm\{\\theta\}^\{\*\}\),\\qquad\\bar\{\\bm\{g\}\}\(\\bm\{\\theta\}^\{\*\}\)=\\bm\{0\}\.\(1\)
To characterize the convergence of TD iterates𝜽t\\bm\{\\theta\}\_\{t\}to the TD fixed point𝜽∗\\bm\{\\theta\}^\{\*\}, we introduce the Dirichlet semi\-norm‖𝒙‖Dir2≔𝒙⊤𝑳Dir𝒙\\\|\\bm\{x\}\\\|\_\{\\mathrm\{Dir\}\}^\{2\}\\coloneqq\\bm\{x\}^\{\\top\}\\bm\{L\}\_\{\\mathrm\{Dir\}\}\\bm\{x\}, where
𝑳Dir≔𝑫−12\(𝑫𝑷μ\+\(𝑷μ\)⊤𝑫\)\.\\bm\{L\}\_\{\\mathrm\{Dir\}\}\\coloneqq\\bm\{D\}\-\\tfrac\{1\}\{2\}\\bigl\(\\bm\{D\}\\bm\{P\}^\{\\mu\}\+\(\\bm\{P\}^\{\\mu\}\)^\{\\top\}\\bm\{D\}\\bigr\)\.By Lemma[F\.2](https://arxiv.org/html/2606.24981#A6.Thmtheorem2),∥⋅∥Dir\\\|\\cdot\\\|\_\{\\mathrm\{Dir\}\}is indeed a semi\-norm since𝑳Dir\\bm\{L\}\_\{\\mathrm\{Dir\}\}is positive semidefinite\. The potential functionffthat we study\(Ollivier,[2018](https://arxiv.org/html/2606.24981#bib.bib13); Liu and Olshevsky,[2021](https://arxiv.org/html/2606.24981#bib.bib4)\)is defined as
f\(𝜽\)≔\(1−γ\)‖𝑽𝜽−𝑽𝜽∗‖𝑫2\+γ‖𝑽𝜽−𝑽𝜽∗‖Dir2\.f\(\\bm\{\\theta\}\)\\coloneqq\(1\-\\gamma\)\\,\\\|\\bm\{V\}\_\{\\bm\{\\theta\}\}\-\\bm\{V\}\_\{\\bm\{\\theta\}^\{\*\}\}\\\|\_\{\\bm\{D\}\}^\{2\}\+\\gamma\\,\\\|\\bm\{V\}\_\{\\bm\{\\theta\}\}\-\\bm\{V\}\_\{\\bm\{\\theta\}^\{\*\}\}\\\|\_\{\\mathrm\{Dir\}\}^\{2\}\.\(2\)In particular, for𝒆=𝜽−𝜽∗\\bm\{e\}=\\bm\{\\theta\}\-\\bm\{\\theta\}^\{\*\}, since𝒆⊤𝑨𝒆=𝒆⊤𝑨⊤𝒆\\bm\{e\}^\{\\top\}\\bm\{A\}\\bm\{e\}=\\bm\{e\}^\{\\top\}\\bm\{A\}^\{\\top\}\\bm\{e\}, a direct calculation yields
f\(𝜽\)−f\(𝜽∗\)=12𝒆⊤\(𝑨\+𝑨⊤\)𝒆=𝒆⊤𝑨𝒆\.f\(\\bm\{\\theta\}\)\-f\(\\bm\{\\theta\}^\{\*\}\)=\\tfrac\{1\}\{2\}\\bm\{e\}^\{\\top\}\(\\bm\{A\}\+\\bm\{A\}^\{\\top\}\)\\bm\{e\}=\\bm\{e\}^\{\\top\}\\bm\{A\}\\bm\{e\}\.\(3\)Moreover, using \([1](https://arxiv.org/html/2606.24981#S3.E1)\), we have
⟨𝒈¯\(𝜽\),𝜽∗−𝜽⟩=⟨−𝑨𝒆,−𝒆⟩=𝒆⊤𝑨𝒆=f\(𝜽\)−f\(𝜽∗\)\.\\langle\\bar\{\\bm\{g\}\}\(\\bm\{\\theta\}\),\\bm\{\\theta\}^\{\*\}\-\\bm\{\\theta\}\\rangle=\\langle\-\\bm\{A\}\\bm\{e\},\-\\bm\{e\}\\rangle=\\bm\{e\}^\{\\top\}\\bm\{A\}\\bm\{e\}=f\(\\bm\{\\theta\}\)\-f\(\\bm\{\\theta\}^\{\*\}\)\.\(4\)Sincef\(𝜽\)−f\(𝜽∗\)≥0f\(\\bm\{\\theta\}\)\-f\(\\bm\{\\theta\}^\{\*\}\)\\geq 0for𝜽≠𝜽∗\\bm\{\\theta\}\\neq\\bm\{\\theta\}^\{\*\}, the mean\-path update direction𝒈¯\(𝜽\)\\bar\{\\bm\{g\}\}\(\\bm\{\\theta\}\)is aligned with the direction𝜽∗−𝜽\\bm\{\\theta\}^\{\*\}\-\\bm\{\\theta\}\. In particular, TD\(0\) under i\.i\.d\. sampling fromπZ\\pi\_\{Z\}acts as a descent method forff\. Additionally, any convergence guarantee onffcan be translated to a guarantee in the projected value norm:
‖𝜽−𝜽∗‖𝚺2=‖𝑽𝜽−𝑽𝜽∗‖𝑫2≤f\(𝜽\)1−γ\.\\\|\\bm\{\\theta\}\-\\bm\{\\theta\}^\{\*\}\\\|\_\{\\bm\{\\Sigma\}\}^\{2\}=\\\|\\bm\{V\}\_\{\\bm\{\\theta\}\}\-\\bm\{V\}\_\{\\bm\{\\theta\}^\{\*\}\}\\\|\_\{\\bm\{D\}\}^\{2\}\\leq\\frac\{f\(\\bm\{\\theta\}\)\}\{1\-\\gamma\}\.\(5\)
#### Constants and structural consequences\.
We next collect the constants and matrix properties used in the high\-probability analysis\. The boundedness constants are fixed by the finite state space, while the mixing and curvature constants follow from Assumption[3\.1](https://arxiv.org/html/2606.24981#S3.Thmtheorem1)\.
###### Lemma 3\.2\(Feature and reward bounds\)\.
There exist finite constantsϕ∞,r∞<∞\\phi\_\{\\infty\},r\_\{\\infty\}<\\inftysuch that‖ϕ\(s\)‖≤ϕ∞\\\|\\bm\{\\phi\}\(s\)\\\|\\leq\\phi\_\{\\infty\}for alls∈𝒮s\\in\\mathcal\{S\}and\|r\(s,s′\)\|≤r∞\|r\(s,s^\{\\prime\}\)\|\\leq r\_\{\\infty\}for alls,s′∈𝒮s,s^\{\\prime\}\\in\\mathcal\{S\}\. Forz=\(s,s′\)∈𝒵z=\(s,s^\{\\prime\}\)\\in\\mathcal\{Z\}, set
εz≔r\(s,s′\)\+γϕ\(s′\)⊤𝜽∗−ϕ\(s\)⊤𝜽∗\.\\varepsilon\_\{z\}\\coloneqq r\(s,s^\{\\prime\}\)\+\\gamma\\bm\{\\phi\}\(s^\{\\prime\}\)^\{\\top\}\\bm\{\\theta\}^\{\*\}\-\\bm\{\\phi\}\(s\)^\{\\top\}\\bm\{\\theta\}^\{\*\}\.Then𝛏\(z\)=εzϕ\(s\)\\bm\{\\xi\}\(z\)=\\varepsilon\_\{z\}\\bm\{\\phi\}\(s\)and
‖𝝃\(z\)‖≤r∞ϕ∞\+2ϕ∞2‖𝜽∗‖\.\\\|\\bm\{\\xi\}\(z\)\\\|\\leq r\_\{\\infty\}\\phi\_\{\\infty\}\+2\\phi\_\{\\infty\}^\{2\}\\\|\\bm\{\\theta\}^\{\*\}\\\|\.
###### Lemma 3\.3\(Geometric mixing of\(Zt\)\(Z\_\{t\}\)\)\.
The transition chainZt=\(st,st\+1\)Z\_\{t\}=\(s\_\{t\},s\_\{t\+1\}\)has stationary distributionπZ\\pi\_\{Z\}\. Moreover, there exists a finite constantτmix≥1\\tau\_\{\\mathrm\{mix\}\}\\geq 1such that
supz∈𝒵‖PZk\(z,⋅\)−πZ‖TV≤4⋅2−k/τmix,k≥0,\\sup\_\{z\\in\\mathcal\{Z\}\}\\\|P\_\{Z\}^\{k\}\(z,\\cdot\)\-\\pi\_\{Z\}\\\|\_\{\\mathrm\{TV\}\}\\leq 4\\cdot 2^\{\-k/\\tau\_\{\\mathrm\{mix\}\}\},\\qquad k\\geq 0,wherePZk\(z,z′\)≔ℙ\(Zk=z′∣Z0=z\)P\_\{Z\}^\{k\}\(z,z^\{\\prime\}\)\\coloneqq\\mathbb\{P\}\(Z\_\{k\}=z^\{\\prime\}\\mid Z\_\{0\}=z\)\.
###### Lemma 3\.4\(Curvature of the TD matrix\)\.
The symmetric part of𝐀\\bm\{A\}is positive definite\. More precisely,
𝑨\+𝑨⊤2=\(1−γ\)𝚽⊤𝑫𝚽\+γ𝚽⊤𝑳Dir𝚽⪰\(1−γ\)λmin\(𝚽⊤𝑫𝚽\)𝐈d\.\\frac\{\\bm\{A\}\+\\bm\{A\}^\{\\top\}\}\{2\}=\(1\-\\gamma\)\\bm\{\\Phi\}^\{\\top\}\\bm\{D\}\\bm\{\\Phi\}\+\\gamma\\bm\{\\Phi\}^\{\\top\}\\bm\{L\}\_\{\\mathrm\{Dir\}\}\\bm\{\\Phi\}\\succeq\(1\-\\gamma\)\\lambda\_\{\\min\}\(\\bm\{\\Phi\}^\{\\top\}\\bm\{D\}\\bm\{\\Phi\}\)\\mathbf\{I\}\_\{d\}\.Consequently, one may takeω=\(1−γ\)λmin\(𝚽⊤𝐃𝚽\)\>0\\omega=\(1\-\\gamma\)\\lambda\_\{\\min\}\(\\bm\{\\Phi\}^\{\\top\}\\bm\{D\}\\bm\{\\Phi\}\)\>0, and
f\(𝜽\)−f\(𝜽∗\)=𝒆⊤𝑨\+𝑨⊤2𝒆≥ω‖𝒆‖2=ω‖𝜽−𝜽∗‖2\.f\(\\bm\{\\theta\}\)\-f\(\\bm\{\\theta\}^\{\*\}\)=\\bm\{e\}^\{\\top\}\\frac\{\\bm\{A\}\+\\bm\{A\}^\{\\top\}\}\{2\}\\bm\{e\}\\geq\\omega\\\|\\bm\{e\}\\\|^\{2\}=\\omega\\\|\\bm\{\\theta\}\-\\bm\{\\theta\}^\{\*\}\\\|^\{2\}\.Thusffisω\\omegastar\-strongly convex around𝛉∗\\bm\{\\theta\}^\{\*\}\.
The first lemma is immediate from the finiteness of𝒮\\mathcal\{S\}and the triangle inequality\. Lemma[3\.3](https://arxiv.org/html/2606.24981#S3.Thmtheorem3)is the standard geometric mixing bound for finite irreducible and aperiodic Markov chains; see, e\.g\.,Levin and Peres \([2017](https://arxiv.org/html/2606.24981#bib.bib26), Theorem 4\.9\)\. For Lemma[3\.4](https://arxiv.org/html/2606.24981#S3.Thmtheorem4), expand𝑨=𝚽⊤𝑫\(𝐈−γ𝑷μ\)𝚽\\bm\{A\}=\\bm\{\\Phi\}^\{\\top\}\\bm\{D\}\(\\mathbf\{I\}\-\\gamma\\bm\{P\}^\{\\mu\}\)\\bm\{\\Phi\}and symmetrize\. Then Lemma[F\.2](https://arxiv.org/html/2606.24981#A6.Thmtheorem2)gives𝑳Dir⪰0\\bm\{L\}\_\{\\mathrm\{Dir\}\}\\succeq 0, while Assumption[3\.1](https://arxiv.org/html/2606.24981#S3.Thmtheorem1)gives𝚽⊤𝑫𝚽≻0\\bm\{\\Phi\}^\{\\top\}\\bm\{D\}\\bm\{\\Phi\}\\succ 0\.
#### Poisson Equation\.
A critical tool for handling Markovian noise is the Poisson equation associated with a bounded measurable functionh:𝒵→ℝdh:\\mathcal\{Z\}\\to\\mathbb\{R\}^\{d\}\. Define the stationary meanπZ\(h\)≔𝔼Z∼πZ\[h\(Z\)\]\\pi\_\{Z\}\(h\)\\coloneqq\\mathbb\{E\}\_\{Z\\sim\\pi\_\{Z\}\}\[h\(Z\)\]and leth~≔h−πZ\(h\)\\tilde\{h\}\\coloneqq h\-\\pi\_\{Z\}\(h\)be the centered function\. The Poisson equation associated withPZP\_\{Z\}is
u−PZu=h~,i\.e\.,u\(z\)−∑z′∈𝒵PZ\(z,z′\)u\(z′\)=h~\(z\),z∈𝒵\.u\-P\_\{Z\}u=\\tilde\{h\},\\qquad\\text\{i\.e\.,\}\\qquad u\(z\)\-\\sum\_\{z^\{\\prime\}\\in\\mathcal\{Z\}\}P\_\{Z\}\(z,z^\{\\prime\}\)\\,u\(z^\{\\prime\}\)=\\tilde\{h\}\(z\),\\qquad z\\in\\mathcal\{Z\}\.\(6\)In Appendix[A](https://arxiv.org/html/2606.24981#A1), we will prove that this Poisson equation admits a solutionu∗:𝒵→ℝdu^\{\*\}:\\mathcal\{Z\}\\to\\mathbb\{R\}^\{d\}such that‖u∗‖∞≤16τmix‖h‖∞\\\|u^\{\*\}\\\|\_\{\\infty\}\\leq 16\\tau\_\{\\mathrm\{mix\}\}\\\|h\\\|\_\{\\infty\}, where‖h‖∞≔supz∈𝒵‖h\(z\)‖\\\|h\\\|\_\{\\infty\}\\coloneqq\\sup\_\{z\\in\\mathcal\{Z\}\}\\\|h\(z\)\\\|\.
## 4Main Results
In this section, we present our main results\. We first establish a high\-probability uniform bound on the iterates, which is a key step in our convergence analysis\. We then derive a simultaneous high\-probability convergence rate that is both curvature\-free/robust and curvature\-aware/fast convergence rate for the Polyak–Ruppert averaged iterate\.
### 4\.1High\-Probability Boundedness of the Iterates of Unprojected TD\(0\)
Our first result shows that, even without any projection, the iterates of TD\(0\) remain bounded by a constant multiple ofmax\{‖𝜽0−𝜽∗‖,‖𝜽∗‖,r∞ϕ∞\}\\max\\\{\\\|\\bm\{\\theta\}\_\{0\}\-\\bm\{\\theta\}^\{\*\}\\\|,\\\|\\bm\{\\theta\}^\{\*\}\\\|,\\frac\{r\_\{\\infty\}\}\{\\phi\_\{\\infty\}\}\\\}under a simple stepsize schedule that exploits the update structure of TD\(0\)\.
###### Theorem 4\.1\(High\-probability bounded iterates\)\.
Under Assumption[3\.1](https://arxiv.org/html/2606.24981#S3.Thmtheorem1), letϕ∞,r∞\\phi\_\{\\infty\},r\_\{\\infty\}andτmix\\tau\_\{\\mathrm\{mix\}\}be the constants introduced in Lemmas[3\.2](https://arxiv.org/html/2606.24981#S3.Thmtheorem2)and[3\.3](https://arxiv.org/html/2606.24981#S3.Thmtheorem3)\. Consider the following stepsize schedule:
ηt=ηbaseat,whereηbase:=1cτmixϕ∞2,\\eta\_\{t\}=\\eta\_\{\\mathrm\{base\}\}a\_\{t\},\\quad\\text\{where\}\\quad\\eta\_\{\\mathrm\{base\}\}:=\\frac\{1\}\{c\\tau\_\{\\mathrm\{mix\}\}\\,\\phi\_\{\\infty\}^\{2\}\},\(7\)for some numerical constantc\>0c\>0, and a non\-increasing positive sequence\(at\)\(a\_\{t\}\)such that∑t=0∞at2\\sum\_\{t=0\}^\{\\infty\}a\_\{t\}^\{2\}is finite anda0≤1a\_\{0\}\\leq 1\. Fix anyδ∈\(0,1\)\\delta\\in\(0,1\)and let
Rbase≔max\{‖𝜽0−𝜽∗‖,‖𝜽∗‖,r∞ϕ∞\}\.R\_\{\\mathrm\{base\}\}\\coloneqq\\max\\left\\\{\\left\\\|\{\\bm\{\\theta\}\_\{0\}\-\\bm\{\\theta\}^\{\*\}\}\\right\\\|,\\left\\\|\{\\bm\{\\theta\}^\{\*\}\}\\right\\\|,\\frac\{r\_\{\\infty\}\}\{\\phi\_\{\\infty\}\}\\right\\\}\\,\.\(8\)Define222We keep track of all numerical constants because future work may focus on sharpening the minimal value ofcc\.A1\(δ\)≔1536∑t=0∞at22log2δ\+2304A\_\{1\}\(\\delta\)\\coloneqq 1536\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}a\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{2\}\{\\delta\}\}\+2304,A2≔2706∑t=0∞at2A\_\{2\}\\coloneqq 2706\\sum\_\{t=0\}^\{\\infty\}a\_\{t\}^\{2\},
cmin\(δ\)\\displaystyle c\_\{\\min\}\(\\delta\)≔A1\(δ\)\+A12\(δ\)\+4A22,andρ≔2cc2−A1\(δ\)c−A2\.\\displaystyle\\coloneqq\\frac\{A\_\{1\}\(\\delta\)\+\\sqrt\{A^\{2\}\_\{1\}\(\\delta\)\+4A\_\{2\}\}\}\{2\},\\qquad\\text\{and\}\\qquad\\rho\\coloneqq\\frac\{2c\}\{\\sqrt\{c^\{2\}\-A\_\{1\}\(\\delta\)c\-A\_\{2\}\}\}\\,\.\(9\)Then, provided thatc\>cmin\(δ\)c\>c\_\{\\min\}\(\\delta\), with probability at least1−δ1\-\\delta, we have
supt≥0‖𝜽t‖2≤ρRbase\.\\sup\_\{t\\geq 0\}\\ \\left\\\|\{\\bm\{\\theta\}\_\{t\}\}\\right\\\|\_\{2\}\\leq\\rho R\_\{\\mathrm\{base\}\}\\,\.
Note that a simple choice ofata\_\{t\}that satisfies the assumptions of the theorem isat=1ln\(t\+3\)t\+1a\_\{t\}=\\frac\{1\}\{\\ln\(t\+3\)\\sqrt\{t\+1\}\}\.
We provide the complete proof in Appendix[B](https://arxiv.org/html/2606.24981#A2)and a proof sketch below\.
#### Proof Sketch\.
Fixδ\>0\\delta\>0and suppress theδ\\delta\-dependence in the concentration bounds for simplicity\. Starting from the definition of the TD\(0\) update, we have
‖𝒈k‖=‖\(rk\+γ⟨ϕ\(sk\+1\),𝜽k⟩−⟨ϕ\(sk\),𝜽k⟩\)ϕ\(sk\)‖≤\|rk\|‖ϕ\(sk\)‖\+\|1\+γ\|‖ϕ\(sk\)‖2‖𝜽k‖\.\\left\\\|\{\\bm\{g\}\_\{k\}\}\\right\\\|=\\left\\\|\{\\bigl\(r\_\{k\}\+\\gamma\\langle\\bm\{\\phi\}\(s\_\{k\+1\}\),\\bm\{\\theta\}\_\{k\}\\rangle\-\\langle\\bm\{\\phi\}\(s\_\{k\}\),\\bm\{\\theta\}\_\{k\}\\rangle\\bigr\)\\bm\{\\phi\}\(s\_\{k\}\)\}\\right\\\|\\leq\|r\_\{k\}\|\\left\\\|\{\\bm\{\\phi\}\(s\_\{k\}\)\}\\right\\\|\+\|1\+\\gamma\|\\,\\left\\\|\{\\bm\{\\phi\}\(s\_\{k\}\)\}\\right\\\|^\{2\}\\left\\\|\{\\bm\{\\theta\}\_\{k\}\}\\right\\\|\.\(10\)Thus, in each iteration, the update magnitude is at mostr∞ϕ∞\+2ϕ∞2‖𝜽k‖r\_\{\\infty\}\\phi\_\{\\infty\}\+2\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\\bm\{\\theta\}\_\{k\}\}\\right\\\|, which is of the same order as‖𝜽k‖\\left\\\|\{\\bm\{\\theta\}\_\{k\}\}\\right\\\|\.
Next, expand‖𝜽t−𝜽∗‖2\\left\\\|\{\\bm\{\\theta\}\_\{t\}\-\\bm\{\\theta\}^\{\*\}\}\\right\\\|^\{2\}using the TD\(0\) update and sum fromk=1k=1tott, to obtain
‖𝜽t−𝜽∗‖2\\displaystyle\\\|\\bm\{\\theta\}\_\{t\}\-\\bm\{\\theta\}^\{\*\}\\\|^\{2\}=‖𝜽0−𝜽∗‖2\+∑k=1tηk−12‖𝒈k−1‖2\+2∑k=1tηk−1⟨𝒈k−1−𝒈¯\(𝜽k−1\),𝜽k−1−𝜽∗⟩\\displaystyle=\\\|\\bm\{\\theta\}\_\{0\}\-\\bm\{\\theta\}^\{\*\}\\\|^\{2\}\+\\sum\_\{k=1\}^\{t\}\\eta^\{2\}\_\{k\-1\}\\\|\\bm\{g\}\_\{k\-1\}\\\|^\{2\}\+2\\sum\_\{k=1\}^\{t\}\\eta\_\{k\-1\}\\langle\\bm\{g\}\_\{k\-1\}\-\\bar\{\\bm\{g\}\}\(\\bm\{\\theta\}\_\{k\-1\}\),\\bm\{\\theta\}\_\{k\-1\}\-\\bm\{\\theta\}^\{\*\}\\rangle\+2∑k=1tηk−1⟨𝒈¯\(𝜽k−1\),𝜽k−1−𝜽∗⟩\.\\displaystyle\\quad\+2\\sum\_\{k=1\}^\{t\}\\eta\_\{k\-1\}\\langle\\bar\{\\bm\{g\}\}\(\\bm\{\\theta\}\_\{k\-1\}\),\\bm\{\\theta\}\_\{k\-1\}\-\\bm\{\\theta\}^\{\*\}\\rangle\\,\.\(11\)To build intuition, suppose for the moment that𝒈k=𝒈¯\(𝜽k\)\\bm\{g\}\_\{k\}=\\bar\{\\bm\{g\}\}\(\\bm\{\\theta\}\_\{k\}\)\. Then, equation \([4\.1](https://arxiv.org/html/2606.24981#S4.Ex19)\) simplifies to
‖𝜽t−𝜽∗‖2−‖𝜽0−𝜽∗‖2≤∑k=1tηk−12‖𝒈¯\(𝜽k−1\)‖2=𝒪\(∑k=1tηk−12\(‖𝜽k−1‖\+‖𝜽k−1‖2\)\),\\displaystyle\\\|\\bm\{\\theta\}\_\{t\}\-\\bm\{\\theta\}^\{\*\}\\\|^\{2\}\-\\\|\\bm\{\\theta\}\_\{0\}\-\\bm\{\\theta\}^\{\*\}\\\|^\{2\}\\leq\\sum\_\{k=1\}^\{t\}\\eta^\{2\}\_\{k\-1\}\\\|\\bar\{\\bm\{g\}\}\(\\bm\{\\theta\}\_\{k\-1\}\)\\\|^\{2\}=\\mathcal\{O\}\\\!\\left\(\\sum\_\{k=1\}^\{t\}\\eta^\{2\}\_\{k\-1\}\\left\(\\left\\\|\{\\bm\{\\theta\}\_\{k\-1\}\}\\right\\\|\+\\left\\\|\{\\bm\{\\theta\}\_\{k\-1\}\}\\right\\\|^\{2\}\\right\)\\right\),where we use⟨𝒈¯\(𝜽k−1\),𝜽k−1−𝜽∗⟩≤0\\langle\{\\bar\{\\bm\{g\}\}\(\\bm\{\\theta\}\_\{k\-1\}\)\},\{\\bm\{\\theta\}\_\{k\-1\}\-\\bm\{\\theta\}^\{\*\}\}\\rangle\\leq 0from equation \([4](https://arxiv.org/html/2606.24981#S3.E4)\)\. Hence,‖𝜽t−𝜽∗‖2\\\|\\bm\{\\theta\}\_\{t\}\-\\bm\{\\theta\}^\{\*\}\\\|^\{2\}is governed by a term of the form𝒪\(∑k=1tηk−12\(‖𝜽k−1‖\+‖𝜽k−1‖2\)\)\\mathcal\{O\}\\\!\\left\(\\sum\_\{k=1\}^\{t\}\\eta^\{2\}\_\{k\-1\}\\bigl\(\\left\\\|\{\\bm\{\\theta\}\_\{k\-1\}\}\\right\\\|\+\\left\\\|\{\\bm\{\\theta\}\_\{k\-1\}\}\\right\\\|^\{2\}\\bigr\)\\right\)\. Since\(ηk\)\(\\eta\_\{k\}\)is square\-summable, if‖𝜽k‖\\left\\\|\{\\bm\{\\theta\}\_\{k\}\}\\right\\\|is bounded byRbaseR\_\{\\mathrm\{base\}\}for allk≤t−1k\\leq t\-1, then‖𝜽t−𝜽∗‖\\left\\\|\{\\bm\{\\theta\}\_\{t\}\-\\bm\{\\theta\}^\{\*\}\}\\right\\\|is also bounded by a constant multiple ofRbaseR\_\{\\mathrm\{base\}\}\. Moreover, by the triangle inequality,‖𝜽t‖≤‖𝜽t−𝜽∗‖\+‖𝜽∗‖\\left\\\|\{\\bm\{\\theta\}\_\{t\}\}\\right\\\|\\leq\\left\\\|\{\\bm\{\\theta\}\_\{t\}\-\\bm\{\\theta\}^\{\*\}\}\\right\\\|\+\\left\\\|\{\\bm\{\\theta\}^\{\*\}\}\\right\\\|\. In other words, a bound on‖𝜽k‖\\left\\\|\{\\bm\{\\theta\}\_\{k\}\}\\right\\\|for allk≤t−1k\\leq t\-1implies a bound at timett\. This motivates the following induction hypothesis for someρ\>2\\rho\>2:
max1≤k≤t−1‖𝜽k−1‖≤ρRbase\.\\max\_\{1\\leq k\\leq t\-1\}\\ \\left\\\|\{\\bm\{\\theta\}\_\{k\-1\}\}\\right\\\|\\leq\\rho R\_\{\\mathrm\{base\}\}\\,\.The core of the proof then becomes to choose the stepsize parametercc\(and henceρ\\rho\) so that
‖𝜽t‖2≤2‖𝜽t−𝜽∗‖2\+2‖𝜽∗‖2≤ρ2Rbase2,\\\|\\bm\{\\theta\}\_\{t\}\\\|^\{2\}\\leq 2\\\|\\bm\{\\theta\}\_\{t\}\-\\bm\{\\theta\}^\{\*\}\\\|^\{2\}\+2\\\|\\bm\{\\theta\}^\{\*\}\\\|^\{2\}\\leq\\rho^\{2\}R\_\{\\mathrm\{base\}\}^\{2\},where we use the induction hypothesis to control‖𝜽t−𝜽∗‖\\left\\\|\{\\bm\{\\theta\}\_\{t\}\-\\bm\{\\theta\}^\{\*\}\}\\right\\\|\.
We now return to the Markovian bias term∑k=1tηk−1⟨𝒈k−1−𝒈¯\(𝜽k−1\),𝜽k−1−𝜽∗⟩\\sum\_\{k=1\}^\{t\}\\eta\_\{k\-1\}\\langle\\bm\{g\}\_\{k\-1\}\-\\bar\{\\bm\{g\}\}\(\\bm\{\\theta\}\_\{k\-1\}\),\\bm\{\\theta\}\_\{k\-1\}\-\\bm\{\\theta\}^\{\*\}\\ranglein \([4\.1](https://arxiv.org/html/2606.24981#S4.Ex19)\)\. In general, forℱt:=σ\(Z0,…,Zt\)\\mathcal\{F\}\_\{t\}:=\\sigma\(Z\_\{0\},\\dots,Z\_\{t\}\),𝔼\[𝒈k−1−𝒈¯\(𝜽k−1\)\|ℱk−2\]\\mathbb\{E\}\\\!\\left\[\\bm\{g\}\_\{k\-1\}\-\\bar\{\\bm\{g\}\}\(\\bm\{\\theta\}\_\{k\-1\}\)\\;\\middle\|\\;\\mathcal\{F\}\_\{k\-2\}\\right\]need not be zero:𝒈k−1\\bm\{g\}\_\{k\-1\}depends on the fresh randomness inZk−1Z\_\{k\-1\}, and the conditional distribution ofZk−1Z\_\{k\-1\}givenZk−2Z\_\{k\-2\}does not necessarily coincide with the stationary distributionπZ\\pi\_\{Z\}\. To control this bias, freeze the past iterate𝜽k−1\\bm\{\\theta\}\_\{k\-1\}and define the centered scalar forcing termhk−1\(z\)≔⟨𝒈\(𝜽k−1,z\)−𝒈¯\(𝜽k−1\),𝜽k−1−𝜽∗⟩h\_\{k\-1\}\(z\)\\coloneqq\\langle\\bm\{g\}\(\\bm\{\\theta\}\_\{k\-1\},z\)\-\\bar\{\\bm\{g\}\}\(\\bm\{\\theta\}\_\{k\-1\}\),\\bm\{\\theta\}\_\{k\-1\}\-\\bm\{\\theta\}^\{\*\}\\rangle\. For this forcing term, consider the Poisson equation on the transition chainZtZ\_\{t\}:
uk−1:𝒵→ℝ,uk−1\(z\)−𝔼\[uk−1\(Z1\)\|Z0=z\]=hk−1\(z\)\.u\_\{k\-1\}:\\mathcal\{Z\}\\rightarrow\\mathbb\{R\},\\quad u\_\{k\-1\}\(z\)\-\\mathbb\{E\}\\\!\\left\[u\_\{k\-1\}\(Z\_\{1\}\)\\;\\middle\|\\;Z\_\{0\}=z\\right\]=h\_\{k\-1\}\(z\)\\,\.\(12\)By Lemma[A\.1](https://arxiv.org/html/2606.24981#A1.Thmtheorem1), the equation \([12](https://arxiv.org/html/2606.24981#S4.E12)\) is solved byuk−1∗\(z\)≔∑i=0∞𝔼\[hk−1\(Zi\)\|Z0=z\]u\_\{k\-1\}^\{\*\}\(z\)\\coloneqq\\sum\_\{i=0\}^\{\\infty\}\\mathbb\{E\}\\\!\\left\[h\_\{k\-1\}\(Z\_\{i\}\)\\;\\middle\|\\;Z\_\{0\}=z\\right\], and we also have
‖uk−1∗‖∞≔supz\|uk−1∗\(z\)\|≤16τmix‖hk−1‖∞=𝒪\(τmix‖𝜽k−1‖2\)\.\\left\\\|\{u\_\{k\-1\}^\{\*\}\}\\right\\\|\_\{\\infty\}\\coloneqq\\sup\_\{z\}\\ \|u\_\{k\-1\}^\{\*\}\(z\)\|\\leq 16\\tau\_\{\\mathrm\{mix\}\}\\left\\\|\{h\_\{k\-1\}\}\\right\\\|\_\{\\infty\}=\\mathcal\{O\}\\left\(\\tau\_\{\\mathrm\{mix\}\}\\left\\\|\{\\bm\{\\theta\}\_\{k\-1\}\}\\right\\\|^\{2\}\\right\)\\,\.\(13\)The solutionuk−1∗u^\{\*\}\_\{k\-1\}provides a convenient decomposition of∑k=1tηk−1hk−1\(Zk−1\)\\sum\_\{k=1\}^\{t\}\\eta\_\{k\-1\}h\_\{k\-1\}\(Z\_\{k\-1\}\)into a martingale termMtM\_\{t\}plus a remainder termRtR\_\{t\}:
∑k=1tηk−1hk−1\(Zk−1\)\\displaystyle\\sum\_\{k=1\}^\{t\}\\eta\_\{k\-1\}h\_\{k\-1\}\(Z\_\{k\-1\}\)=∑k=1tηk−1\(uk−1∗\(Zk\)−𝔼\[uk−1∗\(Zk\)\|Zk−1\]\+uk−1∗\(Zk−1\)−uk−1∗\(Zk\)\)\\displaystyle=\\sum\_\{k=1\}^\{t\}\\eta\_\{k\-1\}\\left\(u^\{\*\}\_\{k\-1\}\(Z\_\{k\}\)\-\\mathbb\{E\}\\\!\\left\[u^\{\*\}\_\{k\-1\}\(Z\_\{k\}\)\\;\\middle\|\\;Z\_\{k\-1\}\\right\]\+u^\{\*\}\_\{k\-1\}\(Z\_\{k\-1\}\)\-u^\{\*\}\_\{k\-1\}\(Z\_\{k\}\)\\right\)=Mt\+Rt,\\displaystyle=M\_\{t\}\+R\_\{t\},\(14\)whereΔMk≔ηk−1\(uk−1∗\(Zk\)−𝔼\[uk−1∗\(Zk\)\|Zk−1\]\)\\Delta M\_\{k\}\\coloneqq\\eta\_\{k\-1\}\(u^\{\*\}\_\{k\-1\}\(Z\_\{k\}\)\-\\mathbb\{E\}\[u^\{\*\}\_\{k\-1\}\(Z\_\{k\}\)\|Z\_\{k\-1\}\]\)andMt≔∑k=1tΔMkM\_\{t\}\\coloneqq\\sum\_\{k=1\}^\{t\}\\Delta M\_\{k\}\. Note that‖ΔMk‖∞=𝒪\(ηk−1τmix‖𝜽k−1‖2\)=𝒪\(ak−1‖𝜽k−1‖2\)\\left\\\|\{\\Delta M\_\{k\}\}\\right\\\|\_\{\\infty\}=\\mathcal\{O\}\(\\eta\_\{k\-1\}\\tau\_\{\\mathrm\{mix\}\}\\left\\\|\{\\bm\{\\theta\}\_\{k\-1\}\}\\right\\\|^\{2\}\)=\\mathcal\{O\}\(a\_\{k\-1\}\\left\\\|\{\\bm\{\\theta\}\_\{k\-1\}\}\\right\\\|^\{2\}\)from \([13](https://arxiv.org/html/2606.24981#S4.E13)\)\. Consequently, the accumulated bounded difference forMtM\_\{t\}is controlled by𝒪\(maxk≤t‖𝜽k−1‖2\)\\mathcal\{O\}\(\\max\_\{k\\leq t\}\\ \\left\\\|\{\\bm\{\\theta\}\_\{k\-1\}\}\\right\\\|^\{2\}\)\. Assuming again thatmaxk≤t‖𝜽k−1‖≤ρRbase\\max\_\{k\\leq t\}\\ \\left\\\|\{\\bm\{\\theta\}\_\{k\-1\}\}\\right\\\|\\leq\\rho R\_\{\\mathrm\{base\}\}, Pinelis’ inequality\(Pinelis,[1994](https://arxiv.org/html/2606.24981#bib.bib43)\)for martingale concentration \(Lemma[B\.1](https://arxiv.org/html/2606.24981#A2.Thmtheorem1)\) yields\|Mt\|=𝒪\(ρ2Rbase2\)\|M\_\{t\}\|=\\mathcal\{O\}\(\\rho^\{2\}R\_\{\\mathrm\{base\}\}^\{2\}\)with high probability\.
To control the remainder termRt=∑k=1tηk−1\(uk−1∗\(Zk−1\)−uk−1∗\(Zk\)\)R\_\{t\}=\\sum\_\{k=1\}^\{t\}\\eta\_\{k\-1\}\(u^\{\*\}\_\{k\-1\}\(Z\_\{k\-1\}\)\-u^\{\*\}\_\{k\-1\}\(Z\_\{k\}\)\), we rewrite it \(see Lemma[A\.3](https://arxiv.org/html/2606.24981#A1.Thmtheorem3)\) as
Rt=η0u0∗\(Z0\)−ηtut∗\(Zt\)−∑k=1t\(ηk−1−ηk\)uk−1∗\(Zk\)−∑k=1tηk\(uk−1∗\(Zk\)−uk∗\(Zk\)\)\.R\_\{t\}=\\eta\_\{0\}u^\{\*\}\_\{0\}\(Z\_\{0\}\)\-\\eta\_\{t\}u^\{\*\}\_\{t\}\(Z\_\{t\}\)\-\\sum\_\{k=1\}^\{t\}\(\\eta\_\{k\-1\}\-\\eta\_\{k\}\)u^\{\*\}\_\{k\-1\}\(Z\_\{k\}\)\-\\sum\_\{k=1\}^\{t\}\\eta\_\{k\}\\bigl\(u^\{\*\}\_\{k\-1\}\(Z\_\{k\}\)\-u^\{\*\}\_\{k\}\(Z\_\{k\}\)\\bigr\)\\,\.Since‖uk∗−uk−1∗‖∞\\left\\\|\{u^\{\*\}\_\{k\}\-u^\{\*\}\_\{k\-1\}\}\\right\\\|\_\{\\infty\}is controlled by𝒪\(ηk−1τmix‖𝜽k−1‖2\)\\mathcal\{O\}\(\\eta\_\{k\-1\}\\tau\_\{\\mathrm\{mix\}\}\\left\\\|\{\\bm\{\\theta\}\_\{k\-1\}\}\\right\\\|^\{2\}\)\(because𝜽k−𝜽k−1=ηk−1𝒈k−1\\bm\{\\theta\}\_\{k\}\-\\bm\{\\theta\}\_\{k\-1\}=\\eta\_\{k\-1\}\\bm\{g\}\_\{k\-1\}\), the induction hypothesis again implies\|Rt\|=𝒪\(ρ2Rbase2\)\|R\_\{t\}\|=\\mathcal\{O\}\(\\rho^\{2\}R\_\{\\mathrm\{base\}\}^\{2\}\)\. Combining the bounds forMtM\_\{t\}andRtR\_\{t\}, we obtain∑k=1tηk−1hk−1\(Zk−1\)=𝒪\(ρ2Rbase2\)\\sum\_\{k=1\}^\{t\}\\eta\_\{k\-1\}h\_\{k\-1\}\(Z\_\{k\-1\}\)=\\mathcal\{O\}\(\\rho^\{2\}R\_\{\\mathrm\{base\}\}^\{2\}\), which fits perfectly with our inductive proof, since‖𝜽t−𝜽∗‖2\\left\\\|\{\\bm\{\\theta\}\_\{t\}\-\\bm\{\\theta\}^\{\*\}\}\\right\\\|^\{2\}remains the same order𝒪\(ρ2Rbase2\)\\mathcal\{O\}\(\\rho^\{2\}R\_\{\\mathrm\{base\}\}^\{2\}\)as in the idealized case𝒈k=𝒈¯\(𝜽k\)\\bm\{g\}\_\{k\}=\\bar\{\\bm\{g\}\}\(\\bm\{\\theta\}\_\{k\}\)\.
### 4\.2High\-Probability Convergence Rate for Polyak–Ruppert Averaging
Our second result studies the convergence rate of the Polyak–Ruppert \(PR\) averaged iterate\. Let
ST:=∑t=1Tηt−1𝜽¯T:=1ST∑t=1Tηt−1𝜽t−1\.S\_\{T\}:=\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\qquad\\bar\{\\bm\{\\theta\}\}\_\{T\}:=\\frac\{1\}\{S\_\{T\}\}\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\bm\{\\theta\}\_\{t\-1\}\\,\.Recall the following potential defined in \([2](https://arxiv.org/html/2606.24981#S3.E2)\):
f\(𝜽\)−f\(𝜽∗\)=\(1−γ\)‖𝑽𝜽−𝑽𝜽∗‖𝑫2\+γ‖𝑽𝜽−𝑽𝜽∗‖Dir2\.f\(\\bm\{\\theta\}\)\-f\(\\bm\{\\theta\}^\{\*\}\)=\(1\-\\gamma\)\\,\\\|\\bm\{V\}\_\{\\bm\{\\theta\}\}\-\\bm\{V\}\_\{\\bm\{\\theta\}^\{\*\}\}\\\|\_\{\\bm\{D\}\}^\{2\}\+\\gamma\\,\\\|\\bm\{V\}\_\{\\bm\{\\theta\}\}\-\\bm\{V\}\_\{\\bm\{\\theta\}^\{\*\}\}\\\|\_\{\\mathrm\{Dir\}\}^\{2\}\\,\.We have the following high\-probability guarantees for the PR average𝜽¯T\\bar\{\\bm\{\\theta\}\}\_\{T\}\.
###### Theorem 4\.2\(High\-probability rate for PR averaging\)\.
Under Assumption[3\.1](https://arxiv.org/html/2606.24981#S3.Thmtheorem1), letϕ∞,r∞,τmix\\phi\_\{\\infty\},r\_\{\\infty\},\\tau\_\{\\mathrm\{mix\}\}andω\\omegabe the constants introduced in Lemmas[3\.2](https://arxiv.org/html/2606.24981#S3.Thmtheorem2)–[3\.4](https://arxiv.org/html/2606.24981#S3.Thmtheorem4)\. Consider the same stepsize schedule\(ηt\)\(\\eta\_\{t\}\)as in \([7](https://arxiv.org/html/2606.24981#S4.E7)\) and the sameRbaseR\_\{\\mathrm\{base\}\}in \([8](https://arxiv.org/html/2606.24981#S4.E8)\)\. Also, defineH≔∑t=0∞ηt2H\\coloneqq\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\. Fix anyδ∈\(0,1\)\\delta\\in\(0,1\)and letA1\(δ\)=1536∑t=0∞at22log8δ\+2304A\_\{1\}\(\\delta\)=1536\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}a\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+2304andA2=2706∑t=0∞at2A\_\{2\}=2706\\sum\_\{t=0\}^\{\\infty\}a\_\{t\}^\{2\}\. Define
cmin\(δ\)≔A1\(δ\)\+A1\(δ\)2\+4A22,ρ≔2cc2−A1\(δ\)c−A2\.c\_\{\\min\}\(\\delta\)\\coloneqq\\frac\{A\_\{1\}\(\\delta\)\+\\sqrt\{A\_\{1\}\(\\delta\)^\{2\}\+4A\_\{2\}\}\}\{2\},\\qquad\\rho\\coloneqq\\frac\{2c\}\{\\sqrt\{c^\{2\}\-A\_\{1\}\(\\delta\)c\-A\_\{2\}\}\}\\,\.Supposec\>cmin\(δ\)c\>c\_\{\\min\}\(\\delta\)\. Then, with probability at least1−δ1\-\\delta, we have for allT≥1T\\geq 1,
f\(𝜽¯T\)−f\(𝜽∗\)≤min\{Cfast2ωST2,CrobustST\},f\(\\bar\{\\bm\{\\theta\}\}\_\{T\}\)\-f\(\\bm\{\\theta\}^\{\*\}\)\\leq\\min\\left\\\{\\frac\{C\_\{\\mathrm\{fast\}\}^\{2\}\}\{\\omega S\_\{T\}^\{2\}\},\\frac\{C\_\{\\mathrm\{robust\}\}\}\{S\_\{T\}\}\\right\\\},where
Cfast\\displaystyle C\_\{\\mathrm\{fast\}\}≔ρRbase\[2\+2τmixϕ∞2\(264η0\+176H2log8δ\)\+192τmixϕ∞4H\],\\displaystyle\\coloneqq\\rho R\_\{\\mathrm\{base\}\}\\left\[2\+2\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\left\(264\\eta\_\{0\}\+176\\sqrt\{H\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\\right\)\+192\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}H\\right\],Crobust\\displaystyle C\_\{\\mathrm\{robust\}\}≔ρ2Rbase2\[0\.5\+2τmixϕ∞2\(288η0\+192H2log8δ\)\+\(672τmix\+4\.5\)ϕ∞4H\]\.\\displaystyle\\coloneqq\\rho^\{2\}R\_\{\\mathrm\{base\}\}^\{2\}\\left\[0\.5\+2\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\left\(288\\eta\_\{0\}\+192\\sqrt\{H\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\\right\)\+\\left\(672\\tau\_\{\\mathrm\{mix\}\}\+4\.5\\right\)\\phi\_\{\\infty\}^\{4\}H\\right\]\\,\.
A choice of the stepsize that satisfies the assumptions of the theorem is \(see Appendix[D](https://arxiv.org/html/2606.24981#A4)\)
ηt=\(cϕ∞2τmix\)−1at,whereat=\(log\(t\+3\)t\+1\)−1\.\\eta\_\{t\}=\(c\\,\\phi\_\{\\infty\}^\{2\}\\,\\tau\_\{\\mathrm\{mix\}\}\)^\{\-1\}a\_\{t\},\\quad\\text\{ where \}\\quad a\_\{t\}=\(\\log\(t\+3\)\\sqrt\{t\+1\}\)^\{\-1\}\\,\.\(15\)The logarithmic correction inata\_\{t\}is used at the square\-summability boundary: it ensures that∑t=0∞ηt2\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}is finite while keepingST=∑t=1Tηt−1S\_\{T\}=\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}within a logarithmic factor ofT\\sqrt\{T\}\. Such a choice results in a robust rate of𝒪~\(‖𝜽∗‖2τmixϕ∞2T\)\\widetilde\{\\mathcal\{O\}\}\(\\frac\{\\\|\\bm\{\\theta\}^\{\*\}\\\|^\{2\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\}\{\\sqrt\{T\}\}\)and a fast rate of𝒪~\(‖𝜽∗‖2τmix2ϕ∞4ωT\)\\widetilde\{\\mathcal\{O\}\}\(\\frac\{\\\|\\bm\{\\theta\}^\{\*\}\\\|^\{2\}\\tau\_\{\\mathrm\{mix\}\}^\{2\}\\phi\_\{\\infty\}^\{4\}\}\{\\omega T\}\)with high probability\.
In Appendix[E](https://arxiv.org/html/2606.24981#A5), we also show an alternative stepsize choice that does not require knowledge ofτmix\\tau\_\{\\mathrm\{mix\}\}, at the cost of a doubly exponential dependence onτmix\\tau\_\{\\mathrm\{mix\}\}in the constants\.
#### Proof Sketch of Theorem[4\.2](https://arxiv.org/html/2606.24981#S4.Thmtheorem2)\.
By Theorem[4\.1](https://arxiv.org/html/2606.24981#S4.Thmtheorem1), with high probability we havesupt≥0‖𝜽t‖2≤ρRbase\\sup\_\{t\\geq 0\}\\ \\left\\\|\{\\bm\{\\theta\}\_\{t\}\}\\right\\\|\_\{2\}\\leq\\rho\\,R\_\{\\mathrm\{base\}\}\. It then follows that∑t=1Tηt−12‖𝒈t−1‖2\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}^\{2\}\\left\\\|\{\\bm\{g\}\_\{t\-1\}\}\\right\\\|^\{2\}is of order𝒪\(Rbase2\)\\mathcal\{O\}\(R\_\{\\mathrm\{base\}\}^\{2\}\), using the gradient bound \([10](https://arxiv.org/html/2606.24981#S4.E10)\)\. Moreover, the cumulated Markov bias term∑t=1Tηt−1⟨𝒈t−1−𝒈¯\(𝜽t−1\),𝜽t−1−𝜽∗⟩\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\langle\\bm\{g\}\_\{t\-1\}\-\\bar\{\\bm\{g\}\}\(\\bm\{\\theta\}\_\{t\-1\}\),\\bm\{\\theta\}\_\{t\-1\}\-\\bm\{\\theta\}^\{\*\}\\rangleis also of order𝒪\(Rbase2\)\\mathcal\{O\}\(R\_\{\\mathrm\{base\}\}^\{2\}\), by applying the Poisson equation and Pinelis’ inequality\.
Rearranging \([4\.1](https://arxiv.org/html/2606.24981#S4.Ex19)\), we obtain
2∑t=1Tηt−1⟨𝒈¯\(𝜽t−1\),𝜽∗−𝜽t−1⟩\\displaystyle 2\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\langle\\bar\{\\bm\{g\}\}\(\\bm\{\\theta\}\_\{t\-1\}\),\\bm\{\\theta\}^\{\*\}\-\\bm\{\\theta\}\_\{t\-1\}\\rangle=‖𝜽0−𝜽∗‖2−‖𝜽T−𝜽∗‖2\\displaystyle=\\\|\\bm\{\\theta\}\_\{0\}\-\\bm\{\\theta\}^\{\*\}\\\|^\{2\}\-\\\|\\bm\{\\theta\}\_\{T\}\-\\bm\{\\theta\}^\{\*\}\\\|^\{2\}\+∑t=1Tηt−12‖𝒈t−1‖2\+2∑t=1Tηt−1⟨𝒈t−1−𝒈¯\(𝜽t−1\),𝜽t−1−𝜽∗⟩\\displaystyle\\quad\+\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}^\{2\}\\\|\\bm\{g\}\_\{t\-1\}\\\|^\{2\}\+2\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\langle\\bm\{g\}\_\{t\-1\}\-\\bar\{\\bm\{g\}\}\(\\bm\{\\theta\}\_\{t\-1\}\),\\bm\{\\theta\}\_\{t\-1\}\-\\bm\{\\theta\}^\{\*\}\\rangle=𝒪\(Rbase2\)\.\\displaystyle=\\mathcal\{O\}\(R\_\{\\mathrm\{base\}\}^\{2\}\)\\,\.The robust rate then follows from the convexity offftogether with the equation \([4](https://arxiv.org/html/2606.24981#S3.E4)\):
f\(𝜽¯T\)−f\(𝜽∗\)\\displaystyle f\(\\bar\{\\bm\{\\theta\}\}\_\{T\}\)\-f\(\\bm\{\\theta\}^\{\*\}\)≤1∑t=1Tηt−1∑t=1Tηt−1⟨𝒈¯\(𝜽t−1\),𝜽∗−𝜽t−1⟩=𝒪\(Rbase2ST\)\.\\displaystyle\\leq\\frac\{1\}\{\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\}\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\langle\{\\bar\{\\bm\{g\}\}\(\\bm\{\\theta\}\_\{t\-1\}\)\},\{\\bm\{\\theta\}^\{\*\}\-\\bm\{\\theta\}\_\{t\-1\}\}\\rangle=\\mathcal\{O\}\\left\(\\frac\{R\_\{\\mathrm\{base\}\}^\{2\}\}\{S\_\{T\}\}\\right\)\\,\.
To derive the fast rate, recall that the TD update can be written as𝒈\(𝜽t−1,Zt−1\)=−𝑨Zt−1𝜽t−1\+𝒃Zt−1\.\\bm\{g\}\(\\bm\{\\theta\}\_\{t\-1\},Z\_\{t\-1\}\)=\-\\bm\{A\}\_\{Z\_\{t\-1\}\}\\bm\{\\theta\}\_\{t\-1\}\+\\bm\{b\}\_\{Z\_\{t\-1\}\}\\,\.Then the centered error𝒆t:=𝜽t−𝜽∗\\bm\{e\}\_\{t\}:=\\bm\{\\theta\}\_\{t\}\-\\bm\{\\theta\}^\{\*\}can be shown \(see \([24](https://arxiv.org/html/2606.24981#A3.E24)\)\) to evolve as
𝒆t=\(𝐈d−ηt−1𝑨\)𝒆t−1\+ηt−1𝝃t−1\+ηt−1𝜹t−1𝒆t−1\.\\bm\{e\}\_\{t\}=\(\\mathbf\{I\}\_\{d\}\-\\eta\_\{t\-1\}\\bm\{A\}\)\\bm\{e\}\_\{t\-1\}\+\\eta\_\{t\-1\}\\bm\{\\xi\}\_\{t\-1\}\+\\eta\_\{t\-1\}\\bm\{\\delta\}\_\{t\-1\}\\bm\{e\}\_\{t\-1\}\\,\.\(16\)Rearranging yields
𝑨𝒆t−1=𝒆t−1−𝒆tηt−1\+𝝃t−1\+𝜹t−1𝒆t−1\.\\bm\{A\}\\bm\{e\}\_\{t\-1\}=\\frac\{\\bm\{e\}\_\{t\-1\}\-\\bm\{e\}\_\{t\}\}\{\\eta\_\{t\-1\}\}\+\\bm\{\\xi\}\_\{t\-1\}\+\\bm\{\\delta\}\_\{t\-1\}\\bm\{e\}\_\{t\-1\}\\,\.Define𝒆¯T:=1ST∑t=1Tηt−1𝒆t−1\\bar\{\\bm\{e\}\}\_\{T\}:=\\frac\{1\}\{S\_\{T\}\}\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\bm\{e\}\_\{t\-1\}\. Then𝑨𝒆¯T=I1\+I2\+I3\\bm\{A\}\\bar\{\\bm\{e\}\}\_\{T\}=I\_\{1\}\+I\_\{2\}\+I\_\{3\}, where
I1:=1ST∑t=1T\(𝒆t−1−𝒆t\),I2:=1ST∑t=1Tηt−1𝝃t−1,I3:=1ST∑t=1Tηt−1𝜹t−1𝒆t−1\.I\_\{1\}:=\\frac\{1\}\{S\_\{T\}\}\\sum\_\{t=1\}^\{T\}\(\\bm\{e\}\_\{t\-1\}\-\\bm\{e\}\_\{t\}\),\\qquad I\_\{2\}:=\\frac\{1\}\{S\_\{T\}\}\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\bm\{\\xi\}\_\{t\-1\},\\qquad I\_\{3\}:=\\frac\{1\}\{S\_\{T\}\}\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\bm\{\\delta\}\_\{t\-1\}\\bm\{e\}\_\{t\-1\}\\,\.By Lemma[3\.4](https://arxiv.org/html/2606.24981#S3.Thmtheorem4), we have\(𝑨\+𝑨⊤\)/2⪰ω𝐈d\(\\bm\{A\}\+\\bm\{A\}^\{\\top\}\)/2\\succeq\\omega\\mathbf\{I\}\_\{d\}\. Combining this with the equation \([4](https://arxiv.org/html/2606.24981#S3.E4)\) again, we obtain
f\(𝜽¯T\)−f\(𝜽∗\)=⟨𝑨𝒆¯T,𝒆¯T⟩≤‖𝑨𝒆¯T‖‖𝒆¯T‖≤1ω‖𝑨𝒆¯T‖2\.f\(\\bar\{\\bm\{\\theta\}\}\_\{T\}\)\-f\(\\bm\{\\theta\}^\{\*\}\)=\\langle\\bm\{A\}\\bar\{\\bm\{e\}\}\_\{T\},\\bar\{\\bm\{e\}\}\_\{T\}\\rangle\\leq\\\|\\bm\{A\}\\bar\{\\bm\{e\}\}\_\{T\}\\\|\\,\\\|\\bar\{\\bm\{e\}\}\_\{T\}\\\|\\leq\\frac\{1\}\{\\omega\}\\,\\\|\\bm\{A\}\\bar\{\\bm\{e\}\}\_\{T\}\\\|^\{2\}\\,\.The fast rate follows by controlling‖I1‖\\left\\\|\{I\_\{1\}\}\\right\\\|,‖I2‖\\left\\\|\{I\_\{2\}\}\\right\\\|, and‖I3‖\\left\\\|\{I\_\{3\}\}\\right\\\|separately\. The term‖I1‖\\\|I\_\{1\}\\\|is bounded using the telescoping sum\. For‖I2‖\\\|I\_\{2\}\\\|and‖I3‖\\\|I\_\{3\}\\\|, we again apply Lemma[A\.1](https://arxiv.org/html/2606.24981#A1.Thmtheorem1)with forcing terms𝝃t−1\\bm\{\\xi\}\_\{t\-1\}and𝜹t−1𝒆t−1\\bm\{\\delta\}\_\{t\-1\}\\bm\{e\}\_\{t\-1\}, respectively\. Therefore, by Pinelis’ inequality and the normalization bySTS\_\{T\}, both‖I2‖\\\|I\_\{2\}\\\|and‖I3‖\\\|I\_\{3\}\\\|are of order𝒪\(Rbase/ST\)\\mathcal\{O\}\(R\_\{\\mathrm\{base\}\}/S\_\{T\}\)with high probability, which completes the proof sketch for the fast rate\.
## 5Detailed Comparison with Prior Results
In this section, we compare Theorems[4\.1](https://arxiv.org/html/2606.24981#S4.Thmtheorem1)and[4\.2](https://arxiv.org/html/2606.24981#S4.Thmtheorem2)with prior finite\-time results for TD\(0\) with linear function approximation\. This comparison is delicate because prior papers use slightly different assumptions on the problem and the algorithm\. We therefore compare the results on a common footing, while emphasizing which unknown quantities are needed to set the algorithmic hyperparameters\. For simplicity, in this section we assumeϕ∞=1\\phi\_\{\\infty\}=1\.
First, we recall our guarantees under the stepsize schedule in \([15](https://arxiv.org/html/2606.24981#S4.E15)\):ηt=1cτmixt\+1log\(t\+3\)\\eta\_\{t\}=\\frac\{1\}\{c\\,\\tau\_\{\\mathrm\{mix\}\}\\sqrt\{t\+1\}\\log\(t\+3\)\}, whereccsatisfies the assumptions of Theorem[4\.1](https://arxiv.org/html/2606.24981#S4.Thmtheorem1)\. Under this choice, TD\(0\) with Polyak–Ruppert averaging*simultaneously*achieves the following three guarantees:
- •bounded iterates, independent ofω\\omega,
- •theω\\omega\-robust rate𝒪\(τmixRbase2logTlog\(1/δ\)T\)\\mathcal\{O\}\\Big\(\\frac\{\\tau\_\{\\mathrm\{mix\}\}R\_\{\\mathrm\{base\}\}^\{2\}\\log T\\sqrt\{\\log\(1/\\delta\)\}\}\{\\sqrt\{T\}\}\\Big\),
- •the fastω\\omega\-aware rate𝒪\(τmix2Rbase2log2Tlog\(1/δ\)ωT\)\\mathcal\{O\}\\Big\(\\frac\{\\tau\_\{\\mathrm\{mix\}\}^\{2\}R\_\{\\mathrm\{base\}\}^\{2\}\\log^\{2\}T\\log\(1/\\delta\)\}\{\\omega T\}\\Big\), with probability at least1−δ1\-\\delta\.
Importantly, all guarantees use a single stepsize schedule that is independent of the curvature parameterω\\omega\.
#### Prior work on bounded iterates\.
To the best of our knowledge, our result is the first to show that TD\(0\) with Markovian sampling has bounded iterates with high probability without usingω\\omegain the stepsize or in a curvature\-based stability argument\.
The closest result isLee and Orabona \([2025](https://arxiv.org/html/2606.24981#bib.bib37)\), but they only provide expectation bounds, whose rate we match\. While one would expect that it is relatively easy to convert expectation results to high\-probability ones for well\-behaved random variables, here it requires completely different machinery based on the Poisson equation\.
Our proof proceeds by establishing the boundedness of the iterates via a carefully designed inductive argument\. This argument uses only the condition⟨𝒈¯\(𝜽\),𝜽−𝜽∗⟩≤0\\langle\{\\bar\{\\bm\{g\}\}\(\\bm\{\\theta\}\)\},\{\\bm\{\\theta\}\-\\bm\{\\theta\}^\{\*\}\}\\rangle\\leq 0\. We then control the Markovian bias pathwise using a Poisson\-equation\-based decomposition\. This approach is fundamentally different from the Linear Stochastic Approximation \(LSA\)\-stability\-based analyses inSrikant and Ying \([2019](https://arxiv.org/html/2606.24981#bib.bib16)\); Mouet al\.\([2020](https://arxiv.org/html/2606.24981#bib.bib59)\); Chenet al\.\([2022](https://arxiv.org/html/2606.24981#bib.bib33)\); Patilet al\.\([2023](https://arxiv.org/html/2606.24981#bib.bib32)\); Liet al\.\([2024](https://arxiv.org/html/2606.24981#bib.bib57)\); Samsonovet al\.\([2024](https://arxiv.org/html/2606.24981#bib.bib30)\); Durmuset al\.\([2025](https://arxiv.org/html/2606.24981#bib.bib56)\)that result in a bound on the norm of the iterates that depends onω\\omega, hence incompatible with the robust rate\.
Our approach is also very different from using projections, that is, projecting𝜽k\\bm\{\\theta\}\_\{k\}onto the ballℬ\(𝟎,‖𝜽∗‖\)\\mathcal\{B\}\(\\bm\{0\},\\left\\\|\{\\bm\{\\theta\}^\{\*\}\}\\right\\\|\)\(see, e\.g\., Bhandariet al\.,[2018](https://arxiv.org/html/2606.24981#bib.bib3); Liu and Olshevsky,[2021](https://arxiv.org/html/2606.24981#bib.bib4); Sunet al\.,[2021](https://arxiv.org/html/2606.24981#bib.bib47); Prashanthet al\.,[2021](https://arxiv.org/html/2606.24981#bib.bib58); Patilet al\.,[2023](https://arxiv.org/html/2606.24981#bib.bib32)\)\.
Such projections play an analytic rather than algorithmic role: they guarantee that\|𝒈\(𝜽,⋅\)\|∞\|\\bm\{g\}\(\\bm\{\\theta\},\\cdot\)\|\_\{\\infty\}is always bounded by a constant depending on‖𝜽∗‖\\left\\\|\{\\bm\{\\theta\}^\{\*\}\}\\right\\\|, so that the following terms can be controlled easily using concentration inequalities:
∑kηk2‖𝒈k‖2,∑kηk⟨𝒈k−𝒈¯\(𝜽k\),𝜽k−𝜽∗⟩\.\\sum\_\{k\}\\eta\_\{k\}^\{2\}\\\|\\bm\{g\}\_\{k\}\\\|^\{2\},\\qquad\\sum\_\{k\}\\eta\_\{k\}\\langle\{\\bm\{g\}\_\{k\}\-\\bar\{\\bm\{g\}\}\(\\bm\{\\theta\}\_\{k\}\)\},\{\\bm\{\\theta\}\_\{k\}\-\\bm\{\\theta\}^\{\*\}\}\\rangle\.These terms scale with‖𝜽∗‖2\\left\\\|\{\\bm\{\\theta\}^\{\*\}\}\\right\\\|^\{2\}instead of an unknown bound on‖𝜽k‖2\\left\\\|\{\\bm\{\\theta\}\_\{k\}\}\\right\\\|^\{2\}\. We remark that this kind of modification of TD\(0\) is not used in practice, and it would require an additional hyperparameter\. Moreover, simply bounding‖𝜽∗‖\\left\\\|\{\\bm\{\\theta\}^\{\*\}\}\\right\\\|with the loose bound\(Bhandariet al\.,[2018](https://arxiv.org/html/2606.24981#bib.bib3), Lemma 7\)‖𝜽∗‖≤2r∞ω\(1−λ\)\\left\\\|\{\\bm\{\\theta\}^\{\*\}\}\\right\\\|\\leq\\frac\{2r\_\{\\infty\}\}\{\\sqrt\{\\omega\}\(1\-\\lambda\)\}once again breaks theω\\omega\-independent result required for the robust rate\.
#### Prior work on robust rates\.
To the best of our knowledge, ours is the first result to achieve the robust rate𝒪~\(‖𝜽∗‖2τmixT\)\\widetilde\{\\mathcal\{O\}\}\(\\frac\{\\left\\\|\{\\bm\{\\theta\}^\{\*\}\}\\right\\\|^\{2\}\\tau\_\{\\mathrm\{mix\}\}\}\{\\sqrt\{T\}\}\)with high probability, without any algorithmic modification and with a simple stepsize schedule independent ofω\\omega\. Once one considers the implicit dependence ofTTonτmix\\tau\_\{\\mathrm\{mix\}\}in some prior work, we match the rate in expectation from prior results\(see, e\.g\., Bhandariet al\.,[2018](https://arxiv.org/html/2606.24981#bib.bib3); Liu and Olshevsky,[2021](https://arxiv.org/html/2606.24981#bib.bib4); Lee and Orabona,[2025](https://arxiv.org/html/2606.24981#bib.bib37)\), up to polylog factors\. On the other hand, we require knowledge ofτmix\\tau\_\{\\mathrm\{mix\}\}to set the stepsizes\. Instead, prior work, either implicitly or explicitly, requires the number of iterationsTTto be large enough with respect to some function ofτmix\\tau\_\{\\mathrm\{mix\}\}for the rate to be𝒪~\(1T\)\\widetilde\{\\mathcal\{O\}\}\(\\frac\{1\}\{\\sqrt\{T\}\}\)\.
Our analysis is fundamentally different from previous ones,with the notable exception of the expectation result ofLee and Orabona \([2025](https://arxiv.org/html/2606.24981#bib.bib37)\)that we already discussed\. Indeed, the stability\-based approach commonly used in the literature, which relies on \([16](https://arxiv.org/html/2606.24981#S4.E16)\) to unroll𝜽t−𝜽∗\\bm\{\\theta\}\_\{t\}\-\\bm\{\\theta\}^\{\*\}, would typically result in a bound on‖𝜽t−𝜽∗‖\\left\\\|\{\\bm\{\\theta\}\_\{t\}\-\\bm\{\\theta\}^\{\*\}\}\\right\\\|that scales like∑i=1t\(1−λminη\)i\\sum\_\{i=1\}^\{t\}\(1\-\\lambda\_\{\\min\}\\eta\)^\{i\}using inequalities similar to \([17](https://arxiv.org/html/2606.24981#S5.E17)\), which only controls‖𝜽t‖\\left\\\|\{\\bm\{\\theta\}\_\{t\}\}\\right\\\|at the scale𝒪\(1/λmin\)\\mathcal\{O\}\(1/\\lambda\_\{\\min\}\)\. This dependence is incompatible with theω\\omega\-independent bounded\-iterates guarantee in Theorem[4\.1](https://arxiv.org/html/2606.24981#S4.Thmtheorem1)\.
#### Prior work on fast rates\.
The most relevant prior works on fast rates areSamsonovet al\.\([2024](https://arxiv.org/html/2606.24981#bib.bib30)\); Durmuset al\.\([2025](https://arxiv.org/html/2606.24981#bib.bib56)\)\.333As observed byDurmuset al\.\([2025](https://arxiv.org/html/2606.24981#bib.bib56), p\. 7\), the results inMouet al\.\([2020](https://arxiv.org/html/2606.24981#bib.bib59)\)are based on restrictive assumptions that do not allow a fair comparison with ours and previous results, so we omit them here\.These papers view TD\(0\) with linear function approximation as a special case of LSA\. They prove that there exist constantsa\>0a\>0,ϰp\>0\\varkappa\_\{p\}\>0, andαp,∞\>0\\alpha\_\{p,\\infty\}\>0\(depending onppand possibly onτmix−1\\tau\_\{\\mathrm\{mix\}\}^\{\-1\}\) such thatpαp,∞≤1/2p\\alpha\_\{p,\\infty\}\\leq 1/2and TD\(0\) satisfies the following exponential stability condition: for any moment orderp\>0p\>0and any stepsizeα∈\(0,αp,∞\)\\alpha\\in\(0,\\alpha\_\{p,\\infty\}\)
𝔼1/p\[‖Γm:nαu‖p\]≤ϰp\(1−αa\)n‖u‖,for anyu∈ℝd,1≤m≤n,\\mathbb\{E\}^\{1/p\}\\left\[\\left\\\|\{\\Gamma^\{\\alpha\}\_\{m:n\}u\}\\right\\\|^\{p\}\\right\]\\leq\\varkappa\_\{p\}\(1\-\\alpha a\)^\{n\}\\left\\\|\{u\}\\right\\\|,\\quad\\text\{for any \}u\\in\\mathbb\{R\}^\{d\},\\quad 1\\leq m\\leq n\\,,\(17\)whereΓm:nα=∏k=mn\(𝐈−α𝑨Zk\)\\Gamma^\{\\alpha\}\_\{m:n\}=\\prod\_\{k=m\}^\{n\}\(\\mathbf\{I\}\-\\alpha\\bm\{A\}\_\{Z\_\{k\}\}\)is a product of random matrices\. This condition allows one to unroll the recursion and decompose𝜽t−𝜽∗\\bm\{\\theta\}\_\{t\}\-\\bm\{\\theta\}^\{\*\}as
𝜽t−𝜽∗\\displaystyle\\bm\{\\theta\}\_\{t\}\-\\bm\{\\theta\}^\{\*\}=\(𝐈−α𝑨Zt\)\(𝜽t−1−𝜽∗\)\+α𝝃Zt=Γ1:tα\(𝜽0−𝜽∗\)\+α∑i=1tΓi\+1:tα𝝃Zi\.\\displaystyle=\\left\(\\mathbf\{I\}\-\\alpha\\bm\{A\}\_\{Z\_\{t\}\}\\right\)\(\\bm\{\\theta\}\_\{t\-1\}\-\\bm\{\\theta\}^\{\*\}\)\+\\alpha\\bm\{\\xi\}\_\{Z\_\{t\}\}=\\Gamma^\{\\alpha\}\_\{1:t\}\(\\bm\{\\theta\}\_\{0\}\-\\bm\{\\theta\}^\{\*\}\)\+\\alpha\\sum\_\{i=1\}^\{t\}\\Gamma^\{\\alpha\}\_\{i\+1:t\}\\bm\{\\xi\}\_\{Z\_\{i\}\}\\,\.Then, one can usepp\-th moment bounds onΓm:nα\\Gamma^\{\\alpha\}\_\{m:n\}to control‖𝜽t−𝜽∗‖𝚺\\\|\\bm\{\\theta\}\_\{t\}\-\\bm\{\\theta\}^\{\*\}\\\|\_\{\\bm\{\\Sigma\}\}\.
Under this approach, and recalling thatω≥\(1−γ\)λmin\(𝚽⊤𝑫𝚽\)\\omega\\geq\(1\-\\gamma\)\\lambda\_\{\\min\}\(\\bm\{\\Phi\}^\{\\top\}\\bm\{D\}\\bm\{\\Phi\}\),Samsonovet al\.\([2024](https://arxiv.org/html/2606.24981#bib.bib30), Theorem 6\)shows that, with high probability, a data\-dropping modification of TD\(0\)\(Nagarajet al\.,[2020](https://arxiv.org/html/2606.24981#bib.bib36)\)achieves the rate𝒪\(τmix‖𝜽∗‖2log2\(T/δ\)\(1−γ\)2λmin2T\)\\mathcal\{O\}\(\\frac\{\\tau\_\{\\mathrm\{mix\}\}\\\|\\bm\{\\theta\}^\{\*\}\\\|^\{2\}\\log^\{2\}\(T/\\delta\)\}\{\(1\-\\gamma\)^\{2\}\\lambda^\{2\}\_\{\\min\}T\}\)for‖𝜽¯T−𝜽∗‖𝚺2\\\|\\bar\{\\bm\{\\theta\}\}\_\{T\}\-\\bm\{\\theta\}^\{\*\}\\\|^\{2\}\_\{\\bm\{\\Sigma\}\}\. Their bound also includes an additional exponentially decaying term that depends on‖𝜽0−𝜽∗‖2\\\|\\bm\{\\theta\}\_\{0\}\-\\bm\{\\theta\}^\{\*\}\\\|^\{2\}\. The fixed stepsizeα\\alphainSamsonovet al\.\([2024](https://arxiv.org/html/2606.24981#bib.bib30)\)isω\\omega\-agnostic, but the data\-dropping strategy requires knowledge ofτmix\\tau\_\{\\mathrm\{mix\}\}, and it is not commonly used in practice\.
In parallel,Durmuset al\.\([2025](https://arxiv.org/html/2606.24981#bib.bib56), Corollary 2\)considers a varying stepsizeηt=𝒪\(t−2/3\)\\eta\_\{t\}=\\mathcal\{O\}\(t^\{\-2/3\}\)that depends onλmin\\lambda\_\{\\min\}andτmix\\tau\_\{\\mathrm\{mix\}\}and obtains a better high\-probability guarantee in which‖𝜽¯T−𝜽∗‖𝚺2\\\|\\bar\{\\bm\{\\theta\}\}\_\{T\}\-\\bm\{\\theta\}^\{\*\}\\\|^\{2\}\_\{\\bm\{\\Sigma\}\}decays at rate𝒪\(τmix‖𝜽∗‖2log\(1/δ\)\(1−γ\)2λminT\)\\mathcal\{O\}\(\\frac\{\\tau\_\{\\mathrm\{mix\}\}\\\|\\bm\{\\theta\}^\{\*\}\\\|^\{2\}\\log\(1/\\delta\)\}\{\(1\-\\gamma\)^\{2\}\\lambda\_\{\\min\}T\}\), plus an exponentially decaying term depending on‖𝜽0−𝜽∗‖2\\\|\\bm\{\\theta\}\_\{0\}\-\\bm\{\\theta\}^\{\*\}\\\|^\{2\}\.
To compare these results with our bounds, we can use equation \([5](https://arxiv.org/html/2606.24981#S3.E5)\) to translate our bound onffinto a bound on‖𝜽¯T−𝜽∗‖𝚺2\\\|\\bar\{\\bm\{\\theta\}\}\_\{T\}\-\\bm\{\\theta\}^\{\*\}\\\|^\{2\}\_\{\\bm\{\\Sigma\}\}\. Hence, we obtain the fast rate𝒪~\(τmix2Rbase2\(1−γ\)2λminT\)\\widetilde\{\\mathcal\{O\}\}\(\\frac\{\\tau\_\{\\mathrm\{mix\}\}^\{2\}R\_\{\\mathrm\{base\}\}^\{2\}\}\{\(1\-\\gamma\)^\{2\}\\lambda\_\{\\min\}T\}\)in Theorem[4\.2](https://arxiv.org/html/2606.24981#S4.Thmtheorem2)\. Up to logarithmic factors, this rate matches the best\-known high\-probability results in the Markovian sampling regime\(Samsonovet al\.,[2024](https://arxiv.org/html/2606.24981#bib.bib30); Durmuset al\.,[2025](https://arxiv.org/html/2606.24981#bib.bib56)\)and the i\.i\.d\. stationary sampling regime\(Liet al\.,[2024](https://arxiv.org/html/2606.24981#bib.bib57)\)for all relevant non\-mixing quantities\. Indeed, ourRbase2R\_\{\\mathrm\{base\}\}^\{2\}has the same order of magnitude as\|𝜽∗\|2\|\\bm\{\\theta\}^\{\*\}\|^\{2\}that appears in the other bounds\. However, ourτmix2\\tau\_\{\\mathrm\{mix\}\}^\{2\}dependence is worse by one factor ofτmix\\tau\_\{\\mathrm\{mix\}\}than the linearτmix\\tau\_\{\\mathrm\{mix\}\}dependence obtained in the Markovian bounds ofSamsonovet al\.\([2024](https://arxiv.org/html/2606.24981#bib.bib30)\); Durmuset al\.\([2025](https://arxiv.org/html/2606.24981#bib.bib56)\)\. In addition, our initial\-error term‖𝜽0−𝜽∗‖2\\\|\\bm\{\\theta\}\_\{0\}\-\\bm\{\\theta\}^\{\*\}\\\|^\{2\}decays at a1/T1/Trate rather than exponentially, since we avoid relying on a contraction\-based argument in our analysis\.
The worse dependence onτmix\\tau\_\{\\mathrm\{mix\}\}could be a side effect of our Poisson\-equation\-based analysis\. On the other hand, it could also be due to the fact that we do not use a data\-dropping method\(Samsonovet al\.,[2024](https://arxiv.org/html/2606.24981#bib.bib30)\)or knowledge ofλmin\\lambda\_\{\\min\}\(Durmuset al\.,[2025](https://arxiv.org/html/2606.24981#bib.bib56)\)\.
More generally, it is unclear whether any of the above results are optimal in their respective settings\. To the best of our knowledge, the closest lower bound isΩ\(‖𝜽∗‖2\(1−γ\)λminT\)\\Omega\(\\frac\{\\\|\\bm\{\\theta\}^\{\*\}\\\|^\{2\}\}\{\(1\-\\gamma\)\\lambda\_\{\\min\}T\}\)due toLiet al\.\([2024](https://arxiv.org/html/2606.24981#bib.bib57), Theorem 2\), which considers TD\(0\) with linear function approximation under the easier setting of i\.i\.d\. sampling\. Thus, when data dropping is not allowed, projections are not used, andλmin\\lambda\_\{\\min\}is unknown, the optimal dependence of our rate onτmix\\tau\_\{\\mathrm\{mix\}\}and1−γ1\-\\gammaremains open\.
A reason one might expect such a dependence onτmix\\tau\_\{\\mathrm\{mix\}\}in the convergence rate or stepsize is that the TD fixed point𝜽∗\\bm\{\\theta\}^\{\*\}is defined with respect to the stationary distribution, and a slowly mixing Markov chain can require more samples to accurately estimate stationary expectations\(Nagarajet al\.,[2020](https://arxiv.org/html/2606.24981#bib.bib36)\)\. At the same time, tabular results show that one should not expect a multiplicative mixing\-time dependence to be necessary in all settings:Liet al\.\([2020](https://arxiv.org/html/2606.24981#bib.bib63)\)show that, for asynchronous Q\-learning and hence tabular TD, the leading statistical term can match the i\.i\.d\. sampling setting, with only an additive transientτmix\\tau\_\{\\mathrm\{mix\}\}cost\. In the linear function approximation setting, existing results still account for the mixing time in various ways: stepsize conditions that explicitly depend onτmix\\tau\_\{\\mathrm\{mix\}\}appear in\(Srikant and Ying,[2019](https://arxiv.org/html/2606.24981#bib.bib16); Mitra,[2025](https://arxiv.org/html/2606.24981#bib.bib5); Durmuset al\.,[2025](https://arxiv.org/html/2606.24981#bib.bib56)\); data\-dropping approaches that retain only one out of𝒪\(τmix\)\\mathcal\{O\}\(\\tau\_\{\\mathrm\{mix\}\}\)samples are studied in\(Patilet al\.,[2023](https://arxiv.org/html/2606.24981#bib.bib32); Samsonovet al\.,[2024](https://arxiv.org/html/2606.24981#bib.bib30)\); and several results require the total sample budgetTTto be sufficiently large relative toτmix\\tau\_\{\\mathrm\{mix\}\}\(Lee and Orabona,[2025](https://arxiv.org/html/2606.24981#bib.bib37); Liet al\.,[2026](https://arxiv.org/html/2606.24981#bib.bib55)\)\.
## Acknowledgments
We thank Chi\-Jen Lu, Li Chen, and our anonymous reviewers for their insightful comments\. We also acknowledge the use of large language models, including GPT and Gemini, for editorial assistance and for suggesting that Poisson\-equation\-based methods may provide a useful way to handle randomness in our analysis\. The authors independently verified, developed, and finalized all mathematical arguments, results, and references\.
## References
- A finite time analysis of temporal difference learning with linear function approximation\.InConference on Learning Theory,pp\. 1691–1692\.Cited by:[§1](https://arxiv.org/html/2606.24981#S1.SS0.SSS0.Px1.p2.1),[§1](https://arxiv.org/html/2606.24981#S1.p3.1),[§2](https://arxiv.org/html/2606.24981#S2.p3.1),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px1.p4.2),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px1.p5.8),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px2.p1.8)\.
- E\. Blaser and S\. Zhang \(2024\)Asymptotic and finite sample analysis of nonexpansive stochastic approximations with Markovian noise\.arXiv preprint arXiv:2409\.19546\.Cited by:[§2](https://arxiv.org/html/2606.24981#S2.p9.1)\.
- S\. Chandak and V\. S\. Borkar \(2025\)A concentration bound for TD\(0\) with function approximation\.Stochastic Systems\.External Links:[Document](https://dx.doi.org/10.1287/stsy.2023.0055)Cited by:[§2](https://arxiv.org/html/2606.24981#S2.p4.1),[§2](https://arxiv.org/html/2606.24981#S2.p9.1)\.
- Z\. Chen, S\. Zhang, T\. T\. Doan, J\. Clarke, and S\. T\. Maguluri \(2022\)Finite\-sample analysis of nonlinear stochastic approximation with applications in reinforcement learning\.Automatica146,pp\. 110623\.Cited by:[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px1.p3.2)\.
- G\. Dalal, B\. Szörényi, G\. Thoppe, and S\. Mannor \(2018\)Finite sample analyses for TD\(0\) with function approximation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.32\.Cited by:[§2](https://arxiv.org/html/2606.24981#S2.p2.1)\.
- A\. Durmus, E\. Moulines, A\. Naumov, and S\. Samsonov \(2025\)Finite\-time high\-probability bounds for Polyak–Ruppert averaged iterates of linear stochastic approximation\.Mathematics of Operations Research50\(2\),pp\. 935–964\.Cited by:[§1](https://arxiv.org/html/2606.24981#S1.SS0.SSS0.Px1.p2.1),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px1.p3.2),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px3.p1.10),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px3.p3.6),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px3.p4.10),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px3.p5.2),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px3.p7.7),[footnote 3](https://arxiv.org/html/2606.24981#footnote3)\.
- V\. R\. Konda \(2002\)Actor\-critic algorithms\.Ph\.D\. Thesis,Massachusetts Institute of Technology,Cambridge, MA, USA\.Note:Ph\.D\. thesis\. Available via DSpace@MIT\.External Links:[Link](https://dspace.mit.edu/handle/1721.1/8120)Cited by:[§2](https://arxiv.org/html/2606.24981#S2.p8.1)\.
- N\. Korda and P\. La \(2015\)On TD\(0\) with function approximation: concentration bounds and a centered variant with exponential convergence\.InInternational Conference on Machine Learning,pp\. 626–634\.Cited by:[§2](https://arxiv.org/html/2606.24981#S2.p2.1),[§2](https://arxiv.org/html/2606.24981#S2.p8.1)\.
- H\. Kushner \(2010\)Stochastic approximation: a survey\.Wiley Interdisciplinary Reviews: Computational Statistics2\(1\),pp\. 87–96\.Cited by:[§1](https://arxiv.org/html/2606.24981#S1.p1.1),[§2](https://arxiv.org/html/2606.24981#S2.p2.1)\.
- C\. Lakshminarayanan and C\. Szepesvári \(2018\)Linear stochastic approximation: how far does constant step\-size and iterate averaging go?\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 1347–1355\.Cited by:[§2](https://arxiv.org/html/2606.24981#S2.p2.1),[§2](https://arxiv.org/html/2606.24981#S2.p8.1)\.
- W\. Lee and F\. Orabona \(2025\)A robust𝒪~\(1/T\)\\widetilde\{\\mathcal\{O\}\}\(1/\\sqrt\{T\}\)rate for unprojected TD learning with linear function approximation\.arXiv preprint arXiv:2506\.01052\.Cited by:[§2](https://arxiv.org/html/2606.24981#S2.p5.1),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px1.p2.1),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px2.p1.8),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px2.p2.6),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px3.p7.7)\.
- D\. A\. Levin and Y\. Peres \(2017\)Markov chains and mixing times\.Vol\.107,American Mathematical Society\.Cited by:[§3](https://arxiv.org/html/2606.24981#S3.SS0.SSS0.Px2.p2.4)\.
- G\. Li, Y\. Wei, Y\. Chi, Y\. Gu, and Y\. Chen \(2020\)Sample complexity of asynchronous Q\-learning: sharper analysis and variance reduction\.Advances in neural information processing systems33,pp\. 7031–7043\.Cited by:[§2](https://arxiv.org/html/2606.24981#S2.p6.3),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px3.p7.7)\.
- G\. Li, W\. Wu, Y\. Chi, C\. Ma, A\. Rinaldo, and Y\. Wei \(2024\)High\-probability sample complexities for policy evaluation with linear function approximation\.IEEE Transactions on Information Theory70\(8\),pp\. 5969–5999\.Cited by:[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px1.p3.2),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px3.p4.10),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px3.p6.5)\.
- Y\. Li, M\. Schmidt, R\. Babanezhad, and S\. Vaswani \(2026\)Towards parameter\-free temporal difference learning\.arXiv preprint arXiv:2603\.02577\.Cited by:[§2](https://arxiv.org/html/2606.24981#S2.p4.1),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px3.p7.7)\.
- R\. Liu and A\. Olshevsky \(2021\)Temporal difference learning as gradient splitting\.InInternational Conference on Machine Learning,pp\. 6905–6913\.Cited by:[§1](https://arxiv.org/html/2606.24981#S1.SS0.SSS0.Px1.p1.4),[§2](https://arxiv.org/html/2606.24981#S2.p3.1),[§3](https://arxiv.org/html/2606.24981#S3.SS0.SSS0.Px1.p8.6),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px1.p4.2),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px2.p1.8)\.
- S\. Mannor, Y\. Mansour, and A\. Tamar \(2026\)Reinforcement learning: foundations\.Cambridge University Press\.External Links:[Link](https://sites.google.com/view/rlfoundations/home)Cited by:[§3](https://arxiv.org/html/2606.24981#S3.p2.1)\.
- S\. P\. Meyn and R\. L\. Tweedie \(2012\)Markov chains and stochastic stability\.Springer Science & Business Media\.Cited by:[§A\.1](https://arxiv.org/html/2606.24981#A1.SS1.p1.1)\.
- A\. Mitra \(2025\)A simple finite\-time analysis of TD learning with linear function approximation\.IEEE Transactions on Automatic Control70\(2\),pp\. 1388–1394\.External Links:[Document](https://dx.doi.org/10.1109/TAC.2024.3469328)Cited by:[§2](https://arxiv.org/html/2606.24981#S2.p4.1),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px3.p7.7)\.
- W\. Mou, C\. J\. Li, M\. J\. Wainwright, P\. L\. Bartlett, and M\. I\. Jordan \(2020\)On linear stochastic approximation: fine\-grained Polyak\-Ruppert and non\-asymptotic concentration\.InProceedings of Thirty Third Conference on Learning Theory,pp\. 2947–2997\.Cited by:[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px1.p3.2),[footnote 3](https://arxiv.org/html/2606.24981#footnote3)\.
- D\. Nagaraj, X\. Wu, G\. Bresler, P\. Jain, and P\. Netrapalli \(2020\)Least squares regression with Markovian data: fundamental limits and algorithms\.Advances in Neural Information Processing Systems33,pp\. 16666–16676\.Cited by:[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px3.p2.8),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px3.p7.7)\.
- Y\. Ollivier \(2018\)Approximate temporal difference learning is a gradient descent for reversible policies\.arXiv preprint arXiv:1805\.00869\.Cited by:[§1](https://arxiv.org/html/2606.24981#S1.SS0.SSS0.Px1.p1.4),[§3](https://arxiv.org/html/2606.24981#S3.SS0.SSS0.Px1.p8.6)\.
- G\. Patil, L\. A\. Prashanth, D\. Nagaraj, and D\. Precup \(2023\)Finite time analysis of temporal difference learning with linear function approximation: tail averaging and regularisation\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 5438–5448\.Cited by:[§2](https://arxiv.org/html/2606.24981#S2.p4.1),[§2](https://arxiv.org/html/2606.24981#S2.p8.1),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px1.p3.2),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px1.p4.2),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px3.p7.7)\.
- I\. Pinelis \(1994\)Optimum bounds for the distributions of martingales in Banach spaces\.The Annals of Probability,pp\. 1679–1706\.Cited by:[Lemma F\.1](https://arxiv.org/html/2606.24981#A6.Thmtheorem1),[§4\.1](https://arxiv.org/html/2606.24981#S4.SS1.SSS0.Px1.p3.23)\.
- L\. A\. Prashanth, N\. Korda, and R\. Munos \(2021\)Concentration bounds for temporal difference learning with linear function approximation: the case of batch data and uniform sampling\.Machine Learning110\(3\),pp\. 559–618\.Cited by:[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px1.p4.2)\.
- A\. Raj, P\. Joulani, A\. Gyorgy, and C\. Szepesvári \(2022\)Faster rates, adaptive algorithms, and finite\-time bounds for linear composition optimization and gradient TD learning\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 7176–7186\.Cited by:[§2](https://arxiv.org/html/2606.24981#S2.p6.3)\.
- S\. Samsonov, D\. Tiapkin, A\. Naumov, and E\. Moulines \(2024\)Improved high\-probability bounds for the temporal difference learning algorithm via exponential stability\.InThe Thirty Seventh Annual Conference on Learning Theory,pp\. 4511–4547\.Cited by:[§1](https://arxiv.org/html/2606.24981#S1.SS0.SSS0.Px1.p2.1),[§2](https://arxiv.org/html/2606.24981#S2.p4.1),[§2](https://arxiv.org/html/2606.24981#S2.p8.1),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px1.p3.2),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px3.p1.10),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px3.p2.8),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px3.p4.10),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px3.p5.2),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px3.p7.7)\.
- R\. Srikant and L\. Ying \(2019\)Finite\-time error bounds for linear stochastic approximation and TD learning\.InConference on Learning Theory,pp\. 2803–2830\.Cited by:[§2](https://arxiv.org/html/2606.24981#S2.p4.1),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px1.p3.2),[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px3.p7.7)\.
- T\. Sun, D\. Li, and B\. Wang \(2022\)Finite\-time analysis of adaptive temporal difference learning with deep neural networks\.Advances in Neural Information Processing Systems35,pp\. 19592–19604\.Cited by:[§2](https://arxiv.org/html/2606.24981#S2.p4.1)\.
- T\. Sun, H\. Shen, T\. Chen, and D\. Li \(2021\)Adaptive temporal difference learning with linear function approximation\.IEEE Transactions on Pattern Analysis and Machine Intelligence44\(12\),pp\. 8812–8824\.Cited by:[§5](https://arxiv.org/html/2606.24981#S5.SS0.SSS0.Px1.p4.2)\.
- R\. S\. Sutton and A\. G\. Barto \(1998\)Reinforcement learning: an introduction\.Vol\.1,MIT Press\.Cited by:[§3](https://arxiv.org/html/2606.24981#S3.p2.1)\.
- R\. S\. Sutton \(1988\)Learning to predict by the methods of temporal differences\.Machine Learning3,pp\. 9–44\.Cited by:[§1](https://arxiv.org/html/2606.24981#S1.p1.1),[§3](https://arxiv.org/html/2606.24981#S3.p3.10)\.
- J\. Tsitsiklis and B\. Van Roy \(1996\)Analysis of temporal\-difference learning with function approximation\.Advances in Neural Information Processing Systems9\.Cited by:[§1](https://arxiv.org/html/2606.24981#S1.p1.1),[§2](https://arxiv.org/html/2606.24981#S2.p2.1),[§3](https://arxiv.org/html/2606.24981#S3.SS0.SSS0.Px1.p6.4)\.
## Appendix APoisson Toolkit for Markov Noise
In this section, we develop a Poisson\-equation toolkit for Markov concentration\. Throughout, let\(Zt\)t≥0\(Z\_\{t\}\)\_\{t\\geq 0\}be the Markov chain on
𝒵:=\{\(i,j\)∈𝒮×𝒮:Pμ\(i,j\)\>0\}\\mathcal\{Z\}:=\\\{\(i,j\)\\in\\mathcal\{S\}\\times\\mathcal\{S\}:P^\{\\mu\}\(i,j\)\>0\\\}with kernelPZP\_\{Z\}and stationary distributionπZ\\pi\_\{Z\}, as defined in Section[3](https://arxiv.org/html/2606.24981#S3)\. The state space𝒵\\mathcal\{Z\}is finite\.
### A\.1Poisson Equation under Geometric Mixing
We start with a standard Poisson equation lemma for geometrically mixing chains inℝd\\mathbb\{R\}^\{d\}\[Meyn and Tweedie,[2012](https://arxiv.org/html/2606.24981#bib.bib54), Theorem 17\.4\.3\]\.
###### Lemma A\.1\(Poisson solution under geometric mixing\)\.
Leth:𝒵→ℝdh:\\mathcal\{Z\}\\to\\mathbb\{R\}^\{d\}be a bounded measurable function with stationary mean zero, i\.e\.,𝔼πZ\[h\(Z\)\]=0\\mathbb\{E\}\_\{\\pi\_\{Z\}\}\[h\(Z\)\]=0\. Then, the Poisson equation
u−PZu=h,on𝒵,u\-P\_\{Z\}u=h,\\qquad\\text\{on \}\\mathcal\{Z\},\(18\)where\(PZu\)\(z\):=𝔼\[u\(Z1\)∣Z0=z\]\(P\_\{Z\}u\)\(z\):=\\mathbb\{E\}\[u\(Z\_\{1\}\)\\mid Z\_\{0\}=z\], admits a solution
u∗\(z\):=∑k=0∞\(PZkh\)\(z\),u^\{\*\}\(z\):=\\sum\_\{k=0\}^\{\\infty\}\(P\_\{Z\}^\{k\}h\)\(z\),where\(PZkh\)\(z\):=𝔼\[h\(Zk\)∣Z0=z\]\(P\_\{Z\}^\{k\}h\)\(z\):=\\mathbb\{E\}\[h\(Z\_\{k\}\)\\mid Z\_\{0\}=z\]\. Moreover,
‖u∗‖∞≤16τmix‖h‖∞,where‖h‖∞:=supz∈𝒵‖h\(z\)‖\.\\\|u^\{\*\}\\\|\_\{\\infty\}\\leq 16\\tau\_\{\\mathrm\{mix\}\}\\\|h\\\|\_\{\\infty\},\\qquad\\text\{where \}\\\|h\\\|\_\{\\infty\}:=\\sup\_\{z\\in\\mathcal\{Z\}\}\\ \\\|h\(z\)\\\|\\,\.
###### Proof\.
Defineu∗\(z\):=∑k=0∞\(PZkh\)\(z\)u^\{\*\}\(z\):=\\sum\_\{k=0\}^\{\\infty\}\(P\_\{Z\}^\{k\}h\)\(z\)\. Since𝔼πZ\[h\(Z\)\]=0\\mathbb\{E\}\_\{\\pi\_\{Z\}\}\[h\(Z\)\]=0, for everyk≥0k\\geq 0and everyz∈𝒵z\\in\\mathcal\{Z\},
\(PZkh\)\(z\)=∫h\(z′\)\{PZk\(z,dz′\)−πZ\(dz′\)\}\.\(P\_\{Z\}^\{k\}h\)\(z\)=\\int h\(z^\{\\prime\}\)\\\{P\_\{Z\}^\{k\}\(z,dz^\{\\prime\}\)\-\\pi\_\{Z\}\(dz^\{\\prime\}\)\\\}\\,\.Using the geometric mixing bound from Lemma[3\.3](https://arxiv.org/html/2606.24981#S3.Thmtheorem3), we have for allk≥0k\\geq 0,
‖\(PZkh\)‖∞\\displaystyle\\\|\(P\_\{Z\}^\{k\}h\)\\\|\_\{\\infty\}=supz∈𝒵∥𝔼\[h\(Zk\)∣Z0=z\]∥≤supz∈𝒵2∥h∥∞∥PZk\(z,⋅\)−πZ∥TV≤8⋅2−k/τmix∥h∥∞\.\\displaystyle=\\sup\_\{z\\in\\mathcal\{Z\}\}\\ \\left\\\|\\mathbb\{E\}\\bigl\[h\(Z\_\{k\}\)\\mid Z\_\{0\}=z\\bigr\]\\right\\\|\\leq\\sup\_\{z\\in\\mathcal\{Z\}\}2\\\|h\\\|\_\{\\infty\}\\\|P\_\{Z\}^\{k\}\(z,\\cdot\)\-\\pi\_\{Z\}\\\|\_\{\\mathrm\{TV\}\}\\leq 8\\cdot 2^\{\-k/\\tau\_\{\\mathrm\{mix\}\}\}\\\|h\\\|\_\{\\infty\}\\,\.Thus, the series definingu∗u^\{\*\}converges absolutely and uniformly with
‖u∗‖∞≤∑k=0∞‖\(PZkh\)‖∞≤81−2−1/τmix‖h‖∞≔C\(τmix\)‖h‖∞\.\\\|u^\{\*\}\\\|\_\{\\infty\}\\leq\\sum\_\{k=0\}^\{\\infty\}\\\|\(P\_\{Z\}^\{k\}h\)\\\|\_\{\\infty\}\\leq\\frac\{8\}\{1\-2^\{\-1/\\tau\_\{\\mathrm\{mix\}\}\}\}\\\|h\\\|\_\{\\infty\}\\coloneqq C\(\\tau\_\{\\mathrm\{mix\}\}\)\\\|h\\\|\_\{\\infty\}\\,\.Moreover,
\(PZu∗\)\(z\)=∑k=0∞PZk\+1h\(z\),\(P\_\{Z\}u^\{\*\}\)\(z\)=\\sum\_\{k=0\}^\{\\infty\}P\_\{Z\}^\{k\+1\}h\(z\),so
u∗\(z\)−\(PZu∗\)\(z\)=\(PZ0h\)\(z\)=h\(z\),u^\{\*\}\(z\)\-\(P\_\{Z\}u^\{\*\}\)\(z\)=\(P\_\{Z\}^\{0\}h\)\(z\)=h\(z\),that is,u∗u^\{\*\}solves \([18](https://arxiv.org/html/2606.24981#A1.E18)\)\.
Letx=1/τmixx=1/\\tau\_\{\\mathrm\{mix\}\}\. We scale the functionC\(τmix\)C\(\\tau\_\{\\mathrm\{mix\}\}\)by1/τmix1/\\tau\_\{\\mathrm\{mix\}\}and consider the new functionC~\(x\)\\tilde\{C\}\(x\):
C~\(x\):=1τmixC\(1x\)=x⋅8⋅2x2x−1=8x1−2−x\.\\tilde\{C\}\(x\):=\\frac\{1\}\{\\tau\_\{\\mathrm\{mix\}\}\}C\\left\(\\frac\{1\}\{x\}\\right\)=x\\cdot\\frac\{8\\cdot 2^\{x\}\}\{2^\{x\}\-1\}=\\frac\{8x\}\{1\-2^\{\-x\}\}\\,\.We claim thatC~\(x\)\\tilde\{C\}\(x\)is monotonically increasing on\(0,1\]\(0,1\]\. Lett=xln2t=x\\ln 2\. ThenC~\(x\)\\tilde\{C\}\(x\)can be written as8ln2t1−e−t\\frac\{8\}\{\\ln 2\}\\frac\{t\}\{1\-e^\{\-t\}\}\. Consider the derivative off\(t\)=t1−e−tf\(t\)=\\frac\{t\}\{1\-e^\{\-t\}\}fort\>0t\>0:
f′\(t\)=1\(1−e−t\)−t\(e−t\)\(1−e−t\)2=1−\(1\+t\)e−t\(1−e−t\)2\.f^\{\\prime\}\(t\)=\\frac\{1\(1\-e^\{\-t\}\)\-t\(e^\{\-t\}\)\}\{\(1\-e^\{\-t\}\)^\{2\}\}=\\frac\{1\-\(1\+t\)e^\{\-t\}\}\{\(1\-e^\{\-t\}\)^\{2\}\}\\,\.Using the elementary inequalityet\>1\+te^\{t\}\>1\+tfort\>0t\>0, we have1\>\(1\+t\)e−t1\>\(1\+t\)e^\{\-t\}, which impliesf′\(t\)\>0f^\{\\prime\}\(t\)\>0\. Thus,C~\(x\)\\tilde\{C\}\(x\)is strictly increasing on\(0,1\]\(0,1\]\. Consequently, the maximum value ofC~\(x\)\\tilde\{C\}\(x\)on the interval\(0,1\]\(0,1\]occurs atx=1x=1:
C~\(x\)≤C~\(1\)=8⋅11−2−1=80\.5=16\.\\tilde\{C\}\(x\)\\leq\\tilde\{C\}\(1\)=\\frac\{8\\cdot 1\}\{1\-2^\{\-1\}\}=\\frac\{8\}\{0\.5\}=16\\,\.Therefore, we haveC\(τmix\)≤16τmixC\(\\tau\_\{\\mathrm\{mix\}\}\)\\leq 16\\tau\_\{\\mathrm\{mix\}\}, which completes the proof\. ∎
### A\.2Bounds for TD\-type Poisson solutions
We now specialize Lemma[A\.1](https://arxiv.org/html/2606.24981#A1.Thmtheorem1)to the functions arising from TD\. Forz=\(s,s′\)∈𝒵z=\(s,s^\{\\prime\}\)\\in\\mathcal\{Z\}, define the scalar function
h𝜽,𝜽−𝜽∗\(z\):=⟨𝒈\(𝜽,z\)−𝒈¯\(𝜽\),𝜽−𝜽∗⟩,h\_\{\{\\bm\{\\theta\}\},\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\(z\):=\\bigl\\langle\\bm\{g\}\(\{\\bm\{\\theta\}\},z\)\-\\bar\{\\bm\{g\}\}\(\{\\bm\{\\theta\}\}\),\\,\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\\bigr\\rangle,and the vector\-valued functions
h\(z\)\\displaystyle h\(z\):=εzϕ\(s\)∈ℝd,where we recallεz=r\(s,s′\)\+γϕ\(s′\)⊤𝜽∗−ϕ\(s\)⊤𝜽∗,\\displaystyle:=\\varepsilon\_\{z\}\\bm\{\\phi\}\(s\)\\in\\mathbb\{R\}^\{d\},\\qquad\\text\{where we recall \}\\varepsilon\_\{z\}=r\(s,s^\{\\prime\}\)\+\\gamma\\bm\{\\phi\}\(s^\{\\prime\}\)^\{\\top\}\{\\bm\{\\theta\}\}^\{\*\}\-\\bm\{\\phi\}\(s\)^\{\\top\}\{\\bm\{\\theta\}\}^\{\*\},h𝜽\(z\)\\displaystyle h\_\{\\bm\{\\theta\}\}\(z\):=𝜹z\(𝜽−𝜽∗\)∈ℝd,where we define𝜹z≔𝑨−𝑨z∈ℝd×d\.\\displaystyle:=\\bm\{\\delta\}\_\{z\}\\bigl\(\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\\bigr\)\\in\\mathbb\{R\}^\{d\},\\qquad\\text\{where we define \}\\bm\{\\delta\}\_\{z\}\\coloneqq\\bm\{A\}\-\\bm\{A\}\_\{z\}\\in\\mathbb\{R\}^\{d\\times d\}\\,\.By construction,h𝜽,𝜽−𝜽∗h\_\{\{\\bm\{\\theta\}\},\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\},hh, andh𝜽h\_\{\\bm\{\\theta\}\}all have stationary mean zero underπZ\\pi\_\{Z\}\. We study properties of the corresponding Poisson\-equation solutions in the following lemma\.
###### Lemma A\.2\(Bounds and Lipschitz properties of TD Poisson equation solutions\)\.
In the setting of Section[3](https://arxiv.org/html/2606.24981#S3), letϕ∞,r∞\\phi\_\{\\infty\},r\_\{\\infty\}andτmix\\tau\_\{\\mathrm\{mix\}\}be the constants introduced in Lemmas[3\.2](https://arxiv.org/html/2606.24981#S3.Thmtheorem2)and[3\.3](https://arxiv.org/html/2606.24981#S3.Thmtheorem3)\. Then the following statements hold\.
1. *\(i\)*Growth bounds\.For allz∈𝒵z\\in\\mathcal\{Z\}, \|h𝜽,𝜽−𝜽∗\(z\)\|\\displaystyle\|h\_\{\{\\bm\{\\theta\}\},\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\(z\)\|≤\(2r∞ϕ∞\+4ϕ∞2‖𝜽‖\)‖𝜽−𝜽∗‖,\\displaystyle\\leq\(2r\_\{\\infty\}\\phi\_\{\\infty\}\+4\\phi\_\{\\infty\}^\{2\}\\\|\{\\bm\{\\theta\}\}\\\|\)\\,\\\|\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\\\|,‖h\(z\)‖\\displaystyle\\left\\\|\{h\(z\)\}\\right\\\|≤r∞ϕ∞\+2ϕ∞2‖𝜽∗‖,\\displaystyle\\leq r\_\{\\infty\}\\phi\_\{\\infty\}\+2\\phi\_\{\\infty\}^\{2\}\\\|\{\\bm\{\\theta\}\}^\{\*\}\\\|,‖h𝜽\(z\)‖\\displaystyle\\\|h\_\{\\bm\{\\theta\}\}\(z\)\\\|≤4ϕ∞2‖𝜽−𝜽∗‖\.\\displaystyle\\leq 4\\phi\_\{\\infty\}^\{2\}\\\|\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\\\|\\,\.
2. *\(ii\)*Bounded Poisson solutions\.Letu𝜽,𝜽−𝜽∗∗:𝒵→ℝu^\{\*\}\_\{\{\\bm\{\\theta\}\},\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\}:\\mathcal\{Z\}\\to\\mathbb\{R\}be the infinite series solution of the Poisson equation u𝜽,𝜽−𝜽∗∗−PZu𝜽,𝜽−𝜽∗∗=h𝜽,𝜽−𝜽∗,u^\{\*\}\_\{\{\\bm\{\\theta\}\},\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\-P\_\{Z\}u^\{\*\}\_\{\{\\bm\{\\theta\}\},\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\}=h\_\{\{\\bm\{\\theta\}\},\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\},and letu∗:𝒵→ℝdu^\{\*\}:\\mathcal\{Z\}\\to\\mathbb\{R\}^\{d\}be the infinite series solution of u∗−PZu∗=h,u^\{\*\}\-P\_\{Z\}u^\{\*\}=h,and letu𝜽∗:𝒵→ℝdu^\{\*\}\_\{\\bm\{\\theta\}\}:\\mathcal\{Z\}\\to\\mathbb\{R\}^\{d\}be the infinite series solution of u𝜽∗−PZu𝜽∗=h𝜽\.u^\{\*\}\_\{\\bm\{\\theta\}\}\-P\_\{Z\}u^\{\*\}\_\{\\bm\{\\theta\}\}=h\_\{\\bm\{\\theta\}\}\\,\.Then, ‖u𝜽,𝜽−𝜽∗∗‖∞\\displaystyle\\\|u^\{\*\}\_\{\{\\bm\{\\theta\}\},\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\\|\_\{\\infty\}≤16τmix‖h𝜽,𝜽−𝜽∗‖∞≤16τmix\(2r∞ϕ∞\+4ϕ∞2‖𝜽‖\)‖𝜽−𝜽∗‖,\\displaystyle\\leq 16\\tau\_\{\\mathrm\{mix\}\}\\\|h\_\{\{\\bm\{\\theta\}\},\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\\|\_\{\\infty\}\\leq 16\\tau\_\{\\mathrm\{mix\}\}\(2r\_\{\\infty\}\\phi\_\{\\infty\}\+4\\phi\_\{\\infty\}^\{2\}\\\|\{\\bm\{\\theta\}\}\\\|\)\\,\\\|\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\\\|,‖u∗‖∞\\displaystyle\\\|u^\{\*\}\\\|\_\{\\infty\}≤16τmix‖h‖∞≤16τmix\(r∞ϕ∞\+2ϕ∞2‖𝜽∗‖\),\\displaystyle\\leq 16\\tau\_\{\\mathrm\{mix\}\}\\\|h\\\|\_\{\\infty\}\\leq 16\\tau\_\{\\mathrm\{mix\}\}\(r\_\{\\infty\}\\phi\_\{\\infty\}\+2\\phi\_\{\\infty\}^\{2\}\\\|\{\\bm\{\\theta\}\}^\{\*\}\\\|\),‖u𝜽∗‖∞\\displaystyle\\\|u^\{\*\}\_\{\\bm\{\\theta\}\}\\\|\_\{\\infty\}≤16τmix‖h𝜽‖∞≤64τmixϕ∞2‖𝜽−𝜽∗‖\.\\displaystyle\\leq 16\\tau\_\{\\mathrm\{mix\}\}\\\|h\_\{\\bm\{\\theta\}\}\\\|\_\{\\infty\}\\leq 64\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\\|\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\\\|\\,\.
3. *\(iii\)*Local Lipschitz continuity inθ\{\\bm\{\\theta\}\}andθ′\{\\bm\{\\theta\}\}^\{\\prime\}\.With the definitions in the previous point, we also have ‖u𝜽,𝜽−𝜽∗∗−u𝜽′,𝜽′−𝜽∗∗‖∞\\displaystyle\\\|u^\{\*\}\_\{\{\\bm\{\\theta\}\},\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\-u^\{\*\}\_\{\{\\bm\{\\theta\}\}^\{\\prime\},\{\\bm\{\\theta\}\}^\{\\prime\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\\|\_\{\\infty\}≤16τmix‖h𝜽,𝜽−𝜽∗−h𝜽′,𝜽′−𝜽∗‖∞\\displaystyle\\leq 16\\tau\_\{\\mathrm\{mix\}\}\\\|h\_\{\{\\bm\{\\theta\}\},\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\-h\_\{\{\\bm\{\\theta\}\}^\{\\prime\},\{\\bm\{\\theta\}\}^\{\\prime\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\\|\_\{\\infty\}≤16τmix\(2r∞ϕ∞\+4ϕ∞2‖𝜽‖\)‖𝜽−𝜽′‖\\displaystyle\\leq 16\\tau\_\{\\mathrm\{mix\}\}\(2r\_\{\\infty\}\\phi\_\{\\infty\}\+4\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\{\\bm\{\\theta\}\}\}\\right\\\|\)\\left\\\|\{\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\\prime\}\}\\right\\\|\+64τmixϕ∞2‖𝜽′−𝜽∗‖‖𝜽−𝜽′‖,\\displaystyle\\quad\+64\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\{\\bm\{\\theta\}\}^\{\\prime\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|\\left\\\|\{\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\\prime\}\}\\right\\\|,and ‖u𝜽∗−u𝜽′∗‖∞≤64τmixϕ∞2‖𝜽−𝜽′‖\.\\\|u^\{\*\}\_\{\\bm\{\\theta\}\}\-u^\{\*\}\_\{\{\\bm\{\\theta\}\}^\{\\prime\}\}\\\|\_\{\\infty\}\\leq 64\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\\|\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\\prime\}\\\|\\,\.
###### Proof\.
*\(i\) Growth bounds\.*By the feature and reward bounds in Lemma[3\.2](https://arxiv.org/html/2606.24981#S3.Thmtheorem2), forz=\(s,s′\)z=\(s,s^\{\\prime\}\)we have
‖𝒈\(𝜽,z\)‖\\displaystyle\\\|\\bm\{g\}\(\{\\bm\{\\theta\}\},z\)\\\|=\|r\(s,s′\)\+γϕ\(s′\)⊤𝜽−ϕ\(s\)⊤𝜽\|⋅‖ϕ\(s\)‖\\displaystyle=\\bigl\|r\(s,s^\{\\prime\}\)\+\\gamma\\bm\{\\phi\}\(s^\{\\prime\}\)^\{\\top\}\{\\bm\{\\theta\}\}\-\\bm\{\\phi\}\(s\)^\{\\top\}\{\\bm\{\\theta\}\}\\bigr\|\\cdot\\\|\\bm\{\\phi\}\(s\)\\\|≤\(r∞\+\(γ\+1\)ϕ∞‖𝜽‖\)ϕ∞≤\(r∞\+2ϕ∞‖𝜽‖\)ϕ∞\.\\displaystyle\\leq\\bigl\(r\_\{\\infty\}\+\(\\gamma\+1\)\\phi\_\{\\infty\}\\\|\{\\bm\{\\theta\}\}\\\|\\bigr\)\\,\\phi\_\{\\infty\}\\leq\\bigl\(r\_\{\\infty\}\+2\\phi\_\{\\infty\}\\\|\{\\bm\{\\theta\}\}\\\|\\bigr\)\\,\\phi\_\{\\infty\}\\,\.Taking expectations underπZ\\pi\_\{Z\}yields the same type of bound for‖𝒈¯\(𝜽\)‖\\\|\\bar\{\\bm\{g\}\}\(\{\\bm\{\\theta\}\}\)\\\|, hence
‖𝒈\(𝜽,z\)−𝒈¯\(𝜽\)‖≤2\(r∞\+2ϕ∞‖𝜽‖\)ϕ∞=2r∞ϕ∞\+4ϕ∞2‖𝜽‖\.\\\|\\bm\{g\}\(\{\\bm\{\\theta\}\},z\)\-\\bar\{\\bm\{g\}\}\(\{\\bm\{\\theta\}\}\)\\\|\\leq 2\(r\_\{\\infty\}\+2\\phi\_\{\\infty\}\\\|\{\\bm\{\\theta\}\}\\\|\)\\,\\phi\_\{\\infty\}=2r\_\{\\infty\}\\phi\_\{\\infty\}\+4\\phi\_\{\\infty\}^\{2\}\\\|\{\\bm\{\\theta\}\}\\\|\\,\.Therefore,
\|h𝜽,𝜽−𝜽∗\(z\)\|=\|⟨𝒈\(𝜽,z\)−𝒈¯\(𝜽\),𝜽−𝜽∗⟩\|≤\(2r∞ϕ∞\+4ϕ∞2‖𝜽‖\)‖𝜽−𝜽∗‖\.\|h\_\{\{\\bm\{\\theta\}\},\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\(z\)\|=\\bigl\|\\langle\\bm\{g\}\(\{\\bm\{\\theta\}\},z\)\-\\bar\{\\bm\{g\}\}\(\{\\bm\{\\theta\}\}\),\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\\rangle\\bigr\|\\leq\(2r\_\{\\infty\}\\phi\_\{\\infty\}\+4\\phi\_\{\\infty\}^\{2\}\\\|\{\\bm\{\\theta\}\}\\\|\)\\,\\\|\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\\\|\\,\.Forhh, recall that
εz=r\(s,s′\)\+γϕ\(s′\)⊤𝜽∗−ϕ\(s\)⊤𝜽∗\.\\varepsilon\_\{z\}=r\(s,s^\{\\prime\}\)\+\\gamma\\bm\{\\phi\}\(s^\{\\prime\}\)^\{\\top\}\{\\bm\{\\theta\}\}^\{\*\}\-\\bm\{\\phi\}\(s\)^\{\\top\}\{\\bm\{\\theta\}\}^\{\*\}\\,\.Then,
‖h\(z\)‖=‖εzϕ\(s\)‖≤r∞ϕ∞\+2ϕ∞2‖𝜽∗‖\.\\left\\\|\{h\(z\)\}\\right\\\|=\\left\\\|\{\\varepsilon\_\{z\}\\bm\{\\phi\}\(s\)\}\\right\\\|\\leq r\_\{\\infty\}\\phi\_\{\\infty\}\+2\\phi\_\{\\infty\}^\{2\}\\\|\{\\bm\{\\theta\}\}^\{\*\}\\\|\\,\.Forh𝜽h\_\{\\bm\{\\theta\}\}, recall that
𝑨z=ϕ\(s\)\(ϕ\(s\)−γϕ\(s′\)\)⊤\.\\bm\{A\}\_\{z\}=\\bm\{\\phi\}\(s\)\\bigl\(\\bm\{\\phi\}\(s\)\-\\gamma\\bm\{\\phi\}\(s^\{\\prime\}\)\\bigr\)^\{\\top\}\\,\.Then,
‖𝑨z‖op≤‖ϕ\(s\)‖\(‖ϕ\(s\)‖\+γ‖ϕ\(s′\)‖\)≤\(1\+γ\)ϕ∞2≤2ϕ∞2\.\\\|\\bm\{A\}\_\{z\}\\\|\_\{\\mathrm\{op\}\}\\leq\\\|\\bm\{\\phi\}\(s\)\\\|\\bigl\(\\\|\\bm\{\\phi\}\(s\)\\\|\+\\gamma\\\|\\bm\{\\phi\}\(s^\{\\prime\}\)\\\|\\bigr\)\\leq\(1\+\\gamma\)\\phi\_\{\\infty\}^\{2\}\\leq 2\\phi\_\{\\infty\}^\{2\}\\,\.Consequently,‖𝑨‖op≤2ϕ∞2\\\|\\bm\{A\}\\\|\_\{\\mathrm\{op\}\}\\leq 2\\phi\_\{\\infty\}^\{2\}and
‖𝜹z‖op≤‖𝑨‖op\+‖𝑨z‖op≤4ϕ∞2\.\\\|\\bm\{\\delta\}\_\{z\}\\\|\_\{\\mathrm\{op\}\}\\leq\\\|\\bm\{A\}\\\|\_\{\\mathrm\{op\}\}\+\\\|\\bm\{A\}\_\{z\}\\\|\_\{\\mathrm\{op\}\}\\leq 4\\phi\_\{\\infty\}^\{2\}\\,\.Thus,
‖h𝜽\(z\)‖=‖𝜹z\(𝜽−𝜽∗\)‖≤4ϕ∞2‖𝜽−𝜽∗‖\.\\\|h\_\{\\bm\{\\theta\}\}\(z\)\\\|=\\\|\\bm\{\\delta\}\_\{z\}\(\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\)\\\|\\leq 4\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|\\,\.
*\(ii\) Poisson solution sup\-norm bounds\.*It follows from Lemma[A\.1](https://arxiv.org/html/2606.24981#A1.Thmtheorem1)applied to the real\-valued functionh𝜽,𝜽−𝜽∗h\_\{\{\\bm\{\\theta\}\},\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\}and theℝd\\mathbb\{R\}^\{d\}functionshhandh𝜽h\_\{\\bm\{\\theta\}\}\.
*\(iii\) Lipschitz continuity in𝛉\{\\bm\{\\theta\}\}\.**Scalar caseh𝛉,𝛉−𝛉∗h\_\{\{\\bm\{\\theta\}\},\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\.*For eachz=\(s,s′\)z=\(s,s^\{\\prime\}\)and𝜽∈ℝd\{\\bm\{\\theta\}\}\\in\\mathbb\{R\}^\{d\}recall that
𝑨z:=ϕ\(s\)\(ϕ\(s\)−γϕ\(s′\)\)⊤,𝒃z:=r\(s,s′\)ϕ\(s\),\\bm\{A\}\_\{z\}:=\\bm\{\\phi\}\(s\)\\bigl\(\\bm\{\\phi\}\(s\)\-\\gamma\\bm\{\\phi\}\(s^\{\\prime\}\)\\bigr\)^\{\\top\},\\qquad\\bm\{b\}\_\{z\}:=r\(s,s^\{\\prime\}\)\\bm\{\\phi\}\(s\),and
𝑨=𝔼πZ\[𝑨Z\],𝒃=𝔼πZ\[𝒃Z\],\\bm\{A\}=\\mathbb\{E\}\_\{\\pi\_\{Z\}\}\[\\bm\{A\}\_\{Z\}\],\\qquad\\bm\{b\}=\\mathbb\{E\}\_\{\\pi\_\{Z\}\}\[\\bm\{b\}\_\{Z\}\],so that𝒈¯\(𝜽\)=−𝑨𝜽\+𝒃\\bar\{\\bm\{g\}\}\(\{\\bm\{\\theta\}\}\)=\-\\bm\{A\}\{\\bm\{\\theta\}\}\+\\bm\{b\}and𝒈\(𝜽,z\)=−𝑨z𝜽\+𝒃z\\bm\{g\}\(\{\\bm\{\\theta\}\},z\)=\-\\bm\{A\}\_\{z\}\{\\bm\{\\theta\}\}\+\\bm\{b\}\_\{z\}\. Also, recall that
𝜹z=𝑨−𝑨z,and let𝜻z:=𝒃z−𝒃\.\\bm\{\\delta\}\_\{z\}=\\bm\{A\}\-\\bm\{A\}\_\{z\},\\text\{ and let\}\\qquad\\bm\{\\zeta\}\_\{z\}:=\\bm\{b\}\_\{z\}\-\\bm\{b\}\\,\.Then
𝒈\(𝜽,z\)−𝒈¯\(𝜽\)=𝜹z𝜽\+𝜻z,\\bm\{g\}\(\{\\bm\{\\theta\}\},z\)\-\\bar\{\\bm\{g\}\}\(\{\\bm\{\\theta\}\}\)=\\bm\{\\delta\}\_\{z\}\{\\bm\{\\theta\}\}\+\\bm\{\\zeta\}\_\{z\},and hence
h𝜽,𝜽−𝜽∗\(z\)=⟨𝜹z𝜽\+𝜻z,𝜽−𝜽∗⟩\.h\_\{\{\\bm\{\\theta\}\},\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\(z\)=\\bigl\\langle\\bm\{\\delta\}\_\{z\}\{\\bm\{\\theta\}\}\+\\bm\{\\zeta\}\_\{z\},\\,\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\\bigr\\rangle\\,\.As above,‖𝜹z‖op≤4ϕ∞2\\\|\\bm\{\\delta\}\_\{z\}\\\|\_\{\\mathrm\{op\}\}\\leq 4\\phi\_\{\\infty\}^\{2\}, and‖𝒃z‖≤r∞ϕ∞\\\|\\bm\{b\}\_\{z\}\\\|\\leq r\_\{\\infty\}\\phi\_\{\\infty\}implies‖𝜻z‖≤2r∞ϕ∞\\\|\\bm\{\\zeta\}\_\{z\}\\\|\\leq 2r\_\{\\infty\}\\phi\_\{\\infty\}\. We have
\|h𝜽,𝜽−𝜽∗\(z\)−h𝜽′,𝜽′−𝜽∗\(z\)\|\\displaystyle\|h\_\{\{\\bm\{\\theta\}\},\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\(z\)\-h\_\{\{\\bm\{\\theta\}\}^\{\\prime\},\{\\bm\{\\theta\}\}^\{\\prime\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\(z\)\|=\|⟨𝜹z𝜽\+𝜻z,𝜽−𝜽∗⟩−⟨𝜹z𝜽′\+𝜻z,𝜽′−𝜽∗⟩\|\\displaystyle=\|\\bigl\\langle\\bm\{\\delta\}\_\{z\}\{\\bm\{\\theta\}\}\+\\bm\{\\zeta\}\_\{z\},\\,\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\\bigr\\rangle\-\\bigl\\langle\\bm\{\\delta\}\_\{z\}\{\\bm\{\\theta\}\}^\{\\prime\}\+\\bm\{\\zeta\}\_\{z\},\\,\{\\bm\{\\theta\}\}^\{\\prime\}\-\{\\bm\{\\theta\}\}^\{\*\}\\bigr\\rangle\|=\|⟨𝜹z𝜽\+𝜻z,𝜽−𝜽∗−𝜽′\+𝜽∗⟩\+⟨𝜹z𝜽\+𝜻z−\(𝜹z𝜽′\+𝜻z\),𝜽′−𝜽∗⟩\|\\displaystyle=\|\\bigl\\langle\\bm\{\\delta\}\_\{z\}\{\\bm\{\\theta\}\}\+\\bm\{\\zeta\}\_\{z\},\\,\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\-\{\\bm\{\\theta\}\}^\{\\prime\}\+\{\\bm\{\\theta\}\}^\{\*\}\\bigr\\rangle\+\\bigl\\langle\\bm\{\\delta\}\_\{z\}\{\\bm\{\\theta\}\}\+\\bm\{\\zeta\}\_\{z\}\-\(\\bm\{\\delta\}\_\{z\}\{\\bm\{\\theta\}\}^\{\\prime\}\+\\bm\{\\zeta\}\_\{z\}\),\\,\{\\bm\{\\theta\}\}^\{\\prime\}\-\{\\bm\{\\theta\}\}^\{\*\}\\bigr\\rangle\|≤‖𝜹z𝜽\+𝜻z‖‖𝜽−𝜽′‖\+‖𝜹z\(𝜽−𝜽′\)‖‖𝜽′−𝜽∗‖\\displaystyle\\leq\\left\\\|\{\\bm\{\\delta\}\_\{z\}\{\\bm\{\\theta\}\}\+\\bm\{\\zeta\}\_\{z\}\}\\right\\\|\\left\\\|\{\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\\prime\}\}\\right\\\|\+\\left\\\|\{\\bm\{\\delta\}\_\{z\}\(\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\\prime\}\)\}\\right\\\|\\left\\\|\{\{\\bm\{\\theta\}\}^\{\\prime\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|≤\(4ϕ∞2‖𝜽‖\+2r∞ϕ∞\)‖𝜽−𝜽′‖\+4ϕ∞2‖𝜽′−𝜽∗‖‖𝜽−𝜽′‖\.\\displaystyle\\leq\\left\(4\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\{\\bm\{\\theta\}\}\}\\right\\\|\+2r\_\{\\infty\}\\phi\_\{\\infty\}\\right\)\\left\\\|\{\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\\prime\}\}\\right\\\|\+4\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\{\\bm\{\\theta\}\}^\{\\prime\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|\\left\\\|\{\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\\prime\}\}\\right\\\|\\,\.The second inequality follows from applying Lemma[A\.1](https://arxiv.org/html/2606.24981#A1.Thmtheorem1)and linearity of the Poisson equation:
‖u𝜽,𝜽−𝜽∗∗−u𝜽′,𝜽′−𝜽∗∗‖∞≤16τmix‖h𝜽,𝜽−𝜽∗−h𝜽′,𝜽′−𝜽∗‖∞\.\\\|u^\{\*\}\_\{\{\\bm\{\\theta\}\},\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\-u^\{\*\}\_\{\{\\bm\{\\theta\}\}^\{\\prime\},\{\\bm\{\\theta\}\}^\{\\prime\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\\|\_\{\\infty\}\\leq 16\\tau\_\{\\mathrm\{mix\}\}\\\|h\_\{\{\\bm\{\\theta\}\},\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\-h\_\{\{\\bm\{\\theta\}\}^\{\\prime\},\{\\bm\{\\theta\}\}^\{\\prime\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\\|\_\{\\infty\}\\,\.
*Vector caseh𝛉h\_\{\\bm\{\\theta\}\}\.*Forh𝜽h\_\{\\bm\{\\theta\}\},
h𝜽\(z\)−h𝜽′\(z\)=𝜹z\(𝜽−𝜽′\),h\_\{\\bm\{\\theta\}\}\(z\)\-h\_\{\{\\bm\{\\theta\}\}^\{\\prime\}\}\(z\)=\\bm\{\\delta\}\_\{z\}\(\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\\prime\}\),so
‖h𝜽\(z\)−h𝜽′\(z\)‖≤‖𝜹z‖op‖𝜽−𝜽′‖≤4ϕ∞2‖𝜽−𝜽′‖\.\\\|h\_\{\\bm\{\\theta\}\}\(z\)\-h\_\{\{\\bm\{\\theta\}\}^\{\\prime\}\}\(z\)\\\|\\leq\\\|\\bm\{\\delta\}\_\{z\}\\\|\_\{\\mathrm\{op\}\}\\\|\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\\prime\}\\\|\\leq 4\\phi\_\{\\infty\}^\{2\}\\\|\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\\prime\}\\\|\\,\.Taking the supremum overzzgives
‖h𝜽−h𝜽′‖∞≤4ϕ∞2‖𝜽−𝜽′‖\.\\\|h\_\{\\bm\{\\theta\}\}\-h\_\{\{\\bm\{\\theta\}\}^\{\\prime\}\}\\\|\_\{\\infty\}\\leq 4\\phi\_\{\\infty\}^\{2\}\\\|\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\\prime\}\\\|\\,\.Thus, by Lemma[A\.1](https://arxiv.org/html/2606.24981#A1.Thmtheorem1)and linearity of the Poisson equation,
‖u𝜽∗−u𝜽′∗‖∞≤64τmixϕ∞2‖𝜽−𝜽′‖\.\\\|u^\{\*\}\_\{\\bm\{\\theta\}\}\-u^\{\*\}\_\{\{\\bm\{\\theta\}\}^\{\\prime\}\}\\\|\_\{\\infty\}\\leq 64\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\\|\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\\prime\}\\\|\\,\.∎
### A\.3Poisson Martingale decomposition of additive Markov noise
Recallℱt:=σ\(Z0,…,Zt\)\\mathcal\{F\}\_\{t\}:=\\sigma\(Z\_\{0\},\\dots,Z\_\{t\}\)\. Letut−1u\_\{t\-1\}be the infinite series solution of the Poisson equation with forcing functionht−1h\_\{t\-1\}defined as one ofh𝜽t−1,𝜽t−1−𝜽∗h\_\{\\bm\{\\theta\}\_\{t\-1\},\\bm\{\\theta\}\_\{t\-1\}\-\\bm\{\\theta\}^\{\*\}\},hh, orh𝜽t−1h\_\{\\bm\{\\theta\}\_\{t\-1\}\}, that is
ut−1−PZut−1=ht−1\.u\_\{t\-1\}\-P\_\{Z\}u\_\{t\-1\}=h\_\{t\-1\}\\,\.Then, we have
∑t=1Tηt−1ht−1\(Zt−1\)\\displaystyle\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}h\_\{t\-1\}\(Z\_\{t\-1\}\)=∑t=1Tηt−1\(ut−1\(Zt−1\)−\(PZut−1\)\(Zt−1\)\)\\displaystyle=\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\bigl\(u\_\{t\-1\}\(Z\_\{t\-1\}\)\-\\left\(P\_\{Z\}u\_\{t\-1\}\\right\)\(Z\_\{t\-1\}\)\\bigr\)=∑t=1Tηt−1\(ut−1\(Zt−1\)−ut−1\(Zt\)\+ut−1\(Zt\)−\(PZut−1\)\(Zt−1\)\)\\displaystyle=\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\bigl\(u\_\{t\-1\}\(Z\_\{t\-1\}\)\-u\_\{t\-1\}\(Z\_\{t\}\)\+u\_\{t\-1\}\(Z\_\{t\}\)\-\\left\(P\_\{Z\}u\_\{t\-1\}\\right\)\(Z\_\{t\-1\}\)\\bigr\)=∑t=1Tηt−1\(ut−1\(Zt−1\)−ut−1\(Zt\)\)\+∑t=1Tηt−1ΔMt,\\displaystyle=\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\bigl\(u\_\{t\-1\}\(Z\_\{t\-1\}\)\-u\_\{t\-1\}\(Z\_\{t\}\)\\bigr\)\+\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\Delta M\_\{t\},whereΔMt≔ut−1\(Zt\)−\(PZut−1\)\(Zt−1\)\\Delta M\_\{t\}\\coloneqq u\_\{t\-1\}\(Z\_\{t\}\)\-\\left\(P\_\{Z\}u\_\{t\-1\}\\right\)\(Z\_\{t\-1\}\)is a martingale difference with respect toℱt:=σ\(Z0,…,Zt\)\\mathcal\{F\}\_\{t\}:=\\sigma\(Z\_\{0\},\\dots,Z\_\{t\}\)sinceut−1∈ℱt−1u\_\{t\-1\}\\in\\mathcal\{F\}\_\{t\-1\}and
𝔼\[ut−1\(Zt\)∣ℱt−1\]=𝔼\[ut−1\(Zt\)∣ℱt−1,ut−1\]=\(PZut−1\)\(Zt−1\)\.\\mathbb\{E\}\[u\_\{t\-1\}\(Z\_\{t\}\)\\mid\\mathcal\{F\}\_\{t\-1\}\]=\\mathbb\{E\}\[u\_\{t\-1\}\(Z\_\{t\}\)\\mid\\mathcal\{F\}\_\{t\-1\},u\_\{t\-1\}\]=\\left\(P\_\{Z\}u\_\{t\-1\}\\right\)\(Z\_\{t\-1\}\)\\,\.We will use this Poisson decomposition repeatedly throughout the paper\.
Thus, for allT≥1T\\geq 1, we obtain
MT:=∑t=1Tηt−1ΔMt,RT:=∑t=1Tηt−1\(ut−1\(Zt−1\)−ut−1\(Zt\)\),M\_\{T\}:=\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\Delta M\_\{t\},\\quad R\_\{T\}:=\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\bigl\(u\_\{t\-1\}\(Z\_\{t\-1\}\)\-u\_\{t\-1\}\(Z\_\{t\}\)\\bigr\),\(19\)and\(MT\)T≥1\(M\_\{T\}\)\_\{T\\geq 1\}is aℝd\\mathbb\{R\}^\{d\}\-valued martingale with respect to\(ℱT\)T≥1\(\\mathcal\{F\}\_\{T\}\)\_\{T\\geq 1\}\. Moreover,RTR\_\{T\}enjoys the following representation:
###### Lemma A\.3\(Summation by parts\)\.
RT=η0u0\(Z0\)−ηTuT\(ZT\)−∑t=1T\(ηt−1−ηt\)ut−1\(Zt\)−∑t=1Tηt\(ut−1\(Zt\)−ut\(Zt\)\)\.R\_\{T\}=\\eta\_\{0\}u\_\{0\}\(Z\_\{0\}\)\-\\eta\_\{T\}u\_\{T\}\(Z\_\{T\}\)\-\\sum\_\{t=1\}^\{T\}\(\\eta\_\{t\-1\}\-\\eta\_\{t\}\)u\_\{t\-1\}\(Z\_\{t\}\)\-\\sum\_\{t=1\}^\{T\}\\eta\_\{t\}\\bigl\(u\_\{t\-1\}\(Z\_\{t\}\)\-u\_\{t\}\(Z\_\{t\}\)\\bigr\)\\,\.
###### Proof\.
LetRT=∑t=1Tηt−1\(ut−1\(Zt−1\)−ut−1\(Zt\)\)R\_\{T\}=\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\(u\_\{t\-1\}\(Z\_\{t\-1\}\)\-u\_\{t\-1\}\(Z\_\{t\}\)\)and write
RT=∑t=1Tηt−1ut−1\(Zt−1\)⏟TT\(1\)−∑t=1Tηt−1ut−1\(Zt\)⏟TT\(2\)\.R\_\{T\}=\\underbrace\{\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}u\_\{t\-1\}\(Z\_\{t\-1\}\)\}\_\{T\_\{T\}^\{\(1\)\}\}\-\\underbrace\{\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}u\_\{t\-1\}\(Z\_\{t\}\)\}\_\{T\_\{T\}^\{\(2\)\}\}\\,\.Reindex the first sum asTT\(1\)=∑j=0T−1ηjuj\(Zj\)T\_\{T\}^\{\(1\)\}=\\sum\_\{j=0\}^\{T\-1\}\\eta\_\{j\}u\_\{j\}\(Z\_\{j\}\)\. For the second sum, add and subtract the termsηtut\(Zt\)\\eta\_\{t\}u\_\{t\}\(Z\_\{t\}\)andηtut−1\(Zt\)\\eta\_\{t\}u\_\{t\-1\}\(Z\_\{t\}\):
ηt−1ut−1\(Zt\)=ηtut\(Zt\)\+\(ηt−1−ηt\)ut−1\(Zt\)\+ηt\(ut−1\(Zt\)−ut\(Zt\)\)\.\\eta\_\{t\-1\}u\_\{t\-1\}\(Z\_\{t\}\)=\\eta\_\{t\}u\_\{t\}\(Z\_\{t\}\)\+\(\\eta\_\{t\-1\}\-\\eta\_\{t\}\)u\_\{t\-1\}\(Z\_\{t\}\)\+\\eta\_\{t\}\(u\_\{t\-1\}\(Z\_\{t\}\)\-u\_\{t\}\(Z\_\{t\}\)\)\\,\.Summing overt=1,…,Tt=1,\\dots,Tyields
TT\(2\)=∑t=1Tηtut\(Zt\)\+∑t=1T\(ηt−1−ηt\)ut−1\(Zt\)\+∑t=1Tηt\(ut−1\(Zt\)−ut\(Zt\)\)\.T\_\{T\}^\{\(2\)\}=\\sum\_\{t=1\}^\{T\}\\eta\_\{t\}u\_\{t\}\(Z\_\{t\}\)\+\\sum\_\{t=1\}^\{T\}\(\\eta\_\{t\-1\}\-\\eta\_\{t\}\)u\_\{t\-1\}\(Z\_\{t\}\)\+\\sum\_\{t=1\}^\{T\}\\eta\_\{t\}\\bigl\(u\_\{t\-1\}\(Z\_\{t\}\)\-u\_\{t\}\(Z\_\{t\}\)\\bigr\)\\,\.Hence,
RT=η0u0\(Z0\)−ηTuT\(ZT\)−∑t=1T\(ηt−1−ηt\)ut−1\(Zt\)−∑t=1Tηt\(ut−1\(Zt\)−ut\(Zt\)\)\.R\_\{T\}=\\eta\_\{0\}u\_\{0\}\(Z\_\{0\}\)\-\\eta\_\{T\}u\_\{T\}\(Z\_\{T\}\)\-\\sum\_\{t=1\}^\{T\}\(\\eta\_\{t\-1\}\-\\eta\_\{t\}\)u\_\{t\-1\}\(Z\_\{t\}\)\-\\sum\_\{t=1\}^\{T\}\\eta\_\{t\}\\bigl\(u\_\{t\-1\}\(Z\_\{t\}\)\-u\_\{t\}\(Z\_\{t\}\)\\bigr\)\\,\.∎
## Appendix BBounded Iterates: Proof of Theorem[4\.1](https://arxiv.org/html/2606.24981#S4.Thmtheorem1)
### B\.1Markov noise and localization
Let𝒆t:=𝜽t−𝜽∗\\bm\{e\}\_\{t\}:=\{\\bm\{\\theta\}\}\_\{t\}\-\{\\bm\{\\theta\}\}^\{\*\}denote the error at timett\. From the TD update, expanding𝒆t\\bm\{e\}\_\{t\}gives
‖𝒆t‖2\\displaystyle\\\|\\bm\{e\}\_\{t\}\\\|^\{2\}=‖𝜽t−𝜽∗‖2=‖𝜽t−1\+ηt−1𝒈\(𝜽t−1,Zt−1\)−𝜽∗‖2\\displaystyle=\\\|\{\\bm\{\\theta\}\}\_\{t\}\-\{\\bm\{\\theta\}\}^\{\*\}\\\|^\{2\}=\\\|\{\\bm\{\\theta\}\}\_\{t\-1\}\+\\eta\_\{t\-1\}\\bm\{g\}\(\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\-\{\\bm\{\\theta\}\}^\{\*\}\\\|^\{2\}=‖𝜽t−1−𝜽∗‖2\+ηt−12‖𝒈\(𝜽t−1,Zt−1\)‖2\+2ηt−1⟨𝒈\(𝜽t−1,Zt−1\),𝜽t−1−𝜽∗⟩\\displaystyle=\\\|\{\\bm\{\\theta\}\}\_\{t\-1\}\-\{\\bm\{\\theta\}\}^\{\*\}\\\|^\{2\}\+\\eta^\{2\}\_\{t\-1\}\\\|\\bm\{g\}\(\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\\\|^\{2\}\+2\\eta\_\{t\-1\}\\langle\\bm\{g\}\(\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\),\{\\bm\{\\theta\}\}\_\{t\-1\}\-\{\\bm\{\\theta\}\}^\{\*\}\\rangle=‖𝜽t−1−𝜽∗‖2\+ηt−12‖𝒈\(𝜽t−1,Zt−1\)‖2\+2ηt−1⟨𝒈\(𝜽t−1,Zt−1\)−𝒈¯\(𝜽t−1\),𝜽t−1−𝜽∗⟩\\displaystyle=\\\|\{\\bm\{\\theta\}\}\_\{t\-1\}\-\{\\bm\{\\theta\}\}^\{\*\}\\\|^\{2\}\+\\eta^\{2\}\_\{t\-1\}\\\|\\bm\{g\}\(\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\\\|^\{2\}\+2\\eta\_\{t\-1\}\\langle\\bm\{g\}\(\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\-\\bar\{\\bm\{g\}\}\(\{\\bm\{\\theta\}\}\_\{t\-1\}\),\{\\bm\{\\theta\}\}\_\{t\-1\}\-\{\\bm\{\\theta\}\}^\{\*\}\\rangle\+2ηt−1⟨𝒈¯\(𝜽t−1\),𝜽t−1−𝜽∗⟩\.\\displaystyle\\quad\+2\\eta\_\{t\-1\}\\langle\\bar\{\\bm\{g\}\}\(\{\\bm\{\\theta\}\}\_\{t\-1\}\),\{\\bm\{\\theta\}\}\_\{t\-1\}\-\{\\bm\{\\theta\}\}^\{\*\}\\rangle\\,\.Summing fromt=1t=1toTTyields
‖𝒆T‖2=‖𝒆0‖2\+∑t=1Tηt−12‖𝒈\(𝜽t−1,Zt−1\)‖2\+2∑t=1Tηt−1⟨𝒈¯\(𝜽t−1\),𝜽t−1−𝜽∗⟩\+BT,\\\|\\bm\{e\}\_\{T\}\\\|^\{2\}=\\\|\\bm\{e\}\_\{0\}\\\|^\{2\}\+\\sum\_\{t=1\}^\{T\}\\eta^\{2\}\_\{t\-1\}\\\|\\bm\{g\}\(\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\\\|^\{2\}\+2\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\langle\\bar\{\\bm\{g\}\}\(\{\\bm\{\\theta\}\}\_\{t\-1\}\),\{\\bm\{\\theta\}\}\_\{t\-1\}\-\{\\bm\{\\theta\}\}^\{\*\}\\rangle\+B\_\{T\},\(20\)where the Markov bias term is
BT:=2∑t=1Tηt−1⟨𝒈\(𝜽t−1,Zt−1\)−𝒈¯\(𝜽t−1\),𝜽t−1−𝜽∗⟩\.B\_\{T\}:=2\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\,\\langle\\bm\{g\}\(\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\-\\bar\{\\bm\{g\}\}\(\{\\bm\{\\theta\}\}\_\{t\-1\}\),\\,\{\\bm\{\\theta\}\}\_\{t\-1\}\-\{\\bm\{\\theta\}\}^\{\*\}\\rangle\\,\.\(21\)By \([4](https://arxiv.org/html/2606.24981#S3.E4)\) and the nonnegativity off\(𝜽\)−f\(𝜽∗\)f\(\\bm\{\\theta\}\)\-f\(\\bm\{\\theta\}^\{\*\}\),
⟨𝒈¯\(𝜽t−1\),𝜽t−1−𝜽∗⟩=−\(f\(𝜽t−1\)−f\(𝜽∗\)\)≤0\.\\langle\\bar\{\\bm\{g\}\}\(\{\\bm\{\\theta\}\}\_\{t\-1\}\),\{\\bm\{\\theta\}\}\_\{t\-1\}\-\{\\bm\{\\theta\}\}^\{\*\}\\rangle=\-\\bigl\(f\(\{\\bm\{\\theta\}\}\_\{t\-1\}\)\-f\(\{\\bm\{\\theta\}\}^\{\*\}\)\\bigr\)\\leq 0\.Thus, we have
‖𝒆T‖2≤‖𝒆0‖2\+∑t=1Tηt−12‖𝒈\(𝜽t−1,Zt−1\)‖2\+BT\.\\\|\\bm\{e\}\_\{T\}\\\|^\{2\}\\leq\\\|\\bm\{e\}\_\{0\}\\\|^\{2\}\+\\sum\_\{t=1\}^\{T\}\\eta^\{2\}\_\{t\-1\}\\\|\\bm\{g\}\(\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\\\|^\{2\}\+B\_\{T\}\\,\.To localize the process, we define
Texit:=inf\{t≥0:‖𝜽t‖\>ρRbase\},T\_\{\\mathrm\{exit\}\}:=\\inf\\\{t\\geq 0:\\,\\\|\{\\bm\{\\theta\}\}\_\{t\}\\\|\>\\rho R\_\{\\mathrm\{base\}\}\\\},whereTexitT\_\{\\mathrm\{exit\}\}is a stopping time with respect to\(ℱt\)\(\\mathcal\{F\}\_\{t\}\)because𝜽t\{\\bm\{\\theta\}\}\_\{t\}isℱt\\mathcal\{F\}\_\{t\}\-measurable\. The numberρ\>2\\rho\>2will be specified later\. Define
Rbase≔max\{‖𝜽0−𝜽∗‖,‖𝜽∗‖,r∞ϕ∞\},Rmax≔ρRbase,R\_\{\\mathrm\{base\}\}\\coloneqq\\max\\left\\\{\\left\\\|\{\{\\bm\{\\theta\}\}\_\{0\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|,\\left\\\|\{\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|,\\frac\{r\_\{\\infty\}\}\{\\phi\_\{\\infty\}\}\\right\\\},\\quad R\_\{\\mathrm\{max\}\}\\coloneqq\\rho R\_\{\\mathrm\{base\}\},and the stopped iterates and stepsizes by
𝜽~t:=𝜽t∧Texit,η~t:=\{ηt,t<Texit,0,t≥Texit\.\\tilde\{\\bm\{\\theta\}\}\_\{t\}:=\{\\bm\{\\theta\}\}\_\{t\\wedge T\_\{\\mathrm\{exit\}\}\},\\qquad\\tilde\{\\eta\}\_\{t\}:=\\begin\{cases\}\\eta\_\{t\},&t<T\_\{\\mathrm\{exit\}\},\\\\ 0,&t\\geq T\_\{\\mathrm\{exit\}\}\\,\.\\end\{cases\}We define the stopped Markov bias term
B~T:=2∑t=1Tη~t−1⟨𝒈\(𝜽~t−1,Zt−1\)−𝒈¯\(𝜽~t−1\),𝜽~t−1−𝜽∗⟩\.\\tilde\{B\}\_\{T\}:=2\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}\\langle\\bm\{g\}\(\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\-\\bar\{\\bm\{g\}\}\(\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\),\\,\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\-\{\\bm\{\\theta\}\}^\{\*\}\\rangle\\,\.In particular, on the event\{Texit=∞\}\\\{T\_\{\\mathrm\{exit\}\}=\\infty\\\}we have𝜽~t=𝜽t\\tilde\{\\bm\{\\theta\}\}\_\{t\}=\{\\bm\{\\theta\}\}\_\{t\},η~t=ηt\\tilde\{\\eta\}\_\{t\}=\\eta\_\{t\}, andB~t=Bt\\tilde\{B\}\_\{t\}=B\_\{t\}for alltt\.
### B\.2Poisson representation and bounds for the localized Markov noise
Letut−1u\_\{t\-1\}be the infinite series solution of the Poisson equation with forcing functionht−1:=h𝜽~t−1,𝜽~t−1−𝜽∗∈ℱt−2h\_\{t\-1\}:=h\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\},\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\in\\mathcal\{F\}\_\{t\-2\}, that is,
ut−1−PZut−1=ht−1\.u\_\{t\-1\}\-P\_\{Z\}u\_\{t\-1\}=h\_\{t\-1\}\\,\.Then, from Section[A\.3](https://arxiv.org/html/2606.24981#A1.SS3), we have
∑t=1Tη~t−1ht−1\(Zt−1\)\\displaystyle\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}h\_\{t\-1\}\(Z\_\{t\-1\}\)=∑t=1Tη~t−1\(ut−1\(Zt−1\)−\(PZut−1\)\(Zt−1\)\)\\displaystyle=\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}\\bigl\(u\_\{t\-1\}\(Z\_\{t\-1\}\)\-\\left\(P\_\{Z\}u\_\{t\-1\}\\right\)\(Z\_\{t\-1\}\)\\bigr\)=∑t=1Tη~t−1\(ut−1\(Zt−1\)−ut−1\(Zt\)\+ut−1\(Zt\)−\(PZut−1\)\(Zt−1\)\)\\displaystyle=\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}\\bigl\(u\_\{t\-1\}\(Z\_\{t\-1\}\)\-u\_\{t\-1\}\(Z\_\{t\}\)\+u\_\{t\-1\}\(Z\_\{t\}\)\-\\left\(P\_\{Z\}u\_\{t\-1\}\\right\)\(Z\_\{t\-1\}\)\\bigr\)=∑t=1Tη~t−1\(ut−1\(Zt−1\)−ut−1\(Zt\)\)\+∑t=1Tη~t−1ΔMt,\\displaystyle=\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}\\bigl\(u\_\{t\-1\}\(Z\_\{t\-1\}\)\-u\_\{t\-1\}\(Z\_\{t\}\)\\bigr\)\+\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}\\Delta M\_\{t\},whereΔMt≔ut−1\(Zt\)−\(PZut−1\)\(Zt−1\)\\Delta M\_\{t\}\\coloneqq u\_\{t\-1\}\(Z\_\{t\}\)\-\\left\(P\_\{Z\}u\_\{t\-1\}\\right\)\(Z\_\{t\-1\}\)is a martingale difference with respect toℱt:=σ\(Z0,…,Zt\)\\mathcal\{F\}\_\{t\}:=\\sigma\(Z\_\{0\},\\dots,Z\_\{t\}\)sinceut−1∈ℱt−1u\_\{t\-1\}\\in\\mathcal\{F\}\_\{t\-1\}and
𝔼\[ut−1\(Zt\)∣ℱt−1\]=𝔼\[ut−1\(Zt\)∣ℱt−1,ut−1\]=\(PZut−1\)\(Zt−1\)\.\\mathbb\{E\}\[u\_\{t\-1\}\(Z\_\{t\}\)\\mid\\mathcal\{F\}\_\{t\-1\}\]=\\mathbb\{E\}\[u\_\{t\-1\}\(Z\_\{t\}\)\\mid\\mathcal\{F\}\_\{t\-1\},u\_\{t\-1\}\]=\\left\(P\_\{Z\}u\_\{t\-1\}\\right\)\(Z\_\{t\-1\}\)\\,\.Thus, for allT≥1T\\geq 1, we obtain
B~T=2M~T\+2R~T,M~T:=∑t=1Tη~t−1ΔMt,R~T:=∑t=1Tη~t−1\(ut−1\(Zt−1\)−ut−1\(Zt\)\),\\tilde\{B\}\_\{T\}=2\\tilde\{M\}\_\{T\}\+2\\tilde\{R\}\_\{T\},\\qquad\\tilde\{M\}\_\{T\}:=\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}\\Delta M\_\{t\},\\quad\\tilde\{R\}\_\{T\}:=\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}\\bigl\(u\_\{t\-1\}\(Z\_\{t\-1\}\)\-u\_\{t\-1\}\(Z\_\{t\}\)\\bigr\),\(22\)and\(M~T\)T≥1\(\\tilde\{M\}\_\{T\}\)\_\{T\\geq 1\}is a real\-valued martingale with respect to\(ℱT\)T≥1\(\\mathcal\{F\}\_\{T\}\)\_\{T\\geq 1\}\.
Moreover, ifη~t−1\>0\\tilde\{\\eta\}\_\{t\-1\}\>0, we have‖𝜽~t−1‖≤Rmax\\\|\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\\\|\\leq R\_\{\\mathrm\{max\}\}and‖𝜽~t−1−𝜽∗‖≤2Rmax\\\|\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\-\{\\bm\{\\theta\}\}^\{\*\}\\\|\\leq 2R\_\{\\mathrm\{max\}\}\. Thus, Lemma[A\.2](https://arxiv.org/html/2606.24981#A1.Thmtheorem2)implies that
‖ut−1‖∞\\displaystyle\\\|u\_\{t\-1\}\\\|\_\{\\infty\}≤16τmix\(2r∞ϕ∞\+4ϕ∞2‖𝜽~t−1‖\)\(‖𝜽~t−1−𝜽∗‖\)≤16τmixϕ∞2\(4Rmax2\+8Rmax2\)\\displaystyle\\leq 16\\tau\_\{\\mathrm\{mix\}\}\\left\(2r\_\{\\infty\}\\phi\_\{\\infty\}\+4\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\\right\\\|\\right\)\\left\(\\left\\\|\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|\\right\)\\leq 16\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\(4R\_\{\\mathrm\{max\}\}^\{2\}\+8R\_\{\\mathrm\{max\}\}^\{2\}\)≤192τmixϕ∞2Rmax2\.\\displaystyle\\leq 192\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\\,\.Also,
‖ΔMt‖∞≤‖ut−1‖∞\+‖PZut−1‖∞≤2‖ut−1‖∞≤384τmixϕ∞2Rmax2\.\\left\\\|\{\\Delta M\_\{t\}\}\\right\\\|\_\{\\infty\}\\leq\\left\\\|\{u\_\{t\-1\}\}\\right\\\|\_\{\\infty\}\+\\left\\\|\{P\_\{Z\}u\_\{t\-1\}\}\\right\\\|\_\{\\infty\}\\leq 2\\\|u\_\{t\-1\}\\\|\_\{\\infty\}\\leq 384\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\\,\.Sinceη~t−1=0\\tilde\{\\eta\}\_\{t\-1\}=0for allt\>Texitt\>T\_\{\\mathrm\{exit\}\}, the incrementsη~t−1ΔMt\\tilde\{\\eta\}\_\{t\-1\}\\Delta M\_\{t\}satisfy
‖η~t−1ΔMt‖∞≤η~t−1384τmixϕ∞2Rmax2\.\\left\\\|\{\\tilde\{\\eta\}\_\{t\-1\}\\Delta M\_\{t\}\}\\right\\\|\_\{\\infty\}\\leq\\tilde\{\\eta\}\_\{t\-1\}384\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\\,\.\(23\)We now use Pinelis’ inequality in Lemma[F\.1](https://arxiv.org/html/2606.24981#A6.Thmtheorem1)\. Applying it to the increments
Xt:=η~t−1ΔMt,t≥1,X\_\{t\}:=\\tilde\{\\eta\}\_\{t\-1\}\\Delta M\_\{t\},\\qquad t\\geq 1,with \([23](https://arxiv.org/html/2606.24981#A2.E23)\), we obtain the following anytime high\-probability guarantee\.
###### Lemma B\.1\(Pinelis’ inequality for the stopped martingale\)\.
LetM~T:=∑t=1TXt\\tilde\{M\}\_\{T\}:=\\sum\_\{t=1\}^\{T\}X\_\{t\}withXt:=η~t−1ΔMtX\_\{t\}:=\\tilde\{\\eta\}\_\{t\-1\}\\Delta M\_\{t\}as above\. Then, for allδ∈\(0,1\)\\delta\\in\(0,1\),
ℙ\{supT≥1\|M~T\|≤384τmixϕ∞2Rmax2∑t=0∞ηt22log2δ\}≥1−δ\.\\mathbb\{P\}\\left\\\{\\sup\_\{T\\geq 1\}\\ \|\\tilde\{M\}\_\{T\}\|\\;\\leq\\;384\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\{\\eta\}\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{2\}\{\\delta\}\}\\right\\\}\\;\\geq\\;1\-\\delta\\,\.
###### Proof\.
Settingr≥Db∗2log2δr\\geq Db\_\{\*\}\\sqrt\{2\\log\\frac\{2\}\{\\delta\}\}in Lemma[F\.1](https://arxiv.org/html/2606.24981#A6.Thmtheorem1), we have2exp\(−r22D2b∗2\)≤δ2\\exp\\left\(\-\\frac\{r^\{2\}\}\{2D^\{2\}b^\{2\}\_\{\*\}\}\\right\)\\leq\\delta\. Sinceℝd\\mathbb\{R\}^\{d\}withℓ2\\ell\_\{2\}is a\(2,1\)\(2,1\)\-smooth Banach space, choosingb∗2=∑t=0∞ηt2\(384τmixϕ∞2Rmax2\)2b\_\{\*\}^\{2\}=\\sum\_\{t=0\}^\{\\infty\}\{\\eta\}\_\{t\}^\{2\}\(384\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\)^\{2\}finishes the proof\. ∎
### B\.3Bound on the remainder term
We now bound the deterministic remainderR~T\\tilde\{R\}\_\{T\}defined in \([22](https://arxiv.org/html/2606.24981#A2.E22)\) using Lemma[A\.3](https://arxiv.org/html/2606.24981#A1.Thmtheorem3)and the Lipschitz continuity ofut−1u\_\{t\-1\}\.
###### Lemma B\.2\(Bound on the localized remainder\)\.
In the setting of Section[3](https://arxiv.org/html/2606.24981#S3), with the constants from Lemmas[3\.2](https://arxiv.org/html/2606.24981#S3.Thmtheorem2)and[3\.3](https://arxiv.org/html/2606.24981#S3.Thmtheorem3), for anyT≥1T\\geq 1,
\|R~T\|≤576η0τmixϕ∞2Rmax2\+672τmixϕ∞4Rmax2∑t=1Tηt−12\.\|\\tilde\{R\}\_\{T\}\|\\leq 576\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\+672\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}^\{2\}\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}^\{2\}\\,\.
###### Proof\.
Using Lemma[A\.3](https://arxiv.org/html/2606.24981#A1.Thmtheorem3), we have
R~T=η~0u0\(Z0\)−η~TuT\(ZT\)−∑t=1T\(η~t−1−η~t\)ut−1\(Zt\)−∑t=1Tη~t\(ut−1\(Zt\)−ut\(Zt\)\)\.\\tilde\{R\}\_\{T\}=\\tilde\{\\eta\}\_\{0\}u\_\{0\}\(Z\_\{0\}\)\-\\tilde\{\\eta\}\_\{T\}u\_\{T\}\(Z\_\{T\}\)\-\\sum\_\{t=1\}^\{T\}\(\\tilde\{\\eta\}\_\{t\-1\}\-\\tilde\{\\eta\}\_\{t\}\)u\_\{t\-1\}\(Z\_\{t\}\)\-\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\}\\bigl\(u\_\{t\-1\}\(Z\_\{t\}\)\-u\_\{t\}\(Z\_\{t\}\)\\bigr\)\\,\.Ifη~t−1\>0\\tilde\{\\eta\}\_\{t\-1\}\>0, we have‖𝜽~t−1‖≤Rmax\\\|\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\\\|\\leq R\_\{\\mathrm\{max\}\}, hence‖𝜽~t−1−𝜽∗‖≤2Rmax\\\|\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\-\{\\bm\{\\theta\}\}^\{\*\}\\\|\\leq 2R\_\{\\mathrm\{max\}\}\. Lemma[A\.2](https://arxiv.org/html/2606.24981#A1.Thmtheorem2)implies
‖ut−1‖∞≤192τmixϕ∞2Rmax2\.\\\|u\_\{t\-1\}\\\|\_\{\\infty\}\\leq 192\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\\,\.Therefore
\|η~0u0\(Z0\)\|\+\|η~TuT\(ZT\)\|≤384η0τmixϕ∞2Rmax2\.\|\\tilde\{\\eta\}\_\{0\}u\_\{0\}\(Z\_\{0\}\)\|\+\|\\tilde\{\\eta\}\_\{T\}u\_\{T\}\(Z\_\{T\}\)\|\\leq 384\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\\,\.Since\(ηt−1\)\(\\eta\_\{t\-1\}\)is non\-increasing, we have∑t=1T\|η~t−1−η~t\|=∑t=1T\(η~t−1−η~t\)≤η0\\sum\_\{t=1\}^\{T\}\|\\tilde\{\\eta\}\_\{t\-1\}\-\\tilde\{\\eta\}\_\{t\}\|=\\sum\_\{t=1\}^\{T\}\(\\tilde\{\\eta\}\_\{t\-1\}\-\\tilde\{\\eta\}\_\{t\}\)\\leq\\eta\_\{0\}, so
\|∑t=1T\(η~t−1−η~t\)ut−1\(Zt\)\|≤192η0τmixϕ∞2Rmax2\.\\Bigl\|\\sum\_\{t=1\}^\{T\}\(\\tilde\{\\eta\}\_\{t\-1\}\-\\tilde\{\\eta\}\_\{t\}\)u\_\{t\-1\}\(Z\_\{t\}\)\\Bigr\|\\leq 192\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\\,\.For theut−ut−1u\_\{t\}\-u\_\{t\-1\}term, whenttsatisfiesη~t\>0\\tilde\{\\eta\}\_\{t\}\>0, we expand the TD recursion𝜽~t−𝜽~t−1=η~t−1𝒈\(𝜽~t−1,Zt−1\)\\tilde\{\\bm\{\\theta\}\}\_\{t\}\-\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}=\\tilde\{\\eta\}\_\{t\-1\}\\bm\{g\}\(\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)and use Lemma[A\.2](https://arxiv.org/html/2606.24981#A1.Thmtheorem2)\(iii\) to conclude
‖ut−ut−1‖∞\\displaystyle\\left\\\|\{u\_\{t\}\-u\_\{t\-1\}\}\\right\\\|\_\{\\infty\}≤16τmix\(2r∞ϕ∞\+4ϕ∞2‖𝜽~t‖\)‖𝜽~t−𝜽~t−1‖\+64τmixϕ∞2‖𝜽~t−1−𝜽∗‖‖𝜽~t−𝜽~t−1‖\\displaystyle\\leq 16\\tau\_\{\\mathrm\{mix\}\}\\left\(2r\_\{\\infty\}\\phi\_\{\\infty\}\+4\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\\tilde\{\\bm\{\\theta\}\}\_\{t\}\}\\right\\\|\\right\)\\left\\\|\{\\tilde\{\\bm\{\\theta\}\}\_\{t\}\-\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\\right\\\|\+64\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|\\left\\\|\{\\tilde\{\\bm\{\\theta\}\}\_\{t\}\-\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\\right\\\|≤η~t−116τmix\(2r∞ϕ∞\+4ϕ∞2Rmax\)\(r∞ϕ∞\+2ϕ∞2Rmax\)\\displaystyle\\leq\\tilde\{\\eta\}\_\{t\-1\}16\\tau\_\{\\mathrm\{mix\}\}\(2r\_\{\\infty\}\\phi\_\{\\infty\}\+4\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\)\(r\_\{\\infty\}\\phi\_\{\\infty\}\+2\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\)\+64η~t−1τmixϕ∞22Rmax\(r∞ϕ∞\+2ϕ∞2Rmax\)\\displaystyle\\quad\+64\\tilde\{\\eta\}\_\{t\-1\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}2R\_\{\\mathrm\{max\}\}\(r\_\{\\infty\}\\phi\_\{\\infty\}\+2\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\)≤η~t−1ϕ∞4Rmax2\(16×18\+64×2×3\)τmix\\displaystyle\\leq\\tilde\{\\eta\}\_\{t\-1\}\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}^\{2\}\(16\\times 18\+64\\times 2\\times 3\)\\tau\_\{\\mathrm\{mix\}\}≤672η~t−1τmixϕ∞4Rmax2\.\\displaystyle\\leq 672\\tilde\{\\eta\}\_\{t\-1\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}^\{2\}\\,\.Thus,
\|∑t=1Tη~t\(ut−1\(Zt\)−ut\(Zt\)\)\|\\displaystyle\\left\|\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\}\\bigl\(u\_\{t\-1\}\(Z\_\{t\}\)\-u\_\{t\}\(Z\_\{t\}\)\\bigr\)\\right\|≤∑t=1Tη~t‖ut−ut−1‖∞≤672τmixϕ∞4Rmax2∑t=1Tη~tη~t−1\\displaystyle\\leq\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\}\\\|u\_\{t\}\-u\_\{t\-1\}\\\|\_\{\\infty\}\\leq 672\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}^\{2\}\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\}\\tilde\{\\eta\}\_\{t\-1\}≤672τmixϕ∞4Rmax2∑t=1Tηt−12\.\\displaystyle\\leq 672\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}^\{2\}\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}^\{2\}\\,\.Combining the three bounds, we have
\|R~T\|\\displaystyle\|\\tilde\{R\}\_\{T\}\|≤576η0τmixϕ∞2Rmax2\+672τmixϕ∞4Rmax2∑t=1Tηt−12\.\\displaystyle\\leq 576\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\+672\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}^\{2\}\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}^\{2\}\\,\.∎
### B\.4High\-probability bound on the localized Markov noise
Combining Lemmas[B\.1](https://arxiv.org/html/2606.24981#A2.Thmtheorem1)and[B\.2](https://arxiv.org/html/2606.24981#A2.Thmtheorem2)with the decomposition \([22](https://arxiv.org/html/2606.24981#A2.E22)\) yields the following high\-probability control ofB~T\\tilde\{B\}\_\{T\}\.
###### Lemma B\.3\(Localized high\-probability bound forBTB\_\{T\}\)\.
In the setting of Section[3](https://arxiv.org/html/2606.24981#S3), with the constants from Lemmas[3\.2](https://arxiv.org/html/2606.24981#S3.Thmtheorem2)and[3\.3](https://arxiv.org/html/2606.24981#S3.Thmtheorem3), for anyδ∈\(0,1\)\\delta\\in\(0,1\), we have
ℙ\{12supT≥1\|B~T\|≤384τmixϕ∞2Rmax2∑t=0∞ηt22log2δ\+C\}≥1−δ,\\mathbb\{P\}\\left\\\{\\frac\{1\}\{2\}\\sup\_\{T\\geq 1\}\\ \|\\tilde\{B\}\_\{T\}\|\\;\\leq\\;384\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\{\\eta\}\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{2\}\{\\delta\}\}\+C\\right\\\}\\;\\geq\\;1\-\\delta,whereC=576η0τmixϕ∞2Rmax2\+672τmixϕ∞4Rmax2∑t=0∞ηt2C=576\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\+672\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}^\{2\}\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\.
###### Proof\.
From \([22](https://arxiv.org/html/2606.24981#A2.E22)\),B~T=2M~T\+2R~T\\tilde\{B\}\_\{T\}=2\\tilde\{M\}\_\{T\}\+2\\tilde\{R\}\_\{T\}\. By Lemma[B\.1](https://arxiv.org/html/2606.24981#A2.Thmtheorem1), with probability at least1−δ1\-\\delta,
supT≥1\|M~T\|≤384τmixϕ∞2Rmax2∑t=0∞ηt22log2δ\.\\sup\_\{T\\geq 1\}\\ \|\\tilde\{M\}\_\{T\}\|\\leq 384\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\{\\eta\}\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{2\}\{\\delta\}\}\\,\.By Lemma[B\.2](https://arxiv.org/html/2606.24981#A2.Thmtheorem2), we also have the deterministic bound
supT≥1\|R~T\|≤576η0τmixϕ∞2Rmax2\+672τmixϕ∞4Rmax2∑t=0∞ηt2\.\\sup\_\{T\\geq 1\}\\ \|\\tilde\{R\}\_\{T\}\|\\leq 576\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\+672\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}^\{2\}\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\\,\.Combining these two bounds yields the desired result\. ∎
### B\.5High\-probability bounded iterates via bootstrap
We can now bootstrap onRmaxR\_\{\\mathrm\{max\}\}to prove high\-probability bounded iterates, completing the proof of Theorem[4\.1](https://arxiv.org/html/2606.24981#S4.Thmtheorem1)\.
###### Lemma B\.4\(Bootstrap inequality for the radius\)\.
In the setting of Section[3](https://arxiv.org/html/2606.24981#S3), with the constants from Lemmas[3\.2](https://arxiv.org/html/2606.24981#S3.Thmtheorem2)and[3\.3](https://arxiv.org/html/2606.24981#S3.Thmtheorem3), for allδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\delta, we have
supT≥0‖𝒆~T‖2\\displaystyle\\sup\_\{T\\geq 0\}\\ \\\|\\tilde\{\\bm\{e\}\}\_\{T\}\\\|^\{2\}\\;≤‖𝒆0‖2\+768τmixϕ∞2Rmax2∑t=0∞ηt22log2δ\+1152η0τmixϕ∞2Rmax2\\displaystyle\\leq\\;\\\|\\bm\{e\}\_\{0\}\\\|^\{2\}\+768\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\{\\eta\}\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{2\}\{\\delta\}\}\+1152\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\+1344τmixϕ∞4Rmax2∑t=0∞ηt2\+9ϕ∞4Rmax2∑t=0∞ηt2,\\displaystyle\\quad\\;\+1344\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}^\{2\}\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\+9\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}^\{2\}\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\},where𝐞~T:=𝛉~T−𝛉∗\\tilde\{\\bm\{e\}\}\_\{T\}:=\\tilde\{\\bm\{\\theta\}\}\_\{T\}\-\{\\bm\{\\theta\}\}^\{\*\}\.
###### Proof\.
Using inequality \([20](https://arxiv.org/html/2606.24981#A2.E20)\) and dropping the nonpositive2∑t=1Tηt−1⟨𝒈¯\(𝜽t−1\),𝜽t−1−𝜽∗⟩2\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\langle\\bar\{\\bm\{g\}\}\(\{\\bm\{\\theta\}\}\_\{t\-1\}\),\{\\bm\{\\theta\}\}\_\{t\-1\}\-\{\\bm\{\\theta\}\}^\{\*\}\\rangleterm by \([4](https://arxiv.org/html/2606.24981#S3.E4)\), we obtain
‖𝒆~T‖2≤‖𝒆0‖2\+∑t=1Tη~t−12‖𝒈\(𝜽~t−1,Zt−1\)‖2\+B~T\.\\\|\\tilde\{\\bm\{e\}\}\_\{T\}\\\|^\{2\}\\leq\\\|\\bm\{e\}\_\{0\}\\\|^\{2\}\+\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}^\{2\}\\\|\\bm\{g\}\(\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\\\|^\{2\}\+\\tilde\{B\}\_\{T\}\\,\.Lemma[3\.2](https://arxiv.org/html/2606.24981#S3.Thmtheorem2)implies
‖𝒈\(𝜽~t−1,Zt−1\)‖≤r∞ϕ∞\+2ϕ∞2Rmax≤3ϕ∞2Rmax\.\\\|\\bm\{g\}\(\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\\\|\\leq r\_\{\\infty\}\\phi\_\{\\infty\}\+2\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\\leq 3\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\\,\.Thus,
∑t=1Tη~t−12‖𝒈\(𝜽~t−1,Zt−1\)‖2≤9ϕ∞4Rmax2∑t=0∞ηt2\.\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}^\{2\}\\\|\\bm\{g\}\(\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\\\|^\{2\}\\leq 9\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}^\{2\}\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\\,\.Combining this with Lemma[B\.3](https://arxiv.org/html/2606.24981#A2.Thmtheorem3)finishes the proof\. ∎
We are now ready to give the proof of Theorem[4\.1](https://arxiv.org/html/2606.24981#S4.Thmtheorem1)\.
###### Proof\.
By the triangle inequality,
supT≥0‖𝜽~T‖2\\displaystyle\\sup\_\{T\\geq 0\}\\ \\\|\\tilde\{\\bm\{\\theta\}\}\_\{T\}\\\|^\{2\}≤2‖𝜽∗‖2\+2supT‖𝒆~T‖2\\displaystyle\\leq 2\\\|\{\\bm\{\\theta\}\}^\{\*\}\\\|^\{2\}\+2\\sup\_\{T\}\\ \\\|\\tilde\{\\bm\{e\}\}\_\{T\}\\\|^\{2\}≤2‖𝜽∗‖2\+2‖𝜽∗−𝜽0‖2\+1536τmixϕ∞2Rmax2∑t=0∞ηt22log2δ\\displaystyle\\leq 2\\\|\{\\bm\{\\theta\}\}^\{\*\}\\\|^\{2\}\+2\\\|\{\\bm\{\\theta\}\}^\{\*\}\-\{\\bm\{\\theta\}\}\_\{0\}\\\|^\{2\}\+1536\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\{\\eta\}\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{2\}\{\\delta\}\}\+2304η0τmixϕ∞2Rmax2\+\(2688τmix\+18\)ϕ∞4Rmax2∑t=0∞ηt2\\displaystyle\\quad\+2304\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\+\(2688\\tau\_\{\\mathrm\{mix\}\}\+18\)\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}^\{2\}\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}≤Rbase2\(2\+2\+1536τmixϕ∞2ρ2∑t=0∞ηt22log2δ\)\\displaystyle\\leq R\_\{\\mathrm\{base\}\}^\{2\}\\left\(2\+2\+1536\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\rho^\{2\}\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\{\\eta\}\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{2\}\{\\delta\}\}\\right\)\+Rbase2\(2304η0τmixϕ∞2ρ2\+\(2688τmix\+18\)ϕ∞4ρ2∑t=0∞ηt2\)\.\\displaystyle\\quad\+R\_\{\\mathrm\{base\}\}^\{2\}\\left\(2304\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\rho^\{2\}\+\(2688\\tau\_\{\\mathrm\{mix\}\}\+18\)\\phi\_\{\\infty\}^\{4\}\\rho^\{2\}\\sum^\{\\infty\}\_\{t=0\}\\eta^\{2\}\_\{t\}\\right\)\\,\.Recall thatηbase=1cτmixϕ∞2\\eta\_\{\\mathrm\{base\}\}=\\frac\{1\}\{c\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\}and∑t=0∞ηt2=1cτmixϕ∞2∑t=0∞at2\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\{\\eta\}\_\{t\}^\{2\}\}=\\frac\{1\}\{c\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\}\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}a\_\{t\}^\{2\}\}\. It follows that
supT≥0‖𝜽~T‖2\\displaystyle\\sup\_\{T\\geq 0\}\\ \\\|\\tilde\{\\bm\{\\theta\}\}\_\{T\}\\\|^\{2\}≤Rbase2\(4\+1536ρ2c∑t=0∞at22log2δ\+2304ρ2c\+2706\(∑t=0∞at2\)ρ2c2\)\.\\displaystyle\\leq R\_\{\\mathrm\{base\}\}^\{2\}\\left\(4\+1536\\frac\{\\rho^\{2\}\}\{c\}\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}a\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{2\}\{\\delta\}\}\+2304\\frac\{\\rho^\{2\}\}\{c\}\+2706\\left\(\\sum\_\{t=0\}^\{\\infty\}a^\{2\}\_\{t\}\\right\)\\frac\{\\rho^\{2\}\}\{c^\{2\}\}\\right\)\\,\.There exists a sufficiently largeccsuch that
4\+1536ρ2c∑t=0∞at22log2δ\+2304ρ2c\+2706\(∑t=0∞at2\)ρ2c2≤ρ2\.4\+1536\\frac\{\\rho^\{2\}\}\{c\}\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}a^\{2\}\_\{t\}\}\\sqrt\{2\\log\\frac\{2\}\{\\delta\}\}\+2304\\frac\{\\rho^\{2\}\}\{c\}\+2706\\left\(\\sum\_\{t=0\}^\{\\infty\}a^\{2\}\_\{t\}\\right\)\\frac\{\\rho^\{2\}\}\{c^\{2\}\}\\leq\\rho^\{2\}\.For suchcc, we have
supT≥0‖𝜽~T‖2≤ρ2Rbase2=Rmax2\.\\sup\_\{T\\geq 0\}\\ \\\|\\tilde\{\\bm\{\\theta\}\}\_\{T\}\\\|^\{2\}\\leq\\rho^\{2\}R\_\{\\mathrm\{base\}\}^\{2\}=R\_\{\\mathrm\{max\}\}^\{2\}\\,\.On the event\{supT‖𝜽~T‖2≤Rmax2\}\\\{\\sup\_\{T\}\\ \\\|\\tilde\{\\bm\{\\theta\}\}\_\{T\}\\\|^\{2\}\\leq R\_\{\\mathrm\{max\}\}^\{2\}\\\}we must haveTexit=∞T\_\{\\mathrm\{exit\}\}=\\inftyby the definition ofTexitT\_\{\\mathrm\{exit\}\}\. Thus𝜽~T=𝜽T\\tilde\{\\bm\{\\theta\}\}\_\{T\}=\{\\bm\{\\theta\}\}\_\{T\}for allTT, and
supT≥0‖𝜽T‖2≤Rmax2\.\\sup\_\{T\\geq 0\}\\ \\\|\{\\bm\{\\theta\}\}\_\{T\}\\\|^\{2\}\\leq R\_\{\\mathrm\{max\}\}^\{2\}\\,\.∎
## Appendix CConvergence Rates: Proof of Theorem[4\.2](https://arxiv.org/html/2606.24981#S4.Thmtheorem2)
Recall that the TD update can be written as
𝒈\(𝜽t−1,Zt−1\)=−𝑨Zt−1𝜽t−1\+𝒃Zt−1\.\\bm\{g\}\(\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)=\-\\bm\{A\}\_\{Z\_\{t\-1\}\}\{\\bm\{\\theta\}\}\_\{t\-1\}\+\\bm\{b\}\_\{Z\_\{t\-1\}\}\.Therefore, the centered error𝒆t:=𝜽t−𝜽∗\\bm\{e\}\_\{t\}:=\{\\bm\{\\theta\}\}\_\{t\}\-\{\\bm\{\\theta\}\}^\{\*\}evolves as
𝒆t\\displaystyle\\bm\{e\}\_\{t\}=𝒆t−1\+ηt−1\(𝒃Zt−1−𝑨Zt−1𝜽t−1\)\\displaystyle=\\bm\{e\}\_\{t\-1\}\+\\eta\_\{t\-1\}\\bigl\(\\bm\{b\}\_\{Z\_\{t\-1\}\}\-\\bm\{A\}\_\{Z\_\{t\-1\}\}\{\\bm\{\\theta\}\}\_\{t\-1\}\\bigr\)=𝒆t−1\+ηt−1\(−𝑨Zt−1\(𝜽∗\+𝒆t−1\)\+𝒃Zt−1\)\\displaystyle=\\bm\{e\}\_\{t\-1\}\+\\eta\_\{t\-1\}\\bigl\(\-\\bm\{A\}\_\{Z\_\{t\-1\}\}\(\{\\bm\{\\theta\}\}^\{\*\}\+\\bm\{e\}\_\{t\-1\}\)\+\\bm\{b\}\_\{Z\_\{t\-1\}\}\\bigr\)=𝒆t−1\+ηt−1\(−𝑨Zt−1𝒆t−1\+𝝃t−1\)\\displaystyle=\\bm\{e\}\_\{t\-1\}\+\\eta\_\{t\-1\}\\bigl\(\-\\bm\{A\}\_\{Z\_\{t\-1\}\}\\bm\{e\}\_\{t\-1\}\+\\bm\{\\xi\}\_\{t\-1\}\\bigr\)=\(𝐈d−ηt−1𝑨\)𝒆t−1\+ηt−1𝝃t−1\+ηt−1𝜹t−1𝒆t−1,\\displaystyle=\(\\mathbf\{I\}\_\{d\}\-\\eta\_\{t\-1\}\\bm\{A\}\)\\bm\{e\}\_\{t\-1\}\+\\eta\_\{t\-1\}\\bm\{\\xi\}\_\{t\-1\}\+\\eta\_\{t\-1\}\\bm\{\\delta\}\_\{t\-1\}\\bm\{e\}\_\{t\-1\},\(24\)using𝑨Zt−1=𝑨−𝜹t−1\\bm\{A\}\_\{Z\_\{t\-1\}\}=\\bm\{A\}\-\\bm\{\\delta\}\_\{t\-1\}and𝝃t−1=𝒃Zt−1−𝑨Zt−1𝜽∗\\bm\{\\xi\}\_\{t\-1\}=\\bm\{b\}\_\{Z\_\{t\-1\}\}\-\\bm\{A\}\_\{Z\_\{t\-1\}\}\{\\bm\{\\theta\}\}^\{\*\}\. Rearranging yields
𝑨𝒆t−1=𝒆t−1−𝒆tηt−1\+𝝃t−1\+𝜹t−1𝒆t−1\.\\bm\{A\}\\bm\{e\}\_\{t\-1\}=\\frac\{\\bm\{e\}\_\{t\-1\}\-\\bm\{e\}\_\{t\}\}\{\\eta\_\{t\-1\}\}\+\\bm\{\\xi\}\_\{t\-1\}\+\\bm\{\\delta\}\_\{t\-1\}\\bm\{e\}\_\{t\-1\}\\,\.\(25\)Define the following quantities:
ST:=∑t=1Tηt−1,𝒆¯T:=1ST∑t=1Tηt−1𝒆t−1,𝜽¯T:=𝜽∗\+𝒆¯T\.S\_\{T\}:=\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\},\\qquad\\bar\{\\bm\{e\}\}\_\{T\}:=\\frac\{1\}\{S\_\{T\}\}\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\bm\{e\}\_\{t\-1\},\\qquad\\bar\{\\bm\{\\theta\}\}\_\{T\}:=\{\\bm\{\\theta\}\}^\{\*\}\+\\bar\{\\bm\{e\}\}\_\{T\}\\,\.Multiplying \([25](https://arxiv.org/html/2606.24981#A3.E25)\) byηt−1\\eta\_\{t\-1\}and summing fromt=1t=1toTTyields
∑t=1Tηt−1𝑨𝒆t−1=∑t=1T\(𝒆t−1−𝒆t\)\+∑t=1Tηt−1𝝃t−1\+∑t=1Tηt−1𝜹t−1𝒆t−1\.\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\bm\{A\}\\bm\{e\}\_\{t\-1\}=\\sum\_\{t=1\}^\{T\}\(\\bm\{e\}\_\{t\-1\}\-\\bm\{e\}\_\{t\}\)\+\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\bm\{\\xi\}\_\{t\-1\}\+\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\bm\{\\delta\}\_\{t\-1\}\\bm\{e\}\_\{t\-1\}\\,\.Dividing bySTS\_\{T\}and recalling the definition of𝒆¯T\\bar\{\\bm\{e\}\}\_\{T\}yields
𝑨𝒆¯T=I1\+I2\+I3,\\bm\{A\}\\bar\{\\bm\{e\}\}\_\{T\}=I\_\{1\}\+I\_\{2\}\+I\_\{3\},\(26\)where
I1:=1ST∑t=1T\(𝒆t−1−𝒆t\),I2:=1ST∑t=1Tηt−1𝝃t−1,I3:=1ST∑t=1Tηt−1𝜹t−1𝒆t−1\.I\_\{1\}:=\\frac\{1\}\{S\_\{T\}\}\\sum\_\{t=1\}^\{T\}\(\\bm\{e\}\_\{t\-1\}\-\\bm\{e\}\_\{t\}\),\\quad I\_\{2\}:=\\frac\{1\}\{S\_\{T\}\}\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\bm\{\\xi\}\_\{t\-1\},\\quad I\_\{3\}:=\\frac\{1\}\{S\_\{T\}\}\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\bm\{\\delta\}\_\{t\-1\}\\bm\{e\}\_\{t\-1\}\\,\.Notice that, by Lemma[3\.4](https://arxiv.org/html/2606.24981#S3.Thmtheorem4),\(𝑨\+𝑨⊤\)/2⪰ω𝐈d\(\\bm\{A\}\+\\bm\{A\}^\{\\top\}\)/2\\succeq\\omega\\mathbf\{I\}\_\{d\}, so for any𝒙∈ℝd\\bm\{x\}\\in\\mathbb\{R\}^\{d\},
⟨𝑨𝒙,𝒙⟩=𝒙⊤𝑨\+𝑨⊤2𝒙≥ω‖𝒙‖2\.\\langle\\bm\{A\}\\bm\{x\},\\bm\{x\}\\rangle=\\bm\{x\}^\{\\top\}\\frac\{\\bm\{A\}\+\\bm\{A\}^\{\\top\}\}\{2\}\\bm\{x\}\\geq\\omega\\\|\\bm\{x\}\\\|^\{2\}\\,\.Combining this with Cauchy–Schwarz,
ω‖𝒙‖2≤⟨𝑨𝒙,𝒙⟩≤‖𝑨𝒙‖‖𝒙‖,\\omega\\\|\\bm\{x\}\\\|^\{2\}\\leq\\langle\\bm\{A\}\\bm\{x\},\\bm\{x\}\\rangle\\leq\\\|\\bm\{A\}\\bm\{x\}\\\|\\\|\\bm\{x\}\\\|,we obtain, for𝒙≠0\\bm\{x\}\\neq 0,
‖𝒙‖≤1ω‖𝑨𝒙‖\.\\\|\\bm\{x\}\\\|\\leq\\frac\{1\}\{\\omega\}\\,\\\|\\bm\{A\}\\bm\{x\}\\\|\.In particular, for𝒙=𝒆¯T\\bm\{x\}=\\bar\{\\bm\{e\}\}\_\{T\},
f\(𝜽¯T\)−f\(𝜽∗\)=⟨𝑨𝒆¯T,𝒆¯T⟩≤‖𝑨𝒆¯T‖2‖𝒆¯T‖≤1ω‖𝑨𝒆¯T‖2\.f\(\\bar\{\\bm\{\\theta\}\}\_\{T\}\)\-f\(\{\\bm\{\\theta\}\}^\{\*\}\)=\\langle\\bm\{A\}\\bar\{\\bm\{e\}\}\_\{T\},\\bar\{\\bm\{e\}\}\_\{T\}\\rangle\\leq\\\|\\;\\bm\{A\}\\bar\{\\bm\{e\}\}\_\{T\}\\\|\_\{2\}\\,\\\|\\bar\{\\bm\{e\}\}\_\{T\}\\\|\\leq\\frac\{1\}\{\\omega\}\\,\\\|\\bm\{A\}\\bar\{\\bm\{e\}\}\_\{T\}\\\|^\{2\}\\,\.\(27\)We now bound‖I1‖,‖I2‖\\left\\\|\{I\_\{1\}\}\\right\\\|,\\left\\\|\{I\_\{2\}\}\\right\\\|, and‖I3‖\\left\\\|\{I\_\{3\}\}\\right\\\|in turn to controlf\(𝜽¯T\)−f\(𝜽∗\)f\(\\bar\{\\bm\{\\theta\}\}\_\{T\}\)\-f\(\{\\bm\{\\theta\}\}^\{\*\}\)\.
### C\.1Bound on the telescoping termI1I\_\{1\}
The first term is purely deterministic\.
###### Lemma C\.1\(Bound onI1I\_\{1\}\)\.
On the bounded\-iterates eventℰR:=\{supt‖𝛉t‖≤Rmax\}\\mathcal\{E\}\_\{R\}:=\\\{\\sup\_\{t\}\\ \\\|\{\\bm\{\\theta\}\}\_\{t\}\\\|\\leq R\_\{\\mathrm\{max\}\}\\\}we have,
‖I1‖2≤2Rmax∑t=1Tηt−1\\\|I\_\{1\}\\\|\_\{2\}\\leq\\frac\{2R\_\{\\mathrm\{max\}\}\}\{\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\}for allT≥1T\\geq 1\.
###### Proof\.
By telescoping,
∑t=1T\(𝒆t−1−𝒆t\)=𝒆0−𝒆T,\\sum\_\{t=1\}^\{T\}\(\\bm\{e\}\_\{t\-1\}\-\\bm\{e\}\_\{t\}\)=\\bm\{e\}\_\{0\}\-\\bm\{e\}\_\{T\},so
I1=𝒆0−𝒆TST\.I\_\{1\}=\\frac\{\\bm\{e\}\_\{0\}\-\\bm\{e\}\_\{T\}\}\{S\_\{T\}\}\.OnℰR\\mathcal\{E\}\_\{R\},‖𝒆0−𝒆T‖=‖𝜽0−𝜽∗−𝜽T\+𝜽∗‖≤2Rmax\\\|\\bm\{e\}\_\{0\}\-\\bm\{e\}\_\{T\}\\\|=\\\|\{\\bm\{\\theta\}\}\_\{0\}\-\{\\bm\{\\theta\}\}^\{\*\}\-\{\\bm\{\\theta\}\}\_\{T\}\+\{\\bm\{\\theta\}\}^\{\*\}\\\|\\leq 2R\_\{\\mathrm\{max\}\}\. Therefore,‖I1‖≤2Rmax∑t=1Tηt−1\\\|I\_\{1\}\\\|\\leq\\frac\{2R\_\{\\mathrm\{max\}\}\}\{\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\}\. ∎
### C\.2Bound onI2I\_\{2\}via vector\-valued Martingale Inequality
###### Lemma C\.2\(Bounds forI2I\_\{2\}\)\.
In the setting of Section[3](https://arxiv.org/html/2606.24981#S3), with the constants from Lemmas[3\.2](https://arxiv.org/html/2606.24981#S3.Thmtheorem2)and[3\.3](https://arxiv.org/html/2606.24981#S3.Thmtheorem3), let
I2:=1ST∑t=1Tηt−1𝝃t−1,I\_\{2\}:=\\frac\{1\}\{S\_\{T\}\}\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\bm\{\\xi\}\_\{t\-1\},where𝛏t−1:=𝛏\(Zt−1\)\\bm\{\\xi\}\_\{t\-1\}:=\\bm\{\\xi\}\(Z\_\{t\-1\}\)and the stepsize sequence\(ηt−1\)\(\\eta\_\{t\-1\}\)is non\-increasing\. Then, with probability at least1−δ1\-\\delta, for allT≥1T\\geq 1, we have
‖I2‖≤1∑t=1Tηt−1\(2∑t=0∞ηt22log2δ\+3η0\)16τmix\(r∞ϕ∞\+2ϕ∞2‖𝜽∗‖\)\.\\\|I\_\{2\}\\\|\\leq\\frac\{1\}\{\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\}\\left\(2\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{2\}\{\\delta\}\}\+3\\eta\_\{0\}\\right\)16\\tau\_\{\\mathrm\{mix\}\}\(r\_\{\\infty\}\\phi\_\{\\infty\}\+2\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|\)\\,\.
###### Proof\.
By Lemma[A\.1](https://arxiv.org/html/2606.24981#A1.Thmtheorem1), we have
u−PZu=𝝃,and‖u‖∞≤16τmix\(r∞ϕ∞\+2ϕ∞2‖𝜽∗‖\)\.u\-P\_\{Z\}u=\\bm\{\\xi\},\\quad\\text\{ and \}\\quad\\\|u\\\|\_\{\\infty\}\\leq 16\\tau\_\{\\mathrm\{mix\}\}\(r\_\{\\infty\}\\phi\_\{\\infty\}\+2\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|\)\\,\.Then, reasoning as in Section[A\.3](https://arxiv.org/html/2606.24981#A1.SS3), we obtain
∑t=1Tηt−1𝝃t−1\\displaystyle\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\bm\{\\xi\}\_\{t\-1\}=∑t=1Tηt−1\(u\(Zt−1\)−\(PZu\)\(Zt−1\)\)\\displaystyle=\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\left\(u\(Z\_\{t\-1\}\)\-\\left\(P\_\{Z\}u\\right\)\(Z\_\{t\-1\}\)\\right\)=∑t=1Tηt−1\(u\(Zt−1\)−u\(Zt\)\+u\(Zt\)−\(PZu\)\(Zt−1\)\)\\displaystyle=\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\left\(u\(Z\_\{t\-1\}\)\-u\(Z\_\{t\}\)\+u\(Z\_\{t\}\)\-\\left\(P\_\{Z\}u\\right\)\(Z\_\{t\-1\}\)\\right\)=MT\+RT,\\displaystyle=M\_\{T\}\+R\_\{T\},where
MT\\displaystyle M\_\{T\}:=∑t=1Tηt−1ΔMt,RT:=∑t=1Tηt−1\(u\(Zt−1\)−u\(Zt\)\)\.\\displaystyle:=\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\Delta M\_\{t\},\\quad R\_\{T\}:=\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\left\(u\(Z\_\{t\-1\}\)\-u\(Z\_\{t\}\)\\right\)\\,\.Since‖u‖∞≤16τmix\(r∞ϕ∞\+2ϕ∞2‖𝜽∗‖\)\\\|u\\\|\_\{\\infty\}\\leq 16\\tau\_\{\\mathrm\{mix\}\}\(r\_\{\\infty\}\\phi\_\{\\infty\}\+2\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|\), we have
‖ΔMt‖∞≤‖u‖∞\+‖PZu‖∞≤2‖u‖∞≤32τmix\(r∞ϕ∞\+2ϕ∞2‖𝜽∗‖\)\.\\\|\\Delta M\_\{t\}\\\|\_\{\\infty\}\\leq\\\|u\\\|\_\{\\infty\}\+\\bigl\\\|P\_\{Z\}u\\bigr\\\|\_\{\\infty\}\\leq 2\\\|u\\\|\_\{\\infty\}\\leq 32\\tau\_\{\\mathrm\{mix\}\}\(r\_\{\\infty\}\\phi\_\{\\infty\}\+2\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|\)\\,\.Using Lemma[F\.1](https://arxiv.org/html/2606.24981#A6.Thmtheorem1), for anyδ∈\(0,1\)\\delta\\in\(0,1\), we have with probability at least1−δ1\-\\delta,
supT≥1‖MT‖≤32τmix\(r∞ϕ∞\+2ϕ∞2‖𝜽∗‖2\)∑t=0∞ηt22log2δ\.\\sup\_\{T\\geq 1\}\\ \\left\\\|\{M\_\{T\}\}\\right\\\|\\leq 32\\tau\_\{\\mathrm\{mix\}\}\(r\_\{\\infty\}\\phi\_\{\\infty\}\+2\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|\_\{2\}\)\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{2\}\{\\delta\}\}\\,\.For the remainder termRTR\_\{T\}, we have
‖RT‖\\displaystyle\\left\\\|\{R\_\{T\}\}\\right\\\|=‖∑t=1Tηt−1\(u\(Zt−1\)−u\(Zt\)\)‖≤3η0‖u‖∞≤48τmix\(r∞ϕ∞\+2ϕ∞2‖𝜽∗‖\)η0,\\displaystyle=\\left\\\|\{\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\bigl\(u\(Z\_\{t\-1\}\)\-u\(Z\_\{t\}\)\\bigr\)\}\\right\\\|\\leq 3\\eta\_\{0\}\\left\\\|\{u\}\\right\\\|\_\{\\infty\}\\leq 48\\tau\_\{\\mathrm\{mix\}\}\(r\_\{\\infty\}\\phi\_\{\\infty\}\+2\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|\)\\eta\_\{0\},sinceηt−1\\eta\_\{t\-1\}is non\-increasing\. Thus, with probability at least1−δ1\-\\delta,
supT≥1‖MT\+RT‖≤\(2∑t=0∞ηt22log2δ\+3η0\)16τmix\(r∞ϕ∞\+2ϕ∞2‖𝜽∗‖\)\.\\sup\_\{T\\geq 1\}\\ \\\|M\_\{T\}\+R\_\{T\}\\\|\\leq\\left\(2\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{2\}\{\\delta\}\}\+3\\eta\_\{0\}\\right\)16\\tau\_\{\\mathrm\{mix\}\}\(r\_\{\\infty\}\\phi\_\{\\infty\}\+2\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|\)\\,\.Hence, with probability at least1−δ1\-\\delta,
‖I2‖=‖MT\+RT‖2ST≤1∑t=1Tηt−1\(2∑t=0∞ηt22log2δ\+3η0\)16τmix\(r∞ϕ∞\+2ϕ∞2‖𝜽∗‖\)\.\\\|I\_\{2\}\\\|=\\frac\{\\\|M\_\{T\}\+R\_\{T\}\\\|\_\{2\}\}\{S\_\{T\}\}\\leq\\frac\{1\}\{\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\}\\left\(2\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{2\}\{\\delta\}\}\+3\\eta\_\{0\}\\right\)16\\tau\_\{\\mathrm\{mix\}\}\(r\_\{\\infty\}\\phi\_\{\\infty\}\+2\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|\)\\,\.∎
### C\.3Bound onI3I\_\{3\}via localized vector\-valued Martingale Inequality
Recall the stopping time and stopped process:
Texit:=inf\{t≥0:‖𝜽t‖\>Rmax\},𝜽~t:=𝜽t∧Texit,𝒆~t:=𝜽~t−𝜽∗,T\_\{\\mathrm\{exit\}\}:=\\inf\\\{t\\geq 0:\\,\\\|\{\\bm\{\\theta\}\}\_\{t\}\\\|\>R\_\{\\mathrm\{max\}\}\\\},\\qquad\\tilde\{\\bm\{\\theta\}\}\_\{t\}:=\{\\bm\{\\theta\}\}\_\{t\\wedge T\_\{\\mathrm\{exit\}\}\},\\qquad\\tilde\{\\bm\{e\}\}\_\{t\}:=\\tilde\{\\bm\{\\theta\}\}\_\{t\}\-\{\\bm\{\\theta\}\}^\{\*\},and the stopped stepsizes
η~t:=\{ηt,t<Texit,0,t≥Texit\.\\tilde\{\\eta\}\_\{t\}:=\\begin\{cases\}\\eta\_\{t\},&t<T\_\{\\mathrm\{exit\}\},\\\\\[1\.29167pt\] 0,&t\\geq T\_\{\\mathrm\{exit\}\}\\,\.\\end\{cases\}ForT≥1T\\geq 1, define the localized multiplicative\-noise term
I~3,T:=1ST∑t=1Tη~t−1𝜹t−1𝒆~t−1\.\\tilde\{I\}\_\{3,T\}:=\\frac\{1\}\{S\_\{T\}\}\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}\\bm\{\\delta\}\_\{t\-1\}\\tilde\{\\bm\{e\}\}\_\{t\-1\}\\,\.\(28\)On the event\{Texit\>T−1\}\\\{T\_\{\\mathrm\{exit\}\}\>T\-1\\\}we haveη~t−1=ηt−1\\tilde\{\\eta\}\_\{t\-1\}=\\eta\_\{t\-1\}and𝒆~t−1=𝒆t−1\\tilde\{\\bm\{e\}\}\_\{t\-1\}=\\bm\{e\}\_\{t\-1\}for all1≤t≤T1\\leq t\\leq T, soI~3,T=I3\\tilde\{I\}\_\{3,T\}=I\_\{3\}\.
###### Lemma C\.3\(Localized bounds forI3I\_\{3\}\)\.
In the setting of Section[3](https://arxiv.org/html/2606.24981#S3), with the constants from Lemmas[3\.2](https://arxiv.org/html/2606.24981#S3.Thmtheorem2)and[3\.3](https://arxiv.org/html/2606.24981#S3.Thmtheorem3), consider the stopped process defined above\. Fixδ∈\(0,1\)\\delta\\in\(0,1\)\. We have, with probability at least1−δ1\-\\delta, for anyT≥1T\\geq 1,
‖I~3,T‖2\\displaystyle\\\|\\tilde\{I\}\_\{3,T\}\\\|\_\{2\}\\;≤256τmixϕ∞2Rmax∑t=0∞ηt22log2δ\+384η0τmixϕ∞2RmaxST\\displaystyle\\leq\\;\\frac\{256\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{2\}\{\\delta\}\}\+384\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\}\{S\_\{T\}\}\+192τmixϕ∞4Rmax∑t=0∞ηt2ST\.\\displaystyle\\quad\\;\+\\frac\{192\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\}\{S\_\{T\}\}\\,\.
###### Proof\.
Recall
𝜹z=𝑨−𝑨z,h𝜽~\(z\):=𝜹z\(𝜽~−𝜽∗\),\\bm\{\\delta\}\_\{z\}=\\bm\{A\}\-\\bm\{A\}\_\{z\},\\qquad h\_\{\\tilde\{\\bm\{\\theta\}\}\}\(z\):=\\bm\{\\delta\}\_\{z\}\(\\tilde\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\),and‖𝜹z‖op≤4ϕ∞2\\\|\\bm\{\\delta\}\_\{z\}\\\|\_\{\\mathrm\{op\}\}\\leq 4\\phi\_\{\\infty\}^\{2\}for allzz\. Thus, for each𝜽\\bm\{\\theta\}, there exists a Poisson solutionu𝜽:𝒵→ℝdu\_\{\\bm\{\\theta\}\}:\\mathcal\{Z\}\\to\\mathbb\{R\}^\{d\}satisfying
u𝜽−PZu𝜽=h𝜽,u\_\{\\bm\{\\theta\}\}\-P\_\{Z\}u\_\{\\bm\{\\theta\}\}=h\_\{\\bm\{\\theta\}\},and Lemma[A\.2](https://arxiv.org/html/2606.24981#A1.Thmtheorem2)gives
‖u𝜽‖∞≤64τmixϕ∞2‖𝜽−𝜽∗‖,‖u𝜽−u𝜽′‖∞≤64τmixϕ∞2‖𝜽−𝜽′‖\.\\\|u\_\{\\bm\{\\theta\}\}\\\|\_\{\\infty\}\\leq 64\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|,\\qquad\\\|u\_\{\\bm\{\\theta\}\}\-u\_\{\{\\bm\{\\theta\}\}^\{\\prime\}\}\\\|\_\{\\infty\}\\leq 64\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\\|\{\\bm\{\\theta\}\}\-\{\\bm\{\\theta\}\}^\{\\prime\}\\\|\\,\.Reasoning as in Section[A\.3](https://arxiv.org/html/2606.24981#A1.SS3)withh𝜽~t−1h\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}yields, for allT≥1T\\geq 1,
∑t=1Tη~t−1𝜹t−1𝒆~t−1\\displaystyle\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}\\bm\{\\delta\}\_\{t\-1\}\\tilde\{\\bm\{e\}\}\_\{t\-1\}=∑t=1Tη~t−1h𝜽~t−1\(Zt−1\)=∑t=1Tη~t−1\(u𝜽~t−1\(Zt−1\)−\(PZu𝜽~t−1\)\(Zt−1\)\)\\displaystyle=\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}h\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\(Z\_\{t\-1\}\)=\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}\\left\(u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\(Z\_\{t\-1\}\)\-\\left\(P\_\{Z\}u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\\right\)\(Z\_\{t\-1\}\)\\right\)=∑t=1Tη~t−1\(u𝜽~t−1\(Zt−1\)−u𝜽~t−1\(Zt\)\+u𝜽~t−1\(Zt\)−\(PZu𝜽~t−1\)\(Zt−1\)\)\.\\displaystyle=\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}\\left\(u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\(Z\_\{t\-1\}\)\-u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\(Z\_\{t\}\)\+u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\(Z\_\{t\}\)\-\\left\(P\_\{Z\}u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\\right\)\(Z\_\{t\-1\}\)\\right\)\\,\.LetΔMt≔u𝜽~t−1\(Zt\)−\(PZu𝜽~t−1\)\(Zt−1\)\\Delta M\_\{t\}\\coloneqq u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\(Z\_\{t\}\)\-\\left\(P\_\{Z\}u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\\right\)\(Z\_\{t\-1\}\)\. We have∑t=1Tη~t−1𝜹t−1𝒆~t−1=M~T\+R~T\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}\\bm\{\\delta\}\_\{t\-1\}\\tilde\{\\bm\{e\}\}\_\{t\-1\}=\\tilde\{M\}\_\{T\}\+\\tilde\{R\}\_\{T\}, where
M~T\\displaystyle\\tilde\{M\}\_\{T\}:=∑t=1Tη~t−1ΔMt,\\displaystyle:=\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}\\Delta M\_\{t\},R~T\\displaystyle\\tilde\{R\}\_\{T\}:=∑t=1Tη~t−1\(u𝜽~t−1\(Zt−1\)−u𝜽~t−1\(Zt\)\)\.\\displaystyle:=\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}\\bigl\(u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\(Z\_\{t\-1\}\)\-u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\(Z\_\{t\}\)\\bigr\)\\,\.Ifη~t−1\>0\\tilde\{\\eta\}\_\{t\-1\}\>0, then‖𝜽~t−1‖≤Rmax\\left\\\|\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\\right\\\|\\leq R\_\{\\mathrm\{max\}\}and we have
‖η~t−1ΔMt‖2≤2η~t−1‖u𝜽~t−1‖∞≤256η~t−1τmixϕ∞2Rmax\.\\\|\\tilde\{\\eta\}\_\{t\-1\}\\Delta M\_\{t\}\\\|\_\{2\}\\leq 2\\tilde\{\\eta\}\_\{t\-1\}\\left\\\|\{u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\}\\right\\\|\_\{\\infty\}\\leq 256\\tilde\{\\eta\}\_\{t\-1\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\\,\.Using Lemma[F\.1](https://arxiv.org/html/2606.24981#A6.Thmtheorem1), for anyδ∈\(0,1\)\\delta\\in\(0,1\), we have with probability at least1−δ1\-\\delta,
supT≥1‖M~T‖2≤256τmixϕ∞2Rmax∑t=0∞ηt22log2δ\.\\sup\_\{T\\geq 1\}\\ \\left\\\|\{\\tilde\{M\}\_\{T\}\}\\right\\\|\_\{2\}\\leq 256\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{2\}\{\\delta\}\}\\,\.For the remainder termR~T\\tilde\{R\}\_\{T\}, using Lemma[A\.3](https://arxiv.org/html/2606.24981#A1.Thmtheorem3), we have
R~T=η~0u𝜽~0\(Z0\)−η~Tu𝜽~T\(ZT\)−∑t=1T\(η~t−1−η~t\)u𝜽~t−1\(Zt\)−∑t=1Tη~t\(u𝜽~t−1\(Zt\)−u𝜽~t\(Zt\)\)\.\\tilde\{R\}\_\{T\}=\\tilde\{\\eta\}\_\{0\}u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{0\}\}\(Z\_\{0\}\)\-\\tilde\{\\eta\}\_\{T\}u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{T\}\}\(Z\_\{T\}\)\-\\sum\_\{t=1\}^\{T\}\(\\tilde\{\\eta\}\_\{t\-1\}\-\\tilde\{\\eta\}\_\{t\}\)u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\(Z\_\{t\}\)\-\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\}\\bigl\(u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\(Z\_\{t\}\)\-u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\}\}\(Z\_\{t\}\)\\bigr\)\\,\.Sinceη0≥ηT−1\\eta\_\{0\}\\geq\\eta\_\{T\-1\}, we obtain
‖η~0u𝜽~0\(Z0\)‖2\+‖η~Tu𝜽~T\(ZT\)‖2≤256η0τmixϕ∞2Rmax\.\\\|\\tilde\{\\eta\}\_\{0\}u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{0\}\}\(Z\_\{0\}\)\\\|\_\{2\}\+\\\|\\tilde\{\\eta\}\_\{T\}u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{T\}\}\(Z\_\{T\}\)\\\|\_\{2\}\\leq 256\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\\,\.For the second term, we have
‖∑t=1T\(η~t−1−η~t\)u𝜽~t−1\(Zt\)‖2≤128τmixϕ∞2Rmax∑t=1T\|η~t−1−η~t\|\.\\left\\\|\\sum\_\{t=1\}^\{T\}\(\\tilde\{\\eta\}\_\{t\-1\}\-\\tilde\{\\eta\}\_\{t\}\)u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\(Z\_\{t\}\)\\right\\\|\_\{2\}\\leq 128\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\\sum\_\{t=1\}^\{T\}\|\\tilde\{\\eta\}\_\{t\-1\}\-\\tilde\{\\eta\}\_\{t\}\|\\,\.Since the stopped stepsizes\(η~t\)\(\\tilde\{\\eta\}\_\{t\}\)are non\-increasing and satisfy∑t=1T\(η~t−1−η~t\)=η~0−η~T≤η0\\sum\_\{t=1\}^\{T\}\(\\tilde\{\\eta\}\_\{t\-1\}\-\\tilde\{\\eta\}\_\{t\}\)=\\tilde\{\\eta\}\_\{0\}\-\\tilde\{\\eta\}\_\{T\}\\leq\\eta\_\{0\}, we have
‖∑t=1T\(η~t−1−η~t\)u𝜽~t−1\(Zt\)‖2≤128η0τmixϕ∞2Rmax\.\\left\\\|\\sum\_\{t=1\}^\{T\}\(\\tilde\{\\eta\}\_\{t\-1\}\-\\tilde\{\\eta\}\_\{t\}\)u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\(Z\_\{t\}\)\\right\\\|\_\{2\}\\leq 128\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\\,\.For the last term, by the Lipschitz property ofu𝜽~u\_\{\\tilde\{\\bm\{\\theta\}\}\}in part \(iii\) of Lemma[A\.2](https://arxiv.org/html/2606.24981#A1.Thmtheorem2), we have
‖u𝜽~t−1−u𝜽~t‖∞≤64τmixϕ∞2‖𝜽~t−1−𝜽~t‖2\.\\\|u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\-u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\}\}\\\|\_\{\\infty\}\\leq 64\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\\|\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\-\\tilde\{\\bm\{\\theta\}\}\_\{t\}\\\|\_\{2\}\\,\.Expanding the TD recursion, we have
‖𝜽~t−𝜽~t−1‖2=η~t−1‖𝒈\(𝜽~t−1,Zt−1\)‖2≤η~t−1\(r∞\+2ϕ∞‖𝜽~t−1‖2\)ϕ∞≤3η~t−1ϕ∞2Rmax\.\\\|\\tilde\{\\bm\{\\theta\}\}\_\{t\}\-\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\\\|\_\{2\}=\\tilde\{\\eta\}\_\{t\-1\}\\\|\\bm\{g\}\(\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\\\|\_\{2\}\\leq\\tilde\{\\eta\}\_\{t\-1\}\\bigl\(r\_\{\\infty\}\+2\\phi\_\{\\infty\}\\\|\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\\\|\_\{2\}\\bigr\)\\phi\_\{\\infty\}\\leq 3\\tilde\{\\eta\}\_\{t\-1\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\\,\.Therefore,
‖∑t=1Tη~t\(u𝜽~t−1\(Zt\)−u𝜽~t\(Zt\)\)‖2\\displaystyle\\left\\\|\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\}\\bigl\(u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\(Z\_\{t\}\)\-u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\}\}\(Z\_\{t\}\)\\bigr\)\\right\\\|\_\{2\}≤∑t=1Tη~t‖u𝜽~t−1−u𝜽~t‖∞\\displaystyle\\leq\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\}\\\|u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\}\-u\_\{\\tilde\{\\bm\{\\theta\}\}\_\{t\}\}\\\|\_\{\\infty\}≤192τmixϕ∞4Rmax∑t=1Tη~tη~t−1\\displaystyle\\leq 92\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\}\\tilde\{\\eta\}\_\{t\-1\}≤192τmixϕ∞4Rmax∑t=0∞ηt2,\\displaystyle\\leq 92\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\},since\(η~t\)\(\\tilde\{\\eta\}\_\{t\}\)is non\-increasing\. Combining all the estimates finishes the proof\. ∎
### C\.4Proof of Theorem[4\.2](https://arxiv.org/html/2606.24981#S4.Thmtheorem2)
###### Proof\.
Fixδ∈\(0,1\)\\delta\\in\(0,1\)and choose\(c,ρ\)\(c,\\rho\)withρ\>2\\rho\>2satisfying
4\+1536ρ2c∑t=0∞at22log8δ\+2304ρ2c\+2706\(∑t=0∞at2\)ρ2c2≤ρ2\.4\+1536\\frac\{\\rho^\{2\}\}\{c\}\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}a^\{2\}\_\{t\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+2304\\frac\{\\rho^\{2\}\}\{c\}\+2706\\left\(\\sum\_\{t=0\}^\{\\infty\}a^\{2\}\_\{t\}\\right\)\\frac\{\\rho^\{2\}\}\{c^\{2\}\}\\leq\\rho^\{2\}\\,\.Then, with probability at least1−δ/41\-\\delta/4, running TD\(0\) with the stepsize schedule\(ηt\)\(\\eta\_\{t\}\)defined in Theorem[4\.1](https://arxiv.org/html/2606.24981#S4.Thmtheorem1)guarantees
supt≥0‖𝜽t‖≤ρmax\{‖𝜽0−𝜽∗‖,‖𝜽∗‖,r∞ϕ∞\},\\sup\_\{t\\geq 0\}\\ \\left\\\|\{\{\\bm\{\\theta\}\}\_\{t\}\}\\right\\\|\\leq\\rho\\max\\left\\\{\\left\\\|\{\{\\bm\{\\theta\}\}\_\{0\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|,\\left\\\|\{\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|,\\frac\{r\_\{\\infty\}\}\{\\phi\_\{\\infty\}\}\\right\\\},where we define the corresponding eventℰR:=\{supt≥0‖𝜽t‖≤ρmax\{‖𝜽0−𝜽∗‖,‖𝜽∗‖,r∞ϕ∞\}\}\\mathcal\{E\}\_\{R\}:=\\left\\\{\\sup\_\{t\\geq 0\}\\ \\left\\\|\{\{\\bm\{\\theta\}\}\_\{t\}\}\\right\\\|\\leq\\rho\\max\\left\\\{\\left\\\|\{\{\\bm\{\\theta\}\}\_\{0\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|,\\left\\\|\{\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|,\\frac\{r\_\{\\infty\}\}\{\\phi\_\{\\infty\}\}\\right\\\}\\right\\\}\.
Next, we apply Lemma[C\.2](https://arxiv.org/html/2606.24981#A3.Thmtheorem2)with confidence parameterδ/4\\delta/4to obtain an eventℰ2\\mathcal\{E\}\_\{2\}such that
ℙ\{ℰ2\}≥1−δ4,and onℰ2:‖I2‖≤2ST\(2∑t=0∞ηt22log8δ\+3η0\)8τmix\(r∞ϕ∞\+2ϕ∞2‖𝜽∗‖\)\.\\mathbb\{P\}\\\{\\mathcal\{E\}\_\{2\}\\\}\\geq 1\-\\frac\{\\delta\}\{4\},\\text\{ and on \}\\mathcal\{E\}\_\{2\}:\\\|I\_\{2\}\\\|\\leq\\frac\{2\}\{S\_\{T\}\}\\left\(2\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+3\\eta\_\{0\}\\right\)8\\tau\_\{\\mathrm\{mix\}\}\(r\_\{\\infty\}\\phi\_\{\\infty\}\+2\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|\)\\,\.Similarly, apply Lemma[C\.3](https://arxiv.org/html/2606.24981#A3.Thmtheorem3)with the sameδ/4\\delta/4to obtain an eventℰ3\\mathcal\{E\}\_\{3\}such that
ℙ\{ℰ3\}≥1−δ4,and onℰ3:\\mathbb\{P\}\\\{\\mathcal\{E\}\_\{3\}\\\}\\geq 1\-\\frac\{\\delta\}\{4\},\\qquad\\text\{and on \}\\mathcal\{E\}\_\{3\}:‖I~3,T‖\\displaystyle\\\|\\tilde\{I\}\_\{3,T\}\\\|≤2ST\(128τmixϕ∞2Rmax∑t=0∞ηt22log8δ\+192η0τmixϕ∞2Rmax\)\\displaystyle\\leq\\frac\{2\}\{S\_\{T\}\}\\left\(128\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+192\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\\right\)\+2ST\(96τmixϕ∞4Rmax∑t=0∞ηt2\)\.\\displaystyle\\quad\+\\frac\{2\}\{S\_\{T\}\}\\left\(96\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\\right\)\\,\.Moreover, from the proof of Lemma[B\.4](https://arxiv.org/html/2606.24981#A2.Thmtheorem4), we can also obtain an eventℰ4\\mathcal\{E\}\_\{4\}such that
ℙ\{ℰ4\}≥1−δ4,and onℰ4:\\mathbb\{P\}\\\{\\mathcal\{E\}\_\{4\}\\\}\\geq 1\-\\frac\{\\delta\}\{4\},\\qquad\\text\{and on \}\\mathcal\{E\}\_\{4\}:∑t=1Tη~t−12‖𝒈\(𝜽~t−1,Zt−1\)‖2\+B~T\\displaystyle\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}^\{2\}\\left\\\|\{\\bm\{g\}\(\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\}\\right\\\|^\{2\}\+\\tilde\{B\}\_\{T\}≤768τmixϕ∞2Rmax2∑t=0∞ηt22log8δ\+1152η0τmixϕ∞2Rmax2\\displaystyle\\leq 768\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\{\\eta\}\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+1152\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\+1344τmixϕ∞4Rmax2∑t=0∞ηt2\+9ϕ∞4Rmax2∑t=0∞ηt2\.\\displaystyle\\quad\+1344\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}^\{2\}\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\+9\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}^\{2\}\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\\,\.
In particular, if we chooseRmax=ρmax\{‖𝜽0−𝜽∗‖,‖𝜽∗‖,r∞ϕ∞\}R\_\{\\mathrm\{max\}\}=\\rho\\max\\left\\\{\\left\\\|\{\{\\bm\{\\theta\}\}\_\{0\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|,\\left\\\|\{\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|,\\frac\{r\_\{\\infty\}\}\{\\phi\_\{\\infty\}\}\\right\\\}, we haveTexit=∞T\_\{\\mathrm\{exit\}\}=\\inftyonℰR\\mathcal\{E\}\_\{R\}\. Hence, for allT≥1T\\geq 1,η~t−1=ηt−1\\tilde\{\\eta\}\_\{t\-1\}=\\eta\_\{t\-1\}and𝒆~t−1=𝒆t−1\\tilde\{\\bm\{e\}\}\_\{t\-1\}=\\bm\{e\}\_\{t\-1\}for1≤t≤T1\\leq t\\leq T, soI~3,T=I3\\tilde\{I\}\_\{3,T\}=I\_\{3\}and∑t=1Tη~t−12‖𝒈\(𝜽~t−1,Zt−1\)‖2\+B~T=∑t=1Tηt−12‖𝒈\(𝜽t−1,Zt−1\)‖2\+BT\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}^\{2\}\\left\\\|\{\\bm\{g\}\(\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\}\\right\\\|^\{2\}\+\\tilde\{B\}\_\{T\}=\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}^\{2\}\\left\\\|\{\\bm\{g\}\(\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\}\\right\\\|^\{2\}\+B\_\{T\}\. Define
ℰ:=ℰR∩ℰ2∩ℰ3∩ℰ4\.\\mathcal\{E\}:=\\mathcal\{E\}\_\{R\}\\cap\\mathcal\{E\}\_\{2\}\\cap\\mathcal\{E\}\_\{3\}\\cap\\mathcal\{E\}\_\{4\}\.By a union bound, we have
ℙ\{ℰ\}≥1−\(δ4\+δ4\+δ4\+δ4\)=1−δ\.\\mathbb\{P\}\\\{\\mathcal\{E\}\\\}\\geq 1\-\\left\(\\frac\{\\delta\}\{4\}\+\\frac\{\\delta\}\{4\}\+\\frac\{\\delta\}\{4\}\+\\frac\{\\delta\}\{4\}\\right\)=1\-\\delta\\,\.Moreover, onℰ\\mathcal\{E\},
‖𝑨𝒆¯T‖\\displaystyle\\\|\\bm\{A\}\\bar\{\\bm\{e\}\}\_\{T\}\\\|=‖I1\+I2\+I3‖≤‖I1‖\+‖I2‖\+‖I3‖\\displaystyle=\\\|I\_\{1\}\+I\_\{2\}\+I\_\{3\}\\\|\\leq\\\|I\_\{1\}\\\|\+\\\|I\_\{2\}\\\|\+\\\|I\_\{3\}\\\|≤2Rmax∑t=1Tηt−1\+2∑t=1Tηt−1\(2∑t=0∞ηt22log8δ\+3η0\)8τmix\(r∞ϕ∞\+2ϕ∞2‖𝜽∗‖\)\\displaystyle\\leq\\frac\{2R\_\{\\mathrm\{max\}\}\}\{\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\}\+\\frac\{2\}\{\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\}\\left\(2\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+3\\eta\_\{0\}\\right\)8\\tau\_\{\\mathrm\{mix\}\}\(r\_\{\\infty\}\\phi\_\{\\infty\}\+2\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|\)\+2∑t=1Tηt−1\(128τmixϕ∞2Rmax∑t=0∞ηt22log8δ\\displaystyle\\quad\+\\frac\{2\}\{\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\}\\left\(128\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\\right\.\+192η0τmixϕ∞2Rmax\+96τmixϕ∞4Rmax∑t=0∞ηt2\)\.\\displaystyle\\qquad\\qquad\\qquad\\qquad\\left\.\+192\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\+96\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\\right\)\\,\.Thus, using \([27](https://arxiv.org/html/2606.24981#A3.E27)\), onℰ\\mathcal\{E\}, we have
f\(𝜽¯T\)−f\(𝜽∗\)≤Cfast2ω\(∑t=1Tηt−1\)2,f\(\\bar\{\\bm\{\\theta\}\}\_\{T\}\)\-f\(\{\\bm\{\\theta\}\}^\{\*\}\)\\leq\\frac\{C\_\{\\mathrm\{fast\}\}^\{2\}\}\{\\omega\(\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\)^\{2\}\},where
Cfast\\displaystyle C\_\{\\mathrm\{fast\}\}≔2Rmax\+\(2∑t=0∞ηt22log8δ\+3η0\)48τmixϕ∞2Rmax\\displaystyle\\coloneqq 2R\_\{\\mathrm\{max\}\}\+\\left\(2\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+3\\eta\_\{0\}\\right\)48\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\+256τmixϕ∞2Rmax∑t=0∞ηt22log8δ\+384η0τmixϕ∞2Rmax\\displaystyle\\quad\+256\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+384\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\+192τmixϕ∞4Rmax∑t=0∞ηt2\.\\displaystyle\\quad\+192\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\\,\.On the other hand, using \([20](https://arxiv.org/html/2606.24981#A2.E20)\), we also have
‖𝒆T‖2=‖𝒆0‖2\+∑t=1Tηt−12‖𝒈\(𝜽t−1,Zt−1\)‖2\+2∑t=1Tηt−1⟨𝒈¯\(𝜽t−1\),𝜽t−1−𝜽∗⟩\+BT\.\\\|\\bm\{e\}\_\{T\}\\\|^\{2\}\\;=\\;\\\|\\bm\{e\}\_\{0\}\\\|^\{2\}\+\\sum\_\{t=1\}^\{T\}\\eta^\{2\}\_\{t\-1\}\\\|\\bm\{g\}\(\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\\\|^\{2\}\+2\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\langle\{\\bar\{\\bm\{g\}\}\(\{\\bm\{\\theta\}\}\_\{t\-1\}\)\},\{\{\\bm\{\\theta\}\}\_\{t\-1\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\rangle\+B\_\{T\}\\,\.Rearranging terms yields
2∑t=1Tηt−1⟨𝒈¯\(𝜽t−1\),𝜽∗−𝜽t−1⟩\\displaystyle 2\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\langle\{\\bar\{\\bm\{g\}\}\(\{\\bm\{\\theta\}\}\_\{t\-1\}\)\},\{\{\\bm\{\\theta\}\}^\{\*\}\-\{\\bm\{\\theta\}\}\_\{t\-1\}\}\\rangle=‖𝒆0‖2−‖𝒆T‖2\+∑t=1Tηt−12‖𝒈\(𝜽t−1,Zt−1\)‖2\+BT\\displaystyle=\\left\\\|\{\\bm\{e\}\_\{0\}\}\\right\\\|^\{2\}\-\\left\\\|\{\\bm\{e\}\_\{T\}\}\\right\\\|^\{2\}\+\\sum\_\{t=1\}^\{T\}\\eta^\{2\}\_\{t\-1\}\\\|\\bm\{g\}\(\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\\\|^\{2\}\+B\_\{T\}≤Rmax2\+768τmixϕ∞2Rmax2∑t=0∞ηt22log8δ\\displaystyle\\leq R\_\{\\mathrm\{max\}\}^\{2\}\+768\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\{\\eta\}\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+1152η0τmixϕ∞2Rmax2\\displaystyle\\quad\+1152\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\+1344τmixϕ∞4Rmax2∑t=0∞ηt2\+9ϕ∞4Rmax2∑t=0∞ηt2\.\\displaystyle\\quad\+1344\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}^\{2\}\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\+9\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}^\{2\}\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\\,\.Using the convexity offfand \([4](https://arxiv.org/html/2606.24981#S3.E4)\), we also have
f\(𝜽¯T\)−f\(𝜽∗\)\\displaystyle f\(\\bar\{\\bm\{\\theta\}\}\_\{T\}\)\-f\(\{\\bm\{\\theta\}\}^\{\*\}\)≤\(1∑t=1Tηt−1\)∑t=1Tηt−1⟨𝒈¯\(𝜽t−1\),𝜽∗−𝜽t−1⟩\\displaystyle\\leq\\left\(\\frac\{1\}\{\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\}\\right\)\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\langle\{\\bar\{\\bm\{g\}\}\(\{\\bm\{\\theta\}\}\_\{t\-1\}\)\},\{\{\\bm\{\\theta\}\}^\{\*\}\-\{\\bm\{\\theta\}\}\_\{t\-1\}\}\\rangle≤Rmax2∑t=1Tηt−1\(0\.5\+384τmixϕ∞2∑t=0∞ηt22log8δ\+576η0τmixϕ∞2\)\\displaystyle\\leq\\frac\{R\_\{\\mathrm\{max\}\}^\{2\}\}\{\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\}\\left\(0\.5\+384\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\{\\eta\}\_\{t\}^\{2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+576\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\right\)\+Rmax2∑t=1Tηt−1\(672τmixϕ∞4∑t=0∞ηt2\+4\.5ϕ∞4∑t=0∞ηt2\)\\displaystyle\\quad\+\\frac\{R\_\{\\mathrm\{max\}\}^\{2\}\}\{\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\}\\left\(672\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\+4\.5\\phi\_\{\\infty\}^\{4\}\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\\right\)≤Crobust∑t=1Tηt−1,\\displaystyle\\leq\\frac\{C\_\{\\mathrm\{robust\}\}\}\{\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\},where
Crobust\\displaystyle C\_\{\\mathrm\{robust\}\}≔Rmax2\(0\.5\+384τmixϕ∞2∑t=0∞ηt22log\(8δ\)\\displaystyle\\coloneqq R\_\{\\mathrm\{max\}\}^\{2\}\\left\(0\.5\+384\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\sqrt\{\\sum\_\{t=0\}^\{\\infty\}\{\\eta\}\_\{t\}^\{2\}\}\\sqrt\{2\\log\\left\(\\frac\{8\}\{\\delta\}\\right\)\}\\right\.\+576η0τmixϕ∞2\+672τmixϕ∞4∑t=0∞ηt2\+4\.5ϕ∞4∑t=0∞ηt2\)\.\\displaystyle\\quad\\left\.\+576\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\+672\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\+4\.5\\phi\_\{\\infty\}^\{4\}\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\\right\)\.Combining the two upper bounds forf\(𝜽¯T\)−f\(𝜽∗\)f\(\\bar\{\\bm\{\\theta\}\}\_\{T\}\)\-f\(\{\\bm\{\\theta\}\}^\{\*\}\), with probability at least1−δ1\-\\delta, we have
f\(𝜽¯T\)−f\(𝜽∗\)≤min\{Cfast2ω\(∑t=1Tηt−1\)2,Crobust∑t=1Tηt−1\}\.f\(\\bar\{\\bm\{\\theta\}\}\_\{T\}\)\-f\(\{\\bm\{\\theta\}\}^\{\*\}\)\\leq\\min\\left\\\{\\frac\{C\_\{\\mathrm\{fast\}\}^\{2\}\}\{\\omega\(\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\)^\{2\}\},\\frac\{C\_\{\\mathrm\{robust\}\}\}\{\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\}\\right\\\}\\,\.∎
## Appendix DExplicit Derivation of Stepsize Schedules
In this section, we consider specific stepsize schedules\(ηt\)\(\\eta\_\{t\}\)and derive corresponding corollaries of Theorem[4\.1](https://arxiv.org/html/2606.24981#S4.Thmtheorem1)and Theorem[4\.2](https://arxiv.org/html/2606.24981#S4.Thmtheorem2)\.
###### Corollary D\.1\(High\-probability bounded iterates\)\.
In the setting of Theorem[4\.1](https://arxiv.org/html/2606.24981#S4.Thmtheorem1), consider the following stepsize schedule:
ηt=1cτmixϕ∞2t\+1log\(t\+3\),∀t≥0,\\eta\_\{t\}=\\frac\{1\}\{c\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\sqrt\{t\+1\}\\log\(t\+3\)\},\\quad\\forall t\\geq 0,for some numerical constantc\>0c\>0\. Fix anyδ∈\(0,1\)\\delta\\in\(0,1\)and letRbase≔max\{‖𝛉0−𝛉∗‖,‖𝛉∗‖,r∞/ϕ∞\}R\_\{\\mathrm\{base\}\}\\coloneqq\\max\\left\\\{\\left\\\|\{\{\\bm\{\\theta\}\}\_\{0\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|,\\left\\\|\{\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|,r\_\{\\infty\}/\\phi\_\{\\infty\}\\right\\\},A1\(δ\)=15363log22log2δ\+2304A\_\{1\}\(\\delta\)=1536\\sqrt\{\\frac\{3\}\{\\log 2\}\}\\sqrt\{2\\log\\frac\{2\}\{\\delta\}\}\+2304, andA2=8118log2A\_\{2\}=\\frac\{8118\}\{\\log 2\}\. Then, provided thatc\>cmin\(δ\)≔A1\(δ\)\+A12\(δ\)\+4A22c\>c\_\{\\min\}\(\\delta\)\\coloneqq\\frac\{A\_\{1\}\(\\delta\)\+\\sqrt\{A\_\{1\}^\{2\}\(\\delta\)\+4A\_\{2\}\}\}\{2\}, with probability at least1−δ1\-\\delta, we have
supt≥0‖𝜽t‖2≤ρRbase,whereρ=2cc2−A1\(δ\)c−A2\.\\sup\_\{t\\geq 0\}\\ \\left\\\|\{\{\\bm\{\\theta\}\}\_\{t\}\}\\right\\\|\_\{2\}\\leq\\rho R\_\{\\mathrm\{base\}\},\\quad\\text\{where \}\\rho=\\frac\{2c\}\{\\sqrt\{c^\{2\}\-A\_\{1\}\(\\delta\)c\-A\_\{2\}\}\}\\,\.
###### Proof\.
Fixδ∈\(0,1\)\\delta\\in\(0,1\)\. We substituteat=1t\+1log\(t\+3\),∀t≥0a\_\{t\}=\\frac\{1\}\{\\sqrt\{t\+1\}\\log\(t\+3\)\},\\forall t\\geq 0into Theorem[4\.1](https://arxiv.org/html/2606.24981#S4.Thmtheorem1)\. Lemma[F\.3](https://arxiv.org/html/2606.24981#A6.Thmtheorem3)\(i\) gives∑t=0∞at2≤3log2\\sum\_\{t=0\}^\{\\infty\}a\_\{t\}^\{2\}\\leq\\frac\{3\}\{\\log 2\}\. Thus, a sufficient requirement forccandρ\\rhotranslates to
4\+1536ρ2c3log22log2δ\+2304ρ2c\+27063log2ρ2c2≤ρ2\.4\+1536\\frac\{\\rho^\{2\}\}\{c\}\\sqrt\{\\frac\{3\}\{\\log 2\}\}\\sqrt\{2\\log\\frac\{2\}\{\\delta\}\}\+2304\\frac\{\\rho^\{2\}\}\{c\}\+2706\\frac\{3\}\{\\log 2\}\\frac\{\\rho^\{2\}\}\{c^\{2\}\}\\leq\\rho^\{2\}\\,\.We now derive an explicit relationship betweenccandρ\\rhofor the above inequality to hold\. DenoteA1\(δ\)=15363log22log2δ\+2304A\_\{1\}\(\\delta\)=1536\\sqrt\{\\frac\{3\}\{\\log 2\}\}\\sqrt\{2\\log\\frac\{2\}\{\\delta\}\}\+2304andA2=27063log2A\_\{2\}=2706\\frac\{3\}\{\\log 2\}\. Then, the above inequality is equivalent to
4\+ρ2cA1\(δ\)\+ρ2c2A2≤ρ2\.4\+\\frac\{\\rho^\{2\}\}\{c\}A\_\{1\}\(\\delta\)\+\\frac\{\\rho^\{2\}\}\{c^\{2\}\}A\_\{2\}\\leq\\rho^\{2\}\\,\.\(29\)
For any fixedc\>0c\>0, rearranging \([29](https://arxiv.org/html/2606.24981#A4.E29)\) yields
c2−A1\(δ\)c−A2≥4c2ρ2,c^\{2\}\-A\_\{1\}\(\\delta\)c\-A\_\{2\}\\geq\\frac\{4c^\{2\}\}\{\\rho^\{2\}\},which requiresc2−A1\(δ\)c−A2\>0c^\{2\}\-A\_\{1\}\(\\delta\)c\-A\_\{2\}\>0since the right\-hand side is positive\. In particular, for anyc\>cmin\(δ\)≔A1\(δ\)\+A12\(δ\)\+4A22c\>c\_\{\\min\}\(\\delta\)\\coloneqq\\frac\{A\_\{1\}\(\\delta\)\+\\sqrt\{A\_\{1\}^\{2\}\(\\delta\)\+4A\_\{2\}\}\}\{2\}, choosing
ρ≥ρmin\(c\)≔2cc2−A1\(δ\)c−A2\\rho\\geq\\rho\_\{\\min\}\(c\)\\coloneqq\\frac\{2c\}\{\\sqrt\{c^\{2\}\-A\_\{1\}\(\\delta\)c\-A\_\{2\}\}\}guarantees the condition\. This completes the proof\. ∎
###### Corollary D\.2\(High\-probability rate for PR averaging\)\.
Under Assumption[3\.1](https://arxiv.org/html/2606.24981#S3.Thmtheorem1), withϕ∞,r∞,τmix\\phi\_\{\\infty\},r\_\{\\infty\},\\tau\_\{\\mathrm\{mix\}\}andω\\omegaas in Lemmas[3\.2](https://arxiv.org/html/2606.24981#S3.Thmtheorem2)–[3\.4](https://arxiv.org/html/2606.24981#S3.Thmtheorem4), consider the following stepsize schedule:
ηt=1cτmixϕ∞2t\+1log\(t\+3\),∀t≥0,\\eta\_\{t\}=\\frac\{1\}\{c\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\sqrt\{t\+1\}\\log\(t\+3\)\},\\qquad\\forall t\\geq 0,for some numerical constantc\>0c\>0\. Fix anyδ∈\(0,1\)\\delta\\in\(0,1\)and letRbase≔max\{‖𝛉0−𝛉∗‖,‖𝛉∗‖,r∞/ϕ∞\}R\_\{\\mathrm\{base\}\}\\coloneqq\\max\\left\\\{\\left\\\|\{\{\\bm\{\\theta\}\}\_\{0\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|,\\left\\\|\{\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|,r\_\{\\infty\}/\\phi\_\{\\infty\}\\right\\\},Rmax=ρRbaseR\_\{\\mathrm\{max\}\}=\\rho R\_\{\\mathrm\{base\}\},A1\(δ\)=15363log22log8δ\+2304A\_\{1\}\(\\delta\)=1536\\sqrt\{\\frac\{3\}\{\\log 2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+2304, andA2=8118log2A\_\{2\}=\\frac\{8118\}\{\\log 2\}\. Then, provided thatc\>cmin\(δ\)≔A1\(δ\)\+A12\(δ\)\+4A22c\>c\_\{\\min\}\(\\delta\)\\coloneqq\\frac\{A\_\{1\}\(\\delta\)\+\\sqrt\{A\_\{1\}^\{2\}\(\\delta\)\+4A\_\{2\}\}\}\{2\}, with probability at least1−δ1\-\\delta, the following upper bound holds forf\(𝛉¯T\)−f\(𝛉∗\)f\(\\bar\{\\bm\{\\theta\}\}\_\{T\}\)\-f\(\{\\bm\{\\theta\}\}^\{\*\}\):
Rmax2min\\displaystyle R\_\{\\mathrm\{max\}\}^\{2\}\\,\\min\{c2τmix2ϕ∞4log2\(T\+2\)ω\(T\+1−1\)2\(1\+264c\+1763clog22log8δ\+288c2τmixlog2\)2⏟fast rate,\\displaystyle\\left\\\{\\underbrace\{\\frac\{c^\{2\}\\tau\_\{\\mathrm\{mix\}\}^\{2\}\\phi\_\{\\infty\}^\{4\}\\,\\log^\{2\}\(T\+2\)\}\{\\omega\\,\(\\sqrt\{T\+1\}\-1\)^\{2\}\}\\left\(1\+\\frac\{264\}\{c\}\+\\frac\{176\\sqrt\{3\}\}\{c\\sqrt\{\\log 2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+\\frac\{288\}\{c^\{2\}\\tau\_\{\\mathrm\{mix\}\}\\log 2\}\\right\)^\{2\}\}\_\{\\text\{fast rate\}\},\\right\.cτmixϕ∞2log\(T\+2\)4\(T\+1−1\)\(1\+1152c\+7683clog22log8δ\+4059c2τmixlog2\)⏟robust rate\},\\displaystyle\\quad\\left\.\\underbrace\{\\frac\{c\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\,\\log\(T\+2\)\}\{4\(\\sqrt\{T\+1\}\-1\)\}\\left\(1\+\\frac\{1152\}\{c\}\+\\frac\{768\\sqrt\{3\}\}\{c\\sqrt\{\\log 2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+\\frac\{4059\}\{c^\{2\}\\tau\_\{\\mathrm\{mix\}\}\\log 2\}\\right\)\}\_\{\\text\{robust rate\}\}\\right\\\},whereρ=2cc2−A1\(δ\)c−A2\\rho=\\frac\{2c\}\{\\sqrt\{c^\{2\}\-A\_\{1\}\(\\delta\)c\-A\_\{2\}\}\}\. That is,f\(𝛉¯T\)−f\(𝛉∗\)=min\{𝒪\(log2\(T\)/\(ωT\)\),𝒪\(log\(T\)/T\)\}f\(\\bar\{\\bm\{\\theta\}\}\_\{T\}\)\-f\(\{\\bm\{\\theta\}\}^\{\*\}\)=\\min\\\{\\mathcal\{O\}\(\\log^\{2\}\(T\)/\(\\omega T\)\),\\mathcal\{O\}\(\\log\(T\)/\\sqrt\{T\}\)\\\}up to constants depending oncc,τmix\\tau\_\{\\mathrm\{mix\}\},ϕ∞\\phi\_\{\\infty\}, andδ\\delta\.
###### Proof\.
The condition for\(c,ρ\)\(c,\\rho\)can be derived as in Corollary[D\.1](https://arxiv.org/html/2606.24981#A4.Thmtheorem1)\. We now focus on simplifying the constantsCrobustC\_\{\\mathrm\{robust\}\}andCfastC\_\{\\mathrm\{fast\}\}\. Lemma[F\.3](https://arxiv.org/html/2606.24981#A6.Thmtheorem3)\(i\) gives∑t=0∞ηt2≤3c2τmix2ϕ∞4log2\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\\leq\\frac\{3\}\{c^\{2\}\\tau\_\{\\mathrm\{mix\}\}^\{2\}\\phi\_\{\\infty\}^\{4\}\\log 2\}\. Thus, we have
Crobust\\displaystyle C\_\{\\mathrm\{robust\}\}≤Rmax2\[12\+384τmixϕ∞23c2τmix2ϕ∞4log22log8δ\+576η0τmixϕ∞2\\displaystyle\\leq R\_\{\\mathrm\{max\}\}^\{2\}\\left\[\\frac\{1\}\{2\}\+384\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\sqrt\{\\frac\{3\}\{c^\{2\}\\tau\_\{\\mathrm\{mix\}\}^\{2\}\\phi\_\{\\infty\}^\{4\}\\log 2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+576\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\right\.\+\(672τmix\+4\.5\)ϕ∞43c2τmix2ϕ∞4log2\]\\displaystyle\\qquad\\qquad\\left\.\+\\left\(672\\tau\_\{\\mathrm\{mix\}\}\+4\.5\\right\)\\phi\_\{\\infty\}^\{4\}\\frac\{3\}\{c^\{2\}\\tau\_\{\\mathrm\{mix\}\}^\{2\}\\phi\_\{\\infty\}^\{4\}\\log 2\}\\right\]≤Rmax2\[12\+3843clog22log8δ\+576c\+2029\.5c2τmixlog2\],\\displaystyle\\leq R\_\{\\mathrm\{max\}\}^\{2\}\\left\[\\frac\{1\}\{2\}\+\\frac\{384\\sqrt\{3\}\}\{c\\sqrt\{\\log 2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+\\frac\{576\}\{c\}\+\\frac\{2029\.5\}\{c^\{2\}\\tau\_\{\\mathrm\{mix\}\}\\log 2\}\\right\],where we usedη0≤1/\(cτmixϕ∞2\)\\eta\_\{0\}\\leq 1/\(c\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\)andτmix≥1\\tau\_\{\\mathrm\{mix\}\}\\geq 1, so that2016c2τmixlog2\+13\.5c2τmix2log2≤2029\.5c2τmixlog2\\frac\{2016\}\{c^\{2\}\\tau\_\{\\mathrm\{mix\}\}\\log 2\}\+\\frac\{13\.5\}\{c^\{2\}\\tau\_\{\\mathrm\{mix\}\}^\{2\}\\log 2\}\\leq\\frac\{2029\.5\}\{c^\{2\}\\tau\_\{\\mathrm\{mix\}\}\\log 2\}\. Similarly,
Cfast\\displaystyle C\_\{\\mathrm\{fast\}\}≤Rmax\[2\+528η0τmixϕ∞2\+352τmixϕ∞23c2τmix2ϕ∞4log22log8δ\\displaystyle\\leq R\_\{\\mathrm\{max\}\}\\left\[2\+528\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\+352\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\sqrt\{\\frac\{3\}\{c^\{2\}\\tau\_\{\\mathrm\{mix\}\}^\{2\}\\phi\_\{\\infty\}^\{4\}\\log 2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\\right\.\+192τmixϕ∞43c2τmix2ϕ∞4log2\]\\displaystyle\\quad\\quad\\quad\\quad\\left\.\+192\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}\\frac\{3\}\{c^\{2\}\\tau\_\{\\mathrm\{mix\}\}^\{2\}\\phi\_\{\\infty\}^\{4\}\\log 2\}\\right\]≤Rmax\[2\+528c\+3523clog22log8δ\+576c2τmixlog2\]\.\\displaystyle\\leq R\_\{\\mathrm\{max\}\}\\left\[2\+\\frac\{528\}\{c\}\+\\frac\{352\\sqrt\{3\}\}\{c\\sqrt\{\\log 2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+\\frac\{576\}\{c^\{2\}\\tau\_\{\\mathrm\{mix\}\}\\log 2\}\\right\]\\,\.RecallingRmax=ρRbaseR\_\{\\mathrm\{max\}\}=\\rho R\_\{\\mathrm\{base\}\}, we obtain
Cfast\\displaystyle C\_\{\\mathrm\{fast\}\}≤2ρRbase\(1\+264c\+1763clog22log8δ\+288c2τmixlog2\),\\displaystyle\\leq 2\\rho R\_\{\\mathrm\{base\}\}\\left\(1\+\\frac\{264\}\{c\}\+\\frac\{176\\sqrt\{3\}\}\{c\\sqrt\{\\log 2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+\\frac\{288\}\{c^\{2\}\\tau\_\{\\mathrm\{mix\}\}\\log 2\}\\right\),Crobust\\displaystyle C\_\{\\mathrm\{robust\}\}≤ρ2Rbase22\(1\+1152c\+7683clog22log8δ\+4059c2τmixlog2\)\.\\displaystyle\\leq\\frac\{\\rho^\{2\}R\_\{\\mathrm\{base\}\}^\{2\}\}\{2\}\\left\(1\+\\frac\{1152\}\{c\}\+\\frac\{768\\sqrt\{3\}\}\{c\\sqrt\{\\log 2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+\\frac\{4059\}\{c^\{2\}\\tau\_\{\\mathrm\{mix\}\}\\log 2\}\\right\)\.Next, letST≔∑t=1Tηt−1=∑s=0T−1ηsS\_\{T\}\\coloneqq\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}=\\sum\_\{s=0\}^\{T\-1\}\\eta\_\{s\}\. By Lemma[F\.3](https://arxiv.org/html/2606.24981#A6.Thmtheorem3)withp=1p=1andκ=cτmixϕ∞2\\kappa=c\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\},ST≥2\(T\+1−1\)cτmixϕ∞2log\(T\+2\)S\_\{T\}\\geq\\frac\{2\(\\sqrt\{T\+1\}\-1\)\}\{c\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\log\(T\+2\)\}\. Therefore,
1ST≤cτmixϕ∞2log\(T\+2\)2\(T\+1−1\),1ST2≤c2τmix2ϕ∞4log2\(T\+2\)4\(T\+1−1\)2\.\\frac\{1\}\{S\_\{T\}\}\\leq\\frac\{c\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\log\(T\+2\)\}\{2\(\\sqrt\{T\+1\}\-1\)\},\\qquad\\frac\{1\}\{S\_\{T\}^\{2\}\}\\leq\\frac\{c^\{2\}\\tau\_\{\\mathrm\{mix\}\}^\{2\}\\phi\_\{\\infty\}^\{4\}\\log^\{2\}\(T\+2\)\}\{4\(\\sqrt\{T\+1\}\-1\)^\{2\}\}\\,\.Substituting the above bounds into Theorem[4\.2](https://arxiv.org/html/2606.24981#S4.Thmtheorem2)yields the claimed result\. ∎
## Appendix ERemoving the dependence onτmix\\tau\_\{\\mathrm\{mix\}\}in the stepsize
The choice in \([15](https://arxiv.org/html/2606.24981#S4.E15)\) depends on the unknown mixing parameterτmix\\tau\_\{\\mathrm\{mix\}\}, which may make the algorithm impractical\. This dependence of the stepsize onτmix\\tau\_\{\\mathrm\{mix\}\}can be avoided by running TD\(0\) with the modified stepsize schedule
ηt=1cϕ∞2t\+1log2\(t\+3\),∀t≥0,\\eta\_\{t\}=\\frac\{1\}\{c\\,\\phi\_\{\\infty\}^\{2\}\\sqrt\{t\+1\}\\log^\{2\}\(t\+3\)\},\\qquad\\forall t\\geq 0,\(30\)for some numerical constantc\>0c\>0\. ForT≥1T\\geq 1, let
ST≔∑t=1Tηt−1,𝜽¯T≔1ST∑t=1Tηt−1𝜽t−1\.S\_\{T\}\\coloneqq\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\},\\qquad\\bar\{\\bm\{\\theta\}\}\_\{T\}\\coloneqq\\frac\{1\}\{S\_\{T\}\}\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\{\\bm\{\\theta\}\}\_\{t\-1\}\\,\.\(31\)
###### Lemma E\.1\(Deterministic bound on the finite prefix\)\.
Assume the feature and reward bounds in Lemma[3\.2](https://arxiv.org/html/2606.24981#S3.Thmtheorem2)\. For any nonnegative stepsizes\(ηt\)t≥0\(\\eta\_\{t\}\)\_\{t\\geq 0\}and any integerm≥0m\\geq 0, the TD\(0\) iterates satisfy
max0≤s≤m‖𝜽s‖≤\(‖𝜽0‖\+r∞\(1\+γ\)ϕ∞\)∏t=0m−1\(1\+\(1\+γ\)ϕ∞2ηt\)−r∞\(1\+γ\)ϕ∞\.\\max\_\{0\\leq s\\leq m\}\\left\\\|\{\{\\bm\{\\theta\}\}\_\{s\}\}\\right\\\|\\leq\\left\(\\left\\\|\{\{\\bm\{\\theta\}\}\_\{0\}\}\\right\\\|\+\\frac\{r\_\{\\infty\}\}\{\(1\+\\gamma\)\\phi\_\{\\infty\}\}\\right\)\\prod\_\{t=0\}^\{m\-1\}\\left\(1\+\(1\+\\gamma\)\\phi\_\{\\infty\}^\{2\}\\eta\_\{t\}\\right\)\-\\frac\{r\_\{\\infty\}\}\{\(1\+\\gamma\)\\phi\_\{\\infty\}\}\\,\.In particular, for the stepsize \([30](https://arxiv.org/html/2606.24981#A5.E30)\), define
Bm≔\(‖𝜽0‖\+r∞\(1\+γ\)ϕ∞\)∏t=0m−1\(1\+1\+γct\+1log2\(t\+3\)\)−r∞\(1\+γ\)ϕ∞\.B\_\{m\}\\coloneqq\\left\(\\left\\\|\{\{\\bm\{\\theta\}\}\_\{0\}\}\\right\\\|\+\\frac\{r\_\{\\infty\}\}\{\(1\+\\gamma\)\\phi\_\{\\infty\}\}\\right\)\\prod\_\{t=0\}^\{m\-1\}\\left\(1\+\\frac\{1\+\\gamma\}\{c\\sqrt\{t\+1\}\\log^\{2\}\(t\+3\)\}\\right\)\-\\frac\{r\_\{\\infty\}\}\{\(1\+\\gamma\)\\phi\_\{\\infty\}\}\\,\.Thenmax0≤s≤m‖𝛉s‖≤Bm\\max\_\{0\\leq s\\leq m\}\\left\\\|\{\{\\bm\{\\theta\}\}\_\{s\}\}\\right\\\|\\leq B\_\{m\}\. Furthermore,
Bm≤\(‖𝜽0‖\+r∞\(1\+γ\)ϕ∞\)exp\(2\(1\+γ\)mclog23\)−r∞\(1\+γ\)ϕ∞\.B\_\{m\}\\leq\\left\(\\left\\\|\{\{\\bm\{\\theta\}\}\_\{0\}\}\\right\\\|\+\\frac\{r\_\{\\infty\}\}\{\(1\+\\gamma\)\\phi\_\{\\infty\}\}\\right\)\\exp\\left\(\\frac\{2\(1\+\\gamma\)\\sqrt\{m\}\}\{c\\log^\{2\}3\}\\right\)\-\\frac\{r\_\{\\infty\}\}\{\(1\+\\gamma\)\\phi\_\{\\infty\}\}\\,\.
###### Proof\.
The bounds in Lemma[3\.2](https://arxiv.org/html/2606.24981#S3.Thmtheorem2)imply
‖𝒈\(𝜽t,Zt\)‖\\displaystyle\\left\\\|\{\\bm\{g\}\(\{\\bm\{\\theta\}\}\_\{t\},Z\_\{t\}\)\}\\right\\\|=\|rt\+γ⟨ϕ\(st\+1\),𝜽t⟩−⟨ϕ\(st\),𝜽t⟩\|‖ϕ\(st\)‖\\displaystyle=\\left\|r\_\{t\}\+\\gamma\\left\\langle\\bm\{\\phi\}\(s\_\{t\+1\}\),\{\\bm\{\\theta\}\}\_\{t\}\\right\\rangle\-\\left\\langle\\bm\{\\phi\}\(s\_\{t\}\),\{\\bm\{\\theta\}\}\_\{t\}\\right\\rangle\\right\|\\left\\\|\{\\bm\{\\phi\}\(s\_\{t\}\)\}\\right\\\|≤r∞ϕ∞\+\(1\+γ\)ϕ∞2‖𝜽t‖\.\\displaystyle\\leq r\_\{\\infty\}\\phi\_\{\\infty\}\+\(1\+\\gamma\)\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\{\\bm\{\\theta\}\}\_\{t\}\}\\right\\\|\\,\.Therefore,
‖𝜽t\+1‖≤\(1\+\(1\+γ\)ϕ∞2ηt\)‖𝜽t‖\+r∞ϕ∞ηt\.\\left\\\|\{\{\\bm\{\\theta\}\}\_\{t\+1\}\}\\right\\\|\\leq\(1\+\(1\+\\gamma\)\\phi\_\{\\infty\}^\{2\}\\eta\_\{t\}\)\\left\\\|\{\{\\bm\{\\theta\}\}\_\{t\}\}\\right\\\|\+r\_\{\\infty\}\\phi\_\{\\infty\}\\eta\_\{t\}\\,\.Letb≔r∞/\(\(1\+γ\)ϕ∞\)b\\coloneqq r\_\{\\infty\}/\(\(1\+\\gamma\)\\phi\_\{\\infty\}\)andut≔‖𝜽t‖\+bu\_\{t\}\\coloneqq\\left\\\|\{\{\\bm\{\\theta\}\}\_\{t\}\}\\right\\\|\+b\. Thenut\+1≤\(1\+\(1\+γ\)ϕ∞2ηt\)utu\_\{t\+1\}\\leq\(1\+\(1\+\\gamma\)\\phi\_\{\\infty\}^\{2\}\\eta\_\{t\}\)u\_\{t\}\. Iterating this inequality yields, for everys≤ms\\leq m,
us≤u0∏t=0s−1\(1\+\(1\+γ\)ϕ∞2ηt\)≤u0∏t=0m−1\(1\+\(1\+γ\)ϕ∞2ηt\)\.u\_\{s\}\\leq u\_\{0\}\\prod\_\{t=0\}^\{s\-1\}\(1\+\(1\+\\gamma\)\\phi\_\{\\infty\}^\{2\}\\eta\_\{t\}\)\\leq u\_\{0\}\\prod\_\{t=0\}^\{m\-1\}\(1\+\(1\+\\gamma\)\\phi\_\{\\infty\}^\{2\}\\eta\_\{t\}\)\\,\.Taking the supremum over0≤s≤m0\\leq s\\leq mand subtractingbbproves the first claim\. Substituting \([30](https://arxiv.org/html/2606.24981#A5.E30)\) gives the displayed formula forBmB\_\{m\}\. Finally, using1\+x≤ex1\+x\\leq e^\{x\}and
∑t=0m−11t\+1log2\(t\+3\)≤2mlog23\\sum\_\{t=0\}^\{m\-1\}\\frac\{1\}\{\\sqrt\{t\+1\}\\log^\{2\}\(t\+3\)\}\\leq\\frac\{2\\sqrt\{m\}\}\{\\log^\{2\}3\}gives the closed\-form bound\. ∎
###### Corollary E\.2\(Mixing\-free stepsize\)\.
Under Assumption[3\.1](https://arxiv.org/html/2606.24981#S3.Thmtheorem1), withϕ∞,r∞,τmix\\phi\_\{\\infty\},r\_\{\\infty\},\\tau\_\{\\mathrm\{mix\}\}andω\\omegaas in Lemmas[3\.2](https://arxiv.org/html/2606.24981#S3.Thmtheorem2)–[3\.4](https://arxiv.org/html/2606.24981#S3.Thmtheorem4), run TD\(0\) with the stepsize schedule:
ηt=1cϕ∞2t\+1log2\(t\+3\),t≥0,\\eta\_\{t\}=\\frac\{1\}\{c\\,\\phi\_\{\\infty\}^\{2\}\\sqrt\{t\+1\}\\log^\{2\}\(t\+3\)\},\\qquad t\\geq 0,for some numerical constantc\>0c\>0, and form the PR average𝛉¯T\\bar\{\\bm\{\\theta\}\}\_\{T\}as in \([31](https://arxiv.org/html/2606.24981#A5.E31)\)\. Let
m∗≔min\{m≥0:log\(m\+3\)≥τmix\}\.m^\{\*\}\\coloneqq\\min\\\{m\\geq 0:\\log\(m\+3\)\\geq\\tau\_\{\\mathrm\{mix\}\}\\\}\\,\.Form≥0m\\geq 0, define the deterministic prefix bound
Bm≔\(‖𝜽0‖\+r∞\(1\+γ\)ϕ∞\)∏t=0m−1\(1\+1\+γct\+1log2\(t\+3\)\)−r∞\(1\+γ\)ϕ∞\.B\_\{m\}\\coloneqq\\left\(\\left\\\|\{\{\\bm\{\\theta\}\}\_\{0\}\}\\right\\\|\+\\frac\{r\_\{\\infty\}\}\{\(1\+\\gamma\)\\phi\_\{\\infty\}\}\\right\)\\prod\_\{t=0\}^\{m\-1\}\\left\(1\+\\frac\{1\+\\gamma\}\{c\\sqrt\{t\+1\}\\log^\{2\}\(t\+3\)\}\\right\)\-\\frac\{r\_\{\\infty\}\}\{\(1\+\\gamma\)\\phi\_\{\\infty\}\}\\,\.Fix anyδ∈\(0,1\)\\delta\\in\(0,1\)and set
Rbase≔max\{Bm∗\+‖𝜽∗‖,‖𝜽∗‖,r∞ϕ∞\}\.R\_\{\\mathrm\{base\}\}\\coloneqq\\max\\left\\\{B\_\{m^\{\*\}\}\+\\left\\\|\{\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|,\\ \\left\\\|\{\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|,\\ \\frac\{r\_\{\\infty\}\}\{\\phi\_\{\\infty\}\}\\right\\\}\\,\.Define
A1\(δ\)≔15363log22log8δ\+2304,A2≔8118log2,A\_\{1\}\(\\delta\)\\coloneqq 1536\\sqrt\{\\frac\{3\}\{\\log 2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+2304,\\qquad A\_\{2\}\\coloneqq\\frac\{8118\}\{\\log 2\},cmin\(δ\)≔A1\(δ\)\+A12\(δ\)\+4A22,ρ≔2cc2−A1\(δ\)c−A2,Rmax≔ρRbase\.c\_\{\\min\}\(\\delta\)\\coloneqq\\frac\{A\_\{1\}\(\\delta\)\+\\sqrt\{A\_\{1\}^\{2\}\(\\delta\)\+4A\_\{2\}\}\}\{2\},\\qquad\\rho\\coloneqq\\frac\{2c\}\{\\sqrt\{c^\{2\}\-A\_\{1\}\(\\delta\)c\-A\_\{2\}\}\},\\qquad R\_\{\\mathrm\{max\}\}\\coloneqq\\rho R\_\{\\mathrm\{base\}\}\\,\.Then, provided thatc\>cmin\(δ\)c\>c\_\{\\min\}\(\\delta\), with probability at least1−δ1\-\\delta, we have
supt≥0‖𝜽t‖≤Rmax,\\sup\_\{t\\geq 0\}\\ \\left\\\|\{\{\\bm\{\\theta\}\}\_\{t\}\}\\right\\\|\\leq R\_\{\\mathrm\{max\}\},and for everyT≥1T\\geq 1,f\(𝛉¯T\)−f\(𝛉∗\)f\(\\bar\{\\bm\{\\theta\}\}\_\{T\}\)\-f\(\{\\bm\{\\theta\}\}^\{\*\}\)is upper\-bounded by
Rmax2min\\displaystyle R\_\{\\mathrm\{max\}\}^\{2\}\\,\\min\{c2ϕ∞4log4\(T\+2\)ω\(T\+1−1\)2\(1\+264τmixclog23\+1763τmixclog22log8δ\+288τmixc2log2\)2⏟fast rate,\\displaystyle\\left\\\{\\underbrace\{\\frac\{c^\{2\}\\phi\_\{\\infty\}^\{4\}\\log^\{4\}\(T\+2\)\}\{\\omega\(\\sqrt\{T\+1\}\-1\)^\{2\}\}\\left\(1\+\\frac\{264\\tau\_\{\\mathrm\{mix\}\}\}\{c\\log^\{2\}3\}\+\\frac\{176\\sqrt\{3\}\\tau\_\{\\mathrm\{mix\}\}\}\{c\\sqrt\{\\log 2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+\\frac\{288\\tau\_\{\\mathrm\{mix\}\}\}\{c^\{2\}\\log 2\}\\right\)^\{2\}\}\_\{\\text\{fast rate\}\},\\right\.cϕ∞2log2\(T\+2\)4\(T\+1−1\)\(1\+1152τmixclog23\+7683τmixclog22log8δ\+4032τmix\+27c2log2\)⏟robust rate\}\.\\displaystyle\\quad\\left\.\\underbrace\{\\frac\{c\\phi\_\{\\infty\}^\{2\}\\log^\{2\}\(T\+2\)\}\{4\(\\sqrt\{T\+1\}\-1\)\}\\left\(1\+\\frac\{1152\\tau\_\{\\mathrm\{mix\}\}\}\{c\\log^\{2\}3\}\+\\frac\{768\\sqrt\{3\}\\tau\_\{\\mathrm\{mix\}\}\}\{c\\sqrt\{\\log 2\}\}\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\}\+\\frac\{4032\\tau\_\{\\mathrm\{mix\}\}\+27\}\{c^\{2\}\\log 2\}\\right\)\}\_\{\\text\{robust rate\}\}\\right\\\}\\,\.Equivalently, sinceτmix≥1\\tau\_\{\\mathrm\{mix\}\}\\geq 1, the horizon dependence is
f\(𝜽¯T\)−f\(𝜽∗\)=min\{𝒪~\(Rmax2τmix2ϕ∞4ωT\),𝒪~\(Rmax2τmixϕ∞2T\)\},f\(\\bar\{\\bm\{\\theta\}\}\_\{T\}\)\-f\(\{\\bm\{\\theta\}\}^\{\*\}\)=\\min\\left\\\{\\widetilde\{\\mathcal\{O\}\}\\\!\\left\(\\frac\{R\_\{\\mathrm\{max\}\}^\{2\}\\tau\_\{\\mathrm\{mix\}\}^\{2\}\\phi\_\{\\infty\}^\{4\}\}\{\\omega T\}\\right\),\\widetilde\{\\mathcal\{O\}\}\\\!\\left\(\\frac\{R\_\{\\mathrm\{max\}\}^\{2\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\}\{\\sqrt\{T\}\}\\right\)\\right\\\},where the𝒪~\(⋅\)\\widetilde\{\\mathcal\{O\}\}\(\\cdot\)notation hides logarithmic factors inTTand1/δ1/\\delta, as well as numerical constants depending oncc\. Moreover, the prefactor depends doubly exponentially onτmix\\tau\_\{\\mathrm\{mix\}\}through its dependence onRbaseR\_\{\\mathrm\{base\}\}: usingm∗≤eτmixm^\{\*\}\\leq e^\{\\tau\_\{\\mathrm\{mix\}\}\}and Lemma[E\.1](https://arxiv.org/html/2606.24981#A5.Thmtheorem1),
Bm∗≤\(‖𝜽0‖\+r∞\(1\+γ\)ϕ∞\)exp\(2\(1\+γ\)eτmix/2clog23\)−r∞\(1\+γ\)ϕ∞\.B\_\{m^\{\*\}\}\\leq\\left\(\\left\\\|\{\{\\bm\{\\theta\}\}\_\{0\}\}\\right\\\|\+\\frac\{r\_\{\\infty\}\}\{\(1\+\\gamma\)\\phi\_\{\\infty\}\}\\right\)\\exp\\left\(\\frac\{2\(1\+\\gamma\)e^\{\\tau\_\{\\mathrm\{mix\}\}/2\}\}\{c\\log^\{2\}3\}\\right\)\-\\frac\{r\_\{\\infty\}\}\{\(1\+\\gamma\)\\phi\_\{\\infty\}\}\\,\.Consequently, for fixedc,γ,ϕ∞,r∞,𝛉0,𝛉∗c,\\gamma,\\phi\_\{\\infty\},r\_\{\\infty\},\{\\bm\{\\theta\}\}\_\{0\},\{\\bm\{\\theta\}\}^\{\*\}, andδ\\delta, there is a constantC0C\_\{0\}independent ofτmix\\tau\_\{\\mathrm\{mix\}\}such that
Rmax2≤C0exp\(4\(1\+γ\)eτmix/2clog23\)=exp\(𝒪\(eτmix/2\)\)\.R\_\{\\mathrm\{max\}\}^\{2\}\\leq C\_\{0\}\\exp\\left\(\\frac\{4\(1\+\\gamma\)e^\{\\tau\_\{\\mathrm\{mix\}\}/2\}\}\{c\\log^\{2\}3\}\\right\)=\\exp\\left\(\\mathcal\{O\}\\bigl\(e^\{\\tau\_\{\\mathrm\{mix\}\}/2\}\\bigr\)\\right\)\\,\.Thus the total fast and robust prefactors are both of doubly exponential orderexp\(𝒪\(eτmix/2\)\)\\exp\(\\mathcal\{O\}\(e^\{\\tau\_\{\\mathrm\{mix\}\}/2\}\)\)up to polynomial factors inτmix\\tau\_\{\\mathrm\{mix\}\}\.
###### Proof\.
Let
Lδ≔2log8δ,H≔∑t=0∞ηt2\.L\_\{\\delta\}\\coloneqq\\sqrt\{2\\log\\frac\{8\}\{\\delta\}\},\\qquad H\\coloneqq\\sum\_\{t=0\}^\{\\infty\}\\eta\_\{t\}^\{2\}\\,\.We first prove the bounded\-iterates event\. Consider the shifted process
𝜽k′≔𝜽m∗\+k,Zk′≔Zm∗\+k,ηk′≔ηm∗\+k\.\{\\bm\{\\theta\}\}^\{\\prime\}\_\{k\}\\coloneqq\{\\bm\{\\theta\}\}\_\{m^\{\*\}\+k\},\\qquad Z^\{\\prime\}\_\{k\}\\coloneqq Z\_\{m^\{\*\}\+k\},\\qquad\\eta^\{\\prime\}\_\{k\}\\coloneqq\\eta\_\{m^\{\*\}\+k\}\\,\.The shifted chain has the same transition kernel, hence the same mixing constantτmix\\tau\_\{\\mathrm\{mix\}\}\. Then
ηk′=1cτmixϕ∞2ak′,ak′≔τmixm∗\+k\+1log2\(m∗\+k\+3\)\.\\eta^\{\\prime\}\_\{k\}=\\frac\{1\}\{c\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\}a^\{\\prime\}\_\{k\},\\qquad a^\{\\prime\}\_\{k\}\\coloneqq\\frac\{\\tau\_\{\\mathrm\{mix\}\}\}\{\\sqrt\{m^\{\*\}\+k\+1\}\\log^\{2\}\(m^\{\*\}\+k\+3\)\}\\,\.By the definition ofm∗m^\{\*\}, for everyk≥0k\\geq 0,
0<ak′≤1m∗\+k\+1log\(m∗\+k\+3\),a0′≤1,0<a^\{\\prime\}\_\{k\}\\leq\\frac\{1\}\{\\sqrt\{m^\{\*\}\+k\+1\}\\log\(m^\{\*\}\+k\+3\)\},\\qquad a^\{\\prime\}\_\{0\}\\leq 1,and\(ak′\)k≥0\(a^\{\\prime\}\_\{k\}\)\_\{k\\geq 0\}is non\-increasing\. Moreover, Lemma[F\.3](https://arxiv.org/html/2606.24981#A6.Thmtheorem3)\(i\) gives
∑k=0∞\(ak′\)2\\displaystyle\\sum\_\{k=0\}^\{\\infty\}\(a^\{\\prime\}\_\{k\}\)^\{2\}=τmix2∑k=0∞1\(m∗\+k\+1\)log4\(m∗\+k\+3\)\\displaystyle=\\tau\_\{\\mathrm\{mix\}\}^\{2\}\\sum\_\{k=0\}^\{\\infty\}\\frac\{1\}\{\(m^\{\*\}\+k\+1\)\\log^\{4\}\(m^\{\*\}\+k\+3\)\}≤∑k=0∞1\(m∗\+k\+1\)log2\(m∗\+k\+3\)≤3log2\.\\displaystyle\\leq\\sum\_\{k=0\}^\{\\infty\}\\frac\{1\}\{\(m^\{\*\}\+k\+1\)\\log^\{2\}\(m^\{\*\}\+k\+3\)\}\\leq\\frac\{3\}\{\\log 2\}\\,\.Conditioning onℱm∗=σ\(Z0,…,Zm∗\)\\mathcal\{F\}\_\{m^\{\*\}\}=\\sigma\(Z\_\{0\},\\dots,Z\_\{m^\{\*\}\}\)and applying Theorem[4\.1](https://arxiv.org/html/2606.24981#S4.Thmtheorem1)to the shifted recursion with confidence parameterδ/4\\delta/4, we obtain an eventℰR\+\\mathcal\{E\}\_\{R\}^\{\+\}such that
ℙ\{ℰR\+∣ℱm∗\}≥1−δ4a\.s\.,\\mathbb\{P\}\\\{\\mathcal\{E\}\_\{R\}^\{\+\}\\mid\\mathcal\{F\}\_\{m^\{\*\}\}\\\}\\geq 1\-\\frac\{\\delta\}\{4\}\\quad a\.s\.,and onℰR\+\\mathcal\{E\}\_\{R\}^\{\+\},
supk≥0‖𝜽m∗\+k‖≤ρmax\{‖𝜽m∗−𝜽∗‖,‖𝜽∗‖,r∞ϕ∞\}≤ρRbase=Rmax\.\\sup\_\{k\\geq 0\}\\ \\left\\\|\{\{\\bm\{\\theta\}\}\_\{m^\{\*\}\+k\}\}\\right\\\|\\leq\\rho\\max\\left\\\{\\left\\\|\{\{\\bm\{\\theta\}\}\_\{m^\{\*\}\}\-\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|,\\left\\\|\{\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|,\\frac\{r\_\{\\infty\}\}\{\\phi\_\{\\infty\}\}\\right\\\}\\leq\\rho R\_\{\\mathrm\{base\}\}=R\_\{\\mathrm\{max\}\}\\,\.The constantsA1\(δ\)A\_\{1\}\(\\delta\),A2A\_\{2\}, andcmin\(δ\)c\_\{\\min\}\(\\delta\)are exactly those obtained from the last square\-summability bound and the confidence parameterδ/4\\delta/4\. Taking expectations in the preceding conditional probability bound givesℙ\{ℰR\+\}≥1−δ/4\\mathbb\{P\}\\\{\\mathcal\{E\}\_\{R\}^\{\+\}\\\}\\geq 1\-\\delta/4\. On the finite prefix, Lemma[E\.1](https://arxiv.org/html/2606.24981#A5.Thmtheorem1)givessup0≤s≤m∗‖𝜽s‖≤Bm∗≤Rbase≤Rmax\\sup\_\{0\\leq s\\leq m^\{\*\}\}\\ \\left\\\|\{\{\\bm\{\\theta\}\}\_\{s\}\}\\right\\\|\\leq B\_\{m^\{\*\}\}\\leq R\_\{\\mathrm\{base\}\}\\leq R\_\{\\mathrm\{max\}\}\. Therefore the event
ℰR≔\{supt≥0‖𝜽t‖≤Rmax\}\\mathcal\{E\}\_\{R\}\\coloneqq\\left\\\{\\sup\_\{t\\geq 0\}\\ \\left\\\|\{\{\\bm\{\\theta\}\}\_\{t\}\}\\right\\\|\\leq R\_\{\\mathrm\{max\}\}\\right\\\}satisfies
ℙ\{ℰR\}≥1−δ4\.\\mathbb\{P\}\\\{\\mathcal\{E\}\_\{R\}\\\}\\geq 1\-\\frac\{\\delta\}\{4\}\\,\.We next follow the proof strategy of Theorem[4\.2](https://arxiv.org/html/2606.24981#S4.Thmtheorem2)in Section[C](https://arxiv.org/html/2606.24981#A3), spelling out the high\-probability events needed for the present mixing\-free stepsize\. For everyT≥1T\\geq 1, the energy decomposition \([20](https://arxiv.org/html/2606.24981#A2.E20)\), which will be used for the robust\-rate bound, gives
‖𝒆T‖2\\displaystyle\\\|\\bm\{e\}\_\{T\}\\\|^\{2\}=‖𝒆0‖2\+∑t=1Tηt−12‖𝒈\(𝜽t−1,Zt−1\)‖2\+2∑t=1Tηt−1⟨𝒈¯\(𝜽t−1\),𝜽t−1−𝜽∗⟩\+BT,\\displaystyle=\\\|\\bm\{e\}\_\{0\}\\\|^\{2\}\+\\sum\_\{t=1\}^\{T\}\\eta^\{2\}\_\{t\-1\}\\\|\\bm\{g\}\(\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\\\|^\{2\}\+2\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\langle\\bar\{\\bm\{g\}\}\(\{\\bm\{\\theta\}\}\_\{t\-1\}\),\{\\bm\{\\theta\}\}\_\{t\-1\}\-\{\\bm\{\\theta\}\}^\{\*\}\\rangle\+B\_\{T\},\(32\)where𝒆t=𝜽t−𝜽∗\\bm\{e\}\_\{t\}=\{\\bm\{\\theta\}\}\_\{t\}\-\{\\bm\{\\theta\}\}^\{\*\}and
BT=2∑t=1Tηt−1⟨𝒈\(𝜽t−1,Zt−1\)−𝒈¯\(𝜽t−1\),𝜽t−1−𝜽∗⟩\.B\_\{T\}=2\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\,\\langle\\bm\{g\}\(\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\-\\bar\{\\bm\{g\}\}\(\{\\bm\{\\theta\}\}\_\{t\-1\}\),\\,\{\\bm\{\\theta\}\}\_\{t\-1\}\-\{\\bm\{\\theta\}\}^\{\*\}\\rangle\\,\.Similarly, the PR decomposition \([26](https://arxiv.org/html/2606.24981#A3.E26)\), which will be used for the fast\-rate bound, gives
𝑨𝒆¯T=I1,T\+I2,T\+I3,T,\\displaystyle\\bm\{A\}\\bar\{\\bm\{e\}\}\_\{T\}=I\_\{1,T\}\+I\_\{2,T\}\+I\_\{3,T\},\(33\)where
𝒆¯T:=1ST∑t=1Tηt−1𝒆t−1,\\bar\{\\bm\{e\}\}\_\{T\}:=\\frac\{1\}\{S\_\{T\}\}\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\bm\{e\}\_\{t\-1\},and
I1,T=1ST∑t=1T\(𝒆t−1−𝒆t\),I2,T=1ST∑t=1Tηt−1𝝃t−1,I3,T=1ST∑t=1Tηt−1𝜹t−1𝒆t−1\.I\_\{1,T\}=\\frac\{1\}\{S\_\{T\}\}\\sum\_\{t=1\}^\{T\}\(\\bm\{e\}\_\{t\-1\}\-\\bm\{e\}\_\{t\}\),\\qquad I\_\{2,T\}=\\frac\{1\}\{S\_\{T\}\}\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\bm\{\\xi\}\_\{t\-1\},\\qquad I\_\{3,T\}=\\frac\{1\}\{S\_\{T\}\}\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\bm\{\\delta\}\_\{t\-1\}\\bm\{e\}\_\{t\-1\}\\,\.We now introduce three additional high\-probability events,ℰ2\\mathcal\{E\}\_\{2\},ℰ3\\mathcal\{E\}\_\{3\}, andℰ4\\mathcal\{E\}\_\{4\}, to control the terms appearing in the two decompositions above\. First, applying Lemma[C\.2](https://arxiv.org/html/2606.24981#A3.Thmtheorem2)with confidence parameterδ/4\\delta/4and usingr∞ϕ∞\+2ϕ∞2‖𝜽∗‖≤3ϕ∞2Rmaxr\_\{\\infty\}\\phi\_\{\\infty\}\+2\\phi\_\{\\infty\}^\{2\}\\left\\\|\{\{\\bm\{\\theta\}\}^\{\*\}\}\\right\\\|\\leq 3\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}, we obtain an eventℰ2\\mathcal\{E\}\_\{2\}such that
ℙ\{ℰ2\}≥1−δ4,\\mathbb\{P\}\\\{\\mathcal\{E\}\_\{2\}\\\}\\geq 1\-\\frac\{\\delta\}\{4\},and, onℰ2\\mathcal\{E\}\_\{2\}, for allT≥1T\\geq 1,
‖I2,T‖≤48τmixϕ∞2RmaxST\(2HLδ\+3η0\)\.\\left\\\|\{I\_\{2,T\}\}\\right\\\|\\leq\\frac\{48\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\}\{S\_\{T\}\}\\left\(2\\sqrt\{H\}L\_\{\\delta\}\+3\\eta\_\{0\}\\right\)\\,\.Next, to controlI3,TI\_\{3,T\}in \([33](https://arxiv.org/html/2606.24981#A5.E33)\) and the Markov\-bias termBTB\_\{T\}in \([32](https://arxiv.org/html/2606.24981#A5.E32)\), we use the stopped process at radiusRmaxR\_\{\\mathrm\{max\}\}:
Texit≔inf\{t≥0:‖𝜽t‖\>Rmax\},𝜽~t≔𝜽t∧Texit,𝒆~t:=𝜽~t−𝜽∗,η~t≔ηt𝟏\{t<Texit\}\.T\_\{\\mathrm\{exit\}\}\\coloneqq\\inf\\\{t\\geq 0:\\left\\\|\{\{\\bm\{\\theta\}\}\_\{t\}\}\\right\\\|\>R\_\{\\mathrm\{max\}\}\\\},\\quad\\tilde\{\\bm\{\\theta\}\}\_\{t\}\\coloneqq\{\\bm\{\\theta\}\}\_\{t\\wedge T\_\{\\mathrm\{exit\}\}\},\\quad\\tilde\{\\bm\{e\}\}\_\{t\}:=\\tilde\{\\bm\{\\theta\}\}\_\{t\}\-\{\\bm\{\\theta\}\}^\{\*\},\\quad\\tilde\{\\eta\}\_\{t\}\\coloneqq\\eta\_\{t\}\\mathbf\{1\}\\\{t<T\_\{\\mathrm\{exit\}\}\\\}\\,\.Define the corresponding localized quantities:
I~3,T:=1ST∑t=1Tη~t−1𝜹t−1𝒆~t−1,B~T:=2∑t=1Tη~t−1⟨𝒈\(𝜽~t−1,Zt−1\)−𝒈¯\(𝜽~t−1\),𝜽~t−1−𝜽∗⟩\.\\tilde\{I\}\_\{3,T\}:=\\frac\{1\}\{S\_\{T\}\}\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}\\bm\{\\delta\}\_\{t\-1\}\\tilde\{\\bm\{e\}\}\_\{t\-1\},\\qquad\\tilde\{B\}\_\{T\}:=2\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}\\,\\langle\\bm\{g\}\(\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\-\\bar\{\\bm\{g\}\}\(\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\),\\,\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\}\-\{\\bm\{\\theta\}\}^\{\*\}\\rangle\\,\.Applying Lemma[C\.3](https://arxiv.org/html/2606.24981#A3.Thmtheorem3)with confidence parameterδ/4\\delta/4gives an eventℰ3\\mathcal\{E\}\_\{3\}such that
ℙ\{ℰ3\}≥1−δ4,\\mathbb\{P\}\\\{\\mathcal\{E\}\_\{3\}\\\}\\geq 1\-\\frac\{\\delta\}\{4\},and, onℰ3\\mathcal\{E\}\_\{3\}, for allT≥1T\\geq 1,
‖I~3,T‖≤1ST\(256τmixϕ∞2RmaxHLδ\+384η0τmixϕ∞2Rmax\+192τmixϕ∞4RmaxH\)\.\\left\\\|\{\\tilde\{I\}\_\{3,T\}\}\\right\\\|\\leq\\frac\{1\}\{S\_\{T\}\}\\left\(256\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\\sqrt\{H\}L\_\{\\delta\}\+384\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}\+192\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}H\\right\)\\,\.Finally, Lemma[B\.3](https://arxiv.org/html/2606.24981#A2.Thmtheorem3)with confidence parameterδ/4\\delta/4, together with the deterministic quadratic\-term bound used in the proof of Lemma[B\.4](https://arxiv.org/html/2606.24981#A2.Thmtheorem4), gives an eventℰ4\\mathcal\{E\}\_\{4\}such that
ℙ\{ℰ4\}≥1−δ4,\\mathbb\{P\}\\\{\\mathcal\{E\}\_\{4\}\\\}\\geq 1\-\\frac\{\\delta\}\{4\},and, onℰ4\\mathcal\{E\}\_\{4\}, for allT≥1T\\geq 1,
∑t=1Tη~t−12‖𝒈\(𝜽~t−1,Zt−1\)‖2\+B~T\\displaystyle\\sum\_\{t=1\}^\{T\}\\tilde\{\\eta\}\_\{t\-1\}^\{2\}\\left\\\|\{\\bm\{g\}\(\\tilde\{\\bm\{\\theta\}\}\_\{t\-1\},Z\_\{t\-1\}\)\}\\right\\\|^\{2\}\+\\tilde\{B\}\_\{T\}≤768τmixϕ∞2Rmax2HLδ\+1152η0τmixϕ∞2Rmax2\\displaystyle\\leq 68\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\\sqrt\{H\}L\_\{\\delta\}\+152\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\(34\)\+1344τmixϕ∞4Rmax2H\+9ϕ∞4Rmax2H\.\\displaystyle\\quad\+344\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}^\{2\}H\+9\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}^\{2\}H\\,\.Define
ℰ≔ℰR∩ℰ2∩ℰ3∩ℰ4\.\\mathcal\{E\}\\coloneqq\\mathcal\{E\}\_\{R\}\\cap\\mathcal\{E\}\_\{2\}\\cap\\mathcal\{E\}\_\{3\}\\cap\\mathcal\{E\}\_\{4\}\\,\.Since each ofℰR\\mathcal\{E\}\_\{R\},ℰ2\\mathcal\{E\}\_\{2\},ℰ3\\mathcal\{E\}\_\{3\}, andℰ4\\mathcal\{E\}\_\{4\}has probability at least1−δ/41\-\\delta/4, the union bound givesℙ\{ℰ\}≥1−δ\\mathbb\{P\}\\\{\\mathcal\{E\}\\\}\\geq 1\-\\delta\. OnℰR\\mathcal\{E\}\_\{R\}, we haveTexit=∞T\_\{\\mathrm\{exit\}\}=\\infty, so the stopped and original processes coincide\. Hence, onℰ\\mathcal\{E\},I~3,T=I3,T\\tilde\{I\}\_\{3,T\}=I\_\{3,T\}andB~T=BT\\tilde\{B\}\_\{T\}=B\_\{T\}for everyT≥1T\\geq 1\.
We now derive the fast\-rate control onℰ\\mathcal\{E\}\. By Lemma[C\.1](https://arxiv.org/html/2606.24981#A3.Thmtheorem1),
‖I1,T‖≤2RmaxST\.\\left\\\|\{I\_\{1,T\}\}\\right\\\|\\leq\\frac\{2R\_\{\\mathrm\{max\}\}\}\{S\_\{T\}\}\\,\.Combining \([33](https://arxiv.org/html/2606.24981#A5.E33)\) with the preceding bounds onI1,TI\_\{1,T\},I2,TI\_\{2,T\}, andI~3,T=I3,T\\tilde\{I\}\_\{3,T\}=I\_\{3,T\}, for allT≥1T\\geq 1,
‖𝑨𝒆¯T‖≤RmaxST\[2\+528η0τmixϕ∞2\+352τmixϕ∞2HLδ\+192τmixϕ∞4H\]\.\\left\\\|\{\\bm\{A\}\\bar\{\\bm\{e\}\}\_\{T\}\}\\right\\\|\\leq\\frac\{R\_\{\\mathrm\{max\}\}\}\{S\_\{T\}\}\\left\[2\+528\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\+352\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\sqrt\{H\}L\_\{\\delta\}\+192\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}H\\right\]\\,\.\(35\)For the present stepsize,
η0=1cϕ∞2log23,H=1c2ϕ∞4∑t=0∞1\(t\+1\)log4\(t\+3\)≤3c2ϕ∞4log2,\\eta\_\{0\}=\\frac\{1\}\{c\\phi\_\{\\infty\}^\{2\}\\log^\{2\}3\},\\qquad H=\\frac\{1\}\{c^\{2\}\\phi\_\{\\infty\}^\{4\}\}\\sum\_\{t=0\}^\{\\infty\}\\frac\{1\}\{\(t\+1\)\\log^\{4\}\(t\+3\)\}\\leq\\frac\{3\}\{c^\{2\}\\phi\_\{\\infty\}^\{4\}\\log 2\},\(36\)where the last inequality follows fromlog\(t\+3\)≥log3\>1\\log\(t\+3\)\\geq\\log 3\>1and Lemma[F\.3](https://arxiv.org/html/2606.24981#A6.Thmtheorem3)\(i\)\. Hence
2\+528η0τmixϕ∞2\+352τmixϕ∞2HLδ\+192τmixϕ∞4H\\displaystyle 2\+28\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\+52\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\sqrt\{H\}L\_\{\\delta\}\+92\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}H≤2\(1\+264τmixclog23\+1763τmixclog2Lδ\+288τmixc2log2\)\.\\displaystyle\\qquad\\leq 2\\left\(1\+\\frac\{264\\tau\_\{\\mathrm\{mix\}\}\}\{c\\log^\{2\}3\}\+\\frac\{176\\sqrt\{3\}\\tau\_\{\\mathrm\{mix\}\}\}\{c\\sqrt\{\\log 2\}\}L\_\{\\delta\}\+\\frac\{288\\tau\_\{\\mathrm\{mix\}\}\}\{c^\{2\}\\log 2\}\\right\)\\,\.Using \([27](https://arxiv.org/html/2606.24981#A3.E27)\), \([35](https://arxiv.org/html/2606.24981#A5.E35)\), and the preceding simplification,
f\(𝜽¯T\)−f\(𝜽∗\)\\displaystyle f\(\\bar\{\\bm\{\\theta\}\}\_\{T\}\)\-f\(\{\\bm\{\\theta\}\}^\{\*\}\)≤4Rmax2ωST2\(1\+264τmixclog23\+1763τmixclog2Lδ\+288τmixc2log2\)2\.\\displaystyle\\leq\\frac\{4R\_\{\\mathrm\{max\}\}^\{2\}\}\{\\omega S\_\{T\}^\{2\}\}\\left\(1\+\\frac\{264\\tau\_\{\\mathrm\{mix\}\}\}\{c\\log^\{2\}3\}\+\\frac\{176\\sqrt\{3\}\\tau\_\{\\mathrm\{mix\}\}\}\{c\\sqrt\{\\log 2\}\}L\_\{\\delta\}\+\\frac\{288\\tau\_\{\\mathrm\{mix\}\}\}\{c^\{2\}\\log 2\}\\right\)^\{2\}\\,\.\(37\)
We next derive the robust\-rate control onℰ\\mathcal\{E\}\. Using the robust rate decomposition \([32](https://arxiv.org/html/2606.24981#A5.E32)\), the identityB~T=BT\\tilde\{B\}\_\{T\}=B\_\{T\}onℰ\\mathcal\{E\}, and \([34](https://arxiv.org/html/2606.24981#A5.E34)\),
2∑t=1Tηt−1⟨𝒈¯\(𝜽t−1\),𝜽∗−𝜽t−1⟩\\displaystyle 2\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\langle\{\\bar\{\\bm\{g\}\}\(\{\\bm\{\\theta\}\}\_\{t\-1\}\)\},\{\{\\bm\{\\theta\}\}^\{\*\}\-\{\\bm\{\\theta\}\}\_\{t\-1\}\}\\rangle≤Rmax2\+768τmixϕ∞2Rmax2HLδ\+1152η0τmixϕ∞2Rmax2\\displaystyle\\leq R\_\{\\mathrm\{max\}\}^\{2\}\+68\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\\sqrt\{H\}L\_\{\\delta\}\+152\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}R\_\{\\mathrm\{max\}\}^\{2\}\+1344τmixϕ∞4Rmax2H\+9ϕ∞4Rmax2H\.\\displaystyle\\quad\+344\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}^\{2\}H\+9\\phi\_\{\\infty\}^\{4\}R\_\{\\mathrm\{max\}\}^\{2\}H\\,\.By convexity offfand \([4](https://arxiv.org/html/2606.24981#S3.E4)\),
f\(𝜽¯T\)−f\(𝜽∗\)\\displaystyle f\(\\bar\{\\bm\{\\theta\}\}\_\{T\}\)\-f\(\{\\bm\{\\theta\}\}^\{\*\}\)≤Rmax2ST\[12\+384τmixϕ∞2HLδ\+576η0τmixϕ∞2\\displaystyle\\leq\\frac\{R\_\{\\mathrm\{max\}\}^\{2\}\}\{S\_\{T\}\}\\left\[\\frac\{1\}\{2\}\+384\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\sqrt\{H\}L\_\{\\delta\}\+576\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\right\.\+672τmixϕ∞4H\+92ϕ∞4H\]\.\\displaystyle\\qquad\\qquad\\left\.\+672\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}H\+\\frac\{9\}\{2\}\\phi\_\{\\infty\}^\{4\}H\\right\]\\,\.Using \([36](https://arxiv.org/html/2606.24981#A5.E36)\),
12\+384τmixϕ∞2HLδ\+576η0τmixϕ∞2\+672τmixϕ∞4H\+92ϕ∞4H\\displaystyle\\frac\{1\}\{2\}\+84\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\\sqrt\{H\}L\_\{\\delta\}\+76\\eta\_\{0\}\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{2\}\+72\\tau\_\{\\mathrm\{mix\}\}\\phi\_\{\\infty\}^\{4\}H\+\\frac\{9\}\{2\}\\phi\_\{\\infty\}^\{4\}H≤12\(1\+1152τmixclog23\+7683τmixclog2Lδ\+4032τmix\+27c2log2\)\.\\displaystyle\\qquad\\leq\\frac\{1\}\{2\}\\left\(1\+\\frac\{1152\\tau\_\{\\mathrm\{mix\}\}\}\{c\\log^\{2\}3\}\+\\frac\{768\\sqrt\{3\}\\tau\_\{\\mathrm\{mix\}\}\}\{c\\sqrt\{\\log 2\}\}L\_\{\\delta\}\+\\frac\{4032\\tau\_\{\\mathrm\{mix\}\}\+27\}\{c^\{2\}\\log 2\}\\right\)\\,\.Therefore,
f\(𝜽¯T\)−f\(𝜽∗\)≤Rmax22ST\(1\+1152τmixclog23\+7683τmixclog2Lδ\+4032τmix\+27c2log2\)\.f\(\\bar\{\\bm\{\\theta\}\}\_\{T\}\)\-f\(\{\\bm\{\\theta\}\}^\{\*\}\)\\leq\\frac\{R\_\{\\mathrm\{max\}\}^\{2\}\}\{2S\_\{T\}\}\\left\(1\+\\frac\{1152\\tau\_\{\\mathrm\{mix\}\}\}\{c\\log^\{2\}3\}\+\\frac\{768\\sqrt\{3\}\\tau\_\{\\mathrm\{mix\}\}\}\{c\\sqrt\{\\log 2\}\}L\_\{\\delta\}\+\\frac\{4032\\tau\_\{\\mathrm\{mix\}\}\+27\}\{c^\{2\}\\log 2\}\\right\)\\,\.\(38\)Finally, Lemma[F\.3](https://arxiv.org/html/2606.24981#A6.Thmtheorem3)withp=2p=2andκ=cϕ∞2\\kappa=c\\phi\_\{\\infty\}^\{2\}gives
ST≥2\(T\+1−1\)cϕ∞2log2\(T\+2\),S\_\{T\}\\geq\\frac\{2\(\\sqrt\{T\+1\}\-1\)\}\{c\\phi\_\{\\infty\}^\{2\}\\log^\{2\}\(T\+2\)\},and hence
1ST≤cϕ∞2log2\(T\+2\)2\(T\+1−1\),1ST2≤c2ϕ∞4log4\(T\+2\)4\(T\+1−1\)2\.\\frac\{1\}\{S\_\{T\}\}\\leq\\frac\{c\\phi\_\{\\infty\}^\{2\}\\log^\{2\}\(T\+2\)\}\{2\(\\sqrt\{T\+1\}\-1\)\},\\qquad\\frac\{1\}\{S\_\{T\}^\{2\}\}\\leq\\frac\{c^\{2\}\\phi\_\{\\infty\}^\{4\}\\log^\{4\}\(T\+2\)\}\{4\(\\sqrt\{T\+1\}\-1\)^\{2\}\}\\,\.\(39\)Substituting \([39](https://arxiv.org/html/2606.24981#A5.E39)\) into \([37](https://arxiv.org/html/2606.24981#A5.E37)\) and \([38](https://arxiv.org/html/2606.24981#A5.E38)\) gives the displayed explicit rates\. The big\-𝒪~\\widetilde\{\\mathcal\{O\}\}statement follows directly from the displayed formula, and the doubly\-exponential dependence onτmix\\tau\_\{\\mathrm\{mix\}\}follows from Lemma[E\.1](https://arxiv.org/html/2606.24981#A5.Thmtheorem1)andm∗≤eτmixm^\{\*\}\\leq e^\{\\tau\_\{\\mathrm\{mix\}\}\}\. ∎
## Appendix FTechnical Lemmas
### F\.1Pinelis’ Inequality
###### Lemma F\.1\(Pinelis[1994](https://arxiv.org/html/2606.24981#bib.bib43), Thm\. 3\.5\)\.
Let\(fj\)j≥0\(f\_\{j\}\)\_\{j\\geq 0\}be a martingale in a\(2,D\)\(2,D\)\-smooth Banach space\(X,∥⋅∥\)\(X,\\\|\\cdot\\\|\)withf0=0f\_\{0\}=0\. Letdj:=fj−fj−1d\_\{j\}:=f\_\{j\}\-f\_\{j\-1\},f∗:=supj≥0‖fj‖f^\{\\ast\}:=\\sup\_\{j\\geq 0\}\\ \\\|f\_\{j\}\\\|, and‖dj‖∞:=esssup‖dj‖\\\|d\_\{j\}\\\|\_\{\\infty\}:=\\operatorname\*\{ess\\,sup\}\\\|d\_\{j\}\\\|\. If∑j≥1‖dj‖∞2≤b∗2\\sum\_\{j\\geq 1\}\\\|d\_\{j\}\\\|\_\{\\infty\}^\{2\}\\leq b\_\{\\ast\}^\{2\}, then for allr≥0r\\geq 0,
ℙ\(f∗≥r\)≤2exp\(−r22D2b∗2\)\.\\mathbb\{P\}\\\!\\left\(f^\{\\ast\}\\geq r\\right\)\\leq 2\\exp\\\!\\left\(\-\\frac\{r^\{2\}\}\{2D^\{2\}b\_\{\\ast\}^\{2\}\}\\right\)\\,\.
### F\.2Miscellaneous lemmas
###### Lemma F\.2\(Dirichlet matrix\)\.
The Dirichlet matrix
𝑳Dir:=𝑫−12\(𝑫𝑷μ\+\(𝑷μ\)⊤𝑫\)\\bm\{L\}\_\{\\mathrm\{Dir\}\}:=\\bm\{D\}\-\\frac\{1\}\{2\}\\bigl\(\\bm\{D\}\\bm\{P\}^\{\\mu\}\+\(\\bm\{P\}^\{\\mu\}\)^\{\\top\}\\bm\{D\}\\bigr\)is symmetric and positive semidefinite\. More precisely, if
𝑷sym:=12\(𝑷μ\+𝑫−1\(𝑷μ\)⊤𝑫\),\\bm\{P\}\_\{\\mathrm\{sym\}\}:=\\frac\{1\}\{2\}\\Bigl\(\\bm\{P\}^\{\\mu\}\+\\bm\{D\}^\{\-1\}\(\\bm\{P\}^\{\\mu\}\)^\{\\top\}\\bm\{D\}\\Bigr\),then, for every𝐱∈ℝn\\bm\{x\}\\in\\mathbb\{R\}^\{n\},
𝒙⊤𝑳Dir𝒙=12∑i,jπ\(i\)𝑷sym\(i,j\)\(xi−xj\)2=12∑i,jπ\(i\)Pμ\(i,j\)\(xi−xj\)2≥0\.\\bm\{x\}^\{\\top\}\\bm\{L\}\_\{\\mathrm\{Dir\}\}\\bm\{x\}=\\frac\{1\}\{2\}\\sum\_\{i,j\}\\pi\(i\)\\bm\{P\}\_\{\\mathrm\{sym\}\}\(i,j\)\(x\_\{i\}\-x\_\{j\}\)^\{2\}=\\frac\{1\}\{2\}\\sum\_\{i,j\}\\pi\(i\)P^\{\\mu\}\(i,j\)\(x\_\{i\}\-x\_\{j\}\)^\{2\}\\geq 0\\,\.
###### Proof\.
First,𝑳Dir\\bm\{L\}\_\{\\mathrm\{Dir\}\}is symmetric because𝑫⊤=𝑫\\bm\{D\}^\{\\top\}=\\bm\{D\}:
𝑳Dir⊤=\(𝑫−12\(𝑫𝑷μ\+\(𝑷μ\)⊤𝑫\)\)⊤=𝑫−12\(\(𝑷μ\)⊤𝑫\+𝑫𝑷μ\)=𝑳Dir\.\\bm\{L\}\_\{\\mathrm\{Dir\}\}^\{\\top\}=\\left\(\\bm\{D\}\-\\frac\{1\}\{2\}\\bigl\(\\bm\{D\}\\bm\{P\}^\{\\mu\}\+\(\\bm\{P\}^\{\\mu\}\)^\{\\top\}\\bm\{D\}\\bigr\)\\right\)^\{\\top\}=\\bm\{D\}\-\\frac\{1\}\{2\}\\bigl\(\(\\bm\{P\}^\{\\mu\}\)^\{\\top\}\\bm\{D\}\+\\bm\{D\}\\bm\{P\}^\{\\mu\}\\bigr\)=\\bm\{L\}\_\{\\mathrm\{Dir\}\}\\,\.
Define the time\-reversal kernel
𝑷revμ:=𝑫−1\(𝑷μ\)⊤𝑫,𝑷revμ\(i,j\)=π\(j\)Pμ\(j,i\)π\(i\)\.\\bm\{P\}\_\{\\mathrm\{rev\}\}^\{\\mu\}:=\\bm\{D\}^\{\-1\}\(\\bm\{P\}^\{\\mu\}\)^\{\\top\}\\bm\{D\},\\qquad\\bm\{P\}\_\{\\mathrm\{rev\}\}^\{\\mu\}\(i,j\)=\\frac\{\\pi\(j\)P^\{\\mu\}\(j,i\)\}\{\\pi\(i\)\}\\,\.Sinceπ⊤𝑷μ=π⊤\\pi^\{\\top\}\\bm\{P\}^\{\\mu\}=\\pi^\{\\top\},𝑷revμ\\bm\{P\}\_\{\\mathrm\{rev\}\}^\{\\mu\}is row\-stochastic:
∑j𝑷revμ\(i,j\)=1π\(i\)∑jπ\(j\)Pμ\(j,i\)=1\.\\sum\_\{j\}\\bm\{P\}\_\{\\mathrm\{rev\}\}^\{\\mu\}\(i,j\)=\\frac\{1\}\{\\pi\(i\)\}\\sum\_\{j\}\\pi\(j\)P^\{\\mu\}\(j,i\)=1\\,\.Hence𝑷sym=12\(𝑷μ\+𝑷revμ\)\\bm\{P\}\_\{\\mathrm\{sym\}\}=\\frac\{1\}\{2\}\(\\bm\{P\}^\{\\mu\}\+\\bm\{P\}\_\{\\mathrm\{rev\}\}^\{\\mu\}\)is also row\-stochastic and satisfies detailed balance:
π\(i\)𝑷sym\(i,j\)=12\(π\(i\)Pμ\(i,j\)\+π\(j\)Pμ\(j,i\)\)=π\(j\)𝑷sym\(j,i\)\.\\pi\(i\)\\bm\{P\}\_\{\\mathrm\{sym\}\}\(i,j\)=\\frac\{1\}\{2\}\\bigl\(\\pi\(i\)P^\{\\mu\}\(i,j\)\+\\pi\(j\)P^\{\\mu\}\(j,i\)\\bigr\)=\\pi\(j\)\\bm\{P\}\_\{\\mathrm\{sym\}\}\(j,i\)\\,\.Moreover,
𝑫𝑷sym=12\(𝑫𝑷μ\+\(𝑷μ\)⊤𝑫\),\\bm\{D\}\\bm\{P\}\_\{\\mathrm\{sym\}\}=\\frac\{1\}\{2\}\\bigl\(\\bm\{D\}\\bm\{P\}^\{\\mu\}\+\(\\bm\{P\}^\{\\mu\}\)^\{\\top\}\\bm\{D\}\\bigr\),so𝑳Dir=𝑫−𝑫𝑷sym\\bm\{L\}\_\{\\mathrm\{Dir\}\}=\\bm\{D\}\-\\bm\{D\}\\bm\{P\}\_\{\\mathrm\{sym\}\}\. Therefore, for any𝒙∈ℝn\\bm\{x\}\\in\\mathbb\{R\}^\{n\},
𝒙⊤𝑳Dir𝒙\\displaystyle\\bm\{x\}^\{\\top\}\\bm\{L\}\_\{\\mathrm\{Dir\}\}\\bm\{x\}=∑iπ\(i\)xi2−∑i,jπ\(i\)𝑷sym\(i,j\)xixj\\displaystyle=\\sum\_\{i\}\\pi\(i\)x\_\{i\}^\{2\}\-\\sum\_\{i,j\}\\pi\(i\)\\bm\{P\}\_\{\\mathrm\{sym\}\}\(i,j\)x\_\{i\}x\_\{j\}=12∑i,jπ\(i\)𝑷sym\(i,j\)\(xi−xj\)2≥0,\\displaystyle=\\frac\{1\}\{2\}\\sum\_\{i,j\}\\pi\(i\)\\bm\{P\}\_\{\\mathrm\{sym\}\}\(i,j\)\(x\_\{i\}\-x\_\{j\}\)^\{2\}\\geq 0,where the second equality uses that𝑷sym\\bm\{P\}\_\{\\mathrm\{sym\}\}is row\-stochastic and satisfies detailed balance\. Finally, using the displayed identity forπ\(i\)𝑷sym\(i,j\)\\pi\(i\)\\bm\{P\}\_\{\\mathrm\{sym\}\}\(i,j\)and swapping indices in the second term gives
12∑i,jπ\(i\)𝑷sym\(i,j\)\(xi−xj\)2=12∑i,jπ\(i\)Pμ\(i,j\)\(xi−xj\)2\.\\frac\{1\}\{2\}\\sum\_\{i,j\}\\pi\(i\)\\bm\{P\}\_\{\\mathrm\{sym\}\}\(i,j\)\(x\_\{i\}\-x\_\{j\}\)^\{2\}=\\frac\{1\}\{2\}\\sum\_\{i,j\}\\pi\(i\)P^\{\\mu\}\(i,j\)\(x\_\{i\}\-x\_\{j\}\)^\{2\}\\,\.This proves the claimed identity and𝑳Dir⪰0\\bm\{L\}\_\{\\mathrm\{Dir\}\}\\succeq 0\. ∎
###### Lemma F\.3\(Logarithmic stepsize estimates\)\.
The following estimates hold, wherelog\(⋅\)\\log\(\\cdot\)denotes the natural logarithm\.
*\(i\) Square summability\.*Let\(at\)t≥0\(a\_\{t\}\)\_\{t\\geq 0\}be defined by
at=1t\+1log\(t\+3\)∀t≥0\.a\_\{t\}=\\frac\{1\}\{\\sqrt\{t\+1\}\\,\\log\(t\+3\)\}\\qquad\\forall t\\geq 0\\,\.Define the tail sums
QT:=∑t=T∞at2,T∈ℕ∪\{0\}\.Q\_\{T\}\\;:=\\;\\sum\_\{t=T\}^\{\\infty\}a\_\{t\}^\{2\},\\qquad T\\in\\mathbb\{N\}\\cup\\\{0\\\}\\,\.Then, for all integersT≥0T\\geq 0,
QT≤3log\(T\+2\)\.Q\_\{T\}\\;\\leq\\;\\frac\{3\}\{\\log\(T\+2\)\}\\,\.\(40\)In particular,
∑t=0∞at2≤3log2<∞\.\\sum\_\{t=0\}^\{\\infty\}a\_\{t\}^\{2\}\\;\\leq\\;\\frac\{3\}\{\\log 2\}\\;<\\;\\infty\\,\.
*\(ii\) Partial\-sum lower bound\.*Letp≥0p\\geq 0,κ\>0\\kappa\>0, and let\(ηt\)t≥0\(\\eta\_\{t\}\)\_\{t\\geq 0\}be defined by
ηt=1κt\+1logp\(t\+3\),t≥0\.\\eta\_\{t\}=\\frac\{1\}\{\\kappa\\sqrt\{t\+1\}\\,\\log^\{p\}\(t\+3\)\},\\qquad t\\geq 0\\,\.For every integerT≥1T\\geq 1, define
ST≔∑t=1Tηt−1\.S\_\{T\}\\coloneqq\\sum\_\{t=1\}^\{T\}\\eta\_\{t\-1\}\\,\.Then
ST≥2\(T\+1−1\)κlogp\(T\+2\)\.S\_\{T\}\\geq\\frac\{2\(\\sqrt\{T\+1\}\-1\)\}\{\\kappa\\log^\{p\}\(T\+2\)\}\\,\.
###### Proof\.
For \(i\), by definition,
at2=1\(t\+1\)log2\(t\+3\)\.a\_\{t\}^\{2\}=\\frac\{1\}\{\(t\+1\)\\,\\log^\{2\}\(t\+3\)\}\\,\.For everyt≥0t\\geq 0, we havet\+1≥t\+33t\+1\\geq\\frac\{t\+3\}\{3\}, hence
1\(t\+1\)log2\(t\+3\)≤3\(t\+3\)log2\(t\+3\)\.\\frac\{1\}\{\(t\+1\)\\log^\{2\}\(t\+3\)\}\\leq\\frac\{3\}\{\(t\+3\)\\log^\{2\}\(t\+3\)\}\\,\.Therefore,
QT≤∑t=T∞3\(t\+3\)log2\(t\+3\)\.Q\_\{T\}\\leq\\sum\_\{t=T\}^\{\\infty\}\\frac\{3\}\{\(t\+3\)\\log^\{2\}\(t\+3\)\}\\,\.Letn=t\+3n=t\+3\. Then
∑t=T∞1\(t\+3\)log2\(t\+3\)=∑n=T\+3∞1nlog2n\.\\sum\_\{t=T\}^\{\\infty\}\\frac\{1\}\{\(t\+3\)\\log^\{2\}\(t\+3\)\}=\\sum\_\{n=T\+3\}^\{\\infty\}\\frac\{1\}\{n\\log^\{2\}n\}\\,\.Now definef\(x\)=1xlog2xf\(x\)=\\frac\{1\}\{x\\log^\{2\}x\}forx\>1x\>1\. Sinceffis decreasing on\(1,∞\)\(1,\\infty\), the integral test yields, for allN≥3N\\geq 3,
∑n=N∞f\(n\)≤∫N−1∞f\(x\)𝑑x=∫N−1∞1xlog2x𝑑x=1log\(N−1\)\.\\sum\_\{n=N\}^\{\\infty\}f\(n\)\\leq\\int\_\{N\-1\}^\{\\infty\}f\(x\)\\,dx=\\int\_\{N\-1\}^\{\\infty\}\\frac\{1\}\{x\\log^\{2\}x\}\\,dx=\\frac\{1\}\{\\log\(N\-1\)\}\\,\.Applying this withN=T\+3N=T\+3gives
∑n=T\+3∞1nlog2n≤1log\(T\+2\)\.\\sum\_\{n=T\+3\}^\{\\infty\}\\frac\{1\}\{n\\log^\{2\}n\}\\leq\\frac\{1\}\{\\log\(T\+2\)\}\\,\.Substituting back proves \([40](https://arxiv.org/html/2606.24981#A6.E40)\)\. TakingT=0T=0gives the claimed finite upper bound on∑t=0∞at2\\sum\_\{t=0\}^\{\\infty\}a\_\{t\}^\{2\}\.
For \(ii\), lets=ts=t\. Then
ST=1κ∑s=1T1slogp\(s\+2\)\.S\_\{T\}=\\frac\{1\}\{\\kappa\}\\sum\_\{s=1\}^\{T\}\\frac\{1\}\{\\sqrt\{s\}\\,\\log^\{p\}\(s\+2\)\}\\,\.Sincep≥0p\\geq 0,log\(⋅\)\\log\(\\cdot\)is increasing, ands\+2≤T\+2s\+2\\leq T\+2for all1≤s≤T1\\leq s\\leq T,
1slogp\(s\+2\)≥1slogp\(T\+2\)\.\\frac\{1\}\{\\sqrt\{s\}\\,\\log^\{p\}\(s\+2\)\}\\geq\\frac\{1\}\{\\sqrt\{s\}\\,\\log^\{p\}\(T\+2\)\}\\,\.Therefore,
ST≥1κlogp\(T\+2\)∑s=1T1s\.S\_\{T\}\\geq\\frac\{1\}\{\\kappa\\log^\{p\}\(T\+2\)\}\\sum\_\{s=1\}^\{T\}\\frac\{1\}\{\\sqrt\{s\}\}\\,\.The functionx↦x−1/2x\\mapsto x^\{\-1/2\}is decreasing on\[1,∞\)\[1,\\infty\), hence
∑s=1T1s≥∫1T\+11x𝑑x=2\(T\+1−1\)\.\\sum\_\{s=1\}^\{T\}\\frac\{1\}\{\\sqrt\{s\}\}\\geq\\int\_\{1\}^\{T\+1\}\\frac\{1\}\{\\sqrt\{x\}\}\\,dx=2\(\\sqrt\{T\+1\}\-1\)\\,\.Combining the last two displays proves the claim\. ∎Similar Articles
Convergence of Steepest Descent and Adam under Non-Uniform Smoothness
This paper generalizes non-uniform smoothness assumptions to objectives whose curvature is affine in the objective value, proving convergence rates for steepest descent and diagonal variants of RMSProp and Adam, with applications to logistic regression and neural networks.
Flatland: The Adventures of Gradient Descent with Large Step Sizes
This paper addresses the open question of maximum step size for gradient descent convergence on non-L-smooth objectives, introducing adaptive methods that operate at the edge of stability and can minimize sharpness globally.
From Non-Convex to Strongly Convex: Curvature-Adaptive FTPL for Online Optimization
This paper introduces a curvature-adaptive Follow-the-Perturbed-Leader (FTPL) algorithm for online optimization that achieves optimal regret bounds for both non-convex Lipschitz losses and strongly convex losses, using a time-varying perturbation scale.
Accurate Large-sample Uncertainty Quantification using Stochastic Gradient Markov Chain Monte Carlo
This paper proposes new discrete-time approximations for stochastic gradient Langevin dynamics (SGLD) with and without momentum, enabling accurate predictions of stationary covariance, iterate average covariance, and integrated autocorrelation time. The method provides improved tuning guidance for large-sample uncertainty quantification, especially under model misspecification.
Halt Fast! Early Stopping for Certified Robustness
This paper introduces a meta-learning framework for anytime-valid certified robustness that uses sequential E-processes to adaptively allocate compute, achieving a 20-fold reduction in sample complexity compared to traditional randomized smoothing while maintaining rigorous statistical guarantees.