Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction
Summary
This paper proposes STHTD-MP, a behavior-induced Mirror-Prox temporal-difference method for faster off-policy prediction in reinforcement learning. It replaces the covariance metric with the behavior-policy Bellman matrix and provides convergence analysis and experimental comparisons.
View Cached Full Text
Cached at: 05/29/26, 09:10 AM
# Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction
Source: [https://arxiv.org/html/2605.28849](https://arxiv.org/html/2605.28849)
Yuchen ShenShangdong YangChao LiGuang YangWenhao Wang
###### Abstract
Gradient temporal\-difference methods provide stable off\-policy prediction with linear function approximation, but their practical performance is strongly affected by the geometry induced by the auxiliary\-variable metric\. Existing Mirror\-Prox TD methods typically use the feature covariance metric, whereas hybrid TD methods suggest that behavior\-policy transition information can provide a more informative update geometry\. This paper proposes a behavior\-induced Mirror\-Prox temporal\-difference method, called STHTD\-MP, which replaces the covariance metric in the primal\-dual saddle\-point formulation with the symmetric part of the behavior\-policy Bellman matrix\. The method keeps a single learning rate for the primal and auxiliary variables and applies a Mirror\-Prox prediction\-correction step to the resulting hybrid saddle\-point operator\. We provide a formal convergence analysis for fixed\-policy linear prediction under standard stochastic approximation assumptions: the behavior\-induced metric is positive definite, the joint mean system is Hurwitz, boundedness follows from a Lyapunov argument, and the stochastic recursion converges by the ODE method\. We further derive projected\-oracle ergodic gap bounds and an exact mean\-operator comparison with GTD2\-MP based on the spectral radius of the deterministic Mirror\-Prox error matrix\. The analysis shows that STHTD\-MP can have a smaller mean contraction factor than GTD2\-MP when the behavior\-induced metric improves the saddle\-point geometry\. Exact numerical mean\-operator analysis on two\-state, Random Walk, and Boyan Chain benchmarks supports this condition, while Baird’s counterexample is identified as a singular boundary case where the strict assumptions fail\. Experiments over 100 independent runs report means and standard deviations of two scalar summaries – the steady\-state AUC, defined as the time\-average RMSVE over the last 50% of each trajectory, and the final RMSVE – and show that STHTD\-MP is competitive with strong online TD baselines and that its empirical advantage is geometry\- and horizon\-dependent\.
###### keywords:
Reinforcement learning , Off\-policy prediction , Temporal\-difference learning , Mirror\-Prox , Saddle\-point optimization , Behavior\-induced metric
††journal:Neurocomputing\\affiliation
\[aff1\]organization=Nanjing University of Posts and Telecommunications, city=Nanjing, country=China
\\affiliation
\[aff2\]organization=Department of Computer Science and Technology, Nanjing University, city=Nanjing, country=China
\\affiliation
\[aff3\]organization=College of Electronic Countermeasure, National University of Defense Technology, city=Hefei, country=China
## 1Introduction
Temporal\-difference \(TD\) learning is a central mechanism for value prediction in reinforcement learning\[[20](https://arxiv.org/html/2605.28849#bib.bib2),[17](https://arxiv.org/html/2605.28849#bib.bib1)\]\. In off\-policy prediction with function approximation, however, classical TD may diverge because sampling, bootstrapping, and approximation interact in an unstable way\[[1](https://arxiv.org/html/2605.28849#bib.bib4),[21](https://arxiv.org/html/2605.28849#bib.bib3)\]\. This has led to gradient\-TD algorithms such as GTD, GTD2, TDC, and their extensions, which restore stability by introducing an auxiliary variable and optimizing projected Bellman\-error objectives\[[19](https://arxiv.org/html/2605.28849#bib.bib6),[18](https://arxiv.org/html/2605.28849#bib.bib7),[13](https://arxiv.org/html/2605.28849#bib.bib8),[14](https://arxiv.org/html/2605.28849#bib.bib9)\]\.
The stability of gradient\-TD methods does not by itself settle the question of fast learning\. Two issues remain particularly important\. First, many gradient\-TD algorithms use separate primary and auxiliary learning rates, making relative step\-size tuning nontrivial\. Saddle\-point and proximal views of TD learning provide a single\-timescale alternative by rewriting policy evaluation as a primal\-dual problem\[[12](https://arxiv.org/html/2605.28849#bib.bib12),[11](https://arxiv.org/html/2605.28849#bib.bib11)\]\. Second, even within the same saddle\-point framework, the auxiliary metric determines the geometry of the mean operator and can strongly affect convergence speed\. Standard GTD2\-type saddle\-point methods use the feature covariance metricC=𝔼μ\[ϕϕ⊤\]C=\\mathbb\{E\}\_\{\\mu\}\[\\phi\\phi^\{\\top\}\]\. In contrast, hybrid TD ideas suggest that the behavior\-policy transition matrix contains useful geometric information that is not captured byCCalone\[[8](https://arxiv.org/html/2605.28849#bib.bib10)\]\.
This paper asks whether behavior\-policy transition information can be used to obtain a faster Mirror\-Prox TD method\. We answer this question by replacing the covariance metric in the primal\-dual TD objective with the symmetric behavior\-induced Bellman metricH=sym\(Aμ\)H=\\operatorname\{sym\}\(A\_\{\\mu\}\)and applying a Mirror\-Prox prediction\-correction step\[[15](https://arxiv.org/html/2605.28849#bib.bib13),[9](https://arxiv.org/html/2605.28849#bib.bib14)\]\. The resulting method, STHTD\-MP, can be viewed as a behavior\-induced counterpart of GTD2\-MP\. Both methods use Mirror\-Prox, but they shape the saddle\-point operator with different metrics\.
The paper makes the following contributions\.
- 1\.We derive STHTD\-MP, a single\-timescale behavior\-induced Mirror\-Prox TD method for off\-policy linear prediction\. The method uses the symmetric behavior\-policy Bellman matrix as the auxiliary metric\.
- 2\.We prove that the behavior\-induced metric is positive definite under standard finite\-state assumptions and that the resulting joint mean system is Hurwitz\. Under standard stochastic approximation conditions, the underlying single\-timescale hybrid recursion converges to the projected Bellman fixed point\.
- 3\.We provide a convergence\-speed analysis beyond big\-OOnotation\. In addition to stochastic ergodic gap bounds, we derive an exact mean\-operator comparison with GTD2\-MP through the spectral radius of the deterministic Mirror\-Prox error matrix\.
- 4\.We compute the exact finite\-state mean operators for four benchmarks\. The numerical spectral analysis shows that STHTD\-MP has a smaller deterministic mean contraction factor than GTD2\-MP on the two\-state, Random Walk, and Boyan Chain problems, while Baird’s counterexample is a singular boundary case where the strict assumptions fail\.
- 5\.We conduct stochastic experiments with stronger baselines, disjoint tuning/evaluation seeds, and 100 independent evaluation runs\. The results show that STHTD\-MP is competitive among online first\-order TD methods and that the empirical benefit of the behavior\-induced metric depends on the task geometry and evaluation horizon\.
The resulting picture is geometry\-dependent\. The behavior\-induced metric improves the Mirror\-Prox mean operator when it yields a more favorable saddle\-point geometry than the covariance metric\. We make this condition explicit through an exact spectral comparison and show through numerical mean\-operator analysis and stochastic experiments that it holds in several standard prediction benchmarks\.
## 2Background
### 2\.1Notation
We consider a discounted Markov decision process\(𝒮,𝒜,P,r,γ\)\(\\mathcal\{S\},\\mathcal\{A\},P,r,\\gamma\), where𝒮\\mathcal\{S\}is a finite state space,𝒜\\mathcal\{A\}is a finite action space,P\(s′\|s,a\)P\(s^\{\\prime\}\|s,a\)is the transition kernel,r\(s,a,s′\)r\(s,a,s^\{\\prime\}\)is the reward, andγ∈\(0,1\]\\gamma\\in\(0,1\]is the discount factor\. The target policy is denoted byπ\\piand the behavior policy byμ\\mu\. The data are generated byμ\\mu, while the value function ofπ\\piis to be estimated\. Unless otherwise stated, all expectations are taken with respect to the stationary trajectory induced by the behavior policyμ\\mu, with importance sampling used to correct the target\-policy Bellman term\.
For a policyν∈\{π,μ\}\\nu\\in\\\{\\pi,\\mu\\\}, letPν∈ℝ\|𝒮\|×\|𝒮\|P\_\{\\nu\}\\in\\mathbb\{R\}^\{\|\\mathcal\{S\}\|\\times\|\\mathcal\{S\}\|\}denote the state transition matrix induced byν\\nu:
\[Pν\]ss′=∑a∈𝒜ν\(a\|s\)P\(s′\|s,a\)\.\[P\_\{\\nu\}\]\_\{ss^\{\\prime\}\}=\\sum\_\{a\\in\\mathcal\{A\}\}\\nu\(a\|s\)P\(s^\{\\prime\}\|s,a\)\.\(1\)Letdμd\_\{\\mu\}be the stationary distribution ofPμP\_\{\\mu\}, and let
Dμ=diag\(dμ\)D\_\{\\mu\}=\\operatorname\{diag\}\(d\_\{\\mu\}\)\(2\)be the corresponding diagonal weighting matrix\. We use a linear value approximation
vθ\(s\)=θ⊤ϕ\(s\),v\_\{\\theta\}\(s\)=\\theta^\{\\top\}\\phi\(s\),\(3\)whereϕ\(s\)∈ℝd\\phi\(s\)\\in\\mathbb\{R\}^\{d\}is the feature vector andθ∈ℝd\\theta\\in\\mathbb\{R\}^\{d\}is the primary parameter\. The feature matrix is
Φ=\(ϕ\(s1\)⊤⋯ϕ\(s\|𝒮\|\)⊤\)∈ℝ\|𝒮\|×d\.\\Phi=\\begin\{pmatrix\}\\phi\(s\_\{1\}\)^\{\\top\}\\\\ \\cdots\\\\ \\phi\(s\_\{\|\\mathcal\{S\}\|\}\)^\{\\top\}\\end\{pmatrix\}\\in\\mathbb\{R\}^\{\|\\mathcal\{S\}\|\\times d\}\.\(4\)For compactness, we writeϕt=ϕ\(st\)\\phi\_\{t\}=\\phi\(s\_\{t\}\)andϕt\+1=ϕ\(st\+1\)\\phi\_\{t\+1\}=\\phi\(s\_\{t\+1\}\)\. When the next state is sampled according to the target\-policy transition we writeϕt\+1π\\phi\_\{t\+1\}^\{\\pi\}; when it is sampled according to the behavior\-policy transition we writeϕt\+1μ\\phi\_\{t\+1\}^\{\\mu\}\. In the sample\-based off\-policy updates below,ϕt\+1\\phi\_\{t\+1\}is the observed next\-state feature underμ\\mu, andρt\\rho\_\{t\}corrects the target\-policy Bellman term\.
The importance ratio is
ρt=π\(at\|st\)μ\(at\|st\)\.\\rho\_\{t\}=\\frac\{\\pi\(a\_\{t\}\|s\_\{t\}\)\}\{\\mu\(a\_\{t\}\|s\_\{t\}\)\}\.\(5\)The off\-policy TD error is defined as
δt=rt\+γθt⊤ϕt\+1−θt⊤ϕt\.\\delta\_\{t\}=r\_\{t\}\+\\gamma\\theta\_\{t\}^\{\\top\}\\phi\_\{t\+1\}\-\\theta\_\{t\}^\{\\top\}\\phi\_\{t\}\.\(6\)
The key matrices used in the paper are
Aπ=𝔼\[ρtϕt\(ϕt−γϕt\+1\)⊤\],b=𝔼\[ρtrtϕt\]\.A\_\{\\pi\}=\\mathbb\{E\}\\left\[\\rho\_\{t\}\\phi\_\{t\}\(\\phi\_\{t\}\-\\gamma\\phi\_\{t\+1\}\)^\{\\top\}\\right\],\\quad b=\\mathbb\{E\}\\left\[\\rho\_\{t\}r\_\{t\}\\phi\_\{t\}\\right\]\.\(7\)Equivalently, in matrix form,
Aπ=Φ⊤Dμ\(I−γPπ\)Φ,b=Φ⊤Dμrπ,A\_\{\\pi\}=\\Phi^\{\\top\}D\_\{\\mu\}\(I\-\\gamma P\_\{\\pi\}\)\\Phi,\\quad b=\\Phi^\{\\top\}D\_\{\\mu\}r\_\{\\pi\},\(8\)whererπ\(s\)=𝔼a∼π\(⋅\|s\),s′∼P\(⋅\|s,a\)\[r\(s,a,s′\)\]r\_\{\\pi\}\(s\)=\\mathbb\{E\}\_\{a\\sim\\pi\(\\cdot\|s\),s^\{\\prime\}\\sim P\(\\cdot\|s,a\)\}\[r\(s,a,s^\{\\prime\}\)\]\. The projected Bellman equation isAπθ=bA\_\{\\pi\}\\theta=b\. The auxiliary variable is denoted byy∈ℝdy\\in\\mathbb\{R\}^\{d\}, and the joint variable used in the theoretical analysis is
z=\(θy\)∈ℝ2d\.z=\\begin\{pmatrix\}\\theta\\\\ y\\end\{pmatrix\}\\in\\mathbb\{R\}^\{2d\}\.\(9\)
### 2\.2Off\-policy TD and saddle\-point formulation
Gradient\-TD methods stabilize off\-policy learning by introducing an auxiliary variable, but standard variants typically involve separate learning rates for the value and auxiliary variables\[[19](https://arxiv.org/html/2605.28849#bib.bib6),[18](https://arxiv.org/html/2605.28849#bib.bib7)\]\. Recent finite\-sample and stochastic approximation studies further clarify the role of coupled recursions, two\-timescale dynamics, and Markovian noise\[[5](https://arxiv.org/html/2605.28849#bib.bib20),[10](https://arxiv.org/html/2605.28849#bib.bib21),[6](https://arxiv.org/html/2605.28849#bib.bib25)\]\.
The saddle\-point view starts from the objective
minθmaxyL\(θ,y\)=⟨b−Aπθ,y⟩−12‖y‖M2,\\min\_\{\\theta\}\\max\_\{y\}L\(\\theta,y\)=\\langle b\-A\_\{\\pi\}\\theta,y\\rangle\-\\frac\{1\}\{2\}\\\|y\\\|\_\{M\}^\{2\},\(10\)whereMMis a positive definite metric matrix\. IfMMis chosen properly, the solution still satisfiesAπθ=bA\_\{\\pi\}\\theta=bwhile the optimization geometry changes\. Proximal gradient TD and related methods exploit this formulation to obtain stable single\-timescale updates\[[12](https://arxiv.org/html/2605.28849#bib.bib12),[11](https://arxiv.org/html/2605.28849#bib.bib11)\]\.
## 3Single\-Timescale Hybrid TD
Hybrid TD methods use behavior\-policy information to alter the TD update direction\. In this work we define the behavior\-policy Bellman matrix
Aμ=𝔼\[ϕt\(ϕt−γϕt\+1μ\)⊤\],A\_\{\\mu\}=\\mathbb\{E\}\\left\[\\phi\_\{t\}\(\\phi\_\{t\}\-\\gamma\\phi\_\{t\+1\}^\{\\mu\}\)^\{\\top\}\\right\],\(11\)whereϕt\+1μ\\phi\_\{t\+1\}^\{\\mu\}denotes the next\-state feature induced by the behavior policy\. SinceAμA\_\{\\mu\}need not be symmetric, the auxiliary metric is taken as
H=12\(Aμ\+Aμ⊤\)\.H=\\frac\{1\}\{2\}\(A\_\{\\mu\}\+A\_\{\\mu\}^\{\\top\}\)\.\(12\)The mean update of STHTD is
yt\+1\\displaystyle y\_\{t\+1\}=yt\+αt\(b−Aπθt−Hyt\),\\displaystyle=y\_\{t\}\+\\alpha\_\{t\}\(b\-A\_\{\\pi\}\\theta\_\{t\}\-Hy\_\{t\}\),\(13\)θt\+1\\displaystyle\\theta\_\{t\+1\}=θt\+αtAπ⊤yt\.\\displaystyle=\\theta\_\{t\}\+\\alpha\_\{t\}A\_\{\\pi\}^\{\\top\}y\_\{t\}\.Its sample\-based off\-policy form is
θt\+1\\displaystyle\\theta\_\{t\+1\}=θt\+αtρt\(ϕt−γϕt\+1\)ϕt⊤yt,\\displaystyle=\\theta\_\{t\}\+\\alpha\_\{t\}\\rho\_\{t\}\(\\phi\_\{t\}\-\\gamma\\phi\_\{t\+1\}\)\\phi\_\{t\}^\{\\top\}y\_\{t\},\(14\)yt\+1\\displaystyle y\_\{t\+1\}=yt\+αt\[\(ρtδt−ϕt⊤yt\+12γϕt\+1⊤yt\)ϕt\+12γϕt⊤ytϕt\+1\]\.\\displaystyle=y\_\{t\}\+\\alpha\_\{t\}\\Big\[\(\\rho\_\{t\}\\delta\_\{t\}\-\\phi\_\{t\}^\{\\top\}y\_\{t\}\+\\tfrac\{1\}\{2\}\\gamma\\phi\_\{t\+1\}^\{\\top\}y\_\{t\}\)\\phi\_\{t\}\+\\tfrac\{1\}\{2\}\\gamma\\phi\_\{t\}^\{\\top\}y\_\{t\}\\phi\_\{t\+1\}\\Big\]\.Compared with a covariance\-metric auxiliary update, the STHTD auxiliary update contains additional hybrid terms involvingϕt\+1\\phi\_\{t\+1\}\.
## 4Mirror\-Prox Correction
The saddle\-point structure in Eq\. \([10](https://arxiv.org/html/2605.28849#S2.E10)\) induces a monotone operator
F\(z\)=\(−Aπ⊤yAπθ\+Hy−b\),z=\(θ,y\)\.F\(z\)=\\begin\{pmatrix\}\-A\_\{\\pi\}^\{\\top\}y\\\\ A\_\{\\pi\}\\theta\+Hy\-b\\end\{pmatrix\},\\quad z=\(\\theta,y\)\.\(15\)Mirror\-Prox first evaluates the operator at the current point to form an intermediate prediction, and then uses the predicted point to correct the final update\[[15](https://arxiv.org/html/2605.28849#bib.bib13),[9](https://arxiv.org/html/2605.28849#bib.bib14)\]\. Applying this idea to Eq\. \([14](https://arxiv.org/html/2605.28849#S3.E14)\) gives the following update\. Let
θtm\\displaystyle\\theta\_\{t\}^\{m\}=θt\+αtρt\(ϕt−γϕt\+1\)ϕt⊤yt,\\displaystyle=\\theta\_\{t\}\+\\alpha\_\{t\}\\rho\_\{t\}\(\\phi\_\{t\}\-\\gamma\\phi\_\{t\+1\}\)\\phi\_\{t\}^\{\\top\}y\_\{t\},\(16\)ytm\\displaystyle y\_\{t\}^\{m\}=yt\+αt\[\(ρtδt−ϕt⊤yt\+12γϕt\+1⊤yt\)ϕt\+12γϕt⊤ytϕt\+1\]\.\\displaystyle=y\_\{t\}\+\\alpha\_\{t\}\\Big\[\(\\rho\_\{t\}\\delta\_\{t\}\-\\phi\_\{t\}^\{\\top\}y\_\{t\}\+\\tfrac\{1\}\{2\}\\gamma\\phi\_\{t\+1\}^\{\\top\}y\_\{t\}\)\\phi\_\{t\}\+\\tfrac\{1\}\{2\}\\gamma\\phi\_\{t\}^\{\\top\}y\_\{t\}\\phi\_\{t\+1\}\\Big\]\.With the intermediate TD error
δtm=rt\+γ\(θtm\)⊤ϕt\+1−\(θtm\)⊤ϕt,\\delta\_\{t\}^\{m\}=r\_\{t\}\+\\gamma\(\\theta\_\{t\}^\{m\}\)^\{\\top\}\\phi\_\{t\+1\}\-\(\\theta\_\{t\}^\{m\}\)^\{\\top\}\\phi\_\{t\},\(17\)STHTD\-MP performs
θt\+1\\displaystyle\\theta\_\{t\+1\}=θt\+αtρt\(ϕt−γϕt\+1\)ϕt⊤ytm,\\displaystyle=\\theta\_\{t\}\+\\alpha\_\{t\}\\rho\_\{t\}\(\\phi\_\{t\}\-\\gamma\\phi\_\{t\+1\}\)\\phi\_\{t\}^\{\\top\}y\_\{t\}^\{m\},\(18\)yt\+1\\displaystyle y\_\{t\+1\}=yt\+αt\[\(ρtδtm−ϕt⊤ytm\+12γϕt\+1⊤ytm\)ϕt\+12γϕt⊤ytmϕt\+1\]\.\\displaystyle=y\_\{t\}\+\\alpha\_\{t\}\\Big\[\(\\rho\_\{t\}\\delta\_\{t\}^\{m\}\-\\phi\_\{t\}^\{\\top\}y\_\{t\}^\{m\}\+\\tfrac\{1\}\{2\}\\gamma\\phi\_\{t\+1\}^\{\\top\}y\_\{t\}^\{m\}\)\\phi\_\{t\}\+\\tfrac\{1\}\{2\}\\gamma\\phi\_\{t\}^\{\\top\}y\_\{t\}^\{m\}\\phi\_\{t\+1\}\\Big\]\.The method uses one learning rate and roughly doubles the per\-step first\-order update cost, similarly to other extra\-gradient methods\.
## 5Theoretical Analysis
This section formalizes the convergence argument for fixed\-policy linear prediction\. The setting is a fixed target policy with linear features, where the behavior\-induced metric and the corresponding saddle\-point mean dynamics can be analyzed explicitly\.
### 5\.1Assumptions and mean dynamics
LetΦ∈ℝ\|𝒮\|×d\\Phi\\in\\mathbb\{R\}^\{\|\\mathcal\{S\}\|\\times d\}be the feature matrix,Dμ=diag\(dμ\)D\_\{\\mu\}=\\operatorname\{diag\}\(d\_\{\\mu\}\)the stationary distribution matrix of the behavior policy, andPμP\_\{\\mu\}the state transition matrix underμ\\mu\. Recall
Aμ=Φ⊤Dμ\(I−γPμ\)Φ,H=12\(Aμ\+Aμ⊤\)\.A\_\{\\mu\}=\\Phi^\{\\top\}D\_\{\\mu\}\(I\-\\gamma P\_\{\\mu\}\)\\Phi,\\quad H=\\frac\{1\}\{2\}\(A\_\{\\mu\}\+A\_\{\\mu\}^\{\\top\}\)\.\(19\)The mean STHTD recursion in Eq\. \([13](https://arxiv.org/html/2605.28849#S3.E13)\) is
zt\+1=zt\+αt\(𝒢zt\+h\),zt=\(θtyt\),z\_\{t\+1\}=z\_\{t\}\+\\alpha\_\{t\}\(\\mathcal\{G\}z\_\{t\}\+h\),\\quad z\_\{t\}=\\begin\{pmatrix\}\\theta\_\{t\}\\\\ y\_\{t\}\\end\{pmatrix\},\(20\)with
𝒢=\(0Aπ⊤−Aπ−H\),h=\(0b\)\.\\mathcal\{G\}=\\begin\{pmatrix\}0&A\_\{\\pi\}^\{\\top\}\\\\ \-A\_\{\\pi\}&\-H\\end\{pmatrix\},\\quad h=\\begin\{pmatrix\}0\\\\ b\\end\{pmatrix\}\.\(21\)
###### Assumption 1\.
The following conditions hold: \(i\)0<γ<10<\\gamma<1; \(ii\) the Markov chain induced byμ\\muis irreducible and has stationary distributiondμd\_\{\\mu\}withdμ\(s\)\>0d\_\{\\mu\}\(s\)\>0for allss; \(iii\)dμ⊤Pμ=dμ⊤d\_\{\\mu\}^\{\\top\}P\_\{\\mu\}=d\_\{\\mu\}^\{\\top\}; \(iv\)Φ\\Phihas full column rank; \(v\)AπA\_\{\\pi\}is nonsingular; and \(vi\) features and rewards are uniformly bounded\.
###### Lemma 1\(Positive definiteness of the hybrid metric\)\.
Under Assumption[1](https://arxiv.org/html/2605.28849#Thmassumption1),H=\(Aμ\+Aμ⊤\)/2H=\(A\_\{\\mu\}\+A\_\{\\mu\}^\{\\top\}\)/2is positive definite\.
###### Proof\.
For any nonzerox∈ℝdx\\in\\mathbb\{R\}^\{d\}, letv=Φxv=\\Phi x\. SinceΦ\\Phihas full column rank,v≠0v\\neq 0\. We have
x⊤Aμx=v⊤Dμ\(I−γPμ\)v=‖v‖Dμ2−γ⟨v,Pμv⟩Dμ\.x^\{\\top\}A\_\{\\mu\}x=v^\{\\top\}D\_\{\\mu\}\(I\-\\gamma P\_\{\\mu\}\)v=\\\|v\\\|\_\{D\_\{\\mu\}\}^\{2\}\-\\gamma\\langle v,P\_\{\\mu\}v\\rangle\_\{D\_\{\\mu\}\}\.\(22\)By Cauchy–Schwarz,
⟨v,Pμv⟩Dμ≤‖v‖Dμ‖Pμv‖Dμ\.\\langle v,P\_\{\\mu\}v\\rangle\_\{D\_\{\\mu\}\}\\leq\\\|v\\\|\_\{D\_\{\\mu\}\}\\\|P\_\{\\mu\}v\\\|\_\{D\_\{\\mu\}\}\.\(23\)Jensen’s inequality and stationarity ofdμd\_\{\\mu\}imply
‖Pμv‖Dμ2=∑sdμ\(s\)\(∑s′Pμ\(s,s′\)v\(s′\)\)2≤∑s′dμ\(s′\)v\(s′\)2=‖v‖Dμ2\.\\\|P\_\{\\mu\}v\\\|\_\{D\_\{\\mu\}\}^\{2\}=\\sum\_\{s\}d\_\{\\mu\}\(s\)\\left\(\\sum\_\{s^\{\\prime\}\}P\_\{\\mu\}\(s,s^\{\\prime\}\)v\(s^\{\\prime\}\)\\right\)^\{2\}\\leq\\sum\_\{s^\{\\prime\}\}d\_\{\\mu\}\(s^\{\\prime\}\)v\(s^\{\\prime\}\)^\{2\}=\\\|v\\\|\_\{D\_\{\\mu\}\}^\{2\}\.\(24\)Therefore
x⊤Aμx≥\(1−γ\)‖v‖Dμ2\>0\.x^\{\\top\}A\_\{\\mu\}x\\geq\(1\-\\gamma\)\\\|v\\\|\_\{D\_\{\\mu\}\}^\{2\}\>0\.\(25\)Sincex⊤Hx=x⊤Aμxx^\{\\top\}Hx=x^\{\\top\}A\_\{\\mu\}xfor realxx,HHis positive definite\. ∎
###### Theorem 1\(Hurwitz stability of the joint mean system\)\.
Under Assumption[1](https://arxiv.org/html/2605.28849#Thmassumption1), the joint matrix𝒢\\mathcal\{G\}in Eq\. \([21](https://arxiv.org/html/2605.28849#S5.E21)\) is Hurwitz\. Consequently, the ODE
z˙\(t\)=𝒢z\(t\)\+h\\dot\{z\}\(t\)=\\mathcal\{G\}z\(t\)\+h\(26\)has a unique globally exponentially stable equilibrium
z∗=\(θ∗y∗\)=\(Aπ−1b0\)\.z^\{\*\}=\\begin\{pmatrix\}\\theta^\{\*\}\\\\ y^\{\*\}\\end\{pmatrix\}=\\begin\{pmatrix\}A\_\{\\pi\}^\{\-1\}b\\\\ 0\\end\{pmatrix\}\.\(27\)
###### Proof\.
Letλ\\lambdabe an eigenvalue of𝒢\\mathcal\{G\}with eigenvectorw=\(u,v\)≠0w=\(u,v\)\\neq 0\. Since
𝒢\+𝒢⊤=\(000−2H\),\\mathcal\{G\}\+\\mathcal\{G\}^\{\\top\}=\\begin\{pmatrix\}0&0\\\\ 0&\-2H\\end\{pmatrix\},\(28\)we obtain
2Re\(λ\)‖w‖2=w∗\(𝒢\+𝒢⊤\)w=−2v∗Hv≤0\.2\\operatorname\{Re\}\(\\lambda\)\\\|w\\\|^\{2\}=w^\{\*\}\(\\mathcal\{G\}\+\\mathcal\{G\}^\{\\top\}\)w=\-2v^\{\*\}Hv\\leq 0\.\(29\)ThusRe\(λ\)≤0\\operatorname\{Re\}\(\\lambda\)\\leq 0\. SupposeRe\(λ\)=0\\operatorname\{Re\}\(\\lambda\)=0\. SinceHHis positive definite by Lemma[1](https://arxiv.org/html/2605.28849#Thmlemma1),v∗Hv=0v^\{\*\}Hv=0impliesv=0v=0\. The eigenvalue equations are
Aπ⊤v=λu,−Aπu−Hv=λv\.A\_\{\\pi\}^\{\\top\}v=\\lambda u,\\quad\-A\_\{\\pi\}u\-Hv=\\lambda v\.\(30\)Withv=0v=0, the first equation givesλu=0\\lambda u=0\. Ifλ≠0\\lambda\\neq 0, thenu=0u=0, contradictingw≠0w\\neq 0\. Ifλ=0\\lambda=0, the second equation givesAπu=0A\_\{\\pi\}u=0, and nonsingularity ofAπA\_\{\\pi\}again givesu=0u=0, a contradiction\. HenceRe\(λ\)<0\\operatorname\{Re\}\(\\lambda\)<0for every eigenvalue, so𝒢\\mathcal\{G\}is Hurwitz\. The equilibrium follows fromAπ⊤y∗=0A\_\{\\pi\}^\{\\top\}y^\{\*\}=0and−Aπθ∗−Hy∗\+b=0\-A\_\{\\pi\}\\theta^\{\*\}\-Hy^\{\*\}\+b=0\. ∎
###### Assumption 2\.
The stochastic STHTD recursion can be written as
zt\+1=zt\+αt\(𝒢zt\+h\+ξt\+1\),z\_\{t\+1\}=z\_\{t\}\+\\alpha\_\{t\}\(\\mathcal\{G\}z\_\{t\}\+h\+\\xi\_\{t\+1\}\),\(31\)where\{ξt\+1\}\\\{\\xi\_\{t\+1\}\\\}is a martingale\-difference noise sequence with respect to the natural filtration\{ℱt\}\\\{\\mathcal\{F\}\_\{t\}\\\}, i\.e\.,𝔼\[ξt\+1\|ℱt\]=0\\mathbb\{E\}\[\\xi\_\{t\+1\}\|\\mathcal\{F\}\_\{t\}\]=0, and there existscξ\>0c\_\{\\xi\}\>0such that
𝔼\[‖ξt\+1‖2\|ℱt\]≤cξ\(1\+‖zt‖2\)\.\\mathbb\{E\}\[\\\|\\xi\_\{t\+1\}\\\|^\{2\}\|\\mathcal\{F\}\_\{t\}\]\\leq c\_\{\\xi\}\(1\+\\\|z\_\{t\}\\\|^\{2\}\)\.\(32\)The step sizes satisfy∑tαt=∞\\sum\_\{t\}\\alpha\_\{t\}=\\inftyand∑tαt2<∞\\sum\_\{t\}\\alpha\_\{t\}^\{2\}<\\infty\.
###### Lemma 2\(Boundedness of the linear stochastic recursion\)\.
Under Assumptions[1](https://arxiv.org/html/2605.28849#Thmassumption1)and[2](https://arxiv.org/html/2605.28849#Thmassumption2), the iterates of the STHTD recursion are almost surely bounded\.
###### Proof\.
By Theorem[1](https://arxiv.org/html/2605.28849#Thmtheorem1),𝒢\\mathcal\{G\}is Hurwitz\. Hence for any positive definite matrixQQ, the Lyapunov equation
𝒢⊤P\+P𝒢=−Q\\mathcal\{G\}^\{\\top\}P\+P\\mathcal\{G\}=\-Q\(33\)has a unique positive definite solutionPP\. ChooseQ=IQ=Iand define the shifted erroret=zt−z∗e\_\{t\}=z\_\{t\}\-z^\{\*\}, wherez∗z^\{\*\}is the unique equilibrium satisfying𝒢z∗\+h=0\\mathcal\{G\}z^\{\*\}\+h=0\. Then
et\+1=et\+αt\(𝒢et\+ξt\+1\)\.e\_\{t\+1\}=e\_\{t\}\+\\alpha\_\{t\}\(\\mathcal\{G\}e\_\{t\}\+\\xi\_\{t\+1\}\)\.\(34\)LetV\(e\)=e⊤PeV\(e\)=e^\{\\top\}Pe\. ExpandingV\(et\+1\)V\(e\_\{t\+1\}\)gives
V\(et\+1\)\\displaystyle V\(e\_\{t\+1\}\)=V\(et\)\+2αtet⊤P𝒢et\+2αtet⊤Pξt\+1\\displaystyle=V\(e\_\{t\}\)\+2\\alpha\_\{t\}e\_\{t\}^\{\\top\}P\\mathcal\{G\}e\_\{t\}\+2\\alpha\_\{t\}e\_\{t\}^\{\\top\}P\\xi\_\{t\+1\}\+αt2\(𝒢et\+ξt\+1\)⊤P\(𝒢et\+ξt\+1\)\.\\displaystyle\\quad\+\\alpha\_\{t\}^\{2\}\(\\mathcal\{G\}e\_\{t\}\+\\xi\_\{t\+1\}\)^\{\\top\}P\(\\mathcal\{G\}e\_\{t\}\+\\xi\_\{t\+1\}\)\.\(35\)Taking conditional expectation, using𝔼\[ξt\+1∣ℱt\]=0\\mathbb\{E\}\[\\xi\_\{t\+1\}\\mid\\mathcal\{F\}\_\{t\}\]=0, and expanding the quadratic term gives
𝔼\[V\(et\+1\)∣ℱt\]\\displaystyle\\mathbb\{E\}\[V\(e\_\{t\+1\}\)\\mid\\mathcal\{F\}\_\{t\}\]=V\(et\)\+αtet⊤\(P𝒢\+𝒢⊤P\)et\\displaystyle=V\(e\_\{t\}\)\+\\alpha\_\{t\}e\_\{t\}^\{\\top\}\(P\\mathcal\{G\}\+\\mathcal\{G\}^\{\\top\}P\)e\_\{t\}\+αt2et⊤𝒢⊤P𝒢et\+αt2𝔼\[ξt\+1⊤Pξt\+1∣ℱt\]\\displaystyle\\quad\+\\alpha\_\{t\}^\{2\}\\,e\_\{t\}^\{\\top\}\\mathcal\{G\}^\{\\top\}P\\mathcal\{G\}\\,e\_\{t\}\+\\alpha\_\{t\}^\{2\}\\,\\mathbb\{E\}\[\\xi\_\{t\+1\}^\{\\top\}P\\xi\_\{t\+1\}\\mid\\mathcal\{F\}\_\{t\}\]\+2αt2et⊤𝒢⊤P𝔼\[ξt\+1∣ℱt\]\\displaystyle\\quad\+2\\alpha\_\{t\}^\{2\}\\,e\_\{t\}^\{\\top\}\\mathcal\{G\}^\{\\top\}P\\,\\mathbb\{E\}\[\\xi\_\{t\+1\}\\mid\\mathcal\{F\}\_\{t\}\]\(36\)=V\(et\)\+αtet⊤\(P𝒢\+𝒢⊤P\)et\\displaystyle=V\(e\_\{t\}\)\+\\alpha\_\{t\}e\_\{t\}^\{\\top\}\(P\\mathcal\{G\}\+\\mathcal\{G\}^\{\\top\}P\)e\_\{t\}\+αt2et⊤𝒢⊤P𝒢et\+αt2𝔼\[ξt\+1⊤Pξt\+1∣ℱt\],\\displaystyle\\quad\+\\alpha\_\{t\}^\{2\}\\,e\_\{t\}^\{\\top\}\\mathcal\{G\}^\{\\top\}P\\mathcal\{G\}\\,e\_\{t\}\+\\alpha\_\{t\}^\{2\}\\,\\mathbb\{E\}\[\\xi\_\{t\+1\}^\{\\top\}P\\xi\_\{t\+1\}\\mid\\mathcal\{F\}\_\{t\}\],\(37\)where the cross term2αt2et⊤𝒢⊤P𝔼\[ξt\+1∣ℱt\]2\\alpha\_\{t\}^\{2\}e\_\{t\}^\{\\top\}\\mathcal\{G\}^\{\\top\}P\\,\\mathbb\{E\}\[\\xi\_\{t\+1\}\\mid\\mathcal\{F\}\_\{t\}\]vanishes becauseξt\+1\\xi\_\{t\+1\}is a martingale difference\. The deterministic quadratic term is bounded by‖𝒢⊤P𝒢‖2‖et‖2\\\|\\mathcal\{G\}^\{\\top\}P\\mathcal\{G\}\\\|\_\{2\}\\\|e\_\{t\}\\\|^\{2\}\. For the noise term, the second\-moment bound in Assumption[2](https://arxiv.org/html/2605.28849#Thmassumption2)gives𝔼\[‖ξt\+1‖2∣ℱt\]≤cξ\(1\+‖zt‖2\)\\mathbb\{E\}\[\\\|\\xi\_\{t\+1\}\\\|^\{2\}\\mid\\mathcal\{F\}\_\{t\}\]\\leq c\_\{\\xi\}\(1\+\\\|z\_\{t\}\\\|^\{2\}\), and boundedness of the equilibrium implies1\+‖zt‖2≤c0\(1\+‖et‖2\)1\+\\\|z\_\{t\}\\\|^\{2\}\\leq c\_\{0\}\(1\+\\\|e\_\{t\}\\\|^\{2\}\)forc0=2max\(1,‖z∗‖2\)c\_\{0\}=2\\max\(1,\\\|z^\{\*\}\\\|^\{2\}\), so
𝔼\[ξt\+1⊤Pξt\+1∣ℱt\]≤MPcξc0\(1\+‖et‖2\)\.\\mathbb\{E\}\[\\xi\_\{t\+1\}^\{\\top\}P\\xi\_\{t\+1\}\\mid\\mathcal\{F\}\_\{t\}\]\\leq M\_\{P\}\\,c\_\{\\xi\}\\,c\_\{0\}\\,\(1\+\\\|e\_\{t\}\\\|^\{2\}\)\.\(38\)Combining these bounds with Eq\. \([33](https://arxiv.org/html/2605.28849#S5.E33)\) yields
𝔼\[V\(et\+1\)∣ℱt\]\\displaystyle\\mathbb\{E\}\[V\(e\_\{t\+1\}\)\\mid\\mathcal\{F\}\_\{t\}\]≤V\(et\)−αt‖et‖2\+c1αt2\(1\+‖et‖2\),\\displaystyle\\leq V\(e\_\{t\}\)\-\\alpha\_\{t\}\\\|e\_\{t\}\\\|^\{2\}\+c\_\{1\}\\alpha\_\{t\}^\{2\}\(1\+\\\|e\_\{t\}\\\|^\{2\}\),\(39\)withc1=‖𝒢⊤P𝒢‖2\+MPcξc0c\_\{1\}=\\\|\\mathcal\{G\}^\{\\top\}P\\mathcal\{G\}\\\|\_\{2\}\+M\_\{P\}c\_\{\\xi\}c\_\{0\}\. SincePPis positive definite, there exist constantsmP,MP\>0m\_\{P\},M\_\{P\}\>0such thatmP‖e‖2≤V\(e\)≤MP‖e‖2m\_\{P\}\\\|e\\\|^\{2\}\\leq V\(e\)\\leq M\_\{P\}\\\|e\\\|^\{2\}, hence‖et‖2≥V\(et\)/MP\\\|e\_\{t\}\\\|^\{2\}\\geq V\(e\_\{t\}\)/M\_\{P\}\. Choosingc=1/MPc=1/M\_\{P\}and absorbing theαt2\\alpha\_\{t\}^\{2\}proportional term intoCαt2C\\alpha\_\{t\}^\{2\}onceαtc1≤c/2\\alpha\_\{t\}c\_\{1\}\\leq c/2\(which holds for all sufficiently largettsinceαt→0\\alpha\_\{t\}\\to 0\), we obtain
𝔼\[V\(et\+1\)∣ℱt\]≤\(1−cαt/2\)V\(et\)\+Cαt2\.\\mathbb\{E\}\[V\(e\_\{t\+1\}\)\\mid\\mathcal\{F\}\_\{t\}\]\\leq\(1\-c\\alpha\_\{t\}/2\)V\(e\_\{t\}\)\+C\\alpha\_\{t\}^\{2\}\.\(40\)This is a standard Robbins–Siegmund recursion: combined with∑tαt2<∞\\sum\_\{t\}\\alpha\_\{t\}^\{2\}<\\inftyit implies thatV\(et\)V\(e\_\{t\}\)converges almost surely to a finite random variable, and hencesupt‖et‖<∞\\sup\_\{t\}\\\|e\_\{t\}\\\|<\\inftyalmost surely\[[3](https://arxiv.org/html/2605.28849#bib.bib16), Chapter 3\]\. ∎
###### Theorem 2\(Almost\-sure convergence of STHTD\)\.
Under Assumptions[1](https://arxiv.org/html/2605.28849#Thmassumption1)and[2](https://arxiv.org/html/2605.28849#Thmassumption2), the STHTD iterates converge almost surely to the projected Bellman fixed point:
θt→Aπ−1b,yt→0\.\\theta\_\{t\}\\to A\_\{\\pi\}^\{\-1\}b,\\quad y\_\{t\}\\to 0\.\(41\)
###### Proof\.
The proof proceeds in three steps\. First, Theorem[1](https://arxiv.org/html/2605.28849#Thmtheorem1)shows that𝒢\\mathcal\{G\}is Hurwitz and that the ODE in Eq\. \([26](https://arxiv.org/html/2605.28849#S5.E26)\) has the unique globally exponentially stable equilibriumz∗=\(Aπ−1b,0\)z^\{\*\}=\(A\_\{\\pi\}^\{\-1\}b,0\)\. Second, Lemma[2](https://arxiv.org/html/2605.28849#Thmlemma2)establishes almost\-sure boundedness of the stochastic iterates; this boundedness is derived from Hurwitz stability and is not assumed\. Third, Assumption[2](https://arxiv.org/html/2605.28849#Thmassumption2)gives the remaining stochastic approximation conditions: non\-summable and square\-summable step sizes and martingale\-difference noise with controlled second moment\. Therefore, the ODE method for stochastic approximation\[[2](https://arxiv.org/html/2605.28849#bib.bib15),[3](https://arxiv.org/html/2605.28849#bib.bib16)\]applies to the recursion, and every limit point of the stochastic process lies in the internally chain transitive set of the limiting ODE\. Since the ODE has a unique globally asymptotically stable equilibrium, this set is the singleton\{z∗\}\\\{z^\{\*\}\\\}\. Hencezt→z∗z\_\{t\}\\to z^\{\*\}almost surely, which givesθt→Aπ−1b\\theta\_\{t\}\\to A\_\{\\pi\}^\{\-1\}bandyt→0y\_\{t\}\\to 0\. ∎
### 5\.2Convergence\-rate comparison
We now compare the ergodic saddle\-point gap of the single\-step STHTD update and the Mirror\-Prox update\. This rate comparison uses the standard projected variational\-inequality setting over compact convex setsΘ\\ThetaandYY, with fixed horizon\-dependent step sizes and a stochastic oracle\. It complements Theorem[2](https://arxiv.org/html/2605.28849#Thmtheorem2), which establishes almost\-sure convergence for the non\-projected stochastic approximation recursion with diminishing step sizes\.
LetZ=Θ×YZ=\\Theta\\times Y,D=supz,z′∈Z‖z−z′‖D=\\sup\_\{z,z^\{\\prime\}\\in Z\}\\\|z\-z^\{\\prime\}\\\|, and define the monotone saddle\-point operator
F\(z\)=\(−Aπ⊤yAπθ\+Hy−b\)\.F\(z\)=\\begin\{pmatrix\}\-A\_\{\\pi\}^\{\\top\}y\\\\ A\_\{\\pi\}\\theta\+Hy\-b\\end\{pmatrix\}\.\(42\)This operator is the negative of the affine mean\-update direction in Eq\. \([20](https://arxiv.org/html/2605.28849#S5.E20)\), namelyF\(z\)=−\(𝒢z\+h\)F\(z\)=\-\(\\mathcal\{G\}z\+h\)\. Forz¯n=n−1∑t=1nzt\\bar\{z\}\_\{n\}=n^\{\-1\}\\sum\_\{t=1\}^\{n\}z\_\{t\}, define the primal\-dual gap
Gap\(z¯n\)=maxy∈YL\(θ¯n,y\)−minθ∈ΘL\(θ,y¯n\)\.\\operatorname\{Gap\}\(\\bar\{z\}\_\{n\}\)=\\max\_\{y\\in Y\}L\(\\bar\{\\theta\}\_\{n\},y\)\-\\min\_\{\\theta\\in\\Theta\}L\(\\theta,\\bar\{y\}\_\{n\}\)\.\(43\)For the linear convex\-concave saddle objective in Eq\. \([10](https://arxiv.org/html/2605.28849#S2.E10)\), this saddle gap is controlled by the corresponding variational\-inequality gap induced byFF\.
###### Assumption 3\.
The operatorFFis monotone andLL\-Lipschitz onZZ\. The stochastic oracle is i\.i\.d\. across calls and satisfies𝔼\[F^\(z\)\|z\]=F\(z\)\\mathbb\{E\}\[\\widehat\{F\}\(z\)\|z\]=F\(z\)and𝔼‖F^\(z\)−F\(z\)‖2≤σ2\\mathbb\{E\}\\\|\\widehat\{F\}\(z\)\-F\(z\)\\\|^\{2\}\\leq\\sigma^\{2\}\. Moreover,𝔼‖F^\(z\)‖2≤G2\\mathbb\{E\}\\\|\\widehat\{F\}\(z\)\\\|^\{2\}\\leq G^\{2\}onZZ\.
###### Proposition 1\(Rate of projected STHTD\)\.
Under Assumption[3](https://arxiv.org/html/2605.28849#Thmassumption3), the projected single\-step STHTD update with constant step sizeα\\alphasatisfies
𝔼\[Gap\(z¯n\)\]≤D22αn\+αG22\.\\mathbb\{E\}\[\\operatorname\{Gap\}\(\\bar\{z\}\_\{n\}\)\]\\leq\\frac\{D^\{2\}\}\{2\\alpha n\}\+\\frac\{\\alpha G^\{2\}\}\{2\}\.\(44\)Choosing the horizon\-dependent fixed step sizeα=D/\(Gn\)\\alpha=D/\(G\\sqrt\{n\}\)gives
𝔼\[Gap\(z¯n\)\]=O\(DGn\)\.\\mathbb\{E\}\[\\operatorname\{Gap\}\(\\bar\{z\}\_\{n\}\)\]=O\\left\(\\frac\{DG\}\{\\sqrt\{n\}\}\\right\)\.\(45\)
###### Proof\.
The projected update iszt\+1=ΠZ\(zt−αF^\(zt\)\)z\_\{t\+1\}=\\Pi\_\{Z\}\(z\_\{t\}\-\\alpha\\widehat\{F\}\(z\_\{t\}\)\)\. Non\-expansiveness of projection gives, for anyz∈Zz\\in Z,
‖zt\+1−z‖2≤‖zt−z‖2−2α⟨F^\(zt\),zt−z⟩\+α2‖F^\(zt\)‖2\.\\\|z\_\{t\+1\}\-z\\\|^\{2\}\\leq\\\|z\_\{t\}\-z\\\|^\{2\}\-2\\alpha\\langle\\widehat\{F\}\(z\_\{t\}\),z\_\{t\}\-z\\rangle\+\\alpha^\{2\}\\\|\\widehat\{F\}\(z\_\{t\}\)\\\|^\{2\}\.\(46\)Rearranging gives
⟨F^\(zt\),zt−z⟩≤‖zt−z‖2−‖zt\+1−z‖22α\+α2‖F^\(zt\)‖2\.\\langle\\widehat\{F\}\(z\_\{t\}\),z\_\{t\}\-z\\rangle\\leq\\frac\{\\\|z\_\{t\}\-z\\\|^\{2\}\-\\\|z\_\{t\+1\}\-z\\\|^\{2\}\}\{2\\alpha\}\+\\frac\{\\alpha\}\{2\}\\\|\\widehat\{F\}\(z\_\{t\}\)\\\|^\{2\}\.\(47\)Taking conditional expectation and using unbiasedness yields
𝔼⟨F\(zt\),zt−z⟩≤𝔼‖zt−z‖2−𝔼‖zt\+1−z‖22α\+αG22\.\\mathbb\{E\}\\langle F\(z\_\{t\}\),z\_\{t\}\-z\\rangle\\leq\\frac\{\\mathbb\{E\}\\\|z\_\{t\}\-z\\\|^\{2\}\-\\mathbb\{E\}\\\|z\_\{t\+1\}\-z\\\|^\{2\}\}\{2\\alpha\}\+\\frac\{\\alpha G^\{2\}\}\{2\}\.\(48\)For a monotone variational inequality,Gap\(zt\)≤maxz∈Z⟨F\(zt\),zt−z⟩\\operatorname\{Gap\}\(z\_\{t\}\)\\leq\\max\_\{z\\in Z\}\\langle F\(z\_\{t\}\),z\_\{t\}\-z\\rangle\. Summing fromt=1t=1tonn, using‖zt−z‖≤D\\\|z\_\{t\}\-z\\\|\\leq D, and applying convexity of the gap to the ergodic average gives Eq\. \([44](https://arxiv.org/html/2605.28849#S5.E44)\)\. ∎
###### Proposition 2\(Rate of stochastic STHTD\-MP\)\.
Under the projected setting and the i\.i\.d\. oracle condition in Assumption[3](https://arxiv.org/html/2605.28849#Thmassumption3), the projected stochastic Mirror\-Prox version satisfies the standard bound
𝔼\[Gap\(z¯n\)\]≤C1LD2n\+C2σDn,\\mathbb\{E\}\[\\operatorname\{Gap\}\(\\bar\{z\}\_\{n\}\)\]\\leq C\_\{1\}\\frac\{LD^\{2\}\}\{n\}\+C\_\{2\}\\frac\{\\sigma D\}\{\\sqrt\{n\}\},\(49\)for universal constantsC1,C2\>0C\_\{1\},C\_\{2\}\>0and an admissible step size of order1/L1/Lfor the deterministic component\.
###### Proof\.
This is the standard stochastic Mirror\-Prox bound for Lipschitz monotone variational inequalities with unbiased i\.i\.d\. stochastic oracles\[[15](https://arxiv.org/html/2605.28849#bib.bib13),[9](https://arxiv.org/html/2605.28849#bib.bib14)\]\. The deterministic term follows from the extra\-gradient correction, which cancels the leading rotation error of the operator and gives anO\(1/n\)O\(1/n\)term\. The second term is due to stochastic oracle noise; averaging independent oracle fluctuations yields the unavoidableO\(1/n\)O\(1/\\sqrt\{n\}\)contribution\. Substituting the TD saddle\-point operator in Eq\. \([42](https://arxiv.org/html/2605.28849#S5.E42)\) gives Eq\. \([49](https://arxiv.org/html/2605.28849#S5.E49)\)\. ∎
Propositions[1](https://arxiv.org/html/2605.28849#Thmproposition1)and[2](https://arxiv.org/html/2605.28849#Thmproposition2)explain the expected convergence\-speed difference\. For STHTD, the structural operator term and the stochastic term are both absorbed in anO\(n−1/2\)O\(n^\{\-1/2\}\)bound\. For STHTD\-MP, the deterministic Lipschitz term is improved toO\(n−1\)O\(n^\{\-1\}\), while the stochastic noise term remainsO\(n−1/2\)O\(n^\{\-1/2\}\)\.
###### Corollary 1\(Speed advantage under a structure\-dominant regime\)\.
Consider a fixed horizonnnand suppose Assumption[3](https://arxiv.org/html/2605.28849#Thmassumption3)holds for both STHTD and STHTD\-MP\. LetGSTHG\_\{\\rm STH\}be the second\-moment bound in Proposition[1](https://arxiv.org/html/2605.28849#Thmproposition1), and let\(L,D,σMP\)\(L,D,\\sigma\_\{\\rm MP\}\)be the constants in Proposition[2](https://arxiv.org/html/2605.28849#Thmproposition2)\. If
C1LDn\+C2σMP<GSTH,C\_\{1\}\\frac\{LD\}\{\\sqrt\{n\}\}\+C\_\{2\}\\sigma\_\{\\rm MP\}<G\_\{\\rm STH\},\(50\)then the theoretical upper bound on the expected ergodic gap of projected STHTD\-MP is smaller than that of projected STHTD at horizonnn\. This condition is a qualitative structural sufficient condition: the constants are usually not tightly identifiable from sample trajectories, and in practice we use AUC\-based trajectory statistics and exact mean\-operator spectral factors as empirical and numerical proxies\.
###### Proof\.
By Proposition[1](https://arxiv.org/html/2605.28849#Thmproposition1), projected STHTD with the optimized constant step size satisfies
𝔼\[GapSTH\(z¯n\)\]≤O\(DGSTHn\)\.\\mathbb\{E\}\[\\operatorname\{Gap\}\_\{\\rm STH\}\(\\bar\{z\}\_\{n\}\)\]\\leq O\\left\(\\frac\{DG\_\{\\rm STH\}\}\{\\sqrt\{n\}\}\\right\)\.\(51\)By Proposition[2](https://arxiv.org/html/2605.28849#Thmproposition2), projected STHTD\-MP satisfies
𝔼\[GapMP\(z¯n\)\]≤C1LD2n\+C2σMPDn=Dn\(C1LDn\+C2σMP\)\.\\mathbb\{E\}\[\\operatorname\{Gap\}\_\{\\rm MP\}\(\\bar\{z\}\_\{n\}\)\]\\leq C\_\{1\}\\frac\{LD^\{2\}\}\{n\}\+C\_\{2\}\\frac\{\\sigma\_\{\\rm MP\}D\}\{\\sqrt\{n\}\}=\\frac\{D\}\{\\sqrt\{n\}\}\\left\(C\_\{1\}\\frac\{LD\}\{\\sqrt\{n\}\}\+C\_\{2\}\\sigma\_\{\\rm MP\}\\right\)\.\(52\)Thus condition \([50](https://arxiv.org/html/2605.28849#S5.E50)\) is sufficient for the STHTD\-MP bound to be smaller than the STHTD bound, up to the same universal\-constant convention used in the two preceding propositions\. ∎
Corollary[1](https://arxiv.org/html/2605.28849#Thmcorollary1)clarifies the role of Mirror\-Prox\. STHTD\-MP is favored when the deterministic coupling term represented byLLis large enough for theO\(n−1\)O\(n^\{\-1\}\)improvement to matter, while the stochastic oracle noiseσMP\\sigma\_\{\\rm MP\}remains moderate\. In the experiments, this structure is reflected by trajectory\-level statistics such as AUC error and across\-seed variability\.
### 5\.3Mean\-operator comparison with GTD2\-MP
The preceding comparison explains why Mirror\-Prox improves STHTD\. We now compare STHTD\-MP with the closer baseline GTD2\-MP\. Both methods use Mirror\-Prox, but they use different saddle\-point metrics\. GTD2\-MP corresponds to the covariance metric
C=𝔼μ\[ϕtϕt⊤\]=Φ⊤DμΦ,C=\\mathbb\{E\}\_\{\\mu\}\[\\phi\_\{t\}\\phi\_\{t\}^\{\\top\}\]=\\Phi^\{\\top\}D\_\{\\mu\}\\Phi,\(53\)whereas STHTD\-MP uses the behavior\-induced symmetric metric
H=12\(Aμ\+Aμ⊤\)\.H=\\frac\{1\}\{2\}\(A\_\{\\mu\}\+A\_\{\\mu\}^\{\\top\}\)\.\(54\)
The metric also determines the key matrix that appears after eliminating the auxiliary variable in the corresponding mean squared projected Bellman\-error geometry:
BM=Aπ⊤M−1Aπ,M∈\{C,H\}\.B\_\{M\}=A\_\{\\pi\}^\{\\top\}M^\{\-1\}A\_\{\\pi\},\\quad M\\in\\\{C,H\\\}\.\(55\)ForM=CM=C,BMB\_\{M\}is the GTD2 key matrix\. ForM=HM=H, it is the behavior\-induced, or hybrid, key matrix\. A natural structural hypothesis behind the speed comparison is that the hybrid metric improves this key matrix relative to the covariance metric\. The mean\-rate result we use in Corollary[2](https://arxiv.org/html/2605.28849#Thmcorollary2)below compares spectral radii of the exact Mirror\-Prox mean recursion and does not require this hypothesis; we state it here because it is the geometric mechanism that explains the speedup whenever it holds, and Table[3](https://arxiv.org/html/2605.28849#S7.T3)verifies it on the benchmarks we study\.
###### Assumption 4\(Key\-matrix improvement\)\.
The matricesCCandHHare positive definite,AπA\_\{\\pi\}is nonsingular, and the key matrices in Eq\. \([55](https://arxiv.org/html/2605.28849#S5.E55)\) satisfy
λmin\(BH\)\>λmin\(BC\),κ\(BH\)≤κ\(BC\),\\lambda\_\{\\min\}\(B\_\{H\}\)\>\\lambda\_\{\\min\}\(B\_\{C\}\),\\quad\\kappa\(B\_\{H\}\)\\leq\\kappa\(B\_\{C\}\),\(56\)whereκ\(B\)=λmax\(B\)/λmin\(B\)\\kappa\(B\)=\\lambda\_\{\\max\}\(B\)/\\lambda\_\{\\min\}\(B\)\.
Under this condition, the idealized metric\-preconditioned mean dynamics associated withHHhas a better key\-matrix geometry than the corresponding GTD2 dynamics associated withCC\. For a fixed positive definite key matrixBB, the best constant\-step\-size linear factor of gradient descent on the associated quadratic is\(κ\(B\)−1\)/\(κ\(B\)\+1\)\(\\kappa\(B\)\-1\)/\(\\kappa\(B\)\+1\)\. Hence a smaller key\-matrix condition number directly indicates a better attainable mean linear factor, while a largerλmin\\lambda\_\{\\min\}indicates stronger contraction for sufficiently small common step sizes\. The two requirements onλmin\\lambda\_\{\\min\}andκ\\kappaare not logically independent in general, since the smallest and largest eigenvalues can both shift under the metric change, but they hold jointly on the standard finite\-state benchmarks reported in Table[3](https://arxiv.org/html/2605.28849#S7.T3)\.
For a generic metricM∈\{C,H\}M\\in\\\{C,H\\\}, define the linear saddle\-point operator matrix
KM=\(0−Aπ⊤AπM\)\.K\_\{M\}=\\begin\{pmatrix\}0&\-A\_\{\\pi\}^\{\\top\}\\\\ A\_\{\\pi\}&M\\end\{pmatrix\}\.\(57\)The deterministic Mirror\-Prox error recursion for the mean operator is
et\+1=RM\(α\)et,RM\(α\)=I−αKM\+α2KM2\.e\_\{t\+1\}=R\_\{M\}\(\\alpha\)e\_\{t\},\\quad R\_\{M\}\(\\alpha\)=I\-\\alpha K\_\{M\}\+\\alpha^\{2\}K\_\{M\}^\{2\}\.\(58\)Thus the exact asymptotic linear convergence factor is
qM\(α\)=ρ\(RM\(α\)\),q\_\{M\}\(\\alpha\)=\\rho\(R\_\{M\}\(\\alpha\)\),\(59\)whereρ\(⋅\)\\rho\(\\cdot\)denotes the spectral radius\.
###### Corollary 2\(Exact mean\-rate comparison with GTD2\-MP\)\.
Assume that bothKCK\_\{C\}andKHK\_\{H\}are such thatqC\(αC\)<1q\_\{C\}\(\\alpha\_\{C\}\)<1andqH\(αH\)<1q\_\{H\}\(\\alpha\_\{H\}\)<1for admissible step sizesαC\\alpha\_\{C\}andαH\\alpha\_\{H\}\. If
qH\(αH\)<qC\(αC\),q\_\{H\}\(\\alpha\_\{H\}\)<q\_\{C\}\(\\alpha\_\{C\}\),\(60\)then the deterministic mean STHTD\-MP recursion converges linearly faster than the deterministic mean GTD2\-MP recursion under these step sizes\. In particular, ifαC∗\\alpha\_\{C\}^\{\*\}andαH∗\\alpha\_\{H\}^\{\*\}minimizeqCq\_\{C\}andqHq\_\{H\}on a common admissible grid andqH\(αH∗\)<qC\(αC∗\)q\_\{H\}\(\\alpha\_\{H\}^\{\*\}\)<q\_\{C\}\(\\alpha\_\{C\}^\{\*\}\), then STHTD\-MP has a smaller best attainable mean linear convergence factor on that grid\.
###### Proof\.
For a fixed metricMM, Eq\. \([58](https://arxiv.org/html/2605.28849#S5.E58)\) giveset=RM\(α\)te0e\_\{t\}=R\_\{M\}\(\\alpha\)^\{t\}e\_\{0\}\. IfqM\(α\)<1q\_\{M\}\(\\alpha\)<1, then for anyϵ\>0\\epsilon\>0there exists a constantcϵc\_\{\\epsilon\}such that
‖et‖≤cϵ\(qM\(α\)\+ϵ\)t‖e0‖\.\\\|e\_\{t\}\\\|\\leq c\_\{\\epsilon\}\(q\_\{M\}\(\\alpha\)\+\\epsilon\)^\{t\}\\\|e\_\{0\}\\\|\.\(61\)Therefore the method with the smaller spectral radius has the smaller asymptotic linear factor\. Applying this statement toM=HM=HandM=CM=Cproves the result\. ∎
Corollary[2](https://arxiv.org/html/2605.28849#Thmcorollary2)is stated in terms of the exact mean spectral radius rather than a loose big\-OOconstant\. This makes the comparison numerically verifiable for each finite benchmark and ties the predicted speed difference directly to the geometry ofAπA\_\{\\pi\},CC, andHH\.
## 6Experiments
### 6\.1Protocol
We evaluate on four standard off\-policy prediction benchmarks: the two\-state counterexample, Baird’s counterexample, Random Walk, and Boyan Chain\. The baselines are off\-policy TD, GTD2, TDC, TDRC, GTD2\-MP, HTD, ETD, STHTD, and STHTD\-MP\. GTD2, TDC, and HTD use separate value\-function and auxiliary\-variable learning rates, both of which are tuned\. TDRC follows the original prediction\-setting recommendation and uses a shared learning rate with regularization parameter fixed at1\.01\.0\. We focus the main comparison on online first\-order TD methods; batch least\-squares methods such as LSTD are not included in the main learning\-curve figures\.
All four benchmarks are off\-policy\. In the two\-state counterexample and Baird’s counterexample, the target policy is deterministic while the behavior policy assigns positive probability to non\-target actions, leading to strong importance\-ratio mismatch\. In Random Walk and Boyan Chain, the behavior policy chooses the two nonterminal actions with probability0\.5/0\.50\.5/0\.5, while the target policy uses0\.4/0\.60\.4/0\.6; these two benchmarks are therefore mildly off\-policy\. The Random Walk true values are computed from the target\-policy Bellman equation withγ=0\.99\\gamma=0\.99\.
All algorithms are tuned on the same tuning seeds with a predefined grid\. Reported results are evaluated on 100 disjoint random seeds\. The tuning objective is the average prediction error over the last 20% of the tuning trajectory\. We report mean curves with shaded±\\pmone\-standard\-deviation bands over the 100 evaluation runs\. Tables report two scalar summaries: the steady\-state AUC error, defined as the time\-average RMSVE over the last 50% of each trajectory, and the final error at the last step\. The last\-50% AUC skips the early\-phase transient and therefore evaluates each algorithm by its steady\-state behavior; this is relevant for algorithms with high\-variance early dynamics such as ETD, which converges with a high variance at the beginning on the two\-state counterexample\[[4](https://arxiv.org/html/2605.28849#bib.bib26)\]\. Divergent runs are kept as∞\\inftyor NaN and propagate to the reported means and standard deviations, without any clipping\.
### 6\.2Learning curves
Figures[1](https://arxiv.org/html/2605.28849#S6.F1)–[4](https://arxiv.org/html/2605.28849#S6.F4)show the learning curves\. The proposed STHTD\-MP is designed to change the Mirror\-Prox TD geometry relative to GTD2\-MP, and the curves should be interpreted together with the exact mean\-operator analysis in Section[7](https://arxiv.org/html/2605.28849#S7)\. Empirically, STHTD\-MP is consistently more stable than the underlying STHTD update in the counterexample settings and reaches low final error on the two longer\-horizon non\-counterexample tasks\. The results also show that well\-tuned two\-learning\-rate baselines, TDRC, ETD, and simple TD\-style methods can be very strong in mildly off\-policy prediction tasks\.
Figure 1:Prediction learning curves on the two\-state counterexample\. Curves show means over 100 independent runs; shaded regions show±\\pmone sample standard deviation\. Stochastic single\-step ETD exhibits early\-phase high\-variance transients before converging, which is consistent with the observation of Chen et al\.\[[4](https://arxiv.org/html/2605.28849#bib.bib26)\]that emphatic TD “converges with a high variance at the beginning in the 2\-state counterexample\.”Figure 2:Prediction learning curves on Baird’s counterexample\. Curves show means over 100 independent runs, and shaded regions show±\\pmone sample standard deviation\.Figure 3:Prediction learning curves on Random Walk\. Curves show means over 100 independent runs, and shaded regions show±\\pmone sample standard deviation\.Figure 4:Prediction learning curves on Boyan Chain\. Curves show means over 100 independent runs; shaded regions show±\\pmone sample standard deviation\. The shaded band appears thin because the across\-seed standard deviation is on the order of10−210^\{\-2\}while the ordinate spans tens\.On Boyan Chain, TD, TDRC, HTD, ETD, and STHTD\-MP all converge to the same linear projected fixed point at RMSVE≈0\.167\\approx 0\.167, so their late\-trajectory curves overlap in Figure[4](https://arxiv.org/html/2605.28849#S6.F4)\. The visible oscillation of STHTD is along the time axis of the mean curve, not across\-seed dispersion\. GTD2\-MP and TDC converge more slowly within the20,00020\{,\}000\-step budget, which is consistent with the known slow convergence of TDC\-style updates on Boyan Chain\[[13](https://arxiv.org/html/2605.28849#bib.bib8)\]\.
### 6\.3Quantitative comparison
Table[1](https://arxiv.org/html/2605.28849#S6.T1)reports the steady\-state AUC error, defined as the time\-average RMSVE over the last 50% of each trajectory\. This metric captures the steady\-state behavior of each algorithm rather than the early\-phase transient and is therefore a sharper measure of how fast each algorithm enters its asymptotic regime\. STHTD\-MP attains the lowest steady\-state AUC among the proposed hybrid variants on the two\-state and Baird counterexamples, while TD\-style and regularized\-correction baselines remain strong on the mildly off\-policy tasks\. These results support a geometry\-dependent claim: replacing the covariance metric with the behavior\-induced metric can improve the Mirror\-Prox mean dynamics, while finite\-sample performance also depends on sampling variance, feature conditioning, step\-size tuning, and task difficulty\.
The Random Walk results illustrate how the geometric advantage interacts with sampling noise\. STHTD\-MP improves on GTD2\-MP by roughly2\.5×2\.5\\timesin steady\-state AUC \(Table[1](https://arxiv.org/html/2605.28849#S6.T1)\), and the exact mean\-operator analysis in Section[7](https://arxiv.org/html/2605.28849#S7)shows that the hybrid Mirror\-Prox factorqH\(αH∗\)q\_\{H\}\(\\alpha\_\{H\}^\{\*\}\)is smaller thanqC\(αC∗\)q\_\{C\}\(\\alpha\_\{C\}^\{\*\}\)on this benchmark \(Table[4](https://arxiv.org/html/2605.28849#S7.T4)\)\. At the same time, semi\-gradient TD, TDRC, and HTD all reach a slightly lower or comparable AUC\. Random Walk is only mildly off\-policy, with importance ratiosρ∈\[0\.8,1\.2\]\\rho\\in\[0\.8,1\.2\], so the projected Bellman fixed point is stable even for plain semi\-gradient TD and the dominant source of finite\-sample error is the importance\-weighted noise variance rather than the Mirror\-Prox structural error\. The geometric advantage of the hybrid metric, which manifests in the deterministic operator factorqq, is therefore partially absorbed by single\-update sampling noise of the auxiliary\-variable recursion, which scales with‖ϕ‖2\\\|\\phi\\\|^\{2\}rather than with the projected Bellman geometry\. On the two\-state counterexample the off\-policy mismatch is large and the deterministic mean\-operator factor dominates the finite\-sample behavior, and STHTD\-MP improves on GTD2\-MP by nine orders of magnitude\.
Table 1:Steady\-state AUC error \(last\-50% time\-average RMSVE\) over 100 independent runs, mean±\\pmsample standard deviation\. Lower is better\. Bold marks the best online first\-order TD method per environment\.Table[2](https://arxiv.org/html/2605.28849#S6.T2)reports the final error at the last step\. Final error favors methods that eventually converge accurately, while the steady\-state AUC reflects how quickly the algorithm reaches that regime\. Values at the10−1210^\{\-12\}level or below in Tables[1](https://arxiv.org/html/2605.28849#S6.T1)and[2](https://arxiv.org/html/2605.28849#S6.T2)are at machine precision and indicate numerical convergence to zero; the explicit values are retained for reproducibility\.
Table 2:Final prediction error \(RMSVE at the last step\) over 100 independent runs, mean±\\pmsample standard deviation\. Lower is better\. Bold marks the best online first\-order TD method per environment\.
### 6\.4Step\-size robustness
The above comparison reports each algorithm at its tuned step size\. A practitioner, however, rarely has the budget to tuneα\\alphaas carefully as we do here\. We therefore evaluate*step\-size robustness*: for each algorithm and each environment we fix the auxiliary\-variable step size atβ=0\.05\\beta=0\.05and the TDRC regularizer at1\.01\.0, and sweep the primary step sizeα∈\{10−4,3×10−4,10−3,3×10−3,10−2,3×10−2,10−1\}\\alpha\\in\\\{10^\{\-4\},3\\times 10^\{\-4\},10^\{\-3\},3\\times 10^\{\-3\},10^\{\-2\},3\\times 10^\{\-2\},10^\{\-1\}\\\}\. Each\(α,algorithm,environment\)\(\\alpha,\\text\{algorithm\},\\text\{environment\}\)cell is evaluated on 30 independent seeds, and we plot the resulting steady\-state AUC \(last\-50% time\-average RMSVE\) againstα\\alphaon log–log axes\. A robust algorithm has both a low minimum AUC over the grid and a wide region ofα\\alphafor which the AUC stays close to that minimum\. This is the same sensitivity protocol used by Chen et al\.\[[4](https://arxiv.org/html/2605.28849#bib.bib26)\]on the same counterexamples\.
\(a\)Two\-state counterexample\.
\(b\)Baird’s counterexample\.
\(c\)Random Walk\.
\(d\)Boyan Chain\.
Figure 5:Step\-size robustness\. Each panel plots the steady\-state AUC \(last\-50% time\-average RMSVE\) versus the primary step sizeα\\alphaover 30 independent seeds; shaded regions show±\\pmone sample standard deviation\. Both axes are logarithmic\. A lower curve indicates better performance, and a flatter curve indicates greater robustness to step\-size mis\-specification\.Figure[5](https://arxiv.org/html/2605.28849#S6.F5)reveals three patterns\.*First*, on the two\-state counterexample, STHTD\-MP is monotone inα\\alphaand reaches∼2×10−21\\sim 2\\times 10^\{\-21\}atα=10−1\\alpha=10^\{\-1\}, beating GTD2\-MP by roughly nine orders of magnitude across the entire grid\. ETD has lower AUC than both at the largest testedα\\alphabut exhibits high variance at smallα\\alpha, consistent with the early\-phase transient discussed earlier\.*Second*, on Baird’s counterexample, STHTD\-MP and GTD2\-MP both bottom out near1\.941\.94aroundα=10−2\\alpha=10^\{\-2\}and both fail atα=10−1\\alpha=10^\{\-1\}\(with 23 and 15 out of 30 seeds diverging respectively\); ETD diverges monotonically asα\\alphagrows, which is consistent with the well\-known observation that emphatic TD diverges on Baird\[[4](https://arxiv.org/html/2605.28849#bib.bib26)\]\.*Third*, on Random Walk and Boyan Chain, STHTD\-MP attains its minimum atα=10−1\\alpha=10^\{\-1\}and matches or beats the strong baselines TD, TDRC, and HTD there, while remaining bounded at everyα\\alphawhere ETD blows up \(e\.g\., Random Walk atα=10−1\\alpha=10^\{\-1\}\)\. The breadth of the low\-AUC region for STHTD\-MP is comparable to or wider than that of GTD2\-MP on every benchmark, supporting the geometric claim that the behavior\-induced metric yields a better\-conditioned Mirror\-Prox mean dynamics\.
## 7Numerical Analysis
The theoretical analysis identifies two mechanisms: Mirror\-Prox reduces the deterministic structure error of the hybrid saddle\-point operator, and the behavior\-induced metric can improve the key matrix that governs the mean dynamics\. The stochastic experiments verify the finite\-sample behavior of the algorithms\. This section connects the two by computing the exact finite\-state matrices behind the benchmarks\. In this way, the theoretical assumptions, the observed learning curves, and the numerical matrix properties form a single closed loop\.
### 7\.1Exact mean\-rate comparison with GTD2\-MP
To test the key\-matrix hypothesis and Corollary[2](https://arxiv.org/html/2605.28849#Thmcorollary2), we compute the exact finite\-state matricesAπA\_\{\\pi\},CC,AμA\_\{\\mu\},HH,KCK\_\{C\}, andKHK\_\{H\}for the four benchmarks\. This is not a Monte Carlo simulation: the transition probabilities, stationary behavior distribution, and feature matrices are used directly to form the mean operators\.
Table[3](https://arxiv.org/html/2605.28849#S7.T3)first verifies the structural hypothesis in Assumption[4](https://arxiv.org/html/2605.28849#Thmassumption4)\. On the two\-state, Random Walk, and Boyan Chain benchmarks, the hybrid key matrixBH=Aπ⊤H−1AπB\_\{H\}=A\_\{\\pi\}^\{\\top\}H^\{\-1\}A\_\{\\pi\}has a larger smallest eigenvalue thanBC=Aπ⊤C−1AπB\_\{C\}=A\_\{\\pi\}^\{\\top\}C^\{\-1\}A\_\{\\pi\}and no worse conditioning\. The improvement is especially clear on Random Walk and Boyan Chain, where the condition number decreases from181\.49181\.49to18\.7118\.71and from63\.1463\.14to11\.2311\.23, respectively\. Baird’s counterexample does not satisfy the strict hypothesis becauseAπA\_\{\\pi\}and the metric matrices are numerically singular in the over\-parameterized feature representation\.
Table 3:Verification of the key\-matrix condition\. HereBC=Aπ⊤C−1AπB\_\{C\}=A\_\{\\pi\}^\{\\top\}C^\{\-1\}A\_\{\\pi\}is the GTD2 key matrix andBH=Aπ⊤H−1AπB\_\{H\}=A\_\{\\pi\}^\{\\top\}H^\{\-1\}A\_\{\\pi\}is the hybrid key matrix\. The factor is\(κ−1\)/\(κ\+1\)\(\\kappa\-1\)/\(\\kappa\+1\), the best linear factor for the corresponding idealized quadratic gradient dynamics\. Lower condition number and lower factor are better\.After verifying this structural condition, we compute the Mirror\-Prox spectral radiiqC\(α\)q\_\{C\}\(\\alpha\)andqH\(α\)q\_\{H\}\(\\alpha\)in Eq\. \([59](https://arxiv.org/html/2605.28849#S5.E59)\), both at the tuned learning rates used in the experiments and at the best learning rates over a common logarithmic grid\. Table[4](https://arxiv.org/html/2605.28849#S7.T4)gives these exact numerical results\. On the two\-state problem, the hybrid metric reduces the operator norm by 78\.2%, increases the Hurwitz margin from 0\.0161 to 0\.1094, and improves the best mean Mirror\-Prox contraction factor from 0\.9936 to 0\.9026\. On Boyan Chain, the operator norm is reduced by 17\.3%, the Hurwitz margin is improved by nearly five times, and the best contraction factor improves from 0\.9975 to 0\.9875\. On Random Walk withγ=0\.99\\gamma=0\.99,‖KH‖2\\\|K\_\{H\}\\\|\_\{2\}is larger than‖KC‖2\\\|K\_\{C\}\\\|\_\{2\}, but the key matrix and Hurwitz margin are both more favorable, and the exact spectral radius still favors STHTD\-MP\. Therefore, for the three benchmarks that satisfy the key\-matrix condition, the exact mean\-operator analysis supports the claim that the proposed hybrid Mirror\-Prox metric has a faster deterministic mean convergence factor than GTD2\-MP\.
Baird’s counterexample has a different mean geometry\. Its feature matrix has more features than states, andAπA\_\{\\pi\}is numerically singular in this representation\. The key\-matrix condition fails and the exact spectral factors are essentially one for both metrics, so the deterministic mean\-rate comparison is inconclusive\. This matches the empirical behavior: Baird is an extreme off\-policy case in which the behavior\-induced metric does not yield the clear contraction advantage observed in the other three benchmarks\.
Table 4:Exact deterministic mean\-rate comparison between GTD2\-MP and STHTD\-MP\. HereC=𝔼μ\[ϕϕ⊤\]C=\\mathbb\{E\}\_\{\\mu\}\[\\phi\\phi^\{\\top\}\]is the GTD2\-MP metric andH=sym\(Aμ\)H=\\operatorname\{sym\}\(A\_\{\\mu\}\)is the STHTD\-MP metric\.q\(α\)=ρ\(I−αK\+α2K2\)q\(\\alpha\)=\\rho\(I\-\\alpha K\+\\alpha^\{2\}K^\{2\}\)is the exact Mirror\-Prox mean contraction factor\. Lowerqqis better\.
### 7\.2Why does STHTD struggle on Baird’s counterexample?
Baird’s counterexample is an extreme off\-policy setting\. In our implementation, the target policy always selects the solid action, while the behavior policy selects it with probability1/71/7\. Consequently, the importance ratio is77for the solid action and0for the dashed action\. Theθ\\thetablock of STHTD is driven by this target\-policy, importance\-weighted direction, but the auxiliary block also contains behavior\-induced hybrid terms involvingϕt\+1\\phi\_\{t\+1\}\. In Baird’s geometry, these two effects can be poorly aligned, producing oscillation and poor stability\.
This diagnosis is also consistent with the exact mean\-operator analysis in Table[4](https://arxiv.org/html/2605.28849#S7.T4)\. For Baird, bothCCandHHlead to nearly zero Hurwitz margins becauseAπA\_\{\\pi\}is numerically singular in the over\-parameterized feature representation\. The resulting mean\-rate comparison is therefore inconclusive\. The observed behavior reflects this singular off\-policy geometry, where the behavior\-induced metric has limited room to improve the covariance\-metric dynamics\.
## 8Related Work
The instability of off\-policy TD with function approximation was established in early theoretical and empirical studies\[[1](https://arxiv.org/html/2605.28849#bib.bib4),[21](https://arxiv.org/html/2605.28849#bib.bib3)\]\. Importance sampling enables off\-policy prediction but can introduce high\-variance updates, especially under strong target\-behavior mismatch\[[16](https://arxiv.org/html/2605.28849#bib.bib5)\]\. Gradient\-TD methods address the stability issue by introducing an auxiliary variable and optimizing projected Bellman\-error objectives\[[19](https://arxiv.org/html/2605.28849#bib.bib6),[18](https://arxiv.org/html/2605.28849#bib.bib7),[13](https://arxiv.org/html/2605.28849#bib.bib8)\]\. Related control\-oriented extensions show how these ideas connect to off\-policy action\-value learning, but they also introduce additional complications beyond fixed\-policy prediction\[[14](https://arxiv.org/html/2605.28849#bib.bib9)\]\.
Two aspects of gradient\-TD are especially relevant here\. First, classical GTD2 and TDC use coupled primal and auxiliary recursions whose empirical behavior depends on relative step\-size tuning\. Finite\-sample and stochastic\-approximation analyses have clarified how two\-timescale structure and Markovian noise affect such recursions\[[5](https://arxiv.org/html/2605.28849#bib.bib20),[10](https://arxiv.org/html/2605.28849#bib.bib21),[6](https://arxiv.org/html/2605.28849#bib.bib25)\]\. Second, the auxiliary metric changes the geometry of the saddle\-point operator\. Hybrid TD methods were motivated by faster learning through a mixture of TD and gradient\-TD directions\[[8](https://arxiv.org/html/2605.28849#bib.bib10)\]\. Our work keeps this behavior\-policy geometric ingredient but formulates it as a symmetric positive definite metric in a single\-timescale saddle\-point system\.
Proximal gradient TD and related saddle\-point formulations provide a natural route to single\-timescale algorithms\[[12](https://arxiv.org/html/2605.28849#bib.bib12),[11](https://arxiv.org/html/2605.28849#bib.bib11)\]\. Mirror\-Prox and stochastic extra\-gradient methods provide a general mechanism for smooth monotone variational inequalities and convex\-concave saddle\-point problems\[[15](https://arxiv.org/html/2605.28849#bib.bib13),[9](https://arxiv.org/html/2605.28849#bib.bib14)\]\. Existing Mirror\-Prox TD\-style methods typically inherit the GTD2 covariance metricC=𝔼μ\[ϕϕ⊤\]C=\\mathbb\{E\}\_\{\\mu\}\[\\phi\\phi^\{\\top\}\]\. The distinction of STHTD\-MP is that the extra\-gradient step is applied after changing the metric toH=sym\(Aμ\)H=\\operatorname\{sym\}\(A\_\{\\mu\}\)\. Thus the contribution is not simply adding Mirror\-Prox to TD, but changing the deterministic Mirror\-Prox mean operator through behavior\-induced geometry\.
A line of recent work has further sharpened the analysis of single\- and two\-timescale gradient TD\. Doan\[[6](https://arxiv.org/html/2605.28849#bib.bib25)\]gives finite\-time bounds for linear two\-timescale stochastic approximation with restarting, while Dalal et al\.\[[5](https://arxiv.org/html/2605.28849#bib.bib20)\]and Kaledin et al\.\[[10](https://arxiv.org/html/2605.28849#bib.bib21)\]establish finite\-time and Markovian\-noise analyses that apply to TDC\-type recursions\. TDRC introduces a regularized correction with a shared learning rate\[[7](https://arxiv.org/html/2605.28849#bib.bib27)\], which can be viewed as a single\-timescale stabilization of TDC and serves as a direct baseline in our experiments\. Hybrid TD methods\[[8](https://arxiv.org/html/2605.28849#bib.bib10)\]also exploit behavior\-policy information, but as a mixture of TD and gradient\-TD directions rather than as a positive definite metric, so the saddle\-point reading of the speedup is absent\. Relative to single\-timescale proximal TD methods\[[11](https://arxiv.org/html/2605.28849#bib.bib11)\], the distinguishing feature of STHTD\-MP is the use of the symmetric partH=sym\(Aμ\)H=\\operatorname\{sym\}\(A\_\{\\mu\}\)of the behavior\-induced Bellman matrix as the saddle\-point metric, and the resulting key\-matrix improvement made measurable in Section[7](https://arxiv.org/html/2605.28849#S7)\.
Recent reinforcement\-learning work has also studied target networks, off\-policy evaluation, and variance\-reduced Markovian stochastic approximation\[[22](https://arxiv.org/html/2605.28849#bib.bib22),[24](https://arxiv.org/html/2605.28849#bib.bib23),[23](https://arxiv.org/html/2605.28849#bib.bib24)\]\. These directions are complementary\. Target\-network and variance\-reduction methods mainly control instability or sampling noise, whereas this paper isolates the effect of the saddle\-point metric on linear off\-policy prediction\. The exact mean\-operator comparison with GTD2\-MP makes this geometric difference measurable in addition to the empirical learning curves\.
## 9Discussion
The present analysis focuses on fixed\-policy off\-policy prediction with linear function approximation\. This setting exposes the role of the behavior\-induced metric cleanly becauseAμA\_\{\\mu\},HH, and the saddle\-point operator are fixed by the behavior policy and the feature map\. Extending the same idea to control would require coupling the metric with changing target policies and bootstrapped action\-value maximization\. Nonlinear and deep function approximation would add another layer of dependence because the metric and the saddle\-point operator would vary with the representation parameters\.
The convergence and rate analyses describe two complementary views of the algorithm\. The stochastic\-approximation theorem establishes almost\-sure stability for the non\-projected linear recursion with diminishing step sizes\. The projected variational\-inequality bounds describe the structural effect of Mirror\-Prox under an unbiased i\.i\.d\. oracle model\. The experiments use Markovian trajectories, so the theoretical rate results are paired with exact finite\-state mean\-operator calculations and trajectory\-level statistics\.
The exact GTD2\-MP comparison is a deterministic mean\-operator comparison for finite benchmarks\. It gives a verifiable contraction factor and identifies when the behavior\-induced metric improves the Mirror\-Prox geometry\. In finite samples, this mean effect interacts with variance, feature conditioning, and the strength of the off\-policy mismatch, which explains why the empirical comparisons are reported with standard deviations across independent runs\.
## 10Conclusion
We presented STHTD\-MP, a behavior\-induced Mirror\-Prox temporal\-difference method for off\-policy prediction with linear function approximation\. Unlike GTD2\-MP, which uses the feature covariance metric, STHTD\-MP uses the symmetric part of the behavior\-policy Bellman matrix as the saddle\-point metric\. This changes the deterministic Mirror\-Prox mean operator rather than merely adding an extra\-gradient step to an existing TD method\.
The theoretical analysis establishes positive definiteness of the behavior\-induced metric under standard finite\-state assumptions, Hurwitz stability of the joint mean system, stochastic\-approximation convergence, and ergodic gap bounds\. More importantly, the exact mean\-rate comparison shows how to compare STHTD\-MP with GTD2\-MP through the spectral radius of the deterministic Mirror\-Prox error matrix\. The numerical mean\-operator analysis supports a faster deterministic contraction factor for STHTD\-MP on the two\-state, Random Walk, and Boyan Chain benchmarks, while Baird’s counterexample is correctly identified as a singular boundary case where the strict comparison assumptions fail\.
The stochastic experiments complement this analysis with finite\-sample learning curves, standard deviations over 100 independent runs, and fair hyperparameter tuning\. They show that STHTD\-MP is competitive with strong online TD baselines, while also revealing that no single metric is uniformly best in every off\-policy setting\. Future work should develop adaptive behavior\-induced metrics with theoretical guarantees and extend the analysis from fixed\-policy prediction to control settings\.
## Acknowledgments
This work was supported by the National Natural Science Foundation of China \(Nos\. 62276142, 62206133, 62202240, 62506172\), the Fundamental Research Funds for the Central Universities \(No\. 2242025K30024\), the Natural Science Foundation of Jiangsu Province \(No\. BK20250658\), and the Scientific Research Start\-up Foundation for Introduced Talents of Nanjing University of Posts and Telecommunications \(No\. NY225026\)\.
## Data and Code Availability
The experimental code and configuration files are available in the public GitHub repositoryGameAI\-NJUPT/STHTD\-MP\. The generated result tables are reproducible from the accompanying prediction\-experiment script\.
## References
- \[1\]L\. Baird\(1995\)Residual algorithms: reinforcement learning with function approximation\.InProceedings of the Twelfth International Conference on Machine Learning,pp\. 30–37\.Cited by:[§1](https://arxiv.org/html/2605.28849#S1.p1.1),[§8](https://arxiv.org/html/2605.28849#S8.p1.1)\.
- \[2\]V\. S\. Borkar and S\. P\. Meyn\(2000\)The o\.d\.e\. method for convergence of stochastic approximation and reinforcement learning\.SIAM Journal on Control and Optimization38\(2\),pp\. 447–469\.Cited by:[§5\.1](https://arxiv.org/html/2605.28849#S5.SS1.4.p1.6)\.
- \[3\]V\. S\. Borkar\(2023\)Stochastic approximation: a dynamical systems viewpoint\.2 edition,Springer\.Cited by:[§5\.1](https://arxiv.org/html/2605.28849#S5.SS1.3.p1.30),[§5\.1](https://arxiv.org/html/2605.28849#S5.SS1.4.p1.6)\.
- \[4\]X\. Chen, X\. Ma, Y\. Li, G\. Yang, S\. Yang, and Y\. Gao\(2023\)Modified retrace for off\-policy temporal difference learning\.InProceedings of the Thirty\-Ninth Conference on Uncertainty in Artificial Intelligence \(UAI\),Cited by:[Figure 1](https://arxiv.org/html/2605.28849#S6.F1),[Figure 1](https://arxiv.org/html/2605.28849#S6.F1.2.1),[§6\.1](https://arxiv.org/html/2605.28849#S6.SS1.p3.2),[§6\.4](https://arxiv.org/html/2605.28849#S6.SS4.p1.7),[§6\.4](https://arxiv.org/html/2605.28849#S6.SS4.p2.12)\.
- \[5\]G\. Dalal, B\. Szorenyi, G\. Thoppe, and S\. Mannor\(2020\)Finite sample analysis of two\-timescale stochastic approximation with applications to reinforcement learning\.InProceedings of the Thirty Third Conference on Learning Theory,Proceedings of Machine Learning Research, Vol\.125,pp\. 1199–1233\.Cited by:[§2\.2](https://arxiv.org/html/2605.28849#S2.SS2.p1.1),[§8](https://arxiv.org/html/2605.28849#S8.p2.1),[§8](https://arxiv.org/html/2605.28849#S8.p4.1),[Remark 1](https://arxiv.org/html/2605.28849#Thmremark1.p1.1.1)\.
- \[6\]T\. T\. Doan\(2021\)Finite\-time analysis and restarting scheme for linear two\-time\-scale stochastic approximation\.SIAM Journal on Control and Optimization59\(4\),pp\. 2798–2819\.Cited by:[§2\.2](https://arxiv.org/html/2605.28849#S2.SS2.p1.1),[§8](https://arxiv.org/html/2605.28849#S8.p2.1),[§8](https://arxiv.org/html/2605.28849#S8.p4.1),[Remark 1](https://arxiv.org/html/2605.28849#Thmremark1.p1.1.1)\.
- \[7\]S\. Ghiassian, A\. Patterson, S\. Garg, D\. Gupta, A\. White, and M\. White\(2020\)Gradient temporal\-difference learning with regularized corrections\.InProceedings of the 37th International Conference on Machine Learning \(ICML\),Cited by:[§8](https://arxiv.org/html/2605.28849#S8.p4.1)\.
- \[8\]L\. M\. Hackman\(2012\)Faster gradient\-td algorithms\.Master’s Thesis,University of Alberta\.Cited by:[§1](https://arxiv.org/html/2605.28849#S1.p2.2),[§8](https://arxiv.org/html/2605.28849#S8.p2.1),[§8](https://arxiv.org/html/2605.28849#S8.p4.1)\.
- \[9\]A\. Juditsky, A\. Nemirovski, and C\. Tauvel\(2011\)Solving variational inequalities with stochastic mirror\-prox algorithm\.Stochastic Systems1\(1\),pp\. 17–58\.Cited by:[§1](https://arxiv.org/html/2605.28849#S1.p3.1),[§4](https://arxiv.org/html/2605.28849#S4.p1.2),[§5\.2](https://arxiv.org/html/2605.28849#S5.SS2.2.p1.2),[§8](https://arxiv.org/html/2605.28849#S8.p3.2)\.
- \[10\]M\. Kaledin, E\. Moulines, A\. Naumov, V\. Tadic, and H\. Wai\(2020\)Finite sample analysis of linear two\-timescale stochastic approximation with markovian noise\.InProceedings of the Thirty Third Conference on Learning Theory,Proceedings of Machine Learning Research, Vol\.125,pp\. 2143–2203\.Cited by:[§2\.2](https://arxiv.org/html/2605.28849#S2.SS2.p1.1),[§8](https://arxiv.org/html/2605.28849#S8.p2.1),[§8](https://arxiv.org/html/2605.28849#S8.p4.1),[Remark 1](https://arxiv.org/html/2605.28849#Thmremark1.p1.1.1)\.
- \[11\]B\. Liu, I\. Gemp, M\. Ghavamzadeh, J\. Liu, S\. Mahadevan, and M\. Petrik\(2018\)Proximal gradient temporal difference learning: stable reinforcement learning with polynomial sample complexity\.Journal of Artificial Intelligence Research63,pp\. 461–494\.Cited by:[§1](https://arxiv.org/html/2605.28849#S1.p2.2),[§2\.2](https://arxiv.org/html/2605.28849#S2.SS2.p2.3),[§8](https://arxiv.org/html/2605.28849#S8.p3.2),[§8](https://arxiv.org/html/2605.28849#S8.p4.1)\.
- \[12\]B\. Liu, J\. Liu, M\. Ghavamzadeh, S\. Mahadevan, and M\. Petrik\(2015\)Finite\-sample analysis of proximal gradient td algorithms\.InProceedings of the Thirty\-First Conference on Uncertainty in Artificial Intelligence,pp\. 504–513\.Cited by:[§1](https://arxiv.org/html/2605.28849#S1.p2.2),[§2\.2](https://arxiv.org/html/2605.28849#S2.SS2.p2.3),[§8](https://arxiv.org/html/2605.28849#S8.p3.2)\.
- \[13\]H\. R\. Maei and R\. S\. Sutton\(2010\)GQ\(lambda\): a general gradient algorithm for temporal\-difference prediction learning with eligibility traces\.InProceedings of the 3rd Conference on Artificial General Intelligence,pp\. 100–105\.Cited by:[§1](https://arxiv.org/html/2605.28849#S1.p1.1),[§6\.2](https://arxiv.org/html/2605.28849#S6.SS2.p2.2),[§8](https://arxiv.org/html/2605.28849#S8.p1.1)\.
- \[14\]H\. R\. Maei, C\. Szepesvari, S\. Bhatnagar, and R\. S\. Sutton\(2010\)Toward off\-policy learning control with function approximation\.InProceedings of the 27th International Conference on Machine Learning,pp\. 719–726\.Cited by:[§1](https://arxiv.org/html/2605.28849#S1.p1.1),[§8](https://arxiv.org/html/2605.28849#S8.p1.1)\.
- \[15\]A\. Nemirovski\(2004\)Prox\-method with rate of convergence o\(1/t\) for variational inequalities with lipschitz continuous monotone operators and smooth convex\-concave saddle point problems\.SIAM Journal on Optimization15\(1\),pp\. 229–251\.Cited by:[§1](https://arxiv.org/html/2605.28849#S1.p3.1),[§4](https://arxiv.org/html/2605.28849#S4.p1.2),[§5\.2](https://arxiv.org/html/2605.28849#S5.SS2.2.p1.2),[§8](https://arxiv.org/html/2605.28849#S8.p3.2)\.
- \[16\]D\. Precup, R\. S\. Sutton, and S\. P\. Singh\(2000\)Eligibility traces for off\-policy policy evaluation\.InProceedings of the Seventeenth International Conference on Machine Learning,pp\. 759–766\.Cited by:[§8](https://arxiv.org/html/2605.28849#S8.p1.1)\.
- \[17\]R\. S\. Sutton and A\. G\. Barto\(2018\)Reinforcement learning: an introduction\.2 edition,MIT Press\.Cited by:[§1](https://arxiv.org/html/2605.28849#S1.p1.1)\.
- \[18\]R\. S\. Sutton, H\. R\. Maei, D\. Precup, S\. Bhatnagar, D\. Silver, C\. Szepesvari, and E\. Wiewiora\(2009\)Fast gradient\-descent methods for temporal\-difference learning with linear function approximation\.InProceedings of the 26th Annual International Conference on Machine Learning,pp\. 993–1000\.Cited by:[§1](https://arxiv.org/html/2605.28849#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.28849#S2.SS2.p1.1),[§8](https://arxiv.org/html/2605.28849#S8.p1.1)\.
- \[19\]R\. S\. Sutton, C\. Szepesvari, and H\. R\. Maei\(2008\)A convergent o\(n\) temporal\-difference algorithm for off\-policy learning with linear function approximation\.InAdvances in Neural Information Processing Systems,Vol\.21,pp\. 1609–1616\.Cited by:[§1](https://arxiv.org/html/2605.28849#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.28849#S2.SS2.p1.1),[§8](https://arxiv.org/html/2605.28849#S8.p1.1)\.
- \[20\]R\. S\. Sutton\(1988\)Learning to predict by the methods of temporal differences\.Machine Learning3\(1\),pp\. 9–44\.Cited by:[§1](https://arxiv.org/html/2605.28849#S1.p1.1)\.
- \[21\]J\. N\. Tsitsiklis and B\. Van Roy\(1997\)An analysis of temporal\-difference learning with function approximation\.IEEE Transactions on Automatic Control42\(5\),pp\. 674–690\.Cited by:[§1](https://arxiv.org/html/2605.28849#S1.p1.1),[§8](https://arxiv.org/html/2605.28849#S8.p1.1)\.
- \[22\]M\. Uehara, J\. Huang, and N\. Jiang\(2020\)Minimax weight and q\-function learning for off\-policy evaluation\.InProceedings of the 37th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.119,pp\. 9659–9668\.Cited by:[§8](https://arxiv.org/html/2605.28849#S8.p5.1)\.
- \[23\]H\. Wai, Z\. Yang, Z\. Wang, and M\. Hong\(2021\)Variance\-reduced stochastic methods for markovian stochastic approximation\.InProceedings of the 38th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.139,pp\. 10431–10441\.Cited by:[§8](https://arxiv.org/html/2605.28849#S8.p5.1)\.
- \[24\]S\. Zhang, H\. Yao, and S\. Whiteson\(2021\)Breaking the deadly triad with a target network\.InProceedings of the 38th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.139,pp\. 12621–12631\.Cited by:[§8](https://arxiv.org/html/2605.28849#S8.p5.1)\.Similar Articles
Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction
This paper proposes behavior-aware auxiliary corrections for off-policy temporal-difference prediction, introducing BA-TDC and BA-TDRC algorithms that replace the auxiliary covariance matrix with the behavior Bellman matrix to improve stability and convergence. Theoretical analysis and experiments on standard benchmarks validate the effectiveness of the proposed methods.
HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents
HINT-SD proposes a targeted self-distillation framework that selects failure-relevant actions from full trajectories to improve long-horizon LLM agent training, achieving up to 18.80% improvement and 2.26× speedup over dense feedback baselines.
TEMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting
Proposes TEMPO, a policy optimization method that trains LLMs to reason exclusively from pre-cutoff information by using a two-mode reward and GRPO-based training, reducing knowledge leakage by 2–13% while improving task performance by 6–13%.
Metric-Gradient Projection for Stable Multi-Agent Policy Learning
Introduces HPML, a method that projects the joint update field of multi-agent systems onto a metric-gradient component to stabilize and improve multi-agent reinforcement learning. It provides theoretical guarantees and shows improved stability and returns on CTDE benchmarks.
Regularized Offline Policy Optimization with Posterior Hybrid Bayesian Belief
This paper introduces Posterior Hybrid Bayesian Belief (PhyB), a framework that reformulates the expectation in Bayesian RL as a convex combination over dynamics models, enabling efficient regularized offline policy optimization with bounded objective discrepancy and state-of-the-art performance.