Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction
Summary
This paper proposes behavior-aware auxiliary corrections for off-policy temporal-difference prediction, introducing BA-TDC and BA-TDRC algorithms that replace the auxiliary covariance matrix with the behavior Bellman matrix to improve stability and convergence. Theoretical analysis and experiments on standard benchmarks validate the effectiveness of the proposed methods.
View Cached Full Text
Cached at: 05/29/26, 09:10 AM
# Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction
Source: [https://arxiv.org/html/2605.28855](https://arxiv.org/html/2605.28855)
Zhiang HeYuchen ShenShangdong YangChao LiGuang YangWenhao Wang
###### Abstract
Temporal\-difference learning with function approximation can be unstable under off\-policy sampling\. TDC stabilizes off\-policy TD through an auxiliary covariance correction, and TDRC further regularizes this correction in a single\-timescale recursion\. This paper studies a behavior\-aware replacement of the auxiliary covariance geometry in the linear prediction setting, which is the standard local model for understanding the feature\-space dynamics of value\-function approximation\. We first replace the TDC auxiliary matrixCCby the behavior Bellman matrixAμA\_\{\\mu\}, yielding BA\-TDC, and then regularize the same behavior\-aware equation to obtain BA\-TDRC\. This two\-step construction separates the contribution of behavior\-aware geometry from the contribution of regularization\. The linear analysis also provides a tractable model for an auxiliary\-geometry design question that arises in neural\-network value approximation, where feature covariances and temporal transition matrices jointly shape the last\-layer correction dynamics\. We give a finite\-state mean\-system formulation, prove fixed\-point preservation and almost\-sure convergence under a Hurwitz stability condition on the instantiated mean system, and compare deterministic mean rates through the spectral radius of the exact linear error recursion\. Experiments on the two\-state counterexample, Baird’s counterexample, Random Walk, and Boyan Chain show that the behavior\-aware replacement can be highly beneficial by itself on some tasks, but that regularization is necessary for robust performance across harder settings\.
###### keywords:
Reinforcement learning , Off\-policy prediction , Temporal\-difference learning , TDRC , Behavior\-aware correction , Stochastic approximation
††journal:Neural Networks\\affiliation
\[aff1\]organization=Nanjing University of Posts and Telecommunications, city=Nanjing, country=China\\affiliation\[aff2\]organization=Department of Computer Science and Technology, Nanjing University, city=Nanjing, country=China\\affiliation\[aff3\]organization=College of Electronic Countermeasure, National University of Defense Technology, city=Hefei, country=China
## 1Introduction
Temporal\-difference \(TD\) learning is a basic mechanism for policy evaluation in reinforcement learning\[[15](https://arxiv.org/html/2605.28855#bib.bib2),[11](https://arxiv.org/html/2605.28855#bib.bib1)\]\. In off\-policy prediction with linear function approximation, however, the combination of bootstrapping, approximation, and off\-policy sampling may produce divergence\[[1](https://arxiv.org/html/2605.28855#bib.bib4),[16](https://arxiv.org/html/2605.28855#bib.bib3)\]\. Gradient\-TD algorithms such as GTD2 and TDC address this problem by introducing an auxiliary variable and optimizing projected Bellman\-error objectives\[[14](https://arxiv.org/html/2605.28855#bib.bib5),[12](https://arxiv.org/html/2605.28855#bib.bib6),[9](https://arxiv.org/html/2605.28855#bib.bib7)\]\. More recently, TDRC added a regularized correction to TDC and used a shared learning rate, producing a practical single\-timescale variant with improved stability\[[5](https://arxiv.org/html/2605.28855#bib.bib8)\]\.
Several related TD variants modify different parts of the off\-policy prediction mechanism\. Proximal TD and saddle\-point TD formulations provide single\-timescale views of gradient\-TD learning\[[8](https://arxiv.org/html/2605.28855#bib.bib18),[7](https://arxiv.org/html/2605.28855#bib.bib17)\]\. Emphatic TD stabilizes off\-policy learning by reweighting updates with follow\-on or emphasis traces\[[13](https://arxiv.org/html/2605.28855#bib.bib11)\], and its convergence properties and related off\-policy evaluation ideas have been further studied in\[[17](https://arxiv.org/html/2605.28855#bib.bib12),[6](https://arxiv.org/html/2605.28855#bib.bib13)\]\. This is an important line of work, but the trace\-based weighting mechanism can still suffer from high variance when importance ratios fluctuate strongly, and emphatic\-style improvements must still control the variance induced by accumulated follow\-on weights\. Other off\-policy variants alter the primary TD correction direction itself\. These methods motivate the broader question of how update geometry affects learning, but the present paper focuses specifically on the correction family around TDC and TDRC: BA\-TDC and BA\-TDRC keep the same primary correction structure and modify only the auxiliary correction geometry\.
This paper asks whether the auxiliary correction term can be made more informative by using behavior\-policy transition geometry\. In TDC and TDRC, the auxiliary variable is driven by the feature covariance termC=𝔼μ\[ϕtϕt⊤\]C=\\mathbb\{E\}\_\{\\mu\}\[\\phi\_\{t\}\\phi\_\{t\}^\{\\top\}\]\. This covariance metric ignores how the behavior policy moves features across time\. We replace this covariance correction by the behavior\-policy Bellman matrixAμ=𝔼μ\[ϕt\(ϕt−γϕt\+1\)⊤\]A\_\{\\mu\}=\\mathbb\{E\}\_\{\\mu\}\[\\phi\_\{t\}\(\\phi\_\{t\}\-\\gamma\\phi\_\{t\+1\}\)^\{\\top\}\]\. The unregularized replacement gives BA\-TDC; adding the TDRC\-style regularizer gives BA\-TDRC\.
The finite\-state linear setting is used because it lets the auxiliary geometry be isolated exactly: the matricesCC,AμA\_\{\\mu\},AπA\_\{\\pi\}, andDπD\_\{\\pi\}can all be computed and compared\. This controlled setting is also tied to neural\-network value approximation\. Deep value functions learn a feature map and a prediction head jointly, and the last\-layer or local linearization dynamics are governed by empirical feature covariances and temporal feature\-transition matrices rather than by state values alone\. The proposed replacementC↦AμC\\mapsto A\_\{\\mu\}can therefore be read as a controlled linear model of a broader design issue in neural\-network reinforcement learning: auxiliary correction targets should reflect not only which features are sampled, but also how the behavior policy transports learned features across bootstrapped targets\. The finite\-state analysis isolates this geometry before addressing the additional difficulties of nonlinear feature drift, online matrix estimation, and approximation error in deep networks\[[5](https://arxiv.org/html/2605.28855#bib.bib8),[10](https://arxiv.org/html/2605.28855#bib.bib9)\]\.
The contributions are as follows\.
- 1\.We derive BA\-TDC and BA\-TDRC, separating the behavior\-aware replacementC↦AμC\\mapsto A\_\{\\mu\}from the additional effect of regularization\.
- 2\.We formulate its exact finite\-state mean dynamics, prove stochastic\-approximation convergence under a Hurwitz stability condition on the instantiated mean system, and verify this condition numerically for each benchmark through exact finite\-state matrix computation\.
- 3\.We compare convergence speed through the spectral radius of the deterministic linear error recursion, giving a verifiable finite\-state mean\-rate criterion\.
- 4\.We evaluate the modular increments TDC→\\rightarrowBA\-TDC and TDRC→\\rightarrowBA\-TDRC on four standard off\-policy prediction benchmarks\.
## 2Background
### 2\.1Notation
We consider a finite Markov decision process with state space𝒮\\mathcal\{S\}, action space𝒜\\mathcal\{A\}, transition kernelPP, rewardrr, target policyπ\\pi, behavior policyμ\\mu, and discount factorγ∈\(0,1\)\\gamma\\in\(0,1\)\. The data are sampled underμ\\mu, while the value function ofπ\\piis estimated\. For a policyν∈\{π,μ\}\\nu\\in\\\{\\pi,\\mu\\\}, letPνP\_\{\\nu\}denote the state\-transition matrix induced byν\\nu, and letdμd\_\{\\mu\}be the stationary distribution ofPμP\_\{\\mu\}\. We writeDμ=diag\(dμ\)D\_\{\\mu\}=\\operatorname\{diag\}\(d\_\{\\mu\}\)\.
The value approximation is linear:
vθ\(s\)=θ⊤ϕ\(s\),ϕ\(s\)∈ℝd,θ∈ℝd\.v\_\{\\theta\}\(s\)=\\theta^\{\\top\}\\phi\(s\),\\quad\\phi\(s\)\\in\\mathbb\{R\}^\{d\},\\quad\\theta\\in\\mathbb\{R\}^\{d\}\.\(1\)The feature matrix isΦ∈ℝ\|𝒮\|×d\\Phi\\in\\mathbb\{R\}^\{\|\\mathcal\{S\}\|\\times d\}, whosessth row isϕ\(s\)⊤\\phi\(s\)^\{\\top\}\. For compactness, we writeϕt=ϕ\(st\)\\phi\_\{t\}=\\phi\(s\_\{t\}\)andϕt\+1=ϕ\(st\+1\)\\phi\_\{t\+1\}=\\phi\(s\_\{t\+1\}\)\. The importance ratio is
ρt=π\(at\|st\)μ\(at\|st\)\.\\rho\_\{t\}=\\frac\{\\pi\(a\_\{t\}\|s\_\{t\}\)\}\{\\mu\(a\_\{t\}\|s\_\{t\}\)\}\.\(2\)The TD error is written in the compact form
δt=rt−θt⊤\(ϕt−γϕt\+1\)\.\\delta\_\{t\}=r\_\{t\}\-\\theta\_\{t\}^\{\\top\}\(\\phi\_\{t\}\-\\gamma\\phi\_\{t\+1\}\)\.\(3\)All expectations are taken under the stationary behavior trajectory unless otherwise stated\. The standard projected Bellman matrices are
Aπ=𝔼μ\[ρtϕt\(ϕt−γϕt\+1\)⊤\],b=𝔼μ\[ρtrtϕt\],C=𝔼μ\[ϕtϕt⊤\]\.A\_\{\\pi\}=\\mathbb\{E\}\_\{\\mu\}\[\\rho\_\{t\}\\phi\_\{t\}\(\\phi\_\{t\}\-\\gamma\\phi\_\{t\+1\}\)^\{\\top\}\],\\quad b=\\mathbb\{E\}\_\{\\mu\}\[\\rho\_\{t\}r\_\{t\}\\phi\_\{t\}\],\\quad C=\\mathbb\{E\}\_\{\\mu\}\[\\phi\_\{t\}\\phi\_\{t\}^\{\\top\}\]\.\(4\)The projected Bellman fixed point satisfiesAπθ=bA\_\{\\pi\}\\theta=b\.
We also use the target\-policy next\-feature coupling matrix
Dπ=𝔼μ\[ρtγϕt\+1ϕt⊤\],D\_\{\\pi\}=\\mathbb\{E\}\_\{\\mu\}\[\\rho\_\{t\}\\gamma\\phi\_\{t\+1\}\\phi\_\{t\}^\{\\top\}\],\(5\)and the behavior\-policy Bellman matrix
Aμ=𝔼μ\[ϕt\(ϕt−γϕt\+1\)⊤\]\.A\_\{\\mu\}=\\mathbb\{E\}\_\{\\mu\}\[\\phi\_\{t\}\(\\phi\_\{t\}\-\\gamma\\phi\_\{t\+1\}\)^\{\\top\}\]\.\(6\)The vectorw∈ℝdw\\in\\mathbb\{R\}^\{d\}denotes the auxiliary correction variable\. The notation‖x‖\\\|x\\\|denotes the Euclidean norm,IIdenotes the identity matrix, andρ\(M\)\\rho\(M\)denotes the spectral radius of a matrixMM\. The analysis treatszt=\(θt,wt\)z\_\{t\}=\(\\theta\_\{t\},w\_\{t\}\)asℱt\\mathcal\{F\}\_\{t\}\-measurable; conditional expectations in the mean\-drift derivations are taken with respect to\(st,at,st\+1\)\(s\_\{t\},a\_\{t\},s\_\{t\+1\}\)givenℱt\\mathcal\{F\}\_\{t\}\.
### 2\.2TDC and TDRC
Gradient\-TD methods optimize the mean\-squared projected Bellman error \(MSPBE\)
J\(θ\)=12\(b−Aπθ\)⊤C−1\(b−Aπθ\)\.J\(\\theta\)=\\frac\{1\}\{2\}\(b\-A\_\{\\pi\}\\theta\)^\{\\top\}C^\{\-1\}\(b\-A\_\{\\pi\}\\theta\)\.\(7\)The auxiliary variable in TDC estimates
wθ=C−1\(b−Aπθ\),or equivalentlyCwθ=b−Aπθ\.w\_\{\\theta\}=C^\{\-1\}\(b\-A\_\{\\pi\}\\theta\),\\quad\\text\{or equivalently\}\\quad Cw\_\{\\theta\}=b\-A\_\{\\pi\}\\theta\.\(8\)With this auxiliary variable, the negative MSPBE gradient can be written as a correction direction that admits the standard TDC sample update:
θt\+1\\displaystyle\\theta\_\{t\+1\}=θt\+αtρt\(δtϕt−γϕt\+1ϕt⊤wt\),\\displaystyle=\\theta\_\{t\}\+\\alpha\_\{t\}\\rho\_\{t\}\\left\(\\delta\_\{t\}\\phi\_\{t\}\-\\gamma\\phi\_\{t\+1\}\\phi\_\{t\}^\{\\top\}w\_\{t\}\\right\),\(9\)wt\+1\\displaystyle w\_\{t\+1\}=wt\+βt\(ρtδt−ϕt⊤wt\)ϕt\.\\displaystyle=w\_\{t\}\+\\beta\_\{t\}\(\\rho\_\{t\}\\delta\_\{t\}\-\\phi\_\{t\}^\{\\top\}w\_\{t\}\)\\phi\_\{t\}\.\(10\)Theww\-recursion is a stochastic approximation toCw=b−AπθCw=b\-A\_\{\\pi\}\\theta, since the sample term\(ρtδt−ϕt⊤wt\)ϕt\(\\rho\_\{t\}\\delta\_\{t\}\-\\phi\_\{t\}^\{\\top\}w\_\{t\}\)\\phi\_\{t\}has meanb−Aπθ−Cwb\-A\_\{\\pi\}\\theta\-Cw\.
The practical weakness of this TDC recursion is that the auxiliary equation can be poorly conditioned and is usually run with a separate step size\. The performance of TDC can therefore depend strongly on the relative tuning of the primary step sizeαt\\alpha\_\{t\}and auxiliary step sizeβt\\beta\_\{t\}, especially in off\-policy counterexamples where the correction variable changes rapidly\.
TDRC regularizes this correction equation\. Instead of solving Eq\. \([8](https://arxiv.org/html/2605.28855#S2.E8)\), TDRC uses the regularized equation
\(C\+ηI\)wθ=b−Aπθ,η\>0,\(C\+\\eta I\)w\_\{\\theta\}=b\-A\_\{\\pi\}\\theta,\\quad\\eta\>0,\(11\)which is equivalent to adding a ridge penalty to the auxiliary least\-squares problem\. This shifts the auxiliary matrix fromCCtoC\+ηIC\+\\eta I, improves conditioning whenCCis nearly singular, and damps rapid growth of the correction variable\. The mean auxiliary drift becomesb−Aπθ−\(C\+ηI\)wb\-A\_\{\\pi\}\\theta\-\(C\+\\eta I\)w\. In the single\-learning\-rate form used in this paper, TDRC is
θt\+1\\displaystyle\\theta\_\{t\+1\}=θt\+αtρt\(δtϕt−γϕt\+1ϕt⊤wt\),\\displaystyle=\\theta\_\{t\}\+\\alpha\_\{t\}\\rho\_\{t\}\\left\(\\delta\_\{t\}\\phi\_\{t\}\-\\gamma\\phi\_\{t\+1\}\\phi\_\{t\}^\{\\top\}w\_\{t\}\\right\),\(12\)wt\+1\\displaystyle w\_\{t\+1\}=wt\+αt\[\(ρtδt−ϕt⊤wt\)ϕt−ηwt\],\\displaystyle=w\_\{t\}\+\\alpha\_\{t\}\\left\[\(\\rho\_\{t\}\\delta\_\{t\}\-\\phi\_\{t\}^\{\\top\}w\_\{t\}\)\\phi\_\{t\}\-\\eta w\_\{t\}\\right\],\(13\)whereη\>0\\eta\>0is the regularization parameter\. Thus TDRC keeps the TDC primary correction direction but stabilizes the auxiliary recursion through a regularized covariance equation\.
This is the point from which BA\-TDRC departs\. We do not change the TDC/TDRC primary correction direction\. Instead, we keep the TDRC idea of a regularized auxiliary equation and replace its covariance matrix by a behavior\-aware Bellman matrix\.
## 3Behavior\-Aware Auxiliary Corrections
The point of departure is the role of the auxiliary variable\. In TDC and TDRC, this variable does not approximate the value function directly; it estimates a correction vector for the bias created by off\-policy bootstrapping with function approximation\. TDC obtains this vector from the covariance equationCwθ=b−AπθCw\_\{\\theta\}=b\-A\_\{\\pi\}\\theta, and TDRC regularizes the same equation as
\(C\+ηI\)wθ=b−Aπθ\.\(C\+\\eta I\)w\_\{\\theta\}=b\-A\_\{\\pi\}\\theta\.\(14\)These equations are stable and easy to sample becauseC=𝔼μ\[ϕtϕt⊤\]C=\\mathbb\{E\}\_\{\\mu\}\[\\phi\_\{t\}\\phi\_\{t\}^\{\\top\}\]is the feature covariance under the behavior distribution\. The limitation is thatCCdescribes only instantaneous feature occurrence\. It says how often features are observed, but not how the behavior policy transports them from the current state to the next state\. Off\-policy TD instability is inherently temporal: the update couplesϕt\\phi\_\{t\}andϕt\+1\\phi\_\{t\+1\}through bootstrapping\. A correction variable shaped only byCCcan therefore be poorly matched to the temporal geometry of the sampled behavior trajectory\.
The behavior\-policy Bellman matrix
Aμ=𝔼μ\[ϕt\(ϕt−γϕt\+1\)⊤\]A\_\{\\mu\}=\\mathbb\{E\}\_\{\\mu\}\[\\phi\_\{t\}\(\\phi\_\{t\}\-\\gamma\\phi\_\{t\+1\}\)^\{\\top\}\]\(15\)contains this temporal feature transport under the behavior policy\. It keeps the current featureϕt\\phi\_\{t\}but subtracts the discounted next feature induced by behavior sampling\. This suggests a two\-step modification\. First, replace the TDC auxiliary matrixCCbyAμA\_\{\\mu\}, producing an unregularized behavior\-aware correction\. Second, add the same type of regularization used by TDRC, producing a regularized behavior\-aware correction\.
This design deliberately changes only the auxiliary equation\. We keep the primary TDC/TDRC update direction unchanged so that the comparison isolates two effects: replacingCCbyAμA\_\{\\mu\}, and then adding regularization\. This is important for both theory and experiments: the TD projected fixed point is preserved, while the transient correction geometry can change\.
To make the replacement precise, write the covariance\-based auxiliary equation in the generic form
MCwθ=b−Aπθ,MC=C\+ηI\.M\_\{C\}w\_\{\\theta\}=b\-A\_\{\\pi\}\\theta,\\quad M\_\{C\}=C\+\\eta I\.\(16\)The corresponding mean residual is
b−Aπθ−MCw\.b\-A\_\{\\pi\}\\theta\-M\_\{C\}w\.\(17\)Indeed, the sample residual used in TDRC is
\(ρtδt−ϕt⊤wt\)ϕt−ηwt,\(\\rho\_\{t\}\\delta\_\{t\}\-\\phi\_\{t\}^\{\\top\}w\_\{t\}\)\\phi\_\{t\}\-\\eta w\_\{t\},\(18\)whose expectation is
𝔼μ\[\(ρtδt−ϕt⊤wt\)ϕt−ηwt\]=b−Aπθ−\(C\+ηI\)w\.\\mathbb\{E\}\_\{\\mu\}\[\(\\rho\_\{t\}\\delta\_\{t\}\-\\phi\_\{t\}^\{\\top\}w\_\{t\}\)\\phi\_\{t\}\-\\eta w\_\{t\}\]=b\-A\_\{\\pi\}\\theta\-\(C\+\\eta I\)w\.\(19\)
Hereη=0\\eta=0corresponds to TDC andη\>0\\eta\>0corresponds to TDRC\. The behavior\-aware version keeps the same right\-hand sideb−Aπθb\-A\_\{\\pi\}\\theta, because this is the projected Bellman residual whose correction is needed by the primary TDC update\. It changes only the metric used to map this residual into the auxiliary variable\. Specifically, the behavior\-aware family replacesMCM\_\{C\}by
MA=Aμ\+βI,β≥0\.M\_\{A\}=A\_\{\\mu\}\+\\beta I,\\quad\\beta\\geq 0\.\(20\)Hereβ=0\\beta=0gives BA\-TDC, whileβ\>0\\beta\>0gives BA\-TDRC\. The behavior\-aware auxiliary equation is therefore
\(Aμ\+βI\)wθ=b−Aπθ,β≥0,\(A\_\{\\mu\}\+\\beta I\)w\_\{\\theta\}=b\-A\_\{\\pi\}\\theta,\\quad\\beta\\geq 0,\(21\)and its mean residual is
b−Aπθ−\(Aμ\+βI\)w\.b\-A\_\{\\pi\}\\theta\-\(A\_\{\\mu\}\+\\beta I\)w\.\(22\)The sample counterpart ofAμwtA\_\{\\mu\}w\_\{t\}isϕt\(ϕt−γϕt\+1\)⊤wt\\phi\_\{t\}\(\\phi\_\{t\}\-\\gamma\\phi\_\{t\+1\}\)^\{\\top\}w\_\{t\}\. Specifically, the behavior\-aware sample counterpart of the auxiliary residual in Eq\. \([22](https://arxiv.org/html/2605.28855#S3.E22)\) is
\(ρtδt−\(ϕt−γϕt\+1\)⊤wt\)ϕt−βwt\.\\left\(\\rho\_\{t\}\\delta\_\{t\}\-\(\\phi\_\{t\}\-\\gamma\\phi\_\{t\+1\}\)^\{\\top\}w\_\{t\}\\right\)\\phi\_\{t\}\-\\beta w\_\{t\}\.\(23\)Taking expectations verifies the replacement explicitly:
𝔼μ\[\(ρtδt−\(ϕt−γϕt\+1\)⊤wt\)ϕt−βwt\]\\displaystyle\\mathbb\{E\}\_\{\\mu\}\\left\[\\left\(\\rho\_\{t\}\\delta\_\{t\}\-\(\\phi\_\{t\}\-\\gamma\\phi\_\{t\+1\}\)^\{\\top\}w\_\{t\}\\right\)\\phi\_\{t\}\-\\beta w\_\{t\}\\right\]=b−Aπθ−Aμw−βw=b−Aπθ−\(Aμ\+βI\)w\.\\displaystyle\\qquad=b\-A\_\{\\pi\}\\theta\-A\_\{\\mu\}w\-\\beta w=b\-A\_\{\\pi\}\\theta\-\(A\_\{\\mu\}\+\\beta I\)w\.\(24\)
Equations \([19](https://arxiv.org/html/2605.28855#S3.E19)\) and \([24](https://arxiv.org/html/2605.28855#S3.E24)\) show the whole modification: covariance\-based corrections useC\+ηIC\+\\eta I, while behavior\-aware corrections useAμ\+βIA\_\{\\mu\}\+\\beta I\. No extra\-gradient step, emphatic weighting, or control\-specific mechanism is introduced\.
Substituting Eq\. \([23](https://arxiv.org/html/2605.28855#S3.E23)\) into the auxiliary recursion gives the behavior\-aware update:
θt\+1\\displaystyle\\theta\_\{t\+1\}=θt\+αtρt\(δtϕt−γϕt\+1ϕt⊤wt\),\\displaystyle=\\theta\_\{t\}\+\\alpha\_\{t\}\\rho\_\{t\}\\left\(\\delta\_\{t\}\\phi\_\{t\}\-\\gamma\\phi\_\{t\+1\}\\phi\_\{t\}^\{\\top\}w\_\{t\}\\right\),\(25\)wt\+1\\displaystyle w\_\{t\+1\}=wt\+λαt\[\(ρtδt−\(ϕt−γϕt\+1\)⊤wt\)ϕt−βwt\]\.\\displaystyle=w\_\{t\}\+\\lambda\\alpha\_\{t\}\\left\[\\left\(\\rho\_\{t\}\\delta\_\{t\}\-\(\\phi\_\{t\}\-\\gamma\\phi\_\{t\+1\}\)^\{\\top\}w\_\{t\}\\right\)\\phi\_\{t\}\-\\beta w\_\{t\}\\right\]\.\(26\)For BA\-TDC,β=0\\beta=0and the auxiliary gainλαt\\lambda\\alpha\_\{t\}is treated as a second step size, as in TDC\. BA\-TDC is therefore used mainly to isolate the unregularized behavior\-aware geometry, not as the final robust single\-timescale method\. For BA\-TDRC,β\>0\\beta\>0andλ\\lambdais a fixed positive constant, so the method remains single\-timescale in the same sense as TDRC\. This follows the TDRC implementation convention, where the auxiliary update uses a fixed multiple of the primary step size rather than an independently decaying second timescale\[[5](https://arxiv.org/html/2605.28855#bib.bib8)\];λ\\lambdais therefore a fixed gain ratio, not a new asymptotic timescale\. In the implementation, the auxiliary update uses an explicit constantαw\\alpha\_\{w\}, and the ratioλ=αw/α\\lambda=\\alpha\_\{w\}/\\alphais determined by the per\-environment hyperparameter search rather than fixed across environments; this matches the way TDRC tunes its auxiliary step in\[[5](https://arxiv.org/html/2605.28855#bib.bib8)\]\.
The distinction is therefore modular: BA\-TDC tests the effect of replacingCCbyAμA\_\{\\mu\}without regularization, and BA\-TDRC tests the regularized replacementC\+ηIC\+\\eta IbyAμ\+βIA\_\{\\mu\}\+\\beta I\. The next section analyzes the consequences of this replacement: the TD fixed point is preserved under a nonsingularity condition, almost\-sure convergence follows under a Hurwitz mean\-system condition, and a conditional speed advantage follows when the induced mean error matrix has a smaller spectral factor\.
Table 1:Modular comparison of covariance\-based and behavior\-aware corrections\. All methods use the same primary TDC correction direction; they differ in the auxiliary equation and time\-scale structure\.This comparison also clarifies the intended claim\. BA\-TDC isolates the behavior\-aware geometry, while BA\-TDRC combines that geometry with regularization\. BA\-TDRC is not meant to uniformly dominate TDRC on every off\-policy problem; it is expected to help when the behavior Bellman geometry and the regularizer together improve the mean transient or stability structure, and to behave competitively otherwise\.
## 4Theoretical Analysis
### 4\.1Generic Regularized\-Correction Mean Dynamics
We analyze the covariance\-based and behavior\-aware correction equations through a common auxiliary matrix\. TDC and TDRC use
MC=C\+ηI,η≥0,M\_\{C\}=C\+\\eta I,\\quad\\eta\\geq 0,\(27\)whereη=0\\eta=0gives TDC andη\>0\\eta\>0gives TDRC\. BA\-TDC and BA\-TDRC use
MA=Aμ\+βI,β≥0,M\_\{A\}=A\_\{\\mu\}\+\\beta I,\\quad\\beta\\geq 0,\(28\)whereβ=0\\beta=0gives BA\-TDC andβ\>0\\beta\>0gives BA\-TDRC\. Let the auxiliary gain beλαt\\lambda\\alpha\_\{t\}with fixedλ\>0\\lambda\>0\. For BA\-TDC this is the usual two\-stepsize form with a separately tuned auxiliary gain; for BA\-TDRC it remains a single\-timescale recursion because the ratioλ\\lambdais fixed\.
The mean recursion follows from three elementary identities\. First, expanding the compact TD\-error definition gives the target\-policy Bellman residual sampled underμ\\mu:
𝔼μ\[ρtδtϕt\]=b−Aπθt\.\\mathbb\{E\}\_\{\\mu\}\[\\rho\_\{t\}\\delta\_\{t\}\\phi\_\{t\}\]=b\-A\_\{\\pi\}\\theta\_\{t\}\.\(29\)Indeed, the reward term givesbb, while the feature\-difference term givesAπθtA\_\{\\pi\}\\theta\_\{t\}by the definitions ofbbandAπA\_\{\\pi\}\. Second, by the definition ofDπD\_\{\\pi\},
𝔼μ\[ρtγϕt\+1ϕt⊤wt\]=Dπwt\.\\mathbb\{E\}\_\{\\mu\}\[\\rho\_\{t\}\\gamma\\phi\_\{t\+1\}\\phi\_\{t\}^\{\\top\}w\_\{t\}\]=D\_\{\\pi\}w\_\{t\}\.\(30\)Thus the primary TDC correction has mean drift
𝔼μ\[ρt\(δtϕt−γϕt\+1ϕt⊤wt\)\]=b−Aπθt−Dπwt\.\\mathbb\{E\}\_\{\\mu\}\\left\[\\rho\_\{t\}\(\\delta\_\{t\}\\phi\_\{t\}\-\\gamma\\phi\_\{t\+1\}\\phi\_\{t\}^\{\\top\}w\_\{t\}\)\\right\]=b\-A\_\{\\pi\}\\theta\_\{t\}\-D\_\{\\pi\}w\_\{t\}\.\(31\)Third, the auxiliary residual has mean
𝔼μ\[\(ρtδt−mt⊤wt\)ϕt−rwt\]=b−Aπθt−Mwt,\\mathbb\{E\}\_\{\\mu\}\\left\[\(\\rho\_\{t\}\\delta\_\{t\}\-m\_\{t\}^\{\\top\}w\_\{t\}\)\\phi\_\{t\}\-rw\_\{t\}\\right\]=b\-A\_\{\\pi\}\\theta\_\{t\}\-Mw\_\{t\},\(32\)where\(mt,r,M\)=\(ϕt,η,C\+ηI\)\(m\_\{t\},r,M\)=\(\\phi\_\{t\},\\eta,C\+\\eta I\)for TDRC and\(mt,r,M\)=\(ϕt−γϕt\+1,β,Aμ\+βI\)\(m\_\{t\},r,M\)=\(\\phi\_\{t\}\-\\gamma\\phi\_\{t\+1\},\\beta,A\_\{\\mu\}\+\\beta I\)for BA\-TDRC\. Equation \([32](https://arxiv.org/html/2605.28855#S4.E32)\) also covers TDC and BA\-TDC by settingη=0\\eta=0orβ=0\\beta=0\.
Combining Eqs\. \([31](https://arxiv.org/html/2605.28855#S4.E31)\) and \([32](https://arxiv.org/html/2605.28855#S4.E32)\), the expected recursion for any fixedMMandλ\\lambdais
zt\+1=zt\+αt\(𝒢M,λzt\+hλ\),zt=\(θtwt\),hλ=\(bλb\),z\_\{t\+1\}=z\_\{t\}\+\\alpha\_\{t\}\(\\mathcal\{G\}\_\{M,\\lambda\}z\_\{t\}\+h\_\{\\lambda\}\),\\quad z\_\{t\}=\\begin\{pmatrix\}\\theta\_\{t\}\\\\ w\_\{t\}\\end\{pmatrix\},\\quad h\_\{\\lambda\}=\\begin\{pmatrix\}b\\\\ \\lambda b\\end\{pmatrix\},\(33\)where
𝒢M,λ=\(−Aπ−Dπ−λAπ−λM\)\.\\mathcal\{G\}\_\{M,\\lambda\}=\\begin\{pmatrix\}\-A\_\{\\pi\}&\-D\_\{\\pi\}\\\\ \-\\lambda A\_\{\\pi\}&\-\\lambda M\\end\{pmatrix\}\.\(34\)Thus the structural difference between TDRC and BA\-TDRC in the mean system is the replacement ofMC=C\+ηIM\_\{C\}=C\+\\eta IbyMA=Aμ\+βIM\_\{A\}=A\_\{\\mu\}\+\\beta I;λ\\lambdaonly fixes the auxiliary\-to\-primary gain ratio\.
###### Assumption 1\.
The Markov chain underμ\\muis irreducible and has stationary distributiondμd\_\{\\mu\}with full support\. Features and rewards are bounded, the feature matrixΦ\\Phihas full column rank, andAπA\_\{\\pi\}is nonsingular\.
The nonsingularity ofAπ=Φ⊤Dμ\(I−γPπ\)ΦA\_\{\\pi\}=\\Phi^\{\\top\}D\_\{\\mu\}\(I\-\\gamma P\_\{\\pi\}\)\\Phiis the usual projected Bellman fixed\-point condition for linear off\-policy prediction\. Full column rank ofΦ\\Phiis a natural identifiability requirement, but by itself it does not imply nonsingularity of the oblique off\-policy matrixAπA\_\{\\pi\}; this is why nonsingularity is stated explicitly\.
###### Assumption 2\.
ForM=MA=Aμ\+βIM=M\_\{A\}=A\_\{\\mu\}\+\\beta Iwith the chosenβ≥0\\beta\\geq 0, the matrixMA−DπM\_\{A\}\-D\_\{\\pi\}is nonsingular\.
###### Assumption 3\.
ForM=MA=Aμ\+βIM=M\_\{A\}=A\_\{\\mu\}\+\\beta Iand the chosenλ\>0\\lambda\>0, the matrix𝒢MA,λ\\mathcal\{G\}\_\{M\_\{A\},\\lambda\}in Eq\. \([34](https://arxiv.org/html/2605.28855#S4.E34)\) is Hurwitz\.
Assumption[3](https://arxiv.org/html/2605.28855#Thmassumption3)is a stability condition on the instantiated mean system, not a consequence asserted from Assumption[1](https://arxiv.org/html/2605.28855#Thmassumption1)alone\. The behavior\-aware matrixAμA\_\{\\mu\}is generally nonsymmetric, and the block coupling throughAπA\_\{\\pi\}andDπD\_\{\\pi\}means that stability depends jointly onAπA\_\{\\pi\},DπD\_\{\\pi\},MAM\_\{A\}, andλ\\lambda\. Assumptions[2](https://arxiv.org/html/2605.28855#Thmassumption2)and[3](https://arxiv.org/html/2605.28855#Thmassumption3)are independent: either can hold while the other fails\. For fixed finite\-state matrices, Assumption[3](https://arxiv.org/html/2605.28855#Thmassumption3)can be verified by computingmaxiReλi\(𝒢MA,λ\)<0\\max\_\{i\}\\operatorname\{Re\}\\,\\lambda\_\{i\}\(\\mathcal\{G\}\_\{M\_\{A\},\\lambda\}\)<0; this is checked numerically for each benchmark in Table[7](https://arxiv.org/html/2605.28855#S6.T7)\.
###### Proposition 1\(TD fixed point is preserved\)\.
Under Assumptions[1](https://arxiv.org/html/2605.28855#Thmassumption1)and[2](https://arxiv.org/html/2605.28855#Thmassumption2), any equilibrium of the BA\-TDC/BA\-TDRC mean recursion satisfies
w∗=0,θ∗=Aπ−1b\.w^\{\*\}=0,\\quad\\theta^\{\*\}=A\_\{\\pi\}^\{\-1\}b\.\(35\)Therefore replacingCCbyAμA\_\{\\mu\}does not change the TD projected fixed point whenever the nonsingularity condition holds\.
###### Proof\.
At equilibrium, Eq\. \([33](https://arxiv.org/html/2605.28855#S4.E33)\) withM=MAM=M\_\{A\}gives two block equations:
b−Aπθ−Dπw=0,λ\(b−Aπθ−MAw\)=0\.b\-A\_\{\\pi\}\\theta\-D\_\{\\pi\}w=0,\\quad\\lambda\(b\-A\_\{\\pi\}\\theta\-M\_\{A\}w\)=0\.\(36\)Becauseλ\>0\\lambda\>0, the second equation is equivalent tob−Aπθ−MAw=0b\-A\_\{\\pi\}\\theta\-M\_\{A\}w=0\. Subtracting the second equation from the first yields
\(MA−Dπ\)w=0\.\(M\_\{A\}\-D\_\{\\pi\}\)w=0\.\(37\)Assumption[2](https://arxiv.org/html/2605.28855#Thmassumption2)impliesw=0w=0\. Substitutingw=0w=0into either block equation givesAπθ=bA\_\{\\pi\}\\theta=b\. Assumption[1](https://arxiv.org/html/2605.28855#Thmassumption1)states thatAπA\_\{\\pi\}is nonsingular, henceθ=Aπ−1b\\theta=A\_\{\\pi\}^\{\-1\}b\. This proves both uniqueness of the equilibrium and preservation of the TD projected fixed point\. ∎
The nonsingularity condition has a simple sufficient form\. SinceMA−Dπ∈ℝd×dM\_\{A\}\-D\_\{\\pi\}\\in\\mathbb\{R\}^\{d\\times d\}and
MA−Dπ=βI\+\(Aμ−Dπ\),M\_\{A\}\-D\_\{\\pi\}=\\beta I\+\(A\_\{\\mu\}\-D\_\{\\pi\}\),\(38\)Weyl’s singular\-value inequality implies
σmin\(MA−Dπ\)≥β−‖Dπ−Aμ‖2\.\\sigma\_\{\\min\}\(M\_\{A\}\-D\_\{\\pi\}\)\\geq\\beta\-\\\|D\_\{\\pi\}\-A\_\{\\mu\}\\\|\_\{2\}\.\(39\)Thus Assumption[2](https://arxiv.org/html/2605.28855#Thmassumption2)is guaranteed wheneverβ\>‖Dπ−Aμ‖2\\beta\>\\\|D\_\{\\pi\}\-A\_\{\\mu\}\\\|\_\{2\}\. This bound is conservative but operational: increasing the regularizer makes fixed\-point preservation easier to verify\. It is separate from the speed condition in Assumption[6](https://arxiv.org/html/2605.28855#Thmassumption6); a problem can preserve the TD fixed point while failing to give BA\-TDRC a smaller deterministic mean spectral factor\.
### 4\.2Stochastic\-Approximation Convergence
Letz∗=\(Aπ−1b,0\)z^\{\*\}=\(A\_\{\\pi\}^\{\-1\}b,0\)andet=zt−z∗e\_\{t\}=z\_\{t\}\-z^\{\*\}\. The sampled BA\-TDC/BA\-TDRC recursion can be written as
zt\+1=zt\+αt\(𝒢MA,λzt\+hλ\+ξt\+1\),z\_\{t\+1\}=z\_\{t\}\+\\alpha\_\{t\}\(\\mathcal\{G\}\_\{M\_\{A\},\\lambda\}z\_\{t\}\+h\_\{\\lambda\}\+\\xi\_\{t\+1\}\),\(40\)whereξt\+1\\xi\_\{t\+1\}is the stochastic sampling noise after subtracting the conditional mean drift in Eq\. \([33](https://arxiv.org/html/2605.28855#S4.E33)\)\. Since𝒢MA,λz∗\+hλ=0\\mathcal\{G\}\_\{M\_\{A\},\\lambda\}z^\{\*\}\+h\_\{\\lambda\}=0by Proposition[1](https://arxiv.org/html/2605.28855#Thmproposition1), the error recursion is
et\+1=et\+αt\(𝒢MA,λet\+ξt\+1\)\.e\_\{t\+1\}=e\_\{t\}\+\\alpha\_\{t\}\(\\mathcal\{G\}\_\{M\_\{A\},\\lambda\}e\_\{t\}\+\\xi\_\{t\+1\}\)\.\(41\)
###### Assumption 4\.
The step sizes satisfyαt\>0\\alpha\_\{t\}\>0,∑t=0∞αt=∞\\sum\_\{t=0\}^\{\\infty\}\\alpha\_\{t\}=\\infty, and∑t=0∞αt2<∞\\sum\_\{t=0\}^\{\\infty\}\\alpha\_\{t\}^\{2\}<\\infty\.
###### Assumption 5\.
The noise sequence satisfies𝔼\[ξt\+1\|ℱt\]=0\\mathbb\{E\}\[\\xi\_\{t\+1\}\|\\mathcal\{F\}\_\{t\}\]=0and𝔼\[‖ξt\+1‖2\|ℱt\]≤c\(1\+‖zt‖2\)\\mathbb\{E\}\[\\\|\\xi\_\{t\+1\}\\\|^\{2\}\|\\mathcal\{F\}\_\{t\}\]\\leq c\(1\+\\\|z\_\{t\}\\\|^\{2\}\)for some constantc\>0c\>0\.
Under Assumption[1](https://arxiv.org/html/2605.28855#Thmassumption1), this linear\-growth bound holds for the Markovian sample update because features and rewards are bounded and the update is affine inztz\_\{t\}with bounded random coefficients\.
###### Theorem 1\(Almost\-sure convergence to the TD fixed point\)\.
Under Assumptions[1](https://arxiv.org/html/2605.28855#Thmassumption1)–[5](https://arxiv.org/html/2605.28855#Thmassumption5), the BA\-TDC/BA\-TDRC iterates satisfy
θt→Aπ−1b,wt→0\\theta\_\{t\}\\to A\_\{\\pi\}^\{\-1\}b,\\quad w\_\{t\}\\to 0\(42\)almost surely\.
###### Proof\.
LetG=𝒢MA,λG=\\mathcal\{G\}\_\{M\_\{A\},\\lambda\}\. By Assumption[3](https://arxiv.org/html/2605.28855#Thmassumption3),GGis Hurwitz\. Therefore, for any positive definite matrixQQthere is a unique positive definite matrixPPsolving the Lyapunov equation
G⊤P\+PG=−Q\.G^\{\\top\}P\+PG=\-Q\.\(43\)ChooseQ=IQ=Iand defineV\(e\)=e⊤PeV\(e\)=e^\{\\top\}Pe\. From Eq\. \([41](https://arxiv.org/html/2605.28855#S4.E41)\),
et\+1=et\+αt\(Get\+ξt\+1\)\.e\_\{t\+1\}=e\_\{t\}\+\\alpha\_\{t\}\(Ge\_\{t\}\+\\xi\_\{t\+1\}\)\.\(44\)ExpandingV\(et\+1\)V\(e\_\{t\+1\}\)gives
V\(et\+1\)\\displaystyle V\(e\_\{t\+1\}\)=V\(et\)\+2αtet⊤P\(Get\+ξt\+1\)\\displaystyle=V\(e\_\{t\}\)\+2\\alpha\_\{t\}e\_\{t\}^\{\\top\}P\(Ge\_\{t\}\+\\xi\_\{t\+1\}\)\+αt2\(Get\+ξt\+1\)⊤P\(Get\+ξt\+1\)\.\\displaystyle\\quad\+\\alpha\_\{t\}^\{2\}\(Ge\_\{t\}\+\\xi\_\{t\+1\}\)^\{\\top\}P\(Ge\_\{t\}\+\\xi\_\{t\+1\}\)\.\(45\)Taking conditional expectation and using Assumption[5](https://arxiv.org/html/2605.28855#Thmassumption5)gives
𝔼\[et⊤Pξt\+1\|ℱt\]=0\.\\mathbb\{E\}\[e\_\{t\}^\{\\top\}P\\xi\_\{t\+1\}\|\\mathcal\{F\}\_\{t\}\]=0\.\(46\)The Lyapunov equation gives
2et⊤PGet=et⊤\(G⊤P\+PG\)et=−‖et‖2\.2e\_\{t\}^\{\\top\}PGe\_\{t\}=e\_\{t\}^\{\\top\}\(G^\{\\top\}P\+PG\)e\_\{t\}=\-\\\|e\_\{t\}\\\|^\{2\}\.\(47\)BecausePPandGGare fixed finite matrices, the second\-order term can be bounded explicitly\. Letp¯=‖P‖2\\bar\{p\}=\\\|P\\\|\_\{2\}andg¯=‖G‖2\\bar\{g\}=\\\|G\\\|\_\{2\}\. Assumption[5](https://arxiv.org/html/2605.28855#Thmassumption5)gives
𝔼\[‖ξt\+1‖2\|ℱt\]≤c\(1\+‖zt‖2\)≤c\(1\+2‖et‖2\+2‖z∗‖2\),\\mathbb\{E\}\[\\\|\\xi\_\{t\+1\}\\\|^\{2\}\|\\mathcal\{F\}\_\{t\}\]\\leq c\(1\+\\\|z\_\{t\}\\\|^\{2\}\)\\leq c\(1\+2\\\|e\_\{t\}\\\|^\{2\}\+2\\\|z^\{\*\}\\\|^\{2\}\),\(48\)wherezt=et\+z∗z\_\{t\}=e\_\{t\}\+z^\{\*\}andz∗=\(Aπ−1b,0\)z^\{\*\}=\(A\_\{\\pi\}^\{\-1\}b,0\)is finite by Assumption[1](https://arxiv.org/html/2605.28855#Thmassumption1)\. Therefore
𝔼\[\(Get\+ξt\+1\)⊤P\(Get\+ξt\+1\)∣ℱt\]≤k1‖et‖2\+k2\.\\mathbb\{E\}\\\!\\big\[\(Ge\_\{t\}\+\\xi\_\{t\+1\}\)^\{\\top\}P\(Ge\_\{t\}\+\\xi\_\{t\+1\}\)\\mid\\mathcal\{F\}\_\{t\}\\big\]\\leq k\_\{1\}\\\|e\_\{t\}\\\|^\{2\}\+k\_\{2\}\.\(49\)For example, one may takek1=2p¯g¯2\+4p¯ck\_\{1\}=2\\bar\{p\}\\bar\{g\}^\{2\}\+4\\bar\{p\}candk2=2p¯c\(1\+2‖z∗‖2\)k\_\{2\}=2\\bar\{p\}c\(1\+2\\\|z^\{\*\}\\\|^\{2\}\), using\(a\+b\)⊤P\(a\+b\)≤2p¯\(‖a‖2\+‖b‖2\)\(a\+b\)^\{\\top\}P\(a\+b\)\\leq 2\\bar\{p\}\(\\\|a\\\|^\{2\}\+\\\|b\\\|^\{2\}\)\. Combining these bounds with Eq\. \([45](https://arxiv.org/html/2605.28855#S4.E45)\) yields
𝔼\[V\(et\+1\)\|ℱt\]≤V\(et\)−αt‖et‖2\+c~1αt2‖et‖2\+c~2αt2\.\\mathbb\{E\}\[V\(e\_\{t\+1\}\)\|\\mathcal\{F\}\_\{t\}\]\\leq V\(e\_\{t\}\)\-\\alpha\_\{t\}\\\|e\_\{t\}\\\|^\{2\}\+\\tilde\{c\}\_\{1\}\\alpha\_\{t\}^\{2\}\\\|e\_\{t\}\\\|^\{2\}\+\\tilde\{c\}\_\{2\}\\alpha\_\{t\}^\{2\}\.\(50\)For all sufficiently largett,c~1αt2‖et‖2≤\(αt/2\)‖et‖2\\tilde\{c\}\_\{1\}\\alpha\_\{t\}^\{2\}\\\|e\_\{t\}\\\|^\{2\}\\leq\(\\alpha\_\{t\}/2\)\\\|e\_\{t\}\\\|^\{2\}becauseαt→0\\alpha\_\{t\}\\to 0\. Hence, after discarding finitely many initial terms,
𝔼\[V\(et\+1\)\|ℱt\]≤V\(et\)−αt2‖et‖2\+c~2αt2\.\\mathbb\{E\}\[V\(e\_\{t\+1\}\)\|\\mathcal\{F\}\_\{t\}\]\\leq V\(e\_\{t\}\)\-\\frac\{\\alpha\_\{t\}\}\{2\}\\\|e\_\{t\}\\\|^\{2\}\+\\tilde\{c\}\_\{2\}\\alpha\_\{t\}^\{2\}\.\(51\)Since∑tαt2<∞\\sum\_\{t\}\\alpha\_\{t\}^\{2\}<\\infty, the Robbins–Siegmund supermartingale theorem implies thatV\(et\)V\(e\_\{t\}\)converges almost surely to a finite random variable and that
∑t=0∞αt‖et‖2<∞almost surely\.\\sum\_\{t=0\}^\{\\infty\}\\alpha\_\{t\}\\\|e\_\{t\}\\\|^\{2\}<\\infty\\quad\\text\{almost surely\.\}\(52\)The rest of the argument separates boundedness from convergence\. First, sincePPis positive definite, almost\-sure convergence ofV\(et\)V\(e\_\{t\}\)to a finite value implies thatete\_\{t\}is almost surely bounded, and hencezt=et\+z∗z\_\{t\}=e\_\{t\}\+z^\{\*\}is also almost surely bounded\. This boundedness is obtained from the Robbins–Siegmund argument above and is not assumed in advance\. Second, Eq\. \([52](https://arxiv.org/html/2605.28855#S4.E52)\) together with∑tαt=∞\\sum\_\{t\}\\alpha\_\{t\}=\\inftyimplies that the iterates cannot spend positive asymptotic stepsize\-weighted time away from zero; in particular,lim inft→∞‖et‖=0\\liminf\_\{t\\to\\infty\}\\\|e\_\{t\}\\\|=0almost surely\. Third, the limiting ODE associated with Eq\. \([41](https://arxiv.org/html/2605.28855#S4.E41)\) ise˙=Ge\\dot\{e\}=Ge\. SinceGGis Hurwitz, the origin is the unique globally asymptotically stable equilibrium, and the only internally chain\-transitive invariant set of the limiting ODE is\{0\}\\\{0\\\}\. Applying the standard ODE theorem for stochastic approximation with martingale\-difference noise, bounded iterates, and square\-summable stepsizes then giveset→0e\_\{t\}\\to 0almost surely\[[2](https://arxiv.org/html/2605.28855#bib.bib15),[3](https://arxiv.org/html/2605.28855#bib.bib14)\]\. Thereforezt→z∗z\_\{t\}\\to z^\{\*\}almost surely\. Proposition[1](https://arxiv.org/html/2605.28855#Thmproposition1)identifiesz∗z^\{\*\}as\(Aπ−1b,0\)\(A\_\{\\pi\}^\{\-1\}b,0\), completing the proof\. ∎
### 4\.3Convergence\-Speed Comparison with TDRC
For a constant step sizeα\\alpha, the deterministic mean error recursion induced byMMis
et\+1=RM,λ\(α\)et,RM,λ\(α\)=I\+α𝒢M,λ\.e\_\{t\+1\}=R\_\{M,\\lambda\}\(\\alpha\)e\_\{t\},\\quad R\_\{M,\\lambda\}\(\\alpha\)=I\+\\alpha\\mathcal\{G\}\_\{M,\\lambda\}\.\(53\)Its exact asymptotic linear factor is
qM,λ\(α\)=ρ\(RM,λ\(α\)\)\.q\_\{M,\\lambda\}\(\\alpha\)=\\rho\(R\_\{M,\\lambda\}\(\\alpha\)\)\.\(54\)
###### Assumption 6\(Behavior\-aware speed advantage\)\.
For admissible step sizesαA\\alpha\_\{A\}andαC\\alpha\_\{C\}, the TDRC and BA\-TDRC mean systems are stable and satisfy
qMA,λA\(αA\)<qMC,λC\(αC\),MA=Aμ\+βI,MC=C\+ηI\.q\_\{M\_\{A\},\\lambda\_\{A\}\}\(\\alpha\_\{A\}\)<q\_\{M\_\{C\},\\lambda\_\{C\}\}\(\\alpha\_\{C\}\),\\quad M\_\{A\}=A\_\{\\mu\}\+\\beta I,\\quad M\_\{C\}=C\+\\eta I\.\(55\)
###### Proposition 2\(Faster deterministic mean convergence\)\.
Under Assumption[6](https://arxiv.org/html/2605.28855#Thmassumption6), BA\-TDRC has a smaller deterministic asymptotic mean linear factor than TDRC under the corresponding step sizes\.
###### Proof\.
Let
RA=RMA,λA\(αA\),RC=RMC,λC\(αC\)\.R\_\{A\}=R\_\{M\_\{A\},\\lambda\_\{A\}\}\(\\alpha\_\{A\}\),\\quad R\_\{C\}=R\_\{M\_\{C\},\\lambda\_\{C\}\}\(\\alpha\_\{C\}\)\.\(56\)Admissibility in Assumption[6](https://arxiv.org/html/2605.28855#Thmassumption6)means that both deterministic systems are stable, soρ\(RA\)<1\\rho\(R\_\{A\}\)<1andρ\(RC\)<1\\rho\(R\_\{C\}\)<1\. The deterministic errors are
etA=RAte0A,etC=RCte0C\.e\_\{t\}^\{A\}=R\_\{A\}^\{t\}e\_\{0\}^\{A\},\\quad e\_\{t\}^\{C\}=R\_\{C\}^\{t\}e\_\{0\}^\{C\}\.\(57\)For any fixed matrixRR, Gelfand’s formula gives
limt→∞‖Rt‖1/t=ρ\(R\)\.\\lim\_\{t\\to\\infty\}\\\|R^\{t\}\\\|^\{1/t\}=\\rho\(R\)\.\(58\)Equivalently, for everyϵ\>0\\epsilon\>0there existscϵ<∞c\_\{\\epsilon\}<\\inftysuch that
‖Rt‖≤cϵ\(ρ\(R\)\+ϵ\)tfor allt≥0\.\\\|R^\{t\}\\\|\\leq c\_\{\\epsilon\}\(\\rho\(R\)\+\\epsilon\)^\{t\}\\quad\\text\{for all \}t\\geq 0\.\(59\)Applying this bound toRAR\_\{A\}andRCR\_\{C\}shows that the asymptotic linear decay factors areqMA,λA\(αA\)=ρ\(RA\)q\_\{M\_\{A\},\\lambda\_\{A\}\}\(\\alpha\_\{A\}\)=\\rho\(R\_\{A\}\)andqMC,λC\(αC\)=ρ\(RC\)q\_\{M\_\{C\},\\lambda\_\{C\}\}\(\\alpha\_\{C\}\)=\\rho\(R\_\{C\}\)\. Assumption[6](https://arxiv.org/html/2605.28855#Thmassumption6)states that the former is strictly smaller than the latter\. Hence the BA\-TDRC deterministic mean error has the smaller asymptotic linear factor\. This is a conditional comparison of the two instantiated mean systems, not an unconditional dominance result\. ∎
Assumption[6](https://arxiv.org/html/2605.28855#Thmassumption6)is the point at which the covariance matrixCCand the behavior matrixAμA\_\{\\mu\}are separated\. Proposition[2](https://arxiv.org/html/2605.28855#Thmproposition2)is therefore a conditional mean\-rate statement: its role is to put the two instantiated linear systems under a common spectral\-radius criterion, not to assert an unconditional speed advantage\. The numerical analysis in Section[6](https://arxiv.org/html/2605.28855#S6)instantiatesMCM\_\{C\}andMAM\_\{A\}on each benchmark and checks whether this assumption holds\.
## 5Experiments
### 5\.1Protocol
The primary evaluation metric is the root mean\-squared projected Bellman error \(RMSPBE\)\. We use RMSPBE because the TDC/TDRC family is derived from projected Bellman\-error correction equations, and because BA\-TDC and BA\-TDRC preserve the same projected TD fixed point rather than optimizing a separate value\-error objective\. Metrics such as RMSVE can be useful diagnostics, but they need not rank the algorithms identically because they measure value\-function error under a state distribution rather than projected Bellman residual\. The main experiments therefore use RMSPBE to evaluate the object directly controlled by the correction geometry; value\-error based conclusions should be treated as supplementary rather than interchangeable\.
The experimental scope is deliberately limited to finite linear off\-policy prediction\. This makes the exact matricesAπA\_\{\\pi\},AμA\_\{\\mu\},CC, andDπD\_\{\\pi\}computable, so the fixed\-point and mean\-rate conditions in the theory can be checked rather than inferred indirectly from learning curves\. TDRC has also been studied in nonlinear prediction and control settings\[[5](https://arxiv.org/html/2605.28855#bib.bib8)\]; extending the present behavior\-aware replacement to neural\-network critics would require additional machinery for estimatingAμA\_\{\\mu\}or its action on learned features while the representation itself changes\. We therefore treat the linear experiments as a controlled test of the auxiliary\-geometry mechanism, not as a complete empirical claim about nonlinear control\.
We evaluate on four off\-policy prediction benchmarks\. The two\-state counterexample is a minimal setting where off\-policy bootstrapping can create severe transient behavior\. Baird’s counterexample is the classical linear off\-policy divergence benchmark\. Random Walk is a mildly off\-policy prediction task where ordinary TD\-style methods are already strong\. Boyan Chain tests a larger linear prediction problem with correlated features and a longer transition structure\. Together these environments separate difficult off\-policy instability from mildly off\-policy prediction, which is important for judging whether behavior\-aware corrections help only in pathological cases or remain competitive in benign cases\.
#### Benchmark Environments
All four benchmarks are off\-policy prediction tasks in which the behavior policyμ\\mudiffers from the target policyπ\\pi\. Each is implemented as a finite\-state Markov reward process with linear features; importance ratiosρt=π\(at\|st\)/μ\(at\|st\)\\rho\_\{t\}=\\pi\(a\_\{t\}\|s\_\{t\}\)/\\mu\(a\_\{t\}\|s\_\{t\}\)are computed exactly\. Table[2](https://arxiv.org/html/2605.28855#S5.T2)summarizes the configuration; the next paragraphs describe each environment\.
Table 2:Configuration of the four off\-policy prediction benchmarks\.##### Two\-state counterexample\.
Two states\{0,1\}\\\{0,1\\\}with scalar featuresϕ\(0\)=1\\phi\(0\)=1,ϕ\(1\)=2\\phi\(1\)=2, discountγ=0\.9\\gamma=0\.9, and zero reward\. Two actions induce deterministic next states: action0leads to state0and action11leads to state11\. The behavior policy is uniform,μ\(⋅\|s\)=0\.5\\mu\(\\cdot\|s\)=0\.5, while the target policy is degenerate,π\(1\|s\)=1\\pi\(1\|s\)=1\. Importance ratios are thereforeρ∈\{0,2\}\\rho\\in\\\{0,2\\\}\. This setting exposes the transient instability that gradient\-TD corrections are designed to control\.
##### Baird’s counterexample\.
Seven states with an eight\-dimensional linear feature representation that is intentionally aliased to make semi\-gradient off\-policy TD diverge\[[1](https://arxiv.org/html/2605.28855#bib.bib4)\]\. Six “dashed” actions move uniformly to one of the six upper states, and a “solid” action moves to the lower state\. The behavior policy chooses dashed actions with probability6/76/7and the solid action with probability1/71/7; the target policy always selects the solid action\. This givesρ∈\{0,7\}\\rho\\in\\\{0,7\\\}\. The discount isγ=0\.99\\gamma=0\.99and the reward is zero, so the unique fixed point of the projected Bellman equation isθ∗=0\\theta^\{\*\}=0\.
##### Random Walk\.
A five\-state chain with two terminal absorbing states; interior states use a five\-dimensional one\-hot feature\. The agent moves left or right by one step; reaching the right terminal yields reward\+1\+1and reaching the left terminal yields reward0\. The behavior policy is uniform,μ\(left\)=μ\(right\)=0\.5\\mu\(\\text\{left\}\)=\\mu\(\\text\{right\}\)=0\.5, while the target policy is slightly biased to the right,π\(right\)=0\.6\\pi\(\\text\{right\}\)=0\.6andπ\(left\)=0\.4\\pi\(\\text\{left\}\)=0\.4\. Importance ratios areρ∈\{0\.8,1\.2\}\\rho\\in\\\{0\.8,1\.2\\\}, close to one\. The discount isγ=0\.99\\gamma=0\.99\. Episodes that reach a terminal state restart from the center\.
##### Boyan Chain\.
A 13\-state chain with four\-dimensional piecewise\-linear features adapted from\[[4](https://arxiv.org/html/2605.28855#bib.bib16)\], withγ=0\.9\\gamma=0\.9and reward−3\-3per non\-terminal step\. At interior statess∈\{0,…,10\}s\\in\\\{0,\\ldots,10\\\}, two actions advance one or two steps respectively; both are taken with behavior probability0\.50\.5, while the target policy chooses the one\-step action with probability0\.40\.4and the two\-step action with probability0\.60\.6, givingρ∈\{0\.8,1\.2\}\\rho\\in\\\{0\.8,1\.2\\\}\. States1111and1212have deterministic transitions, after which the chain resets to state0\. Although the importance ratios are mild, the correlated feature representation and the longer transition structure produce a non\-trivial linear prediction problem\.
The experiments are organized into three parts\. First, the main comparison evaluates BA\-TDRC against TD, GTD2, TDC, TDRC, and GTD2\-MP\. BA\-TDC is intentionally omitted from the main comparison because it is an ablation\-only variant rather than the regularized method proposed for robust use\. Second, a modular ablation compares TDC, BA\-TDC, TDRC, and BA\-TDRC to isolate the effect of the behavior\-aware replacement and the additional effect of regularization\. The ablation therefore uses a narrower baseline set than the main comparison by design\. Third, a step\-size robustness study plots BA\-TDRC alone across primary step sizes in each environment\. ETD and Hybrid TD are related off\-policy TD variants, but they change the update through emphatic weighting or direction interpolation rather than through the TDC/TDRC auxiliary\-equation metric; they are outside the correction\-family comparison studied here\. Hyperparameters are tuned on eight disjoint seeds using the average RMSPBE over the last 20% of the tuning trajectory\. Final evaluation uses 100 disjoint seeds\. For a metric curvee0,e1,…,eTe\_\{0\},e\_\{1\},\\ldots,e\_\{T\}, we define the steady\-state AUC as
AUCss=1T−⌊T/2⌋\+1∑t=⌊T/2⌋Tet,\\operatorname\{AUC\}\_\{\\mathrm\{ss\}\}=\\frac\{1\}\{T\-\\lfloor T/2\\rfloor\+1\}\\sum\_\{t=\\lfloor T/2\\rfloor\}^\{T\}e\_\{t\},\(60\)For evenTTthis gives⌊T/2⌋\+1\\lfloor T/2\\rfloor\+1terms; the boundary effect is negligible for the trajectory lengths used\. The quantity is the last\-50% time\-average RMSPBE after discarding the initial transient\. Tables report the sample mean and sample standard deviation ofAUCss\\operatorname\{AUC\}\_\{\\mathrm\{ss\}\}and of the final RMSPBE across 100 runs\. Curves show mean±\\pmsample standard deviation, not confidence intervals\. Divergent runs are not clipped\.
Table 3:Hyperparameter search spaces used for tuning\. The objective is the average RMSPBE over the last 20% of the tuning trajectory on eight disjoint tuning seeds\. Final results use 100 separate evaluation seeds\.
### 5\.2Main Comparison
Figure 1:Main comparison on the two\-state counterexample\. Curves show mean RMSPBE over 100 runs and shaded regions show one sample standard deviation\.Figure 2:Main comparison on Baird’s counterexample\. Curves show mean RMSPBE over 100 runs and shaded regions show one sample standard deviation\.Figure 3:Main comparison on Random Walk\. Curves show mean RMSPBE over 100 runs and shaded regions show one sample standard deviation\.Figure 4:Main comparison on Boyan Chain\. Curves show mean RMSPBE over 100 runs and shaded regions show one sample standard deviation\.Figures[1](https://arxiv.org/html/2605.28855#S5.F1)–[4](https://arxiv.org/html/2605.28855#S5.F4)compare BA\-TDRC with the main baselines\. BA\-TDRC reaches near\-zero RMSPBE on the two\-state counterexample, is competitive with TDC and TDRC on Baird’s counterexample, matches the strongest TD\-style methods on Random Walk, and obtains the best result on Boyan Chain\. These results support BA\-TDRC as a regularized behavior\-aware correction, but they do not imply uniform dominance over every baseline in every environment\.
Table 4:Steady\-state AUC error \(last\-50% time\-average RMSPBE\), mean±\\pmsample standard deviation over 100 runs\. Lower is better\.†\\daggerThe Hurwitz condition in Assumption[3](https://arxiv.org/html/2605.28855#Thmassumption3)is not satisfied under this hyperparameter setting \(see Table[7](https://arxiv.org/html/2605.28855#S6.T7)\); this result is an empirical stress test, not covered by Theorem[1](https://arxiv.org/html/2605.28855#Thmtheorem1)\.
Table 5:Final RMSPBE at the last step, mean±\\pmsample standard deviation over 100 runs\. Lower is better\.†\\daggerThe Hurwitz condition in Assumption[3](https://arxiv.org/html/2605.28855#Thmassumption3)is not satisfied under this hyperparameter setting \(see Table[7](https://arxiv.org/html/2605.28855#S6.T7)\); this result is an empirical stress test, not covered by Theorem[1](https://arxiv.org/html/2605.28855#Thmtheorem1)\.
Tables[4](https://arxiv.org/html/2605.28855#S5.T4)and[5](https://arxiv.org/html/2605.28855#S5.T5)show the main quantitative pattern without the ablation\-only BA\-TDC variant\. BA\-TDRC is strongest on the two\-state and Boyan Chain tasks, slightly improves over TDC and TDRC on Baird in steady\-state AUC, and matches the best Random Walk result\. The Baird entries marked by†\\daggerare not covered by the Hurwitz condition used in Theorem[1](https://arxiv.org/html/2605.28855#Thmtheorem1); they are retained to show the empirical stress\-test behavior under the same tuning protocol\. GTD2\-MP remains best on Baird, so BA\-TDRC should be viewed as improving the TDC/TDRC correction family rather than replacing all gradient\-TD alternatives\.
On Random Walk, off\-policy bias is mild because the behavior policy is uniform \(μ\(⋅\|s\)=0\.5\\mu\(\\cdot\|s\)=0\.5\) and the target policy is only slightly biased \(π\(right\|s\)=0\.6\\pi\(\\text\{right\}\|s\)=0\.6\), giving importance ratiosρ∈\{0\.8,1\.2\}\\rho\\in\\\{0\.8,1\.2\\\}close to one\. Under such mildly off\-policy conditions, the tuned BA\-TDRC hyperparameters yield a large effective regularization on the auxiliary recursion, so the auxiliary variablewtw\_\{t\}stays close to zero and the primary update reduces to off\-policy semi\-gradient TD\. The tuned TDRC selects a similarly small auxiliary effect\. This explains why BA\-TDRC, TDRC, and TD coincide on Random Walk: the corrected and uncorrected updates are numerically indistinguishable in this regime\. This also agrees with the mean\-operator analysis in Table[7](https://arxiv.org/html/2605.28855#S6.T7): TDRC has a slightly smaller deterministic factor there, but the difference is too small to produce a visible advantage in the stochastic finite\-horizon RMSPBE curves\.
### 5\.3Modular Ablation
Figure 5:Ablation on the two\-state counterexample\. The comparison isolates TDC→\\rightarrowBA\-TDC and TDRC→\\rightarrowBA\-TDRC\.Figure 6:Ablation on Baird’s counterexample\. The comparison isolates TDC→\\rightarrowBA\-TDC and TDRC→\\rightarrowBA\-TDRC\.Figure 7:Ablation on Random Walk\. The comparison isolates TDC→\\rightarrowBA\-TDC and TDRC→\\rightarrowBA\-TDRC\.Figure 8:Ablation on Boyan Chain\. The comparison isolates TDC→\\rightarrowBA\-TDC and TDRC→\\rightarrowBA\-TDRC\.Table 6:Ablation AUC results for the covariance\-to\-behavior replacement and regularization\. Lower is better\.The ablation shows that the behavior\-aware replacement and regularization have different roles\. ReplacingCCbyAμA\_\{\\mu\}alone is highly effective on the two\-state counterexample and helpful on Random Walk, but it fails on Baird and Boyan Chain\. This is expected from the structure of the unregularized auxiliary equation:AμA\_\{\\mu\}contains temporal feature transport and is not a symmetric covariance matrix, so in aliased or strongly coupled feature spaces its inverse can amplify noise and transient residuals rather than damping them\. Baird’s counterexample is especially sensitive because the feature representation is over\-parameterized relative to the state dynamics and the importance ratios are extreme\. In this setting, the nonsymmetric behavior\-aware auxiliary direction is repeatedly excited by rare high\-ratio solid transitions, producing the nonmonotone growth visible in the ablation curve\. Boyan Chain has milder ratios but strongly correlated features\. AddingβI\\beta Ishifts the auxiliary matrix away from such poorly conditioned behavior\-aware inversions, preserves the two\-state gain, and removes the severe BA\-TDC failures\.
### 5\.4Step\-Size Robustness
Figure 9:BA\-TDRC step\-size robustness across the four off\-policy benchmarks\. Each panel reports steady\-state RMSPBE AUC against the primary step sizeα\\alphaover 100 seeds, with the regularizer fixed toβ=1\.0\\beta=1\.0and the auxiliary\-to\-primary gain ratio fixed toλ=1\\lambda=1; these fixed values differ from the per\-environment tuning used in the main comparison\.Figure[9](https://arxiv.org/html/2605.28855#S5.F9)reports BA\-TDRC robustness alone, rather than mixing robustness curves from unrelated baselines\. The four panels show how sensitive BA\-TDRC is to the primary step size in each environment\. The two\-state case has a wide low\-error region at larger step sizes, Random Walk and Boyan Chain have smoother low\-error regions, and Baird is more sensitive, consistent with its role as a difficult off\-policy counterexample\. The vertical axes are logarithmic and span very different ranges because BA\-TDRC reaches extremely small RMSPBE on the two\-state problem, whereas high step sizes on Baird can amplify the already difficult off\-policy transient by many orders of magnitude\. In these robustness plots, the regularizer is fixed toβ=1\.0\\beta=1\.0and the auxiliary\-to\-primary gain ratio is fixed toλ=1\\lambda=1across all environments, so that only the primary step sizeα\\alphavaries along the horizontal axis\. This is a controlled robustness sweep and is therefore not the same hyperparameter setting as the per\-environment BA\-TDRC in the main comparison, whereβ\\betaandλ\\lambdaare tuned per environment\.
## 6Numerical Mean\-Operator Analysis
For each benchmark, we construct the exact finite\-state matricesAπA\_\{\\pi\},AμA\_\{\\mu\},CC, andDπD\_\{\\pi\}and instantiate the two auxiliary matrices in the theory:MC=C\+ηIM\_\{C\}=C\+\\eta Ifor TDRC andMA=Aμ\+βIM\_\{A\}=A\_\{\\mu\}\+\\beta Ifor BA\-TDRC\. The values ofη\\eta,β\\beta, and the auxiliary gain ratios are the tuned values used in the main experiments\. Table[7](https://arxiv.org/html/2605.28855#S6.T7)reports three checks that connect the analysis to the experiments: \(i\) whetherMA−DπM\_\{A\}\-D\_\{\\pi\}is nonsingular, so the TD fixed point is preserved by Proposition[1](https://arxiv.org/html/2605.28855#Thmproposition1); \(ii\) the Hurwitz margins for the TDRC and BA\-TDRC mean systems, where a positive value verifies the corresponding mean\-system stability condition; and \(iii\) whether the behavior\-aware speed advantage condition in Assumption[6](https://arxiv.org/html/2605.28855#Thmassumption6), namelyqMA,λA<qMC,λCq\_\{M\_\{A\},\\lambda\_\{A\}\}<q\_\{M\_\{C\},\\lambda\_\{C\}\}, holds\.
Table 7:Exact mean\-operator verification of the theory\. The columnσmin\(MA−Dπ\)\\sigma\_\{\\min\}\(M\_\{A\}\-D\_\{\\pi\}\)checks fixed\-point preservation; positive Hurwitz margins indicate stable continuous\-time mean systems; lower bestqqis faster and verifies Assumption[6](https://arxiv.org/html/2605.28855#Thmassumption6)\.The numerical analysis forms the bridge between the theory and the experiments\. First,σmin\(MA−Dπ\)\>0\\sigma\_\{\\min\}\(M\_\{A\}\-D\_\{\\pi\}\)\>0in all four benchmarks, so the condition used to prove preservation of the same TD fixed point is verified\. Second, the Hurwitz condition required for the almost\-sure convergence theorem is verified for the tuned two\-state, Random Walk, and Boyan Chain BA\-TDRC mean systems, but not for Baird under the selected constant\-gain configuration\. The Baird BA\-TDRC margin is only slightly negative \(−3\.92×10−4\-3\.92\\times 10^\{\-4\}\), while the corresponding TDRC margin is essentially zero and positive only at numerical precision \(4\.96×10−184\.96\\times 10^\{\-18\}\)\. Thus both regularized correction systems are close to the stability boundary on Baird\. With the tuned finite step size and regularization, the stochastic recursion remains empirically well controlled over the evaluated horizon, but this behavior is outside the theorem’s coverage and should be read as a stress test of the update\. Third, the speed assumptionqMA,λA<qMC,λCq\_\{M\_\{A\},\\lambda\_\{A\}\}<q\_\{M\_\{C\},\\lambda\_\{C\}\}holds clearly on the two\-state counterexample, matching the large empirical gain\. The other three benchmarks do not satisfy the deterministic mean\-speed condition under the tuned RMSPBE settings, so Proposition[2](https://arxiv.org/html/2605.28855#Thmproposition2)is not invoked for those cases\.
This gap between the mean\-rate condition and the empirical results is informative\. Assumption[6](https://arxiv.org/html/2605.28855#Thmassumption6)compares asymptotic deterministic linear factors of the exact mean recursion; it does not account for finite\-horizon bias, Markovian sampling variance, or the way regularization changes the magnitude and variability of the auxiliary vector along stochastic trajectories\. On Random Walk, the off\-policy mismatch is mild and the tuned regularization driveswtw\_\{t\}close to zero, making BA\-TDRC behave like off\-policy semi\-gradient TD\. On Boyan Chain, the speed factors of TDRC and BA\-TDRC are nearly identical \(0\.98550\.9855versus0\.98570\.9857\), so the deterministic asymptotic comparison is too fine to explain the observed AUC difference; finite\-sample damping and auxiliary\-variance effects can dominate over the small mean\-factor disadvantage\. The present experiments support this interpretation qualitatively through the ablation and robustness patterns, but they do not provide a separate variance decomposition\. A complete theory connecting behavior\-aware regularization to stochastic finite\-sample RMSPBE remains open\.
## 7Conclusion
We proposed behavior\-aware auxiliary corrections for off\-policy temporal\-difference prediction\. BA\-TDC isolates the replacement of the covariance matrixCCby the behavior Bellman matrixAμA\_\{\\mu\}, while BA\-TDRC combines the same replacement with TDRC\-style regularization\. We gave the mean\-system formulation, established convergence under a Hurwitz condition verified from exact finite\-state matrices for three of the four tuned benchmark settings, derived a conditional mean\-rate comparison, and evaluated the modular increments on four standard off\-policy benchmarks\. Baird serves as an empirical stress test outside the coverage of the convergence theorem\. The RMSPBE results show that behavior\-aware geometry alone can be very effective on the two\-state counterexample but unreliable on harder tasks, whereas the regularized BA\-TDRC variant is competitive with or better than TDC/TDRC across the tested prediction problems\. The method should therefore be viewed as a modular correction\-geometry change whose benefit depends on the behavior\-induced mean operator and its interaction with regularization, rather than as a uniformly dominant replacement for TDRC on every off\-policy prediction task\. Extending the idea to neural\-network critics is a natural next step, but it requires controlling the additional error from learned, time\-varying feature maps and from online estimates of behavior\-aware feature\-transition operators\.
## Data and Code Availability
The experimental code is publicly available at[https://github\.com/GameAI\-NJUPT/BA\-TDRC](https://github.com/GameAI-NJUPT/BA-TDRC)\. Generated result tables and figures are not included in the repository and can be reproduced by running the scripts with the documented protocols\.
## References
- \[1\]L\. Baird\(1995\)Residual algorithms: reinforcement learning with function approximation\.InProceedings of the Twelfth International Conference on Machine Learning,pp\. 30–37\.Cited by:[§1](https://arxiv.org/html/2605.28855#S1.p1.1),[§5\.1](https://arxiv.org/html/2605.28855#S5.SS1.SSSx1.Px2.p1.5)\.
- \[2\]V\. S\. Borkar and S\. P\. Meyn\(2000\)The o\.d\.e\. method for convergence of stochastic approximation and reinforcement learning\.SIAM Journal on Control and Optimization38\(2\),pp\. 447–469\.Cited by:[§4\.2](https://arxiv.org/html/2605.28855#S4.SS2.1.p1.34)\.
- \[3\]V\. S\. Borkar\(2023\)Stochastic approximation: a dynamical systems viewpoint\.2 edition,Springer\.Cited by:[§4\.2](https://arxiv.org/html/2605.28855#S4.SS2.1.p1.34)\.
- \[4\]J\. A\. Boyan\(2002\)Technical update: least\-squares temporal difference learning\.Machine Learning49\(2–3\),pp\. 233–246\.Cited by:[§5\.1](https://arxiv.org/html/2605.28855#S5.SS1.SSSx1.Px4.p1.10)\.
- \[5\]S\. Ghiassian, A\. Patterson, S\. Garg, D\. Gupta, A\. White, and M\. White\(2020\)Gradient temporal\-difference learning with regularized corrections\.InProceedings of the 37th International Conference on Machine Learning,pp\. 3524–3534\.Cited by:[§1](https://arxiv.org/html/2605.28855#S1.p1.1),[§1](https://arxiv.org/html/2605.28855#S1.p4.5),[§3](https://arxiv.org/html/2605.28855#S3.p7.7),[§5\.1](https://arxiv.org/html/2605.28855#S5.SS1.p2.5)\.
- \[6\]A\. Hallak and S\. Mannor\(2017\)Consistent on\-line off\-policy evaluation\.InProceedings of the 34th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.70,pp\. 1372–1383\.Cited by:[§1](https://arxiv.org/html/2605.28855#S1.p2.1)\.
- \[7\]B\. Liu, I\. Gemp, M\. Ghavamzadeh, J\. Liu, S\. Mahadevan, and M\. Petrik\(2018\)Proximal gradient temporal difference learning: stable reinforcement learning with polynomial sample complexity\.Journal of Artificial Intelligence Research63,pp\. 461–494\.Cited by:[§1](https://arxiv.org/html/2605.28855#S1.p2.1)\.
- \[8\]B\. Liu, J\. Liu, M\. Ghavamzadeh, S\. Mahadevan, and M\. Petrik\(2015\)Finite\-sample analysis of proximal gradient td algorithms\.InProceedings of the Thirty\-First Conference on Uncertainty in Artificial Intelligence,pp\. 504–513\.Cited by:[§1](https://arxiv.org/html/2605.28855#S1.p2.1)\.
- \[9\]H\. R\. Maei and R\. S\. Sutton\(2010\)GQ\(lambda\): a general gradient algorithm for temporal\-difference prediction learning with eligibility traces\.InProceedings of the 3rd Conference on Artificial General Intelligence,pp\. 100–105\.Cited by:[§1](https://arxiv.org/html/2605.28855#S1.p1.1)\.
- \[10\]V\. Mnih, K\. Kavukcuoglu, D\. Silver, A\. A\. Rusu, J\. Veness, M\. G\. Bellemare, A\. Graves, M\. Riedmiller, A\. K\. Fidjeland, G\. Ostrovski, S\. Petersen, C\. Beattie, A\. Sadik, I\. Antonoglou, H\. King, D\. Kumaran, D\. Wierstra, S\. Legg, and D\. Hassabis\(2015\)Human\-level control through deep reinforcement learning\.Nature518\(7540\),pp\. 529–533\.Cited by:[§1](https://arxiv.org/html/2605.28855#S1.p4.5)\.
- \[11\]R\. S\. Sutton and A\. G\. Barto\(2018\)Reinforcement learning: an introduction\.2 edition,MIT Press\.Cited by:[§1](https://arxiv.org/html/2605.28855#S1.p1.1)\.
- \[12\]R\. S\. Sutton, H\. R\. Maei, D\. Precup, S\. Bhatnagar, D\. Silver, C\. Szepesvari, and E\. Wiewiora\(2009\)Fast gradient\-descent methods for temporal\-difference learning with linear function approximation\.InProceedings of the 26th Annual International Conference on Machine Learning,pp\. 993–1000\.Cited by:[§1](https://arxiv.org/html/2605.28855#S1.p1.1)\.
- \[13\]R\. S\. Sutton, A\. R\. Mahmood, and M\. White\(2016\)An emphatic approach to the problem of off\-policy temporal\-difference learning\.Journal of Machine Learning Research17\(73\),pp\. 1–29\.Cited by:[§1](https://arxiv.org/html/2605.28855#S1.p2.1)\.
- \[14\]R\. S\. Sutton, C\. Szepesvari, and H\. R\. Maei\(2008\)A convergent o\(n\) temporal\-difference algorithm for off\-policy learning with linear function approximation\.InAdvances in Neural Information Processing Systems,Vol\.21,pp\. 1609–1616\.Cited by:[§1](https://arxiv.org/html/2605.28855#S1.p1.1)\.
- \[15\]R\. S\. Sutton\(1988\)Learning to predict by the methods of temporal differences\.Machine Learning3\(1\),pp\. 9–44\.Cited by:[§1](https://arxiv.org/html/2605.28855#S1.p1.1)\.
- \[16\]J\. N\. Tsitsiklis and B\. Van Roy\(1997\)An analysis of temporal\-difference learning with function approximation\.IEEE Transactions on Automatic Control42\(5\),pp\. 674–690\.Cited by:[§1](https://arxiv.org/html/2605.28855#S1.p1.1)\.
- \[17\]H\. Yu\(2015\)On convergence of emphatic temporal\-difference learning\.InProceedings of the 28th Conference on Learning Theory,Proceedings of Machine Learning Research, Vol\.40,pp\. 1724–1751\.Cited by:[§1](https://arxiv.org/html/2605.28855#S1.p2.1)\.Similar Articles
Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction
This paper proposes STHTD-MP, a behavior-induced Mirror-Prox temporal-difference method for faster off-policy prediction in reinforcement learning. It replaces the covariance metric with the behavior-policy Bellman matrix and provides convergence analysis and experimental comparisons.
Reducing Credit Assignment Variance via Counterfactual Reasoning Paths
Introduces Implicit Behavior Policy Optimization (IBPO), a counterfactual comparison-based credit assignment framework that improves training stability and performance in multi-step reasoning tasks for large language models by converting sparse terminal rewards into step-sensitive learning signals.
Robust OT-Guided Generative Residual Domain Adaptation for Bike-Sharing Demand Prediction under Temporal Domain Shift
This paper proposes Gen-ROTDA, a robust optimal transport-guided residual domain adaptation framework for predicting bike-sharing demand under temporal domain shift, achieving improved stability and accuracy compared to baselines, especially with noisy target data.
AdaTKG: Adaptive Memory for Temporal Knowledge Graph Reasoning
This paper proposes AdaTKG, a method for temporal knowledge graph reasoning that uses adaptive memory to refine entity representations dynamically as new interactions occur, improving performance over static baselines.
CALAD: Channel-Aware contrastive Learning for multivariate time series Anomaly Detection
Proposes CALAD, a channel-aware contrastive learning framework for multivariate time series anomaly detection that uses estimated channel relevance to construct contrastive samples, achieving state-of-the-art performance.