On the Divergence of Differential Temporal Difference Learning without Local Clocks
Summary
This paper addresses an open problem in reinforcement learning by providing a counterexample showing that differential temporal difference learning can diverge when using a global clock, despite converging with a local clock, in average-reward settings.
View Cached Full Text
Cached at: 05/11/26, 07:01 AM
# On the Divergence of Differential Temporal Difference Learning without Local Clocks
Source: [https://arxiv.org/html/2605.06874](https://arxiv.org/html/2605.06874)
David Antrobius Department of Computer Science University of Virginia davidantrobius@email\.virginia\.edu &Shangtong Zhang Department of Computer Science University of Virginia shangtong@virginia\.edu
###### Abstract
Learning rate is a critical component of reinforcement learning \(RL\)\. This work uses global and local clocks to distinguish two types of learning rates\. The former is of the standard formαt\\alpha\_\{t\}that depends only on the time steptt\(i\.e\., a global clock\)\. The latter is of the formαν\(St,t\)\\alpha\_\{\\nu\(S\_\{t\},t\)\}, whereν\(s,t\)\\nu\(s,t\)counts the number of visits to statessuntil timett\(i\.e\., a local clock\)\. In discounted RL, an RL algorithm that is convergent with a local clock is always also convergent with a global clock, and vice versa\. We are not aware of any counterexample\. The key contribution of this work is to show that this nice correspondence breaks down in average\-reward RL\. Specifically, we construct a counterexample showing that although differential temporal difference learning is convergent with a local clock, it can diverge with a global clock\. This counterexample closes the open problem inWan et al\. \([2021](https://arxiv.org/html/2605.06874#bib.bib31)\); Blaser et al\. \([2026](https://arxiv.org/html/2605.06874#bib.bib7)\)\.
## 1Introduction
Tabular method is a fundamental class of algorithms in reinforcement learning \(RL,Sutton and Barto \([2018](https://arxiv.org/html/2605.06874#bib.bib29)\)\)\. Looking back at the history of tabular RL, we can see a clear pattern that early theoretical analysis of tabular RL methods often uses a local clock in the learning rate\. Here, by a local clock in a learning rate, we mean a learning rate of the formαν\(St,t\)\\alpha\_\{\\nu\(S\_\{t\},t\)\}, whereν\(s,t\)\\nu\(s,t\)counts the number of visits to statessuntil timett\. This is in contrast to a learning rate with a global clock, which is of the formαt\\alpha\_\{t\}that depends only on the time steptt\. The global clock refers to the time stepttin the learning rate, while the local clock refers to the state\-dependent visit countν\(s,t\)\\nu\(s,t\)in the learning rate\. For example, classical convergence proofs forQQ\-learning usually use local clocks\(Watkins,[1989](https://arxiv.org/html/2605.06874#bib.bib34); Watkins and Dayan,[1992](https://arxiv.org/html/2605.06874#bib.bib33); Jaakkola et al\.,[1993](https://arxiv.org/html/2605.06874#bib.bib16); Tsitsiklis,[1994](https://arxiv.org/html/2605.06874#bib.bib30); Bertsekas and Tsitsiklis,[1996](https://arxiv.org/html/2605.06874#bib.bib4)\)\.
We argue that the use of local clocks in theoretical analysis is mostly because of technical convenience\. Chapter 7 ofBorkar \([2009](https://arxiv.org/html/2605.06874#bib.bib9)\)gives a thorough exposition of the role of local clocks in the convergence analysis of stochastic approximation algorithms\. Intuitively, local clocks allow a more balanced updates across states, essentially converting the Markovian samples in RL into i\.i\.d\. samples, which significantly simplifies the analysis\. However, a local clock is not really practitioner’s first choice\. For example, the entire textbookSutton and Barto \([2018](https://arxiv.org/html/2605.06874#bib.bib29)\)does not mention nor use any local clock in the learning rates\. A global clock is preferred by most practitioners\.
This theorist\-practitioner gap is not a problem in the discounted RL setting\. Although early theoretical analysis of discounted RL methods often uses local clocks, recent works have shown that the same algorithms are also convergent with global clocks\. For example, the convergence ofQQ\-learning with a global clock is also established byLee and He \([2020](https://arxiv.org/html/2605.06874#bib.bib17)\); Chen et al\. \([2024](https://arxiv.org/html/2605.06874#bib.bib11)\); Liu et al\. \([2025a](https://arxiv.org/html/2605.06874#bib.bib18)\)\. To our knowledge, there is no known counterexample showing that an algorithm that is convergent with a local clock can diverge with a global clock in discounted RL\.
In average\-reward RL, early theoretical analysis such as RVIQQ\-learning also use local clocks in the learning rates\(Abounadi et al\.,[2001](https://arxiv.org/html/2605.06874#bib.bib1)\)\. The seminal workWan et al\. \([2021](https://arxiv.org/html/2605.06874#bib.bib31)\)develops a new set of temporal difference algorithms for average\-reward RL that removes the need for a reference state in RVIQQ\-learning\. The \(off\-policy\) policy evaluation algorithm inWan et al\. \([2021](https://arxiv.org/html/2605.06874#bib.bib31)\)is called differential TD\. And the control algorithm inWan et al\. \([2021](https://arxiv.org/html/2605.06874#bib.bib31)\)is called differentialQQ\-learning\. In the main text ofWan et al\. \([2021](https://arxiv.org/html/2605.06874#bib.bib31)\), both algorithms are presented with a global clock in the learning rates\. However, in the convergence analysis in the appendix, both algorithms are shown to be convergent only with a local clock in the learning rates\. So the open problem is, do these algorithms also converge with a global clock in the learning rates?Blaser et al\. \([2026](https://arxiv.org/html/2605.06874#bib.bib7)\)partially answer this question by showing that whenη\\etais sufficiently small, differential TD is convergent even with a global clock\. Hereη\\etais a hyperparameter in differential TD that we will introduce later\. However, there is still a gap becauseWan et al\. \([2021](https://arxiv.org/html/2605.06874#bib.bib31)\)show that with a local clock, differential TD is convergent for anyη\>0\\eta\>0\. Does differential TD also converge with a global clock for anyη\>0\\eta\>0? This is an open problem fromBlaser et al\. \([2026](https://arxiv.org/html/2605.06874#bib.bib7)\)\.
One may optimistically conjecture that the answer to the above question is yes\. After all in the discounted RL setting, we have seen that algorithms that are convergent with a local clock are also convergent with a global clock\. We are not aware of any counterexample in the literature\. The key contribution of this work is to show that this nice correspondence breaks down in average\-reward RL\. Particularly, we construct a counterexample showing that differential TD can diverge with a global clock\. This negative result is surprising and interesting because it shows that the choice of learning rates can have a much more significant impact on the convergence of RL algorithms in the average\-reward setting than in the discounted setting\.
## 2Background
We consider an infinite horizon Markov Decision Process \(MDP,Bellman \([1957](https://arxiv.org/html/2605.06874#bib.bib3)\)\) defined by a tuple\(𝒮,𝒜,p,p0,r\)\(\\mathcal\{S\},\\mathcal\{A\},p,p\_\{0\},r\), where𝒮\\mathcal\{S\}is the finite state space,𝒜\\mathcal\{A\}is the finite action space,p:𝒮×𝒮×𝒜→\[0,1\]p:\\mathcal\{S\}\\times\\mathcal\{S\}\\times\\mathcal\{A\}\\to\[0,1\]is the transition function,p0:𝒮→\[0,1\]p\_\{0\}:\\mathcal\{S\}\\to\[0,1\]is the initial state distribution, andr:𝒮×𝒜→ℝr:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\mathbb\{R\}is the reward function\. At time step0, an initial stateS0S\_\{0\}is drawn fromp0p\_\{0\}\. At each time stept≥0t\\geq 0, an actionAtA\_\{t\}is chosen according to a policyπ:𝒜×𝒮→\[0,1\]\\pi:\\mathcal\{A\}\\times\\mathcal\{S\}\\to\[0,1\]asAt∼π\(⋅\|St\)A\_\{t\}\\sim\\pi\(\\cdot\|S\_\{t\}\)\. Then a rewardRt\+1=r\(St,At\)R\_\{t\+1\}=r\(S\_\{t\},A\_\{t\}\)is emitted and the next stateSt\+1S\_\{t\+1\}is drawn fromp\(⋅\|St,At\)p\(\\cdot\|S\_\{t\},A\_\{t\}\)\.
In the discounted RL setting, we consider a discount factorγ∈\[0,1\)\\gamma\\in\[0,1\)\. The goal of policy evaluation is to estimate the value function for a given policyπ\\pi, defined asvπ\(s\)≐𝔼\[∑i=0∞γiRt\+i\+1\|St=s\]v\_\{\\pi\}\(s\)\\doteq\\mathbb\{E\}\\quantity\[\\sum\_\{i=0\}^\{\\infty\}\\gamma^\{i\}R\_\{t\+i\+1\}\|S\_\{t\}=s\]\. In the average\-reward RL setting, we consider the average rewardJπ≐limT→∞1T𝔼\[∑t=0T−1Rt\+1\]J\_\{\\pi\}\\doteq\\lim\_\{T\\to\\infty\}\\frac\{1\}\{T\}\\mathbb\{E\}\\quantity\[\\sum\_\{t=0\}^\{T\-1\}R\_\{t\+1\}\]and are interested in the differential value function of a policyπ\\pi, defined asv¯π\(s\)≐𝔼\[∑i=0∞\(Rt\+i\+1−Jπ\)\|St=s\]\\bar\{v\}\_\{\\pi\}\(s\)\\doteq\\mathbb\{E\}\\quantity\[\\sum\_\{i=0\}^\{\\infty\}\(R\_\{t\+i\+1\}\-J\_\{\\pi\}\)\|S\_\{t\}=s\]\. Under mild conditions,JπJ\_\{\\pi\}is independent ofS0S\_\{0\}\.
Temporal difference \(TD\) learning is the most fundamental class of algorithms for policy evaluation in RL\. In this work, we consider off\-policy TD, where the data is generated by a behavior policyμ\\mu\(i\.e\.,At∼μ\(⋅\|St\)A\_\{t\}\\sim\\mu\(\\cdot\|S\_\{t\}\)\) while we want to estimate the value function for a different target policyπ\\pi\. In the discounted setting, the TD update is given by
δt=\\displaystyle\\delta\_\{t\}=Rt\+1\+γvt\(St\+1\)−vt\(St\),\\displaystyle R\_\{t\+1\}\+\\gamma v\_\{t\}\(S\_\{t\+1\}\)\-v\_\{t\}\(S\_\{t\}\),\(1\)vt\+1\(s\)=\\displaystyle v\_\{t\+1\}\(s\)=\{vt\(s\)\+αtρtδt,ifs=St,vt\(s\),otherwise,\\displaystyle\\begin\{cases\}v\_\{t\}\(s\)\+\\alpha\_\{t\}\\rho\_\{t\}\\delta\_\{t\},&\\text\{if \}s=S\_\{t\},\\\\ v\_\{t\}\(s\),&\\text\{otherwise\}\\end\{cases\},\(2\)whereρt=π\(At\|St\)μ\(At\|St\)\\rho\_\{t\}=\\frac\{\\pi\(A\_\{t\}\|S\_\{t\}\)\}\{\\mu\(A\_\{t\}\|S\_\{t\}\)\}is the importance sampling ratio\. Here,αt\\alpha\_\{t\}is the learning rate, e\.g\.,αt=1t\+1\\alpha\_\{t\}=\\frac\{1\}\{t\+1\}and we refer to \([1](https://arxiv.org/html/2605.06874#S2.E1)\) as*discounted TD with a global clock*\. Letν\(s,t\)≐∑i=0t𝕀\{Si=s\}\\nu\(s,t\)\\doteq\\sum\_\{i=0\}^\{t\}\\mathbb\{I\}\_\{\\\{S\_\{i\}=s\\\}\}be the number of times statesshas been visited by timettwith𝕀\\mathbb\{I\}being the indicator function\. If we replaceαt\\alpha\_\{t\}withαν\(St,t\)\\alpha\_\{\\nu\(S\_\{t\},t\)\}in \([1](https://arxiv.org/html/2605.06874#S2.E1)\), we get*discounted TD with a local clock*\. It is well\-known that discounted TD with both clocks are convergent under standard assumptions \(e\.g\., the Markov chain induced byμ\\muis ergodic,π\\piis absolutely continuous w\.r\.t\.μ\\mu, andαt\\alpha\_\{t\}satisfies the Robbins\-Monro conditions\)\.
Wan et al\. \([2021](https://arxiv.org/html/2605.06874#bib.bib31)\)extend off\-policy TD to the average\-reward setting and propose the following update for differential TD:
δt=\\displaystyle\\delta\_\{t\}=Rt\+1−J^t\+vt\(St\+1\)−vt\(St\),\\displaystyle R\_\{t\+1\}\-\\hat\{J\}\_\{t\}\+v\_\{t\}\(S\_\{t\+1\}\)\-v\_\{t\}\(S\_\{t\}\),\(3\)J^t\+1=\\displaystyle\\hat\{J\}\_\{t\+1\}=J^t\+αtηρtδt,\\displaystyle\\hat\{J\}\_\{t\}\+\\alpha\_\{t\}\\eta\\rho\_\{t\}\\delta\_\{t\},\(4\)vt\+1\(s\)=\\displaystyle v\_\{t\+1\}\(s\)=\{vt\(s\)\+αtρtδt,ifs=St,vt\(s\),otherwise,\\displaystyle\\begin\{cases\}v\_\{t\}\(s\)\+\\alpha\_\{t\}\\rho\_\{t\}\\delta\_\{t\},&\\text\{if \}s=S\_\{t\},\\\\ v\_\{t\}\(s\),&\\text\{otherwise\}\\end\{cases\},\(5\)whereη\>0\\eta\>0is a hyperparameter and\{J^t\}\\quantity\{\\hat\{J\}\_\{t\}\}is an estimate of the average rewardJπJ\_\{\\pi\}\. We refer to \([3](https://arxiv.org/html/2605.06874#S2.E3)\) as*differential TD with a global clock*\. Similarly, by replacing both appearances ofαt\\alpha\_\{t\}in \([3](https://arxiv.org/html/2605.06874#S2.E3)\) withαν\(St,t\)\\alpha\_\{\\nu\(S\_\{t\},t\)\}, we get*differential TD with a local clock*\.Wan et al\. \([2021](https://arxiv.org/html/2605.06874#bib.bib31)\)prove the almost sure convergence of differential TD with a local clock for anyη\>0\\eta\>0in their appendix, although their main text only presents differential TD with a global clock and does not mention the local clock at all\.Blaser et al\. \([2026](https://arxiv.org/html/2605.06874#bib.bib7)\)show that whenη\\etais sufficiently small, differential TD with a global clock is also convergent\. It is thus an open problem whether differential TD with a global clock is convergent for anη\\etathat is not small enough\. Surprisingly, we will show in this paper that the answer is no\.
We now introduce some definitions for matrix analysis\.
###### Definition 2\.1\.
A matrixAAis called anMM\-matrix ifAAcan be expressed asA=sI−BA=sI\-B, wheres\>0s\>0,B≥0B\\geq 0\(i\.e\., all entries ofBBare nonnegative\), ands≥ρ\(B\)s\\geq\\rho\(B\)withρ\(B\)\\rho\(B\)being the spectral radius ofBB\.
###### Definition 2\.2\.
A matrixAAis called positive stable if all eigenvalues ofAAhave strictly positive real parts\. A matrixAAis Hurwitz if−A\-Ais positive stable\.
## 3How Is Differential TD Different from Discounted TD?
We now provide some necessary background to illustrate the fundamental difference between how discounted and differential TD interact with global and local clocks\. This is largely due toBlaser et al\. \([2026](https://arxiv.org/html/2605.06874#bib.bib7)\)and we provide details here only for completeness and to set up the stage for our counterexample in the next section\.
The ODE method is a powerful tool for establishing the almost sure convergence of RL algorithms\. Take discounted TD with a local clock as an example\. To apply the ODE method, we study the ODE@∞@\\inftyassociated with discounted TD with a local clock, which is given by
dx\(t\)dt=\\displaystyle\\derivative\{x\(t\)\}\{t\}=−\(I−γPπ\)x\(t\),\\displaystyle\-\(I\-\\gamma P\_\{\\pi\}\)x\(t\),\(6\)whereIIis the identity matrix andPπP\_\{\\pi\}is the transition matrix under policyπ\\pi, i\.e\.,Pπ\(s,s′\)=∑aπ\(a\|s\)p\(s′\|s,a\)P\_\{\\pi\}\(s,s^\{\\prime\}\)=\\sum\_\{a\}\\pi\(a\|s\)p\(s^\{\\prime\}\|s,a\)\. The standard ODE method \(e\.g\.,Borkar \([2009](https://arxiv.org/html/2605.06874#bib.bib9)\); Borkar et al\. \([2025](https://arxiv.org/html/2605.06874#bib.bib8)\); Liu et al\. \([2025a](https://arxiv.org/html/2605.06874#bib.bib18)\)\) needs to verify that the above ODE is globally asymptotically stable \(GAS\) and 0 is its unique equilibrium point\. Fortunately, it is the case here becauseI−γPπI\-\\gamma P\_\{\\pi\}is a nonsingularMM\-matrix\. It is well\-known that a nonsingularMM\-matrix is positive stable\(Plemmons,[1977](https://arxiv.org/html/2605.06874#bib.bib25)\)\. The ODE@∞@\\inftyassociated with discounted TD with a global clock is instead
dx\(t\)dt=\\displaystyle\\derivative\{x\(t\)\}\{t\}=−Dμ\(I−γPπ\)x\(t\),\\displaystyle\-D\_\{\\mu\}\(I\-\\gamma P\_\{\\pi\}\)x\(t\),\(7\)whereDμD\_\{\\mu\}is a diagonal matrix withDμ\(s,s\)=dμ\(s\)D\_\{\\mu\}\(s,s\)=d\_\{\\mu\}\(s\)anddμd\_\{\\mu\}is the stationary distribution of the Markov chain induced byμ\\mu\. Suppose the Markov chain induced byμ\\muis irreducible, thendμ\(s\)\>0d\_\{\\mu\}\(s\)\>0for allss\. SoDμD\_\{\\mu\}is also positive stable\. It’s also known that the left multiplication of a nonsingularMM\-matrix by a positive stable diagonal matrix is still positive stable\(Fiedler and Pták,[1962](https://arxiv.org/html/2605.06874#bib.bib13); Plemmons,[1977](https://arxiv.org/html/2605.06874#bib.bib25)\), which implies thatDμ\(I−γPπ\)D\_\{\\mu\}\(I\-\\gamma P\_\{\\pi\}\)is also positive stable and thus the ODE@∞@\\inftyassociated with discounted TD with a global clock is also GAS\. The difference between the two ODEs is intuitive\. With a local clock, all the states are updated at the same “magnitude”, so the actual sampling distribution does not matter\. But with a global clock, the states are updated at different magnitudes according to how frequently they are visited\. So there is aDμD\_\{\\mu\}term to capture the sampling distribution\.
For differential TD with a local clock, the corresponding ODE@∞@\\inftyis
dx\(t\)dt=−\(I−Pπ\+ηee⊤\)x\(t\),\\displaystyle\\derivative\{x\(t\)\}\{t\}=\-\(I\-P\_\{\\pi\}\+\\eta ee^\{\\top\}\)x\(t\),\(8\)whereeeis the all\-one vector\. The detailed derivation can be found inWan et al\. \([2021](https://arxiv.org/html/2605.06874#bib.bib31)\)\. Although\(I−Pπ\+ηee⊤\)\(I\-P\_\{\\pi\}\+\\eta ee^\{\\top\}\)is in general not anMM\-matrix, it is still positive stable for allη\>0\\eta\>0\. So the ODE is still GAS\. The ODE@∞@\\inftyassociated with differential TD with a global clock is instead
dx\(t\)dt=−Dμ\(I−Pπ\+ηee⊤\)x\(t\)\.\\displaystyle\\derivative\{x\(t\)\}\{t\}=\-D\_\{\\mu\}\(I\-P\_\{\\pi\}\+\\eta ee^\{\\top\}\)x\(t\)\.\(9\)SeeBlaser et al\. \([2026](https://arxiv.org/html/2605.06874#bib.bib7)\)for the detailed derivation\. Unfortunately, the left multiplication of a positive stable matrix by a positive stable diagonal matrix does not necessarily preserve the positive stability\. So the ODE@∞@\\inftyassociated with differential TD with a global clock is not automatically GAS for allη\>0\\eta\>0\. To summarize, the fundamental difference is thatdiscounted TD has anMM\-matrix but differential TD has only a positive stable matrix\.
Blaser et al\. \([2026](https://arxiv.org/html/2605.06874#bib.bib7)\)approach this challenge via a rank\-one perturbation perspective\. Let
Aη≐Dμ\(I−Pπ\+ηee⊤\)=Dμ\(I−Pπ\)\+ηdμe⊤\.\\displaystyle A\_\{\\eta\}\\doteq D\_\{\\mu\}\(I\-P\_\{\\pi\}\+\\eta ee^\{\\top\}\)=D\_\{\\mu\}\(I\-P\_\{\\pi\}\)\+\\eta d\_\{\\mu\}e^\{\\top\}\.\(10\)ThenAηA\_\{\\eta\}can be viewed as a rank\-one perturbation of a singular M\-matrix becauseDμ\(I−Pπ\)D\_\{\\mu\}\(I\-P\_\{\\pi\}\)is a singular M\-matrix \(seeBlaser et al\. \([2026](https://arxiv.org/html/2605.06874#bib.bib7)\)for a simple proof\) andηdμe⊤\\eta d\_\{\\mu\}e^\{\\top\}is a rank\-one matrix\. With results from the rank\-one perturbation community\(Bierkens and Ran,[2014](https://arxiv.org/html/2605.06874#bib.bib5)\),Blaser et al\. \([2026](https://arxiv.org/html/2605.06874#bib.bib7)\)show that whenη\\etais sufficiently small,AηA\_\{\\eta\}is positive stable and thus differential TD with a global clock is convergent\.
## 4Counterexamples
In this section, we provide a full characterization about whenAηA\_\{\\eta\}is positive stable, based on which we construct concrete counterexamples showing thatAηA\_\{\\eta\}is not positive stable for someη\>0\\eta\>0\. The following assumption is standard and is used in all the rest of the paper\.
###### Assumption 4\.1\.
The target transition matrixPπP\_\{\\pi\}is irreducible anddμ\(s\)\>0d\_\{\\mu\}\(s\)\>0for allss\.
We follow the rank\-one perturbation perspective inBlaser et al\. \([2026](https://arxiv.org/html/2605.06874#bib.bib7)\)and first study
L≐Dμ\(I−Pπ\)\.\\displaystyle L\\doteq D\_\{\\mu\}\(I\-P\_\{\\pi\}\)\.\(11\)We usespec\(L\)≐\{z∈ℂ:det\(zI−L\)=0\}\\operatorname\{spec\}\(L\)\\doteq\\\{z\\in\\mathbb\{C\}:\\det\(zI\-L\)=0\\\}to denote the spectrum ofLL, i\.e\., the set of its eigenvalues\. Our first observation is that every eigenvalue ofLLhas nonnegative real part and every nonzero eigenvalue has strictly positive real part\.
###### Lemma 4\.2\.
For anyλ∈spec\(L\)\\lambda\\in\\operatorname\{spec\}\(L\),Re\(λ\)≥0\\real\(\\lambda\)\\geq 0\. Ifλ≠0\\lambda\\neq 0, thenRe\(λ\)\>0\\real\(\\lambda\)\>0\.
###### Proof\.
Letλ∈spec\(L\)\\lambda\\in\\operatorname\{spec\}\(L\)and letx≠0x\\neq 0satisfyLx=λxLx=\\lambda x\. Choosekksuch that\|xk\|=maxj\|xj\|\|x\_\{k\}\|=\\max\_\{j\}\|x\_\{j\}\|\. Thekk\-th row ofLx=λxLx=\\lambda xgivesdμ\(k\)\(xk−∑jPπ\(k,j\)xj\)=λxk,\\textstyle d\_\{\\mu\}\(k\)\\left\(x\_\{k\}\-\\sum\_\{j\}P\_\{\\pi\}\(k,j\)x\_\{j\}\\right\)=\\lambda x\_\{k\},so∑jPπ\(k,j\)xj=\(1−λdμ\(k\)\)xk\.\\textstyle\\sum\_\{j\}P\_\{\\pi\}\(k,j\)x\_\{j\}=\\left\(1\-\\frac\{\\lambda\}\{d\_\{\\mu\}\(k\)\}\\right\)x\_\{k\}\.SincePπP\_\{\\pi\}is row\-stochastic,
\|1−λdμ\(k\)\|\|xk\|≤∑jPπ\(k,j\)\|xj\|≤\|xk\|\.\\textstyle\\left\|1\-\\frac\{\\lambda\}\{d\_\{\\mu\}\(k\)\}\\right\|\|x\_\{k\}\|\\leq\\sum\_\{j\}P\_\{\\pi\}\(k,j\)\|x\_\{j\}\|\\leq\|x\_\{k\}\|\.Thus\|1−λ/dμ\(k\)\|≤1\|1\-\\lambda/d\_\{\\mu\}\(k\)\|\\leq 1\. Writingλ=a\+ib\\lambda=a\+ibyields\(1−adμ\(k\)\)2\+\(bdμ\(k\)\)2≤1\.\\left\(1\-\\frac\{a\}\{d\_\{\\mu\}\(k\)\}\\right\)^\{2\}\+\\left\(\\frac\{b\}\{d\_\{\\mu\}\(k\)\}\\right\)^\{2\}\\leq 1\.By rearranging terms, we have2dμ\(k\)Re\(λ\)≥\|λ\|2,2d\_\{\\mu\}\(k\)\\operatorname\{Re\}\(\\lambda\)\\geq\|\\lambda\|^\{2\},which completes the proof\. ∎
Lemma[4\.2](https://arxiv.org/html/2605.06874#S4.Thmtheorem2)clarifies that*0 is the only eigenvalue ofLLon the imaginary axis and all the other eigenvalues ofLLare in the open right half plane*\. Now, define a functiong:ℂ→ℂg:\\mathbb\{C\}\\to\\mathbb\{C\}as
g\(z\)≐e⊤\(zI−L\)−1dμ\.\\displaystyle g\(z\)\\doteq e^\{\\top\}\(zI\-L\)^\{\-1\}d\_\{\\mu\}\.\(12\)It is easy to see thatg\(z\)g\(z\)is well\-defined for everyz∉spec\(L\)z\\notin\\operatorname\{spec\}\(L\)\. Particularly, letiibe the imaginary unit\. Then for everyω\>0\\omega\>0,iω∉spec\(L\)i\\omega\\notin\\operatorname\{spec\}\(L\)thanks to Lemma[4\.2](https://arxiv.org/html/2605.06874#S4.Thmtheorem2)\. Sog\(iω\)g\(i\\omega\)is well\-defined\. This allows us to define
η∗≐inf\{η\>0∣∃ω\>0such that1−ηg\(iω\)=0\}\.\\displaystyle\\eta\_\{\*\}\\doteq\\inf\\quantity\{\\eta\>0\\mid\\exists\\omega\>0\\text\{ such that \}1\-\\eta g\(i\\omega\)=0\}\.\(13\)with the convention thatinf∅=\+∞\\inf\\emptyset=\+\\infty\. Thisη∗\\eta\_\{\*\}is the threshold we want\.
###### Theorem 4\.3\(Maximal Stability Threshold\)\.
For everyη∈\(0,η∗\)\\eta\\in\(0,\\eta\_\{\*\}\),AηA\_\{\\eta\}is positive stable\. Furthermore, the above statement does not hold for anyη∗′\>η∗\\eta\_\{\*\}^\{\\prime\}\>\\eta\_\{\*\}, i\.e\., for everyη∗′\>η∗\\eta\_\{\*\}^\{\\prime\}\>\\eta\_\{\*\}, there exists someηc∈\[η∗,η∗′\)\\eta\_\{c\}\\in\[\\eta\_\{\*\},\\eta\_\{\*\}^\{\\prime\}\)such thatAηcA\_\{\\eta\_\{c\}\}is not positive stable\.
###### Proof\.
We viewAηA\_\{\\eta\}as a rank\-one perturbation ofLLand study how the eigenvalues ofAηA\_\{\\eta\}move asη\\etaincreases from0\. Since eigenvalues depend continuously on the entries of the matrix, they also depend continuously onη\\eta\. By Lemma 4\.9 ofBlaser et al\. \([2026](https://arxiv.org/html/2605.06874#bib.bib7)\), there existsη0\>0\\eta\_\{0\}\>0such thatAηA\_\{\\eta\}is positive stable for everyη∈\(0,η0\]\\eta\\in\(0,\\eta\_\{0\}\]\.111Lemma 4\.9 ofBlaser et al\. \([2026](https://arxiv.org/html/2605.06874#bib.bib7)\)proves this using Lemma 2\.11 ofBierkens and Ran \([2014](https://arxiv.org/html/2605.06874#bib.bib5)\)\. For completeness, Appendix[A](https://arxiv.org/html/2605.06874#A1)gives a self\-contained perturbation proof for our setting\.In other words, all the eigenvalues ofAηA\_\{\\eta\}are in the open right half plane for all sufficiently smallη\>0\\eta\>0\. Thus forAηA\_\{\\eta\}to lose positive stability whenη\\etaincreases, at least one eigenvalue must cross the imaginary axis for someη\>0\\eta\>0\.
First, we will show that it cannot cross the imaginary axis through the origin\. This is becauseAηA\_\{\\eta\}has no zero eigenvalue for anyη\>0\\eta\>0\. To see this, letdπd\_\{\\pi\}be the stationary distribution ofPπP\_\{\\pi\}and defineℓ⊤≐dπ⊤Dμ−1\\ell^\{\\top\}\\doteq d\_\{\\pi\}^\{\\top\}D\_\{\\mu\}^\{\-1\}\. IfAηx=0A\_\{\\eta\}x=0for somex≠0x\\neq 0andη\>0\\eta\>0, then
0=ℓ⊤Aηx=ℓ⊤Dμ\(I−Pπ\)x\+ηℓ⊤dμe⊤x=ηe⊤x\.\\displaystyle 0=\\ell^\{\\top\}A\_\{\\eta\}x=\\ell^\{\\top\}D\_\{\\mu\}\(I\-P\_\{\\pi\}\)x\+\\eta\\ell^\{\\top\}d\_\{\\mu\}e^\{\\top\}x=\\eta e^\{\\top\}x\.\(14\)Then0=Aηx=Lx\+ηdμe⊤x=Lx0=A\_\{\\eta\}x=Lx\+\\eta d\_\{\\mu\}e^\{\\top\}x=Lx\. Soxxis in the kernel ofLL\. SinceDμD\_\{\\mu\}is full rank, the kernel ofLLis the same as the kernel ofI−PπI\-P\_\{\\pi\}\. But by the Perron\-Frobenius theorem, the kernel ofI−PπI\-P\_\{\\pi\}is one\-dimensional and is spanned by the all\-one vectoree\. So there must existc≠0c\\neq 0such thatx=cex=ce, yielding0=e⊤x=ce⊤e≠00=e^\{\\top\}x=ce^\{\\top\}e\\neq 0, which is a contradiction\.
Now we characterize every possible non\-origin imaginary axis crossing\. Notice thatAηA\_\{\\eta\}is a real matrix, so nonreal eigenvalues ofAηA\_\{\\eta\}come in conjugate pairs\. It is thus sufficient to characterize every possible crossing atiωi\\omegawithω\>0\\omega\>0\. We thus study wheniωi\\omegais an eigenvalue ofAηA\_\{\\eta\}for someω\>0\\omega\>0andη\>0\\eta\>0, i\.e\., whendet\(iωI−Aη\)=0\\det\(i\\omega I\-A\_\{\\eta\}\)=0\. By the rank\-one determinant identitydet\(M\+uv⊤\)=det\(M\)\+v⊤adj\(M\)u\\det\(M\+uv^\{\\top\}\)=\\det\(M\)\+v^\{\\top\}\\operatorname\{adj\}\(M\)uwithadj\(M\)\\operatorname\{adj\}\(M\)being the adjugate ofMM, we have
0=det\(iωI−Aη\)=\\displaystyle 0=\\det\(i\\omega I\-A\_\{\\eta\}\)=det\(iωI−L−ηdμe⊤\)=det\(iωI−L\)−ηe⊤adj\(iωI−L\)dμ\.\\displaystyle\\det\(i\\omega I\-L\-\\eta d\_\{\\mu\}e^\{\\top\}\)=\\det\(i\\omega I\-L\)\-\\eta e^\{\\top\}\\operatorname\{adj\}\(i\\omega I\-L\)d\_\{\\mu\}\.\(15\)Sinceiω∉spec\(L\)i\\omega\\notin\\operatorname\{spec\}\(L\), we haveadj\(iωI−L\)=det\(iωI−L\)\(iωI−L\)−1\\operatorname\{adj\}\(i\\omega I\-L\)=\\det\(i\\omega I\-L\)\(i\\omega I\-L\)^\{\-1\}, so
0=\\displaystyle 0=det\(iωI−L\)−ηe⊤adj\(iωI−L\)dμ=det\(iωI−L\)\(1−ηg\(iω\)\)\.\\displaystyle\\det\(i\\omega I\-L\)\-\\eta e^\{\\top\}\\operatorname\{adj\}\(i\\omega I\-L\)d\_\{\\mu\}=\\det\(i\\omega I\-L\)\(1\-\\eta g\(i\\omega\)\)\.\(16\)Again sinceiω∉spec\(L\)i\\omega\\notin\\operatorname\{spec\}\(L\), we havedet\(iωI−L\)≠0\\det\(i\\omega I\-L\)\\neq 0\. So the above is equivalent to1−ηg\(iω\)=01\-\\eta g\(i\\omega\)=0\. By defintion,η∗\\eta\_\{\*\}is the infimum of suchη\\eta, which completes the proof\. ∎
Theorem[4\.3](https://arxiv.org/html/2605.06874#S4.Thmtheorem3)gives the largestη∗\\eta\_\{\*\}such that the claim “AηA\_\{\\eta\}is positive stable for everyη∈\(0,η∗\)\\eta\\in\(0,\\eta\_\{\*\}\)” holds\. In the following, we show that the stability region can be more complicated than just\(0,η∗\)\(0,\\eta\_\{\*\}\)\.
###### Theorem 4\.4\(Exact Stability Region\)\.
The stability region
Π≐\{η\>0∣Aηis positive stable\}\\Pi\\doteq\\quantity\{\\eta\>0\\mid A\_\{\\eta\}\\text\{ is positive stable\}\}is always a finite union of open intervals, possibly including unbounded intervals\.
###### Proof\.
To characterizeΠ\\Pi, we will use the Routh\-Hurwitz criterion\(Gantmacher,[1980](https://arxiv.org/html/2605.06874#bib.bib15)\)\. To this end, we consider the characteristic polynomial of−Aη\-A\_\{\\eta\}defined as
qη\(z\)≐det\(zI\+Aη\)=zm\+a1\(η\)zm−1\+⋯\+am\(η\),\\displaystyle q\_\{\\eta\}\(z\)\\doteq\\det\(zI\+A\_\{\\eta\}\)=z^\{m\}\+a\_\{1\}\(\\eta\)z^\{m\-1\}\+\\cdots\+a\_\{m\}\(\\eta\),\(17\)wherem=\|𝒮\|m=\|\\mathcal\{S\}\|is the number of states and\{aj\(η\)\}j=1m\\quantity\{a\_\{j\}\(\\eta\)\}\_\{j=1\}^\{m\}are the coefficients of the characteristic polynomial\. The roots ofqη\(z\)=det\(zI\+Aη\)q\_\{\\eta\}\(z\)=\\det\(zI\+A\_\{\\eta\}\)are exactlyz=−λz=\-\\lambda, whereλ∈spec\(Aη\)\\lambda\\in\\operatorname\{spec\}\(A\_\{\\eta\}\)\. ThereforeAηA\_\{\\eta\}is positive stable if and only if every root ofqηq\_\{\\eta\}has negative real part, i\.e\., if and only ifqηq\_\{\\eta\}is a Hurwitz polynomial\. By the Routh\-Hurwitz criterion, a real monic polynomial is Hurwitz if and only if all of its Hurwitz determinants are positive\. Using the convention thata0\(η\)≐1a\_\{0\}\(\\eta\)\\doteq 1andaj\(η\)≐0a\_\{j\}\(\\eta\)\\doteq 0wheneverj<0j<0orj\>mj\>m, fork=1,…,mk=1,\\dots,m, thekk\-th Hurwitz determinant of the polynomialqηq\_\{\\eta\}is defined as
Δk\(η\)≐det\(\[a2j−i\(η\)\]i,j=1k\)\.\\displaystyle\\Delta\_\{k\}\(\\eta\)\\doteq\\det\\left\(\\left\[a\_\{2j\-i\}\(\\eta\)\\right\]\_\{i,j=1\}^\{k\}\\right\)\.\(18\)ThusAηA\_\{\\eta\}is positive stable if and only ifΔk\(η\)\>0\\Delta\_\{k\}\(\\eta\)\>0for everyk=1,…,mk=1,\\ldots,m\. SinceAη=L\+ηdμe⊤A\_\{\\eta\}=L\+\\eta d\_\{\\mu\}e^\{\\top\}, we have
qη\(z\)=det\(zI\+L\+ηdμe⊤\)=det\(zI\+L\)\+ηe⊤adj\(zI\+L\)dμ\.\\displaystyle q\_\{\\eta\}\(z\)=\\det\(zI\+L\+\\eta d\_\{\\mu\}e^\{\\top\}\)=\\det\(zI\+L\)\+\\eta e^\{\\top\}\\operatorname\{adj\}\(zI\+L\)d\_\{\\mu\}\.\(19\)Henceqη\(z\)=q0\(z\)\+ηq1\(z\)q\_\{\\eta\}\(z\)=q\_\{0\}\(z\)\+\\eta q\_\{1\}\(z\), whereq0q\_\{0\}andq1q\_\{1\}are polynomials inzz\. Consequently, each coefficientaj\(η\)a\_\{j\}\(\\eta\)must be linear inη\\eta\. SinceΔk\(η\)\\Delta\_\{k\}\(\\eta\)is the determinant of ak×kk\\times kmatrix whose entries are linear inη\\eta, it is a polynomial inη\\etaof degree at mostkk\. If someΔk\(η\)\\Delta\_\{k\}\(\\eta\)is zero for allη\>0\\eta\>0, then noη\\etacan satisfyΔk\(η\)\>0\\Delta\_\{k\}\(\\eta\)\>0, so the stability region is empty\. Otherwise, the finitely many real roots of the nonzero polynomialsΔ1,…,Δm\\Delta\_\{1\},\\ldots,\\Delta\_\{m\}partition\(0,∞\)\(0,\\infty\)into finitely many intervals, and the sign of eachΔk\\Delta\_\{k\}is constant on each such interval\. Selecting exactly those intervals on which all determinants are positive givesΠ\\Pi\. Thus the stability region is a finite union of open intervals, possibly empty and possibly including unbounded intervals\. ∎
We note that Theorems[4\.3](https://arxiv.org/html/2605.06874#S4.Thmtheorem3)&[4\.4](https://arxiv.org/html/2605.06874#S4.Thmtheorem4)do not rule out the possibility thatΠ\\Piis\(0,∞\)\(0,\\infty\)\. Rather, they provide a characterization of howΠ\\Pimay look like\. Building on these insights, we will now construct a concrete counterexample showing that for someη\>0\\eta\>0,AηA\_\{\\eta\}is not positive stable\.
###### Example 4\.5\(Counterexample\)\.
Supposem\>22m\>22is an integer\. Then there exists a positive diagonal matrixDμ∈ℝ\(m\+2\)×\(m\+2\)D\_\{\\mu\}\\in\\mathbb\{R\}^\{\(m\+2\)\\times\(m\+2\)\}and an irreducible and aperiodic row\-stochastic matrixPπP\_\{\\pi\}such thatAηA\_\{\\eta\}is positive stable if and only if
η∈\(0,α\)∪\(3α,∞\),α≐m−22m2\+m−2\.\\eta\\in\(0,\\alpha\)\\cup\(3\\alpha,\\infty\),\\qquad\\alpha\\doteq\\frac\{m\-22\}\{m^\{2\}\+m\-2\}\.The smallest instance in this family is obtained atm=23m=23, yieldingα=1/550\\alpha=1/550and a2525\-state system\.
###### Proof\.
Considerm\+2m\+2statesa1,…,am,b,ca\_\{1\},\\ldots,a\_\{m\},b,c\. Setα≐\(m−22\)/\(m2\+m−2\)\\alpha\\doteq\(m\-22\)/\(m^\{2\}\+m\-2\)\. Construct a vectordμ∈ℝm\+2d\_\{\\mu\}\\in\\mathbb\{R\}^\{m\+2\}asdμ\[ai\]=αd\_\{\\mu\}\[a\_\{i\}\]=\\alphafori=1,…,mi=1,\\ldots,m,dμ\[b\]=αd\_\{\\mu\}\[b\]=\\alpha, anddμ\[c\]=1−\(m\+1\)αd\_\{\\mu\}\[c\]=1\-\(m\+1\)\\alpha, and letDμ≐diag\(dμ\)D\_\{\\mu\}\\doteq\\operatorname\{diag\}\(d\_\{\\mu\}\)\. Sincem\>22m\>22, we haveα\>0\\alpha\>0\. Moreover,
dμ\[c\]=22m\+20m2\+m−2\>0,dμ\[c\]−α=21m\+42m2\+m−2\>0\.d\_\{\\mu\}\[c\]=\\frac\{22m\+20\}\{m^\{2\}\+m\-2\}\>0,\\qquad d\_\{\\mu\}\[c\]\-\\alpha=\\frac\{21m\+42\}\{m^\{2\}\+m\-2\}\>0\.Also,∑sdμ\[s\]=mα\+α\+dμ\[c\]=1\\sum\_\{s\}d\_\{\\mu\}\[s\]=m\\alpha\+\\alpha\+d\_\{\\mu\}\[c\]=1\. Thusdμd\_\{\\mu\}is a strictly positive probability vector, andα≤dμ\[s\]\\alpha\\leq d\_\{\\mu\}\[s\]for every statess\.
Define the row\-stochastic matrixQQby
Q\[ai,b\]=1\(i=1,…,m\),Q\[b,c\]=1,Q\[c,ai\]=1m\(i=1,…,m\),Q\[a\_\{i\},b\]=1\\ \(i=1,\\ldots,m\),\\qquad Q\[b,c\]=1,\\qquad Q\[c,a\_\{i\}\]=\\frac\{1\}\{m\}\\ \(i=1,\\ldots,m\),with all other entries zero\. DefinePπ≐I\+αDμ−1\(Q−I\)P\_\{\\pi\}\\doteq I\+\\alpha D\_\{\\mu\}^\{\-1\}\(Q\-I\)\. Then, for each statess,
Pπ\[s,⋅\]=\(1−αdμ\[s\]\)es⊤\+αdμ\[s\]Q\[s,⋅\],\\textstyle P\_\{\\pi\}\[s,\\cdot\]=\\left\(1\-\\frac\{\\alpha\}\{d\_\{\\mu\}\[s\]\}\\right\)e\_\{s\}^\{\\top\}\+\\frac\{\\alpha\}\{d\_\{\\mu\}\[s\]\}Q\[s,\\cdot\],whereese\_\{s\}is the standard basis vector at statess, not the all\-one vectoree\. Therefore every row ofPπP\_\{\\pi\}is a convex combination ofes⊤e\_\{s\}^\{\\top\}andQ\[s,⋅\]Q\[s,\\cdot\], soPπP\_\{\\pi\}is row\-stochastic\. It is irreducible because all transitions ofQQremain available, and it is aperiodic becausePπ\[c,c\]=1−α/dμ\[c\]\>0P\_\{\\pi\}\[c,c\]=1\-\\alpha/d\_\{\\mu\}\[c\]\>0\.
Forη\>0\\eta\>0, writet≐η/αt\\doteq\\eta/\\alpha\. Then
Aη=Dμ\(I−Pπ\+ηee⊤\)=α\(I−Q\+tdμe⊤\),A\_\{\\eta\}=D\_\{\\mu\}\(I\-P\_\{\\pi\}\+\\eta ee^\{\\top\}\)=\\alpha\(I\-Q\+td\_\{\\mu\}e^\{\\top\}\),sinceDμ\(I−Pπ\)=α\(I−Q\)D\_\{\\mu\}\(I\-P\_\{\\pi\}\)=\\alpha\(I\-Q\)andDμηee⊤=ηdμe⊤=αtdμe⊤D\_\{\\mu\}\\eta ee^\{\\top\}=\\eta d\_\{\\mu\}e^\{\\top\}=\\alpha td\_\{\\mu\}e^\{\\top\}\. Sinceα\>0\\alpha\>0, multiplication byα\\alphadoes not change positive stability\. So it is sufficient to analyzeBt≐I−Q\+tdμe⊤B\_\{t\}\\doteq I\-Q\+td\_\{\\mu\}e^\{\\top\}\.
We next decompose the ambient vector space\. Work overℂ\\mathbb\{C\}and order the coordinates as\(a1,…,am,b,c\)\(a\_\{1\},\\ldots,a\_\{m\},b,c\)\. Let
ℋ≐\{x∈ℂm\+2:x\[b\]=x\[c\]=0,∑i=1mx\[ai\]=0\}\.\\mathcal\{H\}\\doteq\\\{x\\in\\mathbb\{C\}^\{m\+2\}:x\[b\]=x\[c\]=0,\\ \\sum\_\{i=1\}^\{m\}x\[a\_\{i\}\]=0\\\}\.Ifx∈ℋx\\in\\mathcal\{H\}, thenQx=0Qx=0ande⊤x=0e^\{\\top\}x=0, soBtx=xB\_\{t\}x=x\. Sincedim\(ℋ\)=m−1\\dim\(\\mathcal\{H\}\)=m\-1, the restrictionBt\|ℋB\_\{t\}\|\_\{\\mathcal\{H\}\}is the identity map onℋ\\mathcal\{H\}, so this block contributesm−1m\-1eigenvalues equal to11\. Now consider the symmetric subspace
𝒰≐\{x∈ℂm\+2:x\[a1\]=⋯=x\[am\]\}\.\\mathcal\{U\}\\doteq\\quantity\{x\\in\\mathbb\{C\}^\{m\+2\}:x\[a\_\{1\}\]=\\cdots=x\[a\_\{m\}\]\}\.Define the linear isomorphism
Φ:ℂ3→𝒰,Φ\(u,v,w\)=\(u,…,u,v,w\)\.\\Phi:\\mathbb\{C\}^\{3\}\\to\\mathcal\{U\},\\qquad\\Phi\(u,v,w\)=\(u,\\ldots,u,v,w\)\.Lety=\(u,v,w\)⊤y=\(u,v,w\)^\{\\top\}\. ThenQΦy=Φ\(Cy\),e⊤Φy=r⊤y,Q\\Phi y=\\Phi\(Cy\),\\qquad e^\{\\top\}\\Phi y=r^\{\\top\}y,where
C=\(010001100\),r⊤=\(m11\)\.C=\\begin\{pmatrix\}0&1&0\\\\ 0&0&1\\\\ 1&0&0\\end\{pmatrix\},\\qquad r^\{\\top\}=\\begin\{pmatrix\}m&1&1\\end\{pmatrix\}\.Moreover,
dμe⊤Φy=Φ\(d¯r⊤y\),d¯=\(αα1−\(m\+1\)α\)\.d\_\{\\mu\}e^\{\\top\}\\Phi y=\\Phi\(\\bar\{d\}r^\{\\top\}y\),\\qquad\\bar\{d\}=\\begin\{pmatrix\}\\alpha\\\\ \\alpha\\\\ 1\-\(m\+1\)\\alpha\\end\{pmatrix\}\.Hence\(Q−tdμe⊤\)Φy=Φ\(\(C−td¯r⊤\)y\)\.\(Q\-td\_\{\\mu\}e^\{\\top\}\)\\Phi y=\\Phi\\quantity\(\(C\-t\\bar\{d\}r^\{\\top\}\)y\)\.In particular,𝒰\\mathcal\{U\}is invariant underQ−tdμe⊤Q\-td\_\{\\mu\}e^\{\\top\}, and therefore also underBt=I−Q\+tdμe⊤B\_\{t\}=I\-Q\+td\_\{\\mu\}e^\{\\top\}\. The matrix of the restricted map\(Q−tdμe⊤\)\|𝒰\(Q\-td\_\{\\mu\}e^\{\\top\}\)\|\_\{\\mathcal\{U\}\}in the coordinates given byΦ\\PhiisMt≐C−td¯r⊤\.M\_\{t\}\\doteq C\-t\\bar\{d\}r^\{\\top\}\.Consequently, the restriction ofBtB\_\{t\}to𝒰\\mathcal\{U\}is represented byI−MtI\-M\_\{t\}\. Finally,ℂm\+2=ℋ⊕𝒰\\mathbb\{C\}^\{m\+2\}=\\mathcal\{H\}\\oplus\\mathcal\{U\}\. Indeed, for anyx∈ℂm\+2x\\in\\mathbb\{C\}^\{m\+2\}, letu¯≐m−1∑i=1mx\[ai\]\\bar\{u\}\\doteq m^\{\-1\}\\sum\_\{i=1\}^\{m\}x\[a\_\{i\}\]\. Then
x=\(x\[a1\]−u¯,…,x\[am\]−u¯,0,0\)⏟∈ℋ\+\(u¯,…,u¯,x\[b\],x\[c\]\)⏟∈𝒰,x=\\underbrace\{\(x\[a\_\{1\}\]\-\\bar\{u\},\\ldots,x\[a\_\{m\}\]\-\\bar\{u\},0,0\)\}\_\{\\in\\mathcal\{H\}\}\+\\underbrace\{\(\\bar\{u\},\\ldots,\\bar\{u\},x\[b\],x\[c\]\)\}\_\{\\in\\mathcal\{U\}\},andℋ∩𝒰=\{0\}\\mathcal\{H\}\\cap\\mathcal\{U\}=\\\{0\\\}\. Since both subspaces areBtB\_\{t\}\-invariant, the matrix ofBtB\_\{t\}in a basis adapted to this direct sum is block diagonal:
Bt∼\(Im−100I−Mt\)\.B\_\{t\}\\sim\\begin\{pmatrix\}I\_\{m\-1\}&0\\\\ 0&I\-M\_\{t\}\\end\{pmatrix\}\.Thus the remaining three eigenvalues ofBtB\_\{t\}are exactly the eigenvalues ofI−MtI\-M\_\{t\}\. Now compute the characteristic polynomial ofMtM\_\{t\}\. LetN=zI−CN=zI\-C\. ThendetN=z3−1\\det N=z^\{3\}\-1and
adj\(N\)=\(z2z11z2zz1z2\)\.\\operatorname\{adj\}\(N\)=\\begin\{pmatrix\}z^\{2\}&z&1\\\\ 1&z^\{2\}&z\\\\ z&1&z^\{2\}\\end\{pmatrix\}\.By the matrix determinant lemma,
det\(zI−Mt\)=det\(N\+td¯r⊤\)=detN\+tr⊤adj\(N\)d¯\.\\det\(zI\-M\_\{t\}\)=\\det\(N\+t\\bar\{d\}r^\{\\top\}\)=\\det N\+tr^\{\\top\}\\operatorname\{adj\}\(N\)\\bar\{d\}\.r⊤adj\(N\)d¯=z2\+z\+\[m−α\(m2\+m−2\)\]=z2\+z\+22\.r^\{\\top\}\\operatorname\{adj\}\(N\)\\bar\{d\}=z^\{2\}\+z\+\\left\[m\-\\alpha\(m^\{2\}\+m\-2\)\\right\]=z^\{2\}\+z\+22\.Thereforedet\(zI−Mt\)=z3−1\+t\(z2\+z\+22\)\\det\(zI\-M\_\{t\}\)=z^\{3\}\-1\+t\(z^\{2\}\+z\+22\)\. The Hurwitz polynomial ofI−MtI\-M\_\{t\}is
qt\(s\)=det\(sI\+I−Mt\)=s3\+\(t\+3\)s2\+\(3t\+3\)s\+24t\.q\_\{t\}\(s\)=\\det\(sI\+I\-M\_\{t\}\)=s^\{3\}\+\(t\+3\)s^\{2\}\+\(3t\+3\)s\+24t\.
For a cubics3\+as2\+bs\+cs^\{3\}\+as^\{2\}\+bs\+c, Hurwitz stability is equivalent toa\>0a\>0,b\>0b\>0,c\>0c\>0, andab\>cab\>c\. Herea,b,c\>0a,b,c\>0fort\>0t\>0, and the remaining condition is
\(t\+3\)\(3t\+3\)\>24t⟺3\(t−1\)\(t−3\)\>0\.\(t\+3\)\(3t\+3\)\>24t\\quad\\Longleftrightarrow\\quad 3\(t\-1\)\(t\-3\)\>0\.Thus the three\-dimensional block is positive stable if and only ift∈\(0,1\)∪\(3,∞\)t\\in\(0,1\)\\cup\(3,\\infty\)\. Since the otherm−1m\-1eigenvalues ofBtB\_\{t\}are equal to11, we conclude that
Aηis positive stable⟺η∈\(0,α\)∪\(3α,∞\)\.A\_\{\\eta\}\\text\{ is positive stable\}\\quad\\Longleftrightarrow\\quad\\eta\\in\(0,\\alpha\)\\cup\(3\\alpha,\\infty\)\.∎
## 5Experiments
We first provide a numerical validation of Example[4\.5](https://arxiv.org/html/2605.06874#S4.Thmtheorem5)\. Figure[1](https://arxiv.org/html/2605.06874#S5.F1)shows the nontrivial eigenvalues ofAη/αA\_\{\\eta\}/\\alphaasη\\etaincreases from0to4α4\\alpha\. The conjugate pair starts in the open right half plane, crosses the imaginary axis atη=α\\eta=\\alpha, remains in the open left half plane forη∈\(α,3α\)\\eta\\in\(\\alpha,3\\alpha\), and crosses back into the open right half plane atη=3α\\eta=3\\alpha\.
Figure 1:Phase portrait of the nontrivial eigenvalues ofAη/αA\_\{\\eta\}/\\alphafor the counterexample in Example[4\.5](https://arxiv.org/html/2605.06874#S4.Thmtheorem5)\. Left: trajectory of the conjugate pair in the complex plane\. Right: real part of this pair as a function oft=η/αt=\\eta/\\alpha, with the unstable intervalt∈\(1,3\)t\\in\(1,3\)shaded\.We then construct a concrete MDP instance based on Example[4\.5](https://arxiv.org/html/2605.06874#S4.Thmtheorem5)and run the global\-clock and local\-clock differential TD algorithms on it with the sameη\\eta\. We consider a finite MDP withm\+2m\+2states and two actions\{a0,a1\}\\quantity\{a\_\{0\},a\_\{1\}\}, wherem\>22m\>22is an integer\. The target policyπ\\pialways selects actiona0a\_\{0\}, and the transition matrix ofa0a\_\{0\}is defined to bePπP\_\{\\pi\}in the proof of Example[4\.5](https://arxiv.org/html/2605.06874#S4.Thmtheorem5)\. So the transition matrix of the target policyπ\\picoincides withPπP\_\{\\pi\}\. The behavior policyμ\\muselects actiona0a\_\{0\}with probabilityκ∈\(0,1\)\\kappa\\in\(0,1\)and actiona1a\_\{1\}otherwise, whereκ\\kappais chosen so thatκ≤mini,j:Pπ\[i,j\]\>0dμ\[j\]Pπ\[i,j\],\\kappa\\leq\\min\_\{i,j:P\_\{\\pi\}\[i,j\]\>0\}\\frac\{d\_\{\\mu\}\[j\]\}\{P\_\{\\pi\}\[i,j\]\},where bothPπP\_\{\\pi\}anddμd\_\{\\mu\}are defined in the proof of Example[4\.5](https://arxiv.org/html/2605.06874#S4.Thmtheorem5)\. Define the transition matrix ofa1a\_\{1\}asR≐\(edμ⊤−κPπ\)/\(1−κ\)R\\doteq\(ed\_\{\\mu\}^\{\\top\}\-\\kappa P\_\{\\pi\}\)/\(1\-\\kappa\)\. The matrixRRis a row stochastic matrix by the choice ofκ\\kappaand the fact thatedμ⊤ed\_\{\\mu\}^\{\\top\}andPπP\_\{\\pi\}are row stochastic\. The transition matrix of the behavior policy is thenPμ=κPπ\+\(1−κ\)R=edμ⊤\.P\_\{\\mu\}=\\kappa P\_\{\\pi\}\+\(1\-\\kappa\)R=ed\_\{\\mu\}^\{\\top\}\.Sodμd\_\{\\mu\}coincides with the stationary distribution of the behavior policy\. The reward is always 0\. For the reported run, the initial state is fixed to bea1a\_\{1\}\. We use the smallest instance,m=23m=23, and chooseη=2α=1275\\eta=2\\alpha=\\frac\{1\}\{275\}, which lies in the unstable interval\(α,3α\)\(\\alpha,3\\alpha\)for the global clock\. We then run the algorithm \([3](https://arxiv.org/html/2605.06874#S2.E3)\) with both the global clock and the local clock in the learning rates, using the base sequenceαn=0\.45/\(104\+n\)0\.6\\alpha\_\{n\}=0\.45/\(10^\{4\}\+n\)^\{0\.6\}\. The global\-clock run usesn=t\+1n=t\+1, while the local\-clock run usesn=ν\(St,t\)n=\\nu\(S\_\{t\},t\)\. We choosev0v\_\{0\}as the real part of an eigenvector corresponding to the eigenvalue ofAηA\_\{\\eta\}with the smallest real part, normalized so that‖v0‖2=1\\norm\{v\_\{0\}\}\_\{2\}=1, and setJ^0=ηe⊤v0\\hat\{J\}\_\{0\}=\\eta e^\{\\top\}v\_\{0\}\. With this particular choice ofJ^0\\hat\{J\}\_\{0\}, the results inWan et al\. \([2021](https://arxiv.org/html/2605.06874#bib.bib31)\)show that differential TD with a local clock should converge to 0 for anyη\>0\\eta\>0\. We run the algorithm for101110^\{11\}time steps and plot both‖vt‖2\\norm\{v\_\{t\}\}\_\{2\}anddist\(vt,span\{e\}\)\\operatorname\{dist\}\(v\_\{t\},\\operatorname\{span\}\\\{e\\\}\)\. Since the reward is always 0, the latter is the distance fromvtv\_\{t\}to\{v¯π\+ce:c∈ℝ\}\\\{\\bar\{v\}\_\{\\pi\}\+ce:c\\in\\mathbb\{R\}\\\}\. The result in Figure[2](https://arxiv.org/html/2605.06874#S5.F2)shows that the local\-clock recursion contracts to 0 while the global\-clock recursion grows, matching the exact stability calculation above\.
Figure 2:Differential TD withη=2α\\eta=2\\alpha\. Left:‖vt‖2\\norm\{v\_\{t\}\}\_\{2\}\. Right: distance fromvtv\_\{t\}tospan\{e\}\\operatorname\{span\}\\\{e\\\}\. The global\-clock trajectory is oscillatory because the unstable mode is the conjugate pair shown in Figure[1](https://arxiv.org/html/2605.06874#S5.F1)\. The curve here is only a typical run with one seed, because we are interested in the sample path behavior\. We provide 9 more seeds in Appendix[B](https://arxiv.org/html/2605.06874#A2), which show the same phenomenon\.
## 6Related Work
The closest to our work isChen \([2025](https://arxiv.org/html/2605.06874#bib.bib10)\), which also shows that an average\-reward algorithm converges with a local clock but can diverge with a global clock\. However, there are several differences between this work andChen \([2025](https://arxiv.org/html/2605.06874#bib.bib10)\)\. First,Chen \([2025](https://arxiv.org/html/2605.06874#bib.bib10)\)considers only the learning rate ofαt=𝒪\(t−1\)\\alpha\_\{t\}=\\mathcal\{O\}\(t^\{\-1\}\)\. By using a local clock in this form of learning rate,Chen \([2025](https://arxiv.org/html/2605.06874#bib.bib10)\)implicitly approximates the state\-action visitation distribution and thus implicitly applies an importance sampling\. Their insights and analyses do not apply to learning rates that do not have this form, such asαt=𝒪\(t−β\)\\alpha\_\{t\}=\\mathcal\{O\}\(t^\{\-\\beta\}\)for someβ∈\(1/2,1\)\\beta\\in\(1/2,1\)\. By contrast, our analysis applies to any learning rate that satisfies the standard Robbins\-Monro conditions, including those withβ∈\(1/2,1\)\\beta\\in\(1/2,1\)\. Second,Chen \([2025](https://arxiv.org/html/2605.06874#bib.bib10)\)focuses on running standard discountedQQ\-learning withγ=1\\gamma=1in the average\-reward setting directly\. With the local clock implicitly applying importance sampling, their algorithm converges to the*set*of optimal action\-value functions, i\.e\., theqqvalue estimates themselves are not guaranteed to converge to a unique limit\. SoChen \([2025](https://arxiv.org/html/2605.06874#bib.bib10)\)has to introduce a new centered version for convergence to a unique limit\. By contrast, the differential TD algorithm we analyze is designed for the average\-reward setting and implicitly applies the reference state technique from RVIQQ\-learning, which directly guarantees convergence to a unique limit\. Nevertheless, one may ask, can the insights and techniques ofChen \([2025](https://arxiv.org/html/2605.06874#bib.bib10)\)be applied towards understanding the open problems inWan et al\. \([2021](https://arxiv.org/html/2605.06874#bib.bib31)\); Blaser et al\. \([2026](https://arxiv.org/html/2605.06874#bib.bib7)\)? Our evaluation is negative\. At most,Chen \([2025](https://arxiv.org/html/2605.06874#bib.bib10)\)can provide an alternative convergence proof for differential TD with a local clock andαt=𝒪\(t−1\)\\alpha\_\{t\}=\\mathcal\{O\}\(t^\{\-1\}\)\. But combining the results fromBlaser et al\. \([2026](https://arxiv.org/html/2605.06874#bib.bib7)\)and this work, it is easy to see that the effect of the global clock interacts with the hyperparameterη\\eta\. ButChen \([2025](https://arxiv.org/html/2605.06874#bib.bib10)\)does not have the mechanism to make an evaluation based onη\\eta\.
This work studies the divergence of differential TD through the lens of the instability of the associated ODE\. It is important to note that there are works showing that even if the ODE is not GAS, an algorithm can still converge in a weaker sense, e\.g\., convergence to a set of solutions\(Xie et al\.,[2025](https://arxiv.org/html/2605.06874#bib.bib35); Wang and Zhang,[2026](https://arxiv.org/html/2605.06874#bib.bib32)\), convergence to a bounded set\(Meyn,[2024](https://arxiv.org/html/2605.06874#bib.bib23); Liu et al\.,[2025c](https://arxiv.org/html/2605.06874#bib.bib20),[b](https://arxiv.org/html/2605.06874#bib.bib19)\), or convergence to a sample path dependent limit\(Blaser and Zhang,[2026](https://arxiv.org/html/2605.06874#bib.bib6)\)\. However, as demonstrated by our experiments, the instability of the ODE we encounter does not appear benign\. That being said, proving that differntial TD with a global clock can diverge almost surely is still an open problem\.
## 7Conclusion
This work shows that the choice of local and global clocks can affect the stability of average\-reward RL algorithms in an arguably surprising way: for someη\>0\\eta\>0, the differential TD algorithm can be stable with a local clock but unstable with a global clock, a phenomenon that has never been reported in the discounted setting\. Our result shows that, in average\-reward RL, replacing local clocks by the practically standard global clock is not a benign implementation detail\. A theorem proved with local clocks may be misleading for the algorithm people actually implement\. This work also makes contribution to the rank\-one perturbation theory\(Moro and Dopico,[2003](https://arxiv.org/html/2605.06874#bib.bib24); Savchenko,[2004](https://arxiv.org/html/2605.06874#bib.bib28); Ding and Zhou,[2007](https://arxiv.org/html/2605.06874#bib.bib12); Mehl et al\.,[2011](https://arxiv.org/html/2605.06874#bib.bib21); Ran and Wojtylak,[2012](https://arxiv.org/html/2605.06874#bib.bib26); Fourie et al\.,[2013](https://arxiv.org/html/2605.06874#bib.bib14); Mehl et al\.,[2014](https://arxiv.org/html/2605.06874#bib.bib22); Ran and Wojtylak,[2021](https://arxiv.org/html/2605.06874#bib.bib27)\)\. Particularly,Bierkens and Ran \([2014](https://arxiv.org/html/2605.06874#bib.bib5)\)study the positive stability of rank\-one perturbation to a singularMM\-matrix and establish a set of sufficient conditions\. One important result is regarding the positive stability ofBη≐L\+ηuv⊤B\_\{\\eta\}\\doteq L\+\\eta uv^\{\\top\}, whereLLis a singularMM\-matrix with 0 being an algebraically simple eigenvalue anduuandvvare elementwise nonnegative\. Lemma 2\.11 ofBierkens and Ran \([2014](https://arxiv.org/html/2605.06874#bib.bib5)\)shows thatBηB\_\{\\eta\}is positive stable for sufficiently smallη\>0\\eta\>0\. The positive stability ofBηB\_\{\\eta\}for sufficiently largeη\\etais left as Conjecture 2\.17 inBierkens and Ran \([2014](https://arxiv.org/html/2605.06874#bib.bib5)\)\. Later on,Anehila and Ran \([2022](https://arxiv.org/html/2605.06874#bib.bib2)\)show by a counterexample that Conjecture 2\.17 ofBierkens and Ran \([2014](https://arxiv.org/html/2605.06874#bib.bib5)\)is false\. Our counterexample further shows that even if we impose additional condition such asuubeing strictly positive andvvbeing the all\-one vector, the positive stability ofBηB\_\{\\eta\}does not necessarily hold for allη\>0\\eta\>0\. Those counterexamples together testify the hardness of characterizing the positive stability of rank\-one perturbation to a singularMM\-matrix\. It remains an open problem that whether there is some nontrivial sufficient condition that can guarantee the positive stability ofBηB\_\{\\eta\}for sufficiently largeη\\eta\. Notably, as noted byBierkens and Ran \([2014](https://arxiv.org/html/2605.06874#bib.bib5)\); Anehila and Ran \([2022](https://arxiv.org/html/2605.06874#bib.bib2)\), such counterexamples are hard to construct and random sampling is unlikely to find them\. Our experience is the same\. We have tried to find counterexamples by random sampling and failed\. And we have to construct the counterexample manually based on the theoretical insights in Theorems[4\.3](https://arxiv.org/html/2605.06874#S4.Thmtheorem3)&[4\.4](https://arxiv.org/html/2605.06874#S4.Thmtheorem4)\.
## Acknowledgments and Disclosure of Funding
This work is supported in part by the US National Science Foundation under the awards III\-2128019, SLES\-2331904, and CAREER\-2442098, the Commonwealth Cyber Initiative’s Central Virginia Node under the award VV\-1Q26\-001, and a Cisco Faculty Research Award\.
## References
- Abounadi et al\. \[2001\]Jinane Abounadi, Dimitri Bertsekas, and Vivek S Borkar\.Learning algorithms for markov decision processes with average cost\.*SIAM Journal on Control and Optimization*, 2001\.
- Anehila and Ran \[2022\]B Anehila and ACM Ran\.A note on a conjecture concerning rank one perturbations of singular m\-matrices\.*Quaestiones Mathematicae*, 2022\.
- Bellman \[1957\]Richard Bellman\.A markovian decision process\.*Journal of Mathematics and Mechanics*, 1957\.
- Bertsekas and Tsitsiklis \[1996\]Dimitri P Bertsekas and John N Tsitsiklis\.*Neuro\-Dynamic Programming*\.Athena Scientific Belmont, MA, 1996\.
- Bierkens and Ran \[2014\]Joris Bierkens and André Ran\.A singular M\-matrix perturbed by a nonnegative rank one matrix has positive principal minors; is it D\-stable?*Linear Algebra and Its Applications*, 2014\.
- Blaser and Zhang \[2026\]Ethan Blaser and Shangtong Zhang\.Asymptotic and finite sample analysis of nonexpansive stochastic approximations with markovian noise\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, 2026\.
- Blaser et al\. \[2026\]Ethan Blaser, Jiuqi Wang, and Shangtong Zhang\.Almost sure convergence of differential temporal difference learning for average reward markov decision processes\.In*Proceedings of the International Conference on Artificial Intelligence and Statistics*, 2026\.
- Borkar et al\. \[2025\]Vivek Borkar, Shuhang Chen, Adithya Devraj, Ioannis Kontoyiannis, and Sean Meyn\.The ODE method for asymptotic statistics in stochastic approximation and reinforcement learning\.*The Annals of Applied Probability*, 2025\.
- Borkar \[2009\]Vivek S Borkar\.*Stochastic approximation: a dynamical systems viewpoint*\.Springer, 2009\.
- Chen \[2025\]Zaiwei Chen\.Non\-asymptotic guarantees for average\-reward Q\-learning with adaptive stepsizes\.In*Advances in Neural Information Processing Systems*, 2025\.
- Chen et al\. \[2024\]Zaiwei Chen, Siva T Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam\.A lyapunov theory for finite\-sample guarantees of markovian stochastic approximation\.*Operations Research*, 2024\.
- Ding and Zhou \[2007\]Jiu Ding and Aihui Zhou\.Eigenvalues of rank\-one updated matrices with some applications\.*Applied Mathematics Letters*, 2007\.
- Fiedler and Pták \[1962\]Miroslav Fiedler and Vlastimil Pták\.On matrices with non\-positive off\-diagonal elements and positive principal minors\.*Czechoslovak Mathematical Journal*, 1962\.
- Fourie et al\. \[2013\]JH Fourie, GJ Groenewald, DB Janse van Rensburg, and ACM Ran\.Rank one perturbations of h\-positive real matrices\.*Linear Algebra and its Applications*, 2013\.
- Gantmacher \[1980\]F\.R\. Gantmacher\.*The Theory of Matrices*\.Chelsea Publishing Company, 1980\.
- Jaakkola et al\. \[1993\]Tommi Jaakkola, Michael Jordan, and Satinder Singh\.Convergence of stochastic iterative dynamic programming algorithms\.In*Advances in Neural Information Processing Systems*, 1993\.
- Lee and He \[2020\]Donghwan Lee and Niao He\.A unified switching system perspective and convergence analysis of Q\-Learning algorithms\.In*Advances in Neural Information Processing Systems*, 2020\.
- Liu et al\. \[2025a\]Shuze Liu, Shuhang Chen, and Shangtong Zhang\.The ODE method for stochastic approximation and reinforcement learning with markovian noise\.*Journal of Machine Learning Research*, 2025a\.
- Liu et al\. \[2025b\]Xinyu Liu, Zixuan Xie, and Shangtong Zhang\.Extensions of robbins\-siegmund theorem with applications in reinforcement learning\.*ArXiv Preprint*, 2025b\.
- Liu et al\. \[2025c\]Xinyu Liu, Zixuan Xie, and Shangtong Zhang\.LinearQ\{Q\}\-learning does not diverge inL2\{L\}^\{2\}: Convergence rates to a bounded set\.In*Proceedings of the International Conference on Machine Learning*, 2025c\.
- Mehl et al\. \[2011\]Christian Mehl, Volker Mehrmann, André CM Ran, and Leiba Rodman\.Eigenvalue perturbation theory of classes of structured matrices under generic structured rank one perturbations\.*Linear Algebra and Its Applications*, 2011\.
- Mehl et al\. \[2014\]Christian Mehl, Volker Mehrmann, André CM Ran, and Leiba Rodman\.Eigenvalue perturbation theory of symplectic, orthogonal, and unitary matrices under generic structured rank one perturbations\.*BIT Numerical Mathematics*, 2014\.
- Meyn \[2024\]Sean Meyn\.The projected bellman equation in reinforcement learning\.*IEEE Transactions on Automatic Control*, 2024\.
- Moro and Dopico \[2003\]Julio Moro and Froilán M Dopico\.Low rank perturbation of jordan structure\.*SIAM Journal on Matrix Analysis and Applications*, 2003\.
- Plemmons \[1977\]Robert J\. Plemmons\.M\-matrix characterizations\.i—nonsingular m\-matrices\.*Linear Algebra and its Applications*, 1977\.
- Ran and Wojtylak \[2012\]André CM Ran and Michał Wojtylak\.Eigenvalues of rank one perturbations of unstructured matrices\.*Linear Algebra and its Applications*, 2012\.
- Ran and Wojtylak \[2021\]André CM Ran and Michał Wojtylak\.Global properties of eigenvalues of parametric rank one perturbations for unstructured and structured matrices\.*Complex Analysis and Operator Theory*, 2021\.
- Savchenko \[2004\]Sergey Valerievich Savchenko\.On the change in the spectral properties of a matrix under perturbations of sufficiently low rank\.*Functional Analysis and Its Applications*, 2004\.
- Sutton and Barto \[2018\]Richard S Sutton and Andrew G Barto\.*Reinforcement Learning: An Introduction \(2nd Edition\)*\.MIT press, 2018\.
- Tsitsiklis \[1994\]John N Tsitsiklis\.Asynchronous stochastic approximation and Q\-learning\.*Machine Learning*, 1994\.
- Wan et al\. \[2021\]Yi Wan, Abhishek Naik, and Richard S\. Sutton\.Learning and planning in average\-reward markov decision processes\.In*Proceedings of the International Conference on Machine Learning*, 2021\.
- Wang and Zhang \[2026\]Jiuqi Wang and Shangtong Zhang\.Almost sure convergence of linear temporal difference learning with arbitrary features\.*Journal of Machine Learning Research*, 2026\.
- Watkins and Dayan \[1992\]Christopher JCH Watkins and Peter Dayan\.Q\-learning\.*Machine Learning*, 1992\.
- Watkins \[1989\]Christopher John Cornish Hellaby Watkins\.*Learning from delayed rewards*\.PhD thesis, King’s College, Cambridge, 1989\.
- Xie et al\. \[2025\]Zixuan Xie, Xinyu Liu, Rohan Chandra, and Shangtong Zhang\.Finite sample analysis of linear temporal difference learning with arbitrary features\.In*Advances in Neural Information Processing Systems*, 2025\.
## Appendix ASmall\-η\\etapositive stability
###### Lemma A\.1\(Small\-η\\etapositive stability\)\.
Suppose thatPπP\_\{\\pi\}is irreducible anddμ\(s\)\>0d\_\{\\mu\}\(s\)\>0for every statess\. Then there existsη0\>0\\eta\_\{0\}\>0such that
Aη=Dμ\(I−Pπ\+ηee⊤\)A\_\{\\eta\}=D\_\{\\mu\}\(I\-P\_\{\\pi\}\+\\eta ee^\{\\top\}\)is positive stable for everyη∈\(0,η0\]\\eta\\in\(0,\\eta\_\{0\}\]\.
###### Proof\.
Recall thatL≐Dμ\(I−Pπ\)L\\doteq D\_\{\\mu\}\(I\-P\_\{\\pi\}\), soAη=L\+ηdμe⊤A\_\{\\eta\}=L\+\\eta d\_\{\\mu\}e^\{\\top\}\. Lemma[4\.2](https://arxiv.org/html/2605.06874#S4.Thmtheorem2)shows that every nonzero eigenvalue ofLLhas strictly positive real part\. It remains to understand the zero eigenvalue ofLL\. SinceDμD\_\{\\mu\}is invertible,ker\(L\)=ker\(I−Pπ\)\\ker\(L\)=\\ker\(I\-P\_\{\\pi\}\)\. BecausePπP\_\{\\pi\}is irreducible and row\-stochastic,ker\(I−Pπ\)=span\{e\}\\ker\(I\-P\_\{\\pi\}\)=\\operatorname\{span\}\\\{e\\\}by the Perron\-Frobenius theorem\.
Letdπd\_\{\\pi\}be the stationary distribution ofPπP\_\{\\pi\}and define
ℓ⊤≐dπ⊤Dμ−1\.\\ell^\{\\top\}\\doteq d\_\{\\pi\}^\{\\top\}D\_\{\\mu\}^\{\-1\}\.Then
ℓ⊤L=dπ⊤\(I−Pπ\)=0,ℓ⊤e=dπ⊤Dμ−1e\>0\.\\ell^\{\\top\}L=d\_\{\\pi\}^\{\\top\}\(I\-P\_\{\\pi\}\)=0,\\qquad\\ell^\{\\top\}e=d\_\{\\pi\}^\{\\top\}D\_\{\\mu\}^\{\-1\}e\>0\.The zero eigenvalue ofLLis algebraically simple\. If we suppose otherwise, we know that since its eigenspace is one\-dimensional \(becauseker\(L\)=span\{e\}\\ker\(L\)=\\operatorname\{span\}\\\{e\\\}is one\-dimensional\), there would exist a generalized eigenvectoryysuch thatLy=eLy=e\. Multiplying byℓ⊤\\ell^\{\\top\}yields
0=ℓ⊤Ly=ℓ⊤e\>0,0=\\ell^\{\\top\}Ly=\\ell^\{\\top\}e\>0,which is a contradiction\.
We now compute how this simple zero eigenvalue moves under the perturbation\. Define
F\(x,λ,η\)≐\(Aηx−λxℓ⊤x−ℓ⊤e\)\.F\(x,\\lambda,\\eta\)\\doteq\\begin\{pmatrix\}A\_\{\\eta\}x\-\\lambda x\\\\ \\ell^\{\\top\}x\-\\ell^\{\\top\}e\\end\{pmatrix\}\.At\(x,λ,η\)=\(e,0,0\)\(x,\\lambda,\\eta\)=\(e,0,0\), we haveF\(e,0,0\)=0F\(e,0,0\)=0\. The derivative ofFFwith respect to\(x,λ\)\(x,\\lambda\)at this point maps\(y,θ\)\(y,\\theta\)to
\(Ly−θeℓ⊤y\)\.\\begin\{pmatrix\}Ly\-\\theta e\\\\ \\ell^\{\\top\}y\\end\{pmatrix\}\.This map is injective\. IfLy−θe=0Ly\-\\theta e=0andℓ⊤y=0\\ell^\{\\top\}y=0, then multiplying the first equation byℓ⊤\\ell^\{\\top\}gives−θℓ⊤e=0\-\\theta\\ell^\{\\top\}e=0, soθ=0\\theta=0\. ThenLy=0Ly=0, hencey=cey=ce, andℓ⊤y=0\\ell^\{\\top\}y=0givesc=0c=0\. Since the domain and codomain have the same finite dimension, the map is invertible\.
By the finite\-dimensional implicit function theorem, applied after identifying complex vectors with real vectors, there are differentiable functionsx\(η\)x\(\\eta\)andλ\(η\)\\lambda\(\\eta\)near0such that
Aηx\(η\)=λ\(η\)x\(η\),x\(0\)=e,λ\(0\)=0\.A\_\{\\eta\}x\(\\eta\)=\\lambda\(\\eta\)x\(\\eta\),\\qquad x\(0\)=e,\\qquad\\lambda\(0\)=0\.Differentiating atη=0\\eta=0yields
dμe⊤e\+Lx′\(0\)=λ′\(0\)e\.d\_\{\\mu\}e^\{\\top\}e\+Lx^\{\\prime\}\(0\)=\\lambda^\{\\prime\}\(0\)e\.Multiplying byℓ⊤\\ell^\{\\top\}gives
λ′\(0\)=ℓ⊤dμe⊤eℓ⊤e=\|𝒮\|dπ⊤Dμ−1e\>0,\\lambda^\{\\prime\}\(0\)=\\frac\{\\ell^\{\\top\}d\_\{\\mu\}\\,e^\{\\top\}e\}\{\\ell^\{\\top\}e\}=\\frac\{\|\\mathcal\{S\}\|\}\{d\_\{\\pi\}^\{\\top\}D\_\{\\mu\}^\{\-1\}e\}\>0,becauseℓ⊤dμ=dπ⊤Dμ−1dμ=dπ⊤e=1\\ell^\{\\top\}d\_\{\\mu\}=d\_\{\\pi\}^\{\\top\}D\_\{\\mu\}^\{\-1\}d\_\{\\mu\}=d\_\{\\pi\}^\{\\top\}e=1ande⊤e=\|𝒮\|e^\{\\top\}e=\|\\mathcal\{S\}\|\. Therefore the zero eigenvalue ofLLmoves into the open right half plane for all sufficiently small positiveη\\eta\.
Here, all other eigenvalues ofLLalready have strictly positive real part by Lemma[4\.2](https://arxiv.org/html/2605.06874#S4.Thmtheorem2), and they remain in the open right half plane for sufficiently smallη\>0\\eta\>0by continuity of the roots of the characteristic polynomial\. Hence there existsη0\>0\\eta\_\{0\}\>0such thatAηA\_\{\\eta\}is positive stable for everyη∈\(0,η0\]\\eta\\in\(0,\\eta\_\{0\}\]\. ∎
## Appendix BAdditional Online TD Runs
Figures[3](https://arxiv.org/html/2605.06874#A2.F3)and[4](https://arxiv.org/html/2605.06874#A2.F4)repeat the differential TD experiment from Figure[2](https://arxiv.org/html/2605.06874#S5.F2)for nine additional random seeds\.

Figure 3:Additional differential TD runs withη=2α\\eta=2\\alpha, showing‖vt‖2\\norm\{v\_\{t\}\}\_\{2\}\.\.
Figure 4:Additional differential TD runs withη=2α\\eta=2\\alpha, showing the distance fromvtv\_\{t\}tospan\{e\}\\operatorname\{span\}\\\{e\\\}\.Similar Articles
A Finite-Iteration Theory for Asynchronous Categorical Distributional Temporal-Difference Learning
This paper presents a finite-iteration theory for asynchronous categorical distributional temporal-difference learning, bridging the gap between existing theoretical frameworks and practical online implementations.
Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
This paper introduces RLRT, a method that reverses teacher signals in self-distillation to reinforce successful student deviations, enhancing reasoning exploration in large language models.
Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning
This paper introduces DOSER, a framework using diffusion models for out-of-distribution detection and selective regularization in offline reinforcement learning. It aims to improve performance on static datasets by distinguishing between beneficial and detrimental OOD actions.
Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates
This paper introduces Trust Region Inverse Reinforcement Learning (TRIRL), a method that combines monotonic dual improvement with efficient local policy updates to outperform state-of-the-art imitation learning methods. It addresses the trade-off between stability and computational cost in IRL by using trust-region constraints.
Reciprocal Co-Training (RCT): Coupling Gradient-Based and Non-Differentiable Models via Reinforcement Learning
Researchers from Fordham University introduce Reciprocal Co-Training (RCT), a framework that couples LLMs and Random Forest classifiers via reinforcement learning, creating an iterative feedback loop where each model improves using signals from the other. Experiments on three medical datasets show consistent performance gains for both models, demonstrating a general mechanism for integrating incompatible model families.