Exploring Starts Are Not Enough: Counterexamples and a Fix for Monte Carlo Exploring Starts
Summary
This paper presents counterexamples showing that Monte Carlo Exploring Starts can converge to suboptimal solutions in tabular reinforcement learning, and provides a modification that guarantees convergence to optimality by scaling learning rates inversely to update frequencies.
View Cached Full Text
Cached at: 06/16/26, 11:39 AM
# Counterexamples and a Fix for Monte Carlo Exploring Starts
Source: [https://arxiv.org/html/2606.15247](https://arxiv.org/html/2606.15247)
Octave Oliviers Department of Engineering University of Cambridge Cambridge, UK ofao2@cam\.ac\.uk &Glenn Vinnicombe Department of Engineering University of Cambridge Cambridge, UK gv103@cam\.ac\.uk
###### Abstract
The asymptotic behaviour of Monte Carlo Exploring Starts \(MCES\) is a long\-standing open question in reinforcement learning, even in the tabular setting\. We investigated the convergence properties of tabular MCES by constructing examples in which the algorithm converges to suboptimal solutions\. This paper presents new counterexamples for both initial\-visit and first\-visit MCES and gives a convergence\-restoring modification for the initial\-visit case\. We show that stable suboptimal solutions may exist for initial\-visit MCES with sample\-average updates even when greedy actions are updated more often than non\-greedy actions on average\. However, by scaling learning rates inversely to update frequencies on a state\-by\-state basis, convergence to optimality is guaranteed\. Unlike previous uniformisation methods, this modification is applicable to large\-scale problems that require approximating the estimated value function\. We then extend the example to show that sample\-average first\-visit MCES may also converge to suboptimal solutions\. This largely settles a fundamental open problem and shows that exploring starts alone do not guarantee convergence to optimality\. More broadly, these results highlight that convergence depends critically on the relative size and frequency of updates applied to different actions, making the choice of learning rates and the balance between exploration and exploitation central to the analysis of MCES and the implementation of scalable Monte Carlo control methods\.
## 1Introduction
Reinforcement learning aims to train an agent to behave optimally in a certain environment\. In this paper, we model this interaction as a finite Markov decision process \(MDP\) evolving over discrete steps\. At timett, the agent observes a statest∈𝒮s\_\{t\}\\in\\mathcal\{S\}and selects an actionat∈𝒜\(st\)a\_\{t\}\\in\\mathcal\{A\}\(s\_\{t\}\)according to a policyπ\(at\|st\)\\pi\(a\_\{t\}\|s\_\{t\}\)\. The environment then produces a rewardrtr\_\{t\}and transitions to a new statest\+1s\_\{t\+1\}according to its dynamicsp\(st\+1,rt\|st,at\)p\(s\_\{t\+1\},r\_\{t\}\|s\_\{t\},a\_\{t\}\)\.
For a policyπ\\pi, the action\-value functionqπq\_\{\\pi\}assigns to each pair\(s,a\)\(s,a\)the expected discounted return obtained by taking actionaain statess, and then following policyπ\\pi:
qπ\(s,a\)=𝔼π\[∑t=0∞γtrt\|s0=s,a0=a\]\.q\_\{\\pi\}\(s,a\)=\\mathbb\{E\}\_\{\\pi\}\\left\[\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}r\_\{t\}\\;\\middle\|\\;s\_\{0\}=s,\\ a\_\{0\}=a\\right\]\.Under standard finite\-MDP assumptions, there exists an optimal policyπ∗\\pi\_\{\*\}that satisfiesqπ∗\(s,a\)=maxπqπ\(s,a\)q\_\{\\pi\_\{\*\}\}\(s,a\)=\\max\_\{\\pi\}q\_\{\\pi\}\(s,a\)for all\(s,a\)\(s,a\)\. Moreover,π∗\\pi\_\{\*\}may be chosen to be deterministic and greedy with respect toqπ∗q\_\{\\pi\_\{\*\}\}, meaningπ∗\(s\)∈argmaxa∈𝒜\(s\)qπ∗\(s,a\)\\pi\_\{\*\}\(s\)\\in\\arg\\max\_\{a\\in\\mathcal\{A\}\(s\)\}q\_\{\\pi\_\{\*\}\}\(s,a\)for allss\.
Monte Carlo Exploring Starts \(MCES\) maintains an action\-value estimateqkq\_\{k\}and a policyπk\\pi\_\{k\}greedy with respect toqkq\_\{k\}\. At iterationkk, the algorithm samples an initial state\-action pair according to a distributionσk\\sigma\_\{k\}and simulates an episode starting from that pair:
ℰk:=\{s0,a0,r0,s1,a1,r1,…:\(s0,a0\)∼σk,\(st\+1,rt\)∼p\(⋅,⋅\|st,at\),at=πk\(st\),∀t≥1\}\.\\mathcal\{E\}\_\{k\}:=\\bigg\\\{s\_\{0\},a\_\{0\},r\_\{0\},s\_\{1\},a\_\{1\},r\_\{1\},\\ldots:\(s\_\{0\},a\_\{0\}\)\\sim\\sigma\_\{k\},\\ \(s\_\{t\+1\},r\_\{t\}\)\\sim p\(\\cdot,\\cdot\|s\_\{t\},a\_\{t\}\),\\ a\_\{t\}=\\pi\_\{k\}\(s\_\{t\}\),\\ \\forall\\,t\\geq 1\\bigg\\\}\.It then updates the estimateqkq\_\{k\}at pairs\(s,a\)\(s,a\)appearing inℰk\\mathcal\{E\}\_\{k\}as
qk\+1\(s,a\)=qk\(s,a\)\+ηk\(s,a\)\(q~k\(s,a\)−qk\(s,a\)\),\\displaystyle q\_\{k\+1\}\(s,a\)=q\_\{k\}\(s,a\)\+\\eta\_\{k\}\(s,a\)\\big\(\\tilde\{q\}\_\{k\}\(s,a\)\-q\_\{k\}\(s,a\)\\big\),\(1\)whileqk\+1\(s,a\)=qk\(s,a\)q\_\{k\+1\}\(s,a\)=q\_\{k\}\(s,a\)for non\-updated pairs\. Here,ηk\(s,a\)\\eta\_\{k\}\(s,a\)is a learning rate satisfying the Robbins–Monro conditions∑kηk\(s,a\)=∞\\sum\_\{k\}\\eta\_\{k\}\(s,a\)=\\inftyand∑kηk2\(s,a\)<∞\\sum\_\{k\}\\eta\_\{k\}^\{2\}\(s,a\)<\\inftyfor each\(s,a\)\(s,a\), andq~k\(s,a\)\\tilde\{q\}\_\{k\}\(s,a\)is the observed return from that pair onwards:
q~k\(st,at\)=∑i≥0γirt\+i∀t≥0\.\\displaystyle\\tilde\{q\}\_\{k\}\(s\_\{t\},a\_\{t\}\)=\\sum\_\{i\\geq 0\}\\gamma^\{i\}r\_\{t\+i\}\\qquad\\forall\\,t\\geq 0\.\(2\)We consider two update rules\. The*initial\-visit*\(IV\) method only updatesqkq\_\{k\}at the initial pair\(s0,a0\)\(s\_\{0\},a\_\{0\}\)\. The*first\-visit*\(FV\) method updates each state\-action pair in the episode, using the return from its first occurrence in the episode\. We also consider different choices of learning rates\. The*sample\-average*\(SA\) rule setsηk\(s,a\)=1/nk\(s,a\)\\eta\_\{k\}\(s,a\)=1/n\_\{k\}\(s,a\), wherenk\(s,a\)n\_\{k\}\(s,a\)is the number of times\(s,a\)\(s,a\)has been updated up to and including timekk\. Under this rule,qk\(s,a\)q\_\{k\}\(s,a\)is the sample average of the observed returns for\(s,a\)\(s,a\)up to timekk\. Alternatively, the learning rates\{ηk\}k≥0\\\{\\eta\_\{k\}\\\}\_\{k\\geq 0\}can be independent of\(s,a\)\(s,a\)and just form a scalar sequence\.
To ensure sufficient exploration, the*exploring starts*condition requires thatσk\(s,a\)\>0\\sigma\_\{k\}\(s,a\)\>0for allkkand all\(s,a\)\(s,a\)\. In order to ensure all state\-action pairs are visited infinitely often, we introduce a stricter condition: there exists a constantσmin\>0\\sigma\_\{\\min\}\>0such thatσk\(s,a\)≥σmin\\sigma\_\{k\}\(s,a\)\\geq\\sigma\_\{\\min\}for allkkand all\(s,a\)\(s,a\)\.
For both IV and FV, the observed returnq~k\\tilde\{q\}\_\{k\}is an unbiased estimate ofqπkq\_\{\\pi\_\{k\}\}Singh and Sutton \([1996](https://arxiv.org/html/2606.15247#bib.bib176)\)\. So, defining the zero\-mean perturbationvk\(s,a\):=q~k\(s,a\)−qπk\(s,a\)v\_\{k\}\(s,a\):=\\tilde\{q\}\_\{k\}\(s,a\)\-q\_\{\\pi\_\{k\}\}\(s,a\), equation \([1](https://arxiv.org/html/2606.15247#S1.E1)\) can be written as
qk\+1\(s,a\)=qk\(s,a\)\+ηk\(s,a\)\(qπk\(s,a\)−qk\(s,a\)\+vk\(s,a\)\)\.\\displaystyle q\_\{k\+1\}\(s,a\)=q\_\{k\}\(s,a\)\+\\eta\_\{k\}\(s,a\)\\big\(q\_\{\\pi\_\{k\}\}\(s,a\)\-q\_\{k\}\(s,a\)\+v\_\{k\}\(s,a\)\\big\)\.\(3\)MCES is thus a stochastic approximation to the policy evaluation map associated with the current greedy policy\. The dynamics are fixed as long asqkq\_\{k\}remains in a region where the same policy is greedy\. For a policyπ\\pi, define this region by
Q\(π\)≔\{q:q\(s,π\(s\)\)=maxa∈𝒜\(s\)q\(s,a\),∀s∈𝒮\}\.\\displaystyle Q\(\\pi\)\\coloneqq\\big\\\{q:q\(s,\\pi\(s\)\)=\\max\_\{a\\in\\mathcal\{A\}\(s\)\}q\(s,a\),\\ \\forall\\,s\\in\\mathcal\{S\}\\big\\\}\.\(4\)Discussing the FV/SA variant of MCES, Sutton writes: “It is hard to imagine any RL method simpler or more likely to converge than this…\\ldotsWhile this simplest case remains open we are unlikely to make progress on any control method forλ\>0\\lambda\>0\.”\(Sutton,[1999](https://arxiv.org/html/2606.15247#bib.bib151), p\. 13\)\(Here,λ\\lambdarefers to the parameter in Q\(λ\\lambda\), TD\(λ\\lambda\) etc\.Sutton \([1988](https://arxiv.org/html/2606.15247#bib.bib109)\)\.\) Sutton and Barto later reinforce the importance of this problem by labelling it as “one of the most fundamental open theoretical questions in reinforcement learning\.”\(Sutton and Barto,[2018](https://arxiv.org/html/2606.15247#bib.bib177), p\. 99\)
### 1\.1Previous work
The Robbins–Monro conditions alone do not prevent IV MCES from converging to suboptimal solutionsBertsekas and Tsitsiklis \([1996](https://arxiv.org/html/2606.15247#bib.bib132)\); Wanget al\.\([2022](https://arxiv.org/html/2606.15247#bib.bib144)\)\. In these counterexamples, the sampling distribution is strongly biased towards non\-greedy actions \(i\.e\.σk\(s,a1\)\>σk\(s,a1\)\\sigma\_\{k\}\(s,a\_\{1\}\)\>\\sigma\_\{k\}\(s,a\_\{1\}\)whenqk\(s,a1\)<qk\(s,a2\)q\_\{k\}\(s,a\_\{1\}\)<q\_\{k\}\(s,a\_\{2\}\)\), as described in[Appendix A](https://arxiv.org/html/2606.15247#A1)\. In contrast, with uniform sampling distributions over all state\-action pairs and scalar learning rates, IV MCES converges almost surely to optimalityTsitsiklis \([2002](https://arxiv.org/html/2606.15247#bib.bib140)\); Chen \([2018](https://arxiv.org/html/2606.15247#bib.bib141)\); Liu \([2021](https://arxiv.org/html/2606.15247#bib.bib143)\)\. For large problems, however, sampling uniformly over the state space is not feasible\.
For completeness, we note that MCES also converges to optimality when the MDP exhibits a feedforward structure\(Wanget al\.,[2022](https://arxiv.org/html/2606.15247#bib.bib144); Lubarset al\.,[2021](https://arxiv.org/html/2606.15247#bib.bib154)\), and that more complex multi\-step algorithms manage to recover contraction arguments by combining several multi\-step approximations\(Munoset al\.,[2016](https://arxiv.org/html/2606.15247#bib.bib180)\), or using a look\-ahead mechanism\(Winnicki and Srikant,[2023](https://arxiv.org/html/2606.15247#bib.bib142)\)\. We focus, however, on the behaviour of tabular MCES for general MDPs as even this simple case remains an open question\.
### 1\.2Contributions
We make three contributions\. First, we construct an MDP in which IV/SA MCES does not converge toqπ∗q\_\{\\pi\_\{\*\}\}even when sampling distributions are biased towards greedy actions\. Second, we prove that IV MCES converges toqπ∗q\_\{\\pi\_\{\*\}\}when scaling learning rates inversely to update frequencies on a state\-by\-state basis\. This modification works in large\-scale environments as it does not require uniform sampling over the state space\. Unlike earlier work, the proof derives convergence by analysing the evolution of the policyπk\\pi\_\{k\}instead of the estimateqkq\_\{k\}\. Third, we show that FV/SA MCES can converge to suboptimal solutions when noise is averaged out\. Thus any convergence proof for FV/SA MCES would need to show that the stochastic sampled paths always escape the mean\-field attractors\. Our simulations suggest that this does not happen\. For all simulation horizons we could test, we observed fully sampled runs getting trapped in the same suboptimal basin\. These results provide strong evidence against the global convergence of FV/SA MCES in general MDPs\.
## 2Counterexample for initial\-visit MCES with sample average updates
Consider the MDP in[Figure 1](https://arxiv.org/html/2606.15247#S2.F1)withγ=0\.9\\gamma=0\.9\. Each state has two actions: "move" yields a reward of one while "stay" yields no reward\. The policyπ∗\\pi\_\{\*\}chooses "move" everywhere, denoted bya∗a\_\{\*\}, whileπ¯\\bar\{\\pi\}chooses "stay", denoted bya¯\\bar\{a\}\.
Figure 1:The sampling strategy described on page[1](https://arxiv.org/html/2606.15247#S2.I1.i1)forces IV MC\-O\-PI to cycle indefinitely between six suboptimal policies for every initial condition in𝒬0\\mathcal\{Q\}\_\{0\}\. This figure depicts the expected cycle in the policies\{πk\}k≥0\\\{\\pi\_\{k\}\\\}\_\{k\\geq 0\}\(middle\) and in the estimates\{qk\}k≥0\\\{q\_\{k\}\\\}\_\{k\\geq 0\}\(bottom\)\. The "move" action correspond toa∗a\_\{\*\}, and "stay" toa¯\\bar\{a\}\. The dotted lines are the boundaries between regions, namely whereq\(s,a∗\)=q\(s,a¯\)q\(s,a\_\{\*\}\)=q\(s,\\bar\{a\}\)\.The following sampling strategy forces IV MCES to cycle indefinitely between the suboptimal policiesπ1\\pi\_\{1\},π2\\pi\_\{2\},π3\\pi\_\{3\},π4\\pi\_\{4\},π5\\pi\_\{5\}andπ6\\pi\_\{6\}for every initial condition in the region𝒬0\\mathcal\{Q\}\_\{0\}, which is the intersection between the regions of these six policies and the shaded region in[Figure 1](https://arxiv.org/html/2606.15247#S2.F1):
1. 1\.Whenqk∈Q\(π1\)q\_\{k\}\\in Q\(\\pi\_\{1\}\), start each episode in\(s2,a∗\)\(s\_\{2\},\{a\}\_\{\*\}\)or\(s3,a∗\)\(s\_\{3\},\{a\}\_\{\*\}\)with a probability of 0\.4, or in\(s2,a¯\)\(s\_\{2\},\\overline\{a\}\)with a probability of 0\.2\. MCES is then expected to cross the boundary in states3s\_\{3\}and transition to policyπ2\\pi\_\{2\}\.
2. 2\.Whenqk∈Q\(π2\)q\_\{k\}\\in Q\(\\pi\_\{2\}\), start each episode in\(s1,a¯\)\(s\_\{1\},\\overline\{a\}\)\. MCES will cross the boundary ins1s\_\{1\}and transition toπ3\\pi\_\{3\}\.
3. 3\.Whenqk∈Q\(π3\)q\_\{k\}\\in Q\(\\pi\_\{3\}\), start each episode in\(s1,a∗\)\(s\_\{1\},\{a\}\_\{\*\}\)or\(s2,a∗\)\(s\_\{2\},\{a\}\_\{\*\}\)with a probability of 0\.4, or in\(s1,a¯\)\(s\_\{1\},\\overline\{a\}\)with a probability of 0\.2\. MCES is then expected to cross the boundary ins2s\_\{2\}and transition toπ4\\pi\_\{4\}\.
4. 4\.Whenqk∈Q\(π4\)q\_\{k\}\\in Q\(\\pi\_\{4\}\), start each episode in\(s3,a¯\)\(s\_\{3\},\\overline\{a\}\)\. MCES will cross the boundary ins3s\_\{3\}and transition toπ5\\pi\_\{5\}\.
5. 5\.Whenqk∈Q\(π5\)q\_\{k\}\\in Q\(\\pi\_\{5\}\), start each episode in\(s1,a∗\)\(s\_\{1\},\{a\}\_\{\*\}\)or\(s3,a∗\)\(s\_\{3\},\{a\}\_\{\*\}\)with a probability of 0\.4, or in\(s3,a¯\)\(s\_\{3\},\\overline\{a\}\)with a probability of 0\.2\. MCES is then expected to cross the boundary ins1s\_\{1\}and transition toπ6\\pi\_\{6\}\.
6. 6\.Whenqk∈Q\(π6\)q\_\{k\}\\in Q\(\\pi\_\{6\}\), start each episode in\(s2,a¯\)\(s\_\{2\},\\overline\{a\}\)\. MCES will cross the boundary ins2s\_\{2\}and transition toπ1\\pi\_\{1\}\.
The cycle is robust to noise caused by sampling the initial state\-action pairs in steps 1, 3 and 5\. For example, if\(s2,a¯\)\(s\_\{2\},\\overline\{a\}\)is updated several times in a row during step 1, the solution might switch to policyπ6\\pi\_\{6\}instead ofπ2\\pi\_\{2\}\. Nevertheless, the update distribution underπ6\\pi\_\{6\}directly corrects this by updating\(s2,a¯\)\(s\_\{2\},\\overline\{a\}\)and driving the solution back toπ1\\pi\_\{1\}\. Thus, the sampling strategy guarantees that, for every initial condition in𝒬0\\mathcal\{Q\}\_\{0\}, the greedy policyπk\\pi\_\{k\}contains at least one suboptimal action but never all three\. As a result, solutions cycle indefinitely with potential pairwise repeats betweenπ1\\pi\_\{1\}andπ6\\pi\_\{6\},π3\\pi\_\{3\}andπ2\\pi\_\{2\}, orπ4\\pi\_\{4\}andπ5\\pi\_\{5\}but without ever visiting the policiesπ∗\{\\pi\}\_\{\*\}orπ¯\\overline\{\\pi\}\. This behaviour does not rely on the exact probabilities above\. It persists when increasing the sampling bias towards greedy actions, for instance when replacing the 2\-to\-1 sampling ratio of\(s2,a∗\)\(s\_\{2\},\{a\}\_\{\*\}\)to\(s2,a¯\)\(s\_\{2\},\\overline\{a\}\)during step 1 with an epsilon\-greedy strategy for small epsilon\.
We incorporate exploring starts by initialising each episode using the sampling strategy above with a probability of 0\.9 or by uniformly sampling a state\-action pair with a probability of 0\.1\. Since the wrong state\-action pairs are now sampled at the wrong time infinitely often, solutions can visit the policiesπ∗\{\\pi\}\_\{\*\}orπ¯\\overline\{\\pi\}\. When they do, we sample the initial state\-action pair uniformly\. Additionally, we introduce stochastic transitions by terminating each episode with a probability of 0\.1 after each action\. The valueqπk\(s,a\)q\_\{\\pi\_\{k\}\}\(s,a\)is then always over\- or underestimated, which yields steps that potentially drive solutions out of𝒬0\\mathcal\{Q\}\_\{0\}\.


Figure 2:Initial\-visit MCES with sample\-average updates can converge to a suboptimal solution whereq\(s,a∗\)=q\(s,a¯\)q\(s,a\_\{\*\}\)=q\(s,\\bar\{a\}\)for all states even when the greedy action is sampled at least as frequently as any other action in each state\. The blue line depicts a representative solution\.Despite the variability of initial state\-action pairs and the noisy action\-value estimates, IV MCES still converges to a suboptimal solution when using SA learning rates\. For instance,99\.6%99\.6\\%of solutions converge to a point whereq\(s,a∗\)=q\(s,a¯\)≈1\.615q\(s,a\_\{\*\}\)=q\(s,\\bar\{a\}\)\\approx 1\.615for all states when starting inq0\(s1,a∗\)=q0\(s2,a∗\)=q0\(s3,a¯\)=2q\_\{0\}\(s\_\{1\},a\_\{\*\}\)=q\_\{0\}\(s\_\{2\},a\_\{\*\}\)=q\_\{0\}\(s\_\{3\},\\bar\{a\}\)=2andq0\(s1,a¯\)=q0\(s2,a¯\)=q0\(s3,a∗\)=1q\_\{0\}\(s\_\{1\},\\bar\{a\}\)=q\_\{0\}\(s\_\{2\},\\bar\{a\}\)=q\_\{0\}\(s\_\{3\},a\_\{\*\}\)=1withn1\(s,a\)=1n\_\{1\}\(s,a\)=1for all state\-action pairs \([Figure 2](https://arxiv.org/html/2606.15247#S2.F2)\)\. The remaining0\.4%0\.4\\%escape toqπ∗q\_\{\\pi\_\{\*\}\}\.
The reason solutions converge to the boundary despite frequently visiting the optimal policyπ∗\\pi\_\{\*\}is as follows\. Consider a solution in the regionQ\(π∗\)Q\(\\pi\_\{\*\}\)near the boundary point1\.6151\.615\. Soqk\(s,a∗\)\>qk\(s,a¯\)q\_\{k\}\(s,a\_\{\*\}\)\>q\_\{k\}\(s,\\bar\{a\}\)for every state, andqk\(s,a\)<qπ∗\(s,a\)q\_\{k\}\(s,a\)<q\_\{\{\\pi\}\_\{\*\}\}\(s,a\)for every state\-action pair\. There are two possible updates\. Updating\(s,a∗\)\(s,a\_\{\*\}\)movesqkq\_\{k\}deeper intoQ\(π∗\)Q\(\\pi\_\{\*\}\)on average asqk\(s,a∗\)q\_\{k\}\(s,a\_\{\*\}\)increases compared toqk\(s,a¯\)q\_\{k\}\(s,\\bar\{a\}\)\. In contrast, updating\(s,a¯\)\(s,\\bar\{a\}\)guides the solution towards the boundary ofQ\(π∗\)Q\(\\pi\_\{\*\}\)asqk\(s,a¯\)q\_\{k\}\(s,\\bar\{a\}\)catches up toqk\(s,a∗\)q\_\{k\}\(s,a\_\{\*\}\)\. If the learning rate is large enough,a¯\\bar\{a\}may even become greedy inss\. Thus, solutions consistently leave the optimal region only if updates to\(s,a¯\)\(s,\\bar\{a\}\)receive larger learning rates or are more frequent on average than updates to\(s,a∗\)\(s,a\_\{\*\}\)\. Precisely this behaviour is induced by the sampling strategy, the initial\-visit updates and the sample\-average learning rates\. Specifically, the sampling strategy combined with initial\-visit updates implies that, for every state,a∗a\_\{\*\}is updated more frequently thana¯\\bar\{a\}when solutions cycle between the policiesπ1\\pi\_\{1\},π2\\pi\_\{2\},π3\\pi\_\{3\},π4\\pi\_\{4\},π5\\pi\_\{5\}andπ6\\pi\_\{6\}\. Therefore,1/nk\(s,a∗\)1/n\_\{k\}\(s,\{a\}\_\{\*\}\)decreases faster than1/nk\(s,a¯\)1/n\_\{k\}\(s,\\overline\{a\}\)for each state\. As a result, when noise pushes solutions into the optimal regionQ\(π∗\)Q\(\\pi\_\{\*\}\), all actions are updated equally frequently but updates to\(s,a¯\)\(s,\\overline\{a\}\)are larger than updates to\(s,a∗\)\(s,\{a\}\_\{\*\}\), which drives solutions out ofQ\(π∗\)Q\(\\pi\_\{\*\}\)\. In other words, the sampling strategy, the initial\-visit updates and the learning rates effectively bias updates towards suboptimal actions whenπk=π∗\\pi\_\{k\}=\{\\pi\}\_\{\*\}even though greedy actions are updated more frequently\. As discussed above, this allows solutions to consistently jump in and out of the optimal region, creating a stable suboptimal boundary region when the learning rates decrease\.
Overall, this example shows that IV/SA MCES may converge to suboptimal solutions even when, in each state, the greedy action is updated at least as frequently as any other action\. The asymptotic behaviour is determined by the interplay between learning rates and within\-state update frequencies, rather than either factor alone\. This suggests a remedy: scale learning rates to offset non\-uniform update frequencies in each state and recover convergence to optimality\.
## 3Fix for initial\-visit MCES
Starting from an initialq0q\_\{0\}, IV MCES repeats the following steps fork=0,1,…,∞k=0,1,\\ldots,\\infty:
1. 1\.Select a deterministic policyπk\(s\)∈argmaxaqk\(s,a\)\\pi\_\{k\}\(s\)\\in\\arg\\max\_\{a\}q\_\{k\}\(s,a\)for allss\.
2. 2\.Sample an initial state\-action pair\(sk,ak\)∼σk\(s\_\{k\},a\_\{k\}\)\\sim\\sigma\_\{k\}\.
3. 3\.Update the estimateqkq\_\{k\}as qk\+1\(s,a\)=\{qk\(s,a\)\+ηk\(s,a\)\(q~k\(s,a\)−qk\(s,a\)\),\(s,a\)=\(sk,ak\)qk\(s,a\)otherwise \.\\displaystyle q\_\{k\+1\}\(s,a\)=\\begin\{cases\}q\_\{k\}\(s,a\)\+\\eta\_\{k\}\(s,a\)\\left\(\\tilde\{q\}\_\{k\}\(s,a\)\-q\_\{k\}\(s,a\)\\right\),&\(s,a\)=\(s\_\{k\},a\_\{k\}\)\\\\ q\_\{k\}\(s,a\)&\\text\{otherwise \.\}\\end\{cases\}\(5\)whereq~k\\tilde\{q\}\_\{k\}is the episode return defined in \([2](https://arxiv.org/html/2606.15247#S1.E2)\)\.
Section[2](https://arxiv.org/html/2606.15247#S2), together with Appendix[A](https://arxiv.org/html/2606.15247#A1)and Example 5\.12 inBertsekas and Tsitsiklis \([1996](https://arxiv.org/html/2606.15247#bib.bib132)\), shows that IV MCES may converge to suboptimal solutions when the sampling distributions are biased towards either greedy or non\-greedy actions\. This section presents a way to compensate for such non\-uniform sampling by appropriately normalizing the learning rates, and thereby restoring convergence to optimality\. The approach relies on the chain rule decomposition of the sampling distribution:
σk\(s,a\)=fk\(s\)gk\(a\|s\)\.\\displaystyle\\sigma\_\{k\}\(s,a\)=f\_\{k\}\(s\)\\,g\_\{k\}\(a\|s\)\.\(6\)
To formalise this correction, letℱk\\mathcal\{F\}\_\{k\}denote the information available after computingqkq\_\{k\}, but before selecting policyπk\\pi\_\{k\}, sampling the initial pair\(sk,ak\)\(s\_\{k\},a\_\{k\}\), and generating the episodeℰk\\mathcal\{E\}\_\{k\}\.
###### Theorem 1\.
Assume
1. \(1\)Discounted or episodic MDP\.Either the discount factor satisfiesγ∈\[0,1\)\\gamma\\in\[0,1\)or each episode of the MDP almost surely reaches a terminal state\.
2. \(2\)Uniform tie\-breaking of greedy policies\.There exists a constantβ\>0\\beta\>0such that, if several policies are greedy with respect toqkq\_\{k\}, ℙ\(πk\(a\|s\)=1∣ℱk\)≥β,∀k≥0,∀s∈𝒮,∀a∈argmax𝒜\(s\)qk\(s,⋅\)\.\\mathbb\{P\}\(\\pi\_\{k\}\(a\|s\)=1\\mid\\mathcal\{F\}\_\{k\}\)\\geq\\beta,\\qquad\\forall\\,k\\geq 0,\\ \\forall\\,s\\in\\mathcal\{S\},\\ \\forall\\,a\\in\\arg\\max\_\{\\mathcal\{A\}\(s\)\}q\_\{k\}\(s,\\cdot\)\.\(7\)
3. \(3\)Strict exploring starts\.The sampling distributions\{σk\}k≥0\\\{\\sigma\_\{k\}\\\}\_\{k\\geq 0\}satisfy σk\(s,a\)≥σmin∀k≥0,∀s∈𝒮,∀a∈𝒜\(s\),\\displaystyle\\sigma\_\{k\}\(s,a\)\\geq\\sigma\_\{\\min\}\\qquad\\forall\\,k\\geq 0,\\ \\forall\\,s\\in\\mathcal\{S\},\\ \\forall\\,a\\in\\mathcal\{A\}\(s\),\(8\)for some constantσmin∈\(0,∞\)\\sigma\_\{\\min\}\\in\(0,\\infty\)\.
4. \(4\)Scalar, Robbins–Monro and comparable learning rates\.The sequence\{αk\}k≥0\\\{\\alpha\_\{k\}\\\}\_\{k\\geq 0\}forms a scalar non\-increasing sequence:αk\+1≤αk\\alpha\_\{k\+1\}\\leq\\alpha\_\{k\}for everyk≥0k\\geq 0\. It satisfies ∑k=0∞αk=∞,∑k=0∞αk2<∞,\\displaystyle\\sum\_\{k=0\}^\{\\infty\}\\alpha\_\{k\}=\\infty,\\qquad\\sum\_\{k=0\}^\{\\infty\}\\alpha\_\{k\}^\{2\}<\\infty,\(9\)and there exist a timeKα<∞K\_\{\\alpha\}<\\inftyand a constantρα∈\(0,1\]\\rho\_\{\\alpha\}\\in\(0,1\]such that αk\+i≥ρααk,∀k≥Kα,∀i∈\{0,…,⌊1/αk⌋\}\.\\displaystyle\\alpha\_\{k\+i\}\\geq\\rho\_\{\\alpha\}\\alpha\_\{k\},\\qquad\\forall\\,k\\geq K\_\{\\alpha\},\\ \\forall\\,i\\in\\\{0,\\dots,\\left\\lfloor 1/\\alpha\_\{k\}\\right\\rfloor\\\}\.\(10\)Given that condition \([9](https://arxiv.org/html/2606.15247#S3.E9)\) requires thatαk→0\\alpha\_\{k\}\\to 0and we study the asymptotic behaviour of MCES, we assume, without loss of generality, thatαk∈\(0,1\]\\alpha\_\{k\}\\in\(0,1\]\.
Then IV MCES converges almost surely toqπ∗q\_\{\\pi\_\{\*\}\}when using learning rates
ηk\(s,a\):=αkgk\(a\|s\)\.\\displaystyle\\eta\_\{k\}\(s,a\):=\\dfrac\{\\alpha\_\{k\}\}\{g\_\{k\}\(a\|s\)\}\.\(11\)
The scaling in \([11](https://arxiv.org/html/2606.15247#S3.E11)\) is a local analogue of SA learning rates\. While1/nk\(s,a\)1/n\_\{k\}\(s,a\)gives a greater weight to actions that have been sampled less often in the past, the factor1/gk\(a\|s\)1/g\_\{k\}\(a\|s\)gives a greater weight to actions which are less likely to be sampled at the current step\. The correction is thus local in time: it compensates for the current sampling bias rather than the historical sampling frequency\.
The key idea in proving[Theorem 1](https://arxiv.org/html/2606.15247#Thmtheorem1)is to analyse the sequence of policies\{πk\}k≥0\\\{\\pi\_\{k\}\\\}\_\{k\\geq 0\}rather than the evolution of the estimates\{qk\}k≥0\\\{q\_\{k\}\\\}\_\{k\\geq 0\}\. We find that the normalization in \([11](https://arxiv.org/html/2606.15247#S3.E11)\) causes the associated mean\-field dynamics to generate monotonically improving policies because it equalises the rate of change across actions on a state\-by\-state basis at every step\. This structure allows us to establish convergence of the original stochastic solutions\. The complete proof is given in[Appendix B](https://arxiv.org/html/2606.15247#A2)\.
###### Proof sketch\.
Equation \([5](https://arxiv.org/html/2606.15247#S3.E5)\) with learning rates as in \([11](https://arxiv.org/html/2606.15247#S3.E11)\) is equivalent to
qk\+1\(s,a\)\\displaystyle q\_\{k\+1\}\(s,a\)=qk\(s,a\)\+αk\(σk\(s,a\)gk\(a\|s\)\(qπk\(s,a\)−qk\(s,a\)\)\+wk\(s,a\)\)\\displaystyle=q\_\{k\}\(s,a\)\+\\alpha\_\{k\}\\left\(\\frac\{\\sigma\_\{k\}\(s,a\)\}\{g\_\{k\}\(a\|s\)\}\\big\(q\_\{\\pi\_\{k\}\}\(s,a\)\-q\_\{k\}\(s,a\)\\big\)\+w\_\{k\}\(s,a\)\\right\)=qk\(s,a\)\+αk\(fk\(s\)\(qπk\(s,a\)−qk\(s,a\)\)\+wk\(s,a\)\)\\displaystyle=q\_\{k\}\(s,a\)\+\\alpha\_\{k\}\\left\(f\_\{k\}\(s\)\\big\(q\_\{\\pi\_\{k\}\}\(s,a\)\-q\_\{k\}\(s,a\)\\big\)\+w\_\{k\}\(s,a\)\\right\)\(12\)for every state\-action pair\. Here,wkw\_\{k\}incorporates the perturbations caused by the episode simulation and the asynchronous updates of state\-action pairs\. Given thatq~k\\tilde\{q\}\_\{k\}is an unbiased estimate ofqπkq\_\{\\pi\_\{k\}\}under initial\-visit updates, it follows that𝔼\[wk\(s,a\)∣ℱk\]=0\\mathbb\{E\}\[w\_\{k\}\(s,a\)\\mid\\mathcal\{F\}\_\{k\}\]=0\.
Step 1\. Mean\-field generates monotonically improving policies
Consider the mean\-field dynamics underlying \([12](https://arxiv.org/html/2606.15247#S3.E12)\):
qk\+1\(s,a\)=qk\(s,a\)\+αkfk\(s\)\(qπk\(s,a\)−qk\(s,a\)\)\.\\displaystyle q\_\{k\+1\}\(s,a\)=q\_\{k\}\(s,a\)\+\\alpha\_\{k\}f\_\{k\}\(s\)\(q\_\{\\pi\_\{k\}\}\(s,a\)\-q\_\{k\}\(s,a\)\)\.\(13\)Since the new policyπk\+1\\pi\_\{k\+1\}will be greedy with respect toqk\+1q\_\{k\+1\}, we haveqk\+1\(s,πk\+1\(s\)\)≥qk\+1\(s,πk\(s\)\)q\_\{k\+1\}\(s,\\pi\_\{k\+1\}\(s\)\)\\geq q\_\{k\+1\}\(s,\\pi\_\{k\}\(s\)\)for every statess\. Using \([13](https://arxiv.org/html/2606.15247#S3.E13)\), this becomes
qk\(s,πk\+1\(s\)\)\+αkfk\(s\)\(qπk\(s,πk\+1\(s\)\)−qk\(s,πk\+1\(s\)\)\)≥qk\(s,πk\(s\)\)\+αkfk\(s\)\(qπk\(s,πk\(s\)\)−qk\(s,πk\(s\)\)\)\.q\_\{k\}\(s,\\pi\_\{k\+1\}\(s\)\)\+\\alpha\_\{k\}f\_\{k\}\(s\)\\big\(q\_\{\\pi\_\{k\}\}\(s,\\pi\_\{k\+1\}\(s\)\)\-q\_\{k\}\(s,\\pi\_\{k\+1\}\(s\)\)\\big\)\\\\ \\geq q\_\{k\}\(s,\\pi\_\{k\}\(s\)\)\+\\alpha\_\{k\}f\_\{k\}\(s\)\\big\(q\_\{\\pi\_\{k\}\}\(s,\\pi\_\{k\}\(s\)\)\-q\_\{k\}\(s,\\pi\_\{k\}\(s\)\)\\big\)\.Rearranging the inequality yields
qπk\(s,πk\+1\(s\)\)−qπk\(s,πk\(s\)\)≥1−αkfk\(s\)αkfk\(s\)\(qk\(s,πk\(s\)\)−qk\(s,πk\+1\(s\)\)\)\.\\displaystyle q\_\{\\pi\_\{k\}\}\(s,\\pi\_\{k\+1\}\(s\)\)\-q\_\{\\pi\_\{k\}\}\(s,\\pi\_\{k\}\(s\)\)\\geq\\dfrac\{1\-\\alpha\_\{k\}f\_\{k\}\(s\)\}\{\\alpha\_\{k\}f\_\{k\}\(s\)\}\\big\(q\_\{k\}\(s,\\pi\_\{k\}\(s\)\)\-q\_\{k\}\(s,\\pi\_\{k\+1\}\(s\)\)\\big\)\.Sinceπk\\pi\_\{k\}is greedy with respect toqkq\_\{k\}and0<αkfk\(s\)≤10<\\alpha\_\{k\}f\_\{k\}\(s\)\\leq 1for sufficiently largekk, it follows that
qπk\(s,πk\+1\(s\)\)−qπk\(s,πk\(s\)\)≥0\\displaystyle q\_\{\\pi\_\{k\}\}\(s,\\pi\_\{k\+1\}\(s\)\)\-q\_\{\\pi\_\{k\}\}\(s,\\pi\_\{k\}\(s\)\)\\geq 0for every statess\. As a result, the Policy Improvement Theorem guarantees that the new policyπk\+1\\pi\_\{k\+1\}is at least as good as policyπk\\pi\_\{k\}Sutton and Barto \([2018](https://arxiv.org/html/2606.15247#bib.bib177)\)\.
Step 2\. Noise cannot consistently prevent policy improvement
We enforce improving transitions by maintaining a positive distance to "bad" regions using three events with fixed non\-zero probabilities:
1. 1\.*Seed*: Force the distance toO\(αk\)O\(\\alpha\_\{k\}\), with probabilityps\>0p\_\{s\}\>0\.
2. 2\.*Grow*: Bring it toO\(1\)O\(1\)using the mean\-field drift, with probabilitypg\>0p\_\{g\}\>0\.
3. 3\.*Lock\-in*: Keep it strictly positive until the next transition, with probabilitypl\>0p\_\{l\}\>0\.
Since, at any time, the next policy will be better with probability at leastpspgpl\>0p\_\{s\}p\_\{g\}p\_\{l\}\>0, the probability of never reaching and locking\-in to the optimal region grows as\(1−\(pspgpl\)\|𝒫\|\)n\\big\(1\-\(p\_\{s\}p\_\{g\}p\_\{l\}\)^\{\|\\mathcal\{P\}\|\}\\big\)^\{n\}, where𝒫\\mathcal\{P\}is the set of deterministic policies\. This probability tends to zero asnntends to infinity because the number of deterministic policies is finite\. As a result, the complement event, i\.e\. locking into the optimal region andqk→qπ∗q\_\{k\}\\to q\_\{\\pi\_\{\*\}\}, occurs eventually with probability 1\. ∎
Sincegk\(⋅∣s\)g\_\{k\}\(\\cdot\\mid s\)is a distribution over𝒜\(s\)\\mathcal\{A\}\(s\), its values scale as1/\|𝒜\(s\)\|1/\|\\mathcal\{A\}\(s\)\|\. Hence,1/gk\(a∣s\)1/g\_\{k\}\(a\\mid s\)is typically larger in states with more available actions\. If necessary, dividing the learning rates in \([11](https://arxiv.org/html/2606.15247#S3.E11)\) by\|𝒜\(s\)\|\|\\mathcal\{A\}\(s\)\|cancels out this effect\. Since this additional factor is constant across actions within a state, it does not affect the argument\. It only introduces extra state\-dependent scaling factors in the proof\.
Earlier work recovers convergence to optimality by scaling the learning rates with1/σk1/\\sigma\_\{k\}Tsitsiklis \([2002](https://arxiv.org/html/2606.15247#bib.bib140)\); Liu \([2021](https://arxiv.org/html/2606.15247#bib.bib143)\)\. However, this requires knowing the sampling distribution overallstate\-action pairs\. In contrast, the learning rates in \([11](https://arxiv.org/html/2606.15247#S3.E11)\) only require the conditional sampling probability of the initial action given the initial state\. Thus, convergence can still be recovered when the initial\-state distribution is unknown, provided the action distribution within each state is known\. This is often the case in practice, for example when episodes are initialized using an epsilon\-greedy policy\. This weaker condition is more practical for real\-world applications, in particular when the state\-space is large\. For instance, chess hasO\(1045\)O\(10^\{45\}\)different board layouts but onlyO\(101\)O\(10^\{1\}\)legal moves per layout\. Keeping track of a distribution over the joint state\-action space is not feasible, whereas representing the action distribution within a given state is straightforward\.
Earlier work recovers convergence to optimality by making updates effectively uniform over all state\-action pairs, whereas[Theorem 1](https://arxiv.org/html/2606.15247#Thmtheorem1)shows that uniform updates over the actions within each state are sufficient\. The earlier work is based on the theorem of Tsitsiklis inTsitsiklis \([2002](https://arxiv.org/html/2606.15247#bib.bib140)\), which uses commutativity in a central way\. Since this technique does not work when updates are only uniform over the actions within each state, a different mathematical approach is required\.
## 4Counterexample for first\-visit MCES with sample average updates
Under first\-visits, the state\-action pairs updated after an episode depend on the transition dynamics induced by the greedy policy on the MDP\. FV MCES is thus harder to analyse than IV MCES and counterexamples are harder to construct\. This section presents such a counterexample: a cyclic MDP with1313states generalising[Figure 1](https://arxiv.org/html/2606.15247#S2.F1), on which FV/SA MCES frequently converges to a suboptimal solution\. Each state has two actions:a∗a\_\{\*\}gives reward11and moves to the next state with probabilityτ\\tauor terminates the episode with probability1−τ1\-\\tau; actiona¯\\bar\{a\}yields no reward and stays in the same state\.
We found the example by first analysing FV/SA MCES without noise, namely whenqk\(s,a\)q\_\{k\}\(s,a\)is updated towards the exact valueqπk\(s,a\)q\_\{\\pi\_\{k\}\}\(s,a\), andnk\(s,a\)n\_\{k\}\(s,a\)is incremented by the probability of updating\(s,a\)\(s,a\)after an episode simulated with policyπk\\pi\_\{k\}\. ForN=3N=3this deterministic process does not converge to a suboptimal solution\. ForN=4N=4it can oscillate indefinitely between suboptimal policies, but this behaviour is fragile as it requires that the greedy policy changes in two states simultaneously\. ForN≥5N\\geq 5the same type of oscillation can proceed through one\-state policy changes; the oscillation alternates between policies with one and two "stay" actions, as depicted in[Figure 4](https://arxiv.org/html/2606.15247#S4.F4)\. This suboptimal behaviour persists under the noise of the fully sampled FV/SA MCES process for sufficiently late starts, becoming increasingly robust to fluctuations early in the simulation asNNgrows\. We describe below the case N=13\.
The deterministic solutions oscillate indefinitely between the 26 policies when initialising episodes as follows\. At each time step, with probabilityϵES\\epsilon\_\{\\mathrm\{ES\}\}sample the initial state–action pair uniformly amongst the 26 pairs; otherwise, choose it according to the current greedy policy\. If that policy has a single "stay" action atsis\_\{i\}, start the episode in\(si\+6,a¯\)\(s\_\{i\+6\},\\bar\{a\}\)\. If the policy has adjacent "stay" actions atsis\_\{i\}andsi\+1s\_\{i\+1\}, start in\(si−1,a¯\)\(s\_\{i\-1\},\\bar\{a\}\)with probabilityλ\\lambdaor in\(si\+1,a∗\)\(s\_\{i\+1\},a\_\{\*\}\)with probability\(1−λ\)\(1\-\\lambda\)\. All state indices are to be taken modulo 13\. Under this initialisation scheme,[Figure 4](https://arxiv.org/html/2606.15247#S4.F4)shows that the deterministic solutions spiral into a fixed point on the the boundary\. This behaviour is robust as a large set of initial conditions produce this spiral\. We derive the valueq^\\widehat\{q\}in Appendix[C\.2](https://arxiv.org/html/2606.15247#A3.SS2), and show that the subsequence\{qk\+26i\}i≥0\\\{q\_\{k\+26i\}\\\}\_\{i\\geq 0\}moves along a straight line towards that fixed point in Appendix[C\.5](https://arxiv.org/html/2606.15247#A3.SS5)\. As a result, it is guaranteed that the deterministic solutions visits only these 26 policies\.
Figure 3:The MDP forms a connected chain of 13 states, each with two actions\. As in[Figure 1](https://arxiv.org/html/2606.15247#S2.F1), solutions cycle between suboptimal policies that only contain 1 or 2 suboptimal actions\. The suboptimal action propagates backwards relative to the state\-transitions\.
Figure 4:Mean\-field solutions of FV/SA MCES robustly converge to a suboptimal point\(q^,q^\)\(\\widehat\{q\},\\widehat\{q\}\)derived in Appendix[C\.2](https://arxiv.org/html/2606.15247#A3.SS2)\. Solution is initialised with countsn0\(s,a∗\):=10n\_\{0\}\(s,a\_\{\*\}\):=10andn0\(s,a¯\):=5n\_\{0\}\(s,\\bar\{a\}\):=5\. The dotted line corresponds to the boundary whereq\(s1,a∗\)=q\(s1,a¯\)q\(s\_\{1\},a\_\{\*\}\)=q\(s\_\{1\},\\bar\{a\}\)
Fully sampled FV/SA MCES solutions are extremely noisy, due to both sample returns and exploring starts\. As a result, they frequently visit off\-cycle policies\. We then initialise episodes to guide the process back to an on\-cycle policy as follows\. On visiting the optimal all\-"move" policy, we start episodes in the "stay" action of the state for which the action\-values are closest to the boundary\. The first visit structure ensures that "move" is then also updated in that, and potentially downstream, states, and also that "move" actions are updated more often than "stay" actions overall\. The1/nk\(s,a\)1/n\_\{k\}\(s,a\)weighting then results in solutions being pushed towards and eventually across the boundary, creating a "stay" state\. In all other off\-cycle policies, we first consider all states where "stay" is greedy and keep only those with the fewest neighbours that also prefer "stay"\. Among that subset, we choose the state whose action values are closest to the decision boundary, then initialize the episode in the "stay" action of that state\. This eventually results in solutions crossing the boundary in the opposite direction, reducing the number of "stay" states until only a singleton or adjacent pair remain\. Exploring starts remain active during the recovery\. The sampling strategy is thus completely determined by the currentqkq\_\{k\}\(as would be the case for anϵ\\epsilon\-greedy policy, for example\) with a uniform lower bound on the probability of sampling any state\-action pair at each step\.
We simulated40,00040,000solutions of FV/SA MCES withγ=0\.9999\\gamma=0\.9999,τ≈0\.5123\\tau\\approx 0\.5123andϵES=10−4\\epsilon\_\{ES\}=10^\{\-4\}for10910^\{9\}steps for a range of initial countsnk\(s,a¯\)=Mn\_\{k\}\(s,\\bar\{a\}\)=M,nk\(s,a∗\)=2\.01Mn\_\{k\}\(s,a\_\{\*\}\)=2\.01MwithMMfrom30003000to1200012000and found that all but a proportion of approximately\(2200/M\)5\.4\(2200/M\)^\{5\.4\}remained trapped in a shrinking strip around the boundary significantly below the optimal value\. In each run, for approximately14\.9%14\.9\\%of steps the greedy policy was the all\-"move" policyπ∗\\pi\_\{\*\}and for approximately1\.5%1\.5\\%of steps it was another off\-cycle policy\. Thus for a large majority of steps the system is in a cycle phase\. For a representative sample of runs which hadn’t escaped at10910^\{9\}steps we continued the simulation up to101210^\{12\}steps and none escaped\. We then ran88simulations fromM=105M=10^\{5\}and all remained trapped at101210^\{12\}steps\. That is, all trials that survived10910^\{9\}steps went on to survive101210^\{12\}steps\.[Figure 5](https://arxiv.org/html/2606.15247#S4.F5)shows representative behaviour from a start atM=1000M=1000, demonstrating an initial lock in followed by the long1/k1/ktail\. Altogether, these simulations provide strong evidence of high probability non\-convergence to optimality from late starts of FV/SA MCES for this MDP; i\.e\. that, for largeMM, starts fromnk\(s,a¯\)=Mn\_\{k\}\(s,\\bar\{a\}\)=Mescape from the stable suboptimal basin with probabilityO\(M−K\)→0O\(M^\{\-K\}\)\\rightarrow 0forK≈5\.4K\\approx 5\.4\.


Figure 5:First\-visit MCES with sample\-average updates can converge to a suboptimal solution whereq\(s,a∗\)=q\(s,a¯\)q\(s,a\_\{\*\}\)=q\(s,\\bar\{a\}\)for all states\.
## 5Conclusion
We have presented two new counterexamples showing that both initial\-visit and first\-visit MCES can converge to suboptimal solutions, and proposed a practical fix that guarantees convergence to optimality in the initial\-visit case\.
Previous counterexamples for initial\-visit MCES rely on sampling distributions that heavily favour non\-greedy actions at every step\. Our example shows that stable suboptimal solutions exist even when greedy actions are updated more frequently, as typically occurs during the exploitation phase\. The suboptimal behaviour arose because the sample\-average learning rates introduced an effective bias towards non\-greedy actions when solutions were in the optimal region\.
We then presented a practical learning\-rate scheme that restores convergence to optimality despite updating actions at different frequencies\. The key idea is to scale learning rates inversely with update frequencies on a state\-by\-state basis, so that the effective update size is uniform across actions within each state\. Under this correction, the underlying mean\-field generates monotonically improving policies\. Since the stochastic perturbations cannot consistently prevent this improvement, initial\-visit MCES eventually converges to the optimal policy and its action values\.
Finally, we constructed a 13\-state MDP in which first\-visit MCES with sample\-average updates converges to a suboptimal solution\. This example establishes that any convergence proof for first\-visit MCES would need to show that solutions always escape suboptimal mean\-field attractors\. In our simulations this occurs with sharply decreasing frequency for later starts, providing strong evidence that exploring starts alone do not guarantee convergence of Monte Carlo control methods to optimality\.
Overall, these results highlight that the asymptotic behaviour of MCES depends critically on the interplay between learning rates and sampling distributions\. By controlling the frequency and magnitude of updates to each action, these parameters determine the effective update size\. Initial experiments \(not presented here\) suggest that MCES converges to optimality when, within each state, the effective update to the greedy action is at least as large as that to any other action\. Establishing this condition formally is an important direction for future work as it may provide a foundation for analysing scalable MCES variants\.
## References
- \[1\]\(1996\)Neuro\-Dynamic Programming\.Athena Scientific,Belmont, Massachusetts\.Cited by:[Appendix A](https://arxiv.org/html/2606.15247#A1.p1.1),[§1\.1](https://arxiv.org/html/2606.15247#S1.SS1.p1.2),[§3](https://arxiv.org/html/2606.15247#S3.p2.1)\.
- \[2\]Y\. Chen\(2018\)On the convergence of optimistic policy iteration for stochastic shortest path problem\.Technical reportCited by:[§1\.1](https://arxiv.org/html/2606.15247#S1.SS1.p1.2)\.
- \[3\]J\. Liu\(2021\-07\)On the convergence of reinforcement learning with Monte Carlo Exploring Starts\.Automatica129,pp\. 109693\.External Links:[Document](https://dx.doi.org/10.1016/J.AUTOMATICA.2021.109693),ISSN 0005\-1098Cited by:[§1\.1](https://arxiv.org/html/2606.15247#S1.SS1.p1.2),[§3](https://arxiv.org/html/2606.15247#S3.p7.3)\.
- \[4\]J\. Lubars, A\. Winnicki, M\. Livesay, and R\. Srikant\(2021\)Optimistic Policy Iteration for MDPs with Acyclic Transient State Structure\.Technical reportCited by:[§1\.1](https://arxiv.org/html/2606.15247#S1.SS1.p2.1)\.
- \[5\]R\. Munos, T\. Stepleton, A\. Harutyunyan, and M\. G\. Bellemare\(2016\)Safe and Efficient Off\-Policy Reinforcement Learning\.Advances in Neural Information Processing Systems29\.Cited by:[§1\.1](https://arxiv.org/html/2606.15247#S1.SS1.p2.1)\.
- \[6\]O\. Oliviers and G\. Vinnicombe\(2026\-06\)Convergence of Monte Carlo Optimistic Policy Iteration: Beyond Uniform State\-Action Updates\.Cited by:[Appendix B](https://arxiv.org/html/2606.15247#A2.p2.3)\.
- \[7\]S\. P\. Singh and R\. S\. Sutton\(1996\-03\)Reinforcement Learning with Replacing Eligibility Traces\.Machine Learning22,pp\. 123–158\.Cited by:[Appendix B](https://arxiv.org/html/2606.15247#A2.2.p2.3),[§1](https://arxiv.org/html/2606.15247#S1.p5.3)\.
- \[8\]R\. S\. Sutton and A\. G\. Barto\(2018\)Reinforcement Learning: An Introduction\.2 edition,The MIT Press\.Cited by:[§1](https://arxiv.org/html/2606.15247#S1.p5.10),[§3](https://arxiv.org/html/2606.15247#S3.3.p3.12)\.
- \[9\]R\. S\. Sutton\(1988\-08\)Learning to predict by the methods of temporal differences\.Machine Learning3\(1\),pp\. 9–44\.External Links:[Document](https://dx.doi.org/10.1007/BF00115009),ISSN 0885\-6125Cited by:[§1](https://arxiv.org/html/2606.15247#S1.p5.10)\.
- \[10\]R\. S\. Sutton\(1999\)Open Theoretical Questions in Reinforcement Learning\.InProceedings of the 4th European Conference on Computational Learning Theory,Berlin, Heidelberg,pp\. 11–17\.External Links:[Document](https://dx.doi.org/10.1007/3-540-49097-3%5F2)Cited by:[§1](https://arxiv.org/html/2606.15247#S1.p5.10)\.
- \[11\]J\. N\. Tsitsiklis\(2002\)On the Convergence of Optimistic Policy Iteration\.Journal of Machine Learning Research3,pp\. 59–72\.Cited by:[§1\.1](https://arxiv.org/html/2606.15247#S1.SS1.p1.2),[§3](https://arxiv.org/html/2606.15247#S3.p7.3),[§3](https://arxiv.org/html/2606.15247#S3.p8.1)\.
- \[12\]C\. Wang, S\. Yuan, K\. Shao, and K\. Ross\(2022\)On the Convergence of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning\.InThe Tenth International Conference on Learning Representations,Cited by:[§1\.1](https://arxiv.org/html/2606.15247#S1.SS1.p1.2),[§1\.1](https://arxiv.org/html/2606.15247#S1.SS1.p2.1)\.
- \[13\]A\. Winnicki and R\. Srikant\(2023\)On The Convergence Of Policy Iteration\-Based Reinforcement Learning With Monte Carlo Policy Evaluation\.InProceedings of the 26th International Conference on Artificial Intelligence and Statistics,Cited by:[§1\.1](https://arxiv.org/html/2606.15247#S1.SS1.p2.1)\.
## Appendix ACounterexample for initial\-visit MCES with non\-greedy bias
Bertsekas and Tsitsiklis show in\[[1](https://arxiv.org/html/2606.15247#bib.bib132), Example 5\.12\]that IV MCES can converge to suboptimal solutions when greedy actions are updated less frequently than non\-greedy actions\. The single\-state MDP in[Figure 6](https://arxiv.org/html/2606.15247#A1.F6)is the simplest such example\.
Episodes are initialised to favour the non\-greedy action, meaning whenqk\(s,π∗\(s\)\)≥qk\(s,π¯\(s\)\)q\_\{k\}\(s,\\pi\_\{\*\}\(s\)\)\\geq q\_\{k\}\(s,\\bar\{\\pi\}\(s\)\)the initial state\-action pair is\(s,π∗\(s\)\)\(s,\\pi\_\{\*\}\(s\)\)with a small probabilityϵ\\epsilonand\(s,π¯\(s\)\)\(s,\\bar\{\\pi\}\(s\)\)with probability1−ϵ1\-\\epsilon, with the probabilities reversed otherwise\.
[Figure 7](https://arxiv.org/html/2606.15247#A1.F7)shows simulations withϵ=0\.1\\epsilon=0\.1\. The streamlines indicate the mean\-field flow\. Solutions starting to the right of the separatrix typically converge toqπ∗q\_\{\\pi\_\{\*\}\}, whereas those starting to the left leave the optimal regionQ\(π∗\)Q\(\\pi\_\{\*\}\)and converge to a boundary point\.
\(a\)MDP
\(b\)Policyπ∗\{\\pi\}\_\{\*\}
\(c\)Policyπ¯\\overline\{\\pi\}
Figure 6:MDP that exhibits a stable suboptimal solution for IV MCES when updates are biased towards non\-greedy actions\.Figure 7:The streamlines of the mean\-field flow indicate that there are two basins of attraction when the non\-greedy bias is large \(ϵ=0\.1\\epsilon=0\.1\)\. Solutions starting to the left of the separatrix tend to converge to a suboptimal fixed point on the boundary when the learning rates are sufficiently small\. Those starting to the right converge toqπ∗q\_\{\{\\pi\}\_\{\*\}\}\. The two simulations of MCES in blue confirm this conclusion\.
## Appendix BProof of[Theorem 1](https://arxiv.org/html/2606.15247#Thmtheorem1)
Define the perturbationvk:=q~k−qπkv\_\{k\}:=\\tilde\{q\}\_\{k\}\-q\_\{\\pi\_\{k\}\}whereq~k\\tilde\{q\}\_\{k\}is the episode return defined in \([2](https://arxiv.org/html/2606.15247#S1.E2)\), as well as the indicatoruk\(s,a\):=𝟏\{\(s,a\)=\(sk,ak\)\}u\_\{k\}\(s,a\):=\\mathbf\{1\}\_\{\\\{\(s,a\)=\(s\_\{k\},a\_\{k\}\)\\\}\}that captures the randomness of asynchronous updates\. Combine these two sources of noise into a single stochastic perturbationwkw\_\{k\}defined as
wk\(s,a\)\\displaystyle w\_\{k\}\(s,a\):=1gk\(a\|s\)\(uk\(s,a\)vk\(s,a\)\+\(uk\(s,a\)−σk\(s,a\)\)\(qπk\(s,a\)−qk\(s,a\)\)\)\.\\displaystyle:=\\frac\{1\}\{g\_\{k\}\(a\|s\)\}\\Big\(u\_\{k\}\(s,a\)v\_\{k\}\(s,a\)\+\\big\(u\_\{k\}\(s,a\)\-\\sigma\_\{k\}\(s,a\)\\big\)\\big\(q\_\{\\pi\_\{k\}\}\(s,a\)\-q\_\{k\}\(s,a\)\\big\)\\Big\)\.\(14\)Equation \([5](https://arxiv.org/html/2606.15247#S3.E5)\) with learning rates as in \([11](https://arxiv.org/html/2606.15247#S3.E11)\) is then equivalent, for every state\-action pair, to
qk\+1\(s,a\)\\displaystyle q\_\{k\+1\}\(s,a\)=qk\(s,a\)\+αkuk\(s,a\)gk\(a\|s\)\(qπk\(s,a\)\+vk\(s,a\)−qk\(s,a\)\)\\displaystyle=q\_\{k\}\(s,a\)\+\\alpha\_\{k\}\\frac\{u\_\{k\}\(s,a\)\}\{g\_\{k\}\(a\|s\)\}\\big\(q\_\{\\pi\_\{k\}\}\(s,a\)\+v\_\{k\}\(s,a\)\-q\_\{k\}\(s,a\)\\big\)=qk\(s,a\)\+αk\(fk\(s\)\(qπk\(s,a\)−qk\(s,a\)\)\+wk\(s,a\)\)\\displaystyle=q\_\{k\}\(s,a\)\+\\alpha\_\{k\}\\big\(f\_\{k\}\(s\)\(q\_\{\\pi\_\{k\}\}\(s,a\)\-q\_\{k\}\(s,a\)\)\+w\_\{k\}\(s,a\)\\big\)\(15\)whereπk\\pi\_\{k\}is a deterministic policy that is greedy with respect toqkq\_\{k\}\.
[Theorem 1](https://arxiv.org/html/2606.15247#Thmtheorem1)is an adaptation of Proposition 20 in\[[6](https://arxiv.org/html/2606.15247#bib.bib55)\]\. The event construction and policy\-improvement argument are unchanged: stochastic perturbations cannot consistently undermine the policy improvement driven by the mean\-field dynamics, soπk\\pi\_\{k\}eventually stabilises atπ∗\\pi\_\{\*\}andqk→qπ∗q\_\{k\}\\to q\_\{\\pi\_\{\*\}\}\.
The only modification is the normalized learning rateαk/gk\(a∣s\)\\alpha\_\{k\}/g\_\{k\}\(a\\mid s\)in place ofαk\\alpha\_\{k\}\. Sincegk\(a∣s\)∈\(0,1\]g\_\{k\}\(a\\mid s\)\\in\(0,1\], this rescaling is positive and at least as large asαk\\alpha\_\{k\}, so the induction in Lemma 15 is unaffected\. Lemmas 17 and 18 require only thatwkw\_\{k\}is a martingale difference sequence with bounded second moment; this is verified in the lemma below\.
###### Lemma 1\.
There exists a constantcw∈\[0,∞\)c\_\{w\}\\in\[0,\\infty\)such that
𝔼\[wk\(s,a\)∣ℱk\]=0,𝔼\[wk2\(s,a\)∣ℱk\]≤cw\(1\+‖qk‖2\),∀k≥0,∀s∈𝒮,∀a∈𝒜\(s\)\.\\displaystyle\\mathbb\{E\}\[w\_\{k\}\(s,a\)\\mid\\mathcal\{F\}\_\{k\}\]=0,\\ \\ \\ \\mathbb\{E\}\[w\_\{k\}^\{2\}\(s,a\)\\mid\\mathcal\{F\}\_\{k\}\]\\leq c\_\{w\}\(1\+\\\|q\_\{k\}\\\|^\{2\}\),\\ \\ \\ \\forall\\,k\\geq 0,\\,\\forall\\,s\\in\\mathcal\{S\},\\,\\forall\\,a\\in\\mathcal\{A\}\(s\)\.\(16\)
###### Proof\.
Fix a timek≥0k\\geq 0, a states∈𝒮s\\in\\mathcal\{S\}and an actiona∈𝒜\(s\)a\\in\\mathcal\{A\}\(s\)\. Letℱk′\\mathcal\{F\}\_\{k\}^\{\\prime\}represent the available information right after selecting the policyπk\\pi\_\{k\}but before sampling the initial state\-action pair\(sk,ak\)\(s\_\{k\},a\_\{k\}\), soℱk⊆ℱk′\\mathcal\{F\}\_\{k\}\\subseteq\\mathcal\{F\}\_\{k\}^\{\\prime\}\.
Under initial\-visit updates,q~k\\tilde\{q\}\_\{k\}is an unbiased estimate ofqπkq\_\{\\pi\_\{k\}\}with uniformly bounded variance\[[7](https://arxiv.org/html/2606.15247#bib.bib176)\]\. Thus there exists a constantcv∈\[0,∞\)c\_\{v\}\\in\[0,\\infty\)such that
𝔼\[vk\(s,a\)∣ℱk\]=0,𝔼\[vk2\(s,a\)∣ℱk\]≤cv,∀k≥0\.\\displaystyle\\mathbb\{E\}\\\!\\left\[v\_\{k\}\(s,a\)\\mid\\mathcal\{F\}\_\{k\}\\right\]=0,\\qquad\\mathbb\{E\}\\\!\\left\[v\_\{k\}^\{2\}\(s,a\)\\mid\\mathcal\{F\}\_\{k\}\\right\]\\leq c\_\{v\},\\qquad\\forall\\,k\\geq 0\.\(17\)Further,uk\(s,a\)vk\(s,a\)=0u\_\{k\}\(s,a\)v\_\{k\}\(s,a\)=0on\{uk\(s,a\)=0\}\\\{u\_\{k\}\(s,a\)=0\\\}\. Hence,
𝔼\[uk\(s,a\)vk\(s,a\)∣ℱk′\]=𝔼\[𝟏\{uk\(s,a\)=1\}𝔼\[vk\(s,a\)∣ℱk′,uk\(s,a\)=1\]∣ℱk′\]=0\.\\displaystyle\\mathbb\{E\}\\big\[u\_\{k\}\(s,a\)v\_\{k\}\(s,a\)\\mid\\mathcal\{F\}^\{\\prime\}\_\{k\}\\big\]=\\mathbb\{E\}\\Big\[\\mathbf\{1\}\_\{\\\{u\_\{k\}\(s,a\)=1\\\}\}\\ \\mathbb\{E\}\\big\[v\_\{k\}\(s,a\)\\mid\\mathcal\{F\}^\{\\prime\}\_\{k\},u\_\{k\}\(s,a\)=1\\big\]\\mid\\mathcal\{F\}^\{\\prime\}\_\{k\}\\Big\]=0\.\(18\)Also,𝔼\[uk\(s,a\)∣ℱk′\]=σk\(s,a\)\\operatorname\*\{\\mathbb\{E\}\}\[u\_\{k\}\(s,a\)\\mid\\mathcal\{F\}^\{\\prime\}\_\{k\}\]=\\sigma\_\{k\}\(s,a\)andσk\(s,a\)\\sigma\_\{k\}\(s,a\)and\(qπk\(s,a\)−qk\(s,a\)\)\(q\_\{\\pi\_\{k\}\}\(s,a\)\-q\_\{k\}\(s,a\)\)areℱk′\\mathcal\{F\}^\{\\prime\}\_\{k\}\-measurable\. It follows that
𝔼\[\(uk\(s,a\)−σk\(s,a\)\)\(qπk\(s,a\)−qk\(s,a\)\)∣ℱk′\]=\(𝔼\[uk\(s,a\)∣ℱk′\]−σk\(s,a\)\)\(qπk\(s,a\)−qk\(s,a\)\)=0\.\\mathbb\{E\}\\\!\\left\[\\big\(u\_\{k\}\(s,a\)\-\\sigma\_\{k\}\(s,a\)\\big\)\\big\(q\_\{\\pi\_\{k\}\}\(s,a\)\-q\_\{k\}\(s,a\)\\big\)\\mid\\mathcal\{F\}^\{\\prime\}\_\{k\}\\right\]\\\\ =\\big\(\\mathbb\{E\}\[u\_\{k\}\(s,a\)\\mid\\mathcal\{F\}^\{\\prime\}\_\{k\}\]\-\\sigma\_\{k\}\(s,a\)\\big\)\\big\(q\_\{\\pi\_\{k\}\}\(s,a\)\-q\_\{k\}\(s,a\)\\big\)=0\.\(19\)Sincegk\(a\|s\)g\_\{k\}\(a\|s\)isℱk′\\mathcal\{F\}^\{\\prime\}\_\{k\}\-measurable, dividing \([18](https://arxiv.org/html/2606.15247#A2.E18)\) and \([19](https://arxiv.org/html/2606.15247#A2.E19)\) bygk\(a\|s\)g\_\{k\}\(a\|s\)and summing the results yields𝔼\[wk\(s,a\)∣ℱk′\]=0\\mathbb\{E\}\[w\_\{k\}\(s,a\)\\mid\\mathcal\{F\}^\{\\prime\}\_\{k\}\]=0\. Applying the tower property then gives
𝔼\[wk\(s,a\)∣ℱk\]=𝔼\[𝔼\[wk\(s,a\)∣ℱk′\]∣ℱk\]=0\.\\displaystyle\\mathbb\{E\}\[w\_\{k\}\(s,a\)\\mid\\mathcal\{F\}\_\{k\}\]=\\mathbb\{E\}\\big\[\\mathbb\{E\}\[w\_\{k\}\(s,a\)\\mid\\mathcal\{F\}^\{\\prime\}\_\{k\}\]\\mid\\mathcal\{F\}\_\{k\}\\big\]=0\.\(20\)
Finally,gk2\(a\|s\)≥σmin2g\_\{k\}^\{2\}\(a\|s\)\\geq\\sigma\_\{\\min\}^\{2\}by the strict exploring starts condition in \([8](https://arxiv.org/html/2606.15247#S3.E8)\),uk2\(s,a\)≤1u\_\{k\}^\{2\}\(s,a\)\\leq 1and\(uk\(s,a\)−σk\(s,a\)\)2≤1\(u\_\{k\}\(s,a\)\-\\sigma\_\{k\}\(s,a\)\)^\{2\}\\leq 1for every state\-action pair\. Using\(x\+y\)2≤2x2\+2y2\(x\+y\)^\{2\}\\leq 2x^\{2\}\+2y^\{2\}on \([14](https://arxiv.org/html/2606.15247#A2.E14)\) thus yields
wk2\(s,a\)\\displaystyle w\_\{k\}^\{2\}\(s,a\)≤1σmin2\(2vk2\(s,a\)\+2\(qπk\(s,a\)−qk\(s,a\)\)2\)\\displaystyle\\leq\\frac\{1\}\{\\sigma\_\{\\min\}^\{2\}\}\\big\(2v\_\{k\}^\{2\}\(s,a\)\+2\(q\_\{\\pi\_\{k\}\}\(s,a\)\-q\_\{k\}\(s,a\)\)^\{2\}\\big\)≤1σmin2\(2vk2\(s,a\)\+4qπk2\(s,a\)\+4qk2\(s,a\)\)\.\\displaystyle\\leq\\frac\{1\}\{\\sigma\_\{\\min\}^\{2\}\}\\big\(2v\_\{k\}^\{2\}\(s,a\)\+4q\_\{\\pi\_\{k\}\}^\{2\}\(s,a\)\+4q\_\{k\}^\{2\}\(s,a\)\\big\)\.Since all action\-values are bounded, there exists a constantc𝒫∈\[0,∞\)c\_\{\\mathcal\{P\}\}\\in\[0,\\infty\)such that‖qπ‖2≤c𝒫\\\|q\_\{\\pi\}\\\|^\{2\}\\leq c\_\{\\mathcal\{P\}\}for every policyπ\\pi\. As a result, using \([17](https://arxiv.org/html/2606.15247#A2.E17)\) yields
𝔼\[wk2\(s,a\)∣ℱk\]\\displaystyle\\mathbb\{E\}\\\!\\big\[w\_\{k\}^\{2\}\(s,a\)\\mid\\mathcal\{F\}\_\{k\}\\big\]≤2σmin2\(cv\+2c𝒫\+2‖qk‖2\)\.\\displaystyle\\leq\\frac\{2\}\{\\sigma\_\{\\min\}^\{2\}\}\(c\_\{v\}\+2c\_\{\\mathcal\{P\}\}\+2\\\|q\_\{k\}\\\|^\{2\}\)\.≤max\{2σmin2\(cv\+2c𝒫\),4σmin2\}\(1\+‖qk‖2\)\.\\displaystyle\\leq\\max\\left\\\{\\frac\{2\}\{\\sigma\_\{\\min\}^\{2\}\}\(c\_\{v\}\+2c\_\{\\mathcal\{P\}\}\),\\frac\{4\}\{\\sigma\_\{\\min\}^\{2\}\}\\right\\\}\(1\+\\\|q\_\{k\}\\\|^\{2\}\)\.\(21\)Equations \([20](https://arxiv.org/html/2606.15247#A2.E20)\) and \([21](https://arxiv.org/html/2606.15247#A2.E21)\) confirm that the perturbations\{wk\}k≥0\\\{w\_\{k\}\\\}\_\{k\\geq 0\}satisfy the conditions in \([16](https://arxiv.org/html/2606.15247#A2.E16)\)\. ∎
## Appendix C13\-state First\-Visit MCES example
### C\.1MDP, cycle, and notation
This appendix describes the discounted 13\-state example with exploring starts\. The discount and exploring\-start perturbation level are
γ=0\.9999,εES=10−4\.\\gamma=0\.9999,\\qquad\\varepsilon\_\{\\rm ES\}=10^\{\-4\}\.The "move" action is denoted bya∗a\_\{\*\}and the "stay" action bya¯\\bar\{a\}\. A successful "move" sends
si⟼si\+1,s\_\{i\}\\longmapsto s\_\{i\+1\},with indices understood modulo1313\. See Figure[4](https://arxiv.org/html/2606.15247#S4.F4)\. Actiona∗a\_\{\*\}gives reward11and succeeds with probabilityτ\\tau; on success it moves to the next state, and on failure the episode terminates\. Actiona¯\\bar\{a\}gives reward0and remains in the same state\.
A deterministic greedy policy is represented by its "stay" setS⊆\{s1,…,s13\}S\\subseteq\\\{s\_\{1\},\\ldots,s\_\{13\}\\\}\. Its value function satisfies
VS\(si\)=\{γVS\(si\),si∈S,1\+γτVS\(si\+1\),si∉S,V\_\{S\}\(s\_\{i\}\)=\\begin\{cases\}\\gamma V\_\{S\}\(s\_\{i\}\),&s\_\{i\}\\in S,\\\\\[2\.84526pt\] 1\+\\gamma\\tau V\_\{S\}\(s\_\{i\+1\}\),&s\_\{i\}\\notin S,\\end\{cases\}and the exact policy action values are
qS\(si,a∗\)=1\+γτVS\(si\+1\),qS\(si,a¯\)=γVS\(si\)\.q\_\{S\}\(s\_\{i\},a\_\{\*\}\)=1\+\\gamma\\tau V\_\{S\}\(s\_\{i\+1\}\),\\qquad q\_\{S\}\(s\_\{i\},\\bar\{a\}\)=\\gamma V\_\{S\}\(s\_\{i\}\)\.Becauseγ<1\\gamma<1, "stay" states satisfyVS\(si\)=0V\_\{S\}\(s\_\{i\}\)=0\.
The intended adjacent cycle is
\{s1\},\{s1,s13\},\{s13\},\{s13,s12\},…,\{s2\},\{s2,s1\}\.\\\{s\_\{1\}\\\},\\\{s\_\{1\},s\_\{13\}\\\},\\\{s\_\{13\}\\\},\\\{s\_\{13\},s\_\{12\}\\\},\\ldots,\\\{s\_\{2\}\\\},\\\{s\_\{2\},s\_\{1\}\\\}\.\(22\)Equivalently, the local pattern is
\{si\}⟶\{si,si−1\}⟶\{si−1\}\.\\\{s\_\{i\}\\\}\\longrightarrow\\\{s\_\{i\},s\_\{i\-1\}\\\}\\longrightarrow\\\{s\_\{i\-1\}\\\}\.The normal start laws will be chosen later\.
There are two requirements for a deterministic cycle\. Firstly it must converge to a point on the boundary and secondly it must follow the right set of greedy policies as it does so\. The first condition we call the full cycle balance condition and the second the sign\-margin criterion\. These are the subjects of the next two sections\.
### C\.2General full\-cycle balance equations
We require that, provided the trajectory follows the right policies, it converges onto the boundary\.
Let𝒵=\{\(si,a\):1≤i≤13,a∈\{a∗,a¯\}\}\\mathcal\{Z\}=\\\{\(s\_\{i\},a\):1\\leq i\\leq 13,\\ a\\in\\\{a\_\{\*\},\\bar\{a\}\\\}\\\}be the set of state\-action starts\. For phasejjin the intended cycle, letSjS\_\{j\}be the "stay" set,qj=qSjq\_\{j\}=q\_\{S\_\{j\}\}its exact action\-values, andνj\\nu\_\{j\}its nominal start distribution on𝒵\\mathcal\{Z\}\. We impose a rotational symmetry, thatν\{si\}\(si\+k,a\)\\nu\_\{\\\{s\_\{i\}\\\}\}\(s\_\{i\+k\},a\)is a function ofkkonly, so it is only necessary to specify one distribution over𝒵\\mathcal\{Z\}\. With exploring\-starts the mixture is
νjεES=\(1−εES\)νj\+εESU𝒵,U𝒵\(z\)=126\.\\nu\_\{j\}^\{\\varepsilon\_\{\\rm ES\}\}=\(1\-\\varepsilon\_\{\\rm ES\}\)\\nu\_\{j\}\+\\varepsilon\_\{\\rm ES\}U\_\{\\mathcal\{Z\}\},\\qquad U\_\{\\mathcal\{Z\}\}\(z\)=\{1\\over 26\}\.Letμjz\(s,a\)\\mu\_\{j\}^\{z\}\(s,a\)be the probability of visiting\(s,a\)\(s,a\)when the episode starts fromzzand follows policySjS\_\{j\}after the start action\. The phase\-jjfirst\-visit probability under the perturbed start law is
μj\(s,a\)=∑z∈𝒵νjεES\(z\)μjz\(s,a\)\.\\mu\_\{j\}\(s,a\)=\\sum\_\{z\\in\\mathcal\{Z\}\}\\nu\_\{j\}^\{\\varepsilon\_\{\\rm ES\}\}\(z\)\\mu\_\{j\}^\{z\}\(s,a\)\.
For one complete pass through the 26 intended phases, define
Ba\(s\)=∑j=126μj\(s,a\),Aa\(s\)=∑j=126μj\(s,a\)qj\(s,a\)\.B\_\{a\}\(s\)=\\sum\_\{j=1\}^\{26\}\\mu\_\{j\}\(s,a\),\\qquad A\_\{a\}\(s\)=\\sum\_\{j=1\}^\{26\}\\mu\_\{j\}\(s,a\)q\_\{j\}\(s,a\)\.HereBa\(s\)B\_\{a\}\(s\)is the expected number of first\-visit updates to\(s,a\)\(s,a\)per normal full cycle, andAa\(s\)A\_\{a\}\(s\)is the corresponding expected return\-weighted update mass\.
If fractional expected updates are applied over one full cycle, then
N\+\(s,a\)=N\(s,a\)\+Ba\(s\)N^\{\+\}\(s,a\)=N\(s,a\)\+B\_\{a\}\(s\)and
q\+\(s,a\)=N\(s,a\)q\(s,a\)\+Aa\(s\)N\(s,a\)\+Ba\(s\)\.q^\{\+\}\(s,a\)=\{N\(s,a\)q\(s,a\)\+A\_\{a\}\(s\)\\over N\(s,a\)\+B\_\{a\}\(s\)\}\.Equivalently,
q\+\(s,a\)−q\(s,a\)=Ba\(s\)N\(s,a\)\+Ba\(s\)\(Aa\(s\)Ba\(s\)−q\(s,a\)\)\.q^\{\+\}\(s,a\)\-q\(s,a\)=\{B\_\{a\}\(s\)\\over N\(s,a\)\+B\_\{a\}\(s\)\}\\left\(\{A\_\{a\}\(s\)\\over B\_\{a\}\(s\)\}\-q\(s,a\)\\right\)\.Thus the full\-cycle target for coordinate\(s,a\)\(s,a\)isAa\(s\)/Ba\(s\)A\_\{a\}\(s\)/B\_\{a\}\(s\)\.
The boundary cancellation condition asks that, for every state,
Aa∗\(s\)Ba∗\(s\)=Aa¯\(s\)Ba¯\(s\)=q^\.\{A\_\{a\_\{\*\}\}\(s\)\\over B\_\{a\_\{\*\}\}\(s\)\}=\{A\_\{\\bar\{a\}\}\(s\)\\over B\_\{\\bar\{a\}\}\(s\)\}=\\widehat\{q\}\.\(23\)i\.e\. the target is on the boundary\. By rotational symmetry it is enough to impose the scalar equation at one state,
F:=Aa∗\(s1\)Ba¯\(s1\)−Aa¯\(s1\)Ba∗\(s1\)=0,F:=A\_\{a\_\{\*\}\}\(s\_\{1\}\)B\_\{\\bar\{a\}\}\(s\_\{1\}\)\-A\_\{\\bar\{a\}\}\(s\_\{1\}\)B\_\{a\_\{\*\}\}\(s\_\{1\}\)=0,\(24\)The numerator ofFFis a high order polynomial inτ\\tauand a low order polynomial in each component ofν\\nu\.
The corresponding count ratio is
r=Ba∗\(s\)Ba¯\(s\)\.r=\{B\_\{a\_\{\*\}\}\(s\)\\over B\_\{\\bar\{a\}\}\(s\)\}\.\(25\)Choosing initial counts in this ratio makes the two action coordinates have matching first\-order full\-cycle learning rates:
Ba∗\(s\)N0\(s,a∗\)=Ba¯\(s\)N0\(s,a¯\)\.\{B\_\{a\_\{\*\}\}\(s\)\\over N\_\{0\}\(s,a\_\{\*\}\)\}=\{B\_\{\\bar\{a\}\}\(s\)\\over N\_\{0\}\(s,\\bar\{a\}\)\}\.
### C\.3Phase\-sign margin LP
The balance equations center the deterministic cycle on the boundary, but they do not by themselves ensure that the intended greedy policies occur in the right order\. Let
σj\(s\)=\{\+1,s∈Sj,−1,s∉Sj\.\\sigma\_\{j\}\(s\)=\\begin\{cases\}\+1,&s\\in S\_\{j\},\\\\ \-1,&s\\notin S\_\{j\}\.\\end\{cases\}At the boundary levelq^\\widehat\{q\}, define the scaled expected gap increment in phasejjby
gj\(s\)=μj\(s,a¯\)\(qj\(s,a¯\)−q^\)−μj\(s,a∗\)r\(qj\(s,a∗\)−q^\)\.g\_\{j\}\(s\)=\\mu\_\{j\}\(s,\\bar\{a\}\)\\bigl\(q\_\{j\}\(s,\\bar\{a\}\)\-\\widehat\{q\}\\bigr\)\-\{\\mu\_\{j\}\(s,a\_\{\*\}\)\\over r\}\\bigl\(q\_\{j\}\(s,a\_\{\*\}\)\-\\widehat\{q\}\\bigr\)\.\(26\)Let
Ck\(s\)=∑j<kgj\(s\)C\_\{k\}\(s\)=\\sum\_\{j<k\}g\_\{j\}\(s\)be the cumulative gap displacement before phasekk\. The scaled initial gapd0\(s\)d\_\{0\}\(s\)is then chosen by
maximizeρ,\\displaystyle\\rho,\(27\)overd0∈ℝ13,ρ∈ℝ,\\displaystyle d\_\{0\}\\in\\mathbb\{R\}^\{13\},\\ \\rho\\in\\mathbb\{R\},subject toσk\(s\)\(d0\(s\)\+Ck\(s\)\)≥ρ,k=1,…,26,s=s1,…,s13\.\\displaystyle\\sigma\_\{k\}\(s\)\\bigl\(d\_\{0\}\(s\)\+C\_\{k\}\(s\)\\bigr\)\\geq\\rho,\\qquad k=1,\\ldots,6,\\quad s=s\_\{1\},\\ldots,s\_\{13\}\.A positive optimal valueρ\\rhomeans that the phases are followed in the intended order\.
The initial points are then placed symmetrically around the boundary:
q0\(s,a∗\)=q^−d0\(s\)2N0\(s,a¯\),q0\(s,a¯\)=q^\+d0\(s\)2N0\(s,a¯\)\.q\_\{0\}\(s,a\_\{\*\}\)=\\widehat\{q\}\-\{d\_\{0\}\(s\)\\over 2N\_\{0\}\(s,\\bar\{a\}\)\},\\qquad q\_\{0\}\(s,\\bar\{a\}\)=\\widehat\{q\}\+\{d\_\{0\}\(s\)\\over 2N\_\{0\}\(s,\\bar\{a\}\)\}\.The factor1/21/2splits the desired total action gap equally above and belowq^\\widehat\{q\}\.
### C\.4Rotational start search and selected start law
The initial search for a suitable start distributionν\\nuwas over a wide class of rotationally symmetric distributions\. Distributions with small support were preferred, as these would avoid adding too much extra noise into the full simulations\. We started by considering all13213^\{2\}pairs of deterministic starts\. For each, we solved \([24](https://arxiv.org/html/2606.15247#A3.E24)\) forτ\\tau, and for any roots in\(0,1\]\(0,1\]we then solved \([27](https://arxiv.org/html/2606.15247#A3.E27)\) forρ\\rho\. There were no solutions with a positiveρ\\rho\.
We then broadened the search to distributions with a two point support in the two\-"stay" phases\. That is, for offsetso1,o2,o3o\_\{1\},o\_\{2\},o\_\{3\}we set
ν\{si\}\(o1\)\\displaystyle\\nu\_\{\\\{s\_\{i\}\\\}\}^\{\(o\_\{1\}\)\}=δ\(si\+o1,a¯\),\\displaystyle=\\delta\_\{\(s\_\{i\+o\_\{1\}\},\\bar\{a\}\)\},\(28\)ν\{si,si−1\}\(o2,o3,λ\)\\displaystyle\\nu\_\{\\\{s\_\{i\},s\_\{i\-1\}\\\}\}^\{\(o\_\{2\},o\_\{3\},\\lambda\)\}=λδ\(si\+o2,a¯\)\+\(1−λ\)δ\(si\+o3,a∗\)\.\\displaystyle=\\lambda\\delta\_\{\(s\_\{i\+o\_\{2\}\},\\bar\{a\}\)\}\+\(1\-\\lambda\)\\delta\_\{\(s\_\{i\+o\_\{3\}\},a\_\{\*\}\)\}\.\(29\)For a fixed triple and trialτ\\tau, equation \([24](https://arxiv.org/html/2606.15247#A3.E24)\) is solved for a rootλ∈\[0,1\]\\lambda\\in\[0,1\]and then the sign\-margin LP \([27](https://arxiv.org/html/2606.15247#A3.E27)\) is evaluated\. Theτ\\tauwhich gives the maximum marginρ\\rhois then found by gridding\. The offsets\(o1,o2,o3\)=\(−6,\+1,−1\)\(o\_\{1\},o\_\{2\},o\_\{3\}\)=\(\-6,\+1,\-1\)were chosen as giving a good margin with the correspondingλ\\lambdaclose to 1\.
Thus
ν\{si\}\\displaystyle\\nu\_\{\\\{s\_\{i\}\\\}\}=δ\(si−6,a¯\),\\displaystyle=\\delta\_\{\(s\_\{i\-6\},\\bar\{a\}\)\},\(30\)ν\{si,si−1\}\\displaystyle\\nu\_\{\\\{s\_\{i\},s\_\{i\-1\}\\\}\}=λδ\(si\+1,a¯\)\+\(1−λ\)δ\(si−1,a∗\)\.\\displaystyle=\\lambda\\delta\_\{\(s\_\{i\+1\},\\bar\{a\}\)\}\+\(1\-\\lambda\)\\delta\_\{\(s\_\{i\-1\},a\_\{\*\}\)\}\.\(31\)and the correspondingτ\\tauandλ\\lambdaare given by
τ=0\.512333524979486,λ=0\.923714016854384\.\\tau=0\.512333524979486,\\qquad\\lambda=0\.923714016854384\.The resulting boundary quantities are
q^=1\.971056244749326q∗=11−γτ=2\.050366396036596r=2\.010039006519156\.\\widehat\{q\}=1\.971056244749326\\qquad q\_\{\*\}=\{1\\over 1\-\\gamma\\tau\}=2\.050366396036596\\qquad r=2\.010039006519156\.The sign\-margin LP gives
ρ=8\.527647265071263×10−3\\rho=8\.527647265071263\\times 10^\{\-3\}There were no distributions with a two point support in the single\-"stay" phases and a one point support in the two\-"stay" phases which admitted a positive margin\.
Note that this procedure is not specific to the 13 state MDP, it in fact gives a suitable three point start distribution for the corresponding cycle on examples with 5 states and above, although a larger number of states gives a larger margin\.
Figure[4](https://arxiv.org/html/2606.15247#S4.F4)in the main text shows the resulting deterministic cycle\. Note that the corners of the cycle contract linearly towards\(q^,q^\)\(\\widehat\{q\},\\widehat\{q\}\), and so stay in the right greedy region, thus the cycles will persist indefinitely\. The next section confirms this contraction\.
### C\.5Full\-cycle linear interpolation towardq^\\widehat\{q\}
Note that the full\-cycle balance equations also imply a geometric contraction\. Suppose counts are exactly proportional to the full\-cycle masses,
N0\(s,a∗\)=csBa∗\(s\),N0\(s,a¯\)=csBa¯\(s\)\.N\_\{0\}\(s,a\_\{\*\}\)=c\_\{s\}B\_\{a\_\{\*\}\}\(s\),\\qquad N\_\{0\}\(s,\\bar\{a\}\)=c\_\{s\}B\_\{\\bar\{a\}\}\(s\)\.Then aftermmcomplete normal cycles,
N26m\(s,a\)=\(cs\+m\)Ba\(s\)\.N\_\{26m\}\(s,a\)=\(c\_\{s\}\+m\)B\_\{a\}\(s\)\.UsingAa\(s\)=q^Ba\(s\)A\_\{a\}\(s\)=\\widehat\{q\}B\_\{a\}\(s\), the next full\-cycle endpoint is
q26\(m\+1\)\(s,a\)\\displaystyle q\_\{26\(m\+1\)\}\(s,a\)=\(cs\+m\)Ba\(s\)q26m\(s,a\)\+Aa\(s\)\(cs\+m\+1\)Ba\(s\)\\displaystyle=\{\(c\_\{s\}\+m\)B\_\{a\}\(s\)q\_\{26m\}\(s,a\)\+A\_\{a\}\(s\)\\over\(c\_\{s\}\+m\+1\)B\_\{a\}\(s\)\}\(32\)=cs\+mcs\+m\+1q26m\(s,a\)\+1cs\+m\+1q^\.\\displaystyle=\{c\_\{s\}\+m\\over c\_\{s\}\+m\+1\}q\_\{26m\}\(s,a\)\+\{1\\over c\_\{s\}\+m\+1\}\\widehat\{q\}\.\(33\)Therefore the two\-action point
Pm\(s\)=\(q26m\(s,a∗\),q26m\(s,a¯\)\)P\_\{m\}\(s\)=\\bigl\(q\_\{26m\}\(s,a\_\{\*\}\),q\_\{26m\}\(s,\\bar\{a\}\)\\bigr\)satisfies
Pm\+1\(s\)=\(1−αm\)Pm\(s\)\+αm\(q^,q^\),αm=1cs\+m\+1\.P\_\{m\+1\}\(s\)=\(1\-\\alpha\_\{m\}\)P\_\{m\}\(s\)\+\\alpha\_\{m\}\(\\widehat\{q\},\\widehat\{q\}\),\\qquad\\alpha\_\{m\}=\{1\\over c\_\{s\}\+m\+1\}\.\(34\)Equivalently,
Pm\(s\)−\(q^,q^\)=cscs\+m\(P0\(s\)−\(q^,q^\)\)\.P\_\{m\}\(s\)\-\(\\widehat\{q\},\\widehat\{q\}\)=\{c\_\{s\}\\over c\_\{s\}\+m\}\\bigl\(P\_\{0\}\(s\)\-\(\\widehat\{q\},\\widehat\{q\}\)\\bigr\)\.Thus the full\-cycle endpoints move on the line segment toward the boundary point\. The intermediate phase points need not lie precisely on this line, but deviate only by order1/k1/k\.
### C\.6First\-visit probabilities
Fix an anchor statesis\_\{i\}and writed=\(m−i\)mod13d=\(m\-i\)\\bmod 13, sosm=si\+ds\_\{m\}=s\_\{i\+d\}\. Letμ1ε\\mu\_\{1\}^\{\\varepsilon\}andμ2ε\\mu\_\{2\}^\{\\varepsilon\}denote the perturbed first\-visit probabilities in the one\-"stay" and two\-"stay" phases for the selected start law\. The values for the currentτ\\tauandλ\\lambdaare given in the following table for convenience\.
Table 1:Single\-phase first\-visit probabilities by relative distanceddfor the tunedn=13n=13start law\.
### C\.7Sampled recovery strategy
The balance equations and margin LP describe the intended normal cycle\. In the sampled first\-visit MC run, however, return noise, trajectory noise, and rare exploring starts can move the greedy policy off the intended 26\-phase cycle\. The simulation therefore has to choose start distributions even when the observed greedy "stay"\-set is not one of the intended cycle phases\.
Let
S=\{s:q\(s,a¯\)\>q\(s,a∗\)\}S=\\\{s:q\(s,\\bar\{a\}\)\>q\(s,a\_\{\*\}\)\\\}denote the set of states for which the "stay" action is greedy at the start of an update\. IfSSis one of the intended cycle phases, the controller uses the normal start law from Section[C\.4](https://arxiv.org/html/2606.15247#A3.SS4)\. IfSSis off\-cycle, the rollout policy is the current greedy policy with "stay" setSS, and the controller chooses from the2626state\-action starts as follows:
Let the current action gap be
G\(s\)=q\(s,a¯\)−q\(s,a∗\)\.G\(s\)=q\(s,\\bar\{a\}\)\-q\(s,a\_\{\*\}\)\.IfSSis empty then allG\(s\)G\(s\)are negative, in this case we chooses0=argmaxsG\(s\)s\_\{0\}=\\arg\\max\_\{s\}G\(s\)as the state whose action values are nearest the "move"/"stay" boundary\. For other non\-cycle phases then forsi∈Ss\_\{i\}\\in Swe letm\(si\)=𝟏\(si−1∈M\)\+𝟏\(si\+1∈M\)m\(s\_\{i\}\)=\{\\mathbf\{1\}\}\(s\_\{i\-1\}\\in M\)\+\{\\mathbf\{1\}\}\(s\_\{i\+1\}\\in M\)denote the number of neighbours ofsis\_\{i\}inSSand theSmin=\{s∈S:m\(s\)=minm\}S\_\{\\text\{min\}\}=\\\{s\\in S:m\(s\)=\\min m\\\}\. That is,SminS\_\{\\text\{min\}\}is the subset ofSSconsisting of the states with the minimum number of neighbours \(all operations are modulo 13 and sos1s\_\{1\}ands13s\_\{13\}are considered neighbours\)\. Finally we lets0=argmins∈SminG\(s\)s\_\{0\}=\\arg\\min\_\{s\\in S\_\{\\text\{min\}\}\}G\(s\)be the "stay" state with the fewest number of "stay" neighbours whose action values are closest to the "move"/"stay" boundary\. This guarantees, for example, that ifMMconsists of an adjacent pair and one or more isolated states then one of the isolated states is chosen, and in a run of adjacent states then the two ends are chosen in preference to an internal state\. Finally, the starting pair is chosen as the "stay" action in this state\(s0,a¯\)\(s\_\{0\},\\bar\{a\}\), in an attempt to move the action value across the boundary\. If successful this would create a "stay" state ifSSis empty or reduce a "stay" state, movingSScloser to to a cycle phase, otherwise\. If at the next stepSSis still not a cycle phase then the process is repeated\.
For the simulations we present, approximately83\.6%83\.6\\%of updates are normal on\-cycle updates,14\.9%14\.9\\%are attempted recoveries from the optimal all\-"move" policy and the remaining1\.5%1\.5\\%are recoveries from all other non\-cycle policies\.
ChapGPT 5\.5 Pro was used to help refine and help code this example\. The core simulation was hand\-coded\.Similar Articles
#Exploration: A study of count-based exploration for deep reinforcement learning
OpenAI researchers demonstrate that a simple count-based exploration approach using hash codes can achieve near state-of-the-art performance on high-dimensional deep RL benchmarks, challenging the assumption that count-based methods cannot scale to continuous state spaces.
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
This paper proposes an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high, improving performance on text-based and GUI-based benchmarks.
Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying
This paper introduces ReMax, a new objective for reinforcement learning that induces exploration as an emergent property by evaluating policies based on expected maximum return over multiple samples, without explicit exploration bonuses. The authors derive a policy gradient formulation and propose RePPO, a PPO variant that achieves efficient exploration on MinAtar and Craftax benchmarks.
Evolving Robustness--Exploration Trade-off in Online Reinforcement Learning via Quantile Bayesian Risk MDPs
This paper proposes a quantile Bayesian risk-aware MDP framework for online RL that adaptively balances robustness and exploration over time, providing theoretical regret bounds and demonstrating strong empirical performance.
Some considerations on learning to explore via meta-reinforcement learning
OpenAI researchers introduce E-MAML and E-RL², two meta-reinforcement learning algorithms designed to improve exploration in tasks where discovering optimal policies requires significant exploration. The work demonstrates these algorithms' effectiveness on novel environments including Krazy World and maze tasks.