Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL
Summary
This paper introduces Approximate Next Policy Sampling (ANPS) as an alternative to conservative policy updates in deep reinforcement learning. It proposes Stable Value Approximate Policy Iteration (SV-API) and SV-RL, which align training data with the next policy's state distribution to allow for larger and safer policy updates.
View Cached Full Text
Cached at: 05/08/26, 07:31 AM
# Replacing Conservative Target Policy Updates in Deep RL
Source: [https://arxiv.org/html/2605.05481](https://arxiv.org/html/2605.05481)
Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL
Dillon Sandhu, Ronald Parr
Keywords:Foundations, Policy Improvement
SummaryConservative policy updates \(small changes in policy space\) are an integral part of most modern RL algorithms\. The origins trace back to a classic "chicken\-and\-egg" problem in policy improvement: to safely improve a policy, the value function must be accurate on the state\-visitation distributionof the updated policy, which is unknown during training\. To deal with this, conservative methods constrain the policy update – aiming to ensure this updated state visitation distribution stays similar to the training data\.This paper explores an alternative, which we callApproximate Next Policy Sampling\(ANPS\): shift the training data to be similar to the future policy’s state distribution\. ANPS aims to improve the value function estimate at the states that are most important to the policy update\. In contrast, conservative updates constrain the policy to remain where the value function estimate is already assumed to be trustworthy\. We presentStable Value Approximate Policy Iteration\(SV\-API\), a lightweight modification to standard approximate policy iteration algorithms that implements ANPS\. SV\-API holds the target policy fixed while an iteratively updated behavioral policy gathers relevant experience, only committing to a policy update once the value estimates have stabilized\. SV\-API matches or exceeds the performance of existing methods on high\-dimensional discrete \(Atari\) and continuous control, while making substantially larger target policy updates\. This demonstrates the viability of ANPS as an alternative approach to a classic challenge in RL\.
Contribution\(s\)1\.Proposes Approximate Next Policy Sampling: an alternative to conservative policy updates that explicitly alignins the training distribution with the next policy’s state visitation distribution – rather than constraining the policy update\. Context:As implemented, ANPS allows for unconstrained target policy updates, but must constrain the behavioral policy updates\.2\.A general bound \(Theorem[3\.3](https://arxiv.org/html/2605.05481#S3.Thmtheorem3)\) that isolates the importance of the next policy’s distribution\. This shows why conservative updates are only one possible solution to the distribution mismatch problem raised bykakade\_and\_langford:CPI\. Context:This bound follows straightforwardly from the Performance Difference Lemmakakade\_and\_langford:CPI\.3\.A bound \(Theorem[4\.3](https://arxiv.org/html/2605.05481#S4.Thmtheorem3)\) proving that our algorithm guarantees policy improvement by controlling training error and behavioral divergence\. Context:Like standard related bounds \(kakade\_and\_langford:CPI,pmlr\-v37\-schulman15\), it assumes bounded value error, which is not guaranteed for deep RL\.4\.A practical wrapper \(SV\-RL\) applicable to standard online RL algorithms \(e\.g\., PPO\) that implements ANPS by decoupling the behavioral and target policies\. We provide empirical evidence that this allows algorithms to hold the target policy fixed, gather relevant data, and safely execute larger policy jumps\. Context:SV\-RL is an initial implementation of ANPS\. It relies on a proxy metric \(stability of the value estimates\) to determine when the value function is accurate enough to update the target policy\. Furthermore, it requires off\-policy evaluation since it introduces a separate behavioral policy\.
###### Abstract
We revisit a classic "chicken\-and\-egg" problem in reinforcement learning: to safely improve a policy, the value function must be accurate on the state\-visitation distributionof the updated policy\. That distribution over states is unknown and cannot be sampled for the purposes of training the value function\. Conservative updates solve this problem, but at the cost of shrinking the policy update\. This paper explores an alternative solution, Approximate Next Policy Sampling \(ANPS\), which addresses the problem by modifying the training distribution rather than constraining the policy update\. ANPS is satisfied if the distribution of the training data approximates that of the next policy\. To demonstrate the feasibility and efficacy of ANPS, we introduce Stable Value Approximate Policy Iteration \(SV\-API\)\. SV\-API modifies the standard approximate policy iteration loop to hold the target policy fixed while an iteratively updated behavioral policy gathers relevant experience\. It only commits to a new policy once a convergence criterion has been met\. If certain stability criteria are met, the update is guaranteed to be safe; otherwise, it remains no less safe than standard approximate policy iteration\. Applying SV\-API to PPO yields Stable Value PPO \(SV\-PPO\), which matches or improves performance on high\-dimensional discrete \(Atari\) and continuous control benchmarks while executing substantially larger target policy updates\. These results demonstrate the viability of ANPS as a new solution to this classic challenge in RL\.
## 1Introduction
Policy Iteration\(howard:dp\)has long been the “big hammer” of exact MDP algorithms, often solving MDPs which can fit in memory in a shockingly small number of iterations\. Combining policy iteration with value function approximation, however, undermines the fundamental reason policy iteration works: monotonic policy improvement\. Various mechanisms for enforcing small policy changes from one iteration to the next have become widely adopted means of mitigating this problem\. This paper proposes a novel alternative which, even in its simplest realizations, shows promise\.
Value function approximation errors can cause the next policy to be worse than the current one\(bertsekas1996neuro\)\. This issue can arise even with a “well behaved” function approximator that minimizes error on its training set – indeed this behavior is part of the problem\. States that are rarely visited under the current policy can have high approximation error, which can lead to detrimental changes to the policy at these states\. The new policy may visit such states frequently due to policy changes in other states\. The effect is worse performance\. Ideally, and somewhat counterintuitively, the best distribution for training the current Q\-values is the distribution of thenextpolicy, but that typically isn’t available when the Q\-functions for the current policy are learned, creating a chicken\-and\-egg problem\.
kakade\_and\_langford:CPIintroducedconservative policy iterationas an approach to manage this problem\. They proved that by limiting the policy change, the distributions for the current and next policy would be close, thereby minimizing the problem\. Follow up work TRPO\(pmlr\-v37\-schulman15\)adapted the conservative policy update for actor\-critic algorithms, withkakade\_and\_langford:CPI’s result playing a central role in the TRPO update derivation\. PPO is designed to emulate TRPO using clipping an efficient heuristic\(schulman2017proximalpolicyoptimizationalgorithms\)\. Other modern RL algorithms also regularize changes to the policy to ensure it stays in the region where the Q\-values assumed to be more trustworthy\. \(mpo,museli\)\.
This paper explores an alternative approach to the distribution mismatch problem: Rather than restricting the policy update, our approach actively shifts the data collection to match the next policy\. We call itApproximate Next Policy Sampling\(ANPS\)\. Since the value function is trained to minimize error on sampled states, ANPS leads to a more trustworthy value estimate \(approximately\) the states the next policy will visit\. If ANPS successfully samples states that the true next policy visits, and the value learning method is able to lower error on these sampled states, then it resolves the chicken\-and\-egg problem\. We propose and analyze an iterative approach to aligning the distributions that can be implemented as a modest change to existing algorithms\. We show that idea works as a viable alternative to conservative policy updates \- even exceeding the performance of conservative policy updates in some cases\.
### 1\.1Related Work
Standard Approximate Policy Iteration \(API\)\(bertsekas1996neuro\)computes the greedy policy relative to aQQ\-function estimate for a fixed policyπk\\pi\_\{k\}\. Unfortunately, policy improvement cannot be guaranteed at each step, meaning the approximately greedy policyπk\+1\\pi\_\{k\+1\}may perform worse thanπk\\pi\_\{k\}\. As such, the algorithm may not converge to a locally optimum policy\. Since the greedy policy update is discontinuous, it can lead to a large shift in the state\-distribution\(perkins\_2002\)\.
Softer policy update steps are the most widely used solution to this problem\. A stochastic policy is used, and changed smoothly\. For example, policy gradient methods with function approximation\(Sutton2000\)avoid the distribution mismatch problem by using an infinitesimal step in policy space\. When the policy change is infinitesimal, the change in performance does not depend on the new state visitation distribution, only that of the present policy\.
Conservative policy updates \(kakade\_and\_langford:CPI,pmlr\-v37\-schulman15\) allow for discrete, non\-infinitesimal jumps\. The primary reason for conservative policy updates is mistrust in the value function at the state distribution associated withπk\+1\\pi\_\{k\+1\}\.
The Approximate Next Policy Sampling \(ANPS\) scheme we introduce does not attempt to ensureπ′\\pi^\{\\prime\}has a similar distribution toπ\\pi\. In contrast, ANPS decouples the target and behavioral policies\. Given a prospective next policyπ~\\tilde\{\\pi\}, ANPS samples it directly in order to estimate performance ondπ~d^\{\\tilde\{\\pi\}\}\. This pushes the “small policy change” requirement from the target policy to the sampling policy, allowing for larger jumps in policy space, reminiscent of the large jumps of classical Policy Iteration\.
## 2Preliminaries
We consider a Markov Decision Process \(MDP\) defined by the tuple\(𝒮,𝒜,P,r,γ,s0\)\(\\mathcal\{S\},\\mathcal\{A\},P,r,\\gamma,s\_\{0\}\), with state space𝒮\\mathcal\{S\}, action space𝒜\\mathcal\{A\}, transition dynamicsP:𝒮×𝒜→Δ\(𝒮\)P:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\Delta\(\\mathcal\{S\}\), reward functionr:𝒮×𝒜→\[0,1\]r:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\[0,1\], discount factorγ∈\[0,1\)\\gamma\\in\[0,1\), and fixed initial states0s\_\{0\}\.
A stationary policyπ:S→ΔA\\pi:S\\to\\Delta\_\{A\}specifies a conditional distribution over actions\. We overload notation and sometimes useπ\(s\)\\pi\(s\)as shorthand forπ\(⋅\|s\)\\pi\(\\cdot\|s\)\. The problem is to find a policy that maximizesVπ\(s0\)≐𝔼π\(∑t=0∞γtr\(st,at\)\)\|s0\)V^\{\\pi\}\(s\_\{0\}\)\\doteq\\mathbb\{E\}\_\{\\pi\}\(\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}r\(s\_\{t\},a\_\{t\}\)\)\\ \|s\_\{0\}\), where the subscriptπ\\piindicates the expectation is over probabilities that are jointly governed byπ\\piandPP\. Theaction\-value function,QπQ^\{\\pi\}, andadvantage functionare related to the value function as follows, and we refer to all of them as value functions:
Qπ\(s,a\)\\displaystyle Q^\{\\pi\}\(s,a\)≐r\(s,a\)\+γ𝔼s′∼P\(s,a\)𝔼a′∼π\(s′\)Qπ\(s′,a′\)\\displaystyle\\doteq r\(s,a\)\+\\gamma\\mathbb\{E\}\_\{s^\{\\prime\}\\sim P\(s,a\)\}\\mathbb\{E\}\_\{a^\{\\prime\}\\sim\\pi\(s^\{\\prime\}\)\}Q^\{\\pi\}\(s^\{\\prime\},a^\{\\prime\}\)\(1\)Vπ\(s\)\\displaystyle V^\{\\pi\}\(s\)=𝔼a∼π\(s\)Qπ\(s,a\)\\displaystyle=\\mathbb\{E\}\_\{a\\sim\\pi\(s\)\}Q^\{\\pi\}\(s,a\)\(2\)Aπ\(s,a\)\\displaystyle A^\{\\pi\}\(s,a\)≐Qπ\(s,a\)−Vπ\(s\)\.\\displaystyle\\doteq Q^\{\\pi\}\(s,a\)\-V^\{\\pi\}\(s\)\.\(3\)Note that the value function is bounded:maxπ,s,aQπ\(s,a\)≤∑i=0∞γi=11−γ\\max\_\{\\pi,s,a\}Q^\{\\pi\}\(s,a\)\\leq\\sum\_\{i=0\}^\{\\infty\}\\gamma^\{i\}=\\frac\{1\}\{1\-\\gamma\}\.
A convenient way to express a policy’s performance is in terms of its discountedstate action visitation distribution\. This quantity discounts states based on how far they are in the future, and is normalized by\(1−γ\)\(1\-\\gamma\)to ensure it sums to one\.
dπ\(s,a\)\\displaystyle d^\{\\pi\}\(s,a\)≐\(1−γ\)∑t=0∞γtP\(st=s,at=a\)\\displaystyle\\doteq\(1\-\\gamma\)\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}P\(s\_\{t\}=s,a\_\{t\}=a\)The Performance Difference Lemma \(PDL;kakade\_and\_langford:CPI\) decomposes the improvement of arbitrary policyπ′\\pi^\{\\prime\}over arbitrary policyπ\\piintoAπA^\{\\pi\}anddπ′d^\{\\pi^\{\\prime\}\}:
Vπ′\(s0\)−Vπ\(s0\)=11−γ𝔼s,a∼dπ′Aπ\(s,a\)\.\\displaystyle V^\{\\pi^\{\\prime\}\}\(s\_\{0\}\)\-V^\{\\pi\}\(s\_\{0\}\)=\\frac\{1\}\{1\-\\gamma\}\\mathbb\{E\}\_\{s,a\\sim d^\{\{\\pi^\{\\prime\}\}\}\}\\ A^\{\\pi\}\(s,a\)\.\(PDL\)
## 3Approximate Policy Iteration \(API\)
We now describe the distribution shift problem for API, deriving theoretical bounds that emphasize the importance of aligning the training distribution and next\-policy’s distribution, a concept we termNext Policy Alignment\.
Algorithm \([1](https://arxiv.org/html/2605.05481#alg1)\) presents an on\-policy version of API, meaning that it collects data usingπk\\pi\_\{k\}\. API constructs a sequence of policies\{πk\}k=0∞\\\{\\pi\_\{k\}\\\}\_\{k=0\}^\{\\infty\}by iterating betweenData Collection,Policy EvaluationandPolicy Improvement\. In each roundkk, the policy evaluation operator \(Eval\) estimatesqkπk≈Qπq\_\{k\}^\{\\pi\_\{k\}\}\\approx Q^\{\\pi\}from dataset𝒟k\\mathcal\{D\}\_\{k\}\. Then, the policy improvement operator \(Γ\\Gamma\) returns new policy as a function ofqkπkq^\{\\pi\_\{k\}\}\_\{k\}, with optional dependence on the present policyπk\\pi\_\{k\}or data batch𝒟k\\mathcal\{D\}\_\{k\}\.
Algorithm 1On\-Policy API1:API SubroutinesEvalandΓ\\Gamma2:Initializeπ0\\pi\_\{0\}3:fork=0,1,2,…k=0,1,2,\\dotsdo4:Collect dataset𝒟k\\mathcal\{D\}\_\{k\}by rolling outπk\\pi\_\{k\}\.5:qkπk←Eval\(πk,𝒟k\)q\_\{k\}^\{\\pi\_\{k\}\}\\leftarrow\\texttt\{Eval\}\(\\pi\_\{k\},\\mathcal\{D\}\_\{k\}\)6:πk\+1←Γ\(qkπk,πk,𝒟k\)\\pi\_\{k\+1\}\\leftarrow\\Gamma\(q\_\{k\}^\{\\pi\_\{k\}\},\\pi\_\{k\},\\mathcal\{D\}\_\{k\}\)7:endforAlgorithm 2Stable Value API \(Abstract\)1:Eval,Γ\\Gamma, and Stability Criterion𝒞\\mathcal\{C\}2:Initializeπ0\\pi\_\{0\},β0←π0\\beta\_\{0\}\\leftarrow\\pi\_\{0\}3:fork=0,1,2,…k=0,1,2,\\dotsdo4:Collect dataset𝒟k\\mathcal\{D\}\_\{k\}by rolling outβk\\beta\_\{k\}5:qkπk←Eval\(πk,𝒟k\)q^\{\\pi\_\{k\}\}\_\{k\}\\leftarrow\\texttt\{Eval\}\(\\pi\_\{k\},\\mathcal\{D\}\_\{k\}\)6:βk\+1←Γ\(qπk,βk,𝒟k\)\\beta\_\{k\+1\}\\leftarrow\\Gamma\(q^\{\\pi\_\{k\}\},\\beta\_\{k\},\\mathcal\{D\}\_\{k\}\)7:if𝒞\(qkπk,qk−1πk−1,𝒟k\)\\mathcal\{C\}\(q^\{\\pi\_\{k\}\}\_\{k\},q^\{\\pi\_\{k\-1\}\}\_\{k\-1\},\\mathcal\{D\}\_\{k\}\)isTruethen8:πk\+1←βk\+1\\pi\_\{k\+1\}\\leftarrow\\beta\_\{k\+1\}⊳\\trianglerightUpdate target policy9:else10:πk\+1←πk\\pi\_\{k\+1\}\\leftarrow\\pi\_\{k\}⊳\\trianglerightHold target policy11:endif12:endfor
Figure 1:The Stable Value modification to API\. SV\-API only updates the target policyπ\\piwhen the convergence criterion𝒞\\mathcal\{C\}is met\.The importance of thenext\-policy state\-action distributionfor policy improvement is well known in the literature on API\. We provide an analysis that emphasizes this dependence in each round\. To quantify the performance ofEvalunder the training distributionμ\\mu, we define the following error metric, which is similar to the Mean Squared Value Error \(sutton2018introduction\)\.
###### Definition 3\.1\(Weighted Action\-Value Error\)\.
Theμ\\mu\-weighted estimation error is
ε\(μ,qπ\)≐∑s∈S∑a∈Aμ\(s,a\)\|Qπ\(s,a\)−qπ\(s,a\)\|\.\\displaystyle\\varepsilon\(\\mu,q^\{\\pi\}\)\\doteq\\sum\_\{s\\in S\}\\sum\_\{a\\in A\}\\mu\(s,a\)\\left\|Q^\{\\pi\}\(s,a\)\-q^\{\\pi\}\(s,a\)\\right\|\.\(4\)
Next, we introduce a quantity that typical policy improvement operators maximize\.
###### Definition 3\.2\(Estimated Expected Advantage\)\.
Given criticqπq^\{\\pi\}, policyπ\\pi, and arbitrary policyπ′\\pi^\{\\prime\}, theestimated expected advantageofπ′\\pi^\{\\prime\}overπ\\piis
𝔸π′π≐𝔼s,a∼dπ′\[qπ\(s,a\)−Vπ\(s\)\]\.\\displaystyle\\mathbb\{A\}^\{\\pi\}\_\{\\pi^\{\\prime\}\}\\doteq\\mathbb\{E\}\_\{s,a\\sim d^\{\\pi^\{\\prime\}\}\}\\left\[q^\{\\pi\}\(s,a\)\-V^\{\\pi\}\(s\)\\right\]\.\(5\)
Most policy improvement operators approximately or exactly maximize this quantity by selecting actions of \(perceived\) high value at each state\. For example, the greedy policy update exactly maximizes𝔸π′π\\mathbb\{A\}^\{\\pi\}\_\{\\pi^\{\\prime\}\}\. Actor\-critic methods maximize a surrogate, which can be estimated on\-policy \(pmlr\-v37\-schulman15\)\.
The following theorem bounds the policy improvement using the quantities in Definition[3\.1](https://arxiv.org/html/2605.05481#S3.Thmtheorem1)and Definition[3\.2](https://arxiv.org/html/2605.05481#S3.Thmtheorem2)\.
###### Theorem 3\.3\(Value Error and Policy Improvement\)\.
Given action\-value estimateqπq^\{\\pi\}and target policyπ\\pi, let the next policy beπ′=Γ\(q,π\)\\pi^\{\\prime\}=\\Gamma\(q,\\pi\)\. Then,
11−γ\(𝔸π′π−ε\(dπ′,qπ\)\)≤Vπ′\(s0\)−Vπ\(s0\)\\displaystyle\\frac\{1\}\{1\-\\gamma\}\\left\(\\mathbb\{A\}^\{\\pi\}\_\{\\pi^\{\\prime\}\}\-\\varepsilon\(d^\{\\pi^\{\\prime\}\},q^\{\\pi\}\)\\right\)\\leq V^\{\\pi^\{\\prime\}\}\(s\_\{0\}\)\-V^\{\\pi\}\(s\_\{0\}\)≤11−γ\(𝔸π′π\+ε\(dπ′,qπ\)\)\\displaystyle\\leq\\frac\{1\}\{1\-\\gamma\}\\left\(\\mathbb\{A\}^\{\\pi\}\_\{\\pi^\{\\prime\}\}\+\\varepsilon\(d^\{\\pi^\{\\prime\}\},q^\{\\pi\}\)\\right\)\(6\)
The proof is in the supplemental materials[7](https://arxiv.org/html/2605.05481#S7)\. Compared to previous results, theorem[3\.3](https://arxiv.org/html/2605.05481#S3.Thmtheorem3)includes the function approximation error of the Q\-estimate that is used to improve the policy, making the relationship between this error and performance degradation more salient\. Ifqπq^\{\\pi\}is completely accurate ondπk\+1d^\{\\pi\_\{k\+1\}\}, the greedy policy will maximize policy improvement\. On the other hand, even ifqπq^\{\\pi\}is perfectly accurate on\-policy, inaccuracy ondπk\+1d^\{\\pi\_\{k\+1\}\}means that the greedy update may fail to improve the policy\.
In general, we assume thatEvalminimizesε\(μ,qπ\)\\varepsilon\(\\mu,q^\{\\pi\}\), by focusing value function capacity on the training data\. Thus,ε\(dπ′,qπ\)\\varepsilon\(d^\{\\pi^\{\\prime\}\},q^\{\\pi\}\)must be controlled some other way\. The following results shows that when the training distribution matches the next round’s on\-policy distribution, thenε\(dπ′,qπ\)\\varepsilon\(d^\{\\pi^\{\\prime\}\},q^\{\\pi\}\)can be controlled indirectly\.
###### Definition 3\.4\(δ\\delta\-Next Policy Alignment \(NPA\)\)\.
Given next policyπ′\\pi^\{\\prime\}, a training distributionμ\(s,a\)\\mu\(s,a\)achievesδ\\delta\-Next Policy AlignmentifdTV\(dπ′,μ\)≤δdistd\_\{TV\}\(d^\{\\pi^\{\\prime\}\},\\mu\)\\leq\\delta\_\{dist\}\.
###### Lemma 3\.5\.
Assume bounded value error:maxs,a,π\|Qπ\(s,a\)−qπ\(s,a\)\|≤11−γ\\max\_\{s,a,\\pi\}\|Q^\{\\pi\}\(s,a\)\-q^\{\\pi\}\(s,a\)\|\\leq\\frac\{1\}\{1\-\\gamma\}\. For any two state\-action distributionsμ\\muanddπ′d^\{\\pi^\{\\prime\}\}:
ε\(dπ′,qπ\)≤ε\(μ,qπ\)\+21−γdTV\(dπ′,μ\)\\varepsilon\(d^\{\\pi^\{\\prime\}\},q^\{\\pi\}\)\\leq\\varepsilon\(\\mu,q^\{\\pi\}\)\+\\frac\{2\}\{1\-\\gamma\}d\_\{TV\}\(d^\{\\pi^\{\\prime\}\},\\mu\)
The proof is in Appendix[8](https://arxiv.org/html/2605.05481#S8)\. It relies on a basic distribution shift argument relating the expectation overdπ′d^\{\\pi^\{\\prime\}\}present on the LHS to the expectation overμ\\mu\.
###### Corollary 3\.6\.
Assume the conditions of Theorem[3\.3](https://arxiv.org/html/2605.05481#S3.Thmtheorem3)and Lemma[3\.5](https://arxiv.org/html/2605.05481#S3.Thmtheorem5)hold, and that training distributionμk\\mu\_\{k\}achievesδ\\delta\-NPA \(Definition[3\.4](https://arxiv.org/html/2605.05481#S3.Thmtheorem4)\)\. Then, the policy improvement is at least:
Vπ′\(s0\)−Vπ\(s0\)\\displaystyle V^\{\\pi^\{\\prime\}\}\(s\_\{0\}\)\-V^\{\\pi\}\(s\_\{0\}\)≥11−γ𝔸π′π−11−γε\(μk,qπ\)−2\(1−γ\)2δdist\\displaystyle\\geq\\frac\{1\}\{1\-\\gamma\}\\mathbb\{A\}\_\{\\pi^\{\\prime\}\}^\{\\pi\}\-\\frac\{1\}\{1\-\\gamma\}\\varepsilon\(\\mu\_\{k\},q^\{\\pi\}\)\-\\frac\{2\}\{\(1\-\\gamma\)^\{2\}\}\\delta\_\{dist\}\(7\)
###### Proof\.
Substitute the upper bound ofε\(dπ′,qπ\)\\varepsilon\(d^\{\\pi^\{\\prime\}\},q^\{\\pi\}\)from Lemma 1 directly into the lower bound of Theorem 1\. ∎
This shows that whenδdist\\delta\_\{dist\}\-NPA occurs, then the policy improvement suffers two penalties: one related toδdist\\delta\_\{dist\}, and the other related to the training error data\. It is similar to bounds due tokakade\_and\_langford:CPIandpmlr\-v37\-schulman15\. The former assumes that the greedy policy can be approximately determined only for the on\-policy distribution – naturally constraining the application of the method off\-policy\. The latter assumes no value function error at all\. In contrast, Corollary[3\.6](https://arxiv.org/html/2605.05481#S3.Thmtheorem6)explicitly includes the value function error on the training distribution\.
### 3\.1Conservative Policy Iteration \(CPI\)
Conservative policy iteration and related methods ensure that the policy improvement operator in Algorithm[1](https://arxiv.org/html/2605.05481#alg1)performsδ\\delta\-NPA\. In the original CPI\(kakade\_and\_langford:CPI\), the next policy is an exponential moving average \(EMA\) of the greedy policy with past policies, ensuringdTV\(π,π′\)d\_\{TV\}\(\\pi,\\pi^\{\\prime\}\)is controlled\.
Ifε\(μ,qπ\)\\varepsilon\(\\mu,q^\{\\pi\}\)is bounded, conservative methods achieve monotonic policy improvement, but at the cost of exceedingly small policy improvements\. For example, withγ=0\.99\\gamma=0\.99, the worst\-case policy change,maxsdTV\(πk\(s\),πk\+1\(s\)\)\\max\_\{s\}d\_\{TV\}\(\\pi\_\{k\}\(s\),\\pi\_\{k\+1\}\(s\)\), must be less than one quarter of a percent \(computed by applying Theorem 12\.2 ofagarwal21\)\. Theorem[3\.3](https://arxiv.org/html/2605.05481#S3.Thmtheorem3)shows that this can waste data and compute if the value function is already accurate enough ondπ′d^\{\\pi^\{\\prime\}\}\.
In practice, deep actor\-critic methods implement much weaker conservationism than the sufficient conditions of Corollary[3\.6](https://arxiv.org/html/2605.05481#S3.Thmtheorem6)and similar theorems due tokakade\_and\_langford:CPIandpmlr\-v37\-schulman15would suggest\. For example, in their Museli method,museliregularize the update to prefer policies wheredTV\(πnew,πold\)≤0\.46d\_\{TV\}\(\\pi\_\{new\},\\pi\_\{old\}\)\\leq 0\.46, a jump in policy space that makes the policy improvement bound in Corollary[3\.6](https://arxiv.org/html/2605.05481#S3.Thmtheorem6)vacuous\. As another example,Engstrom2020Implementationfound that TRPO and PPO held the average KL divergence at around 0\.05 – still at least20×20\\timestoo large to guarantee policy improvement\. These methods are still widely applied, despite this theory\-practice gap \(stiennon2020learning,OpenAI\_GPT4\_2023\)\.
## 4Approximate Next Policy Sampling with Stable Value API
Our approach,Approximate Next Policy Sampling \(ANPS\), achieves NPA through a fundamentally different mechanism\. Instead of constraining the next target policyπk\+1\\pi\_\{k\+1\}to remain close toπk\\pi\_\{k\}\(as in CPI\), ANPS decouples data collection from the target policy entirely\.
This decoupling allows the algorithm to execute massive, unconstrained updates to the target policy\. It achieves this by introducing an iteratively refined behavioral policy that aligns the training distribution with the anticipated next policy before the target update is made\.
We introduce Stable Value API \(SV\-API, Algorithm[2](https://arxiv.org/html/2605.05481#alg2)\) as a general framework for executing ANPS\. SV\-API separates the optimization process into three distinct policies at each iterationkk:
- •Target Policy \(πk\\pi\_\{k\}\):Uniquely determines the value targetQπkQ^\{\\pi\_\{k\}\}for evaluation\.
- •Behavioral Policy \(βk\\beta\_\{k\}\):Collects the data and is iteratively refined to scout the state space\.
- •Next Target Policy \(πk\+1\\pi\_\{k\+1\}\):The prospective unconstrained update\.
SV\-API gates the target policy update behind a stability condition,𝒞\\mathcal\{C\}\. The behavioral policy gathers data to estimateQπkQ^\{\\pi\_\{k\}\}and is updated repeatedly\. The true value function,QπkQ^\{\\pi\_\{k\}\}, corresponds to a fixed target policy, meaning the behavioral policy can safely explore and align with the optimal actions for that fixed value function while limiting premature target updates\.
The behavioral policy gathers data for the estimation ofQπkQ^\{\\pi\_\{k\}\}, and is iteratively updated to help align the behavioral andnexttarget policy distributions\. The true value function,QπkQ^\{\\pi\_\{k\}\}, corresponds to a policy that may differ from the sampling policy after the first round\.
Algorithm[2](https://arxiv.org/html/2605.05481#alg2)gates the target policy update behind a stability condition,𝒞\\mathcal\{C\}, which dictates when the value estimate and distribution have stabilized sufficiently to permit a target policy update\. In this work, we explore two concrete instantiations of𝒞\\mathcal\{C\}:
1. 1\.Static Intervals:The target policy is updated everyKKrounds, whereKKis a hyperparameter\. It is generalized by the following dynamic thresholding scheme\.
2. 2\.Dynamic Thresholding:The update occurs when the change in the value estimates falls below a hyperparameterδv\\delta\_\{v\}for at leastKminK\_\{min\}rounds\. Specifically, we track the absolute critic change: diffk\\displaystyle\\text\{diff\}\_\{k\}≐∑s,a∈𝒟\|qkπk\(s,a\)−qk−1πk−1\(s,a\)\|\\displaystyle\\doteq\\sum\_\{s,a\\in\\mathcal\{D\}\}\|q\_\{k\}^\{\\pi\_\{k\}\}\(s,a\)\-q\_\{k\-1\}^\{\\pi\_\{k\-1\}\}\(s,a\)\|\(8\)This quantity is a measurable proxy forε\(μk,qkπk\)\\varepsilon\(\\mu\_\{k\},q\_\{k\}^\{\\pi\_\{k\}\}\)anddTV\(μk−1,μk\)d\_\{TV\}\(\\mu\_\{k\-1\},\\mu\_\{k\}\)\. When either of these is large,diffk\\text\{diff\}\_\{k\}may also be large\. To help ensure transferability across environments, we normalizediffk\\text\{diff\}\_\{k\}by the return before comparing toδv\\delta\_\{v\}\. The full convergence logic, shown in Algorithm[4](https://arxiv.org/html/2605.05481#alg4), also uses a maximum number of rounds spent evaluating a fixed policy,KmaxK\_\{max\}to prevent stalling\.
Static interval thresholding can be obtained by setting the maximum and minimum number of iterations to the same value: i\.e\.Kmin=Kmax=KK\_\{min\}=K\_\{max\}=K\. We now present an experiment that allows us to observe the theoretical quantities from Section[2](https://arxiv.org/html/2605.05481#S2), and then return to our theoretical analysis\.
### 4\.1An empirical demonstration on the Four Rooms MDP

Figure 2:Snapshots two rounds of SV\-PPO on Four Rooms\. TrueVπV^\{\\pi\}indicates the true value function of the target policy, which is the same in both rounds\. “Dist\. @ k” is the moving average empirical state visitation distribution at roundkk\. “Value Pred\.” is the predicted value\.To examine for howε\(μk,qπk\)\\varepsilon\(\\mu\_\{k\},q^\{\\pi\_\{k\}\}\)and next\-policy alignment interact in practice, we examine the Four Rooms environment\(journals/ai/SuttonPS99\)\. We compare a conservative API method \(in the form of PPO\) baseline against an implementation of our Stable Value API \(Algorithm[2](https://arxiv.org/html/2605.05481#alg2), also applied to PPO\)\. We defer the implementation details to Section[5](https://arxiv.org/html/2605.05481#S5), to focus on learning dynamics\.
In standard PPO, data collection is due to the target policy\. When applied to Four Rooms, PPO immediately updates the policy based on initial, high\-error value estimates\. As shown in Figure[3](https://arxiv.org/html/2605.05481#S4.F3), this leads to value churn\(tang2024churn\)and catastrophic forgetting around iteration 80\. The critic overestimates the value of rarely visited regions \(the bottom room and leftmost wall\) leading to degradation of the policy in these regions\. The premature policy updates adjust the policy to avoid these regions, meaning the data required to correct the value estimate is never gathered \(See Supplemental Materials[13](https://arxiv.org/html/2605.05481#S13)for a detailed explanation\)\.
In contrast, SV\-API decouples these elements by holding the target policy fixed while a separate behavioral policy iteratively gathers data\. SV\-API delays updating the target policy until the change in the value estimates falls below a stability threshold\. Figure[2](https://arxiv.org/html/2605.05481#S4.F2)shows how this delay allows for stronger policy evaluation of the very first target policy, which is frozen until roundk=30k=30\. During these rounds, the behavioral policy acts as a scout expanding the set of visited states and providing coverage of the goal room\. Without this sustained data collection, accurate value estimation ofVπ0V^\{\\pi\_\{0\}\}at these distant, high\-value states would be impossible\.
By evaluating target policies for longer, SV\-API suppresses value overestimation, maintains coverage of high\-value states, and successfully converges to the optimal policy\. Figure[3](https://arxiv.org/html/2605.05481#S4.F3)a plots the value of the sampling policy, while Figure[3](https://arxiv.org/html/2605.05481#S4.F3)b shows plots ofε\(μk\+1,qkπk\)\\varepsilon\(\\mu\_\{k\+1\},q^\{\\pi\_\{k\}\}\_\{k\}\), the error term in Theorem[3\.3](https://arxiv.org/html/2605.05481#S3.Thmtheorem3)\. Value over\-estimation and catastrophic forgetting in rarely visited regions lead to spikes in this error for PPO\. Figure[3](https://arxiv.org/html/2605.05481#S4.F3)c shows that the degree of NPA is similar for both methods, despite the less frequent target policy updates of SV\-API\. Figure[3](https://arxiv.org/html/2605.05481#S4.F3)d shows that the metric tracked for convergence is reasonably correlated with low value error and strong NPA\.

Figure 3:Four Rooms learning curves\. Dots indicate target policy updates for SV\-API\. \(a\) Expected return \(value\) of the data\-collection policy for both algorithms; \(b\) Next\-Policy Weighted Mean Squared Value Error𝔼μk\+1\(Vπ\(s\)−vk\(s\)\)2\\mathbb\{E\}\_\{\\mu\_\{k\+1\}\}\(V^\{\\pi\}\(s\)\-v\_\{k\}\(s\)\)^\{2\}; \(c\) TV distance between successive training distributionsμk\\mu\_\{k\}; and \(d\) Convergence: scaleddiffk\\text\{diff\}\_\{k\}vs\. the threshold,δv\\delta\_\{v\}\(dashed line\)\.
### 4\.2Analysis of SV\-API
To guarantee monotonic policy improvement, SV\-API must ultimately achieveδ\\delta\-NPA in the round that the target policy is updated\. SV\-API attempts to achieve this by iteratively aligning the behavioral policy, obtained asΓ\(qk−1πk−1,βk−1,𝒟k−1\)\\Gamma\(q\_\{k\-1\}^\{\\pi\_\{k\-1\}\},\\beta\_\{k\-1\},\\mathcal\{D\}\_\{k\-1\}\), with the next policy, which is obtained asΓ\(qkπk,βk,𝒟k\)\\Gamma\(q\_\{k\}^\{\\pi\_\{k\}\},\\beta\_\{k\},\\mathcal\{D\}\_\{k\}\)\.
Because the target policy update is simply an assignment \(πk\+1←βk\+1\\pi\_\{k\+1\}\\leftarrow\\beta\_\{k\+1\}\), the difference between this round’s training distribution and the next state distribution is due to the change in the behavioral policy during that final update round\. The following Lemma fromagarwal21formally links a bounded change in policy space to a bounded change in state\-action visitation distribution\.
###### Lemma 4\.1\.
For two policiesβ\\betaandβ′\\beta^\{\\prime\}, ifdTV\(β′\(s\),β\(s\)\)≤δ\(1−γ\)d\_\{TV\}\(\\beta^\{\\prime\}\(s\),\\beta\(s\)\)\\leq\\delta\(1\-\\gamma\)for all statesss, thendTV\(dβ′\(s,a\),dβ\(s,a\)\)≤δd\_\{TV\}\(d^\{\\beta^\{\\prime\}\}\(s,a\),d^\{\\beta\}\(s,a\)\)\\leq\\delta\.
###### Corollary 4\.2\.
In update roundkk, Algorithm[2](https://arxiv.org/html/2605.05481#alg2)performsδ\\delta\-NPA ifmaxsdTV\(βk\(⋅\|s\),βk\+1\(⋅\|s\)\)≤δ\(1−γ\)\\max\_\{s\}d\_\{TV\}\(\\beta\_\{k\}\(\\cdot\|s\),\\beta\_\{k\+1\}\(\\cdot\|s\)\)\\leq\\delta\(1\-\\gamma\)\.
This corollary provides the theoretical mechanism for ANPS: the algorithm must iteratively refine the behavioral policy untilβk\\beta\_\{k\}andβk\+1\\beta\_\{k\+1\}are sufficiently similar, and then terminate in that round\. The next theorem applies the corollary to show that SV\-API makes safe, unconstrained, updates to the target policy if the value error is low and the behavioral policy satisfies NPA\.
###### Theorem 4\.3\(SV\-API Improvement\)\.
In Algorithm[2](https://arxiv.org/html/2605.05481#alg2), let the training data distributionμk\\mu\_\{k\}be the stationary distribution induced by the behavioral policy,μk≐dβk\\mu\_\{k\}\\doteq d^\{\\beta\_\{k\}\}\. Assume the behavioral policy has stabilized such thatdTV\(βk\(s\),βk\+1\(s\)\)≤δ\(1−γ\)d\_\{TV\}\(\\beta\_\{k\}\(s\),\\beta\_\{k\+1\}\(s\)\)\\leq\\delta\(1\-\\gamma\)and the target policy update occurs \(πk\+1←βk\+1\\pi\_\{k\+1\}\\leftarrow\\beta\_\{k\+1\}\)\. Finally, assume bounded error on the training distributionε\(μk,qπk\)≤ε\\varepsilon\(\\mu\_\{k\},q^\{\\pi\_\{k\}\}\)\\leq\\varepsilonand globallymaxs,a,π\|Qπ\(s,a\)−qπ\(s,a\)\|≤11−γ\\max\_\{s,a,\\pi\}\|Q^\{\\pi\}\(s,a\)\-q^\{\\pi\}\(s,a\)\|\\leq\\frac\{1\}\{1\-\\gamma\}\. Then, the performance improvement is bounded by:
Vπk\+1\(s0\)−Vπk\(s0\)≥11−γ\(𝔸πk\+1πk−ε−2δ1−γ\)\\displaystyle V^\{\\pi\_\{k\+1\}\}\(s\_\{0\}\)\-V^\{\\pi\_\{k\}\}\(s\_\{0\}\)\\geq\\frac\{1\}\{1\-\\gamma\}\\left\(\\mathbb\{A\}^\{\\pi\_\{k\}\}\_\{\\pi\_\{k\+1\}\}\-\\varepsilon\-\\frac\{2\\delta\}\{1\-\\gamma\}\\right\)\(9\)
###### Proof\.
In an update round, the target policy becomes the new behavioral policy:πk\+1=βk\+1\\pi\_\{k\+1\}=\\beta\_\{k\+1\}\. The training data distributionμk\\mu\_\{k\}was generated by rolling outβk\\beta\_\{k\}\. Lemma[4\.1](https://arxiv.org/html/2605.05481#S4.Thmtheorem1)says that a maximum policy divergence ofδ\(1−γ\)\\delta\(1\-\\gamma\)implies a state\-action visitation distribution divergence of at mostδ\\delta, thusdTV\(μk,dπk\+1\)≤δd\_\{TV\}\(\\mu\_\{k\},d^\{\\pi\_\{k\+1\}\}\)\\leq\\delta\(NPA\)\. From here, Corollary[3\.6](https://arxiv.org/html/2605.05481#S3.Thmtheorem6)yields the result\. ∎
It is instructive to contrast the bound in Theorem[4\.3](https://arxiv.org/html/2605.05481#S4.Thmtheorem3)with the standard monotonic improvement bounds found in conservative methods like CPI\(kakade\_and\_langford:CPI\)\. In CPI, the updated policy is a mixture of the current policy and the greedy policy\. Setting the mixture coefficient toδ\(1−γ\)\\delta\(1\-\\gamma\)restricts the maximum policy divergence to match the conditions of Theorem[4\.3](https://arxiv.org/html/2605.05481#S4.Thmtheorem3)\. In exchange, the advantage term is scaled down byδ\(1−γ\)\\delta\(1\-\\gamma\)\. In other words, standard conservatism inherently penalizes the magnitude of the policy improvement step\.111We compare our Theorem[4\.3](https://arxiv.org/html/2605.05481#S4.Thmtheorem3)to the monotonic policy improvement theorem in more detail in Supplemental Materials[9](https://arxiv.org/html/2605.05481#S9)\.
In contrast, Theorem[4\.3](https://arxiv.org/html/2605.05481#S4.Thmtheorem3)considers an unconstrained assignment to the target policy\. SV\-API avoids scaling down the policy improvement because it does not explicitly setδ\\delta\. Rather, by iteratively applying the improvement operator to the sequence of behavioral policies, the algorithm waits for the behavioral policy to stabilize, ensuringδ\\delta\-NPA\. If this occurs,δ\\deltais kept under control, but the estimated improvement𝔸πk\+1πk\\mathbb\{A\}^\{\\pi\_\{k\}\}\_\{\\pi\_\{k\+1\}\}can still be large becauseπk\+1\\pi\_\{k\+1\}is the accumulation of multiple training steps\. In short, a convergent sequence of behavioral policies avoids problematic distribution shift while permitting massive leaps in target policy performance\.
This is the motivation behind our proposed dynamic convergence criterion, which simultaneously estimates value function change and behavioral policy change\. When the value network stops changing on shifting distribution, this is a proxy for the underlying training distribution stabilizing\. This indicates that the next behavioral policy may remain similar, and it is safe to make an update\. In this way distribution shift is controlled by the stabilization of the data\-collection process, in addition to whatever conservative propertiesΓ\\Gammamay have\.
## 5Results
We modify PPO to implement Algorithm[2](https://arxiv.org/html/2605.05481#alg2)\. This allows for direct comparison with a strong conservative baseline\. Moreover, by preventing large changes in both the value function and \(behavioral\) policy, PPO’s update helps induce the stabilization dynamic described above, where the change in one controls the other\. Standard PPO can be obtained from the same code by ensuring𝒞\\mathcal\{C\}is always true, for example withKmax=1K\_\{max\}=1\.
SV\-PPO \(Algorithm[3](https://arxiv.org/html/2605.05481#alg3)\) freezes a set of policy parameters, which correspond to the target policy\. This frozen network is used to compute importance sampling \(IS\) ratios for the the value target and advantage estimates\. These ensure that the estimates trackVπV^\{\\pi\}andAπA^\{\\pi\}despite potentially off\-policy data\. We use V\-Trace\(espeholt2018impalascalabledistributeddeeprl\)for the value estimation and ReTrace\(λ\\lambda\)\(DBLP:retrace\)for the advantage estimation\.
The clipped IS ratio used in PPO’s policy objective corresponds to the ratio of two successivebehavioralpolicies\. This implies the sequence of behavioral polices can drift away from the target policy, but at any roundkk, the difference in the final behavioral policy and next target policy will be controlled in the same fashion as PPO’s target policy update\. A detailed description all the changes and the loss function are in Supplemental Material[23](https://arxiv.org/html/2605.05481#S23)\.
Compared to PPO, no additional training or environment sampling is necessary\. In each round, the only additional requirement is running inference of the policy network to obtain the IS weights\. Wheneverδv=∞\\delta\_\{v\}=\\inftyandKmin=1K\_\{min\}=1, it becomes exactly standard PPO\. Aggregate results are shown in Table[1](https://arxiv.org/html/2605.05481#S5.T1)\.
Algorithm 3Stable Value PPOTargetπθ¯\\pi\_\{\\bar\{\\theta\}\}, Valuevϕ0v\_\{\\phi\_\{0\}\}, Num\. Env\. StepsTT,Params:δv,Kmin,Kmax,αent\\delta\_\{v\},K\_\{min\},K\_\{max\},\\alpha\_\{ent\}βθ0←πθ¯\\beta\_\{\{\\theta\}\_\{0\}\}\\leftarrow\\pi\_\{\\bar\{\\theta\}\},kπ←0k\_\{\\pi\}\\leftarrow 0,nstable←0n\_\{\\text\{stable\}\}\\leftarrow 0fork=0,1,2,…k=0,1,2,\\dotsdo𝒟k←\{\(st,at,st′,rt\)\}t=1:T,at∼βθk\(st\)\\mathcal\{D\}\_\{k\}\\leftarrow\\\{\(s\_\{t\},a\_\{t\},s^\{\\prime\}\_\{t\},r\_\{t\}\)\\\}\_\{t=1:T\},a\_\{t\}\\sim\\beta\_\{\\theta\_\{k\}\}\(s\_\{t\}\)1\. Evaluate Target Policy\(Eval\)yt←VTrace\(𝒟k,βθk,πθ¯\)y\_\{t\}\\leftarrow\\text\{VTrace\}\(\\mathcal\{D\}\_\{k\},\\beta\_\{\\theta\_\{k\}\},\\pi\_\{\\bar\{\\theta\}\}\)A^t←Retrace\-GAE\(𝒟k,βθk,πθ¯\)\\hat\{A\}\_\{t\}\\leftarrow\\text\{Retrace\-GAE\}\(\\mathcal\{D\}\_\{k\},\\beta\_\{\\theta\_\{k\}\},\\pi\_\{\\bar\{\\theta\}\}\)ϕk\+1←Stepϕ\(12∑t\(yt−vϕk\(st\)\)2\)\\phi\_\{k\+1\}\\leftarrow\\text\{Step\}\_\{\\phi\}\\left\(\\frac\{1\}\{2\}\\sum\_\{t\}\(y\_\{t\}\-v\_\{\\phi\_\{k\}\}\(s\_\{t\}\)\)^\{2\}\\right\)2\. Update Behavioral Policy\(Γ\\Gamma\)θk\+1←Stepθ\(Jclip\(θk\)\+αentH\(βθk\)\)\\theta\_\{k\+1\}\\leftarrow\\text\{Step\}\_\{\\theta\}\\left\(J^\{clip\}\(\\theta\_\{k\}\)\+\\alpha\_\{ent\}H\(\\beta\_\{\\theta\_\{k\}\}\)\\right\)3\. Stability Criterion\(𝒞\\mathcal\{C\}\)diffk←1T∑t\|yt−vϕk\(st\)\|\\text\{diff\}\_\{k\}\\leftarrow\\frac\{1\}\{T\}\\sum\_\{t\}\|y\_\{t\}\-v\_\{\\phi\_\{k\}\}\(s\_\{t\}\)\|stable,nstable←Conv\(y1:T,diffk,nstable,kπ\)\\text\{stable\},n\_\{\\text\{stable\}\}\\leftarrow\\texttt\{Conv\}\(y\_\{1:T\},\\text\{diff\}\_\{k\},n\_\{\\text\{stable\}\},k\_\{\\pi\}\)ifconv\.thenkπ←0k\_\{\\pi\}\\leftarrow 0θ¯←θk\+1\\bar\{\\theta\}\\leftarrow\\theta\_\{k\+1\}⊳\\trianglerightUpdate target policyelsekπ\+=1k\_\{\\pi\}\\mathrel\{\{\+\}\{=\}\}1⊳\\trianglerightHold target policyendifendforAlgorithm 4Conv1:y1:Ty\_\{1:T\},diffk\\text\{diff\}\_\{k\},nstablen\_\{\\text\{stable\}\},kπk\_\{\\pi\}2:In\-Scope \(Alg\.[3](https://arxiv.org/html/2605.05481#alg3)\):δv,Kmin,Kmax\\delta\_\{v\},K\_\{min\},K\_\{max\}3:Check Stability:4:y¯T←1T∑t=1T\|yt\|\\overline\{y\}\_\{T\}\\leftarrow\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\|y\_\{t\}\|5:Istable←𝕀\(diffk≤y¯Tδv\)I\_\{\\text\{stable\}\}\\leftarrow\\mathbb\{I\}\(\\text\{diff\}\_\{k\}\\leq\\overline\{y\}\_\{T\}\\delta\_\{v\}\)6:7:Increment Number of Stable Iters:8:nstable←\(nstable\+1\)×Istablen\_\{\\text\{stable\}\}\\leftarrow\(n\_\{\\text\{stable\}\}\+1\)\\times I\_\{\\text\{stable\}\}9:10:Stability Gating / Early Stopping:11:ifnstable≥Kminn\_\{\\text\{stable\}\}\\geq K\_\{min\}orkπ≥Kmaxk\_\{\\pi\}\\geq K\_\{max\}then12:returnTrue,013:else14:returnFalse,nstablen\_\{\\text\{stable\}\}15:endif
### 5\.1Atari
We run on sixteen Atari Games: the Atari\-10 Benchmark \(bellemare13arcade,atari\_5\), plus six well known and interesting games: Breakout, Asterix, Freeway, Seaquest, Space Invaders, and Ms\. Pacman\. We run a dynamic variant that setsδ=0\.01\\delta=0\.01andKmax=33K\_\{max\}=33\. This configuration was obtained from tuning on MinAtaryoung19minatar\. We also run a static interval variant, settingK=9K=9\. Results are shown in Figure[4\(a\)](https://arxiv.org/html/2605.05481#S5.F4.sf1)\.

\(a\)Atari benchmark suite\. Five Seeds\.

\(b\)BRAX continuous control suite\. Twelve seeds\.
Figure 4:SV\-PPO Performance Comparison\.Normalized scores relative to the PPO baseline \(100%100\\%\)\. Bars show the performance of both methods; the colored tip indicates the margin by which the winning method \(Dynamic in blue, Static in orange\) outperformed the runner\-up\.Table 1:Aggregate performance of Static and Dynamic SV\-PPO across Brax and Atari environments \(12 and 5 training seeds respectively\)\. Scores are reported as a percentage of PPO performance\. The left column shows median performance, with first and third quartile performance in the bracket\. The mean score shows the geometric mean normalized performance, with multiplicative spread shown in parentheses\. KL\(π\\pi\) indicates mean change in the target policy during learning, while KL\(β\\beta\) is the same for the behavioral\.Performance on all games is shown in Figure[4](https://arxiv.org/html/2605.05481#S5.F4)\. Dynamic SV\-PPO has better mean converged performance on 11 out of the 16 games – and is statistically significantly better on six \(p=0\.1p=0\.1, Welch’s T\-Test\), and significantly worse on zero\. Static SV\-PPO is statistically significantly better on seven\.
Both variants make fewer, larger updates to the target policy\. The static interval variant makes exactly one ninth the number of target policy updates, while the dynamic makes between anywhere from 3% to 95% as many\. The size of the average target policy update is larger by on average 96% for the dynamic variant, and 511% for the static variant \(See Table[1](https://arxiv.org/html/2605.05481#S5.T1), and Table[3](https://arxiv.org/html/2605.05481#S15.T3)in the supplemental material for a per\-game breakdown\)\. Moreover, the size of the behavioral update is smaller, especially for the dynamic variant\. This is empirical verification that SV\-PPO can keep the change in sampling policy \(δ\\deltain Theorem[4\.3](https://arxiv.org/html/2605.05481#S4.Thmtheorem3)\) smaller than PPO’s heuristic conservative updates can alone\.
The majority of games benefited from slower target policy updates – more samples and updates spent learning the value function for a fixed policy\. The dynamic variant allows for fast updates at the end \(See Figures[5](https://arxiv.org/html/2605.05481#S10.F5)and[6](https://arxiv.org/html/2605.05481#S11.F6)in the supplemental material\)\. In contrast, the static variant slows down the policy updates for all of training\. We expected better performance from the dynamic variant, as it helps ensure the policy and value drift is small at the time of the upate\. The dynamic convergence criterion resulted in to infrequent policy updates at the beginning \(resembling policy iteration\) and frequent refinements to the policy at the end \(resembling value iteration\)\. However, the static interval variant improved performance on most games where the stable value principle was helpful\. This suggests that the convergence heuristic based on could be improved such that these games have even fewer policy updates\.
### 5\.2Continuous Control
We test on continuous control tasks using the Brax physics simulator\(brax2021github\)\. For continuous control, the policy is a Gaussian, and the additional variance due to the off\-policy correction appears to be quite high, potentially offsetting the benefits of regularizing value learning\. After initially experimenting withKmax=32K\_\{max\}=32, we ended up finding better results settingKmax=8K\_\{max\}=8to control how off\-policy the data could become\. The static variant setsKmax=Kmin=8K\_\{max\}=K\_\{min\}=8, and performs nearly as well as standard PPO with one eighth of the updates\.
The dynamic thresholding variant setδv=0\.05\\delta\_\{v\}=0\.05andKmax=8K\_\{max\}=8\. This performed a median of 1% better across the suite\.
SV\-PPO can degrade performance on environments where the PPO policy is highly deterministic, such as thereacherandant\. We hypothesize that applying Stable Value API to Soft Actor\-Critic \(SAC\)conf/icml/HaarnojaZAL18may help since SAC’s entropy regularization could keep the IS ratios closer to one\.
## 6Conclusion
This work introduced ANPS, a new mechanism for policy improvement that is based on iteratively refining a behavioral policy until it is suitable to become the new target policy\. We implement ANPS using SV\-API, which waits until the change in the agent’s value and policy predictions falls below a threshold\. Our Theorem[4\.3](https://arxiv.org/html/2605.05481#S4.Thmtheorem3)shows that if this heuristic measure is in fact low when the state distribution stops changing, then policy improvement can be guaranteed\. Because SV\-API can make significantly larger target policy updates, while still ensuring distribution shift does not destroy the policy improvement, we frame it as a new kind of solution to a key problem in policy improvement\.
To future work we leave \(1\) Improving the convergence condition𝒞\\mathcal\{C\}– our theory indicates what this should correspond to, but those quantities are unmeasurable; \(2\) Determining other ways to perform ANPS \- for example in imagination with a learned model; \(3\) Applying techniques from the exploration literature to the SV\-API’s behavioral policy; and \(4\) Investigating the degree to which off\-policy evaluation harms SV\-API in practice, and determining ways to reduce this harm\.
## References
Supplementary Materials
*The following content was not necessarily subject to peer review\.*
## 7Proof of Theorem[3\.3](https://arxiv.org/html/2605.05481#S3.Thmtheorem3)
###### Proof of Theorem[3\.3](https://arxiv.org/html/2605.05481#S3.Thmtheorem3)\.
Start with the[PDL](https://arxiv.org/html/2605.05481#S2.Ex2), and introduceqπq^\{\\pi\}through addition and subtraction:
\(1−γ\)\(Vπ′\(s0\)−Vπ\(s0\)\)\\displaystyle\(1\-\\gamma\)\(V^\{\\pi^\{\\prime\}\}\(s\_\{0\}\)\-V^\{\\pi\}\(s\_\{0\}\)\)=𝔼s,a∼dπ′Aπ\(s,a\)\([PDL](https://arxiv.org/html/2605.05481#S2.Ex2)\)\\displaystyle=\\mathbb\{E\}\_\{s,a\\sim d^\{\{\\pi^\{\\prime\}\}\}\}\\ A^\{\\pi\}\(s,a\)\\quad\\eqref\{eq:pdl\}=𝔼s,a∼dπ′\[qπ\(s,a\)−Vπ\(s\)\+Qπ\(s,a\)−qπ\(s,a\)\]\\displaystyle=\\mathbb\{E\}\_\{s,a\\sim d^\{\{\\pi^\{\\prime\}\}\}\}\\left\[q^\{\\pi\}\(s,a\)\-V^\{\\pi\}\(s\)\+Q^\{\\pi\}\(s,a\)\-q^\{\\pi\}\(s,a\)\\right\]=𝔸π′π\+𝔼s,a∼dπ′\[Qπ\(s,a\)−qπ\(s,a\)\]\\displaystyle=\\mathbb\{A\}^\{\\pi\}\_\{\\pi^\{\\prime\}\}\+\\mathbb\{E\}\_\{s,a\\sim d^\{\{\\pi^\{\\prime\}\}\}\}\[Q^\{\\pi\}\(s,a\)\-q^\{\\pi\}\(s,a\)\]Subtract𝔸π′π\\mathbb\{A\}^\{\\pi\}\_\{\\pi^\{\\prime\}\}from both sides, and take the absolute value:
\|\(1−γ\)\(Vπ′\(s0\)−Vπ\(s0\)\)−𝔸π′π\|\\displaystyle\\left\|\(1\-\\gamma\)\(V^\{\\pi^\{\\prime\}\}\(s\_\{0\}\)\-V^\{\\pi\}\(s\_\{0\}\)\)\-\\mathbb\{A\}^\{\\pi\}\_\{\\pi^\{\\prime\}\}\\right\|=\|𝔼s,a∼dπ′\[Qπ\(s,a\)−q\(s,a\)\]\|\\displaystyle=\\left\|\\mathbb\{E\}\_\{s,a\\sim d^\{\{\\pi^\{\\prime\}\}\}\}\[Q^\{\\pi\}\(s,a\)\-q\(s,a\)\]\\right\|≤𝔼s,a∼dπ′\|Qπ\(s,a\)−q\(s,a\)\|\\displaystyle\\leq\\mathbb\{E\}\_\{s,a\\sim d^\{\{\\pi^\{\\prime\}\}\}\}\|Q^\{\\pi\}\(s,a\)\-q\(s,a\)\|=ε\(dπ′,qπ\)\\displaystyle=\\varepsilon\(d^\{\\pi^\{\\prime\}\},q^\{\\pi\}\)Dividing by\(1−γ\)\>0\(1\-\\gamma\)\>0and noting\|x\|≤a⇔−a≤x≤a\|x\|\\leq a\\iff\-a\\leq x\\leq ayields the theorem statement\.
−ε\(dπ′,qπ\)≤\(1−γ\)\(Vπ′\(s0\)−Vπ\(s0\)\)−𝔸π′π≤ε\(dπ′,qπ\)\\displaystyle\-\\varepsilon\(d^\{\\pi^\{\\prime\}\},q^\{\\pi\}\)\\leq\(1\-\\gamma\)\(V^\{\\pi^\{\\prime\}\}\(s\_\{0\}\)\-V^\{\\pi\}\(s\_\{0\}\)\)\-\\mathbb\{A\}^\{\\pi\}\_\{\\pi^\{\\prime\}\}\\leq\\varepsilon\(d^\{\\pi^\{\\prime\}\},q^\{\\pi\}\)11−γ\(𝔸π′π−ε\(dπ′,qπ\)\)≤\(Vπ′\(s0\)−Vπ\(s0\)\)≤11−γ\(𝔸π′π\+ε\(dπ′,qπ\)\)\\displaystyle\\frac\{1\}\{1\-\\gamma\}\\left\(\\mathbb\{A\}^\{\\pi\}\_\{\\pi^\{\\prime\}\}\-\\varepsilon\(d^\{\\pi^\{\\prime\}\},q^\{\\pi\}\)\\right\)\\leq\(V^\{\\pi^\{\\prime\}\}\(s\_\{0\}\)\-V^\{\\pi\}\(s\_\{0\}\)\)\\leq\\frac\{1\}\{1\-\\gamma\}\\left\(\\mathbb\{A\}^\{\\pi\}\_\{\\pi^\{\\prime\}\}\+\\varepsilon\(d^\{\\pi^\{\\prime\}\},q^\{\\pi\}\)\\right\)∎
## 8Proof of Lemma[3\.5](https://arxiv.org/html/2605.05481#S3.Thmtheorem5)
Given two distributions,μ\\mu, and arbitary action\-value estimateqπq^\{\\pi\}, we’d like toε\(μ′qπ\)\\varepsilon\(\\mu^\{\\prime\}q^\{\\pi\}\)\. Dropping dependence onqπq^\{\\pi\}, and writing the difference:
ε\(μ′\)−ε\(μ\)\\displaystyle\\varepsilon\(\\mu^\{\\prime\}\)\-\\varepsilon\(\\mu\)Sinceε\(μ\)\\varepsilon\(\\mu\)is positive for any distribution, we have
ε\(μ′\)−ε\(μ\)\\displaystyle\\varepsilon\(\\mu^\{\\prime\}\)\-\\varepsilon\(\\mu\)≤\|ε\(μ′\)−ε\(μ\)\|\\displaystyle\\leq\|\\varepsilon\(\\mu^\{\\prime\}\)\-\\varepsilon\(\\mu\)\|\(10\)=\|𝔼s,a∼μ′\|Qπ\(s,a\)−qπ\(s,a\)\|−𝔼s,a∼μ\|Qπ\(s,a\)−qπ\(s,a\)\|\|\\displaystyle=\\left\|\\mathbb\{E\}\_\{s,a\\sim\\mu^\{\\prime\}\}\|Q^\{\\pi\}\(s,a\)\-q^\{\\pi\}\(s,a\)\|\-\\mathbb\{E\}\_\{s,a\\sim\\mu\}\|Q^\{\\pi\}\(s,a\)\-q^\{\\pi\}\(s,a\)\|\\right\|\(11\)=\|∑s,a\(μ′−μ\)\|Qπ\(s,a\)−qπ\(s,a\)\|\|\\displaystyle=\\left\|\\sum\_\{s,a\}\(\\mu^\{\\prime\}\-\\mu\)\|Q^\{\\pi\}\(s,a\)\-q^\{\\pi\}\(s,a\)\|\\right\|\(12\)≤\|∑s,a\|μ′−μ‖Qπ\(s,a\)−qπ\(s,a\)‖\\displaystyle\\leq\\left\|\\sum\_\{s,a\}\|\\mu^\{\\prime\}\-\\mu\|\|Q^\{\\pi\}\(s,a\)\-q^\{\\pi\}\(s,a\)\|\\right\|\(13\)=∑s,a\|μ′−μ\|\|Qπ\(s,a\)−qπ\(s,a\)\|\\displaystyle=\\sum\_\{s,a\}\|\\mu^\{\\prime\}\-\\mu\|\|Q^\{\\pi\}\(s,a\)\-q^\{\\pi\}\(s,a\)\|\(14\)≤maxs,a,π\(\|Qπ\(s,a\)−qπ\(s,a\)\|\)∑s,a\|μ′−μ\|\\displaystyle\\leq\\max\_\{s,a,\\pi\}\\left\(\|Q^\{\\pi\}\(s,a\)\-q^\{\\pi\}\(s,a\)\|\\right\)\\sum\_\{s,a\}\|\\mu^\{\\prime\}\-\\mu\|\(15\)≤11−γ∑s,a\|μ′−μ\|\\displaystyle\\leq\\frac\{1\}\{1\-\\gamma\}\\sum\_\{s,a\}\|\\mu^\{\\prime\}\-\\mu\|\(16\)=21−γdTV\(μ′,μ\)\\displaystyle=\\frac\{2\}\{1\-\\gamma\}d\_\{TV\}\(\\mu^\{\\prime\},\\mu\)\(17\)Rearranging gives
ε\(μ′\)≤ε\(μ\)\+21−γdTV\(μ′,μ\)\\displaystyle\\varepsilon\(\\mu^\{\\prime\}\)\\leq\\varepsilon\(\\mu\)\+\\frac\{2\}\{1\-\\gamma\}d\_\{TV\}\(\\mu^\{\\prime\},\\mu\)\(18\)
## 9Comparison of Theorem[4\.3](https://arxiv.org/html/2605.05481#S4.Thmtheorem3)to CPI Monotonic Policy Improvement Bound
We compare our Theorem[3\.3](https://arxiv.org/html/2605.05481#S3.Thmtheorem3)to the related monotonic policy improvement bound from CPI, which we state below\. Note that the following bound usesϵ\\epsilonthat is distinct fromε\(μ,qπ\)\\varepsilon\(\\mu,q^\{\\pi\}\)\. CPI usesϵ\\epsilonto denote error in the greedy policy selection subroutine\. Recall that in our notation,ε\(qπk,μ\)\\varepsilon\(q^\{\\pi\_\{k\}\},\\mu\)measures the explicit mean value estimation error of the critic on its distributionμ\\mu\. These terms are related, but are not interchangeable\.
###### Theorem 9\.1\(agarwal21Theorem 12\.2\)\.
Assume access to a method which outputs a policyπ′\\pi^\{\\prime\}such that𝔸π′πk≥maxπ~𝔸π~πk−ϵ\\mathbb\{A\}^\{\\pi\_\{k\}\}\_\{\\pi^\{\\prime\}\}\\geq\\max\_\{\\tilde\{\\pi\}\}\\mathbb\{A\}^\{\\pi\_\{k\}\}\_\{\\tilde\{\\pi\}\}\-\\epsilon\. Then, the CPI updateπk\+1=\(1−δ\(1−γ\)\)πk\+δ\(1−γ\)π′\\pi\_\{k\+1\}=\(1\-\\delta\(1\-\\gamma\)\)\\pi\_\{k\}\+\\delta\(1\-\\gamma\)\\pi^\{\\prime\}yields the following performance bound:
Vπk\+1\(s0\)−Vπk\(s0\)≥11−γ\(δ\(1−γ\)\[maxπ~𝔸π~πk−ϵ\]−2δ2γ\)\\displaystyle V^\{\\pi\_\{k\+1\}\}\(s\_\{0\}\)\-V^\{\\pi\_\{k\}\}\(s\_\{0\}\)\\geq\\frac\{1\}\{1\-\\gamma\}\\left\(\\delta\(1\-\\gamma\)\[\\max\_\{\\tilde\{\\pi\}\}\\mathbb\{A\}^\{\\pi\_\{k\}\}\_\{\\tilde\{\\pi\}\}\-\\epsilon\]\-2\\delta^\{2\}\\gamma\\right\)\(19\)
Note that, the distribution shift penalty in the last term, is ofO\(δ2/\(1−γ\)\)O\(\\delta^\{2\}/\(1\-\\gamma\)\), while for SV\-API it is ofO\(δ/\(1−γ\)2\)O\(\\delta/\(1\-\\gamma\)^\{2\}\)\. This is because, for CPI, the coefficientδ\(1−γ\)\\delta\(1\-\\gamma\)scales down both the expected advantage and distribution shift\. This allows CPI to guarantee monotonic policy improvement for small enoughδ\\delta\. In contrast, the bound in Theorem[4\.3](https://arxiv.org/html/2605.05481#S4.Thmtheorem3)is due to an unconstrained policy update, and incurs no multiplier ofδ\\deltaon the improvement𝔸πk\+1πk\\mathbb\{A\}\_\{\\pi\_\{k\+1\}\}^\{\\pi\_\{k\}\}\.
## 10Detailed Atari Convergence Threshold
Figure 5:Scaled value diff and convergence thresholdδv\\delta\_\{v\}for SV\-PPO on all 16 Atari games\.
## 11Detailed Atari Learning Curves
Figure 6:Average score and target policy lifespan for SV\-PPO on all 16 Atari games\.
## 12Learning Curves: Atari
Figure 7:Learning Curves for PPO, Dynamic SV\-PPO, and Static SV\-PPO for Atari \(5 seeds, median shown 0\.
## 13Four Rooms Experiment
Four Rooms\(journals/ai/SuttonPS99\)is a grid world with fixed starting state \(top left room\) and goal state \(bottom right room\) The action is the intended direction of movement, and is executed noisily, with a20%20\\%chance of success\. Hugging the walls mitigates the negative effect of an erroneous move\. The optimal policy, shown in Figure \([8\(a\)](https://arxiv.org/html/2605.05481#S13.F8.sf1)\), visits all four rooms: generally going through the top room, but occasionally going through the bottom room if brought there via noise\.
We use 10,000 rounds of value iteration to solve for the optimal policy on Four Rooms\. We then compute the optimal actions, placing equal weight on ties, and compute the associated discounted stationary distribution\.
\(a\)Optimal Policy
\(b\)Associated State Visitation Distribution
Figure 8:Optimal Policies and Associated State Visitation Distribution### 13\.1Experiment Details
We run an implementation of PPO due topurejaxrl, using a convolutional actor critic network, due topqn\)\. The observation space is visual \(the entire grid, with the agent’s position and wall marked\)\. SV\-PPO uses dynamic thresholding, and is discussed in detail in the next section\. We run for 195 total rounds, each corresponding to an update to the behavioral policy\. Based on the strictness of the convergence criterion, SV\-PPO updates the target policy 40 times\. Figure[3](https://arxiv.org/html/2605.05481#S4.F3)visualizes the convergence logic \(corresponding to𝒞\\mathcal\{C\}in Algorithm[2](https://arxiv.org/html/2605.05481#alg2)\): If the quantity shown in blue on the rightmost chart falls below the dotted line for four consecutive iterations, an update occurs\.
### 13\.2Results
In terms of performance, SV\-PPO converges to the optimal policy\. PPO has unstable convergence to a local optimum that always attempts to go through the top rooms even if noise has pushed the agent closer to the bottom\-left room, while the optimal policy is to continue along the bottom route\. Of note is the sudden collapse in the performance of PPO in iterationk=80k=80\. This is due to mis\-generalization of the value network, sometimes called value churn\(tang2024churn\)\. Around iterationk=50k=50, PPO’s CNN critic over\-estimates the value along the left wall\. Byk=60k=60, the PPO policy network learns to avoid the bottom room\. Insufficient sampling and poor generalization leads to catastrophic forgetting occurring at iterationk=90k=90\. This is indicated by a massive degradation in the true value function at these states\. The “recovery” in value that occurs for PPO is actually to avoid these states, which is a sub\-optimal policy\. Detailed visualizations of the learning dynamics are in Figures \([9](https://arxiv.org/html/2605.05481#S13.F9)\) and \([10](https://arxiv.org/html/2605.05481#S13.F10)\)\.
By evaluating a fixed policy for longer, SV\-PPO mitigated value overestimation\. For example, it experienced less\-overestimation along the left wall, since it was fitting a lower value policy in the top right room\. Lower value estimates led to more consistent exploration of the problematic room – instead of just seeking the left wall, as PPO did\. The end result is that SV\-PPO learns a strong policy in the bottom left room, while PPO’s remains weak, even at the end of learning\.
Figure 9:Detailed Examination of PPO’s Learning on Four Rooms, highlighting the catastrophic forgetting at iteration 90, and over exploration of the bottom left corner due to value over \-estimation at Iteration 100\. This leads the agent to avoid the room altogether\.Figure 10:Detailed Examination of SV\-PPO’s Learning on Four Rooms, highlighting iterations which saw value over\-estimation the bottom left corner, but less widespread than PPO, and more quickly resolved by exploring the room\. The final policy goes through the bottom room just like the optimal policy in Figure \([8](https://arxiv.org/html/2605.05481#S13.F8)\)
## 14Static SV\-PPO Atari
Table 2:Static SV\-PPO on Atari\.By strictly gathering data for 9 rounds before every target policy update, SV\-PPO operates on 11% of the target updates of standard PPO as shown in\(b\) Updates\.Column \(a\)indicates the final converged return relative to PPO\.Column \(b\)indicates the fraction of roundskkwhere SV\-PPO updates the target policy\.Column \(c\)shows the size of the target policy jump as a multiplier of PPO’s, demonstrating that SV\-PPO executes substantially larger target updates\.Column \(d\)shows the same quantity for the behavioral policy remains constrained\.Environment\(a\)Score\(% of PPO\)\(b\)Updates\(% of PPO\)\(c\)KL\(πk\|\|πk\+1\\pi\_\{k\}\|\|\\pi\_\{k\+1\}\)\(×\\timesPPO\)\(d\)KL\(βk\|\|βk\+1\\beta\_\{k\}\|\|\\beta\_\{k\+1\}\)\(×\\timesPPO\)Asterix787±\\pm285114\.5x0\.7xFrostbite358±\\pm526117\.6x1\.1xPhoenix186±\\pm22114\.5x0\.8xNameThisGame184±\\pm20115\.3x1\.0xSeaquest125±\\pm17116\.9x1\.1xRiverraid120±\\pm9117\.3x1\.0xSpaceInvaders113±\\pm6114\.7x0\.9xBreakout107±\\pm3116\.2x0\.9xQbert101±\\pm14117\.5x0\.9xDoubleDunk100±\\pm11111\.3x2\.0xBattleZone100±\\pm11116\.6x0\.9xFreeway99±\\pm14114\.9x1\.0xKungFuMaster98±\\pm8116\.1x0\.9xAmidar92±\\pm10115\.7x0\.9xMsPacman90±\\pm14116\.7x0\.9xBowling70±\\pm32114\.9x0\.6x
## 15Dynamic SV\-PPO Atari
Table 3:Dynamic SV\-PPO on Atari 2600 environments\.Column \(a\)indicates the final converged return relative to PPO, with bold indicating significantly different performance\.Column \(b\)indicates the fraction of roundskkwhere SV\-PPO updates the target policy\.Columns \(c\) and \(d\)show statistics about the convergence value \(diffk/y¯T\\text\{diff\}\_\{k\}/\\overline\{y\}\_\{T\}in Algorithm[4](https://arxiv.org/html/2605.05481#alg4)\), normlaized by the thresholdδv\\delta\_\{v\}, such that a value≤1\\leq 1implies stability\.Column \(e\)shows the size of the target policy jump as a multiplier of PPO’s, demonstrating that SV\-PPO executes substantially larger target updates\.Column \(f\)shows the same quantity for the behavioral policy\.Environment\(a\)Score\(% of PPO\)\(b\)Updates\(% of PPO\)\(c\)diff𝒌/𝜹𝒗\\text\{diff\}\_\{k\}/\\delta\_\{v\}\(Median\)\(d\)Std\.diffk\\text\{diff\}\_\{k\}\(% ofδv\\delta\_\{v\}\)\(e\)KL\(πk\|\|πk\+𝟏\\pi\_\{k\}\|\|\\pi\_\{k\+1\}\)\(x PPO\)\(f\)KL\(βk\|\|βk\+𝟏\\beta\_\{k\}\|\|\\beta\_\{k\+1\}\)\(x PPO\)Asterix256±\\pm1517456270\.7x0\.5xPhoenix170±\\pm166185250\.9x0\.5xNameThisGame137±\\pm167575211\.3x1\.1xFrostbite119±\\pm405583371\.3x0\.6xBattleZone114±\\pm104796271\.8x0\.8xRiverraid111±\\pm16773271\.2x0\.7xBreakout111±\\pm54187909\.6x0\.7xFreeway105±\\pm5381061892\.3x0\.9xSpaceInvaders105±\\pm1241100241\.4x0\.6xKungFuMaster103±\\pm37967211\.2x1\.0xSeaquest101±\\pm188547181\.2x1\.0xDoubleDunk99±\\pm131645329927\.0x0\.9xQbert96±\\pm1140100281\.9x0\.6xAmidar94±\\pm275483361\.3x0\.6xMsPacman92±\\pm1927115392\.8x0\.4xBowling77±\\pm744010528652\.4x0\.7x
## 16Dynamic SV\-PPO Brax
Table 4:Dynamic SV\-PPO on Continuous ControlColumn \(a\)indicates the final converged return relative to PPO\.Column\(b\)indicates the fraction of roundskkwhere SV PPO updates the target policy\.Columns \(c\) and \(d\)show statistics about the convergence value \(diffk/y¯T\\text\{diff\}\_\{k\}/\\overline\{y\}\_\{T\}in Algorithm[4](https://arxiv.org/html/2605.05481#alg4)\), normalized by the thresholdδv\\delta\_\{v\}, such that a value≤1\\leq 1implies stability\.Column\(e\)indicates value loss \(MSBE\) at convergence as a multiplier of PPO’s\.Column \(f\)shows the size of the target policy jump as a multiplier of PPO’s, demonstrating that SV\-PPO executes substantially larger target updates\.Column \(g\)shows the same quantity for the behavioral policy\.Column \(h\)shows mean policy entropy of PPO at convergence\.
## 17Static SV\-PPO Brax
Table 5:Static SV\-PPO on Continuous Control\.Column \(a\)indicates the final converged return relative to PPO\.Column\(b\)indicates the fraction of roundskkwhere SV PPO updates the target policy\.Column\(c\)indicates value loss \(MSBE\) at convergence as a multiplier of PPO’s\.Column \(d\)shows the size of the target policy jump as a multiplier of PPO’s, demonstrating that SV\-PPO executes substantially larger target updates\.Column \(e\)shows the same quantity for the behavioral policy\.Column \(f\)shows mean policy entropy of PPO at convergence\.
## 18Game\-By\-Game Comparison: Atari
Figure 11:Game by Game comparison of Static and Dynamic SV\-PPO on Atari\. Each game is placed on the plot twice: a blue circle denotes the dynamic variant, and a triangle denotes the static variant\. A colored line connects the same game for the two variants\. The vertical axis is converged performance \(normalized by PPO’s\) and the horizontal axis is the average change to the target policy, as measured by KL\-divergence\. Thus, if the line slopes “upward” for a certain game, this indicates the static variant increased performance\. Moreover, the origin indicates PPO by construction\.
## 19Game\-By\-Game Comparison: Brax
Figure 12:Game by Game comparison of Static and Dynamic SV\-PPO on Brax\. Each game is placed on the plot twice: a blue circle denotes the dynamic variant, and a triangle denotes the static variant\. A colored line connects the same game for the two variants\. The vertical axis is converged performance \(normalized by PPO’s\) and the horizontal axis is the average change to the target policy, as measured by KL\-divergence\. Thus, if the line slopes “upward” for a certain game, this indicates the static variant increased performance\. Moreover, the origin indicates PPO by construction\.
## 20Learning Curves: Brax
Figure 13:Learning Curves for PPO, Dynamic SV\-PPO, and Static SV\-PPO for Brax\.
## 21SV\-PPO Brax Hyperparameters
Table 6:Hyperparameters for Brax Continuous Control\.Shared parameters \(due to the PureJAXRL repositorypurejaxrl\) are applied across all baselines and variants\. SV\-PPO specific parameters apply only to the dynamic and static gated algorithms\.CategoryHyperparameterValueShared PPO HyperparametersEnvironment & SamplingTotal Timesteps5×1075\\times 10^\{7\}Number of Environments2048Steps per Environment \(TT\)10Discount Factor \(γ\\gamma\)0\.99GAE Parameter \(λ\\lambda\)0\.95Optimization \(Network\)Learning Rate \(Initial\)3×10−43\\times 10^\{\-4\}Learning Rate \(Final\)1×10−51\\times 10^\{\-5\}LR ScheduleLinearOptimization Epochs4Minibatch Size1024Activation FunctionTanhMax Gradient Norm1\.0Initial Log Std0\.5Normalize ObservationsTrueNormalize RewardsTrueWarmup Steps1000Loss CoefficientsPPO Clip Range \(ϵ\\epsilon\)0\.2Value Clip Range0\.2Value Coefficient \(c2c\_\{2\}\)0\.5Entropy Coefficient \(αent\\alpha\_\{ent\}\)0\.001Dynamic SV\-PPO Specific HyperparametersStability GatingValue Diff Threshold \(δv\\delta\_\{v\}\)0\.05Min Stable Iters \(KminK\_\{min\}\)2Max Policy Delay \(KmaxK\_\{max\}\)8IS Ratio Bounds \(c¯,ρ¯\\bar\{c\},\\bar\{\\rho\}\)\[0\.0,100\.0\]\[0\.0,100\.0\]KKDecay Fraction0\.05Static SV\-PPO Specific HyperparametersMin Stable Iters \(KminK\_\{min\}\)8Max Policy Delay \(KmaxK\_\{max\}\)8IS Ratio Bounds \(c¯,ρ¯\\bar\{c\},\\bar\{\\rho\}\)\[0\.0,100\.0\]\[0\.0,100\.0\]
## 22SV\-PPO Atari Implementation and Hyperparameters
Table 7:Hyperparameters for Atari\-16\.The PPO implementation is due to PureJAXRL, with most hyperparamters due to the original paperschulman2017proximalpolicyoptimizationalgorithms\. However, following Museli’smuselistrong implementation of PPOwith sticky actions, we made several changes\. First, separate IMPALA networks were used for actor and critic\. The discount factor was raised toγ=0\.995\\gamma=0\.995and the maximum learning rate to3×10−43\\times 10^\{\-4\}\. The minibatch size was also increased to 1,024\. These changes improved performance on all tested games, and were shared by both runs\. The SV PPO stability gating hyperparamters were tuned on the MinAtar environmentsyoung19minatar, which can be run much more quickly than experiments on Atari\. SV\-PPO specific parameters apply only to the dynamic and static gated algorithms\.CategoryHyperparameterValueShared PPO HyperparametersEnvironment & SamplingTotal Timesteps200M200MNumber of Environments64Steps per Environment \(TT\)128Discount Factor \(γ\\gamma\)0\.995GAE Parameter \(λ\\lambda\)0\.95Frame Skip4No\-op Max30Repeat Action Probability0\.25Episodic LifeTrueReward ClippingFalseOptimization \(Network\)Learning Rate \(Initial\)3×10−43\\times 10^\{\-4\}Learning Rate \(Final\)1×10−61\\times 10^\{\-6\}LR ScheduleLinearOptimization Epochs4Minibatch Size1024Max Gradient Norm1\.0Layer NormalizationTrueLoss CoefficientsPPO Clip Range \(ϵ\\epsilon\)0\.1Value Clip Range0\.2Value Coefficient \(c2c\_\{2\}\)0\.25Entropy Coefficient \(αent\\alpha\_\{ent\}\)0\.01Dynamic SV\-PPO Specific HyperparametersStability GatingValue Diff Threshold \(δv\\delta\_\{v\}\)0\.01KminK\_\{min\}Decay Fraction0\.2Min Stable Iters \(KminK\_\{min\}\)9 decays to 1Max Policy Delay \(KmaxK\_\{max\}\)33IS Ratio Bounds \(c¯,ρ¯\\bar\{c\},\\bar\{\\rho\}\)\[0\.0,5\.0\]\[0\.0,5\.0\]Static SV\-PPO Specific HyperparametersStability GatingMin Stable Iters \(KminK\_\{min\}\)9Max Policy Delay \(KmaxK\_\{max\}\)9IS Ratio Bounds \(c¯,ρ¯\\bar\{c\},\\bar\{\\rho\}\)\[0\.0,5\.0\]\[0\.0,5\.0\]
## 23SV\-PPO
Letπθ\\pi\_\{\\theta\}denote the policy actor network, withθk\\theta\_\{k\}parameterizing the current behavioral policy at roundkk, andθ¯k\\overline\{\\theta\}\_\{k\}denoting the target policy parameters\. The PPO policy optimization objective stays the same, except that the ratiork\(θ\)r\_\{k\}\(\\theta\)is the ratio of behavioral policies\.
JCLIP\(θ\)=𝔼^t\[min\(βθ\(at\|st\)βθk\(at\|st\)A^t,clip\(βθ\(at\|st\)βθk\(at\|st\),1−ϵ,1\+ϵ\)A^t\)\]\\displaystyle J^\{CLIP\}\(\\theta\)=\\hat\{\\mathbb\{E\}\}\_\{t\}\\left\[\\min\\left\(\\frac\{\\beta\_\{\\theta\}\(a\_\{t\}\|s\_\{t\}\)\}\{\\beta\_\{\\theta\_\{k\}\}\(a\_\{t\}\|s\_\{t\}\)\}\\hat\{A\}\_\{t\},\\text\{clip\}\\left\(\\frac\{\\beta\_\{\\theta\}\(a\_\{t\}\|s\_\{t\}\)\}\{\\beta\_\{\\theta\_\{k\}\}\(a\_\{t\}\|s\_\{t\}\)\},1\-\\epsilon,1\+\\epsilon\\right\)\\hat\{A\}\_\{t\}\\right\)\\right\]\(20\)In this way, the behavioral policy is allowed to drift further and further away from the target policy in each round\. Once the convergence criteria is met, the update is justθ¯←θk\+1\\overline\{\\theta\}\\leftarrow\\theta\_\{k\+1\}\.
The critic training process must ensure that the value function’s fixed point corresponds toπθ¯k\\pi\_\{\\overline\{\\theta\}\_\{k\}\}\. Let the value network bevϕv\_\{\\phi\}\. We use the clipped importance sampling method V\-Traceespeholt2018impalascalabledistributeddeeprl, which can be written in recursive terms:
ρt\\displaystyle\\rho\_\{t\}=min\(ρ¯,πθ¯k\(at\|xt\)πθk\(at\|xt\)\),ci=min\(λ,πθ¯k\(ai\|xi\)πθk\(ai\|xi\)\)\\displaystyle=\\min\\left\(\\bar\{\\rho\},\\frac\{\\pi\_\{\\overline\{\\theta\}\_\{k\}\}\(a\_\{t\}\|x\_\{t\}\)\}\{\\pi\_\{\\theta\_\{k\}\}\(a\_\{t\}\|x\_\{t\}\)\}\\right\),\\quad c\_\{i\}=\\min\\left\(\\lambda,\\frac\{\\pi\_\{\\overline\{\\theta\}\_\{k\}\}\(a\_\{i\}\|x\_\{i\}\)\}\{\\pi\_\{\\theta\_\{k\}\}\(a\_\{i\}\|x\_\{i\}\)\}\\right\)\(21\)δt\\displaystyle\\delta\_\{t\}=rt\+γvϕ\(st\+1\)−vϕ\(st\)\\displaystyle=r\_\{t\}\+\\gamma v\_\{\\phi\}\(s\_\{t\+1\}\)\-v\_\{\\phi\}\(s\_\{t\}\)yt\\displaystyle y\_\{t\}=vϕ\(st\)\+ρtδt\+γct\(yt\+1−vϕ\(st\+1\)\)\\displaystyle=v\_\{\\phi\}\(s\_\{t\}\)\+\\rho\_\{t\}\\delta\_\{t\}\+\\gamma c\_\{t\}\(y\_\{t\+1\}\-v\_\{\\phi\}\(s\_\{t\+1\}\)\)withyT\+1=vϕ\(sT\+1\)\\displaystyle\\text\{with \}\\quad y\_\{T\+1\}=v\_\{\\phi\}\(s\_\{T\+1\}\)\(22\)Sincectc\_\{t\}is at mostλ\\lambda, this results in TD\(λ\\lambda\) when on\-policy \(θk\\theta\_\{k\}=θ¯k\\overline\{\\theta\}\_\{k\}\)\. To eliminate bias due to off policy sampling,ρ¯\\overline\{\\rho\}must be set to infinity\. Clipping withρ¯<∞\\overline\{\\rho\}<\\inftystabilizes learning at the cost of adding some bias\. We setρ¯\\overline\{\\rho\}to 5\. Since the advantage estimates are also off\-policy, we modify the generalized advantage estimation\(gae\)used by PPO by incorporating the importance sampling weights from Equation \([21](https://arxiv.org/html/2605.05481#S23.E21)\), which are originally due to Retrace\(λ\\lambda\)\(DBLP:retrace\)\.
A^t\\displaystyle\\hat\{A\}\_\{t\}=δt\+γctA^t\+1,A^T\+1=0\\displaystyle=\\delta\_\{t\}\+\\gamma c\_\{t\}\\hat\{A\}\_\{t\+1\},\\quad\\hat\{A\}\_\{T\+1\}=0\(23\)Finally, we normalize the estimatesA^t\\hat\{A\}\_\{t\}by dividing by the batch standard deviation, but we do not center them, to avoid flipping the sign\.Similar Articles
Near-Future Policy Optimization
Proposes Near-Future Policy Optimization (NPO), a mixed-policy RL method that accelerates convergence by learning from a later checkpoint of the same training run, boosting Qwen3-VL-8B-Instruct performance from 57.88 to 62.84.
@SOURADIPCHAKR18: Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts…
This work proposes using privileged information to actively sample rollouts in reinforcement learning, improving on typical blind sampling methods.
Uncertainty-Aware LLM-Guided Policy Shaping for Sparse-Reward Reinforcement Learning
Proposes ULPS, a framework integrating a calibrated LLM into RL training with uncertainty-modulated guidance and A*-based symbolic trajectories, achieving improved success rate and sample efficiency on MiniGrid-UnlockPickup.
Regularized Offline Policy Optimization with Posterior Hybrid Bayesian Belief
This paper introduces Posterior Hybrid Bayesian Belief (PhyB), a framework that reformulates the expectation in Bayesian RL as a convex combination over dynamics models, enabling efficient regularized offline policy optimization with bounded objective discrepancy and state-of-the-art performance.
StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
StepPO introduces a step-centric paradigm for agentic reinforcement learning that aligns policy optimization with agent decision granularity, outperforming token-centric methods in multi-turn interaction tasks.