Not All Transitions Matter: Evidence from PPO
Summary
This paper investigates the temporal correlation problem in on-policy reinforcement learning with PPO, showing that randomly dropping a fixed fraction of transitions from rollouts reduces gradient redundancy and stabilizes training without degrading performance.
View Cached Full Text
Cached at: 05/26/26, 09:00 AM
# Evidence from PPO Reducing Temporal Correlation in PPO Without Degrading Performance
Source: [https://arxiv.org/html/2605.24071](https://arxiv.org/html/2605.24071)
Ajhesh Basnetajheshb@gmail\.comDepartment of Artificial Intelligence and Data ScienceKPR Institute of Engineering and Technology, Coimbatore
###### Abstract
Training a reinforcement learning agent on\-policy means collecting fresh experience at every update, and that experience comes with a hidden problem\. Each state in a rollout is the direct output of the previous one, causally chained together by the agent’s own actions\. Because of this, consecutive transitions are never truly independent\. They carry overlapping information, and the gradient signal the network receives ends up far more repetitive than the batch size suggests\. The same directions get reinforced over and over, the value network struggles to keep up as the policy shifts, and training becomes quietly unstable in ways that reward curves alone rarely reveal\.
This paper asks whether that redundancy can simply be removed\. We show that randomly dropping a fixed fraction of transitions from the rollout, at the right stage so the reward signal stays intact, is enough to break the repetitive gradient structure and stabilise training\. The change is minimal: one sampling step, no new components, no modification to the core algorithm, and it works with any PPO implementation\. Across five environments of increasing difficulty, CartPole\-v1, Acrobot\-v1, LunarLander\-v2, HalfCheetah\-v5, and Hopper\-v5, the method matches vanilla PPO on reward while producing more consistent training dynamics across KL divergence, policy entropy, and value estimates\. Dropping25%25\\%of transitions turns out to be the sweet spot: enough to disrupt the redundancy, not enough to thin the batch\. Source code:[https://github\.com/ajheshbasnet/rollout\-slim](https://github.com/ajheshbasnet/rollout-slim)\.
## 1Introduction
Supervised learning assumes training samples are Independently and Identically Distributed \(IID\)\. Under this assumption, gradient signals across batches stay diverse and non\-redundant, and learning proceeds stably\. In on\-policy RL, this assumption does not just get bent\. It gets broken by construction\.
Within a single trajectoryτ\\tau, every statest\+1s\_\{t\+1\}is a direct causal product ofsts\_\{t\}\. The trajectory is not a collection of independent observations but a chain, where each transition follows from the policy’s own prior decisions\. Feeding this into a neural network produces gradient vectors that point in nearly the same direction update after update, creating functionally collinear weight changes that slow and destabilize learning in ways that are hard to detect until training already looks wrong\.
There is a second problem that is harder to see\. Once the policy updates pastτ1\\tau\_\{1\}, the value network starts operating on states it was never trained on\. Stale value estimates feed into corrupted advantage signals, which feed into corrupted policy gradients, which push the agent into new regions of state space the critic has seen even less of, and the cycle keeps tightening\. This is a direct instance of the Deadly Triad\[[2](https://arxiv.org/html/2605.24071#bib.bib2)\]: function approximation combined with bootstrapping under a non\-stationary data distribution\.
Off\-policy methods like DQN and SAC sidestep this through experience replay\. Transitions get stored, shuffled, and randomly sampled, which approximates IID well enough in practice\. On\-policy methods like PPO\[[1](https://arxiv.org/html/2605.24071#bib.bib1)\]cannot do this\. The data must come from the current policy, so old transitions cannot be reused, and the temporal correlation problem is left sitting there unaddressed in every rollout\.
What makes this worth investigating further is a counterintuitive observation: if you randomly drop a large fraction of transitions from the rollout and still match the performance of the agent that trained on all of them, it strongly implies that most of those transitions were carrying nearly the same gradient signal anyway\. The states are so correlated with each other that the network was not really learning from each one independently — it was seeing the same information repeated\. That is the core finding this paper builds on\.
#### What about vectorized environments?
A common practical response to temporal correlation is to run 6–8 parallel environment copies simultaneously, collecting transitions from multiple independent trajectories at once\. This does reduce correlation across the batch because each worker is at a different point in a different episode\. It works\. But runningNNenvironments in parallel meansNNtimes the memory andNNtimes the CPU overhead, and on a single machine or constrained hardware that cost adds up fast\. The method presented here achieves similar decorrelation benefits from a single environment rollout by subsampling after advantage estimation\. It is not a replacement for vectorized environments in every setting, but it is a significantly cheaper path to the same goal\.
This paper looks at three methods for reducing that correlation without touching the core PPO objective or hurting performance\.
## 2Background
### 2\.1On\-Policy Trajectory Collection
In on\-policy RL, the agent collects a trajectory by running its current policyπθ\\pi\_\{\\theta\}in the environment:
τ=\(s0,a0,r0\)→\(s1,a1,r1\)→\(s2,a2,r2\)→⋯→\(sT,aT,rT\)\\tau=\(s\_\{0\},a\_\{0\},r\_\{0\}\)\\rightarrow\(s\_\{1\},a\_\{1\},r\_\{1\}\)\\rightarrow\(s\_\{2\},a\_\{2\},r\_\{2\}\)\\rightarrow\\cdots\\rightarrow\(s\_\{T\},a\_\{T\},r\_\{T\}\)\(1\)Every transition\(st,at,rt,st\+1\)\(s\_\{t\},a\_\{t\},r\_\{t\},s\_\{t\+1\}\)is causally produced by the one before it, making the trajectory a temporally dependent chain rather than a collection of independent observations\.
### 2\.2Temporal Correlation and Its Effect on Gradients
The policy gradient objective is:
∇J\(θ\)=𝔼τ∼πθ\[∑t∇logπθ\(at\|st\)⋅At\]\\nabla J\(\\theta\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\}\}\\left\[\\sum\_\{t\}\\nabla\\log\\pi\_\{\\theta\}\(a\_\{t\}\|s\_\{t\}\)\\cdot A\_\{t\}\\right\]\(2\)This expectation assumes samples come from a stationary distribution\. In practice, consecutive statessts\_\{t\}andst\+1s\_\{t\+1\}are highly similar, so gradient vectors∇logπθ\(at\|st\)\\nabla\\log\\pi\_\{\\theta\}\(a\_\{t\}\|s\_\{t\}\)and∇logπθ\(at\+1\|st\+1\)\\nabla\\log\\pi\_\{\\theta\}\(a\_\{t\+1\}\|s\_\{t\+1\}\)end up nearly parallel\.
### 2\.3The Non\-Stationary Bootstrapping Feedback Loop
The TD update for the value network is:
V\(st\)←V\(st\)\+α\[rt\+γV\(st\+1\)−V\(st\)\]V\(s\_\{t\}\)\\leftarrow V\(s\_\{t\}\)\+\\alpha\\left\[r\_\{t\}\+\\gamma V\(s\_\{t\+1\}\)\-V\(s\_\{t\}\)\\right\]\(3\)BothV\(st\)V\(s\_\{t\}\)andV\(st\+1\)V\(s\_\{t\+1\}\)come from the same network whose weights are shifting during training\. After training onτ1\\tau\_\{1\}, the critic fits those states well and the loss is low\. Afternnmore trajectory updates the weights have drifted substantially\. Evaluating the critic onτ1\\tau\_\{1\}now gives high error, even though that error was low at the time of original training\. This compounds across trajectories and gives rise to the Non\-Stationary Bootstrapping Feedback Loop, a specific case of the Deadly Triad\[[2](https://arxiv.org/html/2605.24071#bib.bib2)\]: function approximation plus bootstrapping plus a shifting data distribution\.
To see why this is a problem, consider a simple analogy\. Imagine a teacher who trains a student exclusively on calculus\. Every example, every drill, every test comes from that one chapter\. The student performs well on calculus\. Then weeks later the teacher starts asking questions from algebra, assuming the student is equally well\-prepared\. The student struggles, not because they forgot calculus, but because they were never trained on the new material\. The value network faces exactly this situation\. It is trained on the state distribution ofτ1\\tau\_\{1\}, but as the policy updates the agent visits different states, and the distribution the critic is asked to evaluate shifts underneath it\. The critic’s predictions become unreliable not because the network forgot what it learned, but because what it learned no longer matches what it is being asked about\.
### 2\.4Why On\-Policy Methods Must Discard Old Data
Reusing trajectories from a past policyπθold\\pi\_\{\\theta\_\{\\text\{old\}\}\}to update the current policyπθ\\pi\_\{\\theta\}introduces distribution mismatch\. The training data no longer reflects where the current policy actually goes, which biases the gradient in ways that compound over updates\. Old trajectories are discarded not as a practical inconvenience but because using them violates the assumptions the objective was derived under\.
### 2\.5Related Work
The data efficiency and correlation problems in on\-policy RL have attracted a growing body of work, each approaching the issue from a different angle\. GePPO\[[4](https://arxiv.org/html/2605.24071#bib.bib4)\]extends PPO to an off\-policy setting by deriving policy improvement guarantees that hold under sample reuse, connecting those bounds directly to the clipping mechanism in the original algorithm\. PTR\-PPO\[[5](https://arxiv.org/html/2605.24071#bib.bib5)\]takes a complementary approach, combining on\-policy collection with prioritized replay of older trajectories to squeeze more signal out of each rollout\. PPG\[[6](https://arxiv.org/html/2605.24071#bib.bib6)\]separates the policy and value optimization phases entirely, allowing the critic to train with higher sample reuse without interfering with the policy’s stability\. More recently, PROPS\[[7](https://arxiv.org/html/2605.24071#bib.bib7)\]observed that finite on\-policy samples often fail to match the true on\-policy distribution — that sampling error itself is a source of high\-variance gradients — and addressed this by using an adaptive off\-policy behavior policy to collect data that better approximates the current policy’s distribution\. SAPG\[[8](https://arxiv.org/html/2605.24071#bib.bib8)\]takes a different path: it divides parallel environments into blocks, each optimizing a separate policy, then combines them via an off\-policy update to recover data diversity that single\-policy rollouts inherently lack\.
A separate line of work has studied how the statistical structure of data within a rollout — not just how it is collected — shapes learning\. Hollenstein et al\.\[[9](https://arxiv.org/html/2605.24071#bib.bib9)\]found that correlated action noise in PPO systematically improves exploration, with the optimal noise color sitting between white and pink noise depending on the amount of data collected per update\. Tavakoli et al\.\[[10](https://arxiv.org/html/2605.24071#bib.bib10)\]showed more broadly that action redundancy — where different actions induce nearly identical next\-state transitions — is a fundamental problem in RL and degrades sample efficiency in both discrete and continuous settings\.
What these methods share is that they all intervene at the data collection stage: they change how trajectories are gathered, how old ones are weighted, or how behavior policies are chosen\. Our approach is different\. We leave the rollout and advantage estimation completely unchanged and only subsample transitions after the advantage computation is done, at the point where the gradient update is formed\. This means the reward signal is fully preserved while the optimization batch becomes less redundant — a distinction that turns out to matter, as Methods 1 and 2 in this paper demonstrate when they fail precisely because they intervene earlier and damage the credit assignment signal in the process\.
## 3Methods
All three methods were evaluated using PPO on five environments: CartPole\-v1, Acrobot\-v1, LunarLander\-v2, HalfCheetah\-v5, and Hopper\-v5\.
### 3\.1Method 1: Fixed K\-Step Sampling
Here, transitions are stored in the buffer only once everyKKsteps\. Intermediate rewards are not thrown away — they are accumulated and added to the stored transition’s reward\. For example, with statess0,s1,s2,s3,s4s\_\{0\},s\_\{1\},s\_\{2\},s\_\{3\},s\_\{4\}:s0s\_\{0\}is stored,s1s\_\{1\}ands2s\_\{2\}are skipped but their rewards are accumulated, ands3s\_\{3\}is stored with rewardr0\+r1\+r2\+r3r\_\{0\}\+r\_\{1\}\+r\_\{2\}\+r\_\{3\}\. This keeps the total reward signal while reducing the number of correlated transitions in the buffer\.
#### Motivation\.
Putting a fixed temporal gap between stored samples and accumulating intermediate rewards is supposed to reduce gradient collinearity without throwing away the reward signal from skipped steps entirely\.
#### Result\.
This method works reasonably well only in low\-complexity discrete environments like CartPole\-v1, where the state space is small, the reward signal is dense and simple, and the fixed skip interval is sufficient to capture the key state transitions\. However, it fails in environments of greater complexity\. On Acrobot\-v1, which has a sparser reward structure and requires coordinated multi\-joint control, the method shows early instability\. LunarLander\-v2 exposes the method’s core weakness\. Summing rewards over skipped steps loses the fine\-grained causal signal — the agent cannot tell what it did right or wrong at any specific timestep, and convergence suffers for it\. The fixed skip interval adds another problem on top: it punches the same holes in every trajectory, and those blind spots never rotate out\.
### 3\.2Method 2: Random Adaptive K\-Step Sampling
Method 1 always skips the same positions — so the fix is straightforward: stop using a fixed interval and draw the skip randomly instead:
ε\\displaystyle\\varepsilon=𝒩\(0,1\)\\displaystyle=\\mathcal\{N\}\(0,1\)\(4\)k′\\displaystyle k^\{\\prime\}=\{kifε\>0k\+1ifε≤0\\displaystyle=\\begin\{cases\}k&\\text\{if \}\\varepsilon\>0\\\\ k\+1&\\text\{if \}\\varepsilon\\leq 0\\end\{cases\}\(5\)
#### Motivation\.
A fixedKKmeans the same states get skipped on every trajectory without exception — ifK=2K=2, only even\-indexed states ever enter the buffer and every odd\-indexed state is permanently invisible to the optimizer\. Randomizing the skip interval viaε∼𝒩\(0,1\)\\varepsilon\\sim\\mathcal\{N\}\(0,1\)fixes that: the gap shifts each time, so the buffer stops having a fixed parity bias and actually sees a broader, more representative slice of the trajectory over time\.
#### Result\.
Method 2 is a genuine improvement over Method 1 — CartPole\-v1 trains more cleanly, and on Acrobot\-v1 the rotating blind spots help noticeably\. But LunarLander\-v2 still does not work, and the reason is that randomizing the skip did not touch the actual problem: rewards are still being summed across skipped steps, and that summation is what kills precise credit assignment in environments with shaped rewards\. Both methods ultimately share the same flaw — they are fine for small, simple, discrete games, but the moment the environment requires the agent to understand exactly which action caused which outcome, they fall apart\.
### 3\.3Method 3: Random P% Trajectory Subsampling
This method matched standard PPO performance and in several cases exceeded it in stability\.
#### Key insight\.
Methods 1 and 2 both intervened at the data collection stage, before advantage estimation runs, which damaged reward signal integrity\. Method 3 intervenes at the optimisation stage, after advantage estimation, so the ground truth signal is fully preserved while the gradient updates are still decorrelated\.
#### Procedure\.
The full trajectory buffer is collected normally with no skipping\. Advantages are computed over the complete, unmodified transition sequence:
A^t=∑l=0∞\(γλ\)lδt\+l,δt=rt\+γV\(st\+1\)−V\(st\)\\hat\{A\}\_\{t\}=\\sum\_\{l=0\}^\{\\infty\}\(\\gamma\\lambda\)^\{l\}\\delta\_\{t\+l\},\\quad\\delta\_\{t\}=r\_\{t\}\+\\gamma V\(s\_\{t\+1\}\)\-V\(s\_\{t\}\)\(6\)After this, a randomly chosenp%p\\%of theNNtransitions is sampled without replacement for the gradient update\. The remaining\(1−p\)%\(1\-p\)\\%are excluded only from the optimisation step\. Their reward contributions are already captured in the advantage estimates\.
#### Theoretical justification\.
The idea borrows from Dropout\[[3](https://arxiv.org/html/2605.24071#bib.bib3)\], not the mechanism but the logic\. Dropout randomly kills neurons during the forward pass to stop the network from co\-adapting to redundant features\. Here the same principle is applied one level higher: instead of dropping neurons, we drop transitions\. Randomly removing correlated transitions from the gradient update stops the optimizer from repeatedly reinforcing the same near\-collinear gradient directions that on\-policy trajectories naturally produce\. The intervention point is different, Dropout sits inside the network while subsampling sits on the batch, but the principle is the same: inject controlled randomness to break correlated signal pathways before they corrupt optimisation\.
Three things follow naturally from this design:
1. 1\.Decorrelation:Randomly selecting which transitions enter the gradient update disrupts the sequential structure of the trajectory directly\. No reward information is lost, just the redundant repetition of similar gradients\.
2. 2\.Memory efficiency:Pushing onlyp%p\\%of transitions to the GPU per update lowers memory pressure, which in practice means cleaner, less noisy gradient computation\.
3. 3\.Implicit regularization:Because the optimizer never sees the full correlated batch the same way twice, it cannot overfit to the local redundancy of any single trajectory, which nudges it toward policies that generalise better across state space\.
Algorithm 1PPO with Randomp%p\\%Transition Subsampling1:Policy
πθ\\pi\_\{\\theta\}, critic
VϕV\_\{\\phi\}, rollout length
TT, subsampling fraction
p∈\(0,1\]p\\in\(0,1\], clip
ε\\varepsilon, discount
γ\\gamma, GAE parameter
λ\\lambda
2:foreach iteration
k=1,2,…k=1,2,\\dotsdo
3:Execute
πθk\\pi\_\{\\theta\_\{k\}\}for
TTtimesteps; collect buffer
𝒟=\{\(st,at,rt,st\+1,dt\)\}t=0T−1\\mathcal\{D\}=\\\{\(s\_\{t\},a\_\{t\},r\_\{t\},s\_\{t\+1\},d\_\{t\}\)\\\}\_\{t=0\}^\{T\-1\}
4:Compute TD residuals
δt=rt\+γ\(1−dt\)Vϕ\(st\+1\)−Vϕ\(st\)\\delta\_\{t\}=r\_\{t\}\+\\gamma\(1\-d\_\{t\}\)V\_\{\\phi\}\(s\_\{t\+1\}\)\-V\_\{\\phi\}\(s\_\{t\}\)for all
tt
5:Compute GAE advantages
A^t=∑l=0T−1−t\(γλ\)lδt\+l\\hat\{A\}\_\{t\}=\\sum\_\{l=0\}^\{T\-1\-t\}\(\\gamma\\lambda\)^\{l\}\\delta\_\{t\+l\}and returns
R^t=A^t\+Vϕ\(st\)\\hat\{R\}\_\{t\}=\\hat\{A\}\_\{t\}\+V\_\{\\phi\}\(s\_\{t\}\)
6:Normalise:
A^t←\(A^t−A¯\)/\(std\(A^\)\+ϵ\)\\hat\{A\}\_\{t\}\\leftarrow\\bigl\(\\hat\{A\}\_\{t\}\-\\bar\{A\}\\bigr\)\\big/\\bigl\(\\mathrm\{std\}\(\\hat\{A\}\)\+\\epsilon\\bigr\)
7:Draw subsampled index set
ℐ⊂\{0,…,T−1\}\\mathcal\{I\}\\subset\\\{0,\\dots,T\-1\\\},
\|ℐ\|=⌊p⋅T⌋\|\\mathcal\{I\}\|=\\lfloor p\\cdot T\\rfloor, sampled uniformly without replacement
8:foreach optimisation epochdo
9:Partition
ℐ\\mathcal\{I\}into minibatches
\{ℳj\}\\\{\\mathcal\{M\}\_\{j\}\\\}
10:foreach minibatch
ℳj\\mathcal\{M\}\_\{j\}do
11:Compute probability ratio
ρt=πθ\(at∣st\)/πθk\(at∣st\)\\rho\_\{t\}=\\pi\_\{\\theta\}\(a\_\{t\}\\mid s\_\{t\}\)\\,/\\,\\pi\_\{\\theta\_\{k\}\}\(a\_\{t\}\\mid s\_\{t\}\)
12:
ℒCLIP\(θ\)=𝔼t∈ℳj\[min\(ρtA^t,clip\(ρt,1−ε,1\+ε\)A^t\)\]\\mathcal\{L\}^\{\\text\{CLIP\}\}\(\\theta\)=\\mathbb\{E\}\_\{t\\in\\mathcal\{M\}\_\{j\}\}\\\!\\left\[\\min\\\!\\left\(\\rho\_\{t\}\\hat\{A\}\_\{t\},\\ \\mathrm\{clip\}\(\\rho\_\{t\},1\\\!\-\\\!\\varepsilon,1\\\!\+\\\!\\varepsilon\)\\,\\hat\{A\}\_\{t\}\\right\)\\right\]
13:
ℒVF\(ϕ\)=𝔼t∈ℳj\[\(Vϕ\(st\)−R^t\)2\]\\mathcal\{L\}^\{\\text\{VF\}\}\(\\phi\)=\\mathbb\{E\}\_\{t\\in\\mathcal\{M\}\_\{j\}\}\\\!\\left\[\\bigl\(V\_\{\\phi\}\(s\_\{t\}\)\-\\hat\{R\}\_\{t\}\\bigr\)^\{2\}\\right\]
14:Update
θ←θ\+απ∇θℒCLIP\(θ\)\\theta\\leftarrow\\theta\+\\alpha\_\{\\pi\}\\nabla\_\{\\theta\}\\mathcal\{L\}^\{\\text\{CLIP\}\}\(\\theta\)and
ϕ←ϕ−αV∇ϕℒVF\(ϕ\)\\phi\\leftarrow\\phi\-\\alpha\_\{V\}\\nabla\_\{\\phi\}\\mathcal\{L\}^\{\\text\{VF\}\}\(\\phi\)
15:endfor
16:endfor
17:endfor
#### On early training and convergence behaviour\.
In the early stages of training, the dropped and undropped agents do not look meaningfully different in terms of the actions they take, and that is expected\. Early on the policy is essentially random regardless\. Whether you drop 25% of transitions or keep all of them, the agent is still exploring near\-uniformly and the loss is high either way\. The real difference shows up later\. As training progresses and backpropagation starts shaping the network’s internal representations, the subsampled agent begins generalising across states more smoothly, because its gradient updates were never locked into the redundant sequential structure of any single rollout\. Given enough training steps both agents reach comparable final reward, but the subsampled agent tends to get there more stably\. The reward is the last metric to reflect an improvement in training dynamics\. The real signal is in KL divergence and entropy, and that is exactly where Method 3 shows its edge\.
## 4Experimental Setup
### 4\.1Environments
Experiments were run on five benchmark environments of increasing complexity:
- •CartPole\-v1— A pole balanced on a cart moving along a frictionless track\. The agent pushes left or right to stop it falling\. Rewards come every timestep the pole stays up, the state is just 4 numbers, and the whole thing runs fast\. It is the simplest possible sanity check — if a method cannot work here, nothing else matters\.
- •Acrobot\-v1— A two\-link pendulum where the agent can only apply torque at the middle joint, and has to swing the free end up past a target height\. Unlike CartPole, there is no reward for progress — just a penalty every timestep until the goal is reached\. That sparsity makes credit assignment noticeably harder and exposes weaknesses that CartPole would never catch\.
- •LunarLander\-v2— A lander that needs to touch down safely between two flags using main and side thrusters\. The reward signal tracks position, velocity, tilt, leg contact, and fuel use all at once, which makes it a genuinely difficult shaped\-reward problem\. This is the environment where long\-horizon credit assignment actually matters, and where Methods 1 and 2 are expected to struggle\.
- •HalfCheetah\-v5— A planar two\-legged robot that must learn to run forward as fast as possible\. The state space is high\-dimensional and the reward is dense but shaped around velocity, making it a strong test of whether the subsampling method holds up under continuous control with a rich observation space\.
- •Hopper\-v5— A single\-legged robot that must learn to hop forward without falling over\. Despite having fewer degrees of freedom than HalfCheetah, Hopper is notoriously sensitive to instability — small policy errors compound quickly and cause the agent to fall\. It tests whether Method 3 can maintain stable training under the kind of fragile dynamics that tend to amplify any noise in the gradient\.
### 4\.2Algorithm
All experiments used PPO with 1400 rollout steps per update\. The baseline is standard PPO trained on the full trajectory buffer with no subsampling\.
### 4\.3Hyperparameters
Tables[1](https://arxiv.org/html/2605.24071#S4.T1)and[2](https://arxiv.org/html/2605.24071#S4.T2)report the full configuration for every run\.Pure PPOrefers top=100%p=100\\%\(no subsampling\);Skip\-KandRand\-Skiprefer to Method 1 and Method 2 respectively\.
Table 1:Hyperparameters for CartPole\-v1, Acrobot\-v1, and LunarLander\-v2\.Table 2:Hyperparameters for HalfCheetah\-v5 and Hopper\-v5 \(shared configuration\)\. Epochs per update and hidden dimensiondmodeld\_\{\\text\{model\}\}are reduced at lowerppto match the smaller effective batch\.The following apply to HalfCheetah\-v5 and Hopper\-v5 across all runs: actor and critic gradient norms are clipped at 0\.5 and 0\.8 respectively; the entropy coefficientβ\\betadecays linearly fromβ0=1×10−4\\beta\_\{0\}=1\\times 10^\{\-4\}to zero over the course of training\. A rollout length of 2048 was chosen because both environments have an average episode length of approximately 1000 steps — training on a single environment with a shorter buffer would produce gradient estimates too noisy to learn stable locomotion policies, so two effective environments worth of experience is collected per update\. All forward passes for these two environments were executed under mixed\-precision \(FP16\) usingtorch\.autocast, with gradient scaling applied separately to the actor and critic to maintain numerical stability\.
### 4\.4Compute
All experiments were run on a single NVIDIA Tesla T4 GPU \(16 GB VRAM\) provided through Kaggle’s free compute tier\. No paid cloud resources were used\. Tables[3](https://arxiv.org/html/2605.24071#S4.T3)and[4](https://arxiv.org/html/2605.24071#S4.T4)report wall\-clock training times averaged over 5 independent seeds, as recorded by the Weights & Biases run dashboard\.
Table 3:Average runtime per run for CartPole\-v1, Acrobot\-v1, and LunarLander\-v2 \(T4 GPU\)\.CartPole\-v1Acrobot\-v1LunarLander\-v2Pure PPOp=75%p=75\\%Skip\-KRand\-SkipPure PPOp=80%p=80\\%p=65%p=65\\%Pure PPOp=75%p=75\\%p=65%p=65\\%Avg\. Runtime \(mins\)≈22\.8\\approx 22\.8≈20\.3\\approx 20\.3≈21\.6\\approx 21\.6≈20\.1\\approx 20\.1≈28\.2\\approx 28\.2≈27\.8\\approx 27\.8≈25\.4\\approx 25\.4≈43\.4\\approx 43\.4≈41\.7\\approx 41\.7≈40\.03\\approx 40\.03
Table 4:Average runtime per run for HalfCheetah\-v5 and Hopper\-v5 \(T4 GPU\)\.
### 4\.5Evaluation Protocol
Each run was trained across 5 independent seeds\. At each evaluation checkpoint the agent was run for 1 episode per seed; the reported reward is the mean across all 5 seeds, which smooths environment stochasticity and gives a reliable picture of policy performance\.
### 4\.6Evaluation Metrics
The following metrics were tracked throughout training: KL divergence, policy entropy, explained variance, value bias, critic loss, and evaluation reward\. Training stability was assessed by comparing the variance of these metrics across updates between Method 3 and the vanilla PPO baseline\.
## 5Results
Method 3 matched vanilla PPO on all five environments\. Even after dropping\(1−p\)%\(1\-p\)\\%of transitions per rollout, KL divergence, policy loss, value loss, and evaluation reward all stayed nearly identical to the baseline — and in several cases were measurably more stable, with lower variance across updates\. The takeaway is that a large fraction of transitions in a typical on\-policy rollout are simply redundant\. Cutting them out randomly does not hurt anything; if anything, it regularises the gradient\.
Methods 1 and 2 only held together on CartPole\-v1\. Acrobot\-v1 exposed early instability in both, and LunarLander\-v2 broke them outright\. This is consistent with the central argument: the intervention point is what matters, and both methods got it wrong by touching the data before advantage estimation had a chance to preserve the reward signal\.
On HalfCheetah\-v5 and Hopper\-v5, Method 3 tracked vanilla PPO closely on reward, KL divergence, and entropy across all testedppvalues\. Hopper in particular showed reduced metric variance under subsampling, which is notable given how sensitive that environment is to policy instability\.
#### On the choice ofpp\.
Atp=75%p=75\\%all tracked metrics remained healthy across every environment — reward, entropy, and KL all matched vanilla PPO throughout training\. Below75%75\\%the reward curve still looks fine, but entropy starts drifting and KL gets noisier: the optimizer is quietly losing the signal diversity it needs for stable exploration\. The reward is the last metric to break\.p=75%p=75\\%is where all metrics still agree, which is why it is treated as the recommended threshold\.
## 6Discussion
What these experiments show, taken together, is that temporal correlation in on\-policy trajectories should be addressed after advantage estimation, not before or during data collection\. Intervening before advantage estimation destroys the reward signal the agent depends on for credit assignment\. Intervening after leaves it completely intact, while the random subsampling of transitions for the weight update introduces enough stochasticity to break the correlated gradient structure\.
A reasonable objection here is that standard PPO already shuffles rollout data into minibatches before each gradient update, so doesn’t that handle the correlation? It does not\. Shuffling changes the order transitions arrive in but does not remove any of them\. The core problem is that the states within a single on\-policy rollout are highly similar to one another — they follow causally from the same policy acting in the same environment over a short window of time\. Feeding all of them into the optimizer, regardless of order, still pushes the gradient repeatedly in nearly the same direction\. Subsampling top%p\\%actively discards some of that redundant overlap, which is precisely why it reduces gradient collinearity in a way that shuffling cannot\. The key result here is not just parity with vanilla PPO but the fact that a strict subset of the rollout transitions is sufficient to recover identical final performance — which is direct evidence that the full correlated batch was contributing less unique gradient signal than its size implied\.
Method 3 also does not touch anything it should not\. Methods 1 and 2 both accumulated rewards across skipped transitions, which quietly broke the Markov assumption — the storedsts\_\{t\}ended up carrying information from future states it never actually observed\. Method 3 has no such side effect\. The environment, the rollout, advantage estimation, and the clipping objective are all identical to vanilla PPO\. The only change is a single random sampling step between advantage estimation and the gradient update\.
#### On network size, learning rates, and gradient clipping\.
For the continuous control environments, keeping the network deliberately small turned out to matter more than it might seem\. A larger hidden dimension with more epochs per update causes the critic to overfit quickly to the current batch\. The MSE loss drops fast and then starts oscillating as the value estimates chase a moving target\. A smaller network generalises more smoothly across the state distribution and stays stable for longer\. The learning rate asymmetry between actor and critic follows the same logic\. The critic is trained via MSE and converges much faster, so a higher critic learning rate is fine and actually helps it track the returns more accurately\. The actor, by contrast, needs to change slowly because aggressive policy updates in locomotion tasks compound into instability very quickly\. Keeping the actor learning rate lower ensures that each policy step is a small, grounded improvement rather than an overcorrection\. The critic gradient norm being clipped at 0\.8 compared to the actor at 0\.5 reflects the same asymmetry\. The critic gradient can grow large during early training when value estimates are far off, and clipping it at a slightly higher threshold lets it learn faster without exploding, while the actor is kept on a tighter leash throughout\.
## 7Conclusion
On\-policy rollouts carry more redundancy than is commonly assumed\. The experiments here show that randomly dropping25%25\\%of transitions before the gradient update — at the right stage so the reward signal stays intact — is enough to break the correlated gradient structure and stabilise training, without touching anything else in the PPO pipeline\. The method matches standard PPO on evaluation reward across five environments while consistently reducing variance in the training metrics that matter most\.
The result was stronger than expected\. Despite dropping\(1−p\)%\(1\-p\)\\%of transitions per update, evaluation returns after sufficient training are nearly identical to the undropped baseline and often slightly better\. This is not a coincidence — it is direct empirical evidence that the states within a rollout are so highly correlated that a large fraction of them contribute little unique information to the gradient\. The network was already learning from redundant signal\. Removing some of it does not hurt; it cleans things up\.
On the choice ofpp: at75%75\\%all tracked metrics still agree — reward, entropy, and KL divergence all healthy\. Below that the reward holds but entropy drifts and KL gets noisier, and the optimizer is quietly losing grip on stable exploration before the reward has reflected it\. Dropping exactly25%25\\%is the practical sweet spot\.
## References
- \[1\]Schulman, J\., Wolski, F\., Dhariwal, P\., Radford, A\., & Klimov, O\. \(2017\)\.Proximal Policy Optimization Algorithms\. arXiv preprint arXiv:1707\.06347\.
- \[2\]Sutton, R\. S\., & Barto, A\. G\. \(2018\)\.Reinforcement Learning: An Introduction\(2nd ed\.\)\. MIT Press\.
- \[3\]Srivastava, N\., Hinton, G\., Krizhevsky, A\., Sutskever, I\., & Salakhutdinov, R\. \(2014\)\.Dropout: A Simple Way to Prevent Neural Networks from Overfitting\. Journal of Machine Learning Research, 15\(1\), 1929–1958\.
- \[4\]Queeney, J\., Paschalidis, I\. C\., & Cassandras, C\. G\. \(2021\)\.Generalized Proximal Policy Optimization with Sample Reuse\. arXiv preprint arXiv:2111\.00072\.
- \[5\]Liang, X\., Ma, Y\., Feng, Y\., & Liu, Z\. \(2021\)\.PTR\-PPO: Proximal Policy Optimization with Prioritized Trajectory Replay\. arXiv preprint arXiv:2112\.03798\.
- \[6\]Cobbe, K\., Hilton, J\., Klimov, O\., & Schulman, J\. \(2021\)\.Phasic Policy Gradient\. Proceedings of the 38th International Conference on Machine Learning \(ICML\)\.
- \[7\]Corrado, N\. E\., & Hanna, J\. P\. \(2023\)\.On\-Policy Policy Gradient Reinforcement Learning Without On\-Policy Sampling\. arXiv preprint arXiv:2311\.08290\.
- \[8\]Makoviychuk, V\., et al\. \(2024\)\.SAPG: Split and Aggregate Policy Gradients\. arXiv preprint arXiv:2407\.20230\.
- \[9\]Hollenstein, J\., Martius, G\., & Piater, J\. \(2024\)\.Colored Noise in PPO: Improved Exploration and Performance through Correlated Action Sampling\. Proceedings of the AAAI Conference on Artificial Intelligence\. arXiv preprint arXiv:2312\.11091\.
- \[10\]Tavakoli, A\., Fatemi, M\., & Kormushev, P\. \(2021\)\.Action Redundancy in Reinforcement Learning\. arXiv preprint arXiv:2102\.11329\.
## Appendix: Experimental Results and Graphs
All figures compare Vanilla PPO, Method 1 \(Fixed K\-Step\), Method 2 \(Random Adaptive K\-Step\), and Method 3 \(Randomp%p\\%Subsampling\) across 1400 rollout steps per update\. HalfCheetah\-v5 and Hopper\-v5 compare Method 3 against the vanilla PPO baseline\.
### CartPole\-v1 \(Fig\. 1–6,p=75%p=75\\%\)
Figure 1:CartPole\-v1: Training reward\.
Figure 2:CartPole\-v1: Evaluation reward\.
Figure 3:CartPole\-v1: KL divergence\.
Figure 4:CartPole\-v1: Policy entropy\.
Figure 5:CartPole\-v1: Explained variance\.
Figure 6:CartPole\-v1: Value bias\.
### Acrobot\-v1 \(Fig\. 7–12\)
Figure 7:Acrobot\-v1: Evaluation reward\.
Figure 8:Acrobot\-v1: Critic loss\.
Figure 9:Acrobot\-v1: KL divergence\.
Figure 10:Acrobot\-v1: Policy entropy\.
Figure 11:Acrobot\-v1: Explained variance\.
Figure 12:Acrobot\-v1: Value bias\.
### LunarLander\-v2 \(Fig\. 13–18\)
Figure 13:LunarLander\-v2: Training reward\.
Figure 14:LunarLander\-v2: Evaluation reward\.
Figure 15:LunarLander\-v2: KL divergence\.
Figure 16:LunarLander\-v2: Policy entropy\.
Figure 17:LunarLander\-v2: Explained variance\.
Figure 18:LunarLander\-v2: Value bias\.
### HalfCheetah\-v5 \(Fig\. 19–24\)
Figure 19:HalfCheetah\-v5: Training reward\.
Figure 20:HalfCheetah\-v5: Evaluation reward\.
Figure 21:HalfCheetah\-v5: KL divergence\.
Figure 22:HalfCheetah\-v5: Policy entropy\.
Figure 23:HalfCheetah\-v5: Explained variance\.
Figure 24:HalfCheetah\-v5: Value bias\.
Figure 25:HalfCheetah\-v5: Critic loss\.
### Hopper\-v5 \(Fig\. 25–28\)
Figure 26:Hopper\-v5: Training reward\.
Figure 27:Hopper\-v5: Evaluation reward\.
Figure 28:Hopper\-v5: KL divergence\.
Figure 29:Hopper\-v5: Explained variance\.Similar Articles
Diffusion Policy Optimization without Drifting Apart
DiPOD stabilizes diffusion policy optimization by interleaving self-distillation with policy-gradient updates to maintain a tight ELBO, preventing the double-drift phenomenon and achieving higher rewards in both language and continuous control tasks.
ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization
Introduces ODRPO, a framework that decomposes discrete rewards into ordinal binary indicators to improve robustness of policy optimization in RLAIF for LLMs, achieving up to 14.8% relative improvement with minimal overhead.
Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning
This paper introduces CPPO, a method that improves reinforcement learning with verifiable rewards for LLMs by using position-weighted thresholds and cumulative prefix budgeting to address limitations of uniform token-level trust regions.
Multi-Rollout On-Policy Distillation via Peer Successes and Failures
Introduces Multi-Rollout On-Policy Distillation (MOPD), a method that conditions the teacher on both successful and failed peer rollouts to provide denser token-level supervision for language model post-training, improving performance across multiple benchmarks.
Proximal Policy Optimization
OpenAI introduces Proximal Policy Optimization (PPO), a reinforcement learning algorithm that matches or outperforms state-of-the-art methods while being simpler to implement and tune. PPO uses a novel clipped objective function to constrain policy updates and has since become OpenAI's default RL algorithm.