KLip-PPO: A per-sample KL perspective on PPO-Clip

arXiv cs.LG 06/24/26, 04:00 AM Papers
reinforcement-learning ppo kl-divergence trust-region policy-gradient continous-control
Summary
This paper shows that the gradient of the clipped surrogate in Proximal Policy Optimization (PPO) is exactly reproduced by a per-sample Kullback-Leibler penalty with a variable coefficient, revealing structural features of the clipped surrogate and suggesting new design directions.
arXiv:2606.23932v1 Announce Type: new Abstract: Proximal Policy Optimization (PPO) is the standard policy-gradient algorithm for on-policy reinforcement learning. The literature presents it in two forms, a clipped surrogate that bounds the importance ratio between successive policies and a Kullback-Leibler penalty between them. These forms are treated as separate algorithms with their own gradients, their own hyperparameters, and their own reference implementations, and a sizeable body of empirical work compares them. We show that the gradient of the clipped surrogate is reproduced exactly by a Kullback-Leibler surrogate whose coefficient varies per sample, with closed-form dependence on the importance ratio and the advantage. The identity holds at every minibatch step and across the entire inner loop, and on five MuJoCo continuous-control benchmarks the two losses produce indistinguishable training curves. The reformulation exposes a structural feature of the clipped surrogate that the min notation hides. PPO-Clip's implicit per-sample penalty is a step function at the boundary of the trust region, and the shape of this coefficient is the natural design axis for generalising the algorithm. We sketch the resulting follow-up directions in the discussion.
Original Article
View Cached Full Text
Cached at: 06/24/26, 07:49 AM
# KLip-PPO: A per-sample KL perspective on PPO-Clip
Source: [https://arxiv.org/html/2606.23932](https://arxiv.org/html/2606.23932)
Riccardo Colletti University of California, Berkeley riccardo\_colletti \[at\] berkeley\.edu &Robin Holzinger11footnotemark:1 University of California, Berkeley robin\.holzinger \[at\] berkeley\.edu

###### Abstract

Proximal Policy Optimization \(PPO\) is the standard policy\-gradient algorithm for on\-policy reinforcement learning\. The literature presents it in two forms, a clipped surrogate that bounds the importance ratio between successive policies and a Kullback–Leibler penalty between them\. These forms are treated as separate algorithms with their own gradients, their own hyperparameters, and their own reference implementations, and a sizeable body of empirical work compares them\. We show that the gradient of the clipped surrogate is reproduced exactly by a Kullback–Leibler surrogate whose coefficient varies per sample, with closed\-form dependence on the importance ratio and the advantage\. The identity holds at every minibatch step and across the entire inner loop, and on five MuJoCo continuous\-control benchmarks the two losses produce indistinguishable training curves\. The reformulation exposes a structural feature of the clipped surrogate that themin\\minnotation hides\. PPO\-Clip’s implicit per\-sample penalty is a step function at the boundary of the trust region, and the shape of this coefficient is the natural design axis for generalising the algorithm\. We sketch the resulting follow\-up directions in the discussion\.

††footnotetext:Code:[https://github\.com/learning\-mechanisms/KLip\-PPO](https://github.com/learning-mechanisms/KLip-PPO)Project page:[https://klip\-ppo\.org](https://klip-ppo.org/)Public W&B artifacts:[https://wandb\.ai/KLip\-PPO/KLip\-PPO](https://wandb.ai/KLip-PPO/KLip-PPO)## 1Introduction

Proximal Policy Optimization\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\]is the default policy\-gradient algorithm for on\-policy reinforcement learning\. The method approximates the trust\-region step of TRPO\[[13](https://arxiv.org/html/2606.23932#bib.bib13),[9](https://arxiv.org/html/2606.23932#bib.bib9)\]by maximising a surrogate objective that keeps the new policy close to the rollout policy\. The original work proposes two surrogates\. The first clips the importance ratio between the new and old policies inside a fixed band\. The second adds an adaptive Kullback–Leibler penalty between them\.Schulman et al\.\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\]report that the clipped variant outperforms the penalty variant on MuJoCo continuous\-control benchmarks and recommend it as the default\. The community has followed the recommendation\. PPO\-Clip is the de facto choice in modern open\-source implementations\[[12](https://arxiv.org/html/2606.23932#bib.bib12),[7](https://arxiv.org/html/2606.23932#bib.bib7)\], and the clipped surrogate is the building block of token\-level extensions such as GRPO\[[16](https://arxiv.org/html/2606.23932#bib.bib16)\]that underpin recent reasoning models\[[2](https://arxiv.org/html/2606.23932#bib.bib2)\]\.

Subsequent empirical work has scrutinised PPO from many angles\.Engstrom et al\.\[[4](https://arxiv.org/html/2606.23932#bib.bib4)\]show that “code\-level” optimisations explain most of PPO’s gain over TRPO and that the clip mechanism itself is not load\-bearing for performance\.Ilyas et al\.\[[8](https://arxiv.org/html/2606.23932#bib.bib8)\]demonstrate that auxiliary optimisations, rather than the clip term, are what actually maintain the trust region\.Andrychowicz et al\.\[[1](https://arxiv.org/html/2606.23932#bib.bib1)\]train over2\.5×1052\.5\\times 10^\{5\}agents to compare design choices and recommend PPO\-Clip among five policy losses, though they do not run a standalone PPO\-KL variant\.Hsu et al\.\[[6](https://arxiv.org/html/2606.23932#bib.bib6)\]report that KL\-regularised PPO matches or outperforms the clipped variant outside MuJoCo with Gaussian policies, andSun et al\.\[[17](https://arxiv.org/html/2606.23932#bib.bib17)\]argue that ratio clipping is not necessary in PPO at all\. Across this body of work the clip and KL forms are treated as alternative algorithmic choices to be compared empirically\.

We show that this treatment misunderstands the relationship between the two surrogates\. The per\-sample gradient of PPO\-Clip is reproduced exactly by a Kullback–Leibler surrogate whose coefficient varies per sample, with closed\-form dependence on the importance ratio and the advantage\. The identity holds at every minibatch step and across the entire inner loop\. On HalfCheetah, Hopper, Walker2d, Ant, and Humanoid\[[20](https://arxiv.org/html/2606.23932#bib.bib20)\]the two losses produce indistinguishable training curves\. Where the original PPO paper notes thatLCLIPL^\{\\mathrm\{CLIP\}\}and the unclipped surrogate agree to first order aroundθold\\theta\_\{\\mathrm\{old\}\}\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\], we strengthen the statement to a per\-sample identity that holds at every parameter configuration\. Recent work has analysed gradient\-level relationships between different KL formulations\[[10](https://arxiv.org/html/2606.23932#bib.bib10)\]and proposed unified clip\-plus\-KL design frameworks\[[24](https://arxiv.org/html/2606.23932#bib.bib24)\], but to our knowledge no prior work establishes the per\-sample equivalence between PPO\-Clip and PPO\-KL itself\.

Making the implicit coefficient explicit clarifies the position of PPO\-Clip in the policy\-optimisation landscape\. The clipped surrogate is a per\-sample KL penalty whose coefficient is a step function on the trust\-region boundary, and theshape of this coefficientis the natural design axis for generalising the algorithm\. Soft relaxations of the step, asymmetric and position\-conditioned penalties, and off\-policy extensions all fit inside the same template; we sketch these directions in Section[6](https://arxiv.org/html/2606.23932#S6)\.

The paper is organised as follows\. Section[2](https://arxiv.org/html/2606.23932#S2)reviews PPO\-Clip, PPO\-KL, and the existing comparisons between them\. Section[3](https://arxiv.org/html/2606.23932#S3)states and proves the per\-sample gradient identity\. Section[4](https://arxiv.org/html/2606.23932#S4)validates the identity empirically on five MuJoCo continuous\-control benchmarks\. Section[6](https://arxiv.org/html/2606.23932#S6)surveys natural extensions of the framework\. Section[5](https://arxiv.org/html/2606.23932#S5)concludes\.

## 2Background

We adopt the standard on\-policy actor\-critic setting\[[18](https://arxiv.org/html/2606.23932#bib.bib18),[19](https://arxiv.org/html/2606.23932#bib.bib19)\]\. A behaviour policyπθ\\pi\_\{\\theta\}collects a rollout ofNNtrajectories of horizonHHin a Markov decision process\. For each sample\(i,t\)\(i,t\)the rollout records a statest\(i\)s\_\{t\}^\{\(i\)\}, an actionat\(i\)a\_\{t\}^\{\(i\)\}, a rewardrt\(i\)r\_\{t\}^\{\(i\)\}, and an advantage estimateA^t\(i\)\\hat\{A\}\_\{t\}^\{\(i\)\}produced by generalized advantage estimation\[[14](https://arxiv.org/html/2606.23932#bib.bib14)\]\. An update step lifts the rollout policyπθ\\pi\_\{\\theta\}to a new policyπθ′\\pi\_\{\\theta^\{\\prime\}\}by repeated stochastic gradient steps on a surrogate objective\[[9](https://arxiv.org/html/2606.23932#bib.bib9),[13](https://arxiv.org/html/2606.23932#bib.bib13)\]; the importance ratiowt\(i\)=πθ′\(at\(i\)∣st\(i\)\)/πθ\(at\(i\)∣st\(i\)\)w\_\{t\}^\{\(i\)\}=\\pi\_\{\\theta^\{\\prime\}\}\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\)/\\pi\_\{\\theta\}\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\)tracks how muchπθ′\\pi\_\{\\theta^\{\\prime\}\}has moved away fromπθ\\pi\_\{\\theta\}on the sampled actions\. The two surrogate objectives we study are the proximal\-policy\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\]pair: PPO\-Clip and PPO\-KL\.

### 2\.1The surrogate objective

Letπθ\\pi\_\{\\theta\}denote the policy parameterised byθ\\theta\. PPO collects a rollout under the behaviour policyπθ\\pi\_\{\\theta\}and updates the parameters toθ′\\theta^\{\\prime\}by repeated stochastic gradient steps on a surrogate objective\. The starting point is the importance sampled return\[[9](https://arxiv.org/html/2606.23932#bib.bib9),[13](https://arxiv.org/html/2606.23932#bib.bib13)\], written for a batch ofNNepisodes of horizonHHas

LIS\(θ′\)=1N∑i=1N∑t=1Hwt\(i\)A^t\(i\),wt\(i\)=πθ′\(at\(i\)∣st\(i\)\)πθ\(at\(i\)∣st\(i\)\),L\_\{\\mathrm\{IS\}\}\(\\theta^\{\\prime\}\)\\;=\\;\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\sum\_\{t=1\}^\{H\}w\_\{t\}^\{\(i\)\}\\,\\hat\{A\}\_\{t\}^\{\(i\)\},\\qquad w\_\{t\}^\{\(i\)\}\\;=\\;\\frac\{\\pi\_\{\\theta^\{\\prime\}\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\{\\pi\_\{\\theta\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\},whereA^t\(i\)\\hat\{A\}\_\{t\}^\{\(i\)\}is an estimator of the advantage at the transition\(st\(i\),at\(i\)\)\(s\_\{t\}^\{\(i\)\},a\_\{t\}^\{\(i\)\}\)\. Direct optimisation ofLISL\_\{\\mathrm\{IS\}\}is unstable because the importance ratioswt\(i\)w\_\{t\}^\{\(i\)\}can become large whenθ′\\theta^\{\\prime\}drifts away fromθ\\theta\.Schulman et al\.\[[13](https://arxiv.org/html/2606.23932#bib.bib13)\]address the instability by constraining the KL divergence betweenπθ′\\pi\_\{\\theta^\{\\prime\}\}andπθ\\pi\_\{\\theta\}\.Schulman et al\.\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\]replace the constrained optimisation by one of two penalised first\-order surrogates\.

### 2\.2PPO\-Clip

The clipped surrogate ofSchulman et al\.\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\]is

LCLIP\(θ′\)=1N∑i=1N∑t=1Hmin⁡\{wt\(i\)A^t\(i\),clip\(wt\(i\),1−ϵ,1\+ϵ\)A^t\(i\)\},L\_\{\\mathrm\{CLIP\}\}\(\\theta^\{\\prime\}\)\\;=\\;\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\sum\_\{t=1\}^\{H\}\\min\\\!\\Bigl\\\{w\_\{t\}^\{\(i\)\}\\,\\hat\{A\}\_\{t\}^\{\(i\)\},\\;\\;\\mathrm\{clip\}\\\!\\left\(w\_\{t\}^\{\(i\)\},\\,1\-\\epsilon,\\,1\+\\epsilon\\right\)\\,\\hat\{A\}\_\{t\}^\{\(i\)\}\\Bigr\\\},\(1\)withϵ∈\(0,1\)\\epsilon\\in\(0,1\)a fixed hyperparameter \(typicallyϵ=0\.2\\epsilon=0\.2\)\. Themin\\minacts as a one\-sided trust region\. WhenA^t\(i\)\>0\\hat\{A\}\_\{t\}^\{\(i\)\}\>0, increasingwt\(i\)w\_\{t\}^\{\(i\)\}beyond1\+ϵ1\+\\epsilonno longer increases the loss; whenA^t\(i\)<0\\hat\{A\}\_\{t\}^\{\(i\)\}<0, decreasingwt\(i\)w\_\{t\}^\{\(i\)\}below1−ϵ1\-\\epsilonno longer increases the loss\. This prevents the optimiser from exploiting large positive surrogate values that would correspond to large policy changes\.

### 2\.3PPO\-KL

The KL\-penalised surrogate ofSchulman et al\.\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\]adds an explicit divergence term,

ℒKL\(θ′\)=LIS\(θ′\)−βD^KL\[πθ∥πθ′\],\\mathcal\{L\}\_\{\\mathrm\{KL\}\}\(\\theta^\{\\prime\}\)\\;=\\;L\_\{\\mathrm\{IS\}\}\(\\theta^\{\\prime\}\)\\;\-\\;\\beta\\,\\widehat\{D\}\_\{\\mathrm\{KL\}\}\\\!\\bigl\[\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\theta^\{\\prime\}\}\\bigr\],whereD^KL\\widehat\{D\}\_\{\\mathrm\{KL\}\}is an empirical KL estimate andβ\>0\\beta\>0is a scalar coefficient that is held fixed or adapted at the end of each outer update to keep the empirical KL near a target\[[15](https://arxiv.org/html/2606.23932#bib.bib15),[5](https://arxiv.org/html/2606.23932#bib.bib5)\]\. In the fixed\-β\\betaform,β\\betais shared across samples and across training\. In the adaptive form,β\\betais multiplied or divided by a constant whenever the empirical KL exceeds or falls below the target\.

In both forms,β\\betais a single scalar applied uniformly to all samples in the batch\. The comparison of clipping against scalar\-KL penalisation has been studied extensively\[[15](https://arxiv.org/html/2606.23932#bib.bib15),[4](https://arxiv.org/html/2606.23932#bib.bib4),[8](https://arxiv.org/html/2606.23932#bib.bib8),[1](https://arxiv.org/html/2606.23932#bib.bib1),[6](https://arxiv.org/html/2606.23932#bib.bib6),[17](https://arxiv.org/html/2606.23932#bib.bib17)\]\. The two surrogates are taken throughout this literature as algorithmically distinct; the gradient identity we establish in Section[3](https://arxiv.org/html/2606.23932#S3)shows that, at the per\-sample level, they are not\.

## 3The per\-sample KL view of PPO\-Clip

### 3\.1Partition of the rollout

The gradient ofLCLIPL\_\{\\mathrm\{CLIP\}\}depends on which of the two arguments of the innermin\\minis active for each sample\. Fixθ′\\theta^\{\\prime\}and define the importance ratiowt\(i\)w\_\{t\}^\{\(i\)\}as in Section[2](https://arxiv.org/html/2606.23932#S2)\. The pairs\(i,t\)\(i,t\)partition into three disjoint index sets,

ℐin\\displaystyle\\mathcal\{I\}\_\{\\mathrm\{in\}\}=\{\(i,t\):wt\(i\)∈\[1−ϵ,1\+ϵ\]\}∪\{\(i,t\):A^t\(i\)=0\},\\displaystyle\\;=\\;\\bigl\\\{\(i,t\)\\;:\\;w\_\{t\}^\{\(i\)\}\\in\[1\-\\epsilon,\\,1\+\\epsilon\]\\bigr\\\}\\;\\cup\\;\\bigl\\\{\(i,t\)\\;:\\;\\hat\{A\}\_\{t\}^\{\(i\)\}=0\\bigr\\\},ℐkill\\displaystyle\\mathcal\{I\}\_\{\\mathrm\{kill\}\}=\{\(i,t\):wt\(i\)\>1\+ϵandA^t\(i\)\>0\}∪\{\(i,t\):wt\(i\)<1−ϵandA^t\(i\)<0\},\\displaystyle\\;=\\;\\bigl\\\{\(i,t\)\\;:\\;w\_\{t\}^\{\(i\)\}\>1\+\\epsilon\\text\{ and \}\\hat\{A\}\_\{t\}^\{\(i\)\}\>0\\bigr\\\}\\;\\cup\\;\\bigl\\\{\(i,t\)\\;:\\;w\_\{t\}^\{\(i\)\}<1\-\\epsilon\\text\{ and \}\\hat\{A\}\_\{t\}^\{\(i\)\}<0\\bigr\\\},ℐpass\\displaystyle\\mathcal\{I\}\_\{\\mathrm\{pass\}\}=\{\(i,t\):wt\(i\)\>1\+ϵandA^t\(i\)<0\}∪\{\(i,t\):wt\(i\)<1−ϵandA^t\(i\)\>0\}\.\\displaystyle\\;=\\;\\bigl\\\{\(i,t\)\\;:\\;w\_\{t\}^\{\(i\)\}\>1\+\\epsilon\\text\{ and \}\\hat\{A\}\_\{t\}^\{\(i\)\}<0\\bigr\\\}\\;\\cup\\;\\bigl\\\{\(i,t\)\\;:\\;w\_\{t\}^\{\(i\)\}<1\-\\epsilon\\text\{ and \}\\hat\{A\}\_\{t\}^\{\(i\)\}\>0\\bigr\\\}\.Intuitively,ℐin\\mathcal\{I\}\_\{\\mathrm\{in\}\}contains the samples for which the clip is inactive \(together with the zero\-advantage samples, which contribute no gradient to either surrogate\),ℐkill\\mathcal\{I\}\_\{\\mathrm\{kill\}\}the samples for which the clip suppresses the gradient because the policy is already moving in the advantage\-improving direction, andℐpass\\mathcal\{I\}\_\{\\mathrm\{pass\}\}the samples for which the clip leaves the unclipped term active because the policy is moving against the advantage\.

### 3\.2Gradient of PPO\-Clip

The gradient of a sum is the sum of gradients, and the gradient of the innermin\\minon each sample depends on which of its two arguments is active\.

#### Caseℐin\\mathcal\{I\}\_\{\\mathrm\{in\}\}:

whenwt\(i\)∈\[1−ϵ,1\+ϵ\]w\_\{t\}^\{\(i\)\}\\in\[1\-\\epsilon,1\+\\epsilon\], the clipping operator leaveswt\(i\)w\_\{t\}^\{\(i\)\}unchanged, soclip\(wt\(i\),1−ϵ,1\+ϵ\)=wt\(i\)\\mathrm\{clip\}\(w\_\{t\}^\{\(i\)\},1\-\\epsilon,1\+\\epsilon\)=w\_\{t\}^\{\(i\)\}and both arguments of themin\\mincoincide:

min⁡\{wt\(i\)A^t\(i\),clip\(wt\(i\),1−ϵ,1\+ϵ\)A^t\(i\)\}=wt\(i\)A^t\(i\)\.\\min\\\!\\bigl\\\{w\_\{t\}^\{\(i\)\}\\hat\{A\}\_\{t\}^\{\(i\)\},\\;\\;\\mathrm\{clip\}\(w\_\{t\}^\{\(i\)\},\\,1\-\\epsilon,\\,1\+\\epsilon\)\\hat\{A\}\_\{t\}^\{\(i\)\}\\bigr\\\}\\;=\\;w\_\{t\}^\{\(i\)\}\\hat\{A\}\_\{t\}^\{\(i\)\}\.Its gradient isA^t\(i\)∇θ′wt\(i\)\\hat\{A\}\_\{t\}^\{\(i\)\}\\,\\nabla\_\{\\theta^\{\\prime\}\}w\_\{t\}^\{\(i\)\}\.

#### Caseℐkill\\mathcal\{I\}\_\{\\mathrm\{kill\}\}:

consider first the subcasewt\(i\)\>1\+ϵw\_\{t\}^\{\(i\)\}\>1\+\\epsilonandA^t\(i\)\>0\\hat\{A\}\_\{t\}^\{\(i\)\}\>0\. The clipped valueclip\(wt\(i\),1−ϵ,1\+ϵ\)\\mathrm\{clip\}\(w\_\{t\}^\{\(i\)\},1\-\\epsilon,1\+\\epsilon\)saturates at1\+ϵ1\+\\epsilon, so the clipped term equals\(1\+ϵ\)A^t\(i\)\(1\+\\epsilon\)\\hat\{A\}\_\{t\}^\{\(i\)\}, while the unclipped term equalswt\(i\)A^t\(i\)w\_\{t\}^\{\(i\)\}\\hat\{A\}\_\{t\}^\{\(i\)\}\. BecauseA^t\(i\)\>0\\hat\{A\}\_\{t\}^\{\(i\)\}\>0andwt\(i\)\>1\+ϵw\_\{t\}^\{\(i\)\}\>1\+\\epsilon, the clipped term is the smaller of the two and themin\\minselects it\. The clipped value is constant inθ′\\theta^\{\\prime\}, so its gradient vanishes\. The symmetric subcasewt\(i\)<1−ϵw\_\{t\}^\{\(i\)\}<1\-\\epsilonandA^t\(i\)<0\\hat\{A\}\_\{t\}^\{\(i\)\}<0is analogous: the clipped term saturates at\(1−ϵ\)A^t\(i\)\(1\-\\epsilon\)\\hat\{A\}\_\{t\}^\{\(i\)\}, which is smaller \(more negative\) than the unclipped termwt\(i\)A^t\(i\)w\_\{t\}^\{\(i\)\}\\hat\{A\}\_\{t\}^\{\(i\)\}, and the gradient vanishes again\. In both subcases the per\-sample gradient is zero\.

#### Caseℐpass\\mathcal\{I\}\_\{\\mathrm\{pass\}\}:

consider the subcasewt\(i\)\>1\+ϵw\_\{t\}^\{\(i\)\}\>1\+\\epsilonandA^t\(i\)<0\\hat\{A\}\_\{t\}^\{\(i\)\}<0\. The clipped term equals\(1\+ϵ\)A^t\(i\)\(1\+\\epsilon\)\\hat\{A\}\_\{t\}^\{\(i\)\}and the unclipped term equalswt\(i\)A^t\(i\)w\_\{t\}^\{\(i\)\}\\hat\{A\}\_\{t\}^\{\(i\)\}\. BecauseA^t\(i\)<0\\hat\{A\}\_\{t\}^\{\(i\)\}<0andwt\(i\)\>1\+ϵw\_\{t\}^\{\(i\)\}\>1\+\\epsilon, the unclipped term is more negative and themin\\minselects it\. The symmetric subcasewt\(i\)<1−ϵw\_\{t\}^\{\(i\)\}<1\-\\epsilonandA^t\(i\)\>0\\hat\{A\}\_\{t\}^\{\(i\)\}\>0is analogous\. In both subcases the active term is the unclippedwt\(i\)A^t\(i\)w\_\{t\}^\{\(i\)\}\\hat\{A\}\_\{t\}^\{\(i\)\}and its gradient isA^t\(i\)∇θ′wt\(i\)\\hat\{A\}\_\{t\}^\{\(i\)\}\\,\\nabla\_\{\\theta^\{\\prime\}\}w\_\{t\}^\{\(i\)\}\.

Combining the three cases,

∇θ′LCLIP=1N∑\(i,t\)∈ℐin∪ℐpassA^t\(i\)∇θ′wt\(i\),\\nabla\_\{\\theta^\{\\prime\}\}L\_\{\\mathrm\{CLIP\}\}\\;=\\;\\frac\{1\}\{N\}\\sum\_\{\(i,t\)\\,\\in\\,\\mathcal\{I\}\_\{\\mathrm\{in\}\}\\,\\cup\\,\\mathcal\{I\}\_\{\\mathrm\{pass\}\}\}\\hat\{A\}\_\{t\}^\{\(i\)\}\\,\\nabla\_\{\\theta^\{\\prime\}\}w\_\{t\}^\{\(i\)\},where samples inℐkill\\mathcal\{I\}\_\{\\mathrm\{kill\}\}contribute zero\. To make the dependence on the policy explicit, observe thatπθ\(at\(i\)∣st\(i\)\)\\pi\_\{\\theta\}\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\)does not depend onθ′\\theta^\{\\prime\}, so

∇θ′wt\(i\)=∇θ′\[πθ′\(at\(i\)∣st\(i\)\)πθ\(at\(i\)∣st\(i\)\)\]=∇θ′πθ′\(at\(i\)∣st\(i\)\)πθ\(at\(i\)∣st\(i\)\)\.\\nabla\_\{\\theta^\{\\prime\}\}w\_\{t\}^\{\(i\)\}\\;=\\;\\nabla\_\{\\theta^\{\\prime\}\}\\\!\\left\[\\frac\{\\pi\_\{\\theta^\{\\prime\}\}\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\)\}\{\\pi\_\{\\theta\}\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\)\}\\right\]\\;=\\;\\frac\{\\nabla\_\{\\theta^\{\\prime\}\}\\pi\_\{\\theta^\{\\prime\}\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\{\\pi\_\{\\theta\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\.Substituting yields

∇θ′LCLIP=1N∑\(i,t\)∈ℐin∪ℐpassA^t\(i\)∇θ′πθ′\(at\(i\)∣st\(i\)\)πθ\(at\(i\)∣st\(i\)\)\.\\nabla\_\{\\theta^\{\\prime\}\}L\_\{\\mathrm\{CLIP\}\}\\;=\\;\\frac\{1\}\{N\}\\sum\_\{\(i,t\)\\,\\in\\,\\mathcal\{I\}\_\{\\mathrm\{in\}\}\\,\\cup\\,\\mathcal\{I\}\_\{\\mathrm\{pass\}\}\}\\hat\{A\}\_\{t\}^\{\(i\)\}\\,\\frac\{\\nabla\_\{\\theta^\{\\prime\}\}\\pi\_\{\\theta^\{\\prime\}\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\{\\pi\_\{\\theta\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\.\(2\)Table[1](https://arxiv.org/html/2606.23932#S3.T1)summarises the per\-sample gradient contribution on each index set\.

Table 1:Per\-sample contribution of the PPO\-Clip objective on the three index sets\. The gradient is killed only onℐkill\\mathcal\{I\}\_\{\\mathrm\{kill\}\}, where further policy change would push the importance ratio further outside the trust region\.

### 3\.3Gradient of PPO\-KL

The empirical KL estimateD^KL\[πθ∥πθ′\]\\widehat\{D\}\_\{\\mathrm\{KL\}\}\\\!\\bigl\[\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\theta^\{\\prime\}\}\\bigr\]of Section[2](https://arxiv.org/html/2606.23932#S2)can be written, up to aθ′\\theta^\{\\prime\}\-independent additive constant that drops out of the gradient, as a sum of negative log\-probabilities of the sampled actions under the new policy\. After absorbing the constant and allowing the penalty coefficient to vary per sample, the surrogate becomes

ℒKL\(θ′\)=1N∑i=1N∑t=1H\[wt\(i\)A^t\(i\)\+βt\(i\)log⁡πθ′\(at\(i\)∣st\(i\)\)\]\.\\mathcal\{L\}\_\{\\mathrm\{KL\}\}\(\\theta^\{\\prime\}\)\\;=\\;\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\sum\_\{t=1\}^\{H\}\\\!\\Bigl\[w\_\{t\}^\{\(i\)\}\\,\\hat\{A\}\_\{t\}^\{\(i\)\}\+\\beta\_\{t\}^\{\(i\)\}\\,\\log\\pi\_\{\\theta^\{\\prime\}\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\\Bigr\]\.The standard fixed and adaptive PPO\-KL variants ofSchulman et al\.\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\]correspond to the choiceβt\(i\)≡β\\beta\_\{t\}^\{\(i\)\}\\equiv\\beta, withβ\\betaeither fixed or updated between outer iterations\.

For the importance\-sampled return term,

∇θ′\[wt\(i\)A^t\(i\)\]=A^t\(i\)∇θ′wt\(i\)=A^t\(i\)∇θ′πθ′\(at\(i\)∣st\(i\)\)πθ\(at\(i\)∣st\(i\)\),\\nabla\_\{\\theta^\{\\prime\}\}\\\!\\left\[w\_\{t\}^\{\(i\)\}\\,\\hat\{A\}\_\{t\}^\{\(i\)\}\\right\]\\;=\\;\\hat\{A\}\_\{t\}^\{\(i\)\}\\,\\nabla\_\{\\theta^\{\\prime\}\}w\_\{t\}^\{\(i\)\}\\;=\\;\\hat\{A\}\_\{t\}^\{\(i\)\}\\,\\frac\{\\nabla\_\{\\theta^\{\\prime\}\}\\pi\_\{\\theta^\{\\prime\}\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\{\\pi\_\{\\theta\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\},using the same calculation as in the PPO\-Clip derivation\. For the penalty term, the coefficientβt\(i\)\\beta\_\{t\}^\{\(i\)\}is a*stop\-gradient*\(detached\) coefficient: it is evaluated at the currentθ′\\theta^\{\\prime\}and held fixed when differentiating, as is standard for the penalty coefficient of a KL\-penalised policy gradient and as the implementation does\. Its derivative inθ′\\theta^\{\\prime\}vanishes by convention, so

∇θ′\[βt\(i\)log⁡πθ′\(at\(i\)∣st\(i\)\)\]=βt\(i\)∇θ′πθ′\(at\(i\)∣st\(i\)\)πθ′\(at\(i\)∣st\(i\)\)\.\\nabla\_\{\\theta^\{\\prime\}\}\\\!\\left\[\\beta\_\{t\}^\{\(i\)\}\\log\\pi\_\{\\theta^\{\\prime\}\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\\right\]\\;=\\;\\beta\_\{t\}^\{\(i\)\}\\,\\frac\{\\nabla\_\{\\theta^\{\\prime\}\}\\pi\_\{\\theta^\{\\prime\}\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\{\\pi\_\{\\theta^\{\\prime\}\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\.Multiplying numerator and denominator of the right\-hand side byπθ\(at\(i\)∣st\(i\)\)\\pi\_\{\\theta\}\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\)converts the rollout\-policy denominator to the same form as the importance\-sampled term,

βt\(i\)∇θ′πθ′\(at\(i\)∣st\(i\)\)πθ′\(at\(i\)∣st\(i\)\)=βt\(i\)∇θ′πθ′\(at\(i\)∣st\(i\)\)πθ\(at\(i\)∣st\(i\)\)⋅πθ\(at\(i\)∣st\(i\)\)πθ′\(at\(i\)∣st\(i\)\)=βt\(i\)wt\(i\)∇θ′πθ′\(at\(i\)∣st\(i\)\)πθ\(at\(i\)∣st\(i\)\)\.\\beta\_\{t\}^\{\(i\)\}\\,\\frac\{\\nabla\_\{\\theta^\{\\prime\}\}\\pi\_\{\\theta^\{\\prime\}\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\{\\pi\_\{\\theta^\{\\prime\}\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\\;=\\;\\beta\_\{t\}^\{\(i\)\}\\,\\frac\{\\nabla\_\{\\theta^\{\\prime\}\}\\pi\_\{\\theta^\{\\prime\}\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\{\\pi\_\{\\theta\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\\,\\cdot\\,\\frac\{\\pi\_\{\\theta\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\{\\pi\_\{\\theta^\{\\prime\}\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\\;=\\;\\frac\{\\beta\_\{t\}^\{\(i\)\}\}\{w\_\{t\}^\{\(i\)\}\}\\,\\frac\{\\nabla\_\{\\theta^\{\\prime\}\}\\pi\_\{\\theta^\{\\prime\}\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\{\\pi\_\{\\theta\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\.Summing the two contributions and factoring the common∇θ′πθ′\(at\(i\)∣st\(i\)\)/πθ\(at\(i\)∣st\(i\)\)\\nabla\_\{\\theta^\{\\prime\}\}\\pi\_\{\\theta^\{\\prime\}\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)/\\pi\_\{\\theta\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\),

∇θ′ℒKL=1N∑i=1N∑t=1H∇θ′πθ′\(at\(i\)∣st\(i\)\)πθ\(at\(i\)∣st\(i\)\)\[A^t\(i\)\+βt\(i\)wt\(i\)\]\.\\nabla\_\{\\theta^\{\\prime\}\}\\mathcal\{L\}\_\{\\mathrm\{KL\}\}\\;=\\;\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\sum\_\{t=1\}^\{H\}\\frac\{\\nabla\_\{\\theta^\{\\prime\}\}\\pi\_\{\\theta^\{\\prime\}\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\{\\pi\_\{\\theta\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\\,\\Biggl\[\\,\\hat\{A\}\_\{t\}^\{\(i\)\}\\;\+\\;\\frac\{\\beta\_\{t\}^\{\(i\)\}\}\{w\_\{t\}^\{\(i\)\}\}\\,\\Biggr\]\.\(3\)Every sample contributes to the gradient\. The penalty does not zero out any term; it shifts the effective advantage of sample\(i,t\)\(i,t\)fromA^t\(i\)\\hat\{A\}\_\{t\}^\{\(i\)\}toA^t\(i\)\+βt\(i\)/wt\(i\)\\hat\{A\}\_\{t\}^\{\(i\)\}\+\\beta\_\{t\}^\{\(i\)\}/w\_\{t\}^\{\(i\)\}\.

### 3\.4Gradient identity

Comparison of \([2](https://arxiv.org/html/2606.23932#S3.E2)\) and \([3](https://arxiv.org/html/2606.23932#S3.E3)\) yields the main result\.

###### Theorem 1\(Per\-sample gradient identity\)\.

LetLCLIPL\_\{\\mathrm\{CLIP\}\}be the PPO\-Clip surrogate ofSchulman et al\.\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\]and letℒKL\\mathcal\{L\}\_\{\\mathrm\{KL\}\}be the PPO\-KL surrogate with per\-sample stop\-gradient coefficients\{βt\(i\)\}\\\{\\beta\_\{t\}^\{\(i\)\}\\\}, each evaluated at the currentθ′\\theta^\{\\prime\}and held fixed under differentiation\. Define

βt\(i\)=\{0,\(i,t\)∈ℐin∪ℐpass,−wt\(i\)A^t\(i\),\(i,t\)∈ℐkill\.\\beta\_\{t\}^\{\(i\)\}\\;=\\;\\begin\{cases\}\\;0,&\(i,t\)\\in\\mathcal\{I\}\_\{\\mathrm\{in\}\}\\cup\\mathcal\{I\}\_\{\\mathrm\{pass\}\},\\\\\[2\.0pt\] \\,\-\\,w\_\{t\}^\{\(i\)\}\\,\\hat\{A\}\_\{t\}^\{\(i\)\},&\(i,t\)\\in\\mathcal\{I\}\_\{\\mathrm\{kill\}\}\.\\end\{cases\}\(4\)Then

∇θ′LCLIP=∇θ′ℒKL\\nabla\_\{\\theta^\{\\prime\}\}L\_\{\\mathrm\{CLIP\}\}\\;=\\;\\nabla\_\{\\theta^\{\\prime\}\}\\mathcal\{L\}\_\{\\mathrm\{KL\}\}at everyθ′\\theta^\{\\prime\}whereLCLIPL\_\{\\mathrm\{CLIP\}\}is differentiable, namely wherever no sample lies exactly on a clip boundarywt\(i\)=1±ϵw\_\{t\}^\{\(i\)\}=1\\pm\\epsilon; this excludes only a measure\-zero set, on whichLCLIPL\_\{\\mathrm\{CLIP\}\}has a kink\.

###### Proof\.

Fix a sample\(i,t\)\(i,t\)and writegt\(i\)\(θ′\)g\_\{t\}^\{\(i\)\}\(\\theta^\{\\prime\}\)for its per\-sample gradient contribution under either surrogate\. From \([2](https://arxiv.org/html/2606.23932#S3.E2)\) and \([3](https://arxiv.org/html/2606.23932#S3.E3)\),

gt\(i\)\(LCLIP\)=\{A^t\(i\)∇θ′πθ′\(at\(i\)∣st\(i\)\)πθ\(at\(i\)∣st\(i\)\),\(i,t\)∈ℐin∪ℐpass,0,\(i,t\)∈ℐkill,g\_\{t\}^\{\(i\)\}\\bigl\(L\_\{\\mathrm\{CLIP\}\}\\bigr\)\\;=\\;\\begin\{cases\}\\hat\{A\}\_\{t\}^\{\(i\)\}\\,\\dfrac\{\\nabla\_\{\\theta^\{\\prime\}\}\\pi\_\{\\theta^\{\\prime\}\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\{\\pi\_\{\\theta\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\},&\(i,t\)\\in\\mathcal\{I\}\_\{\\mathrm\{in\}\}\\cup\\mathcal\{I\}\_\{\\mathrm\{pass\}\},\\\\\[12\.0pt\] 0,&\(i,t\)\\in\\mathcal\{I\}\_\{\\mathrm\{kill\}\},\\end\{cases\}and

gt\(i\)\(ℒKL\)=∇θ′πθ′\(at\(i\)∣st\(i\)\)πθ\(at\(i\)∣st\(i\)\)\[A^t\(i\)\+βt\(i\)wt\(i\)\]\.g\_\{t\}^\{\(i\)\}\\bigl\(\\mathcal\{L\}\_\{\\mathrm\{KL\}\}\\bigr\)\\;=\\;\\frac\{\\nabla\_\{\\theta^\{\\prime\}\}\\pi\_\{\\theta^\{\\prime\}\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\{\\pi\_\{\\theta\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\\,\\Biggl\[\\,\\hat\{A\}\_\{t\}^\{\(i\)\}\\;\+\\;\\frac\{\\beta\_\{t\}^\{\(i\)\}\}\{w\_\{t\}^\{\(i\)\}\}\\,\\Biggr\]\.Under the choice \([4](https://arxiv.org/html/2606.23932#S3.E4)\), the bracket ofgt\(i\)\(ℒKL\)g\_\{t\}^\{\(i\)\}\(\\mathcal\{L\}\_\{\\mathrm\{KL\}\}\)takes two values depending on the index set\.

*Case\(i,t\)∈ℐin∪ℐpass\(i,t\)\\in\\mathcal\{I\}\_\{\\mathrm\{in\}\}\\cup\\mathcal\{I\}\_\{\\mathrm\{pass\}\}\.*The definition \([4](https://arxiv.org/html/2606.23932#S3.E4)\) givesβt\(i\)=0\\beta\_\{t\}^\{\(i\)\}=0, so the bracket reduces toA^t\(i\)\\hat\{A\}\_\{t\}^\{\(i\)\}\. Hence

gt\(i\)\(ℒKL\)=A^t\(i\)∇θ′πθ′\(at\(i\)∣st\(i\)\)πθ\(at\(i\)∣st\(i\)\)=gt\(i\)\(LCLIP\)\.g\_\{t\}^\{\(i\)\}\\bigl\(\\mathcal\{L\}\_\{\\mathrm\{KL\}\}\\bigr\)\\;=\\;\\hat\{A\}\_\{t\}^\{\(i\)\}\\,\\frac\{\\nabla\_\{\\theta^\{\\prime\}\}\\pi\_\{\\theta^\{\\prime\}\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\{\\pi\_\{\\theta\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\}\\;=\\;g\_\{t\}^\{\(i\)\}\\bigl\(L\_\{\\mathrm\{CLIP\}\}\\bigr\)\.
*Case\(i,t\)∈ℐkill\(i,t\)\\in\\mathcal\{I\}\_\{\\mathrm\{kill\}\}\.*The definition \([4](https://arxiv.org/html/2606.23932#S3.E4)\) givesβt\(i\)=−wt\(i\)A^t\(i\)\\beta\_\{t\}^\{\(i\)\}=\-w\_\{t\}^\{\(i\)\}\\hat\{A\}\_\{t\}^\{\(i\)\}\. The bracket evaluates to

A^t\(i\)\+βt\(i\)wt\(i\)=A^t\(i\)\+−wt\(i\)A^t\(i\)wt\(i\)=A^t\(i\)−A^t\(i\)=0,\\hat\{A\}\_\{t\}^\{\(i\)\}\\;\+\\;\\frac\{\\beta\_\{t\}^\{\(i\)\}\}\{w\_\{t\}^\{\(i\)\}\}\\;=\\;\\hat\{A\}\_\{t\}^\{\(i\)\}\\;\+\\;\\frac\{\-w\_\{t\}^\{\(i\)\}\\hat\{A\}\_\{t\}^\{\(i\)\}\}\{w\_\{t\}^\{\(i\)\}\}\\;=\\;\\hat\{A\}\_\{t\}^\{\(i\)\}\-\\hat\{A\}\_\{t\}^\{\(i\)\}\\;=\\;0,where the cancellation useswt\(i\)\>0w\_\{t\}^\{\(i\)\}\>0\(sinceπθ′\(at\(i\)∣st\(i\)\)\>0\\pi\_\{\\theta^\{\\prime\}\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\>0andπθ\(at\(i\)∣st\(i\)\)\>0\\pi\_\{\\theta\}\\\!\\left\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\\right\)\>0for any sampled action\)\. Therefore

gt\(i\)\(ℒKL\)=0=gt\(i\)\(LCLIP\)\.g\_\{t\}^\{\(i\)\}\\bigl\(\\mathcal\{L\}\_\{\\mathrm\{KL\}\}\\bigr\)\\;=\\;0\\;=\\;g\_\{t\}^\{\(i\)\}\\bigl\(L\_\{\\mathrm\{CLIP\}\}\\bigr\)\.
The per\-sample contributions agree on every\(i,t\)\(i,t\), so the two batch gradients agree:

∇θ′LCLIP=1N∑i=1N∑t=1Hgt\(i\)\(LCLIP\)=1N∑i=1N∑t=1Hgt\(i\)\(ℒKL\)=∇θ′ℒKL\.\\nabla\_\{\\theta^\{\\prime\}\}L\_\{\\mathrm\{CLIP\}\}\\;=\\;\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\sum\_\{t=1\}^\{H\}g\_\{t\}^\{\(i\)\}\\bigl\(L\_\{\\mathrm\{CLIP\}\}\\bigr\)\\;=\\;\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\sum\_\{t=1\}^\{H\}g\_\{t\}^\{\(i\)\}\\bigl\(\\mathcal\{L\}\_\{\\mathrm\{KL\}\}\\bigr\)\\;=\\;\\nabla\_\{\\theta^\{\\prime\}\}\\mathcal\{L\}\_\{\\mathrm\{KL\}\}\.∎

### 3\.5Interpretation

Figure[1](https://arxiv.org/html/2606.23932#S3.F1)reads the equivalence row by row: for each combination ofwt\(i\)w\_\{t\}^\{\(i\)\}andA^t\(i\)\\hat\{A\}\_\{t\}^\{\(i\)\}, the PPO\-Clip gradient and the value ofβt\(i\)\\beta\_\{t\}^\{\(i\)\}that reproduces it under the KL surrogate are listed side by side\.

RegionPPO\-Clip gradientEquivalentβt\\beta\_\{t\}wt\(i\)∈\[1−ϵ,1\+ϵ\]w\_\{t\}^\{\(i\)\}\\in\[1\\\!\-\\\!\\epsilon,\\,1\\\!\+\\\!\\epsilon\]A^t\(i\)∇wt\(i\)\\hat\{A\}\_\{t\}^\{\(i\)\}\\,\\nabla w\_\{t\}^\{\(i\)\}βt=0\\beta\_\{t\}=0wt\(i\)\>1\+ϵw\_\{t\}^\{\(i\)\}\>1\\\!\+\\\!\\epsilon,A^t\(i\)\>0\\hat\{A\}\_\{t\}^\{\(i\)\}\>00\(killed\)βt=−wt\(i\)A^t\(i\)\\beta\_\{t\}=\-w\_\{t\}^\{\(i\)\}\\hat\{A\}\_\{t\}^\{\(i\)\}wt\(i\)\>1\+ϵw\_\{t\}^\{\(i\)\}\>1\\\!\+\\\!\\epsilon,A^t\(i\)<0\\hat\{A\}\_\{t\}^\{\(i\)\}<0A^t\(i\)∇wt\(i\)\\hat\{A\}\_\{t\}^\{\(i\)\}\\,\\nabla w\_\{t\}^\{\(i\)\}βt=0\\beta\_\{t\}=0wt\(i\)<1−ϵw\_\{t\}^\{\(i\)\}<1\\\!\-\\\!\\epsilon,A^t\(i\)\>0\\hat\{A\}\_\{t\}^\{\(i\)\}\>0A^t\(i\)∇wt\(i\)\\hat\{A\}\_\{t\}^\{\(i\)\}\\,\\nabla w\_\{t\}^\{\(i\)\}βt=0\\beta\_\{t\}=0wt\(i\)<1−ϵw\_\{t\}^\{\(i\)\}<1\\\!\-\\\!\\epsilon,A^t\(i\)<0\\hat\{A\}\_\{t\}^\{\(i\)\}<00\(killed\)βt=−wt\(i\)A^t\(i\)\\beta\_\{t\}=\-w\_\{t\}^\{\(i\)\}\\hat\{A\}\_\{t\}^\{\(i\)\}Figure 1:Summary of the per\-sample equivalence\. In the green rows the PPO\-Clip gradient is reproduced by a PPO\-KL surrogate withβt=0\\beta\_\{t\}=0\. In the red rows the PPO\-Clip gradient is reproduced by a PPO\-KL surrogate withβt=−wtA^t\\beta\_\{t\}=\-w\_\{t\}\\hat\{A\}\_\{t\}, which is the value that kills the corresponding term of the per\-sample gradient\.The coefficient \([4](https://arxiv.org/html/2606.23932#S3.E4)\) makes explicit what PPO\-Clip implicitly applies\. The clip is a Kullback–Leibler penalty whose strength is zero everywhere except on the killed region, where it takes the value−wt\(i\)A^t\(i\)\-w\_\{t\}^\{\(i\)\}\\hat\{A\}\_\{t\}^\{\(i\)\}\. The sign of the coefficient is consistent with the trust\-region intuition\. WhenA^t\(i\)\>0\\hat\{A\}\_\{t\}^\{\(i\)\}\>0andwt\(i\)\>1\+ϵw\_\{t\}^\{\(i\)\}\>1\+\\epsilon, the policy is already over\-weighting a beneficial action, and a negativeβt\(i\)\\beta\_\{t\}^\{\(i\)\}pullslog⁡πθ′\\log\\pi\_\{\\theta^\{\\prime\}\}away from the current direction; whenA^t\(i\)<0\\hat\{A\}\_\{t\}^\{\(i\)\}<0andwt\(i\)<1−ϵw\_\{t\}^\{\(i\)\}<1\-\\epsilon, the policy is already under\-weighting a harmful action, and a positiveβt\(i\)\\beta\_\{t\}^\{\(i\)\}stabilises it\. In both cases the effective advantage of the bracket in \([3](https://arxiv.org/html/2606.23932#S3.E3)\) is driven exactly to zero, reproducing the clip’s behaviour\.

The identity in Theorem[1](https://arxiv.org/html/2606.23932#Thmtheorem1)is stronger than the first\-order observation ofSchulman et al\.\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\]thatLCLIPL\_\{\\mathrm\{CLIP\}\}and the unclipped surrogate agree aroundw=1w=1\. It holds at everyθ′\\theta^\{\\prime\}off the clip boundary, every minibatch step, and across the entire inner loop\. Two closely related lines of recent work have approached the clip and the KL term from related but distinct angles:Liu et al\.\[[10](https://arxiv.org/html/2606.23932#bib.bib10)\]establish a gradient equivalence between three KL estimators inside RLHF objectives, andZhang et al\.\[[24](https://arxiv.org/html/2606.23932#bib.bib24)\]propose a unified design framework for KL\-regularised policy gradient\. Neither identifies the per\-sample coefficient \([4](https://arxiv.org/html/2606.23932#S3.E4)\) that turns PPO\-Clip itself into a KL penalty\.

## 4Experiments

### 4\.1Setup

We evaluate four objectives that share the surrogate of Section[2](https://arxiv.org/html/2606.23932#S2)and differ only in the policy loss\. These are PPO\-Clip, fixed\-β\\betaPPO\-KL, adaptive\-β\\betaPPO\-KL, and the per\-sample PPO\-KL of \([4](https://arxiv.org/html/2606.23932#S3.E4)\); the first three are the variants ofSchulman et al\.\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\]and the fourth is the construction of Section[3](https://arxiv.org/html/2606.23932#S3)\. We train each on the five MuJoCo locomotion tasks\[[20](https://arxiv.org/html/2606.23932#bib.bib20)\]HalfCheetah\-v4, Hopper\-v4, Walker2d\-v4, Ant\-v4, and Humanoid\-v4 for10610^\{6\}environment steps over five seeds, and include CartPole\-v1 and LunarLander\-v3 as a low\-dimensional and a discrete\-action check\.

All four variants share the trainer, the rollout collector, and the value head, and use the standard PPO configuration of CleanRL\[[7](https://arxiv.org/html/2606.23932#bib.bib7)\]and Stable\-Baselines3\[[12](https://arxiv.org/html/2606.23932#bib.bib12)\]\. This configuration is a two\-hidden\-layer \(6464\-6464,tanh\\tanh\) actor\-critic with orthogonal initialisation and a diagonal Gaussian policy, GAE\[[14](https://arxiv.org/html/2606.23932#bib.bib14)\]withγ=0\.99\\gamma=0\.99andλ=0\.95\\lambda=0\.95, Adam at3⋅10−43\\cdot 10^\{\-4\}with linear annealing, gradient clipping at norm0\.50\.5, value\-loss clipping at0\.20\.2, observation and reward normalisation, and an inner loop ofK=10K=10epochs over size\-6464minibatches of each20482048\-step rollout\.

The variants differ in the trust\-region knob\. PPO\-Clip usesϵ=0\.2\\epsilon=0\.2, fixed\-β\\betaPPO\-KL usesβ=1\\beta=1, and adaptive\-β\\betaPPO\-KL followsSchulman et al\.\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\], updatingβ\\betaonce per rollout by a factor of two toward a targetDKL=0\.02D\_\{\\mathrm\{KL\}\}=0\.02, the second\-order KLϵ2/2\\epsilon^\{2\}/2of the clip radius\. The fixed and adaptive variants penalise the analytic Gaussian KL, while the per\-sample variant penalises the sampled log\-ratio−log⁡wt\(i\)\-\\log w\_\{t\}^\{\(i\)\}, the estimator for which Theorem[1](https://arxiv.org/html/2606.23932#Thmtheorem1)holds\. A knob sweep on CartPole\-v1, HalfCheetah\-v4, and Hopper\-v4 overϵ∈\{0\.1,0\.2,0\.3\}\\epsilon\\in\\\{0\.1,0\.2,0\.3\\\},β∈\{0\.1,0\.3,1,3,10\}\\beta\\in\\\{0\.1,0\.3,1,3,10\\\}, andDKL∈\{0\.003,0\.01,0\.02,0\.03,0\.1\}D\_\{\\mathrm\{KL\}\}\\in\\\{0\.003,0\.01,0\.02,0\.03,0\.1\\\}fixes each baseline at its best value\.

All public run histories and per\-run reproducibility artifacts are available in the Weights & Biases project at[https://wandb\.ai/KLip\-PPO/KLip\-PPO](https://wandb.ai/KLip-PPO/KLip-PPO)\.

### 4\.2Results

By Theorem[1](https://arxiv.org/html/2606.23932#Thmtheorem1), PPO\-Clip and the per\-sample PPO\-KL surrogate share a gradient at every step\. Their learning curves coincide on all five MuJoCo tasks \(Figure[2](https://arxiv.org/html/2606.23932#S4.F2)\), and their final returns agree on every task \(Table[2](https://arxiv.org/html/2606.23932#S4.T2)\)\.Schulman et al\.\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\]showed only that the clipped objective and the unclipped surrogate \(β=0\\beta=0\) agree to first order when the new policy is close to the one that collected the rollout\. The per\-sample identity is sharper, because with the coefficientβt\\beta\_\{t\}of \([4](https://arxiv.org/html/2606.23932#S3.E4)\) in place the equality becomes exact and holds at everyθ′\\theta^\{\\prime\}, which is why the curves stay together over the whole run, however far training moves the policy\.

The literature has consistently found clipping to outperform a KL penalty on continuous control\. The original PPO study reports this on MuJoCo\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\], and subsequent benchmarks repeat it\[[4](https://arxiv.org/html/2606.23932#bib.bib4),[1](https://arxiv.org/html/2606.23932#bib.bib1)\]\. Our experiments reproduce the same ordering\. PPO\-KL with a fixed or an adaptively tunedβ\\betamatches PPO\-Clip on the easier tasks but falls behind on the high\-dimensional ones \(Ant\-v4 and Humanoid\-v4\), where the policy must travel far from its initialisation and the trust region does real work\.

The per\-sample identity explains the shortfall\. Clipping constrains each sample on its own terms, turning the penalty on only for the transitions whose ratio has left the band and scaling it by that sample’s ratio and advantage\. The scalarβ\\betaof PPO\-KL, fixed or adaptive, instead applies one value to every sample, so aβ\\betalarge enough to restrain the few runaway transitions over\-penalises the many well\-behaved ones, and no single value reproduces what the clip does pointwise\[[6](https://arxiv.org/html/2606.23932#bib.bib6)\]\. The per\-sample coefficient of \([4](https://arxiv.org/html/2606.23932#S3.E4)\) removes that limitation, setting the penalty separately for each sample, and so reproduces the clip exactly\.

![Refer to caption](https://arxiv.org/html/2606.23932v1/x1.png)\(a\)Hopper\-v4
![Refer to caption](https://arxiv.org/html/2606.23932v1/x2.png)\(b\)HalfCheetah\-v4
![Refer to caption](https://arxiv.org/html/2606.23932v1/x3.png)\(c\)Walker2d\-v4
![Refer to caption](https://arxiv.org/html/2606.23932v1/x4.png)\(d\)Ant\-v4
![Refer to caption](https://arxiv.org/html/2606.23932v1/x5.png)\(e\)Humanoid\-v4

Figure 2:Episode return on the five MuJoCo tasks \(mean over55seeds,±\\pmstd band\)\. PPO\-Clip and per\-sample PPO\-KL are indistinguishable on every task, while the fixed\- and adaptive\-β\\betapenalties fall behind on Ant\-v4 \(d\) and Humanoid\-v4 \(e\)\.Table 2:Final return \(mean±\\pmstd over55seeds, last10%10\\%of training\)\. PPO\-Clip and per\-sample PPO\-KL are identical on every task; the scalar\-β\\betavariants fall behind on the high\-dimensional tasks\.

## 5Discussion

Theorem[1](https://arxiv.org/html/2606.23932#Thmtheorem1)makes PPO\-Clip’s per\-sample gradient exactly the gradient of a Kullback–Leibler penalty,

∇θ′LCLIP=∇θ′𝔼t\[wtA^t\+βtlog⁡πθ′\(at∣st\)\],βt=−wtA^t1\[\(i,t\)∈ℐkill\],\\nabla\_\{\\theta^\{\\prime\}\}L\_\{\\mathrm\{CLIP\}\}=\\nabla\_\{\\theta^\{\\prime\}\}\\,\\mathbb\{E\}\_\{t\}\\\!\\bigl\[\\,w\_\{t\}\\hat\{A\}\_\{t\}\+\\beta\_\{t\}\\log\\pi\_\{\\theta^\{\\prime\}\}\(a\_\{t\}\\mid s\_\{t\}\)\\,\\bigr\],\\qquad\\beta\_\{t\}=\-\\,w\_\{t\}\\hat\{A\}\_\{t\}\\,\\mathbf\{1\}\\\!\\left\[\(i,t\)\\in\\mathcal\{I\}\_\{\\mathrm\{kill\}\}\\right\],whose coefficient is set for each sample by its own importance ratio and advantage\. The clip is in this sense a trust region acting in the space of policy distributions, applied to the samples inℐkill\\mathcal\{I\}\_\{\\mathrm\{kill\}\}that it would otherwise discard\. We show in Appendix[A](https://arxiv.org/html/2606.23932#A1)that the same identity has a matching form in importance\-weight space, so the clipped objective \([1](https://arxiv.org/html/2606.23932#S2.E1)\), the weight\-space penalty, and the per\-sample KL penalty arethree equivalent expressions of a single per\-sample gradient\.

This reading has two consequences\. First, it makes precise a question the field has answered only empirically\. Clipping and KL penalisation have been compared through benchmark scores\[[15](https://arxiv.org/html/2606.23932#bib.bib15),[4](https://arxiv.org/html/2606.23932#bib.bib4),[8](https://arxiv.org/html/2606.23932#bib.bib8),[1](https://arxiv.org/html/2606.23932#bib.bib1),[6](https://arxiv.org/html/2606.23932#bib.bib6),[17](https://arxiv.org/html/2606.23932#bib.bib17)\], with no theory relating the two objectives\. The identity reframes that comparison\. Since clipping is itself a KL penalty, the methods here differ only in how they set the penalty’s coefficient\. PPO\-KL uses one scalarβ\\betafor the whole batch, while PPO\-Clip, by Theorem[1](https://arxiv.org/html/2606.23932#Thmtheorem1), uses the per\-sampleβt\\beta\_\{t\}of \([4](https://arxiv.org/html/2606.23932#S3.E4)\)\. The benchmark gap read as clipping versus KL is thus a gap between a scalar and a per\-sample coefficient, which is where our experiments locate it\. Second, it gives the penalty aflexibilitythe clipped form hides\. Onceβt\\beta\_\{t\}is written out, its step shape is one choice among many, and replacing it with another, such as one that softens the boundary or that varies with the individual sample, stays within the same surrogate family and defines new algorithms \(Section[6](https://arxiv.org/html/2606.23932#S6)\)\.

What governs the update is therefore the per\-sample gradient coefficient and not the surface choice between a clip and a KL term, a principle that recent analyses of KL regularisation in language\-model training independently corroborate\.Liu et al\.\[[10](https://arxiv.org/html/2606.23932#bib.bib10)\]find that a KL term written as a loss and its associated per\-sample reward coefficient induce the same gradient, andZhang et al\.\[[24](https://arxiv.org/html/2606.23932#bib.bib24)\]report that clipped and KL objectives agree once their per\-sample importance weights are aligned\. Neither isolates the exact coefficientβt=−wtA^t\\beta\_\{t\}=\-w\_\{t\}\\hat\{A\}\_\{t\}that makes PPO\-Clip itself a KL penalty, which is the identity established here; that their gradient\-level findings point the same way is evidence that the per\-sample view this identity rests on is the correct one\.

## 6Future work

The per\-sample reformulation turns the implicit penalty coefficientβt\(i\)\\beta\_\{t\}^\{\(i\)\}of PPO\-Clip into an explicit object\. Its functional form, a step function on the trust\-region boundary in the present case, can be modified by design without leaving the surrogate\-loss family of Section[2](https://arxiv.org/html/2606.23932#S2)\. We outline five extensions that follow from making different choices forβt\(i\)\\beta\_\{t\}^\{\(i\)\}\. Each defines a new policy\-optimisation algorithm and is left as the subject of a separate study\.

#### Soft relaxations of the boundary\.

The coefficient defined in \([4](https://arxiv.org/html/2606.23932#S3.E4)\) is discontinuous atwt\(i\)=1±ϵw\_\{t\}^\{\(i\)\}=1\\pm\\epsilon\. Several proposals in the literature soften this discontinuity from different angles\. Trust Region\-Guided PPO\[[21](https://arxiv.org/html/2606.23932#bib.bib21)\]keeps the hard clip but lets its width depend on the local KL of the policy\. Truly PPO\[[22](https://arxiv.org/html/2606.23932#bib.bib22)\]keeps the clip and adds a rollback term that drags the policy back when the ratio exits the trust region\. ESPO\[[17](https://arxiv.org/html/2606.23932#bib.bib17)\]removes ratio clipping altogether and controls the inner loop by early stopping\. Simple Policy Optimization\[[23](https://arxiv.org/html/2606.23932#bib.bib23)\]substitutes clipping with a regulariser on the ratio that admits a tighter trust region\. Probability Smoothing Policy Optimisation\[[3](https://arxiv.org/html/2606.23932#bib.bib3)\], in the language\-model setting, replaces the hard ratio with a soft mixture of old and new policies so that the gradient is non\-zero everywhere\.

The per\-sample form unifies these proposals under one quantity: each is a choice ofβt\(i\)\\beta\_\{t\}^\{\(i\)\}in the template𝔼t\[wtA^t\+βtlog⁡πθ′\(at∣st\)\]\\mathbb\{E\}\_\{t\}\[\\,w\_\{t\}\\hat\{A\}\_\{t\}\+\\beta\_\{t\}\\log\\pi\_\{\\theta^\{\\prime\}\}\(a\_\{t\}\\mid s\_\{t\}\)\\,\], with the soft variants replacing the step of \([4](https://arxiv.org/html/2606.23932#S3.E4)\) by a continuous shape that agrees with it far inside and far outside the trust region\. The linear ramp of widthδ≥0\\delta\\geq 0, which interpolates between0and−wt\(i\)A^t\(i\)\-w\_\{t\}^\{\(i\)\}\\hat\{A\}\_\{t\}^\{\(i\)\}across the boundary, is one explicit member: it recovers the PPO\-Clip step asδ→0\\delta\\to 0and the unconstrained surrogate asδ→∞\\delta\\to\\infty\. The right shape, as a function of the task and of the inner\-loop epoch countKK, is left as an empirical question\.

#### Position\-aware coefficient for sequence models\.

On token\-level applications such as language\-model fine\-tuning, each sample corresponds to one token of a generated completion\. The per\-sample coefficientβt\(i\)\\beta\_\{t\}^\{\(i\)\}can then be allowed to depend on the token’s position within the sequence\. A coefficient that is sharper near the answer span and softer in the reasoning prefix, for instance, is a specific position\-conditionedβt\(i\)\\beta\_\{t\}^\{\(i\)\}and is not expressible in standard PPO\-Clip, which uses the same trust\-region radius for every token\. The construction and empirical study of position\-aware variants is a direct use of the framework and is left to follow\-up work\.

#### Age\-conditioned coefficient for off\-policy learning\.

The trust\-region argument behind PPO\-Clip assumes that the rollout was sampled under a behaviour policy close to the current one\. With a replay buffer this assumption fails\. The importance ratiowt\(i\)w\_\{t\}^\{\(i\)\}on a sample drawn from an older policy can be arbitrarily large, and PPO\-Clip discards all such samples by the step coefficient\. Lettingβt\(i\)\\beta\_\{t\}^\{\(i\)\}depend on the age of the sample at the time of the update, so that older samples are weighted differently from fresh ones, defines an off\-policy variant of the algorithm whose properties can be analysed within the same formulation\. The appropriate functional form of the age dependence, and its interaction with the bias\-variance tradeoff of off\-policy estimation, is an open question\.

#### Asymmetric trust regions\.

The coefficient \([4](https://arxiv.org/html/2606.23932#S3.E4)\) is symmetric under the swap\(wt\(i\)−1\)↦−\(wt\(i\)−1\)\(w\_\{t\}^\{\(i\)\}\-1\)\\mapsto\-\(w\_\{t\}^\{\(i\)\}\-1\), in the sense that the same step shape applies on both sides of the trust region\. An asymmetric variant softens the coefficient on the side that pulls the policy back toward the rollout distribution and keeps it sharp on the side that pushes it away\. This modification is a single change to the per\-sample form and is not expressible inside the originalmin\\minformulation ofSchulman et al\.\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\]\. Its analysis is a natural follow\-up\.

#### A unified per\-sample template\.

The per\-sample form is not specific to PPO\-Clip\. The template

ℒtmpl\(θ′\)=𝔼t\[wtA^t\+βtlog⁡πθ′\(at∣st\)\]\\mathcal\{L\}\_\{\\mathrm\{tmpl\}\}\(\\theta^\{\\prime\}\)\\;=\\;\\mathbb\{E\}\_\{t\}\\\!\\bigl\[\\,w\_\{t\}\\hat\{A\}\_\{t\}\+\\beta\_\{t\}\\log\\pi\_\{\\theta^\{\\prime\}\}\(a\_\{t\}\\mid s\_\{t\}\)\\,\\bigr\]\(5\)recovers several existing on\-policy algorithms once the dependence ofβt\(i\)\\beta\_\{t\}^\{\(i\)\}on the sample is specified, and Table[3](https://arxiv.org/html/2606.23932#S6.T3)summarises this view\. The unconstrained importance\-sampled surrogate is the caseβt≡0\\beta\_\{t\}\\equiv 0\. PPO\-KL with a fixed scalar\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\]corresponds toβt≡β∈ℝ\\beta\_\{t\}\\equiv\\beta\\in\\mathbb\{R\}\. Adaptive PPO\-KL\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\]replaces the constant by a quantityβ\(t\)\\beta\(t\)that is updated once per rollout according to the measured KL\. PPO\-Clip is the per\-sample step identified in Theorem[1](https://arxiv.org/html/2606.23932#Thmtheorem1)\. Token\-level PPO\-Clip and GRPO\[[16](https://arxiv.org/html/2606.23932#bib.bib16),[2](https://arxiv.org/html/2606.23932#bib.bib2)\]apply the same step independently to each token of a generated sequence\. The directions in this section correspond to making different choices ofβt\(i\)\\beta\_\{t\}^\{\(i\)\}within the same template\.

Algorithm𝜷𝒕\(𝒊\)\\boldsymbol\{\\beta\_\{t\}^\{\(i\)\}\}formDepends onReferenceUnconstrained surrogate0——PPO\-KL \(fixed\)β∈ℝ\\beta\\in\\mathbb\{R\}noneSchulman et al\.\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\]PPO\-KL \(adaptive\)β\(t\)∈ℝ\\beta\(t\)\\in\\mathbb\{R\}training timeSchulman et al\.\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\]PPO\-Clip−wtA^t⋅𝟏ℐkill\-w\_\{t\}\\hat\{A\}\_\{t\}\\cdot\\mathbf\{1\}\_\{\\mathcal\{I\}\_\{\\mathrm\{kill\}\}\}\(w,A^\)\(w,\\hat\{A\}\)Schulman et al\.\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\]Soft\-clip \(linear ramp\)−wtA^t⋅gδ\(w,A^\)\-w\_\{t\}\\hat\{A\}\_\{t\}\\cdot g\_\{\\delta\}\(w,\\hat\{A\}\)\(w,A^\)\(w,\\hat\{A\}\),δ\\deltathis paper, Sec\.[6](https://arxiv.org/html/2606.23932#S6)Token\-level / GRPOstep shape per tokentoken\(w,A^\)\(w,\\hat\{A\}\)Shao et al\.\[[16](https://arxiv.org/html/2606.23932#bib.bib16)\],DeepSeek\-AI\[[2](https://arxiv.org/html/2606.23932#bib.bib2)\]Position\-awareper\-token shape, conditioned on positionposition\+\(w,A^\)\+\(w,\\hat\{A\}\)futureOff\-policyper\-sample shape, conditioned on ageage\+\(w,A^\)\+\(w,\\hat\{A\}\)futureAsymmetricnon\-symmetric insign\(w−1\)\\mathrm\{sign\}\(w\-1\)\(w,A^\)\(w,\\hat\{A\}\)future

Table 3:Instances of the per\-sample template \([5](https://arxiv.org/html/2606.23932#S6.E5)\)\. Each row is a particular choice of the dependence ofβt\(i\)\\beta\_\{t\}^\{\(i\)\}on the sample; the rest of the loss is shared\.Several research questions follow from the per\-sample template\. The published algorithms in the upper part of the table can be trained under identical pipelines and compared on the same metrics, which isolates the effect of the shape ofβt\(i\)\\beta\_\{t\}^\{\(i\)\}from the implementation choices that the literature has shown matter strongly\[[4](https://arxiv.org/html/2606.23932#bib.bib4),[1](https://arxiv.org/html/2606.23932#bib.bib1)\]\. The soft, position\-aware, age\-conditioned and asymmetric directions described in the previous paragraphs enrich the arguments ofβt\(i\)\\beta\_\{t\}^\{\(i\)\}, and the corresponding algorithms can be implemented and evaluated inside the same pipeline\. On the theory side, the boundedness and monotone\-improvement guarantees available for fixed and scheduled scalars in the original PPO analysis\[[9](https://arxiv.org/html/2606.23932#bib.bib9),[13](https://arxiv.org/html/2606.23932#bib.bib13)\]have direct analogues at the per\-sample level, and characterising which shapes ofβt\(i\)\\beta\_\{t\}^\{\(i\)\}preserve them is an open problem\. The template also crosses domains: the same form covers MuJoCo locomotion, language\-model fine\-tuning\[[11](https://arxiv.org/html/2606.23932#bib.bib11),[16](https://arxiv.org/html/2606.23932#bib.bib16)\]and off\-policy regimes, andβt\(i\)\\beta\_\{t\}^\{\(i\)\}is the shared design variable across the three\.

## References

- Andrychowicz et al\. \[2021\]Marcin Andrychowicz, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphael Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, Sylvain Gelly, and Olivier Bachem\.What matters in on\-policy reinforcement learning? a large\-scale empirical study\.In*International Conference on Learning Representations \(ICLR\)*, 2021\.
- DeepSeek\-AI \[2025\]DeepSeek\-AI\.Deepseek\-r1: Incentivizing reasoning capability in LLMs via reinforcement learning\.*arXiv preprint arXiv:2501\.12948*, 2025\.
- Dwyer et al\. \[2025\]Madeleine Dwyer, Adam Sobey, and Adriane Chapman\.It’s not you, it’s clipping: A soft trust\-region via probability smoothing for LLM RL\.*arXiv preprint arXiv:2509\.21282*, 2025\.
- Engstrom et al\. \[2020\]Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry\.Implementation matters in deep RL: A case study on PPO and TRPO\.In*International Conference on Learning Representations \(ICLR\)*, 2020\.
- Heess et al\. \[2017\]Nicolas Heess, Dhruva TB, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval Tassa, Tom Erez, Ziyu Wang, S\. M\. Ali Eslami, Martin Riedmiller, and David Silver\.Emergence of locomotion behaviours in rich environments\.*arXiv preprint arXiv:1707\.02286*, 2017\.
- Hsu et al\. \[2020\]Chloe Ching\-Yun Hsu, Celestine Mendler\-Dünner, and Moritz Hardt\.Revisiting design choices in proximal policy optimization\.*arXiv preprint arXiv:2009\.10897*, 2020\.
- Huang et al\. \[2022\]Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G\. M\. Araújo\.CleanRL: High\-quality single\-file implementations of deep reinforcement learning algorithms\.*Journal of Machine Learning Research*, 23\(274\):1–18, 2022\.
- Ilyas et al\. \[2020\]Andrew Ilyas, Logan Engstrom, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry\.A closer look at deep policy gradients\.In*International Conference on Learning Representations \(ICLR\)*, 2020\.
- Kakade and Langford \[2002\]Sham Kakade and John Langford\.Approximately optimal approximate reinforcement learning\.In*Proceedings of the 19th International Conference on Machine Learning \(ICML\)*, pages 267–274, 2002\.
- Liu et al\. \[2025\]Kezhao Liu, Jason Klein Liu, Mingtao Chen, and Yiming Liu\.Rethinking KL regularization in RLHF: From value estimation to gradient optimization\.*arXiv preprint arXiv:2510\.01555*, 2025\.
- Ouyang et al\. \[2022\]Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L\. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al\.Training language models to follow instructions with human feedback\.*Advances in Neural Information Processing Systems*, 35:27730–27744, 2022\.
- Raffin et al\. \[2021\]Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann\.Stable\-baselines3: Reliable reinforcement learning implementations\.*Journal of Machine Learning Research*, 22\(268\):1–8, 2021\.
- Schulman et al\. \[2015\]John Schulman, Sergey Levine, Pieter Abbeel, Michael I\. Jordan, and Philipp Moritz\.Trust region policy optimization\.In*Proceedings of the 32nd International Conference on Machine Learning \(ICML\)*, pages 1889–1897, 2015\.
- Schulman et al\. \[2016\]John Schulman, Philipp Moritz, Sergey Levine, Michael I\. Jordan, and Pieter Abbeel\.High\-dimensional continuous control using generalized advantage estimation\.In*International Conference on Learning Representations \(ICLR\)*, 2016\.
- Schulman et al\. \[2017\]John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov\.Proximal policy optimization algorithms\.*arXiv preprint arXiv:1707\.06347*, 2017\.
- Shao et al\. \[2024\]Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y\. K\. Li, Y\. Wu, and Daya Guo\.Deepseekmath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*, 2024\.
- Sun et al\. \[2022\]Mingfei Sun, Vitaly Kurin, Guoqing Liu, Sam Devlin, Tao Qin, Katja Hofmann, and Shimon Whiteson\.You may not need ratio clipping in PPO\.*arXiv preprint arXiv:2202\.00079*, 2022\.
- Sutton and Barto \[2018\]Richard S\. Sutton and Andrew G\. Barto\.*Reinforcement Learning: An Introduction*\.MIT Press, 2nd edition, 2018\.
- Sutton et al\. \[2000\]Richard S\. Sutton, David McAllester, Satinder Singh, and Yishay Mansour\.Policy gradient methods for reinforcement learning with function approximation\.In*Advances in Neural Information Processing Systems \(NIPS\)*, 2000\.
- Todorov et al\. \[2012\]Emanuel Todorov, Tom Erez, and Yuval Tassa\.MuJoCo: A physics engine for model\-based control\.In*IEEE/RSJ International Conference on Intelligent Robots and Systems \(IROS\)*, pages 5026–5033, 2012\.
- Wang et al\. \[2019\]Yuhui Wang, Hao He, Xiaoyang Tan, and Yaozhong Gan\.Trust region\-guided proximal policy optimization\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2019\.
- Wang et al\. \[2020\]Yuhui Wang, Hao He, and Xiaoyang Tan\.Truly proximal policy optimization\.In*Conference on Uncertainty in Artificial Intelligence \(UAI\)*, 2020\.
- Xie et al\. \[2025\]Zhengpeng Xie, Qiang Zhang, Fan Yang, Marco Hutter, and Renjing Xu\.Simple policy optimization\.In*Proceedings of the 42nd International Conference on Machine Learning \(ICML\)*, pages 68813–68824, 2025\.
- Zhang et al\. \[2026\]Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, and Andrew Chi\-Chih Yao\.On the design of KL\-regularized policy gradient algorithms for LLM reasoning\.In*International Conference on Learning Representations \(ICLR\)*, 2026\.

## Appendix AWeight\-space dual: PPO\-Clip as aΦ\\Phipenalty

The per\-sampleβt\\beta\_\{t\}identity of the main text places PPO\-Clip in distribution space: the penalty isβtlog⁡πθ′\(at∣st\)\\beta\_\{t\}\\log\\pi\_\{\\theta^\{\\prime\}\}\(a\_\{t\}\\mid s\_\{t\}\), a per\-sample KL contribution\. PPO\-Clip also admits a dual formulation in weight space, where the penalty acts directly on the deviation ofwtw\_\{t\}from the trust region\[1−ϵ,1\+ϵ\]\[1\-\\epsilon,\\,1\+\\epsilon\]\.

###### Theorem 2\(Weight\-space form of PPO\-Clip\)\.

The PPO\-Clip objective can be written as

LCLIP\(θ′\)=1N∑i=1N∑t=1H\[wt\(i\)A^t\(i\)−Φ\(wt\(i\),A^t\(i\)\)\]L\_\{\\mathrm\{CLIP\}\}\(\\theta^\{\\prime\}\)\\;=\\;\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\sum\_\{t=1\}^\{H\}\\Bigl\[\\,w\_\{t\}^\{\(i\)\}\\hat\{A\}\_\{t\}^\{\(i\)\}\\;\-\\;\\Phi\\\!\\left\(w\_\{t\}^\{\(i\)\},\\hat\{A\}\_\{t\}^\{\(i\)\}\\right\)\\Bigr\]with the per\-sample weight\-space penalty

Φ\(w,A^\)=\{\(w−\(1\+ϵ\)\)A^ifw\>1\+ϵandA^\>0,\(w−\(1−ϵ\)\)A^ifw<1−ϵandA^<0,0otherwise\.\\Phi\(w,\\hat\{A\}\)\\;=\\;\\begin\{cases\}\\bigl\(w\-\(1\+\\epsilon\)\\bigr\)\\,\\hat\{A\}&\\text\{if \}w\>1\+\\epsilon\\text\{ and \}\\hat\{A\}\>0,\\\\\[3\.0pt\] \\bigl\(w\-\(1\-\\epsilon\)\\bigr\)\\,\\hat\{A\}&\\text\{if \}w<1\-\\epsilon\\text\{ and \}\\hat\{A\}<0,\\\\\[3\.0pt\] 0&\\text\{otherwise\.\}\\end\{cases\}

###### Proof\.

Fix a sample\(i,t\)\(i,t\)and writeℓt\(i\)=min⁡\(wt\(i\)A^t\(i\),clip\(wt\(i\)\)A^t\(i\)\)\\ell\_\{t\}^\{\(i\)\}=\\min\\\!\\bigl\(w\_\{t\}^\{\(i\)\}\\hat\{A\}\_\{t\}^\{\(i\)\},\\,\\mathrm\{clip\}\(w\_\{t\}^\{\(i\)\}\)\\hat\{A\}\_\{t\}^\{\(i\)\}\\bigr\)for its PPO\-Clip per\-sample contribution\.

*Casewt\(i\)\>1\+ϵw\_\{t\}^\{\(i\)\}\>1\+\\epsilonandA^t\(i\)\>0\\hat\{A\}\_\{t\}^\{\(i\)\}\>0\.*The clipped weight is1\+ϵ1\+\\epsilonand the clipped product is smaller than the unclipped one, so

ℓt\(i\)=\(1\+ϵ\)A^t\(i\)=wt\(i\)A^t\(i\)−\(wt\(i\)−\(1\+ϵ\)\)A^t\(i\)=wt\(i\)A^t\(i\)−Φ\(wt\(i\),A^t\(i\)\)\.\\ell\_\{t\}^\{\(i\)\}\\;=\\;\(1\+\\epsilon\)\\hat\{A\}\_\{t\}^\{\(i\)\}\\;=\\;w\_\{t\}^\{\(i\)\}\\hat\{A\}\_\{t\}^\{\(i\)\}\-\\bigl\(w\_\{t\}^\{\(i\)\}\-\(1\+\\epsilon\)\\bigr\)\\hat\{A\}\_\{t\}^\{\(i\)\}\\;=\\;w\_\{t\}^\{\(i\)\}\\hat\{A\}\_\{t\}^\{\(i\)\}\-\\Phi\(w\_\{t\}^\{\(i\)\},\\hat\{A\}\_\{t\}^\{\(i\)\}\)\.
*Casewt\(i\)<1−ϵw\_\{t\}^\{\(i\)\}<1\-\\epsilonandA^t\(i\)<0\\hat\{A\}\_\{t\}^\{\(i\)\}<0\.*The clipped weight is1−ϵ1\-\\epsilonand again the clipped product is the smaller, so

ℓt\(i\)=\(1−ϵ\)A^t\(i\)=wt\(i\)A^t\(i\)−\(wt\(i\)−\(1−ϵ\)\)A^t\(i\)=wt\(i\)A^t\(i\)−Φ\(wt\(i\),A^t\(i\)\)\.\\ell\_\{t\}^\{\(i\)\}\\;=\\;\(1\-\\epsilon\)\\hat\{A\}\_\{t\}^\{\(i\)\}\\;=\\;w\_\{t\}^\{\(i\)\}\\hat\{A\}\_\{t\}^\{\(i\)\}\-\\bigl\(w\_\{t\}^\{\(i\)\}\-\(1\-\\epsilon\)\\bigr\)\\hat\{A\}\_\{t\}^\{\(i\)\}\\;=\\;w\_\{t\}^\{\(i\)\}\\hat\{A\}\_\{t\}^\{\(i\)\}\-\\Phi\(w\_\{t\}^\{\(i\)\},\\hat\{A\}\_\{t\}^\{\(i\)\}\)\.
*Otherwise\.*The unclipped term is the minimum, soℓt\(i\)=wt\(i\)A^t\(i\)=wt\(i\)A^t\(i\)−0\\ell\_\{t\}^\{\(i\)\}=w\_\{t\}^\{\(i\)\}\\hat\{A\}\_\{t\}^\{\(i\)\}=w\_\{t\}^\{\(i\)\}\\hat\{A\}\_\{t\}^\{\(i\)\}\-0, and the definition ofΦ\\PhigivesΦ\(wt\(i\),A^t\(i\)\)=0\\Phi\(w\_\{t\}^\{\(i\)\},\\hat\{A\}\_\{t\}^\{\(i\)\}\)=0\.

Summing over\(i,t\)\(i,t\)yields the result\. ∎

The penaltyΦ\\Phiis non\-negative onℐkill\\mathcal\{I\}\_\{\\mathrm\{kill\}\}: whenw\>1\+ϵw\>1\+\\epsilonandA^\>0\\hat\{A\}\>0the factorw−\(1\+ϵ\)w\-\(1\+\\epsilon\)is positive and so isA^\\hat\{A\}; whenw<1−ϵw<1\-\\epsilonandA^<0\\hat\{A\}<0both factors are negative and their product is positive again\. PPO\-Clip therefore subtracts a non\-negative penalty from the unconstrained surrogate, proportional to how farwt\(i\)w\_\{t\}^\{\(i\)\}exceeds the trust\-region boundary, on exactly the samples inℐkill\\mathcal\{I\}\_\{\\mathrm\{kill\}\}\.

Combining the weight\-space identity above with the per\-sampleβt\\beta\_\{t\}gradient identity of the main text, PPO\-Clip admits three forms on every minibatch, the first two equal in value and the third equal in gradient:

1. 1\.themin\\minformulation ofSchulman et al\.\[[15](https://arxiv.org/html/2606.23932#bib.bib15)\]: LCLIP=𝔼t\[min⁡\(wtA^t,clip\(wt\)A^t\)\];L\_\{\\mathrm\{CLIP\}\}\\;=\\;\\mathbb\{E\}\_\{t\}\\\!\\bigl\[\\min\\\!\\bigl\(w\_\{t\}\\hat\{A\}\_\{t\},\\,\\mathrm\{clip\}\(w\_\{t\}\)\\hat\{A\}\_\{t\}\\bigr\)\\bigr\];
2. 2\.the weight\-space form of Theorem[2](https://arxiv.org/html/2606.23932#Thmtheorem2): LCLIP=𝔼t\[wtA^t−Φ\(wt,A^t\)\],L\_\{\\mathrm\{CLIP\}\}\\;=\\;\\mathbb\{E\}\_\{t\}\\\!\\bigl\[\\,w\_\{t\}\\hat\{A\}\_\{t\}\\;\-\\;\\Phi\(w\_\{t\},\\hat\{A\}\_\{t\}\)\\bigr\],with penalty acting on\|wt−1\|\|w\_\{t\}\-1\|in importance\-weight space;
3. 3\.the per\-sample KL form of the main theorem, which matches PPO\-Clip in gradient: ∇θ′LCLIP=∇θ′𝔼t\[wtA^t\+βtlog⁡πθ′\(at∣st\)\],βt=−wtA^t⋅𝟏\[\(i,t\)∈ℐkill\],\\nabla\_\{\\theta^\{\\prime\}\}L\_\{\\mathrm\{CLIP\}\}\\;=\\;\\nabla\_\{\\theta^\{\\prime\}\}\\,\\mathbb\{E\}\_\{t\}\\\!\\bigl\[\\,w\_\{t\}\\hat\{A\}\_\{t\}\\;\+\\;\\beta\_\{t\}\\log\\pi\_\{\\theta^\{\\prime\}\}\(a\_\{t\}\\mid s\_\{t\}\)\\bigr\],\\qquad\\beta\_\{t\}=\-w\_\{t\}\\hat\{A\}\_\{t\}\\cdot\\mathbf\{1\}\\\!\\left\[\(i,t\)\\in\\mathcal\{I\}\_\{\\mathrm\{kill\}\}\\right\],with penalty acting in distribution space\.

The first two forms are equal as functions and differ only in surface notation; the third matches them in per\-sample gradient and places PPO\-Clip inside the PPO\-KL family\. The space in which the trust region is expressed \(importance\-weight space forΦ\\Phi, distribution space forβt\\beta\_\{t\}\) changes the notation but not the per\-sample gradient, which is the same in all three forms\.

Table 4:Three forms of the PPO\-Clip per\-sample loss\. All three produce the same per\-sample gradient on every\(i,t\)\(i,t\)\.
## Appendix BSupplementary Figures

This appendix collects supplementary figures that illustrate the geometry of PPO\-Clip and PPO\-KL and the per\-sample equivalence between them\. The notation follows the main text, withwtw\_\{t\}the importance ratio,A^t\\hat\{A\}\_\{t\}the GAE advantage estimate, andℐin,ℐkill,ℐpass\\mathcal\{I\}\_\{\\text\{in\}\},\\mathcal\{I\}\_\{\\text\{kill\}\},\\mathcal\{I\}\_\{\\text\{pass\}\}the partition of the minibatch\.

### B\.1The clipping function

w=πθ′πθw=\\dfrac\{\\pi\_\{\\theta^\{\\prime\}\}\}\{\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\pi\_\{\\theta\}\}\}wcw\_\{c\}1−ϵ1\\\!\-\\\!\\epsilon111\+ϵ1\\\!\+\\\!\\epsilon1−ϵ1\\\!\-\\\!\\epsilon111\+ϵ1\\\!\+\\\!\\epsilongradient active∇θ′=0\\nabla\_\{\\theta^\{\\prime\}\}=0∇θ′=0\\nabla\_\{\\theta^\{\\prime\}\}=0weights free to varyFigure 3:The clipping functionwc=clip\(w,1−ϵ,1\+ϵ\)w\_\{c\}=\\mathrm\{clip\}\(w,1\-\\epsilon,1\+\\epsilon\)\. Inside the band\[1−ϵ,1\+ϵ\]\[1\-\\epsilon,1\+\\epsilon\]the clipped weight equals the true importance ratio and the gradient flows normally; outside the band the clipped weight is constant and the gradient with respect toθ′\\theta^\{\\prime\}vanishes\.
### B\.2The PPO\-Clip surrogate

Positive advantage \(A^\>0\\hat\{A\}\>0\)wwobjective1−ϵ1\\\!\-\\\!\\epsilon111\+ϵ1\\\!\+\\\!\\epsilonwA^w\\hat\{A\}wcA^w\_\{c\}\\hat\{A\}min\\min\(PPO\)bounded

Negative advantage \(A^<0\\hat\{A\}<0\)wwobjective1−ϵ1\\\!\-\\\!\\epsilon111\+ϵ1\\\!\+\\\!\\epsilonwA^w\\hat\{A\}wcA^w\_\{c\}\\hat\{A\}min\\min\(PPO\)unbounded

Figure 4:The PPO\-Clip surrogate, split by the sign ofA^\\hat\{A\}\. Left panel \(A^\>0\\hat\{A\}\>0\): forw\>1\+ϵw\>1\+\\epsilonthe minimum follows the clipped term, bounding the upside\. Right panel \(A^<0\\hat\{A\}<0\): forw\>1\+ϵw\>1\+\\epsilonthe minimum follows the unclipped term, so the penalty grows without bound\. The asymmetry between the two panels is what makes PPO\-Clip a one\-sided trust region\.
### B\.3PPO\-Clip flowchart

Step 1 — Collect SamplesRun policyπθ\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\pi\_\{\\theta\}\}in the environment forNNtrajectories of lengthHH:\{τ\(i\)\}i=1N\\\{\\tau^\{\(i\)\}\\\}\_\{i=1\}^\{N\}, where eachτ\(i\)=\(s1\(i\),a1\(i\),r1\(i\),…,sH\(i\),aH\(i\),rH\(i\)\)\\tau^\{\(i\)\}=\(s\_\{1\}^\{\(i\)\},a\_\{1\}^\{\(i\)\},r\_\{1\}^\{\(i\)\},\\ldots,s\_\{H\}^\{\(i\)\},a\_\{H\}^\{\(i\)\},r\_\{H\}^\{\(i\)\}\)Step 2 — Compute Targets and Fit Value FunctionTargets:yt\(i\)=r\(st\(i\),at\(i\)\)\+γV^ϕ\(st\+1\(i\)\)y\_\{t\}^\{\(i\)\}=r\(s\_\{t\}^\{\(i\)\},a\_\{t\}^\{\(i\)\}\)\+\\gamma\\,\\hat\{V\}\_\{\\phi\}\(s\_\{t\+1\}^\{\(i\)\}\)Train criticV^ϕ\\hat\{V\}\_\{\\phi\}by regression on\{\(st\(i\),yt\(i\)\)\}\\\{\(s\_\{t\}^\{\(i\)\},\\;y\_\{t\}^\{\(i\)\}\)\\\}Step 3 — Estimate Advantages \(GAE\)δt\(i\)=rt\(i\)\+γV^ϕ\(st\+1\(i\)\)−V^ϕ\(st\(i\)\)\\delta\_\{t\}^\{\(i\)\}=r\_\{t\}^\{\(i\)\}\+\\gamma\\,\\hat\{V\}\_\{\\phi\}\(s\_\{t\+1\}^\{\(i\)\}\)\-\\hat\{V\}\_\{\\phi\}\(s\_\{t\}^\{\(i\)\}\),A^GAE,t\(i\)=∑t′=t∞\(γλ\)t′−tδt′\(i\)\\hat\{A\}\_\{\\mathrm\{GAE\},t\}^\{\(i\)\}=\\displaystyle\\sum\_\{t^\{\\prime\}=t\}^\{\\infty\}\(\\gamma\\lambda\)^\{t^\{\\prime\}\-t\}\\,\\delta\_\{t^\{\\prime\}\}^\{\(i\)\}Step 4 — Initialize Inner LoopSetθ′←θ\\theta^\{\\prime\}\\leftarrow\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\theta\}\. Pre\-compute and storeπθ\(at\(i\)∣st\(i\)\)\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\pi\_\{\\theta\}\}\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\)for all samplesStep 5 — Gradient Step on the Clipped SurrogateLCLIP\(θ′\)=1N∑i=1N∑t=1Hmin⁡\{πθ′\(at\(i\)∣st\(i\)\)πθ\(at\(i\)∣st\(i\)\)⏟wt\(i\)A^t\(i\),clip\(wt\(i\),1−ϵ,1\+ϵ\)A^t\(i\)\}\\displaystyle L\_\{\\mathrm\{CLIP\}\}\(\\theta^\{\\prime\}\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\sum\_\{t=1\}^\{H\}\\min\\\!\\left\\\{\\underbrace\{\\frac\{\\pi\_\{\\theta^\{\\prime\}\}\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\)\}\{\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\pi\_\{\\theta\}\}\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\)\}\}\_\{\\displaystyle w\_\{t\}^\{\(i\)\}\}\\hat\{A\}\_\{t\}^\{\(i\)\},\\;\\;\\mathrm\{clip\}\\\!\\left\(w\_\{t\}^\{\(i\)\},\\;1\\\!\-\\\!\\epsilon,\\;1\\\!\+\\\!\\epsilon\\right\)\\hat\{A\}\_\{t\}^\{\(i\)\}\\right\\\}Update:θ′←θ′\+α∇θ′LCLIP\(θ′\)\\theta^\{\\prime\}\\leftarrow\\theta^\{\\prime\}\+\\alpha\\,\\nabla\_\{\\theta^\{\\prime\}\}L\_\{\\mathrm\{CLIP\}\}\(\\theta^\{\\prime\}\)repeatedKKtimes ?noyes:θ←θ′\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\theta\}\\leftarrow\\theta^\{\\prime\}outer loop\(once periteration\)inner loop\(KKgradientsteps\)Figure 5:PPO\-Clip\. The outer loop collects rollouts and fits the critic; the inner loop takesKKgradient steps on the clipped surrogateLCLIPL\_\{\\mathrm\{CLIP\}\}before refreshing the behaviour policy\.
### B\.4PPO\-KL flowchart

Step 1 — Collect SamplesRun policyπθ\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\pi\_\{\\theta\}\}in the environment forNNtrajectories of lengthHH:\{τ\(i\)\}i=1N\\\{\\tau^\{\(i\)\}\\\}\_\{i=1\}^\{N\}, where eachτ\(i\)=\(s1\(i\),a1\(i\),r1\(i\),…,sH\(i\),aH\(i\),rH\(i\)\)\\tau^\{\(i\)\}=\(s\_\{1\}^\{\(i\)\},a\_\{1\}^\{\(i\)\},r\_\{1\}^\{\(i\)\},\\ldots,s\_\{H\}^\{\(i\)\},a\_\{H\}^\{\(i\)\},r\_\{H\}^\{\(i\)\}\)Step 2 — Compute Targetsyt\(i\)=r\(st\(i\),at\(i\)\)\+γV^ϕ\(st\+1\(i\)\)y\_\{t\}^\{\(i\)\}=r\(s\_\{t\}^\{\(i\)\},a\_\{t\}^\{\(i\)\}\)\+\\gamma\\,\\hat\{V\}\_\{\\phi\}\(s\_\{t\+1\}^\{\(i\)\}\)Step 3 — Fit Value FunctionTrain criticV^ϕ\\hat\{V\}\_\{\\phi\}by regression:ϕ←ϕ−αϕ∇ϕ1NH∑i=1N∑t=1H\(V^ϕ\(st\(i\)\)−yt\(i\)\)2\\;\\phi\\leftarrow\\phi\-\\alpha\_\{\\phi\}\\nabla\_\{\\phi\}\\frac\{1\}\{NH\}\\displaystyle\\sum\_\{i=1\}^\{N\}\\sum\_\{t=1\}^\{H\}\\bigl\(\\hat\{V\}\_\{\\phi\}\(s\_\{t\}^\{\(i\)\}\)\-y\_\{t\}^\{\(i\)\}\\bigr\)^\{2\}Step 4 — Estimate Advantages \(GAE\)δt\(i\)=rt\(i\)\+γV^ϕ\(st\+1\(i\)\)−V^ϕ\(st\(i\)\)\\delta\_\{t\}^\{\(i\)\}=r\_\{t\}^\{\(i\)\}\+\\gamma\\,\\hat\{V\}\_\{\\phi\}\(s\_\{t\+1\}^\{\(i\)\}\)\-\\hat\{V\}\_\{\\phi\}\(s\_\{t\}^\{\(i\)\}\),A^GAE,t\(i\)=∑t′=t∞\(γλ\)t′−tδt′\(i\)\\hat\{A\}\_\{\\mathrm\{GAE\},t\}^\{\(i\)\}=\\displaystyle\\sum\_\{t^\{\\prime\}=t\}^\{\\infty\}\(\\gamma\\lambda\)^\{t^\{\\prime\}\-t\}\\,\\delta\_\{t^\{\\prime\}\}^\{\(i\)\}Step 5 — Initialize Inner LoopSetθ′←θ\\theta^\{\\prime\}\\leftarrow\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\theta\}\. Pre\-compute and storeπθ\(at\(i\)∣st\(i\)\)\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\pi\_\{\\theta\}\}\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\)for all samplesStep 6 — Gradient Step on the KL\-Penalised ObjectiveℒKL\(θ′\)=1N∑i=1N∑t=1H\[πθ′\(at\(i\)∣st\(i\)\)πθ\(at\(i\)∣st\(i\)\)A^t\(i\)\+βlog⁡πθ′\(at\(i\)∣st\(i\)\)\]\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{KL\}\}\(\\theta^\{\\prime\}\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\sum\_\{t=1\}^\{H\}\\left\[\\frac\{\\pi\_\{\\theta^\{\\prime\}\}\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\)\}\{\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\pi\_\{\\theta\}\}\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\)\}\\,\\hat\{A\}\_\{t\}^\{\(i\)\}\+\\beta\\log\\pi\_\{\\theta^\{\\prime\}\}\(a\_\{t\}^\{\(i\)\}\\mid s\_\{t\}^\{\(i\)\}\)\\right\]Update:θ′←θ′\+α∇θ′ℒKL\(θ′\)\\theta^\{\\prime\}\\leftarrow\\theta^\{\\prime\}\+\\alpha\\,\\nabla\_\{\\theta^\{\\prime\}\}\\mathcal\{L\}\_\{\\mathrm\{KL\}\}\(\\theta^\{\\prime\}\)repeatedKKtimes ?Step 7 — Dual Variable Updateβ←β\+αβ\(DKL\(πθ∥πθ′\)−ϵ\)\\beta\\leftarrow\\beta\+\\alpha\_\{\\beta\}\\bigl\(D\_\{\\mathrm\{KL\}\}\(\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\pi\_\{\\theta\}\}\\\|\\pi\_\{\\theta^\{\\prime\}\}\)\-\\epsilon\\bigr\)\(DKL\>ϵ⇒D\_\{\\mathrm\{KL\}\}\>\\epsilon\\Rightarrowincreaseβ\\beta;DKL<ϵ⇒D\_\{\\mathrm\{KL\}\}<\\epsilon\\Rightarrowdecreaseβ\\beta\)Step 8 — Adopt New Policyθ←θ′\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\theta\}\\leftarrow\\theta^\{\\prime\}yesnonext iterationouter loop\(once periteration\)inner loop\(KKgradientsteps\)Figure 6:PPO\-KL\. The objective is the unclipped importance\-weighted advantage plus a KL penalty with a scalar coefficientβ\\beta; after the inner loopβ\\betais adjusted by a dual gradient step that targets a desired KL\.
### B\.5Effective gradient multiplier as a function ofww

PPO\-Clipwweffective gradient1−ϵ1\\\!\-\\\!\\epsilon111\+ϵ1\\\!\+\\\!\\epsilonA^\>0\\hat\{A\}\>0: killedA^<0\\hat\{A\}<0: activekilled\[1−ϵ,1\+ϵ\]\[1\\\!\-\\\!\\epsilon,\\;1\\\!\+\\\!\\epsilon\]PPO\-KL \(scalarβ\\beta\)wweffective gradient1−ϵ1\\\!\-\\\!\\epsilon111\+ϵ1\\\!\+\\\!\\epsilonA^\\hat\{A\}A^\\hat\{A\}A^\+βw\\hat\{A\}\\\!\+\\\!\\tfrac\{\\beta\}\{w\}A^\+βw\\hat\{A\}\\\!\+\\\!\\tfrac\{\\beta\}\{w\}w=β\|A^\|w\\\!=\\\!\\tfrac\{\\beta\}\{\|\\hat\{A\}\|\}effective trust region≈\\approxinside TRFigure 7:Per\-sample effective gradient multiplier as a function of the importance ratioww\. Left: PPO\-Clip is a hard step\. The multiplier equalsA^\\hat\{A\}inside\[1−ϵ,1\+ϵ\]\[1\-\\epsilon,1\+\\epsilon\]and onℐpass\\mathcal\{I\}\_\{\\text\{pass\}\}, and zero onℐkill\\mathcal\{I\}\_\{\\text\{kill\}\}\. Right: a scalar PPO\-KL produces the smooth curveA^\+β/w\\hat\{A\}\+\\beta/w\(the asymptote isA^\\hat\{A\}\)\. The two coincide inside the trust region; outside, PPO\-Clip is a step function while scalar PPO\-KL is a smooth one\. The per\-sampleβt\\beta\_\{t\}construction of the main text is the choice that recovers the step function\.
### B\.6Morphing from scalar to per\-sampleβ\\beta

\(a\) scalarβ\\betaoneβ\\betafor every samplewwmult\.A^\\hat\{A\}1−ϵ1\\\!\-\\\!\\epsilon111\+ϵ1\\\!\+\\\!\\epsilonA^\+β/w\\hat\{A\}\+\\beta/w\(b\) adaptive scalarβ\(t\)\\beta\(t\)β\\betavaries in time, same per samplewwmult\.A^\\hat\{A\}1−ϵ1\\\!\-\\\!\\epsilon111\+ϵ1\\\!\+\\\!\\epsilonA^\+β\(t\)/w\\hat\{A\}\+\\beta\(t\)/w\(c\) per\-sampleβt=−wA^1ℐkill\\beta\_\{t\}=\-w\\hat\{A\}\\,\\mathbf\{1\}\_\{\\mathcal\{I\}\_\{\\mathrm\{kill\}\}\}≡\\equivPPO\-Clipwwmult\.A^\\hat\{A\}1−ϵ1\\\!\-\\\!\\epsilon111\+ϵ1\\\!\+\\\!\\epsilonkilledFigure 8:Morphing from scalar to per\-sampleβ\\beta, shown for the caseA^\>0\\hat\{A\}\>0\. \(a\) A constant scalarβ\\betaproduces the smooth curveA^\+β/w\\hat\{A\}\+\\beta/w, which can never coincide with the PPO\-Clip step\. \(b\) Lettingβ\\betaadapt in time but stay scalar shifts the curve along its family but cannot reshape it\. \(c\) Lettingβ\\betabecome per\-sample, withβt=−wtA^t\\beta\_\{t\}=\-w\_\{t\}\\hat\{A\}\_\{t\}exactly onℐkill\\mathcal\{I\}\_\{\\mathrm\{kill\}\}and zero elsewhere, snaps the curve onto the PPO\-Clip step function\. The clip is the per\-sample limit of the KL surrogate; scalar penalties never reach it\.
### B\.7Theβt\\beta\_\{t\}map in the\(w,A^\)\(w,\\hat\{A\}\)plane

wwA^\\hat\{A\}111−ϵ1\\\!\-\\\!\\epsilon1\+ϵ1\\\!\+\\\!\\epsilonℐkill\\mathcal\{I\}\_\{\\mathrm\{kill\}\}βt=−wA^<0\\beta\_\{t\}=\-w\\hat\{A\}<0ℐkill\\mathcal\{I\}\_\{\\mathrm\{kill\}\}βt=−wA^\>0\\beta\_\{t\}=\-w\\hat\{A\}\>0ℐpass\\mathcal\{I\}\_\{\\mathrm\{pass\}\}βt=0\\beta\_\{t\}=0ℐpass\\mathcal\{I\}\_\{\\mathrm\{pass\}\}βt=0\\beta\_\{t\}=0ℐin\\mathcal\{I\}\_\{\\mathrm\{in\}\}βt=0\\beta\_\{t\}\\\!=\\\!0Figure 9:Theβt\\beta\_\{t\}landscape in\(w,A^\)\(w,\\hat\{A\}\)space\. Vertical dashed lines mark the trust region\[1−ϵ,1\+ϵ\]\[1\-\\epsilon,1\+\\epsilon\]\. Two corners \(top\-rightw\>1\+ϵ,A^\>0w\>1\+\\epsilon,\\hat\{A\}\>0and bottom\-leftw<1−ϵ,A^<0w<1\-\\epsilon,\\hat\{A\}<0\) carry the per\-sample coefficientβt=−wA^\\beta\_\{t\}=\-w\\hat\{A\}, with shading indicating\|βt\|\|\\beta\_\{t\}\|\. Everywhere elseβt=0\\beta\_\{t\}=0\. The support and value ofβt\\beta\_\{t\}reproduce the PPO\-Clip gradient sample by sample\.
### B\.8The same per\-sample gradient via two surrogates

sample\(st,at,A^t,wt\)\(s\_\{t\},a\_\{t\},\\hat\{A\}\_\{t\},w\_\{t\}\)PPO\-Clip per\-sample termℓtCLIP=min⁡\(wtA^t,clip\(wt\)A^t\)\\ell\_\{t\}^\{\\mathrm\{CLIP\}\}=\\min\\\!\\bigl\(w\_\{t\}\\hat\{A\}\_\{t\},\\,\\mathrm\{clip\}\(w\_\{t\}\)\\hat\{A\}\_\{t\}\\bigr\)per\-sample KL termℓtKL=wtA^t\+βtlog⁡πθ′\(at∣st\)\\ell\_\{t\}^\{\\mathrm\{KL\}\}=w\_\{t\}\\hat\{A\}\_\{t\}\+\\beta\_\{t\}\\,\\log\\pi\_\{\\theta^\{\\prime\}\}\(a\_\{t\}\\mid s\_\{t\}\)gt=\{A^t∇θ′wtif\(i,t\)∈ℐin∪ℐpass,0if\(i,t\)∈ℐkill\.g\_\{t\}\\;=\\;\\begin\{cases\}\\hat\{A\}\_\{t\}\\,\\nabla\_\{\\theta^\{\\prime\}\}w\_\{t\}&\\text\{if \}\(i,t\)\\in\\mathcal\{I\}\_\{\\mathrm\{in\}\}\\cup\\mathcal\{I\}\_\{\\mathrm\{pass\}\},\\\\\[2\.0pt\] 0&\\text\{if \}\(i,t\)\\in\\mathcal\{I\}\_\{\\mathrm\{kill\}\}\.\\end\{cases\}∇θ′\\nabla\_\{\\theta^\{\\prime\}\}∇θ′\\nabla\_\{\\theta^\{\\prime\}\}withβt=−wtA^t1ℐkill\\beta\_\{t\}=\-w\_\{t\}\\hat\{A\}\_\{t\}\\,\\mathbf\{1\}\_\{\\mathcal\{I\}\_\{\\mathrm\{kill\}\}\}Figure 10:The per\-sample gradient under the two surrogates\. The same sample feeds both per\-sample terms; differentiating each inθ′\\theta^\{\\prime\}yields the samegtg\_\{t\}\. PPO\-Clip selects the active branch of themin\\minto kill the gradient onℐkill\\mathcal\{I\}\_\{\\mathrm\{kill\}\}; the per\-sample KL surrogate setsβt=−wtA^t\\beta\_\{t\}=\-w\_\{t\}\\hat\{A\}\_\{t\}onℐkill\\mathcal\{I\}\_\{\\mathrm\{kill\}\}and0elsewhere, which makes the bracketA^t\+βt/wt\\hat\{A\}\_\{t\}\+\\beta\_\{t\}/w\_\{t\}vanish on those samples\.

## Appendix CSupplementary experimental results

### C\.1Per\-task equivalence

Figure[11](https://arxiv.org/html/2606.23932#A3.F11)reports PPO\-Clip and the per\-sample PPO\-KL surrogate on each of the seven tasks separately, namely CartPole\-v1, LunarLander\-v3, Hopper\-v4, HalfCheetah\-v4, Walker2d\-v4, Ant\-v4, and Humanoid\-v4\. The two are indistinguishable on every environment\. Each curve is a mean over55seeds with a±\\pmstd band\.

![Refer to caption](https://arxiv.org/html/2606.23932v1/x6.png)\(a\)CartPole\-v1
![Refer to caption](https://arxiv.org/html/2606.23932v1/x7.png)\(b\)LunarLander\-v3
![Refer to caption](https://arxiv.org/html/2606.23932v1/x8.png)\(c\)Hopper\-v4
![Refer to caption](https://arxiv.org/html/2606.23932v1/x9.png)\(d\)HalfCheetah\-v4
![Refer to caption](https://arxiv.org/html/2606.23932v1/x10.png)\(e\)Walker2d\-v4
![Refer to caption](https://arxiv.org/html/2606.23932v1/x11.png)\(f\)Ant\-v4
![Refer to caption](https://arxiv.org/html/2606.23932v1/x12.png)\(g\)Humanoid\-v4

Figure 11:PPO\-Clip and per\-sample PPO\-KL on each task, mean over55seeds with a±\\pmstd band\. The two coincide on every environment\.
### C\.2Trust\-region knob sweeps

To place each scalar\-β\\betabaseline at a defensible operating point we sweep the trust\-region knob of each variant; the main\-text baselines use the best value of each knob\.

![Refer to caption](https://arxiv.org/html/2606.23932v1/x13.png)Figure 12:Final return \(mean±\\pmstandard error over55seeds\) on CartPole\-v1, HalfCheetah\-v4, and Hopper\-v4 as a function of each trust\-region knob: clipϵ\\epsilonfor PPO\-Clip, fixedβ\\betafor PPO\-KL, and the KL target for adaptive PPO\-KL\.
### C\.3Clipping partition

Figure[13](https://arxiv.org/html/2606.23932#A3.F13)shows, for PPO\-Clip, the fraction of each minibatch that falls inℐkill\\mathcal\{I\}\_\{\\mathrm\{kill\}\}andℐpass\\mathcal\{I\}\_\{\\mathrm\{pass\}\}over training\. This is the empirical view of the partition on which the identity rests: the penaltyβt\\beta\_\{t\}is non\-zero only onℐkill\\mathcal\{I\}\_\{\\mathrm\{kill\}\}, and its reach grows with task difficulty, exceeding half the batch on Humanoid\-v4\.

![Refer to caption](https://arxiv.org/html/2606.23932v1/x14.png)\(a\)Hopper\-v4
![Refer to caption](https://arxiv.org/html/2606.23932v1/x15.png)\(b\)HalfCheetah\-v4
![Refer to caption](https://arxiv.org/html/2606.23932v1/x16.png)\(c\)Walker2d\-v4
![Refer to caption](https://arxiv.org/html/2606.23932v1/x17.png)\(d\)Ant\-v4
![Refer to caption](https://arxiv.org/html/2606.23932v1/x18.png)\(e\)Humanoid\-v4

Figure 13:Fraction of the PPO\-Clip minibatch inℐkill\\mathcal\{I\}\_\{\\mathrm\{kill\}\}andℐpass\\mathcal\{I\}\_\{\\mathrm\{pass\}\}over training \(mean over55seeds,±\\pmstd band\)\. The per\-sample coefficientβt\\beta\_\{t\}acts only onℐkill\\mathcal\{I\}\_\{\\mathrm\{kill\}\}\.
### C\.4Per\-sample coefficient

Figure[14](https://arxiv.org/html/2606.23932#A3.F14)plots the per\-sample coefficientβt\\beta\_\{t\}over training for the per\-sample variant\. Its median is zero, since most samples lie outsideℐkill\\mathcal\{I\}\_\{\\mathrm\{kill\}\}, while the tails carry the active penalty−wtA^t\-w\_\{t\}\\hat\{A\}\_\{t\}and widen on the harder tasks\.

![Refer to caption](https://arxiv.org/html/2606.23932v1/x19.png)\(a\)Hopper\-v4
![Refer to caption](https://arxiv.org/html/2606.23932v1/x20.png)\(b\)HalfCheetah\-v4
![Refer to caption](https://arxiv.org/html/2606.23932v1/x21.png)\(c\)Walker2d\-v4
![Refer to caption](https://arxiv.org/html/2606.23932v1/x22.png)\(d\)Ant\-v4
![Refer to caption](https://arxiv.org/html/2606.23932v1/x23.png)\(e\)Humanoid\-v4

Figure 14:Per\-sample coefficientβt\\beta\_\{t\}over training for the per\-sample variant\. The median is zero becauseβt\\beta\_\{t\}vanishes on every sample outside the kill region; the shaded bands show how far the active coefficient−wtA^t\-w\_\{t\}\\hat\{A\}\_\{t\}reaches on the samples inℐkill\\mathcal\{I\}\_\{\\mathrm\{kill\}\}, widening on the high\-dimensional tasks\.
KLip-PPO: A per-sample KL perspective on PPO-Clip

Similar Articles

@johnschulman2: PPO had a second wave in the LLM era for reasons unanticipated by the original paper - the importance-ratio objective f…

Proximal Policy Optimization

Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

Submit Feedback

Similar Articles

@johnschulman2: PPO had a second wave in the LLM era for reasons unanticipated by the original paper - the importance-ratio objective f…
Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models
Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients
DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off