Value-Gradient Hypothesis of RL for LLMs
Summary
This paper introduces the value-gradient hypothesis to explain why critic-free RL methods like PPO and GRPO work well for LLMs, showing that the actor backward pass carries a value-gradient-like signal. It derives a predictive criterion for when RL is most effective along the pretraining trajectory.
View Cached Full Text
Cached at: 05/22/26, 08:50 AM
# Value-Gradient Hypothesis of RL for LLMs
Source: [https://arxiv.org/html/2605.21654](https://arxiv.org/html/2605.21654)
Arip AsadulaevDaniil OgnevKarim SaltaMartin TakacMBZUAIMBZUAIIndependentMBZUAI
###### Abstract
Reinforcement learning substantially improves pretrained language models, but it remains understudied why critic\-free methods such as PPO and GRPO work as well as they do, and when they should provide the largest gains\. We develop a value\-gradient perspective of critic\-free RL for LLM post\-training\. First, under a differentiable rollout and additive\-noise parameterization, we show that the actor update is value\-gradient\-like in expectation:*the backward pass propagates costates whose conditional expectation equals the value gradient*\. Second, for discrete transformer policies, we show that autodifferentiation through attention produces empirical costates that approximate this value signal, with an error controlled by the sampling gap and policy entropy\. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory\.
## 1Introduction

Figure 1:Real RL gain vs\. predicted one using value impact formula \(Section[5](https://arxiv.org/html/2605.21654#S5), Eq\.[29](https://arxiv.org/html/2605.21654#S5.E29)\)\.Recently, Large Language Models \(LLMs\) achieve state\-of\-the\-art reasoning using Group Relative Policy Optimization \(GRPO\)\(Shaoet al\.,[2024](https://arxiv.org/html/2605.21654#bib.bib3)\), which discards the critic entirely, yet classical Reinforcement Learning \(RL\) theory predicts that critic\-free methods should fail at long\-horizon credit assignment\. Why don’t they?*In this paper we argue that critic\-free RL in LLMs is not value\-free*\. The central claim is that the actor backward pass already carries a value\-gradient\-like signal\. In a differentiable rollout, this signal is exactly the costate propagated by Backpropagation Through Time \(BPTT\)\. In a discrete transformer, the same structure survives approximately because attention provides a differentiable pathway for credit transport around the token\-sampling bottleneck\.
First, in our paper, under a continuous relaxation\(Fairbank and Alonso,[2012](https://arxiv.org/html/2605.21654#bib.bib4)\), we show that the Proximal Policy Optimization \(PPO\)/GRPO actor update is value\-gradient\-like in expectation\. Second, for discrete transformer policies, we show that the empirical costate computed by autodiff approximates the continuous BPTT signal, with an error controlled by the sampling gap and policy entropy\. Third, we use this perspective to derive an RL\-impact decomposition into usable value\-gradient signal and reachable reward headroom, which predicts when RL should be most effective along pretraining \(Figure[1](https://arxiv.org/html/2605.21654#S1.F1)\)\. This perspective gives a concrete answer to a practical question:*RL should help most at checkpoints that are simultaneously close enough to the value\-gradient regime to transmit useful credit and far enough from saturation to retain reward\-improving trajectories*\.
Ourcontributionsare: We show that under a differentiable rollout and shift/additive\-noise policy, the local GRPO actor update is value\-gradient\-like in expectation\. We show that in transformers with discrete token sampling, the empirical costate computed by autodiff approximates the BPTT costate, with an error controlled by the sampling gap and attention\-based credit transport\. We derive a predictive RL\-impact decomposition into usable value\-gradient signal and reachable reward headroom, which empirically can be used for checkpoint selection during pretraining\.
## 2Background
Notation\. A prompt/question is denoted byq∼P\(Q\)q\\sim P\(Q\)\. Givenqq, a policy with parametersθ\\thetagenerates an autoregressive completiono=\(o1,…,oT\)o=\(o\_\{1\},\\dots,o\_\{T\}\)with token\-level factorisationπθ\(o∣q\)=∏t=1Tπθ\(ot∣st\)\\pi\_\{\\theta\}\(o\\mid q\)=\\prod\_\{t=1\}^\{T\}\\pi\_\{\\theta\}\(o\_\{t\}\\mid s\_\{t\}\),st:=\(q,o<t\)s\_\{t\}:=\(q,o\_\{<t\}\),at:=ot\.a\_\{t\}:=o\_\{t\}\.We write\(st,at\)t=1T\(s\_\{t\},a\_\{t\}\)\_\{t=1\}^\{T\}for the token\-level trajectory\. Letr\(st,at\)∈ℝr\(s\_\{t\},a\_\{t\}\)\\in\\mathbb\{R\}denote a per\-token reward\.*Outcome\-only*RL is the case wherer\(st,at\)=0r\(s\_\{t\},a\_\{t\}\)=0fort<Tt<Tandr\(sT,aT\)=r\(q,o\)r\(s\_\{T\},a\_\{T\}\)=r\(q,o\)is a terminal reward\. With a fixed discountγ∈\(0,1\]\\gamma\\in\(0,1\], the discounted return and the return\-to\-go areR:=∑t=1Tγt−1r\(st,at\)R:=\\sum\_\{t=1\}^\{T\}\\gamma^\{t\-1\}\\,r\(s\_\{t\},a\_\{t\}\),Rt:=∑j=tTγj−tr\(sj,aj\)R\_\{t\}:=\\sum\_\{j=t\}^\{T\}\\gamma^\{j\-t\}\\,r\(s\_\{j\},a\_\{j\}\)\.
For any policyπ\\pi, define the time\-indexed value function, action\-value, and advantageVtπ\(s\):=𝔼\[Rt∣st=s\]V^\{\\pi\}\_\{t\}\(s\):=\\mathbb\{E\}\[R\_\{t\}\\mid s\_\{t\}=s\],Qtπ\(s,a\):=𝔼\[Rt∣st=s,at=a\]Q^\{\\pi\}\_\{t\}\(s,a\):=\\mathbb\{E\}\[R\_\{t\}\\mid s\_\{t\}=s,\\,a\_\{t\}=a\],Atπ\(s,a\):=Qtπ\(s,a\)−Vtπ\(s\)A^\{\\pi\}\_\{t\}\(s,a\):=Q^\{\\pi\}\_\{t\}\(s,a\)\-V^\{\\pi\}\_\{t\}\(s\)\. The*value gradient*is the state\-gradient of the value function,Gtπ\(s\):=∂Vtπ\(s\)∂sG\_\{t\}^\{\\pi\}\(s\)\\;:=\\;\\frac\{\\partial V\_\{t\}^\{\\pi\}\(s\)\}\{\\partial s\}\. Unlike the scalarVtπV\_\{t\}^\{\\pi\}, it is a vector field on state space\. Throughout the paper we write gradients with the partial\-derivative symbol:∂F\(θ\)∂θ\\frac\{\\partial F\(\\theta\)\}\{\\partial\\theta\}for a gradient in parameter space,∂ϕ\(s\)∂s\\frac\{\\partial\\phi\(s\)\}\{\\partial s\}and∂ϕ\(s,a\)∂a\\frac\{\\partial\\phi\(s,a\)\}\{\\partial a\}for gradients in state or action space, respectively\.
### 2\.1RL for LLMs
GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.21654#bib.bib3)\)is commonly presented as a PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.21654#bib.bib9)\)variant that removes the learned critic/value model\. For each promptqq, it samples a group ofnncompletions\{oi\}i=1n\\\{o^\{i\}\\\}\_\{i=1\}^\{n\}from an old policyπθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}, where the indexiiidentifies the rollout \(and its return, which may be terminal\-only or per\-token\)\. LetTi:=\|oi\|T\_\{i\}:=\|o^\{i\}\|andsi,t:=\(q,o<ti\)s\_\{i,t\}:=\(q,o^\{i\}\_\{<t\}\)\. In the outcome\-only case each completion receives a scalar rewardri=r\(q,oi\)r\_\{i\}=r\(q,o^\{i\}\)\. GRPO forms a group\-normalised signalr~i\\tilde\{r\}\_\{i\}and the tokenwise likelihood ratio: fort=1,…,Tit=1,\\dots,T\_\{i\},r~i:=ri−mean\(r\)std\(r\)\+ξnum\\tilde\{r\}\_\{i\}:=\\frac\{r\_\{i\}\-\\mathrm\{mean\}\(r\)\}\{\\mathrm\{std\}\(r\)\+\\xi\_\{\\mathrm\{num\}\}\},ρi,t\(θ\):=πθ\(oti∣si,t\)πθold\(oti∣si,t\),\\rho\_\{i,t\}\(\\theta\):=\\frac\{\\pi\_\{\\theta\}\(o^\{i\}\_\{t\}\\mid s\_\{i,t\}\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(o^\{i\}\_\{t\}\\mid s\_\{i,t\}\)\},withξnum\>0\\xi\_\{\\mathrm\{num\}\}\>0for numerical stability andr=\(r1,…,rG\)r=\(r\_\{1\},\\dots,r\_\{G\}\)\. GRPO then uses a tokenwise advantage estimate constant along the completion,A^i,t:=r~i\\widehat\{A\}\_\{i,t\}:=\\tilde\{r\}\_\{i\}for alltt\(process\-supervision variants can use token\- or step\-level reward\-to\-go\)\. The objective is the GRPO clipped surrogate with a KL penalty to a fixed reference policyπref\\pi\_\{\\mathrm\{ref\}\}:
J\(θ\)=𝔼\[1G∑i=1G1Ti∑t=1Ti\\displaystyle J\(\\theta\)=\\mathbb\{E\}\\Bigg\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{T\_\{i\}\}\\sum\_\{t=1\}^\{T\_\{i\}\}\(min\(ρi,t\(θ\)A^i,t,clip\(ρi,t\(θ\),1−ε,1\+ε\)A^i,t\)−βKL\(πθ∥πref\)\)\],\\displaystyle\\Bigg\(\\min\\\!\\Bigl\(\\rho\_\{i,t\}\(\\theta\)\\widehat\{A\}\_\{i,t\},\\;\\operatorname\{clip\}\(\\rho\_\{i,t\}\(\\theta\),1\-\\varepsilon,1\+\\varepsilon\)\\widehat\{A\}\_\{i,t\}\\Bigr\)\-\\beta\\,\{\\mathrm\{KL\}\}\\\!\\Bigl\(\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\\Bigr\)\\Bigg\)\\Bigg\],\(1\)whereε\>0\\varepsilon\>0is the clipping threshold andβ\>0\\beta\>0controls the KL regularisation\. GRPO differs from PPO only in howA^i,t\\widehat\{A\}\_\{i,t\}is constructed: it uses a group\-normalised return \(computed from multiple rollouts for the same prompt\) and typically holds the resulting scalar constant along the trajectory\.
### 2\.2Gradient Estimators
Letxxbe a random variable,cca differentiable scalar cost, andF\(θ\):=𝔼x\[c\(x\)\]F\(\\theta\):=\\mathbb\{E\}\_\{x\}\[c\(x\)\]the objective\. There are two canonical waysθ\\thetacan enter this expectation\(Schulmanet al\.,[2015](https://arxiv.org/html/2605.21654#bib.bib6)\)\. For anScore\-function \(SF\) estimator, ifx∼p\(⋅;θ\)x\\sim p\(\\,\\cdot\\,;\\theta\), then by the log\-derivative trick
∂∂θ𝔼x∼p\(⋅;θ\)\[c\(x\)\]=𝔼x\[c\(x\)∂∂θlogp\(x;θ\)\]\.\\frac\{\\partial\}\{\\partial\\theta\}\\,\\mathbb\{E\}\_\{x\\sim p\(\\cdot;\\theta\)\}\\\!\\bigl\[c\(x\)\\bigr\]\\;=\\;\\mathbb\{E\}\_\{x\}\\\!\\left\[c\(x\)\\,\\frac\{\\partial\}\{\\partial\\theta\}\\log p\(x;\\theta\)\\right\]\.\(2\)This identity is valid wheneverp\(x;θ\)p\(x;\\theta\)is differentiable inθ\\theta\. Crucially,ccneed not be differentiable, or even continuous, inxx\. That is exactly why SF is the natural estimator for classical RL: discrete actions, non\-differentiable rewards, and unknown dynamics are all admissible\. The cost is variance: the estimator uses only the*scalar*c\(x\)c\(x\), not its slope, and therefore*ignores all local geometry ofcc\.*
Pathwise\-derivative \(PD\) estimator\.If insteadx=x\(z,θ\)x=x\(z,\\theta\)is a differentiable function ofθ\\thetaand an exogenous noise variablez∼p\(z\)z\\sim p\(z\)whose distribution does*not*depend onθ\\theta\(a*reparameterization*\), then differentiation and expectation commute directly:
∂∂θ𝔼z\[c\(x\(z,θ\)\)\]=𝔼z\[∂∂θc\(x\(z,θ\)\)\]\.\\frac\{\\partial\}\{\\partial\\theta\}\\,\\mathbb\{E\}\_\{z\}\\\!\\bigl\[c\(x\(z,\\theta\)\)\\bigr\]\\;=\\;\\mathbb\{E\}\_\{z\}\\\!\\left\[\\frac\{\\partial\}\{\\partial\\theta\}c\(x\(z,\\theta\)\)\\right\]\.\(3\)PD exploits∂c∂x\\frac\{\\partial c\}\{\\partial x\}directly and is typically lower variance than SF when both apply\(Rezendeet al\.,[2014](https://arxiv.org/html/2605.21654#bib.bib5)\)\. The price is a stronger regularity requirement:c∘x\(⋅,θ\)c\\circ x\(\\cdot,\\theta\)must be \(almost everywhere\) differentiable, and the sampling must admit a reparameterization\.
### 2\.3Costates and Value\-gradients
We now lift the single\-variable policy gradient setup of §[2\.2](https://arxiv.org/html/2605.21654#S2.SS2)to trajectories and ask what object the backward pass in RL settings computes\.
###### Definition 1\(Differentiable rollout\)\.
The functionsπθ,fθ,r\\pi\_\{\\theta\},f\_\{\\theta\},rare differentiable in their arguments\. The noise lawp\(ξ\)p\(\\xi\)does not depend onθ\\theta, and
at=πθ\(st,ξt\),st\+1=fθ\(st,πθ\(st,ξt\)\),ξt∼p\(ξ\)i\.i\.d\., independent ofθ\.a\_\{t\}=\\pi\_\{\\theta\}\(s\_\{t\},\\xi\_\{t\}\),\\qquad s\_\{t\+1\}=f\_\{\\theta\}\\bigl\(s\_\{t\},\\pi\_\{\\theta\}\(s\_\{t\},\\xi\_\{t\}\)\\bigr\),\\qquad\\xi\_\{t\}\\sim p\(\\xi\)\\text\{ i\.i\.d\., independent of \}\\theta\.\(4\)
LetDDdenote the total derivative with respect to the state, accounting for both the direct state dependence and the indirect dependence through the policy’s action\.Dfθ\(st,at\)=∂fθ∂s\+∂fθ∂a∂πθ∂sDf\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)=\\frac\{\\partial f\_\{\\theta\}\}\{\\partial s\}\+\\frac\{\\partial f\_\{\\theta\}\}\{\\partial a\}\\frac\{\\partial\\pi\_\{\\theta\}\}\{\\partial s\}for the vector\-valued dynamics, and similarlyDrDrfor the reward\. Because all randomness is exogenous under Definition[1](https://arxiv.org/html/2605.21654#Thmdefinition1), the sampled returnR1\(θ,ξ1:T\)R\_\{1\}\(\\theta,\\xi\_\{1:T\}\)is a deterministic function of the parameters given the noise\. This allows differentiation to commute with expectation, resulting in a pathwise identity formula analogous to equation[3](https://arxiv.org/html/2605.21654#S2.E3):∂J\(θ\)∂θ=𝔼\[∂R1∂θ\]\\frac\{\\partial J\(\\theta\)\}\{\\partial\\theta\}=\\mathbb\{E\}\\\!\\left\[\\frac\{\\partial R\_\{1\}\}\{\\partial\\theta\}\\right\]\. The gradient∂R1∂θ\\frac\{\\partial R\_\{1\}\}\{\\partial\\theta\}is computed by differentiating the unrolled computation graph equation[4](https://arxiv.org/html/2605.21654#S2.E4), a process known as*backpropagation through time*\(BPTT\)\(Fairbank and Alonso,[2012](https://arxiv.org/html/2605.21654#bib.bib4)\)\. Crucially, BPTT does not propagate the parameter gradient itself\. It propagates the state\-sensitivity adjoint, which we call thecostate:
λt:=∂Rt∂st,λT\+1:=0\.\\lambda\_\{t\}\\;:=\\;\\frac\{\\partial R\_\{t\}\}\{\\partial s\_\{t\}\},\\qquad\\lambda\_\{T\+1\}:=0\.\(5\)Intuitively, the effect of the current state on future return has two pieces: the immediate effect on the current reward, plus the future effect pushed backward through the transition Jacobian\. Formally:
###### Proposition 1\(Adjoint recursion,\(Fairbank and Alonso,[2012](https://arxiv.org/html/2605.21654#bib.bib4)\)\)\.
Under Definition[1](https://arxiv.org/html/2605.21654#Thmdefinition1), the costates satisfy
λt=Dr\(st,at\)\+γ\(Dfθ\(st,at\)\)⊤λt\+1\\boxed\{\\;\\lambda\_\{t\}\\;=\\;Dr\(s\_\{t\},a\_\{t\}\)\\;\+\\;\\gamma\\,\\bigl\(Df\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\\bigr\)^\{\\\!\\top\}\\\!\\lambda\_\{t\+1\}\\;\}\(6\)fort=T,…,1t=T,\\ldots,1, and the exact pathwise parameter gradient is
∂J\(θ\)∂θ=𝔼\[∑t=1Tγt−1\(γ\(∂fθ\(st,at\)∂θ\)⊤λt\+1\+\(∂πθ\(st,ξt\)∂θ\)⊤\(∂r\(st,at\)∂at\+γ\(∂fθ\(st,at\)∂at\)⊤λt\+1\)\)\]\.\\begin\{split\}\\frac\{\\partial J\(\\theta\)\}\{\\partial\\theta\}=\\mathbb\{E\}\\Bigg\[\\sum\_\{t=1\}^\{T\}\\gamma^\{t\-1\}\\Bigg\(&\\gamma\\left\(\\frac\{\\partial f\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\}\{\\partial\\theta\}\\right\)^\{\\top\}\\lambda\_\{t\+1\}\\\\ \+&\\left\(\\frac\{\\partial\\pi\_\{\\theta\}\(s\_\{t\},\\xi\_\{t\}\)\}\{\\partial\\theta\}\\right\)^\{\\top\}\\left\(\\frac\{\\partial r\(s\_\{t\},a\_\{t\}\)\}\{\\partial a\_\{t\}\}\+\\gamma\\left\(\\frac\{\\partial f\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\}\{\\partial a\_\{t\}\}\\right\)^\{\\top\}\\lambda\_\{t\+1\}\\right\)\\Bigg\)\\Bigg\]\.\\end\{split\}\(7\)
The proof \(see Appendix[A](https://arxiv.org/html/2605.21654#A1)\) is a direct chain\-rule expansion ofRt=r\(st,at\)\+γRt\+1R\_\{t\}=r\(s\_\{t\},a\_\{t\}\)\+\\gamma R\_\{t\+1\}\.
Costates are value\-gradient estimators\.Conditioning onst=ss\_\{t\}=s, the value function satisfiesVtπ\(s\)=𝔼\[r\(st,at\)\+γVt\+1π\(st\+1\)\|st=s\]V\_\{t\}^\{\\pi\}\(s\)=\\mathbb\{E\}\\\!\\left\[r\(s\_\{t\},a\_\{t\}\)\+\\gamma V\_\{t\+1\}^\{\\pi\}\(s\_\{t\+1\}\)\\,\\middle\|\\,s\_\{t\}=s\\right\]\. Comparing this value recursion with the sampled costate recursion in equation[6](https://arxiv.org/html/2605.21654#S2.E6), we see that the two have the same form\. More precisely, differentiating[6](https://arxiv.org/html/2605.21654#S2.E6)with respect tossand interchanging differentiation and expectation \(valid under Definition[1](https://arxiv.org/html/2605.21654#Thmdefinition1)\) gives
∂Vtπ\(s\)∂s=𝔼\[Dr\(st,at\)\+γ\(Dfθ\(st,at\)\)⊤Gt\+1π\(st\+1\)\|st=s\]\.\\frac\{\\partial V\_\{t\}^\{\\pi\}\(s\)\}\{\\partial s\}=\\mathbb\{E\}\\\!\\left\[Dr\(s\_\{t\},a\_\{t\}\)\+\\gamma\\bigl\(Df\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\\bigr\)^\{\\\!\\top\}G\_\{t\+1\}^\{\\pi\}\(s\_\{t\+1\}\)\\,\\middle\|\\,s\_\{t\}=s\\right\]\.\(8\)The value gradient is the corresponding*conditional expectation*over that future noise\. Therefore,
𝔼\[λt∣st=s\]=∂Vtπ\(s\)∂s\\mathbb\{E\}\[\\lambda\_\{t\}\\mid s\_\{t\}=s\]=\\frac\{\\partial V\_\{t\}^\{\\pi\}\(s\)\}\{\\partial s\}\(9\)Thus the quantity propagated backward by BPTT is, at each step, a Monte Carlo sample of the value gradient\. Thus*critic\-free*does not mean*value\-free*: the relevant value information is present as a propagated gradient signal rather than as a separately fitted scalar critic\.
TakeawayThe key object for this paper is the costate: in a differentiable rollout, the backward signal propagated by BPTT is a Monte Carlo estimator of the value gradient\.
## 3Continuous Lens: Why Critic\-Free RL Is Value\-Gradient\-Like
We now show that the GRPO/PPO actor update equals, in expectation, the BPTT pathwise gradient of a continuous\-relaxed rollout\. The standard GRPO/PPO derivation is written for discrete token actions,st:=\(q,o<t\)s\_\{t\}:=\(q,o\_\{<t\}\),at:=ota\_\{t\}:=o\_\{t\},st\+1=\(st,at\)s\_\{t\+1\}=\(s\_\{t\},a\_\{t\}\), where the sampling step blocks differentiation through the trajectory\. However,Fairbank and Alonso \([2012](https://arxiv.org/html/2605.21654#bib.bib4)\)integration\-by\-parts equivalence supplies the missing conceptual link: under an additive\-noise parameterization, the expected score\-function update can be rewritten as an action\-derivative \(PD\-like\) update\. To make the backward signal explicit, we introduce a continuous relaxation of the LLM rollout\. At the level of hidden states, logits, or other differentiable surrogate states, we apply Definition[1](https://arxiv.org/html/2605.21654#Thmdefinition1)with differentiablefθf\_\{\\theta\},πθ\\pi\_\{\\theta\}, andrr\. This turns the rollout into a reparameterized computation graph, so the pathwise analysis of §[2\.3](https://arxiv.org/html/2605.21654#S2.SS3)applies\. The only ingredient we need is the score\-function/pathwise bridge\.
###### Lemma 1\(Fairbank’s\(Fairbank and Alonso,[2012](https://arxiv.org/html/2605.21654#bib.bib4)\)SF==PD under a shift policy\)\.
Fix a statessand letF\(θ\)=𝔼a∼πθ\(⋅∣s\)\[r\(s,a\)\]F\(\\theta\)=\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s\)\}\[r\(s,a\)\], withrrdifferentiable inaa\. If
πθ\(a∣s\)=ν\(a−a¯θ\(s\)\),\\pi\_\{\\theta\}\(a\\mid s\)=\\nu\\bigl\(a\-\\bar\{a\}\_\{\\theta\}\(s\)\\bigr\),for a differentiable densityν\\nuwith vanishing boundary terms, then the score\-function and pathwise forms agree in expectation\. In particular,
𝔼\[r\(s,a\)∂∂θlogπθ\(a∣s\)\]⏟SF form=\(∂a¯θ\(s\)∂θ\)⊤𝔼\[∂r\(s,a\)∂a\]⏟PD form\.\\underbrace\{\\mathbb\{E\}\\\!\\left\[r\(s,a\)\\,\\frac\{\\partial\}\{\\partial\\theta\}\\log\\pi\_\{\\theta\}\(a\\mid s\)\\right\]\}\_\{\\text\{SF form\}\}\\;=\\;\\underbrace\{\\Bigl\(\\frac\{\\partial\\bar\{a\}\_\{\\theta\}\(s\)\}\{\\partial\\theta\}\\Bigr\)^\{\\\!\\top\}\\,\\mathbb\{E\}\\\!\\left\[\\frac\{\\partial r\(s,a\)\}\{\\partial a\}\\right\]\}\_\{\\text\{PD form\}\}\.\(10\)
Lemma[1](https://arxiv.org/html/2605.21654#Thmlemma1)is the bridge: it guarantees that the score\-function update used in practice matches a pathwise update in expectation\. Applied token\-by\-token, under a shift/additive\-noise parameterisationπθ\(at∣st\)=ν\(at−a¯θ\(st\)\)\\pi\_\{\\theta\}\(a\_\{t\}\\mid s\_\{t\}\)=\\nu\\\!\\bigl\(a\_\{t\}\-\\bar\{a\}\_\{\\theta\}\(s\_\{t\}\)\\bigr\), if the GRPO weightA^i,t\\widehat\{A\}\_\{i,t\}is treated as a stop\-gradient scalar, each local score\-function term admits:
𝔼\[A^i,tr\(st,at\)∂∂θlogπθ\(at∣st\)\]=\(∂a¯θ\(st\)∂θ\)⊤𝔼\[∂\(A^i,tr\(st,at\)\)∂at\]\.\\mathbb\{E\}\\\!\\left\[\\widehat\{A\}\_\{i,t\}\\,r\(s\_\{t\},a\_\{t\}\)\\,\\frac\{\\partial\}\{\\partial\\theta\}\\log\\pi\_\{\\theta\}\(a\_\{t\}\\mid s\_\{t\}\)\\right\]\\;=\\;\\Bigl\(\\frac\{\\partial\\bar\{a\}\_\{\\theta\}\(s\_\{t\}\)\}\{\\partial\\theta\}\\Bigr\)^\{\\\!\\top\}\\\!\\mathbb\{E\}\\\!\\left\[\\frac\{\\partial\\,\\bigl\(\\widehat\{A\}\_\{i,t\}\\,r\(s\_\{t\},a\_\{t\}\)\\bigr\)\}\{\\partial a\_\{t\}\}\\right\]\.\(11\)The right\-hand side is PD\-shaped: it depends on the slope of the action\-to\-reward signal, not on the score function\. No SF information is lost and expectation is unchanged\. Then, under the continuous rollout equation[1](https://arxiv.org/html/2605.21654#Thmdefinition1), the backward quantity propagated by GRPO is the costate, and, by §[2\.3](https://arxiv.org/html/2605.21654#S2.SS3), it satisfies
λt:=∂Rt∂st,𝔼\[λt∣st=s\]=∂Vtπ\(s\)∂s\.\\lambda\_\{t\}:=\\frac\{\\partial R\_\{t\}\}\{\\partial s\_\{t\}\},\\quad\\mathbb\{E\}\[\\lambda\_\{t\}\\mid s\_\{t\}=s\]=\\frac\{\\partial V\_\{t\}^\{\\pi\}\(s\)\}\{\\partial s\}\.\(12\)Thus, in the continuous surrogate, the GRPO/PPO update is value\-gradient\-like in expectation: even without a fitted critic, the backward pass carries a Monte Carlo estimator of the state\-gradient of future return\. The full derivation is deferred to Appendix[A](https://arxiv.org/html/2605.21654#A1)\.
TakeawayUnder a differentiable continuous relaxation, PPO/GRPO is value\-gradient\-like in expectation, so critic\-free actor updates can still carry a principled credit\-assignment signal without a learned critic\.
## 4Discrete GRPO Sends Approximate BPTT Signals
Section §[3](https://arxiv.org/html/2605.21654#S3)established, under a continuous relaxation and a shift/additive\-noise policy, that the local GRPO gradient is a value\-gradient update \(Corollary[1](https://arxiv.org/html/2605.21654#Thmcorollary1)\)\. In practice, LLM rollouts use a*discrete*categorical policy\. The key realisation is that when one computes∂LGRPO∂θ\\frac\{\\partial L\_\{\\mathrm\{GRPO\}\}\}\{\\partial\\theta\}by standard autodifferentiation, one is already backpropagating through the transformer’s internal computation graph across positions\. The attention mechanism creates differentiable pathways between positions; the only place differentiability breaks is at the token\-sampling boundary\. The question thus becomes:*how much of the BPTT costate structure of §[2\.3](https://arxiv.org/html/2605.21654#S2.SS3)survives despite the discrete sampling gaps?*
For a single trajectory, the log\-probability at positionttdepends on the final\-layer hidden state, and the GRPO loss \(locally, ignoring clipping\) is
L\(θ\)=∑t=1TA^t⋅ℓt\(θ\),ℓt:=logπθ\(ot∣st\):=logsoftmax\(Whead⋅ht\(L\)\(θ\)\)ot,L\(\\theta\)=\\sum\_\{t=1\}^\{T\}\\widehat\{A\}\_\{t\}\\cdot\\ell\_\{t\}\(\\theta\),\\qquad\\ell\_\{t\}:=\\log\\pi\_\{\\theta\}\(o\_\{t\}\\mid s\_\{t\}\):=\\log\\mathrm\{softmax\}\\\!\\bigl\(W\_\{\\mathrm\{head\}\}\\cdot h\_\{t\}^\{\(L\)\}\(\\theta\)\\bigr\)\_\{o\_\{t\}\},\(13\)and in a transformer,ht\(L\)h\_\{t\}^\{\(L\)\}depends on hidden representations at*all previous positions*through the attention mechanism,ht\(l\)=TransformerBlock\(l\)\(h1:t\(l−1\)\)h\_\{t\}^\{\(l\)\}=\\mathrm\{TransformerBlock\}^\{\(l\)\}\\\!\\bigl\(h\_\{1:t\}^\{\(l\-1\)\}\\bigr\),l=1,…,L\.l=1,\\ldots,L\.Consequently, there exist differentiable JacobiansJt←t′:=∂ht\(L\)∂ht′\(L\)J\_\{t\\leftarrow t^\{\\prime\}\}:=\\frac\{\\partial h\_\{t\}^\{\(L\)\}\}\{\\partial h\_\{t^\{\\prime\}\}^\{\(L\)\}\},t′≤tt^\{\\prime\}\\leq t, computed through the multi\-layer attention stack\. These are*exactly the transition Jacobians*that play the role ofDfθDf\_\{\\theta\}in §[2\.3](https://arxiv.org/html/2605.21654#S2.SS3)\.
###### Definition 2\(Empirical costate\)\.
The empirical costate at positionttis the gradient of the advantage\-weighted loss with respect to the final\-layer hidden state:
λ^t:=∂∂ht\(L\)∑k≥tA^k⋅ℓk\.\\hat\{\\lambda\}\_\{t\}:=\\frac\{\\partial\}\{\\partial h\_\{t\}^\{\(L\)\}\}\\sum\_\{k\\geq t\}\\widehat\{A\}\_\{k\}\\cdot\\ell\_\{k\}\.\(14\)
By the chain rule through attention,λ^t=A^t⋅∂ℓt∂ht\(L\)\+∑k\>tA^k⋅∂ℓk∂hk\(L\)⋅Jk←t\\hat\{\\lambda\}\_\{t\}=\\widehat\{A\}\_\{t\}\\cdot\\frac\{\\partial\\ell\_\{t\}\}\{\\partial h\_\{t\}^\{\(L\)\}\}\+\\sum\_\{k\>t\}\\widehat\{A\}\_\{k\}\\cdot\\frac\{\\partial\\ell\_\{k\}\}\{\\partial h\_\{k\}^\{\(L\)\}\}\\cdot J\_\{k\\leftarrow t\}, or equivalently, in recursive form,
λ^t=A^t⋅∂ℓt∂ht\(L\)\+Jt\+1←t⊤⋅λ^t\+1,λ^T\+1:=0\.\\boxed\{\\;\\hat\{\\lambda\}\_\{t\}\\;=\\;\\widehat\{A\}\_\{t\}\\cdot\\frac\{\\partial\\ell\_\{t\}\}\{\\partial h\_\{t\}^\{\(L\)\}\}\\;\+\\;J\_\{t\+1\\leftarrow t\}^\{\\\!\\top\}\\cdot\\hat\{\\lambda\}\_\{t\+1\}\\;\},\\qquad\\hat\{\\lambda\}\_\{T\+1\}:=0\.\(15\)
This isstructurally identicalto the BPTT costate recursion of Proposition[1](https://arxiv.org/html/2605.21654#Thmproposition1),λt=Dr\(st,at\)\+γ\(Dfθ\(st,at\)\)⊤λt\+1\\lambda\_\{t\}\\;=\\;Dr\(s\_\{t\},a\_\{t\}\)\\;\+\\;\\gamma\\,\\bigl\(Df\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\\bigr\)^\{\\\!\\top\}\\lambda\_\{t\+1\}\. The correspondence is
Dr\\displaystyle Dr⟷A^t⋅∂ℓt∂ht\(L\)\\displaystyle\\;\\longleftrightarrow\\;\\widehat\{A\}\_\{t\}\\cdot\\frac\{\\partial\\ell\_\{t\}\}\{\\partial h\_\{t\}^\{\(L\)\}\}\(immediate credit signal\),\\displaystyle\\text\{\(immediate credit signal\)\},\(16\)\(Dfθ\)⊤\\displaystyle\(Df\_\{\\theta\}\)^\{\\\!\\top\}⟷Jt\+1←t⊤\\displaystyle\\;\\longleftrightarrow\\;J\_\{t\+1\\leftarrow t\}^\{\\\!\\top\}\(transition Jacobian through attention\),\\displaystyle\\text\{\(transition Jacobian through attention\)\},\(17\)λt\+1\\displaystyle\\lambda\_\{t\+1\}⟷λ^t\+1\\displaystyle\\;\\longleftrightarrow\\;\\hat\{\\lambda\}\_\{t\+1\}\(propagated future signal\)\.\\displaystyle\\text\{\(propagated future signal\)\}\.\(18\)
### 4\.1What is missing vs\. exact BPTT
In the continuous relaxation of §[2\.3](https://arxiv.org/html/2605.21654#S2.SS3), the transitionst\+1=fθ\(st,at\)s\_\{t\+1\}=f\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)is fully differentiable, soDfθDf\_\{\\theta\}captures*everything*, including how a perturbation ofsts\_\{t\}would change the continuous actionata\_\{t\}, which would changest\+1s\_\{t\+1\}\. In discrete GRPO, there is anon\-differentiable gapat each sampling step:
ht\(L\)→diffzt→sampleot→embedet→diffht\+1\(L\)\.h\_\{t\}^\{\(L\)\}\\;\\xrightarrow\{\\;\\text\{diff\}\\;\}\\;z\_\{t\}\\;\\xrightarrow\{\\;\\text\{sample\}\\;\}\\;o\_\{t\}\\;\\xrightarrow\{\\;\\text\{embed\}\\;\}\\;e\_\{t\}\\;\\xrightarrow\{\\;\\text\{diff\}\\;\}\\;h\_\{t\+1\}^\{\(L\)\}\.\(19\)The JacobianJt\+1←tattnJ\_\{t\+1\\leftarrow t\}^\{\\mathrm\{attn\}\}that autodiff computes goes through the attention pathway: howht\(L\)h\_\{t\}^\{\(L\)\}influencesht\+1\(L\)h\_\{t\+1\}^\{\(L\)\}through attention at positiont\+1t\+1, treating the input tokenoto\_\{t\}as fixed\. It does*not*capture the pathht\(L\)→zt→ot→et→ht\+1\(L\)h\_\{t\}^\{\(L\)\}\\to z\_\{t\}\\to o\_\{t\}\\to e\_\{t\}\\to h\_\{t\+1\}^\{\(L\)\}through the sampled token\.
The exact BPTT Jacobian therefore decomposes as
Dfθexact=Jt\+1←tattn⏟what GRPO computes\+∂ht\+1\(L\)∂et⋅∂et∂ot⋅∂ot∂zt⋅∂zt∂ht\(L\)⏟missing: through the sampling step\.Df\_\{\\theta\}^\{\\mathrm\{exact\}\}\\;=\\;\\underbrace\{J\_\{t\+1\\leftarrow t\}^\{\\mathrm\{attn\}\}\}\_\{\\text\{what GRPO computes\}\}\\;\+\\;\\underbrace\{\\frac\{\\partial h\_\{t\+1\}^\{\(L\)\}\}\{\\partial e\_\{t\}\}\\cdot\\frac\{\\partial e\_\{t\}\}\{\\partial o\_\{t\}\}\\cdot\\frac\{\\partial o\_\{t\}\}\{\\partial z\_\{t\}\}\\cdot\\frac\{\\partial z\_\{t\}\}\{\\partial h\_\{t\}^\{\(L\)\}\}\}\_\{\\text\{missing: through the sampling step\}\}\.\(20\)The second term is non\-differentiable in the discrete case\.This is the only gap\.
###### Proposition 2\(Sampling\-gap bound\)\.
The missing Jacobian term is small when the policy is near\-deterministic\. Specifically, if the policy entropy at positionttisHtH\_\{t\}, then the effective Jacobian through the sampling step \(in a straight\-through or Gumbel\-Softmax sense\) has operator norm bounded by
‖Dfθexact−Jt\+1←tattn‖≤C⋅Htlog\|V\|,\\bigl\\\|Df\_\{\\theta\}^\{\\mathrm\{exact\}\}\-J\_\{t\+1\\leftarrow t\}^\{\\mathrm\{attn\}\}\\bigr\\\|\\;\\leq\\;C\\cdot\\sqrt\{\\frac\{H\_\{t\}\}\{\\log\|V\|\}\},\(21\)where\|V\|\|V\|is the vocabulary size andCCdepends on embedding norms\.
Figure 2:Bound inequality plot\. Real value gradient gap vs proposed bound \(Section[4](https://arxiv.org/html/2605.21654#S4), Prop\.[2](https://arxiv.org/html/2605.21654#Thmproposition2)\)\.Attention is doing the heavy lifting\. In an RNN,ht\+1h\_\{t\+1\}depends onhth\_\{t\}only through the recurrenceht\+1=f\(ht,eot\)h\_\{t\+1\}=f\(h\_\{t\},e\_\{o\_\{t\}\}\)\. The*only*pathway is through the discrete token, so the sampling gap blocks*all*temporal credit flow\. In a transformer,ht\+1\(L\)h\_\{t\+1\}^\{\(L\)\}depends onht\(L\)h\_\{t\}^\{\(L\)\}throughattention, a direct, differentiable, content\-based pathway that does not go through the sampling step at positiontt\. The attention scores
αt\+1,t′∝exp\(qt\+1⋅kt′\)\\alpha\_\{t\+1,\\,t^\{\\prime\}\}\\;\\propto\\;\\exp\\\!\\bigl\(q\_\{t\+1\}\\cdot k\_\{t^\{\\prime\}\}\\bigr\)\(22\)create soft, differentiable pointers to all previous hidden states\. This gives transformers a*bypass around the discrete sampling bottleneck*: credit can flow from future positions to past hidden states through attention without differentiating through token sampling\.
###### Proposition 3\(Attention\-pathway rank and magnitude\)\.
For a transformer withLLlayers and causal attention withnheadsn\_\{\\mathrm\{heads\}\}heads of dimensiondheadd\_\{\\mathrm\{head\}\}, the attention\-pathway JacobianJt\+1←t′attnJ\_\{t\+1\\leftarrow t^\{\\prime\}\}^\{\\mathrm\{attn\}\}has rank at mostL⋅dhead⋅nheadsL\\cdot d\_\{\\mathrm\{head\}\}\\cdot n\_\{\\mathrm\{heads\}\}per layer of attention, and its magnitude is controlled by the attention weightsαt,t′\\alpha\_\{t,t^\{\\prime\}\}\. When attention is broadly distributed over context, the Jacobian carries richer temporal credit information, partially compensating for the missing sampling\-path Jacobian\.
### 4\.2Approximation theorem
###### Theorem 1\(Discrete GRPO costates approximate BPTT costates\)\.
Letλ^t\\hat\{\\lambda\}\_\{t\}be the empirical costate equation[14](https://arxiv.org/html/2605.21654#S4.E14)computed by autodifferentiation through the GRPO loss, and letλt\\lambda\_\{t\}be the exact BPTT costate from Proposition[1](https://arxiv.org/html/2605.21654#Thmproposition1)applied to the relaxed rollout of §[3](https://arxiv.org/html/2605.21654#S3)\. Then
∥𝔼\[λ^t∣ht\]−𝔼\[λt∣ht\]∥≤∑k=t\+1Tγk−t⋅‖Jk←k−1attn‖k−t−1⏟Jacobian chain growth⋅ϵk⏟per\-step sampling gap,\\bigl\\\|\\mathbb\{E\}\[\\hat\{\\lambda\}\_\{t\}\\mid h\_\{t\}\]\-\\mathbb\{E\}\[\\lambda\_\{t\}\\mid h\_\{t\}\]\\bigr\\\|\\;\\leq\\;\\sum\_\{k=t\+1\}^\{T\}\\gamma^\{\\,k\-t\}\\cdot\\underbrace\{\\bigl\\\|J\_\{k\\leftarrow k\-1\}^\{\\mathrm\{attn\}\}\\bigr\\\|^\{k\-t\-1\}\}\_\{\\text\{Jacobian chain growth\}\}\\cdot\\underbrace\{\\epsilon\_\{k\}\}\_\{\\text\{per\-step sampling gap\}\},\(23\)whereϵk\\epsilon\_\{k\}is the per\-step approximation error from the missing sampling Jacobian, bounded by equation[21](https://arxiv.org/html/2605.21654#S4.E21)\.
Combining Theorem[1](https://arxiv.org/html/2605.21654#Thmtheorem1)with identity equation[9](https://arxiv.org/html/2605.21654#S2.E9),
𝔼\[λ^t∣ht\]≈𝔼\[λt∣ht\]=Gtπ\(ht\),\\mathbb\{E\}\[\\hat\{\\lambda\}\_\{t\}\\mid h\_\{t\}\]\\;\\approx\\;\\mathbb\{E\}\[\\lambda\_\{t\}\\mid h\_\{t\}\]\\;=\\;G\_\{t\}^\{\\pi\}\(h\_\{t\}\),so the empirical costate is, up to the Theorem[1](https://arxiv.org/html/2605.21654#Thmtheorem1)error, a Monte Carlo estimator of the value gradient in the transformer’s hidden\-state space\. The error accumulates through the Jacobian chain but is controlled by*policy entropy at each step*, and spectral properties of the attention Jacobians\. Figure[2](https://arxiv.org/html/2605.21654#S4.F2)evaluates the entropy bound[2](https://arxiv.org/html/2605.21654#Thmproposition2)formula, see more details in §[6](https://arxiv.org/html/2605.21654#S6)\.
TakeawayIn discrete transformers, autodiff already propagates an empirical costate through attention, so critic\-free RL remains close to the BPTT value\-gradient picture despite discrete token sampling\.
## 5RL Impact Law Hypotheses
The previous sections explain that the actor backward pass propagates an approximate hidden\-state value\-gradient signal, and the gap to the ideal value\-gradient regime is controlled by the discrete sampling step together with attention\-based credit transport\. We now turn this into a predictive theory of*RL readiness*\. The key idea is simple: RL should be effective when the actor update contain a usable value\-gradient signal\. This gives a direct prediction for how much gain RL should produce from a given pretrained checkpoint\. For any critic\-free normalized policy\-gradient methodmm, letλ^t\(m\)\\hat\{\\lambda\}\_\{t\}^\{\(m\)\}denote the empirical hidden\-state costate produced by the actor backward pass, and letGtG\_\{t\}denote the hidden\-state value gradient\. The quantity
𝔼q,τ,t\[⟨𝔼\[λ^t\(m\)∣ht\],Gt⟩\]\\mathbb\{E\}\_\{q,\\tau,t\}\\left\[\\left\\langle\\mathbb\{E\}\[\\hat\{\\lambda\}\_\{t\}^\{\(m\)\}\\mid h\_\{t\}\],\\,G\_\{t\}\\right\\rangle\\right\]\(24\)measures how much useful RL signal the checkpoint actually sends\. Using⟨u,v⟩=‖v‖22\+⟨u−v,v⟩\\langle u,v\\rangle=\\\|v\\\|\_\{2\}^\{2\}\+\\langle u\-v,v\\rangle, we obtain the lower bound
𝔼q,τ,t\[⟨𝔼\[λ^t\(m\)∣ht\],Gt⟩\]≥Σ\(θ\)−Λm\(θ\),\\mathbb\{E\}\_\{q,\\tau,t\}\\left\[\\left\\langle\\mathbb\{E\}\[\\hat\{\\lambda\}\_\{t\}^\{\(m\)\}\\mid h\_\{t\}\],\\,G\_\{t\}\\right\\rangle\\right\]\\geq\\Sigma\(\\theta\)\-\\Lambda\_\{m\}\(\\theta\),\(25\)where
Σ\(θ\):=𝔼q,τ,t\[∥Gt∥22\],Λm\(θ\):=𝔼q,τ,t\[∥Gt∥2∥𝔼\[λ^t\(m\)∣ht\]−Gt∥2\]\.\\Sigma\(\\theta\):=\\mathbb\{E\}\_\{q,\\tau,t\}\\\!\\left\[\\\|G\_\{t\}\\\|\_\{2\}^\{2\}\\right\],\\qquad\\Lambda\_\{m\}\(\\theta\):=\\mathbb\{E\}\_\{q,\\tau,t\}\\left\[\\\|G\_\{t\}\\\|\_\{2\}\\,\\left\\\|\\mathbb\{E\}\[\\hat\{\\lambda\}\_\{t\}^\{\(m\)\}\\mid h\_\{t\}\]\-G\_\{t\}\\right\\\|\_\{2\}\\right\]\.\(26\)We therefore define the*usable value\-gradient signal*
Sm\(θ\):=\(Σ\(θ\)−Λm\(θ\)\)S\_\{m\}\(\\theta\):=\\left\(\\Sigma\(\\theta\)\-\\Lambda\_\{m\}\(\\theta\)\\right\)\(27\)Intuitively,Σ\(θ\)\\Sigma\(\\theta\)measures how much value\-gradient signal is available, whileΛm\(θ\)\\Lambda\_\{m\}\(\\theta\)measures how much of that signal is lost because the actual actor update is still separated from the value\-gradient regime\. By the discrete approximation result of §[4](https://arxiv.org/html/2605.21654#S4), the gap term insideΛm\(θ\)\\Lambda\_\{m\}\(\\theta\)becomes small when the entropy\-controlled sampling gap is small and attention transports credit effectively\. Signal quality alone is not enough\. Even a clean RL signal will produce little gain if the current policy has no useful reward headroom left to exploit\. We therefore define the*reachable headroom*
ℋα\(θ\):=𝔼q∼P\(Q\)\[1αlog𝔼τ∼πθ\(⋅∣q\)\[eαR\(τ\)\]−𝔼τ∼πθ\(⋅∣q\)\[R\(τ\)\]\],\\mathcal\{H\}\_\{\\alpha\}\(\\theta\):=\\mathbb\{E\}\_\{q\\sim P\(Q\)\}\\left\[\\frac\{1\}\{\\alpha\}\\log\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\}\(\\cdot\\mid q\)\}\\left\[e^\{\\alpha R\(\\tau\)\}\\right\]\-\\mathbb\{E\}\_\{\\tau\\sim\\pi\_\{\\theta\}\(\\cdot\\mid q\)\}\\left\[R\(\\tau\)\\right\]\\right\],\(28\)whereR\(τ\)R\(\\tau\)is the trajectory return\. This quantity measures how much reward can still be gained by reweighting the model’s existing trajectory distribution\. It is small both when the model is too weak to sample any good trajectories and when it is already close to saturation; it is large only when better trajectories are already present in support and can still be amplified by RL\. These two terms combine into the central predictive statement of the framework:
RL Impact∝Sm\(θ\)⏟value\-gradient signal×ℋα\(θ\)⏟reachable headroom\.\\boxed\{\\text\{RL Impact\}\\propto\\underbrace\{S\_\{m\}\(\\theta\)\}\_\{\\text\{value\-gradient signal\}\}\\times\\underbrace\{\\mathcal\{H\}\_\{\\alpha\}\(\\theta\)\}\_\{\\text\{reachable headroom\}\}\}\.\(29\)For a fixed task, reward, and RL budgetBB, we write this in the simplest one\-constant form as
ΔPerfm,B\(θ\)≈κm,BSm\(θ\)ℋα\(θ\),κm,B\>0\.\\Delta\\mathrm\{Perf\}\_\{m,B\}\(\\theta\)\\approx\\kappa\_\{m,B\}\\,S\_\{m\}\(\\theta\)\\,\\mathcal\{H\}\_\{\\alpha\}\(\\theta\),\\qquad\\kappa\_\{m,B\}\>0\.\(30\)Eq\. equation[29](https://arxiv.org/html/2605.21654#S5.E29)is the main prediction of the theory\. RL should work best when the checkpoint is already close enough to the value\-gradient regime to provide a usable actor signal, but still has enough reward headroom for RL to exploit\. Practically, if\{θN\}\\\{\\theta\_\{N\}\\\}is a sequence of pretrained*checkpoints*, then the predicted best point to start RL is
N⋆=argmaxNSm\(θN\)ℋα\(θN\)\.N^\{\\star\}=\\arg\\max\_\{N\}S\_\{m\}\(\\theta\_\{N\}\)\\,\\mathcal\{H\}\_\{\\alpha\}\(\\theta\_\{N\}\)\.\(31\)
TakeawayRL gain should scale with two factors at once: how much usable value\-gradient signal the model transmits, and how much reward\-improving headroom remains in its trajectory distribution\.
## 6Experiments


Figure 3:Results of the RL upon various OLMO\-2 pretraining checkpoints\. The left image shows the consistency of the achieved gains via RL using a different number of GRPO steps\. The right image shows the behavior of the different components of our study\.We evaluate the proposed theory in two complementary settings\. First, we test the entropy\-controlled approximation bound from §4 on OLMo\-2\(Team OLMoet al\.,[2024](https://arxiv.org/html/2605.21654#bib.bib16)\)checkpoints\. Second, we test whether the costate\-based impact score predicts which checkpoints benefit most from a RL\.
Figure 4:Real RL gain vs\. predicted one using value impact formula \(Section[5](https://arxiv.org/html/2605.21654#S5), Eq\.[29](https://arxiv.org/html/2605.21654#S5.E29)\)\.Closed\-form RL task\.For the RL\-impact experiment, we use OLMo\-2 1B checkpoints from pretraining steps5050k to11M in increments of5050k\. The task is a controlled label\-copying problem\. Given a prompt containing a target label in\{A,B,C,D\}\\\{A,B,C,D\\\}, the model must put probability mass on the matching answer token\. The reward isR\(θ;q\)=pθ\(y⋆∣q\)R\(\\theta;q\)=p\_\{\\theta\}\(y^\{\\star\}\\mid q\), wherey⋆y^\{\\star\}is the correct label\. This reward is differentiable in the model logits and avoids decoded\-text, classifier, or parser discontinuities\. For each checkpointθN\\theta\_\{N\}, we measure the pre\-RL rewardRbefore\(θN\)=𝔼qpθN\(y⋆∣q\)R\_\{\\rm before\}\(\\theta\_\{N\}\)=\\mathbb\{E\}\_\{q\}\\,p\_\{\\theta\_\{N\}\}\(y^\{\\star\}\\mid q\), then run GRPO from the checkpoint and measure the final post\-RL rewardRafter\(θN,K\)R\_\{\\rm after\}\(\\theta\_\{N\},K\), whereKKis the number of RL updates\. We use several RL budgets,K∈\{10,20,25,30\}K\\in\\\{10,20,25,30\\\}, and report both individual\-budget results and the averaged responseΔR¯\(θN\)=𝔼K\[Rafter\(θN,K\)−Rbefore\(θN\)\]\\overline\{\\Delta R\}\(\\theta\_\{N\}\)=\\mathbb\{E\}\_\{K\}\\left\[R\_\{\\rm after\}\(\\theta\_\{N\},K\)\-R\_\{\\rm before\}\(\\theta\_\{N\}\)\\right\]\.
Entropy\-bound verification\.Figure[2](https://arxiv.org/html/2605.21654#S4.F2)evaluates the entropy bound from §[4](https://arxiv.org/html/2605.21654#S4)\. In this experiment, we exactly computed our Proposition[2](https://arxiv.org/html/2605.21654#Thmproposition2)formula, comparing the real right\-hand sideϵ\\epsilonwith the entropy\-based left side\. For the OLMO\-2 model, we estimate the costate approximation error and compare it to the entropy\-controlled upper bound predicted by the theory, using the reward function described above\. The purpose of this experiment is to verify our claim that the discrete\-sampling gap is controlled by entropy\.
Costate\-based predictors\.For each checkpoint, we compute the true pathwise costateGt=∂R/∂htG\_\{t\}=\\partial R/\\partial h\_\{t\}and the detached\-reward RL costateλ^t=∂\[A^tlogπθ\(at∣st\)\]/∂ht\\widehat\{\\lambda\}\_\{t\}=\\partial\[\\widehat\{A\}\_\{t\}\\log\\pi\_\{\\theta\}\(a\_\{t\}\\mid s\_\{t\}\)\]/\\partial h\_\{t\}, using the same prompts, hidden states, sampled actions, and reward evaluations\. This matched\-trajectory computation avoids mixing the relaxed and discrete distributions\. We then estimateΣ=𝔼‖Gt‖2\\Sigma=\\mathbb\{E\}\\\|G\_\{t\}\\\|^\{2\},Λm=𝔼\[‖Gt‖‖λ^t−Gt‖\]\\Lambda\_\{m\}=\\mathbb\{E\}\[\\\|G\_\{t\}\\\|\\,\\\|\\widehat\{\\lambda\}\_\{t\}\-G\_\{t\}\\\|\], and the headroom termHα=1αlog𝔼\[exp\(αR\)\]−𝔼\[R\]H\_\{\\alpha\}=\\frac\{1\}\{\\alpha\}\\log\\mathbb\{E\}\[\\exp\(\\alpha R\)\]\-\\mathbb\{E\}\[R\]\. The impact score isPgap=\(Σ−Λm\)HαP\_\{\\rm gap\}=\(\\Sigma\-\\Lambda\_\{m\}\)H\_\{\\alpha\}, which preserves the ordering information\.
RL\-response prediction\.Figure[1](https://arxiv.org/html/2605.21654#S1.F1),[3](https://arxiv.org/html/2605.21654#S6.F3)and[4](https://arxiv.org/html/2605.21654#S6.F4)shows the costate components and the impact predictor\. The component plot shows that the raw quantities vary systematically across checkpoints, while the predictor tracks the broad shape of the measured RL gain curve\. This supports the mechanism suggested by the theory: checkpoints differ not only in their current reward, but also in the quality of their costate signal and their remaining reward headroom\.
Figure[5](https://arxiv.org/html/2605.21654#S6.F5)evaluates the final predictive story\. The left panel compares the realized mean post\-RL reward againstR^after=Rbefore\+Affine\(Pgap\)\\widehat\{R\}\_\{\\rm after\}=R\_\{\\rm before\}\+\{\\rm Affine\}\(P\_\{\\rm gap\}\), where the affine map calibrates the scale ofPgapP\_\{\\rm gap\}to the observed reward\-gain scale\. The right panel uses a scale\-free version,z\(Rbefore\)\+z\(Pgap\)z\(R\_\{\\rm before\}\)\+z\(P\_\{\\rm gap\}\), and compares it toz\(R¯after\)z\(\\overline\{R\}\_\{\\rm after\}\), wherez\(⋅\)z\(\\cdot\)denotes normalization across checkpoints\. This asks whether pretrained competence plus predicted RL readiness explains the final post\-RL checkpoint quality\. Empirically, the combined predictor correlates more strongly with averaged post\-RL reward than either current reward or the impact score alone\. Across the RL budgets, the score satisfiesSpearman\(Pgap,ΔR¯\)≈0\.60\{\\rm Spearman\}\(P\_\{\\rm gap\},\\overline\{\\Delta R\}\)\\approx 0\.60, while the combined scale\-free predictor satisfiesSpearman\(z\(Rbefore\)\+z\(Pgap\),z\(R¯after\)\)≈0\.73\{\\rm Spearman\}\(z\(R\_\{\\rm before\}\)\+z\(P\_\{\\rm gap\}\),z\(\\overline\{R\}\_\{\\rm after\}\)\)\\approx 0\.73\. These results suggest that the theory captures a real checkpoint\-dependent RL\-readiness signal, but also that current competence remains necessary for predicting the final post\-RL reward level\.


Figure 5:Z\-scores of the gain after RL vs\. z\-scores of the predicted RL impact\. Correlation \(left\) and curve per checkpoint \(right\)\. The std bars on left indicate RL gain variance over the various training steps\.Limitations\. The current setup is controlled and partially toy\-like\. The reward is deliberately chosen to be differentiable in the model logits, which allows exact costate measurements\. Many parameters affect the observed correspondence, including reward scale, RL learning rate, batch size, group size, and the number of RL updates\. Our method may indicate that a checkpoint is ready for efficient RL, but how RL is performed remains an additional optimization question\. Indeed, too few RL steps under\-realize the predicted headroom, while too many steps saturate the reward and obscure checkpoint differences\. For this reason, we report averages across several RL budgets\.
## 7Related Work and Discussion
There is a large body of work on extending the GRPO algorithm\. Methods such as these are essentially additional techniques built on top of GRPO, and our methodology extrapolates well with respect to such modifications\.In terms of the critic\-free RL interpretation, it has been shown that GRPO implicitly computes a value\-like quantity via U\-statistics\(Zhouet al\.,[2026](https://arxiv.org/html/2605.21654#bib.bib13)\)\. However, this account is specific to GRPO, while PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.21654#bib.bib9)\)and even REINFORCE also perform unexpectedly well in the LLM setting\. Moreover, evidence from standard RL suggests that group\-normalized episodic baselines are generally insufficient to replace a critic for bootstrapping and temporal credit assignment\(de Oliveira,[2025](https://arxiv.org/html/2605.21654#bib.bib12)\)\. Therefore, we argue that the gains of RL in LLMs come not only from the GRPO baseline design\.
Zhenget al\.\([2025](https://arxiv.org/html/2605.21654#bib.bib15)\)propose using BPTT diretcly instead of RL and obtain better empirical results, which according to our interpretation should make the value\-gradient signal propagates better\. Another interesting observation is that scalar continuous rewards were used in the latest DeepSeek\-V4 model\(DeepSeek\-AI,[2026](https://arxiv.org/html/2605.21654#bib.bib17)\)to achieve better results\. Recently, value gradient flow\-based learning was proposed\(Xuet al\.,[2026](https://arxiv.org/html/2605.21654#bib.bib18)\), which achieved sustainable results in RL tasks, signaling that the value signal is useful\. However, this method differs from ours because their approach is actor\-free, showing how the value gradient can be used as a policy, while we focus on the critic\-free LLMs\.
## 8Conclusion
This paper argues that critic\-free RL in LLM post\-training works because the actor backward pass is not value\-free: it carries a value\-gradient\-like signal\. In a differentiable rollout this signal is exact in expectation, and in discrete transformers it survives approximately because attention provides a differentiable channel for temporal credit transport while the missing sampling path is controlled by policy entropy\. This perspective leads to a simple prediction: RL gains should be largest when a checkpoint has both strong usable credit\-assignment signal and remaining reward headroom\. More broadly, the paper suggests that the success of critic\-free RL in LLMs is not an exception to credit\-assignment theory, but a consequence of hidden\-state computation in continues space\.
## References
- e\. a\. de Oliveira \(2025\)Learning without critics? revisiting grpo in classical reinforcement learning environments\.\.Arxiv12\.Cited by:[§7](https://arxiv.org/html/2605.21654#S7.p1.1)\.
- DeepSeek\-AI \(2026\)DeepSeek\-v4: towards highly efficient million\-token context intelligence\.Cited by:[§7](https://arxiv.org/html/2605.21654#S7.p2.1)\.
- M\. Fairbank and E\. Alonso \(2012\)Value\-gradient learning\.InThe 2012 international joint conference on neural networks \(ijcnn\),pp\. 1–8\.Cited by:[§1](https://arxiv.org/html/2605.21654#S1.p2.1),[§2\.3](https://arxiv.org/html/2605.21654#S2.SS3.p2.6),[§3](https://arxiv.org/html/2605.21654#S3.p1.6),[Lemma 1](https://arxiv.org/html/2605.21654#Thmlemma1),[Proposition 1](https://arxiv.org/html/2605.21654#Thmproposition1)\.
- D\. J\. Rezende, S\. Mohamed, and D\. Wierstra \(2014\)Stochastic backpropagation and approximate inference in deep generative models\.InInternational conference on machine learning,pp\. 1278–1286\.Cited by:[§2\.2](https://arxiv.org/html/2605.21654#S2.SS2.p2.6)\.
- J\. Schulman, N\. Heess, T\. Weber, and P\. Abbeel \(2015\)Gradient estimation using stochastic computation graphs\.Advances in neural information processing systems28\.Cited by:[§2\.2](https://arxiv.org/html/2605.21654#S2.SS2.p1.5)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§2\.1](https://arxiv.org/html/2605.21654#S2.SS1.p1.17),[§7](https://arxiv.org/html/2605.21654#S7.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1](https://arxiv.org/html/2605.21654#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.21654#S2.SS1.p1.17)\.
- Team OLMo, P\. Walsh, L\. Soldaini, D\. Groeneveld, K\. Lo, S\. Arora, A\. Bhagia, Y\. Gu, S\. Huang, M\. Jordan, N\. Lambert, D\. Schwenk, O\. Tafjord, T\. Anderson, D\. Atkinson, F\. Brahman, C\. Clark, P\. Dasigi, N\. Dziri, M\. Guerquin, H\. Ivison, P\. W\. Koh, J\. Liu, S\. Malik, W\. Merrill, L\. J\. V\. Miranda, J\. Morrison, T\. Murray, C\. Nam, V\. Pyatkin, A\. Rangapur, M\. Schmitz, S\. Skjonsberg, D\. Wadden, C\. Wilhelm, M\. Wilson, L\. Zettlemoyer, A\. Farhadi, N\. A\. Smith, and H\. Hajishirzi \(2024\)2 OLMo 2 Furious\.External Links:2501\.00656,[Link](https://arxiv.org/abs/2501.00656)Cited by:[§6](https://arxiv.org/html/2605.21654#S6.p1.1)\.
- H\. Xu, K\. Hu, S\. Sojoudi, and A\. Zhang \(2026\)Reinforcement learning via value gradient flow\.arXiv preprint arXiv:2604\.14265\.Cited by:[§7](https://arxiv.org/html/2605.21654#S7.p2.1)\.
- Z\. Zheng, Y\. Gu, W\. Liu, Y\. W\. Teh, and W\. S\. Lee \(2025\)Soft\-grpo: surpassing discrete\-token llm reinforcement learning via gumbel\-reparameterized soft\-thinking policy optimization\.arXiv preprint arXiv:2511\.06411\.Cited by:[§7](https://arxiv.org/html/2605.21654#S7.p2.1)\.
- H\. Zhou, K\. Ye, E\. Xu, J\. Zhu, S\. Gong, and C\. Shi \(2026\)Demystifying group relative policy optimization: its policy gradient is a u\-statistic\.arXiv preprint arXiv:2603\.01162\.Cited by:[§7](https://arxiv.org/html/2605.21654#S7.p1.1)\.
## Appendix AProofs
This appendix collects the proofs and supporting derivations for the statements used in the main text\. We keep the notation of the main paper: the reparameterized action is written asat=πθ\(st,ξt\)a\_\{t\}=\\pi\_\{\\theta\}\(s\_\{t\},\\xi\_\{t\}\), the differentiable transition asst\+1=fθ\(st,at\)s\_\{t\+1\}=f\_\{\\theta\}\(s\_\{t\},a\_\{t\}\), and the costate asλt=∂Rt/∂st\\lambda\_\{t\}=\\partial R\_\{t\}/\\partial s\_\{t\}\.
### A\.1Pathwise differentiation under a differentiable rollout
Under Definition[1](https://arxiv.org/html/2605.21654#Thmdefinition1), all stochasticity enters through the exogenous noise variablesξ1:T\\xi\_\{1:T\}, whose law is independent ofθ\\theta\. Hence, for a fixed noise realization, the sampled returnR1\(θ,ξ1:T\)R\_\{1\}\(\\theta,\\xi\_\{1:T\}\)is a deterministic differentiable function ofθ\\theta\. Therefore,
J\(θ\)=𝔼ξ1:T\[R1\(θ,ξ1:T\)\]=∫R1\(θ,ξ1:T\)p\(ξ1:T\)𝑑ξ1:T\.J\(\\theta\)=\\mathbb\{E\}\_\{\\xi\_\{1:T\}\}\\\!\\left\[R\_\{1\}\(\\theta,\\xi\_\{1:T\}\)\\right\]=\\int R\_\{1\}\(\\theta,\\xi\_\{1:T\}\)\\,p\(\\xi\_\{1:T\}\)\\,d\\xi\_\{1:T\}\.Under the differentiability and domination assumptions implicit in Definition[1](https://arxiv.org/html/2605.21654#Thmdefinition1), differentiation may be interchanged with integration:
∂J\(θ\)∂θ=∫∂R1\(θ,ξ1:T\)∂θp\(ξ1:T\)𝑑ξ1:T=𝔼\[∂R1∂θ\]\.\\frac\{\\partial J\(\\theta\)\}\{\\partial\\theta\}=\\int\\frac\{\\partial R\_\{1\}\(\\theta,\\xi\_\{1:T\}\)\}\{\\partial\\theta\}p\(\\xi\_\{1:T\}\)\\,d\\xi\_\{1:T\}=\\mathbb\{E\}\\\!\\left\[\\frac\{\\partial R\_\{1\}\}\{\\partial\\theta\}\\right\]\.Thus, in the differentiable rollout model, the policy\-gradient direction can be computed pathwise by differentiating the unrolled trajectory computation\.
### A\.2Proof of Proposition[1](https://arxiv.org/html/2605.21654#Thmproposition1): adjoint recursion
###### Proof\.
Recall that
Rt=r\(st,at\)\+γRt\+1,at=πθ\(st,ξt\),st\+1=fθ\(st,at\),λt:=∂Rt∂st\.R\_\{t\}=r\(s\_\{t\},a\_\{t\}\)\+\\gamma R\_\{t\+1\},\\qquad a\_\{t\}=\\pi\_\{\\theta\}\(s\_\{t\},\\xi\_\{t\}\),\\qquad s\_\{t\+1\}=f\_\{\\theta\}\(s\_\{t\},a\_\{t\}\),\\qquad\\lambda\_\{t\}:=\\frac\{\\partial R\_\{t\}\}\{\\partial s\_\{t\}\}\.Throughout the proof, the noise variablesξt:T\\xi\_\{t:T\}are held fixed\.
First differentiateRtR\_\{t\}with respect tosts\_\{t\}\. The immediate reward contributes the total derivative
Dr\(st,at\)=∂r\(st,at\)∂s\+\(∂πθ\(st,ξt\)∂s\)⊤∂r\(st,at\)∂a\.Dr\(s\_\{t\},a\_\{t\}\)=\\frac\{\\partial r\(s\_\{t\},a\_\{t\}\)\}\{\\partial s\}\+\\left\(\\frac\{\\partial\\pi\_\{\\theta\}\(s\_\{t\},\\xi\_\{t\}\)\}\{\\partial s\}\\right\)^\{\\\!\\top\}\\frac\{\\partial r\(s\_\{t\},a\_\{t\}\)\}\{\\partial a\}\.The future return contributes throughst\+1=fθ\(st,at\)s\_\{t\+1\}=f\_\{\\theta\}\(s\_\{t\},a\_\{t\}\):
γ\(∂fθ\(st,at\)∂s\+∂fθ\(st,at\)∂a∂πθ\(st,ξt\)∂s\)⊤λt\+1=γ\(Dfθ\(st,at\)\)⊤λt\+1\.\\gamma\\left\(\\frac\{\\partial f\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\}\{\\partial s\}\+\\frac\{\\partial f\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\}\{\\partial a\}\\frac\{\\partial\\pi\_\{\\theta\}\(s\_\{t\},\\xi\_\{t\}\)\}\{\\partial s\}\\right\)^\{\\\!\\top\}\\lambda\_\{t\+1\}=\\gamma\\bigl\(Df\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\\bigr\)^\{\\\!\\top\}\\lambda\_\{t\+1\}\.Therefore,
λt=Dr\(st,at\)\+γ\(Dfθ\(st,at\)\)⊤λt\+1,\\lambda\_\{t\}=Dr\(s\_\{t\},a\_\{t\}\)\+\\gamma\\bigl\(Df\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\\bigr\)^\{\\\!\\top\}\\lambda\_\{t\+1\},which is the adjoint recursion equation[6](https://arxiv.org/html/2605.21654#S2.E6)\. It remains to derive the parameter\-gradient expression\. Let
s˙t:=dstdθ\.\\dot\{s\}\_\{t\}:=\\frac\{ds\_\{t\}\}\{d\\theta\}\.The total state sensitivity satisfies
s˙t\+1=Dfθ\(st,at\)s˙t\+∂fθ\(st,at\)∂θ\+∂fθ\(st,at\)∂at∂πθ\(st,ξt\)∂θ\.\\dot\{s\}\_\{t\+1\}=Df\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\\,\\dot\{s\}\_\{t\}\+\\frac\{\\partial f\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\}\{\\partial\\theta\}\+\\frac\{\\partial f\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\}\{\\partial a\_\{t\}\}\\frac\{\\partial\\pi\_\{\\theta\}\(s\_\{t\},\\xi\_\{t\}\)\}\{\\partial\\theta\}\.The total derivative of the return is
dR1dθ=∑t=1Tγt−1\[s˙t⊤Dr\(st,at\)\+\(∂πθ\(st,ξt\)∂θ\)⊤∂r\(st,at\)∂at\]\.\\frac\{dR\_\{1\}\}\{d\\theta\}=\\sum\_\{t=1\}^\{T\}\\gamma^\{t\-1\}\\left\[\\dot\{s\}\_\{t\}^\{\\top\}Dr\(s\_\{t\},a\_\{t\}\)\+\\left\(\\frac\{\\partial\\pi\_\{\\theta\}\(s\_\{t\},\\xi\_\{t\}\)\}\{\\partial\\theta\}\\right\)^\{\\\!\\top\}\\frac\{\\partial r\(s\_\{t\},a\_\{t\}\)\}\{\\partial a\_\{t\}\}\\right\]\.Using the adjoint identity
Dr\(st,at\)=λt−γ\(Dfθ\(st,at\)\)⊤λt\+1,Dr\(s\_\{t\},a\_\{t\}\)=\\lambda\_\{t\}\-\\gamma\\bigl\(Df\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\\bigr\)^\{\\\!\\top\}\\lambda\_\{t\+1\},we obtain
dR1dθ=∑t=1Tγt−1\[\\displaystyle\\frac\{dR\_\{1\}\}\{d\\theta\}=\\sum\_\{t=1\}^\{T\}\\gamma^\{t\-1\}\\Bigg\[\(∂πθ\(st,ξt\)∂θ\)⊤∂r\(st,at\)∂at\+s˙t⊤λt\\displaystyle\\left\(\\frac\{\\partial\\pi\_\{\\theta\}\(s\_\{t\},\\xi\_\{t\}\)\}\{\\partial\\theta\}\\right\)^\{\\\!\\top\}\\frac\{\\partial r\(s\_\{t\},a\_\{t\}\)\}\{\\partial a\_\{t\}\}\+\\dot\{s\}\_\{t\}^\{\\top\}\\lambda\_\{t\}−γs˙t⊤\(Dfθ\(st,at\)\)⊤λt\+1\]\.\\displaystyle\-\\gamma\\dot\{s\}\_\{t\}^\{\\top\}\\bigl\(Df\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\\bigr\)^\{\\\!\\top\}\\lambda\_\{t\+1\}\\Bigg\]\.From the state\-sensitivity recursion,
Dfθ\(st,at\)s˙t=s˙t\+1−∂fθ\(st,at\)∂θ−∂fθ\(st,at\)∂at∂πθ\(st,ξt\)∂θ\.Df\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\\,\\dot\{s\}\_\{t\}=\\dot\{s\}\_\{t\+1\}\-\\frac\{\\partial f\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\}\{\\partial\\theta\}\-\\frac\{\\partial f\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\}\{\\partial a\_\{t\}\}\\frac\{\\partial\\pi\_\{\\theta\}\(s\_\{t\},\\xi\_\{t\}\)\}\{\\partial\\theta\}\.Substituting this into the previous display gives
dR1dθ=∑t=1Tγt−1\[\\displaystyle\\frac\{dR\_\{1\}\}\{d\\theta\}=\\sum\_\{t=1\}^\{T\}\\gamma^\{t\-1\}\\Bigg\[\(∂πθ\(st,ξt\)∂θ\)⊤∂r\(st,at\)∂at\\displaystyle\\left\(\\frac\{\\partial\\pi\_\{\\theta\}\(s\_\{t\},\\xi\_\{t\}\)\}\{\\partial\\theta\}\\right\)^\{\\\!\\top\}\\frac\{\\partial r\(s\_\{t\},a\_\{t\}\)\}\{\\partial a\_\{t\}\}\+γ\(∂fθ\(st,at\)∂θ\+∂fθ\(st,at\)∂at∂πθ\(st,ξt\)∂θ\)⊤λt\+1\]\\displaystyle\+\\gamma\\left\(\\frac\{\\partial f\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\}\{\\partial\\theta\}\+\\frac\{\\partial f\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\}\{\\partial a\_\{t\}\}\\frac\{\\partial\\pi\_\{\\theta\}\(s\_\{t\},\\xi\_\{t\}\)\}\{\\partial\\theta\}\\right\)^\{\\\!\\top\}\\lambda\_\{t\+1\}\\Bigg\]\+∑t=1Tγt−1\[s˙t⊤λt−γs˙t\+1⊤λt\+1\]\.\\displaystyle\+\\sum\_\{t=1\}^\{T\}\\gamma^\{t\-1\}\\left\[\\dot\{s\}\_\{t\}^\{\\top\}\\lambda\_\{t\}\-\\gamma\\dot\{s\}\_\{t\+1\}^\{\\top\}\\lambda\_\{t\+1\}\\right\]\.The second sum telescopes:
∑t=1Tγt−1\[s˙t⊤λt−γs˙t\+1⊤λt\+1\]=s˙1⊤λ1−γTs˙T\+1⊤λT\+1\.\\sum\_\{t=1\}^\{T\}\\gamma^\{t\-1\}\\left\[\\dot\{s\}\_\{t\}^\{\\top\}\\lambda\_\{t\}\-\\gamma\\dot\{s\}\_\{t\+1\}^\{\\top\}\\lambda\_\{t\+1\}\\right\]=\\dot\{s\}\_\{1\}^\{\\top\}\\lambda\_\{1\}\-\\gamma^\{T\}\\dot\{s\}\_\{T\+1\}^\{\\top\}\\lambda\_\{T\+1\}\.The initial state is the prompt state and is independent ofθ\\theta, sos˙1=0\\dot\{s\}\_\{1\}=0\. AlsoλT\+1=0\\lambda\_\{T\+1\}=0by definition\. Hence the telescoping term vanishes\. Collecting the remaining terms yields
dR1dθ=∑t=1Tγt−1\[\\displaystyle\\frac\{dR\_\{1\}\}\{d\\theta\}=\\sum\_\{t=1\}^\{T\}\\gamma^\{t\-1\}\\Bigg\[γ\(∂fθ\(st,at\)∂θ\)⊤λt\+1\\displaystyle\\gamma\\left\(\\frac\{\\partial f\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\}\{\\partial\\theta\}\\right\)^\{\\\!\\top\}\\lambda\_\{t\+1\}\+\(∂πθ\(st,ξt\)∂θ\)⊤\(∂r\(st,at\)∂at\+γ\(∂fθ\(st,at\)∂at\)⊤λt\+1\)\]\.\\displaystyle\+\\left\(\\frac\{\\partial\\pi\_\{\\theta\}\(s\_\{t\},\\xi\_\{t\}\)\}\{\\partial\\theta\}\\right\)^\{\\\!\\top\}\\left\(\\frac\{\\partial r\(s\_\{t\},a\_\{t\}\)\}\{\\partial a\_\{t\}\}\+\\gamma\\left\(\\frac\{\\partial f\_\{\\theta\}\(s\_\{t\},a\_\{t\}\)\}\{\\partial a\_\{t\}\}\\right\)^\{\\\!\\top\}\\lambda\_\{t\+1\}\\right\)\\Bigg\]\.Taking expectation over the exogenous noise gives equation[7](https://arxiv.org/html/2605.21654#S2.E7)\. ∎
### A\.3Proof of identity equation[9](https://arxiv.org/html/2605.21654#S2.E9): costates are value\-gradient estimators
###### Proof\.
By definition,
Vtπ\(s\)=𝔼\[Rt∣st=s\]\.V\_\{t\}^\{\\pi\}\(s\)=\\mathbb\{E\}\\\!\\left\[R\_\{t\}\\mid s\_\{t\}=s\\right\]\.Conditioned onst=ss\_\{t\}=s, the remaining randomness comes only from the future exogenous noisesξt:T\\xi\_\{t:T\}, whose distribution is independent ofssandθ\\theta\. Therefore, under the regularity conditions of Definition[1](https://arxiv.org/html/2605.21654#Thmdefinition1),
Gtπ\(s\)\\displaystyle G\_\{t\}^\{\\pi\}\(s\)=∂Vtπ\(s\)∂s=∂∂s𝔼\[Rt∣st=s\]\\displaystyle=\\frac\{\\partial V\_\{t\}^\{\\pi\}\(s\)\}\{\\partial s\}=\\frac\{\\partial\}\{\\partial s\}\\mathbb\{E\}\\\!\\left\[R\_\{t\}\\mid s\_\{t\}=s\\right\]=𝔼\[∂Rt∂st\|st=s\]=𝔼\[λt∣st=s\]\.\\displaystyle=\\mathbb\{E\}\\\!\\left\[\\frac\{\\partial R\_\{t\}\}\{\\partial s\_\{t\}\}\\;\\middle\|\\;s\_\{t\}=s\\right\]=\\mathbb\{E\}\\\!\\left\[\\lambda\_\{t\}\\mid s\_\{t\}=s\\right\]\.Thus the BPTT costate is a Monte Carlo sample whose conditional expectation is the value gradient\. ∎
### A\.4Proof of Lemma[1](https://arxiv.org/html/2605.21654#Thmlemma1): SF==PD under a shift policy
###### Proof\.
Fixss\. Let
πθ\(a∣s\)=ν\(a−a¯θ\(s\)\),u:=a−a¯θ\(s\)\.\\pi\_\{\\theta\}\(a\\mid s\)=\\nu\(a\-\\bar\{a\}\_\{\\theta\}\(s\)\),\\qquad u:=a\-\\bar\{a\}\_\{\\theta\}\(s\)\.Thena=u\+a¯θ\(s\)a=u\+\\bar\{a\}\_\{\\theta\}\(s\), and the density ofuuisν\(u\)\\nu\(u\), independent ofθ\\theta\. By the chain rule,
∂∂θlogπθ\(a∣s\)=−\(∂a¯θ\(s\)∂θ\)⊤∂∂ulogν\(u\)\.\\frac\{\\partial\}\{\\partial\\theta\}\\log\\pi\_\{\\theta\}\(a\\mid s\)=\-\\left\(\\frac\{\\partial\\bar\{a\}\_\{\\theta\}\(s\)\}\{\\partial\\theta\}\\right\)^\{\\\!\\top\}\\frac\{\\partial\}\{\\partial u\}\\log\\nu\(u\)\.Therefore,
𝔼\[r\(s,a\)∂∂θlogπθ\(a∣s\)\]\\displaystyle\\mathbb\{E\}\\\!\\left\[r\(s,a\)\\frac\{\\partial\}\{\\partial\\theta\}\\log\\pi\_\{\\theta\}\(a\\mid s\)\\right\]=−\(∂a¯θ\(s\)∂θ\)⊤∫r\(s,u\+a¯θ\(s\)\)∂∂ulogν\(u\)ν\(u\)𝑑u\\displaystyle=\-\\left\(\\frac\{\\partial\\bar\{a\}\_\{\\theta\}\(s\)\}\{\\partial\\theta\}\\right\)^\{\\\!\\top\}\\int r\(s,u\+\\bar\{a\}\_\{\\theta\}\(s\)\)\\frac\{\\partial\}\{\\partial u\}\\log\\nu\(u\)\\nu\(u\)\\,du=−\(∂a¯θ\(s\)∂θ\)⊤∫r\(s,u\+a¯θ\(s\)\)∂ν\(u\)∂u𝑑u\.\\displaystyle=\-\\left\(\\frac\{\\partial\\bar\{a\}\_\{\\theta\}\(s\)\}\{\\partial\\theta\}\\right\)^\{\\\!\\top\}\\int r\(s,u\+\\bar\{a\}\_\{\\theta\}\(s\)\)\\frac\{\\partial\\nu\(u\)\}\{\\partial u\}\\,du\.Integrating by parts and using the assumed vanishing boundary terms,
∫r\(s,u\+a¯θ\(s\)\)∂ν\(u\)∂u𝑑u=−∫ν\(u\)∂r\(s,u\+a¯θ\(s\)\)∂u𝑑u\.\\int r\(s,u\+\\bar\{a\}\_\{\\theta\}\(s\)\)\\frac\{\\partial\\nu\(u\)\}\{\\partial u\}\\,du=\-\\int\\nu\(u\)\\frac\{\\partial r\(s,u\+\\bar\{a\}\_\{\\theta\}\(s\)\)\}\{\\partial u\}\\,du\.Since
∂r\(s,u\+a¯θ\(s\)\)∂u=∂r\(s,a\)∂a,\\frac\{\\partial r\(s,u\+\\bar\{a\}\_\{\\theta\}\(s\)\)\}\{\\partial u\}=\\frac\{\\partial r\(s,a\)\}\{\\partial a\},we obtain
𝔼\[r\(s,a\)∂∂θlogπθ\(a∣s\)\]=\(∂a¯θ\(s\)∂θ\)⊤𝔼\[∂r\(s,a\)∂a\]\.\\mathbb\{E\}\\\!\\left\[r\(s,a\)\\frac\{\\partial\}\{\\partial\\theta\}\\log\\pi\_\{\\theta\}\(a\\mid s\)\\right\]=\\left\(\\frac\{\\partial\\bar\{a\}\_\{\\theta\}\(s\)\}\{\\partial\\theta\}\\right\)^\{\\\!\\top\}\\mathbb\{E\}\\\!\\left\[\\frac\{\\partial r\(s,a\)\}\{\\partial a\}\\right\]\.This proves equation[10](https://arxiv.org/html/2605.21654#S3.E10)\. ∎
### A\.5Local GRPO/PPO value\-gradient form
The main text uses the following consequence of Lemma[1](https://arxiv.org/html/2605.21654#Thmlemma1)and Proposition[1](https://arxiv.org/html/2605.21654#Thmproposition1)\.
###### Corollary 1\(GRPO is a value\-gradient update in expectation\)\.
Assume the differentiable rollout of Definition[1](https://arxiv.org/html/2605.21654#Thmdefinition1), the shift/additive\-noise policy of Lemma[1](https://arxiv.org/html/2605.21654#Thmlemma1), and treat the GRPO/PPO advantage weightsA^i,t\\widehat\{A\}\_\{i,t\}as stop\-gradient scalars\. Locally atθ=θold\\theta=\\theta\_\{\\mathrm\{old\}\}, where clipping is inactive, the expected actor\-gradient direction of the GRPO/PPO surrogate is pathwise\-equivalent to a BPTT adjoint update\. The backward signal in that adjoint update satisfies
𝔼\[λt∣st=s\]=Gtπ\(s\)\.\\mathbb\{E\}\[\\lambda\_\{t\}\\mid s\_\{t\}=s\]=G\_\{t\}^\{\\pi\}\(s\)\.Thus the local GRPO/PPO actor update is value\-gradient\-like in expectation\.
###### Proof\.
Atθ=θold\\theta=\\theta\_\{\\mathrm\{old\}\},
ρi,t\(θold\)=πθold\(oti∣si,t\)πθold\(oti∣si,t\)=1\.\\rho\_\{i,t\}\(\\theta\_\{\\mathrm\{old\}\}\)=\\frac\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(o\_\{t\}^\{i\}\\mid s\_\{i,t\}\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(o\_\{t\}^\{i\}\\mid s\_\{i,t\}\)\}=1\.In a local neighbourhood in which the clipping threshold is not active, the clipped surrogate reduces to the unclipped likelihood\-ratio term\. Hence, withA^i,t\\widehat\{A\}\_\{i,t\}held fixed,
∂∂θ\(ρi,t\(θ\)A^i,t\)\|θ=θold=A^i,t∂∂θlogπθ\(oti∣si,t\)\|θ=θold\.\\frac\{\\partial\}\{\\partial\\theta\}\\left\(\\rho\_\{i,t\}\(\\theta\)\\widehat\{A\}\_\{i,t\}\\right\)\\bigg\|\_\{\\theta=\\theta\_\{\\mathrm\{old\}\}\}=\\widehat\{A\}\_\{i,t\}\\frac\{\\partial\}\{\\partial\\theta\}\\log\\pi\_\{\\theta\}\(o\_\{t\}^\{i\}\\mid s\_\{i,t\}\)\\bigg\|\_\{\\theta=\\theta\_\{\\mathrm\{old\}\}\}\.Therefore, the local actor part of the GRPO/PPO gradient is the usual score\-function policy\-gradient term:
𝔼\[1G∑i=1G1Ti∑t=1TiA^i,t∂∂θlogπθ\(oti∣si,t\)\]θ=θold\.\\mathbb\{E\}\\\!\\left\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\frac\{1\}\{T\_\{i\}\}\\sum\_\{t=1\}^\{T\_\{i\}\}\\widehat\{A\}\_\{i,t\}\\frac\{\\partial\}\{\\partial\\theta\}\\log\\pi\_\{\\theta\}\(o\_\{t\}^\{i\}\\mid s\_\{i,t\}\)\\right\]\_\{\\theta=\\theta\_\{\\mathrm\{old\}\}\}\.The KL penalty in equation[1](https://arxiv.org/html/2605.21654#S2.E1)contributes an ordinary differentiable regularization gradient and does not affect the costate identity\.
Now apply Lemma[1](https://arxiv.org/html/2605.21654#Thmlemma1)token\-by\-token\. For any differentiable local scalar signalhi,t\(st,at\)h\_\{i,t\}\(s\_\{t\},a\_\{t\}\)whose stop\-gradient coefficient isA^i,t\\widehat\{A\}\_\{i,t\}, we have
𝔼\[A^i,thi,t\(st,at\)∂∂θlogπθ\(at∣st\)\]=\(∂a¯θ\(st\)∂θ\)⊤𝔼\[∂\(A^i,thi,t\(st,at\)\)∂at\]\.\\mathbb\{E\}\\\!\\left\[\\widehat\{A\}\_\{i,t\}h\_\{i,t\}\(s\_\{t\},a\_\{t\}\)\\frac\{\\partial\}\{\\partial\\theta\}\\log\\pi\_\{\\theta\}\(a\_\{t\}\\mid s\_\{t\}\)\\right\]=\\left\(\\frac\{\\partial\\bar\{a\}\_\{\\theta\}\(s\_\{t\}\)\}\{\\partial\\theta\}\\right\)^\{\\\!\\top\}\\mathbb\{E\}\\\!\\left\[\\frac\{\\partial\\bigl\(\\widehat\{A\}\_\{i,t\}h\_\{i,t\}\(s\_\{t\},a\_\{t\}\)\\bigr\)\}\{\\partial a\_\{t\}\}\\right\]\.Takinghi,t\(st,at\)=r\(st,at\)h\_\{i,t\}\(s\_\{t\},a\_\{t\}\)=r\(s\_\{t\},a\_\{t\}\)gives exactly equation[11](https://arxiv.org/html/2605.21654#S3.E11)\. Thus the score\-function form and the pathwise action\-derivative form have the same expectation under the shift policy\.
Summing these pathwise terms over the rollout gives the BPTT gradient of the differentiable trajectory\. By Proposition[1](https://arxiv.org/html/2605.21654#Thmproposition1), the backward state\-sensitivity propagated by this BPTT computation is the costateλt\\lambda\_\{t\}\. By identity equation[9](https://arxiv.org/html/2605.21654#S2.E9),
𝔼\[λt∣st=s\]=Gtπ\(s\)\.\\mathbb\{E\}\[\\lambda\_\{t\}\\mid s\_\{t\}=s\]=G\_\{t\}^\{\\pi\}\(s\)\.Hence the local GRPO/PPO actor update is value\-gradient\-like in expectation\. ∎
### A\.6Empirical costate recursion in a transformer
###### Derivation of equation[15](https://arxiv.org/html/2605.21654#S4.E15)\.
For a fixed trajectory, the local GRPO loss without clipping can be written as
L\(θ\)=∑k=1TA^kℓk\(θ\),ℓk=logπθ\(ok∣sk\)\.L\(\\theta\)=\\sum\_\{k=1\}^\{T\}\\widehat\{A\}\_\{k\}\\ell\_\{k\}\(\\theta\),\\qquad\\ell\_\{k\}=\\log\\pi\_\{\\theta\}\(o\_\{k\}\\mid s\_\{k\}\)\.By Definition[2](https://arxiv.org/html/2605.21654#Thmdefinition2),
λ^t=∂∂ht\(L\)∑k≥tA^kℓk\.\\hat\{\\lambda\}\_\{t\}=\\frac\{\\partial\}\{\\partial h\_\{t\}^\{\(L\)\}\}\\sum\_\{k\\geq t\}\\widehat\{A\}\_\{k\}\\ell\_\{k\}\.Splitting thek=tk=tterm from the future terms gives
λ^t=A^t∂ℓt∂ht\(L\)\+∑k\>tA^k∂ℓk∂ht\(L\)\.\\hat\{\\lambda\}\_\{t\}=\\widehat\{A\}\_\{t\}\\frac\{\\partial\\ell\_\{t\}\}\{\\partial h\_\{t\}^\{\(L\)\}\}\+\\sum\_\{k\>t\}\\widehat\{A\}\_\{k\}\\frac\{\\partial\\ell\_\{k\}\}\{\\partial h\_\{t\}^\{\(L\)\}\}\.Fork\>tk\>t, the dependence ofℓk\\ell\_\{k\}onht\(L\)h\_\{t\}^\{\(L\)\}passes through the transformer computation graph\. In particular, the first future step contributes through the Jacobian
Jt\+1←t=∂ht\+1\(L\)∂ht\(L\)\.J\_\{t\+1\\leftarrow t\}=\\frac\{\\partial h\_\{t\+1\}^\{\(L\)\}\}\{\\partial h\_\{t\}^\{\(L\)\}\}\.Applying the chain rule recursively gives
∑k\>tA^k∂ℓk∂ht\(L\)=Jt\+1←t⊤∂∂ht\+1\(L\)∑k≥t\+1A^kℓk=Jt\+1←t⊤λ^t\+1\.\\sum\_\{k\>t\}\\widehat\{A\}\_\{k\}\\frac\{\\partial\\ell\_\{k\}\}\{\\partial h\_\{t\}^\{\(L\)\}\}=J\_\{t\+1\\leftarrow t\}^\{\\top\}\\frac\{\\partial\}\{\\partial h\_\{t\+1\}^\{\(L\)\}\}\\sum\_\{k\\geq t\+1\}\\widehat\{A\}\_\{k\}\\ell\_\{k\}=J\_\{t\+1\\leftarrow t\}^\{\\top\}\\hat\{\\lambda\}\_\{t\+1\}\.Therefore,
λ^t=A^t∂ℓt∂ht\(L\)\+Jt\+1←t⊤λ^t\+1,λ^T\+1=0,\\hat\{\\lambda\}\_\{t\}=\\widehat\{A\}\_\{t\}\\frac\{\\partial\\ell\_\{t\}\}\{\\partial h\_\{t\}^\{\(L\)\}\}\+J\_\{t\+1\\leftarrow t\}^\{\\top\}\\hat\{\\lambda\}\_\{t\+1\},\\qquad\\hat\{\\lambda\}\_\{T\+1\}=0,which is equation[15](https://arxiv.org/html/2605.21654#S4.E15)\. ∎
### A\.7Proof of Proposition[2](https://arxiv.org/html/2605.21654#Thmproposition2): sampling\-gap bound
###### Proof\.
The exact relaxed transition Jacobian decomposes as in equation[20](https://arxiv.org/html/2605.21654#S4.E20):
Dfθexact=Jt\+1←tattn\+∂ht\+1\(L\)∂et∂et∂ot∂ot∂zt∂zt∂ht\(L\)\.Df\_\{\\theta\}^\{\\mathrm\{exact\}\}=J\_\{t\+1\\leftarrow t\}^\{\\mathrm\{attn\}\}\+\\frac\{\\partial h\_\{t\+1\}^\{\(L\)\}\}\{\\partial e\_\{t\}\}\\frac\{\\partial e\_\{t\}\}\{\\partial o\_\{t\}\}\\frac\{\\partial o\_\{t\}\}\{\\partial z\_\{t\}\}\\frac\{\\partial z\_\{t\}\}\{\\partial h\_\{t\}^\{\(L\)\}\}\.Thus,
‖Dfθexact−Jt\+1←tattn‖≤‖∂ht\+1\(L\)∂et‖‖∂et∂ot‖‖∂ot∂zt‖‖∂zt∂ht\(L\)‖\.\\left\\\|Df\_\{\\theta\}^\{\\mathrm\{exact\}\}\-J\_\{t\+1\\leftarrow t\}^\{\\mathrm\{attn\}\}\\right\\\|\\leq\\left\\\|\\frac\{\\partial h\_\{t\+1\}^\{\(L\)\}\}\{\\partial e\_\{t\}\}\\right\\\|\\left\\\|\\frac\{\\partial e\_\{t\}\}\{\\partial o\_\{t\}\}\\right\\\|\\left\\\|\\frac\{\\partial o\_\{t\}\}\{\\partial z\_\{t\}\}\\right\\\|\\left\\\|\\frac\{\\partial z\_\{t\}\}\{\\partial h\_\{t\}^\{\(L\)\}\}\\right\\\|\.The first, second, and fourth factors are controlled by the embedding matrix, the output head, and the local Lipschitz constants of the transformer block\. It remains to control the effective Jacobian through sampling\.
For a soft or straight\-through relaxation, the categorical sample is replaced by a differentiable soft sample with probabilities
pt=softmax\(zt/τ\),p\_\{t\}=\\mathrm\{softmax\}\(z\_\{t\}/\\tau\),whereτ\>0\\tau\>0is the relaxation temperature\. Its Jacobian has the form
∂ot∂zt=1τ\(diag\(pt\)−ptpt⊤\)\.\\frac\{\\partial o\_\{t\}\}\{\\partial z\_\{t\}\}=\\frac\{1\}\{\\tau\}\\left\(\\mathrm\{diag\}\(p\_\{t\}\)\-p\_\{t\}p\_\{t\}^\{\\top\}\\right\)\.The matrixdiag\(pt\)−ptpt⊤\\mathrm\{diag\}\(p\_\{t\}\)\-p\_\{t\}p\_\{t\}^\{\\top\}is the covariance matrix of a categorical one\-hot vector\. Since it is positive semidefinite,
‖diag\(pt\)−ptpt⊤‖≤Tr\(diag\(pt\)−ptpt⊤\)=1−‖pt‖22\.\\left\\\|\\mathrm\{diag\}\(p\_\{t\}\)\-p\_\{t\}p\_\{t\}^\{\\top\}\\right\\\|\\leq\\mathrm\{Tr\}\\left\(\\mathrm\{diag\}\(p\_\{t\}\)\-p\_\{t\}p\_\{t\}^\{\\top\}\\right\)=1\-\\\|p\_\{t\}\\\|\_\{2\}^\{2\}\.Letpmax:=maxjpt,jp\_\{\\max\}:=\\max\_\{j\}p\_\{t,j\}\. Then
1−‖pt‖22≤1−pmax2≤2\(1−pmax\)\.1\-\\\|p\_\{t\}\\\|\_\{2\}^\{2\}\\leq 1\-p\_\{\\max\}^\{2\}\\leq 2\(1\-p\_\{\\max\}\)\.For a fixed vocabulary size\|V\|\|V\|, normalized entropy controls the distance from a point mass: there exists a finite constantCVC\_\{V\}such that
1−pmax≤CVHtlog\|V\|\.1\-p\_\{\\max\}\\leq C\_\{V\}\\sqrt\{\\frac\{H\_\{t\}\}\{\\log\|V\|\}\}\.This follows because the simplex is compact and the ratio
1−pmaxHt/log\|V\|\\frac\{1\-p\_\{\\max\}\}\{\\sqrt\{H\_\{t\}/\\log\|V\|\}\}has a finite continuous extension at the deterministic vertices, where both the numerator and the entropy vanish\. Combining the previous bounds gives
‖∂ot∂zt‖≤2CVτHtlog\|V\|\.\\left\\\|\\frac\{\\partial o\_\{t\}\}\{\\partial z\_\{t\}\}\\right\\\|\\leq\\frac\{2C\_\{V\}\}\{\\tau\}\\sqrt\{\\frac\{H\_\{t\}\}\{\\log\|V\|\}\}\.Absorbing the fixed vocabulary, temperature, embedding, head, and local Lipschitz constants into a single constantCC, we obtain
‖Dfθexact−Jt\+1←tattn‖≤CHtlog\|V\|,\\bigl\\\|Df\_\{\\theta\}^\{\\mathrm\{exact\}\}\-J\_\{t\+1\\leftarrow t\}^\{\\mathrm\{attn\}\}\\bigr\\\|\\leq C\\sqrt\{\\frac\{H\_\{t\}\}\{\\log\|V\|\}\},which is equation[21](https://arxiv.org/html/2605.21654#S4.E21)\. ∎
### A\.8Proof of Proposition[3](https://arxiv.org/html/2605.21654#Thmproposition3): attention\-pathway rank and magnitude
###### Proof\.
Consider a single causal\-attention head\. Its output at positiontthas the form
Attnt=∑t′≤tαt,t′WVht′\.\\mathrm\{Attn\}\_\{t\}=\\sum\_\{t^\{\\prime\}\\leq t\}\\alpha\_\{t,t^\{\\prime\}\}W\_\{V\}h\_\{t^\{\\prime\}\}\.For a fixed earlier positiont′t^\{\\prime\}, the Jacobian of this head output with respect toht′h\_\{t^\{\\prime\}\}maps into thedheadd\_\{\\mathrm\{head\}\}\-dimensional value subspace of that head\. Hence its rank is at mostdheadd\_\{\\mathrm\{head\}\}\. This remains true when including the derivative of the attention weightsαt,t′\\alpha\_\{t,t^\{\\prime\}\}, because the output of a single head is still a vector inℝdhead\\mathbb\{R\}^\{d\_\{\\mathrm\{head\}\}\}\.
Withnheadsn\_\{\\mathrm\{heads\}\}heads, the multi\-head attention output is the concatenation and output projection ofnheadsn\_\{\\mathrm\{heads\}\}such head outputs\. Therefore, the rank of the cross\-position attention contribution in one layer is at most
dheadnheads\.d\_\{\\mathrm\{head\}\}\\,n\_\{\\mathrm\{heads\}\}\.AcrossLLlayers, the attention\-pathway Jacobian can be written as a sum of cross\-position contributions transported through the intervening layer Jacobians\. Since the rank of a sum is at most the sum of ranks, the total attention\-pathway rank is bounded by
Ldheadnheads\.L\\,d\_\{\\mathrm\{head\}\}\\,n\_\{\\mathrm\{heads\}\}\.This proves the stated rank bound\. For the magnitude claim, each head contribution contains factors of the attention weightαt,t′\\alpha\_\{t,t^\{\\prime\}\}, value projectionWVW\_\{V\}, output projection, and softmax\-score derivatives\. Under bounded projection norms and bounded hidden\-state norms, the contribution from positiont′t^\{\\prime\}is therefore controlled by the size of the corresponding attention weights\. Thus broadly distributed attention provides more cross\-position credit pathways, while small attention weights suppress the corresponding Jacobian entries\. ∎
### A\.9Proof of Theorem[1](https://arxiv.org/html/2605.21654#Thmtheorem1): discrete GRPO costates approximate BPTT costates
###### Proof\.
Define the conditional costate error
δt:=𝔼\[λ^t∣ht\]−𝔼\[λt∣ht\]\.\\delta\_\{t\}:=\\mathbb\{E\}\[\\hat\{\\lambda\}\_\{t\}\\mid h\_\{t\}\]\-\\mathbb\{E\}\[\\lambda\_\{t\}\\mid h\_\{t\}\]\.The empirical costate recursion is
λ^t=A^t∂ℓt∂ht\(L\)\+\(Jt\+1←tattn\)⊤λ^t\+1\.\\hat\{\\lambda\}\_\{t\}=\\widehat\{A\}\_\{t\}\\frac\{\\partial\\ell\_\{t\}\}\{\\partial h\_\{t\}^\{\(L\)\}\}\+\\left\(J\_\{t\+1\\leftarrow t\}^\{\\mathrm\{attn\}\}\\right\)^\{\\\!\\top\}\\hat\{\\lambda\}\_\{t\+1\}\.The exact relaxed BPTT recursion has the form
λt=Drt\+γ\(Dfθexact\)⊤λt\+1\.\\lambda\_\{t\}=Dr\_\{t\}\+\\gamma\\left\(Df\_\{\\theta\}^\{\\mathrm\{exact\}\}\\right\)^\{\\\!\\top\}\\lambda\_\{t\+1\}\.Under the local SF\-to\-PD identification of Lemma[1](https://arxiv.org/html/2605.21654#Thmlemma1), the immediate credit terms align in expectation\. Thus the difference between the two conditional recursions is caused by the propagated future error and by the missing sampling\-path Jacobian:
δt=γ\(Jt\+1←tattn\)⊤δt\+1\+γ\(Dfθexact−Jt\+1←tattn\)⊤𝔼\[λt\+1∣ht\]\.\\delta\_\{t\}=\\gamma\\left\(J\_\{t\+1\\leftarrow t\}^\{\\mathrm\{attn\}\}\\right\)^\{\\\!\\top\}\\delta\_\{t\+1\}\+\\gamma\\left\(Df\_\{\\theta\}^\{\\mathrm\{exact\}\}\-J\_\{t\+1\\leftarrow t\}^\{\\mathrm\{attn\}\}\\right\)^\{\\\!\\top\}\\mathbb\{E\}\[\\lambda\_\{t\+1\}\\mid h\_\{t\}\]\.Taking norms gives
∥δt∥≤γ∥Jt\+1←tattn∥∥δt\+1∥\+γ∥Dfθexact−Jt\+1←tattn∥∥𝔼\[λt\+1∣ht\]∥\.\\\|\\delta\_\{t\}\\\|\\leq\\gamma\\left\\\|J\_\{t\+1\\leftarrow t\}^\{\\mathrm\{attn\}\}\\right\\\|\\\|\\delta\_\{t\+1\}\\\|\+\\gamma\\left\\\|Df\_\{\\theta\}^\{\\mathrm\{exact\}\}\-J\_\{t\+1\\leftarrow t\}^\{\\mathrm\{attn\}\}\\right\\\|\\left\\\|\\mathbb\{E\}\[\\lambda\_\{t\+1\}\\mid h\_\{t\}\]\\right\\\|\.Let
ϵt\+1:=∥Dfθexact−Jt\+1←tattn∥∥𝔼\[λt\+1∣ht\]∥\.\\epsilon\_\{t\+1\}:=\\left\\\|Df\_\{\\theta\}^\{\\mathrm\{exact\}\}\-J\_\{t\+1\\leftarrow t\}^\{\\mathrm\{attn\}\}\\right\\\|\\left\\\|\\mathbb\{E\}\[\\lambda\_\{t\+1\}\\mid h\_\{t\}\]\\right\\\|\.By Proposition[2](https://arxiv.org/html/2605.21654#Thmproposition2), this per\-step error is controlled by the entropy\-dependent sampling\-gap bound\. Hence
‖δt‖≤γ‖Jt\+1←tattn‖‖δt\+1‖\+γϵt\+1\.\\\|\\delta\_\{t\}\\\|\\leq\\gamma\\left\\\|J\_\{t\+1\\leftarrow t\}^\{\\mathrm\{attn\}\}\\right\\\|\\\|\\delta\_\{t\+1\}\\\|\+\\gamma\\epsilon\_\{t\+1\}\.Unrolling this inequality backward fromδT\+1=0\\delta\_\{T\+1\}=0yields
‖δt‖≤∑k=t\+1Tγk−t\(∏j=t\+1k−1‖Jj←j−1attn‖\)ϵk\.\\\|\\delta\_\{t\}\\\|\\leq\\sum\_\{k=t\+1\}^\{T\}\\gamma^\{k\-t\}\\left\(\\prod\_\{j=t\+1\}^\{k\-1\}\\left\\\|J\_\{j\\leftarrow j\-1\}^\{\\mathrm\{attn\}\}\\right\\\|\\right\)\\epsilon\_\{k\}\.The main text writes the Jacobian chain compactly as‖Jk←k−1attn‖k−t−1\\\|J\_\{k\\leftarrow k\-1\}^\{\\mathrm\{attn\}\}\\\|^\{k\-t\-1\}, which should be read as this product or as a uniform upper bound on the product\. Therefore,
∥𝔼\[λ^t∣ht\]−𝔼\[λt∣ht\]∥≤∑k=t\+1Tγk−t⋅‖Jk←k−1attn‖k−t−1⏟Jacobian chain growth⋅ϵk,\\bigl\\\|\\mathbb\{E\}\[\\hat\{\\lambda\}\_\{t\}\\mid h\_\{t\}\]\-\\mathbb\{E\}\[\\lambda\_\{t\}\\mid h\_\{t\}\]\\bigr\\\|\\leq\\sum\_\{k=t\+1\}^\{T\}\\gamma^\{k\-t\}\\cdot\\underbrace\{\\bigl\\\|J\_\{k\\leftarrow k\-1\}^\{\\mathrm\{attn\}\}\\bigr\\\|^\{k\-t\-1\}\}\_\{\\text\{Jacobian chain growth\}\}\\cdot\\epsilon\_\{k\},which is equation[23](https://arxiv.org/html/2605.21654#S4.E23)\. ∎
### A\.10Proof of the usable\-signal lower bound
###### Proof\.
Let
ut\(m\):=𝔼\[λ^t\(m\)∣ht\],vt:=Gt\.u\_\{t\}^\{\(m\)\}:=\\mathbb\{E\}\[\\hat\{\\lambda\}\_\{t\}^\{\(m\)\}\\mid h\_\{t\}\],\\qquad v\_\{t\}:=G\_\{t\}\.Then
⟨ut\(m\),vt⟩=‖vt‖22\+⟨ut\(m\)−vt,vt⟩\.\\langle u\_\{t\}^\{\(m\)\},v\_\{t\}\\rangle=\\\|v\_\{t\}\\\|\_\{2\}^\{2\}\+\\langle u\_\{t\}^\{\(m\)\}\-v\_\{t\},v\_\{t\}\\rangle\.By Cauchy–Schwarz,
⟨ut\(m\)−vt,vt⟩≥−‖ut\(m\)−vt‖2‖vt‖2\.\\langle u\_\{t\}^\{\(m\)\}\-v\_\{t\},v\_\{t\}\\rangle\\geq\-\\\|u\_\{t\}^\{\(m\)\}\-v\_\{t\}\\\|\_\{2\}\\,\\\|v\_\{t\}\\\|\_\{2\}\.Therefore,
⟨ut\(m\),vt⟩≥∥Gt∥22−∥Gt∥2∥𝔼\[λ^t\(m\)∣ht\]−Gt∥2\.\\langle u\_\{t\}^\{\(m\)\},v\_\{t\}\\rangle\\geq\\\|G\_\{t\}\\\|\_\{2\}^\{2\}\-\\\|G\_\{t\}\\\|\_\{2\}\\left\\\|\\mathbb\{E\}\[\\hat\{\\lambda\}\_\{t\}^\{\(m\)\}\\mid h\_\{t\}\]\-G\_\{t\}\\right\\\|\_\{2\}\.Taking expectation overq,τ,tq,\\tau,tgives
𝔼q,τ,t\[⟨𝔼\[λ^t\(m\)∣ht\],Gt⟩\]≥Σ\(θ\)−Λm\(θ\),\\mathbb\{E\}\_\{q,\\tau,t\}\\left\[\\left\\langle\\mathbb\{E\}\[\\hat\{\\lambda\}\_\{t\}^\{\(m\)\}\\mid h\_\{t\}\],G\_\{t\}\\right\\rangle\\right\]\\geq\\Sigma\(\\theta\)\-\\Lambda\_\{m\}\(\\theta\),withΣ\(θ\)\\Sigma\(\\theta\)andΛm\(θ\)\\Lambda\_\{m\}\(\\theta\)defined in equation[26](https://arxiv.org/html/2605.21654#S5.E26)\. This proves equation[25](https://arxiv.org/html/2605.21654#S5.E25)\. ∎Similar Articles
Reinforcement Learning via Value Gradient Flow
Value Gradient Flow (VGF) presents a scalable approach to behavior-regularized reinforcement learning by formulating it as an optimal transport problem solved through discrete gradient flow, achieving state-of-the-art results on offline RL and LLM RL benchmarks. The method eliminates explicit policy parameterization while enabling adaptive test-time scaling by controlling transport budget.
@RyanBoldi: Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse al…
Introduces Vector Policy Optimization (VPO), a new RL method that handles vector-valued rewards to improve test-time scaling for LLMs, outperforming conventional scalar reward approaches.
GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards
GRAIL introduces gradient-reweighted advantages to improve token-level credit assignment in reinforcement learning for LLM reasoning, outperforming GRPO across multiple models.
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
This paper introduces Listwise Policy Optimization (LPO), a method for RLVR that explicitly handles target projection via divergence minimization on the response simplex to improve training stability and performance in LLMs.
@lateinteraction: ICYMI: read the blog on Pedagogical RL Instead of sampling blindly from your LLM, leverage the label used for RLVR! Lea…
Introduces Pedagogical RL, a method that leverages privileged information to guide the sampling of successful trajectories for LLM reasoning, achieving up to 40% relative gains over GRPO and on-policy distillation.