GAGPO: Generalized Advantage Grouped Policy Optimization
Summary
GAGPO proposes a critic-free RL method that uses a non-parametric grouped value proxy for step-level credit assignment in multi-turn agentic tasks, outperforming strong baselines on ALFWorld and WebShop.
View Cached Full Text
Cached at: 06/15/26, 09:13 AM
# GAGPO: Generalized Advantage Grouped Policy Optimization
Source: [https://arxiv.org/html/2605.13217](https://arxiv.org/html/2605.13217)
Siyuan Zhu1,2,Chao Yu1,Rongxin Yang1,2,Zongkai Liu1, Jinjun Hu2,Qiwen Chen2,Yibo Zhang2
1School of Computer Science and Engineering, Sun Yat\-sen University2Meituan zhusy58@mail2\.sysu\.edu\.cn, yuchao3@mail\.sysu\.edu\.cn, zhangyibo06@meituan\.com
###### Abstract
Reinforcement learning \(RL\) has emerged as a powerful paradigm for post\-training large language model \(LLM\) agents\. However, credit assignment in multi\-turn environments remains a challenge\. Agents typically receive sparse, trajectory\-level rewards only at the end of an episode, making it difficult to identify which specific intermediate actions led to success or failure\. Consequently, effectively propagating delayed outcomes back to individual steps—without relying on costly auxiliary value models—remains an open problem\. In this paper, we proposeGeneralized Advantage Grouped Policy Optimization \(GAGPO\), a critic\-free RL method that enables precise, step\-aligned temporal credit assignment\. GAGPO constructs a non\-parametric grouped value proxy from sampled rollouts to compute TD/GAE\-style temporal advantages, recursively propagating outcome supervision backward through time\. Coupled with group\-wise advantage normalization and an action\-level importance ratio, GAGPO extracts stable and localized optimization signals directly from multi\-turn trajectories\. Experiments on ALFWorld and WebShop demonstrate that GAGPO outperforms strong RL baselines\. Further analyses reveal faster early\-stage learning, improved interaction efficiency, and smoother optimization dynamics, offering a simple yet highly effective framework for multi\-turn agentic RL\.
GAGPO: Generalized Advantage Grouped Policy Optimization
Siyuan Zhu1,2, Chao Yu1††thanks:Corresponding author\., Rongxin Yang1,2, Zongkai Liu1,Jinjun Hu2,Qiwen Chen2,Yibo Zhang21School of Computer Science and Engineering, Sun Yat\-sen University2Meituanzhusy58@mail2\.sysu\.edu\.cn, yuchao3@mail\.sysu\.edu\.cn,zhangyibo06@meituan\.com
## 1Introduction
Large language models \(LLMs\) are increasingly evolving from single\-turn assistants into agents that can perceive environments, reason over observations, and act through multi\-turn interactions\(GPT\-5 Team,[2025](https://arxiv.org/html/2605.13217#bib.bib2); Gemini 2\.5 Team,[2025](https://arxiv.org/html/2605.13217#bib.bib3); Qwen3 Team,[2025](https://arxiv.org/html/2605.13217#bib.bib4)\)\. Reinforcement learning \(RL\)\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.13217#bib.bib1)\)has become a natural post\-training paradigm for this transition\. From PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.13217#bib.bib5)\)to critic\-free grouped policy optimization methods such as GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.13217#bib.bib6)\)and its variants\(Ahmadianet al\.,[2024](https://arxiv.org/html/2605.13217#bib.bib7); Yuet al\.,[2025](https://arxiv.org/html/2605.13217#bib.bib8); Zhenget al\.,[2025](https://arxiv.org/html/2605.13217#bib.bib10); Gaoet al\.,[2025](https://arxiv.org/html/2605.13217#bib.bib11)\), online policy optimization has shown strong performance in reasoning\-oriented post\-training\. More recently, these methods have been extended to multi\-turn agent settings, enabling LLMs to improve through search, tool use, and environment interaction\(Wanget al\.,[2025](https://arxiv.org/html/2605.13217#bib.bib12); Jinet al\.,[2025](https://arxiv.org/html/2605.13217#bib.bib13); Chenet al\.,[2025](https://arxiv.org/html/2605.13217#bib.bib14)\)\.
Despite this progress, agentic RL in multi\-turn environments remains challenging: rewards are sparse and delayed, while policy optimization is typically performed at the token level, whereas task success is determined by higher\-level agent actions\. Consequently, intermediate decisions receive weak, noisy, and poorly localized supervision\(Fenget al\.,[2025](https://arxiv.org/html/2605.13217#bib.bib20); Liet al\.,[2026b](https://arxiv.org/html/2605.13217#bib.bib17)\)\.
Existing approaches only partially address this mismatch\. One line of work introduces auxiliary critics, value estimators, or process reward models for denser step\-level feedback\(Xiet al\.,[2025](https://arxiv.org/html/2605.13217#bib.bib15); Liuet al\.,[2025](https://arxiv.org/html/2605.13217#bib.bib16); Liet al\.,[2026b](https://arxiv.org/html/2605.13217#bib.bib17),[a](https://arxiv.org/html/2605.13217#bib.bib18); Weiet al\.,[2025](https://arxiv.org/html/2605.13217#bib.bib19)\), at the cost of additional training complexity and estimation error\. Critic\-free alternatives instead rely on trajectory\-relative or Monte Carlo\-style grouped optimization\(Fenget al\.,[2025](https://arxiv.org/html/2605.13217#bib.bib20); Heet al\.,[2026](https://arxiv.org/html/2605.13217#bib.bib21)\), which preserves architectural simplicity but yields high\-variance, weakly propagated supervision, or on tree\-structured rollouts with branch\-level comparison and turn\-wise reward propagation\(Ding and Ye,[2025](https://arxiv.org/html/2605.13217#bib.bib29); Zonget al\.,[2026](https://arxiv.org/html/2605.13217#bib.bib30); Donget al\.,[2025](https://arxiv.org/html/2605.13217#bib.bib31)\)\. Despite these advances, agentic RL still lacks a simple critic\-free method that performs temporally propagated, step\-aligned credit assignment under standard multi\-turn rollouts, without auxiliary critics or specialized search procedures\.
Figure 1:Overview of GAGPO\. GAGPO consists of three stages: \(1\) rollout grouping, which groups all occurrences of the same environment state across sampled trajectories; \(2\) step\-level credit assignment, which builds a grouped non\-parametric value proxy and computes TD/GAE\-style step advantages without a learned critic; and \(3\) group\-normalized PPO update, which normalizes step advantages within each rollout group and performs action\-level policy optimization with a shared sequence\-level importance ratio\.In this paper, we proposeGeneralized Advantage Grouped Policy Optimization\(GAGPO\), a critic\-free reinforcement learning method for multi\-turn agent training\. GAGPO treats each environment step, rather than each token, as the basic unit of credit assignment, and constructs a non\-parametric grouped value proxy from rollout groups to compute TD/GAE\-style\(Schulmanet al\.,[2018](https://arxiv.org/html/2605.13217#bib.bib22)\)temporal advantages without learning a critic\. Unlike methods that broadcast a shared trajectory\-level reward to every step, GAGPO propagates outcome supervision through temporal recursion and applies group\-wise advantage normalization for stability\.
We evaluate GAGPO on ALFWorld\(Shridharet al\.,[2021](https://arxiv.org/html/2605.13217#bib.bib25)\)and WebShop\(Yaoet al\.,[2023a](https://arxiv.org/html/2605.13217#bib.bib26)\)using Qwen2\.5\-1\.5B\-Instruct and Qwen2\.5\-7B\-Instruct\(Qwen2\.5 Team,[2025](https://arxiv.org/html/2605.13217#bib.bib32)\)\. Across both benchmarks and both model scales, GAGPO consistently outperforms strong prompting baselines and RL baselines including PPO, RLOO, GRPO, and GiGPO\. Further analyses show faster early\-stage learning, improved interaction efficiency, smoother optimization dynamics, and lower\-variance step\-level advantage signals\. These results show that critic\-free grouped RL can be extended more effectively to interactive LLM agents when credit is assigned at the level of environment steps and propagated through time\.
## 2Background
### 2\.1Related Works
#### RL for large language models\.
RL has become a standard paradigm for post\-training LLMs\. Classical RLHF\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.13217#bib.bib1)\)relies on PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.13217#bib.bib5)\)with a learned critic, which is costly and sensitive to value estimation, while preference\-based methods such as DPO\(Rafailovet al\.,[2024](https://arxiv.org/html/2605.13217#bib.bib23)\)bypass online RL but do not handle exploration or multi\-turn interactions\. Recent critic\-free on\-policy methods address these issues with grouped or REINFORCE\-style updates, including RLOO\(Ahmadianet al\.,[2024](https://arxiv.org/html/2605.13217#bib.bib7)\), GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.13217#bib.bib6)\), DAPO\(Yuet al\.,[2025](https://arxiv.org/html/2605.13217#bib.bib8)\), and GSPO\(Zhenget al\.,[2025](https://arxiv.org/html/2605.13217#bib.bib10)\)\. However, these methods are designed for single\-turn generation or sequence\-level reasoning\. GAGPO extends critic\-free grouped RL to multi\-turn agent training with temporally propagated, step\-aligned credit assignment\.
#### Credit assignment for agentic RL\.
Existing agentic RL methods address credit assignment along two directions\. The first introduces auxiliary critics or process reward models for denser step\-level supervision, e\.g\., AgentPRM\(Xiet al\.,[2025](https://arxiv.org/html/2605.13217#bib.bib15)\), iStar\(Liuet al\.,[2025](https://arxiv.org/html/2605.13217#bib.bib16)\), Turn\-PPO\(Liet al\.,[2026b](https://arxiv.org/html/2605.13217#bib.bib17)\), and SORL\(Liet al\.,[2026a](https://arxiv.org/html/2605.13217#bib.bib18)\), but requires extra value or reward modeling\. The second pursues finer\-grained credit within critic\-free grouped optimization, including anchor\-state grouping in GiGPO\(Fenget al\.,[2025](https://arxiv.org/html/2605.13217#bib.bib20)\)and tree\- or turn\-structured rollouts such as Tree\-GRPO\(Ding and Ye,[2025](https://arxiv.org/html/2605.13217#bib.bib29)\), AT2PO\(Zonget al\.,[2026](https://arxiv.org/html/2605.13217#bib.bib30)\), and ARPO\(Donget al\.,[2025](https://arxiv.org/html/2605.13217#bib.bib31)\)\. In contrast, GAGPO stays critic\-free and rollout\-based, but replaces Monte Carlo or relative\-return estimation with a bootstrapped TD/GAE\-style temporal estimator, enabling step\-aligned credit propagation without an additional critic\.
### 2\.2Preliminary
#### Problem setup\.
We consider the problem of training an LLM agent to accomplish tasks through multi\-turn interaction with an external environment\. The interaction process is modeled as a Markov Decision Process \(MDP\)ℳ=\(𝒮,𝒜,P,r,γ\)\\mathcal\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},P,r,\\gamma\), where𝒮\\mathcal\{S\}denotes the state space,𝒜\\mathcal\{A\}the action space,P\(st\+1∣st,at\)P\(s\_\{t\+1\}\\mid s\_\{t\},a\_\{t\}\)the transition dynamics,rrthe reward function, andγ∈\[0,1\]\\gamma\\in\[0,1\]the discount factor\. At each stept=1,…,Tt=1,\\dots,T, the agent receives the environment statest∈𝒮s\_\{t\}\\in\\mathcal\{S\}and generates an actionat∈𝒜⊆𝒱na\_\{t\}\\in\\mathcal\{A\}\\subseteq\\mathcal\{V\}^\{n\}, where𝒱\\mathcal\{V\}is the token vocabulary andnnis the maximum action length\. The agent policy is parameterized byθ\\thetaasπθ\(at∣st\)\\pi\_\{\\theta\}\(a\_\{t\}\\mid s\_\{t\}\)\. After executingata\_\{t\}, the environment returns the next statest\+1∼P\(⋅∣st,at\)s\_\{t\+1\}\\sim P\(\\cdot\\mid s\_\{t\},a\_\{t\}\), wherest\+1s\_\{t\+1\}corresponds to the environment response represented by the updated interaction context, yielding a trajectoryτ=\{\(s1,a1\),…,\(sT,aT\)\}\\tau=\\\{\(s\_\{1\},a\_\{1\}\),\\dots,\(s\_\{T\},a\_\{T\}\)\\\}\. Under the sparse delayed\-reward setting widely studied in agentic RL, this interaction process becomes a sequential decision\-making problem with challenging credit assignment\.
#### Generalized advantage estimation\.
Policy optimization is commonly based on advantages assigned to sampled actions\. A standard estimator is generalized advantage estimation \(GAE\)\(Schulmanet al\.,[2018](https://arxiv.org/html/2605.13217#bib.bib22)\), which defines the TD residualδt=rt\+γV\(st\+1\)−V\(st\)\\delta\_\{t\}=r\_\{t\}\+\\gamma V\(s\_\{t\+1\}\)\-V\(s\_\{t\}\)and computes
A^t=∑l=0T−t\(γλ\)lδt\+l,\\hat\{A\}\_\{t\}=\\sum\_\{l=0\}^\{T\-t\}\(\\gamma\\lambda\)^\{l\}\\delta\_\{t\+l\},whereV\(⋅\)V\(\\cdot\)is a value function andλ∈\[0,1\]\\lambda\\in\[0,1\]controls the bias–variance trade\-off\. By recursively propagating TD residuals backward through time, GAE provides a temporally structured credit signal, but relies on a learned value function that is absent in critic\-free grouped policy optimization\.
Table 1:Performance on ALFWorld and WebShop\.TypeMethodALFWorldWebShopPickLookCleanHeatCoolPick2AllScoreSucc\.Closed\-Source ModelPromptingGPT\-4o75\.360\.831\.256\.721\.649\.848\.031\.823\.7PromptingGemini\-2\.5\-Pro92\.863\.362\.169\.026\.658\.760\.342\.535\.9Qwen2\.5\-1\.5B\-InstructPromptingQwen2\.55\.95\.53\.39\.74\.20\.04\.123\.15\.2PromptingReAct17\.420\.515\.76\.27\.72\.012\.840\.111\.3PromptingReflexion35\.322\.221\.713\.619\.43\.721\.855\.821\.9RL TrainingPPO \(with critic\)64\.8±3\.540\.5±6\.957\.1±4\.960\.6±6\.646\.4±4\.047\.4±1\.954\.4±3\.173\.8±3\.051\.5±2\.9RL TrainingRLOO88\.3±3\.052\.8±8\.671\.0±5\.962\.8±8\.766\.4±5\.556\.9±4\.769\.7±2\.573\.9±5\.652\.1±6\.7RL TrainingGRPO73\.1±3\.466\.7±10\.180\.2±8\.269\.6±12\.258\.7±4\.567\.6±11\.070\.3±3\.680\.5±2\.066\.4±4\.4RL TrainingGiGPO98\.4±2\.172\.2±4\.991\.1±6\.196\.8±6\.2582\.6±4\.579\.7±5\.488\.1±1\.9579\.8±1\.262\.5±1\.1RL TrainingGAGPO \(Ours\)99\.2±3\.183\.8±6\.397\.3±1\.995\.1±3\.584\.9±1\.889\.8±6\.093\.5±1\.388\.6±3\.378\.1±1\.1Qwen2\.5\-7B\-InstructPromptingQwen2\.533\.421\.619\.36\.92\.83\.214\.826\.47\.8PromptingReAct48\.535\.434\.313\.218\.217\.631\.246\.219\.5PromptingReflexion62\.041\.644\.930\.936\.323\.842\.758\.128\.8RL TrainingPPO \(with critic\)92\.3±4\.064\.0±8\.492\.5±2\.489\.5±7\.080\.3±2\.068\.8±8\.380\.4±2\.781\.4±3\.168\.7±5\.1RL TrainingRLOO87\.6±4\.378\.2±8\.387\.3±5\.881\.3±7\.671\.9±5\.248\.9±8\.475\.5±4\.680\.3±3\.265\.7±4\.0RL TrainingGRPO85\.9±6\.969\.5±4\.882\.7±6\.673\.7±6\.865\.4±8\.462\.6±6\.373\.2±4\.680\.5±2\.166\.8±1\.7RL TrainingGiGPO96\.2±3\.990\.9±9\.195\.5±5\.180\.9±8\.772\.1±8\.690\.4±5\.188\.8±4\.586\.3±2\.773\.3±1\.9RL TrainingGAGPO \(Ours\)97\.8±1\.697\.8±3\.195\.8±5\.997\.6±3\.392\.1±3\.092\.6±5\.495\.6±0\.990\.3±1\.277\.5±3\.0
Figure 2:Learning dynamics on ALFWorld and WebShop over the first 120 training steps for Qwen2\.5\-1\.5B\-Instruct and Qwen2\.5\-7B\-Instruct\. The figure reports ALFWorld success rate, WebShop success rate, and WebShop task score\. Across both backbones, GAGPO improves faster than GiGPO and GRPO in the early stage of training and maintains stronger overall performance throughout most of training\.
## 3Method
Generalized Advantage Grouped Policy Optimization\(GAGPO\) is a critic\-free RL algorithm for multi\-turn agentic training \(Figure[1](https://arxiv.org/html/2605.13217#S1.F1)\)\. Building on the PPO\-style grouped optimization framework, GAGPO replaces direct Monte Carlo\-style relative advantages with a temporally propagated step\-level estimator, and uses a shared sequence\-level importance ratio aligned with the action boundary rather than individual tokens\. The key idea is to construct a non\-parametric value proxy from grouped rollouts and compute TD/GAE\-style advantages over environment steps without an additional critic\. This design provides \(i\)*step alignment*with the agent’s decision boundary, \(ii\)*temporal credit propagation*of delayed outcomes, and \(iii\)*critic\-free bootstrapping*\.
Formally, for a given task instance, a rollout group𝒯=\{τ\(i\)\}i=1K\\mathcal\{T\}=\\\{\\tau^\{\(i\)\}\\\}\_\{i=1\}^\{K\}, whereτ\(i\)=\{\(st\(i\),at\(i\),rt\(i\)\)\}t=1Ti\\tau^\{\(i\)\}=\\\{\(s^\{\(i\)\}\_\{t\},a^\{\(i\)\}\_\{t\},r^\{\(i\)\}\_\{t\}\)\\\}\_\{t=1\}^\{T\_\{i\}\}and each actionat\(i\)=\(yt,1\(i\),…,yt,mt\(i\)\(i\)\)a^\{\(i\)\}\_\{t\}=\(y^\{\(i\)\}\_\{t,1\},\\dots,y^\{\(i\)\}\_\{t,m^\{\(i\)\}\_\{t\}\}\)is a token sequence\.
### 3\.1Step\-Aligned Grouped Temporal Credit Assignment
Since rewards are sparse and delayed while policy updates operate at the token level, GAGPO treats each*environment step*as the unit of credit assignment: all tokens within the same actionat\(i\)a^\{\(i\)\}\_\{t\}share a single step\-level advantageA^t\(i\)\\hat\{A\}^\{\(i\)\}\_\{t\}\.
To construct critic\-free temporal credit signals, GAGPO organizes rollout steps into state\-consistent groups\. For each statess, the corresponding step group𝒢\(s\)=\{\(i,t\)∣st\(i\)=s\}\\mathcal\{G\}\(s\)=\\\{\(i,t\)\\mid s^\{\(i\)\}\_\{t\}=s\\\}gathers all occurrences ofssacross the rollout group, built entirely from collected trajectories at no extra rollout cost\.
For each sampled step\(i,t\)\(i,t\), its discounted return is defined as
R^t\(i\)=∑u=tTiγu−tru\(i\),\\hat\{R\}^\{\(i\)\}\_\{t\}=\\sum\_\{u=t\}^\{T\_\{i\}\}\\gamma^\{\\,u\-t\}r^\{\(i\)\}\_\{u\},whereγ∈\[0,1\]\\gamma\\in\[0,1\]is the discount factor\. A non\-parametric grouped value proxy for statessis constructed by averaging the discounted returns of steps in the same group:
V¯\(s\)=1\|𝒢\(s\)\|∑\(j,u\)∈𝒢\(s\)R^u\(j\)\.\\bar\{V\}\(s\)=\\frac\{1\}\{\|\\mathcal\{G\}\(s\)\|\}\\sum\_\{\(j,u\)\\in\\mathcal\{G\}\(s\)\}\\hat\{R\}^\{\(j\)\}\_\{u\}\.Based on this grouped value proxy, GAGPO computes a temporal\-difference residual at each step:
δt\(i\)=rt\(i\)\+γV¯\(st\+1\(i\)\)−V¯\(st\(i\)\),\\delta^\{\(i\)\}\_\{t\}=r^\{\(i\)\}\_\{t\}\+\\gamma\\bar\{V\}\(s^\{\(i\)\}\_\{t\+1\}\)\-\\bar\{V\}\(s^\{\(i\)\}\_\{t\}\),whereV¯\(sTi\+1\(i\)\)=0\\bar\{V\}\(s^\{\(i\)\}\_\{T\_\{i\}\+1\}\)=0for terminal states\. The step\-level temporal advantage is then defined recursively in a GAE\-style manner:
A^t\(i\)=δt\(i\)\+γλA^t\+1\(i\),\\hat\{A\}^\{\(i\)\}\_\{t\}=\\delta^\{\(i\)\}\_\{t\}\+\\gamma\\lambda\\hat\{A\}^\{\(i\)\}\_\{t\+1\},\(1\)whereλ∈\[0,1\]\\lambda\\in\[0,1\]controls the bias–variance trade\-off in temporal credit propagation\. Equivalently,
A^t\(i\)=∑l=0Ti−t\(γλ\)lδt\+l\(i\)\.\\hat\{A\}^\{\(i\)\}\_\{t\}=\\sum\_\{l=0\}^\{T\_\{i\}\-t\}\(\\gamma\\lambda\)^\{l\}\\delta^\{\(i\)\}\_\{t\+l\}\.
### 3\.2Localized Objective and Group\-Normalized PPO Optimization
Many grouped policy optimization methods combine local step\-level signals with a trajectory\-level reward or relative advantage\. However, adding the same episode\-level offset to every step makes all actions share an identical global component regardless of their temporal position, reducing contrast among intermediate decisions\. GAGPO instead uses the temporal advantage in Eq\.[1](https://arxiv.org/html/2605.13217#S3.E1)as the sole optimization signal: episode\-level outcomes still influence earlier decisions through temporal recursion, without imposing a uniform offset on all steps\.
Although the temporal estimator improves credit localization, step advantage magnitudes still vary across tasks and rollout groups\. Since batch\-level normalization mixes heterogeneous tasks and disrupts the within\-group structure, GAGPO applies*group normalization*: letℬ\\mathcal\{B\}denote all sampled steps in the same rollout group;A^t\(i\)\\hat\{A\}^\{\(i\)\}\_\{t\}is standardized asAt\(i\)=\(A^t\(i\)−μℬ\)/\(σℬ\+ϵ\)\{A\}\_\{t\}^\{\(i\)\}=\(\\hat\{A\}^\{\(i\)\}\_\{t\}\-\\mu\_\{\\mathcal\{B\}\}\)/\(\\sigma\_\{\\mathcal\{B\}\}\+\\epsilon\), whereμℬ,σℬ\\mu\_\{\\mathcal\{B\}\},\\sigma\_\{\\mathcal\{B\}\}are group statistics, preserving within\-group comparisons while mitigating cross\-task scale variation\.
Finally, GAGPO optimizes the policy with a PPO\-style clipped objective\. Since each actionat\(i\)a^\{\(i\)\}\_\{t\}is a sequence of tokens, similar toZhenget al\.\([2025](https://arxiv.org/html/2605.13217#bib.bib10)\), the same normalized step\-level advantageAt\(i\)A^\{\(i\)\}\_\{t\}is assigned to all tokens within that action\. Rather than clipping token\-wise importance ratios independently, a length\-normalized ratio is computed for each action sequence by averaging token\-level log\-ratios within the action and exponentiating:
st\(i\)\(θ\)=exp\(1mt\(i\)∑k=1mt\(i\)logπθ\(yt,k\(i\)∣st\(i\),yt,<k\(i\)\)πθold\(yt,k\(i\)∣st\(i\),yt,<k\(i\)\)\),s^\{\(i\)\}\_\{t\}\(\\theta\)=\\exp\\\!\\left\(\\frac\{1\}\{m^\{\(i\)\}\_\{t\}\}\\sum\_\{k=1\}^\{m^\{\(i\)\}\_\{t\}\}\\log\\frac\{\\pi\_\{\\theta\}\(y^\{\(i\)\}\_\{t,k\}\\mid s^\{\(i\)\}\_\{t\},y^\{\(i\)\}\_\{t,<k\}\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(y^\{\(i\)\}\_\{t,k\}\\mid s^\{\(i\)\}\_\{t\},y^\{\(i\)\}\_\{t,<k\}\)\}\\right\),wheremt\(i\)m^\{\(i\)\}\_\{t\}is the number of valid tokens in actionat\(i\)a^\{\(i\)\}\_\{t\}\. This defines a sequence\-level ratio for the entire action, while normalizing for action length\.
The clipped objective is then written as
ℒGAGPO\(θ\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{GAGPO\}\}\(\\theta\)=𝔼\(i,t\)\[min\(st\(i\)\(θ\)At\(i\),\\displaystyle=\\mathbb\{E\}\_\{\(i,t\)\}\\Big\[\\min\\Big\(s^\{\(i\)\}\_\{t\}\(\\theta\)\{A\}^\{\(i\)\}\_\{t\},clip\(st\(i\)\(θ\),1−ϵ,1\+ϵ\)At\(i\)\)\]\\displaystyle\\qquad\\qquad\\mathrm\{clip\}\\\!\\big\(s^\{\(i\)\}\_\{t\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\\big\)\{A\}^\{\(i\)\}\_\{t\}\\Big\)\\Big\]−βDKL\(πθ∥πref\),\\displaystyle\\quad\-\\beta\\,D\_\{\\mathrm\{KL\}\}\\\!\\left\(\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\\right\),whereϵ\\epsilonis the PPO clipping coefficient,β\\betacontrols the KL penalty strength, andπref\\pi\_\{\\mathrm\{ref\}\}is a reference policy\.
Overall, GAGPO preserves the simplicity and efficiency of grouped policy optimization while introducing a temporally propagated and step\-aligned credit signal for multi\-turn agent training\.
Figure 3:Average episode length on ALFWorld and WebShop over the first 120 training steps for Qwen2\.5\-1\.5B\-Instruct and Qwen2\.5\-7B\-Instruct\.Figure 4:Optimization and advantage statistics of GAGPO and GiGPO on ALFWorld over the first 120 training steps, including gradient norm, entropy loss, and summary statistics of step\-level advantages\. Compared with GiGPO, GAGPO exhibits smoother gradient dynamics, faster entropy reduction, lower advantage variance, and substantially tighter advantage extrema, indicating more stable optimization and lower\-variance credit signals\.Figure 5:Distribution of normalized step\-level advantages at training steps 60 and 120 on ALFWorld\. The gray region marks\[−1,1\]\[\-1,1\], and the inset reports the interquartile range \(IQR\) and the fraction of large\-magnitude advantages with\|A\|\>1\|A\|\>1\. GAGPO shows smaller spread and lower tail mass than GiGPO at both stages\.
## 4Experiments
### 4\.1Experimental Setup
#### Benchmarks\.
We evaluate GAGPO on two representative multi\-turn agent benchmarks,ALFWorld\(Shridharet al\.,[2021](https://arxiv.org/html/2605.13217#bib.bib25)\)andWebShop\(Yaoet al\.,[2023a](https://arxiv.org/html/2605.13217#bib.bib26)\)\. ALFWorld requires sequential decision making over embodied household tasks such as finding, manipulating, and composing objects, while WebShop evaluates interactive decision making in an online shopping environment via multi\-turn search, comparison, and selection\. Both environments are purely text\-based with structured, deterministic observations, allowing us to aggregate same\-state occurrences via exact textual match when constructing the grouped value proxyV¯\(s\)\\bar\{V\}\(s\)\. Episodes terminate upon task success or upon reaching a fixed interaction budget \(50 steps for ALFWorld, 15 for WebShop\)\. Following prior work\(Fenget al\.,[2025](https://arxiv.org/html/2605.13217#bib.bib20); Heet al\.,[2026](https://arxiv.org/html/2605.13217#bib.bib21)\)\. The evaluation reports category\-wise success rates and overall average success on ALFWorld, and the average score and success rate on WebShop\.
#### Baselines\.
GAGPO is compared against both prompting\-based and RL\-based baselines\. Prompting baselines include direct prompting, ReAct\(Yaoet al\.,[2023b](https://arxiv.org/html/2605.13217#bib.bib27)\), and Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2605.13217#bib.bib28)\)on Qwen2\.5 backbones, as well as strong closed\-source models such as GPT\-4o and Gemini\-2\.5\-Pro\. RL baselines include PPO\(Ouyanget al\.,[2022](https://arxiv.org/html/2605.13217#bib.bib1)\)with a learned critic, RLOO\(Ahmadianet al\.,[2024](https://arxiv.org/html/2605.13217#bib.bib7)\), GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.13217#bib.bib6)\), and GiGPO\(Fenget al\.,[2025](https://arxiv.org/html/2605.13217#bib.bib20)\)\. For prompting baselines, PPO, and RLOO, we follow the results reported by GiGPO under the same backbones, environments, and evaluation protocols\. For GRPO and GiGPO, we re\-run the baselines under the same training and evaluation pipeline as GAGPO to ensure a controlled comparison\. The implementation follows GiGPO exactly in all training and evaluation settings, except for the proposed credit assignment mechanism used by GAGPO\.
Table 2:Ablation study on key components of GAGPO\. We report ALFWorld overall success rate, WebShop average score, and WebShop success rate on both Qwen2\.5\-1\.5B\-Instruct and Qwen2\.5\-7B\-Instruct backbones\.
#### Implementation details\.
Experiments are conducted on Qwen2\.5\-1\.5B\-Instruct and Qwen2\.5\-7B\-Instruct\. To ensure a controlled comparison, all training and evaluation settings are kept identical to GiGPO, including the rollout group size, optimizer, learning rate, batch size, mini batch size, clipping coefficient, KL regularization, and environment settings, etc\. GAGPO introduces two method\-specific hyperparameters: the discount factorγ\\gammaand the temporal propagation coefficientλ\\lambda, which are set to 0\.95 and 0\.8, respectively, unless otherwise specified\. We report the mean and standard deviation over 3 random seeds\. Unless otherwise specified, main results in Table[1](https://arxiv.org/html/2605.13217#S2.T1)are reported at the final checkpoint after 160 training steps, while Figures[2](https://arxiv.org/html/2605.13217#S2.F2)and[3](https://arxiv.org/html/2605.13217#S3.F3)visualize the first 120 training steps for clarity\. We provide additional hyperparameter sensitivity results and exact\-match group\-size statistics in Appendices[A](https://arxiv.org/html/2605.13217#A1)and[C](https://arxiv.org/html/2605.13217#A3)\.
### 4\.2Main Results
Table[1](https://arxiv.org/html/2605.13217#S2.T1)reports the main comparison on ALFWorld and WebShop\. GAGPO consistently outperforms all RL baselines on aggregate metrics across both benchmarks and both model scales\. On ALFWorld, GAGPO improves the overall score from 88\.1 to 93\.5 on Qwen2\.5\-1\.5B and from 88\.8 to 95\.6 on Qwen2\.5\-7B, achieving gains of 5\.4 and 6\.8 points over the strongest baseline, respectively\. On WebShop, GAGPO raises the score from 80\.5 to 88\.6 and the success rate from 66\.4 to 78\.1 on the 1\.5B model\. On the 7B model, it further improves the score from 86\.3 to 90\.3 and the success rate from 73\.3 to 77\.5, corresponding to gains of 8\.1/11\.7 and 4\.0/4\.2 points over the strongest baseline, respectively\.
### 4\.3Analysis
#### Learning dynamics and interaction efficiency\.
We further analyze the learning dynamics of GAGPO during the first 120 training steps, where differences in credit assignment are most directly reflected in optimization efficiency\. For Figures[2](https://arxiv.org/html/2605.13217#S2.F2)and[3](https://arxiv.org/html/2605.13217#S3.F3), each curve reports the mean over three random seeds, shaded regions denote standard deviation, and moving\-average smoothing with coefficient 0\.6 is applied for visualization\. Each training step corresponds to one policy update based on a batch of sampled trajectories, so the curves in Figures[2](https://arxiv.org/html/2605.13217#S2.F2)and[3](https://arxiv.org/html/2605.13217#S3.F3)reflect early\-stage policy improvement under the same training budget\.
As shown in Figure[2](https://arxiv.org/html/2605.13217#S2.F2), GAGPO consistently improves faster than GiGPO and GRPO across both benchmarks and both model scales\. On ALFWorld, the advantage is especially pronounced: GAGPO reaches substantially higher success rates early in training and maintains a clear gap over most of the optimization trajectory\. On WebShop, the gains are more moderate but remain consistent in both success rate and task score\. Since WebShop task score is computed from the final reward and reflects how well the selected product satisfies the instruction, including partial matches in attributes, options, type, and price, these results suggest that GAGPO improves not only exact task completion but also the quality of the final product selection\.
Figure[3](https://arxiv.org/html/2605.13217#S3.F3)provides a complementary view from interaction efficiency\. Across both ALFWorld and WebShop, GAGPO generally achieves shorter average episode lengths as training progresses, with the difference being particularly visible on ALFWorld\. Because unsuccessful episodes are truncated by the maximum interaction budget, episode length should not be interpreted in isolation\. Instead, taken together with the higher success rates in Figure[2](https://arxiv.org/html/2605.13217#S2.F2), the lower episode lengths indicate that GAGPO reaches successful completion in fewer interaction steps on average\.
Overall, these results suggest that GAGPO converts training signal into successful behavior more efficiently in the early stage of optimization, consistent with the claim that temporally propagated and step\-aligned credit assignment provides a more effective learning signal for multi\-turn agent training\.
#### Optimization stability and advantage statistics\.
To better understand the source of GAGPO’s gains, we analyze optimization dynamics and step\-level advantage statistics on ALFWorld with Qwen2\.5\-1\.5B\-Instruct over the first 120 training steps\. Since GAGPO and GiGPO share the same rollout grouping and training pipeline, and differ mainly in the step\-level credit estimator, these metrics provide a direct view of whether the proposed temporal estimator yields more stable updates and lower\-variance learning signals\.
As shown in Figure[4](https://arxiv.org/html/2605.13217#S3.F4), GAGPO exhibits consistently smoother optimization dynamics than GiGPO\. After the initial warm\-up phase, the gradient norm under GAGPO remains lower and less volatile, whereas GiGPO shows frequent high\-amplitude fluctuations throughout training\. The entropy loss also decreases faster and more monotonically under GAGPO, suggesting that the policy converts exploration into more confident task\-specific behavior earlier in training\.
The advantage statistics in Figure[5](https://arxiv.org/html/2605.13217#S3.F5)further support this trend\. Although both methods keep the normalized mean close to zero, GAGPO produces a more concentrated distribution with substantially lower tail mass\. At step 60, GAGPO reduces the interquartile range from 0\.67 to 0\.33 and the fraction of large\-magnitude advantages with\|A\|\>1\|A\|\>1from 27\.3% to 14\.9% compared with GiGPO\. The gap becomes more pronounced at step 120, where GiGPO exhibits a much broader distribution with an IQR of 1\.61 and 43\.1% large\-magnitude advantages, while GAGPO maintains a compact distribution with an IQR of 0\.56 and only 17\.2% large\-magnitude advantages\. Together with the smoother gradient dynamics in Figure[4](https://arxiv.org/html/2605.13217#S3.F4), these results indicate that GAGPO reduces extreme step\-level credit signals rather than merely shifting the advantage mean\.
This behavior is consistent with the design of the proposed estimator\. GiGPO combines trajectory\-level relative feedback with step\-level Monte Carlo\-style signals, which can introduce large variations when delayed outcomes are assigned to multiple intermediate actions\. In contrast, GAGPO propagates outcome supervision through TD/GAE\-style temporal recursion and applies a single group\-wise normalization to the resulting step\-level advantages\. This yields a more localized and lower\-variance optimization signal, reducing the chance that PPO updates are dominated by noisy high\-magnitude advantages\. Importantly, the sharper concentration around zero should not be interpreted as weakened learning signal, since GAGPO simultaneously achieves higher task performance and smoother optimization; rather, it suggests that well\-learned or low\-disagreement states receive smaller residual updates while informative steps still provide effective credit for policy improvement\.
### 4\.4Ablation Study
We conduct ablations to examine the role of each component in GAGPO\. As shown in Table[2](https://arxiv.org/html/2605.13217#S4.T2), the full method consistently achieves the best performance on both Qwen2\.5\-1\.5B\-Instruct and Qwen2\.5\-7B\-Instruct across ALFWorld and WebShop, showing that the gains come from the combination of temporally propagated credit assignment, localized step\-level optimization, and group normalization\.
Removing temporal recursion by settingλ=0\\lambda=0leads to clear performance drops on both benchmarks, as the truncated temporal horizon fails to propagate delayed success signals\. Replacing our TD/GAE\-style estimator with the MC\-style step advantage performs slightly better than the myopicλ=0\\lambda=0variant by capturing full trajectory returns, but it still falls significantly short of the full GAGPO\. This demonstrates that our GAE\-style temporal propagation successfully achieves a superior bias\-variance trade\-off compared to both myopic \(TD\) and high\-variance \(MC\) alternatives\.
To evaluate the necessity of step\-aligned policy updates, we ablate the shared action sequence importance ratio, thereby reverting to standard token\-independent PPO clipping\. This variant suffers a noticeable performance drop across both benchmarks\. This decline demonstrates that when assigning a single step\-level advantage to a multi\-token action, treating the token importance ratios independently can cause inconsistent updates and gradient tearing within the action\. By using a shared sequence\-level ratio, GAGPO ensures that the entire action remains a cohesive optimization unit\.
We further compare against adding a trajectory\-level reward broadcast term to every step\. Although this variant performs better than several weaker ablations, it remains consistently worse than the full method, suggesting that directly injecting the same episode\-level offset into all steps weakens local credit assignment\. In contrast, GAGPO preserves outcome supervision through temporal recursion while avoiding indiscriminate trajectory\-wide bias\.
Finally, normalization is crucial for stable optimization\. Removing normalization causes the largest overall degradation, while replacing group normalization with standard batch normalization also hurts performance\. This suggests that advantage normalization should respect the grouped rollout structure: group\-wise normalization preserves meaningful within\-task comparisons and improves robustness across heterogeneous trajectories\.
## 5Conclusion
We presented GAGPO, a critic\-free grouped policy optimization method for multi\-turn agentic RL\. By aligning credit assignment with environment steps, propagating sparse outcome supervision through a TD/GAE\-style temporal recursion over a non\-parametric grouped value proxy, and applying group\-wise normalization, GAGPO provides a step\-aligned, temporally consistent, and stable training signal for LLM agents\. Experiments on ALFWorld and WebShop with Qwen2\.5\-1\.5B/7B\-Instruct show consistent gains over strong RL baselines, with improved learning efficiency and optimization stability\. Future work includes extending the grouping mechanism beyond exact state matches to support approximate state aggregation in partially observed environments\.
## 6Limitations
While GAGPO provides a simple and effective framework for step\-aligned credit assignment in multi\-turn agentic RL, several limitations remain\.
#### Reliance on exact state matching\.
The non\-parametric grouped value proxyV¯\(s\)\\bar\{V\}\(s\)is constructed by aggregating rollouts that share the same environment statess\. In ALFWorld and WebShop, observations are textual and deterministic, so the same state can be identified via exact string matching\. In environments with stochastic observations, continuous sensory inputs, or partial observability, exact matches become rare and\|𝒢\(s\)\|\|\\mathcal\{G\}\(s\)\|shrinks toward one, weakening the grouped value proxy and reducing GAGPO toward a per\-trajectory Monte Carlo estimate\. Extending GAGPO to such settings will require approximate state aggregation, such as embedding\-based clustering or learned equivalence relations\.
#### Scope of evaluation and assumptions\.
Our experiments focus on sparse episode\-end rewards, discrete text\-based actions, and two representative benchmarks, ALFWorld and WebShop, with Qwen2\.5\-1\.5B/7B\-Instruct backbones\. We do not study settings with dense process rewards, mixed reward sources, continuous, asynchronous interaction, or substantially larger and more diverse agent domains\. Further validation is needed to determine whether the same temporal estimator remains beneficial in richer long\-horizon environments\.
#### Potential risks\.
Although our experiments are conducted in closed text\-based benchmarks, stronger multi\-turn agent training may lower the barrier to deploying autonomous language agents in open environments\. If used without sufficient safeguards, such agents may execute unreliable action sequences, automate undesirable behavior, or waste external resources\. Practical deployment should therefore pair GAGPO with sandboxing, permission control, and human oversight, especially in safety\-critical settings\.
## References
- A\. Ahmadian, C\. Cremer, M\. Gallé, M\. Fadaee, J\. Kreutzer, O\. Pietquin, A\. Üstün, and S\. Hooker \(2024\)Back to basics: revisiting reinforce style optimization for learning from human feedback in llms\.External Links:2402\.14740,[Link](https://arxiv.org/abs/2402.14740)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.13217#S2.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.13217#S4.SS1.SSS0.Px2.p1.1)\.
- K\. Chen, M\. Cusumano\-Towner, B\. Huval, A\. Petrenko, J\. Hamburger, V\. Koltun, and P\. Krähenbühl \(2025\)Reinforcement learning for long\-horizon interactive llm agents\.External Links:2502\.01600,[Link](https://arxiv.org/abs/2502.01600)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p1.1)\.
- Z\. Ding and W\. Ye \(2025\)TreeGRPO: tree\-advantage grpo for online rl post\-training of diffusion models\.External Links:2512\.08153,[Link](https://arxiv.org/abs/2512.08153)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.13217#S2.SS1.SSS0.Px2.p1.1)\.
- G\. Dong, H\. Mao, K\. Ma, L\. Bao, Y\. Chen, Z\. Wang, Z\. Chen, J\. Du, H\. Wang, F\. Zhang, G\. Zhou, Y\. Zhu, J\. Wen, and Z\. Dou \(2025\)Agentic reinforced policy optimization\.External Links:2507\.19849,[Link](https://arxiv.org/abs/2507.19849)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.13217#S2.SS1.SSS0.Px2.p1.1)\.
- L\. Feng, Z\. Xue, T\. Liu, and B\. An \(2025\)Group\-in\-group policy optimization for llm agent training\.External Links:2505\.10978,[Link](https://arxiv.org/abs/2505.10978)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p2.1),[§1](https://arxiv.org/html/2605.13217#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.13217#S2.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.13217#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.13217#S4.SS1.SSS0.Px2.p1.1)\.
- C\. Gao, C\. Zheng, X\. Chen, K\. Dang, S\. Liu, B\. Yu, A\. Yang, S\. Bai, J\. Zhou, and J\. Lin \(2025\)Soft adaptive policy optimization\.External Links:2511\.20347,[Link](https://arxiv.org/abs/2511.20347)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p1.1)\.
- Gemini 2\.5 Team \(2025\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.External Links:2507\.06261,[Link](https://arxiv.org/abs/2507.06261)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p1.1)\.
- GPT\-5 Team \(2025\)OpenAI gpt\-5 system card\.External Links:2601\.03267,[Link](https://arxiv.org/abs/2601.03267)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p1.1)\.
- S\. He, L\. Feng, Q\. Wei, X\. Cheng, L\. Feng, and B\. An \(2026\)Hierarchy\-of\-groups policy optimization for long\-horizon agentic tasks\.External Links:2602\.22817,[Link](https://arxiv.org/abs/2602.22817)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p3.1),[§4\.1](https://arxiv.org/html/2605.13217#S4.SS1.SSS0.Px1.p1.1)\.
- B\. Jin, H\. Zeng, Z\. Yue, J\. Yoon, S\. Arik, D\. Wang, H\. Zamani, and J\. Han \(2025\)Search\-r1: training llms to reason and leverage search engines with reinforcement learning\.External Links:2503\.09516,[Link](https://arxiv.org/abs/2503.09516)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p1.1)\.
- C\. Li, A\. Elmahdy, A\. Boyd, Z\. Wang, S\. Zeng, A\. Garcia, P\. Bhatia, T\. Kass\-Hout, C\. Xiao, and M\. Hong \(2026a\)Stabilizing off\-policy training for long\-horizon llm agent via turn\-level importance sampling and clipping\-triggered normalization\.External Links:2511\.20718,[Link](https://arxiv.org/abs/2511.20718)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.13217#S2.SS1.SSS0.Px2.p1.1)\.
- J\. Li, P\. Zhou, R\. Meng, M\. P\. Vadera, L\. Li, and Y\. Li \(2026b\)Turn\-ppo: turn\-level advantage estimation with ppo for improved multi\-turn rl in agentic llms\.External Links:2512\.17008,[Link](https://arxiv.org/abs/2512.17008)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p2.1),[§1](https://arxiv.org/html/2605.13217#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.13217#S2.SS1.SSS0.Px2.p1.1)\.
- X\. Liu, K\. Wang, Y\. Wu, F\. Huang, Y\. Li, J\. Zhang, and J\. Jiao \(2025\)Agentic reinforcement learning with implicit step rewards\.External Links:2509\.19199,[Link](https://arxiv.org/abs/2509.19199)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.13217#S2.SS1.SSS0.Px2.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.External Links:2203\.02155,[Link](https://arxiv.org/abs/2203.02155)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.13217#S2.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.13217#S4.SS1.SSS0.Px2.p1.1)\.
- Qwen2\.5 Team \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p5.1)\.
- Qwen3 Team \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, S\. Ermon, C\. D\. Manning, and C\. Finn \(2024\)Direct preference optimization: your language model is secretly a reward model\.External Links:2305\.18290,[Link](https://arxiv.org/abs/2305.18290)Cited by:[§2\.1](https://arxiv.org/html/2605.13217#S2.SS1.SSS0.Px1.p1.1)\.
- J\. Schulman, P\. Moritz, S\. Levine, M\. Jordan, and P\. Abbeel \(2018\)High\-dimensional continuous control using generalized advantage estimation\.External Links:1506\.02438,[Link](https://arxiv.org/abs/1506.02438)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p4.1),[§2\.2](https://arxiv.org/html/2605.13217#S2.SS2.SSS0.Px2.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.External Links:1707\.06347,[Link](https://arxiv.org/abs/1707.06347)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.13217#S2.SS1.SSS0.Px1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.13217#S2.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.13217#S4.SS1.SSS0.Px2.p1.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.External Links:2303\.11366,[Link](https://arxiv.org/abs/2303.11366)Cited by:[§4\.1](https://arxiv.org/html/2605.13217#S4.SS1.SSS0.Px2.p1.1)\.
- M\. Shridhar, X\. Yuan, M\. Côté, Y\. Bisk, A\. Trischler, and M\. Hausknecht \(2021\)ALFWorld: aligning text and embodied environments for interactive learning\.External Links:2010\.03768,[Link](https://arxiv.org/abs/2010.03768)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p5.1),[§4\.1](https://arxiv.org/html/2605.13217#S4.SS1.SSS0.Px1.p1.1)\.
- Z\. Wang, K\. Wang, Q\. Wang, P\. Zhang, L\. Li, Z\. Yang, X\. Jin, K\. Yu, M\. N\. Nguyen, L\. Liu, E\. Gottlieb, Y\. Lu, K\. Cho, J\. Wu, L\. Fei\-Fei, L\. Wang, Y\. Choi, and M\. Li \(2025\)RAGEN: understanding self\-evolution in llm agents via multi\-turn reinforcement learning\.External Links:2504\.20073,[Link](https://arxiv.org/abs/2504.20073)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p1.1)\.
- Q\. Wei, S\. Zeng, C\. Li, W\. Brown, O\. Frunza, W\. Deng, A\. Schneider, Y\. Nevmyvaka, Y\. K\. Zhao, A\. Garcia, and M\. Hong \(2025\)Reinforcing multi\-turn reasoning in llm agents via turn\-level reward design\.External Links:2505\.11821,[Link](https://arxiv.org/abs/2505.11821)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p3.1)\.
- Z\. Xi, C\. Liao, G\. Li, Y\. Yang, W\. Chen, Z\. Zhang, B\. Wang, S\. Jin, Y\. Zhou, J\. Guan, W\. Wu, T\. Ji, T\. Gui, Q\. Zhang, and X\. Huang \(2025\)AgentPRM: process reward models for llm agents via step\-wise promise and progress\.External Links:2511\.08325,[Link](https://arxiv.org/abs/2511.08325)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.13217#S2.SS1.SSS0.Px2.p1.1)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2023a\)WebShop: towards scalable real\-world web interaction with grounded language agents\.External Links:2207\.01206,[Link](https://arxiv.org/abs/2207.01206)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p5.1),[§4\.1](https://arxiv.org/html/2605.13217#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023b\)ReAct: synergizing reasoning and acting in language models\.External Links:2210\.03629,[Link](https://arxiv.org/abs/2210.03629)Cited by:[§4\.1](https://arxiv.org/html/2605.13217#S4.SS1.SSS0.Px2.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu, X\. Liu, H\. Lin, Z\. Lin, B\. Ma, G\. Sheng, Y\. Tong, C\. Zhang, M\. Zhang, W\. Zhang, H\. Zhu, J\. Zhu, J\. Chen, J\. Chen, C\. Wang, H\. Yu, Y\. Song, X\. Wei, H\. Zhou, J\. Liu, W\. Ma, Y\. Zhang, L\. Yan, M\. Qiao, Y\. Wu, and M\. Wang \(2025\)DAPO: an open\-source llm reinforcement learning system at scale\.External Links:2503\.14476,[Link](https://arxiv.org/abs/2503.14476)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.13217#S2.SS1.SSS0.Px1.p1.1)\.
- C\. Zheng, S\. Liu, M\. Li, X\. Chen, B\. Yu, C\. Gao, K\. Dang, Y\. Liu, R\. Men, A\. Yang, J\. Zhou, and J\. Lin \(2025\)Group sequence policy optimization\.External Links:2507\.18071,[Link](https://arxiv.org/abs/2507.18071)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.13217#S2.SS1.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2605.13217#S3.SS2.p3.2)\.
- Z\. Zong, D\. Chen, Y\. Li, Q\. Yi, B\. Zhou, C\. Li, B\. Qian, P\. Chen, and J\. Jiang \(2026\)AT2po: agentic turn\-based policy optimization via tree search\.External Links:2601\.04767,[Link](https://arxiv.org/abs/2601.04767)Cited by:[§1](https://arxiv.org/html/2605.13217#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.13217#S2.SS1.SSS0.Px2.p1.1)\.
## Appendix AHyperparameter Sensitivity
We provide a sensitivity study for the two method\-specific temporal hyperparameters in GAGPO, the discount factorγ\\gammaand the temporal propagation coefficientλ\\lambda\. We vary these parameters on ALFWorld with Qwen2\.5\-1\.5B while keeping the remaining training pipeline unchanged, and report representative ALFWorld overall success results in Table[3](https://arxiv.org/html/2605.13217#A1.T3)\. The default setting\(γ=0\.95,λ=0\.8\)\(\\gamma=0\.95,\\lambda=0\.8\)used in the main experiments yields the strongest overall performance, while nearby settings remain competitive, suggesting that GAGPO does not depend on a brittle single choice\. In contrast, pushingγ\\gammato 1\.0 leads to a substantial degradation, indicating that overly long\-horizon propagation amplifies estimation noise in practice\. These trends are consistent with the design motivation of GAGPO: effective temporal credit assignment requires a balanced regime between myopic propagation and high\-variance long\-horizon recursion\.
Table 3:Sensitivity study for GAGPO temporal hyperparameters on ALFWorld with Qwen2\.5\-1\.5B\. We report representative ALFWorld overall success under selected\(γ,λ\)\(\\gamma,\\lambda\)configurations\. The default setting used in the main paper is highlighted in bold\.
## Appendix BTraining Configuration\.
For reproducibility, we summarize the key training settings used in our implementation in Table[4](https://arxiv.org/html/2605.13217#A2.T4)\. Our implementation follows the official GiGPO/verl\-agent training pipeline111[https://github\.com/langfengQ/verl\-agent](https://github.com/langfengQ/verl-agent)\. Unless otherwise specified, we keep all shared training and evaluation settings identical to the GiGPO baseline in our controlled experiments, including the environment setup, prompting format, rollout pipeline, evaluation protocol, optimizer type, batch construction, clipping coefficient, and KL regularization\. The only differences are the proposed credit assignment mechanism and the method\-specific temporal hyperparameters analyzed in Appendix[A](https://arxiv.org/html/2605.13217#A1)\.
Table 4:Key training configurations used in the main experiments\. Shared settings not listed here follow the official GiGPO/verl\-agent training pipeline and are kept identical across controlled comparisons\.
## Appendix CExact\-Match Group Size Statistics
Because both GAGPO and GiGPO construct rollout groups via exact textual state matching, one possible concern is that the gains of GAGPO might be explained by an easier grouping regime rather than by the proposed temporal estimator itself\. Table[5](https://arxiv.org/html/2605.13217#A3.T5)compares representative group\-size statistics on ALFWorld with Qwen2\.5\-1\.5B at training steps 60 and 120\. The overall grouping regime remains broadly comparable across methods: singleton groups account for only a small fraction of sampled steps in both methods, mean group sizes remain in the same range, and medium\-to\-large groups still make up roughly half of the sampled steps\. GAGPO exhibits slightly lower singleton mass and somewhat more large groups at the later stage, but it does not remove the singleton/small\-group regime or induce a qualitatively different exact\-match grouping pattern\. These results support the interpretation that GAGPO’s gains are driven mainly by improved temporal credit assignment under similar grouping conditions\.
Table 5:Representative exact\-match group\-size statistics on ALFWorld with Qwen2\.5\-1\.5B\. Percentages are measured over sampled steps\.Similar Articles
F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking
F-GRPO proposes a factorized group-relative policy optimization framework that unifies candidate generation and ranking in a single autoregressive LLM, addressing credit assignment issues and improving top-ranked performance across sequential recommendation and multi-hop QA benchmarks.
A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping
This paper introduces A^2TGPO, a reinforcement learning method for agentic LLMs that uses adaptive turn-level clipping and information gain normalization to improve process credit assignment in multi-turn interactions.
BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization
BiasGRPO proposes a framework using Group Relative Policy Optimization (GRPO) to stabilize social bias mitigation in LLMs by normalizing rewards across sampled completions, outperforming DPO and PPO on multiple benchmarks. The authors also release a compute-efficient bias reward model designed for integration into multi-objective RLHF pipelines.
APPO: Agentic Procedural Policy Optimization
APPO improves multi-turn tool-use in LLM agents by refining branching decisions and credit assignment using fine-grained decision points and procedure-level advantage scaling, outperforming baselines by 4 points on 13 benchmarks.
Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization
This paper introduces Guidance Contrastive Policy Optimization (GCPO), a novel algorithm that enables per-token credit assignment in reinforcement learning by contrasting model predictions under positive and negative prompts, consistently outperforming GRPO and DAPO baselines on text-to-image generation and chain-of-thought reasoning benchmarks.