A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping
Summary
This paper introduces A^2TGPO, a reinforcement learning method for agentic LLMs that uses adaptive turn-level clipping and information gain normalization to improve process credit assignment in multi-turn interactions.
View Cached Full Text
Cached at: 05/08/26, 07:09 AM
Paper page - A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping
Source: https://huggingface.co/papers/2605.06200
Abstract
Reinforcement learning for agentic LLMs suffers from sparse rewards and challenges in credit assignment, which are addressed through A²TGPO that adapts information gain normalization, accumulation, and clipping for improved policy optimization.
Reinforcement learningforagentic large language models(LLMs) typically relies on a sparse,trajectory-level outcome reward, making it difficult to evaluate the contribution of individualtool-callswithin multi-turn interactions. Existing approaches to suchprocess credit assignmenteither depend on separate external process reward models that introduce additional consumption, or tree-based structural rollout that merely redistributes the outcome signal while constraining trajectory diversity. A promising alternative leverages the per-turn change in the policy’s predicted probability of the ground-truth, termedInformation Gain(IG), as an intrinsic process signal without an external evaluator. However, prior work on leveraging IG signals within the RL training loop faces three systematic challenges: normalizing across turns that face heterogeneous positional contexts can distort the relative standing of individual turns, accumulating a variable number of terms causes advantage magnitudes to drift with trajectory depth, and a fixed clipping range governs policy updates identically for turns with vastly different IG signals. In this paper, we propose A^2TGPO (Agentic Turn-GroupPolicy OptimizationwithAdaptive Turn-level Clipping), which retains IG as the intrinsic signal but re-designs how it is normalized, accumulated, and consumed: (i)turn-group normalization: normalizes IG within each (prompt, turn-index) group so that each turn is compared only against peers at the same interaction depth; (ii)variance-rescaled discounted accumulation: divides cumulative normalized IG by square root of accumulated terms to keep advantage magnitudes comparable across turn positions; and (iii)adaptive turn-level clipping: modulates each turn’s clipping range based on its normalized IG, widening the update region for informative turns and narrowing it for uninformative ones.
View arXiv pageView PDFGitHub1Add to collection
Get this paper in your agent:
hf papers read 2605\.06200
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.06200 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.06200 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.06200 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
GAGPO: Generalized Advantage Grouped Policy Optimization
GAGPO proposes a critic-free RL method that uses a non-parametric grouped value proxy for step-level credit assignment in multi-turn agentic tasks, outperforming strong baselines on ALFWorld and WebShop.
APPO: Agentic Procedural Policy Optimization
APPO improves multi-turn tool-use in LLM agents by refining branching decisions and credit assignment using fine-grained decision points and procedure-level advantage scaling, outperforming baselines by 4 points on 13 benchmarks.
GD^2PO: Mitigating Multi-Reward Conflicts via Group-Dynamic reward-Decoupled Policy Optimization
GD^2PO introduces a conflict-aware filtering mechanism to mitigate multi-reward conflicts in reinforcement learning for large language models, preventing signal cancellation and accelerating training efficiency.
Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization
This paper introduces Guidance Contrastive Policy Optimization (GCPO), a novel algorithm that enables per-token credit assignment in reinforcement learning by contrasting model predictions under positive and negative prompts, consistently outperforming GRPO and DAPO baselines on text-to-image generation and chain-of-thought reasoning benchmarks.
Gradient Extrapolation-Based Policy Optimization
The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.