A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

Hugging Face Daily Papers Papers

Summary

This paper introduces A^2TGPO, a reinforcement learning method for agentic LLMs that uses adaptive turn-level clipping and information gain normalization to improve process credit assignment in multi-turn interactions.

Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assignment either depend on separate external process reward models that introduce additional consumption, or tree-based structural rollout that merely redistributes the outcome signal while constraining trajectory diversity. A promising alternative leverages the per-turn change in the policy's predicted probability of the ground-truth, termed Information Gain (IG), as an intrinsic process signal without an external evaluator. However, prior work on leveraging IG signals within the RL training loop faces three systematic challenges: normalizing across turns that face heterogeneous positional contexts can distort the relative standing of individual turns, accumulating a variable number of terms causes advantage magnitudes to drift with trajectory depth, and a fixed clipping range governs policy updates identically for turns with vastly different IG signals. In this paper, we propose A^2TGPO (Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping), which retains IG as the intrinsic signal but re-designs how it is normalized, accumulated, and consumed: (i) turn-group normalization: normalizes IG within each (prompt, turn-index) group so that each turn is compared only against peers at the same interaction depth; (ii) variance-rescaled discounted accumulation: divides cumulative normalized IG by square root of accumulated terms to keep advantage magnitudes comparable across turn positions; and (iii) adaptive turn-level clipping: modulates each turn's clipping range based on its normalized IG, widening the update region for informative turns and narrowing it for uninformative ones.
Original Article
View Cached Full Text

Cached at: 05/08/26, 07:09 AM

Paper page - A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

Source: https://huggingface.co/papers/2605.06200

Abstract

Reinforcement learning for agentic LLMs suffers from sparse rewards and challenges in credit assignment, which are addressed through A²TGPO that adapts information gain normalization, accumulation, and clipping for improved policy optimization.

Reinforcement learningforagentic large language models(LLMs) typically relies on a sparse,trajectory-level outcome reward, making it difficult to evaluate the contribution of individualtool-callswithin multi-turn interactions. Existing approaches to suchprocess credit assignmenteither depend on separate external process reward models that introduce additional consumption, or tree-based structural rollout that merely redistributes the outcome signal while constraining trajectory diversity. A promising alternative leverages the per-turn change in the policy’s predicted probability of the ground-truth, termedInformation Gain(IG), as an intrinsic process signal without an external evaluator. However, prior work on leveraging IG signals within the RL training loop faces three systematic challenges: normalizing across turns that face heterogeneous positional contexts can distort the relative standing of individual turns, accumulating a variable number of terms causes advantage magnitudes to drift with trajectory depth, and a fixed clipping range governs policy updates identically for turns with vastly different IG signals. In this paper, we propose A^2TGPO (Agentic Turn-GroupPolicy OptimizationwithAdaptive Turn-level Clipping), which retains IG as the intrinsic signal but re-designs how it is normalized, accumulated, and consumed: (i)turn-group normalization: normalizes IG within each (prompt, turn-index) group so that each turn is compared only against peers at the same interaction depth; (ii)variance-rescaled discounted accumulation: divides cumulative normalized IG by square root of accumulated terms to keep advantage magnitudes comparable across turn positions; and (iii)adaptive turn-level clipping: modulates each turn’s clipping range based on its normalized IG, widening the update region for informative turns and narrowing it for uninformative ones.

View arXiv pageView PDFGitHub1Add to collection

Get this paper in your agent:

hf papers read 2605\.06200

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.06200 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.06200 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.06200 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

APPO: Agentic Procedural Policy Optimization

Hugging Face Daily Papers

APPO improves multi-turn tool-use in LLM agents by refining branching decisions and credit assignment using fine-grained decision points and procedure-level advantage scaling, outperforming baselines by 4 points on 13 benchmarks.

Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

Hugging Face Daily Papers

This paper introduces Guidance Contrastive Policy Optimization (GCPO), a novel algorithm that enables per-token credit assignment in reinforcement learning by contrasting model predictions under positive and negative prompts, consistently outperforming GRPO and DAPO baselines on text-to-image generation and chain-of-thought reasoning benchmarks.

Gradient Extrapolation-Based Policy Optimization

arXiv cs.LG

The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.