A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

Hugging Face Daily Papers 05/07/26, 12:00 AM Papers

Summary

This paper introduces A^2TGPO, a reinforcement learning method for agentic LLMs that uses adaptive turn-level clipping and information gain normalization to improve process credit assignment in multi-turn interactions.

Reinforcement learning for agentic large language models (LLMs) typically relies on a sparse, trajectory-level outcome reward, making it difficult to evaluate the contribution of individual tool-calls within multi-turn interactions. Existing approaches to such process credit assignment either depend on separate external process reward models that introduce additional consumption, or tree-based structural rollout that merely redistributes the outcome signal while constraining trajectory diversity. A promising alternative leverages the per-turn change in the policy's predicted probability of the ground-truth, termed Information Gain (IG), as an intrinsic process signal without an external evaluator. However, prior work on leveraging IG signals within the RL training loop faces three systematic challenges: normalizing across turns that face heterogeneous positional contexts can distort the relative standing of individual turns, accumulating a variable number of terms causes advantage magnitudes to drift with trajectory depth, and a fixed clipping range governs policy updates identically for turns with vastly different IG signals. In this paper, we propose A^2TGPO (Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping), which retains IG as the intrinsic signal but re-designs how it is normalized, accumulated, and consumed: (i) turn-group normalization: normalizes IG within each (prompt, turn-index) group so that each turn is compared only against peers at the same interaction depth; (ii) variance-rescaled discounted accumulation: divides cumulative normalized IG by square root of accumulated terms to keep advantage magnitudes comparable across turn positions; and (iii) adaptive turn-level clipping: modulates each turn's clipping range based on its normalized IG, widening the update region for informative turns and narrowing it for uninformative ones.

Original Article

View Cached Full Text

Cached at: 05/08/26, 07:09 AM

Paper page - A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

Source: https://huggingface.co/papers/2605.06200

Abstract

Reinforcement learning for agentic LLMs suffers from sparse rewards and challenges in credit assignment, which are addressed through A²TGPO that adapts information gain normalization, accumulation, and clipping for improved policy optimization.

Reinforcement learningforagentic large language models(LLMs) typically relies on a sparse,trajectory-level outcome reward, making it difficult to evaluate the contribution of individualtool-callswithin multi-turn interactions. Existing approaches to suchprocess credit assignmenteither depend on separate external process reward models that introduce additional consumption, or tree-based structural rollout that merely redistributes the outcome signal while constraining trajectory diversity. A promising alternative leverages the per-turn change in the policy’s predicted probability of the ground-truth, termedInformation Gain(IG), as an intrinsic process signal without an external evaluator. However, prior work on leveraging IG signals within the RL training loop faces three systematic challenges: normalizing across turns that face heterogeneous positional contexts can distort the relative standing of individual turns, accumulating a variable number of terms causes advantage magnitudes to drift with trajectory depth, and a fixed clipping range governs policy updates identically for turns with vastly different IG signals. In this paper, we propose A^2TGPO (Agentic Turn-GroupPolicy OptimizationwithAdaptive Turn-level Clipping), which retains IG as the intrinsic signal but re-designs how it is normalized, accumulated, and consumed: (i)turn-group normalization: normalizes IG within each (prompt, turn-index) group so that each turn is compared only against peers at the same interaction depth; (ii)variance-rescaled discounted accumulation: divides cumulative normalized IG by square root of accumulated terms to keep advantage magnitudes comparable across turn positions; and (iii)adaptive turn-level clipping: modulates each turn’s clipping range based on its normalized IG, widening the update region for informative turns and narrowing it for uninformative ones.

View arXiv page View PDF GitHub1 Add to collection

Get this paper in your agent:

hf papers read 2605\.06200

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.06200 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.06200 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.06200 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

Paper page - A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

GAGPO: Generalized Advantage Grouped Policy Optimization

APPO: Agentic Procedural Policy Optimization

GD^2PO: Mitigating Multi-Reward Conflicts via Group-Dynamic reward-Decoupled Policy Optimization

Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

Gradient Extrapolation-Based Policy Optimization

Submit Feedback

Similar Articles

GAGPO: Generalized Advantage Grouped Policy Optimization

APPO: Agentic Procedural Policy Optimization

GD^2PO: Mitigating Multi-Reward Conflicts via Group-Dynamic reward-Decoupled Policy Optimization

Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

Gradient Extrapolation-Based Policy Optimization