APPO: Agentic Procedural Policy Optimization

Hugging Face Daily Papers 06/10/26, 05:47 PM Papers

reinforcement-learning agentic-rl tool-use credit-assignment policy-optimization language-model hf-paper

Summary

APPO improves multi-turn tool-use in LLM agents by refining branching decisions and credit assignment using fine-grained decision points and procedure-level advantage scaling, outperforming baselines by 4 points on 13 benchmarks.

Recent advances in agentic Reinforcement Learning (RL) have substantially improved the multi-turn tool-use capabilities of large language model agents. However, most existing methods assign credit over coarse heuristic units, such as tool-call boundaries or fixed workflows, making it difficult to identify which intermediate decisions influence downstream outcomes. In this work, we study agentic RL from two perspectives: where to branch and how to assign credit after branching. Our pilot analysis shows that influential decision points are broadly distributed throughout the generated sequence rather than concentrated at tool calls, while token entropy alone does not reliably reflect their impact on final outcomes. Motivated by these observations, we propose Agentic Procedural Policy Optimization (APPO), which shifts branching and credit assignment from coarse interaction units to fine-grained decision points in the sequence. APPO selects branching locations using a Branching Score that combines token uncertainty with policy-induced likelihood gains of subsequent continuations, enabling more targeted exploration while filtering out spurious high-entropy positions. It further introduces procedure-level advantage scaling to better distribute credit across branched rollouts. Experiments on 13 benchmarks show that APPO consistently improves strong agentic RL baselines by nearly 4 points, while keeping efficient tool-calls and maintaining behavior interpretability.

Original Article

View Cached Full Text

Cached at: 06/15/26, 09:03 AM

Paper page - APPO: Agentic Procedural Policy Optimization

Source: https://huggingface.co/papers/2606.12384

Abstract

Agentic Reinforcement Learning method that improves multi-turn tool-use capabilities by refining branching decisions and credit assignment through fine-grained decision points and procedure-level advantage scaling.

Recent advances inagentic Reinforcement Learning(RL) have substantially improved the multi-turntool-use capabilitiesof large language model agents. However, most existing methods assign credit over coarse heuristic units, such as tool-call boundaries or fixed workflows, making it difficult to identify which intermediate decisions influence downstream outcomes. In this work, we study agentic RL from two perspectives: where to branch and how to assign credit after branching. Our pilot analysis shows that influential decision points are broadly distributed throughout the generated sequence rather than concentrated at tool calls, while token entropy alone does not reliably reflect their impact on final outcomes. Motivated by these observations, we propose Agentic Procedural Policy Optimization (APPO), which shifts branching andcredit assignmentfrom coarse interaction units to fine-grained decision points in the sequence. APPO selectsbranching locationsusing aBranching Scorethat combinestoken uncertaintywithpolicy-induced likelihood gainsof subsequent continuations, enabling more targeted exploration while filtering out spurious high-entropy positions. It further introducesprocedure-level advantage scalingto better distribute credit across branched rollouts. Experiments on 13 benchmarks show that APPO consistently improves strong agentic RL baselines by nearly 4 points, while keeping efficient tool-calls and maintaining behavior interpretability.

View arXiv page View PDF GitHub46 Add to collection

Get this paper in your agent:

hf papers read 2606\.12384

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.12384 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.12384 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.12384 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

APPO: Agentic Procedural Policy Optimization

Paper page - APPO: Agentic Procedural Policy Optimization

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

GAGPO: Generalized Advantage Grouped Policy Optimization

IAPO: Input Attribution-Aware Policy Optimization for Tool Use in Small Multimodal Agents

Proximal Policy Optimization

A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

SocraticPO: Policy Optimization via Interactive Guidance

Submit Feedback

Similar Articles

GAGPO: Generalized Advantage Grouped Policy Optimization

IAPO: Input Attribution-Aware Policy Optimization for Tool Use in Small Multimodal Agents

A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

SocraticPO: Policy Optimization via Interactive Guidance