APPO: Agentic Procedural Policy Optimization
Summary
APPO improves multi-turn tool-use in LLM agents by refining branching decisions and credit assignment using fine-grained decision points and procedure-level advantage scaling, outperforming baselines by 4 points on 13 benchmarks.
View Cached Full Text
Cached at: 06/15/26, 09:03 AM
Paper page - APPO: Agentic Procedural Policy Optimization
Source: https://huggingface.co/papers/2606.12384
Abstract
Agentic Reinforcement Learning method that improves multi-turn tool-use capabilities by refining branching decisions and credit assignment through fine-grained decision points and procedure-level advantage scaling.
Recent advances inagentic Reinforcement Learning(RL) have substantially improved the multi-turntool-use capabilitiesof large language model agents. However, most existing methods assign credit over coarse heuristic units, such as tool-call boundaries or fixed workflows, making it difficult to identify which intermediate decisions influence downstream outcomes. In this work, we study agentic RL from two perspectives: where to branch and how to assign credit after branching. Our pilot analysis shows that influential decision points are broadly distributed throughout the generated sequence rather than concentrated at tool calls, while token entropy alone does not reliably reflect their impact on final outcomes. Motivated by these observations, we propose Agentic Procedural Policy Optimization (APPO), which shifts branching andcredit assignmentfrom coarse interaction units to fine-grained decision points in the sequence. APPO selectsbranching locationsusing aBranching Scorethat combinestoken uncertaintywithpolicy-induced likelihood gainsof subsequent continuations, enabling more targeted exploration while filtering out spurious high-entropy positions. It further introducesprocedure-level advantage scalingto better distribute credit across branched rollouts. Experiments on 13 benchmarks show that APPO consistently improves strong agentic RL baselines by nearly 4 points, while keeping efficient tool-calls and maintaining behavior interpretability.
View arXiv pageView PDFGitHub46Add to collection
Get this paper in your agent:
hf papers read 2606\.12384
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.12384 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.12384 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.12384 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
GAGPO: Generalized Advantage Grouped Policy Optimization
GAGPO proposes a critic-free RL method that uses a non-parametric grouped value proxy for step-level credit assignment in multi-turn agentic tasks, outperforming strong baselines on ALFWorld and WebShop.
IAPO: Input Attribution-Aware Policy Optimization for Tool Use in Small Multimodal Agents
This paper introduces IAPO, a reinforcement learning algorithm that improves tool-calling capabilities in multimodal small language models by aligning input attribution with a stronger teacher. Experiments on Qwen2.5-VL-3B show an average 3% improvement in visual question answering accuracy across six test sets.
Proximal Policy Optimization
OpenAI introduces Proximal Policy Optimization (PPO), a reinforcement learning algorithm that matches or outperforms state-of-the-art methods while being simpler to implement and tune. PPO uses a novel clipped objective function to constrain policy updates and has since become OpenAI's default RL algorithm.
A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping
This paper introduces A^2TGPO, a reinforcement learning method for agentic LLMs that uses adaptive turn-level clipping and information gain normalization to improve process credit assignment in multi-turn interactions.
SocraticPO: Policy Optimization via Interactive Guidance
SocraticPO augments RL rollouts with Socratic-style natural language guidance and reward decay to improve scientific reasoning in LLMs, outperforming strong baselines on SciKnowEval benchmarks.