StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
Summary
StepPO introduces a step-centric paradigm for agentic reinforcement learning that aligns policy optimization with agent decision granularity, outperforming token-centric methods in multi-turn interaction tasks.
View Cached Full Text
Cached at: 06/16/26, 11:34 AM
Paper page - StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
Source: https://huggingface.co/papers/2604.18401
Abstract
StepPO introduces a step-centric approach for agentic reinforcement learning that aligns policy optimization with agent decision granularity, outperforming existing token-centric methods in multi-turn interaction tasks.
Agentic reinforcement learning(RL) is emerging as a critical post-training paradigm for improving LLM agent capabilities. Existing RL algorithms for LLMs largely follow thetoken-centric paradigmas in RLHF and RLVR, where tokens serve as the basic units for modeling and optimization. However, this paradigm introduces a granularity mismatch in agentic RL, as it optimizes token-level predictions while LLM agents make step-level decisions through cycles of environmental observations and actions. To bridge this gap, we propose StepPO, astep-centric paradigmfor agentic RL via step-alignedpolicy optimization. Specifically, we reformulate agentic RL from a token-levelMarkov Decision Process(MDP) into a step-level MDP, where interaction steps serve as the basic trajectory representations. We further propose step-levelcredit assignmentto alignpolicy optimizationwith the natural granularity of agent decisions. Together, StepPO optimizes agent policies at the step level for multi-turnagent-environment interaction. Experiments across multi-hop QA, academic paper search, and text-world action tasks show that StepPO consistently outperforms various RL algorithms. Further analyses provide insights into howstep-centric paradigmimproves agent training. We hope thisstep-centric paradigmoffers a useful lens for understanding agent behavior and a practical path for training more capable LLM agents.
View arXiv pageView PDFProject pageGitHub12Add to collection
Get this paper in your agent:
hf papers read 2604\.18401
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.18401 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.18401 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.18401 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
APPO: Agentic Procedural Policy Optimization
APPO improves multi-turn tool-use in LLM agents by refining branching decisions and credit assignment using fine-grained decision points and procedure-level advantage scaling, outperforming baselines by 4 points on 13 benchmarks.
Proximal Policy Optimization
OpenAI introduces Proximal Policy Optimization (PPO), a reinforcement learning algorithm that matches or outperforms state-of-the-art methods while being simpler to implement and tune. PPO uses a novel clipped objective function to constrain policy updates and has since become OpenAI's default RL algorithm.
GAGPO: Generalized Advantage Grouped Policy Optimization
GAGPO proposes a critic-free RL method that uses a non-parametric grouped value proxy for step-level credit assignment in multi-turn agentic tasks, outperforming strong baselines on ALFWorld and WebShop.
LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models
Introduces LambdaPO, a novel reinforcement learning framework that improves upon GRPO by decomposing advantage estimation into pairwise preference comparisons and adding a semantic density reward, achieving better performance on math reasoning tasks.
GraphPO: Graph-based Policy Optimization for Reasoning Models
GraphPO is a novel graph-based reinforcement learning framework that represents rollouts as a directed acyclic graph, merging semantically equivalent reasoning paths to reduce redundant exploration and improve credit assignment for large reasoning models.