StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

Hugging Face Daily Papers 06/05/26, 12:00 AM Papers

Summary

StepPO introduces a step-centric paradigm for agentic reinforcement learning that aligns policy optimization with agent decision granularity, outperforming token-centric methods in multi-turn interaction tasks.

Agentic reinforcement learning (RL) is emerging as a critical post-training paradigm for improving LLM agent capabilities. Existing RL algorithms for LLMs largely follow the token-centric paradigm as in RLHF and RLVR, where tokens serve as the basic units for modeling and optimization. However, this paradigm introduces a granularity mismatch in agentic RL, as it optimizes token-level predictions while LLM agents make step-level decisions through cycles of environmental observations and actions. To bridge this gap, we propose StepPO, a step-centric paradigm for agentic RL via step-aligned policy optimization. Specifically, we reformulate agentic RL from a token-level Markov Decision Process (MDP) into a step-level MDP, where interaction steps serve as the basic trajectory representations. We further propose step-level credit assignment to align policy optimization with the natural granularity of agent decisions. Together, StepPO optimizes agent policies at the step level for multi-turn agent-environment interaction. Experiments across multi-hop QA, academic paper search, and text-world action tasks show that StepPO consistently outperforms various RL algorithms. Further analyses provide insights into how step-centric paradigm improves agent training. We hope this step-centric paradigm offers a useful lens for understanding agent behavior and a practical path for training more capable LLM agents.

Original Article

View Cached Full Text

Cached at: 06/16/26, 11:34 AM

Paper page - StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

Source: https://huggingface.co/papers/2604.18401

Abstract

StepPO introduces a step-centric approach for agentic reinforcement learning that aligns policy optimization with agent decision granularity, outperforming existing token-centric methods in multi-turn interaction tasks.

Agentic reinforcement learning(RL) is emerging as a critical post-training paradigm for improving LLM agent capabilities. Existing RL algorithms for LLMs largely follow thetoken-centric paradigmas in RLHF and RLVR, where tokens serve as the basic units for modeling and optimization. However, this paradigm introduces a granularity mismatch in agentic RL, as it optimizes token-level predictions while LLM agents make step-level decisions through cycles of environmental observations and actions. To bridge this gap, we propose StepPO, astep-centric paradigmfor agentic RL via step-alignedpolicy optimization. Specifically, we reformulate agentic RL from a token-levelMarkov Decision Process(MDP) into a step-level MDP, where interaction steps serve as the basic trajectory representations. We further propose step-levelcredit assignmentto alignpolicy optimizationwith the natural granularity of agent decisions. Together, StepPO optimizes agent policies at the step level for multi-turnagent-environment interaction. Experiments across multi-hop QA, academic paper search, and text-world action tasks show that StepPO consistently outperforms various RL algorithms. Further analyses provide insights into howstep-centric paradigmimproves agent training. We hope thisstep-centric paradigmoffers a useful lens for understanding agent behavior and a practical path for training more capable LLM agents.

View arXiv page View PDF Project page GitHub12 Add to collection

Get this paper in your agent:

hf papers read 2604\.18401

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.18401 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.18401 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.18401 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

Paper page - StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

APPO: Agentic Procedural Policy Optimization

Proximal Policy Optimization

GAGPO: Generalized Advantage Grouped Policy Optimization

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

GraphPO: Graph-based Policy Optimization for Reasoning Models

Submit Feedback

Similar Articles

APPO: Agentic Procedural Policy Optimization

GAGPO: Generalized Advantage Grouped Policy Optimization

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

GraphPO: Graph-based Policy Optimization for Reasoning Models