ESPO: Early-Stopping Proximal Policy Optimization

Hugging Face Daily Papers Papers

Summary

ESPO introduces an early-stopping mechanism for reinforcement learning that detects and terminates failed reasoning trajectories in LLMs, improving mathematical reasoning performance while reducing compute by over 20%.

When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum horizon, spending compute on tokens that never receive positive reward and polluting advantage estimates with post-failure noise. We propose ESPO (Early-Stopping Proximal Policy Optimization), which detects trajectory failure on-the-fly and terminates rollouts early. At each generation step, ESPO computes a surrogate regret using only the logits already computed during sampling, and terminates when the smoothed cumulative regret significantly exceeds its estimated values. Truncated trajectories are treated as absorbing failure states with a terminal reward, concentrating negative temporal-difference (TD) errors near the detected failure step without any additional reward model or human annotation. On DeepSeek-R1-Distill-Qwen-7B trained for mathematical reasoning, ESPO surpasses PPO on AIME~2024 (46.28% vs. 45.25%), AMC~2023 (85.83% vs. 82.94%), and MATH-500 (87.42% vs. 85.43%), while saving more than 20% rollout tokens cumulatively.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:24 AM

Paper page - ESPO: Early-Stopping Proximal Policy Optimization

Source: https://huggingface.co/papers/2605.29860 Authors:

,

,

,

,

,

,

,

,

,

Abstract

ESPO improves mathematical reasoning in large language models by detecting and terminating failed trajectories early, leading to better performance and reduced computational waste.

When a large language model underreinforcement learningcommits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum horizon, spending compute on tokens that never receive positive reward and polluting advantage estimates with post-failure noise. We propose ESPO (Early-StoppingProximal Policy Optimization), which detectstrajectory failureon-the-fly and terminates rollouts early. At each generation step, ESPO computes asurrogate regretusing only thelogitsalready computed during sampling, and terminates when the smoothed cumulative regret significantly exceeds its estimated values. Truncated trajectories are treated asabsorbing failure stateswith a terminal reward, concentrating negative temporal-difference (TD) errors near the detected failure step without any additional reward model or human annotation. On DeepSeek-R1-Distill-Qwen-7B trained formathematical reasoning, ESPO surpasses PPO on AIME~2024 (46.28% vs. 45.25%), AMC~2023 (85.83% vs. 82.94%), and MATH-500 (87.42% vs. 85.43%), while saving more than 20%rollout tokenscumulatively.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.29860

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.29860 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.29860 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.29860 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

Gradient Extrapolation-Based Policy Optimization

arXiv cs.LG

The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.

SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models

arXiv cs.CL

Researchers propose SPS (Steering Probability Squeezing), a training paradigm combining reinforcement learning with inverse reinforcement learning to address probability squeezing in LLM reasoning training, where probability mass concentrates too narrowly on high-reward trajectories, limiting exploration and multi-sample performance (Pass@k). Experiments on five reasoning benchmarks demonstrate improved exploration and Pass@k metrics.

CSPO: Constraint-Sensitive Policy Optimization for Safe Reinforcement Learning

arXiv cs.AI

This paper proposes Constraint-Sensitive Policy Optimization (CSPO), a first-order primal-dual method for safe reinforcement learning that incorporates local constraint sensitivity to improve safety recovery and reduce oscillations near safety boundaries, achieving higher constrained returns on navigation and locomotion benchmarks.