ESPO: Early-Stopping Proximal Policy Optimization

Hugging Face Daily Papers 05/28/26, 12:00 AM Papers

Summary

ESPO introduces an early-stopping mechanism for reinforcement learning that detects and terminates failed reasoning trajectories in LLMs, improving mathematical reasoning performance while reducing compute by over 20%.

When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum horizon, spending compute on tokens that never receive positive reward and polluting advantage estimates with post-failure noise. We propose ESPO (Early-Stopping Proximal Policy Optimization), which detects trajectory failure on-the-fly and terminates rollouts early. At each generation step, ESPO computes a surrogate regret using only the logits already computed during sampling, and terminates when the smoothed cumulative regret significantly exceeds its estimated values. Truncated trajectories are treated as absorbing failure states with a terminal reward, concentrating negative temporal-difference (TD) errors near the detected failure step without any additional reward model or human annotation. On DeepSeek-R1-Distill-Qwen-7B trained for mathematical reasoning, ESPO surpasses PPO on AIME~2024 (46.28% vs. 45.25%), AMC~2023 (85.83% vs. 82.94%), and MATH-500 (87.42% vs. 85.43%), while saving more than 20% rollout tokens cumulatively.

Original Article

View Cached Full Text

Cached at: 06/02/26, 03:24 AM

Paper page - ESPO: Early-Stopping Proximal Policy Optimization

Source: https://huggingface.co/papers/2605.29860 Authors:

Abstract

ESPO improves mathematical reasoning in large language models by detecting and terminating failed trajectories early, leading to better performance and reduced computational waste.

When a large language model underreinforcement learningcommits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum horizon, spending compute on tokens that never receive positive reward and polluting advantage estimates with post-failure noise. We propose ESPO (Early-StoppingProximal Policy Optimization), which detectstrajectory failureon-the-fly and terminates rollouts early. At each generation step, ESPO computes asurrogate regretusing only thelogitsalready computed during sampling, and terminates when the smoothed cumulative regret significantly exceeds its estimated values. Truncated trajectories are treated asabsorbing failure stateswith a terminal reward, concentrating negative temporal-difference (TD) errors near the detected failure step without any additional reward model or human annotation. On DeepSeek-R1-Distill-Qwen-7B trained formathematical reasoning, ESPO surpasses PPO on AIME~2024 (46.28% vs. 45.25%), AMC~2023 (85.83% vs. 82.94%), and MATH-500 (87.42% vs. 85.43%), while saving more than 20%rollout tokenscumulatively.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.29860

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.29860 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.29860 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.29860 in a Space README.md to link it from this page.

ESPO: Early-Stopping Proximal Policy Optimization

Paper page - ESPO: Early-Stopping Proximal Policy Optimization

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

SocraticPO: Policy Optimization via Interactive Guidance

Gradient Extrapolation-Based Policy Optimization

SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models

CSPO: Constraint-Sensitive Policy Optimization for Safe Reinforcement Learning

Submit Feedback

Similar Articles

LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models

SocraticPO: Policy Optimization via Interactive Guidance

Gradient Extrapolation-Based Policy Optimization

SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models

CSPO: Constraint-Sensitive Policy Optimization for Safe Reinforcement Learning