ESPO: Early-Stopping Proximal Policy Optimization
Summary
ESPO introduces an early-stopping mechanism for reinforcement learning that detects and terminates failed reasoning trajectories in LLMs, improving mathematical reasoning performance while reducing compute by over 20%.
View Cached Full Text
Cached at: 06/02/26, 03:24 AM
Paper page - ESPO: Early-Stopping Proximal Policy Optimization
Source: https://huggingface.co/papers/2605.29860 Authors:
,
,
,
,
,
,
,
,
,
Abstract
ESPO improves mathematical reasoning in large language models by detecting and terminating failed trajectories early, leading to better performance and reduced computational waste.
When a large language model underreinforcement learningcommits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum horizon, spending compute on tokens that never receive positive reward and polluting advantage estimates with post-failure noise. We propose ESPO (Early-StoppingProximal Policy Optimization), which detectstrajectory failureon-the-fly and terminates rollouts early. At each generation step, ESPO computes asurrogate regretusing only thelogitsalready computed during sampling, and terminates when the smoothed cumulative regret significantly exceeds its estimated values. Truncated trajectories are treated asabsorbing failure stateswith a terminal reward, concentrating negative temporal-difference (TD) errors near the detected failure step without any additional reward model or human annotation. On DeepSeek-R1-Distill-Qwen-7B trained formathematical reasoning, ESPO surpasses PPO on AIME~2024 (46.28% vs. 45.25%), AMC~2023 (85.83% vs. 82.94%), and MATH-500 (87.42% vs. 85.43%), while saving more than 20%rollout tokenscumulatively.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.29860
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.29860 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.29860 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.29860 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models
Introduces LambdaPO, a novel reinforcement learning framework that improves upon GRPO by decomposing advantage estimation into pairwise preference comparisons and adding a semantic density reward, achieving better performance on math reasoning tasks.
SocraticPO: Policy Optimization via Interactive Guidance
SocraticPO augments RL rollouts with Socratic-style natural language guidance and reward decay to improve scientific reasoning in LLMs, outperforming strong baselines on SciKnowEval benchmarks.
Gradient Extrapolation-Based Policy Optimization
The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.
SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models
Researchers propose SPS (Steering Probability Squeezing), a training paradigm combining reinforcement learning with inverse reinforcement learning to address probability squeezing in LLM reasoning training, where probability mass concentrates too narrowly on high-reward trajectories, limiting exploration and multi-sample performance (Pass@k). Experiments on five reasoning benchmarks demonstrate improved exploration and Pass@k metrics.
CSPO: Constraint-Sensitive Policy Optimization for Safe Reinforcement Learning
This paper proposes Constraint-Sensitive Policy Optimization (CSPO), a first-order primal-dual method for safe reinforcement learning that incorporates local constraint sensitivity to improve safety recovery and reduce oscillations near safety boundaries, achieving higher constrained returns on navigation and locomotion benchmarks.