Near-Future Policy Optimization

Hugging Face Daily Papers 04/22/26, 12:00 AM Papers

Summary

Proposes Near-Future Policy Optimization (NPO), a mixed-policy RL method that accelerates convergence by learning from a later checkpoint of the same training run, boosting Qwen3-VL-8B-Instruct performance from 57.88 to 62.84.

Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher Q , more new knowledge to learn) and close enough (lower V , more readily absorbed) conditions required to maximize the effective learning signal S = Q/V. We propose Near-Future Policy Optimization (NPO), a simple mixed-policy scheme that learns from a policy's own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that is both stronger than the current policy and closer than any external source, directly balancing trajectory quality against variance cost. We validate NPO through two manual interventions, early-stage bootstrapping and late-stage plateau breakthrough, and further propose AutoNPO,an adaptive variant that automatically triggers interventions from online training signals and selects the guide checkpoint that maximizes S. On Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84, and AutoNPO pushes it to 63.15, raising the final performance ceiling while accelerating convergence.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/23/26, 07:47 AM

Paper page - Near-Future Policy Optimization

Source: https://huggingface.co/papers/2604.20733

Abstract

Mixed-policy reinforcement learning approach using near-future policy optimization to accelerate convergence and improve performance by balancing trajectory quality and variance.

Reinforcement learningwithverifiable rewards(RLVR) has become a core post-training recipe. Introducing suitableoff-policy trajectoriesintoon-policy explorationaccelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existingmixed-policy methodseither import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher Q , more new knowledge to learn) and close enough (lower V , more readily absorbed) conditions required to maximize theeffective learning signalS = Q/V. We propose Near-FuturePolicy Optimization(NPO), a simple mixed-policy scheme that learns from a policy’s own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that is both stronger than the current policy and closer than any external source, directly balancing trajectory quality against variance cost. We validate NPO through two manual interventions, early-stagebootstrappingand late-stageplateau breakthrough, and further propose AutoNPO,an adaptive variant that automatically triggers interventions fromonline training signalsand selects the guide checkpoint that maximizes S. On Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84, and AutoNPO pushes it to 63.15, raising the final performance ceiling while accelerating convergence.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2604\.20733

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.20733 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.20733 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.20733 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Near-Future Policy Optimization

Paper page - Near-Future Policy Optimization

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Proximal Policy Optimization

Gradient Extrapolation-Based Policy Optimization

Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

Submit Feedback

Similar Articles

Gradient Extrapolation-Based Policy Optimization

Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs