Near-Future Policy Optimization
Summary
Proposes Near-Future Policy Optimization (NPO), a mixed-policy RL method that accelerates convergence by learning from a later checkpoint of the same training run, boosting Qwen3-VL-8B-Instruct performance from 57.88 to 62.84.
View Cached Full Text
Cached at: 04/23/26, 07:47 AM
Paper page - Near-Future Policy Optimization
Source: https://huggingface.co/papers/2604.20733
Abstract
Mixed-policy reinforcement learning approach using near-future policy optimization to accelerate convergence and improve performance by balancing trajectory quality and variance.
Reinforcement learningwithverifiable rewards(RLVR) has become a core post-training recipe. Introducing suitableoff-policy trajectoriesintoon-policy explorationaccelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existingmixed-policy methodseither import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher Q , more new knowledge to learn) and close enough (lower V , more readily absorbed) conditions required to maximize theeffective learning signalS = Q/V. We propose Near-FuturePolicy Optimization(NPO), a simple mixed-policy scheme that learns from a policy’s own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that is both stronger than the current policy and closer than any external source, directly balancing trajectory quality against variance cost. We validate NPO through two manual interventions, early-stagebootstrappingand late-stageplateau breakthrough, and further propose AutoNPO,an adaptive variant that automatically triggers interventions fromonline training signalsand selects the guide checkpoint that maximizes S. On Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84, and AutoNPO pushes it to 63.15, raising the final performance ceiling while accelerating convergence.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2604\.20733
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.20733 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.20733 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.20733 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Proximal Policy Optimization
OpenAI introduces Proximal Policy Optimization (PPO), a reinforcement learning algorithm that matches or outperforms state-of-the-art methods while being simpler to implement and tune. PPO uses a novel clipped objective function to constrain policy updates and has since become OpenAI's default RL algorithm.
Gradient Extrapolation-Based Policy Optimization
The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.
Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL
This paper introduces Approximate Next Policy Sampling (ANPS) as an alternative to conservative policy updates in deep reinforcement learning. It proposes Stable Value Approximate Policy Iteration (SV-API) and SV-RL, which align training data with the next policy's state distribution to allow for larger and safer policy updates.
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
This paper introduces Listwise Policy Optimization (LPO), a method for RLVR that explicitly handles target projection via divergence minimization on the response simplex to improve training stability and performance in LLMs.
Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs
The paper introduces mmGRPO, a multi-module extension of Group Relative Policy Optimization (GRPO) that improves accuracy in modular AI systems by optimizing language model calls and prompts. It reports an average 11% accuracy improvement across various tasks and provides an open-source implementation in DSPy.