policy-optimization

#policy-optimization

Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL

arXiv cs.LG ↗ · 2d ago Cached

This paper introduces Approximate Next Policy Sampling (ANPS) as an alternative to conservative policy updates in deep reinforcement learning. It proposes Stable Value Approximate Policy Iteration (SV-API) and SV-RL, which align training data with the next policy's state distribution to allow for larger and safer policy updates.

0 favorites 0 likes

#policy-optimization

A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

Hugging Face Daily Papers ↗ · 3d ago Cached

This paper introduces A^2TGPO, a reinforcement learning method for agentic LLMs that uses adaptive turn-level clipping and information gain normalization to improve process credit assignment in multi-turn interactions.

0 favorites 0 likes

#policy-optimization

Recovering Hidden Reward in Diffusion-Based Policies

Hugging Face Daily Papers ↗ · 2026-05-01 Cached

This research paper explores methods for recovering hidden rewards within diffusion-based policies, likely aiming to improve the alignment or efficiency of such models.

0 favorites 0 likes

#policy-optimization

Near-Future Policy Optimization

Hugging Face Daily Papers ↗ · 2026-04-22 Cached

Proposes Near-Future Policy Optimization (NPO), a mixed-policy RL method that accelerates convergence by learning from a later checkpoint of the same training run, boosting Qwen3-VL-8B-Instruct performance from 57.88 to 62.84.

0 favorites 0 likes

#policy-optimization

DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

Hugging Face Daily Papers ↗ · 2026-04-15 Cached

DiPO introduces a novel reinforcement learning approach for LLMs that uses perplexity-based sample partitioning to disentangle exploration and exploitation subspaces, combined with a bidirectional reward allocation mechanism for more stable policy optimization. The method demonstrates superior performance on mathematical reasoning and function calling tasks.

0 favorites 0 likes

#policy-optimization

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

Papers with Code Trending ↗ · 2025-08-06 Cached

The paper introduces mmGRPO, a multi-module extension of Group Relative Policy Optimization (GRPO) that improves accuracy in modular AI systems by optimizing language model calls and prompts. It reports an average 11% accuracy improvement across various tasks and provides an open-source implementation in DSPy.

0 favorites 0 likes

#policy-optimization

Quantifying generalization in reinforcement learning

OpenAI Blog ↗ · 2018-12-06 Cached

OpenAI trained 9 agents on the CoinRun environment with varying numbers of training levels to quantify generalization in reinforcement learning, finding substantial overfitting even with 16,000 training levels and that IMPALA-CNN architectures generalize significantly better than Nature-CNN baselines.

0 favorites 0 likes

#policy-optimization

Better exploration with parameter noise

OpenAI Blog ↗ · 2017-07-27 Cached

OpenAI presents parameter noise, a technique that adds adaptive noise to neural network policy parameters rather than action spaces, enabling agents to learn tasks significantly faster than traditional action noise approaches. The method achieves 2x faster learning on HalfCheetah and represents a middle ground between evolution strategies and deep RL approaches like TRPO and DDPG.

0 favorites 0 likes

#policy-optimization

Evolution strategies as a scalable alternative to reinforcement learning

OpenAI Blog ↗ · 2017-03-24 Cached

OpenAI presents evolution strategies (ES) as a scalable black-box optimization alternative to reinforcement learning for training neural network policies. ES simplifies the optimization problem by treating policy training as a stochastic parameter search that repeatedly samples and selects better parameter configurations based on reward feedback.

0 favorites 0 likes

policy-optimization

Submit Feedback