Variance reduction for policy gradient with action-dependent factorized baselines
Summary
OpenAI researchers derive a bias-free action-dependent baseline for variance reduction in policy gradient methods, demonstrating improved learning efficiency on high-dimensional control tasks, multi-agent, and partially observed environments.
View Cached Full Text
Cached at: 04/20/26, 02:56 PM
Similar Articles
BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization
BiasGRPO proposes a framework using Group Relative Policy Optimization (GRPO) to stabilize social bias mitigation in LLMs by normalizing rewards across sampled completions, outperforming DPO and PPO on multiple benchmarks. The authors also release a compute-efficient bias reward model designed for integration into multi-objective RLHF pipelines.
Evolved Policy Gradients
OpenAI introduces Evolved Policy Gradients (EPG), a meta-learning approach that learns loss functions through evolution rather than learning policies directly, enabling RL agents to generalize better across tasks by leveraging prior experience similar to how humans transfer skills.
Hierarchical Variational Policies for Reward-Guided Diffusion
Proposes a hierarchical variational policy framework for reward-guided diffusion, enabling high-quality sampling with reduced inference cost. Achieves strong quality-speed tradeoff on tasks like super-resolution.
Reducing Credit Assignment Variance via Counterfactual Reasoning Paths
Introduces Implicit Behavior Policy Optimization (IBPO), a counterfactual comparison-based credit assignment framework that improves training stability and performance in multi-step reasoning tasks for large language models by converting sparse terminal rewards into step-sensitive learning signals.
Better exploration with parameter noise
OpenAI presents parameter noise, a technique that adds adaptive noise to neural network policy parameters rather than action spaces, enabling agents to learn tasks significantly faster than traditional action noise approaches. The method achieves 2x faster learning on HalfCheetah and represents a middle ground between evolution strategies and deep RL approaches like TRPO and DDPG.