Variance reduction for policy gradient with action-dependent factorized baselines
Summary
OpenAI researchers derive a bias-free action-dependent baseline for variance reduction in policy gradient methods, demonstrating improved learning efficiency on high-dimensional control tasks, multi-agent, and partially observed environments.
View Cached Full Text
Cached at: 04/20/26, 02:56 PM
Similar Articles
Evolved Policy Gradients
OpenAI introduces Evolved Policy Gradients (EPG), a meta-learning approach that learns loss functions through evolution rather than learning policies directly, enabling RL agents to generalize better across tasks by leveraging prior experience similar to how humans transfer skills.
Better exploration with parameter noise
OpenAI presents parameter noise, a technique that adds adaptive noise to neural network policy parameters rather than action spaces, enabling agents to learn tasks significantly faster than traditional action noise approaches. The method achieves 2x faster learning on HalfCheetah and represents a middle ground between evolution strategies and deep RL approaches like TRPO and DDPG.
OpenAI Baselines: ACKTR & A2C
OpenAI releases ACKTR and A2C algorithms as part of its Baselines library, with ACKTR demonstrating improved sample complexity through natural gradient descent while maintaining computational efficiency comparable to first-order methods.
Equivalence between policy gradients and soft Q-learning
OpenAI researchers demonstrate a precise mathematical equivalence between soft (entropy-regularized) Q-learning and policy gradient methods in reinforcement learning, providing theoretical insight into why Q-learning works despite inaccurate value estimates. They validate this equivalence empirically on the Atari benchmark and show a Q-learning method can closely match A3C's learning dynamics.
OpenAI Baselines: DQN
OpenAI shares lessons learned while implementing DQN as part of their Baselines project, covering debugging tips such as greyscale calibration issues, hyperparameter tuning, and correct interpretation of the Huber Loss in the original Nature paper.