OpenAI Baselines: ACKTR & A2C

OpenAI Blog Tools

Summary

OpenAI releases ACKTR and A2C algorithms as part of its Baselines library, with ACKTR demonstrating improved sample complexity through natural gradient descent while maintaining computational efficiency comparable to first-order methods.

We’re releasing two new OpenAI Baselines implementations: ACKTR and A2C. A2C is a synchronous, deterministic variant of Asynchronous Advantage Actor Critic (A3C) which we’ve found gives equal performance. ACKTR is a more sample-efficient reinforcement learning algorithm than TRPO and A2C, and requires only slightly more computation than A2C per update.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 02:56 PM

# OpenAI Baselines: ACKTR & A2C Source: [https://openai.com/index/openai-baselines-acktr-a2c/](https://openai.com/index/openai-baselines-acktr-a2c/) For machine learning algorithms, two costs are important to consider: sample complexity and computational complexity\. Sample complexity refers to the number of timesteps of interaction between the agent and its environment, and computational complexity refers to the amount of numerical operations that must be performed\. ACKTR has better sample complexity than first\-order methods such as A2C because it takes a step in the*natural gradient*direction, rather than the gradient direction \(or a rescaled version as in ADAM\)\. The natural gradient gives us the direction in parameter space that achieves the largest \(instantaneous\) improvement in the objective per unit of change in the output distribution of the network, as measured using the KL\-divergence\. By limiting the KL divergence, we ensure that the new policy does not behave radically differently than the old one, which could cause a collapse in performance\. As for computational complexity, the KFAC update used by ACKTR is only 10–25% more expensive per update step than a standard gradient update\. This contrasts with methods like TRPO \(i\.e, Hessian\-free optimization\), which requires a more expensive conjugate\-gradient computation\. In the following video you can see comparisons at different timesteps between agents trained with ACKTR to solve the game Q\-Bert and those trained with A2C\. ACKTR agents get higher scores than ones trained with A2C\.

Similar Articles

OpenAI Baselines: DQN

OpenAI Blog

OpenAI shares lessons learned while implementing DQN as part of their Baselines project, covering debugging tips such as greyscale calibration issues, hyperparameter tuning, and correct interpretation of the Huber Loss in the original Nature paper.

Learning from human preferences

OpenAI Blog

OpenAI presents a method for training AI agents using human preference feedback, where an agent learns reward functions from human comparisons of behavior trajectories and uses reinforcement learning to optimize for the inferred goals. The approach demonstrates strong sample efficiency, requiring less than 1000 bits of human feedback to train an agent to perform a backflip.

OpenAI Gym Beta

OpenAI Blog

OpenAI releases OpenAI Gym, a public beta toolkit for developing and comparing reinforcement learning algorithms with a growing suite of environments and a platform for reproducible research. The toolkit aims to standardize RL benchmarks and address the lack of diverse, easy-to-use environments for the research community.