OpenAI shares lessons learned while implementing DQN as part of their Baselines project, covering debugging tips such as greyscale calibration issues, hyperparameter tuning, and correct interpretation of the Huber Loss in the original Nature paper.
We’re open-sourcing OpenAI Baselines, our internal effort to reproduce reinforcement learning algorithms with performance on par with published results. We’ll release the algorithms over upcoming months; today’s release includes DQN and three of its variants.
# OpenAI Baselines: DQN
Source: [https://openai.com/index/openai-baselines-dqn/](https://openai.com/index/openai-baselines-dqn/)
When transforming the screen images into greyscale we had incorrectly calibrated our coefficients for the green color values, which led to the fish disappearing\. After we noticed the bug we tweaked the color values and our algorithm was able to see the fish again\.
To debug issues like this in the future, Gym now contains a[play\(opens in a new window\)](https://github.com/openai/gym/blob/master/gym/utils/play.py)function, which lets a researcher easily see the same observations as the AI agent would\.
*Fix bugs, then hyperparameters*: After debugging, we started to calibrate our hyperparameters\. We ultimately found that setting the annealing schedule for epsilon, a hyperparameter which controlled the exploration rate, had a huge impact on performance\. Our final implementation decreases epsilon to 0\.1 over the first million steps and then down to 0\.01 over the next 24 million steps\. If our implementation contained bugs, then it’s likely we would come up with different hyperparameter settings to try to deal with faults we hadn’t yet diagnosed\.
*Double check your interpretations of papers*: In the DQN[Nature\(opens in a new window\)](https://www.nature.com/nature/journal/v518/n7540/full/nature14236.html)paper the authors write: “We also found it helpful to clip the error term from the update \[\.\.\.\] to be between \-1 and 1\.”\. There are two ways to interpret this statement — clip the objective, or clip the multiplicative term when computing gradient\. The former seems more natural, but it causes the gradient to be zero on transitions with high error, which leads to suboptimal performance, as found in one[DQN implementation\(opens in a new window\)](https://github.com/devsisters/DQN-tensorflow/issues/16)\. The latter is correct and has a simple mathematical interpretation —[Huber Loss\(opens in a new window\)](https://en.wikipedia.org/wiki/Huber_loss)\. You can spot bugs like these by checking that the gradients appear as you expect — this can be easily done within TensorFlow by using[compute\_gradients\(opens in a new window\)](https://www.tensorflow.org/api_docs/python/tf/train/GradientDescentOptimizer#compute_gradients)\.
The majority of bugs in this post were spotted by going over the code multiple times and thinking through what could go wrong with each line\. Each bug seems obvious in hindsight, but even experienced researchers tend to underestimate how many passes over the code it can take to find all the bugs in an implementation\.
OpenAI releases ACKTR and A2C algorithms as part of its Baselines library, with ACKTR demonstrating improved sample complexity through natural gradient descent while maintaining computational efficiency comparable to first-order methods.
OpenAI presents a new reinforcement learning benchmark based on Sonic the Hedgehog to measure transfer learning and few-shot learning performance in RL agents, along with baseline algorithm evaluations.
OpenAI describes challenges with conventional RL on robotics tasks and introduces Hindsight Experience Replay (HER), a new RL algorithm that enables agents to learn from binary rewards by reframing failures as intended outcomes, combined with domain randomization for sim-to-real transfer.
OpenAI proposes standardizing constrained RL as the formalism for safe exploration and introduces Safety Gym, a benchmark suite for evaluating safe deep RL algorithms in high-dimensional continuous control tasks with safety constraints.
OpenAI researchers derive a bias-free action-dependent baseline for variance reduction in policy gradient methods, demonstrating improved learning efficiency on high-dimensional control tasks, multi-agent, and partially observed environments.