Cached at:
04/20/26, 02:55 PM
# Learning Montezuma’s Revenge from a single demonstration
Source: [https://openai.com/index/learning-montezumas-revenge-from-a-single-demonstration/](https://openai.com/index/learning-montezumas-revenge-from-a-single-demonstration/)
Model\-free RL methods like policy gradients and Q\-learning explore by taking actions randomly\. If, by chance, the random actions lead to a reward, they are*reinforced*, and the agent becomes more likely to take these beneficial actions in the future\. This works well if rewards are dense enough for random actions to lead to a reward with reasonable probability\. However, many of the more complicated games require long sequences of very specific actions to experience any reward, and such sequences are extremely unlikely to occur randomly\.
Although the step\-by\-step learning done by our agent is much simpler than learning to play from scratch, it is still far from trivial\. One challenge our RL agent faces is that it is generally unable to reach the exact state from later on in the demo when it starts from an earlier state\. This is because the agent plays the game at a different frameskip from what we used for recording the demonstration, but it is also due to the randomness in the actions which make it very unlikely to exactly reproduce any specific sequence of actions\. The agent will thus need to be able to generalize between states that are very similar, but not identical\. We found that this works well for Montezuma’s Revenge, but much less well for some other Atari games we tried, like Gravitar and Pitfall\. One reason for this may be that these latter games require solving a harder vision problem: we found these games difficult to play from a downsampled screen ourselves, and we saw some improvement when using larger and deeper neural network policies\.
Another challenge we encountered is that standard RL algorithms like policy gradients require striking a careful balance between exploration and exploitation: if the agent’s actions are too random, it makes too many mistakes to ever achieve the required final score when starting from the beginning of the game; if the actions are too deterministic, the agent stops learning because it does not explore alternative actions\. Achieving the reported result on Montezuma’s Revenge thus required careful tuning of the coefficient of the entropy bonus used in PPO, in combination with other hyperparameters such as the learning rate and the scaling of rewards\. For some other games like Gravitar and Pitfall we were unable to find hyperparameters that worked for training the full curriculum\. The algorithm also still shows substantial random variation from run to run, with some runs failing to converge for Montezuma’s Revenge\. We hope that future advances in RL will yield algorithms that are more robust to random noise and to the choice of hyperparameters\.
Finally, like is often the case in reinforcement learning, we find that our trained neural net policy does not yet generalize at the level of a human player\.[One method\(opens in a new window\)](https://arxiv.org/abs/1709.06009v2)to test for generalization ability is to perturb the policy by making actions*sticky*and repeating the last action with probability of 0\.25 at every frame\. Using this evaluation method our trained policy obtains a score of 10,000 on Montezuma’s Revenge on average\. Alternatively, we can take random actions with probability 0\.01 \(repeated for 4 frameskipped steps\), which leads to an average score of 8,400 for our policy\. Anecdotally, we find that such perturbations also significantly reduce the score of human players on Montezuma’s Revenge, but to a lesser extent\. As far as we are aware, our results using perturbed policies are still better than all those published previously\. Perturbing the learned policy by starting with between 0 and 30 random no\-ops did not significantly hurt results, with the majority of rollouts achieving the final score obtained in our demonstration\.
Where most previous work on learning from demonstrations focused on*imitation*, which encourages identical behavior to that seen in the demonstration, we have shown that good results can be achieved by optimizing returns directly\. This allows the agent to deviate from the demonstrated behavior, which lets it find new and exciting solutions that the human demonstrator may not have considered\. By training on a curriculum of subtasks, created by resetting from demonstration states, we used this technique to solve a difficult reinforcement learning problem requiring long sequences of actions\.