Quantifying generalization in reinforcement learning

OpenAI Blog 12/06/18, 08:00 AM Papers

Summary

OpenAI trained 9 agents on the CoinRun environment with varying numbers of training levels to quantify generalization in reinforcement learning, finding substantial overfitting even with 16,000 training levels and that IMPALA-CNN architectures generalize significantly better than Nature-CNN baselines.

We’re releasing CoinRun, a training environment which provides a metric for an agent’s ability to transfer its experience to novel situations and has already helped clarify a longstanding puzzle in reinforcement learning. CoinRun strikes a desirable balance in complexity: the environment is simpler than traditional platformer games like Sonic the Hedgehog but still poses a worthy generalization challenge for state of the art algorithms.

Original Article

View Cached Full Text

Cached at: 04/20/26, 02:55 PM

# Quantifying generalization in reinforcement learning Source: [https://openai.com/index/quantifying-generalization-in-reinforcement-learning/](https://openai.com/index/quantifying-generalization-in-reinforcement-learning/) We trained 9 agents to play CoinRun, each with a different number of available training levels\. The ﬁrst 8 agents trained on sets ranging from of 100 to 16,000 levels\. We trained the ﬁnal agent on an unrestricted set of levels, so this agent never sees the same level twice\. We trained our agents with policies using a[common⁠\(opens in a new window\)](https://www.nature.com/articles/nature14236)3\-layer[convolutional architecture⁠\(opens in a new window\)](https://en.wikipedia.org/wiki/Convolutional_neural_network), which we call Nature\-CNN\. Our agents trained with[Proximal Policy Optimization⁠\(opens in a new window\)](https://arxiv.org/abs/1707.06347)\([PPO⁠\(opens in a new window\)](https://blog.openai.com/openai-baselines-ppo/#ppo)\) for a total of 256M timesteps\. Since an epsiode lasts 100 timesteps on average, agents with fixed training sets will see each training level thousands to millions of times\. The final agent, trained with the unrestricted set, will see roughly 2 million distinct levels — each of them exactly once\. We collected each data point in the following graphs by averaging the ﬁnal agent’s performance across 10,000 episodes\. At test time, the agent is evaluated on never\-before\-seen levels\. We discovered substantial overﬁtting occurs when there are less than 4,000 training levels\.**In fact, we still see overfitting even with 16,000 training levels\!**Unsurprisingly, agents trained with the unrestricted set of levels performed best, as these agents had access to the most data\. These agents are represented by the dotted line in the following graphs\. We compared our Nature\-CNN baseline against the convolutional architecture used in[IMPALA⁠\(opens in a new window\)](https://arxiv.org/abs/1802.01561)and found the IMPALA\-CNN agents generalized*much better*with any training set as seen below\.

Quantifying generalization in reinforcement learning

Similar Articles

GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

Gotta Learn Fast: A new benchmark for generalization in RL

Reinforcement Learning Towards Broadly and Persistently Beneficial Models

@OpenAI: As AI takes on longer, higher-stakes tasks, we want models to carry beneficial and safe behavior into new domains beyon…

How AI training scales

Submit Feedback

Similar Articles

GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

Gotta Learn Fast: A new benchmark for generalization in RL

Reinforcement Learning Towards Broadly and Persistently Beneficial Models

@OpenAI: As AI takes on longer, higher-stakes tasks, we want models to carry beneficial and safe behavior into new domains beyon…