Quantifying generalization in reinforcement learning

OpenAI Blog Papers

Summary

OpenAI trained 9 agents on the CoinRun environment with varying numbers of training levels to quantify generalization in reinforcement learning, finding substantial overfitting even with 16,000 training levels and that IMPALA-CNN architectures generalize significantly better than Nature-CNN baselines.

We’re releasing CoinRun, a training environment which provides a metric for an agent’s ability to transfer its experience to novel situations and has already helped clarify a longstanding puzzle in reinforcement learning. CoinRun strikes a desirable balance in complexity: the environment is simpler than traditional platformer games like Sonic the Hedgehog but still poses a worthy generalization challenge for state of the art algorithms.
Original Article
View Cached Full Text

Cached at: 04/20/26, 02:55 PM

# Quantifying generalization in reinforcement learning Source: [https://openai.com/index/quantifying-generalization-in-reinforcement-learning/](https://openai.com/index/quantifying-generalization-in-reinforcement-learning/) We trained 9 agents to play CoinRun, each with a different number of available training levels\. The first 8 agents trained on sets ranging from of 100 to 16,000 levels\. We trained the final agent on an unrestricted set of levels, so this agent never sees the same level twice\. We trained our agents with policies using a[common⁠\(opens in a new window\)](https://www.nature.com/articles/nature14236)3\-layer[convolutional architecture⁠\(opens in a new window\)](https://en.wikipedia.org/wiki/Convolutional_neural_network), which we call Nature\-CNN\. Our agents trained with[Proximal Policy Optimization⁠\(opens in a new window\)](https://arxiv.org/abs/1707.06347)\([PPO⁠\(opens in a new window\)](https://blog.openai.com/openai-baselines-ppo/#ppo)\) for a total of 256M timesteps\. Since an epsiode lasts 100 timesteps on average, agents with fixed training sets will see each training level thousands to millions of times\. The final agent, trained with the unrestricted set, will see roughly 2 million distinct levels — each of them exactly once\. We collected each data point in the following graphs by averaging the final agent’s performance across 10,000 episodes\. At test time, the agent is evaluated on never\-before\-seen levels\. We discovered substantial overfitting occurs when there are less than 4,000 training levels\.**In fact, we still see overfitting even with 16,000 training levels\!**Unsurprisingly, agents trained with the unrestricted set of levels performed best, as these agents had access to the most data\. These agents are represented by the dotted line in the following graphs\. We compared our Nature\-CNN baseline against the convolutional architecture used in[IMPALA⁠\(opens in a new window\)](https://arxiv.org/abs/1802.01561)and found the IMPALA\-CNN agents generalized*much better*with any training set as seen below\.

Similar Articles

Reinforcement Learning Towards Broadly and Persistently Beneficial Models

arXiv cs.AI

This paper from OpenAI investigates whether reinforcement learning on beneficial behavior can produce broad and persistent alignment generalization beyond the training distribution. Using a dataset of realistic situations, they show that RL training on beneficial traits improves out-of-distribution alignment and persistence against adversarial attacks.

How AI training scales

OpenAI Blog

OpenAI researchers discovered that the gradient noise scale, a simple statistical metric, predicts the parallelizability of neural network training across a wide range of tasks. They found that more complex tasks and more powerful models tolerate larger batch sizes, suggesting future AI systems can scale further through increased parallelization.