OpenAI trained 9 agents on the CoinRun environment with varying numbers of training levels to quantify generalization in reinforcement learning, finding substantial overfitting even with 16,000 training levels and that IMPALA-CNN architectures generalize significantly better than Nature-CNN baselines.
We’re releasing CoinRun, a training environment which provides a metric for an agent’s ability to transfer its experience to novel situations and has already helped clarify a longstanding puzzle in reinforcement learning. CoinRun strikes a desirable balance in complexity: the environment is simpler than traditional platformer games like Sonic the Hedgehog but still poses a worthy generalization challenge for state of the art algorithms.
# Quantifying generalization in reinforcement learning
Source: [https://openai.com/index/quantifying-generalization-in-reinforcement-learning/](https://openai.com/index/quantifying-generalization-in-reinforcement-learning/)
We trained 9 agents to play CoinRun, each with a different number of available training levels\. The first 8 agents trained on sets ranging from of 100 to 16,000 levels\. We trained the final agent on an unrestricted set of levels, so this agent never sees the same level twice\. We trained our agents with policies using a[common\(opens in a new window\)](https://www.nature.com/articles/nature14236)3\-layer[convolutional architecture\(opens in a new window\)](https://en.wikipedia.org/wiki/Convolutional_neural_network), which we call Nature\-CNN\. Our agents trained with[Proximal Policy Optimization\(opens in a new window\)](https://arxiv.org/abs/1707.06347)\([PPO\(opens in a new window\)](https://blog.openai.com/openai-baselines-ppo/#ppo)\) for a total of 256M timesteps\. Since an epsiode lasts 100 timesteps on average, agents with fixed training sets will see each training level thousands to millions of times\. The final agent, trained with the unrestricted set, will see roughly 2 million distinct levels — each of them exactly once\.
We collected each data point in the following graphs by averaging the final agent’s performance across 10,000 episodes\. At test time, the agent is evaluated on never\-before\-seen levels\. We discovered substantial overfitting occurs when there are less than 4,000 training levels\.**In fact, we still see overfitting even with 16,000 training levels\!**Unsurprisingly, agents trained with the unrestricted set of levels performed best, as these agents had access to the most data\. These agents are represented by the dotted line in the following graphs\.
We compared our Nature\-CNN baseline against the convolutional architecture used in[IMPALA\(opens in a new window\)](https://arxiv.org/abs/1802.01561)and found the IMPALA\-CNN agents generalized*much better*with any training set as seen below\.
GRLO introduces a novel reinforcement learning post-training method that achieves strong generalization across multiple domains (math, code, etc.) from only 5K prompts and 22.7 GPU hours, significantly outperforming in-domain RLVR baselines in efficiency and data requirements.
OpenAI presents a new reinforcement learning benchmark based on Sonic the Hedgehog to measure transfer learning and few-shot learning performance in RL agents, along with baseline algorithm evaluations.
This paper from OpenAI investigates whether reinforcement learning on beneficial behavior can produce broad and persistent alignment generalization beyond the training distribution. Using a dataset of realistic situations, they show that RL training on beneficial traits improves out-of-distribution alignment and persistence against adversarial attacks.
OpenAI releases research on reinforcement learning for training models to exhibit beneficial traits like honesty and corrigibility, showing that such training generalizes across domains and persists under adversarial pressure.
OpenAI researchers discovered that the gradient noise scale, a simple statistical metric, predicts the parallelizability of neural network training across a wide range of tasks. They found that more complex tasks and more powerful models tolerate larger batch sizes, suggesting future AI systems can scale further through increased parallelization.