Procgen Benchmark

OpenAI Blog Tools

Summary

OpenAI introduces Procgen Benchmark, a suite of procedurally generated environments designed to evaluate generalization in reinforcement learning agents across diverse tasks, addressing overfitting issues in traditional benchmarks like Atari.

We’re releasing Procgen Benchmark, 16 simple-to-use procedurally-generated environments which provide a direct measure of how quickly a reinforcement learning agent learns generalizable skills.
Original Article
View Cached Full Text

Cached at: 04/20/26, 02:55 PM

# Procgen Benchmark Source: [https://openai.com/index/procgen-benchmark/](https://openai.com/index/procgen-benchmark/) [In⁠\(opens in a new window\)](https://arxiv.org/abs/1804.06893)[several⁠](https://openai.com/index/quantifying-generalization-in-reinforcement-learning/)[environments⁠\(opens in a new window\)](https://arxiv.org/abs/1806.10729), it has been observed that agents can overfit to remarkably large training sets\. This evidence raises the possibility that overfitting pervades classic benchmarks like the[Arcade Learning Environment⁠\(opens in a new window\)](https://arxiv.org/abs/1207.4708), which has long served as a gold standard in reinforcement learning \(RL\)\. While the diversity between different games in the ALE is one of the benchmark’s greatest strengths, the low emphasis on generalization presents a significant drawback\. In each game the question must be asked: are agents robustly learning a relevant skill, or are they approximately memorizing specific trajectories? [CoinRun⁠](https://openai.com/index/quantifying-generalization-in-reinforcement-learning/)was designed to address precisely this issue, by using procedural generation to construct distinct sets of training levels and test levels\. While CoinRun has helped us better quantify generalization in RL, it is still only a single environment\. It’s likely that CoinRun is not fully representative of the many challenges RL agents must face\. We want the best of both worlds: a benchmark comprised of many diverse environments, each of which fundamentally requires generalization\. To fulfill this need, we have created Procgen Benchmark\. CoinRun now serves as the inaugural environment in Procgen Benchmark, contributing its diversity to a greater whole\. Previous work, including the[Obstacle Tower Challenge⁠\(opens in a new window\)](https://arxiv.org/abs/1902.01378)and the[General Video Game AI framework⁠\(opens in a new window\)](https://arxiv.org/abs/1802.10363), has also encouraged using procedural generation to better evaluate generalization in RL\. We’ve designed environments in a similar spirit, with two Procgen environments drawing direct inspiration from[GVGAI\-based work⁠\(opens in a new window\)](https://arxiv.org/abs/1806.10729)\. Other environments like Dota and StarCraft also provide lots of per\-environment complexity, but these environments are hard to rapidly iterate with \(and it’s even harder to use more than one such environment at a time\)\. With Procgen Benchmark, we strive for all of the following: experimental convenience, high diversity within environments, and high diversity across environments\. We found that agents strongly overfit to small training sets in almost all environments\. In some cases, agents need access to as many as 10,000 levels to close the generalization gap\. We also saw a peculiar trend emerge in many environments: past a certain threshold, training performance improves as the training sets grows\! This runs counter to trends found in supervised learning, where training performance commonly decreases with the size of the training set\. We believe this increase in training performance comes from an implicit curriculum provided by a diverse set of levels\. A larger training set can improve training performance if the agent learns to generalize*even across levels in the training set*\. We previously noticed this effect with CoinRun, and have found it often occurs in many Procgen environments as well\.

Similar Articles

Introducing GeneBench-Pro

OpenAI Blog

OpenAI introduces GeneBench-Pro, a research-level benchmark designed to test AI agents' ability to perform judgment-heavy analyses in computational biology, covering genomics, quantitative biology, and translational medicine.

Procgen and MineRL Competitions

OpenAI Blog

OpenAI co-organizes the MineRL 2020 Competition to advance sample-efficient reinforcement learning algorithms that leverage human demonstrations. Participants compete to obtain a diamond in Minecraft using only 8 million simulator samples and 4 days of single-GPU training, with access to a 60+ million frame human demonstration dataset.

Inside Genebench-Pro

OpenAI Blog

GeneBench-Pro is a comprehensive benchmark from OpenAI designed to evaluate AI models on complex genomics tasks, including somatic oncology, functional genomics, and clinical carrier screening.

Running the Gauntlet: Re-evaluating the Capabilities of Agents Beyond Familiar Environments

Hugging Face Daily Papers

GauntletBench is a new web-based benchmark that evaluates AI agents on challenging scenarios focusing on temporal perception, graphical understanding, and 3D reasoning. Results show state-of-the-art agents achieve only 19.1% success rate compared to over 80% for non-expert humans, highlighting significant limitations in current agentic systems.