Deep double descent

OpenAI Blog Papers

Summary

OpenAI research reveals the 'double descent' phenomenon where test error exhibits a non-monotonic pattern as both model size and training steps increase, challenging traditional understanding of the bias-variance tradeoff in deep learning.

We show that the double descent phenomenon occurs in CNNs, ResNets, and transformers: performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time. This effect is often avoided through careful regularization. While this behavior appears to be fairly universal, we don’t yet fully understand why it happens, and view further study of this phenomenon as an important research direction.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 02:43 PM

# Deep double descent Source: [https://openai.com/index/deep-double-descent/](https://openai.com/index/deep-double-descent/) The charts above show test and train error as a function of both model size and number of optimization steps\. For a given number of optimization steps \(fixed y\-coordinate\), test and train error exhibit model\-size double descent\. For a given model size \(fixed x\-coordinate\), as training proceeds, test and train error decreases, increases, and decreases again; we call this phenomenon epoch\-wise double descent\. *In general, the peak of test error appears systematically when models are just barely able to fit the train set\.* Our intuition is that, for models at the interpolation threshold, there is effectively only one model that fits the train data, and forcing it to fit even slightly noisy or misspecified labels will destroy its global structure\. That is, there are no “good models” which both interpolate the train set and perform well on the test set\. However, in the over\-parameterized regime, there are many models that fit the train set and there exist such good models\. Moreover, the implicit bias of stochastic gradient descent \(SGD\) leads it to such good models, for reasons we don’t yet understand\. We leave fully understanding the mechanisms behind double descent in deep neural networks as an important open question\.

Similar Articles

How AI training scales

OpenAI Blog

OpenAI researchers discovered that the gradient noise scale, a simple statistical metric, predicts the parallelizability of neural network training across a wide range of tasks. They found that more complex tasks and more powerful models tolerate larger batch sizes, suggesting future AI systems can scale further through increased parallelization.

Improved Techniques for Training Consistency Models

OpenAI Blog

OpenAI presents improved techniques for training consistency models that enable high-quality single-step image generation without distillation, achieving significant FID improvements on CIFAR-10 and ImageNet 64×64 through novel loss functions and training strategies.

OpenAI Baselines: ACKTR & A2C

OpenAI Blog

OpenAI releases ACKTR and A2C algorithms as part of its Baselines library, with ACKTR demonstrating improved sample complexity through natural gradient descent while maintaining computational efficiency comparable to first-order methods.

Are Flat Minima an Illusion?

arXiv cs.LG

This paper challenges the common belief that flat minima cause better generalization in neural networks, arguing that 'weakness'—a reparameterization-invariant measure of function simplicity—is the true driver. Empirical results on MNIST and Fashion-MNIST show that weakness predicts generalization while sharpness anticorrelates, and the large-batch generalization advantage vanishes as training data increases.