Deep double descent
Summary
OpenAI research reveals the 'double descent' phenomenon where test error exhibits a non-monotonic pattern as both model size and training steps increase, challenging traditional understanding of the bias-variance tradeoff in deep learning.
View Cached Full Text
Cached at: 04/20/26, 02:43 PM
Similar Articles
The Implicit Bias of Depth: From Neural Collapse to Softmax Codes
This paper studies how depth alone induces an implicit low-rank bias in deep unconstrained feature models trained without regularization, shifting the optimal solution from neural collapse to softmax codes, and provides the first asymptotic and dynamic characterization of this bias under gradient descent with cross-entropy loss.
Double descent for least-squares interpolation on contaminated data: A simulation study
This simulation study examines the double descent phenomenon for least-squares interpolation on contaminated data in linear regression, comparing the performance of the least-squares interpolator with robust alternatives.
Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway
This paper shows that discrete Gradient Descent with large step sizes restores symmetry in multi-pathway Deep Linear Networks, countering the symmetry-breaking predicted by Gradient Flow, and leads to signal re-balancing across pathways. The authors theoretically prove that balanced solutions are flatter (less sharp) than sparse ones, and large learning rates drive the network toward stable, balanced configurations.
How AI training scales
OpenAI researchers discovered that the gradient noise scale, a simple statistical metric, predicts the parallelizability of neural network training across a wide range of tasks. They found that more complex tasks and more powerful models tolerate larger batch sizes, suggesting future AI systems can scale further through increased parallelization.
Mirror Descent Beyond Euclidean Stability: An Exponential Separation in Initialization Sensitivity
This paper reveals that Mirror Descent with non-quadratic regularizers can be exponentially more sensitive to initialization than Gradient Descent, even under well-conditioned settings, which has implications for reproducibility in RL and LLM post-training.