Flatland: The Adventures of Gradient Descent with Large Step Sizes

arXiv cs.LG Papers

Summary

This paper addresses the open question of maximum step size for gradient descent convergence on non-L-smooth objectives, introducing adaptive methods that operate at the edge of stability and can minimize sharpness globally.

arXiv:2606.06722v1 Announce Type: new Abstract: The training of neural networks often entails objective functions that are not globally $L$-smooth. For these functions, it is both theoretically and practically difficult to reply to the question: what is the largest possible step size that ensures the convergence of gradient descent (GD)? We address this longstanding open question in deep learning by providing a unifying definition of "large" step sizes that requires only local Lipschitz (or even H\"older) continuity of the gradient. We design first-order adaptive methods that provably yield large step sizes and show that they operate at the edge of stability (EoS) right from the start of the training. In particular, the loss decreases nonmonotonically and the product between the step size and sharpness, i.e., the largest eigenvalue of the Hessian, stays above the EoS threshold of 2 throughout training. Using our method, we are also able to minimize the sharpness all the way down to its global minimum. Contrary to expectation, we find that encountering globally-flat regions too early in the training may both slow down convergence and jeopardize the generalization ability of the network. Exploiting a self-stabilization argument, we allow GD to enter slightly sharper valleys and turn unsuccessful training runs into very successful ones.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:18 AM

# Flatland: The Adventures of Gradient Descent with Large Step Sizes
Source: [https://arxiv.org/abs/2606.06722](https://arxiv.org/abs/2606.06722)
[View PDF](https://arxiv.org/pdf/2606.06722)

> Abstract:The training of neural networks often entails objective functions that are not globally $L$\-smooth\. For these functions, it is both theoretically and practically difficult to reply to the question: what is the largest possible step size that ensures the convergence of gradient descent \(GD\)? We address this longstanding open question in deep learning by providing a unifying definition of "large" step sizes that requires only local Lipschitz \(or even Hölder\) continuity of the gradient\. We design first\-order adaptive methods that provably yield large step sizes and show that they operate at the edge of stability \(EoS\) right from the start of the training\. In particular, the loss decreases nonmonotonically and the product between the step size and sharpness, i\.e\., the largest eigenvalue of the Hessian, stays above the EoS threshold of 2 throughout training\. Using our method, we are also able to minimize the sharpness all the way down to its global minimum\. Contrary to expectation, we find that encountering globally\-flat regions too early in the training may both slow down convergence and jeopardize the generalization ability of the network\. Exploiting a self\-stabilization argument, we allow GD to enter slightly sharper valleys and turn unsuccessful training runs into very successful ones\.

## Submission history

From: Curtis Fox \[[view email](https://arxiv.org/show-email/e2ceb203/2606.06722)\] **\[v1\]**Thu, 4 Jun 2026 21:14:07 UTC \(10,979 KB\)

Similar Articles

Convergence of Steepest Descent and Adam under Non-Uniform Smoothness

arXiv cs.LG

This paper generalizes non-uniform smoothness assumptions to objectives whose curvature is affine in the objective value, proving convergence rates for steepest descent and diagonal variants of RMSProp and Adam, with applications to logistic regression and neural networks.

Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway

arXiv cs.LG

This paper shows that discrete Gradient Descent with large step sizes restores symmetry in multi-pathway Deep Linear Networks, countering the symmetry-breaking predicted by Gradient Flow, and leads to signal re-balancing across pathways. The authors theoretically prove that balanced solutions are flatter (less sharp) than sparse ones, and large learning rates drive the network toward stable, balanced configurations.

Uniform Stability and Generalization Error of GD and SGD on Fixed-Point Parameters

arXiv cs.LG

This paper analyzes generalization error, uniform stability, and uniform argument stability of gradient descent (GD) and stochastic gradient descent (SGD) over discrete parameter spaces with deterministic or stochastic rounding, showing that rounding degrades generalization for GD and introduces dimension-dependent errors for stochastic rounding.

Deep double descent

OpenAI Blog

OpenAI research reveals the 'double descent' phenomenon where test error exhibits a non-monotonic pattern as both model size and training steps increase, challenging traditional understanding of the bias-variance tradeoff in deep learning.