Convergence of Steepest Descent and Adam under Non-Uniform Smoothness
Summary
This paper generalizes non-uniform smoothness assumptions to objectives whose curvature is affine in the objective value, proving convergence rates for steepest descent and diagonal variants of RMSProp and Adam, with applications to logistic regression and neural networks.
View Cached Full Text
Cached at: 06/01/26, 09:30 AM
# Convergence of Steepest Descent and Adam under Non-Uniform Smoothness Source: [https://arxiv.org/abs/2605.30648](https://arxiv.org/abs/2605.30648) [View PDF](https://arxiv.org/pdf/2605.30648) > Abstract:Recent work has analyzed the convergence of first\-order methods under non\-uniform smoothness assumptions that better model the loss landscape in machine learning tasks\. We generalize this assumption to objectives whose curvature is an affine function of the objective value\. This property is satisfied by a broad class of problems, including logistic regression, generalized linear models with a logistic link function, softmax policy gradient in reinforcement learning, and a class of neural networks\. Under this assumption and gradient domination conditions, we establish a general convergence rate for the steepest descent method, and deterministic, diagonal variants of RMSProp and Adam\. Our results imply that for logistic regression on separable data and the softmax policy gradient objective, sign GD converges linearly and is provably faster than GD\. Furthermore, we show that for a class of two\-layer neural networks on separable data, RMSProp and Adam can converge at a linear rate with a constant step\-size and momentum parameter\. Finally, we present a lower bound demonstrating that, under our assumption, RMSProp and Adam are provably faster than AdaGrad, AMSGrad, gradient descent, and heavy\-ball momentum\. ## Submission history From: Sharan Vaswani \[[view email](https://arxiv.org/show-email/53f7e821/2605.30648)\] **\[v1\]**Thu, 28 May 2026 23:05:45 UTC \(79 KB\)
Similar Articles
Flatland: The Adventures of Gradient Descent with Large Step Sizes
This paper addresses the open question of maximum step size for gradient descent convergence on non-L-smooth objectives, introducing adaptive methods that operate at the edge of stability and can minimize sharpness globally.
Mitigating the Curse of Dimensionality in Uniform Convergence of Deep Neural Networks via Smooth Activations
This paper establishes a theoretical framework showing that smooth activations in deep neural networks can mitigate the curse of dimensionality in uniform convergence, providing non-asymptotic guarantees and outperforming ReLU networks in worst-case reliability.
Uniform Stability and Generalization Error of GD and SGD on Fixed-Point Parameters
This paper analyzes generalization error, uniform stability, and uniform argument stability of gradient descent (GD) and stochastic gradient descent (SGD) over discrete parameter spaces with deterministic or stochastic rounding, showing that rounding degrades generalization for GD and introduces dimension-dependent errors for stochastic rounding.
Beyond Bounded Variance: Variance-Reduced Normalized Methods for Nonconvex Optimization under Blum-Gladyshev Noise
This paper studies nonconvex stochastic optimization under Blum-Gladyshev noise, where gradient variance grows with distance from initialization. It proves convergence guarantees for normalized SGD with momentum and a variance-reduced STORM method, achieving minimax optimal rates under certain conditions.
A Unified Framework for Gradient Aggregation in Multi-Objective Optimization
This paper presents a unified theoretical framework for gradient aggregation in multi-objective optimization, establishing convergence rates to Pareto stationarity. The authors introduce a sufficient alignment condition and demonstrate its application to existing and new algorithms, such as capped MGDA.