Convergence of Steepest Descent and Adam under Non-Uniform Smoothness

arXiv cs.LG 06/01/26, 04:00 AM Papers

optimization convergence steepest-descent adam smoothness machine-learning

Summary

This paper generalizes non-uniform smoothness assumptions to objectives whose curvature is affine in the objective value, proving convergence rates for steepest descent and diagonal variants of RMSProp and Adam, with applications to logistic regression and neural networks.

arXiv:2605.30648v1 Announce Type: new Abstract: Recent work has analyzed the convergence of first-order methods under non-uniform smoothness assumptions that better model the loss landscape in machine learning tasks. We generalize this assumption to objectives whose curvature is an affine function of the objective value. This property is satisfied by a broad class of problems, including logistic regression, generalized linear models with a logistic link function, softmax policy gradient in reinforcement learning, and a class of neural networks. Under this assumption and gradient domination conditions, we establish a general convergence rate for the steepest descent method, and deterministic, diagonal variants of RMSProp and Adam. Our results imply that for logistic regression on separable data and the softmax policy gradient objective, sign GD converges linearly and is provably faster than GD. Furthermore, we show that for a class of two-layer neural networks on separable data, RMSProp and Adam can converge at a linear rate with a constant step-size and momentum parameter. Finally, we present a lower bound demonstrating that, under our assumption, RMSProp and Adam are provably faster than AdaGrad, AMSGrad, gradient descent, and heavy-ball momentum.

Original Article

View Cached Full Text

Cached at: 06/01/26, 09:30 AM

# Convergence of Steepest Descent and Adam under Non-Uniform Smoothness
Source: [https://arxiv.org/abs/2605.30648](https://arxiv.org/abs/2605.30648)
[View PDF](https://arxiv.org/pdf/2605.30648)

> Abstract:Recent work has analyzed the convergence of first\-order methods under non\-uniform smoothness assumptions that better model the loss landscape in machine learning tasks\. We generalize this assumption to objectives whose curvature is an affine function of the objective value\. This property is satisfied by a broad class of problems, including logistic regression, generalized linear models with a logistic link function, softmax policy gradient in reinforcement learning, and a class of neural networks\. Under this assumption and gradient domination conditions, we establish a general convergence rate for the steepest descent method, and deterministic, diagonal variants of RMSProp and Adam\. Our results imply that for logistic regression on separable data and the softmax policy gradient objective, sign GD converges linearly and is provably faster than GD\. Furthermore, we show that for a class of two\-layer neural networks on separable data, RMSProp and Adam can converge at a linear rate with a constant step\-size and momentum parameter\. Finally, we present a lower bound demonstrating that, under our assumption, RMSProp and Adam are provably faster than AdaGrad, AMSGrad, gradient descent, and heavy\-ball momentum\.

## Submission history

From: Sharan Vaswani \[[view email](https://arxiv.org/show-email/53f7e821/2605.30648)\] **\[v1\]**Thu, 28 May 2026 23:05:45 UTC \(79 KB\)

Convergence of Steepest Descent and Adam under Non-Uniform Smoothness

Similar Articles

The Convergence Behavior of Adam under Heavy-Tailed Noise

Flatland: The Adventures of Gradient Descent with Large Step Sizes

Learning from the Descent Direction: Adaptive Gradient Descent under One-Sided H\"older Regularity

Mitigating the Curse of Dimensionality in Uniform Convergence of Deep Neural Networks via Smooth Activations

Stability Annealing Selects the Implicit Bias of Smoothed Sign Descent: A Rate-Indexed Barrier Path on Separable Data

Submit Feedback

Similar Articles

The Convergence Behavior of Adam under Heavy-Tailed Noise

Flatland: The Adventures of Gradient Descent with Large Step Sizes

Learning from the Descent Direction: Adaptive Gradient Descent under One-Sided H\"older Regularity

Mitigating the Curse of Dimensionality in Uniform Convergence of Deep Neural Networks via Smooth Activations

Stability Annealing Selects the Implicit Bias of Smoothed Sign Descent: A Rate-Indexed Barrier Path on Separable Data