Convergence of Steepest Descent and Adam under Non-Uniform Smoothness

arXiv cs.LG Papers

Summary

This paper generalizes non-uniform smoothness assumptions to objectives whose curvature is affine in the objective value, proving convergence rates for steepest descent and diagonal variants of RMSProp and Adam, with applications to logistic regression and neural networks.

arXiv:2605.30648v1 Announce Type: new Abstract: Recent work has analyzed the convergence of first-order methods under non-uniform smoothness assumptions that better model the loss landscape in machine learning tasks. We generalize this assumption to objectives whose curvature is an affine function of the objective value. This property is satisfied by a broad class of problems, including logistic regression, generalized linear models with a logistic link function, softmax policy gradient in reinforcement learning, and a class of neural networks. Under this assumption and gradient domination conditions, we establish a general convergence rate for the steepest descent method, and deterministic, diagonal variants of RMSProp and Adam. Our results imply that for logistic regression on separable data and the softmax policy gradient objective, sign GD converges linearly and is provably faster than GD. Furthermore, we show that for a class of two-layer neural networks on separable data, RMSProp and Adam can converge at a linear rate with a constant step-size and momentum parameter. Finally, we present a lower bound demonstrating that, under our assumption, RMSProp and Adam are provably faster than AdaGrad, AMSGrad, gradient descent, and heavy-ball momentum.
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:30 AM

# Convergence of Steepest Descent and Adam under Non-Uniform Smoothness
Source: [https://arxiv.org/abs/2605.30648](https://arxiv.org/abs/2605.30648)
[View PDF](https://arxiv.org/pdf/2605.30648)

> Abstract:Recent work has analyzed the convergence of first\-order methods under non\-uniform smoothness assumptions that better model the loss landscape in machine learning tasks\. We generalize this assumption to objectives whose curvature is an affine function of the objective value\. This property is satisfied by a broad class of problems, including logistic regression, generalized linear models with a logistic link function, softmax policy gradient in reinforcement learning, and a class of neural networks\. Under this assumption and gradient domination conditions, we establish a general convergence rate for the steepest descent method, and deterministic, diagonal variants of RMSProp and Adam\. Our results imply that for logistic regression on separable data and the softmax policy gradient objective, sign GD converges linearly and is provably faster than GD\. Furthermore, we show that for a class of two\-layer neural networks on separable data, RMSProp and Adam can converge at a linear rate with a constant step\-size and momentum parameter\. Finally, we present a lower bound demonstrating that, under our assumption, RMSProp and Adam are provably faster than AdaGrad, AMSGrad, gradient descent, and heavy\-ball momentum\.

## Submission history

From: Sharan Vaswani \[[view email](https://arxiv.org/show-email/53f7e821/2605.30648)\] **\[v1\]**Thu, 28 May 2026 23:05:45 UTC \(79 KB\)

Similar Articles

Uniform Stability and Generalization Error of GD and SGD on Fixed-Point Parameters

arXiv cs.LG

This paper analyzes generalization error, uniform stability, and uniform argument stability of gradient descent (GD) and stochastic gradient descent (SGD) over discrete parameter spaces with deterministic or stochastic rounding, showing that rounding degrades generalization for GD and introduces dimension-dependent errors for stochastic rounding.

A Unified Framework for Gradient Aggregation in Multi-Objective Optimization

arXiv cs.LG

This paper presents a unified theoretical framework for gradient aggregation in multi-objective optimization, establishing convergence rates to Pareto stationarity. The authors introduce a sufficient alignment condition and demonstrate its application to existing and new algorithms, such as capped MGDA.