gradient-descent

#gradient-descent

Regularity-Aware Stochastic MGDA with Adaptive Conflict-Avoidant Update Direction Control

arXiv cs.LG ↗ · yesterday Cached

This paper proposes a regularity-aware stochastic multi-gradient descent method (MoRe) that adaptively switches between conflict-avoidant and scalarization updates. The method achieves improved convergence rates from O~T^{-1/4} to O~T^{-1/2} in nonconvex settings while maintaining per-iterate conflict avoidance.

0 favorites 0 likes

#gradient-descent

Learning in Curved Weight Space:Exponential-Linear Weight Reparameterization for Improved Optimization

arXiv cs.LG ↗ · 2026-07-14 Cached

Introduces SymExpLin (SEL), a weight reparameterization that combines symmetric-exponential and linear pathways to improve optimization in neural networks, reducing training steps by up to 1.49x on transformers.

0 favorites 0 likes

#gradient-descent

Understanding Schedule-Free Methods in Nonconvex Optimization: Rate Guarantees and Escaping Saddles

arXiv cs.LG ↗ · 2026-07-13 Cached

This paper provides worst-case convergence analyses for Schedule-Free gradient descent and stochastic gradient descent in nonconvex optimization, establishing optimal rates and strict-saddle avoidance, thus theoretically justifying their empirical success.

0 favorites 0 likes

#gradient-descent

@TensorTonic: 7 math ideas every ML engineer uses daily and almost nobody has actually derived: 1. Why gradient descent moves in the …

X AI KOLs Timeline ↗ · 2026-07-11 Cached

This tweet lists 7 fundamental math ideas used daily by ML engineers, with brief explanations emphasizing the underlying derivations, such as why gradient descent moves in the steepest direction and why softmax plus cross-entropy yields a clean gradient.

0 favorites 0 likes

#gradient-descent

Optimal Learning Rate Scaling Depends on Data in Deep Scalar Linear Networks

arXiv cs.LG ↗ · 2026-07-10 Cached

This paper demonstrates that optimal learning rate scaling in deep scalar linear networks is inherently data-dependent, contradicting prior data-agnostic scaling rules. It shows that with data-dependent scaling, convergence becomes depth-independent, including at infinite depth.

0 favorites 0 likes

#gradient-descent

Hybrid Least Squares/Gradient Descent Methods for MIONets

arXiv cs.LG ↗ · 2026-07-09 Cached

Proposes a hybrid least squares/gradient descent method for MIONets to accelerate training by using alternating least squares for the last layer parameters of multiple branch networks, leveraging Kronecker and Khatri-Rao products.

0 favorites 0 likes

#gradient-descent

@0x0SojalSec: Want to truly stand out in AI/ML not just use the tools, but understand and improve them? understand why gradient desce…

X AI KOLs Timeline ↗ · 2026-06-30 Cached

A tweet promoting a curated collection of math and deep learning resources for understanding the foundations behind models like Claude, including linear algebra, real analysis, optimization, and representation theory.

0 favorites 0 likes

#gradient-descent

Reflecting to optimise

Hacker News Top ↗ · 2026-06-26 Cached

A blog post discussing optimization techniques for constrained categorical probability distributions, using softmax reparameterization and log barrier methods, applied to protein binder design.

0 favorites 0 likes

#gradient-descent

I made a gradient descent visualization for different optimizers.[P]

Reddit r/ArtificialInteligence ↗ · 2026-06-17

A project that visualizes gradient descent for different optimization algorithms, useful for understanding how optimizers work in machine learning.

0 favorites 0 likes

#gradient-descent

FastMix: Fast Data Mixture Optimization via Gradient Descent

arXiv cs.LG ↗ · 2026-06-16 Cached

FastMix is a novel framework that automates data mixture discovery for training large models using a single proxy model and bilevel optimization, achieving state-of-the-art performance with significant efficiency gains.

0 favorites 0 likes

#gradient-descent

Uniform Stability and Generalization Error of GD and SGD on Fixed-Point Parameters

arXiv cs.LG ↗ · 2026-06-08 Cached

This paper analyzes generalization error, uniform stability, and uniform argument stability of gradient descent (GD) and stochastic gradient descent (SGD) over discrete parameter spaces with deterministic or stochastic rounding, showing that rounding degrades generalization for GD and introduces dimension-dependent errors for stochastic rounding.

0 favorites 0 likes

#gradient-descent

Flatland: The Adventures of Gradient Descent with Large Step Sizes

arXiv cs.LG ↗ · 2026-06-08 Cached

This paper addresses the open question of maximum step size for gradient descent convergence on non-L-smooth objectives, introducing adaptive methods that operate at the edge of stability and can minimize sharpness globally.

0 favorites 0 likes

#gradient-descent

Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway

arXiv cs.LG ↗ · 2026-06-05 Cached

This paper shows that discrete Gradient Descent with large step sizes restores symmetry in multi-pathway Deep Linear Networks, countering the symmetry-breaking predicted by Gradient Flow, and leads to signal re-balancing across pathways. The authors theoretically prove that balanced solutions are flatter (less sharp) than sparse ones, and large learning rates drive the network toward stable, balanced configurations.

0 favorites 0 likes

#gradient-descent

Edge of Stability Selectively Shapes Learning Across the Data Distribution

arXiv cs.LG ↗ · 2026-06-04 Cached

MIT researchers show that the edge of stability (EoS) in neural network training is not merely a global optimization phenomenon but selectively redistributes learning across subsets of the training distribution, amplifying progress on some data groups while suppressing others. They identify two key conditions governing this allocation: gradient alignment with the top Hessian eigenvector and sustained non-vanishing gradient magnitude.

0 favorites 0 likes

#gradient-descent

Pseudospectral Bounds for Transient Amplification in Coupled Gradient Descent

arXiv cs.LG ↗ · 2026-06-04 Cached

This paper develops a sharp pseudospectral theory for block-triangular Jacobians in coupled gradient descent, proving Kreiss-constant bounds and establishing iteration complexity results. The work exposes non-asymptotic, instance-dependent transient amplification phenomena relevant to bilevel optimization, two-time-scale stochastic approximation, and GAN training.

0 favorites 0 likes

#gradient-descent

Balancing Learning Rates Across Layers: Exact Two-Step Dynamics and Optimal Scaling in Linear Neural Networks

arXiv cs.LG ↗ · 2026-06-02 Cached

This paper derives exact closed-form expressions for gradients and test loss after one and two steps of gradient descent in two-layer and three-layer linear neural networks, characterizing optimal learning rate selection and revealing a distinct early-training regime where unequal layer-wise learning rates are initially optimal.

0 favorites 0 likes

#gradient-descent

Planning Neural Dynamics with Lie Group Embedding through Supervised Projective Manifold Learning

arXiv cs.LG ↗ · 2026-05-27 Cached

This paper proposes Lie group embedded dynamical neural networks (LieEDNN) with learning algorithms based on gradient descent and metric projection on smooth manifolds, enabling stable dynamics on Lie groups like SO(3) and SE(3) for robotics and control applications.

0 favorites 0 likes

#gradient-descent

In-Context Optimization for Retrieval-Augmented Generation: A Gradient-Descent Perspective

arXiv cs.CL ↗ · 2026-05-27 Cached

This paper studies retrieval-augmented generation as an in-context optimization process, showing that linear self-attention can implement gradient descent on a unified RAG objective. It proposes a lightweight method for frozen RAG LLMs that predicts context-conditioned updates, improving performance across multiple QA benchmarks.

0 favorites 0 likes

#gradient-descent

@pallavishekhar_: Math Behind Gradient Descent Read here: https://outcomeschool.com/blog/math-behind-gradient-descent…

X AI KOLs Timeline ↗ · 2026-05-26 Cached

This blog post explains the math behind gradient descent, the fundamental optimization algorithm used to train machine learning models, with a step-by-step numeric example and intuition.

0 favorites 0 likes

#gradient-descent

The Implicit Bias of Depth: From Neural Collapse to Softmax Codes

arXiv cs.LG ↗ · 2026-05-25 Cached

This paper studies how depth alone induces an implicit low-rank bias in deep unconstrained feature models trained without regularization, shifting the optimal solution from neural collapse to softmax codes, and provides the first asymptotic and dynamic characterization of this bias under gradient descent with cross-entropy loss.

0 favorites 0 likes

gradient-descent

Submit Feedback