Zero-order Parameter-free Optimization for LMO-based Methods: Novel Approach for Efficient Fine-tuning

arXiv cs.LG Papers

Summary

This paper introduces AdaNAGED, a method that combines zero-order optimization, parameter-free adaptation, and non-Euclidean update geometry for memory-efficient fine-tuning of large language models, with theoretical convergence guarantees and validation on the OPT-1.3B model.

arXiv:2606.14970v1 Announce Type: new Abstract: Fine-tuning large language models (LLMs) has become a central application of modern optimization, enabling pretrained models to adapt to diverse downstream tasks and domain-specific data. A major obstacle in large-scale fine-tuning is the memory overhead of backpropagation, which requires storing activations, gradients, and optimizer states. Zeroth-order (ZO) optimization offers a memory-efficient alternative, but its performance is highly sensitive to the stepsize and smoothing parameter, often requiring costly task-specific tuning. Parameter-free (PF) optimization addresses this issue by adapting algorithmic parameters without prior knowledge of problem-dependent constants. Moreover, large-scale fine-tuning can benefit from geometry-aware updates that account for the heterogeneous structure of parameter blocks, which can be modeled through methods that exploit linear minimization oracle (LMO). In this work, we study PF adaptation for LMO-based ZO optimization and introduce $\texttt{AdaNAGED}$, a method that unifies gradient-free training, adaptive tuning, and non-Euclidean update geometry. We establish convergence guarantees and validate the method on large-scale LLM fine-tuning task with $\texttt{OPT}-1.3\mathrm{B}$ model.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:36 AM

# 1 Introduction
Source: [https://arxiv.org/html/2606.14970](https://arxiv.org/html/2606.14970)
![[Uncaptioned image]](https://arxiv.org/html/2606.14970v1/logos/innopolis.png)![[Uncaptioned image]](https://arxiv.org/html/2606.14970v1/x1.png)![[Uncaptioned image]](https://arxiv.org/html/2606.14970v1/x2.png)Zero\-order Parameter\-free Optimization for LMO\-based Methods: Novel Approach for Efficient Fine\-tuningDmitriy Bystrov1,2, Daniil Medyakov1,2, Dmitry Bylinkin1,2, Aleksandr Beznosikov1,2,31Moscow Independent Research Institute of Artificial Intelligence2Basic Research of Artificial Intelligence Laboratory \(BRAIn Lab\)3Innopolis UniversityFine\-tuning large language models \(LLMs\) has become a central application of modern optimization, enabling pretrained models to adapt to diverse downstream tasks and domain\-specific data\. A major obstacle in large\-scale fine\-tuning is the memory overhead of backpropagation, which requires storing activations, gradients, and optimizer states\. Zeroth\-order \(ZO\) optimization offers a memory\-efficient alternative, but its performance is highly sensitive to the stepsize and smoothing parameter, often requiring costly task\-specific tuning\. Parameter\-free \(PF\) optimization addresses this issue by adapting algorithmic parameters without prior knowledge of problem\-dependent constants\. Moreover, large\-scale fine\-tuning can benefit from geometry\-aware updates that account for the heterogeneous structure of parameter blocks, which can be modeled through methods that exploit linear minimization oracle \(LMO\)\. In this work, we study PF adaptation for LMO\-based ZO optimization and introduceAdaNAGED, a method that unifies gradient\-free training, adaptive tuning, and non\-Euclidean update geometry\. We establish convergence guarantees and validate the method on large\-scale LLM fine\-tuning task withOPT−1\.3​B\\texttt\{OPT\}\-1\.3\\mathrm\{B\}model\.

Optimization is a fundamental component of modern machine learning, and its role is particularly prominent in the fine\-tuning of large language models \(LLMs\)\[[1](https://arxiv.org/html/2606.14970#bib.bib1),[2](https://arxiv.org/html/2606.14970#bib.bib2)\]\. Fine\-tuning offers an effective and flexible way to adapt pretrained models to diverse downstream tasks while leveraging the knowledge acquired during pretraining\[[3](https://arxiv.org/html/2606.14970#bib.bib3),[4](https://arxiv.org/html/2606.14970#bib.bib4),[5](https://arxiv.org/html/2606.14970#bib.bib5)\]\. Formally, we consider the following optimization problem:

minx∈ℝd⁡f​\(x\),\\min\_\{x\\in\\mathbb\{R\}^\{d\}\}f\(x\),\(1\)wherexxdenotes the model parameters, andffis the loss function\. Training is typically performed using gradient\-based optimization methods, such asSGD\[[6](https://arxiv.org/html/2606.14970#bib.bib6)\]orAdam\[[7](https://arxiv.org/html/2606.14970#bib.bib7)\]\. These methods rely on gradients computed via backpropagation\[[8](https://arxiv.org/html/2606.14970#bib.bib8)\], which provides the first\-order information required for parameter updates\. This combination has become the standard training pipeline in modern machine learning, offering strong empirical performance and reliable optimization behavior across a wide range of tasks\.

Despite its strong empirical performance, backpropagation becomes a major memory bottleneck in large\-scale LLM fine\-tuning, as it requires storing intermediate activations, gradients, and optimizer states throughout training\[[9](https://arxiv.org/html/2606.14970#bib.bib9),[10](https://arxiv.org/html/2606.14970#bib.bib10)\]\. Since fine\-tuning adapts an already pretrained model to a narrower target task or domain, this overhead motivates optimization methods that reduce or avoid the memory costs of first\-order training\. Existing approaches address this challenge through block\-wise descent methods\[[11](https://arxiv.org/html/2606.14970#bib.bib11),[12](https://arxiv.org/html/2606.14970#bib.bib12)\], quantization\[[13](https://arxiv.org/html/2606.14970#bib.bib13),[14](https://arxiv.org/html/2606.14970#bib.bib14)\], and parameter\-efficient fine\-tuning methods\[[15](https://arxiv.org/html/2606.14970#bib.bib15),[16](https://arxiv.org/html/2606.14970#bib.bib16)\]\. However, these methods still rely on backpropagation and thereforecannot fully remove the memory burdenassociated with gradient\-based optimization\. A fundamentally different optimization paradigm is provided by zeroth\-order optimization \(ZO\)\[[17](https://arxiv.org/html/2606.14970#bib.bib17)\]\. Instead of relying on gradients, ZO methods assume access only to function\-value evaluations of the objective\. Formally, ZO replaces gradients with finite\-difference estimates:

∇f​\(x\)≈f​\(x\+τ​e\)−f​\(x\)τ​e,\\nabla f\(x\)\\approx\\frac\{f\(x\+\\tau e\)\-f\(x\)\}\{\\tau\}e,\(2\)whereeeis a random sample vector andτ\\tauis a smoothing parameter\. Zeroth\-order optimization is a mature area of optimization, comprising a broad class of algorithms with established convergence guarantees\[[18](https://arxiv.org/html/2606.14970#bib.bib18),[19](https://arxiv.org/html/2606.14970#bib.bib19),[20](https://arxiv.org/html/2606.14970#bib.bib20)\]\. This paradigm directly addresses the memory bottleneck: instead of storing the intermediate quantities required for backpropagation, ZO methods rely on function\-value queries along sampled perturbation directions\. In particular, after the finite\-difference estimator is constructed, the vectoreeneed not be stored explicitly; followingMalladi et al\. \[[10](https://arxiv.org/html/2606.14970#bib.bib10)\], it can be represented by the seed of a random generator, introducing essentially no additional memory overhead\.

The ZO setting introduces several challenges, including the need for careful hyperparameter selection\. In particular, the performance of ZO methods can be highly sensitive to the choice of the stepsize and the smoothing parameter\. Although theoretically optimal choices exist, computing them typically requires prior access to problem\-dependent quantities, such as Lipschitz constant, or the distance to the optimal solution\[[6](https://arxiv.org/html/2606.14970#bib.bib6),[21](https://arxiv.org/html/2606.14970#bib.bib21),[22](https://arxiv.org/html/2606.14970#bib.bib22)\], which areunavailablein practical applications\. As a result, implementations often rely on time\-consuming grid search, which must be repeated for each new task\. A natural way to address this issue is through parameter\-free optimization \(PF\)\. Algorithms in this paradigm do not require prior knowledge of problem\-specific constants\. Instead, they estimate such quantities adaptively from the local optimization landscape and the statistics accumulated along the trajectory\. In first\-order optimization, this line of work is well developed and includes a variety of schemes for estimating problem\-dependent constants while preserving strong convergence guarantees\[[23](https://arxiv.org/html/2606.14970#bib.bib23),[24](https://arxiv.org/html/2606.14970#bib.bib24),[25](https://arxiv.org/html/2606.14970#bib.bib25)\]\.

Compared with the first\-order setting, parameter\-free zeroth\-order optimization remains considerably less explored\. Extending PF principles to ZO is challenging because hyperparameters must be adapted from finite\-difference information rather than exact gradients\. This is especially delicate for the smoothing parameterτ\\tau: large values increase the bias of the gradient approximation, while overly small values can lead to instability due to the division byτ\\tau\[[26](https://arxiv.org/html/2606.14970#bib.bib26)\]\. As a result, ZO\-PF methods must adapt not only the stepsize but also the smoothing parameterτ\\tauwithout access to problem\-dependent constants, making this direction challenging\.

Beyond hyperparameter choice, ZO fine\-tuning also raises the question of how to adapt updates to the geometry of large\-scale models\. Since different parameter groups may naturally correspond to different vector or matrix structures, Euclidean updates may not always capture the most suitable local geometry\. This motivates optimization methods defined with respect to non\-Euclidean norms\. A prominent example of this principle isMuon\[[27](https://arxiv.org/html/2606.14970#bib.bib27)\], which can be viewed as anSGD\-type method that exploits spectral\-norm geometry for matrix\-shaped parameters\. By approximately orthogonalizing matrix updates,Muonbalances update directions according to the structure of matrix\-valued weights\. This idea can be formulated more generally using linear minimization oracles \(LMOs\) over norm balls\[[28](https://arxiv.org/html/2606.14970#bib.bib28),[29](https://arxiv.org/html/2606.14970#bib.bib29)\]\. Given a norm\-induced domain𝒟\\mathcal\{D\}, an LMO is defined as

lmo𝒟⁡\(x\)=arg⁡minv∈𝒟⁡⟨x,v⟩\.\\operatorname\{lmo\}\_\{\\mathcal\{D\}\}\(x\)=\\arg\\min\_\{v\\in\\mathcal\{D\}\}\\langle x,v\\rangle\.\(3\)Specific choices of the underlying norm recover different update rules: vector norms yield methods related toNormalized SGDwith momentum\[[30](https://arxiv.org/html/2606.14970#bib.bib30)\]andSign\-SGDwith momentum\[[31](https://arxiv.org/html/2606.14970#bib.bib31)\], whereas matrix norms capture methods such asMuon\[[27](https://arxiv.org/html/2606.14970#bib.bib27)\]andGluon\[[32](https://arxiv.org/html/2606.14970#bib.bib32)\]\. Such geometry\-aware updates are especially attractive for large\-scale fine\-tuning, as they can improve the conditioning of parameter updates and provide more stable optimization across heterogeneous vector\- and matrix\-shaped blocks\.

Parameter\-free optimization and LMO\-based updates address two essential aspects of ZO fine\-tuning\. The former reduces the reliance on costly task\-specific hyperparameter tuning, while the latter enables optimization under structured non\-Euclidean geometries through linear minimization oracles\. Despite their individual relevance, it remains unclearhow parameter\-free ideas can be carried out in the zeroth\-order setting when the update direction is defined by an LMO\. In this work, we provide a comprehensive study of this question\.

## 2 Related Work and Contribution

### 2\.1Related Work

#### Zeroth\-order methods\.

Early foundations of gradient\-free optimization include simultaneous perturbation methods, which estimate the descent directions from function\-value evaluations along randomly perturbed points\[[33](https://arxiv.org/html/2606.14970#bib.bib33)\]\. A modern analysis of stochastic zeroth\-order methods for nonconvex optimization was developed in\[[17](https://arxiv.org/html/2606.14970#bib.bib17)\], where the gradient\-based updates are replaced by estimators constructed from zeroth\-order oracle queries\. Since then, a wide range of estimator designs and sampling strategies has been studied\[[19](https://arxiv.org/html/2606.14970#bib.bib19)\]\. Classic choices include random directions on the unit sphere\[[34](https://arxiv.org/html/2606.14970#bib.bib34)\], coordinate\-wise perturbations based on the canonical basis\[[35](https://arxiv.org/html/2606.14970#bib.bib35)\], and Gaussian smoothing estimators\[[17](https://arxiv.org/html/2606.14970#bib.bib17),[10](https://arxiv.org/html/2606.14970#bib.bib10)\]\. These estimators are typically designed to approximate the gradient of a smoothed objective using only function values\.

More recent work has explored more structured perturbation schemes and algorithmic improvements\. For example,HiZOOuses Hessian\-informed zeroth\-order updates to improve the convergence of LLM fine\-tuning\[[36](https://arxiv.org/html/2606.14970#bib.bib36)\]\. Another line of work connects randomized finite\-difference estimators with policy\-gradient methods, interpreting ZO updates from a reinforcement\-learning perspective and relating them toREINFORCE\-type estimators\[[37](https://arxiv.org/html/2606.14970#bib.bib37),[38](https://arxiv.org/html/2606.14970#bib.bib38),[39](https://arxiv.org/html/2606.14970#bib.bib39)\]\. Orthogonal advances include momentum\-based ZO methods\[[40](https://arxiv.org/html/2606.14970#bib.bib40)\], and variance\-reduced estimators\[[41](https://arxiv.org/html/2606.14970#bib.bib41),[42](https://arxiv.org/html/2606.14970#bib.bib42),[43](https://arxiv.org/html/2606.14970#bib.bib43)\]\.

ZO methods have been successfully applied to deep neural networks, demonstrating the practical relevance of this paradigm\[[44](https://arxiv.org/html/2606.14970#bib.bib44),[45](https://arxiv.org/html/2606.14970#bib.bib45)\]\. Moreover, they have emerged as a promising approach to alleviating the memory constraints of LLM fine\-tuning\[[10](https://arxiv.org/html/2606.14970#bib.bib10),[19](https://arxiv.org/html/2606.14970#bib.bib19)\]\.

#### Parameter\-free optimization\.

A large part of the theory of adaptive and parameter\-free optimization has been developed for non\-smooth convex objectives\. In this setting, the optimal stepsize for gradient\-based methods\[[46](https://arxiv.org/html/2606.14970#bib.bib46)\]is typically chosen as

γ=‖x0−x∗‖M​T\.\\gamma=\\frac\{\\\|x^\{0\}\-x^\{\*\}\\\|\}\{M\\sqrt\{T\}\}\.\(4\)whereMMis the Lipschitz constant of the objective\(\|f​\(x\)−f​\(y\)\|≤M​‖x−y‖​∀x,y∈ℝd\)\(\|f\(x\)\-f\(y\)\|\\leq M\\\|x\-y\\\|\\ \\forall x,y\\in\\mathbb\{R\}^\{d\}\)\. In practice, however, neitherMMnor the initial distance to the solution‖x0−x∗‖\\\|x^\{0\}\-x^\{\*\}\\\|is available\. The core idea of parameter\-free optimization is to estimate such quantities adaptively and recover convergence guarantees without requiring them as input\.

Early adaptive methods, includingAdaGrad\[[47](https://arxiv.org/html/2606.14970#bib.bib47)\],AdaDelta\[[48](https://arxiv.org/html/2606.14970#bib.bib48)\], andAdam\[[7](https://arxiv.org/html/2606.14970#bib.bib7)\], were among the first widely used approaches to reduce the dependence on prior knowledge of problem constants\. These methods exploit the statistics of already computed gradients to rescale the updates and construct an adaptive stepsize schedule throughout training\. However, some of them introduce additional memory overhead due to the stored statistics, and they generally do not explicitly adapt to the initial distance to the optimum\. Parameter\-free ideas have also been extensively studied in online learning, although under a different problem formulation\[[49](https://arxiv.org/html/2606.14970#bib.bib49)\]\. This line of work introduced coin\-betting, reward\-doubling, and related schemes\[[50](https://arxiv.org/html/2606.14970#bib.bib50),[51](https://arxiv.org/html/2606.14970#bib.bib51),[52](https://arxiv.org/html/2606.14970#bib.bib52),[53](https://arxiv.org/html/2606.14970#bib.bib53)\]\. These methods eliminate several forms of manual tuning, but their guarantees are usually tied to online\-learning assumptions, such as bounded stochastic oracles\.

A subsequent milestone in parameter\-free optimization was the development of methods that adapt to the initial distance‖x0−x∗‖\\\|x^\{0\}\-x^\{\*\}\\\|in \([4](https://arxiv.org/html/2606.14970#S2.E4)\)\. One of the first parameter\-free methods with such adaptation was proposed inCarmon and Hinder \[[23](https://arxiv.org/html/2606.14970#bib.bib23)\]\. However, the construction relies on an additional search procedure, which limits its practical efficiency\. TheDoGalgorithm\[[54](https://arxiv.org/html/2606.14970#bib.bib54)\]was later introduced as an iterative scheme that adaptively estimates both the Lipschitz constant and the distance to the solution, although its distance estimate can be unbounded\. Subsequent works\[[55](https://arxiv.org/html/2606.14970#bib.bib55),[56](https://arxiv.org/html/2606.14970#bib.bib56)\]addressed this issue by introducing more stable distance estimators\. However, some algorithms introduced in these papers still require prior knowledge of the Lipschitz constant and lack stochastic analysis\. Recent work has further extended the parameter\-free paradigm by incorporating momentum\[[57](https://arxiv.org/html/2606.14970#bib.bib57)\], weighting mechanisms\[[25](https://arxiv.org/html/2606.14970#bib.bib25)\], and sharper theoretical analyses\[[58](https://arxiv.org/html/2606.14970#bib.bib58),[59](https://arxiv.org/html/2606.14970#bib.bib59)\]\.

Parameter\-free methods have also been studied beyond the non\-smooth setting\. In the smooth case,DoWG\[[60](https://arxiv.org/html/2606.14970#bib.bib60)\]provides an analysis of anAdaGrad\-Normstepsize\. A complementary line of work builds on the adaptive stepsize idea fromMalitsky and Mishchenko \[[61](https://arxiv.org/html/2606.14970#bib.bib61)\]\. In this direction, the work\[[62](https://arxiv.org/html/2606.14970#bib.bib62)\]proposes a parameter\-free method based on the local smoothness estimates computed along the optimization trajectory\. Later,Medyakov et al\. \[[63](https://arxiv.org/html/2606.14970#bib.bib63)\]utilized this concept and applied it to the analysis ofSign\-SGDin the PF setting\.

#### LMO\-based methods\.

Linear minimization oracles \(LMOs\) first appeared in the context ofFrank\-Wolfe\-type methods for constrained optimization\. This line of work studies optimization over structured feasible sets, including convex polyhedra, nuclear\-norm balls, and flow polytopes\[[64](https://arxiv.org/html/2606.14970#bib.bib64),[65](https://arxiv.org/html/2606.14970#bib.bib65)\]\. The same technique was later recognized as useful for unconstrained problems formulated with respect to abstract norms\.

The foundation of non\-Euclidean analysis in this context can be traced to theSign\-SGDalgorithm\[[66](https://arxiv.org/html/2606.14970#bib.bib66),[67](https://arxiv.org/html/2606.14970#bib.bib67)\]\. This method has been shown to outperformSGDwhen analyzed through theℓ1\\ell\_\{1\}\-norm geometry\[[68](https://arxiv.org/html/2606.14970#bib.bib68),[69](https://arxiv.org/html/2606.14970#bib.bib69)\]\. Later,Muonwas introduced as a matrix\-aware algorithm and analyzed under the spectral norm\[[27](https://arxiv.org/html/2606.14970#bib.bib27)\]\. This sparked a line of related work on quantization\[[70](https://arxiv.org/html/2606.14970#bib.bib70)\], faster randomized SVD decompositions\[[71](https://arxiv.org/html/2606.14970#bib.bib71)\], and layer\-wise optimization\[[32](https://arxiv.org/html/2606.14970#bib.bib32)\], further establishing non\-Euclidean analysis as a useful perspective\.

Subsequently,\[[28](https://arxiv.org/html/2606.14970#bib.bib28)\]and\[[29](https://arxiv.org/html/2606.14970#bib.bib29)\]proposed analyzing optimizer convergence under abstract norms using LMOs\. This framework generalizes several existing methods, includingNormalized SGDwith momentum\[[30](https://arxiv.org/html/2606.14970#bib.bib30)\],Sign\-SGDwith momentum\[[31](https://arxiv.org/html/2606.14970#bib.bib31)\], andMuon\[[27](https://arxiv.org/html/2606.14970#bib.bib27)\]\.

#### Combined approaches\.

In recent years, parameter\-free ideas have also been extended to zeroth\-order optimization\. The work\[[72](https://arxiv.org/html/2606.14970#bib.bib72)\]initiated this direction with thePOEMalgorithm\. Its stepsize adaptation follows theDoG\-type rule\[[54](https://arxiv.org/html/2606.14970#bib.bib54)\], but it uses ZO gradient estimates and retains the Euclidean update geometry\. However, their analysis requires a prior estimate of the Lipschitz constant, which weakens the PF nature of the method\. Moreover, the scheme requires to store the original model weights, which undermines the memory efficiency\. This line was later extended to decentralized optimization byD\-POEM\[[73](https://arxiv.org/html/2606.14970#bib.bib73)\], and to variance\-reduced schemes\[[74](https://arxiv.org/html/2606.14970#bib.bib74)\]\. Nevertheless, these methods require𝒪​\(d\)\\mathcal\{O\}\(d\)function evaluations for ZO gradient estimator construction, which is infeasible for LLM fine\-tuning\. Moreover, their empirical validation is limited to dimensions below those relevant for large\-scale models\.

Recent work has also explored parameter\-free extensions of LMO\-based and non\-Euclidean methods\. In particular, several variants ofMuonincorporateAdam\-style normalization or adaptive scaling into matrix\-aware updates\[[75](https://arxiv.org/html/2606.14970#bib.bib75),[76](https://arxiv.org/html/2606.14970#bib.bib76)\]\. Parameter\-free concepts have also been studied in LMO\-based constrained optimization, mostly in theFrank–Wolfe\-based methods\[[77](https://arxiv.org/html/2606.14970#bib.bib77),[78](https://arxiv.org/html/2606.14970#bib.bib78),[79](https://arxiv.org/html/2606.14970#bib.bib79)\]\. However, these methods do not address the issue of memory\-efficient LLM fine\-tuning\.

LMO\-based ideas have also appeared in zeroth\-order optimization\. Recent works exploit the matrix structure in the optimization variables to design more geometry\-aware ZO methods\[[80](https://arxiv.org/html/2606.14970#bib.bib80),[81](https://arxiv.org/html/2606.14970#bib.bib81),[82](https://arxiv.org/html/2606.14970#bib.bib82)\]\. However, this line of work does not consider parameter\-free challenges\.

### 2\.2Contribution

In light of the related work discussed above, our main contributions are as follows\.

- •We propose a novel approach forconstructing parameter\-free stepsizein the zeroth\-order setting\. Our method estimates local smoothness using differences between ZO gradient approximations at neighboring points along the optimization trajectory\. Thispreserves the memory\-efficient nature of ZO optimization, as it only requires storing a generator seed corresponding to the perturbation direction of ZO gradient\.
- •Based on the proposed parameter\-free estimators, we introduceAdaNAGED, an algorithm that exploits LMO\-based updates\. As a result,our method combines three desirable properties: memory\-efficient zeroth\-order optimization, reduced dependence on hyperparameter tuning, and geometry\-aware updates\.
- •We provideconvergence guaranteesfor the proposed method considering Lipschitz continuous, smooth and convex objectives\. The resulting convergence rate is asymptotically optimal\.
- •We conduct experiments by fine\-tuning theOPT\-1\.3Bmodel on theSST\-2task\. The performance of our method matches the tunedSign\-SGD,MuonandAdaMM\.

## 3 Preliminaries

In this section, we present the preliminary concepts required for constructing and analyzing the desired method\. To start with, we define some notation used in the paper\. We denote∥⋅∥\\\|\\cdot\\\|and∥⋅∥∗\\\|\\cdot\\\|\_\{\*\}as an abstract norm and its dual conjugate, respectively,𝔼​\[⋅\]\\mathbb\{E\}\[\\cdot\]represents the mathematical expectation\. Additionally, for generality, we introduce the equivalence factor within the chosen norm and the22\-norm:c2​‖v‖≤‖v‖2≤C2​‖v‖,c2⁣∗​‖v‖∗≤‖v‖2≤C2⁣∗​‖v‖∗​∀v∈ℝdc\_\{2\}\\\|v\\\|\\leq\\\|v\\\|\_\{2\}\\leq C\_\{2\}\\\|v\\\|,c\_\{2\*\}\\\|v\\\|\_\{\*\}\\leq\\\|v\\\|\_\{2\}\\leq C\_\{2\*\}\\\|v\\\|\_\{\*\}\\ \\forall v\\in\\mathbb\{R\}^\{d\}\.

#### LMO\.

In this work, we use LMOs over norm balls of arbitrary radius, which allows the update geometry and the update scale to be controlled separately\[[29](https://arxiv.org/html/2606.14970#bib.bib29)\]\. For a given abstract norm∥⋅∥\\\|\\cdot\\\|and radiusρ\>0\\rho\>0, we define:

lmo⁡\(x\)=arg⁡min‖v‖≤ρ⁡⟨x,v⟩\.\\operatorname\{lmo\}\(x\)=\\arg\\min\_\{\\\|v\\\|\\leq\\rho\}\\langle x,v\\rangle\.This formulation covers a broad class of non\-Euclidean update rules by varying the underlying norm\. For example, for the Euclidean norm,lmo⁡\(x\)=−ρ​x‖x‖2\\operatorname\{lmo\}\(x\)=\-\\rho\\frac\{x\}\{\\\|x\\\|\_\{2\}\}, while for theℓ∞\\ell\_\{\\infty\}\-norm,lmo⁡\(x\)=−ρ⋅s​i​g​n​\(x\)\.\\operatorname\{lmo\}\(x\)=\-\\rho\\cdot sign\(x\)\.Thus, the choice of norm determines the direction of the update, whereas the radiusρ\\rhocontrols its magnitude\.

#### Zero\-order optimization\.

In the ZO setting, gradients are replaced by finite\-difference approximations computed from function\-value queries\. Given a perturbation directioneeand a smoothing parameterτ\>0\\tau\>0, we define the ZO gradient estimator as

gτ​\(x,e\)=f​\(x\+τ​e\)−f​\(x\)τ​e\.g\_\{\\tau\}\(x,e\)=\\frac\{f\(x\+\\tau e\)\-f\(x\)\}\{\\tau\}e\.For the analysis, it is standard to introduce the corresponding smoothed objectivefτ​\(x\)=𝔼​\[f​\(x\+τ​e\)\]f\_\{\\tau\}\(x\)=\\mathbb\{E\}\[f\(x\+\\tau e\)\], where the expectation is taken over the random perturbation directionee\[[34](https://arxiv.org/html/2606.14970#bib.bib34),[83](https://arxiv.org/html/2606.14970#bib.bib83)\]\. This allows the behavior of ZO methods to be studied in terms of the gradients of the smoothed function rather than the original objective directly\.

Next, we list the assumptions regarding the objective\.

###### Assumption 1\.

The objectiveffisLL\-smooth, i\.e\., for anyx,y∈ℝdx,y\\in\\mathbb\{R\}^\{d\}it implies

‖∇f​\(x\)−∇f​\(y\)‖2≤L​‖x−y‖2,where​L\>0,‖x‖2=∑i=1dxi2\.\\\|\\nabla f\(x\)\-\\nabla f\(y\)\\\|\_\{2\}\\leq L\\\|x\-y\\\|\_\{2\},\\text\{\\penalty 10000\\ \\penalty 10000\\ where\\penalty 10000\\ \\penalty 10000\\ \}L\>0,\\\|x\\\|\_\{2\}=\\sqrt\{\\sum\_\{i=1\}^\{d\}x\_\{i\}^\{2\}\}\.

Lipschitzness of the gradient is an inherent aspect of neural network optimization due to their almost\-everywhere smooth behavior\[[84](https://arxiv.org/html/2606.14970#bib.bib84)\]\. Moreover, smoothness is one of the foundational assumptions in optimization theory\[[6](https://arxiv.org/html/2606.14970#bib.bib6),[21](https://arxiv.org/html/2606.14970#bib.bib21)\]\.

###### Assumption 2\.

The objectiveffis convex, i\.e\., for anyx,y∈ℝdx,y\\in\\mathbb\{R\}^\{d\}it implies

f​\(y\)≥f​\(x\)\+⟨∇f​\(x\),y−x⟩\.f\(y\)\\geq f\(x\)\+\\langle\\nabla f\(x\),y\-x\\rangle\.

Although neural networks, and LLMs in particular, are non\-convex, they are shown to establish convex\-like properties\[[85](https://arxiv.org/html/2606.14970#bib.bib85),[86](https://arxiv.org/html/2606.14970#bib.bib86),[87](https://arxiv.org/html/2606.14970#bib.bib87)\]\. This motivates analyzing the convex setting as a principled baseline\. Moreover, convex optimization serves as a theoretical foundation for the design of optimization algorithms\. For example, momentum\[[88](https://arxiv.org/html/2606.14970#bib.bib88)\]andAdaGrad\[[47](https://arxiv.org/html/2606.14970#bib.bib47)\]were initially developed and analyzed for convex problems\.

###### Assumption 3\.

The objectiveffisMM\-Lipschitz, i\.e\., for anyx,y∈ℝdx,y\\in\\mathbb\{R\}^\{d\}it implies

\|f​\(x\)−f​\(y\)\|≤M​‖x−y‖,where​M\>0\.\|f\(x\)\-f\(y\)\|\\leq M\\\|x\-y\\\|,\\text\{\\penalty 10000\\ \\penalty 10000\\ where\\penalty 10000\\ \\penalty 10000\\ \}M\>0\.

ZO methods are often analyzed under Lipschitz continuity assumptions, since function\-value queries alone require control over how the objective changes under random perturbations\. This condition makes it possible to relate finite\-difference estimators to the gradient of a smoothed objective and to control the smoothing bias introduced by the perturbation radiusτ\\tau\[[34](https://arxiv.org/html/2606.14970#bib.bib34),[83](https://arxiv.org/html/2606.14970#bib.bib83),[35](https://arxiv.org/html/2606.14970#bib.bib35)\]\.

## 4 Main Results

#### Motivation\.

We begin by motivating how a parameter\-free stepsize should be constructed in our setting\. To this end, we first examine the stepsize that would be optimal for a ZO LMO\-based method if the relevant problem constants were known\. As discussed in the preliminaries \(Section[3](https://arxiv.org/html/2606.14970#S3)\), the standard ZO analysis is naturally carried out for the smoothed objectivefτf\_\{\\tau\}\. Applying the basic descent argument to an LMO\-based update with a fixed stepsizeγ\\gammaand smoothing parameterτ\\tauyields

1T​∑t=0T−1‖∇fτ​\(xt\)‖∗≤Δγ​T\+γ​L2\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\\|\\nabla f\_\{\\tau\}\(x^\{t\}\)\\\|\_\{\*\}\\leq\\frac\{\\Delta\}\{\\gamma T\}\+\\frac\{\\gamma L\}\{2\}\.Here,Δ=fτ​\(x0\)−fτ​\(x∗\)\\Delta=f\_\{\\tau\}\(x^\{0\}\)\-f\_\{\\tau\}\(x^\{\*\}\)\. Balancing the two terms gives the optimal fixed stepsize

γ=ΔL​T,which leads to the convergence rate​𝒪​\(Δ​LT\)\.\\gamma=\\sqrt\{\\frac\{\{\\Delta\}\}\{\{LT\}\}\},\\text\{\\penalty 10000\\ \\penalty 10000\\ which leads to the convergence rate\\penalty 10000\\ \\penalty 10000\\ \}\\mathcal\{O\}\\left\(\\sqrt\{\\frac\{\\Delta L\}\{T\}\}\\right\)\.\(5\)However, bothΔ\\DeltaandLLare problem\-dependent quantities and are not available in practice\. A parameter\-free method should therefore avoid requiring them as inputs\. Following the standard principle behind parameter\-free algorithms, we construct iterative estimates of these constants along the optimization trajectory and use them directly to set the stepsize\. This allows the method to adapt within a single run without an external grid search\. The central question is thus how to build valid ZO\-compatible approximations ofΔ\\DeltaandLLthat preserve the desired convergence guarantees\.

#### Our estimators\.

We first construct an approximation of the smoothness constantLL\. Our parameter\-free objective is closely related to the approach ofMishkin et al\. \[[62](https://arxiv.org/html/2606.14970#bib.bib62)\], Medyakov et al\. \[[63](https://arxiv.org/html/2606.14970#bib.bib63)\], who use local smoothness estimates computed along the optimization trajectory\. In their setting, this estimate takes the formLt=‖∇ft​\(xt\+1\)−∇ft​\(xt\)‖∗/‖xt\+1−xt‖L^\{t\}=\\nicefrac\{\{\\\|\\nabla f^\{t\}\(x^\{t\+1\}\)\-\\nabla f^\{t\}\(x^\{t\}\)\\\|\_\{\*\}\}\}\{\{\\\|x^\{t\+1\}\-x^\{t\}\\\|\}\}\. This quantity is natural because smoothness is only invoked between two adjacent iterates; thus, it is sufficient to estimate curvature along the realized update direction\. In our ZO setting, however, exact gradient oracles are unavailable\. We therefore replace the gradients in the local smoothness estimate with their ZO approximations\. Specifically, we define

Lt=‖gt​\(xt\+1,et\)−gt​\(xt,et\)‖∗‖xt\+1−xt‖,where​gt​\(x,e\)=gτt​\(x,e\)\.L^\{t\}=\\frac\{\\\|g^\{t\}\(x^\{t\+1\},e^\{t\}\)\-g^\{t\}\(x^\{t\},e^\{t\}\)\\\|\_\{\*\}\}\{\\\|x^\{t\+1\}\-x^\{t\}\\\|\},\\text\{\\penalty 10000\\ \\penalty 10000\\ where\\penalty 10000\\ \\penalty 10000\\ \}g^\{t\}\(x,e\)=g\_\{\\tau^\{t\}\}\(x,e\)\.\(6\)Note that to improve the stability of this estimate, we use the same perturbation directionete^\{t\}at both points\. Our construction also preserves the memory\-efficient nature of ZO optimization: it only requires the corresponding loss differences and the perturbation direction, which can be stored using the seed of a random generator\[[10](https://arxiv.org/html/2606.14970#bib.bib10)\]\.

For the convergence analysis, and also for the numerical stability of the method, such local estimates must remain bounded\. In the approach ofMishkin et al\. \[[62](https://arxiv.org/html/2606.14970#bib.bib62)\], Medyakov et al\. \[[63](https://arxiv.org/html/2606.14970#bib.bib63)\], this follows directly from Assumption[1](https://arxiv.org/html/2606.14970#Thmassumption1)\. In contrast, boundedness is not obvious for our estimate \([6](https://arxiv.org/html/2606.14970#S4.E6)\)\. We therefore establish the following lemma, which shows that the proposed estimator remains controlled under our assumptions\.

###### Lemma 1\.

Suppose Assumptions[1](https://arxiv.org/html/2606.14970#Thmassumption1),[2](https://arxiv.org/html/2606.14970#Thmassumption2), and[3](https://arxiv.org/html/2606.14970#Thmassumption3)hold\. Then the estimator \([6](https://arxiv.org/html/2606.14970#S4.E6)\) is uniformly bounded:

Lt≤C2c2⁣∗​L,L^\{t\}\\leq\\frac\{C\_\{2\}\}\{c\_\{2\*\}\}L,wherec2⁣∗,C2c\_\{2\*\},C\_\{2\}are norm equivalence factors\.

Theoretically, we need to approximate the factorL​T\\sqrt\{LT\}appearing in the denominator of the optimal stepsize\. We therefore aggregate the local smoothness estimates\([6](https://arxiv.org/html/2606.14970#S4.E6)\)in the manner ofAdaGrad\-Norm\[[47](https://arxiv.org/html/2606.14970#bib.bib47)\], leading to the following denominator for the stepsize:

St=St−1\+Lt,γt∼1St\.S^\{t\}=S^\{t\-1\}\+L^\{t\},\\ \\gamma^\{t\}\\sim\\frac\{1\}\{\\sqrt\{S^\{t\}\}\}\.
For the approximation ofΔ\\Delta, a precise estimate is typically not necessary\. Instead, it is often sufficient to use a lower bound on the objective\. Namely, assume that there exists a known constantf~\\widetilde\{f\}such thatf​\(x\)≥f~\\ f\(x\)\\geq\\widetilde\{f\}\\holds for allx∈ℝdx\\in\\mathbb\{R\}^\{d\}\. Then

Δ=f​\(x0\)−f​\(x∗\)≤f​\(x0\)−f~\.\\Delta=f\(x\_\{0\}\)\-f\(x\_\{\*\}\)\\leq f\(x\_\{0\}\)\-\\widetilde\{f\}\.Such a bound is readily available in many practical settings: for empirical risk minimization with nonnegative losses, one can takef~=0\\widetilde\{f\}=0; similarly, feasibility\-type formulations often satisfyf​\(x∗\)=0f\(x^\{\*\}\)=0\[[89](https://arxiv.org/html/2606.14970#bib.bib89)\], making the same lower bound valid\. Since our analysis is carried out for the smoothed objective, we also note that the same lower bound is preserved forfτf\_\{\\tau\}\. Indeed,fτ​\(x\)=𝔼​\[f​\(x\+τ​e\)\]≥𝔼​\[f~\]=f~f\_\{\\tau\}\(x\)=\\mathbb\{E\}\[f\(x\+\\tau e\)\]\\geq\\mathbb\{E\}\[\\widetilde\{f\}\]=\\widetilde\{f\}\. Therefore, the quantityf​\(x0\)−f~f\(x^\{0\}\)\-\\widetilde\{f\}provides a valid upper estimate of the initial gap\. Also,ρ\\rhoserves as a scaling factor forlmo\\operatorname\{lmo\}operator\.

Consequently, we use the following stepsize for training:

γt=f​\(x0\)−f~ρ​∑i=0t−1Li\.\\gamma^\{t\}=\\frac\{\\sqrt\{f\(x^\{0\}\)\-\\widetilde\{f\}\}\}\{\\rho\\sqrt\{\\sum\_\{i=0\}^\{t\-1\}L^\{i\}\}\}\.

Next, we turn to the choice of the smoothing parameterτ\\tau\. Rather than introducing an additional tuning rule, we chooseτt\\tau^\{t\}directly from the stability requirement that appears in the proof of Lemma[1](https://arxiv.org/html/2606.14970#Thmlemma1)\. In particular, the analysis requires controlling the relation between the stepsizeγt\\gamma^\{t\}and the smoothing radiusτt\\tau^\{t\}\. The key bound takes the form \(see Lemma[3](https://arxiv.org/html/2606.14970#Thmlemma3)for details\)

Lt≤L​\(ρ​C2​γt\)2\+\(τt\)2τt​γt,L^\{t\}\\leq L\\frac\{\(\\rho C\_\{2\}\\gamma^\{t\}\)^\{2\}\+\(\\tau^\{t\}\)^\{2\}\}\{\\tau^\{t\}\\gamma^\{t\}\},To make this bound uniform, the two terms in the numerator should be balanced, which naturally leads to a linear scaling betweenτt\\tau^\{t\}andγt\\gamma^\{t\}\. We therefore set

τt=ρ​C2​γt=C2​f​\(x0\)−f~∑i=0t−1Li\.\\tau^\{t\}=\\rho C\_\{2\}\\gamma^\{t\}=\\frac\{C\_\{2\}\\sqrt\{f\(x^\{0\}\)\-\\widetilde\{f\}\}\}\{\\sqrt\{\\sum\_\{i=0\}^\{t\-1\}L^\{i\}\}\}\.

#### Algorithm and theoretical analysis\.

We present the main algorithmic contribution of the paper:AdaNAGED\(ADAptive Norm\-Agnostic Gradient\-frEe Descent\)\. Using the estimators constructed above, Algorithm[4](https://arxiv.org/html/2606.14970#S4.SS0.SSS0.Px3)samples a random directionete^\{t\}\(Line[3](https://arxiv.org/html/2606.14970#alg0.l3)\), computes the corresponding ZO gradient \(Line[4](https://arxiv.org/html/2606.14970#alg0.l4)\), provides a model update \(Lines[5](https://arxiv.org/html/2606.14970#alg0.l5),[6](https://arxiv.org/html/2606.14970#alg0.l6)\), and computes a ZO approximation in a new point \(Line[7](https://arxiv.org/html/2606.14970#alg0.l7)\) to updateγ,τ\\gamma,\\tauquantities \(Line[9](https://arxiv.org/html/2606.14970#alg0.l9)\)\. Now we present the main theoretical result\.

AdaNAGED

1:Input:

x0∈ℝdx^\{0\}\\in\\mathbb\{R\}^\{d\},

S−1=ξ,C2,γ0=f​\(x0\)−f~ρ​S−1,τ0=ρ​C2​γ0,ξS^\{\-1\}=\\xi,C\_\{2\},\\gamma^\{0\}=\\frac\{\\sqrt\{f\(x^\{0\}\)\-\\widetilde\{f\}\}\}\{\\rho\\sqrt\{S^\{\-1\}\}\},\\tau^\{0\}=\\rho\{C\_\{2\}\}\\gamma^\{0\},\\xiis an arbitrary small value

2:for

t=0,…,T−1t=0,\\ldots,T\-1do

3:

et∼S∥⋅∥2d−1e^\{t\}\\sim S^\{d\-1\}\_\{\\\|\\cdot\\\|\_\{2\}\}//Sample random seed

4:

gt​\(xt,et\)=\(f​\(xt\+τt​et\)−f​\(xt\)\)​etτtg^\{t\}\(x^\{t\},e^\{t\}\)=\(f\(x^\{t\}\+\\tau^\{t\}e^\{t\}\)\-f\(x^\{t\}\)\)\\frac\{e^\{t\}\}\{\\tau^\{t\}\}//Compute ZO\-gradient

5:

vt=lmo⁡\(gt​\(xt,et\)\)v^\{t\}=\\operatorname\{lmo\}\(g^\{t\}\(x^\{t\},e^\{t\}\)\)//Compute LMO

6:

xt\+1=xt\+γt​vtx^\{t\+1\}=x^\{t\}\+\\gamma^\{t\}v^\{t\}//Provide model update

7:

gt​\(xt\+1,et\)=\(f​\(xt\+1\+τt​et\)−f​\(xt\+1\)\)​etτtg^\{t\}\(x^\{t\+1\},e^\{t\}\)=\(f\(x^\{t\+1\}\+\\tau^\{t\}e^\{t\}\)\-f\(x^\{t\+1\}\)\)\\frac\{e^\{t\}\}\{\\tau^\{t\}\}//Compute ZO\-gradient in new point

8:

Lt=‖gt​\(xt\+1,et\)−gt​\(xt,et\)‖∗‖xt\+1−xt‖L^\{t\}=\\frac\{\\\|g^\{t\}\(x^\{t\+1\},e^\{t\}\)\-g^\{t\}\(x^\{t\},e^\{t\}\)\\\|\_\{\*\}\}\{\\\|x^\{t\+1\}\-x^\{t\}\\\|\}//ApproximateLL

9:

St=St−1\+Lt,γt\+1=f​\(x0\)−f~ρ​St,τt\+1=ρ​C2​γt\+1S^\{t\}=S^\{t\-1\}\+L^\{t\},\\gamma^\{t\+1\}=\\frac\{\\sqrt\{f\(x^\{0\}\)\-\\widetilde\{f\}\}\}\{\\rho\\sqrt\{S^\{t\}\}\},\\tau^\{t\+1\}=\\rho C\_\{2\}\\gamma^\{t\+1\}//Update parameters

10:endfor

11:return

xTx\_\{T\}

###### Theorem 1\.

Suppose Assumptions[1](https://arxiv.org/html/2606.14970#Thmassumption1),[2](https://arxiv.org/html/2606.14970#Thmassumption2),[3](https://arxiv.org/html/2606.14970#Thmassumption3)hold\. Then Algorithm[4](https://arxiv.org/html/2606.14970#S4.SS0.SSS0.Px3)to achieveε\\varepsilon\-accuracy needs

T=𝒪​\(C23​Δ​L3c2⁣∗3​ε2​\(1\+M2​C22Δ​L0\)​𝔼​\[1L0\]2​𝔼​\[log⁡T​LL0\]2\+M​C2⁣∗\)​iterations,T=\\mathcal\{O\}\\left\(\\frac\{\{C\_\{2\}^\{3\}\\Delta L^\{3\}\}\}\{c\_\{2\*\}^\{3\}\\varepsilon^\{2\}\}\\left\(1\+\\frac\{M^\{2\}C\_\{2\}^\{2\}\}\{\\Delta\{L^\{0\}\}\}\\right\)\\mathbb\{E\}\\left\[\\frac\{1\}\{L^\{0\}\}\\right\]^\{2\}\\mathbb\{E\}\\left\[\\log\\frac\{TL\}\{L^\{0\}\}\\right\]^\{2\}\+MC\_\{2\*\}\\right\)\\text\{\\penalty 10000\\ \\penalty 10000\\ iterations,\}whereε=𝔼​\[∑t=0T−1γt∑i=0T−1γi​‖∇ft​\(xt\)‖∗\]\\displaystyle\\varepsilon=\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\gamma^\{t\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\\|\\nabla f^\{t\}\(x^\{t\}\)\\\|\_\{\*\}\\right\], andΔ=f​\(x0\)−f~\\Delta=f\(x^\{0\}\)\-\\widetilde\{f\}\.

Discussion\.Although the objective is convex, we formulate the guarantee as convergence to anε\\varepsilon\-stationary point of the smoothed objective, which is the standard form of convergence for LMO\-based methods\[[29](https://arxiv.org/html/2606.14970#bib.bib29)\]\. The bound preserves the optimal asymptotic dependence𝒪​\(1/ε2\)\\mathcal\{O\}\(\\nicefrac\{\{1\}\}\{\{\\varepsilon^\{2\}\}\}\)\[[17](https://arxiv.org/html/2606.14970#bib.bib17),[83](https://arxiv.org/html/2606.14970#bib.bib83)\], while incurring an additional factor𝒪​\(\(M2​L2/Δ​\(L0\)3\)\)\\mathcal\{O\}\(\(\\nicefrac\{\{M^\{2\}L^\{2\}\}\}\{\{\\Delta\(L^\{0\}\)^\{3\}\}\}\)\)from using parameter\-free estimates instead of the optimal stepsize\. The remaining non\-vanishing term comes from the stochasticity of randomized finite\-difference directions, which is standard in LMO analysis\[[66](https://arxiv.org/html/2606.14970#bib.bib66),[80](https://arxiv.org/html/2606.14970#bib.bib80)\]\.

### 4\.1Results for different optimizers

Now we demonstrate how the choice of a particular norm leads to different optimizers\. By choosing∥⋅∥=∥⋅∥∞,∥⋅∥∗=∥⋅∥1\\\|\\cdot\\\|=\\\|\\cdot\\\|\_\{\\infty\},\\\|\\cdot\\\|\_\{\*\}=\\\|\\cdot\\\|\_\{1\}, Algorithm[4](https://arxiv.org/html/2606.14970#S4.SS0.SSS0.Px3)transforms into aSign\-SGD\-like method that follows both the ZO and PF paradigms\. In this case, we can derive exact values for equivalence factors:c2=1,C2=d,c2⁣∗=1d,C2⁣∗=1\.c\_\{2\}=1,C\_\{2\}=\\sqrt\{d\},c\_\{2\*\}=\\frac\{1\}\{\\sqrt\{d\}\},C\_\{2\*\}=1\.

###### Corollary 1\.

Suppose Assumptions[1](https://arxiv.org/html/2606.14970#Thmassumption1),[2](https://arxiv.org/html/2606.14970#Thmassumption2),[3](https://arxiv.org/html/2606.14970#Thmassumption3)hold\. Then Algorithm[4](https://arxiv.org/html/2606.14970#S4.SS0.SSS0.Px3)to achieveε\\varepsilon\-accuracy needs

T=𝒪​\(Δ​L3​d3ε2​\(1\+M2​dΔ​L0\)​𝔼​\[1L0\]2​𝔼​\[log⁡T​L​dL0\]2\+M\)​iterations,T=\\mathcal\{O\}\\left\(\\frac\{\\Delta\{L^\{3\}d^\{3\}\}\}\{\\varepsilon^\{2\}\}\\left\(1\+\\frac\{M^\{2\}\{d\}\}\{\\Delta\{L^\{0\}\}\}\\right\)\\mathbb\{E\}\\left\[\\frac\{1\}\{L^\{0\}\}\\right\]^\{2\}\\mathbb\{E\}\\left\[\\log\\frac\{TLd\}\{L^\{0\}\}\\right\]^\{2\}\+M\\right\)\\text\{\\penalty 10000\\ \\penalty 10000\\ iterations,\}whereε=𝔼​\[∑t=0T−1γt∑i=0T−1γi​‖∇ft​\(xt\)‖1\],\\displaystyle\\varepsilon=\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\gamma^\{t\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\\|\\nabla f^\{t\}\(x^\{t\}\)\\\|\_\{1\}\\right\],andΔ=f0​\(x0\)−f~\\Delta=f^\{0\}\(x^\{0\}\)\-\\widetilde\{f\}\.

This rate is the same as in Theorem[1](https://arxiv.org/html/2606.14970#Thmtheorem1), but it explicitly clarifies the constants induced by the particular choice of norm\. In this case, the norm\-dependent factors yield ad4d^\{4\}dependence in the leading term\. Such dimensional dependence is intrinsic to ZO optimization\[[17](https://arxiv.org/html/2606.14970#bib.bib17)\]\. Moreover, a comparable dependence is observed in the ZO LMO analysis in\[[80](https://arxiv.org/html/2606.14970#bib.bib80)\], indicating that the dimensional factors in our bound are consistent with existing results\.

Next, we consider the matrix domain and select∥⋅∥=∥⋅∥s​p,∥⋅∥∗=∥⋅∥n​u​c\\\|\\cdot\\\|=\\\|\\cdot\\\|\_\{sp\},\\\|\\cdot\\\|\_\{\*\}=\\\|\\cdot\\\|\_\{nuc\}, the spectral matrix norm, and the nuclear matrix norm, respectively\. We aim to recover aMuon\-like method with this choice of norm\. However, the LMO in the chosen norm results in a polar operator derived from SVD \(withρ=1\\rho=1\):

lmo⁡\(X\)=−Polar​\(X\)=−U​VT,\\operatorname\{lmo\}\(X\)=\-\\texttt\{Polar\}\(X\)=\-UV^\{T\},whereX=U​Σ​VTX=U\\Sigma V^\{T\}is SVD decomposition\. This creates a practical implementation challenge due to the cost of computing the SVD\. To address this issue, Newton–Schulz steps \(N\-SS\) have been proposed as an efficient alternative\[[90](https://arxiv.org/html/2606.14970#bib.bib90)\]\. At their core, N\-SS apply a matrix polynomial transformation that converges to the polar operator\. Since this procedure avoids matrix inversion, it substantially reduces computational cost\. Moreover, the approximation error becomes negligible after only a few iterations due to its double\-exponential convergence\. Accordingly, we modify Algorithm[4](https://arxiv.org/html/2606.14970#S4.SS0.SSS0.Px3)by replacing exact LMO operator \(Line[5](https://arxiv.org/html/2606.14970#alg0.l5)\) withqqsteps of Newton\-Schultz algorithm\. Thus, we obtainAdaMuGED\(ADAptive MUon Gradient\-frEe Descent\) — Algorithm[4\.1](https://arxiv.org/html/2606.14970#S4.SS1)\. The full description of the algorithm can be found in the appendix[D](https://arxiv.org/html/2606.14970#A4)\.

AdaMuGED

1:Input:

X0∈ℝdX^\{0\}\\in\\mathbb\{R\}^\{d\},

S−1=ξ,C2,γ0=f​\(x0\)−f~ρ​S−1,τ0=C2​γ0,ξS^\{\-1\}=\\xi,C\_\{2\},\\gamma^\{0\}=\\frac\{\\sqrt\{f\(x^\{0\}\)\-\\widetilde\{f\}\}\}\{\\rho\\sqrt\{S^\{\-1\}\}\},\\tau^\{0\}=\{C\_\{2\}\}\\gamma^\{0\},\\xiis an arbitrary small value

2:for

t=0,…,T−1t=0,\\ldots,T\-1do

3:

Et∼S∥⋅∥2d−1E^\{t\}\\sim S^\{d\-1\}\_\{\\\|\\cdot\\\|\_\{2\}\}//Sample random seed

4:

Gt​\(Xt,Et\)=\(f​\(Xt\+τt​Et\)−f​\(Xt\)\)​EtτtG^\{t\}\(X^\{t\},E^\{t\}\)=\(f\(X^\{t\}\+\\tau^\{t\}E^\{t\}\)\-f\(X^\{t\}\)\)\\frac\{E^\{t\}\}\{\\tau^\{t\}\}//Compute ZO\-gradient

5:

Vt←N\-SS​\(Gt​\(Xt,Et\),q\)V^\{t\}\\leftarrow\\texttt\{N\-SS\}\(G^\{t\}\(X^\{t\},E^\{t\}\),q\)//Compute LMO with N\-SS

6:

Xt\+1=Xt\+γt​VtX^\{t\+1\}=X^\{t\}\+\\gamma^\{t\}V^\{t\}//Provide model update

7:

Gt​\(xt\+1,Et\)=\(f​\(Xt\+1\+τt​et\)−f​\(Xt\+1\)\)​EtτtG^\{t\}\(x^\{t\+1\},E^\{t\}\)=\(f\(X^\{t\+1\}\+\\tau^\{t\}e^\{t\}\)\-f\(X^\{t\+1\}\)\)\\frac\{E^\{t\}\}\{\\tau^\{t\}\}//Compute ZO\-gradient in new point

8:

Lt=‖Gt​\(Xt\+1,Et\)−Gt​\(Xt,Et\)‖∗‖Xt\+1−Xt‖L^\{t\}=\\frac\{\\\|G^\{t\}\(X^\{t\+1\},E^\{t\}\)\-G^\{t\}\(X^\{t\},E^\{t\}\)\\\|\_\{\*\}\}\{\\\|X^\{t\+1\}\-X^\{t\}\\\|\}//ApproximateLL

9:

St=St−1\+Lt,γt\+1=f​\(X0\)−f~ρ​St,τt\+1=C2​γt\+1S^\{t\}=S^\{t\-1\}\+L^\{t\},\\gamma^\{t\+1\}=\\frac\{\\sqrt\{f\(X^\{0\}\)\-\\widetilde\{f\}\}\}\{\\rho\\sqrt\{S^\{t\}\}\},\\tau^\{t\+1\}=C\_\{2\}\\gamma^\{t\+1\}//Update parameters

10:endfor

11:return

xTx\_\{T\}

Here,N\-SS\(A,q\)\(A,q\)denotes applyingqqsteps of the Newton\-Schultz algorithm to the matrixA\.A\.

The algorithm[4\.1](https://arxiv.org/html/2606.14970#S4.SS1)converges according to the following Theorem\. The proof can be found in the appendix[4](https://arxiv.org/html/2606.14970#Thmtheorem4)\.

###### Theorem 2\.

Suppose Assumptions[1](https://arxiv.org/html/2606.14970#Thmassumption1),[2](https://arxiv.org/html/2606.14970#Thmassumption2),[3](https://arxiv.org/html/2606.14970#Thmassumption3)hold\. Then Algorithm[4\.1](https://arxiv.org/html/2606.14970#S4.SS1)to achieveε\\varepsilon\-accuracy needs

T=𝒪​\(χq5​\(C23​Δ​L3c2⁣∗3​ε2​\(1\+M2​C22Δ​L0\)​𝔼​\[1L0\]2​𝔼​\[log⁡T​LL0\]2\)\+χq2​M​C2⁣∗\),\\displaystyle T=\\mathcal\{O\}\\left\(\\chi\_\{q\}^\{5\}\\left\(\\frac\{C\_\{2\}^\{3\}\\Delta\{L\}^\{3\}\}\{c\_\{2\*\}^\{3\}\\varepsilon^\{2\}\}\\left\(1\+\\frac\{M^\{2\}C\_\{2\}^\{2\}\}\{\\Delta\{L^\{0\}\}\}\\right\)\\mathbb\{E\}\\left\[\\frac\{1\}\{L^\{0\}\}\\right\]^\{2\}\\mathbb\{E\}\\left\[\\log\\frac\{T\{L\}\}\{L^\{0\}\}\\right\]^\{2\}\\right\)\+\\chi\_\{q\}^\{2\}MC\_\{2\*\}\\right\),iterations, whereε=𝔼​\[∑t=0T−1γt∑i=0T−1γi​‖∇ft​\(xt\)‖∗\],Δ=f​\(x0\)−f~,\\displaystyle\\varepsilon=\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\gamma^\{t\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\\|\\nabla f^\{t\}\(x^\{t\}\)\\\|\_\{\*\}\\right\],\\ \\Delta=f\(x^\{0\}\)\-\\widetilde\{f\},andχq=11−εq\\chi\_\{q\}=\\frac\{1\}\{1\-\\varepsilon\_\{q\}\}\.

The convergence guarantee mirrors the Theorem[1](https://arxiv.org/html/2606.14970#Thmtheorem1)only introducing extraχq5\\chi\_\{q\}^\{5\}factor\. As previously discussed, this factor is insignificant in practice\.

## 5 Experiments

We empirically evaluate the proposed parameter\-free ZO methods on large\-scale language\-model fine\-tuning\. Our goal is to assess whether the proposed adaptive rules can retain the performance of tuned ZO optimizers while reducing reliance on task\-specific hyperparameter search\.

We fine\-tune the pretrained OPT\-1\.3B model\[[91](https://arxiv.org/html/2606.14970#bib.bib91)\]on the SST\-2 sentiment classification task\[[92](https://arxiv.org/html/2606.14970#bib.bib92)\]\. This setting is representative of memory\-constrained LLM fine\-tuning: the model is large enough for backpropagation\-based training to incur substantial activation, gradient, and optimizer\-state memory costs \(∼25\\sim 25GB\), while SST\-2 is a standard benchmark for evaluating downstream adaptation of language models\. All methods are run for20,00020\{,\}000iterations, and we report the best accuracy achieved during training\.

We compare against three tuned zeroth\-order baselines:ZO\-AdaMM\[[40](https://arxiv.org/html/2606.14970#bib.bib40)\],ZO\-SignSGD\[[93](https://arxiv.org/html/2606.14970#bib.bib93)\], andZO\-Muon\[[82](https://arxiv.org/html/2606.14970#bib.bib82)\]\. For our method, we evaluate two instantiations\. First,AdaNAGEDuses theℓ∞\\ell\_\{\\infty\}geometry, yielding a SignSGD\-like LMO update\. Second,AdaMuGEDuses the matrix geometry described in Section[4\.1](https://arxiv.org/html/2606.14970#S4.SS1), resulting in a Muon\-like variant based on the Newton–Schulz approximation of the polar operator\. Additional implementation details and baseline hyperparameters are provided in Appendix[A](https://arxiv.org/html/2606.14970#A1)\.

Table 1:Results for fine\-tuning OPT\-1\.3B on SST\-2 task\.Table[1](https://arxiv.org/html/2606.14970#S5.T1)shows that the proposed parameter\-free methods remain competitive with tuned ZO optimizers\.AdaNAGEDreaches0\.9030\.903accuracy, which is within0\.70\.7percentage points of tunedZO\-SignSGDand within0\.90\.9percentage points of tunedZO\-AdaMM\. The Muon\-like variant further improves the result:AdaMuGEDobtains0\.9080\.908, reducing the gap toZO\-SignSGDto only0\.20\.2percentage points and improving overAdaNAGEDby0\.50\.5percentage points\.

The strongest baseline isZO\-Muon, which achieves0\.9210\.921\. This suggests that matrix\-aware update geometry is beneficial for LLM fine\-tuning, consistently with the motivation for LMO\-based methods\. At the same time, the relatively small gap betweenAdaMuGEDand the tuned baselines indicates that the proposed parameter\-free adaptation can recover much of the performance of manually tuned ZO methods\. Overall, these results support the main empirical claim of the paper:parameter\-free LMO\-based ZO optimization can provide competitive fine\-tuning performance while reducing dependence on expensive task\-specific hyperparameter tuning\.

## 6 Conclusion

In this paper, we propose the first method for LLM fine\-tuning in the intersection of three areas: Zeroth Order, Parameter\-Free and Linear Minimization Oracle optimization\. Our approach exhibits the benefits of all paradigms, namely, costly gradient computations are avoided, no knowledge about problem constants is assumed, and adaptation to the landscape can be achieved due to the flexible choice of norm\. Moreover, we present a modification of the algorithm with Newton\-Schultz steps as a more practical variation\.

We provide rigorous convergence guarantees for the original method and the modification\. Our experiments validate the practical performance of the proposed methods on the established benchmark, showing metrics consistent with the baselines\.

## References

- Howard and Ruder \[2018\]Jeremy Howard and Sebastian Ruder\.Universal language model fine\-tuning for text classification\.In*Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 328–339, 2018\.
- Lester et al\. \[2021\]Brian Lester, Rami Al\-Rfou, and Noah Constant\.The power of scale for parameter\-efficient prompt tuning\.In*Proceedings of the 2021 conference on empirical methods in natural language processing*, pages 3045–3059, 2021\.
- Gururangan et al\. \[2020\]Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A Smith\.Don’t stop pretraining: Adapt language models to domains and tasks\.In*Proceedings of the 58th annual meeting of the association for computational linguistics*, pages 8342–8360, 2020\.
- Zhang et al\. \[2020\]Yizhe Zhang, Siqi Sun, Michel Galley, Yen\-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and William B Dolan\.Dialogpt: Large\-scale generative pre\-training for conversational response generation\.In*Proceedings of the 58th annual meeting of the association for computational linguistics: system demonstrations*, pages 270–278, 2020\.
- Wu et al\. \[2025\]Xiao\-Kun Wu, Min Chen, Wanyi Li, Rui Wang, Limeng Lu, Jia Liu, Kai Hwang, Yixue Hao, Yanru Pan, Qingguo Meng, et al\.Llm fine\-tuning: Concepts, opportunities, and challenges\.*Big Data and Cognitive Computing*, 9\(4\):87, 2025\.
- Robbins and Monro \[1951\]Herbert Robbins and Sutton Monro\.A stochastic approximation method\.*The annals of mathematical statistics*, pages 400–407, 1951\.
- Kingma and Ba \[2014\]Diederik P Kingma and Jimmy Ba\.Adam: A method for stochastic optimization\.*arXiv preprint arXiv:1412\.6980*, 2014\.
- Rojas \[1996\]Raul Rojas\.The backpropagation algorithm\.In*Neural networks: a systematic introduction*, pages 149–182\. Springer, 1996\.
- Rajbhandari et al\. \[2020\]Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He\.Zero: Memory optimizations toward training trillion parameter models\.In*SC20: international conference for high performance computing, networking, storage and analysis*, pages 1–16\. IEEE, 2020\.
- Malladi et al\. \[2023\]Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D Lee, Danqi Chen, and Sanjeev Arora\.Fine\-tuning language models with just forward passes\.*Advances in Neural Information Processing Systems*, 36:53038–53075, 2023\.
- Richtárik and Takáč \[2014\]Peter Richtárik and Martin Takáč\.Iteration complexity of randomized block\-coordinate descent methods for minimizing a composite function\.*Mathematical Programming*, 144\(1\):1–38, 2014\.
- Luo et al\. \[2024\]Qijun Luo, Hengxu Yu, and Xiao Li\.Badam: A memory efficient full parameter optimization method for large language models\.*Advances in Neural Information Processing Systems*, 37:24926–24958, 2024\.
- Dettmers et al\. \[2021\]Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer\.8\-bit optimizers via block\-wise quantization\.*arXiv preprint arXiv:2110\.02861*, 2021\.
- Novikov et al\. \[2023\]Georgii Sergeevich Novikov, Daniel Bershatsky, Julia Gusak, Alex Shonenkov, Denis Valerievich Dimitrov, and Ivan Oseledets\.Few\-bit backward: Quantized gradients of activation functions for memory footprint reduction\.In*International Conference on Machine Learning*, pages 26363–26381\. PMLR, 2023\.
- Sun et al\. \[2022\]Zehua Sun, Huanqi Yang, Kai Liu, Zhimeng Yin, Zhenjiang Li, and Weitao Xu\.Recent advances in lora: A comprehensive survey\.*ACM Transactions on Sensor Networks*, 18\(4\):1–44, 2022\.
- Wang et al\. \[2024\]Zhengbo Wang, Jian Liang, Ran He, Zilei Wang, and Tieniu Tan\.Lora\-pro: Are low\-rank adapters properly optimized?*arXiv preprint arXiv:2407\.18242*, 2024\.
- Ghadimi and Lan \[2013\]Saeed Ghadimi and Guanghui Lan\.Stochastic first\-and zeroth\-order methods for nonconvex stochastic programming\.*SIAM journal on optimization*, 23\(4\):2341–2368, 2013\.
- Zhang and Ying \[2024\]Qining Zhang and Lei Ying\.Zeroth\-order policy gradient for reinforcement learning from human feedback without reward inference\.*arXiv preprint arXiv:2409\.17401*, 2024\.
- Lin et al\. \[2025\]Liting Lin, Hansong Ma, Junxiao Wang, and Shiyu Yang\.A survey on zeroth\-order optimization for machine learning\.In*International Conference on Web Information Systems and Applications*, pages 481–497\. Springer, 2025\.
- Bar and Giryes \[2025\]Noga Bar and Raja Giryes\.Zoqo: Zero\-order quantized optimization\.In*ICASSP 2025\-2025 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\)*, pages 1–5\. IEEE, 2025\.
- Stich \[2019\]Sebastian U Stich\.Unified optimal analysis of the \(stochastic\) gradient method\.*arXiv preprint arXiv:1907\.04232*, 2019\.
- Hendrikx et al\. \[2020\]Hadrien Hendrikx, Lin Xiao, Sebastien Bubeck, Francis Bach, and Laurent Massoulie\.Statistically preconditioned accelerated gradient method for distributed optimization\.In*International conference on machine learning*, pages 4203–4227\. PMLR, 2020\.
- Carmon and Hinder \[2022\]Yair Carmon and Oliver Hinder\.Making sgd parameter\-free\.In*Conference on learning theory*, pages 2360–2389\. PMLR, 2022\.
- Deng et al\. \[2024\]Qi Deng, Guanghui Lan, and Zhenwei Lin\.Uniformly optimal and parameter\-free first\-order methods for convex and function\-constrained optimization\.*arXiv preprint arXiv:2412\.06319*, 2024\.
- Kreisler et al\. \[2024\]Itai Kreisler, Maor Ivgi, Oliver Hinder, and Yair Carmon\.Accelerated parameter\-free stochastic optimization\.In*The Thirty Seventh Annual Conference on Learning Theory*, pages 3257–3324\. PMLR, 2024\.
- Ma and Huang \[2025\]Shaocong Ma and Heng Huang\.Revisiting zeroth\-order optimization: Minimum\-variance two\-point estimators and directionally aligned perturbations\.*arXiv preprint arXiv:2510\.19975*, 2025\.
- Jordan et al\. \[2024\]Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein\.Muon: An optimizer for hidden layers in neural networks, 2024\.*URL https://kellerjordan\. github\. io/posts/muon*, 6\(3\):4, 2024\.
- Bernstein and Newhouse \[2024\]Jeremy Bernstein and Laker Newhouse\.Old optimizer, new norm: An anthology\.*arXiv preprint arXiv:2409\.20325*, 2024\.
- Pethick et al\. \[2025\]Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti\-Falls, and Volkan Cevher\.Training deep learning models with norm\-constrained lmos\.*arXiv preprint arXiv:2502\.07529*, 2025\.
- Cutkosky and Mehta \[2020\]Ashok Cutkosky and Harsh Mehta\.Momentum improves normalized sgd\.In*International conference on machine learning*, pages 2260–2268\. PMLR, 2020\.
- Sun et al\. \[2023\]Tao Sun, Qingsong Wang, Dongsheng Li, and Bao Wang\.Momentum ensures convergence of signsgd under weaker assumptions\.In*International Conference on Machine Learning*, pages 33077–33099\. PMLR, 2023\.
- Riabinin et al\. \[2025\]Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richtárik\.Gluon: Making muon & scion great again\!\(bridging theory and practice of lmo\-based optimizers for llms\)\.*arXiv preprint arXiv:2505\.13416*, 2025\.
- Spall \[1998\]James C Spall\.An overview of the simultaneous perturbation method for efficient optimization\.*Johns Hopkins apl technical digest*, 19\(4\):482–492, 1998\.
- Flaxman et al\. \[2004\]Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan\.Online convex optimization in the bandit setting: gradient descent without a gradient\.*arXiv preprint cs/0408007*, 2004\.
- Duchi et al\. \[2015\]John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono\.Optimal rates for zero\-order convex optimization: The power of two function evaluations\.*IEEE Transactions on Information Theory*, 61\(5\):2788–2806, 2015\.
- Zhao et al\. \[2024\]Yanjun Zhao, Sizhe Dang, Haishan Ye, Guang Dai, Yi Qian, and Ivor W Tsang\.Second\-order fine\-tuning without pain for llms: A hessian informed zeroth\-order optimizer\.*arXiv preprint arXiv:2402\.15173*, 2024\.
- Williams \[1992\]Ronald J Williams\.Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.*Machine learning*, 8\(3\):229–256, 1992\.
- Qiu et al\. \[2025\]Junbin Qiu, Zhengpeng Xie, Xiangda Yan, Yongjie Yang, and Yao Shu\.Zeroth\-order optimization is secretly single\-step policy optimization\.*arXiv preprint arXiv:2506\.14460*, 2025\.
- Seung et al\. \[2026\]Hyunseok Seung, Jaewoo Lee, and Hyunsuk Ko\.Low\-rank curvature for zeroth\-order optimization in llm fine\-tuning\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 40, pages 25235–25242, 2026\.
- Chen et al\. \[2019\]Xiangyi Chen, Sijia Liu, Kaidi Xu, Xingguo Li, Xue Lin, Mingyi Hong, and David Cox\.Zo\-adamm: Zeroth\-order adaptive momentum method for black\-box optimization\.*Advances in neural information processing systems*, 32, 2019\.
- Liu et al\. \[2018\]Sijia Liu, Bhavya Kailkhura, Pin\-Yu Chen, Paishun Ting, Shiyu Chang, and Lisa Amini\.Zeroth\-order stochastic variance reduction for nonconvex optimization\.*Advances in neural information processing systems*, 31, 2018\.
- Ji et al\. \[2019\]Kaiyi Ji, Zhe Wang, Yi Zhou, and Yingbin Liang\.Improved zeroth\-order variance reduced algorithms and analysis for nonconvex optimization\.In*International conference on machine learning*, pages 3100–3109\. PMLR, 2019\.
- Gautam et al\. \[2024\]Tanmay Gautam, Youngsuk Park, Hao Zhou, Parameswaran Raman, and Wooseok Ha\.Variance\-reduced zeroth\-order methods for fine\-tuning language models\.*arXiv preprint arXiv:2404\.08080*, 2024\.
- Chen et al\. \[2017\]Pin\-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho\-Jui Hsieh\.Zoo: Zeroth order optimization based black\-box attacks to deep neural networks without training substitute models\.In*ACM AISec*, 2017\.
- Zhang et al\. \[2024\]Yihua Zhang, Pingzhi Li, Junyuan Hong, Jiaxiang Li, Yimeng Zhang, Wenqing Zheng, Pin\-Yu Chen, Jason D Lee, Wotao Yin, Mingyi Hong, et al\.Revisiting zeroth\-order optimization for memory\-efficient llm fine\-tuning: A benchmark\.*arXiv preprint arXiv:2402\.11592*, 2024\.
- Nemirovskij and Yudin \[1983\]Arkadij Semenovič Nemirovskij and David Borisovich Yudin\.Problem complexity and method efficiency in optimization\.1983\.
- Duchi et al\. \[2011\]John Duchi, Elad Hazan, and Yoram Singer\.Adaptive subgradient methods for online learning and stochastic optimization\.*Journal of machine learning research*, 12\(7\), 2011\.
- Zeiler \[2012\]Matthew D Zeiler\.Adadelta: an adaptive learning rate method\.*arXiv preprint arXiv:1212\.5701*, 2012\.
- Orabona \[2019\]Francesco Orabona\.A modern introduction to online learning\.*arXiv preprint arXiv:1912\.13213*, 2019\.
- Orabona \[2013\]Francesco Orabona\.Dimension\-free exponentiated gradient\.*Advances in Neural Information Processing Systems*, 26, 2013\.
- McMahan and Orabona \[2014\]H Brendan McMahan and Francesco Orabona\.Unconstrained online linear learning in hilbert spaces: Minimax algorithms and normal approximations\.In*Conference on Learning Theory*, pages 1020–1039\. PMLR, 2014\.
- Orabona and Pál \[2016\]Francesco Orabona and Dávid Pál\.Coin betting and parameter\-free online learning\.*Advances in Neural Information Processing Systems*, 29, 2016\.
- Cutkosky and Orabona \[2018\]Ashok Cutkosky and Francesco Orabona\.Black\-box reductions for parameter\-free online learning in banach spaces\.In*Conference On Learning Theory*, pages 1493–1529\. PMLR, 2018\.
- Ivgi et al\. \[2023\]Maor Ivgi, Oliver Hinder, and Yair Carmon\.Dog is sgd’s best friend: A parameter\-free dynamic step size schedule\.In*International conference on machine learning*, pages 14465–14499\. PMLR, 2023\.
- Defazio and Mishchenko \[2023\]Aaron Defazio and Konstantin Mishchenko\.Learning\-rate\-free learning by d\-adaptation\.In*International conference on machine learning*, pages 7449–7479\. PMLR, 2023\.
- Mishchenko and Defazio \[2023\]Konstantin Mishchenko and Aaron Defazio\.Prodigy: An expeditiously adaptive parameter\-free learner\.*arXiv preprint arXiv:2306\.06101*, 2023\.
- Schaipp et al\. \[2023\]Fabian Schaipp, Ruben Ohana, Michael Eickenberg, Aaron Defazio, and Robert M Gower\.Momo: Momentum models for adaptive learning rates\.*arXiv preprint arXiv:2305\.07583*, 2023\.
- Attia and Koren \[2023\]Amit Attia and Tomer Koren\.Sgd with adagrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance\.In*International Conference on Machine Learning*, pages 1147–1171\. PMLR, 2023\.
- Attia and Koren \[2024\]Amit Attia and Tomer Koren\.How free is parameter\-free stochastic optimization?*arXiv preprint arXiv:2402\.03126*, 2024\.
- Khaled et al\. \[2023\]Ahmed Khaled, Konstantin Mishchenko, and Chi Jin\.Dowg unleashed: An efficient universal parameter\-free gradient descent method\.*Advances in Neural Information Processing Systems*, 36:6748–6769, 2023\.
- Malitsky and Mishchenko \[2019\]Yura Malitsky and Konstantin Mishchenko\.Adaptive gradient descent without descent\.*arXiv preprint arXiv:1910\.09529*, 2019\.
- Mishkin et al\. \[2024\]Aaron Mishkin, Ahmed Khaled, Yuanhao Wang, Aaron Defazio, and Robert M Gower\.Directional smoothness and gradient methods: Convergence and adaptivity\.*Advances in Neural Information Processing Systems*, 37:14810–14848, 2024\.
- Medyakov et al\. \[2026\]Daniil Medyakov, Stanko Sergey, Gleb Molodtsov, Philip Zmushko, Evseev Grigoriy, Egor Petrov, and Aleksandr Beznosikov\.Sign\-sgd via parameter\-free optimization\.In*The Fourteenth International Conference on Learning Representations*, 2026\.
- Bomze et al\. \[2021\]Immanuel M Bomze, Francesco Rinaldi, and Damiano Zeffiro\.Frank–wolfe and friends: a journey into projection\-free first\-order optimization methods\.*4OR*, 19\(3\):313–345, 2021\.
- Braun et al\. \[2022\]Gábor Braun, Alejandro Carderera, Cyrille W Combettes, Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and Sebastian Pokutta\.Conditional gradient methods\.*arXiv preprint arXiv:2211\.14103*, 2022\.
- Bernstein et al\. \[2018\]Jeremy Bernstein, Yu\-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar\.signsgd: Compressed optimisation for non\-convex problems\.In*International conference on machine learning*, pages 560–569\. PMLR, 2018\.
- Karimireddy et al\. \[2019\]Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi\.Error feedback fixes signsgd and other gradient compression schemes\.In*International conference on machine learning*, pages 3252–3261\. PMLR, 2019\.
- Balles and Hennig \[2018\]Lukas Balles and Philipp Hennig\.Dissecting adam: The sign, magnitude and variance of stochastic gradients\.In*International Conference on Machine Learning*, pages 404–413\. PMLR, 2018\.
- Balles et al\. \[2020\]Lukas Balles, Fabian Pedregosa, and Nicolas Le Roux\.The geometry of sign gradient descent\.*arXiv preprint arXiv:2002\.08056*, 2020\.
- Gupta et al\. \[2025\]Aman Gupta, Rafael Celente, Abhishek Shivanna, DT Braithwaite, Gregory Dexter, Shao Tang, Hiroto Udagawa, Daniel Silva, Rohan Ramanath, and S Sathiya Keerthi\.Effective quantization of muon optimizer states\.*arXiv preprint arXiv:2509\.23106*, 2025\.
- Huang et al\. \[2025\]Feihu Huang, Yuning Luo, and Songcan Chen\.Limuon: Light and fast muon optimizer for large models\.*arXiv preprint arXiv:2509\.14562*, 2025\.
- Ren and Luo \[2025\]Kunjie Ren and Luo Luo\.A parameter\-free and near\-optimal zeroth\-order algorithm for stochastic convex optimization\.*arXiv preprint arXiv:2502\.05600*, 2025\.
- Chen and Rogozin \[2026\]Jiawei Chen and Alexander Rogozin\.A parameter\-free zeroth\-order algorithm for decentralized stochastic convex optimization\.*arXiv preprint arXiv:2603\.15219*, 2026\.
- \[74\]Yuxing Peng, Yuanyuan Liu, Fanhua Shang, and Hongying Liu\.Parameter\-free variance reduced zeroth\-order optimization for non\-convex problems\.
- Si et al\. \[2025\]Chongjie Si, Debing Zhang, and Wei Shen\.Adamuon: Adaptive muon optimizer\.*arXiv preprint arXiv:2507\.11005*, 2025\.
- Li et al\. \[2025\]Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao\.Normuon: Making muon more efficient and scalable\.*arXiv preprint arXiv:2510\.05491*, 2025\.
- Sahu et al\. \[2019\]Anit Kumar Sahu, Manzil Zaheer, and Soummya Kar\.Towards gradient free and projection free stochastic optimization\.In*The 22nd International Conference on Artificial Intelligence and Statistics*, pages 3468–3477\. PMLR, 2019\.
- Carderera et al\. \[2021\]Alejandro Carderera, Jelena Diakonikolas, Cheuk Yin Lin, and Sebastian Pokutta\.Parameter\-free locally accelerated conditional gradients\.In*International Conference on Machine Learning*, pages 1283–1293\. PMLR, 2021\.
- Ito et al\. \[2023\]Masaru Ito, Zhaosong Lu, and Chuan He\.A parameter\-free conditional gradient method for composite minimization under hölder condition\.*Journal of Machine Learning Research*, 24\(166\):1–34, 2023\.
- Veprikov et al\. \[2024\]Andrey Veprikov, Alexander Bogdanov, Vladislav Minashkin, and Aleksandr Beznosikov\.New aspects of black box conditional gradient: Variance reduction and one point feedback\.*Chaos, Solitons & Fractals*, 189:115654, 2024\.
- Chen et al\. \[2024\]Yiming Chen, Yuan Zhang, Liyuan Cao, Kun Yuan, and Zaiwen Wen\.Enhancing zeroth\-order fine\-tuning for language models with low\-rank structures\.*arXiv preprint arXiv:2410\.07698*, 2024\.
- Petrov et al\. \[2025\]Egor Petrov, Grigoriy Evseev, Aleksey Antonov, Andrey Veprikov, Nikolay Bushkov, Stanislav Moiseev, and Aleksandr Beznosikov\.Leveraging coordinate momentum in signsgd and muon: Memory\-optimized zero\-order\.*arXiv preprint arXiv:2506\.04430*, 2025\.
- Nesterov and Spokoiny \[2017\]Yurii Nesterov and Vladimir Spokoiny\.Random gradient\-free minimization of convex functions\.*Foundations of Computational Mathematics*, 17\(2\):527–566, 2017\.
- Li et al\. \[2018\]Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein\.Visualizing the loss landscape of neural nets\.*Advances in neural information processing systems*, 31, 2018\.
- Kleinberg et al\. \[2018\]Bobby Kleinberg, Yuanzhi Li, and Yang Yuan\.An alternative view: When does sgd escape local minima?In*International conference on machine learning*, pages 2698–2707\. PMLR, 2018\.
- Zhou et al\. \[2019\]Yi Zhou, Junjie Yang, Huishuai Zhang, Yingbin Liang, and Vahid Tarokh\.Sgd converges to global minimum in deep learning via star\-convex path\.*arXiv preprint arXiv:1901\.00451*, 2019\.
- Liu et al\. \[2022\]Chaoyue Liu, Libin Zhu, and Mikhail Belkin\.Loss landscapes and optimization in over\-parameterized non\-linear systems and neural networks\.*Applied and Computational Harmonic Analysis*, 59:85–116, 2022\.
- Nesterov et al\. \[2018\]Yurii Nesterov et al\.*Lectures on convex optimization*, volume 137\.Springer, 2018\.
- Boyd et al\. \[2003\]Stephen Boyd, Lin Xiao, and Almir Mutapcic\.Subgradient methods\.*lecture notes of EE392o, Stanford University, Autumn Quarter*, 2004\(01\):175, 2003\.
- Kim and Oh \[2026\]Gyu Yeol Kim and Min\-hwan Oh\.Convergence of muon with newton\-schulz\.*arXiv preprint arXiv:2601\.19156*, 2026\.
- Zhang et al\. \[2022\]Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al\.Opt: Open pre\-trained transformer language models\.*arXiv preprint arXiv:2205\.01068*, 2022\.
- Socher et al\. \[2013\]Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts\.Recursive deep models for semantic compositionality over a sentiment treebank\.In*Proceedings of the 2013 conference on empirical methods in natural language processing*, pages 1631–1642, 2013\.
- Liu et al\. \[2019\]Sijia Liu, Pin\-Yu Chen, Xiangyi Chen, and Mingyi Hong\.signsgd via zeroth\-order oracle\.In*International conference on learning representations*, 2019\.
- Paszke et al\. \[2019\]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al\.Pytorch: An imperative style, high\-performance deep learning library\.*Advances in neural information processing systems*, 32, 2019\.

AppendixSupplementary Materials forZero\-order Parameter\-free Optimization for LMO\-based Methods: Novel Approach for Efficient Fine\-tuning

## Appendix AExperimental Details

The experiments are implemented in Python using the PyTorch library\[[94](https://arxiv.org/html/2606.14970#bib.bib94)\], leveraging a single CPU \(Intel Xeon 2\.20 GHz\) and a single GPU \(NVIDIA A100\) for computation\. The total runtime for all experiments is approximately 100 hours\.

Then, we report the parameters for the tuned baselines\. For all methods we implemented momentumβ=0\.9\\beta=0\.9and ZO\-smoothingτ=0\.001\\tau=0\.001, and cosine learning\-rate scheduler\. For the baselines random sampling is performed from the normal distributionN​\(0,I\)\.\{N\}\(0,I\)\.

- •ZO\-AdaMM\.learning rateγ=6\.0⋅10−5\\gamma=6\.0\\cdot 10^\{\-5\}, first and second momentum\(β1,β2\)=\(0\.9,0\.999\)\(\\beta\_\{1\},\\beta\_\{2\}\)=\(0\.9,0\.999\);
- •ZO\-SignSGD\.learning rateγ=1\.0⋅10−5\\gamma=1\.0\\cdot 10^\{\-5\};
- •ZO\-Muon\.learning rateγ=6\.0⋅10−5\\gamma=6\.0\\cdot 10^\{\-5\}, the number of Newton\-Schultz stepsq=5q=5\.

## Appendix BNotation and Useful Equations

In this section we rigorously introduce the notation\.∥⋅∥\\\|\\cdot\\\|and∥⋅∥∗\\\|\\cdot\\\|\_\{\*\}denote an abstract norm and its dual conjugate, respectively\. The equivalence factors of the chosen norm and the22\-norm:

c2​‖v‖≤‖v‖2≤C2​‖v‖,c2⁣∗​‖v‖∗≤‖v‖2≤C2⁣∗​‖v‖∗​∀v∈ℝd\.\\displaystyle c\_\{2\}\\\|v\\\|\\leq\\\|v\\\|\_\{2\}\\leq C\_\{2\}\\\|v\\\|,c\_\{2\*\}\\\|v\\\|\_\{\*\}\\leq\\\|v\\\|\_\{2\}\\leq C\_\{2\*\}\\\|v\\\|\_\{\*\}\\ \\forall v\\in\\mathbb\{R\}^\{d\}\.\(7\)
The LMO operator is defined as follows,

lmo⁡\(x\)=arg⁡minv∈ℝd,‖v‖≤ρ⁡⟨x,v⟩\.\\operatorname\{lmo\}\(x\)=\\arg\\min\_\{v\\in\\mathbb\{R\}^\{d\},\\\|v\\\|\\leq\\rho\}\\langle x,v\\rangle\.
We definefτ​\(x\)=𝔼​\[f​\(x\+τ​e\)\]f\_\{\\tau\}\(x\)=\\mathbb\{E\}\[f\(x\+\\tau e\)\]as aτ\\tau\-smoothed objective\. For the simplicity of notation,ft​\(x\)f^\{t\}\(x\)is theτt\\tau^\{t\}\-smoothed function at iterationtt\. The ZO\-gradient along the directioneeis denoted as

gτ​\(x,e\)=f​\(x\+τ​e\)−f​\(x\)τ​e\.g\_\{\\tau\}\(x,e\)=\\frac\{f\(x\+\\tau e\)\-f\(x\)\}\{\\tau\}e\.Notably,𝔼​\[gτ​\(x,e\)\]=∇fτ​\(x\),\\mathbb\{E\}\[g\_\{\\tau\}\(x,e\)\]=\\nabla f\_\{\\tau\}\(x\),where the expectation is taken over the random directionee\.

Now we additionally introduce some useful equations\. Letx,y,\{ai\}i=1n∈ℝd,ψ,ξ∈ℝ\+x,y,\\\{a\_\{i\}\\\}\_\{i=1\}^\{n\}\\in\\mathbb\{R\}^\{d\},\\ \\psi,\\xi\\in\\mathbb\{R\}\_\{\+\}are non\-negative random variables,φ\\varphisatisfies Assumption[2](https://arxiv.org/html/2606.14970#Thmassumption2)\. Then

ϕ​\(y\)\\displaystyle\\displaystyle\\phi\(y\)≥φ​\(x\)\+⟨∇φ​\(x\),y−x⟩,\\displaystyle\\penalty 10000\\ \\geq\\varphi\(x\)\+\\left\\langle\\nabla\\varphi\(x\),y\-x\\right\\rangle,\(Conv\)‖∑i=1nai‖2\\displaystyle\\left\\\|\\sum\_\{i=1\}^\{n\}a\_\{i\}\\right\\\|^\{2\}≤n​∑i=1n‖ai‖2,\\displaystyle\\penalty 10000\\ \\leq\\penalty 10000\\ n\\sum\_\{i=1\}^\{n\}\\\|a\_\{i\}\\\|^\{2\},\(CS\)⟨x,y⟩\\displaystyle\\langle x,y\\rangle≤‖x‖⋅‖y‖∗,\\displaystyle\\penalty 10000\\ \\leq\\\|x\\\|\\cdot\\\|y\\\|\_\{\*\},\(Conj\)‖x\+y‖\\displaystyle\\\|x\+y\\\|≤‖x‖\+‖y‖,\\displaystyle\\penalty 10000\\ \\leq\\\|x\\\|\+\\\|y\\\|,\(Norm\)𝔼​\[ψ​ξ\]\\displaystyle\\mathbb\{E\}\\left\[\\psi\\xi\\right\]≤\(𝔼​\[ψp\]\)1/p​\(𝔼​\[ξq\]\)1/q,1p\+1q=1,\\displaystyle\\penalty 10000\\ \\leq\\left\(\\mathbb\{E\}\\left\[\\psi^\{p\}\\right\]\\right\)^\{1/p\}\\left\(\\mathbb\{E\}\\left\[\\xi^\{q\}\\right\]\\right\)^\{1/q\},\\ \\frac\{1\}\{p\}\+\\frac\{1\}\{q\}=1,\(Hol\)

## Appendix CProofs for Algorithm[4](https://arxiv.org/html/2606.14970#S4.SS0.SSS0.Px3)

First, we need to prove a Numerical Lemma that is used further\.

###### Lemma 2\.

Let\{at\}t=0T\\\{a^\{t\}\\\}\_\{t=0\}^\{T\}be a bounded sequence of positive values∀t=1,…​T:at∈\(0,A\)\.\\forall t=1,\\dots T:\\ a^\{t\}\\in\(0,A\)\.Then,

∑t=0T−1at\+1∑i=0tai≤2​Aa0\+2​log⁡A​Ta0\.\\displaystyle\\sum\_\{t=0\}^\{T\-1\}\\frac\{a^\{t\+1\}\}\{\\sum\_\{i=0\}^\{t\}a^\{i\}\}\\leq\\frac\{2A\}\{a^\{0\}\}\+2\\log\\frac\{AT\}\{a^\{0\}\}\.

###### Proof\.

Since the sequence is bounded, there exists such indexrrthat∑i=0r−2ai≤ar−1,\\sum\_\{i=0\}^\{r\-2\}a^\{i\}\\leq a^\{r\-1\},and∑i=0tai≥at−1\\sum\_\{i=0\}^\{t\}a^\{i\}\\geq a^\{t\-1\}for anyt≥r−1\.t\\geq r\-1\.We refer to that property as\(⋆\)\(\\star\)\. First, we divide the sum over therrindex:

∑t=0T−1at\+1∑i=0tai\\displaystyle\\sum\_\{t=0\}^\{T\-1\}\\frac\{a^\{t\+1\}\}\{\\sum\_\{i=0\}^\{t\}a^\{i\}\}=\\displaystyle=∑t=0r−1at\+1∑i=0tai\+∑t=rT−1at\+1∑i=0tai\.\\displaystyle\\sum\_\{t=0\}^\{r\-1\}\\frac\{a^\{t\+1\}\}\{\\sum\_\{i=0\}^\{t\}a^\{i\}\}\+\\sum\_\{t=r\}^\{T\-1\}\\frac\{a^\{t\+1\}\}\{\\sum\_\{i=0\}^\{t\}a^\{i\}\}\.\(8\)Now, we estimate two sums separately, starting with the first one\. We trivially bound∑i=0tai≥a0\\sum\_\{i=0\}^\{t\}a^\{i\}\\geq a^\{0\}and proceed with

∑t=0r−1at\+1∑i=0tai≤∑t=0r−1at\+1a0=1a0​\(∑t=0r−2at\+1\+ar−1\)​≤\(⋆\)​1a0⋅2​ar−1≤2​Aa0\.\\displaystyle\\sum\_\{t=0\}^\{r\-1\}\\frac\{a^\{t\+1\}\}\{\\sum\_\{i=0\}^\{t\}a^\{i\}\}\\leq\\sum\_\{t=0\}^\{r\-1\}\\frac\{a^\{t\+1\}\}\{a^\{0\}\}=\\frac\{1\}\{a^\{0\}\}\\left\(\\sum\_\{t=0\}^\{r\-2\}a^\{t\+1\}\+a^\{r\-1\}\\right\)\\overset\{\(\\star\)\}\{\\leq\}\\frac\{1\}\{a^\{0\}\}\\cdot 2a^\{r\-1\}\\leq\\frac\{2A\}\{a^\{0\}\}\.Now, we turn our attention to the second sum in[8](https://arxiv.org/html/2606.14970#A3.E8)\.

∑t=rT−1at\+1∑i=0tai\\displaystyle\\sum\_\{t=r\}^\{T\-1\}\\frac\{a^\{t\+1\}\}\{\\sum\_\{i=0\}^\{t\}a^\{i\}\}=\\displaystyle=∑t=rT−1at\+112​∑i=0tai\+12​∑i=0tai\\displaystyle\\sum\_\{t=r\}^\{T\-1\}\\frac\{a^\{t\+1\}\}\{\\frac\{1\}\{2\}\\sum\_\{i=0\}^\{t\}a^\{i\}\+\\frac\{1\}\{2\}\\sum\_\{i=0\}^\{t\}a^\{i\}\}≤\(⋆\)\\displaystyle\\overset\{\(\\star\)\}\{\\leq\}∑t=rT−1at\+112​∑i=0tai\+12​at\+1\.\\displaystyle\\sum\_\{t=r\}^\{T\-1\}\\frac\{a^\{t\+1\}\}\{\\frac\{1\}\{2\}\\sum\_\{i=0\}^\{t\}a^\{i\}\+\\frac\{1\}\{2\}a^\{t\+1\}\}\.=\\displaystyle=∑t=rT−1at\+112​∑i=0t\+1ai\\displaystyle\\sum\_\{t=r\}^\{T\-1\}\\frac\{a^\{t\+1\}\}\{\\frac\{1\}\{2\}\\sum\_\{i=0\}^\{t\+1\}a^\{i\}\}≤\\displaystyle\\leq∑t=0T−12​at\+1∑i=0t\+1ai\.\\displaystyle\\sum\_\{t=0\}^\{T\-1\}\\frac\{2a^\{t\+1\}\}\{\\sum\_\{i=0\}^\{t\+1\}a^\{i\}\}\.Next, we introducest=∑i=0tais^\{t\}=\\sum\_\{i=0\}^\{t\}a^\{i\}and use it to rewrite terms of the sum:

2​at\+1∑i=0t\+1ai=2​st\+1−stst\+1=2​∫stst\+1d​xst\+1​≤\(i\)​2​∫stst\+1d​xx\.\\displaystyle\\frac\{2a^\{t\+1\}\}\{\\sum\_\{i=0\}^\{t\+1\}a^\{i\}\}=2\\frac\{s^\{t\+1\}\-s^\{t\}\}\{s^\{t\+1\}\}=2\\int\_\{s^\{t\}\}^\{s^\{t\+1\}\}\\frac\{dx\}\{s^\{t\+1\}\}\\overset\{\(i\)\}\{\\leq\}2\\int\_\{s^\{t\}\}^\{s^\{t\+1\}\}\\frac\{dx\}\{x\}\.Here, in\(i\)\(i\), we usedx≤st\+1x\\leq s^\{t\+1\}in the\(st;st\+1\)\(s^\{t\};s^\{t\+1\}\)interval\. Then, the integrals compose a telescopic sum:

∑t=0T−12​at\+1∑i=0t\+1ai≤∑t=0T−12​∫stst\+1d​xx=2​∫s0sTd​xx=2​log⁡\(sTs0\)≤2​log⁡\(A​Ta0\)\.\\displaystyle\\sum\_\{t=0\}^\{T\-1\}\\frac\{2a^\{t\+1\}\}\{\\sum\_\{i=0\}^\{t\+1\}a^\{i\}\}\\leq\\sum\_\{t=0\}^\{T\-1\}2\\int\_\{s^\{t\}\}^\{s^\{t\+1\}\}\\frac\{dx\}\{x\}=2\\int\_\{s^\{0\}\}^\{s^\{T\}\}\\frac\{dx\}\{x\}=2\\log\\left\(\\frac\{s^\{T\}\}\{s^\{0\}\}\\right\)\\leq 2\\log\\left\(\\frac\{AT\}\{a^\{0\}\}\\right\)\.Now, plugging the acquired estimates into[8](https://arxiv.org/html/2606.14970#A3.E8), we achieve

∑t=0T−1at\+1∑i=0tai\\displaystyle\\sum\_\{t=0\}^\{T\-1\}\\frac\{a^\{t\+1\}\}\{\\sum\_\{i=0\}^\{t\}a^\{i\}\}=\\displaystyle=2​Aa0\+2​log⁡\(A​Ta0\)\.\\displaystyle\\frac\{2A\}\{a^\{0\}\}\+2\\log\\left\(\\frac\{AT\}\{a^\{0\}\}\\right\)\.\(9\)∎

###### Lemma 3\.

\[Lemma[1](https://arxiv.org/html/2606.14970#Thmlemma1)\] Suppose Assumptions[1](https://arxiv.org/html/2606.14970#Thmassumption1),[2](https://arxiv.org/html/2606.14970#Thmassumption2), and[3](https://arxiv.org/html/2606.14970#Thmassumption3)hold\. Then the estimator \([6](https://arxiv.org/html/2606.14970#S4.E6)\) is uniformly bounded:

Lt≤C2c2⁣∗​L,L^\{t\}\\leq\\frac\{C\_\{2\}\}\{c\_\{2\*\}\}L,wherec2⁣∗,C2c\_\{2\*\},C\_\{2\}are norm equivalence factors\.

###### Proof\.

We begin be using the definition of ZO\-gradients\.

∥gt\(xt\+1,et\)−\\displaystyle\\\|g^\{t\}\(x^\{t\+1\},e^\{t\}\)\-gt​\(xt,et\)∥∗=\\displaystyle g^\{t\}\(x^\{t\},e^\{t\}\)\\\|\_\{\*\}==\\displaystyle=‖1τt​\(f​\(xt\+1\+τt​et\)−f​\(xt\+1\)\)​et−1τt​\(f​\(xt\+τt​et\)−f​\(xt\)\)​et‖∗\\displaystyle\\\|\\frac\{1\}\{\\tau^\{t\}\}\(f\(x^\{t\+1\}\+\\tau^\{t\}e^\{t\}\)\-f\(x^\{t\+1\}\)\)e^\{t\}\-\\frac\{1\}\{\\tau^\{t\}\}\(f\(x^\{t\}\+\\tau^\{t\}e^\{t\}\)\-f\(x^\{t\}\)\)e^\{t\}\\\|\_\{\*\}=\\displaystyle=1τt​\|f​\(xt\+1\+τt​et\)−f​\(xt\+1\)−f​\(xt\+τt​et\)−f​\(xt\)\|⋅‖et‖∗\.\\displaystyle\\frac\{1\}\{\\tau^\{t\}\}\|f\(x^\{t\+1\}\+\\tau^\{t\}e^\{t\}\)\-f\(x^\{t\+1\}\)\-f\(x^\{t\}\+\\tau^\{t\}e^\{t\}\)\-f\(x^\{t\}\)\|\\cdot\\\|e^\{t\}\\\|\_\{\*\}\.The core of the lemma is to estimate the loss difference\. For the simplicity of the notation, we definez=12​\(xt\+1\+xt\+τt​et\),u=12​\(xt\+1−xt\),v=τt2​et\.z=\\frac\{1\}\{2\}\(x^\{t\+1\}\+x^\{t\}\+\\tau^\{t\}e^\{t\}\),\\ u=\\frac\{1\}\{2\}\(x^\{t\+1\}\-x^\{t\}\),\\ v=\\frac\{\\tau^\{t\}\}\{2\}e^\{t\}\. Thus,

\|f​\(xt\+1\+τ​et\)−f​\(xt\+1\)−f​\(xt\+τ​et\)\+f​\(xt\)\|=\|f\(x^\{t\+1\}\+\\tau e^\{t\}\)\-f\(x^\{t\+1\}\)\-f\(x^\{t\}\+\\tau e^\{t\}\)\+f\(x^\{t\}\)\|==\|f​\(z\+u\+v\)−f​\(z\+u−v\)−f​\(z−u\+v\)\+f​\(z−u−v\)\|\.=\|f\(z\+u\+v\)\-f\(z\+u\-v\)\-f\(z\-u\+v\)\+f\(z\-u\-v\)\|\.Now, we introduce four auxiliary functionsϕ±±:ℝ→ℝ,ϕ±±​\(s\)=f​\(z\+s​\(±u±v\)\)\.\\phi\_\{\\pm\\pm\}:\\mathbb\{R\}\\rightarrow\\mathbb\{R\},\\ \\phi\_\{\\pm\\pm\}\(s\)=f\(z\+s\(\\pm u\\pm v\)\)\. Forϕ\+\+​\(s\)=f​\(z\+s​\(u\+v\)\)\\phi\_\{\+\+\}\(s\)=f\(z\+s\(u\+v\)\)we write the Tailor series at the point0to the second order with the Lagrange remainder \(ξ\+\+∈\(0;1\)\\xi\_\{\+\+\}\\in\(0;1\)\):

ϕ\+\+​\(1\)\\displaystyle\\phi\_\{\+\+\}\(1\)=\\displaystyle=ϕ\+\+​\(0\)\+\(1−0\)​ϕ\+\+′​\(0\)\+12​\(1−0\)2​ϕ\+\+′′​\(ξ\+\+\)\\displaystyle\\phi\_\{\+\+\}\(0\)\+\(1\-0\)\\phi\_\{\+\+\}^\{\\prime\}\(0\)\+\\frac\{1\}\{2\}\(1\-0\)^\{2\}\\phi\_\{\+\+\}^\{\\prime\\prime\}\(\\xi\_\{\+\+\}\)=\\displaystyle=f​\(z\)\+⟨∇f​\(z\),u\+v⟩\+12​\(u\+v\)T​∇2f​\(z\+ξ\+\+​\(u\+v\)\)​\(u\+v\)\\displaystyle f\(z\)\+\\langle\\nabla f\(z\),u\+v\\rangle\+\\frac\{1\}\{2\}\(u\+v\)^\{T\}\\nabla^\{2\}f\(z\+\\xi\_\{\+\+\}\(u\+v\)\)\(u\+v\)=\\displaystyle=f​\(z\)\+⟨∇f​\(z\),u\+v⟩\+12​\(u\+v\)T​∇2f​\(z\+\+\)​\(u\+v\),\\displaystyle f\(z\)\+\\langle\\nabla f\(z\),u\+v\\rangle\+\\frac\{1\}\{2\}\(u\+v\)^\{T\}\\nabla^\{2\}f\(z\_\{\+\+\}\)\(u\+v\),wherez±±=z\+ξ±±​\(±u±v\)\.z\_\{\\pm\\pm\}=z\+\\xi\_\{\\pm\\pm\}\(\\pm u\\pm v\)\.

Analogously, we derive the same series for all functionsϕ±±​\(s\)=f​\(z\+s​\(±u±v\)\)\.\\phi\_\{\\pm\\pm\}\(s\)=f\(z\+s\(\\pm u\\pm v\)\)\.Then, we put the acquired equations into the loss difference\.

\|f​\(z\+u\+v\)−f​\(z\+u−v\)−f​\(z−u\+v\)\+f​\(z−u−v\)\|\\displaystyle\|f\(z\+u\+v\)\-f\(z\+u\-v\)\-f\(z\-u\+v\)\+f\(z\-u\-v\)\|=\\displaystyle=\|ϕ\+\+​\(1\)−ϕ\+−​\(1\)−ϕ−\+​\(1\)\+ϕ−−​\(1\)\|\\displaystyle\|\\phi\_\{\+\+\}\(1\)\-\\phi\_\{\+\-\}\(1\)\-\\phi\_\{\-\+\}\(1\)\+\\phi\_\{\-\-\}\(1\)\|=\\displaystyle=\|\(f\(z\)\+⟨∇f\(z\),u\+v⟩\+12\(u\+v\)T∇2f\(z\+\+\)\(u\+v\)\)\\displaystyle\\biggl\|\\left\(f\(z\)\+\\langle\\nabla f\(z\),u\+v\\rangle\+\\frac\{1\}\{2\}\(u\+v\)^\{T\}\\nabla^\{2\}f\(z\_\{\+\+\}\)\(u\+v\)\\right\)−\(f​\(z\)\+⟨∇f​\(z\),u−v⟩\+12​\(u−v\)T​∇2f​\(z\+−\)​\(u−v\)\)\\displaystyle\-\\left\(f\(z\)\+\\langle\\nabla f\(z\),u\-v\\rangle\+\\frac\{1\}\{2\}\(u\-v\)^\{T\}\\nabla^\{2\}f\(z\_\{\+\-\}\)\(u\-v\)\\right\)−\(f​\(z\)\+⟨∇f​\(z\),−u\+v⟩\+12​\(−u\+v\)T​∇2f​\(z−\+\)​\(−u\+v\)\)\\displaystyle\-\\left\(f\(z\)\+\\langle\\nabla f\(z\),\-u\+v\\rangle\+\\frac\{1\}\{2\}\(\-u\+v\)^\{T\}\\nabla^\{2\}f\(z\_\{\-\+\}\)\(\-u\+v\)\\right\)\+\(f\(z\)\+⟨∇f\(z\),−u−v⟩\+12\(−u−v\)T∇2f\(z\+−\)\(−u−v\)\)\|\\displaystyle\+\\left\(f\(z\)\+\\langle\\nabla f\(z\),\-u\-v\\rangle\+\\frac\{1\}\{2\}\(\-u\-v\)^\{T\}\\nabla^\{2\}f\(z\_\{\+\-\}\)\(\-u\-v\)\\right\)\\biggr\|=\\displaystyle=\|f\(z\)−f\(z\)−f\(z\)\+f\(z\)\\displaystyle\\biggl\|f\(z\)\-f\(z\)\-f\(z\)\+f\(z\)\+⟨∇f​\(z\),\(u\+v\)−\(u−v\)−\(v−u\)−\(u\+v\)⟩\\displaystyle\+\\langle\\nabla f\(z\),\(u\+v\)\-\(u\-v\)\-\(v\-u\)\-\(u\+v\)\\rangle\+12​\(u\+v\)T​∇2f​\(z\+\+\)​\(u\+v\)\\displaystyle\+\\frac\{1\}\{2\}\(u\+v\)^\{T\}\\nabla^\{2\}f\(z\_\{\+\+\}\)\(u\+v\)−12​\(u−v\)T​∇2f​\(z\+−\)​\(u−v\)\\displaystyle\-\\frac\{1\}\{2\}\(u\-v\)^\{T\}\\nabla^\{2\}f\(z\_\{\+\-\}\)\(u\-v\)−12​\(−u\+v\)T​∇2f​\(z−\+\)​\(−u\+v\)\\displaystyle\-\\frac\{1\}\{2\}\(\-u\+v\)^\{T\}\\nabla^\{2\}f\(z\_\{\-\+\}\)\(\-u\+v\)\+12\(u\+v\)T∇2f\(z−−\)\(u\+v\)\|\\displaystyle\+\\frac\{1\}\{2\}\(u\+v\)^\{T\}\\nabla^\{2\}f\(z\_\{\-\-\}\)\(u\+v\)\\biggr\|=\\displaystyle=\|12\(u\+v\)T∇2f\(z\+\+\)\(u\+v\)\\displaystyle\\biggl\|\\frac\{1\}\{2\}\(u\+v\)^\{T\}\\nabla^\{2\}f\(z\_\{\+\+\}\)\(u\+v\)−12​\(u−v\)T​∇2f​\(z\+−\)​\(u−v\)\\displaystyle\-\\frac\{1\}\{2\}\(u\-v\)^\{T\}\\nabla^\{2\}f\(z\_\{\+\-\}\)\(u\-v\)−12​\(−u\+v\)T​∇2f​\(z−\+\)​\(−u\+v\)\\displaystyle\-\\frac\{1\}\{2\}\(\-u\+v\)^\{T\}\\nabla^\{2\}f\(z\_\{\-\+\}\)\(\-u\+v\)\+12\(u\+v\)T∇2f\(z−−\)\(u\+v\)\|\.\\displaystyle\+\\frac\{1\}\{2\}\(u\+v\)^\{T\}\\nabla^\{2\}f\(z\_\{\-\-\}\)\(u\+v\)\\biggr\|\.Due to the symmetry, zeroth and first order terms vanish\. Now, we estimate each term withxT​A​y≤‖x‖2​‖A​y‖2≤‖x‖2​‖A‖2​‖y‖2\.x^\{T\}Ay\{\\leq\}\\\|x\\\|\_\{2\}\\\|Ay\\\|\_\{2\}\\leq\\\|x\\\|\_\{2\}\\\|A\\\|\_\{2\}\\\|y\\\|\_\{2\}\.The Ass\.[1](https://arxiv.org/html/2606.14970#Thmassumption1)implies‖∇2f​\(x\)‖2≤L\\\|\\nabla^\{2\}f\(x\)\\\|\_\{2\}\\leq Lfor anyxx, therefore

\|f​\(z\+u\+v\)−f​\(z\+u−v\)−f​\(z−u\+v\)\+f​\(z−u−v\)\|\\displaystyle\|f\(z\+u\+v\)\-f\(z\+u\-v\)\-f\(z\-u\+v\)\+f\(z\-u\-v\)\|≤\\displaystyle\\leq4⋅12​max⁡\{‖u\+v‖22,‖u−v‖22\}​maxx⁡‖∇2f​\(x\)‖2\\displaystyle 4\\cdot\\frac\{1\}\{2\}\\max\\\{\\\|u\+v\\\|\_\{2\}^\{2\},\\\|u\-v\\\|\_\{2\}^\{2\}\\\}\\max\_\{x\}\\\|\\nabla^\{2\}f\(x\)\\\|\_\{2\}≤\\displaystyle\\leq2​\(‖u‖2\+‖v‖2\)2​L\\displaystyle 2\(\\\|u\\\|\_\{2\}\+\\\|v\\\|\_\{2\}\)^\{2\}L≤\([CS](https://arxiv.org/html/2606.14970#A2.Ex20)\)\\displaystyle\\overset\{\\eqref\{CS\}\}\{\\leq\}4​\(‖γt​lmo⁡\(gt​\(xt,et\)\)2‖22\+‖τt​e2‖22\)​L\\displaystyle 4\\left\(\\bigg\\\|\\frac\{\\gamma^\{t\}\\operatorname\{lmo\}\(g^\{t\}\(x^\{t\},e^\{t\}\)\)\}\{2\}\\bigg\\\|\_\{2\}^\{2\}\+\\bigg\\\|\\frac\{\\tau^\{t\}e\}\{2\}\\bigg\\\|\_\{2\}^\{2\}\\right\)L≤\\displaystyle\{\\leq\}\(\(γt​‖lmo⁡\(gt​\(xt,et\)\)‖2\)2\+\(τt​‖e‖2\)2\)​L\\displaystyle\\left\(\\left\(\{\\gamma^\{t\}\}\\\|\\operatorname\{lmo\}\(g^\{t\}\(x^\{t\},e^\{t\}\)\)\\\|\_\{2\}\\right\)^\{2\}\+\\left\(\{\\tau^\{t\}\}\\\|e\\\|\_\{2\}\\right\)^\{2\}\\right\)L≤\\displaystyle\\leq\(C22​ρ2​\(γt\)2\+\(τt\)2​‖e‖22\)​L\.\\displaystyle\(C\_\{2\}^\{2\}\\rho^\{2\}\(\\gamma^\{t\}\)^\{2\}\+\(\\tau^\{t\}\)^\{2\}\\\|e\\\|\_\{2\}^\{2\}\)L\.Here, we estimated‖lmo⁡\(gt​\(xt,et\)\)‖2≤C2​‖lmo⁡\(gt​\(xt,et\)\)‖=ρ​C2\\\|\\operatorname\{lmo\}\(g^\{t\}\(x^\{t\},e^\{t\}\)\)\\\|\_\{2\}\{\\leq\}C\_\{2\}\\\|\\operatorname\{lmo\}\(g^\{t\}\(x^\{t\},e^\{t\}\)\)\\\|=\\rho C\_\{2\}, whereC2C\_\{2\}is the norm equivalence factor[7](https://arxiv.org/html/2606.14970#A2.E7)\. Now, we derive the final bound for theLtL^\{t\}\.

Lt\\displaystyle L^\{t\}=\\displaystyle=‖gt​\(xt\+1,et\)−gt​\(xt,et\)‖∗‖xt\+1−xt‖\\displaystyle\\frac\{\\\|g^\{t\}\(x^\{t\+1\},e^\{t\}\)\-g^\{t\}\(x^\{t\},e^\{t\}\)\\\|\_\{\*\}\}\{\\\|x^\{t\+1\}\-x^\{t\}\\\|\}=\\displaystyle=\|f​\(z\+u\+v\)−f​\(z\+u−v\)−f​\(z−u\+v\)\+f​\(z−u−v\)\|⋅‖et‖∗γt​τt​‖lmo⁡\(gt​\(xt,et\)\)‖\\displaystyle\\frac\{\|f\(z\+u\+v\)\-f\(z\+u\-v\)\-f\(z\-u\+v\)\+f\(z\-u\-v\)\|\\cdot\\\|e^\{t\}\\\|\_\{\*\}\}\{\\gamma^\{t\}\\tau^\{t\}\\\|\\operatorname\{lmo\}\(g^\{t\}\(x^\{t\},e^\{t\}\)\)\\\|\}≤\\displaystyle\\leq‖e‖∗γtτt∥lmo\(gt\(xt,et\)∥⋅\(C22​ρ2​\(γt\)2\+\(τt\)2​‖e‖22\)​L\.\\displaystyle\\frac\{\\\|e\\\|\_\{\*\}\}\{\\gamma^\{t\}\\tau^\{t\}\\\|\\operatorname\{lmo\}\(g^\{t\}\(x^\{t\},e^\{t\}\)\\\|\}\\cdot\(C\_\{2\}^\{2\}\\rho^\{2\}\(\\gamma^\{t\}\)^\{2\}\+\(\\tau^\{t\}\)^\{2\}\\\|e\\\|\_\{2\}^\{2\}\)L\.In order to acquire a uniform bound,γt\\gamma^\{t\}andτt\\tau^\{t\}should be linearly related\. Pluggingτt=k​γt\\tau^\{t\}=k\\gamma^\{t\}, we arrive at

Lt\\displaystyle L^\{t\}≤\\displaystyle\\leq‖e‖∗γtτt∥lmo\(gt\(xt,et\)∥⋅\(C22​ρ2​\(γt\)2\+\(τt\)2​‖e‖22\)​L\.\\displaystyle\\frac\{\\\|e\\\|\_\{\*\}\}\{\\gamma^\{t\}\\tau^\{t\}\\\|\\operatorname\{lmo\}\(g^\{t\}\(x^\{t\},e^\{t\}\)\\\|\}\\cdot\(C\_\{2\}^\{2\}\\rho^\{2\}\(\\gamma^\{t\}\)^\{2\}\+\(\\tau^\{t\}\)^\{2\}\\\|e\\\|\_\{2\}^\{2\}\)L\.≤[7](https://arxiv.org/html/2606.14970#A2.E7)\\displaystyle\\overset\{\\ref\{app:eq:norms\}\}\{\\leq\}‖et‖2k​c2⁣∗​ρ​\(C22​ρ2\+k2​‖et‖22\)​L\.\\displaystyle\\frac\{\\\|e^\{t\}\\\|\_\{2\}\}\{kc\_\{2\*\}\\rho\}\(C\_\{2\}^\{2\}\\rho^\{2\}\+k^\{2\}\\\|e^\{t\}\\\|\_\{2\}^\{2\}\)L\.Ifeeis sampled from the22\-norm sphere andkkis set to the minimizerC2​ρC\_\{2\}\\rho, we end up with the bound denoted asL^\.\\hat\{L\}\.

Lt\\displaystyle L^\{t\}≤\\displaystyle\\leq2​C2c2⁣∗L=:L^\.\\displaystyle\\frac\{2C\_\{2\}\}\{c\_\{2\*\}\}L=:\\hat\{L\}\.∎

###### Theorem 3\.

\[Theorem[1](https://arxiv.org/html/2606.14970#Thmtheorem1)\] Suppose Assumptions[1](https://arxiv.org/html/2606.14970#Thmassumption1),[2](https://arxiv.org/html/2606.14970#Thmassumption2),[3](https://arxiv.org/html/2606.14970#Thmassumption3)hold\. Then Algorithm[4](https://arxiv.org/html/2606.14970#S4.SS0.SSS0.Px3)to achieveε\\varepsilon\-accuracy needs

T=𝒪​\(C23​Δ​L3c2⁣∗3​ε2​\(1\+M2​C22Δ​L0\)​𝔼​\[1L0\]2​𝔼​\[log⁡T​LL0\]2\+M​C2⁣∗\)​iterations,T=\\mathcal\{O\}\\left\(\\frac\{\{C\_\{2\}^\{3\}\\Delta L^\{3\}\}\}\{c\_\{2\*\}^\{3\}\\varepsilon^\{2\}\}\\left\(1\+\\frac\{M^\{2\}C\_\{2\}^\{2\}\}\{\\Delta\{L^\{0\}\}\}\\right\)\\mathbb\{E\}\\left\[\\frac\{1\}\{L^\{0\}\}\\right\]^\{2\}\\mathbb\{E\}\\left\[\\log\\frac\{TL\}\{L^\{0\}\}\\right\]^\{2\}\+MC\_\{2\*\}\\right\)\\text\{\\penalty 10000\\ \\penalty 10000\\ iterations,\}whereε=𝔼​\[∑t=0T−1γt∑i=0T−1γi​‖∇ft​\(xt\)‖∗\]\\displaystyle\\varepsilon=\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\gamma^\{t\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\\|\\nabla f^\{t\}\(x^\{t\}\)\\\|\_\{\*\}\\right\], andΔ=f​\(x0\)−f~\\Delta=f\(x^\{0\}\)\-\\widetilde\{f\}\.

###### Proof\.

We begin with the descent split in two terms: the smoothing difference and the point difference, which we analyze separately:

ft\+1​\(xt\+1\)−ft​\(xt\)\\displaystyle f^\{t\+1\}\(x^\{t\+1\}\)\-f^\{t\}\(x^\{t\}\)=\\displaystyle=\(ft\+1​\(xt\+1\)−ft​\(xt\+1\)\)\+\(ft​\(xt\+1\)−ft​\(xt\)\)\.\\displaystyle\(f^\{t\+1\}\(x^\{t\+1\}\)\-f^\{t\}\(x^\{t\+1\}\)\)\+\(f^\{t\}\(x^\{t\+1\}\)\-f^\{t\}\(x^\{t\}\)\)\.First, we estimate the function difference\. Here, we introduce𝔼\\mathbb\{E\}as the expectation over the sample vectoreefrom the sphere in the22\-norm, as in the definition of smoothed function\.

ft\+1​\(xt\+1\)−ft​\(xt\+1\)\\displaystyle f^\{t\+1\}\(x^\{t\+1\}\)\-f^\{t\}\(x^\{t\+1\}\)=\\displaystyle=𝔼e​\[f​\(xt\+1\+τt\+1​e\)\]−𝔼e​\[f​\(xt\+1\+τt​e\)\]\\displaystyle\\mathbb\{E\}\_\{e\}\[f\(x^\{t\+1\}\+\\tau^\{t\+1\}e\)\]\-\\mathbb\{E\}\_\{e\}\[f\(x^\{t\+1\}\+\\tau^\{t\}e\)\]≤\([Conv](https://arxiv.org/html/2606.14970#A2.Ex19)\)\\displaystyle\\overset\{\\eqref\{Conv\}\}\{\\leq\}𝔼e​⟨∇f​\(xt\+1\+τt\+1​e\),\(xt\+1\+τt\+1​e\)−\(xt\+1\+τt​e\)⟩\\displaystyle\\mathbb\{E\}\_\{e\}\\big\\langle\\nabla f\(x^\{t\+1\}\+\\tau^\{t\+1\}e\),\(x^\{t\+1\}\+\\tau^\{t\+1\}e\)\-\(x^\{t\+1\}\+\\tau^\{t\}e\)\\big\\rangle≤\([Conj](https://arxiv.org/html/2606.14970#A2.Ex21)\)\\displaystyle\\overset\{\\eqref\{Conj\}\}\{\\leq\}𝔼e​\[‖∇f​\(xt\+1\+τt\+1​e\)‖2​‖\(τt\+1−τt\)​e‖2\]\\displaystyle\\mathbb\{E\}\_\{e\}\\left\[\\\|\\nabla f\(x^\{t\+1\}\+\\tau^\{t\+1\}e\)\\\|\_\{2\}\\\|\(\\tau^\{t\+1\}\-\\tau^\{t\}\)e\\\|\_\{2\}\\right\]≤\\displaystyle\\leq𝔼e​\[M⋅\|τt\+1−τt\|⋅‖e‖2\]\\displaystyle\\mathbb\{E\}\_\{e\}\[M\\cdot\|\\tau^\{t\+1\}\-\\tau^\{t\}\|\\cdot\\\|e\\\|\_\{2\}\]≤\\displaystyle\\leq𝔼e​\[M​C2​ρ​\(γt−γt\+1\)\]\.\\displaystyle\\mathbb\{E\}\_\{e\}\[MC\_\{2\}\\rho\(\\gamma^\{t\}\-\\gamma^\{t\+1\}\)\]\.We utilized the definition ofτt=ρ​C2​γt\\tau^\{t\}=\\rho C\_\{2\}\\gamma^\{t\}and the Ass\.[3](https://arxiv.org/html/2606.14970#Thmassumption3)which implies‖∇f​\(x\)‖2≤M\.\\\|\\nabla f\(x\)\\\|\_\{2\}\\leq M\.Next, we write the definition for step size:

ft\+1​\(xt\+1\)−ft​\(xt\+1\)\\displaystyle f^\{t\+1\}\(x^\{t\+1\}\)\-f^\{t\}\(x^\{t\+1\}\)≤\\displaystyle\\leq𝔼e​\[M​C2​ρ​\(γt−γt\+1\)\]\\displaystyle\\mathbb\{E\}\_\{e\}\[MC\_\{2\}\\rho\(\\gamma^\{t\}\-\\gamma^\{t\+1\}\)\]=\\displaystyle=M​C2​\(A∑i=0t−1Li−A∑i=0tLi\)\\displaystyle MC\_\{2\}\\left\(\\frac\{A\}\{\\sqrt\{\\sum\_\{i=0\}^\{t\-1\}L^\{i\}\}\}\-\\frac\{A\}\{\\sqrt\{\\sum\_\{i=0\}^\{t\}L^\{i\}\}\}\\right\)=\\displaystyle=M​C2​A​∑i=0tLi−∑i=0t−1Li∑i=0tLi​∑i=0t−1Li\\displaystyle MC\_\{2\}A\\frac\{\\sqrt\{\\sum\_\{i=0\}^\{t\}L^\{i\}\}\-\\sqrt\{\\sum\_\{i=0\}^\{t\-1\}L^\{i\}\}\}\{\\sqrt\{\\sum\_\{i=0\}^\{t\}L^\{i\}\}\\sqrt\{\\sum\_\{i=0\}^\{t\-1\}L^\{i\}\}\}=\\displaystyle=M​C2​A​\(∑i=0tLi−∑i=0t−1Li\)∑i=0tLi​∑i=0t−1Li​\(∑i=0tLi\+∑i=0t−1Li\)\\displaystyle\\frac\{MC\_\{2\}A\\left\(\{\\sum\_\{i=0\}^\{t\}L^\{i\}\}\-\{\\sum\_\{i=0\}^\{t\-1\}L^\{i\}\}\\right\)\}\{\\sqrt\{\\sum\_\{i=0\}^\{t\}L^\{i\}\}\\sqrt\{\\sum\_\{i=0\}^\{t\-1\}L^\{i\}\}\(\\sqrt\{\\sum\_\{i=0\}^\{t\}L^\{i\}\}\+\\sqrt\{\\sum\_\{i=0\}^\{t\-1\}L^\{i\}\}\)\}=\\displaystyle=M​C2​ρA⋅γt​γt\+1​Lt∑i=0tLi\+∑i=0t−1Li\.\\displaystyle\\frac\{MC\_\{2\}\\rho\}\{A\}\\cdot\\frac\{\\gamma^\{t\}\\gamma^\{t\+1\}L^\{t\}\}\{\\sqrt\{\\sum\_\{i=0\}^\{t\}L^\{i\}\}\+\\sqrt\{\\sum\_\{i=0\}^\{t\-1\}L^\{i\}\}\}\.After simple arithmetic, we extract the step size back\. Now, we trivially estimate∑i=0tLi≥∑i=0t−1Li,\\sqrt\{\\sum\_\{i=0\}^\{t\}L^\{i\}\}\\geq\\sqrt\{\\sum\_\{i=0\}^\{t\-1\}L^\{i\}\},andγt\+1≤γ1=Aρ​L0:\\gamma^\{t\+1\}\\leq\\gamma^\{1\}=\\frac\{A\}\{\\rho\\sqrt\{L^\{0\}\}\}:

ft\+1​\(xt\+1\)−ft​\(xt\+1\)\\displaystyle f^\{t\+1\}\(x^\{t\+1\}\)\-f^\{t\}\(x^\{t\+1\}\)≤\\displaystyle\\leqM​C2​ρ2A⋅γt​γt\+1​Lt∑i=0tLi\+∑i=0t−1Li\\displaystyle\\frac\{MC\_\{2\}\\rho^\{2\}\}\{A\}\\cdot\\frac\{\\gamma^\{t\}\\gamma^\{t\+1\}L^\{t\}\}\{\\sqrt\{\\sum\_\{i=0\}^\{t\}L^\{i\}\}\+\\sqrt\{\\sum\_\{i=0\}^\{t\-1\}L^\{i\}\}\}≤\\displaystyle\\leqM​C2​ρ​γt​γt\+1​Lt2​A​∑i=0t−1Li\\displaystyle\\frac\{MC\_\{2\}\\rho\\gamma^\{t\}\\gamma^\{t\+1\}L^\{t\}\}\{2A\\sqrt\{\\sum\_\{i=0\}^\{t\-1\}L^\{i\}\}\}=\\displaystyle=M​C2​ρ2​\(γt\)2​γt\+1​Lt2​A2\\displaystyle\\frac\{MC\_\{2\}\\rho^\{2\}\(\\gamma^\{t\}\)^\{2\}\\gamma^\{t\+1\}L^\{t\}\}\{2A^\{2\}\}≤\\displaystyle\\leqM​C2​ρ​\(γt\)2​γ1​Lt2​A2\\displaystyle\\frac\{MC\_\{2\}\\rho\(\\gamma^\{t\}\)^\{2\}\\gamma^\{1\}L^\{t\}\}\{2A^\{2\}\}=\\displaystyle=M​C2​ρ​\(γt\)2​Lt2​A​L0\.\\displaystyle\\frac\{MC\_\{2\}\\rho\(\\gamma^\{t\}\)^\{2\}L^\{t\}\}\{2A\\sqrt\{L^\{0\}\}\}\.Next, we analyze the point difference term\.

ft​\(xt\+1\)−ft​\(xt\)\\displaystyle f^\{t\}\(x^\{t\+1\}\)\-f^\{t\}\(x^\{t\}\)≤\([Conv](https://arxiv.org/html/2606.14970#A2.Ex19)\)\\displaystyle\\overset\{\\eqref\{Conv\}\}\{\\leq\}⟨∇ft​\(xt\+1\),xt\+1−xt⟩\\displaystyle\\langle\\nabla f^\{t\}\(x^\{t\+1\}\),x^\{t\+1\}\-x^\{t\}\\rangle=\\displaystyle=⟨gt​\(xt,et\),xt\+1−xt⟩\\displaystyle\\langle g^\{t\}\(x^\{t\},e^\{t\}\),x^\{t\+1\}\-x^\{t\}\\rangle\+⟨∇ft​\(xt\+1\)−gt​\(xt,et\),xt\+1−xt⟩\\displaystyle\+\\langle\\nabla f^\{t\}\(x^\{t\+1\}\)\-g^\{t\}\(x^\{t\},e^\{t\}\),x^\{t\+1\}\-x^\{t\}\\rangle≤\([Conj](https://arxiv.org/html/2606.14970#A2.Ex21)\)\\displaystyle\\overset\{\\eqref\{Conj\}\}\{\\leq\}γt​⟨gt​\(xt,et\),lmo⁡\(gt​\(xt,et\)\)⟩\\displaystyle\\gamma^\{t\}\\langle g^\{t\}\(x^\{t\},e^\{t\}\),\\operatorname\{lmo\}\(g^\{t\}\(x^\{t\},e^\{t\}\)\)\\rangle\+‖∇ft​\(xt\+1\)−gt​\(xt,et\)‖∗​‖xt\+1−xt‖\.\\displaystyle\+\\\|\\nabla f^\{t\}\(x^\{t\+1\}\)\-g^\{t\}\(x^\{t\},e^\{t\}\)\\\|\_\{\*\}\\\|x^\{t\+1\}\-x^\{t\}\\\|\.Next, we use the main LMO property:

‖v‖∗=max‖u‖≤1⁡⟨u,v⟩=−1ρ​⟨v,lmo⁡\(v\)⟩,\\displaystyle\\\|v\\\|\_\{\*\}=\\max\_\{\\\|u\\\|\\leq 1\}\\langle u,v\\rangle=\-\\frac\{1\}\{\\rho\}\\langle v,\\operatorname\{lmo\}\(v\)\\rangle,applying it to the first term:

ft​\(xt\+1\)−ft​\(xt\)\\displaystyle f^\{t\}\(x^\{t\+1\}\)\-f^\{t\}\(x^\{t\}\)≤\\displaystyle\\leq−γt​ρ​‖gt​\(xt,et\)‖∗\\displaystyle\-\\gamma^\{t\}\\rho\\\|g^\{t\}\(x^\{t\},e^\{t\}\)\\\|\_\{\*\}\+‖∇ft​\(xt\+1\)−gt​\(xt,et\)‖∗​‖xt\+1−xt‖\\displaystyle\+\\\|\\nabla f^\{t\}\(x^\{t\+1\}\)\-g^\{t\}\(x^\{t\},e^\{t\}\)\\\|\_\{\*\}\\\|x^\{t\+1\}\-x^\{t\}\\\|≤\([Norm](https://arxiv.org/html/2606.14970#A2.Ex22)\)\\displaystyle\\overset\{\\eqref\{Norm\}\}\{\\leq\}−γt​ρ​‖gt​\(xt,et\)‖∗\\displaystyle\-\\gamma^\{t\}\\rho\\\|g^\{t\}\(x^\{t\},e^\{t\}\)\\\|\_\{\*\}\+γt​‖∇ft​\(xt\+1\)−gt​\(xt\+1,et\)‖∗​‖lmo⁡\(gt​\(xt,et\)\)‖\\displaystyle\+\\gamma^\{t\}\\\|\\nabla f^\{t\}\(x^\{t\+1\}\)\-g^\{t\}\(x^\{t\+1\},e^\{t\}\)\\\|\_\{\*\}\\\|\\operatorname\{lmo\}\(g^\{t\}\(x^\{t\},e^\{t\}\)\)\\\|\+‖gt​\(xt\+1,et\)−gt​\(xt,et\)‖∗​‖xt\+1−xt‖\\displaystyle\+\\\|g^\{t\}\(x^\{t\+1\},e^\{t\}\)\-g^\{t\}\(x^\{t\},e^\{t\}\)\\\|\_\{\*\}\\\|x^\{t\+1\}\-x^\{t\}\\\|≤\\displaystyle\\leq−γt​ρ​‖∇ft​\(xt\)‖∗\\displaystyle\-\\gamma^\{t\}\\rho\\\|\\nabla f^\{t\}\(x^\{t\}\)\\\|\_\{\*\}\+γt​ρ​‖∇ft​\(xt\)−gt​\(xt,et\)‖∗\\displaystyle\+\\gamma^\{t\}\\rho\\\|\\nabla f^\{t\}\(x^\{t\}\)\-g^\{t\}\(x^\{t\},e^\{t\}\)\\\|\_\{\*\}\+γt​ρ​‖∇ft​\(xt\+1\)−gt​\(xt\+1,et\)‖∗\\displaystyle\+\\gamma^\{t\}\\rho\\\|\\nabla f^\{t\}\(x^\{t\+1\}\)\-g^\{t\}\(x^\{t\+1\},e^\{t\}\)\\\|\_\{\*\}\+\(‖gt​\(xt\+1,et\)−gt​\(xt,et\)‖∗‖xt\+1−xt‖\)​‖xt\+1−xt‖2\\displaystyle\+\\left\(\\frac\{\\\|g^\{t\}\(x^\{t\+1\},e^\{t\}\)\-g^\{t\}\(x^\{t\},e^\{t\}\)\\\|\_\{\*\}\}\{\\\|x^\{t\+1\}\-x^\{t\}\\\|\}\\right\)\\\|x^\{t\+1\}\-x^\{t\}\\\|^\{2\}=\\displaystyle=−γt​ρ​‖∇ft​\(xt\)‖∗\\displaystyle\-\\gamma^\{t\}\\rho\\\|\\nabla f^\{t\}\(x^\{t\}\)\\\|\_\{\*\}\+γt​ρ​‖∇ft​\(xt\)−gt​\(xt,et\)‖∗\\displaystyle\+\\gamma^\{t\}\\rho\\\|\\nabla f^\{t\}\(x^\{t\}\)\-g^\{t\}\(x^\{t\},e^\{t\}\)\\\|\_\{\*\}\+γt​ρ​‖∇ft​\(xt\+1\)−gt​\(xt\+1,et\)‖∗\\displaystyle\+\\gamma^\{t\}\\rho\\\|\\nabla f^\{t\}\(x^\{t\+1\}\)\-g^\{t\}\(x^\{t\+1\},e^\{t\}\)\\\|\_\{\*\}\+Lt​\(γt\)2​ρ2\.\\displaystyle\+L^\{t\}\(\\gamma^\{t\}\)^\{2\}\\rho^\{2\}\.Then, we add together two acquired estimates:

ft\+1​\(xt\+1\)−ft​\(xt\)\\displaystyle f^\{t\+1\}\(x^\{t\+1\}\)\-f^\{t\}\(x^\{t\}\)≤\\displaystyle\\leqM​C2​ρ​\(γt\)2​Lt2​A​L0\\displaystyle\\frac\{MC\_\{2\}\\rho\(\\gamma^\{t\}\)^\{2\}L^\{t\}\}\{2A\\sqrt\{L^\{0\}\}\}−γt​ρ​‖∇ft​\(xt,et\)‖∗\+γt​ρ​‖∇ft​\(xt\)−gt​\(xt,et\)‖∗\\displaystyle\-\\gamma^\{t\}\\rho\\\|\\nabla f^\{t\}\(x^\{t\},e^\{t\}\)\\\|\_\{\*\}\+\\gamma^\{t\}\\rho\\\|\\nabla f^\{t\}\(x^\{t\}\)\-g^\{t\}\(x^\{t\},e^\{t\}\)\\\|\_\{\*\}\+γt​ρ​‖∇ft​\(xt\+1\)−gt​\(xt\+1,et\)‖∗\\displaystyle\+\\gamma^\{t\}\\rho\\\|\\nabla f^\{t\}\(x^\{t\+1\}\)\-g^\{t\}\(x^\{t\+1\},e^\{t\}\)\\\|\_\{\*\}\+Lt​\(γt\)2​ρ2\\displaystyle\+L^\{t\}\(\\gamma^\{t\}\)^\{2\}\\rho^\{2\}=\\displaystyle=−γt​ρ​‖∇ft​\(xt,et\)‖∗\\displaystyle\-\\gamma^\{t\}\\rho\\\|\\nabla f^\{t\}\(x^\{t\},e^\{t\}\)\\\|\_\{\*\}\+γt​ρ​‖∇ft​\(xt\)−gt​\(xt,et\)‖∗\\displaystyle\+\\gamma^\{t\}\\rho\\\|\\nabla f^\{t\}\(x^\{t\}\)\-g^\{t\}\(x^\{t\},e^\{t\}\)\\\|\_\{\*\}\+γt​ρ​‖∇ft​\(xt\+1\)−gt​\(xt\+1,et\)‖∗\\displaystyle\+\\gamma^\{t\}\\rho\\\|\\nabla f^\{t\}\(x^\{t\+1\}\)\-g^\{t\}\(x^\{t\+1\},e^\{t\}\)\\\|\_\{\*\}\+Lt​\(γt\)2​\(ρ2\+M​C2​ρ2​A​L0\)\.\\displaystyle\+L^\{t\}\(\\gamma^\{t\}\)^\{2\}\(\\rho^\{2\}\+\\frac\{MC\_\{2\}\\rho\}\{2A\\sqrt\{L^\{0\}\}\}\)\.Summing overt=0,…​T−1,t=0,\\dots T\-1,we arrive at

fT​\(xT\)−f0​\(x0\)\\displaystyle f^\{T\}\(x^\{T\}\)\-f^\{0\}\(x^\{0\}\)≤\\displaystyle\\leq−ρ​∑t=0T−1γt​‖∇ft​\(xt\)‖∗\\displaystyle\-\\rho\\sum\_\{t=0\}^\{T\-1\}\\gamma^\{t\}\\\|\\nabla f^\{t\}\(x^\{t\}\)\\\|\_\{\*\}\+\(ρ2\+M​C2​ρ2​A​L0\)​∑t=0T−1Lt​\(γt\)2\\displaystyle\+\(\\rho^\{2\}\+\\frac\{MC\_\{2\}\\rho\}\{2A\\sqrt\{L^\{0\}\}\}\)\\sum\_\{t=0\}^\{T\-1\}L^\{t\}\(\\gamma^\{t\}\)^\{2\}\+ρ​∑t=0T−1γt​‖gt​\(xt\+1,et\)−∇ft​\(xt\+1\)‖∗\\displaystyle\+\\rho\\sum\_\{t=0\}^\{T\-1\}\\gamma^\{t\}\\\|g^\{t\}\(x^\{t\+1\},e^\{t\}\)\-\\nabla f^\{t\}\(x^\{t\+1\}\)\\\|\_\{\*\}\+ρ​∑t=0T−1γt​‖gt​\(xt,et\)−∇ft​\(xt\)‖∗\.\\displaystyle\+\\rho\\sum\_\{t=0\}^\{T\-1\}\\gamma^\{t\}\\\|g^\{t\}\(x^\{t\},e^\{t\}\)\-\\nabla f^\{t\}\(x^\{t\}\)\\\|\_\{\*\}\.Now, we take the expectation and rearrange the terms\.

ρ​𝔼​\[∑t=0T−1γt∑i=0T−1γi​‖∇ft​\(xt\)‖∗\]\\displaystyle\\rho\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\gamma^\{t\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\\|\\nabla f^\{t\}\(x^\{t\}\)\\\|\_\{\*\}\\right\]≤\\displaystyle\\leq\(f0​\(x0\)−f~\)​𝔼​\[1∑i=0T−1γi\]\\displaystyle\(f^\{0\}\(x^\{0\}\)\-\\widetilde\{f\}\)\\mathbb\{E\}\\left\[\\frac\{1\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\right\]\+ρ​𝔼​\[∑t=0T−1γt∑i=0T−1γi​‖gt​\(xt\+1,et\)−∇ft​\(xt\+1\)‖∗\]\\displaystyle\+\\rho\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\gamma^\{t\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\\|g^\{t\}\(x^\{t\+1\},e^\{t\}\)\-\\nabla f^\{t\}\(x^\{t\+1\}\)\\\|\_\{\*\}\\right\]\+ρ​𝔼​\[∑t=0T−1γt∑i=0T−1γi​‖gt​\(xt,et\)−∇ft​\(xt\)‖∗\]\\displaystyle\+\\rho\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\gamma^\{t\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\\|g^\{t\}\(x^\{t\},e^\{t\}\)\-\\nabla f^\{t\}\(x^\{t\}\)\\\|\_\{\*\}\\right\]\+\(ρ2\+M​C2​ρ2​A​L0\)​𝔼​\[∑t=0T−1Lt​\(γt\)2∑i=0T−1γi\]\.\\displaystyle\+\\left\(\\rho^\{2\}\+\\frac\{MC\_\{2\}\\rho\}\{2A\\sqrt\{L^\{0\}\}\}\\right\)\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{L^\{t\}\(\\gamma^\{t\}\)^\{2\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\right\]\.Here,f0​\(x0\)−fT​\(xT\)≤f0​\(x0\)−f~f^\{0\}\(x^\{0\}\)\-f^\{T\}\(x^\{T\}\)\\leq f^\{0\}\(x^\{0\}\)\-\\widetilde\{f\}as discussed in[4](https://arxiv.org/html/2606.14970#S4)\. Now, let us analyze the second term\.

‖gt​\(xt\+1,et\)−∇ft​\(xt\+1\)‖∗\\displaystyle\\\|g^\{t\}\(x^\{t\+1\},e^\{t\}\)\-\\nabla f^\{t\}\(x^\{t\+1\}\)\\\|\_\{\*\}≤\([Norm](https://arxiv.org/html/2606.14970#A2.Ex22)\)\\displaystyle\\overset\{\\eqref\{Norm\}\}\{\\leq\}‖gt​\(xt\+1,et\)‖∗\+‖∇ft​\(xt\+1\)‖∗\\displaystyle\\\|g^\{t\}\(x^\{t\+1\},e^\{t\}\)\\\|\_\{\*\}\+\\\|\\nabla f^\{t\}\(x^\{t\+1\}\)\\\|\_\{\*\}≤[7](https://arxiv.org/html/2606.14970#A2.E7)\\displaystyle\\overset\{\\ref\{app:eq:norms\}\}\{\\leq\}C2⁣∗​\(‖gt​\(xt\+1,et\)‖2\+‖∇ft​\(xt\+1\)‖2\)\\displaystyle C\_\{2\*\}\(\\\|g^\{t\}\(x^\{t\+1\},e^\{t\}\)\\\|\_\{2\}\+\\\|\\nabla f^\{t\}\(x^\{t\+1\}\)\\\|\_\{2\}\)≤\\displaystyle\\leq2​M​C2⁣∗\.\\displaystyle 2MC\_\{2\*\}\.This bound is uniform and can be applied under the expectation, yielding

ρ​𝔼​\[∑t=0T−1γt∑i=0T−1γi​‖gt​\(xt\+1,et\)−∇ft​\(xt\+1\)‖∗\]\\displaystyle\\rho\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\gamma^\{t\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\\|g^\{t\}\(x^\{t\+1\},e^\{t\}\)\-\\nabla f^\{t\}\(x^\{t\+1\}\)\\\|\_\{\*\}\\right\]≤\\displaystyle\\leqρ​𝔼​\[∑t=0T−1γt∑i=0T−1γi\]⋅2​M​C2⁣∗\\displaystyle\\rho\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\gamma^\{t\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\right\]\\cdot 2MC\_\{2\*\}=ρ​𝔼​\[1\]⋅2​M​C2⁣∗\\displaystyle=\\rho\\mathbb\{E\}\[1\]\\cdot 2MC\_\{2\*\}=2​ρ​M​C2⁣∗\.\\displaystyle=2\\rho MC\_\{2\*\}\.Next, we apply Holder inequality \([Hol](https://arxiv.org/html/2606.14970#A2.Ex23)\) to the last term withp=q=2p=q=2:

ρ​𝔼​\[∑t=0T−1γt∑i=0T−1γi​‖∇ft​\(xt\)‖1\]\\displaystyle\\rho\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\gamma^\{t\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\\|\\nabla f^\{t\}\(x^\{t\}\)\\\|\_\{1\}\\right\]≤\\displaystyle\\leqΔ​𝔼​\[1∑i=0T−1γi\]\\displaystyle\\Delta\\mathbb\{E\}\\left\[\\frac\{1\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\right\]\+4​ρ​M​C2⁣∗\\displaystyle\+4\\rho MC\_\{2\*\}\+\(ρ2\+M​C2​ρ2​A​L0\)​\(𝔼​\[1∑i=0T−1γi\]2\)1/2\\displaystyle\+\\left\(\\rho^\{2\}\+\\frac\{MC\_\{2\}\\rho\}\{2A\\sqrt\{L^\{0\}\}\}\\right\)\\left\(\\mathbb\{E\}\\left\[\\frac\{1\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\right\]^\{2\}\\right\)^\{1/2\}⋅\(𝔼​\[∑t=0T−1Lt​\(γt\)2\]2\)1/2\.\\displaystyle\\cdot\\left\(\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}L^\{t\}\(\\gamma^\{t\}\)^\{2\}\\right\]^\{2\}\\right\)^\{1/2\}\.To estimate the sum withLt​\(γt\)2L^\{t\}\(\\gamma^\{t\}\)^\{2\}, we apply the Numerical Lemma[2](https://arxiv.org/html/2606.14970#Thmlemma2)\. It is justified since the boundL^\\hat\{L\}is derived forLtL^\{t\}in Lemma[3](https://arxiv.org/html/2606.14970#Thmlemma3)

∑t=0T−1Lt​\(γt\)2=∑t=0T−1A2​Lt∑i=0t−1Li​≤\(L​e​m​[2](https://arxiv.org/html/2606.14970#Thmlemma2)\)​2​A2​L^L0\+4​A2​log⁡T​L^L0\.\\displaystyle\\sum\_\{t=0\}^\{T\-1\}L^\{t\}\(\\gamma^\{t\}\)^\{2\}=\\sum\_\{t=0\}^\{T\-1\}\\frac\{A^\{2\}L^\{t\}\}\{\\sum\_\{i=0\}^\{t\-1\}L^\{i\}\}\\overset\{\(Lem\\ref\{app:num\_lem\}\)\}\{\\leq\}2A^\{2\}\\frac\{\\hat\{L\}\}\{L^\{0\}\}\+4A^\{2\}\\log\\frac\{T\\hat\{L\}\}\{L^\{0\}\}\.Next, considering∑i=1t1i≥t2,\\sum\_\{i=1\}^\{t\}\\frac\{1\}\{\\sqrt\{i\}\}\\geq\\frac\{\\sqrt\{t\}\}\{2\},we estimate

1∑i=0T−1γi=1∑i=0T−1A∑j=0i−1Lj​≤\(L​e​m\.[3](https://arxiv.org/html/2606.14970#Thmlemma3)\)​1∑i=0T−1AL^​i\+1≤2​L^A​T\.\\displaystyle\\frac\{1\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}=\\frac\{1\}\{\\sum\_\{i=0\}^\{T\-1\}\\sqrt\{\\frac\{A\}\{\\sum\_\{j=0\}^\{i\-1\}L^\{j\}\}\}\}\\overset\{\(Lem\.\\ref\{app:lem\_L\}\)\}\{\\leq\}\\frac\{1\}\{\\sum\_\{i=0\}^\{T\-1\}\\frac\{A\}\{\\sqrt\{\\hat\{L\}\}\\sqrt\{i\+1\}\}\}\\leq\\frac\{2\\sqrt\{\\hat\{L\}\}\}\{A\\sqrt\{T\}\}\.Now, we combine all of the above to derive the following:

𝔼​\[∑t=0T−1γt∑i=0T−1γi​‖∇ft​\(xt\)‖∗\]\\displaystyle\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\gamma^\{t\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\\|\\nabla f^\{t\}\(x^\{t\}\)\\\|\_\{\*\}\\right\]≤\\displaystyle\\leq\(f0​\(x0\)−f~\)​L^ρ​A​T\+2​M​C2⁣∗\\displaystyle\\frac\{\(f^\{0\}\(x^\{0\}\)\-\\widetilde\{f\}\)\\sqrt\{\\hat\{L\}\}\}\{\\rho A\\sqrt\{T\}\}\+2MC\_\{2\*\}\+2​L^A​T​\(ρ\+M​C22​A​L0\)\\displaystyle\+\\frac\{2\\sqrt\{\\hat\{L\}\}\}\{A\\sqrt\{T\}\}\\left\(\\rho\+\\frac\{MC\_\{2\}\}\{2A\\sqrt\{L^\{0\}\}\}\\right\)⋅\(𝔼​\[2​A2​L^L0\+2​A2​log⁡T​L^L0\]2\)1/2\\displaystyle\\cdot\\left\(\\mathbb\{E\}\\left\[2A^\{2\}\\frac\{\\hat\{L\}\}\{L^\{0\}\}\+2A^\{2\}\\log\\frac\{T\\hat\{L\}\}\{L^\{0\}\}\\right\]^\{2\}\\right\)^\{1/2\}≤\\displaystyle\\leq\(f0​\(x0\)−f~\)​L^ρ​A​T\+2​M​C2⁣∗\\displaystyle\\frac\{\(f^\{0\}\(x^\{0\}\)\-\\widetilde\{f\}\)\\sqrt\{\\hat\{L\}\}\}\{\\rho A\\sqrt\{T\}\}\+2MC\_\{2\*\}\+4​L^T​\(A​ρ\+M​C22​L0\)​\(𝔼​\[L^L0\]2\)1/2\\displaystyle\+\\frac\{4\\sqrt\{\\hat\{L\}\}\}\{\\sqrt\{T\}\}\\left\(A\\rho\+\\frac\{MC\_\{2\}\}\{2\\sqrt\{L^\{0\}\}\}\\right\)\\left\(\\mathbb\{E\}\\left\[\\frac\{\\hat\{L\}\}\{L^\{0\}\}\\right\]^\{2\}\\right\)^\{1/2\}\+4​L^T​\(A​ρ\+M​C22​L0\)​\(𝔼​\[log⁡T​L^L0\]2\)1/2\\displaystyle\+\\frac\{4\\sqrt\{\\hat\{L\}\}\}\{\\sqrt\{T\}\}\\left\(A\\rho\+\\frac\{MC\_\{2\}\}\{2\\sqrt\{L^\{0\}\}\}\\right\)\\left\(\\mathbb\{E\}\\left\[\\log\\frac\{T\\hat\{L\}\}\{L^\{0\}\}\\right\]^\{2\}\\right\)^\{1/2\}≤\\displaystyle\\leq\(f0​\(x0\)−f~\)​L^ρ​A​T\+2​M​C2⁣∗\\displaystyle\\frac\{\(f^\{0\}\(x^\{0\}\)\-\\widetilde\{f\}\)\\sqrt\{\\hat\{L\}\}\}\{\\rho A\\sqrt\{T\}\}\+2MC\_\{2\*\}\+8​L^32T​\(A​ρ\+M​C22​L0\)\\displaystyle\+\\frac\{8\\hat\{L\}^\{\\frac\{3\}\{2\}\}\}\{\\sqrt\{T\}\}\\left\(A\\rho\+\\frac\{MC\_\{2\}\}\{2\\sqrt\{L^\{0\}\}\}\\right\)⋅\(𝔼​\[1L0\]2\)1/2⋅\(𝔼​\[log⁡T​L^L0\]2\)1/2\.\\displaystyle\\cdot\\left\(\\mathbb\{E\}\\left\[\\frac\{1\}\{L^\{0\}\}\\right\]^\{2\}\\right\)^\{1/2\}\\cdot\\left\(\\mathbb\{E\}\\left\[\\log\\frac\{T\\hat\{L\}\}\{L^\{0\}\}\\right\]^\{2\}\\right\)^\{1/2\}\.SettingA=f0​\(x0\)−f~ρA=\\frac\{\\sqrt\{f^\{0\}\(x^\{0\}\)\-\\widetilde\{f\}\}\}\{\\rho\}would lead to ending of the proof\. However,f0​\(x0\)f^\{0\}\(x^\{0\}\)is inaccessible for the method\. Thus, we setA=f​\(x0\)−f~ρA=\\frac\{\\sqrt\{f\(x^\{0\}\)\-\\widetilde\{f\}\}\}\{\\rho\}and account for the difference:

f0​\(x0\)−f~\\displaystyle f^\{0\}\(x^\{0\}\)\-\\widetilde\{f\}=\\displaystyle=𝔼e​\[f​\(x0\+τ0​e\)\]−f~\\displaystyle\\mathbb\{E\}\_\{e\}\[f\(x^\{0\}\+\\tau^\{0\}e\)\]\-\\widetilde\{f\}=\\displaystyle=𝔼e​\[f​\(x0\+τ0​e\)−f​\(x0\)\]\+f​\(x0\)−f~\\displaystyle\\mathbb\{E\}\_\{e\}\[f\(x^\{0\}\+\\tau^\{0\}e\)\-f\(x^\{0\}\)\]\+f\(x^\{0\}\)\-\\widetilde\{f\}≤\(A​s​s\.[3](https://arxiv.org/html/2606.14970#Thmassumption3)\)\\displaystyle\\overset\{\(Ass\.\\ref\{ass:lip\}\)\}\{\\leq\}M​𝔼e​\[‖x0\+τ0​e−x0‖\]\+f​\(x0\)−f~\\displaystyle M\\mathbb\{E\}\_\{e\}\[\\\|x^\{0\}\+\\tau^\{0\}e\-x^\{0\}\\\|\]\+f\(x^\{0\}\)\-\\widetilde\{f\}=\\displaystyle=τ0​M​𝔼e​\[‖e‖\]\+f​\(x0\)−f~\\displaystyle\\tau^\{0\}M\\mathbb\{E\}\_\{e\}\[\\\|e\\\|\]\+f\(x^\{0\}\)\-\\widetilde\{f\}=\\displaystyle=ρ​C2​γ0​M\+f​\(x0\)−f~\\displaystyle\\rho C\_\{2\}\\gamma^\{0\}M\+f\(x^\{0\}\)\-\\widetilde\{f\}=\\displaystyle=ρ​C2​M​AL0\+f​\(x0\)−f~\.\\displaystyle\\frac\{\\rho C\_\{2\}MA\}\{\\sqrt\{L^\{0\}\}\}\+f\(x^\{0\}\)\-\\widetilde\{f\}\.We plug the value forAAinto the equation:

𝔼​\[∑t=0T−1γt∑i=0T−1γi​‖∇ft​\(xt\)‖∗\]\\displaystyle\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\gamma^\{t\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\\|\\nabla f^\{t\}\(x^\{t\}\)\\\|\_\{\*\}\\right\]≤\\displaystyle\\leq\(f0​\(x0\)−f~\)​L^ρ​A​T\+2​M​C2⁣∗\\displaystyle\\frac\{\(f^\{0\}\(x^\{0\}\)\-\\widetilde\{f\}\)\\sqrt\{\\hat\{L\}\}\}\{\\rho A\\sqrt\{T\}\}\+2MC\_\{2\*\}\+8​L^32T​\(A​ρ\+M​C22​L0\)\\displaystyle\+\\frac\{8\\hat\{L\}^\{\\frac\{3\}\{2\}\}\}\{\\sqrt\{T\}\}\\left\(A\\rho\+\\frac\{MC\_\{2\}\}\{2\\sqrt\{L^\{0\}\}\}\\right\)⋅\(𝔼​\[1L0\]2\)1/2⋅\(𝔼​\[log⁡T​L^L0\]2\)1/2\\displaystyle\\cdot\\left\(\\mathbb\{E\}\\left\[\\frac\{1\}\{L^\{0\}\}\\right\]^\{2\}\\right\)^\{1/2\}\\cdot\\left\(\\mathbb\{E\}\\left\[\\log\\frac\{T\\hat\{L\}\}\{L^\{0\}\}\\right\]^\{2\}\\right\)^\{1/2\}≤\\displaystyle\\leq\(ρ​C2​M​AL0\+f​\(x0\)−f~\)​L^ρ​A​T\+2​M​C2⁣∗\\displaystyle\\frac\{\(\\frac\{\\rho C\_\{2\}MA\}\{\\sqrt\{L^\{0\}\}\}\+f\(x^\{0\}\)\-\\widetilde\{f\}\)\\sqrt\{\\hat\{L\}\}\}\{\\rho A\\sqrt\{T\}\}\+2MC\_\{2\*\}\+8​L^32T​\(A​ρ\+M​C22​L0\)\\displaystyle\+\\frac\{8\\hat\{L\}^\{\\frac\{3\}\{2\}\}\}\{\\sqrt\{T\}\}\\left\(A\\rho\+\\frac\{MC\_\{2\}\}\{2\\sqrt\{L^\{0\}\}\}\\right\)⋅\(𝔼​\[1L0\]2\)1/2⋅\(𝔼​\[log⁡T​L^L0\]2\)1/2\.\\displaystyle\\cdot\\left\(\\mathbb\{E\}\\left\[\\frac\{1\}\{L^\{0\}\}\\right\]^\{2\}\\right\)^\{1/2\}\\cdot\\left\(\\mathbb\{E\}\\left\[\\log\\frac\{T\\hat\{L\}\}\{L^\{0\}\}\\right\]^\{2\}\\right\)^\{1/2\}\.≤\\displaystyle\\leq10​\(f​\(x0\)−f~\)​L^T​\(1\+M​C22​L0\)\\displaystyle\\frac\{10\\sqrt\{\(f\(x^\{0\}\)\-\\widetilde\{f\}\)\\hat\{L\}\}\}\{\\sqrt\{T\}\}\\left\(1\+\\frac\{MC\_\{2\}\}\{2\\sqrt\{L^\{0\}\}\}\\right\)⋅\(𝔼​\[1L0\]2\)1/2⋅\(𝔼​\[log⁡T​L^L0\]2\)1/2\\displaystyle\\cdot\\left\(\\mathbb\{E\}\\left\[\\frac\{1\}\{L^\{0\}\}\\right\]^\{2\}\\right\)^\{1/2\}\\cdot\\left\(\\mathbb\{E\}\\left\[\\log\\frac\{T\\hat\{L\}\}\{L^\{0\}\}\\right\]^\{2\}\\right\)^\{1/2\}\+2​M​C2⁣∗\.\\displaystyle\+2MC\_\{2\*\}\.The regrouping the terms withε=𝔼​\[∑t=0T−1γt∑i=0T−1γi​‖∇ft​\(xt\)‖∗\]\\displaystyle\\varepsilon=\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\gamma^\{t\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\\|\\nabla f^\{t\}\(x^\{t\}\)\\\|\_\{\*\}\\right\]concludes the proof\. ∎

## Appendix DProofs for Algorithm[4\.1](https://arxiv.org/html/2606.14970#S4.SS1)

Now we provide main convergence result\.

###### Theorem 4\.

\[Theorem[2](https://arxiv.org/html/2606.14970#Thmtheorem2)\] Suppose Assumptions[1](https://arxiv.org/html/2606.14970#Thmassumption1),[2](https://arxiv.org/html/2606.14970#Thmassumption2),[3](https://arxiv.org/html/2606.14970#Thmassumption3)hold\. Then Algorithm[4\.1](https://arxiv.org/html/2606.14970#S4.SS1)to achieveε\\varepsilon\-accuracy needs

T=𝒪​\(χq5​\(C23​Δ​L3c2⁣∗3​ε2​\(1\+M2​C22Δ​L0\)​𝔼​\[1L0\]2​𝔼​\[log⁡T​LL0\]2\)\+χq2​M​C2⁣∗\),\\displaystyle T=\\mathcal\{O\}\\left\(\\chi\_\{q\}^\{5\}\\left\(\\frac\{C\_\{2\}^\{3\}\\Delta\{L\}^\{3\}\}\{c\_\{2\*\}^\{3\}\\varepsilon^\{2\}\}\\left\(1\+\\frac\{M^\{2\}C\_\{2\}^\{2\}\}\{\\Delta\{L^\{0\}\}\}\\right\)\\mathbb\{E\}\\left\[\\frac\{1\}\{L^\{0\}\}\\right\]^\{2\}\\mathbb\{E\}\\left\[\\log\\frac\{T\{L\}\}\{L^\{0\}\}\\right\]^\{2\}\\right\)\+\\chi\_\{q\}^\{2\}MC\_\{2\*\}\\right\),iterations, whereε=𝔼​\[∑t=0T−1γt∑i=0T−1γi​‖∇ft​\(xt\)‖∗\],Δ=f​\(x0\)−f~,\\displaystyle\\varepsilon=\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\gamma^\{t\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\\|\\nabla f^\{t\}\(x^\{t\}\)\\\|\_\{\*\}\\right\],\\ \\Delta=f\(x^\{0\}\)\-\\widetilde\{f\},andχq=11−εq\\chi\_\{q\}=\\frac\{1\}\{1\-\\varepsilon\_\{q\}\}\.

###### Proof\.

The fundamentally, the proof stays the same, we only need to account for extra error brought be N\-S steps\. FollowingKim and Oh \[[90](https://arxiv.org/html/2606.14970#bib.bib90)\], we denote

εq=maxt=1​…​T⁡‖Vt−Polar​\(Gt\)‖=maxt=1​…​T⁡‖N\-SS​\(Gt,q\)−Polar​\(Gt\)‖\.\\varepsilon\_\{q\}=\\max\_\{t=1\\dots T\}\\\|V^\{t\}\-\\texttt\{Polar\}\(G^\{t\}\)\\\|=\\max\_\{t=1\\dots T\}\\\|\\texttt\{N\-SS\}\(G^\{t\},q\)\-\\texttt\{Polar\}\(G^\{t\}\)\\\|\.The authors showed thatεq≤1\\varepsilon\_\{q\}\\leq 1and converges to0at a double exponential rate overqq\.

Firstly, we recover the bound forLtL^\{t\}\. The proof of Lemma[3](https://arxiv.org/html/2606.14970#Thmlemma3)can be followed directly to gain the following:

\|f​\(Xt\+1\+τt​Et\)−f​\(Xt\+1\)−f​\(Xt\+τt​Et\)\+f​\(Xt\)\|\\displaystyle\|f\(X^\{t\+1\}\+\\tau^\{t\}E^\{t\}\)\-f\(X^\{t\+1\}\)\-f\(X^\{t\}\+\\tau^\{t\}E^\{t\}\)\+f\(X^\{t\}\)\|=\\displaystyle=\|f​\(Z\+U\+V\)−f​\(Z\+U−V\)−f​\(Z−U\+V\)\+f​\(Z−U−V\)\|\\displaystyle\|f\(Z\+U\+V\)\-f\(Z\+U\-V\)\-f\(Z\-U\+V\)\+f\(Z\-U\-V\)\|≤\\displaystyle\\leq4​\(‖U‖F2\+‖V‖F2\)​L,\\displaystyle 4\\left\(\\\|U\\\|\_\{F\}^\{2\}\+\\\|V\\\|\_\{F\}^\{2\}\\right\)L,where, similarly, we defineZ=12​\(Xt\+1\+Xt\+τt​Et\),U=12​\(Xt\+1−Xt\),V=τt2​Et\.Z=\\frac\{1\}\{2\}\(X^\{t\+1\}\+X^\{t\}\+\\tau^\{t\}E^\{t\}\),\\ U=\\frac\{1\}\{2\}\(X^\{t\+1\}\-X^\{t\}\),\\ V=\\frac\{\\tau^\{t\}\}\{2\}E^\{t\}\.

The error emerges in the‖U‖F\\\|U\\\|\_\{F\}estimate: ‖U‖F=γt2​‖Vt‖F≤γt​C22​‖Vt‖≤γt​C2​\(1\+εq\)2,‖V‖F=τt2​‖Et‖F=τt2\.\\\|U\\\|\_\{F\}=\\frac\{\\gamma^\{t\}\}\{2\}\\\|V^\{t\}\\\|\_\{F\}\\leq\\frac\{\\gamma^\{t\}C\_\{2\}\}\{2\}\\\|V^\{t\}\\\|\\leq\\frac\{\\gamma^\{t\}C\_\{2\}\(1\+\\varepsilon\_\{q\}\)\}\{2\},\\ \\\|V\\\|\_\{F\}=\\frac\{\\tau^\{t\}\}\{2\}\\\|E^\{t\}\\\|\_\{F\}=\\frac\{\\tau^\{t\}\}\{2\}\.Thus, we derive

Lt\\displaystyle L^\{t\}=\\displaystyle=‖Gt​\(Xt\+1,Et\)−Gt​\(Xt,Et\)‖∗‖Xt\+1−Xt‖\\displaystyle\\frac\{\\\|G^\{t\}\(X^\{t\+1\},E^\{t\}\)\-G^\{t\}\(X^\{t\},E^\{t\}\)\\\|\_\{\*\}\}\{\\\|X^\{t\+1\}\-X^\{t\}\\\|\}=\\displaystyle=\|f​\(Z\+U\+V\)−f​\(Z\+U−V\)−f​\(Z−U\+V\)\+f​\(Z−U−V\)\|⋅‖Et‖∗γt​τt​‖Vt‖\\displaystyle\\frac\{\|f\(Z\+U\+V\)\-f\(Z\+U\-V\)\-f\(Z\-U\+V\)\+f\(Z\-U\-V\)\|\\cdot\\\|E^\{t\}\\\|\_\{\*\}\}\{\\gamma^\{t\}\\tau^\{t\}\\\|V^\{t\}\\\|\}≤\\displaystyle\\leq‖E‖Fc2⁣∗​γt​τt​‖Vt‖⋅\(C22​\(γt\)2​\(1\+εq\)2\+\(τt\)2\)​L,\\displaystyle\\frac\{\\\|E\\\|\_\{F\}\}\{c\_\{2\*\}\\gamma^\{t\}\\tau^\{t\}\\\|V^\{t\}\\\|\}\\cdot\(C\_\{2\}^\{2\}\(\\gamma^\{t\}\)^\{2\}\(1\+\\varepsilon\_\{q\}\)^\{2\}\+\(\\tau^\{t\}\)^\{2\}\)L,≤[7](https://arxiv.org/html/2606.14970#A2.E7)\\displaystyle\\overset\{\\ref\{app:eq:norms\}\}\{\\leq\}1c2⁣∗​γt​τt​\(‖Polar​\(Gt\)−Vt‖\+‖Polar​\(Gt\)‖\)⋅\(C22​\(γt\)2​\(1\+εq\)2\+\(τt\)2\)​L\\displaystyle\\frac\{1\}\{c\_\{2\*\}\\gamma^\{t\}\\tau^\{t\}\(\\\|\\texttt\{Polar\}\(G^\{t\}\)\-V^\{t\}\\\|\+\\\|\\texttt\{Polar\}\(G^\{t\}\)\\\|\)\}\\cdot\(C\_\{2\}^\{2\}\(\\gamma^\{t\}\)^\{2\}\(1\+\\varepsilon\_\{q\}\)^\{2\}\+\(\\tau^\{t\}\)^\{2\}\)L≤\\displaystyle\{\\leq\}4c2⁣∗​γt​τt​\(1−εq\)⋅\(C22​\(γt\)2\+\(τt\)2\)​L\.\\displaystyle\\frac\{4\}\{c\_\{2\*\}\\gamma^\{t\}\\tau^\{t\}\(1\-\\varepsilon\_\{q\}\)\}\\cdot\(C\_\{2\}^\{2\}\(\\gamma^\{t\}\)^\{2\}\+\(\\tau^\{t\}\)^\{2\}\)L\.Again,γt\\gamma^\{t\}andτt\\tau^\{t\}should be linearly related\. We chooseτt=C2​γt\\tau^\{t\}=C\_\{2\}\\gamma^\{t\}, resulting in

Lt\\displaystyle L^\{t\}≤\\displaystyle\\leq4​C2c2⁣∗​\(1−εq\)=:L^\.\\displaystyle\\frac\{4C\_\{2\}\}\{c\_\{2\*\}\(1\-\\varepsilon\_\{q\}\)\}=:\\hat\{L\}\.Now, we continue with the proof of the Theorem[4](https://arxiv.org/html/2606.14970#Thmtheorem4)\. The first estimates are exactly the same as in Theorem[3](https://arxiv.org/html/2606.14970#Thmtheorem3):

ft\+1​\(Xt\+1\)−ft​\(Xt\)=\\displaystyle f^\{t\+1\}\(X^\{t\+1\}\)\-f^\{t\}\(X^\{t\}\)=\(ft\+1​\(Xt\+1\)−ft​\(Xt\+1\)\)\+\(ft​\(Xt\+1\)−ft​\(Xt\)\)\.\\displaystyle\(f^\{t\+1\}\(X^\{t\+1\}\)\-f^\{t\}\(X^\{t\+1\}\)\)\+\(f^\{t\}\(X^\{t\+1\}\)\-f^\{t\}\(X^\{t\}\)\)\.ft\+1​\(Xt\+1\)−ft​\(Xt\+1\)≤M​C2​\(γt\)2​Lt2​A​L0\.\\displaystyle f^\{t\+1\}\(X^\{t\+1\}\)\-f^\{t\}\(X^\{t\+1\}\)\\leq\\frac\{MC\_\{2\}\(\\gamma^\{t\}\)^\{2\}L^\{t\}\}\{2A\\sqrt\{L^\{0\}\}\}\.Now, we analyze the other term\.

ft​\(Xt\+1\)−ft​\(Xt\)\\displaystyle f^\{t\}\(X^\{t\+1\}\)\-f^\{t\}\(X^\{t\}\)≤\([Conv](https://arxiv.org/html/2606.14970#A2.Ex19)\)\\displaystyle\\overset\{\\eqref\{Conv\}\}\{\\leq\}⟨∇ft​\(Xt\+1\),Xt\+1−Xt⟩\\displaystyle\\langle\\nabla f^\{t\}\(X^\{t\+1\}\),X^\{t\+1\}\-X^\{t\}\\rangle=\\displaystyle=⟨Gt​\(Xt,Et\),Xt\+1−Xt⟩\\displaystyle\\langle G^\{t\}\(X^\{t\},E^\{t\}\),X^\{t\+1\}\-X^\{t\}\\rangle\+⟨∇ft​\(Xt\+1\)−Gt​\(Xt,Et\),Xt\+1−Xt⟩\\displaystyle\+\\langle\\nabla f^\{t\}\(X^\{t\+1\}\)\-G^\{t\}\(X^\{t\},E^\{t\}\),X^\{t\+1\}\-X^\{t\}\\rangle≤\([Conj](https://arxiv.org/html/2606.14970#A2.Ex21)\)\\displaystyle\\overset\{\\eqref\{Conj\}\}\{\\leq\}−γt​⟨Gt​\(Xt,Et\),Vt⟩\\displaystyle\-\\gamma^\{t\}\\langle G^\{t\}\(X^\{t\},E^\{t\}\),V^\{t\}\\rangle\+‖∇ft​\(Xt\+1\)−Gt​\(Xt,Et\)‖∗​‖Xt\+1−Xt‖\\displaystyle\+\\\|\\nabla f^\{t\}\(X^\{t\+1\}\)\-G^\{t\}\(X^\{t\},E^\{t\}\)\\\|\_\{\*\}\\\|X^\{t\+1\}\-X^\{t\}\\\|≤\([Conj](https://arxiv.org/html/2606.14970#A2.Ex21)\)\\displaystyle\\overset\{\\eqref\{Conj\}\}\{\\leq\}−γt​⟨Gt​\(Xt,Et\),Polar​\(Gt\)⟩\\displaystyle\-\\gamma^\{t\}\\langle G^\{t\}\(X^\{t\},E^\{t\}\),\\texttt\{Polar\}\(G^\{t\}\)\\rangle\+γt​‖Gt​\(Xt,Et\)‖∗​‖Vt−Polar​\(Gt\)‖\\displaystyle\+\\gamma^\{t\}\\\|G^\{t\}\(X^\{t\},E^\{t\}\)\\\|\_\{\*\}\\\|V^\{t\}\-\\texttt\{Polar\}\(G^\{t\}\)\\\|\+‖∇ft​\(Xt\+1\)−Gt​\(Xt,Et\)‖∗​‖Xt\+1−Xt‖\.\\displaystyle\+\\\|\\nabla f^\{t\}\(X^\{t\+1\}\)\-G^\{t\}\(X^\{t\},E^\{t\}\)\\\|\_\{\*\}\\\|X^\{t\+1\}\-X^\{t\}\\\|\.Here, we utilize the notation ofεq\\varepsilon\_\{q\}to derive

ft​\(Xt\+1\)−ft​\(Xt\)\\displaystyle f^\{t\}\(X^\{t\+1\}\)\-f^\{t\}\(X^\{t\}\)≤\\displaystyle\\leq−γt​‖Gt​\(Xt,Et\)‖∗\\displaystyle\-\\gamma^\{t\}\\\|G^\{t\}\(X^\{t\},E^\{t\}\)\\\|\_\{\*\}\+γt​‖Gt​\(Xt,Et\)‖∗​εq\\displaystyle\+\\gamma^\{t\}\\\|G^\{t\}\(X^\{t\},E^\{t\}\)\\\|\_\{\*\}\\varepsilon\_\{q\}\+‖∇ft​\(Xt\+1\)−Gt​\(Xt,Et\)‖∗​‖Xt\+1−Xt‖\\displaystyle\+\\\|\\nabla f^\{t\}\(X^\{t\+1\}\)\-G^\{t\}\(X^\{t\},E^\{t\}\)\\\|\_\{\*\}\\\|X^\{t\+1\}\-X^\{t\}\\\|≤\\displaystyle\\leq−γt​\(1−εq\)​‖Gt​\(Xt,Et\)‖∗\\displaystyle\-\\gamma^\{t\}\(1\-\\varepsilon\_\{q\}\)\\\|G^\{t\}\(X^\{t\},E^\{t\}\)\\\|\_\{\*\}\+‖∇ft​\(Xt\+1\)−Gt​\(Xt\+1,Et\)‖∗​‖Xt\+1−Xt‖\\displaystyle\+\\\|\\nabla f^\{t\}\(X^\{t\+1\}\)\-G^\{t\}\(X^\{t\+1\},E^\{t\}\)\\\|\_\{\*\}\\\|X^\{t\+1\}\-X^\{t\}\\\|\+\(‖Gt​\(Xt\+1,Et\)−Gt​\(Xt,Et\)‖∗‖Xt\+1−Xt‖\)​‖Xt\+1−Xt‖2\\displaystyle\+\\left\(\\frac\{\\\|G^\{t\}\(X^\{t\+1\},E^\{t\}\)\-G^\{t\}\(X^\{t\},E^\{t\}\)\\\|\_\{\*\}\}\{\\\|X^\{t\+1\}\-X^\{t\}\\\|\}\\right\)\\\|X^\{t\+1\}\-X^\{t\}\\\|^\{2\}≤\([Norm](https://arxiv.org/html/2606.14970#A2.Ex22)\)\\displaystyle\\overset\{\\eqref\{Norm\}\}\{\\leq\}−γt​\(1−εq\)​‖∇ft​\(Xt\)‖∗\\displaystyle\-\\gamma^\{t\}\(1\-\\varepsilon\_\{q\}\)\\\|\\nabla f^\{t\}\(X^\{t\}\)\\\|\_\{\*\}\+γt​\(1−εq\)​‖Gt​\(Xt,Et\)−∇ft​\(Xt\)‖∗\\displaystyle\+\\gamma^\{t\}\(1\-\\varepsilon\_\{q\}\)\\\|G^\{t\}\(X^\{t\},E^\{t\}\)\-\\nabla f^\{t\}\(X^\{t\}\)\\\|\_\{\*\}\+2​γt​‖∇ft​\(Xt\+1\)−Gt​\(Xt\+1,Et\)‖∗\\displaystyle\+2\\gamma^\{t\}\\\|\\nabla f^\{t\}\(X^\{t\+1\}\)\-G^\{t\}\(X^\{t\+1\},E^\{t\}\)\\\|\_\{\*\}\+4​Lt​\(γt\)2\.\\displaystyle\+4L^\{t\}\(\\gamma^\{t\}\)^\{2\}\.Here, in the last transition we estimated

‖Xt\+1−Xt‖=γt​‖Vt‖​≤\([Norm](https://arxiv.org/html/2606.14970#A2.Ex22)\)​γt​‖Polar​\(Gt\)‖\+γt​‖Vt−Polar​\(Gt\)‖≤γt​\(1\+εq\)≤2​γt\.\\\|X^\{t\+1\}\-X^\{t\}\\\|=\\gamma^\{t\}\\\|V^\{t\}\\\|\\overset\{\\eqref\{Norm\}\}\{\\leq\}\\gamma^\{t\}\\\|\\texttt\{Polar\}\(G^\{t\}\)\\\|\+\\gamma^\{t\}\\\|V^\{t\}\-\\texttt\{Polar\}\(G^\{t\}\)\\\|\\leq\\gamma^\{t\}\(1\+\\varepsilon\_\{q\}\)\\leq 2\\gamma^\{t\}\.Next, we combine the estimates together\.

ft\+1​\(Xt\+1\)−ft​\(Xt\)\\displaystyle f^\{t\+1\}\(X^\{t\+1\}\)\-f^\{t\}\(X^\{t\}\)≤\\displaystyle\\leqM​C2​\(γt\)2​Lt2​A​L0\\displaystyle\\frac\{MC\_\{2\}\(\\gamma^\{t\}\)^\{2\}L^\{t\}\}\{2A\\sqrt\{L^\{0\}\}\}−γt​\(1−εq\)​‖∇ft​\(Xt\)‖∗\\displaystyle\-\\gamma^\{t\}\(1\-\\varepsilon\_\{q\}\)\\\|\\nabla f^\{t\}\(X^\{t\}\)\\\|\_\{\*\}\+γt​‖Gt​\(Xt,Et\)−∇ft​\(Xt\)‖∗\\displaystyle\+\\gamma^\{t\}\\\|G^\{t\}\(X^\{t\},E^\{t\}\)\-\\nabla f^\{t\}\(X^\{t\}\)\\\|\_\{\*\}\+2​γt​‖∇ft​\(Xt\+1\)−Gt​\(Xt\+1,Et\)‖∗\\displaystyle\+2\\gamma^\{t\}\\\|\\nabla f^\{t\}\(X^\{t\+1\}\)\-G^\{t\}\(X^\{t\+1\},E^\{t\}\)\\\|\_\{\*\}\+4​Lt​\(γt\)2\\displaystyle\+4L^\{t\}\(\\gamma^\{t\}\)^\{2\}≤\\displaystyle\\leq−γt​\(1−εq\)​‖∇ft​\(Xt\)‖∗\\displaystyle\-\\gamma^\{t\}\(1\-\\varepsilon\_\{q\}\)\\\|\\nabla f^\{t\}\(X^\{t\}\)\\\|\_\{\*\}\+γt​\(1−εq\)​‖Gt​\(Xt,Et\)−∇ft​\(Xt\)‖∗\\displaystyle\+\\gamma^\{t\}\(1\-\\varepsilon\_\{q\}\)\\\|G^\{t\}\(X^\{t\},E^\{t\}\)\-\\nabla f^\{t\}\(X^\{t\}\)\\\|\_\{\*\}\+2​γt​‖∇ft​\(Xt\+1\)−Gt​\(Xt\+1,Et\)‖∗\\displaystyle\+2\\gamma^\{t\}\\\|\\nabla f^\{t\}\(X^\{t\+1\}\)\-G^\{t\}\(X^\{t\+1\},E^\{t\}\)\\\|\_\{\*\}\+Lt​\(γt\)2​\(4\+M​C22​A​L0\)\.\\displaystyle\+L^\{t\}\(\\gamma^\{t\}\)^\{2\}\\left\(4\+\\frac\{MC\_\{2\}\}\{2A\\sqrt\{L^\{0\}\}\}\\right\)\.Summing overt=1,…​Tt=1,\\dots Twe derive

fT​\(XT\)−f0​\(X0\)\\displaystyle f^\{T\}\(X^\{T\}\)\-f^\{0\}\(X^\{0\}\)≤\\displaystyle\\leq−\(1−εq\)​∑t=0T−1γt​‖∇ft​\(Xt\)‖∗\\displaystyle\-\(1\-\\varepsilon\_\{q\}\)\\sum\_\{t=0\}^\{T\-1\}\\gamma^\{t\}\\\|\\nabla f^\{t\}\(X^\{t\}\)\\\|\_\{\*\}\+\(4\+M​C22​A​L0\)​∑t=0T−1Lt​\(γt\)2\\displaystyle\+\\left\(4\+\\frac\{MC\_\{2\}\}\{2A\\sqrt\{L^\{0\}\}\}\\right\)\\sum\_\{t=0\}^\{T\-1\}L^\{t\}\(\\gamma^\{t\}\)^\{2\}\+∑t=0T−1γt​‖Gt​\(Xt\+1,Et\)−∇ft​\(Xt\+1\)‖∗\\displaystyle\+\\sum\_\{t=0\}^\{T\-1\}\\gamma^\{t\}\\\|G^\{t\}\(X^\{t\+1\},E^\{t\}\)\-\\nabla f^\{t\}\(X^\{t\+1\}\)\\\|\_\{\*\}\+2​∑t=0T−1γt​‖Gt​\(Xt,Et\)−∇ft​\(Xt\)‖∗\.\\displaystyle\+2\\sum\_\{t=0\}^\{T\-1\}\\gamma^\{t\}\\\|G^\{t\}\(X^\{t\},E^\{t\}\)\-\\nabla f^\{t\}\(X^\{t\}\)\\\|\_\{\*\}\.After rearranging the terms,

\(1−εq\)​𝔼​\[∑t=0T−1γt∑i=0T−1γi​‖∇ft​\(xt\)‖∗\]\\displaystyle\(1\-\\varepsilon\_\{q\}\)\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\gamma^\{t\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\\|\\nabla f^\{t\}\(x^\{t\}\)\\\|\_\{\*\}\\right\]≤\\displaystyle\\leq\(f0​\(x0\)−f~\)​𝔼​\[1∑i=0T−1γi\]\\displaystyle\(f^\{0\}\(x^\{0\}\)\-\\widetilde\{f\}\)\\mathbb\{E\}\\left\[\\frac\{1\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\right\]\+𝔼​\[∑t=0T−1γt∑i=0T−1γi​‖Gt​\(Xt\+1,Et\)−∇ft​\(Xt\+1\)‖∗\]\\displaystyle\+\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\gamma^\{t\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\\|G^\{t\}\(X^\{t\+1\},E^\{t\}\)\-\\nabla f^\{t\}\(X^\{t\+1\}\)\\\|\_\{\*\}\\right\]\+2​𝔼​\[∑t=0T−1γt∑i=0T−1γi​‖Gt​\(Xt,Et\)−∇ft​\(Xt\)‖∗\]\\displaystyle\+2\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\gamma^\{t\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\\|G^\{t\}\(X^\{t\},E^\{t\}\)\-\\nabla f^\{t\}\(X^\{t\}\)\\\|\_\{\*\}\\right\]\+\(4\+M​C22​A​L0\)​𝔼​\[∑t=0T−1Lt​\(γt\)2∑i=0T−1γi\]\.\\displaystyle\+\\left\(4\+\\frac\{MC\_\{2\}\}\{2A\\sqrt\{L^\{0\}\}\}\\right\)\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{L^\{t\}\(\\gamma^\{t\}\)^\{2\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\right\]\.We then implement the same estimate for the second and third terms, bounding those with 4​M​C2⁣∗\.4MC\_\{2\*\}\.Next, \([Hol](https://arxiv.org/html/2606.14970#A2.Ex23)\) is applied to the last term:

\(1−εq\)​𝔼​\[∑t=0T−1γt∑i=0T−1γi​‖∇ft​\(Xt\)‖∗\]\\displaystyle\(1\-\\varepsilon\_\{q\}\)\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\gamma^\{t\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\\|\\nabla f^\{t\}\(X^\{t\}\)\\\|\_\{\*\}\\right\]≤\\displaystyle\\leq\(f0​\(X0\)−f~\)​𝔼​\[1∑i=0T−1γi\]\\displaystyle\(f^\{0\}\(X^\{0\}\)\-\\widetilde\{f\}\)\\mathbb\{E\}\\left\[\\frac\{1\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\right\]\+8​M​C2⁣∗\\displaystyle\+8MC\_\{2\*\}\+\(4\+M​C22​A​L0\)\\displaystyle\+\\left\(4\+\\frac\{MC\_\{2\}\}\{2A\\sqrt\{L^\{0\}\}\}\\right\)⋅\(𝔼​\[1∑i=0T−1γi\]2\)1/2\\displaystyle\\cdot\\left\(\\mathbb\{E\}\\left\[\\frac\{1\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\right\]^\{2\}\\right\)^\{1/2\}⋅\(𝔼​\[∑t=0T−1Lt​\(γt\)2\]2\)1/2\.\\displaystyle\\cdot\\left\(\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}L^\{t\}\(\\gamma^\{t\}\)^\{2\}\\right\]^\{2\}\\right\)^\{1/2\}\.The following estimates are repeated from Theorem[3](https://arxiv.org/html/2606.14970#Thmtheorem3)

∑t=0T−1L∞t​\(γt\)2=A2​∑t=0T−1Lt∑i=0t−1L∞i\\displaystyle\\sum\_\{t=0\}^\{T\-1\}L^\{t\}\_\{\\infty\}\(\\gamma^\{t\}\)^\{2\}=A^\{2\}\\sum\_\{t=0\}^\{T\-1\}\\frac\{L^\{t\}\}\{\\sum\_\{i=0\}^\{t\-1\}L\_\{\\infty\}^\{i\}\}≤\\displaystyle\\leq2​A2​L^L0\+2​A2​log⁡T​L^L0,\\displaystyle 2A^\{2\}\\frac\{\\hat\{L\}\}\{L^\{0\}\}\+2A^\{2\}\\log\\frac\{T\\hat\{L\}\}\{L^\{0\}\},1∑i=0T−1γi\\displaystyle\\frac\{1\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}≤\\displaystyle\\leqL^A​T\.\\displaystyle\\frac\{\\sqrt\{\\hat\{L\}\}\}\{A\\sqrt\{T\}\}\.Next, we plug those into the equation, introducingχq=11−εq\\chi\_\{q\}=\\frac\{1\}\{1\-\\varepsilon\_\{q\}\}\.

𝔼​\[∑t=0T−1γt∑i=0T−1γi​‖∇ft​\(Xt\)‖∗\]\\displaystyle\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\gamma^\{t\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\\|\\nabla f^\{t\}\(X^\{t\}\)\\\|\_\{\*\}\\right\]≤\\displaystyle\\leqχq​\(f0​\(X0\)−f~\)​L^A​T\+χq​M​C2\\displaystyle\\frac\{\\chi\_\{q\}\(f^\{0\}\(X^\{0\}\)\-\\widetilde\{f\}\)\\sqrt\{\\hat\{L\}\}\}\{A\\sqrt\{T\}\}\+\\chi\_\{q\}\{MC\_\{2\}\}\+χq​\(1\+M​C22​A​L0\)⋅L^A​T\\displaystyle\+\\chi\_\{q\}\\left\(1\+\\frac\{MC\_\{2\}\}\{2A\\sqrt\{L^\{0\}\}\}\\right\)\\cdot\\frac\{\\sqrt\{\\hat\{L\}\}\}\{A\\sqrt\{T\}\}⋅\(𝔼​\[2​A2​L^L0\+2​A2​log⁡T​L^L0\]2\)1/2\\displaystyle\\cdot\\left\(\\mathbb\{E\}\\left\[2A^\{2\}\\frac\{\\hat\{L\}\}\{L^\{0\}\}\+2A^\{2\}\\log\\frac\{T\\hat\{L\}\}\{L^\{0\}\}\\right\]^\{2\}\\right\)^\{1/2\}≤\\displaystyle\\leqχq\(\(f0​\(X0\)−f~\)​L^A​T\+8M\\displaystyle\\chi\_\{q\}\\Bigg\(\\frac\{\(f^\{0\}\(X^\{0\}\)\-\\widetilde\{f\}\)\\sqrt\{\\hat\{L\}\}\}\{A\\sqrt\{T\}\}\+8M\+8​L^3/2T​\(4​A\+M​C22​L0\)\\displaystyle\+\\frac\{8\{\\hat\{L\}^\{3/2\}\}\}\{\\sqrt\{T\}\}\\left\(4A\+\\frac\{MC\_\{2\}\}\{2\\sqrt\{L^\{0\}\}\}\\right\)⋅\(𝔼\[1L0\]2\)1/2⋅\(𝔼\[logT​L^L0\]2\)1/2\)\.\\displaystyle\\cdot\\left\(\\mathbb\{E\}\\left\[\\frac\{1\}\{L^\{0\}\}\\right\]^\{2\}\\right\)^\{1/2\}\\cdot\\left\(\\mathbb\{E\}\\left\[\\log\\frac\{T\\hat\{L\}\}\{L^\{0\}\}\\right\]^\{2\}\\right\)^\{1/2\}\\Bigg\)\.Same way as in the previous theorem, we need estimate the initial function discrepancyf​\(x0\)−f0​\(x0\)f\(x^\{0\}\)\-f^\{0\}\(x^\{0\}\)due to inability to computef0f^\{0\}directly:

f0​\(x0\)−f~\\displaystyle f^\{0\}\(x^\{0\}\)\-\\widetilde\{f\}=\\displaystyle=𝔼e​\[f​\(x0\+τ0​e\)\]−f~\\displaystyle\\mathbb\{E\}\_\{e\}\[f\(x^\{0\}\+\\tau^\{0\}e\)\]\-\\widetilde\{f\}=\\displaystyle=𝔼E​\[f​\(X0\+τ0​E\)−f​\(X0\)\]\+f​\(X0\)−f~\\displaystyle\\mathbb\{E\}\_\{E\}\[f\(X^\{0\}\+\\tau^\{0\}E\)\-f\(X^\{0\}\)\]\+f\(X^\{0\}\)\-\\widetilde\{f\}≤\(A​s​s\.[3](https://arxiv.org/html/2606.14970#Thmassumption3)\)\\displaystyle\\overset\{\(Ass\.\\ref\{ass:lip\}\)\}\{\\leq\}M​𝔼E​\[‖X0\+τ0​E−X0‖\]\+f​\(X0\)−f~\\displaystyle M\\mathbb\{E\}\_\{E\}\[\\\|X^\{0\}\+\\tau^\{0\}E\-X^\{0\}\\\|\]\+f\(X^\{0\}\)\-\\widetilde\{f\}=\\displaystyle=τ0​M​𝔼e​\[‖e‖\]\+f​\(X0\)−f~\\displaystyle\\tau^\{0\}M\\mathbb\{E\}\_\{e\}\[\\\|e\\\|\]\+f\(X^\{0\}\)\-\\widetilde\{f\}=\\displaystyle=C2​γ0​M\+f​\(X0\)−f~\\displaystyle C\_\{2\}\\gamma^\{0\}M\+f\(X^\{0\}\)\-\\widetilde\{f\}=\\displaystyle=C2​M​AL0\+f​\(X0\)−f~\.\\displaystyle\\frac\{C\_\{2\}MA\}\{\\sqrt\{L^\{0\}\}\}\+f\(X^\{0\}\)\-\\widetilde\{f\}\.Thus, we derive

𝔼​\[∑t=0T−1γt∑i=0T−1γi​‖∇ft​\(Xt\)‖∗\]\\displaystyle\\mathbb\{E\}\\left\[\\sum\_\{t=0\}^\{T\-1\}\\frac\{\\gamma^\{t\}\}\{\\sum\_\{i=0\}^\{T\-1\}\\gamma^\{i\}\}\\\|\\nabla f^\{t\}\(X^\{t\}\)\\\|\_\{\*\}\\right\]≤\\displaystyle\\leqχq\(\(f​\(X0\)−f~\+C2​M​AL0\)​L^A​T\+8M\\displaystyle\\chi\_\{q\}\\Bigg\(\\frac\{\(f\(X^\{0\}\)\-\\widetilde\{f\}\+\\frac\{C\_\{2\}MA\}\{\\sqrt\{L^\{0\}\}\}\)\\sqrt\{\\hat\{L\}\}\}\{A\\sqrt\{T\}\}\+8M\+8​L^3/2T​\(4​A\+M​C22​L0\)\\displaystyle\+\\frac\{8\{\\hat\{L\}^\{3/2\}\}\}\{\\sqrt\{T\}\}\\left\(4A\+\\frac\{MC\_\{2\}\}\{2\\sqrt\{L^\{0\}\}\}\\right\)⋅\(𝔼\[1L0\]2\)1/2⋅\(𝔼\[logT​L^L0\]2\)1/2\)\.\\displaystyle\\cdot\\left\(\\mathbb\{E\}\\left\[\\frac\{1\}\{L^\{0\}\}\\right\]^\{2\}\\right\)^\{1/2\}\\cdot\\left\(\\mathbb\{E\}\\left\[\\log\\frac\{T\\hat\{L\}\}\{L^\{0\}\}\\right\]^\{2\}\\right\)^\{1/2\}\\Bigg\)\.After substitutingA=f​\(x0\)−f~A=\\sqrt\{f\(x^\{0\}\)\-\\widetilde\{f\}\}andL^\\hat\{L\}, we arrive at the Theorem statement\. ∎

Similar Articles

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

arXiv cs.CL

This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.

Can Muon Fine-tune Adam-Pretrained Models?

Hugging Face Daily Papers

Research paper investigating performance degradation when using the Muon optimizer instead of Adam for fine-tuning pretrained models, demonstrating that parameter-efficient methods like LoRA effectively mitigate this optimizer mismatch across language and vision tasks.

Accelerating LMO-Based Optimization via Implicit Gradient Transport

arXiv cs.LG

This paper proposes LMO-IGT, a new class of stochastic optimization methods that accelerates convergence using implicit gradient transport while maintaining a single-gradient-per-iteration structure. It introduces a unified theoretical framework and demonstrates improved performance over existing LMO-based optimizers like Muon.

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

Hugging Face Daily Papers

This paper investigates the quantitative limits of parametric memory in LLMs using LoRA as a probe, establishing a power law relationship and introducing a threshold-guided optimization method called MemFT for improved memory performance.