Accelerating LMO-Based Optimization via Implicit Gradient Transport
Summary
This paper proposes LMO-IGT, a new class of stochastic optimization methods that accelerates convergence using implicit gradient transport while maintaining a single-gradient-per-iteration structure. It introduces a unified theoretical framework and demonstrates improved performance over existing LMO-based optimizers like Muon.
View Cached Full Text
Cached at: 05/08/26, 08:01 AM
# Accelerating LMO-Based Optimization via Implicit Gradient Transport
Source: [https://arxiv.org/html/2605.05577](https://arxiv.org/html/2605.05577)
Won\-Jun Jang Si\-Hyeon Lee School of Electrical Engineering Korea Advanced Institute of Science and Technology \(KAIST\) Daejeon, South Korea \{wonjun\_jang,sihyeon\}@kaist\.ac\.kr
###### Abstract
Recent optimizers such as Lion and Muon have demonstrated strong empirical performance by normalizing gradient momentum via linear minimization oracles \(LMOs\)\. While variance reduction has been explored to accelerate LMO\-based methods, it typically incurs substantial computational overhead due to additional gradient evaluations\. At the same time, the theoretical understanding of LMO\-based methods remains fragmented across unconstrained and constrained formulations\. Motivated by these limitations, we propose*LMO\-IGT*, a new class of stochastic LMO\-based methods leveraging implicit gradient transport \(IGT\)\. We further introduce a unified framework for stochastic LMO\-based optimization together with a new stationarity measure, the*regularized support function*\(RSF\), which bridges gradient\-norm and Frank–Wolfe\-gap notions within a common framework\. By evaluating stochastic gradients at transported points, LMO\-IGT accelerates convergence while retaining the single\-gradient\-per\-iteration structure of standard stochastic LMO\. Our analysis establishes that stochastic LMO achieves an iteration complexity of𝒪\(ε−4\)\\mathcal\{O\}\(\\varepsilon^\{\-4\}\), variance\-reduced LMO achieves𝒪\(ε−3\)\\mathcal\{O\}\(\\varepsilon^\{\-3\}\)at the cost of additional gradient evaluations, and LMO\-IGT achieves𝒪\(ε−3\.5\)\\mathcal\{O\}\(\\varepsilon^\{\-3\.5\}\)using only a single stochastic gradient per iteration\. Empirically, LMO\-IGT consistently improves over stochastic LMO counterparts with negligible overhead\. Among its instantiations, Muon\-IGT achieves the strongest overall performance across evaluated settings, demonstrating that IGT provides an effective and practical acceleration mechanism for modern LMO\-based optimization\.
## 1Introduction
Deep learning has achieved remarkable success in recent years, increasing the need for reliable optimization methods for training large\-scale neural networks\. Training such models involves high\-dimensional and non\-convex optimization problems, making efficient and stable algorithms essential\. Consequently, efficient stochastic gradient\-based optimizers such as Adam\(Kingma and Ba,[2017](https://arxiv.org/html/2605.05577#bib.bib41)\)and AdamW\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2605.05577#bib.bib42)\)have long been the standard choice\.
Recently, new optimizers such as Lion\(Chenet al\.,[2023](https://arxiv.org/html/2605.05577#bib.bib8)\)and Muon\(Jordanet al\.,[2024](https://arxiv.org/html/2605.05577#bib.bib7)\)have attracted significant attention due to their strong empirical performance\. A common feature of these methods is that theynormalizethe gradient momentum during optimization, a property closely related to the linear minimization oracle \(LMO\)\. Following this, a line of work has focused on improving the convergence of LMO\-based methods, where techniques such as variance reduction achieve faster theoretical rates\(Jiang and Zhang,[2025](https://arxiv.org/html/2605.05577#bib.bib10); Yuanet al\.,[2025](https://arxiv.org/html/2605.05577#bib.bib22); Changet al\.,[2025](https://arxiv.org/html/2605.05577#bib.bib2); Sfyraki and Wang,[2025](https://arxiv.org/html/2605.05577#bib.bib1)\)\. However, these approaches typically require additional gradient evaluations, which can be computationally expensive in large\-scale settings\. This raises a natural question: can LMO\-based methods achieve accelerated convergence while avoiding the substantial computational overhead associated with variance reduction?
From a theoretical perspective, LMO originates from the Frank–Wolfe method for constrained optimization\(Jiang and Zhang,[2025](https://arxiv.org/html/2605.05577#bib.bib10); Sfyraki and Wang,[2025](https://arxiv.org/html/2605.05577#bib.bib1); Pethicket al\.,[2025](https://arxiv.org/html/2605.05577#bib.bib39)\)\. However, the role of LMO\-type updates in modern stochastic optimization is not yet fully understood\. In particular, existing analyses often rely on different stationarity measures for unconstrained and constrained problems, preventing a unified treatment across settings\. This leads to a second question: can stochastic LMO\-based optimization be understood within a unified framework?
We address the above questions through the following contributions\.
Low\-overhead acceleration for LMO\-based optimization\.To accelerate LMO\-based methods while avoiding the substantial computational overhead of variance reduction, we draw inspiration from implicit gradient transport \(IGT\), which constructs momentum using gradients evaluated at a lookahead point\(Arnoldet al\.,[2019](https://arxiv.org/html/2605.05577#bib.bib24); Cutkosky and Mehta,[2020](https://arxiv.org/html/2605.05577#bib.bib16)\)\. While IGT has been studied in normalized gradient methods, we extend it to LMO\-based optimization\. We observe that LMO updates inherently perform a form of momentum normalization, making them well\-suited for IGT\-style acceleration\. Based on this insight, we propose a new class of methods, termed LMO\-IGT, that achieves accelerated convergence without additional gradient evaluations\.
Unified framework and analysis\.We develop a unified framework for stochastic LMO\-based optimization, encompassing three classes: stochastic LMO, variance\-reduced LMO \(LMO\-VR\), and the proposed LMO\-IGT\. To enable a unified analysis across both unconstrained and constrained settings, we introduce a new stationarity measure, the*regularized support function \(RSF\)*, which bridges gradient\-based and Frank–Wolfe\-type measures\. Under this framework, we establish unified convergence results: stochastic LMO achieves an iteration complexity ofO\(ϵ−4\)O\(\\epsilon^\{\-4\}\), LMO\-VR achievesO\(ϵ−3\)O\(\\epsilon^\{\-3\}\), and LMO\-IGT attainsO\(ϵ−3\.5\)O\(\\epsilon^\{\-3\.5\}\), while incurring only modest computational overhead\.
Empirical validation\.We instantiate LMO\-IGT with different LMO sets, yielding practical algorithms such as Lion\-IGT and Muon\-IGT\. In particular, Muon\-IGT achieves the strongest empirical performance without requiring extra gradient evaluations, demonstrating the effectiveness of the proposed approach in large\-scale settings\.
Table[1](https://arxiv.org/html/2605.05577#S1.T1)summarizes the three classes in our unified framework, together with representative algorithms, the associated LMO sets, and their corresponding iteration complexities\.
Table 1:Unified framework for stochastic LMO\-based optimization and the corresponding convergence results\. For a gradient momentum matrixgg, we write its singular value decomposition \(SVD\) asg=Udiag\(σ\)V⊤g=U\\text\{diag\}\(\\sigma\)V^\{\\top\}\. See Section II for the definition of the LMO and the set𝒞\\mathcal\{C\}\.ClassAlgorithmLMO set𝒞\\mathcal\{C\}LMO\(g\)𝒞\{\}\_\{\\mathcal\{C\}\}\(g\)IterationComplexityStochasticLMO\(1 gradient eval\.\)\\begin\{array\}\[\]\{c\}\\text\{Stochastic\}\\\\ \\text\{LMO\}\\\\ \\text\{\\footnotesize\(1 gradient eval\.\)\}\\end\{array\}Normalized SGD\(Hazanet al\.,[2015](https://arxiv.org/html/2605.05577#bib.bib40)\)∥⋅∥2\\\|\\cdot\\\|\_\{2\}–ball−g/‖g‖\-g/\|\|g\|\|O\(ϵ−4\)O\(\\epsilon^\{\-4\}\)signSGD\(Bernsteinet al\.,[2018](https://arxiv.org/html/2605.05577#bib.bib29)\)∥⋅∥∞\\\|\\cdot\\\|\_\{\\infty\}–ball−\-sign\(g\)\(g\)Signum\(Bernsteinet al\.,[2018](https://arxiv.org/html/2605.05577#bib.bib29)\)Lion\(Chenet al\.,[2023](https://arxiv.org/html/2605.05577#bib.bib8)\)Muon\(Jordanet al\.,[2024](https://arxiv.org/html/2605.05577#bib.bib7)\)∥⋅∥op\\\|\\cdot\\\|\_\{\\text\{op\}\}–ball∗−UV⊤\-UV^\{\\top\}LMO\-VR\(2 gradient eval\.\)\\begin\{array\}\[\]\{c\}\\text\{LMO\-VR\}\\\\ \\text\{\(2 gradient eval\.\)\}\\end\{array\}Lion\-VR\(Jiang and Zhang,[2025](https://arxiv.org/html/2605.05577#bib.bib10)\)∥⋅∥∞\\\|\\cdot\\\|\_\{\\infty\}–ball−\-sign\(g\)\(g\)O\(ϵ−3\)O\(\\epsilon^\{\-3\}\)MARS\-Shampoo\(Yuanet al\.,[2025](https://arxiv.org/html/2605.05577#bib.bib22)\)∥⋅∥op\\\|\\cdot\\\|\_\{\\text\{op\}\}–ball∗−UV⊤\-UV^\{\\top\}Muon\-VR\(Sfyraki and Wang,[2025](https://arxiv.org/html/2605.05577#bib.bib1)\)LMO\-IGT\(1 gradient eval\.\)\\begin\{array\}\[\]\{c\}\\text\{LMO\-IGT\}\\\\ \\text\{\(1 gradient eval\.\)\}\\end\{array\}NIGT\(Cutkosky and Mehta,[2020](https://arxiv.org/html/2605.05577#bib.bib16)\)∥⋅∥2\\\|\\cdot\\\|\_\{2\}–ball−g/‖g‖\-g/\|\|g\|\|O\(ϵ−3\.5\)O\(\\epsilon^\{\-3\.5\}\)Lion\-IGT \(this work\)∥⋅∥∞\\\|\\cdot\\\|\_\{\\infty\}–ball−\-sign\(g\)\(g\)Muon\-IGT \(this work\)∥⋅∥op\\\|\\cdot\\\|\_\{\\text\{op\}\}–ball∗−UV⊤\-UV^\{\\top\}
∗operator norm ball for matrices:𝒞=\{X\|σ1\(X\)≤1\}\\mathcal\{C\}=\\\{X\\,\|\\,\\sigma\_\{1\}\(X\)\\leq 1\\\}\.
## 2Problem Setting and Preliminaries
Letξ\\xidenote a random variable capturing the stochasticity of the oracle, and letf\(⋅;ξ\)f\(\\cdot;\\xi\)denote the corresponding sample\-wise loss function\. We consider the stochastic first\-order minimization problem
minw∈𝒫F\(w\),F\(w\):=𝔼ξ\[f\(w;ξ\)\],\\displaystyle\\min\_\{w\\in\\mathcal\{P\}\}F\(w\),\\qquad F\(w\):=\\mathbb\{E\}\_\{\\xi\}\[f\(w;\\xi\)\],\(1\)whereF:ℝd→ℝF:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}andffare non\-convex, differentiable functions\. We assume that the feasible set𝒫⊆ℝd\\mathcal\{P\}\\subseteq\\mathbb\{R\}^\{d\}is convex and compact\.
We make the following assumptions:
###### Assupmtion 1\(LL\-smoothness\)\.
The functionFFisLL\-smooth:
‖∇F\(x\)−∇F\(y\)‖≤L‖x−y‖,∀x,y∈ℝd\.\\\|\\nabla F\(x\)\-\\nabla F\(y\)\\\|\\leq L\\\|x\-y\\\|,\\qquad\\forall x,y\\in\\mathbb\{R\}^\{d\}\.
###### Assupmtion 2\(Bounded variance\)\.
The stochastic gradients have bounded variance:
𝔼ξ\[‖∇f\(w;ξ\)−∇F\(w\)‖2\]≤σ2,∀w∈ℝd\.\\mathbb\{E\}\_\{\\xi\}\\\!\\left\[\\\|\\nabla f\(w;\\xi\)\-\\nabla F\(w\)\\\|^\{2\}\\right\]\\leq\\sigma^\{2\},\\qquad\\forall w\\in\\mathbb\{R\}^\{d\}\.
###### Assupmtion 3\(Averaged smoothness\)\.
There existsL\>0L\>0such that
𝔼ξ\[‖∇f\(x;ξ\)−∇f\(y;ξ\)‖2\]≤L2‖x−y‖2,∀x,y∈ℝd\.\\mathbb\{E\}\_\{\\xi\}\\\!\\left\[\\\|\\nabla f\(x;\\xi\)\-\\nabla f\(y;\\xi\)\\\|^\{2\}\\right\]\\leq L^\{2\}\\\|x\-y\\\|^\{2\},\\qquad\\forall x,y\\in\\mathbb\{R\}^\{d\}\.
###### Assupmtion 4\(Second\-order smoothness\)\.
The Hessian ofFFisρ\\rho\-Lipschitz:
‖∇2F\(x\)−∇2F\(y\)‖op≤ρ‖x−y‖,∀x,y∈ℝd\.\\\|\\nabla^\{2\}F\(x\)\-\\nabla^\{2\}F\(y\)\\\|\_\{\\mathrm\{op\}\}\\leq\\rho\\\|x\-y\\\|,\\qquad\\forall x,y\\in\\mathbb\{R\}^\{d\}\.
Assumptions[1](https://arxiv.org/html/2605.05577#Thmasmp1)and[2](https://arxiv.org/html/2605.05577#Thmasmp2)are standard in stochastic optimization\. Assumption[3](https://arxiv.org/html/2605.05577#Thmasmp3)is commonly used in variance\-reduced methods, while Assumption[4](https://arxiv.org/html/2605.05577#Thmasmp4)is a standard second\-order regularity condition in analyses of IGT\(Cutkosky and Mehta,[2020](https://arxiv.org/html/2605.05577#bib.bib16)\)\.
In the rest of this section, we introduce the necessary preliminaries\. We begin by describing the update rule of LMO\-based optimization methods, followed by their Frank–Wolfe interpretation and the associated stationarity measure, the Frank–Wolfe gap\. We then summarize the convergence behavior of stochastic LMO methods and discuss variance reduction as an acceleration technique\. Finally, we present IGT as an alternative acceleration mechanism\. A more detailed discussion of related work is provided in Appendix[A](https://arxiv.org/html/2605.05577#A1)\.
### 2\.1LMO\-Based Updates
To solve \([1](https://arxiv.org/html/2605.05577#S2.E1)\), both normalized SGD and LMO\-based optimization methods \(e\.g\., Lion and Muon\) compute, given a gradient or momentum estimategg, the linear minimization oracle \(LMO\)
LMO𝒞\(g\)∈argminv∈𝒞⟨g,v⟩\.\\operatorname\{LMO\}\_\{\\mathcal\{C\}\}\(g\)\\in\\arg\\min\_\{v\\in\\mathcal\{C\}\}\\langle g,v\\rangle\.Here,𝒞⊆ℝd\\mathcal\{C\}\\subseteq\\mathbb\{R\}^\{d\}is a compact convex set containing the origin\. Equivalently, the oracle returns the point in𝒞\\mathcal\{C\}most aligned with the descent direction−g\-g\. The specific form ofLMO𝒞\(⋅\)\\operatorname\{LMO\}\_\{\\mathcal\{C\}\}\(\\cdot\)depends on𝒞\\mathcal\{C\}\. For example, if𝒞\\mathcal\{C\}is theℓ2\\ell\_\{2\}norm ball, thenLMO𝒞\(g\)=−g/‖g‖\\operatorname\{LMO\}\_\{\\mathcal\{C\}\}\(g\)=\-g/\\\|g\\\|, which recovers normalized SGD\. For matrix\-valuedgg, choosing𝒞\\mathcal\{C\}as the spectral norm ball, i\.e\.,𝒞=\{X:‖X‖op≤1\}\\mathcal\{C\}=\\\{X:\\\|X\\\|\_\{\\text\{op\}\}\\leq 1\\\}, yields updates corresponding to Muon\.
At each iterationt=0,1,…,T−1t=0,1,\\dots,T\-1, the parameterwtw\_\{t\}is updated via
wt\+1=\(1−ληt\)wt\+ηtvt,vt=LMO𝒞\(gt\),\\displaystyle w\_\{t\+1\}=\(1\-\\lambda\\eta\_\{t\}\)w\_\{t\}\+\\eta\_\{t\}v\_\{t\},\\qquad v\_\{t\}=\\operatorname\{LMO\}\_\{\\mathcal\{C\}\}\(g\_\{t\}\),\(2\)whereλ≥0\\lambda\\geq 0is the weight decay parameter andηt\>0\\eta\_\{t\}\>0is the learning rate\. Whenλ=0\\lambda=0, the problem reduces to the unconstrained case with𝒫=ℝd\\mathcal\{P\}=\\mathbb\{R\}^\{d\}\. Whenλ\>0\\lambda\>0, the update admits a natural constrained optimization interpretation, which we discuss next\.
### 2\.2Frank–Wolfe Interpretation
The Frank–Wolfe \(conditional gradient\) method is designed for constrained optimization over a convex and compact set𝒫\\mathcal\{P\}\. At iterationtt, givenwtw\_\{t\}, stochastic gradientgtg\_\{t\}, and stepsizeγt∈\(0,1\)\\gamma\_\{t\}\\in\(0,1\), the update is
wt\+1=\(1−γt\)wt\+γtLMO𝒫\(gt\),w\_\{t\+1\}=\(1\-\\gamma\_\{t\}\)w\_\{t\}\+\\gamma\_\{t\}\\operatorname\{LMO\}\_\{\\mathcal\{P\}\}\(g\_\{t\}\),which preserves feasibility by forming a convex combination in𝒫\\mathcal\{P\}\.
Whenλ\>0\\lambda\>0, the LMO\-based update in \([2](https://arxiv.org/html/2605.05577#S2.E2)\) can be interpreted as a Frank–Wolfe update\(Chenet al\.,[2024](https://arxiv.org/html/2605.05577#bib.bib11); Sfyraki and Wang,[2025](https://arxiv.org/html/2605.05577#bib.bib1)\)\. Since𝒞\\mathcal\{C\}is compact and convex, the scaled set𝒫:=λ−1𝒞\\mathcal\{P\}:=\\lambda^\{\-1\}\\mathcal\{C\}is also compact and convex\. Definingst:=λ−1vts\_\{t\}:=\\lambda^\{\-1\}v\_\{t\}, wherevt=LMO𝒞\(gt\)v\_\{t\}=\\operatorname\{LMO\}\_\{\\mathcal\{C\}\}\(g\_\{t\}\), yieldsst=LMO𝒫\(gt\)∈𝒫s\_\{t\}=\\operatorname\{LMO\}\_\{\\mathcal\{P\}\}\(g\_\{t\}\)\\in\\mathcal\{P\}\. The update can thus be rewritten as
wt\+1=\(1−ληt\)wt\+ηtvt=\(1−γt\)wt\+γtst,w\_\{t\+1\}=\(1\-\\lambda\\eta\_\{t\}\)w\_\{t\}\+\\eta\_\{t\}v\_\{t\}=\(1\-\\gamma\_\{t\}\)w\_\{t\}\+\\gamma\_\{t\}s\_\{t\},withγt=ληt\\gamma\_\{t\}=\\lambda\\eta\_\{t\}, which matches the Frank–Wolfe update\.
A standard stationarity measure for constrained optimization over𝒫\\mathcal\{P\}is the Frank–Wolfe gap\(Jaggi,[2013](https://arxiv.org/html/2605.05577#bib.bib13)\):
G𝒫\(w\):=supu∈𝒫⟨−∇F\(w\),u−w⟩\.G\_\{\\mathcal\{P\}\}\(w\):=\\sup\_\{u\\in\\mathcal\{P\}\}\\langle\-\\nabla F\(w\),\\,u\-w\\rangle\.The gap satisfiesG𝒫\(w\)=0G\_\{\\mathcal\{P\}\}\(w\)=0if and only ifwwis a first\-order stationary point over𝒫\\mathcal\{P\}\(Lacoste\-Julien,[2016](https://arxiv.org/html/2605.05577#bib.bib15)\)\. Moreover, its convergence to zero is equivalent to satisfying the KKT conditions\(Xie and Li,[2024](https://arxiv.org/html/2605.05577#bib.bib32); Sfyraki and Wang,[2025](https://arxiv.org/html/2605.05577#bib.bib1)\)\.
### 2\.3Convergence Analysis
From the perspective of stochastic nonconvex optimization in the unconstrained setting, several recent works have analyzed the convergence of Lion/Muon\-type methods under standard smoothness assumptions\(Donget al\.,[2024](https://arxiv.org/html/2605.05577#bib.bib9); Jiang and Zhang,[2025](https://arxiv.org/html/2605.05577#bib.bib10); Shenet al\.,[2025](https://arxiv.org/html/2605.05577#bib.bib3); Changet al\.,[2025](https://arxiv.org/html/2605.05577#bib.bib2); Kim and Oh,[2026](https://arxiv.org/html/2605.05577#bib.bib4); Nagashima and Iiduka,[2026](https://arxiv.org/html/2605.05577#bib.bib6); Satoet al\.,[2025](https://arxiv.org/html/2605.05577#bib.bib5)\)and established the rate
1T∑t=0T−1‖∇F\(wt\)‖≤O\(T−1/4\)\.\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\\|\\nabla F\(w\_\{t\}\)\\\|\\leq O\(T^\{\-1/4\}\)\.\(3\)This implies an iteration complexity ofO\(ϵ−4\)O\(\\epsilon^\{\-4\}\)to achieveϵ\\epsilon\-stationarity\.
An analogous result holds for constrained optimization from the Frank–Wolfe perspective\(Chenet al\.,[2024](https://arxiv.org/html/2605.05577#bib.bib11); Sfyraki and Wang,[2025](https://arxiv.org/html/2605.05577#bib.bib1); Pethicket al\.,[2025](https://arxiv.org/html/2605.05577#bib.bib39)\)\. Using the Frank–Wolfe gap as the stationarity measure, one obtains
1T∑t=0T−1G𝒫\(wt\)≤O\(T−1/4\),\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}G\_\{\\mathcal\{P\}\}\(w\_\{t\}\)\\leq O\(T^\{\-1/4\}\),\(4\)which again yieldsO\(ϵ−4\)O\(\\epsilon^\{\-4\}\)iteration complexity\. Thus, stochastic LMO\-based methods exhibit the same convergence rate under both unconstrained and constrained formulations\.
### 2\.4Variance Reduction for LMO\-Based Methods
Variance\-reduction is a powerful tool for accelerating stochastic optimization\(Goweret al\.,[2020](https://arxiv.org/html/2605.05577#bib.bib23); Zhouet al\.,[2020](https://arxiv.org/html/2605.05577#bib.bib21); Cutkosky and Orabona,[2019](https://arxiv.org/html/2605.05577#bib.bib18); Yuanet al\.,[2025](https://arxiv.org/html/2605.05577#bib.bib22); Fanget al\.,[2018](https://arxiv.org/html/2605.05577#bib.bib19)\)\. When applied to Lion/Muon\-type methods, it improves the iteration complexity toO\(ϵ−3\)O\(\\epsilon^\{\-3\}\)\(Jiang and Zhang,[2025](https://arxiv.org/html/2605.05577#bib.bib10); Changet al\.,[2025](https://arxiv.org/html/2605.05577#bib.bib2); Sfyraki and Wang,[2025](https://arxiv.org/html/2605.05577#bib.bib1)\)\. The key idea is to incorporate a STORM\-like correction term\(Cutkosky and Orabona,[2019](https://arxiv.org/html/2605.05577#bib.bib18)\)into the momentum update,
∇f\(wt;ξt\)−∇f\(wt−1;ξt\),\\nabla f\(w\_\{t\};\\xi\_\{t\}\)\-\\nabla f\(w\_\{t\-1\};\\xi\_\{t\}\),which reduces the variance of the stochastic gradient estimator while preserving first\-order information\. However, this approach incurs additional overhead\. It requires storing the previous iteratewt−1w\_\{t\-1\}, which increases memory usage\. Moreover, the correction term involves computing gradients at bothwtw\_\{t\}andwt−1w\_\{t\-1\}using the same data batch, effectively doubling the per\-iteration gradient cost\. As a result, despite improved theoretical convergence rates, variance reduction may degrade practical efficiency in large\-scale settings\.
### 2\.5Implicit Gradient Transport
A separate line of work accelerates normalized methods without variance reduction by evaluating stochastic gradients at an extrapolated point\(Arnoldet al\.,[2019](https://arxiv.org/html/2605.05577#bib.bib24); Cutkosky and Mehta,[2020](https://arxiv.org/html/2605.05577#bib.bib16)\)\. The key idea is to implicitly transport the momentum along the optimization trajectory using a lookahead point instead of the current iterate\. For momentum parameterβt\\beta\_\{t\}, implicit gradient transport \(IGT\) evaluates the gradient at
xt\+1=wt\+1\+βt1−βt\(wt\+1−wt\),x\_\{t\+1\}=w\_\{t\+1\}\+\\frac\{\\beta\_\{t\}\}\{1\-\\beta\_\{t\}\}\(w\_\{t\+1\}\-w\_\{t\}\),and updates the momentum as
mt=βtmt−1\+\(1−βt\)∇f\(xt;ξt\),wt\+1=wt−ηtmt\.m\_\{t\}=\\beta\_\{t\}m\_\{t\-1\}\+\(1\-\\beta\_\{t\}\)\\nabla f\(x\_\{t\};\\xi\_\{t\}\),\\qquad w\_\{t\+1\}=w\_\{t\}\-\\eta\_\{t\}m\_\{t\}\.This mechanism preserves single\-sample gradient evaluation while implicitly transporting the momentum along the optimization trajectory\.
Earlier work analyzed this mechanism under stronger structural assumptions\(Arnoldet al\.,[2019](https://arxiv.org/html/2605.05577#bib.bib24)\)\. In particular, under second\-order smoothness \(Assumption[4](https://arxiv.org/html/2605.05577#Thmasmp4)\),Cutkosky and Mehta \([2020](https://arxiv.org/html/2605.05577#bib.bib16)\)established an improvedO\(ϵ−3\.5\)O\(\\epsilon^\{\-3\.5\}\)iteration complexity forℓ2\\ell\_\{2\}\-normalized stochastic methods\. However, existing IGT analyses are restricted to Euclidean normalized updates and do not extend to general LMO\-based methods or the constrained optimization setting\.
## 3Main Results
In this section, we develop a unified framework for stochastic LMO\-based optimization and present our main theoretical results\. We begin by introducing a new class of LMO\-based methods, LMO\-IGT, which extends IGT to general LMO\-based updates\. We then present a unified framework equipped with a new stationarity measure, enabling a common treatment of existing LMO\-based approaches\. Finally, we establish unified convergence guarantees for three classes of methods within this framework\.
### 3\.1LMO\-IGT
Algorithm 1LMO\-IGT1:initial point
x0=w0x\_\{0\}=w\_\{0\}, momentum buffer
m−1=∇f\(w0;ξ0\)m\_\{\-1\}=\\nabla f\(w\_\{0\};\\xi\_\{0\}\), parameters
\{β1,t\}t≥0\\\{\\beta\_\{1,t\}\\\}\_\{t\\geq 0\},
\{β2,t\}t≥0\\\{\\beta\_\{2,t\}\\\}\_\{t\\geq 0\},
λ≥0\\lambda\\geq 0, step sizes
\{η1,t\}t≥0\\\{\\eta\_\{1,t\}\\\}\_\{t\\geq 0\},
\{η2,t\}t≥0\\\{\\eta\_\{2,t\}\\\}\_\{t\\geq 0\}
2:for
t=0,…,T−1t=0,\\ldots,T\-1do
3:
gt←β1,tmt−1\+\(1−β1,t\)∇f\(xt;ξt\)g\_\{t\}\\leftarrow\\beta\_\{1,t\}m\_\{t\-1\}\+\(1\-\\beta\_\{1,t\}\)\\nabla f\(x\_\{t\};\\xi\_\{t\}\)
4:
mt←β2,tmt−1\+\(1−β2,t\)∇f\(xt;ξt\)m\_\{t\}\\leftarrow\\beta\_\{2,t\}m\_\{t\-1\}\+\(1\-\\beta\_\{2,t\}\)\\nabla f\(x\_\{t\};\\xi\_\{t\}\)
5:
vt←LMO𝒞\(gt\)v\_\{t\}\\leftarrow\\operatorname\{LMO\}\_\{\\mathcal\{C\}\}\(g\_\{t\}\)
6:
xt\+1←\(1−λη1,t\)wt\+η1,tvtx\_\{t\+1\}\\leftarrow\(1\-\\lambda\\eta\_\{1,t\}\)w\_\{t\}\+\\eta\_\{1,t\}v\_\{t\}
7:
wt\+1←\(1−λη2,t\)wt\+η2,tvtw\_\{t\+1\}\\leftarrow\(1\-\\lambda\\eta\_\{2,t\}\)w\_\{t\}\+\\eta\_\{2,t\}v\_\{t\}
8:endfor
Algorithm[1](https://arxiv.org/html/2605.05577#alg1)presents LMO\-IGT\. The key idea is to evaluate the stochastic gradient at a transported pointxtx\_\{t\}, which serves as a lookahead approximation of the optimization trajectory, while updating the actual iteratewtw\_\{t\}through an LMO\-based step\.
At each iteration, the method computes a stochastic gradient atxtx\_\{t\}and uses it to update two momentum variables with different time scales, controlled byβ1,t\\beta\_\{1,t\}andβ2,t\\beta\_\{2,t\}\. Throughout the paper, we consider the regime0≤β1,t≤β2,t<10\\leq\\beta\_\{1,t\}\\leq\\beta\_\{2,t\}<1, which corresponds to a Lion\-style double\-momentum design\. Under this condition, the gradient momentumgtg\_\{t\}, controlled byβ1,t\\beta\_\{1,t\}, places more weight on the current stochastic gradient and is therefore more responsive to the local geometry, while the momentum buffermtm\_\{t\}, controlled byβ2,t\\beta\_\{2,t\}, evolves more slowly and provides a more stable estimate for subsequent iterations\. The gradient momentumgtg\_\{t\}is then used to determine the update direction via the linear minimization oracle\. The next transported pointxt\+1x\_\{t\+1\}and the iteratewt\+1w\_\{t\+1\}are updated using the same LMO direction but potentially different step sizes\. This design enables implicit gradient transport without increasing the number of gradient evaluations, thereby preserving the single\-gradient\-per\-iteration structure of stochastic LMO methods\.
For the parameter choice used in the convergence analysis, we setη1,t=11−β2,tη2,t,t=0,1,…,T−1\.\\eta\_\{1,t\}=\\frac\{1\}\{1\-\\beta\_\{2,t\}\}\\eta\_\{2,t\},\\qquad t=0,1,\\ldots,T\-1\.Then the auxiliary point satisfies
xt\+1=wt\+1\+β2,t1−β2,t\(wt\+1−wt\)=wt\+11−β2,t\(wt\+1−wt\),x\_\{t\+1\}=w\_\{t\+1\}\+\\frac\{\\beta\_\{2,t\}\}\{1\-\\beta\_\{2,t\}\}\(w\_\{t\+1\}\-w\_\{t\}\)=w\_\{t\}\+\\frac\{1\}\{1\-\\beta\_\{2,t\}\}\(w\_\{t\+1\}\-w\_\{t\}\),which is the natural LMO analogue of the IGT extrapolation in Section[2\.5](https://arxiv.org/html/2605.05577#S2.SS5)\.
Different choices of𝒞\\mathcal\{C\}yield concrete instances of the method\. In particular, choosing anℓ∞\\ell\_\{\\infty\}\-ball yields Lion\-IGT, while choosing an operator\-norm ball yields Muon\-IGT, as summarized in Table[1](https://arxiv.org/html/2605.05577#S1.T1)\.
To compare LMO\-IGT with existing stochastic LMO methods under a common stationarity measure, we introduce a unified framework in the next subsection\. Within this framework, LMO\-IGT achieves a faster convergence rate than stochastic LMO\. Figure[1](https://arxiv.org/html/2605.05577#S3.F1)provides intuition for this improvement\. In stochastic LMO, the momentum buffermt=β2,tmt−1\+\(1−β2,t\)∇f\(wt;ξt\)m\_\{t\}=\\beta\_\{2,t\}m\_\{t\-1\}\+\(1\-\\beta\_\{2,t\}\)\\nabla f\(w\_\{t\};\\xi\_\{t\}\)is an exponential moving average and can therefore lag behind the true gradient∇F\(wt\)\\nabla F\(w\_\{t\}\)\. As a result, the query constructed frommtm\_\{t\}may be misaligned with the current optimization trajectory\. LMO\-IGT mitigates this lag by evaluating the stochastic gradient at a transported pointxtx\_\{t\}, chosen along the same LMO direction as the iterate update\. Through the same momentum recursion, this leadsmtm\_\{t\}to better approximate∇F\(wt\)\\nabla F\(w\_\{t\}\), thereby producing a more informative querygtg\_\{t\}for the LMO–without requiring an additional gradient evaluation\.
\(a\)Stochastic momentum
\(b\)IGT \(transported gradient\)
Figure 1:Comparison of stochastic momentum and IGT\. In stochastic LMO, the momentum buffer is formed from gradients evaluated at the current iterate, which introduces a lagging effect\. In LMO\-IGT, the transported pointxtx\_\{t\}yields a momentum buffer that is better aligned with the optimization trajectory\.
### 3\.2Unified Framework
To compare LMO\-IGT with existing stochastic LMO methods, we introduce a unified framework along with a new stationarity measure\. Let
R:=supu,v∈𝒞‖u−v‖R:=\\sup\_\{u,v\\in\\mathcal\{C\}\}\\\|u\-v\\\|denote the diameter of𝒞\\mathcal\{C\}\. In the constrained case withλ\>0\\lambda\>0, the feasible region induced by the LMO update is𝒫=λ−1𝒞\\mathcal\{P\}=\\lambda^\{\-1\}\\mathcal\{C\}, and the Frank–Wolfe gapG𝒫\(w\)G\_\{\\mathcal\{P\}\}\(w\)serves as a natural stationarity measure\. However, this quantity scales asλ−1R‖∇F\(w\)‖\\lambda^\{\-1\}R\\\|\\nabla F\(w\)\\\|and thus diverges asλ→0\\lambda\\to 0\. To treat the constrained and unconstrained regimes in a unified manner, we introduce the following regularized support function\.
###### Definition 3\.1\(Regularized support function\)\.
Forw∈ℝdw\\in\\mathbb\{R\}^\{d\}andλ≥0\\lambda\\geq 0, define
Ψ𝒞,λ\(w\):=supv∈𝒞⟨−∇F\(w\),v−λw⟩\.\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\):=\\sup\_\{v\\in\\mathcal\{C\}\}\\langle\-\\nabla F\(w\),\\,v\-\\lambda w\\rangle\.
Whenλ\>0\\lambda\>0and𝒫=λ−1𝒞\\mathcal\{P\}=\\lambda^\{\-1\}\\mathcal\{C\}, this reduces to a scaled Frank–Wolfe gap:
Ψ𝒞,λ\(w\)=λG𝒫\(w\)\.\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\)=\\lambda G\_\{\\mathcal\{P\}\}\(w\)\.Whenλ=0\\lambda=0, it becomes the support function\(Fenchel and Bonnesen,[1934](https://arxiv.org/html/2605.05577#bib.bib33); Gardner,[1995](https://arxiv.org/html/2605.05577#bib.bib35); Schneider,[2013](https://arxiv.org/html/2605.05577#bib.bib34)\)of𝒞\\mathcal\{C\}evaluated at−∇F\(w\)\-\\nabla F\(w\):
Ψ𝒞,0\(w\)=h𝒞\(−∇F\(w\)\):=supv∈𝒞⟨−∇F\(w\),v⟩\.\\Psi\_\{\\mathcal\{C\},0\}\(w\)=h\_\{\\mathcal\{C\}\}\(\-\\nabla F\(w\)\):=\\sup\_\{v\\in\\mathcal\{C\}\}\\langle\-\\nabla F\(w\),v\\rangle\.For centrally symmetric sets𝒞\\mathcal\{C\}, including all norm balls considered in this paper, this coincides with the dual norm induced by𝒞\\mathcal\{C\}\. In particular, if𝒞\\mathcal\{C\}is a Euclidean ball with diameterRR, thenΨ𝒞,0\(w\)=R2‖∇F\(w\)‖\.\\Psi\_\{\\mathcal\{C\},0\}\(w\)=\\frac\{R\}\{2\}\\\|\\nabla F\(w\)\\\|\.More generally,Ψ𝒞,λ\(w\)≤R‖∇F\(w\)‖\.\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\)\\leq R\\\|\\nabla F\(w\)\\\|\.Thus, RSF interpolates naturally between gradient\-norm stationarity and Frank–Wolfe\-gap stationarity\. Moreover,Ψ𝒞,λ\(w\)=0\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\)=0characterizes stationarity in both regimes: it is equivalent to the KKT conditions whenλ\>0\\lambda\>0, and to∇F\(w\)=0\\nabla F\(w\)=0whenλ=0\\lambda=0\.
To analyze existing LMO\-based algorithms in a unified manner, we consider the following generalized update framework\.
Algorithm 2Unified Framework for Stochastic LMO\-Based Optimization1:initial point
x−1=x0=w0x\_\{\-1\}=x\_\{0\}=w\_\{0\}, initial momentum
m−1=∇f\(w0;ξ0\)m\_\{\-1\}=\\nabla f\(w\_\{0\};\\xi\_\{0\}\), parameters
\{α1,t\}t≥0\\\{\\alpha\_\{1,t\}\\\}\_\{t\\geq 0\},
\{α2,t\}t≥0\\\{\\alpha\_\{2,t\}\\\}\_\{t\\geq 0\},
\{β1,t\}t≥0\\\{\\beta\_\{1,t\}\\\}\_\{t\\geq 0\},
\{β2,t\}t≥0\\\{\\beta\_\{2,t\}\\\}\_\{t\\geq 0\},
λ≥0\\lambda\\geq 0, and step sizes
\{η1,t\}t≥0\\\{\\eta\_\{1,t\}\\\}\_\{t\\geq 0\},
\{η2,t\}t≥0\\\{\\eta\_\{2,t\}\\\}\_\{t\\geq 0\}
2:for
t=0,…,T−1t=0,\\ldots,T\-1do
3:
gt←β1,tmt−1\+\(1−β1,t\)∇f\(xt;ξt\)\+α1,t\(∇f\(xt;ξt\)−∇f\(xt−1;ξt\)\)g\_\{t\}\\leftarrow\\beta\_\{1,t\}m\_\{t\-1\}\+\(1\-\\beta\_\{1,t\}\)\\nabla f\(x\_\{t\};\\xi\_\{t\}\)\+\\alpha\_\{1,t\}\\bigl\(\\nabla f\(x\_\{t\};\\xi\_\{t\}\)\-\\nabla f\(x\_\{t\-1\};\\xi\_\{t\}\)\\bigr\)
4:
mt←β2,tmt−1\+\(1−β2,t\)∇f\(xt;ξt\)\+α2,t\(∇f\(xt;ξt\)−∇f\(xt−1;ξt\)\)m\_\{t\}\\leftarrow\\beta\_\{2,t\}m\_\{t\-1\}\+\(1\-\\beta\_\{2,t\}\)\\nabla f\(x\_\{t\};\\xi\_\{t\}\)\+\\alpha\_\{2,t\}\\bigl\(\\nabla f\(x\_\{t\};\\xi\_\{t\}\)\-\\nabla f\(x\_\{t\-1\};\\xi\_\{t\}\)\\bigr\)
5:
vt←LMO𝒞\(gt\)v\_\{t\}\\leftarrow\\operatorname\{LMO\}\_\{\\mathcal\{C\}\}\(g\_\{t\}\)
6:
xt\+1←\(1−λη1,t\)wt\+η1,tvtx\_\{t\+1\}\\leftarrow\(1\-\\lambda\\eta\_\{1,t\}\)w\_\{t\}\+\\eta\_\{1,t\}v\_\{t\}
7:
wt\+1←\(1−λη2,t\)wt\+η2,tvtw\_\{t\+1\}\\leftarrow\(1\-\\lambda\\eta\_\{2,t\}\)w\_\{t\}\+\\eta\_\{2,t\}v\_\{t\}
8:endfor
Algorithm[2](https://arxiv.org/html/2605.05577#alg2)presents a unified framework for stochastic LMO\-based optimization, written in terms of stochastic gradients evaluated at the current point and explicitly exposing the shared LMO update structure\. The parametersα1,t\\alpha\_\{1,t\}andα2,t\\alpha\_\{2,t\}capture variance\-reduction effects through gradient differences, whileβ1,t\\beta\_\{1,t\}andβ2,t\\beta\_\{2,t\}control the momentum dynamics\.
Stochastic LMO, LMO\-VR, and the proposed LMO\-IGT arise as special cases of this framework\. In particular, settingα1,t=α2,t=0\\alpha\_\{1,t\}=\\alpha\_\{2,t\}=0andη1,t=η2,t\\eta\_\{1,t\}=\\eta\_\{2,t\}recovers stochastic LMO; allowing nonzeroα1,t\\alpha\_\{1,t\}and/orα2,t\\alpha\_\{2,t\}withη1,t=η2,t\\eta\_\{1,t\}=\\eta\_\{2,t\}recovers LMO\-VR; and settingα1,t=α2,t=0\\alpha\_\{1,t\}=\\alpha\_\{2,t\}=0withη1,t≠η2,t\\eta\_\{1,t\}\\neq\\eta\_\{2,t\}recovers LMO\-IGT\.
### 3\.3Unified Convergence Analysis
#### 3\.3\.1Stochastic LMO
We begin with the stochastic LMO class\. In Algorithm[2](https://arxiv.org/html/2605.05577#alg2), this corresponds to settingα1,t=α2,t=0\\alpha\_\{1,t\}=\\alpha\_\{2,t\}=0andη1,t=η2,t\\eta\_\{1,t\}=\\eta\_\{2,t\}for allt=0,1,…,T−1t=0,1,\\ldots,T\-1, so thatxt\+1=wt\+1x\_\{t\+1\}=w\_\{t\+1\}\. The convergence result for stochastic LMO is given below\.
###### Theorem 3\.2\(Stochastic LMO\)\.
Suppose Assumptions[1](https://arxiv.org/html/2605.05577#Thmasmp1)–[2](https://arxiv.org/html/2605.05577#Thmasmp2)hold\. Assume that, for eacht=0,1,…,T−1t=0,1,\\ldots,T\-1, we setβ1,t=β1\\beta\_\{1,t\}=\\beta\_\{1\},β2,t=β2\\beta\_\{2,t\}=\\beta\_\{2\}, andηt=η\\eta\_\{t\}=\\eta, where0≤β1≤β2<1,0\\leq\\beta\_\{1\}\\leq\\beta\_\{2\}<1,0≤λη≤10\\leq\\lambda\\eta\\leq 1andΔF:=F\(w0\)−F∗\\Delta\_\{F\}:=F\(w\_\{0\}\)\-F^\{\*\}whereF∗:=infwF\(w\)\>−∞F^\{\*\}:=\\inf\_\{w\}F\(w\)\>\-\\infty\. Then
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\!\\left\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\right\]≤ΔFTη\+Rσ\(\(1−β1\)\+β11−β2\+β1T\(1−β2\)\+1T\)\\displaystyle\\leq\\frac\{\\Delta\_\{F\}\}\{T\\eta\}\+R\\sigma\\\!\\left\(\(1\-\\beta\_\{1\}\)\+\\beta\_\{1\}\\sqrt\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{1\}\}\{T\(1\-\\beta\_\{2\}\)\}\+\\frac\{1\}\{T\}\\right\)\+LR2η\(β11−β2\+12\)\.\\displaystyle\\quad\+LR^\{2\}\\eta\\left\(\\frac\{\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}\+\\frac\{1\}\{2\}\\right\)\.\(5\)In particular, choosingη=1RT3/4,β2=1−1T1/2,1−β1∈\[1T1/2,1T1/4\]\\eta=\\frac\{1\}\{RT^\{3/4\}\},\\beta\_\{2\}=1\-\\frac\{1\}\{T^\{1/2\}\},1\-\\beta\_\{1\}\\in\\left\[\\frac\{1\}\{T^\{1/2\}\},\\frac\{1\}\{T^\{1/4\}\}\\right\]that balances the convergence rate of each term in the RHS of \([5](https://arxiv.org/html/2605.05577#S3.E5)\) yields
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]≤O\(R\(ΔF\+σ\+L\)T1/4\)=O\(T−1/4\)\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\!\\left\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\right\]\\leq O\\\!\\left\(\\frac\{R\(\\Delta\_\{F\}\+\\sigma\+L\)\}\{T^\{1/4\}\}\\right\)=O\(T^\{\-1/4\}\)\.
The above result establishes the baseline RSF convergence guarantee for the stochastic\-LMO class\. The proof of Theorem[3\.2](https://arxiv.org/html/2605.05577#S3.Thmthm2)is deferred to Appendix[D](https://arxiv.org/html/2605.05577#A4)\.
#### 3\.3\.2LMO\-VR
We next consider the variance\-reduced LMO class\. This corresponds to a specialization of Algorithm[2](https://arxiv.org/html/2605.05577#alg2)that setsη1,t=η2,t\\eta\_\{1,t\}=\\eta\_\{2,t\}for alltt, yieldingxt\+1=wt\+1x\_\{t\+1\}=w\_\{t\+1\}, while activating the correction parametersα1,t\\alpha\_\{1,t\}and/orα2,t\\alpha\_\{2,t\}\.
The convergence result for this class is given below\.
###### Theorem 3\.3\(LMO\-VR\)\.
Suppose Assumptions[1](https://arxiv.org/html/2605.05577#Thmasmp1)–[3](https://arxiv.org/html/2605.05577#Thmasmp3)hold\. Assume that, for eacht=0,1,…,T−1t=0,1,\\ldots,T\-1, we setα1,t=α1\\alpha\_\{1,t\}=\\alpha\_\{1\},α2,t=α2\\alpha\_\{2,t\}=\\alpha\_\{2\},β1,t=β1\\beta\_\{1,t\}=\\beta\_\{1\},β2,t=β2\\beta\_\{2,t\}=\\beta\_\{2\}, andηt=η\\eta\_\{t\}=\\eta, where0≤β1≤β2<10\\leq\\beta\_\{1\}\\leq\\beta\_\{2\}<1and0≤λη≤10\\leq\\lambda\\eta\\leq 1\. LetF∗F^\{\*\}be any finite lower bound onFF, andΔF:=F\(w0\)−F∗\.\\Delta\_\{F\}:=F\(w\_\{0\}\)\-F^\{\*\}\.Then
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]≤ΔFTη\+Rσ\(\(1−β1\)\+β11−β2\+β1T\(1−β2\)\+1T\)\\displaystyle\\leq\\frac\{\\Delta\_\{F\}\}\{T\\eta\}\+R\\sigma\\left\(\(1\-\\beta\_\{1\}\)\+\\beta\_\{1\}\\sqrt\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{1\}\}\{T\(1\-\\beta\_\{2\}\)\}\+\\frac\{1\}\{T\}\\right\)\+LR2η\(\|β1−α1\|\+β1\|β2−α2\|1−β2\+\|α1\|\+β1\|α2\|1−β2\+12\)\.\\displaystyle\\quad\+LR^\{2\}\\eta\\left\(\|\\beta\_\{1\}\-\\alpha\_\{1\}\|\+\\frac\{\\beta\_\{1\}\|\\beta\_\{2\}\-\\alpha\_\{2\}\|\}\{1\-\\beta\_\{2\}\}\+\|\\alpha\_\{1\}\|\+\\frac\{\\beta\_\{1\}\|\\alpha\_\{2\}\|\}\{\\sqrt\{1\-\\beta\_\{2\}\}\}\+\\frac\{1\}\{2\}\\right\)\.\(6\)
###### Corollary 3\.4\.
Under the assumptions of Theorem[3\.3](https://arxiv.org/html/2605.05577#S3.Thmthm3), suppose0≤α1≤β1,α2=β2\.0\\leq\\alpha\_\{1\}\\leq\\beta\_\{1\},\\alpha\_\{2\}=\\beta\_\{2\}\.Chooseη=1RT2/3,β2=1−1T2/3,1−β1∈\[1T2/3,1T1/3\]\\eta=\\frac\{1\}\{RT^\{2/3\}\},\\beta\_\{2\}=1\-\\frac\{1\}\{T^\{2/3\}\},1\-\\beta\_\{1\}\\in\\left\[\\frac\{1\}\{T^\{2/3\}\},\\frac\{1\}\{T^\{1/3\}\}\\right\]that balances the convergence rate of each term in the RHS of \([6](https://arxiv.org/html/2605.05577#S3.E6)\)\. Then
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]≤O\(R\(ΔF\+σ\+L\)T1/3\)=O\(T−1/3\)\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]\\leq O\\left\(\\frac\{R\(\\Delta\_\{F\}\+\\sigma\+L\)\}\{T^\{1/3\}\}\\right\)=O\\\!\\left\(T^\{\-1/3\}\\right\)\.
The proofs of Theorem[3\.3](https://arxiv.org/html/2605.05577#S3.Thmthm3)and Corollary[3\.4](https://arxiv.org/html/2605.05577#S3.Thmthm4)are deferred to Appendix[E](https://arxiv.org/html/2605.05577#A5)\. The result shows that variance reduction improves the convergence rate fromO\(T−1/4\)O\(T^\{\-1/4\}\)toO\(T−1/3\)O\(T^\{\-1/3\}\)\. This improvement arises from explicitly controlling the noise in the gradient momentum through the correction terms\. However, this approach requires an additional stochastic gradient evaluation per iteration, trading increased per\-iteration cost for a faster convergence rate\.
#### 3\.3\.3LMO\-IGT
The LMO\-IGT method in Algorithm[1](https://arxiv.org/html/2605.05577#alg1)corresponds to a specialization of Algorithm[2](https://arxiv.org/html/2605.05577#alg2), settingα1,t=α2,t=0\\alpha\_\{1,t\}=\\alpha\_\{2,t\}=0for allt=0,1,…,T−1t=0,1,\.\.\.,T\-1\. Compared with LMO\-VR, the main advantage of LMO\-IGT is that it preserves the single\-gradient\-per\-iteration structure of stochastic LMO, while exploiting the second\-order regularity in Assumption[4](https://arxiv.org/html/2605.05577#Thmasmp4)through the transported pointxtx\_\{t\}\. The convergence result for LMO\-IGT is given below\.
###### Theorem 3\.5\(LMO\-IGT\)\.
Suppose Assumptions[1](https://arxiv.org/html/2605.05577#Thmasmp1),[2](https://arxiv.org/html/2605.05577#Thmasmp2), and[4](https://arxiv.org/html/2605.05577#Thmasmp4)hold\. Consider Algorithm[1](https://arxiv.org/html/2605.05577#alg1)and assume that, for eacht=0,1,…,T−1t=0,1,\\ldots,T\-1, we setβ1,t=β1\\beta\_\{1,t\}=\\beta\_\{1\},β2,t=β2\\beta\_\{2,t\}=\\beta\_\{2\},η2,t=η\\eta\_\{2,t\}=\\eta, andη1,t=η/\(1−β2\)\\eta\_\{1,t\}=\\eta/\(1\-\\beta\_\{2\}\), where0≤β1≤β2<1,0\\leq\\beta\_\{1\}\\leq\\beta\_\{2\}<1,0≤λη≤10\\leq\\lambda\\eta\\leq 1andΔF:=F\(w0\)−F∗\\Delta\_\{F\}:=F\(w\_\{0\}\)\-F^\{\*\}\. Then
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\!\\left\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\right\]≤ΔFTη\+Rσ\(\(1−β1\)\+β11−β2\+β1T\(1−β2\)\+1T\)\\displaystyle\\leq\\frac\{\\Delta\_\{F\}\}\{T\\eta\}\+R\\sigma\\left\(\(1\-\\beta\_\{1\}\)\+\\beta\_\{1\}\\sqrt\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{1\}\}\{T\(1\-\\beta\_\{2\}\)\}\+\\frac\{1\}\{T\}\\right\)\+LR2η\(β2−β11−β2\+12\)\+ρR3η2\(β11−β2\+β22\(1−β2\)2\)\.\\displaystyle\\quad\+LR^\{2\}\\eta\\left\(\\frac\{\\beta\_\{2\}\-\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}\+\\frac\{1\}\{2\}\\right\)\+\\rho R^\{3\}\\eta^\{2\}\\left\(\\frac\{\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{2\}^\{2\}\}\{\(1\-\\beta\_\{2\}\)^\{2\}\}\\right\)\.\(7\)
###### Corollary 3\.6\(Stochastic regime\)\.
Under the assumptions of Theorem[3\.5](https://arxiv.org/html/2605.05577#S3.Thmthm5), supposeσ\>0\\sigma\>0\. By choosingη=1RT5/7,β2=1−1T4/7,1−β1∈\[1T4/7,1T2/7\]\\eta=\\frac\{1\}\{RT^\{5/7\}\},\\beta\_\{2\}=1\-\\frac\{1\}\{T^\{4/7\}\},1\-\\beta\_\{1\}\\in\\left\[\\frac\{1\}\{T^\{4/7\}\},\\frac\{1\}\{T^\{2/7\}\}\\right\]that balances the convergence rate of each term in the RHS of \([7](https://arxiv.org/html/2605.05577#S3.E7)\), Algorithm[1](https://arxiv.org/html/2605.05577#alg1)yields
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]≤O\(R\(ΔF\+σ\+ρ\)T2/7\+RLT5/7\)=O\(T−2/7\)\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\!\\left\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\right\]\\leq O\\\!\\left\(\\frac\{R\(\\Delta\_\{F\}\+\\sigma\+\\rho\)\}\{T^\{2/7\}\}\+\\frac\{RL\}\{T^\{5/7\}\}\\right\)=O\(T^\{\-2/7\}\)\.
###### Corollary 3\.7\(Deterministic regime\)\.
Under the assumptions of Theorem[3\.5](https://arxiv.org/html/2605.05577#S3.Thmthm5), supposeσ=0\\sigma=0\. By choosingη=1RT1/2,β1=β2=1−1T1/4\\eta=\\frac\{1\}\{RT^\{1/2\}\},\\beta\_\{1\}=\\beta\_\{2\}=1\-\\frac\{1\}\{T^\{1/4\}\}that balances the convergence rate of each term in the RHS of \([7](https://arxiv.org/html/2605.05577#S3.E7)\), Algorithm[1](https://arxiv.org/html/2605.05577#alg1)yields
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]≤O\(R\(ΔF\+L\+ρ\)T1/2\)=O\(T−1/2\)\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\!\\left\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\right\]\\leq O\\\!\\left\(\\frac\{R\(\\Delta\_\{F\}\+L\+\\rho\)\}\{T^\{1/2\}\}\\right\)=O\(T^\{\-1/2\}\)\.
The proofs of Theorem[3\.5](https://arxiv.org/html/2605.05577#S3.Thmthm5)and Corollaries[3\.6](https://arxiv.org/html/2605.05577#S3.Thmthm6)and[3\.7](https://arxiv.org/html/2605.05577#S3.Thmthm7)are deferred to Appendix[C](https://arxiv.org/html/2605.05577#A3)\.
Taken together, the three classes exhibit the hierarchy
O\(T−1/4\)\(stochastic LMO\),O\(T−1/3\)\(LMO\-VR\),O\(T−2/7\)\(LMO\-IGT\)\.O\(T^\{\-1/4\}\)\\quad\\text\{\(stochastic LMO\)\},\\qquad O\(T^\{\-1/3\}\)\\quad\\text\{\(LMO\-VR\)\},\\qquad O\(T^\{\-2/7\}\)\\quad\\text\{\(LMO\-IGT\)\}\.Equivalently, their iteration complexities areO\(ϵ−4\)O\(\\epsilon^\{\-4\}\),O\(ϵ−3\)O\(\\epsilon^\{\-3\}\), andO\(ϵ−3\.5\)O\(\\epsilon^\{\-3\.5\}\), respectively\. The key computational distinction is that LMO\-VR achieves its improvement through a correction term with substantial per\-iteration cost, whereas LMO\-IGT attains faster convergence through a low\-overhead extrapolation\.
## 4Experiments
We evaluate the proposed IGT method on CIFAR\-10\(Krizhevsky and others,[2009](https://arxiv.org/html/2605.05577#bib.bib43)\)\(MIT License\) for image classification using ResNet\-18\(Heet al\.,[2016](https://arxiv.org/html/2605.05577#bib.bib44)\)\. We compare eight optimizers: AdamW, NIGT, Lion, Muon, Lion\-VR, Muon\-VR, Lion\-IGT, and Muon\-IGT\. AdamW serves as a standard adaptive optimization baseline, while NIGT serves as an existing Euclidean normalized IGT baseline\. The Lion and Muon methods belong to the stochastic LMO class, Lion\-VR and Muon\-VR correspond to the LMO\-VR class, and Lion\-IGT and Muon\-IGT instantiate the proposed LMO\-IGT class\. Experimental details are given in Appendix[F](https://arxiv.org/html/2605.05577#A6)\.
\(a\)Test accuracy over epochs\.
\(b\)Test accuracy over training time\.
Figure 2:Test accuracy on CIFAR\-10 with ResNet\-18\. Each curve is averaged over five independent runs, and the shaded region denotes one standard deviation\. All methods are trained for 200 epochs\. In Figure[2\(b\)](https://arxiv.org/html/2605.05577#S4.F2.sf2), curves terminate at different time points as faster methods complete training earlier\.In Figure[2](https://arxiv.org/html/2605.05577#S4.F2), each curve reports the mean test accuracy over five independent runs and the shaded region indicates one standard deviation\. Figure[2\(a\)](https://arxiv.org/html/2605.05577#S4.F2.sf1)shows that Muon\-IGT achieves the best final accuracy among all tested methods\. Muon and NIGT form the strongest baselines, indicating that both operator\-norm LMO geometry and Euclidean IGT are effective; nevertheless, Muon\-IGT improves upon both by combining implicit gradient transport with Muon\-style LMO updates\. AdamW remains competitive but is outperformed by the strongest LMO\-based methods\.
The comparison with LMO\-VR highlights the practical advantage of IGT\. Although variance reduction improves the theoretical convergence rate, its correction term requires an additional gradient evaluation at the previous iterate, substantially increasing wall\-clock time as shown in Figure[2\(b\)](https://arxiv.org/html/2605.05577#S4.F2.sf2)\. In contrast, IGT uses only one stochastic gradient evaluation per iteration and incurs minimal computation and memory overhead\. Consequently, Lion\-IGT and Muon\-IGT run at nearly the same speed as their plain counterparts while achieving higher accuracy\. We also observe that Lion\-VR performs poorly in this setting, suggesting that variance\-reduction corrections are not uniformly beneficial for sign\-based LMO updates\.
Additional results, including ablation studies on the components of Muon\-IGT, robustness tests over learning rate and weight decay, and experiments on other tasks such as language modeling and large\-scale language modeling, are provided in Appendix[G](https://arxiv.org/html/2605.05577#A7)\. These results demonstrate the robustness of Muon\-IGT to hyperparameter choices and its strong performance across different tasks\.
## 5Conclusion
We studied stochastic LMO\-based optimization and proposed LMO\-IGT, a new class of methods that extends IGT to general LMO\-based updates\. We also introduced a unified framework equipped with a new stationarity measure, enabling a common analysis of stochastic LMO, LMO\-VR, and LMO\-IGT methods across both constrained and unconstrained settings\.
Our analysis shows that LMO\-IGT improves the convergence rate fromO\(ϵ−4\)O\(\\epsilon^\{\-4\}\)toO\(ϵ−3\.5\)O\(\\epsilon^\{\-3\.5\}\)while preserving the single\-gradient\-per\-iteration structure of stochastic LMO, thereby bridging the gap between standard stochastic LMO and variance\-reduced methods\. Empirically, LMO\-IGT consistently outperforms its counterparts while maintaining comparable wall\-clock efficiency, highlighting its practical advantage\. These results demonstrate that IGT provides an effective mechanism for accelerating LMO\-based optimization without sacrificing computational efficiency\.
Limitations and broader impacts are provided in Appendix[H](https://arxiv.org/html/2605.05577#A8)and Appendix[I](https://arxiv.org/html/2605.05577#A9), respectively\.
## References
- Reducing the variance in online optimization by transporting past gradients\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px6.p2.1),[§1](https://arxiv.org/html/2605.05577#S1.p5.1),[§2\.5](https://arxiv.org/html/2605.05577#S2.SS5.p1.1),[§2\.5](https://arxiv.org/html/2605.05577#S2.SS5.p2.2)\.
- J\. Bernstein and L\. Newhouse \(2024\)Old optimizer, new norm: an anthology\.InOPT 2024: Optimization for Machine Learning,External Links:[Link](https://openreview.net/forum?id=ux18f5nOpD)Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px3.p1.1)\.
- J\. Bernstein and L\. Newhouse \(2025\)Modular duality in deep learning\.InInternational Conference on Machine Learning,pp\. 3920–3930\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px3.p1.1)\.
- J\. Bernstein, Y\. Wang, K\. Azizzadenesheli, and A\. Anandkumar \(2018\)SignSGD: compressed optimisation for non\-convex problems\.InProceedings of the 35th International Conference on Machine Learning,Vol\.80,pp\. 560–569\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px2.p1.4),[Table 1](https://arxiv.org/html/2605.05577#S1.T1.15.9.4),[Table 1](https://arxiv.org/html/2605.05577#S1.T1.36.32.2.1)\.
- A\. Beznosikov, D\. Dobre, and G\. Gidel \(2024\)Sarah frank\-wolfe: methods for constrained optimization with best rates and practical features\.InProceedings of the 41st International Conference on Machine Learning,pp\. 3725–3750\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px4.p1.1)\.
- D\. Chang, Y\. Liu, and G\. Yuan \(2025\)On the convergence of muon and beyond\.Note:arXiv:2506\.15816External Links:2509\.15816Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px3.p1.1),[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px5.p1.2),[§1](https://arxiv.org/html/2605.05577#S1.p2.1),[§2\.3](https://arxiv.org/html/2605.05577#S2.SS3.p1.3),[§2\.4](https://arxiv.org/html/2605.05577#S2.SS4.p1.1)\.
- L\. Chen, J\. Li, and qiang liu \(2025\)Muon optimizes under spectral norm constraints\.InOPT 2025: Optimization for Machine Learning,External Links:[Link](https://openreview.net/forum?id=bBSq533vFH)Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px3.p1.1)\.
- L\. Chen, B\. Liu, K\. Liang, and qiang liu \(2024\)Lion secretly solves a constrained optimization: as lyapunov predicts\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=e4xS9ZarDr)Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px2.p1.4),[§2\.2](https://arxiv.org/html/2605.05577#S2.SS2.p2.6),[§2\.3](https://arxiv.org/html/2605.05577#S2.SS3.p2.2)\.
- X\. Chen, M\. Hong, S\. Liu, and R\. Sun \(2019\)On the convergence of a class of adam\-type algorithms for non\-convex optimization\.In7th International Conference on Learning Representations, ICLR 2019,Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px1.p1.1)\.
- X\. Chen, C\. Liang, D\. Huang, E\. Real, K\. Wang, H\. Pham, X\. Dong, T\. Luong, C\. Hsieh, Y\. Lu, and Q\. V\. Le \(2023\)Symbolic discovery of optimization algorithms\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 49205–49233\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px2.p1.4),[Table 1](https://arxiv.org/html/2605.05577#S1.T1.36.33.3.1),[§1](https://arxiv.org/html/2605.05577#S1.p2.1)\.
- A\. Cutkosky and H\. Mehta \(2020\)Momentum improves normalized sgd\.InInternational conference on machine learning,pp\. 2260–2268\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px6.p2.1),[Table 1](https://arxiv.org/html/2605.05577#S1.T1.30.24.5),[§1](https://arxiv.org/html/2605.05577#S1.p5.1),[§2\.5](https://arxiv.org/html/2605.05577#S2.SS5.p1.1),[§2\.5](https://arxiv.org/html/2605.05577#S2.SS5.p2.2),[§2](https://arxiv.org/html/2605.05577#S2.p3.1)\.
- A\. Cutkosky and F\. Orabona \(2019\)Momentum\-based variance reduction in non\-convex sgd\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px5.p1.2),[§2\.4](https://arxiv.org/html/2605.05577#S2.SS4.p1.1)\.
- Y\. Dong, H\. Li, and Z\. Lin \(2024\)Convergence rate analysis of lion\.arXiv preprint arXiv:2411\.07724\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px2.p1.4),[§2\.3](https://arxiv.org/html/2605.05577#S2.SS3.p1.3)\.
- J\. Duchi, E\. Hazan, and Y\. Singer \(2011\)Adaptive subgradient methods for online learning and stochastic optimization\.\.Journal of machine learning research12\(7\)\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px1.p1.1)\.
- C\. Fang, C\. J\. Li, Z\. Lin, and T\. Zhang \(2018\)SPIDER: near\-optimal non\-convex optimization via stochastic path\-integrated differential estimator\.InAdvances in Neural Information Processing Systems,Vol\.31\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px5.p1.2),[§2\.4](https://arxiv.org/html/2605.05577#S2.SS4.p1.1)\.
- W\. Fenchel and T\. Bonnesen \(1934\)Theorie der konvexen körper\.J\. Springer\.Cited by:[§3\.2](https://arxiv.org/html/2605.05577#S3.SS2.p2.5)\.
- R\. J\. Gardner \(1995\)Geometric tomography\.Vol\.58,Cambridge University Press\.Cited by:[§3\.2](https://arxiv.org/html/2605.05577#S3.SS2.p2.5)\.
- A\. Gokaslan and V\. Cohen \(2019\)OpenWebText corpus\.Note:[http://Skylion007\.github\.io/OpenWebTextCorpus](http://skylion007.github.io/OpenWebTextCorpus)Cited by:[§F\.3](https://arxiv.org/html/2605.05577#A6.SS3.p1.1)\.
- R\. M\. Gower, M\. Schmidt, F\. Bach, and P\. Richtarik \(2020\)Variance\-reduced methods for machine learning\.Note:arXiv:2010\.00892External Links:2010\.00892Cited by:[§2\.4](https://arxiv.org/html/2605.05577#S2.SS4.p1.1)\.
- V\. Gupta, T\. Koren, and Y\. Singer \(2018\)Shampoo: preconditioned stochastic tensor optimization\.InInternational Conference on Machine Learning,pp\. 1842–1850\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px3.p1.1)\.
- H\. Hassani, A\. Karbasi, A\. Mokhtari, and Z\. Shen \(2020\)Stochastic conditional gradient\+\+\.Note:arXiv:1902\.06992External Links:1902\.06992Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px4.p1.1)\.
- E\. Hazan, K\. Y\. Levy, and S\. Shalev\-Shwartz \(2015\)Beyond convexity: stochastic quasi\-convex optimization\.InProceedings of the 29th International Conference on Neural Information Processing Systems\-Volume 1,pp\. 1594–1602\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px2.p1.4),[Table 1](https://arxiv.org/html/2605.05577#S1.T1.12.6.5)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 770–778\.Cited by:[§4](https://arxiv.org/html/2605.05577#S4.p1.1)\.
- M\. Jaggi \(2013\)Revisiting frank\-wolfe: projection\-free sparse convex optimization\.InInternational conference on machine learning,pp\. 427–435\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px4.p1.1),[§2\.2](https://arxiv.org/html/2605.05577#S2.SS2.p3.1)\.
- W\. Jiang and L\. Zhang \(2025\)Convergence analysis of the lion optimizer in centralized and distributed settings\.arXiv preprint arXiv:2508\.12327\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px2.p1.4),[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px5.p1.2),[Table 1](https://arxiv.org/html/2605.05577#S1.T1.23.17.6),[§1](https://arxiv.org/html/2605.05577#S1.p2.1),[§1](https://arxiv.org/html/2605.05577#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.05577#S2.SS3.p1.3),[§2\.4](https://arxiv.org/html/2605.05577#S2.SS4.p1.1)\.
- K\. Jordan, Y\. Jin, V\. Boza, J\. You, F\. Cesista, L\. Newhouse, and J\. Bernstein \(2024\)Muon: an optimizer for hidden layers in neural networks\.External Links:[Link](https://kellerjordan.github.io/posts/muon/)Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px3.p1.1),[Appendix F](https://arxiv.org/html/2605.05577#A6.SS0.SSS0.Px1.p1.2),[Table 1](https://arxiv.org/html/2605.05577#S1.T1.18.12.4),[§1](https://arxiv.org/html/2605.05577#S1.p2.1)\.
- A\. Karpathy \(2015\)Shakespeare dataset\.GitHub\.Note:[https://github\.com/karpathy/char\-rnn](https://github.com/karpathy/char-rnn)Cited by:[§F\.2](https://arxiv.org/html/2605.05577#A6.SS2.p1.1)\.
- A\. Karpathy \(2022\)NanoGPT\.GitHub\.Note:[https://github\.com/karpathy/nanoGPT](https://github.com/karpathy/nanoGPT)Cited by:[§F\.2](https://arxiv.org/html/2605.05577#A6.SS2.p1.1)\.
- G\. Y\. Kim and M\. Oh \(2026\)Convergence of muon with newton\-schulz\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=lJSfxtLpLm)Cited by:[§2\.3](https://arxiv.org/html/2605.05577#S2.SS3.p1.3)\.
- D\. P\. Kingma and J\. Ba \(2017\)Adam: a method for stochastic optimization\.Note:arXiv:1412\.6980External Links:1412\.6980Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.05577#S1.p1.1)\.
- A\. Krizhevskyet al\.\(2009\)Learning multiple layers of features from tiny images\.Cited by:[§4](https://arxiv.org/html/2605.05577#S4.p1.1)\.
- S\. Lacoste\-Julien \(2016\)Convergence rate of frank\-wolfe for non\-convex objectives\.Note:arXiv:1607\.00345External Links:1607\.00345Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px4.p1.1),[§2\.2](https://arxiv.org/html/2605.05577#S2.SS2.p3.4)\.
- I\. Loshchilov and F\. Hutter \(2019\)Decoupled weight decay regularization\.Note:arXiv:1711\.05101External Links:1711\.05101Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.05577#S1.p1.1)\.
- S\. Nagashima and H\. Iiduka \(2026\)Improved convergence rates of muon optimizer for nonconvex optimization\.Note:arXiv:2601\.19400External Links:2601\.19400Cited by:[§2\.3](https://arxiv.org/html/2605.05577#S2.SS3.p1.3)\.
- Y\. Nesterov \(2013\)Introductory lectures on convex optimization: a basic course\.Vol\.87,Springer Science & Business Media\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px6.p1.1)\.
- T\. Pethick, W\. Xie, K\. Antonakopoulos, Z\. Zhu, A\. Silveti\-Falls, and V\. Cevher \(2025\)Training deep learning models with norm\-constrained LMOs\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 49069–49104\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px4.p1.1),[§F\.3](https://arxiv.org/html/2605.05577#A6.SS3.p1.1),[§1](https://arxiv.org/html/2605.05577#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.05577#S2.SS3.p2.2)\.
- S\. J\. Reddi, S\. Kale, and S\. Kumar \(2018\)On the convergence of adam and beyond\.InInternational Conference on Learning Representations,Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px1.p1.1)\.
- S\. J\. Reddi, S\. Sra, B\. Póczos, and A\. Smola \(2016\)Stochastic frank\-wolfe methods for nonconvex optimization\.In2016 54th annual Allerton conference on communication, control, and computing \(Allerton\),pp\. 1244–1251\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px4.p1.1)\.
- N\. Sato, H\. Naganuma, and H\. Iiduka \(2025\)Convergence bound and critical batch size of muon optimizer\.Note:arXiv:2507\.01598External Links:2507\.01598Cited by:[§2\.3](https://arxiv.org/html/2605.05577#S2.SS3.p1.3)\.
- R\. Schneider \(2013\)Convex bodies: the brunn–minkowski theory\.Vol\.151,Cambridge university press\.Cited by:[§3\.2](https://arxiv.org/html/2605.05577#S3.SS2.p2.5)\.
- M\. Sfyraki and J\. Wang \(2025\)Lions and muons: optimization via stochastic frank\-wolfe\.Note:arXiv:2506\.04192External Links:2506\.04192Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px4.p1.1),[§F\.1](https://arxiv.org/html/2605.05577#A6.SS1.p1.1),[§F\.2](https://arxiv.org/html/2605.05577#A6.SS2.p1.1),[Table 1](https://arxiv.org/html/2605.05577#S1.T1.36.34.4.1.1),[§1](https://arxiv.org/html/2605.05577#S1.p2.1),[§1](https://arxiv.org/html/2605.05577#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.05577#S2.SS2.p2.6),[§2\.2](https://arxiv.org/html/2605.05577#S2.SS2.p3.4),[§2\.3](https://arxiv.org/html/2605.05577#S2.SS3.p2.2),[§2\.4](https://arxiv.org/html/2605.05577#S2.SS4.p1.1)\.
- W\. Shen, R\. Huang, M\. Huang, C\. Shen, and J\. Zhang \(2025\)On the convergence analysis of muon\.Note:arXiv:2505\.23737External Links:2505\.23737Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px3.p1.1),[§2\.3](https://arxiv.org/html/2605.05577#S2.SS3.p1.3)\.
- C\. Si, D\. Zhang, and W\. Shen \(2025\)AdaMuon: adaptive muon optimizer\.Note:arXiv:2507\.11005External Links:2507\.11005Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px3.p1.1)\.
- T\. Tieleman and G\. Hinton \(2012\)Lecture 6\.5—rmsprop: divide the gradient by a running average of its recent magnitude\.Note:COURSERA: Neural Networks for Machine LearningCited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px1.p1.1)\.
- S\. Xie and Z\. Li \(2024\)Implicit bias of AdamW:ℓ∞\\ell\_\{\\infty\}\-norm constrained optimization\.InProceedings of the 41st International Conference on Machine Learning,Vol\.235,pp\. 54488–54510\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px1.p1.1),[§2\.2](https://arxiv.org/html/2605.05577#S2.SS2.p3.4)\.
- X\. Xie, P\. Zhou, H\. Li, Z\. Lin, and S\. Yan \(2024\)Adan: adaptive nesterov momentum algorithm for faster optimizing deep models\.IEEE Transactions on Pattern Analysis and Machine Intelligence46\(12\),pp\. 9508–9520\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px1.p1.1),[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px6.p1.1)\.
- H\. Yuan, Y\. Liu, S\. Wu, Z\. Xun, and Q\. Gu \(2025\)MARS: unleashing the power of variance reduction for training large models\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 73553–73587\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px5.p1.2),[Table 1](https://arxiv.org/html/2605.05577#S1.T1.26.20.4),[§1](https://arxiv.org/html/2605.05577#S1.p2.1),[§2\.4](https://arxiv.org/html/2605.05577#S2.SS4.p1.1)\.
- M\. Zaheer, S\. Reddi, D\. Sachan, S\. Kale, and S\. Kumar \(2018\)Adaptive methods for nonconvex optimization\.InAdvances in Neural Information Processing Systems,Vol\.31,pp\.\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px1.p1.1)\.
- B\. Zhang and R\. Sennrich \(2019\)Root mean square layer normalization\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[§F\.3](https://arxiv.org/html/2605.05577#A6.SS3.p1.1)\.
- M\. Zhang, J\. Lucas, J\. Ba, and G\. E\. Hinton \(2019\)Lookahead optimizer: k steps forward, 1 step back\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px6.p1.1)\.
- M\. Zhang, Z\. Shen, A\. Mokhtari, H\. Hassani, and A\. Karbasi \(2020\)One sample stochastic frank\-wolfe\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 4012–4023\.Cited by:[Appendix A](https://arxiv.org/html/2605.05577#A1.SS0.SSS0.Px4.p1.1)\.
- D\. Zhou, P\. Xu, and Q\. Gu \(2020\)Stochastic nested variance reduction for nonconvex optimization\.Journal of machine learning research21\(103\),pp\. 1–63\.Cited by:[§2\.4](https://arxiv.org/html/2605.05577#S2.SS4.p1.1)\.
## Appendix ARelated Works
Viewed historically, the literature leading to our setting has evolved along three largely independent directions: improved local scaling via adaptive moments, improved geometry via normalization, sign\-based, or matrix\-aware updates, and improved stochastic efficiency via recursive estimators or extrapolation\-based mechanisms, including lookahead and transported gradients\. A central challenge in comparing these lines is that they adopt different notions of stationarity\. Some works measure convergence using \(averaged\) gradient norms, others rely on dual norms induced by normalized updates, while projection\-free methods naturally employ the Frank–Wolfe gap\. As a result, even algorithmically similar methods are often analyzed under incompatible criteria\. Given this mismatch, the literature is best understood from a conceptual perspective before drawing comparisons based on formal convergence guarantees\.
##### Adaptive methods and coordinate\-wise preconditioning\.
Coordinate\-wise adaptive preconditioning has a long history in stochastic optimization\. AdaGrad adapts per\-coordinate learning rates using accumulated gradient geometry\(Duchiet al\.,[2011](https://arxiv.org/html/2605.05577#bib.bib65)\), while RMSProp replaces this cumulative scaling with an exponential moving average of recent squared gradients\(Tieleman and Hinton,[2012](https://arxiv.org/html/2605.05577#bib.bib66)\)\. The modern baseline in deep learning is Adam, which combines momentum with coordinate\-wise second\-moment normalization through exponential moving averages of first and second moments\(Kingma and Ba,[2017](https://arxiv.org/html/2605.05577#bib.bib41)\)\. Subsequent work has clarified that adaptive methods are not merely drop\-in improvements over SGD\. In particular,Reddiet al\.\([2018](https://arxiv.org/html/2605.05577#bib.bib45)\)identified non\-convergence issues of Adam and proposed AMSGrad, while later studies analyzed broader Adam\-type recursions and Yogi\-style stabilizations in nonconvex stochastic settings\(Chenet al\.,[2019](https://arxiv.org/html/2605.05577#bib.bib46); Zaheeret al\.,[2018](https://arxiv.org/html/2605.05577#bib.bib47)\)\. AdamW is especially relevant in our context, as decoupled weight decay separates shrinkage from the gradient update\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2605.05577#bib.bib42)\)\. Recent work further shows that this separation alters the effective optimization geometry and admits an implicit constrained interpretation connected to normalized steepest descent and Frank–Wolfe\-type methods\(Xie and Li,[2024](https://arxiv.org/html/2605.05577#bib.bib32)\)\. Recent work further shows that this separation alters the effective optimization geometry and admits an implicit constrained interpretation connected to normalized steepest descent and Frank–Wolfe\-type methods\(Xie and Li,[2024](https://arxiv.org/html/2605.05577#bib.bib32)\)\. Adan extends this line by incorporating a Nesterov\-style momentum mechanism within an adaptive optimizer, while retaining the overall coordinate\-wise adaptive structure\(Xieet al\.,[2024](https://arxiv.org/html/2605.05577#bib.bib48)\)\. Overall, this family serves as the primary coordinate\-wise baseline against which LMO\-based methods are compared\.
##### From normalized gradients to sign\-based updates\.
A second line of work replaces magnitude\-sensitive steps with normalized or sign\-based directions\. Stochastic normalized gradient descent already emphasized that the gradient direction, rather than its magnitude, can be the key object of analysis\(Hazanet al\.,[2015](https://arxiv.org/html/2605.05577#bib.bib40)\)\. signSGD and Signum push this idea further by using sign\-based updates, retaining strong optimization performance while significantly simplifying the update rule\(Bernsteinet al\.,[2018](https://arxiv.org/html/2605.05577#bib.bib29)\)\. Lion continues this direction by combining sign\-based updates with momentum while discarding second\-moment adaptation\(Chenet al\.,[2023](https://arxiv.org/html/2605.05577#bib.bib8)\)\. As a result, its behavior is more closely aligned with normalized or dual\-norm steepest descent than with classical adaptive methods\. This interpretation has been reinforced by recent analyses: some view Lion with decoupled weight decay as solving anℓ∞\\ell\_\{\\infty\}\-constrained problem\(Chenet al\.,[2024](https://arxiv.org/html/2605.05577#bib.bib11)\), while others establish stochastic convergence rates, showing the standard𝒪\(ε−4\)\\mathcal\{O\}\(\\varepsilon^\{\-4\}\)complexity under gradient\- orℓ1\\ell\_\{1\}\-type stationarity, with variance\-reduced variants improving to𝒪\(ε−3\)\\mathcal\{O\}\(\\varepsilon^\{\-3\}\)\(Donget al\.,[2024](https://arxiv.org/html/2605.05577#bib.bib9); Jiang and Zhang,[2025](https://arxiv.org/html/2605.05577#bib.bib10)\)\. Overall, this line of work demonstrates that sign\-based methods are not merely empirical heuristics, but instances of a broader norm\-aware optimization principle\.
##### Matrix\-structured and geometry\-aware optimization\.
A third line of work emphasizes that the relevant geometry in deep learning is often neither purely Euclidean nor purely coordinate\-wise\. Early structure\-aware preconditioners such as Shampoo exploit tensor structure rather than flattening parameters into vectors\(Guptaet al\.,[2018](https://arxiv.org/html/2605.05577#bib.bib51)\)\. More recent work makes this perspective explicit at the level of optimization geometry\. For example,Bernstein and Newhouse \([2024](https://arxiv.org/html/2605.05577#bib.bib52)\)reinterpret several methods as steepest descent under non\-Euclidean norms, whileBernstein and Newhouse \([2025](https://arxiv.org/html/2605.05577#bib.bib53)\)develop layerwise duality maps based on operator norms for general neural architectures\. Muon fits naturally into this line, as its orthogonalized updates operate on matrix\-shaped parameters and are closely tied to spectral, nuclear, or operator\-norm geometry\(Jordanet al\.,[2024](https://arxiv.org/html/2605.05577#bib.bib7)\)\. Recent analyses establish its convergence behavior, identify regimes where it can outperform gradient descent, and interpret Muon with decoupled weight decay as solving a spectrally constrained problem\(Shenet al\.,[2025](https://arxiv.org/html/2605.05577#bib.bib3); Changet al\.,[2025](https://arxiv.org/html/2605.05577#bib.bib2); Chenet al\.,[2025](https://arxiv.org/html/2605.05577#bib.bib49)\)\. AdaMuon further combines this geometry with element\-wise adaptivity, illustrating that adaptive and matrix\-aware designs can be unified\(Siet al\.,[2025](https://arxiv.org/html/2605.05577#bib.bib50)\)\. From a comparative perspective, this line is particularly relevant because the underlying norm geometry becomes part of the algorithm design, leading to stationarity measures that range from gradient\-based criteria to constrained Frank–Wolfe gap interpretations\.
##### Frank–Wolfe, conditional gradients, and LMO\-based viewpoints\.
The Frank–Wolfe literature provides the natural constrained counterpart to normalized and LMO\-based updates\. Classical Frank–Wolfe theory relies on duality\-gap certificates for projection\-free convex optimization\(Jaggi,[2013](https://arxiv.org/html/2605.05577#bib.bib13)\), and its nonconvex extension shows that the Frank–Wolfe gap remains a meaningful stationarity measure beyond convexity\(Lacoste\-Julien,[2016](https://arxiv.org/html/2605.05577#bib.bib15)\)\. Building on this foundation, stochastic and variance\-reduced conditional\-gradient methods have progressively improved the oracle complexity of nonconvex projection\-free optimization, including stochastic Frank–Wolfe, one\-sample variants, SFW\+\+, and SARAH\-based methods\(Reddiet al\.,[2016](https://arxiv.org/html/2605.05577#bib.bib58); Zhanget al\.,[2020](https://arxiv.org/html/2605.05577#bib.bib55); Hassaniet al\.,[2020](https://arxiv.org/html/2605.05577#bib.bib56); Beznosikovet al\.,[2024](https://arxiv.org/html/2605.05577#bib.bib57)\)\. In parallel, recent deep\-learning work reinterprets modern optimizers through this lens\. Norm\-constrained LMO methods unify a range of update rules within a stochastic framework and demonstrate that LMO\-type updates can be applied even in unconstrained settings\(Pethicket al\.,[2025](https://arxiv.org/html/2605.05577#bib.bib39)\)\. Practical instantiations often employ layer\-wise norm constraints and structured updates, enabling flexible, layer\-dependent behavior at the cost of additional tuning\. Closely related work further shows that Lion and Muon with decoupled weight decay can be interpreted as instances of stochastic Frank–Wolfe overℓ∞\\ell\_\{\\infty\}\- or operator\-norm\-type feasible sets\(Sfyraki and Wang,[2025](https://arxiv.org/html/2605.05577#bib.bib1)\)\. This line of work is central to our setting, as it provides the constrained\-optimization language that connects normalized updates, modern LMO\-based optimizers, and projection\-free notions of stationarity within a unified perspective\.
##### Variance reduction and accelerated stochastic first\-order methods\.
From a rate perspective, variance reduction remains the canonical approach for improving stochastic nonconvex first\-order complexity from the SGD\-like𝒪\(ε−4\)\\mathcal\{O\}\(\\varepsilon^\{\-4\}\)regime to the widely studied𝒪\(ε−3\)\\mathcal\{O\}\(\\varepsilon^\{\-3\}\)regime\. SPIDER is a central reference, and its normalized\-gradient variant is particularly relevant as it already connects recursive estimators with normalized updates\(Fanget al\.,[2018](https://arxiv.org/html/2605.05577#bib.bib19)\)\. STORM further shows that momentum\-like recursions can achieve similar improvements without the large batch sizes often required by variance\-reduced methods\(Cutkosky and Orabona,[2019](https://arxiv.org/html/2605.05577#bib.bib18)\)\. This idea has recently reappeared in large\-model optimization through MARS, which unifies preconditioned updates with stochastic recursive momentum and recovers AdamW\-, Lion\-, and Shampoo\-style methods within a common framework\(Yuanet al\.,[2025](https://arxiv.org/html/2605.05577#bib.bib22)\)\. A similar pattern also appears in recent analyses of Lion and Muon, where STORM\-type correction terms are incorporated into sign\-based or orthogonalized momentum updates to improve the stochastic rate\(Jiang and Zhang,[2025](https://arxiv.org/html/2605.05577#bib.bib10); Changet al\.,[2025](https://arxiv.org/html/2605.05577#bib.bib2)\)\. The key takeaway is that improved iteration complexity is already achievable for modern optimizers, but typically comes at the cost of additional estimator state, gradient\-difference corrections, or multiple gradient evaluations per iteration\.
##### Extrapolation, lookahead, and transported gradients\.
A complementary route to acceleration is to change where or how first\-order information is evaluated along the optimization trajectory\. Classical Nesterov acceleration forms updates from an extrapolated sequence rather than from the current iterate alone, thereby introducing a trajectory\-aware form of momentum\(Nesterov,[2013](https://arxiv.org/html/2605.05577#bib.bib37)\)\. In modern deep learning, this intuition appears in several practical forms\. One is approximate Nesterov momentum, where lookahead behavior is encoded algebraically without requiring a separate gradient evaluation at a distinct extrapolated point; adaptive examples in this direction include Adan\(Xieet al\.,[2024](https://arxiv.org/html/2605.05577#bib.bib48)\)\. Another is Lookahead, which maintains fast and slow weights and periodically updates the slow sequence toward the trajectory generated by an inner optimizer\(Zhanget al\.,[2019](https://arxiv.org/html/2605.05577#bib.bib59)\)\.
The transported\-gradient line makes this trajectory\-aware viewpoint even more explicit\. The original IGT work introduced the idea of correcting gradient staleness by transporting previously computed gradient information along the optimization trajectory without explicit Hessian computations\(Arnoldet al\.,[2019](https://arxiv.org/html/2605.05577#bib.bib24)\)\. Building on this idea,Cutkosky and Mehta \([2020](https://arxiv.org/html/2605.05577#bib.bib16)\)showed that, under bounded second\-derivative assumptions, a transported\-gradient modification of normalized SGD attains an improved𝒪\(ε−3\.5\)\\mathcal\{O\}\(\\varepsilon^\{\-3\.5\}\)stochastic complexity while preserving one stochastic gradient evaluation per iteration\. Conceptually, these methods are related through the use of extrapolation or synchronization along the optimization path, but they differ in how auxiliary points are used: Nesterov\-style methods modify the momentum/query construction, Lookahead explicitly couples fast and slow iterate sequences, whereas IGT evaluates stochastic gradients at a transported point to reduce momentum lag\. Existing analyses in this broader extrapolation family, however, remain largely restricted to Euclidean or normalized settings and do not directly address general LMO queries, operator\-norm geometries, or constrained gap\-based stationarity\.
##### Positioning of our work\.
Against this background, our contribution goes beyond a marginal rate improvement within a fixed optimizer family\. Algorithmically, we extend implicit gradient transport from Euclidean normalized updates to general LMO\-based methods, thereby unifying normalized, sign\-based, and matrix\-structured optimizers under a common mechanism\. Analytically, we compare plain stochastic LMO, variance\-reduced LMO, and LMO\-IGT under a single stationarity measure, rather than switching between gradient norms in unconstrained settings and Frank–Wolfe gaps in constrained ones\. This distinction is crucial: prior work typically improves either the update geometry or the iteration complexity within a fixed measure, whereas our framework enables comparisons across both geometry and stationarity notions on a common scale\. As a result, the𝒪\(ε−4\)\\mathcal\{O\}\(\\varepsilon^\{\-4\}\),𝒪\(ε−3\)\\mathcal\{O\}\(\\varepsilon^\{\-3\}\), and𝒪\(ε−3\.5\)\\mathcal\{O\}\(\\varepsilon^\{\-3\.5\}\)guarantees can be placed on a single comparison axis\.
## Appendix BUseful Lemmas
Whenλ\>0\\lambda\>0, we write𝒫:=λ−1𝒞\.\\mathcal\{P\}:=\\lambda^\{\-1\}\\mathcal\{C\}\.Recall also that
Ψ𝒞,λ\(w\):=supv∈𝒞⟨−∇F\(w\),v−λw⟩,R:=diam\(𝒞\)\.\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\):=\\sup\_\{v\\in\\mathcal\{C\}\}\\langle\-\\nabla F\(w\),\\,v\-\\lambda w\\rangle,\\qquad R:=\\operatorname\{diam\}\(\\mathcal\{C\}\)\.
###### Lemma B\.1\(Geometry of a single LMO step\)\.
Let
wt\+1=\(1−ληt\)wt\+ηtvt,vt∈𝒞,w\_\{t\+1\}=\(1\-\\lambda\\eta\_\{t\}\)w\_\{t\}\+\\eta\_\{t\}v\_\{t\},\\qquad v\_\{t\}\\in\\mathcal\{C\},where eitherλ=0\\lambda=0, orλ\>0\\lambda\>0andwt∈𝒫=λ−1𝒞w\_\{t\}\\in\\mathcal\{P\}=\\lambda^\{\-1\}\\mathcal\{C\}\. Then
‖vt−λwt‖≤R,‖wt\+1−wt‖≤ηtR\.\\\|v\_\{t\}\-\\lambda w\_\{t\}\\\|\\leq R,\\qquad\\\|w\_\{t\+1\}\-w\_\{t\}\\\|\\leq\\eta\_\{t\}R\.
###### Proof\.
Ifλ\>0\\lambda\>0, thenλwt∈𝒞\\lambda w\_\{t\}\\in\\mathcal\{C\}, and sincevt∈𝒞v\_\{t\}\\in\\mathcal\{C\},
‖vt−λwt‖≤diam\(𝒞\)=R\.\\\|v\_\{t\}\-\\lambda w\_\{t\}\\\|\\leq\\operatorname\{diam\}\(\\mathcal\{C\}\)=R\.Ifλ=0\\lambda=0, then‖vt−λwt‖=‖vt‖≤R\\\|v\_\{t\}\-\\lambda w\_\{t\}\\\|=\\\|v\_\{t\}\\\|\\leq Rbecause0∈𝒞0\\in\\mathcal\{C\}\.
Finally,
wt\+1−wt=ηt\(vt−λwt\),w\_\{t\+1\}\-w\_\{t\}=\\eta\_\{t\}\(v\_\{t\}\-\\lambda w\_\{t\}\),hence
‖wt\+1−wt‖≤ηtR\.∎\\\|w\_\{t\+1\}\-w\_\{t\}\\\|\\leq\\eta\_\{t\}R\.\\qed
###### Lemma B\.2\(Descent rule\)\.
Suppose Assumption[1](https://arxiv.org/html/2605.05577#Thmasmp1)holds\. Letgt∈ℝdg\_\{t\}\\in\\mathbb\{R\}^\{d\}, let
vt∈argminv∈𝒞⟨gt,v⟩,wt\+1=\(1−ληt\)wt\+ηtvt,v\_\{t\}\\in\\arg\\min\_\{v\\in\\mathcal\{C\}\}\\langle g\_\{t\},v\\rangle,\\qquad w\_\{t\+1\}=\(1\-\\lambda\\eta\_\{t\}\)w\_\{t\}\+\\eta\_\{t\}v\_\{t\},and define
ϵ^t:=gt−∇F\(wt\)\.\\hat\{\\epsilon\}\_\{t\}:=g\_\{t\}\-\\nabla F\(w\_\{t\}\)\.Then
F\(wt\+1\)\\displaystyle F\(w\_\{t\+1\}\)≤F\(wt\)−ηtΨ𝒞,λ\(wt\)\+ηtR‖ϵ^t‖\+L2ηt2R2\.\\displaystyle\\leq F\(w\_\{t\}\)\-\\eta\_\{t\}\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\+\\eta\_\{t\}R\\\|\\hat\{\\epsilon\}\_\{t\}\\\|\+\\frac\{L\}\{2\}\\eta\_\{t\}^\{2\}R^\{2\}\.\(A1\)Consequently, ifηt=η\\eta\_\{t\}=\\etafor everyt=0,1,…,T−1t=0,1,\\ldots,T\-1, then
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]≤ΔFTη\+RT∑t=0T−1𝔼‖ϵ^t‖\+L2R2η,\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]\\leq\\frac\{\\Delta\_\{F\}\}\{T\\eta\}\+\\frac\{R\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|\\hat\{\\epsilon\}\_\{t\}\\\|\+\\frac\{L\}\{2\}R^\{2\}\\eta,\(A2\)whereΔF:=F\(w0\)−F∗\\Delta\_\{F\}:=F\(w\_\{0\}\)\-F^\{\*\}\.
###### Proof\.
Let
v^t∈argmaxv∈𝒞⟨−∇F\(wt\),v⟩\.\\hat\{v\}\_\{t\}\\in\\arg\\max\_\{v\\in\\mathcal\{C\}\}\\langle\-\\nabla F\(w\_\{t\}\),v\\rangle\.ByLL\-smoothness ofFF,
F\(wt\+1\)\\displaystyle F\(w\_\{t\+1\}\)≤F\(wt\)\+ηt⟨∇F\(wt\),vt−λwt⟩\+L2ηt2‖vt−λwt‖2\\displaystyle\\leq F\(w\_\{t\}\)\+\\eta\_\{t\}\\langle\\nabla F\(w\_\{t\}\),\\,v\_\{t\}\-\\lambda w\_\{t\}\\rangle\+\\frac\{L\}\{2\}\\eta\_\{t\}^\{2\}\\\|v\_\{t\}\-\\lambda w\_\{t\}\\\|^\{2\}≤F\(wt\)\+ηt⟨gt,vt−λwt⟩\+ηt⟨∇F\(wt\)−gt,vt−λwt⟩\+L2ηt2R2\\displaystyle\\leq F\(w\_\{t\}\)\+\\eta\_\{t\}\\langle g\_\{t\},\\,v\_\{t\}\-\\lambda w\_\{t\}\\rangle\+\\eta\_\{t\}\\langle\\nabla F\(w\_\{t\}\)\-g\_\{t\},\\,v\_\{t\}\-\\lambda w\_\{t\}\\rangle\+\\frac\{L\}\{2\}\\eta\_\{t\}^\{2\}R^\{2\}≤F\(wt\)\+ηt⟨gt,v^t−λwt⟩\+ηt⟨∇F\(wt\)−gt,vt−λwt⟩\+L2ηt2R2\\displaystyle\\leq F\(w\_\{t\}\)\+\\eta\_\{t\}\\langle g\_\{t\},\\,\\hat\{v\}\_\{t\}\-\\lambda w\_\{t\}\\rangle\+\\eta\_\{t\}\\langle\\nabla F\(w\_\{t\}\)\-g\_\{t\},\\,v\_\{t\}\-\\lambda w\_\{t\}\\rangle\+\\frac\{L\}\{2\}\\eta\_\{t\}^\{2\}R^\{2\}=F\(wt\)\+ηt⟨∇F\(wt\),v^t−λwt⟩\+ηt⟨∇F\(wt\)−gt,vt−v^t⟩\+L2ηt2R2\\displaystyle=F\(w\_\{t\}\)\+\\eta\_\{t\}\\langle\\nabla F\(w\_\{t\}\),\\,\\hat\{v\}\_\{t\}\-\\lambda w\_\{t\}\\rangle\+\\eta\_\{t\}\\langle\\nabla F\(w\_\{t\}\)\-g\_\{t\},\\,v\_\{t\}\-\\hat\{v\}\_\{t\}\\rangle\+\\frac\{L\}\{2\}\\eta\_\{t\}^\{2\}R^\{2\}=F\(wt\)−ηtΨ𝒞,λ\(wt\)\+ηt⟨∇F\(wt\)−gt,vt−v^t⟩\+L2ηt2R2\\displaystyle=F\(w\_\{t\}\)\-\\eta\_\{t\}\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\+\\eta\_\{t\}\\langle\\nabla F\(w\_\{t\}\)\-g\_\{t\},\\,v\_\{t\}\-\\hat\{v\}\_\{t\}\\rangle\+\\frac\{L\}\{2\}\\eta\_\{t\}^\{2\}R^\{2\}≤F\(wt\)−ηtΨ𝒞,λ\(wt\)\+ηtR‖ϵ^t‖\+L2ηt2R2,\\displaystyle\\leq F\(w\_\{t\}\)\-\\eta\_\{t\}\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\+\\eta\_\{t\}R\\\|\\hat\{\\epsilon\}\_\{t\}\\\|\+\\frac\{L\}\{2\}\\eta\_\{t\}^\{2\}R^\{2\},where we used Lemma[B\.1](https://arxiv.org/html/2605.05577#A2.Thmlemma1)for‖vt−λwt‖≤R\\\|v\_\{t\}\-\\lambda w\_\{t\}\\\|\\leq R, the LMO optimality⟨gt,vt⟩≤⟨gt,v^t⟩\\langle g\_\{t\},v\_\{t\}\\rangle\\leq\\langle g\_\{t\},\\hat\{v\}\_\{t\}\\rangle, and‖vt−v^t‖≤R\\\|v\_\{t\}\-\\hat\{v\}\_\{t\}\\\|\\leq Rsince both points lie in𝒞\\mathcal\{C\}\.
Summing overt=0,…,T−1t=0,\\dots,T\-1, taking expectations, and dividing byTηT\\etagives \([A2](https://arxiv.org/html/2605.05577#A2.E2)\)\. ∎
###### Lemma B\.3\(IGT extrapolation geometry\)\.
Suppose
xt\+1=\(1−λη1,t\)wt\+η1,tvt,wt\+1=\(1−λη2,t\)wt\+η2,tvt\.x\_\{t\+1\}=\(1\-\\lambda\\eta\_\{1,t\}\)w\_\{t\}\+\\eta\_\{1,t\}v\_\{t\},\\qquad w\_\{t\+1\}=\(1\-\\lambda\\eta\_\{2,t\}\)w\_\{t\}\+\\eta\_\{2,t\}v\_\{t\}\.Then
xt\+1−wt\+1=\(η1,t−η2,t\)\(vt−λwt\)\.x\_\{t\+1\}\-w\_\{t\+1\}=\(\\eta\_\{1,t\}\-\\eta\_\{2,t\}\)\(v\_\{t\}\-\\lambda w\_\{t\}\)\.Ifη1,t=η1\\eta\_\{1,t\}=\\eta\_\{1\}andη2,t=η2\\eta\_\{2,t\}=\\eta\_\{2\}for everyt=0,1,…,T−1t=0,1,\\ldots,T\-1, then for everyt≥1t\\geq 1,
wt−wt−1=η2\(vt−1−λwt−1\),xt−wt=\(η1−η2\)\(vt−1−λwt−1\),w\_\{t\}\-w\_\{t\-1\}=\\eta\_\{2\}\(v\_\{t\-1\}\-\\lambda w\_\{t\-1\}\),\\qquad x\_\{t\}\-w\_\{t\}=\(\\eta\_\{1\}\-\\eta\_\{2\}\)\(v\_\{t\-1\}\-\\lambda w\_\{t\-1\}\),and hence
xt=wt\+C\(wt−wt−1\),C:=η1−η2η2\.x\_\{t\}=w\_\{t\}\+C\(w\_\{t\}\-w\_\{t\-1\}\),\\qquad C:=\\frac\{\\eta\_\{1\}\-\\eta\_\{2\}\}\{\\eta\_\{2\}\}\.Moreover,
‖wt−wt−1‖≤η2R,‖xt−wt‖≤\|η1−η2\|R\.\\\|w\_\{t\}\-w\_\{t\-1\}\\\|\\leq\\eta\_\{2\}R,\\qquad\\\|x\_\{t\}\-w\_\{t\}\\\|\\leq\|\\eta\_\{1\}\-\\eta\_\{2\}\|R\.
###### Proof\.
The identities follow directly from the definitions ofxt\+1x\_\{t\+1\}andwt\+1w\_\{t\+1\}\. The norm bounds follow from Lemma[B\.1](https://arxiv.org/html/2605.05577#A2.Thmlemma1)\. ∎
###### Lemma B\.4\(Second\-order remainder\)\.
Assume Assumption[4](https://arxiv.org/html/2605.05577#Thmasmp4)\. Define
Z\(x,y\):=∫01\(∇2F\(y\+τ\(x−y\)\)−∇2F\(y\)\)\(x−y\)𝑑τ\.Z\(x,y\):=\\int\_\{0\}^\{1\}\\bigl\(\\nabla^\{2\}F\(y\+\\tau\(x\-y\)\)\-\\nabla^\{2\}F\(y\)\\bigr\)\(x\-y\)\\,d\\tau\.Then
∇F\(x\)=∇F\(y\)\+∇2F\(y\)\(x−y\)\+Z\(x,y\),\\nabla F\(x\)=\\nabla F\(y\)\+\\nabla^\{2\}F\(y\)\(x\-y\)\+Z\(x,y\),and
‖Z\(x,y\)‖≤ρ‖x−y‖2\.\\\|Z\(x,y\)\\\|\\leq\\rho\\\|x\-y\\\|^\{2\}\.
###### Proof\.
The identity is the integral form of Taylor’s theorem\. By Assumption[4](https://arxiv.org/html/2605.05577#Thmasmp4),
‖Z\(x,y\)‖\\displaystyle\\\|Z\(x,y\)\\\|≤∫01‖∇2F\(y\+τ\(x−y\)\)−∇2F\(y\)‖op‖x−y‖𝑑τ\\displaystyle\\leq\\int\_\{0\}^\{1\}\\\|\\nabla^\{2\}F\(y\+\\tau\(x\-y\)\)\-\\nabla^\{2\}F\(y\)\\\|\_\{\\mathrm\{op\}\}\\,\\\|x\-y\\\|\\,d\\tau≤∫01ρτ‖x−y‖2𝑑τ≤ρ‖x−y‖2\.∎\\displaystyle\\leq\\int\_\{0\}^\{1\}\\rho\\tau\\\|x\-y\\\|^\{2\}\\,d\\tau\\leq\\rho\\\|x\-y\\\|^\{2\}\.\\qed
###### Lemma B\.5\(Momentum\-weighted martingale bound\)\.
Letℱt:=σ\(ξ0,…,ξt\)\\mathcal\{F\}\_\{t\}:=\\sigma\(\\xi\_\{0\},\\dots,\\xi\_\{t\}\)be the natural filtration\. Suppose\{ϵt\}t≥0\\\{\\epsilon\_\{t\}\\\}\_\{t\\geq 0\}is adapted to\{ℱt\}t≥0\\\{\\mathcal\{F\}\_\{t\}\\\}\_\{t\\geq 0\}and satisfies
𝔼\[ϵt∣ℱt−1\]=0,𝔼\[‖ϵt‖2∣ℱt−1\]≤σ2\.\\mathbb\{E\}\[\\epsilon\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\}\]=0,\\qquad\\mathbb\{E\}\[\\\|\\epsilon\_\{t\}\\\|^\{2\}\\mid\\mathcal\{F\}\_\{t\-1\}\]\\leq\\sigma^\{2\}\.Fix0≤β1≤β2<10\\leq\\beta\_\{1\}\\leq\\beta\_\{2\}<1\. Fort≥1t\\geq 1, define
Nt:=β1β2t−1ϵ0\+β1\(1−β2\)∑s=1t−1β2t−1−sϵs\+\(1−β1\)ϵt\.N\_\{t\}:=\\beta\_\{1\}\\beta\_\{2\}^\{t\-1\}\\epsilon\_\{0\}\+\\beta\_\{1\}\(1\-\\beta\_\{2\}\)\\sum\_\{s=1\}^\{t\-1\}\\beta\_\{2\}^\{t\-1\-s\}\\epsilon\_\{s\}\+\(1\-\\beta\_\{1\}\)\\epsilon\_\{t\}\.Then
𝔼‖Nt‖2\\displaystyle\\mathbb\{E\}\\\|N\_\{t\}\\\|^\{2\}≤β12β22t−2σ2\+β12\(1−β2\)σ2\+\(1−β1\)2σ2,\\displaystyle\\leq\\beta\_\{1\}^\{2\}\\beta\_\{2\}^\{2t\-2\}\\sigma^\{2\}\+\\beta\_\{1\}^\{2\}\(1\-\\beta\_\{2\}\)\\sigma^\{2\}\+\(1\-\\beta\_\{1\}\)^\{2\}\\sigma^\{2\},\(A3\)𝔼‖Nt‖\\displaystyle\\mathbb\{E\}\\\|N\_\{t\}\\\|≤β1β2t−1σ\+β11−β2σ\+\(1−β1\)σ\.\\displaystyle\\leq\\beta\_\{1\}\\beta\_\{2\}^\{t\-1\}\\sigma\+\\beta\_\{1\}\\sqrt\{1\-\\beta\_\{2\}\}\\,\\sigma\+\(1\-\\beta\_\{1\}\)\\sigma\.\(A4\)If additionallyN0:=ϵ0N\_\{0\}:=\\epsilon\_\{0\}, then
1T∑t=0T−1𝔼‖Nt‖\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|N\_\{t\}\\\|≤σT\+σ\(\(1−β1\)\+β11−β2\+β1T\(1−β2\)\)\.\\displaystyle\\leq\\frac\{\\sigma\}\{T\}\+\\sigma\\left\(\(1\-\\beta\_\{1\}\)\+\\beta\_\{1\}\\sqrt\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{1\}\}\{T\(1\-\\beta\_\{2\}\)\}\\right\)\.\(A5\)
###### Proof\.
Because\{ϵt\}\\\{\\epsilon\_\{t\}\\\}is a martingale difference sequence, the cross terms vanish:
𝔼⟨ϵr,ϵs⟩=0,r≠s\.\\mathbb\{E\}\\langle\\epsilon\_\{r\},\\epsilon\_\{s\}\\rangle=0,\\qquad r\\neq s\.Hence
𝔼‖Nt‖2\\displaystyle\\mathbb\{E\}\\\|N\_\{t\}\\\|^\{2\}≤β12β22t−2σ2\+β12\(1−β2\)2∑s=1t−1β22t−2−2sσ2\+\(1−β1\)2σ2\\displaystyle\\leq\\beta\_\{1\}^\{2\}\\beta\_\{2\}^\{2t\-2\}\\sigma^\{2\}\+\\beta\_\{1\}^\{2\}\(1\-\\beta\_\{2\}\)^\{2\}\\sum\_\{s=1\}^\{t\-1\}\\beta\_\{2\}^\{2t\-2\-2s\}\\sigma^\{2\}\+\(1\-\\beta\_\{1\}\)^\{2\}\\sigma^\{2\}≤β12β22t−2σ2\+β12\(1−β2\)σ2\+\(1−β1\)2σ2,\\displaystyle\\leq\\beta\_\{1\}^\{2\}\\beta\_\{2\}^\{2t\-2\}\\sigma^\{2\}\+\\beta\_\{1\}^\{2\}\(1\-\\beta\_\{2\}\)\\sigma^\{2\}\+\(1\-\\beta\_\{1\}\)^\{2\}\\sigma^\{2\},where we used
\(1−β2\)2∑s=1t−1β22t−2−2s≤\(1−β2\)2∑k=0∞β22k=1−β21\+β2≤1−β2\.\(1\-\\beta\_\{2\}\)^\{2\}\\sum\_\{s=1\}^\{t\-1\}\\beta\_\{2\}^\{2t\-2\-2s\}\\leq\(1\-\\beta\_\{2\}\)^\{2\}\\sum\_\{k=0\}^\{\\infty\}\\beta\_\{2\}^\{2k\}=\\frac\{1\-\\beta\_\{2\}\}\{1\+\\beta\_\{2\}\}\\leq 1\-\\beta\_\{2\}\.This proves \([A3](https://arxiv.org/html/2605.05577#A2.E3)\)\.
Applying Jensen’s inequality anda\+b\+c≤a\+b\+c\\sqrt\{a\+b\+c\}\\leq\\sqrt\{a\}\+\\sqrt\{b\}\+\\sqrt\{c\}gives \([A4](https://arxiv.org/html/2605.05577#A2.E4)\)\. Finally, averaging overt=1,…,T−1t=1,\\dots,T\-1, using
∑t=1T−1β2t−1≤11−β2,\\sum\_\{t=1\}^\{T\-1\}\\beta\_\{2\}^\{t\-1\}\\leq\\frac\{1\}\{1\-\\beta\_\{2\}\},and𝔼‖N0‖≤σ\\mathbb\{E\}\\\|N\_\{0\}\\\|\\leq\\sigma, yields \([A5](https://arxiv.org/html/2605.05577#A2.E5)\)\. ∎
###### Lemma B\.6\(Centered gradient\-difference bound\)\.
Under Assumption[3](https://arxiv.org/html/2605.05577#Thmasmp3), for allx,yx,yin the region of interest,
𝔼ξ\[‖\(∇f\(x;ξ\)−∇f\(y;ξ\)\)−\(∇F\(x\)−∇F\(y\)\)‖2\]≤L2‖x−y‖2\.\\mathbb\{E\}\_\{\\xi\}\\\!\\left\[\\left\\\|\\bigl\(\\nabla f\(x;\\xi\)\-\\nabla f\(y;\\xi\)\\bigr\)\-\\bigl\(\\nabla F\(x\)\-\\nabla F\(y\)\\bigr\)\\right\\\|^\{2\}\\right\]\\leq L^\{2\}\\\|x\-y\\\|^\{2\}\.
###### Proof\.
Let
Z:=∇f\(x;ξ\)−∇f\(y;ξ\)\.Z:=\\nabla f\(x;\\xi\)\-\\nabla f\(y;\\xi\)\.Then𝔼\[Z\]=∇F\(x\)−∇F\(y\)\\mathbb\{E\}\[Z\]=\\nabla F\(x\)\-\\nabla F\(y\), and therefore
𝔼‖Z−𝔼Z‖2=𝔼‖Z‖2−‖𝔼Z‖2≤𝔼‖Z‖2≤L2‖x−y‖2\.∎\\mathbb\{E\}\\\|Z\-\\mathbb\{E\}Z\\\|^\{2\}=\\mathbb\{E\}\\\|Z\\\|^\{2\}\-\\\|\\mathbb\{E\}Z\\\|^\{2\}\\leq\\mathbb\{E\}\\\|Z\\\|^\{2\}\\leq L^\{2\}\\\|x\-y\\\|^\{2\}\.\\qed
##### Approximate Nesterov momentum\.
Before using the term Nesterov momentum, we briefly specify the convention adopted in this appendix\. In the classical exact Nesterov scheme, the gradient is evaluated at a lookahead point rather than at the current iterate\. A generic form is
yt=wt\+γt\(wt−wt−1\),mt=βtmt−1\+\(1−βt\)∇f\(yt;ξt\)\.y\_\{t\}=w\_\{t\}\+\\gamma\_\{t\}\(w\_\{t\}\-w\_\{t\-1\}\),\\qquad m\_\{t\}=\\beta\_\{t\}m\_\{t\-1\}\+\(1\-\\beta\_\{t\}\)\\nabla f\(y\_\{t\};\\xi\_\{t\}\)\.
In practice, many deep\-learning implementations use an approximate Nesterov form that does not require a separate gradient evaluation at a distinct lookahead point\. Letztz\_\{t\}denote the point at which the stochastic gradient is evaluated\. We then define
mt=β2,tmt−1\+\(1−β2,t\)∇f\(zt;ξt\),m\_\{t\}=\\beta\_\{2,t\}m\_\{t\-1\}\+\(1\-\\beta\_\{2,t\}\)\\nabla f\(z\_\{t\};\\xi\_\{t\}\),and
gt=β¯1,tmt\+\(1−β¯1,t\)∇f\(zt;ξt\)\.g\_\{t\}=\\bar\{\\beta\}\_\{1,t\}m\_\{t\}\+\(1\-\\bar\{\\beta\}\_\{1,t\}\)\\nabla f\(z\_\{t\};\\xi\_\{t\}\)\.
Substituting the recursion formtm\_\{t\}into the definition ofgtg\_\{t\}gives
gt=β¯1,tβ2,tmt−1\+\(1−β¯1,tβ2,t\)∇f\(zt;ξt\)\.g\_\{t\}=\\bar\{\\beta\}\_\{1,t\}\\beta\_\{2,t\}m\_\{t\-1\}\+\\bigl\(1\-\\bar\{\\beta\}\_\{1,t\}\\beta\_\{2,t\}\\bigr\)\\nabla f\(z\_\{t\};\\xi\_\{t\}\)\.Therefore the approximate Nesterov query is exactly the same as the two\-momentum query
gt=β1,tmt−1\+\(1−β1,t\)∇f\(zt;ξt\)g\_\{t\}=\\beta\_\{1,t\}m\_\{t\-1\}\+\(1\-\\beta\_\{1,t\}\)\\nabla f\(z\_\{t\};\\xi\_\{t\}\)with
β1,t=β¯1,tβ2,t\.\\beta\_\{1,t\}=\\bar\{\\beta\}\_\{1,t\}\\beta\_\{2,t\}\.
In particular, if0≤β¯1,t≤10\\leq\\bar\{\\beta\}\_\{1,t\}\\leq 1and0≤β2,t<10\\leq\\beta\_\{2,t\}<1, then
0≤β1,t≤β2,t<1\.0\\leq\\beta\_\{1,t\}\\leq\\beta\_\{2,t\}<1\.Whenzt=wtz\_\{t\}=w\_\{t\}, this recovers the current\-point stochastic\-LMO specialization\. Whenzt=xtz\_\{t\}=x\_\{t\}, it recovers the transported\-point LMO\-IGT specialization\. For this reason, we do not state separate Nesterov\-specific convergence theorems in the appendix\.
## Appendix CTheoretical Analysis of the LMO\-IGT Class
Consider the Algorithm[1](https://arxiv.org/html/2605.05577#alg1)\. For everyt=0,1,…,T−1t=0,1,\\ldots,T\-1, let
β1,t=β1,β2,t=β2,η2,t=η,η1,t=η1−β2\.\\beta\_\{1,t\}=\\beta\_\{1\},\\qquad\\beta\_\{2,t\}=\\beta\_\{2\},\\qquad\\eta\_\{2,t\}=\\eta,\\qquad\\eta\_\{1,t\}=\\frac\{\\eta\}\{1\-\\beta\_\{2\}\}\.
Applying Lemma[B\.2](https://arxiv.org/html/2605.05577#A2.Thmlemma2)with stepsizeη\\etagives
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]≤ΔFTη\+RT∑t=0T−1𝔼‖ϵ^t‖\+L2R2η,\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]\\leq\\frac\{\\Delta\_\{F\}\}\{T\\eta\}\+\\frac\{R\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|\\hat\{\\epsilon\}\_\{t\}\\\|\+\\frac\{L\}\{2\}R^\{2\}\\eta,\(A6\)where
ϵ^t=gt−∇F\(wt\)\.\\hat\{\\epsilon\}\_\{t\}=g\_\{t\}\-\\nabla F\(w\_\{t\}\)\.
###### Lemma C\.1\.
Suppose Assumptions[1](https://arxiv.org/html/2605.05577#Thmasmp1),[2](https://arxiv.org/html/2605.05577#Thmasmp2), and[4](https://arxiv.org/html/2605.05577#Thmasmp4)hold\. For everyt=0,1,…,T−1t=0,1,\\ldots,T\-1, let
β1,t=β1,β2,t=β2,η2,t=η,η1,t=η1−β2\.\\beta\_\{1,t\}=\\beta\_\{1\},\\qquad\\beta\_\{2,t\}=\\beta\_\{2\},\\qquad\\eta\_\{2,t\}=\\eta,\\qquad\\eta\_\{1,t\}=\\frac\{\\eta\}\{1\-\\beta\_\{2\}\}\.Then
1T∑t=0T−1𝔼‖ϵ^t‖\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|\\hat\{\\epsilon\}\_\{t\}\\\|≤σ\(\(1−β1\)\+β11−β2\+β1T\(1−β2\)\+1T\)\\displaystyle\\leq\\sigma\\left\(\(1\-\\beta\_\{1\}\)\+\\beta\_\{1\}\\sqrt\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{1\}\}\{T\(1\-\\beta\_\{2\}\)\}\+\\frac\{1\}\{T\}\\right\)\+β2−β11−β2LRη\+\(β11−β2\+β22\(1−β2\)2\)ρR2η2\.\\displaystyle\\quad\+\\frac\{\\beta\_\{2\}\-\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}LR\\eta\+\\left\(\\frac\{\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{2\}^\{2\}\}\{\(1\-\\beta\_\{2\}\)^\{2\}\}\\right\)\\rho R^\{2\}\\eta^\{2\}\.\(A7\)
###### Proof\.
Let
ℱt=σ\(ξ0,…,ξt\)\\mathcal\{F\}\_\{t\}=\\sigma\(\\xi\_\{0\},\\dots,\\xi\_\{t\}\)be the natural filtration, and define
ϵt=∇f\(xt;ξt\)−∇F\(xt\)\.\\epsilon\_\{t\}=\\nabla f\(x\_\{t\};\\xi\_\{t\}\)\-\\nabla F\(x\_\{t\}\)\.Sincextx\_\{t\}isℱt−1\\mathcal\{F\}\_\{t\-1\}\-measurable, Assumption[2](https://arxiv.org/html/2605.05577#Thmasmp2)implies
𝔼\[ϵt∣ℱt−1\]=0,𝔼\[‖ϵt‖2∣ℱt−1\]≤σ2\.\\mathbb\{E\}\[\\epsilon\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\}\]=0,\\qquad\\mathbb\{E\}\[\\\|\\epsilon\_\{t\}\\\|^\{2\}\\mid\\mathcal\{F\}\_\{t\-1\}\]\\leq\\sigma^\{2\}\.
We track
ϵ~t=mt−∇F\(wt\),ϵ^t=gt−∇F\(wt\)\.\\tilde\{\\epsilon\}\_\{t\}=m\_\{t\}\-\\nabla F\(w\_\{t\}\),\\qquad\\hat\{\\epsilon\}\_\{t\}=g\_\{t\}\-\\nabla F\(w\_\{t\}\)\.Fort≥1t\\geq 1, define
At=∇2F\(wt\)\(wt−wt−1\),Ztw=Z\(wt−1,wt\),Ztx=Z\(xt,wt\),A\_\{t\}=\\nabla^\{2\}F\(w\_\{t\}\)\(w\_\{t\}\-w\_\{t\-1\}\),\\qquad Z\_\{t\}^\{w\}=Z\(w\_\{t\-1\},w\_\{t\}\),\\qquad Z\_\{t\}^\{x\}=Z\(x\_\{t\},w\_\{t\}\),whereZ\(⋅,⋅\)Z\(\\cdot,\\cdot\)is from Lemma[B\.4](https://arxiv.org/html/2605.05577#A2.Thmlemma4)\. By Lemma[B\.3](https://arxiv.org/html/2605.05577#A2.Thmlemma3),
xt=wt\+C\(wt−wt−1\),C=η1−ηη\.x\_\{t\}=w\_\{t\}\+C\(w\_\{t\}\-w\_\{t\-1\}\),\\qquad C=\\frac\{\\eta\_\{1\}\-\\eta\}\{\\eta\}\.Hence Lemma[B\.4](https://arxiv.org/html/2605.05577#A2.Thmlemma4)gives
∇F\(xt\)=∇F\(wt\)\+CAt\+Ztx,∇F\(wt−1\)=∇F\(wt\)−At\+Ztw\.\\nabla F\(x\_\{t\}\)=\\nabla F\(w\_\{t\}\)\+CA\_\{t\}\+Z\_\{t\}^\{x\},\\qquad\\nabla F\(w\_\{t\-1\}\)=\\nabla F\(w\_\{t\}\)\-A\_\{t\}\+Z\_\{t\}^\{w\}\.
Substituting into
mt=β2mt−1\+\(1−β2\)\(∇F\(xt\)\+ϵt\)m\_\{t\}=\\beta\_\{2\}m\_\{t\-1\}\+\(1\-\\beta\_\{2\}\)\\bigl\(\\nabla F\(x\_\{t\}\)\+\\epsilon\_\{t\}\\bigr\)yields
ϵ~t\\displaystyle\\tilde\{\\epsilon\}\_\{t\}=β2ϵ~t−1\+\(C\(1−β2\)−β2\)At\+β2Ztw\+\(1−β2\)\(ϵt\+Ztx\)\.\\displaystyle=\\beta\_\{2\}\\tilde\{\\epsilon\}\_\{t\-1\}\+\\bigl\(C\(1\-\\beta\_\{2\}\)\-\\beta\_\{2\}\\bigr\)A\_\{t\}\+\\beta\_\{2\}Z\_\{t\}^\{w\}\+\(1\-\\beta\_\{2\}\)\(\\epsilon\_\{t\}\+Z\_\{t\}^\{x\}\)\.\(A8\)Similarly,
gt=β1mt−1\+\(1−β1\)\(∇F\(xt\)\+ϵt\)g\_\{t\}=\\beta\_\{1\}m\_\{t\-1\}\+\(1\-\\beta\_\{1\}\)\\bigl\(\\nabla F\(x\_\{t\}\)\+\\epsilon\_\{t\}\\bigr\)gives
ϵ^t\\displaystyle\\hat\{\\epsilon\}\_\{t\}=β1ϵ~t−1\+\(C\(1−β1\)−β1\)At\+β1Ztw\+\(1−β1\)\(ϵt\+Ztx\)\.\\displaystyle=\\beta\_\{1\}\\tilde\{\\epsilon\}\_\{t\-1\}\+\\bigl\(C\(1\-\\beta\_\{1\}\)\-\\beta\_\{1\}\\bigr\)A\_\{t\}\+\\beta\_\{1\}Z\_\{t\}^\{w\}\+\(1\-\\beta\_\{1\}\)\(\\epsilon\_\{t\}\+Z\_\{t\}^\{x\}\)\.\(A9\)
Now choose
η1=η1−β2,C=η1−ηη=β21−β2\.\\eta\_\{1\}=\\frac\{\\eta\}\{1\-\\beta\_\{2\}\},\\qquad C=\\frac\{\\eta\_\{1\}\-\\eta\}\{\\eta\}=\\frac\{\\beta\_\{2\}\}\{1\-\\beta\_\{2\}\}\.Then
C\(1−β2\)−β2=0,C\(1\-\\beta\_\{2\}\)\-\\beta\_\{2\}=0,so the first\-order drift disappears from \([A8](https://arxiv.org/html/2605.05577#A3.E8)\):
ϵ~t\\displaystyle\\tilde\{\\epsilon\}\_\{t\}=β2ϵ~t−1\+β2Ztw\+\(1−β2\)\(ϵt\+Ztx\)\.\\displaystyle=\\beta\_\{2\}\\tilde\{\\epsilon\}\_\{t\-1\}\+\\beta\_\{2\}Z\_\{t\}^\{w\}\+\(1\-\\beta\_\{2\}\)\(\\epsilon\_\{t\}\+Z\_\{t\}^\{x\}\)\.\(A10\)Also
C\(1−β1\)−β1=β2−β11−β2,C\(1\-\\beta\_\{1\}\)\-\\beta\_\{1\}=\\frac\{\\beta\_\{2\}\-\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\},hence
ϵ^t\\displaystyle\\hat\{\\epsilon\}\_\{t\}=β1ϵ~t−1\+β2−β11−β2At\+β1Ztw\+\(1−β1\)\(ϵt\+Ztx\)\.\\displaystyle=\\beta\_\{1\}\\tilde\{\\epsilon\}\_\{t\-1\}\+\\frac\{\\beta\_\{2\}\-\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}A\_\{t\}\+\\beta\_\{1\}Z\_\{t\}^\{w\}\+\(1\-\\beta\_\{1\}\)\(\\epsilon\_\{t\}\+Z\_\{t\}^\{x\}\)\.\(A11\)
Becausex0=w0x\_\{0\}=w\_\{0\}andm−1=∇f\(w0;ξ0\)m\_\{\-1\}=\\nabla f\(w\_\{0\};\\xi\_\{0\}\), we have
ϵ~0=ϵ0\.\\tilde\{\\epsilon\}\_\{0\}=\\epsilon\_\{0\}\.Unrolling \([A10](https://arxiv.org/html/2605.05577#A3.E10)\), fort≥1t\\geq 1,
ϵ~t−1=β2t−1ϵ0\+\(1−β2\)∑s=1t−1β2t−1−sϵs\+∑s=1t−1β2t−sZsw\+\(1−β2\)∑s=1t−1β2t−1−sZsx\.\\tilde\{\\epsilon\}\_\{t\-1\}=\\beta\_\{2\}^\{t\-1\}\\epsilon\_\{0\}\+\(1\-\\beta\_\{2\}\)\\sum\_\{s=1\}^\{t\-1\}\\beta\_\{2\}^\{t\-1\-s\}\\epsilon\_\{s\}\+\\sum\_\{s=1\}^\{t\-1\}\\beta\_\{2\}^\{t\-s\}Z\_\{s\}^\{w\}\+\(1\-\\beta\_\{2\}\)\\sum\_\{s=1\}^\{t\-1\}\\beta\_\{2\}^\{t\-1\-s\}Z\_\{s\}^\{x\}\.Substituting into \([A11](https://arxiv.org/html/2605.05577#A3.E11)\), we decompose
ϵ^t=Nt\+Bt,\\hat\{\\epsilon\}\_\{t\}=N\_\{t\}\+B\_\{t\},where
Nt=β1β2t−1ϵ0\+β1\(1−β2\)∑s=1t−1β2t−1−sϵs\+\(1−β1\)ϵt,N\_\{t\}=\\beta\_\{1\}\\beta\_\{2\}^\{t\-1\}\\epsilon\_\{0\}\+\\beta\_\{1\}\(1\-\\beta\_\{2\}\)\\sum\_\{s=1\}^\{t\-1\}\\beta\_\{2\}^\{t\-1\-s\}\\epsilon\_\{s\}\+\(1\-\\beta\_\{1\}\)\\epsilon\_\{t\},and
Dt=β2−β11−β2At\+β1∑s=1tβ2t−sZsw\+β1\(1−β2\)∑s=1t−1β2t−1−sZsx\+\(1−β1\)Ztx\.D\_\{t\}=\\frac\{\\beta\_\{2\}\-\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}A\_\{t\}\+\\beta\_\{1\}\\sum\_\{s=1\}^\{t\}\\beta\_\{2\}^\{t\-s\}Z\_\{s\}^\{w\}\+\\beta\_\{1\}\(1\-\\beta\_\{2\}\)\\sum\_\{s=1\}^\{t\-1\}\\beta\_\{2\}^\{t\-1\-s\}Z\_\{s\}^\{x\}\+\(1\-\\beta\_\{1\}\)Z\_\{t\}^\{x\}\.
Fort=0t=0, we set
N0=ϵ^0=ϵ0\.N\_\{0\}=\\hat\{\\epsilon\}\_\{0\}=\\epsilon\_\{0\}\.
By Assumption[1](https://arxiv.org/html/2605.05577#Thmasmp1)and Lemma[B\.3](https://arxiv.org/html/2605.05577#A2.Thmlemma3),
‖At‖≤‖∇2F\(wt\)‖op‖wt−wt−1‖≤LηR\.\\\|A\_\{t\}\\\|\\leq\\\|\\nabla^\{2\}F\(w\_\{t\}\)\\\|\_\{\\mathrm\{op\}\}\\\|w\_\{t\}\-w\_\{t\-1\}\\\|\\leq L\\eta R\.By Lemmas[B\.3](https://arxiv.org/html/2605.05577#A2.Thmlemma3)and[B\.4](https://arxiv.org/html/2605.05577#A2.Thmlemma4),
‖Ztw‖≤ρ‖wt−wt−1‖2≤ρη2R2,\\\|Z\_\{t\}^\{w\}\\\|\\leq\\rho\\\|w\_\{t\}\-w\_\{t\-1\}\\\|^\{2\}\\leq\\rho\\eta^\{2\}R^\{2\},and
‖Ztx‖≤ρ‖xt−wt‖2≤ρ\(η1−η\)2R2=ρ\(β21−β2\)2η2R2\.\\\|Z\_\{t\}^\{x\}\\\|\\leq\\rho\\\|x\_\{t\}\-w\_\{t\}\\\|^\{2\}\\leq\\rho\(\\eta\_\{1\}\-\\eta\)^\{2\}R^\{2\}=\\rho\\left\(\\frac\{\\beta\_\{2\}\}\{1\-\\beta\_\{2\}\}\\right\)^\{2\}\\eta^\{2\}R^\{2\}\.
Using these bounds together with
∑s=1tβ2t−s≤11−β2,β1\(1−β2\)∑s=1t−1β2t−1−s\+\(1−β1\)≤1,\\sum\_\{s=1\}^\{t\}\\beta\_\{2\}^\{t\-s\}\\leq\\frac\{1\}\{1\-\\beta\_\{2\}\},\\qquad\\beta\_\{1\}\(1\-\\beta\_\{2\}\)\\sum\_\{s=1\}^\{t\-1\}\\beta\_\{2\}^\{t\-1\-s\}\+\(1\-\\beta\_\{1\}\)\\leq 1,we obtain
‖Dt‖\\displaystyle\\\|D\_\{t\}\\\|≤β2−β11−β2LηR\+β11−β2ρη2R2\+ρ\(η1−η\)2R2\\displaystyle\\leq\\frac\{\\beta\_\{2\}\-\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}L\\eta R\+\\frac\{\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}\\rho\\eta^\{2\}R^\{2\}\+\\rho\(\\eta\_\{1\}\-\\eta\)^\{2\}R^\{2\}=β2−β11−β2LηR\+\(β11−β2\+β22\(1−β2\)2\)ρη2R2\.\\displaystyle=\\frac\{\\beta\_\{2\}\-\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}L\\eta R\+\\left\(\\frac\{\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{2\}^\{2\}\}\{\(1\-\\beta\_\{2\}\)^\{2\}\}\\right\)\\rho\\eta^\{2\}R^\{2\}\.\(A12\)
Lemma[B\.5](https://arxiv.org/html/2605.05577#A2.Thmlemma5)gives
1T∑t=0T−1𝔼‖Nt‖≤σ\(\(1−β1\)\+β11−β2\+β1T\(1−β2\)\+1T\)\.\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|N\_\{t\}\\\|\\leq\\sigma\\left\(\(1\-\\beta\_\{1\}\)\+\\beta\_\{1\}\\sqrt\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{1\}\}\{T\(1\-\\beta\_\{2\}\)\}\+\\frac\{1\}\{T\}\\right\)\.\(A13\)
Finally,
1T∑t=0T−1𝔼‖ϵ^t‖≤1T∑t=0T−1𝔼‖Nt‖\+sup0≤t≤T−1‖Dt‖\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|\\hat\{\\epsilon\}\_\{t\}\\\|\\leq\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|N\_\{t\}\\\|\+\\sup\_\{0\\leq t\\leq T\-1\}\\\|D\_\{t\}\\\|\.Combining \([A12](https://arxiv.org/html/2605.05577#A3.E12)\) and \([A13](https://arxiv.org/html/2605.05577#A3.E13)\) proves \([A7](https://arxiv.org/html/2605.05577#A3.E7)\)\. ∎
Substituting Lemma[C\.1](https://arxiv.org/html/2605.05577#A3.Thmlemma1)into \([A6](https://arxiv.org/html/2605.05577#A3.E6)\) yields
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]≤ΔFTη\+Rσ\(\(1−β1\)\+β11−β2\+β1T\(1−β2\)\+1T\)\\displaystyle\\leq\\frac\{\\Delta\_\{F\}\}\{T\\eta\}\+R\\sigma\\left\(\(1\-\\beta\_\{1\}\)\+\\beta\_\{1\}\\sqrt\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{1\}\}\{T\(1\-\\beta\_\{2\}\)\}\+\\frac\{1\}\{T\}\\right\)\+LR2η\(β2−β11−β2\+12\)\+ρR3η2\(β11−β2\+β22\(1−β2\)2\),\\displaystyle\\quad\+LR^\{2\}\\eta\\left\(\\frac\{\\beta\_\{2\}\-\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}\+\\frac\{1\}\{2\}\\right\)\+\\rho R^\{3\}\\eta^\{2\}\\left\(\\frac\{\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{2\}^\{2\}\}\{\(1\-\\beta\_\{2\}\)^\{2\}\}\\right\),\(A14\)which proves Theorem[3\.5](https://arxiv.org/html/2605.05577#S3.Thmthm5)\.
### C\.1Proof of Corollary[3\.6](https://arxiv.org/html/2605.05577#S3.Thmthm6)
###### Proof\.
Choose
η=1RT5/7,β2=1−1T4/7,1−β1∈\[1T4/7,1T2/7\]\.\\eta=\\frac\{1\}\{RT^\{5/7\}\},\\qquad\\beta\_\{2\}=1\-\\frac\{1\}\{T^\{4/7\}\},\\qquad 1\-\\beta\_\{1\}\\in\\left\[\\frac\{1\}\{T^\{4/7\}\},\\frac\{1\}\{T^\{2/7\}\}\\right\]\.Then
ΔFTη=RΔFT2/7,1−β1≤1T2/7,1−β2=1T2/7,1T\(1−β2\)=1T3/7,1T≤1T2/7\.\\frac\{\\Delta\_\{F\}\}\{T\\eta\}=\\frac\{R\\Delta\_\{F\}\}\{T^\{2/7\}\},\\qquad 1\-\\beta\_\{1\}\\leq\\frac\{1\}\{T^\{2/7\}\},\\qquad\\sqrt\{1\-\\beta\_\{2\}\}=\\frac\{1\}\{T^\{2/7\}\},\\qquad\\frac\{1\}\{T\(1\-\\beta\_\{2\}\)\}=\\frac\{1\}\{T^\{3/7\}\},\\qquad\\frac\{1\}\{T\}\\leq\\frac\{1\}\{T^\{2/7\}\}\.Also,
β2−β11−β2=\(1−β1\)−\(1−β2\)1−β2≤1/T2/71/T4/7=T2/7,\\frac\{\\beta\_\{2\}\-\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}=\\frac\{\(1\-\\beta\_\{1\}\)\-\(1\-\\beta\_\{2\}\)\}\{1\-\\beta\_\{2\}\}\\leq\\frac\{1/T^\{2/7\}\}\{1/T^\{4/7\}\}=T^\{2/7\},so
LR2η\(β2−β11−β2\+12\)=O\(RLT3/7\)\.LR^\{2\}\\eta\\left\(\\frac\{\\beta\_\{2\}\-\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}\+\\frac\{1\}\{2\}\\right\)=O\\\!\\left\(\\frac\{RL\}\{T^\{3/7\}\}\\right\)\.Finally,
β11−β2≤T4/7,β22\(1−β2\)2≤T8/7,R3η2=RT10/7,\\frac\{\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}\\leq T^\{4/7\},\\qquad\\frac\{\\beta\_\{2\}^\{2\}\}\{\(1\-\\beta\_\{2\}\)^\{2\}\}\\leq T^\{8/7\},\\qquad R^\{3\}\\eta^\{2\}=\\frac\{R\}\{T^\{10/7\}\},and hence
ρR3η2\(β11−β2\+β22\(1−β2\)2\)≤ρR\(1T6/7\+1T2/7\)\.\\rho R^\{3\}\\eta^\{2\}\\left\(\\frac\{\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{2\}^\{2\}\}\{\(1\-\\beta\_\{2\}\)^\{2\}\}\\right\)\\leq\\rho R\\left\(\\frac\{1\}\{T^\{6/7\}\}\+\\frac\{1\}\{T^\{2/7\}\}\\right\)\.Combining the above bounds yields
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]≤O\(R\(ΔF\+σ\+ρ\)T2/7\+RLT3/7\)=O\(T−2/7\)\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]\\leq O\\\!\\left\(\\frac\{R\(\\Delta\_\{F\}\+\\sigma\+\\rho\)\}\{T^\{2/7\}\}\+\\frac\{RL\}\{T^\{3/7\}\}\\right\)=O\(T^\{\-2/7\}\)\.∎
### C\.2Proof of Corollary[3\.7](https://arxiv.org/html/2605.05577#S3.Thmthm7)
###### Proof\.
Whenσ=0\\sigma=0, applying \([A14](https://arxiv.org/html/2605.05577#A3.E14)\) with
η=1RT1/2,β1=β2=1−1T1/4\\eta=\\frac\{1\}\{RT^\{1/2\}\},\\qquad\\beta\_\{1\}=\\beta\_\{2\}=1\-\\frac\{1\}\{T^\{1/4\}\}gives
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]≤ΔFTη\+L2R2η\+ρR3η2\(β11−β2\+β22\(1−β2\)2\)\.\\displaystyle\\leq\\frac\{\\Delta\_\{F\}\}\{T\\eta\}\+\\frac\{L\}\{2\}R^\{2\}\\eta\+\\rho R^\{3\}\\eta^\{2\}\\left\(\\frac\{\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{2\}^\{2\}\}\{\(1\-\\beta\_\{2\}\)^\{2\}\}\\right\)\.Now
ΔFTη=RΔFT1/2,L2R2η=RL2T1/2\.\\frac\{\\Delta\_\{F\}\}\{T\\eta\}=\\frac\{R\\Delta\_\{F\}\}\{T^\{1/2\}\},\\qquad\\frac\{L\}\{2\}R^\{2\}\\eta=\\frac\{RL\}\{2T^\{1/2\}\}\.Also,
β11−β2≤T1/4,β22\(1−β2\)2≤T1/2,R3η2=RT,\\frac\{\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}\\leq T^\{1/4\},\\qquad\\frac\{\\beta\_\{2\}^\{2\}\}\{\(1\-\\beta\_\{2\}\)^\{2\}\}\\leq T^\{1/2\},\\qquad R^\{3\}\\eta^\{2\}=\\frac\{R\}\{T\},hence
ρR3η2\(β11−β2\+β22\(1−β2\)2\)≤ρR\(1T3/4\+1T1/2\)\.\\rho R^\{3\}\\eta^\{2\}\\left\(\\frac\{\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{2\}^\{2\}\}\{\(1\-\\beta\_\{2\}\)^\{2\}\}\\right\)\\leq\\rho R\\left\(\\frac\{1\}\{T^\{3/4\}\}\+\\frac\{1\}\{T^\{1/2\}\}\\right\)\.Therefore
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]≤O\(R\(ΔF\+L\+ρ\)T1/2\)=O\(T−1/2\),\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]\\leq O\\\!\\left\(\\frac\{R\(\\Delta\_\{F\}\+L\+\\rho\)\}\{T^\{1/2\}\}\\right\)=O\(T^\{\-1/2\}\),which proves Corollary[3\.7](https://arxiv.org/html/2605.05577#S3.Thmthm7)\. ∎
## Appendix DTheoretical Analysis of the Stochastic LMO Class
Consider the following algorithm\.
Algorithm 3Stochastic LMO Optimization1:initial point
w0w\_\{0\}, momentum buffer
m−1=∇f\(w0;ξ0\)m\_\{\-1\}=\\nabla f\(w\_\{0\};\\xi\_\{0\}\), parameters
\{β1,t\}t≥0\\\{\\beta\_\{1,t\}\\\}\_\{t\\geq 0\},
\{β2,t\}t≥0\\\{\\beta\_\{2,t\}\\\}\_\{t\\geq 0\},
λ≥0\\lambda\\geq 0, and step sizes
\{ηt\}t≥0\\\{\\eta\_\{t\}\\\}\_\{t\\geq 0\}
2:for
t=0,…,T−1t=0,\\ldots,T\-1do
3:
gt←β1,tmt−1\+\(1−β1,t\)∇f\(wt;ξt\)g\_\{t\}\\leftarrow\\beta\_\{1,t\}m\_\{t\-1\}\+\(1\-\\beta\_\{1,t\}\)\\nabla f\(w\_\{t\};\\xi\_\{t\}\)
4:
mt←β2,tmt−1\+\(1−β2,t\)∇f\(wt;ξt\)m\_\{t\}\\leftarrow\\beta\_\{2,t\}m\_\{t\-1\}\+\(1\-\\beta\_\{2,t\}\)\\nabla f\(w\_\{t\};\\xi\_\{t\}\)
5:
vt←LMO𝒞\(gt\)v\_\{t\}\\leftarrow\\operatorname\{LMO\}\_\{\\mathcal\{C\}\}\(g\_\{t\}\)
6:
wt\+1←\(1−ληt\)wt\+ηtvtw\_\{t\+1\}\\leftarrow\(1\-\\lambda\\eta\_\{t\}\)w\_\{t\}\+\\eta\_\{t\}v\_\{t\}
7:endfor
Algorithm[3](https://arxiv.org/html/2605.05577#alg3)is the stochastic\-LMO specialization of Algorithm[2](https://arxiv.org/html/2605.05577#alg2)\. In particular, it corresponds to setting
α1,t=0,α2,t=0,η1,t=η2,t=ηt\\alpha\_\{1,t\}=0,\\qquad\\alpha\_\{2,t\}=0,\\qquad\\eta\_\{1,t\}=\\eta\_\{2,t\}=\\eta\_\{t\}for everyt=0,1,…,T−1t=0,1,\\ldots,T\-1\.
We analyze Algorithm[3](https://arxiv.org/html/2605.05577#alg3)under the parameter choice
β1,t=β1,β2,t=β2,ηt=η,t=0,1,…,T−1\.\\beta\_\{1,t\}=\\beta\_\{1\},\\qquad\\beta\_\{2,t\}=\\beta\_\{2\},\\qquad\\eta\_\{t\}=\\eta,\\qquad t=0,1,\\ldots,T\-1\.
###### Theorem D\.1\(Stochastic LMO method\)\.
Suppose Assumptions[1](https://arxiv.org/html/2605.05577#Thmasmp1)–[2](https://arxiv.org/html/2605.05577#Thmasmp2)hold\. Consider Algorithm[3](https://arxiv.org/html/2605.05577#alg3)and assume that, for everyt=0,1,…,T−1t=0,1,\\ldots,T\-1,
β1,t=β1,β2,t=β2,ηt=η,\\beta\_\{1,t\}=\\beta\_\{1\},\\qquad\\beta\_\{2,t\}=\\beta\_\{2\},\\qquad\\eta\_\{t\}=\\eta,where0≤β1≤β2<10\\leq\\beta\_\{1\}\\leq\\beta\_\{2\}<1and0≤λη≤10\\leq\\lambda\\eta\\leq 1\. Whenλ\>0\\lambda\>0, assumew0∈𝒫=λ−1𝒞w\_\{0\}\\in\\mathcal\{P\}=\\lambda^\{\-1\}\\mathcal\{C\}\. LetF∗F^\{\*\}be any finite lower bound onFFover a region containing the iterates, and define
ΔF=F\(w0\)−F∗\.\\Delta\_\{F\}=F\(w\_\{0\}\)\-F^\{\*\}\.Then
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]≤ΔFTη\+Rσ\(\(1−β1\)\+β11−β2\+β1T\(1−β2\)\+1T\)\\displaystyle\\leq\\frac\{\\Delta\_\{F\}\}\{T\\eta\}\+R\\sigma\\left\(\(1\-\\beta\_\{1\}\)\+\\beta\_\{1\}\\sqrt\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{1\}\}\{T\(1\-\\beta\_\{2\}\)\}\+\\frac\{1\}\{T\}\\right\)\+LR2η\(β11−β2\+12\)\.\\displaystyle\\quad\+LR^\{2\}\\eta\\left\(\\frac\{\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}\+\\frac\{1\}\{2\}\\right\)\.\(A15\)
###### Corollary D\.2\(Equal momentum parameters\)\.
Under the assumptions of Theorem[D\.1](https://arxiv.org/html/2605.05577#A4.Thmthm1), suppose
β1=β2=β\.\\beta\_\{1\}=\\beta\_\{2\}=\\beta\.Then
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]≤ΔFTη\+Rσ\(\(1\+β\)1−β\+βT\(1−β\)\+1T\)\\displaystyle\\leq\\frac\{\\Delta\_\{F\}\}\{T\\eta\}\+R\\sigma\\left\(\(1\+\\beta\)\\sqrt\{1\-\\beta\}\+\\frac\{\\beta\}\{T\(1\-\\beta\)\}\+\\frac\{1\}\{T\}\\right\)\+LR2η\(β1−β\+12\),\\displaystyle\\quad\+LR^\{2\}\\eta\\left\(\\frac\{\\beta\}\{1\-\\beta\}\+\\frac\{1\}\{2\}\\right\),\(A16\)where we used1−β≤1−β1\-\\beta\\leq\\sqrt\{1\-\\beta\}\.
###### Corollary D\.3\(Stochastic regime for stochastic LMO\)\.
Under the assumptions of Corollary[D\.2](https://arxiv.org/html/2605.05577#A4.Thmthm2), supposeσ\>0\\sigma\>0\. For sufficiently largeTT, choose
η=1RT3/4,β=1−1T1/2\.\\eta=\\frac\{1\}\{RT^\{3/4\}\},\\qquad\\beta=1\-\\frac\{1\}\{T^\{1/2\}\}\.Then
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]≤O\(R\(ΔF\+σ\+L\)T1/4\)=O\(T−1/4\)\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]\\leq O\\\!\\left\(\\frac\{R\(\\Delta\_\{F\}\+\\sigma\+L\)\}\{T^\{1/4\}\}\\right\)=O\(T^\{\-1/4\}\)\.
###### Corollary D\.4\(Deterministic regime for stochastic LMO\)\.
Under the assumptions of Corollary[D\.2](https://arxiv.org/html/2605.05577#A4.Thmthm2), supposeσ=0\\sigma=0\. Letβ∈\[0,1\)\\beta\\in\[0,1\)be any fixed constant independent ofTT, and choose
η=1RT1/2\.\\eta=\\frac\{1\}\{RT^\{1/2\}\}\.Then
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]≤O\(R\(ΔF\+L/\(1−β\)\)T1/2\)=O\(T−1/2\)\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]\\leq O\\\!\\left\(\\frac\{R\\bigl\(\\Delta\_\{F\}\+L/\(1\-\\beta\)\\bigr\)\}\{T^\{1/2\}\}\\right\)=O\(T^\{\-1/2\}\)\.
Applying Lemma[B\.2](https://arxiv.org/html/2605.05577#A2.Thmlemma2)withηt=η\\eta\_\{t\}=\\etafor everyt=0,1,…,T−1t=0,1,\\ldots,T\-1yields
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]≤ΔFTη\+RT∑t=0T−1𝔼‖ϵ^t‖\+L2R2η,\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]\\leq\\frac\{\\Delta\_\{F\}\}\{T\\eta\}\+\\frac\{R\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|\\hat\{\\epsilon\}\_\{t\}\\\|\+\\frac\{L\}\{2\}R^\{2\}\\eta,\(A17\)where
ϵ^t=gt−∇F\(wt\)\.\\hat\{\\epsilon\}\_\{t\}=g\_\{t\}\-\\nabla F\(w\_\{t\}\)\.
###### Lemma D\.1\.
Suppose Assumptions[1](https://arxiv.org/html/2605.05577#Thmasmp1)–[2](https://arxiv.org/html/2605.05577#Thmasmp2)hold\. Assume that, for everyt=0,1,…,T−1t=0,1,\\ldots,T\-1,
β1,t=β1,β2,t=β2,ηt=η\.\\beta\_\{1,t\}=\\beta\_\{1\},\\qquad\\beta\_\{2,t\}=\\beta\_\{2\},\\qquad\\eta\_\{t\}=\\eta\.Then
1T∑t=0T−1𝔼‖ϵ^t‖\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|\\hat\{\\epsilon\}\_\{t\}\\\|≤σT\+σ\(\(1−β1\)\+β11−β2\+β1T\(1−β2\)\)\+β11−β2LRη\.\\displaystyle\\leq\\frac\{\\sigma\}\{T\}\+\\sigma\\left\(\(1\-\\beta\_\{1\}\)\+\\beta\_\{1\}\\sqrt\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{1\}\}\{T\(1\-\\beta\_\{2\}\)\}\\right\)\+\\frac\{\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}LR\\eta\.\(A18\)
Substituting Lemma[D\.1](https://arxiv.org/html/2605.05577#A4.Thmlemma1)into \([A17](https://arxiv.org/html/2605.05577#A4.E17)\) yields
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]≤ΔFTη\+Rσ\(\(1−β1\)\+β11−β2\+β1T\(1−β2\)\+1T\)\\displaystyle\\leq\\frac\{\\Delta\_\{F\}\}\{T\\eta\}\+R\\sigma\\left\(\(1\-\\beta\_\{1\}\)\+\\beta\_\{1\}\\sqrt\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{1\}\}\{T\(1\-\\beta\_\{2\}\)\}\+\\frac\{1\}\{T\}\\right\)\+LR2η\(β11−β2\+12\)\.\\displaystyle\\quad\+LR^\{2\}\\eta\\left\(\\frac\{\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}\+\\frac\{1\}\{2\}\\right\)\.\(A19\)Since \([A19](https://arxiv.org/html/2605.05577#A4.E19)\) is exactly \([A15](https://arxiv.org/html/2605.05577#A4.E15)\), this proves Theorem[D\.1](https://arxiv.org/html/2605.05577#A4.Thmthm1)\.
### D\.1Proof of Lemma[D\.1](https://arxiv.org/html/2605.05577#A4.Thmlemma1)
###### Proof\.
Let
ℱt=σ\(ξ0,…,ξt\)\\mathcal\{F\}\_\{t\}=\\sigma\(\\xi\_\{0\},\\dots,\\xi\_\{t\}\)be the natural filtration, and define
ϵt=∇f\(wt;ξt\)−∇F\(wt\)\.\\epsilon\_\{t\}=\\nabla f\(w\_\{t\};\\xi\_\{t\}\)\-\\nabla F\(w\_\{t\}\)\.Sincewtw\_\{t\}isℱt−1\\mathcal\{F\}\_\{t\-1\}\-measurable, Assumption[2](https://arxiv.org/html/2605.05577#Thmasmp2)implies
𝔼\[ϵt∣ℱt−1\]=0,𝔼\[‖ϵt‖2∣ℱt−1\]≤σ2\.\\mathbb\{E\}\[\\epsilon\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\}\]=0,\\qquad\\mathbb\{E\}\[\\\|\\epsilon\_\{t\}\\\|^\{2\}\\mid\\mathcal\{F\}\_\{t\-1\}\]\\leq\\sigma^\{2\}\.
Define
ϵ~t=mt−∇F\(wt\),ϵ^t=gt−∇F\(wt\)\.\\tilde\{\\epsilon\}\_\{t\}=m\_\{t\}\-\\nabla F\(w\_\{t\}\),\\qquad\\hat\{\\epsilon\}\_\{t\}=g\_\{t\}\-\\nabla F\(w\_\{t\}\)\.Fort≥1t\\geq 1, let
δt=∇F\(wt−1\)−∇F\(wt\)\.\\delta\_\{t\}=\\nabla F\(w\_\{t\-1\}\)\-\\nabla F\(w\_\{t\}\)\.By Assumption[1](https://arxiv.org/html/2605.05577#Thmasmp1)and Lemma[B\.1](https://arxiv.org/html/2605.05577#A2.Thmlemma1),
‖δt‖≤L‖wt−wt−1‖≤LηR\.\\\|\\delta\_\{t\}\\\|\\leq L\\\|w\_\{t\}\-w\_\{t\-1\}\\\|\\leq L\\eta R\.
From
mt=β2mt−1\+\(1−β2\)\(∇F\(wt\)\+ϵt\),m\_\{t\}=\\beta\_\{2\}m\_\{t\-1\}\+\(1\-\\beta\_\{2\}\)\(\\nabla F\(w\_\{t\}\)\+\\epsilon\_\{t\}\),we obtain
ϵ~t\\displaystyle\\tilde\{\\epsilon\}\_\{t\}=β2ϵ~t−1\+β2δt\+\(1−β2\)ϵt\.\\displaystyle=\\beta\_\{2\}\\tilde\{\\epsilon\}\_\{t\-1\}\+\\beta\_\{2\}\\delta\_\{t\}\+\(1\-\\beta\_\{2\}\)\\epsilon\_\{t\}\.\(A20\)Similarly, from
gt=β1mt−1\+\(1−β1\)\(∇F\(wt\)\+ϵt\),g\_\{t\}=\\beta\_\{1\}m\_\{t\-1\}\+\(1\-\\beta\_\{1\}\)\(\\nabla F\(w\_\{t\}\)\+\\epsilon\_\{t\}\),we get
ϵ^t\\displaystyle\\hat\{\\epsilon\}\_\{t\}=β1ϵ~t−1\+β1δt\+\(1−β1\)ϵt\.\\displaystyle=\\beta\_\{1\}\\tilde\{\\epsilon\}\_\{t\-1\}\+\\beta\_\{1\}\\delta\_\{t\}\+\(1\-\\beta\_\{1\}\)\\epsilon\_\{t\}\.\(A21\)
Becausem−1=∇f\(w0;ξ0\)m\_\{\-1\}=\\nabla f\(w\_\{0\};\\xi\_\{0\}\), the first update gives
m0=∇f\(w0;ξ0\),ϵ~0=ϵ0\.m\_\{0\}=\\nabla f\(w\_\{0\};\\xi\_\{0\}\),\\qquad\\tilde\{\\epsilon\}\_\{0\}=\\epsilon\_\{0\}\.Unrolling \([A20](https://arxiv.org/html/2605.05577#A4.E20)\), fort≥1t\\geq 1,
ϵ~t−1=β2t−1ϵ0\+∑s=1t−1β2t−sδs\+\(1−β2\)∑s=1t−1β2t−1−sϵs\.\\tilde\{\\epsilon\}\_\{t\-1\}=\\beta\_\{2\}^\{t\-1\}\\epsilon\_\{0\}\+\\sum\_\{s=1\}^\{t\-1\}\\beta\_\{2\}^\{t\-s\}\\delta\_\{s\}\+\(1\-\\beta\_\{2\}\)\\sum\_\{s=1\}^\{t\-1\}\\beta\_\{2\}^\{t\-1\-s\}\\epsilon\_\{s\}\.Substituting this into \([A21](https://arxiv.org/html/2605.05577#A4.E21)\), we decompose
ϵ^t=Nt\+Bt,\\hat\{\\epsilon\}\_\{t\}=N\_\{t\}\+B\_\{t\},where
Nt=β1β2t−1ϵ0\+β1\(1−β2\)∑s=1t−1β2t−1−sϵs\+\(1−β1\)ϵt,N\_\{t\}=\\beta\_\{1\}\\beta\_\{2\}^\{t\-1\}\\epsilon\_\{0\}\+\\beta\_\{1\}\(1\-\\beta\_\{2\}\)\\sum\_\{s=1\}^\{t\-1\}\\beta\_\{2\}^\{t\-1\-s\}\\epsilon\_\{s\}\+\(1\-\\beta\_\{1\}\)\\epsilon\_\{t\},and
Bt=β1∑s=1tβ2t−sδs\.B\_\{t\}=\\beta\_\{1\}\\sum\_\{s=1\}^\{t\}\\beta\_\{2\}^\{t\-s\}\\delta\_\{s\}\.
Fort=0t=0, we set
N0=ϵ^0=ϵ0,B0=0\.N\_\{0\}=\\hat\{\\epsilon\}\_\{0\}=\\epsilon\_\{0\},\\qquad B\_\{0\}=0\.
Using‖δs‖≤LηR\\\|\\delta\_\{s\}\\\|\\leq L\\eta R,
‖Bt‖≤β1∑s=1tβ2t−s‖δs‖≤β11−β2LηR\.\\\|B\_\{t\}\\\|\\leq\\beta\_\{1\}\\sum\_\{s=1\}^\{t\}\\beta\_\{2\}^\{t\-s\}\\\|\\delta\_\{s\}\\\|\\leq\\frac\{\\beta\_\{1\}\}\{1\-\\beta\_\{2\}\}L\\eta R\.Lemma[B\.5](https://arxiv.org/html/2605.05577#A2.Thmlemma5)gives
1T∑t=0T−1𝔼‖Nt‖≤σT\+σ\(\(1−β1\)\+β11−β2\+β1T\(1−β2\)\)\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|N\_\{t\}\\\|\\leq\\frac\{\\sigma\}\{T\}\+\\sigma\\left\(\(1\-\\beta\_\{1\}\)\+\\beta\_\{1\}\\sqrt\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{1\}\}\{T\(1\-\\beta\_\{2\}\)\}\\right\)\.Therefore
1T∑t=0T−1𝔼‖ϵ^t‖≤1T∑t=0T−1𝔼‖Nt‖\+sup0≤t≤T−1‖Bt‖,\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|\\hat\{\\epsilon\}\_\{t\}\\\|\\leq\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|N\_\{t\}\\\|\+\\sup\_\{0\\leq t\\leq T\-1\}\\\|B\_\{t\}\\\|,which proves \([A18](https://arxiv.org/html/2605.05577#A4.E18)\)\. ∎
### D\.2Proof of Corollary[D\.3](https://arxiv.org/html/2605.05577#A4.Thmthm3)
###### Proof\.
Applying Corollary[D\.2](https://arxiv.org/html/2605.05577#A4.Thmthm2)with
η=1RT3/4,β=1−1T1/2,\\eta=\\frac\{1\}\{RT^\{3/4\}\},\\qquad\\beta=1\-\\frac\{1\}\{T^\{1/2\}\},gives
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]≤ΔFTη\+Rσ\(\(1\+β\)1−β\+βT\(1−β\)\+1T\)\+LR2η\(β1−β\+12\)\.\\displaystyle\\leq\\frac\{\\Delta\_\{F\}\}\{T\\eta\}\+R\\sigma\\left\(\(1\+\\beta\)\\sqrt\{1\-\\beta\}\+\\frac\{\\beta\}\{T\(1\-\\beta\)\}\+\\frac\{1\}\{T\}\\right\)\+LR^\{2\}\\eta\\left\(\\frac\{\\beta\}\{1\-\\beta\}\+\\frac\{1\}\{2\}\\right\)\.Now
ΔFTη=RΔFT1/4,\(1\+β\)1−β≤2T1/4,βT\(1−β\)≤1T1/2,1T≤1T1/4\.\\frac\{\\Delta\_\{F\}\}\{T\\eta\}=\\frac\{R\\Delta\_\{F\}\}\{T^\{1/4\}\},\\qquad\(1\+\\beta\)\\sqrt\{1\-\\beta\}\\leq\\frac\{2\}\{T^\{1/4\}\},\\qquad\\frac\{\\beta\}\{T\(1\-\\beta\)\}\\leq\\frac\{1\}\{T^\{1/2\}\},\\qquad\\frac\{1\}\{T\}\\leq\\frac\{1\}\{T^\{1/4\}\}\.Also,
LR2η\(β1−β\+12\)≤LR2⋅1RT3/4\(T1/2\+12\)=O\(RLT1/4\)\.LR^\{2\}\\eta\\left\(\\frac\{\\beta\}\{1\-\\beta\}\+\\frac\{1\}\{2\}\\right\)\\leq LR^\{2\}\\cdot\\frac\{1\}\{RT^\{3/4\}\}\\left\(T^\{1/2\}\+\\frac\{1\}\{2\}\\right\)=O\\\!\\left\(\\frac\{RL\}\{T^\{1/4\}\}\\right\)\.Combining the bounds yields
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]≤O\(R\(ΔF\+σ\+L\)T1/4\),\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]\\leq O\\\!\\left\(\\frac\{R\(\\Delta\_\{F\}\+\\sigma\+L\)\}\{T^\{1/4\}\}\\right\),which proves Corollary[D\.3](https://arxiv.org/html/2605.05577#A4.Thmthm3)\. ∎
### D\.3Proof of Corollary[D\.4](https://arxiv.org/html/2605.05577#A4.Thmthm4)
###### Proof\.
Whenσ=0\\sigma=0, Corollary[D\.2](https://arxiv.org/html/2605.05577#A4.Thmthm2)becomes
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]≤ΔFTη\+LR2η\(β1−β\+12\)\.\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]\\leq\\frac\{\\Delta\_\{F\}\}\{T\\eta\}\+LR^\{2\}\\eta\\left\(\\frac\{\\beta\}\{1\-\\beta\}\+\\frac\{1\}\{2\}\\right\)\.Choosing
η=1RT1/2,\\eta=\\frac\{1\}\{RT^\{1/2\}\},gives
ΔFTη=RΔFT1/2,LR2η\(β1−β\+12\)=RLT1/2\(β1−β\+12\)\.\\frac\{\\Delta\_\{F\}\}\{T\\eta\}=\\frac\{R\\Delta\_\{F\}\}\{T^\{1/2\}\},\\qquad LR^\{2\}\\eta\\left\(\\frac\{\\beta\}\{1\-\\beta\}\+\\frac\{1\}\{2\}\\right\)=\\frac\{RL\}\{T^\{1/2\}\}\\left\(\\frac\{\\beta\}\{1\-\\beta\}\+\\frac\{1\}\{2\}\\right\)\.Sinceβ∈\[0,1\)\\beta\\in\[0,1\)is fixed independently ofTT, this yields
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]≤O\(R\(ΔF\+L/\(1−β\)\)T1/2\),\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]\\leq O\\\!\\left\(\\frac\{R\\bigl\(\\Delta\_\{F\}\+L/\(1\-\\beta\)\\bigr\)\}\{T^\{1/2\}\}\\right\),which proves Corollary[D\.4](https://arxiv.org/html/2605.05577#A4.Thmthm4)\. ∎
## Appendix ETheoretical Analysis of the Variance\-Reduced Stochastic LMO Class
We analyze the LMO\-VR specialization of Algorithm[2](https://arxiv.org/html/2605.05577#alg2), obtained by setting
η1,t=η2,t=ηt\\eta\_\{1,t\}=\\eta\_\{2,t\}=\\eta\_\{t\}for everyt=0,1,…,T−1t=0,1,\\ldots,T\-1\. Sincex0=w0x\_\{0\}=w\_\{0\}, this impliesxt=wtx\_\{t\}=w\_\{t\}for everyt=0,1,…,Tt=0,1,\\ldots,T\. For notational convenience, we writew−1=w0w\_\{\-1\}=w\_\{0\}\.
For everyt=0,1,…,T−1t=0,1,\\ldots,T\-1, let
α1,t=α1,α2,t=α2,β1,t=β1,β2,t=β2,ηt=η\.\\alpha\_\{1,t\}=\\alpha\_\{1\},\\qquad\\alpha\_\{2,t\}=\\alpha\_\{2\},\\qquad\\beta\_\{1,t\}=\\beta\_\{1\},\\qquad\\beta\_\{2,t\}=\\beta\_\{2\},\\qquad\\eta\_\{t\}=\\eta\.
In this specialization, the query and momentum recursions are
gt=β1mt−1\+\(1−β1\)∇f\(wt;ξt\)\+α1\(∇f\(wt;ξt\)−∇f\(wt−1;ξt\)\),g\_\{t\}=\\beta\_\{1\}m\_\{t\-1\}\+\(1\-\\beta\_\{1\}\)\\nabla f\(w\_\{t\};\\xi\_\{t\}\)\+\\alpha\_\{1\}\\bigl\(\\nabla f\(w\_\{t\};\\xi\_\{t\}\)\-\\nabla f\(w\_\{t\-1\};\\xi\_\{t\}\)\\bigr\),and
mt=β2mt−1\+\(1−β2\)∇f\(wt;ξt\)\+α2\(∇f\(wt;ξt\)−∇f\(wt−1;ξt\)\)\.m\_\{t\}=\\beta\_\{2\}m\_\{t\-1\}\+\(1\-\\beta\_\{2\}\)\\nabla f\(w\_\{t\};\\xi\_\{t\}\)\+\\alpha\_\{2\}\\bigl\(\\nabla f\(w\_\{t\};\\xi\_\{t\}\)\-\\nabla f\(w\_\{t\-1\};\\xi\_\{t\}\)\\bigr\)\.
Applying Lemma[B\.2](https://arxiv.org/html/2605.05577#A2.Thmlemma2)gives
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]≤ΔFTη\+RT∑t=0T−1𝔼‖ϵ^t‖\+L2R2η,\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]\\leq\\frac\{\\Delta\_\{F\}\}\{T\\eta\}\+\\frac\{R\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|\\hat\{\\epsilon\}\_\{t\}\\\|\+\\frac\{L\}\{2\}R^\{2\}\\eta,\(A22\)where
ϵ^t=gt−∇F\(wt\)\.\\hat\{\\epsilon\}\_\{t\}=g\_\{t\}\-\\nabla F\(w\_\{t\}\)\.
###### Lemma E\.1\(Tracking\-error bound\)\.
Suppose Assumptions[1](https://arxiv.org/html/2605.05577#Thmasmp1)–[3](https://arxiv.org/html/2605.05577#Thmasmp3)hold\. Then
1T∑t=0T−1𝔼‖ϵ^t‖\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|\\hat\{\\epsilon\}\_\{t\}\\\|≤σT\+σ\(\(1−β1\)\+β11−β2\+β1T\(1−β2\)\)\\displaystyle\\leq\\frac\{\\sigma\}\{T\}\+\\sigma\\left\(\(1\-\\beta\_\{1\}\)\+\\beta\_\{1\}\\sqrt\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{1\}\}\{T\(1\-\\beta\_\{2\}\)\}\\right\)\+LRη\(\|β1−α1\|\+β1\|β2−α2\|1−β2\+\|α1\|\+β1\|α2\|1−β2\)\.\\displaystyle\\quad\+LR\\eta\\left\(\|\\beta\_\{1\}\-\\alpha\_\{1\}\|\+\\frac\{\\beta\_\{1\}\|\\beta\_\{2\}\-\\alpha\_\{2\}\|\}\{1\-\\beta\_\{2\}\}\+\|\\alpha\_\{1\}\|\+\\frac\{\\beta\_\{1\}\|\\alpha\_\{2\}\|\}\{\\sqrt\{1\-\\beta\_\{2\}\}\}\\right\)\.\(A23\)
###### Proof\.
Let
ℱt=σ\(ξ0,…,ξt\)\\mathcal\{F\}\_\{t\}=\\sigma\(\\xi\_\{0\},\\dots,\\xi\_\{t\}\)be the natural filtration, and define
ht=∇f\(wt;ξt\),ϵt=ht−∇F\(wt\)\.h\_\{t\}=\\nabla f\(w\_\{t\};\\xi\_\{t\}\),\\qquad\\epsilon\_\{t\}=h\_\{t\}\-\\nabla F\(w\_\{t\}\)\.Fort≥1t\\geq 1, define
ht−=∇f\(wt−1;ξt\),δt=∇F\(wt−1\)−∇F\(wt\),h\_\{t\}^\{\-\}=\\nabla f\(w\_\{t\-1\};\\xi\_\{t\}\),\\qquad\\delta\_\{t\}=\\nabla F\(w\_\{t\-1\}\)\-\\nabla F\(w\_\{t\}\),and
ζt=\(ht−ht−\)−\(∇F\(wt\)−∇F\(wt−1\)\)\.\\zeta\_\{t\}=\\bigl\(h\_\{t\}\-h\_\{t\}^\{\-\}\\bigr\)\-\\bigl\(\\nabla F\(w\_\{t\}\)\-\\nabla F\(w\_\{t\-1\}\)\\bigr\)\.
By Assumption[2](https://arxiv.org/html/2605.05577#Thmasmp2),
𝔼\[ϵt∣ℱt−1\]=0,𝔼\[‖ϵt‖2∣ℱt−1\]≤σ2\.\\mathbb\{E\}\[\\epsilon\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\}\]=0,\\qquad\\mathbb\{E\}\[\\\|\\epsilon\_\{t\}\\\|^\{2\}\\mid\\mathcal\{F\}\_\{t\-1\}\]\\leq\\sigma^\{2\}\.By Lemma[B\.6](https://arxiv.org/html/2605.05577#A2.Thmlemma6),
𝔼\[ζt∣ℱt−1\]=0,𝔼\[‖ζt‖2∣ℱt−1\]≤L2‖wt−wt−1‖2\.\\mathbb\{E\}\[\\zeta\_\{t\}\\mid\\mathcal\{F\}\_\{t\-1\}\]=0,\\qquad\\mathbb\{E\}\[\\\|\\zeta\_\{t\}\\\|^\{2\}\\mid\\mathcal\{F\}\_\{t\-1\}\]\\leq L^\{2\}\\\|w\_\{t\}\-w\_\{t\-1\}\\\|^\{2\}\.Moreover, by Assumption[1](https://arxiv.org/html/2605.05577#Thmasmp1)and Lemma[B\.1](https://arxiv.org/html/2605.05577#A2.Thmlemma1),
‖δt‖≤L‖wt−wt−1‖≤LηR,\\\|\\delta\_\{t\}\\\|\\leq L\\\|w\_\{t\}\-w\_\{t\-1\}\\\|\\leq L\\eta R,and
𝔼\[‖ζt‖2∣ℱt−1\]≤L2η2R2\.\\mathbb\{E\}\[\\\|\\zeta\_\{t\}\\\|^\{2\}\\mid\\mathcal\{F\}\_\{t\-1\}\]\\leq L^\{2\}\\eta^\{2\}R^\{2\}\.
Sincew−1=w0w\_\{\-1\}=w\_\{0\},m−1=∇f\(w0;ξ0\)m\_\{\-1\}=\\nabla f\(w\_\{0\};\\xi\_\{0\}\), and
∇f\(w0;ξ0\)−∇f\(w−1;ξ0\)=0,\\nabla f\(w\_\{0\};\\xi\_\{0\}\)\-\\nabla f\(w\_\{\-1\};\\xi\_\{0\}\)=0,the first step satisfies
m0=∇f\(w0;ξ0\),g0=∇f\(w0;ξ0\)\.m\_\{0\}=\\nabla f\(w\_\{0\};\\xi\_\{0\}\),\\qquad g\_\{0\}=\\nabla f\(w\_\{0\};\\xi\_\{0\}\)\.Therefore
ϵ~0=ϵ^0=ϵ0,ϵ~t=mt−∇F\(wt\)\.\\tilde\{\\epsilon\}\_\{0\}=\\hat\{\\epsilon\}\_\{0\}=\\epsilon\_\{0\},\\qquad\\tilde\{\\epsilon\}\_\{t\}=m\_\{t\}\-\\nabla F\(w\_\{t\}\)\.
Fort≥1t\\geq 1, using
ht−ht−=ζt−δt,h\_\{t\}\-h\_\{t\}^\{\-\}=\\zeta\_\{t\}\-\\delta\_\{t\},the recursion formtm\_\{t\}gives
ϵ~t\\displaystyle\\tilde\{\\epsilon\}\_\{t\}=β2ϵ~t−1\+\(β2−α2\)δt\+\(1−β2\)ϵt\+α2ζt\.\\displaystyle=\\beta\_\{2\}\\tilde\{\\epsilon\}\_\{t\-1\}\+\(\\beta\_\{2\}\-\\alpha\_\{2\}\)\\delta\_\{t\}\+\(1\-\\beta\_\{2\}\)\\epsilon\_\{t\}\+\\alpha\_\{2\}\\zeta\_\{t\}\.\(A24\)Similarly, the recursion forgtg\_\{t\}gives
ϵ^t\\displaystyle\\hat\{\\epsilon\}\_\{t\}=β1ϵ~t−1\+\(β1−α1\)δt\+\(1−β1\)ϵt\+α1ζt\.\\displaystyle=\\beta\_\{1\}\\tilde\{\\epsilon\}\_\{t\-1\}\+\(\\beta\_\{1\}\-\\alpha\_\{1\}\)\\delta\_\{t\}\+\(1\-\\beta\_\{1\}\)\\epsilon\_\{t\}\+\\alpha\_\{1\}\\zeta\_\{t\}\.\(A25\)
Unrolling \([A24](https://arxiv.org/html/2605.05577#A5.E24)\) and substituting into \([A25](https://arxiv.org/html/2605.05577#A5.E25)\), we decompose
ϵ^t=Nt\+Zt\+Bt,\\hat\{\\epsilon\}\_\{t\}=N\_\{t\}\+Z\_\{t\}\+B\_\{t\},where
Nt\\displaystyle N\_\{t\}=β1β2t−1ϵ0\+β1\(1−β2\)∑s=1t−1β2t−1−sϵs\+\(1−β1\)ϵt,\\displaystyle=\\beta\_\{1\}\\beta\_\{2\}^\{t\-1\}\\epsilon\_\{0\}\+\\beta\_\{1\}\(1\-\\beta\_\{2\}\)\\sum\_\{s=1\}^\{t\-1\}\\beta\_\{2\}^\{t\-1\-s\}\\epsilon\_\{s\}\+\(1\-\\beta\_\{1\}\)\\epsilon\_\{t\},Zt\\displaystyle Z\_\{t\}=β1α2∑s=1t−1β2t−1−sζs\+α1ζt,\\displaystyle=\\beta\_\{1\}\\alpha\_\{2\}\\sum\_\{s=1\}^\{t\-1\}\\beta\_\{2\}^\{t\-1\-s\}\\zeta\_\{s\}\+\\alpha\_\{1\}\\zeta\_\{t\},Bt\\displaystyle B\_\{t\}=β1\(β2−α2\)∑s=1t−1β2t−1−sδs\+\(β1−α1\)δt\.\\displaystyle=\\beta\_\{1\}\(\\beta\_\{2\}\-\\alpha\_\{2\}\)\\sum\_\{s=1\}^\{t\-1\}\\beta\_\{2\}^\{t\-1\-s\}\\delta\_\{s\}\+\(\\beta\_\{1\}\-\\alpha\_\{1\}\)\\delta\_\{t\}\.
Fort=0t=0, we set
N0=ϵ^0=ϵ0,Z0=0,B0=0\.N\_\{0\}=\\hat\{\\epsilon\}\_\{0\}=\\epsilon\_\{0\},\\qquad Z\_\{0\}=0,\\qquad B\_\{0\}=0\.
Lemma[B\.5](https://arxiv.org/html/2605.05577#A2.Thmlemma5)gives
1T∑t=0T−1𝔼‖Nt‖\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|N\_\{t\}\\\|≤σT\+σ\(\(1−β1\)\+β11−β2\+β1T\(1−β2\)\)\.\\displaystyle\\leq\\frac\{\\sigma\}\{T\}\+\\sigma\\left\(\(1\-\\beta\_\{1\}\)\+\\beta\_\{1\}\\sqrt\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{1\}\}\{T\(1\-\\beta\_\{2\}\)\}\\right\)\.\(A26\)
Using‖δs‖≤LηR\\\|\\delta\_\{s\}\\\|\\leq L\\eta R,
‖Bt‖\\displaystyle\\\|B\_\{t\}\\\|≤β1\|β2−α2\|∑s=1t−1β2t−1−s‖δs‖\+\|β1−α1\|‖δt‖\\displaystyle\\leq\\beta\_\{1\}\|\\beta\_\{2\}\-\\alpha\_\{2\}\|\\sum\_\{s=1\}^\{t\-1\}\\beta\_\{2\}^\{t\-1\-s\}\\\|\\delta\_\{s\}\\\|\+\|\\beta\_\{1\}\-\\alpha\_\{1\}\|\\,\\\|\\delta\_\{t\}\\\|≤LRη\(β1\|β2−α2\|1−β2\+\|β1−α1\|\)\.\\displaystyle\\leq LR\\eta\\left\(\\frac\{\\beta\_\{1\}\|\\beta\_\{2\}\-\\alpha\_\{2\}\|\}\{1\-\\beta\_\{2\}\}\+\|\\beta\_\{1\}\-\\alpha\_\{1\}\|\\right\)\.\(A27\)
Since\{ζt\}\\\{\\zeta\_\{t\}\\\}is a martingale\-difference sequence and
𝔼\[‖ζt‖2∣ℱt−1\]≤L2η2R2,\\mathbb\{E\}\[\\\|\\zeta\_\{t\}\\\|^\{2\}\\mid\\mathcal\{F\}\_\{t\-1\}\]\\leq L^\{2\}\\eta^\{2\}R^\{2\},the cross terms vanish and
𝔼‖Zt‖2\\displaystyle\\mathbb\{E\}\\\|Z\_\{t\}\\\|^\{2\}≤β12α22∑s=1t−1β22t−2−2sL2η2R2\+α12L2η2R2\\displaystyle\\leq\\beta\_\{1\}^\{2\}\\alpha\_\{2\}^\{2\}\\sum\_\{s=1\}^\{t\-1\}\\beta\_\{2\}^\{2t\-2\-2s\}L^\{2\}\\eta^\{2\}R^\{2\}\+\\alpha\_\{1\}^\{2\}L^\{2\}\\eta^\{2\}R^\{2\}≤\(β12α221−β2\+α12\)L2η2R2\.\\displaystyle\\leq\\left\(\\frac\{\\beta\_\{1\}^\{2\}\\alpha\_\{2\}^\{2\}\}\{1\-\\beta\_\{2\}\}\+\\alpha\_\{1\}^\{2\}\\right\)L^\{2\}\\eta^\{2\}R^\{2\}\.Hence, by Jensen’s inequality,
𝔼‖Zt‖\\displaystyle\\mathbb\{E\}\\\|Z\_\{t\}\\\|≤LRη\(β1\|α2\|1−β2\+\|α1\|\)\.\\displaystyle\\leq LR\\eta\\left\(\\frac\{\\beta\_\{1\}\|\\alpha\_\{2\}\|\}\{\\sqrt\{1\-\\beta\_\{2\}\}\}\+\|\\alpha\_\{1\}\|\\right\)\.\(A28\)
Combining \([A26](https://arxiv.org/html/2605.05577#A5.E26)\), \([A27](https://arxiv.org/html/2605.05577#A5.E27)\), and \([A28](https://arxiv.org/html/2605.05577#A5.E28)\), and using
‖ϵ^t‖≤‖Nt‖\+‖Zt‖\+‖Bt‖,\\\|\\hat\{\\epsilon\}\_\{t\}\\\|\\leq\\\|N\_\{t\}\\\|\+\\\|Z\_\{t\}\\\|\+\\\|B\_\{t\}\\\|,proves \([A23](https://arxiv.org/html/2605.05577#A5.E23)\)\. ∎
Substituting Lemma[E\.1](https://arxiv.org/html/2605.05577#A5.Thmlemma1)into \([A22](https://arxiv.org/html/2605.05577#A5.E22)\) yields
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]≤ΔFTη\+Rσ\(\(1−β1\)\+β11−β2\+β1T\(1−β2\)\+1T\)\\displaystyle\\leq\\frac\{\\Delta\_\{F\}\}\{T\\eta\}\+R\\sigma\\left\(\(1\-\\beta\_\{1\}\)\+\\beta\_\{1\}\\sqrt\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{1\}\}\{T\(1\-\\beta\_\{2\}\)\}\+\\frac\{1\}\{T\}\\right\)\+LR2η\(\|β1−α1\|\+β1\|β2−α2\|1−β2\+\|α1\|\+β1\|α2\|1−β2\+12\),\\displaystyle\\quad\+LR^\{2\}\\eta\\left\(\|\\beta\_\{1\}\-\\alpha\_\{1\}\|\+\\frac\{\\beta\_\{1\}\|\\beta\_\{2\}\-\\alpha\_\{2\}\|\}\{1\-\\beta\_\{2\}\}\+\|\\alpha\_\{1\}\|\+\\frac\{\\beta\_\{1\}\|\\alpha\_\{2\}\|\}\{\\sqrt\{1\-\\beta\_\{2\}\}\}\+\\frac\{1\}\{2\}\\right\),\(A29\)which proves Theorem[3\.3](https://arxiv.org/html/2605.05577#S3.Thmthm3)\.
### E\.1Proof of Corollary[3\.4](https://arxiv.org/html/2605.05577#S3.Thmthm4)
###### Proof\.
Under the assumptions of Theorem[3\.3](https://arxiv.org/html/2605.05577#S3.Thmthm3), suppose
0≤α1≤β1,α2=β2\.0\\leq\\alpha\_\{1\}\\leq\\beta\_\{1\},\\qquad\\alpha\_\{2\}=\\beta\_\{2\}\.Then the term involving\|β2−α2\|\|\\beta\_\{2\}\-\\alpha\_\{2\}\|vanishes, and
\|β1−α1\|\+\|α1\|=\(β1−α1\)\+α1=β1\.\|\\beta\_\{1\}\-\\alpha\_\{1\}\|\+\|\\alpha\_\{1\}\|=\(\\beta\_\{1\}\-\\alpha\_\{1\}\)\+\\alpha\_\{1\}=\\beta\_\{1\}\.Therefore Theorem[3\.3](https://arxiv.org/html/2605.05577#S3.Thmthm3)reduces to
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]≤ΔFTη\+Rσ\(\(1−β1\)\+β11−β2\+β1T\(1−β2\)\+1T\)\\displaystyle\\leq\\frac\{\\Delta\_\{F\}\}\{T\\eta\}\+R\\sigma\\left\(\(1\-\\beta\_\{1\}\)\+\\beta\_\{1\}\\sqrt\{1\-\\beta\_\{2\}\}\+\\frac\{\\beta\_\{1\}\}\{T\(1\-\\beta\_\{2\}\)\}\+\\frac\{1\}\{T\}\\right\)\+LR2η\(β1\+β1β21−β2\+12\)\.\\displaystyle\\quad\+LR^\{2\}\\eta\\left\(\\beta\_\{1\}\+\\frac\{\\beta\_\{1\}\\beta\_\{2\}\}\{\\sqrt\{1\-\\beta\_\{2\}\}\}\+\\frac\{1\}\{2\}\\right\)\.\(A30\)
Now choose
η=1RT2/3,β2=1−1T2/3,1−β1∈\[1T2/3,1T1/3\]\.\\eta=\\frac\{1\}\{RT^\{2/3\}\},\\qquad\\beta\_\{2\}=1\-\\frac\{1\}\{T^\{2/3\}\},\\qquad 1\-\\beta\_\{1\}\\in\\left\[\\frac\{1\}\{T^\{2/3\}\},\\frac\{1\}\{T^\{1/3\}\}\\right\]\.Then
ΔFTη=RΔFT1/3,1−β1=O\(T−1/3\),1−β2=1T1/3,1T\(1−β2\)=1T1/3,1T≤1T1/3\.\\frac\{\\Delta\_\{F\}\}\{T\\eta\}=\\frac\{R\\Delta\_\{F\}\}\{T^\{1/3\}\},\\quad 1\-\\beta\_\{1\}=O\(T^\{\-1/3\}\),\\quad\\sqrt\{1\-\\beta\_\{2\}\}=\\frac\{1\}\{T^\{1/3\}\},\\quad\\frac\{1\}\{T\(1\-\\beta\_\{2\}\)\}=\\frac\{1\}\{T^\{1/3\}\},\\quad\\frac\{1\}\{T\}\\leq\\frac\{1\}\{T^\{1/3\}\}\.Moreover,
LR2η\(β1\+β1β21−β2\+12\)≤LR2⋅1RT2/3\(1\+T1/3\+12\)=O\(RLT1/3\)\.LR^\{2\}\\eta\\left\(\\beta\_\{1\}\+\\frac\{\\beta\_\{1\}\\beta\_\{2\}\}\{\\sqrt\{1\-\\beta\_\{2\}\}\}\+\\frac\{1\}\{2\}\\right\)\\leq LR^\{2\}\\cdot\\frac\{1\}\{RT^\{2/3\}\}\\left\(1\+T^\{1/3\}\+\\frac\{1\}\{2\}\\right\)=O\\\!\\left\(\\frac\{RL\}\{T^\{1/3\}\}\\right\)\.Substituting these bounds into \([A30](https://arxiv.org/html/2605.05577#A5.E30)\) yields
1T∑t=0T−1𝔼\[Ψ𝒞,λ\(wt\)\]≤O\(R\(ΔF\+σ\+L\)T1/3\)=O\(T−1/3\),\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\Psi\_\{\\mathcal\{C\},\\lambda\}\(w\_\{t\}\)\\big\]\\leq O\\\!\\left\(\\frac\{R\(\\Delta\_\{F\}\+\\sigma\+L\)\}\{T^\{1/3\}\}\\right\)=O\\\!\\left\(T^\{\-1/3\}\\right\),which proves Corollary[3\.4](https://arxiv.org/html/2605.05577#S3.Thmthm4)\. ∎
## Appendix FDetailed Experimental Settings
All experiments were conducted in a Python 3\.11\.9 environment on a machine with a 64\-core Intel Xeon Gold 6226R CPU at 2\.90GHz, 512GB of memory, and an RTX 3090 GPU\. All algorithms were implemented in PyTorch 2\.4\.1\. Hyperparameters are tuned separately for each optimizer\. We do not use learning\-rate decay to isolate the effect of the optimizer update\.
##### Implementation of Muon\.
The momentum update rule of Muon proposed byJordanet al\.\([2024](https://arxiv.org/html/2605.05577#bib.bib7)\)is given by
mt=β2mt−1\+∇f\(wt;ξt\)\.m\_\{t\}=\\beta\_\{2\}m\_\{t\-1\}\+\\nabla f\(w\_\{t\};\\xi\_\{t\}\)\.Compared to Algorithm[2](https://arxiv.org/html/2605.05577#alg2), this is equivalent to scaling the loss function by a factor of1/\(1−β2\)1/\(1\-\\beta\_\{2\}\)\. For a fair comparison with baselines, we adopt the momentum formulation used in Algorithm[2](https://arxiv.org/html/2605.05577#alg2)\. Empirically, we observe no significant difference in performance between the two implementations\.
### F\.1Multiclass Image Classification
For the image classification experiments in Section[4](https://arxiv.org/html/2605.05577#S4)and Appendix[G\.1](https://arxiv.org/html/2605.05577#A7.SS1), we follow the experimental setup ofSfyraki and Wang \([2025](https://arxiv.org/html/2605.05577#bib.bib1)\)and train ResNet\-18 on CIFAR\-10 for 200 epochs\. A single training run takes 1 to 2 hours on one RTX 3090 GPU\.
We first search over the learning\-rate and weight\-decay grids in Table[2](https://arxiv.org/html/2605.05577#A6.T2)\.
Table 2:Hyperparameter search space for multiclass image classification\.learning rateweight decayMuon1e\-1, 5e\-2, 1e\-2, 5e\-3, 1e\-3, 5e\-4, 1e\-40, 1e\-4, 5e\-4, 1e\-3, 5e\-3, 1e\-2, 5e\-2, 1e\-1, 5e\-1, 1\.0Muon\-VRMuon\-IGT1e\-3, 5e\-4, 1e\-4, 5e\-5, 1e\-5, 5e\-6, 1e\-6NIGTLionLion\-VRAdamWLion\-IGT1e\-5, 5e\-6, 1e\-6, 5e\-7, 1e\-7, 5e\-8For the VR methods, we fixα2\\alpha\_\{2\}equal toβ2\\beta\_\{2\}and searchα1\\alpha\_\{1\}over \{0, 0\.1, 0\.5, 0\.9\}, subject toα1\\alpha\_\{1\}being no larger thanβ1\\beta\_\{1\}\. The final hyperparameter choices used in the experiments are summarized in Table[3](https://arxiv.org/html/2605.05577#A6.T3)\.
Table 3:Final hyperparameter settings for multiclass image classification\.
### F\.2Language Modeling
For the language modeling experiments in Appendix[G\.2](https://arxiv.org/html/2605.05577#A7.SS2), we follow the setup ofSfyraki and Wang \([2025](https://arxiv.org/html/2605.05577#bib.bib1)\)and train nanoGPT\(Karpathy,[2022](https://arxiv.org/html/2605.05577#bib.bib61)\)\(MIT License\) on the Shakespeare dataset\(Karpathy,[2015](https://arxiv.org/html/2605.05577#bib.bib62)\)from scratch, for 5000 optimization steps\. A single Shakespeare/nanoGPT run takes approximately 1 hour on one RTX 3090 GPU\.
We search over the hyperparameter grid in Table[4](https://arxiv.org/html/2605.05577#A6.T4)and select the final hyperparameter settings in Table[5](https://arxiv.org/html/2605.05577#A6.T5)\.
Table 4:Hyperparameter search space for language modeling on Shakespeare\.Table 5:Final hyperparameter settings for language modeling on Shakespeare\.
### F\.3Large\-scale Language Modeling
For the large\-scale language modeling experiments in Appendix[G\.3](https://arxiv.org/html/2605.05577#A7.SS3), we adopt the model architecture ofPethicket al\.\([2025](https://arxiv.org/html/2605.05577#bib.bib39)\)and train a 124M\-parameter nanoGPT model on OpenWebText dataset\(Gokaslan and Cohen,[2019](https://arxiv.org/html/2605.05577#bib.bib60)\)from scratch\. Concretely, we follow the modded\-nanoGPT implementation, which incorporates several modern architectural choices: rotary positional embeddings instead of learned positional embeddings, RMSNorm\(Zhang and Sennrich,[2019](https://arxiv.org/html/2605.05577#bib.bib63)\)in place of LayerNorm, and the scaled ReLU activation function instead of GELU\. A single OpenWebText/124M nanoGPT run takes approximately 8 hours on one RTX 3090 GPU\. Since the model does not exhibit clear signs of overfitting in this setting, we omit dropout and weight decay\.
We search over the learning\-rate grid in Table[6](https://arxiv.org/html/2605.05577#A6.T6)and select the final hyperparameter settings reported in Table[7](https://arxiv.org/html/2605.05577#A6.T7)\.
Table 6:Learning\-rate search space for large\-scale language modeling\.Table 7:Final hyperparameter settings for large\-scale language modeling on OpenWebText\.
## Appendix GAdditional Experimental Results
We report additional results on multiclass image classification, language modeling, and large\-scale language modeling\. Hyperparameter choices and implementation details are provided in Appendix[F](https://arxiv.org/html/2605.05577#A6)\.
### G\.1Additional Results on Multiclass Image Classification
For the CIFAR\-10 task considered in the main text, we conduct additional experiments\. For each optimizer, hyperparameters are selected based on the final test accuracy\.
#### G\.1\.1Ablation Study
To understand the gains of Muon\-IGT, we compare the three methods in Table[8](https://arxiv.org/html/2605.05577#A7.T8), which differ in the use of double momentum and IGT\. Figure[3](https://arxiv.org/html/2605.05577#A7.F3)illustrates the contribution of each component\. Double momentum improves test accuracy over Muon, while further incorporating IGT yields additional gains in both test accuracy and test loss, resulting in the best overall performance\.
Table 8:Ablation variants for Muon\-IGT\.\(a\)Test accuracy over epochs\.
\(b\)Test loss over epochs\.
Figure 3:Ablation study for Muon\-IGT on CIFAR\-10 with ResNet\-18\.
#### G\.1\.2Robustness to Learning Rate and Weight Decay
To assess robustness to hyperparameter choice, we measure the final test accuracy of Muon\-IGT over multiple learning\-rate and weight\-decay combinations\. For weight decay, we use the values listed in Table[2](https://arxiv.org/html/2605.05577#A6.T2)\. For the learning rate, after identifying the best\-performing value in the initial hyperparameter search, we construct a local grid from nearby powers of two, resulting in\{2−15,2−14,…,2−7\}\\\{2^\{\-15\},2^\{\-14\},\\ldots,2^\{\-7\}\\\}\. We consider two representative values ofβ1\\beta\_\{1\}, namely0\.90\.9and0\.950\.95, while keeping the remaining settings fixed\. Each entry in Figure[4](https://arxiv.org/html/2605.05577#A7.F4)reports the final test accuracy obtained by a single pair of learning rate and weight decay\.
\(a\)β1=0\.9\\beta\_\{1\}=0\.9
\(b\)β1=0\.95\\beta\_\{1\}=0\.95
Figure 4:Final test accuracy of Muon\-IGT on CIFAR\-10 with ResNet\-18 as a function of learning rate and weight decay under two choices ofβ1\\beta\_\{1\}\.Figure[4](https://arxiv.org/html/2605.05577#A7.F4)shows broad high\-accuracy regions for bothβ1=0\.9\\beta\_\{1\}=0\.9andβ1=0\.95\\beta\_\{1\}=0\.95, indicating that Muon\-IGT performs well over a wide range of learning rates and weight decays\. This suggests that the empirical gain of Muon\-IGT is not confined to a narrowly tuned hyperparameter configuration, but remains stable across a broad portion of the search space\.
### G\.2Language Modeling
We next evaluate the methods on language modeling using the Shakespeare dataset and nanoGPT\. For each optimizer, we select hyperparameters based on the final test loss\. Compared with the image\-classification setting in Appendix[G](https://arxiv.org/html/2605.05577#A7), some baselines are omitted because their performance is substantially worse\.
As shown in Figure[5](https://arxiv.org/html/2605.05577#A7.F5), Muon\-IGT outperforms the baselines in both training loss and test loss from an early stage of optimization\. This indicates that the advantage of Muon\-IGT persists even under a smaller momentum parameter\.
\(a\)Training loss over steps\.
\(b\)Test loss over steps\.
Figure 5:Language modeling results on Shakespeare with nanoGPT \(10M\)
### G\.3Large\-scale Language Modeling
To examine whether Muon\-IGT scales to larger language\-modeling tasks, we further evaluate it on OpenWebText using a 124M\-parameter nanoGPT model, rather than the 10M model used in Appendix[G\.2](https://arxiv.org/html/2605.05577#A7.SS2)\.
\(a\)Training loss over steps\.
\(b\)Test loss over steps\.
Figure 6:Large\-scale language modeling results on OpenWebText with nanoGPT \(124M\)\.We observe the same qualitative trend as in the smaller\-scale language\-modeling experiment\. Muon\-IGT outperforms the baselines in both training loss and test loss from the early stage of training, suggesting that the method scales favorably to larger language\-modeling problems\.
## Appendix HLimitations
Our results should be interpreted within the scope of the present theoretical setting\. The convergence guarantees for stochastic LMO and LMO\-VR rely on standard smoothness and stochastic\-noise assumptions, while the improved guarantee for LMO\-IGT further leverages second\-order smoothness through the transported\-gradient construction\. Although these assumptions are standard in stochastic optimization, they remain idealized relative to modern large\-scale deep learning practice\. Accordingly, our theory is best viewed as a principled rate comparison within a canonical analytical regime, rather than a complete characterization of optimizer behavior in practical training\.
Another limitation is that the analysis focuses on expected first\-order stationarity measured by the regularized support function\. We do not establish stronger guarantees such as high\-probability bounds, general last\-iterate convergence, or a unified treatment of approximate or biased LMOs\. Extending the framework in these directions would further clarify the robustness and scope of the RSF perspective\.
These limitations do not diminish the main contribution of the paper, namely that transported gradients extend beyond Euclidean normalization to general LMO\-based updates under a unified stationarity framework\. Rather, these limitations delineate the scope of our theoretical claims and highlight the settings in which our unified perspective is most informative\.
## Appendix IBroader Impact
This work studies general\-purpose optimization methods for training machine learning models\. Its primary impact is methodological: a better understanding of LMO\-based optimizers may lead to more compute\- and memory\-efficient training, and to more consistent comparisons across constrained and unconstrained formulations\. If such gains persist at scale, they could reduce experimental cost and broaden access to model training\. However, our results do not support strong claims about net energy savings in realistic deployments\.
As with most advances in optimization, improved training efficiency may lower the cost of building more capable models, potentially amplifying both beneficial applications and undesirable uses\. Since our contribution is foundational, its broader societal impact depends on how downstream models are developed and deployed\.
Finally, by emphasizing unified stationarity measures together with explicit consideration of computational overhead, our work encourages more transparent and consistent evaluation of optimization methods\.Similar Articles
An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning
This paper introduces MMOT, an online mixture model learning framework based on optimal transport theory that addresses incremental learning with distributional shifts through dynamic centroid updates and improved class similarity estimation. The approach includes a Dynamic Preservation strategy to mitigate catastrophic forgetting and maintain class separability in latent space.
AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization
AccelOpt is a self-improving LLM agentic system that autonomously optimizes AI accelerator kernels through iterative generation and optimization memory, achieving 49-61% peak throughput improvements on AWS Trainium while being 26x cheaper than Claude Sonnet 4.
A^2TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping
This paper introduces A^2TGPO, a reinforcement learning method for agentic LLMs that uses adaptive turn-level clipping and information gain normalization to improve process credit assignment in multi-turn interactions.
What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search
Large-scale study of 15 LLMs across 8 tasks reveals that optimization success hinges on maintaining localized search trajectories rather than initial problem-solving ability or solution novelty.
Reinforcement Learning via Value Gradient Flow
Value Gradient Flow (VGF) presents a scalable approach to behavior-regularized reinforcement learning by formulating it as an optimal transport problem solved through discrete gradient flow, achieving state-of-the-art results on offline RL and LLM RL benchmarks. The method eliminates explicit policy parameterization while enabling adaptive test-time scaling by controlling transport budget.