Optimistic Dual Averaging Unifies Modern Optimizers

arXiv cs.LG 05/13/26, 04:00 AM Papers
Summary
This paper introduces SODA, a generalization of Optimistic Dual Averaging that unifies various modern optimizers like Muon and Lion. It proposes a practical wrapper that improves performance across different scales without requiring additional hyperparameter tuning for weight decay.
arXiv:2605.11172v1 Announce Type: new Abstract: We introduce SODA, a generalization of Optimistic Dual Averaging, which provides a common perspective on state-of-the-art optimizers like Muon, Lion, AdEMAMix and NAdam, showing that they can all be viewed as optimistic instances of this framework. Based on this framing, we propose a practical SODA wrapper for any base optimizer that eliminates weight decay tuning through a theoretically-grounded $1/k$ decay schedule. Empirical results across various scales and training horizons show that SODA consistently improves performance without any additional hyperparameter tuning.
Original Article
View Cached Full Text
Cached at: 05/13/26, 06:33 AM
# Optimistic Dual Averaging Unifies Modern Optimizers
Source: [https://arxiv.org/html/2605.11172](https://arxiv.org/html/2605.11172)
Thomas Pethick Independent Researcher tmpethick@gmail\.com&Wanyun Xie EPFL \(LIONS\) wanyun\.xie@epfl\.chRoman Machacek University of Bern roman\.machacek@unibe\.ch&Volkan Cevher EPFL \(LIONS\) volkan\.cevher@epfl\.ch

###### Abstract

We introduce SODA, a generalization of Optimistic Dual Averaging, which provides a common perspective on state\-of\-the\-art optimizers like Muon, Lion, AdEMAMix and NAdam, showing that they can all be viewed as optimistic instances of this framework\. Based on this framing, we propose a practical SODA wrapper for any base optimizer that eliminates weight decay tuning through a theoretically\-grounded1/k1/kdecay schedule\. Empirical results across various scales and training horizons show that SODA consistently improves performance without any additional hyperparameter tuning\.

## 1Introduction

Deep learning optimization has developed along two largely complementary axes\. The first axis is*geometry and adaptation*: designing updates whose implicit norm, preconditioner, or constraint set reflects the structure of modern models, potentially improving scaling in high dimensions\. Adam\(Kingma and Ba,[2014](https://arxiv.org/html/2605.11172#bib.bib28)\)is a central example, coupling gradient averaging with anℓ∞\\ell\_\{\\infty\}\-type geometry through elementwise normalization\. In parallel,*spectral*geometry and stochastic dualization were developed in Stochastic Spectral Descent \(SSD\)\(Carlson et al\.,[2015a](https://arxiv.org/html/2605.11172#bib.bib4),[2016](https://arxiv.org/html/2605.11172#bib.bib5)\), and are now resurfacing in recent methods such as Muon and Scion\(Jordan et al\.,[2024b](https://arxiv.org/html/2605.11172#bib.bib24); Pethick et al\.,[2025a](https://arxiv.org/html/2605.11172#bib.bib41)\)\.

The second axis is*composition of ingredients*: how gradient feedback and iterates are combined through momentum, averaging, and schedules to obtain a stable and performant training recipe\. Algorithms such as Lion\(Chen et al\.,[2023](https://arxiv.org/html/2605.11172#bib.bib8)\)and practical recipes such as the schedule\-free wrapper\(Defazio et al\.,[2024](https://arxiv.org/html/2605.11172#bib.bib12)\)highlight that the assembly of these ingredients can be as important as the ingredients themselves\. A striking pattern over the last decade is that many seemingly disparate new optimizers can be explained as different points in this 2D design space\.

A useful lens for both axes is the classical dualization framework, where the next iterate is obtained by minimizing a linear surrogate regularized by a geometry\-inducing term:

xk\+1∈argminx∈𝒳⁡γk⟨dk,x⟩\+hk\(x\)\.\\addcontentsline\{lla\}\{section\}\{\\numberline\{\\string\\crtrefnumber\{eq:gen\_form\}\}\{e\}q:gen\_\{f\}orm\}\\textstyle x^\{k\+1\}\\in\\operatorname\*\{arg\\,min\}\_\{x\\in\\mathcal\{X\}\}\\gamma\_\{k\}\\langle d^\{k\},x\\rangle\+h\_\{k\}\(x\)\.\(1\)Here,γk\>0\\gamma\_\{k\}\>0is a stepsize schedule,dkd^\{k\}is a \(possibly averaged\) gradient feedback, andhkh\_\{k\}is a \(possibly time\-varying\) regularizer/mirror map that determines the geometry\.

For instance, choosinghk\(x\)=12‖x−xk‖22h\_\{k\}\(x\)=\\tfrac\{1\}\{2\}\\\|x\-x^\{k\}\\\|\_\{2\}^\{2\}recovers gradient descent, while non\-Euclidean choices ofhkh\_\{k\}yield normalized and geometry\-aware updates, that can depend more favorably on the dimensionality of the problem\. In parallel,*how*we formdkd^\{k\}\(momentum, averaging, extrapolation\) and*how*we average iterates \(primal averaging / schedule\-free\) have a strong effect on the properties of the resulting method, including noise robustness, acceleration, and anytime guarantees\.

One major challenge is how to set hyperparameters of these optimizers, which becomes particularly important as model size becomes prohibitively expensive to tune and we instead seek predictable scaling rules\.

One particularly challenging parameter is weight decay\. In the context of multi\-epoch training, weight decay has a precise characterization, since it constrains the norm of the iterates, thus acting as a regularizer\(Xie and Li,[2024](https://arxiv.org/html/2605.11172#bib.bib49); Pethick et al\.,[2025a](https://arxiv.org/html/2605.11172#bib.bib41)\)\. However, even in single\-epoch training, where overfitting is not a concern, weight decay can surprisingly still be beneficial\.

Very recently a scaling rule was developed choosing weight decay as1/d1/dwith model dimensiondd\(Xiao,[2024](https://arxiv.org/html/2605.11172#bib.bib48); Qiu et al\.,[2025](https://arxiv.org/html/2605.11172#bib.bib43)\)\. However, since many large\-scale experiments follow compute\-optimal training regimes whereddand the training horizonnnare coupled, it is difficult to disentangle whether the effective dependence is onddor on time\. In addition, weight decay typically needs to be tuned on a per\-optimizer basis, as observed for Lion, which requires a significantly larger weight decay than AdamW\(Chen et al\.,[2023](https://arxiv.org/html/2605.11172#bib.bib8)\)\. It raises the following question:

*Is it possible to ground these two axes of recent progress in classical methods and provide guidance on hyperparameters?*

Table 1:Instances of[SODA](https://arxiv.org/html/2605.11172#S3.Ex6)forλ¯k=0\\bar\{\\lambda\}\_\{k\}=0\(c\.f\.[Section˜3](https://arxiv.org/html/2605.11172#S3)\)\.- 1Rediscovered as Simplified\-AdEMAMix\(Morwani et al\.,[2025](https://arxiv.org/html/2605.11172#bib.bib34)\)\.

In this paper, we contend that the two axes of algorithmic development are not separate threads\. To this end, we introduce[SODA](https://arxiv.org/html/2605.11172#S3.Ex6), a generalization of Optimistic Dual Averaging\(Rakhlin and Sridharan,[2013](https://arxiv.org/html/2605.11172#bib.bib44)\)that explicitly couples*dual processing*\(gradient averaging \+ optimism\) and*primal processing*\(iterate averaging \+ primal extrapolation\)\. The resulting framework provides a single template that recovers several widely used optimizers as special cases and makes it straightforward to inject non\-Euclidean geometry, such asℓ∞\\ell\_\{\\infty\}or spectral geometry, into modern training recipes\.

In particular, when one of the averaging parameters is simplified \(our “modernized” regime\), SODA yields a practical wrapper around any base optimizer that eliminates weight decay tuning via a theoretically grounded1/k1/kdecay\.

#### Contributions\.

Our contributions are as follows:

1. \(ii\)Unification via optimism and dualization:We show that[SODA](https://arxiv.org/html/2605.11172#S3.Ex6)provides a unified perspective on several state\-of\-the\-art optimizers\. Most notably, it captures Muon, Lion\-𝒦\\mathcal\{K\}, and NAdam as*optimistic*instances within this broad framework \(cf\.[Table˜1](https://arxiv.org/html/2605.11172#S1.T1)\)\.
2. \(iiii\)Theoretical guarantees:The[SODA](https://arxiv.org/html/2605.11172#S3.Ex6)framing allows us to theoretically derive hyperparameter choices that works remarkably well in practice\. Our formulation provides a new perspective on weight decay, not only as a regularizer, but as a form of primal averaging\. This results in a new update rule that anchors each update at the initialization and induces a1/k1/kweight decay schedule\.
3. \(iiiiii\)A simple wrapper:We propose a practical wrapper around any base optimizer \(e\.g\., Adam, Lion, Muon, Scion\) that introduces no new hyperparameters and removes the need to tune a weight decay parameter\.
4. \(iviv\)Empirical results without additional tuning:We demonstrate consistent performance improvements when wrapping Adam, Muon, and Scion across various model sizes and training horizons, even outperforming baselines with tuned weight decay\. Our proposed instantiation of our framework, SODA†, removes the need to tune weight decay while leading to better performance at scale\.

#### Limitations

This work focuses on the single\-epoch setting\. For multi\-epoch training it might be necessary to have a hyperparameter controlling the regularization strength, due to overfitting\.

## 2Preliminaries

We are interested in the following minimization problem

minx∈𝒳⁡f\(x\):=𝔼ξ\[f\(x,ξ\)\],\\addcontentsline\{lla\}\{section\}\{\\numberline\{\\string\\crtrefnumber\{eq:stochastic\_problem\}\}\{e\}q:stochastic\_\{p\}roblem\}\\min\_\{x\\in\\mathcal\{X\}\}f\(x\):=\\mathbb\{E\}\_\{\\xi\}\[f\(x,\\xi\)\],\(2\)which covers a host of machine learning problems, whereξ\\xirepresents a data sample,xxis the parameter of the model being optimized, andf\(x,ξ\)f\(x,\\xi\)is a loss function\.

#### Dual averaging and Fenchel conjugate

The Dual Averaging framework\(Nesterov,[2009](https://arxiv.org/html/2605.11172#bib.bib38)\)and its variants typically rely on the Fenchel conjugate to map gradient information back to the primal space\. The Fenchel conjugate of a functionh:𝒳→ℝ∪\{∞\}h:\\mathcal\{X\}\\to\\mathbb\{R\}\\cup\\\{\\infty\\\}is defined as:

h∗\(d\)=supx∈𝒳\{⟨d,x⟩−h\(x\)\},h^\{\*\}\(d\)=\\sup\_\{x\\in\\mathcal\{X\}\}\\left\\\{\\langle d,x\\rangle\-h\(x\)\\right\\\},whered∈𝒳∗d\\in\\mathcal\{X\}^\{\*\}\. The subdifferential∂h∗\\partial h^\{\*\}is equivalent to the set of maximizers of this conjugate operation:

∂h∗\(d\)=argmaxx∈𝒳\{⟨d,x⟩−h\(x\)\}\.\\partial h^\{\*\}\(d\)=\\operatorname\*\{argmax\}\_\{x\\in\\mathcal\{X\}\}\\left\\\{\\langle d,x\\rangle\-h\(x\)\\right\\\}\.This identity is a direct consequence of the Fenchel\-Young inequality\(Bauschke and Lucet,[2012](https://arxiv.org/html/2605.11172#bib.bib1)\)\. The general optimization template \([1](https://arxiv.org/html/2605.11172#S1.E1)\) can then be written compactly as:

xk\+1∈∂hk∗\(−γkdk\),\\addcontentsline\{lla\}\{section\}\{\\numberline\{\\string\\crtrefnumber\{eq:dual\_update\_template\}\}\{e\}q:dual\_\{u\}pdate\_\{t\}emplate\}x^\{k\+1\}\\in\\partial h\_\{k\}^\{\*\}\(\-\\gamma\_\{k\}d^\{k\}\),\(3\)wheredkd^\{k\}is an average \(or momentum\) of past stochastic gradients,γk\\gamma\_\{k\}is a stepsize schedule, andhkh\_\{k\}is a sequence of geometry defining regularizers \(or mirror maps\)\.

#### Standard regularizers

Different choices of the regularizerh\(x\)h\(x\)recover well\-known optimization steps\. We highlight three instances that are particularly relevant for our work, especially when the constraint set is the norm ball𝒟=\{x∈𝒳∣‖x‖≤1\}\\mathcal\{D\}=\\\{x\\in\\mathcal\{X\}\\mid\\\|x\\\|\\leq 1\\\}for some arbitrary norm∥⋅∥\\\|\\cdot\\\|:

1. \(ii\)Unconstrained:Leth\(x\)=12‖x‖2h\(x\)=\\frac\{1\}\{2\}\\\|x\\\|^\{2\}\. Then, the∂h∗\\partial h^\{\*\}is the sharp operator\(Nesterov,[2012](https://arxiv.org/html/2605.11172#bib.bib36)\): ∂h∗\(d\)=\[d\]♯:=argmaxx⁡\{⟨d,x⟩−12‖x‖2\}\.\\begin\{split\}\\partial h^\{\*\}\(d\)=\[d\]^\{\\sharp\}:=\\operatorname\*\{arg\\,max\}\_\{x\}\\left\\\{\\braket\{d,x\}\-\\tfrac\{1\}\{2\}\\\|x\\\|^\{2\}\\right\\\}\.\\end\{split\}
2. \(iiii\)Linear minimization oracle \(lmo\\operatorname\{lmo\}\):Leth\(x\)=ι𝒟\(x\)h\(x\)=\\iota\_\{\\mathcal\{D\}\}\(x\)be the indicator function of a set𝒟\\mathcal\{D\}\(i\.e\.,0ifx∈𝒟x\\in\\mathcal\{D\}and∞\\inftyotherwise\)\. The∂h∗\\partial h^\{\*\}is then thelmo\\operatorname\{lmo\}: ∂h∗\(−d\)=lmo𝒟⁡\(d\):=argminx∈𝒟⁡⟨d,x⟩\.\\begin\{split\}\\partial h^\{\*\}\(\-d\)=\\operatorname\{lmo\}\_\{\\mathcal\{D\}\}\(d\):=\\operatorname\*\{arg\\,min\}\_\{x\\in\\mathcal\{D\}\}\\braket\{d,x\}\.\\end\{split\}This is the core operation in Frank\-Wolfe \(conditional gradient\) methods\(Frank and Wolfe,[1956](https://arxiv.org/html/2605.11172#bib.bib17); Clarkson,[2010](https://arxiv.org/html/2605.11172#bib.bib9); Jaggi,[2013](https://arxiv.org/html/2605.11172#bib.bib21)\), which are projection\-free but typically requireffto be smooth\.
3. \(iiiiii\)Clipping:By combining the above, we can handle a non\-smoothffvia “clipping\.” Leth\(x\)=12‖x‖2\+ι𝒟\(x\)h\(x\)=\\frac\{1\}\{2\}\\\|x\\\|^\{2\}\+\\iota\_\{\\mathcal\{D\}\}\(x\)\. Then, we have ∂h∗\(−d\)=clip𝒟⁡\(d\):=argminx∈𝒟⁡\{⟨d,x⟩\+12‖x‖2\}\.\\begin\{split\}\\partial h^\{\*\}\(\-d\)=\\operatorname\{clip\}\_\{\\mathcal\{D\}\}\(d\):=\\operatorname\*\{arg\\,min\}\_\{x\\in\\mathcal\{D\}\}\\\{\\braket\{d,x\}\+\\tfrac\{1\}\{2\}\\\|x\\\|^\{2\}\\\}\.\\end\{split\}This operation clips the gradient step into the feasibility set𝒟\\mathcal\{D\}, essentially performing a projection of the negative gradient\. This was used inPethick et al\. \([2025b](https://arxiv.org/html/2605.11172#bib.bib42)\); Crawshaw et al\. \([2025](https://arxiv.org/html/2605.11172#bib.bib10)\)for training neural networks\.

These three choices of∂h∗\\partial h^\{\*\}can all be related through thelmo\\operatorname\{lmo\}, since\[d\]\#=−‖d‖∗lmo𝒟⁡\(d\)\[d\]^\{\\\#\}=\-\\\|d\\\|\_\{\*\}\\operatorname\{lmo\}\_\{\\mathcal\{D\}\}\(d\)andclip𝒟⁡\(d\)=min⁡\{1,‖d‖∗\}lmo𝒟⁡\(d\)\\operatorname\{clip\}\_\{\\mathcal\{D\}\}\(d\)=\\min\\\{1,\\\|d\\\|\_\{\*\}\\\}\\operatorname\{lmo\}\_\{\\mathcal\{D\}\}\(d\)\(seeLABEL:lem:equiv\)\.

#### Norm choices

Important special cases arise by takingh=ι𝒟h=\\iota\_\{\\mathcal\{D\}\}with𝒟\\mathcal\{D\}the unit ball of a norm\. For theℓ∞\\ell\_\{\\infty\}ball,lmo𝒟⁡\(d\)=−sign⁡\(d\)\\operatorname\{lmo\}\_\{\\mathcal\{D\}\}\(d\)=\-\\operatorname\{sign\}\(d\)as used in SignSGD and Lion\(Bernstein et al\.,[2018](https://arxiv.org/html/2605.11172#bib.bib3); Chen et al\.,[2023](https://arxiv.org/html/2605.11172#bib.bib8)\)\. For the spectral norm ball,lmo𝒟⁡\(G\)=−msign⁡\(G\)\\operatorname\{lmo\}\_\{\\mathcal\{D\}\}\(G\)=\-\\operatorname\{msign\}\(G\)withmsign⁡\(G\):=UV⊤\\operatorname\{msign\}\(G\):=UV^\{\\top\}, whereUΣV⊤U\\Sigma V^\{\\top\}is the singular value decomposition ofGGas used inCarlson et al\. \([2016](https://arxiv.org/html/2605.11172#bib.bib5)\); Jordan et al\. \([2024b](https://arxiv.org/html/2605.11172#bib.bib24)\); Pethick et al\. \([2025a](https://arxiv.org/html/2605.11172#bib.bib41)\)\.

## 3Method

We propose the following algorithm which generalizes Optimistic Dual Averaging \(ODA\) by introducing a primal extrapolation sequence \(yky^\{k\}\) fromTseng \([2008](https://arxiv.org/html/2605.11172#bib.bib47)\); Lan \([2012](https://arxiv.org/html/2605.11172#bib.bib30)\); Defazio et al\. \([2024](https://arxiv.org/html/2605.11172#bib.bib12)\):

mk\+1=\(1−αk\)mk\+αk∇f\(yk,ξk\)m¯k\+1=\(1−α¯k\)mk\+1\+α¯k∇f\(yk,ξk\)\(optimism\)zk\+1∈∂hk∗\(−γkm¯k\+1\)=argminx∈𝒳⁡γk⟨m¯k\+1,x⟩\+hk\(x\)xk\+1=\(1−λk\)xk\+λkzk\+1yk\+1=\(1−λ¯k\)xk\+1\+λ¯kzk\+1\(primal extrapolation\)\\addcontentsline\{lla\}\{section\}\{\\numberline\{\\string\\crtrefnumber\{eq:SODA\}\}\{e\}q:SODA\}\\begin\{split\}m^\{k\+1\}&=\(1\-\\alpha\_\{k\}\)m^\{k\}\+\\alpha\_\{k\}\\nabla f\(y^\{k\},\\xi\_\{k\}\)\\\\ \\bar\{m\}^\{k\+1\}&=\(1\-\\bar\{\\alpha\}\_\{k\}\)m^\{k\+1\}\+\\bar\{\\alpha\}\_\{k\}\\nabla f\(y^\{k\},\\xi\_\{k\}\)\\quad\{\\color\[rgb\]\{0,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@gray@stroke\{0\}\\pgfsys@color@gray@fill\{0\}\\text\{\(optimism\)\}\}\\\\ z^\{k\+1\}&\\in\\partial h\_\{k\}^\{\*\}\(\-\\gamma\_\{k\}\\bar\{m\}^\{k\+1\}\)=\\operatorname\*\{arg\\,min\}\_\{x\\in\\mathcal\{X\}\}\\gamma\_\{k\}\\braket\{\\bar\{m\}^\{k\+1\},x\}\+h\_\{k\}\(x\)\\\\ x^\{k\+1\}&=\(1\-\\lambda\_\{k\}\)x^\{k\}\+\\lambda\_\{k\}z^\{k\+1\}\\\\ y^\{k\+1\}&=\(1\-\\bar\{\\lambda\}\_\{k\}\)x^\{k\+1\}\+\\bar\{\\lambda\}\_\{k\}z^\{k\+1\}\\qquad\\qquad\{\\color\[rgb\]\{0,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@gray@stroke\{0\}\\pgfsys@color@gray@fill\{0\}\\text\{\(primal extrapolation\)\}\}\\end\{split\}\(SODA\)withαk,α¯k,λk,λ¯k∈\[0,1\]\\alpha\_\{k\},\\bar\{\\alpha\}\_\{k\},\\lambda\_\{k\},\\bar\{\\lambda\}\_\{k\}\\in\[0,1\]andγk\>0\\gamma\_\{k\}\>0\. We initializem0=0m^\{0\}=0andx0=y0=z0∈∂h0∗\(0\)x^\{0\}=y^\{0\}=z^\{0\}\\in\\partial h\_\{0\}^\{\*\}\(0\)\. Observe the elegant symmetry between how the dual \(gradients\) and primal \(iterates\) are being processed\. Our analysis in[Section˜4](https://arxiv.org/html/2605.11172#S4)builds on theSchedule\-Free framework ofDefazio et al\. \([2024](https://arxiv.org/html/2605.11172#bib.bib12)\)andODARakhlin and Sridharan \([2013](https://arxiv.org/html/2605.11172#bib.bib44)\), so we refer to the method as SODA to pay homage\.

Note that typically in machine learning libraries such as PyTorch the momentum parameters are instead defined asβk=1−αk\\beta\_\{k\}=1\-\\alpha\_\{k\},β¯k=1−α¯k\\bar\{\\beta\}\_\{k\}=1\-\\bar\{\\alpha\}\_\{k\},τk=1−λk\\tau\_\{k\}=1\-\\lambda\_\{k\}, andτ¯k=1−λ¯k\\bar\{\\tau\}\_\{k\}=1\-\\bar\{\\lambda\}\_\{k\}\.

### 3\.1Special Cases

There are two important extremes depending on the choice of the primal extrapolation parameterλ¯k\\bar\{\\lambda\}\_\{k\}\.

#### Optimistic Dual Averaging

Forλ¯k=1\\bar\{\\lambda\}\_\{k\}=1:

mk\+1=\(1−αk\)mk\+αk∇f\(zk,ξk\)m¯k\+1=\(1−α¯k\)mk\+1\+α¯k∇f\(zk,ξk\)zk\+1∈∂h∗\(−γkm¯k\+1\)xk\+1=\(1−λk\)xk\+λkzk\+1\\addcontentsline\{lla\}\{section\}\{\\numberline\{\\string\\crtrefnumber\{eq:ODA\}\}\{e\}q:ODA\}\\begin\{split\}m^\{k\+1\}&=\(1\-\\alpha\_\{k\}\)m^\{k\}\+\\alpha\_\{k\}\\nabla f\(z^\{k\},\\xi\_\{k\}\)\\\\ \\bar\{m\}^\{k\+1\}&=\(1\-\\bar\{\\alpha\}\_\{k\}\)m^\{k\+1\}\+\\bar\{\\alpha\}\_\{k\}\\nabla f\(z^\{k\},\\xi\_\{k\}\)\\\\ z^\{k\+1\}&\\in\\partial h^\{\*\}\(\-\\gamma\_\{k\}\\bar\{m\}^\{k\+1\}\)\\\\ x^\{k\+1\}&=\(1\-\\lambda\_\{k\}\)x^\{k\}\+\\lambda\_\{k\}z^\{k\+1\}\\\\ \\end\{split\}\(ODA\)This is an optimistic version of the celebrated Dual Averaging scheme\(Nesterov,[2005](https://arxiv.org/html/2605.11172#bib.bib35)\), also known as optimistic follow\-the\-regularized\-leader \(FTRL\) in online learning\(Rakhlin and Sridharan,[2013](https://arxiv.org/html/2605.11172#bib.bib44)\)\. Notice that the output of the algorithm \(xkx^\{k\}\) can be different from where the gradient is evaluated \(zkz^\{k\}\)\.

#### Modernized Optimistic Dual Averaging

Forλ¯k=0\\bar\{\\lambda\}\_\{k\}=0:

mk\+1=\(1−αk\)mk\+αk∇f\(xk,ξk\)m¯k\+1=\(1−α¯k\)mk\+1\+α¯k∇f\(xk,ξk\)xk\+1=\(1−λk\)xk\+λk∂hk∗\(−γkm¯k\+1\)\\addcontentsline\{lla\}\{section\}\{\\numberline\{\\string\\crtrefnumber\{eq:MODA\}\}\{e\}q:MODA\}\\begin\{split\}m^\{k\+1\}&=\(1\-\\alpha\_\{k\}\)m^\{k\}\+\\alpha\_\{k\}\\nabla f\(x^\{k\},\\xi\_\{k\}\)\\\\ \\bar\{m\}^\{k\+1\}&=\(1\-\\bar\{\\alpha\}\_\{k\}\)m^\{k\+1\}\+\\bar\{\\alpha\}\_\{k\}\\nabla f\(x^\{k\},\\xi\_\{k\}\)\\\\ x^\{k\+1\}&=\(1\-\\lambda\_\{k\}\)x^\{k\}\+\\lambda\_\{k\}\\partial h\_\{k\}^\{\*\}\(\-\\gamma\_\{k\}\\bar\{m\}^\{k\+1\}\)\\end\{split\}\(MODA\)This can be seen as an optimistic version of Double Averaging ofNesterov and Shikhman \([2015](https://arxiv.org/html/2605.11172#bib.bib37)\)\. We refer to this as “modernized” followingJelassi and Defazio \([2020](https://arxiv.org/html/2605.11172#bib.bib22)\), which applied Double Averaging to deep learning\. The output of the algorithm \(xkx^\{k\}\) is the same as where the gradient is evaluated \(xkx^\{k\}\)\.

This recovers Stochastic Frank\-Wolfe and Scion\(Mokhtari et al\.,[2020](https://arxiv.org/html/2605.11172#bib.bib33); Pethick et al\.,[2025a](https://arxiv.org/html/2605.11172#bib.bib41)\)withα¯k=0\\bar\{\\alpha\}\_\{k\}=0andhk\(x\)=ι𝒟\(x\)h\_\{k\}\(x\)=\\iota\_\{\\mathcal\{D\}\}\(x\), since then∂hk∗\(−γkd\)=ρlmo⁡\(γkd\)=ρlmo⁡\(d\)\\partial h\_\{k\}^\{\*\}\(\-\\gamma\_\{k\}d\)=\\rho\\operatorname\{lmo\}\(\\gamma\_\{k\}d\)=\\rho\\operatorname\{lmo\}\(d\)for some constrained radiusρ\>0\\rho\>0where the first equality follows from scale invariance of thelmo\\operatorname\{lmo\}\.

More importantly, forα¯k≠0\\bar\{\\alpha\}\_\{k\}\\neq 0,[MODA](https://arxiv.org/html/2605.11172#S3.Ex8)captures NAdamDozat \([2016](https://arxiv.org/html/2605.11172#bib.bib15)\), Lion\(Chen et al\.,[2023](https://arxiv.org/html/2605.11172#bib.bib8)\)and Muon\(Jordan et al\.,[2024b](https://arxiv.org/html/2605.11172#bib.bib24)\)with weight decay through the choice of geometryhkh\_\{k\}:

- •Lion\-𝒦\\mathcal\{K\}: choosehk=𝒦∗h\_\{k\}=\\mathcal\{K\}^\{\*\}, so that∂hk∗=∂𝒦\\partial h\_\{k\}^\{\*\}=\\partial\\mathcal\{K\}\. For Lion specifically,∂hk∗\(−u\)=−sign⁡\(u\)\\partial h\_\{k\}^\{\*\}\(\-u\)=\-\\operatorname\{sign\}\(u\)\.
- •Muon: choose a spectral mirror maphkh\_\{k\}such that∂hk∗\(−u\)=−msign⁡\(u\)\\partial h\_\{k\}^\{\*\}\(\-u\)=\-\\operatorname\{msign\}\(u\)\. The so\-called Nesterov momentum corresponds to choosing the optimistic parameter asα¯k=αk\\bar\{\\alpha\}\_\{k\}=\\alpha\_\{k\}\.
- •NAdam: choosehk\(x\)=12⟨x,Diag⁡\(vk\+ε\)x⟩h\_\{k\}\(x\)=\\tfrac\{1\}\{2\}\\langle x,\\operatorname\{Diag\}\(\\sqrt\{v^\{k\}\}\+\\varepsilon\)x\\rangle, wherevk=τvk−1\+\(1−τ\)∇f\(xk,ξk\)⊙∇f\(xk,ξk\)v^\{k\}=\\tau v^\{k\-1\}\+\(1\-\\tau\)\\nabla f\(x^\{k\},\\xi\_\{k\}\)\\odot\\nabla f\(x^\{k\},\\xi\_\{k\}\)\. Then∇hk∗\(−γkm¯k\+1\)=−γkm¯k\+1⊘\(vk\+ε\)\\nabla h\_\{k\}^\{\*\}\(\-\\gamma\_\{k\}\\bar\{m\}^\{k\+1\}\)=\-\\gamma\_\{k\}\\bar\{m\}^\{k\+1\}\\oslash\(\\sqrt\{v^\{k\}\}\+\\varepsilon\)\.

In this light, all of the above methods can be interpreted as optimistic versions of Dual Averaging\. We note that NAdam was also rediscovered as a simplification of AdEMAMix\(Pagliardini et al\.,[2024](https://arxiv.org/html/2605.11172#bib.bib40)\)named Simplified\-AdEMAMix from\(Morwani et al\.,[2025](https://arxiv.org/html/2605.11172#bib.bib34)\)\. See[Table˜1](https://arxiv.org/html/2605.11172#S1.T1)for an overview\.

Notice thatλk\\lambda\_\{k\}plays the role of the stepsize in this case\. This is in contrast with the following where we will instead letλk=1/\(k\+2\)\\lambda\_\{k\}=1/\(k\+2\)and letηk\\eta\_\{k\}be the stepsize of the base optimizer \(typically using a linear or cosine schedule\)\.

Algorithm 1SODA WrapperInput:Horizonnn, initializationz0=x0∈𝒳z^\{0\}=x^\{0\}\\in\\mathcal\{X\}

1:for

k=0,…,n−1k=0,\\dots,n\-1do

2:Sample

ξk∼𝒫\\xi\_\{k\}\\sim\\mathcal\{P\}and compute the gradient

gk=∇f\(xk,ξk\)g^\{k\}=\\nabla f\(x^\{k\},\\xi\_\{k\}\)
3:

zk\+1=z0\+\(k\+2\)BaseUpdate⁡\(gk\)z^\{k\+1\}=z^\{0\}\+\(k\+2\)\\operatorname\{BaseUpdate\}\(g^\{k\}\)
4:

xk\+1=\(1−1k\+2\)xk\+1k\+2zk\+1x^\{k\+1\}=\(1\-\\tfrac\{1\}\{k\+2\}\)x^\{k\}\+\\tfrac\{1\}\{k\+2\}z^\{k\+1\}
5:

xnx^\{n\}

BaseUpdate\\operatorname\{BaseUpdate\}refers to the update delta of the base optimizer*without weight decay*, i\.e\.,uk\+1=uk\+BaseUpdate⁡\(gk\)u^\{k\+1\}=u^\{k\}\+\\operatorname\{BaseUpdate\}\(g^\{k\}\)\.

#### SODA Wrapper

In[Algorithm˜1](https://arxiv.org/html/2605.11172#alg1), we provide a particularly practical instantiation of[SODA](https://arxiv.org/html/2605.11172#S3.Ex6), that wraps an existing base optimizer, based on hyperparameter choices from[Corollary˜4\.6](https://arxiv.org/html/2605.11172#S4.Thmthm6)\. Following the theory, this approach usesλ¯k=0\\bar\{\\lambda\}\_\{k\}=0\(aka[MODA](https://arxiv.org/html/2605.11172#S3.Ex8)\), uniform iterate averaging \(λk=1/k\+2\\lambda\_\{k\}=\\nicefrac\{\{1\}\}\{\{k\+2\}\}\), a stepsizeγk=\(k\+2\)ηk\\gamma\_\{k\}=\(k\+2\)\\eta\_\{k\}and defines the regularizer relative to a reference iteratez0z^\{0\}ashk\(x\)=ψk\(x−z0\)h\_\{k\}\(x\)=\\psi\_\{k\}\(x\-z^\{0\}\)withψk\(x\):=12‖x‖Hk2\\psi\_\{k\}\(x\):=\\tfrac\{1\}\{2\}\\\|x\\\|^\{2\}\_\{H\_\{k\}\}whereHkH\_\{k\}is a positive\-definite matrix\. Since∂ψk∗\\partial\\psi\_\{k\}^\{\*\}is positively homogeneous, we have∂hk∗\(−γkm¯k\+1\)=z0\+∂ψk∗\(−γkm¯k\+1\)=z0\+γk∂ψk∗\(−m¯k\+1\)\\partial h\_\{k\}^\{\*\}\(\-\\gamma\_\{k\}\\bar\{m\}^\{k\+1\}\)=z^\{0\}\+\\partial\\psi\_\{k\}^\{\*\}\(\-\\gamma\_\{k\}\\bar\{m\}^\{k\+1\}\)=z^\{0\}\+\\gamma\_\{k\}\\partial\\psi\_\{k\}^\{\*\}\(\-\\bar\{m\}^\{k\+1\}\)\.

Any optimizer based on \(optimistic\) gradient momentum and dualization, such as Adam, NAdam, Scion, Muon, Signum, or Lion, can serve as the base optimizer\. Notice that theBaseUpdate\\operatorname\{BaseUpdate\}involves the stepsizeηk\\eta\_\{k\}, while the wrapper introduces no new hyperparameters\.

The resulting[Algorithm˜1](https://arxiv.org/html/2605.11172#alg1)has a clear interpretation when simplifying the expression:

xk\+1=1k\+2z0⏟centering\+\(1−1k\+2\)xk⏟scheduled weight decay\+BaseUpdate⁡\(gk\)⏟includes a stepsize schedule\\addcontentsline\{lla\}\{section\}\{\\numberline\{\\string\\crtrefnumber\{eq:SODA:wrapper:simplified\}\}\{e\}q:SODA:wrapper:simplified\}x^\{k\+1\}=\\underbrace\{\\tfrac\{1\}\{k\+2\}z^\{0\}\}\_\{\\text\{centering\}\}\+\\underbrace\{\(1\-\\tfrac\{1\}\{k\+2\}\)x^\{k\}\}\_\{\\text\{scheduled weight decay\}\}\+\\underbrace\{\\operatorname\{BaseUpdate\}\(g^\{k\}\)\}\_\{\\text\{includes a stepsize schedule\}\}\(4\)Let us compare \([4](https://arxiv.org/html/2605.11172#S3.E4)\) with the original \(*independent*\) weight decayHanson and Pratt \([1988](https://arxiv.org/html/2605.11172#bib.bib19)\)formulation

xk\+1=\(1−μk\)xk−γk∇f\(xk,ξk\)\\addcontentsline\{lla\}\{section\}\{\\numberline\{\\string\\crtrefnumber\{eq:wd\}\}\{e\}q:wd\}x^\{k\+1\}=\(1\-\\mu\_\{k\}\)x^\{k\}\-\\gamma\_\{k\}\\nabla f\(x^\{k\},\\xi\_\{k\}\)\(5\)whereμk≥0\\mu\_\{k\}\\geq 0\. First the iterates in \([4](https://arxiv.org/html/2605.11172#S3.E4)\) are anchored in the initializationz0z^\{0\}\. Second, the weight decay is decaying asμk=1/k\+2\\mu\_\{k\}=\\nicefrac\{\{1\}\}\{\{k\+2\}\}and is*not*multiplied by the stepsize schedule as otherwise standard in e\.g\., decoupled weight decay\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2605.11172#bib.bib31)\), whereas the base optimizer is multiplied with the stepsize schedule \(absorbed intoBaseUpdate\\operatorname\{BaseUpdate\}\)\.

The weight decay plays a different role in[SODA](https://arxiv.org/html/2605.11172#S3.Ex6)than typically\. It does not act as a regularizer, but is rather related to the averaging of iterates, which improves optimization\. Remarkably, as we shall see in[Section˜5](https://arxiv.org/html/2605.11172#S5), the SODA wrapper \([Algorithm˜1](https://arxiv.org/html/2605.11172#alg1)\) consistently improves upon the wrapped base optimizer, even outperforming the baseline with tuned weight decay while the SODA wrapper requires no such tuning\.

### 3\.2Related Work

We organize related work through the lens of[SODA](https://arxiv.org/html/2605.11172#S3.Ex6), which separates \(ii\) dual averaging and optimism, \(iiii\) primal averaging and primal extrapolation, and \(iiiiii\) dualization through geometry \(mirror maps\)\. This decomposition clarifies how several modern optimizers arise as special cases\.

#### Gradient averaging\.

Several works for deep learning have highlighted the benefit of maintaining multiple dual sequences\. AdEMAMix\(Pagliardini et al\.,[2024](https://arxiv.org/html/2605.11172#bib.bib40)\)showed that keeping an additional dual averaging sequence on top of Adam\-style momentum can significantly improve performance\. Simplified\-AdEMAMix\(Morwani et al\.,[2025](https://arxiv.org/html/2605.11172#bib.bib34)\)later demonstrated that explicitly adding back the current gradient is sufficient, yielding an update equivalent to NAdam\(Dozat,[2016](https://arxiv.org/html/2605.11172#bib.bib15)\)\. Similarly, Muon\(Jordan et al\.,[2024b](https://arxiv.org/html/2605.11172#bib.bib24)\)uses the PyTorch implementation of Nesterov momentum, which can be captured by[SODA](https://arxiv.org/html/2605.11172#S3.Ex6)withα¯k=αk\\bar\{\\alpha\}\_\{k\}=\\alpha\_\{k\}\. The sign\-based optimizer Lion\(Chen et al\.,[2024](https://arxiv.org/html/2605.11172#bib.bib7)\)also moves beyond simple averaging\. From the perspective of[SODA](https://arxiv.org/html/2605.11172#S3.Ex6), these gradient averaging schemes correspond precisely to forming the optimistic dual variablem¯k\+1\\bar\{m\}^\{k\+1\}\.

#### Iterate averaging\.

A complementary line of work in the deep learning literature focuses on averaging and extrapolation in the*primal*space\. Lookahead\(Zhang et al\.,[2019](https://arxiv.org/html/2605.11172#bib.bib52)\)performs iterate averaging by maintaining a slow sequence \(xkx^\{k\}\) and generating extrapolated points \(zk\+1z^\{k\+1\}\) via an inner optimization loop, corresponding to the special caseλ¯=0\\bar\{\\lambda\}=0\. DiLoCo\(Douillard et al\.,[2023](https://arxiv.org/html/2605.11172#bib.bib14)\)replaces simple averaging in the outer optimizer with PyTorch\-style Nesterov momentum, demonstrating improvements in distributed training and, subsequently, even in single\-node settings\(Kallusky et al\.,[2025](https://arxiv.org/html/2605.11172#bib.bib26)\)\.

Generalized Primal Averaging \(GPA\)\(Defazio et al\.,[2025](https://arxiv.org/html/2605.11172#bib.bib13)\)further simplifies DiLoCo by showing that an explicit inner loop is unnecessary: it suffices to average the iterates \(xk\+1x^\{k\+1\}\) and extrapolate \(yk\+1y^\{k\+1\}\)\. Specifically, GPA can be written as

zk\+1=\(1−ηkμ\)zk\+ηkBaseUpdate⁡\(∇f\(yk,ξk\)\),xk\+1=\(1−λ\)xk\+λzk\+1,yk\+1=\(1−λ¯\)xk\+1\+λ¯zk\+1\.\\begin\{split\}z^\{k\+1\}&=\(1\-\\eta\_\{k\}\\mu\)z^\{k\}\+\\eta\_\{k\}\\operatorname\{BaseUpdate\}\(\\nabla f\(y^\{k\},\\xi\_\{k\}\)\),\\\\ x^\{k\+1\}&=\(1\-\\lambda\)x^\{k\}\+\\lambda z^\{k\+1\},\\\\ y^\{k\+1\}&=\(1\-\\bar\{\\lambda\}\)x^\{k\+1\}\+\\bar\{\\lambda\}z^\{k\+1\}\.\\end\{split\}\(GPA\)whereμ≥0\\mu\\geq 0andλ,λ¯∈\[0,1\]\\lambda,\\bar\{\\lambda\}\\in\[0,1\]\. GPA uses a stepsize schedule for the base optimizer’sηk\\eta\_\{k\}and fixes the iterate\-averaging parameters\(λ,λ¯\)\(\\lambda,\\bar\{\\lambda\}\)\. From the SODA viewpoint, GPA corresponds to constant primal averaging parameters\(λ,λ¯\)\(\\lambda,\\bar\{\\lambda\}\)\. Typical choices are1−λ¯=0\.91\-\\bar\{\\lambda\}=0\.9and1−λ=\(1−λ¯\)1/H≈0\.99671\-\\lambda=\(1\-\\bar\{\\lambda\}\)^\{1/H\}\\approx 0\.9967withH=32H=32, implyingλ<λ¯\\lambda<\\bar\{\\lambda\}\. This contrasts with the smooth SODA regime, which requires the sufficient conditionλ¯k≤λk/10\\bar\{\\lambda\}\_\{k\}\\leq\\lambda\_\{k\}/10\([Corollaries˜4\.6](https://arxiv.org/html/2605.11172#S4.Thmthm6)andLABEL:cor:soda\-acc\)\. Moreover, whereas GPA relies on a base optimizer*with*a tuned weight decay parameterμ\\mu, the SODA wrapper \([Algorithm˜1](https://arxiv.org/html/2605.11172#alg1)\) wraps a base optimizer*without*weight decay and eliminates the need for any additional hyperparameters\.

Weight decay can also be viewed as a form of primal averaging\.Xiao \([2024](https://arxiv.org/html/2605.11172#bib.bib48)\)observed a1/d1/dscaling with model sizeddand used it for hyperparameter transfer, whileQiu et al\. \([2025](https://arxiv.org/html/2605.11172#bib.bib43)\)adopted this rule for spectral methods such as Muon\. In contrast, we separate model size and horizon effects: if the horizon is fixed and only model size varies, our framework does not predict a necessary1/d1/dscaling\. Concurrently,Ferbach et al\. \([2026](https://arxiv.org/html/2605.11172#bib.bib16)\)considered the time\-decaying ruleλk=ληk/k\\lambda\_\{k\}=\\lambda\\eta\_\{k\}/k\. In contrast, the SODA wrapper uses the parameter\-free scheduleλk=1/\(k\+2\)\\lambda\_\{k\}=1/\(k\+2\)\. Our work focuses on removing hyperparameters, whileFerbach et al\. \([2026](https://arxiv.org/html/2605.11172#bib.bib16)\)introduces new hyperparameters\.

#### Acceleration and universality\.

Primal extrapolation, corresponding toλ¯k\>0\\bar\{\\lambda\}\_\{k\}\>0in[SODA](https://arxiv.org/html/2605.11172#S3.Ex6), originates in accelerated gradient and proximal\-gradient methods\(Tseng,[2008](https://arxiv.org/html/2605.11172#bib.bib47); Lan,[2012](https://arxiv.org/html/2605.11172#bib.bib30)\)\. Querying gradients at the averaged point, corresponding toλ¯k=0\\bar\{\\lambda\}\_\{k\}=0, was later used concurrently byCutkosky \([2019](https://arxiv.org/html/2605.11172#bib.bib11)\); Kavis et al\. \([2019](https://arxiv.org/html/2605.11172#bib.bib27)\)to obtain universal methods, i\.e\., a single algorithm simultaneously attaining both the optimal smooth stochastic convex rateO\(L/n2\+σ/n\)O\(L/n^\{2\}\+\\sigma/\\sqrt\{n\}\)and the nonsmooth rateO\(1/n\)O\(1/\\sqrt\{n\}\)\.Joulani et al\. \([2020](https://arxiv.org/html/2605.11172#bib.bib25)\)combined this averaging mechanism with adaptive Optimistic Dual Averaging/FTRL\(Rakhlin and Sridharan,[2013](https://arxiv.org/html/2605.11172#bib.bib44); Mohri and Yang,[2016](https://arxiv.org/html/2605.11172#bib.bib32)\)\.Defazio et al\. \([2024](https://arxiv.org/html/2605.11172#bib.bib12)\)later extended this analysis to allow for the larger rangeλ¯k≤λk/10\\bar\{\\lambda\}\_\{k\}\\leq\\lambda\_\{k\}/10and used it to develop a optimization wrapper for deep learning with unknown training horizon\.

#### Dualization and geometry\.

The choice of mirror maphhdetermines the geometry of the update and plays a central role in modern deep learning\. Recent work has emphasized that many deep learning optimizers are best understood through their induced norm\(Bernstein and Newhouse,[2024](https://arxiv.org/html/2605.11172#bib.bib2)\)\. Elementwise sign methods are the simplest example: theℓ∞\\ell\_\{\\infty\}geometry gives rise to SignSGD\(Bernstein et al\.,[2018](https://arxiv.org/html/2605.11172#bib.bib3)\)and Lion\(Chen et al\.,[2023](https://arxiv.org/html/2605.11172#bib.bib8)\), and has been used to partially explain the effectiveness of the popular Adam optimizer\(Kunstner et al\.,[2023](https://arxiv.org/html/2605.11172#bib.bib29)\)\. For matrix parameters, spectral descent methods\(Carlson et al\.,[2015a](https://arxiv.org/html/2605.11172#bib.bib4),[2016](https://arxiv.org/html/2605.11172#bib.bib5),[b](https://arxiv.org/html/2605.11172#bib.bib6)\)and their modern variants, such as Muon\(Jordan et al\.,[2024b](https://arxiv.org/html/2605.11172#bib.bib24)\)and Scion\(Pethick et al\.,[2025a](https://arxiv.org/html/2605.11172#bib.bib41)\), arise from spectral geometries\. Beyond single\-norm geometries, multi\-norm constructions enforcing both row\- and column\-normalization \(doubly stochastic structure\) have also been explored\(Scetbon et al\.,[2025](https://arxiv.org/html/2605.11172#bib.bib45); Xie et al\.,[2025](https://arxiv.org/html/2605.11172#bib.bib50)\)\. Whenever the mirror map admits a tractable Fenchel conjugate, such geometries naturally fit within the SODA framework\.

## 4Analysis

We now derive convergence guarantees for[SODA](https://arxiv.org/html/2605.11172#S3.Ex6), in order to set the hyperparameters\. The proof is based on an online regret argument, so we first letgkg^\{k\}denote an arbitrary gradient\-feedback sequence\. In the stochastic optimization setting of[SODA](https://arxiv.org/html/2605.11172#S3.Ex6), we takegk:=∇f\(yk,ξk\)g^\{k\}:=\\nabla f\(y^\{k\},\\xi\_\{k\}\)\. We use the following assumptions, which are standard except for[Section˜4](https://arxiv.org/html/2605.11172#S4)\.

###### Assumption 4\.1\(Convex\)\.

For every sampleξ\\xi, the functionf\(⋅,ξ\)f\(\\cdot,\\xi\)is convex\.

###### Assumption 4\.2\(LL\-smooth\)\.

The functionffisLL\-smooth with respect to∥⋅∥\\left\\\|\{\\cdot\}\\right\\\|\.

###### Assumption 4\.3\(Unbiased\)\.

Letℱk\\mathcal\{F\}\_\{k\}be the natural filtration\. The gradients satisfy

𝔼\[gk∣ℱk−1\]∈∂f\(yk\)\.\\mathbb\{E\}\[g^\{k\}\\mid\\mathcal\{F\}\_\{k\-1\}\]\\in\\partial f\(y^\{k\}\)\.

###### Assumption 4\.4\(Gradient variation\)\.

The gradients satisfy, fork≥0k\\geq 0andρ\>0\\rho\>0,

𝔼\[‖gk−gk−1‖∗2\]≤ρ𝔼\[‖∇f\(yk\)−∇f\(yk−1\)‖∗2\]\+σ2\.\\mathbb\{E\}\\\!\\left\[\\left\\\|\{g^\{k\}\-g^\{k\-1\}\}\\right\\\|\_\{\*\}^\{2\}\\right\]\\leq\\rho\\,\\mathbb\{E\}\\\!\\left\[\\left\\\|\{\\nabla f\(y^\{k\}\)\-\\nabla f\(y^\{k\-1\}\)\}\\right\\\|\_\{\*\}^\{2\}\\right\]\+\\sigma^\{2\}\.

Our proof primarily builds onDefazio et al\. \([2024](https://arxiv.org/html/2605.11172#bib.bib12)\); however, rather than combining primal extrapolation with an adaptive version of Optimistic Mirror Descent, we use Optimistic Dual AveragingRakhlin and Sridharan \([2013](https://arxiv.org/html/2605.11172#bib.bib44)\)as the underlying no\-regret algorithm\.

###### Corollary4\.6\(Convergence underLL\-smoothness\)\.

Letx⋆∈argminx∈𝒳⁡f\(x\)x^\{\\star\}\\in\\operatorname\*\{arg\\,min\}\_\{x\\in\\mathcal\{X\}\}f\(x\)and letR⋆:=h\(x⋆\)−infhR\_\{\\star\}:=h\(x^\{\\star\}\)\-\\inf h\. Consider[SODA](https://arxiv.org/html/2605.11172#S3.Ex6)with a fixed regularizerhk≡hh\_\{k\}\\equiv h\. For everyk=0,…,n−1k=0,\\dots,n\-1, choose

αk=1k\+1,α¯k=λk=1k\+2,λ¯k≤λk10,γk=η\(k\+2\),η=min⁡\{μ6ρL,μR⋆σn\}\.\\displaystyle\\alpha\_\{k\}=\\tfrac\{1\}\{k\+1\},\\qquad\\bar\{\\alpha\}\_\{k\}=\\lambda\_\{k\}=\\tfrac\{1\}\{k\+2\},\\qquad\\bar\{\\lambda\}\_\{k\}\\leq\\tfrac\{\\lambda\_\{k\}\}\{10\},\\qquad\\gamma\_\{k\}=\\eta\(k\+2\),\\qquad\\eta=\\min\\left\\\{\\tfrac\{\\mu\}\{6\\rho L\},\\tfrac\{\\sqrt\{\\mu R\_\{\\star\}\}\}\{\\sigma\\sqrt\{n\}\}\\right\\\}\.Suppose[Sections˜4](https://arxiv.org/html/2605.11172#S4),[4](https://arxiv.org/html/2605.11172#S4),[4](https://arxiv.org/html/2605.11172#S4)and[4](https://arxiv.org/html/2605.11172#S4)hold and thathhisμ\\mu\-strongly convex with respect to∥⋅∥\\left\\\|\{\\cdot\}\\right\\\|\. Then, for everyn≥1n\\geq 1,

𝔼\[f\(xn−1\)−f\(x⋆\)\]=O\(\(ρ\+1\)LR⋆μn\+σR⋆μn\)\.\\mathbb\{E\}\[f\(x^\{n\-1\}\)\-f\(x^\{\\star\}\)\]=O\\\!\\left\(\\tfrac\{\(\\rho\+1\)LR\_\{\\star\}\}\{\\mu n\}\+\\tfrac\{\\sigma\\sqrt\{R\_\{\\star\}\}\}\{\\sqrt\{\\mu n\}\}\\right\)\.

#### Consequences for practice

The SODA wrapper in[Algorithm˜1](https://arxiv.org/html/2605.11172#alg1)directly uses the theoretically suggested choices ofλk\\lambda\_\{k\}andγk\\gamma\_\{k\}\. Taking the primal extrapolation constantλ¯k\\bar\{\\lambda\}\_\{k\}small is used to exploit smoothness to cancel the gradient\-variation term from[Section˜4](https://arxiv.org/html/2605.11172#S4)\. Sinceλ¯k\\bar\{\\lambda\}\_\{k\}has no effect on the rate once it is small enough, we can conveniently setλ¯k=0\\bar\{\\lambda\}\_\{k\}=0in practice, matching the “modernized” parameterization in[Section˜3](https://arxiv.org/html/2605.11172#S3)and recovering Muon, Lion, and NAdam\. Under the bounded\-gradient analysis ofLABEL:cor:soda\-nonacc, the larger rangeλ¯k∈\[0,1\]\\bar\{\\lambda\}\_\{k\}\\in\[0,1\]is also admissible\.

The smooth analysis also requires the optimistic averaging parameterα¯k\\bar\{\\alpha\}\_\{k\}to be slightly smaller thanαk\\alpha\_\{k\}\. This is consistent with common choices: Lion often usesαk=0\.1\\alpha\_\{k\}=0\.1withα¯k∈\{0\.05,0\.01\}\\bar\{\\alpha\}\_\{k\}\\in\\\{0\.05,0\.01\\\}, while Muon takesαk=α¯k\\alpha\_\{k\}=\\bar\{\\alpha\}\_\{k\}\. Finally, the rate depends on the initial regularizer gapR⋆R\_\{\\star\}\. Forh\(x\)=‖x−z0‖2h\(x\)=\\\|x\-z^\{0\}\\\|^\{2\}, we haveR⋆=‖x⋆−z0‖2R\_\{\\star\}=\\\|x^\{\\star\}\-z^\{0\}\\\|^\{2\}, so choosingz0=0z^\{0\}=0can be a poor anchor when the solution is far from the origin\. This motivates the centered choice in[Section˜3](https://arxiv.org/html/2605.11172#S3)\.

#### Acceleration

LABEL:cor:soda\-accshows that SODA also admits an accelerated parameterization followingDefazio et al\. \([2024](https://arxiv.org/html/2605.11172#bib.bib12), Cor\. 1\)\. This is obtained by choosingαk=2/k\+2\\alpha\_\{k\}=\\nicefrac\{\{2\}\}\{\{k\+2\}\},α¯k=λk=2/k\+3\\bar\{\\alpha\}\_\{k\}=\\lambda\_\{k\}=\\nicefrac\{\{2\}\}\{\{k\+3\}\}, andλ¯k≤λk/10\\bar\{\\lambda\}\_\{k\}\\leq\\nicefrac\{\{\\lambda\_\{k\}\}\}\{\{10\}\}\. Equivalently, this accelerated regime corresponds to using increasing weightsak=k\+1a\_\{k\}=k\+1fork≥0k\\geq 0and takingαk=ak/∑i=0kai\\alpha\_\{k\}=\\nicefrac\{\{a\_\{k\}\}\}\{\{\\sum\_\{i=0\}^\{k\}a\_\{i\}\}\}andλk=ak\+1/∑i=0k\+1ai\\lambda\_\{k\}=\\nicefrac\{\{a\_\{k\+1\}\}\}\{\{\\sum\_\{i=0\}^\{k\+1\}a\_\{i\}\}\}\. We further comment inLABEL:app:acceleration\.

#### Limitations

Our analysis is convex\. Although convex theory often remains empirically informative for deep learning, as observed byDefazio et al\. \([2024](https://arxiv.org/html/2605.11172#bib.bib12)\); Schaipp et al\. \([2025](https://arxiv.org/html/2605.11172#bib.bib46)\)and in our experiments, extending the guarantees to less restrictive assumptions is an important direction\. A second limitation is the strong convexity requirement on the regularizer\. For a regularizerhhthat is not strongly convex, one can instead usehτ\(x\):=h\(x\)\+τψ\(x\)h\_\{\\tau\}\(x\):=h\(x\)\+\\tau\\psi\(x\), whereψ\\psiis11\-strongly convex with respect to the chosen norm andτ\>0\\tau\>0\. Substitutinghτh\_\{\\tau\}forhhandτ\\tauforμ\\mupreserves the sameO\(n−1/2\)O\(n^\{\-1/2\}\)dependence onnn, but at the cost of a1/τ1/\\sqrt\{\\tau\}factor in the constant\.

## 5Experiments

![Refer to caption](https://arxiv.org/html/2605.11172v1/x1.png)

![Refer to caption](https://arxiv.org/html/2605.11172v1/x2.png)

Figure 1:Muon with swept weight decay is outperformed by SODA\(Muon\), without any additional tuning, on 124M models trained for both1×1\\timesChinchilla steps \(left\) and4×4\\timesChinchilla steps \(right\)\.![Refer to caption](https://arxiv.org/html/2605.11172v1/x3.png)

![Refer to caption](https://arxiv.org/html/2605.11172v1/x4.png)

Figure 2:The SODA wrapper yields consistent improvement across various base optimizers without any additional tuning as illustrated on 124M model trained for both1×1\\timesChinchilla steps \(left\) and4×4\\timesChinchilla steps \(right\)\.![Refer to caption](https://arxiv.org/html/2605.11172v1/x5.png)

![Refer to caption](https://arxiv.org/html/2605.11172v1/x6.png)

Figure 3:SODA with optimism \(referred to as SODA†\{\\dagger\}\) is competitive with the best wrapped optimizer\. In comparison with SODA\(Muon\), the configuration simplifies the method by replacing Adam with Lion and reusing the same hyperparameters for the momentum across all layers\.Throughout the experiments, Muon refers to the official implementation, which uses Adam for the first and last layer and Nesterov momentum for hidden layers \(corresponding to optimism in[SODA](https://arxiv.org/html/2605.11172#S3.Ex6)withα¯k=αk\\bar\{\\alpha\}\_\{k\}=\\alpha\_\{k\}\)\. The Scion optimizer, on the other hand, disables the Nesterov momentum \(α¯k=0\\bar\{\\alpha\}\_\{k\}=0\) and uses Signum instead of Adam for the first and last layer\. Experiments are conducted on nanoGPT on FineWeb100 \(seeLABEL:tbl:hyperparams:nanoGPTfor full details\)\. In the experiments, the token budget is expressed in units of Chinchilla\(Hoffmann et al\.,[2022](https://arxiv.org/html/2605.11172#bib.bib20)\), where1×1\\timesChinchilla corresponds to20×\#\(parameters\)20\\times\\\#\(\\text\{parameters\}\)\.

#### Transfer across horizon

Zero\-shot hyperparameter transfer results, most notablyμ\\muP, show that learning rates and related optimization hyperparameters can transfer reliably across width\(Yang et al\.,[2021](https://arxiv.org/html/2605.11172#bib.bib51)\)\. However, these transfer results are typically demonstrated at a fixed training horizon\. This leaves open the more practically important question of how weight decay should transfer across horizon, since for longer runs the optimal weight decay typically changes\.

To isolate the effect of the horizon, we keep the model width fixed and vary only the training horizon\. We find that the optimal weight decay*decreases*with the horizon for Muon \(c\.f\.[Figure˜1](https://arxiv.org/html/2605.11172#S5.F1)\), thus demonstrating that the1/model\_size1/\\texttt\{model\\\_size\}choice made inXiao \([2024](https://arxiv.org/html/2605.11172#bib.bib48)\); Qiu et al\. \([2025](https://arxiv.org/html/2605.11172#bib.bib43)\), which is constant in the horizon, would be suboptimal outside the Chinchilla scaling rule where the model size and horizon is scaled proportionally\(Hoffmann et al\.,[2022](https://arxiv.org/html/2605.11172#bib.bib20)\)\.

Furthermore, we find that the theoretically motivated1/\(k\+2\)1/\(k\+2\)weight decay scaling used by the SODA wrapper \([Algorithm˜1](https://arxiv.org/html/2605.11172#alg1)\) consistently leads to further improvement without requiring any additional tuning\. In[Figure˜1](https://arxiv.org/html/2605.11172#S5.F1), SODA outperforms the base optimizer \(Muon\) even when the latter is given its best\-tuned weight decay, showing that the gain is not simply due to a favorable retuning of that hyperparameter\. Taken together, these results suggest that SODA provides a principled mechanism for transferring weight decay across horizon without the need for tuning the weight decay even of the smaller proxy model\. The SODA wrapper relies on a non\-zero center iteratez0z^\{0\}, which we further ablate the importance of inLABEL:fig:val\_loss\_z0\.

#### SODA Wrapper

The benefit of the SODA Wrapper \([Algorithm˜1](https://arxiv.org/html/2605.11172#alg1)\) is not exclusive to Muon\. We additionally apply the wrapper across Adam and Scion and observe consistent improvements across training horizons in[Figure˜2](https://arxiv.org/html/2605.11172#S5.F2), notably without introducing any additional tuning\.

![Refer to caption](https://arxiv.org/html/2605.11172v1/x7.png)Figure 4:SODA is effective under 1×\\timesChinchilla scaling and the benefit increases with scale\.
#### Optimism and SODA†

Considering the benefit of optimism \(Muon\) in the overtrained regime of[Figure˜2](https://arxiv.org/html/2605.11172#S5.F2)we systematically investigate the impact of optimism in SODA, reported inLABEL:tab:optimismofLABEL:app:experiments\. We use theℓ∞\\ell\_\{\\infty\}\-norm for the input and output layer and spectral norm for hidden layers followingPethick et al\. \([2025a](https://arxiv.org/html/2605.11172#bib.bib41)\)\. SODA with optimism, which borrows the same optimized configuration as the Muon baseline \(αk=α¯k=0\.05\\alpha\_\{k\}=\\bar\{\\alpha\}\_\{k\}=0\.05\), shows the best performance under both1×1\\timesand4×4\\timesChinchilla\. This SODA configuration notably removes the use of Adam in Muon and uses the same momentum hyperparameter across all layers in contrast, greatly simplifying tuning\. We mark SODA with this specific optimistic setting as SODA†\.[Figure˜3](https://arxiv.org/html/2605.11172#S5.F3)shows validation loss compared with the best two settings found in[Figure˜2](https://arxiv.org/html/2605.11172#S5.F2)\.

#### Transfer across horizon & width

In[Figure˜4](https://arxiv.org/html/2605.11172#S5.F4), we test SODA†on NanoGPT with between 64M and 1B parameters following Chinchilla\-style scaling, where width and horizon grow proportionally\. We choose \(u\)Scion as the base optimizer since its stepsize can be transferred along the width and horizon\(Pethick et al\.,[2025a](https://arxiv.org/html/2605.11172#bib.bib41)\)\. SODA†consistently outperforms Scion/uScion with different sizes of models and the gap becomes more apparent as the model grows larger\. A similar conclusion holds for SODA\(uScion\) \(c\.f\.LABEL:fig:scale\_width\_horizon\_supp\)\.

## 6Conclusion

This work provides a new perspective on weight decay: beyond acting as a regularizer, it can be understood as a form of primal averaging\. We show that carefully scheduling this averaging parameter \(λk\\lambda\_\{k\}in[SODA](https://arxiv.org/html/2605.11172#S3.Ex6)\) yields acceleration in a precise theoretical sense and can lead to practical speedups without hyperparameter tuning\.

Several directions open up\. One is applying[SODA](https://arxiv.org/html/2605.11172#S3.Ex6)to finetuning, where standard weight decay is often suboptimal\. Since[SODA](https://arxiv.org/html/2605.11172#S3.Ex6)regularizes with respect to the pretrained modelz0z^\{0\}rather than the origin, it may be better suited for this setting\. Another promising direction is to develop practical accelerated instantiations of[SODA](https://arxiv.org/html/2605.11172#S3.Ex6)\.

## 7Acknowledgments

This work was funded by the Swiss National Science Foundation \(SNSF\) under grant number 2000\-1\-240094\. This work was supported by the Swiss AI Initiative \(2025 Fellowship Program\)\. This work was supported with project ID \#37 as part of the Swiss AI Initiative, through a grant from the ETH Domain and computational resources provided by the Swiss National Supercomputing Centre \(CSCS\) under the Alps infrastructure\.

## References

- Bauschke and Lucet \[2012\]H Bauschke and Yves Lucet\.What is a Fenchel conjugate\.*Notices of the AMS*, 59\(1\):44–46, 2012\.
- Bernstein and Newhouse \[2024\]Jeremy Bernstein and Laker Newhouse\.Old optimizer, new norm: An anthology\.*arXiv:2409\.20325*, 2024\.
- Bernstein et al\. \[2018\]Jeremy Bernstein, Yu\-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar\.signSGD: Compressed optimisation for non\-convex problems\.In*International Conference on Machine Learning*, pages 560–569\. PMLR, 2018\.
- Carlson et al\. \[2015a\]David Carlson, Volkan Cevher, and Lawrence Carin\.Stochastic spectral descent for restricted boltzmann machines\.In*Artificial Intelligence and Statistics*, 2015a\.
- Carlson et al\. \[2016\]David Carlson, Ya\-Ping Hsieh, Edo Collins, Lawrence Carin, and Volkan Cevher\.Stochastic spectral descent for discrete graphical models\.*IEEE Journal of Selected Topics in Signal Processing*, 2016\.
- Carlson et al\. \[2015b\]David E Carlson, Edo Collins, Ya\-Ping Hsieh, Lawrence Carin, and Volkan Cevher\.Preconditioned spectral descent for deep learning\.In*Proceedings of the 28th International Conference on Neural Information Processing Systems*, pages 2971–2979, 2015b\.
- Chen et al\. \[2024\]Lizhang Chen, Bo Liu, Kaizhao Liang, and Qiang Liu\.Lion secretly solves a constrained optimization: As lyapunov predicts\.In*International Conference on Learning Representations*, 2024\.URL[https://openreview\.net/forum?id=e4xS9ZarDr](https://openreview.net/forum?id=e4xS9ZarDr)\.
- Chen et al\. \[2023\]Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho\-Jui Hsieh, Yifeng Lu, and Quoc V Le\.Symbolic discovery of optimization algorithms\.In*Thirty\-seventh Conference on Neural Information Processing Systems*, 2023\.URL[https://openreview\.net/forum?id=ne6zeqLFCZ](https://openreview.net/forum?id=ne6zeqLFCZ)\.
- Clarkson \[2010\]Kenneth L\. Clarkson\.Coresets, sparse greedy approximation, and the Frank\-Wolfe algorithm\.*ACM Trans\. Algorithms*, 2010\.
- Crawshaw et al\. \[2025\]Michael Crawshaw, Chirag Modi, Mingrui Liu, and Robert M Gower\.An exploration of non\-euclidean gradient descent: Muon and its many variants\.*arXiv preprint arXiv:2510\.09827*, 2025\.
- Cutkosky \[2019\]Ashok Cutkosky\.Anytime online\-to\-batch, optimism and acceleration\.In*International conference on machine learning*, pages 1446–1454\. PMLR, 2019\.
- Defazio et al\. \[2024\]Aaron Defazio, Xingyu Yang, Harsh Mehta, Konstantin Mishchenko, Ahmed Khaled, and Ashok Cutkosky\.The road less scheduled\.In*Advances in Neural Information Processing Systems*, volume 37, pages 9974–10007, 2024\.doi:10\.52202/079017\-0320\.URL[https://proceedings\.neurips\.cc/paper\_files/paper/2024/file/136b9a13861308c8948cd308ccd02658\-Paper\-Conference\.pdf](https://proceedings.neurips.cc/paper_files/paper/2024/file/136b9a13861308c8948cd308ccd02658-Paper-Conference.pdf)\.
- Defazio et al\. \[2025\]Aaron Defazio, Konstantin Mishchenko, Parameswaran Raman, Hao\-Jun Michael Shi, and Lin Xiao\.Smoothing DiLoCo with primal averaging for faster training of LLMs\.*arXiv preprint arXiv:2512\.17131*, 2025\.
- Douillard et al\. \[2023\]Arthur Douillard, Qixuan Feng, Andrei A Rusu, Rachita Chhaparia, Yani Donchev, Adhiguna Kuncoro, Marc’Aurelio Ranzato, Arthur Szlam, and Jiajun Shen\.Diloco: Distributed low\-communication training of language models\.*arXiv preprint arXiv:2311\.08105*, 2023\.
- Dozat \[2016\]Timothy Dozat\.Incorporating Nesterov momentum into Adam\.2016\.
- Ferbach et al\. \[2026\]Damien Ferbach, Courtney Paquette, Gauthier Gidel, Katie Everett, and Elliot Paquette\.Logarithmic\-time schedules for scaling language models with momentum\.*arXiv preprint arXiv:2602\.05298*, 2026\.
- Frank and Wolfe \[1956\]Marguerite Frank and Philip Wolfe\.An algorithm for quadratic programming\.*Naval research logistics quarterly*, 1956\.
- Gupta et al\. \[2018\]Vineet Gupta, Tomer Koren, and Yoram Singer\.Shampoo: Preconditioned stochastic tensor optimization\.In*International Conference on Machine Learning*, 2018\.
- Hanson and Pratt \[1988\]Stephen Hanson and Lorien Pratt\.Comparing biases for minimal network construction with back\-propagation\.*Advances in neural information processing systems*, 1, 1988\.
- Hoffmann et al\. \[2022\]Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al\.Training compute\-optimal large language models\.In*Advances in Neural Information Processing Systems*, 2022\.
- Jaggi \[2013\]Martin Jaggi\.Revisiting Frank\-Wolfe: Projection\-free sparse convex optimization\.In*International Conference on Machine Learning*, 2013\.
- Jelassi and Defazio \[2020\]Samy Jelassi and Aaron Defazio\.Dual averaging is surprisingly effective for deep learning optimization\.*arXiv preprint arXiv:2010\.10502*, 2020\.
- Jordan et al\. \[2024a\]Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear\.bsky\.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977\.modded\-nanogpt: Speedrunning the nanogpt baseline, 2024a\.URL[https://github\.com/KellerJordan/modded\-nanogpt](https://github.com/KellerJordan/modded-nanogpt)\.
- Jordan et al\. \[2024b\]Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cecista, Laker Newhouse, and Jeremy Bernstein\.Muon: An optimizer for hidden layers in neural networks, 2024b\.
- Joulani et al\. \[2020\]Pooria Joulani, Anant Raj, Andras Gyorgy, and Csaba Szepesvári\.A simpler approach to accelerated optimization: iterative averaging meets optimism\.In*International conference on machine learning*, pages 4984–4993\. PMLR, 2020\.
- Kallusky et al\. \[2025\]Dominik Kallusky, Vinay Rao, Vishal Nandavanam, and Hao\-Jun Michael Shi\.SNOO: Step\-k Nesterov outer optimizer\-the surprising effectiveness of Nesterov momentum applied to pseudo\-gradients\.*arXiv preprint arXiv:2510\.15830*, 2025\.
- Kavis et al\. \[2019\]Ali Kavis, Kfir Y Levy, Francis Bach, and Volkan Cevher\.Unixgrad: A universal, adaptive algorithm with optimal guarantees for constrained optimization\.*Advances In Neural Information Processing Systems 32 \(Nips 2019\)*, 32\(CONF\), 2019\.
- Kingma and Ba \[2014\]DP Kingma and Jimmy Ba\.Adam: A method for stochastic optimization\.In*International Conference on Learning Representations*, 2014\.
- Kunstner et al\. \[2023\]Frederik Kunstner, Jacques Chen, Jonathan Wilder Lavington, and Mark Schmidt\.Noise is not the main factor behind the gap between sgd and adam on transformers, but sign descent might be\.*arXiv preprint arXiv:2304\.13960*, 2023\.
- Lan \[2012\]Guanghui Lan\.An optimal method for stochastic composite optimization\.*Mathematical Programming*, 133\(1\):365–397, 2012\.
- Loshchilov and Hutter \[2019\]Ilya Loshchilov and Frank Hutter\.Decoupled weight decay regularization\.In*International Conference on Learning Representations*, 2019\.
- Mohri and Yang \[2016\]Mehryar Mohri and Scott Yang\.Accelerating online convex optimization via adaptive prediction\.In*Artificial Intelligence and Statistics*, pages 848–856\. PMLR, 2016\.
- Mokhtari et al\. \[2020\]Aryan Mokhtari, Hamed Hassani, and Amin Karbasi\.Stochastic conditional gradient methods: From convex minimization to submodular maximization\.*Journal of Machine Learning Research*, 2020\.
- Morwani et al\. \[2025\]Depen Morwani, Nikhil Vyas, Hanlin Zhang, and Sham Kakade\.Connections between schedule\-free optimizers, AdEMAMix, and accelerated sgd variants\.*arXiv preprint arXiv:2502\.02431*, 2025\.
- Nesterov \[2005\]Yu Nesterov\.Smooth minimization of non\-smooth functions\.*Mathematical programming*, 103:127–152, 2005\.
- Nesterov \[2012\]Yu Nesterov\.Efficiency of coordinate descent methods on huge\-scale optimization problems\.*SIAM Journal on Optimization*, 22\(2\):341–362, 2012\.
- Nesterov and Shikhman \[2015\]Yu Nesterov and Vladimir Shikhman\.Quasi\-monotone subgradient methods for nonsmooth convex minimization\.*Journal of Optimization Theory and Applications*, 165\(3\):917–940, 2015\.
- Nesterov \[2009\]Yurii Nesterov\.Primal\-dual subgradient methods for convex problems\.*Mathematical programming*, 120\(1\):221–259, 2009\.
- Orabona \[2019\]Francesco Orabona\.A modern introduction to online learning\.*CoRR*, abs/1912\.13213, 2019\.URL[http://arxiv\.org/abs/1912\.13213](http://arxiv.org/abs/1912.13213)\.
- Pagliardini et al\. \[2024\]Matteo Pagliardini, Pierre Ablin, and David Grangier\.The AdEMAMix optimizer: Better, faster, older\.*arXiv preprint arXiv:2409\.03137*, 2024\.
- Pethick et al\. \[2025a\]Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti\-Falls, and Volkan Cevher\.Training deep learning models with norm\-constrained LMOs\.In*International Conference on Machine Learning*, 2025a\.
- Pethick et al\. \[2025b\]Thomas Pethick, Wanyun Xie, Mete Erdogan, Kimon Antonakopoulos, Tony Silveti\-Falls, and Volkan Cevher\.Generalized gradient norm clipping & non\-euclidean\(L0,L1\)\(L\_\{0\},L\_\{1\}\)\-smoothness\.*arXiv preprint arXiv:2506\.01913*, 2025b\.
- Qiu et al\. \[2025\]Shikai Qiu, Zixi Chen, Hoang Phan, Qi Lei, and Andrew Gordon Wilson\.Hyperparameter transfer enables consistent gains of matrix\-preconditioned optimizers across scales\.*arXiv preprint arXiv:2512\.05620*, 2025\.
- Rakhlin and Sridharan \[2013\]Alexander Rakhlin and Karthik Sridharan\.Online learning with predictable sequences\.In*Conference on Learning Theory*, pages 993–1019\. PMLR, 2013\.
- Scetbon et al\. \[2025\]Meyer Scetbon, Chao Ma, Wenbo Gong, and Edward Meeds\.Gradient multi\-normalization for stateless and scalable LLM training\.*arXiv preprint arXiv:2502\.06742*, 2025\.
- Schaipp et al\. \[2025\]Fabian Schaipp, Alexander Hägele, Adrien Taylor, Umut Simsekli, and Francis Bach\.The surprising agreement between convex optimization theory and learning\-rate scheduling for large model training\.*arXiv preprint arXiv:2501\.18965*, 2025\.
- Tseng \[2008\]Paul Tseng\.On accelerated proximal gradient methods for convex\-concave optimization\.*submitted to SIAM Journal on Optimization*, 2\(3\), 2008\.
- Xiao \[2024\]Lechao Xiao\.Rethinking conventional wisdom in machine learning: From generalization to scaling\.*arXiv preprint arXiv:2409\.15156*, 2024\.
- Xie and Li \[2024\]Shuo Xie and Zhiyuan Li\.Implicit bias of AdamW:ℓ∞\\ell\_\{\\infty\}norm constrained optimization\.*arXiv preprint arXiv:2404\.04454*, 2024\.
- Xie et al\. \[2025\]Zhenda Xie, Yixuan Wei, Huanqi Cao, Chenggang Zhao, Chengqi Deng, Jiashi Li, Damai Dai, Huazuo Gao, Jiang Chang, Liang Zhao, et al\.mHC: Manifold\-constrained hyper\-connections\.*arXiv preprint arXiv:2512\.24880*, 2025\.
- Yang et al\. \[2021\]Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao\.Tuning large neural networks via zero\-shot hyperparameter transfer\.In*Advances in Neural Information Processing Systems*, 2021\.
- Zhang et al\. \[2019\]Michael R\. Zhang, James Lucas, Geoffrey Hinton, and Jimmy Ba\.Lookahead Optimizer: K steps forward, 1 step back\.*arXiv*, December 2019\.

Appendix

## Table of Contents
Optimistic Dual Averaging Unifies Modern Optimizers

Similar Articles

DualOptim+: Bridging Shared and Decoupled Optimizer States for Better Machine Unlearning in Large Language Models

Diversity-Driven Offline Multi-Objective Optimization via Nested Pareto Set Learning

DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning

A Unified Framework for Gradient Aggregation in Multi-Objective Optimization

Zero-order Parameter-free Optimization for LMO-based Methods: Novel Approach for Efficient Fine-tuning

Submit Feedback

Similar Articles

DualOptim+: Bridging Shared and Decoupled Optimizer States for Better Machine Unlearning in Large Language Models
Diversity-Driven Offline Multi-Objective Optimization via Nested Pareto Set Learning
DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning
A Unified Framework for Gradient Aggregation in Multi-Objective Optimization
Zero-order Parameter-free Optimization for LMO-based Methods: Novel Approach for Efficient Fine-tuning