A Rod Flow Model for Adam at the Edge of Stability
Summary
This paper introduces a 'rod flow' model for Adam and other adaptive optimizers to better analyze their behavior at the edge of stability. It extends continuous-time modeling to momentum methods, showing improved accuracy in tracking discrete iterates compared to stable flow models.
View Cached Full Text
Cached at: 05/11/26, 06:53 AM
# A Rod Flow Model for Adam at the Edge of Stability
Source: [https://arxiv.org/html/2605.06821](https://arxiv.org/html/2605.06821)
###### Abstract
Cohenet al\.\[[2022](https://arxiv.org/html/2605.06821#bib.bib10)\]observed that adaptive gradient methods such as Adam operate at the edge of stability\. While there has been significant work on continuous\-time modeling of gradient descent at the edge of stability, extending these models to momentum methods remains underdeveloped\. In the gradient descent setting,Regis and Chewi \[[2026](https://arxiv.org/html/2605.06821#bib.bib1)\]introduced*rod flow*, which models consecutive iterates as an extended one\-dimensional object—a “rod\.” Here we extend rod flow to Adam by working in the joint phase space of parameters and first moment\(w,m\)\(w,m\)and treating the second momentν\\nuas a smooth auxiliary variable\. We also develop rod flows for heavy ball momentum, Nesterov momentum, and scalar and per\-component versions of RMSProp, Adam, and NAdam\. For all eight optimizers, we empirically evaluate rod flow on representative machine learning architectures, where it tracks the discrete iterates through the edge\-of\-stability regime significantly more accurately than the corresponding stable flow\.
## 1Introduction
Neural networks are trained by minimizing loss functions with gradient\-based optimizers\.Cohenet al\.\[[2021](https://arxiv.org/html/2605.06821#bib.bib2)\]observed that full\-batch gradient descent operates at the*edge of stability*\(EoS\): the largest eigenvalue of the Hessian, called the*sharpness*, first rises \(a phase called*progressive sharpening*\) and then hovers at the stability threshold2/η2/\\etawhereη\\etais the learning rate\.Cohenet al\.\[[2022](https://arxiv.org/html/2605.06821#bib.bib10)\]extended this picture to momentum methods and adaptive gradient methods, showing that each optimizer exhibits its own edge of stability\. Rather than hovering at2/η2/\\eta, the relevant quantity—the*preconditioned sharpness*—hovers at a hyperparameter\-dependent threshold that depends on the optimizer \([Table˜2](https://arxiv.org/html/2605.06821#A5.T2)\)\.
In practice, the dominant optimizer in machine learning is Adam\[Kingma and Ba,[2015](https://arxiv.org/html/2605.06821#bib.bib26)\], which differs from gradient descent in two respects\. First, it is a momentum method: rather than updating parameters with the gradient at the current point, it maintains an exponential moving averagemmof past gradients and updates based on this momentum\. Second, it is an adaptive method: when updating the parameters, Adam divides each component of the momentum byν\\sqrt\{\\nu\}, whereν\\nuis an exponential moving average of the squared gradients\. The effect is to normalize each parameter’s update by the typical magnitude of its gradient\.
Continuous\-time models have long served as a valuable theoretical tool for analyzing discrete\-time optimizers\[Liet al\.,[2017](https://arxiv.org/html/2605.06821#bib.bib20), Barrett and Dherin,[2021](https://arxiv.org/html/2605.06821#bib.bib8), Shiet al\.,[2021](https://arxiv.org/html/2605.06821#bib.bib18)\], clarifying their implicit biases and asymptotic behavior\. Such a model would be especially valuable for Adam, the workhorse of modern deep learning\.
However, developing such a model that remains valid at the edge of stability is non\-trivial\. At EoS, the discrete iterates oscillate along the sharpest directions of the Hessian\. But the naïve continuous\-time limit—which is called the “stable flow”—admits no oscillations\.
While there is a rich and growing body of continuous\-time theory for gradient descent at the edge of stability\[Aroraet al\.,[2022](https://arxiv.org/html/2605.06821#bib.bib5), Damianet al\.,[2023](https://arxiv.org/html/2605.06821#bib.bib4), Cohenet al\.,[2025](https://arxiv.org/html/2605.06821#bib.bib3), Regis and Chewi,[2026](https://arxiv.org/html/2605.06821#bib.bib1)\], the corresponding theory for Adam and its close relatives—heavy ball momentum, Nesterov momentum, NAdam—remains underdeveloped\. Existing continuous\-time analyses of Adam either take a vanishing\-step\-size limit that discards oscillatory dynamics by construction\[Belotto da Silva and Gazeau,[2020](https://arxiv.org/html/2605.06821#bib.bib39), Barakat and Bianchi,[2021](https://arxiv.org/html/2605.06821#bib.bib38)\]or study finite\-step\-size effects that stop short of the oscillatory EoS regime\[Malladiet al\.,[2022](https://arxiv.org/html/2605.06821#bib.bib33), Cattaneoet al\.,[2024](https://arxiv.org/html/2605.06821#bib.bib31)\]\. The regime that practitioners actually utilize—finite\-step\-size Adam operating at the edge of stability—currently lacks a principled continuous\-time model\.
### 1\.1Our Approach
Recently,Regis and Chewi \[[2026](https://arxiv.org/html/2605.06821#bib.bib1)\]introduced*rod flow*: a continuous\-time model of gradient descent at the edge of stability\. Rod flow tracks a smoothly\-evolving centerw¯\\bar\{w\}, the average of consecutive iterates, and an extent tensorΣ\\Sigma, the outer product of the half\-displacement between consecutive iterates\. The physical picture is intuitive: rather than following a point that oscillates back and forth along the sharp direction of the Hessian, one follows an extended one\-dimensional object—a “rod”—whose length and orientation encode the oscillation\.
By extending the rod flow framework to momentum and adaptive gradient methods, we develop a continuous\-time model for Adam at the edge of stability\.
Extending rod flow to Adam requires two modifications\. First, the rod must be lifted from weight spacew∈ℝdw\\in\\mathbb\{R\}^\{d\}to phase spacez=\(w,m\)∈ℝ2dz=\(w,m\)\\in\\mathbb\{R\}^\{2d\}\. For momentum methods operating at the edge of stability, the first momentmmoscillates in lockstep withww, so the rod’s center and half\-difference must be defined over both coordinates\. This extension handles both heavy\-ball and Nesterov momentum\.
Second, the preconditioner must be tracked as a smooth auxiliary variable\. The second momentν\\nuin RMSProp and Adam is an exponential moving average of the*squared*gradient; the squaring kills the sign flips, soν\\nuvaries smoothly even whenwwoscillates at the edge of stability\. It therefore needs no rod structure of its own\.
Adam is thus modeled as phase\-space rod flow in the adaptive metric defined byν\\nu\.
### 1\.2Contributions
- •A derivation of rod flows for heavy\-ball and Nesterov momentum, extending the original weight\-space model to phase\-space \([Section˜4](https://arxiv.org/html/2605.06821#S4)\)\.
- •A smooth\-auxiliary treatment of the preconditioner, yielding rod flows for scalar and per\-component RMSProp \([Section˜5](https://arxiv.org/html/2605.06821#S5)\)\.
- •A rod flow for Adam and NAdam that combines the phase\-space and preconditioner extensions \([Section˜6](https://arxiv.org/html/2605.06821#S6)\)\.
- •An empirical evaluation for Adam showing that the rod flow tracks the discrete iterates accurately and exhibits self\-stabilization of the preconditioned sharpness at the theoretically predicted thresholds \([Section˜7](https://arxiv.org/html/2605.06821#S7)\)\.
## 2Related Work
Edge of stability\.For full\-batch gradient descent,Cohenet al\.\[[2021](https://arxiv.org/html/2605.06821#bib.bib2)\]documented progressive sharpening and edge of stability across architectures, with sharpness rising to2/η2/\\etaand then hovering\. Theoretical explanations include self\-stabilization\[Damianet al\.,[2023](https://arxiv.org/html/2605.06821#bib.bib4)\], two\-step analysis\[Ahnet al\.,[2023](https://arxiv.org/html/2605.06821#bib.bib6), Chen and Bruna,[2023](https://arxiv.org/html/2605.06821#bib.bib9)\], convergence in the unstable regime\[Ahnet al\.,[2022](https://arxiv.org/html/2605.06821#bib.bib15), Aroraet al\.,[2022](https://arxiv.org/html/2605.06821#bib.bib5), Zhuet al\.,[2023](https://arxiv.org/html/2605.06821#bib.bib14)\], progressive sharpening in quadratic regression\[Agarwalaet al\.,[2023](https://arxiv.org/html/2605.06821#bib.bib7)\], the catapult mechanism\[Lewkowyczet al\.,[2020](https://arxiv.org/html/2605.06821#bib.bib11), Ghoshet al\.,[2023](https://arxiv.org/html/2605.06821#bib.bib34)\]and its momentum\-induced amplification\[Phunyaphibarnet al\.,[2024](https://arxiv.org/html/2605.06821#bib.bib47)\], loss landscape analysis\[Maet al\.,[2022](https://arxiv.org/html/2605.06821#bib.bib21)\], bifurcation theory\[Song and Yun,[2023](https://arxiv.org/html/2605.06821#bib.bib19)\], chaotic GD dynamics\[Kong and Tao,[2020](https://arxiv.org/html/2605.06821#bib.bib36), Chenet al\.,[2024](https://arxiv.org/html/2605.06821#bib.bib22)\], regularity\-induced implicit biases\[Wanget al\.,[2025](https://arxiv.org/html/2605.06821#bib.bib30)\], and NTK evolution at EoS\[Jianget al\.,[2025](https://arxiv.org/html/2605.06821#bib.bib23)\]\.Cohenet al\.\[[2022](https://arxiv.org/html/2605.06821#bib.bib10)\]empirically identify edge of stability for adaptive gradient methods\.
Adam\.The original Adam method is due toKingma and Ba \[[2015](https://arxiv.org/html/2605.06821#bib.bib26)\]; variants include NAdam\[Dozat,[2016](https://arxiv.org/html/2605.06821#bib.bib27)\], AMSGrad\[Reddiet al\.,[2018](https://arxiv.org/html/2605.06821#bib.bib28)\], and AdamW\[Loshchilov and Hutter,[2019](https://arxiv.org/html/2605.06821#bib.bib29)\]\. We analyze Adam and NAdam in their standard EMA form\. Continuous\-time analyses of Adam have all operated in small\-step or near\-minimum regimes rather than through EoS: foundational ODE limits\[Barakat and Bianchi,[2021](https://arxiv.org/html/2605.06821#bib.bib38), Dereichet al\.,[2025](https://arxiv.org/html/2605.06821#bib.bib42)\], general adaptive\-ODE frameworks\[Belotto da Silva and Gazeau,[2020](https://arxiv.org/html/2605.06821#bib.bib39)\], SDE approximations with scaling rules\[Malladiet al\.,[2022](https://arxiv.org/html/2605.06821#bib.bib33)\], IMEX time\-stepping\[Bhattacharjeeet al\.,[2024](https://arxiv.org/html/2605.06821#bib.bib43)\]and integro\-differential\[Heredia,[2024](https://arxiv.org/html/2605.06821#bib.bib44)\]views, control\-theoretic framings\[Chakrabarti and Chopra,[2024](https://arxiv.org/html/2605.06821#bib.bib45)\], stability in the hyperparameter\-stable regime\[Gould and Tanaka,[2024](https://arxiv.org/html/2605.06821#bib.bib32)\], backward error analysis at small LR\[Cattaneoet al\.,[2024](https://arxiv.org/html/2605.06821#bib.bib31)\], slow\-SDE sharpness\-reduction near minimizers\[Liet al\.,[2025](https://arxiv.org/html/2605.06821#bib.bib40)\], Adam\-flow implicit bias on homogeneous networks\[Wanget al\.,[2021](https://arxiv.org/html/2605.06821#bib.bib41)\], and mechanistic accounts of EoS\-induced loss spikes\[Baiet al\.,[2025](https://arxiv.org/html/2605.06821#bib.bib46)\]\. None of these captures the finite\-step\-size Adam during EoS\.
Continuous\-time models\.Gradient flow is the classical continuous limit of gradient descent but fails at the edge of stability since it cannot capture oscillations\.Roscaet al\.\[[2023](https://arxiv.org/html/2605.06821#bib.bib12)\]study instabilities in gradient flow models\.Shiet al\.\[[2021](https://arxiv.org/html/2605.06821#bib.bib18)\]derive high\-resolution ODEs for momentum methods, whileLiet al\.\[[2017](https://arxiv.org/html/2605.06821#bib.bib20)\]develop stochastic modified equations for SGD\. Modified equations from backward error analysis are standard in numerical analysis\[Wilkinson,[1963](https://arxiv.org/html/2605.06821#bib.bib24),[1965](https://arxiv.org/html/2605.06821#bib.bib25), Haireret al\.,[2006](https://arxiv.org/html/2605.06821#bib.bib16)\];Barrett and Dherin \[[2021](https://arxiv.org/html/2605.06821#bib.bib8)\]andSmithet al\.\[[2021](https://arxiv.org/html/2605.06821#bib.bib13)\]use this framework to understand implicit regularization in gradient descent, andDi Giovacchinoet al\.\[[2024](https://arxiv.org/html/2605.06821#bib.bib17)\]provides methodological validation\. The closest related work is Central Flow\[Cohenet al\.,[2025](https://arxiv.org/html/2605.06821#bib.bib3)\], which derives continuous\-time models for gradient descent, scalar RMSProp, and per\-component RMSProp at the edge of stability\. The direct predecessor of this work isRegis and Chewi \[[2026](https://arxiv.org/html/2605.06821#bib.bib1)\], which introduces a rod flow for gradient descent at EoS\. We extend the rod flow framework to phase space and to adaptive methods\.
## 3Rod Flow for Gradient Descent
We briefly recap the rod flow for gradient descent, introduced inRegis and Chewi \[[2026](https://arxiv.org/html/2605.06821#bib.bib1)\]\. A full rederivation appears in[Appendix˜B](https://arxiv.org/html/2605.06821#A2)\.
Consider gradient descent with learning rateη\\eta:
wt\+1=wt−η∇L\(wt\)\.w\_\{t\+1\}=w\_\{t\}\-\\eta\\,\\nabla L\(w\_\{t\}\)\.\(1\)Define the center and half\-difference of consecutive iterates:
w¯t\\displaystyle\\bar\{w\}\_\{t\}=12\(wt\+1\+wt\),\\displaystyle=\\tfrac\{1\}\{2\}\(w\_\{t\+1\}\+w\_\{t\}\),\(2\)δt\\displaystyle\\delta\_\{t\}=12\(wt\+1−wt\)\.\\displaystyle=\\tfrac\{1\}\{2\}\(w\_\{t\+1\}\-w\_\{t\}\)\.\(3\)The rod flow framework exploits the fact that, even when individual iterateswtw\_\{t\}oscillate rapidly at the edge of stability, the centerw¯t\\bar\{w\}\_\{t\}and extent tensorδt⊗δt\\delta\_\{t\}\\otimes\\delta\_\{t\}vary smoothly—making them amenable to continuous\-time ODE approximation\.
For brevity, we writeL±≔L\(w¯±δ\)L\_\{\\pm\}\\coloneqq L\(\\bar\{w\}\\pm\\delta\)with∇L±\\nabla L\_\{\\pm\}and∇2L±\\nabla^\{2\}L\_\{\\pm\}denoting the corresponding gradients and Hessians\. The discrete difference equations forw¯\\bar\{w\}andδt⊗δt\\delta\_\{t\}\\otimes\\delta\_\{t\}are given as:
w¯t\+1−w¯t\\displaystyle\\bar\{w\}\_\{t\+1\}\-\\bar\{w\}\_\{t\}=−η2\(∇L\+\+∇L−\),\\displaystyle=\-\\frac\{\\eta\}\{2\}\\bigl\(\\nabla L\_\{\+\}\+\\nabla L\_\{\-\}\\bigr\),\(4\)δt\+1⊗δt\+1−δt⊗δt\\displaystyle\\delta\_\{t\+1\}\\otimes\\delta\_\{t\+1\}\-\\delta\_\{t\}\\otimes\\delta\_\{t\}=η24\(∇L\+⊗∇L\+\+∇L−⊗∇L−\)−2δt⊗δt\.\\displaystyle=\\frac\{\\eta^\{2\}\}\{4\}\\bigl\(\\nabla L\_\{\+\}\\otimes\\nabla L\_\{\+\}\+\\nabla L\_\{\-\}\\otimes\\nabla L\_\{\-\}\\bigr\)\-2\\delta\_\{t\}\\otimes\\delta\_\{t\}\.\(5\)LetΣ\(t\)\\Sigma\(t\)denote the continuous\-time analogue of the extent tensorδt⊗δt\\delta\_\{t\}\\otimes\\delta\_\{t\}\. In continuous time,δ\\deltais to be identified with the principal eigenvector ofΣ\\Sigmascaled by the square root of the principal eigenvalue\. Promoting[Equation˜4](https://arxiv.org/html/2605.06821#S3.E4)and[Equation˜5](https://arxiv.org/html/2605.06821#S3.E5)to ODEs yields the rod flow ODEs for gradient descent:
dw¯dt\\displaystyle\\frac\{d\\bar\{w\}\}\{dt\}=−ηg¯,\\displaystyle=\-\\eta\\,\\bar\{g\},\(6\)dΣdt\\displaystyle\\frac\{d\\Sigma\}\{dt\}=η24\(∇L\+⊗∇L\+\+∇L−⊗∇L−\)−2Σ\.\\displaystyle=\\frac\{\\eta^\{2\}\}\{4\}\\bigl\(\\nabla L\_\{\+\}\\otimes\\nabla L\_\{\+\}\+\\nabla L\_\{\-\}\\otimes\\nabla L\_\{\-\}\\bigr\)\-2\\Sigma\.\(7\)whereg¯=\(∇L\+\+∇L−\)/2\\bar\{g\}=\(\\nabla L\_\{\+\}\+\\nabla L\_\{\-\}\)/2is the average of the gradients at the endpoints of the rod\.
Note that the discrete difference equations forw¯t\\bar\{w\}\_\{t\}andδt⊗δt\\delta\_\{t\}\\otimes\\delta\_\{t\}areexact\. The only approximation in the rod flow ODEs is replacing the discrete differences with continuous\-time derivatives\. Traditionally, such interpolation would require backward error analysis\. The original derivation inRegis and Chewi \[[2026](https://arxiv.org/html/2605.06821#bib.bib1)\]includes anO\(η2\)O\(\\eta^\{2\}\)backward\-error\-analysis correction in[Equation˜6](https://arxiv.org/html/2605.06821#S3.E6), which we drop throughout for simplicity as it does not qualitatively affect the dynamics\. See[Appendix˜F](https://arxiv.org/html/2605.06821#A6)for further discussion of backward error analysis\.
## 4Rod Flow for Momentum Methods


Figure 1:Phase\-Space Rod Flow\. Left:Depiction of a phase\-space rod\. The endpoints represent consecutive phase\-space iterates\.Right:Sharpness trajectories from training a 3\-layer MLP on CIFAR, showing the differing sharpness thresholds for gradient descent, heavy ball, and Nesterov momentum with the same step size\.We will now extend rod flow to heavy ball and Nesterov momentum\. To do so, we will lift the rod from weight space to phase space\.
### 4\.1Setup
We will work with heavy ball momentum, using the exponential moving average \(EMA\) parameterization throughout\. The heavy ball momentum update equations are given as:
mt\+1\\displaystyle m\_\{t\+1\}=βmt\+\(1−β\)∇L\(wt\),\\displaystyle=\\beta\\,m\_\{t\}\+\(1\-\\beta\)\\,\\nabla L\(w\_\{t\}\),\(8\)wt\+1\\displaystyle w\_\{t\+1\}=wt−ηmt\+1,\\displaystyle=w\_\{t\}\-\\eta\\,m\_\{t\+1\},\(9\)whereβ∈\[0,1\)\\beta\\in\[0,1\)is the momentum coefficient\. This differs from the form inPolyak \[[1964](https://arxiv.org/html/2605.06821#bib.bib35)\], but we adopt this convention as it is the form in which Adam is typically written\.
Just like gradient descent, heavy ball momentum on a quadratic has a sharpness threshold beyond which the iterates will no longer converge\. However, it is no longer the2/η2/\\etathreshold: it is modified by the momentum\. For heavy ball momentum with step sizeη\\etaand momentum coefficientβ\\beta, the sharpness thresholdS∗S^\{\*\}is:
S∗=2η⋅1\+β1−βS^\{\*\}=\\frac\{2\}\{\\eta\}\\cdot\\frac\{1\+\\beta\}\{1\-\\beta\}\(10\)
Formally, one can write heavy ball momentum as a two\-step recurrence relation\. For the quadratic loss, this recurrence relation is linear, andS∗S^\{\*\}corresponds to the threshold at which the eigenvalues of the transition operator leave the unit circle\. More details are provided in[Appendix˜C](https://arxiv.org/html/2605.06821#A3)\.
When heavy ball crosses into EoS,wwandmmoscillate in phase: both flip sign about the center each iteration\. Because the position and momentum oscillate together, it is natural to concatenate them into a single phase\-space vectorzz:
zt=\(wtmt\)∈ℝ2d\.z\_\{t\}=\\begin\{pmatrix\}w\_\{t\}\\\\ m\_\{t\}\\end\{pmatrix\}\\in\\mathbb\{R\}^\{2d\}\.\(11\)In analogy with the rod flow for gradient descent, we can define the average of two consecutive phase\-space iteratesz¯\\bar\{z\}and the half\-displacement between consecutive phase\-space iteratesΔ\\Delta:
z¯t\\displaystyle\\bar\{z\}\_\{t\}=12\(zt\+1\+zt\)=\(w¯tm¯t\),\\displaystyle=\\tfrac\{1\}\{2\}\(z\_\{t\+1\}\+z\_\{t\}\)=\\begin\{pmatrix\}\\bar\{w\}\_\{t\}\\\\ \\bar\{m\}\_\{t\}\\end\{pmatrix\},\(12\)Δt\\displaystyle\\Delta\_\{t\}=12\(zt\+1−zt\)=\(δtγt\),\\displaystyle=\\tfrac\{1\}\{2\}\(z\_\{t\+1\}\-z\_\{t\}\)=\\begin\{pmatrix\}\\delta\_\{t\}\\\\ \\gamma\_\{t\}\\end\{pmatrix\},\(13\)wherem¯t=\(mt\+1\+mt\)/2\\bar\{m\}\_\{t\}=\(m\_\{t\+1\}\+m\_\{t\}\)/2andγt=\(mt\+1−mt\)/2\\gamma\_\{t\}=\(m\_\{t\+1\}\-m\_\{t\}\)/2\. We emphasize that\(δ,γ\)\(\\delta,\\gamma\)forms a*single*rod in phase space, not two separate rods in position and momentum space: the formalism is symmetric under the simultaneous flip\(δ,γ\)↦\(−δ,−γ\)\(\\delta,\\gamma\)\\mapsto\(\-\\delta,\-\\gamma\)but not under flipping just one of the coordinates\.
### 4\.2Exact Difference Equations
Write the phase\-space update as:
zt\+1=zt\+Φ\(zt\)z\_\{t\+1\}=z\_\{t\}\+\\Phi\(z\_\{t\}\)\(14\)where we have that:
Φ\(z\)=\(−η\[βm\+\(1−β\)∇L\(w\)\]\(1−β\)\[∇L\(w\)−m\]\)\.\\Phi\(z\)=\\begin\{pmatrix\}\-\\eta\\bigl\[\\beta m\+\(1\-\\beta\)\\nabla L\(w\)\\bigr\]\\\\ \(1\-\\beta\)\\bigl\[\\nabla L\(w\)\-m\\bigr\]\\end\{pmatrix\}\.\(15\)
Following the analogous derivation for gradient descent, the exact difference equations for the phase\-space center and extent are:
z¯t\+1−z¯t\\displaystyle\\bar\{z\}\_\{t\+1\}\-\\bar\{z\}\_\{t\}=12\[Φ\+\+Φ−\],\\displaystyle=\\tfrac\{1\}\{2\}\\bigl\[\\Phi\_\{\+\}\+\\Phi\_\{\-\}\\bigr\],\(16\)Δt\+1⊗Δt\+1−Δt⊗Δt\\displaystyle\\Delta\_\{t\+1\}\\otimes\\Delta\_\{t\+1\}\-\\Delta\_\{t\}\\otimes\\Delta\_\{t\}=14\[Φ\+⊗Φ\+\+Φ−⊗Φ−\]−2Δt⊗Δt,\\displaystyle=\\tfrac\{1\}\{4\}\\bigl\[\\Phi\_\{\+\}\\otimes\\Phi\_\{\+\}\+\\Phi\_\{\-\}\\otimes\\Phi\_\{\-\}\\bigr\]\-2\\,\\Delta\_\{t\}\\otimes\\Delta\_\{t\},\(17\)whereΦ±≔Φ\(z¯t±Δt\)\\Phi\_\{\\pm\}\\coloneqq\\Phi\(\\bar\{z\}\_\{t\}\\pm\\Delta\_\{t\}\)\. As in the GD case, both discrete difference equations are exact\.
### 4\.3Phase\-Space Rod Flow
Promoting the discrete difference equations to ODEs, we obtain the rod flow ODEs for heavy ball momentum:
dw¯dt\\displaystyle\\frac\{d\\bar\{w\}\}\{dt\}=−η\[βm¯\+\(1−β\)g¯\],\\displaystyle=\-\\eta\\bigl\[\\beta\\bar\{m\}\+\(1\-\\beta\)\\bar\{g\}\\bigr\],\(18\)dm¯dt\\displaystyle\\frac\{d\\bar\{m\}\}\{dt\}=\(1−β\)\[g¯−m¯\],\\displaystyle=\(1\-\\beta\)\\bigl\[\\bar\{g\}\-\\bar\{m\}\\bigr\],\(19\)dΣdt\\displaystyle\\frac\{d\\Sigma\}\{dt\}=14\[Φ\+⊗Φ\+\+Φ−⊗Φ−\]−2Σ,\\displaystyle=\\tfrac\{1\}\{4\}\\bigl\[\\Phi\_\{\+\}\\otimes\\Phi\_\{\+\}\+\\Phi\_\{\-\}\\otimes\\Phi\_\{\-\}\\bigr\]\-2\\,\\Sigma,\(20\)Note the extent tensorΣ∈ℝ2d×2d\\Sigma\\in\\mathbb\{R\}^\{2d\\times 2d\}decomposes into fourd×dd\\times dblocks:
Σ=\(ΣδδΣδγΣγδΣγγ\)\.\\Sigma=\\begin\{pmatrix\}\\Sigma\_\{\\delta\\delta\}&\\Sigma\_\{\\delta\\gamma\}\\\\ \\Sigma\_\{\\gamma\\delta\}&\\Sigma\_\{\\gamma\\gamma\}\\end\{pmatrix\}\.\(21\)We again identifyΔ=\(δ,γ\)\\Delta=\(\\delta,\\gamma\)with the principal eigenvector ofΣ\\Sigmascaled by the square root of the principal eigenvalue\.
### 4\.4Nesterov Momentum
Nesterov momentum is very similar to heavy ball momentum\. The only difference is that the gradient in the momentum update \([Equation˜8](https://arxiv.org/html/2605.06821#S4.E8)\) is replaced by∇L\(wt−ηβmt\)\\nabla L\(w\_\{t\}\-\\eta\\beta m\_\{t\}\): a look\-ahead along the current momentum\. For the Nesterov phase\-space update, we have that:
Φ\(z\)=\(−η\[βm\+\(1−β\)∇L\(w−ηβm\)\]\(1−β\)\[∇L\(w−ηβm\)−m\]\)\.\\Phi\(z\)=\\begin\{pmatrix\}\-\\eta\\bigl\[\\beta m\+\(1\-\\beta\)\\nabla L\(w\-\\eta\\beta m\)\\bigr\]\\\\ \(1\-\\beta\)\\bigl\[\\nabla L\(w\-\\eta\\beta m\)\-m\\bigr\]\\end\{pmatrix\}\.\(22\)The corresponding change in the rod flow is that∇L±\\nabla L\_\{\\pm\}are now evaluated at the look\-ahead points:
∇L±=∇L\(w¯−ηβm¯±\[δ−ηβγ\]\)\.\\nabla L\_\{\\pm\}=\\nabla L\(\\bar\{w\}\-\\eta\\beta\\bar\{m\}\\pm\[\\delta\-\\eta\\beta\\gamma\]\)\.\(23\)
Otherwise, the Nesterov rod flow has the same functional form as the heavy ball rod flow\.
## 5Rod Flow for Preconditioned Methods
We now turn to preconditioned methods\. For simplicity, we will consider Scalar RMSProp and RMSProp\. For these methods,ν\\nuacts as an estimator of the second moment of the gradient along the trajectory\. The key observation is that, even at the edge of stability,ν\\nudoes not oscillate because squaring kills sign flips\.
### 5\.1Scalar RMSProp
Scalar RMSProp maintains a single scalarν∈ℝ\+\\nu\\in\\mathbb\{R\}\_\{\+\}which follows the update rule:
νt\+1=β2νt\+\(1−β2\)‖∇L\(wt\)‖2\\nu\_\{t\+1\}=\\beta\_\{2\}\\,\\nu\_\{t\}\+\(1\-\\beta\_\{2\}\)\\,\\\|\\nabla L\(w\_\{t\}\)\\\|^\{2\}\(24\)whereβ2∈\[0,1\)\\beta\_\{2\}\\in\[0,1\), andν\\nuis an exponential moving average of the gradient norm squared\. Define the preconditioner:
P\(ν\)=\(ν\+ε\)IdP\(\\nu\)=\(\\sqrt\{\\nu\}\+\\varepsilon\)I\_\{d\}\(25\)whereε\\varepsilonis a small constant for numerical stability\. The position update becomes:
wt\+1=wt−ηPt\+1−1∇L\(wt\)w\_\{t\+1\}=w\_\{t\}\-\\eta P^\{\-1\}\_\{t\+1\}\\nabla L\(w\_\{t\}\)\(26\)wherePt=P\(νt\)P\_\{t\}=P\(\\nu\_\{t\}\)\.
Even whenwwoscillates at EoS,‖∇L‖2\\\|\\nabla L\\\|^\{2\}is non\-negative and so does not oscillate—ν\\nuevolves smoothly\. Becauseν\\nudoes not oscillate, we will only obtain an ODE for its midpointν¯t=\(νt\+1\+νt\)/2\\bar\{\\nu\}\_\{t\}=\(\\nu\_\{t\+1\}\+\\nu\_\{t\}\)/2, but not for the half\-displacement\.
For the discrete difference equation forν¯\\bar\{\\nu\}, we have:
ν¯t\+1−ν¯t=\(1−β2\)\[‖∇L\+‖2\+‖∇L−‖22−ν¯\],\\bar\{\\nu\}\_\{t\+1\}\-\\bar\{\\nu\}\_\{t\}=\(1\-\\beta\_\{2\}\)\\left\[\\frac\{\\\|\\nabla L\_\{\+\}\\\|^\{2\}\+\\\|\\nabla L\_\{\-\}\\\|^\{2\}\}\{2\}\-\\bar\{\\nu\}\\right\],\(27\)which we can then promote to an ODE:
dν¯dt=\(1−β2\)\[‖∇L\+‖2\+‖∇L−‖22−ν¯\]\.\\frac\{d\\bar\{\\nu\}\}\{dt\}=\(1\-\\beta\_\{2\}\)\\left\[\\frac\{\\\|\\nabla L\_\{\+\}\\\|^\{2\}\+\\\|\\nabla L\_\{\-\}\\\|^\{2\}\}\{2\}\-\\bar\{\\nu\}\\right\]\.\(28\)We then obtain the following rod flow ODEs:
dw¯dt\\displaystyle\\frac\{d\\bar\{w\}\}\{dt\}=−ηP−1\(ν¯\)g¯,\\displaystyle=\-\\eta P^\{\-1\}\(\\bar\{\\nu\}\)\\bar\{g\},\(29\)dΣdt\\displaystyle\\frac\{d\\Sigma\}\{dt\}=\(η2P−1\(ν¯\)\)2\(∇L\+⊗∇L\+\+∇L−⊗∇L−\)−2Σ\\displaystyle=\\Bigl\(\\frac\{\\eta\}\{2\}P^\{\-1\}\(\\bar\{\\nu\}\)\\Bigr\)^\{2\}\\,\\bigl\(\\nabla L\_\{\+\}\\otimes\\nabla L\_\{\+\}\+\\nabla L\_\{\-\}\\otimes\\nabla L\_\{\-\}\\bigr\)\-2\\Sigma\(30\)Note that if we define the preconditioned step size
η~=ηP−1\(ν¯\),\\tilde\{\\eta\}=\\eta P^\{\-1\}\(\\bar\{\\nu\}\),\(31\)then the rod flow ODEs forw¯\\bar\{w\}andΣ\\Sigmaare identical to those for gradient descent, except thatη~\\tilde\{\\eta\}replacesη\\eta\. More details are provided in[Appendix˜D](https://arxiv.org/html/2605.06821#A4)\.
### 5\.2RMSProp
Figure 2:Edge of Stability for RMSProp\.Left:Preconditioned sharpnessλmax\(P−1/2HP−1/2\)\\lambda\_\{\\max\}\(P^\{\-1/2\}HP^\{\-1/2\}\)hovers around the stability threshold2/η2/\\eta\.Middle:The raw Hessian sharpnessλmax\(H\)\\lambda\_\{\\max\}\(H\)continues to rise steadily\.Right:After an initial transient, the norm of the second moment‖ν‖\\\|\\nu\\\|rises smoothly\.RMSProp differs from Scalar RMSProp in that instead of tracking a single scalar, it tracks a vectorν∈ℝd\\nu\\in\\mathbb\{R\}^\{d\}of per\-component second\-moment estimates:
νt\+1=β2νt\+\(1−β2\)∇L\(wt\)⊙2,\\nu\_\{t\+1\}=\\beta\_\{2\}\\,\\nu\_\{t\}\+\(1\-\\beta\_\{2\}\)\\,\\nabla L\(w\_\{t\}\)^\{\\odot 2\},\(32\)wherev⊙2v^\{\\odot 2\}denotes the elementwise square of a vector\. The ODE forν¯\\bar\{\\nu\}is given as:
dν¯dt=\(1−β2\)\[∇L\+⊙2\+∇L−⊙22−ν¯\]\.\\frac\{d\\bar\{\\nu\}\}\{dt\}=\(1\-\\beta\_\{2\}\)\\,\\biggl\[\\frac\{\\nabla L\_\{\+\}^\{\\odot 2\}\+\\nabla L\_\{\-\}^\{\\odot 2\}\}\{2\}\-\\bar\{\\nu\}\\biggr\]\\,\.\(33\)The preconditioner is now defined as:
P\(ν\)=diag\(ν\)\+εId,P\(\\nu\)=\\text\{diag\}\(\\sqrt\{\\nu\}\)\+\\varepsilon I\_\{d\},\(34\)where the square root is applied component\-wise\. Otherwise, the functional form for RMSProp is identical to that of Scalar RMSProp\.
### 5\.3Acceleration via Regularization
A major difference between preconditioned methods and gradient descent is the precise definition of the edge\-of\-stability condition\. The relevant sharpness measure is no longerλmax\(H\)\\lambda\_\{\\max\}\(H\)butλmax\(P−1/2HP−1/2\)\\lambda\_\{\\max\}\(P^\{\-1/2\}HP^\{\-1/2\}\)—it is thepreconditionedsharpness that hovers at2/η2/\\eta\.
This leads to a feedback loop that gradient descent does not exhibit\. In gradient descent, the only mechanism for relieving excess sharpness \(sharpness above2/η2/\\eta\) is for the oscillation amplitudeΣ\\Sigmato grow until it self\-stabilizes due to higher\-order terms in the loss\. In RMSProp, there is a second channel: as oscillations grow, so do the gradients along the sharp direction, which drivesν\\nuupward\. Since the preconditioned sharpness threshold remains at2/η2/\\etawhile the corresponding actual sharpness threshold is2ν/η2\\sqrt\{\\nu\}/\\eta, growingν\\nuabsorbs excess sharpness without requiring the oscillation amplitude to grow as large\. Central Flow\[Cohenet al\.,[2025](https://arxiv.org/html/2605.06821#bib.bib3)\]calls this “acceleration via regularization”: by inflatingν\\nu, the preconditioner raises the sharpness it can tolerate at the cost of shrinking the effective step sizeη/ν\\eta/\\sqrt\{\\nu\}, a trade\-off that in practice leads to faster progress in the long run\.
## 6Rod Flow for Adam
We are now ready to obtain the rod flow for Adam\. It is a straightforward combination of the phase\-space rod of[Section˜4](https://arxiv.org/html/2605.06821#S4)and the smooth auxiliaryν¯\\bar\{\\nu\}of[Section˜5](https://arxiv.org/html/2605.06821#S5)\.
### 6\.1Setup
The update equations for Adam are:
mt\+1\\displaystyle m\_\{t\+1\}=β1mt\+\(1−β1\)∇L\(wt\),\\displaystyle=\\beta\_\{1\}\\,m\_\{t\}\+\(1\-\\beta\_\{1\}\)\\,\\nabla L\(w\_\{t\}\),\(35\)νt\+1\\displaystyle\\nu\_\{t\+1\}=β2νt\+\(1−β2\)∇L\(wt\)⊙2,\\displaystyle=\\beta\_\{2\}\\,\\nu\_\{t\}\+\(1\-\\beta\_\{2\}\)\\,\\nabla L\(w\_\{t\}\)^\{\\odot 2\},\(36\)wt\+1\\displaystyle w\_\{t\+1\}=wt−ηPt\+1−1mt\+1\.\\displaystyle=w\_\{t\}\-\\eta\\,P\_\{t\+1\}^\{\-1\}\\,m\_\{t\+1\}\.\(37\)Adam combines momentum with a preconditioner\. The momentummmis an exponential moving average of the gradient, the second momentν\\nuis an exponential moving average of the component\-wise squared gradient, and the position is updated using the momentum rescaled by the preconditioner\.
One technical detail is bias correction\. The equations above define the raw moments, but becausemmandν\\nuare initialized at zero, these estimates are biased downward during early training\. To compensate, Adam defines bias\-corrected first and second moments:
m^t\\displaystyle\\hat\{m\}\_\{t\}=mt1−β1t,\\displaystyle=\\frac\{m\_\{t\}\}\{1\-\\beta\_\{1\}^\{t\}\},\(38\)ν^t\\displaystyle\\hat\{\\nu\}\_\{t\}=νt1−β2t,\\displaystyle=\\frac\{\\nu\_\{t\}\}\{1\-\\beta\_\{2\}^\{t\}\},\(39\)and the position update uses the corrected versions:
wt\+1=wt−ηP−1\(ν^t\+1\)m^t\+1\.w\_\{t\+1\}=w\_\{t\}\-\\eta\\,P^\{\-1\}\(\\hat\{\\nu\}\_\{t\+1\}\)\\,\\hat\{m\}\_\{t\+1\}\.\(40\)The bias correction matters more when the EMA coefficient is close to 1, as it takes longer for the estimates to reach their steady\-state values\. RMSProp typically does not use a bias correction becauseβ2\\beta\_\{2\}is usually smaller \(a typical value is 0\.9, compared to 0\.999 for Adam\)\. More details on the computational implementation of the bias correction can be found in[Appendix˜G](https://arxiv.org/html/2605.06821#A7)\.
For notational simplicity, we omit the bias correction below\.
### 6\.2Adam Rod Flow ODEs
As with momentum methods, we define the phase\-space update:
zt\+1=zt\+Φt,z\_\{t\+1\}=z\_\{t\}\+\\Phi\_\{t\},\(41\)where
Φt=\(−ηPt\+1−1\[β1mt\+\(1−β1\)∇L\(wt\)\]\(1−β1\)\[∇L\(wt\)−mt\]\)\.\\Phi\_\{t\}=\\begin\{pmatrix\}\-\\eta P\_\{t\+1\}^\{\-1\}\\bigl\[\\beta\_\{1\}m\_\{t\}\+\(1\-\\beta\_\{1\}\)\\nabla L\(w\_\{t\}\)\\bigr\]\\\\ \(1\-\\beta\_\{1\}\)\\bigl\[\\nabla L\(w\_\{t\}\)\-m\_\{t\}\\bigr\]\\end\{pmatrix\}\.\(42\)The rod flow ODEs for Adam are:
dw¯dt\\displaystyle\\frac\{d\\bar\{w\}\}\{dt\}=−ηP−1\(ν¯\)\[β1m¯\+\(1−β1\)g¯\],\\displaystyle=\-\\eta P^\{\-1\}\(\\bar\{\\nu\}\)\\bigl\[\\beta\_\{1\}\\bar\{m\}\+\(1\-\\beta\_\{1\}\)\\bar\{g\}\\bigr\],\(43\)dm¯dt\\displaystyle\\frac\{d\\bar\{m\}\}\{dt\}=\(1−β1\)\[g¯−m¯\],\\displaystyle=\(1\-\\beta\_\{1\}\)\\bigl\[\\bar\{g\}\-\\bar\{m\}\\bigr\],\(44\)dν¯dt\\displaystyle\\frac\{d\\bar\{\\nu\}\}\{dt\}=\(1−β2\)\[\(∇L\+⊙2\+∇L−⊙2\)2−ν¯\],\\displaystyle=\(1\-\\beta\_\{2\}\)\\,\\biggl\[\\frac\{\\bigl\(\\nabla L\_\{\+\}^\{\\odot 2\}\+\\nabla L\_\{\-\}^\{\\odot 2\}\\bigr\)\}\{2\}\-\\bar\{\\nu\}\\biggr\]\\,,\(45\)dΣdt\\displaystyle\\frac\{d\\Sigma\}\{dt\}=14\[Φ\+⊗Φ\+\+Φ−⊗Φ−\]−2Σ\.\\displaystyle=\\tfrac\{1\}\{4\}\\bigl\[\\Phi\_\{\+\}\\otimes\\Phi\_\{\+\}\+\\Phi\_\{\-\}\\otimes\\Phi\_\{\-\}\\bigr\]\-2\\,\\Sigma\.\(46\)Analogous to the relationship between gradient descent and RMSProp, the rod flow ODEs forw¯\\bar\{w\},m¯\\bar\{m\}, andΣ\\Sigmaare identical to their heavy\-ball counterparts, except thatη\\etais replaced by the preconditioned step sizeη~=ηP−1\(ν¯\)\\tilde\{\\eta\}=\\eta P^\{\-1\}\(\\bar\{\\nu\}\)\. The full derivation is given in[Appendix˜E](https://arxiv.org/html/2605.06821#A5)\.
### 6\.3Adam at the Edge of Stability
Cohenet al\.\[[2022](https://arxiv.org/html/2605.06821#bib.bib10)\]showed that Adam at EoS equilibrates at the preconditioned sharpness threshold
λmax\(P−1/2HP−1/2\)=2η⋅1\+β11−β1,\\lambda\_\{\\max\}\(P^\{\-1/2\}HP^\{\-1/2\}\)=\\frac\{2\}\{\\eta\}\\cdot\\frac\{1\+\\beta\_\{1\}\}\{1\-\\beta\_\{1\}\},\(47\)which is the heavy\-ball threshold in the preconditioned metric set byν\\nu\.
One possible explanation for why Adam is so “forgiving” as an optimizer is that momentum and preconditioning combine to give it higher stability thresholds than gradient descent\. The momentum raises the effective sharpness threshold by a factor of\(1\+β1\)/\(1−β1\)\(1\+\\beta\_\{1\}\)/\(1\-\\beta\_\{1\}\)relative to the non\-momentum case\. The preconditioner, meanwhile, means that it is the*preconditioned*sharpness that must hover at the edge of stability\. Through the acceleration\-via\-regularization mechanism, the actual \(unpreconditioned\) sharpness is free to continue rising during training\.
### 6\.4NAdam
NAdam is a variant of Adam that mimics the effect of Nesterov momentum without an explicit look\-ahead gradient\. The momentum update is identical to Adam’s, but the position update uses a modified momentum
m~t\+1=β1mt\+1\+\(1−β1\)gt\\widetilde\{m\}\_\{t\+1\}=\\beta\_\{1\}m\_\{t\+1\}\+\(1\-\\beta\_\{1\}\)g\_\{t\}\(48\)in place ofmt\+1m\_\{t\+1\}\. This biases the position update towards the most recent gradient, reproducing the effect of a look\-ahead step while evaluating the gradient only at the current iterate\. Further details on NAdam are given in[Appendix˜E](https://arxiv.org/html/2605.06821#A5)\.
## 7Experiments



Figure 3:Experimental Results\.Left:MLP \(η=10−4\\eta=10^\{\-4\},β1=0\.8\\beta\_\{1\}=0\.8,β2=0\.999\\beta\_\{2\}=0\.999\)\.Center:CNN \(η=10−4\\eta=10^\{\-4\},β1=0\.5\\beta\_\{1\}=0\.5,β2=0\.999\\beta\_\{2\}=0\.999\)\.Right:ViT \(η=7\.5×10−5\\eta=7\.5\\times 10^\{\-5\},β1=0\.4\\beta\_\{1\}=0\.4,β2=0\.999\\beta\_\{2\}=0\.999\)\. Within each column, the top panel shows the loss over time, the middle panel shows the effective sharpness over time, and the bottom panel shows the distance between the centers of the stable flow and rod flow relative to the discrete Adam trajectory\.We evaluate the Adam rod flow on three representative neural network architectures \(MLP, CNN, and ViT\[Dosovitskiyet al\.,[2021](https://arxiv.org/html/2605.06821#bib.bib49)\]\)\. We compare the rod flow to the discrete update and the “stable flow”—the naïve continuous\-time limit of the discrete optimizer\.
### 7\.1Setup
All three models are trained full\-batch on the same subset of5,0005\{,\}000examples from CIFAR\-10\[Krizhevsky,[2009](https://arxiv.org/html/2605.06821#bib.bib48)\]\. The MLP is a 3\-layer fully\-connected network, the CNN is a 2\-layer convolutional network, and the ViT is a 3\-layer vision transformer\.
For each experiment, we tracked the loss, sharpness, distance between centers, and other relevant comparison quantities\. As inRegis and Chewi \[[2026](https://arxiv.org/html/2605.06821#bib.bib1)\], we used a warm\-up phase so that each flow was initialized in the steady\-state edge\-of\-stability regime\. Full architectural and hyperparameter details, including the warm\-up procedure, can be found in[Appendix˜H](https://arxiv.org/html/2605.06821#A8)\.
### 7\.2Results
As shown in[Figure˜3](https://arxiv.org/html/2605.06821#S7.F3), the Adam rod flow tracks the center of the discrete iterates through the edge\-of\-stability regime on all three architectures, matching it several orders of magnitude more closely than the stable flow does\. It also stabilizes at the correct value of the preconditioned sharpness\. The rod flows for the other optimizers perform similarly \([Appendix˜I](https://arxiv.org/html/2605.06821#A9)\)\.
One limitation is that rod flow struggles to accurately track the iterates for larger values of the momentum coefficientβ\\beta\. This may be due to the prevalence of period\-doubling bifurcations: if the iterates oscillate in44\-cycles rather than22\-cycles, the assumptions underlying rod flow break down\.
## 8Conclusion
##### Limitations\.
Rod flow is a full\-batch model and does not apply to the mini\-batch setting\. It also assumes the steady\-state edge\-of\-stability regime, and does not model the transient early\-training phase before oscillations equilibrate\. Furthermore, it remains too costly to serve as a practical substitute for running the discrete optimizer\. Finally, rod flow says nothing about*progressive sharpening*—the empirical phenomenon by which sharpness drifts upward before reaching the EoS threshold—which remains not fully explained by existing theory\.
##### Future work\.
The most natural extensions are to the mini\-batch setting, where stochastic noise may require augmenting the rod flow equations with diffusion terms, and to optimizers beyond the Adam family—such as Muon\[Jordanet al\.,[2024](https://arxiv.org/html/2605.06821#bib.bib52)\], Lion\[Chenet al\.,[2023](https://arxiv.org/html/2605.06821#bib.bib51)\], and Shampoo\[Guptaet al\.,[2018](https://arxiv.org/html/2605.06821#bib.bib50)\]—that share a momentum\-plus\-preconditioner structure\. Finally, a formal characterization of when the rod flow approximation breaks down remains open\.
## Acknowledgements
## References
- A\. Agarwala, F\. Pedregosa, and J\. Pennington \(2023\)Second\-order regression models exhibit progressive sharpening to the edge of stability\.InProceedings of the 40th International Conference on Machine Learning,ICML’23\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p1.1)\.
- K\. Ahn, S\. Bubeck, S\. Chewi, Y\. T\. Lee, F\. Suarez, and Y\. Zhang \(2023\)Learning threshold neurons via edge of stability\.Advances in Neural Information Processing Systems36,pp\. 19540–19569\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p1.1)\.
- K\. Ahn, J\. Zhang, and S\. Sra \(2022\)Understanding the unstable convergence of gradient descent\.InInternational Conference on Machine Learning,pp\. 247–257\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p1.1)\.
- S\. Arora, Z\. Li, and A\. Panigrahi \(2022\)Understanding gradient descent on the edge of stability in deep learning\.InInternational Conference on Machine Learning,pp\. 948–1024\.Cited by:[§1](https://arxiv.org/html/2605.06821#S1.p5.1),[§2](https://arxiv.org/html/2605.06821#S2.p1.1)\.
- Z\. Bai, Z\. Zhou, J\. Zhao, X\. Li, Z\. Li, F\. Xiong, H\. Yang, Y\. Zhang, and Z\. J\. Xu \(2025\)Adaptive preconditioners trigger loss spikes in Adam\.arXiv preprint 2506\.04805\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p2.1)\.
- A\. Barakat and P\. Bianchi \(2021\)Convergence and dynamical behavior of the ADAM algorithm for nonconvex stochastic optimization\.SIAM Journal on Optimization31\(1\),pp\. 244–274\.Cited by:[§1](https://arxiv.org/html/2605.06821#S1.p5.1),[§2](https://arxiv.org/html/2605.06821#S2.p2.1)\.
- D\. Barrett and B\. Dherin \(2021\)Implicit gradient regularization\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.06821#S1.p3.1),[§2](https://arxiv.org/html/2605.06821#S2.p3.1)\.
- A\. Belotto da Silva and M\. Gazeau \(2020\)A general system of differential equations to model first\-order adaptive algorithms\.Journal of Machine Learning Research21\(129\),pp\. 1–42\.Cited by:[§1](https://arxiv.org/html/2605.06821#S1.p5.1),[§2](https://arxiv.org/html/2605.06821#S2.p2.1)\.
- A\. Bhattacharjee, A\. A\. Popov, A\. Sarshar, and A\. Sandu \(2024\)Improving the adaptive moment estimation \(ADAM\) stochastic optimizer through an implicit\-explicit \(IMEX\) time\-stepping approach\.Journal of Machine Learning for Modeling and Computing5\(3\),pp\. 47–68\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p2.1)\.
- M\. D\. Cattaneo, J\. M\. Klusowski, and B\. Shigida \(2024\)On the implicit bias of Adam\.InProceedings of the 41st International Conference on Machine Learning,R\. Salakhutdinov, Z\. Kolter, K\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research, Vol\.235,pp\. 5862–5906\.Cited by:[§1](https://arxiv.org/html/2605.06821#S1.p5.1),[§2](https://arxiv.org/html/2605.06821#S2.p2.1)\.
- K\. Chakrabarti and N\. Chopra \(2024\)A control theoretic framework for adaptive gradient optimizers\.Automatica160,pp\. 111466\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p2.1)\.
- L\. Chen and J\. Bruna \(2023\)Beyond the edge of stability via two\-step gradient updates\.InInternational Conference on Machine Learning,pp\. 4330–4391\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p1.1)\.
- X\. Chen, C\. Liang, D\. Huang, E\. Real, K\. Wang, H\. Pham, X\. Dong, T\. Luong, C\. Hsieh, Y\. Lu, and Q\. V\. Le \(2023\)Symbolic discovery of optimization algorithms\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 49205–49233\.Cited by:[§8](https://arxiv.org/html/2605.06821#S8.SS0.SSS0.Px2.p1.1)\.
- X\. Chen, K\. Balasubramanian, P\. Ghosal, and B\. K\. Agrawalla \(2024\)From stability to chaos: analyzing gradient descent dynamics in quadratic regression\.Transactions on Machine Learning Research\.Note:Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p1.1)\.
- J\. Cohen, A\. Damian, A\. Talwalkar, J\. Z\. Kolter, and J\. D\. Lee \(2025\)Understanding optimization in deep learning with central flows\.InThe Thirteenth International Conference on Learning Representations,Cited by:[Appendix B](https://arxiv.org/html/2605.06821#A2.p7.5),[§1](https://arxiv.org/html/2605.06821#S1.p5.1),[§2](https://arxiv.org/html/2605.06821#S2.p3.1),[§5\.3](https://arxiv.org/html/2605.06821#S5.SS3.p2.8)\.
- J\. Cohen, S\. Kaur, Y\. Li, J\. Z\. Kolter, and A\. Talwalkar \(2021\)Gradient descent on neural networks typically occurs at the edge of stability\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.06821#S1.p1.3),[§2](https://arxiv.org/html/2605.06821#S2.p1.1)\.
- J\. M\. Cohen, B\. Ghorbani, S\. Krishnan, N\. Agarwal, S\. Medapati, M\. Badura, D\. Suo, D\. Cardoze, Z\. Nado, G\. E\. Dahl, and J\. Gilmer \(2022\)Adaptive gradient methods at the edge of stability\.arXiv preprint arXiv:2207\.14484\.Cited by:[§1](https://arxiv.org/html/2605.06821#S1.p1.3),[§2](https://arxiv.org/html/2605.06821#S2.p1.1),[§6\.3](https://arxiv.org/html/2605.06821#S6.SS3.p1.2)\.
- A\. Damian, E\. Nichani, and J\. D\. Lee \(2023\)Self\-stabilization: the implicit bias of gradient descent at the edge of stability\.InThe Eleventh International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.06821#S1.p5.1),[§2](https://arxiv.org/html/2605.06821#S2.p1.1)\.
- S\. Dereich, A\. Jentzen, and S\. Kassing \(2025\)ODE approximation for the Adam algorithm: general and overparametrized setting\.arXiv preprint arXiv:2511\.04622\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p2.1)\.
- S\. Di Giovacchino, D\. J\. Higham, and K\. C\. Zygalakis \(2024\)Backward error analysis and the qualitative behaviour of stochastic optimization algorithms: application to stochastic coordinate descent\.Journal of Computational Dynamics\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p3.1)\.
- A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby \(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.InInternational Conference on Learning Representations,Cited by:[§7](https://arxiv.org/html/2605.06821#S7.p1.1)\.
- T\. Dozat \(2016\)Incorporating Nesterov momentum into Adam\.InInternational Conference on Learning Representations Workshop,Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p2.1)\.
- A\. Ghosh, H\. Lyu, X\. Zhang, and R\. Wang \(2023\)Implicit regularization in heavy\-ball momentum accelerated stochastic gradient descent\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p1.1)\.
- R\. Gould and H\. Tanaka \(2024\)Continuous\-time analysis of adaptive optimization and normalization\.arXiv preprint arXiv:2411\.05746\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p2.1)\.
- V\. Gupta, T\. Koren, and Y\. Singer \(2018\)Shampoo: preconditioned stochastic tensor optimization\.InProceedings of the 35th International Conference on Machine Learning,J\. Dy and A\. Krause \(Eds\.\),Proceedings of Machine Learning Research, Vol\.80,pp\. 1842–1850\.Cited by:[§8](https://arxiv.org/html/2605.06821#S8.SS0.SSS0.Px2.p1.1)\.
- E\. Hairer, C\. Lubich, and G\. Wanner \(2006\)Geometric numerical integration: structure\-preserving algorithms for ordinary differential equations\.Springer\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p3.1)\.
- C\. Heredia \(2024\)Modeling AdaGrad, RMSProp, and Adam with integro\-differential equations\.arXiv preprint arXiv:2411\.09734\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p2.1)\.
- K\. Jiang, J\. Cohen, and Y\. Li \(2025\)Understanding the evolution of the neural tangent kernel at the edge of stability\.InThe Thirty\-Ninth Annual Conference on Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p1.1)\.
- K\. Jordan, Y\. Jin, V\. Boza, J\. You, F\. Cesista, L\. Newhouse, and J\. Bernstein \(2024\)Muon: an optimizer for hidden layers in neural networks\.External Links:[Link](https://kellerjordan.github.io/posts/muon/)Cited by:[§8](https://arxiv.org/html/2605.06821#S8.SS0.SSS0.Px2.p1.1)\.
- D\. P\. Kingma and J\. Ba \(2015\)Adam: a method for stochastic optimization\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.06821#S1.p2.3),[§2](https://arxiv.org/html/2605.06821#S2.p2.1)\.
- L\. Kong and M\. Tao \(2020\)Stochasticity of deterministic gradient descent: large learning rate for multiscale objective function\.InAdvances in Neural Information Processing Systems 33,Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p1.1)\.
- A\. Krizhevsky \(2009\)Learning multiple layers of features from tiny images\.Technical reportUniversity of Toronto\.Cited by:[§7\.1](https://arxiv.org/html/2605.06821#S7.SS1.p1.1)\.
- A\. Lewkowycz, Y\. Bahri, E\. Dyer, J\. Sohl\-Dickstein, and G\. Gur\-Ari \(2020\)The large learning rate phase of deep learning: the catapult mechanism\.arXiv preprint arXiv:2003\.02218\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p1.1)\.
- Q\. Li, C\. Tai, and W\. E \(2017\)Stochastic modified equations and adaptive stochastic gradient algorithms\.InInternational Conference on Machine Learning,pp\. 2101–2110\.Cited by:[§1](https://arxiv.org/html/2605.06821#S1.p3.1),[§2](https://arxiv.org/html/2605.06821#S2.p3.1)\.
- X\. Li, H\. Wen, and K\. Lyu \(2025\)Adam reduces a unique form of sharpness: theoretical insights near the minimizer manifold\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p2.1)\.
- I\. Loshchilov and F\. Hutter \(2019\)Decoupled weight decay regularization\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p2.1)\.
- C\. Ma, D\. Kunin, L\. Wu, and L\. Ying \(2022\)Beyond the quadratic approximation: the multiscale structure of neural network loss landscapes\.Journal of Machine Learning1\(3\),pp\. 247–267\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p1.1)\.
- S\. Malladi, K\. Lyu, A\. Panigrahi, and S\. Arora \(2022\)On the SDEs and scaling rules for adaptive gradient algorithms\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.06821#S1.p5.1),[§2](https://arxiv.org/html/2605.06821#S2.p2.1)\.
- Y\. E\. Nesterov \(1983\)A method for solving the convex programming problem with convergence rateO\(1/k2\)O\(1/k^\{2\}\)\.Proceedings of the USSR Academy of Sciences269,pp\. 543–547\.Cited by:[§C\.1\.2](https://arxiv.org/html/2605.06821#A3.SS1.SSS2.p1.5)\.
- P\. Phunyaphibarn, J\. Lee, B\. Wang, H\. Zhang, and C\. Yun \(2024\)Gradient descent with Polyak’s momentum finds flatter minima via large catapults\.arXiv preprint 2311\.15051\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p1.1)\.
- B\. T\. Polyak \(1964\)Some methods of speeding up the convergence of iteration methods\.USSR Computational Mathematics and Mathematical Physics4\(5\),pp\. 1–17\.Cited by:[§4\.1](https://arxiv.org/html/2605.06821#S4.SS1.p1.1)\.
- S\. J\. Reddi, S\. Kale, and S\. Kumar \(2018\)On the convergence of Adam and beyond\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p2.1)\.
- E\. Regis and S\. Chewi \(2026\)Rod flow: a continuous\-time model for gradient descent at the edge of stability\.arXiv preprint arXiv:2602\.01480\.Cited by:[Appendix B](https://arxiv.org/html/2605.06821#A2.p1.1),[§1\.1](https://arxiv.org/html/2605.06821#S1.SS1.p1.2),[§1](https://arxiv.org/html/2605.06821#S1.p5.1),[§2](https://arxiv.org/html/2605.06821#S2.p3.1),[§3](https://arxiv.org/html/2605.06821#S3.p1.1),[§3](https://arxiv.org/html/2605.06821#S3.p4.3),[§7\.1](https://arxiv.org/html/2605.06821#S7.SS1.p2.1)\.
- M\. Rosca, Y\. Wu, C\. Qin, and B\. Dherin \(2023\)On a continuous time model of gradient descent dynamics and instability in deep learning\.Transactions on Machine Learning Research\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p3.1)\.
- B\. Shi, S\. S\. Du, W\. Su, and M\. I\. Jordan \(2021\)A continuous\-time analysis of momentum methods\.Journal of Machine Learning Research22\(1\),pp\. 5915–5957\.Cited by:[§1](https://arxiv.org/html/2605.06821#S1.p3.1),[§2](https://arxiv.org/html/2605.06821#S2.p3.1)\.
- S\. L\. Smith, B\. Dherin, D\. Barrett, and S\. De \(2021\)On the origin of implicit regularization in stochastic gradient descent\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p3.1)\.
- M\. Song and C\. Yun \(2023\)Trajectory alignment: understanding the edge of stability phenomenon via bifurcation theory\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p1.1)\.
- B\. Wang, Q\. Meng, W\. Chen, and T\. Liu \(2021\)The implicit bias for adaptive optimization algorithms on homogeneous neural networks\.InInternational Conference on Machine Learning,pp\. 10849–10858\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p2.1)\.
- Y\. Wang, Z\. Xu, T\. Zhao, and M\. Tao \(2025\)Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult\.Journal of Machine Learning Research26\(273\),pp\. 1–68\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p1.1)\.
- J\. H\. Wilkinson \(1963\)Rounding errors in algebraic processes\.Prentice\-Hall\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p3.1)\.
- J\. H\. Wilkinson \(1965\)The algebraic eigenvalue problem\.Clarendon Press,Oxford\.Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p3.1)\.
- X\. Zhu, Z\. Wang, X\. Wang, M\. Zhou, and R\. Ge \(2023\)Understanding edge\-of\-stability training dynamics with a minimalist example\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.06821#S2.p1.1)\.
## Appendix ANotation
Table 1:Unified notation used throughout the paper\.
## Appendix BRod Flow for Gradient Descent
We briefly rederive the gradient descent rod flow, first introduced inRegis and Chewi \[[2026](https://arxiv.org/html/2605.06821#bib.bib1)\]\.
Recall the update equation for gradient descent:
wt\+1=wt−η∇L\(wt\)w\_\{t\+1\}=w\_\{t\}\-\\eta\\nabla L\(w\_\{t\}\)\(49\)Define the midpoint and half\-difference of consecutive iterates:
w¯t\\displaystyle\\bar\{w\}\_\{t\}=12\(wt\+1\+wt\)\\displaystyle=\\tfrac\{1\}\{2\}\(w\_\{t\+1\}\+w\_\{t\}\)\(50\)δt\\displaystyle\\delta\_\{t\}=12\(wt\+1−wt\)\\displaystyle=\\tfrac\{1\}\{2\}\(w\_\{t\+1\}\-w\_\{t\}\)\(51\)At the edge of stability, the GD iterates oscillate every step about their running center\. However, even while the individual iterates oscillate, the midpointw¯t\\bar\{w\}\_\{t\}and the outer productδt⊗δt\\delta\_\{t\}\\otimes\\delta\_\{t\}vary smoothly\. So we can hope to model these two quantities in continuous time\.
We seek recurrence equations forw¯t\\bar\{w\}\_\{t\}andδt⊗δt\\delta\_\{t\}\\otimes\\delta\_\{t\}satisfying two conditions, each motivated by the eventual promotion to an ODE\.
1. 1\.The right\-hand sides should be expressible purely in terms ofw¯t\\bar\{w\}\_\{t\}andδt⊗δt\\delta\_\{t\}\\otimes\\delta\_\{t\}, without reference to the raw iterateswtw\_\{t\}\. Otherwise, the system is not closed in the variables we want to pass to continuous time\.
2. 2\.The recurrences should be invariant underδ↦−δ\\delta\\mapsto\-\\delta\. In continuous time, the sign ofδ\\deltawill no longer be a meaningful distinction\.
Note that the original iterates can be recovered from the transformed variables via:
wt\\displaystyle w\_\{t\}=w¯t−δt,\\displaystyle=\\bar\{w\}\_\{t\}\-\\delta\_\{t\},\(52\)wt\+1\\displaystyle w\_\{t\+1\}=w¯t\+δt,\\displaystyle=\\bar\{w\}\_\{t\}\+\\delta\_\{t\},\(53\)and that the half\-displacement can be expressed in terms of the gradient as:
δt=−η2∇L\(wt\)\.\\delta\_\{t\}=\-\\frac\{\\eta\}\{2\}\\nabla L\(w\_\{t\}\)\.\(54\)The difference equation for the midpoint is:
w¯t\+1−w¯t\\displaystyle\\bar\{w\}\_\{t\+1\}\-\\bar\{w\}\_\{t\}=wt\+2\+wt\+12−wt\+1\+wt2\\displaystyle=\\frac\{w\_\{t\+2\}\+w\_\{t\+1\}\}\{2\}\-\\frac\{w\_\{t\+1\}\+w\_\{t\}\}\{2\}=wt\+2−wt\+12\+wt\+1−wt2\\displaystyle=\\frac\{w\_\{t\+2\}\-w\_\{t\+1\}\}\{2\}\+\\frac\{w\_\{t\+1\}\-w\_\{t\}\}\{2\}=−η2\[∇L\(wt\+1\)\+∇L\(wt\)\]\\displaystyle=\-\\frac\{\\eta\}\{2\}\\bigl\[\\nabla L\(w\_\{t\+1\}\)\+\\nabla L\(w\_\{t\}\)\\bigr\]=−η2\[∇L\(w¯t\+δt\)\+∇L\(w¯t−δt\)\]\.\\displaystyle=\-\\frac\{\\eta\}\{2\}\\bigl\[\\nabla L\(\\bar\{w\}\_\{t\}\+\\delta\_\{t\}\)\+\\nabla L\(\\bar\{w\}\_\{t\}\-\\delta\_\{t\}\)\\bigr\]\.And the difference equation for the outer product of the half\-difference is:
δt\+1⊗δt\+1−δt⊗δt\\displaystyle\\delta\_\{t\+1\}\\otimes\\delta\_\{t\+1\}\-\\delta\_\{t\}\\otimes\\delta\_\{t\}=δt\+1⊗δt\+1−δt⊗δt\+\(δt⊗δt−δt⊗δt\)\\displaystyle=\\delta\_\{t\+1\}\\otimes\\delta\_\{t\+1\}\-\\delta\_\{t\}\\otimes\\delta\_\{t\}\+\(\\delta\_\{t\}\\otimes\\delta\_\{t\}\-\\delta\_\{t\}\\otimes\\delta\_\{t\}\)=δt\+1⊗δt\+1\+δt⊗δt−2δt⊗δt\\displaystyle=\\delta\_\{t\+1\}\\otimes\\delta\_\{t\+1\}\+\\delta\_\{t\}\\otimes\\delta\_\{t\}\-2\\delta\_\{t\}\\otimes\\delta\_\{t\}=η24\[∇L\+⊗∇L\+\+∇L−⊗∇L−\]−2δt⊗δt\.\\displaystyle=\\frac\{\\eta^\{2\}\}\{4\}\\bigl\[\\nabla L\_\{\+\}\\otimes\\nabla L\_\{\+\}\+\\nabla L\_\{\-\}\\otimes\\nabla L\_\{\-\}\\bigr\]\-2\\,\\delta\_\{t\}\\otimes\\delta\_\{t\}\.It should be emphasized that both difference equations areexact\. The only approximation in the rod flow framework is the step where we promote these difference equations to ODEs\.
LetΣ\(t\)\\Sigma\(t\)denote the continuous\-time analog ofδ⊗δ\\delta\\otimes\\delta\. We obtain the following rod flow ODEs for gradient descent:
dw¯dt\\displaystyle\\frac\{d\\bar\{w\}\}\{dt\}=−ηg¯\\displaystyle=\-\\eta\\,\\bar\{g\}\(55\)dΣdt\\displaystyle\\frac\{d\\Sigma\}\{dt\}=η24\(∇L\+⊗∇L\+\+∇L−⊗∇L−\)−2Σ\\displaystyle=\\frac\{\\eta^\{2\}\}\{4\}\\,\\bigl\(\\nabla L\_\{\+\}\\otimes\\nabla L\_\{\+\}\+\\nabla L\_\{\-\}\\otimes\\nabla L\_\{\-\}\\bigr\)\-2\\Sigma\(56\)where we identifyδ\\deltawith the principal eigenvector ofΣ\\Sigmascaled by the square root of the principal eigenvalue\.
We introduce theΣ\\Sigmanotation for two reasons\. First, in continuous time,Σ\\Sigmawill no longer be exactly rank\-one, so denoting the extent asδ⊗δ\\delta\\otimes\\deltawould be misleading\. One should note though thatΣ\\Sigmadoes remainapproximatelyrank\-one: during experiments, its largest eigenvalue is consistently several orders of magnitude larger than the second\-largest eigenvalue\. Second, this notation aligns with Central Flow\[Cohenet al\.,[2025](https://arxiv.org/html/2605.06821#bib.bib3)\], whereΣ\\Sigmaplays the role of the covariance matrix of the oscillations\.
While the extent tensor of rod flow and the covariance matrix of Central Flow play similar roles, there are important conceptual differences\. Rod flow’s extent tensor is always approximately rank\-one—even when there are multiple sharp directions, or when there are no oscillations at all \(as is the case when the iterates arenotat the edge of stability\)\. The covariance matrix of Central Flow, by contrast, has rank equal to the number of sharp directions\.
## Appendix CRod Flow for Heavy Ball and Nesterov Momentum
### C\.1Derivation of Rod Flows
#### C\.1\.1Heavy Ball
The heavy ball momentum update equations are:
mt\+1\\displaystyle m\_\{t\+1\}=βmt\+\(1−β\)∇L\(wt\),\\displaystyle=\\beta\\,m\_\{t\}\+\(1\-\\beta\)\\,\\nabla L\(w\_\{t\}\),\(57\)wt\+1\\displaystyle w\_\{t\+1\}=wt−ηmt\+1,\\displaystyle=w\_\{t\}\-\\eta\\,m\_\{t\+1\},\(58\)whereβ∈\[0,1\)\\beta\\in\[0,1\)is the momentum coefficient\.
When heavy ball crosses into EoS,wwandmmoscillate in phase: both flip sign about the center each iteration\. Because the position and momentum oscillate together, it is natural to concatenate them into a single phase\-space vectorzz:
zt=\(wtmt\)∈ℝ2d\.z\_\{t\}=\\begin\{pmatrix\}w\_\{t\}\\\\ m\_\{t\}\\end\{pmatrix\}\\in\\mathbb\{R\}^\{2d\}\\,\.\(59\)In analogy with the rod flow for gradient descent, define the average of two consecutive phase\-space iteratesz¯\\bar\{z\}and the half\-displacement between themΔ\\Delta:
z¯t\\displaystyle\\bar\{z\}\_\{t\}=12\(zt\+1\+zt\)=\(w¯tm¯t\),\\displaystyle=\\tfrac\{1\}\{2\}\(z\_\{t\+1\}\+z\_\{t\}\)=\\begin\{pmatrix\}\\bar\{w\}\_\{t\}\\\\ \\bar\{m\}\_\{t\}\\end\{pmatrix\},\(60\)Δt\\displaystyle\\Delta\_\{t\}=12\(zt\+1−zt\)=\(δtγt\),\\displaystyle=\\tfrac\{1\}\{2\}\(z\_\{t\+1\}\-z\_\{t\}\)=\\begin\{pmatrix\}\\delta\_\{t\}\\\\ \\gamma\_\{t\}\\end\{pmatrix\},\(61\)where
m¯t\\displaystyle\\bar\{m\}\_\{t\}=mt\+1\+mt2\\displaystyle=\\frac\{m\_\{t\+1\}\+m\_\{t\}\}\{2\}\(62\)γt\\displaystyle\\gamma\_\{t\}=mt\+1−mt2\\displaystyle=\\frac\{m\_\{t\+1\}\-m\_\{t\}\}\{2\}\(63\)are the momentum midpoint and the momentum half\-difference, respectively\.
We emphasize that\(δ,γ\)\(\\delta,\\gamma\)forms a*single*rod in phase space, not two separate rods in position and momentum space: the formalism is symmetric under the simultaneous flip\(δ,γ\)↦\(−δ,−γ\)\(\\delta,\\gamma\)\\mapsto\(\-\\delta,\-\\gamma\), but not under flipping just one of the coordinates\. This is why the phase\-space structure is necessary: when we pass to continuous time, we will extractδ\\deltaandγ\\gammajointly from the principal eigenvector ofΣ\\Sigma, and the phase\-space structure ensures their signs are consistent\.
Write the phase\-space update as:
zt\+1=zt\+Φ\(zt\),Φ\(z\)=\(−η\[βm\+\(1−β\)∇L\(w\)\]\(1−β\)\[∇L\(w\)−m\]\)\.z\_\{t\+1\}=z\_\{t\}\+\\Phi\(z\_\{t\}\),\\qquad\\Phi\(z\)=\\begin\{pmatrix\}\-\\eta\\bigl\[\\beta m\+\(1\-\\beta\)\\nabla L\(w\)\\bigr\]\\\\ \(1\-\\beta\)\\bigl\[\\nabla L\(w\)\-m\\bigr\]\\end\{pmatrix\}\.\(64\)As in the gradient descent case, the phase\-space midpointz¯t\\bar\{z\}\_\{t\}and the outer productΔt⊗Δt\\Delta\_\{t\}\\otimes\\Delta\_\{t\}both evolve smoothly at EoS, so these are the quantities we model in continuous time\. WritingΦt≔Φ\(zt\)\\Phi\_\{t\}\\coloneqq\\Phi\(z\_\{t\}\), the difference equation forz¯\\bar\{z\}is
z¯t\+1−z¯t\\displaystyle\\bar\{z\}\_\{t\+1\}\-\\bar\{z\}\_\{t\}=12\(zt\+2\+zt\+1\)−12\(zt\+1\+zt\)\\displaystyle=\\tfrac\{1\}\{2\}\(z\_\{t\+2\}\+z\_\{t\+1\}\)\-\\tfrac\{1\}\{2\}\(z\_\{t\+1\}\+z\_\{t\}\)=12\(Φt\+1\+Φt\)\\displaystyle=\\tfrac\{1\}\{2\}\(\\Phi\_\{t\+1\}\+\\Phi\_\{t\}\)=12\[Φ\(z¯t\+Δt\)\+Φ\(z¯t−Δt\)\],\\displaystyle=\\tfrac\{1\}\{2\}\\bigl\[\\Phi\(\\bar\{z\}\_\{t\}\+\\Delta\_\{t\}\)\+\\Phi\(\\bar\{z\}\_\{t\}\-\\Delta\_\{t\}\)\\bigr\],where the last line useszt\+1=z¯t\+Δtz\_\{t\+1\}=\\bar\{z\}\_\{t\}\+\\Delta\_\{t\}andzt=z¯t−Δtz\_\{t\}=\\bar\{z\}\_\{t\}\-\\Delta\_\{t\}\.
For the difference equation forΔ⊗Δ\\Delta\\otimes\\Delta:
Δt\+1⊗Δt\+1−Δt⊗Δt\\displaystyle\\Delta\_\{t\+1\}\\otimes\\Delta\_\{t\+1\}\-\\Delta\_\{t\}\\otimes\\Delta\_\{t\}=Δt\+1⊗Δt\+1−Δt⊗Δt\+\(Δt⊗Δt−Δt⊗Δt\)\\displaystyle=\\Delta\_\{t\+1\}\\otimes\\Delta\_\{t\+1\}\-\\Delta\_\{t\}\\otimes\\Delta\_\{t\}\+\(\\Delta\_\{t\}\\otimes\\Delta\_\{t\}\-\\Delta\_\{t\}\\otimes\\Delta\_\{t\}\)=Δt\+1⊗Δt\+1\+Δt⊗Δt−2Δt⊗Δt\\displaystyle=\\Delta\_\{t\+1\}\\otimes\\Delta\_\{t\+1\}\+\\Delta\_\{t\}\\otimes\\Delta\_\{t\}\-2\\Delta\_\{t\}\\otimes\\Delta\_\{t\}=14\(Φ\+⊗Φ\+\+Φ−⊗Φ−\)−2Δt⊗Δt\\displaystyle=\\frac\{1\}\{4\}\(\\Phi\_\{\+\}\\otimes\\Phi\_\{\+\}\+\\Phi\_\{\-\}\\otimes\\Phi\_\{\-\}\)\-2\\Delta\_\{t\}\\otimes\\Delta\_\{t\}As in the gradient descent case, both difference equations are exact\.
Promoting the difference equations to ODEs gives the rod flow for heavy ball momentum:
dw¯dt\\displaystyle\\frac\{d\\bar\{w\}\}\{dt\}=−η\[βm¯\+\(1−β\)g¯\],\\displaystyle=\-\\eta\\bigl\[\\beta\\,\\bar\{m\}\+\(1\-\\beta\)\\,\\bar\{g\}\\bigr\],\(65\)dm¯dt\\displaystyle\\frac\{d\\bar\{m\}\}\{dt\}=\(1−β\)\[g¯−m¯\],\\displaystyle=\(1\-\\beta\)\\bigl\[\\bar\{g\}\-\\bar\{m\}\\bigr\],\(66\)dΣdt\\displaystyle\\frac\{d\\Sigma\}\{dt\}=14\[Φ\+⊗Φ\+\+Φ−⊗Φ−\]−2Σ,\\displaystyle=\\tfrac\{1\}\{4\}\\bigl\[\\Phi\_\{\+\}\\otimes\\Phi\_\{\+\}\+\\Phi\_\{\-\}\\otimes\\Phi\_\{\-\}\\bigr\]\-2\\,\\Sigma,\(67\)withg¯=\(∇L\+\+∇L−\)/2\\bar\{g\}=\(\\nabla L\_\{\+\}\+\\nabla L\_\{\-\}\)/2\. Note that the extent tensorΣ∈ℝ2d×2d\\Sigma\\in\\mathbb\{R\}^\{2d\\times 2d\}decomposes into fourd×dd\\times dblocks:
Σ=\(ΣδδΣδγΣγδΣγγ\)\.\\Sigma=\\begin\{pmatrix\}\\Sigma\_\{\\delta\\delta\}&\\Sigma\_\{\\delta\\gamma\}\\\\ \\Sigma\_\{\\gamma\\delta\}&\\Sigma\_\{\\gamma\\gamma\}\\end\{pmatrix\}\.\(68\)As before, we identifyΔ=\(δ,γ\)\\Delta=\(\\delta,\\gamma\)with the principal eigenvector ofΣ\\Sigmascaled by the square root of the principal eigenvalue\.
#### C\.1\.2Nesterov
Nesterov momentum is very similar to heavy ball momentum\. The only difference between the two momentum schemes is where the gradient is evaluated\. Rather than evaluating the gradient atwtw\_\{t\}, Nesterov momentum evaluates the gradient at alook\-ahead pointθt\\theta\_\{t\}, obtained by extrapolatingwtw\_\{t\}along the current momentum:
θt\\displaystyle\\theta\_\{t\}=wt−ηβmt,\\displaystyle=w\_\{t\}\-\\eta\\beta\\,m\_\{t\},\(69\)mt\+1\\displaystyle m\_\{t\+1\}=βmt\+\(1−β\)∇L\(θt\),\\displaystyle=\\beta\\,m\_\{t\}\+\(1\-\\beta\)\\,\\nabla L\(\\theta\_\{t\}\),\(70\)wt\+1\\displaystyle w\_\{t\+1\}=wt−ηmt\+1\.\\displaystyle=w\_\{t\}\-\\eta\\,m\_\{t\+1\}\.\(71\)Nesterov originally introduced this look\-ahead to obtain an accelerated convergence rate ofO\(1/T2\)O\(1/T^\{2\}\)on smooth convex objectives\[Nesterov,[1983](https://arxiv.org/html/2605.06821#bib.bib37)\], improving on theO\(1/T\)O\(1/T\)rate of vanilla gradient descent\. Heuristically, the look\-ahead acts as a one\-step prediction–correction: we use the current value of the momentum to predict where the iterate is heading, and then query the gradient there for a course correction\.
For the Nesterov rod flow, the only change relative to heavy ball is that∇L\(w\)\\nabla L\(w\)is replaced by∇L\(w−ηβm\)\\nabla L\(w\-\\eta\\beta\\,m\)insideΦ\\Phi\. The phase\-space update becomes:
Φ\(z\)=\(−η\[βm\+\(1−β\)∇L\(w−ηβm\)\]\(1−β\)\[∇L\(w−ηβm\)−m\]\)\.\\Phi\(z\)=\\begin\{pmatrix\}\-\\eta\\bigl\[\\beta m\+\(1\-\\beta\)\\nabla L\(w\-\\eta\\beta m\)\\bigr\]\\\\ \(1\-\\beta\)\\bigl\[\\nabla L\(w\-\\eta\\beta m\)\-m\\bigr\]\\end\{pmatrix\}\.\(72\)The phase\-space midpointz¯t\\bar\{z\}\_\{t\}and phase\-space half\-displacementΔt=\(δt,γt\)\\Delta\_\{t\}=\(\\delta\_\{t\},\\gamma\_\{t\}\)are defined exactly as in the heavy ball case\. And the difference equations forz¯\\bar\{z\}andΔ⊗Δ\\Delta\\otimes\\Deltaremain identical in form:
z¯t\+1−z¯t\\displaystyle\\bar\{z\}\_\{t\+1\}\-\\bar\{z\}\_\{t\}=12\[Φ\(z¯t\+Δt\)\+Φ\(z¯t−Δt\)\],\\displaystyle=\\tfrac\{1\}\{2\}\\bigl\[\\Phi\(\\bar\{z\}\_\{t\}\+\\Delta\_\{t\}\)\+\\Phi\(\\bar\{z\}\_\{t\}\-\\Delta\_\{t\}\)\\bigr\],Δt\+1⊗Δt\+1−Δt⊗Δt\\displaystyle\\Delta\_\{t\+1\}\\otimes\\Delta\_\{t\+1\}\-\\Delta\_\{t\}\\otimes\\Delta\_\{t\}=14\[Φ\+⊗Φ\+\+Φ−⊗Φ−\]−2Δt⊗Δt\.\\displaystyle=\\tfrac\{1\}\{4\}\\bigl\[\\Phi\_\{\+\}\\otimes\\Phi\_\{\+\}\+\\Phi\_\{\-\}\\otimes\\Phi\_\{\-\}\\bigr\]\-2\\,\\Delta\_\{t\}\\otimes\\Delta\_\{t\}\.ExpandingΦ\(z¯t±Δt\)\\Phi\(\\bar\{z\}\_\{t\}\\pm\\Delta\_\{t\}\), the gradient evaluation point becomes:
\(w¯±δ\)−ηβ\(m¯±γ\)=\(w¯−ηβm¯\)±\(δ−ηβγ\)=θ¯±φ,\(\\bar\{w\}\\pm\\delta\)\-\\eta\\beta\\,\(\\bar\{m\}\\pm\\gamma\)\\;=\\;\\bigl\(\\bar\{w\}\-\\eta\\beta\\,\\bar\{m\}\\bigr\)\\;\\pm\\;\\bigl\(\\delta\-\\eta\\beta\\,\\gamma\\bigr\)\\;=\\;\\bar\{\\theta\}\\;\\pm\\;\\varphi,whereθ¯=w¯−ηβm¯\\bar\{\\theta\}=\\bar\{w\}\-\\eta\\beta\\,\\bar\{m\}is the look\-ahead midpoint andφ=δ−ηβγ\\varphi=\\delta\-\\eta\\beta\\,\\gammais the look\-ahead\-shifted half\-displacement\.
Promoting the discrete difference equations to ODEs yields the rod flow for Nesterov momentum:
dw¯dt\\displaystyle\\frac\{d\\bar\{w\}\}\{dt\}=−η\[βm¯\+\(1−β\)g¯\],\\displaystyle=\-\\eta\\bigl\[\\beta\\,\\bar\{m\}\+\(1\-\\beta\)\\,\\bar\{g\}\\bigr\],\(73\)dm¯dt\\displaystyle\\frac\{d\\bar\{m\}\}\{dt\}=\(1−β\)\[g¯−m¯\],\\displaystyle=\(1\-\\beta\)\\bigl\[\\bar\{g\}\-\\bar\{m\}\\bigr\],\(74\)dΣdt\\displaystyle\\frac\{d\\Sigma\}\{dt\}=14\[Φ\+⊗Φ\+\+Φ−⊗Φ−\]−2Σ,\\displaystyle=\\tfrac\{1\}\{4\}\\bigl\[\\Phi\_\{\+\}\\otimes\\Phi\_\{\+\}\+\\Phi\_\{\-\}\\otimes\\Phi\_\{\-\}\\bigr\]\-2\\,\\Sigma,\(75\)identical in form to the heavy ball rod flow, with the sole modification that:
g¯=12\[∇L\(θ¯\+φ\)\+∇L\(θ¯−φ\)\]\\bar\{g\}=\\tfrac\{1\}\{2\}\\bigl\[\\nabla L\(\\bar\{\\theta\}\+\\varphi\)\+\\nabla L\(\\bar\{\\theta\}\-\\varphi\)\\bigr\]
### C\.2Derivation of the Sharpness Threshold on Quadratic Loss
#### C\.2\.1Heavy Ball
Just as with gradient descent, heavy ball momentum on a quadratic loss has a sharpness threshold beyond which the iterates no longer converge\. However, this threshold is no longer2/η2/\\eta: it is modified by the momentum\.
To find this critical value of the sharpness, we first express the momentum update as a two\-step recurrence inwwalone\. Recall the heavy ball update equations:
mt\+1\\displaystyle m\_\{t\+1\}=βmt\+\(1−β\)∇L\(wt\),\\displaystyle=\\beta\\,m\_\{t\}\+\(1\-\\beta\)\\,\\nabla L\(w\_\{t\}\),wt\+1\\displaystyle w\_\{t\+1\}=wt−ηmt\+1\.\\displaystyle=w\_\{t\}\-\\eta\\,m\_\{t\+1\}\.From the position update, we can isolatemt\+1m\_\{t\+1\}:
mt\+1=−1η\(wt\+1−wt\)\.m\_\{t\+1\}=\-\\frac\{1\}\{\\eta\}\(w\_\{t\+1\}\-w\_\{t\}\)\\,\.\(76\)Shifting the index back by one likewise givesmt=−1η\(wt−wt−1\)m\_\{t\}=\-\\tfrac\{1\}\{\\eta\}\(w\_\{t\}\-w\_\{t\-1\}\)\. Substituting both into the momentum update yields:
−1η\(wt\+1−wt\)=−βη\(wt−wt−1\)\+\(1−β\)∇L\(wt\)\.\-\\frac\{1\}\{\\eta\}\(w\_\{t\+1\}\-w\_\{t\}\)=\-\\frac\{\\beta\}\{\\eta\}\(w\_\{t\}\-w\_\{t\-1\}\)\+\(1\-\\beta\)\\,\\nabla L\(w\_\{t\}\)\\,\.\(77\)Now consider the quadratic lossL\(w\)=12Sw2L\(w\)=\\tfrac\{1\}\{2\}Sw^\{2\}\. We have that∇L\(w\)=Sw\\nabla L\(w\)=Sw\. Substituting and rearranging gives the two\-step recurrence:
wt\+1=\(1\+β−η\(1−β\)S\)wt−βwt−1\.w\_\{t\+1\}=\\bigl\(1\+\\beta\-\\eta\(1\-\\beta\)S\\bigr\)\\,w\_\{t\}\-\\beta\\,w\_\{t\-1\}\\,\.\(78\)We can express this recurrence as a matrix equation\. Letxt=\[wtwt−1\]x\_\{t\}=\\begin\{bmatrix\}w\_\{t\}\\\\ w\_\{t\-1\}\\end\{bmatrix\}be the state vector\. Then we have that:
xt\+1=Mxt,M=\[1\+β−η\(1−β\)S−β10\]\.x\_\{t\+1\}=Mx\_\{t\},\\qquad M=\\begin\{bmatrix\}1\+\\beta\-\\eta\(1\-\\beta\)S&\-\\beta\\\\ 1&0\\end\{bmatrix\}\.\(79\)The dynamics of the system are governed by the eigenvaluesλ\\lambdaof the transition matrixMM\. For the iterates to converge, both eigenvalues must lie within the unit circle \(\|λ\|≤1\|\\lambda\|\\leq 1\)\. We can solve for the eigenvalues using the characteristic equationDet\(M−λI\)=0\\operatorname\{Det\}\(M\-\\lambda I\)=0:
λ2−\(1\+β−η\(1−β\)S\)λ\+β=0\.\\lambda^\{2\}\-\\bigl\(1\+\\beta\-\\eta\(1\-\\beta\)S\\bigr\)\\lambda\+\\beta=0\\,\.\(80\)The coefficients of this quadratic are real, so its roots are either both real or a complex\-conjugate pair\. In the complex\-conjugate case, the two roots have equal magnitude\. And because their product satisfiesλ1λ2=β\\lambda\_\{1\}\\lambda\_\{2\}=\\beta, that common magnitude isβ<1\\sqrt\{\\beta\}<1\(usingβ∈\[0,1\)\\beta\\in\[0,1\)\)\. So complex roots cannot leave the unit circle\.
The eigenvalues can only cross the boundary along the real axis: atλ=1\\lambda=1orλ=−1\\lambda=\-1\. Theλ=1\\lambda=1case is degenerate, corresponding to stationary dynamics\. Theλ=−1\\lambda=\-1case corresponds to a 2\-period cycle—which is precisely what we are looking for\. Substitutingλ=−1\\lambda=\-1into the characteristic equation gives the condition for the sharpness thresholdS∗S^\{\\ast\}:
2\+2β−η\(1−β\)S∗=0⟹S∗=2η⋅1\+β1−β\.2\+2\\beta\-\\eta\(1\-\\beta\)S^\{\\ast\}=0\\implies S^\{\\ast\}=\\frac\{2\}\{\\eta\}\\cdot\\frac\{1\+\\beta\}\{1\-\\beta\}\\,\.\(81\)At the critical sharpnessS∗S^\{\\ast\}, we know thatλ1=−1\\lambda\_\{1\}=\-1is an eigenvalue\. As the product of the eigenvalues equalsβ\\beta, it immediately follows that the second eigenvalue is:
λ2=−β\.\\lambda\_\{2\}=\-\\beta\\,\.\(82\)Because the momentum coefficientβ∈\[0,1\)\\beta\\in\[0,1\), this second eigenvalue has magnitude strictly less than 1\. This implies that, at the sharpness threshold, the system does not globally diverge\. Rather, there is a one\-dimensional family of 2\-period orbits, and the iterates decay toward this family at a rate ofβ\\beta\.
To see this explicitly, we can decompose the system’s initial conditions into the eigenvectors ofMM\. AtS=S∗S=S^\{\\ast\}, the transition matrix becomes:
M=\[−\(1\+β\)−β10\]\.M=\\begin\{bmatrix\}\-\(1\+\\beta\)&\-\\beta\\\\ 1&0\\end\{bmatrix\}\\,\.\(83\)Solving\(M−λI\)v=0\(M\-\\lambda I\)v=0for each eigenvalue yields the corresponding eigenvectors:
v1=\[1−1\]\(forλ1=−1\),v2=\[−β1\]\(forλ2=−β\)\.v\_\{1\}=\\begin\{bmatrix\}1\\\\ \-1\\end\{bmatrix\}\\quad\(\\text\{for \}\\lambda\_\{1\}=\-1\),\\qquad v\_\{2\}=\\begin\{bmatrix\}\-\\beta\\\\ 1\\end\{bmatrix\}\\quad\(\\text\{for \}\\lambda\_\{2\}=\-\\beta\)\\,\.\(84\)The vectorv1v\_\{1\}represents the persistent 2\-period oscillation, whilev2v\_\{2\}represents the decaying transient\.
Letw0w\_\{0\}andm0m\_\{0\}be our initial position and momentum, respectively\. We can express the initial statex0=\[w0w−1\]⊤x\_\{0\}=\\begin\{bmatrix\}w\_\{0\}&w\_\{\-1\}\\end\{bmatrix\}^\{\\top\}in terms ofw0w\_\{0\}andm0m\_\{0\}\. Rearranging the position update equation, one can see thatw−1=w0\+ηm0w\_\{\-1\}=w\_\{0\}\+\\eta m\_\{0\}\. We then have that:
x0=\[w0w0\+ηm0\]\.x\_\{0\}=\\begin\{bmatrix\}w\_\{0\}\\\\ w\_\{0\}\+\\eta m\_\{0\}\\end\{bmatrix\}\\,\.\(85\)We want to decompose this initial state into the eigenbasis ofMM:
x0=c1v1\+c2v2x\_\{0\}=c\_\{1\}v\_\{1\}\+c\_\{2\}v\_\{2\}\(86\)Which yields the following linear system of equations for the coefficients:
w0\\displaystyle w\_\{0\}=c1−βc2,\\displaystyle=c\_\{1\}\-\\beta c\_\{2\}\\,,\(87\)w0\+ηm0\\displaystyle w\_\{0\}\+\\eta m\_\{0\}=−c1\+c2\.\\displaystyle=\-c\_\{1\}\+c\_\{2\}\\,\.\(88\)Solving for the coefficients gives:
c1=\(1\+β\)w0\+βηm01−β,c2=2w0\+ηm01−β\.c\_\{1\}=\\frac\{\(1\+\\beta\)w\_\{0\}\+\\beta\\eta m\_\{0\}\}\{1\-\\beta\},\\qquad c\_\{2\}=\\frac\{2w\_\{0\}\+\\eta m\_\{0\}\}\{1\-\\beta\}\\,\.\(89\)The dynamics over time are given by:
xt=Mtx0=c1\(−1\)tv1\+c2\(−β\)tv2x\_\{t\}=M^\{t\}x\_\{0\}=c\_\{1\}\(\-1\)^\{t\}v\_\{1\}\+c\_\{2\}\(\-\\beta\)^\{t\}v\_\{2\}\(90\)And we have that ast→∞t\\to\\infty:
x∞=c1v1⟹\|w∞\|=\|c1\|x\_\{\\infty\}=c\_\{1\}v\_\{1\}\\implies\|w\_\{\\infty\}\|=\|c\_\{1\}\|\(91\)Specifically, in the case thatm0=0m\_\{0\}=0, we have that:
\|w∞\|=1\+β1−β\|w0\|\|w\_\{\\infty\}\|=\\frac\{1\+\\beta\}\{1\-\\beta\}\|w\_\{0\}\|\(92\)
Figure 4:Heavy Ball at the EoS Threshold\.Dynamics of Heavy Ball momentum on a 1D quadratic loss, illustrating convergence to a persistent 2\-period orbit at the sharpness threshold\.
#### C\.2\.2Nesterov
The same approach yields the sharpness threshold for Nesterov momentum on a quadratic\.
Recall the update equations for Nesterov momentum:
θt\\displaystyle\\theta\_\{t\}=wt−ηβmt,\\displaystyle=w\_\{t\}\-\\eta\\beta\\,m\_\{t\},mt\+1\\displaystyle m\_\{t\+1\}=βmt\+\(1−β\)∇L\(θt\),\\displaystyle=\\beta\\,m\_\{t\}\+\(1\-\\beta\)\\,\\nabla L\(\\theta\_\{t\}\),wt\+1\\displaystyle w\_\{t\+1\}=wt−ηmt\+1\.\\displaystyle=w\_\{t\}\-\\eta\\,m\_\{t\+1\}\.Once again, we aim to eliminatemmto obtain a two\-step recurrence inwwalone\. From the position update, we have:
mt\+1=−1η\(wt\+1−wt\),m\_\{t\+1\}=\-\\frac\{1\}\{\\eta\}\(w\_\{t\+1\}\-w\_\{t\}\),from which shifting the index back by one givesmt=−1η\(wt−wt−1\)m\_\{t\}=\-\\tfrac\{1\}\{\\eta\}\(w\_\{t\}\-w\_\{t\-1\}\)\. Substituting this into the look\-ahead point yields
θt\\displaystyle\\theta\_\{t\}=wt−ηβmt\\displaystyle=w\_\{t\}\-\\eta\\beta\\,m\_\{t\}=wt\+β\(wt−wt−1\)\\displaystyle=w\_\{t\}\+\\beta\(w\_\{t\}\-w\_\{t\-1\}\)=\(1\+β\)wt−βwt−1\.\\displaystyle=\(1\+\\beta\)\\,w\_\{t\}\-\\beta\\,w\_\{t\-1\}\\,\.\(93\)Consider the quadratic lossL\(w\)=12Sw2L\(w\)=\\tfrac\{1\}\{2\}Sw^\{2\}\. One can see that∇L\(w\)=Sw\\nabla L\(w\)=Sw\. Substituting the above expressions for the momentum, the look\-ahead, and the gradient into the update equation for the momentum yields:
−1η\(wt\+1−wt\)=−βη\(wt−wt−1\)\+\(1−β\)S\[\(1\+β\)wt−βwt−1\]\.\-\\frac\{1\}\{\\eta\}\(w\_\{t\+1\}\-w\_\{t\}\)=\-\\frac\{\\beta\}\{\\eta\}\(w\_\{t\}\-w\_\{t\-1\}\)\+\(1\-\\beta\)\\,S\\bigl\[\(1\+\\beta\)w\_\{t\}\-\\beta\\,w\_\{t\-1\}\\bigr\]\\,\.\(94\)Rearranging to isolatewt\+1w\_\{t\+1\}yields:
wt\+1=\(1\+β−η\(1−β2\)S\)wt−β\(1−η\(1−β\)S\)wt−1\.w\_\{t\+1\}=\\bigl\(1\+\\beta\-\\eta\(1\-\\beta^\{2\}\)\\,S\\bigr\)\\,w\_\{t\}\\;\-\\;\\beta\\bigl\(1\-\\eta\(1\-\\beta\)\\,S\\bigr\)\\,w\_\{t\-1\}\\,\.\(95\)Letxt=\[wtwt−1\]x\_\{t\}=\\begin\{bmatrix\}w\_\{t\}\\\\ w\_\{t\-1\}\\end\{bmatrix\}\. We have the following two\-step recurrence:
xt\+1=Mxt,M=\[1\+β−η\(1−β2\)S−β\(1−η\(1−β\)S\)10\]\.x\_\{t\+1\}=Mx\_\{t\},\\qquad M=\\begin\{bmatrix\}1\+\\beta\-\\eta\(1\-\\beta^\{2\}\)\\,S&\-\\beta\\bigl\(1\-\\eta\(1\-\\beta\)S\\bigr\)\\\\ 1&0\\end\{bmatrix\}\.\(96\)The characteristic equation for the transition matrixMMis:
λ2−\(1\+β−η\(1−β2\)S\)λ\+β\(1−η\(1−β\)S\)=0\.\\lambda^\{2\}\-\\bigl\(1\+\\beta\-\\eta\(1\-\\beta^\{2\}\)S\\bigr\)\\,\\lambda\+\\beta\\bigl\(1\-\\eta\(1\-\\beta\)S\\bigr\)=0\\,\.\(97\)The characteristic equation for Nesterov momentum is somewhat more complicated than the one for heavy ball\. For example, in heavy ball the constant term was simplyβ\\beta, whereas here it involves both the step size and the sharpness of the quadratic\.
As before, we find the sharpness thresholdS∗S^\{\\ast\}by substitutingλ=−1\\lambda=\-1into the characteristic equation:
2\(1\+β\)−\(1−β\)\(1\+2β\)ηS∗=0\.2\(1\+\\beta\)\-\(1\-\\beta\)\(1\+2\\beta\)\\,\\eta S^\{\\ast\}=0\\,\.\(98\)IsolatingS∗S^\{\\ast\}gives:
S∗=2η⋅1\+β\(1−β\)\(1\+2β\)\.S^\{\\ast\}=\\frac\{2\}\{\\eta\}\\cdot\\frac\{1\+\\beta\}\{\(1\-\\beta\)\(1\+2\\beta\)\}\\,\.\(99\)Compared to the heavy ball thresholdSHB∗=2η⋅1\+β1−βS^\{\\ast\}\_\{\\text\{HB\}\}=\\tfrac\{2\}\{\\eta\}\\cdot\\tfrac\{1\+\\beta\}\{1\-\\beta\}, the look\-ahead tightens the threshold by an extra factor of1/\(1\+2β\)1/\(1\+2\\beta\): at the sameη\\etaandβ\\beta, Nesterov destabilizes at a strictly smaller sharpness threshold than heavy ball\.
We want to find the other eigenvalue ofMMatS=S∗S=S^\{\\ast\}\. To do so, we can use the fact that the product of the eigenvalues equals the constant term of the characteristic equation\. SubstitutingS∗S^\{\\ast\}into our expression for the constant term yields:
λ1λ2\\displaystyle\\lambda\_\{1\}\\lambda\_\{2\}=β\(1−η\(1−β\)S∗\)\\displaystyle=\\beta\\bigl\(1\-\\eta\(1\-\\beta\)\\,S^\{\\ast\}\\bigr\)=β\(1−η\(1−β\)⋅2η⋅1\+β\(1−β\)\(1\+2β\)\)\\displaystyle=\\beta\\left\(1\-\\eta\(1\-\\beta\)\\cdot\\frac\{2\}\{\\eta\}\\cdot\\frac\{1\+\\beta\}\{\(1\-\\beta\)\(1\+2\\beta\)\}\\right\)=β\(1−2\(1\+β\)1\+2β\)\\displaystyle=\\beta\\left\(1\-\\frac\{2\(1\+\\beta\)\}\{1\+2\\beta\}\\right\)=β⋅\(1\+2β\)−2\(1\+β\)1\+2β\\displaystyle=\\beta\\cdot\\frac\{\(1\+2\\beta\)\-2\(1\+\\beta\)\}\{1\+2\\beta\}=−β1\+2β\.\\displaystyle=\-\\frac\{\\beta\}\{1\+2\\beta\}\\,\.Recall that, because we are at the sharpness threshold,λ1=−1\\lambda\_\{1\}=\-1\. It then follows that:
λ2=β1\+2β\.\\lambda\_\{2\}=\\frac\{\\beta\}\{1\+2\\beta\}\\,\.\(100\)Sinceβ∈\[0,1\)\\beta\\in\[0,1\), we have thatλ2∈\[0,13\)\\lambda\_\{2\}\\in\[0,\\tfrac\{1\}\{3\}\)\. So once again, the iterates decay onto a one\-dimensional family of 2\-period orbits\. But notably: the transient decays more quickly than was the case for heavy ball, where the analogous decay rate wasβ\\betarather thanβ/\(1\+2β\)\\beta/\(1\+2\\beta\)\.
AtS=S∗S=S^\{\\ast\}, the transition matrix simplifies to:
M=\[−\(1\+β\)/\(1\+2β\)β/\(1\+2β\)10\]\.M=\\begin\{bmatrix\}\-\(1\+\\beta\)/\(1\+2\\beta\)&\\beta/\(1\+2\\beta\)\\\\ 1&0\\end\{bmatrix\}\\,\.\(101\)Solving\(M−λI\)v=0\(M\-\\lambda I\)\\,v=0for each eigenvalue gives the corresponding eigenvector:
v1=\[1−1\]\(forλ1=−1\),v2=\[β1\+2β\]\(forλ2=β1\+2β\)\.v\_\{1\}=\\begin\{bmatrix\}1\\\\ \-1\\end\{bmatrix\}\\;\\;\(\\text\{for \}\\lambda\_\{1\}=\-1\),\\qquad v\_\{2\}=\\begin\{bmatrix\}\\beta\\\\ 1\+2\\beta\\end\{bmatrix\}\\;\\;\\bigl\(\\text\{for \}\\lambda\_\{2\}=\\tfrac\{\\beta\}\{1\+2\\beta\}\\bigr\)\\,\.\(102\)The persistent oscillation directionv1v\_\{1\}is identical to the heavy\-ball case and again corresponds to symmetric bouncing about the minimum\. However, the transient directionv2v\_\{2\}has changed\.
Letw0w\_\{0\}andm0m\_\{0\}be our initial position and momentum, respectively\. Using the position update equation, we have that our initial state is:
x0=\[w0w0\+ηm0\]\.x\_\{0\}=\\begin\{bmatrix\}w\_\{0\}\\\\ w\_\{0\}\+\\eta m\_\{0\}\\end\{bmatrix\}\\,\.\(103\)We can express our initial state as a linear combination of the eigenvectors ofMM:
x0=c1v1\+c2v2x\_\{0\}=c\_\{1\}v\_\{1\}\+c\_\{2\}v\_\{2\}\(104\)We can then obtain a linear system of equations for the coefficientsc1c\_\{1\}andc2c\_\{2\}:
w0\\displaystyle w\_\{0\}=c1\+β1\+2βc2,\\displaystyle=c\_\{1\}\+\\frac\{\\beta\}\{1\+2\\beta\}\\,c\_\{2\}\\,,\(105\)w0\+ηm0\\displaystyle w\_\{0\}\+\\eta m\_\{0\}=−c1\+c2,\\displaystyle=\-c\_\{1\}\+c\_\{2\}\\,,\(106\)Solving for the coefficients:
c1=\(1\+β\)w0−βηm01\+3β,c2=\(2w0\+ηm0\)\(1\+2β\)1\+3β\.c\_\{1\}=\\frac\{\(1\+\\beta\)\\,w\_\{0\}\-\\beta\\,\\eta\\,m\_\{0\}\}\{1\+3\\beta\}\\,,\\qquad c\_\{2\}=\\frac\{\(2w\_\{0\}\+\\eta m\_\{0\}\)\(1\+2\\beta\)\}\{1\+3\\beta\}\\,\.\(107\)So our state evolves over time according to:
xt=Mtx0=c1\(−1\)tv1\+c2\(β1\+2β\)tv2\.x\_\{t\}=M^\{t\}x\_\{0\}=c\_\{1\}\(\-1\)^\{t\}v\_\{1\}\+c\_\{2\}\\Bigl\(\\tfrac\{\\beta\}\{1\+2\\beta\}\\Bigr\)^\{t\}v\_\{2\}\\,\.\(108\)The asymptotic 2\-period oscillation amplitude is\|w∞\|=\|c1\|\|w\_\{\\infty\}\|=\|c\_\{1\}\|\. In particular, whenm0=0m\_\{0\}=0, we have:
\|w∞\|=1\+β1\+3β\|w0\|\.\|w\_\{\\infty\}\|=\\frac\{1\+\\beta\}\{1\+3\\beta\}\\,\|w\_\{0\}\|\\,\.\(109\)Note that
1\+β1\+3β<1\+β1−β\\frac\{1\+\\beta\}\{1\+3\\beta\}<\\frac\{1\+\\beta\}\{1\-\\beta\}\(110\)for anyβ∈\(0,1\)\\beta\\in\(0,1\)\. Hence, for the same initial position and momentum coefficient, Nesterov momentum settles into a strictly smaller\-amplitude 2\-period orbit than heavy ball\.
### C\.3Theoretical Analysis
As was the case with the gradient descent rod flow, it’s instructive to analyze the behavior of the momentum rod flows on three one\-dimensional loss functions: the linear loss, the quadratic loss, and the quartic loss\.
LetΦw\\Phi^\{w\}andΦm\\Phi^\{m\}denote the position and momentum components of the phase\-space update, respectively\. For heavy ball:
ΦHBw\\displaystyle\\Phi^\{w\}\_\{\\text\{HB\}\}=−η\[βm\+\(1−β\)∇L\(w\)\],\\displaystyle=\-\\eta\\bigl\[\\beta\\,m\+\(1\-\\beta\)\\,\\nabla L\(w\)\\bigr\],\(111\)ΦHBm\\displaystyle\\Phi^\{m\}\_\{\\text\{HB\}\}=\(1−β\)\[∇L\(w\)−m\]\.\\displaystyle=\(1\-\\beta\)\\bigl\[\\nabla L\(w\)\-m\\bigr\]\.\(112\)And for Nesterov:
ΦNAGw\\displaystyle\\Phi^\{w\}\_\{\\text\{NAG\}\}=−η\[βm\+\(1−β\)∇L\(w−ηβm\)\],\\displaystyle=\-\\eta\\bigl\[\\beta\\,m\+\(1\-\\beta\)\\,\\nabla L\(w\-\\eta\\beta\\,m\)\\bigr\],\(113\)ΦNAGm\\displaystyle\\Phi^\{m\}\_\{\\text\{NAG\}\}=\(1−β\)\[∇L\(w−ηβm\)−m\]\.\\displaystyle=\(1\-\\beta\)\\bigl\[\\nabla L\(w\-\\eta\\beta\\,m\)\-m\\bigr\]\.\(114\)Throughout, we drop the subscripts fromΦ\\Phiwhen it is clear from context whether we are working with heavy ball or Nesterov momentum\.
When our loss landscape is one\-dimensional, momentum rod flow reduces down to just six scalar ODEs:
dw¯dt\\displaystyle\\frac\{d\\bar\{w\}\}\{dt\}=−η\[βm¯\+\(1−β\)g¯\]\\displaystyle=\-\\eta\[\\beta\\bar\{m\}\+\(1\-\\beta\)\\bar\{g\}\]\(115\)dm¯dt\\displaystyle\\frac\{d\\bar\{m\}\}\{dt\}=\(1−β\)\(g¯−m¯\)\\displaystyle=\(1\-\\beta\)\(\\bar\{g\}\-\\bar\{m\}\)\(116\)dΣδδdt\\displaystyle\\frac\{d\\Sigma\_\{\\delta\\delta\}\}\{dt\}=14\(\(Φ\+w\)2\+\(Φ−w\)2\)−2Σδδ\\displaystyle=\\frac\{1\}\{4\}\\bigl\(\(\\Phi^\{w\}\_\{\+\}\)^\{2\}\+\(\\Phi^\{w\}\_\{\-\}\)^\{2\}\\bigr\)\-2\\Sigma\_\{\\delta\\delta\}\(117\)dΣγγdt\\displaystyle\\frac\{d\\Sigma\_\{\\gamma\\gamma\}\}\{dt\}=14\(\(Φ\+m\)2\+\(Φ−m\)2\)−2Σγγ\\displaystyle=\\frac\{1\}\{4\}\\bigl\(\(\\Phi^\{m\}\_\{\+\}\)^\{2\}\+\(\\Phi^\{m\}\_\{\-\}\)^\{2\}\\bigr\)\-2\\Sigma\_\{\\gamma\\gamma\}\(118\)dΣδγdt\\displaystyle\\frac\{d\\Sigma\_\{\\delta\\gamma\}\}\{dt\}=14\(\(Φ\+w\)\(Φ\+m\)\+\(Φ−w\)\(Φ−m\)\)−2Σδγ\\displaystyle=\\frac\{1\}\{4\}\\bigl\(\(\\Phi^\{w\}\_\{\+\}\)\(\\Phi^\{m\}\_\{\+\}\)\+\(\\Phi^\{w\}\_\{\-\}\)\(\\Phi^\{m\}\_\{\-\}\)\\bigr\)\-2\\Sigma\_\{\\delta\\gamma\}\(119\)dΣγδdt\\displaystyle\\frac\{d\\Sigma\_\{\\gamma\\delta\}\}\{dt\}=14\(\(Φ\+m\)\(Φ\+w\)\+\(Φ−m\)\(Φ−w\)\)−2Σγδ\\displaystyle=\\frac\{1\}\{4\}\\bigl\(\(\\Phi^\{m\}\_\{\+\}\)\(\\Phi^\{w\}\_\{\+\}\)\+\(\\Phi^\{m\}\_\{\-\}\)\(\\Phi^\{w\}\_\{\-\}\)\\bigr\)\-2\\Sigma\_\{\\gamma\\delta\}\(120\)Because we constrainΣ\\Sigmato be symmetric,Σδγ=Σγδ\\Sigma\_\{\\delta\\gamma\}=\\Sigma\_\{\\gamma\\delta\}\. Without loss of generality, we will only consider the dynamics ofΣδγ\\Sigma\_\{\\delta\\gamma\}\.
#### C\.3\.1Linear Loss
Consider the linear lossL\(w\)=bwL\(w\)=bw\. The gradient is constant:∇L=b\\nabla L=b\. Because the gradient is independent of the evaluation point, heavy ball and Nesterov momentum behave identically on the linear loss\.
Them¯\\bar\{m\}ODE is:
dm¯dt=\(1−β\)\(b−m¯\)\.\\frac\{d\\bar\{m\}\}\{dt\}=\(1\-\\beta\)\(b\-\\bar\{m\}\)\\,\.\(121\)The dynamics ofm¯\\bar\{m\}are independent of the other variables\. At steady state,m¯∗=b\\bar\{m\}^\{\\ast\}=b\. This makes sense: the momentum is an exponential moving average of the gradient, and the gradient is constant\. Thew¯\\bar\{w\}ODE is then:
dw¯dt=−ηb\.\\frac\{d\\bar\{w\}\}\{dt\}=\-\\eta b\\,\.\(122\)So the midpoint of the rod in position space moves at constant speed\. At steady state, we have that:
Φ±w\\displaystyle\\Phi^\{w\}\_\{\\pm\}=−ηb,\\displaystyle=\-\\eta b,Φ±m\\displaystyle\\Phi^\{m\}\_\{\\pm\}=0\.\\displaystyle=0\.The ODEs for the components ofΣ\\Sigmaare then:
dΣδδdt\\displaystyle\\frac\{d\\Sigma\_\{\\delta\\delta\}\}\{dt\}=12\(ηb\)2−2Σδδ,\\displaystyle=\\tfrac\{1\}\{2\}\(\\eta b\)^\{2\}\-2\\,\\Sigma\_\{\\delta\\delta\},\(123\)dΣγγdt\\displaystyle\\frac\{d\\Sigma\_\{\\gamma\\gamma\}\}\{dt\}=−2Σγγ,\\displaystyle=\-2\\,\\Sigma\_\{\\gamma\\gamma\},\(124\)dΣδγdt\\displaystyle\\frac\{d\\Sigma\_\{\\delta\\gamma\}\}\{dt\}=−2Σδγ\.\\displaystyle=\-2\\,\\Sigma\_\{\\delta\\gamma\}\.\(125\)By inspection, the steady\-state value ofΣ\\Sigmais:
Σ∗=\(\(ηb2\)2000\),\\Sigma^\{\\ast\}=\\begin\{pmatrix\}\\left\(\\tfrac\{\\eta b\}\{2\}\\right\)^\{2\}&0\\\\ 0&0\\end\{pmatrix\},\(126\)and the corresponding steady\-state value ofΔ\\Deltais:
Δ∗=\(−ηb/20\)\.\\Delta^\{\\ast\}=\\begin\{pmatrix\}\-\\eta b/2\\\\ 0\\end\{pmatrix\}\.\(127\)
#### C\.3\.2Quadratic Loss
Consider the quadratic lossL\(w\)=12Sw2L\(w\)=\\frac\{1\}\{2\}Sw^\{2\}withS\>0S\>0\. For the gradient, we have that∇L\(w\)=Sw\\nabla L\(w\)=Sw\.
We will first consider heavy ball momentum\. For the center ODEs, we have that:
dw¯dt\\displaystyle\\frac\{d\\bar\{w\}\}\{dt\}=−η\[βm¯\+\(1−β\)Sw¯\],\\displaystyle=\-\\eta\\bigl\[\\beta\\,\\bar\{m\}\+\(1\-\\beta\)\\,S\\bar\{w\}\\bigr\],dm¯dt\\displaystyle\\frac\{d\\bar\{m\}\}\{dt\}=\(1−β\)\[Sw¯−m¯\]\.\\displaystyle=\(1\-\\beta\)\\bigl\[S\\bar\{w\}\-\\bar\{m\}\\bigr\]\.Unlike the linear\-loss case, the dynamics ofm¯\\bar\{m\}are now coupled to those ofw¯\\bar\{w\}\. However, bothw¯\\bar\{w\}andm¯\\bar\{m\}still evolve independently ofΣ\\Sigma\. And because the gradient is linear inww, they form a linear system:
ddt\(w¯m¯\)=A\(w¯m¯\),A=\(−η\(1−β\)S−ηβ\(1−β\)S−\(1−β\)\)\.\\frac\{d\}\{dt\}\\begin\{pmatrix\}\\bar\{w\}\\\\ \\bar\{m\}\\end\{pmatrix\}=A\\begin\{pmatrix\}\\bar\{w\}\\\\ \\bar\{m\}\\end\{pmatrix\},\\qquad A=\\begin\{pmatrix\}\-\\eta\(1\-\\beta\)S&\-\\eta\\beta\\\\ \(1\-\\beta\)S&\-\(1\-\\beta\)\\end\{pmatrix\}\\,\.\(128\)The eigenvalues of the transition matrixAAgovern the dynamics of our system\. The eigenvalues are the roots of the characteristic equation:
λ2−Tr\(A\)λ\+Det\(A\)=0\\lambda^\{2\}\-\\operatorname\{Tr\}\(A\)\\lambda\+\\operatorname\{Det\}\(A\)=0\(129\)Evaluating these terms forAA:
Tr\(A\)\\displaystyle\\operatorname\{Tr\}\(A\)=−\(1−β\)\(1\+ηS\),\\displaystyle=\-\(1\-\\beta\)\(1\+\\eta S\),Det\(A\)\\displaystyle\\operatorname\{Det\}\(A\)=\(1−β\)ηS\.\\displaystyle=\(1\-\\beta\)\\eta S\.To determine the stability of the system, we examine the real parts of the eigenvalues of the transition matrixAA\. The fixed point\(0,0\)\(0,0\)is asymptotically stable if and only if both eigenvalues satisfyRe\(λ\)<0\\operatorname\{Re\}\(\\lambda\)<0\.
Under the assumptionsη\>0\\eta\>0,S\>0S\>0, andβ∈\[0,1\)\\beta\\in\[0,1\), we observe thatTr\(A\)<0\\operatorname\{Tr\}\(A\)<0andDet\(A\)\>0\\operatorname\{Det\}\(A\)\>0\. Since the trace is the sum of the eigenvalues and the determinant is their product, it follows that:
λ1\+λ2<0andλ1λ2\>0\.\\lambda\_\{1\}\+\\lambda\_\{2\}<0\\quad\\text\{and\}\\quad\\lambda\_\{1\}\\lambda\_\{2\}\>0\\,\.\(130\)By the Routh–Hurwitz stability criterion, the system is stable and decays to zero\. Hence the steady state is\(w¯∗,m¯∗\)=\(0,0\)\(\\bar\{w\}^\{\\ast\},\\bar\{m\}^\{\\ast\}\)=\(0,0\)\.
Now we will consider the dynamics forΣ\\Sigma\. Let\(w¯,m¯\)=\(0,0\)\(\\bar\{w\},\\bar\{m\}\)=\(0,0\)\. Assume thatΣ\\Sigmais rank\-one and positive semi\-definite, so that it can be written as the outer product ofΔ=\(δ,γ\)\\Delta=\(\\delta,\\gamma\)with itself:
Σ\\displaystyle\\Sigma=\(ΣδδΣδγΣγδΣγγ\)\\displaystyle=\\begin\{pmatrix\}\\Sigma\_\{\\delta\\delta\}&\\Sigma\_\{\\delta\\gamma\}\\\\ \\Sigma\_\{\\gamma\\delta\}&\\Sigma\_\{\\gamma\\gamma\}\\end\{pmatrix\}=\(δγ\)⊗\(δγ\)\\displaystyle=\\begin\{pmatrix\}\\delta\\\\ \\gamma\\end\{pmatrix\}\\otimes\\begin\{pmatrix\}\\delta\\\\ \\gamma\\end\{pmatrix\}=\(δ2δγγδγ2\)\.\\displaystyle=\\begin\{pmatrix\}\\delta^\{2\}&\\delta\\gamma\\\\ \\gamma\\delta&\\gamma^\{2\}\\end\{pmatrix\}\.For our phase\-space update, we have:
Φ±\\displaystyle\\Phi\_\{\\pm\}=\(Φ±wΦ±m\)\\displaystyle=\\begin\{pmatrix\}\\Phi\_\{\\pm\}^\{w\}\\\\ \\Phi\_\{\\pm\}^\{m\}\\end\{pmatrix\}=±\(−η\[βγ\+\(1−β\)Sδ\]\(1−β\)\[Sδ−γ\]\)\\displaystyle=\\pm\\begin\{pmatrix\}\-\\eta\\bigl\[\\beta\\,\\gamma\+\(1\-\\beta\)\\,S\\delta\\bigr\]\\\\ \(1\-\\beta\)\\bigl\[S\\delta\-\\gamma\\bigr\]\\end\{pmatrix\}=±A\(δγ\),\\displaystyle=\\pm A\\begin\{pmatrix\}\\delta\\\\ \\gamma\\end\{pmatrix\},whereAAis the transition matrix introduced above\.
For theΣ\\SigmaODE:
dΣdt\\displaystyle\\frac\{d\\Sigma\}\{dt\}=14\(Φ\+⊗Φ\+\+Φ−⊗Φ−\)−2Σ\\displaystyle=\\tfrac\{1\}\{4\}\\bigl\(\\Phi\_\{\+\}\\otimes\\Phi\_\{\+\}\+\\Phi\_\{\-\}\\otimes\\Phi\_\{\-\}\\bigr\)\-2\\Sigma=12\[A\(δγ\)\]⊗\[A\(δγ\)\]−2Σ\\displaystyle=\\frac\{1\}\{2\}\\,\\biggl\[A\\begin\{pmatrix\}\\delta\\\\ \\gamma\\end\{pmatrix\}\\biggr\]\\otimes\\biggl\[A\\begin\{pmatrix\}\\delta\\\\ \\gamma\\end\{pmatrix\}\\biggr\]\-2\\Sigma=12AΣA⊤−2Σ\.\\displaystyle=\\tfrac\{1\}\{2\}\\,A\\,\\Sigma\\,A^\{\\top\}\-2\\Sigma\.Consider the fixed pointΣ∗=0\\Sigma^\{\\ast\}=0\. We want to determine for which values of the sharpnessSSis the fixed point stable\. The right\-hand side of this ODE is linear inΣ\\Sigma, so the stability ofΣ∗=0\\Sigma^\{\\ast\}=0is controlled by the eigenvalues of the linear operator:
ℒ:Σ⟼12AΣA⊤−2Σ,\\mathcal\{L\}\\colon\\Sigma\\;\\longmapsto\\;\\tfrac\{1\}\{2\}\\,A\\,\\Sigma\\,A^\{\\top\}\-2\\Sigma,\(131\)acting on the three\-dimensional space of symmetric2×22\\times 2matrices\. The fixed pointΣ∗=0\\Sigma^\{\\ast\}=0is asymptotically stable if and only if all three eigenvalues ofℒ\\mathcal\{L\}have negative real parts\.
Suppose thatAAhas eigenvaluesα1,α2\\alpha\_\{1\},\\alpha\_\{2\}with corresponding eigenvectorsv1,v2v\_\{1\},v\_\{2\}\. Acting on the rank\-one matrixvivj⊤v\_\{i\}v\_\{j\}^\{\\top\}, the operatorΣ↦AΣA⊤\\Sigma\\mapsto A\\,\\Sigma\\,A^\{\\top\}multiplies byαiαj\\alpha\_\{i\}\\alpha\_\{j\}:
A\(vivj⊤\)A⊤=\(Avi\)\(Avj\)⊤=αiαjvivj⊤\.A\\bigl\(v\_\{i\}v\_\{j\}^\{\\top\}\\bigr\)A^\{\\top\}=\(Av\_\{i\}\)\(Av\_\{j\}\)^\{\\top\}=\\alpha\_\{i\}\\alpha\_\{j\}\\,v\_\{i\}v\_\{j\}^\{\\top\}\\,\.Restricting to symmetric matrices, the three eigendirections ofℒ\\mathcal\{L\}are:
v1v1⊤\\displaystyle v\_\{1\}v\_\{1\}^\{\\top\}\\quadwith eigenvalueλ11=12α12−2,\\displaystyle\\text\{with eigenvalue\}\\quad\\lambda\_\{11\}=\\tfrac\{1\}\{2\}\\alpha\_\{1\}^\{2\}\-2\\,,v2v2⊤\\displaystyle v\_\{2\}v\_\{2\}^\{\\top\}\\quadwith eigenvalueλ22=12α22−2,\\displaystyle\\text\{with eigenvalue\}\\quad\\lambda\_\{22\}=\\tfrac\{1\}\{2\}\\alpha\_\{2\}^\{2\}\-2\\,,v1v2⊤\+v2v1⊤\\displaystyle v\_\{1\}v\_\{2\}^\{\\top\}\+v\_\{2\}v\_\{1\}^\{\\top\}\\quadwith eigenvalueλ12=12α1α2−2\.\\displaystyle\\text\{with eigenvalue\}\\quad\\lambda\_\{12\}=\\tfrac\{1\}\{2\}\\alpha\_\{1\}\\alpha\_\{2\}\-2\\,\.We start with the cross eigenvalueλ12\\lambda\_\{12\}\. Usingα1α2=Det\(A\)\\alpha\_\{1\}\\alpha\_\{2\}=\\operatorname\{Det\}\(A\), we have that:
λ12\\displaystyle\\lambda\_\{12\}=12α1α2−2\\displaystyle=\\tfrac\{1\}\{2\}\\alpha\_\{1\}\\alpha\_\{2\}\-2=12Det\(A\)−2\\displaystyle=\\tfrac\{1\}\{2\}\\operatorname\{Det\}\(A\)\-2=12\(1−β\)ηS−2\\displaystyle=\\tfrac\{1\}\{2\}\(1\-\\beta\)\\eta S\-2In order forλ12\\lambda\_\{12\}to be negative, we require that:
\(1−β\)2ηS−2<0⟹S<4η\(1−β\)\\frac\{\(1\-\\beta\)\}\{2\}\\eta S\-2<0\\implies S<\\frac\{4\}\{\\eta\(1\-\\beta\)\}\(132\)Anticipating our final result, it’s helpful to notice that becauseβ∈\[0,1\)\\beta\\in\[0,1\), we have that:
S\>4η\(1−β\)⟹S\>2η⋅1\+β1−βS\>\\frac\{4\}\{\\eta\(1\-\\beta\)\}\\implies S\>\\frac\{2\}\{\\eta\}\\cdot\\frac\{1\+\\beta\}\{1\-\\beta\}\(133\)Letλii\\lambda\_\{ii\}generically denote either of the two diagonal eigenvalues ofℒ\\mathcal\{L\}\. Forλii\\lambda\_\{ii\}to have a negative real part, we require:
Re\{λii\}=12Re\{αi2\}−2<0⟹Re\{αi2\}<4\.\\operatorname\{Re\}\\\{\\lambda\_\{ii\}\\\}=\\tfrac\{1\}\{2\}\\operatorname\{Re\}\\\{\\alpha\_\{i\}^\{2\}\\\}\-2<0\\;\\;\\implies\\;\\;\\operatorname\{Re\}\\\{\\alpha\_\{i\}^\{2\}\\\}<4\\,\.\(134\)Recall from the analysis of the centers thatTr\(A\)<0\\operatorname\{Tr\}\(A\)<0andDet\(A\)\>0\\operatorname\{Det\}\(A\)\>0—from which it followed that the eigenvalues ofAAare either a complex\-conjugate pair or both real and negative\. We can handle the two cases separately\.
Consider the case thatα1\\alpha\_\{1\}andα2\\alpha\_\{2\}form a complex\-conjugate pair\. For any complex numberα\\alpha, we haveRe\{α2\}≤\|α\|2\\operatorname\{Re\}\\\{\\alpha^\{2\}\\\}\\leq\|\\alpha\|^\{2\}, with equality only ifα\\alphais real\. Combined with\|α1\|2=\|α2\|2=Det\(A\)\|\\alpha\_\{1\}\|^\{2\}=\|\\alpha\_\{2\}\|^\{2\}=\\operatorname\{Det\}\(A\), this givesRe\{αi2\}≤Det\(A\)\\operatorname\{Re\}\\\{\\alpha\_\{i\}^\{2\}\\\}\\leq\\operatorname\{Det\}\(A\)\. One can then see that:
Re\{λii\}\\displaystyle\\text\{Re\}\\\{\\lambda\_\{ii\}\\\}=12Re\{αi2\}−2\\displaystyle=\\tfrac\{1\}\{2\}\\operatorname\{Re\}\\\{\\alpha\_\{i\}^\{2\}\\\}\-2≤12\|αi\|2−2\\displaystyle\\leq\\tfrac\{1\}\{2\}\|\\alpha\_\{i\}\|^\{2\}\-2=12α1α2−2\\displaystyle=\\tfrac\{1\}\{2\}\\alpha\_\{1\}\\alpha\_\{2\}\-2=λ12\\displaystyle=\\lambda\_\{12\}In that case,λ12<0\\lambda\_\{12\}<0would imply thatRe\{λii\}<0\\text\{Re\}\\\{\\lambda\_\{ii\}\\\}<0\.
Now consider the case that bothαi\\alpha\_\{i\}are real and negative\. We have thatRe\{αi2\}=αi2\\operatorname\{Re\}\\\{\\alpha\_\{i\}^\{2\}\\\}=\\alpha\_\{i\}^\{2\}, so our stability condition becomes\|αi\|<2\|\\alpha\_\{i\}\|<2\. Sinceαi<0\\alpha\_\{i\}<0, the stability threshold corresponds to when the more negative eigenvalue ofAAreaches−2\-2\. Substitutingλ=−2\\lambda=\-2into the characteristic equation gives:
4\+2Tr\(A\)\+Det\(A\)=0\.4\+2\\operatorname\{Tr\}\(A\)\+\\operatorname\{Det\}\(A\)=0\\,\.Plugging in the expressions forTr\(A\)\\operatorname\{Tr\}\(A\)andDet\(A\)\\operatorname\{Det\}\(A\)and simplifying:
2\+2β−η\(1−β\)S∗=0\.2\+2\\beta\-\\eta\(1\-\\beta\)\\,S^\{\\ast\}=0\\,\.Solving forS∗S^\{\\ast\}yields the heavy ball edge\-of\-stability threshold:
S∗=2η⋅1\+β1−β\.S^\{\\ast\}=\\frac\{2\}\{\\eta\}\\cdot\\frac\{1\+\\beta\}\{1\-\\beta\}\\,\.\(135\)Now consider Nesterov momentum\. For the quadratic lossL\(w\)=12Sw2L\(w\)=\\tfrac\{1\}\{2\}Sw^\{2\}, we have∇L\(θt\)=S\(wt−ηβmt\)\\nabla L\(\\theta\_\{t\}\)=S\(w\_\{t\}\-\\eta\\beta\\,m\_\{t\}\)\. As was the case with heavy ball momentum, the phase\-space update is linear inzz:
Φ\(z\)=Az,A=\(−η\(1−β\)S−ηβ\[1−η\(1−β\)S\]\(1−β\)S−\(1−β\)\[1\+ηβS\]\)\.\\Phi\(z\)=A\\,z\\,,\\qquad A=\\begin\{pmatrix\}\-\\eta\(1\-\\beta\)S&\-\\eta\\beta\\bigl\[1\-\\eta\(1\-\\beta\)S\\bigr\]\\\\ \(1\-\\beta\)S&\-\(1\-\\beta\)\\bigl\[1\+\\eta\\beta S\\bigr\]\\end\{pmatrix\}\\,\.For the trace and determinant, we have that:
Tr\(A\)\\displaystyle\\operatorname\{Tr\}\(A\)=−\(1−β\)\[1\+η\(1\+β\)S\],\\displaystyle=\-\(1\-\\beta\)\\bigl\[1\+\\eta\(1\+\\beta\)S\\bigr\]\\,,Det\(A\)\\displaystyle\\operatorname\{Det\}\(A\)=η\(1−β\)S\.\\displaystyle=\\eta\(1\-\\beta\)S\\,\.Once again, we have thatTr\(A\)<0\\operatorname\{Tr\}\(A\)<0andDet\(A\)\>0\\operatorname\{Det\}\(A\)\>0\. So the fixed point\(w¯∗,m¯∗\)=\(0,0\)\(\\bar\{w\}^\{\\ast\},\\bar\{m\}^\{\\ast\}\)=\(0,0\)is stable\.
Though the transition matrixAAdiffers between heavy ball and Nesterov momentum, theΣ\\SigmaODE retains the same functional form:
dΣdt=12AΣA⊤−2Σ\.\\frac\{d\\Sigma\}\{dt\}=\\tfrac\{1\}\{2\}\\,A\\,\\Sigma\\,A^\{\\top\}\-2\\Sigma\\,\.To find the sharpness threshold, we again substituteλ=−2\\lambda=\-2into the characteristic equation:
2\(1\+β\)−η\(1−β\)\(1\+2β\)S∗=0\.2\(1\+\\beta\)\-\\eta\(1\-\\beta\)\(1\+2\\beta\)\\,S^\{\\ast\}=0\\,\.\(136\)Solving forS∗S^\{\\ast\}yields the sharpness threshold for Nesterov momentum:
S∗=2η⋅1\+β\(1−β\)\(1\+2β\)\.S^\{\\ast\}=\\frac\{2\}\{\\eta\}\\cdot\\frac\{1\+\\beta\}\{\(1\-\\beta\)\(1\+2\\beta\)\}\\,\.\(137\)
#### C\.3\.3Quartic Loss
Consider the quartic potential
L\(w\)=S2w2−Q4w4,L\(w\)=\\frac\{S\}\{2\}w^\{2\}\-\\frac\{Q\}\{4\}w^\{4\},\(138\)whereS,Q\>0S,Q\>0\. The gradient is∇L\(w\)=Sw−Qw3\\nabla L\(w\)=Sw\-Qw^\{3\}\. As was the case with gradient descent, this potential is intended as a minimal model of a loss landscape that is sharpest near the origin and flattens out farther away\. We will see that the momentum algorithms settle into finite\-amplitude oscillations precisely when the sharpness at the minimum exceeds the respective sharpness thresholds derived in the quadratic analysis\.
Consider heavy ball momentum\. Because the loss function is even \(L\(−w\)=L\(w\)L\(\-w\)=L\(w\)\), the gradient is odd \(∇L\(−w\)=−∇L\(w\)\\nabla L\(\-w\)=\-\\nabla L\(w\)\)\. This means that at the origin\(w¯∗,m¯∗\)=\(0,0\)\(\\bar\{w\}^\{\\ast\},\\bar\{m\}^\{\\ast\}\)=\(0,0\), we have:
g¯=12\(∇L\(δ\)\+∇L\(−δ\)\)=0\\bar\{g\}=\\tfrac\{1\}\{2\}\(\\nabla L\(\\delta\)\+\\nabla L\(\-\\delta\)\)=0\(139\)from which it follows that:
dw¯dt=−η\[βm¯\+\(1−β\)g¯\]=0\\frac\{d\\bar\{w\}\}\{dt\}=\-\\eta\[\\beta\\bar\{m\}\+\(1\-\\beta\)\\bar\{g\}\]=0\(140\)and:
dm¯dt=\(1−β\)\(g¯−m¯\)=0\.\\frac\{d\\bar\{m\}\}\{dt\}=\(1\-\\beta\)\(\\bar\{g\}\-\\bar\{m\}\)=0\.\(141\)Thus\(w¯,m¯\)=\(0,0\)\(\\bar\{w\},\\bar\{m\}\)=\(0,0\)is a fixed point of the center ODEs\.
We restrict our analysis to this case throughout\. Note that because:
dz¯dt=12\(Φ\+\+Φ−\)=0\\frac\{d\\bar\{z\}\}\{dt\}=\\tfrac\{1\}\{2\}\(\\Phi\_\{\+\}\+\\Phi\_\{\-\}\)=0It immediately follows thatΦ−=−Φ\+\\Phi\_\{\-\}=\-\\Phi\_\{\+\}\. For ourΣ\\SigmaODE, we then have that:
dΣdt=12Φ\+⊗Φ\+−2Σ\.\\frac\{d\\Sigma\}\{dt\}=\\tfrac\{1\}\{2\}\\,\\Phi\_\{\+\}\\otimes\\Phi\_\{\+\}\-2\\Sigma\\,\.\(142\)We want to analyze the fixed points of theΣ\\SigmaODE\. SettingdΣdt=0\\frac\{d\\Sigma\}\{dt\}=0, we have that fixed pointsΣ∗\\Sigma^\{\\ast\}satisfy:
Σ∗=14Φ\+\(Δ∗\)⊗Φ\+\(Δ∗\)\\Sigma^\{\\ast\}=\\tfrac\{1\}\{4\}\\Phi\_\{\+\}\(\\Delta^\{\\ast\}\)\\otimes\\Phi\_\{\+\}\(\\Delta^\{\\ast\}\)\(143\)In terms of the components ofΣ∗\\Sigma^\{\\ast\}, we have that:
Σδδ∗\\displaystyle\\Sigma\_\{\\delta\\delta\}^\{\\ast\}=14η2\[βγ∗\+\(1−β\)g\(δ∗\)\]2\\displaystyle=\\tfrac\{1\}\{4\}\\eta^\{2\}\[\\beta\\gamma^\{\\ast\}\+\(1\-\\beta\)g\(\\delta^\{\\ast\}\)\]^\{2\}\(144\)Σγγ∗\\displaystyle\\Sigma^\{\\ast\}\_\{\\gamma\\gamma\}=14\(1−β\)2\[g\(δ∗\)−γ∗\]2\\displaystyle=\\tfrac\{1\}\{4\}\(1\-\\beta\)^\{2\}\[g\(\\delta^\{\\ast\}\)\-\\gamma^\{\\ast\}\]^\{2\}\(145\)Σδγ∗\\displaystyle\\Sigma^\{\\ast\}\_\{\\delta\\gamma\}=−14η\(1−β\)\[βγ∗\+\(1−β\)g\(δ∗\)\]\[g\(δ∗\)−γ∗\]\\displaystyle=\-\\tfrac\{1\}\{4\}\\eta\(1\-\\beta\)\[\\beta\\gamma^\{\\ast\}\+\(1\-\\beta\)g\(\\delta^\{\\ast\}\)\]\[g\(\\delta^\{\\ast\}\)\-\\gamma^\{\\ast\}\]\(146\)whereg\(δ\)=Sδ−Qδ3g\(\\delta\)=S\\delta\-Q\\delta^\{3\}\.
The right\-hand side of[Equation˜143](https://arxiv.org/html/2605.06821#A3.E143)is an outer product, so any non\-trivial fixed pointΣ∗\\Sigma^\{\\ast\}must be rank\-one\. WritingΣ∗=Δ∗⊗Δ∗\\Sigma^\{\\ast\}=\\Delta^\{\\ast\}\\otimes\\Delta^\{\\ast\}, it then follows that:
Φ\+\(Δ∗\)=±2Δ∗\.\\Phi\_\{\+\}\(\\Delta^\{\\ast\}\)=\\pm 2\\,\\Delta^\{\\ast\}\\,\.\(147\)For the quiescent fixed pointΔ∗=0\\Delta^\{\\ast\}=0, both sign choices are satisfied trivially\. For non\-trivial fixed points, we want the negative solution, since the minus sign encodes the period\-2 conditionzt\+1=−ztz\_\{t\+1\}=\-z\_\{t\}of the underlying discrete heavy ball map:
zt\+1=zt\+Φ\(zt\)andzt\+1=−zt⟹Φ\(zt\)=−2zt\.z\_\{t\+1\}=z\_\{t\}\+\\Phi\(z\_\{t\}\)\\quad\\text\{and\}\\quad z\_\{t\+1\}=\-z\_\{t\}\\quad\\implies\\quad\\Phi\(z\_\{t\}\)=\-2z\_\{t\}\\,\.\(148\)We therefore take:
Φ\+\(Δ∗\)=−2Δ∗\.\\Phi\_\{\+\}\(\\Delta^\{\\ast\}\)=\-2\\,\\Delta^\{\\ast\}\\,\.\(149\)We can express[Equation˜149](https://arxiv.org/html/2605.06821#A3.E149)as two scalar equations:
−2δ∗\\displaystyle\-2\\delta^\{\\ast\}=−η\[βγ∗\+\(1−β\)g\(δ∗\)\],\\displaystyle=\-\\eta\\bigl\[\\beta\\gamma^\{\\ast\}\+\(1\-\\beta\)g\(\\delta^\{\\ast\}\)\\bigr\]\\,,\(150\)−2γ∗\\displaystyle\-2\\gamma^\{\\ast\}=\(1−β\)\[g\(δ∗\)−γ∗\]\.\\displaystyle=\(1\-\\beta\)\\bigl\[g\(\\delta^\{\\ast\}\)\-\\gamma^\{\\ast\}\\bigr\]\\,\.\(151\)The second equation can be solved forγ∗\\gamma^\{\\ast\}in terms ofδ∗\\delta^\{\\ast\}:
γ∗=−1−β1\+βg\(δ∗\)\.\\gamma^\{\\ast\}=\-\\frac\{1\-\\beta\}\{1\+\\beta\}\\,g\(\\delta^\{\\ast\}\)\\,\.\(152\)Substituting this into the first equation, we have that:
−2δ∗\\displaystyle\-2\\delta^\{\\ast\}=−η\[β\(−1−β1\+βg\(δ∗\)\)\+\(1−β\)g\(δ∗\)\]\\displaystyle=\-\\eta\\left\[\\beta\\left\(\-\\frac\{1\-\\beta\}\{1\+\\beta\}g\(\\delta^\{\\ast\}\)\\right\)\+\(1\-\\beta\)g\(\\delta^\{\\ast\}\)\\right\]=−ηg\(δ∗\)\[−β\(1−β\)\+\(1\+β\)\(1−β\)1\+β\]\\displaystyle=\-\\eta g\(\\delta^\{\\ast\}\)\\left\[\\frac\{\-\\beta\(1\-\\beta\)\+\(1\+\\beta\)\(1\-\\beta\)\}\{1\+\\beta\}\\right\]=−ηg\(δ∗\)\[1−β1\+β\]\\displaystyle=\-\\eta g\(\\delta^\{\\ast\}\)\\left\[\\frac\{1\-\\beta\}\{1\+\\beta\}\\right\]Rearranging, we can obtain the following equation:
g\(δ∗\)δ∗=2\(1\+β\)η\(1−β\)=S∗\.\\frac\{g\(\\delta^\{\\ast\}\)\}\{\\delta^\{\\ast\}\}=\\frac\{2\(1\+\\beta\)\}\{\\eta\(1\-\\beta\)\}=S^\{\\ast\}\.\(153\)The left\-hand side of[Equation˜153](https://arxiv.org/html/2605.06821#A3.E153)is the gradient divided by the distance from the center of the oscillations\. Because the change in gradient between the rod endpoints is2g\(δ\)2g\(\\delta\)while the displacement is2δ2\\delta, the left\-hand side gives theaveragecurvature along the line segment connecting the two endpoints of the rod\. The right\-hand side is the sharpness threshold for heavy ball momentum\. This generalizes the result from the parabola: stable two\-point cycles correspond to orbits whose average curvature over the rod equals the sharpness threshold\. What is special about the parabola is that, because the curvature is constant, there is an entire manifold of two\-point orbits\. For generic loss functions, only a special amplitude satisfies the average\-curvature condition\.
Using the form of the gradient for our quartic loss, we have:
g\(δ∗\)δ∗=S−Q\(δ∗\)2\.\\frac\{g\(\\delta^\{\\ast\}\)\}\{\\delta^\{\\ast\}\}=S\-Q\(\\delta^\{\\ast\}\)^\{2\}\\,\.\(154\)Recalling thatΣδδ∗=\(δ∗\)2\\Sigma^\{\\ast\}\_\{\\delta\\delta\}=\(\\delta^\{\\ast\}\)^\{2\}, the average\-curvature condition gives:
S−Q\(δ∗\)2=S∗⟹Σδδ∗=S−S∗Q\.S\-Q\(\\delta^\{\\ast\}\)^\{2\}=S^\{\\ast\}\\quad\\implies\\quad\\Sigma^\{\\ast\}\_\{\\delta\\delta\}=\\frac\{S\-S^\{\\ast\}\}\{Q\}\\,\.\(155\)We requireΣδδ∗\>0\\Sigma^\{\\ast\}\_\{\\delta\\delta\}\>0for the fixed point to be physically meaningful\. SinceQ\>0Q\>0, we have:
Σδδ∗\>0⇔S−S∗\>0\.\\Sigma^\{\\ast\}\_\{\\delta\\delta\}\>0\\iff S\-S^\{\\ast\}\>0\\,\.\(156\)So, the oscillation fixed point becomes physically meaningful precisely when the sharpness at the minimum exceeds the sharpness threshold\.
Forγ∗\\gamma^\{\\ast\}, substitutingg\(δ∗\)=S∗δ∗g\(\\delta^\{\\ast\}\)=S^\{\\ast\}\\delta^\{\\ast\}into[Equation˜152](https://arxiv.org/html/2605.06821#A3.E152)gives:
γ∗\\displaystyle\\gamma^\{\\ast\}=−1−β1\+βg\(δ∗\)\\displaystyle=\-\\frac\{1\-\\beta\}\{1\+\\beta\}\\,g\(\\delta^\{\\ast\}\)=−1−β1\+β\(2\(1\+β\)η\(1−β\)δ∗\)\\displaystyle=\-\\frac\{1\-\\beta\}\{1\+\\beta\}\\left\(\\frac\{2\(1\+\\beta\)\}\{\\eta\(1\-\\beta\)\}\\,\\delta^\{\\ast\}\\right\)=−2ηδ∗,\\displaystyle=\-\\frac\{2\}\{\\eta\}\\,\\delta^\{\\ast\}\\,,from which it follows:
γ∗δ∗=−2η\.\\frac\{\\gamma^\{\\ast\}\}\{\\delta^\{\\ast\}\}=\-\\frac\{2\}\{\\eta\}\\,\.\(157\)Recalling thatΣγγ∗=\(γ∗\)2\\Sigma^\{\\ast\}\_\{\\gamma\\gamma\}=\(\\gamma^\{\\ast\}\)^\{2\}andΣδγ∗=δ∗γ∗\\Sigma^\{\\ast\}\_\{\\delta\\gamma\}=\\delta^\{\\ast\}\\gamma^\{\\ast\}, we then have that:
Σγγ∗=4\(S−S∗\)η2Q,Σδγ∗=−2\(S−S∗\)ηQ\.\\Sigma^\{\\ast\}\_\{\\gamma\\gamma\}=\\frac\{4\(S\-S^\{\\ast\}\)\}\{\\eta^\{2\}Q\}\\,,\\qquad\\Sigma^\{\\ast\}\_\{\\delta\\gamma\}=\-\\frac\{2\(S\-S^\{\\ast\}\)\}\{\\eta\\,Q\}\\,\.\(158\)LetΣ∗\\Sigma^\{\\ast\}denote the non\-zero fixed point\. We want to determine whenΣ∗\\Sigma^\{\\ast\}is a stable fixed point\. Letf\(Σ\)f\(\\Sigma\)denote the right\-hand side of the ODE forΣ\\Sigma:
f\(Σ\)=12Φ\+\(Σ\)⊗Φ\+\(Σ\)−2Σ\.f\(\\Sigma\)=\\frac\{1\}\{2\}\\,\\Phi\_\{\+\}\(\\Sigma\)\\otimes\\Phi\_\{\+\}\(\\Sigma\)\-2\\Sigma\.A fixed pointΣ∗\\Sigma^\{\\ast\}is asymptotically stable if all eigenvalues of the linear operator∇Σf\(Σ∗\)\\nabla\_\{\\Sigma\}f\(\\Sigma^\{\\ast\}\)have strictly negative real parts\. LetΣ~\\widetilde\{\\Sigma\}be a symmetric perturbation matrix\. The action of∇Σf\(Σ∗\)\\nabla\_\{\\Sigma\}f\(\\Sigma^\{\\ast\}\)onΣ~\\widetilde\{\\Sigma\}is given by:
∇Σf\(Σ∗\)\[Σ~\]=12\(∇ΣΦ\+\[Σ~\]⊗Φ\+\+Φ\+⊗∇ΣΦ\+\[Σ~\]\)−2Σ~\.\\nabla\_\{\\Sigma\}f\(\\Sigma^\{\\ast\}\)\[\\widetilde\{\\Sigma\}\]=\\frac\{1\}\{2\}\\Bigl\(\\nabla\_\{\\Sigma\}\\Phi\_\{\+\}\[\\widetilde\{\\Sigma\}\]\\otimes\\Phi\_\{\+\}\+\\Phi\_\{\+\}\\otimes\\nabla\_\{\\Sigma\}\\Phi\_\{\+\}\[\\widetilde\{\\Sigma\}\]\\Bigr\)\-2\\widetilde\{\\Sigma\}\.
We now need an expression for∇ΣΦ\+\\nabla\_\{\\Sigma\}\\Phi\_\{\+\}\. SinceΦ\+\\Phi\_\{\+\}depends onΣ\\Sigmaentirely through its principal eigencomponentΔ\\Delta, we can apply the chain rule\. The derivative ofΦ\+\\Phi\_\{\+\}with respect toΣ\\Sigmadecomposes into the product of the Jacobian ofΦ\+\\Phi\_\{\+\}with respect to the principal vectorΔ\\Deltaand the derivative ofΔ\\Deltawith respect toΣ\\Sigma:
∇ΣΦ\+=dΦ\+dΔdΔdΣ\.\\nabla\_\{\\Sigma\}\\Phi\_\{\+\}=\\frac\{d\\Phi\_\{\+\}\}\{d\\Delta\}\\,\\frac\{d\\Delta\}\{d\\Sigma\}\.We begin by computing the Jacobian ofΦ\+\\Phi\_\{\+\}with respect toΔ\\Delta\. Recall that:
Φ\+=\(Φ\+wΦ\+m\)=\(−η\[βγ\+\(1−β\)g\(δ\)\]\(1−β\)\[g\(δ\)−γ\]\)\.\\Phi\_\{\+\}=\\begin\{pmatrix\}\\Phi\_\{\+\}^\{w\}\\\\ \\Phi\_\{\+\}^\{m\}\\end\{pmatrix\}=\\begin\{pmatrix\}\-\\eta\\,\[\\beta\\gamma\+\(1\-\\beta\)\\,g\(\\delta\)\]\\\\ \(1\-\\beta\)\\,\[g\(\\delta\)\-\\gamma\]\\end\{pmatrix\}\.LetJ=dΦ\+/dΔJ=d\\Phi\_\{\+\}/d\\Delta\. Then we have that:
J=\(dΦ\+wdδdΦ\+wdγdΦ\+mdδdΦ\+mdγ\)=\(−η\(1−β\)H\(δ\)−ηβ\(1−β\)H\(δ\)−\(1−β\)\),J=\\begin\{pmatrix\}\\frac\{d\\Phi\_\{\+\}^\{w\}\}\{d\\delta\}&\\frac\{d\\Phi\_\{\+\}^\{w\}\}\{d\\gamma\}\\\\ \\frac\{d\\Phi\_\{\+\}^\{m\}\}\{d\\delta\}&\\frac\{d\\Phi\_\{\+\}^\{m\}\}\{d\\gamma\}\\\\ \\end\{pmatrix\}=\\begin\{pmatrix\}\-\\eta\(1\-\\beta\)H\(\\delta\)&\-\\eta\\beta\\\\ \(1\-\\beta\)H\(\\delta\)&\-\(1\-\\beta\)\\end\{pmatrix\},\(159\)whereHHis the Hessian of our loss function:
H\(δ\)=g′\(δ\)=S−3Qδ2\.H\(\\delta\)=g^\{\\prime\}\(\\delta\)=S\-3Q\\delta^\{2\}\.\(160\)We now need to finddΔ/dΣd\\Delta/d\\Sigma\. AlthoughΣ\\Sigmais a2×22\\times 2matrix with four independent entries, we can restrict our attention to the three\-dimensional space of symmetric2×22\\times 2matrices\. When considering perturbationsΣ~\\widetilde\{\\Sigma\}, it helps to work in a basis aligned with the rod of the fixed point\. Define:
e1=Δ∗‖Δ∗‖ande2⟂e1\.e\_\{1\}=\\frac\{\\Delta^\{\\ast\}\}\{\\\|\\Delta^\{\\ast\}\\\|\}\\qquad\\text\{and\}\\qquad e\_\{2\}\\perp e\_\{1\}\.\(161\)We then have that:
Σ~∈span\{e1⊗e1,e2⊗e2,12\(e1⊗e2\+e2⊗e1\)\}\.\\widetilde\{\\Sigma\}\\in\\operatorname\{span\}\\left\\\{e\_\{1\}\\otimes e\_\{1\},\\;e\_\{2\}\\otimes e\_\{2\},\\;\\tfrac\{1\}\{2\}\(e\_\{1\}\\otimes e\_\{2\}\+e\_\{2\}\\otimes e\_\{1\}\)\\right\\\}\.Recall that aconeis a set that is closed under scalar multiplication\. BecauseΣ∗=Δ∗⊗Δ∗\\Sigma^\{\\ast\}=\\Delta^\{\\ast\}\\otimes\\Delta^\{\\ast\}, it lies on the cone of rank\-1 symmetric2×22\\times 2matrices\. The manifold of symmetric2×22\\times 2matrices is three\-dimensional, whereas the cone of rank\-1 symmetric matrices is two\-dimensional\. We can therefore distinguish two types of perturbations\.Off\-coneperturbationsΣ~⟂\\widetilde\{\\Sigma\}\_\{\\perp\}increase the rank\.On\-coneperturbationsΣ~∥\\widetilde\{\\Sigma\}\_\{\\parallel\}rotate and scale the principal eigenvector but remain on the cone of rank\-1 symmetric matrices\.
Consider an off\-cone perturbation:
Σ~⟂=ce2⊗e2\\widetilde\{\\Sigma\}\_\{\\perp\}=c\\,e\_\{2\}\\otimes e\_\{2\}\(162\)for smallc\>0c\>0\. We then have that:
Σ∗\+Σ~⟂=Δ∗⊗Δ∗\+ce2⊗e2\\Sigma^\{\\ast\}\+\\widetilde\{\\Sigma\}\_\{\\perp\}=\\Delta^\{\\ast\}\\otimes\\Delta^\{\\ast\}\+c\\,e\_\{2\}\\otimes e\_\{2\}\(163\)Becausee2e\_\{2\}is orthogonal toe1e\_\{1\}by construction, the eigenvectors ofΣ∗\+Σ~⟂\\Sigma^\{\\ast\}\+\\widetilde\{\\Sigma\}\_\{\\perp\}remaine1e\_\{1\}ande2e\_\{2\}\. And becauseccis assumed small, we have that‖Δ‖2\>c\\\|\\Delta\\\|^\{2\}\>c, so the principal eigenvalue remains‖Δ‖2\\\|\\Delta\\\|^\{2\}\. As the principal eigenvector \(e1e\_\{1\}\) and principal eigenvalue \(‖Δ∗‖2\\\|\\Delta^\{\\ast\}\\\|^\{2\}\) remain unchanged under off\-cone perturbations, it follows that:
dΔdΣ\[Σ~⟂\]=0\\frac\{d\\Delta\}\{d\\Sigma\}\[\\widetilde\{\\Sigma\}\_\{\\perp\}\]=0\(164\)We can then show that for off\-cone perturbations:
∇Σf\(Σ∗\)\[Σ~⟂\]\\displaystyle\\nabla\_\{\\Sigma\}f\(\\Sigma^\{\\ast\}\)\[\\widetilde\{\\Sigma\}\_\{\\perp\}\]=12\(∇ΣΦ\+\[Σ~⟂\]⊗Φ\+\+Φ\+⊗∇ΣΦ\+\[Σ~⟂\]\)−2Σ~⟂\\displaystyle=\\frac\{1\}\{2\}\\Bigl\(\\nabla\_\{\\Sigma\}\\Phi\_\{\+\}\[\\widetilde\{\\Sigma\}\_\{\\perp\}\]\\otimes\\Phi\_\{\+\}\+\\Phi\_\{\+\}\\otimes\\nabla\_\{\\Sigma\}\\Phi\_\{\+\}\[\\widetilde\{\\Sigma\}\_\{\\perp\}\]\\Bigr\)\-2\\widetilde\{\\Sigma\}\_\{\\perp\}=12\(dΦ\+dΔdΔdΣ\[Σ~⟂\]⊗Φ\+\+Φ\+⊗dΦ\+dΔdΔdΣ\[Σ~⟂\]\)−2Σ~⟂\\displaystyle=\\frac\{1\}\{2\}\\Bigl\(\\frac\{d\\Phi\_\{\+\}\}\{d\\Delta\}\\frac\{d\\Delta\}\{d\\Sigma\}\[\\widetilde\{\\Sigma\}\_\{\\perp\}\]\\otimes\\Phi\_\{\+\}\+\\Phi\_\{\+\}\\otimes\\frac\{d\\Phi\_\{\+\}\}\{d\\Delta\}\\frac\{d\\Delta\}\{d\\Sigma\}\[\\widetilde\{\\Sigma\}\_\{\\perp\}\]\\Bigr\)\-2\\widetilde\{\\Sigma\}\_\{\\perp\}=−2Σ~⟂\\displaystyle=\-2\\widetilde\{\\Sigma\}\_\{\\perp\}Therefore, off\-cone perturbations are eigenvectors of∇Σf\\nabla\_\{\\Sigma\}fwith eigenvalue−2\-2\.
Now consider an on\-cone perturbation:
Σ~∥=u⊗Δ∗\+Δ∗⊗u,u=ae1\+be2\\widetilde\{\\Sigma\}\_\{\\parallel\}=u\\otimes\\Delta^\{\\ast\}\+\\Delta^\{\\ast\}\\otimes u,\\qquad u=ae\_\{1\}\+be\_\{2\}\(165\)One can see that:
Σ∗\+Σ~∥\\displaystyle\\Sigma^\{\\ast\}\+\\widetilde\{\\Sigma\}\_\{\\parallel\}=Δ∗⊗Δ∗\+u⊗Δ∗\+Δ∗⊗u\\displaystyle=\\Delta^\{\\ast\}\\otimes\\Delta^\{\\ast\}\+u\\otimes\\Delta^\{\\ast\}\+\\Delta^\{\\ast\}\\otimes u=\(Δ∗\+u\)⊗\(Δ∗\+u\)−u⊗u\\displaystyle=\(\\Delta^\{\\ast\}\+u\)\\otimes\(\\Delta^\{\\ast\}\+u\)\-u\\otimes u=\(Δ∗\+u\)⊗\(Δ∗\+u\)\+O\(u2\)\\displaystyle=\(\\Delta^\{\\ast\}\+u\)\\otimes\(\\Delta^\{\\ast\}\+u\)\+O\(u^\{2\}\)To first order,Σ~∥\\widetilde\{\\Sigma\}\_\{\\parallel\}shifts the principal eigenvector byuu\(assumed small\)\. We therefore deduce that:
dΔdΣ\[Σ~∥\]=u\.\\frac\{d\\Delta\}\{d\\Sigma\}\\bigl\[\\widetilde\{\\Sigma\}\_\{\\parallel\}\\bigr\]=u\.\(166\)Recall thatΦ\+=−2Δ∗\\Phi\_\{\+\}=\-2\\Delta^\{\\ast\}\. For on\-cone perturbations,
∇Σf\(Σ∗\)\[Σ~∥\]\\displaystyle\\nabla\_\{\\Sigma\}f\(\\Sigma^\{\\ast\}\)\[\\widetilde\{\\Sigma\}\_\{\\parallel\}\]=12\(∇ΣΦ\+\[Σ~∥\]⊗Φ\+\+Φ\+⊗∇ΣΦ\+\[Σ~∥\]\)−2Σ~∥\\displaystyle=\\tfrac\{1\}\{2\}\\Bigl\(\\nabla\_\{\\Sigma\}\\Phi\_\{\+\}\[\\widetilde\{\\Sigma\}\_\{\\parallel\}\]\\otimes\\Phi\_\{\+\}\+\\Phi\_\{\+\}\\otimes\\nabla\_\{\\Sigma\}\\Phi\_\{\+\}\[\\widetilde\{\\Sigma\}\_\{\\parallel\}\]\\Bigr\)\-2\\widetilde\{\\Sigma\}\_\{\\parallel\}=12\(dΦ\+dΔdΔdΣ\[Σ~∥\]⊗Φ\+\+Φ\+⊗dΦ\+dΔdΔdΣ\[Σ~∥\]\)−2Σ~∥\\displaystyle=\\tfrac\{1\}\{2\}\\Bigl\(\\tfrac\{d\\Phi\_\{\+\}\}\{d\\Delta\}\\,\\tfrac\{d\\Delta\}\{d\\Sigma\}\[\\widetilde\{\\Sigma\}\_\{\\parallel\}\]\\otimes\\Phi\_\{\+\}\+\\Phi\_\{\+\}\\otimes\\tfrac\{d\\Phi\_\{\+\}\}\{d\\Delta\}\\,\\tfrac\{d\\Delta\}\{d\\Sigma\}\[\\widetilde\{\\Sigma\}\_\{\\parallel\}\]\\Bigr\)\-2\\widetilde\{\\Sigma\}\_\{\\parallel\}=12\(Ju⊗\(−2Δ∗\)\+\(−2Δ∗\)⊗Ju\)−2Σ~∥\\displaystyle=\\tfrac\{1\}\{2\}\\Bigl\(Ju\\otimes\(\-2\\Delta^\{\\ast\}\)\+\(\-2\\Delta^\{\\ast\}\)\\otimes Ju\\Bigr\)\-2\\widetilde\{\\Sigma\}\_\{\\parallel\}=−\(Ju⊗Δ∗\+Δ∗⊗Ju\)−2Σ~∥\\displaystyle=\-\\bigl\(Ju\\otimes\\Delta^\{\\ast\}\+\\Delta^\{\\ast\}\\otimes Ju\\bigr\)\-2\\widetilde\{\\Sigma\}\_\{\\parallel\}=−\[\(J\+2I\)u⊗Δ∗\+Δ∗⊗\(J\+2I\)u\]\.\\displaystyle=\-\\bigl\[\(J\+2I\)u\\otimes\\Delta^\{\\ast\}\+\\Delta^\{\\ast\}\\otimes\(J\+2I\)u\\bigr\]\.
Factoring out the basis structure yields a 2\-dimensional ODE for the perturbation vectoruu:
dudt=−\(J\+2I\)u\.\\frac\{du\}\{dt\}=\-\(J\+2I\)u\.\(167\)In order for the fixed pointΣ∗\\Sigma^\{\\ast\}to be stable, the real part of both eigenvalues of−\(J\+2I\)\-\(J\+2I\)must be negative\. Letλ\\lambdadenote an eigenvalue ofJJ\. We have thatλ\\lambdasatisfies the characteristic polynomial:
λ2−Tr\(J\)λ\+Det\(J\)=0\.\\lambda^\{2\}\-\\operatorname\{Tr\}\(J\)\\lambda\+\\operatorname\{Det\}\(J\)=0\.Letν\\nudenote an eigenvalue of−\(J\+2I\)\-\(J\+2I\)\. Ifvvis an eigenvector ofJJ, we then have that:
−\(J\+2I\)v=−\(λ\+2\)v⟹ν=−\(λ\+2\)\-\(J\+2I\)v=\-\(\\lambda\+2\)v\\implies\\nu=\-\(\\lambda\+2\)\(168\)Substitutingλ=−ν−2\\lambda=\-\\nu\-2into the characteristic polynomial ofJJ, yields the characteristic polynomial of−\(J\+2I\)\-\(J\+2I\):
\(−ν−2\)2−Tr\(J\)\(−ν−2\)\+Det\(J\)=0\(\-\\nu\-2\)^\{2\}\-\\operatorname\{Tr\}\(J\)\(\-\\nu\-2\)\+\\operatorname\{Det\}\(J\)=0\(169\)UsingTr\(J\)=−\(1−β\)\(1\+ηH\(δ∗\)\)\\operatorname\{Tr\}\(J\)=\-\(1\-\\beta\)\(1\+\\eta H\(\\delta^\{\\ast\}\)\)andDet\(J\)=η\(1−β\)H\(δ∗\)\\operatorname\{Det\}\(J\)=\\eta\(1\-\\beta\)H\(\\delta^\{\\ast\}\), we have that:
ν2\+\[3\+β−η\(1−β\)H\(δ∗\)\]ν\+\[2\(1\+β\)−η\(1−β\)H\(δ∗\)\]=0\.\\nu^\{2\}\+\\bigl\[3\+\\beta\-\\eta\(1\-\\beta\)H\(\\delta^\{\\ast\}\)\\bigr\]\\,\\nu\+\\bigl\[2\(1\+\\beta\)\-\\eta\(1\-\\beta\)H\(\\delta^\{\\ast\}\)\\bigr\]=0\.The characteristic polynomial above is expressed in terms of the HessianHHat the fixed point\. We want to rewrite it in terms of the sharpness thresholdS∗S^\{\\ast\}and the sharpnessSSat the minimum\.
First, one can show that:
H\\displaystyle H=S−3Q⋅\(S−S∗Q\)\\displaystyle=S\-3Q\\cdot\\left\(\\frac\{S\-S^\{\\ast\}\}\{Q\}\\right\)=S−3\(S−S∗\)\\displaystyle=S\-3\(S\-S^\{\\ast\}\)=3S∗−2S\.\\displaystyle=3S^\{\\ast\}\-2S\.One can also see that:
S∗=2η⋅1\+β1−β⟺η\(1−β\)S∗=2\(1\+β\)\.S^\{\\ast\}=\\frac\{2\}\{\\eta\}\\cdot\\frac\{1\+\\beta\}\{1\-\\beta\}\\quad\\Longleftrightarrow\\quad\\eta\(1\-\\beta\)S^\{\\ast\}=2\(1\+\\beta\)\.\(170\)Combining the two above equations, one can see that:
η\(1−β\)H\\displaystyle\\eta\(1\-\\beta\)H=3η\(1−β\)S∗−2η\(1−β\)S\\displaystyle=3\\eta\(1\-\\beta\)S^\{\\ast\}\-2\\eta\(1\-\\beta\)S=6\(1\+β\)−2η\(1−β\)S\\displaystyle=6\(1\+\\beta\)\-2\\eta\(1\-\\beta\)S\(171\)Substituting \([171](https://arxiv.org/html/2605.06821#A3.E171)\) into the linear coefficient:
3\+β−η\(1−β\)H\\displaystyle 3\+\\beta\-\\eta\(1\-\\beta\)H=3\+β−\[6\(1\+β\)−2η\(1−β\)S\]\\displaystyle=3\+\\beta\-\\bigl\[6\(1\+\\beta\)\-2\\eta\(1\-\\beta\)S\\bigr\]=2η\(1−β\)S−3−5β\.\\displaystyle=2\\eta\(1\-\\beta\)S\-3\-5\\beta\.Substituting \([171](https://arxiv.org/html/2605.06821#A3.E171)\) into the constant term:
2\(1\+β\)−η\(1−β\)H\\displaystyle 2\(1\+\\beta\)\-\\eta\(1\-\\beta\)H=2\(1\+β\)−\[6\(1\+β\)−2η\(1−β\)S\]\\displaystyle=2\(1\+\\beta\)\-\\bigl\[6\(1\+\\beta\)\-2\\eta\(1\-\\beta\)S\\bigr\]=2η\(1−β\)S−4\(1\+β\)\\displaystyle=2\\eta\(1\-\\beta\)S\-4\(1\+\\beta\)=2η\(1−β\)S−2η\(1−β\)S∗\\displaystyle=2\\eta\(1\-\\beta\)S\-2\\eta\(1\-\\beta\)S^\{\\ast\}=2η\(1−β\)\(S−S∗\)\\displaystyle=2\\eta\(1\-\\beta\)\(S\-S^\{\\ast\}\)The characteristic polynomial therefore simplifies to:
ν2\+\[2η\(1−β\)S−3−5β\]ν\+2η\(1−β\)\(S−S∗\)=0\.\\nu^\{2\}\+\\bigl\[2\\eta\(1\-\\beta\)S\-3\-5\\beta\\bigr\]\\,\\nu\+2\\eta\(1\-\\beta\)\(S\-S^\{\\ast\}\)=0\.By the Routh–Hurwitz criterion, both eigenvalues have negative real parts if and only if both coefficients are strictly positive:
- •Constant term:2η\(1−β\)\(S−S∗\)\>0⇔S\>S∗2\\eta\(1\-\\beta\)\(S\-S^\{\\ast\}\)\>0\\iff S\>S^\{\\ast\}\.
- •ν\\nu\-coefficient:2η\(1−β\)S−3−5β\>0⇔S\>3\+5β2η\(1−β\)2\\eta\(1\-\\beta\)S\-3\-5\\beta\>0\\iff S\>\\dfrac\{3\+5\\beta\}\{2\\eta\(1\-\\beta\)\}\.
The second condition is automatically implied by the first\. SinceS∗=4\(1\+β\)2η\(1−β\)S^\{\\ast\}=\\dfrac\{4\(1\+\\beta\)\}\{2\\eta\(1\-\\beta\)\}, comparing the two thresholds:
3\+5β2η\(1−β\)<4\+4β2η\(1−β\)⇔3\+5β<4\+4β⇔β<1,\\frac\{3\+5\\beta\}\{2\\eta\(1\-\\beta\)\}<\\frac\{4\+4\\beta\}\{2\\eta\(1\-\\beta\)\}\\iff 3\+5\\beta<4\+4\\beta\\iff\\beta<1,which holds since the momentum parameter satisfiesβ<1\\beta<1\. Hence theν\\nu\-coefficient threshold is strictly weaker thanS∗S^\{\\ast\}, and both on\-cone eigenvalues have negative real parts if and only ifS\>S∗S\>S^\{\\ast\}\.
Combined with the off\-cone eigenvalue of−2\-2, we conclude that the fixed pointΣ∗\\Sigma^\{\\ast\}is stable if and only ifS\>S∗S\>S^\{\\ast\}\.
## Appendix DRod Flow for RMSProp
### D\.1Derivation of Rod Flows
#### D\.1\.1Scalar RMSProp
For the Scalar RMSProp update equations, we have:
νt\+1\\displaystyle\\nu\_\{t\+1\}=β2νt\+\(1−β2\)‖∇L\(wt\)‖2\\displaystyle=\\beta\_\{2\}\\nu\_\{t\}\+\(1\-\\beta\_\{2\}\)\\\|\\nabla L\(w\_\{t\}\)\\\|^\{2\}wt\+1\\displaystyle w\_\{t\+1\}=wt−ηPt\+1−1∇L\(wt\)\\displaystyle=w\_\{t\}\-\\eta\\,P\_\{t\+1\}^\{\-1\}\\,\\nabla L\(w\_\{t\}\)wherePPis the preconditioner:
P\(ν\)=\(ν\+ε\)IdP\(\\nu\)=\(\\sqrt\{\\nu\}\+\\varepsilon\)I\_\{d\}As was the case with gradient descent, we can define the midpoint and half\-displacement of the position:
w¯t\\displaystyle\\bar\{w\}\_\{t\}=wt\+1\+wt2\\displaystyle=\\frac\{w\_\{t\+1\}\+w\_\{t\}\}\{2\}δt\\displaystyle\\delta\_\{t\}=wt\+1−wt2\\displaystyle=\\frac\{w\_\{t\+1\}\-w\_\{t\}\}\{2\}With Scalar RMSProp, we will only track the midpointν¯\\bar\{\\nu\}of the second moment\. There is no need to track the half\-difference ofν\\nubecause, even at the edge of stability, there are no oscillations in the gradient norm\.
ν¯t=νt\+1\+νt2\\bar\{\\nu\}\_\{t\}=\\frac\{\\nu\_\{t\+1\}\+\\nu\_\{t\}\}\{2\}Due to not tracking the half\-difference inν\\nu, the difference equations are not exact as they were for gradient descent\. However, the discrepancy should be negligible as long as the norm of the gradient is smoothly varying—which is a necessary precondition for rod flow to work as an effective theory in the first place\.
For the difference equation of the midpoint, we have the following:
w¯t\+1−w¯t\\displaystyle\\bar\{w\}\_\{t\+1\}\-\\bar\{w\}\_\{t\}=wt\+2\+wt\+12−wt\+1\+wt2\\displaystyle=\\frac\{w\_\{t\+2\}\+w\_\{t\+1\}\}\{2\}\-\\frac\{w\_\{t\+1\}\+w\_\{t\}\}\{2\}=wt\+2−wt\+12\+wt\+1−wt2\\displaystyle=\\frac\{w\_\{t\+2\}\-w\_\{t\+1\}\}\{2\}\+\\frac\{w\_\{t\+1\}\-w\_\{t\}\}\{2\}=−η2\(νt\+2\+ε\)∇L\(wt\+1\)−η2\(νt\+1\+ε\)∇L\(wt\)\\displaystyle=\-\\frac\{\\eta\}\{2\(\\sqrt\{\\nu\_\{t\+2\}\}\+\\varepsilon\)\}\\nabla L\(w\_\{t\+1\}\)\-\\frac\{\\eta\}\{2\(\\sqrt\{\\nu\_\{t\+1\}\}\+\\varepsilon\)\}\\nabla L\(w\_\{t\}\)≈−η2\(ν¯t\+ε\)\(∇L\+\+∇L−\)\\displaystyle\\approx\-\\frac\{\\eta\}\{2\(\\sqrt\{\\bar\{\\nu\}\_\{t\}\}\+\\varepsilon\)\}\\,\\bigl\(\\nabla L\_\{\+\}\+\\nabla L\_\{\-\}\\bigr\)And for the difference equation of the extent:
δt\+1⊗δt\+1−δt⊗δt\\displaystyle\\delta\_\{t\+1\}\\otimes\\delta\_\{t\+1\}\-\\delta\_\{t\}\\otimes\\delta\_\{t\}=\(δt\+1⊗δt\+1\+δt⊗δt\)−2δt⊗δt\\displaystyle=\(\\delta\_\{t\+1\}\\otimes\\delta\_\{t\+1\}\+\\delta\_\{t\}\\otimes\\delta\_\{t\}\)\-2\\delta\_\{t\}\\otimes\\delta\_\{t\}=\(η2\(νt\+2\+ε\)\)2∇L\(wt\+1\)⊗∇L\(wt\+1\)\\displaystyle=\\left\(\\frac\{\\eta\}\{2\(\\sqrt\{\\nu\_\{t\+2\}\}\+\\varepsilon\)\}\\right\)^\{2\}\\nabla L\(w\_\{t\+1\}\)\\otimes\\nabla L\(w\_\{t\+1\}\)\+\(η2\(νt\+1\+ε\)\)2∇L\(wt\)⊗∇L\(wt\)−2δt⊗δt\\displaystyle\\qquad\{\}\+\\left\(\\frac\{\\eta\}\{2\(\\sqrt\{\\nu\_\{t\+1\}\}\+\\varepsilon\)\}\\right\)^\{2\}\\nabla L\(w\_\{t\}\)\\otimes\\nabla L\(w\_\{t\}\)\-2\\delta\_\{t\}\\otimes\\delta\_\{t\}≈\(η2\(ν¯t\+ε\)\)2\(∇L\+⊗∇L\+\+∇L−⊗∇L−\)−2δt⊗δt\\displaystyle\\approx\\left\(\\frac\{\\eta\}\{2\(\\sqrt\{\\bar\{\\nu\}\_\{t\}\}\+\\varepsilon\)\}\\right\)^\{2\}\\,\\bigl\(\\nabla L\_\{\+\}\\otimes\\nabla L\_\{\+\}\+\\nabla L\_\{\-\}\\otimes\\nabla L\_\{\-\}\\bigr\)\-2\\delta\_\{t\}\\otimes\\delta\_\{t\}For the difference equation of the second moment midpoint:
ν¯t\+1−ν¯t\\displaystyle\\bar\{\\nu\}\_\{t\+1\}\-\\bar\{\\nu\}\_\{t\}=νt\+2\+νt\+12−νt\+1\+νt2\\displaystyle=\\frac\{\\nu\_\{t\+2\}\+\\nu\_\{t\+1\}\}\{2\}\-\\frac\{\\nu\_\{t\+1\}\+\\nu\_\{t\}\}\{2\}=νt\+2−νt\+12\+νt\+1−νt2\\displaystyle=\\frac\{\\nu\_\{t\+2\}\-\\nu\_\{t\+1\}\}\{2\}\+\\frac\{\\nu\_\{t\+1\}\-\\nu\_\{t\}\}\{2\}=\(1−β2\)\[‖∇L\(wt\+1\)‖2−νt\+1\]2\+\(1−β2\)\[‖∇L\(wt\)‖2−νt\]2\\displaystyle=\\frac\{\(1\-\\beta\_\{2\}\)\[\\\|\\nabla L\(w\_\{t\+1\}\)\\\|^\{2\}\-\\nu\_\{t\+1\}\]\}\{2\}\+\\frac\{\(1\-\\beta\_\{2\}\)\[\\\|\\nabla L\(w\_\{t\}\)\\\|^\{2\}\-\\nu\_\{t\}\]\}\{2\}=\(1−β2\)\[‖∇L\+‖2\+‖∇L−‖22−ν¯t\]\\displaystyle=\(1\-\\beta\_\{2\}\)\\left\[\\frac\{\\\|\\nabla L\_\{\+\}\\\|^\{2\}\+\\\|\\nabla L\_\{\-\}\\\|^\{2\}\}\{2\}\-\\bar\{\\nu\}\_\{t\}\\right\]We can then promote the above difference equations to obtain the rod flow for Scalar RMSProp:
dw¯dt\\displaystyle\\frac\{d\\bar\{w\}\}\{dt\}=−ηP−1\(ν¯\)g¯,\\displaystyle=\-\\eta\\,P^\{\-1\}\(\\bar\{\\nu\}\)\\,\\bar\{g\},dΣdt\\displaystyle\\frac\{d\\Sigma\}\{dt\}=\(η2P−1\(ν¯\)\)2\(∇L\+⊗∇L\+\+∇L−⊗∇L−\)−2Σ\\displaystyle=\\left\(\\frac\{\\eta\}\{2\}P^\{\-1\}\(\\bar\{\\nu\}\)\\right\)^\{2\}\\,\\bigl\(\\nabla L\_\{\+\}\\otimes\\nabla L\_\{\+\}\+\\nabla L\_\{\-\}\\otimes\\nabla L\_\{\-\}\\bigr\)\-2\\Sigmadν¯dt\\displaystyle\\frac\{d\\bar\{\\nu\}\}\{dt\}=\(1−β2\)\[‖∇L\+‖2\+‖∇L−‖22−ν¯\]\.\\displaystyle=\(1\-\\beta\_\{2\}\)\\left\[\\frac\{\\\|\\nabla L\_\{\+\}\\\|^\{2\}\+\\\|\\nabla L\_\{\-\}\\\|^\{2\}\}\{2\}\-\\bar\{\\nu\}\\right\]\.Note that if we define the preconditioned step size
η~=ηP−1\(ν¯\)\\tilde\{\\eta\}=\\eta\\,P^\{\-1\}\(\\bar\{\\nu\}\)\(172\)then the rod flow ODEs forw¯\\bar\{w\}andΣ\\Sigmaare identical to those for gradient descent, except thatη~\\tilde\{\\eta\}replacesη\\eta\.
#### D\.1\.2RMSProp
For standard RMSProp, the second moment estimateν∈ℝd\\nu\\in\\mathbb\{R\}^\{d\}is promoted to a vector of the same dimension asww, tracking per\-component second\-moment estimates\. The update equations are:
νt\+1\\displaystyle\\nu\_\{t\+1\}=β2νt\+\(1−β2\)∇L\(wt\)⊙2\\displaystyle=\\beta\_\{2\}\\nu\_\{t\}\+\(1\-\\beta\_\{2\}\)\\nabla L\(w\_\{t\}\)^\{\\odot 2\}wt\+1\\displaystyle w\_\{t\+1\}=wt−ηPt\+1−1∇L\(wt\)\\displaystyle=w\_\{t\}\-\\eta\\,P^\{\-1\}\_\{t\+1\}\\,\\nabla L\(w\_\{t\}\)The preconditionerPPis given as:
P\(ν\)=diag\(ν\)\+εIdP\(\\nu\)=\\text\{diag\}\(\\sqrt\{\\nu\}\)\+\\varepsilon I\_\{d\}The midpointsw¯t\\bar\{w\}\_\{t\},ν¯t\\bar\{\\nu\}\_\{t\}, and the half\-displacementδt\\delta\_\{t\}are defined exactly as they were in Scalar RMSProp\. For the difference equation of theν\\numidpoint, we replace the squared norm from the scalar case with the component\-wise square:
ν¯t\+1−ν¯t\\displaystyle\\bar\{\\nu\}\_\{t\+1\}\-\\bar\{\\nu\}\_\{t\}=νt\+2\+νt\+12−νt\+1\+νt2\\displaystyle=\\frac\{\\nu\_\{t\+2\}\+\\nu\_\{t\+1\}\}\{2\}\-\\frac\{\\nu\_\{t\+1\}\+\\nu\_\{t\}\}\{2\}=νt\+2−νt\+12\+νt\+1−νt2\\displaystyle=\\frac\{\\nu\_\{t\+2\}\-\\nu\_\{t\+1\}\}\{2\}\+\\frac\{\\nu\_\{t\+1\}\-\\nu\_\{t\}\}\{2\}=\(1−β2\)\[∇L\(wt\+1\)⊙2−νt\+1\]2\+\(1−β2\)\[∇L\(wt\)⊙2−νt\]2\\displaystyle=\\frac\{\(1\-\\beta\_\{2\}\)\[\\nabla L\(w\_\{t\+1\}\)^\{\\odot 2\}\-\\nu\_\{t\+1\}\]\}\{2\}\+\\frac\{\(1\-\\beta\_\{2\}\)\[\\nabla L\(w\_\{t\}\)^\{\\odot 2\}\-\\nu\_\{t\}\]\}\{2\}=\(1−β2\)\[∇L\+⊙2\+∇L−⊙22−ν¯t\]\\displaystyle=\(1\-\\beta\_\{2\}\)\\left\[\\frac\{\\nabla L\_\{\+\}^\{\\odot 2\}\+\\nabla L\_\{\-\}^\{\\odot 2\}\}\{2\}\-\\bar\{\\nu\}\_\{t\}\\right\]Promoting these difference equations to continuous time, we obtain the rod flow for standard RMSProp:
dw¯dt\\displaystyle\\frac\{d\\bar\{w\}\}\{dt\}=−ηP−1\(ν¯\)g¯,\\displaystyle=\-\\eta\\,P^\{\-1\}\(\\bar\{\\nu\}\)\\,\\bar\{g\},dΣdt\\displaystyle\\frac\{d\\Sigma\}\{dt\}=\(η2P−1\(ν¯\)\)2\(∇L\+⊗∇L\+\+∇L−⊗∇L−\)−2Σ\\displaystyle=\\left\(\\frac\{\\eta\}\{2\}P^\{\-1\}\(\\bar\{\\nu\}\)\\right\)^\{2\}\\bigg\(\\nabla L\_\{\+\}\\otimes\\nabla L\_\{\+\}\+\\nabla L\_\{\-\}\\otimes\\nabla L\_\{\-\}\\bigg\)\-2\\Sigmadν¯dt\\displaystyle\\frac\{d\\bar\{\\nu\}\}\{dt\}=\(1−β2\)\[∇L\+⊙2\+∇L−⊙22−ν¯\]\.\\displaystyle=\(1\-\\beta\_\{2\}\)\\left\[\\frac\{\\nabla L\_\{\+\}^\{\\odot 2\}\+\\nabla L\_\{\-\}^\{\\odot 2\}\}\{2\}\-\\bar\{\\nu\}\\right\]\.
### D\.2Theoretical Analysis
#### D\.2\.1Linear Loss
Consider the linear lossL\(w\)=−b⋅wL\(w\)=\-b\\cdot w\. Because we are working in one\-dimension, there is no difference between Scalar RMSProp and standard RMSProp\. For the theoretical analysis section, we will drop the regularization parameterε\\varepsilon\.
Since∇L\+=∇L−=−b\\nabla L\_\{\+\}=\\nabla L\_\{\-\}=\-b, the rod flow equations become:
dν¯dt\\displaystyle\\frac\{d\\bar\{\\nu\}\}\{dt\}=\(1−β2\)\(b2−ν¯\),\\displaystyle=\(1\-\\beta\_\{2\}\)\(b^\{2\}\-\\bar\{\\nu\}\)\\,,\(173\)dw¯dt\\displaystyle\\frac\{d\\bar\{w\}\}\{dt\}=ηbν¯,\\displaystyle=\\frac\{\\eta b\}\{\\sqrt\{\\bar\{\\nu\}\}\}\\,,\(174\)dΣdt\\displaystyle\\frac\{d\\Sigma\}\{dt\}=η2b22ν¯−2Σ\.\\displaystyle=\\frac\{\\eta^\{2\}b^\{2\}\}\{2\\bar\{\\nu\}\}\-2\\Sigma\\,\.\(175\)Settingdν¯dt=0\\frac\{d\\bar\{\\nu\}\}\{dt\}=0gives the steady\-state second moment:
ν¯∗=b2\.\\bar\{\\nu\}^\{\*\}=b^\{2\}\\,\.\(176\)Substituting our expression forν¯∗\\bar\{\\nu\}^\{\*\}into thew¯\\bar\{w\}equation yields the steady\-state equation of motion for the midpoint of the position:
dw¯dt=ηsign\(b\)\.\\frac\{d\\bar\{w\}\}\{dt\}=\\eta\\operatorname\{sign\}\(b\)\\,\.\(177\)SettingdΣdt=0\\frac\{d\\Sigma\}\{dt\}=0gives the steady\-state extent:
Σ∗=η24\.\\Sigma^\{\*\}=\\frac\{\\eta^\{2\}\}\{4\}\\,\.\(178\)In a flat landscape, RMSProp behaves like Sign\-GD: it takes a step down the landscape with fixed step sizeη\\eta\.
#### D\.2\.2Quadratic Loss
Consider the quadraticL\(w\)=12Sw2L\(w\)=\\frac\{1\}\{2\}Sw^\{2\}withS\>0S\>0\. In one dimension, the extentΣ\\Sigmais a scalar\. So the half\-displacement isδ=Σ\\delta=\\sqrt\{\\Sigma\}\. The gradient is∇L\(w\)=Sw\\nabla L\(w\)=Sw\.
Evaluating the gradients at the endpoints of the rod yields:
∇L\+\\displaystyle\\nabla L\_\{\+\}=S\(w¯\+δ\)\\displaystyle=S\(\\bar\{w\}\+\\delta\)\(179\)∇L−\\displaystyle\\nabla L\_\{\-\}=S\(w¯−δ\)\\displaystyle=S\(\\bar\{w\}\-\\delta\)\(180\)Summing these terms and summing their squares gives:
∇L\+\+∇L−\\displaystyle\\nabla L\_\{\+\}\+\\nabla L\_\{\-\}=2Sw¯\\displaystyle=2S\\bar\{w\}\(181\)∇L\+2\+∇L−2\\displaystyle\\nabla L\_\{\+\}^\{2\}\+\\nabla L\_\{\-\}^\{2\}=S2\(w¯\+δ\)2\+S2\(w¯−δ\)2=2S2\(w¯2\+Σ\)\\displaystyle=S^\{2\}\(\\bar\{w\}\+\\delta\)^\{2\}\+S^\{2\}\(\\bar\{w\}\-\\delta\)^\{2\}=2S^\{2\}\(\\bar\{w\}^\{2\}\+\\Sigma\)\(182\)Plugging these into our rod flow equations for RMSProp:
dw¯dt\\displaystyle\\frac\{d\\bar\{w\}\}\{dt\}=−ηSν¯w¯\\displaystyle=\-\\frac\{\\eta S\}\{\\sqrt\{\\bar\{\\nu\}\}\}\\bar\{w\}\(183\)dΣdt\\displaystyle\\frac\{d\\Sigma\}\{dt\}=\(η2S22ν¯\)\(w¯2\+Σ\)−2Σ\\displaystyle=\\left\(\\frac\{\\eta^\{2\}S^\{2\}\}\{2\\bar\{\\nu\}\}\\right\)\(\\bar\{w\}^\{2\}\+\\Sigma\)\-2\\Sigma\(184\)dν¯dt\\displaystyle\\frac\{d\\bar\{\\nu\}\}\{dt\}=\(1−β2\)\(S2\(w¯2\+Σ\)−ν¯\)\\displaystyle=\(1\-\\beta\_\{2\}\)\\left\(S^\{2\}\(\\bar\{w\}^\{2\}\+\\Sigma\)\-\\bar\{\\nu\}\\right\)\(185\)By inspection, the centerw¯\\bar\{w\}experiences a restoring force and decays to the origin\. Because of this, it suffices to setw¯∗=0\\bar\{w\}^\{\\ast\}=0\. We can then study the coupled dynamics of the extent and the second moment at steady\-state:
dΣdt\\displaystyle\\frac\{d\\Sigma\}\{dt\}=\(η2S22ν¯−2\)Σ\\displaystyle=\\left\(\\frac\{\\eta^\{2\}S^\{2\}\}\{2\\bar\{\\nu\}\}\-2\\right\)\\Sigma\(186\)dν¯dt\\displaystyle\\frac\{d\\bar\{\\nu\}\}\{dt\}=\(1−β2\)\(S2Σ−ν¯\)\\displaystyle=\(1\-\\beta\_\{2\}\)\(S^\{2\}\\Sigma\-\\bar\{\\nu\}\)\(187\)We want to find the fixed point\(Σ∗,ν¯∗\)\(\\Sigma^\{\*\},\\bar\{\\nu\}^\{\*\}\)\. From the second moment equation, we have that:
ν¯∗=S2Σ∗\\bar\{\\nu\}^\{\*\}=S^\{2\}\\Sigma^\{\*\}\(188\)Substituting this relationship into the extent equation and settingdΣdt=0\\frac\{d\\Sigma\}\{dt\}=0yields:
η2S22\(S2Σ∗\)−2=0⟹η22Σ∗=2\\frac\{\\eta^\{2\}S^\{2\}\}\{2\(S^\{2\}\\Sigma^\{\*\}\)\}\-2=0\\implies\\frac\{\\eta^\{2\}\}\{2\\Sigma^\{\*\}\}=2\(189\)Solving forΣ∗\\Sigma^\{\*\}yields:
Σ∗=η24\\Sigma^\{\*\}=\\frac\{\\eta^\{2\}\}\{4\}\(190\)To determine the stability of the fixed point\(Σ∗,ν¯∗\)=\(η24,S2η24\)\(\\Sigma^\{\*\},\\bar\{\\nu\}^\{\*\}\)=\\left\(\\frac\{\\eta^\{2\}\}\{4\},\\frac\{S^\{2\}\\eta^\{2\}\}\{4\}\\right\), we perform a linear stability analysis by evaluating the Jacobian matrix of the system at this point\.
Let our system of ODEs be defined as:
f\(Σ,ν¯\)\\displaystyle f\(\\Sigma,\\bar\{\\nu\}\)=\(η2S22ν¯−2\)Σ\\displaystyle=\\left\(\\frac\{\\eta^\{2\}S^\{2\}\}\{2\\bar\{\\nu\}\}\-2\\right\)\\Sigmag\(Σ,ν¯\)\\displaystyle g\(\\Sigma,\\bar\{\\nu\}\)=\(1−β2\)\(S2Σ−ν¯\)\\displaystyle=\(1\-\\beta\_\{2\}\)\(S^\{2\}\\Sigma\-\\bar\{\\nu\}\)JJis given by the partial derivatives offfandgg:
J=\(∂f∂Σ∂f∂ν¯∂g∂Σ∂g∂ν¯\)J=\\begin\{pmatrix\}\\frac\{\\partial f\}\{\\partial\\Sigma\}&\\frac\{\\partial f\}\{\\partial\\bar\{\\nu\}\}\\\\\[4\.30554pt\] \\frac\{\\partial g\}\{\\partial\\Sigma\}&\\frac\{\\partial g\}\{\\partial\\bar\{\\nu\}\}\\end\{pmatrix\}First, we compute the partial derivatives:
∂f∂Σ\\displaystyle\\frac\{\\partial f\}\{\\partial\\Sigma\}=η2S22ν¯−2\\displaystyle=\\frac\{\\eta^\{2\}S^\{2\}\}\{2\\bar\{\\nu\}\}\-2∂f∂ν¯\\displaystyle\\frac\{\\partial f\}\{\\partial\\bar\{\\nu\}\}=−η2S2Σ2ν¯2\\displaystyle=\-\\frac\{\\eta^\{2\}S^\{2\}\\Sigma\}\{2\\bar\{\\nu\}^\{2\}\}∂g∂Σ\\displaystyle\\frac\{\\partial g\}\{\\partial\\Sigma\}=\(1−β2\)S2\\displaystyle=\(1\-\\beta\_\{2\}\)S^\{2\}∂g∂ν¯\\displaystyle\\frac\{\\partial g\}\{\\partial\\bar\{\\nu\}\}=−\(1−β2\)\\displaystyle=\-\(1\-\\beta\_\{2\}\)Next, we evaluate these derivatives at the fixed pointΣ∗=η24\\Sigma^\{\*\}=\\frac\{\\eta^\{2\}\}\{4\}andν¯∗=S2η24\\bar\{\\nu\}^\{\*\}=\\frac\{S^\{2\}\\eta^\{2\}\}\{4\}:
ForJ11J\_\{11\}:
∂f∂Σ\|∗=η2S22\(S2η24\)−2=0\\frac\{\\partial f\}\{\\partial\\Sigma\}\\Bigg\|\_\{\*\}=\\frac\{\\eta^\{2\}S^\{2\}\}\{2\\left\(\\frac\{S^\{2\}\\eta^\{2\}\}\{4\}\\right\)\}\-2=0ForJ12J\_\{12\}:
∂f∂ν¯\|∗=−η2S2\(η24\)2\(S2η24\)2=−2S2\\frac\{\\partial f\}\{\\partial\\bar\{\\nu\}\}\\Bigg\|\_\{\*\}=\-\\frac\{\\eta^\{2\}S^\{2\}\\left\(\\frac\{\\eta^\{2\}\}\{4\}\\right\)\}\{2\\left\(\\frac\{S^\{2\}\\eta^\{2\}\}\{4\}\\right\)^\{2\}\}=\-\\frac\{2\}\{S^\{2\}\}ForJ21J\_\{21\}andJ22J\_\{22\}\(which are independent ofΣ\\Sigmaandν¯\\bar\{\\nu\}\):
∂g∂Σ\|∗\\displaystyle\\frac\{\\partial g\}\{\\partial\\Sigma\}\\Bigg\|\_\{\*\}=\(1−β2\)S2\\displaystyle=\(1\-\\beta\_\{2\}\)S^\{2\}∂g∂ν¯\|∗\\displaystyle\\frac\{\\partial g\}\{\\partial\\bar\{\\nu\}\}\\Bigg\|\_\{\*\}=−\(1−β2\)\\displaystyle=\-\(1\-\\beta\_\{2\}\)We then have for our Jacobian at the fixed point:
J∗=\(0−2S2\(1−β2\)S2−\(1−β2\)\)J^\{\*\}=\\begin\{pmatrix\}0&\-\\frac\{2\}\{S^\{2\}\}\\\\ \(1\-\\beta\_\{2\}\)S^\{2\}&\-\(1\-\\beta\_\{2\}\)\\end\{pmatrix\}To determine stability, we must compute the trace and the determinant ofJ∗J^\{\*\}:
Tr\(J∗\)\\displaystyle\\operatorname\{Tr\}\(J^\{\*\}\)=−\(1−β2\)\\displaystyle=\-\(1\-\\beta\_\{2\}\)Det\(J∗\)\\displaystyle\\operatorname\{Det\}\(J^\{\*\}\)=2\(1−β2\)\\displaystyle=2\(1\-\\beta\_\{2\}\)Asβ2<1\\beta\_\{2\}<1, we have thatTr\(J∗\)<0\\operatorname\{Tr\}\(J^\{\\ast\}\)<0andDet\(J∗\)\>0\\operatorname\{Det\}\(J^\{\\ast\}\)\>0\. So the fixed point is stable\.
This highlights an important difference between gradient descent and RMSProp\. Gradient descent exhibits a strict sharpness threshold: on a quadratic loss, the iterates diverge to infinity ifS\>2/ηS\>2/\\eta\. For RMSProp, the preconditioner perfectly adapts to the local sharpness—it never diverges to infinity for any sharpness\. Instead of diverging, RMSProp locks into a stable oscillation with a half\-displacementδ∗=η/2\\delta^\{\*\}=\\eta/2\.
## Appendix ERod Flow for Adam
Table 2:Summary of rod flow models\. Each row lists the optimizer, its preconditioner type, and the predicted self\-stabilization threshold for the preconditioned sharpness\.### E\.1Derivation of Rod Flows
#### E\.1\.1Adam
The discrete update equations for Adam are given as:
νt\+1\\displaystyle\\nu\_\{t\+1\}=β2νt\+\(1−β2\)∇L\(wt\)⊙2,\\displaystyle=\\beta\_\{2\}\\nu\_\{t\}\+\(1\-\\beta\_\{2\}\)\\nabla L\(w\_\{t\}\)^\{\\odot 2\},\(191\)mt\+1\\displaystyle m\_\{t\+1\}=β1mt\+\(1−β1\)∇L\(wt\),\\displaystyle=\\beta\_\{1\}m\_\{t\}\+\(1\-\\beta\_\{1\}\)\\nabla L\(w\_\{t\}\),\(192\)wt\+1\\displaystyle w\_\{t\+1\}=wt−ηPt\+1−1mt\+1\.\\displaystyle=w\_\{t\}\-\\eta\\,P\_\{t\+1\}^\{\-1\}\\,m\_\{t\+1\}\.\(193\)whereβ1\\beta\_\{1\}is the momentum coefficient andβ2\\beta\_\{2\}controls the decay of the second moment estimate\.
To construct the rod flow for Adam, we must carefully treat the distinct behaviors of these three variables at the edge of stability\.
As was the case with heavy ball momentum, the position and momentum oscillate in lockstep\. It is therefore natural to concatenate them into a single phase\-space vectorzt∈ℝ2dz\_\{t\}\\in\\mathbb\{R\}^\{2d\}:
zt=\(wtmt\)\.z\_\{t\}=\\begin\{pmatrix\}w\_\{t\}\\\\ m\_\{t\}\\end\{pmatrix\}\.\(194\)We define the phase\-space midpointz¯t\\bar\{z\}\_\{t\}and the phase\-space half\-displacementΔt\\Delta\_\{t\}exactly as before:
z¯t\\displaystyle\\bar\{z\}\_\{t\}=12\(zt\+1\+zt\)=\(w¯tm¯t\),\\displaystyle=\\tfrac\{1\}\{2\}\(z\_\{t\+1\}\+z\_\{t\}\)=\\begin\{pmatrix\}\\bar\{w\}\_\{t\}\\\\ \\bar\{m\}\_\{t\}\\end\{pmatrix\},\(195\)Δt\\displaystyle\\Delta\_\{t\}=12\(zt\+1−zt\)=\(δtγt\)\.\\displaystyle=\\tfrac\{1\}\{2\}\(z\_\{t\+1\}\-z\_\{t\}\)=\\begin\{pmatrix\}\\delta\_\{t\}\\\\ \\gamma\_\{t\}\\end\{pmatrix\}\.\(196\)By substituting the momentum update \(Equation[192](https://arxiv.org/html/2605.06821#A5.E192)\) directly into the position update \(Equation[193](https://arxiv.org/html/2605.06821#A5.E193)\), we can define the phase\-space update functionΦ\\Phi:
zt\+1=zt\+Φ\(zt;νt\+1\),Φ\(z;ν\)=\(−ηP−1\(ν\)\[β1m\+\(1−β1\)∇L\(w\)\]\(1−β1\)\[∇L\(w\)−m\]\)\.z\_\{t\+1\}=z\_\{t\}\+\\Phi\(z\_\{t\};\\nu\_\{t\+1\}\),\\qquad\\Phi\(z;\\nu\)=\\begin\{pmatrix\}\-\\eta\\,P^\{\-1\}\(\\nu\)\\bigl\[\\beta\_\{1\}m\+\(1\-\\beta\_\{1\}\)\\nabla L\(w\)\\bigr\]\\\\ \(1\-\\beta\_\{1\}\)\\bigl\[\\nabla L\(w\)\-m\\bigr\]\\end\{pmatrix\}\.\(197\)For the difference equation ofw¯\\bar\{w\}, we have:
w¯t\+1−w¯t\\displaystyle\\bar\{w\}\_\{t\+1\}\-\\bar\{w\}\_\{t\}=wt\+2\+wt\+12−wt\+1\+wt2\\displaystyle=\\frac\{w\_\{t\+2\}\+w\_\{t\+1\}\}\{2\}\-\\frac\{w\_\{t\+1\}\+w\_\{t\}\}\{2\}=wt\+2−wt\+12\+wt\+1−wt2\\displaystyle=\\frac\{w\_\{t\+2\}\-w\_\{t\+1\}\}\{2\}\+\\frac\{w\_\{t\+1\}\-w\_\{t\}\}\{2\}=−η2\(νt\+2\+ε\)mt\+2−η2\(νt\+1\+ε\)mt\+1\\displaystyle=\-\\frac\{\\eta\}\{2\(\\sqrt\{\\nu\_\{t\+2\}\}\+\\varepsilon\)\}\\,m\_\{t\+2\}\-\\frac\{\\eta\}\{2\(\\sqrt\{\\nu\_\{t\+1\}\}\+\\varepsilon\)\}\\,m\_\{t\+1\}=−η2\(νt\+2\+ε\)\[β1mt\+1\+\(1−β1\)∇L\(wt\+1\)\]\\displaystyle=\-\\frac\{\\eta\}\{2\(\\sqrt\{\\nu\_\{t\+2\}\}\+\\varepsilon\)\}\[\\beta\_\{1\}m\_\{t\+1\}\+\(1\-\\beta\_\{1\}\)\\nabla L\(w\_\{t\+1\}\)\]−η2\(νt\+1\+ε\)\[β1mt\+\(1−β1\)∇L\(wt\)\]\\displaystyle\\qquad\{\}\-\\frac\{\\eta\}\{2\(\\sqrt\{\\nu\_\{t\+1\}\}\+\\varepsilon\)\}\[\\beta\_\{1\}m\_\{t\}\+\(1\-\\beta\_\{1\}\)\\nabla L\(w\_\{t\}\)\]≈−η\(ν¯\+ε\)\[β1\(mt\+1\+mt\)2\+\(1−β1\)\(∇L\(wt\+1\)\+∇L\(wt\)\)2\]\\displaystyle\\approx\-\\frac\{\\eta\}\{\(\\sqrt\{\\bar\{\\nu\}\}\+\\varepsilon\)\}\\left\[\\beta\_\{1\}\\frac\{\(m\_\{t\+1\}\+m\_\{t\}\)\}\{2\}\+\(1\-\\beta\_\{1\}\)\\frac\{\(\\nabla L\(w\_\{t\+1\}\)\+\\nabla L\(w\_\{t\}\)\)\}\{2\}\\right\]=−η\(ν¯\+ε\)\[β1m¯\+\(1−β1\)g¯\]\\displaystyle=\-\\frac\{\\eta\}\{\(\\sqrt\{\\bar\{\\nu\}\}\+\\varepsilon\)\}\\,\\bigl\[\\beta\_\{1\}\\bar\{m\}\+\(1\-\\beta\_\{1\}\)\\bar\{g\}\\bigr\]For the difference equation ofm¯\\bar\{m\}:
m¯t\+1−m¯t\\displaystyle\\bar\{m\}\_\{t\+1\}\-\\bar\{m\}\_\{t\}=mt\+2\+mt\+12−mt\+1\+mt2\\displaystyle=\\frac\{m\_\{t\+2\}\+m\_\{t\+1\}\}\{2\}\-\\frac\{m\_\{t\+1\}\+m\_\{t\}\}\{2\}=mt\+2−mt\+12\+mt\+1−mt2\\displaystyle=\\frac\{m\_\{t\+2\}\-m\_\{t\+1\}\}\{2\}\+\\frac\{m\_\{t\+1\}\-m\_\{t\}\}\{2\}=\(1−β1\)2\[∇L\(wt\+1\)−mt\+1\]\+\(1−β1\)2\[∇L\(wt\)−mt\]\\displaystyle=\\frac\{\(1\-\\beta\_\{1\}\)\}\{2\}\\bigg\[\\nabla L\(w\_\{t\+1\}\)\-m\_\{t\+1\}\\bigg\]\+\\frac\{\(1\-\\beta\_\{1\}\)\}\{2\}\\bigg\[\\nabla L\(w\_\{t\}\)\-m\_\{t\}\\bigg\]=\(1−β1\)\[∇L\(wt\+1\)\+∇L\(wt\)2−mt\+1\+mt2\]\\displaystyle=\(1\-\\beta\_\{1\}\)\\left\[\\frac\{\\nabla L\(w\_\{t\+1\}\)\+\\nabla L\(w\_\{t\}\)\}\{2\}\-\\frac\{m\_\{t\+1\}\+m\_\{t\}\}\{2\}\\right\]=\(1−β1\)\[g¯−m¯\]\\displaystyle=\(1\-\\beta\_\{1\}\)\\,\[\\bar\{g\}\-\\bar\{m\}\]As was the case with RMSProp, the difference equation forw¯\\bar\{w\}is not exact \(unlike in the gradient descent case\)\. Because we do not track the half\-difference of the second\-momentν\\nu, an approximation must be made\. However, this approximation should be negligible—especially for large values ofβ2\\beta\_\{2\}\.
LetΦ±\\Phi\_\{\\pm\}denoteΦ\(z¯±Δ\)\\Phi\(\\bar\{z\}\\pm\\Delta\)\. Then all together, the phase\-space difference equation can be expressed as:
z¯t\+1−z¯t\\displaystyle\\bar\{z\}\_\{t\+1\}\-\\bar\{z\}\_\{t\}=12\[Φ\+\+Φ−\]\.\\displaystyle=\\tfrac\{1\}\{2\}\\bigl\[\\Phi\_\{\+\}\+\\Phi\_\{\-\}\\bigr\]\.For the difference equation of the extent, we have:
Δt\+1⊗Δt\+1−Δt⊗Δt\\displaystyle\\Delta\_\{t\+1\}\\otimes\\Delta\_\{t\+1\}\-\\Delta\_\{t\}\\otimes\\Delta\_\{t\}=Δt\+1⊗Δt\+1−Δt⊗Δt\+\(Δt⊗Δt−Δt⊗Δt\)\\displaystyle=\\Delta\_\{t\+1\}\\otimes\\Delta\_\{t\+1\}\-\\Delta\_\{t\}\\otimes\\Delta\_\{t\}\+\(\\Delta\_\{t\}\\otimes\\Delta\_\{t\}\-\\Delta\_\{t\}\\otimes\\Delta\_\{t\}\)=Δt\+1⊗Δt\+1\+Δt⊗Δt−2Δt⊗Δt\\displaystyle=\\Delta\_\{t\+1\}\\otimes\\Delta\_\{t\+1\}\+\\Delta\_\{t\}\\otimes\\Delta\_\{t\}\-2\\Delta\_\{t\}\\otimes\\Delta\_\{t\}=14\(Φ\+⊗Φ\+\+Φ−⊗Φ−\)−2Δt⊗Δt\.\\displaystyle=\\tfrac\{1\}\{4\}\(\\Phi\_\{\+\}\\otimes\\Phi\_\{\+\}\+\\Phi\_\{\-\}\\otimes\\Phi\_\{\-\}\)\-2\\Delta\_\{t\}\\otimes\\Delta\_\{t\}\.And for the difference equation of the midpoint ofν\\nu:
ν¯t\+1−ν¯t\\displaystyle\\bar\{\\nu\}\_\{t\+1\}\-\\bar\{\\nu\}\_\{t\}=νt\+2\+νt\+12−νt\+1\+νt2\\displaystyle=\\frac\{\\nu\_\{t\+2\}\+\\nu\_\{t\+1\}\}\{2\}\-\\frac\{\\nu\_\{t\+1\}\+\\nu\_\{t\}\}\{2\}=νt\+2−νt\+12\+νt\+1−νt2\\displaystyle=\\frac\{\\nu\_\{t\+2\}\-\\nu\_\{t\+1\}\}\{2\}\+\\frac\{\\nu\_\{t\+1\}\-\\nu\_\{t\}\}\{2\}=\(1−β2\)2\[∇L\(wt\+1\)⊙2−νt\+1\]\+\(1−β2\)2\[∇L\(wt\)⊙2−νt\]\\displaystyle=\\frac\{\(1\-\\beta\_\{2\}\)\}\{2\}\\bigg\[\\nabla L\(w\_\{t\+1\}\)^\{\\odot 2\}\-\\nu\_\{t\+1\}\\bigg\]\+\\frac\{\(1\-\\beta\_\{2\}\)\}\{2\}\\bigg\[\\nabla L\(w\_\{t\}\)^\{\\odot 2\}\-\\nu\_\{t\}\\bigg\]=\(1−β2\)\[∇L\(wt\+1\)⊙2\+∇L\(wt\)⊙22−νt\+1\+νt2\]\\displaystyle=\(1\-\\beta\_\{2\}\)\\left\[\\frac\{\\nabla L\(w\_\{t\+1\}\)^\{\\odot 2\}\+\\nabla L\(w\_\{t\}\)^\{\\odot 2\}\}\{2\}\-\\frac\{\\nu\_\{t\+1\}\+\\nu\_\{t\}\}\{2\}\\right\]=\(1−β2\)\[∇L\(wt\+1\)⊙2\+∇L\(wt\)⊙22−ν¯\]\\displaystyle=\(1\-\\beta\_\{2\}\)\\left\[\\frac\{\\nabla L\(w\_\{t\+1\}\)^\{\\odot 2\}\+\\nabla L\(w\_\{t\}\)^\{\\odot 2\}\}\{2\}\-\\bar\{\\nu\}\\right\]Promoting these discrete difference equations to continuous\-time ODEs yields the rod flow for Adam:
dw¯dt\\displaystyle\\frac\{d\\bar\{w\}\}\{dt\}=−ηP−1\(ν¯\)\[β1m¯\+\(1−β1\)g¯\],\\displaystyle=\-\\eta\\,P^\{\-1\}\(\\bar\{\\nu\}\)\\bigl\[\\beta\_\{1\}\\,\\bar\{m\}\+\(1\-\\beta\_\{1\}\)\\,\\bar\{g\}\\bigr\],\(198\)dm¯dt\\displaystyle\\frac\{d\\bar\{m\}\}\{dt\}=\(1−β1\)\[g¯−m¯\],\\displaystyle=\(1\-\\beta\_\{1\}\)\\bigl\[\\bar\{g\}\-\\bar\{m\}\\bigr\],\(199\)dν¯dt\\displaystyle\\frac\{d\\bar\{\\nu\}\}\{dt\}=\(1−β2\)\(∇L\+⊙2\+∇L−⊙22−ν¯\),\\displaystyle=\(1\-\\beta\_\{2\}\)\\,\\biggl\(\\frac\{\\nabla L\_\{\+\}^\{\\odot 2\}\+\\nabla L\_\{\-\}^\{\\odot 2\}\}\{2\}\-\\bar\{\\nu\}\\biggr\),\(200\)dΣdt\\displaystyle\\frac\{d\\Sigma\}\{dt\}=14\[Φ\+⊗Φ\+\+Φ−⊗Φ−\]−2Σ,\\displaystyle=\\tfrac\{1\}\{4\}\\bigl\[\\Phi\_\{\+\}\\otimes\\Phi\_\{\+\}\+\\Phi\_\{\-\}\\otimes\\Phi\_\{\-\}\\bigr\]\-2\\,\\Sigma,\(201\)
#### E\.1\.2NAdam
The update equations for NAdam are given as:
νt\+1\\displaystyle\\nu\_\{t\+1\}=β2νt\+\(1−β2\)∇L\(wt\)⊙2,\\displaystyle=\\beta\_\{2\}\\nu\_\{t\}\+\(1\-\\beta\_\{2\}\)\\nabla L\(w\_\{t\}\)^\{\\odot 2\},\(202\)mt\+1\\displaystyle m\_\{t\+1\}=β1mt\+\(1−β1\)∇L\(wt\),\\displaystyle=\\beta\_\{1\}m\_\{t\}\+\(1\-\\beta\_\{1\}\)\\nabla L\(w\_\{t\}\),\(203\)wt\+1\\displaystyle w\_\{t\+1\}=wt−ηPt\+1−1m~t\+1\.\\displaystyle=w\_\{t\}\-\\eta\\,P^\{\-1\}\_\{t\+1\}\\,\\widetilde\{m\}\_\{t\+1\}\.\(204\)wherem~\\widetilde\{m\}is the modified momentum:
m~t\+1=β1mt\+1\+\(1−β1\)∇L\(wt\)\\widetilde\{m\}\_\{t\+1\}=\\beta\_\{1\}m\_\{t\+1\}\+\(1\-\\beta\_\{1\}\)\\nabla L\(w\_\{t\}\)\(205\)Relative to raw momentum, the modified momentum more heavily weights the most recent gradient\. We can express the modified momentum in terms ofmtm\_\{t\}and∇L\(wt\)\\nabla L\(w\_\{t\}\):
m~t\+1\\displaystyle\\widetilde\{m\}\_\{t\+1\}=β1mt\+1\+\(1−β1\)∇L\(wt\)\\displaystyle=\\beta\_\{1\}m\_\{t\+1\}\+\(1\-\\beta\_\{1\}\)\\nabla L\(w\_\{t\}\)=β1\[β1mt\+\(1−β1\)∇L\(wt\)\]\+\(1−β1\)∇L\(wt\)\\displaystyle=\\beta\_\{1\}\[\\beta\_\{1\}m\_\{t\}\+\(1\-\\beta\_\{1\}\)\\nabla L\(w\_\{t\}\)\]\+\(1\-\\beta\_\{1\}\)\\nabla L\(w\_\{t\}\)=β12mt\+\[β1\(1−β1\)\+\(1−β1\)\]∇L\(wt\)\\displaystyle=\\beta\_\{1\}^\{2\}m\_\{t\}\+\[\\beta\_\{1\}\(1\-\\beta\_\{1\}\)\+\(1\-\\beta\_\{1\}\)\]\\nabla L\(w\_\{t\}\)=β12mt\+\(1−β12\)∇L\(wt\)\\displaystyle=\\beta\_\{1\}^\{2\}m\_\{t\}\+\(1\-\\beta\_\{1\}^\{2\}\)\\nabla L\(w\_\{t\}\)By substituting this modified momentum directly into the position update \(Equation[204](https://arxiv.org/html/2605.06821#A5.E204)\), we define the phase\-space update functionΦ\\Phifor NAdam:
Φ\(z;ν\)=\(−ηP−1\(ν\)\[β12m\+\(1−β12\)∇L\(w\)\]\(1−β1\)\[∇L\(w\)−m\]\)\.\\Phi\(z;\\nu\)=\\begin\{pmatrix\}\-\\eta\\,P^\{\-1\}\(\\nu\)\\bigl\[\\beta\_\{1\}^\{2\}m\+\(1\-\\beta\_\{1\}^\{2\}\)\\nabla L\(w\)\\bigr\]\\\\ \(1\-\\beta\_\{1\}\)\\bigl\[\\nabla L\(w\)\-m\\bigr\]\\end\{pmatrix\}\.\(206\)For the difference equation ofw¯\\bar\{w\}, we have:
w¯t\+1−w¯t\\displaystyle\\bar\{w\}\_\{t\+1\}\-\\bar\{w\}\_\{t\}=wt\+2\+wt\+12−wt\+1\+wt2\\displaystyle=\\frac\{w\_\{t\+2\}\+w\_\{t\+1\}\}\{2\}\-\\frac\{w\_\{t\+1\}\+w\_\{t\}\}\{2\}=wt\+2−wt\+12\+wt\+1−wt2\\displaystyle=\\frac\{w\_\{t\+2\}\-w\_\{t\+1\}\}\{2\}\+\\frac\{w\_\{t\+1\}\-w\_\{t\}\}\{2\}=−η2\(νt\+2\+ε\)m~t\+2−η2\(νt\+1\+ε\)m~t\+1\\displaystyle=\-\\frac\{\\eta\}\{2\(\\sqrt\{\\nu\_\{t\+2\}\}\+\\varepsilon\)\}\\,\\widetilde\{m\}\_\{t\+2\}\-\\frac\{\\eta\}\{2\(\\sqrt\{\\nu\_\{t\+1\}\}\+\\varepsilon\)\}\\,\\widetilde\{m\}\_\{t\+1\}=−η2\(νt\+2\+ε\)\[β12mt\+1\+\(1−β12\)∇L\(wt\+1\)\]\\displaystyle=\-\\frac\{\\eta\}\{2\(\\sqrt\{\\nu\_\{t\+2\}\}\+\\varepsilon\)\}\[\\beta\_\{1\}^\{2\}m\_\{t\+1\}\+\(1\-\\beta\_\{1\}^\{2\}\)\\nabla L\(w\_\{t\+1\}\)\]−η2\(νt\+1\+ε\)\[β12mt\+\(1−β12\)∇L\(wt\)\]\\displaystyle\\qquad\{\}\-\\frac\{\\eta\}\{2\(\\sqrt\{\\nu\_\{t\+1\}\}\+\\varepsilon\)\}\[\\beta\_\{1\}^\{2\}m\_\{t\}\+\(1\-\\beta\_\{1\}^\{2\}\)\\nabla L\(w\_\{t\}\)\]≈−η\(ν¯\+ε\)\[β12\(mt\+1\+mt\)2\+\(1−β12\)\(∇L\(wt\+1\)\+∇L\(wt\)\)2\]\\displaystyle\\approx\-\\frac\{\\eta\}\{\(\\sqrt\{\\bar\{\\nu\}\}\+\\varepsilon\)\}\\,\\biggl\[\\beta\_\{1\}^\{2\}\\frac\{\(m\_\{t\+1\}\+m\_\{t\}\)\}\{2\}\+\(1\-\\beta\_\{1\}^\{2\}\)\\frac\{\(\\nabla L\(w\_\{t\+1\}\)\+\\nabla L\(w\_\{t\}\)\)\}\{2\}\\biggr\]=−η\(ν¯\+ε\)\[β12m¯\+\(1−β12\)g¯\]\\displaystyle=\-\\frac\{\\eta\}\{\(\\sqrt\{\\bar\{\\nu\}\}\+\\varepsilon\)\}\\,\\bigl\[\\beta\_\{1\}^\{2\}\\bar\{m\}\+\(1\-\\beta\_\{1\}^\{2\}\)\\bar\{g\}\\bigr\]Because the updates formtm\_\{t\}andνt\\nu\_\{t\}remain structurally identical to standard Adam, the difference equations form¯\\bar\{m\}andν¯\\bar\{\\nu\}are unchanged:
m¯t\+1−m¯t\\displaystyle\\bar\{m\}\_\{t\+1\}\-\\bar\{m\}\_\{t\}=\(1−β1\)\[g¯−m¯\]\\displaystyle=\(1\-\\beta\_\{1\}\)\\,\[\\bar\{g\}\-\\bar\{m\}\]ν¯t\+1−ν¯t\\displaystyle\\bar\{\\nu\}\_\{t\+1\}\-\\bar\{\\nu\}\_\{t\}=\(1−β2\)\[∇L\(wt\+1\)⊙2\+∇L\(wt\)⊙22−ν¯\]\\displaystyle=\(1\-\\beta\_\{2\}\)\\,\\biggl\[\\frac\{\\nabla L\(w\_\{t\+1\}\)^\{\\odot 2\}\+\\nabla L\(w\_\{t\}\)^\{\\odot 2\}\}\{2\}\-\\bar\{\\nu\}\\biggr\]Similarly, the phase\-space difference equation and the extent difference equation take the exact same form as Adam, substituting our newΦ±\\Phi\_\{\\pm\}:
z¯t\+1−z¯t\\displaystyle\\bar\{z\}\_\{t\+1\}\-\\bar\{z\}\_\{t\}=12\[Φ\+\+Φ−\]\\displaystyle=\\tfrac\{1\}\{2\}\\bigl\[\\Phi\_\{\+\}\+\\Phi\_\{\-\}\\bigr\]Δt\+1⊗Δt\+1−Δt⊗Δt\\displaystyle\\Delta\_\{t\+1\}\\otimes\\Delta\_\{t\+1\}\-\\Delta\_\{t\}\\otimes\\Delta\_\{t\}=14\(Φ\+⊗Φ\+\+Φ−⊗Φ−\)−2Δt⊗Δt\\displaystyle=\\tfrac\{1\}\{4\}\(\\Phi\_\{\+\}\\otimes\\Phi\_\{\+\}\+\\Phi\_\{\-\}\\otimes\\Phi\_\{\-\}\)\-2\\Delta\_\{t\}\\otimes\\Delta\_\{t\}Promoting these discrete difference equations to continuous\-time ODEs yields the rod flow for NAdam:
dw¯dt\\displaystyle\\frac\{d\\bar\{w\}\}\{dt\}=−ηP−1\(ν¯\)\[β12m¯\+\(1−β12\)g¯\],\\displaystyle=\-\\eta\\,P^\{\-1\}\(\\bar\{\\nu\}\)\\bigl\[\\beta\_\{1\}^\{2\}\\,\\bar\{m\}\+\(1\-\\beta\_\{1\}^\{2\}\)\\,\\bar\{g\}\\bigr\],\(207\)dm¯dt\\displaystyle\\frac\{d\\bar\{m\}\}\{dt\}=\(1−β1\)\[g¯−m¯\],\\displaystyle=\(1\-\\beta\_\{1\}\)\\bigl\[\\bar\{g\}\-\\bar\{m\}\\bigr\],\(208\)dν¯dt\\displaystyle\\frac\{d\\bar\{\\nu\}\}\{dt\}=\(1−β2\)\(∇L\+⊙2\+∇L−⊙22−ν¯\),\\displaystyle=\(1\-\\beta\_\{2\}\)\\,\\biggl\(\\frac\{\\nabla L\_\{\+\}^\{\\odot 2\}\+\\nabla L\_\{\-\}^\{\\odot 2\}\}\{2\}\-\\bar\{\\nu\}\\biggr\),\(209\)dΣdt\\displaystyle\\frac\{d\\Sigma\}\{dt\}=14\[Φ\+⊗Φ\+\+Φ−⊗Φ−\]−2Σ,\\displaystyle=\\tfrac\{1\}\{4\}\\bigl\[\\Phi\_\{\+\}\\otimes\\Phi\_\{\+\}\+\\Phi\_\{\-\}\\otimes\\Phi\_\{\-\}\\bigr\]\-2\\,\\Sigma,\(210\)
### E\.2NAdam versus Nesterov Momentum
Despite its name, NAdam is notquiteAdam with Nesterov momentum replacing heavy\-ball momentum\. True Nesterov momentum requires a look\-ahead gradient—evaluating the gradient at the shifted pointwt−ηβmtw\_\{t\}\-\\eta\\beta m\_\{t\}rather than the current iterate—which is awkward to slot into standard training loops\. The look\-ahead gradient requires temporarily perturbing the parameters before each forward/backward pass, and it does not compose cleanly with Adam’s bias correction or a time\-varyingβ\\beta\. Instead, NAdam adopts Dozat’s formulation \(building on Sutskever’s change of variables\), which evaluates the gradient only at the current position\. With constantβ\\beta, it is mathematically equivalent to Nesterov momentum—only in a shifted coordinate system\.
Recall the Nesterov momentum update equations:
mt\+1\\displaystyle m\_\{t\+1\}=βmt\+\(1−β\)∇L\(wt−ηβmt\),\\displaystyle=\\beta m\_\{t\}\+\(1\-\\beta\)\\nabla L\(w\_\{t\}\-\\eta\\beta m\_\{t\}\),wt\+1\\displaystyle w\_\{t\+1\}=wt−η\[βmt\+\(1−β\)∇L\(wt−ηβmt\)\]\.\\displaystyle=w\_\{t\}\-\\eta\\bigl\[\\beta m\_\{t\}\+\(1\-\\beta\)\\nabla L\(w\_\{t\}\-\\eta\\beta m\_\{t\}\)\\bigr\]\.Dozat’s formulation \(equivalent to NAdam without the preconditioner\) replaces these with:
mt\+1\\displaystyle m\_\{t\+1\}=βmt\+\(1−β\)∇L\(wt\),\\displaystyle=\\beta m\_\{t\}\+\(1\-\\beta\)\\nabla L\(w\_\{t\}\),wt\+1\\displaystyle w\_\{t\+1\}=wt−η\[β2mt\+\(1−β2\)∇L\(wt\)\]\.\\displaystyle=w\_\{t\}\-\\eta\\bigl\[\\beta^\{2\}m\_\{t\}\+\(1\-\\beta^\{2\}\)\\nabla L\(w\_\{t\}\)\\bigr\]\.To see the relationship between the two, introduce the change of variablesθt≔wt−ηβmt\\theta\_\{t\}\\coloneqq w\_\{t\}\-\\eta\\beta m\_\{t\}\. One can see thatθt\\theta\_\{t\}is precisely the “look\-ahead” point at which Nesterov evaluates the gradient\. Substituting into the Nesterov update forwt\+1w\_\{t\+1\}gives:
θt\+1\+ηβmt\+1\\displaystyle\\theta\_\{t\+1\}\+\\eta\\beta m\_\{t\+1\}=θt\+ηβmt−η\[βmt\+\(1−β\)∇L\(θt\)\]\.\\displaystyle=\\theta\_\{t\}\+\\eta\\beta m\_\{t\}\-\\eta\\bigl\[\\beta m\_\{t\}\+\(1\-\\beta\)\\nabla L\(\\theta\_\{t\}\)\\bigr\]\.Isolatingθt\+1\\theta\_\{t\+1\}yields:
θt\+1\\displaystyle\\theta\_\{t\+1\}=θt−η\[βmt\+1\+\(1−β\)∇L\(θt\)\]\.\\displaystyle=\\theta\_\{t\}\-\\eta\\bigl\[\\beta m\_\{t\+1\}\+\(1\-\\beta\)\\nabla L\(\\theta\_\{t\}\)\\bigr\]\.Expandingmt\+1=βmt\+\(1−β\)∇L\(θt\)m\_\{t\+1\}=\\beta m\_\{t\}\+\(1\-\\beta\)\\nabla L\(\\theta\_\{t\}\)in the last expression recovers Dozat’s position update rule\.
Hence, Nesterov momentum and Dozat’s formulation are equivalent up to a change of variables: given an initial condition\(w0,m0\)\(w\_\{0\},m\_\{0\}\)for Nesterov, the shifted iterateθ0=w0−ηβm0\\theta\_\{0\}=w\_\{0\}\-\\eta\\beta m\_\{0\}run under Dozat’s update produces the same trajectory of look\-ahead points, and shifting back recovers the original Nesterov iterates\.
We can show that Dozat’s formulation has the same sharpness thresholdS∗S^\{\\ast\}on a quadratic objective as Nesterov momentum\. Consider the quadratic lossL\(w\)=12Sw2L\(w\)=\\tfrac\{1\}\{2\}Sw^\{2\}\. We have that∇L=Sw\\nabla L=Sw\. Substituting into Dozat’s formulation gives:
mt\+1\\displaystyle m\_\{t\+1\}=βmt\+\(1−β\)Swt,\\displaystyle=\\beta m\_\{t\}\+\(1\-\\beta\)Sw\_\{t\},\(211\)wt\+1\\displaystyle w\_\{t\+1\}=wt−η\[β2mt\+\(1−β2\)Swt\]\.\\displaystyle=w\_\{t\}\-\\eta\\bigl\[\\beta^\{2\}m\_\{t\}\+\(1\-\\beta^\{2\}\)Sw\_\{t\}\\bigr\]\.\(212\)This is a linear system, which we can express in matrix form as:
\(wt\+1mt\+1\)=M\(wtmt\),M=\(1−η\(1−β2\)S−ηβ2\(1−β\)Sβ\)\.\\begin\{pmatrix\}w\_\{t\+1\}\\\\ m\_\{t\+1\}\\end\{pmatrix\}=M\\begin\{pmatrix\}w\_\{t\}\\\\ m\_\{t\}\\end\{pmatrix\},\\qquad M=\\begin\{pmatrix\}1\-\\eta\(1\-\\beta^\{2\}\)S&\-\\eta\\beta^\{2\}\\\\ \(1\-\\beta\)S&\\beta\\end\{pmatrix\}\.\(213\)Our system is stable when the eigenvalues ofMMlie within the unit circle\. The characteristic polynomial is determined by the trace and determinant:
Tr\(M\)\\displaystyle\\operatorname\{Tr\}\(M\)=1\+β−η\(1−β2\)S,\\displaystyle=1\+\\beta\-\\eta\(1\-\\beta^\{2\}\)S,\(214\)Det\(M\)\\displaystyle\\operatorname\{Det\}\(M\)=β−ηβ\(1−β\)S\.\\displaystyle=\\beta\-\\eta\\beta\(1\-\\beta\)S\.\(215\)This gives the characteristic polynomial:
λ2−\[1\+β−η\(1−β2\)S\]λ\+\[β−ηβ\(1−β\)S\]=0\.\\lambda^\{2\}\-\\bigl\[1\+\\beta\-\\eta\(1\-\\beta^\{2\}\)S\\bigr\]\\lambda\+\\bigl\[\\beta\-\\eta\\beta\(1\-\\beta\)S\\bigr\]=0\.\(216\)We can find the sharpness thresholdS∗S^\{\\ast\}by plugging inλ=−1\\lambda=\-1, which corresponds to the period\-2 oscillations:
1\+\[1\+β−η\(1−β2\)S∗\]\+\[β−ηβ\(1−β\)S∗\]=0\.1\+\\bigl\[1\+\\beta\-\\eta\(1\-\\beta^\{2\}\)S^\{\\ast\}\\bigr\]\+\\bigl\[\\beta\-\\eta\\beta\(1\-\\beta\)S^\{\\ast\}\\bigr\]=0\.\(217\)Collecting terms:
2\(1\+β\)−η\(1−β\)S∗\[1\+2β\]=0,2\(1\+\\beta\)\-\\eta\(1\-\\beta\)S^\{\\ast\}\\bigl\[1\+2\\beta\\bigr\]=0,\(218\)Solving forS∗S^\{\\ast\}yields:
S∗=2\(1\+β\)η\(1−β\)\(1\+2β\)\.S^\{\\ast\}=\\frac\{2\(1\+\\beta\)\}\{\\eta\(1\-\\beta\)\(1\+2\\beta\)\}\.\(219\)This matches the sharpness threshold for Nesterov momentum on a quadratic\.
However, in the presence of bias correction, Dozat’s formulation of NAdam does not coincide with a simple application of Nesterov momentum to Adam\.
### E\.3Theoretical Analysis
#### E\.3\.1Linear Loss
As was the case with RMSProp, for the theoretical analysis section, we will drop the regularization parameterε\\varepsilon\.
Consider the linear lossL\(w\)=b⋅wL\(w\)=b\\cdot w\. The gradient is constant everywhere:∇L\(w\)=b\\nabla L\(w\)=b\. Since∇L\+=∇L−=b\\nabla L\_\{\+\}=\\nabla L\_\{\-\}=b, the average gradient isg¯=b\\bar\{g\}=b\. The rod flow equations become:
dν¯dt\\displaystyle\\frac\{d\\bar\{\\nu\}\}\{dt\}=\(1−β2\)\(b2−ν¯\),\\displaystyle=\(1\-\\beta\_\{2\}\)\(b^\{2\}\-\\bar\{\\nu\}\)\\,,\(220\)dm¯dt\\displaystyle\\frac\{d\\bar\{m\}\}\{dt\}=\(1−β1\)\(b−m¯\),\\displaystyle=\(1\-\\beta\_\{1\}\)\(b\-\\bar\{m\}\)\\,,\(221\)dw¯dt\\displaystyle\\frac\{d\\bar\{w\}\}\{dt\}=−ην¯\[β1m¯\+\(1−β1\)b\],\\displaystyle=\-\\frac\{\\eta\}\{\\sqrt\{\\bar\{\\nu\}\}\}\\,\\bigl\[\\beta\_\{1\}\\bar\{m\}\+\(1\-\\beta\_\{1\}\)b\\bigr\]\\,,\(222\)dΣdt\\displaystyle\\frac\{d\\Sigma\}\{dt\}=12Φ⊗Φ−2Σ,\\displaystyle=\\tfrac\{1\}\{2\}\\Phi\\otimes\\Phi\-2\\Sigma\\,,\(223\)where we have that:
Φ=\(−ην¯\[β1m¯\+\(1−β1\)b\]\(1−β1\)\(b−m¯\)\)\\Phi=\\begin\{pmatrix\}\-\\frac\{\\eta\}\{\\sqrt\{\\bar\{\\nu\}\}\}\[\\beta\_\{1\}\\bar\{m\}\+\(1\-\\beta\_\{1\}\)b\]\\\\ \(1\-\\beta\_\{1\}\)\(b\-\\bar\{m\}\)\\end\{pmatrix\}Settingdν¯dt=0\\frac\{d\\bar\{\\nu\}\}\{dt\}=0anddm¯dt=0\\frac\{d\\bar\{m\}\}\{dt\}=0gives the steady\-state momentum and second moment:
m¯∗\\displaystyle\\bar\{m\}^\{\*\}=b,\\displaystyle=b\\,,\(224\)ν¯∗\\displaystyle\\bar\{\\nu\}^\{\*\}=b2,\\displaystyle=b^\{2\}\\,,\(225\)Substituting our expressions forν¯∗\\bar\{\\nu\}^\{\*\}andm¯∗\\bar\{m\}^\{\*\}into thew¯\\bar\{w\}equation yields the steady\-state equation of motion for the midpoint of the position:
dw¯dt=−ηsign\(b\)\.\\frac\{d\\bar\{w\}\}\{dt\}=\-\\eta\\operatorname\{sign\}\(b\)\\,\.\(226\)Substituting the steady\-state values into the phase\-space update yields:
Φ∗=\(−ηsign\(b\)0\)\.\\Phi^\{\*\}=\\begin\{pmatrix\}\-\\eta\\operatorname\{sign\}\(b\)\\\\ 0\\end\{pmatrix\}\.SettingdΣdt=0\\frac\{d\\Sigma\}\{dt\}=0gives the steady\-state extent for the position coordinate:
Σδδ∗=η24\.\\Sigma\_\{\\delta\\delta\}^\{\*\}=\\frac\{\\eta^\{2\}\}\{4\}\\,\.\(227\)Along flat directions, Adam behaves similarly to RMSProp\. Because the gradient does not change, the exponential moving average of the gradient is indistinguishable from the instantaneous gradient\. Thus, Adam recovers the behavior of RMSProp, acting as Sign\-GD with step sizeη\\eta\.
#### E\.3\.2Quadratic Loss
Consider the quadraticL\(w\)=12Sw2L\(w\)=\\frac\{1\}\{2\}Sw^\{2\}withS\>0S\>0\. We will work in one dimension, meaning the phase\-space half\-displacement is a 2D vector:
Δ=\(δγ\)\\Delta=\\begin\{pmatrix\}\\delta\\\\ \\gamma\\end\{pmatrix\}whereδ\\deltarepresents the position half\-displacement andγ\\gammarepresents the momentum half\-displacement\. The phase\-space extent isΣ=Δ⊗Δ\\Sigma=\\Delta\\otimes\\Delta\.
Evaluating the gradients at the endpoints of the rod yields:
∇L\+\\displaystyle\\nabla L\_\{\+\}=S\(w¯\+δ\)\\displaystyle=S\(\\bar\{w\}\+\\delta\)\(228\)∇L−\\displaystyle\\nabla L\_\{\-\}=S\(w¯−δ\)\\displaystyle=S\(\\bar\{w\}\-\\delta\)\(229\)Summing these terms and summing their squares gives:
∇L\+\+∇L−\\displaystyle\\nabla L\_\{\+\}\+\\nabla L\_\{\-\}=2Sw¯\\displaystyle=2S\\bar\{w\}\(230\)∇L\+2\+∇L−2\\displaystyle\\nabla L\_\{\+\}^\{2\}\+\\nabla L\_\{\-\}^\{2\}=S2\(w¯\+δ\)2\+S2\(w¯−δ\)2=2S2\(w¯2\+δ2\)\\displaystyle=S^\{2\}\(\\bar\{w\}\+\\delta\)^\{2\}\+S^\{2\}\(\\bar\{w\}\-\\delta\)^\{2\}=2S^\{2\}\(\\bar\{w\}^\{2\}\+\\delta^\{2\}\)\(231\)Plugging these into our rod flow equations for Adam:
dw¯dt\\displaystyle\\frac\{d\\bar\{w\}\}\{dt\}=−ην¯\[β1m¯\+\(1−β1\)Sw¯\]\\displaystyle=\-\\frac\{\\eta\}\{\\sqrt\{\\bar\{\\nu\}\}\}\\bigl\[\\beta\_\{1\}\\,\\bar\{m\}\+\(1\-\\beta\_\{1\}\)S\\bar\{w\}\\bigr\]\(232\)dm¯dt\\displaystyle\\frac\{d\\bar\{m\}\}\{dt\}=\(1−β1\)\[Sw¯−m¯\]\\displaystyle=\(1\-\\beta\_\{1\}\)\\bigl\[S\\bar\{w\}\-\\bar\{m\}\\bigr\]\(233\)dν¯dt\\displaystyle\\frac\{d\\bar\{\\nu\}\}\{dt\}=\(1−β2\)\(S2\(w¯2\+δ2\)−ν¯\)\\displaystyle=\(1\-\\beta\_\{2\}\)\\left\(S^\{2\}\(\\bar\{w\}^\{2\}\+\\delta^\{2\}\)\-\\bar\{\\nu\}\\right\)\(234\)dΣdt\\displaystyle\\frac\{d\\Sigma\}\{dt\}=14\[Φ\+⊗Φ\+\+Φ−⊗Φ−\]−2Σ\\displaystyle=\\tfrac\{1\}\{4\}\\bigl\[\\Phi\_\{\+\}\\otimes\\Phi\_\{\+\}\+\\Phi\_\{\-\}\\otimes\\Phi\_\{\-\}\\bigr\]\-2\\Sigma\(235\)Consider the\(w¯,m¯\)\(\\bar\{w\},\\bar\{m\}\)subsystem\. It is a linear system:
ddt\(w¯m¯\)=A\(w¯m¯\),A=\(−η\(1−β1\)Sν¯−ηβ1ν¯\(1−β1\)S−\(1−β1\)\)\.\\frac\{d\}\{dt\}\\begin\{pmatrix\}\\bar\{w\}\\\\ \\bar\{m\}\\end\{pmatrix\}=A\\begin\{pmatrix\}\\bar\{w\}\\\\ \\bar\{m\}\\end\{pmatrix\}\\,,\\quad A=\\begin\{pmatrix\}\-\\frac\{\\eta\(1\-\\beta\_\{1\}\)S\}\{\\sqrt\{\\bar\{\\nu\}\}\}&\-\\frac\{\\eta\\beta\_\{1\}\}\{\\sqrt\{\\bar\{\\nu\}\}\}\\\\ \(1\-\\beta\_\{1\}\)S&\-\(1\-\\beta\_\{1\}\)\\end\{pmatrix\}\\,\.\(236\)To determine whether the system decays to the origin, we need to evaluate the trace and the determinant ofAA\.
Tr\(A\)\\displaystyle\\operatorname\{Tr\}\(A\)=−\(1−β1\)\(ηSν¯\+1\)\\displaystyle=\-\(1\-\\beta\_\{1\}\)\\left\(\\frac\{\\eta S\}\{\\sqrt\{\\bar\{\\nu\}\}\}\+1\\right\)Det\(A\)\\displaystyle\\operatorname\{Det\}\(A\)=ηS\(1−β1\)ν¯\\displaystyle=\\frac\{\\eta S\(1\-\\beta\_\{1\}\)\}\{\\sqrt\{\\bar\{\\nu\}\}\}AsTr\(A\)<0\\operatorname\{Tr\}\(A\)<0andDet\(A\)\>0\\operatorname\{Det\}\(A\)\>0, we have that our system decays to the origin\.
We will now analyze the coupled dynamics of the extent and the second moment at steady\-state\. The rod flow equation for the second moment becomes:
dν¯dt=\(1−β2\)\(S2δ2−ν¯\)\.\\frac\{d\\bar\{\\nu\}\}\{dt\}=\(1\-\\beta\_\{2\}\)\\bigl\(S^\{2\}\\delta^\{2\}\-\\bar\{\\nu\}\\bigr\)\.\(237\)Settingdν¯dt=0\\frac\{d\\bar\{\\nu\}\}\{dt\}=0yields:
ν¯∗=S2δ2\\bar\{\\nu\}^\{\*\}=S^\{2\}\\delta^\{2\}\(238\)We will now examine the phase\-space extent\. At steady\-state edge of stability,Φ\+=−2Δ∗\\Phi\_\{\+\}=\-2\\Delta^\{\\ast\}\. Writing out the two components of this equation explicitly:
−2δ∗\\displaystyle\-2\\delta^\{\\ast\}=−ην¯∗\[β1γ∗\+\(1−β1\)Sδ∗\],\\displaystyle=\-\\frac\{\\eta\}\{\\sqrt\{\\bar\{\\nu\}^\{\*\}\}\}\\bigl\[\\beta\_\{1\}\\gamma^\{\\ast\}\+\(1\-\\beta\_\{1\}\)S\\delta^\{\\ast\}\\bigr\]\\,,\(239\)−2γ∗\\displaystyle\-2\\gamma^\{\\ast\}=\(1−β1\)\(Sδ∗−γ∗\)\.\\displaystyle=\(1\-\\beta\_\{1\}\)\(S\\delta^\{\\ast\}\-\\gamma^\{\\ast\}\)\\,\.\(240\)We can solve this system to find the fixed point\. First, rearrange the momentum equation \([240](https://arxiv.org/html/2605.06821#A5.E240)\) to find the relationship between the momentum half\-differenceγ\\gammaand the position half\-differenceδ\\delta:
γ∗\\displaystyle\\gamma^\{\\ast\}=−\(1−β11\+β1\)Sδ\.\\displaystyle=\-\\left\(\\frac\{1\-\\beta\_\{1\}\}\{1\+\\beta\_\{1\}\}\\right\)S\\delta\\,\.\(241\)Substituting the expression forγ∗\\gamma^\{\\ast\}back into the position equation \([239](https://arxiv.org/html/2605.06821#A5.E239)\) and using the steady\-state second\-moment relationν¯∗=Sδ∗\\sqrt\{\\bar\{\\nu\}^\{\\ast\}\}=S\\delta^\{\\ast\}, we obtain a single equation forδ∗\\delta^\{\\ast\}:
−2δ∗\\displaystyle\-2\\delta^\{\\ast\}=−ηSδ∗\[−β1\(1−β11\+β1\)Sδ∗\+\(1−β1\)Sδ∗\]\\displaystyle=\-\\frac\{\\eta\}\{S\\delta^\{\\ast\}\}\\left\[\-\\beta\_\{1\}\\left\(\\frac\{1\-\\beta\_\{1\}\}\{1\+\\beta\_\{1\}\}\\right\)S\\delta^\{\\ast\}\+\(1\-\\beta\_\{1\}\)S\\delta^\{\\ast\}\\right\]=−ηSδ∗\(1−β1\)Sδ∗\[1−β11\+β1\]\\displaystyle=\-\\frac\{\\eta\}\{S\\delta^\{\\ast\}\}\\,\(1\-\\beta\_\{1\}\)\\,S\\delta^\{\\ast\}\\left\[1\-\\frac\{\\beta\_\{1\}\}\{1\+\\beta\_\{1\}\}\\right\]=−η\(1−β11\+β1\)\.\\displaystyle=\-\\eta\\left\(\\frac\{1\-\\beta\_\{1\}\}\{1\+\\beta\_\{1\}\}\\right\)\.\(242\)The factors ofSδ∗S\\delta^\{\\ast\}cancel, leaving an explicit expression for the steady\-state half\-displacement:
δ∗=η2\(1−β11\+β1\)\.\\delta^\{\\ast\}=\\frac\{\\eta\}\{2\}\\left\(\\frac\{1\-\\beta\_\{1\}\}\{1\+\\beta\_\{1\}\}\\right\)\.\(243\)It is instructive to compare these results with those of RMSProp\.
Broadly, an optimizer wants to maximize progress in flat directions while minimizing the bouncing amplitude in the sharp directions: large oscillations in the sharp directions correspond to less stable gradients\.
Adam behaves similarly to RMSProp in flat directions, where both act like Sign\-GD with step sizeη\\eta\. Where their behavior differs is in the sharp directions: for the same step size, Adam bounces far less\. By incorporating heavy\-ball momentum, Adam strictly reduces the amplitude of edge\-of\-stability bouncing, suppressing the steady\-state half\-displacement by the factor\(1−β1\)/\(1\+β1\)\(1\-\\beta\_\{1\}\)/\(1\+\\beta\_\{1\}\)\. For a typical valueβ1=0\.9\\beta\_\{1\}=0\.9, the bouncing amplitude is roughly5%5\\%of what it would be under RMSProp\.
This comparison also suggests that Adam is more naturally viewed as RMSProp with momentum rather than momentum with an adaptive step size: its behavior in flat directions is essentially that of RMSProp, with momentum playing a more prominent role in its behavior in the sharp directions\.
## Appendix FBackward Error Analysis
### F\.1Overview
Backward error analysis \(BEA\) is a technique for understanding the continuous dynamics underlying a discrete update rule\. Rather than viewing a discrete\-time system as an approximation to some continuous flow, BEA adopts a converse perspective: we ask which continuous dynamics the discrete iterates solve*exactly*\. Given a discrete update rule:
xt\+1−xt=D\(xt\),x\_\{t\+1\}\-x\_\{t\}=D\(x\_\{t\}\),\(244\)we seek a modified vector fieldV~\(x\)\\tilde\{V\}\(x\)such that the solution to
dxdt=V~\(x\)\\frac\{dx\}\{dt\}=\\tilde\{V\}\(x\)\(245\)exactly interpolates the discrete iterates: ifx\(t\)=xtx\(t\)=x\_\{t\}, thenx\(t\+1\)=xt\+1x\(t\+1\)=x\_\{t\+1\}\.
Assumingx\(t\)x\(t\)is smooth, we can express the discrete forward step using a Taylor series:
xt\+1=xt\+∑n=1∞1n\!dnxdtn\.x\_\{t\+1\}=x\_\{t\}\+\\sum\_\{n=1\}^\{\\infty\}\\frac\{1\}\{n\!\}\\frac\{d^\{n\}x\}\{dt^\{n\}\}\.\(246\)Becausex\(t\)x\(t\)is governed by the time\-independent ODEx˙=V~\(x\)\\dot\{x\}=\\tilde\{V\}\(x\), all higher\-order time derivatives can be expressed purely in terms of spatial derivatives via the chain rule\. This is elegantly captured by the Lie derivative associated withV~\\tilde\{V\}, which we denoteℒ~=V~⋅∇\\tilde\{\\mathcal\{L\}\}=\\tilde\{V\}\\cdot\\nabla\.
Applying this operator repeatedly gives the time derivatives:
x˙\\displaystyle\\dot\{x\}=ℒ~x=V~,\\displaystyle=\\tilde\{\\mathcal\{L\}\}x=\\tilde\{V\},x¨\\displaystyle\\ddot\{x\}=ℒ~2x=ℒ~V~=∇V~⋅V~,\\displaystyle=\\tilde\{\\mathcal\{L\}\}^\{2\}x=\\tilde\{\\mathcal\{L\}\}\\tilde\{V\}=\\nabla\\tilde\{V\}\\cdot\\tilde\{V\},x˙˙˙\\displaystyle\\dddot\{x\}=ℒ~3x=ℒ~\(∇V~⋅V~\)=∇\(∇V~⋅V~\)⋅V~\.\\displaystyle=\\tilde\{\\mathcal\{L\}\}^\{3\}x=\\tilde\{\\mathcal\{L\}\}\(\\nabla\\tilde\{V\}\\cdot\\tilde\{V\}\)=\\nabla\(\\nabla\\tilde\{V\}\\cdot\\tilde\{V\}\)\\cdot\\tilde\{V\}\.Using this operator, the Taylor series becomes an operator exponential:
xt\+1=eℒ~xt\.x\_\{t\+1\}=e^\{\\tilde\{\\mathcal\{L\}\}\}x\_\{t\}\.\(247\)Equating this to our discrete updatext\+1=xt\+D\(xt\)x\_\{t\+1\}=x\_\{t\}\+D\(x\_\{t\}\), we obtain the fundamental operator equation for the modified vector field:
D\(x\)=\(eℒ~−I\)x=∑n=1∞1n\!ℒ~nx\.D\(x\)=\(e^\{\\tilde\{\\mathcal\{L\}\}\}\-I\)x=\\sum\_\{n=1\}^\{\\infty\}\\frac\{1\}\{n\!\}\\tilde\{\\mathcal\{L\}\}^\{n\}x\.\(248\)Expanding this explicitly in terms ofV~\\tilde\{V\}:
D=V~\+12∇V~⋅V~\+16∇\(∇V~⋅V~\)⋅V~\+𝒪\(V~4\)\.D=\\tilde\{V\}\+\\tfrac\{1\}\{2\}\\nabla\\tilde\{V\}\\cdot\\tilde\{V\}\+\\tfrac\{1\}\{6\}\\nabla\(\\nabla\\tilde\{V\}\\cdot\\tilde\{V\}\)\\cdot\\tilde\{V\}\+\\mathcal\{O\}\(\\tilde\{V\}^\{4\}\)\.\(249\)
To solve forV~\\tilde\{V\}in terms of the known discrete displacementDD, we perform a perturbative inversion\. We assumeD∼𝒪\(η\)D\\sim\\mathcal\{O\}\(\\eta\)and expand the modified vector field as an infinite series graded by powers ofη\\eta:
V~=V~1\+V~2\+V~3\+⋯,\\tilde\{V\}=\\tilde\{V\}\_\{1\}\+\\tilde\{V\}\_\{2\}\+\\tilde\{V\}\_\{3\}\+\\cdots,\(250\)whereV~k∼𝒪\(ηk\)\\tilde\{V\}\_\{k\}\\sim\\mathcal\{O\}\(\\eta^\{k\}\)\. We substitute this series into our expanded fundamental equation:
D\\displaystyle D=\(V~1\+V~2\+V~3\+⋯\)\\displaystyle=\\left\(\\tilde\{V\}\_\{1\}\+\\tilde\{V\}\_\{2\}\+\\tilde\{V\}\_\{3\}\+\\cdots\\right\)\+12∇\(V~1\+V~2\+⋯\)⋅\(V~1\+V~2\+⋯\)\\displaystyle\\quad\+\\tfrac\{1\}\{2\}\\nabla\\left\(\\tilde\{V\}\_\{1\}\+\\tilde\{V\}\_\{2\}\+\\cdots\\right\)\\cdot\\left\(\\tilde\{V\}\_\{1\}\+\\tilde\{V\}\_\{2\}\+\\cdots\\right\)\+16∇\(∇\(V~1\+⋯\)⋅\(V~1\+⋯\)\)⋅\(V~1\+⋯\)\+⋯\\displaystyle\\quad\+\\tfrac\{1\}\{6\}\\nabla\\Big\(\\nabla\\left\(\\tilde\{V\}\_\{1\}\+\\cdots\\right\)\\cdot\\left\(\\tilde\{V\}\_\{1\}\+\\cdots\\right\)\\Big\)\\cdot\\left\(\\tilde\{V\}\_\{1\}\+\\cdots\\right\)\+\\cdots\(251\)We now match terms of the same order inη\\etato systematically build up the modified vector field\.
Retaining only theO\(η\)O\(\\eta\)terms, we immediately find:
V~1=D\.\\tilde\{V\}\_\{1\}=D\.\(252\)
Next, we collect all𝒪\(η2\)\\mathcal\{O\}\(\\eta^\{2\}\)terms\. There are two such contributions: the second\-order fieldV~2\\tilde\{V\}\_\{2\}itself, and the leading term of the12ℒ~2x\\tfrac\{1\}\{2\}\\tilde\{\\mathcal\{L\}\}^\{2\}xexpansion\.
D=V~1\+V~2\+12∇V~1⋅V~1\.D=\\tilde\{V\}\_\{1\}\+\\tilde\{V\}\_\{2\}\+\\tfrac\{1\}\{2\}\\nabla\\tilde\{V\}\_\{1\}\\cdot\\tilde\{V\}\_\{1\}\.\(253\)SubstitutingV~1=D\\tilde\{V\}\_\{1\}=Dand solving forV~2\\tilde\{V\}\_\{2\}:
V~2=−12∇D⋅D\.\\tilde\{V\}\_\{2\}=\-\\tfrac\{1\}\{2\}\\nabla D\\cdot D\.\(254\)
Collecting all terms of orderη3\\eta^\{3\}:
D=V~1\+V~2\+V~3\+12\(∇V~1⋅V~2\+∇V~2⋅V~1\)\+16∇\(∇V~1⋅V~1\)⋅V~1\.D=\\tilde\{V\}\_\{1\}\+\\tilde\{V\}\_\{2\}\+\\tilde\{V\}\_\{3\}\+\\tfrac\{1\}\{2\}\\Big\(\\nabla\\tilde\{V\}\_\{1\}\\cdot\\tilde\{V\}\_\{2\}\+\\nabla\\tilde\{V\}\_\{2\}\\cdot\\tilde\{V\}\_\{1\}\\Big\)\+\\tfrac\{1\}\{6\}\\nabla\(\\nabla\\tilde\{V\}\_\{1\}\\cdot\\tilde\{V\}\_\{1\}\)\\cdot\\tilde\{V\}\_\{1\}\.\(255\)Note that the final term can be rewritten usingV~2=−12∇V~1⋅V~1\\tilde\{V\}\_\{2\}=\-\\tfrac\{1\}\{2\}\\nabla\\tilde\{V\}\_\{1\}\\cdot\\tilde\{V\}\_\{1\}\. Substituting this back and usingV~1=D\\tilde\{V\}\_\{1\}=Dyields:
V~3=−12∇D⋅V~2−16∇V~2⋅D\.\\tilde\{V\}\_\{3\}=\-\\tfrac\{1\}\{2\}\\nabla D\\cdot\\tilde\{V\}\_\{2\}\-\\tfrac\{1\}\{6\}\\nabla\\tilde\{V\}\_\{2\}\\cdot D\.\(256\)
Note thatV~3\\tilde\{V\}\_\{3\}can be computed using Jacobian\-vector products \(JVPs\) of previously\-computed lower\-order vector fields, allowing for efficient algorithmic implementation\. This generalizes to arbitraryV~k\\tilde\{V\}\_\{k\}\.
Adam

Figure 5:Backward Error Analysis for Adam\.Comparison of Adam against stable flow, rod flow, and rod flow with the BEA correction, evaluated on the loss, preconditioned sharpness, and distance between iterates\. The BEA correction yields no noticeable improvement\.
### F\.2Backward Error Analysis for Rod Flow
We apply the second\-order backward error analysis \(BEA\) correction to the Adam rod flow center dynamics\. The modified vector fieldV~\\tilde\{V\}is approximated by:
V~≈D−12∇D⋅D\\tilde\{V\}\\approx D\-\\frac\{1\}\{2\}\\nabla D\\cdot D\(257\)
whereDDrepresents the discrete difference updates\. The discrete displacement for the rod centersz¯=\(w¯,m¯\)\\bar\{z\}=\(\\bar\{w\},\\bar\{m\}\)is given by:
Dw¯\\displaystyle D\_\{\\bar\{w\}\}=−ηP−1\(ν¯\)\[β1m¯\+\(1−β1\)g¯\],\\displaystyle=\-\\eta\\,P^\{\-1\}\(\\bar\{\\nu\}\)\\bigl\[\\beta\_\{1\}\\bar\{m\}\+\(1\-\\beta\_\{1\}\)\\bar\{g\}\\bigr\],\(258\)Dm¯\\displaystyle D\_\{\\bar\{m\}\}=\(1−β1\)\(g¯−m¯\),\\displaystyle=\(1\-\\beta\_\{1\}\)\(\\bar\{g\}\-\\bar\{m\}\),\(259\)We restrict the correction to the center variables due to timescale separation: the extentΣ\\Sigmaequilibrates on an𝒪\(1\)\\mathcal\{O\}\(1\)timescale and the second\-moment estimateν¯\\bar\{\\nu\}changes slowly relative to the center dynamics\.
The blocks of the Jacobian ofDDwith respect toz¯\\bar\{z\}are:
∂Dw¯∂w¯\\displaystyle\\frac\{\\partial D\_\{\\bar\{w\}\}\}\{\\partial\\bar\{w\}\}=−ηP−1\(ν¯\)\(1−β1\)H¯,\\displaystyle=\-\\eta\\,P^\{\-1\}\(\\bar\{\\nu\}\)\(1\-\\beta\_\{1\}\)\\bar\{H\},∂Dw¯∂m¯\\displaystyle\\frac\{\\partial D\_\{\\bar\{w\}\}\}\{\\partial\\bar\{m\}\}=−ηP−1\(ν¯\)β1Id,\\displaystyle=\-\\eta\\,P^\{\-1\}\(\\bar\{\\nu\}\)\\beta\_\{1\}I\_\{d\},\(260\)∂Dm¯∂w¯\\displaystyle\\frac\{\\partial D\_\{\\bar\{m\}\}\}\{\\partial\\bar\{w\}\}=\(1−β1\)H¯,\\displaystyle=\(1\-\\beta\_\{1\}\)\\bar\{H\},∂Dm¯∂m¯\\displaystyle\\frac\{\\partial D\_\{\\bar\{m\}\}\}\{\\partial\\bar\{m\}\}=−\(1−β1\)Id,\\displaystyle=\-\(1\-\\beta\_\{1\}\)I\_\{d\},\(261\)whereH¯=12\(∇2L\+\+∇2L−\)\\bar\{H\}=\\frac\{1\}\{2\}\(\\nabla^\{2\}L\_\{\+\}\+\\nabla^\{2\}L\_\{\-\}\)is the average Hessian at the rod endpoints andIdI\_\{d\}is thed×dd\\times didentity matrix\.
Computing∇D⋅D\\nabla D\\cdot Dblockwise gives:
\(∇D⋅D\)w¯\\displaystyle\(\\nabla D\\cdot D\)\_\{\\bar\{w\}\}=−ηP−1\(ν¯\)\(1−β1\)H¯Dw¯−ηP−1\(ν¯\)β1Dm¯,\\displaystyle=\-\\eta\\,P^\{\-1\}\(\\bar\{\\nu\}\)\(1\-\\beta\_\{1\}\)\\bar\{H\}D\_\{\\bar\{w\}\}\-\\eta\\,P^\{\-1\}\(\\bar\{\\nu\}\)\\beta\_\{1\}D\_\{\\bar\{m\}\},\(262\)\(∇D⋅D\)m¯\\displaystyle\(\\nabla D\\cdot D\)\_\{\\bar\{m\}\}=\(1−β1\)H¯Dw¯−\(1−β1\)Dm¯\.\\displaystyle=\(1\-\\beta\_\{1\}\)\\bar\{H\}D\_\{\\bar\{w\}\}\-\(1\-\\beta\_\{1\}\)D\_\{\\bar\{m\}\}\.\(263\)
Subtracting half of each from the discrete stepDDyields the BEA\-corrected center dynamics for the continuous variables:
dw¯dt\\displaystyle\\frac\{d\\bar\{w\}\}\{dt\}=Dw¯\+η2P−1\(ν¯\)\(1−β1\)H¯Dw¯\+η2P−1\(ν¯\)β1Dm¯,\\displaystyle=D\_\{\\bar\{w\}\}\+\\frac\{\\eta\}\{2\}\\,P^\{\-1\}\(\\bar\{\\nu\}\)\(1\-\\beta\_\{1\}\)\\bar\{H\}D\_\{\\bar\{w\}\}\+\\frac\{\\eta\}\{2\}\\,P^\{\-1\}\(\\bar\{\\nu\}\)\\beta\_\{1\}D\_\{\\bar\{m\}\},\(264\)dm¯dt\\displaystyle\\frac\{d\\bar\{m\}\}\{dt\}=Dm¯\+\(1−β1\)2Dm¯−\(1−β1\)2H¯Dw¯\.\\displaystyle=D\_\{\\bar\{m\}\}\+\\frac\{\(1\-\\beta\_\{1\}\)\}\{2\}D\_\{\\bar\{m\}\}\-\\frac\{\(1\-\\beta\_\{1\}\)\}\{2\}\\bar\{H\}D\_\{\\bar\{w\}\}\.\(265\)
[Figure˜5](https://arxiv.org/html/2605.06821#A6.F5)compares the BEA\-corrected rod flow to the uncorrected version\. The two are essentially indistinguishable, which shows that the second\-order truncation error between the discrete map and its modified equation is not what drives the discrepancy between rod flow and the discrete iterates\. So what is causing it? Two likely candidates are \(1\) violations of the period\-2 oscillation assumption and \(2\) the presence of multiple sharp directions in the Hessian\. In any case, since the BEA correction does not improve accuracy in the deep learning setting, the simpler uncorrected form of rod flow is to be preferred\.
## Appendix GComputational Implementation
Implementing rod flow on realistically\-sized neural networks introduces two challenges\. First, the extent tensorΣ\\Sigmais too large to store explicitly: it isd×dd\\times dfor non\-momentum optimizers and2d×2d2d\\times 2dfor momentum\-based optimizers\. We address this with a low\-rank representation ofΣ\\Sigma\. Second, Adam’s bias\-corrected second\-moment estimate introduces an explicit dependence on the iterate counter that the continuous flow must track\. We address this by incorporating the bias correction directly into the flow\.
### G\.1Low\-Rank Representation ofΣ\\Sigma
Consider the rod flow ODE forΣ\\Sigma:
dΣdt=14\(Φ\+⊗Φ\+\+Φ−⊗Φ−\)−2Σ\.\\frac\{d\\Sigma\}\{dt\}=\\tfrac\{1\}\{4\}\(\\Phi\_\{\+\}\\otimes\\Phi\_\{\+\}\+\\Phi\_\{\-\}\\otimes\\Phi\_\{\-\}\)\-2\\Sigma\.Because of the separation of time scales between the center dynamics and the extent dynamics,Σ\\Sigmais always in quasi\-equilibrium:
Σ≈18\[Φ\+\(Δ\)⊗Φ\+\(Δ\)\+Φ−\(Δ\)⊗Φ−\(Δ\)\]\.\\Sigma\\approx\\tfrac\{1\}\{8\}\\bigl\[\\Phi\_\{\+\}\(\\Delta\)\\otimes\\Phi\_\{\+\}\(\\Delta\)\+\\Phi\_\{\-\}\(\\Delta\)\\otimes\\Phi\_\{\-\}\(\\Delta\)\\bigr\]\.\(266\)This expression is at most rank two\. In practice, the phase\-space displacements at the rod endpoints point in nearly identical directions which causesΣ\\Sigmato be effectively rank one\. This justifies a low\-rank representation ofΣ\\Sigma: rather than storing the full matrix, we work in the low\-dimensional subspace spanned by the current direction ofΣ\\Sigma\. We therefore represent:
Σ=VΛV⊤,V∈ℝ2d×r,Λ∈ℝr×r\.\\Sigma=V\\Lambda V^\{\\top\},\\qquad V\\in\\mathbb\{R\}^\{2d\\times r\},\\qquad\\Lambda\\in\\mathbb\{R\}^\{r\\times r\}\.\(267\)HereVVis an orthonormal matrix whose columns represent the directions along which the rod extends in phase space, andΛ\\Lambdais a symmetric matrix whose entries give the squared magnitudes of the half\-displacements along those directions\. We setr=3r=3in all experiments, which reduces the storage cost from𝒪\(d2\)\\mathcal\{O\}\(d^\{2\}\)to𝒪\(dr\)\\mathcal\{O\}\(dr\)\.
The update of\(V,Λ\)\(V,\\Lambda\)proceeds in five steps:
##### Step 1: Decay\.
Apply the exponential decay from the−2Σ\-2\\Sigmaterm in the ODE:
Λ←\(1−2dt\)Λ\.\\Lambda\\leftarrow\(1\-2\\,dt\)\\,\\Lambda\.\(268\)SinceΛ\\Lambdais diagonal at the start of each step, this is simply element\-wise multiplication\.
##### Step 2: Project\.
Decompose the phase\-space displacements at the rod’s endpoints,Φ\+\\Phi\_\{\+\}andΦ−\\Phi\_\{\-\}, into components parallel and perpendicular to the current eigenspace:
Φ\+∥\\displaystyle\\Phi\_\{\+\}^\{\\parallel\}=V⊤Φ\+,Φ\+⟂=Φ\+−VΦ\+∥,\\displaystyle=V^\{\\top\}\\Phi\_\{\+\},\\qquad\\Phi\_\{\+\}^\{\\perp\}=\\Phi\_\{\+\}\-V\\Phi\_\{\+\}^\{\\parallel\},\(269\)Φ−∥\\displaystyle\\Phi\_\{\-\}^\{\\parallel\}=V⊤Φ−,Φ−⟂=Φ−−VΦ−∥\.\\displaystyle=V^\{\\top\}\\Phi\_\{\-\},\\qquad\\Phi\_\{\-\}^\{\\perp\}=\\Phi\_\{\-\}\-V\\Phi\_\{\-\}^\{\\parallel\}\.\(270\)The parallel components arerr\-dimensional vectors, while the perpendicular components live inℝ2d\\mathbb\{R\}^\{2d\}but are orthogonal to the current basis\.
##### Step 3: Augment basis\.
If the perpendicular components have significant norm \(above a thresholdεtol=10−10\\varepsilon\_\{\\rm tol\}=10^\{\-10\}\), add them as new basis directions:
Vaug=\[VΦ\+⟂‖Φ\+⟂‖Φ−⟂‖Φ−⟂‖\]∈ℝ2d×\(r\+k\),V\_\{\\text\{aug\}\}=\\begin\{bmatrix\}V&\\dfrac\{\\Phi\_\{\+\}^\{\\perp\}\}\{\\\|\\Phi\_\{\+\}^\{\\perp\}\\\|\}&\\dfrac\{\\Phi\_\{\-\}^\{\\perp\}\}\{\\\|\\Phi\_\{\-\}^\{\\perp\}\\\|\}\\end\{bmatrix\}\\in\\mathbb\{R\}^\{2d\\times\(r\+k\)\},\(271\)wherek∈\{0,1,2\}k\\in\\\{0,1,2\\\}counts how many perpendicular components exceed the threshold\. We orthogonalize the new directions against each other via Gram–Schmidt\.
##### Step 4: Add outer products\.
ExtendΛ\\Lambdato the augmented basis by padding with zeros for the new directions, then add the rank\-22outer product contributions:
Λaug←Λaug\+dt4\[\(Vaug⊤Φ\+\)\(Vaug⊤Φ\+\)⊤\+\(Vaug⊤Φ−\)\(Vaug⊤Φ−\)⊤\]\.\\Lambda\_\{\\text\{aug\}\}\\leftarrow\\Lambda\_\{\\text\{aug\}\}\+\\tfrac\{dt\}\{4\}\\left\[\(V\_\{\\text\{aug\}\}^\{\\top\}\\Phi\_\{\+\}\)\(V\_\{\\text\{aug\}\}^\{\\top\}\\Phi\_\{\+\}\)^\{\\top\}\+\(V\_\{\\text\{aug\}\}^\{\\top\}\\Phi\_\{\-\}\)\(V\_\{\\text\{aug\}\}^\{\\top\}\\Phi\_\{\-\}\)^\{\\top\}\\right\]\.\(272\)This produces a symmetric matrixΛaug∈ℝ\(r\+k\)×\(r\+k\)\\Lambda\_\{\\text\{aug\}\}\\in\\mathbb\{R\}^\{\(r\+k\)\\times\(r\+k\)\}that is generally no longer diagonal\.
##### Step 5: Truncate and reorthogonalize\.
Compute the eigendecompositionΛaug=UΛ~U⊤\\Lambda\_\{\\text\{aug\}\}=U\\tilde\{\\Lambda\}U^\{\\top\}, where the columns ofUUare eigenvectors andΛ~\\tilde\{\\Lambda\}is diagonal with eigenvalues in decreasing order\. Retaining only the toprreigenpairs, we rotate the basis into the new eigenvector coordinates:
V←VaugU:,:r,Λ←diag\(λ~1,…,λ~r\),V\\leftarrow V\_\{\\text\{aug\}\}U\_\{:,\\,:r\},\\qquad\\Lambda\\leftarrow\\mathrm\{diag\}\(\\tilde\{\\lambda\}\_\{1\},\\ldots,\\tilde\{\\lambda\}\_\{r\}\),\(273\)whereU:,:rU\_\{:,\\,:r\}denotes the firstrrcolumns ofUU\. The newΛ\\Lambdais diagonal by construction\. We finish with a QR re\-orthogonalization ofVVto correct for numerical drift\.
### G\.2Bias Correction
Adam utilizes bias\-corrected moment estimates to counteract the fact that its exponential moving averages \(EMAs\) are initialized at zero\.
To understand why this is necessary, consider the linear lossL=bwL=bw\. The gradient is constant for all time:gt=bg\_\{t\}=b\. If we initialize the moments at zero \(m0=0m\_\{0\}=0andν0=0\\nu\_\{0\}=0\), we have that the raw EMAs evolve as:
mt=β1mt−1\+\(1−β1\)gt=\(1−β1t\)bm\_\{t\}=\\beta\_\{1\}m\_\{t\-1\}\+\(1\-\\beta\_\{1\}\)g\_\{t\}=\(1\-\\beta\_\{1\}^\{t\}\)b\(274\)νt=β2νt−1\+\(1−β2\)gt2=\(1−β2t\)b2\\nu\_\{t\}=\\beta\_\{2\}\\nu\_\{t\-1\}\+\(1\-\\beta\_\{2\}\)g\_\{t\}^\{2\}=\(1\-\\beta\_\{2\}^\{t\}\)b^\{2\}\(275\)Early in training, these estimates are drastically biased toward zero\. With the standardβ2=0\.999\\beta\_\{2\}=0\.999, the raw second momentνt\\nu\_\{t\}requires𝒪\(103\)\\mathcal\{O\}\(10^\{3\}\)iterations to properly equilibrate\.
To resolve this, Adam defines the bias\-corrected moments:
m^t=mt1−β1t\\hat\{m\}\_\{t\}=\\frac\{m\_\{t\}\}\{1\-\\beta\_\{1\}^\{t\}\}\(276\)ν^t=νt1−β2t\\hat\{\\nu\}\_\{t\}=\\frac\{\\nu\_\{t\}\}\{1\-\\beta\_\{2\}^\{t\}\}\(277\)which, in our linear loss example, perfectly recover the true momentsbbandb2b^\{2\}from the very first step\.
Our continuous\-time rod flow implements this by inheriting a synthetic step counter from the outer discrete training loop\. At each substep of the rod flow, we pass the current discrete iteration indexttand compute the bias\-correction factors:
bc1=1−β1t\+1,bc2=1−β2t\+1\.\\mathrm\{bc\}\_\{1\}=1\-\\beta\_\{1\}^\{\\,t\+1\},\\qquad\\mathrm\{bc\}\_\{2\}=1\-\\beta\_\{2\}^\{\\,t\+1\}\.\(278\)Note that the index ist\+1t\+1rather thantt\. Although the bias\-correction factor is computed at timett, it ismt\+1m\_\{t\+1\}that actually drives the position update—and this is the moment estimate to which the correction must apply\.
In full, the bias\-corrected rod flow ODEs for Adam are:
dw¯dt\\displaystyle\\frac\{d\\bar\{w\}\}\{dt\}=−ηP−1\(ν¯^\)\[β1m¯^\+\(1−β1\)g¯^\],\\displaystyle=\-\\eta\\,P^\{\-1\}\(\\,\\widehat\{\\bar\{\\nu\}\}\\,\)\\,\\bigl\[\\beta\_\{1\}\\,\\widehat\{\\bar\{m\}\}\+\(1\-\\beta\_\{1\}\)\\,\\widehat\{\\bar\{g\}\}\\bigr\],\(279\)dm¯dt\\displaystyle\\frac\{d\\bar\{m\}\}\{dt\}=\(1−β1\)\[g¯−m¯\],\\displaystyle=\(1\-\\beta\_\{1\}\)\\bigl\[\\bar\{g\}\-\\bar\{m\}\\bigr\],\(280\)dν¯dt\\displaystyle\\frac\{d\\bar\{\\nu\}\}\{dt\}=\(1−β2\)\(∇L\+⊙2\+∇L−⊙22−ν¯\),\\displaystyle=\(1\-\\beta\_\{2\}\)\\,\\biggl\(\\frac\{\\nabla L\_\{\+\}^\{\\odot 2\}\+\\nabla L\_\{\-\}^\{\\odot 2\}\}\{2\}\-\\bar\{\\nu\}\\biggr\),\(281\)dΣdt\\displaystyle\\frac\{d\\Sigma\}\{dt\}=14\[Φ\+⊗Φ\+\+Φ−⊗Φ−\]−2Σ,\\displaystyle=\\tfrac\{1\}\{4\}\\bigl\[\\Phi\_\{\+\}\\otimes\\Phi\_\{\+\}\+\\Phi\_\{\-\}\\otimes\\Phi\_\{\-\}\\bigr\]\-2\\,\\Sigma,\(282\)where the bias\-corrected moments and gradient are
m¯^=m¯bc1,ν¯^=ν¯bc2,g¯^=g¯bc1,\\widehat\{\\bar\{m\}\}=\\frac\{\\bar\{m\}\}\{\\mathrm\{bc\}\_\{1\}\},\\qquad\\widehat\{\\bar\{\\nu\}\}=\\frac\{\\bar\{\\nu\}\}\{\\mathrm\{bc\}\_\{2\}\},\\qquad\\widehat\{\\bar\{g\}\}=\\frac\{\\bar\{g\}\}\{\\mathrm\{bc\}\_\{1\}\},\(283\)so that the bracketed term in the position update,
β1m¯^\+\(1−β1\)g¯^=β1m¯\+\(1−β1\)g¯bc1,\\beta\_\{1\}\\,\\widehat\{\\bar\{m\}\}\+\(1\-\\beta\_\{1\}\)\\,\\widehat\{\\bar\{g\}\}\\;=\\;\\frac\{\\beta\_\{1\}\\,\\bar\{m\}\+\(1\-\\beta\_\{1\}\)\\,\\bar\{g\}\}\{\\mathrm\{bc\}\_\{1\}\},\(284\)is the bias\-corrected version ofmt\+1m\_\{t\+1\}\.
## Appendix HExperimental Details
### H\.1Stable Flows
Thestable flowof a discrete\-time optimizer is its naïve continuous\-time limit\. For gradient descent, the stable flow is the familiar gradient flow:
dwdt=−η∇L\(w\)\.\\frac\{dw\}\{dt\}=\-\\eta\\,\\nabla L\(w\)\.\(285\)
Stable flows track the discrete optimizer well when the sharpness \(or preconditioned sharpness\) is below the EoS threshold\. During EoS, they fail to capture the oscillatory dynamics that the discrete optimizer exhibits about its center trajectory\. We use the stable flow as the natural baseline against which to benchmark rod flow\.
Below are the stable flows for the eight optimizers that we studied\.
Heavy Ball
dwdt\\displaystyle\\frac\{dw\}\{dt\}=−η\[βm\+\(1−β\)∇L\(w\)\]\\displaystyle=\-\\eta\\bigl\[\\beta m\+\(1\-\\beta\)\\nabla L\(w\)\\bigr\]dmdt\\displaystyle\\frac\{dm\}\{dt\}=\(1−β\)\[∇L\(w\)−m\]\\displaystyle=\(1\-\\beta\)\\bigl\[\\nabla L\(w\)\-m\\bigr\]
Nesterov
dwdt\\displaystyle\\frac\{dw\}\{dt\}=−η\[βm\+\(1−β\)∇L\(w−ηβm\)\]\\displaystyle=\-\\eta\\bigl\[\\beta m\+\(1\-\\beta\)\\nabla L\(w\-\\eta\\beta m\)\\bigr\]dmdt\\displaystyle\\frac\{dm\}\{dt\}=\(1−β\)\[∇L\(w−ηβm\)−m\]\\displaystyle=\(1\-\\beta\)\\bigl\[\\nabla L\(w\-\\eta\\beta m\)\-m\\bigr\]
Scalar RMSProp
dwdt\\displaystyle\\frac\{dw\}\{dt\}=−ηP−1\(ν\)∇L\(w\)\\displaystyle=\-\\eta\\,P^\{\-1\}\(\\nu\)\\nabla L\(w\)dνdt\\displaystyle\\frac\{d\\nu\}\{dt\}=\(1−β2\)\[‖∇L\(w\)‖2−ν\]\\displaystyle=\(1\-\\beta\_\{2\}\)\\bigl\[\\\|\\nabla L\(w\)\\\|^\{2\}\-\\nu\\bigr\]
RMSProp
dwdt\\displaystyle\\frac\{dw\}\{dt\}=−ηP−1\(ν\)∇L\(w\)\\displaystyle=\-\\eta\\,P^\{\-1\}\(\\nu\)\\nabla L\(w\)dνdt\\displaystyle\\frac\{d\\nu\}\{dt\}=\(1−β2\)\[∇L\(w\)⊙2−ν\]\\displaystyle=\(1\-\\beta\_\{2\}\)\\bigl\[\\nabla L\(w\)^\{\\odot 2\}\-\\nu\\bigr\]
Scalar Adam
dwdt\\displaystyle\\frac\{dw\}\{dt\}=−ηP−1\(ν^\)\[β1m^\+\(1−β1\)∇L^\(w\)\]\\displaystyle=\-\\eta\\,P^\{\-1\}\(\\hat\{\\nu\}\)\\bigl\[\\beta\_\{1\}\\hat\{m\}\+\(1\-\\beta\_\{1\}\)\\nabla\\widehat\{L\}\(w\)\\bigr\]dmdt\\displaystyle\\frac\{dm\}\{dt\}=\(1−β1\)\[∇L\(w\)−m\]\\displaystyle=\(1\-\\beta\_\{1\}\)\\bigl\[\\nabla L\(w\)\-m\\bigr\]dνdt\\displaystyle\\frac\{d\\nu\}\{dt\}=\(1−β2\)\[‖∇L\(w\)‖2−ν\]\\displaystyle=\(1\-\\beta\_\{2\}\)\\bigl\[\\\|\\nabla L\(w\)\\\|^\{2\}\-\\nu\\bigr\]
Scalar NAdam
dwdt\\displaystyle\\frac\{dw\}\{dt\}=−ηP−1\(ν^\)\(β12m^\+\(1−β12\)∇L\(w\)\)\\displaystyle=\-\\eta\\,P^\{\-1\}\(\\hat\{\\nu\}\)\\bigl\(\\beta\_\{1\}^\{2\}\\hat\{m\}\+\(1\-\\beta\_\{1\}^\{2\}\)\\nabla L\(w\)\\bigr\)dmdt\\displaystyle\\frac\{dm\}\{dt\}=\(1−β1\)\[∇L\(w\)−m\]\\displaystyle=\(1\-\\beta\_\{1\}\)\\bigl\[\\nabla L\(w\)\-m\\bigr\]dνdt\\displaystyle\\frac\{d\\nu\}\{dt\}=\(1−β2\)\[‖∇L\(w\)‖2−ν\]\\displaystyle=\(1\-\\beta\_\{2\}\)\\bigl\[\\\|\\nabla L\(w\)\\\|^\{2\}\-\\nu\\bigr\]
Adam
dwdt\\displaystyle\\frac\{dw\}\{dt\}=−ηP−1\(ν^\)\[β1m^\+\(1−β1\)∇L^\(w\)\]\\displaystyle=\-\\eta\\,P^\{\-1\}\(\\hat\{\\nu\}\)\\bigl\[\\beta\_\{1\}\\hat\{m\}\+\(1\-\\beta\_\{1\}\)\\nabla\\widehat\{L\}\(w\)\\bigr\]dmdt\\displaystyle\\frac\{dm\}\{dt\}=\(1−β1\)\[∇L\(w\)−m\]\\displaystyle=\(1\-\\beta\_\{1\}\)\\bigl\[\\nabla L\(w\)\-m\\bigr\]dνdt\\displaystyle\\frac\{d\\nu\}\{dt\}=\(1−β2\)\[∇L\(w\)⊙2−ν\]\\displaystyle=\(1\-\\beta\_\{2\}\)\\bigl\[\\nabla L\(w\)^\{\\odot 2\}\-\\nu\\bigr\]
NAdam
dwdt\\displaystyle\\frac\{dw\}\{dt\}=−ηP−1\(ν^\)\(β12m^\+\(1−β12\)∇L\(w\)\)\\displaystyle=\-\\eta\\,P^\{\-1\}\(\\hat\{\\nu\}\)\\bigl\(\\beta\_\{1\}^\{2\}\\hat\{m\}\+\(1\-\\beta\_\{1\}^\{2\}\)\\nabla L\(w\)\\bigr\)dmdt\\displaystyle\\frac\{dm\}\{dt\}=\(1−β1\)\[∇L\(w\)−m\]\\displaystyle=\(1\-\\beta\_\{1\}\)\\bigl\[\\nabla L\(w\)\-m\\bigr\]dνdt\\displaystyle\\frac\{d\\nu\}\{dt\}=\(1−β2\)\[∇L\(w\)⊙2−ν\]\\displaystyle=\(1\-\\beta\_\{2\}\)\\bigl\[\\nabla L\(w\)^\{\\odot 2\}\-\\nu\\bigr\]
### H\.2Experimental Procedure
Each experiment begins with a warm\-start phase in which the discrete optimizer is run on its own\. The warm\-start length is chosen based on prior testing to be long enough for the iterates to settle into the steady\-state edge of stability\. At its conclusion, both the stable flow and the rod flow are initialized from the current value of the discrete iterates\.
##### Setup\.
For each architecture–optimizer pair, we train on a fixed subset of5,0005\{,\}000CIFAR\-10 examples with MSE loss against one\-hot targets\. Gradients are computed full\-batch over all5,0005\{,\}000examples\. For the ViT, where activation memory dominates, we microbatch the per\-example gradient\. Inputs are channel\-normalized using the standard CIFAR\-10 statistics:μ=\(0\.4914,0\.4822,0\.4465\)\\mu=\(0\.4914,0\.4822,0\.4465\),σ=\(0\.2470,0\.2435,0\.2616\)\\sigma=\(0\.2470,0\.2435,0\.2616\)\. We use no data augmentation, weight decay, or dropout\.
##### Flow initialization\.
At iterationt=warmup\_iterations−1t=\\texttt\{warmup\\\_iterations\}\-1, the discrete state is used to seed both flows\. The flow centers are set to the average of the last two iterates:
w¯0flow\\displaystyle\\bar\{w\}\_\{0\}^\{\\text\{flow\}\}=12\(wt\+wt\+1\),\\displaystyle=\\tfrac\{1\}\{2\}\(w\_\{t\}\+w\_\{t\+1\}\),\(286\)m¯0flow\\displaystyle\\bar\{m\}\_\{0\}^\{\\text\{flow\}\}=12\(mt\+mt\+1\)\(momentum\-based optimizers\),\\displaystyle=\\tfrac\{1\}\{2\}\(m\_\{t\}\+m\_\{t\+1\}\)\\quad\\text\{\(momentum\-based optimizers\)\},\(287\)ν¯0flow\\displaystyle\\bar\{\\nu\}\_\{0\}^\{\\text\{flow\}\}=12\(νt\+νt\+1\)\(adaptive optimizers\)\.\\displaystyle=\\tfrac\{1\}\{2\}\(\\nu\_\{t\}\+\\nu\_\{t\+1\}\)\\quad\\text\{\(adaptive optimizers\)\}\.\(288\)For rod flow, the extent tensorΣ\\Sigmais initialized using the phase\-space half\-difference:
δ0\\displaystyle\\delta\_\{0\}=12\(wt\+1−wt\),\\displaystyle=\\tfrac\{1\}\{2\}\(w\_\{t\+1\}\-w\_\{t\}\),\(289\)γ0\\displaystyle\\gamma\_\{0\}=12\(mt\+1−mt\),\\displaystyle=\\tfrac\{1\}\{2\}\(m\_\{t\+1\}\-m\_\{t\}\),\(290\)Σ0\\displaystyle\\Sigma\_\{0\}∝\(δ0,γ0\)⊗\(δ0,γ0\)\.\\displaystyle\\propto\(\\delta\_\{0\},\\gamma\_\{0\}\)\\otimes\(\\delta\_\{0\},\\gamma\_\{0\}\)\.\(291\)
##### Lockstep loop\.
Fort≥warmup\_iterationst\\geq\\texttt\{warmup\\\_iterations\}, each iteration of the outer loop executes the following three updates in sequence:
1. 1\.Discrete optimizer\.Take one step atwtw\_\{t\}to producewt\+1,mt\+1,νt\+1w\_\{t\+1\},m\_\{t\+1\},\\nu\_\{t\+1\}\.
2. 2\.Stable flow\.Advance\(w¯sf,m¯sf,ν¯sf\)\(\\bar\{w\}\_\{\\text\{sf\}\},\\bar\{m\}\_\{\\text\{sf\}\},\\bar\{\\nu\}\_\{\\text\{sf\}\}\)bynsfn\_\{\\text\{sf\}\}forward\-Euler substeps of sizeΔtsf\\Delta t\_\{\\text\{sf\}\}\.
3. 3\.Rod flow\.Advance\(w¯rod,m¯rod,ν¯rod,Σrod\)\(\\bar\{w\}\_\{\\text\{rod\}\},\\bar\{m\}\_\{\\text\{rod\}\},\\bar\{\\nu\}\_\{\\text\{rod\}\},\\Sigma\_\{\\text\{rod\}\}\)bynrodn\_\{\\text\{rod\}\}forward\-Euler substeps of sizeΔtrod\\Delta t\_\{\\text\{rod\}\}\.
Total simulated time per outer\-loop iteration isnsf⋅Δtsf=nrod⋅Δtrod=1n\_\{\\text\{sf\}\}\\cdot\\Delta t\_\{\\text\{sf\}\}=n\_\{\\text\{rod\}\}\\cdot\\Delta t\_\{\\text\{rod\}\}=1, so one continuous\-time unit corresponds to one discrete optimizer step\. Across all architectures we usen=10n=10substeps withΔt=0\.1\\Delta t=0\.1\.
Table 3:Hyperparameters used in the MLP\-on\-CIFAR\-10 runs\.
##### Eigenvalue computation\.
Both the raw and effective Hessian eigenvalues are computed using a warm\-started Lanczos solver\. Between calls, the solver caches its previous eigenbasis and reuses it as the starting iterate for the next call, exploiting the slow drift of the top eigenspace over the course of training\. Eigenvalues are sampled every200200discrete steps—a cadence that prevents the eigensolver from dominating wall\-clock time while still resolving the dynamics around the EoS threshold\.
### H\.3Quantities Tracked
We track two main classes of quantities: intrinsic observables \(computed independently for the discrete iterates, the stable flow, and the rod flow\) and cross\-trajectory comparison metrics\.
##### Intrinsic observables\.
- •Losses\(disc\_loss\_w,disc\_loss\_wbar,sf\_loss,rod\_loss\_wbar\): Evaluated at the raw discrete iteratewtw\_\{t\}and the respective trajectory centersw¯\\bar\{w\}\.
- •Oscillation amplitudes\(disc\_delta\_norms,disc\_mu\_norms,delta\_norm\_rod,mu\_norm\_rod\): The norms of the position and momentum half\-differencesδ,γ\\delta,\\gamma\. For the discrete trajectory, these are explicit finite differences\. For rod flow, they are extracted from the principal eigencomponents ofΣ\\Sigma\.
- •Sharpness\(disc\_pre\_sharpness\_wbar,sf\_pre\_sharpness,rod\_pre\_sharpness\_wbar\): The top eigenvalue of the preconditioned Hessian at the respective centers\.
- •Preconditioner state\(nu\_norm\_disc,nu\_norm\_sf,nu\_norm\_rod,disc\_mean\_ess\): For adaptive optimizers, we track the norm of the second\-moment vectorν\\nuacross all trajectories\.
##### Cross\-trajectory comparison\.
- •Center distances\(dist\_wbar\_disc\_to\_sf,dist\_wbar\_disc\_to\_rod\): Euclidean distances between the discrete centerw¯t\\bar\{w\}\_\{t\}and each continuous flow center\.
- •Directional alignment\(cos\_delta\_alignment\): Cosine similarity betweenδdisc\\delta\_\{\\text\{disc\}\}andδrod\\delta\_\{\\text\{rod\}\}\.
All scalar quantities are recorded every step\. Eigenvalue\-based metrics are recorded every200200steps\.
### H\.4Architectures
##### MLP\.
A three\-layer fully\-connected network with hidden width200200andtanh\\tanhactivations\. CIFAR\-10 inputs are flattened toℝ3072\\mathbb\{R\}^\{3072\}, passed through two hidden layers of width200200, and projected to a1010\-dimensional output\. All linear layers include biases, giving a total of656,810656\{,\}810parameters\.
##### CNN\.
A small two\-block convolutional network\. Each block applies a3×33\\times 3convolution with padding11\(with bias\), followed bytanh\\tanhand2×22\\times 2average pooling; the first block has3→323\\to 32channels and the second32→3232\\to 32\. After two pooling stages, the32×3232\\times 32inputs are reduced to8×88\\times 8feature maps, which are flattened and passed through a linear readout to dimension1010\. Total parameter count:30,63430\{,\}634\.
##### ViT\.
A small Vision Transformer based on theSimpleViTvariant fromvit\_pytorch\. We use patch size44\(so each32×3232\\times 32image yields6464patches\), embedding dimension6464,33transformer blocks,88attention heads, and per\-block MLP hidden size256256\. Position information is supplied by a fixed 2\-D sin\-cos positional encoding rather than learned embeddings, following the SimpleViT design\. The linear classification head is initialized at half its default magnitude \(init\_scale=0\.50\.5\) to reduce the initial sharpness, which would otherwise place the iterates immediately in a divergent regime\. To ensure that the second and third derivatives required by the Lanczos eigensolvers are well\-defined along the trajectory, we replace everyLayerNormmodule with aTrainableLayerNorm\(src/architectures/utils\.py\) that matches the standard forward behavior and learnable scale/shift parameters but computes its mean and variance with smooth operations rather than cached running statistics\. Total parameter count is approximately165,000165\{,\}000, dominated by the three transformer blocks\.
## Appendix IExperimental Results
Across all eight optimizers, the rod flow tracks the discrete\-time optimizers significantly more accurately than the stable flow does\.
Because of the finite step size, the discrete\-time optimizers cannot follow the direction of steepest descentexactly—so the loss decreases more slowly for the discrete\-time optimizer than for the stable flow\. Rod flow matches this slower loss trajectory\.
Across the different optimizers, rod flow hovers at the correct preconditioned sharpness threshold\. Stable flow, by contrast, is meant to proxy the step size going to zero\. It has no preconditioned sharpness limit and continues to sharpen well past the stability threshold\.
While there is some discrepancy between the center of rod flow and those of the discrete iterates, it is much smaller than the corresponding discrepancy for stable flow\. And a consistent feature is that the rod flow discrepancy levels off: after an initial phase of minor divergence, the discrepancy quickly plateaus\. Rod flow also accurately tracks the magnitudes of the position half\-displacement and the momentum half\-displacement of the discrete iterates\.
For preconditioned methods, there is strong agreement between the second momentν\\nuof rod flow and the second moment of the discrete\-time optimizer\. An advantage of rod flow is that it not only tracks the center of the discrete iterates, but also models where the iterates actually visit—the endpoints of the rod\. Because it has access to gradient evaluations at these endpoints,ν\\nuis tracked accurately\.
One interesting observation is the behavior of theδ\\deltaalignment\. A consistent feature across optimizers is that, rather than hovering near 1, theδ\\deltaalignment fluctuates over the course of training\. This could be due to the existence of multiple sharp directions in the Hessian\.
\(Note that in the experimental figures below, the momentum oscillation is mistakenly denoted asμ\\muinstead ofγ\\gamma\.\)
Heavy Ball



Figure 6:Experimental Results for Heavy Ball Momentum\.MLP:η=0\.05\\eta=0\.05,β=0\.4\\beta=0\.4\.CNN:η=0\.125\\eta=0\.125,β=0\.4\\beta=0\.4\.ViT:η=0\.04\\eta=0\.04,β=0\.3\\beta=0\.3\.Nesterov



Figure 7:Experimental Results for Nesterov Momentum\.MLP:η=0\.08\\eta=0\.08,β=0\.8\\beta=0\.8\.CNN:η=0\.125\\eta=0\.125,β=0\.4\\beta=0\.4\.ViT:η=0\.02\\eta=0\.02,β=0\.4\\beta=0\.4\.SRMSProp



Figure 8:Experimental Results for Scalar RMSProp\(no bias correction\)\.MLP:η=10−4\\eta=10^\{\-4\},β2=0\.99\\beta\_\{2\}=0\.99\.CNN:η=10−4\\eta=10^\{\-4\},β2=0\.99\\beta\_\{2\}=0\.99\.ViT:η=7\.5×10−5\\eta=7\.5\\times 10^\{\-5\},β2=0\.99\\beta\_\{2\}=0\.99\.RMSProp



Figure 9:Experimental Results for RMSProp\(no bias correction\)\.MLP:η=10−4\\eta=10^\{\-4\},β2=0\.99\\beta\_\{2\}=0\.99\.CNN:η=10−4\\eta=10^\{\-4\},β2=0\.99\\beta\_\{2\}=0\.99\.ViT:η=7\.5×10−5\\eta=7\.5\\times 10^\{\-5\},β2=0\.99\\beta\_\{2\}=0\.99\.SAdam



Figure 10:Experimental Results for Scalar Adam\(with bias correction\)\.MLP:η=10−3\\eta=10^\{\-3\},β1=0\.8\\beta\_\{1\}=0\.8,β2=0\.99\\beta\_\{2\}=0\.99\.CNN:η=10−3\\eta=10^\{\-3\},β1=0\.6\\beta\_\{1\}=0\.6,β2=0\.99\\beta\_\{2\}=0\.99\.ViT:η=10−4\\eta=10^\{\-4\},β1=0\.4\\beta\_\{1\}=0\.4,β2=0\.99\\beta\_\{2\}=0\.99\.SNAdam



Figure 11:Experimental Results for Scalar NAdam\(with bias correction\)\.MLP:η=5×10−4\\eta=5\\times 10^\{\-4\},β1=0\.5\\beta\_\{1\}=0\.5,β2=0\.99\\beta\_\{2\}=0\.99\.CNN:η=10−3\\eta=10^\{\-3\},β1=0\.5\\beta\_\{1\}=0\.5,β2=0\.99\\beta\_\{2\}=0\.99\.ViT:η=10−4\\eta=10^\{\-4\},β1=0\.4\\beta\_\{1\}=0\.4,β2=0\.99\\beta\_\{2\}=0\.99\.Adam



Figure 12:Experimental Results for Adam\(with bias correction\)\.MLP:η=10−4\\eta=10^\{\-4\},β1=0\.8\\beta\_\{1\}=0\.8,β2=0\.999\\beta\_\{2\}=0\.999\.CNN:η=10−4\\eta=10^\{\-4\},β1=0\.5\\beta\_\{1\}=0\.5,β2=0\.999\\beta\_\{2\}=0\.999\.ViT:η=7\.5×10−5\\eta=7\.5\\times 10^\{\-5\},β1=0\.4\\beta\_\{1\}=0\.4,β2=0\.999\\beta\_\{2\}=0\.999\.NAdam



Figure 13:Experimental Results for NAdam\(with bias correction\)\.MLP:η=10−4\\eta=10^\{\-4\},β1=0\.6\\beta\_\{1\}=0\.6,β2=0\.999\\beta\_\{2\}=0\.999\.CNN:η=10−4\\eta=10^\{\-4\},β1=0\.5\\beta\_\{1\}=0\.5,β2=0\.999\\beta\_\{2\}=0\.999\.ViT:η=7\.5×10−5\\eta=7\.5\\times 10^\{\-5\},β1=0\.4\\beta\_\{1\}=0\.4,β2=0\.999\\beta\_\{2\}=0\.999\.Similar Articles
Analysis of Adam Algorithms for Stochastic Dynamic Systems
This paper establishes a general theory of the Adam optimizer for time-varying and nonstationary stochastic systems, providing parameter tracking and output prediction error bounds under a stochastic excitation condition that allows nonstationary and dependent data.
Capturing non-Markovian dynamics in non-equilibrium stochastic systems using flow matching
This paper develops a generative flow matching method to capture non-Markovian dynamics in non-equilibrium stochastic systems, demonstrating improved predictions for the Kramers first passage time problem compared to Markovian baselines.
Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards
The paper identifies off-manifold drift in guided flow models under compositional rewards and proposes Conflict-Aware Additive Guidance (CAR), a lightweight method that dynamically resolves gradient conflicts to improve generation fidelity without retraining.
Revisiting Adam for Streaming Reinforcement Learning
This paper revisits the Adam optimizer for streaming reinforcement learning, demonstrating that established methods like DQN and C51 perform well when properly tuned. The authors propose Adaptive Q(lambda), which combines eligibility traces with Adam's variance adaptation to surpass existing streaming RL methods on 55 Atari games.
Residual-Space Evolutionary Optimization via Flow-based Generative Models
Introduces a framework combining flow-based generative editing with evolutionary algorithms to perform optimization in residual space, enabling controllable data editing with non-differentiable objectives. Validated on MorphoMNIST and crystal data.