Sharp First-Order Lower Bounds for Higher-Order Smooth Nonconvex Optimization

arXiv cs.LG Papers

Summary

This paper proves sharp dimension-free first-order lower bounds for finding epsilon-stationary points in higher-order smooth nonconvex optimization, resolving open problems for Hessian-Lipschitz and third-order smooth cases.

arXiv:2606.05438v1 Announce Type: new Abstract: We study the deterministic first-order oracle complexity of finding \(\epsilon\)-stationary points in smooth nonconvex optimization when the objective satisfies higher-order smoothness assumptions. While the classical \(\epsilon^{-2}\) rate is optimal under only Lipschitz gradients, higher-order smoothness leads to accelerated first-order upper bounds, most notably the \(\epsilon^{-7/4}\) rate under Lipschitz Hessians and the \(\epsilon^{-5/3}\) rate under Lipschitz third derivatives. The matching lower bounds, however, have remained open. We resolve this gap by proving a new dimension-free first-order lower bound for higher-order smooth nonconvex functions, valid for every finite smoothness order. In particular, our construction gives a matching \(\Omega(\epsilon^{-7/4})\) lower bound in the Hessian-Lipschitz case and a matching \(\Omega(\epsilon^{-5/3})\) lower bound in the third-order-smooth regime. The hard instance is based on a \emph{block-chain} mechanism that enforces blockwise oracle revelation while preserving the smoothness structure needed for the scalar hard instance. The lower-bound construction was discovered with the assistance of ChatGPT 5.5 Pro and subsequently verified by the authors.
Original Article
View Cached Full Text

Cached at: 06/05/26, 08:10 AM

# Sharp First-Order Lower Bounds for Higher-Order Smooth Nonconvex Optimization
Source: [https://arxiv.org/html/2606.05438](https://arxiv.org/html/2606.05438)
Dongruo Zhou

###### Abstract

We study the deterministic first\-order oracle complexity of findingϵ\\epsilon\-stationary points in smooth nonconvex optimization when the objective satisfies higher\-order smoothness assumptions\. While the classicalϵ−2\\epsilon^\{\-2\}rate is optimal under only Lipschitz gradients, higher\-order smoothness leads to accelerated first\-order upper bounds, most notably theϵ−7/4\\epsilon^\{\-7/4\}rate under Lipschitz Hessians and theϵ−5/3\\epsilon^\{\-5/3\}rate under Lipschitz third derivatives\. The matching lower bounds, however, have remained open\. We resolve this gap by proving a new dimension\-free first\-order lower bound for higher\-order smooth nonconvex functions, valid for every finite smoothness order\. In particular, our construction gives a matchingΩ​\(ϵ−7/4\)\\Omega\(\\epsilon^\{\-7/4\}\)lower bound in the Hessian\-Lipschitz case and a matchingΩ​\(ϵ−5/3\)\\Omega\(\\epsilon^\{\-5/3\}\)lower bound in the third\-order\-smooth regime\. The hard instance is based on a*block\-chain*mechanism that enforces blockwise oracle revelation while preserving the smoothness structure needed for the scalar hard instance\. The lower\-bound construction was discovered with the assistance of ChatGPT 5\.5 Pro and subsequently verified by the authors\.

## 1Introduction

Finding approximate stationary points is one of the basic tasks in smooth nonconvex optimization\. Given a differentiable objectiveff, the goal is to find a pointxxwith

‖∇f​\(x\)‖≤ϵ\.\\displaystyle\\\|\\nabla f\(x\)\\\|\\leq\\epsilon\.We focus on first\-order methods, which only query function values and gradients, because they are the most widely used methods in large\-scale nonconvex optimization\. For functions withL1L\_\{1\}\-Lipschitz gradients and initial gapΔ\\Delta, gradient descent finds such a point inO​\(Δ​L1​ϵ−2\)O\(\\Delta L\_\{1\}\\epsilon^\{\-2\}\)first\-order oracle calls\. This rate is worst\-case optimal under only gradient Lipschitzness, so in the general smooth function class there is no room for acceleration beyond the classicalϵ−2\\epsilon^\{\-2\}dependence\.

Table 1:Known dimension\-free deterministic first\-order oracle complexity under higher\-order smoothness\. Only the dependence onϵ\\epsilonis shown\.p=1p=1p=2p=2p≥3p\\geq 3Upper boundϵ−2\\epsilon^\{\-2\}\(gradient descent\)ϵ−7/4\\epsilon^\{\-7/4\}\[[2](https://arxiv.org/html/2606.05438#bib.bib22),[8](https://arxiv.org/html/2606.05438#bib.bib23),[23](https://arxiv.org/html/2606.05438#bib.bib11),[33](https://arxiv.org/html/2606.05438#bib.bib12),[7](https://arxiv.org/html/2606.05438#bib.bib3),[24](https://arxiv.org/html/2606.05438#bib.bib10),[27](https://arxiv.org/html/2606.05438#bib.bib13)\]ϵ−5/3\\epsilon^\{\-5/3\}\[[7](https://arxiv.org/html/2606.05438#bib.bib3)\]Lower boundϵ−2\\epsilon^\{\-2\}\[[9](https://arxiv.org/html/2606.05438#bib.bib2)\]ϵ−12/7\\epsilon^\{\-12/7\}\[[10](https://arxiv.org/html/2606.05438#bib.bib1)\]ϵ−8/5\\epsilon^\{\-8/5\}\[[10](https://arxiv.org/html/2606.05438#bib.bib1)\]Our resultϵ−𝟕/𝟒\\bm\{\\epsilon^\{\-7/4\}\}\(Theorem[5\.2](https://arxiv.org/html/2606.05438#S5.Thmtheorem2)\)ϵ−𝟓/𝟑\\bm\{\\epsilon^\{\-5/3\}\}\(Theorem[5\.2](https://arxiv.org/html/2606.05438#S5.Thmtheorem2)\)A natural way to seek faster rates is to impose higher\-order smoothness on the objective\. The first and most important case is second\-order smoothness, namely Lipschitz continuity of the Hessian: it is the weakest standard assumption beyond gradient Lipschitzness that controls how curvature changes, and it gives algorithms enough structure to distinguish convex\-like regions from genuinely nonconvex ones\. Algorithmic evidence for acceleration under this assumption comes from several related oracle models and target guarantees\. One line of work uses Hessian\-vector products to obtain curvature\-aware guarantees, often with the stronger goal of finding approximate local minima\. For example,Agarwalet al\.\[[2](https://arxiv.org/html/2606.05438#bib.bib22)\]approximately solve cubic\-regularized subproblems and find approximate local minima withinO​\(ϵ−7/4\)O\(\\epsilon^\{\-7/4\}\)Hessian\-vector\-product complexity\. In a related Hessian\-vector\-product\-based accelerated framework,Carmonet al\.\[[8](https://arxiv.org/html/2606.05438#bib.bib23)\]obtainO​\(ϵ−7/4\)O\(\\epsilon^\{\-7/4\}\)complexity for finding first\-order stationary points\. Within the first\-order saddle\-escaping literature, accelerated and negative\-curvature\-based gradient methods yieldO​\(ϵ−7/4\)O\(\\epsilon^\{\-7/4\}\)complexity bounds in the Hessian\-Lipschitz setting\[[23](https://arxiv.org/html/2606.05438#bib.bib11),[33](https://arxiv.org/html/2606.05438#bib.bib12)\]\.

The line closest to our setting keeps both the oracle and the target purely first\-order\. The accelerated “convex until proven guilty” framework ofCarmonet al\.\[[7](https://arxiv.org/html/2606.05438#bib.bib3)\]runs accelerated gradient in regions that behave convexly and uses failure of convexity as a certificate for progress in the nonconvex landscape, givingO​\(ϵ−7/4​log⁡\(1/ϵ\)\)O\(\\epsilon^\{\-7/4\}\\log\(1/\\epsilon\)\)first\-order complexity under Lipschitz Hessians andO​\(ϵ−5/3​log⁡\(1/ϵ\)\)O\(\\epsilon^\{\-5/3\}\\log\(1/\\epsilon\)\)under Lipschitz third derivatives\. Subsequent work refined this picture in two directions: restarted accelerated gradient removes the logarithmic factor in the Hessian\-Lipschitz case\[[24](https://arxiv.org/html/2606.05438#bib.bib10)\], while parameter\-free accelerated methods recover theO​\(ϵ−7/4\)O\(\\epsilon^\{\-7/4\}\)complexity without requiring the same smoothness\-parameter tuning\[[27](https://arxiv.org/html/2606.05438#bib.bib13)\]\.Jianget al\.\[[20](https://arxiv.org/html/2606.05438#bib.bib14)\]give a gradient\-only quasi\-Newton method withO​\(d1/4​ϵ−13/8\)O\(d^\{1/4\}\\epsilon^\{\-13/8\}\)complexity, improving theϵ\\epsilon\-dependence in low dimension but with explicit dimension dependence rather than a dimension\-free guarantee\. Thus higher\-order smoothness does allow acceleration beyondϵ−2\\epsilon^\{\-2\}, even when the algorithm only observes function values and gradients\. The absence of matching dimension\-free lower bounds has been explicitly noted in later work that sharpened or reinterpreted the upper\-bound landscape: even after the logarithmic factor was removed, and even after online\-learning viewpoints recovered parts of the upper\-bound picture, the dimension\-free deterministic first\-order lower bound remained unchanged\[[24](https://arxiv.org/html/2606.05438#bib.bib10),[15](https://arxiv.org/html/2606.05438#bib.bib16)\]\.

Despite these upper\-bound developments, whether the accelerated rates were optimal remained unclear\. On the lower\-bound side, the best previous result is due toCarmonet al\.\[[10](https://arxiv.org/html/2606.05438#bib.bib1)\], who proved anΩ​\(ϵ−12/7\)\\Omega\(\\epsilon^\{\-12/7\}\)lower bound in the Hessian\-Lipschitz setting and anΩ​\(ϵ−8/5\)\\Omega\(\\epsilon^\{\-8/5\}\)lower bound for arbitrarily smooth functions\. This left an exponent gap of1/281/28in the Hessian\-Lipschitz case, between12/712/7and7/47/4, and an exponent gap of1/151/15relative to the5/35/3first\-order rate under Lipschitz third derivatives\. Whether these gaps reflected a limitation of the known lower\-bound constructions or a genuine separation has remained a long\-standing open problem\. This motivates the following question\.

*What is the optimal first\-order oracle complexity of findingϵ\\epsilon\-stationary points when higher\-order smoothness is available?*

In this work, we answer this question by proving a new lower bound, which resolves a long\-standing open gap in deterministic first\-order oracle complexity\. Our contributions are as follows\.

- •As summarized in Table[1](https://arxiv.org/html/2606.05438#S1.T1), we give a new dimension\-free first\-order lower\-bound construction for higher\-order smooth nonconvex functions, valid for every finite smoothness orderpp\. In the Hessian\-Lipschitz case, the construction gives the matchingΩ​\(ϵ−7/4\)\\Omega\(\\epsilon^\{\-7/4\}\)dependence, closing the gap between the previousϵ−12/7\\epsilon^\{\-12/7\}lower bound and the best known first\-order upper bound\. Under Lipschitz third derivatives, it gives the matchingΩ​\(ϵ−5/3\)\\Omega\(\\epsilon^\{\-5/3\}\)dependence in the third\-order\-smooth regime\. The full theorem tracks the dependence onΔ\\Delta,L1,…,LpL\_\{1\},\\ldots,L\_\{p\}, and the active smoothness constraints\.
- •Technically, we introduce a block\-chain hard instance that augments the scalar transition problem with a*linear operator*for observation\. The nonconvex potential is applied after a linear readout, whose operator norm provides an additional knob for calibrating higher\-order smoothness while preserving the scalar residual certificate\. To keep this pullback compatible with first\-order revelation, each scalar coordinate is replaced by a chain block: the previous scalar state enters through an entry coordinate, whereas the current scalar state is read through a suffix projection\. This separation lets a zero\-respecting method reveal the construction only layer by layer, yielding a block\-layer tail\-hiding property and a gradient lower bound before the hidden tail is reached\.

#### Notation\.

Forn∈ℕn\\in\\mathbb\{N\}, write\[n\]=\{1,…,n\}\[n\]=\\\{1,\\ldots,n\\\}\. We use⟨⋅,⋅⟩\\langle\\cdot,\\cdot\\ranglefor the Euclidean inner product,∥⋅∥\\\|\\cdot\\\|for the Euclidean norm of a vector and the induced operator norm of a matrix, andsupp⁡\(x\)\\operatorname\{supp\}\(x\)for the support of a vectorxx\. The vectorse1,e2,…e\_\{1\},e\_\{2\},\\ldotsdenote the standard basis vectors in the ambient Euclidean space; when the dimension matters, we writeej\(m\)∈ℝme\_\{j\}^\{\(m\)\}\\in\\mathbb\{R\}^\{m\}\. For aCqC^\{q\}functionff,Dq​f​\(x\)D^\{q\}f\(x\)denotes itsqq\-th Fréchet derivative atxx, viewed as a symmetricqq\-linear form\. For aqq\-linear formAA, we write

‖A‖op:=sup‖v1‖,…,‖vq‖≤1\|A​\[v1,…,vq\]\|\.\\\|A\\\|\_\{\\rm op\}:=\\sup\_\{\\\|v\_\{1\}\\\|,\\ldots,\\\|v\_\{q\}\\\|\\leq 1\}\|A\[v\_\{1\},\\ldots,v\_\{q\}\]\|\.For a mapGGbetween Euclidean spaces,Lip⁡\(G\)\\operatorname\{Lip\}\(G\)denotes its global Lipschitz constant with respect to the induced Euclidean norms; in particular,Lip⁡\(Dq​f\)\\operatorname\{Lip\}\(D^\{q\}f\)uses the above operator norm onqq\-linear forms\. The notationa≲ba\\lesssim bmeansa≤C​ba\\leq Cbfor a positive absolute constantCC,a≳ba\\gtrsim bmeansb≲ab\\lesssim a, anda≍ba\\asymp bmeans botha≲ba\\lesssim bandb≲ab\\lesssim a\. We useO​\(⋅\)O\(\\cdot\),Ω​\(⋅\)\\Omega\(\\cdot\), andΘ​\(⋅\)\\Theta\(\\cdot\)with the same convention that hidden constants are positive and absolute unless explicitly stated otherwise\. Named numerical constants introduced below, such as the constantsℓq\\ell\_\{q\}in the scalar primitive bounds, are displayed explicitly in quantitative rates rather than absorbed into this asymptotic notation\.

## 2Related Work

The literature surrounding first\-order nonconvex optimization and higher\-order smoothness is broad, so we discuss only the strands most directly related to our lower\-bound result; the references below are necessarily selective\.

#### Higher\-order oracle models\.

A separate line of work studies algorithms that can query higher\-order derivative information, rather than only function values and gradients\. Classical trust\-region methods and regularized Newton methods form the basic second\-order toolkit\[[14](https://arxiv.org/html/2606.05438#bib.bib9),[11](https://arxiv.org/html/2606.05438#bib.bib6)\]\. In the Hessian\-Lipschitz setting, cubic regularization builds a local quadratic model with a cubic penalty and achieves theO​\(ϵ−3/2\)O\(\\epsilon^\{\-3/2\}\)complexity for finding first\-order stationary points\[[28](https://arxiv.org/html/2606.05438#bib.bib4),[11](https://arxiv.org/html/2606.05438#bib.bib6)\]; related work also characterizes the complexity of reaching second\-order stationarity\[[12](https://arxiv.org/html/2606.05438#bib.bib7)\]\. More generally, regularizedppth\-order Taylor methods achieveO​\(ϵ−\(p\+1\)/p\)O\(\\epsilon^\{\-\(p\+1\)/p\}\)under Lipschitzppth derivatives\[[6](https://arxiv.org/html/2606.05438#bib.bib5)\], and worst\-case analyses for second\-order methods further clarify the optimality of this oracle model\[[13](https://arxiv.org/html/2606.05438#bib.bib8)\]\. These rates are matched by lower bounds for algorithms with access to derivatives up to orderpp\[[9](https://arxiv.org/html/2606.05438#bib.bib2)\]\. Recent mixed\-oracle work further studies tradeoffs between gradient and Hessian queries; in particular, finite\-difference Hessian approximations yield improved low\-dimensional gradient complexity, and even a small number of Hessian queries can change the gradient\-query exponent\[[1](https://arxiv.org/html/2606.05438#bib.bib21)\]\. This higher\-order\-oracle theory is complementary to our setting: we impose higher\-order smoothness on the objective, but the algorithm remains first\-order\.

#### Stochastic, finite\-sum, and local\-minimum variants\.

There is also a substantial stochastic, finite\-sum, and approximate\-local\-minimum counterpart to the oracle\-complexity literature\. For second\-order stationarity, early stochastic saddle\-escaping results show that noisy or perturbed gradient methods can avoid strict saddles and reach approximate local minima\[[19](https://arxiv.org/html/2606.05438#bib.bib26),[21](https://arxiv.org/html/2606.05438#bib.bib27),[22](https://arxiv.org/html/2606.05438#bib.bib28)\]\. Under Hessian\-Lipschitz structure, Hessian\-vector and gradient\-only algorithms obtain faster approximate\-local\-minimum guarantees through cubic regularization, accelerated saddle escaping, and negative\-curvature extraction\[[2](https://arxiv.org/html/2606.05438#bib.bib22),[8](https://arxiv.org/html/2606.05438#bib.bib23),[23](https://arxiv.org/html/2606.05438#bib.bib11),[33](https://arxiv.org/html/2606.05438#bib.bib12),[3](https://arxiv.org/html/2606.05438#bib.bib25),[26](https://arxiv.org/html/2606.05438#bib.bib29),[30](https://arxiv.org/html/2606.05438#bib.bib30)\]\. Stochastic and finite\-sum refinements combine perturbed SGD, variance reduction, stochastic cubic regularization, and third\-order smoothness to improve sample or gradient complexity for local\-minimum finding\[[31](https://arxiv.org/html/2606.05438#bib.bib24),[17](https://arxiv.org/html/2606.05438#bib.bib31),[18](https://arxiv.org/html/2606.05438#bib.bib32),[25](https://arxiv.org/html/2606.05438#bib.bib33),[29](https://arxiv.org/html/2606.05438#bib.bib34),[32](https://arxiv.org/html/2606.05438#bib.bib35),[36](https://arxiv.org/html/2606.05438#bib.bib36),[37](https://arxiv.org/html/2606.05438#bib.bib37),[34](https://arxiv.org/html/2606.05438#bib.bib38),[38](https://arxiv.org/html/2606.05438#bib.bib39)\]\. More recently, an online\-to\-nonconvex conversion viewpoint gives deterministic and stochastic upper bounds through a common framework\[[15](https://arxiv.org/html/2606.05438#bib.bib16)\]\. On the lower\-bound side,Arjevaniet al\.\[[4](https://arxiv.org/html/2606.05438#bib.bib18)\]characterize the power and limitations of stochastic Hessian\-vector and higher\-order oracles, showing that second\-order information does not remove the intrinsic stochastic barrier in the same way it can in deterministic optimization\. In finite\-sum models, lower bounds for higher\-order smooth nonconvex objectives were developed byEmmeneggeret al\.\[[16](https://arxiv.org/html/2606.05438#bib.bib19)\]and then sharpened in a dimension\-free form byZhou and Gu \[[35](https://arxiv.org/html/2606.05438#bib.bib20)\]\. In the bounded\-variance stochastic\-gradient model,Arjevaniet al\.\[[5](https://arxiv.org/html/2606.05438#bib.bib17)\]prove tight lower bounds showing that SGD is minimax optimal for finding first\-order stationary points, while stronger mean\-squared smoothness assumptions recover the variance\-reduction rate\. These works address sample, noise, component\-oracle, or saddle\-escaping complexity; they are therefore complementary to the deterministic, dimension\-free full\-gradient lower bound studied here\.

## 3Preliminaries

We follow the oracle\-complexity framework ofCarmonet al\.\[[9](https://arxiv.org/html/2606.05438#bib.bib2),[10](https://arxiv.org/html/2606.05438#bib.bib1)\]\.

#### Function classes\.

Following Def\. 1 ofCarmonet al\.\[[9](https://arxiv.org/html/2606.05438#bib.bib2)\], letp≥1p\\geq 1andLp\>0L\_\{p\}\>0\. We say thatf:ℝd→ℝf:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}hasLpL\_\{p\}\-Lipschitzppth\-order derivatives if it ispptimes continuously differentiable and, for everyx∈ℝdx\\in\\mathbb\{R\}^\{d\}and everyv∈ℝdv\\in\\mathbb\{R\}^\{d\}with‖v‖=1\\\|v\\\|=1, the directional projectionfx,v​\(t\):=f​\(x\+t​v\)f\_\{x,v\}\(t\):=f\(x\+tv\)satisfies

\|fx,v\(p\)​\(t\)−fx,v\(p\)​\(t′\)\|≤Lp​\|t−t′\|for all​t,t′∈ℝ\.\\left\|f\_\{x,v\}^\{\(p\)\}\(t\)\-f\_\{x,v\}^\{\(p\)\}\(t^\{\\prime\}\)\\right\|\\leq L\_\{p\}\|t\-t^\{\\prime\}\|\\qquad\\text\{for all \}t,t^\{\\prime\}\\in\\mathbb\{R\}\.LetΔ\>0\\Delta\>0\. The classℱp​\(Δ,Lp\)\\mathcal\{F\}\_\{p\}\(\\Delta,L\_\{p\}\)denotes the union, overd∈ℕd\\in\\mathbb\{N\}, of allC∞C^\{\\infty\}functionsf:ℝd→ℝf:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}withLpL\_\{p\}\-Lipschitzppth\-order derivatives andf​\(0\)−infxf​\(x\)≤Δf\(0\)\-\\inf\_\{x\}f\(x\)\\leq\\Delta\. For positiveL1,…,LpL\_\{1\},\\ldots,L\_\{p\}, define

ℱ1:p​\(Δ,L1,…,Lp\):=⋂q=1pℱq​\(Δ,Lq\)\.\\mathcal\{F\}\_\{1:p\}\(\\Delta,L\_\{1\},\\ldots,L\_\{p\}\):=\\bigcap\_\{q=1\}^\{p\}\\mathcal\{F\}\_\{q\}\(\\Delta,L\_\{q\}\)\.

#### First\-order oracle complexity\.

As in Eq\. \(7\) ofCarmonet al\.\[[9](https://arxiv.org/html/2606.05438#bib.bib2)\], a deterministic first\-order method𝖠\\mathsf\{A\}starts fromx\(0\)=0x^\{\(0\)\}=0and, when applied tof:ℝd→ℝf:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}, generates iteratesx\(t\)∈ℝdx^\{\(t\)\}\\in\\mathbb\{R\}^\{d\}such thatx\(t\+1\)x^\{\(t\+1\)\}is a deterministic function of the previous oracle answers

\(f​\(x\(0\)\),∇f​\(x\(0\)\),…,f​\(x\(t\)\),∇f​\(x\(t\)\)\)\.\\left\(f\(x^\{\(0\)\}\),\\nabla f\(x^\{\(0\)\}\),\\ldots,f\(x^\{\(t\)\}\),\\nabla f\(x^\{\(t\)\}\)\\right\)\.Define

Tϵ​\(𝖠,f\):=inf\{t≥0:‖∇f​\(x\(t\)\)‖≤ϵ\}\.T\_\{\\epsilon\}\(\\mathsf\{A\},f\):=\\inf\\\{t\\geq 0:\\\|\\nabla f\(x^\{\(t\)\}\)\\\|\\leq\\epsilon\\\}\.We use the conventioninf∅=∞\\inf\\emptyset=\\infty\. For a class of deterministic first\-order methods𝒜\\mathcal\{A\}and a function classℱ\\mathcal\{F\}, write

Tϵ​\(𝒜,ℱ\):=inf𝖠∈𝒜supf∈ℱTϵ​\(𝖠,f\),T\_\{\\epsilon\}\(\\mathcal\{A\},\\mathcal\{F\}\):=\\inf\_\{\\mathsf\{A\}\\in\\mathcal\{A\}\}\\sup\_\{f\\in\\mathcal\{F\}\}T\_\{\\epsilon\}\(\\mathsf\{A\},f\),up to the immaterial additive constant coming from whether the initial point is counted as a query\.

#### Zero\-respecting methods\.

Following Eq\. \(5\) ofCarmonet al\.\[[10](https://arxiv.org/html/2606.05438#bib.bib1)\], a sequence\{x\(t\)\}t≥0\\\{x^\{\(t\)\}\\\}\_\{t\\geq 0\}is first\-order zero\-respecting with respect toffif

supp⁡\(x\(t\)\)⊆⋃s<tsupp⁡\(∇f​\(x\(s\)\)\)for all​t≥0\.\\operatorname\{supp\}\(x^\{\(t\)\}\)\\subseteq\\bigcup\_\{s<t\}\\operatorname\{supp\}\(\\nabla f\(x^\{\(s\)\}\)\)\\qquad\\text\{for all \}t\\geq 0\.A first\-order method is zero\-respecting if, for everyff, its generated iterate sequence is zero\-respecting with respect toff\. By contrast, a functionf:ℝd→ℝf:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}is a classical first\-order zero\-chain\[[10](https://arxiv.org/html/2606.05438#bib.bib1), Def\. 3\]if, for everyi∈\[d\]i\\in\[d\]andx∈ℝdx\\in\\mathbb\{R\}^\{d\},

supp⁡\(x\)⊆\{1,…,i−1\}⟹supp⁡\(∇f​\(x\)\)⊆\{1,…,i\}\.\\operatorname\{supp\}\(x\)\\subseteq\\\{1,\\ldots,i\-1\\\}\\quad\\Longrightarrow\\quad\\operatorname\{supp\}\(\\nabla f\(x\)\)\\subseteq\\\{1,\\ldots,i\\\}\.

#### Primitive functions\.

We also fix two elementary primitives used in the construction\. Form∈ℕm\\in\\mathbb\{N\}, let𝖯m∈ℝm×m\\mathsf\{P\}\_\{m\}\\in\\mathbb\{R\}^\{m\\times m\}be the path Laplacian in tridiagonal form, and letHγ,L\(m\)H\_\{\\gamma,L\}^\{\(m\)\}be the corresponding massive path operator:

𝖯1:=0,𝖯m:=\(1−1−12−1⋱⋱⋱−12−1−11\)\(m≥2\),Hγ,L\(m\):=γ​Im\+L​𝖯m\.\\mathsf\{P\}\_\{1\}:=0,\\qquad\\mathsf\{P\}\_\{m\}:=\\begin\{pmatrix\}1&\-1&&&\\\\ \-1&2&\-1&&\\\\ &\\ddots&\\ddots&\\ddots&\\\\ &&\-1&2&\-1\\\\ &&&\-1&1\\end\{pmatrix\}\\quad\(m\\geq 2\),\\qquad H\_\{\\gamma,L\}^\{\(m\)\}:=\\gamma I\_\{m\}\+L\\mathsf\{P\}\_\{m\}\.\(3\.1\)Forr≥1r\\geq 1, letΥr\\Upsilon\_\{r\}be the scalar potential from Eq\. \(10\) ofCarmonet al\.\[[10](https://arxiv.org/html/2606.05438#bib.bib1)\]:

Υr​\(x\):=120​∫1xt2​\(t−1\)1\+\(t/r\)2​𝑑t\.\\Upsilon\_\{r\}\(x\):=120\\int\_\{1\}^\{x\}\\frac\{t^\{2\}\(t\-1\)\}\{1\+\(t/r\)^\{2\}\}\\,dt\.\(3\.2\)
We collect the elementary bounds on these two primitives for later use\.

###### Proposition 3\.1\(Primitive bounds\)\.

The primitives in \([3\.1](https://arxiv.org/html/2606.05438#S3.E1)\) and \([3\.2](https://arxiv.org/html/2606.05438#S3.E2)\) satisfy the following bounds\.

- •For everym∈ℕm\\in\\mathbb\{N\}andγ,L\>0\\gamma,L\>0,0⪯𝖯m⪯4​I0\\preceq\\mathsf\{P\}\_\{m\}\\preceq 4Iandγ​I⪯Hγ,L\(m\)⪯\(γ\+4​L\)​I\\gamma I\\preceq H\_\{\\gamma,L\}^\{\(m\)\}\\preceq\(\\gamma\+4L\)I\.
- •Lemma 2 ofCarmonet al\.\[[10](https://arxiv.org/html/2606.05438#bib.bib1)\]gives constantsℓq\>0\\ell\_\{q\}\>0,q≥1q\\geq 1, which may be chosen so thatℓq≤exp⁡\(32​q​log⁡q\+Cℓ​q\)\\ell\_\{q\}\\leq\\exp\(\\frac\{3\}\{2\}q\\log q\+C\_\{\\ell\}q\)for a numerical constantCℓ<∞C\_\{\\ell\}<\\infty\. For everyr≥1r\\geq 1, we haveΥr∈C∞​\(ℝ\)\\Upsilon\_\{r\}\\in C^\{\\infty\}\(\\mathbb\{R\}\),Υr′​\(0\)=Υr′​\(1\)=0\\Upsilon\_\{r\}^\{\\prime\}\(0\)=\\Upsilon\_\{r\}^\{\\prime\}\(1\)=0,mΥr:=infxΥr​\(x\)=Υr​\(1\)=0m\_\{\\Upsilon\_\{r\}\}:=\\inf\_\{x\}\\Upsilon\_\{r\}\(x\)=\\Upsilon\_\{r\}\(1\)=0, andΥr​\(0\)≤10\\Upsilon\_\{r\}\(0\)\\leq 10\. Moreover, for everyq≥1q\\geq 1, Lip⁡\(Υr\(q\)\)≤ℓq​r3−q,‖Υr\(q\+1\)‖∞≤ℓq​r3−q\.\\operatorname\{Lip\}\\bigl\(\\Upsilon\_\{r\}^\{\(q\)\}\\bigr\)\\leq\\ell\_\{q\}r^\{3\-q\},\\qquad\\left\\\|\\Upsilon\_\{r\}^\{\(q\+1\)\}\\right\\\|\_\{\\infty\}\\leq\\ell\_\{q\}r^\{3\-q\}\.

## 4Block\-Chain Hard Instance

We now build the hard instance, starting from the limitation of the scalar\-chain template used in prior first\-order lower bounds\.

#### Existing lower bound revisited\.

We begin by recalling the scalar chain underlying the first\-order lower bound ofCarmonet al\.\[[10](https://arxiv.org/html/2606.05438#bib.bib1)\]\. For a fixedr≥1r\\geq 1, its original unscaled form has a multiplier0<μ≤10<\\mu\\leq 1on the scalar nonconvex potential:

ϕμ​\(u\):=μ2​\(u1−1\)2\+12​∑s=1N\(us\+1−us\)2\+μ​∑s=1NΥr​\(us\),u∈ℝN\+1\.\\phi\_\{\\mu\}\(u\):=\\frac\{\\sqrt\{\\mu\}\}\{2\}\(u\_\{1\}\-1\)^\{2\}\+\\frac\{1\}\{2\}\\sum\_\{s=1\}^\{N\}\(u\_\{s\+1\}\-u\_\{s\}\)^\{2\}\+\\mu\\sum\_\{s=1\}^\{N\}\\Upsilon\_\{r\}\(u\_\{s\}\),\\qquad u\\in\\mathbb\{R\}^\{N\+1\}\.
Fors≥2s\\geq 2, the cross term−us−1​us\-u\_\{s\-1\}u\_\{s\}is the only transmission from phases−1s\-1to phasess, while the separable potentialμ​Υr​\(us\)\\mu\\Upsilon\_\{r\}\(u\_\{s\}\)supplies the on\-site nonconvex forcing\. The following residual certificate is the key point used by the lower\-bound argument\.

###### Proposition 4\.1\(Scalar transition lower bound, Lemma 3,Carmonet al\.[10](https://arxiv.org/html/2606.05438#bib.bib1)\)\.

There exists a numerical constantc0:=1/4c\_\{0\}:=1/4such that, for everyN≥1N\\geq 1, everyr≥1r\\geq 1, every0<μ≤10<\\mu\\leq 1, and everyu∈ℝN\+1u\\in\\mathbb\{R\}^\{N\+1\}withuN=uN\+1=0u\_\{N\}=u\_\{N\+1\}=0, the scalar chainϕμ\\phi\_\{\\mu\}defined above satisfies

‖∇ϕμ​\(u\)‖2≥c0​μ3/4\.\\\|\\nabla\\phi\_\{\\mu\}\(u\)\\\|\_\{2\}\\geq c\_\{0\}\\mu^\{3/4\}\.

Thus, as long as a zero\-respecting method has not reached the tail coordinate, we haveuN=0u\_\{N\}=0, and the scalar transition lower bound certifies a large gradient norm\. To turn this residual certificate into anϵ\\epsilon\-stationarity obstruction under prescribed smoothness budgets,Carmonet al\.\[[10](https://arxiv.org/html/2606.05438#bib.bib1)\]consider the scaled instancef​\(x\):=λ​σ2​ϕμ​\(x/σ\)f\(x\):=\\lambda\\sigma^\{2\}\\phi\_\{\\mu\}\(x/\\sigma\), withN=TN=T\. The parametersλ,σ,μ,r\\lambda,\\sigma,\\mu,rare then calibrated so that the scaled chain belongs to the target smoothness class, while the chain lengthTTis set by the available function\-value budget; with these choices, all points whose tail coordinates remain hidden still have gradient norm larger thanϵ\\epsilon\.

#### *Block\-chain*hard instance\.

One key parameter in the scalar hard instanceϕμ\\phi\_\{\\mu\}isμ\\mu, which sets the strength of the nonconvex onsite potential relative to the quadratic chain\. This parameter must be tuned carefully in order to satisfy higher\-order smoothness constraints while retaining a nontrivial gradient certificate\. By contrast,λ\\lambdaandσ\\sigmaare global amplitude and length\-scale parameters: they rescale all derivative orders in a coupled way, and therefore cannot by themselves provide the selective balance needed for higher\-order calibration\.Carmonet al\.\[[10](https://arxiv.org/html/2606.05438#bib.bib1)\]also extend the construction beyond quadratic activations and obtain analogous gradient lower bounds, but the resulting scalar balancing still leaves a gap between the best known first\-order upper boundϵ−7/4\\epsilon^\{\-7/4\}forp=2p=2and their calibrated lower boundϵ−12/7\\epsilon^\{\-12/7\}\.

Our construction closes this gap through two observations\.

- •Tune smoothness through a linear pullback\.Instead of relying only on the scalar coefficientμ\\mu, we introduce an additional balancing mechanism by applying the nonconvex potential after a linear observation map\. At the level of intuition, this amounts to replacing the coordinatewise potential by ∑s=1NΥr​\(us\)→∑s=1NΥr​\(as⊤​u\)\\displaystyle\\sum\_\{s=1\}^\{N\}\\Upsilon\_\{r\}\(u\_\{s\}\)\\rightarrow\\sum\_\{s=1\}^\{N\}\\Upsilon\_\{r\}\(a\_\{s\}^\{\\top\}u\)\(4\.1\)whereu↦\(a1⊤​u,…,aN⊤​u\)u\\mapsto\(a\_\{1\}^\{\\top\}u,\\ldots,a\_\{N\}^\{\\top\}u\)is a*linear operator*\. Smoothness constraints are stable under such pullbacks: the relevant derivative operator norms are multiplied only by powers of the operator norm of this linear map\. Thus the geometry of the observation map gives a new way to tune higher\-order smoothness without changing the scalar residual certificate\.
- •Separate revelation from observation\.An arbitrary linear observation map would destroy the lower\-bound argument: the gradient ofΥr​\(as⊤​u\)\\Upsilon\_\{r\}\(a\_\{s\}^\{\\top\}u\)points in the directionasa\_\{s\}, which could reveal information outside the zero\-chain order\. The remedy is to separate the coordinates that drive the oracle revelation from the coordinates observed by the nonconvex potential\. Informally, the construction uses a map of the form u→\(u,as⊤​u\),\\displaystyle u\\rightarrow\(u,a\_\{s\}^\{\\top\}u\),\(4\.2\)where the first component keeps the original chain\-like revelation structure, while the second component feeds the nonconvex potential and supplies the high\-order smoothness scaling\. The formal block construction below implements this separation by making the nonconvex readout compatible with the reveal order\.

With this intuition in mind, we now define the block chain\.

###### Definition 4\.2\(Block chain\)\.

LetM,N∈ℕM,N\\in\\mathbb\{N\}, letγ,L,λ,κ,τ\>0\\gamma,L,\\lambda,\\kappa,\\tau\>0, and leta∈ℝMa\\in\\mathbb\{R\}^\{M\}be a nonzero suffix vector, meaning that for somej⋆∈\[M\]j\_\{\\star\}\\in\[M\],supp⁡\(a\)⊆\{j⋆,…,M\}\\operatorname\{supp\}\(a\)\\subseteq\\\{j\_\{\\star\},\\ldots,M\\\}\. Setd:=M​Nd:=MN, and identifyℝd\\mathbb\{R\}^\{d\}with\(ℝM\)N\(\\mathbb\{R\}^\{M\}\)^\{N\}\. For a block vectorz=\(z1,…,zN\)∈\(ℝM\)Nz=\(z\_\{1\},\\ldots,z\_\{N\}\)\\in\(\\mathbb\{R\}^\{M\}\)^\{N\}, the scalar chain variable in phasessis the suffix readoutus:=a⊤​zsu\_\{s\}:=a^\{\\top\}z\_\{s\}, withu0:=1u\_\{0\}:=1\. The entry coordinate of phasessisxs:=e1⊤​zsx\_\{s\}:=e\_\{1\}^\{\\top\}z\_\{s\}\. The previous phase enters the next block throughxsx\_\{s\}, while the current phase is read only through the suffix readoutusu\_\{s\}\. Forr≥1r\\geq 1, define

fM,N,γ,L,a,λ,κ,τ,r​\(z\)\\displaystyle f\_\{M,N,\\gamma,L,a,\\lambda,\\kappa,\\tau,r\}\(z\):=∑s=1N\[12​zs⊤​Hγ,L\(M\)​zs−λ​us−1​xs\+κ2​us2\+τ​Υr​\(us\)\]\.\\displaystyle:=\\sum\_\{s=1\}^\{N\}\\left\[\\frac\{1\}\{2\}z\_\{s\}^\{\\top\}H\_\{\\gamma,L\}^\{\(M\)\}z\_\{s\}\-\\lambda u\_\{s\-1\}x\_\{s\}\+\\frac\{\\kappa\}\{2\}u\_\{s\}^\{2\}\+\\tau\\Upsilon\_\{r\}\(u\_\{s\}\)\\right\]\.\(4\.3\)
For the normalized member of this family, set

e:=e1,ρ:=e⊤​\(Hγ,L\(M\)\)−1​e,α:=a⊤​\(Hγ,L\(M\)\)−1​e,β:=a⊤​\(Hγ,L\(M\)\)−1​a\.e:=e\_\{1\},\\qquad\\rho:=e^\{\\top\}\\\!\\left\(H\_\{\\gamma,L\}^\{\(M\)\}\\right\)^\{\-1\}e,\\qquad\\alpha:=a^\{\\top\}\\\!\\left\(H\_\{\\gamma,L\}^\{\(M\)\}\\right\)^\{\-1\}e,\\qquad\\beta:=a^\{\\top\}\\\!\\left\(H\_\{\\gamma,L\}^\{\(M\)\}\\right\)^\{\-1\}a\.\(4\.4\)Whenα\>0\\alpha\>0, choose

λ:=1α,τ:=1β,κ:=λ2​ρ\.\\lambda:=\\frac\{1\}\{\\alpha\},\\qquad\\tau:=\\frac\{1\}\{\\beta\},\\qquad\\kappa:=\\lambda^\{2\}\\rho\.\(4\.5\)Forr≥1r\\geq 1, define

f¯M,N,γ,L,a,r:=fM,N,γ,L,a,λ,κ,τ,r,\\bar\{f\}\_\{M,N,\\gamma,L,a,r\}:=f\_\{M,N,\\gamma,L,a,\\lambda,\\kappa,\\tau,r\},\(4\.6\)whereλ,τ,κ\\lambda,\\tau,\\kappaare given by \([4\.5](https://arxiv.org/html/2606.05438#S4.E5)\)\.

We now collect the basic estimates for the block chain that will be used later in the lower\-bound calibration\.

###### Lemma 4\.5\(Boundedness and smoothness of normalized hard instances\)\.

Fixp≥2p\\geq 2,M,N∈ℕM,N\\in\\mathbb\{N\},γ,L\>0\\gamma,L\>0,r≥1r\\geq 1, and a nonzero vectora∈ℝMa\\in\\mathbb\{R\}^\{M\}\. Assumeα\>0\\alpha\>0\. Then the normalized hard\-instance family from \([4\.6](https://arxiv.org/html/2606.05438#S4.E6)\) satisfies

Lip⁡\(∇f¯M,N,γ,L,a,r\)≤γ\+4​L\+2​‖a‖α\+\(ρα2\+ℓ1​r2β\)​‖a‖2\.\\operatorname\{Lip\}\\\!\\left\(\\nabla\\bar\{f\}\_\{M,N,\\gamma,L,a,r\}\\right\)\\leq\\gamma\+4L\+\\frac\{2\\\|a\\\|\}\{\\alpha\}\+\\left\(\\frac\{\\rho\}\{\\alpha^\{2\}\}\+\\frac\{\\ell\_\{1\}r^\{2\}\}\{\\beta\}\\right\)\\\|a\\\|^\{2\}\.Forq=2,…,pq=2,\\ldots,p,

Lip⁡\(Dq​f¯M,N,γ,L,a,r\)≤ℓq​r3−q​‖a‖q\+1β\.\\operatorname\{Lip\}\\\!\\left\(D^\{q\}\\bar\{f\}\_\{M,N,\\gamma,L,a,r\}\\right\)\\leq\\frac\{\\ell\_\{q\}r^\{3\-q\}\\\|a\\\|^\{q\+1\}\}\{\\beta\}\.Moreover,

f¯M,N,γ,L,a,r​\(0\)−inff¯M,N,γ,L,a,r≤10​Nβ\+ρ2​α2\.\\bar\{f\}\_\{M,N,\\gamma,L,a,r\}\(0\)\-\\inf\\bar\{f\}\_\{M,N,\\gamma,L,a,r\}\\leq\\frac\{10N\}\{\\beta\}\+\\frac\{\\rho\}\{2\\alpha^\{2\}\}\.

###### Proof\.

See Appendix[A\.1](https://arxiv.org/html/2606.05438#A1.SS1)\. ∎

We next record the block\-layer consequence needed in the lower\-bound argument: the last scalar chain variable remains hidden until the suffix of the last block has been reached\.

###### Lemma 4\.7\(Tail hiding for zero\-respecting sequences\)\.

LetF=fM,N,γ,L,a,λ,κ,τ,rF=f\_\{M,N,\\gamma,L,a,\\lambda,\\kappa,\\tau,r\}be the block\-chain function in Definition[4\.2](https://arxiv.org/html/2606.05438#S4.Thmtheorem2), and letj⋆∈\[M\]j\_\{\\star\}\\in\[M\]satisfy

supp⁡\(a\)⊆S:=\{j⋆,…,M\}\.\\operatorname\{supp\}\(a\)\\subseteq S:=\\\{j\_\{\\star\},\\ldots,M\\\}\.Forℓ≥0\\ell\\geq 0, define the block\-layer coordinate set

Iℓ:=\{\(s,j\)∈\[N\]×\[M\]:\(s−1\)​j⋆\+min⁡\{j,j⋆\}≤ℓ\}\.I\_\{\\ell\}:=\\left\\\{\(s,j\)\\in\[N\]\\times\[M\]:\(s\-1\)j\_\{\\star\}\+\\min\\\{j,j\_\{\\star\}\\\}\\leq\\ell\\right\\\}\.Let\{z\(t\)\}t≥0\\\{z^\{\(t\)\}\\\}\_\{t\\geq 0\},z\(t\)=\(z1\(t\),…,zN\(t\)\)z^\{\(t\)\}=\(z\_\{1\}^\{\(t\)\},\\ldots,z\_\{N\}^\{\(t\)\}\), be first\-order zero\-respecting with respect toFF, in the standard coordinate\-support sense recalled in Section[3](https://arxiv.org/html/2606.05438#S3), withz\(0\)=0z^\{\(0\)\}=0\. Then, identifying supports with subsets of\[N\]×\[M\]\[N\]\\times\[M\],

supp⁡\(z\(t\)\)⊆Itfor all​t≥0\.\\operatorname\{supp\}\(z^\{\(t\)\}\)\\subseteq I\_\{t\}\\qquad\\text\{for all \}t\\geq 0\.Consequently, ift<N​j⋆t<Nj\_\{\\star\}, then

uN\(t\):=a⊤​zN\(t\)=0\.u\_\{N\}^\{\(t\)\}:=a^\{\\top\}z\_\{N\}^\{\(t\)\}=0\.

###### Proof\.

See Appendix[A\.2](https://arxiv.org/html/2606.05438#A1.SS2)\. ∎

###### Lemma 4\.8\(Gradient lower bound for normalized hard instances\)\.

FixM,N∈ℕM,N\\in\\mathbb\{N\},γ,L\>0\\gamma,L\>0,r≥1r\\geq 1, and a nonzero vectora∈ℝMa\\in\\mathbb\{R\}^\{M\}\. Assumeα\>0\\alpha\>0\. For anyz=\(z1,…,zN\)∈\(ℝM\)Nz=\(z\_\{1\},\\ldots,z\_\{N\}\)\\in\(\\mathbb\{R\}^\{M\}\)^\{N\}satisfying

the normalized hard\-instance family from \([4\.6](https://arxiv.org/html/2606.05438#S4.E6)\) satisfies

‖∇f¯M,N,γ,L,a,r​\(z\)‖≳γβ\+β​ρα\.\\left\\\|\\nabla\\bar\{f\}\_\{M,N,\\gamma,L,a,r\}\(z\)\\right\\\|\\gtrsim\\frac\{\\sqrt\{\\gamma\}\}\{\\sqrt\{\\beta\}\+\\frac\{\\beta\\sqrt\{\\rho\}\}\{\\alpha\}\}\.

###### Proof\.

See Appendix[A\.3](https://arxiv.org/html/2606.05438#A1.SS3)\. ∎

Lemmas[4\.5](https://arxiv.org/html/2606.05438#S4.Thmtheorem5),[4\.7](https://arxiv.org/html/2606.05438#S4.Thmtheorem7), and[4\.8](https://arxiv.org/html/2606.05438#S4.Thmtheorem8)are the uncalibrated interface of the construction: the block geometry enters only throughρ,α,β\\rho,\\alpha,\\beta,‖a‖\\\|a\\\|, and the suffix layerj⋆j\_\{\\star\}\. It remains to choose these quantities so that the smoothness budgets, initial gap, reveal depth, and gradient certificate all have the desired scales\.

## 5Main Results

We now carry out the calibration left open by the block\-chain construction\. Throughout this section, letp≥2p\\geq 2, letL1,…,Lp,ϵ,Δ\>0L\_\{1\},\\ldots,L\_\{p\},\\epsilon,\\Delta\>0, and letℓ1,…,ℓp\\ell\_\{1\},\\ldots,\\ell\_\{p\}be the constants from Proposition[3\.1](https://arxiv.org/html/2606.05438#S3.Thmtheorem1)\.

The following scale summarizes the active smoothness budget\. For a fixed potential parameterrr,Γp​\(r\)\\Gamma\_\{p\}\(r\)is the largest mass scale compatible with all derivative Lipschitz constraints, up to the numerical constants already absorbed into the construction\. We then optimize overrr\. Define

Γp​\(r\)\\displaystyle\\Gamma\_\{p\}\(r\):=min\{L1,L1ℓ1​r2,\(L2​ϵℓ2​r\)1/2,min3≤q≤p\(Lq​ϵq−1​rq−3ℓq\)1/q\},r≥1,\\displaystyle=\\min\\left\\\{L\_\{1\},\\,\\frac\{L\_\{1\}\}\{\\ell\_\{1\}r^\{2\}\},\\,\\left\(\\frac\{L\_\{2\}\\epsilon\}\{\\ell\_\{2\}r\}\\right\)^\{1/2\},\\,\\min\_\{3\\leq q\\leq p\}\\left\(\\frac\{L\_\{q\}\\epsilon^\{q\-1\}r^\{q\-3\}\}\{\\ell\_\{q\}\}\\right\)^\{1/q\}\\right\\\},\\qquad r\\geq 1,Γp\\displaystyle\\Gamma\_\{p\}:=supr≥1Γp​\(r\),\\displaystyle=\\sup\_\{r\\geq 1\}\\Gamma\_\{p\}\(r\),where the last minimum is omitted whenp=2p=2\. Letr⋆≥1r\_\{\\star\}\\geq 1be such thatΓp​\(r⋆\)≍Γp\\Gamma\_\{p\}\(r\_\{\\star\}\)\\asymp\\Gamma\_\{p\}\.

The parameterΓp\\Gamma\_\{p\}is the only implicit quantity in the final rate\. The next proposition unpacks itsϵ\\epsilon\-dependence in the high\-accuracy regime\.

###### Proposition 5\.1\(Asymptotic form ofΓp\\Gamma\_\{p\}\)\.

For all sufficiently smallϵ\\epsilon, the following simplifications ofΓp\\Gamma\_\{p\}hold\. Ifp∈\{2,3\}p\\in\\\{2,3\\\}, then

Γp=\(Lp​ϵp−1ℓp\)1/p\.\\Gamma\_\{p\}=\\left\(\\frac\{L\_\{p\}\\epsilon^\{p\-1\}\}\{\\ell\_\{p\}\}\\right\)^\{1/p\}\.Ifp≥4p\\geq 4, then

Γp=ϵ2/3supR\>0min\{L1ℓ1​R2,\(L2ℓ2​R\)1/2,min3≤q≤p\(Lq​Rq−3ℓq\)1/q\}\.\\Gamma\_\{p\}=\\epsilon^\{2/3\}\\sup\_\{R\>0\}\\min\\left\\\{\\frac\{L\_\{1\}\}\{\\ell\_\{1\}R^\{2\}\},\\,\\left\(\\frac\{L\_\{2\}\}\{\\ell\_\{2\}R\}\\right\)^\{1/2\},\\,\\min\_\{3\\leq q\\leq p\}\\left\(\\frac\{L\_\{q\}R^\{q\-3\}\}\{\\ell\_\{q\}\}\\right\)^\{1/q\}\\right\\\}\.

###### Proof\.

See Appendix[B\.1](https://arxiv.org/html/2606.05438#A2.SS1)\. ∎

We can now state the deterministic first\-order lower bound\.

###### Theorem 5\.2\(Main deterministic first\-order lower bound\)\.

Let𝒜det\\mathcal\{A\}\_\{\\rm det\}denote the class of deterministic first\-order methods\. Assume thatϵ≤cϵ​min1≤q≤p⁡Δq/\(q\+1\)​\(Lq/ℓq\)1/\(q\+1\)\\epsilon\\leq c\_\{\\epsilon\}\\min\_\{1\\leq q\\leq p\}\\Delta^\{q/\(q\+1\)\}\(L\_\{q\}/\\ell\_\{q\}\)^\{1/\(q\+1\)\}, wherecϵ\>0c\_\{\\epsilon\}\>0is a sufficiently small absolute constant\. Then

Tϵ​\(𝒜det,ℱ1:p​\(Δ,L1,…,Lp\)\)≳Δ​L11/2​Γp1/2​ϵ−2\.T\_\{\\epsilon\}\\bigl\(\\mathcal\{A\}\_\{\\rm det\},\\mathcal\{F\}\_\{1:p\}\(\\Delta,L\_\{1\},\\ldots,L\_\{p\}\)\\bigr\)\\gtrsim\\Delta L\_\{1\}^\{1/2\}\\Gamma\_\{p\}^\{1/2\}\\epsilon^\{\-2\}\.In particular, for sufficiently smallϵ\\epsilon, this implies

Tϵ​\(𝒜det,ℱ1:p​\(Δ,L1,…,Lp\)\)≳\{Δ​L11/2​\(L2/ℓ2\)1/4​ϵ−7/4,p=2,Δ​L11/2​\(L3/ℓ3\)1/6​ϵ−5/3,p=3,ΔL11/2min1≤q≤p\(Lq/ℓq\)1/\(2​q\)ϵ−5/3,p≥4\.T\_\{\\epsilon\}\\bigl\(\\mathcal\{A\}\_\{\\rm det\},\\mathcal\{F\}\_\{1:p\}\(\\Delta,L\_\{1\},\\ldots,L\_\{p\}\)\\bigr\)\\gtrsim\\begin\{cases\}\\Delta L\_\{1\}^\{1/2\}\(L\_\{2\}/\\ell\_\{2\}\)^\{1/4\}\\epsilon^\{\-7/4\},&p=2,\\\\ \\Delta L\_\{1\}^\{1/2\}\(L\_\{3\}/\\ell\_\{3\}\)^\{1/6\}\\epsilon^\{\-5/3\},&p=3,\\\\ \\Delta L\_\{1\}^\{1/2\}\\min\_\{1\\leq q\\leq p\}\(L\_\{q\}/\\ell\_\{q\}\)^\{1/\(2q\)\}\\epsilon^\{\-5/3\},&p\\geq 4\.\\end\{cases\}

We record three consequences of the theorem\.

It remains to specify the calibrated instance used in the proof of Theorem[5\.2](https://arxiv.org/html/2606.05438#S5.Thmtheorem2)\. Lemmas[4\.5](https://arxiv.org/html/2606.05438#S4.Thmtheorem5),[4\.7](https://arxiv.org/html/2606.05438#S4.Thmtheorem7), and[4\.8](https://arxiv.org/html/2606.05438#S4.Thmtheorem8)reduce this to choosing the block geometry: the block length creates reveal depth, the number of phases uses the available gap, and the suffix vectoraasets the low\-rank smoothness and gradient certificate scales\.

###### Condition 5\.6\(Calibrated parameters\)\.

Use the scaleΓp\\Gamma\_\{p\}andr⋆r\_\{\\star\}defined above\. Set

γ:=chp​Γp\.\\gamma:=c\_\{\\rm hp\}\\Gamma\_\{p\}\.With this value ofγ\\gamma, choose the free parameters off¯M,N,γ,L,a,r\\bar\{f\}\_\{M,N,\\gamma,L,a,r\}in the following order:

M\\displaystyle M:=⌈CM​L1γ⌉,N:=⌊cN​Δ​γϵ2⌋,L:=L1,\\displaystyle=\\left\\lceil C\_\{M\}\\sqrt\{\\frac\{L\_\{1\}\}\{\\gamma\}\}\\right\\rceil,\\qquad N=\\left\\lfloor c\_\{N\}\\frac\{\\Delta\\gamma\}\{\\epsilon^\{2\}\}\\right\\rfloor,\\qquad L=L\_\{1\},a\\displaystyle a:=γ5/4L11/4​ϵ​\(0,…,0⏟⌊M/2⌋−1,1,…,1\)⊤∈ℝM,r:=r⋆\.\\displaystyle=\\frac\{\\gamma^\{5/4\}\}\{L\_\{1\}^\{1/4\}\\epsilon\}\\bigl\(\\underbrace\{0,\\ldots,0\}\_\{\\lfloor M/2\\rfloor\-1\},1,\\ldots,1\\bigr\)^\{\\top\}\\in\\mathbb\{R\}^\{M\},\\qquad r=r\_\{\\star\}\.Herechp\>0c\_\{\\rm hp\}\>0is a sufficiently small absolute constant andCM\>0,cN\>0C\_\{M\}\>0,c\_\{N\}\>0are sufficiently large and sufficiently small absolute constants, respectively\. LetH:=Hγ,L\(M\)H:=H\_\{\\gamma,L\}^\{\(M\)\}\. For this choice ofaa, the positivity condition required by the scalar\-reduction normalization holds; defineλ,τ,κ\\lambda,\\tau,\\kappaby \([4\.5](https://arxiv.org/html/2606.05438#S4.E5)\)\. We assume the nontrivial gap regime in whichN≥1N\\geq 1\. Setd:=M​Nd:=MNand, under the block identificationℝd≅\(ℝM\)N\\mathbb\{R\}^\{d\}\\cong\(\\mathbb\{R\}^\{M\}\)^\{N\}, define the calibrated hard instance

fΔ,L1:p,ϵ:=f¯M,N,γ,L,a,r\.f\_\{\\Delta,L\_\{1:p\},\\epsilon\}:=\\bar\{f\}\_\{M,N,\\gamma,L,a,r\}\.

The following lemma verifies that this choice realizes the desired scales and therefore produces a hard instance in the target smoothness class\.

###### Lemma 5\.7\(Zero\-respecting lower bound for the calibrated instance\)\.

Let the parameters be chosen as in Condition[5\.6](https://arxiv.org/html/2606.05438#S5.Thmtheorem6)\. Then the quantities from the scalar\-reduction normalization satisfy

ρ≍\(γ​L1\)−1/2,α≍γ1/4L11/4​ϵ,β≍γϵ2,‖a‖≍γϵ\.\\rho\\asymp\(\\gamma L\_\{1\}\)^\{\-1/2\},\\qquad\\alpha\\asymp\\frac\{\\gamma^\{1/4\}\}\{L\_\{1\}^\{1/4\}\\epsilon\},\\qquad\\beta\\asymp\\frac\{\\gamma\}\{\\epsilon^\{2\}\},\\qquad\\\|a\\\|\\asymp\\frac\{\\gamma\}\{\\epsilon\}\.Ifϵ≤cϵ​min1≤q≤p⁡Δq/\(q\+1\)​\(Lq/ℓq\)1/\(q\+1\)\\epsilon\\leq c\_\{\\epsilon\}\\min\_\{1\\leq q\\leq p\}\\Delta^\{q/\(q\+1\)\}\(L\_\{q\}/\\ell\_\{q\}\)^\{1/\(q\+1\)\}for a sufficiently small absolute constantcϵ\>0c\_\{\\epsilon\}\>0, then the calibrated instancefΔ,L1:p,ϵf\_\{\\Delta,L\_\{1:p\},\\epsilon\}belongs toℱ1:p​\(Δ,L1,…,Lp\)\\mathcal\{F\}\_\{1:p\}\(\\Delta,L\_\{1\},\\ldots,L\_\{p\}\)\. Moreover, any first\-order zero\-respecting method𝖠\\mathsf\{A\}applied tofΔ,L1:p,ϵf\_\{\\Delta,L\_\{1:p\},\\epsilon\}, starting from the origin, satisfies

Tϵ​\(𝖠,fΔ,L1:p,ϵ\)≳Δ​L11/2​Γp1/2​ϵ−2T\_\{\\epsilon\}\\bigl\(\\mathsf\{A\},f\_\{\\Delta,L\_\{1:p\},\\epsilon\}\\bigr\)\\gtrsim\\Delta L\_\{1\}^\{1/2\}\\Gamma\_\{p\}^\{1/2\}\\epsilon^\{\-2\}

###### Proof\.

See Appendix[B\.2](https://arxiv.org/html/2606.05438#A2.SS2)\. ∎

###### Proof of Theorem[5\.2](https://arxiv.org/html/2606.05438#S5.Thmtheorem2)\.

Choose the absolute constantcϵc\_\{\\epsilon\}in the theorem no larger than the corresponding constant in Lemma[5\.7](https://arxiv.org/html/2606.05438#S5.Thmtheorem7)\. Lemma[5\.7](https://arxiv.org/html/2606.05438#S5.Thmtheorem7)gives the stated lower bound for zero\-respecting methods on an instance inℱ1:p​\(Δ,L1,…,Lp\)\\mathcal\{F\}\_\{1:p\}\(\\Delta,L\_\{1\},\\ldots,L\_\{p\}\)\. The classℱ1:p​\(Δ,L1,…,Lp\)\\mathcal\{F\}\_\{1:p\}\(\\Delta,L\_\{1\},\\ldots,L\_\{p\}\)is closed under orthogonal changes of variables, so Proposition 1 ofCarmonet al\.\[[10](https://arxiv.org/html/2606.05438#bib.bib1)\]transfers the zero\-respecting lower bound to arbitrary deterministic first\-order methods\. Taking the infimum over𝖠∈𝒜det\\mathsf\{A\}\\in\\mathcal\{A\}\_\{\\rm det\}gives the minimax claim\. The explicit display follows by substituting Proposition[5\.1](https://arxiv.org/html/2606.05438#S5.Thmtheorem1); in thep≥4p\\geq 4regime we use its supremum atR=1R=1, which givesΓp≥ϵ2/3min1≤q≤p\(Lq/ℓq\)1/q\\Gamma\_\{p\}\\geq\\epsilon^\{2/3\}\\min\_\{1\\leq q\\leq p\}\(L\_\{q\}/\\ell\_\{q\}\)^\{1/q\}\. ∎

## 6Conclusion

We established a dimension\-free deterministic first\-order lower bound for finding approximate stationary points under higher\-order smoothness\. In the Hessian\-Lipschitz case the bound gives the matchingΩ​\(ϵ−7/4\)\\Omega\(\\epsilon^\{\-7/4\}\)dependence, and under Lipschitz third derivatives it gives the matchingΩ​\(ϵ−5/3\)\\Omega\(\\epsilon^\{\-5/3\}\)dependence\. Thus the accelerated rates achieved by first\-order methods under higher\-order smoothness are optimal, up to the standard logarithmic factors present in some upper bounds\. The main technical device is the block\-chain hard instance\. It separates the scalar transition obstruction from the blockwise oracle\-discovery process: the scalar state is carried by a suffix readout inside each block, while the path quadratic forces zero\-respecting methods to expose the block one coordinate layer at a time\.

## Acknowledgements

DZ first began thinking about this problem in 2019, but did not make substantial progress at that time\. Motivated by the recent advances in the mathematical reasoning capabilities of AI models, DZ revisited the problem in May 2026 with AI assistance\. DZ supplied the relevant prior literature and designed prompts asking ChatGPT 5\.5 Pro to search for a first\-order lower\-bound construction\. Through several rounds of interaction, ChatGPT 5\.5 Pro identified the block\-chain construction as the right direction and began validating its correctness\. DZ then took primary responsibility for checking the proofs, simplifying several proof paths, improving the presentation, and formulating extension directions, again with assistance from ChatGPT 5\.5 Pro\.

Appendix

## Appendix AProofs for the Block\-Chain Construction

### A\.1Proof of Lemma[4\.5](https://arxiv.org/html/2606.05438#S4.Thmtheorem5)

###### Proof of Lemma[4\.5](https://arxiv.org/html/2606.05438#S4.Thmtheorem5)\.

LetΥ:=Υr\\Upsilon:=\\Upsilon\_\{r\}andF:=f¯M,N,γ,L,a,rF:=\\bar\{f\}\_\{M,N,\\gamma,L,a,r\}, and lete,ρ,α,β,λ,τ,κe,\\rho,\\alpha,\\beta,\\lambda,\\tau,\\kappabe the quantities in \([4\.4](https://arxiv.org/html/2606.05438#S4.E4)\) and \([4\.5](https://arxiv.org/html/2606.05438#S4.E5)\) for the present\(M,γ,L,a\)\(M,\\gamma,L,a\), and setH:=Hγ,L\(M\)H:=H\_\{\\gamma,L\}^\{\(M\)\}\. In particular,κ=ρ/α2\\kappa=\\rho/\\alpha^\{2\}\. Writez=\(z1,…,zN\)∈\(ℝM\)Nz=\(z\_\{1\},\\ldots,z\_\{N\}\)\\in\(\\mathbb\{R\}^\{M\}\)^\{N\},us:=a⊤​zsu\_\{s\}:=a^\{\\top\}z\_\{s\}, andu0:=1u\_\{0\}:=1\. We use the uniform propertiesm:=infxΥ​\(x\)=Υ​\(1\)=0m:=\\inf\_\{x\}\\Upsilon\(x\)=\\Upsilon\(1\)=0,Υ​\(0\)≤10\\Upsilon\(0\)\\leq 10, and, forq=1,…,pq=1,\\ldots,p,‖Υ\(q\+1\)‖∞≤ℓq​r3−q\\\|\\Upsilon^\{\(q\+1\)\}\\\|\_\{\\infty\}\\leq\\ell\_\{q\}r^\{3\-q\}\.

#### Smoothness\.

We first prove theq=1q=1smoothness bound\. DecomposeF=Q\+RF=Q\+R, where

Q​\(z\):=∑s=1N\[12​zs⊤​H​zs−λ​us−1​e⊤​zs\+κ2​us2\],Q\(z\):=\\sum\_\{s=1\}^\{N\}\\left\[\\frac\{1\}\{2\}z\_\{s\}^\{\\top\}Hz\_\{s\}\-\\lambda u\_\{s\-1\}e^\{\\top\}z\_\{s\}\+\\frac\{\\kappa\}\{2\}u\_\{s\}^\{2\}\\right\],andR​\(z\):=τ​∑s=1NΥ​\(us\)R\(z\):=\\tau\\sum\_\{s=1\}^\{N\}\\Upsilon\(u\_\{s\}\)\. For any block vectorv=\(v1,…,vN\)v=\(v\_\{1\},\\ldots,v\_\{N\}\), the quadratic part satisfies

D2​Q​\(z\)​\[v,v\]\\displaystyle D^\{2\}Q\(z\)\[v,v\]=∑s=1Nvs⊤​H​vs\+κ​∑s=1N\(a⊤​vs\)2−2​λ​∑s=2N\(a⊤​vs−1\)​\(e⊤​vs\)\.\\displaystyle=\\sum\_\{s=1\}^\{N\}v\_\{s\}^\{\\top\}Hv\_\{s\}\+\\kappa\\sum\_\{s=1\}^\{N\}\(a^\{\\top\}v\_\{s\}\)^\{2\}\-2\\lambda\\sum\_\{s=2\}^\{N\}\(a^\{\\top\}v\_\{s\-1\}\)\(e^\{\\top\}v\_\{s\}\)\.Hence, using‖e‖=1\\\|e\\\|=1,

\|D2​Q​\(z\)​\[v,v\]\|\\displaystyle\|D^\{2\}Q\(z\)\[v,v\]\|≤‖H‖​∑s=1N‖vs‖2\+κ​‖a‖2​∑s=1N‖vs‖2\+2​λ​‖a‖​∑s=2N‖vs−1‖​‖vs‖\\displaystyle\\leq\\\|H\\\|\\sum\_\{s=1\}^\{N\}\\\|v\_\{s\}\\\|^\{2\}\+\\kappa\\\|a\\\|^\{2\}\\sum\_\{s=1\}^\{N\}\\\|v\_\{s\}\\\|^\{2\}\+2\\lambda\\\|a\\\|\\sum\_\{s=2\}^\{N\}\\\|v\_\{s\-1\}\\\|\\\|v\_\{s\}\\\|≤\(‖H‖\+κ​‖a‖2\+2​λ​‖a‖\)​‖v‖2\.\\displaystyle\\leq\\left\(\\\|H\\\|\+\\kappa\\\|a\\\|^\{2\}\+2\\lambda\\\|a\\\|\\right\)\\\|v\\\|^\{2\}\.Therefore

‖D2​Q​\(z\)‖op≤γ\+4​L\+ρα2​‖a‖2\+2​‖a‖α\.\\\|D^\{2\}Q\(z\)\\\|\_\{\\rm op\}\\leq\\gamma\+4L\+\\frac\{\\rho\}\{\\alpha^\{2\}\}\\\|a\\\|^\{2\}\+\\frac\{2\\\|a\\\|\}\{\\alpha\}\.
For the nonlinear part,

D2​R​\(z\)​\[v,v\]=τ​∑s=1NΥ′′​\(us\)​\(a⊤​vs\)2\.D^\{2\}R\(z\)\[v,v\]=\\tau\\sum\_\{s=1\}^\{N\}\\Upsilon^\{\\prime\\prime\}\(u\_\{s\}\)\(a^\{\\top\}v\_\{s\}\)^\{2\}\.Thus

\|D2​R​\(z\)​\[v,v\]\|\\displaystyle\|D^\{2\}R\(z\)\[v,v\]\|≤τ​‖Υ′′‖∞​∑s=1N\|a⊤​vs\|2\\displaystyle\\leq\\tau\\\|\\Upsilon^\{\\prime\\prime\}\\\|\_\{\\infty\}\\sum\_\{s=1\}^\{N\}\|a^\{\\top\}v\_\{s\}\|^\{2\}≤ℓ1​r2β​‖a‖2​‖v‖2\.\\displaystyle\\leq\\frac\{\\ell\_\{1\}r^\{2\}\}\{\\beta\}\\\|a\\\|^\{2\}\\\|v\\\|^\{2\}\.Consequently,

‖D2​F​\(z\)‖op≤γ\+4​L\+2​‖a‖α\+\(ρα2\+ℓ1​r2β\)​‖a‖2for every​z,\\\|D^\{2\}F\(z\)\\\|\_\{\\rm op\}\\leq\\gamma\+4L\+\\frac\{2\\\|a\\\|\}\{\\alpha\}\+\\left\(\\frac\{\\rho\}\{\\alpha^\{2\}\}\+\\frac\{\\ell\_\{1\}r^\{2\}\}\{\\beta\}\\right\)\\\|a\\\|^\{2\}\\qquad\\text\{for every \}z,and hence

Lip⁡\(∇F\)≤γ\+4​L\+2​‖a‖α\+\(ρα2\+ℓ1​r2β\)​‖a‖2\.\\operatorname\{Lip\}\(\\nabla F\)\\leq\\gamma\+4L\+\\frac\{2\\\|a\\\|\}\{\\alpha\}\+\\left\(\\frac\{\\rho\}\{\\alpha^\{2\}\}\+\\frac\{\\ell\_\{1\}r^\{2\}\}\{\\beta\}\\right\)\\\|a\\\|^\{2\}\.
We now prove the higher\-order smoothness bounds\. Fixq∈\{2,…,p\}q\\in\\\{2,\\ldots,p\\\}\. All terms inQQare quadratic, soDq\+1​Q≡0D^\{q\+1\}Q\\equiv 0\. Thus onlyRRcontributes\. For arbitrary block vectors

v\(0\),v\(1\),…,v\(q\)∈\(ℝM\)N,v\(ℓ\)=\(v1\(ℓ\),…,vN\(ℓ\)\),v^\{\(0\)\},v^\{\(1\)\},\\ldots,v^\{\(q\)\}\\in\(\\mathbb\{R\}^\{M\}\)^\{N\},\\qquad v^\{\(\\ell\)\}=\(v^\{\(\\ell\)\}\_\{1\},\\ldots,v^\{\(\\ell\)\}\_\{N\}\),we have

Dq\+1​R​\(z\)​\[v\(0\),v\(1\),…,v\(q\)\]\\displaystyle D^\{q\+1\}R\(z\)\[v^\{\(0\)\},v^\{\(1\)\},\\ldots,v^\{\(q\)\}\]=τ​∑s=1NΥ\(q\+1\)​\(us\)​∏ℓ=0q\(a⊤​vs\(ℓ\)\)\.\\displaystyle=\\tau\\sum\_\{s=1\}^\{N\}\\Upsilon^\{\(q\+1\)\}\(u\_\{s\}\)\\prod\_\{\\ell=0\}^\{q\}\\bigl\(a^\{\\top\}v\_\{s\}^\{\(\\ell\)\}\\bigr\)\.Therefore

\|Dq\+1​R​\(z\)​\[v\(0\),v\(1\),…,v\(q\)\]\|\\displaystyle\\left\|D^\{q\+1\}R\(z\)\[v^\{\(0\)\},v^\{\(1\)\},\\ldots,v^\{\(q\)\}\]\\right\|≤ℓq​r3−qβ​∑s=1N∏ℓ=0q\|a⊤​vs\(ℓ\)\|\.\\displaystyle\\leq\\frac\{\\ell\_\{q\}r^\{3\-q\}\}\{\\beta\}\\sum\_\{s=1\}^\{N\}\\prod\_\{\\ell=0\}^\{q\}\\left\|a^\{\\top\}v\_\{s\}^\{\(\\ell\)\}\\right\|\.We use the elementary inequality that for nonnegative sequencesb\(0\),…,b\(q\)b^\{\(0\)\},\\ldots,b^\{\(q\)\},

∑s∏ℓ=0qbs\(ℓ\)≤∏ℓ=0q\(∑s\(bs\(ℓ\)\)2\)1/2\.\\sum\_\{s\}\\prod\_\{\\ell=0\}^\{q\}b\_\{s\}^\{\(\\ell\)\}\\leq\\prod\_\{\\ell=0\}^\{q\}\\left\(\\sum\_\{s\}\(b\_\{s\}^\{\(\\ell\)\}\)^\{2\}\\right\)^\{1/2\}\.Applying this withbs\(ℓ\):=\|a⊤​vs\(ℓ\)\|b\_\{s\}^\{\(\\ell\)\}:=\|a^\{\\top\}v\_\{s\}^\{\(\\ell\)\}\|gives

∑s=1N∏ℓ=0q\|a⊤​vs\(ℓ\)\|\\displaystyle\\sum\_\{s=1\}^\{N\}\\prod\_\{\\ell=0\}^\{q\}\|a^\{\\top\}v\_\{s\}^\{\(\\ell\)\}\|≤∏ℓ=0q\(∑s=1N\|a⊤​vs\(ℓ\)\|2\)1/2\\displaystyle\\leq\\prod\_\{\\ell=0\}^\{q\}\\left\(\\sum\_\{s=1\}^\{N\}\|a^\{\\top\}v\_\{s\}^\{\(\\ell\)\}\|^\{2\}\\right\)^\{1/2\}≤‖a‖q\+1​∏ℓ=0q‖v\(ℓ\)‖\.\\displaystyle\\leq\\\|a\\\|^\{q\+1\}\\prod\_\{\\ell=0\}^\{q\}\\\|v^\{\(\\ell\)\}\\\|\.Hence

‖Dq\+1​F​\(z\)‖op=‖Dq\+1​R​\(z\)‖op≤ℓq​r3−q​‖a‖q\+1β\.\\\|D^\{q\+1\}F\(z\)\\\|\_\{\\rm op\}=\\\|D^\{q\+1\}R\(z\)\\\|\_\{\\rm op\}\\leq\\frac\{\\ell\_\{q\}r^\{3\-q\}\\\|a\\\|^\{q\+1\}\}\{\\beta\}\.By the fundamental theorem of calculus along line segments,

‖Dq​F​\(x\)−Dq​F​\(y\)‖op≤sup0≤t≤1‖Dq\+1​F​\(y\+t​\(x−y\)\)‖op​‖x−y‖,\\\|D^\{q\}F\(x\)\-D^\{q\}F\(y\)\\\|\_\{\\rm op\}\\leq\\sup\_\{0\\leq t\\leq 1\}\\\|D^\{q\+1\}F\(y\+t\(x\-y\)\)\\\|\_\{\\rm op\}\\\|x\-y\\\|,so

Lip⁡\(Dq​F\)≤ℓq​r3−q​‖a‖q\+1β,q=2,…,p\.\\operatorname\{Lip\}\(D^\{q\}F\)\\leq\\frac\{\\ell\_\{q\}r^\{3\-q\}\\\|a\\\|^\{q\+1\}\}\{\\beta\},\\qquad q=2,\\ldots,p\.

#### Gap bound\.

For each phasess, writev:=us−1v:=u\_\{s\-1\}\. SinceH≻0H\\succ 0,

12​zs⊤​H​zs−λ​v​e⊤​zs\\displaystyle\\frac\{1\}\{2\}z\_\{s\}^\{\\top\}Hz\_\{s\}\-\\lambda ve^\{\\top\}z\_\{s\}=12​‖zs−λ​v​H−1​e‖H2−λ2​v22​e⊤​H−1​e\\displaystyle=\\frac\{1\}\{2\}\\left\\\|z\_\{s\}\-\\lambda vH^\{\-1\}e\\right\\\|\_\{H\}^\{2\}\-\\frac\{\\lambda^\{2\}v^\{2\}\}\{2\}e^\{\\top\}H^\{\-1\}e≥−κ2​v2,\\displaystyle\\geq\-\\frac\{\\kappa\}\{2\}v^\{2\},where‖w‖H2:=w⊤​H​w\\\|w\\\|\_\{H\}^\{2\}:=w^\{\\top\}Hw,κ=λ2​ρ\\kappa=\\lambda^\{2\}\\rho, andρ=e⊤​H−1​e\\rho=e^\{\\top\}H^\{\-1\}e\. Hence

12​zs⊤​H​zs−λ​us−1​e⊤​zs\+κ2​us2\+τ​Υ​\(us\)\\displaystyle\\frac\{1\}\{2\}z\_\{s\}^\{\\top\}Hz\_\{s\}\-\\lambda u\_\{s\-1\}e^\{\\top\}z\_\{s\}\+\\frac\{\\kappa\}\{2\}u\_\{s\}^\{2\}\+\\tau\\Upsilon\(u\_\{s\}\)≥κ2​\(us2−us−12\)\+τ​Υ​\(us\)\.\\displaystyle\\qquad\\geq\\frac\{\\kappa\}\{2\}\(u\_\{s\}^\{2\}\-u\_\{s\-1\}^\{2\}\)\+\\tau\\Upsilon\(u\_\{s\}\)\.Summing overs=1,…,Ns=1,\\ldots,N, usingu0=1u\_\{0\}=1,uN2≥0u\_\{N\}^\{2\}\\geq 0, andΥ≥0\\Upsilon\\geq 0, givesF​\(z\)≥κ2​\(uN2−u02\)≥−κ2F\(z\)\\geq\\frac\{\\kappa\}\{2\}\(u\_\{N\}^\{2\}\-u\_\{0\}^\{2\}\)\\geq\-\\frac\{\\kappa\}\{2\}\. At the origin,us=0u\_\{s\}=0for everys≥1s\\geq 1, and all quadratic, linear, and stabilization terms vanish\. SinceΥ​\(0\)≤10\\Upsilon\(0\)\\leq 10,F​\(0\)≤10​N​τF\(0\)\\leq 10N\\tau\. Therefore

F​\(0\)−infzF​\(z\)≤10​N​τ\+κ2=10​Nβ\+ρ2​α2\.F\(0\)\-\\inf\_\{z\}F\(z\)\\leq 10N\\tau\+\\frac\{\\kappa\}\{2\}=\\frac\{10N\}\{\\beta\}\+\\frac\{\\rho\}\{2\\alpha^\{2\}\}\.∎

### A\.2Proof of Lemma[4\.7](https://arxiv.org/html/2606.05438#S4.Thmtheorem7)

###### Proof of Lemma[4\.7](https://arxiv.org/html/2606.05438#S4.Thmtheorem7)\.

For each phasess, all coordinates in\{s\}×S\\\{s\\\}\\times Shave layer indexs​j⋆sj\_\{\\star\}, since forj∈Sj\\in S,\(s−1\)​j⋆\+min⁡\{j,j⋆\}=s​j⋆\(s\-1\)j\_\{\\star\}\+\\min\\\{j,j\_\{\\star\}\\\}=sj\_\{\\star\}\. We prove by induction thatsupp⁡\(z\(t\)\)⊆It\\operatorname\{supp\}\(z^\{\(t\)\}\)\\subseteq I\_\{t\}for everyt≥0t\\geq 0, where supports are understood in block\-coordinate notation\.

The key step is the following one\-step support implication\. Supposesupp⁡\(z\)⊆It\\operatorname\{supp\}\(z\)\\subseteq I\_\{t\}\. WriteH=Hγ,L\(M\)H=H\_\{\\gamma,L\}^\{\(M\)\},us=a⊤​zsu\_\{s\}=a^\{\\top\}z\_\{s\},xs=e1⊤​zsx\_\{s\}=e\_\{1\}^\{\\top\}z\_\{s\}, and use the conventionxN\+1=0x\_\{N\+1\}=0\. Thess\-th gradient block is

gs=H​zs−λ​us−1​e1\+\(κ​us\+τ​Υr′​\(us\)\)​a−λ​xs\+1​a\.g\_\{s\}=Hz\_\{s\}\-\\lambda u\_\{s\-1\}e\_\{1\}\+\(\\kappa u\_\{s\}\+\\tau\\Upsilon\_\{r\}^\{\\prime\}\(u\_\{s\}\)\)a\-\\lambda x\_\{s\+1\}a\.We check that each term is supported inIt\+1I\_\{t\+1\}\.

SinceHHis tridiagonal,H​zsHz\_\{s\}can only move support inside phasessby one neighboring coordinate\. In terms ofItI\_\{t\}, this can increasemin⁡\{j,j⋆\}\\min\\\{j,j\_\{\\star\}\\\}by at most one, soH​zsHz\_\{s\}is supported inIt\+1I\_\{t\+1\}\.

For the entry term−λ​us−1​e1\-\\lambda u\_\{s\-1\}e\_\{1\}, ifs=1s=1, then the coordinate\(1,1\)\(1,1\)belongs toIt\+1I\_\{t\+1\}\. Ifs≥2s\\geq 2andus−1≠0u\_\{s\-1\}\\neq 0, thenzs−1z\_\{s\-1\}has some support in the suffixSS, so\(s−1\)​j⋆≤t\(s\-1\)j\_\{\\star\}\\leq t\. Hence the coordinate\(s,1\)\(s,1\), whose layer index is\(s−1\)​j⋆\+1\(s\-1\)j\_\{\\star\}\+1, belongs toIt\+1I\_\{t\+1\}\.

For the term\(κ​us\+τ​Υr′​\(us\)\)​a\(\\kappa u\_\{s\}\+\\tau\\Upsilon\_\{r\}^\{\\prime\}\(u\_\{s\}\)\)a, ifzsz\_\{s\}has no support inSS, thenus=0u\_\{s\}=0, and Proposition[3\.1](https://arxiv.org/html/2606.05438#S3.Thmtheorem1)givesΥr′​\(0\)=0\\Upsilon\_\{r\}^\{\\prime\}\(0\)=0, so the coefficient vanishes\. Otherwise\{s\}×S\\\{s\\\}\\times Sis already contained inItI\_\{t\}, and the term is supported inIt⊆It\+1I\_\{t\}\\subseteq I\_\{t\+1\}\.

Finally, consider−λ​xs\+1​a\-\\lambda x\_\{s\+1\}a\. Ifxs\+1=0x\_\{s\+1\}=0, the term vanishes\. Ifxs\+1≠0x\_\{s\+1\}\\neq 0, then coordinate\(s\+1,1\)\(s\+1,1\)belongs to the support ofzz, sos​j⋆\+1≤tsj\_\{\\star\}\+1\\leq t\. Thus\{s\}×S\\\{s\\\}\\times S, whose coordinates all have layer indexs​j⋆sj\_\{\\star\}, is contained inItI\_\{t\}\. Hence this term is also supported inIt⊆It\+1I\_\{t\}\\subseteq I\_\{t\+1\}\. Thereforesupp⁡\(∇F​\(z\)\)⊆It\+1\\operatorname\{supp\}\(\\nabla F\(z\)\)\\subseteq I\_\{t\+1\}wheneversupp⁡\(z\)⊆It\\operatorname\{supp\}\(z\)\\subseteq I\_\{t\}\.

Now use the zero\-respecting property\. Sincez\(0\)=0z^\{\(0\)\}=0, the induction starts withsupp⁡\(z\(0\)\)⊆I0\\operatorname\{supp\}\(z^\{\(0\)\}\)\\subseteq I\_\{0\}\. Ifsupp⁡\(z\(t′\)\)⊆It′\\operatorname\{supp\}\(z^\{\(t^\{\\prime\}\)\}\)\\subseteq I\_\{t^\{\\prime\}\}for allt′<tt^\{\\prime\}<t, then the one\-step implication givessupp⁡\(∇F​\(z\(t′\)\)\)⊆It′\+1⊆It\\operatorname\{supp\}\(\\nabla F\(z^\{\(t^\{\\prime\}\)\}\)\)\\subseteq I\_\{t^\{\\prime\}\+1\}\\subseteq I\_\{t\}fort′<tt^\{\\prime\}<t\. By zero\-respectingness,

supp⁡\(z\(t\)\)⊆⋃t′<tsupp⁡\(∇F​\(z\(t′\)\)\)⊆It\.\\operatorname\{supp\}\(z^\{\(t\)\}\)\\subseteq\\bigcup\_\{t^\{\\prime\}<t\}\\operatorname\{supp\}\(\\nabla F\(z^\{\(t^\{\\prime\}\)\}\)\)\\subseteq I\_\{t\}\.This proves the induction claim\.

Ift<N​j⋆t<Nj\_\{\\star\}, then no coordinate\(N,j\)\(N,j\)withj∈Sj\\in Slies inItI\_\{t\}\. Sincesupp⁡\(a\)⊆S\\operatorname\{supp\}\(a\)\\subseteq S, we concludeuN\(t\)=a⊤​zN\(t\)=0u\_\{N\}^\{\(t\)\}=a^\{\\top\}z\_\{N\}^\{\(t\)\}=0\. ∎

### A\.3Proof of Lemma[4\.8](https://arxiv.org/html/2606.05438#S4.Thmtheorem8)

###### Lemma A\.1\(Gradient projection identity\)\.

ForM,N∈ℕM,N\\in\\mathbb\{N\},γ,L,r\>0\\gamma,L,r\>0, and any nonzero vectora∈ℝMa\\in\\mathbb\{R\}^\{M\}, lete,ρ,α,β,λ,τ,κe,\\rho,\\alpha,\\beta,\\lambda,\\tau,\\kappabe the quantities in \([4\.4](https://arxiv.org/html/2606.05438#S4.E4)\) and \([4\.5](https://arxiv.org/html/2606.05438#S4.E5)\) for the present\(M,γ,L,a\)\(M,\\gamma,L,a\), and setH:=Hγ,L\(M\)H:=H\_\{\\gamma,L\}^\{\(M\)\}\. Letgsg\_\{s\}denote thess\-th block of∇f¯M,N,γ,L,a,r​\(z\)\\nabla\\bar\{f\}\_\{M,N,\\gamma,L,a,r\}\(z\), namely

gs=H​zs−λ​us−1​e\+\(κ​us\+τ​Υr′​\(us\)\)​a−λ​\(e⊤​zs\+1\)​a,g\_\{s\}=Hz\_\{s\}\-\\lambda u\_\{s\-1\}e\+\(\\kappa u\_\{s\}\+\\tau\\Upsilon\_\{r\}^\{\\prime\}\(u\_\{s\}\)\)a\-\\lambda\(e^\{\\top\}z\_\{s\+1\}\)a,whereus=a⊤​zsu\_\{s\}=a^\{\\top\}z\_\{s\}\. With

wout=H−1​aβ,win=H−1​\(e−αβ​a\),w\_\{\\mathrm\{out\}\}=\\frac\{H^\{\-1\}a\}\{\\beta\},\\qquad w\_\{\\mathrm\{in\}\}=H^\{\-1\}\\left\(e\-\\frac\{\\alpha\}\{\\beta\}a\\right\),one has, fors=1,…,N−1s=1,\\ldots,N\-1,

wout⊤​gs\+λ​win⊤​gs\+1=τ​\[2​us−us−1−us\+1\+Υr′​\(us\)\]\.w\_\{\\mathrm\{out\}\}^\{\\top\}g\_\{s\}\+\\lambda w\_\{\\mathrm\{in\}\}^\{\\top\}g\_\{s\+1\}=\\tau\\left\[2u\_\{s\}\-u\_\{s\-1\}\-u\_\{s\+1\}\+\\Upsilon\_\{r\}^\{\\prime\}\(u\_\{s\}\)\\right\]\.

###### Proof\.

Writexs=e⊤​zsx\_\{s\}=e^\{\\top\}z\_\{s\}within this proof\. We first expand the two projections without substituting the parameter identitiesλ=1/α\\lambda=1/\\alpha,τ=1/β\\tau=1/\\beta, andκ=λ2​ρ\\kappa=\\lambda^\{2\}\\rho\. Since

wout=H−1​aβ,wout⊤​H​zs=usβ,wout⊤​e=αβ,wout⊤​a=1,w\_\{\\mathrm\{out\}\}=\\frac\{H^\{\-1\}a\}\{\\beta\},\\qquad w\_\{\\mathrm\{out\}\}^\{\\top\}Hz\_\{s\}=\\frac\{u\_\{s\}\}\{\\beta\},\\qquad w\_\{\\mathrm\{out\}\}^\{\\top\}e=\\frac\{\\alpha\}\{\\beta\},\\qquad w\_\{\\mathrm\{out\}\}^\{\\top\}a=1,we have

wout⊤​gs=usβ−λ​αβ​us−1\+κ​us\+τ​Υr′​\(us\)−λ​xs\+1\.w\_\{\\mathrm\{out\}\}^\{\\top\}g\_\{s\}=\\frac\{u\_\{s\}\}\{\\beta\}\-\\lambda\\frac\{\\alpha\}\{\\beta\}u\_\{s\-1\}\+\\kappa u\_\{s\}\+\\tau\\Upsilon\_\{r\}^\{\\prime\}\(u\_\{s\}\)\-\\lambda x\_\{s\+1\}\.For the second projection, use onlya⊤​win=0a^\{\\top\}w\_\{\\mathrm\{in\}\}=0to remove the terms parallel toaa:

win⊤​gs\+1\\displaystyle w\_\{\\mathrm\{in\}\}^\{\\top\}g\_\{s\+1\}=win⊤​H​zs\+1−λ​us​win⊤​e\+\(κ​us\+1\+τ​Υr′​\(us\+1\)\)​a⊤​win−λ​xs\+2​a⊤​win\\displaystyle=w\_\{\\mathrm\{in\}\}^\{\\top\}Hz\_\{s\+1\}\-\\lambda u\_\{s\}w\_\{\\mathrm\{in\}\}^\{\\top\}e\+\(\\kappa u\_\{s\+1\}\+\\tau\\Upsilon\_\{r\}^\{\\prime\}\(u\_\{s\+1\}\)\)a^\{\\top\}w\_\{\\mathrm\{in\}\}\-\\lambda x\_\{s\+2\}a^\{\\top\}w\_\{\\mathrm\{in\}\}=xs\+1−αβ​us\+1−λ​\(win⊤​e\)​us\.\\displaystyle=x\_\{s\+1\}\-\\frac\{\\alpha\}\{\\beta\}u\_\{s\+1\}\-\\lambda\(w\_\{\\mathrm\{in\}\}^\{\\top\}e\)u\_\{s\}\.Thusλ​win⊤​gs\+1=λ​xs\+1−λ​αβ​us\+1−λ2​\(win⊤​e\)​us\\lambda w\_\{\\mathrm\{in\}\}^\{\\top\}g\_\{s\+1\}=\\lambda x\_\{s\+1\}\-\\lambda\\frac\{\\alpha\}\{\\beta\}u\_\{s\+1\}\-\\lambda^\{2\}\(w\_\{\\mathrm\{in\}\}^\{\\top\}e\)u\_\{s\}\. Adding the two expansions cancels thexs\+1x\_\{s\+1\}terms and gives

wout⊤​gs\+λ​win⊤​gs\+1=\(1β\+κ−λ2​win⊤​e\)​us−λ​αβ​us−1−λ​αβ​us\+1\+τ​Υr′​\(us\)\.w\_\{\\mathrm\{out\}\}^\{\\top\}g\_\{s\}\+\\lambda w\_\{\\mathrm\{in\}\}^\{\\top\}g\_\{s\+1\}=\\left\(\\frac\{1\}\{\\beta\}\+\\kappa\-\\lambda^\{2\}w\_\{\\mathrm\{in\}\}^\{\\top\}e\\right\)u\_\{s\}\-\\lambda\\frac\{\\alpha\}\{\\beta\}u\_\{s\-1\}\-\\lambda\\frac\{\\alpha\}\{\\beta\}u\_\{s\+1\}\+\\tau\\Upsilon\_\{r\}^\{\\prime\}\(u\_\{s\}\)\.Finally,win⊤​e=ρ−α2βw\_\{\\mathrm\{in\}\}^\{\\top\}e=\\rho\-\\frac\{\\alpha^\{2\}\}\{\\beta\}\. Using the parameter choicesλ=1/α\\lambda=1/\\alpha,τ=1/β\\tau=1/\\beta, andκ=λ2​ρ\\kappa=\\lambda^\{2\}\\rho, the coefficient ofusu\_\{s\}becomes

1β\+λ2​ρ−λ2​\(ρ−α2β\)=2β=2​τ,\\frac\{1\}\{\\beta\}\+\\lambda^\{2\}\\rho\-\\lambda^\{2\}\\left\(\\rho\-\\frac\{\\alpha^\{2\}\}\{\\beta\}\\right\)=\\frac\{2\}\{\\beta\}=2\\tau,whileλ​αβ=1β=τ\\lambda\\frac\{\\alpha\}\{\\beta\}=\\frac\{1\}\{\\beta\}=\\tau\. Therefore

wout⊤​gs\+λ​win⊤​gs\+1=τ​\[2​us−us−1−us\+1\+Υr′​\(us\)\]\.w\_\{\\mathrm\{out\}\}^\{\\top\}g\_\{s\}\+\\lambda w\_\{\\mathrm\{in\}\}^\{\\top\}g\_\{s\+1\}=\\tau\\left\[2u\_\{s\}\-u\_\{s\-1\}\-u\_\{s\+1\}\+\\Upsilon\_\{r\}^\{\\prime\}\(u\_\{s\}\)\\right\]\.∎

We now prove Lemma[4\.8](https://arxiv.org/html/2606.05438#S4.Thmtheorem8)\.

###### Proof of Lemma[4\.8](https://arxiv.org/html/2606.05438#S4.Thmtheorem8)\.

Writeus:=a⊤​zsu\_\{s\}:=a^\{\\top\}z\_\{s\},u0:=1u\_\{0\}:=1, anduN\+1:=0u\_\{N\+1\}:=0, and letgsg\_\{s\}denote thess\-th block of∇f¯M,N,γ,L,a,r​\(z\)\\nabla\\bar\{f\}\_\{M,N,\\gamma,L,a,r\}\(z\)\. Lete,ρ,α,β,λ,τ,κe,\\rho,\\alpha,\\beta,\\lambda,\\tau,\\kappabe the quantities in \([4\.4](https://arxiv.org/html/2606.05438#S4.E4)\) and \([4\.5](https://arxiv.org/html/2606.05438#S4.E5)\) for the present\(M,γ,L,a\)\(M,\\gamma,L,a\), setH:=Hγ,L\(M\)H:=H\_\{\\gamma,L\}^\{\(M\)\}, and define

wout:=H−1​aβ,win:=H−1​\(e−αβ​a\)\.w\_\{\\mathrm\{out\}\}:=\\frac\{H^\{\-1\}a\}\{\\beta\},\\qquad w\_\{\\mathrm\{in\}\}:=H^\{\-1\}\\left\(e\-\\frac\{\\alpha\}\{\\beta\}a\\right\)\.Setrs:=2​us−us−1−us\+1\+Υr′​\(us\)r\_\{s\}:=2u\_\{s\}\-u\_\{s\-1\}\-u\_\{s\+1\}\+\\Upsilon\_\{r\}^\{\\prime\}\(u\_\{s\}\),s=1,…,Ns=1,\\ldots,N\. SinceuN=uN\+1=0u\_\{N\}=u\_\{N\+1\}=0, Proposition[4\.1](https://arxiv.org/html/2606.05438#S4.Thmtheorem1)withμ=1\\mu=1gives‖∇ϕ1​\(u\)‖2≥c0\\\|\\nabla\\phi\_\{1\}\(u\)\\\|\_\{2\}\\geq c\_\{0\}\. The last gradient coordinate ofϕ1\\phi\_\{1\}isuN\+1−uN=0u\_\{N\+1\}\-u\_\{N\}=0, so‖r‖2≥c0\\\|r\\\|\_\{2\}\\geq c\_\{0\}\. By Lemma[A\.1](https://arxiv.org/html/2606.05438#A1.Thmtheorem1), fors=1,…,N−1s=1,\\ldots,N\-1,

τ​rs=wout⊤​gs\+λ​win⊤​gs\+1\.\\tau r\_\{s\}=w\_\{\\mathrm\{out\}\}^\{\\top\}g\_\{s\}\+\\lambda w\_\{\\mathrm\{in\}\}^\{\\top\}g\_\{s\+1\}\.For the last phase,uN=0u\_\{N\}=0,uN\+1=0u\_\{N\+1\}=0, andΥr′​\(0\)=0\\Upsilon\_\{r\}^\{\\prime\}\(0\)=0, so the same projection expansion givesτ​rN=wout⊤​gN\\tau r\_\{N\}=w\_\{\\mathrm\{out\}\}^\{\\top\}g\_\{N\}\. LetA:=‖wout‖\+λ​‖win‖A:=\\\|w\_\{\\mathrm\{out\}\}\\\|\+\\lambda\\\|w\_\{\\mathrm\{in\}\}\\\|\. Then

τ​\|rs\|≤A​\(‖gs‖\+‖gs\+1‖\),s=1,…,N−1,\\tau\|r\_\{s\}\|\\leq A\(\\\|g\_\{s\}\\\|\+\\\|g\_\{s\+1\}\\\|\),\\qquad s=1,\\ldots,N\-1,andτ​\|rN\|≤A​‖gN‖\\tau\|r\_\{N\}\|\\leq A\\\|g\_\{N\}\\\|\. Squaring and summing yields

τ2​‖r‖22≤C​A2​‖∇f¯M,N,γ,L,a,r​\(z\)‖2\\tau^\{2\}\\\|r\\\|\_\{2\}^\{2\}\\leq CA^\{2\}\\left\\\|\\nabla\\bar\{f\}\_\{M,N,\\gamma,L,a,r\}\(z\)\\right\\\|^\{2\}for an absolute constantCC\. Since‖r‖2≥c0\\\|r\\\|\_\{2\}\\geq c\_\{0\},

‖∇f¯M,N,γ,L,a,r​\(z\)‖≳τA\.\\left\\\|\\nabla\\bar\{f\}\_\{M,N,\\gamma,L,a,r\}\(z\)\\right\\\|\\gtrsim\\frac\{\\tau\}\{A\}\.Finally, sinceH⪰γ​IH\\succeq\\gamma I,

‖H−1​a‖2=a⊤​H−2​a≤1γ​a⊤​H−1​a=βγ,\\\|H^\{\-1\}a\\\|^\{2\}=a^\{\\top\}H^\{\-2\}a\\leq\\frac\{1\}\{\\gamma\}a^\{\\top\}H^\{\-1\}a=\\frac\{\\beta\}\{\\gamma\},and

‖wout‖=‖H−1​a‖β≤1γ​β,\\\|w\_\{\\mathrm\{out\}\}\\\|=\\frac\{\\\|H^\{\-1\}a\\\|\}\{\\beta\}\\leq\\frac\{1\}\{\\sqrt\{\\gamma\\beta\}\},and

‖win‖2≤1γ​\(e−αβ​a\)⊤​H−1​\(e−αβ​a\)≤ργ\.\\\|w\_\{\\mathrm\{in\}\}\\\|^\{2\}\\leq\\frac\{1\}\{\\gamma\}\\left\(e\-\\frac\{\\alpha\}\{\\beta\}a\\right\)^\{\\top\}H^\{\-1\}\\left\(e\-\\frac\{\\alpha\}\{\\beta\}a\\right\)\\leq\\frac\{\\rho\}\{\\gamma\}\.The last inequality follows by expanding the quadratic form and using the definitions ofα,β,ρ\\alpha,\\beta,\\rho\. Thusλ​‖win‖≤1α​γ​ρ\\lambda\\\|w\_\{\\mathrm\{in\}\}\\\|\\leq\\frac\{1\}\{\\alpha\\sqrt\{\\gamma\}\}\\sqrt\{\\rho\}\. Therefore

τA≳γβ\+β​ρα\.\\frac\{\\tau\}\{A\}\\gtrsim\\frac\{\\sqrt\{\\gamma\}\}\{\\sqrt\{\\beta\}\+\\frac\{\\beta\\sqrt\{\\rho\}\}\{\\alpha\}\}\.This proves the claim\. ∎

## Appendix BProofs of the Main Results

### B\.1Proof of Proposition[5\.1](https://arxiv.org/html/2606.05438#S5.Thmtheorem1)

###### Proof of Proposition[5\.1](https://arxiv.org/html/2606.05438#S5.Thmtheorem1)\.

First considerp∈\{2,3\}p\\in\\\{2,3\\\}\. In this case all terms inΓp​\(r\)\\Gamma\_\{p\}\(r\)are nonincreasing forr≥1r\\geq 1, so the supremum is attained atr=1r=1\. Thus

Γp=min\{L1,L1ℓ1,min2≤q≤p\(Lq​ϵq−1ℓq\)1/q\}\.\\Gamma\_\{p\}=\\min\\left\\\{L\_\{1\},\\,\\frac\{L\_\{1\}\}\{\\ell\_\{1\}\},\\,\\min\_\{2\\leq q\\leq p\}\\left\(\\frac\{L\_\{q\}\\epsilon^\{q\-1\}\}\{\\ell\_\{q\}\}\\right\)^\{1/q\}\\right\\\}\.For sufficiently smallϵ\\epsilon, theq=pq=pterm is the minimum, proving the claim forp∈\{2,3\}p\\in\\\{2,3\\\}\.

Now supposep≥4p\\geq 4\. ForR\>0R\>0, define

𝔊p\(R\):=min\{L1ℓ1​R2,\(L2ℓ2​R\)1/2,min3≤q≤p\(Lq​Rq−3ℓq\)1/q\}\.\\mathfrak\{G\}\_\{p\}\(R\):=\\min\\left\\\{\\frac\{L\_\{1\}\}\{\\ell\_\{1\}R^\{2\}\},\\,\\left\(\\frac\{L\_\{2\}\}\{\\ell\_\{2\}R\}\\right\)^\{1/2\},\\,\\min\_\{3\\leq q\\leq p\}\\left\(\\frac\{L\_\{q\}R^\{q\-3\}\}\{\\ell\_\{q\}\}\\right\)^\{1/q\}\\right\\\}\.The function𝔊p​\(R\)\\mathfrak\{G\}\_\{p\}\(R\)is continuous and positive on\(0,∞\)\(0,\\infty\), and it is bounded above by theq=3q=3term\(L3/ℓ3\)1/3\(L\_\{3\}/\\ell\_\{3\}\)^\{1/3\}\. Moreover,𝔊p​\(R\)→0\\mathfrak\{G\}\_\{p\}\(R\)\\to 0asR→∞R\\to\\infty, and𝔊p​\(R\)→0\\mathfrak\{G\}\_\{p\}\(R\)\\to 0asR↓0R\\downarrow 0, because one of the terms withq≥4q\\geq 4is present\. Hence the supremum𝔊p=supR\>0𝔊p​\(R\)\\mathfrak\{G\}\_\{p\}=\\sup\_\{R\>0\}\\mathfrak\{G\}\_\{p\}\(R\)is finite, positive, and attained at someR⋆\>0R\_\{\\star\}\>0\.

Forr≥1r\\geq 1, setR:=r​ϵ1/3R:=r\\epsilon^\{1/3\}\. Equivalently,r=R​ϵ−1/3r=R\\epsilon^\{\-1/3\}, and the constraintr≥1r\\geq 1becomesR≥ϵ1/3R\\geq\\epsilon^\{1/3\}\. With this change of variables, each nonconstant term inΓp​\(r\)\\Gamma\_\{p\}\(r\)has the same factorϵ2/3\\epsilon^\{2/3\}:

L1ℓ1​r2=ϵ2/3​L1ℓ1​R2,\(L2​ϵℓ2​r\)1/2=ϵ2/3​\(L2ℓ2​R\)1/2,\\frac\{L\_\{1\}\}\{\\ell\_\{1\}r^\{2\}\}=\\epsilon^\{2/3\}\\frac\{L\_\{1\}\}\{\\ell\_\{1\}R^\{2\}\},\\qquad\\left\(\\frac\{L\_\{2\}\\epsilon\}\{\\ell\_\{2\}r\}\\right\)^\{1/2\}=\\epsilon^\{2/3\}\\left\(\\frac\{L\_\{2\}\}\{\\ell\_\{2\}R\}\\right\)^\{1/2\},and, forq=3,…,pq=3,\\ldots,p,

\(Lq​ϵq−1​rq−3ℓq\)1/q=ϵ2/3​\(Lq​Rq−3ℓq\)1/q\.\\left\(\\frac\{L\_\{q\}\\epsilon^\{q\-1\}r^\{q\-3\}\}\{\\ell\_\{q\}\}\\right\)^\{1/q\}=\\epsilon^\{2/3\}\\left\(\\frac\{L\_\{q\}R^\{q\-3\}\}\{\\ell\_\{q\}\}\\right\)^\{1/q\}\.Therefore

Γp=supR≥ϵ1/3min⁡\{L1,ϵ2/3​𝔊p​\(R\)\}≤ϵ2/3​𝔊p\.\\Gamma\_\{p\}=\\sup\_\{R\\geq\\epsilon^\{1/3\}\}\\min\\left\\\{L\_\{1\},\\,\\epsilon^\{2/3\}\\mathfrak\{G\}\_\{p\}\(R\)\\right\\\}\\leq\\epsilon^\{2/3\}\\mathfrak\{G\}\_\{p\}\.
It remains to prove the matching lower bound\. Takingr⋆=R⋆​ϵ−1/3r\_\{\\star\}=R\_\{\\star\}\\epsilon^\{\-1/3\}, we haver⋆≥1r\_\{\\star\}\\geq 1for sufficiently smallϵ\\epsilon\. Also, for sufficiently smallϵ\\epsilon,ϵ2/3​𝔊p≤L1\\epsilon^\{2/3\}\\mathfrak\{G\}\_\{p\}\\leq L\_\{1\}\. Therefore

Γp≥Γp​\(r⋆\)=min⁡\{L1,ϵ2/3​𝔊p​\(R⋆\)\}=ϵ2/3​𝔊p\\Gamma\_\{p\}\\geq\\Gamma\_\{p\}\(r\_\{\\star\}\)=\\min\\left\\\{L\_\{1\},\\,\\epsilon^\{2/3\}\\mathfrak\{G\}\_\{p\}\(R\_\{\\star\}\)\\right\\\}=\\epsilon^\{2/3\}\\mathfrak\{G\}\_\{p\}for sufficiently smallϵ\\epsilon\. Together with the upper bound, this provesΓp=ϵ2/3​𝔊p\\Gamma\_\{p\}=\\epsilon^\{2/3\}\\mathfrak\{G\}\_\{p\}\. ∎

### B\.2Proof of Lemma[5\.7](https://arxiv.org/html/2606.05438#S5.Thmtheorem7)

###### Lemma B\.1\(Path Green\-kernel estimate\)\.

Let0<h≤10<h\\leq 1, letM∈ℕM\\in\\mathbb\{N\}satisfycM​h−1≤M≤CM​h−1c\_\{M\}h^\{\-1\}\\leq M\\leq C\_\{M\}h^\{\-1\}for fixed positive absolute constantscM,CMc\_\{M\},C\_\{M\}, and let

Ah:=h2​IM\+𝖯M,H:=L1​Ah=γ​IM\+L1​𝖯M,h=γ/L1\.A\_\{h\}:=h^\{2\}I\_\{M\}\+\\mathsf\{P\}\_\{M\},\\qquad H:=L\_\{1\}A\_\{h\}=\\gamma I\_\{M\}\+L\_\{1\}\\mathsf\{P\}\_\{M\},\\qquad h=\\sqrt\{\\gamma/L\_\{1\}\}\.Then there exist positive constantsc,Cc,C, depending only oncM,CMc\_\{M\},C\_\{M\}, such that for alli,j∈\[M\]i,j\\in\[M\],

cL1​h​exp⁡\(−C​h​\|i−j\|\)≤\(H−1\)i​j≤CL1​h​exp⁡\(−c​h​\|i−j\|\)\.\\frac\{c\}\{L\_\{1\}h\}\\exp\(\-Ch\|i\-j\|\)\\leq\(H^\{\-1\}\)\_\{ij\}\\leq\\frac\{C\}\{L\_\{1\}h\}\\exp\(\-ch\|i\-j\|\)\.Equivalently,

\(H−1\)i​j=Θ​\(1γ​L1​exp⁡\(−Θ​\(h​\|i−j\|\)\)\)\.\(H^\{\-1\}\)\_\{ij\}=\\Theta\\\!\\left\(\\frac\{1\}\{\\sqrt\{\\gamma L\_\{1\}\}\}\\exp\(\-\\Theta\(h\|i\-j\|\)\)\\right\)\.

###### Proof\.

The caseM=1M=1is immediate, since thenAh=h2​IA\_\{h\}=h^\{2\}Iand the assumptionM≍h−1M\\asymp h^\{\-1\}forcesh≍1h\\asymp 1\. AssumeM≥2M\\geq 2, and writeA=AhA=A\_\{h\}\. In coordinates,

A=\(1\+h2−1−12\+h2−1⋱⋱⋱−12\+h2−1−11\+h2\)\.A=\\begin\{pmatrix\}1\+h^\{2\}&\-1\\\\ \-1&2\+h^\{2\}&\-1\\\\ &\\ddots&\\ddots&\\ddots\\\\ &&\-1&2\+h^\{2\}&\-1\\\\ &&&\-1&1\+h^\{2\}\\end\{pmatrix\}\.
We use the standard inverse formula for tridiagonal matrices in the following form\. For a tridiagonal matrix with diagonal entriesα1,…,αn\\alpha\_\{1\},\\ldots,\\alpha\_\{n\}and off\-diagonal entries−β1,…,−βn−1\-\\beta\_\{1\},\\ldots,\-\\beta\_\{n\-1\}, defineδ1=α1\\delta\_\{1\}=\\alpha\_\{1\},δk\+1=αk\+1−βk2δk\\delta\_\{k\+1\}=\\alpha\_\{k\+1\}\-\\frac\{\\beta\_\{k\}^\{2\}\}\{\\delta\_\{k\}\},dn=αnd\_\{n\}=\\alpha\_\{n\}, anddk−1=αk−1−βk−12dkd\_\{k\-1\}=\\alpha\_\{k\-1\}\-\\frac\{\\beta\_\{k\-1\}^\{2\}\}\{d\_\{k\}\}\. Then, with

bi=β1​⋯​βi−1d1​⋯​di,ai=βi​⋯​βn−1δi​⋯​δn​bn,b\_\{i\}=\\frac\{\\beta\_\{1\}\\cdots\\beta\_\{i\-1\}\}\{d\_\{1\}\\cdots d\_\{i\}\},\\qquad a\_\{i\}=\\frac\{\\beta\_\{i\}\\cdots\\beta\_\{n\-1\}\}\{\\delta\_\{i\}\\cdots\\delta\_\{n\}\\,b\_\{n\}\},the inverse entries areai∧j​bi∨ja\_\{i\\wedge j\}b\_\{i\\vee j\}\. In our case,

α1=αM=1\+h2,αk=2\+h2\(2≤k≤M−1\),βk=1\.\\alpha\_\{1\}=\\alpha\_\{M\}=1\+h^\{2\},\\qquad\\alpha\_\{k\}=2\+h^\{2\}\\quad\(2\\leq k\\leq M\-1\),\\qquad\\beta\_\{k\}=1\.Thus fori≤ji\\leq j, using the convention that empty products equal11,

\(A−1\)i​j=\(δ1​⋯​δi−1\)​\(dj\+1​⋯​dM\)δ1​⋯​δM\.\(A^\{\-1\}\)\_\{ij\}=\\frac\{\(\\delta\_\{1\}\\cdots\\delta\_\{i\-1\}\)\(d\_\{j\+1\}\\cdots d\_\{M\}\)\}\{\\delta\_\{1\}\\cdots\\delta\_\{M\}\}\.\(\*\)
Letη∈\(0,1\)\\eta\\in\(0,1\)be the smaller root ofs2−\(2\+h2\)​s\+1=0s^\{2\}\-\(2\+h^\{2\}\)s\+1=0, namelyη=2\+h2−h​h2\+42\\eta=\\frac\{2\+h^\{2\}\-h\\sqrt\{h^\{2\}\+4\}\}\{2\}, so thatη\+η−1=2\+h2\\eta\+\\eta^\{\-1\}=2\+h^\{2\}\. SetΔk:=δ1​⋯​δk\\Delta\_\{k\}:=\\delta\_\{1\}\\cdots\\delta\_\{k\}, withΔ0:=1\\Delta\_\{0\}:=1\. For0≤k≤M−10\\leq k\\leq M\-1,

Δk=η−k​1\+η2​k\+11\+η,\\Delta\_\{k\}=\\eta^\{\-k\}\\frac\{1\+\\eta^\{2k\+1\}\}\{1\+\\eta\},and the final endpoint gives

ΔM=η−M​\(1−η\)​\(1−η2​M\)1\+η\.\\Delta\_\{M\}=\\eta^\{\-M\}\\frac\{\(1\-\\eta\)\(1\-\\eta^\{2M\}\)\}\{1\+\\eta\}\.Similarly, by reversing the path,

dj\+1​⋯​dM=η−\(M−j\)​1\+η2​\(M−j\)\+11\+η\.d\_\{j\+1\}\\cdots d\_\{M\}=\\eta^\{\-\(M\-j\)\}\\frac\{1\+\\eta^\{2\(M\-j\)\+1\}\}\{1\+\\eta\}\.Substituting these formulas into\(∗\)\(\*\), and then using symmetry, yields

\(A−1\)i​j=η\|i−j\|\+1​\(1\+η2​\(i∧j\)−1\)​\(1\+η2​\(M−i∨j\)\+1\)\(1−η2\)​\(1−η2​M\)\.\(A^\{\-1\}\)\_\{ij\}=\\frac\{\\eta^\{\|i\-j\|\+1\}\(1\+\\eta^\{2\(i\\wedge j\)\-1\}\)\(1\+\\eta^\{2\(M\-i\\vee j\)\+1\}\)\}\{\(1\-\\eta^\{2\}\)\(1\-\\eta^\{2M\}\)\}\.Since1−η=h2​\(h2\+4−h\)1\-\\eta=\\frac\{h\}\{2\}\\left\(\\sqrt\{h^\{2\}\+4\}\-h\\right\), we have1−η≍h1\-\\eta\\asymp h,1−η2≍h1\-\\eta^\{2\}\\asymp h, and−log⁡η≍h\-\\log\\eta\\asymp hfor0<h≤10<h\\leq 1\. BecauseM≍h−1M\\asymp h^\{\-1\},1−η2​M≍11\-\\eta^\{2M\}\\asymp 1\. The remaining factors are bounded between positive absolute constants, and therefore

c​h−1​e−C​h​\|i−j\|≤\(A−1\)i​j≤C​h−1​e−c​h​\|i−j\|\.ch^\{\-1\}e^\{\-Ch\|i\-j\|\}\\leq\(A^\{\-1\}\)\_\{ij\}\\leq Ch^\{\-1\}e^\{\-ch\|i\-j\|\}\.Finally,H−1=L1−1​Ah−1H^\{\-1\}=L\_\{1\}^\{\-1\}A\_\{h\}^\{\-1\}, so

\(H−1\)i​j≍1L1​h​e−Θ​\(h​\|i−j\|\)=1γ​L1​e−Θ​\(h​\|i−j\|\)\.\(H^\{\-1\}\)\_\{ij\}\\asymp\\frac\{1\}\{L\_\{1\}h\}e^\{\-\\Theta\(h\|i\-j\|\)\}=\\frac\{1\}\{\\sqrt\{\\gamma L\_\{1\}\}\}e^\{\-\\Theta\(h\|i\-j\|\)\}\.∎

We now prove Lemma[5\.7](https://arxiv.org/html/2606.05438#S5.Thmtheorem7)\.

###### Proof of Lemma[5\.7](https://arxiv.org/html/2606.05438#S5.Thmtheorem7)\.

Let the parameters be chosen as in Condition[5\.6](https://arxiv.org/html/2606.05438#S5.Thmtheorem6)and seth:=γ/L1h:=\\sqrt\{\\gamma/L\_\{1\}\},W:=L11/4​ϵ​γ−5/4W:=L\_\{1\}^\{1/4\}\\epsilon\\gamma^\{\-5/4\},j⋆:=⌊M/2⌋j\_\{\\star\}:=\\lfloor M/2\\rfloor, andS:=\{j⋆,…,M\}S:=\\\{j\_\{\\star\},\\ldots,M\\\}\. Thusa=𝟏S/Wa=\\mathbf\{1\}\_\{S\}/W,M≍h−1M\\asymp h^\{\-1\},γ≍Γp\\gamma\\asymp\\Gamma\_\{p\},M≍L11/2​γ−1/2M\\asymp L\_\{1\}^\{1/2\}\\gamma^\{\-1/2\}, andN≍Δ​γ/ϵ2N\\asymp\\Delta\\gamma/\\epsilon^\{2\}\. Lete,ρ,α,βe,\\rho,\\alpha,\\betabe the quantities in \([4\.4](https://arxiv.org/html/2606.05438#S4.E4)\) for the present\(M,γ,L1,a\)\(M,\\gamma,L\_\{1\},a\), and setH:=Hγ,L1\(M\)H:=H\_\{\\gamma,L\_\{1\}\}^\{\(M\)\}\.

#### Calibrated orders and feasibility\.

We first check the quantities that enter Lemmas[4\.5](https://arxiv.org/html/2606.05438#S4.Thmtheorem5)and[4\.8](https://arxiv.org/html/2606.05438#S4.Thmtheorem8)\. Lemma[B\.1](https://arxiv.org/html/2606.05438#A2.Thmtheorem1)gives

\(H−1\)i​j≍\(γ​L1\)−1/2​exp⁡\(−Θ​\(h​\|i−j\|\)\)\.\(H^\{\-1\}\)\_\{ij\}\\asymp\(\\gamma L\_\{1\}\)^\{\-1/2\}\\exp\(\-\\Theta\(h\|i\-j\|\)\)\.Takingi=j=1i=j=1givesρ≍\(γ​L1\)−1/2\\rho\\asymp\(\\gamma L\_\{1\}\)^\{\-1/2\}\. For the suffix quantities,\|S\|≍h−1\|S\|\\asymp h^\{\-1\}andh​\|j−1\|≍1h\|j\-1\|\\asymp 1forj∈Sj\\in S, so

‖a‖=\|S\|W≍γϵ,\\\|a\\\|=\\frac\{\\sqrt\{\|S\|\}\}\{W\}\\asymp\\frac\{\\gamma\}\{\\epsilon\},and

α=1W​∑j∈S\(H−1\)j​1≍1W​h−1​\(γ​L1\)−1/2=γ1/4L11/4​ϵ\.\\alpha=\\frac\{1\}\{W\}\\sum\_\{j\\in S\}\(H^\{\-1\}\)\_\{j1\}\\asymp\\frac\{1\}\{W\}h^\{\-1\}\(\\gamma L\_\{1\}\)^\{\-1/2\}=\\frac\{\\gamma^\{1/4\}\}\{L\_\{1\}^\{1/4\}\\epsilon\}\.Also, for eachi∈Si\\in S,

∑j∈S\(H−1\)i​j≍h−1​\(γ​L1\)−1/2=γ−1,\\sum\_\{j\\in S\}\(H^\{\-1\}\)\_\{ij\}\\asymp h^\{\-1\}\(\\gamma L\_\{1\}\)^\{\-1/2\}=\\gamma^\{\-1\},and therefore

β=1W2​∑i,j∈S\(H−1\)i​j≍1W2​h−1​γ−1=γϵ2\.\\beta=\\frac\{1\}\{W^\{2\}\}\\sum\_\{i,j\\in S\}\(H^\{\-1\}\)\_\{ij\}\\asymp\\frac\{1\}\{W^\{2\}\}h^\{\-1\}\\gamma^\{\-1\}=\\frac\{\\gamma\}\{\\epsilon^\{2\}\}\.In particular,τ=1β≍ϵ2γ\\tau=\\frac\{1\}\{\\beta\}\\asymp\\frac\{\\epsilon^\{2\}\}\{\\gamma\}andρα2≍ϵ2γ\\frac\{\\rho\}\{\\alpha^\{2\}\}\\asymp\\frac\{\\epsilon^\{2\}\}\{\\gamma\}\. The choicer=r⋆r=r\_\{\\star\}andγ=chp​Γp\\gamma=c\_\{\\rm hp\}\\Gamma\_\{p\}, withchpc\_\{\\rm hp\}sufficiently small, implyℓq​r3−q​γq​ϵ1−q≲Lq\\ell\_\{q\}r^\{3\-q\}\\gamma^\{q\}\\epsilon^\{1\-q\}\\lesssim L\_\{q\},q=1,…,pq=1,\\ldots,p\. Substituting the calibrated orders into Lemma[4\.5](https://arxiv.org/html/2606.05438#S4.Thmtheorem5)gives

Lip⁡\(∇fΔ,L1:p,ϵ\)\\displaystyle\\operatorname\{Lip\}\\\!\\left\(\\nabla f\_\{\\Delta,L\_\{1:p\},\\epsilon\}\\right\)≲L1\+‖a‖α\+\(ρα2\+ℓ1​r2β\)​‖a‖2\\displaystyle\\lesssim L\_\{1\}\+\\frac\{\\\|a\\\|\}\{\\alpha\}\+\\left\(\\frac\{\\rho\}\{\\alpha^\{2\}\}\+\\frac\{\\ell\_\{1\}r^\{2\}\}\{\\beta\}\\right\)\\\|a\\\|^\{2\}≍L1\+L11/4​γ3/4\+γ\+ℓ1​r2​γ≲L1\.\\displaystyle\\asymp L\_\{1\}\+L\_\{1\}^\{1/4\}\\gamma^\{3/4\}\+\\gamma\+\\ell\_\{1\}r^\{2\}\\gamma\\lesssim L\_\{1\}\.Forq=2,…,pq=2,\\ldots,p,

Lip⁡\(Dq​fΔ,L1:p,ϵ\)≲ℓq​r3−q​‖a‖q\+1β≍ℓq​r3−q​γq​ϵ1−q≲Lq\.\\operatorname\{Lip\}\\\!\\left\(D^\{q\}f\_\{\\Delta,L\_\{1:p\},\\epsilon\}\\right\)\\lesssim\\frac\{\\ell\_\{q\}r^\{3\-q\}\\\|a\\\|^\{q\+1\}\}\{\\beta\}\\asymp\\ell\_\{q\}r^\{3\-q\}\\gamma^\{q\}\\epsilon^\{1\-q\}\\lesssim L\_\{q\}\.It remains to ensure that the calibrated initial gap is at mostΔ\\Delta\. The tolerance assumption supplies the needed lower bound onΓp\\Gamma\_\{p\}\. SinceΓp≥Γp​\(1\)\\Gamma\_\{p\}\\geq\\Gamma\_\{p\}\(1\), it is enough to lower bound every term inΓp​\(1\)\\Gamma\_\{p\}\(1\)by a constant multiple ofϵ2/Δ\\epsilon^\{2\}/\\Delta\. Theq=1q=1part of the tolerance assumption givesL1ℓ1≳ϵ2Δ\\frac\{L\_\{1\}\}\{\\ell\_\{1\}\}\\gtrsim\\frac\{\\epsilon^\{2\}\}\{\\Delta\}, which controls theL1/ℓ1L\_\{1\}/\\ell\_\{1\}term inΓp​\(1\)\\Gamma\_\{p\}\(1\)\. It also controls theL1L\_\{1\}term, sinceL1=ℓ1​\(L1/ℓ1\)L\_\{1\}=\\ell\_\{1\}\(L\_\{1\}/\\ell\_\{1\}\)andℓ1\\ell\_\{1\}is a fixed numerical constant:L1≳ϵ2ΔL\_\{1\}\\gtrsim\\frac\{\\epsilon^\{2\}\}\{\\Delta\}\. Forq=2,…,pq=2,\\ldots,p, the same assumption gives

Lqℓq≳ϵq\+1Δq,\\frac\{L\_\{q\}\}\{\\ell\_\{q\}\}\\gtrsim\\frac\{\\epsilon^\{q\+1\}\}\{\\Delta^\{q\}\},and hence

\(Lq​ϵq−1ℓq\)1/q≳ϵ2Δ\.\\left\(\\frac\{L\_\{q\}\\epsilon^\{q\-1\}\}\{\\ell\_\{q\}\}\\right\)^\{1/q\}\\gtrsim\\frac\{\\epsilon^\{2\}\}\{\\Delta\}\.ThusΓp​\(1\)≳ϵ2/Δ\\Gamma\_\{p\}\(1\)\\gtrsim\\epsilon^\{2\}/\\Delta\. After decreasingcϵc\_\{\\epsilon\}by an absolute factor if necessary, this givesϵ2/γ≲Δ\\epsilon^\{2\}/\\gamma\\lesssim\\Delta\. The initial gap bound from the same lemma gives

fΔ,L1:p,ϵ​\(0\)−inffΔ,L1:p,ϵ≲N​τ\+ρα2≲cN​Δ\+ϵ2γ≤Δ,f\_\{\\Delta,L\_\{1:p\},\\epsilon\}\(0\)\-\\inf f\_\{\\Delta,L\_\{1:p\},\\epsilon\}\\lesssim N\\tau\+\\frac\{\\rho\}\{\\alpha^\{2\}\}\\lesssim c\_\{N\}\\Delta\+\\frac\{\\epsilon^\{2\}\}\{\\gamma\}\\leq\\Delta,after choosingcNc\_\{N\}sufficiently small\. ThusfΔ,L1:p,ϵ∈ℱ1:p​\(Δ,L1,…,Lp\)f\_\{\\Delta,L\_\{1:p\},\\epsilon\}\\in\\mathcal\{F\}\_\{1:p\}\(\\Delta,L\_\{1\},\\ldots,L\_\{p\}\), after only absolute\-constant adjustments in the calibration\.

#### Zero\-respecting revelation and complexity\.

It remains to turn the calibrated gradient certificate into an iteration lower bound\. The denominator in Lemma[4\.8](https://arxiv.org/html/2606.05438#S4.Thmtheorem8)has scaleβ≍γϵ\\sqrt\{\\beta\}\\asymp\\frac\{\\sqrt\{\\gamma\}\}\{\\epsilon\}andβ​ρα≍γϵ\\frac\{\\beta\\sqrt\{\\rho\}\}\{\\alpha\}\\asymp\\frac\{\\sqrt\{\\gamma\}\}\{\\epsilon\}, so whenevera⊤​zN=0a^\{\\top\}z\_\{N\}=0, Lemma[4\.8](https://arxiv.org/html/2606.05438#S4.Thmtheorem8)gives‖∇fΔ,L1:p,ϵ​\(z\)‖≳ϵ\\left\\\|\\nabla f\_\{\\Delta,L\_\{1:p\},\\epsilon\}\(z\)\\right\\\|\\gtrsim\\epsilon\. If the hidden absolute constant is smaller than one, we run the same construction with a fixed larger multiple of the target tolerance; this changes all displayed rates only by absolute factors\. Let𝖠\\mathsf\{A\}be any first\-order zero\-respecting method and letz\(t\)z^\{\(t\)\}be its iterates onfΔ,L1:p,ϵf\_\{\\Delta,L\_\{1:p\},\\epsilon\}\. Sincesupp⁡\(a\)⊆\{j⋆,…,M\}\\operatorname\{supp\}\(a\)\\subseteq\\\{j\_\{\\star\},\\ldots,M\\\}, Lemma[4\.7](https://arxiv.org/html/2606.05438#S4.Thmtheorem7)gives the hiding conditiona⊤​zN\(t\)=0a^\{\\top\}z\_\{N\}^\{\(t\)\}=0for everyt<N​j⋆t<Nj\_\{\\star\}\. Consequently no such iterate isϵ\\epsilon\-stationary\. Sincej⋆≍Mj\_\{\\star\}\\asymp M,

Tϵ​\(𝖠,fΔ,L1:p,ϵ\)≥N​j⋆≍N​M≍Δ​γϵ2​L1γ≍Δ​L11/2​Γp1/2​ϵ−2\.T\_\{\\epsilon\}\\bigl\(\\mathsf\{A\},f\_\{\\Delta,L\_\{1:p\},\\epsilon\}\\bigr\)\\geq Nj\_\{\\star\}\\asymp NM\\asymp\\frac\{\\Delta\\gamma\}\{\\epsilon^\{2\}\}\\sqrt\{\\frac\{L\_\{1\}\}\{\\gamma\}\}\\asymp\\Delta L\_\{1\}^\{1/2\}\\Gamma\_\{p\}^\{1/2\}\\epsilon^\{\-2\}\.∎

## References

- \[1\]D\. Adil, B\. Bullins, A\. Sidford, and C\. Zhang\(2025\)Balancing gradient and hessian queries in non\-convex optimization\.InAdvances in Neural Information Processing Systems,External Links:2510\.20786Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px1.p1.5)\.
- \[2\]N\. Agarwal, Z\. Allen\-Zhu, B\. Bullins, E\. Hazan, and T\. Ma\(2017\)Finding approximate local minima faster than gradient descent\.InProceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing,pp\. 1195–1199\.External Links:[Document](https://dx.doi.org/10.1145/3055399.3055464)Cited by:[Table 1](https://arxiv.org/html/2606.05438#S1.T1.7.5.2.1.1.2.1),[§1](https://arxiv.org/html/2606.05438#S1.p2.3),[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1),[Remark 5\.4](https://arxiv.org/html/2606.05438#S5.Thmtheorem4.p1.8)\.
- \[3\]Z\. Allen\-Zhu and Y\. Li\(2018\)NEON2: finding local minima via first\-order oracles\.InAdvances in Neural Information Processing Systems,Vol\.31,pp\. 3716–3726\.Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.
- \[4\]Y\. Arjevani, Y\. Carmon, J\. C\. Duchi, D\. J\. Foster, A\. Sekhari, and K\. Sridharan\(2020\)Second\-order information in non\-convex stochastic optimization: power and limitations\.InProceedings of Thirty Third Conference on Learning Theory,Proceedings of Machine Learning Research, Vol\.125,pp\. 242–299\.Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.
- \[5\]Y\. Arjevani, Y\. Carmon, J\. C\. Duchi, D\. J\. Foster, N\. Srebro, and B\. Woodworth\(2023\)Lower bounds for non\-convex stochastic optimization\.Mathematical Programming199\(1\),pp\. 165–214\.External Links:[Document](https://dx.doi.org/10.1007/s10107-022-01822-7)Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.
- \[6\]E\. G\. Birgin, J\. L\. Gardenghi, J\. M\. Martínez, S\. A\. Santos, and P\. L\. Toint\(2017\)Worst\-case evaluation complexity for unconstrained nonlinear optimization using high\-order regularized models\.Mathematical Programming163\(1–2\),pp\. 359–368\.External Links:[Document](https://dx.doi.org/10.1007/s10107-016-1065-8)Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px1.p1.5)\.
- \[7\]Y\. Carmon, J\. C\. Duchi, O\. Hinder, and A\. Sidford\(2017\)“convex until proven guilty”: dimension\-free acceleration of gradient descent on non\-convex functions\.InProceedings of the 34th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.70,pp\. 654–663\.Cited by:[Table 1](https://arxiv.org/html/2606.05438#S1.T1.7.5.2.1.1.2.1),[Table 1](https://arxiv.org/html/2606.05438#S1.T1.8.6.3.1.1.2.1),[§1](https://arxiv.org/html/2606.05438#S1.p3.6),[Remark 5\.4](https://arxiv.org/html/2606.05438#S5.Thmtheorem4.p1.8),[Remark 5\.5](https://arxiv.org/html/2606.05438#S5.Thmtheorem5.p1.5)\.
- \[8\]Y\. Carmon, J\. C\. Duchi, O\. Hinder, and A\. Sidford\(2018\)Accelerated methods for nonconvex optimization\.SIAM Journal on Optimization28\(2\),pp\. 1751–1772\.External Links:[Document](https://dx.doi.org/10.1137/17M1114296)Cited by:[Table 1](https://arxiv.org/html/2606.05438#S1.T1.7.5.2.1.1.2.1),[§1](https://arxiv.org/html/2606.05438#S1.p2.3),[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1),[Remark 5\.4](https://arxiv.org/html/2606.05438#S5.Thmtheorem4.p1.8)\.
- \[9\]Y\. Carmon, J\. C\. Duchi, O\. Hinder, and A\. Sidford\(2020\)Lower bounds for finding stationary points i\.Mathematical Programming184\(1–2\),pp\. 71–120\.External Links:[Document](https://dx.doi.org/10.1007/s10107-019-01406-y)Cited by:[Table 1](https://arxiv.org/html/2606.05438#S1.T1.9.7.1.1.1.2.1),[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px1.p1.5),[§3](https://arxiv.org/html/2606.05438#S3.SS0.SSS0.Px1.p1.10),[§3](https://arxiv.org/html/2606.05438#S3.SS0.SSS0.Px2.p1.5),[§3](https://arxiv.org/html/2606.05438#S3.p1.1)\.
- \[10\]Y\. Carmon, J\. C\. Duchi, O\. Hinder, and A\. Sidford\(2021\)Lower bounds for finding stationary points ii: first\-order methods\.Mathematical Programming185\(1–2\),pp\. 315–355\.External Links:[Document](https://dx.doi.org/10.1007/s10107-019-01431-x)Cited by:[Table 1](https://arxiv.org/html/2606.05438#S1.T1.10.8.2.1.1.2.1),[Table 1](https://arxiv.org/html/2606.05438#S1.T1.11.9.3.1.1.2.1),[§1](https://arxiv.org/html/2606.05438#S1.p4.7),[2nd item](https://arxiv.org/html/2606.05438#S3.I1.i2.p1.10),[§3](https://arxiv.org/html/2606.05438#S3.SS0.SSS0.Px3.p1.2),[§3](https://arxiv.org/html/2606.05438#S3.SS0.SSS0.Px3.p1.7),[§3](https://arxiv.org/html/2606.05438#S3.SS0.SSS0.Px4.p1.5),[§3](https://arxiv.org/html/2606.05438#S3.p1.1),[§4](https://arxiv.org/html/2606.05438#S4.SS0.SSS0.Px1.p1.2),[§4](https://arxiv.org/html/2606.05438#S4.SS0.SSS0.Px1.p3.7),[§4](https://arxiv.org/html/2606.05438#S4.SS0.SSS0.Px2.p1.7),[Proposition 4\.1](https://arxiv.org/html/2606.05438#S4.Thmtheorem1),[§5](https://arxiv.org/html/2606.05438#S5.3.p1.7)\.
- \[11\]C\. Cartis, N\. I\. M\. Gould, and P\. L\. Toint\(2010\)On the complexity of steepest descent, newton’s and regularized newton’s methods for nonconvex unconstrained optimization problems\.SIAM Journal on Optimization20\(6\),pp\. 2833–2852\.External Links:[Document](https://dx.doi.org/10.1137/090774100)Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px1.p1.5)\.
- \[12\]C\. Cartis, N\. I\. M\. Gould, and P\. L\. Toint\(2012\)Complexity bounds for second\-order optimality in unconstrained optimization\.Journal of Complexity28\(1\),pp\. 93–108\.External Links:[Document](https://dx.doi.org/10.1016/j.jco.2011.06.001)Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px1.p1.5)\.
- \[13\]C\. Cartis, N\. I\. M\. Gould, and P\. L\. Toint\(2017\)Worst\-case evaluation complexity and optimality of second\-order methods for nonconvex smooth optimization\.arXiv preprint arXiv:1709\.07180\.Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px1.p1.5)\.
- \[14\]A\. R\. Conn, N\. I\. M\. Gould, and P\. L\. Toint\(2000\)Trust region methods\.MPS\-SIAM Series on Optimization,SIAM\.External Links:[Document](https://dx.doi.org/10.1137/1.9780898719857)Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px1.p1.5)\.
- \[15\]A\. Cutkosky, H\. Mehta, and F\. Orabona\(2023\)Optimal stochastic non\-smooth non\-convex optimization through online\-to\-non\-convex conversion\.InProceedings of the 40th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.202,pp\. 6643–6670\.Cited by:[§1](https://arxiv.org/html/2606.05438#S1.p3.6),[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.
- \[16\]N\. Emmenegger, R\. Kyng, and A\. N\. Zehmakan\(2022\)On the oracle complexity of higher\-order smooth non\-convex finite\-sum optimization\.InProceedings of the 25th International Conference on Artificial Intelligence and Statistics,Proceedings of Machine Learning Research, Vol\.151,pp\. 10718–10752\.Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.
- \[17\]C\. Fang, C\. J\. Li, Z\. Lin, and T\. Zhang\(2018\)SPIDER: near\-optimal non\-convex optimization via stochastic path\-integrated differential estimator\.InAdvances in Neural Information Processing Systems,Vol\.31,pp\. 689–699\.Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.
- \[18\]C\. Fang, Z\. Lin, and T\. Zhang\(2019\)Sharp analysis for nonconvex SGD escaping from saddle points\.InProceedings of the 32nd Conference on Learning Theory,Proceedings of Machine Learning Research, Vol\.99,pp\. 1192–1234\.Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.
- \[19\]R\. Ge, F\. Huang, C\. Jin, and Y\. Yuan\(2015\)Escaping from saddle points – online stochastic gradient for tensor decomposition\.InProceedings of the 28th Conference on Learning Theory,Proceedings of Machine Learning Research, Vol\.40,pp\. 797–842\.Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.
- \[20\]R\. Jiang, A\. Mokhtari, and F\. Patitucci\(2025\)Improved complexity for smooth nonconvex optimization: a two\-level online learning approach with quasi\-newton methods\.InProceedings of the 57th Annual ACM Symposium on Theory of Computing,pp\. 2225–2236\.External Links:[Document](https://dx.doi.org/10.1145/3717823.3718308)Cited by:[§1](https://arxiv.org/html/2606.05438#S1.p3.6)\.
- \[21\]C\. Jin, R\. Ge, P\. Netrapalli, S\. M\. Kakade, and M\. I\. Jordan\(2017\)How to escape saddle points efficiently\.InProceedings of the 34th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.70,pp\. 1724–1732\.Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.
- \[22\]C\. Jin, P\. Netrapalli, R\. Ge, S\. M\. Kakade, and M\. I\. Jordan\(2019\)On nonconvex optimization for machine learning: gradients, stochasticity, and saddle points\.arXiv preprint arXiv:1902\.04811\.Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.
- \[23\]C\. Jin, P\. Netrapalli, and M\. I\. Jordan\(2018\)Accelerated gradient descent escapes saddle points faster than gradient descent\.InProceedings of the 31st Conference on Learning Theory,Proceedings of Machine Learning Research, Vol\.75,pp\. 1042–1085\.Cited by:[Table 1](https://arxiv.org/html/2606.05438#S1.T1.7.5.2.1.1.2.1),[§1](https://arxiv.org/html/2606.05438#S1.p2.3),[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1),[Remark 5\.4](https://arxiv.org/html/2606.05438#S5.Thmtheorem4.p1.8)\.
- \[24\]H\. Li and Z\. Lin\(2023\)Restarted nonconvex accelerated gradient descent: no more polylogarithmic factor in theO​\(ϵ−7/4\)O\(\\epsilon^\{\-7/4\}\)complexity\.Journal of Machine Learning Research24\(157\),pp\. 1–37\.Cited by:[Table 1](https://arxiv.org/html/2606.05438#S1.T1.7.5.2.1.1.2.1),[§1](https://arxiv.org/html/2606.05438#S1.p3.6),[Remark 5\.4](https://arxiv.org/html/2606.05438#S5.Thmtheorem4.p1.8)\.
- \[25\]Z\. Li\(2019\)SSRGD: simple stochastic recursive gradient descent for escaping saddle points\.InAdvances in Neural Information Processing Systems,Vol\.32,pp\. 1523–1533\.Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.
- \[26\]M\. Liu, Z\. Li, X\. Wang, J\. Yi, and T\. Yang\(2018\)Adaptive negative curvature descent with applications in non\-convex optimization\.InAdvances in Neural Information Processing Systems,Vol\.31,pp\. 4853–4862\.Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.
- \[27\]N\. Marumo and A\. Takeda\(2024\)Parameter\-free accelerated gradient descent for nonconvex minimization\.SIAM Journal on Optimization34\(2\),pp\. 2093–2120\.External Links:[Document](https://dx.doi.org/10.1137/22M1540934)Cited by:[Table 1](https://arxiv.org/html/2606.05438#S1.T1.7.5.2.1.1.2.1),[§1](https://arxiv.org/html/2606.05438#S1.p3.6),[Remark 5\.4](https://arxiv.org/html/2606.05438#S5.Thmtheorem4.p1.8)\.
- \[28\]Y\. Nesterov and B\. T\. Polyak\(2006\)Cubic regularization of newton method and its global performance\.Mathematical Programming108\(1\),pp\. 177–205\.External Links:[Document](https://dx.doi.org/10.1007/s10107-006-0706-8)Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px1.p1.5)\.
- \[29\]N\. Tripuraneni, M\. Stern, C\. Jin, J\. Regier, and M\. I\. Jordan\(2018\)Stochastic cubic regularization for fast nonconvex optimization\.InAdvances in Neural Information Processing Systems,Vol\.31,pp\. 2899–2908\.Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.
- \[30\]Y\. Xu, R\. Jin, and T\. Yang\(2017\)NEON\+: accelerated gradient methods for extracting negative curvature for non\-convex optimization\.arXiv preprint arXiv:1712\.01033\.Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.
- \[31\]Y\. Xu, R\. Jin, and T\. Yang\(2018\)First\-order stochastic algorithms for escaping from saddle points in almost linear time\.InAdvances in Neural Information Processing Systems,Vol\.31,pp\. 5530–5540\.Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.
- \[32\]Y\. Yu, P\. Xu, and Q\. Gu\(2018\)Third\-order smoothness helps: faster stochastic optimization algorithms for finding local minima\.InAdvances in Neural Information Processing Systems,Vol\.31,pp\. 4530–4540\.Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.
- \[33\]C\. Zhang and T\. Li\(2021\)Escape saddle points by a simple gradient\-descent based algorithm\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 8545–8556\.External Links:[Link](https://openreview.net/forum?id=lEf52hTHq0Q)Cited by:[Table 1](https://arxiv.org/html/2606.05438#S1.T1.7.5.2.1.1.2.1),[§1](https://arxiv.org/html/2606.05438#S1.p2.3),[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1),[Remark 5\.4](https://arxiv.org/html/2606.05438#S5.Thmtheorem4.p1.8)\.
- \[34\]D\. Zhou and Q\. Gu\(2020\)Stochastic recursive variance\-reduced cubic regularization methods\.InProceedings of the 23rd International Conference on Artificial Intelligence and Statistics,Proceedings of Machine Learning Research, Vol\.108,pp\. 3980–3990\.Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.
- \[35\]D\. Zhou and Q\. Gu\(2022\)Dimension\-free complexity bounds for high\-order nonconvex finite\-sum optimization\.InProceedings of the 39th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.162,pp\. 27143–27158\.Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.
- \[36\]D\. Zhou, P\. Xu, and Q\. Gu\(2018\)Finding local minima via stochastic nested variance reduction\.arXiv preprint arXiv:1806\.08782\.Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.
- \[37\]D\. Zhou, P\. Xu, and Q\. Gu\(2018\)Stochastic variance\-reduced cubic regularized newton methods\.InProceedings of the 35th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.80,pp\. 5990–5999\.Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.
- \[38\]D\. Zhou, P\. Xu, and Q\. Gu\(2020\)Stochastic nested variance reduction for nonconvex optimization\.Journal of Machine Learning Research21\(103\),pp\. 1–63\.Cited by:[§2](https://arxiv.org/html/2606.05438#S2.SS0.SSS0.Px2.p1.1)\.

Similar Articles

Convergence of Steepest Descent and Adam under Non-Uniform Smoothness

arXiv cs.LG

This paper generalizes non-uniform smoothness assumptions to objectives whose curvature is affine in the objective value, proving convergence rates for steepest descent and diagonal variants of RMSProp and Adam, with applications to logistic regression and neural networks.