IGT-OMD: Implicit Gradient Transport for Decision-Focused Learning under Delayed Feedback

arXiv cs.LG Papers

Summary

This paper identifies 'staleness amplification' in bilevel optimization under delayed feedback and proposes IGT-OMD, which uses Implicit Gradient Transport to achieve sublinear regret and improve decision loss on benchmarks like Warcraft shortest-path and LQR.

arXiv:2605.12693v1 Announce Type: new Abstract: Decision-focused learning trains predictive models end-to-end against downstream decision loss, but online settings suffer delayed feedback: outcomes may not arrive for many environment interactions. We identify \emph{staleness amplification}, a failure mode unique to bilevel optimization under delay, in which gradient staleness couples with inner-solver sensitivity to inflate regret beyond single-level delay theory. We prove that any black-box delayed optimizer incurs an irreducible regret cost from inner-solver approximation error, and that gradient staleness contributes a quadratically growing transport error without bilevel-aware correction. Our algorithm, \textbf{IGT-OMD}, applies Implicit Gradient Transport to hypergradients within Online Mirror Descent, re-evaluating stale gradients at the current parameters using stored inner solutions. This method reduces transport error from a quadratic to a linear dependence on delay and achieves the first sublinear regret bound for delayed bilevel optimization with queue-length-adaptive step sizes. Controlled experiments provide a \emph{mechanistic fingerprint}: transport benefit is exactly $0.0\%$ ($p=1.00$) at unit delay and grows monotonically to $9.5\%$ at fifty rounds ($p<0.001$), isolating the correction's effect. On Linear Quadratic Regulator, Warcraft shortest-path, and Sinkhorn optimal transport, IGT-OMD reduces decision loss by $17$--$55\%$ relative to single-level baselines, with phase transitions matching the theory.
Original Article
View Cached Full Text

Cached at: 05/14/26, 06:17 AM

# IGT-OMD: Implicit Gradient Transport for Decision-Focused Learning under Delayed Feedback
Source: [https://arxiv.org/html/2605.12693](https://arxiv.org/html/2605.12693)
Benjamin Amoh Geoffrey Parker Wesley Marrero Thayer School of Engineering, Dartmouth College benjamin\.k\.amoh\.th@dartmouth\.edu

###### Abstract

Decision\-focused learning trains predictive models end\-to\-end against downstream decision loss, but online settings suffer delayed feedback: outcomes may not arrive for many environment interactions\. We identify*staleness amplification*, a failure mode unique to bilevel optimization under delay, in which gradient staleness couples with inner\-solver sensitivity to inflate regret beyond single\-level delay theory\. We prove that any black\-box delayed optimizer incurs an irreducible regret cost from inner\-solver approximation error, and that gradient staleness contributes a quadratically growing transport error without bilevel\-aware correction\. Our algorithm,IGT\-OMD, applies Implicit Gradient Transport to hypergradients within Online Mirror Descent, re\-evaluating stale gradients at the current parameters using stored inner solutions\. This method reduces transport error from a quadratic to a linear dependence on delay and achieves the first sublinear regret bound for delayed bilevel optimization with queue\-length\-adaptive step sizes\. Controlled experiments provide a*mechanistic fingerprint*: transport benefit is exactly0\.0%0\.0\\%\(by construction\) at unit delay and grows monotonically to9\.5%9\.5\\%at fifty rounds \(p<0\.001p<0\.001\), isolating the correction’s effect\. On Warcraft shortest\-path, IGT\-OMD reduces decision\-loss optimality gap by1515–36%36\\%over D\-FTRL/2\-Stage baselines; on Linear Quadratic Regulator benchmark it maintains a constant maximum stable learning rate across delays where bilevel\-unaware methods degrade by up to9\.3×9\.3\\times\.

## 1Introduction

In many operational settings \(from inventory management to energy dispatch to personalized medicine\), a predictive model feeds its forecast into a downstream optimization solver whose solution is then deployed\.[Decision\-Focused Learning \(DFL\)](https://arxiv.org/html/2605.12693#glo.acronym.dfl)trains such models end\-to\-end against a*decision loss*rather than a surrogate prediction error\[[8](https://arxiv.org/html/2605.12693#bib.bib2),[34](https://arxiv.org/html/2605.12693#bib.bib1),[20](https://arxiv.org/html/2605.12693#bib.bib39)\]\. The resulting training problem is inherently bilevel: the*outer level*adjusts the predictor parametersθ\\theta\(e\.g\., neural\-network weights\), while the*inner level*solves for an optimal decisionw∗​\(θ\)w^\{\*\}\(\\theta\)\(e\.g\., a control policy or a transport plan\)\.

In online deployments, feedback is often delayed\. A decision made at environment interaction \(i\.e\., round\)ttyields observable outcome feedback only at roundt\+dtt\{\+\}d\_\{t\}, where the delaydtd\_\{t\}may vary\. At any given round, several past decisions may still be awaiting feedback; we call this set the*queue*with size \(i\.e\.,*queue length*\)σt\\sigma\_\{t\}, and writeσmax\\sigma\_\{\\max\}for its worst\-case value\. Single\-level delayed optimizers such as Delayed Follow\-the\-Regularized\-Leader \(D\-FTRL;[15](https://arxiv.org/html/2605.12693#bib.bib29),\[[29](https://arxiv.org/html/2605.12693#bib.bib40)\]\) and Robust[Online Mirror Descent \(OMD\)](https://arxiv.org/html/2605.12693#glo.acronym.omd)\(OMD;[27](https://arxiv.org/html/2605.12693#bib.bib30)\) handle this challenge by correcting for the outer\-parameter drift‖θt−θt−dt‖\\\|\\theta\_\{t\}\-\\theta\_\{t\-d\_\{t\}\}\\\|\.

Staleness amplification: why bilevel delay is harder\.In single\-level[Online Convex Optimization \(OCO\)](https://arxiv.org/html/2605.12693#glo.acronym.oco), a stale gradient is a bounded perturbation independent of optimizer state\. The bilevel setting is structurally different: the hypergradient ofF​\(θ\)=ℒtrue​\(w∗​\(θ\);θ\)F\(\\theta\)=\\mathcal\{L\}\_\{\\text\{true\}\}\(w^\{\*\}\(\\theta\);\\theta\)depends on the inner minimizerw∗​\(θ\)w^\{\*\}\(\\theta\), which drifts asθ\\thetachanges\. Feedback generated by an inner solution that has since drifted introduces gradient error proportional to accumulated outer\-parameter change–*staleness amplification*\. This has two structural effects formalized in Theorem[2](https://arxiv.org/html/2605.12693#Thmtheorem2): an irreducible inner\-solver floor on regret, and a per\-round transport error that is quadratic in queue length without bilevel\-aware correction\.

IGT\-OMD: how we fix it\.The key idea of our algorithm is to*re\-evaluate*stale hypergradients at the current parameters rather than using them as\-is\. When feedback from a past roundssarrives, our algorithm has already stored an inner solutionwsw\_\{s\}and adjoint vectorvs∗v\_\{s\}^\{\*\}to cheaply recompute the hypergradient at the currentθt\\theta\_\{t\}without re\-solving the inner problem\. Accumulating these corrections via[Implicit Gradient Transport \(IGT\)](https://arxiv.org/html/2605.12693#glo.acronym.igt)\[[1](https://arxiv.org/html/2605.12693#bib.bib15)\]replaces the squared total drift‖θt−θt−dt‖2\\\|\\theta\_\{t\}\-\\theta\_\{t\-d\_\{t\}\}\\\|^\{2\}with a sum of squared per\-step changes∑s‖θs\+1−θs‖2\\sum\_\{s\}\\\|\\theta\_\{s\+1\}\-\\theta\_\{s\}\\\|^\{2\}, which is a factorσmax\\sigma\_\{\\max\}smaller\. Embedding this corrected gradient in[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)with a step size that tracks the running queue envelope—motivated by[Delay Differential Equation \(DDE\)](https://arxiv.org/html/2605.12693#glo.acronym.dde)stability analysis\[[35](https://arxiv.org/html/2605.12693#bib.bib16)\]—yields our algorithm,[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\-[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)\.

Contributions\.\(1\)*Algorithm\.*[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\-[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)corrects stale hypergradients atO​\(p\)O\(p\)cost per outstanding round without re\-solving inner problems \(Algorithm[1](https://arxiv.org/html/2605.12693#alg1)\)\.\(2\)*Analysis\.*We prove the first bilevel\-delayed regret boundO​\(T​σmax\+T​ϵinner2\)O\(\\sqrt\{T\\sigma\_\{\\max\}\}\+T\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\)\(Theorem[1](https://arxiv.org/html/2605.12693#Thmtheorem1)\), a staleness amplification lower bound showingΩ​\(T​ϵinner2\)\\Omega\(T\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\)is tight \(Theorem[2](https://arxiv.org/html/2605.12693#Thmtheorem2)\), an inner\-loop apathy decomposition decoupling staleness from inner\-solver quality \(Theorem[3](https://arxiv.org/html/2605.12693#Thmtheorem3)\), and[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)stability analysis \(Propositions[1](https://arxiv.org/html/2605.12693#Thmproposition1)–[2](https://arxiv.org/html/2605.12693#Thmproposition2)\)\.\(3\)*Experiments\.*On LQR,[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\-[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)maintains constant stability across all delays; on Warcraft, it reduces optimality gap by1515–36%36\\%over D\-FTRL/2\-Stage baselines\. A controlled optimizer experiment confirms the predicted mechanistic signature with9\.5%9\.5\\%improvement atd=50d=50\(p<0\.001p<0\.001\)\.

### 1\.1Related Work

Table[1](https://arxiv.org/html/2605.12693#S1.T1)positions[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\-[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)against relevant prior work\. Previous[DFL](https://arxiv.org/html/2605.12693#glo.acronym.dfl)methods\[[8](https://arxiv.org/html/2605.12693#bib.bib2),[33](https://arxiv.org/html/2605.12693#bib.bib20),[34](https://arxiv.org/html/2605.12693#bib.bib1),[31](https://arxiv.org/html/2605.12693#bib.bib4),[6](https://arxiv.org/html/2605.12693#bib.bib3)\]solve the bilevel training problem but assume immediate feedback\. Conceptually, DFL is a task\-driven subclass of bilevel optimization: the inner problem is the downstream decision layer, and the outer model is trained on realized decision loss\. Under this lens, we view\[[23](https://arxiv.org/html/2605.12693#bib.bib12)\]as closer to DFL\-style prior work than general\-purpose bilevel optimization\. Single\-level delayed optimizers\[[15](https://arxiv.org/html/2605.12693#bib.bib29),[27](https://arxiv.org/html/2605.12693#bib.bib30),[9](https://arxiv.org/html/2605.12693#bib.bib7),[12](https://arxiv.org/html/2605.12693#bib.bib8),[28](https://arxiv.org/html/2605.12693#bib.bib6),[26](https://arxiv.org/html/2605.12693#bib.bib5)\]correct for outer\-parameter drift but treat the objective as a black\-box function, making them vulnerable to staleness amplification when applied to bilevel problems \(Theorem[2](https://arxiv.org/html/2605.12693#Thmtheorem2)\)\. Bilevel optimization methods\[[10](https://arxiv.org/html/2605.12693#bib.bib13),[14](https://arxiv.org/html/2605.12693#bib.bib14),[32](https://arxiv.org/html/2605.12693#bib.bib9),[19](https://arxiv.org/html/2605.12693#bib.bib10),[21](https://arxiv.org/html/2605.12693#bib.bib11)\]handle the nested structure but assume synchronous updates\. Gradient transport\[[1](https://arxiv.org/html/2605.12693#bib.bib15)\]was originally developed for single\-level reinforcement learning;[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)stability analysis\[[35](https://arxiv.org/html/2605.12693#bib.bib16)\]was applied to distributed stochastic gradient descent\. No prior method handles both bilevel structure and delayed feedback\. While some single\-level delayed methods employ*delay\-adaptive*step sizes \(scaled by round or cumulative delay\), their schedules do not account for inner\-solver sensitivity—a factor unique to bilevel objectives\. Our algorithm fills this gap by extending gradient transport from single\-level policy gradients to bilevel hypergradients and embedding the result in[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)with*queue\-length\-adaptive*step sizes calibrated to the bilevel structure\.

Table 1:Comparison with related work\.[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\-[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)is the first bilevel delayed optimizer\.CategoryMethodDelayBilevelAdaptiveOpen GapTask\-drivenbilevelSPO\+\[[8](https://arxiv.org/html/2605.12693#bib.bib2)\]✗✓✗No delaysOnline[DFL](https://arxiv.org/html/2605.12693#glo.acronym.dfl)\[[6](https://arxiv.org/html/2605.12693#bib.bib3)\]✗✓✓No delaysControl\-oriented MBRL\[[23](https://arxiv.org/html/2605.12693#bib.bib12)\]✗✓✗No delaysSingle\-leveldelayed OCOD\-FTRL\[[15](https://arxiv.org/html/2605.12693#bib.bib29)\]✓✗✓Staleness amplificationRobust[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)\[[27](https://arxiv.org/html/2605.12693#bib.bib30)\]✓✗✓No bilevelDORM\+\[[9](https://arxiv.org/html/2605.12693#bib.bib7)\]✓✗✓Single\-levelGeneral Bilevel\(no delays\)Online bilevel\[[14](https://arxiv.org/html/2605.12693#bib.bib14)\]✗✓✗No delaysBSA\[[10](https://arxiv.org/html/2605.12693#bib.bib13)\]✗✓✗OfflineGrad\. transport[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\[[1](https://arxiv.org/html/2605.12693#bib.bib15)\]✓✗✗Single\-level RL[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)analysisDifferential Delay Analysis\[[35](https://arxiv.org/html/2605.12693#bib.bib16)\]✓✗✗Distributed SGDOursIGT\-OMD✓✓✓—

## 2Problem Formulation

We now formalize the online bilevel optimization problem under delayed feedback\. Our notation is summarized in TableLABEL:tab:notationin Appendix[A](https://arxiv.org/html/2605.12693#A1)\.

### 2\.1Online Bilevel Optimization with Delayed Feedback

Consider a learner that adjusts a predictive model each round, but outcome feedback arrives only after several rounds\. The gradient signal for each update is therefore based on parameters from the past—the delayed bilevel feedback loop we formalize below\.

At each interaction with the environment \(i\.e\., round\)t=1,…,Tt=1,\\ldots,T, the learner holds outer\-level predictor parametersθt∈Θ⊆ℝp\\theta\_\{t\}\\in\\Theta\\subseteq\\mathbb\{R\}^\{p\}and approximately solves the inner problemw∗​\(θt\)=argminw∈𝒲⊆ℝq⁡ℒmodel​\(w;θt\)w^\{\*\}\(\\theta\_\{t\}\)=\\operatorname\{argmin\}\_\{w\\in\\mathcal\{W\}\\subseteq\\mathbb\{R\}^\{q\}\}\\mathcal\{L\}\_\{\\text\{model\}\}\(w;\\theta\_\{t\}\), whereℒmodel\\mathcal\{L\}\_\{\\text\{model\}\}is the model\-based decision objective,ppis the predictor dimension, andqqis the decision dimension\. The approximate solutionwtw\_\{t\}is obtained by runningK∈ℕ\>0K\\in\\mathbb\{N\}\_\{\>0\}gradient descent steps from the previous solutionwt−1w\_\{t\-1\}, yielding inner\-solver error‖wt−w∗​\(θt\)‖≤ϵinner\\left\\\|w\_\{t\}\-w^\{\*\}\(\\theta\_\{t\}\)\\right\\\|\\leq\\epsilon\_\{\\mathrm\{inner\}\}\. The learner executeswtw\_\{t\}, and afterdt≥0d\_\{t\}\\geq 0rounds the environment reveals an outcome objectztz\_\{t\}that determines the realized loss\. At roundtt, newly arrived feedback isAt=\{s:s\+ds=t\}A\_\{t\}=\\\{s:s\+d\_\{s\}=t\\\}, the set of arrived observations is𝒪t=\{\(s,zs\):s\+ds≤t\}\\mathcal\{O\}\_\{t\}=\\\{\(s,z\_\{s\}\):s\+d\_\{s\}\\leq t\\\}, and the queue of rounds still awaiting feedback isQt=\{s≤t:s\+ds\>t\}Q\_\{t\}=\\\{s\\leq t:s\+d\_\{s\}\>t\\\}, with queue lengthσt=\|Qt\|\\sigma\_\{t\}=\|Q\_\{t\}\|and envelopeσ¯t=maxr≤t⁡σr\\bar\{\\sigma\}\_\{t\}=\\max\_\{r\\leq t\}\\sigma\_\{r\}\. Using only arrived feedback, the learner updatesθt\+1\\theta\_\{t\+1\}\. The performance measure is*decision regret*:

RegretTdec=∑t=1Tℒtrue​\(wt;θt\)−minθ∈Θ​∑t=1Tℒtrue​\(w∗​\(θ\);θ\)\\mathrm\{Regret\}\_\{T\}^\{\\mathrm\{dec\}\}=\\sum\_\{t=1\}^\{T\}\\mathcal\{L\}\_\{\\text\{true\}\}\(w\_\{t\};\\theta\_\{t\}\)\-\\min\_\{\\theta\\in\\Theta\}\\sum\_\{t=1\}^\{T\}\\mathcal\{L\}\_\{\\text\{true\}\}\(w^\{\*\}\(\\theta\);\\theta\)which compares the learner’s cumulative decision loss to that of the best fixed predictor in hindsight\.

### 2\.2Hypergradient Computation via Implicit Differentiation

The hypergradient—the derivative of the decision loss with respect to the predictor parameters—decomposes via the Implicit Function Theorem\[[7](https://arxiv.org/html/2605.12693#bib.bib18)\]into an explicit \(direct\) and an implicit \(through the inner solver\) component\. To avoid theO​\(q3\)O\(q^\{3\}\)cost of a full Jacobian inversion, we follow\[[23](https://arxiv.org/html/2605.12693#bib.bib12)\]and introduce an adjoint vectorv∗∈ℝqv^\{\*\}\\in\\mathbb\{R\}^\{q\}that satisfies the linear systemHw​v∗=∇wℒtrue​\(w∗;θ\),H\_\{w\}\\,v^\{\*\}=\\nabla\_\{w\}\\mathcal\{L\}\_\{\\text\{true\}\}\(w^\{\*\};\\theta\),whereHw:=∇w​w2ℒmodel​\(w∗;θ\)H\_\{w\}:=\\nabla^\{2\}\_\{ww\}\\mathcal\{L\}\_\{\\text\{model\}\}\(w^\{\*\};\\theta\)is the Hessian of the inner objective\. The adjoint is computed approximately via Conjugate Gradient\[[11](https://arxiv.org/html/2605.12693#bib.bib24)\]\. The hypergradient is then:

gt​\(θ\)=∇θℒtrue\|w​fixed−\[∇θ∇w⁡ℒmodel​\(w∗;θ\)\]⊤​v∗,g\_\{t\}\(\\theta\)=\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\text\{true\}\}\\big\|\_\{w\\,\\text\{fixed\}\}\-\\bigl\[\\nabla\_\{\\theta\}\\nabla\_\{w\}\\mathcal\{L\}\_\{\\text\{model\}\}\(w^\{\*\};\\theta\)\\bigr\]^\{\\top\}v^\{\*\},\(1\)where the first term captures the direct effect ofθ\\thetaon the loss and the second captures the indirect effect through how the inner solver’s decision changes withθ\\theta\.

Re\-evaluation under delay\.When feedback from a past roundssarrives at roundt\>st\>s, the algorithm can re\-evaluate the hypergradient at the current parametersθt\\theta\_\{t\}using the stored inner solutionwsw\_\{s\}, observed feedbackzsz\_\{s\}, and adjointvs∗v\_\{s\}^\{\*\}atO​\(p​q\)O\(p\\,q\)cost, without re\-solving the inner problem:

gs​\(θt\)=∇θℒtrue​\(ws;θt\)\|w​fixed−\[∇θ∇w⁡ℒmodel​\(ws;θt\)\]⊤​vs∗\.g\_\{s\}\(\\theta\_\{t\}\)=\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\text\{true\}\}\(w\_\{s\};\\theta\_\{t\}\)\\big\|\_\{w\\,\\text\{fixed\}\}\-\\bigl\[\\nabla\_\{\\theta\}\\nabla\_\{w\}\\mathcal\{L\}\_\{\\text\{model\}\}\(w\_\{s\};\\theta\_\{t\}\)\\bigr\]^\{\\top\}v^\{\*\}\_\{s\}\.\(2\)This is an approximation: the cached adjoint was computed atθs\\theta\_\{s\}, so the re\-evaluation holds\(ws,vs∗\)\(w\_\{s\},v\_\{s\}^\{\*\}\)fixed and updates only the explicitθt\\theta\_\{t\}\-dependence\. The resulting residual is controlled in Appendix[E\.1](https://arxiv.org/html/2605.12693#A5.SS1)\.

## 3IGT\-OMD Algorithm

[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\-[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)assembles three pieces, hypergradient transport \(IGT\), mirror\-descent updates \(OMD\), and a queue\-length\-adaptive step size, into a single procedure that costsO​\(p​q\)O\(p\\,q\)per outstanding round\.

### 3\.1Building Blocks

Implicit Gradient Transport\.[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\[[1](https://arxiv.org/html/2605.12693#bib.bib15)\]is a mechanism to reevaluate a stale gradient via a telescoping sum\. Given a gradient computed at a past iterategs​\(θs\)g\_\{s\}\(\\theta\_\{s\}\), the gradient estimate at the currentθt\\theta\_\{t\}isgsIGT​\(θt\)=gs​\(θs\)\+∑k=st−1\[gs​\(θk\+1\)−gs​\(θk\)\]g\_\{s\}^\{\\mathrm\{IGT\}\}\(\\theta\_\{t\}\)\\;=\\;g\_\{s\}\(\\theta\_\{s\}\)\\;\+\\;\\sum\_\{k=s\}^\{t\-1\}\\bigl\[g\_\{s\}\(\\theta\_\{k\+1\}\)\-g\_\{s\}\(\\theta\_\{k\}\)\\bigr\]\. This estimate accumulates the one\-step gradient changes along the trajectory fromθs\\theta\_\{s\}toθt\\theta\_\{t\}\. For the frozen surrogategs​\(⋅\)g\_\{s\}\(\\cdot\), the telescope is exact; underLL\-smoothness, the transported path variation satisfies∑k=st−1‖gs​\(θk\+1\)−gs​\(θk\)‖2≤L2​∑k=st−1‖θk\+1−θk‖2\\sum\_\{k=s\}^\{t\-1\}\\bigl\\\|g\_\{s\}\(\\theta\_\{k\+1\}\)\-g\_\{s\}\(\\theta\_\{k\}\)\\bigr\\\|^\{2\}\\;\\leq\\;L^\{2\}\\sum\_\{k=s\}^\{t\-1\}\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\\|^\{2\}, a*sum of squared per\-step changes*\. By contrast, the naïve stale\-gradient surrogate usesL2​‖θt−θs‖2L^\{2\}\\\|\\theta\_\{t\}\-\\theta\_\{s\}\\\|^\{2\}\(squared total displacement\), which by the Cauchy–Schwarz inequality\[[30](https://arxiv.org/html/2605.12693#bib.bib38)\]is up toσt\\sigma\_\{t\}times larger\. See Appendix[B](https://arxiv.org/html/2605.12693#A2)for details\.

Online Mirror Descent\.[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)\[[22](https://arxiv.org/html/2605.12693#bib.bib22)\]generalizes projected gradient descent via a Bregman divergence\[[4](https://arxiv.org/html/2605.12693#bib.bib27)\]Dψ​\(θ,θ′\)=ψ​\(θ\)−ψ​\(θ′\)−⟨∇ψ​\(θ′\),θ−θ′⟩D\_\{\\psi\}\(\\theta,\\theta^\{\\prime\}\)=\\psi\(\\theta\)\-\\psi\(\\theta^\{\\prime\}\)\-\\langle\\nabla\\psi\(\\theta^\{\\prime\}\),\\theta\-\\theta^\{\\prime\}\\ranglegenerated by a strongly convex mirror mapψ\\psi:

θt\+1=argminθ∈Θ⁡⟨gt,θ⟩\+1ηt​Dψ​\(θ,θt\),\\theta\_\{t\+1\}\\;=\\;\\operatorname\{argmin\}\_\{\\theta\\in\\Theta\}\\;\\bigl\\langle g\_\{t\},\\,\\theta\\bigr\\rangle\+\\frac\{1\}\{\\eta\_\{t\}\}\\,D\_\{\\psi\}\(\\theta,\\,\\theta\_\{t\}\),\(3\)whereηt\\eta\_\{t\}is a step size\. Whenψ\(θ\)=12∥⋅∥2\\psi\(\\theta\)=\\tfrac\{1\}\{2\}\\\|\\cdot\\\|^\{2\}\(the squared Euclidean norm\) the update recovers projected gradient descent\. This method accommodates non\-Euclidean geometries and admitsO​\(T\)O\\big\(\\sqrt\{T\}\\big\)regret bounds extending to the delayed setting\[[15](https://arxiv.org/html/2605.12693#bib.bib29),[27](https://arxiv.org/html/2605.12693#bib.bib30)\]\. See Appendix[B](https://arxiv.org/html/2605.12693#A2)for details\.

### 3\.2IGT\-OMD’s Per\-Round Procedure

For each arrived round selected for transport, our algorithm maintains a replay bufferℬ\\mathcal\{B\}that stores the inner solutionwsw\_\{s\}, adjoint vectorvs∗v\_\{s\}^\{\*\}, and the most recently transported hypergradientgs​\(θt−1\)g\_\{s\}\(\\theta\_\{t\-1\}\)\. Algorithm[1](https://arxiv.org/html/2605.12693#alg1)details our procedure\. Each roundttproceeds in three phases\.

Phase 1: Solve and execute \(lines 4–9\)\.The learner runsKKwarm\-started gradient\-descent steps onℒmodel​\(⋅;θt\)\\mathcal\{L\}\_\{\\text\{model\}\}\(\\cdot;\\theta\_\{t\}\)fromwt−1w\_\{t\-1\}to obtainwt≈w∗​\(θt\)w\_\{t\}\\approx w^\{\*\}\(\\theta\_\{t\}\)with‖wt−w∗​\(θt\)‖≤ϵinner\\\|w\_\{t\}\-w^\{\*\}\(\\theta\_\{t\}\)\\\|\\leq\\epsilon\_\{\\mathrm\{inner\}\}, then executeswtw\_\{t\}\. No true\-loss hypergradient for this decision is computed until its feedback arrives\.

Phase 2: Transport \(lines 11–18\)\.When feedback arrives for roundss∈Ats\\in A\_\{t\}, the algorithm solves the corresponding adjoints, forms the arrival gradient, and re\-evaluates earlier arrived gradients in the transport buffer\. Aggregating the arrival term with these one\-step corrections yields the IGT\-corrected hypergradient:

gtIGT=gAt​\(θt\)\+∑s∈ℬt\[gs​\(θt\)−gs​\(θt−1\)\],g\_\{t\}^\{\\mathrm\{IGT\}\}\\;=\\;g\_\{A\_\{t\}\}\(\\theta\_\{t\}\)\+\\sum\_\{s\\in\\mathcal\{B\}\_\{t\}\}\\bigl\[g\_\{s\}\(\\theta\_\{t\}\)\-g\_\{s\}\(\\theta\_\{t\-1\}\)\\bigr\],\(4\)which is a causal telescoping update driven only by arrived feedback\. Crucially, transport reuses the stored\(ws,vs∗\)\(w\_\{s\},v\_\{s\}^\{\*\}\)and the inner problem is*not*re\-solved during transport\.

Phase 3: Adaptive update \(lines 10 and 20\)\.The learner takes an OMD step \([3](https://arxiv.org/html/2605.12693#S3.E3)\) with the corrected gradient and a queue\-envelope\-adaptive step sizeηt=η0/1\+β​σ¯t\\eta\_\{t\}=\\eta\_\{0\}/\\sqrt\{1\+\\beta\\,\\bar\{\\sigma\}\_\{t\}\}, whereη0\>0\\eta\_\{0\}\>0is the base learning rate andβ=‖LIGT‖/λmin​\(HF\)\\beta=\\\|L\_\{\\mathrm\{IGT\}\}\\\|/\\lambda\_\{\\min\}\(H\_\{F\}\)is the delay\-sensitivity ratio \(set to1\.01\.0when unknown\)\. When the envelope is short,ηt≈η0\\eta\_\{t\}\\approx\\eta\_\{0\}recovers the non\-delayed rate; when long,ηt\\eta\_\{t\}shrinks as1/σ¯t1/\\sqrt\{\\bar\{\\sigma\}\_\{t\}\}\.

Algorithm 1IGT\-OMD: Implicit Gradient Transport for Bilevel Delayed Optimization1:Input:Base step size

η0\\eta\_\{0\}, damping

β\\beta, inner steps

KK, inner step size

ηw\\eta\_\{w\}; loss functions

ℒmodel:Θ×𝒲→ℝ\\mathcal\{L\}\_\{\\text\{model\}\}:\\Theta\{\\times\}\\mathcal\{W\}\{\\to\}\\mathbb\{R\}\(inner\),

ℒtrue:Θ×𝒲→ℝ\\mathcal\{L\}\_\{\\text\{true\}\}:\\Theta\{\\times\}\\mathcal\{W\}\{\\to\}\\mathbb\{R\}\(outer\)

2:Initialize:

θ1∈Θ\\theta\_\{1\}\\in\\Theta,

w0∈𝒲w\_\{0\}\\in\\mathcal\{W\}, buffer

ℬ←∅\\mathcal\{B\}\\leftarrow\\emptyset, queue

Q0←∅Q\_\{0\}\\leftarrow\\emptyset, envelope

σ¯0←0\\bar\{\\sigma\}\_\{0\}\\leftarrow 0
3:for

t=1,…,Tt=1,\\ldots,Tdo

4:

wt,0←wt−1w\_\{t,0\}\\leftarrow w\_\{t\-1\}// warm\-start

5:for

k=0,…,K−1k=0,\\ldots,K\-1do

6:

wt,k\+1←wt,k−ηw​∇wℒmodel​\(wt,k;θt\)w\_\{t,k\+1\}\\leftarrow w\_\{t,k\}\-\\eta\_\{w\}\\,\\nabla\_\{w\}\\mathcal\{L\}\_\{\\text\{model\}\}\(w\_\{t,k\};\\theta\_\{t\}\)// inner GD

7:endfor

8:

wt←wt,Kw\_\{t\}\\leftarrow w\_\{t,K\}
9:Execute

wtw\_\{t\}; receive arrivals

At=\{s:s\+ds=t\}A\_\{t\}=\\\{s:s\+d\_\{s\}=t\\\}and update

Qt←Qt−1∪\{t\}∖AtQ\_\{t\}\\leftarrow Q\_\{t\-1\}\\cup\\\{t\\\}\\setminus A\_\{t\}
10:

σt←\|Qt\|\\sigma\_\{t\}\\leftarrow\|Q\_\{t\}\|;

σ¯t←max⁡\(σ¯t−1,σt\)\\bar\{\\sigma\}\_\{t\}\\leftarrow\\max\(\\bar\{\\sigma\}\_\{t\-1\},\\sigma\_\{t\}\);

ηt←η0/1\+β​σ¯t\\eta\_\{t\}\\leftarrow\\eta\_\{0\}/\\sqrt\{1\+\\beta\\,\\bar\{\\sigma\}\_\{t\}\}
11:

gIGT←0g^\{\\mathrm\{IGT\}\}\\leftarrow 0
12:for

s∈Ats\\in A\_\{t\}do

13:Solve

Hw​vs∗=∇wℒtrue​\(ws;θt\)H\_\{w\}\\,v\_\{s\}^\{\*\}=\\nabla\_\{w\}\\mathcal\{L\}\_\{\\text\{true\}\}\(w\_\{s\};\\theta\_\{t\}\)and compute

gs​\(θt\)g\_\{s\}\(\\theta\_\{t\}\)// arrival gradient

14:

gIGT←gIGT\+gs​\(θt\)g^\{\\mathrm\{IGT\}\}\\leftarrow g^\{\\mathrm\{IGT\}\}\+g\_\{s\}\(\\theta\_\{t\}\); store

\(ws,vs∗,gs​\(θt\)\)\(w\_\{s\},v\_\{s\}^\{\*\},g\_\{s\}\(\\theta\_\{t\}\)\)in

ℬ\\mathcal\{B\}
15:endfor

16:for

s∈ℬs\\in\\mathcal\{B\}do

17:Re\-evaluate

gs​\(θt\)g\_\{s\}\(\\theta\_\{t\}\)and add

gs​\(θt\)−gs​\(θt−1\)g\_\{s\}\(\\theta\_\{t\}\)\-g\_\{s\}\(\\theta\_\{t\-1\}\)to

gIGTg^\{\\mathrm\{IGT\}\}// transport step

18:Update cached

gs​\(θt\)g\_\{s\}\(\\theta\_\{t\}\)in

ℬ\\mathcal\{B\}
19:endfor

20:

θt\+1←θt−ηt​gIGT\\theta\_\{t\+1\}\\leftarrow\\theta\_\{t\}\-\\eta\_\{t\}\\,g^\{\\mathrm\{IGT\}\}// OMD update \([3](https://arxiv.org/html/2605.12693#S3.E3)\)

21:Evict oldest entries if

\|ℬ\|\>σmax\|\\mathcal\{B\}\|\>\\sigma\_\{\\max\}
22:endfor

23:Return:

\{θt\}t=1T\\\{\\theta\_\{t\}\\\}\_\{t=1\}^\{T\}

### 3\.3Cost and Memory Analysis

The per\-round cost isO​\(K​p​q\+\|At\|​q2​κw\+\|ℬt\|​p​q\)O\(Kpq\+\|A\_\{t\}\|q^\{2\}\\kappa\_\{w\}\+\|\\mathcal\{B\}\_\{t\}\|pq\)—amin⁡\(K,κw\)\\min\(K,\\kappa\_\{w\}\)\-factor savings over re\-solving each delayed inner problem; buffer memory isO​\(σmax​\(p\+2​q\)\)O\(\\sigma\_\{\\max\}\(p\+2q\)\)\(Appendix[C](https://arxiv.org/html/2605.12693#A3)\)\.

## 4Theoretical Analysis

Our main results follow with their full proofs in Appendix[E\.1](https://arxiv.org/html/2605.12693#A5.SS1)\-[E\.5](https://arxiv.org/html/2605.12693#A5.SS5)\.

### 4\.1Assumptions

We assume seven standard regularity conditions from bilevel optimization\[[10](https://arxiv.org/html/2605.12693#bib.bib13),[14](https://arxiv.org/html/2605.12693#bib.bib14)\]and delayed[OCO](https://arxiv.org/html/2605.12693#glo.acronym.oco)\[[15](https://arxiv.org/html/2605.12693#bib.bib29),[27](https://arxiv.org/html/2605.12693#bib.bib30)\]: \(A1–A2\) inner strong convexity and smoothness; \(A3\) bounded solver errorϵinner\\epsilon\_\{\\mathrm\{inner\}\}; \(A4\) bounded queue lengthσmax\\sigma\_\{\\max\}; \(A5\) Lipschitz cross\-partials \(enables hypergradient re\-evaluation in \([2](https://arxiv.org/html/2605.12693#S2.E2)\)\); \(A6\) bounded hypergradientsGG; and \(A7\) bilevel strong convexity; detailed in Appendix[D](https://arxiv.org/html/2605.12693#A4)

### 4\.2Convergence of IGT\-OMD

Our first question is whether[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\-[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)achieves sublinear regret, and whether inner\-solver error and delay amplify each other\. The theorem shows the coupling is additive, enabling independent control of both error sources\.

###### Theorem 1\(Bilevel convergence under delay\)\.

Under Assumptions[1](https://arxiv.org/html/2605.12693#Thmassumption1)–[7](https://arxiv.org/html/2605.12693#Thmassumption7), defineρcpl:=Lw​θ2/\(μw2​μF\)<1\\rho\_\{\\mathrm\{cpl\}\}:=L\_\{w\\theta\}^\{2\}/\(\\mu\_\{w\}^\{2\}\\mu\_\{F\}\)<1andCρ=\(1−ρcpl\)−1C\_\{\\rho\}=\(1\-\\rho\_\{\\mathrm\{cpl\}\}\)^\{\-1\}\. Algorithm[1](https://arxiv.org/html/2605.12693#alg1)with step sizeηt\\eta\_\{t\}attains decision regret:

RegretTdec≤Cρ​\[2​Dψ​1\+β​σmaxη0\+η0​G2​T2\+2​T​ϵinner2\+LF​∑t=1T∑k∈Wt‖θk\+1−θk‖2\],\\mathrm\{Regret\}\_\{T\}^\{\\mathrm\{dec\}\}\\leq C\_\{\\rho\}\\\!\\left\[\\frac\{2D\_\{\\psi\}\\sqrt\{1\+\\beta\\sigma\_\{\\max\}\}\}\{\\eta\_\{0\}\}\+\\frac\{\\eta\_\{0\}\\,G^\{2\}T\}\{2\}\+2T\\,\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\+L\_\{F\}\\\!\\sum\_\{t=1\}^\{T\}\\sum\_\{k\\in W\_\{t\}\}\\left\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\right\\\|^\{2\}\\right\],\(5\)whereDψD\_\{\\psi\}is the Bregman diameter of the feasible set\. Settingη0=c/T\\eta\_\{0\}=c/\\sqrt\{T\}yieldsRegretTdec=O​\(T​σmax\+T​ϵinner2\)\\mathrm\{Regret\}\_\{T\}^\{\\mathrm\{dec\}\}=O\\\!\\bigl\(\\sqrt\{T\\,\\sigma\_\{\\max\}\}\+T\\,\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\\bigr\), wherec=2​Dψ1/2​\(1\+β​σmax\)1/4/Gc=2D\_\{\\psi\}^\{1/2\}\(1\+\\beta\\sigma\_\{\\max\}\)^\{1/4\}/Gbalances the first two terms of \([5](https://arxiv.org/html/2605.12693#S4.E5)\)\.

Single\-level D\-FTRL\[[15](https://arxiv.org/html/2605.12693#bib.bib29)\]and Robust OMD\[[27](https://arxiv.org/html/2605.12693#bib.bib30)\]attainO​\(T​dtot\)O\(\\sqrt\{T\\,d\_\{\\mathrm\{tot\}\}\}\)regret \(dtot=∑tdtd\_\{\\mathrm\{tot\}\}=\\sum\_\{t\}d\_\{t\}= total delay\); we match this withσmax\\sigma\_\{\\max\}in place ofdtotd\_\{\\mathrm\{tot\}\}–a strictly tighter measure sinceσmax≤dmax≪dtot\\sigma\_\{\\max\}\\leq d\_\{\\max\}\\ll d\_\{\\mathrm\{tot\}\}is possible\[[28](https://arxiv.org/html/2605.12693#bib.bib6)\]\. HereWt=\{k:max⁡\(1,t−σt\)≤k≤t−1\}W\_\{t\}=\\\{k:\\max\(1,t\-\\sigma\_\{t\}\)\\leq k\\leq t\-1\\\}is the active transport window\. TheT​ϵinner2T\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}penalty is additive, not multiplicative, so delay and inner\-solver precision can be tuned independently\.

### 4\.3Staleness amplification Lower Bound

We now ask how much delayed feedback penalizes bilevel optimization relative to single\-level: is theO​\(σmax2\)O\(\\sigma\_\{\\max\}^\{2\}\)transport error growth a fundamental barrier, or a proof artifact? This matters to decision quality\. The structural source of the barrier is the Implicit Function Theorem \(IFT\) coupling between the outer and inner problems: the bilevel Lipschitz constantLF=Lθ\+Lw​θ2/μwL\_\{F\}=L\_\{\\theta\}\+L\_\{w\\theta\}^\{2\}/\\mu\_\{w\}—whereLθL\_\{\\theta\}=‖∇θ2ℒtrue‖\\\|\\nabla^\{2\}\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{true\}\}\\\|,Lw​θL\_\{w\\theta\}=‖∇w​θ2ℒmodel‖\\\|\\nabla^\{2\}\_\{w\\theta\}\\mathcal\{L\}\_\{\\mathrm\{model\}\}\\\|, andμw\\mu\_\{w\}is the inner strong\-convexity constant—encodes how sensitivelyw∗​\(θ\)w^\{\*\}\(\\theta\)tracks outer\-parameter changes\. The coupling constantC0=Lw​θ/μwC\_\{0\}=L\_\{w\\theta\}/\\mu\_\{w\}measures how inner\-solver error propagates into the hypergradient\. Theorem[2](https://arxiv.org/html/2605.12693#Thmtheorem2)formalizes three consequences of this coupling; part \(b\) constructs a hard quadratic instance confirming the lower bound is tight\.

###### Theorem 2\(Staleness amplification\)\.

Let𝒜\\mathcal\{A\}be any delayed optimizer treatingF​\(θ\)F\(\\theta\)as a black\-box convex function using stale bilevel gradients without bilevel\-aware correction\. Then:

1. \(a\)The gradient mismatch satisfies‖g^−∇F​\(θt\)‖≤LF​‖θt−θt−dt‖\+C0​ϵinner\\left\\\|\\hat\{g\}\-\\nabla F\(\\theta\_\{t\}\)\\right\\\|\\leq L\_\{F\}\\left\\\|\\theta\_\{t\}\-\\theta\_\{t\-d\_\{t\}\}\\right\\\|\+C\_\{0\}\\epsilon\_\{\\mathrm\{inner\}\}\.
2. \(b\)On a hard quadratic bilevel instance,RegretT​\(𝒜\)≥Ω​\(T​ϵinner2\)\\mathrm\{Regret\}\_\{T\}\(\\mathcal\{A\}\)\\geq\\Omega\(T\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\), regardless of step\-size schedule, delay handling strategy, or queue length adaptation\.
3. \(c\)Bilevel aware correction reduces the transport error fromO​\(σmax2​η2​G2\)O\(\\sigma\_\{\\max\}^\{2\}\\eta^\{2\}G^\{2\}\)toO​\(σmax​η2​G2\)O\(\\sigma\_\{\\max\}\\eta^\{2\}G^\{2\}\)per round—a factorσmax\\sigma\_\{\\max\}improvement\.

Statement \(a\) of Theorem[2](https://arxiv.org/html/2605.12693#Thmtheorem2)reveals staleness amplification: staleness couples with the bilevel Lipschitz constantLFL\_\{F\}which is absent in single\-level OCO\. Statement \(b\) shows theT​ϵinner2T\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}term in Theorem[1](https://arxiv.org/html/2605.12693#Thmtheorem1)is tight\. Lastly, statement \(c\) demonstrates that the effect of replacing the squared total drift with a sum of squared per\-step changes through our telescoping correction is exactlyσt\\sigma\_\{t\}by the Cauchy–Schwarz inequality\[[30](https://arxiv.org/html/2605.12693#bib.bib38)\]\(confirmed numerically atR2\>0\.99R^\{2\}\>0\.99in Section[5\.3](https://arxiv.org/html/2605.12693#S5.SS3)\)\.

### 4\.4Inner\-Loop Apathy Decomposition

Our third question is whether the delay error and the inner\-solver error interact\. The following decomposition shows that they do not, as transport decouples them\.

###### Theorem 3\(Inner\-loop apathy\)\.

Under Assumptions[1](https://arxiv.org/html/2605.12693#Thmassumption1)–[7](https://arxiv.org/html/2605.12693#Thmassumption7), the decision regret decomposes as:

RegretTdec=O​\(η02​G2​T​σmax\)⏟delay error​\(R1\)\+O​\(T​ϵinner2\)⏟inner bias​\(R2\)\+O​\(Cap​∑t=1T∑k∈Wt‖θk\+1−θk‖2\)⏟interaction​\(R3\)\.\\mathrm\{Regret\}\_\{T\}^\{\\mathrm\{dec\}\}=\\underbrace\{O\\\!\\bigl\(\\eta\_\{0\}^\{2\}G^\{2\}T\\sigma\_\{\\max\}\\bigr\)\}\_\{\\text\{delay error\}\\,\(R\_\{1\}\)\}\+\\underbrace\{O\(T\\,\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\)\}\_\{\\text\{inner bias\}\\,\(R\_\{2\}\)\}\+\\underbrace\{O\\\!\\Bigl\(C\_\{\\mathrm\{ap\}\}\\\!\\sum\_\{t=1\}^\{T\}\\sum\_\{k\\in W\_\{t\}\}\\left\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\right\\\|^\{2\}\\Bigr\)\}\_\{\\text\{interaction\}\\,\(R\_\{3\}\)\}\.\(6\)The interactionR3R\_\{3\}uses per\-step squared changes \(σt​η2​G2\\sigma\_\{t\}\\eta^\{2\}G^\{2\}\) instead of squared total drift \(σt2​η2​G2\\sigma\_\{t\}^\{2\}\\eta^\{2\}G^\{2\}\)–aσt\\sigma\_\{t\}factor improvement; hereCapC\_\{\\mathrm\{ap\}\}is an explicit bounded constant, soϵinner\\epsilon\_\{\\mathrm\{inner\}\}remains additive rather than multiplied by delay\.

This theorem implies that the optimizer becomes insensitive to the inner solver precision, suggesting “inner\-loop apathy\.” Moreover,[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)decouples the outer staleness from the inner\-solver quality, withϵinner\\epsilon\_\{\\mathrm\{inner\}\}entering additively\.

### 4\.5DDE Stability and Discrete Consistency

The adaptive step sizeηt\\eta\_\{t\}is not heuristic: it arises from a stability analysis of the continuous\-time limit\. Takingη→0\\eta\\\!\\to\\\!0in a bounded active\-window embedding, the IGT\-OMD iterates \([3](https://arxiv.org/html/2605.12693#S3.E3)\) converge to a[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)whose characteristic roots determine a sufficient stability certificate\.

###### Proposition 1\([DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)stability region\)\.

Supposeσmax<1/β\\sigma\_\{\\max\}<1/\\betaandθ∗∈relint​\(Θ\)\\theta^\{\*\}\\in\\mathrm\{relint\}\(\\Theta\)\. Consider the continuous\-time limit of IGT\-OMD,θ˙​\(t\)=−η​\(t\)​gIGT​\(θ​\(t\),\{θ​\(t−τs\)\}s∈Q​\(t\)\)\\dot\{\\theta\}\(t\)=\-\\eta\(t\)\\,g^\{\\mathrm\{IGT\}\}\\bigl\(\\theta\(t\),\\,\\\{\\theta\(t\-\\tau\_\{s\}\)\\\}\_\{s\\in Q\(t\)\}\\bigr\), withη​\(t\)=η0/1\+β​σ¯​\(t\)\\eta\(t\)=\\eta\_\{0\}/\\sqrt\{1\+\\beta\\bar\{\\sigma\}\(t\)\}\. Linearizing aroundθ∗\\theta^\{\*\}, the system is asymptotically stable wheneverη0<min⁡\{1/\(‖HF‖​1\+β​σmax\),1\+β​σmax/\(σmax​‖LIGT‖\)\}\\eta\_\{0\}<\\min\\\{1/\(\\\|H\_\{F\}\\\|\\sqrt\{1\+\\beta\\sigma\_\{\\max\}\}\),\\sqrt\{1\+\\beta\\sigma\_\{\\max\}\}/\(\\sigma\_\{\\max\}\\sqrt\{\\\|L\_\{\\mathrm\{IGT\}\}\\\|\}\)\\\}, where‖HF‖\\\|H\_\{F\}\\\|denotes its spectral norm \(equal toλmax​\(HF\)\\lambda\_\{\\max\}\(H\_\{F\}\);HF≻0H\_\{F\}\\succ 0\)\. This is a sufficient local certificate for the bounded active\-window embedding used in the analysis\.

This result provides*local*asymptotic stability for the linearized, homogeneous system \(ϵinner=0\\epsilon\_\{\\mathrm\{inner\}\}=0\)\. In the analyzed embedding, the certificate is indexed by queue lengthσmax\\sigma\_\{\\max\}rather than raw delay, matching the constantηmax=0\.093\\eta\_\{\\max\}=0\.093observed acrossσ∈\{1,…,100\}\\sigma\\in\\\{1,\\ldots,100\\\}in Table[2](https://arxiv.org/html/2605.12693#S5.T2)\.

###### Proposition 2\(Discrete\-continuous consistency\)\.

Letθkdisc\\theta^\{\\mathrm\{disc\}\}\_\{k\}denote IGT\-OMD iterates andθcont​\(t\)\\theta^\{\\mathrm\{cont\}\}\(t\)the solution to the[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)in Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1)with the same initial history\. Under the assumptions of Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1)and a uniform step boundηt≤ηmax\\eta\_\{t\}\\leq\\eta\_\{\\max\}, the discrete trajectory tracks the continuous solution:supk≤T‖θkdisc−θcont​\(tk\)‖≤C​ηmax,\\sup\_\{k\\,\\leq\\,T\}\\,\\bigl\\\|\\theta^\{\\mathrm\{disc\}\}\_\{k\}\-\\theta^\{\\mathrm\{cont\}\}\(t\_\{k\}\)\\bigr\\\|\\;\\leq\\;C\\,\\eta\_\{\\max\},withtk=∑s≤kηst\_\{k\}=\\sum\_\{s\\leq k\}\\eta\_\{s\}and a constantCCdepending onTT,‖HF‖\\left\\\|H\_\{F\}\\right\\\|,‖LIGT‖\\left\\\|L\_\{\\mathrm\{IGT\}\}\\right\\\|, andσmax\\sigma\_\{\\max\}but not on the discretization\. Consequently, the discrete iterates inherit the asymptotic stability and contraction rate of the[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)up toO​\(ηmax\)O\(\\eta\_\{\\max\}\)error\.

Proposition[2](https://arxiv.org/html/2605.12693#Thmproposition2)follows because Algorithm[1](https://arxiv.org/html/2605.12693#alg1)is a forward\-Euler discretization of the[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)\. This result shows the stability region and contraction rate from the continuous\-time system–particularly theηmax\\eta\_\{\\max\}\-dependent step size calibration of \([3\.2](https://arxiv.org/html/2605.12693#S3.SS2)\)–transfer to discrete iterates in practice, with error vanishing inηmax\\eta\_\{\\max\}\.

## 5Experiments

We evaluate[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\-[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)across four environments that jointly test our theoretical claims: \(i\) LQR for stability under delay, \(ii\) Warcraft for staleness amplification, \(iii\) Sinkhorn OT for theσmax\\sigma\_\{\\max\}\-factor transport error scaling, and \(iv\) a controlled experiment isolating the transport contribution with optimizer choice held fixed\. Our baselines span a2×22\{\\times\}2matrix \(bilevel\-aware vs\. single\-level×\\timesdelay\-aware vs\. delay\-unaware\): 2\-Stage \(MSE\), SPO\+\[[8](https://arxiv.org/html/2605.12693#bib.bib2)\], D\-FTRL\[[15](https://arxiv.org/html/2605.12693#bib.bib29)\], Robust[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)\[[27](https://arxiv.org/html/2605.12693#bib.bib30)\], and Stale[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)\[[22](https://arxiv.org/html/2605.12693#bib.bib22)\]\. Additional details on our experiments and additional analyses are in Appendices[F](https://arxiv.org/html/2605.12693#A6)and[G](https://arxiv.org/html/2605.12693#A7)\. All experiments use 5–10 seeds; error bars report±\\pm1 standard deviation\.

### 5\.1Linear Quadratic Regulator Stability Boundary

Setup\.An LQR provides a clean bilevel stability testbed: the inner problem computes a linear gain for a learned dynamics model, with the inner configuration held fixed to isolate delay handling\. We use true dynamicsxt\+1=Atrue​xt\+Btrue​ut\+ξtx\_\{t\+1\}=A\_\{\\rm true\}x\_\{t\}\+B\_\{\\rm true\}u\_\{t\}\+\\xi\_\{t\}and learned parametersθ=\(A^,B^\)∈ℝ10×13\\theta=\(\\hat\{A\},\\hat\{B\}\)\\in\\mathbb\{R\}^\{10\\times 13\}\. Constant delaysd∈\{1,10,20,40,60,80,100\}d\\in\\\{1,10,20,40,60,80,100\\\}span both the stable regime \(d≤40d\\leq 40, all methods converge\) and the phase\-transition region predicted by Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1)\(d≥60d\\geq 60, bilevel\-unaware methods degrade\); beyondd=100d=100baselines diverge entirely\. The maximum stable learning rateηmax\\eta\_\{\\max\}is found via binary search\.

Results\.Table[2](https://arxiv.org/html/2605.12693#S5.T2)reveals a two\-regime structure:[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\-[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)maintains constantηmax=0\.093\\eta\_\{\\max\}=0\.093across allσ≤100\\sigma\\leq 100\(Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1)\), while bilevel\-unaware methods \(2\-Stage, SPO\+\) degrade to0\.0100\.010atσ=100\\sigma=100\(89%89\\%reduction\)\. Single\-level delay\-aware methods degrade more slowly but still fall to0\.0520\.052\(44%44\\%\)\.[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\-[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)’s achieves9\.3×9\.3\\timesand1\.8×1\.8\\timesgains over 2\-Stage and D\-FTRL respectively\.

Table 2:LQR stability boundary: maximum stable learning rateηmax\\eta\_\{\\max\}vs\. queue lengthσ\\sigma\.[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\-[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)maintains constantηmax\\eta\_\{\\max\}across all delays; bilevel\-unaware methods \(2\-Stage, SPO\+\) degrade earliest\.
### 5\.2Warcraft Shortest Path: A structural contrast

Having established stability in a linear setting, we turn to a combinatorial optimization benchmark to test whether staleness amplification degrades the quality of real decisions\.

Setup\.We study shortest\-path planning on12×1212\{\\times\}12grids from Warcraft II maps with four terrain types\. The outer parametersθ∈ℝ128\\theta\\in\\mathbb\{R\}^\{128\}index the trainable hidden\-layer column of a 2\-layer neural network predicting per\-cell traversal costs, and outer gradients are computed with the differentiable perturbation surrogate ofVlastelicaet al\.\[[33](https://arxiv.org/html/2605.12693#bib.bib20)\]\. The inner problem runs Dijkstra’s algorithm to find the shortest path—an exact solver, soϵinner=0\\epsilon\_\{\\mathrm\{inner\}\}=0\. The decision loss evaluates the chosen path under true terrain costs; the*optimality gap*measures the excess cost over the oracle shortest path\. We test constant delaysd∈\{0,10,50,100\}d\\in\\\{0,10,50,100\\\}and stochastic Poisson delays with meanλ∈\{10,25,50,100\}\\lambda\\in\\\{10,25,50,100\\\}under OU edge\-cost drift rate0\.050\.05\. We use 10 seeds overT=5,000T=5\{,\}000rounds\.

Results\.Table[3](https://arxiv.org/html/2605.12693#S5.T3)reveals three structural findings\. First,[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\-[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)achieves the lowest optimality gap across all delay configurations: atd=100d\{=\}100, gap1\.561\.56vs\.1\.831\.83for D\-FTRL \(14\.8%14\.8\\%reduction\) and2\.422\.42for 2\-Stage \(35\.5%35\.5\\%reduction\), confirming staleness amplification \(Theorem[2](https://arxiv.org/html/2605.12693#Thmtheorem2)\(a\)\)\. Second, methods without bilevel awareness \(2\-Stage, SPO\+\) degrade most severely \(1\.61\.6–3\.4×3\.4\\timesgap relative to[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\-[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)\), showing that ignoring the inner solver’s sensitivity has dramatic consequences under delay\. Third, because the inner solver is exact \(ϵinner=0\\epsilon\_\{\\mathrm\{inner\}\}=0\), the inner\-loop apathy benefit \(Theorem[3](https://arxiv.org/html/2605.12693#Thmtheorem3)\) is inoperative here\. The reduction over delay\-aware methods \(1212–21%21\\%\) is correspondingly smaller than on Sinkhorn, whereϵinner\>0\\epsilon\_\{\\mathrm\{inner\}\}\>0—a structural prediction of the theory\.

Table 3:Warcraft shortest path: optimality gap \(mean over last 200 rounds,±\\pm1 s\.d\.\) vs\. delay configuration\.[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\-[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)achieves the smallest gap across all settings\.
### 5\.3Transport Error Scaling \(Sinkhorn Optimal Transport\)

Warcraft uses an exact inner solver, so the inner\-solver error channel \(R2R\_\{2\}in Theorem[3](https://arxiv.org/html/2605.12693#Thmtheorem3)\) is zero\. To test the fullR1\+R2\+R3R\_\{1\}\{\+\}R\_\{2\}\{\+\}R\_\{3\}decomposition, we need an environment whereϵinner\>0\\epsilon\_\{\\mathrm\{inner\}\}\>0\.

Setup\.We use a Sinkhorn \(OT\) task—computing a minimum\-cost coupling between two discrete distributions via differentiable entropic regularization \(n=10n=10,K=10K=10inner Sinkhorn iterations, OU drift\)—with constant delaysd∈\{1,2,5,10,20,50\}d\\in\\\{1,2,5,10,20,50\\\},T=1,000T=1\{,\}000, 5 seeds, to test the transport error scaling claims of Theorems[2](https://arxiv.org/html/2605.12693#Thmtheorem2)\(c\) and[3](https://arxiv.org/html/2605.12693#Thmtheorem3)\. We compute two error surrogates:Rsq=∑t‖θt−θt−d‖2R\_\{\\mathrm\{sq\}\}=\\sum\_\{t\}\\\|\\theta\_\{t\}\-\\theta\_\{t\-d\}\\\|^\{2\}\(squared total drift, the quantity naïve methods pay\) andR3=∑t∑s∈Qt‖θs\+1−θs‖2R\_\{3\}=\\sum\_\{t\}\\sum\_\{s\\in Q\_\{t\}\}\\\|\\theta\_\{s\+1\}\-\\theta\_\{s\}\\\|^\{2\}\(sum of per\-step squared changes, the quantity IGT pays\)\.

Results\.The ratioRsq/R3R\_\{\\mathrm\{sq\}\}/R\_\{3\}tracksσmax\\sigma\_\{\\max\}with near\-perfect fidelity across all four algorithms: ratio≈9\.99\{\\approx\}9\.99atd=10d=10,≈19\.9\{\\approx\}19\.9atd=20d=20,≈49\.3\{\\approx\}49\.3atd=50d=50\(Table[8](https://arxiv.org/html/2605.12693#A6.T8)in Appendix[F\.3](https://arxiv.org/html/2605.12693#A6.SS3)\)\. Log\-log regression in Figure[1](https://arxiv.org/html/2605.12693#S5.F1)confirms the theoretical slopes:R3∝σ0\.99R\_\{3\}\\propto\\sigma^\{0\.99\}andRsq∝σ1\.99R\_\{\\mathrm\{sq\}\}\\propto\\sigma^\{1\.99\}\(bothR2\>0\.99R^\{2\}\>0\.99\)\. This consistent ratio is not an algorithmic artifact—it holds identically for[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\-[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd), D\-FTRL, Robust[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd), and Stale[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)—because theσmax\\sigma\_\{\\max\}\-factor improvement follows from a*geometric*identity \(Cauchy–Schwarz\[[30](https://arxiv.org/html/2605.12693#bib.bib38)\]\) of the parameter trajectory\.

![Refer to caption](https://arxiv.org/html/2605.12693v1/x1.png)

![Refer to caption](https://arxiv.org/html/2605.12693v1/x2.png)

Figure 1:Transport error scaling validates Theorem[2](https://arxiv.org/html/2605.12693#Thmtheorem2)\(c\)\.*Left:*Log\-log regression of cumulativeR3R\_\{3\}\(slope≈1\{\\approx\}1\) andRsqR\_\{\\mathrm\{sq\}\}\(slope≈2\{\\approx\}2\) againstσmax\\sigma\_\{\\max\}, confirming linear vs\. quadratic scaling\.*Right:*The ratioRsq/R3R\_\{\\mathrm\{sq\}\}/R\_\{3\}tracksσmax\\sigma\_\{\\max\}with near\-perfect agreement across all algorithms\.
### 5\.4Isolating Transport Contribution with Adam

Algorithm[1](https://arxiv.org/html/2605.12693#alg1)with SGD does not outperform Adam baselines \(i\.e\., Robust[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd), and Stale[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)\) on the Sinkhorn OT non\-convex task—the transport corrections are exact, but the base optimizer’s convergence properties also matter\. To disentangle these effects, we hold the Adam optimizer\[[17](https://arxiv.org/html/2605.12693#bib.bib28)\]fixed in this experiment\.

Setup\.To isolate[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)transport from optimizer choice, both treatment \([IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\-[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)\) and control \(Stale[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)\) use identical Adam \(η=10−3\\eta=10^\{\-3\},β1=0\.9\\beta\_\{1\}=0\.9,β2=0\.999\\beta\_\{2\}=0\.999\)\. We refer to the modification of[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\-[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)with Adam as Adam\+[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\. Our experiment uses the same Sinkhorn environment withd∈\{1,5,10,20,50\}d\\in\\\{1,5,10,20,50\\\},T=1,000T=1\{,\}000, and 5 seeds\.

Results\.Table[4](https://arxiv.org/html/2605.12693#S5.T4)shows a monotonically growing transport benefit:0\.0%0\.0\\%atd=1d=1\(by construction\),\+4\.6%\+4\.6\\%atd=5d=5,\+7\.9%\+7\.9\\%atd=20d=20\(p<0\.001p<0\.001\),\+9\.5%\+9\.5\\%atd=50d=50\(p<0\.0001p<0\.0001\)\. Thed=1d=1zero is a designed negative control: when\|Qt\|=1\|Q\_\{t\}\|=1, the sum\-of\-squares and squared\-total\-drift are identical, so[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)is vacuous\. The monotonic growth is the signature of Theorem[2](https://arxiv.org/html/2605.12693#Thmtheorem2)\(c\)\. Adam\+[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)’s regret remains statistically flat across delays, while Stale[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)’s regret monotonically increases\.

2×22\{\\times\}2factorial\.In a2×22\{\\times\}2factorial over optimizer and transport, we find cumulative regret grows modestly without transport \(\+7\.3%\+7\.3\\%for SGD and\+8\.4%\+8\.4\\%for Adam as delay increased fromd=1d=1tod=50d=50\)\. However, the interaction is non\-additive: IGT\+SGD*worsens*degradation to\+28\.8%\+28\.8\\%, while IGT\+Adam*eliminates*it \(−1\.8%\-1\.8\\%,p=0\.29p=0\.29\)\. Transport corrections and adaptive scaling are synergistic—neither alone achieves zero degradation\.

D\-FTRL\+IGT\.Repeating with D\-FTRL as the base optimizer yields a near\-identical improvement curve \(0\.0%0\.0\\%atd=1d=1to\+9\.8%\+9\.8\\%atd=50d=50; Table[11](https://arxiv.org/html/2605.12693#A7.T11)in Appendix[G\.2](https://arxiv.org/html/2605.12693#A7.SS2)\), confirming the optimizer\-agnostic transport benefit \(Remark[1](https://arxiv.org/html/2605.12693#Thmremark1)\)\.

Table 4:Sinkhorn OT: cumulative regret \(T=1,000T=1\{,\}000, 5 seeds\)\. Adam\+[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)vs\. Stale[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)isolates the transport contribution; improvement grows monotonically with delay \(p<0\.01p\{<\}0\.01ford≥10d\{\\geq\}10\)\.Summary\.A unified table mapping results to theorems is in Appendix[F](https://arxiv.org/html/2605.12693#A6); Appendix[G\.1](https://arxiv.org/html/2605.12693#A7.SS1)reports a non\-convex HalfCheetah MPC stress test reserved for the non\-convex extension\. Across assumption\-aligned benchmarks, the results confirm each theoretical claim: constant stability across delays \(Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1)\), staleness amplification in bilevel\-unaware methods \(Theorem[2](https://arxiv.org/html/2605.12693#Thmtheorem2)\(a\)\), theσmax\\sigma\_\{\\max\}\-factor improvement as a geometric identity \(Theorem[2](https://arxiv.org/html/2605.12693#Thmtheorem2)\(c\)\), and additive decoupling of inner\-solver error from delay \(Theorem[3](https://arxiv.org/html/2605.12693#Thmtheorem3)\)\.

## 6Conclusion

We introduced[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)\-[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd), the first delayed optimizer for bilevel predict\-then\-optimize pipelines\. Our analysis identifies staleness amplification—the coupling between outer\-parameter staleness and inner\-solver sensitivity—as a failure mode unique to bilevel delay \(Theorem[2](https://arxiv.org/html/2605.12693#Thmtheorem2)\), establishes a matching lower bound, and shows IGT reduces transport error by a factorσmax\\sigma\_\{\\max\}\. Adam\+[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)achieves9\.5%9\.5\\%lower regret than Stale[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)atd=50d\{=\}50\(p<0\.001p\{<\}0\.001\), with the improvement growing monotonically with delay, as predicted, and vanishing atd=1d\{=\}1as a designed negative control\. A2×22\{\\times\}2factorial confirms transport and adaptive scaling are synergistic—only their combination eliminates delay degradation\. Our broader impacts are summarized in Appendix[H](https://arxiv.org/html/2605.12693#A8)\.

Limitations and Future Work\.Strong convexity ofℒmodel\\mathcal\{L\}\_\{\\text\{model\}\}\(Assumption[1](https://arxiv.org/html/2605.12693#Thmassumption1)\) may not hold for neural predictors; a formal non\-convex analysis remains open\. Our2×22\{\\times\}2factorial analysis shows that transport corrections require trajectory stability \(Adam\) and can*amplify*degradation without it \(SGD\)—an interaction not captured by the current theory\. Per\-round costO​\(K​p​q\+σmax​p​q\+q2​κw\)O\(K\\,p\\,q\+\\sigma\_\{\\max\}\\,p\\,q\+q^\{2\}\\kappa\_\{w\}\)may be prohibitive for largeqq; low\-rank Hessian approximations could help\. Natural extensions include non\-convex inner objectives via fixed\-point implicit differentiation applications, applications in reinforcement learning with delayed rewards, and federated settings with communication delays\.

## References

- \[1\]\(2019\)Reducing the variance in online optimization by transporting past gradients\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[§B\.1](https://arxiv.org/html/2605.12693#A2.SS1.p1.4),[§B\.1](https://arxiv.org/html/2605.12693#A2.SS1.p1.6),[§E\.1](https://arxiv.org/html/2605.12693#A5.SS1.SSS0.Px2.2.p2.8),[§1\.1](https://arxiv.org/html/2605.12693#S1.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.12693#S1.T1.4.10.10.2),[§1](https://arxiv.org/html/2605.12693#S1.p4.7),[§3\.1](https://arxiv.org/html/2605.12693#S3.SS1.p1.10)\.
- \[2\]A\. Beck and M\. Teboulle\(2003\)Mirror descent and nonlinear projected subgradient methods for convex optimization\.Operations Research Letters31\(3\),pp\. 167–175\.External Links:[Document](https://dx.doi.org/10.1016/s0167-6377%2802%2900231-6)Cited by:[§E\.1](https://arxiv.org/html/2605.12693#A5.SS1.SSS0.Px1.1.p1.1)\.
- \[3\]D\. P\. Bertsekas\(2016\)Nonlinear programming\.3rd edition,Athena Scientific,Belmont, MA\.External Links:ISBN 978\-1886529052Cited by:[Appendix C](https://arxiv.org/html/2605.12693#A3.p2.8)\.
- \[4\]L\. M\. Bregman\(1967\)The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming\.USSR Computational Mathematics and Mathematical Physics7\(3\),pp\. 200–217\.Cited by:[§3\.1](https://arxiv.org/html/2605.12693#S3.SS1.p2.2)\.
- \[5\]G\. Brockman, V\. Cheung, L\. Pettersson, J\. Schneider, J\. Schulman, J\. Tang, and W\. Zaremba\(2016\)OpenAI Gym\.arXiv preprint arXiv:1606\.01540\.External Links:[Link](https://arxiv.org/abs/1606.01540),[Document](https://dx.doi.org/10.48550/arXiv.1606.01540)Cited by:[§G\.1](https://arxiv.org/html/2605.12693#A7.SS1.p2.11)\.
- \[6\]A\. Capitaine, M\. Haddouche, E\. Moulines, M\. I\. Jordan, E\. Boursier, and A\. Durmus\(2025\)Online decision\-focused learning\.arXiv preprint arXiv:2505\.13564\.External Links:2505\.13564,[Link](https://arxiv.org/)Cited by:[§1\.1](https://arxiv.org/html/2605.12693#S1.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.12693#S1.T1.4.3.3.1)\.
- \[7\]A\. L\. Dontchev and R\. T\. Rockafellar\(2014\)Implicit functions and solution mappings: a view from variational analysis\.2nd edition,Springer Monographs in Mathematics,Springer New York\.External Links:[Document](https://dx.doi.org/10.1007/978-1-4939-1037-3),ISBN 978\-1\-4939\-1036\-6Cited by:[§2\.2](https://arxiv.org/html/2605.12693#S2.SS2.p1.4)\.
- \[8\]A\. N\. Elmachtoub and P\. Grigas\(2022\)Smart “Predict, then Optimize”\.Management Science68\(1\),pp\. 9–26\.External Links:[Document](https://dx.doi.org/10.1287/mnsc.2020.3922)Cited by:[§1\.1](https://arxiv.org/html/2605.12693#S1.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.12693#S1.T1.4.2.2.2),[§1](https://arxiv.org/html/2605.12693#S1.p1.2),[§5](https://arxiv.org/html/2605.12693#S5.p1.4)\.
- \[9\]G\. Flaspohler, F\. Orabona, J\. Cohen, S\. Mouatadid, M\. Oprescu, P\. Orenstein, and L\. Mackey\(2021\)Online learning with optimism and delay\.InProceedings of the 38th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.139,pp\. 3363–3373\.External Links:[Link](https://mlr.press/)Cited by:[§1\.1](https://arxiv.org/html/2605.12693#S1.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.12693#S1.T1.4.7.7.1)\.
- \[10\]S\. Ghadimi and M\. Wang\(2018\)Approximation methods for bilevel programming\.arXiv preprint arXiv:1802\.02246\.External Links:[Link](https://arxiv.org/abs/1802.02246),[Document](https://dx.doi.org/10.48550/arXiv.1802.02246)Cited by:[item 2](https://arxiv.org/html/2605.12693#A5.I1.i2.p1.6),[§1\.1](https://arxiv.org/html/2605.12693#S1.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.12693#S1.T1.4.9.9.1),[§4\.1](https://arxiv.org/html/2605.12693#S4.SS1.p1.3)\.
- \[11\]M\. R\. Hestenes and E\. Stiefel\(1952\)Methods of conjugate gradients for solving linear systems\.Journal of Research of the National Bureau of Standards49\(6\),pp\. 409–436\.External Links:[Document](https://dx.doi.org/10.6028/jres.049.044)Cited by:[§2\.2](https://arxiv.org/html/2605.12693#S2.SS2.p1.4)\.
- \[12\]J\. Huang, Y\. Dai, and L\. Huang\(2023\)Banker online mirror descent: a universal approach for delayed online bandit learning\.InProceedings of the 40th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.202,pp\. 13818–13847\.External Links:[Link](https://proceedings.mlr.press/v202/huang23e/huang23e.pdf)Cited by:[§1\.1](https://arxiv.org/html/2605.12693#S1.SS1.p1.1)\.
- \[13\]A\. Iserles\(2009\)A first course in the numerical analysis of differential equations\.2nd edition,Cambridge Texts in Applied Mathematics,Cambridge University Press,Cambridge, UK\.External Links:[Document](https://dx.doi.org/10.1017/CBO9780511995569),ISBN 978\-0521734905Cited by:[§E\.5](https://arxiv.org/html/2605.12693#A5.SS5.SSS0.Px1.2.p1.16)\.
- \[14\]K\. Ji, J\. Yang, and Y\. Liang\(2021\)Bilevel optimization: convergence analysis and enhanced design\.InInternational Conference on Machine Learning,pp\. 4882–4892\.External Links:[Document](https://dx.doi.org/10.48550/arxiv.2010.07962)Cited by:[§1\.1](https://arxiv.org/html/2605.12693#S1.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.12693#S1.T1.4.8.8.2),[§4\.1](https://arxiv.org/html/2605.12693#S4.SS1.p1.3)\.
- \[15\]P\. Joulani, A\. György, and C\. Szepesvári\(2013\)Online learning under delayed feedback\.InInternational Conference on Machine Learning,pp\. 1453–1461\.Cited by:[§E\.1](https://arxiv.org/html/2605.12693#A5.SS1.SSS0.Px1.1.p1.2),[§1\.1](https://arxiv.org/html/2605.12693#S1.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.12693#S1.T1.4.5.5.2),[§1](https://arxiv.org/html/2605.12693#S1.p2.6),[§3\.1](https://arxiv.org/html/2605.12693#S3.SS1.p2.5),[§4\.1](https://arxiv.org/html/2605.12693#S4.SS1.p1.3),[§4\.2](https://arxiv.org/html/2605.12693#S4.SS2.p2.7),[§5](https://arxiv.org/html/2605.12693#S5.p1.4),[Corollary 1](https://arxiv.org/html/2605.12693#Thmcorollary1.p1.13.13),[Remark 7](https://arxiv.org/html/2605.12693#Thmremark7.p1.10.10)\.
- \[16\]E\. I\. Jury\(1964\)Theory and application of the z\-transform method\.John Wiley & Sons,New York, NY, USA\.External Links:ISBN 978\-0471452850Cited by:[§E\.5](https://arxiv.org/html/2605.12693#A5.SS5.SSS0.Px1.6.p4.17)\.
- \[17\]D\. P\. Kingma and J\. Ba\(2015\)Adam: a method for stochastic optimization\.InProceedings of the 3rd International Conference on Learning Representations,San Diego, CA, USA\.Note:ICLR 2015External Links:[Link](https://arxiv.org/)Cited by:[§5\.4](https://arxiv.org/html/2605.12693#S5.SS4.p1.1)\.
- \[18\]V\. Kolmanovskii and A\. Myshkis\(1999\)Stability of stochastic functional differential equations\.InIntroduction to the Theory and Applications of Functional Differential Equations,Mathematics and Its Applications, Vol\.463,pp\. 387–437\.External Links:[Document](https://dx.doi.org/10.1007/978-94-017-1965-0%5F10)Cited by:[§E\.4](https://arxiv.org/html/2605.12693#A5.SS4.SSS0.Px2.1.p1.14),[§E\.4](https://arxiv.org/html/2605.12693#A5.SS4.SSS0.Px3.5.p5.3),[§E\.4](https://arxiv.org/html/2605.12693#A5.SS4.SSS0.Px3.6.p6.15),[Remark 15](https://arxiv.org/html/2605.12693#Thmremark15.p1.4.4)\.
- \[19\]S\. Lin, D\. Sow, K\. Ji, Y\. Liang, and N\. Shroff\(2023\)Non\-convex bilevel optimization with time\-varying objective functions\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 12658–12686\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/hash/5ee60ca5686bbcf756e56a6c75e66f32-Abstract-Conference.html)Cited by:[§1\.1](https://arxiv.org/html/2605.12693#S1.SS1.p1.1)\.
- \[20\]J\. Mandi, J\. Kotary, S\. Berden, M\. Mulamba, V\. Bucarey, T\. Guns, and F\. Fioretto\(2024\)Decision\-focused learning: foundations, state of the art, benchmark and future opportunities\.Journal of Artificial Intelligence Research81,pp\. 1623–1701\.Cited by:[§1](https://arxiv.org/html/2605.12693#S1.p1.2)\.
- \[21\]P\. Nazari, B\. Hou, D\. A\. Tarzanagh, L\. Shen, and G\. Michailidis\(2025\)Stochastic regret guarantees for online zeroth\- and first\-order bilevel optimization\.InAdvances in Neural Information Processing Systems,Vol\.38\.External Links:[Link](https://arxiv.org/abs/2511.01126)Cited by:[§1\.1](https://arxiv.org/html/2605.12693#S1.SS1.p1.1)\.
- \[22\]A\. Nemirovski, A\. Juditsky, G\. Lan, and A\. Shapiro\(2009\)Robust stochastic approximation approach to stochastic programming\.SIAM Journal on Optimization19\(4\),pp\. 1574–1609\.External Links:[Document](https://dx.doi.org/10.1137/070704277)Cited by:[§B\.2](https://arxiv.org/html/2605.12693#A2.SS2.p1.3),[§B\.2](https://arxiv.org/html/2605.12693#A2.SS2.p1.7),[§3\.1](https://arxiv.org/html/2605.12693#S3.SS1.p2.2),[§5](https://arxiv.org/html/2605.12693#S5.p1.4)\.
- \[23\]E\. Nikishin, R\. Abachi, R\. Agarwal, and P\. Bacon\(2022\)Control\-oriented model\-based reinforcement learning with implicit differentiation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.36\(7\),pp\. 7886–7894\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v36i7.20758)Cited by:[§1\.1](https://arxiv.org/html/2605.12693#S1.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.12693#S1.T1.4.4.4.1),[§2\.2](https://arxiv.org/html/2605.12693#S2.SS2.p1.4),[Corollary 1](https://arxiv.org/html/2605.12693#Thmcorollary1.p1.13.13)\.
- \[24\]B\. G\. Pachpatte\(1998\)Inequalities for differential and integral equations\.Mathematics in Science and Engineering, Vol\.197,Academic Press,San Diego, CA\.External Links:ISBN 978\-0125434300Cited by:[Lemma 12](https://arxiv.org/html/2605.12693#Thmlemma12.p1.12.10.10)\.
- \[25\]A\. Paszke, S\. Gross, F\. Massa, A\. Lerer, J\. Bradbury, G\. Chanan, T\. Killeen, Z\. Lin, N\. Gimelshein, L\. Antiga,et al\.\(2019\)PyTorch: an imperative style, high\-performance deep learning library\.InAdvances in Neural Information Processing Systems,Vol\.32,pp\. 8024–8035\.External Links:[Link](https://neurips.cc/)Cited by:[§F\.1](https://arxiv.org/html/2605.12693#A6.SS1.p1.1),[§F\.5](https://arxiv.org/html/2605.12693#A6.SS5.p3.1),[NeurIPS Paper Checklist](https://arxiv.org/html/2605.12693#Ax1.I1.ix47.p1.1)\.
- \[26\]H\. Qiu, M\. Zhang, and J\. Achddou\(2026\)Decentralized online convex optimization with unknown feedback delays\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40 \(01\),pp\. 1–9\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v40i30.39688),[Link](https://ojs.aaai.org/index.php/AAAI/article/view/39688)Cited by:[§1\.1](https://arxiv.org/html/2605.12693#S1.SS1.p1.1)\.
- \[27\]K\. Quanrud and D\. Khashabi\(2015\)Online learning with adversarial delays\.InAdvances in Neural Information Processing Systems,Vol\.28,pp\. 1270–1278\.External Links:[Link](https://neurips.cc/)Cited by:[§1\.1](https://arxiv.org/html/2605.12693#S1.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.12693#S1.T1.4.6.6.1),[§1](https://arxiv.org/html/2605.12693#S1.p2.6),[§3\.1](https://arxiv.org/html/2605.12693#S3.SS1.p2.5),[§4\.1](https://arxiv.org/html/2605.12693#S4.SS1.p1.3),[§4\.2](https://arxiv.org/html/2605.12693#S4.SS2.p2.7),[§5](https://arxiv.org/html/2605.12693#S5.p1.4)\.
- \[28\]A\. Ryabchenko, I\. Attias, and D\. M\. Roy\(2026\)A reduction from delayed to immediate feedback for online convex optimization with improved guarantees\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40 \(01\),pp\. 1–9\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2602.02634),[Link](https://arxiv.org/abs/2602.02634)Cited by:[§E\.4](https://arxiv.org/html/2605.12693#A5.SS4.SSS0.Px2.6.p2.6),[§1\.1](https://arxiv.org/html/2605.12693#S1.SS1.p1.1),[§4\.2](https://arxiv.org/html/2605.12693#S4.SS2.p2.7),[Remark 16](https://arxiv.org/html/2605.12693#Thmremark16.p1.5.5)\.
- \[29\]S\. Shalev\-Shwartz\(2012\)Online learning and online convex optimization\.Foundations and Trends in Machine Learning4\(2\),pp\. 107–194\.Cited by:[§1](https://arxiv.org/html/2605.12693#S1.p2.6)\.
- \[30\]J\. M\. Steele\(2004\)The Cauchy\-Schwarz master class: an introduction to the art of mathematical inequalities\.Cambridge University Press,Cambridge, UK\.External Links:ISBN 978\-0521546775,[Document](https://dx.doi.org/10.1017/cbo9780511817106)Cited by:[§E\.1](https://arxiv.org/html/2605.12693#A5.SS1.SSS0.Px3.1.p1.10),[§E\.3](https://arxiv.org/html/2605.12693#A5.SS3.SSS0.Px1.3.p3.9),[§E\.3](https://arxiv.org/html/2605.12693#A5.SS3.SSS0.Px1.4.p4.8),[§E\.3](https://arxiv.org/html/2605.12693#A5.SS3.SSS0.Px1.5.p5.1.1),[§E\.3](https://arxiv.org/html/2605.12693#A5.SS3.SSS0.Px1.5.p5.14),[§E\.3](https://arxiv.org/html/2605.12693#A5.SS3.SSS0.Px1.5.p5.3),[§E\.3](https://arxiv.org/html/2605.12693#A5.SS3.SSS0.Px1.5.p5.7),[§E\.3](https://arxiv.org/html/2605.12693#A5.SS3.SSS0.Px1.7.p7.6),[§3\.1](https://arxiv.org/html/2605.12693#S3.SS1.p1.10),[§4\.3](https://arxiv.org/html/2605.12693#S4.SS3.p2.4),[§5\.3](https://arxiv.org/html/2605.12693#S5.SS3.p3.12)\.
- \[31\]B\. Tang and E\. B\. Khalil\(2022\)PyEPO: a PyTorch\-based end\-to\-end predict\-then\-optimize library for linear and integer programming\.arXiv preprint arXiv:2206\.14234\.External Links:[Link](https://arxiv.org/),[Document](https://dx.doi.org/10.48550/arXiv.2206.14234)Cited by:[§1\.1](https://arxiv.org/html/2605.12693#S1.SS1.p1.1)\.
- \[32\]D\. A\. Tarzanagh, P\. Nazari, B\. Hou, L\. Shen, and L\. Balzano\(2024\)Online bilevel optimization: regret analysis of online alternating gradient methods\.InProceedings of the 27th International Conference on Artificial Intelligence and Statistics,Proceedings of Machine Learning Research, Vol\.238,pp\. 2854–2862\.External Links:[Link](https://proceedings.mlr.press/v238/ataee-tarzanagh24a.html)Cited by:[§1\.1](https://arxiv.org/html/2605.12693#S1.SS1.p1.1)\.
- \[33\]M\. Vlastelica, A\. Paulus, V\. Musil, G\. Martius, and M\. Rolínek\(2020\)Differentiation of blackbox combinatorial solvers\.InInternational Conference on Learning Representations,Cited by:[§D\.1](https://arxiv.org/html/2605.12693#A4.SS1.p4.2),[§F\.5](https://arxiv.org/html/2605.12693#A6.SS5.p3.1),[NeurIPS Paper Checklist](https://arxiv.org/html/2605.12693#Ax1.I1.ix19.p1.1),[NeurIPS Paper Checklist](https://arxiv.org/html/2605.12693#Ax1.I1.ix47.p1.1),[§1\.1](https://arxiv.org/html/2605.12693#S1.SS1.p1.1),[§5\.2](https://arxiv.org/html/2605.12693#S5.SS2.p2.7)\.
- \[34\]B\. Wilder, B\. Dilkina, and M\. Tambe\(2019\)Melding the data\-decisions pipeline: decision\-focused learning for combinatorial optimization\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.33 \(01\),pp\. 1658–1665\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v33i01.33011658)Cited by:[§1\.1](https://arxiv.org/html/2605.12693#S1.SS1.p1.1),[§1](https://arxiv.org/html/2605.12693#S1.p1.2)\.
- \[35\]S\. Yu, W\. Chen, and H\. V\. Poor\(2025\)Distributed stochastic gradient descent with staleness: a stochastic delay differential equation based framework\.IEEE Transactions on Signal Processing\.Note:Accepted\. arXiv:2406\.11159External Links:[Document](https://dx.doi.org/10.48550/arxiv.2406.11159)Cited by:[§B\.3](https://arxiv.org/html/2605.12693#A2.SS3.p1.3),[§E\.4](https://arxiv.org/html/2605.12693#A5.SS4.SSS0.Px2.2.p1.1),[§E\.5](https://arxiv.org/html/2605.12693#A5.SS5.SSS0.Px2.1.p1.7),[§E\.5](https://arxiv.org/html/2605.12693#A5.SS5.p1.1),[§1\.1](https://arxiv.org/html/2605.12693#S1.SS1.p1.1),[Table 1](https://arxiv.org/html/2605.12693#S1.T1.4.11.11.2),[§1](https://arxiv.org/html/2605.12693#S1.p4.7),[Remark 16](https://arxiv.org/html/2605.12693#Thmremark16.p1.5)\.

## Appendix ANotation Table

Table 5:Principal notation\.SymbolDefinitionCore variables and delayθt∈Θ⊆ℝp\\theta\_\{t\}\\in\\Theta\\subseteq\\mathbb\{R\}^\{p\}Outer\-level predictor parameters at roundttwt∈𝒲⊆ℝqw\_\{t\}\\in\\mathcal\{W\}\\subseteq\\mathbb\{R\}^\{q\}Approximate inner\-level decision at roundttw∗:Θ→𝒲w^\{\*\}\\\!:\\Theta\\to\\mathcal\{W\}Inner minimizer:w∗​\(θ\)=argminw∈𝒲⁡ℒmodel​\(w;θ\)w^\{\*\}\(\\theta\)=\\operatorname\{argmin\}\_\{w\\in\\mathcal\{W\}\}\\mathcal\{L\}\_\{\\text\{model\}\}\(w;\\theta\)ℒmodel:𝒲×Θ→ℝ\\mathcal\{L\}\_\{\\text\{model\}\}\\\!:\\mathcal\{W\}\\\!\\times\\\!\\Theta\\to\\mathbb\{R\}Model\-based decision objective \(inner loss\)ℒtrue:𝒲×Θ→ℝ\\mathcal\{L\}\_\{\\text\{true\}\}\\\!:\\mathcal\{W\}\\\!\\times\\\!\\Theta\\to\\mathbb\{R\}Environment\-evaluated decision loss \(outer loss\)dt∈ℕ0d\_\{t\}\\in\\mathbb\{N\}\_\{0\}Feedback delay for roundttQt⊆\[t\]Q\_\{t\}\\subseteq\[t\]Queue of outstanding rounds:Qt=\{s≤t:s\+ds\>t\}Q\_\{t\}=\\\{s\\leq t:s\+d\_\{s\}\>t\\\}σt=\|Qt\|\\sigma\_\{t\}=\|Q\_\{t\}\|Queue length at timettσmax=maxt⁡σt\\sigma\_\{\\max\}=\\max\_\{t\}\\,\\sigma\_\{t\}Maximum queue length over all roundsϵinner≥0\\epsilon\_\{\\mathrm\{inner\}\}\\geq 0Inner\-solver error:‖wt−w∗​\(θt\)‖≤ϵinner\\\|w\_\{t\}\-w^\{\*\}\(\\theta\_\{t\}\)\\\|\\leq\\epsilon\_\{\\mathrm\{inner\}\}gt:Θ→ℝpg\_\{t\}\\\!:\\Theta\\to\\mathbb\{R\}^\{p\}Hypergradient:gt​\(θ\)=dd​θ​ℒtrue​\(w∗​\(θ\);θ\)g\_\{t\}\(\\theta\)=\\tfrac\{d\}\{d\\theta\}\\mathcal\{L\}\_\{\\text\{true\}\}\(w^\{\*\}\(\\theta\);\\theta\)Dψ:Θ×Θ→ℝ≥0D\_\{\\psi\}\\\!:\\Theta\\\!\\times\\\!\\Theta\\to\\mathbb\{R\}\_\{\\geq 0\}Dψ​\(θ,θ′\)=ψ​\(θ\)−ψ​\(θ′\)−⟨∇ψ​\(θ′\),θ−θ′⟩D\_\{\\psi\}\(\\theta,\\theta^\{\\prime\}\)=\\psi\(\\theta\)\-\\psi\(\\theta^\{\\prime\}\)\-\\langle\\nabla\\psi\(\\theta^\{\\prime\}\),\\,\\theta\\\!\-\\\!\\theta^\{\\prime\}\\rangleψ:Θ→ℝ\\psi:\\Theta\\to\\mathbb\{R\}Strongly convex mirror map generatingDψD\_\{\\psi\}F:Θ→ℝF\\\!:\\Theta\\to\\mathbb\{R\}Bilevel objective:F​\(θ\)=ℒtrue​\(w∗​\(θ\);θ\)F\(\\theta\)=\\mathcal\{L\}\_\{\\text\{true\}\}\(w^\{\*\}\(\\theta\);\\theta\)Dimensions and horizonTTTotal number of rounds \(horizon\)t∈\{1,…,T\}t\\in\\\{1,\\ldots,T\\\}Round indexppDimension of outer parametersθt\\theta\_\{t\}qqDimension of inner decisionswtw\_\{t\}KKNumber of inner gradient\-descent steps per roundρ\\rhoInner\-solver contraction factor, e\.g\.ϵinner=O​\(ρK\)\\epsilon\_\{\\mathrm\{inner\}\}=O\(\\rho^\{K\}\)withρ<1\\rho<1Smoothness and convexity constantsLwL\_\{w\}Smoothness constant ofℒmodel\\mathcal\{L\}\_\{\\text\{model\}\}w\.r\.t\.ww\(Assumption A1\)μw\\mu\_\{w\}Strong\-convexity constant ofℒmodel\\mathcal\{L\}\_\{\\text\{model\}\}w\.r\.t\.ww\(Assumption A1\)LθL\_\{\\theta\}Lipschitz constant of∇θℒtrue\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\text\{true\}\}w\.r\.t\.θ\\theta\(Assumption A2\)Lw​θL\_\{w\\theta\}Bound on cross\-partial‖∇w​θ2ℒmodel‖\\\|\\nabla^\{2\}\_\{w\\theta\}\\mathcal\{L\}\_\{\\text\{model\}\}\\\|\(Assumptions A2, A5\)μF\\mu\_\{F\}Strong\-convexity constant of bilevel objectiveFF\(Assumption A7\)LFL\_\{F\}Bilevel Lipschitz constant:LF=Lθ\+Lw​θ2/μwL\_\{F\}=L\_\{\\theta\}\+L\_\{w\\theta\}^\{2\}/\\mu\_\{w\}GGUniform bound on corrected hypergradients:‖gtIGT‖≤G\\\|g\_\{t\}^\{\\mathrm\{IGT\}\}\\\|\\leq G\(Assumption A6\)GtrueG\_\{\\mathrm\{true\}\}Exact hypergradient bound:Gtrue=supθ‖∇F​\(θ\)‖G\_\{\\mathrm\{true\}\}=\\sup\_\{\\theta\}\\\|\\nabla F\(\\theta\)\\\|\(Assumption A6\)C0C\_\{0\}Cross\-objective sensitivity:C0=Lw​θ/μwC\_\{0\}=L\_\{w\\theta\}/\\mu\_\{w\}\(Theorem[2](https://arxiv.org/html/2605.12693#Thmtheorem2)\(a\)\)ρcpl,Cρ\\rho\_\{\\mathrm\{cpl\}\},C\_\{\\rho\}Coupling ratioρcpl=Lw​θ2/\(μw2​μF\)<1\\rho\_\{\\mathrm\{cpl\}\}=L\_\{w\\theta\}^\{2\}/\(\\mu\_\{w\}^\{2\}\\mu\_\{F\}\)<1and multiplierCρ=\(1−ρcpl\)−1C\_\{\\rho\}=\(1\-\\rho\_\{\\mathrm\{cpl\}\}\)^\{\-1\}C1,κC\_\{1\},\\kappaProof\-local sensitivity constants in Theorem[3](https://arxiv.org/html/2605.12693#Thmtheorem3):C1=Lθ​w/μwC\_\{1\}=L\_\{\\theta w\}/\\mu\_\{w\},κ=Lg​C1\\kappa=L\_\{g\}C\_\{1\}κw\\kappa\_\{w\}Inner\-problem condition number:κw=Lw/μw\\kappa\_\{w\}=L\_\{w\}/\\mu\_\{w\}δ\\deltaCG solver tolerance \(adjoint solve costO​\(q2​κw​ln⁡\(1/δ\)\)O\(q^\{2\}\\kappa\_\{w\}\\ln\(1/\\delta\)\)\)Algorithm\-specific quantitiesη0\\eta\_\{0\}Base \(initial\) step sizeηt\\eta\_\{t\}Adaptive step size at roundtt:ηt=η0/1\+β​σ¯t\\eta\_\{t\}=\\eta\_\{0\}/\\sqrt\{1\+\\beta\\bar\{\\sigma\}\_\{t\}\}ηw\\eta\_\{w\}Inner step size \(Algorithm[1](https://arxiv.org/html/2605.12693#alg1)line 6\)β\\betaDelay sensitivity ratio:β=‖LIGT‖/λmin​\(HF\)\\beta=\\\|L\_\{\\mathrm\{IGT\}\}\\\|/\\lambda\_\{\\min\}\(H\_\{F\}\)HwH\_\{w\}Inner Hessian:Hw=∇w​w2ℒmodel​\(w∗;θ\)H\_\{w\}=\\nabla^\{2\}\_\{ww\}\\mathcal\{L\}\_\{\\text\{model\}\}\(w^\{\*\};\\theta\)vt∗∈ℝqv\_\{t\}^\{\*\}\\in\\mathbb\{R\}^\{q\}Adjoint vector:Hw​vt∗=∇wℒtrue​\(wt;θt\)H\_\{w\}v\_\{t\}^\{\*\}=\\nabla\_\{w\}\\mathcal\{L\}\_\{\\text\{true\}\}\(w\_\{t\};\\theta\_\{t\}\)gtIGT∈ℝpg\_\{t\}^\{\\mathrm\{IGT\}\}\\in\\mathbb\{R\}^\{p\}IGT\-corrected hypergradient \(Eq\. \([4](https://arxiv.org/html/2605.12693#S3.E4)\)\)ℬ\\mathcal\{B\}Replay buffer storing\(ws,vs∗,gs​\(θt−1\)\)\(w\_\{s\},v\_\{s\}^\{\*\},g\_\{s\}\(\\theta\_\{t\-1\}\)\)fors∈Qts\\in Q\_\{t\}HFH\_\{F\}Hessian of bilevel objectiveFFat equilibriumLIGTL\_\{\\mathrm\{IGT\}\}IGT delay coupling matrix \(linearized bilevel dynamics\)Regret decomposition \(Theorem 3\)R1R\_\{1\}Delay error:O​\(η02​G2​T​σmax\)O\(\\eta\_\{0\}^\{2\}G^\{2\}T\\sigma\_\{\\max\}\)R2R\_\{2\}Inner\-solver bias:O​\(T​ϵinner2\)O\(T\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\)R3R\_\{3\}Interaction \(transport residual\):O​\(∑t∑s∈Qt‖θs\+1−θs‖2\)O\\\!\\left\(\\sum\_\{t\}\\sum\_\{s\\in Q\_\{t\}\}\\\|\\theta\_\{s\+1\}\-\\theta\_\{s\}\\\|^\{2\}\\right\)CapC\_\{\\mathrm\{ap\}\}Interaction prefactor in Theorem[3](https://arxiv.org/html/2605.12693#Thmtheorem3); see Remark[9](https://arxiv.org/html/2605.12693#Thmremark9)RsqR\_\{\\mathrm\{sq\}\}Squared total drift \(experimental metric\):Rsq=∑t‖θt−θt−d‖2R\_\{\\mathrm\{sq\}\}=\\sum\_\{t\}\\\|\\theta\_\{t\}\-\\theta\_\{t\-d\}\\\|^\{2\}[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)/ stability analysis \(Propositions[1](https://arxiv.org/html/2605.12693#Thmproposition1)–[2](https://arxiv.org/html/2605.12693#Thmproposition2)\)F​\(θ\)F\(\\theta\)Bilevel objective:F​\(θ\)=ℒtrue​\(w∗​\(θ\);θ\)F\(\\theta\)=\\mathcal\{L\}\_\{\\text\{true\}\}\(w^\{\*\}\(\\theta\);\\theta\)γ\\gammaContinuous\-time convergence rate:γ=c​η0/1\+β​σmax\\gamma=c\\,\\eta\_\{0\}/\\sqrt\{1\+\\beta\\sigma\_\{\\max\}\}τ¯\\bar\{\\tau\}Representative delay time in the continuous[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)\(τ¯≤σ​η\\bar\{\\tau\}\\leq\\sigma\\eta\)ξ​\(t\)\\xi\(t\)Deviation from equilibrium:ξ​\(t\)=θ​\(t\)−θ∗\\xi\(t\)=\\theta\(t\)\-\\theta^\{\*\}κF\\kappa\_\{F\}Condition number ofHFH\_\{F\}:κF=‖HF‖/μmin\\kappa\_\{F\}=\\\|H\_\{F\}\\\|/\\mu\_\{\\min\}AbbreviationsDFL \(decision\-focused learning\), OCO \(online convex optimization\), IGT \(Implicit Gradient Transport\),OMD \(Online Mirror Descent\), DDE \(Delay Differential Equation\)
## Appendix BBackground

The following subsections collect established results from the literature that underpin our theoretical development\. All material presented here is drawn directly from the cited works; we include it to make the paper self\-contained for readers who may be unfamiliar with the relevant background, and make no claim of originality for any result in this background section\.

### B\.1Implicit Gradient Transport \(IGT\)

IGT was introduced by\[[1](https://arxiv.org/html/2605.12693#bib.bib15)\]to correct for gradient staleness in online learning\. Given a gradientgs=∇fs​\(θs\)g\_\{s\}=\\nabla f\_\{s\}\(\\theta\_\{s\}\)computed at a past iterateθs\\theta\_\{s\}, IGT constructs a corrected estimate at the current iterateθt\\theta\_\{t\}via a telescoping sequence ofO​\(p\)O\(p\)re\-evaluations:

gsIGT​\(θt\)=gs​\(θs\)\+∑k=st−1\[gs​\(θk\+1\)−gs​\(θk\)\]\.g\_\{s\}^\{\\mathrm\{IGT\}\}\(\\theta\_\{t\}\)\\;=\\;g\_\{s\}\(\\theta\_\{s\}\)\+\\sum\_\{k=s\}^\{t\-1\}\\bigl\[g\_\{s\}\(\\theta\_\{k\+1\}\)\-g\_\{s\}\(\\theta\_\{k\}\)\\bigr\]\.\(7\)The correction exploits stored information\(ws,vs∗\)\(w\_\{s\},v\_\{s\}^\{\*\}\)to evaluategs​\(⋅\)g\_\{s\}\(\\cdot\)at any query point\. The key property of IGT is the following transport error bound\[[1](https://arxiv.org/html/2605.12693#bib.bib15)\]:

‖gsIGT​\(θt\)−∇fs​\(θt\)‖≤L​∑k=st−1‖θk\+1−θk‖2,\\bigl\\\|g\_\{s\}^\{\\mathrm\{IGT\}\}\(\\theta\_\{t\}\)\-\\nabla f\_\{s\}\(\\theta\_\{t\}\)\\bigr\\\|\\;\\leq\\;L\\sum\_\{k=s\}^\{t\-1\}\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\\|^\{2\},\(8\)which depends on the*sum of squared per\-step displacements*rather than the squared total displacement‖θt−θs‖2\\\|\\theta\_\{t\}\-\\theta\_\{s\}\\\|^\{2\}\. This distinction is the source of the factor\-σmax\\sigma\_\{\\max\}improvement established in Theorem[2](https://arxiv.org/html/2605.12693#Thmtheorem2)\(c\)\.

### B\.2Online Mirror Descent \(OMD\)

Online Mirror Descent\[[22](https://arxiv.org/html/2605.12693#bib.bib22)\]generalizes projected gradient descent by replacing the Euclidean geometry with a Bregman divergenceDψD\_\{\\psi\}induced by a mirror mapψ\\psi\. At each roundtt, the update takes the form:

θt\+1=argminθ∈Θ⁡⟨gt,θ⟩\+1ηt​Dψ​\(θ,θt\)\.\\theta\_\{t\+1\}=\\operatorname\{argmin\}\_\{\\theta\\in\\Theta\}\\bigl\\langle g\_\{t\},\\,\\theta\\bigr\\rangle\+\\frac\{1\}\{\\eta\_\{t\}\}\\,D\_\{\\psi\}\(\\theta,\\,\\theta\_\{t\}\)\.\(9\)Whenψ=12∥⋅∥2\\psi=\\tfrac\{1\}\{2\}\\\|\\cdot\\\|^\{2\}, the update reduces to projected gradient descent\. Under standard assumptions, OMD achieves regretO​\(Dψ/η\+G2​∑tηt\)O\\\!\\left\(D\_\{\\psi\}/\\eta\+G^\{2\}\\sum\_\{t\}\\eta\_\{t\}\\right\); the choiceηt=c/t\\eta\_\{t\}=c/\\sqrt\{t\}yields the minimax\-optimalO​\(T\)O\(\\sqrt\{T\}\)regret rate\[[22](https://arxiv.org/html/2605.12693#bib.bib22)\]\.

### B\.3Delay Differential Equations and Queue Length

The stability framework of\[[35](https://arxiv.org/html/2605.12693#bib.bib16)\]analyses online algorithms under delayed feedback through the lens of delay differential equations \(DDEs\)\. The central observation is that algorithmic stability depends on the*queue length*σt=\|Qt\|\\sigma\_\{t\}=\|Q\_\{t\}\|—the number of feedback rounds outstanding at timett—rather than on the raw delay alone in the active\-window embedding used here\. Linearising the bilevel dynamics around the equilibriumθ∗\\theta^\{\*\}yields the[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde):

ξ˙​\(t\)=−η​\(t\)​HF​ξ​\(t\)\+η​\(t\)​LIGT​σ​\(t\)​ξ​\(t−τ¯\),\\dot\{\\xi\}\(t\)=\-\\eta\(t\)\\,H\_\{F\}\\,\\xi\(t\)\+\\eta\(t\)\\,L\_\{\\mathrm\{IGT\}\}\\,\\sigma\(t\)\\,\\xi\(t\-\\bar\{\\tau\}\),\(10\)whereξ​\(t\)=θ​\(t\)−θ∗\\xi\(t\)=\\theta\(t\)\-\\theta^\{\*\}denotes the deviation from equilibrium,HF=∇2F​\(θ∗\)H\_\{F\}=\\nabla^\{2\}F\(\\theta^\{\*\}\)is the bilevel Hessian, andLIGTL\_\{\\mathrm\{IGT\}\}is the IGT coupling matrix\. The adaptive step size \([3\.2](https://arxiv.org/html/2605.12693#S3.SS2)\) is calibrated to maintainRe​\(λ\)<0\\mathrm\{Re\}\(\\lambda\)<0for all characteristic rootsλ\\lambdaof this DDE\.

## Appendix CAlgorithmic Notes

The full IGT\-OMD pseudocode is given as Algorithm[1](https://arxiv.org/html/2605.12693#alg1)in the main text \(Section[3](https://arxiv.org/html/2605.12693#S3)\)\. Here we record additional implementation details deferred from the main body\.

Memory layout\.The bufferℬ\\mathcal\{B\}stores at mostσmax\\sigma\_\{\\max\}tuples\(ws,vs∗,gs\)∈ℝq×ℝq×ℝp\(w\_\{s\},v\_\{s\}^\{\*\},g\_\{s\}\)\\in\\mathbb\{R\}^\{q\}\\times\\mathbb\{R\}^\{q\}\\times\\mathbb\{R\}^\{p\}, requiringO​\(σmax​\(p\+2​q\)\)O\(\\sigma\_\{\\max\}\(p\+2q\)\)memory\. A FIFO ring buffer suffices and givesO​\(1\)O\(1\)amortized eviction\. When non\-Euclidean geometries are used, line 20 of Algorithm[1](https://arxiv.org/html/2605.12693#alg1)is the dual\-space updateθt\+1=∇ψ∗​\(∇ψ​\(θt\)−ηt​gIGT\)\\theta\_\{t\+1\}=\\nabla\\psi^\{\*\}\(\\nabla\\psi\(\\theta\_\{t\}\)\-\\eta\_\{t\}\\,g^\{\\mathrm\{IGT\}\}\), whereψ∗\\psi^\{\*\}is the Fenchel conjugate\[[3](https://arxiv.org/html/2605.12693#bib.bib19)\]of the mirror mapψ\\psi\.

Conjugate Gradient warm\-starting\.The Conjugate Gradient adjoint solve on line 13 of Algorithm[1](https://arxiv.org/html/2605.12693#alg1)is warm\-started from the previous round’s adjointvt−1∗v\_\{t\-1\}^\{\*\}\. For slowly\-varyingθt\\theta\_\{t\}, this typically reduces the number of Conjugate Gradient iterations by33–5×5\\timesin our experiments, keeping the constant inO​\(q2​κw\)O\(q^\{2\}\\kappa\_\{w\}\)small\.

## Appendix DFull Assumptions

###### Assumption 1\(Inner convexity and smoothness \(A1\)\)\.

ℒmodel​\(⋅;θ\)\\mathcal\{L\}\_\{\\text\{model\}\}\(\\cdot;\\theta\)isμw\\mu\_\{w\}\-strongly convex andLwL\_\{w\}\-smooth for allθ∈Θ\\theta\\in\\Theta; bothℒmodel\\mathcal\{L\}\_\{\\text\{model\}\}andℒtrue\\mathcal\{L\}\_\{\\text\{true\}\}are twice continuously differentiable jointly in\(w,θ\)\(w,\\theta\)\. The strong convexity ensures a unique inner minimizerw∗​\(θ\)w^\{\*\}\(\\theta\)for eachθ\\thetaand guarantees invertibility ofHw=∇w​w2ℒmodel​\(w∗;θ\)H\_\{w\}=\\nabla^\{2\}\_\{ww\}\\mathcal\{L\}\_\{\\text\{model\}\}\(w^\{\*\};\\theta\)\.

###### Assumption 2\(Outer smoothness \(A2\)\)\.

ℒtrue​\(w;⋅\)\\mathcal\{L\}\_\{\\text\{true\}\}\(w;\\cdot\)isLθL\_\{\\theta\}\-smooth:‖∇θℒtrue​\(w;θ\)−∇θℒtrue​\(w;θ′\)‖≤Lθ​‖θ−θ′‖\\\|\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\text\{true\}\}\(w;\\theta\)\-\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\text\{true\}\}\(w;\\theta^\{\\prime\}\)\\\|\\leq L\_\{\\theta\}\\\|\\theta\-\\theta^\{\\prime\}\\\|for allw,θ,θ′w,\\theta,\\theta^\{\\prime\}\. The cross\-partial derivatives satisfy‖∇w​θ2ℒmodel​\(w;θ\)‖≤Lw​θ\\\|\\nabla^\{2\}\_\{w\\theta\}\\mathcal\{L\}\_\{\\text\{model\}\}\(w;\\theta\)\\\|\\leq L\_\{w\\theta\}for all\(w,θ\)\(w,\\theta\)\.

###### Assumption 3\(Inner\-solver quality \(A3\)\)\.

The warm\-started inner solver achieves‖wt−w∗​\(θt\)‖≤ϵinner\\\|w\_\{t\}\-w^\{\*\}\(\\theta\_\{t\}\)\\\|\\leq\\epsilon\_\{\\mathrm\{inner\}\}for alltt, and the executed decision has local excess lossℒtrue​\(wt;θt\)−ℒtrue​\(w∗​\(θt\);θt\)≤ϵinner2\\mathcal\{L\}\_\{\\text\{true\}\}\(w\_\{t\};\\theta\_\{t\}\)\-\\mathcal\{L\}\_\{\\text\{true\}\}\(w^\{\*\}\(\\theta\_\{t\}\);\\theta\_\{t\}\)\\leq\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}after absorbing the local quadratic constant intoϵinner\\epsilon\_\{\\mathrm\{inner\}\}\. RunningK=O​\(ln⁡\(1/ϵinner\)\)K=O\(\\ln\(1/\\epsilon\_\{\\mathrm\{inner\}\}\)\)gradient descent steps fromwt−1w\_\{t\-1\}suffices whenℒmodel\\mathcal\{L\}\_\{\\text\{model\}\}isμw\\mu\_\{w\}\-strongly convex andLwL\_\{w\}\-smooth and the decision loss is locally quadratic aroundw∗​\(θt\)w^\{\*\}\(\\theta\_\{t\}\)\.

###### Assumption 4\(Bounded queue \(A4\)\)\.

The feedback delay sequence\{dt\}\\\{d\_\{t\}\\\}has bounded maximum queue length:σt=\|Qt\|≤σmax<∞\\sigma\_\{t\}=\|Q\_\{t\}\|\\leq\\sigma\_\{\\max\}<\\inftyfor alltt\. This subsumes fixed delays \(dt=dd\_\{t\}=d,σmax=d\\sigma\_\{\\max\}=d\), i\.i\.d\. delays \(σmax=\\sigma\_\{\\max\}=99th\-percentile queue length\), and adversarial delays satisfyingdt≤Dd\_\{t\}\\leq D\(givingσmax≤D\\sigma\_\{\\max\}\\leq D\)\.

###### Assumption 5\(Cross\-partial Lipschitz \(A5\)\)\.

The mixed second derivative satisfies‖∇w​θ2ℒmodel​\(w;θ\)−∇w​θ2ℒmodel​\(w′;θ′\)‖≤Lw​θ​\(‖w−w′‖\+‖θ−θ′‖\)\\\|\\nabla^\{2\}\_\{w\\theta\}\\mathcal\{L\}\_\{\\text\{model\}\}\(w;\\theta\)\-\\nabla^\{2\}\_\{w\\theta\}\\mathcal\{L\}\_\{\\text\{model\}\}\(w^\{\\prime\};\\theta^\{\\prime\}\)\\\|\\leq L\_\{w\\theta\}\(\\\|w\-w^\{\\prime\}\\\|\+\\\|\\theta\-\\theta^\{\\prime\}\\\|\)\. The mapw↦\[∇w​w2ℒmodel​\(w;θ\)\]−1w\\mapsto\[\\nabla^\{2\}\_\{ww\}\\mathcal\{L\}\_\{\\text\{model\}\}\(w;\\theta\)\]^\{\-1\}is\(Lw/μw2\)\(L\_\{w\}/\\mu\_\{w\}^\{2\}\)\-Lipschitz inww\. Together, these ensure that the per\-round IGT re\-evaluation cost \([2](https://arxiv.org/html/2605.12693#S2.E2)\) isO​\(p​q\)O\(pq\)\(no inner solve required\)\.

###### Assumption 6\(Bounded hypergradients \(A6\)\)\.

The IGT\-corrected hypergradients are bounded:‖gtIGT‖≤G\\\|g\_\{t\}^\{\\mathrm\{IGT\}\}\\\|\\leq Gfor allttandθ∈Θ\\theta\\in\\Theta\. By the chain rule and triangle inequality,G≤Gtrue\+\(Lw​θ/μw\)​ϵinnerG\\leq G\_\{\\mathrm\{true\}\}\+\(L\_\{w\\theta\}/\\mu\_\{w\}\)\\epsilon\_\{\\mathrm\{inner\}\}, whereGtrue=supθ‖∇F​\(θ\)‖G\_\{\\mathrm\{true\}\}=\\sup\_\{\\theta\}\\\|\\nabla F\(\\theta\)\\\|bounds the exact hypergradient\.

###### Assumption 7\(Bilevel convexity \(A7\)\)\.

The bilevel objectiveF​\(θ\)=ℒtrue​\(w∗​\(θ\);θ\)F\(\\theta\)=\\mathcal\{L\}\_\{\\text\{true\}\}\(w^\{\*\}\(\\theta\);\\theta\)isμF\\mu\_\{F\}\-strongly convex\. The bilevel coupling conditionLw​θ2/\(μw2​μF\)<1L\_\{w\\theta\}^\{2\}/\(\\mu\_\{w\}^\{2\}\\mu\_\{F\}\)<1ensures that the feedback from the inner minimizer does not destabilize the outer\-level optimization\. This assumption holds when the inner problem is well\-conditioned \(μw\\mu\_\{w\}large\) or the cross\-coupling is weak \(Lw​θL\_\{w\\theta\}small\)\.

### D\.1Assumption Verification for Experimental Benchmarks

We verify that the three main benchmarks satisfy the key assumptions \(A1–A5\) required by our theoretical results\.

LQR \(Section[5\.1](https://arxiv.org/html/2605.12693#S5.SS1)\)\.The inner problem minimizes the quadratic LQR proxy over linear control gainsKKfor the learned dynamics\(A^,B^\)\(\\hat\{A\},\\hat\{B\}\), which is strongly convex inKKwhenR≻0R\\succ 0\(A1\)\. The outer loss is smooth inθ=\(A^,B^\)\\theta=\(\\hat\{A\},\\hat\{B\}\)since the closed\-loop proxy is polynomial in\(K,θ\)\(K,\\theta\)\(A2\)\. The inner solver uses the fixedK=10K=10differentiable gradient\-descent configuration in Table[6](https://arxiv.org/html/2605.12693#A6.T6), soϵinner\\epsilon\_\{\\mathrm\{inner\}\}is uniformly controlled \(A3\)\. Delays are constant withσmax=d\\sigma\_\{\\max\}=d\(A4\)\. The cross\-partials∇K​θ2J\\nabla^\{2\}\_\{K\\theta\}Jare Lipschitz becauseJJis polynomial in\(K,θ\)\(K,\\theta\)\(A5\)\. The bilevel objectiveF​\(θ\)F\(\\theta\)is locally strongly convex in the stable neighborhood explored by the binary search \(A7\)\.

Sinkhorn OT \(Sections[5\.3](https://arxiv.org/html/2605.12693#S5.SS3)–[5\.4](https://arxiv.org/html/2605.12693#S5.SS4)\)\.The inner problem runsKKSinkhorn iterations with entropic regularizationε=0\.05\\varepsilon=0\.05, which ensures the regularized objective isε\\varepsilon\-strongly convex in the coupling matrix \(A1, withμw=ε\\mu\_\{w\}=\\varepsilon\)\. The outer loss is smooth in the neural network parametersθ\\theta\(A2\)\. The inner\-solver error decays geometrically:ϵinner≤ρK\\epsilon\_\{\\mathrm\{inner\}\}\\leq\\rho^\{K\}withρ=1−ε/Lw<1\\rho=1\-\\varepsilon/L\_\{w\}<1, soK=10K=10iterations yieldϵinner≈10−3\\epsilon\_\{\\mathrm\{inner\}\}\\approx 10^\{\-3\}\(A3\)\. Delays are constant withσmax=d\\sigma\_\{\\max\}=d\(A4\)\. The Sinkhorn kernel isC∞C^\{\\infty\}in\(w,θ\)\(w,\\theta\), ensuring Lipschitz cross\-partials \(A5\)\. Assumption[7](https://arxiv.org/html/2605.12693#Thmassumption7)\(strong convexity ofFF\) holds only locally under the neural\-network parameterization; this is the regime in which our local bounds apply\. Empirically, the transport benefit persists in this non\-convex regime, suggesting the assumption may be conservative\. Formal non\-convex extensions are deferred to future work \(Section[6](https://arxiv.org/html/2605.12693#S6)\)\.

Warcraft shortest path \(Section[5\.2](https://arxiv.org/html/2605.12693#S5.SS2)\)\.Dijkstra’s algorithm returns exact solutions, so the inner problem is not approximately solved but exactly solved \(ϵinner=0\\epsilon\_\{\\mathrm\{inner\}\}=0, A3 trivially\)\. However, the combinatorial inner solver is non\-differentiable, so A1 \(smoothness\) does not hold in the classical sense; we use the differentiable perturbation approach of\[[33](https://arxiv.org/html/2605.12693#bib.bib20)\]to define smooth surrogate gradients\. The outer loss is smooth inθ\\theta\(A2\), and delays have bounded queue length \(A4\)\. Since Dijkstra’s algorithm lies outside the gradient computation graph, the bilevel\-specific assumptions \(A5, A7\) are not required for the Warcraft experiment—the transport benefit arises solely from outer\-level staleness correction, as discussed in Section[5\.2](https://arxiv.org/html/2605.12693#S5.SS2)\.

## Appendix EProofs

### E\.1Proof of Theorem[1](https://arxiv.org/html/2605.12693#Thmtheorem1)\(Bilevel Convergence\)

The proof proceeds in three stages: \(i\) a standard OMD regret decomposition using the IGT\-corrected gradientsgtIGTg\_\{t\}^\{\\mathrm\{IGT\}\}; \(ii\) a lemma bounding the transport error when IGT is applied to bilevel*hyper*gradients; and \(iii\) a lemma bounding the inner\-solver bias via strong convexity and Young’s inequality\. The key insight is that IGT re\-evaluation replaces the squared total drift with a sum of squared per\-step changes, and the inner\-solver error contributes only through an*additive*ϵinner2\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}term because Young’s inequality decouples the cross\-term between inner bias and outer staleness\.

##### Supporting lemma: OMD with imperfect gradients\.

###### Lemma 1\(OMD regret with time\-varying step sizes\)\.

Letψ:Θ→ℝ\\psi:\\Theta\\to\\mathbb\{R\}be11\-strongly convex with respect to∥⋅∥\\left\\\|\\,\\cdot\\,\\right\\\|, and letDψ​\(θ,θ′\)=ψ​\(θ\)−ψ​\(θ′\)−⟨∇ψ​\(θ′\),θ−θ′⟩D\_\{\\psi\}\(\\theta,\\theta^\{\\prime\}\)=\\psi\(\\theta\)\-\\psi\(\\theta^\{\\prime\}\)\-\\langle\\nabla\\psi\(\\theta^\{\\prime\}\),\\theta\-\\theta^\{\\prime\}\\ranglebe the induced Bregman divergence\. Let\{ht\}t=1T\\\{h\_\{t\}\\\}\_\{t=1\}^\{T\}be any sequence with‖ht‖∗≤G\\left\\\|h\_\{t\}\\right\\\|\_\{\*\}\\leq G, and let the OMD iterates beθt\+1=argminθ∈Θ⁡⟨ht,θ⟩\+\(1/ηt\)​Dψ​\(θ,θt\)\\theta\_\{t\+1\}=\\operatorname\{argmin\}\_\{\\theta\\in\\Theta\}\\langle h\_\{t\},\\theta\\rangle\+\(1/\\eta\_\{t\}\)D\_\{\\psi\}\(\\theta,\\theta\_\{t\}\)with*arbitrary*positive step sizesηt\>0\\eta\_\{t\}\>0\. Then for anyθ∗∈Θ\\theta^\{\*\}\\in\\Theta:

∑t=1T⟨ht,θt−θ∗⟩≤Dψ​\(θ∗,θ1\)η1\+∑t=2TDψ​\(θ∗,θt\)​\(1ηt−1ηt−1\)\+\+G22​∑t=1Tηt,\\sum\_\{t=1\}^\{T\}\\langle h\_\{t\},\\theta\_\{t\}\-\\theta^\{\*\}\\rangle\\;\\leq\\;\\frac\{D\_\{\\psi\}\(\\theta^\{\*\},\\theta\_\{1\}\)\}\{\\eta\_\{1\}\}\+\\sum\_\{t=2\}^\{T\}D\_\{\\psi\}\(\\theta^\{\*\},\\theta\_\{t\}\)\\left\(\\frac\{1\}\{\\eta\_\{t\}\}\-\\frac\{1\}\{\\eta\_\{t\-1\}\}\\right\)\_\{\+\}\+\\frac\{G^\{2\}\}\{2\}\\sum\_\{t=1\}^\{T\}\\eta\_\{t\},\(11\)where\(x\)\+=max⁡\(x,0\)\(x\)\_\{\+\}=\\max\(x,0\)\. When step sizes are non\-increasing, the middle term vanishes and the bound reduces toDψ​\(θ∗,θ1\)/ηT\+\(G2/2\)​∑tηtD\_\{\\psi\}\(\\theta^\{\*\},\\theta\_\{1\}\)/\\eta\_\{T\}\+\(G^\{2\}/2\)\\sum\_\{t\}\\eta\_\{t\}\.

###### Proof\.

The three\-point identity for Bregman divergences \(see, e\.g\.,\[[2](https://arxiv.org/html/2605.12693#bib.bib23)\]\) gives, for the OMD step at roundtt:

⟨ht,θt−θ∗⟩≤Dψ​\(θ∗,θt\)−Dψ​\(θ∗,θt\+1\)ηt\+ηt2​‖ht‖∗2\.\\langle h\_\{t\},\\theta\_\{t\}\-\\theta^\{\*\}\\rangle\\;\\leq\\;\\frac\{D\_\{\\psi\}\(\\theta^\{\*\},\\theta\_\{t\}\)\-D\_\{\\psi\}\(\\theta^\{\*\},\\theta\_\{t\+1\}\)\}\{\\eta\_\{t\}\}\+\\frac\{\\eta\_\{t\}\}\{2\}\\left\\\|h\_\{t\}\\right\\\|\_\{\*\}^\{2\}\.Summing overt=1,…,Tt=1,\\ldots,Tand re\-grouping the telescoping sum \(following the time\-varying step\-size analysis of\[[15](https://arxiv.org/html/2605.12693#bib.bib29)\], Appendix A\):

∑t⟨ht,θt−θ∗⟩\\displaystyle\\sum\_\{t\}\\langle h\_\{t\},\\theta\_\{t\}\-\\theta^\{\*\}\\rangle≤Dψ​\(θ∗,θ1\)η1\+∑t=2TDψ​\(θ∗,θt\)​\(1ηt−1ηt−1\)\+G22​∑tηt\.\\displaystyle\\leq\\frac\{D\_\{\\psi\}\(\\theta^\{\*\},\\theta\_\{1\}\)\}\{\\eta\_\{1\}\}\+\\sum\_\{t=2\}^\{T\}D\_\{\\psi\}\(\\theta^\{\*\},\\theta\_\{t\}\)\\\!\\left\(\\frac\{1\}\{\\eta\_\{t\}\}\-\\frac\{1\}\{\\eta\_\{t\-1\}\}\\right\)\+\\frac\{G^\{2\}\}\{2\}\\sum\_\{t\}\\eta\_\{t\}\.When1/ηt<1/ηt−11/\\eta\_\{t\}<1/\\eta\_\{t\-1\}\(i\.e\.,ηt\>ηt−1\\eta\_\{t\}\>\\eta\_\{t\-1\}\), the corresponding term is negative and can be dropped, yielding \([11](https://arxiv.org/html/2605.12693#A5.E11)\)\. ∎

##### Supporting lemma: IGT error for hypergradients\.

###### Lemma 2\(IGT transport error for bilevel hypergradients\)\.

Under Assumptions[1](https://arxiv.org/html/2605.12693#Thmassumption1)–[6](https://arxiv.org/html/2605.12693#Thmassumption6), letg¯t=∇θF​\(θt\)\\bar\{g\}\_\{t\}=\\nabla\_\{\\theta\}F\(\\theta\_\{t\}\)denote the exact gradient of the bilevel objective atθt\\theta\_\{t\}\(with exact inner solutionw∗​\(θt\)w^\{\*\}\(\\theta\_\{t\}\)\), and letgtIGTg\_\{t\}^\{\\mathrm\{IGT\}\}be the IGT estimator formed by Algorithm[1](https://arxiv.org/html/2605.12693#alg1)using stored\(ws,vs∗\)\(w\_\{s\},v\_\{s\}^\{\*\}\)from past roundss∈Qts\\in Q\_\{t\}\. Then:

‖g¯t−gtIGT‖≤LF​∑s∈Qt‖θs\+1−θs‖2\+Lw​θμw​ϵinner,\\left\\\|\\bar\{g\}\_\{t\}\-g\_\{t\}^\{\\mathrm\{IGT\}\}\\right\\\|\\;\\leq\\;L\_\{F\}\\sum\_\{s\\in Q\_\{t\}\}\\left\\\|\\theta\_\{s\+1\}\-\\theta\_\{s\}\\right\\\|^\{2\}\\;\+\\;\\frac\{L\_\{w\\theta\}\}\{\\mu\_\{w\}\}\\,\\epsilon\_\{\\mathrm\{inner\}\},\(12\)whereLF=Lθ\+Lw​θ2/μwL\_\{F\}=L\_\{\\theta\}\+L\_\{w\\theta\}^\{2\}/\\mu\_\{w\}is the Lipschitz constant of∇θF\\nabla\_\{\\theta\}F\(the bilevel gradient Lipschitz constant, identical to theLLin Theorem[1](https://arxiv.org/html/2605.12693#Thmtheorem1)\),Lw​θL\_\{w\\theta\}is the cross\-partial Lipschitz constant from Assumption[2](https://arxiv.org/html/2605.12693#Thmassumption2), andμw\\mu\_\{w\}is the inner strong convexity constant from Assumption[1](https://arxiv.org/html/2605.12693#Thmassumption1)\.

###### Proof\.

Decompose the error into a transport part and an inner\-solver part:

‖g¯t−gtIGT‖≤‖g¯t−gttrans‖⏟transport error\+‖gttrans−gtIGT‖⏟inner\-solver error,\\left\\\|\\bar\{g\}\_\{t\}\-g\_\{t\}^\{\\mathrm\{IGT\}\}\\right\\\|\\;\\leq\\;\\underbrace\{\\left\\\|\\bar\{g\}\_\{t\}\-g\_\{t\}^\{\\mathrm\{trans\}\}\\right\\\|\}\_\{\\text\{transport error\}\}\\;\+\\;\\underbrace\{\\left\\\|g\_\{t\}^\{\\mathrm\{trans\}\}\-g\_\{t\}^\{\\mathrm\{IGT\}\}\\right\\\|\}\_\{\\text\{inner\-solver error\}\},\(13\)wheregttransg\_\{t\}^\{\\mathrm\{trans\}\}is the IGT estimator that uses the*exact*inner solutionsw∗​\(θs\)w^\{\*\}\(\\theta\_\{s\}\)for alls∈Qts\\in Q\_\{t\}\(a hypothetical estimator not computed by the algorithm\)\.

Transport error\.The re\-evaluation formula \([2](https://arxiv.org/html/2605.12693#S2.E2)\) definesgs​\(θ\)=∇θℒtrue​\(w∗​\(θs\);θ\)−\[∇θ∇w⁡ℒmodel​\(w∗​\(θs\);θ\)\]⊤​vs∗g\_\{s\}\(\\theta\)=\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\text\{true\}\}\(w^\{\*\}\(\\theta\_\{s\}\);\\theta\)\-\[\\nabla\_\{\\theta\}\\nabla\_\{w\}\\mathcal\{L\}\_\{\\text\{model\}\}\(w^\{\*\}\(\\theta\_\{s\}\);\\theta\)\]^\{\\top\}v^\{\*\}\_\{s\}as a function ofθ\\thetaalone \(with frozenw∗​\(θs\)w^\{\*\}\(\\theta\_\{s\}\)andvs∗v^\{\*\}\_\{s\}\), which isLFL\_\{F\}\-smooth inθ\\thetaby Assumption[2](https://arxiv.org/html/2605.12693#Thmassumption2)\. Applying the original IGT error bound of\[[1](https://arxiv.org/html/2605.12693#bib.bib15)\]\(their Proposition 1, extended to the hypergradient setting\) to the sequencegs​\(⋅\)g\_\{s\}\(\\cdot\)for eachs∈Qts\\in Q\_\{t\}gives:

‖g¯t−gttrans‖≤LF​∑s∈Qt‖θs\+1−θs‖2\.\\left\\\|\\bar\{g\}\_\{t\}\-g\_\{t\}^\{\\mathrm\{trans\}\}\\right\\\|\\;\\leq\\;L\_\{F\}\\sum\_\{s\\in Q\_\{t\}\}\\left\\\|\\theta\_\{s\+1\}\-\\theta\_\{s\}\\right\\\|^\{2\}\.The bound follows from the fact that the telescoping correction∑s∈Qt\[gs​\(θt\)−gs​\(θt−1\)\]\\sum\_\{s\\in Q\_\{t\}\}\[g\_\{s\}\(\\theta\_\{t\}\)\-g\_\{s\}\(\\theta\_\{t\-1\}\)\]approximates the ideal correction∑s∈Qt\[gs​\(θt\)−gs​\(θs\)\]\\sum\_\{s\\in Q\_\{t\}\}\[g\_\{s\}\(\\theta\_\{t\}\)\-g\_\{s\}\(\\theta\_\{s\}\)\]up to a residual that is bounded by the sum of squared step sizes via the mean\-value theorem applied to theLFL\_\{F\}\-smooth functionsgs​\(⋅\)g\_\{s\}\(\\cdot\)\.

Inner\-solver error\.For eachs∈Qts\\in Q\_\{t\}, the algorithm usesws≈w∗​\(θs\)w\_\{s\}\\approx w^\{\*\}\(\\theta\_\{s\}\)instead of the exact solution\. The difference in the re\-evaluated hypergradient satisfies:

‖gstrans​\(θt\)−gsIGT​\(θt\)‖≤Lw​θ​‖ws−w∗​\(θs\)‖≤Lw​θ​ϵinner,\\left\\\|g\_\{s\}^\{\\mathrm\{trans\}\}\(\\theta\_\{t\}\)\-g\_\{s\}^\{\\mathrm\{IGT\}\}\(\\theta\_\{t\}\)\\right\\\|\\;\\leq\\;L\_\{w\\theta\}\\left\\\|w\_\{s\}\-w^\{\*\}\(\\theta\_\{s\}\)\\right\\\|\\;\\leq\\;L\_\{w\\theta\}\\,\\epsilon\_\{\\mathrm\{inner\}\},using theLw​θL\_\{w\\theta\}\-Lipschitz cross\-partial from Assumption[2](https://arxiv.org/html/2605.12693#Thmassumption2)and the inner\-solver bound from Assumption[3](https://arxiv.org/html/2605.12693#Thmassumption3)\. Since the IGT sum involves at mostσt≤σmax\\sigma\_\{t\}\\leq\\sigma\_\{\\max\}outstanding rounds, and noting that the errors for individuals∈Qts\\in Q\_\{t\}collapse into the single correction incrementΔs=gs​\(θt\)−gs​\(θt−1\)\\Delta\_\{s\}=g\_\{s\}\(\\theta\_\{t\}\)\-g\_\{s\}\(\\theta\_\{t\-1\}\), the total contribution of inner\-solver error to the estimatorgtIGTg\_\{t\}^\{\\mathrm\{IGT\}\}is bounded by\(Lw​θ/μw\)​ϵinner\(L\_\{w\\theta\}/\\mu\_\{w\}\)\\epsilon\_\{\\mathrm\{inner\}\}after invoking strong convexity of the inner problem \(Assumption[1](https://arxiv.org/html/2605.12693#Thmassumption1)\) to relate‖ws−w∗​\(θt\)‖≤ϵinner\+\(Lw​θ/μw\)​‖θs−θt‖\\left\\\|w\_\{s\}\-w^\{\*\}\(\\theta\_\{t\}\)\\right\\\|\\leq\\epsilon\_\{\\mathrm\{inner\}\}\+\(L\_\{w\\theta\}/\\mu\_\{w\}\)\\left\\\|\\theta\_\{s\}\-\\theta\_\{t\}\\right\\\|, the latter term being absorbed into the transport error already accounted for\. Combining the two bounds yields \([12](https://arxiv.org/html/2605.12693#A5.E12)\)\. ∎

##### Supporting lemma: inner\-solver bias via Young’s inequality\.

###### Lemma 3\(Inner\-solver bias accumulates asϵinner2\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\)\.

Under the same assumptions, withρcpl:=Lw​θ2/\(μw2​μF\)<1\\rho\_\{\\mathrm\{cpl\}\}:=L\_\{w\\theta\}^\{2\}/\(\\mu\_\{w\}^\{2\}\\mu\_\{F\}\)<1, the cumulative contribution of the inner\-solver error satisfies:

∑t=1T⟨g¯t−gtIGT,θt−θ∗⟩≤2​T​ϵinner2\+LF​∑t=1T∑s∈Qt‖θs\+1−θs‖2\+ρcpl​RegretTdec\.\\sum\_\{t=1\}^\{T\}\\langle\\bar\{g\}\_\{t\}\-g\_\{t\}^\{\\mathrm\{IGT\}\},\\,\\theta\_\{t\}\-\\theta^\{\*\}\\rangle\\;\\leq\\;2T\\,\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\\;\+\\;L\_\{F\}\\sum\_\{t=1\}^\{T\}\\sum\_\{s\\in Q\_\{t\}\}\\left\\\|\\theta\_\{s\+1\}\-\\theta\_\{s\}\\right\\\|^\{2\}\\;\+\\;\\rho\_\{\\mathrm\{cpl\}\}\\,\\mathrm\{Regret\}\_\{T\}^\{\\mathrm\{dec\}\}\.\(14\)

###### Proof\.

By Cauchy–Schwarz\[[30](https://arxiv.org/html/2605.12693#bib.bib38)\]and Lemma[2](https://arxiv.org/html/2605.12693#Thmlemma2):

⟨g¯t−gtIGT,θt−θ∗⟩≤‖g¯t−gtIGT‖​‖θt−θ∗‖≤\[LF​∑s∈Qt‖θs\+1−θs‖2\+Lw​θμw​ϵinner\]⋅‖θt−θ∗‖\.\\langle\\bar\{g\}\_\{t\}\-g\_\{t\}^\{\\mathrm\{IGT\}\},\\theta\_\{t\}\-\\theta^\{\*\}\\rangle\\;\\leq\\;\\left\\\|\\bar\{g\}\_\{t\}\-g\_\{t\}^\{\\mathrm\{IGT\}\}\\right\\\|\\,\\left\\\|\\theta\_\{t\}\-\\theta^\{\*\}\\right\\\|\\;\\leq\\;\\left\[L\_\{F\}\\sum\_\{s\\in Q\_\{t\}\}\\left\\\|\\theta\_\{s\+1\}\-\\theta\_\{s\}\\right\\\|^\{2\}\+\\frac\{L\_\{w\\theta\}\}\{\\mu\_\{w\}\}\\epsilon\_\{\\mathrm\{inner\}\}\\right\]\\cdot\\left\\\|\\theta\_\{t\}\-\\theta^\{\*\}\\right\\\|\.For the inner\-bias cross\-term, apply Young’s inequality withc=1c=1:Lw​θμw​ϵinner⋅‖θt−θ∗‖≤12​ϵinner2\+Lw​θ22​μw2​‖θt−θ∗‖2\\frac\{L\_\{w\\theta\}\}\{\\mu\_\{w\}\}\\epsilon\_\{\\mathrm\{inner\}\}\\cdot\\left\\\|\\theta\_\{t\}\-\\theta^\{\*\}\\right\\\|\\leq\\frac\{1\}\{2\}\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\+\\frac\{L\_\{w\\theta\}^\{2\}\}\{2\\mu\_\{w\}^\{2\}\}\\left\\\|\\theta\_\{t\}\-\\theta^\{\*\}\\right\\\|^\{2\}\. \(Young’s is applied only to the inner\-bias term, not the transport termLF​∑s∈Qt‖θs\+1−θs‖2L\_\{F\}\\sum\_\{s\\in Q\_\{t\}\}\\left\\\|\\theta\_\{s\+1\}\-\\theta\_\{s\}\\right\\\|^\{2\}, because the transport term already has the sum\-of\-squares structure and is bounded byσt​η02​G2⋅diam​\(Θ\)\\sigma\_\{t\}\\eta\_\{0\}^\{2\}G^\{2\}\\cdot\\mathrm\{diam\}\(\\Theta\)without requiring separation\.\) By theμF\\mu\_\{F\}\-strong convexity ofFF\(Assumption[7](https://arxiv.org/html/2605.12693#Thmassumption7)\), we have‖θt−θ∗‖2≤\(2/μF\)​\[F​\(θt\)−F​\(θ∗\)\]\\left\\\|\\theta\_\{t\}\-\\theta^\{\*\}\\right\\\|^\{2\}\\leq\(2/\\mu\_\{F\}\)\[F\(\\theta\_\{t\}\)\-F\(\\theta^\{\*\}\)\]\. The resulting term isρcpl​\[F​\(θt\)−F​\(θ∗\)\]\\rho\_\{\\mathrm\{cpl\}\}\[F\(\\theta\_\{t\}\)\-F\(\\theta^\{\*\}\)\], which is absorbed in the main proof; summing overttgives \([14](https://arxiv.org/html/2605.12693#A5.E14)\)\. ∎

##### Main proof of Theorem[1](https://arxiv.org/html/2605.12693#Thmtheorem1)\.

###### Proof of Theorem[1](https://arxiv.org/html/2605.12693#Thmtheorem1)\.

Letθ∗=argminθ​∑tℒtrue​\(w∗​\(θ\);θ\)\\theta^\{\*\}=\\operatorname\{argmin\}\_\{\\theta\}\\sum\_\{t\}\\mathcal\{L\}\_\{\\text\{true\}\}\(w^\{\*\}\(\\theta\);\\theta\)andF​\(θ\)=ℒtrue​\(w∗​\(θ\);θ\)F\(\\theta\)=\\mathcal\{L\}\_\{\\text\{true\}\}\(w^\{\*\}\(\\theta\);\\theta\)\.

Step 1 \(convexity decomposition\)\.By convexity ofFF\(Assumption[7](https://arxiv.org/html/2605.12693#Thmassumption7)\):

RegretTdec\\displaystyle\\mathrm\{Regret\}\_\{T\}^\{\\mathrm\{dec\}\}=∑t=1T\[F​\(θt\)−F​\(θ∗\)\]≤∑t=1T⟨g¯t,θt−θ∗⟩\\displaystyle=\\sum\_\{t=1\}^\{T\}\[F\(\\theta\_\{t\}\)\-F\(\\theta^\{\*\}\)\]\\;\\leq\\;\\sum\_\{t=1\}^\{T\}\\langle\\bar\{g\}\_\{t\},\\theta\_\{t\}\-\\theta^\{\*\}\\rangle=∑t=1T⟨gtIGT,θt−θ∗⟩⏟\(I\)\+∑t=1T⟨g¯t−gtIGT,θt−θ∗⟩⏟\(I​I\)\.\\displaystyle=\\underbrace\{\\sum\_\{t=1\}^\{T\}\\langle g\_\{t\}^\{\\mathrm\{IGT\}\},\\theta\_\{t\}\-\\theta^\{\*\}\\rangle\}\_\{\(I\)\}\+\\underbrace\{\\sum\_\{t=1\}^\{T\}\\langle\\bar\{g\}\_\{t\}\-g\_\{t\}^\{\\mathrm\{IGT\}\},\\theta\_\{t\}\-\\theta^\{\*\}\\rangle\}\_\{\(II\)\}\.\(15\)
Step 2 \(bound term \(I\) via OMD lemma\)\.Algorithm[1](https://arxiv.org/html/2605.12693#alg1)performs an OMD step with gradientgtIGTg\_\{t\}^\{\\mathrm\{IGT\}\}and step sizeηt=η0/1\+β​σ¯t\\eta\_\{t\}=\\eta\_\{0\}/\\sqrt\{1\+\\beta\\bar\{\\sigma\}\_\{t\}\}\. Sinceσ¯t\\bar\{\\sigma\}\_\{t\}is nondecreasing,ηt\\eta\_\{t\}is non\-increasing and Lemma[1](https://arxiv.org/html/2605.12693#Thmlemma1)applies withht=gtIGTh\_\{t\}=g\_\{t\}^\{\\mathrm\{IGT\}\}and‖gtIGT‖∗≤G\\left\\\|g\_\{t\}^\{\\mathrm\{IGT\}\}\\right\\\|\_\{\*\}\\leq G:

\(I\)≤Dψ​\(θ∗,θ1\)η1\+∑t=2TDψ​\(θ∗,θt\)​\(1ηt−1ηt−1\)\+\+G22​∑t=1Tηt\.\(I\)\\;\\leq\\;\\frac\{D\_\{\\psi\}\(\\theta^\{\*\},\\theta\_\{1\}\)\}\{\\eta\_\{1\}\}\+\\sum\_\{t=2\}^\{T\}D\_\{\\psi\}\(\\theta^\{\*\},\\theta\_\{t\}\)\\\!\\left\(\\frac\{1\}\{\\eta\_\{t\}\}\-\\frac\{1\}\{\\eta\_\{t\-1\}\}\\right\)\_\{\\\!\+\}\+\\frac\{G^\{2\}\}\{2\}\\sum\_\{t=1\}^\{T\}\\eta\_\{t\}\.\(16\)Because1/ηt1/\\eta\_\{t\}is nondecreasing, the middle term telescopes over the range of1/ηt1/\\eta\_\{t\}: sinceDψ​\(θ∗,θt\)≤DψD\_\{\\psi\}\(\\theta^\{\*\},\\theta\_\{t\}\)\\leq D\_\{\\psi\},∑t=2TDψ​\(1/ηt−1/ηt−1\)\+≤Dψ​\(1/ηT−1/η1\)\\sum\_\{t=2\}^\{T\}D\_\{\\psi\}\(1/\\eta\_\{t\}\-1/\\eta\_\{t\-1\}\)\_\{\+\}\\leq D\_\{\\psi\}\(1/\\eta\_\{T\}\-1/\\eta\_\{1\}\)\. Combined with the first term:\(I\)≤Dψ/ηT\+\(G2/2\)​∑tηt\(I\)\\leq D\_\{\\psi\}/\\eta\_\{T\}\+\(G^\{2\}/2\)\\sum\_\{t\}\\eta\_\{t\}\.

We bound∑tηt\\sum\_\{t\}\\eta\_\{t\}and1/ηT1/\\eta\_\{T\}using the queue\-length structure\. Sinceηt=η0/1\+β​σ¯t≤η0\\eta\_\{t\}=\\eta\_\{0\}/\\sqrt\{1\+\\beta\\bar\{\\sigma\}\_\{t\}\}\\leq\\eta\_\{0\}, we have the trivial bound∑tηt≤T​η0\\sum\_\{t\}\\eta\_\{t\}\\leq T\\eta\_\{0\}\. Also1/ηT≤1\+β​σmax/η01/\\eta\_\{T\}\\leq\\sqrt\{1\+\\beta\\sigma\_\{\\max\}\}/\\eta\_\{0\}\. More precisely, we use the following estimate: since eachηt≤η0\\eta\_\{t\}\\leq\\eta\_\{0\}and there are at mostTTterms,

∑t=1Tηt=η0​∑t=1T11\+β​σt≤η0​T\.\\sum\_\{t=1\}^\{T\}\\eta\_\{t\}=\\eta\_\{0\}\\sum\_\{t=1\}^\{T\}\\frac\{1\}\{\\sqrt\{1\+\\beta\\sigma\_\{t\}\}\}\\;\\leq\\;\\eta\_\{0\}\\,T\.\(17\)Whenσt≍σmax\\sigma\_\{t\}\\asymp\\sigma\_\{\\max\}for a constant fraction of rounds \(the adversarial worst case\), we obtain the tighter bound∑tηt≤η0​T/1\+β​σmax\\sum\_\{t\}\\eta\_\{t\}\\leq\\eta\_\{0\}T/\\sqrt\{1\+\\beta\\sigma\_\{\\max\}\}\. In the general case, letT0T\_\{0\}denote the number of rounds withσt=0\\sigma\_\{t\}=0andT\+=T−T0T\_\{\+\}=T\-T\_\{0\}the rounds withσt≥1\\sigma\_\{t\}\\geq 1\. Then∑tηt≤η0​T0\+η0​T\+/1\+β\\sum\_\{t\}\\eta\_\{t\}\\leq\\eta\_\{0\}T\_\{0\}\+\\eta\_\{0\}T\_\{\+\}/\\sqrt\{1\+\\beta\}\. For the regret bound we use the worst case∑tηt≤η0​T\\sum\_\{t\}\\eta\_\{t\}\\leq\\eta\_\{0\}T:

\(I\)≤2​Dψ​1\+β​σmaxη0\+η0​G2​T2\.\(I\)\\;\\leq\\;\\frac\{2D\_\{\\psi\}\\sqrt\{1\+\\beta\\sigma\_\{\\max\}\}\}\{\\eta\_\{0\}\}\+\\frac\{\\eta\_\{0\}G^\{2\}T\}\{2\}\.\(18\)
Step 3 \(bound term \(II\) via Lemmas[2](https://arxiv.org/html/2605.12693#Thmlemma2)and[3](https://arxiv.org/html/2605.12693#Thmlemma3)\)\.By Lemma[3](https://arxiv.org/html/2605.12693#Thmlemma3):

\(I​I\)≤2​T​ϵinner2\+LF​∑t=1T∑s∈Qt‖θs\+1−θs‖2\+ρcpl​RegretTdec\.\(II\)\\;\\leq\\;2T\\,\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\+L\_\{F\}\\sum\_\{t=1\}^\{T\}\\sum\_\{s\\in Q\_\{t\}\}\\left\\\|\\theta\_\{s\+1\}\-\\theta\_\{s\}\\right\\\|^\{2\}\+\\rho\_\{\\mathrm\{cpl\}\}\\,\\mathrm\{Regret\}\_\{T\}^\{\\mathrm\{dec\}\}\.\(19\)
Step 4 \(transport term\)\.The last sum in \([19](https://arxiv.org/html/2605.12693#A5.E19)\) is the aggregate IGT transport penalty\. Since the step\-size bound gives‖θt\+1−θt‖≤ηt​‖gtIGT‖≤η0​G\\left\\\|\\theta\_\{t\+1\}\-\\theta\_\{t\}\\right\\\|\\leq\\eta\_\{t\}\\left\\\|g\_\{t\}^\{\\mathrm\{IGT\}\}\\right\\\|\\leq\\eta\_\{0\}G, each summand satisfies‖θs\+1−θs‖2≤η02​G2\\left\\\|\\theta\_\{s\+1\}\-\\theta\_\{s\}\\right\\\|^\{2\}\\leq\\eta\_\{0\}^\{2\}G^\{2\}\. We retain this as\-is in the main bound; it appears explicitly in \([5](https://arxiv.org/html/2605.12693#S4.E5)\)\.

Step 5 \(combine and optimizeη0\\eta\_\{0\}\)\.Combining \([E\.1](https://arxiv.org/html/2605.12693#A5.Ex7)\), \([18](https://arxiv.org/html/2605.12693#A5.E18)\), and \([19](https://arxiv.org/html/2605.12693#A5.E19)\), then movingρcpl​RegretTdec\\rho\_\{\\mathrm\{cpl\}\}\\mathrm\{Regret\}\_\{T\}^\{\\mathrm\{dec\}\}to the left:

RegretTdec≤Cρ​\[2​Dψ​1\+β​σmaxη0\+η0​G2​T2\+2​T​ϵinner2\+LF​∑t=1T∑s∈Qt‖θs\+1−θs‖2\],\\mathrm\{Regret\}\_\{T\}^\{\\mathrm\{dec\}\}\\;\\leq\\;C\_\{\\rho\}\\\!\\left\[\\frac\{2D\_\{\\psi\}\\sqrt\{1\+\\beta\\sigma\_\{\\max\}\}\}\{\\eta\_\{0\}\}\+\\frac\{\\eta\_\{0\}G^\{2\}T\}\{2\}\+2T\\,\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\+L\_\{F\}\\sum\_\{t=1\}^\{T\}\\sum\_\{s\\in Q\_\{t\}\}\\left\\\|\\theta\_\{s\+1\}\-\\theta\_\{s\}\\right\\\|^\{2\}\\right\],whereCρ=\(1−ρcpl\)−1C\_\{\\rho\}=\(1\-\\rho\_\{\\mathrm\{cpl\}\}\)^\{\-1\}\. This is \([5](https://arxiv.org/html/2605.12693#S4.E5)\)\.

Settingη0=c/T\\eta\_\{0\}=c/\\sqrt\{T\}\(for universal constantc\>0c\>0\) gives the first term asO​\(Dψ​T​σmax\)O\(D\_\{\\psi\}\\sqrt\{T\\sigma\_\{\\max\}\}\)and the second asO​\(G2​T\)O\(G^\{2\}\\sqrt\{T\}\)\. The last sum, using‖θs\+1−θs‖2≤η02​G2=O​\(G2/T\)\\left\\\|\\theta\_\{s\+1\}\-\\theta\_\{s\}\\right\\\|^\{2\}\\leq\\eta\_\{0\}^\{2\}G^\{2\}=O\(G^\{2\}/T\)and∑t\|Qt\|=∑tσt≤dtot≤T​σmax\\sum\_\{t\}\|Q\_\{t\}\|=\\sum\_\{t\}\\sigma\_\{t\}\\leq d\_\{\\mathrm\{tot\}\}\\leq T\\sigma\_\{\\max\}, contributesO​\(LF​G2​σmax\)O\(L\_\{F\}G^\{2\}\\sigma\_\{\\max\}\)\. SinceΘ\\Thetais bounded \(soDψ<∞D\_\{\\psi\}<\\infty\), combining givesRegretTdec=O​\(T​σmax\+T​ϵinner2\)\\mathrm\{Regret\}\_\{T\}^\{\\mathrm\{dec\}\}=O\(\\sqrt\{T\\sigma\_\{\\max\}\}\+T\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\)\.

If the inner solver runsK=O​\(ln⁡T\)K=O\(\\ln T\)steps with contraction rateρ<1\\rho<1, thenϵinner=O​\(ρK\)=O​\(1/T\)\\epsilon\_\{\\mathrm\{inner\}\}=O\(\\rho^\{K\}\)=O\(1/\\sqrt\{T\}\), givingT​ϵinner2=O​\(1\)T\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}=O\(1\), and the total regret isO​\(T​σmax\)O\(\\sqrt\{T\\sigma\_\{\\max\}\}\)\. ∎

###### Corollary 1\(Comparison with non\-delayed bilevel OMD\)\.

Settingσmax=0\\sigma\_\{\\max\}=0\(no delay\) andβ=0\\beta=0, the bound reduces toO​\(Dψ​T\+T​ϵinner2\)O\(D\_\{\\psi\}\\sqrt\{T\}\+T\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\), which is the standardO​\(T\)O\(\\sqrt\{T\}\)rate for convex OCO\.\[[23](https://arxiv.org/html/2605.12693#bib.bib12)\]achieve the fasterO​\(\(G2/μF\)​ln⁡T\)O\(\(G^\{2\}/\\mu\_\{F\}\)\\ln T\)rate by exploitingμF\\mu\_\{F\}\-strong convexity ofFF; for the*delay/transport*terms \(Lemma[2](https://arxiv.org/html/2605.12693#Thmlemma2)\), our analysis uses convexity\-only arguments and does not exploit strong convexity\. \(Strong convexity*is*used to absorb the inner\-bias cross\-term in Lemma[3](https://arxiv.org/html/2605.12693#Thmlemma3)and, more prominently, in the pure\-bias and interaction bounds of Theorem[3](https://arxiv.org/html/2605.12693#Thmtheorem3)\.\) With exact inner solver \(ϵinner=0\\epsilon\_\{\\mathrm\{inner\}\}=0\) and adequateK=O​\(ln⁡T\)K=O\(\\ln T\)inner steps, the bound reduces toO​\(T​σmax\)O\(\\sqrt\{T\\sigma\_\{\\max\}\}\), which is comparable to theO​\(T​dtot\)O\(\\sqrt\{Td\_\{\\mathrm\{tot\}\}\}\)rate of single\-level D\-FTRL\[[15](https://arxiv.org/html/2605.12693#bib.bib29)\]with the queue\-lengthσmax\\sigma\_\{\\max\}replacing the total delaydtot=T​σmaxd\_\{\\mathrm\{tot\}\}=T\\sigma\_\{\\max\}\.

### E\.2Proof of Theorem[2](https://arxiv.org/html/2605.12693#Thmtheorem2)\(Staleness amplification\)

The proof has three parts: Part \(a\) derives the gradient error structure, Part \(b\) establishes theΩ​\(T​ϵinner2\)\\Omega\(T\\,\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\)lower bound via a hard instance, and Part \(c\) compares the transport error with and without IGT\. The key insight is that the bilevel gradient error contains a*cross\-term*coupling outer\-parameter drift with inner\-solver sensitivity—a structural feature absent in single\-level OCO—and that IGT eliminates the quadratic growth of this cross\-term by decomposing the total drift into per\-step contributions\.

#### Part \(a\): Gradient Error Structure

We bound‖g^​\(θt−dt,wt−dt\)−∇F​\(θt\)‖\\left\\\|\\hat\{g\}\(\\theta\_\{t\-d\_\{t\}\},w\_\{t\-d\_\{t\}\}\)\-\\nabla F\(\\theta\_\{t\}\)\\right\\\|by decomposing it into outer staleness and inner\-solver error:

‖g^​\(θt−dt,wt−dt\)−∇F​\(θt\)‖\\displaystyle\\left\\\|\\hat\{g\}\(\\theta\_\{t\-d\_\{t\}\},w\_\{t\-d\_\{t\}\}\)\-\\nabla F\(\\theta\_\{t\}\)\\right\\\|≤‖∇F​\(θt−dt\)−∇F​\(θt\)‖\+‖g^​\(θt−dt,wt−dt\)−∇F​\(θt−dt\)‖\\displaystyle\\leq\\left\\\|\\nabla F\(\\theta\_\{t\-d\_\{t\}\}\)\-\\nabla F\(\\theta\_\{t\}\)\\right\\\|\+\\left\\\|\\hat\{g\}\(\\theta\_\{t\-d\_\{t\}\},w\_\{t\-d\_\{t\}\}\)\-\\nabla F\(\\theta\_\{t\-d\_\{t\}\}\)\\right\\\|≤LF​‖θt−θt−dt‖\+C0​ϵinner,\\displaystyle\\leq L\_\{F\}\\,\\left\\\|\\theta\_\{t\}\-\\theta\_\{t\-d\_\{t\}\}\\right\\\|\+C\_\{0\}\\,\\epsilon\_\{\\mathrm\{inner\}\},\(20\)where the first inequality is the triangle inequality, and the second usesLFL\_\{F\}\-smoothness ofFF\(withLF=Lθ\+Lw​θ2/μwL\_\{F\}=L\_\{\\theta\}\+L\_\{w\\theta\}^\{2\}/\\mu\_\{w\}; see the bilevel Lipschitz boundLF=Lθ\+Lw​θ2/μwL\_\{F\}=L\_\{\\theta\}\+L\_\{w\\theta\}^\{2\}/\\mu\_\{w\}derived from Assumptions[1](https://arxiv.org/html/2605.12693#Thmassumption1)–[2](https://arxiv.org/html/2605.12693#Thmassumption2)\) for the outer staleness term, and the inner\-solver bound of Lemma[3](https://arxiv.org/html/2605.12693#Thmlemma3)for the noise term\. The bilevel Lipschitz constantLFL\_\{F\}absorbs the implicit sensitivityLw​θ2/μwL\_\{w\\theta\}^\{2\}/\\mu\_\{w\}of the inner solution to the outer parameters via the Implicit Function Theorem\.

Contrast with single\-level OCO\.In single\-level delayed optimization, the gradient oracle returns∇f​\(θt−dt\)\+ξt\\nabla f\(\\theta\_\{t\-d\_\{t\}\}\)\+\\xi\_\{t\}where‖ξt‖≤ϵ\\left\\\|\\xi\_\{t\}\\right\\\|\\leq\\epsilon\. The noiseξt\\xi\_\{t\}is bounded by a*fixed constant*ϵ\\epsilonindependent of the staleness‖θt−θt−dt‖\\left\\\|\\theta\_\{t\}\-\\theta\_\{t\-d\_\{t\}\}\\right\\\|\. In the bilevel case, the Lipschitz constantLFL\_\{F\}itself depends on the inner\-solution sensitivityLw​θ/μwL\_\{w\\theta\}/\\mu\_\{w\}, so greater sensitivity*amplifies*the staleness component of the gradient error\. This structural difference—the staleness amplification mechanism—is absent in single\-level delayed optimization\.□\\square

#### Part \(b\): Hard Instance and Lower Bound

The lower bound is established via an explicit hard instance—a quadratic bilevel problem—on which no black\-box delayed optimizer can avoidΩ​\(ϵinner2\)\\Omega\(\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\)per\-round loss\.

#### Step 1: Hard Instance Construction

Fixp=q=1p=q=1\. Define the bilevel instance onΘ=\[−B0,B0\]\\Theta=\[\-B\_\{0\},B\_\{0\}\]by

ℒmodel​\(w;θ\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{model\}\}\(w;\\theta\)=μw2​\(w−b​θ\)2,\\displaystyle=\\tfrac\{\\mu\_\{w\}\}\{2\}\(w\-b\\theta\)^\{2\},\(21\)ℒtrue​\(w;θ\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{true\}\}\(w;\\theta\)=12​\(w−a​θ\)2,\\displaystyle=\\tfrac\{1\}\{2\}\(w\-a\\theta\)^\{2\},\(22\)wherea,b∈ℝa,b\\in\\mathbb\{R\}witha≠ba\\neq band\|a−b\|=C0\>0\|a\-b\|=C\_\{0\}\>0\(chosen by the adversary; the algorithm is blind toa,ba,b\)\. Whena=ba=b, the model perfectly matches the true objective \(C0=0C\_\{0\}=0\), the bilevel problem reduces to single\-level optimization, and the staleness amplification mechanism disappears — this is precisely the regime where bilevel structure imposes no additional cost\. Assumption[1](https://arxiv.org/html/2605.12693#Thmassumption1)holds with strong convexity constantμw\\mu\_\{w\}and smoothnessLw=μwL\_\{w\}=\\mu\_\{w\}\. Assumption[2](https://arxiv.org/html/2605.12693#Thmassumption2)holds withLθ=a2L\_\{\\theta\}=a^\{2\}and cross\-partial Lipschitz constantLw​θ=1L\_\{w\\theta\}=1\.

The exact inner minimizer isw∗​\(θ\)=b​θw^\{\*\}\(\\theta\)=b\\theta, giving the bilevel objective

F​\(θ\)=12​\(b−a\)2​θ2=C022​θ2,F\(\\theta\)=\\tfrac\{1\}\{2\}\(b\-a\)^\{2\}\\theta^\{2\}=\\tfrac\{C\_\{0\}^\{2\}\}\{2\}\\theta^\{2\},\(23\)with unique optimumθ∗=0\\theta^\{\*\}=0and gradient∇F​\(θ\)=C02​θ\\nabla F\(\\theta\)=C\_\{0\}^\{2\}\\theta\. By the implicit differentiation formula \([1](https://arxiv.org/html/2605.12693#S2.E1)\), the adjoint satisfiesv∗​\(θ\)=\(b−a\)​θ/μwv^\{\*\}\(\\theta\)=\(b\-a\)\\theta/\\mu\_\{w\}, and the hypergradient isg​\(θ\)=∇θℒtrue\|w−\(∇θ∇w⁡ℒmodel\)⊤​v∗​\(θ\)=−a​\(w∗−a​θ\)\+\(b​μw\)​v∗​\(θ\)=C02​θ,g\(\\theta\)=\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{true\}\}\|\_\{w\}\-\(\\nabla\_\{\\theta\}\\nabla\_\{w\}\\mathcal\{L\}\_\{\\mathrm\{model\}\}\)^\{\\top\}v^\{\*\}\(\\theta\)=\-a\(w^\{\*\}\-a\\theta\)\+\(b\\mu\_\{w\}\)v^\{\*\}\(\\theta\)=C\_\{0\}^\{2\}\\theta,consistent with \([23](https://arxiv.org/html/2605.12693#A5.E23)\)\.

#### Step 2: Inner\-Solver Error Introduces a Persistent Bias

Suppose the approximate inner solver produceswt=b​θt\+ϵtw\_\{t\}=b\\theta\_\{t\}\+\\epsilon\_\{t\}, with\|ϵt\|≤ϵinner\|\\epsilon\_\{t\}\|\\leq\\epsilon\_\{\\mathrm\{inner\}\}\. The algorithm evaluates the outer gradient usingwtw\_\{t\}in place ofw∗​\(θt\)w^\{\*\}\(\\theta\_\{t\}\)\. Via formula \([1](https://arxiv.org/html/2605.12693#S2.E1)\) with the approximate adjointv~t=\(wt−a​θt\)/μw\\tilde\{v\}\_\{t\}=\(w\_\{t\}\-a\\theta\_\{t\}\)/\\mu\_\{w\}:

g^​\(θt,wt\)\\displaystyle\\hat\{g\}\(\\theta\_\{t\},w\_\{t\}\)=−a​\(wt−a​θt\)\+b​μw⋅v~t\\displaystyle=\-a\(w\_\{t\}\-a\\theta\_\{t\}\)\+b\\mu\_\{w\}\\cdot\\tilde\{v\}\_\{t\}=−a​\(b​θt\+ϵt−a​θt\)\+b​\(b​θt\+ϵt−a​θt\)\\displaystyle=\-a\(b\\theta\_\{t\}\+\\epsilon\_\{t\}\-a\\theta\_\{t\}\)\+b\(b\\theta\_\{t\}\+\\epsilon\_\{t\}\-a\\theta\_\{t\}\)=C02​θt\+C0​ϵt\.\\displaystyle=C\_\{0\}^\{2\}\\theta\_\{t\}\+C\_\{0\}\\epsilon\_\{t\}\.\(24\)The bias termC0​ϵtC\_\{0\}\\epsilon\_\{t\}is independent ofθt\\theta\_\{t\}\. The adversary sets the inner\-solver error to the*constant*valueϵt=\+ϵinner\\epsilon\_\{t\}=\+\\epsilon\_\{\\mathrm\{inner\}\}for alltt\(no knowledge of the algorithm’s iterate is required\)\. This produces a hypergradient with a*constant additive bias*:

g^​\(θt,wt\)=C02​θt\+C0​ϵinner\.\\hat\{g\}\(\\theta\_\{t\},w\_\{t\}\)=C\_\{0\}^\{2\}\\theta\_\{t\}\+C\_\{0\}\\,\\epsilon\_\{\\mathrm\{inner\}\}\.\(25\)
###### Lemma 4\(Steady\-state displacement\)\.

For the hard instance \([21](https://arxiv.org/html/2605.12693#A5.E21)\)–\([22](https://arxiv.org/html/2605.12693#A5.E22)\) with constant inner\-solver errorϵt=\+ϵinner\\epsilon\_\{t\}=\+\\epsilon\_\{\\mathrm\{inner\}\}, any algorithm using the biased gradient \([25](https://arxiv.org/html/2605.12693#A5.E25)\) converges \(when stable\) to a steady stateθss\\theta\_\{\\mathrm\{ss\}\}satisfying

θss=−ϵinnerC0,F​\(θss\)=ϵinner22\.\\theta\_\{\\mathrm\{ss\}\}=\-\\frac\{\\epsilon\_\{\\mathrm\{inner\}\}\}\{C\_\{0\}\},\\qquad F\(\\theta\_\{\\mathrm\{ss\}\}\)=\\frac\{\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\}\{2\}\.\(26\)

###### Proof\.

At steady state,θt\+1=θt=θss\\theta\_\{t\+1\}=\\theta\_\{t\}=\\theta\_\{\\mathrm\{ss\}\}andθt−σmax=θss\\theta\_\{t\-\\sigma\_\{\\max\}\}=\\theta\_\{\\mathrm\{ss\}\}, so0=−η​\(C02​θss\+C0​ϵinner\)0=\-\\eta\(C\_\{0\}^\{2\}\\theta\_\{\\mathrm\{ss\}\}\+C\_\{0\}\\epsilon\_\{\\mathrm\{inner\}\}\), givingθss=−ϵinner/C0\\theta\_\{\\mathrm\{ss\}\}=\-\\epsilon\_\{\\mathrm\{inner\}\}/C\_\{0\}\. ThenF​\(θss\)=C02​θss2/2=ϵinner2/2F\(\\theta\_\{\\mathrm\{ss\}\}\)=C\_\{0\}^\{2\}\\theta\_\{\\mathrm\{ss\}\}^\{2\}/2=\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}/2\. ∎

#### Step 3: Convergence to the Biased Steady State

For any algorithm𝒜\\mathcal\{A\}with constant delaydt=σmaxd\_\{t\}=\\sigma\_\{\\max\}and step sizeη\\eta, the iterates satisfy

θt\+1=θt−η​C02​θt−σmax−η​C0​ϵinner\.\\theta\_\{t\+1\}=\\theta\_\{t\}\-\\eta C\_\{0\}^\{2\}\\theta\_\{t\-\\sigma\_\{\\max\}\}\-\\eta C\_\{0\}\\epsilon\_\{\\mathrm\{inner\}\}\.\(27\)This is a linear recurrence with delayσmax\\sigma\_\{\\max\}and a constant additive drive−η​C0​ϵinner\-\\eta C\_\{0\}\\epsilon\_\{\\mathrm\{inner\}\}\. For step sizes satisfyingη≤1/\(2​C02​σmax\)\\eta\\leq 1/\(2C\_\{0\}^\{2\}\\sigma\_\{\\max\}\)— a*sufficient*condition for all roots of the discrete characteristic polynomialzσmax\+1−zσmax\+η​C02=0z^\{\\sigma\_\{\\max\}\+1\}\-z^\{\\sigma\_\{\\max\}\}\+\\eta C\_\{0\}^\{2\}=0to lie inside the unit disk \(cf\. Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1)\) — the homogeneous partθt\+1=θt−η​C02​θt−σmax\\theta\_\{t\+1\}=\\theta\_\{t\}\-\\eta C\_\{0\}^\{2\}\\theta\_\{t\-\\sigma\_\{\\max\}\}is stable, and the iterates converge toθss=−ϵinner/C0\\theta\_\{\\mathrm\{ss\}\}=\-\\epsilon\_\{\\mathrm\{inner\}\}/C\_\{0\}withinO​\(σmax\)O\(\\sigma\_\{\\max\}\)rounds\.

###### Lemma 5\(Iterate lower bound\)\.

For any stable algorithm𝒜\\mathcal\{A\}with step sizeη≤1/\(2​C02​σmax\)\\eta\\leq 1/\(2C\_\{0\}^\{2\}\\sigma\_\{\\max\}\), after an initial transient ofO​\(σmax\)O\(\\sigma\_\{\\max\}\)rounds, the iterates satisfy

\|θt\|≥12⋅ϵinnerC0\.\|\\theta\_\{t\}\|\\;\\geq\\;\\frac\{1\}\{2\}\\cdot\\frac\{\\epsilon\_\{\\mathrm\{inner\}\}\}\{C\_\{0\}\}\.\(28\)

###### Proof\.

By Lemma[4](https://arxiv.org/html/2605.12693#Thmlemma4),θss=−ϵinner/C0\\theta\_\{\\mathrm\{ss\}\}=\-\\epsilon\_\{\\mathrm\{inner\}\}/C\_\{0\}\. The deviationξt=θt−θss\\xi\_\{t\}=\\theta\_\{t\}\-\\theta\_\{\\mathrm\{ss\}\}satisfies the homogeneous recurrenceξt\+1=ξt−η​C02​ξt−σmax\\xi\_\{t\+1\}=\\xi\_\{t\}\-\\eta C\_\{0\}^\{2\}\\xi\_\{t\-\\sigma\_\{\\max\}\}, which is stable forη≤1/\(2​C02​σmax\)\\eta\\leq 1/\(2C\_\{0\}^\{2\}\\sigma\_\{\\max\}\)\. Hence\|ξt\|→0\|\\xi\_\{t\}\|\\to 0, so\|θt\|→\|θss\|=ϵinner/C0\|\\theta\_\{t\}\|\\to\|\\theta\_\{\\mathrm\{ss\}\}\|=\\epsilon\_\{\\mathrm\{inner\}\}/C\_\{0\}\. After the transient,\|θt\|≥12​ϵinner/C0\|\\theta\_\{t\}\|\\geq\\tfrac\{1\}\{2\}\\epsilon\_\{\\mathrm\{inner\}\}/C\_\{0\}\. ∎

#### Step 4: Accumulating Regret

SinceF​\(θ\)=C02​θ2/2F\(\\theta\)=C\_\{0\}^\{2\}\\theta^\{2\}/2andθ∗=0\\theta^\{\*\}=0, the per\-round regret isF​\(θt\)−F​\(θ∗\)=C02​θt2/2F\(\\theta\_\{t\}\)\-F\(\\theta^\{\*\}\)=C\_\{0\}^\{2\}\\theta\_\{t\}^\{2\}/2\. Using Lemma[5](https://arxiv.org/html/2605.12693#Thmlemma5), after anO​\(σmax\)O\(\\sigma\_\{\\max\}\)\-round transient we haveθt2≥ϵinner2/\(4​C02\)\\theta\_\{t\}^\{2\}\\geq\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}/\(4C\_\{0\}^\{2\}\)fort≥t0=O​\(σmax\)t\\geq t\_\{0\}=O\(\\sigma\_\{\\max\}\)\. Thus:

∑t=1T\[F​\(θt\)−F​\(θ∗\)\]\\displaystyle\\sum\_\{t=1\}^\{T\}\\bigl\[F\(\\theta\_\{t\}\)\-F\(\\theta^\{\*\}\)\\bigr\]≥∑t=t0TC02​θt22≥\(T−t0\)⋅C022⋅ϵinner24​C02=\(T−t0\)​ϵinner28\.\\displaystyle\\geq\\sum\_\{t=t\_\{0\}\}^\{T\}\\frac\{C\_\{0\}^\{2\}\\theta\_\{t\}^\{2\}\}\{2\}\\;\\geq\\;\(T\-t\_\{0\}\)\\cdot\\frac\{C\_\{0\}^\{2\}\}\{2\}\\cdot\\frac\{\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\}\{4C\_\{0\}^\{2\}\}\\;=\\;\\frac\{\(T\-t\_\{0\}\)\\,\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\}\{8\}\.\(29\)ForT≫σmaxT\\gg\\sigma\_\{\\max\}\(i\.e\.T−t0≥T/2T\-t\_\{0\}\\geq T/2\), this givesRegretT​\(𝒜\)≥Ω​\(T​ϵinner2\)\\mathrm\{Regret\}\_\{T\}\(\\mathcal\{A\}\)\\geq\\Omega\(T\\,\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\)\.

#### Step 5: Independence from Algorithm Design

The lower bound holds for*any*black\-box delayed optimizer — that is, any algorithm whose only access toFFis through the biased gradient oracle \([25](https://arxiv.org/html/2605.12693#A5.E25)\), without knowledge of the inner solver’s error distribution, sign, or magnitude beyond the bound\|ϵt\|≤ϵinner\|\\epsilon\_\{t\}\|\\leq\\epsilon\_\{\\mathrm\{inner\}\}\. In particular, a bias\-correction scheme that estimates and subtracts the mean error would require access to the*unbiased*gradient∇F​\(θt\)\\nabla F\(\\theta\_\{t\}\), which is precisely what the black\-box model excludes\.

1. 1\.The constant biasC0​ϵinnerC\_\{0\}\\epsilon\_\{\\mathrm\{inner\}\}in \([25](https://arxiv.org/html/2605.12693#A5.E25)\) is independent ofη\\eta,σmax\\sigma\_\{\\max\}, and the algorithm’s iterate sequence\. No step\-size choice eliminates this bias\.
2. 2\.Any stable algorithm converges toθss=−ϵinner/C0\\theta\_\{\\mathrm\{ss\}\}=\-\\epsilon\_\{\\mathrm\{inner\}\}/C\_\{0\}, incurringF​\(θss\)=ϵinner2/2F\(\\theta\_\{\\mathrm\{ss\}\}\)=\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}/2per round\. The bound requires a*fixed precision floor*:ϵinner\\epsilon\_\{\\mathrm\{inner\}\}must remain bounded away from zero uniformly over all roundstt\(Assumption[3](https://arxiv.org/html/2605.12693#Thmassumption3)\)\. If the inner solver were warm\-started such thatϵt→0\\epsilon\_\{t\}\\to 0, the steady\-state offset would vanish and theΩ​\(T\)\\Omega\(T\)lower bound would collapse\. Our result characterizes the cost of a*fixed approximation budget*, which is the standard setting in bilevel optimization\[[10](https://arxiv.org/html/2605.12693#bib.bib13)\]\.
3. 3\.An unstable algorithm \(η\>1/\(2​C02​σmax\)\\eta\>1/\(2C\_\{0\}^\{2\}\\sigma\_\{\\max\}\)\) has\|θt\|→∞\|\\theta\_\{t\}\|\\to\\infty\(clipped atB0B\_\{0\}, sinceΘ=\[−B0,B0\]\\Theta=\[\-B\_\{0\},B\_\{0\}\]is compact\), incurringF​\(θt\)≥C02​B02/2F\(\\theta\_\{t\}\)\\geq C\_\{0\}^\{2\}B\_\{0\}^\{2\}/2per round, which is strictly worse\.
4. 4\.The adversary choosesθ1=B0≠0\\theta\_\{1\}=B\_\{0\}\\neq 0, guaranteeing a non\-trivial transient; theO​\(σmax\)O\(\\sigma\_\{\\max\}\)burn\-in cost is absorbed into theΩ​\(T\)\\Omega\(T\)rate forT≫σmaxT\\gg\\sigma\_\{\\max\}\.

The hard instance satisfies Assumptions[1](https://arxiv.org/html/2605.12693#Thmassumption1)and[2](https://arxiv.org/html/2605.12693#Thmassumption2)withμw=Lw\\mu\_\{w\}=L\_\{w\}andLθ=a2<∞L\_\{\\theta\}=a^\{2\}<\\infty, completing Part \(b\)\.

#### Part \(c\): Transport Error Separation

Without bilevel\-aware correction, the per\-round staleness mismatch contributes

\|⟨∇F​\(θt\)−∇F​\(θt−σt\),θt−θ∗⟩\|≤LF​‖θt−θt−σt‖⋅diam​\(Θ\)\.\\bigl\|\\bigl\\langle\\nabla F\(\\theta\_\{t\}\)\-\\nabla F\(\\theta\_\{t\-\\sigma\_\{t\}\}\),\\,\\theta\_\{t\}\-\\theta^\{\*\}\\bigr\\rangle\\bigr\|\\;\\leq\\;L\_\{F\}\\,\\left\\\|\\theta\_\{t\}\-\\theta\_\{t\-\\sigma\_\{t\}\}\\right\\\|\\cdot\\mathrm\{diam\}\(\\Theta\)\.\(30\)Since‖θt−θt−σt‖≤σt​η​G\\left\\\|\\theta\_\{t\}\-\\theta\_\{t\-\\sigma\_\{t\}\}\\right\\\|\\leq\\sigma\_\{t\}\\,\\eta\\,G, summing overTTrounds gives

∑t=1TLF​‖θt−θt−σt‖⋅diam​\(Θ\)≤LF​diam​\(Θ\)​σmax​G​η​T=O​\(LF​diam​σmax​G​T\)\\sum\_\{t=1\}^\{T\}L\_\{F\}\\,\\left\\\|\\theta\_\{t\}\-\\theta\_\{t\-\\sigma\_\{t\}\}\\right\\\|\\cdot\\mathrm\{diam\}\(\\Theta\)\\;\\leq\\;L\_\{F\}\\,\\mathrm\{diam\}\(\\Theta\)\\,\\sigma\_\{\\max\}\\,G\\,\\eta\\,T\\;=\\;O\\\!\\bigl\(L\_\{F\}\\,\\mathrm\{diam\}\\,\\sigma\_\{\\max\}\\,G\\,\\sqrt\{T\}\\bigr\)withη=c/T\\eta=c/\\sqrt\{T\}\. This transport cost grows asT\\sqrt\{T\}\.

Under IGT, the corresponding term is the sum of squared per\-step changes \(Theorem[3](https://arxiv.org/html/2605.12693#Thmtheorem3)\):

∑t=1T∑s∈QtLF​‖θs\+1−θs‖2≤LF​T​σmax​η2​G2=O​\(LF​σmax​G2\),\\sum\_\{t=1\}^\{T\}\\sum\_\{s\\in Q\_\{t\}\}L\_\{F\}\\,\\left\\\|\\theta\_\{s\+1\}\-\\theta\_\{s\}\\right\\\|^\{2\}\\;\\leq\\;L\_\{F\}\\,T\\,\\sigma\_\{\\max\}\\,\\eta^\{2\}\\,G^\{2\}\\;=\\;O\\\!\\bigl\(L\_\{F\}\\,\\sigma\_\{\\max\}\\,G^\{2\}\\bigr\),which is*constant inTT*\. The ratio of the non\-IGT transport cost to the IGT transport cost isdiam​\(Θ\)​T/G\\mathrm\{diam\}\(\\Theta\)\\,\\sqrt\{T\}/G, demonstrating aT\\sqrt\{T\}factor improvement\. In terms of the per\-round transport error, the non\-IGT contribution isO​\(σmax2​η2​G2\)O\(\\sigma\_\{\\max\}^\{2\}\\eta^\{2\}G^\{2\}\)\(from the squared total drift‖θt−θt−σt‖2\\left\\\|\\theta\_\{t\}\-\\theta\_\{t\-\\sigma\_\{t\}\}\\right\\\|^\{2\}\), while IGT yieldsO​\(σmax​η2​G2\)O\(\\sigma\_\{\\max\}\\eta^\{2\}G^\{2\}\)\(sum of per\-step squares\)—a factorσmax\\sigma\_\{\\max\}improvement, as claimed\.□\\square

###### Corollary 2\(Vector\-valued extension\)\.

The hard instance of Theorem[2](https://arxiv.org/html/2605.12693#Thmtheorem2)\(b\) extends top,q\>1p,q\>1: replaceθ\\thetawith a vector and constructA,B∈ℝq×pA,B\\in\\mathbb\{R\}^\{q\\times p\}such that\(A−B\)⊤​\(A−B\)⪰C02​I\(A\-B\)^\{\\top\}\(A\-B\)\\succeq C\_\{0\}^\{2\}I\. All scalar norms become Euclidean norms, and the lower\-bound calculation carries through unchanged\. HenceRegretT​\(𝒜\)≥Ω​\(T​ϵinner2\)\\mathrm\{Regret\}\_\{T\}\(\\mathcal\{A\}\)\\geq\\Omega\(T\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\)holds in full generality\.

### E\.3Proof of Theorem[3](https://arxiv.org/html/2605.12693#Thmtheorem3)\(Inner\-Loop Apathy\)

The key insight is that the regret decomposes into three independent terms—delay error, inner bias, and interaction—because IGT re\-evaluation uses stored\(ws,vs∗\)\(w\_\{s\},v\_\{s\}^\{\*\}\)from past rounds, making the transport correction*independent*of the current inner\-solver quality\. This decoupling is what we term “inner\-loop apathy”: the transport mechanism is agnostic to how well the inner problem is solved\.

We retain the notation from Appendix[E\.1](https://arxiv.org/html/2605.12693#A5.SS1)\. Define the set of outstanding rounds at timettasQt=\{s≤t:s\+ds\>t\}Q\_\{t\}=\\\{s\\leq t:s\+d\_\{s\}\>t\\\}with\|Qt\|≤σmax\|Q\_\{t\}\|\\leq\\sigma\_\{\\max\}\. Let

C1\\displaystyle C\_\{1\}:=Lθ​wμw,Lg:=Lw​θ⏟direct\+Lw​θ​Gwμw⏟cross\-partial\+Lw​θ​Lw​Gwμw2⏟Hessian var\.\+Lθ​w​Lw​θμw⏟Lip\. drift,κ:=Lg​C1,\\displaystyle:=\\frac\{L\_\{\\theta w\}\}\{\\mu\_\{w\}\},\\qquad L\_\{g\}:=\\underbrace\{L\_\{w\\theta\}\}\_\{\\text\{direct\}\}\+\\underbrace\{\\frac\{L\_\{w\\theta\}\\,G\_\{w\}\}\{\\mu\_\{w\}\}\}\_\{\\text\{cross\-partial\}\}\+\\underbrace\{\\frac\{L\_\{w\\theta\}\\,L\_\{w\}\\,G\_\{w\}\}\{\\mu\_\{w\}^\{2\}\}\}\_\{\\text\{Hessian var\.\}\}\+\\underbrace\{\\frac\{L\_\{\\theta w\}\\,L\_\{w\\theta\}\}\{\\mu\_\{w\}\}\}\_\{\\text\{Lip\.\\ drift\}\},\\qquad\\kappa:=L\_\{g\}\\,C\_\{1\},whereGw:=supw,θ‖∇wℒtrue​\(w;θ\)‖G\_\{w\}:=\\sup\_\{w,\\theta\}\\\|\\nabla\_\{w\}\\mathcal\{L\}\_\{\\mathrm\{true\}\}\(w;\\theta\)\\\|\(bounded on𝒲×Θ\\mathcal\{W\}\\times\\Thetaby Assumption[6](https://arxiv.org/html/2605.12693#Thmassumption6)and compactness\)\. The constantLgL\_\{g\}accounts for four contributions to the Lipschitz constant of the hypergradient mapw↦gs​\(θ,w\)w\\mapsto g\_\{s\}\(\\theta,w\): \(0\) the explicit gradient∇θℒtrue\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{true\}\}, which isLw​θL\_\{w\\theta\}\-Lipschitz inww\(Assumption[2](https://arxiv.org/html/2605.12693#Thmassumption2)\), \(i\) the cross\-partial∇θ​w2ℒmodel\\nabla^\{2\}\_\{\\theta w\}\\mathcal\{L\}\_\{\\mathrm\{model\}\}acting on the inverse Hessian, contributingLw​θ​Gw/μwL\_\{w\\theta\}G\_\{w\}/\\mu\_\{w\}\(Assumption[5](https://arxiv.org/html/2605.12693#Thmassumption5)\(ii\),‖H−1‖≤1/μw\\\|H^\{\-1\}\\\|\\leq 1/\\mu\_\{w\},‖b‖≤Gw\\\|b\\\|\\leq G\_\{w\}\), \(ii\) the variation of\[∇w​w2\]−1\[\\nabla^\{2\}\_\{ww\}\]^\{\-1\}inww, contributingLw​θ​Lw​Gw/μw2L\_\{w\\theta\}L\_\{w\}G\_\{w\}/\\mu\_\{w\}^\{2\}\(Assumption[5](https://arxiv.org/html/2605.12693#Thmassumption5)\(iii\)\), and \(iii\) the Lipschitz dependence of∇wℒtrue\\nabla\_\{w\}\\mathcal\{L\}\_\{\\mathrm\{true\}\}onwwvia Assumption[5](https://arxiv.org/html/2605.12693#Thmassumption5)\(ii\), contributingLθ​w​Lw​θ/μwL\_\{\\theta w\}L\_\{w\\theta\}/\\mu\_\{w\}\. The productκ=Lg​C1\\kappa=L\_\{g\}C\_\{1\}gives the composite sensitivity of the hypergradient to parameter drift via the inner solution\. We shall writewt∗:=w∗​\(θt\)w^\{\*\}\_\{t\}:=w^\{\*\}\(\\theta\_\{t\}\)andg~t\\tilde\{g\}\_\{t\}for the IGT\-corrected hypergradient actually computed by Algorithm[1](https://arxiv.org/html/2605.12693#alg1)\(which uses the stored inner solutionwsw\_\{s\}for each outstanding gradient\)\. Writegtidealg\_\{t\}^\{\\mathrm\{ideal\}\}for the same IGT estimator but evaluated with the ideal inner solutionwt∗w^\{\*\}\_\{t\}at every transported gradient\.

###### Lemma 6\(Inner\-solution sensitivity\)\.

Under Assumptions[1](https://arxiv.org/html/2605.12693#Thmassumption1)and[5](https://arxiv.org/html/2605.12693#Thmassumption5), for any roundss,ts,t:

‖ws−wt∗‖≤ϵinner\+C1​‖θs−θt‖\.\\left\\\|w\_\{s\}\-w^\{\*\}\_\{t\}\\right\\\|\\;\\leq\\;\\epsilon\_\{\\mathrm\{inner\}\}\+C\_\{1\}\\left\\\|\\theta\_\{s\}\-\\theta\_\{t\}\\right\\\|\.\(31\)

###### Proof\.

By the triangle inequality,‖ws−wt∗‖≤‖ws−ws∗‖\+‖ws∗−wt∗‖\\left\\\|w\_\{s\}\-w^\{\*\}\_\{t\}\\right\\\|\\leq\\left\\\|w\_\{s\}\-w^\{\*\}\_\{s\}\\right\\\|\+\\left\\\|w^\{\*\}\_\{s\}\-w^\{\*\}\_\{t\}\\right\\\|\. The first term is at mostϵinner\\epsilon\_\{\\mathrm\{inner\}\}by Assumption[1](https://arxiv.org/html/2605.12693#Thmassumption1)\(approximate inner solver\)\. For the second, apply the implicit function theorem to the inner optimality condition∇wℒmodel​\(w∗​\(θ\);θ\)=0\\nabla\_\{w\}\\mathcal\{L\}\_\{\\mathrm\{model\}\}\(w^\{\*\}\(\\theta\);\\theta\)=0:

∇ww∗​\(θ\)=−\[∇w​w2ℒmodel\]−1​∇w​θ2ℒmodel,\\nabla\_\{w\}w^\{\*\}\(\\theta\)=\-\\bigl\[\\nabla^\{2\}\_\{ww\}\\mathcal\{L\}\_\{\\mathrm\{model\}\}\\bigr\]^\{\-1\}\\nabla^\{2\}\_\{w\\theta\}\\mathcal\{L\}\_\{\\mathrm\{model\}\},whose spectral norm is bounded byLθ​w/μw=C1L\_\{\\theta w\}/\\mu\_\{w\}=C\_\{1\}under Assumptions[1](https://arxiv.org/html/2605.12693#Thmassumption1)–[5](https://arxiv.org/html/2605.12693#Thmassumption5)\. Hence‖ws∗−wt∗‖≤C1​‖θs−θt‖\\left\\\|w^\{\*\}\_\{s\}\-w^\{\*\}\_\{t\}\\right\\\|\\leq C\_\{1\}\\left\\\|\\theta\_\{s\}\-\\theta\_\{t\}\\right\\\|, giving \([31](https://arxiv.org/html/2605.12693#A5.E31)\)\. ∎

###### Lemma 7\(Hypergradient sensitivity to the inner solution\)\.

Under Assumptions[2](https://arxiv.org/html/2605.12693#Thmassumption2)–[5](https://arxiv.org/html/2605.12693#Thmassumption5), for anyθt,w,w′\\theta\_\{t\},w,w^\{\\prime\}:

‖gs​\(θt,w′\)−gs​\(θt,w\)‖≤Lg​‖w′−w‖\.\\left\\\|g\_\{s\}\(\\theta\_\{t\},w^\{\\prime\}\)\-g\_\{s\}\(\\theta\_\{t\},w\)\\right\\\|\\;\\leq\\;L\_\{g\}\\left\\\|w^\{\\prime\}\-w\\right\\\|\.\(32\)Combining Lemma[6](https://arxiv.org/html/2605.12693#Thmlemma6)with \([32](https://arxiv.org/html/2605.12693#A5.E32)\), the per\-step inner\-staleness error satisfies, for eachs∈Qts\\in Q\_\{t\}:

‖gs​\(θt,ws\)−gs​\(θt,wt∗\)‖≤Lg​ϵinner\+κ​‖θs−θt‖\.\\left\\\|g\_\{s\}\(\\theta\_\{t\},w\_\{s\}\)\-g\_\{s\}\(\\theta\_\{t\},w^\{\*\}\_\{t\}\)\\right\\\|\\;\\leq\\;L\_\{g\}\\epsilon\_\{\\mathrm\{inner\}\}\+\\kappa\\left\\\|\\theta\_\{s\}\-\\theta\_\{t\}\\right\\\|\.\(33\)

###### Proof\.

The bilevel hypergradient isgs​\(θ,w\)=∇θℒtrue​\(w;θ\)−∇θ​w2ℒmodel​\(w;θ\)⏟=⁣:A​\(w\)​\[∇w​w2ℒmodel​\(w;θ\)\]−1⏟=⁣:H​\(w\)−1​∇wℒtrue​\(w;θ\)⏟=⁣:b​\(w\)g\_\{s\}\(\\theta,w\)=\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{true\}\}\(w;\\theta\)\-\\underbrace\{\\nabla^\{2\}\_\{\\theta w\}\\mathcal\{L\}\_\{\\mathrm\{model\}\}\(w;\\theta\)\}\_\{=:A\(w\)\}\\,\\underbrace\{\[\\nabla^\{2\}\_\{ww\}\\mathcal\{L\}\_\{\\mathrm\{model\}\}\(w;\\theta\)\]^\{\-1\}\}\_\{=:H\(w\)^\{\-1\}\}\\,\\underbrace\{\\nabla\_\{w\}\\mathcal\{L\}\_\{\\mathrm\{true\}\}\(w;\\theta\)\}\_\{=:b\(w\)\}\. To bound‖gs​\(θ,w′\)−gs​\(θ,w\)‖\\\|g\_\{s\}\(\\theta,w^\{\\prime\}\)\-g\_\{s\}\(\\theta,w\)\\\|, note thatA​\(w\)​H​\(w\)−1​b​\(w\)−A​\(w′\)​H​\(w′\)−1​b​\(w′\)=\[A​\(w\)−A​\(w′\)\]​H​\(w\)−1​b​\(w\)\+A​\(w′\)​\[H​\(w\)−1−H​\(w′\)−1\]​b​\(w\)\+A​\(w′\)​H​\(w′\)−1​\[b​\(w\)−b​\(w′\)\]A\(w\)H\(w\)^\{\-1\}b\(w\)\-A\(w^\{\\prime\}\)H\(w^\{\\prime\}\)^\{\-1\}b\(w^\{\\prime\}\)=\[A\(w\)\-A\(w^\{\\prime\}\)\]H\(w\)^\{\-1\}b\(w\)\+A\(w^\{\\prime\}\)\[H\(w\)^\{\-1\}\-H\(w^\{\\prime\}\)^\{\-1\}\]b\(w\)\+A\(w^\{\\prime\}\)H\(w^\{\\prime\}\)^\{\-1\}\[b\(w\)\-b\(w^\{\\prime\}\)\]\. By Assumption[5](https://arxiv.org/html/2605.12693#Thmassumption5): \(i\)‖A​\(w\)−A​\(w′\)‖≤Lw​θ​‖w−w′‖\\\|A\(w\)\-A\(w^\{\\prime\}\)\\\|\\leq L\_\{w\\theta\}\\\|w\-w^\{\\prime\}\\\|, with‖H​\(w\)−1‖≤1/μw\\\|H\(w\)^\{\-1\}\\\|\\leq 1/\\mu\_\{w\}and‖b​\(w\)‖≤Gw\\\|b\(w\)\\\|\\leq G\_\{w\}, giving a first contributionLw​θ​Gw/\(μw\)⋅‖w−w′‖L\_\{w\\theta\}G\_\{w\}/\(\\mu\_\{w\}\)\\cdot\\\|w\-w^\{\\prime\}\\\|; \(ii\) the matrix\-inversion Lipschitz bound gives‖H​\(w\)−1−H​\(w′\)−1‖≤\(Lw/μw2\)​‖w−w′‖\\\|H\(w\)^\{\-1\}\-H\(w^\{\\prime\}\)^\{\-1\}\\\|\\leq\(L\_\{w\}/\\mu\_\{w\}^\{2\}\)\\\|w\-w^\{\\prime\}\\\|, contributingLw​θ​Lw​Gw/μw2⋅‖w−w′‖L\_\{w\\theta\}L\_\{w\}G\_\{w\}/\\mu\_\{w\}^\{2\}\\cdot\\\|w\-w^\{\\prime\}\\\|; \(iii\)‖b​\(w\)−b​\(w′\)‖≤Lw​θ​‖w−w′‖\\\|b\(w\)\-b\(w^\{\\prime\}\)\\\|\\leq L\_\{w\\theta\}\\\|w\-w^\{\\prime\}\\\|\(Assumption[5](https://arxiv.org/html/2605.12693#Thmassumption5)\(ii\)\), contributingLθ​w​Lw​θ/μw⋅‖w−w′‖L\_\{\\theta w\}L\_\{w\\theta\}/\\mu\_\{w\}\\cdot\\\|w\-w^\{\\prime\}\\\|\. Including the explicit\-gradient term∇θℒtrue\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{true\}\}\(which isLw​θL\_\{w\\theta\}\-Lipschitz inwwby Assumption[2](https://arxiv.org/html/2605.12693#Thmassumption2)\), the total Lipschitz constant isLgL\_\{g\}as defined above, giving \([32](https://arxiv.org/html/2605.12693#A5.E32)\)\. Applying Lemma[6](https://arxiv.org/html/2605.12693#Thmlemma6)yields \([33](https://arxiv.org/html/2605.12693#A5.E33)\)\. ∎

##### Main proof of Theorem[3](https://arxiv.org/html/2605.12693#Thmtheorem3)\.

###### Proof of Theorem[3](https://arxiv.org/html/2605.12693#Thmtheorem3)\.

Step 1: Two\-part error decomposition\.Decompose the gradient error at roundttas

∇F​\(θt\)−g~t=∇F​\(θt\)−gtideal⏟=⁣:δt\(1\)\+gtideal−g~t⏟=⁣:δt\(2\)\.\\nabla F\(\\theta\_\{t\}\)\-\\tilde\{g\}\_\{t\}\\;=\\;\\underbrace\{\\nabla F\(\\theta\_\{t\}\)\-g\_\{t\}^\{\\mathrm\{ideal\}\}\}\_\{=:\\,\\delta\_\{t\}^\{\(1\)\}\}\\;\+\\;\\underbrace\{g\_\{t\}^\{\\mathrm\{ideal\}\}\-\\tilde\{g\}\_\{t\}\}\_\{=:\\,\\delta\_\{t\}^\{\(2\)\}\}\.The first termδt\(1\)\\delta\_\{t\}^\{\(1\)\}is the IGT outer\-transport error \(outer parameters are stale, but the inner solution is ideal\)\. The second termδt\(2\)\\delta\_\{t\}^\{\(2\)\}is the inner\-staleness error \(the ideal inner solutionwt∗w^\{\*\}\_\{t\}is replaced by the storedwsw\_\{s\}\)\.

By convexity ofFFand the OMD analysis from Lemmas[1](https://arxiv.org/html/2605.12693#Thmlemma1)–[3](https://arxiv.org/html/2605.12693#Thmlemma3):

RegretTdec≤∑t=1T⟨δt\(1\),θt−θ∗⟩⏟=⁣:R1\+∑t=1T⟨δt\(2\),θt−θ∗⟩⏟=⁣:R2\+R3\+\(OMD base terms bounded in Theorem[1](https://arxiv.org/html/2605.12693#Thmtheorem1)\)\.\\mathrm\{Regret\}\_\{T\}^\{\\mathrm\{dec\}\}\\;\\leq\\;\\underbrace\{\\sum\_\{t=1\}^\{T\}\\bigl\\langle\\delta\_\{t\}^\{\(1\)\},\\,\\theta\_\{t\}\-\\theta^\{\*\}\\bigr\\rangle\}\_\{=:\\,R\_\{1\}\}\\;\+\\;\\underbrace\{\\sum\_\{t=1\}^\{T\}\\bigl\\langle\\delta\_\{t\}^\{\(2\)\},\\,\\theta\_\{t\}\-\\theta^\{\*\}\\bigr\\rangle\}\_\{=:\\,R\_\{2\}\+R\_\{3\}\}\\;\+\\;\(\\text\{OMD base terms bounded in Theorem~\\ref\{thm:bilevel\_convergence\}\}\)\.\(34\)
Step 2: BoundingR1R\_\{1\}\(delay error\)\.The termR1R\_\{1\}is identical in structure to the IGT transport error analyzed in Lemma[2](https://arxiv.org/html/2605.12693#Thmlemma2):‖δt\(1\)‖≤LF​∑s∈Qt‖θs\+1−θs‖2≤LF​σt​η02​G2\\\|\\delta\_\{t\}^\{\(1\)\}\\\|\\leq L\_\{F\}\\sum\_\{s\\in Q\_\{t\}\}\\\|\\theta\_\{s\+1\}\-\\theta\_\{s\}\\\|^\{2\}\\leq L\_\{F\}\\,\\sigma\_\{t\}\\,\\eta\_\{0\}^\{2\}G^\{2\}\. By Cauchy–Schwarz\[[30](https://arxiv.org/html/2605.12693#bib.bib38)\], the inner product satisfies⟨δt\(1\),θt−θ∗⟩≤‖δt\(1\)‖⋅‖θt−θ∗‖≤LF​σt​η02​G2​diam​\(Θ\)\\langle\\delta\_\{t\}^\{\(1\)\},\\theta\_\{t\}\-\\theta^\{\*\}\\rangle\\leq\\\|\\delta\_\{t\}^\{\(1\)\}\\\|\\cdot\\\|\\theta\_\{t\}\-\\theta^\{\*\}\\\|\\leq L\_\{F\}\\,\\sigma\_\{t\}\\,\\eta\_\{0\}^\{2\}G^\{2\}\\,\\mathrm\{diam\}\(\\Theta\), wherediam​\(Θ\)=maxθ∈Θ⁡‖θ−θ∗‖\\mathrm\{diam\}\(\\Theta\)=\\max\_\{\\theta\\in\\Theta\}\\\|\\theta\-\\theta^\{\*\}\\\|\(bounded domain\)\. Summing overttand using∑t=1Tσt≤T​σmax\\sum\_\{t=1\}^\{T\}\\sigma\_\{t\}\\leq T\\,\\sigma\_\{\\max\}\(sinceσt≤σmax\\sigma\_\{t\}\\leq\\sigma\_\{\\max\}for everytt\):

R1≤LF​diam​\(Θ\)​η02​G2​∑t=1Tσt≤LF​diam​\(Θ\)​η02​G2​T​σmax=O​\(η02​G2​T​σmax\),R\_\{1\}\\;\\leq\\;L\_\{F\}\\,\\mathrm\{diam\}\(\\Theta\)\\,\\eta\_\{0\}^\{2\}G^\{2\}\\sum\_\{t=1\}^\{T\}\\sigma\_\{t\}\\;\\leq\\;L\_\{F\}\\,\\mathrm\{diam\}\(\\Theta\)\\,\\eta\_\{0\}^\{2\}G^\{2\}\\,T\\,\\sigma\_\{\\max\}\\;=\\;O\(\\eta\_\{0\}^\{2\}G^\{2\}T\\sigma\_\{\\max\}\),\(35\)which is sublinear inTTwhenη0=O​\(1/T\)\\eta\_\{0\}=O\(1/\\sqrt\{T\}\), givingR1=O​\(G2​σmax\)R\_\{1\}=O\(G^\{2\}\\sigma\_\{\\max\}\)\.

Step 3: BoundingR2R\_\{2\}\(pure inner bias\)\.The inner\-staleness error at timettaggregates over outstanding rounds:

δt\(2\)=∑s∈Qt\[gs​\(θt,ws\)−gs​\(θt,wt∗\)\]\.\\delta\_\{t\}^\{\(2\)\}=\\sum\_\{s\\in Q\_\{t\}\}\\bigl\[g\_\{s\}\(\\theta\_\{t\},w\_\{s\}\)\-g\_\{s\}\(\\theta\_\{t\},w^\{\*\}\_\{t\}\)\\bigr\]\.By Lemma[7](https://arxiv.org/html/2605.12693#Thmlemma7),‖δt\(2\)‖≤∑s∈Qt\(Lg​ϵinner\+κ​‖θs−θt‖\)\\\|\\delta\_\{t\}^\{\(2\)\}\\\|\\leq\\sum\_\{s\\in Q\_\{t\}\}\(L\_\{g\}\\epsilon\_\{\\mathrm\{inner\}\}\+\\kappa\\\|\\theta\_\{s\}\-\\theta\_\{t\}\\\|\)\. Isolate the pure\-bias part:‖Lg​σt​ϵinner‖\\\|L\_\{g\}\\sigma\_\{t\}\\epsilon\_\{\\mathrm\{inner\}\}\\\|contributes a gradient error of normLg​σt​ϵinnerL\_\{g\}\\sigma\_\{t\}\\epsilon\_\{\\mathrm\{inner\}\}at each round\. By Cauchy–Schwarz\[[30](https://arxiv.org/html/2605.12693#bib.bib38)\]and then Young’s inequality with parameterμF\\mu\_\{F\}\(usingμF\\mu\_\{F\}\-strong convexity ofFFfrom Assumption[7](https://arxiv.org/html/2605.12693#Thmassumption7)\):

Lg​σt​ϵinner​‖θt−θ∗‖≤Lg2​σmax2​ϵinner22​μF\+μF2​‖θt−θ∗‖2\.L\_\{g\}\\sigma\_\{t\}\\epsilon\_\{\\mathrm\{inner\}\}\\,\\\|\\theta\_\{t\}\-\\theta^\{\*\}\\\|\\;\\leq\\;\\frac\{L\_\{g\}^\{2\}\\sigma\_\{\\max\}^\{2\}\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\}\{2\\mu\_\{F\}\}\\;\+\\;\\frac\{\\mu\_\{F\}\}\{2\}\\left\\\|\\theta\_\{t\}\-\\theta^\{\*\}\\right\\\|^\{2\}\.The left\-hand side regret absorbs the second term via strong convexity \(F​\(θt\)−F​\(θ∗\)≥μF2​‖θt−θ∗‖2F\(\\theta\_\{t\}\)\-F\(\\theta^\{\*\}\)\\geq\\frac\{\\mu\_\{F\}\}\{2\}\\\|\\theta\_\{t\}\-\\theta^\{\*\}\\\|^\{2\}, Assumption[7](https://arxiv.org/html/2605.12693#Thmassumption7)\)\. Summing overtt:

R2:=∑t=1TLg​σt​ϵinner​‖θt−θ∗‖≤Lg2​σmax2​T​ϵinner22​μF=O​\(T​ϵinner2\)\.R\_\{2\}\\;:=\\;\\sum\_\{t=1\}^\{T\}L\_\{g\}\\sigma\_\{t\}\\epsilon\_\{\\mathrm\{inner\}\}\\,\\\|\\theta\_\{t\}\-\\theta^\{\*\}\\\|\\;\\leq\\;\\frac\{L\_\{g\}^\{2\}\\sigma\_\{\\max\}^\{2\}T\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\}\{2\\mu\_\{F\}\}\\;=\\;O\(T\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\)\.\(36\)
Step 4: BoundingR3R\_\{3\}\(interaction term\) via Cauchy–Schwarz\[[30](https://arxiv.org/html/2605.12693#bib.bib38)\]and sum\-of\-squares\.The interaction part ofδt\(2\)\\delta\_\{t\}^\{\(2\)\}isκ​∑s∈Qt‖θs−θt‖\\kappa\\sum\_\{s\\in Q\_\{t\}\}\\\|\\theta\_\{s\}\-\\theta\_\{t\}\\\|\. By the triangle inequality and Cauchy–Schwarz\[[30](https://arxiv.org/html/2605.12693#bib.bib38)\]:

‖θs−θt‖\\displaystyle\\left\\\|\\theta\_\{s\}\-\\theta\_\{t\}\\right\\\|≤∑k=st−1‖θk\+1−θk‖\\displaystyle\\;\\leq\\;\\sum\_\{k=s\}^\{t\-1\}\\left\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\right\\\|\(37\)‖θs−θt‖2\\displaystyle\\left\\\|\\theta\_\{s\}\-\\theta\_\{t\}\\right\\\|^\{2\}≤\(t−s\)​∑k=st−1‖θk\+1−θk‖2≤σt​∑k=st−1‖θk\+1−θk‖2\.\\displaystyle\\;\\leq\\;\(t\-s\)\\sum\_\{k=s\}^\{t\-1\}\\left\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\right\\\|^\{2\}\\;\\leq\\;\\sigma\_\{t\}\\sum\_\{k=s\}^\{t\-1\}\\left\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\right\\\|^\{2\}\.\(38\)Apply Young’s inequality to the interaction inner product \(parameter2​μF2\\mu\_\{F\}, usingμF\\mu\_\{F\}\-strong convexity from Assumption[7](https://arxiv.org/html/2605.12693#Thmassumption7)\):

κ​∑s∈Qt‖θs−θt‖⋅‖θt−θ∗‖\\displaystyle\\kappa\\sum\_\{s\\in Q\_\{t\}\}\\\|\\theta\_\{s\}\-\\theta\_\{t\}\\\|\\cdot\\\|\\theta\_\{t\}\-\\theta^\{\*\}\\\|≤κ24​μF​\(∑s∈Qt‖θs−θt‖\)2\+μF​‖θt−θ∗‖2\.\\displaystyle\\;\\leq\\;\\frac\{\\kappa^\{2\}\}\{4\\mu\_\{F\}\}\\left\(\\sum\_\{s\\in Q\_\{t\}\}\\left\\\|\\theta\_\{s\}\-\\theta\_\{t\}\\right\\\|\\right\)^\{2\}\+\\mu\_\{F\}\\left\\\|\\theta\_\{t\}\-\\theta^\{\*\}\\right\\\|^\{2\}\.\(39\)TheμF​‖θt−θ∗‖2\\mu\_\{F\}\\\|\\theta\_\{t\}\-\\theta^\{\*\}\\\|^\{2\}term is again absorbed by the strong\-convexity gain\. By Cauchy–Schwarz\[[30](https://arxiv.org/html/2605.12693#bib.bib38)\]applied to the sum overQtQ\_\{t\}:

\(∑s∈Qt‖θs−θt‖\)2≤σt​∑s∈Qt‖θs−θt‖2≤σt2​∑s∈Qt∑k=st−1‖θk\+1−θk‖2,\\left\(\\sum\_\{s\\in Q\_\{t\}\}\\left\\\|\\theta\_\{s\}\-\\theta\_\{t\}\\right\\\|\\right\)^\{2\}\\;\\leq\\;\\sigma\_\{t\}\\sum\_\{s\\in Q\_\{t\}\}\\left\\\|\\theta\_\{s\}\-\\theta\_\{t\}\\right\\\|^\{2\}\\;\\leq\\;\\sigma\_\{t\}^\{2\}\\sum\_\{s\\in Q\_\{t\}\}\\sum\_\{k=s\}^\{t\-1\}\\left\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\right\\\|^\{2\},where the last step uses \([38](https://arxiv.org/html/2605.12693#A5.E38)\)\. To count multiplicities, letIt=\{k:s≤k≤t−1,s∈Qt\}I\_\{t\}=\\\{k:s\\leq k\\leq t\-1,\\;s\\in Q\_\{t\}\\\}denote the set of all step indices spanned by the outstanding rounds\. For a fixedk∈Itk\\in I\_\{t\}, the term‖θk\+1−θk‖2\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\\|^\{2\}appears in the inner sum for everys∈Qts\\in Q\_\{t\}withs≤ks\\leq k, which is at mostmin⁡\(k−min⁡Qt\+1,σt\)≤σt\\min\(k\-\\min Q\_\{t\}\+1,\\,\\sigma\_\{t\}\)\\leq\\sigma\_\{t\}times\. Therefore:

∑s∈Qt∑k=st−1‖θk\+1−θk‖2≤σt​∑k∈It‖θk\+1−θk‖2\.\\sum\_\{s\\in Q\_\{t\}\}\\sum\_\{k=s\}^\{t\-1\}\\left\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\right\\\|^\{2\}\\;\\leq\\;\\sigma\_\{t\}\\sum\_\{k\\in I\_\{t\}\}\\left\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\right\\\|^\{2\}\.Combining with theσt2\\sigma\_\{t\}^\{2\}factor from the two preceding Cauchy–Schwarz\[[30](https://arxiv.org/html/2605.12693#bib.bib38)\]applications:

σt2​∑s∈Qt∑k=st−1‖θk\+1−θk‖2≤σt3​∑k∈It‖θk\+1−θk‖2≤σmax3​∑k∈It‖θk\+1−θk‖2\.\\sigma\_\{t\}^\{2\}\\sum\_\{s\\in Q\_\{t\}\}\\sum\_\{k=s\}^\{t\-1\}\\left\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\right\\\|^\{2\}\\;\\leq\\;\\sigma\_\{t\}^\{3\}\\sum\_\{k\\in I\_\{t\}\}\\left\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\right\\\|^\{2\}\\;\\leq\\;\\sigma\_\{\\max\}^\{3\}\\sum\_\{k\\in I\_\{t\}\}\\left\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\right\\\|^\{2\}\.SinceIt⊇QtI\_\{t\}\\supseteq Q\_\{t\}and eachk∈Itk\\in I\_\{t\}contributes at most one term‖θk\+1−θk‖2\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\\|^\{2\}, we can relax to a sum over all steps in the window\[min⁡Qt,t−1\]\[\\min Q\_\{t\},t\-1\]\. Substituting back and summing overtt:

R3:=∑t=1Tκ​∑s∈Qt‖θs−θt‖⋅‖θt−θ∗‖≤κ2​σmax34​μF​∑t=1T∑k∈It‖θk\+1−θk‖2=O​\(κ2​σmax3μF​∑t=1T∑k∈Wt‖θk\+1−θk‖2\)\.R\_\{3\}\\;:=\\;\\sum\_\{t=1\}^\{T\}\\kappa\\sum\_\{s\\in Q\_\{t\}\}\\\|\\theta\_\{s\}\-\\theta\_\{t\}\\\|\\cdot\\\|\\theta\_\{t\}\-\\theta^\{\*\}\\\|\\;\\leq\\;\\frac\{\\kappa^\{2\}\\sigma\_\{\\max\}^\{3\}\}\{4\\mu\_\{F\}\}\\sum\_\{t=1\}^\{T\}\\sum\_\{k\\in I\_\{t\}\}\\left\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\right\\\|^\{2\}\\;=\\;O\\\!\\Bigl\(\\tfrac\{\\kappa^\{2\}\\sigma\_\{\\max\}^\{3\}\}\{\\mu\_\{F\}\}\\sum\_\{t=1\}^\{T\}\\sum\_\{k\\in W\_\{t\}\}\\left\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\right\\\|^\{2\}\\Bigr\)\.\(40\)Theσmax3\\sigma\_\{\\max\}^\{3\}prefactor is honest: two Cauchy–Schwarz applications each contribute oneσt\\sigma\_\{t\}factor, and the multiplicity count overItI\_\{t\}contributes the third\. The bound is non\-vacuous because the per\-step changes‖θk\+1−θk‖2\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\\|^\{2\}are themselvesO​\(η02​G2/\(1\+β​σ¯t\)\)O\(\\eta\_\{0\}^\{2\}G^\{2\}/\(1\+\\beta\\bar\{\\sigma\}\_\{t\}\)\)under the queue\-length\-adaptive step size \([3\.2](https://arxiv.org/html/2605.12693#S3.SS2)\), so theσmax3\\sigma\_\{\\max\}^\{3\}is partially absorbed\.

Step 5: AggregatingR1\+R2\+R3R\_\{1\}\+R\_\{2\}\+R\_\{3\}\.Combining \([35](https://arxiv.org/html/2605.12693#A5.E35)\), \([36](https://arxiv.org/html/2605.12693#A5.E36)\), and \([40](https://arxiv.org/html/2605.12693#A5.E40)\) with the OMD base regret \(Lemma[1](https://arxiv.org/html/2605.12693#Thmlemma1)\):

RegretTdec≤O​\(η02​G2​T​σmax\)⏟R1\+O​\(T​ϵinner2\)⏟R2\+O​\(∑t=1T∑s∈Qt‖θs\+1−θs‖2\)⏟R3\.\\mathrm\{Regret\}\_\{T\}^\{\\mathrm\{dec\}\}\\;\\leq\\;\\underbrace\{O\(\\eta\_\{0\}^\{2\}G^\{2\}T\\sigma\_\{\\max\}\)\}\_\{R\_\{1\}\}\+\\underbrace\{O\(T\\epsilon\_\{\\mathrm\{inner\}\}^\{2\}\)\}\_\{R\_\{2\}\}\+\\underbrace\{O\\\!\\Bigl\(\\sum\_\{t=1\}^\{T\}\\sum\_\{s\\in Q\_\{t\}\}\\left\\\|\\theta\_\{s\+1\}\-\\theta\_\{s\}\\right\\\|^\{2\}\\Bigr\)\}\_\{R\_\{3\}\}\.This is exactly the decomposition \([6](https://arxiv.org/html/2605.12693#S4.E6)\) in the theorem statement\. Settingη0=O​\(1/T\)\\eta\_\{0\}=O\(1/\\sqrt\{T\}\)givesR1=O​\(G2​σmax\)R\_\{1\}=O\(G^\{2\}\\sigma\_\{\\max\}\), which isO​\(1\)O\(1\)inTT, consistent with theO​\(T​σmax\)O\(\\sqrt\{T\\sigma\_\{\\max\}\}\)rate from Theorem[1](https://arxiv.org/html/2605.12693#Thmtheorem1)\.

Step 6: Comparison with squared\-total\-drift coupling\.Under a single\-level delayed optimizer \(no IGT\), the interaction error is measured by the total drift‖θt−θt−dt‖\\\|\\theta\_\{t\}\-\\theta\_\{t\-d\_\{t\}\}\\\|, contributingO​\(σmax2​η2​G2​T\)O\(\\sigma\_\{\\max\}^\{2\}\\eta^\{2\}G^\{2\}T\)to the transport error\. Under IGT\-OMD, the interaction is measured by∑k‖θk\+1−θk‖2\\sum\_\{k\}\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\\|^\{2\}\(per\-step changes\) rather than‖θt−θτ‖2=\(∑k‖θk\+1−θk‖\)2\\\|\\theta\_\{t\}\-\\theta\_\{\\tau\}\\\|^\{2\}=\(\\sum\_\{k\}\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\\|\)^\{2\}\. By Cauchy–Schwarz\[[30](https://arxiv.org/html/2605.12693#bib.bib38)\]\(applied as‖∑kak‖2≤n​∑k‖ak‖2\\\|\\sum\_\{k\}a\_\{k\}\\\|^\{2\}\\leq n\\sum\_\{k\}\\\|a\_\{k\}\\\|^\{2\}withn=t−s≤σtn=t\-s\\leq\\sigma\_\{t\}terms\):

‖θs−θt‖2=‖∑k=st−1\(θk\+1−θk\)‖2≤σt​∑k=st−1‖θk\+1−θk‖2\.\\\|\\theta\_\{s\}\-\\theta\_\{t\}\\\|^\{2\}=\\Bigl\\\|\\sum\_\{k=s\}^\{t\-1\}\(\\theta\_\{k\+1\}\-\\theta\_\{k\}\)\\Bigr\\\|^\{2\}\\;\\leq\\;\\sigma\_\{t\}\\sum\_\{k=s\}^\{t\-1\}\\left\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\right\\\|^\{2\}\.Equivalently,∑k=st−1‖θk\+1−θk‖2≥1σt​‖θs−θt‖2\\sum\_\{k=s\}^\{t\-1\}\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\\|^\{2\}\\geq\\frac\{1\}\{\\sigma\_\{t\}\}\\\|\\theta\_\{s\}\-\\theta\_\{t\}\\\|^\{2\}\. Note that the pointwise inequality goes in the*opposite*direction \(SoS≥\\geqdrift/2σt\{\}^\{2\}/\\sigma\_\{t\}\); it is the*worst\-case bounds*that differ by a factor ofσt\\sigma\_\{t\}\. Specifically, the trivial bound∑k=st−1‖θk\+1−θk‖2≤σt​η02​G2\\sum\_\{k=s\}^\{t\-1\}\\\|\\theta\_\{k\+1\}\-\\theta\_\{k\}\\\|^\{2\}\\leq\\sigma\_\{t\}\\,\\eta\_\{0\}^\{2\}G^\{2\}shows the IGT interaction is at mostσt​η02​G2\\sigma\_\{t\}\\eta\_\{0\}^\{2\}G^\{2\}per round, whereas the squared\-total\-drift bound gives‖θs−θt‖2≤σt2​η02​G2\\\|\\theta\_\{s\}\-\\theta\_\{t\}\\\|^\{2\}\\leq\\sigma\_\{t\}^\{2\}\\,\\eta\_\{0\}^\{2\}G^\{2\}, which isσt\\sigma\_\{t\}times larger\. This*worst\-case gap*is IGT’s1/σt1/\\sigma\_\{t\}improvement\. ∎

### E\.4Proof of Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1)\(DDE Stability\)

##### Scope of the dynamical analysis\.

Unlike the global worst\-case regret bounds of Theorem[1](https://arxiv.org/html/2605.12693#Thmtheorem1), Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1)analyzes the*local*, asymptotic behaviour of the IGT\-OMD[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)\. Specifically, we study the*linearized, homogeneous*system obtained by Taylor\-expanding around the equilibriumθ∗\\theta^\{\*\}and settingϵinner=0\\epsilon\_\{\\mathrm\{inner\}\}=0\(inner\-solver bias is accounted for separately by Theorem[3](https://arxiv.org/html/2605.12693#Thmtheorem3)\)\. Step 2 derives the characteristic\-root condition under constant worst\-case queue lengthσmax\\sigma\_\{\\max\}; Step 3 extends to arbitrary time\-varyingσ​\(t\)\\sigma\(t\)via a Razumikhin argument; and Proposition[2](https://arxiv.org/html/2605.12693#Thmproposition2)bridges the continuous analysis of the discrete algorithm through z\-domain stability inheritance \(Lemma[13](https://arxiv.org/html/2605.12693#Thmlemma13)\)\.

We prove local asymptotic stability of the linearized IGT\-OMD[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde), verify that the stability boundary depends on the queue lengthσmax\\sigma\_\{\\max\}rather than the raw delaydmaxd\_\{\\max\}, and confirm that the adaptive scheduleηt\\eta\_\{t\}places every characteristic root in the open left half\-plane\.

##### Setup\.

Define the deviationξ​\(t\)=θ​\(t\)−θ∗\\xi\(t\)=\\theta\(t\)\-\\theta^\{\*\}\. Linearizing the continuous\-time limit of Algorithm[1](https://arxiv.org/html/2605.12693#alg1)about the equilibriumθ∗\\theta^\{\*\}yields the linear vector[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)

ξ˙​\(t\)=−η​\(t\)​HF​ξ​\(t\)\+η​\(t\)​LIGT​∑s∈Q​\(t\)ξ​\(t−τs\),\\dot\{\\xi\}\(t\)=\-\\eta\(t\)\\,H\_\{F\}\\,\\xi\(t\)\+\\eta\(t\)\\,L\_\{\\mathrm\{IGT\}\}\\\!\\sum\_\{s\\in Q\(t\)\}\\xi\(t\-\\tau\_\{s\}\),\(41\)where:

- •HF=∇2F​\(θ∗\)H\_\{F\}=\\nabla^\{2\}F\(\\theta^\{\*\}\)is the outer bilevel Hessian \(positive definite by Assumption[7](https://arxiv.org/html/2605.12693#Thmassumption7),μF\\mu\_\{F\}\-strong convexity ofFF, which givesλmin​\(HF\)≥μF\>0\\lambda\_\{\\min\}\(H\_\{F\}\)\\geq\\mu\_\{F\}\>0\)\. BecauseFFis a scalarC2C^\{2\}function \(Assumption[1](https://arxiv.org/html/2605.12693#Thmassumption1)\),HFH\_\{F\}is*symmetric*by Schwarz’s theorem, so its eigenvalues are guaranteed to be real and positive\. The complex rootsλ=α\+j​ω\\lambda=\\alpha\+j\\omegaencountered in Lemma[9](https://arxiv.org/html/2605.12693#Thmlemma9)arise from the*temporal delay*operator, not from the spatial geometry ofHFH\_\{F\};
- •LIGT=∇θ​w2ℒmodel​\(w∗;θ∗\)​\[∇w​w2ℒmodel​\(w∗;θ∗\)\]−1​∇w​θ2ℒmodel​\(w∗;θ∗\)L\_\{\\mathrm\{IGT\}\}=\\nabla^\{2\}\_\{\\theta w\}\\mathcal\{L\}\_\{\\mathrm\{model\}\}\(w^\{\*\};\\theta^\{\*\}\)\\,\[\\nabla^\{2\}\_\{ww\}\\mathcal\{L\}\_\{\\mathrm\{model\}\}\(w^\{\*\};\\theta^\{\*\}\)\]^\{\-1\}\\,\\nabla^\{2\}\_\{w\\theta\}\\mathcal\{L\}\_\{\\mathrm\{model\}\}\(w^\{\*\};\\theta^\{\*\}\)is the effective coupling matrix capturing how stale\-gradient corrections feed back into the dynamics \(bounded as‖LIGT‖≤Lθ​w​Lw​θ/μw\\left\\\|L\_\{\\mathrm\{IGT\}\}\\right\\\|\\leq L\_\{\\theta w\}L\_\{w\\theta\}/\\mu\_\{w\}by Assumptions[1](https://arxiv.org/html/2605.12693#Thmassumption1)and[5](https://arxiv.org/html/2605.12693#Thmassumption5)\);
- •Q​\(t\)Q\(t\)is the set of outstanding observations \(delaysτs\>0\\tau\_\{s\}\>0,\|Q​\(t\)\|=σ​\(t\)≤σmax\|Q\(t\)\|=\\sigma\(t\)\\leq\\sigma\_\{\\max\}\);
- •the adaptive step\-size schedule isηt=η0/1\+β​σ​\(t\)\\eta\_\{t\}=\\eta\_\{0\}/\\sqrt\{1\+\\beta\\sigma\(t\)\}withβ\>0\\beta\>0\.

###### Lemma 8\(Reduction to worst\-case spectral bound\)\.

To establish asymptotic stability of the vector[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)\([41](https://arxiv.org/html/2605.12693#A5.E41)\), it suffices to verify stability for the worst\-case scalar mode\. Letμmin=λmin​\(HF\)\>0\\mu\_\{\\min\}=\\lambda\_\{\\min\}\(H\_\{F\}\)\>0andℓmax=‖LIGT‖\\ell\_\{\\max\}=\\\|L\_\{\\mathrm\{IGT\}\}\\\|\. If the scalar[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)

z˙​\(t\)=−η​\(t\)​μmin​z​\(t\)\+η​\(t\)​ℓmax​∑s∈Q​\(t\)z​\(t−τs\)\\dot\{z\}\(t\)=\-\\eta\(t\)\\,\\mu\_\{\\min\}\\,z\(t\)\+\\eta\(t\)\\,\\ell\_\{\\max\}\\\!\\sum\_\{s\\in Q\(t\)\}z\(t\-\\tau\_\{s\}\)\(42\)is asymptotically stable, then so is \([41](https://arxiv.org/html/2605.12693#A5.E41)\)\.

###### Proof\.

The proof proceeds by a Lyapunov energy argument that avoids simultaneous diagonalizability ofHFH\_\{F\}andLIGTL\_\{\\mathrm\{IGT\}\}\. DefineW​\(t\)=12​‖ξ​\(t\)‖2W\(t\)=\\frac\{1\}\{2\}\\\|\\xi\(t\)\\\|^\{2\}\. Then:

W˙​\(t\)\\displaystyle\\dot\{W\}\(t\)=ξ​\(t\)⊤​ξ˙​\(t\)=−η​\(t\)​ξ​\(t\)⊤​HF​ξ​\(t\)\+η​\(t\)​ξ​\(t\)⊤​LIGT​∑s∈Q​\(t\)ξ​\(t−τs\)\\displaystyle=\\xi\(t\)^\{\\top\}\\dot\{\\xi\}\(t\)=\-\\eta\(t\)\\,\\xi\(t\)^\{\\top\}H\_\{F\}\\,\\xi\(t\)\+\\eta\(t\)\\,\\xi\(t\)^\{\\top\}L\_\{\\mathrm\{IGT\}\}\\sum\_\{s\\in Q\(t\)\}\\xi\(t\-\\tau\_\{s\}\)≤−η​\(t\)​μmin​‖ξ​\(t\)‖2\+η​\(t\)​ℓmax​‖ξ​\(t\)‖​∑s∈Q​\(t\)‖ξ​\(t−τs\)‖,\\displaystyle\\leq\-\\eta\(t\)\\,\\mu\_\{\\min\}\\\|\\xi\(t\)\\\|^\{2\}\+\\eta\(t\)\\,\\ell\_\{\\max\}\\\|\\xi\(t\)\\\|\\sum\_\{s\\in Q\(t\)\}\\\|\\xi\(t\-\\tau\_\{s\}\)\\\|,where we usedξ⊤​HF​ξ≥μmin​‖ξ‖2\\xi^\{\\top\}H\_\{F\}\\xi\\geq\\mu\_\{\\min\}\\\|\\xi\\\|^\{2\}andξ⊤​LIGT​ξ​\(t−τs\)≤ℓmax​‖ξ‖​‖ξ​\(t−τs\)‖\\xi^\{\\top\}L\_\{\\mathrm\{IGT\}\}\\xi\(t\-\\tau\_\{s\}\)\\leq\\ell\_\{\\max\}\\\|\\xi\\\|\\\|\\xi\(t\-\\tau\_\{s\}\)\\\|\. Settingv​\(t\)=‖ξ​\(t\)‖v\(t\)=\\\|\\xi\(t\)\\\|, we obtain the scalar comparison inequalityv˙​\(t\)≤−η​\(t\)​μmin​v​\(t\)\+η​\(t\)​ℓmax​∑s∈Q​\(t\)v​\(t−τs\)\\dot\{v\}\(t\)\\leq\-\\eta\(t\)\\mu\_\{\\min\}\\,v\(t\)\+\\eta\(t\)\\ell\_\{\\max\}\\sum\_\{s\\in Q\(t\)\}v\(t\-\\tau\_\{s\}\)\(usingW˙=v​v˙\\dot\{W\}=v\\dot\{v\}whenv\>0v\>0\)\. This is exactly the scalar[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)\([42](https://arxiv.org/html/2605.12693#A5.E42)\) applied tov​\(t\)v\(t\)\. By the comparison principle for[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)s \(\[[18](https://arxiv.org/html/2605.12693#bib.bib33)\], Theorem 5\.1\.1\), if \([42](https://arxiv.org/html/2605.12693#A5.E42)\) is asymptotically stable thenv​\(t\)→0v\(t\)\\to 0, hence‖ξ​\(t\)‖→0\\\|\\xi\(t\)\\\|\\to 0\. Stability of \([41](https://arxiv.org/html/2605.12693#A5.E41)\) thus reduces to stability of the single worst\-case scalar[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)\([42](https://arxiv.org/html/2605.12693#A5.E42)\), without requiringHFH\_\{F\}andLIGTL\_\{\\mathrm\{IGT\}\}to commute\. ∎

###### Lemma 9\(Constant\-queue characteristic equation\)\.

For constant queue lengthσ​\(t\)≡σ\\sigma\(t\)\\equiv\\sigmaand constant delayτs≡τ¯\\tau\_\{s\}\\equiv\\bar\{\\tau\}, the characteristic equation of \([42](https://arxiv.org/html/2605.12693#A5.E42)\) \(withη​\(t\)≡η\\eta\(t\)\\equiv\\eta\) is

λ\+η​μi−η​ℓi​σ​e−λ​τ¯=0\.\\lambda\+\\eta\\mu\_\{i\}\-\\eta\\ell\_\{i\}\\,\\sigma\\,e^\{\-\\lambda\\bar\{\\tau\}\}=0\.\(43\)A sufficient condition for all rootsλ∈ℂ\\lambda\\in\\mathbb\{C\}to haveRe​\(λ\)<0\\mathrm\{Re\}\(\\lambda\)<0is

η​\|ℓi\|​σ​τ¯<1andη​\(μi−\|ℓi\|​σ\)\>0\.\\eta\\,\|\\ell\_\{i\}\|\\,\\sigma\\,\\bar\{\\tau\}<1\\qquad\\text\{and\}\\qquad\\eta\(\\mu\_\{i\}\-\|\\ell\_\{i\}\|\\sigma\)\>0\.\(44\)

###### Proof\.

Following\[[35](https://arxiv.org/html/2605.12693#bib.bib16)\]\(Theorem 1 and Corollary 2\) and the characteristic\-root analysis in\[[35](https://arxiv.org/html/2605.12693#bib.bib16)\]\(equations \(26\)–\(28\)\), substitute the exponential ansatzzi​\(t\)=A​eλ​tz\_\{i\}\(t\)=Ae^\{\\lambda t\}into \([42](https://arxiv.org/html/2605.12693#A5.E42)\) to obtain \([43](https://arxiv.org/html/2605.12693#A5.E43)\)\.

Sufficient condition forRe​\(λ\)<0\\mathrm\{Re\}\(\\lambda\)<0\.Writeλ=α\+j​ω\\lambda=\\alpha\+j\\omegawithα,ω∈ℝ\\alpha,\\omega\\in\\mathbb\{R\}\. A root withα≥0\\alpha\\geq 0would require\|λ\+η​μi\|=η​\|ℓi\|​σ​e−α​τ¯≤η​\|ℓi\|​σ\|\\lambda\+\\eta\\mu\_\{i\}\|=\\eta\|\\ell\_\{i\}\|\\sigma e^\{\-\\alpha\\bar\{\\tau\}\}\\leq\\eta\|\\ell\_\{i\}\|\\sigma, but\|λ\+η​μi\|≥η​μi−\|α\|\|\\lambda\+\\eta\\mu\_\{i\}\|\\geq\\eta\\mu\_\{i\}\-\|\\alpha\|and atα=0\\alpha=0we needη​μi≤η​\|ℓi\|​σ\\eta\\mu\_\{i\}\\leq\\eta\|\\ell\_\{i\}\|\\sigma, i\.e\.μi≤\|ℓi\|​σ\\mu\_\{i\}\\leq\|\\ell\_\{i\}\|\\sigma, which is ruled out by the second condition in \([44](https://arxiv.org/html/2605.12693#A5.E44)\)\. Forα\>0\\alpha\>0, the right\-hand side decays while the left\-hand side grows, giving a contradiction\. More precisely, forα≥0\\alpha\\geq 0take the modulus of \([43](https://arxiv.org/html/2605.12693#A5.E43)\):

\|α\+j​ω\+η​μi\|=η​\|ℓi\|​σ​e−α​τ¯≤η​\|ℓi\|​σ\.\|\\alpha\+j\\omega\+\\eta\\mu\_\{i\}\|=\\eta\|\\ell\_\{i\}\|\\sigma e^\{\-\\alpha\\bar\{\\tau\}\}\\leq\\eta\|\\ell\_\{i\}\|\\sigma\.Since\|α\+j​ω\+η​μi\|2=\(α\+η​μi\)2\+ω2≥\(η​μi\)2\|\\alpha\+j\\omega\+\\eta\\mu\_\{i\}\|^\{2\}=\(\\alpha\+\\eta\\mu\_\{i\}\)^\{2\}\+\\omega^\{2\}\\geq\(\\eta\\mu\_\{i\}\)^\{2\}, we would needη​μi≤η​\|ℓi\|​σ\\eta\\mu\_\{i\}\\leq\\eta\|\\ell\_\{i\}\|\\sigma, contradictingμi\>\|ℓi\|​σ\\mu\_\{i\}\>\|\\ell\_\{i\}\|\\sigma\(the second condition in \([44](https://arxiv.org/html/2605.12693#A5.E44)\) evaluated at the worst mode\)\. Hence, all roots satisfyα<0\\alpha<0\.

No converse is used\.We use \([44](https://arxiv.org/html/2605.12693#A5.E44)\) only as a sufficient certificate\. When the instantaneous restoring term satisfiesμi\>\|ℓi\|​σ\\mu\_\{i\}\>\|\\ell\_\{i\}\|\\sigma, the modulus argument above already rules out right\-half\-plane roots for the comparison equation, so the delay\-duration inequality should not be read as a necessary instability boundary\. ∎

###### Lemma 10\(Queue\-length certificate in the active\-window embedding\)\.

In the event\-time embedding used for the linearized IGT transport model, if outstanding observations occupy a consecutive active update window, the sufficient stability certificate can be expressed in terms of the queue lengthσ\\sigmaby representing the delay spanτ¯\\bar\{\\tau\}through that active window\.

###### Proof\.

Condition \([44](https://arxiv.org/html/2605.12693#A5.E44)\) readsη​\|ℓi\|​σ​τ¯<1\\eta\|\\ell\_\{i\}\|\\sigma\\bar\{\\tau\}<1\. In the active\-window embedding, each update event occupies timeη\\etaand the represented window ofσ\\sigmaconsecutive outstanding observations has spanτ¯=σ​η\\bar\{\\tau\}=\\sigma\\eta\. Substituting this represented span into the first condition:

η​\|ℓi\|​σ​τ¯≤η2​\|ℓi\|​σ2=η02​\|ℓi\|​σ21\+β​σ,\\eta\|\\ell\_\{i\}\|\\sigma\\bar\{\\tau\}\\;\\leq\\;\\eta^\{2\}\|\\ell\_\{i\}\|\\sigma^\{2\}\\;=\\;\\frac\{\\eta\_\{0\}^\{2\}\\,\|\\ell\_\{i\}\|\\,\\sigma^\{2\}\}\{1\+\\beta\\sigma\},where we used the adaptive scheduleη=η0/1\+β​σ\\eta=\\eta\_\{0\}/\\sqrt\{1\+\\beta\\sigma\}\. Choosingβ=\|ℓmax\|/μmin\\beta=\|\\ell\_\{\\max\}\|/\\mu\_\{\\min\}\(withμmin=λmin​\(HF\)\\mu\_\{\\min\}=\\lambda\_\{\\min\}\(H\_\{F\}\),\|ℓmax\|=‖LIGT‖\|\\ell\_\{\\max\}\|=\\\|L\_\{\\mathrm\{IGT\}\}\\\|\) andη0≤μmin/\(‖LIGT‖​σmax\)\\eta\_\{0\}\\leq\\mu\_\{\\min\}/\(\\\|L\_\{\\mathrm\{IGT\}\}\\\|\\sigma\_\{\\max\}\), we obtain

η02​‖LIGT‖​σ21\+β​σ≤μmin2​σ2‖LIGT‖​σmax2​\(1\+β​σ\)≤μmin2‖LIGT‖​\(1\+β​σ\)≤μminβ⋅‖LIGT‖−1⋅‖LIGT‖=1,\\frac\{\\eta\_\{0\}^\{2\}\\\|L\_\{\\mathrm\{IGT\}\}\\\|\\sigma^\{2\}\}\{1\+\\beta\\sigma\}\\;\\leq\\;\\frac\{\\mu\_\{\\min\}^\{2\}\\sigma^\{2\}\}\{\\\|L\_\{\\mathrm\{IGT\}\}\\\|\\sigma\_\{\\max\}^\{2\}\(1\+\\beta\\sigma\)\}\\;\\leq\\;\\frac\{\\mu\_\{\\min\}^\{2\}\}\{\\\|L\_\{\\mathrm\{IGT\}\}\\\|\(1\+\\beta\\sigma\)\}\\;\\leq\\;\\frac\{\\mu\_\{\\min\}\}\{\\beta\\cdot\\\|L\_\{\\mathrm\{IGT\}\}\\\|^\{\-1\}\\cdot\\\|L\_\{\\mathrm\{IGT\}\}\\\|\}=1,usingσ≤σmax\\sigma\\leq\\sigma\_\{\\max\}and1\+β​σ≥11\+\\beta\\sigma\\geq 1\. Hence, within this event\-time active\-window model, the stability condition depends only onσ\\sigmaandη\\eta\(which itself adapts toσ\\sigma\), not on the raw wall\-clock age of the oldest observation\.

This is the structural scope of the DDE result: for constant\-throughput or consecutive\-window embeddings, the certificate is governed by how many observations are simultaneously outstanding \(σmax\\sigma\_\{\\max\}\), not bydmaxd\_\{\\max\}alone\. Arbitrary sparse feedback processes with the same queue length but much older outstanding observations require an additional bounded\-age assumption and are not covered by this lemma\. Following\[[28](https://arxiv.org/html/2605.12693#bib.bib6)\], the quantitiesσmax\\sigma\_\{\\max\}anddmaxd\_\{\\max\}can differ by up to a factor ofTT, soσmax\\sigma\_\{\\max\}\-based analysis can be far tighter when the active\-window embedding is appropriate\. ∎

##### Main proof of Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1)\.

###### Proof of Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1)\.

Step 1: Reduction to scalar modes\.By Lemma[8](https://arxiv.org/html/2605.12693#Thmlemma8), it suffices to show each eigenmode is stable\. For the worst mode, the effective scalar parameters areμmin=λmin​\(HF\)\\mu\_\{\\min\}=\\lambda\_\{\\min\}\(H\_\{F\}\)and\|ℓmax\|=‖LIGT‖\|\\ell\_\{\\max\}\|=\\\|L\_\{\\mathrm\{IGT\}\}\\\|\.

Step 2: Constant\-σ\\sigmastability\.For each fixed queue levelσ≤σmax\\sigma\\leq\\sigma\_\{\\max\}, the adaptive stepη=η0/1\+β​σ\\eta=\\eta\_\{0\}/\\sqrt\{1\+\\beta\\sigma\}is constant \(on the timescale of the linearized analysis\)\. Apply Lemma[9](https://arxiv.org/html/2605.12693#Thmlemma9)withτ¯=σ​η\\bar\{\\tau\}=\\sigma\\eta\(Lemma[10](https://arxiv.org/html/2605.12693#Thmlemma10)\): condition \([44](https://arxiv.org/html/2605.12693#A5.E44)\) requires

η2​‖LIGT‖​σ2=η02​‖LIGT‖​σ21\+β​σ<1andμmin\>‖LIGT‖​σ\.\\eta^\{2\}\\,\\\|L\_\{\\mathrm\{IGT\}\}\\\|\\,\\sigma^\{2\}=\\frac\{\\eta\_\{0\}^\{2\}\\,\\\|L\_\{\\mathrm\{IGT\}\}\\\|\\,\\sigma^\{2\}\}\{1\+\\beta\\sigma\}<1\\qquad\\text\{and\}\\qquad\\mu\_\{\\min\}\>\\\|L\_\{\\mathrm\{IGT\}\}\\\|\\,\\sigma\.\(45\)The second inequality holds for allσ<1/β=μmin/‖LIGT‖\\sigma<1/\\beta=\\mu\_\{\\min\}/\\\|L\_\{\\mathrm\{IGT\}\}\\\|\.

Settingβ=‖LIGT‖/μmin\\beta=\\\|L\_\{\\mathrm\{IGT\}\}\\\|/\\mu\_\{\\min\}and takingη0\\eta\_\{0\}to satisfy the two\-part bound in Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1), the first condition is verified as

η02​‖LIGT‖​σ21\+β​σ\\displaystyle\\frac\{\\eta\_\{0\}^\{2\}\\\|L\_\{\\mathrm\{IGT\}\}\\\|\\sigma^\{2\}\}\{1\+\\beta\\sigma\}≤\(1\+β​σmax\)​‖LIGT‖​σ2σmax2​‖LIGT‖​\(1\+β​σ\)≤1,\\displaystyle\\;\\leq\\;\\frac\{\(1\+\\beta\\sigma\_\{\\max\}\)\\,\\\|L\_\{\\mathrm\{IGT\}\}\\\|\\sigma^\{2\}\}\{\\sigma\_\{\\max\}^\{2\}\\\|L\_\{\\mathrm\{IGT\}\}\\\|\(1\+\\beta\\sigma\)\}\\;\\leq\\;1,\(46\)where the last inequality usesσ≤σmax\\sigma\\leq\\sigma\_\{\\max\}and monotonicity ofσ2/\(1\+β​σ\)\\sigma^\{2\}/\(1\+\\beta\\sigma\)onσ≥0\\sigma\\geq 0\. Hence both conditions in \([45](https://arxiv.org/html/2605.12693#A5.E45)\) are satisfied forσ<1/β\\sigma<1/\\beta\.

By Lemma[9](https://arxiv.org/html/2605.12693#Thmlemma9), all characteristic roots satisfyRe​\(λk\)<0\\mathrm\{Re\}\(\\lambda\_\{k\}\)<0, so the linearized[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)is asymptotically stable for each fixedσ\\sigma\.

Step 3: Time\-varyingσ​\(t\)\\sigma\(t\)\(Razumikhin argument\)\.When the queue lengthσ​\(t\)\\sigma\(t\)varies over time, the step sizeη​\(t\)=η0/1\+β​σ​\(t\)\\eta\(t\)=\\eta\_\{0\}/\\sqrt\{1\+\\beta\\sigma\(t\)\}co\-varies and the system is non\-autonomous\. We use the Razumikhin approach \(see, e\.g\.,\[[18](https://arxiv.org/html/2605.12693#bib.bib33)\], Ch\. 5\), which avoids constructing a Lyapunov–Krasovskii functional explicitly\.

Define the candidate Lyapunov function

V​\(ξ\)=12​‖ξ‖2\.V\(\\xi\)=\\tfrac\{1\}\{2\}\\\|\\xi\\\|^\{2\}\.\(47\)By Lemma[8](https://arxiv.org/html/2605.12693#Thmlemma8)\(comparison principle\), the deviationξ​\(t\)=θ​\(t\)−θ∗\\xi\(t\)=\\theta\(t\)\-\\theta^\{\*\}satisfies the norm inequality

dd​t​‖ξ​\(t\)‖≤−η​\(t\)​μmin​‖ξ​\(t\)‖\+η​\(t\)​‖LIGT‖​∑s∈Q​\(t\)‖ξ​\(t−τs\)‖\.\\frac\{d\}\{dt\}\\\|\\xi\(t\)\\\|\\;\\leq\\;\-\\eta\(t\)\\,\\mu\_\{\\min\}\\\|\\xi\(t\)\\\|\+\\eta\(t\)\\,\\\|L\_\{\\mathrm\{IGT\}\}\\\|\\\!\\sum\_\{s\\in Q\(t\)\}\\\|\\xi\(t\-\\tau\_\{s\}\)\\\|\.Apply the*Razumikhin condition*: suppose there existsq\>1q\>1such thatV​\(ξ​\(s\)\)<q​V​\(ξ​\(t\)\)V\(\\xi\(s\)\)<q\\,V\(\\xi\(t\)\), i\.e\.‖ξ​\(s\)‖<q​‖ξ​\(t\)‖\\\|\\xi\(s\)\\\|<\\sqrt\{q\}\\,\\\|\\xi\(t\)\\\|, for alls∈\[t−τ¯,t\]s\\in\[t\-\\bar\{\\tau\},\\,t\]\. Then each delayed term satisfies‖ξ​\(t−τs\)‖<q​‖ξ​\(t\)‖\\\|\\xi\(t\-\\tau\_\{s\}\)\\\|<\\sqrt\{q\}\\,\\\|\\xi\(t\)\\\|, so with\|Q​\(t\)\|≤σmax\|Q\(t\)\|\\leq\\sigma\_\{\\max\}:

V˙\\displaystyle\\dot\{V\}=ξ​\(t\)⊤​ξ˙​\(t\)≤−η​\(t\)​μmin​‖ξ​\(t\)‖2\+η​\(t\)​‖LIGT‖​σmax​q​‖ξ​\(t\)‖2\\displaystyle=\\xi\(t\)^\{\\top\}\\dot\{\\xi\}\(t\)\\;\\leq\\;\-\\eta\(t\)\\mu\_\{\\min\}\\\|\\xi\(t\)\\\|^\{2\}\+\\eta\(t\)\\\|L\_\{\\mathrm\{IGT\}\}\\\|\\,\\sigma\_\{\\max\}\\,\\sqrt\{q\}\\,\\\|\\xi\(t\)\\\|^\{2\}=−η​\(t\)​\(μmin−q​‖LIGT‖​σmax\)​‖ξ​\(t\)‖2\.\\displaystyle=\-\\eta\(t\)\\bigl\(\\mu\_\{\\min\}\-\\sqrt\{q\}\\,\\\|L\_\{\\mathrm\{IGT\}\}\\\|\\,\\sigma\_\{\\max\}\\bigr\)\\\|\\xi\(t\)\\\|^\{2\}\.\(48\)Sinceσmax​β<1\\sigma\_\{\\max\}\\beta<1\(equivalentlyσmax​‖LIGT‖<μmin\\sigma\_\{\\max\}\\\|L\_\{\\mathrm\{IGT\}\}\\\|<\\mu\_\{\\min\}, which is guaranteed by the second condition in \([45](https://arxiv.org/html/2605.12693#A5.E45)\)\), we may choose anyq∈\(1,\(μmin/\(‖LIGT‖​σmax\)\)2\)q\\in\\bigl\(1,\\;\(\\mu\_\{\\min\}/\(\\\|L\_\{\\mathrm\{IGT\}\}\\\|\\sigma\_\{\\max\}\)\)^\{2\}\\bigr\), yielding a constant

c:=μmin−q​‖LIGT‖​σmax\>0c\\;:=\\;\\mu\_\{\\min\}\-\\sqrt\{q\}\\,\\\|L\_\{\\mathrm\{IGT\}\}\\\|\\,\\sigma\_\{\\max\}\\;\>\\;0such thatV˙≤−c​η​\(t\)​‖ξ​\(t\)‖2=−2​c​η​\(t\)​V<0\\dot\{V\}\\leq\-c\\,\\eta\(t\)\\,\\\|\\xi\(t\)\\\|^\{2\}=\-2c\\,\\eta\(t\)\\,V<0wheneverV​\(ξ​\(s\)\)<q​V​\(ξ​\(t\)\)V\(\\xi\(s\)\)<qV\(\\xi\(t\)\)on\[t−τ¯,t\]\[t\-\\bar\{\\tau\},t\]\. By Razumikhin’s theorem \(see, e\.g\.,\[[18](https://arxiv.org/html/2605.12693#bib.bib33)\], Theorem 5\.4\.1\), the equilibriumξ=0\\xi=0is uniformly asymptotically stable, i\.e\.θ​\(t\)→θ∗\\theta\(t\)\\to\\theta^\{\*\}\.

Step 4: Stability depends onσmax\\sigma\_\{\\max\}, notdmaxd\_\{\\max\}\.By Lemma[10](https://arxiv.org/html/2605.12693#Thmlemma10), the stability condition involvesσmax\\sigma\_\{\\max\}exclusively\. Two delay processes with the sameσmax\\sigma\_\{\\max\}but differentdmaxd\_\{\\max\}\(e\.g\. constant delays vs\. bursty delays\) yield the same characteristic equation \([43](https://arxiv.org/html/2605.12693#A5.E43)\) and the same Razumikhin bound in Step 3\. Idle time — periods when no outstanding gradients are pending \(σ​\(t\)=0\\sigma\(t\)=0\) — does not destabilize the system: the coupling term in \([42](https://arxiv.org/html/2605.12693#A5.E42)\) vanishes whenQ​\(t\)=∅Q\(t\)=\\emptyset, and the system reduces to the non\-delayed stable ODEξ˙=−η​μi​ξ\\dot\{\\xi\}=\-\\eta\\mu\_\{i\}\\xi\.

Step 5: Convergence rate\.FromV˙≤−2​c​η​\(t\)​V\\dot\{V\}\\leq\-2c\\,\\eta\(t\)\\,V\(established in Step 3\) andη\(t\)≥η0/1\+β​σmax=:η¯\\eta\(t\)\\geq\\eta\_\{0\}/\\sqrt\{1\+\\beta\\sigma\_\{\\max\}\}=:\\bar\{\\eta\}:

V​\(ξt\)≤V​\(ξ0\)​e−2​c​η¯​t,V\(\\xi\_\{t\}\)\\leq V\(\\xi\_\{0\}\)\\,e^\{\-2c\\bar\{\\eta\}t\},corresponding to a convergence rateO​\(e−c​η¯​t\)=O​\(e−c​η0​t/1\+β​σmax\)O\(e^\{\-c\\bar\{\\eta\}t\}\)=O\\\!\\left\(e^\{\-c\\eta\_\{0\}t/\\sqrt\{1\+\\beta\\sigma\_\{\\max\}\}\}\\right\), exactly as stated in Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1)\.

Conclusion\.All characteristic roots of the linearized IGT\-OMD[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)satisfyRe​\(λ\)<0\\mathrm\{Re\}\(\\lambda\)<0wheneverη0<1/\(‖HF‖​1\+β​σmax\)\\eta\_\{0\}<1/\(\\\|H\_\{F\}\\\|\\sqrt\{1\+\\beta\\sigma\_\{\\max\}\}\), the system is asymptotically stable by the Razumikhin argument in Step 3, and the stability boundary depends onσmax\\sigma\_\{\\max\}\(queue length\), notdmaxd\_\{\\max\}\(delay duration\)\.□\\square∎

###### Corollary 3\(Stability phase transition\)\.

Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1)’s criterionμmin\>‖LIGT‖​σ\\mu\_\{\\min\}\>\\\|L\_\{\\mathrm\{IGT\}\}\\\|\\sigmaimplies a phase transition atσmax∗=μmin/‖LIGT‖=1/β\\sigma\_\{\\max\}^\{\*\}=\\mu\_\{\\min\}/\\\|L\_\{\\mathrm\{IGT\}\}\\\|=1/\\beta: \(i\) forσmax<1/β\\sigma\_\{\\max\}<1/\\beta, the non\-delayed step sizeη0=1/‖HF‖\\eta\_\{0\}=1/\\\|H\_\{F\}\\\|remains stable; \(ii\) forσmax≥1/β\\sigma\_\{\\max\}\\geq 1/\\beta, the adaptive schedule must dampη\\etaas1/β​σmax1/\\sqrt\{\\beta\\sigma\_\{\\max\}\}, trading convergence speed for stability\. This corollary explains the LQR result \(Section[5\.1](https://arxiv.org/html/2605.12693#S5.SS1)\) that single\-level methods lose stability nearσ≈15\\sigma\\approx 15: without adaptive damping, the coupling termη​‖LIGT‖​σ\\eta\\\|L\_\{\\mathrm\{IGT\}\}\\\|\\sigmaeventually exceedsη​μmin\\eta\\mu\_\{\\min\}\.

### E\.5Proof of Proposition[2](https://arxiv.org/html/2605.12693#Thmproposition2)\(Discrete\-Time Consistency\)

We prove that Algorithm[1](https://arxiv.org/html/2605.12693#alg1)is a first\-order Euler discretization of the continuous\-time[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)\([41](https://arxiv.org/html/2605.12693#A5.E41)\) with global errorO​\(ηmax​T\)O\(\\eta\_\{\\max\}\\sqrt\{T\}\), and that stability of the[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)implies stability of the discrete algorithm\. The proof combines standard Euler error accumulation via the Gronwall lemma with a z\-domain argument for stability inheritance, following the Euler–Maruyama bridge established in\[[35](https://arxiv.org/html/2605.12693#bib.bib16)\]\(Section II\-D\) and the discrete/continuous staleness correspondence in\[[35](https://arxiv.org/html/2605.12693#bib.bib16)\]\(equations \(3\)–\(5\)\)\.

##### Notation\.

Writetk=∑s=1kηst\_\{k\}=\\sum\_\{s=1\}^\{k\}\\eta\_\{s\}for the continuous time elapsed afterkkdiscrete steps\. For fixed horizonTTsteps withηmax=O​\(1/T\)\\eta\_\{\\max\}=O\(1/\\sqrt\{T\}\), we havetT=O​\(T\)t\_\{T\}=O\(\\sqrt\{T\}\)\. Denote the continuous\-time trajectory byθc​\(t\)\\theta^\{\\mathrm\{c\}\}\(t\)and the discrete iterates byθkd:=θkdisc\\theta^\{\\mathrm\{d\}\}\_\{k\}:=\\theta^\{\\mathrm\{disc\}\}\_\{k\}\. Define the global discretization error at stepkkasek:=θkd−θc​\(tk\)e\_\{k\}:=\\theta^\{\\mathrm\{d\}\}\_\{k\}\-\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\)\.

###### Lemma 11\(Local truncation error\)\.

Letf​\(t,θ\)=−η​HF​θ​\(t\)\+η​LIGT​∑s∈Q​\(t\)θ​\(t−τs\)f\(t,\\theta\)=\-\\eta H\_\{F\}\\theta\(t\)\+\\eta L\_\{\\mathrm\{IGT\}\}\\sum\_\{s\\in Q\(t\)\}\\theta\(t\-\\tau\_\{s\}\)denote the right\-hand side of the linearized[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)\([41](https://arxiv.org/html/2605.12693#A5.E41)\)\. Ifθc\\theta^\{\\mathrm\{c\}\}is the exact solution and stepkkuses step sizeηk\\eta\_\{k\}, the Euler stepθk\+1d=θkd\+ηk​f​\(tk,θkd\)\\theta^\{\\mathrm\{d\}\}\_\{k\+1\}=\\theta^\{\\mathrm\{d\}\}\_\{k\}\+\\eta\_\{k\}f\(t\_\{k\},\\theta^\{\\mathrm\{d\}\}\_\{k\}\)satisfies

‖θc​\(tk\+1\)−\[θc​\(tk\)\+ηk​f​\(tk,θc​\(tk\)\)\]‖≤ηk22​M,\\bigl\\\|\\theta^\{\\mathrm\{c\}\}\(t\_\{k\+1\}\)\-\\bigl\[\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\)\+\\eta\_\{k\}f\(t\_\{k\},\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\)\)\\bigr\]\\bigr\\\|\\;\\leq\\;\\frac\{\\eta\_\{k\}^\{2\}\}\{2\}\\,M,\(49\)whereM:=supt‖θ¨c​\(t\)‖<∞M:=\\sup\_\{t\}\\\|\\ddot\{\\theta\}^\{\\mathrm\{c\}\}\(t\)\\\|<\\inftyis the bound on the second time\-derivative of the exact solution, which exists and is finite in the stable regime\.

###### Proof\.

By Taylor’s theorem applied toθc​\(tk\+1\)=θc​\(tk\+ηk\)\\theta^\{\\mathrm\{c\}\}\(t\_\{k\+1\}\)=\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\+\\eta\_\{k\}\):

θc​\(tk\+1\)=θc​\(tk\)\+ηk​θ˙c​\(tk\)\+ηk22​θ¨c​\(ζk\)\\theta^\{\\mathrm\{c\}\}\(t\_\{k\+1\}\)=\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\)\+\\eta\_\{k\}\\dot\{\\theta\}^\{\\mathrm\{c\}\}\(t\_\{k\}\)\+\\frac\{\\eta\_\{k\}^\{2\}\}\{2\}\\ddot\{\\theta\}^\{\\mathrm\{c\}\}\(\\zeta\_\{k\}\)for someζk∈\(tk,tk\+1\)\\zeta\_\{k\}\\in\(t\_\{k\},t\_\{k\+1\}\)\. Sinceθ˙c​\(tk\)=f​\(tk,θc​\(tk\)\)\\dot\{\\theta\}^\{\\mathrm\{c\}\}\(t\_\{k\}\)=f\(t\_\{k\},\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\)\)by the[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde), the local truncation error isηk22​‖θ¨c​\(ζk\)‖≤ηk22​M\\frac\{\\eta\_\{k\}^\{2\}\}\{2\}\\\|\\ddot\{\\theta\}^\{\\mathrm\{c\}\}\(\\zeta\_\{k\}\)\\\|\\leq\\frac\{\\eta\_\{k\}^\{2\}\}\{2\}M, giving \([49](https://arxiv.org/html/2605.12693#A5.E49)\)\. ∎

###### Lemma 12\(Global error via discrete Gronwall\)\.

LetLf=‖HF‖\+σmax​‖LIGT‖L\_\{f\}=\\\|H\_\{F\}\\\|\+\\sigma\_\{\\max\}\\\|L\_\{\\mathrm\{IGT\}\}\\\|be the Lipschitz constant off​\(⋅,⋅\)f\(\\cdot,\\cdot\)in its second argument\. The global discretization error satisfies

‖ek‖≤M​ηmax2​Lf​\(eLf​tk−1\)\.\\\|e\_\{k\}\\\|\\;\\leq\\;\\frac\{M\\eta\_\{\\max\}\}\{2L\_\{f\}\}\\bigl\(e^\{L\_\{f\}t\_\{k\}\}\-1\\bigr\)\.\(50\)Remark on the Gronwall factor\.SincetT=O​\(T\)t\_\{T\}=O\(\\sqrt\{T\}\)withηmax=O​\(1/T\)\\eta\_\{\\max\}=O\(1/\\sqrt\{T\}\), the naive Gronwall bound \([50](https://arxiv.org/html/2605.12693#A5.E50)\) incurs an exponential factoreLf​tT=eO​\(Lf​T\)e^\{L\_\{f\}t\_\{T\}\}=e^\{O\(L\_\{f\}\\sqrt\{T\}\)\}that grows withTT\. This growth mechanism is an artifact of the classical Gronwall lemma\[[24](https://arxiv.org/html/2605.12693#bib.bib32)\]ignoring the system’s stability\. For stable linear[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)s, the error does not actually accumulate because the homogeneous part contracts\. The stability\-aware refinement proceeds as follows: in the stable regime \(allRe​\(λ\)<0\\mathrm\{Re\}\(\\lambda\)<0by Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1)\), both the exact and discrete solutions converge toθ∗\\theta^\{\*\}\. The error equationek\+1=\(1−ηk​μmin\)​ek\+O​\(ηk​ℓmax​σmax​supj≤k\|ej\|\)\+O​\(ηk2​M\)e\_\{k\+1\}=\(1\-\\eta\_\{k\}\\mu\_\{\\min\}\)e\_\{k\}\+O\(\\eta\_\{k\}\\ell\_\{\\max\}\\sigma\_\{\\max\}\\sup\_\{j\\leq k\}\|e\_\{j\}\|\)\+O\(\\eta\_\{k\}^\{2\}M\)is itself a stable recurrence whenηk​μmin<1\\eta\_\{k\}\\mu\_\{\\min\}<1andηk​ℓmax​σmax<μmin\\eta\_\{k\}\\ell\_\{\\max\}\\sigma\_\{\\max\}<\\mu\_\{\\min\}\(both guaranteed by Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1)\)\. By a contraction argument, the error converges to the fixed point\|e∞\|≤O​\(ηmax2​M/μmin\)\|e\_\{\\infty\}\|\\leq O\(\\eta\_\{\\max\}^\{2\}M/\\mu\_\{\\min\}\)\. Hence, for the entire trajectory:

supk≤T‖ek‖=O​\(ηmax\)=O​\(1/T\)\.\\sup\_\{k\\leq T\}\\\|e\_\{k\}\\\|\\;=\\;O\(\\eta\_\{\\max\}\)\\;=\\;O\(1/\\sqrt\{T\}\)\.\(51\)

###### Proof\.

We use the discrete Gronwall lemma\[[13](https://arxiv.org/html/2605.12693#bib.bib36), Lemma 2\.1\]\. Split the error recursion:

ek\+1\\displaystyle e\_\{k\+1\}=θk\+1d−θc​\(tk\+1\)\\displaystyle=\\theta^\{\\mathrm\{d\}\}\_\{k\+1\}\-\\theta^\{\\mathrm\{c\}\}\(t\_\{k\+1\}\)=θkd\+ηk​f​\(tk,θkd\)⏟Euler step−\[θc​\(tk\)\+ηk​f​\(tk,θc​\(tk\)\)\]⏟Euler applied to exact\+\[θc​\(tk\)\+ηk​f​\(tk,θc​\(tk\)\)\]−θc​\(tk\+1\)⏟local truncation error\.\\displaystyle=\\underbrace\{\\theta^\{\\mathrm\{d\}\}\_\{k\}\+\\eta\_\{k\}f\(t\_\{k\},\\theta^\{\\mathrm\{d\}\}\_\{k\}\)\}\_\{\\text\{Euler step\}\}\\;\-\\;\\underbrace\{\\Bigl\[\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\)\+\\eta\_\{k\}f\(t\_\{k\},\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\)\)\\Bigr\]\}\_\{\\text\{Euler applied to exact\}\}\\;\+\\;\\underbrace\{\\Bigl\[\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\)\+\\eta\_\{k\}f\(t\_\{k\},\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\)\)\\Bigr\]\-\\theta^\{\\mathrm\{c\}\}\(t\_\{k\+1\}\)\}\_\{\\text\{local truncation error\}\}\.\(52\)ByLfL\_\{f\}\-Lipschitz continuity offfinθ\\theta:‖ek\+ηk​\[f​\(tk,θkd\)−f​\(tk,θc​\(tk\)\)\]‖≤\(1\+ηk​Lf\)​‖ek‖\\\|e\_\{k\}\+\\eta\_\{k\}\[f\(t\_\{k\},\\theta^\{\\mathrm\{d\}\}\_\{k\}\)\-f\(t\_\{k\},\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\)\)\]\\\|\\leq\(1\+\\eta\_\{k\}L\_\{f\}\)\\\|e\_\{k\}\\\|\. By Lemma[11](https://arxiv.org/html/2605.12693#Thmlemma11), the local truncation term has norm≤ηk22​M\\leq\\frac\{\\eta\_\{k\}^\{2\}\}\{2\}M\. Hence:

‖ek\+1‖≤\(1\+ηk​Lf\)​‖ek‖\+ηk22​M\.\\\|e\_\{k\+1\}\\\|\\leq\(1\+\\eta\_\{k\}L\_\{f\}\)\\\|e\_\{k\}\\\|\+\\tfrac\{\\eta\_\{k\}^\{2\}\}\{2\}M\.Withe0=0e\_\{0\}=0\(both trajectories start atθ0\\theta\_\{0\}\), unrolling gives:

‖ek‖\\displaystyle\\\|e\_\{k\}\\\|≤M2​∑j=0k−1ηj2​∏i=j\+1k−1\(1\+ηi​Lf\)\\displaystyle\\leq\\tfrac\{M\}\{2\}\\sum\_\{j=0\}^\{k\-1\}\\eta\_\{j\}^\{2\}\\prod\_\{i=j\+1\}^\{k\-1\}\(1\+\\eta\_\{i\}L\_\{f\}\)≤M​ηmax2​∑j=0k−1ηj​exp⁡\(Lf​∑i=j\+1k−1ηi\)\\displaystyle\\leq\\tfrac\{M\\eta\_\{\\max\}\}\{2\}\\sum\_\{j=0\}^\{k\-1\}\\eta\_\{j\}\\,\\exp\\\!\\Bigl\(L\_\{f\}\\sum\_\{i=j\+1\}^\{k\-1\}\\eta\_\{i\}\\Bigr\)≤M​ηmax2​∫0tkeLf​\(tk−s\)​ds=M​ηmax2​Lf​\(eLf​tk−1\),\\displaystyle\\leq\\tfrac\{M\\eta\_\{\\max\}\}\{2\}\\int\_\{0\}^\{t\_\{k\}\}e^\{L\_\{f\}\(t\_\{k\}\-s\)\}\\,\\mathrm\{d\}s=\\frac\{M\\eta\_\{\\max\}\}\{2L\_\{f\}\}\\bigl\(e^\{L\_\{f\}t\_\{k\}\}\-1\\bigr\),\(53\)where we used1\+x≤ex1\+x\\leq e^\{x\}, converted the sum to an integral, and usedηj≤ηmax\\eta\_\{j\}\\leq\\eta\_\{\\max\}throughout\. Settingk=Tk=T:

supk≤T‖ek‖≤M​ηmax2​Lf​\(eLf​tT−1\)≤M​ηmax2​Lf​\(ecG−1\)=O​\(ηmax\)=O​\(ηmax​T\),\\sup\_\{k\\leq T\}\\\|e\_\{k\}\\\|\\leq\\frac\{M\\eta\_\{\\max\}\}\{2L\_\{f\}\}\(e^\{L\_\{f\}t\_\{T\}\}\-1\)\\leq\\frac\{M\\eta\_\{\\max\}\}\{2L\_\{f\}\}\(e^\{c\_\{G\}\}\-1\)=O\(\\eta\_\{\\max\}\)=O\(\\eta\_\{\\max\}\\sqrt\{T\}\),wherecG:=Lf​tTc\_\{G\}:=L\_\{f\}\\,t\_\{T\}is bounded by the total integrated learning rate \(which isO​\(1\)O\(1\)sinceηmax=O​\(1/T\)\\eta\_\{\\max\}=O\(1/\\sqrt\{T\}\)andtT=∑k=0T−1ηk≤ηmax​T=O​\(T\)t\_\{T\}=\\sum\_\{k=0\}^\{T\-1\}\\eta\_\{k\}\\leq\\eta\_\{\\max\}T=O\(\\sqrt\{T\}\)\), and the last equality absorbs the constantecG−1e^\{c\_\{G\}\}\-1\. ∎

###### Lemma 13\(Stability inheritance via z\-domain analysis\)\.

If the continuous\-time[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)\([41](https://arxiv.org/html/2605.12693#A5.E41)\) is asymptotically stable with convergence rateγ\>0\\gamma\>0\(i\.e\. all characteristic roots satisfyRe​\(λ\)≤−γ<0\\mathrm\{Re\}\(\\lambda\)\\leq\-\\gamma<0\), then, for step sizes satisfying

η≤min\(1‖HF‖,2​γγ2\+ωmax2\)=:η∗,\\eta\\;\\leq\\;\\min\\\!\\left\(\\frac\{1\}\{\\\|H\_\{F\}\\\|\},\\;\\frac\{2\\gamma\}\{\\gamma^\{2\}\+\\omega\_\{\\max\}^\{2\}\}\\right\)=:\\eta^\{\*\},\(54\)where\|Im​\(λ\)\|≤ωmax\|\\mathrm\{Im\}\(\\lambda\)\|\\leq\\omega\_\{\\max\}, the discrete\-time system \(Algorithm[1](https://arxiv.org/html/2605.12693#alg1)applied to the linearization\) is also asymptotically stable: all rootszzof the discrete characteristic polynomial satisfy\|z\|<1\|z\|<1\.

Scope of the scalar polynomial\.The scalar polynomial below is not the literal characteristic polynomial of the matrix system; it is the*worst\-case spectral envelope*obtained via the comparison principle in Lemma[8](https://arxiv.org/html/2605.12693#Thmlemma8)\. That lemma bounds the vector[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)through a scalar comparison inequality usingμmin=λmin​\(HF\)≥μF\\mu\_\{\\min\}=\\lambda\_\{\\min\}\(H\_\{F\}\)\\geq\\mu\_\{F\}andℓmax=‖LIGT‖\\ell\_\{\\max\}=\\\|L\_\{\\mathrm\{IGT\}\}\\\|, without requiringHFH\_\{F\}andLIGTL\_\{\\mathrm\{IGT\}\}to commute or be simultaneously diagonalizable\. Collapsing the distributed delay∑s∈Qtξ​\(t−τs\)\\sum\_\{s\\in Q\_\{t\}\}\\xi\(t\-\\tau\_\{s\}\)to a single worst\-case lag further reduces the distributed buffer to a conservative point\-delay bound: for constant queue lengthσ\\sigma,∑s∈Q\|v​\(t−τs\)\|≤σ​sups\|v​\(t−τs\)\|\\sum\_\{s\\in Q\}\|v\(t\-\\tau\_\{s\}\)\|\\leq\\sigma\\sup\_\{s\}\|v\(t\-\\tau\_\{s\}\)\|, so the point\-delay scalar envelope dominates the multi\-lag system\. The resulting polynomial is:

zσ\+1−\(1−η​μmin\)​zσ\+η​σ​‖LIGT‖=0\.z^\{\\sigma\+1\}\-\(1\-\\eta\\mu\_\{\\min\}\)z^\{\\sigma\}\+\\eta\\sigma\\\|L\_\{\\mathrm\{IGT\}\}\\\|=0\.\(55\)

###### Proof\.

The polynomial \([55](https://arxiv.org/html/2605.12693#A5.E55)\) has degreeσ\+1\\sigma\+1and thereforeσ\+1\\sigma\+1roots \(counted with multiplicity\)\.

Principal root\.The first conditionη≤1/‖HF‖\\eta\\leq 1/\\\|H\_\{F\}\\\|ensures that‖I−η​HF‖=max⁡\(\|1−η​μmin\|,\|1−η​λmax​\(HF\)\|\)≤1\\\|I\-\\eta H\_\{F\}\\\|=\\max\(\|1\-\\eta\\mu\_\{\\min\}\|,\\,\|1\-\\eta\\lambda\_\{\\max\}\(H\_\{F\}\)\|\)\\leq 1\. Without this bound, the highest eigenmode maps toz=1−η​λmax​\(HF\)<−1z=1\-\\eta\\lambda\_\{\\max\}\(H\_\{F\}\)<\-1whenη\>2/λmax​\(HF\)\\eta\>2/\\lambda\_\{\\max\}\(H\_\{F\}\), causing period\-2 oscillations atz=−1z=\-1that the scalar envelope \(which tracks onlyμmin\\mu\_\{\\min\}\) cannot detect\.

Atη=0\\eta=0, the polynomial \([55](https://arxiv.org/html/2605.12693#A5.E55)\) reduces tozσ​\(z−1\)=0z^\{\\sigma\}\(z\-1\)=0, placing the principal root atz=1z=1\. By the implicit function theorem applied to the polynomial in\(z,η\)\(z,\\eta\), this root is an analytic function ofη\\etanearη=0\\eta=0\. A first\-order perturbation expansion yieldsz0​\(η\)=1\+η​λc\+O​\(η2\)z\_\{0\}\(\\eta\)=1\+\\eta\\lambda^\{\\mathrm\{c\}\}\+O\(\\eta^\{2\}\), whereλc=−γ\+j​ω\\lambda^\{\\mathrm\{c\}\}=\-\\gamma\+j\\omega\(γ\>0\\gamma\>0\) is the corresponding root of the continuous characteristic equation \([43](https://arxiv.org/html/2605.12693#A5.E43)\)\. TheO​\(η2\)O\(\\eta^\{2\}\)discrepancy arises because the discrete delay operatorz−σz^\{\-\\sigma\}and the continuous operatore−λ​τe^\{\-\\lambda\\tau\}differ at second order:\(1\+η​λ\)−σ=e−λ​τ​\(1\+O​\(η\)\)\(1\+\\eta\\lambda\)^\{\-\\sigma\}=e^\{\-\\lambda\\tau\}\\bigl\(1\+O\(\\eta\)\\bigr\)for fixedσ\\sigmaandτ=σ​η\\tau=\\sigma\\eta\. This first\-order agreement is commensurate with the Euler method’sO​\(η2\)O\(\\eta^\{2\}\)local truncation error, so the stability threshold derived from the first\-order location is accurate to the same order\. Settingz0≈\(1−η​γ\)\+j​η​ωz\_\{0\}\\approx\(1\-\\eta\\gamma\)\+j\\eta\\omega,

\|z0\|2≈\(1−η​γ\)2\+η2​ω2=1−2​η​γ\+η2​\(γ2\+ω2\)\.\|z\_\{0\}\|^\{2\}\\approx\(1\-\\eta\\gamma\)^\{2\}\+\\eta^\{2\}\\omega^\{2\}=1\-2\\eta\\gamma\+\\eta^\{2\}\(\\gamma^\{2\}\+\\omega^\{2\}\)\.For\|z0\|2<1\|z\_\{0\}\|^\{2\}<1we needη<2​γ/\(γ2\+ω2\)\\eta<2\\gamma/\(\\gamma^\{2\}\+\\omega^\{2\}\)\. Since we require this for all characteristic roots, the most restrictive constraint is at the root with largest\|ω\|\|\\omega\|, givingη<2​γ/\(γ2\+ωmax2\)\\eta<2\\gamma/\(\\gamma^\{2\}\+\\omega\_\{\\max\}^\{2\}\)\. Combined with thez=−1z=\-1guard, the composite thresholdη∗\\eta^\{\*\}in \([54](https://arxiv.org/html/2605.12693#A5.E54)\) is a*sufficient*condition ensuring\|z0\|<1\|z\_\{0\}\|<1\.

Parasitic roots\.The remainingσ\\sigmaroots of \([55](https://arxiv.org/html/2605.12693#A5.E55)\) have no continuous\-time counterparts and are artifacts of the discretization\. By Schur–Cohn theory\[[16](https://arxiv.org/html/2605.12693#bib.bib35)\], atη=0\\eta=0the polynomial reduces tozσ\+1−zσ=zσ​\(z−1\)=0z^\{\\sigma\+1\}\-z^\{\\sigma\}=z^\{\\sigma\}\(z\-1\)=0, whose roots arez=1z=1\(principal, mapping toλc=0\\lambda^\{\\mathrm\{c\}\}=0\) andz=0z=0with multiplicityσ\\sigma\(all strictly inside the unit circle\)\. Asη\\etaincreases from zero, these roots move continuously\. By the implicit function theorem, forη\\etasufficiently small the parasitic roots remain inside the unit circle, since they start atz=0z=0and can only reach\|z\|=1\|z\|=1at a finiteηcrit\>0\\eta\_\{\\mathrm\{crit\}\}\>0\. The thresholdη∗\\eta^\{\*\}above is chosen small enough that no parasitic root has reached the unit circle\. The exact critical step sizeηcrit\\eta\_\{\\mathrm\{crit\}\}at which a parasitic root first touches\|z\|=1\|z\|=1depends onσ\\sigmaand the polynomial coefficients;η∗\\eta^\{\*\}is a conservative sufficient bound that simultaneously controls both principal and parasitic roots\.

With the adaptive scheduleηt≤η0≤1/\(‖HF‖​1\+β​σmax\)\\eta\_\{t\}\\leq\\eta\_\{0\}\\leq 1/\(\\\|H\_\{F\}\\\|\\sqrt\{1\+\\beta\\sigma\_\{\\max\}\}\)\(from Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1)\), both conditions in \([54](https://arxiv.org/html/2605.12693#A5.E54)\) are satisfied: the first because1/\(‖HF‖​1\+β​σmax\)≤1/‖HF‖1/\(\\\|H\_\{F\}\\\|\\sqrt\{1\+\\beta\\sigma\_\{\\max\}\}\)\\leq 1/\\\|H\_\{F\}\\\|, and the second by the principal\-root modulus argument above\. Substitutingγ=c​η0/1\+β​σmax\\gamma=c\\eta\_\{0\}/\\sqrt\{1\+\\beta\\sigma\_\{\\max\}\}\(from Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1), Step 5\) into \([54](https://arxiv.org/html/2605.12693#A5.E54)\), the explicit delay\-dependent threshold is

η≤min⁡\(1‖HF‖,2​c​η0c2​η02\+ωmax2​\(1\+β​σmax\)\),\\eta\\;\\leq\\;\\min\\\!\\left\(\\frac\{1\}\{\\\|H\_\{F\}\\\|\},\\;\\frac\{2c\\,\\eta\_\{0\}\}\{c^\{2\}\\eta\_\{0\}^\{2\}\+\\omega\_\{\\max\}^\{2\}\(1\+\\beta\\sigma\_\{\\max\}\)\}\\right\),\(56\)which tightens with increasingσmax\\sigma\_\{\\max\}— confirming that the discrete stability region is strictly contained in the continuous one and that the*discretization penalty*scales with the delay\. Hence all\|z\|<1\|z\|<1uniformly\. ∎

###### Lemma 14\(Delayed\-coordinate discretization error\)\.

The discrete algorithm uses the stored iterateθk−σkd\\theta^\{\\mathrm\{d\}\}\_\{k\-\\sigma\_\{k\}\}to approximate the continuous delayed termθc​\(tk−τk\)\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\-\\tau\_\{k\}\)\. The additional error from this delay discretization satisfies

‖θk−σkd−θc​\(tk−τk\)‖≤‖ek−σk‖\+O​\(ηmax​σmax\)\.\\bigl\\\|\\theta^\{\\mathrm\{d\}\}\_\{k\-\\sigma\_\{k\}\}\-\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\-\\tau\_\{k\}\)\\bigr\\\|\\;\\leq\\;\\\|e\_\{k\-\\sigma\_\{k\}\}\\\|\+O\(\\eta\_\{\\max\}\\sigma\_\{\\max\}\)\.\(57\)

###### Proof\.

By the triangle inequality:

‖θk−σkd−θc​\(tk−τk\)‖≤‖θk−σkd−θc​\(tk−σk\)‖⏟=‖ek−σk‖\+‖θc​\(tk−σk\)−θc​\(tk−τk\)‖⏟time\-alignment error\.\\\|\\theta^\{\\mathrm\{d\}\}\_\{k\-\\sigma\_\{k\}\}\-\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\-\\tau\_\{k\}\)\\\|\\leq\\underbrace\{\\\|\\theta^\{\\mathrm\{d\}\}\_\{k\-\\sigma\_\{k\}\}\-\\theta^\{\\mathrm\{c\}\}\(t\_\{k\-\\sigma\_\{k\}\}\)\\\|\}\_\{=\\\|e\_\{k\-\\sigma\_\{k\}\}\\\|\}\+\\underbrace\{\\\|\\theta^\{\\mathrm\{c\}\}\(t\_\{k\-\\sigma\_\{k\}\}\)\-\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\-\\tau\_\{k\}\)\\\|\}\_\{\\text\{time\-alignment error\}\}\.The time\-alignment error arises because the discrete algorithm uses a fixed integer lagσk\\sigma\_\{k\}, while the continuous[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)uses the exact delayτk\\tau\_\{k\}\. Recall thattk−σk=∑s=1k−σkηst\_\{k\-\\sigma\_\{k\}\}=\\sum\_\{s=1\}^\{k\-\\sigma\_\{k\}\}\\eta\_\{s\}andtk−τk=∑s=1kηs−τkt\_\{k\}\-\\tau\_\{k\}=\\sum\_\{s=1\}^\{k\}\\eta\_\{s\}\-\\tau\_\{k\}\. In the Euler embedding,τk=∑s=k−σk\+1kηs\\tau\_\{k\}=\\sum\_\{s=k\-\\sigma\_\{k\}\+1\}^\{k\}\\eta\_\{s\}\(theσk\\sigma\_\{k\}steps’ worth of continuous time that elapse between dispatch and receipt\), sotk−σk=tk−∑s=k−σk\+1kηs=tk−τkt\_\{k\-\\sigma\_\{k\}\}=t\_\{k\}\-\\sum\_\{s=k\-\\sigma\_\{k\}\+1\}^\{k\}\\eta\_\{s\}=t\_\{k\}\-\\tau\_\{k\}exactly\. Hence‖θc​\(tk−σk\)−θc​\(tk−τk\)‖=0\\\|\\theta^\{\\mathrm\{c\}\}\(t\_\{k\-\\sigma\_\{k\}\}\)\-\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\-\\tau\_\{k\}\)\\\|=0when the discrete delay embedding is exact\.

In general, however, the continuous delayτk\\tau\_\{k\}may not align perfectly with∑s=k−σk\+1kηs\\sum\_\{s=k\-\\sigma\_\{k\}\+1\}^\{k\}\\eta\_\{s\}due to time\-varying step sizes: a gradient dispatched during one step may correspond to a slightly different continuous\-time interval\. The mismatch is bounded by the variation of the step size overσk\\sigma\_\{k\}steps:\|tk−σk−\(tk−τk\)\|≤σk​\|ηmax−ηmin\|≤σmax​ηmax​\|Δβ\|\|t\_\{k\-\\sigma\_\{k\}\}\-\(t\_\{k\}\-\\tau\_\{k\}\)\|\\leq\\sigma\_\{k\}\|\\eta\_\{\\max\}\-\\eta\_\{\\min\}\|\\leq\\sigma\_\{\\max\}\\eta\_\{\\max\}\|\\Delta\_\{\\beta\}\|, where\|Δβ\|=\|1−\(1\+β​σmax\)/\(1\+β​σmin\)\|≤1\|\\Delta\_\{\\beta\}\|=\|1\-\\sqrt\{\(1\+\\beta\\sigma\_\{\\max\}\)/\(1\+\\beta\\sigma\_\{\\min\}\)\}\|\\leq 1\. Since‖θ˙c‖≤C\\\|\\dot\{\\theta\}^\{\\mathrm\{c\}\}\\\|\\leq C\(bounded in the stable regime byC=\(‖HF‖\+σmax​‖LIGT‖\)​‖ξ‖∞C=\(\\\|H\_\{F\}\\\|\+\\sigma\_\{\\max\}\\\|L\_\{\\mathrm\{IGT\}\}\\\|\)\\\|\\xi\\\|\_\{\\infty\}\), the Lipschitz continuity ofθc\\theta^\{\\mathrm\{c\}\}gives‖θc​\(tk−σk\)−θc​\(tk−τk\)‖≤C​σmax​ηmax\\\|\\theta^\{\\mathrm\{c\}\}\(t\_\{k\-\\sigma\_\{k\}\}\)\-\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\-\\tau\_\{k\}\)\\\|\\leq C\\sigma\_\{\\max\}\\eta\_\{\\max\}, yielding \([57](https://arxiv.org/html/2605.12693#A5.E57)\)\. ∎

##### Main proof of Proposition[2](https://arxiv.org/html/2605.12693#Thmproposition2)\.

###### Proof of Proposition[2](https://arxiv.org/html/2605.12693#Thmproposition2)\.

Step 1: Algorithm is Euler discretization of[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)\.The update rule of Algorithm[1](https://arxiv.org/html/2605.12693#alg1)at stepkkis

θk\+1d=θkd−ηk​gkIGT=θkd\+ηk​\(−HF​θkd\+LIGT​∑s∈Qkθk−σsd\)\+ηk​ϵk,\\theta^\{\\mathrm\{d\}\}\_\{k\+1\}=\\theta^\{\\mathrm\{d\}\}\_\{k\}\-\\eta\_\{k\}\\,g\_\{k\}^\{\\mathrm\{IGT\}\}=\\theta^\{\\mathrm\{d\}\}\_\{k\}\+\\eta\_\{k\}\\Bigl\(\-H\_\{F\}\\theta^\{\\mathrm\{d\}\}\_\{k\}\+L\_\{\\mathrm\{IGT\}\}\\\!\\sum\_\{s\\in Q\_\{k\}\}\\theta^\{\\mathrm\{d\}\}\_\{k\-\\sigma\_\{s\}\}\\Bigr\)\+\\eta\_\{k\}\\epsilon\_\{k\},whereϵk\\epsilon\_\{k\}collects higher\-order and nonlinear terms \(inner\-solver errors and IGT transport residuals\)\. This forward Euler step is applied to the linearized[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)\([41](https://arxiv.org/html/2605.12693#A5.E41)\) with the discrete delayed coordinates in place of the continuous ones, plus a perturbationϵk\\epsilon\_\{k\}with‖ϵk‖≤G​ηk\+Lf​ηk2\\\|\\epsilon\_\{k\}\\\|\\leq G\\eta\_\{k\}\+L\_\{f\}\\eta\_\{k\}^\{2\}\(from Lemmas[11](https://arxiv.org/html/2605.12693#Thmlemma11)and[14](https://arxiv.org/html/2605.12693#Thmlemma14)\)\. Following\[[35](https://arxiv.org/html/2605.12693#bib.bib16)\]\(equations \(3\)–\(5\)\), the continuous\-time[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)and its discrete\-time Euler–Maruyama counterpart maintain a one\-to\-one correspondence in which step stalenessσk\\sigma\_\{k\}maps to the continuous delayτ\\tauviaτ=σk​ηk\\tau=\\sigma\_\{k\}\\eta\_\{k\}, confirming Algorithm[1](https://arxiv.org/html/2605.12693#alg1)as a valid Euler discretization of \([41](https://arxiv.org/html/2605.12693#A5.E41)\)\.

Step 2: Global approximation error bound \(stability\-aware\)\.Combining Lemma[11](https://arxiv.org/html/2605.12693#Thmlemma11)\(local truncation\) with Lemma[14](https://arxiv.org/html/2605.12693#Thmlemma14)\(delay coordinate alignment\), the total local error per step isδk≤ηk22​M\+C​ηk2​σmax​Lf=O​\(ηk2\)\\delta\_\{k\}\\leq\\frac\{\\eta\_\{k\}^\{2\}\}\{2\}M\+C\\eta\_\{k\}^\{2\}\\sigma\_\{\\max\}L\_\{f\}=O\(\\eta\_\{k\}^\{2\}\)\. A naive Gronwall argument would give‖ek‖=O​\(ηmax​eLf​tT/Lf\)\\\|e\_\{k\}\\\|=O\(\\eta\_\{\\max\}e^\{L\_\{f\}t\_\{T\}\}/L\_\{f\}\)withLf​tT=O​\(T\)L\_\{f\}t\_\{T\}=O\(\\sqrt\{T\}\), yielding a divergent factoreO​\(T\)e^\{O\(\\sqrt\{T\}\)\}\.

Lemma[12](https://arxiv.org/html/2605.12693#Thmlemma12)avoids this by exploiting the stability of the linearised system: the error recurrenceek\+1=\(I−ηk​HF\)​ek\+ηk​LIGT​ek−σk\+δke\_\{k\+1\}=\(I\-\\eta\_\{k\}H\_\{F\}\)e\_\{k\}\+\\eta\_\{k\}L\_\{\\mathrm\{IGT\}\}e\_\{k\-\\sigma\_\{k\}\}\+\\delta\_\{k\}is itself a contraction whenever the conditions of Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1)hold\. Applying the stability\-aware bound from Lemma[12](https://arxiv.org/html/2605.12693#Thmlemma12):

supk‖ek‖=‖θkd−θc​\(tk\)‖≤\(M\+2​C​σmax​Lf\)​ηmax22​μmin=O​\(ηmax\),\\sup\_\{k\}\\\|e\_\{k\}\\\|=\\\|\\theta^\{\\mathrm\{d\}\}\_\{k\}\-\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\)\\\|\\;\\leq\\;\\frac\{\(M\+2C\\sigma\_\{\\max\}L\_\{f\}\)\\eta\_\{\\max\}^\{2\}\}\{2\\mu\_\{\\min\}\}\\;=\\;O\(\\eta\_\{\\max\}\),sinceμmin=λmin​\(HF\)\>0\\mu\_\{\\min\}=\\lambda\_\{\\min\}\(H\_\{F\}\)\>0provides the contraction rate for the error dynamics\. Withηmax=O​\(1/T\)\\eta\_\{\\max\}=O\(1/\\sqrt\{T\}\), the tracking error isO​\(1/T\)→0O\(1/\\sqrt\{T\}\)\\to 0\.*This is a uniform tracking bound*: the discrete trajectory stays withinO​\(ηmax\)O\(\\eta\_\{\\max\}\)of the continuous[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)trajectory at all times\. Convergence ofθkd\\theta^\{\\mathrm\{d\}\}\_\{k\}toθ∗\\theta^\{\*\}is established separately in Step 3 via discrete stability; theO​\(ηmax\)O\(\\eta\_\{\\max\}\)tracking guarantee ensures the discrete and continuous solutions share the same equilibrium and qualitative dynamics\.

Step 3: Stability inheritance\.By Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1), the[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)is asymptotically stable with convergence rateγ=c​η0/1\+β​σmax\\gamma=c\\eta\_\{0\}/\\sqrt\{1\+\\beta\\sigma\_\{\\max\}\}\(from Step 5 of the[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)stability proof\)\.

*Time\-varying convergence via tracking\.*Convergence of the discrete iteratesθkd→θ∗\\theta^\{\\mathrm\{d\}\}\_\{k\}\\to\\theta^\{\*\}under arbitrarily varyingσ​\(t\)\\sigma\(t\)follows from Steps 2 and 4 without the z\-domain: the continuous trajectory converges \(θc​\(t\)→θ∗\\theta^\{\\mathrm\{c\}\}\(t\)\\to\\theta^\{\*\}by the Razumikhin argument in Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1), which handles time\-varying delays\), and the tracking bound‖θkd−θc​\(tk\)‖=O​\(ηmax\)→0\\\|\\theta^\{\\mathrm\{d\}\}\_\{k\}\-\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\)\\\|=O\(\\eta\_\{\\max\}\)\\to 0\(Step 2, no constant\-σ\\sigmaassumption needed\) guarantees the discrete trajectory inherits this convergence\.

*Convergence rate under worst\-case frozen delay\.*By Lemma[13](https://arxiv.org/html/2605.12693#Thmlemma13), both the principal discrete root and allσ\\sigmaparasitic roots of the characteristic polynomial \([55](https://arxiv.org/html/2605.12693#A5.E55)\) satisfy\|z\|<1\|z\|<1wheneverηt≤2​γ/\(γ2\+ωmax2\)\\eta\_\{t\}\\leq 2\\gamma/\(\\gamma^\{2\}\+\\omega\_\{\\max\}^\{2\}\)\. Since the characteristic polynomial requires a fixed integer delay, the z\-domain rate characterization formally applies to the worst\-case frozen queue lengthσmax\\sigma\_\{\\max\}; the time\-varying case inherits convergence from the tracking argument above\. Since the step size satisfiesηt=η0/1\+β​σ​\(t\)≤η∗\\eta\_\{t\}=\\eta\_\{0\}/\\sqrt\{1\+\\beta\\sigma\(t\)\}\\leq\\eta^\{\*\}\(verified for eachttby the same condition that ensures[DDE](https://arxiv.org/html/2605.12693#glo.acronym.dde)stability\), all discrete roots have modulus<1<1, soθkd→θ∗\\theta^\{\\mathrm\{d\}\}\_\{k\}\\to\\theta^\{\*\}ask→∞k\\to\\infty\.

Step 4: Convergence rate of discrete iterates\.From Steps 2 and 3, the discrete error decomposes as

‖θkd−θ∗‖≤‖θkd−θc​\(tk\)‖⏟O​\(ηmax\)\+‖θc​\(tk\)−θ∗‖⏟O​\(e−γ​tk\)\.\\\|\\theta^\{\\mathrm\{d\}\}\_\{k\}\-\\theta^\{\*\}\\\|\\;\\leq\\;\\underbrace\{\\\|\\theta^\{\\mathrm\{d\}\}\_\{k\}\-\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\)\\\|\}\_\{O\(\\eta\_\{\\max\}\)\}\\;\+\\;\\underbrace\{\\\|\\theta^\{\\mathrm\{c\}\}\(t\_\{k\}\)\-\\theta^\{\*\}\\\|\}\_\{O\(e^\{\-\\gamma t\_\{k\}\}\)\}\.The second term decays exponentially at the continuous rateγ\\gamma\. The first term is bounded uniformly byO​\(ηmax\)O\(\\eta\_\{\\max\}\)via the stability\-aware Gronwall bound \(Step 2\), which*decays*asηmax=O​\(1/T\)→0\\eta\_\{\\max\}=O\(1/\\sqrt\{T\}\)\\to 0\. In terms of the step countkk: sincetk≥k​η¯t\_\{k\}\\geq k\\bar\{\\eta\}withη¯=η0/1\+β​σmax\\bar\{\\eta\}=\\eta\_\{0\}/\\sqrt\{1\+\\beta\\sigma\_\{\\max\}\}:

‖θkd−θ∗‖≤O​\(e−η¯​γ​k\)\+O​\(ηmax\)\.\\\|\\theta^\{\\mathrm\{d\}\}\_\{k\}\-\\theta^\{\*\}\\\|\\;\\leq\\;O\\\!\\left\(e^\{\-\\bar\{\\eta\}\\gamma k\}\\right\)\+O\(\\eta\_\{\\max\}\)\.Settingk=Tk=Tandηmax=1/T\\eta\_\{\\max\}=1/\\sqrt\{T\}gives‖θTd−θ∗‖≤O​\(e−η¯​γ​T\)\+O​\(1/T\)→0\\\|\\theta^\{\\mathrm\{d\}\}\_\{T\}\-\\theta^\{\*\}\\\|\\leq O\(e^\{\-\\bar\{\\eta\}\\gamma T\}\)\+O\(1/\\sqrt\{T\}\)\\to 0\. The transient phase is governed by the continuous\-time eigenvalueγ\\gamma, while the residual tracking error decays at the rateO​\(1/T\)O\(1/\\sqrt\{T\}\)set by the step size\. The actual discrete convergence rate is determined by the discrete eigenvalues\|zmax\|=1−η​γ\+O​\(η2\)<1\|z\_\{\\max\}\|=1\-\\eta\\gamma\+O\(\\eta^\{2\}\)<1from Lemma[13](https://arxiv.org/html/2605.12693#Thmlemma13)\.□\\square∎

## Appendix FExperimental Details

This section presents computational infrastructure, our selection of hyperparameters, and additional details on our experiments\.

### F\.1Computational Infrastructure

All experiments run on NVIDIA RTX A5500 GPUs \(24 GB VRAM\) using PyTorch 2\.0\[[25](https://arxiv.org/html/2605.12693#bib.bib26)\]on a shared academic cluster\. LQR experiments require approximately 2–4 hours per delay value \(10 seeds\); Warcraft experiments require approximately 8–12 hours per delay configuration\. The Adam\+[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)controlled experiment \(5 seeds×\\times5 delays\) requires approximately 10 hours total\. Total computational budget: roughly 200–250 GPU\-hours for all reported results\. Preliminary and exploratory runs not reported in the paper required an additional 150–200 GPU\-hours\.

### F\.2LQR Hyperparameters

Table 6:LQR experiment hyperparameters\.Code and data will be released upon acceptance\.

### F\.3Sinkhorn Error\-Scaling Experiment \(Additional Details\)

This section presents the full transport\-error scaling results for the Sinkhorn OT benchmark, complementing the main\-text discussion \(Section[5\.3](https://arxiv.org/html/2605.12693#S5.SS3)\) with per\-algorithm regression statistics\.

### F\.4Sinkhorn OT Hyperparameters

Table 7:Sinkhorn optimal transport experiment hyperparameters\. Adam\+[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)refers to Section[5\.4](https://arxiv.org/html/2605.12693#S5.SS4)\.Extended setup\.We use the Sinkhorn OT environment \(n=10n\{=\}10,K=10K\{=\}10inner steps, regularizationε=0\.05\\varepsilon\{=\}0\.05, OU driftγou=0\.05\\gamma\_\{\\text\{ou\}\}\{=\}0\.05\) with constant delaysd∈\{1,2,5,10,20,50\}d\\in\\\{1,2,5,10,20,50\\\},T=1,000T\{=\}1\{,\}000rounds, and55seeds per condition\. Algorithms: IGT\-OMD \(Algorithm[1](https://arxiv.org/html/2605.12693#alg1)\), D\-FTRL, Robust OMD, and Stale[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)\.

Results\.Figure[1](https://arxiv.org/html/2605.12693#S5.F1)in the main manuscript and Table[8](https://arxiv.org/html/2605.12693#A6.T8)verify theσmax\\sigma\_\{\\max\}\-factor prediction of Theorem[2](https://arxiv.org/html/2605.12693#Thmtheorem2)\(c\)\. We compute two transport error surrogates:Rsq=∑t‖θt−θt−d‖2R\_\{\\mathrm\{sq\}\}=\\sum\_\{t\}\\left\\\|\\theta\_\{t\}\-\\theta\_\{t\-d\}\\right\\\|^\{2\}\(squared total drift\) andR3=∑t∑s∈Qt‖θs\+1−θs‖2R\_\{3\}=\\sum\_\{t\}\\sum\_\{s\\in Q\_\{t\}\}\\left\\\|\\theta\_\{s\+1\}\-\\theta\_\{s\}\\right\\\|^\{2\}\(sum of per\-step squared drifts\)\.

Table 8:Sinkhorn OT transport error scaling\.RsqR\_\{\\mathrm\{sq\}\}: squared total drift;R3R\_\{3\}: sum of per\-step squared drifts; ratio≈σmax\\approx\\sigma\_\{\\max\}for all algorithms\.Practical considerations\.The per\-round cost of the transport re\-evaluation loop \(lines 16–18 of Algorithm[1](https://arxiv.org/html/2605.12693#alg1)\) isO​\(σmax​p​q\)O\(\\sigma\_\{\\max\}\\,p\\,q\): for each of theσmax\\sigma\_\{\\max\}queued feedbacks, the adjoint product\[∇θ∇w⁡ℒmodel\]⊤​vs∗\[\\nabla\_\{\\theta\}\\nabla\_\{w\}\\mathcal\{L\}\_\{\\text\{model\}\}\]^\{\\top\}v\_\{s\}^\{\*\}requires one Hessian\-vector product\. In our Sinkhorn environment \(p=100p\{=\}100,q=100q\{=\}100\), the wall\-clock overhead is approximately1\.11\.1seconds per transport re\-evaluation atd=50d\{=\}50, adding roughly1818minutes atT=1,000T\{=\}1\{,\}000\.

Takeaway\.The near\-perfectR2\>0\.99R^\{2\}\>0\.99across all four algorithms confirms that theO​\(σ2\)O\(\\sigma^\{2\}\)vs\.O​\(σ\)O\(\\sigma\)scaling is a geometric identity of the bilevel transport problem, not an artifact of any specific optimizer, providing strong empirical support for Theorem[2](https://arxiv.org/html/2605.12693#Thmtheorem2)\(c\)\.

### F\.5Warcraft Shortest Path \(Additional Details\)

This section provides the extended Warcraft setup and results that complement Table[3](https://arxiv.org/html/2605.12693#S5.T3)in the main text\.

Warcraft Dataset

The Warcraft shortest\-path dataset was introduced byVlastelicaet al\.\[[33](https://arxiv.org/html/2605.12693#bib.bib20)\]and uses map tiles from Warcraft II, distributed for research purposes under the academic use terms of that work\. PyTorch\[[25](https://arxiv.org/html/2605.12693#bib.bib26)\]is used under the BSD 3\-Clause license\. The LQR and Sinkhorn environments are fully synthetic and generated by our code; no third\-party data licenses apply\.

Table 9:Warcraft shortest\-path experiment hyperparameters\.Extended setup\.We study shortest\-path planning on12×1212\\times 12grids derived from Warcraft II maps with four terrain types \(grass, forest, water, mountain\)\. The outer parametersθ∈ℝ128\\theta\\in\\mathbb\{R\}^\{128\}parameterize a 2\-layer neural network predicting per\-cell traversal costscpred​\(i,j;θ\)c\_\{\\mathrm\{pred\}\}\(i,j;\\theta\)\. The inner problem runs Dijkstra’s algorithm to find the shortest pathπ∗​\(θ\)=argminπ​∑\(i,j\)∈πcpred​\(i,j;θ\)\\pi^\{\*\}\(\\theta\)=\\operatorname\{argmin\}\_\{\\pi\}\\sum\_\{\(i,j\)\\in\\pi\}c\_\{\\mathrm\{pred\}\}\(i,j;\\theta\)\. The decision loss evaluates the chosen path under true terrain costs:ℒtrue=∑\(i,j\)∈πctrue​\(i,j\)\\mathcal\{L\}\_\{\\text\{true\}\}=\\sum\_\{\(i,j\)\\in\\pi\}c\_\{\\mathrm\{true\}\}\(i,j\); the*optimality gap*reports the excess cost over the oracle shortest path\. Main results are in Table[3](https://arxiv.org/html/2605.12693#S5.T3)\(Section[5\.2](https://arxiv.org/html/2605.12693#S5.SS2)\)\.

Results\.Table[3](https://arxiv.org/html/2605.12693#S5.T3)reports the optimality gap\. IGT\-OMD consistently achieves the lowest optimality gap: at constantd=100d\{=\}100, gap1\.561\.56vs\.1\.831\.83for D\-FTRL \(14\.8%14\.8\\%reduction\) and2\.422\.42for 2\-Stage \(35\.5%35\.5\\%reduction\)\. Under Poissonλ=100\\lambda\{=\}100, the gap is1\.531\.53vs\.1\.941\.94for D\-FTRL \(21\.1%21\.1\\%reduction\)\. 2\-Stage and SPO\+ degrade worst \(1\.61\.6–3\.4×3\.4\\timesgap relative to IGT\-OMD\), confirming staleness amplification \(Theorem[2](https://arxiv.org/html/2605.12693#Thmtheorem2)\(a\)\)\. Since Dijkstra’s solver yieldsϵinner=0\\epsilon\_\{\\mathrm\{inner\}\}\{=\}0, the bilevel\-specific IGT advantage \(Theorem[3](https://arxiv.org/html/2605.12693#Thmtheorem3)\) is inoperative here; the smaller reduction over delay\-aware methods \(1212–21%21\\%\) vs\. theσmax\\sigma\_\{\\max\}\-factor gain on Sinkhorn confirms this structural prediction\. Reductions are computed as\(Lbaseline−LIGT\)/Lbaseline\(L\_\{\\mathrm\{baseline\}\}\-L\_\{\\mathrm\{IGT\}\}\)/L\_\{\\mathrm\{baseline\}\}\.

### F\.6Summary of Experimental Evidence

Table[10](https://arxiv.org/html/2605.12693#A6.T10)maps each benchmark result to the theorem it validates\.

Table 10:Unified experimental summary\.

## Appendix GAdditional Experiments

### G\.1MPC with Neural ODE Dynamics

This section is intentionally appendix\-only: the neural ODE dynamics model and shooting\-based MPC inner solver are non\-convex and fall outside Assumptions[1](https://arxiv.org/html/2605.12693#Thmassumption1)and[7](https://arxiv.org/html/2605.12693#Thmassumption7)\. We include it as a stress test of the transport\-scaling mechanism, while reserving theorem\-level decision\-quality claims for a non\-convex extension\.

Setup\.We construct a bilevel MPC benchmark on the HalfCheetah locomotion task\[[5](https://arxiv.org/html/2605.12693#bib.bib37)\]\. A two\-layer neural ODE \(128128hidden units,∼21,800\{\\sim\}21\{,\}800parameters;state​\_​dim=17\\mathrm\{state\\\_dim\}\{=\}17,action​\_​dim=6\\mathrm\{action\\\_dim\}\{=\}6\) serves as the learnable dynamics model \(outer variableθ\\theta\); the inner problem performs shooting\-based MPC over horizonH=5H\{=\}5viaK=10K\{=\}10gradient steps through the differentiable dynamics rollout\. Unlike the Sinkhorn benchmark, the inner solver is a nonlinear trajectory optimizer whose approximation quality is directly controlled byKK, giving a genuineεinner\>0\\varepsilon\_\{\\mathrm\{inner\}\}\>0stress case\. Stale[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)is evaluated at constant delaysd∈\{0,10,50,100\}d\\in\\\{0,10,50,100\\\},T=2,000T\{=\}2\{,\}000rounds, 5 seeds\.

Results\.The transport error metricR3R\_\{3\}\(sum\-of\-squares\) grows linearly in delay:R3=5\.65±0\.24R\_\{3\}\{=\}5\.65\{\\scriptstyle\\pm 0\.24\}atd=0d\{=\}0,57\.40±1\.7157\.40\{\\scriptstyle\\pm 1\.71\}atd=10d\{=\}10,280\.48±16\.24280\.48\{\\scriptstyle\\pm 16\.24\}atd=50d\{=\}50, and547\.27±6\.56547\.27\{\\scriptstyle\\pm 6\.56\}atd=100d\{=\}100\(linear fit: slope5\.435\.43,R2=0\.9999R^\{2\}\{=\}0\.9999,p<10−4p\{<\}10^\{\-4\}\)\. This is consistent with the transport\-scaling mechanism, but we do not use it as a main theorem\-validating claim because the benchmark violates the convex\-inner assumptions\. Experiments evaluating whether IGT transport corrections yield practical decision\-loss improvements on this benchmark are deferred to the non\-convex extension\.

Takeaway\.The MPC result is an appendix stress test:R3R\_\{3\}scaling persists empirically in a non\-convex continuous\-control setting, while theorem\-level claims remain restricted to the assumption\-aligned benchmarks\.

### G\.2Optimizer\-Agnostic Transport Bound

This section validates that the transport error reduction is independent of the base optimizer, as claimed in the main text \(Section[5\.4](https://arxiv.org/html/2605.12693#S5.SS4)\)\.

Lemma[2](https://arxiv.org/html/2605.12693#Thmlemma2)bounds the gradient estimation error‖g¯t−gtIGT‖\\left\\\|\\bar\{g\}\_\{t\}\-g\_\{t\}^\{\\mathrm\{IGT\}\}\\right\\\|in terms of the iterate trajectory\{θs\}s∈Qt\\\{\\theta\_\{s\}\\\}\_\{s\\in Q\_\{t\}\}and problem constants\(LF,Lw​θ,μw,ϵinner\)\(L\_\{F\},L\_\{w\\theta\},\\mu\_\{w\},\\epsilon\_\{\\mathrm\{inner\}\}\)only—it does not depend on the update rule that generated the iterates\. Therefore, any optimizer that computesgtIGTg\_\{t\}^\{\\mathrm\{IGT\}\}via Algorithm[1](https://arxiv.org/html/2605.12693#alg1)’s re\-evaluation procedure benefits from theO​\(σmax2​η2​G2\)→O​\(σmax​η2​G2\)O\(\\sigma\_\{\\max\}^\{2\}\\eta^\{2\}G^\{2\}\)\\to O\(\\sigma\_\{\\max\}\\eta^\{2\}G^\{2\}\)transport cost reduction of Theorem[2](https://arxiv.org/html/2605.12693#Thmtheorem2)\(c\)\.

The end\-to\-end regret bound of Theorem[1](https://arxiv.org/html/2605.12693#Thmtheorem1)is stated for the OMD/SGD update rule: Step 2 of the proof applies the OMD regret lemma \(Lemma[1](https://arxiv.org/html/2605.12693#Thmlemma1)\), and Step 4 uses the uniform step\-size bound‖θt\+1−θt‖≤η0​G\\left\\\|\\theta\_\{t\+1\}\-\\theta\_\{t\}\\right\\\|\\leq\\eta\_\{0\}Gspecific to projected gradient descent\. Extending this to Adam requires per\-coordinate step\-size bounds replacing the uniformη0​G\\eta\_\{0\}G, which we leave as future work\. Nevertheless, the transport error reduction—the paper’s central mechanism—is optimizer\-agnostic\. Empirically, Adam\+[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)achieves7\.97\.9–9\.5%9\.5\\%lower regret atd≥20d\{\\geq\}20and D\-FTRL\+IGT yields a nearly identical improvement curve \(\+4\.7\+4\.7–9\.8%9\.8\\%,p<0\.01p\{<\}0\.01ford≥5d\{\\geq\}5; Table[11](https://arxiv.org/html/2605.12693#A7.T11)\)\.

Table 11:Sinkhorn OT: D\-FTRL\+IGT vs\. D\-FTRL baseline \(T=1,000T\{=\}1\{,\}000, 5 seeds\)\. Replicates the Adam\+[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)pattern, confirming optimizer\-agnostic transport\.
### G\.3Adversarial \(Uniform\) Delay Validation

This section validates that the IGT\-OMD benefit extends beyond constant delay to stochastic delay patterns, as predicted by Theorem[1](https://arxiv.org/html/2605.12693#Thmtheorem1)\.

To validate that the IGT\-OMD benefit extends beyond constant delay \(the worst\-case special case of the adversarial bound, as discussed in Remark[3](https://arxiv.org/html/2605.12693#Thmremark3)\), we compare Adam\+[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)vs\. Stale[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)under*uniform*random delaysdt∼Uniform​\[0,dmax\]d\_\{t\}\\sim\\mathrm\{Uniform\}\[0,d\_\{\\max\}\], using the same Sinkhorn OT environment as Section[5\.4](https://arxiv.org/html/2605.12693#S5.SS4)\. Results useT=1,000T\{=\}1\{,\}000rounds and55seeds\.

Table 12:Adversarial \(uniform\) vs\. constant delay: cumulative regret \(T=1,000T\{=\}1\{,\}000, 5 seeds\)\. IGT benefit under uniform delay is5353–60%60\\%of the benefit under constant delay, consistent with the theoretical prediction that uniform delays carry half the effective queue load\.Adam\+[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)achieves lower regret than Stale OMD under uniform delays at alldmaxd\_\{\\max\}levels\. The advantage ratio \(uniform/constant\) of5353–60%60\\%confirms the theoretical prediction: underUniform​\[0,dmax\]\\mathrm\{Uniform\}\[0,d\_\{\\max\}\]delays, the expected effective queue length isdmax/2d\_\{\\max\}/2, so the IGT benefit scales accordingly\. This validates that the adversarial\-setting guarantee of Theorems[1](https://arxiv.org/html/2605.12693#Thmtheorem1)–[3](https://arxiv.org/html/2605.12693#Thmtheorem3)translates to practical benefit under non\-constant delay distributions\.

Takeaway\.The5353–60%60\\%ratio of uniform\-to\-constant benefit matches the expecteddmax/2d\_\{\\max\}/2effective queue length under uniform delays, confirming the theory’s distributional predictions\.

### G\.4Sensitivity to Inner\-Solver Quality

On the Sinkhorn benchmark, sensitivity to the number of inner stepsKKis reported in Appendix[G\.5](https://arxiv.org/html/2605.12693#A7.SS5)\(Table[13](https://arxiv.org/html/2605.12693#A7.T13)\); the LQR analog is omitted because the stability\-boundary experiment holds the inner configuration fixed to isolate delay handling\.

### G\.5Sinkhorn Inner\-Solver Sensitivity \(ϵinner\\epsilon\_\{\\mathrm\{inner\}\}viaKK\)

Setup\.To validate Theorem[3](https://arxiv.org/html/2605.12693#Thmtheorem3)’s prediction on the Sinkhorn OT task, we vary the number of Sinkhorn iterationsK∈\{1,3,5,10,20,50\}K\\in\\\{1,3,5,10,20,50\\\}at fixed delayd=20d\{=\}20for both Adam\+[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)and Stale[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)\(T=1,000T\{=\}1\{,\}000, 5 seeds\)\. Sinceϵinner≈ρK\\epsilon\_\{\\mathrm\{inner\}\}\\approx\\rho^\{K\}whereρ<1\\rho\{<\}1is the Sinkhorn contraction rate, increasingKKreduces the inner\-solver error geometrically\.

Results\.Table[13](https://arxiv.org/html/2605.12693#A7.T13)reports cumulative regret across inner\-solver quality levels\. Three findings emerge:

- •Transport benefit is robust toKK\.Adam\+[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)improves over Stale[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)by7\.57\.5–8\.6%8\.6\\%at everyKKvalue tested\. The benefit does not vanish at highKK\(high inner\-solver quality\), confirming that transport corrections address outer\-parameter*staleness*, not inner\-solver error\.
- •Fast convergence inKK\.Adam\+[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)regret stabilizes byK=3K\{=\}3\(550\.2550\.2\) and shows negligible change toK=50K\{=\}50\(555\.3555\.3\)\. The Sinkhorn inner solver converges geometrically fast, soK\>5K\{\>\}5provides diminishing returns\.
- •Stale[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)is alsoKK\-insensitive at largeKK\.Stale[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)regret stabilizes byK=3K\{=\}3\(602\.2602\.2\) and is essentially flat forK≥5K\{\\geq\}5\. AtK=1K\{=\}1, both methods show elevated regret \(566\.1566\.1vs\.617\.8617\.8\), reflecting the cost of using a crude 1\-iteration Sinkhorn approximation\.

Table 13:Sinkhorn OT: cumulative regret vs\. Sinkhorn iterationsKKatd=20d\{=\}20\(T=1,000T\{=\}1\{,\}000, 5 seeds\)\. The transport benefit \(Adam\+[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)minus Stale[OMD](https://arxiv.org/html/2605.12693#glo.acronym.omd)\) is robust to the quality of the inner\-solver\.Figure[2](https://arxiv.org/html/2605.12693#A7.F2)visualizes the regret and transport metrics as a function ofKK\.

![Refer to caption](https://arxiv.org/html/2605.12693v1/x3.png)

![Refer to caption](https://arxiv.org/html/2605.12693v1/x4.png)

Figure 2:Inner\-solver sensitivity on Sinkhorn OT\(d=20d\{=\}20, 5 seeds\)\.*Left:*Cumulative regret vs\. Sinkhorn iterationsKK; Adam\+[IGT](https://arxiv.org/html/2605.12693#glo.acronym.igt)improvement is stable at7\.57\.5–8\.6%8\.6\\%across allKK\.*Right:*Transport metricsR3R\_\{3\}andRsqR\_\{\\mathrm\{sq\}\}vs\.KK; the ratioRsq/R3≈20R\_\{\\mathrm\{sq\}\}/R\_\{3\}\\approx 20tracksσmax\\sigma\_\{\\max\}regardless of inner\-solver quality\.
### G\.6Delay Pattern Variations

Three delay distributions with mean queue lengthσ¯=20\\bar\{\\sigma\}=20are tested: \(1\) constantdt=20d\_\{t\}=20, \(2\) uniformdt∼Uniform​\(0,40\)d\_\{t\}\\sim\\mathrm\{Uniform\}\(0,40\), \(3\) bursty \(10 rounds ofdt=0d\_\{t\}=0alternating with 10 rounds ofdt=40d\_\{t\}=40\)\. IGT\-OMD attains comparable regret across all patterns \(52\.4, 54\.1, 56\.8\), validating Proposition[1](https://arxiv.org/html/2605.12693#Thmproposition1): queue length, not delay pattern, determines stability\.

## Appendix HBroader Impacts

IGT\-OMD is a foundational algorithmic contribution to bilevel optimization under delayed feedback\. We do not foresee any direct harmful applications: the algorithm improves sample efficiency and regret in decision\-focused learning systems in settings such as supply\-chain optimization, energy dispatch, and medical treatment planning, where delayed outcome feedback is endemic\.

Positive impacts\.More efficient bilevel learning under delay can reduce the computational resources required for online operational systems, lower the cost of deploying predict\-then\-optimize pipelines in data\-scarce settings, and improve decision quality in safety\-critical applications \(e\.g\., hospital readmission management\) where feedback delays are unavoidable\.

Negative impacts\.More capable online optimization could, in principle, be applied to surveillance, targeted advertising, or automated trading\. However, the specific contribution here—correcting stale hypergradients in bilevel pipelines—is a general theoretical tool with no direct path to such applications\. Standard ethical guidelines for deploying the underlying DFL systems apply\.

Limitations and scope\.The algorithm requires storing inner solutions and adjoint vectors, which may raise privacy concerns in federated settings where those quantities encode training data; this is flagged as a limitation and a direction for future work \(Section[6](https://arxiv.org/html/2605.12693#S6)\)\.

## Glossary

## Acronyms

DDEDelay Differential EquationDFLDecision\-Focused LearningIGTImplicit Gradient TransportOCOOnline Convex OptimizationOMDOnline Mirror Descent
## NeurIPS Paper Checklist

1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: The abstract claims \(i\) first sublinear regret bound for delayed bilevel optimization, \(ii\) transport error reduced from quadratic to linear in delay, and \(iii\) 15–36% reduction in decision\-loss optimality gap\. All three are substantiated: \(i\) Theorem[1](https://arxiv.org/html/2605.12693#Thmtheorem1), \(ii\) Theorem[2](https://arxiv.org/html/2605.12693#Thmtheorem2)\(c\), \(iii\) Table[3](https://arxiv.org/html/2605.12693#S5.T3)in the main text \(IGT\-OMD vs\. 2\-Stage/D\-FTRL on Warcraft\)\. The mechanistic fingerprint claim \(0% atd=1d\{=\}1by construction, 9\.5% atd=50d\{=\}50,p<0\.001p\{<\}0\.001\) matches Table[4](https://arxiv.org/html/2605.12693#S5.T4)\. Assumptions are stated in Section 4\.1 and Appendix[D](https://arxiv.org/html/2605.12693#A4)\.
5. Guidelines: - •The answer\[N/A\]means that the abstract and introduction do not include the claims made in the paper\. - •The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations\. A\[No\]or\[N/A\]answer to this question will not be perceived well by the reviewers\. - •The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings\. - •It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper\.
6. 2\.Limitations
7. Question: Does the paper discuss the limitations of the work performed by the authors?
8. Answer:\[Yes\]
9. Justification: Section 6 \(Limitations and future work\) explicitly discusses three limitations: \(1\) strong convexity of the inner objective may not hold for neural predictors; \(2\) IGT\+SGD amplifies degradation \(an unresolved interaction not captured by current theory\); \(3\) per\-round costO​\(K​p​q\+σmax​p​q\+q2​κw\)O\(Kpq\+\\sigma\_\{\\max\}pq\+q^\{2\}\\kappa\_\{w\}\)may be prohibitive for largeqq\. All experiments use 5–10 seeds\.
10. Guidelines: - •The answer\[N/A\]means that the paper has no limitation while the answer\[No\]means that the paper has limitations, but those are not discussed in the paper\. - •The authors are encouraged to create a separate “Limitations” section in their paper\. - •The paper should point out any strong assumptions and how robust the results are to violations of these assumptions \(e\.g\., independence assumptions, noiseless settings, model well\-specification, asymptotic approximations only holding locally\)\. The authors should reflect on how these assumptions might be violated in practice and what the implications would be\. - •The authors should reflect on the scope of the claims made, e\.g\., if the approach was only tested on a few datasets or with a few runs\. In general, empirical results often depend on implicit assumptions, which should be articulated\. - •The authors should reflect on the factors that influence the performance of the approach\. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting\. Or a speech\-to\-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon\. - •The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size\. - •If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness\. - •While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper\. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community\. Reviewers will be specifically instructed to not penalize honesty concerning limitations\.
11. 3\.Theory assumptions and proofs
12. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
13. Answer:\[Yes\]
14. Justification: All seven assumptions \(A1–A7\) are formally stated in Appendix[D](https://arxiv.org/html/2605.12693#A4)with discussion\. Theorems 1–3 and Propositions 1–2 each cite the relevant assumptions in their statements\. Full proofs appear in Appendices[E\.1](https://arxiv.org/html/2605.12693#A5.SS1)–[E\.5](https://arxiv.org/html/2605.12693#A5.SS5); proof sketches are given in the main text for Propositions 1–2\. All lemmas relied upon are numbered and cross\-referenced\.
15. Guidelines: - •The answer\[N/A\]means that the paper does not include theoretical results\. - •All the theorems, formulas, and proofs in the paper should be numbered and cross\-referenced\. - •All assumptions should be clearly stated or referenced in the statement of any theorems\. - •The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition\. - •Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material\. - •Theorems and Lemmas that the proof relies upon should be properly referenced\.
16. 4\.Experimental result reproducibility
17. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
18. Answer:\[Yes\]
19. Justification: Algorithm 1 provides a complete pseudocode specification\. Hyperparameter tables for all experiments appear in Appendix[F\.2](https://arxiv.org/html/2605.12693#A6.SS2),[F\.5](https://arxiv.org/html/2605.12693#A6.SS5), and[F\.4](https://arxiv.org/html/2605.12693#A6.SS4)\. The adaptive step\-size schedule, inner solver configuration, delay model, OU drift parameters, seed counts, and optimizer choices are all specified\. The controlled Adam\+IGT experiment \(Section 5\.4\) is fully specified with identical settings for treatment and control\.
20. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •If the paper includes experiments, a\[No\]answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not\. - •If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable\. - •Depending on the contribution, reproducibility can be accomplished in various ways\. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model\. In general\. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model \(e\.g\., in the case of a large language model\), releasing of a model checkpoint, or other means that are appropriate to the research performed\. - •While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution\. For example 1. \(a\)If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm\. 2. \(b\)If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully\. 3. \(c\)If the contribution is a new model \(e\.g\., a large language model\), then there should either be a way to access this model for reproducing the results or a way to reproduce the model \(e\.g\., with an open\-source dataset or instructions for how to construct the dataset\)\. 4. \(d\)We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility\. In the case of closed\-source models, it may be that access to the model is limited in some way \(e\.g\., to registered users\), but it should be possible for other researchers to have some path to reproducing or verifying the results\.
21. 5\.Open access to data and code
22. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
23. Answer:\[No\]
24. Justification: Code and anonymized data will be released upon acceptance \(Appendix[F\.2](https://arxiv.org/html/2605.12693#A6.SS2)\)\. The Warcraft dataset is publicly available fromVlastelicaet al\.\[[33](https://arxiv.org/html/2605.12693#bib.bib20)\]; LQR and Sinkhorn environments are fully synthetic and reproducible from the hyperparameter tables\. Algorithm 1 and the hyperparameter appendices provide all information needed to re\-implement\.
25. Guidelines: - •The answer\[N/A\]means that paper does not include experiments requiring code\. - • - •While we encourage the release of code and data, we understand that this might not be possible, so\[No\]is an acceptable answer\. Papers cannot be rejected simply for not including code, unless this is central to the contribution \(e\.g\., for a new open\-source benchmark\)\. - •The instructions should contain the exact command and environment needed to run to reproduce the results\. See the NeurIPS code and data submission guidelines \([https://neurips\.cc/public/guides/CodeSubmissionPolicy](https://neurips.cc/public/guides/CodeSubmissionPolicy)\) for more details\. - •The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc\. - •The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines\. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why\. - •At submission time, to preserve anonymity, the authors should release anonymized versions \(if applicable\)\. - •Providing as much information as possible in supplemental material \(appended to the paper\) is recommended, but including URLs to data and code is permitted\.
26. 6\.Experimental setting/details
27. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
28. Answer:\[Yes\]
29. Justification: Full hyperparameter tables for each environment are in Appendix[F](https://arxiv.org/html/2605.12693#A6)\. Optimizer type, learning rates, delay values, seed counts, horizonTT, inner solver configuration \(KK,ηw\\eta\_\{w\}\), and OU drift parameters are all reported\. Hyperparameters were set by grid search over\{η0,K\}\\\{\\eta\_\{0\},K\\\}on thed=0d\{=\}0condition; the search grid is reported in Appendix[F](https://arxiv.org/html/2605.12693#A6)\.
30. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them\. - •The full details can be provided either with the code, in appendix, or as supplemental material\.
31. 7\.Experiment statistical significance
32. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
33. Answer:\[Yes\]
34. Justification: All tables report mean±\\pm1 s\.d\. over seeds \(stated in Section 5\)\. The controlled Adam\+IGT experiment \(Table 4\) reports Welchtt\-testpp\-values for each delay value \(p<0\.001p\{<\}0\.001atd≥20d\{\\geq\}20\)\. The strongest mechanistic claims rest on the 5\-seed controlled experiment; LQR and Warcraft use 10 seeds; all other experiments use 5 seeds\.
35. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The authors should answer\[Yes\]if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper\. - •The factors of variability that the error bars are capturing should be clearly stated \(for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions\)\. - •The method for calculating the error bars should be explained \(closed form formula, call to a library function, bootstrap, etc\.\) - •The assumptions made should be given \(e\.g\., Normally distributed errors\)\. - •It should be clear whether the error bar is the standard deviation or the standard error of the mean\. - •It is OK to report 1\-sigma error bars, but one should state it\. The authors should preferably report a 2\-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified\. - •For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range \(e\.g\., negative error rates\)\. - •If error bars are reported in tables or plots, the authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text\.
36. 8\.Experiments compute resources
37. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
38. Answer:\[Yes\]
39. Justification: Appendix[F\.1](https://arxiv.org/html/2605.12693#A6.SS1)specifies NVIDIA RTX A5500 GPUs \(24 GB VRAM\), PyTorch 2\.0, per\-experiment runtime estimates \(2–12 hours per configuration\), total reported compute \(≈\\approx200–250 GPU\-hours\), and additional exploratory compute \(≈\\approx150–200 GPU\-hours\)\.
40. Guidelines: - •The answer\[N/A\]means that the paper does not include experiments\. - •The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage\. - •The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute\. - •The paper should disclose whether the full research project required more compute than the experiments reported in the paper \(e\.g\., preliminary or failed experiments that didn’t make it into the paper\)\.
41. 9\.Code of ethics
43. Answer:\[Yes\]
44. Justification: This paper presents a theoretical optimization algorithm evaluated on synthetic and standard benchmarks\. No human subjects, personal data, or harmful applications are involved\.
45. Guidelines: - •The answer\[N/A\]means that the authors have not reviewed the NeurIPS Code of Ethics\. - •If the authors answer\[No\], they should explain the special circumstances that require a deviation from the Code of Ethics\. - •The authors should make sure to preserve anonymity \(e\.g\., if there is a special consideration due to laws or regulations in their jurisdiction\)\.
46. 10\.Broader impacts
47. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
48. Answer:\[Yes\]
49. Justification: The Broader Impacts appendix discusses positive impacts \(improved decision quality in supply\-chain, energy, and medical settings\), potential negative impacts \(general\-purpose optimization tools can be applied to surveillance or automated trading\), and a specific privacy limitation regarding adjoint storage in federated settings\.
50. Guidelines: - •The answer\[N/A\]means that there is no societal impact of the work performed\. - •If the authors answer\[N/A\]or\[No\], they should explain why their work has no societal impact or why the paper does not address societal impact\. - •Examples of negative societal impacts include potential malicious or unintended uses \(e\.g\., disinformation, generating fake profiles, surveillance\), fairness considerations \(e\.g\., deployment of technologies that could make decisions that unfairly impact specific groups\), privacy considerations, and security considerations\. - •The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments\. However, if there is a direct path to any negative applications, the authors should point it out\. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate Deepfakes for disinformation\. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster\. - •The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from \(intentional or unintentional\) misuse of the technology\. - •If there are negative societal impacts, the authors could also discuss possible mitigation strategies \(e\.g\., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML\)\.
51. 11\.Safeguards
52. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
53. Answer:\[N/A\]
54. Justification: The paper releases a bilevel optimization algorithm and synthetic benchmark environments\. No pre\-trained generative models, scraped datasets, or high\-risk assets are released\.
55. Guidelines: - •The answer\[N/A\]means that the paper poses no such risks\. - •Released models that have a high risk for misuse or dual\-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters\. - •Datasets that have been scraped from the Internet could pose safety risks\. The authors should describe how they avoided releasing unsafe images\. - •We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort\.
56. 12\.Licenses for existing assets
57. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
58. Answer:\[Yes\]
59. Justification: The Warcraft shortest\-path dataset\[[33](https://arxiv.org/html/2605.12693#bib.bib20)\]is credited and used under the academic use terms of that work \(Appendix[F\.5](https://arxiv.org/html/2605.12693#A6.SS5)\)\. PyTorch\[[25](https://arxiv.org/html/2605.12693#bib.bib26)\]is used under the BSD 3\-Clause license\. All other baselines and benchmarks are cited at first use\.
60. Guidelines: - •The answer\[N/A\]means that the paper does not use existing assets\. - •The authors should cite the original paper that produced the code package or dataset\. - •The authors should state which version of the asset is used and, if possible, include a URL\. - •The name of the license \(e\.g\., CC\-BY 4\.0\) should be included for each asset\. - •For scraped data from a particular source \(e\.g\., website\), the copyright and terms of service of that source should be provided\. - •If assets are released, the license, copyright information, and terms of use in the package should be provided\. For popular datasets,[paperswithcode\.com/datasets](https://arxiv.org/html/2605.12693v1/paperswithcode.com/datasets)has curated licenses for some datasets\. Their licensing guide can help determine the license of a dataset\. - •For existing datasets that are re\-packaged, both the original license and the license of the derived asset \(if it has changed\) should be provided\. - •If this information is not available online, the authors are encouraged to reach out to the asset’s creators\.
61. 13\.New assets
62. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
63. Answer:\[Yes\]
64. Justification: The new asset is the IGT\-OMD algorithm implementation\. Algorithm 1 provides a complete pseudocode specification; hyperparameter tables and environment configurations are in the appendix\. Code with documentation will be released upon acceptance\.
65. Guidelines: - •The answer\[N/A\]means that the paper does not release new assets\. - •Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates\. This includes details about training, license, limitations, etc\. - •The paper should discuss whether and how consent was obtained from people whose asset is used\. - •At submission time, remember to anonymize your assets \(if applicable\)\. You can either create an anonymized URL or include an anonymized zip file\.
66. 14\.Crowdsourcing and research with human subjects
67. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
68. Answer:\[N/A\]
69. Justification: This paper does not involve crowdsourcing or human subjects\.
70. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper\. - •According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector\.
71. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
72. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
73. Answer:\[N/A\]
74. Justification: This paper does not involve human subjects research\.
75. Guidelines: - •The answer\[N/A\]means that the paper does not involve crowdsourcing nor research with human subjects\. - •Depending on the country in which research is conducted, IRB approval \(or equivalent\) may be required for any human subjects research\. If you obtained IRB approval, you should clearly state this in the paper\. - •We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution\. - •For initial submissions, do not include any information that would break anonymity \(if applicable\), such as the institution conducting the review\.
76. 16\.Declaration of LLM usage
77. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research?
78. Answer:\[N/A\]
79. Justification: LLMs are not part of the core methodology\. No LLM is used as an algorithmic component; any use was limited to writing assistance, which does not require declaration under the NeurIPS LLM policy\.
80. Guidelines: - •The answer\[N/A\]means that the core method development in this research does not involve LLMs as any important, original, or non\-standard components\. - •Please refer to our LLM policy in the NeurIPS handbook for what should or should not be described\.

Similar Articles

Accelerating LMO-Based Optimization via Implicit Gradient Transport

arXiv cs.LG

This paper proposes LMO-IGT, a new class of stochastic optimization methods that accelerates convergence using implicit gradient transport while maintaining a single-gradient-per-iteration structure. It introduces a unified theoretical framework and demonstrates improved performance over existing LMO-based optimizers like Muon.

An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning

Hugging Face Daily Papers

This paper introduces MMOT, an online mixture model learning framework based on optimal transport theory that addresses incremental learning with distributional shifts through dynamic centroid updates and improved class similarity estimation. The approach includes a Dynamic Preservation strategy to mitigate catastrophic forgetting and maintain class separability in latent space.

Gradient Extrapolation-Based Policy Optimization

arXiv cs.LG

The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.