Distributionally Robust Linear Regression With Block Lewis Weights

arXiv cs.LG Papers

Summary

This paper presents an algorithm for group distributionally robust least squares regression using block Lewis weights, achieving improved complexity over interior point methods. It also provides interpolating algorithms between average and robust losses.

arXiv:2607.00252v1 Announce Type: new Abstract: We present an algorithm for the group distributionally robust (GDR) least squares problem. Given $m$ groups, a parameter vector in $\mathbb{R}^d$, and stacked design matrices and responses $\mathbf{A}$ and $\mathbf{b}$, our algorithm obtains a $(1+\varepsilon)$-multiplicative optimal solution using $\widetilde{O}(\min\{\mathsf{rank}(\mathbf{A}),m\}^{1/3}\varepsilon^{-2/3})$ linear-system-solves of matrices of the form $\mathbf{A}^{\top}\mathbf{B}\mathbf{A}$ for block-diagonal $\mathbf{B}$. Our technical methods follow from a recent geometric construction, block Lewis weights, that relates the empirical GDR problem to a carefully chosen least squares problem and an application of accelerated proximal methods. Our algorithm improves over known interior point methods for moderate accuracy regimes and matches the state-of-the-art guarantees for the special case of $\ell_{\infty}$ regression. We also give algorithms that smoothly interpolate between minimizing the average least squares loss and the distributionally robust loss.
Original Article
View Cached Full Text

Cached at: 07/02/26, 05:36 AM

# Distributionally Robust Linear Regression With Block Lewis Weights
Source: [https://arxiv.org/html/2607.00252](https://arxiv.org/html/2607.00252)
Kumar Kshitij PatelInstitute for Foundations of Data Science, Yale University\. Email:[kumarkshitij\.patel@yale\.edu](https://arxiv.org/html/2607.00252v1/[email protected])\. This work was partly done while the author was a Fellow at the Simons Institute for the Theory of Computing\.

###### Abstract

We present an algorithm for the group distributionally robust \(GDR\) least squares problem\. Givenmmgroups, a parameter vector inℝd\\mathbb\{R\}^\{d\}, and stacked design matrices and responses𝐀\\mathbf\{A\}and𝒃\\bm\{b\}, our algorithm obtains a\(1\+ε\)\(1\+\\varepsilon\)\-multiplicative optimal solution usingO~\(min\{𝗋𝖺𝗇𝗄\(𝐀\),m\}1/3ε−2/3\)\\mathaccent 869\{O\}\(\\min\\\{\\mathsf\{rank\}\(\\mathbf\{A\}\),m\\\}^\{1/3\}\\varepsilon^\{\-2/3\}\)linear\-system\-solves of matrices of the form𝐀⊤​𝐁𝐀\\mathbf\{A\}^\{\\top\}\\mathbf\{B\}\\mathbf\{A\}for block\-diagonal𝐁\\mathbf\{B\}\. Our technical methods follow from a recent geometric construction, block Lewis weights, that relates the empirical GDR problem to a carefully chosen least squares problem and an application of accelerated proximal methods\. Our algorithm improves over known interior point methods for moderate accuracy regimes and matches the state\-of\-the\-art guarantees for the special case ofℓ∞\\ell\_\{\\infty\}regression\. We also give algorithms that smoothly interpolate between minimizing the average least squares loss and the distributionally robust loss\.

###### Contents

1. [1Introduction](https://arxiv.org/html/2607.00252#S1)1. [1\.1Our Results](https://arxiv.org/html/2607.00252#S1.SS1) 2. [1\.2Prior Results, Connections, and Open Problems](https://arxiv.org/html/2607.00252#S1.SS2) 3. [1\.3Paper Outline](https://arxiv.org/html/2607.00252#S1.SS3)
2. [2Technical Overview](https://arxiv.org/html/2607.00252#S2)1. [2\.1Solving Proximal Subproblems](https://arxiv.org/html/2607.00252#S2.SS1)1. [2\.1\.1The Robust Case \(p=∞p=\\infty\)\.](https://arxiv.org/html/2607.00252#S2.SS1.SSS1) 2. [2\.1\.2The Interpolating Case \(2≤p<∞2\\leq p<\\infty\)\.](https://arxiv.org/html/2607.00252#S2.SS1.SSS2) 2. [2\.2Iterating Proximal Calls](https://arxiv.org/html/2607.00252#S2.SS2) 3. [2\.3The Geometry of the Proximal Subproblems and Block Lewis Weights](https://arxiv.org/html/2607.00252#S2.SS3) 4. [2\.4Algorithm for Distributionally Robust Regression](https://arxiv.org/html/2607.00252#S2.SS4)
3. [3Block Lewis Weights and their Properties](https://arxiv.org/html/2607.00252#S3)
4. [4Mirror Descent with Inexact Updates](https://arxiv.org/html/2607.00252#S4)
5. [5Optimal MS Acceleration under Custom Euclidean Geometry](https://arxiv.org/html/2607.00252#S5)
6. [6Minimizing the Distributionally Robust Loss](https://arxiv.org/html/2607.00252#S6)1. [6\.1Smoothly Approximating the Objective](https://arxiv.org/html/2607.00252#S6.SS1) 2. [6\.2Calculus forLogSumExp](https://arxiv.org/html/2607.00252#S6.SS2) 3. [6\.3Smoothness and Quasi\-self\-concordance of the Modified Objective](https://arxiv.org/html/2607.00252#S6.SS3) 4. [6\.4Analysis ofAlgorithm˜1](https://arxiv.org/html/2607.00252#S6.SS4)
7. [7Interpolating Between Average and Robust Losses](https://arxiv.org/html/2607.00252#S7)1. [7\.1Calculus for the objective](https://arxiv.org/html/2607.00252#S7.SS1)1. [7\.1\.1Strong Convexity of the Objective](https://arxiv.org/html/2607.00252#S7.SS1.SSS1) 2. [7\.1\.2Smoothness of the Objective](https://arxiv.org/html/2607.00252#S7.SS1.SSS2) 2. [7\.2Facts about the Iterates](https://arxiv.org/html/2607.00252#S7.SS2) 3. [7\.3Proximal Subproblems – Calculus, Algorithms, Proofs](https://arxiv.org/html/2607.00252#S7.SS3)1. [7\.3\.1Hessian Stability](https://arxiv.org/html/2607.00252#S7.SS3.SSS1) 2. [7\.3\.2Strong Convexity of the Proximal Objective and Friends](https://arxiv.org/html/2607.00252#S7.SS3.SSS2) 3. [7\.3\.3Smoothness of the Proximal Objective](https://arxiv.org/html/2607.00252#S7.SS3.SSS3) 4. [7\.3\.4Solving the Proximal Subproblems](https://arxiv.org/html/2607.00252#S7.SS3.SSS4) 4. [7\.4The Algorithm](https://arxiv.org/html/2607.00252#S7.SS4)
8. [8Empirical Evaluation](https://arxiv.org/html/2607.00252#S8)1. [8\.1Synthetic Heterogeneous Regression Construction](https://arxiv.org/html/2607.00252#S8.SS1)1. [8\.1\.1Computing the Robust Optimum via Convex Programming](https://arxiv.org/html/2607.00252#S8.SS1.SSS1) 2. [8\.1\.2Baselines](https://arxiv.org/html/2607.00252#S8.SS1.SSS2) 3. [8\.1\.3Hyperparameter Tuning](https://arxiv.org/html/2607.00252#S8.SS1.SSS3) 4. [8\.1\.4Empirical Behavior](https://arxiv.org/html/2607.00252#S8.SS1.SSS4) 2. [8\.2Real\-world Experiment: ACS Income](https://arxiv.org/html/2607.00252#S8.SS2)
9. [References](https://arxiv.org/html/2607.00252#bib)

## 1Introduction

Machine learning algorithms and their training datasets have grown substantially in both size and complexity over the past decade\. This increased model complexity has made it challenging to interpret and predict their behavior in unobserved scenarios\. Hence, many applications that involve societal decisions still rely on simple, interpretable models like linear regression, often after feature engineering\. Examples of such applications include predicting national housing prices, estimating wages across industries, forecasting loan amounts across banks, predicting life insurance premiums across groups, and projecting energy consumption across communities\[cohen2024fairness\]\.

A shared safety and sometimes legal concern across the above applications is the potential for wildly different model qualities for different distributions, i\.e\., outputting a notably worse model for some source data distributions\[data2014seizing,barocas2016big,hardt2016equality,veale2018fairness,selbst2019fairness,berk2021fairness,corbett2023measure,chouldechova2016fair,kleinberg2018algorithmic,agarwal2019fair,cohen2024fairness,svwz24\]\. Specifically, consider fitting a linear model𝒙∈ℝd\\bm\{x\}\\in\\mathbb\{R\}^\{d\}to make real predictions on some task overmmgroups where groupii’s dataset consists ofnin\_\{i\}entries and is denoted bySi=\{\(𝒂ij,bij\)\}j∈\[ni\]S\_\{i\}=\\\{\(\\bm\{a\}\_\{i\}^\{j\},b\_\{i\}^\{j\}\)\\\}\_\{j\\in\[n\_\{i\}\]\}\. Theutilitarianor the total\-cost\-minimizing objective minimizes the average squared prediction error across groups, i\.e\.,

min𝒙∈ℝd⁡1m​\\slimits@i∈\[m\]​1ni​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert22,\\displaystyle\\min\_\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\\frac\{1\}\{m\}\\sumop\\slimits@\_\{i\\in\[m\]\}\\frac\{1\}\{n\_\{i\}\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\\kern 5\.0pt,\(1\.1\)where𝐀Si≔\[𝒂i1​…​𝒂ini\]⊤∈ℝni×d\\mathbf\{A\}\_\{S\_\{i\}\}\\coloneqq\[\\bm\{a\}\_\{i\}^\{1\}\\dots\\bm\{a\}\_\{i\}^\{n\_\{i\}\}\]^\{\\top\}\\in\\mathbb\{R\}^\{n\_\{i\}\\times d\}is the feature matrix and𝒃Si≔\[bi1​…​bini\]⊤∈ℝni\\bm\{b\}\_\{S\_\{i\}\}\\coloneqq\[b\_\{i\}^\{1\}\\dots b\_\{i\}^\{n\_\{i\}\}\]^\{\\top\}\\in\\mathbb\{R\}^\{n\_\{i\}\}is the label vector for groupi∈\[m\]i\\in\[m\]\.

Due to the inherent heterogeneity of the datasets, the model derived by optimizing the objective \([1\.1](https://arxiv.org/html/2607.00252#S1.E1)\) may be particularly detrimental to some groups, as the prediction error may be disproportionately higher for these groups\. To overcome these limitations, the followingegalitarianor group Distributionally Robust Optimization \(DRO\) objective has been considered in several recent works\[ben2013robust,duchi2016statistics,sagawa2019distributionally,levy2020large,soma2022optimal,aakmrz22,svwz24\],

min𝒙∈ℝd⁡maxi∈\[m\]⁡1ni​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert22\.\\displaystyle\\min\_\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\\max\_\{i\\in\[m\]\}\\frac\{1\}\{n\_\{i\}\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\\kern 5\.0pt\.\(1\.2\)Objective[1\.2](https://arxiv.org/html/2607.00252#S1.E2)is the “fairest” objective among all objectives that balance utility and distributional robustness\[kleinberg2018algorithmic,chouldechova2018frontiers,asadpour2022sequential,chen2022fair,rahmattalabi2019exploring,golrezaei2024online\]\. Since objective[1\.2](https://arxiv.org/html/2607.00252#S1.E2)is a convex problem, it is natural to apply standard black\-box optimization techniques to solve it\. However, we identify several challenges in applying existing methods:

##### Efficient first\-order algorithms have geometry\-dependent rates\.

To our knowledge, using an efficient first\-order method \(such as sub\-gradient descent\) will incur a geometry\-dependent runtime\. In particular, if the matrices𝐀Si\\mathbf\{A\}\_\{S\_\{i\}\}or if the stacked matrix𝐀≔\[𝐀S1⊤​…​𝐀SM⊤\]⊤\\mathbf\{A\}\\coloneqq\[\\mathbf\{A\}\_\{S\_\{1\}\}^\{\\top\}\\dots\\mathbf\{A\}\_\{S\_\{M\}\}^\{\\top\}\]^\{\\top\}are poorly conditioned, then this will be reflected accordingly in the convergence rates\. This is a drawback of the existing results by\[aakmrz22\]and\[svwz24\]\.

##### Objective \([1\.2](https://arxiv.org/html/2607.00252#S1.E2)\) is not smooth\.

Since the objective is the pointwise maximum of several continuous functions, the derivative is not well\-defined at the points at which the maximizing function changes\. Thus, applying subgradient descent to this objective without a tailored analysis will yield a rather unimpressive1/ε21/\\varepsilon^\{2\}dependence in the iteration complexity\.

##### Min\-max optimization/regret minimization approaches have a1/ε21/\\varepsilon^\{2\}dependence on iteration complexity\.

Since problem[1\.2](https://arxiv.org/html/2607.00252#S1.E2)is a min\-max optimization objective, it is also natural to try to use game theory\-inspired approaches that use some oracle \(such as gradients\) for each group as a black box\. For instance, we can cast objective[1\.2](https://arxiv.org/html/2607.00252#S1.E2)as a repeated game between a min player \(equipped with a no\-regret algorithm\) and a max player \(equipped with the best response oracle\)\. The main shortcoming of this approach is that even though the function for each group is smooth, the iteration complexity \(to getε\\varepsilonaverage regret\) for smooth online convex optimization still has an unimpressive1/ε21/\\varepsilon^\{2\}dependence \(as opposed to1/ε1/\\varepsilonfor smooth convex optimization\)\[soma2022optimal,zhang2024stochastic\]\. Thus, this approach is no better than optimizing \([1\.2](https://arxiv.org/html/2607.00252#S1.E2)\) using sub\-gradient descent\.

##### Interior point methods have a poor iteration complexity for largemm\.

Another natural approach \(that can partially address the previous two issues\), following the discussion by\[boyd2004convex, Section 6\.4 \], is to rewrite problem[1\.2](https://arxiv.org/html/2607.00252#S1.E2)in its epigraph form and use an interior point method \(IPM\) to solve the resulting problem \(which, in this case, is a quadratically constrained linear program\)\. Unfortunately, this will give an algorithm whose analysis is only known to yield an iteration complexity ofO​\(m\)O\(\\sqrt\{m\}\), where each iteration solves a linear system in matrices of the form𝐀⊤​𝐁𝐀\\mathbf\{A\}^\{\\top\}\\mathbf\{B\}\\mathbf\{A\}for a block\-diagonal𝐁\\mathbf\{B\}\(see[Remark˜1\.1](https://arxiv.org/html/2607.00252#S1.Thmtheorem1)\)\. A naïve implementation of this algorithm will thus have a superlinear runtime in the number of groups, which is undesirable when the number of groups is large\. Alternately, consider an example in which we copy each groupkktimes in the objective\. The new objective value does not change from the original objective value, but the iteration complexity from the IPM now blows up tom​k\\sqrt\{mk\}\. This also signals to us that we should search for an algorithm whose iteration complexity is mostly independent frommm\.

Hence, designing an algorithm without these shortcomings requires novel ideas\.

### 1\.1Our Results

In this paper, we present a new algorithm \([Algorithm˜1](https://arxiv.org/html/2607.00252#alg1)\) to approximately optimize objective[1\.2](https://arxiv.org/html/2607.00252#S1.E2), which addresses the aforementioned difficulties\. We state the iteration complexity of our algorithm in the following theorem\.

###### Theorem 1\(Robust regression\)\.

Let𝐀Si∈ℝni×d\\mathbf\{A\}\_\{S\_\{i\}\}\\in\\mathbb\{R\}^\{n\_\{i\}\\times d\}and𝐛Si∈ℝni\\bm\{b\}\_\{S\_\{i\}\}\\in\\mathbb\{R\}^\{n\_\{i\}\}for alli∈\[m\]i\\in\[m\]\. Denote their concatenations by𝐀≔\[𝐀S1⊤​…​𝐀SM⊤\]⊤∈ℝn×d\\mathbf\{A\}\\coloneqq\[\\mathbf\{A\}\_\{S\_\{1\}\}^\{\\top\}\\dots\\mathbf\{A\}\_\{S\_\{M\}\}^\{\\top\}\]^\{\\top\}\\in\\mathbb\{R\}^\{n\\times d\}and𝐛≔\[𝐛S1⊤​…​𝐛SM⊤\]⊤∈ℝn\\bm\{b\}\\coloneqq\[\\bm\{b\}\_\{S\_\{1\}\}^\{\\top\}\\dots\\bm\{b\}\_\{S\_\{M\}\}^\{\\top\}\]^\{\\top\}\\in\\mathbb\{R\}^\{n\}wheren:=\\slimits@i∈\[m\]​nin:=\\sumop\\slimits@\_\{i\\in\[m\]\}n\_\{i\}\. Letε\>0\\varepsilon\>0\. Then[Algorithm˜1](https://arxiv.org/html/2607.00252#alg1)returns𝐱^\\mathaccent 866\{\\bm\{x\}\}such that,

maxi∈\[m\]⁡1ni​\\lVert​𝐀Si​𝒙^−𝒃Si​\\rVert2≤\(1\+ε\)⋅min𝒙∈ℝd⁡maxi∈\[m\]⁡1ni​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2,\\displaystyle\\max\_\{i\\in\[m\]\}\\frac\{1\}\{\\sqrt\{n\_\{i\}\}\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\mathaccent 866\{\\bm\{x\}\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}\\leq\(1\+\\varepsilon\)\\cdot\\min\_\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\\max\_\{i\\in\[m\]\}\\frac\{1\}\{\\sqrt\{n\_\{i\}\}\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}\\kern 5\.0pt,\(1\.3\)and it runs in

O​\(min\{𝗋𝖺𝗇𝗄\(𝐀\),m\}1/3\(log\(n​log⁡mε\)14/3\+log\(m\)\)ε2/3\)\\displaystyle O\\left\(\\frac\{\\min\\left\\\{\\mathsf\{rank\}\(\\mathbf\{A\}\),m\\right\\\}^\{1/3\}\\left\(\\log\\left\(\\frac\{n\\log m\}\{\\varepsilon\}\\right\)^\{14/3\}\+\\log\\left\(m\\right\)\\right\)\}\{\\varepsilon^\{2/3\}\}\\right\)linear\-system\-solves in matrices of the form𝐀⊤​𝐁𝐀\\mathbf\{A\}^\{\\top\}\\mathbf\{B\}\\mathbf\{A\}, where𝐁\\mathbf\{B\}is a block\-diagonal matrix for which blockiihas sizeni×nin\_\{i\}\\times n\_\{i\}\.

We prove[˜1](https://arxiv.org/html/2607.00252#Thmmainthm1)in[Section˜6](https://arxiv.org/html/2607.00252#S6)\. We compare the guarantee of[˜1](https://arxiv.org/html/2607.00252#Thmmainthm1)against the other baselines in[Table˜1](https://arxiv.org/html/2607.00252#S1.T1)\.

AlgorithmIteration ComplexityEach IterationSubgradient descent\\lVert​𝒙⋆​\\rVert2​max1≤i≤m⁡1ni​\\lVert​𝐀Si​\\rVertopε2\\frac\{\\left\\lVert\\bm\{x\}^\{\\star\}\\right\\rVert\_\{2\}\\max\_\{1\\leq i\\leq m\}\\tfrac\{1\}\{\\sqrt\{n\_\{i\}\}\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\right\\rVert\_\{\\mathrm\{op\}\}\}\{\\varepsilon^\{2\}\}Evaluate∇f​\(𝒙\)\\nabla f\(\\bm\{x\}\)Nesterov accelerationon smoothened objective\\lVert​𝒙⋆​\\rVert2​\(max1≤i≤m⁡1ni​\\lVert​𝐀Si​\\rVertop\)1/2ε\\frac\{\\left\\lVert\\bm\{x\}^\{\\star\}\\right\\rVert\_\{2\}\\left\(\\max\_\{1\\leq i\\leq m\}\\tfrac\{1\}\{\\sqrt\{n\_\{i\}\}\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\right\\rVert\_\{\\mathrm\{op\}\}\\right\)^\{1/2\}\}\{\\varepsilon\}Evaluate∇f~β,δ​\(𝒙\)\\nabla\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\)\[aakmrz22\]\\lVert​𝒙⋆​\\rVert2​max1≤i≤m⁡1ni​\\lVert​𝐀Si​\\rVertopε\\frac\{\\left\\lVert\\bm\{x\}^\{\\star\}\\right\\rVert\_\{2\}\\max\_\{1\\leq i\\leq m\}\\tfrac\{1\}\{\\sqrt\{n\_\{i\}\}\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\right\\rVert\_\{\\mathrm\{op\}\}\}\{\\varepsilon\}Evaluate∇f~β,δ​\(𝒙\)\\nabla\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\)Interior point with log barrier\[boyd2004convex\]m1/2​log⁡\(1ε\)m^\{1/2\}\\log\\left\(\\frac\{1\}\{\\varepsilon\}\\right\)Linear\-system\-solvein𝐀⊤​𝐁𝐀\\mathbf\{A\}^\{\\top\}\\mathbf\{B\}\\mathbf\{A\}This paper\(naïve geometry\)m1/3ε2/3\\frac\{m^\{1/3\}\}\{\\varepsilon^\{2/3\}\}Linear\-system\-solvein𝐀⊤​𝐁𝐀\\mathbf\{A\}^\{\\top\}\\mathbf\{B\}\\mathbf\{A\}ℓ∞\\ell\_\{\\infty\}regression with Lewisweights\[jls21\]𝗋𝖺𝗇𝗄​\(𝐀\)1/3ε2/3\\frac\{\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\)^\{1/3\}\}\{\\varepsilon^\{2/3\}\}Linear\-system\-solvein𝐀⊤​𝐃𝐀\\mathbf\{A\}^\{\\top\}\\mathbf\{D\}\\mathbf\{A\}ℓ∞\\ell\_\{\\infty\}regression with IPM\[ls19\]𝗋𝖺𝗇𝗄​\(𝐀\)1/2​log⁡\(1ε\)\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\)^\{1/2\}\\log\\left\(\\frac\{1\}\{\\varepsilon\}\\right\)Linear\-system\-solvein𝐀⊤​𝐃𝐀\\mathbf\{A\}^\{\\top\}\\mathbf\{D\}\\mathbf\{A\}This paper \([˜1](https://arxiv.org/html/2607.00252#Thmmainthm1)\)min\{𝗋𝖺𝗇𝗄\(𝐀\),m\}1/3ε2/3\\frac\{\\min\\left\\\{\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\),m\\right\\\}^\{1/3\}\}\{\\varepsilon^\{2/3\}\}Linear\-system\-solvein𝐀⊤​𝐁𝐀\\mathbf\{A\}^\{\\top\}\\mathbf\{B\}\\mathbf\{A\}Table 1:The complexities of algorithms for optimizing \([1\.2](https://arxiv.org/html/2607.00252#S1.E2)\) or for the special case ofℓ∞\\ell\_\{\\infty\}regression, assuming𝖮𝖯𝖳=1\\mathsf\{OPT\}=1\(the first three guarantees are additive approximations\) and ignoringpolylog​\(n,m\)\\mathrm\{polylog\}\(n,m\)terms\. We write𝐃\\mathbf\{D\}to be a diagonal matrix and𝐁\\mathbf\{B\}to be a block\-diagonal matrix where each block has size\(ni\+o​\(1\)\)×\(ni\+o​\(1\)\)\(n\_\{i\}\+o\(1\)\)\\times\(n\_\{i\}\+o\(1\)\)\. We remark that in the special case whereni=1n\_\{i\}=1, our algorithm exactly recovers guarantees of\[jls21\]\. We stress that we include the references toℓ∞\\ell\_\{\\infty\}regression only to show that our algorithm is no worse than that of\[jls21\]in this special case ofni=1n\_\{i\}=1for allii, and none of their algorithms apply to our general setting\.Unlike the aforementioned first\-order methods, our algorithm has no geometry\-dependent terms\. Additionally, our algorithm improves over the standard log\-barrier IPM when the desired accuracyε≥m−1/4\\varepsilon\\geq m^\{\-1/4\}— this improvement is more pronounced whenm≫𝗋𝖺𝗇𝗄​\(𝐀\)m\\gg\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\), i\.e\. when the number of data sources is much larger than the dimension of the parameter vector𝒙\\bm\{x\}\. Additionally, forε≥𝗋𝖺𝗇𝗄​\(𝐀\)−1/4\\varepsilon\\geq\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\)^\{\-1/4\}, our guarantee matches the best known guarantee forℓ∞\\ell\_\{\\infty\}regression\[ls19,jls21\]\.

##### Interpolating between robust and nonrobust optimization\.

We also study the following family of objectives that interpolate between objectives[1\.1](https://arxiv.org/html/2607.00252#S1.E1)and[1\.2](https://arxiv.org/html/2607.00252#S1.E2)for different values ofp≥2p\\geq 2,

min𝒙∈ℝd⁡1m​\\slimits@i∈\[m\]​\(1ni​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert22\)p/2\.\\displaystyle\\min\_\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\\frac\{1\}\{m\}\\sumop\\slimits@\_\{i\\in\[m\]\}\\left\(\\frac\{1\}\{n\_\{i\}\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\\right\)^\{p/2\}\\kern 5\.0pt\.\(1\.4\)In particular, note that choosingp=2p=2in the above objective gives us the average least\-squares problem in objective[1\.1](https://arxiv.org/html/2607.00252#S1.E1), whilep→∞p\\rightarrow\\inftyrecovers objective[1\.2](https://arxiv.org/html/2607.00252#S1.E2)\. Varyingppfrom22to∞\\inftyand minimizing gives solutions that interpolate between utilitarian and egalitarian approaches, allowing for a smooth trade\-off between utility and robustness\. To this end, we give[Algorithm˜5](https://arxiv.org/html/2607.00252#alg5)to approximately optimize objective[1\.4](https://arxiv.org/html/2607.00252#S1.E4)and prove the following guarantee about its iteration complexity\.

###### Theorem 2\(Trading off utility and robustness\)\.

Let𝐀Si∈ℝni×d\\mathbf\{A\}\_\{S\_\{i\}\}\\in\\mathbb\{R\}^\{n\_\{i\}\\times d\}and𝐛Si∈ℝni\\bm\{b\}\_\{S\_\{i\}\}\\in\\mathbb\{R\}^\{n\_\{i\}\}for alli∈\[m\]i\\in\[m\]\. Denote their concatenations by𝐀≔\[𝐀S1⊤​…​𝐀SM⊤\]⊤∈ℝn×d\\mathbf\{A\}\\coloneqq\[\\mathbf\{A\}\_\{S\_\{1\}\}^\{\\top\}\\dots\\mathbf\{A\}\_\{S\_\{M\}\}^\{\\top\}\]^\{\\top\}\\in\\mathbb\{R\}^\{n\\times d\}and𝐛≔\[𝐛S1⊤​…​𝐛SM⊤\]⊤∈ℝn\\bm\{b\}\\coloneqq\[\\bm\{b\}\_\{S\_\{1\}\}^\{\\top\}\\dots\\bm\{b\}\_\{S\_\{M\}\}^\{\\top\}\]^\{\\top\}\\in\\mathbb\{R\}^\{n\}wheren≔\\slimits@i∈\[m\]​nin\\coloneqq\\sumop\\slimits@\_\{i\\in\[m\]\}n\_\{i\}\. Letp≥2p\\geq 2andε\>0\\varepsilon\>0\. Then[Algorithm˜5](https://arxiv.org/html/2607.00252#alg5)returns𝐱^\\mathaccent 866\{\\bm\{x\}\}such that,

\(\\slimits@i=1m\(1ni\\lVert𝐀Si𝒙^−𝒃Si\\rVert2\)p\)1/p≤\(1\+ε\)⋅min𝒙∈ℝd\(\\slimits@i=1m\(1ni\\lVert𝐀Si𝒙−𝒃Si\\rVert2\)p\)1/p\\displaystyle\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\(\\frac\{1\}\{\\sqrt\{n\_\{i\}\}\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\mathaccent 866\{\\bm\{x\}\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}\\right\)^\{p\}\\right\)^\{1/p\}\\leq\(1\+\\varepsilon\)\\cdot\\min\_\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\(\\frac\{1\}\{\\sqrt\{n\_\{i\}\}\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}\\right\)^\{p\}\\right\)^\{1/p\}\(1\.5\)and runs in

O\(pO​\(1\)min\{𝗋𝖺𝗇𝗄\(𝐀\),m\}p−23​p−2log\(p​dε\)3\)\\displaystyle O\\left\(p^\{O\(1\)\}\\min\\left\\\{\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\),m\\right\\\}^\{\\frac\{p\-2\}\{3p\-2\}\}\\log\\left\(\\frac\{pd\}\{\\varepsilon\}\\right\)^\{3\}\\right\)linear\-system\-solves in matrices of the form𝐀⊤​𝐁𝐀\\mathbf\{A\}^\{\\top\}\\mathbf\{B\}\\mathbf\{A\}, where𝐁\\mathbf\{B\}is a block\-diagonal matrix for which blockiihas sizeni×nin\_\{i\}\\times n\_\{i\}\.

We prove[˜2](https://arxiv.org/html/2607.00252#Thmmainthm2)in[Section˜7](https://arxiv.org/html/2607.00252#S7)\.

In the special case whereni=1n\_\{i\}=1for allii\(and therefore the problem isℓp\\ell\_\{p\}regression forp≥2p\\geq 2\), the complexity promised by[˜2](https://arxiv.org/html/2607.00252#Thmmainthm2)is comparable to that promised by\[jls21\]forℓp\\ell\_\{p\}regression\. The main difference is that our iteration complexity is unconditionally polynomial inpp\. In contrast, the comparable result from\[jls21\]seems to require mild assumptions on the problem parameters \(see the “Discussion on numerical stability” by\[jls21, Section 4 \]\)\.

### 1\.2Prior Results, Connections, and Open Problems

Here, we discuss prior work that conceptually and technically relates to ours\. We then suggest natural directions for future work\.

##### Multi\-distribution learning\.

Many learning problems involve multiple data sources, for instance, when multiple agents generate their data independently\. One can formulate these multi\-distribution problems as standard learning/optimization problems by considering a mixture of their distributions, as in objective[1\.1](https://arxiv.org/html/2607.00252#S1.E1)\. However, this approach often biases solutions toward dominant data sources, leading to poor performance on outliers—an issue stemming from statistical heterogeneity\. This limitation motivates the study of multi\-objective optimization problems\[miettinen1999nonlinear,ehrgott2005multicriteria\], where each agentmmhas a distribution𝒟m\\mathcal\{D\}\_\{m\}that defines its objective as𝔼z∼𝒟m​\[f​\(𝒙m;z\)\]\\mathbb\{E\}\_\{z\\sim\\mathcal\{D\}\_\{m\}\}\[f\(\\bm\{x\}\_\{m\};z\)\], and where models𝒙m\\bm\{x\}\_\{m\}can vary across agents—a framework known as personalization\.

One of the earliest algorithms for such problems was introduced by\[blum2017collaborative\], where each agent’s objective must be minimized to a pre\-specified thresholdϵ\\epsilonwith high probability, framed within a PAC learning framework\[valiant1984theory,vapnik2013nature\]\. Subsequent research has refined these algorithms, achieving optimal sample complexity guarantees for learning from multiple distributions\[chen2018tight,nguyen2018improved,hanneke2019value,haghtalab2022demand,zhang2024optimal\]\. Our objectives[1\.2](https://arxiv.org/html/2607.00252#S1.E2)and[1\.4](https://arxiv.org/html/2607.00252#S1.E4)offer different approaches to multi\-distribution learning, where data distributions correspond to empirical agent distributions\. In particular,\[mohri2019agnostic\]analyzed objective[1\.2](https://arxiv.org/html/2607.00252#S1.E2)to establish generalization bounds for unknown mixtures of agents’ distributions\.

Beyond sample efficiency, researchers have also examined other challenges, such as communication costs in large\-scale distributed optimization\[mcmahan2016s\]\. A particularly relevant study is that of\[bullins2021stochastic\], which employs an efficient distributed quadratic sub\-solver\[woodworth2020local,patel2024limits\]to implement an inexact Newton method for optimizing quasi\-self\-concordant functions \(see[Definition˜2\.1](https://arxiv.org/html/2607.00252#S2.Thmtheorem1)\)\.

##### Group fairness\.

Recently, interest in algorithmic fairness has intensified\[barocas2016big,abebe2020roles,kasy2021fairness\]with researchers exploring fairness across various domains, including supervised learning\[calders2009building,dwork2012fairness,hardt2016equality,kusner2017counterfactual,goel2018non,ustun2019fairness\], resource allocation\[bertsimas2011price,bertsimas2012efficiency,hooker2012combining,donahue2020fairness,manshadi2021fair\], scheduling\[mulvany2021fair\], online matching\[chierichetti2019matroids,ma2023fairness\], assortment planning\[singh2018fairness,biega2018equity,singh2019policy,chen2022fair\], and facility location\[gupta2022socially\]\. The extensive literature on algorithmic fairness falls into three main categories: \(1\) individual fairness, which ensures that similar individuals receive comparable predictions\[dwork2012fairness,loi2019philosophical,chen2022fair\], \(2\) group fairness, which aims for equal treatment of different demographic groups, often in terms of resource allocation or performance parity\[singh2018fairness,balseiro2021regularized\], and \(3\) subgroup fairness, which blends aspects of both individual and group fairness\[kearns2018preventing,kearns2019empirical\]\.

This paper focuses on a well\-studied group fairness notion in machine learning literature: the group DRO problem\[ben2013robust,duchi2016statistics,sagawa2019distributionally\]\. The idea of interpolating between robustness and utility is also common\[golrezaei2024online\]and closely related to multi\-objective optimization, where scalarization\[miettinen1999nonlinear,ehrgott2005multicriteria\]helps recover desired solutions along the Pareto frontier\.

##### Linear programming andℓp\\ell\_\{p\}regression\.

In the last several years, there has been a surge of work in obtaining second\-order, condition\-free algorithms for linear programming andℓp\\ell\_\{p\}regression\[bcll18,ls19,akps19,jls21\]\. Observe thatℓp\\ell\_\{p\}regression is a special case of the problem we study in objective \([1\.4](https://arxiv.org/html/2607.00252#S1.E4)\), which is recovered when allni=1n\_\{i\}=1, andℓ∞\\ell\_\{\\infty\}regression is captured by linear programming\. Note that neither of these problem families is expressive enough to capture the objectives we study\. In general, to achieve iteration complexities in the smaller of the two dimensions for these problems, it appears that a geometric understanding of the solution space is required — these ideas were central to the improvements obtained by\[ls19,jls21\]as well as our work\.

##### Open problems\.

Our work raises several open questions\. One limitation of[˜1](https://arxiv.org/html/2607.00252#Thmmainthm1)is that its iteration complexity is not high\-accuracy, meaning its dependence onε\\varepsilonis notpolylog​\(1/ε\)\\mathrm\{polylog\}\(1/\\varepsilon\)\. Designing a high\-accuracy solver under the same conditions as[˜2](https://arxiv.org/html/2607.00252#Thmmainthm2)with iteration complexityO~​\(𝗉𝗈𝗅𝗒​\(min⁡\{𝗋𝖺𝗇𝗄​\(𝐀\),m\},log⁡\(1ε\)\)\)\\mathaccent 869\{O\}\\left\(\\mathsf\{poly\}\(\\min\\left\\\{\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\),m\\right\\\},\\log\\left\(\\frac\{1\}\{\\varepsilon\}\\right\)\)\\right\)remains an open problem\.

A more ambitious general goal is to design algorithms for convex quadratic programs with the aforementioned iteration complexity\. This would generalize analogous results for linear programming\[ls19\]\. We view the current work as a first step towards this goal, as the objective \([1\.2](https://arxiv.org/html/2607.00252#S1.E2)\) is a structured convex quadratic program for which we get an iteration complexity independent ofmm\. It would also be interesting to consider other complexity measures beyond𝗋𝖺𝗇𝗄​\(𝐀\)\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\), for instance, assumptions about the ground\-truth labeling vector𝒙i⋆\\bm\{x\}\_\{i\}^\{\\star\}for each group’s dataSiS\_\{i\}\.

Finally, our results suggest that optimizing for “ℓp\\ell\_\{p\}\-interpolants” between non\-robust and robust objectives may be computationally easier than optimizing for the robust objective alone\. A more precise statistical characterization of how robustness and utility trade\-off asppvaries in collaborative, fair, or multi\-distributional learning settings would be valuable\. Additionally, exploring interpolations or solution concepts along the Pareto frontier of themm\-dimensional multi\-objective optimization problem or other DRO notions \(eg Wasserstein DRO\[bkm19,cpo20\]\) could yield further insights\.

### 1\.3Paper Outline

In the remainder of this paper, we will outline the key details of our approach and provide a proof outline for our theoretical results\. In[Section˜2](https://arxiv.org/html/2607.00252#S2), we give proof sketches of our main results\. In[Section˜3](https://arxiv.org/html/2607.00252#S3), we prove some background results that appear in the main body, particularly about block Lewis weights\. In[Section˜4](https://arxiv.org/html/2607.00252#S4), we give an analysis of mirror descent under inexact subproblem solves – we will need this in the proof of[˜2](https://arxiv.org/html/2607.00252#Thmmainthm2)\. In[Section˜5](https://arxiv.org/html/2607.00252#S5), we modify an acceleration scheme due to\[chjjs22\], which we will use to iterate calls to the proximal subproblem solver \([2\.3](https://arxiv.org/html/2607.00252#S2.E3)\) for the proof of[˜2](https://arxiv.org/html/2607.00252#Thmmainthm2)\. In[Section˜6](https://arxiv.org/html/2607.00252#S6), we prove[˜1](https://arxiv.org/html/2607.00252#Thmmainthm1)\. In[Section˜7](https://arxiv.org/html/2607.00252#S7), we prove[˜2](https://arxiv.org/html/2607.00252#Thmmainthm2)\. Finally, in[Section˜8](https://arxiv.org/html/2607.00252#S8)we include an empirical comparison of our proposed algorithms against the aforementioned baselines\.

## 2Technical Overview

In this section, we sketch our proofs for[˜1](https://arxiv.org/html/2607.00252#Thmmainthm1)and[˜2](https://arxiv.org/html/2607.00252#Thmmainthm2)\.

##### Notation\.

Here and in the rest of the paper, we ignore the dataset size normalization factors1/ni1/\\sqrt\{n\_\{i\}\}as we can fold this into𝐀Si\\mathbf\{A\}\_\{S\_\{i\}\}and𝒃Si\\bm\{b\}\_\{S\_\{i\}\}\. Additionally, letf​\(𝒙\)≔\\slimits@i=1m​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2pf\(\\bm\{x\}\)\\coloneqq\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\}if2≤p<∞2\\leq p<\\inftyand letf​\(𝒙\)≔max1≤i≤m⁡\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2f\(\\bm\{x\}\)\\coloneqq\\max\_\{1\\leq i\\leq m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}ifp=∞p=\\infty\. Note that in the2≤p<∞2\\leq p<\\inftycase, we letf​\(𝒙\)f\(\\bm\{x\}\)be theppth power of the objective written in[˜2](https://arxiv.org/html/2607.00252#Thmmainthm2); this is to make future calculations easier and makes a difference of only polynomial factors inppin the iteration complexity\. Without loss of generality \(by rescaling\), let𝖮𝖯𝖳=1\\mathsf\{OPT\}=1, where𝖮𝖯𝖳≔f​\(𝒙⋆\)\\mathsf\{OPT\}\\coloneqq f\(\\bm\{x\}^\{\\star\}\)\. So, it is enough to get anε\\varepsilon\-additive optimal solution𝒙^\\mathaccent 866\{\\bm\{x\}\}\. Also without loss of generality, let𝐀\\mathbf\{A\}be such that𝗋𝖺𝗇𝗄​\(𝐀\)=d\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\)=d\. For a positive semidefinite𝐌∈ℝd×d\\mathbf\{M\}\\in\\mathbb\{R\}^\{d\\times d\}, denote\\lVert​𝒙​\\rVert𝐌≔𝒙⊤​𝐌​𝒙\\left\\lVert\\bm\{x\}\\right\\rVert\_\{\\mathbf\{M\}\}\\coloneqq\\sqrt\{\\bm\{x\}^\{\\top\}\\mathbf\{M\}\\bm\{x\}\}\. As shorthand, for𝒚∈ℝn\\bm\{y\}\\in\\mathbb\{R\}^\{n\}, we will often refer to the norm\\lVert​𝒚​\\rVert𝒢p≔\(\\slimits@i=1m​\\lVert​𝒚Si​\\rVert2p\)1/p\\left\\lVert\\bm\{y\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}\\coloneqq\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\bm\{y\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\}\\right\)^\{1/p\}forp≥1p\\geq 1, where with a slight abuse of notation𝒚Si\\bm\{y\}\_\{S\_\{i\}\}denotes the coordinates of𝒚\\bm\{y\}indexed bySiS\_\{i\}\. Finally, in an abuse of notation, for symmetric matrices𝐌\\mathbf\{M\}, let𝐌−1\\mathbf\{M\}^\{\-1\}denote the pseudoinverse of𝐌\\mathbf\{M\}\.

Recall that many iterative methods for convex optimization can be seen as decomposing a complex problem into a series of simpler subproblems\[nw06\]\. Our algorithms for distributionally robust linear regression follow this pattern, where the simple subproblem resembles

𝒪​\(𝒒\)≔min\\lVert​𝒙−𝒒​\\rVert𝐌≤r𝒒f​\(𝒙\),\\displaystyle\\mathcal\{O\}\(\\bm\{q\}\)\\coloneqq\\min\_\{\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq r\_\{\\bm\{q\}\}\}\\quad f\(\\bm\{x\}\)\\kern 5\.0pt,\(2\.1\)for some positive semidefinite𝐌\\mathbf\{M\}and for some ball radiusr𝒒r\_\{\\bm\{q\}\}which may depend on the query𝒒\\bm\{q\}\. Sub\-routines like \([2\.1](https://arxiv.org/html/2607.00252#S2.E1)\) are central to many trust\-region methods\[conn2000trust,nw06\], and, importantly whenffis the sum of a linear function and a self\-concordant barrier, interior point methods derived from the self\-concordant barrier framework111In this case, the matrix𝐌\\mathbf\{M\}is given by the Hessian of the barrier function evaluated at the subproblem’s solution\.\[nn94\]\.

With such a subproblem structure in hand, three questions arise\.\(1\)How do we solve the subproblems efficiently?\(2\)How do we combine our subproblem solutions to arrive at our final answer?\(3\)How do we choose the “local geometry”𝐌\\mathbf\{M\}to optimize the iteration complexity we get from the previous two parts?We address these concerns in order in the following discussion\.

### 2\.1Solving Proximal Subproblems

For this discussion, let𝐌\\mathbf\{M\}be any positive semidefinite matrix, as the arguments apply for any geometry𝐌\\mathbf\{M\}\. It will be helpful to assume that\\lVert⋅\\rVert𝐌\\left\\lVert\\cdot\\right\\rVert\_\{\\mathbf\{M\}\}is a good approximation to our objective function in the sense that for somedistortion△\\trianglethat is as close to11as possible, we have

for all𝒙∈ℝd:\\lVert𝒙−𝒃\\rVert𝐌≤\(\\slimits@i=1m\\lVert𝐀Si𝒙−𝒃Si\\rVert2p\)1p≤△\\lVert𝒙−𝒃\\rVert𝐌\.\\displaystyle\\text\{for all \}\\bm\{x\}\\in\\mathbb\{R\}^\{d\}:\\quad\\quad\\left\\lVert\\bm\{x\}\-\\bm\{b\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\}\\right\)^\{\\frac\{1\}\{p\}\}\\leq\\triangle\\left\\lVert\\bm\{x\}\-\\bm\{b\}\\right\\rVert\_\{\\mathbf\{M\}\}\\kern 5\.0pt\.Here, we discuss how to solve problems of the form \([2\.1](https://arxiv.org/html/2607.00252#S2.E1)\) for a fixed query𝒒\\bm\{q\}\. Our strategy follows two general steps\. First, we establish some form of local stability for∇2f​\(𝒙\)\\nabla^\{2\}f\(\\bm\{x\}\)within the ball we are solving in, i\.e\., we want∇2f​\(𝒙\)\\nabla^\{2\}f\(\\bm\{x\}\)to not change too much inside the ball\{𝒙∈ℝd:\\lVert​𝒙−𝒒​\\rVert𝐌≤r𝒒\}\\left\\\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\{\\;\\;:\\;\\;\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq r\_\{\\bm\{q\}\}\\right\\\}\. Second, we use this to demonstrate that an appropriate second\-order algorithm exhibits a favorable convergence rate to an approximate solution for our subproblem\. We handle thep=∞p=\\inftyand2≤p<∞2\\leq p<\\inftycases separately below\.

#### 2\.1\.1The Robust Case \(p=∞p=\\infty\)\.

Unfortunately, sinceffis not even differentiable \(it is the pointwise maximum of Euclidean norms, each of which is also not differentiable\), we cannot directly argue about the stability of∇2f​\(𝒙\)\\nabla^\{2\}f\(\\bm\{x\}\)\. We therefore first need to find some surrogate objectivef~\\mathaccent 869\{f\}so that:

1. 1\.The approximation error\\lVert​f~−f​\\rVert∞\\left\\lVert\\mathaccent 869\{f\}\-f\\right\\rVert\_\{\\infty\}is small;
2. 2\.The surrogate objectivef~\\mathaccent 869\{f\}is smooth in\\lVert⋅\\rVert𝐌\\left\\lVert\\cdot\\right\\rVert\_\{\\mathbf\{M\}\}in such a way that we can solve the proximal subproblems fast\.

To smoothenf​\(𝒙\)f\(\\bm\{x\}\), we use the family of objectives parameterized byβ,δ\\beta,\\delta

f~β,δ​\(𝒙\)≔β​log⁡\(\\slimits@i=1m​exp​\(δ2\+\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert22−δβ\)\)\.\\displaystyle\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\)\\coloneqq\\beta\\log\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\mathrm\{exp\}\\left\(\\frac\{\\sqrt\{\\delta^\{2\}\+\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\}\-\\delta\}\{\\beta\}\\right\)\\right\)\\kern 5\.0pt\.\(2\.2\)This can be seen as composing the softmax function with temperatureβ\\betawith “inner functions”δ2\+\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert22−δ\\sqrt\{\\delta^\{2\}\+\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\}\-\\delta\. It is straightforward to show that for all𝒙∈ℝd\\bm\{x\}\\in\\mathbb\{R\}^\{d\},\\lvert​f~β,δ​\(𝒙\)−f​\(𝒙\)​\\rvert≤β​log⁡m\+δ\\left\\lvert\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\)\-f\(\\bm\{x\}\)\\right\\rvert\\leq\\beta\\log m\+\\delta\. So, settingβ=ε/4​log⁡m\\beta=\\varepsilon/4\\log mandδ=ε/4\\delta=\\varepsilon/4, it is sufficient to optimizef~β,δ\\mathaccent 869\{f\}\_\{\\beta,\\delta\}up toε/2\\varepsilon/2additive error to get anε\\varepsilon\-additive suboptimal solution to our original objective\. Furthermore, we prove thatf~β,δ\\mathaccent 869\{f\}\_\{\\beta,\\delta\}isO​\(1/β\+1/δ\)O\(1/\\beta\+1/\\delta\)\-smooth in the norm\\lVert​𝐀​𝒙​\\rVert𝒢∞≔max1≤i≤m⁡\\lVert​𝐀​𝒙​\\rVert2\\left\\lVert\\mathbf\{A\}\\bm\{x\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\\coloneqq\\max\_\{1\\leq i\\leq m\}\\left\\lVert\\mathbf\{A\}\\bm\{x\}\\right\\rVert\_\{2\}\. Thus, if\\lVert⋅\\rVert𝐌\\left\\lVert\\cdot\\right\\rVert\_\{\\mathbf\{M\}\}is a good approximation to\\lVert​𝐀​𝒙​\\rVert𝒢∞\\left\\lVert\\mathbf\{A\}\\bm\{x\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}, we will get thatf~β,δ\\mathaccent 869\{f\}\_\{\\beta,\\delta\}is also smooth in the norm\\lVert​𝒙​\\rVert𝐌\\left\\lVert\\bm\{x\}\\right\\rVert\_\{\\mathbf\{M\}\}\.

Next,\[msbacon\]show that iff~β,δ\\mathaccent 869\{f\}\_\{\\beta,\\delta\}satisfies a higher\-order smoothness condition calledquasi\-self\-concordancewith respect to the norm\\lVert⋅\\rVert𝐌\\left\\lVert\\cdot\\right\\rVert\_\{\\mathbf\{M\}\}, then we can get the required Hessian stability for afixedr𝒒=\(1/ε\)r\_\{\\bm\{q\}\}=\\Theta\(1/\\varepsilon\)\(in particular,r𝒒r\_\{\\bm\{q\}\}does not depend on𝒒\\bm\{q\}here\)\. To clarify, we define quasi\-self\-concordance as follows\.

###### Definition 2\.1\(Quasi\-self\-concordance, adapted from\[ksj18, Appendix A\]\)\.

Letf:ℝk→ℝf\\colon\\mathbb\{R\}^\{k\}\\rightarrow\\mathbb\{R\}\. We say thatffisν\\nu\-quasi\-self\-concordant in the norm\\lVert⋅\\rVert\\left\\lVert\\cdot\\right\\rVertif for all vectors𝐲∈ℝk\\bm\{y\}\\in\\mathbb\{R\}^\{k\}, directions𝐝∈ℝk\\bm\{d\}\\in\\mathbb\{R\}^\{k\}, andt∈ℝt\\in\\mathbb\{R\}, we have

\\lvert​\(dd​t\)3​f​\(𝒚\+t​𝒅\)​\\rvert≤ν​\\lVert​𝒅​\\rVert​\(dd​t\)2​f​\(𝒚\+t​𝒅\)\.\\displaystyle\\left\\lvert\\left\(\\frac\{d\}\{dt\}\\right\)^\{3\}f\(\\bm\{y\}\+t\\bm\{d\}\)\\right\\rvert\\leq\\nu\\left\\lVert\\bm\{d\}\\right\\rVert\\left\(\\frac\{d\}\{dt\}\\right\)^\{2\}f\(\\bm\{y\}\+t\\bm\{d\}\)\\kern 5\.0pt\.

Then,\[msbacon\]shows how to leverage this Hessian stability to implement \([2\.1](https://arxiv.org/html/2607.00252#S2.E1)\) with low linear\-system\-solve iteration complexity\. However, previously, it was only shown that the composition of the softmax function with linear functions is quasi\-self\-concordant\. So, it was unknown whether composing softmax with other functions could also be quasi\-self\-concordant\.

To resolve this, we prove a much more general composition result, which to the best of our knowledge was not known prior to this work and may be of independent interest\. It essentially states that if we compose the softmax function with any combination of “inner” functions that are quasi\-self\-concordant, the resulting function is also quasi\-self\-concordant\. For a more formal statement, see[Definition˜2\.1](https://arxiv.org/html/2607.00252#S2.Thmtheorem1)\.

\{restatable\*\}

\[Composing softmax with quasi\-self\-concordant functions\]lemmacomposedqsc Let\\lVert⋅\\rVert\\left\\lVert\\cdot\\right\\rVertbe an arbitrary norm andh1,…,hmh\_\{1\},\\dots,h\_\{m\}be such thathi:ℝd→ℝh\_\{i\}\\colon\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}\. Lethhbe the vector formed by concatenating the results ofh1,…,hmh\_\{1\},\\dots,h\_\{m\}\. Additionally, leth1,…,hmh\_\{1\},\\dots,h\_\{m\}be such that for all1≤i≤m1\\leq i\\leq mand for all𝒚,𝒅∈ℝm\\bm\{y\},\\bm\{d\}\\in\\mathbb\{R\}^\{m\}andt∈ℝt\\in\\mathbb\{R\},

\(dd​t\)​hi​\(𝒚\+t​𝒅\)≤\\lVert​𝒅​\\rVert\\displaystyle\\left\(\\frac\{d\}\{dt\}\\right\)h\_\{i\}\(\\bm\{y\}\+t\\bm\{d\}\)\\leq\\left\\lVert\\bm\{d\}\\right\\rVert\(Lipschitzness\)\\lvert​\(dd​t\)3​hi​\(𝒚\+t​𝒅\)​\\rvert≤ν​\\lVert​𝒅​\\rVert​\(dd​t\)2​hi​\(𝒚\+t​𝒅\)\\displaystyle\\left\\lvert\\left\(\\frac\{d\}\{dt\}\\right\)^\{3\}h\_\{i\}\(\\bm\{y\}\+t\\bm\{d\}\)\\right\\rvert\\leq\\nu\\left\\lVert\\bm\{d\}\\right\\rVert\\left\(\\frac\{d\}\{dt\}\\right\)^\{2\}h\_\{i\}\(\\bm\{y\}\+t\\bm\{d\}\)\(quasi\-self\-concordance\)\.\\displaystyle\\text\{\(quasi\-self\-concordance\)\}\.Then, for all𝒚,𝒅∈ℝm\\bm\{y\},\\bm\{d\}\\in\\mathbb\{R\}^\{m\}and allt∈ℝt\\in\\mathbb\{R\}, we have

\\lvert​\(dd​t\)3​β​log⁡\(\\slimits@i=1m​exp​\(hi​\(𝒚\+t​𝒅\)β\)\)​\\rvert≤\(16β\+ν\)​\\lVert​𝒅​\\rVert​\(dd​t\)2​𝗅𝗌𝖾β​\(h​\(𝒚\+t​𝒅\)\)\.\\displaystyle\\left\\lvert\\left\(\\frac\{d\}\{dt\}\\right\)^\{3\}\\beta\\log\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\mathrm\{exp\}\\left\(\\frac\{h\_\{i\}\(\\bm\{y\}\+t\\bm\{d\}\)\}\{\\beta\}\\right\)\\right\)\\right\\rvert\\leq\\left\(\\frac\{16\}\{\\beta\}\+\\nu\\right\)\\left\\lVert\\bm\{d\}\\right\\rVert\\left\(\\frac\{d\}\{dt\}\\right\)^\{2\}\\mathsf\{lse\}\_\{\\beta\}\(h\(\\bm\{y\}\+t\\bm\{d\}\)\)\.
Hence, to show the requisite Hessian stability, we use the following steps\. We show that the “inner” functions for \([2\.2](https://arxiv.org/html/2607.00252#S2.E2)\),δ2\+\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert22−δ\\sqrt\{\\delta^\{2\}\+\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\}\-\\delta, are eachO​\(1/δ\)O\(1/\\delta\)\-quasi\-self\-concordant in the norm\\lVert​𝐀Si​𝒙​\\rVert2\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\\right\\rVert\_\{2\}\. So, we can apply our composition result[Definition˜2\.1](https://arxiv.org/html/2607.00252#S2.Thmtheorem1)to prove thatf~β,δ\\mathaccent 869\{f\}\_\{\\beta,\\delta\}isO​\(1/β\+1/δ\)O\(1/\\beta\+1/\\delta\)\-quasi\-self\-concordant in the normmaxi∈\[m\]⁡\\lVert​𝐀Si​𝒙​\\rVert2\\max\_\{i\\in\[m\]\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\\right\\rVert\_\{2\}\. Again, assuming that\\lVert⋅\\rVert𝐌\\left\\lVert\\cdot\\right\\rVert\_\{\\mathbf\{M\}\}is a good approximation to\\lVert⋅\\rVert𝒢∞\\left\\lVert\\cdot\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}, we will get thatf~β,δ\\mathaccent 869\{f\}\_\{\\beta,\\delta\}is quasi\-self\-concordant in\\lVert​𝒙​\\rVert𝐌\\left\\lVert\\bm\{x\}\\right\\rVert\_\{\\mathbf\{M\}\}as well\.

With these analytic inequalities in hand, we can finally apply the recipe given in\[msbacon\]and get our subproblem solver for thep=∞p=\\inftycase\.

#### 2\.1\.2The Interpolating Case \(2≤p<∞2\\leq p<\\infty\)\.

Instead of explicitly constrainingr𝒒r\_\{\\bm\{q\}\}like in thep=∞p=\\inftycase, we regularize our movement from𝒒\\bm\{q\}in the norm\\lVert⋅\\rVert𝐌\\left\\lVert\\cdot\\right\\rVert\_\{\\mathbf\{M\}\}\. Specifically, the subproblem we solve for any query𝒒\\bm\{q\}is

argmin𝒙∈ℝd​f​\(𝒙\)\+e​pp​\\lVert​𝒙−𝒒​\\rVert𝐌p\.\\displaystyle\\underset\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ f\(\\bm\{x\}\)\+ep^\{p\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\kern 5\.0pt\.\(2\.3\)This is the natural generalization of the proximal problem that\[jls21\]use to get their results forℓp\\ell\_\{p\}regression, and the outline of our solver for these subproblems is similar to what\[jls21\]use for this special case \(see their Section 4\)\.

However, we go a step further and show how to obtain approximate stationary points to \([2\.3](https://arxiv.org/html/2607.00252#S2.E3)\) instead of just getting a small objective value\. This is because the acceleration scheme we use to iterate subproblem solutions to get our final answer𝒙^\\mathaccent 866\{\\bm\{x\}\}requires us to obtain an approximate stationary point for \([2\.3](https://arxiv.org/html/2607.00252#S2.E3)\)\. The main new technical tool we develop for this purpose is a form of strong convexity for functions of the form\\lVert​𝒚​\\rVert2p\\left\\lVert\\bm\{y\}\\right\\rVert\_\{2\}^\{p\}for𝒚∈ℝk\\bm\{y\}\\in\\mathbb\{R\}^\{k\}for anyk≥1k\\geq 1\. See[Section˜2\.1\.2](https://arxiv.org/html/2607.00252#S2.SS1.SSS2)\.

\{restatable\*\}

\[Strong convexity of\\lVert​𝒚​\\rVert2p\\left\\lVert\\bm\{y\}\\right\\rVert\_\{2\}^\{p\}\]lemmalemmastrongconvexitycomponent Let𝒗∈ℝk\\bm\{v\}\\in\\mathbb\{R\}^\{k\}fork≥1k\\geq 1\. For any△∈ℝk\\triangle\\in\\mathbb\{R\}^\{k\}, we have

\\lVert​𝒗\+△​\\rVert2p≥\\lVert​𝒗​\\rVert2p\+p​\\lVert​𝒗​\\rVert2p−2​⟨𝒗,△⟩\+42p​\\lVert​△​\\rVert2p\.\\displaystyle\\left\\lVert\\bm\{v\}\+\\triangle\\right\\rVert\_\{2\}^\{p\}\\geq\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\}^\{p\}\+p\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\}^\{p\-2\}\\left\\langle\\bm\{v\},\\triangle\\right\\rangle\+\\frac\{4\}\{2^\{p\}\}\\left\\lVert\\triangle\\right\\rVert\_\{2\}^\{p\}\\kern 5\.0pt\.
With[Section˜2\.1\.2](https://arxiv.org/html/2607.00252#S2.SS1.SSS2), we can argue about the strong convexity of\\lVert​𝒙−𝒒​\\rVert𝐌p\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}, which means that we can convert an approximately optimal solution to \([2\.3](https://arxiv.org/html/2607.00252#S2.E3)\) in function value to one that is approximately optimal in parameter space as well\. We combine this with a local gradient Lipschitzness property of the objective \([2\.3](https://arxiv.org/html/2607.00252#S2.E3)\) to get our approximate stationary point, which is enough for our purposes\. The local gradient Lipschitzness property itself follows from a form of Hessian stability that we show for the objective \([2\.3](https://arxiv.org/html/2607.00252#S2.E3)\)\. See[Lemma˜7\.8](https://arxiv.org/html/2607.00252#S7.Thmtheorem8)\.

Finally, to obtain an approximately optimal solution to \([2\.3](https://arxiv.org/html/2607.00252#S2.E3)\) in function value, we again apply the Hessian stability property to conclude that \([2\.3](https://arxiv.org/html/2607.00252#S2.E3)\) is relatively smooth and relatively strongly convex in a simpler reference function\. We show how to solve optimization problems in this reference function up to an approximate optimality that is sufficient for the rest of our applications – this requires a mild modification of the standard mirror descent analysis, and we do this in[Section˜4](https://arxiv.org/html/2607.00252#S4)\. Combining all of these building blocks gives us our subproblem solver for the2≤p<∞2\\leq p<\\inftycase\.

### 2\.2Iterating Proximal Calls

We now discuss the second item\. Recall that we think of𝒪​\(𝒒\)\\mathcal\{O\}\(\\bm\{q\}\)as answering a proximal problem for the query𝒒\\bm\{q\}\. It is not hard to show that under reasonable conditions onffand on the structure of the subproblems, we can iterate calls to𝒪​\(𝒒\)\\mathcal\{O\}\(\\bm\{q\}\)to optimizeff\(see, e\.g\.,\[msbacon, Appendix A\]\)\. This conceptually simple approach will already give us guarantees of the form\\lVert​𝒙0−𝒙⋆​\\rVert𝐌/ε\{\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\}/\{\\varepsilon\}for the problems we study\.

But we can do better\. An acceleration framework originally due to\[ms13\]and generalized/refined in subsequent works\[bjlls19,msbacon,chjjs22\]gives a recipe to iterate calls of𝒪​\(𝒒\)\\mathcal\{O\}\(\\bm\{q\}\)to optimize the original functionff\. From these, the iteration complexity we need for anε\\varepsilon\-additive solution with an initialization𝒙0\\bm\{x\}\_\{0\}and optimum𝒙⋆\\bm\{x\}^\{\\star\}is roughly\(\\lVert​𝒙0−𝒙⋆​\\rVert𝐌/ε\)2/3\\left\(\{\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\}/\{\\varepsilon\}\\right\)^\{2/3\}\(see[Theorem˜5\.3](https://arxiv.org/html/2607.00252#S5.Thmtheorem3)for a more formal statement\)\. This cosmetically resembles the rate we get in[˜1](https://arxiv.org/html/2607.00252#Thmmainthm1)\. To get something that looks like our rate for[˜2](https://arxiv.org/html/2607.00252#Thmmainthm2), we use our new strong convexity lemma \([Section˜2\.1\.2](https://arxiv.org/html/2607.00252#S2.SS1.SSS2)\)\. With this, we can demonstrate that after a sufficient number of iterations, we have\\lVert​𝒙t−𝒙⋆​\\rVert𝐌≤0\.5​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌\\left\\lVert\\bm\{x\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq 0\.5\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\. Therefore, repeating this argument yields a high\-accuracy solution, as required\.

Interestingly, our algorithm for the2≤p<∞2\\leq p<\\inftycase employs a form of the accelerated scheme developed in\[chjjs22\], which does not require solving an implicit equation for the query point, thereby improving upon the results from\[jls21\]forℓp\\ell\_\{p\}regression\. It would be practically relevant to obtain this for thep=∞p=\\inftycase \(in[Section˜5](https://arxiv.org/html/2607.00252#S5), we discuss a technical challenge in obtaining this\)\.

### 2\.3The Geometry of the Proximal Subproblems and Block Lewis Weights

At this point, we have the tools we need to get rates of the formO​\(\(\\lVert​𝒙0−𝒙⋆​\\rVert𝐌/ε\)2/3\)O\\left\(\\left\(\{\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\}/\{\\varepsilon\}\\right\)^\{2/3\}\\right\)for the robust objective \([˜1](https://arxiv.org/html/2607.00252#Thmmainthm1)\) and of the formO​\(\\lVert​𝒙0−𝒙⋆​\\rVert𝐌\(p−2\)/\(3​p−2\)\)O\\left\(\{\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\}^\{\(p\-2\)/\(3p\-2\)\}\\right\)for the interpolating objective \([˜2](https://arxiv.org/html/2607.00252#Thmmainthm2)\)\. From this, we see that the rates depend on the geometry𝐌\\mathbf\{M\}that we impose on our problem\. Our goal in this section is to choose this geometry𝐌\\mathbf\{M\}\.

Observe that when we solve \([2\.1](https://arxiv.org/html/2607.00252#S2.E1)\), we are solving an optimization problem over the sublevel sets\{𝒙:\\lVert​𝒙​\\rVert𝐌≤r𝒒\}\\left\\\{\\bm\{x\}\{\\;\\;:\\;\\;\}\\left\\lVert\\bm\{x\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq r\_\{\\bm\{q\}\}\\right\\\}– these are ellipsoids\. Now, consider choosing theℓ2\\ell\_\{2\}geometry that best approximates our loss function\. Specifically, recall that earlier in the section, we stated that for some distortion△≥1\\triangle\\geq 1that is as close to11as possible, we want

for all𝒙∈ℝd:\\lVert𝒙−𝒃\\rVert𝐌≤\(\\slimits@i=1m\\lVert𝐀Si𝒙−𝒃Si\\rVert2p\)1p≤△\\lVert𝒙−𝒃\\rVert𝐌\.\\displaystyle\\text\{for all \}\\bm\{x\}\\in\\mathbb\{R\}^\{d\}:\\quad\\quad\\left\\lVert\\bm\{x\}\-\\bm\{b\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\}\\right\)^\{\\frac\{1\}\{p\}\}\\leq\\triangle\\left\\lVert\\bm\{x\}\-\\bm\{b\}\\right\\rVert\_\{\\mathbf\{M\}\}\\kern 5\.0pt\.To see what kinds of distortion guarantees we can hope for, let us see what happens when we choose the most “obvious” geometry\. By relatingℓ2m\\ell\_\{2\}^\{m\}toℓpm\\ell\_\{p\}^\{m\}, we get

\(\\slimits@i=1m​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert22\)12≤\(\\slimits@i=1m​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2p\)1p≤m12−1p​\(\\slimits@i=1m​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert22\)12,\\displaystyle\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\\right\)^\{\\frac\{1\}\{2\}\}\\leq\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\}\\right\)^\{\\frac\{1\}\{p\}\}\\leq m^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\\right\)^\{\\frac\{1\}\{2\}\},and notice that\(\\slimits@i=1m​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert22\)2=\\lVert​𝐀​𝒙−𝒃​\\rVert2\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\\right\)^\{2\}=\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-\\bm\{b\}\\right\\rVert\_\{2\}\. Thus, setting𝐌=𝐀⊤​𝐀\\mathbf\{M\}=\\mathbf\{A\}^\{\\top\}\\mathbf\{A\}\(which is what we call the naïve geometry in[Table˜1](https://arxiv.org/html/2607.00252#S1.T1)\) gives us our basic rate ofm1/3​ε−2/3m^\{1/3\}\\varepsilon^\{\-2/3\}in the setting of[˜1](https://arxiv.org/html/2607.00252#Thmmainthm1)andm\(p−2\)/\(3​p−2\)m^\{\(p\-2\)/\(3p\-2\)\}in the setting of[˜2](https://arxiv.org/html/2607.00252#Thmmainthm2)\.

But, there exists an improvement over above naïve geometry\. Note our loss function is a norm onℝd\\mathbb\{R\}^\{d\}– in particular, we can check that for𝒚∈ℝn\\bm\{y\}\\in\\mathbb\{R\}^\{n\}, the functions\\lVert​𝒚​\\rVert𝒢p=\(\\slimits@i=1m​\\lVert​𝒚Si​\\rVert2p\)1/p\\left\\lVert\\bm\{y\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}=\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\bm\{y\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\}\\right\)^\{1/p\}for1≤p≤∞1\\leq p\\leq\\inftyare norms\. Now, recall John’s theorem, a fundamental result in high\-dimensional convex geometry\.

###### Theorem 2\.2\(John’s theorem,\[john1948\]\)\.

For any symmetric convex bodyK⊂ℝdK\\subset\\mathbb\{R\}^\{d\}, letℰ​\(K\)\\mathcal\{E\}\(K\)denote the ellipsoid of maximum volume contained withinKK\. Then, we have

ℰ​\(K\)⊆K⊆d⋅ℰ​\(K\)\.\\displaystyle\\mathcal\{E\}\(K\)\\subseteq K\\subseteq\\sqrt\{d\}\\cdot\\mathcal\{E\}\(K\)\\kern 5\.0pt\.Moreover, thed\\sqrt\{d\}is worst\-case optimal \(e\.g\. letKKbe the unitℓ∞\\ell\_\{\\infty\}ball\)\.

It is easy to see that sublevel sets of norms, i\.e\., sets of the form\{𝒙∈ℝd:\\lVert​𝒙​\\rVert≤1\}\\left\\\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\{\\;\\;:\\;\\;\}\\left\\lVert\\bm\{x\}\\right\\rVert\\leq 1\\right\\\}, are symmetric convex bodies\. Hence, using John’s theorem, we see that for our normed losses, there exists𝐌\\mathbf\{M\}that achieves distortion△≤d\\triangle\\leq\\sqrt\{d\}\. From this, it is easy to see that there exists𝐌\\mathbf\{M\}for which we can guarantee\\lVert​𝒙0−𝒙⋆​\\rVert𝐌​d\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\\lesssim\\sqrt\{d\}\. Plugging this into the guarantees from the previous subsections, we get that if we choose the𝐌\\mathbf\{M\}from John’s theorem, and then switch based on whetherm≤dm\\leq d, we get exactly the rates quoted in[˜1](https://arxiv.org/html/2607.00252#Thmmainthm1)and[˜2](https://arxiv.org/html/2607.00252#Thmmainthm2)\.

However, as written, this is only an existence result\. To make this useful for us and actually find𝐌\\mathbf\{M\}, we need an algorithm to calculate John’s ellipsoid for the level sets of our losses \(or some other ellipsoid that gets an even better approximation factor\)\. To this end, a result of\[mo23\]gives us an efficient algorithm to find thisℓ2\\ell\_\{2\}geometry for the loss families we consider\.

###### Theorem 2\.3\(Combining Lemmas 5\.6, 5\.8, Equation \(1\.8\) from\[mo23\]\)\.

Letp≥2p\\geq 2\. There exists an algorithm that finds a positive diagonal matrix𝐖∈ℝn×n\\mathbf\{W\}\\in\\mathbb\{R\}^\{n\\times n\}such that for all𝐱∈ℝd\\bm\{x\}\\in\\mathbb\{R\}^\{d\}and allc∈ℝc\\in\\mathbb\{R\}, we have

\\lVert​𝐖12−1p​\(𝐀​𝒙−c​𝒃\)​\\rVert2\(2​\(𝗋𝖺𝗇𝗄​\(𝐀\)\+1\)\)12−1p≤\(\\slimits@i=1m​\\lVert​𝐀Si​𝒙−c​𝒃Si​\\rVert2p\)1p≤\\lVert​𝐖12−1p​\(𝐀​𝒙−c​𝒃\)​\\rVert2\.\\displaystyle\\frac\{\\left\\lVert\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\left\(\\mathbf\{A\}\\bm\{x\}\-c\\bm\{b\}\\right\)\\right\\rVert\_\{2\}\}\{\(2\(\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\)\+1\)\)^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\}\\leq\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-c\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\}\\right\)^\{\\frac\{1\}\{p\}\}\\leq\\left\\lVert\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\left\(\\mathbf\{A\}\\bm\{x\}\-c\\bm\{b\}\\right\)\\right\\rVert\_\{2\}\\kern 5\.0pt\.The algorithm runs inO​\(log⁡m\)O\(\\log m\)linear\-system\-solves in matrices of the form𝐀⊤​𝐃𝐀\\mathbf\{A\}^\{\\top\}\\mathbf\{D\}\\mathbf\{A\}for positive diagonal matrices𝐃\\mathbf\{D\}\.

The diagonal entries of matrix𝐖\\mathbf\{W\}are calledblock Lewis weights\. This is a generalization of Lewis weights, and both objects have been used previously for various matrix approximation problems\[blm89,mmwy21,jls22,jlls23,mo23\]\. Furthermore, Lewis weights are central to improvements in the iteration complexities for linear programming and vanillaℓp\\ell\_\{p\}regression\[ls19,jls21\]\. We go into greater detail about block Lewis weights in[Section˜3](https://arxiv.org/html/2607.00252#S3)\.

Additionally, notice that the distortion ofO​\(𝗋𝖺𝗇𝗄​\(𝐀\)1/2−1/p\)O\(\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\)^\{1/2\-1/p\}\)guaranteed by[Theorem˜2\.3](https://arxiv.org/html/2607.00252#S2.Thmtheorem3)is optimal\. To see this, let𝐀∈ℝn×d\\mathbf\{A\}\\in\\mathbb\{R\}^\{n\\times d\}be such that fori∈\[d\]i\\in\[d\], row𝒂i=𝒆i\\bm\{a\}\_\{i\}=\\bm\{e\}\_\{i\}, where𝒆i\\bm\{e\}\_\{i\}is theiith standard basis vector\. Then, for alld\+1≤i≤nd\+1\\leq i\\leq n, let𝒂i=0\\bm\{a\}\_\{i\}=0\. In words,𝐀\\mathbf\{A\}is thedd\-dimensional identity matrix atop a large matrix of all0s\. It is easy to see that for anypp, we have\\lVert​𝐀​𝒙​\\rVertp=\\lVert​𝒙​\\rVertp\\left\\lVert\\mathbf\{A\}\\bm\{x\}\\right\\rVert\_\{p\}=\\left\\lVert\\bm\{x\}\\right\\rVert\_\{p\}, and the best distortion we can get for relating\\lVert​𝒙​\\rVertp\\left\\lVert\\bm\{x\}\\right\\rVert\_\{p\}to anydd\-dimensionalℓ2\\ell\_\{2\}norm isd\\lvert​1/2−1/p​\\rvertd^\{\\left\\lvert 1/2\-1/p\\right\\rvert\}\.

With[Theorem˜2\.3](https://arxiv.org/html/2607.00252#S2.Thmtheorem3)and its near optimality in hand, we choose𝐌=𝐀⊤​𝐖1−2p​𝐀\\mathbf\{M\}=\\mathbf\{A\}^\{\\top\}\\mathbf\{W\}^\{1\-\\frac\{2\}\{p\}\}\\mathbf\{A\}if𝗋𝖺𝗇𝗄​\(𝐀\)≤m\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\)\\leq mand𝐌=𝐀⊤​𝐀\\mathbf\{M\}=\\mathbf\{A\}^\{\\top\}\\mathbf\{A\}if𝗋𝖺𝗇𝗄​\(𝐀\)≥m\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\)\\geq m\(recall that in the latter case, we get am\\sqrt\{m\}distortion for free from relatingℓ2m\\ell\_\{2\}^\{m\}toℓ∞m\\ell\_\{\\infty\}^\{m\}\)\. Combining this with the results from the previous two subsections gives us[˜1](https://arxiv.org/html/2607.00252#Thmmainthm1)and[˜2](https://arxiv.org/html/2607.00252#Thmmainthm2)\.

### 2\.4Algorithm for Distributionally Robust Regression

In[Algorithm˜1](https://arxiv.org/html/2607.00252#alg1), we present pseudocode for the algorithm that yields the guarantee in[˜1](https://arxiv.org/html/2607.00252#Thmmainthm1)\. We compare the empirical performance of this algorithm against other baselines mentioned in[Section˜1](https://arxiv.org/html/2607.00252#S1)and examine the effect of Lewis weights in[Section˜8](https://arxiv.org/html/2607.00252#S8)\.

Algorithm 1MinMaxRegression: optimizes \([1\.2](https://arxiv.org/html/2607.00252#S1.E2)\) to\(1\+ε\)\(1\+\\varepsilon\)\-multiplicative error1:Regression problems\(𝐀S1,𝒃S1\),…,\(𝐀Sm,𝒃Sm\)\(\\mathbf\{A\}\_\{S\_\{1\}\},\\bm\{b\}\_\{S\_\{1\}\}\),\\dots,\(\\mathbf\{A\}\_\{S\_\{m\}\},\\bm\{b\}\_\{S\_\{m\}\}\), accuracyε\>0\\varepsilon\>0

2:Using\[mo23, Algorithm 2\]with input\[𝐀\|𝒃\]\\left\[\\mathbf\{A\}\|\\bm\{b\}\\right\], find nonnegative diagonal𝐖\\mathbf\{W\}and weightsw1,…,wmw\_\{1\},\\dots,w\_\{m\}such that for allj∈Sij\\in S\_\{i\},𝐖​\[j\]​\[j\]=wi\\mathbf\{W\}\[j\]\[j\]=w\_\{i\}and for all𝒙∈ℝd\\bm\{x\}\\in\\mathbb\{R\}^\{d\}andc∈ℝc\\in\\mathbb\{R\},\\lVert​𝐀​𝒙−c​𝒃​\\rVert𝒢∞≤\\lVert​𝐖1/2​𝐀​𝒙−c​𝐖1/2​𝒃​\\rVert2≤2​\(𝗋𝖺𝗇𝗄​\(𝐀\)\+1\)​\\lVert​𝐀​𝒙−c​𝒃​\\rVert𝒢∞\.\\displaystyle\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-c\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\\leq\\left\\lVert\\mathbf\{W\}^\{1/2\}\\mathbf\{A\}\\bm\{x\}\-c\\mathbf\{W\}^\{1/2\}\\bm\{b\}\\right\\rVert\_\{2\}\\leq\\sqrt\{2\(\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\)\+1\)\}\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-c\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\.

3:if\\slimits@i=1m​wi≥m\\sumop\\slimits@\_\{i=1\}^\{m\}w\_\{i\}\\geq mthen⊳\\triangleright𝗋𝖺𝗇𝗄​\(𝐀\)\+1≤\\slimits@i=1m​wi≤2​\(𝗋𝖺𝗇𝗄​\(𝐀\)\+1\)\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\)\+1\\leq\\sumop\\slimits@\_\{i=1\}^\{m\}w\_\{i\}\\leq 2\(\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\)\+1\)

4:Reset𝐖=𝐈n\\mathbf\{W\}=\\mathbf\{I\}\_\{n\}\.

5:Let𝒙0=\(𝐀⊤​𝐖𝐀\)−1​𝐀⊤​𝐖​𝒃\\bm\{x\}\_\{0\}=\\left\(\\mathbf\{A\}^\{\\top\}\\mathbf\{W\}\\mathbf\{A\}\\right\)^\{\-1\}\\mathbf\{A\}^\{\\top\}\\mathbf\{W\}\\bm\{b\}\.⊳\\triangleright𝐱0≔argmin𝐱∈ℝd​\\lVert​𝐖1/2​𝐀​𝐱−𝐖1/2​𝐛​\\rVert2\\bm\{x\}\_\{0\}\\coloneqq\\underset\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ \\left\\lVert\\mathbf\{W\}^\{1/2\}\\mathbf\{A\}\\bm\{x\}\-\\mathbf\{W\}^\{1/2\}\\bm\{b\}\\right\\rVert\_\{2\}\.

6:Letf~β,δ​\(𝒙\)≔β​log⁡\(\\slimits@i=1m​exp​\(δ2\+\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert22−δβ\)\)\\displaystyle\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\)\\coloneqq\\beta\\log\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\mathrm\{exp\}\\left\(\\frac\{\\sqrt\{\\delta^\{2\}\+\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\}\-\\delta\}\{\\beta\}\\right\)\\right\)whereβ=ε4​log⁡m\\beta=\\frac\{\\varepsilon\}\{4\\log m\}andδ=ε4\\delta=\\frac\{\\varepsilon\}\{4\}\.⊳\\trianglerightA family of smoothenings of the objective\.

7:Letf^​\(𝒙\)≔f~ε/4​log⁡m,ε/4​\(𝒙\)\+ε1000​min⁡\{𝗋𝖺𝗇𝗄​\(𝐀\),m\}​\\lVert​𝐖1/2​𝐀​\(𝒙−𝒙0\)​\\rVert22\\mathaccent 866\{f\}\(\\bm\{x\}\)\\coloneqq\\mathaccent 869\{f\}\_\{\\varepsilon/4\\log m,\\varepsilon/4\}\(\\bm\{x\}\)\+\\frac\{\\varepsilon\}\{1000\\min\\left\\\{\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\),m\\right\\\}\}\\left\\lVert\\mathbf\{W\}^\{1/2\}\\mathbf\{A\}\(\\bm\{x\}\-\\bm\{x\}\_\{0\}\)\\right\\rVert\_\{2\}^\{2\}\.

8:Using\[msbacon, Algorithm 3\], implement a\(Cmin⁡\{𝗋𝖺𝗇𝗄​\(𝐀\),m\},Cε\)\\left\(\\frac\{C\}\{\\min\\left\\\{\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\),m\\right\\\}\},\\frac\{C\}\{\\varepsilon\}\\right\)\-ball optimization oracle forf^\\mathaccent 866\{f\}, whereCCis a universal constant\.⊳\\trianglerightIteration complexity guaranteed by[Lemma˜6\.4](https://arxiv.org/html/2607.00252#S6.Thmtheorem4)

9:Using\[msbacon, Algorithm 2\], implement a12\\frac\{1\}\{2\}\-MS oracle forf^\\mathaccent 866\{f\}\.

10:Run\[msbacon, Algorithm 1\]forO~​\(min\{𝗋𝖺𝗇𝗄\(𝐀\),m\}1/3log\(dε\)ε2/3\)\\mathaccent 869\{O\}\\left\(\\frac\{\\min\\left\\\{\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\),m\\right\\\}^\{1/3\}\\log\\left\(\\frac\{d\}\{\\varepsilon\}\\right\)\}\{\\varepsilon^\{2/3\}\}\\right\)iterations using the MS oracle from the previous line and with initial point𝒙0\\bm\{x\}\_\{0\}and final point𝒙^\\mathaccent 866\{\\bm\{x\}\}\.

11:return𝒙^\\mathaccent 866\{\\bm\{x\}\}

## 3Block Lewis Weights and their Properties

In this section, we introduceblock Lewis weightsand explore some of their properties\. Several of these statements can be found in\[jlls23,mo23\], but we include definitions and proofs here for self\-completion\.

We first need to defineleverage scores\.

###### Definition 3\.1\(Leverage scores\)\.

For a matrix𝐀∈ℝn×d\\mathbf\{A\}\\in\\mathbb\{R\}^\{n\\times d\}with rows𝐚1,…,𝐚n\\bm\{a\}\_\{1\},\\dots,\\bm\{a\}\_\{n\}, letτj\\tau\_\{j\}denote thejjth leverage score of𝐀\\mathbf\{A\}, which we define to be

τj​\(𝐀\)≔max𝒙∈ℝd∖\{0\}⁡⟨𝒂j,𝒙⟩2\\lVert​𝐀​𝒙​\\rVert22=𝒂j⊤​\(𝐀⊤​𝐀\)−1​𝒂j\.\\displaystyle\\tau\_\{j\}\(\\mathbf\{A\}\)\\coloneqq\\max\_\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\\setminus\\left\\\{0\\right\\\}\}\\frac\{\\left\\langle\\bm\{a\}\_\{j\},\\bm\{x\}\\right\\rangle^\{2\}\}\{\\left\\lVert\\mathbf\{A\}\\bm\{x\}\\right\\rVert\_\{2\}^\{2\}\}=\\bm\{a\}\_\{j\}^\{\\top\}\\left\(\\mathbf\{A\}^\{\\top\}\\mathbf\{A\}\\right\)^\{\-1\}\\bm\{a\}\_\{j\}\\kern 5\.0pt\.

We now introduce the main object of interest in this section,[Definition˜3\.2](https://arxiv.org/html/2607.00252#S3.Thmtheorem2)\. Our version of the definition is adapted from\[mo23, Definition 1\.2\]\(there, we setp1=⋯=pm=2p\_\{1\}=\\dots=p\_\{m\}=2, let their𝐖=𝐈\\mathbf\{W\}=\\mathbf\{I\}, replace𝝀\\bm\{\\lambda\}with𝒘/\\lVert​𝒘​\\rVert1\\bm\{w\}/\\left\\lVert\\bm\{w\}\\right\\rVert\_\{1\}, and rescaleF⋆F^\{\\star\}appropriately\)\.

###### Definition 3\.2\(Adapted from\[mo23, Definition 1\.2\]\)\.

Let𝐰∈ℝ≥0m\\bm\{w\}\\in\\mathbb\{R\}^\{m\}\_\{\\geq 0\}and𝐖∈ℝ≥0n×n\\mathbf\{W\}\\in\\mathbb\{R\}^\{n\\times n\}\_\{\\geq 0\}be a diagonal matrix for which for allj∈Sij\\in S\_\{i\}, we have𝐖j​j=wi\\mathbf\{W\}\_\{jj\}=w\_\{i\}\. Letp\>0p\>0\. We say that𝐰\\bm\{w\}is a block Lewis overestimate if for alli∈\[m\]i\\in\[m\], we have

\\slimits@j∈Si​τj​\(𝐖12−1p​𝐀\)wi≤1\.\\displaystyle\\frac\{\\sumop\\slimits@\_\{j\\in S\_\{i\}\}\\tau\_\{j\}\\left\(\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\right\)\}\{w\_\{i\}\}\\leq 1\\kern 5\.0pt\.

The main reason that[Definition˜3\.2](https://arxiv.org/html/2607.00252#S3.Thmtheorem2)is interesting is that it gives us a formula with which we can relate the level sets of the group norm\\lVert⋅\\rVert𝒢p\\left\\lVert\\cdot\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}toℓ2\\ell\_\{2\}\. See[Theorem˜3\.3](https://arxiv.org/html/2607.00252#S3.Thmtheorem3)\.

###### Theorem 3\.3\(Block Lewis weights give us ellipsoidal approximations to\\lVert⋅\\rVert𝒢p\\left\\lVert\\cdot\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}\)\.

Letp≥2p\\geq 2\. If𝐰\\bm\{w\}is a block Lewis overestimate, then for all𝐱∈ℝd\\bm\{x\}\\in\\mathbb\{R\}^\{d\}, we have

\\lVert​𝐖12−1p​𝐀​𝒙​\\rVert2\\lVert​𝒘​\\rVert112−1p≤\\lVert​𝐀​𝒙​\\rVert𝒢p≤\\lVert​𝐖12−1p​𝐀​𝒙​\\rVert2\.\\displaystyle\\frac\{\\left\\lVert\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\bm\{x\}\\right\\rVert\_\{2\}\}\{\\left\\lVert\\bm\{w\}\\right\\rVert\_\{1\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\}\\leq\\left\\lVert\\mathbf\{A\}\\bm\{x\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}\\leq\\left\\lVert\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\bm\{x\}\\right\\rVert\_\{2\}\\kern 5\.0pt\.

We prove[Theorem˜3\.3](https://arxiv.org/html/2607.00252#S3.Thmtheorem3)in[Section˜3](https://arxiv.org/html/2607.00252#S3)\. An analogous statement can also be shown forp≤2p\\leq 2, but since we do not use it in this paper, we do not write it here\.

Observe that if we can get𝒘\\bm\{w\}that satisfies[Definition˜3\.2](https://arxiv.org/html/2607.00252#S3.Thmtheorem2)and for which\\lVert​𝒘​\\rVert1=𝗋𝖺𝗇𝗄​\(𝐀\)\\left\\lVert\\bm\{w\}\\right\\rVert\_\{1\}=\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\), then[Theorem˜3\.3](https://arxiv.org/html/2607.00252#S3.Thmtheorem3)gives us the optimal relationship betweenℓ2\\ell\_\{2\}and\\lVert⋅\\rVert𝒢p\\left\\lVert\\cdot\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}whenever𝗋𝖺𝗇𝗄​\(𝐀\)≤m\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\)\\leq m\. Furthermore, for intuition, supposep=∞p=\\infty\. By John’s theorem, we know that for any symmetric convex body, there exists an ellipsoid such that the ellipsoid approximates the convex body up to ad\\sqrt\{d\}distortion\. Moreover, this is worst\-case tight \(e\.g\. the best distortion we can get when we approximateℓ1d\\ell\_\{1\}^\{d\}withℓ2\\ell\_\{2\}isd\\sqrt\{d\}\)\. Thus, assuming we can find\\lVert​𝒘​\\rVert1≈𝗋𝖺𝗇𝗄​\(𝐀\)\\left\\lVert\\bm\{w\}\\right\\rVert\_\{1\}\\approx\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\), in this case, we get a guarantee that is similar to what John’s theorem tells us\.

Now, assuming we can find a low\-distortion ellipsoidal approximation to the level sets of our loss, we get that the “effective” diameter of our problem is∼d\\sim\\sqrt\{d\}\. Combining this and the discussion in[Section˜2\.3](https://arxiv.org/html/2607.00252#S2.SS3)\(or, more formally,[Theorem˜5\.3](https://arxiv.org/html/2607.00252#S5.Thmtheorem3)\), we can see why we should expect an iteration complexity of∼d1/3\\sim d^\{1/3\}\(or better, if we can find a better ellipsoid\)\.

What is left is whether weights𝒘\\bm\{w\}satisfying[Definition˜3\.2](https://arxiv.org/html/2607.00252#S3.Thmtheorem2)with small sum can be found\. To this end, we invoke\[mo23, Algorithm 2\]\.

###### Theorem 3\.4\(\[mo23, Algorithm 2 and Lemma 5\.6\]\)\.

There exists an algorithm that returns a block Lewis overestimate𝐰\\bm\{w\}for which\\lVert​𝐰​\\rVert1≤2​𝗋𝖺𝗇𝗄​\(𝐀\)\\left\\lVert\\bm\{w\}\\right\\rVert\_\{1\}\\leq 2\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\)\. The algorithm runs inO​\(log⁡m\)O\(\\log m\)linear system solves with matrices of the form𝐀⊤​𝐃𝐀\\mathbf\{A\}^\{\\top\}\\mathbf\{D\}\\mathbf\{A\}for nonnegative diagonal𝐃\\mathbf\{D\}\.

Thus, by applying[Theorem˜3\.4](https://arxiv.org/html/2607.00252#S3.Thmtheorem4)as a preprocessing step, we get anℓ2\\ell\_\{2\}geometry under which we can run the accelerated proximal algorithms\. As an example of the power of this, observe the following\.

###### Lemma 3\.5\.

Consider the matrix𝐀^≔𝐀\|𝐛∈ℝn×\(d\+1\)\\mathaccent 866\{\\mathbf\{A\}\}\\coloneqq\\mathbf\{A\}\|\\bm\{b\}\\in\\mathbb\{R\}^\{n\\times\(d\+1\)\}that is formed by appending the column vector𝐛\\bm\{b\}to the right of the matrix𝐀\\mathbf\{A\}\. If we have a vector𝐰\\bm\{w\}of block Lewis overestimates for the matrix𝐀^\\mathaccent 866\{\\mathbf\{A\}\}, then there exists an algorithm that finds an initialization𝐱0\\bm\{x\}\_\{0\}for which

\\lVert​𝒙0−𝒙⋆​\\rVert𝐀⊤​𝐖12−1p​𝐀≤2​\(2​𝗋𝖺𝗇𝗄​\(𝐀\)\)12−1p​\\lVert​𝐀​𝒙⋆−𝒃​\\rVert𝒢p\\lVert​𝐀​𝒙0−𝒃​\\rVert𝒢p≤\(2​𝗋𝖺𝗇𝗄​\(𝐀\)\)12−1p​\\lVert​𝐀​𝒙⋆−𝒃​\\rVert𝒢p\.\\begin\{aligned\} \\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{A\}^\{\\top\}\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\}&\\leq 2\\left\(2\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\)\\right\)^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\left\\lVert\\mathbf\{A\}\\bm\{x\}^\{\\star\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}\\\\ \\left\\lVert\\mathbf\{A\}\\bm\{x\}\_\{0\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}&\\leq\\left\(2\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\)\\right\)^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\left\\lVert\\mathbf\{A\}\\bm\{x\}^\{\\star\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}\\end\{aligned\}\\kern 5\.0pt\.The algorithm runs in11linear system solve in𝐀^⊤​𝐃​𝐀^\\mathaccent 866\{\\mathbf\{A\}\}^\{\\top\}\\mathbf\{D\}\\mathaccent 866\{\\mathbf\{A\}\}\.

###### Proof of[Lemma˜3\.5](https://arxiv.org/html/2607.00252#S3.Thmtheorem5)\.

By[Theorem˜3\.3](https://arxiv.org/html/2607.00252#S3.Thmtheorem3), our weights𝒘\\bm\{w\}are such that for all𝒙∈ℝn\\bm\{x\}\\in\\mathbb\{R\}^\{n\}and realsc∈ℝc\\in\\mathbb\{R\},

\\lVert​𝐖12−1p​𝐀​𝒙−c​𝐖12−1p​𝒃​\\rVert2\(2​\(d\+1\)\)12−1p≤\\lVert​𝐀​𝒙−c​𝒃​\\rVert𝒢p≤\\lVert​𝐖12−1p​𝐀​𝒙−c​𝐖12−1p​𝒃​\\rVert2\.\\displaystyle\\frac\{\\left\\lVert\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\bm\{x\}\-c\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\bm\{b\}\\right\\rVert\_\{2\}\}\{\(2\(d\+1\)\)^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\}\\leq\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-c\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}\\leq\\left\\lVert\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\bm\{x\}\-c\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\bm\{b\}\\right\\rVert\_\{2\}\.Let𝒙0\\bm\{x\}\_\{0\}be the solution to the least squares regression problem

𝒙0\\displaystyle\\bm\{x\}\_\{0\}≔argmin𝒙∈ℝd​\\lVert​𝐖12−1p​𝐀​𝒙−𝐖12−1p​𝒃​\\rVert2=\(𝐀⊤​𝐖1−2p​𝐀\)−1​𝐀⊤​𝐖12−1p​𝒃\.\\displaystyle\\coloneqq\\underset\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ \\left\\lVert\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\bm\{x\}\-\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\bm\{b\}\\right\\rVert\_\{2\}=\\left\(\\mathbf\{A\}^\{\\top\}\\mathbf\{W\}^\{1\-\\frac\{2\}\{p\}\}\\mathbf\{A\}\\right\)^\{\-1\}\\mathbf\{A\}^\{\\top\}\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\bm\{b\}\.It is easy to see that computing𝒙0\\bm\{x\}\_\{0\}amounts to11linear system solve in𝐀⊤​𝐃𝐀\\mathbf\{A\}^\{\\top\}\\mathbf\{D\}\\mathbf\{A\}\.

Next, let𝐌≔𝐀⊤​𝐖1−2p​𝐀\\mathbf\{M\}\\coloneqq\\mathbf\{A\}^\{\\top\}\\mathbf\{W\}^\{1\-\\frac\{2\}\{p\}\}\\mathbf\{A\}and observe that

\\lVert​𝒙0−𝒙⋆​\\rVert𝐌\\displaystyle\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}=\\lVert​\(𝐖12−1p​𝐀​𝒙0−𝐖12−1p​𝒃\)−\(𝐖12−1p​𝐀​𝒙⋆−𝐖12−1p​𝒃\)​\\rVert2\\displaystyle=\\left\\lVert\\left\(\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\bm\{x\}\_\{0\}\-\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\bm\{b\}\\right\)\-\\left\(\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\bm\{x\}^\{\\star\}\-\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\bm\{b\}\\right\)\\right\\rVert\_\{2\}≤2​\\lVert​𝐖12−1p​𝐀​𝒙⋆−𝐖12−1p​𝒃​\\rVert2≤2​\(2​d\)12−1p​\\lVert​𝐀​𝒙⋆−𝒃​\\rVert𝒢p\.\\displaystyle\\leq 2\\left\\lVert\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\bm\{x\}^\{\\star\}\-\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\bm\{b\}\\right\\rVert\_\{2\}\\leq 2\\left\(2d\\right\)^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\left\\lVert\\mathbf\{A\}\\bm\{x\}^\{\\star\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}\.Finally, write

\\lVert​𝐀​𝒙0−𝒃​\\rVert𝒢p\\displaystyle\\left\\lVert\\mathbf\{A\}\\bm\{x\}\_\{0\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}≤\\lVert​𝐖12−1p​𝐀​𝒙0−𝐖12−1p​𝒃​\\rVert2\\displaystyle\\leq\\left\\lVert\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\bm\{x\}\_\{0\}\-\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\bm\{b\}\\right\\rVert\_\{2\}≤\\lVert​𝐖12−1p​𝐀​𝒙⋆−𝐖12−1p​𝒃​\\rVert2≤\(2​d\)12−1p​\\lVert​𝐀​𝒙⋆−𝒃​\\rVert𝒢p,\\displaystyle\\leq\\left\\lVert\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\bm\{x\}^\{\\star\}\-\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\bm\{b\}\\right\\rVert\_\{2\}\\leq\\left\(2d\\right\)^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\left\\lVert\\mathbf\{A\}\\bm\{x\}^\{\\star\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\},giving us the conclusion of[Lemma˜3\.5](https://arxiv.org/html/2607.00252#S3.Thmtheorem5)\. ∎

###### Proof of[Theorem˜3\.3](https://arxiv.org/html/2607.00252#S3.Thmtheorem3)\.

Let𝝀≔𝒘/\\lVert​𝒘​\\rVert1\\bm\{\\lambda\}\\coloneqq\\bm\{w\}/\\left\\lVert\\bm\{w\}\\right\\rVert\_\{1\}and≔𝐖/\\lVert​𝒘​\\rVert1\\mathbf\{\\Lambda\}\\coloneqq\\mathbf\{W\}/\\left\\lVert\\bm\{w\}\\right\\rVert\_\{1\}\. It is easy to check that𝝀\\bm\{\\lambda\}is a probability measure on\[m\]\[m\]\. Whenp≥2p\\geq 2, using monotonicity ofLpL\_\{p\}norms taken under probability measures, we get

\(\\slimits@i=1m​\\lVert​𝐀Si​𝒙​\\rVert2p\)1p=\(\\slimits@i=1m​λi​\\lVert​λi−1p​𝐀Si​𝒙​\\rVert2p\)1p≥\(\\slimits@i=1m​λi​\\lVert​λi−1p​𝐀Si​𝒙​\\rVert22\)1/2\.\\displaystyle\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\\right\\rVert\_\{2\}^\{p\}\\right\)^\{\\frac\{1\}\{p\}\}=\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\\left\\lVert\\lambda\_\{i\}^\{\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\\right\\rVert\_\{2\}^\{p\}\\right\)^\{\\frac\{1\}\{p\}\}\\geq\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\\left\\lVert\\lambda\_\{i\}^\{\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\\right\\rVert\_\{2\}^\{2\}\\right\)^\{1/2\}\.Expanding the RHS and substitutingλi=wi/\\lVert​𝒘​\\rVert1\\lambda\_\{i\}=w\_\{i\}/\\left\\lVert\\bm\{w\}\\right\\rVert\_\{1\}gives

\\lVert​𝐀​𝒙​\\rVert𝒢p≥\\lVert​𝐖12−1p​𝐀​𝒙​\\rVert2\\lVert​𝒘​\\rVert112−1p\.\\displaystyle\\left\\lVert\\mathbf\{A\}\\bm\{x\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}\\geq\\frac\{\\left\\lVert\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\bm\{x\}\\right\\rVert\_\{2\}\}\{\\left\\lVert\\bm\{w\}\\right\\rVert\_\{1\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\}\.For the “hard” direction, we will use[Definition˜3\.2](https://arxiv.org/html/2607.00252#S3.Thmtheorem2)in a nontrivial way\. Notice that

\(\\slimits@i=1m​wi​\\lVert​wi−1p​𝐀Si​𝒙​\\rVert2p\)1p=\(\\slimits@i=1m​wi​\\lVert​wi−1p​𝐀Si​𝒙​\\rVert22​\\lVert​wi−1p​𝐀Si​𝒙​\\rVert2p−2\)1p\\displaystyle\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}w\_\{i\}\\left\\lVert w\_\{i\}^\{\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\\right\\rVert\_\{2\}^\{p\}\\right\)^\{\\frac\{1\}\{p\}\}=\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}w\_\{i\}\\left\\lVert w\_\{i\}^\{\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\\right\\rVert\_\{2\}^\{2\}\\left\\lVert w\_\{i\}^\{\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\\right\\rVert\_\{2\}^\{p\-2\}\\right\)^\{\\frac\{1\}\{p\}\}≤\(\\slimits@i=1m​wi​\\lVert​wi−1p​𝐀Si​𝒙​\\rVert22⋅max𝒙∈ℝd∖\{0\}⁡\\lVert​wi−1p​𝐀Si​𝒙​\\rVert2p−2\\lVert​𝐖12−1p​𝐀​𝒙​\\rVert2p−2\)1p\\displaystyle\\leq\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}w\_\{i\}\\left\\lVert w\_\{i\}^\{\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\\right\\rVert\_\{2\}^\{2\}\\cdot\\max\_\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\\setminus\\left\\\{0\\right\\\}\}\\frac\{\\left\\lVert w\_\{i\}^\{\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\\right\\rVert\_\{2\}^\{p\-2\}\}\{\\left\\lVert\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\bm\{x\}\\right\\rVert\_\{2\}^\{p\-2\}\}\\right\)^\{\\frac\{1\}\{p\}\}=\(\\slimits@i=1m​wi​\\lVert​wi−1p​𝐀Si​𝒙​\\rVert22⋅\(max𝒙∈ℝd∖\{0\}⁡\\lVert​wi−1p​𝐀Si​𝒙​\\rVert22\\lVert​𝐖12−1p​𝐀​𝒙​\\rVert22\)p2−1⋅\\lVert​𝐖12−1p​𝐀​𝒙​\\rVert2p−2\)1p\\displaystyle=\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}w\_\{i\}\\left\\lVert w\_\{i\}^\{\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\\right\\rVert\_\{2\}^\{2\}\\cdot\\left\(\\max\_\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\\setminus\\left\\\{0\\right\\\}\}\\frac\{\\left\\lVert w\_\{i\}^\{\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\\right\\rVert\_\{2\}^\{2\}\}\{\\left\\lVert\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\bm\{x\}\\right\\rVert\_\{2\}^\{2\}\}\\right\)^\{\\frac\{p\}\{2\}\-1\}\\cdot\\left\\lVert\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\bm\{x\}\\right\\rVert\_\{2\}^\{p\-2\}\\right\)^\{\\frac\{1\}\{p\}\}≤\(\\slimits@i=1m​wi​\\lVert​wi−1p​𝐀Si​𝒙​\\rVert22⋅\(\\slimits@j∈Si​τj​\(𝐖12−1p​𝐀\)wi\)p2−1​\\lVert​𝐖12−1p​𝐀​𝒙​\\rVert2p−2\)1p\\displaystyle\\leq\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}w\_\{i\}\\left\\lVert w\_\{i\}^\{\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\\right\\rVert\_\{2\}^\{2\}\\cdot\\left\(\\frac\{\\sumop\\slimits@\_\{j\\in S\_\{i\}\}\\tau\_\{j\}\\left\(\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\right\)\}\{w\_\{i\}\}\\right\)^\{\\frac\{p\}\{2\}\-1\}\\left\\lVert\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\bm\{x\}\\right\\rVert\_\{2\}^\{p\-2\}\\right\)^\{\\frac\{1\}\{p\}\}≤[Definition˜3\.2](https://arxiv.org/html/2607.00252#S3.Thmtheorem2)​\(\\slimits@i=1m​wi​\\lVert​wi−1p​𝐀Si​𝒙​\\rVert22​\\lVert​𝐖12−1p​𝐀​𝒙​\\rVert2p−2\)1p=\\lVert​𝐖12−1p​𝐀​𝒙​\\rVert2,\\displaystyle\\overset\{\\text\{\\lx@cref\{creftypecap~refnum\}\{defn:blw\}\}\}\{\\leq\}\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}w\_\{i\}\\left\\lVert w\_\{i\}^\{\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\\right\\rVert\_\{2\}^\{2\}\\left\\lVert\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\bm\{x\}\\right\\rVert\_\{2\}^\{p\-2\}\\right\)^\{\\frac\{1\}\{p\}\}=\\left\\lVert\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\bm\{x\}\\right\\rVert\_\{2\},so combining our upper and lower bounds gives the conclusion of[Theorem˜3\.3](https://arxiv.org/html/2607.00252#S3.Thmtheorem3)\. ∎

## 4Mirror Descent with Inexact Updates

##### Notation warning\.

This section is meant to be a self\-contained, standalone analysis of mirror descent under inexact updates\. The notation is chosen to be consistent with most material we could find on mirror descent and therefore conflicts with the notation used in the rest of the paper\.

In this section, we give an analysis of unconstrained mirror descent when each Bregman proximal problem is solved only approximately \([Algorithm˜2](https://arxiv.org/html/2607.00252#alg2)\)\. Although we expect that this is a standard fact about mirror descent, we could not find an appropriate reference\. Hence, we produce it here\.

Algorithm 2ApproximateMirrorDescent: Implements mirror descent to optimize convex and differentiableffgivenLL\-relative smoothness andμ\\mu\-relative strong convexity in the referencehhwhen we may not be able to solve each proximal problem exactly\.1:Initial point𝒙0\\bm\{x\}\_\{0\}, iteration countTT\.

2:DefineDh​\(𝒙,𝒚\)≔h​\(𝒙\)−h​\(𝒚\)−⟨∇h​\(𝒚\),𝒙−𝒚⟩𝒙⋆≔argmin𝒙∈ℝd​f​\(𝒙\)\.\\begin\{aligned\} D\_\{h\}\(\\bm\{x\},\\bm\{y\}\)&\\coloneqq h\(\\bm\{x\}\)\-h\(\\bm\{y\}\)\-\\left\\langle\\nabla h\(\\bm\{y\}\),\\bm\{x\}\-\\bm\{y\}\\right\\rangle\\\\ \\bm\{x\}^\{\\star\}&\\coloneqq\\underset\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ f\(\\bm\{x\}\)\\end\{aligned\}\.

3:fori=1,…,Ti=1,\\dots,Tdo

4:𝒙i⋆=argmin𝒙~∈ℝd​f​\(𝒙i−1\)\+⟨∇f​\(𝒙i−1\),𝒙~−𝒙i−1⟩\+L​Dh​\(𝒙~,𝒙i−1\)\\bm\{x\}^\{\\star\}\_\{i\}=\\underset\{\\mathaccent 869\{\\bm\{x\}\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ f\(\\bm\{x\}\_\{i\-1\}\)\+\\left\\langle\\nabla f\(\\bm\{x\}\_\{i\-1\}\),\\mathaccent 869\{\\bm\{x\}\}\-\\bm\{x\}\_\{i\-1\}\\right\\rangle\+LD\_\{h\}\(\\mathaccent 869\{\\bm\{x\}\},\\bm\{x\}\_\{i\-1\}\)⊳\\trianglerightWe may only be able to approximate𝐱i⋆\\bm\{x\}^\{\\star\}\_\{i\}– see the next line\.

5:Let𝒙i\\bm\{x\}\_\{i\}be an approximate stationary point for the above objective\.return

argmin0≤i≤T​f​\(𝒙i\)\\underset\{0\\leq i\\leq T\}\{\\mathrm\{argmin\}\}\\ f\(\\bm\{x\}\_\{i\}\)

In[Algorithm˜2](https://arxiv.org/html/2607.00252#alg2), we assume that the functionffisμ\\mu\-relatively strongly convex andLL\-smooth in areference functionhh\. This means that for all𝒙,𝒚∈ℝd\\bm\{x\},\\bm\{y\}\\in\\mathbb\{R\}^\{d\}, we have

μ​Dh​\(𝒙,𝒚\)≤f​\(𝒙\)−f​\(𝒚\)−⟨∇f​\(𝒚\),𝒙−𝒚⟩≤L​Dh​\(𝒙,𝒚\)\.\\displaystyle\\mu D\_\{h\}\(\\bm\{x\},\\bm\{y\}\)\\leq f\(\\bm\{x\}\)\-f\(\\bm\{y\}\)\-\\left\\langle\\nabla f\(\\bm\{y\}\),\\bm\{x\}\-\\bm\{y\}\\right\\rangle\\leq LD\_\{h\}\(\\bm\{x\},\\bm\{y\}\)\.Using\[lfn18, Proposition 1\.1\], whenffis twice\-differentiable, this condition is equivalent to asking for all𝒙∈ℝd\\bm\{x\}\\in\\mathbb\{R\}^\{d\},

μ​∇2h​\(𝒙\)⪯∇2f​\(𝒙\)⪯L​∇2h​\(𝒙\)\.\\displaystyle\\mu\\nabla^\{2\}h\(\\bm\{x\}\)\\preceq\\nabla^\{2\}f\(\\bm\{x\}\)\\preceq L\\nabla^\{2\}h\(\\bm\{x\}\)\.We are now ready to state the performance guarantee of[Algorithm˜2](https://arxiv.org/html/2607.00252#alg2)\. See[Theorem˜4\.1](https://arxiv.org/html/2607.00252#S4.Thmtheorem1)\.

###### Theorem 4\.1\.

Let indexjjbe the index output by[Algorithm˜2](https://arxiv.org/html/2607.00252#alg2)\. Let△i\\triangle\_\{i\}be defined such that

△i≔∇f​\(𝒙i−1\)\+L​\(∇h​\(𝒙i\)−∇h​\(𝒙i−1\)\)\.\\displaystyle\\triangle\_\{i\}\\coloneqq\\nabla f\(\\bm\{x\}\_\{i\-1\}\)\+L\\left\(\\nabla h\(\\bm\{x\}\_\{i\}\)\-\\nabla h\(\\bm\{x\}\_\{i\-1\}\)\\right\)\.Then, we have

f​\(𝒙j\)−f​\(𝒙⋆\)≤L​\(1−μL\)T​Dh​\(𝒙⋆,𝒙0\)\+max1≤i≤n⁡⟨△i,𝒙i−𝒙⋆⟩\.\\displaystyle f\(\\bm\{x\}\_\{j\}\)\-f\(\\bm\{x\}^\{\\star\}\)\\leq L\\left\(1\-\\frac\{\\mu\}\{L\}\\right\)^\{T\}D\_\{h\}\(\\bm\{x\}^\{\\star\},\\bm\{x\}\_\{0\}\)\+\\max\_\{1\\leq i\\leq n\}\\left\\langle\\triangle\_\{i\},\\bm\{x\}\_\{i\}\-\\bm\{x\}^\{\\star\}\\right\\rangle\.

To prove[Theorem˜4\.1](https://arxiv.org/html/2607.00252#S4.Thmtheorem1), we begin with a few standard facts about the mirror descent iterations\.

###### Lemma 4\.2\.

Let𝐲∈ℝd\\bm\{y\}\\in\\mathbb\{R\}^\{d\}be arbitrary\. We have

⟨∇f​\(𝒙i−1\),𝒙i−𝒚⟩=L​\(Dh​\(𝒚,𝒙i−1\)−Dh​\(𝒚,𝒙i\)−Dh​\(𝒙i,𝒙i−1\)\)\+⟨△i,𝒙i−𝒚⟩\.\\displaystyle\\left\\langle\\nabla f\(\\bm\{x\}\_\{i\-1\}\),\\bm\{x\}\_\{i\}\-\\bm\{y\}\\right\\rangle=L\\left\(D\_\{h\}\(\\bm\{y\},\\bm\{x\}\_\{i\-1\}\)\-D\_\{h\}\(\\bm\{y\},\\bm\{x\}\_\{i\}\)\-D\_\{h\}\(\\bm\{x\}\_\{i\},\\bm\{x\}\_\{i\-1\}\)\\right\)\+\\left\\langle\\triangle\_\{i\},\\bm\{x\}\_\{i\}\-\\bm\{y\}\\right\\rangle\.

###### Proof of[Lemma˜4\.2](https://arxiv.org/html/2607.00252#S4.Thmtheorem2)\.

By the three point identity \(see, e\.g\.,\[syls16, Equation \(A\.9\)\]\), we have

Dh​\(𝒚,𝒙i−1\)−Dh​\(𝒚,𝒙i\)−Dh​\(𝒙i,𝒙i−1\)\\displaystyle D\_\{h\}\(\\bm\{y\},\\bm\{x\}\_\{i\-1\}\)\-D\_\{h\}\(\\bm\{y\},\\bm\{x\}\_\{i\}\)\-D\_\{h\}\(\\bm\{x\}\_\{i\},\\bm\{x\}\_\{i\-1\}\)=−⟨∇h​\(𝒙i\)−∇h​\(𝒙i−1\),𝒙i−𝒚⟩\\displaystyle=\-\\left\\langle\\nabla h\(\\bm\{x\}\_\{i\}\)\-\\nabla h\(\\bm\{x\}\_\{i\-1\}\),\\bm\{x\}\_\{i\}\-\\bm\{y\}\\right\\rangle=1L​⟨∇f​\(𝒙i−1\)−△i,𝒙i−𝒚⟩,\\displaystyle=\\frac\{1\}\{L\}\\left\\langle\\nabla f\(\\bm\{x\}\_\{i\-1\}\)\-\\triangle\_\{i\},\\bm\{x\}\_\{i\}\-\\bm\{y\}\\right\\rangle,completing the proof of[Lemma˜4\.2](https://arxiv.org/html/2607.00252#S4.Thmtheorem2)\. ∎

###### Lemma 4\.3\(Mirror descent lemma under approximate stationary point updates\)\.

Let𝐲∈ℝd\\bm\{y\}\\in\\mathbb\{R\}^\{d\}be arbitrary\. For every iterationii, we have

f​\(𝒙i\)−f​\(𝒚\)≤\(L−μ\)​Dh​\(𝒚,𝒙i−1\)−L​Dh​\(𝒚,𝒙i\)\+⟨△i,𝒙i−𝒚⟩\.\\displaystyle f\(\\bm\{x\}\_\{i\}\)\-f\(\\bm\{y\}\)\\leq\(L\-\\mu\)D\_\{h\}\(\\bm\{y\},\\bm\{x\}\_\{i\-1\}\)\-LD\_\{h\}\(\\bm\{y\},\\bm\{x\}\_\{i\}\)\+\\left\\langle\\triangle\_\{i\},\\bm\{x\}\_\{i\}\-\\bm\{y\}\\right\\rangle\.

###### Proof of[Lemma˜4\.3](https://arxiv.org/html/2607.00252#S4.Thmtheorem3)\.

The definition ofμ\\mu\-relative strong convexity tells us that

f​\(𝒙i−1\)−f​\(𝒚\)≤⟨∇f​\(𝒙i−1\),𝒙i−1−𝒚⟩−μ​Dh​\(𝒚,𝒙i−1\)\.\\displaystyle f\(\\bm\{x\}\_\{i\-1\}\)\-f\(\\bm\{y\}\)\\leq\\left\\langle\\nabla f\(\\bm\{x\}\_\{i\-1\}\),\\bm\{x\}\_\{i\-1\}\-\\bm\{y\}\\right\\rangle\-\\mu D\_\{h\}\(\\bm\{y\},\\bm\{x\}\_\{i\-1\}\)\.We now write

f​\(𝒙i\)−f​\(𝒚\)\\displaystyle f\(\\bm\{x\}\_\{i\}\)\-f\(\\bm\{y\}\)≤f​\(𝒙i−1\)−f​\(𝒚\)\+⟨∇f​\(𝒙i−1\),𝒙i−𝒙i−1⟩\+L​Dh​\(𝒙i,𝒙i−1\)\\displaystyle\\leq f\(\\bm\{x\}\_\{i\-1\}\)\-f\(\\bm\{y\}\)\+\\left\\langle\\nabla f\(\\bm\{x\}\_\{i\-1\}\),\\bm\{x\}\_\{i\}\-\\bm\{x\}\_\{i\-1\}\\right\\rangle\+LD\_\{h\}\(\\bm\{x\}\_\{i\},\\bm\{x\}\_\{i\-1\}\)\(LL\-RS\)≤⟨∇f​\(𝒙i−1\),𝒙i−𝒚⟩−μ​Dh​\(𝒚,𝒙i−1\)\+L​Dh​\(𝒙i,𝒙i−1\)\\displaystyle\\leq\\left\\langle\\nabla f\(\\bm\{x\}\_\{i\-1\}\),\\bm\{x\}\_\{i\}\-\\bm\{y\}\\right\\rangle\-\\mu D\_\{h\}\(\\bm\{y\},\\bm\{x\}\_\{i\-1\}\)\+LD\_\{h\}\(\\bm\{x\}\_\{i\},\\bm\{x\}\_\{i\-1\}\)\(μ\\mu\-RSC\)≤\(L−μ\)​Dh​\(𝒚,𝒙i−1\)−L​Dh​\(𝒚,𝒙i\)\+⟨△i,𝒙i−𝒚⟩,\\displaystyle\\leq\(L\-\\mu\)D\_\{h\}\(\\bm\{y\},\\bm\{x\}\_\{i\-1\}\)\-LD\_\{h\}\(\\bm\{y\},\\bm\{x\}\_\{i\}\)\+\\left\\langle\\triangle\_\{i\},\\bm\{x\}\_\{i\}\-\\bm\{y\}\\right\\rangle,\([Lemma˜4\.2](https://arxiv.org/html/2607.00252#S4.Thmtheorem2)\)completing the proof of[Lemma˜4\.3](https://arxiv.org/html/2607.00252#S4.Thmtheorem3)\. ∎

We now have the tools to complete the proof of[Theorem˜4\.1](https://arxiv.org/html/2607.00252#S4.Thmtheorem1)\.

###### Proof of[Theorem˜4\.1](https://arxiv.org/html/2607.00252#S4.Thmtheorem1)\.

LetEi≔f​\(𝒙i\)−f​\(𝒙⋆\)−⟨△i,𝒙i−𝒙⋆⟩E\_\{i\}\\coloneqq f\(\\bm\{x\}\_\{i\}\)\-f\(\\bm\{x\}^\{\\star\}\)\-\\left\\langle\\triangle\_\{i\},\\bm\{x\}\_\{i\}\-\\bm\{x\}^\{\\star\}\\right\\rangle\. Substituting𝒚=𝒙⋆\\bm\{y\}=\\bm\{x\}^\{\\star\}and rearranging the conclusion of[Lemma˜4\.3](https://arxiv.org/html/2607.00252#S4.Thmtheorem3)gives

Ei≤\(L−μ\)​Dh​\(𝒙⋆,𝒙i−1\)−L​Dh​\(𝒙⋆,𝒙i\)\.\\displaystyle E\_\{i\}\\leq\(L\-\\mu\)D\_\{h\}\(\\bm\{x\}^\{\\star\},\\bm\{x\}\_\{i\-1\}\)\-LD\_\{h\}\(\\bm\{x\}^\{\\star\},\\bm\{x\}\_\{i\}\)\.We multiply both sides by\(LL−μ\)i\\left\(\\frac\{L\}\{L\-\\mu\}\\right\)^\{i\}and write

\(LL−μ\)i​Ei≤Li\(L−μ\)i−1​Dh​\(𝒙⋆,𝒙i−1\)−Li\+1\(L−μ\)i​Dh​\(𝒙⋆,𝒙i\)\.\\displaystyle\\left\(\\frac\{L\}\{L\-\\mu\}\\right\)^\{i\}E\_\{i\}\\leq\\frac\{L^\{i\}\}\{\(L\-\\mu\)^\{i\-1\}\}D\_\{h\}\(\\bm\{x\}^\{\\star\},\\bm\{x\}\_\{i\-1\}\)\-\\frac\{L^\{i\+1\}\}\{\(L\-\\mu\)^\{i\}\}D\_\{h\}\(\\bm\{x\}^\{\\star\},\\bm\{x\}\_\{i\}\)\.Adding over allTTiterations yields

\\slimits@i=1T​\(LL−μ\)i​Ei≤L​Dh​\(𝒙⋆,𝒙0\)−\(LL−μ\)T​L​Dh​\(𝒙⋆,𝒙T\)≤L​Dh​\(𝒙⋆,𝒙0\)\.\\displaystyle\\sumop\\slimits@\_\{i=1\}^\{T\}\\left\(\\frac\{L\}\{L\-\\mu\}\\right\)^\{i\}E\_\{i\}\\leq LD\_\{h\}\(\\bm\{x\}^\{\\star\},\\bm\{x\}\_\{0\}\)\-\\left\(\\frac\{L\}\{L\-\\mu\}\\right\)^\{T\}LD\_\{h\}\(\\bm\{x\}^\{\\star\},\\bm\{x\}\_\{T\}\)\\leq LD\_\{h\}\(\\bm\{x\}^\{\\star\},\\bm\{x\}\_\{0\}\)\.Expanding out the definition ofEiE\_\{i\}and rearranging gives

\\slimits@i=1T​\(LL−μ\)i​\(f​\(𝒙i\)−f​\(𝒙⋆\)\)≤L​Dh​\(𝒙⋆,𝒙0\)\+\\slimits@i=1T​\(LL−μ\)i​⟨△i,𝒙i−𝒙⋆⟩\.\\displaystyle\\sumop\\slimits@\_\{i=1\}^\{T\}\\left\(\\frac\{L\}\{L\-\\mu\}\\right\)^\{i\}\(f\(\\bm\{x\}\_\{i\}\)\-f\(\\bm\{x\}^\{\\star\}\)\)\\leq LD\_\{h\}\(\\bm\{x\}^\{\\star\},\\bm\{x\}\_\{0\}\)\+\\sumop\\slimits@\_\{i=1\}^\{T\}\\left\(\\frac\{L\}\{L\-\\mu\}\\right\)^\{i\}\\left\\langle\\triangle\_\{i\},\\bm\{x\}\_\{i\}\-\\bm\{x\}^\{\\star\}\\right\\rangle\.By the geometric series summation formula, we define and have

CT≔\\slimits@i=1T​\(LL−μ\)i=Lμ​\(\(1\+μL−μ\)T−1\)\.\\displaystyle C\_\{T\}\\coloneqq\\sumop\\slimits@\_\{i=1\}^\{T\}\\left\(\\frac\{L\}\{L\-\\mu\}\\right\)^\{i\}=\\frac\{L\}\{\\mu\}\\left\(\\left\(1\+\\frac\{\\mu\}\{L\-\\mu\}\\right\)^\{T\}\-1\\right\)\.Letjjbe the index that[Algorithm˜2](https://arxiv.org/html/2607.00252#alg2)returns\. It is easy to check that

\\slimits@i=1T​\(LL−μ\)i​\(f​\(𝒙i\)−f​\(𝒙⋆\)\)≥CT​\(f​\(𝒙j\)−f​\(𝒙⋆\)\)\\displaystyle\\sumop\\slimits@\_\{i=1\}^\{T\}\\left\(\\frac\{L\}\{L\-\\mu\}\\right\)^\{i\}\(f\(\\bm\{x\}\_\{i\}\)\-f\(\\bm\{x\}^\{\\star\}\)\)\\geq C\_\{T\}\\left\(f\(\\bm\{x\}\_\{j\}\)\-f\(\\bm\{x\}^\{\\star\}\)\\right\)and

\\slimits@i=1T​\(LL−μ\)i​⟨△i,𝒙i−𝒙⋆⟩≤CT​max1≤i≤n⁡⟨△i,𝒙i−𝒙⋆⟩\.\\displaystyle\\sumop\\slimits@\_\{i=1\}^\{T\}\\left\(\\frac\{L\}\{L\-\\mu\}\\right\)^\{i\}\\left\\langle\\triangle\_\{i\},\\bm\{x\}\_\{i\}\-\\bm\{x\}^\{\\star\}\\right\\rangle\\leq C\_\{T\}\\max\_\{1\\leq i\\leq n\}\\left\\langle\\triangle\_\{i\},\\bm\{x\}\_\{i\}\-\\bm\{x\}^\{\\star\}\\right\\rangle\.This gives us

f​\(𝒙j\)−f​\(𝒙⋆\)≤LCT​Dh​\(𝒙⋆,𝒙0\)\+max1≤i≤n⁡⟨△i,𝒙i−𝒙⋆⟩\.\\displaystyle f\(\\bm\{x\}\_\{j\}\)\-f\(\\bm\{x\}^\{\\star\}\)\\leq\\frac\{L\}\{C\_\{T\}\}D\_\{h\}\(\\bm\{x\}^\{\\star\},\\bm\{x\}\_\{0\}\)\+\\max\_\{1\\leq i\\leq n\}\\left\\langle\\triangle\_\{i\},\\bm\{x\}\_\{i\}\-\\bm\{x\}^\{\\star\}\\right\\rangle\.Finally, notice that

LCT=μ\(1\+μL−μ\)T−1≤L​\(1−μL\)T\.\\displaystyle\\frac\{L\}\{C\_\{T\}\}=\\frac\{\\mu\}\{\\left\(1\+\\frac\{\\mu\}\{L\-\\mu\}\\right\)^\{T\}\-1\}\\leq L\\left\(1\-\\frac\{\\mu\}\{L\}\\right\)^\{T\}\.Combining everything completes the proof of[Theorem˜4\.1](https://arxiv.org/html/2607.00252#S4.Thmtheorem1)\. ∎

Finally, we add another useful lemma that quantifies the descent, if any, in the objective value between iterations\.

###### Lemma 4\.4\.

For every iterationii, we have

f​\(𝒙i\)−f​\(𝒙i−1\)≤−L​Dh​\(𝒙i−1,𝒙i\)\+⟨△i,𝒙i−𝒙i−1⟩\.\\displaystyle f\(\\bm\{x\}\_\{i\}\)\-f\(\\bm\{x\}\_\{i\-1\}\)\\leq\-LD\_\{h\}\(\\bm\{x\}\_\{i\-1\},\\bm\{x\}\_\{i\}\)\+\\left\\langle\\triangle\_\{i\},\\bm\{x\}\_\{i\}\-\\bm\{x\}\_\{i\-1\}\\right\\rangle\.In particular, if⟨△i,𝐱i−𝐱i−1⟩≤L​Dh​\(𝐱i−1,𝐱i\)\\left\\langle\\triangle\_\{i\},\\bm\{x\}\_\{i\}\-\\bm\{x\}\_\{i\-1\}\\right\\rangle\\leq LD\_\{h\}\(\\bm\{x\}\_\{i\-1\},\\bm\{x\}\_\{i\}\), then iterationiiis a descent step\.

###### Proof of[Lemma˜4\.4](https://arxiv.org/html/2607.00252#S4.Thmtheorem4)\.

We substitute𝒚=𝒙i−1\\bm\{y\}=\\bm\{x\}\_\{i\-1\}in the conclusion of[Lemma˜4\.3](https://arxiv.org/html/2607.00252#S4.Thmtheorem3)\. This gives

f​\(𝒙i\)−f​\(𝒙i−1\)≤−L​Dh​\(𝒙i−1,𝒙i\)\+⟨△i,𝒙i−𝒙i−1⟩,\\displaystyle f\(\\bm\{x\}\_\{i\}\)\-f\(\\bm\{x\}\_\{i\-1\}\)\\leq\-LD\_\{h\}\(\\bm\{x\}\_\{i\-1\},\\bm\{x\}\_\{i\}\)\+\\left\\langle\\triangle\_\{i\},\\bm\{x\}\_\{i\}\-\\bm\{x\}\_\{i\-1\}\\right\\rangle,completing the proof of[Lemma˜4\.4](https://arxiv.org/html/2607.00252#S4.Thmtheorem4)\. ∎

## 5Optimal MS Acceleration under Custom Euclidean Geometry

In this section, we adapt the bisection\-free Monteiro\-Svaiter acceleration framework developed in\[chjjs22\]to handle custom Euclidean geometries\. The object of interest here is[Algorithm˜3](https://arxiv.org/html/2607.00252#alg3), which we will call with different choices of the oracle𝒪𝖬𝖲\\mathcal\{O\}\_\{\\mathsf\{MS\}\}for our algorithms\.

Algorithm 3OptimalMSAcceleration: optimizes functionffgiven MS oracle𝒪𝖬𝖲\\mathcal\{O\}\_\{\\mathsf\{MS\}\}\.1:Initial𝒙0\\bm\{x\}\_\{0\}, functionff, oracle𝒪𝖬𝖲\\mathcal\{O\}\_\{\\mathsf\{MS\}\}, initialλ0′\\lambda\_\{0\}^\{\\prime\}, multiplicative adjustment factorα\>1\\alpha\>1, iteration countTT

2:Set𝒗0=𝒙0\\bm\{v\}\_\{0\}=\\bm\{x\}\_\{0\},A0=0A\_\{0\}=0,A0′=0A^\{\\prime\}\_\{0\}=0\.

3:Set𝒙~1,λ1=𝒪​\(𝒙0;λ0′\)\\mathaccent 869\{\\bm\{x\}\}\_\{1\},\\lambda\_\{1\}=\\mathcal\{O\}\(\\bm\{x\}\_\{0\};\\lambda\_\{0\}^\{\\prime\}\)andλ1′=λ1\\lambda\_\{1\}^\{\\prime\}=\\lambda\_\{1\}\.

4:fort=0,…,Tt=0,\\dots,Tdo

5:at\+1′=12​λt\+1′​\(1\+1\+4​λt\+1′​At\)a\_\{t\+1\}^\{\\prime\}=\\frac\{1\}\{2\\lambda\_\{t\+1\}^\{\\prime\}\}\\left\(1\+\\sqrt\{1\+4\\lambda^\{\\prime\}\_\{t\+1\}A\_\{t\}\}\\right\)

6:At\+1′=At\+at\+1′A\_\{t\+1\}^\{\\prime\}=A\_\{t\}\+a\_\{t\+1\}^\{\\prime\}

7:𝒒t=AtAt\+1′​𝒙t\+at\+1′At\+1′​𝒗t\\bm\{q\}\_\{t\}=\\frac\{A\_\{t\}\}\{A\_\{t\+1\}^\{\\prime\}\}\\bm\{x\}\_\{t\}\+\\frac\{a\_\{t\+1\}^\{\\prime\}\}\{A\_\{t\+1\}^\{\\prime\}\}\\bm\{v\}\_\{t\}

8:ift\>0t\>0then𝒙~t\+1,λt\+1=𝒪𝖬𝖲​\(𝒒t;λt\+1′\)\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\},\\lambda\_\{t\+1\}=\\mathcal\{O\}\_\{\\mathsf\{MS\}\}\(\\bm\{q\}\_\{t\};\\lambda\_\{t\+1\}^\{\\prime\}\)

9:γt\+1=min⁡\{1,λt\+1′λt\+1\}\\gamma\_\{t\+1\}=\\min\\left\\\{1,\\frac\{\\lambda\_\{t\+1\}^\{\\prime\}\}\{\\lambda\_\{t\+1\}\}\\right\\\}

10:at\+1=γt\+1​at\+1′a\_\{t\+1\}=\\gamma\_\{t\+1\}a^\{\\prime\}\_\{t\+1\}andAt\+1=At\+at\+1A\_\{t\+1\}=A\_\{t\}\+a\_\{t\+1\}⊳\\trianglerightAt\+1=At\+1′−\(1−γt\+1\)​at\+1′A\_\{t\+1\}=A\_\{t\+1\}^\{\\prime\}\-\(1\-\\gamma\_\{t\+1\}\)a^\{\\prime\}\_\{t\+1\}

11:𝒙t\+1=\(1−γt\+1\)​AtAt\+1​𝒙t\+γt\+1​At\+1′At\+1​𝒙~t\+1\\bm\{x\}\_\{t\+1\}=\\frac\{\(1\-\\gamma\_\{t\+1\}\)A\_\{t\}\}\{A\_\{t\+1\}\}\\bm\{x\}\_\{t\}\+\\frac\{\\gamma\_\{t\+1\}A\_\{t\+1\}^\{\\prime\}\}\{A\_\{t\+1\}\}\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}

12:ifγt\+1=1\\gamma\_\{t\+1\}=1then

13:λt\+2′=1α​λt\+1′\\lambda\_\{t\+2\}^\{\\prime\}=\\frac\{1\}\{\\alpha\}\\lambda^\{\\prime\}\_\{t\+1\}

14:else

15:λt\+1′=α​λt\+1′\\lambda\_\{t\+1\}^\{\\prime\}=\\alpha\\lambda\_\{t\+1\}^\{\\prime\}

16:𝒗t\+1=𝒗t−at\+1​𝐌−1​∇f​\(𝒙~t\+1\)\\bm\{v\}\_\{t\+1\}=\\bm\{v\}\_\{t\}\-a\_\{t\+1\}\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)

In order to state the performance guarantee of[Algorithm˜3](https://arxiv.org/html/2607.00252#alg3), we require the notions of anMS oracleand amovement bound\. See[Definition˜5\.1](https://arxiv.org/html/2607.00252#S5.Thmtheorem1)and[Definition˜5\.2](https://arxiv.org/html/2607.00252#S5.Thmtheorem2)\.

###### Definition 5\.1\(MS oracle, generalization of\[chjjs22, Definition 1\]\)\.

Let𝐌∈ℝd×d\\mathbf\{M\}\\in\\mathbb\{R\}^\{d\\times d\}be a positive semidefinite matrix\. An oracle𝒪:ℝd×ℝ≥0→ℝd×ℝ≥0\\mathcal\{O\}\\colon\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}\_\{\\geq 0\}\\rightarrow\\mathbb\{R\}^\{d\}\\times\\mathbb\{R\}\_\{\\geq 0\}is aσ\\sigma\-MS oracle for functionf:ℝd→ℝf\\colon\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}if for every𝐪∈ℝd\\bm\{q\}\\in\\mathbb\{R\}^\{d\}andλ′\>0\\lambda^\{\\prime\}\>0, the points\(𝐱,λ\)=𝒪​\(𝐪;λ′\)\(\\bm\{x\},\\lambda\)=\\mathcal\{O\}\(\\bm\{q\};\\lambda^\{\\prime\}\)satisfy

\\lVert​𝒙−𝒒\+1λ​𝐌−1​∇f​\(𝒙\)​\\rVert𝐌≤σ​\\lVert​𝒙−𝒒​\\rVert𝐌\.\\displaystyle\\left\\lVert\\bm\{x\}\-\\bm\{q\}\+\\frac\{1\}\{\\lambda\}\\mathbf\{M\}^\{\-1\}\\nabla f\(\\bm\{x\}\)\\right\\rVert\_\{\\mathbf\{M\}\}\\leq\\sigma\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\.

###### Definition 5\.2\(Movement bound\[chjjs22, Definition 2\]\)\.

For a norm\\lVert⋅\\rVert𝐌\\left\\lVert\\cdot\\right\\rVert\_\{\\mathbf\{M\}\}induced by positive semidefinite𝐌∈ℝd×d\\mathbf\{M\}\\in\\mathbb\{R\}^\{d\\times d\}, numberss≥1,c,λ\>0s\\geq 1,c,\\lambda\>0, and𝐱,𝐲∈ℝd\\bm\{x\},\\bm\{y\}\\in\\mathbb\{R\}^\{d\}, we say that\(𝐱,𝐲,λ\)\(\\bm\{x\},\\bm\{y\},\\lambda\)satisfies a\(s,c\)\(s,c\)\-movement bound if

\\lVert​𝒙−𝒚​\\rVert𝐌≥\{\(λcs\)1s−1if​s<∞1cif​s=∞\.\\displaystyle\\left\\lVert\\bm\{x\}\-\\bm\{y\}\\right\\rVert\_\{\\mathbf\{M\}\}\\geq\\begin\{cases\}\\left\(\\frac\{\\lambda\}\{c^\{s\}\}\\right\)^\{\\frac\{1\}\{s\-1\}\}&\\text\{ if \}s<\\infty\\\\ \\frac\{1\}\{c\}&\\text\{ if \}s=\\infty\\end\{cases\}\.

With these in hand, we are ready to state the convergence guarantee we get with[Algorithm˜3](https://arxiv.org/html/2607.00252#alg3)\. See[Theorem˜5\.3](https://arxiv.org/html/2607.00252#S5.Thmtheorem3)\.

###### Theorem 5\.3\(Modification of\[chjjs22, Theorem 1\]\)\.

Letf:ℝd→ℝf\\colon\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}be convex and differentiable\. Consider running[Algorithm˜3](https://arxiv.org/html/2607.00252#alg3)with parametersα=exp​\(3−2s\+1\)\\alpha=\\mathrm\{exp\}\\left\(3\-\\frac\{2\}\{s\+1\}\\right\)and aσ\\sigma\-MS oracle with0≤σ<0\.990\\leq\\sigma<0\.99\([Definition˜5\.1](https://arxiv.org/html/2607.00252#S5.Thmtheorem1)\)\. Lets≥1s\\geq 1andc\>0c\>0and suppose that for allttsuch thatλt\>λt′\\lambda\_\{t\}\>\\lambda^\{\\prime\}\_\{t\}ort=1t=1, the iterates\(𝐱~t,𝐪t−1,λt\)\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\},\\bm\{q\}\_\{t\-1\},\\lambda\_\{t\}\)satisfy an\(s,c\)\(s,c\)\-movement bound \([Definition˜5\.2](https://arxiv.org/html/2607.00252#S5.Thmtheorem2)\)\. LetCCbe a universal constant\. For any iteration countTTsatisfying

T≥C​\{s​\(cs​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌s\+1ε\)23​s\+1if​s<∞\(c​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌\)2/3​log⁡\(λ1​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌2ε\)if​s=∞,\\displaystyle T\\geq C\\begin\{cases\}s\\left\(\\frac\{c^\{s\}\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{s\+1\}\}\{\\varepsilon\}\\right\)^\{\\frac\{2\}\{3s\+1\}\}&\\text\{ if \}s<\\infty\\\\ \\left\(c\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\\right\)^\{2/3\}\\log\\left\(\\frac\{\\lambda\_\{1\}\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}\}\{\\varepsilon\}\\right\)&\\text\{ if \}s=\\infty\\end\{cases\},we have

f​\(𝒙T\)−f​\(𝒙⋆\)≤ε\.\\displaystyle f\(\\bm\{x\}\_\{T\}\)\-f\(\\bm\{x\}^\{\\star\}\)\\leq\\varepsilon\.

The proof of[Theorem˜5\.3](https://arxiv.org/html/2607.00252#S5.Thmtheorem3)follows the same recipe as the proof of\[chjjs22, Theorem 1\]\. The only modification needed is that stated in[Lemma˜5\.4](https://arxiv.org/html/2607.00252#S5.Thmtheorem4)\.

###### Lemma 5\.4\(Replaces\[chjjs22, Proposition 1\]\)\.

In the context of[Theorem˜5\.3](https://arxiv.org/html/2607.00252#S5.Thmtheorem3), letEt≔f​\(𝐱t\)−f​\(𝐱⋆\),Dt≔12​\\lVert​𝐯t−𝐱⋆​\\rVert𝐌2,Nt\+1≔12​\\lVert​𝐱~t\+1−𝐪t​\\rVert𝐌2E\_\{t\}\\coloneqq f\(\\bm\{x\}\_\{t\}\)\-f\(\\bm\{x\}^\{\\star\}\),D\_\{t\}\\coloneqq\\frac\{1\}\{2\}\\left\\lVert\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\},N\_\{t\+1\}\\coloneqq\\frac\{1\}\{2\}\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}\. Then, for allt≥0t\\geq 0, we have

At\+1​Et\+1\+Dt\+1\+\(1−σ2\)​At\+1′​min⁡\{λt\+1,λt\+1′\}​Nt\+1≤At​Et\+Dt\.\\displaystyle A\_\{t\+1\}E\_\{t\+1\}\+D\_\{t\+1\}\+\(1\-\\sigma^\{2\}\)A\_\{t\+1\}^\{\\prime\}\\min\\left\\\{\\lambda\_\{t\+1\},\\lambda\_\{t\+1\}^\{\\prime\}\\right\\\}N\_\{t\+1\}\\leq A\_\{t\}E\_\{t\}\+D\_\{t\}\.Consequently, for allT≥1,AT≥12​\\slimits@t∈𝒮T≤​1λt′T\\geq 1,\\sqrt\{A\_\{T\}\}\\geq\\frac\{1\}\{2\}\\sumop\\slimits@\_\{t\\in\\mathcal\{S\}\_\{T\}^\{\\leq\}\}\\frac\{1\}\{\\sqrt\{\\lambda\_\{t\}^\{\\prime\}\}\},

ET≤D0ATand\(1−σ2\)​\\slimits@t∈𝒮T≥​At​λt′​Nt≤D0−AT​ET\.\\displaystyle E\_\{T\}\\leq\\frac\{D\_\{0\}\}\{A\_\{T\}\}\\quad\\text\{ and \}\\quad\(1\-\\sigma^\{2\}\)\\sumop\\slimits@\_\{t\\in\\mathcal\{S\}\_\{T\}^\{\\geq\}\}A\_\{t\}\\lambda\_\{t\}^\{\\prime\}N\_\{t\}\\leq D\_\{0\}\-A\_\{T\}E\_\{T\}\.

###### Proof of[Lemma˜5\.4](https://arxiv.org/html/2607.00252#S5.Thmtheorem4)\.

This proof is a straightforward modification of\[chjjs22, Proposition 1\]\. We have

Dt\+1\\displaystyle D\_\{t\+1\}=12​\\lVert​𝒗t\+1−𝒙⋆​\\rVert𝐌2=12​\\lVert​𝒗t−at\+1​𝐌−1​∇f​\(𝒙~t\+1\)−𝒙⋆​\\rVert𝐌2\\displaystyle=\\frac\{1\}\{2\}\\left\\lVert\\bm\{v\}\_\{t\+1\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}=\\frac\{1\}\{2\}\\left\\lVert\\bm\{v\}\_\{t\}\-a\_\{t\+1\}\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}=Dt\+at\+1​⟨𝐌−1​∇f​\(𝒙~t\+1\),𝒙⋆−𝒗t⟩𝐌\+at\+122​\\lVert​𝐌−1​∇f​\(𝒙~t\+1\)​\\rVert𝐌2\.\\displaystyle=D\_\{t\}\+a\_\{t\+1\}\\left\\langle\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\),\\bm\{x\}^\{\\star\}\-\\bm\{v\}\_\{t\}\\right\\rangle\_\{\\mathbf\{M\}\}\+\\frac\{a\_\{t\+1\}^\{2\}\}\{2\}\\left\\lVert\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}\.By definition of𝒒t\\bm\{q\}\_\{t\}andAt\+1′=At\+at\+1′A\_\{t\+1\}^\{\\prime\}=A\_\{t\}\+a^\{\\prime\}\_\{t\+1\}, we have

at\+1′​𝒗t=At\+1′​𝒒t−At​𝒙t=at\+1′​𝒙~t\+1\+At\+1′​\(𝒒t−𝒙~t\+1\)−At​\(𝒙t−𝒙~t\+1\)\.\\displaystyle a\_\{t\+1\}^\{\\prime\}\\bm\{v\}\_\{t\}=A\_\{t\+1\}^\{\\prime\}\\bm\{q\}\_\{t\}\-A\_\{t\}\\bm\{x\}\_\{t\}=a\_\{t\+1\}^\{\\prime\}\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\+A^\{\\prime\}\_\{t\+1\}\\left\(\\bm\{q\}\_\{t\}\-\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\\right\)\-A\_\{t\}\\left\(\\bm\{x\}\_\{t\}\-\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\\right\)\.Subtractingat\+1′​𝒙⋆a\_\{t\+1\}^\{\\prime\}\\bm\{x\}^\{\\star\}and taking the inner product with𝐌−1​∇f​\(𝒙~t\+1\)\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)gives

at\+1′​⟨𝐌−1​∇f​\(𝒙~t\+1\),𝒙⋆−𝒗t⟩𝐌\\displaystyle\\quad a\_\{t\+1\}^\{\\prime\}\\left\\langle\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\),\\bm\{x\}^\{\\star\}\-\\bm\{v\}\_\{t\}\\right\\rangle\_\{\\mathbf\{M\}\}=⟨𝐌−1​∇f​\(𝒙~t\+1\),at\+1′​\(𝒙⋆−𝒙~t\+1\)\+At\+1′​\(𝒙~t\+1−𝒒t\)\+At​\(𝒙t−𝒙~t\+1\)⟩𝐌\\displaystyle=\\left\\langle\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\),a\_\{t\+1\}^\{\\prime\}\(\\bm\{x\}^\{\\star\}\-\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\+A^\{\\prime\}\_\{t\+1\}\\left\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\)\+A\_\{t\}\\left\(\\bm\{x\}\_\{t\}\-\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\\right\)\\right\\rangle\_\{\\mathbf\{M\}\}≤at\+1′​\(f​\(𝒙⋆\)−f​\(𝒙~t\+1\)\)\+At\+1′​⟨𝐌−1​∇f​\(𝒙~t\+1\),𝒙~t\+1−𝒒t⟩𝐌\+At​\(f​\(𝒙t\)−f​\(𝒙~t\+1\)\)\\displaystyle\\leq a\_\{t\+1\}^\{\\prime\}\\left\(f\(\\bm\{x\}^\{\\star\}\)\-f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\\right\)\+A\_\{t\+1\}^\{\\prime\}\\left\\langle\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\),\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\\rangle\_\{\\mathbf\{M\}\}\+A\_\{t\}\\left\(f\(\\bm\{x\}\_\{t\}\)\-f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\\right\)≤At​Et−At\+1′​\(f​\(𝒙~t\+1\)−f​\(𝒙⋆\)\)\+At\+1′​⟨𝐌−1​∇f​\(𝒙~t\+1\),𝒙~t\+1−𝒒t⟩𝐌\.\\displaystyle\\leq A\_\{t\}E\_\{t\}\-A\_\{t\+1\}^\{\\prime\}\\left\(f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\-f\(\\bm\{x\}^\{\\star\}\)\\right\)\+A\_\{t\+1\}^\{\\prime\}\\left\\langle\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\),\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\\rangle\_\{\\mathbf\{M\}\}\.Rearranging gives

At\+1′​\(f​\(𝒙~t\+1\)−f​\(𝒙⋆\)\)\\displaystyle A\_\{t\+1\}^\{\\prime\}\\left\(f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\-f\(\\bm\{x\}^\{\\star\}\)\\right\)≤At​Et\+at\+1′​⟨𝐌−1​∇f​\(𝒙~t\+1\),𝒗t−𝒙⋆⟩𝐌\\displaystyle\\leq A\_\{t\}E\_\{t\}\+a\_\{t\+1\}^\{\\prime\}\\left\\langle\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\),\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rangle\_\{\\mathbf\{M\}\}\+At\+1′​⟨𝐌−1​∇f​\(𝒙~t\+1\),𝒙~t\+1−𝒒t⟩𝐌\.\\displaystyle\\quad\+A\_\{t\+1\}^\{\\prime\}\\left\\langle\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\),\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\\rangle\_\{\\mathbf\{M\}\}\.Next, recall that by[Definition˜5\.1](https://arxiv.org/html/2607.00252#S5.Thmtheorem1), we have

\\lVert​𝐌−1​∇f​\(𝒙~t\+1\)\+λt\+1​\(𝒙~t\+1−𝒒t\)​\\rVert𝐌2≤λt\+12​σ2​\\lVert​𝒙~t\+1−𝒒t​\\rVert𝐌2\.\\displaystyle\\left\\lVert\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\+\\lambda\_\{t\+1\}\\left\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\)\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}\\leq\\lambda\_\{t\+1\}^\{2\}\\sigma^\{2\}\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}\.We use this to write

λt\+1​⟨𝐌−1​∇f​\(𝒙~t\+1\),𝒙~t\+1−𝒒t⟩𝐌\\displaystyle\\quad\\lambda\_\{t\+1\}\\left\\langle\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\),\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\\rangle\_\{\\mathbf\{M\}\}=12​\\lVert​𝐌−1​∇f​\(𝒙~t\+1\)\+λt\+1​\(𝒙~t\+1−𝒒t\)​\\rVert𝐌2−12​\\lVert​𝐌−1​∇f​\(𝒙~t\+1\)​\\rVert𝐌2−λt\+122​\\lVert​𝒙~t\+1−𝒒t​\\rVert𝐌2\\displaystyle=\\frac\{1\}\{2\}\\left\\lVert\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\+\\lambda\_\{t\+1\}\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\)\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}\-\\frac\{1\}\{2\}\\left\\lVert\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}\-\\frac\{\\lambda\_\{t\+1\}^\{2\}\}\{2\}\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}≤−λt\+12​\(1−σ2\)​Nt\+1−12​\\lVert​𝐌−1​∇f​\(𝒙~t\+1\)​\\rVert𝐌2,\\displaystyle\\leq\-\\lambda\_\{t\+1\}^\{2\}\(1\-\\sigma^\{2\}\)N\_\{t\+1\}\-\\frac\{1\}\{2\}\\left\\lVert\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\},from which we conclude

⟨𝐌−1​∇f​\(𝒙~t\+1\),𝒙~t\+1−𝒒t⟩𝐌\\displaystyle\\left\\langle\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\),\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\\rangle\_\{\\mathbf\{M\}\}≤−λt\+1​\(1−σ2\)​Nt\+1−12​λt\+1​\\lVert​𝐌−1​∇f​\(𝒙~t\+1\)​\\rVert𝐌2\.\\displaystyle\\leq\-\\lambda\_\{t\+1\}\(1\-\\sigma^\{2\}\)N\_\{t\+1\}\-\\frac\{1\}\{2\\lambda\_\{t\+1\}\}\\left\\lVert\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}\.Substituting back gives

At\+1′​\(f​\(𝒙~t\+1\)−f​\(𝒙⋆\)\)\\displaystyle A\_\{t\+1\}^\{\\prime\}\\left\(f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\-f\(\\bm\{x\}^\{\\star\}\)\\right\)≤At​Et\+at\+1′​⟨𝐌−1​∇f​\(𝒙~t\+1\),𝒗t−𝒙⋆⟩𝐌\\displaystyle\\leq A\_\{t\}E\_\{t\}\+a\_\{t\+1\}^\{\\prime\}\\left\\langle\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\),\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rangle\_\{\\mathbf\{M\}\}\+At\+1′​⟨𝐌−1​∇f​\(𝒙~t\+1\),𝒙~t\+1−𝒒t⟩𝐌\\displaystyle\\quad\+A\_\{t\+1\}^\{\\prime\}\\left\\langle\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\),\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\\rangle\_\{\\mathbf\{M\}\}≤At​Et\+at\+1′​⟨𝐌−1​∇f​\(𝒙~t\+1\),𝒗t−𝒙⋆⟩𝐌\\displaystyle\\leq A\_\{t\}E\_\{t\}\+a\_\{t\+1\}^\{\\prime\}\\left\\langle\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\),\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rangle\_\{\\mathbf\{M\}\}−At\+1′​λt\+1​\(1−σ2\)​Nt\+1−At\+1′2​λt\+1​\\lVert​𝐌−1​∇f​\(𝒙~t\+1\)​\\rVert𝐌2\.\\displaystyle\\quad\-A\_\{t\+1\}^\{\\prime\}\\lambda\_\{t\+1\}\(1\-\\sigma^\{2\}\)N\_\{t\+1\}\-\\frac\{A\_\{t\+1\}^\{\\prime\}\}\{2\\lambda\_\{t\+1\}\}\\left\\lVert\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}\.Next, recall thatγt\+1​at\+1′=at\+1\\gamma\_\{t\+1\}a^\{\\prime\}\_\{t\+1\}=a\_\{t\+1\}andγt\+1​λt\+1=min⁡\{λt\+1,λt\+1′\}\\gamma\_\{t\+1\}\\lambda\_\{t\+1\}=\\min\\left\\\{\\lambda\_\{t\+1\},\\lambda\_\{t\+1\}^\{\\prime\}\\right\\\}, by construction\. Letλ^t\+1≔min⁡\{λt\+1,λt\+1′\}\\mathaccent 866\{\\lambda\}\_\{t\+1\}\\coloneqq\\min\\left\\\{\\lambda\_\{t\+1\},\\lambda\_\{t\+1\}^\{\\prime\}\\right\\\}We multiply both sides byγt\+1\\gamma\_\{t\+1\}and conclude

γt\+1​At\+1′​\(f​\(𝒙~t\+1\)−f​\(𝒙⋆\)\)\\displaystyle\\gamma\_\{t\+1\}A\_\{t\+1\}^\{\\prime\}\\left\(f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\-f\(\\bm\{x\}^\{\\star\}\)\\right\)≤γt\+1​At​Et\+at\+1​⟨𝐌−1​∇f​\(𝒙~t\+1\),𝒗t−𝒙⋆⟩𝐌\\displaystyle\\leq\\gamma\_\{t\+1\}A\_\{t\}E\_\{t\}\+a\_\{t\+1\}\\left\\langle\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\),\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rangle\_\{\\mathbf\{M\}\}−At\+1′​λ^t\+1​\(1−σ2\)​Nt\+1−γt\+1​At\+1′2​λt\+1​\\lVert​𝐌−1​∇f​\(𝒙~t\+1\)​\\rVert𝐌2\.\\displaystyle\\quad\-A\_\{t\+1\}^\{\\prime\}\\mathaccent 866\{\\lambda\}\_\{t\+1\}\(1\-\\sigma^\{2\}\)N\_\{t\+1\}\-\\frac\{\\gamma\_\{t\+1\}A\_\{t\+1\}^\{\\prime\}\}\{2\\lambda\_\{t\+1\}\}\\left\\lVert\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}\.Now, by convexity offfand from the definition of𝒙t\+1\\bm\{x\}\_\{t\+1\}, we have

f​\(𝒙t\+1\)−f​\(𝒙⋆\)≤\(1−γt\+1\)​AtAt\+1​\(f​\(𝒙t\)−f​\(𝒙⋆\)\)\+γt\+1​At\+1′At\+1​\(f​\(𝒙~t\+1\)−f​\(𝒙⋆\)\)\.\\displaystyle f\(\\bm\{x\}\_\{t\+1\}\)\-f\(\\bm\{x\}^\{\\star\}\)\\leq\\frac\{\(1\-\\gamma\_\{t\+1\}\)A\_\{t\}\}\{A\_\{t\+1\}\}\\left\(f\(\\bm\{x\}\_\{t\}\)\-f\(\\bm\{x\}^\{\\star\}\)\\right\)\+\\frac\{\\gamma\_\{t\+1\}A\_\{t\+1\}^\{\\prime\}\}\{A\_\{t\+1\}\}\\left\(f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\-f\(\\bm\{x\}^\{\\star\}\)\\right\)\.Recall the definition ofEtE\_\{t\}, multiply both sides byAt\+1A\_\{t\+1\}, apply our bound onγt\+1​At\+1′​\(f​\(𝒙~t\+1\)−f​\(𝒙⋆\)\)\\gamma\_\{t\+1\}A\_\{t\+1\}^\{\\prime\}\\left\(f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\-f\(\\bm\{x\}^\{\\star\}\)\\right\), and we get

At\+1​Et\+1\\displaystyle A\_\{t\+1\}E\_\{t\+1\}≤\(1−γt\+1\)​At​Et\+γt\+1​At\+1′​\(f​\(𝒙~t\+1\)−f​\(𝒙⋆\)\)\\displaystyle\\leq\(1\-\\gamma\_\{t\+1\}\)A\_\{t\}E\_\{t\}\+\\gamma\_\{t\+1\}A\_\{t\+1\}^\{\\prime\}\\left\(f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\-f\(\\bm\{x\}^\{\\star\}\)\\right\)≤At​Et\+at\+1​⟨𝐌−1​∇f​\(𝒙~t\+1\),𝒗t−𝒙⋆⟩𝐌\\displaystyle\\leq A\_\{t\}E\_\{t\}\+a\_\{t\+1\}\\left\\langle\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\),\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rangle\_\{\\mathbf\{M\}\}−At\+1′​λ^t\+1​\(1−σ2\)​Nt\+1−γt\+1​At\+1′2​λt\+1​\\lVert​𝐌−1​∇f​\(𝒙~t\+1\)​\\rVert𝐌2\\displaystyle\\quad\-A\_\{t\+1\}^\{\\prime\}\\mathaccent 866\{\\lambda\}\_\{t\+1\}\(1\-\\sigma^\{2\}\)N\_\{t\+1\}\-\\frac\{\\gamma\_\{t\+1\}A\_\{t\+1\}^\{\\prime\}\}\{2\\lambda\_\{t\+1\}\}\\left\\lVert\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}After shifting terms around, we see that it remains to show

at\+1​⟨𝐌−1​∇f​\(𝒙~t\+1\),𝒗t−𝒙⋆⟩𝐌−γt\+1​At\+1′2​λt\+1​\\lVert​𝐌−1​∇f​\(𝒙~t\+1\)​\\rVert𝐌2​≤?​Dt−Dt\+1\.\\displaystyle a\_\{t\+1\}\\left\\langle\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\),\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rangle\_\{\\mathbf\{M\}\}\-\\frac\{\\gamma\_\{t\+1\}A\_\{t\+1\}^\{\\prime\}\}\{2\\lambda\_\{t\+1\}\}\\left\\lVert\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}\\overset\{?\}\{\\leq\}D\_\{t\}\-D\_\{t\+1\}\.In fact, by the choice ofat\+1′a^\{\\prime\}\_\{t\+1\}and the definition ofAt\+1′A\_\{t\+1\}^\{\\prime\}, we have

λt\+1′​\(at\+1′\)2=at\+1′\+At=At\+1′\.\\displaystyle\\lambda^\{\\prime\}\_\{t\+1\}\(a^\{\\prime\}\_\{t\+1\}\)^\{2\}=a^\{\\prime\}\_\{t\+1\}\+A\_\{t\}=A\_\{t\+1\}^\{\\prime\}\.Multiply both sides byγt\+12/\(2​λt\+1′\)\\gamma\_\{t\+1\}^\{2\}/\(2\\lambda^\{\\prime\}\_\{t\+1\}\)and we get

at\+122=γt\+12​At\+1′2​λt\+1′=min⁡\{1,λt\+1′λt\+1\}​γt\+1​At\+1′2​λt\+1′≤γt\+1​At\+1′2​λt\+1\.\\displaystyle\\frac\{a\_\{t\+1\}^\{2\}\}\{2\}=\\frac\{\\gamma\_\{t\+1\}^\{2\}A\_\{t\+1\}^\{\\prime\}\}\{2\\lambda^\{\\prime\}\_\{t\+1\}\}=\\frac\{\\min\\left\\\{1,\\frac\{\\lambda\_\{t\+1\}^\{\\prime\}\}\{\\lambda\_\{t\+1\}\}\\right\\\}\\gamma\_\{t\+1\}A\_\{t\+1\}^\{\\prime\}\}\{2\\lambda^\{\\prime\}\_\{t\+1\}\}\\leq\\frac\{\\gamma\_\{t\+1\}A\_\{t\+1\}^\{\\prime\}\}\{2\\lambda\_\{t\+1\}\}\.We recycle an earlier computation and know that

Dt−Dt\+1\\displaystyle D\_\{t\}\-D\_\{t\+1\}=at\+1​⟨𝐌−1​∇f​\(𝒙~t\+1\),𝒗t−𝒙⋆⟩𝐌−at\+122​\\lVert​𝐌−1​∇f​\(𝒙~t\+1\)​\\rVert𝐌2\\displaystyle=a\_\{t\+1\}\\left\\langle\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\),\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rangle\_\{\\mathbf\{M\}\}\-\\frac\{a\_\{t\+1\}^\{2\}\}\{2\}\\left\\lVert\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}≥at\+1​⟨𝐌−1​∇f​\(𝒙~t\+1\),𝒗t−𝒙⋆⟩𝐌−γt\+1​At\+1′2​λt\+1​\\lVert​𝐌−1​∇f​\(𝒙~t\+1\)​\\rVert𝐌2,\\displaystyle\\geq a\_\{t\+1\}\\left\\langle\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\),\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rangle\_\{\\mathbf\{M\}\}\-\\frac\{\\gamma\_\{t\+1\}A\_\{t\+1\}^\{\\prime\}\}\{2\\lambda\_\{t\+1\}\}\\left\\lVert\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\},which completes the proof of the potential decrease\.

The remaining statements follow as written in\[chjjs22, Proof of Proposition 1\], and we conclude the proof of[Lemma˜5\.4](https://arxiv.org/html/2607.00252#S5.Thmtheorem4)\. ∎

Now that we have shown[Lemma˜5\.4](https://arxiv.org/html/2607.00252#S5.Thmtheorem4), we refer the reader to\[chjjs22, Appendix A\]for the proof of[Theorem˜5\.3](https://arxiv.org/html/2607.00252#S5.Thmtheorem3), as it now follows exactly as written there\.

We also give additional bounds on the movement of the iterates in\\lVert⋅\\rVert𝐌\\left\\lVert\\cdot\\right\\rVert\_\{\\mathbf\{M\}\}, which is a straightforward adaptation of\[msbacon, Lemma 31\]to the improved framework from\[chjjs22\]\.

###### Lemma 5\.5\.

For allt≥1t\\geq 1, we have both

\\lVert​𝒗t−𝒙⋆​\\rVert𝐌≤2​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌\\lVert​𝒙t−𝒙⋆​\\rVert𝐌≤\(2\+max1≤i≤t⁡λi′λi⋅21−σ2\)​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌\.\\begin\{aligned\} \\left\\lVert\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}&\\leq\\sqrt\{2\}\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\\\\ \\left\\lVert\\bm\{x\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}&\\leq\\left\(\\sqrt\{2\}\+\\max\_\{1\\leq i\\leq t\}\\frac\{\\lambda\_\{i\}^\{\\prime\}\}\{\\lambda\_\{i\}\}\\cdot\\sqrt\{\\frac\{2\}\{1\-\\sigma^\{2\}\}\}\\right\)\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\\end\{aligned\}\.

In the statement of[Lemma˜5\.5](https://arxiv.org/html/2607.00252#S5.Thmtheorem5), the cost of overshooting the guessλi′\\lambda\_\{i\}^\{\\prime\}becomes evident – without an additional strong convexity guarantee, it is challenging to ensure that each iterate remains in a small ball around𝒙⋆\\bm\{x\}^\{\\star\}\. This is the main reason we are unable to apply the framework of\[chjjs22\]to thep=∞p=\\inftycase\.

###### Proof of[Lemma˜5\.5](https://arxiv.org/html/2607.00252#S5.Thmtheorem5)\.

Using the same notation as in[Lemma˜5\.4](https://arxiv.org/html/2607.00252#S5.Thmtheorem4)and in that proof, we define

Pt\\displaystyle P\_\{t\}≔At​Et\+Dt\\displaystyle\\coloneqq A\_\{t\}E\_\{t\}\+D\_\{t\}λ^t\\displaystyle\\mathaccent 866\{\\lambda\}\_\{t\}≔min⁡\{λt,λt′\}\.\\displaystyle\\coloneqq\\min\\left\\\{\\lambda\_\{t\},\\lambda\_\{t\}^\{\\prime\}\\right\\\}\.By induction on the conclusion of[Lemma˜5\.4](https://arxiv.org/html/2607.00252#S5.Thmtheorem4), fort≥1t\\geq 1we have

12​\\lVert​𝒗t−𝒙⋆​\\rVert𝐌2=Dt≤Pt\+\(1−σ2\)​\\slimits@k=1t​Ak′​λ^k​Nk≤P0=\\lVert​𝒙0−𝒙⋆​\\rVert𝐌2\.\\displaystyle\\frac\{1\}\{2\}\\left\\lVert\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}=D\_\{t\}\\leq P\_\{t\}\+\(1\-\\sigma^\{2\}\)\\sumop\\slimits@\_\{k=1\}^\{t\}A\_\{k\}^\{\\prime\}\\mathaccent 866\{\\lambda\}\_\{k\}N\_\{k\}\\leq P\_\{0\}=\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}\.Thus,

\\lVert​𝒗t−𝒙⋆​\\rVert𝐌≤2​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌\.\\displaystyle\\left\\lVert\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq\\sqrt\{2\}\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\.For the second conclusion, we introduce the following notation\.

αt\+1\\displaystyle\\alpha\_\{t\+1\}≔\(1−γt\+1\)​AtAt\+1\\displaystyle\\coloneqq\\frac\{\(1\-\\gamma\_\{t\+1\}\)A\_\{t\}\}\{A\_\{t\+1\}\}βt\+1\\displaystyle\\beta\_\{t\+1\}≔AtAt\+1′\\displaystyle\\coloneqq\\frac\{A\_\{t\}\}\{A\_\{t\+1\}^\{\\prime\}\}δt\+1\\displaystyle\\delta\_\{t\+1\}≔1−\(1−αt\+1\)​\(1−βt\+1\)=1−γt\+1​At\+1′At\+1⋅at\+1′At\+1′=AtAt\+1\\displaystyle\\coloneqq 1\-\(1\-\\alpha\_\{t\+1\}\)\(1\-\\beta\_\{t\+1\}\)=1\-\\frac\{\\gamma\_\{t\+1\}A^\{\\prime\}\_\{t\+1\}\}\{A\_\{t\+1\}\}\\cdot\\frac\{a^\{\\prime\}\_\{t\+1\}\}\{A^\{\\prime\}\_\{t\+1\}\}=\\frac\{A\_\{t\}\}\{A\_\{t\+1\}\}We also establish for anyii,

γi​Ai′λi​ai2=Ai′λi​γi​\(ai′\)2=1γi⋅λi′λi=max⁡\{λi′λi,1\},\\displaystyle\\frac\{\\gamma\_\{i\}A\_\{i\}^\{\\prime\}\}\{\\lambda\_\{i\}a\_\{i\}^\{2\}\}=\\frac\{A\_\{i\}^\{\\prime\}\}\{\\lambda\_\{i\}\\gamma\_\{i\}\(a\_\{i\}^\{\\prime\}\)^\{2\}\}=\\frac\{1\}\{\\gamma\_\{i\}\}\\cdot\\frac\{\\lambda\_\{i\}^\{\\prime\}\}\{\\lambda\_\{i\}\}=\\max\\left\\\{\\frac\{\\lambda\_\{i\}^\{\\prime\}\}\{\\lambda\_\{i\}\},1\\right\\\},which implies

γi​Ai′λi=ai2​max⁡\{λi′λi,1\}\.\\displaystyle\\frac\{\\gamma\_\{i\}A\_\{i\}^\{\\prime\}\}\{\\lambda\_\{i\}\}=a\_\{i\}^\{2\}\\max\\left\\\{\\frac\{\\lambda\_\{i\}^\{\\prime\}\}\{\\lambda\_\{i\}\},1\\right\\\}\.Notice that

\\lVert​𝒙t\+1−𝒙⋆​\\rVert𝐌\\displaystyle\\left\\lVert\\bm\{x\}\_\{t\+1\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}≤αt\+1​\\lVert​𝒙t−𝒙⋆​\\rVert𝐌\+\(1−αt\+1\)​\\lVert​𝒙~t\+1−𝒙⋆​\\rVert𝐌\\displaystyle\\leq\\alpha\_\{t\+1\}\\left\\lVert\\bm\{x\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\(1\-\\alpha\_\{t\+1\}\)\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}≤αt\+1​\\lVert​𝒙t−𝒙⋆​\\rVert𝐌\+\(1−αt\+1\)​\(\\lVert​𝒒t−𝒙⋆​\\rVert𝐌\+\\lVert​𝒙~t\+1−𝒒t​\\rVert𝐌\)\\displaystyle\\leq\\alpha\_\{t\+1\}\\left\\lVert\\bm\{x\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\(1\-\\alpha\_\{t\+1\}\)\\left\(\\left\\lVert\\bm\{q\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}\\right\)≤αt\+1​\\lVert​𝒙t−𝒙⋆​\\rVert𝐌\\displaystyle\\leq\\alpha\_\{t\+1\}\\left\\lVert\\bm\{x\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\(1−αt\+1\)​\(βt\+1​\\lVert​𝒙t−𝒙⋆​\\rVert𝐌\+\(1−βt\+1\)​\\lVert​𝒗t−𝒙⋆​\\rVert𝐌\+\\lVert​𝒙~t\+1−𝒒t​\\rVert𝐌\)\\displaystyle\\quad\+\(1\-\\alpha\_\{t\+1\}\)\\left\(\\beta\_\{t\+1\}\\left\\lVert\\bm\{x\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\(1\-\\beta\_\{t\+1\}\)\\left\\lVert\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}\\right\)=\(βt\+1\+αt\+1−αt\+1​βt\+1\)​\\lVert​𝒙t−𝒙⋆​\\rVert𝐌\\displaystyle=\\left\(\\beta\_\{t\+1\}\+\\alpha\_\{t\+1\}\-\\alpha\_\{t\+1\}\\beta\_\{t\+1\}\\right\)\\left\\lVert\\bm\{x\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\(1−αt\+1\)​\(1−βt\+1\)​\\lVert​𝒗t−𝒙⋆​\\rVert𝐌\+\(1−αt\+1\)​\\lVert​𝒙~t\+1−𝒒t​\\rVert𝐌\\displaystyle\\quad\+\\left\(1\-\\alpha\_\{t\+1\}\\right\)\\left\(1\-\\beta\_\{t\+1\}\\right\)\\left\\lVert\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\left\(1\-\\alpha\_\{t\+1\}\\right\)\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}=δt\+1​\\lVert​𝒙t−𝒙⋆​\\rVert𝐌\+\(1−δt\+1\)​\\lVert​𝒗t−𝒙⋆​\\rVert𝐌\+\(1−αt\+1\)​\\lVert​𝒙~t\+1−𝒒t​\\rVert𝐌\\displaystyle=\\delta\_\{t\+1\}\\left\\lVert\\bm\{x\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\(1\-\\delta\_\{t\+1\}\)\\left\\lVert\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\left\(1\-\\alpha\_\{t\+1\}\\right\)\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}≤\\slimits@i=0t​δi\+1​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌\+\(1−\\slimits@i=0t​δi\+1\)​\\lVert​𝒗t−𝒙⋆​\\rVert𝐌\\displaystyle\\leq\\prodop\\slimits@\_\{i=0\}^\{t\}\\delta\_\{i\+1\}\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\left\(1\-\\prodop\\slimits@\_\{i=0\}^\{t\}\\delta\_\{i\+1\}\\right\)\\left\\lVert\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\slimits@i=1t\+1​\\slimits@j=i\+1t\+1​δj​\(1−αi\)​\\lVert​𝒙~i−𝒒i−1​\\rVert𝐌\\displaystyle\\quad\+\\sumop\\slimits@\_\{i=1\}^\{t\+1\}\\prodop\\slimits@\_\{j=i\+1\}^\{t\+1\}\\delta\_\{j\}\(1\-\\alpha\_\{i\}\)\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{i\}\-\\bm\{q\}\_\{i\-1\}\\right\\rVert\_\{\\mathbf\{M\}\}≤2​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌\+\\slimits@i=1t\+1​\\slimits@j=i\+1t\+1​δj​\(1−αi\)​\\lVert​𝒙~i−𝒒i−1​\\rVert𝐌\\displaystyle\\leq\\sqrt\{2\}\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\sumop\\slimits@\_\{i=1\}^\{t\+1\}\\prodop\\slimits@\_\{j=i\+1\}^\{t\+1\}\\delta\_\{j\}\(1\-\\alpha\_\{i\}\)\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{i\}\-\\bm\{q\}\_\{i\-1\}\\right\\rVert\_\{\\mathbf\{M\}\}=2​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌\+\\slimits@i=1t\+1​AiAt\+1​\(1−αi\)​\\lVert​𝒙~i−𝒒i−1​\\rVert𝐌\\displaystyle=\\sqrt\{2\}\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\sumop\\slimits@\_\{i=1\}^\{t\+1\}\\frac\{A\_\{i\}\}\{A\_\{t\+1\}\}\(1\-\\alpha\_\{i\}\)\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{i\}\-\\bm\{q\}\_\{i\-1\}\\right\\rVert\_\{\\mathbf\{M\}\}=2​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌\+\\slimits@i=1t\+1​AiAt\+1⋅γi​Ai′Ai​\\lVert​𝒙~i−𝒒i−1​\\rVert𝐌\\displaystyle=\\sqrt\{2\}\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\sumop\\slimits@\_\{i=1\}^\{t\+1\}\\frac\{A\_\{i\}\}\{A\_\{t\+1\}\}\\cdot\\frac\{\\gamma\_\{i\}A\_\{i\}^\{\\prime\}\}\{A\_\{i\}\}\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{i\}\-\\bm\{q\}\_\{i\-1\}\\right\\rVert\_\{\\mathbf\{M\}\}=2​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌\+1At\+1​\\slimits@i=1t\+1​γi​Ai′λi⋅λi​γi​Ai′​\\lVert​𝒙~i−𝒒i−1​\\rVert𝐌\\displaystyle=\\sqrt\{2\}\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\frac\{1\}\{A\_\{t\+1\}\}\\sumop\\slimits@\_\{i=1\}^\{t\+1\}\\sqrt\{\\frac\{\\gamma\_\{i\}A\_\{i\}^\{\\prime\}\}\{\\lambda\_\{i\}\}\}\\cdot\\sqrt\{\\lambda\_\{i\}\\gamma\_\{i\}A\_\{i\}^\{\\prime\}\}\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{i\}\-\\bm\{q\}\_\{i\-1\}\\right\\rVert\_\{\\mathbf\{M\}\}≤2​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌\+\(\\slimits@i=1t\+1​γi​Ai′λi\)1/2At\+1⋅\(\\slimits@i=1t\+1​λi​γi​Ai′​\\lVert​𝒙~i−𝒒i−1​\\rVert𝐌2\)1/2\\displaystyle\\leq\\sqrt\{2\}\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\frac\{\\left\(\\sumop\\slimits@\_\{i=1\}^\{t\+1\}\\frac\{\\gamma\_\{i\}A\_\{i\}^\{\\prime\}\}\{\\lambda\_\{i\}\}\\right\)^\{1/2\}\}\{A\_\{t\+1\}\}\\cdot\\left\(\\sumop\\slimits@\_\{i=1\}^\{t\+1\}\\lambda\_\{i\}\\gamma\_\{i\}A\_\{i\}^\{\\prime\}\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{i\}\-\\bm\{q\}\_\{i\-1\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}\\right\)^\{1/2\}≤2​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌\+\(\\slimits@i=1t\+1​γi​Ai′λi\)1/2At\+1⋅21−σ2​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌\\displaystyle\\leq\\sqrt\{2\}\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\frac\{\\left\(\\sumop\\slimits@\_\{i=1\}^\{t\+1\}\\frac\{\\gamma\_\{i\}A\_\{i\}^\{\\prime\}\}\{\\lambda\_\{i\}\}\\right\)^\{1/2\}\}\{A\_\{t\+1\}\}\\cdot\\sqrt\{\\frac\{2\}\{1\-\\sigma^\{2\}\}\}\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}≤2​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌\+\\slimits@i=1t\+1​ai​max⁡\{1,λi′λi\}At\+1⋅21−σ2​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌\\displaystyle\\leq\\sqrt\{2\}\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\frac\{\\sumop\\slimits@\_\{i=1\}^\{t\+1\}a\_\{i\}\\max\\left\\\{1,\\frac\{\\lambda\_\{i\}^\{\\prime\}\}\{\\lambda\_\{i\}\}\\right\\\}\}\{A\_\{t\+1\}\}\\cdot\\sqrt\{\\frac\{2\}\{1\-\\sigma^\{2\}\}\}\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}≤2​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌\+max1≤i≤t\+1⁡λi′λi⋅21−σ2​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌\\displaystyle\\leq\\sqrt\{2\}\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\max\_\{1\\leq i\\leq t\+1\}\\frac\{\\lambda\_\{i\}^\{\\prime\}\}\{\\lambda\_\{i\}\}\\cdot\\sqrt\{\\frac\{2\}\{1\-\\sigma^\{2\}\}\}\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}=\(2\+max1≤i≤t\+1⁡λi′λi⋅21−σ2\)​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌,\\displaystyle=\\left\(\\sqrt\{2\}\+\\max\_\{1\\leq i\\leq t\+1\}\\frac\{\\lambda\_\{i\}^\{\\prime\}\}\{\\lambda\_\{i\}\}\\cdot\\sqrt\{\\frac\{2\}\{1\-\\sigma^\{2\}\}\}\\right\)\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\},completing the proof of[Lemma˜5\.5](https://arxiv.org/html/2607.00252#S5.Thmtheorem5)\. ∎

## 6Minimizing the Distributionally Robust Loss

The goal of this section is to prove[˜1](https://arxiv.org/html/2607.00252#Thmmainthm1)\. We break up the proof into parts as described in[Section˜2](https://arxiv.org/html/2607.00252#S2)\. We structure the section as follows\. In the remainder of this subsection, we present[Algorithm˜1](https://arxiv.org/html/2607.00252#alg1), our algorithm for minimizing the distributionally robust loss\. In[Section˜6\.1](https://arxiv.org/html/2607.00252#S6.SS1), we introduce our smooth approximation for the objective \([1\.2](https://arxiv.org/html/2607.00252#S1.E2)\) and show that it is a good additive approximation \(this is a standard argument, but we include it as it provides crucial intuition\)\.

As the main difficulty of the proof in[˜1](https://arxiv.org/html/2607.00252#Thmmainthm1)is to establish a Hessian stability for our surrogate loss, we devote the bulk of this section to proving this\. Recall that in[Section˜2\.1\.1](https://arxiv.org/html/2607.00252#S2.SS1.SSS1), we claimed that a higher\-order smoothness condition calledquasi\-self\-concordancegives us the needed Hessian stability – in fact, this follows from\[msbacon, Lemma 11\]\. In light of this, it suffices to demonstrate that our surrogate loss is quasi\-self\-concordant\.

In[Section˜6\.2](https://arxiv.org/html/2607.00252#S6.SS2), we work out some calculus facts related to the softmax function\. In particular, it is in[Section˜6\.2](https://arxiv.org/html/2607.00252#S6.SS2)that we prove the general composition result[Definition˜2\.1](https://arxiv.org/html/2607.00252#S2.Thmtheorem1)that states that if we take the softmax of several quasi\-self\-concordant functions, then the resulting function is also quasi\-self\-concordant\. In[Section˜6\.3](https://arxiv.org/html/2607.00252#S6.SS3), we apply this composition fact to prove that our surrogate objective is quasi\-self\-concordant\. Finally, in[Section˜6\.4](https://arxiv.org/html/2607.00252#S6.SS4), we combine these building blocks with the acceleration framework in\[msbacon\]and complete the proof of[˜1](https://arxiv.org/html/2607.00252#Thmmainthm1)\.

### 6\.1Smoothly Approximating the Objective

Recall that for𝒚∈ℝn\\bm\{y\}\\in\\mathbb\{R\}^\{n\}, let\\lVert​𝒚​\\rVert𝒢∞≔max1≤i≤m⁡\\lVert​𝒚Si​\\rVert2\\left\\lVert\\bm\{y\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\\coloneqq\\max\_\{1\\leq i\\leq m\}\\left\\lVert\\bm\{y\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}, where for𝒚∈ℝn\\bm\{y\}\\in\\mathbb\{R\}^\{n\}we let𝒚Si\\bm\{y\}\_\{S\_\{i\}\}refer to the vector inℝni\\mathbb\{R\}^\{n\_\{i\}\}indexed by the indices inSiS\_\{i\}\. Also, for𝒚∈ℝm\\bm\{y\}\\in\\mathbb\{R\}^\{m\}, let𝗅𝗌𝖾β​\(𝒚\)\\mathsf\{lse\}\_\{\\beta\}\(\\bm\{y\}\)refer to the function

𝗅𝗌𝖾β​\(𝒚\)≔β​log⁡\(\\slimits@i=1m​exp​\(yiβ\)\)\.\\displaystyle\\mathsf\{lse\}\_\{\\beta\}\(\\bm\{y\}\)\\coloneqq\\beta\\log\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\mathrm\{exp\}\\left\(\\frac\{y\_\{i\}\}\{\\beta\}\\right\)\\right\)\.At a high level, our algorithm will minimize the function

f~β,δ​\(𝒙\)≔β​log⁡\(\\slimits@i=1m​exp​\(δ2\+\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert22−δβ\)\)\\displaystyle\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\)\\coloneqq\\beta\\log\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\mathrm\{exp\}\\left\(\\frac\{\\sqrt\{\\delta^\{2\}\+\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\}\-\\delta\}\{\\beta\}\\right\)\\right\)for appropriate choices of the parametersβ\\betaandδ\\delta\. This choice of smoothening is natural because of the following approximation statement – see[Lemma˜6\.1](https://arxiv.org/html/2607.00252#S6.Thmtheorem1)\.

###### Lemma 6\.1\.

For all𝐱∈ℝd\\bm\{x\}\\in\\mathbb\{R\}^\{d\}, we have

\\lvert​f~β,δ​\(𝒙\)−\\lVert​𝐀​𝒙−𝒃​\\rVert𝒢∞​\\rvert≤β​log⁡m\+δ\.\\displaystyle\\left\\lvert\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\)\-\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\\right\\rvert\\leq\\beta\\log m\+\\delta\.

###### Proof of[Lemma˜6\.1](https://arxiv.org/html/2607.00252#S6.Thmtheorem1)\.

These guarantees are well\-known, but we prove them anyway for the sake of self\-containment\. We first prove that for any𝒗∈ℝm\\bm\{v\}\\in\\mathbb\{R\}^\{m\}, we have

max1≤i≤m⁡vi≤𝗅𝗌𝖾β​\(𝒗\)≤max1≤i≤m⁡vi\+β​log⁡m\.\\displaystyle\\max\_\{1\\leq i\\leq m\}v\_\{i\}\\leq\\mathsf\{lse\}\_\{\\beta\}\(\\bm\{v\}\)\\leq\\max\_\{1\\leq i\\leq m\}v\_\{i\}\+\\beta\\log m\.In one direction, we have

𝗅𝗌𝖾β​\(𝒗\)≤β​log⁡\(\\slimits@i=1m​exp​\(max1≤i≤m⁡viβ\)\)=β​log⁡m\+max1≤i≤m⁡vi,\\displaystyle\\mathsf\{lse\}\_\{\\beta\}\(\\bm\{v\}\)\\leq\\beta\\log\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\mathrm\{exp\}\\left\(\\frac\{\\max\_\{1\\leq i\\leq m\}v\_\{i\}\}\{\\beta\}\\right\)\\right\)=\\beta\\log m\+\\max\_\{1\\leq i\\leq m\}v\_\{i\},and in the other, we have

𝗅𝗌𝖾β​\(𝒗\)≥β​log⁡\(exp​\(max1≤i≤m⁡viβ\)\)=max1≤i≤m⁡vi\.\\displaystyle\\mathsf\{lse\}\_\{\\beta\}\(\\bm\{v\}\)\\geq\\beta\\log\\left\(\\mathrm\{exp\}\\left\(\\frac\{\\max\_\{1\\leq i\\leq m\}v\_\{i\}\}\{\\beta\}\\right\)\\right\)=\\max\_\{1\\leq i\\leq m\}v\_\{i\}\.Next, for𝒗∈ℝm\\bm\{v\}\\in\\mathbb\{R\}^\{m\}, we will show that

\\lVert​𝒗​\\rVert2−δ≤δ2\+\\lVert​𝒗​\\rVert22−δ≤\\lVert​𝒗​\\rVert2\.\\displaystyle\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\}\-\\delta\\leq\\sqrt\{\\delta^\{2\}\+\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\}^\{2\}\}\-\\delta\\leq\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\}\.Indeed, we have

δ2\+\\lVert​𝒗​\\rVert22−δ≤δ2\+\\lVert​𝒗​\\rVert22−δ=\\lVert​𝒗​\\rVert2,\\displaystyle\\sqrt\{\\delta^\{2\}\+\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\}^\{2\}\}\-\\delta\\leq\\sqrt\{\\delta^\{2\}\}\+\\sqrt\{\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\}^\{2\}\}\-\\delta=\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\},and

δ2\+\\lVert​𝒗​\\rVert22−δ≥\\lVert​𝒗​\\rVert22−δ=\\lVert​𝒗​\\rVert2−δ\.\\displaystyle\\sqrt\{\\delta^\{2\}\+\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\}^\{2\}\}\-\\delta\\geq\\sqrt\{\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\}^\{2\}\}\-\\delta=\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\}\-\\delta\.From this, we get

f~β,δ​\(𝒙\)≤max1≤i≤m⁡\(δ2\+\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert22−δ\)\+β​log⁡m≤\\lVert​𝐀​𝒙−𝒃​\\rVert𝒢∞\+β​log⁡m\\displaystyle\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\)\\leq\\max\_\{1\\leq i\\leq m\}\\left\(\\sqrt\{\\delta^\{2\}\+\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\}\-\\delta\\right\)\+\\beta\\log m\\leq\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\+\\beta\\log mand

f~β,δ​\(𝒙\)≥β​log⁡\(\\slimits@i=1m​exp​\(\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2−δβ\)\)≥\\lVert​𝐀​𝒙−𝒃​\\rVert𝒢∞−δ\.\\displaystyle\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\)\\geq\\beta\\log\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\mathrm\{exp\}\\left\(\\frac\{\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}\-\\delta\}\{\\beta\}\\right\)\\right\)\\geq\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\-\\delta\.Putting these together gives

\\lvert​f~β,δ​\(𝒙\)−\\lVert​𝐀​𝒙−𝒃​\\rVert𝒢∞​\\rvert≤max⁡\(β​log⁡m,δ\)≤β​log⁡m\+δ,\\displaystyle\\left\\lvert\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\)\-\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\\right\\rvert\\leq\\max\\left\(\\beta\\log m,\\delta\\right\)\\leq\\beta\\log m\+\\delta,completing the proof of[Lemma˜6\.1](https://arxiv.org/html/2607.00252#S6.Thmtheorem1)\. ∎

Eventually, we will chooseβ=ε/\(4​log⁡m\)\\beta=\\varepsilon/\(4\\log m\)andδ=ε/4\\delta=\\varepsilon/4and then minimizef~β,δ\\mathaccent 869\{f\}\_\{\\beta,\\delta\}toε/2\\varepsilon/2additive error\. In light of[Lemma˜6\.1](https://arxiv.org/html/2607.00252#S6.Thmtheorem1), this will be enough to get anε\\varepsilon\-additive approximation to the optimum for\\lVert​𝐀​𝒙−𝒃​\\rVert𝒢∞\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\.

### 6\.2Calculus forLogSumExp

We investigate certain properties of𝗅𝗌𝖾β​\(𝒚\)\\mathsf\{lse\}\_\{\\beta\}\(\\bm\{y\}\)when each entry\[𝒚\]i\[\\bm\{y\}\]\_\{i\}is a functionhi​\(t\)h\_\{i\}\(t\)fort∈ℝt\\in\\mathbb\{R\}for alli∈\[m\]i\\in\[m\]\. Leth​\(t\)∈ℝmh\(t\)\\in\\mathbb\{R\}^\{m\}denote the vector where itsiith entry is given byhi​\(t\)h\_\{i\}\(t\)\. We treat eachhih\_\{i\}as a one\-dimensional restriction of a functiongi:ℝm→ℝg\_\{i\}\\colon\\mathbb\{R\}^\{m\}\\rightarrow\\mathbb\{R\}, sohi​\(t\)=gi​\(𝒚\+t​𝒅\)h\_\{i\}\(t\)=g\_\{i\}\(\\bm\{y\}\+t\\bm\{d\}\)for center𝒚\\bm\{y\}and direction𝒅\\bm\{d\}\(we omit the parameters𝒚,𝒅\\bm\{y\},\\bm\{d\}in the notationhih\_\{i\}as it will be clear from context\)\. Finally, recall the definition of quasi\-self\-concordance \([Definition˜2\.1](https://arxiv.org/html/2607.00252#S2.Thmtheorem1)\)\.

We begin with calculating the first two derivatives of𝗅𝗌𝖾β​\(h​\(t\)\)\\mathsf\{lse\}\_\{\\beta\}\(h\(t\)\)with respect tottin[Lemma˜6\.2](https://arxiv.org/html/2607.00252#S6.Thmtheorem2)\.

###### Lemma 6\.2\.

Letλi​\(t\)≔exp​\(hi​\(t\)/β\)\\lambda\_\{i\}\(t\)\\coloneqq\\mathrm\{exp\}\\left\(h\_\{i\}\(t\)/\\beta\\right\)\. Then, we have

\(dd​t\)​𝗅𝗌𝖾β​\(h​\(t\)\)\\displaystyle\\left\(\\frac\{d\}\{dt\}\\right\)\\mathsf\{lse\}\_\{\\beta\}\(h\(t\)\)=\\slimits@i=1m​\(λi​\(t\)⋅hi′​\(t\)\)\\slimits@i=1m​λi​\(t\)\\displaystyle=\\frac\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\(\\lambda\_\{i\}\(t\)\\cdot h\_\{i\}^\{\\prime\}\(t\)\\right\)\}\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\}\(dd​t\)2​𝗅𝗌𝖾β​\(h​\(t\)\)\\displaystyle\\left\(\\frac\{d\}\{dt\}\\right\)^\{2\}\\mathsf\{lse\}\_\{\\beta\}\(h\(t\)\)=1β​\(\\slimits@i=1m​λi​\(t\)​hi′​\(t\)2\\slimits@i=1m​λi​\(t\)−\(\\slimits@i=1m​λi​\(t\)​hi′​\(t\)\\slimits@i=1m​λi​\(t\)\)2\)\+\\slimits@i=1m​λi​\(t\)​hi′′​\(t\)\\slimits@i=1m​λi​\(t\)\.\\displaystyle=\\frac\{1\}\{\\beta\}\\left\(\\frac\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\}\(t\)^\{2\}\}\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\}\-\\left\(\\frac\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\}\(t\)\}\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\}\\right\)^\{2\}\\right\)\+\\frac\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\\prime\}\(t\)\}\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\}\.

###### Proof of[Lemma˜6\.2](https://arxiv.org/html/2607.00252#S6.Thmtheorem2)\.

The first derivative follows from the chain rule\. Indeed, we have

𝗅𝗌𝖾β′​\(h​\(t\)\)\\displaystyle\\mathsf\{lse\}\_\{\\beta\}^\{\\prime\}\(h\(t\)\)=β⋅\\slimits@i=1m​λi′​\(t\)\\slimits@i=1m​λi​\(t\)=β⋅\\slimits@i=1m​\(λi​\(t\)⋅hi′​\(t\)β\)\\slimits@i=1m​λi​\(t\)=\\slimits@i=1m​\(λi​\(t\)⋅hi′​\(t\)\)\\slimits@i=1m​λi​\(t\)≤maxi⁡hi′​\(t\)\.\\displaystyle=\\beta\\cdot\\frac\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}^\{\\prime\}\(t\)\}\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\}=\\beta\\cdot\\frac\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\(\\lambda\_\{i\}\(t\)\\cdot\\frac\{h\_\{i\}^\{\\prime\}\(t\)\}\{\\beta\}\\right\)\}\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\}=\\frac\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\(\\lambda\_\{i\}\(t\)\\cdot h\_\{i\}^\{\\prime\}\(t\)\\right\)\}\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\}\\leq\\max\_\{i\}h\_\{i\}^\{\\prime\}\(t\)\.For the second derivative, we use the differentiation rule for multiplication and division and the chain rule, giving

𝗅𝗌𝖾β′′​\(h​\(t\)\)\\displaystyle\\mathsf\{lse\}\_\{\\beta\}^\{\\prime\\prime\}\(h\(t\)\)=\[\(\\slimits@i=1m​λi′​\(t\)​hi′​\(t\)\+λi​\(t\)​hi′′​\(t\)\)​\(\\slimits@i=1m​λi​\(t\)\)\]−1β​\(\\slimits@i=1m​λi​\(t\)​hi′​\(t\)\)2\(\\slimits@i=1m​λi​\(t\)\)2\\displaystyle=\\frac\{\\left\[\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}^\{\\prime\}\(t\)h\_\{i\}^\{\\prime\}\(t\)\+\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\\prime\}\(t\)\\right\)\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\\right\)\\right\]\-\\frac\{1\}\{\\beta\}\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\}\(t\)\\right\)^\{2\}\}\{\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\\right\)^\{2\}\}=\[1β​\(\\slimits@i=1m​λi​\(t\)​hi′​\(t\)2\+β​λi​\(t\)​hi′′​\(t\)\)​\(\\slimits@i=1m​λi​\(t\)\)\]−1β​\(\\slimits@i=1m​λi​\(t\)​hi′​\(t\)\)2\(\\slimits@i=1m​λi​\(t\)\)2\\displaystyle=\\frac\{\\left\[\\frac\{1\}\{\\beta\}\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\}\(t\)^\{2\}\+\\beta\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\\prime\}\(t\)\\right\)\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\\right\)\\right\]\-\\frac\{1\}\{\\beta\}\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\}\(t\)\\right\)^\{2\}\}\{\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\\right\)^\{2\}\}=1β​\(\\slimits@i=1m​λi​\(t\)​hi′​\(t\)2\\slimits@i=1m​λi​\(t\)−\(\\slimits@i=1m​λi​\(t\)​hi′​\(t\)\)2\(\\slimits@i=1m​λi​\(t\)\)2\)\+\\slimits@i=1m​λi​\(t\)​hi′′​\(t\)\\slimits@i=1m​λi​\(t\)\.\\displaystyle=\\frac\{1\}\{\\beta\}\\left\(\\frac\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\}\(t\)^\{2\}\}\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\}\-\\frac\{\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\}\(t\)\\right\)^\{2\}\}\{\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\\right\)^\{2\}\}\\right\)\+\\frac\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\\prime\}\(t\)\}\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\}\\kern 5\.0pt\.This completes the proof of[Lemma˜6\.2](https://arxiv.org/html/2607.00252#S6.Thmtheorem2)\. ∎

Next, we prove a general fact regarding composing𝗅𝗌𝖾\\mathsf\{lse\}with a vector formed by functions that are themselves quasi self concordant\. See[Definition˜2\.1](https://arxiv.org/html/2607.00252#S2.Thmtheorem1)\.\\composedqscAs far as we are aware, this type of composition result was not previously known and may be of independent interest\.

To prove[Definition˜2\.1](https://arxiv.org/html/2607.00252#S2.Thmtheorem1), we need[Lemma˜6\.3](https://arxiv.org/html/2607.00252#S6.Thmtheorem3)\.

###### Lemma 6\.3\.

For any two random variablesX,YX,Y, we have

𝖵𝖺𝗋​\[X​Y\]≤2​\\lVert​Y​\\rVert∞2​𝖵𝖺𝗋​\[X\]\+2​\\lVert​X​\\rVert∞2​𝖵𝖺𝗋​\[Y\]\.\\displaystyle\\mathsf\{Var\}\\left\[XY\\right\]\\leq 2\\left\\lVert Y\\right\\rVert\_\{\\infty\}^\{2\}\\mathsf\{Var\}\\left\[X\\right\]\+2\\left\\lVert X\\right\\rVert\_\{\\infty\}^\{2\}\\mathsf\{Var\}\\left\[Y\\right\]\.

###### Proof of[Lemma˜6\.3](https://arxiv.org/html/2607.00252#S6.Thmtheorem3)\.

The proof follows that of\[var\_stackexchange\], but we reproduce it here for completeness\. First, notice that for random variablesU,VU,V, we have

2​𝖵𝖺𝗋​\[U\]\+2​𝖵𝖺𝗋​\[V\]−𝖵𝖺𝗋​\[U\+V\]=𝖵𝖺𝗋​\[U\]\+𝖵𝖺𝗋​\[V\]−2​𝖢𝗈𝗏​\[U,V\]=𝖵𝖺𝗋​\[U−V\]≥0\.\\displaystyle 2\\mathsf\{Var\}\\left\[U\\right\]\+2\\mathsf\{Var\}\\left\[V\\right\]\-\\mathsf\{Var\}\\left\[U\+V\\right\]=\\mathsf\{Var\}\\left\[U\\right\]\+\\mathsf\{Var\}\\left\[V\\right\]\-2\\mathsf\{Cov\}\\left\[U,V\\right\]=\\mathsf\{Var\}\\left\[U\-V\\right\]\\geq 0\.LetU=\(X−𝔼​\[X\]\)​YU=\(X\-\\mathbb\{E\}\\left\[X\\right\]\)YandV=𝔼​\[X\]​YV=\\mathbb\{E\}\\left\[X\\right\]Y\. Then,U\+V=X​YU\+V=XY, and we have

𝖵𝖺𝗋​\[X​Y\]≤2​𝖵𝖺𝗋​\[\(X−𝔼​\[X\]\)​Y\]\+2​𝖵𝖺𝗋​\[𝔼​\[X\]​Y\]=2​𝖵𝖺𝗋​\[\(X−𝔼​\[X\]\)​Y\]\+2​𝔼​\[X\]2​𝖵𝖺𝗋​\[Y\]\.\\displaystyle\\mathsf\{Var\}\\left\[XY\\right\]\\leq 2\\mathsf\{Var\}\\left\[\(X\-\\mathbb\{E\}\\left\[X\\right\]\)Y\\right\]\+2\\mathsf\{Var\}\\left\[\\mathbb\{E\}\\left\[X\\right\]Y\\right\]=2\\mathsf\{Var\}\\left\[\(X\-\\mathbb\{E\}\\left\[X\\right\]\)Y\\right\]\+2\\mathbb\{E\}\\left\[X\\right\]^\{2\}\\mathsf\{Var\}\\left\[Y\\right\]\.It remains to bound𝖵𝖺𝗋​\[\(X−𝔼​\[X\]\)​Y\]\\mathsf\{Var\}\\left\[\(X\-\\mathbb\{E\}\\left\[X\\right\]\)Y\\right\]\. By Hölder’s inequality, we have

𝖵𝖺𝗋​\[\(X−𝔼​\[X\]\)​Y\]≤𝔼​\[\(\(X−𝔼​\[X\]\)​Y\)2\]≤𝔼​\[\(X−𝔼​\[X\]\)2\]​\\lVert​Y​\\rVert∞2=𝖵𝖺𝗋​\[X\]​\\lVert​Y​\\rVert∞2\.\\displaystyle\\mathsf\{Var\}\\left\[\(X\-\\mathbb\{E\}\\left\[X\\right\]\)Y\\right\]\\leq\\mathbb\{E\}\\left\[\(\(X\-\\mathbb\{E\}\\left\[X\\right\]\)Y\)^\{2\}\\right\]\\leq\\mathbb\{E\}\\left\[\(X\-\\mathbb\{E\}\\left\[X\\right\]\)^\{2\}\\right\]\\left\\lVert Y\\right\\rVert\_\{\\infty\}^\{2\}=\\mathsf\{Var\}\\left\[X\\right\]\\left\\lVert Y\\right\\rVert\_\{\\infty\}^\{2\}\.Combining everything gives us the conclusion of[Lemma˜6\.3](https://arxiv.org/html/2607.00252#S6.Thmtheorem3)\. ∎

We are now ready to prove[Definition˜2\.1](https://arxiv.org/html/2607.00252#S2.Thmtheorem1)\.

###### Proof of[Definition˜2\.1](https://arxiv.org/html/2607.00252#S2.Thmtheorem1)\.

Letλi​\(t\)≔exp​\(hi​\(t\)/β\)\\lambda\_\{i\}\(t\)\\coloneqq\\mathrm\{exp\}\\left\(h\_\{i\}\(t\)/\\beta\\right\)\.

In this proof, we will encounter many weighted averages of vectors𝒛∈ℝm\\bm\{z\}\\in\\mathbb\{R\}^\{m\}of the form

\\slimits@i=1m​λi​\(t\)​zi\\slimits@i=1m​λi​\(t\)\.\\displaystyle\\frac\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)z\_\{i\}\}\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\}\.Let𝒟\\mathcal\{D\}be the distribution over\[m\]\[m\]whose entries are given by𝒟j=λj​\(t\)/\\slimits@i=1m​λi​\(t\)\\mathcal\{D\}\_\{j\}=\\lambda\_\{j\}\(t\)/\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\. In the rest of this proof, all expected values, variances, and covariances will be taken with respect to this distribution\. In an abuse of notation, leth​\(t\)h\(t\)denote the “random” variable that ishi​\(t\)h\_\{i\}\(t\)with probability𝒟i\\mathcal\{D\}\_\{i\}\. Defineh′​\(t\),h′′​\(t\),h′′′​\(t\)h^\{\\prime\}\(t\),h^\{\\prime\\prime\}\(t\),h^\{\\prime\\prime\\prime\}\(t\)analogously\.

To find the third derivative of𝗅𝗌𝖾β​\(h​\(t\)\)\\mathsf\{lse\}\_\{\\beta\}\(h\(t\)\), we start with its second derivative\. By[Lemma˜6\.2](https://arxiv.org/html/2607.00252#S6.Thmtheorem2), it is given by

𝗅𝗌𝖾β′′​\(h​\(t\)\)\\displaystyle\\mathsf\{lse\}\_\{\\beta\}^\{\\prime\\prime\}\(h\(t\)\)=1β​\(\\slimits@i=1m​λi​\(t\)​hi′​\(t\)2\\slimits@i=1m​λi​\(t\)−\(\\slimits@i=1m​λi​\(t\)​hi′​\(t\)\\slimits@i=1m​λi​\(t\)\)2\)⏟T1\+\\slimits@i=1m​λi​\(t\)​hi′′​\(t\)\\slimits@i=1m​λi​\(t\)⏟T2\\displaystyle=\\underbrace\{\\frac\{1\}\{\\beta\}\\left\(\\frac\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\}\(t\)^\{2\}\}\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\}\-\\left\(\\frac\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\}\(t\)\}\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\}\\right\)^\{2\}\\right\)\}\_\{T\_\{1\}\}\+\\underbrace\{\\frac\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\\prime\}\(t\)\}\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\}\}\_\{T\_\{2\}\}=1β​𝖵𝖺𝗋​\[h′​\(t\)\]\+𝔼​\[h′′​\(t\)\]\.\\displaystyle=\\frac\{1\}\{\\beta\}\\mathsf\{Var\}\\left\[h^\{\\prime\}\(t\)\\right\]\+\\mathbb\{E\}\\left\[h^\{\\prime\\prime\}\(t\)\\right\]\.We now differentiate the above term by term\. First, we have

T2′​\(t\)\\displaystyle T\_\{2\}^\{\\prime\}\(t\)=\\slimits@i=1m​λi​\(t\)​\(\(hi′​\(t\)​hi′′​\(t\)β\)\+hi′′′​\(t\)\)\\slimits@i=1m​λi​\(t\)−1β⋅\(\\slimits@i=1m​λi​\(t\)​hi′​\(t\)\)​\(\\slimits@i=1m​λi​\(t\)​hi′′​\(t\)\)\(\\slimits@i=1m​λi​\(t\)\)2\\displaystyle=\\frac\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\\left\(\\left\(\\frac\{h\_\{i\}^\{\\prime\}\(t\)h\_\{i\}^\{\\prime\\prime\}\(t\)\}\{\\beta\}\\right\)\+h\_\{i\}^\{\\prime\\prime\\prime\}\(t\)\\right\)\}\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\}\-\\frac\{1\}\{\\beta\}\\cdot\\frac\{\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\}\(t\)\\right\)\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\\prime\}\(t\)\\right\)\}\{\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\\right\)^\{2\}\}=1β​\(\\slimits@i=1m​λi​\(t\)​hi′​\(t\)​hi′′​\(t\)\\slimits@i=1m​λi​\(t\)−\(\\slimits@i=1m​λi​\(t\)​hi′​\(t\)\)​\(\\slimits@i=1m​λi​\(t\)​hi′′​\(t\)\)\(\\slimits@i=1m​λi​\(t\)\)2\)\+\\slimits@i=1m​λi​\(t\)​hi′′′​\(t\)\\slimits@i=1m​λi​\(t\)\\displaystyle=\\frac\{1\}\{\\beta\}\\left\(\\frac\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\}\(t\)h\_\{i\}^\{\\prime\\prime\}\(t\)\}\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\}\-\\frac\{\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\}\(t\)\\right\)\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\\prime\}\(t\)\\right\)\}\{\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\\right\)^\{2\}\}\\right\)\+\\frac\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\\prime\\prime\}\(t\)\}\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\}=1β​𝖢𝗈𝗏​\[h′​\(t\),h′′​\(t\)\]\+𝔼​\[h′′′​\(t\)\]\.\\displaystyle=\\frac\{1\}\{\\beta\}\\mathsf\{Cov\}\\left\[h^\{\\prime\}\(t\),h^\{\\prime\\prime\}\(t\)\\right\]\+\\mathbb\{E\}\\left\[h^\{\\prime\\prime\\prime\}\(t\)\\right\]\.Next, we have

dd​t​𝔼​\[h′​\(t\)\]2=2​𝔼​\[h′​\(t\)\]⋅dd​t​𝔼​\[h′​\(t\)\]=2​𝔼​\[h′​\(t\)\]​\(1β​𝖵𝖺𝗋​\[h′​\(t\)\]\+𝔼​\[h′′​\(t\)\]\)\\displaystyle\\frac\{d\}\{dt\}\\mathbb\{E\}\\left\[h^\{\\prime\}\(t\)\\right\]^\{2\}=2\\mathbb\{E\}\\left\[h^\{\\prime\}\(t\)\\right\]\\cdot\\frac\{d\}\{dt\}\\mathbb\{E\}\\left\[h^\{\\prime\}\(t\)\\right\]=2\\mathbb\{E\}\\left\[h^\{\\prime\}\(t\)\\right\]\\left\(\\frac\{1\}\{\\beta\}\\mathsf\{Var\}\\left\[h^\{\\prime\}\(t\)\\right\]\+\\mathbb\{E\}\\left\[h^\{\\prime\\prime\}\(t\)\\right\]\\right\)and

dd​t​𝔼​\[h′​\(t\)2\]\\displaystyle\\quad\\frac\{d\}\{dt\}\\mathbb\{E\}\\left\[h^\{\\prime\}\(t\)^\{2\}\\right\]=\(\\slimits@i=1m​λi′​\(t\)​hi′​\(t\)2\+2​hi′​\(t\)​hi′′​\(t\)​λi​\(t\)\)​\(\\slimits@i=1m​λi​\(t\)\)−1β​\(\\slimits@i=1m​λi​\(t\)​hi′​\(t\)\)​\(\\slimits@i=1m​λi​\(t\)​hi′​\(t\)2\)\(\\slimits@i=1m​λi​\(t\)\)2\\displaystyle=\\frac\{\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}^\{\\prime\}\(t\)h\_\{i\}^\{\\prime\}\(t\)^\{2\}\+2h\_\{i\}^\{\\prime\}\(t\)h\_\{i\}^\{\\prime\\prime\}\(t\)\\lambda\_\{i\}\(t\)\\right\)\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\\right\)\-\\frac\{1\}\{\\beta\}\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\}\(t\)\\right\)\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\}\(t\)^\{2\}\\right\)\}\{\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\\right\)^\{2\}\}=\(\\slimits@i=1m​λi′​\(t\)​hi′​\(t\)2\+2​hi′​\(t\)​hi′′​\(t\)​λi​\(t\)\)\\slimits@i=1m​λi​\(t\)−1β⋅\(\\slimits@i=1m​λi​\(t\)​hi′​\(t\)\)​\(\\slimits@i=1m​λi​\(t\)​hi′​\(t\)2\)\(\\slimits@i=1m​λi​\(t\)\)2\\displaystyle=\\frac\{\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}^\{\\prime\}\(t\)h\_\{i\}^\{\\prime\}\(t\)^\{2\}\+2h\_\{i\}^\{\\prime\}\(t\)h\_\{i\}^\{\\prime\\prime\}\(t\)\\lambda\_\{i\}\(t\)\\right\)\}\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\}\-\\frac\{1\}\{\\beta\}\\cdot\\frac\{\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\}\(t\)\\right\)\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\}\(t\)^\{2\}\\right\)\}\{\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\\right\)^\{2\}\}=\\slimits@i=1m​λi​\(t\)​\(hi′​\(t\)3β\+2​hi′​\(t\)​hi′′​\(t\)\)\\slimits@i=1m​λi​\(t\)−1β⋅\(\\slimits@i=1m​λi​\(t\)​hi′​\(t\)\)​\(\\slimits@i=1m​λi​\(t\)​hi′​\(t\)2\)\(\\slimits@i=1m​λi​\(t\)\)2\\displaystyle=\\frac\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\\left\(\\frac\{h\_\{i\}^\{\\prime\}\(t\)^\{3\}\}\{\\beta\}\+2h\_\{i\}^\{\\prime\}\(t\)h\_\{i\}^\{\\prime\\prime\}\(t\)\\right\)\}\{\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\}\-\\frac\{1\}\{\\beta\}\\cdot\\frac\{\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\}\(t\)\\right\)\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)h\_\{i\}^\{\\prime\}\(t\)^\{2\}\\right\)\}\{\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\lambda\_\{i\}\(t\)\\right\)^\{2\}\}=1β​𝖢𝗈𝗏​\[h′​\(t\),h′​\(t\)2\]\+2​𝔼​\[h′​\(t\)​h′′​\(t\)\]\.\\displaystyle=\\frac\{1\}\{\\beta\}\\mathsf\{Cov\}\\left\[h^\{\\prime\}\(t\),h^\{\\prime\}\(t\)^\{2\}\\right\]\+2\\mathbb\{E\}\\left\[h^\{\\prime\}\(t\)h^\{\\prime\\prime\}\(t\)\\right\]\.Combining everything gives us

𝗅𝗌𝖾β′′′​\(h​\(t\)\)\\displaystyle\\quad\\mathsf\{lse\}\_\{\\beta\}^\{\\prime\\prime\\prime\}\(h\(t\)\)=1β​\(1β​𝖢𝗈𝗏​\[h′​\(t\),h′​\(t\)2\]\+2​𝔼​\[h′​\(t\)​h′′​\(t\)\]−2​𝔼​\[h′​\(t\)\]​\(1β​𝖵𝖺𝗋​\[h′​\(t\)\]\+𝔼​\[h′′​\(t\)\]\)\)\\displaystyle=\\frac\{1\}\{\\beta\}\\left\(\\frac\{1\}\{\\beta\}\\mathsf\{Cov\}\\left\[h^\{\\prime\}\(t\),h^\{\\prime\}\(t\)^\{2\}\\right\]\+2\\mathbb\{E\}\\left\[h^\{\\prime\}\(t\)h^\{\\prime\\prime\}\(t\)\\right\]\-2\\mathbb\{E\}\\left\[h^\{\\prime\}\(t\)\\right\]\\left\(\\frac\{1\}\{\\beta\}\\mathsf\{Var\}\\left\[h^\{\\prime\}\(t\)\\right\]\+\\mathbb\{E\}\\left\[h^\{\\prime\\prime\}\(t\)\\right\]\\right\)\\right\)\+1β​𝖢𝗈𝗏​\[h′​\(t\),h′′​\(t\)\]\+𝔼​\[h′′′​\(t\)\]\\displaystyle\\quad\+\\frac\{1\}\{\\beta\}\\mathsf\{Cov\}\\left\[h^\{\\prime\}\(t\),h^\{\\prime\\prime\}\(t\)\\right\]\+\\mathbb\{E\}\\left\[h^\{\\prime\\prime\\prime\}\(t\)\\right\]=1β2​𝖢𝗈𝗏​\[h′​\(t\),h′​\(t\)2\]−2β2​𝔼​\[h′​\(t\)\]​𝖵𝖺𝗋​\[h′​\(t\)\]\+3β​𝖢𝗈𝗏​\[h′​\(t\),h′′​\(t\)\]\+𝔼​\[h′′′​\(t\)\]\.\\displaystyle=\\frac\{1\}\{\\beta^\{2\}\}\\mathsf\{Cov\}\\left\[h^\{\\prime\}\(t\),h^\{\\prime\}\(t\)^\{2\}\\right\]\-\\frac\{2\}\{\\beta^\{2\}\}\\mathbb\{E\}\\left\[h^\{\\prime\}\(t\)\\right\]\\mathsf\{Var\}\\left\[h^\{\\prime\}\(t\)\\right\]\+\\frac\{3\}\{\\beta\}\\mathsf\{Cov\}\\left\[h^\{\\prime\}\(t\),h^\{\\prime\\prime\}\(t\)\\right\]\+\\mathbb\{E\}\\left\[h^\{\\prime\\prime\\prime\}\(t\)\\right\]\.We first analyze the terms that only depend onh′​\(t\)h^\{\\prime\}\(t\)\. To do so, we use[Lemma˜6\.3](https://arxiv.org/html/2607.00252#S6.Thmtheorem3)to write

\\lvert​𝖢𝗈𝗏​\[h′​\(t\),h′​\(t\)2\]​\\rvert≤𝖵𝖺𝗋​\[h′​\(t\)\]​𝖵𝖺𝗋​\[h′​\(t\)2\]≤2​\\lVert​𝒅​\\rVert​𝖵𝖺𝗋​\[h′​\(t\)\]\.\\displaystyle\\left\\lvert\\mathsf\{Cov\}\\left\[h^\{\\prime\}\(t\),h^\{\\prime\}\(t\)^\{2\}\\right\]\\right\\rvert\\leq\\sqrt\{\\mathsf\{Var\}\\left\[h^\{\\prime\}\(t\)\\right\]\}\\sqrt\{\\mathsf\{Var\}\\left\[h^\{\\prime\}\(t\)^\{2\}\\right\]\}\\leq 2\\left\\lVert\\bm\{d\}\\right\\rVert\\mathsf\{Var\}\\left\[h^\{\\prime\}\(t\)\\right\]\.Now, we have

1β2​\\lvert​𝖢𝗈𝗏​\[h′​\(t\),h′​\(t\)2\]−2​𝔼​\[h′​\(t\)\]​𝖵𝖺𝗋​\[h′​\(t\)\]​\\rvert\\displaystyle\\quad\\frac\{1\}\{\\beta^\{2\}\}\\left\\lvert\\mathsf\{Cov\}\\left\[h^\{\\prime\}\(t\),h^\{\\prime\}\(t\)^\{2\}\\right\]\-2\\mathbb\{E\}\\left\[h^\{\\prime\}\(t\)\\right\]\\mathsf\{Var\}\\left\[h^\{\\prime\}\(t\)\\right\]\\right\\rvert≤1β2​\\lvert​𝖢𝗈𝗏​\[h′​\(t\),h′​\(t\)2\]​\\rvert\+2β2​\\lvert​𝔼​\[h′​\(t\)\]​𝖵𝖺𝗋​\[h′​\(t\)\]​\\rvert\\displaystyle\\leq\\frac\{1\}\{\\beta^\{2\}\}\\left\\lvert\\mathsf\{Cov\}\\left\[h^\{\\prime\}\(t\),h^\{\\prime\}\(t\)^\{2\}\\right\]\\right\\rvert\+\\frac\{2\}\{\\beta^\{2\}\}\\left\\lvert\\mathbb\{E\}\\left\[h^\{\\prime\}\(t\)\\right\]\\mathsf\{Var\}\\left\[h^\{\\prime\}\(t\)\\right\]\\right\\rvert≤4β2​\\lVert​𝒅​\\rVert​𝖵𝖺𝗋​\[h′​\(t\)\]≤4β​\\lVert​𝒅​\\rVert​𝗅𝗌𝖾β′′​\(h​\(t\)\)\.\\displaystyle\\leq\\frac\{4\}\{\\beta^\{2\}\}\\left\\lVert\\bm\{d\}\\right\\rVert\\mathsf\{Var\}\\left\[h^\{\\prime\}\(t\)\\right\]\\leq\\frac\{4\}\{\\beta\}\\left\\lVert\\bm\{d\}\\right\\rVert\\mathsf\{lse\}\_\{\\beta\}^\{\\prime\\prime\}\(h\(t\)\)\.Next, we take care of the remaining terms\. We have

3β​\\lvert​𝖢𝗈𝗏​\[h′​\(t\),h′′​\(t\)\]​\\rvert\+\\lvert​𝔼​\[h′′′​\(t\)\]​\\rvert\\displaystyle\\frac\{3\}\{\\beta\}\\left\\lvert\\mathsf\{Cov\}\\left\[h^\{\\prime\}\(t\),h^\{\\prime\\prime\}\(t\)\\right\]\\right\\rvert\+\\left\\lvert\\mathbb\{E\}\\left\[h^\{\\prime\\prime\\prime\}\(t\)\\right\]\\right\\rvert≤6β​\(maxi⁡hi′​\(t\)\)​𝔼​\[\\lvert​h′′​\(t\)−𝔼​\[h′′​\(t\)\]​\\rvert\]\+\\lvert​𝔼​\[h′′′​\(t\)\]​\\rvert\\displaystyle\\leq\\frac\{6\}\{\\beta\}\\left\(\\max\_\{i\}h\_\{i\}^\{\\prime\}\(t\)\\right\)\\mathbb\{E\}\\left\[\\left\\lvert h^\{\\prime\\prime\}\(t\)\-\\mathbb\{E\}\\left\[h^\{\\prime\\prime\}\(t\)\\right\]\\right\\rvert\\right\]\+\\left\\lvert\\mathbb\{E\}\\left\[h^\{\\prime\\prime\\prime\}\(t\)\\right\]\\right\\rvert≤12β​\\lVert​𝒅​\\rVert​𝗅𝗌𝖾β′′​\(h​\(t\)\)\+𝔼​\[\\lvert​h′′′​\(t\)​\\rvert\]\\displaystyle\\leq\\frac\{12\}\{\\beta\}\\left\\lVert\\bm\{d\}\\right\\rVert\\mathsf\{lse\}\_\{\\beta\}^\{\\prime\\prime\}\(h\(t\)\)\+\\mathbb\{E\}\\left\[\\left\\lvert h^\{\\prime\\prime\\prime\}\(t\)\\right\\rvert\\right\]≤12β​\\lVert​𝒅​\\rVert​𝗅𝗌𝖾β′′​\(h​\(t\)\)\+ν​\\lVert​𝒅​\\rVert​𝔼​\[h′′​\(t\)\]\\displaystyle\\leq\\frac\{12\}\{\\beta\}\\left\\lVert\\bm\{d\}\\right\\rVert\\mathsf\{lse\}\_\{\\beta\}^\{\\prime\\prime\}\(h\(t\)\)\+\\nu\\left\\lVert\\bm\{d\}\\right\\rVert\\mathbb\{E\}\\left\[h^\{\\prime\\prime\}\(t\)\\right\]≤\(12β\+ν\)​\\lVert​𝒅​\\rVert​𝗅𝗌𝖾β′′​\(h​\(t\)\),\\displaystyle\\leq\\left\(\\frac\{12\}\{\\beta\}\+\\nu\\right\)\\left\\lVert\\bm\{d\}\\right\\rVert\\mathsf\{lse\}\_\{\\beta\}^\{\\prime\\prime\}\(h\(t\)\),where the penultimate line follows from[Lemma˜6\.6](https://arxiv.org/html/2607.00252#S6.Thmtheorem6)\. Combining these conclusions yields

\\lvert​𝗅𝗌𝖾β′′′​\(h​\(t\)\)​\\rvert≤\(16β\+ν\)​\\lVert​𝒅​\\rVert​𝗅𝗌𝖾β′′​\(h​\(t\)\),\\displaystyle\\left\\lvert\\mathsf\{lse\}\_\{\\beta\}^\{\\prime\\prime\\prime\}\(h\(t\)\)\\right\\rvert\\leq\\left\(\\frac\{16\}\{\\beta\}\+\\nu\\right\)\\left\\lVert\\bm\{d\}\\right\\rVert\\mathsf\{lse\}\_\{\\beta\}^\{\\prime\\prime\}\(h\(t\)\),completing the proof of[Definition˜2\.1](https://arxiv.org/html/2607.00252#S2.Thmtheorem1)\. ∎

### 6\.3Smoothness and Quasi\-self\-concordance of the Modified Objective

The main result of this subsection is[Lemma˜6\.4](https://arxiv.org/html/2607.00252#S6.Thmtheorem4)\.

###### Lemma 6\.4\.

Let𝐖\\mathbf\{W\}be such that for all𝐳∈ℝd\\bm\{z\}\\in\\mathbb\{R\}^\{d\}, we have\\lVert​𝐀​𝐳​\\rVert𝒢∞≤\\lVert​𝐖1/2​𝐀​𝐳​\\rVert2\\left\\lVert\\mathbf\{A\}\\bm\{z\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\\leq\\left\\lVert\\mathbf\{W\}^\{1/2\}\\mathbf\{A\}\\bm\{z\}\\right\\rVert\_\{2\}\. For all𝐱,𝐳∈ℝd\\bm\{x\},\\bm\{z\}\\in\\mathbb\{R\}^\{d\}andt∈ℝt\\in\\mathbb\{R\}, we have

\(dd​t\)2​f~β,δ​\(𝒙\+t​𝒛\)\\displaystyle\\left\(\\frac\{d\}\{dt\}\\right\)^\{2\}\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\+t\\bm\{z\}\)≤\(1δ\+1β\)​\\lVert​𝐖1/2​𝐀​𝒛​\\rVert22\\displaystyle\\leq\\left\(\\frac\{1\}\{\\delta\}\+\\frac\{1\}\{\\beta\}\\right\)\\left\\lVert\\mathbf\{W\}^\{1/2\}\\mathbf\{A\}\\bm\{z\}\\right\\rVert\_\{2\}^\{2\}\(smoothness\)\\lvert​\(dd​t\)3​f~β,δ​\(𝒙\+t​𝒛\)​\\rvert\\displaystyle\\left\\lvert\\left\(\\frac\{d\}\{dt\}\\right\)^\{3\}\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\+t\\bm\{z\}\)\\right\\rvert≤\(16δ\+3β\)​\\lVert​𝐖1/2​𝐀​𝒛​\\rVert2​\(dd​t\)2​f~β,δ​\(𝒙\+t​𝒛\)\\displaystyle\\leq\\left\(\\frac\{16\}\{\\delta\}\+\\frac\{3\}\{\\beta\}\\right\)\\left\\lVert\\mathbf\{W\}^\{1/2\}\\mathbf\{A\}\\bm\{z\}\\right\\rVert\_\{2\}\\left\(\\frac\{d\}\{dt\}\\right\)^\{2\}\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\+t\\bm\{z\}\)\(quasi\-self\-concordance\)\.\\displaystyle\\text\{\(quasi\-self\-concordance\)\}\.

Our goal in the rest of this section is to prove[Lemma˜6\.4](https://arxiv.org/html/2607.00252#S6.Thmtheorem4)\.

We begin with defininghi​\(t\)h\_\{i\}\(t\)as \(absorb theδ,𝒚,𝒅\\delta,\\bm\{y\},\\bm\{d\}parameters into the definition ofhih\_\{i\}\)

hi​\(t\)≔δ2\+\\lVert​𝒚Si\+t​𝒅Si​\\rVert22\.\\displaystyle h\_\{i\}\(t\)\\coloneqq\\sqrt\{\\delta^\{2\}\+\\left\\lVert\\bm\{y\}\_\{S\_\{i\}\}\+t\\bm\{d\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\}\.Leth​\(t\)h\(t\)denote the vector whoseiith entry ishi​\(t\)h\_\{i\}\(t\)\. Then, observe that

𝗅𝗌𝖾β​\(h​\(t\)\)=β​log⁡\(\\slimits@i=1m​exp​\(hi​\(t\)β\)\)=β​log⁡\(\\slimits@i=1m​exp​\(δ2\+\\lVert​𝒚Si\+t​𝒅Si​\\rVert22β\)\)\.\\displaystyle\\mathsf\{lse\}\_\{\\beta\}\(h\(t\)\)=\\beta\\log\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\mathrm\{exp\}\\left\(\\frac\{h\_\{i\}\(t\)\}\{\\beta\}\\right\)\\right\)=\\beta\\log\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\mathrm\{exp\}\\left\(\\frac\{\\sqrt\{\\delta^\{2\}\+\\left\\lVert\\bm\{y\}\_\{S\_\{i\}\}\+t\\bm\{d\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\}\}\{\\beta\}\\right\)\\right\)\.It is easy to see that every one\-dimensional restriction off~β,δ\\mathaccent 869\{f\}\_\{\\beta,\\delta\}can be obtained by an affine transformation of𝗅𝗌𝖾β​\(h​\(t\)\)\\mathsf\{lse\}\_\{\\beta\}\(h\(t\)\)after appropriate choices of𝒚,𝒅∈ℝm\\bm\{y\},\\bm\{d\}\\in\\mathbb\{R\}^\{m\}\. Hence, we first analyze𝗅𝗌𝖾β​\(h​\(t\)\)\\mathsf\{lse\}\_\{\\beta\}\(h\(t\)\)for all𝒚,𝒅∈ℝm\\bm\{y\},\\bm\{d\}\\in\\mathbb\{R\}^\{m\}\.

We begin with proving the smoothness of𝗅𝗌𝖾β​\(h​\(t\)\)\\mathsf\{lse\}\_\{\\beta\}\(h\(t\)\)with respect to\\lVert⋅\\rVert𝒢∞\\left\\lVert\\cdot\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\.

###### Lemma 6\.5\.

For all𝐲,𝐝∈ℝm\\bm\{y\},\\bm\{d\}\\in\\mathbb\{R\}^\{m\}and allt∈ℝt\\in\\mathbb\{R\}, we have

\(dd​t\)2​𝗅𝗌𝖾β​\(h​\(t\)\)≤\(1δ\+1β\)​\\lVert​𝒅​\\rVert𝒢∞2\.\\displaystyle\\left\(\\frac\{d\}\{dt\}\\right\)^\{2\}\\mathsf\{lse\}\_\{\\beta\}\(h\(t\)\)\\leq\\left\(\\frac\{1\}\{\\delta\}\+\\frac\{1\}\{\\beta\}\\right\)\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}^\{2\}\.

###### Proof of[Lemma˜6\.5](https://arxiv.org/html/2607.00252#S6.Thmtheorem5)\.

By direct calculation, it is easy to see that

hi′​\(t\)\\displaystyle h\_\{i\}^\{\\prime\}\(t\)=⟨𝒚Si\+t​𝒅Si,𝒅Si⟩hi​\(t\)\\displaystyle=\\frac\{\\left\\langle\\bm\{y\}\_\{S\_\{i\}\}\+t\\bm\{d\}\_\{S\_\{i\}\},\\bm\{d\}\_\{S\_\{i\}\}\\right\\rangle\}\{h\_\{i\}\(t\)\}\(6\.1\)hi′′​\(t\)\\displaystyle h\_\{i\}^\{\\prime\\prime\}\(t\)=\\lVert​𝒅Si​\\rVert22​hi​\(t\)−hi′​\(t\)2​hi​\(t\)hi​\(t\)2=\\lVert​𝒅Si​\\rVert22−hi′​\(t\)2hi​\(t\)\.\\displaystyle=\\frac\{\\left\\lVert\\bm\{d\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}h\_\{i\}\(t\)\-h\_\{i\}^\{\\prime\}\(t\)^\{2\}h\_\{i\}\(t\)\}\{h\_\{i\}\(t\)^\{2\}\}=\\frac\{\\left\\lVert\\bm\{d\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\-h\_\{i\}^\{\\prime\}\(t\)^\{2\}\}\{h\_\{i\}\(t\)\}\.We plug this into the result of[Lemma˜6\.2](https://arxiv.org/html/2607.00252#S6.Thmtheorem2)and get

𝗅𝗌𝖾β′′​\(h​\(t\)\)\\displaystyle\\mathsf\{lse\}\_\{\\beta\}^\{\\prime\\prime\}\(h\(t\)\)≤1β​maxi⁡hi′​\(t\)2\+maxi⁡hi′′​\(t\)\\displaystyle\\leq\\frac\{1\}\{\\beta\}\\max\_\{i\}h\_\{i\}^\{\\prime\}\(t\)^\{2\}\+\\max\_\{i\}h\_\{i\}^\{\\prime\\prime\}\(t\)=1βmaxi\(⟨𝒚Si\+t​𝒅Si,𝒅Si⟩δ2\+\\lVert​𝒚Si\+t​𝒅Si​\\rVert22\)2\+maxi\\lVert​𝒅Si​\\rVert22−hi′​\(t\)2δ2\+\\lVert​𝒚Si\+t​𝒅Si​\\rVert22\\displaystyle=\\frac\{1\}\{\\beta\}\\max\_\{i\}\\left\(\\frac\{\\left\\langle\\bm\{y\}\_\{S\_\{i\}\}\+t\\bm\{d\}\_\{S\_\{i\}\},\\bm\{d\}\_\{S\_\{i\}\}\\right\\rangle\}\{\\sqrt\{\\delta^\{2\}\+\\left\\lVert\\bm\{y\}\_\{S\_\{i\}\}\+t\\bm\{d\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\}\}\\right\)^\{2\}\+\\max\_\{i\}\\frac\{\\left\\lVert\\bm\{d\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\-h\_\{i\}^\{\\prime\}\(t\)^\{2\}\}\{\\sqrt\{\\delta^\{2\}\+\\left\\lVert\\bm\{y\}\_\{S\_\{i\}\}\+t\\bm\{d\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\}\}≤1β​maxi⁡\\lVert​𝒅Si​\\rVert22\+1δ​maxi⁡\\lVert​𝒅Si​\\rVert22=\(1β\+1δ\)​\\lVert​𝒅​\\rVert𝒢∞2,\\displaystyle\\leq\\frac\{1\}\{\\beta\}\\max\_\{i\}\\left\\lVert\\bm\{d\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\+\\frac\{1\}\{\\delta\}\\max\_\{i\}\\left\\lVert\\bm\{d\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}=\\left\(\\frac\{1\}\{\\beta\}\+\\frac\{1\}\{\\delta\}\\right\)\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}^\{2\},completing the proof of[Lemma˜6\.5](https://arxiv.org/html/2607.00252#S6.Thmtheorem5)\. ∎

Our next task is to show that𝗅𝗌𝖾β​\(h​\(t\)\)\\mathsf\{lse\}\_\{\\beta\}\(h\(t\)\)isO​\(1/β\+1/δ\)O\(1/\\beta\+1/\\delta\)\-quasi\-self\-concordant in\\lVert⋅\\rVert𝒢∞\\left\\lVert\\cdot\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\. To do so, we will appeal to[Definition˜2\.1](https://arxiv.org/html/2607.00252#S2.Thmtheorem1)\. To be able to do this, we first have to prove the quasi\-self\-concordance of each component function in𝗅𝗌𝖾β​\(h​\(t\)\)\\mathsf\{lse\}\_\{\\beta\}\(h\(t\)\)\.

###### Lemma 6\.6\.

For all𝐲,𝐝∈ℝm\\bm\{y\},\\bm\{d\}\\in\\mathbb\{R\}^\{m\}and allt∈ℝt\\in\\mathbb\{R\}, we have

\\lvert​\(dd​t\)3​δ2\+\\lVert​𝒚Si\+t​𝒅Si​\\rVert22​\\rvert≤3δ​\\lVert​𝒅Si​\\rVert2​\(\(dd​t\)2​δ2\+\\lVert​𝒚Si\+t​𝒅Si​\\rVert22\)\.\\displaystyle\\left\\lvert\\left\(\\frac\{d\}\{dt\}\\right\)^\{3\}\\sqrt\{\\delta^\{2\}\+\\left\\lVert\\bm\{y\}\_\{S\_\{i\}\}\+t\\bm\{d\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\}\\right\\rvert\\leq\\frac\{3\}\{\\delta\}\\left\\lVert\\bm\{d\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}\\left\(\\left\(\\frac\{d\}\{dt\}\\right\)^\{2\}\\sqrt\{\\delta^\{2\}\+\\left\\lVert\\bm\{y\}\_\{S\_\{i\}\}\+t\\bm\{d\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\}\\right\)\.

###### Proof of[Lemma˜6\.6](https://arxiv.org/html/2607.00252#S6.Thmtheorem6)\.

Although a similar fact appears in\[ob18, Section 2\.1\.2\], it is not in the exact form we need\. So, we prove the required statement here\.

Recycling the computation from \([6\.1](https://arxiv.org/html/2607.00252#S6.E1)\), recall

hi′′​\(t\)=\\lVert​𝒅Si​\\rVert22−hi′​\(t\)2hi​\(t\),\\displaystyle h\_\{i\}^\{\\prime\\prime\}\(t\)=\\frac\{\\left\\lVert\\bm\{d\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\-h\_\{i\}^\{\\prime\}\(t\)^\{2\}\}\{h\_\{i\}\(t\)\},which gives

hi′′′​\(t\)=−2​hi′​\(t\)​hi′′​\(t\)​hi​\(t\)−hi′​\(t\)​\(hi​\(t\)​hi′′​\(t\)\)hi​\(t\)2=−3​hi′​\(t\)​hi′′​\(t\)hi​\(t\)\.\\displaystyle h\_\{i\}^\{\\prime\\prime\\prime\}\(t\)=\\frac\{\-2h\_\{i\}^\{\\prime\}\(t\)h\_\{i\}^\{\\prime\\prime\}\(t\)h\_\{i\}\(t\)\-h\_\{i\}^\{\\prime\}\(t\)\(h\_\{i\}\(t\)h\_\{i\}^\{\\prime\\prime\}\(t\)\)\}\{h\_\{i\}\(t\)^\{2\}\}=\-\\frac\{3h\_\{i\}^\{\\prime\}\(t\)h\_\{i\}^\{\\prime\\prime\}\(t\)\}\{h\_\{i\}\(t\)\}\.Finally, again recalling \([6\.1](https://arxiv.org/html/2607.00252#S6.E1)\), notice that

\\lvert​hi′​\(t\)hi​\(t\)​\\rvert=\\lvert​⟨𝒚Si\+t​𝒅Si,𝒅Si⟩hi​\(t\)2​\\rvert=\\lvert​⟨𝒚Si\+t​𝒅Siδ2\+\\lVert​𝒚Si\+t​𝒅Si​\\rVert22,𝒅Siδ2\+\\lVert​𝒚Si\+t​𝒅Si​\\rVert22⟩​\\rvert≤\\lVert​𝒅Si​\\rVert2δ\.\\displaystyle\\left\\lvert\\frac\{h\_\{i\}^\{\\prime\}\(t\)\}\{h\_\{i\}\(t\)\}\\right\\rvert=\\left\\lvert\\frac\{\\left\\langle\\bm\{y\}\_\{S\_\{i\}\}\+t\\bm\{d\}\_\{S\_\{i\}\},\\bm\{d\}\_\{S\_\{i\}\}\\right\\rangle\}\{h\_\{i\}\(t\)^\{2\}\}\\right\\rvert=\\left\\lvert\\left\\langle\\frac\{\\bm\{y\}\_\{S\_\{i\}\}\+t\\bm\{d\}\_\{S\_\{i\}\}\}\{\\sqrt\{\\delta^\{2\}\+\\left\\lVert\\bm\{y\}\_\{S\_\{i\}\}\+t\\bm\{d\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\}\},\\frac\{\\bm\{d\}\_\{S\_\{i\}\}\}\{\\sqrt\{\\delta^\{2\}\+\\left\\lVert\\bm\{y\}\_\{S\_\{i\}\}\+t\\bm\{d\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\}\}\\right\\rangle\\right\\rvert\\leq\\frac\{\\left\\lVert\\bm\{d\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}\}\{\\delta\}\.Combining everything completes the proof of[Lemma˜6\.6](https://arxiv.org/html/2607.00252#S6.Thmtheorem6)\. ∎

We are now ready to prove the quasi\-self\-concordance of𝗅𝗌𝖾β​\(h​\(t\)\)\\mathsf\{lse\}\_\{\\beta\}\(h\(t\)\)in\\lVert⋅\\rVert𝒢∞\\left\\lVert\\cdot\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\.

###### Lemma 6\.7\.

For all𝐲,𝐝∈ℝm\\bm\{y\},\\bm\{d\}\\in\\mathbb\{R\}^\{m\}andt∈ℝt\\in\\mathbb\{R\}, we have

\\lvert​\(dd​t\)3​𝗅𝗌𝖾β​\(h​\(t\)\)​\\rvert≤\(16β\+3δ\)​\\lVert​𝒅​\\rVert𝒢∞​\(dd​t\)2​𝗅𝗌𝖾β​\(h​\(t\)\)\.\\displaystyle\\left\\lvert\\left\(\\frac\{d\}\{dt\}\\right\)^\{3\}\\mathsf\{lse\}\_\{\\beta\}\(h\(t\)\)\\right\\rvert\\leq\\left\(\\frac\{16\}\{\\beta\}\+\\frac\{3\}\{\\delta\}\\right\)\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\\left\(\\frac\{d\}\{dt\}\\right\)^\{2\}\\mathsf\{lse\}\_\{\\beta\}\(h\(t\)\)\.

###### Proof of[Lemma˜6\.7](https://arxiv.org/html/2607.00252#S6.Thmtheorem7)\.

In the statement of[Definition˜2\.1](https://arxiv.org/html/2607.00252#S2.Thmtheorem1), let\\lVert⋅\\rVert=\\lVert⋅\\rVert𝒢∞\\left\\lVert\\cdot\\right\\rVert=\\left\\lVert\\cdot\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\. By the definition of\\lVert⋅\\rVert𝒢∞\\left\\lVert\\cdot\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}andhih\_\{i\}, we have for alliiandttthathi′​\(t\)≤\\lVert​𝒅​\\rVert𝒢∞h\_\{i\}^\{\\prime\}\(t\)\\leq\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\. Additionally, from[Lemma˜6\.6](https://arxiv.org/html/2607.00252#S6.Thmtheorem6), we have that thehi​\(t\)h\_\{i\}\(t\)are3/δ3/\\delta\-quasi\-self\-concordant in the norm\\lVert​𝒅​\\rVert𝒢∞\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}for allii\.[Lemma˜6\.7](https://arxiv.org/html/2607.00252#S6.Thmtheorem7)now follows immediately from[Definition˜2\.1](https://arxiv.org/html/2607.00252#S2.Thmtheorem1)\. ∎

Finally, we can prove[Lemma˜6\.4](https://arxiv.org/html/2607.00252#S6.Thmtheorem4)\.

###### Proof of[Lemma˜6\.4](https://arxiv.org/html/2607.00252#S6.Thmtheorem4)\.

By the conclusion of[Lemma˜6\.5](https://arxiv.org/html/2607.00252#S6.Thmtheorem5), we know that for all𝒚,𝒅∈ℝm\\bm\{y\},\\bm\{d\}\\in\\mathbb\{R\}^\{m\}andt∈ℝt\\in\\mathbb\{R\}that

\(dd​t\)2​𝗅𝗌𝖾β​\(h​\(t\)\)≤\(1δ\+1β\)​\\lVert​𝒛​\\rVert𝒢∞2\.\\displaystyle\\left\(\\frac\{d\}\{dt\}\\right\)^\{2\}\\mathsf\{lse\}\_\{\\beta\}\(h\(t\)\)\\leq\\left\(\\frac\{1\}\{\\delta\}\+\\frac\{1\}\{\\beta\}\\right\)\\left\\lVert\\bm\{z\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}^\{2\}\.Let𝒚=𝐀​𝒙−𝒃\\bm\{y\}=\\mathbf\{A\}\\bm\{x\}\-\\bm\{b\}for some𝒙\\bm\{x\}and𝒅=𝐀​𝒛\\bm\{d\}=\\mathbf\{A\}\\bm\{z\}for some𝒛\\bm\{z\}\. Let

g​\(𝒚\)≔β​log⁡\(\\slimits@i=1m​exp​\(δ2\+\\lVert​𝒚Si​\\rVert22−δβ\)\)\.\\displaystyle g\(\\bm\{y\}\)\\coloneqq\\beta\\log\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\mathrm\{exp\}\\left\(\\frac\{\\sqrt\{\\delta^\{2\}\+\\left\\lVert\\bm\{y\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{2\}\}\-\\delta\}\{\\beta\}\\right\)\\right\)\.Then,

\(dd​t\)2​f~β,δ​\(𝒙\+t​𝒛\)=\(dd​t\)2​g​\(𝐀​𝒙−𝒃\+t​𝐀​𝒛\)≤\(1δ\+1β\)​\\lVert​𝐀​𝒛​\\rVert𝒢∞2\.\\displaystyle\\left\(\\frac\{d\}\{dt\}\\right\)^\{2\}\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\+t\\bm\{z\}\)=\\left\(\\frac\{d\}\{dt\}\\right\)^\{2\}g\(\\mathbf\{A\}\\bm\{x\}\-\\bm\{b\}\+t\\mathbf\{A\}\\bm\{z\}\)\\leq\\left\(\\frac\{1\}\{\\delta\}\+\\frac\{1\}\{\\beta\}\\right\)\\left\\lVert\\mathbf\{A\}\\bm\{z\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}^\{2\}\.With the exact same reasoning applied to the conclusion of[Lemma˜6\.7](https://arxiv.org/html/2607.00252#S6.Thmtheorem7), we also see that

\\lvert​\(dd​t\)3​f~β,δ​\(𝒙\+t​𝒛\)​\\rvert≤\(16δ\+3β\)​\\lVert​𝐀​𝒛​\\rVert𝒢∞​\(dd​t\)2​f~β,δ​\(𝒙\+t​𝒛\)\.\\displaystyle\\left\\lvert\\left\(\\frac\{d\}\{dt\}\\right\)^\{3\}\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\+t\\bm\{z\}\)\\right\\rvert\\leq\\left\(\\frac\{16\}\{\\delta\}\+\\frac\{3\}\{\\beta\}\\right\)\\left\\lVert\\mathbf\{A\}\\bm\{z\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\\left\(\\frac\{d\}\{dt\}\\right\)^\{2\}\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\+t\\bm\{z\}\)\.The conclusion of[Lemma˜6\.4](https://arxiv.org/html/2607.00252#S6.Thmtheorem4)then follows from remembering that we have𝐖\\mathbf\{W\}such that for all𝒛∈ℝd\\bm\{z\}\\in\\mathbb\{R\}^\{d\},\\lVert​𝐀​𝒛​\\rVert𝒢∞≤\\lVert​𝐖1/2​𝐀​𝒛​\\rVert2\\left\\lVert\\mathbf\{A\}\\bm\{z\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\\leq\\left\\lVert\\mathbf\{W\}^\{1/2\}\\mathbf\{A\}\\bm\{z\}\\right\\rVert\_\{2\}\(following from[Theorem˜2\.3](https://arxiv.org/html/2607.00252#S2.Thmtheorem3)\)\. ∎

### 6\.4Analysis of[Algorithm˜1](https://arxiv.org/html/2607.00252#alg1)

In this subsection, we use the calculus facts from the previous two subsections to analyze[Algorithm˜1](https://arxiv.org/html/2607.00252#alg1)\. The outline of this proof follows that of\[jls21, Theorem 2\], which in turn builds up to using the proof used in\[msbacon, Corollary 12\]\. The main idea is to define the algorithm based on the norm given by a good choice of positive semidefinite𝐌\\mathbf\{M\}, given by[Theorem˜2\.3](https://arxiv.org/html/2607.00252#S2.Thmtheorem3)\.

In the rest of this section, let𝐖\\mathbf\{W\}be factor\-22block Lewis weight overestimates for\[𝐀\|𝒃\]\\left\[\\mathbf\{A\}\|\\bm\{b\}\\right\]\. As in Line[2](https://arxiv.org/html/2607.00252#S8.EGx21)of[Algorithm˜1](https://arxiv.org/html/2607.00252#alg1)and from the corresponding guarantee given in\[mo23, Lemmas 5\.6, 5\.8\], this means that within2​log⁡m2\\log mlinear\-system\-solves in𝐀⊤​𝐃𝐀\\mathbf\{A\}^\{\\top\}\\mathbf\{D\}\\mathbf\{A\}for diagonal𝐃\\mathbf\{D\}, we can find𝐖\\mathbf\{W\}such that for all𝒙∈ℝd\\bm\{x\}\\in\\mathbb\{R\}^\{d\}andc∈ℝc\\in\\mathbb\{R\}we have

\\lVert​𝐀​𝒙−c​𝒃​\\rVert𝒢∞≤\\lVert​𝐖1/2​𝐀​𝒙−c​𝐖1/2​𝒃​\\rVert2≤2​\(𝗋𝖺𝗇𝗄​\(𝐀\)\+1\)​\\lVert​𝐀​𝒙−c​𝒃​\\rVert𝒢∞\.\\displaystyle\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-c\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\\leq\\left\\lVert\\mathbf\{W\}^\{1/2\}\\mathbf\{A\}\\bm\{x\}\-c\\mathbf\{W\}^\{1/2\}\\bm\{b\}\\right\\rVert\_\{2\}\\leq\\sqrt\{2\(\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\)\+1\)\}\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-c\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\.Note that choosingc=1c=1yields our original objective on either side of the above inequality\. Motivated by the above, it is natural to use the norm given by𝐌≔𝐀⊤​𝐖𝐀\\mathbf\{M\}\\coloneqq\\mathbf\{A\}^\{\\top\}\\mathbf\{W\}\\mathbf\{A\}to give the geometry for the ball optimization oracle and for the analysis\. Additionally, without loss of generality and for the sake of the analysis, let us rescale the problem so that

1=𝖮𝖯𝖳≔\\lVert​𝐀​𝒙⋆−𝒃​\\rVert𝒢∞\.\\displaystyle 1=\\mathsf\{OPT\}\\coloneqq\\left\\lVert\\mathbf\{A\}\\bm\{x\}^\{\\star\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\.Also, as mentioned earlier, assume without loss of generality that𝗋𝖺𝗇𝗄​\(𝐀\)=d\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\)=d\.

We begin with[Lemma˜6\.8](https://arxiv.org/html/2607.00252#S6.Thmtheorem8), which bounds our initial suboptimality inf~\\mathaccent 869\{f\}and in\\lVert⋅\\rVert𝐌\\left\\lVert\\cdot\\right\\rVert\_\{\\mathbf\{M\}\}\.

###### Lemma 6\.8\.

Let𝐱~β,δ≔argmin𝐱∈ℝd​f~β,δ​\(𝐱\)\\mathaccent 869\{\\bm\{x\}\}\_\{\\beta,\\delta\}\\coloneqq\\underset\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ \\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\)\. Then,

\\lVert​𝒙~β,δ−𝒙0​\\rVert𝐌≤\(2\+2​\(β​log⁡m\+δ\)\)​2​\(d\+1\)f~β,δ​\(𝒙0\)−f~β,δ​\(𝒙~β,δ\)≤2​\(d\+1\)−1\+2​\(β​log⁡m\+δ\)\.\\begin\{aligned\} \\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{\\beta,\\delta\}\-\\bm\{x\}\_\{0\}\\right\\rVert\_\{\\mathbf\{M\}\}&\\leq\(2\+2\(\\beta\\log m\+\\delta\)\)\\sqrt\{2\(d\+1\)\}\\\\ \\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\_\{0\}\)\-\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\mathaccent 869\{\\bm\{x\}\}\_\{\\beta,\\delta\}\)&\\leq\\sqrt\{2\(d\+1\)\}\-1\+2\(\\beta\\log m\+\\delta\)\\end\{aligned\}\.

###### Proof of[Lemma˜6\.8](https://arxiv.org/html/2607.00252#S6.Thmtheorem8)\.

It is easy to check that

𝒙0≔\(𝐀⊤​𝐖𝐀\)−1​𝐀⊤​𝐖​𝒃=argmin𝒙∈ℝd​\\lVert​𝐖1/2​𝐀​𝒙−𝐖1/2​𝒃​\\rVert2\.\\displaystyle\\bm\{x\}\_\{0\}\\coloneqq\\left\(\\mathbf\{A\}^\{\\top\}\\mathbf\{W\}\\mathbf\{A\}\\right\)^\{\-1\}\\mathbf\{A\}^\{\\top\}\\mathbf\{W\}\\bm\{b\}=\\underset\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ \\left\\lVert\\mathbf\{W\}^\{1/2\}\\mathbf\{A\}\\bm\{x\}\-\\mathbf\{W\}^\{1/2\}\\bm\{b\}\\right\\rVert\_\{2\}\.By[Lemma˜6\.1](https://arxiv.org/html/2607.00252#S6.Thmtheorem1), for all𝒙∈ℝd\\bm\{x\}\\in\\mathbb\{R\}^\{d\},

\\lvert​f~β,δ​\(𝒙\)−\\lVert​𝐀​𝒙−𝒃​\\rVert𝒢∞​\\rvert≤β​log⁡m\+δ,\\displaystyle\\left\\lvert\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\)\-\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\\right\\rvert\\leq\\beta\\log m\+\\delta,implying

\\lvert​\\lVert​𝐀​𝒙⋆−𝒃​\\rVert𝒢∞−f~β,δ​\(𝒙~β,δ\)​\\rvert≤β​log⁡m\+δ\.\\displaystyle\\left\\lvert\\left\\lVert\\mathbf\{A\}\\bm\{x\}^\{\\star\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\-\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\mathaccent 869\{\\bm\{x\}\}\_\{\\beta,\\delta\}\)\\right\\rvert\\leq\\beta\\log m\+\\delta\.Combining this with[Theorem˜3\.3](https://arxiv.org/html/2607.00252#S3.Thmtheorem3), we get

1≤\\lVert​𝐀​𝒙⋆−𝒃​\\rVert𝒢∞≤\\lVert​𝐀​𝒙0−𝒃​\\rVert𝒢∞≤\\lVert​𝐖1/2​𝐀​𝒙0−𝐖1/2​𝒃​\\rVert2\\displaystyle 1\\leq\\left\\lVert\\mathbf\{A\}\\bm\{x\}^\{\\star\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\\leq\\left\\lVert\\mathbf\{A\}\\bm\{x\}\_\{0\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\\leq\\left\\lVert\\mathbf\{W\}^\{1/2\}\\mathbf\{A\}\\bm\{x\}\_\{0\}\-\\mathbf\{W\}^\{1/2\}\\bm\{b\}\\right\\rVert\_\{2\}and

\\lVert​𝐖1/2​𝐀​𝒙0−𝐖1/2​𝒃​\\rVert22​\(d\+1\)≤\\lVert​𝐖1/2​𝐀​𝒙⋆−𝐖1/2​𝒃​\\rVert22​\(d\+1\)≤\\lVert​𝐀​𝒙⋆−𝒃​\\rVert𝒢∞=1\.\\displaystyle\\frac\{\\left\\lVert\\mathbf\{W\}^\{1/2\}\\mathbf\{A\}\\bm\{x\}\_\{0\}\-\\mathbf\{W\}^\{1/2\}\\bm\{b\}\\right\\rVert\_\{2\}\}\{\\sqrt\{2\(d\+1\)\}\}\\leq\\frac\{\\left\\lVert\\mathbf\{W\}^\{1/2\}\\mathbf\{A\}\\bm\{x\}^\{\\star\}\-\\mathbf\{W\}^\{1/2\}\\bm\{b\}\\right\\rVert\_\{2\}\}\{\\sqrt\{2\(d\+1\)\}\}\\leq\\left\\lVert\\mathbf\{A\}\\bm\{x\}^\{\\star\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}=1\.Combining these gives

1≤\\lVert​𝐖1/2​𝐀​𝒙0−𝐖1/2​𝒃​\\rVert2≤2​\(d\+1\)\.\\displaystyle 1\\leq\\left\\lVert\\mathbf\{W\}^\{1/2\}\\mathbf\{A\}\\bm\{x\}\_\{0\}\-\\mathbf\{W\}^\{1/2\}\\bm\{b\}\\right\\rVert\_\{2\}\\leq\\sqrt\{2\(d\+1\)\}\.Additionally,

\\lVert​𝐖1/2​𝐀​𝒙~β,δ−𝐖1/2​𝒃​\\rVert2\\displaystyle\\left\\lVert\\mathbf\{W\}^\{1/2\}\\mathbf\{A\}\\mathaccent 869\{\\bm\{x\}\}\_\{\\beta,\\delta\}\-\\mathbf\{W\}^\{1/2\}\\bm\{b\}\\right\\rVert\_\{2\}≤2​\(d\+1\)​\\lVert​𝐀​𝒙~β,δ−𝒃​\\rVert𝒢∞\\displaystyle\\leq\\sqrt\{2\(d\+1\)\}\\left\\lVert\\mathbf\{A\}\\mathaccent 869\{\\bm\{x\}\}\_\{\\beta,\\delta\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}≤2​\(d\+1\)​\(f~β,δ​\(𝒙~β,δ\)\+β​log⁡m\+δ\)\\displaystyle\\leq\\sqrt\{2\(d\+1\)\}\\left\(\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\mathaccent 869\{\\bm\{x\}\}\_\{\\beta,\\delta\}\)\+\\beta\\log m\+\\delta\\right\)≤2​\(d\+1\)​\(\\lVert​𝐀​𝒙⋆−𝒃​\\rVert𝒢∞\+2​\(β​log⁡m\+δ\)\)\\displaystyle\\leq\\sqrt\{2\(d\+1\)\}\\left\(\\left\\lVert\\mathbf\{A\}\\bm\{x\}^\{\\star\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\+2\(\\beta\\log m\+\\delta\)\\right\)=2​\(d\+1\)​\(1\+2​\(β​log⁡m\+δ\)\)\.\\displaystyle=\\sqrt\{2\(d\+1\)\}\(1\+2\(\\beta\\log m\+\\delta\)\)\.Then,

\\lVert​𝒙~−𝒙0​\\rVert𝐌\\displaystyle\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\-\\bm\{x\}\_\{0\}\\right\\rVert\_\{\\mathbf\{M\}\}=\\lVert​\(𝐖1/2​𝐀​𝒙~β,δ−𝐖1/2​𝒃\)−\(𝐖1/2​𝐀​𝒙0−𝐖1/2​𝒃\)​\\rVert2\\displaystyle=\\left\\lVert\\left\(\\mathbf\{W\}^\{1/2\}\\mathbf\{A\}\\mathaccent 869\{\\bm\{x\}\}\_\{\\beta,\\delta\}\-\\mathbf\{W\}^\{1/2\}\\bm\{b\}\\right\)\-\\left\(\\mathbf\{W\}^\{1/2\}\\mathbf\{A\}\\bm\{x\}\_\{0\}\-\\mathbf\{W\}^\{1/2\}\\bm\{b\}\\right\)\\right\\rVert\_\{2\}≤\\lVert​𝐖1/2​𝐀​𝒙~β,δ−𝐖1/2​𝒃​\\rVert2\+\\lVert​𝐖1/2​𝐀​𝒙0−𝐖1/2​𝒃​\\rVert2\\displaystyle\\leq\\left\\lVert\\mathbf\{W\}^\{1/2\}\\mathbf\{A\}\\mathaccent 869\{\\bm\{x\}\}\_\{\\beta,\\delta\}\-\\mathbf\{W\}^\{1/2\}\\bm\{b\}\\right\\rVert\_\{2\}\+\\left\\lVert\\mathbf\{W\}^\{1/2\}\\mathbf\{A\}\\bm\{x\}\_\{0\}\-\\mathbf\{W\}^\{1/2\}\\bm\{b\}\\right\\rVert\_\{2\}≤\(2\+2​\(β​log⁡m\+δ\)\)​2​\(d\+1\),\\displaystyle\\leq\(2\+2\(\\beta\\log m\+\\delta\)\)\\sqrt\{2\(d\+1\)\},and

f~β,δ​\(𝒙0\)−f~β,δ​\(𝒙~β,δ\)\\displaystyle\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\_\{0\}\)\-\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\mathaccent 869\{\\bm\{x\}\}\_\{\\beta,\\delta\}\)≤\\lVert​𝐀​𝒙0−𝒃​\\rVert𝒢∞−\\lVert​𝐀​𝒙⋆−𝒃​\\rVert𝒢∞\+2​\(β​log⁡m\+δ\)\\displaystyle\\leq\\left\\lVert\\mathbf\{A\}\\bm\{x\}\_\{0\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\-\\left\\lVert\\mathbf\{A\}\\bm\{x\}^\{\\star\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\+2\(\\beta\\log m\+\\delta\)≤\\lVert​𝐖1/2​𝐀​𝒙0−𝐖1/2​𝒃​\\rVert2−𝖮𝖯𝖳\+2​\(β​log⁡m\+δ\)\\displaystyle\\leq\\left\\lVert\\mathbf\{W\}^\{1/2\}\\mathbf\{A\}\\bm\{x\}\_\{0\}\-\\mathbf\{W\}^\{1/2\}\\bm\{b\}\\right\\rVert\_\{2\}\-\\mathsf\{OPT\}\+2\(\\beta\\log m\+\\delta\)≤2​\(d\+1\)−1\+2​\(β​log⁡m\+δ\)\.\\displaystyle\\leq\\sqrt\{2\(d\+1\)\}\-1\+2\(\\beta\\log m\+\\delta\)\.This completes the proof of[Lemma˜6\.8](https://arxiv.org/html/2607.00252#S6.Thmtheorem8)\. ∎

We are now ready to prove[˜1](https://arxiv.org/html/2607.00252#Thmmainthm1)\.

###### Proof of[˜1](https://arxiv.org/html/2607.00252#Thmmainthm1)\.

[Algorithm˜1](https://arxiv.org/html/2607.00252#alg1)optimizes the regularization off~\\mathaccent 869\{f\}given by

f^​\(𝒙\)≔f~β,δ​\(𝒙\)\+ε110​R2​\\lVert​𝐖1/2​𝐀​\(𝒙−𝒙0\)​\\rVert22,\\displaystyle\\mathaccent 866\{f\}\(\\bm\{x\}\)\\coloneqq\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\)\+\\frac\{\\varepsilon\}\{110R^\{2\}\}\\left\\lVert\\mathbf\{W\}^\{1/2\}\\mathbf\{A\}\(\\bm\{x\}\-\\bm\{x\}\_\{0\}\)\\right\\rVert\_\{2\}^\{2\},whereRRis such that\\lVert​𝒙0−𝒙~β,δ​\\rVert𝐌≤R\\left\\lVert\\bm\{x\}\_\{0\}\-\\mathaccent 869\{\\bm\{x\}\}\_\{\\beta,\\delta\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq R\. Let𝒙^≔argmin𝒙∈ℝd​f^​\(𝒙\)\\mathaccent 866\{\\bm\{x\}\}\\coloneqq\\underset\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ \\mathaccent 866\{f\}\(\\bm\{x\}\)\. Using\[msbacon, Proof of Corollary 12\], we know that for every iterate𝒙\\bm\{x\}of[Algorithm˜1](https://arxiv.org/html/2607.00252#alg1),

\\lvert​f^​\(𝒙\)−f~β,δ​\(𝒙\)​\\rvert≤ε4\.\\displaystyle\\left\\lvert\\mathaccent 866\{f\}\(\\bm\{x\}\)\-\\mathaccent 869\{f\}\_\{\\beta,\\delta\}\(\\bm\{x\}\)\\right\\rvert\\leq\\frac\{\\varepsilon\}\{4\}\.We now chooseβ=ε/\(4​log⁡m\)\\beta=\\varepsilon/\(4\\log m\)andδ=ε/4\\delta=\\varepsilon/4, so thatf~β,δ\\mathaccent 869\{f\}\_\{\\beta,\\delta\}approximatesffup to errorε/2\\varepsilon/2on every point\. Using[Lemma˜6\.8](https://arxiv.org/html/2607.00252#S6.Thmtheorem8), this givesR=\(2\+ε\)​2​\(d\+1\)R=\(2\+\\varepsilon\)\\sqrt\{2\(d\+1\)\}\. It is therefore sufficient to optimizef^\\mathaccent 866\{f\}up toε/4\\varepsilon/4additive error\.

Next, using[Lemma˜6\.4](https://arxiv.org/html/2607.00252#S6.Thmtheorem4)and\[msbacon, Lemmas 11, 43\], we have thatf^\\mathaccent 866\{f\}is\(1/ν,e\)\(1/\\nu,e\)\-Hessian stable in\\lVert⋅\\rVert𝐌\\left\\lVert\\cdot\\right\\rVert\_\{\\mathbf\{M\}\}forν=\(1/\(ε​log⁡m\)\)\\nu=\\Omega\(1/\(\\varepsilon\\log m\)\)\. We now invoke\[msbacon, Theorem 9\], which tells us that we can implement a\(C/d,C/ε\)\(C/\\sqrt\{d\},C/\\varepsilon\)\-ball optimization oracle forffwithO\(log\(dε\)2\)O\\left\(\\log\\left\(\\frac\{d\}\{\\varepsilon\}\\right\)^\{2\}\\right\)linear\-system\-solves\.

The next step is to turn the ball optimization oracle into a12\\frac\{1\}\{2\}\-MS oracle \([Definition˜5\.1](https://arxiv.org/html/2607.00252#S5.Thmtheorem1)\)\. Using\[msbacon, Proposition 5\], we get a ball oracle complexity ofO​\(log⁡\(dε\)\)O\\left\(\\log\\left\(\\frac\{d\}\{\\varepsilon\}\\right\)\\right\)to implement the MS oracle\. In total, our linear\-system\-solve complexity for implementing the MS oracle for iterationttisO\(log\(dε\)3\)O\\left\(\\log\\left\(\\frac\{d\}\{\\varepsilon\}\\right\)^\{3\}\\right\)\.

Finally, using\[msbacon, Theorem 6\], we get that[Algorithm˜1](https://arxiv.org/html/2607.00252#alg1)has a Newton iteration complexity of

O​\(\(\(1\+ε\)​d​log⁡mε\)2/3​log⁡\(d\+εε\)​\(log⁡\(\(log⁡m/ε\)​d​\(1\+\(1\+ε\)​d​log⁡m/ε\)ε\)\)3\)\\displaystyle\\quad O\\left\(\\left\(\\frac\{\(1\+\\varepsilon\)\\sqrt\{d\}\\log m\}\{\\varepsilon\}\\right\)^\{2/3\}\\log\\left\(\\frac\{\\sqrt\{d\}\+\\varepsilon\}\{\\varepsilon\}\\right\)\\left\(\\log\\left\(\\frac\{\(\\log m/\\varepsilon\)d\(1\+\(1\+\\varepsilon\)\\sqrt\{d\}\\log m/\\varepsilon\)\}\{\\varepsilon\}\\right\)\\right\)^\{3\}\\right\)=O\(d1/3ε2/3log\(d​log⁡mε\)14/3\),\\displaystyle=O\\left\(\\frac\{d^\{1/3\}\}\{\\varepsilon^\{2/3\}\}\\log\\left\(\\frac\{d\\log m\}\{\\varepsilon\}\\right\)^\{14/3\}\\right\),as promised\.

Next, we analyze what happens if we fall in the case where𝐖=𝐈m\\mathbf\{W\}=\\mathbf\{I\}\_\{m\}\. Here, by using them\\sqrt\{m\}distortion from approximatingℓ∞m\\ell\_\{\\infty\}^\{m\}withℓ2m\\ell\_\{2\}^\{m\}, we have for all𝒙∈ℝd\\bm\{x\}\\in\\mathbb\{R\}^\{d\},

\\lVert​𝐀​𝒙−𝒃​\\rVert2m≤\\lVert​𝐀​𝒙−𝒃​\\rVert𝒢∞≤\\lVert​𝐀​𝒙−𝒃​\\rVert2\.\\displaystyle\\frac\{\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-\\bm\{b\}\\right\\rVert\_\{2\}\}\{\\sqrt\{m\}\}\\leq\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\\leq\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-\\bm\{b\}\\right\\rVert\_\{2\}\.Using this and repeating the previous analysis with this choice of𝐌\\mathbf\{M\}gives us a rate of

O\(m1/3ε2/3log\(m​log⁡mε\)14/3\),\\displaystyle O\\left\(\\frac\{m^\{1/3\}\}\{\\varepsilon^\{2/3\}\}\\log\\left\(\\frac\{m\\log m\}\{\\varepsilon\}\\right\)^\{14/3\}\\right\),as required\.

It remains to determine the form of the Newton steps\. For this, it is sufficient to understand the Hessian off^\\mathaccent 866\{f\}\. A straightforward calculation shows that it is of the form𝐀⊤​𝐁𝐀\\mathbf\{A\}^\{\\top\}\\mathbf\{B\}\\mathbf\{A\}where𝐁\\mathbf\{B\}is a block\-diagonal matrix where each block has size\\lvert​Si​\\rvert×\\lvert​Si​\\rvert\\left\\lvert S\_\{i\}\\right\\rvert\\times\\left\\lvert S\_\{i\}\\right\\rvert\. Thus, each Newton step solves a linear system of the form𝐀⊤​𝐁𝐀​𝒛=𝒗\\mathbf\{A\}^\{\\top\}\\mathbf\{B\}\\mathbf\{A\}\\bm\{z\}=\\bm\{v\}\.

Combining this with the iteration complexity guarantee to find𝐖\\mathbf\{W\}\(see[Theorem˜2\.3](https://arxiv.org/html/2607.00252#S2.Thmtheorem3)\) completes the proof of[˜1](https://arxiv.org/html/2607.00252#Thmmainthm1)\. ∎

## 7Interpolating Between Average and Robust Losses

In this section, we prove[˜2](https://arxiv.org/html/2607.00252#Thmmainthm2)\. As before, our proof follows the outline in[Section˜2](https://arxiv.org/html/2607.00252#S2)\. The main technical challenges are to establish a form of strong convexity for our objectiveffand then to build a solver for the proximal problem \([2\.3](https://arxiv.org/html/2607.00252#S2.E3)\)\.

The rest of this section is organized as follows\. In[Section˜7\.1](https://arxiv.org/html/2607.00252#S7.SS1), we derive calculus facts about our objectiveff, including bounds on its Hessian and the promised strong convexity \(particularly[Lemma˜7\.2](https://arxiv.org/html/2607.00252#S7.Thmtheorem2)and the more general result it builds on,[Section˜2\.1\.2](https://arxiv.org/html/2607.00252#S2.SS1.SSS2)\)\. In[Section˜7\.2](https://arxiv.org/html/2607.00252#S7.SS2), we prove some facts about the iterates of[Algorithm˜3](https://arxiv.org/html/2607.00252#alg3)when applied to our setting\. In[Section˜7\.3](https://arxiv.org/html/2607.00252#S7.SS3), we more precisely define and analyze our solver for proximal sub\-problems\. This section is fairly technical and we give a more detailed outline there\. Finally, in[Section˜7\.4](https://arxiv.org/html/2607.00252#S7.SS4), we assemble all these components and analyze[Algorithm˜5](https://arxiv.org/html/2607.00252#alg5), thereby proving[˜2](https://arxiv.org/html/2607.00252#Thmmainthm2)\.

Throughout this analysis, we rescale the problem so thatf​\(𝒙⋆\)=1f\(\\bm\{x\}^\{\\star\}\)=1\. It is now sufficient to solve for anε\\varepsilon\-additive error solution\.

### 7\.1Calculus for the objective

In this section, we work out some calculus facts related to our objective\\lVert​𝐀​𝒙−𝒃​\\rVert𝒢pp\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}^\{p\}\. Throughout this discussion, letf​\(𝒙\)≔\\lVert​𝐀​𝒙−𝒃​\\rVert𝒢ppf\(\\bm\{x\}\)\\coloneqq\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}^\{p\}\.

###### Lemma 7\.1\.

For any𝐳∈ℝd\\bm\{z\}\\in\\mathbb\{R\}^\{d\}, we have

p​\\slimits@i=1m​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2p−2​\\lVert​𝐀Si​𝒛​\\rVert22≤𝒛⊤​\(∇2f​\(𝒙\)\)​𝒛≤p​\(p−1\)​\\slimits@i=1m​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2p−2​\\lVert​𝐀Si​𝒛​\\rVert22\.\\displaystyle p\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\-2\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{z\}\\right\\rVert\_\{2\}^\{2\}\\leq\\bm\{z\}^\{\\top\}\\left\(\\nabla^\{2\}f\(\\bm\{x\}\)\\right\)\\bm\{z\}\\leq p\(p\-1\)\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\-2\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{z\}\\right\\rVert\_\{2\}^\{2\}\.

###### Proof of[Lemma˜7\.1](https://arxiv.org/html/2607.00252#S7.Thmtheorem1)\.

Let us first calculate the derivative and hessian forf​\(⋅\)f\(\\cdot\)using the chain rule and usual matrix differentiation rules:

f​\(𝒙\)\\displaystyle f\(\\bm\{x\}\)=\\slimits@i=1m​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2p,\\displaystyle=\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\}\\kern 5\.0pt,∇f​\(𝒙\)\\displaystyle\\nabla f\(\\bm\{x\}\)=p​\\slimits@i=1m​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2p−2​𝐀Si⊤​\(𝐀Si​𝒙−𝒃Si\),\\displaystyle=p\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\-2\}\\mathbf\{A\}\_\{S\_\{i\}\}^\{\\top\}\(\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\)\\kern 5\.0pt,\(7\.1\)∇2f​\(𝒙\)\\displaystyle\\nabla^\{2\}f\(\\bm\{x\}\)=p​\\slimits@i=1m​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2p−2​𝐀Si⊤​𝐀Si\\displaystyle=p\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\-2\}\\mathbf\{A\}\_\{S\_\{i\}\}^\{\\top\}\\mathbf\{A\}\_\{S\_\{i\}\}\+p​\(p−2\)​\\slimits@i=1m​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2p−4​\(𝐀Si⊤​\(𝐀Si​𝒙−𝒃Si\)​\(𝐀Si​𝒙−𝒃Si\)⊤​𝐀Si\)\.\\displaystyle\\quad\+p\(p\-2\)\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\-4\}\\left\(\\mathbf\{A\}\_\{S\_\{i\}\}^\{\\top\}\(\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\)\(\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\)^\{\\top\}\\mathbf\{A\}\_\{S\_\{i\}\}\\right\)\\kern 5\.0pt\.\(7\.2\)Using this formula, we take the quadratic form with respect to a vector𝒛\\bm\{z\}\. By Cauchy\-Schwarz, notice that

𝒛⊤​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2p−4​\(𝐀Si⊤​\(𝐀Si​𝒙−𝒃Si\)​\(𝐀Si​𝒙−𝒃Si\)⊤​𝐀Si\)​𝒛\\displaystyle\\quad\\bm\{z\}^\{\\top\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\-4\}\\left\(\\mathbf\{A\}\_\{S\_\{i\}\}^\{\\top\}\(\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\)\(\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\)^\{\\top\}\\mathbf\{A\}\_\{S\_\{i\}\}\\right\)\\bm\{z\}=\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2p−4​⟨𝐀Si​𝒛,𝐀Si​𝒙−𝒃Si⟩2≤\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2p−2​\\lVert​𝐀Si​𝒛​\\rVert22\.\\displaystyle=\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\-4\}\\left\\langle\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{z\},\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rangle^\{2\}\\leq\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\-2\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{z\}\\right\\rVert\_\{2\}^\{2\}\.With that, we have

𝒛⊤​\(∇2f​\(𝒙\)\)​𝒛\\displaystyle\\bm\{z\}^\{\\top\}\\left\(\\nabla^\{2\}f\(\\bm\{x\}\)\\right\)\\bm\{z\}≤p​\\slimits@i=1m​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVertp−2​\\lVert​𝐀Si​𝒛​\\rVert22\+\(p−2\)​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVertp−2​\\lVert​𝐀Si​𝒛​\\rVert22,\\displaystyle\\leq p\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert^\{p\-2\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{z\}\\right\\rVert\_\{2\}^\{2\}\+\(p\-2\)\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert^\{p\-2\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{z\}\\right\\rVert\_\{2\}^\{2\}\\kern 5\.0pt,=p​\(p−1\)​\\slimits@i=1m​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2p−2​\\lVert​𝐀Si​𝒛​\\rVert22\.\\displaystyle=p\(p\-1\)\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\-2\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{z\}\\right\\rVert\_\{2\}^\{2\}\\kern 5\.0pt\.\(7\.3\)For the lower bound, we use our calculation for∇2f​\(𝒙\)\\nabla^\{2\}f\(\\bm\{x\}\)to write

𝒛⊤​\(∇2f​\(𝒙\)\)​𝒛≥p​\\slimits@i=1m​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2p−2​\\lVert​𝐀Si​𝒛​\\rVert22,\\displaystyle\\bm\{z\}^\{\\top\}\\left\(\\nabla^\{2\}f\(\\bm\{x\}\)\\right\)\\bm\{z\}\\geq p\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\-2\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{z\}\\right\\rVert\_\{2\}^\{2\},completing the proof of[Lemma˜7\.1](https://arxiv.org/html/2607.00252#S7.Thmtheorem1)\. ∎

#### 7\.1\.1Strong Convexity of the Objective

The main pair of results of this section are[Lemma˜7\.2](https://arxiv.org/html/2607.00252#S7.Thmtheorem2)and[Section˜2\.1\.2](https://arxiv.org/html/2607.00252#S2.SS1.SSS2)\. We can think of[Lemma˜7\.2](https://arxiv.org/html/2607.00252#S7.Thmtheorem2)as a form of strong convexity for our objective\.

###### Lemma 7\.2\(Strong convexity offf\)\.

Letf​\(𝐱\)≔\\lVert​𝐀​𝐱−𝐛​\\rVert𝒢ppf\(\\bm\{x\}\)\\coloneqq\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}^\{p\}\. For all𝐝∈ℝd\\bm\{d\}\\in\\mathbb\{R\}^\{d\}, we have

f​\(𝒙\+𝒅\)≥f​\(𝒙\)\+⟨∇f​\(𝒙\),𝒅⟩\+42p​\\lVert​𝐀​𝒅​\\rVert𝒢pp,\\displaystyle f\(\\bm\{x\}\+\\bm\{d\}\)\\geq f\(\\bm\{x\}\)\+\\left\\langle\\nabla f\(\\bm\{x\}\),\\bm\{d\}\\right\\rangle\+\\frac\{4\}\{2^\{p\}\}\\left\\lVert\\mathbf\{A\}\\bm\{d\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}^\{p\},and therefore

\\lVert​𝒙−𝒙⋆​\\rVert𝐌≤23/2−3/p​d1/2−1/p​\(f​\(𝒙\)−f​\(𝒙⋆\)\)1/p\.\\displaystyle\\left\\lVert\\bm\{x\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq 2^\{3/2\-3/p\}d^\{1/2\-1/p\}\(f\(\\bm\{x\}\)\-f\(\\bm\{x\}^\{\\star\}\)\)^\{1/p\}\\kern 5\.0pt\.

\\lemmastrongconvexitycomponent

To motivate[Section˜2\.1\.2](https://arxiv.org/html/2607.00252#S2.SS1.SSS2), let us see how[Section˜2\.1\.2](https://arxiv.org/html/2607.00252#S2.SS1.SSS2)implies[Lemma˜7\.2](https://arxiv.org/html/2607.00252#S7.Thmtheorem2)\.

###### Proof of[Lemma˜7\.2](https://arxiv.org/html/2607.00252#S7.Thmtheorem2)\.

Note that

∇f​\(𝒙\)=\\slimits@i=1m​p​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2p−2​𝐀Si⊤​\(𝐀Si​𝒙−𝒃Si\)\.\\displaystyle\\nabla f\(\\bm\{x\}\)=\\sumop\\slimits@\_\{i=1\}^\{m\}p\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\-2\}\\mathbf\{A\}\_\{S\_\{i\}\}^\{\\top\}\(\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\)\\kern 5\.0pt\.This implies

\\slimits@i=1m​p​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2p−2​⟨𝐀Si​𝒙−𝒃Si,𝐀Si​𝒅⟩=⟨∇f​\(𝒙\),𝒅⟩\.\\displaystyle\\sumop\\slimits@\_\{i=1\}^\{m\}p\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\-2\}\\left\\langle\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\},\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{d\}\\right\\rangle=\\left\\langle\\nabla f\(\\bm\{x\}\),\\bm\{d\}\\right\\rangle\\kern 5\.0pt\.Combining this and applying[Section˜2\.1\.2](https://arxiv.org/html/2607.00252#S2.SS1.SSS2)\(which is a strong convexity lemma for\\\|⋅\\\|2p\\\|\\cdot\\\|\_\{2\}^\{p\}that we prove subsequently in this section\), we get

f​\(𝒙\+𝒅\)\\displaystyle f\(\\bm\{x\}\+\\bm\{d\}\)=\\lVert​𝐀​\(𝒙\+𝒅\)−𝒃​\\rVert𝒢pp=\\lVert​𝐀​𝒅\+\(𝐀​𝒙−𝒃\)​\\rVert𝒢pp,\\displaystyle=\\left\\lVert\\mathbf\{A\}\(\\bm\{x\}\+\\bm\{d\}\)\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}^\{p\}=\\left\\lVert\\mathbf\{A\}\\bm\{d\}\+\(\\mathbf\{A\}\\bm\{x\}\-\\bm\{b\}\)\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}^\{p\}\\kern 5\.0pt,=\\slimits@i=1m​\\lVert​𝐀Si​𝒅\+\(𝐀Si​𝒙−𝒃Si\)​\\rVert2p,\\displaystyle=\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{d\}\+\(\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\)\\right\\rVert\_\{2\}^\{p\}\\kern 5\.0pt,≥\([Section˜2\.1\.2](https://arxiv.org/html/2607.00252#S2.SS1.SSS2)\)\\slimits@i=1m​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2p\+p​\\\|​𝐀Si​𝒙−𝒃Si​\\\|2p−2​⟨\(𝐀Si​𝒙−𝒃Si\),𝐀Si​𝒅⟩\+42p​\\lVert​𝐀Si​𝒅​\\rVert2p,\\displaystyle\\geq^\{\\text\{\(\\lx@cref\{creftypecap~refnum\}\{lemma:strong\_convexity\_component\}\)\}\}\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\}\+p\\\|\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\\|\_\{2\}^\{p\-2\}\\left\\langle\(\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\),\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{d\}\\right\\rangle\+\\frac\{4\}\{2^\{p\}\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{d\}\\right\\rVert\_\{2\}^\{p\}\\kern 5\.0pt,=\\slimits@i=1m​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2p\+⟨p​\\\|​𝐀Si​𝒙−𝒃Si​\\\|2p−2​𝐀Si⊤​\(𝐀Si​𝒙−𝒃Si\),𝒅⟩\+42p​\\lVert​𝐀Si​𝒅​\\rVert2p,\\displaystyle=\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\}\+\\left\\langle p\\\|\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\\|\_\{2\}^\{p\-2\}\\mathbf\{A\}\_\{S\_\{i\}\}^\{\\top\}\(\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\),\\bm\{d\}\\right\\rangle\+\\frac\{4\}\{2^\{p\}\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{d\}\\right\\rVert\_\{2\}^\{p\}\\kern 5\.0pt,=\([7\.1](https://arxiv.org/html/2607.00252#S7.E1)\)\\lVert​𝐀​𝒙−𝒃​\\rVert𝒢pp\+⟨∇f​\(𝒙\),𝒅⟩\+42p​\\lVert​𝐀​𝒅​\\rVert𝒢pp=f​\(𝒙\)\+⟨∇f​\(𝒙\),𝒅⟩\+42p​\\lVert​𝐀​𝒅​\\rVert𝒢pp\.\\displaystyle=^\{\\text\{\\eqref\{eq:f\_grad\}\}\}\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}^\{p\}\+\\left\\langle\\nabla f\(\\bm\{x\}\),\\bm\{d\}\\right\\rangle\+\\frac\{4\}\{2^\{p\}\}\\left\\lVert\\mathbf\{A\}\\bm\{d\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}^\{p\}=f\(\\bm\{x\}\)\+\\left\\langle\\nabla f\(\\bm\{x\}\),\\bm\{d\}\\right\\rangle\+\\frac\{4\}\{2^\{p\}\}\\left\\lVert\\mathbf\{A\}\\bm\{d\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}^\{p\}\\kern 5\.0pt\.We now take care of the second statement\. Observe that at optimality, we have∇f​\(𝒙⋆\)=0\\nabla f\(\\bm\{x\}^\{\\star\}\)=0\. Plugging this in \(replace𝒙\\bm\{x\}by𝒙⋆\\bm\{x\}^\{\\star\}and𝒅\\bm\{d\}by𝒙−𝒙⋆\\bm\{x\}\-\\bm\{x\}^\{\\star\}above\), rearranging, and takingppth roots gives

\\lVert​𝐀​\(𝒙−𝒙⋆\)​\\rVert𝒢p≤\(42p\)−1/p​\(f​\(𝒙\)−f​\(𝒙⋆\)\)1/p=241/p​\(f​\(𝒙\)−f​\(𝒙⋆\)\)1/p\.\\displaystyle\\left\\lVert\\mathbf\{A\}\(\\bm\{x\}\-\\bm\{x\}^\{\\star\}\)\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}\\leq\\left\(\\frac\{4\}\{2^\{p\}\}\\right\)^\{\-1/p\}\(f\(\\bm\{x\}\)\-f\(\\bm\{x\}^\{\\star\}\)\)^\{1/p\}=\\frac\{2\}\{4^\{1/p\}\}\(f\(\\bm\{x\}\)\-f\(\\bm\{x\}^\{\\star\}\)\)^\{1/p\}\\kern 5\.0pt\.Next, recall that by[Theorem˜2\.3](https://arxiv.org/html/2607.00252#S2.Thmtheorem3),

\\lVert​𝒙−𝒙⋆​\\rVert𝐌=\\lVert​𝐖1/2−1/p​𝐀​\(𝒙−𝒙⋆\)​\\rVert2≤\(2​d\)1/2−1/p​\\lVert​𝐀​\(𝒙−𝒙⋆\)​\\rVert𝒢p\.\\displaystyle\\left\\lVert\\bm\{x\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}=\\left\\lVert\\mathbf\{W\}^\{1/2\-1/p\}\\mathbf\{A\}\(\\bm\{x\}\-\\bm\{x\}^\{\\star\}\)\\right\\rVert\_\{2\}\\leq\(2d\)^\{1/2\-1/p\}\\left\\lVert\\mathbf\{A\}\(\\bm\{x\}\-\\bm\{x\}^\{\\star\}\)\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}\\kern 5\.0pt\.Stitching the inequalities together completes the proof of[Lemma˜7\.2](https://arxiv.org/html/2607.00252#S7.Thmtheorem2)\. ∎

In the rest of this subsection, we prove[Section˜2\.1\.2](https://arxiv.org/html/2607.00252#S2.SS1.SSS2)\. We begin with a few numerical inequalities\.

###### Lemma 7\.3\.

Forα≤−1/2\\alpha\\leq\-1/2andp≥2p\\geq 2,g​\(α\):=1\+p​α\(−\(2​α\+1\)\)p/2g\(\\alpha\):=\\frac\{1\+p\\alpha\}\{\(\-\(2\\alpha\+1\)\)^\{p/2\}\}is nonincreasing inα\\alpha\.

###### Proof of[Lemma˜7\.3](https://arxiv.org/html/2607.00252#S7.Thmtheorem3)\.

We first take the derivative ofggwith respect toα\\alpha,

g′​\(α\)\\displaystyle g^\{\\prime\}\(\\alpha\)=p​\(−\(2​α\+1\)\)p/2−\(\(−2\)​p2​\(−\(2​α\+1\)\)p/2−1\)​\(1\+p​α\)\(−\(2​α\+1\)\)p,\\displaystyle=\\frac\{p\(\-\(2\\alpha\+1\)\)^\{p/2\}\-\\left\(\(\-2\)\\frac\{p\}\{2\}\\left\(\-\(2\\alpha\+1\)\\right\)^\{p/2\-1\}\\right\)\(1\+p\\alpha\)\}\{\(\-\(2\\alpha\+1\)\)^\{p\}\}\\kern 5\.0pt,=p​\(−\(2​α\+1\)p/2\)\+p​\(−\(2​α\+1\)\)p/2−1​\(1\+p​α\)\(−\(2​α\+1\)\)p,\\displaystyle=\\frac\{p\(\-\(2\\alpha\+1\)^\{p/2\}\)\+p\\left\(\-\(2\\alpha\+1\)\\right\)^\{p/2\-1\}\(1\+p\\alpha\)\}\{\(\-\(2\\alpha\+1\)\)^\{p\}\}\\kern 5\.0pt,=p⋅\(−\(2​α\+1\)\)\+\(1\+p​α\)\(−\(2​α\+1\)\)p/2\+1,\\displaystyle=p\\cdot\\frac\{\(\-\(2\\alpha\+1\)\)\+\(1\+p\\alpha\)\}\{\(\-\(2\\alpha\+1\)\)^\{p/2\+1\}\}\\kern 5\.0pt,=p⋅\(p−2\)​α\(−\(2​α\+1\)\)p/2\+1≤0,\\displaystyle=p\\cdot\\frac\{\(p\-2\)\\alpha\}\{\(\-\(2\\alpha\+1\)\)^\{p/2\+1\}\}\\leq 0\\kern 5\.0pt,where in the final inequality we used thatp≥2p\\geq 2andα≤−1/2\\alpha\\leq\-1/2\. This completes the proof of the lemma\. ∎

We also need the following lemma, which is similar to a result due to\[akps19\]\[akps19, Lemma 4\.5\]\. It amounts to proving[Section˜2\.1\.2](https://arxiv.org/html/2607.00252#S2.SS1.SSS2)when the dimensionk=1k=1\.

###### Lemma 7\.4\(Case A\. of[Lemma˜7\.5](https://arxiv.org/html/2607.00252#S7.Thmtheorem5)\)\.

For anyα∈ℝ\\alpha\\in\\mathbb\{R\}andp≥2p\\geq 2,

\\lvert​1\+α​\\rvertp≥1\+p​α\+42p​\\lvert​α​\\rvertp\.\\displaystyle\\left\\lvert 1\+\\alpha\\right\\rvert^\{p\}\\geq 1\+p\\alpha\+\\frac\{4\}\{2^\{p\}\}\\left\\lvert\\alpha\\right\\rvert^\{p\}\\kern 5\.0pt\.

###### Proof of[Lemma˜7\.4](https://arxiv.org/html/2607.00252#S7.Thmtheorem4)\.

Note that the inequality is true whenp=2p=2and becomes an equality\. We consider the case whenp\>2p\>2and useh​\(α\)h\(\\alpha\)to denote the error function,

h​\(α\)≔\\lvert​1\+α​\\rvertp−\(1\+p​α\+42p​\\lvert​α​\\rvertp\)\.\\displaystyle h\(\\alpha\)\\coloneqq\\left\\lvert 1\+\\alpha\\right\\rvert^\{p\}\-\\left\(1\+p\\alpha\+\\frac\{4\}\{2^\{p\}\}\\left\\lvert\\alpha\\right\\rvert^\{p\}\\right\)\\kern 5\.0pt\.We aim to showh​\(α\)≥0h\(\\alpha\)\\geq 0for allα∈ℝ\\alpha\\in\\mathbb\{R\}\. Let us first write the derivatives ofhh\.

h′​\(α\)\\displaystyle h^\{\\prime\}\(\\alpha\)=p​\(\\lvert​1\+α​\\rvertp−2​\(1\+α\)−\(1\+42p​\\lvert​α​\\rvertp−2​α\)\),\\displaystyle=p\\left\(\\left\\lvert 1\+\\alpha\\right\\rvert^\{p\-2\}\(1\+\\alpha\)\-\\left\(1\+\\frac\{4\}\{2^\{p\}\}\\left\\lvert\\alpha\\right\\rvert^\{p\-2\}\\alpha\\right\)\\right\)\\kern 5\.0pt,h′′​\(α\)\\displaystyle h^\{\\prime\\prime\}\(\\alpha\)=p​\(p−1\)​\(\\lvert​1\+α​\\rvertp−2−42p​\\lvert​α​\\rvertp−2\)=p​\(p−1\)​\(\\lvert​1\+α​\\rvertp−2−\\lvert​α2​\\rvertp−2\)\.\\displaystyle=p\(p\-1\)\\left\(\\left\\lvert 1\+\\alpha\\right\\rvert^\{p\-2\}\-\\frac\{4\}\{2^\{p\}\}\\left\\lvert\\alpha\\right\\rvert^\{p\-2\}\\right\)=p\(p\-1\)\\left\(\\left\\lvert 1\+\\alpha\\right\\rvert^\{p\-2\}\-\\left\\lvert\\frac\{\\alpha\}\{2\}\\right\\rvert^\{p\-2\}\\right\)\\kern 5\.0pt\.It is now easy to verify the following statements abouthh,

- I\.h′​\(−2\)=h′′​\(−2\)=0h^\{\\prime\}\(\-2\)=h^\{\\prime\\prime\}\(\-2\)=0andh′′​\(α\)\>0h^\{\\prime\\prime\}\(\\alpha\)\>0forα<−2\\alpha<\-2,⇒\\Rightarrowwithin the range\(−∞,−2\]\(\-\\infty,\-2\]the functionhhis minimized at−2\-2;
- II\.h′​\(−2\)=0h^\{\\prime\}\(\-2\)=0andh′′​\(α\)≤0h^\{\\prime\\prime\}\(\\alpha\)\\leq 0forα∈\(−2,−2/3\]\\alpha\\in\(\-2,\-2/3\]⇒\\Rightarrowh′​\(α\)<0h^\{\\prime\}\(\\alpha\)<0in the range\(−2,−2/3\]\(\-2,\-2/3\], i\.e\., in that range the functionhhis minimized at−2/3\-2/3;
- III\.h′​\(−2/3\)<0=h′​\(0\)h^\{\\prime\}\(\-2/3\)<0=h^\{\\prime\}\(0\)andh′′​\(α\)\>0h^\{\\prime\\prime\}\(\\alpha\)\>0forα\>−2/3\\alpha\>\-2/3⇒\\Rightarrowthe functionhhis decreasing in\(−2/3,0\)\(\-2/3,0\)and increasing in\[0,∞\)\[0,\\infty\), i\.e\., within the range\(−2/3,∞\)\(\-2/3,\\infty\)the functionhhis minimized at0\.

As a result of the above observations, it is enough to check the inequality at the inputsα∈\{−2,−2/3,0\}\\alpha\\in\\left\\\{\-2,\-2/3,0\\right\\\}\. We have forp\>2p\>2,

h​\(−2\)\\displaystyle h\(\-2\)=1−\(1−2​p\+4\)=2​p−4\>0,\\displaystyle=1\-\\left\(1\-2p\+4\\right\)=2p\-4\>0\\kern 5\.0pt,h​\(−23\)\\displaystyle h\\left\(\-\\frac\{2\}\{3\}\\right\)=13p−\(1−2​p3\+42p​\\lvert​23​\\rvertp\)=13p−1\+2​p3−43p=−1\+2​p3−33p\>0\\displaystyle=\\frac\{1\}\{3^\{p\}\}\-\\left\(1\-\\frac\{2p\}\{3\}\+\\frac\{4\}\{2^\{p\}\}\\left\\lvert\\frac\{2\}\{3\}\\right\\rvert^\{p\}\\right\)=\\frac\{1\}\{3^\{p\}\}\-1\+\\frac\{2p\}\{3\}\-\\frac\{4\}\{3^\{p\}\}=\-1\+\\frac\{2p\}\{3\}\-\\frac\{3\}\{3^\{p\}\}\>0h​\(0\)\\displaystyle h\(0\)=1−1=0\.\\displaystyle=1\-1=0\\kern 5\.0pt\.This implies thath​\(α\)≥0h\(\\alpha\)\\geq 0for all values ofα\\alpha, concluding the proof of[Lemma˜7\.4](https://arxiv.org/html/2607.00252#S7.Thmtheorem4)\. ∎

Next, we prove a special case of[Section˜2\.1\.2](https://arxiv.org/html/2607.00252#S2.SS1.SSS2)\.

###### Lemma 7\.5\.

For anyα∈ℝ\\alpha\\in\\mathbb\{R\},β≥0\\beta\\geq 0, andp≥2p\\geq 2, we have

\(\(1\+α\)2\+β2\)p/2≥1\+p​α\+42p​\(α2\+β2\)p/2\.\\displaystyle\\left\(\(1\+\\alpha\)^\{2\}\+\\beta^\{2\}\\right\)^\{p/2\}\\geq 1\+p\\alpha\+\\frac\{4\}\{2^\{p\}\}\\left\(\\alpha^\{2\}\+\\beta^\{2\}\\right\)^\{p/2\}\\kern 5\.0pt\.

###### Proof of[Lemma˜7\.5](https://arxiv.org/html/2607.00252#S7.Thmtheorem5)\.

Let us study the difference of both sides of the inequality using the following function,

h​\(α,β\)≔\(\(1\+α\)2\+β2\)p/2−\(1\+p​α\+42p​\(α2\+β2\)p/2\)\.\\displaystyle h\(\\alpha,\\beta\)\\coloneqq\\left\(\(1\+\\alpha\)^\{2\}\+\\beta^\{2\}\\right\)^\{p/2\}\-\\left\(1\+p\\alpha\+\\frac\{4\}\{2^\{p\}\}\\left\(\\alpha^\{2\}\+\\beta^\{2\}\\right\)^\{p/2\}\\right\)\\kern 5\.0pt\.We want to show that forα∈ℝ\\alpha\\in\\mathbb\{R\},β≥0\\beta\\geq 0, andp≥2p\\geq 2,h​\(α,β\)≥0h\(\\alpha,\\beta\)\\geq 0\. We will break this proof into three cases:A\.α∈ℝ\\alpha\\in\\mathbb\{R\}andβ=0\\beta=0;B\.α∈\(−∞,−2\]∪\[−2/3,∞\)\\alpha\\in\(\-\\infty,\-2\]\\cup\[\-2/3,\\infty\)andβ\>0\\beta\>0; andC\.α∈\(−2,−2/3\)\\alpha\\in\(\-2,\-2/3\)andβ\>0\\beta\>0\. These cases together cover of the entire range ofα∈ℝ\\alpha\\in\\mathbb\{R\}andβ≥0\\beta\\geq 0\.

##### Case A\.

Whenβ=0\\beta=0, the proof simply follows from the statement of[Lemma˜7\.4](https://arxiv.org/html/2607.00252#S7.Thmtheorem4)by noting\|α\|p=\(α2\)p=\(α2\)p/2\|\\alpha\|^\{p\}=\(\\sqrt\{\\alpha^\{2\}\}\)^\{p\}=\(\\alpha^\{2\}\)^\{p/2\}\.

In the remaining two cases we will show that for anyα∈ℝ\\alpha\\in\\mathbb\{R\}, increasing the value ofβ\\betastill maintainsh​\(α,β\)≥0h\(\\alpha,\\beta\)\\geq 0\. To see this, we first note that the derivative ofh​\(α,β\)h\(\\alpha,\\beta\)w\.r\.t\.β\\betais given by,

∇βh​\(α,β\)\\displaystyle\\nabla\_\{\\beta\}h\(\\alpha,\\beta\)=p​β​\(\(\(1\+α\)2\+β2\)p/2−1−42p​\(α2\+β2\)p/2−1\)\.\\displaystyle=p\\beta\\left\(\\left\(\(1\+\\alpha\)^\{2\}\+\\beta^\{2\}\\right\)^\{p/2\-1\}\-\\frac\{4\}\{2^\{p\}\}\\left\(\\alpha^\{2\}\+\\beta^\{2\}\\right\)^\{p/2\-1\}\\right\)\\kern 5\.0pt\.Forβ\>0\\beta\>0, ensuring this derivative is positive is equivalent to the following,

∇βh​\(α,β\)\>0\\displaystyle\\nabla\_\{\\beta\}h\(\\alpha,\\beta\)\>0≡p​β​\(\(1\+α\)2\+β2\)p/2−1\>p​β⋅42p​\(α2\+β2\)p/2−1,\\displaystyle\\equiv p\\beta\\left\(\(1\+\\alpha\)^\{2\}\+\\beta^\{2\}\\right\)^\{p/2\-1\}\>p\\beta\\cdot\\frac\{4\}\{2^\{p\}\}\\left\(\\alpha^\{2\}\+\\beta^\{2\}\\right\)^\{p/2\-1\}\\kern 5\.0pt,≡\(p​β\>0\)\(1\+α\)2\+β2\>\(12p−2\)2/\(p−2\)⋅\(α2\+β2\),\\displaystyle\\equiv^\{\(p\\beta\>0\)\}\(1\+\\alpha\)^\{2\}\+\\beta^\{2\}\>\\left\(\\frac\{1\}\{2^\{p\-2\}\}\\right\)^\{2/\(p\-2\)\}\\cdot\\left\(\\alpha^\{2\}\+\\beta^\{2\}\\right\)\\kern 5\.0pt,≡\(1\+α\)2\+β2\>14⋅\(α2\+β2\),\\displaystyle\\equiv\(1\+\\alpha\)^\{2\}\+\\beta^\{2\}\>\\frac\{1\}\{4\}\\cdot\\left\(\\alpha^\{2\}\+\\beta^\{2\}\\right\)\\kern 5\.0pt,≡\(3​α2\+8​α\+4\)\+3​β2\>0,\\displaystyle\\equiv\(3\\alpha^\{2\}\+8\\alpha\+4\)\+3\\beta^\{2\}\>0\\kern 5\.0pt,≡β2\>−\(α2\+83​α\+43\)\.\\displaystyle\\equiv\\beta^\{2\}\>\-\\left\(\\alpha^\{2\}\+\\frac\{8\}\{3\}\\alpha\+\\frac\{4\}\{3\}\\right\)\\kern 5\.0pt\.\(7\.4\)

##### Case B\.

Note that the roots of the quadratic function3​α2\+8​α\+43\\alpha^\{2\}\+8\\alpha\+4are given byα1=−2\\alpha\_\{1\}=\-2andα2=−2/3\\alpha\_\{2\}=\-2/3\. This means that forα∈\(−∞,−2\]∪\[−2/3,∞\)\\alpha\\in\(\-\\infty,\-2\]\\cup\[\-2/3,\\infty\)we have3​α2\+8​α\+4≥03\\alpha^\{2\}\+8\\alpha\+4\\geq 0which issufficientto ensure using \([7\.4](https://arxiv.org/html/2607.00252#S7.E4)\) that∇βh​\(α,β\)\>0\\nabla\_\{\\beta\}h\(\\alpha,\\beta\)\>0, and henceh​\(α,β\)\>0h\(\\alpha,\\beta\)\>0\. This takes care of Case B\.

##### Case C\.

Now we only need to consider the rangeα∈\(−2,−2/3\)\\alpha\\in\(\-2,\-2/3\)withβ\>0\\beta\>0\. In this range, the recall the equivalence \([7\.4](https://arxiv.org/html/2607.00252#S7.E4)\),

∇βh​\(α,β\)\>0\\displaystyle\\nabla\_\{\\beta\}h\(\\alpha,\\beta\)\>0≡β\>−\(α2\+83​α\+43\)=:β0\(α\)\.\\displaystyle\\equiv\\beta\>\\sqrt\{\-\\left\(\\alpha^\{2\}\+\\frac\{8\}\{3\}\\alpha\+\\frac\{4\}\{3\}\\right\)\}=:\\beta\_\{0\}\(\\alpha\)\\kern 5\.0pt\.Thus for allβ\>β0​\(α\)\\beta\>\\beta\_\{0\}\(\\alpha\)we know thath​\(α,β\)h\(\\alpha,\\beta\)is increasing inβ\\betaand vice\-versa\. This allows us for any givenα∈\(−2,−2/3\)\\alpha\\in\(\-2,\-2/3\)to further break Case C into two sub\-cases:

##### Case C\.I

Forβ∈\[0,β0\)\\beta\\in\[0,\\beta\_\{0\}\), sinceh​\(α,β\)h\(\\alpha,\\beta\)is decreasing inβ\\betaits lowest value is attained atβ=0\\beta=0and we only need to verify thath​\(α,0\)≥0h\(\\alpha,0\)\\geq 0\. We get this directly from[Lemma˜7\.4](https://arxiv.org/html/2607.00252#S7.Thmtheorem4)\.

##### Case C\.II

Forβ∈\[β0,∞\)\\beta\\in\[\\beta\_\{0\},\\infty\), sinceh​\(α,β\)h\(\\alpha,\\beta\)is increasing inβ\\betaits lowest value is attained atβ=β0\\beta=\\beta\_\{0\}and we only need to verify thath​\(α,β0​\(α\)\)≥0h\(\\alpha,\\beta\_\{0\}\(\\alpha\)\)\\geq 0\. We first simplify the expression forh​\(α,β0​\(α\)\)h\(\\alpha,\\beta\_\{0\}\(\\alpha\)\),

h​\(α,β0​\(α\)\)\\displaystyle h\(\\alpha,\\beta\_\{0\}\(\\alpha\)\)=\(\(1\+α\)2\+β02\)p/2−\(1\+p​α\+Kp​\(α2\+β02\)p/2\),\\displaystyle=\\left\(\(1\+\\alpha\)^\{2\}\+\\beta\_\{0\}^\{2\}\\right\)^\{p/2\}\-\\left\(1\+p\\alpha\+K\_\{p\}\\left\(\\alpha^\{2\}\+\\beta\_\{0\}^\{2\}\\right\)^\{p/2\}\\right\)\\kern 5\.0pt,=\(−13−23​α\)p/2−\(1\+p​α\+42p​\(−83​α−43\)p/2\),\\displaystyle=\\left\(\-\\frac\{1\}\{3\}\-\\frac\{2\}\{3\}\\alpha\\right\)^\{p/2\}\-\\left\(1\+p\\alpha\+\\frac\{4\}\{2^\{p\}\}\\left\(\-\\frac\{8\}\{3\}\\alpha\-\\frac\{4\}\{3\}\\right\)^\{p/2\}\\right\)\\kern 5\.0pt,=\(−13−23​α\)p/2−\(1\+p​α\+4​\(−23​α−13\)p/2\),\\displaystyle=\\left\(\-\\frac\{1\}\{3\}\-\\frac\{2\}\{3\}\\alpha\\right\)^\{p/2\}\-\\left\(1\+p\\alpha\+4\\left\(\-\\frac\{2\}\{3\}\\alpha\-\\frac\{1\}\{3\}\\right\)^\{p/2\}\\right\)\\kern 5\.0pt,=−1−p​α−3​\(−23​α−13\)p/2,\\displaystyle=\-1\-p\\alpha\-3\\left\(\-\\frac\{2\}\{3\}\\alpha\-\\frac\{1\}\{3\}\\right\)^\{p/2\}\\kern 5\.0pt,=−1−p​α−13p/2−1​\(−2​α−1\)p/2,\\displaystyle=\-1\-p\\alpha\-\\frac\{1\}\{3^\{p/2\-1\}\}\(\-2\\alpha\-1\)^\{p/2\}\\kern 5\.0pt,=−\(−2​α−1\)p/2​\(1\+p​α\(−2​α−1\)p/2\+13p/2−1\)\.\\displaystyle=\-\(\-2\\alpha\-1\)^\{p/2\}\\left\(\\frac\{1\+p\\alpha\}\{\(\-2\\alpha\-1\)^\{p/2\}\}\+\\frac\{1\}\{3^\{p/2\-1\}\}\\right\)\\kern 5\.0pt\.Now sinceα∈\(−2,−2/3\)<−1/2\\alpha\\in\(\-2,\-2/3\)<\-1/2we can use[Lemma˜7\.3](https://arxiv.org/html/2607.00252#S7.Thmtheorem3)to note that the first term is non\-decreasing inα\\alphawhich means that its lowest value in this range can be lower bounded by its value atα=−2\\alpha=\-2, i\.e\., forα∈\(−2,−2/3\)\\alpha\\in\(\-2,\-2/3\),

h​\(α,β0​\(α\)\)\\displaystyle h\(\\alpha,\\beta\_\{0\}\(\\alpha\)\)≥h​\(−2,β0​\(−2\)\),\\displaystyle\\geq h\(\-2,\\beta\_\{0\}\(\-2\)\)\\kern 5\.0pt,=−3p/2​\(1−2​p3p/2\+13p/2−1\),\\displaystyle=\-3^\{p/2\}\\left\(\\frac\{1\-2p\}\{3^\{p/2\}\}\+\\frac\{1\}\{3^\{p/2\-1\}\}\\right\)\\kern 5\.0pt,=2​p−1−3=2​\(p−2\)\>0,\\displaystyle=2p\-1\-3=2\(p\-2\)\>0\\kern 5\.0pt,which finishes the proof of Case C\.II and also Case C\. Together Cases A, B and C complete the proof of[Lemma˜7\.5](https://arxiv.org/html/2607.00252#S7.Thmtheorem5)\. ∎

We are now ready to prove[Section˜2\.1\.2](https://arxiv.org/html/2607.00252#S2.SS1.SSS2)\.

###### Proof of[Section˜2\.1\.2](https://arxiv.org/html/2607.00252#S2.SS1.SSS2)\.

First, assume that\\lVert​𝒗​\\rVert2=1\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\}=1\. We will later extend the result to all𝒗\\bm\{v\}\.

Since\\lVert​𝒗​\\rVert2=1\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\}=1, we can write△=α​𝒗\+β​𝒘\\triangle=\\alpha\\bm\{v\}\+\\beta\\bm\{w\}where⟨𝒗,𝒘⟩=0\\left\\langle\\bm\{v\},\\bm\{w\}\\right\\rangle=0and\\lVert​𝒘​\\rVert2=1\\left\\lVert\\bm\{w\}\\right\\rVert\_\{2\}=1, so that we have\\lVert​△​\\rVert22=α2\+β2\\left\\lVert\\triangle\\right\\rVert\_\{2\}^\{2\}=\\alpha^\{2\}\+\\beta^\{2\}\. Without loss of generality, we haveβ≥0\\beta\\geq 0\. Fixing𝒘\\bm\{w\}andα\\alphafor now, it is enough to show that for allβ≥0\\beta\\geq 0, we have

\\lVert​\(1\+α\)​𝒗\+β​𝒘​\\rVert2p=\(\(1\+α\)2\+β2\)p/2​≥?​1\+p​α\+42p​\\lVert​△​\\rVert2p=1\+p​α\+42p​\(α2\+β2\)p/2\.\\displaystyle\\left\\lVert\(1\+\\alpha\)\\bm\{v\}\+\\beta\\bm\{w\}\\right\\rVert\_\{2\}^\{p\}=\\left\(\(1\+\\alpha\)^\{2\}\+\\beta^\{2\}\\right\)^\{p/2\}\\overset\{?\}\{\\geq\}1\+p\\alpha\+\\frac\{4\}\{2^\{p\}\}\\left\\lVert\\triangle\\right\\rVert\_\{2\}^\{p\}=1\+p\\alpha\+\\frac\{4\}\{2^\{p\}\}\\left\(\\alpha^\{2\}\+\\beta^\{2\}\\right\)^\{p/2\}\.This follows immediately by[Lemma˜7\.5](https://arxiv.org/html/2607.00252#S7.Thmtheorem5)\.

We now extend the result for all𝒗\\bm\{v\}\. Let𝒗¯≔𝒗/\\lVert​𝒗​\\rVert2\\overline\{\\bm\{v\}\}\\coloneqq\\bm\{v\}/\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\}and note that

\\lVert​𝒗\+△​\\rVert2p=\\lVert​𝒗​\\rVert2p​\\lVert​𝒗¯\+△\\lVert​𝒗​\\rVert2​\\rVert2p\\displaystyle\\left\\lVert\\bm\{v\}\+\\triangle\\right\\rVert\_\{2\}^\{p\}=\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\}^\{p\}\\left\\lVert\\overline\{\\bm\{v\}\}\+\\frac\{\\triangle\}\{\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\}\}\\right\\rVert\_\{2\}^\{p\}≥\\lVert​𝒗​\\rVert2p​\(1\+⟨𝒗¯,△\\lVert​𝒗​\\rVert2⟩\+42p​\\lVert​△\\lVert​𝒗​\\rVert2​\\rVert2p\)\\displaystyle\\geq\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\}^\{p\}\\left\(1\+\\left\\langle\\overline\{\\bm\{v\}\},\\frac\{\\triangle\}\{\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\}\}\\right\\rangle\+\\frac\{4\}\{2^\{p\}\}\\left\\lVert\\frac\{\\triangle\}\{\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\}\}\\right\\rVert\_\{2\}^\{p\}\\right\)=\\lVert​𝒗​\\rVert2p\+p​\\lVert​𝒗​\\rVert2p−2​⟨𝒗,△⟩\+42p​\\lVert​△​\\rVert2p,\\displaystyle=\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\}^\{p\}\+p\\left\\lVert\\bm\{v\}\\right\\rVert\_\{2\}^\{p\-2\}\\left\\langle\\bm\{v\},\\triangle\\right\\rangle\+\\frac\{4\}\{2^\{p\}\}\\left\\lVert\\triangle\\right\\rVert\_\{2\}^\{p\},completing the proof of[Section˜2\.1\.2](https://arxiv.org/html/2607.00252#S2.SS1.SSS2)\. ∎

#### 7\.1\.2Smoothness of the Objective

The main result of this subsection is[Lemma˜7\.6](https://arxiv.org/html/2607.00252#S7.Thmtheorem6)\.

###### Lemma 7\.6\.

For all𝐱∈ℝd\\bm\{x\}\\in\\mathbb\{R\}^\{d\}, we have

f​\(𝒙\)−f​\(𝒙⋆\)≤p​\(p−1\)2​f​\(𝒙\)1−2p​\\lVert​𝐀​\(𝒙−𝒙⋆\)​\\rVert𝒢p2\.\\displaystyle f\(\\bm\{x\}\)\-f\(\\bm\{x\}^\{\\star\}\)\\leq\\frac\{p\(p\-1\)\}\{2\}f\(\\bm\{x\}\)^\{1\-\\frac\{2\}\{p\}\}\\left\\lVert\\mathbf\{A\}\(\\bm\{x\}\-\\bm\{x\}^\{\\star\}\)\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}^\{2\}\.

###### Proof of[Lemma˜7\.6](https://arxiv.org/html/2607.00252#S7.Thmtheorem6)\.

By Taylor’s/mean\-value theorem, we can write for some𝒚\\bm\{y\}on the line connecting𝒙⋆\\bm\{x\}^\{\\star\}and𝒙\\bm\{x\},

f​\(𝒙\)\\displaystyle f\(\\bm\{x\}\)=f​\(𝒙⋆\)\+⟨∇f​\(𝒙⋆\),𝒙−𝒙⋆⟩\+12​\(𝒙−𝒙⋆\)⊤​∇2f​\(𝒚\)​\(𝒙−𝒙⋆\)\\displaystyle=f\(\\bm\{x\}^\{\\star\}\)\+\\left\\langle\\nabla f\(\\bm\{x\}^\{\\star\}\),\\bm\{x\}\-\\bm\{x\}^\{\\star\}\\right\\rangle\+\\frac\{1\}\{2\}\(\\bm\{x\}\-\\bm\{x\}^\{\\star\}\)^\{\\top\}\\nabla^\{2\}f\(\\bm\{y\}\)\(\\bm\{x\}\-\\bm\{x\}^\{\\star\}\)≤\([7\.3](https://arxiv.org/html/2607.00252#S7.E3)\)f​\(𝒙⋆\)\+p​\(p−1\)2​\\slimits@i=1m​\\lVert​𝐀Si​𝒚−𝒃Si​\\rVert2p−2​\\lVert​𝐀Si​\(𝒙−𝒙⋆\)​\\rVert22\\displaystyle\\leq^\{\\eqref\{eq:f\_quad\_form\}\}f\(\\bm\{x\}^\{\\star\}\)\+\\frac\{p\(p\-1\)\}\{2\}\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{y\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\-2\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\(\\bm\{x\}\-\\bm\{x\}^\{\\star\}\)\\right\\rVert\_\{2\}^\{2\}≤f​\(𝒙⋆\)\+p​\(p−1\)2​\(\\slimits@i=1m​\\lVert​𝐀Si​𝒚−𝒃Si​\\rVert2p\)p−2p​\(\\slimits@i=1m​\\lVert​𝐀Si​\(𝒙−𝒙⋆\)​\\rVert2p\)2p\\displaystyle\\leq f\(\\bm\{x\}^\{\\star\}\)\+\\frac\{p\(p\-1\)\}\{2\}\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{y\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\}\\right\)^\{\\frac\{p\-2\}\{p\}\}\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\(\\bm\{x\}\-\\bm\{x\}^\{\\star\}\)\\right\\rVert\_\{2\}^\{p\}\\right\)^\{\\frac\{2\}\{p\}\}≤f​\(𝒙⋆\)\+p​\(p−1\)2​f​\(𝒙\)1−2p​\\lVert​𝐀​\(𝒙−𝒙⋆\)​\\rVert𝒢p2,\\displaystyle\\leq f\(\\bm\{x\}^\{\\star\}\)\+\\frac\{p\(p\-1\)\}\{2\}f\(\\bm\{x\}\)^\{1\-\\frac\{2\}\{p\}\}\\left\\lVert\\mathbf\{A\}\(\\bm\{x\}\-\\bm\{x\}^\{\\star\}\)\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}^\{2\},completing the proof of[Lemma˜7\.6](https://arxiv.org/html/2607.00252#S7.Thmtheorem6)\. ∎

### 7\.2Facts about the Iterates

The main result of this section is[Lemma˜7\.7](https://arxiv.org/html/2607.00252#S7.Thmtheorem7)\. In words,[Lemma˜7\.7](https://arxiv.org/html/2607.00252#S7.Thmtheorem7)tells us that each proximal query we make in[Algorithm˜3](https://arxiv.org/html/2607.00252#alg3)\(see Line[8](https://arxiv.org/html/2607.00252#alg3.l8)of[Algorithm˜3](https://arxiv.org/html/2607.00252#alg3)\) has bounded objective value\. We will need this later when we argue about the convergence rates for the algorithms used to solve the proximal subproblems\.

###### Lemma 7\.7\.

For all queries𝐪t\\bm\{q\}\_\{t\}, we have

f​\(𝒒t\)≤f​\(𝒙t\)\+\(9​p​\(p−1\)\)p2​dp2−1\.\\displaystyle f\(\\bm\{q\}\_\{t\}\)\\leq f\(\\bm\{x\}\_\{t\}\)\+\\left\(9p\(p\-1\)\\right\)^\{\\frac\{p\}\{2\}\}d^\{\\frac\{p\}\{2\}\-1\}\.

###### Proof of[Lemma˜7\.7](https://arxiv.org/html/2607.00252#S7.Thmtheorem7)\.

We establish the following upper bound onf​\(𝒗t\)−f​\(𝒙⋆\)f\(\\bm\{v\}\_\{t\}\)\-f\(\\bm\{x\}^\{\\star\}\)using the ingredients developed so far:

f​\(𝒗t\)−f​\(𝒙⋆\)\\displaystyle f\(\\bm\{v\}\_\{t\}\)\-f\(\\bm\{x\}^\{\\star\}\)≤p​\(p−1\)2​f​\(𝒗t\)1−2p​\\lVert​𝐀​\(𝒗t−𝒙⋆\)​\\rVert𝒢p2\\displaystyle\\leq\\frac\{p\(p\-1\)\}\{2\}f\(\\bm\{v\}\_\{t\}\)^\{1\-\\frac\{2\}\{p\}\}\\left\\lVert\\mathbf\{A\}\(\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\)\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}^\{2\}\([Lemma˜7\.6](https://arxiv.org/html/2607.00252#S7.Thmtheorem6)\)≤p​\(p−1\)2​f​\(𝒗t\)1−2p​\\lVert​𝒗t−𝒙⋆​\\rVert𝐌2\\displaystyle\\leq\\frac\{p\(p\-1\)\}\{2\}f\(\\bm\{v\}\_\{t\}\)^\{1\-\\frac\{2\}\{p\}\}\\left\\lVert\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}\([Theorem˜2\.3](https://arxiv.org/html/2607.00252#S2.Thmtheorem3)\)≤p​\(p−1\)​f​\(𝒗t\)1−2p​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌2\\displaystyle\\leq p\(p\-1\)f\(\\bm\{v\}\_\{t\}\)^\{1\-\\frac\{2\}\{p\}\}\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}\([Lemma˜5\.5](https://arxiv.org/html/2607.00252#S5.Thmtheorem5)\)≤p​\(p−1\)​f​\(𝒗t\)1−2p​22​\(2​d\)1−2p\\displaystyle\\leq p\(p\-1\)f\(\\bm\{v\}\_\{t\}\)^\{1\-\\frac\{2\}\{p\}\}2^\{2\}\(2d\)^\{1\-\\frac\{2\}\{p\}\}\([Lemma˜3\.5](https://arxiv.org/html/2607.00252#S3.Thmtheorem5)\)≤8​d1−2p​p​\(p−1\)​f​\(𝒗t\)1−2p\.\\displaystyle\\leq 8d^\{1\-\\frac\{2\}\{p\}\}p\(p\-1\)f\(\\bm\{v\}\_\{t\}\)^\{1\-\\frac\{2\}\{p\}\}\\kern 5\.0pt\.Now, recall that we assume by rescaling thatf​\(𝒙⋆\)=1f\(\\bm\{x\}^\{\\star\}\)=1\. From this, it trivially follows that1≤d1−2p​p​\(p−1\)​f​\(𝒗t\)1−2p1\\leq d^\{1\-\\frac\{2\}\{p\}\}p\(p\-1\)f\(\\bm\{v\}\_\{t\}\)^\{1\-\\frac\{2\}\{p\}\}\. Combining these and re\-arranging the above inequality leads to the following polynomial inequality inf​\(𝒗t\)f\(\\bm\{v\}\_\{t\}\),

0\\displaystyle 0≥f​\(𝒗t\)−8​d1−2p​p​\(p−1\)​f​\(𝒗t\)1−2p−1,\\displaystyle\\geq f\(\\bm\{v\}\_\{t\}\)\-8d^\{1\-\\frac\{2\}\{p\}\}p\(p\-1\)f\(\\bm\{v\}\_\{t\}\)^\{1\-\\frac\{2\}\{p\}\}\-1\\kern 5\.0pt,=f​\(𝒗t\)−9​d1−2p​p​\(p−1\)​f​\(𝒗t\)1−2p\+d1−2p​p​\(p−1\)​f​\(𝒗t\)1−2p−1,\\displaystyle=f\(\\bm\{v\}\_\{t\}\)\-9d^\{1\-\\frac\{2\}\{p\}\}p\(p\-1\)f\(\\bm\{v\}\_\{t\}\)^\{1\-\\frac\{2\}\{p\}\}\+d^\{1\-\\frac\{2\}\{p\}\}p\(p\-1\)f\(\\bm\{v\}\_\{t\}\)^\{1\-\\frac\{2\}\{p\}\}\-1\\kern 5\.0pt,≥f​\(𝒗t\)−9​d1−2p​p​\(p−1\)​f​\(𝒗t\)1−2p,\\displaystyle\\geq f\(\\bm\{v\}\_\{t\}\)\-9d^\{1\-\\frac\{2\}\{p\}\}p\(p\-1\)f\(\\bm\{v\}\_\{t\}\)^\{1\-\\frac\{2\}\{p\}\}\\kern 5\.0pt,\(7\.5\)where in the last inequality we used the fact that the optimal valuef​\(𝒙⋆\)=1f\(\\bm\{x\}^\{\\star\}\)=1\(due to our rescaling\), which implies that forp≥2p\\geq 2,

1≤f​\(𝒗t\)≤d1−2p​p​\(p−1\)​f​\(𝒗t\)1−2p\.1\\leq f\(\\bm\{v\}\_\{t\}\)\\leq d^\{1\-\\frac\{2\}\{p\}\}p\(p\-1\)f\(\\bm\{v\}\_\{t\}\)^\{1\-\\frac\{2\}\{p\}\}\\kern 5\.0pt\.Solving forf​\(𝒗t\)f\(\\bm\{v\}\_\{t\}\)in \([7\.5](https://arxiv.org/html/2607.00252#S7.E5)\), we get

f​\(𝒗t\)≤\(9​p​\(p−1\)\)p2​dp2−1\.\\displaystyle f\(\\bm\{v\}\_\{t\}\)\\leq\\left\(9p\(p\-1\)\\right\)^\{\\frac\{p\}\{2\}\}d^\{\\frac\{p\}\{2\}\-1\}\\kern 5\.0pt\.Using the definition of𝒒t\\bm\{q\}\_\{t\}from[Algorithm˜3](https://arxiv.org/html/2607.00252#alg3)\(Line[7](https://arxiv.org/html/2607.00252#alg3.l7)\) along with the convexity offf\(Jensen’s inequality\), and using our bound onf​\(𝒗t\)f\(\\bm\{v\}\_\{t\}\)we note that,

f​\(𝒒t\)\\displaystyle f\(\\bm\{q\}\_\{t\}\)≤f​\(𝒙t\)\+f​\(𝒗t\),\\displaystyle\\leq f\(\\bm\{x\}\_\{t\}\)\+f\(\\bm\{v\}\_\{t\}\)\\kern 5\.0pt,≤f​\(𝒙t\)\+\(9​p​\(p−1\)\)p2​dp2−1,\\displaystyle\\leq f\(\\bm\{x\}\_\{t\}\)\+\\left\(9p\(p\-1\)\\right\)^\{\\frac\{p\}\{2\}\}d^\{\\frac\{p\}\{2\}\-1\}\\kern 5\.0pt,which completes the proof of[Lemma˜7\.7](https://arxiv.org/html/2607.00252#S7.Thmtheorem7)\. ∎

### 7\.3Proximal Subproblems – Calculus, Algorithms, Proofs

Let

f𝒒t​\(𝒙~\)≔f​\(𝒙~\)\+e​pp​\\lVert​𝒙~−𝒒t​\\rVert𝐌p\.\\displaystyle f\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\)\\coloneqq f\(\\mathaccent 869\{\\bm\{x\}\}\)\+ep^\{p\}\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\kern 5\.0pt\.In this subsection, we design and analyze an algorithm \([Algorithm˜4](https://arxiv.org/html/2607.00252#alg4)\) that approximately solves the subproblem

argmin𝒙~∈ℝd​f𝒒t​\(𝒙~\)\.\\displaystyle\\underset\{\\mathaccent 869\{\\bm\{x\}\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ f\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\)\.Specifically, we will output\(𝒙~t\+1,λt\+1\)\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\},\\lambda\_\{t\+1\}\)that satisfy the12\\frac\{1\}\{2\}\-MS oracle condition \([Definition˜5\.1](https://arxiv.org/html/2607.00252#S5.Thmtheorem1)\) and an appropriate movement bound \([Definition˜5\.2](https://arxiv.org/html/2607.00252#S5.Thmtheorem2)\)\.

This subproblem is the workhorse of[Algorithm˜5](https://arxiv.org/html/2607.00252#alg5), and once we implement and analyze the solver, it is very straightforward to plug this into[Algorithm˜3](https://arxiv.org/html/2607.00252#alg3)and[Theorem˜5\.3](https://arxiv.org/html/2607.00252#S5.Thmtheorem3)to get our final iteration complexity\.

Algorithm 4GpRegressionProxOracle: Implements12\\frac\{1\}\{2\}\-MS oracle for\\lVert⋅\\rVert𝒢p\\left\\lVert\\cdot\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}regression \(see[Lemma˜7\.19](https://arxiv.org/html/2607.00252#S7.Thmtheorem19)and[Algorithm˜2](https://arxiv.org/html/2607.00252#alg2)\.1:Query𝒒t\\bm\{q\}\_\{t\}, previous iterate𝒙t\\bm\{x\}\_\{t\}, intended parameter distanceγ\\gamma\.

2:Definef𝒒t​\(𝒙~\)≔f​\(𝒙~\)\+e​pp​\\lVert​𝒙~−𝒒t​\\rVert𝐌ph𝒒t​\(𝒙~\)≔\\lVert​𝒙~−𝒒t​\\rVert∇2f​\(𝒒t\)2\+e​pp​\\lVert​𝒙~−𝒒t​\\rVert𝐌pDh𝒒t​\(𝒙,𝒚\)≔h𝒒t​\(𝒙\)−h𝒒t​\(𝒚\)−⟨∇h𝒒t​\(𝒚\),𝒙−𝒚⟩𝒙~𝒒t≔argmin𝒙~∈ℝd​f𝒒t​\(𝒙~\)\.\\begin\{aligned\} f\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\)&\\coloneqq f\(\\mathaccent 869\{\\bm\{x\}\}\)\+ep^\{p\}\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\\\ h\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\)&\\coloneqq\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\nabla^\{2\}f\(\\bm\{q\}\_\{t\}\)\}^\{2\}\+ep^\{p\}\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\\\ D\_\{h\_\{\\bm\{q\}\_\{t\}\}\}\(\\bm\{x\},\\bm\{y\}\)&\\coloneqq h\_\{\\bm\{q\}\_\{t\}\}\(\\bm\{x\}\)\-h\_\{\\bm\{q\}\_\{t\}\}\(\\bm\{y\}\)\-\\left\\langle\\nabla h\_\{\\bm\{q\}\_\{t\}\}\(\\bm\{y\}\),\\bm\{x\}\-\\bm\{y\}\\right\\rangle\\\\ \\mathaccent 869\{\\bm\{x\}\}\_\{\\bm\{q\}\_\{t\}\}&\\coloneqq\\underset\{\\mathaccent 869\{\\bm\{x\}\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ f\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\)\\end\{aligned\}\.

3:LetT≥C​pO​\(1\)​e​log⁡\(d​p​e​h𝒒t​\(𝒙~𝒒t\)​\(4p​γ\)p\)T\\geq Cp^\{O\(1\)\}e\\log\\left\(dpeh\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\_\{\\bm\{q\}\_\{t\}\}\)\\left\(\\frac\{4\}\{p\\gamma\}\\right\)^\{p\}\\right\)\.

4:Run[Algorithm˜2](https://arxiv.org/html/2607.00252#alg2)with input iteration countTT, base functionf𝒒tf\_\{\\bm\{q\}\_\{t\}\}, reference functionh𝒒th\_\{\\bm\{q\}\_\{t\}\}, and initialization𝒒t\\bm\{q\}\_\{t\}\.

The goal of the rest of this section is to analyze[Algorithm˜4](https://arxiv.org/html/2607.00252#alg4)\. The analysis follows several steps:

1. 1\.We find a reference functionh𝒒th\_\{\\bm\{q\}\_\{t\}\}that depends on the query point𝒒t\\bm\{q\}\_\{t\}for which the proximal objectivef𝒒tf\_\{\\bm\{q\}\_\{t\}\}is relatively smooth and relatively strongly convex withO​\(pO​\(1\)\)O\(p^\{O\(1\)\}\)condition number \(see[Section˜4](https://arxiv.org/html/2607.00252#S4)for a sense of why this is useful\)\. The main result here is[Lemma˜7\.8](https://arxiv.org/html/2607.00252#S7.Thmtheorem8)\.
2. 2\.We show thatf𝒒tf\_\{\\bm\{q\}\_\{t\}\}is strongly convex, following from[Section˜2\.1\.2](https://arxiv.org/html/2607.00252#S2.SS1.SSS2)\. This will help us understand the argument suboptimality for any point that approximately optimizesf𝒒tf\_\{\\bm\{q\}\_\{t\}\}in function value\. We also show that the reference functionh𝒒th\_\{\\bm\{q\}\_\{t\}\}is strongly convex, using the same tools, for the same reason\.
3. 3\.We show a form of smoothness forf𝒒tf\_\{\\bm\{q\}\_\{t\}\}\. This helps us bound the gradient of any point that approximately optimizesf𝒒tf\_\{\\bm\{q\}\_\{t\}\}\. Combining these later will tell us that an approximate solution tof𝒒tf\_\{\\bm\{q\}\_\{t\}\}in argument value is also an approximate stationary point, i\.e\., it satisfies the12\\frac\{1\}\{2\}\-MS condition \([Definition˜5\.1](https://arxiv.org/html/2607.00252#S5.Thmtheorem1)\)\.
4. 4\.We solve the proximal subproblems\. This solution itself follows a few steps: 1. \(a\)We apply[Theorem˜4\.1](https://arxiv.org/html/2607.00252#S4.Thmtheorem1)\. This tells us that as long as we can approximately solve the Bregman proximal problems \(approximately implementing Line[4](https://arxiv.org/html/2607.00252#alg2.l4)in[Algorithm˜2](https://arxiv.org/html/2607.00252#alg2)\), we will be in good shape\. 2. \(b\)This means we have to figure out how to approximately solve problems of the formargmin𝒙∈ℝd​⟨𝒈,𝒙⟩\+L​h𝒒t​\(𝒙\)\\underset\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ \\left\\langle\\bm\{g\},\\bm\{x\}\\right\\rangle\+Lh\_\{\\bm\{q\}\_\{t\}\}\(\\bm\{x\}\), whereLLis the smoothness constant derived forf𝒒tf\_\{\\bm\{q\}\_\{t\}\}with respect toh𝒒th\_\{\\bm\{q\}\_\{t\}\}\. We do this up to an accuracy that approximate mirror descent can handle \(see[Theorem˜4\.1](https://arxiv.org/html/2607.00252#S4.Thmtheorem1)for details on what we want this approximation to look like\)\. For the approximation to work, we need to approximately solve this problem up to both argument accuracy and approximate stationarity\. The main technical result of interest here is[Lemma˜7\.17](https://arxiv.org/html/2607.00252#S7.Thmtheorem17)\.
5. 5\.We use the smoothness and strong convexity guarantees to show that our solution from the previous step satisfies the12\\frac\{1\}\{2\}\-MS oracle \([Definition˜5\.1](https://arxiv.org/html/2607.00252#S5.Thmtheorem1)\), which means we can plug\-and\-play into[Theorem˜5\.3](https://arxiv.org/html/2607.00252#S5.Thmtheorem3)\.

#### 7\.3\.1Hessian Stability

Throughout this section, we adopt the following notation:

Cp\\displaystyle C\_\{p\}≔e​pp\\displaystyle\\coloneqq ep^\{p\}f​\(𝒙\)\\displaystyle f\(\\bm\{x\}\)≔\\slimits@i=1m​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2p\\displaystyle\\coloneqq\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\}f𝒒​\(𝒙\)\\displaystyle f\_\{\\bm\{q\}\}\(\\bm\{x\}\)≔f​\(𝒙\)\+Cp​\\lVert​𝒙−𝒒​\\rVert𝐌p\\displaystyle\\coloneqq f\(\\bm\{x\}\)\+C\_\{p\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}h𝒒​\(𝒙\)\\displaystyle h\_\{\\bm\{q\}\}\(\\bm\{x\}\)≔\\lVert​𝒙−𝒒​\\rVert∇2f​\(𝒒\)2\+Cp​\\lVert​𝒙−𝒒​\\rVert𝐌p\\displaystyle\\coloneqq\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\nabla^\{2\}f\(\\bm\{q\}\)\}^\{2\}\+C\_\{p\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}We begin with proving our Hessian stability fact, which should also be equivalently viewed as showing thatf𝒒tf\_\{\\bm\{q\}\_\{t\}\}is relatively smooth and relatively strongly convex inh𝒒th\_\{\\bm\{q\}\_\{t\}\}withO​\(pO​\(1\)\)O\(p^\{O\(1\)\}\)condition number\. Our main result is[Lemma˜7\.8](https://arxiv.org/html/2607.00252#S7.Thmtheorem8)which relies on analytical results[Lemma˜7\.9](https://arxiv.org/html/2607.00252#S7.Thmtheorem9)and[Lemma˜7\.10](https://arxiv.org/html/2607.00252#S7.Thmtheorem10)that we prove later\.

###### Lemma 7\.8\.

For all𝐱∈ℝd\\bm\{x\}\\in\\mathbb\{R\}^\{d\}andp≥2p\\geq 2, we have

12​p⋅e​∇2h𝒒​\(𝒙\)⪯∇2f𝒒​\(𝒙\)⪯p⋅e​∇2h𝒒​\(𝒙\)\.\\frac\{1\}\{2p\\cdot e\}\\nabla^\{2\}h\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\preceq\\nabla^\{2\}f\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\preceq p\\cdot e\\nabla^\{2\}h\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\kern 5\.0pt\.

###### Proof of[Lemma˜7\.8](https://arxiv.org/html/2607.00252#S7.Thmtheorem8)\.

Using an arbitrary𝒛∈ℝd\\bm\{z\}\\in\\mathbb\{R\}^\{d\}we can write the following quadratic form of the hessian offf,

𝒛⊤​∇2f​\(𝒙\)​𝒛\\displaystyle\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{x\}\)\\bm\{z\}≤\(a\)p⋅\(p−1\)​\\slimits@i=1m​\\lVert​𝐀Si​𝒙−𝒃Si​\\rVert2p−2​\\lVert​𝐀Si​𝒛​\\rVert22,\\displaystyle\\leq^\{\\text\{\(a\)\}\}p\\cdot\(p\-1\)\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\-2\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{z\}\\right\\rVert\_\{2\}^\{2\}\\kern 5\.0pt,=p⋅\(p−1\)​\\slimits@i=1m​\\lVert​𝐀Si​\(𝒙−𝒒\)\+𝐀Si​𝒒−𝒃Si​\\rVert2p−2​\\lVert​𝐀Si​𝒛​\\rVert22,\\displaystyle=p\\cdot\(p\-1\)\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\(\\bm\{x\}\-\\bm\{q\}\)\+\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{q\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\-2\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{z\}\\right\\rVert\_\{2\}^\{2\}\\kern 5\.0pt,≤\(b\)p⋅\(p−1\)​\\slimits@i=1m​\(αpp−2​\\lVert​𝐀Si​\(𝒙−𝒒\)​\\rVert2p−2​\\lVert​𝐀Si​𝒛​\\rVert22\+βpp−2​\\lVert​𝐀Si​𝒒−𝒃Si​\\rVert2p−2​\\lVert​𝐀Si​𝒛​\\rVert22\),\\displaystyle\\leq^\{\\text\{\(b\)\}\}p\\cdot\(p\-1\)\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\(\\alpha\_\{p\}^\{p\-2\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\(\\bm\{x\}\-\\bm\{q\}\)\\right\\rVert\_\{2\}^\{p\-2\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{z\}\\right\\rVert\_\{2\}^\{2\}\+\\beta\_\{p\}^\{p\-2\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{q\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\-2\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{z\}\\right\\rVert\_\{2\}^\{2\}\\right\)\\kern 5\.0pt,≤\(c\)p⋅\(p−1\)⋅αpp−2​\\slimits@i=1m​\\lVert​𝐀Si​\(𝒙−𝒒\)​\\rVert2p−2​\\lVert​𝐀Si​𝒛​\\rVert22\+\(p−1\)⋅βpp−2​𝒛⊤​∇2f​\(𝒒\)​𝒛,\\displaystyle\\leq^\{\\text\{\(c\)\}\}p\\cdot\(p\-1\)\\cdot\\alpha\_\{p\}^\{p\-2\}\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\(\\bm\{x\}\-\\bm\{q\}\)\\right\\rVert\_\{2\}^\{p\-2\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{z\}\\right\\rVert\_\{2\}^\{2\}\+\(p\-1\)\\cdot\\beta\_\{p\}^\{p\-2\}\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{q\}\)\\bm\{z\}\\kern 5\.0pt,≤\(d\)p⋅\(p−1\)⋅αpp−2​\(\\lVert​𝒙−𝒒​\\rVert𝐌p\)\(p−2\)/p​\(\\lVert​𝒛​\\rVert𝐌p\)2/p\+\(p−1\)⋅βpp−2​𝒛⊤​∇2f​\(𝒒\)​𝒛,\\displaystyle\\leq^\{\\text\{\(d\)\}\}p\\cdot\(p\-1\)\\cdot\\alpha\_\{p\}^\{p\-2\}\\left\(\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert^\{p\}\_\{\\mathbf\{M\}\}\\right\)^\{\(p\-2\)/p\}\\left\(\\left\\lVert\\bm\{z\}\\right\\rVert^\{p\}\_\{\\mathbf\{M\}\}\\right\)^\{2/p\}\+\(p\-1\)\\cdot\\beta\_\{p\}^\{p\-2\}\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{q\}\)\\bm\{z\}\\kern 5\.0pt,=p⋅\(p−1\)⋅αpp−2​\\lVert​𝒙−𝒒​\\rVert𝐌p−2​\\lVert​𝒛​\\rVert𝐌2\+\(p−1\)⋅βpp−2​𝒛⊤​∇2f​\(𝒒\)​𝒛,\\displaystyle=p\\cdot\(p\-1\)\\cdot\\alpha\_\{p\}^\{p\-2\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert^\{p\-2\}\_\{\\mathbf\{M\}\}\\left\\lVert\\bm\{z\}\\right\\rVert^\{2\}\_\{\\mathbf\{M\}\}\+\(p\-1\)\\cdot\\beta\_\{p\}^\{p\-2\}\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{q\}\)\\bm\{z\}\\kern 5\.0pt,≤\(e\)\(p−1\)⋅αpp−2Cp​𝒛⊤​∇2g𝒒​\(𝒙\)​𝒛\+\(p−1\)⋅βpp−2​𝒛⊤​∇2f​\(𝒒\)​𝒛,\\displaystyle\\leq^\{\(e\)\}\\frac\{\(p\-1\)\\cdot\\alpha\_\{p\}^\{p\-2\}\}\{C\_\{p\}\}\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}\+\(p\-1\)\\cdot\\beta\_\{p\}^\{p\-2\}\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{q\}\)\\bm\{z\}\\kern 5\.0pt,\(7\.6\)where in \(a\) we apply the upper bound from[Lemma˜7\.1](https://arxiv.org/html/2607.00252#S7.Thmtheorem1), in \(b\) we pickαp,βp≥1\\alpha\_\{p\},\\beta\_\{p\}\\geq 1such that1/αp\+1/βp=11/\\alpha\_\{p\}\+1/\\beta\_\{p\}=1\(we will choose them later\), in \(c\) we apply the lower bound from[Lemma˜7\.1](https://arxiv.org/html/2607.00252#S7.Thmtheorem1), in \(d\) we use the choice of our weights in designing𝐌\\mathbf\{M\}and[Theorem˜2\.3](https://arxiv.org/html/2607.00252#S2.Thmtheorem3)11todo:1write something more explicit for justifying \(d\)\.and finally in \(e\) we use the following calculations for the regularizer term for some𝒛∈ℝd\\bm\{z\}\\in\\mathbb\{R\}^\{d\},

g𝒒​\(𝒙\)\\displaystyle g\_\{\\bm\{q\}\}\(\\bm\{x\}\)≔Cp​\\lVert​𝒙−𝒒​\\rVert𝐌p,\\displaystyle\\coloneqq C\_\{p\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert^\{p\}\_\{\\mathbf\{M\}\}\\kern 5\.0pt,∇g𝒒​\(𝒙\)\\displaystyle\\nabla g\_\{\\bm\{q\}\}\(\\bm\{x\}\)=p​Cp​\\lVert​𝒙−𝒒​\\rVert𝐌p−2​𝐌​\(𝒙−𝒒\),\\displaystyle=pC\_\{p\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\{\\mathbf\{M\}\}\(\\bm\{x\}\-\\bm\{q\}\)\\kern 5\.0pt,∇2g𝒒​\(𝒙\)\\displaystyle\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{x\}\)=p​Cp​\\lVert​𝒙−𝒒​\\rVert𝐌p−2​𝐌\+p​\(p−2\)​Cp​\\lVert​𝒙−𝒒​\\rVert𝐌p−4​𝐌​\(𝒙−𝒒\)​\(𝒙−𝒒\)⊤​𝐌,\\displaystyle=pC\_\{p\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\\mathbf\{M\}\+p\(p\-2\)C\_\{p\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-4\}\\mathbf\{M\}\(\\bm\{x\}\-\\bm\{q\}\)\(\\bm\{x\}\-\\bm\{q\}\)^\{\\top\}\\mathbf\{M\}\\kern 5\.0pt,𝒛⊤​∇2g𝒒​\(𝒙\)​𝒛\\displaystyle\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}=p​Cp​\\lVert​𝒙−𝒒​\\rVert𝐌p−2​\\lVert​𝒛​\\rVert𝐌2\+p​\(p−2\)​Cp​\\lVert​𝒙−𝒒​\\rVert𝐌p−4​\(\(𝒙−𝒒\)⊤​𝐌​𝒛\)2≥\(p≥2\)0\.\\displaystyle=pC\_\{p\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\\left\\lVert\\bm\{z\}\\right\\rVert^\{2\}\_\{\\mathbf\{M\}\}\+p\(p\-2\)C\_\{p\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-4\}\\left\(\(\\bm\{x\}\-\\bm\{q\}\)^\{\\top\}\\mathbf\{M\}\\bm\{z\}\\right\)^\{2\}\\geq^\{\(p\\geq 2\)\}0\\kern 5\.0pt\.Combining \([7\.6](https://arxiv.org/html/2607.00252#S7.E6)\) with the definition off𝒒f\_\{\\bm\{q\}\}gives us,

𝒛⊤​∇2f𝒒​\(𝒙\)​𝒛\\displaystyle\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}=𝒛⊤​∇2f​\(𝒙\)​𝒛\+𝒛⊤​∇2g𝒒​\(𝒙\)​𝒛,\\displaystyle=\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{x\}\)\\bm\{z\}\+\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}\\kern 5\.0pt,≤using\([7\.6](https://arxiv.org/html/2607.00252#S7.E6)\)\(p−1\)⋅βpp−2​𝒛⊤​∇2f​\(𝒒\)​𝒛\+\(1\+\(p−1\)⋅αpp−2Cp\)​𝒛⊤​∇2g𝒒​\(𝒙\)​𝒛\.\\displaystyle\\leq^\{\\text\{using \}\\eqref\{eq:quad\_form\_f\_ub\}\}\(p\-1\)\\cdot\\beta\_\{p\}^\{p\-2\}\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{q\}\)\\bm\{z\}\+\\left\(1\+\\frac\{\(p\-1\)\\cdot\\alpha\_\{p\}^\{p\-2\}\}\{C\_\{p\}\}\\right\)\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}\\kern 5\.0pt\.Thus, in order to finish the proof for the upper bound we need to pickαp,βp\\alpha\_\{p\},\\beta\_\{p\}\. We split the analysis here into two cases:A\.p\>2p\>2andB\.p=2p=2\.

##### Case A\.\(p\>2p\>2\)

For simplicity we will just pickαp=p−1\\alpha\_\{p\}=p\-1andβp=p−1p−2\\beta\_\{p\}=\\frac\{p\-1\}\{p\-2\}which implies,

𝒛⊤​∇2f𝒒​\(𝒙\)​𝒛\\displaystyle\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}≤\(p−1\)⋅\(1\+1p−2\)p−2​𝒛⊤​∇2f​\(𝒒\)​𝒛\+\(1\+\(p−1\)⋅\(p−1\)p−2Cp\)​𝒛⊤​∇2g𝒒​\(𝒙\)​𝒛,\\displaystyle\\leq\(p\-1\)\\cdot\\left\(1\+\\frac\{1\}\{p\-2\}\\right\)^\{p\-2\}\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{q\}\)\\bm\{z\}\+\\left\(1\+\\frac\{\(p\-1\)\\cdot\(p\-1\)^\{p\-2\}\}\{C\_\{p\}\}\\right\)\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}\\kern 5\.0pt,≤\(p−1\)⋅e​𝒛⊤​∇2f​\(𝒒\)​𝒛\+\(1\+\(p−1\)p−1Cp\)​𝒛⊤​∇2g𝒒​\(𝒙\)​𝒛,\\displaystyle\\leq\(p\-1\)\\cdot e\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{q\}\)\\bm\{z\}\+\\left\(1\+\\frac\{\(p\-1\)^\{p\-1\}\}\{C\_\{p\}\}\\right\)\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}\\kern 5\.0pt,=\(p−1\)⋅e2​𝒛⊤​\(∇2h𝒒​\(𝒙\)−∇2g𝒒​\(𝒙\)\)​𝒛\+\(1\+\(p−1\)p−1Cp\)​𝒛⊤​∇2g𝒒​\(𝒙\)​𝒛,\\displaystyle=\\frac\{\(p\-1\)\\cdot e\}\{2\}\\bm\{z\}^\{\\top\}\\left\(\\nabla^\{2\}h\_\{\\bm\{q\}\}\(\\bm\{x\}\)\-\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\right\)\\bm\{z\}\+\\left\(1\+\\frac\{\(p\-1\)^\{p\-1\}\}\{C\_\{p\}\}\\right\)\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}\\kern 5\.0pt,≤\(p≥2\)p⋅e​𝒛⊤​∇2h𝒒​\(𝒙\)​𝒛\+\(1\+\(p−1\)p−1Cp−\(p−1\)⋅e2\)​𝒛⊤​∇2g𝒒​\(𝒙\)​𝒛,\\displaystyle\\leq^\{\(p\\geq 2\)\}p\\cdot e\\bm\{z\}^\{\\top\}\\nabla^\{2\}h\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}\+\\left\(1\+\\frac\{\(p\-1\)^\{p\-1\}\}\{C\_\{p\}\}\-\\frac\{\(p\-1\)\\cdot e\}\{2\}\\right\)\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}\\kern 5\.0pt,=p⋅e​𝒛⊤​∇2h𝒒​\(𝒙\)​𝒛\+\(1\+\(p−1\)p−1e​pp−\(p−1\)⋅e2\)​𝒛⊤​∇2g𝒒​\(𝒙\)​𝒛,\\displaystyle=p\\cdot e\\bm\{z\}^\{\\top\}\\nabla^\{2\}h\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}\+\\left\(1\+\\frac\{\(p\-1\)^\{p\-1\}\}\{ep^\{p\}\}\-\\frac\{\(p\-1\)\\cdot e\}\{2\}\\right\)\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}\\kern 5\.0pt,≤\([Lemma˜7\.9](https://arxiv.org/html/2607.00252#S7.Thmtheorem9)\)p⋅e​𝒛⊤​∇2h𝒒​\(𝒙\)​𝒛,\\displaystyle\\leq^\{\\text\{\(\\lx@cref\{creftypecap~refnum\}\{lem:small\_lemma\_for\_hessian\_stability\_one\}\)\}\}p\\cdot e\\bm\{z\}^\{\\top\}\\nabla^\{2\}h\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}\\kern 5\.0pt,where in the final inequality we use[Lemma˜7\.9](https://arxiv.org/html/2607.00252#S7.Thmtheorem9)which tell us that forp≥2p\\geq 2the constant in front of𝒛⊤​∇2g𝒒​\(𝒙\)​𝒛\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}is negative along with the fact that𝒛⊤​∇2g𝒒​\(𝒙\)​𝒛\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}is non\-negative\. To get the lower bound we first exchange𝒙,𝒒\\bm\{x\},\\bm\{q\}in \([7\.6](https://arxiv.org/html/2607.00252#S7.E6)\) \(and use the values ofαp\\alpha\_\{p\}andβp\\beta\_\{p\}\) to get,

𝒛⊤​∇2f​\(𝒒\)​𝒛≤\(p−1\)⋅\(p−1\)​p−2e​pp​𝒛⊤​∇2g𝒙​\(𝒒\)​𝒛\+\(p−1\)​\(1\+1p−2\)p−2​𝒛⊤​∇2f​\(𝒙\)​𝒛,\\displaystyle\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{q\}\)\\bm\{z\}\\leq\\frac\{\(p\-1\)\\cdot\(p\-1\)\{p\-2\}\}\{ep^\{p\}\}\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{x\}\}\(\\bm\{q\}\)\\bm\{z\}\+\(p\-1\)\\left\(1\+\\frac\{1\}\{p\-2\}\\right\)^\{p\-2\}\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{x\}\)\\bm\{z\}\\kern 5\.0pt,⇒𝒛⊤​∇2f​\(𝒒\)​𝒛≤\(p−1\)p−1e​pp​𝒛⊤​∇2g𝒙​\(𝒒\)​𝒛\+\(p−1\)​e​𝒛⊤​∇2f​\(𝒙\)​𝒛,\\displaystyle\\Rightarrow\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{q\}\)\\bm\{z\}\\leq\\frac\{\(p\-1\)^\{p\-1\}\}\{ep^\{p\}\}\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{x\}\}\(\\bm\{q\}\)\\bm\{z\}\+\(p\-1\)e\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{x\}\)\\bm\{z\}\\kern 5\.0pt,⇒\\displaystyle\\Rightarrow1\(p−1\)​e​𝒛⊤​∇2f​\(𝒒\)​𝒛−\(p−1\)p−2e2​pp​𝒛⊤​∇2g𝒙​\(𝒒\)​𝒛≤𝒛⊤​∇2f​\(𝒙\)​𝒛\.\\displaystyle\\frac\{1\}\{\(p\-1\)e\}\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{q\}\)\\bm\{z\}\-\\frac\{\(p\-1\)^\{p\-2\}\}\{e^\{2\}p^\{p\}\}\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{x\}\}\(\\bm\{q\}\)\\bm\{z\}\\leq\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{x\}\)\\bm\{z\}\\kern 5\.0pt\.We can finally lower bound,

𝒛⊤​∇2f𝒒​\(𝒙\)​𝒛\\displaystyle\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}=𝒛⊤​∇2f​\(𝒙\)​𝒛\+𝒛⊤​∇2g𝒒​\(𝒙\)​𝒛,\\displaystyle=\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{x\}\)\\bm\{z\}\+\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}\\kern 5\.0pt,≥1\(p−1\)​e​𝒛⊤​∇2f​\(𝒒\)​𝒛−\(p−1\)p−2e2​pp​𝒛⊤​∇2g𝒙​\(𝒒\)​𝒛\+𝒛⊤​∇2g𝒒​\(𝒙\)​𝒛,\\displaystyle\\geq\\frac\{1\}\{\(p\-1\)e\}\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{q\}\)\\bm\{z\}\-\\frac\{\(p\-1\)^\{p\-2\}\}\{e^\{2\}p^\{p\}\}\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{x\}\}\(\\bm\{q\}\)\\bm\{z\}\+\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}\\kern 5\.0pt,=12​\(p−1\)​e​𝒛⊤​\(∇2h𝒒​\(𝒙\)−∇2g𝒒​\(𝒙\)\)​𝒛−\(p−1\)p−2e2​pp​𝒛⊤​∇2g𝒙​\(𝒒\)​𝒛\+𝒛⊤​∇2g𝒒​\(𝒙\)​𝒛,\\displaystyle=\\frac\{1\}\{2\(p\-1\)e\}\\bm\{z\}^\{\\top\}\\left\(\\nabla^\{2\}h\_\{\\bm\{q\}\}\(\\bm\{x\}\)\-\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\right\)\\bm\{z\}\-\\frac\{\(p\-1\)^\{p\-2\}\}\{e^\{2\}p^\{p\}\}\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{x\}\}\(\\bm\{q\}\)\\bm\{z\}\+\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}\\kern 5\.0pt,≥\(g𝒒​\(𝒙\)=g𝒙​\(𝒒\)\)12​p​e​𝒛⊤​∇2h𝒒​\(𝒙\)​𝒛\+\(1−12​\(p−1\)​e−\(p−1\)p−2e2​pp\)​𝒛⊤​∇2g𝒒​\(𝒙\)​𝒛,\\displaystyle\\geq^\{\(g\_\{\\bm\{q\}\}\(\\bm\{x\}\)=g\_\{\\bm\{x\}\}\(\\bm\{q\}\)\)\}\\frac\{1\}\{2pe\}\\bm\{z\}^\{\\top\}\\nabla^\{2\}h\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}\+\\left\(1\-\\frac\{1\}\{2\(p\-1\)e\}\-\\frac\{\(p\-1\)^\{p\-2\}\}\{e^\{2\}p^\{p\}\}\\right\)\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}\\kern 5\.0pt,≥\([Lemma˜7\.10](https://arxiv.org/html/2607.00252#S7.Thmtheorem10)\)12​p​e​𝒛⊤​∇2h𝒒​\(𝒙\)​𝒛,\\displaystyle\\geq^\{\\text\{\(\\lx@cref\{creftypecap~refnum\}\{lem:small\_lemma\_for\_hessian\_stability\_two\}\)\}\}\\frac\{1\}\{2pe\}\\bm\{z\}^\{\\top\}\\nabla^\{2\}h\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}\\kern 5\.0pt,where in the final inequality we use[Lemma˜7\.10](https://arxiv.org/html/2607.00252#S7.Thmtheorem10)and the fact that𝒛⊤​∇2g𝒒​\(𝒙\)​𝒛\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}is non\-negative\. This finishes the proof for Case A\.

We finally consider the corner case withp=2p=2\.

##### Case B\.\(p=2p=2\)

In this case the proof is trivial, and follows from simply writing the quadratic forms forf𝒒f\_\{\\bm\{q\}\}andh𝒒h\_\{\\bm\{q\}\}\. We do so below,

𝒛⊤​∇2f𝒒​\(𝒙\)​𝒛\\displaystyle\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}=𝒛⊤​∇2f​\(𝒙\)​𝒛\+𝒛⊤​∇2g𝒒​\(𝒙\)​𝒛,\\displaystyle=\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{x\}\)\\bm\{z\}\+\\bm\{z\}^\{\\top\}\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}\\kern 5\.0pt,=𝒛⊤​∇2f​\(𝒙\)​𝒛\+2​C2​\\lVert​𝒛​\\rVert𝐌2,\\displaystyle=\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{x\}\)\\bm\{z\}\+2C\_\{2\}\\left\\lVert\\bm\{z\}\\right\\rVert^\{2\}\_\{\\mathbf\{M\}\}\\kern 5\.0pt,≤2​𝒛⊤​∇2f​\(𝒙\)​𝒛\+2​C2​\\lVert​𝒛​\\rVert𝐌2=𝒛⊤​∇2h𝒒​\(𝒙\)​𝒛,\\displaystyle\\leq 2\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{x\}\)\\bm\{z\}\+2C\_\{2\}\\left\\lVert\\bm\{z\}\\right\\rVert^\{2\}\_\{\\mathbf\{M\}\}=\\bm\{z\}^\{\\top\}\\nabla^\{2\}h\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}\\kern 5\.0pt,which shows the relative smoothness with a constant of11which is smaller \(and hence better\) than the claimed constant \(forp=2p=2\) of2​e2ein the lemma\. Now for the relative strong convexity we do the same,

𝒛⊤​∇2f𝒒​\(𝒙\)​𝒛\\displaystyle\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}=𝒛⊤​∇2f​\(𝒙\)​𝒛\+2​C2​\\lVert​𝒛​\\rVert𝐌2,\\displaystyle=\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{x\}\)\\bm\{z\}\+2C\_\{2\}\\left\\lVert\\bm\{z\}\\right\\rVert^\{2\}\_\{\\mathbf\{M\}\}\\kern 5\.0pt,≥12⋅\(2​𝒛⊤​∇2f​\(𝒙\)​𝒛\+2​C2​\\lVert​𝒛​\\rVert𝐌2\),\\displaystyle\\geq\\frac\{1\}\{2\}\\cdot\\left\(2\\bm\{z\}^\{\\top\}\\nabla^\{2\}f\(\\bm\{x\}\)\\bm\{z\}\+2C\_\{2\}\\left\\lVert\\bm\{z\}\\right\\rVert^\{2\}\_\{\\mathbf\{M\}\}\\right\)\\kern 5\.0pt,=12​𝒛⊤​∇2h𝒒​\(𝒙\)​𝒛,\\displaystyle=\\frac\{1\}\{2\}\\bm\{z\}^\{\\top\}\\nabla^\{2\}h\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\bm\{z\}\\kern 5\.0pt,which shows relative strong\-convexity with a constant of12\\frac\{1\}\{2\}which is larger \(and hence better\) than the claimed constant \(forp=2p=2\) of14​e\\frac\{1\}\{4e\}in the lemma\. This finishes the proof for Case B\.

This completes the proof of[Lemma˜7\.8](https://arxiv.org/html/2607.00252#S7.Thmtheorem8)\. ∎ We prove two small technical lemmas that we used in the above proof now\.

###### Lemma 7\.9\.

For allp≥2p\\geq 2,g​\(p\)=1\+\(p−1\)p−1e​pp−\(p−1\)⋅e2≤0g\(p\)=1\+\\frac\{\(p\-1\)^\{p\-1\}\}\{ep^\{p\}\}\-\\frac\{\(p\-1\)\\cdot e\}\{2\}\\leq 0\.

###### Proof\.

First note that atp=2p=2the function takes a strictly negative value,

g​\(2\)=1\+\(1e​22−e2=4​e\+1−2​e24​e<0\.\\displaystyle g\(2\)=1\+\\frac\{\(1\}\{e2^\{2\}\}\-\\frac\{e\}\{2\}=\\frac\{4e\+1\-2e^\{2\}\}\{4e\}<0\\kern 5\.0pt\.We will now show that the function is increasing inppforp≥2p\\geq 2,

g′​\(p\)\\displaystyle g^\{\\prime\}\(p\)=−\(p−1\)p−1​pp​\(ln⁡\(p\)\+1\)p2​p\+\(p−1\)p−1​\(ln⁡\(p−1\)\+1\)pp−e2,\\displaystyle=\-\\frac\{\(p\-1\)^\{p\-1\}p^\{p\}\(\\ln\(p\)\+1\)\}\{p^\{2\}p\}\+\\frac\{\(p\-1\)^\{p\-1\}\(\\ln\(p\-1\)\+1\)\}\{p^\{p\}\}\-\\frac\{e\}\{2\}\\kern 5\.0pt,=−\(p−1\)p−1​ln⁡\(p/\(p−1\)\)pp−e2<0\.\\displaystyle=\-\\frac\{\(p\-1\)^\{p\-1\}\\ln\(p/\(p\-1\)\)\}\{p^\{p\}\}\-\\frac\{e\}\{2\}<0\\kern 5\.0pt\.Thus, the function attains its maximum value atp=2p=2in the rangep≥2p\\geq 2, implying it is strictly negative in that range\. ∎

###### Lemma 7\.10\.

For allp≥2p\\geq 2,g​\(p\)=1−12​\(p−1\)​e−\(p−1\)p−2e2​pp≥0g\(p\)=1\-\\frac\{1\}\{2\(p\-1\)e\}\-\\frac\{\(p\-1\)^\{p\-2\}\}\{e^\{2\}p^\{p\}\}\\geq 0\.

###### Proof\.

First note that atp=2p=2the function takes a strictly positive value,

g​\(2\)=1−12​e−10e2​22=1−12​e−14​e2=4​e2−2​e−14​e2\>0\.\\displaystyle g\(2\)=1\-\\frac\{1\}\{2e\}\-\\frac\{1^\{0\}\}\{e^\{2\}2^\{2\}\}=1\-\\frac\{1\}\{2e\}\-\\frac\{1\}\{4e^\{2\}\}=\\frac\{4e^\{2\}\-2e\-1\}\{4e^\{2\}\}\>0\\kern 5\.0pt\.We will now show that the function is increasing inppforp≥2p\\geq 2,

g′​\(p\)\\displaystyle g^\{\\prime\}\(p\)=12​\(p−1\)2​e\+\(p−1\)p−2​pp​\(ln⁡\(p\)\+1\)e2​p2​p−\(p−1\)p−2​\(ln⁡\(p−1\)\+\(p−2\)/\(p−1\)\)e2​pp,\\displaystyle=\\frac\{1\}\{2\(p\-1\)^\{2\}e\}\+\\frac\{\(p\-1\)^\{p\-2\}p^\{p\}\(\\ln\(p\)\+1\)\}\{e^\{2\}p^\{2p\}\}\-\\frac\{\(p\-1\)^\{p\-2\}\(\\ln\(p\-1\)\+\(p\-2\)/\(p\-1\)\)\}\{e^\{2\}p^\{p\}\}\\kern 5\.0pt,=12​\(p−1\)2​e\+\(p−1\)p−2​\(ln⁡\(p\)\+1\)e2​pp−\(p−1\)p−2​\(ln⁡\(p−1\)\+1−1/\(p−1\)\)e2​pp,\\displaystyle=\\frac\{1\}\{2\(p\-1\)^\{2\}e\}\+\\frac\{\(p\-1\)^\{p\-2\}\(\\ln\(p\)\+1\)\}\{e^\{2\}p^\{p\}\}\-\\frac\{\(p\-1\)^\{p\-2\}\(\\ln\(p\-1\)\+1\-1/\(p\-1\)\)\}\{e^\{2\}p^\{p\}\}\\kern 5\.0pt,=12​\(p−1\)2​e\+\(p−1\)p−2​\(ln⁡\(p/\(p−1\)\)\+1/\(p−1\)\)e2​pp\>0\.\\displaystyle=\\frac\{1\}\{2\(p\-1\)^\{2\}e\}\+\\frac\{\(p\-1\)^\{p\-2\}\\left\(\\ln\(p/\(p\-1\)\)\+1/\(p\-1\)\\right\)\}\{e^\{2\}p^\{p\}\}\>0\\kern 5\.0pt\.Thus, the functionggattains its minimum value atp=2p=2in the rangep≥2p\\geq 2, implying that it is strictly positive in that range\. ∎

#### 7\.3\.2Strong Convexity of the Proximal Objective and Friends

We begin by showing that the proximal objective enjoys a form of strong convexity\.

###### Lemma 7\.11\.

For all𝐱,𝐝∈ℝd\\bm\{x\},\\bm\{d\}\\in\\mathbb\{R\}^\{d\}, we have

f𝒒​\(𝒙\+𝒅\)≥f𝒒​\(𝒙\)\+⟨∇f𝒒​\(𝒙\),𝒅⟩\+42p​\(\\lVert​𝐀​𝒅​\\rVert𝒢pp\+Cp​\\lVert​𝒅​\\rVert𝐌p\)\.\\displaystyle f\_\{\\bm\{q\}\}\(\\bm\{x\}\+\\bm\{d\}\)\\geq f\_\{\\bm\{q\}\}\(\\bm\{x\}\)\+\\left\\langle\\nabla f\_\{\\bm\{q\}\}\(\\bm\{x\}\),\\bm\{d\}\\right\\rangle\+\\frac\{4\}\{2^\{p\}\}\\left\(\\left\\lVert\\mathbf\{A\}\\bm\{d\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}^\{p\}\+C\_\{p\}\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\right\)\.

###### Proof of[Lemma˜7\.11](https://arxiv.org/html/2607.00252#S7.Thmtheorem11)\.

LetKp≔42pK\_\{p\}\\coloneqq\\frac\{4\}\{2^\{p\}\}\.

The plan is to apply[Section˜2\.1\.2](https://arxiv.org/html/2607.00252#S2.SS1.SSS2)tof𝒒​\(𝒙\+𝒅\)f\_\{\\bm\{q\}\}\(\\bm\{x\}\+\\bm\{d\}\)\. We start with the regularizer\. Notice that

\\lVert​𝒙\+𝒅−𝒒​\\rVert𝐌p\\displaystyle\\left\\lVert\\bm\{x\}\+\\bm\{d\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}=\\lVert​𝐌1/2​\(𝒙\+𝒅−𝒒\)​\\rVert2p=\\lVert​𝐌1/2​\(𝒙−𝒒\)\+𝐌1/2​𝒅​\\rVert2p,\\displaystyle=\\left\\lVert\\mathbf\{M\}^\{1/2\}\(\\bm\{x\}\+\\bm\{d\}\-\\bm\{q\}\)\\right\\rVert\_\{2\}^\{p\}=\\left\\lVert\\mathbf\{M\}^\{1/2\}\(\\bm\{x\}\-\\bm\{q\}\)\+\\mathbf\{M\}^\{1/2\}\\bm\{d\}\\right\\rVert\_\{2\}^\{p\}\\kern 5\.0pt,≥\([Section˜2\.1\.2](https://arxiv.org/html/2607.00252#S2.SS1.SSS2)\)\\lVert​𝐌1/2​\(𝒙−𝒒\)​\\rVert2p\\displaystyle\\geq^\{\\text\{\(\\lx@cref\{creftypecap~refnum\}\{lemma:strong\_convexity\_component\}\)\}\}\\left\\lVert\\mathbf\{M\}^\{1/2\}\(\\bm\{x\}\-\\bm\{q\}\)\\right\\rVert\_\{2\}^\{p\}\(7\.7\)\+⟨p​\\lVert​𝐌1/2​\(𝒙−𝒒\)​\\rVert2p−2​𝐌1/2​\(𝒙−𝒒\),𝐌1/2​𝒅⟩\+Kp​\\lVert​𝐌1/2​𝒅​\\rVert2p,\\displaystyle\\quad\+\\left\\langle p\\left\\lVert\\mathbf\{M\}^\{1/2\}\(\\bm\{x\}\-\\bm\{q\}\)\\right\\rVert\_\{2\}^\{p\-2\}\\mathbf\{M\}^\{1/2\}\(\\bm\{x\}\-\\bm\{q\}\),\\mathbf\{M\}^\{1/2\}\\bm\{d\}\\right\\rangle\+K\_\{p\}\\left\\lVert\\mathbf\{M\}^\{1/2\}\\bm\{d\}\\right\\rVert\_\{2\}^\{p\}\\kern 5\.0pt,=\\lVert​𝒙−𝒒​\\rVert𝐌p\+⟨p​\\lVert​𝒙−𝒒​\\rVert𝐌p−2​𝐌​\(𝒙−𝒒\),𝒅⟩\+Kp​\\lVert​𝒅​\\rVert𝐌p,\\displaystyle=\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\+\\left\\langle p\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\\mathbf\{M\}\(\\bm\{x\}\-\\bm\{q\}\),\\bm\{d\}\\right\\rangle\+K\_\{p\}\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\kern 5\.0pt,=\\lVert​𝒙−𝒒​\\rVert𝐌p\+⟨∇𝒙\(\\lVert​𝒙−𝒒​\\rVert𝐌p\),𝒅⟩\+Kp​\\lVert​𝒅​\\rVert𝐌p\.\\displaystyle=\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\+\\left\\langle\\nabla\_\{\\bm\{x\}\}\\left\(\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\right\),\\bm\{d\}\\right\\rangle\+K\_\{p\}\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\kern 5\.0pt\.\(7\.8\)We combine this with the conclusion of[Lemma˜7\.2](https://arxiv.org/html/2607.00252#S7.Thmtheorem2), giving

f𝒒​\(𝒙\+𝒅\)\\displaystyle f\_\{\\bm\{q\}\}\(\\bm\{x\}\+\\bm\{d\}\)=f​\(𝒙\+𝒅\)\+Cp​\\\|​𝒙\+𝒅−𝒒​\\\|𝐌p,\\displaystyle=f\(\\bm\{x\}\+\\bm\{d\}\)\+C\_\{p\}\\\|\\bm\{x\}\+\\bm\{d\}\-\\bm\{q\}\\\|\_\{\\mathbf\{M\}\}^\{p\}\\kern 5\.0pt,≥\([Lemma˜7\.2](https://arxiv.org/html/2607.00252#S7.Thmtheorem2)\)f​\(𝒙\)\+⟨∇f​\(𝒙\),𝒅⟩\+Kp​\\lVert​𝐀​𝒅​\\rVert𝒢pp\+Cp​\\\|​𝒙\+𝒅−𝒒​\\\|𝐌p,\\displaystyle\\geq^\{\\text\{\(\\lx@cref\{creftypecap~refnum\}\{lemma:strong\_convexity\_gp\}\)\}\}f\(\\bm\{x\}\)\+\\left\\langle\\nabla f\(\\bm\{x\}\),\\bm\{d\}\\right\\rangle\+K\_\{p\}\\left\\lVert\\mathbf\{A\}\\bm\{d\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}^\{p\}\+C\_\{p\}\\\|\\bm\{x\}\+\\bm\{d\}\-\\bm\{q\}\\\|\_\{\\mathbf\{M\}\}^\{p\}\\kern 5\.0pt,≥\([7\.8](https://arxiv.org/html/2607.00252#S7.E8)\)f​\(𝒙\)\+⟨∇f​\(𝒙\),𝒅⟩\+Kp​\\lVert​𝐀​𝒅​\\rVert𝒢pp\+Cp​\\lVert​𝒙−𝒒​\\rVert𝐌p\\displaystyle\\geq^\{\\text\{\\eqref\{eq:M\_norm\_sc\}\}\}f\(\\bm\{x\}\)\+\\left\\langle\\nabla f\(\\bm\{x\}\),\\bm\{d\}\\right\\rangle\+K\_\{p\}\\left\\lVert\\mathbf\{A\}\\bm\{d\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}^\{p\}\+C\_\{p\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\+Cp​⟨∇𝒙\(\\lVert​𝒙−𝒒​\\rVert𝐌p\),𝒅⟩\+Kp​Cp​\\lVert​𝒅​\\rVert𝐌p,\\displaystyle\\quad\+C\_\{p\}\\left\\langle\\nabla\_\{\\bm\{x\}\}\\left\(\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\right\),\\bm\{d\}\\right\\rangle\+K\_\{p\}C\_\{p\}\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\kern 5\.0pt,=f​\(𝒙\)\+Cp​\\lVert​𝒙−𝒒​\\rVert𝐌p\+⟨∇𝒙\(f​\(𝒙\)\+Cp​\\lVert​𝒙−𝒒​\\rVert𝐌p\),𝒅⟩\\displaystyle=\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}f\(\\bm\{x\}\)\}\+\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}C\_\{p\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\}\+\\left\\langle\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\nabla\_\{\\bm\{x\}\}\\left\(f\(\\bm\{x\}\)\+C\_\{p\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\right\)\},\\bm\{d\}\\right\\rangle\+Kp​\\lVert​𝐀​𝒅​\\rVert𝒢pp\+Kp​Cp​\\lVert​𝒅​\\rVert𝐌p,\\displaystyle\\quad\+K\_\{p\}\\left\\lVert\\mathbf\{A\}\\bm\{d\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}^\{p\}\+K\_\{p\}C\_\{p\}\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\kern 5\.0pt,=f𝒒​\(𝒙\)\+⟨∇f𝒒​\(𝒙\),𝒅⟩\+Kp​\(\\lVert​𝐀​𝒅​\\rVert𝒢pp\+Cp​\\lVert​𝒅​\\rVert𝐌p\)\.\\displaystyle=\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}f\_\{\\bm\{q\}\}\(\\bm\{x\}\)\}\+\\left\\langle\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\nabla f\_\{\\bm\{q\}\}\(\\bm\{x\}\)\},\\bm\{d\}\\right\\rangle\+K\_\{p\}\\left\(\\left\\lVert\\mathbf\{A\}\\bm\{d\}\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}^\{p\}\+C\_\{p\}\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\right\)\\kern 5\.0pt\.completing the proof of[Lemma˜7\.11](https://arxiv.org/html/2607.00252#S7.Thmtheorem11)\. ∎

We also show that the subproblems we solve in Line[4](https://arxiv.org/html/2607.00252#alg2.l4)of[Algorithm˜2](https://arxiv.org/html/2607.00252#alg2)are strongly convex\.

###### Lemma 7\.12\.

Fix𝐳,𝐪,𝐝∈ℝd\\bm\{z\},\\bm\{q\},\\bm\{d\}\\in\\mathbb\{R\}^\{d\}and letL\>0L\>0\. Consider the function

g​\(𝒙\)\\displaystyle g\(\\bm\{x\}\)≔⟨𝒛,𝒙⟩\+L​\(\\lVert​𝒙−𝒒​\\rVert∇2f​\(𝒒\)2\+Cp​\\lVert​𝒙−𝒒​\\rVert𝐌p\)\.\\displaystyle\\coloneqq\\left\\langle\\bm\{z\},\\bm\{x\}\\right\\rangle\+L\\left\(\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\nabla^\{2\}f\(\\bm\{q\}\)\}^\{2\}\+C\_\{p\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\right\)\\kern 5\.0pt\.Then,

g​\(𝒙\+𝒅\)≥g​\(𝒙\)\+⟨∇g​\(𝒙\),𝒅⟩\+L​\(\\lVert​𝒅​\\rVert∇2f​\(𝒒\)2\+4​Cp2p​\\lVert​𝒅​\\rVert𝐌p\)\.\\displaystyle g\(\\bm\{x\}\+\\bm\{d\}\)\\geq g\(\\bm\{x\}\)\+\\left\\langle\\nabla g\(\\bm\{x\}\),\\bm\{d\}\\right\\rangle\+L\\left\(\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\nabla^\{2\}f\(\\bm\{q\}\)\}^\{2\}\+\\frac\{4C\_\{p\}\}\{2^\{p\}\}\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\right\)\\kern 5\.0pt\.In particular, if𝐳\\bm\{z\}is the minimizer forgg, then for any𝐝∈ℝd\\bm\{d\}\\in\\mathbb\{R\}^\{d\}, we have

\\lVert​𝒅​\\rVert𝐌≤2p⋅\(4​e\)1/p​\(g​\(𝒛\+𝒅\)−g​\(𝒛\)L\)1/p\.\\displaystyle\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq\\frac\{2\}\{p\\cdot\(4e\)^\{1/p\}\}\\left\(\\frac\{g\(\\bm\{z\}\+\\bm\{d\}\)\-g\(\\bm\{z\}\)\}\{L\}\\right\)^\{1/p\}\\kern 5\.0pt\.

###### Proof of[Lemma˜7\.12](https://arxiv.org/html/2607.00252#S7.Thmtheorem12)\.

This is pretty much the same proof as[Lemma˜7\.11](https://arxiv.org/html/2607.00252#S7.Thmtheorem11)\. It is easy to check that

\\lVert​\(𝒙\+𝒅\)−𝒒​\\rVert∇2f​\(𝒒\)2=\\lVert​𝒙−𝒒​\\rVert∇2f​\(𝒒\)2\+⟨2​∇2f​\(𝒒\)​\(𝒙−𝒒\),𝒅⟩\+\\lVert​𝒅​\\rVert∇2f​\(𝒒\)2,\\displaystyle\\left\\lVert\(\\bm\{x\}\+\\bm\{d\}\)\-\\bm\{q\}\\right\\rVert\_\{\\nabla^\{2\}f\(\\bm\{q\}\)\}^\{2\}=\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\nabla^\{2\}f\(\\bm\{q\}\)\}^\{2\}\+\\left\\langle 2\\nabla^\{2\}f\(\\bm\{q\}\)\(\\bm\{x\}\-\\bm\{q\}\),\\bm\{d\}\\right\\rangle\+\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\nabla^\{2\}f\(\\bm\{q\}\)\}^\{2\}\\kern 5\.0pt,\(7\.9\)and using[Section˜2\.1\.2](https://arxiv.org/html/2607.00252#S2.SS1.SSS2)in the same way as in the proof of[Lemma˜7\.11](https://arxiv.org/html/2607.00252#S7.Thmtheorem11), we have

\\lVert​\(𝒙\+𝒅\)−𝒒​\\rVert𝐌p≥\([7\.8](https://arxiv.org/html/2607.00252#S7.E8)\)\\lVert​𝒙−𝒒​\\rVert𝐌p\+⟨p​\\lVert​𝒙−𝒒​\\rVert𝐌p−2​𝐌​\(𝒙−𝒒\),𝒅⟩\+42p​\\lVert​𝒅​\\rVert𝐌p\.\\displaystyle\\left\\lVert\(\\bm\{x\}\+\\bm\{d\}\)\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\geq^\{\\text\{\\eqref\{eq:M\_norm\_sc\}\}\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\+\\left\\langle p\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\\mathbf\{M\}\(\\bm\{x\}\-\\bm\{q\}\),\\bm\{d\}\\right\\rangle\+\\frac\{4\}\{2^\{p\}\}\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\kern 5\.0pt\.Combining this with the definition ofgggives the following,

g​\(𝒙\+𝒅\)\\displaystyle g\(\\bm\{x\}\+\\bm\{d\}\)=⟨𝒛,𝒙\+𝒅⟩\+L​\(\\lVert​𝒙\+𝒅−𝒒​\\rVert∇2f​\(𝒒\)2\+Cp​\\lVert​𝒙\+𝒅−𝒒​\\rVert𝐌p\),\\displaystyle=\\left\\langle\\bm\{z\},\\bm\{x\}\+\\bm\{d\}\\right\\rangle\+L\\left\(\\left\\lVert\\bm\{x\}\+\\bm\{d\}\-\\bm\{q\}\\right\\rVert\_\{\\nabla^\{2\}f\(\\bm\{q\}\)\}^\{2\}\+C\_\{p\}\\left\\lVert\\bm\{x\}\+\\bm\{d\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\right\)\\kern 5\.0pt,≥\([7\.9](https://arxiv.org/html/2607.00252#S7.E9)\), \([7\.8](https://arxiv.org/html/2607.00252#S7.E8)\)⟨𝒛,𝒙⟩\+⟨𝒛,𝒅⟩\+L​\\lVert​𝒙−𝒒​\\rVert∇2f​\(𝒒\)2\+L​⟨2​∇2f​\(𝒒\)​\(𝒙−𝒒\),𝒅⟩\\displaystyle\\geq^\{\\text\{\\eqref\{eq:f\_norm\_expansion\}, \\eqref\{eq:M\_norm\_sc\}\}\}\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\\left\\langle\\bm\{z\},\\bm\{x\}\\right\\rangle\}\+\\left\\langle\\bm\{z\},\\bm\{d\}\\right\\rangle\+\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}L\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\nabla^\{2\}f\(\\bm\{q\}\)\}^\{2\}\}\+L\\left\\langle 2\\nabla^\{2\}f\(\\bm\{q\}\)\(\\bm\{x\}\-\\bm\{q\}\),\\bm\{d\}\\right\\rangle\+L​\\lVert​𝒅​\\rVert∇2f​\(𝒒\)2\+L​Cp​\(\\lVert​𝒙−𝒒​\\rVert𝐌p\+⟨p​\\lVert​𝒙−𝒒​\\rVert𝐌p−2​𝐌​\(𝒙−𝒒\),𝒅⟩\+42p​\\lVert​𝒅​\\rVert𝐌p\),\\displaystyle\\qquad\+L\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\nabla^\{2\}f\(\\bm\{q\}\)\}^\{2\}\+LC\_\{p\}\\left\(\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\}\+\\left\\langle p\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\\mathbf\{M\}\(\\bm\{x\}\-\\bm\{q\}\),\\bm\{d\}\\right\\rangle\+\\frac\{4\}\{2^\{p\}\}\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\right\)\\kern 5\.0pt,=g​\(𝒙\)\+⟨𝒛\+2​L​∇2f​\(𝒒\)​\(𝒙−𝒒\)\+L​Cp​p​\\lVert​𝒙−𝒒​\\rVert𝐌p−2​𝐌​\(𝒙−𝒒\),𝒅⟩\\displaystyle=\{\\color\[rgb\]\{1,0,0\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{1,0,0\}g\(\\bm\{x\}\)\}\+\\left\\langle\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\bm\{z\}\}\+\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}2L\\nabla^\{2\}f\(\\bm\{q\}\)\(\\bm\{x\}\-\\bm\{q\}\)\}\+\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}LC\_\{p\}p\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\\mathbf\{M\}\(\\bm\{x\}\-\\bm\{q\}\)\},\\bm\{d\}\\right\\rangle\+L​\(\\lVert​𝒅​\\rVert∇2f​\(𝒒\)2\+4​Cp2p​\\lVert​𝒅​\\rVert𝐌p\),\\displaystyle\\qquad\+L\\left\(\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\nabla^\{2\}f\(\\bm\{q\}\)\}^\{2\}\+\\frac\{4C\_\{p\}\}\{2^\{p\}\}\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\right\)\\kern 5\.0pt,=g​\(𝒙\)\+⟨∇g​\(𝒙\),𝒅⟩\+L​\(\\lVert​𝒅​\\rVert∇2f​\(𝒒\)2\+4​Cp2p​\\lVert​𝒅​\\rVert𝐌p\),\\displaystyle=g\(\\bm\{x\}\)\+\\left\\langle\{\\color\[rgb\]\{0,0,1\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0,0,1\}\\nabla g\(\\bm\{x\}\)\},\\bm\{d\}\\right\\rangle\+L\\left\(\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\nabla^\{2\}f\(\\bm\{q\}\)\}^\{2\}\+\\frac\{4C\_\{p\}\}\{2^\{p\}\}\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\right\)\\kern 5\.0pt,which proves the first result of the lemma\.

To get the second result, we observe that∇g​\(𝒛\)=0\\nabla g\(\\bm\{z\}\)=0by the optimality of𝒛\\bm\{z\}\. Ignoring the\\lVert​𝒅​\\rVert∇2f​\(𝒒\)\\left\\lVert\\bm\{d\}\\right\\rVert\_\{\\nabla^\{2\}f\(\\bm\{q\}\)\}terms and rearranging gives the conclusion of[Lemma˜7\.12](https://arxiv.org/html/2607.00252#S7.Thmtheorem12)\. ∎

#### 7\.3\.3Smoothness of the Proximal Objective

We first bound the operator norm of a matrix related to the Hessian of the proximal objective\.

###### Lemma 7\.13\.

For all𝐪,𝐲∈ℝd\\bm\{q\},\\bm\{y\}\\in\\mathbb\{R\}^\{d\}, we have

\\lVert​𝐌−1/2​\(∇2f𝒒​\(𝒚\)\)​𝐌−1/2​\\rVertop\\displaystyle\\left\\lVert\\mathbf\{M\}^\{\-1/2\}\\left\(\\nabla^\{2\}f\_\{\\bm\{q\}\}\(\\bm\{y\}\)\\right\)\\mathbf\{M\}^\{\-1/2\}\\right\\rVert\_\{\\mathrm\{op\}\}≤e​p2​\(p−1\)​\(2​f​\(𝒒\)1−2p\+Cp​\\lVert​𝒚−𝒒​\\rVert𝐌p−2\)\.\\displaystyle\\leq ep^\{2\}\(p\-1\)\\left\(2f\(\\bm\{q\}\)^\{1\-\\frac\{2\}\{p\}\}\+C\_\{p\}\\left\\lVert\\bm\{y\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\\right\)\\kern 5\.0pt\.

###### Proof of[Lemma˜7\.13](https://arxiv.org/html/2607.00252#S7.Thmtheorem13)\.

Recall from the proof of[Lemma˜7\.8](https://arxiv.org/html/2607.00252#S7.Thmtheorem8)the definition of the regularization termg𝒒​\(𝒚\)≔Cp​\\lVert​𝒚−𝒒​\\rVert𝐌pg\_\{\\bm\{q\}\}\(\\bm\{y\}\)\\coloneqq C\_\{p\}\\left\\lVert\\bm\{y\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}forCp=e​ppC\_\{p\}=ep^\{p\}as well as the following calculations,

g𝒒​\(𝒚\)\\displaystyle g\_\{\\bm\{q\}\}\(\\bm\{y\}\)≔Cp​\\lVert​𝒚−𝒒​\\rVert𝐌p,\\displaystyle\\coloneqq C\_\{p\}\\left\\lVert\\bm\{y\}\-\\bm\{q\}\\right\\rVert^\{p\}\_\{\\mathbf\{M\}\}\\kern 5\.0pt,∇g𝒒​\(𝒚\)\\displaystyle\\nabla g\_\{\\bm\{q\}\}\(\\bm\{y\}\)=p​Cp​\\lVert​𝒚−𝒒​\\rVert𝐌p−2​𝐌​\(𝒚−𝒒\),\\displaystyle=pC\_\{p\}\\left\\lVert\\bm\{y\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\{\\mathbf\{M\}\}\(\\bm\{y\}\-\\bm\{q\}\)\\kern 5\.0pt,∇2g𝒒​\(𝒚\)\\displaystyle\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{y\}\)=p​Cp​\\lVert​𝒚−𝒒​\\rVert𝐌p−2​𝐌\+p​\(p−2\)​Cp​\\lVert​𝒚−𝒒​\\rVert𝐌p−4​𝐌​\(𝒚−𝒒\)​\(𝒚−𝒒\)⊤​𝐌\.\\displaystyle=pC\_\{p\}\\left\\lVert\\bm\{y\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\\mathbf\{M\}\+p\(p\-2\)C\_\{p\}\\left\\lVert\\bm\{y\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-4\}\\mathbf\{M\}\(\\bm\{y\}\-\\bm\{q\}\)\(\\bm\{y\}\-\\bm\{q\}\)^\{\\top\}\\mathbf\{M\}\\kern 5\.0pt\.By[Lemma˜7\.8](https://arxiv.org/html/2607.00252#S7.Thmtheorem8), we know that

∇2f𝒒​\(𝒚\)⪯e​p​\(2​∇2f​\(𝒒\)\+∇2g𝒒​\(𝒚\)\)\.\\displaystyle\\nabla^\{2\}f\_\{\\bm\{q\}\}\(\\bm\{y\}\)\\preceq ep\\left\(2\\nabla^\{2\}f\(\\bm\{q\}\)\+\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{y\}\)\\right\)\.Observe that

𝐌−1/2​\(∇2g𝒒​\(𝒚\)\)​𝐌−1/2\\displaystyle\\quad\\mathbf\{M\}^\{\-1/2\}\\left\(\\nabla^\{2\}g\_\{\\bm\{q\}\}\(\\bm\{y\}\)\\right\)\\mathbf\{M\}^\{\-1/2\}=p​Cp​\(\\lVert​𝒚−𝒒​\\rVert𝐌p−2\+\(p−2\)​\\lVert​𝒚−𝒒​\\rVert𝐌p−4​𝐌1/2​\(𝒚−𝒒\)​\(𝒚−𝒒\)⊤​𝐌1/2\),\\displaystyle=pC\_\{p\}\\left\(\\left\\lVert\\bm\{y\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\+\(p\-2\)\\left\\lVert\\bm\{y\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-4\}\\mathbf\{M\}^\{1/2\}\(\\bm\{y\}\-\\bm\{q\}\)\(\\bm\{y\}\-\\bm\{q\}\)^\{\\top\}\\mathbf\{M\}^\{1/2\}\\right\)\\kern 5\.0pt,⪯p​Cp​\\lVert​𝒚−𝒒​\\rVert𝐌p−2​𝐈\+\(p−2\)​\\lVert​𝒚−𝒒​\\rVert𝐌p−4​\\lVert​𝐌1/2​\(𝒚−𝒒\)​\(𝒚−𝒒\)⊤​𝐌1/2​\\rVertop​𝐈,\\displaystyle\\preceq pC\_\{p\}\\left\\lVert\\bm\{y\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\\mathbf\{I\}\+\(p\-2\)\\left\\lVert\\bm\{y\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-4\}\\left\\lVert\\mathbf\{M\}^\{1/2\}\(\\bm\{y\}\-\\bm\{q\}\)\(\\bm\{y\}\-\\bm\{q\}\)^\{\\top\}\\mathbf\{M\}^\{1/2\}\\right\\rVert\_\{\\mathrm\{op\}\}\\mathbf\{I\}\\kern 5\.0pt,⪯p​Cp​\\lVert​𝒚−𝒒​\\rVert𝐌p−2​𝐈\+\(p−2\)​\\lVert​𝒚−𝒒​\\rVert𝐌p−4​\\lVert​𝐌1/2​\(𝒚−𝒒\)​\\rVert22​𝐈,\\displaystyle\\preceq pC\_\{p\}\\left\\lVert\\bm\{y\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\\mathbf\{I\}\+\(p\-2\)\\left\\lVert\\bm\{y\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-4\}\\left\\lVert\\mathbf\{M\}^\{1/2\}\(\\bm\{y\}\-\\bm\{q\}\)\\right\\rVert\_\{2\}^\{2\}\\mathbf\{I\}\\kern 5\.0pt,⪯p​\(p−1\)​Cp​\\lVert​𝒚−𝒒​\\rVert𝐌p−2​𝐈,\\displaystyle\\preceq p\(p\-1\)C\_\{p\}\\left\\lVert\\bm\{y\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\\mathbf\{I\}\\kern 5\.0pt,and, applying[Lemma˜7\.1](https://arxiv.org/html/2607.00252#S7.Thmtheorem1)\(with𝐌−1/2​𝒛\\mathbf\{M\}^\{\-1/2\}\\bm\{z\}as the vectors in the quadratic form\) and Hölder inequality with norms\\\|⋅\\\|p/\(p−2\),\\\|⋅\\\|p/2\\\|\\cdot\\\|\_\{p/\(p\-2\)\},\\ \\\|\\cdot\\\|\_\{p/2\}, for𝒛∈ℝd\\bm\{z\}\\in\\mathbb\{R\}^\{d\}we have22todo:2add a reference for the third inequality\. I keep forgetting which result this is in the previous sections\.

𝒛⊤​𝐌−1/2​\(∇2f​\(𝒒\)\)​𝐌−1/2​𝒛\\displaystyle\\bm\{z\}^\{\\top\}\\mathbf\{M\}^\{\-1/2\}\\left\(\\nabla^\{2\}f\(\\bm\{q\}\)\\right\)\\mathbf\{M\}^\{\-1/2\}\\bm\{z\}≤p​\(p−1\)​\\slimits@i=1m​\\lVert​𝐀Si​𝒒−𝒃Si​\\rVert2p−2​\\lVert​𝐀Si​𝐌−1/2​𝒛​\\rVert22\\displaystyle\\leq p\(p\-1\)\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{q\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\-2\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\mathbf\{M\}^\{\-1/2\}\\bm\{z\}\\right\\rVert\_\{2\}^\{2\}≤p​\(p−1\)​\(\\slimits@i=1m​\\lVert​𝐀Si​𝒒−𝒃Si​\\rVert2p\)p−2p​\(\\slimits@i=1m​\\lVert​𝐀Si​𝐌−1/2​𝒛​\\rVert2p\)2p\\displaystyle\\leq p\(p\-1\)\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{q\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\}\\right\)^\{\\frac\{p\-2\}\{p\}\}\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\mathbf\{M\}^\{\-1/2\}\\bm\{z\}\\right\\rVert\_\{2\}^\{p\}\\right\)^\{\\frac\{2\}\{p\}\}≤p​\(p−1\)​f​\(𝒒\)1−2p​\\lVert​𝐌−1/2​𝒛​\\rVert𝐌2=p​\(p−1\)​f​\(𝒒\)1−2p​\\lVert​𝒛​\\rVert22\.\\displaystyle\\leq p\(p\-1\)f\(\\bm\{q\}\)^\{1\-\\frac\{2\}\{p\}\}\\left\\lVert\\mathbf\{M\}^\{\-1/2\}\\bm\{z\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}=p\(p\-1\)f\(\\bm\{q\}\)^\{1\-\\frac\{2\}\{p\}\}\\left\\lVert\\bm\{z\}\\right\\rVert\_\{2\}^\{2\}\.Combining gives

𝐌−1/2​\(∇2f𝒒​\(𝒚\)\)​𝐌−1/2\\displaystyle\\mathbf\{M\}^\{\-1/2\}\\left\(\\nabla^\{2\}f\_\{\\bm\{q\}\}\(\\bm\{y\}\)\\right\)\\mathbf\{M\}^\{\-1/2\}⪯e​p​𝐌−1/2​\(2​∇2f​\(𝒒\)\+∇2g𝒒​\(𝒚\)\)​𝐌−1/2,\\displaystyle\\preceq ep\\mathbf\{M\}^\{\-1/2\}\\left\(2\\nabla^\{2\}f\(\\bm\{q\}\)\+\\nabla^\{2\}g\_\{\\bm\{q\}\(\\bm\{y\}\)\}\\right\)\\mathbf\{M\}^\{\-1/2\}\\kern 5\.0pt,⪯2​e​p2​\(p−1\)​f​\(𝒒\)1−2p\+e​p2​\(p−1\)​Cp​\\lVert​𝒚−𝒒​\\rVert𝐌p−2,\\displaystyle\\preceq 2ep^\{2\}\(p\-1\)f\(\\bm\{q\}\)^\{1\-\\frac\{2\}\{p\}\}\+ep^\{2\}\(p\-1\)C\_\{p\}\\left\\lVert\\bm\{y\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\\kern 5\.0pt,⪯e​p2​\(p−1\)​\(2​f​\(𝒒\)1−2p\+Cp​\\lVert​𝒚−𝒒​\\rVert𝐌p−2\),\\displaystyle\\preceq ep^\{2\}\(p\-1\)\\left\(2f\(\\bm\{q\}\)^\{1\-\\frac\{2\}\{p\}\}\+C\_\{p\}\\left\\lVert\\bm\{y\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\\right\),completing the proof of[Lemma˜7\.13](https://arxiv.org/html/2607.00252#S7.Thmtheorem13)\. ∎

Next, we show a bound on the norm of the gradient of any solution𝒙\\bm\{x\}that is approximately optimal forf𝒒f\_\{\\bm\{q\}\}\.

###### Lemma 7\.14\.

For all𝐪,𝐱∈ℝd\\bm\{q\},\\bm\{x\}\\in\\mathbb\{R\}^\{d\}, we have

\\lVert​𝐌−1​∇f𝒒​\(𝒙\)​\\rVert𝐌\\displaystyle\\left\\lVert\\mathbf\{M\}^\{\-1\}\\nabla f\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\right\\rVert\_\{\\mathbf\{M\}\}≤ep2\(p−1\)\(f\(𝒒\)1−2p\+Cpmax\{\\lVert𝒙−𝒒\\rVert𝐌,\\lVert𝒙𝒒−𝒒\\rVert𝐌\}p−2\)\\lVert𝒙−𝒙𝒒\\rVert𝐌\.\\displaystyle\\leq ep^\{2\}\(p\-1\)\\left\(f\(\\bm\{q\}\)^\{1\-\\frac\{2\}\{p\}\}\+C\_\{p\}\\max\\left\\\{\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\},\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\\right\\\}^\{p\-2\}\\right\)\\left\\lVert\\bm\{x\}\-\\bm\{x\}\_\{\\bm\{q\}\}\\right\\rVert\_\{\\mathbf\{M\}\}\\kern 5\.0pt\.

###### Proof of[Lemma˜7\.14](https://arxiv.org/html/2607.00252#S7.Thmtheorem14)\.

We use a continuity argument\. By Taylor’s theorem, we know for some𝒚\\bm\{y\}along the line connecting𝒙\\bm\{x\}and𝒙𝒒\\bm\{x\}\_\{\\bm\{q\}\}\(minimizer off𝒒f\_\{\\bm\{q\}\}\) that

∇f𝒒​\(𝒙\)=∇f𝒒​\(𝒙𝒒\)\+∇2f𝒒​\(𝒚\)​\(𝒙−𝒙𝒒\)=∇2f𝒒​\(𝒚\)​\(𝒙−𝒙𝒒\)\.\\displaystyle\\nabla f\_\{\\bm\{q\}\}\(\\bm\{x\}\)=\\nabla f\_\{\\bm\{q\}\}\(\\bm\{x\}\_\{\\bm\{q\}\}\)\+\\nabla^\{2\}f\_\{\\bm\{q\}\}\(\\bm\{y\}\)\(\\bm\{x\}\-\\bm\{x\}\_\{\\bm\{q\}\}\)=\\nabla^\{2\}f\_\{\\bm\{q\}\}\(\\bm\{y\}\)\(\\bm\{x\}\-\\bm\{x\}\_\{\\bm\{q\}\}\)\\kern 5\.0pt\.Taking𝐌−1\\mathbf\{M\}^\{\-1\}\-norm of both sides gives,

\\lVert​∇f𝒒​\(𝒙\)​\\rVert𝐌−1\\displaystyle\\left\\lVert\\nabla f\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\right\\rVert\_\{\\mathbf\{M\}^\{\-1\}\}=\\lVert​𝐌−1/2​∇f𝒒​\(𝒙\)​\\rVert2,\\displaystyle=\\left\\lVert\\mathbf\{M\}^\{\-1/2\}\\nabla f\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\right\\rVert\_\{2\}\\kern 5\.0pt,=\\lVert​𝐌−1/2​∇2f𝒒​\(𝒚\)​\(𝒙−𝒙𝒒\)​\\rVert2,\\displaystyle=\\left\\lVert\\mathbf\{M\}^\{\-1/2\}\\nabla^\{2\}f\_\{\\bm\{q\}\}\(\\bm\{y\}\)\(\\bm\{x\}\-\\bm\{x\}\_\{\\bm\{q\}\}\)\\right\\rVert\_\{2\}\\kern 5\.0pt,=\\lVert​𝐌−1/2​∇2f𝒒​\(𝒚\)​𝐌−1/2​𝐌1/2​\(𝒙−𝒙𝒒\)​\\rVert2,\\displaystyle=\\left\\lVert\\mathbf\{M\}^\{\-1/2\}\\nabla^\{2\}f\_\{\\bm\{q\}\}\(\\bm\{y\}\)\\mathbf\{M\}^\{\-1/2\}\\mathbf\{M\}^\{1/2\}\(\\bm\{x\}\-\\bm\{x\}\_\{\\bm\{q\}\}\)\\right\\rVert\_\{2\}\\kern 5\.0pt,≤\\lVert​𝐌−1/2​\(∇2f𝒒​\(𝒚\)\)​𝐌−1/2​\\rVertop⋅\\lVert​𝒙−𝒙𝒒​\\rVert𝐌\.\\displaystyle\\leq\\left\\lVert\\mathbf\{M\}^\{\-1/2\}\\left\(\\nabla^\{2\}f\_\{\\bm\{q\}\}\(\\bm\{y\}\)\\right\)\\mathbf\{M\}^\{\-1/2\}\\right\\rVert\_\{\\mathrm\{op\}\}\\cdot\\left\\lVert\\bm\{x\}\-\\bm\{x\}\_\{\\bm\{q\}\}\\right\\rVert\_\{\\mathbf\{M\}\}\\kern 5\.0pt\.The rest of the proof involves bounding the operator norm term\. This follows directly from[Lemma˜7\.13](https://arxiv.org/html/2607.00252#S7.Thmtheorem13), from which we get \(using convexity of\\\|⋅\\\|𝐌\\\|\\cdot\\\|\_\{\\mathbf\{M\}\}\),

\\lVert​𝐌−1/2​∇2f𝒒​\(𝒚\)​𝐌−1/2​\\rVertop\\displaystyle\\left\\lVert\\mathbf\{M\}^\{\-1/2\}\\nabla^\{2\}f\_\{\\bm\{q\}\}\(\\bm\{y\}\)\\mathbf\{M\}^\{\-1/2\}\\right\\rVert\_\{\\mathrm\{op\}\}≤e​p2​\(p−1\)​\(2​f​\(𝒒\)1−2p\+Cp​\\lVert​𝒚−𝒒​\\rVert𝐌p−2\)\\displaystyle\\leq ep^\{2\}\(p\-1\)\\left\(2f\(\\bm\{q\}\)^\{1\-\\frac\{2\}\{p\}\}\+C\_\{p\}\\left\\lVert\\bm\{y\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\\right\)≤ep2\(p−1\)\(2f\(𝒒\)1−2p\+Cpmax\{\\lVert𝒙−𝒒\\rVert𝐌,\\lVert𝒙𝒒−𝒒\\rVert𝐌\}p−2\)\.\\displaystyle\\leq ep^\{2\}\(p\-1\)\\left\(2f\(\\bm\{q\}\)^\{1\-\\frac\{2\}\{p\}\}\+C\_\{p\}\\max\\left\\\{\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\},\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\\right\\\}^\{p\-2\}\\right\)\.Putting everything together, we get

\\lVert​𝐌−1​∇f𝒒​\(𝒙\)​\\rVert𝐌\\displaystyle\\left\\lVert\\mathbf\{M\}^\{\-1\}\\nabla f\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\right\\rVert\_\{\\mathbf\{M\}\}=\\lVert​∇f𝒒​\(𝒙\)​\\rVert𝐌−1,\\displaystyle=\\left\\lVert\\nabla f\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\right\\rVert\_\{\\mathbf\{M\}^\{\-1\}\}\\kern 5\.0pt,≤ep2\(p−1\)\(2f\(𝒒\)1−2p\+Cpmax\{\\lVert𝒙−𝒒\\rVert𝐌,\\lVert𝒙𝒒−𝒒\\rVert𝐌\}p−2\)\\lVert𝒙−𝒙𝒒\\rVert𝐌,\\displaystyle\\leq ep^\{2\}\(p\-1\)\\left\(2f\(\\bm\{q\}\)^\{1\-\\frac\{2\}\{p\}\}\+C\_\{p\}\\max\\left\\\{\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\},\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\\right\\\}^\{p\-2\}\\right\)\\left\\lVert\\bm\{x\}\-\\bm\{x\}\_\{\\bm\{q\}\}\\right\\rVert\_\{\\mathbf\{M\}\},completing the proof of[Lemma˜7\.14](https://arxiv.org/html/2607.00252#S7.Thmtheorem14)\. ∎

#### 7\.3\.4Solving the Proximal Subproblems

We begin by showing that the optimal solution to the proximal problem𝒙𝒒t≔argmin𝒙∈ℝd​f𝒒t​\(𝒙\)\\bm\{x\}\_\{\\bm\{q\}\_\{t\}\}\\coloneqq\\underset\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ f\_\{\\bm\{q\}\_\{t\}\}\(\\bm\{x\}\)is not too far from𝒙⋆\\bm\{x\}^\{\\star\}\.

###### Lemma 7\.15\.

For all proximal queries𝐪t\\bm\{q\}\_\{t\}, we have

\\lVert​𝒙𝒒t−𝒙⋆​\\rVert𝐌≤d12−1p​\(232​f​\(𝒙t\)\+4\)\.\\displaystyle\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\_\{t\}\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq d^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\left\(2^\{\\frac\{3\}\{2\}\}f\(\\bm\{x\}\_\{t\}\)\+4\\right\)\.

###### Proof\.

In the rest of this proof, we omit the subscriptttwherever it is clear which iterates we are working with\.

We first show that

\\lVert​𝒙𝒒−𝒒​\\rVert𝐌≤\\lVert​𝒙⋆−𝒒​\\rVert𝐌\.\\displaystyle\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq\\left\\lVert\\bm\{x\}^\{\\star\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\\kern 5\.0pt\.To see this, suppose this is not the case\. Then, we have

f​\(𝒙⋆\)\+Cp​\\lVert​𝒙⋆−𝒒​\\rVert𝐌p<f​\(𝒙𝒒\)\+Cp​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌p,\\displaystyle f\(\\bm\{x\}^\{\\star\}\)\+C\_\{p\}\\left\\lVert\\bm\{x\}^\{\\star\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}<f\(\\bm\{x\}\_\{\\bm\{q\}\}\)\+C\_\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\kern 5\.0pt,which contradicts the optimality of𝒙𝒒\\bm\{x\}\_\{\\bm\{q\}\}forf𝒒f\_\{\\bm\{q\}\}\.

We now write

\\lVert​𝒙𝒒t−𝒙⋆​\\rVert𝐌\\displaystyle\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\_\{t\}\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}≤\\lVert​𝒙𝒒t−𝒒t​\\rVert𝐌\+\\lVert​𝒙⋆−𝒒t​\\rVert𝐌,\\displaystyle\\leq\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\_\{t\}\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\left\\lVert\\bm\{x\}^\{\\star\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}\\kern 5\.0pt,≤2​\\lVert​𝒙⋆−𝒒t​\\rVert𝐌,\\displaystyle\\leq 2\\left\\lVert\\bm\{x\}^\{\\star\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}\\kern 5\.0pt,≤2​\(\\lVert​𝒙t−𝒙⋆​\\rVert𝐌\+\\lVert​𝒗t−𝒙⋆​\\rVert𝐌\),\\displaystyle\\leq 2\\left\(\\left\\lVert\\bm\{x\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\left\\lVert\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\\right\)\\kern 5\.0pt,where in the last inequality, we used the definition of𝒒t\\bm\{q\}\_\{t\}from Line 6 in[Algorithm˜3](https://arxiv.org/html/2607.00252#alg3)and the convexity of\\\|⋅\\\|𝐌\\\|\\cdot\\\|\_\{\\mathbf\{M\}\}\. The required control on\\lVert​𝒗t−𝒙⋆​\\rVert𝐌\\left\\lVert\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}comes from[Lemma˜5\.5](https://arxiv.org/html/2607.00252#S5.Thmtheorem5)and[Lemma˜3\.5](https://arxiv.org/html/2607.00252#S3.Thmtheorem5)\(along with re\-scaling assumption to make the optimal value11\) – we have

\\lVert​𝒗t−𝒙⋆​\\rVert𝐌≤2​\\lVert​𝒙0−𝒙⋆​\\rVert𝐌≤4​d12−1p\.\\displaystyle\\left\\lVert\\bm\{v\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq\\sqrt\{2\}\\left\\lVert\\bm\{x\}\_\{0\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq 4d^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\kern 5\.0pt\.For the other term, we apply[Lemma˜7\.2](https://arxiv.org/html/2607.00252#S7.Thmtheorem2)and get

\\lVert​𝒙t−𝒙⋆​\\rVert𝐌≤232​d12−1p​\(f​\(𝒙t\)−f​\(𝒙⋆\)\)1p<232​d12−1p​f​\(𝒙t\)1p\.\\displaystyle\\left\\lVert\\bm\{x\}\_\{t\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq 2^\{\\frac\{3\}\{2\}\}d^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\left\(f\(\\bm\{x\}\_\{t\}\)\-f\(\\bm\{x\}^\{\\star\}\)\\right\)^\{\\frac\{1\}\{p\}\}<2^\{\\frac\{3\}\{2\}\}d^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}f\(\\bm\{x\}\_\{t\}\)^\{\\frac\{1\}\{p\}\}\.Adding gives us the conclusion of[Lemma˜7\.15](https://arxiv.org/html/2607.00252#S7.Thmtheorem15)\. ∎

The next few lemmas are targeted at solving the proximal subproblems\. We begin with a calculation that we will use in showing that the initial Bregman divergence between our initialization and the optimum is small\.

###### Lemma 7\.16\.

In the same setting as[Lemma˜7\.8](https://arxiv.org/html/2607.00252#S7.Thmtheorem8), for all𝐱,𝐲∈ℝd\\bm\{x\},\\bm\{y\}\\in\\mathbb\{R\}^\{d\}, we have

h𝒒​\(𝒙𝒒\)≤p​\(p−1\)​f​\(𝒒\)1−2p​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌2\+Cp​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌p<f​\(𝒒\)\+Cp​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌p≤2​f​\(𝒒\)\.\\displaystyle h\_\{\\bm\{q\}\}\(\\bm\{x\}\_\{\\bm\{q\}\}\)\\leq p\(p\-1\)f\(\\bm\{q\}\)^\{1\-\\frac\{2\}\{p\}\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}\+C\_\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}<f\(\\bm\{q\}\)\+C\_\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\leq 2f\(\\bm\{q\}\)\.

###### Proof of[Lemma˜7\.16](https://arxiv.org/html/2607.00252#S7.Thmtheorem16)\.

By optimality of𝒙𝒒\\bm\{x\}\_\{\\bm\{q\}\}for the subproblem, we have

f​\(𝒙𝒒\)\+Cp​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌p≤f​\(𝒒\)\+Cp​\\lVert​𝒒−𝒒​\\rVert𝐌p=f​\(𝒒\)\.\\displaystyle f\(\\bm\{x\}\_\{\\bm\{q\}\}\)\+C\_\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\leq f\(\\bm\{q\}\)\+C\_\{p\}\\left\\lVert\\bm\{q\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}=f\(\\bm\{q\}\)\.Rearranging gives,

\\lVert​𝒙𝒒−𝒒​\\rVert𝐌p≤f​\(𝒒\)−f​\(𝒙𝒒\)Cp≤f​\(𝒒\)Cp\.\\displaystyle\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\leq\\frac\{f\(\\bm\{q\}\)\-f\(\\bm\{x\}\_\{\\bm\{q\}\}\)\}\{C\_\{p\}\}\\leq\\frac\{f\(\\bm\{q\}\)\}\{C\_\{p\}\}\\kern 5\.0pt\.\(7\.10\)We now use the definition ofh𝒒h\_\{\\bm\{q\}\}and[Lemma˜7\.1](https://arxiv.org/html/2607.00252#S7.Thmtheorem1)to write

h𝒒​\(𝒙𝒒\)\\displaystyle h\_\{\\bm\{q\}\}\(\\bm\{x\}\_\{\\bm\{q\}\}\)=\\lVert​𝒙𝒒−𝒒​\\rVert∇2f​\(𝒒\)2\+Cp​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌p,\\displaystyle=\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\nabla^\{2\}f\(\\bm\{q\}\)\}^\{2\}\+C\_\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\kern 5\.0pt,≤[Lemma˜7\.1](https://arxiv.org/html/2607.00252#S7.Thmtheorem1)p​\(p−1\)​\\slimits@i=1m​\\lVert​𝐀Si​𝒒−𝒃Si​\\rVert2p−2​\\lVert​𝐀Si​\(𝒙𝒒−𝒒\)​\\rVert22\+Cp​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌p,\\displaystyle\\leq^\{\\text\{\\lx@cref\{creftypecap~refnum\}\{lemma:gp\_regression\_hessian\_upperbound\}\}\}p\(p\-1\)\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{q\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\-2\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\(\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\)\\right\\rVert\_\{2\}^\{2\}\+C\_\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\kern 5\.0pt,≤\(a\)p​\(p−1\)​\(\\slimits@i=1m​\\lVert​𝐀Si​𝒒−𝒃Si​\\rVert2p\)1−2p​\(\\slimits@i=1m​\\lVert​𝐀Si​\(𝒙𝒒−𝒒\)​\\rVert2p\)2p\+Cp​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌p,\\displaystyle\\leq^\{\\text\{\(a\)\}\}p\(p\-1\)\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{q\}\-\\bm\{b\}\_\{S\_\{i\}\}\\right\\rVert\_\{2\}^\{p\}\\right\)^\{1\-\\frac\{2\}\{p\}\}\\left\(\\sumop\\slimits@\_\{i=1\}^\{m\}\\left\\lVert\\mathbf\{A\}\_\{S\_\{i\}\}\(\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\)\\right\\rVert\_\{2\}^\{p\}\\right\)^\{\\frac\{2\}\{p\}\}\+C\_\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\kern 5\.0pt,≤\(b\)p​\(p−1\)​f​\(𝒒\)1−2p​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌2\+Cp​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌p,\\displaystyle\\leq^\{\\text\{\(b\)\}\}p\(p\-1\)f\(\\bm\{q\}\)^\{1\-\\frac\{2\}\{p\}\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}\+C\_\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\kern 5\.0pt,≤\([7\.10](https://arxiv.org/html/2607.00252#S7.E10)\)p​\(p−1\)​f​\(𝒒\)1−2p​\(f​\(𝒒\)Cp\)2p\+Cp​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌p,\\displaystyle\\leq^\{\\text\{\\eqref\{eq:norm\_M\_ub\_f\}\}\}p\(p\-1\)f\(\\bm\{q\}\)^\{1\-\\frac\{2\}\{p\}\}\\left\(\\frac\{f\(\\bm\{q\}\)\}\{C\_\{p\}\}\\right\)^\{\\frac\{2\}\{p\}\}\+C\_\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\kern 5\.0pt,=\(Cp=e​pp\)\(p−1\)e​p​f​\(𝒒\)\+Cp​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌p,\\displaystyle=^\{\\text\{\($C\_\{p\}=ep^\{p\}$\)\}\}\\frac\{\(p\-1\)\}\{ep\}f\(\\bm\{q\}\)\+C\_\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\kern 5\.0pt,<f​\(𝒒\)\+Cp​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌p,\\displaystyle<f\(\\bm\{q\}\)\+C\_\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\kern 5\.0pt,<\([7\.10](https://arxiv.org/html/2607.00252#S7.E10)\)2​f​\(𝒒\),\\displaystyle<^\{\\text\{\\eqref\{eq:norm\_M\_ub\_f\}\}\}2f\(\\bm\{q\}\)\\kern 5\.0pt,where in \(a\) we used Hölder inequality with norms\\\|⋅\\\|p/\(p−2\),\\\|⋅\\\|p/2\\\|\\cdot\\\|\_\{p/\(p\-2\)\},\\ \\\|\\cdot\\\|\_\{p/2\}and in \(b\) we used[Theorem˜2\.3](https://arxiv.org/html/2607.00252#S2.Thmtheorem3)33todo:3again what’s the reference for this one?\.

This completes the proof for the series of inequalities in[Lemma˜7\.16](https://arxiv.org/html/2607.00252#S7.Thmtheorem16)\. ∎

We now have the tools to show how to approximately solve problems in Line[4](https://arxiv.org/html/2607.00252#alg2.l4)of[Algorithm˜2](https://arxiv.org/html/2607.00252#alg2)when applied in our setting\. Although this and future complexity bounds depend onf​\(𝒙t\)f\(\\bm\{x\}\_\{t\}\), we will later be able to use[Theorem˜5\.3](https://arxiv.org/html/2607.00252#S5.Thmtheorem3)to “bootstrap” and get an unconditional upper bound below\.

###### Lemma 7\.17\.

Letα≤1/2\\alpha\\leq 1/2\. In the context of[Algorithm˜5](https://arxiv.org/html/2607.00252#alg5), there exists an algorithm that approximately solves subproblems of the form \(forp≥2p\\geq 2andL=p​eL=pe\),

𝒛≔argmin𝒙∈ℝd​⟨𝒈,𝒙⟩\+L​\(\\lVert​𝒙−𝒒​\\rVert∇2f​\(𝒒\)2\+Cp​\\lVert​𝒙−𝒒​\\rVert𝐌p\),\\displaystyle\\bm\{z\}\\coloneqq\\underset\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ \\left\\langle\\bm\{g\},\\bm\{x\}\\right\\rangle\+L\\left\(\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\nabla^\{2\}f\(\\bm\{q\}\)\}^\{2\}\+C\_\{p\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\right\)\\kern 5\.0pt,in the sense that we output𝐱\\bm\{x\}for which,

max⁡\{\\lVert​𝒙−𝒛​\\rVert𝐌,\\lVert​𝐌−1​𝒈\+2​L​\(𝐌−1​∇2f​\(𝒒\)​\(𝒙−𝒒\)\+Cp​\\lVert​𝒙−𝒒​\\rVert𝐌p−2​\(𝒙−𝒒\)\)​\\rVert𝐌\}≤α\.\\begin\{aligned\} \\max\\left\\\{\\left\\lVert\\bm\{x\}\-\\bm\{z\}\\right\\rVert\_\{\\mathbf\{M\}\},\\left\\lVert\\mathbf\{M\}^\{\-1\}\\bm\{g\}\+2L\\left\(\\mathbf\{M\}^\{\-1\}\\nabla^\{2\}f\(\\bm\{q\}\)\(\\bm\{x\}\-\\bm\{q\}\)\+C\_\{p\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\(\\bm\{x\}\-\\bm\{q\}\)\\right\)\\right\\rVert\_\{\\mathbf\{M\}\}\\right\\\}\\leq\\alpha\\end\{aligned\}\\kern 5\.0pt\.The algorithm takespO​\(1\)​log⁡\(p​d⋅f​\(𝐪\)α\)p^\{O\(1\)\}\\log\\left\(\\frac\{pd\\cdot f\(\\bm\{q\}\)\}\{\\alpha\}\\right\)linear\-system\-solves in matrices of the form𝐀⊤​𝐁𝐀\\mathbf\{A\}^\{\\top\}\\mathbf\{B\}\\mathbf\{A\}for block\-diagonal𝐁\\mathbf\{B\}, where each block in𝐁\\mathbf\{B\}has size\\lvert​Si​\\rvert×\\lvert​Si​\\rvert\\left\\lvert S\_\{i\}\\right\\rvert\\times\\left\\lvert S\_\{i\}\\right\\rvert\.

###### Proof of[Lemma˜7\.17](https://arxiv.org/html/2607.00252#S7.Thmtheorem17)\.

This proof is lengthy, and splitting it into lemmas would disrupt the intended reading flow\. So we break it up into several key components here\.

Motivation for the lemma\.First, let us see why this lemma is even useful\. In each iteration of[Algorithm˜4](https://arxiv.org/html/2607.00252#alg4), which in turn calls[Algorithm˜2](https://arxiv.org/html/2607.00252#alg2), the main primitive is computing44todo:4I think this part should appear before the lemma because it leads to confusion within the proof about what sub\-problem we are solving\. If it is motivation, we don’t need a separate lemma we can just put it before the lemma\.

𝒙~i\\displaystyle\\mathaccent 869\{\\bm\{x\}\}\_\{i\}=argmin𝒙~∈ℝd​f𝒒t​\(𝒙~i−1\)\+⟨∇f𝒒t​\(𝒙~i−1\),𝒙~−𝒙~i−1⟩\+p​e​Dh𝒒t​\(𝒙~,𝒙~i−1\),\\displaystyle=\\underset\{\\mathaccent 869\{\\bm\{x\}\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ f\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\_\{i\-1\}\)\+\\left\\langle\\nabla f\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\_\{i\-1\}\),\\mathaccent 869\{\\bm\{x\}\}\-\\mathaccent 869\{\\bm\{x\}\}\_\{i\-1\}\\right\\rangle\+peD\_\{h\_\{\\bm\{q\}\_\{t\}\}\}\(\\mathaccent 869\{\\bm\{x\}\},\\mathaccent 869\{\\bm\{x\}\}\_\{i\-1\}\)\\kern 5\.0pt,=argmin𝒙~∈ℝd​f𝒒t​\(𝒙~i−1\)\+⟨∇f𝒒t​\(𝒙~i−1\),𝒙~−𝒙~i−1⟩\+p​e​\(h𝒒t​\(𝒙~\)−h𝒒t​\(𝒙~i−1\)−⟨∇h𝒒t​\(𝒙~i−1\),𝒙~−𝒙~i−1⟩\),\\displaystyle=\\underset\{\\mathaccent 869\{\\bm\{x\}\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ f\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\_\{i\-1\}\)\+\\left\\langle\\nabla f\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\_\{i\-1\}\),\\mathaccent 869\{\\bm\{x\}\}\-\\mathaccent 869\{\\bm\{x\}\}\_\{i\-1\}\\right\\rangle\+pe\\left\(h\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\)\-h\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\_\{i\-1\}\)\-\\left\\langle\\nabla h\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\_\{i\-1\}\),\\mathaccent 869\{\\bm\{x\}\}\-\\mathaccent 869\{\\bm\{x\}\}\_\{i\-1\}\\right\\rangle\\right\)\\kern 5\.0pt,=argmin𝒙~∈ℝd​f𝒒t​\(𝒙~i−1\)−p​e​h𝒒t​\(𝒙~i−1\)\+⟨∇f𝒒t​\(𝒙~i−1\)−p​e​∇h𝒒t​\(𝒙~i−1\),𝒙~−𝒙~i−1⟩\+p​e​h𝒒t​\(𝒙~\),\\displaystyle=\\underset\{\\mathaccent 869\{\\bm\{x\}\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ f\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\_\{i\-1\}\)\-peh\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\_\{i\-1\}\)\+\\left\\langle\\nabla f\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\_\{i\-1\}\)\-pe\\nabla h\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\_\{i\-1\}\),\\mathaccent 869\{\\bm\{x\}\}\-\\mathaccent 869\{\\bm\{x\}\}\_\{i\-1\}\\right\\rangle\+peh\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\)\\kern 5\.0pt,=argmin𝒙~∈ℝd​⟨∇f𝒒t​\(𝒙~i−1\)−p​e​∇h𝒒t​\(𝒙~i−1\),𝒙~⟩\+p​e​h𝒒t​\(𝒙~\)\.\\displaystyle=\\underset\{\\mathaccent 869\{\\bm\{x\}\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ \\left\\langle\\nabla f\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\_\{i\-1\}\)\-pe\\nabla h\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\_\{i\-1\}\),\\mathaccent 869\{\\bm\{x\}\}\\right\\rangle\+peh\_\{\\bm\{q\}\_\{t\}\}\(\\mathaccent 869\{\\bm\{x\}\}\)\\kern 5\.0pt\.Observe that the subproblem is of the form

𝒛\\displaystyle\\bm\{z\}=argmin𝒙∈ℝd​⟨𝒈,𝒙⟩\+p​e​h𝒒​\(𝒙\),\\displaystyle=\\underset\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ \\left\\langle\\bm\{g\},\\bm\{x\}\\right\\rangle\+peh\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\kern 5\.0pt,=argmin𝒙∈ℝd​⟨𝒈,𝒙⟩\+p​e​\(\\lVert​𝒙−𝒒​\\rVert∇2f​\(𝒒\)2\+Cp​\\lVert​𝒙−𝒒​\\rVert𝐌p\),\\displaystyle=\\underset\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ \\left\\langle\\bm\{g\},\\bm\{x\}\\right\\rangle\+pe\\left\(\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\nabla^\{2\}f\(\\bm\{q\}\)\}^\{2\}\+C\_\{p\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\right\)\\kern 5\.0pt,\(7\.11\)and so our goal is to show how to solve these types of problems\.

The general algorithm\.Consider solving the related subproblem \(instead of \([7\.11](https://arxiv.org/html/2607.00252#S7.E11)\)\),

argmin𝒙∈ℝd​⟨𝒈,𝒙⟩\+L​\(\\lVert​𝒙−𝒒​\\rVert∇2f​\(𝒒\)2\+Cp​τ​\\lVert​𝒙−𝒒​\\rVert𝐌2\)\\displaystyle\\underset\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ \\left\\langle\\bm\{g\},\\bm\{x\}\\right\\rangle\+L\\left\(\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\nabla^\{2\}f\(\\bm\{q\}\)\}^\{2\}\+C\_\{p\}\\tau\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}\\right\)for some fixedτ≥0\\tau\\geq 0\. This is a quadratic problem, and we can therefore solve it in11linear\-system\-solve\. It is easy to check that at optimality, we have

𝒈\+2​p​e​\(∇2f​\(𝒒\)​\(𝒙−𝒒\)\+Cp​τ​𝐌​\(𝒙−𝒒\)\)=0,\\displaystyle\\bm\{g\}\+2pe\\left\(\\nabla^\{2\}f\(\\bm\{q\}\)\(\\bm\{x\}\-\\bm\{q\}\)\+C\_\{p\}\\tau\\mathbf\{M\}\(\\bm\{x\}\-\\bm\{q\}\)\\right\)=0\\kern 5\.0pt,which rearranges to222Recall that∇2f​\(𝒒\)=𝐀⊤​𝐁1​𝐀\\nabla^\{2\}f\(\\bm\{q\}\)=\\mathbf\{A\}^\{\\top\}\\mathbf\{B\}\_\{1\}\\mathbf\{A\}for block\-diagonal𝐁1\\mathbf\{B\}\_\{1\}and by construction,𝐌=𝐀⊤​𝐖1−2p​𝐀\\mathbf\{M\}=\\mathbf\{A\}^\{\\top\}\\mathbf\{W\}^\{1\-\\frac\{2\}\{p\}\}\\mathbf\{A\}where𝐖\\mathbf\{W\}consists of the block Lewis weights on the diagonal\. Thus,∇2f​\(𝒒\)\+Cp​τ​𝐌=𝐀⊤​𝐁2​𝐀\\nabla^\{2\}f\(\\bm\{q\}\)\+C\_\{p\}\\tau\\mathbf\{M\}=\\mathbf\{A\}^\{\\top\}\\mathbf\{B\}\_\{2\}\\mathbf\{A\}for block\-diagonal𝐁2\\mathbf\{B\}\_\{2\}\.

𝒙−𝒒=−12​p​e​\(∇2f​\(𝒒\)\+Cp​τ​𝐌\)−1​𝒈\.\\displaystyle\\bm\{x\}\-\\bm\{q\}=\-\\frac\{1\}\{2pe\}\\left\(\\nabla^\{2\}f\(\\bm\{q\}\)\+C\_\{p\}\\tau\\mathbf\{M\}\\right\)^\{\-1\}\\bm\{g\}\\kern 5\.0pt\.
Note that at optimality for our original subproblem \([7\.11](https://arxiv.org/html/2607.00252#S7.E11)\), we haveτ⋆:=\\lVert​𝒛−𝒒​\\rVert𝐌p−2\\tau^\{\\star\}:=\\left\\lVert\\bm\{z\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}where𝒛\\bm\{z\}is the solution of subproblem \([7\.11](https://arxiv.org/html/2607.00252#S7.E11)\)\. Also note that\\lVert​𝒙−𝒒​\\rVert𝐌\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}is a decreasing function inτ\\taubecause,

\\lVert​𝒙−𝒒​\\rVert𝐌2=14​p2​e2​\\\|​𝒈​\\\|\(∇2f​\(𝒒\)\+Cp​τ​𝐌\)−1​𝐌​\(∇2f​\(𝒒\)\+Cp​τ​𝐌\)−12,\\displaystyle\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}=\\frac\{1\}\{4p^\{2\}e^\{2\}\}\\\|\\bm\{g\}\\\|^\{2\}\_\{\\left\(\\nabla^\{2\}f\(\\bm\{q\}\)\+C\_\{p\}\\tau\\mathbf\{M\}\\right\)^\{\-1\}\\mathbf\{M\}\\left\(\\nabla^\{2\}f\(\\bm\{q\}\)\+C\_\{p\}\\tau\\mathbf\{M\}\\right\)^\{\-1\}\}\\kern 5\.0pt,and forτ1≤τ2\\tau\_\{1\}\\leq\\tau\_\{2\},

\(∇2f​\(𝒒\)\+Cp​τ1​𝐌\)−1​𝐌​\(∇2f​\(𝒒\)\+Cp​τ1​𝐌\)−1⪰\(∇2f​\(𝒒\)\+Cp​τ2​𝐌\)−1​𝐌​\(∇2f​\(𝒒\)\+Cp​τ2​𝐌\)−1\.\\displaystyle\\left\(\\nabla^\{2\}f\(\\bm\{q\}\)\+C\_\{p\}\\tau\_\{1\}\\mathbf\{M\}\\right\)^\{\-1\}\\mathbf\{M\}\\left\(\\nabla^\{2\}f\(\\bm\{q\}\)\+C\_\{p\}\\tau\_\{1\}\\mathbf\{M\}\\right\)^\{\-1\}\\succeq\\left\(\\nabla^\{2\}f\(\\bm\{q\}\)\+C\_\{p\}\\tau\_\{2\}\\mathbf\{M\}\\right\)^\{\-1\}\\mathbf\{M\}\\left\(\\nabla^\{2\}f\(\\bm\{q\}\)\+C\_\{p\}\\tau\_\{2\}\\mathbf\{M\}\\right\)^\{\-1\}\\kern 5\.0pt\.We therefore see that ifτ\>\\lVert​𝒙−𝒒​\\rVert𝐌p−2\\tau\>\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}— where𝒙\\bm\{x\}is the optimal solution for a fixedτ\\tau— then we are over\-regularizing and need to decreaseτ\\tauand vice\-versa\. This means we can binary search for the appropriate value ofτ\\tau\. To execute this, we first need to establish the accuracy up to which we have to identifyτ\\tau\.

Convergence in Argument\.55todo:5I can’t follow this\. I tried writing the function values, but it is certainly not obvious from looking at the expression; we need more steps and words here\.By[Lemma˜7\.12](https://arxiv.org/html/2607.00252#S7.Thmtheorem12)\(setting𝒅=𝒙−𝒛\\bm\{d\}=\\bm\{x\}\-\\bm\{z\}\), recall that it is enough to solve sub\-problem \([7\.11](https://arxiv.org/html/2607.00252#S7.E11)\) up to additive accuracy\(p/2\)p​L​αp\(p/2\)^\{p\}L\\alpha^\{p\}to get\\lVert​𝒙−𝒛​\\rVert𝐌≤α\\left\\lVert\\bm\{x\}\-\\bm\{z\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq\\alpha\. Suppose we findτ\\taufor whichτ⋆≤τ≤τ⋆\+δ\\tau^\{\\star\}\\leq\\tau\\leq\\tau^\{\\star\}\+\\delta\. By writing the objectives and comparing, we see that the𝒙\\bm\{x\}we find from usingτ\\taugives us at most aδ⋅d\\delta\\cdot d\-suboptimal solution compared to𝒛\\bm\{z\}\. Plugging this into the bound from[Lemma˜7\.12](https://arxiv.org/html/2607.00252#S7.Thmtheorem12)tells us that we should chooseδ=\(p/2\)p​L​αp/d\\delta=\(p/2\)^\{p\}L\\alpha^\{p\}/d, and plugging this into the binary search overτ∈\[0,dp​\(1\+f​\(𝒒\)\)\]\\tau\\in\[0,d^\{p\}\(1\+f\(\\bm\{q\}\)\)\]gives uspO​\(1\)​log⁡\(p​d⋅f​\(𝒒\)α\)p^\{O\(1\)\}\\log\\left\(\\frac\{pd\\cdot f\(\\bm\{q\}\)\}\{\\alpha\}\\right\)steps, as needed\.66todo:6There is a typo here, but I think we discussed we anyways want to make the statement of the lemma iterate dependent; should discuss\.

First\-order stationary point\.We first claim that it is enough to get

\\lVert​𝐌−1​∇h𝒒​\(𝒙\)−𝐌−1​∇h𝒒​\(𝒛\)​\\rVert𝐌≤αL\.\\displaystyle\\left\\lVert\\mathbf\{M\}^\{\-1\}\\nabla h\_\{\\bm\{q\}\}\(\\bm\{x\}\)\-\\mathbf\{M\}^\{\-1\}\\nabla h\_\{\\bm\{q\}\}\(\\bm\{z\}\)\\right\\rVert\_\{\\mathbf\{M\}\}\\leq\\frac\{\\alpha\}\{L\}\.Indeed, let𝒛\\bm\{z\}be the optimal solution for the subproblem\. This means that it must satisfy the first order stationary condition, namely,

𝒈\+L​∇h𝒒​\(𝒛\)=0\.\\displaystyle\\bm\{g\}\+L\\nabla h\_\{\\bm\{q\}\}\(\\bm\{z\}\)=0\.Multiplying both sides by𝐌−1\\mathbf\{M\}^\{\-1\}, subtracting, and dividing both sides byLLgives us the expression we are interested in\.

Writing first order stationary conditions gives both

𝒈\+2​L​\(∇2f​\(𝒒\)​\(𝒙−𝒒\)\+Cp​τ​𝐌​\(𝒙−𝒒\)\)=0𝒈\+2​L​\(∇2f​\(𝒒\)​\(𝒛−𝒒\)\+Cp​τ⋆​𝐌​\(𝒛−𝒒\)\)=0\.\\begin\{aligned\} \\bm\{g\}\+2L\\left\(\\nabla^\{2\}f\(\\bm\{q\}\)\(\\bm\{x\}\-\\bm\{q\}\)\+C\_\{p\}\\tau\\mathbf\{M\}\(\\bm\{x\}\-\\bm\{q\}\)\\right\)&=0\\\\ \\bm\{g\}\+2L\\left\(\\nabla^\{2\}f\(\\bm\{q\}\)\(\\bm\{z\}\-\\bm\{q\}\)\+C\_\{p\}\\tau^\{\\star\}\\mathbf\{M\}\(\\bm\{z\}\-\\bm\{q\}\)\\right\)&=0\\end\{aligned\}\.Multiplying both sides of both equalities by𝐌−1\\mathbf\{M\}^\{\-1\}and subtracting these gives

2​L​\(𝐌−1​∇2f​\(𝒒\)​\(𝒙−𝒛\)\+Cp​\(τ​\(𝒙−𝒒\)−τ⋆​\(𝒛−𝒒\)\)\)=0\.\\displaystyle 2L\\left\(\\mathbf\{M\}^\{\-1\}\\nabla^\{2\}f\(\\bm\{q\}\)\(\\bm\{x\}\-\\bm\{z\}\)\+C\_\{p\}\\left\(\\tau\(\\bm\{x\}\-\\bm\{q\}\)\-\\tau^\{\\star\}\(\\bm\{z\}\-\\bm\{q\}\)\\right\)\\right\)=0\.Expanding outL​\(𝐌−1​∇h𝒒​\(𝒙\)−𝐌−1​h𝒒​\(𝒛\)\)L\(\\mathbf\{M\}^\{\-1\}\\nabla h\_\{\\bm\{q\}\}\(\\bm\{x\}\)\-\\mathbf\{M\}^\{\-1\}h\_\{\\bm\{q\}\}\(\\bm\{z\}\)\)and subtracting the above gives the desired condition

2​L​\\lvert​τ−\\lVert​𝒙−𝒒​\\rVert𝐌p−2​\\rvert⋅\\lVert​𝒙−𝒒​\\rVert𝐌​≤?​α\.\\displaystyle 2L\\left\\lvert\\tau\-\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\\right\\rvert\\cdot\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\\overset\{?\}\{\\leq\}\\alpha\.Next, let us run the binary search from above so that we get argument convergence, i\.e\.\\lVert​𝒙−𝒛​\\rVert𝐌≤αC≪0\.1​α\\left\\lVert\\bm\{x\}\-\\bm\{z\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq\\alpha^\{C\}\\ll 0\.1\\alphafor some constantCC\. Using the fact that the approximate mirror descent step using𝒛\\bm\{z\}decreases the objective value \([Lemma˜4\.4](https://arxiv.org/html/2607.00252#S4.Thmtheorem4)\), observe that

\\lVert​𝒙−𝒒​\\rVert𝐌≤\\lVert​𝒛−𝒒​\\rVert𝐌\+\\lVert​𝒙−𝒛​\\rVert𝐌≤\\lVert​𝒒−𝒛​\\rVert𝐌\+0\.1​α​d​\(1\+f​\(𝒒\)\)\.\\displaystyle\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq\\left\\lVert\\bm\{z\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\left\\lVert\\bm\{x\}\-\\bm\{z\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq\\left\\lVert\\bm\{q\}\-\\bm\{z\}\\right\\rVert\_\{\\mathbf\{M\}\}\+0\.1\\alpha\\lesssim\\sqrt\{d\}\(1\+f\(\\bm\{q\}\)\)\.It then follows that binary searchingτ\\tauto additive accuracyα​\(d​\(1\+f​\(𝒒\)\)\)−1/L\\alpha\(\\sqrt\{d\}\(1\+f\(\\bm\{q\}\)\)\)^\{\-1\}/Lis sufficient\. By the same argument as above, this takespO​\(1\)​log⁡\(p​d⋅f​\(𝒒t\)α\)p^\{O\(1\)\}\\log\\left\(\\frac\{pd\\cdot f\(\\bm\{q\}\_\{t\}\)\}\{\\alpha\}\\right\)steps, completing the proof of[Lemma˜7\.17](https://arxiv.org/html/2607.00252#S7.Thmtheorem17)\. ∎

We now combine[Lemma˜7\.17](https://arxiv.org/html/2607.00252#S7.Thmtheorem17)with[Theorem˜4\.1](https://arxiv.org/html/2607.00252#S4.Thmtheorem1)and[Algorithm˜2](https://arxiv.org/html/2607.00252#alg2)to obtain approximate argument optimality for each proximal subproblem\.

###### Lemma 7\.18\.

Letγ\>0\\gamma\>0and𝐱𝐪≔argmin𝐱∈ℝd​f𝐪​\(𝐱\)\\bm\{x\}\_\{\\bm\{q\}\}\\coloneqq\\underset\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ f\_\{\\bm\{q\}\}\(\\bm\{x\}\)\. There exists an algorithm that returns𝐱\\bm\{x\}for which

\\lVert​𝒙−𝒙𝒒​\\rVert𝐌\\displaystyle\\left\\lVert\\bm\{x\}\-\\bm\{x\}\_\{\\bm\{q\}\}\\right\\rVert\_\{\\mathbf\{M\}\}≤γ\.\\displaystyle\\leq\\gamma\.The algorithm takes at mostO​\(pO​\(1\)​log⁡\(p​h𝐪​\(𝐱𝐪\)​\(4p​γ\)p\)\)O\\left\(p^\{O\(1\)\}\\log\\left\(ph\_\{\\bm\{q\}\}\(\\bm\{x\}\_\{\\bm\{q\}\}\)\\left\(\\frac\{4\}\{p\\gamma\}\\right\)^\{p\}\\right\)\\right\)iterations of solving subproblems of the formargmin𝐱∈ℝd​⟨𝐠,𝐱⟩\+e​p​h𝐪​\(𝐱\)\\underset\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ \\left\\langle\\bm\{g\},\\bm\{x\}\\right\\rangle\+eph\_\{\\bm\{q\}\}\(\\bm\{x\}\)for fixed vectors𝐠\\bm\{g\}and𝐪\\bm\{q\}\.

###### Proof of[Lemma˜7\.18](https://arxiv.org/html/2607.00252#S7.Thmtheorem18)\.

This proof resembles\[jls21, Lemma 4\.5\], which uses an exact version of mirror descent arising from\[lfn18\]\. The main difference between our argument and that of\[jls21, Lemma 4\.5\]is that we rigorously identify a concrete upper bound on the complexity needed to satisfy the MS condition and argue that the mirror descent algorithm can handle the inexact Bregman proximal problem solves\.

First, we use[Lemma˜7\.11](https://arxiv.org/html/2607.00252#S7.Thmtheorem11)on the approximate solution𝒙\\bm\{x\}and true solution𝒙𝒒\\bm\{x\}\_\{\\bm\{q\}\}and get,

f𝒒​\(𝒙\)\\displaystyle f\_\{\\bm\{q\}\}\(\\bm\{x\}\)≥f𝒒​\(𝒙𝒒\)\+42p​\(\\lVert​𝐀​\(𝒙−𝒙𝒒\)​\\rVert𝒢pp\+Cp​\\lVert​𝒙𝒒−𝒙​\\rVert𝐌p\),\\displaystyle\\geq f\_\{\\bm\{q\}\}\(\\bm\{x\}\_\{\\bm\{q\}\}\)\+\\frac\{4\}\{2^\{p\}\}\\left\(\\left\\lVert\\mathbf\{A\}\(\\bm\{x\}\-\\bm\{x\}\_\{\\bm\{q\}\}\)\\right\\rVert\_\{\\mathcal\{G\}\_\{p\}\}^\{p\}\+C\_\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{x\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\right\)\\kern 5\.0pt,≥f𝒒​\(𝒙\)\+4​Cp2p​\\lVert​𝒙𝒒−𝒙​\\rVert𝐌p\.\\displaystyle\\geq f\_\{\\bm\{q\}\}\(\\bm\{x\}\)\+\\frac\{4C\_\{p\}\}\{2^\{p\}\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{x\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\kern 5\.0pt\.Rearranging, we get

\\lVert​𝒙𝒒−𝒙​\\rVert𝐌\\displaystyle\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{x\}\\right\\rVert\_\{\\mathbf\{M\}\}≤\(2p4​Cp\)1/p​\(f𝒒​\(𝒙\)−f𝒒​\(𝒙𝒒\)\)1/p,\\displaystyle\\leq\\left\(\\frac\{2^\{p\}\}\{4C\_\{p\}\}\\right\)^\{1/p\}\\left\(f\_\{\\bm\{q\}\}\(\\bm\{x\}\)\-f\_\{\\bm\{q\}\}\(\\bm\{x\}\_\{\\bm\{q\}\}\)\\right\)^\{1/p\}\\kern 5\.0pt,=\(2p4​e​pp\)1/p​\(f𝒒​\(𝒙\)−f𝒒​\(𝒙𝒒\)\)1/p,\\displaystyle=\\left\(\\frac\{2^\{p\}\}\{4ep^\{p\}\}\\right\)^\{1/p\}\\left\(f\_\{\\bm\{q\}\}\(\\bm\{x\}\)\-f\_\{\\bm\{q\}\}\(\\bm\{x\}\_\{\\bm\{q\}\}\)\\right\)^\{1/p\}\\kern 5\.0pt,<2p​\(f𝒒​\(𝒙\)−f𝒒​\(𝒙𝒒\)\)1/p\.\\displaystyle<\\frac\{2\}\{p\}\\left\(f\_\{\\bm\{q\}\}\(\\bm\{x\}\)\-f\_\{\\bm\{q\}\}\(\\bm\{x\}\_\{\\bm\{q\}\}\)\\right\)^\{1/p\}\\kern 5\.0pt\.Using the notation from\[lfn18\], for convexh:ℝd→ℝh\\colon\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}, let

Dh​\(𝒙,𝒚\)≔h​\(𝒙\)−h​\(𝒚\)−⟨∇h​\(𝒚\),𝒙−𝒚⟩\.\\displaystyle D\_\{h\}\(\\bm\{x\},\\bm\{y\}\)\\coloneqq h\(\\bm\{x\}\)\-h\(\\bm\{y\}\)\-\\left\\langle\\nabla h\(\\bm\{y\}\),\\bm\{x\}\-\\bm\{y\}\\right\\rangle\.Recall the conclusion of[Lemma˜7\.8](https://arxiv.org/html/2607.00252#S7.Thmtheorem8)– we have forμ=1/\(2​p​e\)\\mu=1/\(2pe\)andL=p​eL=pethat

μ​∇2h𝒒​\(𝒙\)⪯∇2f𝒒​\(𝒙\)⪯L​∇2h𝒒​\(𝒙\)\.\\displaystyle\\mu\\nabla^\{2\}h\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\preceq\\nabla^\{2\}f\_\{\\bm\{q\}\}\(\\bm\{x\}\)\\preceq L\\nabla^\{2\}h\_\{\\bm\{q\}\}\(\\bm\{x\}\)\.By[Theorem˜4\.1](https://arxiv.org/html/2607.00252#S4.Thmtheorem1)and[Lemma˜7\.8](https://arxiv.org/html/2607.00252#S7.Thmtheorem8), using the same notation from[Lemma˜7\.8](https://arxiv.org/html/2607.00252#S7.Thmtheorem8), we have for all iterationsttof[Algorithm˜2](https://arxiv.org/html/2607.00252#alg2)\(withf=f𝒒f=f\_\{\\bm\{q\}\}andh=h𝒒h=h\_\{\\bm\{q\}\}\) that,77todo:7not sure about the second equality…

f𝒒​\(𝒙t\)−f𝒒​\(𝒙𝒒\)\\displaystyle f\_\{\\bm\{q\}\}\(\\bm\{x\}\_\{t\}\)\-f\_\{\\bm\{q\}\}\(\\bm\{x\}\_\{\\bm\{q\}\}\)≤L​\(1−μL\)t​Dh𝒒​\(𝒙𝒒,𝒒\)\+max1≤i≤t⁡⟨△i,𝒙t−𝒙𝒒⟩,\\displaystyle\\leq L\\left\(1\-\\frac\{\\mu\}\{L\}\\right\)^\{t\}D\_\{h\_\{\\bm\{q\}\}\}\(\\bm\{x\}\_\{\\bm\{q\}\},\\bm\{q\}\)\+\\max\_\{1\\leq i\\leq t\}\\left\\langle\\triangle\_\{i\},\\bm\{x\}\_\{t\}\-\\bm\{x\}\_\{\\bm\{q\}\}\\right\\rangle\\kern 5\.0pt,=2​L​\(1−μL\)t​h𝒒​\(𝒙𝒒\)\+max1≤i≤t⁡⟨△i,𝒙t−𝒙𝒒⟩\.\\displaystyle=2L\\left\(1\-\\frac\{\\mu\}\{L\}\\right\)^\{t\}h\_\{\\bm\{q\}\}\(\\bm\{x\}\_\{\\bm\{q\}\}\)\+\\max\_\{1\\leq i\\leq t\}\\left\\langle\\triangle\_\{i\},\\bm\{x\}\_\{t\}\-\\bm\{x\}\_\{\\bm\{q\}\}\\right\\rangle\\kern 5\.0pt\.Hence, fort≥Lμ​log⁡\(L​h𝒒​\(𝒙𝒒\)​\(4p​γ\)p\)t\\geq\\frac\{L\}\{\\mu\}\\log\\left\(Lh\_\{\\bm\{q\}\}\(\\bm\{x\}\_\{\\bm\{q\}\}\)\\left\(\\frac\{4\}\{p\\gamma\}\\right\)^\{p\}\\right\), it is easy to check that forp≥2p\\geq 2,

f𝒒​\(𝒙t\)−f𝒒​\(𝒙𝒒\)\\displaystyle f\_\{\\bm\{q\}\}\(\\bm\{x\}\_\{t\}\)\-f\_\{\\bm\{q\}\}\(\\bm\{x\}\_\{\\bm\{q\}\}\)≤2​L​\(1e\)log⁡\(L​h𝒒​\(𝒙𝒒\)​\(4p​γ\)p\)​h𝒒​\(𝒙𝒒\)\+max1≤i≤t⁡⟨△i,𝒙t−𝒙𝒒⟩,\\displaystyle\\leq 2L\\left\(\\frac\{1\}\{e\}\\right\)^\{\\log\\left\(Lh\_\{\\bm\{q\}\}\(\\bm\{x\}\_\{\\bm\{q\}\}\)\\left\(\\frac\{4\}\{p\\gamma\}\\right\)^\{p\}\\right\)\}h\_\{\\bm\{q\}\}\(\\bm\{x\}\_\{\\bm\{q\}\}\)\+\\max\_\{1\\leq i\\leq t\}\\left\\langle\\triangle\_\{i\},\\bm\{x\}\_\{t\}\-\\bm\{x\}\_\{\\bm\{q\}\}\\right\\rangle\\kern 5\.0pt,=2​\(p​γ4\)p\+max1≤i≤t⁡⟨△i,𝒙t−𝒙𝒒⟩,\\displaystyle=2\\left\(\\frac\{p\\gamma\}\{4\}\\right\)^\{p\}\+\\max\_\{1\\leq i\\leq t\}\\left\\langle\\triangle\_\{i\},\\bm\{x\}\_\{t\}\-\\bm\{x\}\_\{\\bm\{q\}\}\\right\\rangle\\kern 5\.0pt,≤\(p​γ2\)p\+max1≤i≤t⁡⟨△i,𝒙t−𝒙𝒒⟩,\\displaystyle\\leq\\left\(\\frac\{p\\gamma\}\{2\}\\right\)^\{p\}\+\\max\_\{1\\leq i\\leq t\}\\left\\langle\\triangle\_\{i\},\\bm\{x\}\_\{t\}\-\\bm\{x\}\_\{\\bm\{q\}\}\\right\\rangle\\kern 5\.0pt,and combining this with[Lemma˜7\.17](https://arxiv.org/html/2607.00252#S7.Thmtheorem17)to make the error term on the order of our accuracy, we get\\lVert​𝒙𝒒−𝒙​\\rVert𝐌​γ\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{x\}\\right\\rVert\_\{\\mathbf\{M\}\}\\lesssim\\gamma\. We thus conclude the proof of[Lemma˜7\.18](https://arxiv.org/html/2607.00252#S7.Thmtheorem18)\.88todo:8I am not sure what is happening in the last line…\.∎

The last step is to use our proximal problem solver to build a valid MS oracle\.

###### Lemma 7\.19\.

In the context of[Algorithm˜3](https://arxiv.org/html/2607.00252#alg3), there exists an algorithm\(𝐱~t\+1,λt\+1\)=𝒪𝗉𝗋𝗈𝗑​\(𝐪t\)\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\},\\lambda\_\{t\+1\}\)=\\mathcal\{O\}\_\{\\mathsf\{prox\}\}\(\\bm\{q\}\_\{t\}\)that approximately solves

argmin𝒙~∈ℝd​f​\(𝒙~\)\+e​pp​\\lVert​𝒙~−𝒒t​\\rVert𝐌p\\displaystyle\\underset\{\\mathaccent 869\{\\bm\{x\}\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ f\(\\mathaccent 869\{\\bm\{x\}\}\)\+ep^\{p\}\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}usingO​\(pO​\(1\)​log⁡\(p​d⋅f​\(𝐱t\)ε\)\)O\\left\(p^\{O\(1\)\}\\log\\left\(\\frac\{pd\\cdot f\(\\bm\{x\}\_\{t\}\)\}\{\\varepsilon\}\\right\)\\right\)linear\-system\-solves in𝐀⊤​𝐁𝐀\\mathbf\{A\}^\{\\top\}\\mathbf\{B\}\\mathbf\{A\}, in the sense that

\\lVert​1e​pp\+1​\\lVert​𝒙~t\+1−𝒒t​\\rVert𝐌p−2​𝐌−1​∇f​\(𝒙~t\+1\)\+\(𝒙~t\+1−𝒒t\)​\\rVert𝐌≤12​\\lVert​𝒙~t\+1−𝒒t​\\rVert𝐌\.\\displaystyle\\left\\lVert\\frac\{1\}\{ep^\{p\+1\}\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\}\\mathbf\{M\}^\{\-1\}\\nabla f\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\)\+\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\)\\right\\rVert\_\{\\mathbf\{M\}\}\\leq\\frac\{1\}\{2\}\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}\.

###### Proof of[Lemma˜7\.19](https://arxiv.org/html/2607.00252#S7.Thmtheorem19)\.

The point of this proof is to give an analysis of[Algorithm˜4](https://arxiv.org/html/2607.00252#alg4)\.

For notational simplicity, let𝒙=𝒙~t\+1\\bm\{x\}=\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}andλ=λt\+1\\lambda=\\lambda\_\{t\+1\}\. We will reintroduce the indices when it is essential to clarify the iterations we are discussing\.

First, it is helpful to see why the stated notion of approximation is useful\. LetCp≔e​ppC\_\{p\}\\coloneqq ep^\{p\}\. Observe that at exact optimality, we have

∇f​\(𝒙𝒒\)\+e​pp\+1​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌p−2⏟λ⋆​𝐌​\(𝒙−𝒒\)=0\.\\displaystyle\\nabla f\(\\bm\{x\}\_\{\\bm\{q\}\}\)\+\\underbrace\{ep^\{p\+1\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\}\_\{\\lambda^\{\\star\}\}\\mathbf\{M\}\(\\bm\{x\}\-\\bm\{q\}\)=0\\kern 5\.0pt\.\(7\.12\)This motivates the approximation in our lemma statement, with us asking for a12\\frac\{1\}\{2\}\-approximate MS oracle \([Definition˜5\.1](https://arxiv.org/html/2607.00252#S5.Thmtheorem1)\) forff\. This also tells us that at optimality in \([7\.12](https://arxiv.org/html/2607.00252#S7.E12)\), we have,

∇f​\(𝒙𝒒\)\+e​pp\+1​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌p−2​𝐌​\(𝒙−𝒒\)=0,\\displaystyle\\nabla f\(\\bm\{x\}\_\{\\bm\{q\}\}\)\+ep^\{p\+1\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\\mathbf\{M\}\(\\bm\{x\}\-\\bm\{q\}\)=0\\kern 5\.0pt,⇔𝐌−1/2​f​\(𝒙𝒒\)=−p​Cp​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌p−2​𝐌1/2​\(𝒙−𝒒\),\\displaystyle\\Leftrightarrow\\mathbf\{M\}^\{\-1/2\}f\(\\bm\{x\}\_\{\\bm\{q\}\}\)=\-pC\_\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\\mathbf\{M\}^\{1/2\}\(\\bm\{x\}\-\\bm\{q\}\)\\kern 5\.0pt,⇒\\lVert​𝐌−1/2​f​\(𝒙𝒒\)​\\rVert2=p​Cp​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌p−2​\\lVert​𝐌1/2​\(𝒙−𝒒\)​\\rVert2,\\displaystyle\\Rightarrow\\left\\lVert\\mathbf\{M\}^\{\-1/2\}f\(\\bm\{x\}\_\{\\bm\{q\}\}\)\\right\\rVert\_\{2\}=pC\_\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\\left\\lVert\\mathbf\{M\}^\{1/2\}\(\\bm\{x\}\-\\bm\{q\}\)\\right\\rVert\_\{2\}\\kern 5\.0pt,⇔\\lVert​𝒙𝒒−𝒒​\\rVert𝐌=\(\\lVert​𝐌−1​∇f​\(𝒙𝒒\)​\\rVert𝐌p​Cp\)1p−1\.\\displaystyle\\Leftrightarrow\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}=\\left\(\\frac\{\\left\\lVert\\mathbf\{M\}^\{\-1\}\\nabla f\(\\bm\{x\}\_\{\\bm\{q\}\}\)\\right\\rVert\_\{\\mathbf\{M\}\}\}\{pC\_\{p\}\}\\right\)^\{\\frac\{1\}\{p\-1\}\}\\kern 5\.0pt\.We now break up our analysis into two cases\. In the first, suppose that\\lVert​𝐌−1​∇f​\(𝒙𝒒\)​\\rVert𝐌≤ε/\\lVert​𝒙𝒒−𝒙⋆​\\rVert𝐌\\left\\lVert\\mathbf\{M\}^\{\-1\}\\nabla f\(\\bm\{x\}\_\{\\bm\{q\}\}\)\\right\\rVert\_\{\\mathbf\{M\}\}\\leq\\varepsilon/\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\. Then, by convexity, we have

f​\(𝒙𝒒\)−f​\(𝒙⋆\)≤⟨∇f​\(𝒙𝒒\),𝒙𝒒−𝒙⋆⟩≤\\lVert​𝐌−1​∇f​\(𝒙𝒒\)​\\rVert𝐌​\\lVert​𝒙𝒒−𝒙⋆​\\rVert𝐌≤ε\.\\displaystyle f\(\\bm\{x\}\_\{\\bm\{q\}\}\)\-f\(\\bm\{x\}^\{\\star\}\)\\leq\\left\\langle\\nabla f\(\\bm\{x\}\_\{\\bm\{q\}\}\),\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{x\}^\{\\star\}\\right\\rangle\\leq\\left\\lVert\\mathbf\{M\}^\{\-1\}\\nabla f\(\\bm\{x\}\_\{\\bm\{q\}\}\)\\right\\rVert\_\{\\mathbf\{M\}\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq\\varepsilon\.Hence, for the rest of the proof, assume that\\lVert​𝐌−1​∇f​\(𝒙𝒒\)​\\rVert≥ε/\\lVert​𝒙𝒒−𝒙⋆​\\rVert𝐌\\left\\lVert\\mathbf\{M\}^\{\-1\}\\nabla f\(\\bm\{x\}\_\{\\bm\{q\}\}\)\\right\\rVert\\geq\\varepsilon/\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\(because if this is not the case, in the algorithm we can simply check whether the MS condition is satisfied – if not, then we know this assumption was violated and we are done anyway\)99todo:9I am not sure I follow the statement in the parenthesis, maybe I need to discuss with you the hierarchy of the algorithms and see where specifically this termination condition comes in\. We run the algorithm implied by[Lemma˜7\.18](https://arxiv.org/html/2607.00252#S7.Thmtheorem18)and obtain an approximate solution𝒙\\bm\{x\}for which

\\lVert​𝒙−𝒙𝒒​\\rVert𝐌≤α​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌​for​α=15​min⁡\{Cpe​p​\(p−1\)​\(\\lVert​𝒙𝒒−𝒒​\\rVert𝐌f​\(𝒒\)1p\)p−2,1\}\.\\displaystyle\\left\\lVert\\bm\{x\}\-\\bm\{x\}\_\{\\bm\{q\}\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq\\alpha\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\\text\{ for \}\\alpha=\\frac\{1\}\{5\}\\min\\left\\\{\\frac\{C\_\{p\}\}\{ep\(p\-1\)\}\\left\(\\frac\{\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\}\{f\(\\bm\{q\}\)^\{\\frac\{1\}\{p\}\}\}\\right\)^\{p\-2\},1\\right\\\}\\kern 5\.0pt\.\(7\.13\)Sinceα<1\\alpha<1the guarantee in \([7\.13](https://arxiv.org/html/2607.00252#S7.E13)\) gives us,

\\lVert​𝒙−𝒙𝒒​\\rVert𝐌≤α​\\lVert​𝒙−𝒒​\\rVert𝐌≤α1−α​\\lVert​𝒙−𝒒​\\rVert𝐌,\\displaystyle\\left\\lVert\\bm\{x\}\-\\bm\{x\}\_\{\\bm\{q\}\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq\\alpha\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\\leq\\frac\{\\alpha\}\{1\-\\alpha\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\\kern 5\.0pt,\(7\.14\)and further applying triangle inequality gives us

\\lVert​𝒙𝒒−𝒒​\\rVert𝐌\\displaystyle\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}≤\\lVert​𝒙−𝒒​\\rVert𝐌\+\\lVert​𝒙𝒒−𝒙​\\rVert𝐌,\\displaystyle\\leq\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{x\}\\right\\rVert\_\{\\mathbf\{M\}\}\\kern 5\.0pt,≤1−α1−α​\\lVert​𝒙−𝒒​\\rVert𝐌\+α1−α​\\lVert​𝒙−𝒒​\\rVert𝐌,\\displaystyle\\leq\\frac\{1\-\\alpha\}\{1\-\\alpha\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\frac\{\\alpha\}\{1\-\\alpha\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\\kern 5\.0pt,≤11−α​\\lVert​𝒙−𝒒​\\rVert𝐌\.\\displaystyle\\leq\\frac\{1\}\{1\-\\alpha\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\\kern 5\.0pt\.\(7\.15\)Hence, we get

e​p​\(p−1\)​f​\(𝒒\)1−2pCp​\\lVert​𝒙−𝒒​\\rVert𝐌p−2⋅\\lVert​𝒙−𝒙𝒒​\\rVert𝐌\\displaystyle\\frac\{ep\(p\-1\)f\(\\bm\{q\}\)^\{1\-\\frac\{2\}\{p\}\}\}\{C\_\{p\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\}\\cdot\\left\\lVert\\bm\{x\}\-\\bm\{x\}\_\{\\bm\{q\}\}\\right\\rVert\_\{\\mathbf\{M\}\}=e​p​\(p−1\)Cp⋅\(f​\(𝒒\)1p\\lVert​𝒙−𝒒​\\rVert𝐌\)p−2⋅\\lVert​𝒙−𝒙𝒒​\\rVert𝐌,\\displaystyle=\\frac\{ep\(p\-1\)\}\{C\_\{p\}\}\\cdot\\left\(\\frac\{f\(\\bm\{q\}\)^\{\\frac\{1\}\{p\}\}\}\{\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\}\\right\)^\{p\-2\}\\cdot\\left\\lVert\\bm\{x\}\-\\bm\{x\}\_\{\\bm\{q\}\}\\right\\rVert\_\{\\mathbf\{M\}\}\\kern 5\.0pt,≤\([7\.13](https://arxiv.org/html/2607.00252#S7.E13)\)15​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌,\\displaystyle\\leq^\{\\eqref\{eq:lemma\_D19\_guarantee\}\}\\frac\{1\}\{5\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\\kern 5\.0pt,≤\([7\.15](https://arxiv.org/html/2607.00252#S7.E15)\)15⋅11−α​\\lVert​𝒙−𝒒​\\rVert𝐌,\\displaystyle\\leq^\{\\eqref\{eq:xq\_x\_UB\_2\}\}\\frac\{1\}\{5\}\\cdot\\frac\{1\}\{1\-\\alpha\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\\kern 5\.0pt,≤14​\\lVert​𝒙−𝒒​\\rVert𝐌,\\displaystyle\\leq\\frac\{1\}\{4\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\\kern 5\.0pt,\(7\.16\)where in the last inequality, we used thatα≤15\\alpha\\leq\\frac\{1\}\{5\}due to our choice in \([7\.13](https://arxiv.org/html/2607.00252#S7.E13)\)\. We now call[Lemma˜7\.14](https://arxiv.org/html/2607.00252#S7.Thmtheorem14), divide both sides byλ\\lambda, and get1010todo:10I tried to make sense of the steps here, unclear how the first and last inequalities are working\.

\\lVert​1e​pp\+1​\\lVert​𝒙−𝒒​\\rVert𝐌p−2​𝐌−1​∇f​\(𝒙\)\+\(𝒙−𝒒\)​\\rVert𝐌\\displaystyle\\left\\lVert\\frac\{1\}\{ep^\{p\+1\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\}\\mathbf\{M\}^\{\-1\}\\nabla f\(\\bm\{x\}\)\+\(\\bm\{x\}\-\\bm\{q\}\)\\right\\rVert\_\{\\mathbf\{M\}\}≤\([Lemma˜7\.14](https://arxiv.org/html/2607.00252#S7.Thmtheorem14)\)e​p​\(p−1\)​\(f​\(𝒒\)1−2pCp​\\lVert​𝒙−𝒒​\\rVert𝐌p−2\+max⁡\{1,\(\\lVert​𝒙𝒒−𝒒​\\rVert𝐌\\lVert​𝒙−𝒒​\\rVert𝐌\)p−2\}\)​\\lVert​𝒙−𝒙𝒒​\\rVert𝐌,\\displaystyle\\leq^\{\\text\{\(\\lx@cref\{creftypecap~refnum\}\{lemma:self\_gradient\_norm\_small\}\)\}\}ep\(p\-1\)\\left\(\\frac\{f\(\\bm\{q\}\)^\{1\-\\frac\{2\}\{p\}\}\}\{C\_\{p\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\}\+\\max\\left\\\{1,\\left\(\\frac\{\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\}\{\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\}\\right\)^\{p\-2\}\\right\\\}\\right\)\\left\\lVert\\bm\{x\}\-\\bm\{x\}\_\{\\bm\{q\}\}\\right\\rVert\_\{\\mathbf\{M\}\}\\kern 5\.0pt,≤\([7\.15](https://arxiv.org/html/2607.00252#S7.E15)\)e​p​\(p−1\)​\(f​\(𝒒\)1−2pCp​\\lVert​𝒙−𝒒​\\rVert𝐌p−2\+1\(1−α\)p−2\)​\\lVert​𝒙−𝒙𝒒​\\rVert𝐌,\\displaystyle\\leq^\{\\text\{\\eqref\{eq:xq\_x\_UB\_2\}\}\}ep\(p\-1\)\\left\(\\frac\{f\(\\bm\{q\}\)^\{1\-\\frac\{2\}\{p\}\}\}\{C\_\{p\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\}\+\\frac\{1\}\{\(1\-\\alpha\)^\{p\-2\}\}\\right\)\\left\\lVert\\bm\{x\}\-\\bm\{x\}\_\{\\bm\{q\}\}\\right\\rVert\_\{\\mathbf\{M\}\}\\kern 5\.0pt,≤\([7\.14](https://arxiv.org/html/2607.00252#S7.E14)\)e​p​\(p−1\)​f​\(𝒒\)1−2pCp​\\lVert​𝒙−𝒒​\\rVert𝐌p−2⋅\\lVert​𝒙−𝒙𝒒​\\rVert𝐌\+e​p​\(p−1\)​α\(1−α\)p−1​\\lVert​𝒙−𝒒​\\rVert𝐌,\\displaystyle\\leq^\{\\eqref\{eq:xq\_x\_UB\_1\}\}\\frac\{ep\(p\-1\)f\(\\bm\{q\}\)^\{1\-\\frac\{2\}\{p\}\}\}\{C\_\{p\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\}\\cdot\\left\\lVert\\bm\{x\}\-\\bm\{x\}\_\{\\bm\{q\}\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\frac\{ep\(p\-1\)\\alpha\}\{\(1\-\\alpha\)^\{p\-1\}\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\\kern 5\.0pt,≤\([7\.15](https://arxiv.org/html/2607.00252#S7.E15)\),\([7\.13](https://arxiv.org/html/2607.00252#S7.E13)\)14​\\lVert​𝒙−𝒒​\\rVert𝐌\+e​p​\(p−1\)​5p−24p−1​\\lVert​𝒙−𝒒​\\rVert𝐌,\\displaystyle\\leq^\{\\eqref\{eq:xq\_x\_UB\_2\},\\ \\eqref\{eq:lemma\_D19\_guarantee\}\}\\frac\{1\}\{4\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\+\\frac\{ep\(p\-1\)5^\{p\-2\}\}\{4^\{p\-1\}\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\\kern 5\.0pt,≤12​\\lVert​𝒙−𝒒​\\rVert𝐌,\\displaystyle\\leq\\frac\{1\}\{2\}\\left\\lVert\\bm\{x\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\\kern 5\.0pt,giving us the approximation guarantee\.

It remains to understand the complexity of solving the proximal subproblem to the accuracy required in \([7\.13](https://arxiv.org/html/2607.00252#S7.E13)\)\. Plugging inγ=α​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌\\gamma=\\alpha\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}into[Lemma˜7\.18](https://arxiv.org/html/2607.00252#S7.Thmtheorem18)and using our bound onh𝒒​\(𝒙𝒒\)h\_\{\\bm\{q\}\}\(\\bm\{x\}\_\{\\bm\{q\}\}\)from[Lemma˜7\.16](https://arxiv.org/html/2607.00252#S7.Thmtheorem16)gives an iteration complexity of \(ignoring the constant in front of the big\-OO\)

pO​\(1\)​log⁡\(p​h𝒒​\(𝒙𝒒\)​\(2p​α​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌\)p\)\\displaystyle\\quad p^\{O\(1\)\}\\log\\left\(ph\_\{\\bm\{q\}\}\(\\bm\{x\}\_\{\\bm\{q\}\}\)\\left\(\\frac\{2\}\{p\\alpha\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\}\\right\)^\{p\}\\right\)≤pO​\(1\)​log⁡\(p​\(p​\(p−1\)​f​\(𝒒\)1−2p​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌2\+Cp​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌p\)​\(2p​α​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌\)p\)\\displaystyle\\leq p^\{O\(1\)\}\\log\\left\(p\\left\(p\(p\-1\)f\(\\bm\{q\}\)^\{1\-\\frac\{2\}\{p\}\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}\+C\_\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\right\)\\left\(\\frac\{2\}\{p\\alpha\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\}\\right\)^\{p\}\\right\)=pO​\(1\)​log⁡\(\(2p\)p​p​\(p​\(p−1\)​f​\(𝒒\)1−2p​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌2\+Cp​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌pαp​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌p\)\)\\displaystyle=p^\{O\(1\)\}\\log\\left\(\\left\(\\frac\{2\}\{p\}\\right\)^\{p\}p\\left\(\\frac\{p\(p\-1\)f\(\\bm\{q\}\)^\{1\-\\frac\{2\}\{p\}\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{2\}\+C\_\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\}\{\\alpha^\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\}\\right\)\\right\)=pO​\(1\)​log⁡\(\(2p\)p​p​\(p​\(p−1\)​f​\(𝒒\)1−2pαp​\\lVert​𝒙𝒒−𝒒​\\rVert𝐌p−2\+Cpαp\)\)\\displaystyle=p^\{O\(1\)\}\\log\\left\(\\left\(\\frac\{2\}\{p\}\\right\)^\{p\}p\\left\(\\frac\{p\(p\-1\)f\(\\bm\{q\}\)^\{1\-\\frac\{2\}\{p\}\}\}\{\\alpha^\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\}\+\\frac\{C\_\{p\}\}\{\\alpha^\{p\}\}\\right\)\\right\)We have two cases to analyze for the value ofα\\alpha\. In the first, suppose we getα=15\\alpha=\\frac\{1\}\{5\}\. By the definition ofα\\alpha, this means we have

Cpe​p​\(p−1\)​\(\\lVert​𝒙𝒒−𝒒​\\rVert𝐌f​\(𝒒\)1p\)p−2≥1,\\displaystyle\\frac\{C\_\{p\}\}\{ep\(p\-1\)\}\\left\(\\frac\{\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\}\{f\(\\bm\{q\}\)^\{\\frac\{1\}\{p\}\}\}\\right\)^\{p\-2\}\\geq 1,which means the complexity we get ispO​\(1\)​log⁡pp^\{O\(1\)\}\\log p\. We now handle the other case, i\.e\.,α=Cp5​e​p​\(p−1\)​\(\\lVert​𝒙𝒒−𝒒​\\rVert𝐌f​\(𝒒\)1p\)p−2\\alpha=\\frac\{C\_\{p\}\}\{5ep\(p\-1\)\}\\left\(\\frac\{\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\}\-\\bm\{q\}\\right\\rVert\_\{\\mathbf\{M\}\}\}\{f\(\\bm\{q\}\)^\{\\frac\{1\}\{p\}\}\}\\right\)^\{p\-2\}\. Here, it will be useful to keep track of the timestepttthat we are working with\. Recall that

\\lVert​𝒙𝒒t−𝒒t​\\rVert𝐌p=\(\\lVert​𝐌−1​∇f​\(𝒙𝒒t\)​\\rVert𝐌p​Cp\)pp−1≥\(εp​Cp​\\lVert​𝒙𝒒t−𝒙⋆​\\rVert𝐌\)pp−1,\\displaystyle\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\_\{t\}\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}=\\left\(\\frac\{\\left\\lVert\\mathbf\{M\}^\{\-1\}\\nabla f\(\\bm\{x\}\_\{\\bm\{q\}\_\{t\}\}\)\\right\\rVert\_\{\\mathbf\{M\}\}\}\{pC\_\{p\}\}\\right\)^\{\\frac\{p\}\{p\-1\}\}\\geq\\left\(\\frac\{\\varepsilon\}\{pC\_\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\_\{t\}\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\}\\right\)^\{\\frac\{p\}\{p\-1\}\}\\kern 5\.0pt,\(7\.17\)so the complexity we want to control is given by

pO​\(1\)​log⁡\(\(2p\)p​p​\(2​f​\(𝒒t\)αp​\\lVert​𝒙𝒒t−𝒒t​\\rVert𝐌p\)\)\\displaystyle p^\{O\(1\)\}\\log\\left\(\\left\(\\frac\{2\}\{p\}\\right\)^\{p\}p\\left\(\\frac\{2f\(\\bm\{q\}\_\{t\}\)\}\{\\alpha^\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\_\{t\}\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\}\\right\)\\right\)pO​\(1\)\([7\.13](https://arxiv.org/html/2607.00252#S7.E13)\)​log⁡\(\(2p\)p​p​\(2​\(5​e​p​\(p−1\)\)p​f​\(𝒒t\)p−1Cpp​\\lVert​𝒙𝒒t−𝒒t​\\rVert𝐌p​\(p−2\)​\\lVert​𝒙𝒒t−𝒒t​\\rVert𝐌p\)\),\\displaystyle\\qquad\{\}^\{\\eqref\{eq:lemma\_D19\_guarantee\}\}p^\{O\(1\)\}\\log\\left\(\\left\(\\frac\{2\}\{p\}\\right\)^\{p\}p\\left\(\\frac\{2\\left\(5ep\(p\-1\)\\right\)^\{p\}f\(\\bm\{q\}\_\{t\}\)^\{p\-1\}\}\{C\_\{p\}^\{p\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\_\{t\}\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\(p\-2\)\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\_\{t\}\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\}\\right\)\\right\)\\kern 5\.0pt,pO​\(1\)​log⁡\(p​\(2​\(10​\(p−1\)\)p​f​\(𝒒t\)p−1pp2​\\lVert​𝒙𝒒t−𝒒t​\\rVert𝐌p​\(p−1\)\)\),\\displaystyle\\qquad\\lesssim p^\{O\(1\)\}\\log\\left\(p\\left\(\\frac\{2\\left\(10\(p\-1\)\\right\)^\{p\}f\(\\bm\{q\}\_\{t\}\)^\{p\-1\}\}\{p^\{p^\{2\}\}\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\_\{t\}\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\(p\-1\)\}\}\\right\)\\right\)\\kern 5\.0pt,pO​\(1\)\([7\.17](https://arxiv.org/html/2607.00252#S7.E17)\)​log⁡\(p​\(2​\(10​e​\(p−1\)\)p​pp​\(p\+1\)​f​\(𝒒t\)p−1pp2​ϵp\)​\\lVert​𝒙𝒒t−𝒙⋆​\\rVert𝐌p\),\\displaystyle\\qquad\{\}^\{\\eqref\{eq:xq\_lb\}\}p^\{O\(1\)\}\\log\\left\(p\\left\(\\frac\{2\\left\(10e\(p\-1\)\\right\)^\{p\}p^\{p\(p\+1\)\}f\(\\bm\{q\}\_\{t\}\)^\{p\-1\}\}\{p^\{p^\{2\}\}\\epsilon^\{p\}\}\\right\)\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\_\{t\}\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\right\)\\kern 5\.0pt,pO​\(1\)\([7\.17](https://arxiv.org/html/2607.00252#S7.E17)\)​log⁡\(\(2​\(10​e​\(p−1\)\)p​pp\+1​f​\(𝒒t\)p−1ϵp\)​\\lVert​𝒙𝒒t−𝒙⋆​\\rVert𝐌p\),\\displaystyle\\qquad\{\}^\{\\eqref\{eq:xq\_lb\}\}p^\{O\(1\)\}\\log\\left\(\\left\(\\frac\{2\\left\(10e\(p\-1\)\\right\)^\{p\}p^\{p\+1\}f\(\\bm\{q\}\_\{t\}\)^\{p\-1\}\}\{\\epsilon^\{p\}\}\\right\)\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\_\{t\}\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\right\)\\kern 5\.0pt,pO​\(1\)​log⁡\(p​f​\(𝒒t\)​\\lVert​𝒙𝒒t−𝒙⋆​\\rVert𝐌ε\),\\displaystyle\\qquad\\lesssim p^\{O\(1\)\}\\log\\left\(\\frac\{pf\(\\bm\{q\}\_\{t\}\)\\left\\lVert\\bm\{x\}\_\{\\bm\{q\}\_\{t\}\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}\}\{\\varepsilon\}\\right\)\\kern 5\.0pt,pO​\(1\)\([Lemma˜7\.15](https://arxiv.org/html/2607.00252#S7.Thmtheorem15)\)​log⁡\(p​f​\(𝒒t\)​d​f​\(𝒙t\)ε\),\\displaystyle\\qquad\{\}^\{\\text\{\(\\lx@cref\{creftypecap~refnum\}\{lemma:gp\_regression\_prox\_diameter\}\)\}\}p^\{O\(1\)\}\\log\\left\(\\frac\{pf\(\\bm\{q\}\_\{t\}\)df\(\\bm\{x\}\_\{t\}\)\}\{\\varepsilon\}\\right\)\\kern 5\.0pt,pO​\(1\)\([Lemma˜7\.7](https://arxiv.org/html/2607.00252#S7.Thmtheorem7)\)​log⁡\(p​f​\(𝒙t\)ε\),\\displaystyle\\qquad\{\}^\{\\text\{\(\\lx@cref\{creftypecap~refnum\}\{lemma:fq\_bounded\}\)\}\}p^\{O\(1\)\}\\log\\left\(\\frac\{pf\(\\bm\{x\}\_\{t\}\)\}\{\\varepsilon\}\\right\)\\kern 5\.0pt,completing the proof of[Lemma˜7\.19](https://arxiv.org/html/2607.00252#S7.Thmtheorem19)\. ∎

### 7\.4The Algorithm

We are now ready to combine the results from the previous two subsections to build our algorithm for𝒢p\\mathcal\{G\}\_\{p\}\-regression and prove[˜2](https://arxiv.org/html/2607.00252#Thmmainthm2)\. The main algorithmic object here is[Algorithm˜5](https://arxiv.org/html/2607.00252#alg5)\.

Algorithm 5GpRegression: Optimizes \([1\.4](https://arxiv.org/html/2607.00252#S1.E4)\) up to\(1\+ε\)\(1\+\\varepsilon\)\-multiplicative error1:Regression problems\(𝐀S1,𝒃S1\),…,\(𝐀Sm,𝒃Sm\)\(\\mathbf\{A\}\_\{S\_\{1\}\},\\bm\{b\}\_\{S\_\{1\}\}\),\\dots,\(\\mathbf\{A\}\_\{S\_\{m\}\},\\bm\{b\}\_\{S\_\{m\}\}\), accuracyε\>0\\varepsilon\>0

2:Using\[mo23, Algorithm 2\]with input\[𝐀\|𝒃\]\\left\[\\mathbf\{A\}\|\\bm\{b\}\\right\], find nonnegative diagonal𝐖\\mathbf\{W\}such that for all𝒙∈ℝd\\bm\{x\}\\in\\mathbb\{R\}^\{d\}andc∈ℝc\\in\\mathbb\{R\},\\lVert​𝐀​𝒙−c​𝒃​\\rVert𝒢∞≤\\lVert​𝐖12−1p​𝐀​𝒙−c​𝐖1/2​𝒃​\\rVert2≤\(2​\(d\+1\)\)12−1p​\\lVert​𝐀​𝒙−c​𝒃​\\rVert𝒢∞\.\\displaystyle\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-c\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\\leq\\left\\lVert\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\bm\{x\}\-c\\mathbf\{W\}^\{1/2\}\\bm\{b\}\\right\\rVert\_\{2\}\\leq\(2\(d\+1\)\)^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\left\\lVert\\mathbf\{A\}\\bm\{x\}\-c\\bm\{b\}\\right\\rVert\_\{\\mathcal\{G\}\_\{\\infty\}\}\.

3:Let𝒙0=\(𝐀⊤​𝐖1−2p​𝐀\)−1​𝐀⊤​𝐖1−2p​𝒃\\bm\{x\}\_\{0\}=\\left\(\\mathbf\{A\}^\{\\top\}\\mathbf\{W\}^\{1\-\\frac\{2\}\{p\}\}\\mathbf\{A\}\\right\)^\{\-1\}\\mathbf\{A\}^\{\\top\}\\mathbf\{W\}^\{1\-\\frac\{2\}\{p\}\}\\bm\{b\}\.⊳\\triangleright𝐱0≔argmin𝐱∈ℝd​\\lVert​𝐖12−1p​𝐀​𝐱−𝐖12−1p​𝐛​\\rVert2\\bm\{x\}\_\{0\}\\coloneqq\\underset\{\\bm\{x\}\\in\\mathbb\{R\}^\{d\}\}\{\\mathrm\{argmin\}\}\\ \\left\\lVert\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\mathbf\{A\}\\bm\{x\}\-\\mathbf\{W\}^\{\\frac\{1\}\{2\}\-\\frac\{1\}\{p\}\}\\bm\{b\}\\right\\rVert\_\{2\}\.

4:Using[Algorithm˜4](https://arxiv.org/html/2607.00252#alg4)and[Lemma˜7\.19](https://arxiv.org/html/2607.00252#S7.Thmtheorem19), implement a12\\frac\{1\}\{2\}\-MS oracle forff\([Definition˜5\.1](https://arxiv.org/html/2607.00252#S5.Thmtheorem1)\)

5:Run[Algorithm˜3](https://arxiv.org/html/2607.00252#alg3)with the oracle from the previous line and with𝒙0\\bm\{x\}\_\{0\}as the initialization forO\(𝗉𝗈𝗅𝗒\(p\)min\{𝗋𝖺𝗇𝗄\(𝐀\),m\}p−23​p−2log\(dε\)3\)O\\left\(\\mathsf\{poly\}\(p\)\\min\\left\\\{\\mathsf\{rank\}\\left\(\\mathbf\{A\}\\right\),m\\right\\\}^\{\\frac\{p\-2\}\{3p\-2\}\}\\log\\left\(\\frac\{d\}\{\\varepsilon\}\\right\)^\{3\}\\right\)iterations\.

6:return𝒙^\\mathaccent 866\{\\bm\{x\}\}the output of the previous step\.

###### Proof of[˜2](https://arxiv.org/html/2607.00252#Thmmainthm2)\.

By writing the stationary condition of the proximal problem, it makes sense to chooseλt\+1=e​pp\+1​\\lVert​𝒙~t\+1−𝒒t​\\rVert𝐌p−2\\lambda\_\{t\+1\}=ep^\{p\+1\}\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\.

It is easy to check that

\\lVert​𝒙~t\+1−𝒒t​\\rVert𝐌=\(e​pp\+1​\\lVert​𝒙~t\+1−𝒒t​\\rVert𝐌p−2\(\(e​pp\+1\)1p−1\)p−1\)1\(p−1\)−1,\\displaystyle\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}=\\left\(\\frac\{ep^\{p\+1\}\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\}\{\\left\(\(ep^\{p\+1\}\)^\{\\frac\{1\}\{p\-1\}\}\\right\)^\{p\-1\}\}\\right\)^\{\\frac\{1\}\{\(p\-1\)\-1\}\},and therefore the triple\(𝒙~t\+1,𝒒t,e​pp\+1​\\lVert​𝒙~t\+1−𝒒t​\\rVert𝐌p−2\)\(\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\},\\bm\{q\}\_\{t\},ep^\{p\+1\}\\left\\lVert\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\-\\bm\{q\}\_\{t\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\-2\}\)always satisfies a\(p−1,\(e​pp\+1\)1/\(p−1\)\)\(p\-1,\(ep^\{p\+1\}\)^\{1/\(p\-1\)\}\)\-movement bound \([Definition˜5\.2](https://arxiv.org/html/2607.00252#S5.Thmtheorem2)\)\.

Next, we calculate the iteration complexity we need to reduce the error to half of what we started with\. For an arbitrary initial iterate𝒙\\bm\{x\}, letδ=0\.5​\(f​\(𝒙\)−f​\(𝒙⋆\)\)\\delta=0\.5\(f\(\\bm\{x\}\)\-f\(\\bm\{x\}^\{\\star\}\)\)\. By[Lemma˜7\.2](https://arxiv.org/html/2607.00252#S7.Thmtheorem2), we have

\\lVert​𝒙−𝒙⋆​\\rVert𝐌s\+1=\\lVert​𝒙−𝒙⋆​\\rVert𝐌p≤23​p/2​dp/2−1​\(f​\(𝒙\)−f​\(𝒙⋆\)\),\\displaystyle\\left\\lVert\\bm\{x\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{s\+1\}=\\left\\lVert\\bm\{x\}\-\\bm\{x\}^\{\\star\}\\right\\rVert\_\{\\mathbf\{M\}\}^\{p\}\\leq 2^\{3p/2\}d^\{p/2\-1\}\(f\(\\bm\{x\}\)\-f\(\\bm\{x\}^\{\\star\}\)\),so combining this along with the fact thatcs=e​pp\+1c^\{s\}=ep^\{p\+1\}and applying[Theorem˜5\.3](https://arxiv.org/html/2607.00252#S5.Thmtheorem3)with our proximal solver[Lemma˜7\.19](https://arxiv.org/html/2607.00252#S7.Thmtheorem19)yields

T𝗆𝗂𝗇\\displaystyle T\_\{\\mathsf\{min\}\}=p−13​\(p​Cp⋅23​p/2\+1​dp/2−1\)23​p−2​p5/3​dp−23​p−2\.\\displaystyle=\\frac\{p\-1\}\{3\}\\left\(pC\_\{p\}\\cdot 2^\{3p/2\+1\}d^\{p/2\-1\}\\right\)^\{\\frac\{2\}\{3p\-2\}\}\\lesssim p^\{5/3\}d^\{\\frac\{p\-2\}\{3p\-2\}\}\.Next, we initialize𝒙0≔\(𝐀⊤​𝐖1−2/p​𝐀\)−1​𝐀⊤​𝐖1−2/p​𝒃\\bm\{x\}\_\{0\}\\coloneqq\\left\(\\mathbf\{A\}^\{\\top\}\\mathbf\{W\}^\{1\-2/p\}\\mathbf\{A\}\\right\)^\{\-1\}\\mathbf\{A\}^\{\\top\}\\mathbf\{W\}^\{1\-2/p\}\\bm\{b\}\. Using[Theorem˜3\.3](https://arxiv.org/html/2607.00252#S3.Thmtheorem3)and[Theorem˜3\.4](https://arxiv.org/html/2607.00252#S3.Thmtheorem4), we have

f​\(𝒙0\)≤\(2​d\)p/2−1​f​\(𝒙⋆\),\\displaystyle f\(\\bm\{x\}\_\{0\}\)\\leq\(2d\)^\{p/2\-1\}f\(\\bm\{x\}^\{\\star\}\),so reaching an iterate𝒙\\bm\{x\}for whichf​\(𝒙\)−f​\(𝒙⋆\)≤ε​f​\(𝒙⋆\)f\(\\bm\{x\}\)\-f\(\\bm\{x\}^\{\\star\}\)\\leq\\varepsilon f\(\\bm\{x\}^\{\\star\}\)takesT𝗆𝗂𝗇⋅log⁡\(dp/2−1/ε\)=p8/3​dp−23​p−2​log⁡\(dε\)T\_\{\\mathsf\{min\}\}\\cdot\\log\\left\(d^\{p/2\-1\}/\\varepsilon\\right\)=p^\{8/3\}d^\{\\frac\{p\-2\}\{3p\-2\}\}\\log\\left\(\\frac\{d\}\{\\varepsilon\}\\right\)calls to𝒪𝗉𝗋𝗈𝗑\\mathcal\{O\}\_\{\\mathsf\{prox\}\}\.

We now resolve the full iteration complexity, including the bootstrapping step to show thatf​\(𝒙t\)f\(\\bm\{x\}\_\{t\}\)is reasonably bounded so that we get an unconditional upper bound from[Lemma˜7\.19](https://arxiv.org/html/2607.00252#S7.Thmtheorem19)\. At the end of iterationtt, from \(loosely\) inverting the bound in[Theorem˜5\.3](https://arxiv.org/html/2607.00252#S5.Thmtheorem3), we know that

f​\(𝒙t\)−f​\(𝒙⋆\)≤\(C​p3\)3​p−22​\(2​d\)p2−1t3​p−22\.\\displaystyle f\(\\bm\{x\}\_\{t\}\)\-f\(\\bm\{x\}^\{\\star\}\)\\leq\\frac\{\(Cp^\{3\}\)^\{\\frac\{3p\-2\}\{2\}\}\(2d\)^\{\\frac\{p\}\{2\}\-1\}\}\{t^\{\\frac\{3p\-2\}\{2\}\}\}\.Since𝒙~t\+1\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}only depends on𝒒t\\bm\{q\}\_\{t\}, which in turn only depends on𝒙t\\bm\{x\}\_\{t\}and𝒗t\\bm\{v\}\_\{t\}, it suffices to use the above bound forf​\(𝒙t\)f\(\\bm\{x\}\_\{t\}\), which gives us an iteration complexity ofpO​\(1\)​log⁡\(p​dε\)p^\{O\(1\)\}\\log\\left\(\\frac\{pd\}\{\\varepsilon\}\\right\)to compute𝒙~t\+1\\mathaccent 869\{\\bm\{x\}\}\_\{t\+1\}\(which we get from plugging into[Lemma˜7\.19](https://arxiv.org/html/2607.00252#S7.Thmtheorem19)\)\.

Combining this with the iteration complexity of𝒪𝗉𝗋𝗈𝗑\\mathcal\{O\}\_\{\\mathsf\{prox\}\}gives us the result of[˜2](https://arxiv.org/html/2607.00252#Thmmainthm2)\. ∎

## 8Empirical Evaluation

In this section, we thoroughly compare our method against the baselines mentioned in[Table˜1](https://arxiv.org/html/2607.00252#S1.T1)\. We begin with a synthetic regression experiment in[Section˜8\.1](https://arxiv.org/html/2607.00252#S8.SS1), in which the synthetic design allows us to control the heterogeneity in the data\. This is followed in[Section˜8\.2](https://arxiv.org/html/2607.00252#S8.SS2)by another experiment on a real\-world group\-structured regression task drawn from the American Community Survey \(ACS\)\.

### 8\.1Synthetic Heterogeneous Regression Construction

We construct synthetic group\-structured regression problems designed to test optimization under severe group heterogeneity\. The data consist ofmmgroups\. Each groupiihas its own design matrix𝐀Si\\mathbf\{A\}\_\{S\_\{i\}\}and target vector𝒃Si\\bm\{b\}\_\{S\_\{i\}\}, and defines a quadratic loss

ℓi​\(x\)=1ni​\\\|​𝐀Si​𝒙−𝒃Si​\\\|22\.\\ell\_\{i\}\(x\)=\\frac\{1\}\{n\_\{i\}\}\\\|\\mathbf\{A\}\_\{S\_\{i\}\}\\bm\{x\}\-\\bm\{b\}\_\{S\_\{i\}\}\\\|\_\{2\}^\{2\}\\kern 5\.0pt\.The objective of interest is the worst\-group loss \(as in \([1\.2](https://arxiv.org/html/2607.00252#S1.E2)\)\)

F​\(x\)=maxi∈\[m\]⁡ℓi​\(𝒙\)\.F\(x\)=\\max\_\{i\\in\[m\]\}\\ell\_\{i\}\(\\bm\{x\}\)\\kern 5\.0pt\.
We generate two qualitatively distinct group types to separate average performance from worst\-group performance\. The majority of groups are benign and geometrically aligned, whereas a small number have very high curvature, are geometrically misaligned, and are far from the population center\. This construction ensures that minimizing the average loss does not coincide with minimizing the worst\-group loss, and that curvature heterogeneity strongly affects optimization behavior\.

We construct all group covariances relative to a shared orthonormal coordinate system\. This allows us to control curvature direction\-by\-direction while keeping the ambient geometry comparable across groups\. Each group, therefore, has a quadratic loss whose Hessian shares eigenvectors with the others but whose eigenvalues vary across groups\.

##### Normal groups\.

Most groups are generated with a moderate condition number\. Their Hessians have eigenvalues that vary across coordinates but remain within a controlled range\. The corresponding optimal parameters are sampled from a distribution concentrated around a common center in parameter space\. Independent noise is added to each group so that these losses are smooth and moderately curved\. As a result, normal groups are geometrically aligned: their curvature structure is similar, and their optima lie in a relatively small region of parameter space\.

##### Outlier groups\.

A small subset of groups is constructed to be adversarial in two distinct ways\. First, each adversarial group has one direction with extremely large curvature, while the remaining directions retain moderate curvature\. These sharp directions differ across adversarial groups\. Second, the optimal parameter of each adversarial group lies far from the population center along its corresponding high\-curvature direction\. In addition, the noise level in these groups is set to be very small, so their losses are sharply concentrated around their optima\. Together, these properties ensure that deviations along the sharp directions incur very large increases in the worst\-group loss\.

##### Implications of our construction\.

This construction produces three key phenomena:

1. 1\.The stacked design matrix has a large condition number\.
2. 2\.The curvature directions that dominate different groups are misaligned\.
3. 3\.The empirical risk minimizer \(which minimizes the average loss\) can perform well on most groups while incurring substantial loss on a small number of adversarial groups\.

Because most groups share similar geometry, gradient averaging implicitly emphasizes their curvature structure\. Therefore, first\-order methods, reduce the average loss efficiently but make comparatively slow progress along the sharp directions that control the worst\-group objective\. In contrast, methods that adapt to local curvature or explicitly control worst\-case behavior can continue to decrease the max\-loss objective\.

For the default instance used in[Figure˜1](https://arxiv.org/html/2607.00252#S8.F1), the problem dimension isd=10d=10with100100groups, of which55are adversarial\. The stacked Gram matrix has a condition number on the order of10510^\{5\}, and the empirical risk minimizer exhibits a clear gap relative to the robust optimum computed via convex programming\. For more details on data generation, see the included Jupyter notebook\.

#### 8\.1\.1Computing the Robust Optimum via Convex Programming

To obtain a reliable reference point for evaluation, we compute the exact optimum value of the worst\-group objective using a convex solver\. Concretely, we solve the robust regression problem that minimizes the maximum group mean\-squared error\. Although the objective is a pointwise maximum over groups, it remains a convex function of the parameter vector because each group loss is a convex quadratic\.

We programmatically formulate this problem using the epigraph trick\. We introduce an auxiliary scalar variablettthat upper\-bounds every group loss, and then minimize this upper bound\. If we write the group loss as the mean squared residual for that group, then the epigraph reformulation becomes

minx,ttsubject toℓi​\(𝒙\)≤tfor every group​i∈\[m\]\.\\min\_\{x,\\,t\}\\ \\ t\\quad\\text\{subject to\}\\quad\\ell\_\{i\}\(\\bm\{x\}\)\\leq t\\ \\ \\text\{for every group \}i\\in\[m\]\\kern 5\.0pt\.Each constraint is a convex quadratic inequality, so the resulting problem is a convex quadratically constrained program\. We implement this formulation inCVXPY\[diamond2016cvxpy,agrawal2018rewriting\]by declaring decision variables for the model parameters and the epigraph variable, adding one quadratic constraint per group, and calling a standard convex solver\. We treat the returned epigraph value as the robust optimum value\.

In all plots, we report the gap between an algorithm’s current worst\-group loss and this robust optimum value\. This subtraction makes the figures directly comparable across instances and highlights whether a method continues to reduce the true robust suboptimality, rather than merely decreasing a surrogate objective\.

#### 8\.1\.2Baselines

We compare the following methods\.

##### Subgradient method\.

We run subgradient descent directly on the nonsmooth max\-loss objective[1\.2](https://arxiv.org/html/2607.00252#S1.E2), using both fixed and diminishing step\-size schedules, and report the best variant\.

##### Smoothed gradient methods\.

We apply log\-sum\-exp smoothing \(see[2\.2](https://arxiv.org/html/2607.00252#S2.E2)\) to approximate the max operator and optimize the resulting smooth objective using:

- •Gradient descent,
- •Heavy\-Ball \(Polyak\) momentum,
- •Nesterov acceleration\.

##### Interior\-point method\.

We implement a log\-barrier\-based interior\-point method that solves a sequence of smooth approximations using Newton steps\.

##### Ball\-oracle methods \(ours\)\.

We implement two trust\-region style methods that repeatedly solve the smoothed objective using a damped Newton solver \([Section˜2\.2](https://arxiv.org/html/2607.00252#S2.SS2)\):

- •Euclidean geometry \(naive ball\),
- •Lewis\-weight geometry, where the trust region is defined using a data\-dependent positive definite matrix constructed from block Lewis weights\.

After each outer step, the center is updated to the new solution, and the trust\-region radius is optionally shrunk\. For simplicity, we do not consider the acceleration of the ball\-oracle method\.

#### 8\.1\.3Hyperparameter Tuning

We tune every method via grid search over its relevant hyperparameters:

- •Step sizes for gradient and subgradient methods;
- •Smoothing parameters and momentum coefficients for smoothed methods;
- •Barrier parameters and inner iteration counts for the interior\-point method;
- •Initial trust\-region radius, smoothing strength, and radius decay factors for the ball\-oracle methods\.

All methods use the same warm start\. For each configuration, we run a fixed number of outer iterations and select the configuration that achieves the lowest worst\-group loss within this budget\.

#### 8\.1\.4Empirical Behavior

Notably, the meaning of an iteration differs across algorithms:

- •For subgradient and smoothed gradient methods, one iteration corresponds to one full gradient or subgradient update using all groups\.
- •For the interior\-point method, one iteration corresponds to one outer Newton step of the barrier procedure\.
- •For the ball\-oracle methods, one iteration corresponds to one call to the trust\-region Newton solver \(i\.e\., one outer iteration\)\.

In all iteration\-complexity plots, we compare methods using their own natural outer\-iteration count\.

##### Iteration complexity\.

On the adversarial instances, first\-order methods make limited progress\. Subgradient descent improves briefly but quickly plateaus\. Even the best\-tuned smoothed\-gradient variant stalls far above the optimum\. In contrast, both ball\-oracle methods steadily decrease the worst\-group loss across outer iterations, whereas the IPM converges rapidly\. We also note that the IPM achieves the best final loss among all methods, which is unsurprising, asCVXPYnatively uses the same algorithm to compute the maximum\-loss optimum\.

##### Runtime complexity\.

While we do not spend significant effort simulating a time\-complexity model or hyper\-optimizing our code, we also plot the runtime complexity of all the algorithms to control for the different meanings of an iteration\. Because Newton steps are computationally more expensive, first\-order methods initially appear competitive in wall\-clock time\. However, they plateau early and fail to approach high accuracy\. Interior\-point and ball\-oracle methods continue to reduce the worst\-group loss gap and eventually achieve near\-optimal solutions, whereas first\-order methods remain stuck far from the optimum\. We observe a very slight benefit from using the Lewis geometry in our ball oracle method\.

![Refer to caption](https://arxiv.org/html/2607.00252v1/figures/Iterations.png)\(a\)Suboptimality versus iteration count \(log–log\)\.
![Refer to caption](https://arxiv.org/html/2607.00252v1/figures/Time.png)\(b\)Suboptimality versus wall\-clock time \(seconds\)\.

Figure 1:Comparison of first\-order, interior\-point, and ball\-oracle methods on adversarial heterogeneous regression instances\. All curves report the worst\-group loss minus the optimal value computed viaCVXPY\.

### 8\.2Real\-world Experiment: ACS Income

To validate the computational behavior of our methods beyond synthetic instances, we evaluate them on a real\-world group\-structured regression task drawn from the American Community Survey \(ACS\), accessed through thefolktablespackage\[ding2021retiring\]\. The task is to predict log personal income fromd=10d=10standardized demographic and employment features \(age, education, occupation, hours worked, etc\.\) for employed US adults\. We group the data by region of residence and use allm=51m=51regions \(the5050states together with Puerto Rico\), subsampling200200individuals per region for a total of10,20010\{,\}200samples\. Each group lossℓi\\ell\_\{i\}is the mean squared prediction error on regionii, and the objective is again the worst\-group lossF​\(𝒙\)=maxi∈\[m\]⁡ℓi​\(𝒙\)F\(\\bm\{x\}\)=\\max\_\{i\\in\[m\]\}\\ell\_\{i\}\(\\bm\{x\}\)\. This is a natural distributionally\-robust setting: income distributions differ substantially across regions, so the model that minimizes the average error can perform poorly on individual states\. All methods are warm\-started at the empirical risk minimizer \(ERM\), and we compute the robust optimum𝖮𝖯𝖳\\mathsf\{OPT\}withCVXPYas the reference value\.

##### Convergence and runtime\.

Our primary interest is computational: how quickly each method drives the worst\-group suboptimalityF​\(𝒙t\)−𝖮𝖯𝖳F\(\\bm\{x\}\_\{t\}\)\-\\mathsf\{OPT\}to zero, both in outer iterations and in wall\-clock time\.[Figure˜2](https://arxiv.org/html/2607.00252#S8.F2)plots this gap against iteration count and against runtime, and[Table˜2](https://arxiv.org/html/2607.00252#S8.T2)reports the iterations and wall\-clock time each method needs to reach a relative suboptimality of1%1\\%\.

The two ball\-oracle methods are the fastest on both axes\. Starting from the warm start, each reaches the1%1\\%target in a*single*outer iteration and in under2020ms of wall\-clock time — roughly3×3\\timesfaster than the interior\-point method \(6666ms,88iterations\) and the best\-tuned smoothed Heavy\-Ball method \(6262ms,4747iterations\)\. The interior\-point method converges in a few iterations but incurs a higher per\-iteration cost, whereas the smoothed first\-order method requires an order\-of\-magnitude more iterations to reach the same accuracy\. The subgradient method never reaches the1%1\\%target within the budget: it remains essentially pinned at the ERM gap \([Figure˜2](https://arxiv.org/html/2607.00252#S8.F2)\)\. The Lewis\-weight geometry is marginally faster than the Euclidean ball, consistent with the synthetic results\. Overall, on a genuine5151\-group regression problem, the ball\-oracle methods reach high\-accuracy robust solutions fastest in both iteration count and wall\-clock time\.

![Refer to caption](https://arxiv.org/html/2607.00252v1/plots/acs_m51_iterations.png)\(a\)Suboptimality versus iteration count \(log–log\)\.
![Refer to caption](https://arxiv.org/html/2607.00252v1/plots/acs_m51_time.png)\(b\)Suboptimality versus wall\-clock time \(seconds\)\.

Figure 2:Convergence on the ACS Income task \(m=51m=51regions,d=10d=10,10,20010\{,\}200samples\)\. Each curve reports the best\-so\-far worst\-group suboptimalityF​\(𝒙t\)−𝖮𝖯𝖳F\(\\bm\{x\}\_\{t\}\)\-\\mathsf\{OPT\}; the dashed line marks the ERM gap\. The ball\-oracle methods reach the robust optimum in a single outer iteration and the least wall\-clock time, while the subgradient method stalls at the ERM gap\.MethodIterations to1%1\\%gapWall\-clock time \(s\)Subgradient— \(not reached\)—Smoothed Heavy\-Ball470\.062IPM80\.066Ball\-Oracle \(Euclidean\)10\.019Ball\-Oracle \(Lewis\)10\.019Table 2:Cost to reach a1%1\\%relative worst\-group suboptimality on the ACS Income task \(m=51m=51\)\. Iterations are each method’s own natural outer iterations\. The subgradient method does not reach the target within the iteration budget\.
##### Statistical context\.

For completeness, we note that the optimization gains correspond to meaningful redistribution of prediction accuracy across regions\. The ERM solution attains the lowest*average*error \(108\.2108\.2\) but a large spread \(σ=11\.8\\sigma=11\.8\) and a worst\-group loss of138\.1138\.1, attained by California; other heavily penalized states are New York, Hawaii, Nevada, Florida, and New Jersey, all large or high\-income regions on which a1010\-feature linear model underfits\. The robust optimum instead nearly equalizes the group losses into a narrow band around107107–114114\(Max/Mean drops from1\.281\.28to1\.021\.02\), decreasing the loss on the previously neglected states \(e\.g\. California by24\.324\.3MSE\) at a moderate cost on the homogeneous low\-income states that ERM happened to fit well\.

#### Acknowledgments

We thank Aaron Sidford for useful comments during the early stage of the project\. We also thank the anonymous reviewers at ICLR’26 and NeurIPS’25 for their significant contributions to improving the paper\. The authors used ChatGPT 5\.3 and Claude Opus 4\.6 to implement the algorithms in this paper\. Most of this work was done when NSM and KKP were graduate students at Toyota Technological Institute, Chicago \(TTIC\)\. KKP and NSM were supported through the NSF TRIPOD Institute on Data, Economics, Algorithms and Learning \(IDEAL\) and other awards from DARPA and NSF\. NSM thanks Citadel LLC for sponsoring the conference travel and attendance for the presentation of this work\.

## References

Similar Articles

Rethinking the Divergence Regularization in LLM RL

Hugging Face Daily Papers

This paper introduces DRPO, which replaces the hard mask in DPPO with a smooth advantage-weighted quadratic regularizer to improve stability and efficiency in LLM reinforcement learning by providing continuous gradient corrections beyond trust-region boundaries.

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

arXiv cs.CL

This paper proposes a conformal prediction framework for LLMs that leverages internal representations rather than output-level statistics, introducing Layer-Wise Information (LI) scores as nonconformity measures to improve validity-efficiency trade-offs under distribution shift. The method demonstrates stronger robustness to calibration-deployment mismatch compared to text-level baselines across QA benchmarks.

Distribution-Aware Algorithm Design with LLM Agents

arXiv cs.AI

This paper introduces a framework for distribution-aware algorithm design where LLM agents learn to generate solver code specialized to target distributions, achieving high solution quality and significant speedups over standard solvers.