Robust Subspace-Constrained Quadratic Models for Low-Dimensional Structure Learning

arXiv cs.LG Papers

Summary

This paper proposes a robust subspace-constrained quadratic model for learning low-dimensional structures from high-dimensional data, accommodating heavy-tailed noise. A gradient-based algorithm with backtracking line search is developed, and experiments show improved robustness and reconstruction accuracy.

arXiv:2605.20300v1 Announce Type: new Abstract: In this paper, we propose a robust subspace-constrained quadratic model (SCQM) for learning low-dimensional structure from high-dimensional data. Building upon the subspace-constrained quadratic matrix factorization (SQMF) framework, the proposed model accommodates a broad class of noise distributions, including generalized Gaussian and radial Laplace models. This generalization enables reliable performance under both heavy-tailed and light-tailed noise, thereby substantially enhancing robustness across diverse data regimes. To efficiently address the resulting nonconvex optimization problem, we develop a gradient-based algorithm equipped with a backtracking line-search strategy that ensures stable and efficient convergence. In addition, we present a sensitivity analysis of the $\ell_p^p$ and $\ell_2$ loss functions, elucidating their distinct behaviors under varying noise characteristics. Extensive numerical experiments corroborate the theoretical analysis and demonstrate that the proposed approach consistently outperforms existing methods in terms of robustness and reconstruction accuracy.
Original Article
View Cached Full Text

Cached at: 05/21/26, 06:22 AM

# Robust Subspace-Constrained Quadratic Models for Low-Dimensional Structure Learning
Source: [https://arxiv.org/html/2605.20300](https://arxiv.org/html/2605.20300)
Zheng Zhai and Xiaohui LiZheng Zhai is with Department of Statistics, Faculty of Arts and Sciences at Beijing Normal University, Zhuhai\. Xiaohui Li is with School of Mathematics and Information Sciences, Yantai University\.

###### Abstract

In this paper, we propose a robust subspace\-constrained quadratic model \(SCQM\) for learning low\-dimensional structure from high\-dimensional data\. Building upon the subspace\-constrained quadratic matrix factorization \(SQMF\) framework, the proposed model accommodates a broad class of noise distributions, including generalized Gaussian and radial Laplace models\. This generalization enables reliable performance under both heavy\-tailed and light\-tailed noise, thereby substantially enhancing robustness across diverse data regimes\. To efficiently address the resulting nonconvex optimization problem, we develop a gradient\-based algorithm equipped with a backtracking line\-search strategy that ensures stable and efficient convergence\. In addition, we present a sensitivity analysis of theℓpp\\ell\_\{p\}^\{p\}andℓ2\\ell\_\{2\}loss functions, elucidating their distinct behaviors under varying noise characteristics\. Extensive numerical experiments corroborate the theoretical analysis and demonstrate that the proposed approach consistently outperforms existing methods in terms of robustness and reconstruction accuracy\.

## IIntroduction

Learning low\-dimensional structures from high\-dimensional data remains a fundamental challenge in data analysis, machine learning, and signal processing\. Traditional methods, including local linear fitting\[[1](https://arxiv.org/html/2605.20300#bib.bib1)\], kernel density estimation \(KDE\)\[[2](https://arxiv.org/html/2605.20300#bib.bib2)\], linear matrix factorization, manifold fitting\[[3](https://arxiv.org/html/2605.20300#bib.bib3)\]and principal curves and surfaces\[[4](https://arxiv.org/html/2605.20300#bib.bib4)\], typically rely on the assumption that data reside in a linear subspace and are perturbed by Gaussian noise\. While these assumptions enable the development of efficient algorithms, they are often unrealistic in real\-world scenarios\. In practice, data may lie on complex nonlinear manifolds and be affected by heavy\-tailed noise or outliers\.

The assumption of flatness often leads to focusing on small, localized regions of the data\. However, when narrowing the focus to such regions, there may be insufficient samples to apply these algorithms effectively, resulting in biased estimates\. This limitation can cause a significant loss of information and degrade model performance\. To overcome these challenges, it is imperative to develop models that can accommodate a broader range of data structures, capturing the intricate, nonlinear nature of real\-world data while remaining robust to noise and outliers\. Such models will be better equipped to provide more accurate, generalizable results and offer enhanced performance in practical applications\.

To address this limitation, manifold learning and nonlinear factorization models have been proposed\. Among them, quadratic matrix factorization \(QMF\)\[[5](https://arxiv.org/html/2605.20300#bib.bib5)\]and subspace\-constrained quadratic matrix factorization \(SQMF\)\[[6](https://arxiv.org/html/2605.20300#bib.bib6)\]have gained attention for their ability to incorporate second\-order information and approximate curved manifolds\. By augmenting linear subspace models\[[7](https://arxiv.org/html/2605.20300#bib.bib7),[8](https://arxiv.org/html/2605.20300#bib.bib8)\]with quadratic terms, these methods can capture both tangent directions and curvature, improving reconstruction accuracy and interpretability\. In particular, subspace\-constrained quadratic matrix factorization \(SQMF\) provides a principled framework for learning low\-dimensional tangent and normal spaces, as well as a quadratic mapping between them\.

Existing quadratic factorization models often rely on squared Euclidean loss or the Frobenius norm, both of which assume Gaussian noise\. While these assumptions offer computational simplicity, they make the models highly sensitive to outliers and misspecifications\. To mitigate the impact of outliers, typical quadratic factorization models introduce a penalty term that penalizes the contribution of the quadratic term, aiming to prevent overfitting\. However, this approach introduces its own challenges, such as the difficult task of selecting appropriate penalty parameters\.

Another key reason why the Frobenius norm loss may be inappropriate is due to the mismatch between its assumptions and the characteristics of real\-world noise distributions\. In practical applications—such as image analysis, sensor data processing, and robust representation learning—data often contains noise that is heavy\-tailed, sparse, or impulsive\. These types of noise can severely degrade model performance and lead to biased estimates\. The assumption of Gaussian noise, which is commonly used in the Frobenius norm loss, fails to capture the true nature of noise in many real\-world scenarios\. This discrepancy highlights the necessity for more flexible and robust loss functions within quadratic factorization models, ones that are specifically designed to account for the diverse and complex noise patterns encountered in practice\. Such loss functions would improve the model’s ability to handle these noise complexities, ultimately enhancing performance and robustness in real\-world applications\.

In response to these challenges, we propose a robust subspace\-constrained quadratic learning framework that allows for a broader class of noise distributions, including generalized Gaussian and Laplace families\. This flexibility enables the model to adapt to both light\-tailed and heavy\-tailed noise while maintaining the expressive power of quadratic manifold representations\. Consequently, the proposed model offers enhanced robustness and accuracy in practical scenarios where Gaussian assumptions fail\.

From an optimization perspective, we extend the classical Frobenius\-norm formulation∥⋅∥F2\\\|\\cdot\\\|\_\{\\rm F\}^\{2\}to matrix factorization under alternative norms, such as the entrywiseℓ1,1\\ell\_\{1,1\}norm and the mixedℓ2,1\\ell\_\{2,1\}norm\. To solve the resulting optimization problem, we propose a gradient descent\-based approach\. Due to the quadratic structure of the mapping and the orthogonality constraints, the optimization remains nonconvex\. To address this, we derive the Karush\-Kuhn\-Tucker \(KKT\) conditions and develop a Riemannian gradient descent algorithm on the Stiefel manifold, ensuring feasibility and convergence through orthogonality\-preserving updates\.

The main contributions of this work are as follows:

- •We propose a generalized subspace\-constrained quadratic framework that extends existing SQMF models to a broad class of loss functions and derive explicit gradients for all variables, including latent coordinates\.
- •We provide a theoretical analysis, based on the implicit function theorem, that explains why theℓpp\\ell\_\{p\}^\{p\}loss and non\-squared Euclidean norm improve robustness\.
- •We develop an efficient Riemannian gradient descent algorithm with orthogonality\-preserving updates on the Stiefel manifold\.
- •Extensive numerical experiments demonstrate that the proposed method significantly improves robustness and reconstruction accuracy in the presence of outliers\.

The structure of the paper is as follows\. In Section[II](https://arxiv.org/html/2605.20300#S2), we introduce the subspace\-constrained quadratic model, detailing its formulations and the associated identifiability property\. We also provide a toy example featuring a circle inℝ2\\mathbb\{R\}^\{2\}to demonstrate the model’s effectiveness\. Section[III](https://arxiv.org/html/2605.20300#S3)focuses on deriving the gradients with respect to the free variables and orthogonal constraints, presenting the KKT conditions, and outlining the proposed optimization algorithm\. In Section[IV](https://arxiv.org/html/2605.20300#S4), we conduct a convexity analysis of the subproblems related tocc,Θ\\Theta, and the projection with respect toτ\\tau\. Section[V](https://arxiv.org/html/2605.20300#S5)offers a sensitivity analysis for theℓpp\\ell\_\{p\}^\{p\}andℓ2\\ell\_\{2\}loss functions\. Section[VII](https://arxiv.org/html/2605.20300#S7)presents numerical experiments to validate the theoretical results, and the paper concludes in Section[VIII](https://arxiv.org/html/2605.20300#S8)with a discussion of promising directions for future research\.

### I\-ARelated works

Since our work leverages a subspace\-constrained quadratic matrix factorization framework to learn latent manifold structures, it is closely related to a broad class of manifold fitting and denoising methods\. These approaches aim to recover low\-dimensional geometric structures embedded in high\-dimensional observations, typically under noise and sampling irregularities\. Representative techniques include local linear fitting methods such as local principal component analysis \(LPCA\)\[[1](https://arxiv.org/html/2605.20300#bib.bib1)\], ridge\-based approaches derived from kernel density estimation \(KDE\)\[[9](https://arxiv.org/html/2605.20300#bib.bib9)\]and its logarithmic variant \(LOG\-KDE\)\[[10](https://arxiv.org/html/2605.20300#bib.bib10)\], tangent\-space aggregation methods such as manifold fitting \(MFIT\)\[[11](https://arxiv.org/html/2605.20300#bib.bib11)\], and higher\-order regression techniques including moving least squares \(MLS\)\[[12](https://arxiv.org/html/2605.20300#bib.bib12)\]\. For a comprehensive comparison, we focus on seven representative methods spanning these categories\.

Ridge estimation constitutes one of the most influential paradigms for nonparametric manifold estimation\. Rather than explicitly parameterizing the manifold, ridge\-based methods define it implicitly as a set of points satisfying specific differential conditions of an estimated probability density function\. In particular, KDE\-based ridge estimation constructs a smooth density estimate from the observed data and identifies the manifold as a subset where the gradient aligns with the principal eigenspace of the Hessian, while the remaining curvature directions exhibit concavity\. This formulation allows the manifold to be recovered directly from data geometry without requiring an explicit embedding or coordinate chart\. In practice, the ridge set rarely admits a closed\-form solution, and iterative procedures such as the subspace\-constrained mean shift \(SCMS\)\[[4](https://arxiv.org/html/2605.20300#bib.bib4),[13](https://arxiv.org/html/2605.20300#bib.bib13)\]algorithm are commonly employed to trace the ridge structure\. The LOG\-KDE variant further modifies this framework by applying the ridge conditions to the logarithm of the density, leading to a different curvature characterization and tangent space estimation, often improving robustness in regions of varying density\.

Beyond ridge\-based techniques, several methods reconstruct manifolds by directly exploiting local tangent or normal space information\. A prominent example is manifold fitting \(MFIT\), which estimates the manifold by enforcing consistency across locally estimated normal spaces\. MFIT aggregates normal directions obtained from neighborhoods of nearby points using spatially adaptive weights and defines the manifold as the set of points where the weighted normal projections vanish\. This approach provides a principled way to fuse local geometric information into a globally coherent manifold estimate and is particularly effective when normal directions can be reliably estimated from noisy data\.

Higher\-order manifold estimation methods aim to move beyond linear or first\-order approximations by explicitly modeling local curvature\. Among these, the moving least squares \(MLS\) framework is one of the most widely studied\. MLS first estimates a local tangent space around each query point and then performs a weighted polynomial regression in this local coordinate system\. The fitted polynomial captures higher\-order geometric features, and the manifold estimate is obtained by projecting the query point onto the polynomial surface\. Owing to its flexibility and strong approximation properties, MLS is well suited for smooth manifolds with nontrivial curvature, albeit at the cost of increased computational complexity and sensitivity to parameter choices\.

Another closely related line of work extends local linear models by incorporating explicit geometric primitives\. Spherical PCA \(SPH\), also known as spherelets, augments local PCA\[[1](https://arxiv.org/html/2605.20300#bib.bib1)\]by fitting low\-dimensional spherical structures instead of affine subspaces\. The method first projects data into a locally estimated affine subspace of dimension one higher than the intrinsic dimension and then fits a sphere within this space\. By modeling curvature through spherical geometry, SPH provides a more accurate approximation than linear methods when the underlying manifold exhibits approximately constant curvature\.

In contrast to the above approaches, which primarily rely on local geometric estimation and projection\-based reconstruction, our method adopts the generalized quadratic approximation perspective with explicit subspace constraints\. By integrating quadratic structure and subspace regularization, the proposed framework bridges manifold learning and data approximation, enabling robust recovery of latent manifolds while maintaining computational efficiency and scalability\. This distinction allows our approach to complement existing local fitting and denoising methods, particularly in settings where global structure and latent representations are of primary interest\.

## IISCQM and its formulations

In this section, we present the generalized SCQM model and explore various loss functions, highlighting their appropriate use cases and scenarios\. We also provide a toy example to demonstrate the advantages of SCQM over the traditional linear local fitting model, illustrating how SCQM model can better capture complex patterns and improve performance in practical applications\.

### II\-AQuadratic Fitting Model with Subspace Constraint

We generalize classical quadratic matrix factorization beyond the Frobenius\-norm objective by allowing a broad class of loss functions\. This enables estimation in models of the form

xi=f​\(τi\)\+ϵi,i=1,…,n,x\_\{i\}=f\(\\tau\_\{i\}\)\+\\epsilon\_\{i\},\\qquad i=1,\\dots,n,\(1\)where the noise termϵi\\epsilon\_\{i\}is not restricted to be Gaussian\. Instead, we allowϵi\\epsilon\_\{i\}to follow either heavy\-tailed or light\-tailed distributions\. A representative example is the generalized Gaussian distribution with densityp​\(ϵ\)∝exp⁡\(−‖ϵ‖pp/η\),x∈ℝD,p\(\\epsilon\)\\propto\\exp\\bigl\(\-\\\|\\epsilon\\\|\_\{p\}^\{\\,p\}/\\eta\\bigr\),\\ x\\in\\mathbb\{R\}^\{D\},which includes the Gaussian model whenp=2p=2and the Laplace model whenp=1p=1\. We also consider isotropic, rotation\-invariant noise modeled by the multivariate radial Laplace distributionp​\(ϵ\)∝exp⁡\(−‖ϵ‖2/η\)p\(\\epsilon\)\\propto\\exp\(\-\\\|\\epsilon\\\|\_\{2\}/\\eta\)\.

When the noise distribution is known, maximum likelihood estimation of the mappingf​\(⋅\)f\(\\cdot\)reduces, up to additive constants, to the optimization problem

minf,\{τi\}​∑i=1nℓ​\(xi−f​\(τi\)\),\\min\_\{f,\\\{\\tau\_\{i\}\\\}\}\\sum\_\{i=1\}^\{n\}\\ell\\bigl\(x\_\{i\}\-f\(\\tau\_\{i\}\)\\bigr\),\(2\)whereℓ​\(⋅\)\\ell\(\\cdot\)is the negative log\-likelihood associated with the assumed noise model\. Typical choices include‖r‖pp\\\|r\\\|\_\{p\}^\{p\},‖r‖2\\\|r\\\|\_\{2\}, and‖r‖22\\\|r\\\|\_\{2\}^\{2\}, corresponding to generalized Gaussian, radial Laplace, and Gaussian noise models, respectively\. This formulation establishes a direct link between the loss function and the statistical properties of the noise\.

In what follows, we study the solution to \([2](https://arxiv.org/html/2605.20300#S2.E2)\) by explicitly specifying both the loss functionℓ​\(⋅\)\\ell\(\\cdot\)and the function class offf\. In particular, we restrictffto the class of quadratic functions equipped with explicit tangent\- and normal\-space basis representations\. This choice offers two key advantages\. First, it provides a clear geometric interpretation of the model parameters while retaining strong representation capability\. Second, the linear model naturally arises as a special case of this framework by setting all quadratic terms to zero\. As a result, this formulation allows us to seamlessly degenerate the quadratic model into its linear counterpart, thereby facilitating a direct and systematic investigation of the effectiveness and necessity of the quadratic terms\.

#### II\-A1Quadratic Function Class

To improve the quality of approximation over a broader domain, we restrictf​\(⋅\)f\(\\cdot\)to the class of quadratic mappings for two main reasons\. First, polynomial parameterizations offer significant practical advantages in optimization, as they lead to smooth objectives with well\-structured gradients and Hessians\. Second, the quadratic model is consistent with the manifold assumption, as it explicitly captures the interaction between the tangent space and the normal space through second\-order terms\. Specifically, we define the restricted class of quadratic mappingsf​\(⋅\)f\(\\cdot\)as follows:

ℱ=\{f:ℝd→ℝD\|f\(τ\)=c\+Uτ\+V𝒜\(τ,τ\),\\displaystyle\\mathcal\{F\}=\\bigl\\\{f:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{D\}\\;\\big\|\\;f\(\\tau\)=c\+U\\tau\+V\\,\\mathcal\{A\}\(\\tau,\\tau\),U⊤U=Id,V⊤V=Is,U⊤V=0\}\.\\displaystyle\\qquad\\qquad U^\{\\top\}U=I\_\{d\},\\;V^\{\\top\}V=I\_\{s\},\\;U^\{\\top\}V=0\\bigr\\\}\.Here,ccdenotes the shift parameter\. The columns ofUUform an orthonormal basis of the tangent space, while the columns ofVVspan a subspace orthogonal to that ofUU\. The vector𝒜​\(τ,τ\)∈ℝs\\mathcal\{A\}\(\\tau,\\tau\)\\in\\mathbb\{R\}^\{s\}represents the action of a third\-order tensor𝒜\\mathcal\{A\}on\(τ,τ\)\(\\tau,\\tau\), defined element\-wise by

\{𝒜​\(τ,τ\)\}k=∑i,j=1d𝒜k,i,j​τi​τj,k=1,…,s\.\\bigl\\\{\\mathcal\{A\}\(\\tau,\\tau\)\\bigr\\\}\_\{k\}=\\sum\_\{i,j=1\}^\{d\}\\mathcal\{A\}\_\{k,i,j\}\\,\\tau\_\{i\}\\tau\_\{j\},\\qquad k=1,\\dots,s\.Since the rank\-one matrixτ​τ⊤\\tau\\tau^\{\\top\}is symmetric, we may assume without loss of generality that each slice𝒜k∈ℝd×d\\mathcal\{A\}\_\{k\}\\in\\mathbb\{R\}^\{d\\times d\}is symmetric\. By concatenatingUUandVVinto a single matrixQ=\[U,V\]∈ℝD×\(d\+s\)Q=\[U,V\]\\in\\mathbb\{R\}^\{D\\times\(d\+s\)\}, the orthogonality constraints can be written compactly asQ⊤​Q=Id\+sQ^\{\\top\}Q=I\_\{d\+s\}\. This yields the equivalent representation

f​\(τ\)=c\+Q​\[τ𝒜​\(τ,τ\)\],Q⊤​Q=Id\+s\.f\(\\tau\)=c\+Q\\begin\{bmatrix\}\\tau\\\\ \\mathcal\{A\}\(\\tau,\\tau\)\\end\{bmatrix\},\\qquad Q^\{\\top\}Q=I\_\{d\+s\}\.\(3\)Exploiting the symmetry of𝒜\\mathcal\{A\}, there exists a matrixΘ\\Thetasuch that𝒜​\(τ,τ\)=Θ⊤​vech​\(τ​τ⊤\),\\mathcal\{A\}\(\\tau,\\tau\)=\\Theta^\{\\top\}\\mathrm\{vech\}\(\\tau\\tau^\{\\top\}\),wherevech​\(⋅\)\\mathrm\{vech\}\(\\cdot\)stacks the upper\-triangular entries of a symmetric matrix into a vector\. Consequently,ℱ\\mathcal\{F\}admits the equivalent formulation

f​\(τ\)=c\+Q​\[τΘ⊤​vech​\(τ​τ⊤\)\],Q⊤​Q=Id\+s\.f\(\\tau\)=c\+Q\\begin\{bmatrix\}\\tau\\\\ \\Theta^\{\\top\}\\mathrm\{vech\}\(\\tau\\tau^\{\\top\}\)\\end\{bmatrix\},\\qquad Q^\{\\top\}Q=I\_\{d\+s\}\.\(4\)Our learning problem is:

\(f^,\{τi^\}\)=arg⁡minf∈ℱ,\{τi\}​∑i=1nℓ​\(xi−f​\(τi\)\)\.\(\\widehat\{f\},\\ \\\{\\widehat\{\\tau\_\{i\}\}\\\}\)=\\arg\\min\_\{f\\in\\mathcal\{F\},\\,\\\{\\tau\_\{i\}\\\}\}\\;\\sum\_\{i=1\}^\{n\}\\ell\\bigl\(x\_\{i\}\-f\(\\tau\_\{i\}\)\\bigr\)\.\(5\)This model can be viewed as a generalized formulation of the subspace\-constrained quadratic factorization problem studied in\[[6](https://arxiv.org/html/2605.20300#bib.bib6)\], in which the Frobenius norm is replaced by a column\-wise loss defined through a specialized norm\.

From Eqn\. \([5](https://arxiv.org/html/2605.20300#S2.E5)\), we observe that the model simultaneously learns both the local geometry, represented by the parameterscc,QQ, andΘ\\Thetain the functionff, as well as the coordinates\{τi\}i=1n\\\{\\tau\_\{i\}\\\}\_\{i=1\}^\{n\}derived from the high\-dimensional input data\{xi\}i=1n\\\{x\_\{i\}\\\}\_\{i=1\}^\{n\}\. Once we obtainf^\\widehat\{f\}, we can use it as a model to refine the new noisy datayyby projecting it ontof^\\widehat\{f\}, which is achieved by solving the following optimization problem:

y^=arg⁡minx=f^​\(τ\)⁡ℓ​\(x−y\)\.\\widehat\{y\}=\\arg\\min\_\{x=\\widehat\{f\}\(\\tau\)\}\\ell\(x\-y\)\.\(6\)
From an algorithmic perspective, due to the nonlinearity of both the quadratic mapping and the general loss function, closed\-form solutions are generally not available for the optimization problem in Eqn\. \([5](https://arxiv.org/html/2605.20300#S2.E5)\) and Eqn\. \([6](https://arxiv.org/html/2605.20300#S2.E6)\)\. In the following sections, we first address the identifiability issue and discuss principled criteria for selecting an appropriate loss function\. We then propose an efficient gradient\-based algorithm to solve the resulting optimization problem numerically\.

### II\-BIdentifiability

The optimization problem in Eqn\. \([5](https://arxiv.org/html/2605.20300#S2.E5)\) is not identifiable due to the inherent invariances in the latent representation\. Specifically, the model exhibits invariance under orthogonal transformations of the latent variables\.

LetR∈O​\(d\)R\\in O\(d\)be any orthogonal matrix, and define the transformed latent variableη=R​τ\\eta=R\\tau\. Since

η​η⊤=R​\(τ​τ⊤\)​R⊤,\\eta\\eta^\{\\top\}=R\(\\tau\\tau^\{\\top\}\)R^\{\\top\},there exists a matrixS​\(R\)S\(R\)such thatvech​\(η​η⊤\)=S​\(R\)​vech​\(τ​τ⊤\)\.\\mathrm\{vech\}\(\\eta\\eta^\{\\top\}\)=S\(R\)\\,\\mathrm\{vech\}\(\\tau\\tau^\{\\top\}\)\.Consequently, by definingUR=U​RT,ΘR=S​\(R\)−⊤​Θ,U\_\{R\}=UR^\{T\},\\Theta\_\{R\}=S\(R\)^\{\-\\top\}\\Theta,we obtain

Θ⊤​vech​\(τ​τ⊤\)=ΘR⊤​vech​\(η​η⊤\),U​τ=UR​η\.\\Theta^\{\\top\}\\mathrm\{vech\}\(\\tau\\tau^\{\\top\}\)=\\Theta\_\{R\}^\{\\top\}\\mathrm\{vech\}\(\\eta\\eta^\{\\top\}\),\\qquad U\\tau=U\_\{R\}\\eta\.This implies that different parameter tuples\(c,U,V,Θ,τ\)\(c,U,V,\\Theta,\\tau\)and\(c,UR,V,ΘR,η\)\(c,U\_\{R\},V,\\Theta\_\{R\},\\eta\), related by orthogonal transformationsη=R​τ\\eta=R\\tau, induce the same input\-output mappingf​\(τ\)=fR​\(η\)f\(\\tau\)=f\_\{R\}\(\\eta\), wherefR​\(⋅\)f\_\{R\}\(\\cdot\)denotes the function parameterized byc,UR,V,ΘRc,U\_\{R\},V,\\Theta\_\{R\}\.

Therefore, the parameters of the quadratic factorization model are identifiable only up to equivalence classes induced by orthogonal transformations in the latent space\. Although a unique parameterization cannot be recovered, the learned mappingf​\(⋅\)f\(\\cdot\)and the associated quadratic manifold are uniquely determined within these equivalence classes\.

### II\-CLoss functions

In this section, we investigate how to choose an appropriate loss function for the quadratic factorization model\. The central principle is that the loss function should be aligned with the statistical distribution of the noise in the data or observations\. Specifically, when the noise exhibits a long\-tailed \(heavy\-tailed\) distribution, anℓpp\\ell\_\{p\}^\{p\}loss withp<2p<2provides a more robust and suitable choice\. In contrast, when the noise follows a short\-tailed distribution, the Gaussian assumption becomes appropriate, and the correspondingℓ22\\ell\_\{2\}^\{2\}loss is a natural and effective option\. This distribution\-aware selection enables the model to better capture the underlying data characteristics and achieve improved robustness and accuracy\.

TABLE I:Common loss functions, together with their gradients and Hessians with respect toxx, and the corresponding noise distributions under a maximum likelihood interpretation\.
### II\-DCriteria for Choosing the Loss Function

We discuss the criteria for loss\-function selection under three different scenarios\. First, we consider the simplest setting, in which the noise distributionϵi\{\\epsilon\_\{i\}\}is known*a priori*\. Second, we study the case where the contaminated noise samplesϵi\{\\epsilon\_\{i\}\}are directly observable\. Third, we address the most realistic scenario, in which the noise is implicitly embedded in the observations and cannot be isolated from the underlying signal\.

##### Known noise distribution

In the first scenario, we select the loss functionℓ​\(⋅\)\\ell\(\\cdot\)according to a principled, likelihood\-based criterion under known noise assumptions\. By the law of large numbers, as the sample size increases, the empirical distribution of the noise converges to its true underlying distribution\. Consequently, a statistically sound and natural choice of the loss function is the negative log\-likelihood induced by the assumed noise model\. This choice leads to statistically consistent estimators and, with high probability, ensures that the optimization objective faithfully reflects the intrinsic characteristics of the noise\. Guided by this principle, we consider several commonly adopted noise models together with their corresponding loss functions in the following analysis\.

When the noise is isotropic Gaussian, i\.e\.,ϵi∼𝒩​\(0,σ2​I\)\\epsilon\_\{i\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}I\), the negative log\-likelihood is proportional to the squared Euclidean norm\. Accordingly, we adopt the lossℓ​\(r\)=‖r‖22\\ell\(r\)=\\\|r\\\|\_\{2\}^\{2\}, which recovers the classical least\-squares formulation\. If the noise instead follows a multivariate isotropic \(radial\) Laplace distribution with densityp​\(ϵ\)∝exp⁡\(−λ​‖ϵ‖2\)p\(\\epsilon\)\\propto\\exp\(\-\\lambda\\\|\\epsilon\\\|\_\{2\}\), the corresponding negative log\-likelihood leads to the Euclidean norm lossℓ​\(r\)=‖r‖2\\ell\(r\)=\\\|r\\\|\_\{2\}\. When the noise components are independently distributed according to univariate Laplace laws, the negative log\-likelihood becomes proportional to theℓ1\\ell\_\{1\}norm, and we employℓ​\(r\)=‖r‖1\\ell\(r\)=\\\|r\\\|\_\{1\}, which is well known for its robustness to sparse, large\-magnitude outliers\.

More generally, heavy\-tailed or light\-tailed noise can be modeled by a generalized Gaussian distribution with densityp​\(ϵ\)∝exp⁡\(−\|ϵ\|pp\)p\(\\epsilon\)\\propto\\exp\(\-\|\\epsilon\|\_\{p\}^\{p\}\)forp≥1p\\geq 1, motivating the lossℓ​\(r\)=‖r‖pp\\ell\(r\)=\\\|r\\\|\_\{p\}^\{p\}\. Smaller values ofppyield increased robustness to outliers, while larger values recover the Gaussian setting\. When the noise covariance is anisotropic, i\.e\.,ϵi∼𝒩​\(0,Σ\)\\epsilon\_\{i\}\\sim\\mathcal\{N\}\(0,\\Sigma\)withΣ≻0\\Sigma\\succ 0, the appropriate choice is the squared Mahalanobis lossℓ​\(r\)=r⊤​Σ−1​r\\ell\(r\)=r^\{\\top\}\\Sigma^\{\-1\}r, which accounts for correlations and heterogeneous scaling across dimensions\. Finally, in the presence of mixed noise models or outlier contamination, the Huber loss is frequently adopted, as it interpolates smoothly between theℓ2\\ell\_\{2\}andℓ1\\ell\_\{1\}norms, combining robustness with local smoothness\.

Table[I](https://arxiv.org/html/2605.20300#S2.T1)summarizes several commonly used loss functions together with their associated noise models from a maximum\-likelihood perspective\. Overall, the choice of the loss functionℓ​\(⋅\)\\ell\(\\cdot\)plays a central role in balancing statistical efficiency and robustness, and is therefore critical to the performance of the proposed quadratic factorization framework\.

##### Observable noise samples

In the second scenario, where the noise samples\{ϵi\}i=1n\\\{\\epsilon\_\{i\}\\\}\_\{i=1\}^\{n\}are explicitly observable, the optimal shape parameterppin theℓpp\\ell\_\{p\}^\{p\}loss can be estimated via maximum likelihood estimation \(MLE\)\[[14](https://arxiv.org/html/2605.20300#bib.bib14)\]\. Since no closed\-form solution is available, we suggest to employ a numerical optimization procedure based on the profile likelihood\. In particular, a one\-dimensional search strategy, such as grid search or bisection, is adopted, following standard practice in the literature\[[15](https://arxiv.org/html/2605.20300#bib.bib15),[16](https://arxiv.org/html/2605.20300#bib.bib16)\]\.

##### Unobservable noise embedded in observations\.

In the third and most practical scenario, the noise is intrinsically mixed with the unknown underlying signal, rendering direct maximum likelihood estimation of the noise distribution infeasible\. Nevertheless, meaningful and practically effective criteria can still be employed for loss\-function selection\. In this case, we propose the following strategies:

1. 1\.Control of heavy\-tailed noiseEmploying theℓpp\\ell\_\{p\}^\{p\}loss with a relatively small value ofppmitigates the influence of heavy\-tailed noise\. This choice reduces sensitivity to samples that lie far from the estimated curve, thereby enhancing robustness to large deviations\.
2. 2\.Euclidean\-distance\-based projection\.Both theℓ22\\ell\_\{2\}^\{2\}andℓ2\\ell\_\{2\}losses yield projections that minimize the Euclidean distance between data points and the estimated model, aligning with the classical notion of projection\. In particular, theℓ22\\ell\_\{2\}^\{2\}loss corresponds to the maximum likelihood estimator under Gaussian noise, which is widely adopted in practice\.
3. 3\.A conservative choice\.Theℓ2\\ell\_\{2\}loss serves as a conservative and stable choice, as it exhibits reliable performance across a wide range of noise conditions, including light\-tailed and moderately heavy\-tailed distributions\. Moreover, when computing the optimal local coordinates associated with the fitted low\-dimensional surface \(or curve inℝ2\\mathbb\{R\}^\{2\}\), theℓ2\\ell\_\{2\}andℓ22\\ell\_\{2\}^\{2\}losses are equivalent in the sense that they induce the same minimizers\. Consequently, both losses admit the standard distance\-based interpretation of orthogonal projection, which is consistent with classical geometric formulations\.

![Refer to caption](https://arxiv.org/html/2605.20300v1/x1.png)Figure 1:Illustration of the fitted curves and projection points obtained usingℓpp\\ell\_\{p\}^\{p\}losses with different values ofpp\.

### II\-EA toy example inℝ2\{\\mathbb\{R\}\}^\{2\}

In this experiment, we demonstrate that when the loss function is chosen to match the underlying noise distribution, all models achieve strong performance\. We present a synthetic example demonstrating that the choice of loss function should be consistent with the underlying noise distribution\. Specifically, we generate data according to

xi=\[cos⁡\(ti\),sin⁡\(ti\)\]T\+ϵi,ϵi∼Cp,d​exp⁡\(−‖ϵi‖pp\),x\_\{i\}=\[\\cos\(t\_\{i\}\),\\sin\(t\_\{i\}\)\]^\{T\}\+\\epsilon\_\{i\},\\qquad\\epsilon\_\{i\}\\sim C\_\{p,d\}\\exp\\bigl\(\-\\\|\\epsilon\_\{i\}\\\|\_\{p\}^\{p\}\\bigr\),\(7\)whereCp,dC\_\{p,d\}denotes the normalizing constant,\{ti\}\\\{t\_\{i\}\\\}consists of equally spaced points in the interval\[0,4\]\[0,4\], andϵi\\epsilon\_\{i\}follows a multivariate generalized Gaussian distribution with shape parameterpp\. Also, to assess the performance of theℓ2\\ell\_\{2\}loss under a matched noise model, we additionally consider a setting in which the noise follows a multivariate isotropic Laplace distribution, namely,ϵi∼Cd′​exp⁡\(−‖ϵi‖2\),\\epsilon\_\{i\}\\sim C^\{\\prime\}\_\{d\}\\exp\\bigl\(\-\\\|\\epsilon\_\{i\}\\\|\_\{2\}\\bigr\),which corresponds to the negative log\-likelihood associated with the Euclidean norm\.

![Refer to caption](https://arxiv.org/html/2605.20300v1/x2.png)Figure 2:Illustration of the fitted curves and projection points obtained usingℓpp\\ell\_\{p\}^\{p\}losses with different values ofppwhen restricting the function class to be linear \(𝒜=𝟎\{\\cal A\}=\\bf 0\)\.To this end, we describe how to generate synthetic noise samples whose density is proportional toexp⁡\(−‖ϵ‖pp\)\\exp\(\-\\\|\\epsilon\\\|\_\{p\}^\{p\}\)\. We focus on the multivariate generalized Gaussian distribution\[[17](https://arxiv.org/html/2605.20300#bib.bib17)\], whose probability density function takes exactly this form\. Owing to its separability, the distribution factorizes into independent generalized Gaussian marginals along each coordinate\. Consequently, a random vector inℝd\\mathbb\{R\}^\{d\}can be generated by independently sampling each coordinate from the one\-dimensional densityf​\(t\)∝exp⁡\(−\|t\|p\)\.f\(t\)\\propto\\exp\\left\(\-\|t\|^\{p\}\\right\)\.In practice, such samples can be efficiently constructed using a simple transformation: drawu∼Gamma​\(1/p,1\)u\\sim\\mathrm\{Gamma\}\(1/p,1\)and an independent Rademacher random variables∈\{−1,1\}s\\in\\\{\-1,1\\\}with equal probability, and setϵ=s​u1/p\\epsilon=s\\,u^\{1/p\}\. Repeating this procedure independently for each coordinate yields a random vector whose joint distribution follows the multivariate generalized Gaussian distribution\.

In Fig\.[1](https://arxiv.org/html/2605.20300#S2.F1), the fitted modelffis shown by the red curve, while the blue points denote the noisy observations\. The arrows indicate the orthogonal projections of the noisy data points onto the underlying low\-dimensional quadratic manifold defined byff\. The noisy data are generated under different values of the shape parameterp∈\{1\.0,1\.25,1\.5,1\.75,2\.0\}p\\in\\\{1\.0,1\.25,1\.5,1\.75,2\.0\\\}, and the corresponding∥⋅∥pp\\\|\\cdot\\\|\_\{p\}^\{p\}loss is employed in each case\.

We make the following observations\. First, the quadratic subspace\-constrained model exhibits substantially stronger fitting capability than its linear counterpart, achieving tighter approximation over a broader domain\. Second, the underlying low\-dimensional structure can be successfully recovered by the proposed subspace\-constrained quadratic matrix factorization model, provided that the loss function is properly matched to the noise distribution\. In particular, for heavy\-tailed noise, we recommend using theℓpp\\ell\_\{p\}^\{p\}loss with smaller values ofppto enhance robustness, whereas for light\-tailed noise, larger values ofpp, such as the squared Euclidean loss, are more suitable and yield higher statistical efficiency\.

## IIIGradients and KKT condition

In this section, we derive closed\-form expressions for∇τF\\nabla\_\{\\tau\}F,∇ΘF,∇cF\\nabla\_\{\\Theta\}F,\\nabla\_\{c\}F, and∇QF\\nabla\_\{Q\}F\. The gradient∇τF\\nabla\_\{\\tau\}Fis nontrivial to compute due to the involvement of the element\-extraction operatorvech​\(⋅\)\\mathrm\{vech\}\(\\cdot\)\.

### III\-AGradient with respect to\{τk\}k=1n\\\{\\tau\_\{k\}\\\}\_\{k=1\}^\{n\}

Here, we derive the gradient of the loss function with respect to the latent variableτk\\tau\_\{k\}\. Owing to the separability of the loss across data samples, the objective function can be decomposed into independent terms, each depending only on a single latent representation\. Consequently, the gradient with respect toτk\\tau\_\{k\}can be computed by differentiating only thekk\-th loss term\.

∇τkF=∇τk​∑i=1nℓ​\(xi−f​\(τi\)\)=∇τkℓ​\(xk−f​\(τk\)\)\.\\nabla\_\{\\tau\_\{k\}\}F=\\nabla\_\{\\tau\_\{k\}\}\\sum\_\{i=1\}^\{n\}\\ell\(x\_\{i\}\-f\(\\tau\_\{i\}\)\)=\\nabla\_\{\\tau\_\{k\}\}\\ell\(x\_\{k\}\-f\(\\tau\_\{k\}\)\)\.Due to the nonlinear dependence off​\(τk\)f\(\\tau\_\{k\}\)onτk\\tau\_\{k\}throughvech​\(τk​τkT\)\\mathrm\{vech\}\(\\tau\_\{k\}\\tau\_\{k\}^\{T\}\), a direct computation of∇τkf​\(τk\)\\nabla\_\{\\tau\_\{k\}\}f\(\\tau\_\{k\}\)is nontrivial\. Observing that, for a small perturbationδ∈ℝd\\delta\\in\\mathbb\{R\}^\{d\}, the functionf​\(τk\+δ\)f\(\\tau\_\{k\}\+\\delta\)admits the following first\-order expansion:

f​\(τk\+δ\)=f​\(τk\)\+⟨∇τkf​\(τk\),δ⟩\+𝒪​\(‖δ‖22\)\.f\(\\tau\_\{k\}\+\\delta\)=f\(\\tau\_\{k\}\)\+\\langle\\nabla\_\{\\tau\_\{k\}\}f\(\\tau\_\{k\}\),\\delta\\rangle\+\\mathcal\{O\}\(\\\|\\delta\\\|\_\{2\}^\{2\}\)\.\(8\)Based on \([8](https://arxiv.org/html/2605.20300#S3.E8)\), we derive the gradient by analyzing the differencef​\(τk\+δ\)−f​\(τk\)f\(\\tau\_\{k\}\+\\delta\)\-f\(\\tau\_\{k\}\)by

f​\(τk\+δ\)−f​\(τk\)\\displaystyle f\(\\tau\_\{k\}\+\\delta\)\-f\(\\tau\_\{k\}\)=\\displaystyle=U​δ\+V​ΘT​vech​\(\(τk\+δ\)​\(τk\+δ\)T\)\+V​ΘT​vech​\(τk​τkT\)\\displaystyle U\\delta\+V\\Theta^\{T\}\\mathrm\{vech\}\(\(\\tau\_\{k\}\+\\delta\)\(\\tau\_\{k\}\+\\delta\)^\{T\}\)\+V\\Theta^\{T\}\\mathrm\{vech\}\(\\tau\_\{k\}\\tau\_\{k\}^\{T\}\)=\\displaystyle=U​δ\+V​ΘT​vech​\(τk​δT\+δ​τkT\+δ​δT\)\.\\displaystyle U\\delta\+V\\Theta^\{T\}\\mathrm\{vech\}\(\\tau\_\{k\}\\delta^\{T\}\+\\delta\\tau\_\{k\}^\{T\}\+\\delta\\delta^\{T\}\)\.Define the operatorsTτ​\(δ\)=vech​\(τ​δT\)T\_\{\\tau\}\(\\delta\)=\\mathrm\{vech\}\(\\tau\\delta^\{T\}\)andT​’τ​\(δ\)=vech​\(δ​τT\)T’\_\{\\tau\}\(\\delta\)=\\mathrm\{vech\}\(\\delta\\tau^\{T\}\)\. Observe that bothvech​\(τ​δT\)\\mathrm\{vech\}\(\\tau\\delta^\{T\}\)andvech​\(δ​τT\)\\mathrm\{vech\}\(\\delta\\tau^\{T\}\)induce linear mappings with respect to the perturbationδ\\delta\. In particular, for anyδ1,δ2∈ℝd\\delta\_\{1\},\\delta\_\{2\}\\in\\mathbb\{R\}^\{d\}and any scalarκ∈ℝ\\kappa\\in\\mathbb\{R\}, these operators satisfy the additivity and homogeneity properties,

Tτk​\(κ​δ\)=vech​\(τk​\(κ​δ\)T\)=κ​vech​\(τk​δT\)=κ​Tτk​\(δ\)\.T\_\{\\tau\_\{k\}\}\(\\kappa\\delta\)=\\mathrm\{vech\}\\\!\\bigl\(\\tau\_\{k\}\(\\kappa\\delta\)^\{T\}\\bigr\)=\\kappa\\mathrm\{vech\}\(\\tau\_\{k\}\\delta^\{T\}\)=\\kappa T\_\{\\tau\_\{k\}\}\(\\delta\)\.Tτk​\(δ1\+δ2\)=\\displaystyle T\_\{\\tau\_\{k\}\}\(\\delta\_\{1\}\+\\delta\_\{2\}\)=vech​\(τk​\(δ1\+δ2\)T\)\\displaystyle\\mathrm\{vech\}\\\!\\bigl\(\\tau\_\{k\}\(\\delta\_\{1\}\+\\delta\_\{2\}\)^\{T\}\\bigr\)=\\displaystyle=vech​\(τk​δ1T\)\+vech​\(τk​δ2T\)\\displaystyle\\mathrm\{vech\}\(\\tau\_\{k\}\\delta\_\{1\}^\{T\}\)\+\\mathrm\{vech\}\(\\tau\_\{k\}\\delta\_\{2\}^\{T\}\)=\\displaystyle=Tτk​\(δ1\)\+Tτk​\(δ2\)\.\\displaystyle T\_\{\\tau\_\{k\}\}\(\\delta\_\{1\}\)\+T\_\{\\tau\_\{k\}\}\(\\delta\_\{2\}\)\.
Therefore, bothTτkT\_\{\\tau\_\{k\}\}andTτk′T^\{\\prime\}\_\{\\tau\_\{k\}\}are linear mappings fromℝd\\mathbb\{R\}^\{d\}toℝd​\(d\+1\)2\\mathbb\{R\}^\{\\frac\{d\(d\+1\)\}\{2\}\}\. By the matrix representation theorem for linear transformations\[[18](https://arxiv.org/html/2605.20300#bib.bib18)\], any linear operator admits a matrix representation once the input and output bases are fixed\. In the following, we adopt the standard bases and derive the explicit matrix form of the corresponding linear transformation\.

###### Corollary 1\(Matrix representation theorem\)\.

LetT:ℝn→ℝmT:\\mathbb\{R\}^\{n\}\\to\\mathbb\{R\}^\{m\}be a linear transformation\. Then there exists a unique matrixA∈ℝm×nA\\in\\mathbb\{R\}^\{m\\times n\}such that

T​\(x\)=A​x,∀x∈ℝn\.T\(x\)=Ax,\\quad\\forall x\\in\\mathbb\{R\}^\{n\}\.

In order to find the corresponding transformation matrix\. We take the standard basis of\{e1,e2,…,ed\}∈ℝd\\\{e\_\{1\},e\_\{2\},\.\.\.,e\_\{d\}\\\}\\in\\mathbb\{R\}^\{d\}, Then, anyδ∈ℝd\\delta\\in\\mathbb\{R\}^\{d\}can be written as

Tτk​\(δ\)=\\displaystyle T\_\{\\tau\_\{k\}\}\(\\delta\)=Tτk​\(∑i=1d⟨δ,ei⟩​ei\)=∑i=1d⟨δ,ei⟩​Tτk​\(ei\)\\displaystyle T\_\{\\tau\_\{k\}\}\(\\sum\_\{i=1\}^\{d\}\\langle\\delta,e\_\{i\}\\rangle e\_\{i\}\)=\\sum\_\{i=1\}^\{d\}\\langle\\delta,e\_\{i\}\\rangle T\_\{\\tau\_\{k\}\}\(e\_\{i\}\)=\\displaystyle=\[Tτk​\(e1\),Tτk​\(e2\),…,Tτk​\(ed\)\]​δ\.\\displaystyle\[T\_\{\\tau\_\{k\}\}\(e\_\{1\}\),T\_\{\\tau\_\{k\}\}\(e\_\{2\}\),\.,T\_\{\\tau\_\{k\}\}\(e\_\{d\}\)\]\\delta\.Therefore, by definingMτk=\[Tτk​\(e1\),Tτk​\(e2\),…,Tτk​\(ed\)\]M\_\{\\tau\_\{k\}\}=\[T\_\{\\tau\_\{k\}\}\(e\_\{1\}\),T\_\{\\tau\_\{k\}\}\(e\_\{2\}\),\.\.\.,T\_\{\\tau\_\{k\}\}\(e\_\{d\}\)\]andNτk=\[Tτk′​\(e1\),Tτk′​\(e2\),…,Tτk′​\(ed\)\]N\_\{\\tau\_\{k\}\}=\[T^\{\\prime\}\_\{\\tau\_\{k\}\}\(e\_\{1\}\),T^\{\\prime\}\_\{\\tau\_\{k\}\}\(e\_\{2\}\),\.\.\.,T^\{\\prime\}\_\{\\tau\_\{k\}\}\(e\_\{d\}\)\], we have:

f​\(τk\+δ\)−f​\(τk\)=U​δ\+V​ΘT​\(Mτk\+Nτk\)​δ\+V​ΘT​vech​\(δ​δT\)\.f\(\\tau\_\{k\}\+\\delta\)\-f\(\\tau\_\{k\}\)=U\\delta\+V\\Theta^\{T\}\(M\_\{\\tau\_\{k\}\}\+N\_\{\\tau\_\{k\}\}\)\\delta\+V\\Theta^\{T\}\\mathrm\{vech\}\(\\delta\\delta^\{T\}\)\.By the definitions of the linear operatorsT​\(⋅\)T\(\\cdot\)andT′​\(⋅\)T^\{\\prime\}\(\\cdot\), the explicit forms ofTτk​\(ei\)T\_\{\\tau\_\{k\}\}\(e\_\{i\}\)andTτk′​\(ei\)T^\{\\prime\}\_\{\\tau\_\{k\}\}\(e\_\{i\}\)can be readily derived for alli=1,…,di=1,\\ldots,d\. Neglecting the higher\-order termV​ΘT​vech​\(δ​δT\)V\\Theta^\{T\}\\mathrm\{vech\}\(\\delta\\delta^\{T\}\)in the first\-order expansion, we obtain

∇τf​\(τ\)=U\+V​ΘT​\(Mτ\+Nτ\)\.\\nabla\_\{\\tau\}f\(\\tau\)=U\+V\\Theta^\{T\}\(M\_\{\\tau\}\+N\_\{\\tau\}\)\.Therefore, the gradient∇τF\\nabla\_\{\\tau\}Fis:

∇τkF​\(c,Q,Θ,Φ\)=\\displaystyle\\nabla\_\{\\tau\_\{k\}\}F\(c,Q,\\Theta,\\Phi\)=∇τkℓ​\(xk−f​\(τk\)\)\\displaystyle\\nabla\_\{\\tau\_\{k\}\}\\ell\(x\_\{k\}\-f\(\\tau\_\{k\}\)\)\(9\)=\\displaystyle=∇τkTf​\(τk\)​∇yℓ​\(y−xk\)\|y=f​\(τk\)\.\\displaystyle\{\\nabla^\{T\}\_\{\\tau\_\{k\}\}f\(\\tau\_\{k\}\)\}\\nabla\_\{y\}\\ell\(y\-x\_\{k\}\)\|\_\{y=f\(\\tau\_\{k\}\)\}\.
Note that the Jacobian matrix∇τkf​\(τk\)\\nabla\_\{\\tau\_\{k\}\}f\(\\tau\_\{k\}\)is defined with respect to thekk\-th latent representationτk\\tau\_\{k\}corresponding toxkx\_\{k\}\. In the following, we present the general forms of the matricesMτM\_\{\\tau\}andNτN\_\{\\tau\}, which characterize the linear mappings induced by the quadratic term\. By stacking the vectorsTτ​\(ej\),j=1,…,dT\_\{\\tau\}\(e\_\{j\}\),j=1,\\ldots,d, as the columns ofMτ∈ℝ\{\(d2\+d\)/2\}×dM\_\{\\tau\}\\in\{\\mathbb\{R\}\}^\{\\\{\(d^\{2\}\+d\)/2\\\}\\times d\}, andTτ′​\(ej\),j=1,…,dT^\{\\prime\}\_\{\\tau\}\(e\_\{j\}\),j=1,\\ldots,d, as the columns ofNτ∈ℝ\{\(d2\+d\)/2\}×dN\_\{\\tau\}\\in\{\\mathbb\{R\}\}^\{\\\{\(d^\{2\}\+d\)/2\\\}\\times d\}, we obtain explicit matrix representations of the two linear operators\. Consequently, for anyτ∈ℝd\\tau\\in\\mathbb\{R\}^\{d\}, the transformation matricesMτM\_\{\\tau\}andNτN\_\{\\tau\}admit the following structured forms:

Mτ=\(τ\[1\]0000000τ\[1\]00000……000000τ\[1\]0τ\[2\]0000000τ\[2\]0000……000000τ\[2\]……000000τ\[d\]\),M\_\{\\tau\}=\\left\(\\begin\{array\}\[\]\{ccccccc\}\\tau\_\{\[1\]\}&0&0&0&0&0&0\\\\ 0&\\tau\_\{\[1\]\}&0&0&0&0&0\\\\ &&\.\.\.&\.\.\.&&&\\\\ 0&0&0&0&0&0&\\tau\_\{\[1\]\}\\\\ 0&\\tau\_\{\[2\]\}&0&0&0&0&0\\\\ 0&0&\\tau\_\{\[2\]\}&0&0&0&0\\\\ &&\.\.\.&\.\.\.&&&\\\\ 0&0&0&0&0&0&\\tau\_\{\[2\]\}\\\\ &&\.\.\.&\.\.\.&&&\\\\ 0&0&0&0&0&0&\\tau\_\{\[d\]\}\\end\{array\}\\right\),Nτ=\(τ\[1\]000000τ\[2\]000000……τ\[d\]0000000τ\[2\]000000τ\[3\]00000……0τ\[d\]00000……000000τ\[d\]\)\.N\_\{\\tau\}=\\left\(\\begin\{array\}\[\]\{ccccccc\}\\tau\_\{\[1\]\}&0&0&0&0&0&0\\\\ \\tau\_\{\[2\]\}&0&0&0&0&0&0\\\\ &&\.\.\.&\.\.\.&&&\\\\ \\tau\_\{\[d\]\}&0&0&0&0&0&0\\\\ 0&\\tau\_\{\[2\]\}&0&0&0&0&0\\\\ 0&\\tau\_\{\[3\]\}&0&0&0&0&0\\\\ &&\.\.\.&\.\.\.&&&\\\\ 0&\\tau\_\{\[d\]\}&0&0&0&0&0\\\\ &&\.\.\.&\.\.\.&&&\\\\ 0&0&0&0&0&0&\\tau\_\{\[d\]\}\\end\{array\}\\right\)\.Noting that bothMτM\_\{\\tau\}andNτN\_\{\\tau\}are highly sparse matrices, each containing exactly one nonzero entry per row, we can express them in the following compact form:

Nτ=\[τ\[1\]​e1,τ\[2\]​e1,⋯,τ\[d\]​e1,τ\[2\]​e2,⋯,τ\[d\]​e2,⋯,τd​ed\]T,Mτ=\[τ\[1\]​e1,τ\[1\]​e2,⋯,τ\[1\]​ed,τ\[2\]​e2,⋯,τ\[2\]​ed,⋯,τ\[d\]​ed\]T\.\\begin\{gathered\}N\_\{\\tau\}=\[\\tau\_\{\[1\]\}e\_\{1\},\\tau\_\{\[2\]\}e\_\{1\},\\cdots,\\tau\_\{\[d\]\}e\_\{1\},\\tau\_\{\[2\]\}e\_\{2\},\\cdots,\\tau\_\{\[d\]\}e\_\{2\},\\cdots,\\tau\_\{d\}e\_\{d\}\]^\{T\},\\\\ M\_\{\\tau\}=\[\\tau\_\{\[1\]\}e\_\{1\},\\tau\_\{\[1\]\}e\_\{2\},\\cdots,\\tau\_\{\[1\]\}e\_\{d\},\\tau\_\{\[2\]\}e\_\{2\},\\cdots,\\tau\_\{\[2\]\}e\_\{d\},\\cdots,\\tau\_\{\[d\]\}e\_\{d\}\]^\{T\}\.\\end\{gathered\}Here,τ\[k\]∈ℝ\\tau\_\{\[k\]\}\\in\\mathbb\{R\}denotes thekk\-th scalar component ofτ∈ℝd\\tau\\in\\mathbb\{R\}^\{d\}, which is distinguished fromτk∈ℝd\\tau\_\{k\}\\in\\mathbb\{R\}^\{d\}, thekk\-th latent vector associated with the data pointxk∈ℝDx\_\{k\}\\in\\mathbb\{R\}^\{D\}\. The vectoreke\_\{k\}denotes thekk\-th canonical basis \(one\-hot\) vector, whosekk\-th entry is equal to11and all other entries are zero\.

### III\-BRiemannian Gradient and First\-Order Optimality Condition forQQ

In this subsection, we compute the Riemannian gradientGQG\_\{Q\}and updateQQby performing a gradient step along the tangent direction followed by a retraction onto the Stiefel manifold\[[19](https://arxiv.org/html/2605.20300#bib.bib19)\]:

\{\[Q~,R\]=qr​\(Q\(t\)−ηt​GQ\(t\)\),Q\(t\+1\)=Q~​diag​\(sign​\(diag​\(R\)\)\),\\left\\\{\\begin\{aligned\} &\[\\widetilde\{Q\},R\]=\\mathrm\{qr\}\\\!\\left\(Q^\{\(t\)\}\-\\eta\_\{t\}G\_\{Q^\{\(t\)\}\}\\right\),\\\\ &Q^\{\(t\+1\)\}=\\widetilde\{Q\}\\,\\mathrm\{diag\}\\\!\\big\(\\mathrm\{sign\}\(\\mathrm\{diag\}\(R\)\)\\big\),\\end\{aligned\}\\right\.\(10\)whereηt\>0\\eta\_\{t\}\>0denotes the step size\. Here, the QR decomposition acts as a retraction operator, mapping the updated iterate back onto the Stiefel manifold, while the diagonal sign correction ensures the uniqueness of the factorization and preserves continuity of the iterates\. The Riemannian update onSt​\(d\+s,D\)\\mathrm\{St\}\(d\+s,D\)

GQ=\\displaystyle G\_\{Q\}=∇QF−Qsym​\(QT​∇QF\)\\displaystyle\\nabla\_\{Q\}F\-Q^\{\\mathrm\{sym\}\}\\big\(Q^\{T\}\\nabla\_\{Q\}F\\big\)=\\displaystyle=QQT​∇QF−∇QTF​Q2\+Q⟂\(Q⟂\)T\)∇QF,\\displaystyle Q\\frac\{Q^\{T\}\\nabla\_\{Q\}F\-\\nabla\_\{Q\}^\{T\}FQ\}\{2\}\+Q^\{\\perp\}\(Q^\{\\perp\}\)^\{T\}\)\\nabla\_\{Q\}F,where the Euclidean gradient∇QF\\nabla\_\{Q\}Fwith a specific form by

∇QF​\(c,Q,Θ,Φ\)=\\displaystyle\\nabla\_\{Q\}F\(c,Q,\\Theta,\\Phi\)=∑i=1n∇Qℓ​\(f​\(τi\)−xi\)\\displaystyle\\sum\_\{i=1\}^\{n\}\\nabla\_\{Q\}\\ell\(f\(\\tau\_\{i\}\)\-x\_\{i\}\)=\\displaystyle=∑i=1n∇yℓ​\(y\)\|y=f​\(τi\)−xi​\[τiT,vechT​\(τi​τiT\)​Θ\]\.\\displaystyle\\sum\_\{i=1\}^\{n\}\\nabla\_\{y\}\\ell\(y\)\|\_\{y=f\(\\tau\_\{i\}\)\-x\_\{i\}\}\\left\[\\tau\_\{i\}^\{T\},\\mathrm\{vech\}^\{T\}\(\\tau\_\{i\}\\tau\_\{i\}^\{T\}\)\\Theta\\right\]\.
###### Proposition 1\.

If the Riemannian gradient with respect toQQvanishes, i\.e\.,GQ=𝟎G\_\{Q\}=\\mathbf\{0\}, thenQQsatisfies the first\-order optimality condition

∇QF​\(c,Q,Θ,Φ\)−2​Q​Λ=𝟎,Q⊤​Q=Id\+s,\\nabla\_\{Q\}F\(c,Q,\\Theta,\\Phi\)\-2Q\\Lambda=\\mathbf\{0\},\\qquad Q^\{\\top\}Q=I\_\{d\+s\},\(11\)whereΛ∈ℝ\(d\+s\)×\(d\+s\)\\Lambda\\in\\mathbb\{R\}^\{\(d\+s\)\\times\(d\+s\)\}is a symmetric matrix of Lagrange multipliers\.

The proof is placed in the appendix\. It is worth noting that although the loss function is an affine mapping composed with a convex outer function, the resulting objective is not convex with respect toQQdue to the nonconvex geometry induced by the Stiefel manifold constraint\.

### III\-CGradient forΘ\\Thetaandcc

Since the loss function involves linear transformations with respect to bothccandΘ\\Theta, followed by an outer norm operation, the optimization properties ofccandΘ\\Thetaare inherently coupled\. We therefore analyze these two variables jointly in this subsection\.

##### Gradient with respect toΘ\\Theta

For eachii, the mappingΘ↦V​ΘT​vech​\(τi​τiT\)\\Theta\\;\\mapsto\\;V\\Theta^\{T\}\\mathrm\{vech\}\(\\tau\_\{i\}\\tau\_\{i\}^\{T\}\)is affine inΘ\\Theta\. Since the loss functionℓ​\(⋅\)\\ell\(\\cdot\)is convex, the compositionℓ​\(fΘ,c​\(τi\)−xi\)\\ell\\\!\\left\(f\_\{\\Theta,c\}\(\\tau\_\{i\}\)\-x\_\{i\}\\right\)is convex with respect toΘ\\Theta\. Applying the chain rule, we obtain

∇Θ​∑i=1nℓ​\(fΘ,c​\(τi\)−xi\)\\displaystyle\\nabla\_\{\\Theta\}\\sum\_\{i=1\}^\{n\}\\ell\\\!\\left\(f\_\{\\Theta,c\}\(\\tau\_\{i\}\)\-x\_\{i\}\\right\)=\\displaystyle=∑i=1nvech​\(τi​τiT\)​\(∇yℓ​\(y−xi\)T\)\|y=fΘ,c​\(τi\)​V\.\\displaystyle\\sum\_\{i=1\}^\{n\}\\mathrm\{vech\}\(\\tau\_\{i\}\\tau\_\{i\}^\{T\}\)\\Big\(\\nabla\_\{y\}\\ell\(y\-x\_\{i\}\)^\{T\}\\Big\)\\Big\|\_\{\\,y=f\_\{\\Theta,c\}\(\\tau\_\{i\}\)\}V\.

##### Gradient with respect tocc

Similarly, the inner function is affine incc, and the gradient is given by

∇c​∑i=1nℓ​\(fΘ,c​\(τi\)−xi\)=∑i=1n∇yℓ​\(y−xi\)\|y=fΘ,c​\(τi\)\.\\nabla\_\{c\}\\sum\_\{i=1\}^\{n\}\\ell\\\!\\left\(f\_\{\\Theta,c\}\(\\tau\_\{i\}\)\-x\_\{i\}\\right\)=\\sum\_\{i=1\}^\{n\}\\nabla\_\{y\}\\ell\(y\-x\_\{i\}\)\\Big\|\_\{\\,y=f\_\{\\Theta,c\}\(\\tau\_\{i\}\)\}\.

##### KKT condition

In this section, we give the first\-order optimal condition for our objective, which is a complex nonlinear equation system\.

\{PTQ​\(∇QF\)=Q​\(QT​∇QF−∇QTF​Q2\)\+Q⟂​Q⟂T​∇QF=𝟎,∇ΘF=∑i=1nvech​\(τi​τiT\)​\(∇yℓ​\(y−xi\)T\)\|y=fΘ,c​\(τi\)​V,∇cF=∑i=1n∇yℓ​\(y−xi\)\|y=fΘ,c​\(τi\)=𝟎,∇τiF=\(U\+V​ΘT​\(Mτi\+Nτi\)\)T​∇yℓ​\(y−xk\)\|y=f​\(τi\)=𝟎,\\left\\\{\\begin\{aligned\} &P\_\{T\_\{Q\}\}\(\\nabla\_\{Q\}F\)=Q\(\\frac\{Q^\{T\}\\nabla\_\{Q\}F\-\\nabla\_\{Q\}^\{T\}FQ\}\{2\}\)\+Q^\{\\perp\}\{Q^\{\\perp\}\}^\{T\}\\nabla\_\{Q\}F=\{\\bf 0\},\\\\ &\\nabla\_\{\\Theta\}F=\\sum\_\{i=1\}^\{n\}\\mathrm\{vech\}\(\\tau\_\{i\}\\tau\_\{i\}^\{T\}\)\\Big\(\\nabla\_\{y\}\\ell\(y\-x\_\{i\}\)^\{T\}\\Big\)\\Big\|\_\{\\,y=f\_\{\\Theta,c\}\(\\tau\_\{i\}\)\}V,\\\\ &\\nabla\_\{c\}F=\\sum\_\{i=1\}^\{n\}\\nabla\_\{y\}\\ell\(y\-x\_\{i\}\)\\Big\|\_\{\\,y=f\_\{\\Theta,c\}\(\\tau\_\{i\}\)\}=\{\\bf 0\},\\\\ &\\nabla\_\{\\tau\_\{i\}\}F=\(U\+V\\Theta^\{T\}\(M\_\{\\tau\_\{i\}\}\+N\_\{\\tau\_\{i\}\}\)\)^\{T\}\\nabla\_\{y\}\\ell\(y\-x\_\{k\}\)\|\_\{y=f\(\\tau\_\{i\}\)\}=\{\\bf 0\},\\end\{aligned\}\\right\.\(12\)wherePTQ​\(⋅\)P\_\{T\_\{Q\}\}\(\\cdot\)denotes the orthogonal projection onto the tangent spaceTQ​St​\(d\+s,D\)T\_\{Q\}\\mathrm\{St\}\(d\+s,D\)of the Stiefel manifold, ensuring that the stationarity condition with respect toQQis compatible with the orthogonality constraint\.

Owing to the strong nonlinearity of the above system and the coupling among variables, directly solving the first\-order conditions in \([12](https://arxiv.org/html/2605.20300#S3.E12)\) is intractable in practice\. Instead, we adopt a gradient\-based optimization strategy, where each variable is updated iteratively using \(Riemannian\) gradient descent, as detailed in Section[VI](https://arxiv.org/html/2605.20300#S6)\.

## IVConvexity Analysis

We analyze the convexity of each subproblem with respect to the variablesΘ,c\\Theta,c, and\{τk\}\\\{\\tau\_\{k\}\\\}\. When all other variables are fixed, the subproblem inccis convex, since the outer loss functionℓ​\(⋅\)\\ell\(\\cdot\)is convex and the inner mapping is affine in c\. The same argument applies to the subproblem inΘ\\Theta\. Consequently, the joint problemminΘ,c​∑i=1nℓ​\(fΘ,c​\(τi\)−xi\)\\min\_\{\\Theta,\\,c\}\\ \\sum\_\{i=1\}^\{n\}\\ell\\\!\\left\(f\_\{\\Theta,c\}\(\\tau\_\{i\}\)\-x\_\{i\}\\right\)is a convex optimization problem, as it consists of a sum of convex functions composed with affine mappings\. As a result, for fixed\{τi\}\\\{\\tau\_\{i\}\\\}, any stationary point with respect to\(Θ,c\)\(\\Theta,c\)is globally optimal\.

In contrast, the subproblem with respect toτk\\tau\_\{k\}is more challenging, as the inner transformationf​\(τk\)f\(\\tau\_\{k\}\)is nonlinear\. Nevertheless, we can still investigate the condition under which the Hessian is positive definite\. Since the overall objective is separable across samples,

F​\(c,Q,Θ,Φ\)=∑k=1nℓ​\(f​\(τk\)−xk\),F\(c,Q,\\Theta,\\Phi\)=\\sum\_\{k=1\}^\{n\}\\ell\\big\(f\(\\tau\_\{k\}\)\-x\_\{k\}\\big\),the optimization with respect to\{τk\}i=1n\\\{\\tau\_\{k\}\\\}\_\{i=1\}^\{n\}decomposes into independent single\-sample subproblems\. Consequently, it suffices to analyze the convexity properties of the functionτ↦ℓ​\(f​\(τ\)−x\)\\tau\\mapsto\\ell\\big\(f\(\\tau\)\-x\\big\)for a single data point\.

##### Convexity of the Projection Problem

Here, we investigate the optimization problem with respect to the lower\-dimensional representation\{τk\}k=1n\\\{\\tau\_\{k\}\\\}\_\{k=1\}^\{n\}\. Since each subproblem with respect toτk\\tau\_\{k\}is independent, for notational convenience, we leave out the indexkkand study on general problem as

minτ⁡ℓ​\(f​\(τ\)−x\)\.\\min\_\{\\tau\}\\ell\(f\(\\tau\)\-x\)\.We next derive the second\-order derivativeH​\(τ\)H\(\\tau\)with respect to the latent variableτ\\tau\. Denotey:=f​\(τ\)−xy:=f\(\\tau\)\-x\. The Hessian admits the decomposition as

H​\(τ\)=∇τf​\(τ\)⊤​∇y2ℓ​\(y\)​∇τf​\(τ\)\+∇τ2f​\(τ\)×1∇yℓ​\(y\),\\displaystyle H\(\\tau\)=\\nabla\_\{\\tau\}f\(\\tau\)^\{\\top\}\\nabla\_\{y\}^\{2\}\\ell\(y\)\\nabla\_\{\\tau\}f\(\\tau\)\+\\nabla\_\{\\tau\}^\{2\}f\(\\tau\)\\times\_\{1\}\\nabla\_\{y\}\\ell\(y\),\(13\)Here,∇y2ℓ​\(y\)\\nabla\_\{y\}^\{2\}\\ell\(y\)denotes the Hessian of the loss with respect toyy, and×1\\times\_\{1\}represents the mode–11tensor–vector contraction, i\.e\., the contraction of the first mode of the third\-order tensor∇τ2f​\(τ\)\\nabla\_\{\\tau\}^\{2\}f\(\\tau\)with the vector∇yℓ​\(y\)\\nabla\_\{y\}\\ell\(y\)evaluated aty=f​\(τ\)−xy=f\(\\tau\)\-x\. For clarity, we summarize the dimensions of the involved quantities:∇τf​\(τ\)∈ℝD×d,∇τ2f​\(τ\)∈ℝD×d×d,∇yℓ​\(y\)\|y=f​\(τ\)−x∈ℝD\.\\nabla\_\{\\tau\}f\(\\tau\)\\in\\mathbb\{R\}^\{D\\times d\},\\nabla\_\{\\tau\}^\{2\}f\(\\tau\)\\in\\mathbb\{R\}^\{D\\times d\\times d\},\\nabla\_\{y\}\\ell\(y\)\\big\|\_\{y=f\(\\tau\)\-x\}\\in\\mathbb\{R\}^\{D\}\.

The first term in \([13](https://arxiv.org/html/2605.20300#S4.E13)\) characterizes the curvature contribution induced by the loss function through the Jacobian offf, whereas the second term captures the intrinsic curvature of the model manifold arising from the nonlinear mappingff\. Owing to the convexity of the loss functionℓ\\ell, the Hessian∇y2ℓ​\(y\)\|y=f​\(τ\)−x\\nabla\_\{y\}^\{2\}\\ell\(y\)\\big\|\_\{y=f\(\\tau\)\-x\}is positive semidefinite, which implies that the first term in \([13](https://arxiv.org/html/2605.20300#S4.E13)\) is also positive semidefinite\. In particular, for theℓpp\\ell\_\{p\}^\{p\}loss, the Hessian with respect toyyadmits the diagonal form

∇y2ℓ​\(y\)\|y=f​\(τ\)−x=p​\(p−1\)​diag​\(\{\|f​\(τ\)i−xi\|p−2\}i=1D\),\\nabla\_\{y\}^\{2\}\\ell\(y\)\\big\|\_\{y=f\(\\tau\)\-x\}=p\(p\-1\)\\,\{\\rm diag\}\\\!\\left\(\\bigl\\\{\|f\(\\tau\)\_\{i\}\-x\_\{i\}\|^\{p\-2\}\\bigr\\\}\_\{i=1\}^\{D\}\\right\),which is positive semidefinite for allp≥1p\\geq 1\.

As we restrictfffrom the quadratic form, the second\-order derivative with respect toτ\\tauis constant and given by

\{∇τ2f​\(τ\)\}r​u​v=2​∑sVr​s​𝒜s​u​v\.\\big\\\{\\nabla\_\{\\tau\}^\{2\}f\(\\tau\)\\big\\\}\_\{ruv\}=2\\sum\_\{s\}V\_\{rs\}\\,\\mathcal\{A\}\_\{suv\}\.where∇τ2f​\(τ\)∈ℝD×d×d\\nabla\_\{\\tau\}^\{2\}f\(\\tau\)\\in\{\\mathbb\{R\}\}^\{D\\times d\\times d\}\. Therefore,

∇τ2f​\(τ\)×1∇yℓ​\(y\)\|y=f​\(τ\)−x\\displaystyle\\nabla\_\{\\tau\}^\{2\}f\(\\tau\)\\times\_\{1\}\\nabla\_\{y\}\\ell\(y\)\\big\|\_\{y=f\(\\tau\)\-x\}\(14\)=\\displaystyle=2​∑r=1D\{∑t=1sVr​t​𝒜t​u​v\}​\{∇yℓ​\(y\)\|y=f​\(τ\)−x\}r\.\\displaystyle 2\\sum\_\{r=1\}^\{D\}\\\{\\sum\_\{t=1\}^\{s\}V\_\{rt\}\\,\\mathcal\{A\}\_\{tuv\}\\\}\\\{\\nabla\_\{y\}\\ell\(y\)\\big\|\_\{y=f\(\\tau\)\-x\}\\\}\_\{r\}\.
Owing to the presence of the curvature tensor𝒜\\mathcal\{A\}and its contraction with the gradient∇yℓ​\(y\)\\nabla\_\{y\}\\ell\(y\), the definiteness of the term∇τ2f​\(τ\)×1∇yℓ​\(y\)\|y=f​\(τ\)−x\\nabla\_\{\\tau\}^\{2\}f\(\\tau\)\\times\_\{1\}\\nabla\_\{y\}\\ell\(y\)\\big\|\_\{y=f\(\\tau\)\-x\}cannot be determined directly\. Nevertheless, its magnitude can be controlled by analyzing the interaction between the curvature tensor𝒜\\mathcal\{A\}and the derivative of the loss function\. In particular, for theℓpp\\ell\_\{p\}^\{p\}loss withp\>1p\>1, either a sufficiently small gradient norm∇yℓ​\(y\)\|y=f​\(τ\)−x\\nabla\_\{y\}\\ell\(y\)\|\_\{y=f\(\\tau\)\-x\}\(small noise setting\) or a small curvature scale of the tensor𝒜\\mathcal\{A\}effectively suppresses the contribution of \([14](https://arxiv.org/html/2605.20300#S4.E14)\)\. As a consequence, the aggregated HessianH​\(τ\)H\(\\tau\)remains positive definite over a relatively large region of the parameter space\.

Motivated by this observation, we next establish a theorem that quantitatively characterizes the region in whichH​\(τ\)H\(\\tau\)is positive definite by analyzing the structure of the Hessian matrix\.

###### Theorem 1\(Local convexity radius forℓpp\\ell\_\{p\}^\{p\}\-SCQM\)\.

Let1<p≤21<p\\leq 2and consider the objective

minτ⁡ℓ​\(τ\)=‖f​\(τ\)−x‖pp\.\\min\_\{\\tau\}\\ell\(\\tau\)=\\\|f\(\\tau\)\-x\\\|\_\{p\}^\{p\}\.Assume there exist constantsσ0,ρ,A0\>0\\sigma\_\{0\},\\rho,A\_\{0\}\>0such that, the following conditions hold:

1. 1\.the Jacobian satisfiesσmin​\(∇τf​\(τ\)\)≥σ0\\sigma\_\{\\min\}\(\\nabla\_\{\\tau\}f\(\\tau\)\)\\geq\\sigma\_\{0\};
2. 2\.the residual is bounded away from zero, i\.e\.,minj⁡\|\(f​\(τ\)−x\)j\|p−2≥ρ\\min\_\{j\}\|\(f\(\\tau\)\-x\)\_\{j\}\|^\{p\-2\}\\geq\\rho;
3. 3\.the quadratic tensor satisfies‖𝒜‖:=maxs⁡‖𝒜s‖≤A0\\\|\\mathcal\{A\}\\\|:=\\max\_\{s\}\\\|\\mathcal\{A\}\_\{s\}\\\|\\leq A\_\{0\}\.

Then the Hessian ofℓ​\(τ\)\\ell\(\\tau\)is positive semidefinite throughout thep−1p\-1norm ball𝒩p−1​\(x,rp\)\\mathcal\{N\}\_\{p\-1\}\(x,r\_\{p\}\):

𝒩p−1​\(x,rp\):=\{τ∈𝒯\|‖f​\(τ\)−x‖p−1≤rp\},\\mathcal\{N\}\_\{p\-1\}\(x,r\_\{p\}\)\\;:=\\;\\left\\\{\\tau\\in\\mathcal\{T\}\\;\\middle\|\\;\\\|f\(\\tau\)\-x\\\|\_\{p\-1\}\\leq r\_\{p\}\\right\\\},whererp:=\(\(p−1\)​ρ​σ022​A0\)1p−1\.r\_\{p\}:=\\left\(\\frac\{\(p\-1\)\\rho\\sigma\_\{0\}^\{2\}\}\{2A\_\{0\}\}\\right\)^\{\\\!\\frac\{1\}\{p\-1\}\}\.

The proof for Theorem[1](https://arxiv.org/html/2605.20300#Thmtheorem1)is placed in the Appendix\. It is worth noting that the assumptionσmin​\(∇τf​\(τ\)\)≥σ0\\sigma\_\{\\min\}\(\\nabla\_\{\\tau\}f\(\\tau\)\)\\geq\\sigma\_\{0\}is natural in our setting\. Indeed, since∇τf​\(τ\)=U\+V​Θ⊤​\(Mτ\+Nτ\),\\nabla\_\{\\tau\}f\(\\tau\)=U\+V\\Theta^\{\\top\}\(M\_\{\\tau\}\+N\_\{\\tau\}\),whereUUandVVhave orthonormal columns and satisfyU⊤​V=0U^\{\\top\}V=0, we have for anyzz,

‖∇τf​\(τ\)​z‖22=\\displaystyle\\\|\\nabla\_\{\\tau\}f\(\\tau\)z\\\|\_\{2\}^\{2\}=‖U​z‖22\+‖V​Θ⊤​\(Mτ\+Nτ\)​z‖22\\displaystyle\\\|Uz\\\|\_\{2\}^\{2\}\+\\\|V\\Theta^\{\\top\}\(M\_\{\\tau\}\+N\_\{\\tau\}\)z\\\|\_\{2\}^\{2\}=\\displaystyle=‖z‖22\+‖Θ⊤​\(Mτ\+Nτ\)​z‖22≥‖z‖22\.\\displaystyle\\\|z\\\|\_\{2\}^\{2\}\+\\\|\\Theta^\{\\top\}\(M\_\{\\tau\}\+N\_\{\\tau\}\)z\\\|\_\{2\}^\{2\}\\geq\\\|z\\\|\_\{2\}^\{2\}\.This immediately impliesσmin​\(∇τf​\(τ\)\)≥1,\\sigma\_\{\\min\}\(\\nabla\_\{\\tau\}f\(\\tau\)\)\\geq 1,and therefore one may takeσ0=1\\sigma\_\{0\}=1without loss of generality\. The assumptionminj⁡\|\(f​\(τ\)−x\)j\|p−2≥ρ\\min\_\{j\}\|\(f\(\\tau\)\-x\)\_\{j\}\|^\{p\-2\}\\geq\\rhois mild in practice whenppis close to22, as long as\(f​\(τ\)−x\)j≠0,∀j\(f\(\\tau\)\-x\)\_\{j\}\\neq 0,\\forall jandρ∈\(0,1\)\\rho\\in\(0,1\)is chosen sufficiently small\. In particular, in the special casep=2p=2, the Hessian of the loss reduces to the identity matrix and the above condition holds trivially\. Moreover, the boundedness assumption on the quadratic tensor,‖𝒜‖:=maxs⁡‖𝒜s‖≤A0,\\\|\\mathcal\{A\}\\\|:=\\max\_\{s\}\\\|\\mathcal\{A\}\_\{s\}\\\|\\leq A\_\{0\},is natural, sinceA0A\_\{0\}can be taken as the maximum operator norm over all matrix slices\{𝒜s\}\\\{\\mathcal\{A\}\_\{s\}\\\}of the tensor𝒜\\mathcal\{A\}\.

## VSensitivity analysis

Due to the high dimensionality of the second\-order parameters and the presence of orthogonality constraints, directly conducting a full sensitivity analysis for all variables based on the KKT conditions in Eqn\. \([12](https://arxiv.org/html/2605.20300#S3.E12)\) is analytically intractable\. Instead, we adopt a simplified yet insightful approach by studying the sensitivity of the Fréchet mean\[[20](https://arxiv.org/html/2605.20300#bib.bib20)\]with respect to the input samples\{xi\}i=1n\\\{x\_\{i\}\\\}\_\{i=1\}^\{n\}\. This setting can be regarded as a reduced version of our quadratic model, in which linear, curvature terms and manifold constraints are omitted, while the influence of the loss function remains explicit\.

The Fréchet mean under a general loss functionℓ​\(⋅\)\\ell\(\\cdot\)is defined in Eqn\. \([15](https://arxiv.org/html/2605.20300#S5.E15)\)\. We view the optimizerc∗c^\{\*\}as an implicit function of the data\{xi\}i=1n\\\{x\_\{i\}\\\}\_\{i=1\}^\{n\}\. By invoking the implicit function theorem\[[21](https://arxiv.org/html/2605.20300#bib.bib21)\], we characterize howc∗c^\{\*\}responds to small perturbations\{Δ​xi\}i=1n\\\{\\Delta x\_\{i\}\\\}\_\{i=1\}^\{n\}in the observations\.

###### Proposition 2\(Sensitivity of the Optimal Solution\)\.

Letℓ:ℝD→ℝ\\ell:\\mathbb\{R\}^\{D\}\\to\\mathbb\{R\}be a convex and twice continuously differentiable loss function, and let

c∗​\(X\):=arg⁡minc∈ℝD​∑i=1nℓ​\(xi−c\),c^\{\*\}\(X\):=\\arg\\min\_\{c\\in\\mathbb\{R\}^\{D\}\}\\sum\_\{i=1\}^\{n\}\\ell\(x\_\{i\}\-c\),\(15\)denote the optimal solution for the data matrixX=\{xi\}i=1nX=\\\{x\_\{i\}\\\}\_\{i=1\}^\{n\}\. Define the residuals

ri∗:=xi−c∗,i=1,…,n,r\_\{i\}^\{\*\}:=x\_\{i\}\-c^\{\*\},\\qquad i=1,\\dots,n,and the aggregated Hessian

H:=∑i=1n∇2ℓ​\(ri∗\)\.H:=\\sum\_\{i=1\}^\{n\}\\nabla^\{2\}\\ell\(r\_\{i\}^\{\*\}\)\.
Then, the mappingX↦c∗​\(X\)X\\mapsto c^\{\*\}\(X\)is locally Fréchet differentiable\. In particular, for any perturbation\{Δ​xi\}i=1n\\\{\\Delta x\_\{i\}\\\}\_\{i=1\}^\{n\}, the induced first\-order variation ofc∗​\(X\)c^\{\*\}\(X\)is given by

Δ​c=H−1​∑i=1n∇2ℓ​\(ri∗\)​Δ​xi\.\\Delta c=H^\{\-1\}\\sum\_\{i=1\}^\{n\}\\nabla^\{2\}\\ell\(r\_\{i\}^\{\*\}\)\\,\\Delta x\_\{i\}\.\(16\)

The proof for Proposition[2](https://arxiv.org/html/2605.20300#Thmproposition2)is left in the appendix\. Based on Proposition[2](https://arxiv.org/html/2605.20300#Thmproposition2), we specialize the analysis to theℓpp\\ell\_\{p\}^\{p\}loss with1<p≤21<p\\leq 2and to theℓ2\\ell\_\{2\}loss\. This proposition allows us to elucidate why these loss functions yield estimators that are more robust than those based on the squaredℓ22\\ell\_\{2\}^\{2\}loss\. In particular, we show that the corresponding estimators possess bounded sensitivity to data perturbations, which significantly enhances their robustness to outliers\.

### V\-ARobustness Interpretation

Here, we explain why theℓpp\\ell\_\{p\}^\{p\}loss for1<p≤21<p\\leq 2leads to enhanced robustness, based on the sensitivity result established in Proposition \([2](https://arxiv.org/html/2605.20300#Thmproposition2)\)\. For theℓpp\\ell\_\{p\}^\{p\}loss, the Hessian∇2ℓ​\(ri∗\)\\nabla^\{2\}\\ell\(r\_\{i\}^\{\\ast\}\)is diagonal, with thekk\-th diagonal entry given by

∇2ℓ​\(ri∗\)=p​\(p−1\)​diag​\(\{\|ri​k∗\|p−2\}k=1D\)\.\\nabla^\{2\}\\ell\(r\_\{i\}^\{\\ast\}\)=p\(p\-1\)\{\\rm diag\}\(\\\{\|r\_\{ik\}^\{\\ast\}\|^\{p\-2\}\\\}\_\{k=1\}^\{D\}\)\.
Forp=2p=2, the loss reduces to the squared Euclidean lossℓ​\(r\)=‖r‖22\\ell\(r\)=\\\|r\\\|\_\{2\}^\{2\}, and the Hessian is constant:∇2ℓ​\(ri∗\)=2​ID\.\\nabla^\{2\}\\ell\(r\_\{i\}^\{\\ast\}\)=2I\_\{D\}\.Consequently, the aggregated Hessian satisfiesH=2​n​IDH=2nI\_\{D\}, and the sensitivity formula in Proposition[2](https://arxiv.org/html/2605.20300#Thmproposition2)simplifies toΔ​c=−1n​∑i=1nΔ​xi\.\\Delta c=\-\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\Delta x\_\{i\}\.This shows that the optimal interceptc∗c^\{\\ast\}responds linearly and uniformly to perturbations in the data, without any attenuation from the residual magnitudes\.

For1<p≤21<p\\leq 2, the Hessian weights depend on the residual magnitudes through\|ri​k∗\|p−2\|r\_\{ik\}^\{\\ast\}\|^\{p\-2\}\. In particular, large residuals receive smaller weights, while small residuals are emphasized\. As a result, the perturbationΔ​xi\\Delta x\_\{i\}is effectively reweighted in the sensitivity formula, leading to

Δ​c=H−1​∑i=1ndiag​\(\{\|ri​k∗\|p−2\}k=1D\)​Δ​xi\.\\Delta c=\\,H^\{\-1\}\\sum\_\{i=1\}^\{n\}\\mathrm\{diag\}\(\\\{\|r^\{\*\}\_\{ik\}\|^\{p\-2\}\\\}\_\{k=1\}^\{D\}\)\\Delta x\_\{i\}\.We observe that a perturbationΔ​xi​k\\Delta x\_\{ik\}in thekk\-th coordinate exerts a strong influence on the estimator when the corresponding residual\|ri​k∗\|\|r\_\{ik\}^\{\\ast\}\|is small, while its influence becomes attenuated when\|ri​k∗\|\|r\_\{ik\}^\{\\ast\}\|is large\. This behavior arises because the sensitivity is weighted by the factor\|ri​k∗\|p−2\|r\_\{ik\}^\{\\ast\}\|^\{p\-2\}, whose exponent satisfiesp−2≤0p\-2\\leq 0for1<p≤21<p\\leq 2\. Consequently, large residuals receive diminishing weight in the sensitivity propagation, which explains the enhanced robustness of theℓpp\\ell\_\{p\}^\{p\}loss compared to the squared Euclidean loss\.

Next, we investigate the robustness properties of theℓ2\\ell\_\{2\}norm\. For theℓ2\\ell\_\{2\}lossℓ​\(r\)=‖r‖2\\ell\(r\)=\\\|r\\\|\_\{2\}, when the residualr≠0r\\neq 0, the Hessian admits the closed\-form expression

∇2ℓ​\(r\)=1‖r‖2​\(I−r​r⊤‖r‖22\)=1‖r‖2​Pr⟂,\\nabla^\{2\}\\ell\(r\)=\\frac\{1\}\{\\\|r\\\|\_\{2\}\}\\left\(I\-\\frac\{rr^\{\\top\}\}\{\\\|r\\\|\_\{2\}^\{2\}\}\\right\)=\\frac\{1\}\{\\\|r\\\|\_\{2\}\}\\,P\_\{r^\{\\perp\}\},wherePr⟂P\_\{r^\{\\perp\}\}denotes the orthogonal projection matrix onto the subspace orthogonal torr\. Substituting this expression into \([16](https://arxiv.org/html/2605.20300#S5.E16)\) yields the first\-order sensitivity of the optimal intercept:

Δ​c=H−1​∑i=1n1‖ri∗‖2​Pri∗⟂​Δ​xi\.\\Delta c=H^\{\-1\}\\sum\_\{i=1\}^\{n\}\\frac\{1\}\{\\\|r\_\{i\}^\{\\ast\}\\\|\_\{2\}\}\\,P\_\{\{r\_\{i\}^\{\\ast\}\}^\{\\perp\}\}\\,\\Delta x\_\{i\}\.\(17\)This representation reveals two intrinsic robustness mechanisms of theℓ2\\ell\_\{2\}loss\. First, perturbations associated with large residuals are attenuated by the factor1/‖ri∗‖21/\\\|r\_\{i\}^\{\\ast\}\\\|\_\{2\}, thereby reducing the influence of outliers\. Second, perturbations in the direction of the residualri∗r\_\{i\}^\{\\ast\}do not affectΔ​c\\Delta c, since they are annihilated by the projection operatorPri∗⟂P\_\{\{r\_\{i\}^\{\\ast\}\}^\{\\perp\}\}\. Together, these properties explain the improved stability ofℓ2\\ell\_\{2\}\-based estimators relative to the squared Euclidean loss\.

## VIAlgorithm

In this section, we present a gradient descent algorithm for solving the proposed generalized subspace\-constrained quadratic model\. The algorithm is highly flexible and accommodates a broad class of differentiable \(or subdifferentiable\) loss functions; adapting it to a specific loss only requires replacing the gradient \(or subgradient\) term∇yℓ​\(y−xi\)\\nabla\_\{y\}\\ell\(y\-x\_\{i\}\)accordingly\.

Before introducing the algorithm, we describe two technical mechanisms that are incorporated to improve its convergence behavior\. First, we employ an orthonormal retraction based on an aligned QR decomposition, which preserves the Stiefel manifold constraint and stabilizes the gradient flow\. Second, we adopt an asynchronous learning\-rate adjustment strategy, which enhances numerical stability and accelerates the optimization process\.

##### Orthonormal retraction via aligned QR

Due to the orthogonality constraint imposed onQQ, the Euclidean gradient update reduces to an update along the Riemannian gradient directionGQG\_\{Q\}, which is obtained by projecting the Euclidean gradient onto the tangent space atQQ\. To maintain the orthonormality constraint, the updated matrix is then retracted back onto the Stiefel manifold via a QR decomposition\[[19](https://arxiv.org/html/2605.20300#bib.bib19)\]\. However, the QR decomposition is not unique: for a given matrix, there may exist two orthonormal matricesQQandQ′Q^\{\\prime\}, together with two upper triangular matricesRRandR′R^\{\\prime\}, such thatQ​R=Q′​R′QR=Q^\{\\prime\}R^\{\\prime\}\. To remove this ambiguity, we impose a standard sign convention on the diagonal entries of the upper triangular factor, requiring them to be positive, as discussed in\[[22](https://arxiv.org/html/2605.20300#bib.bib22)\]\. Under this convention, the QR decomposition is unique, and the associated retraction map is continuous\. Consequently, the update sequence\{Q\(t\)\}\\\{Q^\{\(t\)\}\\\}satisfies

limηt→0‖Q\(t\+1\)−Q\(t\)‖F=0,\\lim\_\{\\eta\_\{t\}\\to 0\}\\bigl\\\|Q^\{\(t\+1\)\}\-Q^\{\(t\)\}\\bigr\\\|\_\{\\rm F\}=0,ensuring that successive iterates vary smoothly as the step size diminishes\. This continuity property is essential for guaranteeing a monotonic decrease of the objective function during the gradient descent process and for maintaining the stability of the optimization on the Stiefel manifold\.

Algorithm 1Riemannian Gradient Descent for SCQM0:

X=\[x1,…,xn\]∈ℝD×nX=\[x\_\{1\},\\dots,x\_\{n\}\]\\in\\mathbb\{R\}^\{D\\times n\}; max iterations

TT; tolerance

ε\\varepsilon; initial step sizes

\(ηQ,ηΘ,ηc,ητ\)\(\\eta\_\{Q\},\\eta\_\{\\Theta\},\\eta\_\{c\},\\eta\_\{\\tau\}\); loss

ℓ\\ellwith \(sub\)gradient

∂ℓ​\(⋅\)\\partial\\ell\(\\cdot\); retraction

RetrSt​\(⋅\)\\mathrm\{Retr\}\_\{\\mathrm\{St\}\}\(\\cdot\)and

sym​\(A\)=12​\(A\+AT\)\\mathrm\{sym\}\(A\)=\\tfrac\{1\}\{2\}\(A\+A^\{T\}\)\.

0:

Q=\[U,V\]∈St​\(d\+s,D\)Q=\[U,V\]\\in\\mathrm\{St\}\(d\+s,D\),

Θ\\Theta,

cc,

\{τi\}i=1n\\\{\\tau\_\{i\}\\\}\_\{i=1\}^\{n\}\.

1:Initialize:

c\(0\)=1n​X​𝟏nc^\{\(0\)\}=\\tfrac\{1\}\{n\}X\\mathbf\{1\}\_\{n\}; compute PCA of

X−c\(0\)​𝟏nTX\-c^\{\(0\)\}\\mathbf\{1\}\_\{n\}^\{T\}and set

Q\(0\)=\[U\(0\),V\(0\)\]∈St​\(d\+s,D\)Q^\{\(0\)\}=\[U^\{\(0\)\},V^\{\(0\)\}\]\\in\\mathrm\{St\}\(d\+s,D\);

τi\(0\)=\(U\(0\)\)T​\(xi−c\(0\)\)\\tau\_\{i\}^\{\(0\)\}=\(U^\{\(0\)\}\)^\{T\}\(x\_\{i\}\-c^\{\(0\)\}\);

Θ\(0\)∼𝒩​\(0,σ2​I\)\\Theta^\{\(0\)\}\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}I\)\.

2:for

t=0,…,T−1t=0,\\dots,T\-1do

3:Residuals and weights:for

i=1,…,ni=1,\\dots,nϕi\(t\)=vech​\(τi\(t\)​τi\(t\)​T\),\\displaystyle\\phi\_\{i\}^\{\(t\)\}=\\mathrm\{vech\}\\big\(\\tau\_\{i\}^\{\(t\)\}\\tau\_\{i\}^\{\(t\)T\}\\big\),fi\(t\)=c\(t\)\+U\(t\)​τi\(t\)\+V\(t\)​Θ\(t\)​T​ϕi\(t\),\\displaystyle f\_\{i\}^\{\(t\)\}=c^\{\(t\)\}\+U^\{\(t\)\}\\tau\_\{i\}^\{\(t\)\}\+V^\{\(t\)\}\\Theta^\{\(t\)T\}\\phi\_\{i\}^\{\(t\)\},ri\(t\)=fi\(t\)−xi,gi\(t\)∈∂ℓ​\(ri\(t\)\)\.\\displaystyle r\_\{i\}^\{\(t\)\}=f\_\{i\}^\{\(t\)\}\-x\_\{i\},\\qquad g\_\{i\}^\{\(t\)\}\\in\\partial\\ell\\big\(r\_\{i\}^\{\(t\)\}\\big\)\.
4:end for

5:Euclidean gradients:

∇cF\(t\)=∑i=1ngi\(t\),\\displaystyle\\nabla\_\{c\}F^\{\(t\)\}=\\sum\_\{i=1\}^\{n\}g\_\{i\}^\{\(t\)\},∇ΘF\(t\)=∑i=1nϕi\(t\)​\(V\(t\)​T​gi\(t\)\)T,\\displaystyle\\nabla\_\{\\Theta\}F^\{\(t\)\}=\\sum\_\{i=1\}^\{n\}\\phi\_\{i\}^\{\(t\)\}\\big\(V^\{\(t\)T\}g\_\{i\}^\{\(t\)\}\\big\)^\{T\},∇QF\(t\)=∑i=1ngi\(t\)​\[τi\(t\)​Tϕi\(t\)​T​Θ\(t\)\]\.\\displaystyle\\nabla\_\{Q\}F^\{\(t\)\}=\\sum\_\{i=1\}^\{n\}g\_\{i\}^\{\(t\)\}\\begin\{bmatrix\}\\tau\_\{i\}^\{\(t\)T\}&\\phi\_\{i\}^\{\(t\)T\}\\Theta^\{\(t\)\}\\end\{bmatrix\}\.
6:Riemannian step onSt​\(d\+s,D\)\\mathrm\{St\}\(d\+s,D\):

GQ\(t\)=∇QF\(t\)−Q\(t\)​sym​\(Q\(t\)​T​∇QF\(t\)\),\\displaystyle G\_\{Q\}^\{\(t\)\}=\\nabla\_\{Q\}F^\{\(t\)\}\-Q^\{\(t\)\}\\mathrm\{sym\}\\\!\\big\(Q^\{\(t\)T\}\\nabla\_\{Q\}F^\{\(t\)\}\\big\),Q\(t\+1\)=RetrSt​\(Q\(t\)−ηQ\(t\)​GQ\(t\)\),\\displaystyle Q^\{\(t\+1\)\}=\\mathrm\{Retr\}\_\{\\mathrm\{St\}\}\\big\(Q^\{\(t\)\}\-\\eta\_\{Q\}^\{\(t\)\}G\_\{Q\}^\{\(t\)\}\\big\),extract

Q\(t\+1\)=\[U\(t\+1\),V\(t\+1\)\]Q^\{\(t\+1\)\}=\[U^\{\(t\+1\)\},V^\{\(t\+1\)\}\]\.

7:Euclidean updates:

Θ\(t\+1\)=Θ\(t\)−ηΘ\(t\)​∇ΘF\(t\),\\displaystyle\\Theta^\{\(t\+1\)\}=\\Theta^\{\(t\)\}\-\\eta\_\{\\Theta\}^\{\(t\)\}\\nabla\_\{\\Theta\}F^\{\(t\)\},c\(t\+1\)=c\(t\)−ηc\(t\)​∇cF\(t\)\.\\displaystyle c^\{\(t\+1\)\}=c^\{\(t\)\}\-\\eta\_\{c\}^\{\(t\)\}\\nabla\_\{c\}F^\{\(t\)\}\.
8:Update latent coordinates:for

i=1,…,ni=1,\\dots,nJτi\(t\)=∂fi\(t\)∂τi\|\(U\(t\+1\),V\(t\+1\),Θ\(t\+1\),τi\(t\)\),\\displaystyle J\_\{\\tau\_\{i\}\}^\{\(t\)\}=\\frac\{\\partial f\_\{i\}^\{\(t\)\}\}\{\\partial\\tau\_\{i\}\}\\bigg\|\_\{\(U^\{\(t\+1\)\},V^\{\(t\+1\)\},\\Theta^\{\(t\+1\)\},\\tau\_\{i\}^\{\(t\)\}\)\},τi\(t\+1\)=τi\(t\)−ητ\(t\)​Jτi\(t\)​T​gi\(t\)\.\\displaystyle\\tau\_\{i\}^\{\(t\+1\)\}=\\tau\_\{i\}^\{\(t\)\}\-\\eta\_\{\\tau\}^\{\(t\)\}\\,J\_\{\\tau\_\{i\}\}^\{\(t\)T\}g\_\{i\}^\{\(t\)\}\.
9:end for

10:endfor

##### Asynchronous learning\-rate adjustment

An appropriate choice of the learning rate is crucial for both convergence behavior and computational efficiency\. However, the optimal learning rate is influenced by multiple factors, including the local geometric properties of the objective function at the current iterate and the curvature induced by the constraint manifold\. Asynchronous learning\-rate adjustment enables the use of distinct learning rates for different parameter blocks—such asηQ\\eta\_\{Q\}for the subspace variableQQandηΘ\\eta\_\{\\Theta\}for the quadratic parametersΘ\\Theta—allowing each component to be updated in accordance with its own geometric and optimization characteristics\.

To ensure an effective learning rate, we adopt an asynchronous learning\-rate adjustment strategy\. Specifically, we initialize separate learning ratesηc,ηQ,ηΘ,\\eta\_\{c\},\\eta\_\{Q\},\\eta\_\{\\Theta\},andητ\\eta\_\{\\tau\}for the variablescc,QQ,Θ\\Theta, andτi\{\\tau\_\{i\}\}, respectively\. During each update, the learning rate associated with a given variable is adjusted independently based on a sufficient decrease criterion\.

We employ a backtracking line search strategy based on the Armijo rule to adaptively determine suitable step sizes during optimization\[[23](https://arxiv.org/html/2605.20300#bib.bib23)\]\. Specifically, when updating the variablecc, we first take a tentative gradient step and evaluate the objective functionF​\(c−ηc​∇cF,Θ,Q,\{τi\}\)F\\\!\\left\(c\-\\eta\_\{c\}\\nabla\_\{c\}F,\\;\\Theta,\\;Q,\\;\\\{\\tau\_\{i\}\\\}\\right\)\. This value is compared against the sufficient decrease conditionF​\(c,Θ,Q,\{τi\}\)−α​ηc​‖∇cF​\(c,Θ,Q,\{τi\}\)‖F2,F\(c,\\Theta,Q,\\\{\\tau\_\{i\}\\\}\)\-\\alpha\\eta\_\{c\}\\big\\\|\\nabla\_\{c\}F\(c,\\Theta,Q,\\\{\\tau\_\{i\}\\\}\)\\big\\\|\_\{\\mathrm\{F\}\}^\{2\},whereα∈\(0,1\)\\alpha\\in\(0,1\)is a prescribed constant\. If the condition is satisfied, the step sizeηc\\eta\_\{c\}is accepted; otherwise, it is reduced, for example by settingηc←ηc/2\\eta\_\{c\}\\leftarrow\\eta\_\{c\}/2, and the test is repeated\. The same backtracking procedure is applied independently to the updates ofQQ,Θ\\Theta, and\{τi\}\\\{\\tau\_\{i\}\\\}\. This block\-wise adaptive step\-size strategy allows each parameter group to adjust to the local geometry of the objective function and the associated constraint manifold, thereby improving numerical stability and accelerating the convergence of the overall optimization algorithm\.

##### Learning process

The learning process can be summarized as: Given a data matrixX∈ℝD×nX\\in\\mathbb\{R\}^\{D\\times n\}, the algorithm jointly estimates the latent representations\{τi\}i=1n\\\{\\tau\_\{i\}\\\}\_\{i=1\}^\{n\}, the quadratic mapping parametersΘ\\Theta, the mean vectorcc, and an orthonormal basisQ∈St​\(d\+s,D\)Q\\in\\mathrm\{St\}\(d\+s,D\)\.

The algorithm initializesccas the column mean ofXX, and initializes Q using the leadingd\+sd\+sleft singular vectors of the centered data matrixX−c​𝟏nTX\-c\\mathbf\{1\}\_\{n\}^\{T\}\. Each latent variableτi\\tau\_\{i\}is initialized by projecting the centered data pointxi−cx\_\{i\}\-conto the firstddcolumns ofQQ, whileΘ\\Thetais initialized randomly\.

At each iteration, the Euclidean gradients with respect to all variables—∇QF,∇ΘF,∇cF\\nabla\_\{Q\}F,\\nabla\_\{\\Theta\}F,\\nabla\_\{c\}F, and\{∇τiF\}i=1n\\\{\\nabla\_\{\\tau\_\{i\}\}F\\\}\_\{i=1\}^\{n\}—are first computed\. To enforce the orthonormality constraint onQQ, the gradient∇QF\\nabla\_\{Q\}Fis projected onto the tangent space of the Stiefel manifold, yielding the Riemannian gradient direction\. A Riemannian gradient step is then performed, followed by a retraction onto the Stiefel manifold via QR decomposition\. A backtracking line search is applied along the retracted Riemannian gradient direction to determine a suitable step size\. For the unconstrained parametersΘ\\Thetaandcc, standard Euclidean gradient updates are employed\.

For the update of eachτi\\tau\_\{i\}, we also employ an asynchronous backtracking line search to determine an acceptable step size that ensures sufficient decrease in the objective value\. In other words, the step size associated with eachτi\\tau\_\{i\}is selected independently\. Each latent variableτi\\tau\_\{i\}is then updated via a gradient step that explicitly incorporates the structured matricesMτiM\_\{\\tau\_\{i\}\}andNτiN\_\{\\tau\_\{i\}\}, which arise from the quadratic term of the model\. The procedure is iterated until convergence or until a prescribed maximum number of iterations is reached, yielding a structured quadratic approximation that minimizes the chosen loss function\.

## VIINumerical Experiments

![Refer to caption](https://arxiv.org/html/2605.20300v1/x3.png)Figure 3:Illustration of the fitted curves and projection points obtained usingℓpp\\ell\_\{p\}^\{p\}losses with different values ofpp\.In this section, we demonstrate how the proposed model can be applied to the task of manifold denoising, or more generally, manifold data refinement\. Unlike the toy example in Section[II\-E](https://arxiv.org/html/2605.20300#S2.SS5), where different noise models are paired with their corresponding loss functions, here we evaluate all methods under the same noise distribution\. Specifically, we assess the performance of various methods under additive Gaussian noise with different noise scales\.

In this experimental setting, we assume that the observed dataset\{xi\}\\\{x\_\{i\}\\\}is generated by corrupting a noise\-free dataset\{x~i\}\\\{\\tilde\{x\}\_\{i\}\\\}with Gaussian noise\{ϵi\}\\\{\\epsilon\_\{i\}\\\}, such that

xi=x~i\+ϵi,wherex~i∼ℳ,ϵi∼𝒩​\(𝟎,σ2​I\)\.x\_\{i\}=\\tilde\{x\}\_\{i\}\+\\epsilon\_\{i\},\\quad\\text\{where\}\\quad\\tilde\{x\}\_\{i\}\\sim\\mathcal\{M\},\\ \\epsilon\_\{i\}\\sim\{\\cal N\}\(\{\\bf 0\},\\sigma^\{2\}I\)\.\(18\)
We implement data refinement in two steps\. First, for each observationxix\_\{i\}, we identify itsKKnearest neighbors, denoted as𝒩xi\\mathcal\{N\}\_\{x\_\{i\}\}\. Second, we learn a subspace\-constrained quadratic matrix factorization model to find a local representationf^\\widehat\{f\}, which is parameterized byQ^,Θ^,c^\\widehat\{Q\},\\widehat\{\\Theta\},\\widehat\{c\}\. Next, we projectxix\_\{i\}back onto the learned subspacef^\\widehat\{f\}by solving the following optimization problem:

P​\(xi\)=arg⁡minx∈\{f^​\(τ\),τ∈ℝd\}⁡‖xi−x‖pp\.P\(x\_\{i\}\)=\\arg\\min\_\{x\\in\\\{\\widehat\{f\}\(\\tau\),\\tau\\in\{\\mathbb\{R\}\}^\{d\}\\\}\}\\\|x\_\{i\}\-x\\\|\_\{p\}^\{p\}\.Finally, we evaluate the performance of the manifold approximation by measuring the empirical mean squared error:

ℰ=1N​∑i=1N‖x~i−P​\(xi\)‖22\.\\mathcal\{E\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\\|\\tilde\{x\}\_\{i\}\-P\(x\_\{i\}\)\\\|\_\{2\}^\{2\}\.A smaller value ofℰ\\mathcal\{E\}indicates a stronger recovery capability of the manifold approximation method\.

### VII\-ASimulation with Spherical Data inℝ3\{\\mathbb\{R\}\}^\{3\}

TABLE II:Mean squared error \(MSE\) of different models under varying noise levels\.![Refer to caption](https://arxiv.org/html/2605.20300v1/x4.png)Figure 4:Performance is compared across models and noise levels using boxplots of the squared reconstruction error,‖x~i−P​\(xi\)‖22\\\|\\tilde\{x\}\_\{i\}\-P\(x\_\{i\}\)\\\|\_\{2\}^\{2\}, computed over all samples\.Here, we compare the performance of the SCQM model with that of its competitors, including SPH, MFIT, and MLS, under different loss functions and across varying noise levels\. We uniformly sample 300 points on the 3D sphere as the ground\-truth signalsx~i\{\\tilde\{x\}\_\{i\}\}\. Noisy observations are then generated according to \([18](https://arxiv.org/html/2605.20300#S7.E18)\)\. We evaluate model performance under multiple noise settings by varying the noise standard deviationσ∈\{0\.05,0\.10,0\.15,0\.20\}\\sigma\\in\\\{0\.05,0\.10,0\.15,0\.20\\\}\. For each noise level, we testℓpp\\ell\_\{p\}^\{p\}loss functions withp∈\{1\.0,1\.25,1\.50,1\.75,2\.0\}p\\in\\\{1\.0,1\.25,1\.50,1\.75,2\.0\\\}, as well as theℓ2\\ell\_\{2\}loss\. For fairness and completeness, all methods are implemented within a local neighborhood of each samplexix\_\{i\}, where the neighborhood contains exactly3030neighboring samples\. The per\-sample squared error‖x~i−P​\(xi\)‖22\\\|\\tilde\{x\}\_\{i\}\-P\(x\_\{i\}\)\\\|\_\{2\}^\{2\}is summarized using boxplots for each combination of model and noise level, as shown in Fig\.[4](https://arxiv.org/html/2605.20300#S7.F4)and Fig\.[3](https://arxiv.org/html/2605.20300#S7.F3)\.

From the observations in Table[II](https://arxiv.org/html/2605.20300#S7.T2)and Figure[4](https://arxiv.org/html/2605.20300#S7.F4), we can conclude that :

- •When an appropriate value ofppis selected, our method consistently outperforms state\-of\-the\-art approaches, including SPH, MFIT, and MLS, across a wide range of noise levels\.
- •In the low\-noise regime \(e\.g\.,σ<0\.15\\sigma<0\.15\), the quadratic model consistently outperforms the corresponding linear model, which in turn validates the effectiveness of incorporating quadratic terms\. However, as the noise level increases, the performance of the quadratic model deteriorates\. For instance, atσ=0\.18\\sigma=0\.18, the quadratic model underperforms the linear model, likely due to overfitting, as it captures spurious patterns arising from the noise rather than the underlying signal\.
- •For the quadratic model under theℓpp\\ell\_\{p\}^\{p\}loss, larger values ofppare favorable in the low\-noise regime \(e\.g\.,σ<0\.12\\sigma<0\.12\)\. However, as the noise level further increases, smaller values ofppbecome more advantageous\. This observation highlights a fundamental trade\-off between accurate signal projection and robustness to noise\. Although minimizing theℓpp\\ell\_\{p\}^\{p\}loss withp=1p=1does not recover the true signal under the minimum\-distance projection criterion, it exhibits the strongest robustness among the considered methods\.
- •The quadratic model with theℓ2\\ell\_\{2\}loss exhibits stable and well\-balanced performance across different noise levels, demonstrating its ability to effectively handle noise and recover the true projection in Euclidean space, as it avoids the additional sensitivity introduced by higher\-order power operations\.
- •When the noise level is small, all methods perform well\. However, theℓpp\\ell\_\{p\}^\{p\}loss withp=1p=1consistently underperforms relative to the other metrics, particularly atσ=0\.05\\sigma=0\.05andσ=0\.10\\sigma=0\.10\. This is likely due to a mismatch between the noise distribution and theℓ1\\ell\_\{1\}metric; in other words, anℓ1\\ell\_\{1\}\-based projection does not coincide with the true projection under the Euclidean distance\. As the noise level increases, theℓ2\\ell\_\{2\}model and theℓpp\\ell\_\{p\}^\{p\}models with smaller values ofpptend to outperform the alternatives\. This behavior can be attributed to improved robustness: these losses are less sensitive to large\-variance noise and outlier\-contaminated samples, leading to more stable performance under heavier noise\.

![Refer to caption](https://arxiv.org/html/2605.20300v1/sqmf_comparison.png)Figure 5:Comparison of reconstruction methods under different loss functions and model settings\. Each row displays 16 reconstructed images\. From top to bottom, the rows correspond to the original images, linear model withℓ2\\ell\_\{2\}loss, SCQM withℓ2\\ell\_\{2\}loss, linear model withℓpp\\ell\_\{p\}^\{p\}loss \(p=2p=2\), SCQM withℓpp\\ell\_\{p\}^\{p\}loss \(p=2p=2\), linear model withℓpp\\ell\_\{p\}^\{p\}loss \(p=1p=1\), and SCQM withℓpp\\ell\_\{p\}^\{p\}loss \(p=1p=1\), respectively\.
### VII\-BPerformance on Real World Dataset

Here, we demonstrate the performance of the robust SCQM on a real\-world dataset using handwritten MNIST\[[24](https://arxiv.org/html/2605.20300#bib.bib24)\]images\. We randomly select 100 samples consisting of two easily confused digits, namely ‘4’ and ‘9’\. The original images are projected onto a low\-dimensional manifold with latent dimensiond=2d=2\. We visualize the reconstructed images corresponding to these projections and examine the reconstruction details across different model settings\.

We evaluate three SCQM models with different loss functions, namelyℓ2\\ell\_\{2\}\(third row\),ℓ22\\ell\_\{2\}^\{2\}\(fifth row\), andℓ1\\ell\_\{1\}\(seventh row\), as shown in Fig\.[5](https://arxiv.org/html/2605.20300#S7.F5)\. In addition, we evaluate the corresponding linear variants by settingΘ=𝟎\\Theta=\\mathbf\{0\}throughout the learning process, thereby removing the quadratic component\. These linear counterparts are displayed in the second, fourth, and sixth rows, corresponding to the three loss functions, respectively\. This comparison serves as an ablation study to systematically assess the effectiveness and contribution of the quadratic term within the SCQM framework\.

![Refer to caption](https://arxiv.org/html/2605.20300v1/x5.png)Figure 6:Visualization of latent\-space \(d=2d=2\) interpolation for the linear model \(Θ=𝟎\\Theta=\\bf 0\) and the quadratic model \(Θ≠𝟎\\Theta\\neq\\bf 0\), learned from data consisting of the three digits ‘2’, ‘6’, and ‘8’\.From Fig\.[5](https://arxiv.org/html/2605.20300#S7.F5), we draw the following conclusions\. First, by comparing the linear and quadratic variants—specifically, the second versus third rows, the fourth versus fifth rows, and the sixth versus seventh rows—we observe that the inclusion of the quadratic term significantly improves the discrimination between the digits ‘4’ and ‘9’ with the improvement being particularly evident in the second column from the right\. Second, theℓ1\\ell\_\{1\}loss and theℓ2\\ell\_\{2\}loss consistently outperform theℓ22\\ell\_\{2\}^\{2\}loss \(SQMF discussed in\[[6](https://arxiv.org/html/2605.20300#bib.bib6)\]\), producing sharper and more visually coherent reconstructions\. This highlights their superior robustness and reconstruction capability in the presence of ambiguous or overlapping digit structures\.

### VII\-CInterpolation via SCQM

In addition, we demonstrate that SCQM can also be used as an interpolation model for generating new high\-dimensional data\. Since we have learned a mappingf^:ℝd→ℝD\\widehat\{f\}:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}^\{D\}, uniformly sampling points in the latent spaceℝd\\mathbb\{R\}^\{d\}allows us to visualize the global landscape off^\\widehat\{f\}through its outputs in the ambient space\. This interpolation procedure further serves as a means to evaluate the expressive capability of different models, as it reveals how well each learned mapping captures the underlying manifold structure\.

To validate the strong expressive capability of the proposed SCQM in comparison with its linear counterpart in the latent space\. We randomly select images corresponding to the digits ‘2’, ‘6’, and ‘8’, and learn an SCQM modelf^\\widehat\{f\}with latent dimensiond=2d=2and quadratic dimensions=20s=20\. We then perform interpolation in the latent space by uniformly sampling between the minimum and maximum values along each latent dimension, resulting in a grid of latent points\{τk\}\\\{\\tau\_\{k\}\\\}\. Each interpolation point is mapped back to the image space via the learned decoderf^:τ↦I\\widehat\{f\}:\\tau\\mapsto I, and the reconstructed images are visualized at their corresponding latent locations\. The results are shown in Fig\.[6](https://arxiv.org/html/2605.20300#S7.F6)\. As illustrated, the quadratic component plays a crucial role in modeling nonlinear variations of the data manifold\. While the linear model produces linear transitions, the quadratic model yields smoother and more continuous interpolations, better capturing the intrinsic geometric structure of the data\.

## VIIIConclusion

This work investigates a robust quadratic fitting framework that generalizes subspace\-constrained quadratic factorization\. This generalization provides a promising approach for addressing scenarios with a limited number of observed data points and for relaxing the local flatness assumption inherent in linear subspace models\. We propose a gradient descent method to solve the resulting robust quadratic fitting problem and conduct a sensitivity analysis for the Fréchet mean problem, which can be viewed as a special case within the quadratic function class\. Numerical experiments are conducted to validate the effectiveness of the proposed framework\.

Several directions remain open for future investigation\. First, the statistical properties of the proposed quadratic model have not been analyzed\. Future work could establish non\-asymptotic results\[[25](https://arxiv.org/html/2605.20300#bib.bib25)\]with respect to the sample sizenn, intrinsic dimensiondd, and additional parameters such as the loss exponentpp\. Second, although the gradient descent algorithm exhibits stable behavior in our experiments, its theoretical convergence properties have not been studied\. Providing convergence guarantees under appropriate conditions is an important direction for future research\. Third, our current analysis focuses on assessing the benefit of incorporating a curvature \(quadratic\) term relative to the linear case\. Extending the framework to more expressive models and exploring richer nonlinear structures may further improve performance and is another promising avenue for future work\.

## References

- \[1\]Nanda Kambhatla and Todd Leen\.Dimension reduction by local principal component analysis\.Neural Computation, 9:1493–1516, 10 1997\.
- \[2\]Christopher R Genovese, Marco Perone\-Pacifico, Isabella Verdinelli, Larry Wasserman, et al\.Nonparametric ridge estimation\.The Annals of Statistics, 42\(4\):1511–1545, 2014\.
- \[3\]Charles Fefferman, Sergei Ivanov, Yaroslav Kurylev, Matti Lassas, and Hariharan Narayanan\.Fitting a putative manifold to noisy data\.InConference On Learning Theory, pages 688–720, 2018\.
- \[4\]Umut Ozertem and Deniz Erdogmus\.Locally defined principal curves and surfaces\.Journal of Machine learning research, 12\(Apr\):1249–1286, 2011\.
- \[5\]Zheng Zhai, Hengchao Chen, and Qiang Sun\.Quadratic matrix factorization with applications to manifold learning\.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46\(9\):6384–6401, 2024\.
- \[6\]Zheng Zhai and Xiaohui Li\.Subspace\-constrained quadratic matrix factorization: Algorithm and applications\.Pattern Recognition, 161:111333, 2025\.
- \[7\]Guangcan Liu, Zhouchen Lin, and Yong Yu\.Robust subspace segmentation by low\-rank representation\.InProceedings of the 27th international conference on machine learning \(ICML\-10\), pages 663–670, 2010\.
- \[8\]Yongqiang Zhang, Daming Shi, Junbin Gao, and Dansong Cheng\.Low\-rank\-sparse subspace representation for robust regression\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7445–7454, 2017\.
- \[9\]Christopher R\. Genovese, Marco Perone\-Pacifico, Isabella Verdinelli, and Larry Wasserman\.Nonparametric ridge estimation\.Annals of Statistics, 42\(4\):1511–1545, 2014\.
- \[10\]Umut Ozertem and Deniz Erdogmus\.Locally defined principal curves and surfaces\.Journal of Machine Learning Research, 12:1249–1286, 2011\.
- \[11\]Charles Fefferman, Sergei Ivanov, Yaroslav Kurylev, Matti Lassas, and Hariharan Narayanan\.Fitting a putative manifold to noisy data\.InConference on Learning Theory, pages 688–720, 2018\.
- \[12\]Barak Sober and David Levin\.Manifold approximation by moving least\-squares projection\.Constructive Approximation, 52\(3\):433–478, 2020\.
- \[13\]Hengchao Chen and Zheng Zhai\.Power transformed density ridge estimation\.IEEE Signal Processing Letters, 2025\.
- \[14\]Alexey A Roenko, Vladimir V Lukin, I Djurović, and M Simeunović\.Estimation of parameters for generalized gaussian distribution\.In2014 6th International Symposium on Communications, Control and Signal Processing \(ISCCSP\), pages 376–379\. IEEE, 2014\.
- \[15\]Frédéric Pascal, Lionel Bombrun, Jean\-Yves Tourneret, and Yannick Berthoumieu\.Parameter estimation for multivariate generalized gaussian distributions\.IEEE Transactions on Signal Processing, 61\(23\):5960–5971, 2013\.
- \[16\]Minh N Do and Martin Vetterli\.Wavelet\-based texture retrieval using generalized gaussian density and kullback\-leibler distance\.IEEE transactions on image processing, 11\(2\):146–158, 2002\.
- \[17\]Samuel Kotz, Tomasz Kozubowski, and Krzystof Podgorski\.The Laplace distribution and generalizations: a revisit with applications to communications, economics, engineering, and finance\.Springer Science & Business Media, 2012\.
- \[18\]Gilbert Strang\.Introduction to linear algebra\.SIAM, 2022\.
- \[19\]P\-A Absil, Robert Mahony, and Rodolphe Sepulchre\.Optimization algorithms on matrix manifolds\.Princeton University Press, 2008\.
- \[20\]Abhishek Bhattacharya and Rabi Bhattacharya\.Nonparametric inference on manifolds: with applications to shape spaces, volume 2\.Cambridge University Press, 2012\.
- \[21\]Steven G\. Krantz and Harold R\. Parks\.The Implicit Function Theorem: History, Theory, and Applications\.Birkhäuser, 2013\.
- \[22\]Gene H\. Golub and Charles F\. Van Loan\.Matrix Computations\.Johns Hopkins University Press, 4th edition, 2013\.
- \[23\]Jorge Nocedal\.Numerical optimization\.Springer Ser\. Oper\. Res\. Financ\. Eng\./Springer, 2006\.
- \[24\]Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner\.Gradient\-based learning applied to document recognition\.Proceedings of the IEEE, 86\(11\):2278–2324, 2002\.
- \[25\]Martin J Wainwright\.High\-dimensional statistics: A non\-asymptotic viewpoint, volume 48\.Cambridge university press, 2019\.

Similar Articles

Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity

arXiv cs.LG

This paper studies piecewise-stationary low-rank linear contextual bandits, proposes the SPSC algorithm that achieves dynamic regret scaling with the intrinsic rank instead of the ambient dimension, and characterizes the identification boundary for subspace recovery under scalar feedback.

In-Context Learning Operates as Concept Subspace Learning

arXiv cs.LG

This paper proposes that in-context learning in LLMs operates through low-dimensional concept subspaces, where task-relevant information concentrates in a small fraction of the representation space, supported by experiments on Llama-3-8B and Qwen2.5-7B.

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

arXiv cs.CL

This paper proposes a conformal prediction framework for LLMs that leverages internal representations rather than output-level statistics, introducing Layer-Wise Information (LI) scores as nonconformity measures to improve validity-efficiency trade-offs under distribution shift. The method demonstrates stronger robustness to calibration-deployment mismatch compared to text-level baselines across QA benchmarks.