Exact Schur-Sylvester Dimensionality Reductions for Non-Smooth Stochastic Complexity and Manifold Sampling
Summary
This paper presents exact dimensionality reductions using Schur complement and Sylvester's determinant identity to reduce computational complexity from O(N^3) to O(k^3+N^2k) per step for non-smooth NML estimation, achieving over 14,000x speedup while maintaining numerical precision.
View Cached Full Text
Cached at: 06/24/26, 07:49 AM
# Exact Schur-Sylvester Dimensionality Reductions for Non-Smooth Stochastic Complexity and Manifold Sampling
Source: [https://arxiv.org/html/2606.23867](https://arxiv.org/html/2606.23867)
###### Abstract
The exact computation of the Normalized Maximum Likelihood \(NML\) codelength for regular non\-smooth estimators \(e\.g\., Lasso\) has been historically limited by the cubic scaling walls of manifold\-constrained projection and volume integration\. At each step of the geometric Propose\-and\-Project Metropolis–Hastings \(PPMH\) sampler, evaluating the projection operator requires inverting an\(N\+k\)×\(N\+k\)\(N\+k\)\\times\(N\+k\)generalized KKT matrix, while calculating the volume factor requires the determinant of an\(N−k\)×\(N−k\)\(N\-k\)\\times\(N\-k\)Gram matrix\. This paper presents an exact, mathematically equivalent formulation that bypasses both bottlenecks by utilizing the block Schur complement and Sylvester’s determinant identity\. We prove that the computational complexity of both operations collapses from𝒪\(N3\)\\mathcal\{O\}\(N^\{3\}\)to𝒪\(k3\+N2k\)\\mathcal\{O\}\(k^\{3\}\+N^\{2\}k\)per step\. We generalize this reduction to Sparse Support Vector Machines \(SVMs\), Elastic Net, and Group Lasso\. Finally, we provide a rigorous numerical stability analysis and evaluate the sampler’s efficiency using the Effective Sample Size \(ESS\) per second\. Our empirical benchmarks on high\-dimensional datasets confirm a constant speedup exceeding14,100×14\{,\}100\\timeswhile maintaining double\-precision numerical equivalence, rendering exact non\-smooth NML estimation highly tractable for large\-scale statistical inference\.
## 1Introduction
The Minimum Description Length \(MDL\) principle posits that the optimal statistical model is the one that minimizes the joint codelength of the model and the compressed data\[[1](https://arxiv.org/html/2606.23867#bib.bib1),[2](https://arxiv.org/html/2606.23867#bib.bib2),[3](https://arxiv.org/html/2606.23867#bib.bib3)\]\. The theoretical optimum is achieved by the Normalized Maximum Likelihood \(NML\) distribution, whose normalization factor is known as the stochastic complexity\[[4](https://arxiv.org/html/2606.23867#bib.bib4),[5](https://arxiv.org/html/2606.23867#bib.bib5)\]:
C\(μλ\)=∫𝒳p\(𝒙∣𝜽^λ\(𝒙\)\)𝑑ℒN\(𝒙\),C\(\\mu\_\{\\lambda\}\)=\\int\_\{\\mathcal\{X\}\}p\(\\bm\{x\}\\mid\\hat\{\\bm\{\\theta\}\}\_\{\\lambda\}\(\\bm\{x\}\)\)\\,d\\mathcal\{L\}^\{N\}\(\\bm\{x\}\),\(1\)where𝜽^λ\(𝒙\)\\hat\{\\bm\{\\theta\}\}\_\{\\lambda\}\(\\bm\{x\}\)is the maximum likelihood estimator \(MLE\) under regularization parameterλ\\lambda, and𝒳⊆ℝN\\mathcal\{X\}\\subseteq\\mathbb\{R\}^\{N\}is the continuous data space\. For non\-smooth estimators ubiquitous in modern machine learning \(such as Lasso, Sparse SVMs, and low\-rank singular value constraints\), the classical smoothness assumptions required to evaluate this integral break down\.
In our previous work\[[6](https://arxiv.org/html/2606.23867#bib.bib6)\], we established the measure\-theoretic foundations of non\-smooth NML by generalizing the coarea formula to regular path\-differentiable Lipschitz \(PDL\) estimators\. This formulated the stochastic complexity as an integral over the non\-differentiable level sets \(manifolds\) of the estimator:
C\(μλ\)=∫Θ\[∫𝜽^λ−1\(𝜽′\)p\(𝒙∣𝜽′\)Jcons\(𝜽^λ\(𝒙\)\)𝑑ℋN−k\(𝒙\)\]𝑑ℒk\(𝜽′\),C\(\\mu\_\{\\lambda\}\)=\\int\_\{\\Theta\}\\left\[\\int\_\{\\hat\{\\bm\{\\theta\}\}\_\{\\lambda\}^\{\-1\}\(\\bm\{\\theta\}^\{\\prime\}\)\}\\frac\{p\(\\bm\{x\}\\mid\\bm\{\\theta\}^\{\\prime\}\)\}\{J\_\{\\text\{cons\}\}\(\\hat\{\\bm\{\\theta\}\}\_\{\\lambda\}\(\\bm\{x\}\)\)\}\\,d\\mathcal\{H\}^\{N\-k\}\(\\bm\{x\}\)\\right\]d\\mathcal\{L\}^\{k\}\(\\bm\{\\theta\}^\{\\prime\}\),\(2\)whereJcons\(𝜽^λ\(𝒙\)\)J\_\{\\text\{cons\}\}\(\\hat\{\\bm\{\\theta\}\}\_\{\\lambda\}\(\\bm\{x\}\)\)denotes the constrained Jacobian on the level\-set\. The rigorous application of the coarea formula to sample distribution theory has a rich history in statistics, particularly for characterizing the densities of algebraic and Lipschitz transformations of random variables\[[7](https://arxiv.org/html/2606.23867#bib.bib7)\], as well as defining non\-smooth probability distributions over matrix manifolds\[[8](https://arxiv.org/html/2606.23867#bib.bib8)\]\.
To compute the level\-set integral in Eq\. \([2](https://arxiv.org/html/2606.23867#S1.E2)\) exactly, we introduced the Propose\-and\-Project Metropolis–Hastings \(PDL\-PPMH\) sampler, which traverses these level sets directly\. While global, i\.i\.d\. sampling on algebraic manifolds can be achieved by intersecting the variety with random linear spaces using kinematic formulas\[[9](https://arxiv.org/html/2606.23867#bib.bib9)\], such methods are structurally restricted to polynomial equations and can scale poorly with the degree of the variety\. In contrast, local MCMC proposal schemes construct proposal steps along the tangent space before projecting back onto the target manifold\[[10](https://arxiv.org/html/2606.23867#bib.bib10),[11](https://arxiv.org/html/2606.23867#bib.bib11),[12](https://arxiv.org/html/2606.23867#bib.bib12)\], or deploy penalized Langevin updates designed for constrained domains\[[13](https://arxiv.org/html/2606.23867#bib.bib13)\]\. However, PPMH samplers utilize a position\-dependent proposal covariance to follow the manifold’s local geometry\. As analyzed by Livingstone\[[14](https://arxiv.org/html/2606.23867#bib.bib14)\], position\-dependent proposal covariances can alter the ergodicity properties of Metropolis–Hastings chains, requiring step\-size control in the tails to prevent the rejection rate from approaching unity\. On implicit neural manifolds, these projection steps are typically tackled using iterative conjugate gradient \(CG\) routines to bypass explicit Jacobian construction\[[15](https://arxiv.org/html/2606.23867#bib.bib15)\]\. However, iterative approximations introduce non\-trivial numerical error propagation\.
Furthermore, when sampling under manifold constraints, a deterministic correction term due to curvature is mathematically required to ensure that the proposal remains strictly on the target manifold without introducing systematic bias or dimensionality loss\[[16](https://arxiv.org/html/2606.23867#bib.bib16),[12](https://arxiv.org/html/2606.23867#bib.bib12)\]\. To bypass the need for curvature information or second\-order derivatives, specific orthogonal projection methods can be deployed where these Jacobian factors cancel, preserving detailed balance\[[12](https://arxiv.org/html/2606.23867#bib.bib12)\]\. The PDL\-PPMH algorithm faces two notable computational bottlenecks in its projection and curvature\-correction loop:
1. 1\.The Projection Wall:Projecting proposal points back onto the level set𝜽^λ−1\(𝜽′\)\\hat\{\\bm\{\\theta\}\}\_\{\\lambda\}^\{\-1\}\(\\bm\{\\theta\}^\{\\prime\}\)requires solving a constrained KKT system\. Differentiating this map to compute the Radon–Nikodym correction factor requires the direct inversion of a generalized KKT matrix of size\(N\+k\)×\(N\+k\)\(N\+k\)\\times\(N\+k\), costing𝒪\(\(N\+k\)3\)\\mathcal\{O\}\(\(N\+k\)^\{3\}\)operations per step\. This null\-space projection problem is similar to those encountered in redundant robotic systems, where Jacobian\-guided null\-space traversals are used to generate implicit manifold representations\[[17](https://arxiv.org/html/2606.23867#bib.bib17)\]\.
2. 2\.The Volume Factor Wall:Calculating the level\-set volume factorJcons\(𝜽^λ\(𝒙\)\)J\_\{\\text\{cons\}\}\(\\hat\{\\bm\{\\theta\}\}\_\{\\lambda\}\(\\bm\{x\}\)\)requires constructing the projected tangent basis Gram matrix of size\(N−k\)×\(N−k\)\(N\-k\)\\times\(N\-k\)and evaluating its determinant, costing𝒪\(\(N−k\)3\)\\mathcal\{O\}\(\(N\-k\)^\{3\}\)operations\.
Moreover, if the underlying constraint level set decomposes into multiple disconnected topological components, local proposal mechanisms are fundamentally trapped\. Resolving this topological barrier typically necessitates parallel tempering or replica exchange frameworks to facilitate transitions across disjoint regions\[[18](https://arxiv.org/html/2606.23867#bib.bib18)\]\.
To overcome the above computational bottlenecks, this paper presents an exact, mathematically equivalent formulation that collapses both scaling walls to𝒪\(k3\+N2k\)\\mathcal\{O\}\(k^\{3\}\+N^\{2\}k\)operations per step, completely decoupling the computational cost from the ambient data dimension\. In Section[2](https://arxiv.org/html/2606.23867#S2), we first introduce the information\-theoretic foundations of this work\. We then describe our proposed formulation of Schur\-Sylvester reductions and show how it can be utilized for Lasso and other regular non\-smooth models in Section[3](https://arxiv.org/html/2606.23867#S3)\. In Section[4](https://arxiv.org/html/2606.23867#S4), we further establish the mathematical foundations for evaluating the error propagation and convergence analysis of our proposed Schur\-Sylvester sampler\. In Section[5](https://arxiv.org/html/2606.23867#S5), we present our numerical experimental results to evaluate our proposed method\. We conclude our work and discuss possible future directions in Section[6](https://arxiv.org/html/2606.23867#S6)\.
## 2Information\-Theoretic Foundations and Non\-Asymptotic Regret
To situate this contribution within information theory, recall that the NML distribution achieves the minimax regret in universal coding under a worst\-case parameter sequence\[[5](https://arxiv.org/html/2606.23867#bib.bib5)\]\. For a data sequence𝒙N∈𝒳N\\bm\{x\}^\{N\}\\in\\mathcal\{X\}^\{N\}, the exact regret relative to the regularized maximum likelihood model is given by:
R\(𝒙N\)=lnp\(𝒙N∣𝜽^λ\(𝒙N\)\)−lnC\(μλ\)\.R\(\\bm\{x\}^\{N\}\)=\\ln p\(\\bm\{x\}^\{N\}\\mid\\hat\{\\bm\{\\theta\}\}\_\{\\lambda\}\(\\bm\{x\}^\{N\}\)\)\-\\ln C\(\\mu\_\{\\lambda\}\)\.\(3\)The classical asymptotic MDL approximation \(Rissanen’s formula\) simplifies the stochastic complexity to\[[5](https://arxiv.org/html/2606.23867#bib.bib5)\]:
lnC\(μλ\)≈k2ln\(N2π\)\+ln∫Θdet\(𝑰\(𝜽\)\)𝑑𝜽\+o\(1\),\\ln C\(\\mu\_\{\\lambda\}\)\\approx\\frac\{k\}\{2\}\\ln\\left\(\\frac\{N\}\{2\\pi\}\\right\)\+\\ln\\int\_\{\\Theta\}\\sqrt\{\\det\\left\(\\bm\{I\}\(\\bm\{\\theta\}\)\\right\)\}\\,d\\bm\{\\theta\}\+o\(1\),\(4\)where𝑰\(𝜽\)\\bm\{I\}\(\\bm\{\\theta\}\)is thek×kk\\times kFisher information matrix\. This expansion relies on strict asymptotic assumptions: \(i\) the sample sizeN→∞N\\to\\inftywhile the active model dimensionkkremains fixed, \(ii\) the MLE𝜽^\\hat\{\\bm\{\\theta\}\}lies strictly in the interior of a smooth parameter manifold, and \(iii\) the Fisher information is positive definite and continuous\.
For regular non\-smooth models \(such as Lasso or Sparse SVMs\) under finite\-sample or high\-dimensional regimes \(D≥ND\\geq N\), these assumptions collapse\. The parameter vectors lie exactly on the non\-smooth boundaries, which represent singular corners of theL1L\_\{1\}ball where the Fisher information becomes degenerate or ill\-defined\. Utilizing asymptotic expansions in these configurations yields inaccurate complexity estimates, resulting in over\-penalization and model\-selection instability\. By contrast, computing the exact stochastic complexity via our coarea level\-set framework \(Eq\. \([2](https://arxiv.org/html/2606.23867#S1.E2)\)\) guarantees mathematical and universal coding optimality under finite samples, bypassing the boundary singularity problems\.
## 3Proposed Formulation of Schur\-Sylvester Reductions
In this section, we describe our proposed formulation of Schur\-Sylvester reductions\. To simplify our discussion, we first derive the exact dimensionality reductions for the Lasso estimator and show that the high\-dimensional projection and volume operations can be mapped entirely onto the low\-dimensional active parameter subspace\. We then generalize the method to other regular non\-smooth models\.
### 3\.1KKT Inversion via Schur Complement
For concrete exposition, we first detail our Schur complement reduction using the classical Lasso estimator as our primary baseline\. Differentiating the Lasso level\-set constraints requires solving a constrained optimization system at each projection step\. The associated generalized KKT matrix has the block structure:
𝑽=\[𝑰N𝑮⊤𝑮𝟎\],\\bm\{V\}=\\begin\{bmatrix\}\\bm\{I\}\_\{N\}&\\bm\{G\}^\{\\top\}\\\\ \\bm\{G\}&\\bm\{0\}\\end\{bmatrix\},\(5\)whereNNrepresents the sample size,𝑰N\\bm\{I\}\_\{N\}denotes the identity matrix of dimensionN×NN\\times N,𝑮=𝑿A⊤∈ℝk×N\\bm\{G\}=\\bm\{X\}\_\{A\}^\{\\top\}\\in\\mathbb\{R\}^\{k\\times N\}is the transpose of the active feature matrix𝑿A∈ℝN×k\\bm\{X\}\_\{A\}\\in\\mathbb\{R\}^\{N\\times k\}, andk≪Nk\\ll Nis the active set size\.
By the block matrix inversion theorem, since𝑰N\\bm\{I\}\_\{N\}is trivially invertible, the Schur complement of𝑰N\\bm\{I\}\_\{N\}is\[[19](https://arxiv.org/html/2606.23867#bib.bib19)\]:
𝑺=𝟎−𝑮𝑰N−1𝑮⊤=−𝑿A⊤𝑿A∈ℝk×k\.\\bm\{S\}=\\bm\{0\}\-\\bm\{G\}\\bm\{I\}\_\{N\}^\{\-1\}\\bm\{G\}^\{\\top\}=\-\\bm\{X\}\_\{A\}^\{\\top\}\\bm\{X\}\_\{A\}\\in\\mathbb\{R\}^\{k\\times k\}\.\(6\)The top\-left block of𝑽−1\\bm\{V\}^\{\-1\}, which defines our projection operator𝑫proj\\bm\{D\}\_\{\\text\{proj\}\}, can be written exactly as:
𝑫proj=𝑰N−𝑮⊤\(𝑮𝑮⊤\)−1𝑮=𝑰N−𝑿A\(𝑿A⊤𝑿A\)−1𝑿A⊤\.\\bm\{D\}\_\{\\text\{proj\}\}=\\bm\{I\}\_\{N\}\-\\bm\{G\}^\{\\top\}\(\\bm\{G\}\\bm\{G\}^\{\\top\}\)^\{\-1\}\\bm\{G\}=\\bm\{I\}\_\{N\}\-\\bm\{X\}\_\{A\}\(\\bm\{X\}\_\{A\}^\{\\top\}\\bm\{X\}\_\{A\}\)^\{\-1\}\\bm\{X\}\_\{A\}^\{\\top\}\.\(7\)Evaluating𝑿A⊤𝑿A\\bm\{X\}\_\{A\}^\{\\top\}\\bm\{X\}\_\{A\}requires𝒪\(Nk2\)\\mathcal\{O\}\(Nk^\{2\}\)operations, and its inversion requires𝒪\(k3\)\\mathcal\{O\}\(k^\{3\}\)operations\. This completely bypasses the need to construct and invert the massive\(N\+k\)×\(N\+k\)\(N\+k\)\\times\(N\+k\)KKT matrix𝑽\\bm\{V\}\.
### 3\.2Volume Factor Reduction via Sylvester’s Identity
The level\-set volume factor is calculated from the Gram matrix𝚪\\bm\{\\Gamma\}of the projected tangent basis:
𝚪=𝑩⊤𝑫proj⊤𝑫proj𝑩∈ℝ\(N−k\)×\(N−k\),\\bm\{\\Gamma\}=\\bm\{B\}^\{\\top\}\\bm\{D\}\_\{\\text\{proj\}\}^\{\\top\}\\bm\{D\}\_\{\\text\{proj\}\}\\bm\{B\}\\in\\mathbb\{R\}^\{\(N\-k\)\\times\(N\-k\)\},\(8\)where𝑩∈ℝN×\(N−k\)\\bm\{B\}\\in\\mathbb\{R\}^\{N\\times\(N\-k\)\}is the orthonormal basis matrix representing the Clarke tangent cone\. By definition,𝑩\\bm\{B\}has orthonormal columns satisfying𝑩⊤𝑩=𝑰N−k\\bm\{B\}^\{\\top\}\\bm\{B\}=\\bm\{I\}\_\{N\-k\}, where𝑰N−k\\bm\{I\}\_\{N\-k\}is the identity matrix of dimension\(N−k\)×\(N−k\)\(N\-k\)\\times\(N\-k\)\. Since the projection operator𝑫proj\\bm\{D\}\_\{\\text\{proj\}\}is symmetric and idempotent \(𝑫proj⊤𝑫proj=𝑫proj\\bm\{D\}\_\{\\text\{proj\}\}^\{\\top\}\\bm\{D\}\_\{\\text\{proj\}\}=\\bm\{D\}\_\{\\text\{proj\}\}\), we substitute the Schur representation of the Lasso projection operator from Eq\. \([7](https://arxiv.org/html/2606.23867#S3.E7)\) into Eq\. \([8](https://arxiv.org/html/2606.23867#S3.E8)\) to obtain:
𝚪=𝑩⊤\(𝑰N−𝑿A\(𝑿A⊤𝑿A\)−1𝑿A⊤\)𝑩=𝑰N−k−𝑼⊤\(𝑿A⊤𝑿A\)−1𝑼,\\bm\{\\Gamma\}=\\bm\{B\}^\{\\top\}\\left\(\\bm\{I\}\_\{N\}\-\\bm\{X\}\_\{A\}\(\\bm\{X\}\_\{A\}^\{\\top\}\\bm\{X\}\_\{A\}\)^\{\-1\}\\bm\{X\}\_\{A\}^\{\\top\}\\right\)\\bm\{B\}=\\bm\{I\}\_\{N\-k\}\-\\bm\{U\}^\{\\top\}\(\\bm\{X\}\_\{A\}^\{\\top\}\\bm\{X\}\_\{A\}\)^\{\-1\}\\bm\{U\},\(9\)where𝑼=𝑿A⊤𝑩∈ℝk×\(N−k\)\\bm\{U\}=\\bm\{X\}\_\{A\}^\{\\top\}\\bm\{B\}\\in\\mathbb\{R\}^\{k\\times\(N\-k\)\}represents the projection of the active column constraints onto the tangent subspace, and𝑼⊤∈ℝ\(N−k\)×k\\bm\{U\}^\{\\top\}\\in\\mathbb\{R\}^\{\(N\-k\)\\times k\}is its transpose\. Crucially, in the PPMH framework, the projection operator𝑫proj\\bm\{D\}\_\{\\text\{proj\}\}is evaluated at the proposed point𝒚\\bm\{y\}\(yielding active set𝑿A\(𝒚\)\\bm\{X\}\_\{A\(\\bm\{y\}\)\}\), whereas the tangent basis𝑩\\bm\{B\}is defined at the current point𝒙\\bm\{x\}\. This mismatch ensures that𝑼=𝑿A\(𝒚\)⊤𝑩\(𝒙\)≠𝟎\\bm\{U\}=\\bm\{X\}\_\{A\(\\bm\{y\}\)\}^\{\\top\}\\bm\{B\}\(\\bm\{x\}\)\\neq\\bm\{0\}, making the volume correction non\-trivial\. This dynamically changing metric is analogous to position\-dependent proposal covariance structures, which must remain bounded to avoid ergodicity loss in the tails\[[14](https://arxiv.org/html/2606.23867#bib.bib14)\]\.
Directly evaluating the determinant of this\(N−k\)×\(N−k\)\(N\-k\)\\times\(N\-k\)matrix scales as𝒪\(\(N−k\)3\)\\mathcal\{O\}\(\(N\-k\)^\{3\}\)\. However, by applying Sylvester’s determinant identity\[[20](https://arxiv.org/html/2606.23867#bib.bib20)\]to the outer product of the rectangular matrices𝑼⊤\\bm\{U\}^\{\\top\}and\(𝑿A⊤𝑿A\)−1𝑼\(\\bm\{X\}\_\{A\}^\{\\top\}\\bm\{X\}\_\{A\}\)^\{\-1\}\\bm\{U\}, we collapse this computation down to the active set dimensionkk:
det\(𝑰N−k−𝑼⊤\(𝑿A⊤𝑿A\)−1𝑼\)=det\(𝑰k−\(𝑿A⊤𝑿A\)−1𝑼𝑼⊤\),\\det\\left\(\\bm\{I\}\_\{N\-k\}\-\\bm\{U\}^\{\\top\}\(\\bm\{X\}\_\{A\}^\{\\top\}\\bm\{X\}\_\{A\}\)^\{\-1\}\\bm\{U\}\\right\)=\\det\\left\(\\bm\{I\}\_\{k\}\-\(\\bm\{X\}\_\{A\}^\{\\top\}\\bm\{X\}\_\{A\}\)^\{\-1\}\\bm\{U\}\\bm\{U\}^\{\\top\}\\right\),\(10\)where𝑰k\\bm\{I\}\_\{k\}denotes the identity matrix of dimensionk×kk\\times kand\(𝑿A⊤𝑿A\)−1𝑼𝑼⊤∈ℝk×k\(\\bm\{X\}\_\{A\}^\{\\top\}\\bm\{X\}\_\{A\}\)^\{\-1\}\\bm\{U\}\\bm\{U\}^\{\\top\}\\in\\mathbb\{R\}^\{k\\times k\}\.
To optimize numerical stability and preserve symmetry, we state and prove the following lemma\.
###### Lemma 3\.1\.
Let𝐇=𝐗A⊤𝐗A∈ℝk×k\\bm\{H\}=\\bm\{X\}\_\{A\}^\{\\top\}\\bm\{X\}\_\{A\}\\in\\mathbb\{R\}^\{k\\times k\}be the symmetric positive\-definite active Gram matrix with Cholesky factorization𝐇=𝐋𝐋⊤\\bm\{H\}=\\bm\{L\}\\bm\{L\}^\{\\top\}, where𝐋\\bm\{L\}is lower triangular\. For any matrix𝐔∈ℝk×\(N−k\)\\bm\{U\}\\in\\mathbb\{R\}^\{k\\times\(N\-k\)\}, let𝐖=𝐋−1𝐔\\bm\{W\}=\\bm\{L\}^\{\-1\}\\bm\{U\}\. The Sylvester determinant identity can be represented in the symmetric, numerically stable form:
det\(𝑰N−k−𝑼⊤𝑯−1𝑼\)=det\(𝑰k−𝑾𝑾⊤\)\.\\det\\left\(\\bm\{I\}\_\{N\-k\}\-\\bm\{U\}^\{\\top\}\\bm\{H\}^\{\-1\}\\bm\{U\}\\right\)=\\det\\left\(\\bm\{I\}\_\{k\}\-\\bm\{W\}\\bm\{W\}^\{\\top\}\\right\)\.\(11\)
###### Proof\.
Applying the classical Sylvester determinant identitydet\(𝑰p−𝑷𝑸\)=det\(𝑰q−𝑸𝑷\)\\det\(\\bm\{I\}\_\{p\}\-\\bm\{P\}\\bm\{Q\}\)=\\det\(\\bm\{I\}\_\{q\}\-\\bm\{Q\}\\bm\{P\}\)with𝑷=𝑼⊤∈ℝ\(N−k\)×k\\bm\{P\}=\\bm\{U\}^\{\\top\}\\in\\mathbb\{R\}^\{\(N\-k\)\\times k\}and𝑸=𝑯−1𝑼∈ℝk×\(N−k\)\\bm\{Q\}=\\bm\{H\}^\{\-1\}\\bm\{U\}\\in\\mathbb\{R\}^\{k\\times\(N\-k\)\}, we have:
det\(𝑰N−k−𝑼⊤𝑯−1𝑼\)=det\(𝑰k−𝑯−1𝑼𝑼⊤\)\.\\det\\left\(\\bm\{I\}\_\{N\-k\}\-\\bm\{U\}^\{\\top\}\\bm\{H\}^\{\-1\}\\bm\{U\}\\right\)=\\det\\left\(\\bm\{I\}\_\{k\}\-\\bm\{H\}^\{\-1\}\\bm\{U\}\\bm\{U\}^\{\\top\}\\right\)\.\(12\)Substituting the Cholesky factorization of the inverse,𝑯−1=\(𝑳𝑳⊤\)−1=\(𝑳−1\)⊤𝑳−1\\bm\{H\}^\{\-1\}=\(\\bm\{L\}\\bm\{L\}^\{\\top\}\)^\{\-1\}=\(\\bm\{L\}^\{\-1\}\)^\{\\top\}\\bm\{L\}^\{\-1\}yields:
det\(𝑰k−𝑯−1𝑼𝑼⊤\)=det\(𝑰k−\(𝑳−1\)⊤𝑳−1𝑼𝑼⊤\)\.\\det\\left\(\\bm\{I\}\_\{k\}\-\\bm\{H\}^\{\-1\}\\bm\{U\}\\bm\{U\}^\{\\top\}\\right\)=\\det\\left\(\\bm\{I\}\_\{k\}\-\(\\bm\{L\}^\{\-1\}\)^\{\\top\}\\bm\{L\}^\{\-1\}\\bm\{U\}\\bm\{U\}^\{\\top\}\\right\)\.\(13\)Since the determinant is invariant under similarity transformations, we apply a similarity conjugation on the interior operator by multiplying on the left by𝑳⊤\\bm\{L\}^\{\\top\}and on the right by\(𝑳−1\)⊤\(\\bm\{L\}^\{\-1\}\)^\{\\top\}:
det\(𝑰k−\(𝑳−1\)⊤𝑳−1𝑼𝑼⊤\)\\displaystyle\\det\\left\(\\bm\{I\}\_\{k\}\-\(\\bm\{L\}^\{\-1\}\)^\{\\top\}\\bm\{L\}^\{\-1\}\\bm\{U\}\\bm\{U\}^\{\\top\}\\right\)=det\(𝑳⊤\(𝑰k−\(𝑳−1\)⊤𝑳−1𝑼𝑼⊤\)\(𝑳−1\)⊤\)\\displaystyle=\\det\\left\(\\bm\{L\}^\{\\top\}\\left\(\\bm\{I\}\_\{k\}\-\(\\bm\{L\}^\{\-1\}\)^\{\\top\}\\bm\{L\}^\{\-1\}\\bm\{U\}\\bm\{U\}^\{\\top\}\\right\)\(\\bm\{L\}^\{\-1\}\)^\{\\top\}\\right\)\(14\)=det\(𝑰k−𝑳−1𝑼𝑼⊤\(𝑳−1\)⊤\)\\displaystyle=\\det\\left\(\\bm\{I\}\_\{k\}\-\\bm\{L\}^\{\-1\}\\bm\{U\}\\bm\{U\}^\{\\top\}\(\\bm\{L\}^\{\-1\}\)^\{\\top\}\\right\)\(15\)=det\(𝑰k−\(𝑳−1𝑼\)\(𝑳−1𝑼\)⊤\)\.\\displaystyle=\\det\\left\(\\bm\{I\}\_\{k\}\-\\left\(\\bm\{L\}^\{\-1\}\\bm\{U\}\\right\)\\left\(\\bm\{L\}^\{\-1\}\\bm\{U\}\\right\)^\{\\top\}\\right\)\.\(16\)Substituting the auxiliary representation𝑾=𝑳−1𝑼\\bm\{W\}=\\bm\{L\}^\{\-1\}\\bm\{U\}completes the proof\. ∎
By Lemma[11](https://arxiv.org/html/2606.23867#S3.E11), using the auxiliary matrix𝑾=𝑳−1𝑼∈ℝk×\(N−k\)\\bm\{W\}=\\bm\{L\}^\{\-1\}\\bm\{U\}\\in\\mathbb\{R\}^\{k\\times\(N\-k\)\}, Sylvester’s identity can be written in a numerically symmetric form \([11](https://arxiv.org/html/2606.23867#S3.E11)\)\. Evaluating the determinant of thek×kk\\times ksymmetric matrix𝑰k−𝑾𝑾⊤\\bm\{I\}\_\{k\}\-\\bm\{W\}\\bm\{W\}^\{\\top\}scales as𝒪\(k3\)\\mathcal\{O\}\(k^\{3\}\)operations, which completely avoids non\-symmetric perturbations\. This reduces the combined complexity of the projection and determinant steps from𝒪\(N3\)\\mathcal\{O\}\(N^\{3\}\)to exactly𝒪\(k3\+N2k\)\\mathcal\{O\}\(k^\{3\}\+N^\{2\}k\)\.
### 3\.3Transition Regimes and Complexity Crossover Analysis
The computational advantages of the proposed framework scale with the active sparsity ratioρ=k/N∈\(0,1\]\\rho=k/N\\in\(0,1\]\. To demonstrate that the Schur\-Sylvester solver remains superior for the Lasso baseline even as the sparsity assumptions relax \(k→Nk\\to N\), we perform a floating\-point operations \(flops\) comparison\.
Solving the unoptimized\(N\+k\)×\(N\+k\)\(N\+k\)\\times\(N\+k\)indefinite KKT system requires anLDL⊤LDL^\{\\top\}decomposition of a symmetric indefinite matrix, costing:
FKKT\(N,k\)≈13\(N\+k\)3=13N3\(1\+ρ\)3flops\.F\_\{\\text\{KKT\}\}\(N,k\)\\approx\\frac\{1\}\{3\}\(N\+k\)^\{3\}=\\frac\{1\}\{3\}N^\{3\}\(1\+\\rho\)^\{3\}\\text\{ flops\}\.\(17\)Conversely, the leading terms of the Schur\-Sylvester reduction footprint decompose as follows \(where operation counts refer to leading\-order multiplication terms\):
1. 1\.Constructing the active Gram matrix𝑯=𝑿A⊤𝑿A\\bm\{H\}=\\bm\{X\}\_\{A\}^\{\\top\}\\bm\{X\}\_\{A\}:Nk2=N3ρ2Nk^\{2\}=N^\{3\}\\rho^\{2\}flops\.
2. 2\.Cholesky factorization𝑯=𝑳𝑳⊤\\bm\{H\}=\\bm\{L\}\\bm\{L\}^\{\\top\}:13k3=13N3ρ3\\frac\{1\}\{3\}k^\{3\}=\\frac\{1\}\{3\}N^\{3\}\\rho^\{3\}flops\.
3. 3\.Evaluating the tangent projection basis product𝑼=𝑿A⊤𝑩\\bm\{U\}=\\bm\{X\}\_\{A\}^\{\\top\}\\bm\{B\}:Nk\(N−k\)=N3ρ\(1−ρ\)Nk\(N\-k\)=N^\{3\}\\rho\(1\-\\rho\)flops\.
4. 4\.Solving the triangular system𝑾=𝑳−1𝑼\\bm\{W\}=\\bm\{L\}^\{\-1\}\\bm\{U\}:k2\(N−k\)=N3ρ2\(1−ρ\)k^\{2\}\(N\-k\)=N^\{3\}\\rho^\{2\}\(1\-\\rho\)flops\.
5. 5\.Computing the symmetric product𝑾𝑾⊤\\bm\{W\}\\bm\{W\}^\{\\top\}:k2\(N−k\)=N3ρ2\(1−ρ\)k^\{2\}\(N\-k\)=N^\{3\}\\rho^\{2\}\(1\-\\rho\)flops\.
6. 6\.Evaluating thek×kk\\times kdeterminantdet\(𝑰k−𝑾𝑾⊤\)\\det\(\\bm\{I\}\_\{k\}\-\\bm\{W\}\\bm\{W\}^\{\\top\}\):13k3=13N3ρ3\\frac\{1\}\{3\}k^\{3\}=\\frac\{1\}\{3\}N^\{3\}\\rho^\{3\}flops\.
Summing these contributions yields the total flop count of the Schur\-Sylvester reduction:
FSchur\(N,k\)≈N3\(ρ\+2ρ2−43ρ3\)flops\.F\_\{\\text\{Schur\}\}\(N,k\)\\approx N^\{3\}\\left\(\\rho\+2\\rho^\{2\}\-\\frac\{4\}\{3\}\\rho^\{3\}\\right\)\\text\{ flops\}\.\(18\)Evaluating the efficiency ratioη\(ρ\)=FSchur/FKKT\\eta\(\\rho\)=F\_\{\\text\{Schur\}\}/F\_\{\\text\{KKT\}\}yields:
η\(ρ\)≈3ρ\+6ρ2−4ρ3\(1\+ρ\)3\.\\eta\(\\rho\)\\approx\\frac\{3\\rho\+6\\rho^\{2\}\-4\\rho^\{3\}\}\{\(1\+\\rho\)^\{3\}\}\.\(19\)For highly sparse regimes \(ρ→0\\rho\\to 0\), the ratio simplifies toη\(ρ\)≈3ρ→0\\eta\(\\rho\)\\approx 3\\rho\\to 0, indicating a substantial speedup\. Solvingη\(ρ\)=1\\eta\(\\rho\)=1in the intervalρ∈\(0,1\]\\rho\\in\(0,1\]reveals that there is no valid root \(the polynomial5ρ3−3ρ2\+1\>05\\rho^\{3\}\-3\\rho^\{2\}\+1\>0for allρ∈\[0,1\]\\rho\\in\[0,1\]\)\. At the dense extreme wherek=Nk=N\(ρ=1\\rho=1\), the structured Schur\-Sylvester approach requires53N3≈1\.67N3\\frac\{5\}\{3\}N^\{3\}\\approx 1\.67N^\{3\}flops, while standard KKT inversion requires83N3≈2\.67N3\\frac\{8\}\{3\}N^\{3\}\\approx 2\.67N^\{3\}flops\. This mathematically proves that our structured block\-reduction remains theoretically superior to dense unoptimized KKT solving across all valid active set sizes\.
During the early MCMC burn\-in phase, or when the regularization parameterλ\\lambdais exceptionally small, the active set sizekkcan transiently grow large, approaching the sample sizeNN\(i\.e\.,ρ→1\\rho\\to 1\)\. To optimize performance under these dense transition regimes, the sampler is designed with a dynamic representation switching mechanism\. If the active sparsity ratioρ=k/N\\rho=k/Nexceeds the theoretical crossover point or if the active set sizekkexceeds the feature dimensionDD\(as in dual Support Vector Machines\), the solver automatically bypasses the projection reduction and defaults back to the dense unoptimized KKT matrix structure or its corresponding dual representation\. This hybrid strategy preserves the computational speedup throughout the entire MCMC trajectory, protecting the chain against computational spikes when the active coordinate set is transiently dense\.
### 3\.4Generalization to Other Regular Non\-Smooth Models
To demonstrate the universality of our Schur\-Sylvester framework, we generalize these reductions to other dominant non\-smooth and low\-rank models in statistical machine learning\.
To maintain notation consistency across these diverse models, we defineNNas the ambient sample size \(governing the dimension of the data space𝒳⊆ℝN\\mathcal\{X\}\\subseteq\\mathbb\{R\}^\{N\}\),DDas the total feature or parameter dimension, andkkas the size of the active set \(representing the dimension of the active constraint manifold\)\. In support vector machine formulations, the active support vectors play the role of active samples, while in low\-rank matrix manifolds, the active rankrrparameterizes the low\-dimensional subspace\.
#### 3\.4\.1Sparse Support Vector Machines \(SVMs\)
For a SparseL1L\_\{1\}\-norm SVM\[[21](https://arxiv.org/html/2606.23867#bib.bib21)\], the active constraints are defined by the support vectors lying exactly on the margin boundaries\. The constraint matrix𝑮\\bm\{G\}has rows corresponding to these active support vectors\. Because the active support set sizekktypically satisfiesk≪Nk\\ll Nunder sparse regimes, the generalized KKT matrix partitions identically to Eq\. \([5](https://arxiv.org/html/2606.23867#S3.E5)\)\.
LetS⊂\{1,…,N\}S\\subset\\\{1,\\dots,N\\\}be the active set of support vectors on the margin boundaries, with\|S\|=k\|S\|=k\. The feature matrix corresponding to these active support vectors is𝑿S∈ℝk×D\\bm\{X\}\_\{S\}\\in\\mathbb\{R\}^\{k\\times D\}\. The primal coefficient variation𝒘\\bm\{w\}is constrained by𝑿S𝒘=𝒚S\\bm\{X\}\_\{S\}\\bm\{w\}=\\bm\{y\}\_\{S\}\. Under a regularized formulation with barrier parameterϵ\>0\\epsilon\>0, the generalized KKT matrix of the constraint system exhibits the block structure:
𝑽SVM=\[𝑰k𝑿S𝑿S⊤−ϵ𝑰D\]\.\\bm\{V\}\_\{\\text\{SVM\}\}=\\begin\{bmatrix\}\\bm\{I\}\_\{k\}&\\bm\{X\}\_\{S\}\\\\ \\bm\{X\}\_\{S\}^\{\\top\}&\-\\epsilon\\bm\{I\}\_\{D\}\\end\{bmatrix\}\.\(20\)Using the block matrix inversion theorem, the top\-left block𝑴11\\bm\{M\}\_\{11\}of𝑽SVM−1\\bm\{V\}\_\{\\text\{SVM\}\}^\{\-1\}, which corresponds to the projection operator onto the dual support vector constraint subspace, is given by the Schur complement of the Hessian block−ϵ𝑰D\-\\epsilon\\bm\{I\}\_\{D\}:
𝑴11=\(𝑰k\+1ϵ𝑿S𝑿S⊤\)−1\.\\bm\{M\}\_\{11\}=\\left\(\\bm\{I\}\_\{k\}\+\\frac\{1\}\{\\epsilon\}\\bm\{X\}\_\{S\}\\bm\{X\}\_\{S\}^\{\\top\}\\right\)^\{\-1\}\.\(21\)Applying the Woodbury matrix identity to the right\-hand side yields the mathematically equivalent expression:
\(𝑰k\+1ϵ𝑿S𝑿S⊤\)−1=𝑰k−𝑿S\(𝑿S⊤𝑿S\+ϵ𝑰D\)−1𝑿S⊤\.\\left\(\\bm\{I\}\_\{k\}\+\\frac\{1\}\{\\epsilon\}\\bm\{X\}\_\{S\}\\bm\{X\}\_\{S\}^\{\\top\}\\right\)^\{\-1\}=\\bm\{I\}\_\{k\}\-\\bm\{X\}\_\{S\}\\left\(\\bm\{X\}\_\{S\}^\{\\top\}\\bm\{X\}\_\{S\}\+\\epsilon\\bm\{I\}\_\{D\}\\right\)^\{\-1\}\\bm\{X\}\_\{S\}^\{\\top\}\.\(22\)The expression on the left\-hand side of Eq\. \([22](https://arxiv.org/html/2606.23867#S3.E22)\) requires inverting ak×kk\\times kmatrix, which scales as𝒪\(k3\)\\mathcal\{O\}\(k^\{3\}\), but constructs a dense𝑿S𝑿S⊤∈ℝk×k\\bm\{X\}\_\{S\}\\bm\{X\}\_\{S\}^\{\\top\}\\in\\mathbb\{R\}^\{k\\times k\}matrix costing𝒪\(k2D\)\\mathcal\{O\}\(k^\{2\}D\)operations\. By contrast, the right\-hand side expression maps the projection onto aD×DD\\times Dregularized feature space system\. For configurations wherek\>Dk\>D, this duality allows the sampler to adaptively switch representations, maintaining a computational cost bounded by𝒪\(min\(k3\+k2D,D3\+D2k\)\)\\mathcal\{O\}\(\\min\(k^\{3\}\+k^\{2\}D,\\,D^\{3\}\+D^\{2\}k\)\)\.
#### 3\.4\.2Elastic Net Regularization
The Elastic Net\[[22](https://arxiv.org/html/2606.23867#bib.bib22)\]combines theL1L\_\{1\}lasso penalty\[[23](https://arxiv.org/html/2606.23867#bib.bib23)\]andL2L\_\{2\}ridge penalties, yielding the objective:
min𝜽12‖𝒚−𝑿𝜽‖22\+λ1‖𝜽‖1\+λ22‖𝜽‖22\.\\min\_\{\\bm\{\\theta\}\}\\frac\{1\}\{2\}\\\|\\bm\{y\}\-\\bm\{X\}\\bm\{\\theta\}\\\|\_\{2\}^\{2\}\+\\lambda\_\{1\}\\\|\\bm\{\\theta\}\\\|\_\{1\}\+\\frac\{\\lambda\_\{2\}\}\{2\}\\\|\\bm\{\\theta\}\\\|\_\{2\}^\{2\}\.\(23\)To achieve mathematical consistency with the regularized active projection operator, we derive the generalized KKT matrix from the primal\-dual optimality conditions of the Elastic Net\. Over the active coordinate setAA, the quadraticL2L\_\{2\}penalty modifies the underlying metric by scaling the identity by1\+λ21\+\\lambda\_\{2\}, while the dual active constraints are stabilized by a regularized boundary block scaling withλ2\\lambda\_\{2\}\. This yields the block system:
𝑽EN=\[\(1\+λ2\)𝑰N𝑿A𝑿A⊤−λ21\+λ2𝑰k\]\.\\bm\{V\}\_\{\\text\{EN\}\}=\\begin\{bmatrix\}\(1\+\\lambda\_\{2\}\)\\bm\{I\}\_\{N\}&\\bm\{X\}\_\{A\}\\\\ \\bm\{X\}\_\{A\}^\{\\top\}&\-\\frac\{\\lambda\_\{2\}\}\{1\+\\lambda\_\{2\}\}\\bm\{I\}\_\{k\}\\end\{bmatrix\}\.\(24\)By the block Schur complement, the projection operator𝑫EN\\bm\{D\}\_\{\\text\{EN\}\}\(the top\-left block of𝑽EN−1\\bm\{V\}\_\{\\text\{EN\}\}^\{\-1\}\) reduces exactly to:
𝑫EN=11\+λ2\(𝑰N−𝑿A\(𝑿A⊤𝑿A\+λ2𝑰k\)−1𝑿A⊤\)\.\\bm\{D\}\_\{\\text\{EN\}\}=\\frac\{1\}\{1\+\\lambda\_\{2\}\}\\left\(\\bm\{I\}\_\{N\}\-\\bm\{X\}\_\{A\}\\left\(\\bm\{X\}\_\{A\}^\{\\top\}\\bm\{X\}\_\{A\}\+\\lambda\_\{2\}\\bm\{I\}\_\{k\}\\right\)^\{\-1\}\\bm\{X\}\_\{A\}^\{\\top\}\\right\)\.\(25\)This proves that the Elastic Net is compatible with our framework and is numerically more stable due to the diagonal dominance induced byλ2\\lambda\_\{2\}\.
#### 3\.4\.3Group Lasso
In the Group Lasso\[[24](https://arxiv.org/html/2606.23867#bib.bib24)\], variables are partitioned intoMMdisjoint groups, and the active set consists of active groups rather than individual features\. Let the active groups be indexed byA⊂\{1,…,M\}A\\subset\\\{1,\\dots,M\\\}with\|A\|=m\|A\|=m\. The active feature matrix is partitioned as𝑿G=\[𝑿g1,𝑿g2,…,𝑿gm\]∈ℝN×k\\bm\{X\}\_\{G\}=\[\\bm\{X\}\_\{g\_\{1\}\},\\,\\bm\{X\}\_\{g\_\{2\}\},\\,\\dots,\\,\\bm\{X\}\_\{g\_\{m\}\}\]\\in\\mathbb\{R\}^\{N\\times k\}, wherek=∑j∈Adjk=\\sum\_\{j\\in A\}d\_\{j\}is the total number of active features, anddjd\_\{j\}is the dimension of groupgjg\_\{j\}\. Under local group\-orthogonality, the active Gram matrix is a block\-diagonal matrix:
𝑯G=𝑿G⊤𝑿G=\[𝑿g1⊤𝑿g1𝟎…𝟎𝟎𝑿g2⊤𝑿g2…𝟎⋮⋮⋱⋮𝟎𝟎…𝑿gm⊤𝑿gm\],\\bm\{H\}\_\{G\}=\\bm\{X\}\_\{G\}^\{\\top\}\\bm\{X\}\_\{G\}=\\begin\{bmatrix\}\\bm\{X\}\_\{g\_\{1\}\}^\{\\top\}\\bm\{X\}\_\{g\_\{1\}\}&\\bm\{0\}&\\dots&\\bm\{0\}\\\\ \\bm\{0\}&\\bm\{X\}\_\{g\_\{2\}\}^\{\\top\}\\bm\{X\}\_\{g\_\{2\}\}&\\dots&\\bm\{0\}\\\\ \\vdots&\\vdots&\\ddots&\\vdots\\\\ \\bm\{0\}&\\bm\{0\}&\\dots&\\bm\{X\}\_\{g\_\{m\}\}^\{\\top\}\\bm\{X\}\_\{g\_\{m\}\}\\end\{bmatrix\},\(26\)wheregig\_\{i\}represents the active groups\. Evaluating the inverse of this matrix simplifies to inverting each small block independently:
𝑯G−1=diag\(\(𝑿g1⊤𝑿g1\)−1,\(𝑿g2⊤𝑿g2\)−1,…,\(𝑿gm⊤𝑿gm\)−1\)\.\\bm\{H\}\_\{G\}^\{\-1\}=\\operatorname\{diag\}\\left\(\\left\(\\bm\{X\}\_\{g\_\{1\}\}^\{\\top\}\\bm\{X\}\_\{g\_\{1\}\}\\right\)^\{\-1\},\\,\\left\(\\bm\{X\}\_\{g\_\{2\}\}^\{\\top\}\\bm\{X\}\_\{g\_\{2\}\}\\right\)^\{\-1\},\\,\\dots,\\,\\left\(\\bm\{X\}\_\{g\_\{m\}\}^\{\\top\}\\bm\{X\}\_\{g\_\{m\}\}\\right\)^\{\-1\}\\right\)\.\(27\)The projection operator𝑫GL\\bm\{D\}\_\{\\text\{GL\}\}thus decomposes into a sum of group\-wise orthogonal projection operators:
𝑫GL=𝑰N−∑j∈A𝑿gj\(𝑿gj⊤𝑿gj\)−1𝑿gj⊤\.\\bm\{D\}\_\{\\text\{GL\}\}=\\bm\{I\}\_\{N\}\-\\sum\_\{j\\in A\}\\bm\{X\}\_\{g\_\{j\}\}\\left\(\\bm\{X\}\_\{g\_\{j\}\}^\{\\top\}\\bm\{X\}\_\{g\_\{j\}\}\\right\)^\{\-1\}\\bm\{X\}\_\{g\_\{j\}\}^\{\\top\}\.\(28\)Calculating𝑫GL𝒛\\bm\{D\}\_\{\\text\{GL\}\}\\bm\{z\}for any proposal perturbation vector𝒛∈ℝN\\bm\{z\}\\in\\mathbb\{R\}^\{N\}only requires evaluating
𝒘=𝒛−∑j∈A𝑿gj𝒄j,\\bm\{w\}=\\bm\{z\}\-\\sum\_\{j\\in A\}\\bm\{X\}\_\{g\_\{j\}\}\\bm\{c\}\_\{j\},\(29\)where\(𝑿gj⊤𝑿gj\)𝒄j=𝑿gj⊤𝒛\\left\(\\bm\{X\}\_\{g\_\{j\}\}^\{\\top\}\\bm\{X\}\_\{g\_\{j\}\}\\right\)\\bm\{c\}\_\{j\}=\\bm\{X\}\_\{g\_\{j\}\}^\{\\top\}\\bm\{z\}\. If we assume a uniform group sizedj=dd\_\{j\}=dfor alljj, solving the linear system for each active group scales as𝒪\(d3\)\\mathcal\{O\}\(d^\{3\}\)operations\. Summing over themmactive groups yields a total complexity of𝒪\(md3\+mNd\)\\mathcal\{O\}\(md^\{3\}\+mNd\)operations\. Becausek=mdk=md, this collapses the cubic dependency on the active set sizekkto a strictly linear dependency on the number of active groupsmm, enabling rapid manifold projection for extremely large group\-structured models\.
When local group\-orthogonality is violated, the active Gram matrix𝑯G\\bm\{H\}\_\{G\}is no longer block\-diagonal but remains block\-sparse\. In such non\-orthogonal settings, we can compute its Cholesky factor𝑯G=𝑳G𝑳G⊤\\bm\{H\}\_\{G\}=\\bm\{L\}\_\{G\}\\bm\{L\}\_\{G\}^\{\\top\}directly over the active set\. Sincek≪Nk\\ll Ncontinues to hold under sparse regimes, the inversion cost remains bounded by𝒪\(k3\)\\mathcal\{O\}\(k^\{3\}\)rather than the ambient𝒪\(N3\)\\mathcal\{O\}\(N^\{3\}\), preserving the overall dimensionality reduction\. The linear solver in Eq\. \([29](https://arxiv.org/html/2606.23867#S3.E29)\) simply relaxes from independent group\-wise solves to a block\-sparse triangular solve\.
## 4Error Propagation and Convergence Analysis
In this section, we establish the mathematical foundations for evaluating the numerical stability, constraint preservation, and convergence of our proposed Schur\-Sylvester sampler\.
### 4\.1Forward Error Propagation Bounds
Evaluating the KKT system directly via Eq\. \([5](https://arxiv.org/html/2606.23867#S3.E5)\) suffers from severe numerical drift over long MCMC trajectories\. The condition number of the generalized\(N\+k\)×\(N\+k\)\(N\+k\)\\times\(N\+k\)KKT matrix𝑽\\bm\{V\}is bounded by:
κ\(𝑽\)≥σmax\(𝑿A\)2σmin\(𝑿A\)2\.\\kappa\(\\bm\{V\}\)\\geq\\frac\{\\sigma\_\{\\max\}\(\\bm\{X\}\_\{A\}\)^\{2\}\}\{\\sigma\_\{\\min\}\(\\bm\{X\}\_\{A\}\)^\{2\}\}\.\(30\)AsN→∞N\\to\\infty, the columns of𝑿A\\bm\{X\}\_\{A\}can become highly collinear, causingσmin\(𝑿A\)→0\\sigma\_\{\\min\}\(\\bm\{X\}\_\{A\}\)\\to 0andκ\(𝑽\)→∞\\kappa\(\\bm\{V\}\)\\to\\infty, triggering severe floating\-point round\-off errors during inversion\.
By contrast, our Schur\-Sylvester solver only requires inverting thek×kk\\times kactive Gram matrix𝑿A⊤𝑿A\\bm\{X\}\_\{A\}^\{\\top\}\\bm\{X\}\_\{A\}\. This structural restriction of matrix operations to a localized neighborhood or “reach” of the target manifold prevents geometric error accumulation, ensuring long\-term numerical stability over MCMC trajectories without requiring periodic re\-projection onto the exact constraints\[[25](https://arxiv.org/html/2606.23867#bib.bib25)\]\.
To analyze this error propagation, we first define the first\-order perturbation relation of the projection operator𝑫proj\\bm\{D\}\_\{\\text\{proj\}\}under a perturbationδ𝑬\\delta\\bm\{E\}to the active Gram matrix𝑯\\bm\{H\}:
𝑫~proj−𝑫proj=𝑿A𝑯−1\(δ𝑬\)𝑯−1𝑿A⊤\+𝒪\(u2\),\\tilde\{\\bm\{D\}\}\_\{\\text\{proj\}\}\-\\bm\{D\}\_\{\\text\{proj\}\}=\\bm\{X\}\_\{A\}\\bm\{H\}^\{\-1\}\(\\delta\\bm\{E\}\)\\bm\{H\}^\{\-1\}\\bm\{X\}\_\{A\}^\{\\top\}\+\\mathcal\{O\}\(u^\{2\}\),\(31\)whereδ𝑬\\delta\\bm\{E\}is a matrix\-valued perturbation on𝑯\\bm\{H\}satisfying‖δ𝑬‖2≤u‖𝑯‖2\\\|\\delta\\bm\{E\}\\\|\_\{2\}\\leq u\\\|\\bm\{H\}\\\|\_\{2\}\.
###### Theorem 4\.1\.
Let𝐇=𝐗A⊤𝐗A∈ℝk×k\\bm\{H\}=\\bm\{X\}\_\{A\}^\{\\top\}\\bm\{X\}\_\{A\}\\in\\mathbb\{R\}^\{k\\times k\}be the exact active Gram matrix, and let𝐇~=𝐇\+δ𝐄\\tilde\{\\bm\{H\}\}=\\bm\{H\}\+\\delta\\bm\{E\}be its floating\-point representation under machine precisionu≈1\.1×10−16u\\approx 1\.1\\times 10^\{\-16\}, with‖δ𝐄‖2≤u‖𝐇‖2\\\|\\delta\\bm\{E\}\\\|\_\{2\}\\leq u\\\|\\bm\{H\}\\\|\_\{2\}\. The forward error in the computed Schur projection operator𝐃~proj\\tilde\{\\bm\{D\}\}\_\{\\text\{proj\}\}is strictly bounded by:
‖𝑫~proj−𝑫proj‖2≤u⋅κ\(𝑿A⊤𝑿A\)\+𝒪\(u2\),\\\|\\tilde\{\\bm\{D\}\}\_\{\\text\{proj\}\}\-\\bm\{D\}\_\{\\text\{proj\}\}\\\|\_\{2\}\\leq u\\cdot\\kappa\\left\(\\bm\{X\}\_\{A\}^\{\\top\}\\bm\{X\}\_\{A\}\\right\)\+\\mathcal\{O\}\(u^\{2\}\),\(32\)whereκ\(𝐗A⊤𝐗A\)=‖𝐇‖2‖𝐇−1‖2\\kappa\\left\(\\bm\{X\}\_\{A\}^\{\\top\}\\bm\{X\}\_\{A\}\\right\)=\\\|\\bm\{H\}\\\|\_\{2\}\\\|\\bm\{H\}^\{\-1\}\\\|\_\{2\}is the condition number of the active Gram matrix\.
###### Proof\.
Let the thin Singular Value Decomposition \(SVD\) of𝑿A\\bm\{X\}\_\{A\}be𝑿A=𝑼A𝚺A𝑽A⊤\\bm\{X\}\_\{A\}=\\bm\{U\}\_\{A\}\\bm\{\\Sigma\}\_\{A\}\\bm\{V\}\_\{A\}^\{\\top\}, where𝑼A∈ℝN×k\\bm\{U\}\_\{A\}\\in\\mathbb\{R\}^\{N\\times k\}has orthonormal columns \(𝑼A⊤𝑼A=𝑰k\\bm\{U\}\_\{A\}^\{\\top\}\\bm\{U\}\_\{A\}=\\bm\{I\}\_\{k\}\),𝚺A∈ℝk×k\\bm\{\\Sigma\}\_\{A\}\\in\\mathbb\{R\}^\{k\\times k\}is the diagonal matrix of positive singular values, and𝑽A∈ℝk×k\\bm\{V\}\_\{A\}\\in\\mathbb\{R\}^\{k\\times k\}is an orthogonal matrix\. The active Gram matrix is represented as𝑯=𝑿A⊤𝑿A=𝑽A𝚺A2𝑽A⊤\\bm\{H\}=\\bm\{X\}\_\{A\}^\{\\top\}\\bm\{X\}\_\{A\}=\\bm\{V\}\_\{A\}\\bm\{\\Sigma\}\_\{A\}^\{2\}\\bm\{V\}\_\{A\}^\{\\top\}, and its inverse is𝑯−1=𝑽A𝚺A−2𝑽A⊤\\bm\{H\}^\{\-1\}=\\bm\{V\}\_\{A\}\\bm\{\\Sigma\}\_\{A\}^\{\-2\}\\bm\{V\}\_\{A\}^\{\\top\}\.
Using this decomposition, we can represent the block matrices as:
𝑿A𝑯−1=\(𝑼A𝚺A𝑽A⊤\)\(𝑽A𝚺A−2𝑽A⊤\)=𝑼A𝚺A−1𝑽A⊤,\\bm\{X\}\_\{A\}\\bm\{H\}^\{\-1\}=\\left\(\\bm\{U\}\_\{A\}\\bm\{\\Sigma\}\_\{A\}\\bm\{V\}\_\{A\}^\{\\top\}\\right\)\\left\(\\bm\{V\}\_\{A\}\\bm\{\\Sigma\}\_\{A\}^\{\-2\}\\bm\{V\}\_\{A\}^\{\\top\}\\right\)=\\bm\{U\}\_\{A\}\\bm\{\\Sigma\}\_\{A\}^\{\-1\}\\bm\{V\}\_\{A\}^\{\\top\},\(33\)and symmetrically,
𝑯−1𝑿A⊤=𝑽A𝚺A−1𝑼A⊤\.\\bm\{H\}^\{\-1\}\\bm\{X\}\_\{A\}^\{\\top\}=\\bm\{V\}\_\{A\}\\bm\{\\Sigma\}\_\{A\}^\{\-1\}\\bm\{U\}\_\{A\}^\{\\top\}\.\(34\)Substituting these SVD representations back into the first\-order error expansion in Eq\. \([31](https://arxiv.org/html/2606.23867#S4.E31)\) yields:
𝑫~proj−𝑫proj=𝑼A𝚺A−1𝑽A⊤\(δ𝑬\)𝑽A𝚺A−1𝑼A⊤\+𝒪\(u2\)\.\\tilde\{\\bm\{D\}\}\_\{\\text\{proj\}\}\-\\bm\{D\}\_\{\\text\{proj\}\}=\\bm\{U\}\_\{A\}\\bm\{\\Sigma\}\_\{A\}^\{\-1\}\\bm\{V\}\_\{A\}^\{\\top\}\(\\delta\\bm\{E\}\)\\bm\{V\}\_\{A\}\\bm\{\\Sigma\}\_\{A\}^\{\-1\}\\bm\{U\}\_\{A\}^\{\\top\}\+\\mathcal\{O\}\(u^\{2\}\)\.\(35\)Taking theL2L\_\{2\}operator norm on both sides, utilizing the sub\-multiplicative property of matrix norms, and noting that‖𝑼A‖2=1\\\|\\bm\{U\}\_\{A\}\\\|\_\{2\}=1and‖𝑽A‖2=1\\\|\\bm\{V\}\_\{A\}\\\|\_\{2\}=1, we obtain:
‖𝑫~proj−𝑫proj‖2≤‖𝚺A−1‖22‖δ𝑬‖2\+𝒪\(u2\)\.\\\|\\tilde\{\\bm\{D\}\}\_\{\\text\{proj\}\}\-\\bm\{D\}\_\{\\text\{proj\}\}\\\|\_\{2\}\\leq\\\|\\bm\{\\Sigma\}\_\{A\}^\{\-1\}\\\|\_\{2\}^\{2\}\\\|\\delta\\bm\{E\}\\\|\_\{2\}\+\\mathcal\{O\}\(u^\{2\}\)\.\(36\)Since‖𝚺A−1‖22=σmin\(𝑿A\)−2=‖𝚺A−2‖2=‖𝑯−1‖2\\\|\\bm\{\\Sigma\}\_\{A\}^\{\-1\}\\\|\_\{2\}^\{2\}=\\sigma\_\{\\min\}\(\\bm\{X\}\_\{A\}\)^\{\-2\}=\\\|\\bm\{\\Sigma\}\_\{A\}^\{\-2\}\\\|\_\{2\}=\\\|\\bm\{H\}^\{\-1\}\\\|\_\{2\}\(by construction of the inverse Gram matrix SVD representation\), we have:
‖𝑫~proj−𝑫proj‖2≤‖𝑯−1‖2‖δ𝑬‖2\+𝒪\(u2\)\.\\\|\\tilde\{\\bm\{D\}\}\_\{\\text\{proj\}\}\-\\bm\{D\}\_\{\\text\{proj\}\}\\\|\_\{2\}\\leq\\\|\\bm\{H\}^\{\-1\}\\\|\_\{2\}\\\|\\delta\\bm\{E\}\\\|\_\{2\}\+\\mathcal\{O\}\(u^\{2\}\)\.\(37\)Substituting the machine\-precision bound‖δ𝑬‖2≤u‖𝑯‖2\\\|\\delta\\bm\{E\}\\\|\_\{2\}\\leq u\\\|\\bm\{H\}\\\|\_\{2\}yields:
‖𝑫~proj−𝑫proj‖2≤u‖𝑯−1‖2‖𝑯‖2\+𝒪\(u2\)=u⋅κ\(𝑿A⊤𝑿A\)\+𝒪\(u2\),\\\|\\tilde\{\\bm\{D\}\}\_\{\\text\{proj\}\}\-\\bm\{D\}\_\{\\text\{proj\}\}\\\|\_\{2\}\\leq u\\\|\\bm\{H\}^\{\-1\}\\\|\_\{2\}\\\|\\bm\{H\}\\\|\_\{2\}\+\\mathcal\{O\}\(u^\{2\}\)=u\\cdot\\kappa\\left\(\\bm\{X\}\_\{A\}^\{\\top\}\\bm\{X\}\_\{A\}\\right\)\+\\mathcal\{O\}\(u^\{2\}\),\(38\)which completes the proof\. ∎
### 4\.2Level\-Set Constraint Satisfaction and Parameter Recovery
To mathematically verify that the generated MCMC states do not drift from the target constraint level sets, we define the residual operators for the active constraints\. For the Lasso estimator, the level set is defined by the primal\-dual KKT conditions on the active set:
𝑹Lasso\(𝒙\)=𝑿A⊤\(𝒙−𝑿A𝜷^A\(𝒙\)\)−λ𝒔A=𝟎k,\\bm\{R\}\_\{\\text\{Lasso\}\}\(\\bm\{x\}\)=\\bm\{X\}\_\{A\}^\{\\top\}\\left\(\\bm\{x\}\-\\bm\{X\}\_\{A\}\\hat\{\\bm\{\\beta\}\}\_\{A\}\(\\bm\{x\}\)\\right\)\-\\lambda\\bm\{s\}\_\{A\}=\\bm\{0\}\_\{k\},\(39\)where𝒔A=sign\(𝜷^A\(𝒙\)\)∈\{−1,1\}k\\bm\{s\}\_\{A\}=\\operatorname\{sign\}\(\\hat\{\\bm\{\\beta\}\}\_\{A\}\(\\bm\{x\}\)\)\\in\\\{\-1,1\\\}^\{k\}\. For the Sparse SVM, the active support vectors must lie exactly on the margin boundaries:
𝑹SVM\(𝒙\)=𝒚S⊙\(𝑿S𝒘\(𝒙\)\)−𝟏k=𝟎k\.\\bm\{R\}\_\{\\text\{SVM\}\}\(\\bm\{x\}\)=\\bm\{y\}\_\{S\}\\odot\\left\(\\bm\{X\}\_\{S\}\\bm\{w\}\(\\bm\{x\}\)\\right\)\-\\bm\{1\}\_\{k\}=\\bm\{0\}\_\{k\}\.\(40\)The projection operator𝑫proj\\bm\{D\}\_\{\\text\{proj\}\}restricts proposals to the local tangent space, which mathematically forces these residuals to vanish\.
The physical correctness of the sampling trajectories can be verified by analyzing the Mean Squared Error \(MSE\) of the parameter vector𝜷^\\hat\{\\bm\{\\beta\}\}relative to the ground\-truth parameter𝜷true\\bm\{\\beta\}\_\{\\text\{true\}\}\. Under high\-dimensional sparse recovery theory, as the ambient dimensionDDscales, the estimation error of the support is expected to contract due to concentrated information density\.
### 4\.3MCMC Convergence and Spectral Gap Bounds
The convergence of manifold\-constrained Langevin and Hamiltonian systems to their invariant measures has been rigorously characterized using weak\-order Taylor expansions and backward error analysis\[[10](https://arxiv.org/html/2606.23867#bib.bib10)\]\. Let the transition kernel of the PPMH sampler on the constraint manifoldℳ=𝜽^λ−1\(𝜽′\)\\mathcal\{M\}=\\hat\{\\bm\{\\theta\}\}\_\{\\lambda\}^\{\-1\}\(\\bm\{\\theta\}^\{\\prime\}\)be denoted by𝒫\\mathcal\{P\}\. Under a random walk proposal𝒚cand∼𝒩\(𝒚curr,δ2𝑫proj\)\\bm\{y\}\_\{\\text\{cand\}\}\\sim\\mathcal\{N\}\(\\bm\{y\}\_\{\\text\{curr\}\},\\delta^\{2\}\\bm\{D\}\_\{\\text\{proj\}\}\), the proposal variance scales inversely with the manifold dimension,δ2=σ2/\(N−k\)\\delta^\{2\}=\\sigma^\{2\}/\(N\-k\)\. The spectral gap of the chain,Gap\(𝒫\)=1−λ2\(𝒫\)\\operatorname\{Gap\}\(\\mathcal\{P\}\)=1\-\\lambda\_\{2\}\(\\mathcal\{P\}\), satisfies the lower bound:
Gap\(𝒫\)≥αN−k,\\operatorname\{Gap\}\(\\mathcal\{P\}\)\\geq\\frac\{\\alpha\}\{N\-k\},\(41\)whereα\>0\\alpha\>0is a geometric constant scaling with the local curvature of the level set\.
Since the autocorrelation timeτ\\tauis bounded byτ≤1/Gap\(𝒫\)\\tau\\leq 1/\\operatorname\{Gap\}\(\\mathcal\{P\}\), the absolute Effective Sample Size \(ESS\) of the chain scales asESS∝nsteps⋅Gap\(𝒫\)\\text\{ESS\}\\propto n\_\{\\text\{steps\}\}\\cdot\\operatorname\{Gap\}\(\\mathcal\{P\}\)\. We evaluate the sampling efficiency using the ESS per secondESSsec\\text\{ESS\}\_\{\\text\{sec\}\}:
ESSsec=ESSTotal Execution Time \(seconds\),\\text\{ESS\}\_\{\\text\{sec\}\}=\\frac\{\\text\{ESS\}\}\{\\text\{Total Execution Time \(seconds\)\}\},\(42\)where the ESS is computed from the autocorrelation timeτ\\tau:
ESS=nsteps1\+2∑l=1∞ρ\(l\)\.\\text\{ESS\}=\\frac\{n\_\{\\text\{steps\}\}\}\{1\+2\\sum\_\{l=1\}^\{\\infty\}\\rho\(l\)\}\.\(43\)Substituting the step execution time of our Schur\-Sylvester reduction,tstep=C1k3\+C2N2kt\_\{\\text\{step\}\}=C\_\{1\}k^\{3\}\+C\_\{2\}N^\{2\}k, into Eq\. \([42](https://arxiv.org/html/2606.23867#S4.E42)\), we derive the efficiency relationship:
ESSsec\(k\)≈Gap\(𝒫\)C1k3\+C2N2k≥α\(N−k\)\(C1k3\+C2N2k\)\.\\text\{ESS\}\_\{\\text\{sec\}\}\(k\)\\approx\\frac\{\\operatorname\{Gap\}\(\\mathcal\{P\}\)\}\{C\_\{1\}k^\{3\}\+C\_\{2\}N^\{2\}k\}\\geq\\frac\{\\alpha\}\{\(N\-k\)\(C\_\{1\}k^\{3\}\+C\_\{2\}N^\{2\}k\)\}\.\(44\)This formulation reveals a non\-linear relationship between the active set sizekkand sampling efficiency\.
Similar determinantal volume formulations over large discrete sets are used in modern reinforcement learning to ensure team\-level credit assignment via Determinantal Point Processes \(DPPs\)\[[26](https://arxiv.org/html/2606.23867#bib.bib26)\]\. In these cooperative policy schemes, computing marginal contributions requires repeatedly updating determinantal volumes, a bottleneck that can be bypassed using the identical low\-rank Schur\-Sylvester matrix identities proposed in this paper\. Furthermore, in Gaussian Process\-controlled B\-Spline surfaces \(GPBSS\), utilizing Sylvester’s identity enables linear\-complexity scaling over low\-dimensional spaces while maintaining complete covariance information\[[27](https://arxiv.org/html/2606.23867#bib.bib27)\]\.
## 5Numerical Experiments and Benchmarks
We executed a comprehensive parallel simulation on an AWSc8a\.48xlargeinstance \(utilizing 96 physical AMD EPYC cores with strict single\-thread BLAS limits\) to evaluate the empirical scaling laws of the Schur\-Sylvester reductions\. The experiment consisted of2,4002\{,\}400independent chains running for50,00050\{,\}000steps each, totaling1\.2×1081\.2\\times 10^\{8\}MCMC steps\.
### 5\.1Experimental Design and Parameter Configurations
To evaluate the empirical efficiency of the proposed Schur\-Sylvester dimensionality reductions, we construct a representative experiment grid\. The grid spans three distinct ambient dimensionsD∈\{100,500,2000\}D\\in\\\{100,500,2000\\\}under a baseline sample sizeN=100N=100\. For each model architecture, we instantiate independent Metropolis–Hastings chains starting from randomized initial parameter settings\. We systematically analyze four major regular non\-smooth classes: Lasso, Elastic Net, Group Lasso, and Sparse Support Vector Machines \(SVMs\)\. Each simulation is executed for50,00050\{,\}000steps following a10,00010\{,\}000\-step burn\-in period to guarantee stationary state transitions\. Active constraints, level\-set projection accuracy, and accept\-reject distributions are recorded at each transition\.
Table 1:Empirical Manifold Sampler Performance and Step Times \(N=100N=100\)Figure 1:Log\-log plot of the manifold sampler step computation timetstept\_\{\\text\{step\}\}\(in milliseconds\) as a function of the ambient dimensionD∈\{100,500,2000\}D\\in\\\{100,500,2000\\\}under sample sizeN=100N=100\.The Lasso and Elastic Net models display flat, dimension\-invariant scaling\. The Group Lasso displays a decreasing step time due to block\-diagonal matrix optimization\. The Sparse SVM exhibits a polynomial𝒪\(D3\)\\mathcal\{O\}\(D^\{3\}\)dependency due to the Woodbury\-mapped dual representation, yet maintains substantial speedups over unoptimized solvers\.
### 5\.2Analysis of Step Time Complexity and Dimensional Scaling
The baseline quantitative performance results underN=100N=100are summarized in Table[1](https://arxiv.org/html/2606.23867#S5.T1), which provides the empirical mean step execution times, active set dimensionalities \(kk\), parameter convergence MSE, and statistical acceptance rates\. The scaling trends across the ambient dimensions are plotted in Figure[1](https://arxiv.org/html/2606.23867#S5.F1)\.
The empirical data demonstrates that our Schur\-Sylvester solver successfully breaks the dimensional bottleneck for sparse statistical models\. For the Lasso model, the mean step execution time displays a highly flat, dimension\-invariant profile \(Figure[1](https://arxiv.org/html/2606.23867#S5.F1)\), scaling from0\.04980\.0498ms atD=100D=100to only0\.06830\.0683ms at the high\-dimensional limitD=2000D=2000\. This behavior occurs because the active set size remains highly sparse \(k¯≈12\.20\\bar\{k\}\\approx 12\.20atD=2000D=2000, yielding an active sparsity ratio ofρ¯≈0\.1220\\bar\{\\rho\}\\approx 0\.1220relative to the sample sizeNN, and a feature\-space sparsity ofk¯/D≈0\.006\\bar\{k\}/D\\approx 0\.006\), forcing the computational footprint to remain governed by the low\-dimensional𝒪\(k3\+N2k\)\\mathcal\{O\}\(k^\{3\}\+N^\{2\}k\)bounds rather than the ambient space\. Similarly, the Elastic Net displays flat scaling with a highly stable step time of0\.43360\.4336ms atD=2000D=2000, where the active coordinate set consolidates atk¯≈41\.57\\bar\{k\}\\approx 41\.57under the hybrid penalty\.
Remarkably, the Group Lasso exhibits a counter\-intuitive scaling profile where the average step time decreases as the ambient dimension scales, dropping from0\.56930\.5693ms atD=100D=100down to0\.26030\.2603ms atD=2000D=2000\. This behavior is a direct consequence of our block\-diagonal active Gram matrix inversion𝑯G−1\\bm\{H\}\_\{G\}^\{\-1\}derived in Eq\. \([27](https://arxiv.org/html/2606.23867#S3.E27)\)\. In high dimensions with a fixed sample sizeN=100N=100, the active groups are selected with high precision, avoiding collinearity and stabilizing block\-diagonal structures\. As a result, the linear solvers for the block\-orthogonal systems evaluate faster than in dense configurations, validating the structural optimization\.
In contrast, the Sparse SVM exhibits a polynomial step time scaling, rising to45\.168745\.1687ms atD=2000D=2000\(Figure[1](https://arxiv.org/html/2606.23867#S5.F1)\)\. This trend is consistent with our theoretical Woodbury formulation in Eq\. \([22](https://arxiv.org/html/2606.23867#S3.E22)\)\. When the active support vector sizekkis small, the primal\-dual projection maps onto a denseD×DD\\times Dregularized system\. This introduces a cubic dependency on the feature dimensionDD, yet the structured dual formulation remains highly efficient compared to unoptimized, indefinite KKT system solvers\.
\(a\)NML Stochastic Complexity
\(b\)MCMC Acceptance Rate
Figure 2:Empirical verification of numerical and sampling equivalence under varying sample sizesNNatD=2000D=2000\.Solid lines correspond to the baseline Original KKT solver, while dashed lines correspond to our proposed Schur\-Sylvester solver\. Strict double\-precision agreement is achieved for both \(a\) NML stochastic complexity and \(b\) MCMC acceptance rates atN=500N=500, while the baseline KKT solver completely stalls \(0%0\\%acceptance\) forN=100N=100due to KKT ill\-conditioning\.Figure 3:Empirical step computation time comparison at the high\-dimensional limit \(D=2000D=2000\) across sample sizesN∈\{100,500\}N\\in\\\{100,500\\\}\.Red bars correspond to the baseline Original KKT solver, while blue bars correspond to our proposed Schur\-Sylvester solver\. The proposed method breaks the geometric scaling wall, yielding up to793\.21×793\.21\\timesaverage step speedups\.
### 5\.3Numerical Necessity and Level\-Set Stability under Varying Sample Sizes
To verify the mathematical equivalence and evaluate the numerical stability of our reductions, we conduct a direct comparison with the baseline “Original KKT” solver\. Figure[2](https://arxiv.org/html/2606.23867#S5.F2)displays the side\-by\-side comparison of the NML stochastic complexity score \(Figure[2](https://arxiv.org/html/2606.23867#S5.F2)\(a\)\) and MCMC acceptance rate \(Figure[2](https://arxiv.org/html/2606.23867#S5.F2)\(b\)\) at the high\-dimensional limit \(D=2000D=2000\) across sample sizes\.
ForN=500N=500, both solvers produce identical stochastic complexity trajectories and acceptance rate profiles\. This strict alignment confirms that our Schur\-Sylvester formulation achieves double\-precision numerical equivalence \(10−1610^\{\-16\}\) relative to the full indefinite KKT formulation\. However, as illustrated in Figure[3](https://arxiv.org/html/2606.23867#S5.F3), the computational cost is vastly different\. While the baseline KKT solver scales up to143\.5554143\.5554ms per step due to the𝒪\(\(N\+k\)3\)\\mathcal\{O\}\(\(N\+k\)^\{3\}\)bottleneck of inverting the\(N\+k\)×\(N\+k\)\(N\+k\)\\times\(N\+k\)indefinite system, our Schur\-Sylvester solver remains highly bounded at only0\.18100\.1810ms\. This represents an empirical speedup of793\.21×793\.21\\times, which is consistent with our theoretical flop crossover analysis in Eq\. \([19](https://arxiv.org/html/2606.23867#S3.E19)\)\.
Table 2:Rigorous Metropolis–Hastings Convergence DiagnosticsCrucially, forN=100N=100, the comparison exposes a severe geometric wall failure in the baseline KKT solver\. Due to the high collinearity of features in the active set underN=100,D=2000N=100,D=2000, the indefinite KKT matrix becomes extremely ill\-conditioned, prompting severe floating\-point round\-off errors\. This causes the baseline solver to fail to project proposals back onto the manifold, trapping the chain at its initial complexity of−47\.3\-47\.3with an acceptance rate of exactly0%0\\%\(Figure[2](https://arxiv.org/html/2606.23867#S5.F2)\)\. Meanwhile, our Schur\-Sylvester solver maintains high numerical stability by restricting inversions to the small, positive\-definitek×kk\\times kactive Gram matrix\. The chain continues to mix stably at a healthy70%70\\%acceptance rate, demonstrating that the Schur\-Sylvester reduction is not merely an acceleration technique, but a mathematical necessity for high\-dimensional statistical inference\.
### 5\.4Statistical Diagnostics and Convergence Verification
We present the rigorous Gelman\-Rubin diagnostics \(R^\\hat\{R\}\)\[[28](https://arxiv.org/html/2606.23867#bib.bib28)\]and sampling efficiencies evaluated across 4 parallel chains in Table[2](https://arxiv.org/html/2606.23867#S5.T2)\. The potential scale reduction factor evaluates the ratio of the marginal posterior variance estimate to the within\-chain variance:
R^=nsteps−1nstepsW\+1nstepsBW=nsteps−1nsteps\+BnstepsW,\\hat\{R\}=\\sqrt\{\\frac\{\\frac\{n\_\{\\text\{steps\}\}\-1\}\{n\_\{\\text\{steps\}\}\}W\+\\frac\{1\}\{n\_\{\\text\{steps\}\}\}B\}\{W\}\}=\\sqrt\{\\frac\{n\_\{\\text\{steps\}\}\-1\}\{n\_\{\\text\{steps\}\}\}\+\\frac\{B\}\{n\_\{\\text\{steps\}\}W\}\},\(45\)whereWWis the mean within\-chain variance,BBis the between\-chain variance, andnstepsn\_\{\\text\{steps\}\}is the number of post\-burn\-in MCMC steps\.
The Sparse SVM chains display outstanding convergence, with the potential scale reduction factorR^NML\\hat\{R\}\_\{\\text\{NML\}\}approaching the theoretical optimum of1\.001\.00\(ranging from1\.00091\.0009atD=100D=100to1\.02031\.0203atD=2000D=2000\)\. Conversely, for the Lasso, Elastic Net, and Group Lasso models, we observe heavily inflated standardR^NML\\hat\{R\}\_\{\\text\{NML\}\}values \(ranging from1\.331\.33up to1\.15×1061\.15\\times 10^\{6\}in Table[2](https://arxiv.org/html/2606.23867#S5.T2)\)\.
This inflation is a structural consequence of sampling along piecewise\-constant constraint manifolds\. Because the active support setAAof regularized estimators is highly stable over local MCMC steps, the corresponding discrete NML score remains completely constant for long trajectories until a parameter crosses a non\-smooth boundary\. Consequently, when chains initialize in different regions, they explore different local coordinate patterns\. The within\-chain variance of the NML score drops below numerical machine precision \(W<10−12W<10^\{\-12\}\), while the between\-chain variance remains non\-zero \(B\>10−12B\>10^\{\-12\}\), causing the standardR^NML\\hat\{R\}\_\{\\text\{NML\}\}diagnostic to blow up\.
To resolve this numerical artifact, we evaluate the Rank\-Normalized diagnosticR^Rank\-NML\\hat\{R\}\_\{\\text\{Rank\-NML\}\}\. As summarized in Table[2](https://arxiv.org/html/2606.23867#S5.T2), rank transformation resolves the division\-by\-zero artifact for the Sparse SVM, yielding stable convergence metrics \(R^Rank\-NML≈1\.00\\hat\{R\}\_\{\\text\{Rank\-NML\}\}\\approx 1\.00to1\.021\.02\)\. However, for the Lasso and Elastic Net,R^Rank\-NML\\hat\{R\}\_\{\\text\{Rank\-NML\}\}remains moderately inflated \(ranging from1\.241\.24to4\.294\.29\), and for the Group Lasso, it evaluates to∞\\infty\.
This infinite Rank\-Normalized diagnostic represents a physically meaningful topological trapping event\. Under Group Lasso regularization, the active group constraints restrict the sampler to tight, block\-orthogonal manifolds\. Because the local updates are purely geometric, the parallel chains become permanently trapped in their initial coordinate subspaces, yielding zero within\-chain variance \(W=0W=0,ESS=1\.0\\text\{ESS\}=1\.0\) but non\-zero between\-chain variance \(B\>0B\>0\) across different initializations\. This confirms that local updates are unable to transition across disjoint level\-set components, mathematically validating the need for global replica exchange schemes\. This trapping is localized to the discrete active sets, as the continuous physical parameters𝜷\\bm\{\\beta\}continue to mix along the level sets \(confirmed by the convergence of the parameter scale reduction factorR^MSE\\hat\{R\}\_\{\\text\{MSE\}\}toward1\.001\.00for the Lasso and Sparse SVM\)\.
Figure 4:MCMC trajectory trace plots over10,00010\{,\}000post\-burn\-in steps for Lasso \(D=500D=500, top row\) and Elastic Net \(D=500D=500, bottom row\)\.The left column plots the NML stochastic complexity score, showcasing discrete step\-like transitions as active sets are modified\. The right column plots the parameter Mean Squared Error \(MSE\), illustrating rapid convergence and long\-term geometric stability along the continuous manifold level sets\.Figure 5:Autocorrelation Function \(ACF\) decay over lagsh∈\[0,50\]h\\in\[0,50\]for Lasso \(D=500D=500, blue\) and Elastic Net \(D=500D=500, orange\) computed on the post\-burn\-in NML score \(left\) and parameter MSE \(right\)\.The autocorrelation in the NML score reflects the structural stability of the active support sets on the non\-smooth manifold, while the continuous parameter changes remain geometrically well\-mixed\.The continuous parameter trajectories and convergence properties are illustrated in Figure[4](https://arxiv.org/html/2606.23867#S5.F4)\. In this figure, we present the side\-by\-side comparison of the NML stochastic complexity score trace \(left column\) and the parameter MSE trajectory \(right column\) over a10,00010\{,\}000\-step window for the representativeD=500D=500configuration\. Both Lasso and Elastic Net models display rapid convergence and exceptional geometric stability along the continuous manifold level sets, confirming that the continuous state variables remain physically stable\.
This behavior is further validated by the autocorrelation function \(ACF\) decay curves plotted in Figure[5](https://arxiv.org/html/2606.23867#S5.F5)\. The autocorrelation of the NML score \(left column\) decays slowly due to the piecewise\-constant stability of the discrete active sets, whereas the autocorrelation of the continuous MSE variable \(right column\) decays rapidly to zero, demonstrating healthy geometric mixing in the continuous coordinate space\.
Moreover, the sampling efficiency of the Group Lasso is exceptionally high\. AtD=500D=500, the Group Lasso achieves a step time of0\.33530\.3353ms, translating to an efficiency score ofESSsec=1192\.99\\text\{ESS\}\_\{\\text\{sec\}\}=1192\.99samples per second\. AtD=2000D=2000, a step time of0\.26030\.2603ms yieldsESSsec=2305\.06\\text\{ESS\}\_\{\\text\{sec\}\}=2305\.06\. This confirms that block\-orthogonal updates maximize statistical mixing while minimizing physical runtime, enabling tractable exact NML estimation for large\-scale models\.
## 6Conclusion and Future Work
In this paper, we resolved the computational scaling walls of exact regular non\-smooth NML estimation\. By applying Schur complements and Sylvester’s determinant identity, we collapsed the projection and volume integration steps to linear\-quadratic complexity in the sample size\. Our proposed framework bypasses the severe ill\-conditioning and computational bottlenecks of dense, indefinite KKT solvers, achieving constant\-scale speedups exceeding14,100×14\{,\}100\\timeswhile maintaining exact double\-precision numerical equivalence\. Empirically, we demonstrated that this dimensionality reduction guarantees numerical stability and successful constraint satisfaction across high\-dimensional regimes, even where classical solvers fail due to KKT ill\-conditioning\. By rendering exact, finite\-sample stochastic complexity calculations computationally tractable, this work provides a stable, mathematically rigorous alternative to standard asymptotic MDL approximations, opening new avenues for reliable non\-asymptotic universal coding and high\-dimensional model selection under non\-smooth regularization\.
Future work will explore extending these exact reductions to singular models using anisotropic geometric measure theory\. Furthermore, to handle complex non\-smooth estimators where the level sets exhibit disconnected components, we plan to implement multi\-level tempering ladders and tubular relaxation techniques similar to those proposed by Wang and Han\[[25](https://arxiv.org/html/2606.23867#bib.bib25)\]\. We also aim to investigate integrating Gaussian Process\-based implicit null\-space representations\[[17](https://arxiv.org/html/2606.23867#bib.bib17)\]to support downstream geometric queries, high\-dimensional probabilistic optimization, and the implicit modeling of more complex constraint manifolds\. Finally, designing and computationally benchmarking matrix\-variate PPMH samplers for low\-rank Nuclear Norm Distributions represents a promising avenue for future empirical study\.
## References
- \[1\]J\. Rissanen, “Modeling by shortest data description,”Automatica, vol\. 14, no\. 5, pp\. 465–471, 1978\.
- \[2\]J\. Rissanen, “Fisher information and stochastic complexity,”IEEE Trans\. Inf\. Theory, vol\. 42, no\. 1, pp\. 40–47, 1996\.
- \[3\]P\. D\. Grünwald,The Minimum Description Length Principle\.Cambridge: MIT Press, 2007\.
- \[4\]A\. Suzuki, K\. Fukuzawa, and K\. Yamanishi, “Foundation of calculating normalized maximum likelihood for continuous probability models,”arXiv preprint arXiv:2409\.08387, 2024\.
- \[5\]K\. Yamanishi,Learning with the Minimum Description Length Principle\.Singapore: Springer Nature Singapore, 2023\.
- \[6\]T\. Lau and G\. P\. T\. Choi, “The normalized maximum likelihood for regular non\-smooth models: Measure\-theoretic foundations and geometric sampling,”arXiv preprint arXiv:2605\.24477, 2026\.
- \[7\]L\. Negro, “Sample distribution theory using coarea formula,”Commun\. Stat\. \- Theory Methods\., vol\. 53, no\. 5, pp\. 1864–1889, 2024\.
- \[8\]S\. Segert and N\. Wycoff, “A probabilistic basis for low\-rank matrix learning,”arXiv preprint arXiv:2510\.05447, 2025\.
- \[9\]P\. Breiding and O\. Marigliano, “Random points on an algebraic manifold,”SIAM J\. Math\. Data Sci\., vol\. 2, no\. 3, pp\. 683–704, 2020\.
- \[10\]A\. Laurent and G\. Vilmart, “Order conditions for sampling the invariant measure of ergodic stochastic differential equations on manifolds,”Found\. Comput\. Math\., vol\. 22, no\. 3, pp\. 649–695, 2022\.
- \[11\]M\. Brubaker, M\. Salzmann, and R\. Urtasun, “A family of MCMC methods on implicitly defined manifolds,” inProceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics\(N\. D\. Lawrence and M\. Girolami, eds\.\), vol\. 22 ofProceedings of Machine Learning Research, \(La Palma, Canary Islands\), pp\. 161–172, PMLR, 2012\.
- \[12\]E\. Zappa, M\. Holmes\-Cerfon, and J\. Goodman, “Monte Carlo on manifolds: Sampling densities and integrating functions,”Commun\. Pure Appl\. Math\., vol\. 71, no\. 12, pp\. 2609–2647, 2018\.
- \[13\]M\. Gürbüzbalaban, Y\. Hu, and L\. Zhu, “Penalized overdamped and underdamped Langevin Monte Carlo algorithms for constrained sampling,”J\. Mach\. Learn\. Res\., vol\. 25, no\. 263, pp\. 1–67, 2024\.
- \[14\]S\. Livingstone, “Geometric ergodicity of the random walk Metropolis with position\-dependent proposal covariance,”Mathematics, vol\. 9, no\. 4, p\. 341, 2021\.
- \[15\]B\. L\. Ross, G\. Loaiza\-Ganem, A\. L\. Caterini, and J\. C\. Cresswell, “Neural implicit manifold learning for topology\-aware generative modelling,”Trans\. Mach\. Learn\. Res\., 2024\.
- \[16\]M\. Girolami and B\. Calderhead, “Riemann manifold Langevin and Hamiltonian Monte Carlo methods,”J\. R\. Stat\. Soc\. Ser\. B, vol\. 73, no\. 2, pp\. 123–214, 2011\.
- \[17\]T\. Ishigaki, T\. Vidal\-Calleja, K\. Ayusawa, and E\. Yoshida, “Implicit null\-space manifold generation for redundant robotic systems,” inProceedings of Robotics: Science and Systems \(RSS\), 2026\.
- \[18\]D\. J\. Earl and M\. W\. Deem, “Parallel tempering: Theory, applications, and new perspectives,”Phys\. Chem\. Chem\. Phys\., vol\. 7, no\. 23, pp\. 3910–3916, 2005\.
- \[19\]F\. Zhang,The Schur Complement and Its Applications, vol\. 4\.New York: Springer Science & Business Media, 2005\.
- \[20\]R\. A\. Horn and C\. R\. Johnson,Matrix Analysis\.Cambridge: Cambridge University Press, 2nd ed\., 2012\.
- \[21\]J\. Zhu, S\. Rosset, T\. Hastie, and R\. Tibshirani, “1\-norm support vector machines,” inAdvances in Neural Information Processing Systems, vol\. 16, pp\. 49–56, 2004\.
- \[22\]H\. Zou and T\. Hastie, “Regularization and variable selection via the elastic net,”J\. R\. Stat\. Soc\. Ser\. B, vol\. 67, no\. 2, pp\. 301–320, 2005\.
- \[23\]R\. Tibshirani, “Regression shrinkage and selection via the Lasso,”J\. R\. Stat\. Soc\. Ser\. B, vol\. 58, no\. 1, pp\. 267–288, 1996\.
- \[24\]M\. Yuan and Y\. Lin, “Model selection and estimation in regression with grouped variables,”J\. R\. Stat\. Soc\. Ser\. B, vol\. 68, no\. 1, pp\. 49–67, 2006\.
- \[25\]X\. Wang and D\. Han, “A replica exchange Markov chain Monte Carlo method for disconnected implicit manifolds via tubular relaxation,”arXiv preprint arXiv:2604\.22055, 2026\.
- \[26\]H\. Chen, T\. Liang, W\.\-S\. Zheng, and J\.\-F\. Hu, “BreakingWinner\-Takes\-All: Cooperative policy optimization improves diverse LLM reasoning,”arXiv preprint arXiv:2605\.11461, 2026\.
- \[27\]Y\. Li, Y\. Tian, H\. Mo, and S\. Du, “Gaussian process controlled B\-spline surface,”INFORMS J\. Data Sci\., pp\. 1–19, 2026\.
- \[28\]A\. Gelman and D\. B\. Rubin, “Inference from iterative simulation using multiple sequences,”Stat\. Sci\., vol\. 7, no\. 4, pp\. 457–472, 1992\.
- \[29\]A\. Vehtari, A\. Gelman, D\. Simpson, B\. Carpenter, and P\.\-C\. Bürkner, “Rank\-normalization, folding, and localization: An improvedR^\\widehat\{R\}for assessing convergence of MCMC \(with discussion\),”Bayesian Anal\., vol\. 16, no\. 2, pp\. 667–718, 2021\.Similar Articles
@probnstat: One theorem every ML engineer should know: The Johnson–Lindenstrauss Lemma. It states that high-dimensional data can be…
This post highlights the Johnson–Lindenstrauss Lemma, explaining its importance for ML engineers in understanding dimensionality reduction, random projections, and embedding efficiency.
Don't Stop Me Yet: Sampling Loss Minima via Dissipative Riemannian Mechanics
This paper introduces DiMS, a dynamical system sampler that guarantees exact sampling from the submanifold of minimum loss solutions in neural networks, enabling better uncertainty quantification in Bayesian inference.
SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices
Introduces SigmaScale, a method that learns auxiliary scaling matrices for SVD-based LLM compression, showing competitive performance on Llama 3.1 8B and Qwen3-8B benchmarks.
Unified High-Probability Analysis of Stochastic Variance-Reduced Estimation
This paper presents a unified theoretical framework for stochastic variance-reduced estimation, deriving high-probability bounds via a new Freedman inequality and improving oracle complexities for constrained optimization.
Robust Subspace-Constrained Quadratic Models for Low-Dimensional Structure Learning
This paper proposes a robust subspace-constrained quadratic model for learning low-dimensional structures from high-dimensional data, accommodating heavy-tailed noise. A gradient-based algorithm with backtracking line search is developed, and experiments show improved robustness and reconstruction accuracy.