Pruning Deep Neural Networks via the Marchenko--Pastur Distribution

arXiv cs.LG Papers

Summary

This paper presents a Marchenko-Pastur random matrix approach to pruning deep neural networks, offering theoretical guarantees and achieving strong accuracy retention with minimal fine-tuning on ImageNet for ViT and CNN architectures.

arXiv:2606.02608v1 Announce Type: new Abstract: We study a Marchenko--Pastur (MP) random-matrix approach to pruning deep neural networks with very small post-pruning fine-tuning budgets. The main practical contribution is accuracy retention under short calibration and fine-tuning schedules, rather than a long post-pruning reoptimization pipeline. The theory gives deterministic data-path certificates: if the removed component $R$ has small propagated logit effect $L_s \| R \psi_1(s) \|_\infty$, pruning decreases an elastic-net objective and preserves samples whose dense margin exceeds twice the perturbation. The zero-budget case gives perfect pruning; a prune--restore extension models weight restoration inside a fixed sparse-execution pattern; and an additive $L_2$-regularized model shows admissible random-like components vanish at the training limit, with persistent spikes stabilizing as the MP bulk collapses. Under iid-Gaussian sufficient conditions, the fitted MP edge $\sigma_+$ gives a high-probability layerwise budget signal. On ImageNet-1k, after only three distillation epochs, ViT-B/16 $2{:}4{+}$ToMe reaches $83.41\%$ top-1 ($-1.70$ pp from dense) at $59.81\%$ sparse-execution MAC reduction, with $1.388\times$ best-observed A40 native-$2{:}4$ backend speedup for the same checkpoint and ToMe graph; a separate no-ToMe A100 endpoint gives $2.705\times$. At structured sparsity, ViT-B/16 $6{:}12$ reaches $83.74\%$, ViT-L/16 $8{:}16$ dense+permutation reaches $85.33\%$ ($-0.51$ pp), and ConvNeXtV2-Base $12{:}16$ reaches $86.35\%$ ($-0.37$ pp). For CNNs, ResNet50 $8{:}16$ dense+permutation reaches $75.87\%$ ($-0.26$ pp), and ResNet152d CAST-conv+permutation reaches $81.33\%$ ($-1.53$ pp) at ${\sim}50\%$ MAC accounting with a $1.62\times$ A40 im2col$+2{:}4$ sparse-GEMM audit.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:39 AM

# Pruning Deep Neural Networks via the Marchenko–Pastur Distribution
Source: [https://arxiv.org/html/2606.02608](https://arxiv.org/html/2606.02608)
Leonid Berlyand Department of Mathematics Pennsylvania State University University Park, PA 16802, USATheo Bourdais Department of Computing and Mathematical Sciences California Institute of Technology Pasadena, CA 91125, USAHouman Owhadi Department of Computing and Mathematical Sciences California Institute of Technology Pasadena, CA 91125, USAYitzchak Shmalo Department of Mathematics Pennsylvania State University University Park, PA 16802, USA

###### Abstract

We study a Marchenko–Pastur \(MP\) random\-matrix approach to pruning deep neural networks with very small post\-pruning fine\-tuning budgets\. The main practical contribution is accuracy retention under short calibration and fine\-tuning schedules, rather than a long post\-pruning reoptimization pipeline\. The theory gives deterministic data\-path certificates: if the removed componentRRhas small propagated logit effectLs​‖R​ψ1​\(s\)‖∞L\_\{s\}\\\|R\\psi\_\{1\}\(s\)\\\|\_\{\\infty\}, pruning decreases an elastic\-net objective and preserves samples whose dense margin exceeds twice the perturbation\. The zero\-budget case gives perfect pruning; a prune–restore extension models weight restoration inside a fixed sparse\-execution pattern; and an additiveL2L\_\{2\}\-regularized model shows admissible random\-like components vanish at the training limit, with persistent spikes stabilizing as the MP bulk collapses\. Under iid\-Gaussian sufficient conditions, the fitted MP edgeσ\+\\sigma\_\{\+\}gives a high\-probability layerwise budget signal\.

On ImageNet\-1k, after only three distillation epochs, ViT\-B/162:4\+2\{:\}4\{\+\}ToMe reaches83\.41%83\.41\\%top\-1 \(−1\.70\-1\.70pp from dense\) at59\.81%59\.81\\%sparse\-execution MAC reduction, with1\.388×1\.388\\timesbest\-observed A40 native\-2:42\{:\}4backend speedup for the same checkpoint and ToMe graph; a separate no\-ToMe A100 endpoint gives2\.705×2\.705\\times\. At structured sparsity, ViT\-B/166:126\{:\}12reaches83\.74%83\.74\\%, ViT\-L/168:168\{:\}16dense\+permutation reaches85\.33%85\.33\\%\(−0\.51\-0\.51pp\), and ConvNeXtV2\-Base12:1612\{:\}16reaches86\.35%86\.35\\%\(−0\.37\-0\.37pp\)\. For CNNs, ResNet508:168\{:\}16dense\+permutation reaches75\.87%75\.87\\%\(−0\.26\-0\.26pp\), and ResNet152d CAST\-conv\+permutation reaches81\.33%81\.33\\%\(−1\.53\-1\.53pp\) at∼50%\\sim 50\\%MAC accounting with a1\.62×1\.62\\timesA40 im2col\+2:42\{:\}4sparse\-GEMM audit\.

Keywords:DNNs, ViTs, Random Matrix Theory, Marchenko–Pastur distribution, pruning, regularization

Supplementary information\.Full proofs, Gaussian/random\-matrix specializations, and mathematical corollaries are provided in Online Resource 1\. Full methodology, run recipes, checkpoint ledgers, timing\-audit details, and the complete comparison table are provided in Online Resource 2\. Mirror copies of the manuscript sources and PDFs are maintained at[https://github\.com/yspennstate/RMT\_based\_pruning\_in\_deep\_learning](https://github.com/yspennstate/RMT_based_pruning_in_deep_learning)\.

Norm convention\.Throughout,‖x‖∞\\\|x\\\|\_\{\\infty\}denotes the vector maximum norm, while‖A‖∞\\\|A\\\|\_\{\\infty\}denotes the induced matrix norm

‖A‖∞:=maxi​∑j\|Ai​j\|\.\\\|A\\\|\_\{\\infty\}:=\\max\_\{i\}\\sum\_\{j\}\|A\_\{ij\}\|\.
## 1Introduction

DNN compression is motivated by overfitting, regularization, and deployment constraints\. Random Matrix Theory \(RMT\) has been used to study trained spectra, implicit self\-regularization, generalization diagnostics, Jacobians, and initialization\[[17](https://arxiv.org/html/2606.02608#bib.bib17),[14](https://arxiv.org/html/2606.02608#bib.bib14),[19](https://arxiv.org/html/2606.02608#bib.bib19),[31](https://arxiv.org/html/2606.02608#bib.bib31),[28](https://arxiv.org/html/2606.02608#bib.bib28),[22](https://arxiv.org/html/2606.02608#bib.bib22),[23](https://arxiv.org/html/2606.02608#bib.bib23),[21](https://arxiv.org/html/2606.02608#bib.bib21),[18](https://arxiv.org/html/2606.02608#bib.bib18),[16](https://arxiv.org/html/2606.02608#bib.bib16)\]\. For fixed pruning hyperparameters, this paper uses MP spectral diagnostics to allocate masks without validation or test access at mask\-construction time\.

A main contribution is that the reported drops are obtained with little post\-pruning fine\-tuning: one epoch per unstructured pruning cycle and three distillation epochs for CAST/CAST\-conv, rather than long retraining schedules\. The empirical target is sparsity inside dense affine maps of Vision Transformers and related ImageNet models\[[5](https://arxiv.org/html/2606.02608#bib.bib5),[29](https://arxiv.org/html/2606.02608#bib.bib29)\]\. Trained spectra are heterogeneous: the method treats MP\-like layers as candidates for larger pruning budgets, while protecting non\-MP or heavy\-tailed layers\. The main numerical results are deliberately front\-loaded\. On ImageNet\-1k, Hybrid Magnitude–SER keeps ViT\-B/16 at83\.37%83\.37\\%top\-1 at50%50\\%unstructured sparsity, while the deployable CAST2:4\+2\{:\}4\{\+\}ToMe row reaches83\.41%83\.41\\%after only three distillation epochs at59\.81%59\.81\\%sparse\-execution MAC reduction\. The same checkpoint and ToMe graph gives a measured1\.36×1\.36\\timesfixed\-batch A40 native\-2:42\{:\}4speedup \(1\.388×1\.388\\timesbest batch\-sweep value\); a separate no\-ToMe ViT\-B/16 dense\-to\-2:42\{:\}4A100 endpoint gives2\.705×2\.705\\times\. Wider structured projections improve the accuracy side: ViT\-B/166:126\{:\}12reaches83\.74%83\.74\\%, ViT\-L/168:168\{:\}16dense\+permutation reaches85\.33%85\.33\\%\(−0\.51\-0\.51pp\), and ConvNeXtV2\-Base12:1612\{:\}16reaches86\.35%86\.35\\%\(−0\.37\-0\.37pp\)\. For CNNs, ResNet508:168\{:\}16dense\+permutation reaches75\.87%75\.87\\%, only0\.260\.26pp below dense, and ResNet152d CAST\-conv\+permutation reaches81\.33%81\.33\\%,1\.531\.53pp below dense, with a1\.62×1\.62\\timesA40 im2col\+2:42\{:\}4sparse\-GEMM audit\. The wider6:126\{:\}12,8:168\{:\}16, and12:1612\{:\}16rows are accuracy/MAC\-accounting rows, not native sparse Tensor Core throughput claims\.

We also present three main theoretical results\. First, the deterministic data\-path certificate says that if the removed componentRRhas small propagated logit effectLs​‖R​ψ1​\(s\)‖∞L\_\{s\}\\\|R\\psi\_\{1\}\(s\)\\\|\_\{\\infty\}, then pruning decreases an elastic\-net objective and preserves every training sample whose dense margin is larger than twice this perturbation \(Lemma[5\.1](https://arxiv.org/html/2606.02608#S5.Thmthm1), Corollary[5\.2](https://arxiv.org/html/2606.02608#S5.Thmthm2), Theorem[5\.4](https://arxiv.org/html/2606.02608#S5.Thmthm4)\)\. In the zero\-budget or “perfect pruning” case, the margin bound gives no accuracy loss on the training set\. Second, the prune–restore certificate models structuredk:nk\{:\}nsparsity: restoring entries inside an already\-paid sparse\-execution group can improve the certificate while leaving the final sparse\-execution pattern unchanged \(Theorem[5\.5](https://arxiv.org/html/2606.02608#S5.Thmthm5)\)\. Third, the additiveL2L\_\{2\}\-regularized theory shows that, under one\-sided stationarity and local\-path convergence, admissible random\-like components vanish at the training limit; the associated MP bulk collapses while persistent signal spikes stabilize \(Theorem[4\.8](https://arxiv.org/html/2606.02608#S4.Thmthm8), Corollaries[4\.9](https://arxiv.org/html/2606.02608#S4.Thmthm9)–[4\.12](https://arxiv.org/html/2606.02608#S4.Thmthm12), and the generalized Theorem[4\.7](https://arxiv.org/html/2606.02608#S4.Thmthm7)\)\. The MP edge enters as a sufficient random\-matrix condition for data\-path budgets and as the empirical layer\-allocation signal used by SER/CAST\. Tables[1](https://arxiv.org/html/2606.02608#S3.T1),[2](https://arxiv.org/html/2606.02608#S3.T2),[3](https://arxiv.org/html/2606.02608#S3.T3), and[4](https://arxiv.org/html/2606.02608#S3.T4)report the main unstructured, structured, MAC\-reduction, and deployability results; Table[5](https://arxiv.org/html/2606.02608#S3.T5)gives literature context\. Prior pruning taxonomies and ViT/CNN compression citations are collected in Online Resource 2\.

Section[2](https://arxiv.org/html/2606.02608#S2)gives DNN and MP preliminaries\. Section[3](https://arxiv.org/html/2606.02608#S3)reports the numerical evidence\. Sections[4](https://arxiv.org/html/2606.02608#S4)and[5](https://arxiv.org/html/2606.02608#S5)state the additive and deterministic certificates\. Full proofs and mathematical details are in Online Resource 1; full protocols, algorithms, provenance notes, and supplementary numerical material are in Online Resource 2\.

## 2Randomness in Deep Neural Networks

### 2\.1Introduction to Deep Neural Networks

In classification tasks, the goal is to assign each element of a setSSto one ofKKclasses\. LetC​\(s\)∈\{1,…,K\}C\(s\)\\in\\\{1,\\dots,K\\\}denote the correct class ofs∈Ss\\in S\. Given a labeled training setT⊂ST\\subset S, we seek a classifier that generalizes fromTTto unseen data\. We consider DNNs of the form

ϕ​\(⋅,α\)=ρ∘X​\(⋅,α\),\\phi\(\\cdot,\\alpha\)=\\rho\\circ X\(\\cdot,\\alpha\),whereρ\\rhois the softmax map andX​\(⋅,α\)X\(\\cdot,\\alpha\)is a composition of affine maps and nonlinearities:

X​\(⋅,α\)=λ∘ML​\(⋅,α\)∘⋯∘λ∘M1​\(⋅,α\)\.X\(\\cdot,\\alpha\)=\\lambda\\circ M\_\{L\}\(\\cdot,\\alpha\)\\circ\\cdots\\circ\\lambda\\circ M\_\{1\}\(\\cdot,\\alpha\)\.Here:

- •Mk​\(⋅,α\)M\_\{k\}\(\\cdot,\\alpha\)is an affine mapℝNk−1→ℝNk\\mathbb\{R\}^\{N\_\{k\-1\}\}\\to\\mathbb\{R\}^\{N\_\{k\}\}with weight matrixWk∈ℝNk×Nk−1W\_\{k\}\\in\\mathbb\{R\}^\{N\_\{k\}\\times N\_\{k\-1\}\}and bias vectorβk∈ℝNk\\beta\_\{k\}\\in\\mathbb\{R\}^\{N\_\{k\}\}, soMk​\(x\)=Wk​x\+βkM\_\{k\}\(x\)=W\_\{k\}x\+\\beta\_\{k\}\.
- •λ:ℝm→ℝm\\lambda:\\mathbb\{R\}^\{m\}\\to\\mathbb\{R\}^\{m\}is a nonlinear activation\. In the simplified theory below we takeλ\\lambdato be either the coordinatewise absolute value or ReLU\.
- •The softmax mapρ:ℝK→ℝK\\rho:\\mathbb\{R\}^\{K\}\\to\\mathbb\{R\}^\{K\}is given by ρ​\(v\)i=evi∑j=1Kevj,v∈ℝK\.\\rho\(v\)\_\{i\}=\\frac\{e^\{v\_\{i\}\}\}\{\\sum\_\{j=1\}^\{K\}e^\{v\_\{j\}\}\},\\qquad v\\in\\mathbb\{R\}^\{K\}\.\(1\)

The standard cross\-entropy loss is

LCE​\(α\)=−1\|T\|​∑s∈Tlog⁡\(ϕC​\(s\)​\(s,α\)\)\.L\_\{\\mathrm\{CE\}\}\(\\alpha\)=\-\\frac\{1\}\{\|T\|\}\\sum\_\{s\\in T\}\\log\\big\(\\phi\_\{C\(s\)\}\(s,\\alpha\)\\big\)\.\(2\)

### 2\.2The MP distribution in machine learning contexts

The Marchenko–Pastur distribution is a basic object in RMT\[[15](https://arxiv.org/html/2606.02608#bib.bib15)\]; high\-dimensional random\-matrix methods more broadly have applications in signal processing, statistics, wireless communication, and machine learning\[[30](https://arxiv.org/html/2606.02608#bib.bib30),[9](https://arxiv.org/html/2606.02608#bib.bib9),[26](https://arxiv.org/html/2606.02608#bib.bib26),[3](https://arxiv.org/html/2606.02608#bib.bib3)\]\. We first define the relevant empirical spectral distributions\.

###### Definition 2\.1\(Eigenvalue and singular\-value empirical spectral distributions\)\.

LetG∈ℝN×MG\\in\\mathbb\{R\}^\{N\\times M\}, and lets1​\(G\),…,smin⁡\{N,M\}​\(G\)s\_\{1\}\(G\),\\dots,s\_\{\\min\\\{N,M\\\}\}\(G\)denote its singular values, counted with multiplicity and including possible zeros\. The singular\-value empirical spectral distribution \(ESD\) ofGGis

νG:=1min⁡\{N,M\}​∑i=1min⁡\{N,M\}δsi​\(G\)\.\\nu\_\{G\}:=\\frac\{1\}\{\\min\\\{N,M\\\}\}\\sum\_\{i=1\}^\{\\min\\\{N,M\\\}\}\\delta\_\{s\_\{i\}\(G\)\}\.IfA∈ℝM×MA\\in\\mathbb\{R\}^\{M\\times M\}is symmetric positive semidefinite with eigenvaluesλ1​\(A\),…,λM​\(A\)\\lambda\_\{1\}\(A\),\\dots,\\lambda\_\{M\}\(A\), its eigenvalue ESD is

μA:=1M​∑i=1Mδλi​\(A\)\.\\mu\_\{A\}:=\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\delta\_\{\\lambda\_\{i\}\(A\)\}\.

###### Theorem 2\.2\(Marchenko–Pastur law\)\.

LetWNW\_\{N\}be anN×MNN\\times M\_\{N\}random matrix with i\.i\.d\. entries of mean0, varianceσ2\\sigma^\{2\}, and finite fourth moment\. Define

XN:=1N​WN⊤​WN,cN:=MNN→c∈\(0,∞\)\.X\_\{N\}:=\\frac\{1\}\{N\}W\_\{N\}^\{\\top\}W\_\{N\},\\qquad c\_\{N\}:=\\frac\{M\_\{N\}\}\{N\}\\to c\\in\(0,\\infty\)\.Then the eigenvalue ESDμXN\\mu\_\{X\_\{N\}\}converges almost surely to the Marchenko–Pastur law

μMPc,σ2=\(1−1c\)\+​δ0\+\(λ\+−x\)​\(x−λ−\)2​π​c​σ2​x​1\[λ−,λ\+\]​\(x\)​d​x,\\mu\_\{\\mathrm\{MP\}\}^\{c,\\sigma^\{2\}\}=\\left\(1\-\\frac\{1\}\{c\}\\right\)\_\{\+\}\\delta\_\{0\}\+\\frac\{\\sqrt\{\(\\lambda\_\{\+\}\-x\)\(x\-\\lambda\_\{\-\}\)\}\}\{2\\pi c\\sigma^\{2\}x\}\\,\\mathbf\{1\}\_\{\[\\lambda\_\{\-\},\\lambda\_\{\+\}\]\}\(x\)\\,\\mathrm\{d\}x,\(3\)where

λ±=σ2​\(1±c\)2\.\\lambda\_\{\\pm\}=\\sigma^\{2\}\(1\\pm\\sqrt\{c\}\)^\{2\}\.\(4\)

*Proof\.*This is a classical cited random\-matrix input rather than a new theorem of the paper; see Online Resource 1 for source notes\.

###### Remark 2\.3\.

IfMN≤NM\_\{N\}\\leq Nfor allNN, thenc∈\(0,1\]c\\in\(0,1\]and the atom at zero in \([3](https://arxiv.org/html/2606.02608#S2.E3)\) vanishes\. When discussing the singular\-value scale ofWNW\_\{N\}, we write

σ\+:=N​λ\+\\sigma\_\{\+\}:=\\sqrt\{N\\lambda\_\{\+\}\}for the corresponding MP upper edge\.

### 2\.3Reduction of randomness in DNN weights via MP diagnostics

At common initialization scaleWi​jW\_\{ij\}has varianceg/Ng/N, so the remark above givesσ\+=g​\(1\+c\)\\sigma\_\{\+\}=\\sqrt\{g\}\\,\(1\+\\sqrt\{c\}\)in the rectangular aspect\-ratio limitM/N→cM/N\\to c\. Singular values and the nonzero eigenvalues ofW⊤​WW^\{\\top\}Ware thenO​\(1\)O\(1\), while the eigenvalues ofXℓ=Wℓ⊤​Wℓ/NX\_\{\\ell\}=W\_\{\\ell\}^\{\\top\}W\_\{\\ell\}/NareO​\(1/N\)O\(1/N\); learned spectra later deviate from the initial random law\[[17](https://arxiv.org/html/2606.02608#bib.bib17),[27](https://arxiv.org/html/2606.02608#bib.bib27)\]\. Training introduces structure, motivating the following signal\-plus\-randomness suppositions\.

#### Supposition 1\.

Aftertttraining steps, the layer\-ℓ\\ellweight matrix can be decomposed as

Wℓ​\(t\)=Rℓ​\(t\)\+Sℓ​\(t\),W\_\{\\ell\}\(t\)=R\_\{\\ell\}\(t\)\+S\_\{\\ell\}\(t\),whereRℓ​\(t\)R\_\{\\ell\}\(t\)is an independent random perturbation andSℓ​\(t\)S\_\{\\ell\}\(t\)is a structured signal component\.

#### Supposition 2\.

For the spectral spike interpretation,Sℓ​\(t\)S\_\{\\ell\}\(t\)is low rank or approximately low rank relative to the width\.

These suppositions are modeling devices for the MP\-budget argument, not assertions that trained ViTs are globally iid noise plus low\-rank signal\. Figure[1](https://arxiv.org/html/2606.02608#S2.F1)gives two representative layer diagnostics; the assumptions behind using MP fit for budgets are stated in Sections[4](https://arxiv.org/html/2606.02608#S4)and[5](https://arxiv.org/html/2606.02608#S5)\.

![Refer to caption](https://arxiv.org/html/2606.02608v1/ViT_base_73LRM_.74MPfit.png)\(a\)ViT\-B/16 layer example with MP\-fit error0\.740\.74and bulk fraction73%73\\%\.
![Refer to caption](https://arxiv.org/html/2606.02608v1/ViT_base_99.73LRM_.01MPfit.png)\(b\)ViT\-B/16 layer example with MP\-fit error0\.010\.01and bulk fraction99\.73%99\.73\\%\.

Figure 1:Layerwise MP diagnostics for two projection matrices from the same trained ViT\-B/16\.

## 3Numerical evidence: RMT\-guided pruning of ViTs

This section reports empirical pruning evidence and gives the minimum algorithmic context needed to read the tables\. Full run recipes, scripts, checkpoint ledgers, calibration details, and migrated numerical appendices are in Online Resource 2\. All ImageNet\-1k top\-1 values in this section are measured after the stated pruning and fine\-tuning recipe on the corresponding dense checkpoint; validation labels are not used to construct masks\. The dense baseline in a row is therefore part of the row’s protocol, and comparisons to external work should be read through the reported drop, compression axis, training budget, and deployment note rather than through raw top\-1 alone\.

The simplest baseline is magnitude pruning\. For a target sparsityss, it removes the entries with the smallest absolute weights in the prunable tensors, either globally or within the layer set specified by the experiment\. Magnitude pruning is useful because it is deterministic, cheap, and hard to dismiss: if an RMT\-guided method cannot beat it under the same checkpoint, layer set, and fine\-tuning budget, then the spectral signal has not added evidence\. It is also incomplete for this setting\. At high sparsity it can spend too much budget in layers whose small entries still matter along the data path, and it has no mechanism for allocating different budgets from the layer spectra\.

The “classical RMT” baseline is different\. It fits a Marchenko–Pastur bulk to a weight matrix or to a reshaped convolutional tensor, estimates an upper bulk edgeσ\+\\sigma\_\{\+\}, and treats the sub\-edge component as the random\-like part suggested by the fitted spectrum\. In its direct pruning form, the method removes or sparsifies this component by singular\-vector reconstruction rather than by weight magnitude alone\. This is an important diagnostic baseline because it asks whether the MP edge by itself identifies removable directions\. Table[1](https://arxiv.org/html/2606.02608#S3.T1)shows that it helps relative to magnitude at some higher sparsities, but it is not the final method: a literal sub\-edge reconstruction is too coarse for modern ViT weights, and the paper does not claim that every sub\-edge direction is noise\.

The practical unstructured method is SER, short for sparsify–estimate–restore\. SER starts from an over\-pruned candidate, evaluates the lost mass through the same MP\-guided budget view, and restores entries that are most useful for the target sparse model\. The point is not to keep the smallest MP\-bulk entries automatically; it is to use the MP fit as a layerwise budget signal, then let the restoration step decide which entries should return\. Hybrid Magnitude–SER uses plain magnitude pruning at low sparsity, where it is strong and stable, then switches to SER once the pruning level is high enough that layer allocation and restoration matter\. In the canonical ViT\-B/16 row this gives83\.37%83\.37\\%top\-1 at50%50\\%unstructured sparsity, compared with81\.67%81\.67\\%for magnitude and82\.57%82\.57\\%for the direct RMT baseline under the same dense checkpoint and short post\-pruning fine\-tuning convention\.

CAST is the structured\-sparsity version of this idea\. The target is a fixedk:nk\{:\}npattern: in every group ofnnweights,kkentries are kept and the rest are zeroed\. A2:42\{:\}4pattern is the deployable NVIDIA sparse\-tensor\-core case; wider6:126\{:\}12,8:168\{:\}16, and12:1612\{:\}16patterns are accuracy/MAC\-accounting probes unless a native backend for that exact pattern is explicitly audited\. CAST scores candidate group patterns using a certificate\-inspired objective: preserve entries that matter for the propagated data\-path perturbation while respecting the final group sparsity\. The “free restoration” terminology means that, inside a group whose final sparse\-execution cost is already fixed, restoring a kept entry can improve the certificate without changing the number of executed sparse weights\.

The CAST rows should therefore be split into two categories\. Rows marked2:42\{:\}4with native sparse\-kernel audits are deployment evidence: the same final pattern can be represented by a backend that has a real sparse execution path\. The ViT\-B/162:4\+2\{:\}4\{\+\}ToMe checkpoint, for example, reaches83\.41%83\.41\\%top\-1 at59\.81%59\.81\\%sparse\-execution MAC reduction and has a measured A40 native\-2:42\{:\}4speedup for the same checkpoint and ToMe graph\. The ResNet rows use a different audit: convolution is lowered to im2col and then timed as a2:42\{:\}4sparse GEMM endpoint, so those numbers are evidence that the weight pattern can accelerate the lowered matrix multiply, not proof of an end\-to\-end cuDNN Conv2d speedup\. The widerk:nk\{:\}nrows answer a separate question: how much accuracy is available if the structured group is less constrained than2:42\{:\}4, or if a future backend exposes a wider pattern\.

Token merging is another axis and is held separate in the interpretation\. ToMe reduces the number of tokens processed by the ViT graph;2:42\{:\}4pruning reduces the executed weights inside eligible linear maps\. The2:4\+2\{:\}4\{\+\}ToMe rows report their combined sparse\-execution MAC accounting because that is the graph being evaluated, but the backend speedup claims in Table[4](https://arxiv.org/html/2606.02608#S3.T4)specify whether ToMe is held fixed\. This distinction is why the different no\-ToMe A100 dense\-to\-2:42\{:\}4endpoint is reported separately in Online Resource 2 rather than pooled with the A40 ToMe\-held\-fixed audits\.

The result blocks have the following roles\. Table[1](https://arxiv.org/html/2606.02608#S3.T1)is the matched ViT\-B/16 ablation comparing magnitude, direct RMT, SER, and Hybrid Magnitude–SER\. Table[2](https://arxiv.org/html/2606.02608#S3.T2)tests whether the Hybrid pattern survives across ViT, Swin, ConvNeXt, ResNet, and Hiera checkpoints at the same sparsity grid\. Table[3](https://arxiv.org/html/2606.02608#S3.T3)reports the structuredk:nk\{:\}nrows used for the main MAC\-reduction claims\. Table[4](https://arxiv.org/html/2606.02608#S3.T4)is the speed audit\. The comparison table gives the closest available published contexts without treating them as a controlled leaderboard\. The full theory\-to\-numerics map is in Online Resource 2; it matches each CAST/SER implementation step to the lemma, theorem, corollary, or empirical proxy it is meant to instantiate, and it states explicitly where the link is motivational rather than a verified certificate\. The theoretical sections then explain why the algorithms are phrased as certificate\-inspired pruning: the deterministic lemmas control the data\-path effect of a removed component, while MP provides a sufficient random\-matrix condition and a practical layerwise signal, not a complete certified transformer bound\.

### 3\.1ViT\-B/16 results

Table 1:ViT\-B/16 pruning results on ImageNet\-1k validation under this paper’s matched checkpoint, prunable\-layer set, sparsity grid, evaluation pipeline, and short fine\-tuning budget\. Entries are top\-1 accuracy \(%\) at sparsityss; the dense baseline is85\.11%85\.11\\%\. Bold marks the highest entry in each sparsity column\.Method0\.050\.050\.100\.100\.150\.150\.200\.200\.250\.250\.300\.300\.350\.350\.400\.400\.450\.450\.500\.500\.550\.550\.600\.600\.650\.650\.700\.70Classical magnitude85\.2185\.1685\.0884\.8784\.6584\.4684\.0883\.5382\.8681\.6780\.1777\.9874\.3067\.44Classical RMT85\.2185\.1485\.0184\.9784\.7784\.5584\.3183\.9883\.3182\.5781\.3179\.1376\.4671\.53SER84\.9484\.9884\.8284\.7684\.6684\.5584\.4184\.2483\.7383\.2782\.6581\.3379\.9577\.94Hybrid Magnitude–SER85\.2385\.1785\.0684\.8984\.8184\.6484\.5584\.2883\.8083\.3782\.7681\.3979\.9578\.01Layerwise interpretation and mask\-only results are in Online Resource 2\.

### 3\.2Broader architecture sweep

Table[2](https://arxiv.org/html/2606.02608#S3.T2)gives the Hybrid Magnitude–SER sweep; Online Resource 2 gives architecture provenance, adaptive\-variant details, slope/certificate\-audit analyses, aggregate statistics, and checkpoint ledgers\.

Table 2:Hybrid Magnitude–SER results across ImageNet\-1k checkpoints\. Entries are post\-FT top\-1 accuracy \(%\) at sparsityss\. “Base\.” is the dense top\-1 measured in this paper’s evaluation pipeline\. Bold marks selected notables=0\.50s=0\.50entries: small drops from dense, high\-accuracy large\-model rows, or strong CNN coverage\. Online Resource 2 retains the protocol markers and detailed notes for the adaptive rows\.Architecture \(timm checkpoint\)Base\.0\.050\.050\.100\.100\.150\.150\.200\.200\.250\.250\.300\.300\.350\.350\.400\.400\.450\.450\.500\.500\.550\.550\.600\.600\.650\.650\.700\.70ViT\-B/16augreg2\_in21k\_ft\_in1k85\.1185\.2385\.1785\.0684\.8984\.8184\.6484\.5584\.2883\.8083\.3782\.7681\.3979\.9578\.01ViT\-B/16/384augreg\_in21k\_ft86\.0186\.0586\.0986\.0785\.9986\.0085\.9085\.8185\.7885\.4885\.1584\.6483\.3582\.5081\.33ViT\-Large/16augreg\_in21k\_ft85\.8485\.8085\.8485\.8885\.7584\.9885\.3285\.6485\.0485\.0884\.5284\.3483\.8183\.0282\.13DeiT\-Tinypatch16\_224\.fb\_in1k72\.2172\.4172\.3872\.1871\.8671\.9971\.7571\.3570\.6668\.9767\.7465\.9161\.1755\.5852\.55DeiT\-Smallpatch16\_224\.fb\_in1k79\.8579\.8279\.7879\.6279\.3779\.3179\.2179\.0078\.6178\.0677\.2276\.2574\.1472\.0568\.55DeiT\-Basepatch16\_224\.fb\_in1k81\.8081\.7681\.7081\.5881\.4681\.4281\.2881\.1780\.9980\.6280\.1279\.4978\.1676\.7274\.90Swin\-Tinypatch4\_window7\_22481\.2081\.3281\.2881\.2581\.0581\.0980\.9180\.7880\.3780\.1479\.8078\.9878\.0676\.6474\.47ConvNeXt\-Basein22k\_ft\_in1k85\.8485\.7285\.5885\.5185\.2685\.3585\.3985\.2685\.0484\.9484\.8084\.0983\.7083\.1181\.21ResNet50dra2\_in1k80\.5579\.9080\.0180\.0280\.1280\.0980\.1280\.0679\.9079\.4879\.1678\.9377\.9377\.2576\.32ResNet101dra2\_in1k82\.2682\.0682\.0582\.1582\.1582\.0682\.0781\.9381\.9781\.6181\.5681\.4780\.4580\.1579\.82Hiera\-Base\+mae\_in1k\_ft\_in1k84\.4085\.0484\.9484\.7684\.3977\.9583\.1282\.1582\.3782\.2982\.2582\.0080\.0380\.5578\.96ResNet18tv\_in1k69\.7669\.7369\.5269\.1868\.9469\.0269\.2869\.5069\.5069\.3969\.1269\.0068\.3167\.8267\.23ResNet34tv\_in1k73\.2873\.0072\.6772\.3972\.6273\.0673\.0973\.1673\.1673\.0072\.8872\.7472\.1271\.5971\.44ResNet50tv\_in1k76\.1375\.8575\.3375\.7276\.0776\.1776\.1376\.1376\.1475\.7575\.7675\.7074\.8374\.6274\.15DeiT\-Basepatch16\_224\.fb\_in1k81\.9781\.7981\.6981\.5481\.4481\.2581\.3581\.1880\.9580\.6880\.1279\.4878\.2476\.7074\.78Swin\-Tinypatch4\_window7\_22481\.3981\.3281\.2981\.2681\.0580\.7780\.4780\.5880\.4380\.1179\.8078\.9478\.0376\.5374\.31ConvNeXtV2\-Basefcmae\_ft\_in22k\_in1k86\.7286\.6786\.5886\.4386\.2785\.9385\.9785\.9685\.7185\.5885\.3384\.8184\.6183\.8381\.40Figures[2](https://arxiv.org/html/2606.02608#S3.F2)and[3](https://arxiv.org/html/2606.02608#S3.F3)show the same size trend in two views: at matched nominal sparsity or MAC reduction, larger dense checkpoints tend to pay a smaller top\-1 penalty\. This is theoretically plausible for RMT\-guided pruning, because the MP edge is an asymptotic spectral object that is better resolved in larger matrices, and because wider overparameterized layers can contain a larger random\-like bulk reservoir whose removal has small data\-path effect under the certificate model\. The figures are descriptive finite\-model evidence for this scaling intuition, not a theorem that size alone guarantees pruneability\.

![Refer to caption](https://arxiv.org/html/2606.02608v1/x1.png)Figure 2:Δ\\DeltaTop\-1 relative to each dense baseline after Hybrid Magnitude–SER pruning vs\. dense parameter count\. Points are Table[2](https://arxiv.org/html/2606.02608#S3.T2)entries; marker style encodes architecture family\. Dashed lines are OLS fits at the plotted sparsity levels\.
### 3\.3Cert\-aware structured sparsity: the FLOP / speedup table

Table[3](https://arxiv.org/html/2606.02608#S3.T3)summarizes structured sparsity; The CAST algorithm and source/checkpoint recipes are in Online Resource 2\.

Table 3:CASTk:nk\{:\}nstructured\-projection results\. “MAC red\.” is dense\-equivalent MAC\-accounting reduction; for rows without a deploy\-speed entry this is not an audited sparse\-backend wall\-clock claim\. “Th\. spd\.” is1/\(1−MACred\.\)1/\(1\-\\mathrm\{MAC\\ red\.\}\), andΔ\\Deltais top\-1 minus the corresponding dense baseline\. “Deploy spd\.” lists the selected best\-observed measured sparse\-backend speedup when available; Table[4](https://arxiv.org/html/2606.02608#S3.T4)reports endpoint, batch, and audit details\. Rows labeled “\(ours\)” are this paper’s measurements\. Bold marks notable accuracy or measured\-speed entries, not row ownership\. A dash means no exact sparse\-backend wall\-clock speedup is claimed\.ArchitectureSourcessMethodDense MACsMAC red\.Th\. spd\.Deploy spd\.Top\-1 \(%\)Δ\\Delta\(pp\)ViT familyViT\-B/160\.35Magnitude 2:4 \+ ToMe \(ours\)17\.56G59\.81%2\.49×2\.49\\times1\.388×1\.388\\times\*82\.92−2\.19\-2\.19ViT\-B/160\.35CAST 2:4 \+ ToMe \(ours\)17\.56G59\.81%2\.49×2\.49\\times1\.388×\\mathbf\{1\.388\\times\}83\.41\\mathbf\{83\.41\}−1\.70\-1\.70ViT\-B/160\.35CAST 6:12 SER\+α\\alpha=0\.5 \(ours, no ToMe\)17\.56G50%2\.00×2\.00\\times–83\.74\\mathbf\{83\.74\}§−1\.37\\mathbf\{\-1\.37\}ViT\-L/160\.35CAST 2:4 \+ ToMe \(ours\)61\.55G†∼\\sim60%†2\.50×2\.50\\times1\.394×\\mathbf\{1\.394\\times\}84\.37\\mathbf\{84\.37\}−1\.47\-1\.47ViT\-L/16denseCAST 8:16 dense\+perm \(ours, no ToMe\)61\.55G50%2\.00×2\.00\\times–85\.33\\mathbf\{85\.33\}§−0\.51\\mathbf\{\-0\.51\}DeiT\-B0\.35CAST 2:4 \+ ToMe17\.56G59\.81%2\.49×2\.49\\times1\.384×1\.384\\times80\.48−1\.32\-1\.32DeiT\-S0\.35CAST 2:4 \+ ToMe4\.61G59\.81%2\.49×2\.49\\times1\.376×1\.376\\times76\.96−2\.89\-2\.89DeiT\-T0\.35CAST 2:4 \+ ToMe1\.26G59\.81%2\.49×2\.49\\times1\.330×1\.330\\times65\.93−6\.28\-6\.28ResNet familyResNet500\.35CAST\-conv4\.09G48\.5%1\.94×1\.94\\times–73\.14−2\.99\-2\.99ResNet500\.35CAST\-conv\+perm \(ours\)4\.09G48\.5%1\.94×1\.94\\times1\.700×\\mathbf\{1\.700\\times\}‡75\.67\\mathbf\{75\.67\}‡−0\.46\\mathbf\{\-0\.46\}ResNet50denseCAST 8:16 dense\+perm \(ours\)4\.09G50%2\.00×2\.00\\times–75\.87\\mathbf\{75\.87\}§−0\.26\\mathbf\{\-0\.26\}ResNet50d0\.35CAST\-conv4\.33G49\.85%1\.99×1\.99\\times–78\.08−2\.47\-2\.47ResNet50d0\.35CAST\-conv\+perm \(ours\)4\.33G49\.85%1\.99×1\.99\\times1\.496×1\.496\\times‡78\.00‡−2\.55\-2\.55ResNet50ddenseCAST 8:16 dense\+perm \(ours\)4\.33G50%2\.00×2\.00\\times–78\.57\\mathbf\{78\.57\}§−1\.98\-1\.98ResNet101d0\.35CAST\-conv8\.0G∼\\sim50%2\.00×2\.00\\times–80\.13−2\.13\-2\.13ResNet101d0\.35CAST\-conv\+perm \(ours\)8\.0G∼\\sim50%2\.00×2\.00\\times1\.568×\\mathbf\{1\.568\\times\}‡80\.59\\mathbf\{80\.59\}‡−1\.67\-1\.67ResNet101ddenseCAST 8:16 dense\+perm \(ours\)8\.0G50%2\.00×2\.00\\times–80\.92\\mathbf\{80\.92\}§−1\.34\-1\.34ResNet152d0\.35CAST\-conv\+perm \(ours\)11\.8G∼\\sim50%2\.00×2\.00\\times1\.617×\\mathbf\{1\.617\\times\}‡81\.33\\mathbf\{81\.33\}‡−1\.53\-1\.53ConvNeXt familyConvNeXtV2\-Base0\.35CAST 2:4 cert \+ free\-restore \(ours\)15\.4G50%2\.00×2\.00\\times1\.295×1\.295\\times85\.47\\mathbf\{85\.47\}−1\.25\-1\.25ConvNeXtV2\-BasedenseCAST 12:16 dense\+perm \(ours, 25% sparse\)15\.4G25%1\.33×1\.33\\times–86\.35\\mathbf\{86\.35\}§−0\.37\\mathbf\{\-0\.37\}ConvNeXtV2\-BasedenseCAST 8:16 dense\+perm \(ours, 50% sparse\)15\.4G50%2\.00×2\.00\\times–85\.85\\mathbf\{85\.85\}§−0\.87\\mathbf\{\-0\.87\}
†Analytical ViT\-L/16 ToMe\+2:4 estimate\.‡Exact im2col\+2:42\{:\}4sparse\-GEMM audit, not faster cuDNN Conv2d\.§Accuracy/MAC\-accounting probe unless a measured endpoint is listed; dense\-source wider\-pattern rows are not incremental comparisons to thes=0\.35s=0\.35source rows\.\*Magnitude row shares the ViT\-B 2:4\+ToMe deployment path; see Online Resource 2\.

![Refer to caption](https://arxiv.org/html/2606.02608v1/x2.png)Figure 3:Top\-1 drop from dense vs\. original dense parameter count for rows of Table[3](https://arxiv.org/html/2606.02608#S3.T3)at 48\.5–50% dense\-equivalent sparse\-execution MAC reduction\. The y\-axis is dense top\-1 minus post\-FT top\-1, so lower is better\. The dashed line is an OLS fit inlog10\\log\_\{10\}parameters\.Table 4:Deployability audit for the main measured checkpoints, reporting best\-observed A40 batch\-sweep values\. Accuracy is fixed by the saved post\-FT checkpoint; no weights are rewritten, and only the in\-memory execution endpoint changes\. Linear rows use the PyTorch/NVIDIA native2:42\{:\}4backend on A40\. ResNet rows use exact im2col\+2:42\{:\}4sparse GEMM relative to dense im2col; these rows do not claim a faster cuDNN Conv2d endpoint\. Bold marks notable measured speedups, including the strongest endpoints within the Linear and ResNet audit groups\.CheckpointAudited endpointBatchDense im/sSparse im/sSpeedupStatusViT\-B/16 \+ ToMenative Linear 2:4641191\.81654\.51\.388×\\mathbf\{1\.388\\times\}native sparse Tensor Core pathViT\-L/16 \+ ToMenative Linear 2:464620\.8865\.61\.394×\\mathbf\{1\.394\\times\}native sparse Tensor Core pathDeiT\-B \+ ToMenative Linear 2:4641189\.61646\.71\.384×1\.384\\timesnative sparse Tensor Core pathDeiT\-S \+ ToMenative Linear 2:41282708\.53725\.81\.376×1\.376\\timesnative sparse Tensor Core pathDeiT\-T \+ ToMenative Linear 2:41285158\.46859\.41\.330×1\.330\\timesnative sparse Tensor Core pathConvNeXtV2\-Bnative Linear 2:464499\.3646\.51\.295×1\.295\\timespointwise Linear layers accelerated; dense convs unchangedResNet50im2col sparse GEMM128469\.5797\.91\.700×\\mathbf\{1\.700\\times\}im2col2:42\{:\}4sparse GEMMResNet50dim2col sparse GEMM128445\.2666\.21\.496×\\mathbf\{1\.496\\times\}im2col2:42\{:\}4sparse GEMMResNet101dim2col sparse GEMM128228\.2357\.81\.568×\\mathbf\{1\.568\\times\}im2col2:42\{:\}4sparse GEMMResNet152dim2col sparse GEMM128159\.9258\.71\.617×\\mathbf\{1\.617\\times\}im2col2:42\{:\}4sparse GEMM#### Deployability interpretation\.

Endpoint coverage and the separate A100 no\-ToMe ViT\-B benchmark are documented in Online Resource 2\.

Several CAST rows start from denser, higher\-accuracy checkpoints than the cited ResNet references\. In that regime the remaining weights are plausibly carrying more useful signal, so the relevant comparison is the reported drop together with the training budget and measured endpoint, not raw top\-1 alone\. The wider6:126\{:\}12,8:168\{:\}16, and12:1612\{:\}16accuracy/MAC probes remain in Table[3](https://arxiv.org/html/2606.02608#S3.T3); Table[5](https://arxiv.org/html/2606.02608#S3.T5)carries the closest published unstructured and structured\-sparsity context; the complete comparison table is provided in Online Resource 2\.

### 3\.4Comparison with published pruning and structured\-sparsity baselines

Table[5](https://arxiv.org/html/2606.02608#S3.T5)is literature context, not matched baseline evidence\. Its main purpose is to show that the small accuracy drops reported here are obtained with very limited fine\-tuning compared with many published pruning pipelines; matched evidence is Table[1](https://arxiv.org/html/2606.02608#S3.T1), Table[2](https://arxiv.org/html/2606.02608#S3.T2), and the caveats in Online Resource 2\. Reference rows in this table and in the extended Online Resource 2 comparison table were checked against the corresponding source\-paper tables; protocol\-mismatched rows \(different baseline checkpoint, different sparsity axis, different fine\-tuning budget, or different deployability claim\) are included only as context, not as apples\-to\-apples baselines\. We separate unstructured or fine\-grained weight sparsity from deployable sparse\-kernel claims: parameter sparsity and FLOP accounting are useful compression proxies, but do not by themselves imply latency or throughput gains on a given backend\[[7](https://arxiv.org/html/2606.02608#bib.bib7),[1](https://arxiv.org/html/2606.02608#bib.bib1),[10](https://arxiv.org/html/2606.02608#bib.bib10)\]\.

Table 5:Closest ImageNet\-1k unstructured and structured\-sparsity context for the main rows\. This compact table keeps the most comparable representation classes from the full comparison table: unstructured/fine\-grained pruning,2:42\{:\}4orN:MN\{:\}Msemi\-structured sparsity, and audited endpoint rows\. The complete literature\-context table is provided in Online Resource 2\. Rows labeled “ours” are this paper’s measurements; “FT/train” is the reported training or post\-pruning fine\-tuning budget\.MethodArchitectureCompressionTop\-1Δ\\DeltaFT/trainMain comparison pointHybrid Mag–SER \(ours\)ViT\-B/1650% unstructured83\.3783\.37−1\.74\-1\.741/cycleMP\-budgeted unstructured row with short\-cycle FT\.CAST 2:4\+ToMe \(ours\)ViT\-B/162:42\{:\}4\+ToMe83\.4183\.41−1\.70\-1\.703Native A402:42\{:\}4endpoint1\.388×1\.388\\times; separate no\-ToMe A100 endpoint2\.705×2\.705\\times\.SNOWS\[[13](https://arxiv.org/html/2606.02608#bib.bib13)\]ViT\-B/162:42\{:\}4QKV\+Out\+MLP76\.57−3\.85\-3\.850Closest one\-shot ViT2:42\{:\}4row; MiniImageNet\-1k subset, no endpoint speed row\.MaskLLM \(vision/4V\)\[[6](https://arxiv.org/html/2606.02608#bib.bib6)\]ViT\-B/162:42\{:\}4learned mask79\.46\+0\.31\+0\.3120 mask ep\.Learned\-mask ViT2:42\{:\}4context \(MaskLLM vision/4V setting\); weights frozen, no ViT endpoint speed reported\.SparseFormer\[[8](https://arxiv.org/html/2606.02608#bib.bib8)\]ViT\-B/16 AugReglatent\-token reduction83\.40−1\.20\-1\.2020\+5 ep\.Similar accuracy but token reduction, not weight sparsity;1\.85×1\.85\\timesthroughput\.CAST 8:16 dense\+perm \(ours\)ViT\-L/168:16 = 50%85\.3385\.33−0\.51\-0\.513Main wider\-pattern accuracy/MAC row; no native 8:16 endpoint audit\.ToMe\[[2](https://arxiv.org/html/2606.02608#bib.bib2)\]ViT\-L/16 MAEtoken merging85\.05−0\.61\-0\.61MAE FTClosest token\-merging accuracy/speed context; not weight sparsity\.UniPTS\[[32](https://arxiv.org/html/2606.02608#bib.bib32)\]ResNet\-5050/60/70% unstructured75\.76/75\.37/74\.73−0\.36/−0\.75/−1\.39\-0\.36/\-0\.75/\-1\.3916k itersClosest post\-training unstructured ResNet row with limited calibration\.Hybrid Mag–SER \(ours\)ResNet5050/60/70% unstructured75\.76/74\.83/74\.1575\.76/74\.83/74\.15−0\.37/−1\.30/−1\.98\-0\.37/\-1\.30/\-1\.981/cycleThis paper’s matched short\-cycle unstructured ResNet sweep\.AC/DC\[[24](https://arxiv.org/html/2606.02608#bib.bib24)\]ResNet\-5050% unstructured77\.05\+0\.21\+0\.21100 trainStrong global sparse\-training baseline; much longer training\.Mishra et al\.\[[20](https://arxiv.org/html/2606.02608#bib.bib20)\]ResNet\-502:4 FP1676\.20\+0\.10\+0\.10repeated trainOriginal2:42\{:\}4sparse Tensor Core reference; up to2×2\\timessparse math\.Pool–Yu perm\.\[[25](https://arxiv.org/html/2606.02608#bib.bib25)\]ResNet\-502:42\{:\}4\+perm76\.29\+0\.13\+0\.13repeated train/FTClosest channel\-permutation semi\-structured ResNet context \(no inference\-time overhead\)\.CAST\-conv\+perm \(ours\)ResNet152d2:4 im2col81\.33−1\.53\-1\.533Main CNN sparse\-GEMM audit;1\.617×1\.617\\timesA40 im2col, not cuDNN Conv2d\.CAP\[[12](https://arxiv.org/html/2606.02608#bib.bib12)\]ConvNeXt\-L CLIP50/60/70% unstructured87\.5/87\.1/86\.8−0\.3/−0\.7/−1\.0\-0\.3/\-0\.7/\-1\.00Closest high\-accuracy ConvNeXt unstructured one\-shot context; no endpoint speed row\.Hybrid Mag–SER \(ours\)ConvNeXtV2\-B50/60/70% unstructured85\.33/84\.61/81\.4085\.33/84\.61/81\.40−1\.39/−2\.11/−5\.32\-1\.39/\-2\.11/\-5\.321/cycleThis paper’s ConvNeXtV2 unstructured row with short\-cycle FT\.CAST 12:16 \(ours\)ConvNeXtV2\-B∼\\sim25% MAC\-accounting86\.35−0\.37\-0\.373Main ConvNeXtV2 wider\-pattern accuracy/MAC row; no native 12:16 endpoint audit\.
### 3\.5Limitations and scope

The main empirical scope qualifications are wider\-pattern deployability, ResNet Conv2d endpoint support, artifact\-bundle coverage, matched\-ablation scope beyond ViT\-B/16, and single\-seed CAST row estimates; details are in Online Resource 2\. The main theory limitation is the unestimated analyticLsL\_\{s\}for modern transformer blocks; see Online Resource 1 for the local\-Lipschitz discussion\.

## 4Abstract Additive Perturbation Extensions

This section gives the abstractW=S\+RW=S\+Radditive analogue of the deterministic data\-path framework in Section[5](https://arxiv.org/html/2606.02608#S5)\. Gaussian/RMT specializations and proofs are in Online Resource 1\.

### 4\.1Assumptions for the generalized perturbation framework

Throughout the asymptotic statements in this section, the training setTTis finite and fixed independently ofNN, unless a result explicitly states summability conditions for a growing family\.

###### Assumption 4\.1\(General architecture\)\.

Assume the DNN can be written as

ϕ=ρ∘ψ2∘\(R\+S\)∘ψ1,\\phi=\\rho\\circ\\psi\_\{2\}\\circ\(R\+S\)\\circ\\psi\_\{1\},whereψ1,ψ2\\psi\_\{1\},\\psi\_\{2\}may depend on the remaining network parameters andρ\\rhois softmax\. For eachs∈Ts\\in T, let𝒰s\\mathcal\{U\}\_\{s\}be any set containing the admissible interpolation segment\{\(S\+ϑ​R\)​ψ1​\(s\):ϑ∈\[0,1\]\}\\\{\(S\+\\vartheta R\)\\psi\_\{1\}\(s\):\\vartheta\\in\[0,1\]\\\}\. We assume thatψ2\\psi\_\{2\}has a finite local Lipschitz constant with respect toℓ∞\\ell\_\{\\infty\}on𝒰s\\mathcal\{U\}\_\{s\}:

Lψ2​\(s\):=supv,w∈𝒰sv≠w‖ψ2​\(v\)−ψ2​\(w\)‖∞‖v−w‖∞<∞\.L\_\{\\psi\_\{2\}\}\(s\):=\\sup\_\{\\begin\{subarray\}\{c\}v,w\\in\\mathcal\{U\}\_\{s\}\\\\ v\\neq w\\end\{subarray\}\}\\frac\{\\\|\\psi\_\{2\}\(v\)\-\\psi\_\{2\}\(w\)\\\|\_\{\\infty\}\}\{\\\|v\-w\\\|\_\{\\infty\}\}<\\infty\.

###### Assumption 4\.2\(Perturbative random component\)\.

Assume that for everyvvmeasurable with respect to the conditioning variablesσ​\(S,ψ1,ψ2\)\\sigma\(S,\\psi\_\{1\},\\psi\_\{2\}\), including deterministicvv,

ℙ​\(‖R​v‖∞\>d1​\(N\)​J​\(v\)\|S,ψ1,ψ2\)≤d2​\(N\),\\mathbb\{P\}\\Big\(\\\|Rv\\\|\_\{\\infty\}\>d\_\{1\}\(N\)J\(v\)\\,\\Big\|\\,S,\\psi\_\{1\},\\psi\_\{2\}\\Big\)\\leq d\_\{2\}\(N\),\(5\)whered1​\(N\),d2​\(N\)→0d\_\{1\}\(N\),d\_\{2\}\(N\)\\to 0andJ​\(v\)J\(v\)depends only onvv\. When a spectral interpretation is desired, we additionally assume that the singular\-value ESD ofRRconverges to a deterministic law with finite right edgeb\+<∞b\_\{\+\}<\\infty\.

For example, ifR∈ℝN×NR\\in\\mathbb\{R\}^\{N\\times N\}has iidN​\(0,1/N\)N\(0,1/N\)entries andvvis fixed conditional onS,ψ1,ψ2S,\\psi\_\{1\},\\psi\_\{2\}, then a Gaussian union bound gives Assumption[4\.2](https://arxiv.org/html/2606.02608#S4.Thmthm2)withJ​\(v\)=‖v‖2J\(v\)=\\\|v\\\|\_\{2\},d1​\(N\)=2​2​log⁡\(2​N\)/Nd\_\{1\}\(N\)=2\\sqrt\{2\\log\(2N\)/N\}, andd2​\(N\)=1/Nd\_\{2\}\(N\)=1/N\. Gaussian/RMT specializations and proofs are in Online Resource 1\.

###### Assumption 4\.3\(Structured component for spectral conclusions\)\.

For any spectral conclusion, assume that there exists a sequencekN=o​\(min⁡\{N,M\}\)k\_\{N\}=o\(\\min\\\{N,M\\\}\)such that

∑i\>kNsi​\(S\)2→0\.\\sum\_\{i\>k\_\{N\}\}s\_\{i\}\(S\)^\{2\}\\to 0\.The convergence is in probability whenSSis random\.

For a DNN satisfying Assumption[4\.1](https://arxiv.org/html/2606.02608#S4.Thmthm1), define

hϕ​\(s\):=J​\(ψ1​\(s\)\)​Lψ2​\(s\),aϕ​\(N,s\):=hϕ​\(s\)​d1​\(N\)\.h\_\{\\phi\}\(s\):=J\(\\psi\_\{1\}\(s\)\)\\,L\_\{\\psi\_\{2\}\}\(s\),\\qquad a\_\{\\phi\}\(N,s\):=h\_\{\\phi\}\(s\)d\_\{1\}\(N\)\.\(6\)

### 4\.2Generalized perturbation and loss\-reduction results

We consider the regularized loss

L​\(α​\(t\)\)=−1\|T\|​∑s∈Tlog⁡\(ϕC​\(s\)​\(s,α​\(t\)\)\)\+λreg​∑i=1L‖Wi​\(t\)‖F2\.L\(\\alpha\(t\)\)=\-\\frac\{1\}\{\|T\|\}\\sum\_\{s\\in T\}\\log\\big\(\\phi\_\{C\(s\)\}\(s,\\alpha\(t\)\)\\big\)\+\\lambda\_\{\\mathrm\{reg\}\}\\sum\_\{i=1\}^\{L\}\\\|W\_\{i\}\(t\)\\\|\_\{F\}^\{2\}\.\(7\)
###### Lemma 4\.4\.

Let

X=ψ2∘\(R\+S\)∘ψ1​\(s\),X=\\psi\_\{2\}\\circ\(R\+S\)\\circ\\psi\_\{1\}\(s\),and letαW,αS\\alpha\_\{W\},\\alpha\_\{S\}denote the corresponding network parameters when the relevant weight matrix isW=R\+SW=R\+SandSS, respectively\. Under Assumptions[4\.1](https://arxiv.org/html/2606.02608#S4.Thmthm1)–[4\.2](https://arxiv.org/html/2606.02608#S4.Thmthm2),

ℙ​\(‖X​\(s,αS\)−X​\(s,αW\)‖∞≤d1​\(N\)​hϕ​\(s\)\)≥1−d2​\(N\)\.\\mathbb\{P\}\\Big\(\\\|X\(s,\\alpha\_\{S\}\)\-X\(s,\\alpha\_\{W\}\)\\\|\_\{\\infty\}\\leq d\_\{1\}\(N\)h\_\{\\phi\}\(s\)\\Big\)\\geq 1\-d\_\{2\}\(N\)\.

###### Theorem 4\.5\.

Assume Assumptions[4\.1](https://arxiv.org/html/2606.02608#S4.Thmthm1)–[4\.2](https://arxiv.org/html/2606.02608#S4.Thmthm2)and that

⟨S,R⟩F→0in probability as​N→∞\.\\langle S,R\\rangle\_\{F\}\\to 0\\qquad\\text\{in probability as \}N\\to\\infty\.Assume further that for alls∈Ts\\in T,

aϕ​\(N,s\)→0as​N→∞\.a\_\{\\phi\}\(N,s\)\\to 0\\qquad\\text\{as \}N\\to\\infty\.Then removingRRyields

L​\(αS\)=L​\(αW\)−λreg​‖R‖F2\+oℙ​\(1\)\(N→∞\)\.L\(\\alpha\_\{S\}\)=L\(\\alpha\_\{W\}\)\-\\lambda\_\{\\mathrm\{reg\}\}\\\|R\\\|\_\{F\}^\{2\}\+o\_\{\\mathbb\{P\}\}\(1\)\\qquad\(N\\to\\infty\)\.

Secondary loss and finite\-rank deformation corollaries, with proofs, are stated in Online Resource 1\.

###### Theorem 4\.6\(Multi\-layer telescoping loss reduction\)\.

Let𝒫\\mathcal\{P\}be a finite set of layers selected for pruning, with\|𝒫\|\|\\mathcal\{P\}\|fixed independently ofNN\. For eachℓ∈𝒫\\ell\\in\\mathcal\{P\}, write

Wℓ=Sℓ\+Rℓ,W\_\{\\ell\}=S\_\{\\ell\}\+R\_\{\\ell\},and assume that, after the previous layers in𝒫\\mathcal\{P\}have already been replaced by their structured parts, the single\-layer hypotheses of Theorem[4\.5](https://arxiv.org/html/2606.02608#S4.Thmthm5)continue to hold for layerℓ\\ellwith perturbation scaleaϕ,ℓ​\(N,s\)a\_\{\\phi,\\ell\}\(N,s\)\. Ifα\(0\)\\alpha^\{\(0\)\}denotes the original network parameters andα\(\|𝒫\|\)\\alpha^\{\(\|\\mathcal\{P\}\|\)\}the parameters after replacing everyWℓW\_\{\\ell\}bySℓS\_\{\\ell\}, then

L​\(α\(\|𝒫\|\)\)=L​\(α\(0\)\)−λreg​∑ℓ∈𝒫‖Rℓ‖F2\+oℙ​\(1\)\.L\(\\alpha^\{\(\|\\mathcal\{P\}\|\)\}\)=L\(\\alpha^\{\(0\)\}\)\-\\lambda\_\{\\mathrm\{reg\}\}\\sum\_\{\\ell\\in\\mathcal\{P\}\}\\\|R\_\{\\ell\}\\\|\_\{F\}^\{2\}\+o\_\{\\mathbb\{P\}\}\(1\)\.More generally, the set of pruned layers may depend onNN\. Let𝒫N\\mathcal\{P\}\_\{N\}be a sequence of layer sets, ordered in the pruning order\. Assume that the corresponding one\-layer logit eventsEℓ,NE\_\{\\ell,N\}satisfy

∑ℓ∈𝒫Nℙ​\(Eℓ,Nc\)→0,\\sum\_\{\\ell\\in\\mathcal\{P\}\_\{N\}\}\\mathbb\{P\}\(E\_\{\\ell,N\}^\{c\}\)\\to 0,that their perturbation scales are summable,

∑ℓ∈𝒫N1\|T\|​∑s∈Taϕ,ℓ​\(N,s\)→0,\\sum\_\{\\ell\\in\\mathcal\{P\}\_\{N\}\}\\frac\{1\}\{\|T\|\}\\sum\_\{s\\in T\}a\_\{\\phi,\\ell\}\(N,s\)\\to 0,and that the accumulated cross term is negligible,

∑ℓ∈𝒫N⟨Sℓ,Rℓ⟩F=oℙ​\(1\)\.\\sum\_\{\\ell\\in\\mathcal\{P\}\_\{N\}\}\\langle S\_\{\\ell\},R\_\{\\ell\}\\rangle\_\{F\}=o\_\{\\mathbb\{P\}\}\(1\)\.Then the same telescoping conclusion holds with𝒫\\mathcal\{P\}replaced by𝒫N\\mathcal\{P\}\_\{N\}\.

###### Theorem 4\.7\.

Assume Assumptions[4\.1](https://arxiv.org/html/2606.02608#S4.Thmthm1)–[4\.2](https://arxiv.org/html/2606.02608#S4.Thmthm2)\. Assume further that for alls∈Ts\\in T,

aϕ​\(N,s\)→0as​N→∞\.a\_\{\\phi\}\(N,s\)\\to 0\\qquad\\text\{as \}N\\to\\infty\.Suppose:

1. 1\.for eachNN,α​\(t,N\)→α∗​\(N\)\\alpha\(t,N\)\\to\\alpha^\{\*\}\(N\)in probability ast→∞t\\to\\infty;
2. 2\.the relevant limit weight matrix admits an admissible decompositionW∗​\(N\)=S∗​\(N\)\+R∗​\(N\)W^\{\*\}\(N\)=S^\{\*\}\(N\)\+R^\{\*\}\(N\)satisfying ⟨S∗​\(N\),R∗​\(N\)⟩F→0in probability as​N→∞\.\\langle S^\{\*\}\(N\),R^\{\*\}\(N\)\\rangle\_\{F\}\\to 0\\qquad\\text\{in probability as \}N\\to\\infty\.

Forϑ∈\[0,1\]\\vartheta\\in\[0,1\], letαϑ∗​\(N\)\\alpha\_\{\\vartheta\}^\{\*\}\(N\)denote the parameter vector obtained by replacingW∗​\(N\)W^\{\*\}\(N\)withS∗​\(N\)\+ϑ​R∗​\(N\)S^\{\*\}\(N\)\+\\vartheta R^\{\*\}\(N\)\. Assume further that there exists a sequence of nonnegative random variablesϵN=oℙ​\(1\)\\epsilon\_\{N\}=o\_\{\\mathbb\{P\}\}\(1\)such that the one\-sided directional stationarity event

lim infh↓0L​\(α1−h∗​\(N\)\)−L​\(α∗​\(N\)\)h≥−ϵN\\liminf\_\{h\\downarrow 0\}\\frac\{L\(\\alpha\_\{1\-h\}^\{\*\}\(N\)\)\-L\(\\alpha^\{\*\}\(N\)\)\}\{h\}\\geq\-\\epsilon\_\{N\}has probability tending to one\. Define

a¯ϕ,N:=2\|T\|​∑s∈Taϕ​\(N,s\)\.\\bar\{a\}\_\{\\phi,N\}:=\\frac\{2\}\{\|T\|\}\\sum\_\{s\\in T\}a\_\{\\phi\}\(N,s\)\.Then

‖R∗​\(N\)‖F2≤a¯ϕ,N\+ϵN2​λreg\+oℙ​\(1\)\.\\\|R^\{\*\}\(N\)\\\|\_\{F\}^\{2\}\\leq\\frac\{\\bar\{a\}\_\{\\phi,N\}\+\\epsilon\_\{N\}\}\{2\\lambda\_\{\\mathrm\{reg\}\}\}\+o\_\{\\mathbb\{P\}\}\(1\)\.In particular, ifa¯ϕ,N\+ϵN→0\\bar\{a\}\_\{\\phi,N\}\+\\epsilon\_\{N\}\\to 0in probability, then for everyϵ\>0\\epsilon\>0,

limN→∞ℙ​\(‖R∗​\(N\)‖F≤ϵ\)=1\.\\lim\_\{N\\to\\infty\}\\mathbb\{P\}\\big\(\\\|R^\{\*\}\(N\)\\\|\_\{F\}\\leq\\epsilon\\big\)=1\.
If, in addition, there exists an admissible path decomposition

W​\(t,N\)=S​\(t,N\)\+R​\(t,N\)W\(t,N\)=S\(t,N\)\+R\(t,N\)such that for each fixedNN,

‖S​\(t,N\)−S∗​\(N\)‖F\+‖R​\(t,N\)−R∗​\(N\)‖F→0in probability as​t→∞,\\\|S\(t,N\)\-S^\{\*\}\(N\)\\\|\_\{F\}\+\\\|R\(t,N\)\-R^\{\*\}\(N\)\\\|\_\{F\}\\to 0\\qquad\\text\{in probability as \}t\\to\\infty,and if for every fixedNNand everyϵ\>0\\epsilon\>0,

ℙ​\(‖R∗​\(N\)‖F=ϵ\)=0,\\mathbb\{P\}\\big\(\\\|R^\{\*\}\(N\)\\\|\_\{F\}=\\epsilon\\big\)=0,then

limN→∞limt→∞ℙ​\(‖R​\(t,N\)‖F≤ϵ\)=1\.\\lim\_\{N\\to\\infty\}\\lim\_\{t\\to\\infty\}\\mathbb\{P\}\\big\(\\\|R\(t,N\)\\\|\_\{F\}\\leq\\epsilon\\big\)=1\.

### 4\.3Main Gaussian/RMT corollaries

The following corollaries record the Gaussian/RMT specialization used in the MP interpretation\. They are stated here because they are main mathematical consequences of the framework; their proofs and the detailed Gaussian assumptions are in Online Resource 1\.

###### Theorem 4\.8\(Gaussian stationarity collapse\)\.

Consider the square iid\-Gaussian specialization of the additive framework for a trained limit layer

W2∗​\(N\)=S2∗​\(N\)\+R2∗​\(N\),R2∗​\(N\)i​j∼N​\(0,g∗​\(N\)N\),W\_\{2\}^\{\*\}\(N\)=S\_\{2\}^\{\*\}\(N\)\+R\_\{2\}^\{\*\}\(N\),\\qquad R\_\{2\}^\{\*\}\(N\)\_\{ij\}\\sim N\\\!\\left\(0,\\frac\{g\_\{\*\}\(N\)\}\{N\}\\right\),with the same local\-path, cross\-term, and one\-sided stationarity hypotheses as Theorem[4\.7](https://arxiv.org/html/2606.02608#S4.Thmthm7)\. If the propagated perturbation scale and stationarity error vanish, then

‖R2∗​\(N\)‖F→0in probability\.\\\|R\_\{2\}^\{\*\}\(N\)\\\|\_\{F\}\\to 0\\qquad\\text\{in probability\.\}If an admissible training\-path decomposition converges locally to this limit decomposition, then for everyϵ\>0\\epsilon\>0,

limN→∞limt→∞ℙ​\(‖R2​\(t,N\)‖F≤ϵ\)=1\.\\lim\_\{N\\to\\infty\}\\lim\_\{t\\to\\infty\}\\mathbb\{P\}\\big\(\\\|R\_\{2\}\(t,N\)\\\|\_\{F\}\\leq\\epsilon\\big\)=1\.

*Proof\.*See Online Resource 1, proof of the Gaussian stationarity\-collapse theorem\.

###### Corollary 4\.9\(Variance\-scale collapse at the limit point\)\.

Under the hypotheses of Theorem[4\.8](https://arxiv.org/html/2606.02608#S4.Thmthm8), the Gaussian variance scale satisfies

g∗​\(N\)​N→0\.g\_\{\*\}\(N\)N\\to 0\.Consequently, in the square case the MP singular\-value edge of the admissible limit perturbation satisfies

σ\+,∗bulk​\(N\)=2​g∗​\(N\)→0\.\\sigma\_\{\+,\*\}^\{\\mathrm\{bulk\}\}\(N\)=2\\sqrt\{g\_\{\*\}\(N\)\}\\to 0\.For a rectangularN×MNN\\times M\_\{N\}perturbation with entriesN​\(0,g∗​\(N\)/N\)N\(0,g\_\{\*\}\(N\)/N\), the corresponding Frobenius collapse condition isg∗​\(N\)​MN→0g\_\{\*\}\(N\)M\_\{N\}\\to 0, equivalent tog∗​\(N\)​N→0g\_\{\*\}\(N\)N\\to 0whenMN/N→c∈\(0,∞\)M\_\{N\}/N\\to c\\in\(0,\\infty\)\.

###### Corollary 4\.10\(Bulk collapse and spike stabilization at the limit point\)\.

Under the hypotheses of Theorem[4\.8](https://arxiv.org/html/2606.02608#S4.Thmthm8), assume the structured component is spectrally admissible, with singular valuesσi∗​\(N\)\\sigma\_\{i\}^\{\*\}\(N\)and approximate rank scalekN=o​\(N\)k\_\{N\}=o\(N\)\. Ifsi​\(W2∗​\(N\)\)s\_\{i\}\(W\_\{2\}^\{\*\}\(N\)\)are the singular values of the full layer, then‖R2∗​\(N\)‖op→0\\\|R\_\{2\}^\{\*\}\(N\)\\\|\_\{\\mathrm\{op\}\}\\to 0in probability, each fixed spike satisfies

\|si​\(W2∗​\(N\)\)−σi∗​\(N\)\|→0in probability,\|s\_\{i\}\(W\_\{2\}^\{\*\}\(N\)\)\-\\sigma\_\{i\}^\{\*\}\(N\)\|\\to 0\\qquad\\text\{in probability\},the tail energy∑i\>kNsi​\(W2∗​\(N\)\)2→0\\sum\_\{i\>k\_\{N\}\}s\_\{i\}\(W\_\{2\}^\{\*\}\(N\)\)^\{2\}\\to 0in probability, and the singular\-value empirical distribution has no mass above any fixedϵ\>0\\epsilon\>0asymptotically\. Ifσi∗​\(N\)→σ¯i\\sigma\_\{i\}^\{\*\}\(N\)\\to\\bar\{\\sigma\}\_\{i\}, thensi​\(W2∗​\(N\)\)→σ¯is\_\{i\}\(W\_\{2\}^\{\*\}\(N\)\)\\to\\bar\{\\sigma\}\_\{i\}for each fixedii\.

###### Corollary 4\.11\(Singular\-subspace stabilization under an asymptotic gap condition\)\.

Under the hypotheses of Theorem[4\.8](https://arxiv.org/html/2606.02608#S4.Thmthm8), fixk≥1k\\geq 1and letδk​\(N\)=σk∗​\(N\)−σk\+1∗​\(N\)\\delta\_\{k\}\(N\)=\\sigma\_\{k\}^\{\*\}\(N\)\-\\sigma\_\{k\+1\}^\{\*\}\(N\)\. Ifℙ​\(δk​\(N\)\>0\)→1\\mathbb\{P\}\(\\delta\_\{k\}\(N\)\>0\)\\to 1and

‖R2∗​\(N\)‖opδk​\(N\)→0in probability,\\frac\{\\\|R\_\{2\}^\{\*\}\(N\)\\\|\_\{\\mathrm\{op\}\}\}\{\\delta\_\{k\}\(N\)\}\\to 0\\qquad\\text\{in probability\},then the top\-kkleft and right singular\-subspace projectors ofW2∗​\(N\)W\_\{2\}^\{\*\}\(N\)converge in operator norm to those ofS2∗​\(N\)S\_\{2\}^\{\*\}\(N\)\.

###### Corollary 4\.12\(Eventual Gaussian supercriticality of persistent spikes\)\.

Under the hypotheses of Theorem[4\.8](https://arxiv.org/html/2606.02608#S4.Thmthm8), assume a square finite\-rank Gaussian deformation regime and fixed persistent spikes

σi∗​\(N\)→σ¯i\>0,1≤i≤k\.\\sigma\_\{i\}^\{\*\}\(N\)\\to\\bar\{\\sigma\}\_\{i\}\>0,\\qquad 1\\leq i\\leq k\.Thenℙ​\(σi∗​\(N\)\>g∗​\(N\)\)→1\\mathbb\{P\}\(\\sigma\_\{i\}^\{\*\}\(N\)\>\\sqrt\{g\_\{\*\}\(N\)\}\)\\to 1, so these spikes are eventually supercritical above the Gaussian bulk edge2​g∗​\(N\)2\\sqrt\{g\_\{\*\}\(N\)\}\. With the usual spike\-separation condition, the corresponding spectral projectors ofW2∗​\(N\)W\_\{2\}^\{\*\}\(N\)converge to those ofS2∗​\(N\)S\_\{2\}^\{\*\}\(N\), and the inverse\-BBP estimator

σ^iBBP​\(N\):=12​\(si​\(W2∗​\(N\)\)\+si​\(W2∗​\(N\)\)2−4​g∗​\(N\)\)\\widehat\{\\sigma\}\_\{i\}^\{\\mathrm\{BBP\}\}\(N\):=\\frac\{1\}\{2\}\\left\(s\_\{i\}\(W\_\{2\}^\{\*\}\(N\)\)\+\\sqrt\{s\_\{i\}\(W\_\{2\}^\{\*\}\(N\)\)^\{2\}\-4g\_\{\*\}\(N\)\}\\right\)is consistent in probability\.

## 5Main Theoretical Results: Deterministic Pruning Certificates

This section states deterministic data\-path certificates based onBT​\(R\)B\_\{T\}\(R\)\. The mask\-specific results concern fixed kept/removed supports and do not assume an additive Gaussian decomposition; random models only supply sufficient bounds for that budget\. Proofs are collected in Online Resource 1\.

### 5\.1Deterministic data\-path certificates for mask pruning

Fix one target layerA∈ℝn×mA\\in\\mathbb\{R\}^\{n\\times m\}inside a network and write the logits as

zA​\(s\):=ψ2​\(A​ψ1​\(s\)\)∈ℝK,s∈T,z\_\{A\}\(s\):=\\psi\_\{2\}\(A\\psi\_\{1\}\(s\)\)\\in\\mathbb\{R\}^\{K\},\\qquad s\\in T,whereψ1\\psi\_\{1\}andψ2\\psi\_\{2\}denote the pre\-layer and post\-layer computations with all other parameters frozen\. For a decomposition

A=S\+R,Aθ:=S\+θ​R,0≤θ≤1,A=S\+R,\\qquad A\_\{\\theta\}:=S\+\\theta R,\\qquad 0\\leq\\theta\\leq 1,letLsL\_\{s\}be any finite deterministic upper bound on the localℓ∞\\ell\_\{\\infty\}\-Lipschitz constant ofψ2\\psi\_\{2\}on the path

\{Aθ​ψ1​\(s\):0≤θ≤1\}\.\\\{A\_\{\\theta\}\\psi\_\{1\}\(s\):0\\leq\\theta\\leq 1\\\}\.Define the data\-path budget of the removed component by

BT​\(R\):=2\|T\|​∑s∈TLs​‖R​ψ1​\(s\)‖∞\.B\_\{T\}\(R\):=\\frac\{2\}\{\|T\|\}\\sum\_\{s\\in T\}L\_\{s\}\\\|R\\psi\_\{1\}\(s\)\\\|\_\{\\infty\}\.\(8\)IfLsL\_\{s\}is itself estimated from the realized network, the statements below are deterministic conditional on that realized upper bound\.

###### Lemma 5\.1\(Deterministic cross\-entropy path bound\)\.

For every decompositionA=S\+RA=S\+Rand everyθ∈\[0,1\]\\theta\\in\[0,1\],

\|ℒCE​\(Aθ\)−ℒCE​\(A1\)\|≤\(1−θ\)​BT​\(R\),\\left\|\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\(A\_\{\\theta\}\)\-\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\(A\_\{1\}\)\\right\|\\leq\(1\-\\theta\)B\_\{T\}\(R\),where

ℒCE​\(A\):=1\|T\|​∑s∈T\[−zA​\(s\)C​\(s\)\+log⁡\(∑j=1KezA​\(s\)j\)\]\.\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\(A\):=\\frac\{1\}\{\|T\|\}\\sum\_\{s\\in T\}\\left\[\-z\_\{A\}\(s\)\_\{C\(s\)\}\+\\log\\\!\\left\(\\sum\_\{j=1\}^\{K\}e^\{z\_\{A\}\(s\)\_\{j\}\}\\right\)\\right\]\.

###### Corollary 5\.2\(Deterministic margin stability\)\.

Let

ηs​\(θ\):=\(1−θ\)​Ls​‖R​ψ1​\(s\)‖∞\.\\eta\_\{s\}\(\\theta\):=\(1\-\\theta\)L\_\{s\}\\\|R\\psi\_\{1\}\(s\)\\\|\_\{\\infty\}\.If

pA​\(s\)∈arg⁡maxj⁡zA​\(s\)j,ΔApred​\(s\):=zA​\(s\)pA​\(s\)−maxj≠pA​\(s\)⁡zA​\(s\)j,p\_\{A\}\(s\)\\in\\arg\\max\_\{j\}z\_\{A\}\(s\)\_\{j\},\\qquad\\Delta\_\{A\}^\{\\mathrm\{pred\}\}\(s\):=z\_\{A\}\(s\)\_\{p\_\{A\}\(s\)\}\-\\max\_\{j\\neq p\_\{A\}\(s\)\}z\_\{A\}\(s\)\_\{j\},then the predicted label atsscan change betweenA1A\_\{1\}andAθA\_\{\\theta\}only if

ΔA1pred​\(s\)≤2​ηs​\(θ\)\.\\Delta\_\{A\_\{1\}\}^\{\\mathrm\{pred\}\}\(s\)\\leq 2\\eta\_\{s\}\(\\theta\)\.For the true\-label margin

ΔAtrue​\(s\):=zA​\(s\)C​\(s\)−maxj≠C​\(s\)⁡zA​\(s\)j,\\Delta\_\{A\}^\{\\mathrm\{true\}\}\(s\):=z\_\{A\}\(s\)\_\{C\(s\)\}\-\\max\_\{j\\neq C\(s\)\}z\_\{A\}\(s\)\_\{j\},one has

\|accAθ⁡\(T\)−accA1⁡\(T\)\|≤1\|T\|​\#​\{s∈T:\|ΔA1true​\(s\)\|≤2​ηs​\(θ\)\}\.\\big\|\\operatorname\{acc\}\_\{A\_\{\\theta\}\}\(T\)\-\\operatorname\{acc\}\_\{A\_\{1\}\}\(T\)\\big\|\\leq\\frac\{1\}\{\|T\|\}\\\#\\left\\\{s\\in T:\\left\|\\Delta\_\{A\_\{1\}\}^\{\\mathrm\{true\}\}\(s\)\\right\|\\leq 2\\eta\_\{s\}\(\\theta\)\\right\\\}\.Here

accA⁡\(T\):=1\|T\|​\#​\{s∈T:ΔAtrue​\(s\)\>0\}\.\\operatorname\{acc\}\_\{A\}\(T\):=\\frac\{1\}\{\|T\|\}\\\#\\\{s\\in T:\\Delta\_\{A\}^\{\\mathrm\{true\}\}\(s\)\>0\\\}\.

###### Theorem 5\.3\(Deterministic additive Frobenius certificate\)\.

For any decompositionA=S\+RA=S\+R, not necessarily a mask decomposition, define the one\-layer Frobenius\-regularized objective

𝒥2​\(A\):=ℒCE​\(A\)\+λ2​‖A‖F2\+𝒞,λ2≥0,\\mathcal\{J\}\_\{2\}\(A\):=\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\(A\)\+\\lambda\_\{2\}\\\|A\\\|\_\{F\}^\{2\}\+\\mathcal\{C\},\\qquad\\lambda\_\{2\}\\geq 0,where𝒞\\mathcal\{C\}contains all terms independent ofAA\. Then for everyθ∈\[0,1\]\\theta\\in\[0,1\],

𝒥2​\(Aθ\)−𝒥2​\(A1\)≤\(1−θ\)​BT​\(R\)−λ2​\(1−θ2\)​‖R‖F2−2​λ2​\(1−θ\)​⟨S,R⟩F\.\\mathcal\{J\}\_\{2\}\(A\_\{\\theta\}\)\-\\mathcal\{J\}\_\{2\}\(A\_\{1\}\)\\leq\(1\-\\theta\)B\_\{T\}\(R\)\-\\lambda\_\{2\}\(1\-\\theta^\{2\}\)\\\|R\\\|\_\{F\}^\{2\}\-2\\lambda\_\{2\}\(1\-\\theta\)\\langle S,R\\rangle\_\{F\}\.Consequently, if\|⟨S,R⟩F\|≤η\|\\langle S,R\\rangle\_\{F\}\|\\leq\\eta, then

𝒥2​\(Aθ\)−𝒥2​\(A1\)≤\(1−θ\)​BT​\(R\)−λ2​\(1−θ2\)​‖R‖F2\+2​λ2​\(1−θ\)​η\.\\mathcal\{J\}\_\{2\}\(A\_\{\\theta\}\)\-\\mathcal\{J\}\_\{2\}\(A\_\{1\}\)\\leq\(1\-\\theta\)B\_\{T\}\(R\)\-\\lambda\_\{2\}\(1\-\\theta^\{2\}\)\\\|R\\\|\_\{F\}^\{2\}\+2\\lambda\_\{2\}\(1\-\\theta\)\\eta\.In particular, full removal is certified to decrease the Frobenius\-regularized objective whenever

BT​\(R\)\+2​λ2​\|⟨S,R⟩F\|​<λ2∥​R∥F2\.B\_\{T\}\(R\)\+2\\lambda\_\{2\}\|\\langle S,R\\rangle\_\{F\}\|<\\lambda\_\{2\}\\\|R\\\|\_\{F\}^\{2\}\.

###### Theorem 5\.4\(Deterministic elastic\-net mask pruning certificate\)\.

AssumeA=S\+RA=S\+Ris a mask decomposition, meaning

For constantsλ2,λ1≥0\\lambda\_\{2\},\\lambda\_\{1\}\\geq 0, define the one\-layer elastic\-net objective by

𝒥​\(A\):=ℒCE​\(A\)\+λ2​‖A‖F2\+λ1​‖A‖1\+𝒞,\\mathcal\{J\}\(A\):=\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\(A\)\+\\lambda\_\{2\}\\\|A\\\|\_\{F\}^\{2\}\+\\lambda\_\{1\}\\\|A\\\|\_\{1\}\+\\mathcal\{C\},where𝒞\\mathcal\{C\}contains all terms independent ofAA\. Then for everyθ∈\[0,1\]\\theta\\in\[0,1\],

𝒥​\(Aθ\)−𝒥​\(A1\)≤\(1−θ\)​BT​\(R\)−λ2​\(1−θ2\)​‖R‖F2−λ1​\(1−θ\)​‖R‖1\.\\mathcal\{J\}\(A\_\{\\theta\}\)\-\\mathcal\{J\}\(A\_\{1\}\)\\leq\(1\-\\theta\)B\_\{T\}\(R\)\-\\lambda\_\{2\}\(1\-\\theta^\{2\}\)\\\|R\\\|\_\{F\}^\{2\}\-\\lambda\_\{1\}\(1\-\\theta\)\\\|R\\\|\_\{1\}\.For full pruning,

𝒥​\(S\)≤𝒥​\(A\)−λ2​‖R‖F2−λ1​‖R‖1\+BT​\(R\)\.\\mathcal\{J\}\(S\)\\leq\\mathcal\{J\}\(A\)\-\\lambda\_\{2\}\\\|R\\\|\_\{F\}^\{2\}\-\\lambda\_\{1\}\\\|R\\\|\_\{1\}\+B\_\{T\}\(R\)\.Consequently, the mask is certified to decrease the elastic\-net objective whenever

BT​\(R\)​<λ2∥​R∥F2\+λ1​‖R‖1\.B\_\{T\}\(R\)<\\lambda\_\{2\}\\\|R\\\|\_\{F\}^\{2\}\+\\lambda\_\{1\}\\\|R\\\|\_\{1\}\.

###### Theorem 5\.5\(Deterministic prune–restore certificate\)\.

Let the original layer decompose as

A=S\+Q\+P,S⊙Q=S⊙P=Q⊙P=0\.A=S\+Q\+P,\\qquad S\\odot Q=S\\odot P=Q\\odot P=0\.InterpretPPas the entries finally removed andQQas entries restored, or kept, from an over\-pruned reservoir\. Define

Aθ:=S\+Q\+θ​P,0≤θ≤1\.A\_\{\\theta\}:=S\+Q\+\\theta P,\\qquad 0\\leq\\theta\\leq 1\.For the elastic\-net objective𝒥\\mathcal\{J\}of Theorem[5\.4](https://arxiv.org/html/2606.02608#S5.Thmthm4), one has

𝒥​\(Aθ\)−𝒥​\(A\)≤\(1−θ\)​BT​\(P\)−λ2​\(1−θ2\)​‖P‖F2−λ1​\(1−θ\)​‖P‖1\.\\mathcal\{J\}\(A\_\{\\theta\}\)\-\\mathcal\{J\}\(A\)\\leq\(1\-\\theta\)B\_\{T\}\(P\)\-\\lambda\_\{2\}\(1\-\\theta^\{2\}\)\\\|P\\\|\_\{F\}^\{2\}\-\\lambda\_\{1\}\(1\-\\theta\)\\\|P\\\|\_\{1\}\.For the final prune–restore layerS\+QS\+Q,

𝒥​\(S\+Q\)≤𝒥​\(A\)−λ2​‖P‖F2−λ1​‖P‖1\+BT​\(P\)\.\\mathcal\{J\}\(S\+Q\)\\leq\\mathcal\{J\}\(A\)\-\\lambda\_\{2\}\\\|P\\\|\_\{F\}^\{2\}\-\\lambda\_\{1\}\\\|P\\\|\_\{1\}\+B\_\{T\}\(P\)\.Consequently, the final SER\-style mask is certified to decrease the elastic\-net objective whenever

BT​\(P\)​<λ2∥​P∥F2\+λ1​‖P‖1\.B\_\{T\}\(P\)<\\lambda\_\{2\}\\\|P\\\|\_\{F\}^\{2\}\+\\lambda\_\{1\}\\\|P\\\|\_\{1\}\.

###### Corollary 5\.6\(When restoration improves the certificate upper bound\)\.

Under the hypotheses of Theorem[5\.5](https://arxiv.org/html/2606.02608#S5.Thmthm5), setR:=Q\+PR:=Q\+P\. Comparing the all\-pruned certificate forA=S\+RA=S\+Rwith the restore\-QQcertificate forA=\(S\+Q\)\+PA=\(S\+Q\)\+P, the restored upper bound is strictly smaller iff

BT​\(R\)−BT​\(P\)\>λ2​‖Q‖F2\+λ1​‖Q‖1\.B\_\{T\}\(R\)\-B\_\{T\}\(P\)\>\\lambda\_\{2\}\\\|Q\\\|\_\{F\}^\{2\}\+\\lambda\_\{1\}\\\|Q\\\|\_\{1\}\.

Coordinatewise weighted variants are stated in Online Resource 1\.

###### Theorem 5\.7\(One\-sided stationarity controls removable masks\)\.

Under the hypotheses of Theorem[5\.4](https://arxiv.org/html/2606.02608#S5.Thmthm4), assume the one\-sided directional stationarity condition holds for someε≥0\\varepsilon\\geq 0:

lim infh↓0𝒥​\(A1−h\)−𝒥​\(A1\)h≥−ε\.\\liminf\_\{h\\downarrow 0\}\\frac\{\\mathcal\{J\}\(A\_\{1\-h\}\)\-\\mathcal\{J\}\(A\_\{1\}\)\}\{h\}\\geq\-\\varepsilon\.Then

2​λ2​‖R‖F2\+λ1​‖R‖1≤BT​\(R\)\+ε\.2\\lambda\_\{2\}\\\|R\\\|\_\{F\}^\{2\}\+\\lambda\_\{1\}\\\|R\\\|\_\{1\}\\leq B\_\{T\}\(R\)\+\\varepsilon\.More generally, if

𝒥​\(Aθ\)≥𝒥​\(A1\)−ε​\(1−θ\)∀θ∈\[0,1\],\\mathcal\{J\}\(A\_\{\\theta\}\)\\geq\\mathcal\{J\}\(A\_\{1\}\)\-\\varepsilon\(1\-\\theta\)\\qquad\\forall\\theta\\in\[0,1\],then for everyθ<1\\theta<1,

λ2​\(1\+θ\)​‖R‖F2\+λ1​‖R‖1≤BT​\(R\)\+ε\.\\lambda\_\{2\}\(1\+\\theta\)\\\|R\\\|\_\{F\}^\{2\}\+\\lambda\_\{1\}\\\|R\\\|\_\{1\}\\leq B\_\{T\}\(R\)\+\\varepsilon\.

###### Theorem 5\.8\(Deterministic multi\-layer mask certificate\)\.

Let𝒥​\(α\)\\mathcal\{J\}\(\\alpha\)denote the full cross\-entropy plus layerwise elastic\-net objective\. Consider an ordered finite sequence of mask\-pruning operations on layersj=1,…,Mj=1,\\dots,M\. At stepjj, write the current layer after steps1,…,j−11,\\dots,j\-1as

Aj=Sj\+Rj,Sj⊙Rj=0,A\_\{j\}=S\_\{j\}\+R\_\{j\},\\qquad S\_\{j\}\\odot R\_\{j\}=0,and letBj​\(Rj\)B\_\{j\}\(R\_\{j\}\)be its data\-path budget in that intermediate network\. Ifα\(0\)\\alpha^\{\(0\)\}andα\(M\)\\alpha^\{\(M\)\}denote the parameters before and after theMMmask removals, then

𝒥​\(α\(M\)\)≤𝒥​\(α\(0\)\)−∑j=1Mλ2,j​‖Rj‖F2−∑j=1Mλ1,j​‖Rj‖1\+∑j=1MBj​\(Rj\),\\mathcal\{J\}\(\\alpha^\{\(M\)\}\)\\leq\\mathcal\{J\}\(\\alpha^\{\(0\)\}\)\-\\sum\_\{j=1\}^\{M\}\\lambda\_\{2,j\}\\\|R\_\{j\}\\\|\_\{F\}^\{2\}\-\\sum\_\{j=1\}^\{M\}\\lambda\_\{1,j\}\\\|R\_\{j\}\\\|\_\{1\}\+\\sum\_\{j=1\}^\{M\}B\_\{j\}\(R\_\{j\}\),whereλ2,j\\lambda\_\{2,j\}andλ1,j\\lambda\_\{1,j\}are the elastic\-net weights applied to layerjj\.

The corresponding multi\-layer margin\-stability corollary is stated in Online Resource 1\.

###### Theorem 5\.9\(Multi\-layer stationarity controls jointly removable masks\)\.

Use the notation of Theorem[5\.8](https://arxiv.org/html/2606.02608#S5.Thmthm8)\. Forh∈\[0,1\]h\\in\[0,1\], letαh\\alpha\_\{h\}be obtained by applying the same ordered pruning operations with each current layerAj=Sj\+RjA\_\{j\}=S\_\{j\}\+R\_\{j\}replaced bySj\+\(1−h\)​RjS\_\{j\}\+\(1\-h\)R\_\{j\}\. Assume that eachBj​\(Rj\)B\_\{j\}\(R\_\{j\}\)remains valid along this fractional path, uniformly forh∈\[0,1\]h\\in\[0,1\], so the step\-jjcross\-entropy increase is at mosth​Bj​\(Rj\)hB\_\{j\}\(R\_\{j\}\)\. If the one\-sided directional stationarity condition

lim infh↓0𝒥​\(αh\)−𝒥​\(α\(0\)\)h≥−ε\\liminf\_\{h\\downarrow 0\}\\frac\{\\mathcal\{J\}\(\\alpha\_\{h\}\)\-\\mathcal\{J\}\(\\alpha^\{\(0\)\}\)\}\{h\}\\geq\-\\varepsilonholds for someε≥0\\varepsilon\\geq 0, then

2​∑j=1Mλ2,j​‖Rj‖F2\+∑j=1Mλ1,j​‖Rj‖1≤∑j=1MBj​\(Rj\)\+ε\.2\\sum\_\{j=1\}^\{M\}\\lambda\_\{2,j\}\\\|R\_\{j\}\\\|\_\{F\}^\{2\}\+\\sum\_\{j=1\}^\{M\}\\lambda\_\{1,j\}\\\|R\_\{j\}\\\|\_\{1\}\\leq\\sum\_\{j=1\}^\{M\}B\_\{j\}\(R\_\{j\}\)\+\\varepsilon\.More generally, if

𝒥​\(αh\)≥𝒥​\(α\(0\)\)−ε​h∀h∈\[0,1\],\\mathcal\{J\}\(\\alpha\_\{h\}\)\\geq\\mathcal\{J\}\(\\alpha^\{\(0\)\}\)\-\\varepsilon h\\qquad\\forall h\\in\[0,1\],then for everyh∈\(0,1\]h\\in\(0,1\],

∑j=1Mλ2,j​\(2−h\)​‖Rj‖F2\+∑j=1Mλ1,j​‖Rj‖1≤∑j=1MBj​\(Rj\)\+ε\.\\sum\_\{j=1\}^\{M\}\\lambda\_\{2,j\}\(2\-h\)\\\|R\_\{j\}\\\|\_\{F\}^\{2\}\+\\sum\_\{j=1\}^\{M\}\\lambda\_\{1,j\}\\\|R\_\{j\}\\\|\_\{1\}\\leq\\sum\_\{j=1\}^\{M\}B\_\{j\}\(R\_\{j\}\)\+\\varepsilon\.

###### Corollary 5\.10\(Gaussian sufficient condition for the data\-path budget\)\.

LetR∈ℝn×mR\\in\\mathbb\{R\}^\{n\\times m\}have independent entriesRi​j∼N​\(0,g/n\)R\_\{ij\}\\sim N\(0,g/n\), independent of the collectionvs:=ψ1​\(s\)v\_\{s\}:=\\psi\_\{1\}\(s\),s∈Ts\\in T\. AssumeLsL\_\{s\}are nonnegative upper bounds for the post\-layer Lipschitz constants, deterministic or measurable with respect to variables independent ofRR\. Then for every finiteTTand everyδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\delta,

BT​\(R\)≤2\|T\|​∑s∈TLs​‖vs‖2​2​gn​log⁡2​n​\|T\|δ\.B\_\{T\}\(R\)\\leq\\frac\{2\}\{\|T\|\}\\sum\_\{s\\in T\}L\_\{s\}\\\|v\_\{s\}\\\|\_\{2\}\\sqrt\{\\frac\{2g\}\{n\}\\log\\frac\{2n\|T\|\}\{\\delta\}\}\.

###### Corollary 5\.11\(MP\-edge form of the Gaussian data\-path budget\)\.

Under the hypotheses of Corollary[5\.10](https://arxiv.org/html/2606.02608#S5.Thmthm10), define the nominal rectangular MP singular\-value edge ofR∈ℝn×mR\\in\\mathbb\{R\}^\{n\\times m\}by

σ\+,R:=g​\(1\+mn\)\.\\sigma\_\{\+,R\}:=\\sqrt\{g\}\\left\(1\+\\sqrt\{\\frac\{m\}\{n\}\}\\right\)\.Then, with probability at least1−δ1\-\\delta,

BT​\(R\)≤2​σ\+,R1\+m/n​2n​log⁡2​n​\|T\|δ​1\|T\|​∑s∈TLs​‖vs‖2\.B\_\{T\}\(R\)\\leq\\frac\{2\\sigma\_\{\+,R\}\}\{1\+\\sqrt\{m/n\}\}\\,\\sqrt\{\\frac\{2\}\{n\}\\log\\frac\{2n\|T\|\}\{\\delta\}\}\\,\\frac\{1\}\{\|T\|\}\\sum\_\{s\\in T\}L\_\{s\}\\\|v\_\{s\}\\\|\_\{2\}\.

Sub\-Gaussian and simultaneous multi\-layer budget corollaries are stated in Online Resource 1\.

### Acknowledgments

LB, HO, and YS acknowledge support from NASA via the AIST program \(Kernel Flows: Emulating Complex Models for Massive Data Sets\)\. This work started during LB’s sabbatical stay at Caltech hosted by H\. Owhadi; LB and YS are grateful for the hospitality during their NASA\-supported visit\.

## Statements and Declarations

### Funding

The work of LB was partially supported by NSF grant DMS\-2005262 and NSF grant IMPRESS\-U 2401227\. TB and HO acknowledge support from the Air Force Office of Scientific Research under MURI award number FA9550\-20\-1\-0358 \(Machine Learning and Physics\-Based Modeling and Simulation\) and FOA\-AFRL\-AFOSR\-2023\-0004 \(Mathematics of Digital Twins\), and from the Department of Energy under award number DE\-SC0023163 \(SEA\-CROGS: Scalable, Efficient, and Accelerated Causal Reasoning Operators, Graphs and Spikes for Earth and Embedded Systems\)\. HO and TB also acknowledge support from the DoD Vannevar Bush Faculty Fellowship Program under award number ONR\-N000142512035\. The work of YS was partially supported by the European Research Council under the European Union’s Horizon 2022 research and innovation program \(grant agreement No\. 101041711\), the Simons Foundation, the Israel Science Foundation \(grant number 2258/19\), and the Israel Science Foundation \(ISF Grant 4101/25\)\.

### Competing interests

The authors declare that they have no competing interests relevant to the content of this article\.

### Author contributions

All authors contributed to the mathematical development, interpretation of results, manuscript revision, and approval of the submitted version\.

### Data, code, and supplementary information availability

Code, plotting scripts, pruning recipes, manuscript artifacts, and mirror copies of Online Resource 1 and Online Resource 2 are available at[https://github\.com/yspennstate/RMT\_based\_pruning\_in\_deep\_learning](https://github.com/yspennstate/RMT_based_pruning_in_deep_learning)\. The checkpoints, validation ledgers, sparse\-backend timing evidence, and per\-run files supporting the main ImageNet\-1k tables are available at[https://drive\.google\.com/drive/folders/1mm990SHAHlYdISHxirvMRdVQEAjpIxDd](https://drive.google.com/drive/folders/1mm990SHAHlYdISHxirvMRdVQEAjpIxDd)\. ImageNet\-1k images are not redistributed; ImageNet is identified by\[[4](https://arxiv.org/html/2606.02608#bib.bib4)\], and users must obtain the dataset under the official download terms\[[11](https://arxiv.org/html/2606.02608#bib.bib11)\]\. Online Resource 2 identifies the source checkpoints, recipes, evidence files, timing audits, and complete numerical ledgers for the reported rows\.

### Declaration of generative AI and AI\-assisted technologies in the manuscript preparation process

During the preparation of this work the author\(s\) used generative AI tools to accelerate drafting and revision\. All mathematical claims, proofs, numerical interpretations, and bibliographic information were subsequently reviewed and edited by the author\(s\), who take full responsibility for the content of the manuscript\.

## References

- Blalock et al\. \[2020\]Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag\.What is the state of neural network pruning?In*Proceedings of Machine Learning and Systems*, volume 2, pages 129–146, 2020\.URL[https://proceedings\.mlsys\.org/paper\_files/paper/2020/hash/6c44dc73014d66ba49b28d483a8f8b0d\-Abstract\.html](https://proceedings.mlsys.org/paper_files/paper/2020/hash/6c44dc73014d66ba49b28d483a8f8b0d-Abstract.html)\.
- Bolya et al\. \[2023\]Daniel Bolya, Cheng\-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman\.Token merging: Your ViT but faster\.In*International Conference on Learning Representations*, 2023\.URL[https://openreview\.net/forum?id=JroZRaRw7Eu](https://openreview.net/forum?id=JroZRaRw7Eu)\.
- Couillet and Debbah \[2011\]Romain Couillet and Mérouane Debbah\.*Random Matrix Methods for Wireless Communications*\.Cambridge University Press, 2011\.doi:10\.1017/CBO9780511994746\.URL[https://doi\.org/10\.1017/CBO9780511994746](https://doi.org/10.1017/CBO9780511994746)\.
- Deng et al\. \[2009\]Jia Deng, Wei Dong, Richard Socher, Li\-Jia Li, Kai Li, and Li Fei\-Fei\.Imagenet: A large\-scale hierarchical image database\.In*2009 IEEE Conference on Computer Vision and Pattern Recognition*, pages 248–255, 2009\.doi:10\.1109/CVPR\.2009\.5206848\.
- Dosovitskiy et al\. \[2021\]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby\.An image is worth 16x16 words: Transformers for image recognition at scale\.In*International Conference on Learning Representations*, 2021\.
- Fang et al\. \[2024\]Gongfan Fang, Hongxu Yin, Saurav Muralidharan, Greg Heinrich, Jeff Pool, Jan Kautz, Pavlo Molchanov, and Xinchao Wang\.MaskLLM: Learnable semi\-structured sparsity for large language models\.In*Advances in Neural Information Processing Systems*, volume 37, pages 7736–7758\. Curran Associates, Inc\., 2024\.doi:10\.52202/079017\-0248\.URL[https://proceedings\.neurips\.cc/paper\_files/paper/2024/file/0e9a05f5ce62284c91e4a33498899124\-Paper\-Conference\.pdf](https://proceedings.neurips.cc/paper_files/paper/2024/file/0e9a05f5ce62284c91e4a33498899124-Paper-Conference.pdf)\.
- Gale et al\. \[2019\]Trevor Gale, Erich Elsen, and Sara Hooker\.The state of sparsity in deep neural networks, 2019\.URL[https://arxiv\.org/abs/1902\.09574](https://arxiv.org/abs/1902.09574)\.
- Gao et al\. \[2024\]Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, and Mike Zheng Shou\.Bootstrapping SparseFormers from vision foundation models\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pages 17710–17721, June 2024\.doi:10\.1109/CVPR52733\.2024\.01677\.URL[https://openaccess\.thecvf\.com/content/CVPR2024/html/Gao\_Bootstrapping\_SparseFormers\_from\_Vision\_Foundation\_Models\_CVPR\_2024\_paper\.html](https://openaccess.thecvf.com/content/CVPR2024/html/Gao_Bootstrapping_SparseFormers_from_Vision_Foundation_Models_CVPR_2024_paper.html)\.
- Ge et al\. \[2021\]Jungang Ge, Ying\-Chang Liang, Zhidong Bai, and Guangming Pan\.Large\-dimensional random matrix theory and its applications in deep learning and wireless communications\.*Random Matrices: Theory and Applications*, 10\(4\):2230001, 2021\.doi:10\.1142/S2010326322300017\.URL[https://doi\.org/10\.1142/S2010326322300017](https://doi.org/10.1142/S2010326322300017)\.
- Hoefler et al\. \[2021\]Torsten Hoefler, Dan Alistarh, Tal Ben\-Nun, Nikoli Dryden, and Alexandra Peste\.Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks\.*Journal of Machine Learning Research*, 22\(241\):1–124, 2021\.URL[https://www\.jmlr\.org/papers/v22/21\-0366\.html](https://www.jmlr.org/papers/v22/21-0366.html)\.
- ImageNet \[2026\]ImageNet\.ImageNet download and terms\.[https://www\.image\-net\.org/download\.php](https://www.image-net.org/download.php), 2026\.Accessed 2026\-05\-13\.
- Kuznedelev et al\. \[2023\]Denis Kuznedelev, Eldar Kurtić, Elias Frantar, and Dan Alistarh\.CAP: Correlation\-aware pruning for highly\-accurate sparse vision models\.In*Advances in Neural Information Processing Systems*, volume 36, pages 28805–28831\. Curran Associates, Inc\., 2023\.URL[https://proceedings\.neurips\.cc/paper\_files/paper/2023/hash/5bd9fbb3a5a985f80c16ddd0ec1dfc43\-Abstract\-Conference\.html](https://proceedings.neurips.cc/paper_files/paper/2023/hash/5bd9fbb3a5a985f80c16ddd0ec1dfc43-Abstract-Conference.html)\.
- Lucas and Mazumder \[2025\]Ryan Lucas and Rahul Mazumder\.Preserving deep representations in one\-shot pruning: A hessian\-free second\-order optimization framework\.In*International Conference on Learning Representations*, 2025\.doi:10\.48550/arXiv\.2411\.18376\.URL[https://openreview\.net/forum?id=eNQp79A5Oz](https://openreview.net/forum?id=eNQp79A5Oz)\.
- Mahoney and Martin \[2019\]Michael W\. Mahoney and Charles H\. Martin\.Traditional and heavy tailed self regularization in neural network models\.In*Proceedings of the 36th International Conference on Machine Learning*, volume 97 of*Proceedings of Machine Learning Research*, pages 4284–4293\. PMLR, 2019\.URL[https://proceedings\.mlr\.press/v97/mahoney19a\.html](https://proceedings.mlr.press/v97/mahoney19a.html)\.
- Marchenko and Pastur \[1967\]V\. A\. Marchenko and L\. A\. Pastur\.Distribution of eigenvalues for some sets of random matrices\.*Mathematics of the USSR\-Sbornik*, 1\(4\):457–483, 1967\.doi:10\.1070/SM1967v001n04ABEH001994\.URL[https://www\.mathnet\.ru/eng/sm4101](https://www.mathnet.ru/eng/sm4101)\.
- Martin and Mahoney \[2020\]Charles H\. Martin and Michael W\. Mahoney\.Heavy\-tailed universality predicts trends in test accuracies for very large pre\-trained deep neural networks\.In*Proceedings of the 2020 SIAM International Conference on Data Mining*, pages 505–513, 2020\.doi:10\.1137/1\.9781611976236\.57\.URL[https://epubs\.siam\.org/doi/10\.1137/1\.9781611976236\.57](https://epubs.siam.org/doi/10.1137/1.9781611976236.57)\.
- Martin and Mahoney \[2021\]Charles H\. Martin and Michael W\. Mahoney\.Implicit self\-regularization in deep neural networks: Evidence from random matrix theory and implications for learning\.*Journal of Machine Learning Research*, 22\(165\):1–73, 2021\.URL[http://jmlr\.org/papers/v22/20\-410\.html](http://jmlr.org/papers/v22/20-410.html)\.
- Martin et al\. \[2021\]Charles H\. Martin, Tongsu \(Serena\) Peng, and Michael W\. Mahoney\.Predicting trends in the quality of state\-of\-the\-art neural networks without access to training or testing data\.*Nature Communications*, 12\(1\):4122, 2021\.doi:10\.1038/s41467\-021\-24025\-8\.URL[https://doi\.org/10\.1038/s41467\-021\-24025\-8](https://doi.org/10.1038/s41467-021-24025-8)\.
- Meng and Yao \[2023\]Xuran Meng and Jianfeng Yao\.Impact of classification difficulty on the weight matrices spectra in deep learning and application to early\-stopping\.*Journal of Machine Learning Research*, 24\(28\):1–40, 2023\.URL[http://jmlr\.org/papers/v24/21\-1441\.html](http://jmlr.org/papers/v24/21-1441.html)\.
- Mishra et al\. \[2021\]Asit Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius\.Accelerating sparse deep neural networks\.*arXiv preprint arXiv:2104\.08378*, 2021\.doi:10\.48550/arXiv\.2104\.08378\.
- Nait Saada and Tanner \[2023\]Thiziri Nait Saada and Jared Tanner\.On the initialisation of wide low\-rank feedforward neural networks\.*arXiv preprint arXiv:2301\.13710*, 2023\.doi:10\.48550/arXiv\.2301\.13710\.URL[https://arxiv\.org/abs/2301\.13710](https://arxiv.org/abs/2301.13710)\.
- Pastur \[2020\]Leonid Pastur\.On random matrices arising in deep neural networks: Gaussian case\.*Pure and Applied Functional Analysis*, 5\(6\):1395–1424, 2020\.URL[https://arxiv\.org/abs/2001\.06188](https://arxiv.org/abs/2001.06188)\.
- Pastur and Slavin \[2023\]Leonid Pastur and Victor Slavin\.On random matrices arising in deep neural networks: General I\.I\.D\. case\.*Random Matrices: Theory and Applications*, 12\(1\):2250046, 2023\.doi:10\.1142/S2010326322500460\.URL[https://doi\.org/10\.1142/S2010326322500460](https://doi.org/10.1142/S2010326322500460)\.
- Peste et al\. \[2021\]Alexandra Peste, Eugenia Iofinova, Adrian Vladu, and Dan Alistarh\.AC/DC: Alternating compressed/decompressed training of deep neural networks\.In*Advances in Neural Information Processing Systems*, volume 34, pages 8557–8570\. Curran Associates, Inc\., 2021\.URL[https://proceedings\.neurips\.cc/paper\_files/paper/2021/hash/48000647b315f6f00f913caa757a70b3\-Abstract\.html](https://proceedings.neurips.cc/paper_files/paper/2021/hash/48000647b315f6f00f913caa757a70b3-Abstract.html)\.
- Pool and Yu \[2021\]Jeff Pool and Chong Yu\.Channel permutations for N:M sparsity\.In*Advances in Neural Information Processing Systems*, volume 34, pages 13316–13327, 2021\.URL[https://proceedings\.neurips\.cc/paper/2021/hash/6e8404c3b93a9527c8db241a1846599a\-Abstract\.html](https://proceedings.neurips.cc/paper/2021/hash/6e8404c3b93a9527c8db241a1846599a-Abstract.html)\.
- Serdobolskii \[2000\]V\. Serdobolskii\.*Multivariate Statistical Analysis: A High\-Dimensional Approach*\.Springer Dordrecht, 2000\.doi:10\.1007/978\-94\-015\-9468\-4\.URL[https://link\.springer\.com/book/10\.1007/978\-94\-015\-9468\-4](https://link.springer.com/book/10.1007/978-94-015-9468-4)\.
- Staats et al\. \[2023\]Max Staats, Matthias Thamm, and Bernd Rosenow\.Boundary between noise and information applied to filtering neural network weight matrices\.*Physical Review E*, 108\(2\):L022302, 2023\.doi:10\.1103/PhysRevE\.108\.L022302\.
- Thamm et al\. \[2022\]Matthias Thamm, Max Staats, and Bernd Rosenow\.Random matrix analysis of deep neural network weight matrices\.*Physical Review E*, 106\(5\):054124, 2022\.doi:10\.1103/PhysRevE\.106\.054124\.URL[https://doi\.org/10\.1103/PhysRevE\.106\.054124](https://doi.org/10.1103/PhysRevE.106.054124)\.
- Vaswani et al\. \[2017\]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N\. Gomez, Łukasz Kaiser, and Illia Polosukhin\.Attention is all you need\.In*Advances in Neural Information Processing Systems*, volume 30, pages 5998–6008, 2017\.URL[https://papers\.nips\.cc/paper\_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa\-Abstract\.html](https://papers.nips.cc/paper_files/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html)\.
- Vershynin \[2018\]Roman Vershynin\.*High\-Dimensional Probability: An Introduction with Applications in Data Science*\.Cambridge University Press, 2018\.doi:10\.1017/9781108231596\.URL[https://doi\.org/10\.1017/9781108231596](https://doi.org/10.1017/9781108231596)\.
- Xiao et al\. \[2023\]Xuanzhe Xiao, Zeng Li, Chuanlong Xie, and Fengwei Zhou\.Heavy\-tailed regularization of weight matrices in deep neural networks\.In*Artificial Neural Networks and Machine Learning – ICANN 2023*, volume 14263 of*Lecture Notes in Computer Science*, pages 236–247\. Springer, Cham, 2023\.doi:10\.1007/978\-3\-031\-44204\-9\_20\.
- Xie et al\. \[2024\]Jingjing Xie, Yuxin Zhang, Mingbao Lin, Zhihang Lin, Liujuan Cao, and Rongrong Ji\.UniPTS: A unified framework for proficient post\-training sparsity\.In*Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition*, pages 5746–5755, June 2024\.doi:10\.1109/CVPR52733\.2024\.00549\.URL[https://openaccess\.thecvf\.com/content/CVPR2024/html/Xie\_UniPTS\_A\_Unified\_Framework\_for\_Proficient\_Post\-Training\_Sparsity\_CVPR\_2024\_paper\.html](https://openaccess.thecvf.com/content/CVPR2024/html/Xie_UniPTS_A_Unified_Framework_for_Proficient_Post-Training_Sparsity_CVPR_2024_paper.html)\.

Similar Articles

Learning sparse neural networks through L₀ regularization

OpenAI Blog

OpenAI proposes a practical L₀ regularization method for neural networks that encourages weights to become exactly zero during training, enabling network pruning for improved speed and generalization. The method uses stochastic gates and introduces the hard concrete distribution to make the non-differentiable L₀ norm optimization tractable via gradient descent.

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

Hugging Face Daily Papers

This paper explores structured pruning and knowledge distillation techniques for compressing large Mixture-of-Experts (MoE) models during pre-training. It demonstrates that progressive pruning and combined distillation strategies, such as multi-token prediction distillation, improve downstream performance, exemplified by compressing Qwen3-Next-80A3B to a more efficient 23A2B model.