PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head

arXiv cs.CL 05/13/26, 04:00 AM Papers
Summary
This paper introduces PRISM, a geometric risk bound that decomposes model drift in post-training LLM variants into scale, shape, and head axes to diagnose specific failure modes like quantization errors or catastrophic forgetting.
arXiv:2605.11608v1 Announce Type: new Abstract: Comparing post-training LLM variants, such as quantized, LoRA-adapted, and distilled models, requires a diagnostic that identifies how a variant has drifted, not only whether it has degraded. Existing similarity scores such as CKA and SVCCA can flag degradation, but they do not directly link representation drift to risk or mechanism. We propose PRISM, Proxy Risk Inference via Structural Mapping, which exploits the linear output head of LLMs and the empirically near-isometric structure of their backbones to derive a closed-form upper bound on the cross-entropy risk gap between a target model and a post-training variant. The bound is calibrated for variant ranking and decomposes drift into three independently measurable axes: scale mismatch, shape mismatch, and head divergence. Each axis corresponds to a distinct failure mode, including shape distortion under low-bit quantization, scale separability under LoRA forgetting, and head divergence under GGUF k-quantization. As a result, the dominant axis suggests a remediation direction rather than merely raising a degradation flag. Because the shape term is differentiable, the same geometry can also serve as a training-time regularizer against catastrophic forgetting. Across two model families and five benchmarks, PRISM ranks variants with mean Spearman correlations of 0.820 for post-training quantization and 0.831 for LoRA forgetting, and its axis-guided shape regularizer outperforms experience replay in aggregate at mitigating downstream forgetting.
Original Article
View Cached Full Text
Cached at: 05/13/26, 06:16 AM
# A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head
Source: [https://arxiv.org/html/2605.11608](https://arxiv.org/html/2605.11608)
###### Abstract

Comparing post\-training LLM variants—quantized, LoRA\-adapted, distilled—needs a diagnostic that pinpoints*how*the variant has drifted, not just*that*it has; existing similarity scores \(CKA, SVCCA\) flag degradation without linking to risk or mechanism\. We proposePRISM\(Proxy Risk Inference via Structural Mapping\), exploiting LLMs’ linear head and empirically near\-isometric backbone to derive a closed\-form upper bound on the cross\-entropy risk gap\|ℛT−ℛP\|\|\\mathcal\{R\}\_\{T\}\-\\mathcal\{R\}\_\{P\}\|, calibrated for variant ranking, that decomposes drift into three independently measurable axes:*scale mismatch*,*shape mismatch*, and a*head divergence*\. Each axis maps to a distinct failure mode \(shape distortion at low\-bit quantization, scale separability under LoRA forgetting, head divergence at GGUF k\-quant\), so the dominant axis points to a remediation direction rather than a flag\. Because the shape term is differentiable, the same geometry doubles as a training\-time regularizer against catastrophic forgetting\. Across two model families and five benchmarks, PRISM ranks variants with mean Spearmanrs=0\.820r\_\{s\}\{=\}0\.820\(PTQ\) and0\.8310\.831\(LoRA forgetting\), and the axis\-guided shape regularizer outperforms experience replay in aggregate at mitigating downstream forgetting\.

### 1Introduction

As large language models \(LLMs\) move from pre\-training to deployment, a new bottleneck emerges: a single base model now produces many post\-training variants—quantized \(GPTQ\[[1](https://arxiv.org/html/2605.11608#bib.bib1)\], GGUF, BitsAndBytes\[[2](https://arxiv.org/html/2605.11608#bib.bib2)\]\), LoRA\-adapted\[[3](https://arxiv.org/html/2605.11608#bib.bib3)\], or distilled\[[4](https://arxiv.org/html/2605.11608#bib.bib4)\]—that must be evaluated before release\[[5](https://arxiv.org/html/2605.11608#bib.bib5),[6](https://arxiv.org/html/2605.11608#bib.bib6)\]\. Existing evaluation largely relies on aggregate accuracy or perplexity, which can reveal that a variant has degraded but not*why*\. As a result, developers often resort to costly trial\-and\-error when debugging low\-bit quantization failures, catastrophic forgetting, or prediction\-head corruption\. What is missing is a diagnostic that not only predicts degradation, but also identifies which component of the model has drifted from the base checkpoint\.

A natural alternative is to compare internal representations directly\. Prior representational similarity methods—SVCCA\[[7](https://arxiv.org/html/2605.11608#bib.bib7)\], CKA\[[8](https://arxiv.org/html/2605.11608#bib.bib8)\], and generalized shape metrics—summarize two feature matrices with a single alignment score, which is descriptive but not diagnostic\. More fundamentally, lifting any such score into a deployment\-time risk bound faces three obstacles: \(i\) no prior similarity has been tied to downstream cross\-entropy risk on the deployed prediction head\[[9](https://arxiv.org/html/2605.11608#bib.bib9),[10](https://arxiv.org/html/2605.11608#bib.bib10),[11](https://arxiv.org/html/2605.11608#bib.bib11)\]; \(ii\) cross\-entropy is Lipschitz in features, but the naive constant scales with the head’s full spectral norm and is uninformative at LLM vocabulary scale \(V∼105V\\sim 10^\{5\}\); and \(iii\) a usable diagnostic must further decompose the bound into geometrically interpretable axes that map to distinct failure mechanisms\.

Our key insight is that two structural properties of modern LLMs jointly resolve all three obstacles\. The linearlm\_headover a non\-linear backbone lets us derive a sharper Lipschitz constant tied to pairwise token\-embedding distances rather than the head’s spectral norm, keeping the bound informative at LLM vocabulary scale \(Sec\.[3\.2](https://arxiv.org/html/2605.11608#S3.SS2)\)\. With the near\-isometry of hidden representations across related LLMs\[[12](https://arxiv.org/html/2605.11608#bib.bib12),[13](https://arxiv.org/html/2605.11608#bib.bib13)\], an orthogonal alignment yields a Procrustes residual that decomposes the feature error*exactly*into scale and shape axes\. Together with a covariance\-weighted head\-side term, these producePRISM\(ProxyRiskInference viaStructuralMapping; Fig\.[1](https://arxiv.org/html/2605.11608#S2.F1)\), a closed\-form upper bound on the cross\-entropy risk gap between a target model and its variant\.

PRISM exposes three diagnostic axes:*scale mismatch*Δρ\\Delta\\rho\(activation magnitude collapse\),*shape mismatch*1−Ω1\-\\Omega\(feature\-geometry distortion\), and a covariance\-weighted*head discrepancy*γ\\gamma\(prediction\-head divergence\)\. Unlike scalar similarity scores, these axes correspond to distinct empirical failure modes: low\-bit quantization primarily induces shape distortion, LoRA fine\-tuning yields activation scale separability, and output\-projection quantization inflates head divergence\.

Beyond post\-hoc diagnosis, PRISM also provides a training signal\. Under frozen\-head LoRA fine\-tuning, the head discrepancy term vanishes, making the differentiable shape term a clean regularization target for backbone drift\. This allows us to regularize fine\-tuning directly by penalizing feature\-geometry distortion, reducing catastrophic forgetting without replay \(Sec\.[5\.4](https://arxiv.org/html/2605.11608#S5.SS4)\)\. Empirically, across Llama\-, Qwen\-, Ministral\-, and DeepSeek\-based variants and five benchmarks, PRISM consistently tracks degradation across both post\-training quantization \(PTQ\) and LoRA fine\-tuning\. The bound achieves mean Spearman correlations of0\.8200\.820for quantized variants and0\.8310\.831for LoRA forgetting; beyond ranking, its axes separate distinct failure modes\. Moreover, the proposed axis\-guided shape regularizer outperforms replay\-based baselines in mitigating forgetting\.

Our contributions are threefold\.

1. 1\.Theory: a closed\-form CE risk bound with three diagnostic axes\.\|ℛT−ℛP\|\|\\mathcal\{R\}\_\{T\}\-\\mathcal\{R\}\_\{P\}\|admits a closed\-form upper bound as a sum of three axes: scale\(Δρ\)2\(\\Delta\\rho\)^\{2\}and shape2ρTρP\(1−Ω\)2\\rho\_\{T\}\\rho\_\{P\}\(1\{\-\}\\Omega\)\(from an exact Procrustes residual decomposition\), plus a covariance\-weighted head\-discrepancy term \(Theorem[1](https://arxiv.org/html/2605.11608#Thmtheorem1); Fig\.[1](https://arxiv.org/html/2605.11608#S2.F1)\)\.
2. 2\.Framework: a unified diagnostic that doubles as a training objective\.Computed from features and head weights alone, the bound applies across both PTQ and frozen\-head LoRA; the differentiable trace formΩ\\Omegafurther turns it into a training\-time regularizer against catastrophic forgetting \(Sec\.[5\.4](https://arxiv.org/html/2605.11608#S5.SS4)\)\.
3. 3\.Empirical: rank\-consistent across both settings, with axis\-level localization\.On Llama and Qwen \(Ministral, DeepSeek in Appendix[F\.3](https://arxiv.org/html/2605.11608#A6.SS3)\) over five benchmarks, the bound ranks PTQ variants \(rs=0\.820r\_\{s\}\{=\}0\.820\) and LoRA checkpoints \(rs=0\.831r\_\{s\}\{=\}0\.831\) consistently \(*predictiveness*\); its axes separate failure modes—shape distortion at Q2/Q3 PTQ, scale separability under cross\-task LoRA drift, head divergence at Qwen3 Q6\_K/Q8\_0 \(*decomposability*\); and an axis\-guided shape regularizer outperforms experience replay in aggregate at mitigating downstream forgetting \(*actionability*\)\.

### 2Related work

##### Representational similarity and Procrustes shape metrics\.

Representational similarity statistics—SVCCA\[[7](https://arxiv.org/html/2605.11608#bib.bib7)\], CKA\[[8](https://arxiv.org/html/2605.11608#bib.bib8)\], and the generalized shape metrics framework\[[14](https://arxiv.org/html/2605.11608#bib.bib14)\]—measure geometric resemblance between activation matrices, but their link to downstream behavior remains an open problem\[[9](https://arxiv.org/html/2605.11608#bib.bib9),[10](https://arxiv.org/html/2605.11608#bib.bib10),[11](https://arxiv.org/html/2605.11608#bib.bib11)\]; a recent decodability bound\[[15](https://arxiv.org/html/2605.11608#bib.bib15)\]reaches downstream via whitened kernels and newly\-trained linear probes rather than the deployed prediction heads\. The Linear Representation Hypothesis\[[12](https://arxiv.org/html/2605.11608#bib.bib12)\]and empirical relative\-representation isometry across same\-family encoders\[[13](https://arxiv.org/html/2605.11608#bib.bib13)\]together motivate restricting alignment to the orthogonal group, and the Platonic Representation Hypothesis\[[16](https://arxiv.org/html/2605.11608#bib.bib16)\]corroborates same\-family shared geometry\. PRISM lifts the Procrustes residual into a closed\-form CE risk bound on the deployed head; the resulting bound splits into separately measurable scale, shape, and head components \(Sec\.[3](https://arxiv.org/html/2605.11608#S3)\)\.

##### Post\-training quantization and its evaluation\.

Post\-training quantization spans calibration\-based methods \(GPTQ\[[1](https://arxiv.org/html/2605.11608#bib.bib1)\], second\-order reconstruction\) and weight\-only schemes \(GGUF k\-quants, BitsAndBytes\[[2](https://arxiv.org/html/2605.11608#bib.bib2)\]\), with low\-bit performance often constrained by activation outliers\[[17](https://arxiv.org/html/2605.11608#bib.bib17)\]\. Existing cheap diagnostics—layer\-wise reconstruction loss or weight quantization error—are indirect proxies that do not account for nonlinear error accumulation through the full network\. PRISM, in contrast, measures the resulting drift end\-to\-end on the deployed model in a single forward pass \(Sec\.[5\.2](https://arxiv.org/html/2605.11608#S5.SS2)\)\.

##### Catastrophic forgetting under parameter\-efficient fine\-tuning\.

LoRA fine\-tuning\[[3](https://arxiv.org/html/2605.11608#bib.bib3)\]risks catastrophic forgetting of pre\-training knowledge, even with a frozenlm\_head\. Existing remedies fall into two families: weight\-space constraints \(EWC\[[18](https://arxiv.org/html/2605.11608#bib.bib18)\]\) and data\-space rehearsal \(experience replay\)\. PRISM’s shape regularizer adds a third option—a feature\-geometry constraint identifiable from the bound’s decomposition: feature\-manifold drift \(Ω\\Omega,Δρ\\Delta\\rho\) relative to the frozen base bounds the downstream risk gap and is usable as a post\-hoc diagnostic and a training\-time penalty \(Sec\.[3\.5](https://arxiv.org/html/2605.11608#S3.SS5)\)\.

##### Efficient evaluation and performance forecasting\.

Scaling laws\[[19](https://arxiv.org/html/2605.11608#bib.bib19),[20](https://arxiv.org/html/2605.11608#bib.bib20)\]predict aggregate loss but not variant\-specific degradation; efficient\-evaluation subsets\[[21](https://arxiv.org/html/2605.11608#bib.bib21),[22](https://arxiv.org/html/2605.11608#bib.bib22)\]preserve variant orderings without explaining them; weak\-to\-strong generalization\[[23](https://arxiv.org/html/2605.11608#bib.bib23)\]targets label generation rather than risk estimation\. PRISM instead bounds the variant\-vs\-base risk gap directly from a feature comparison—not relying on benchmark labels or aggregate\-loss extrapolation \(Sec\.[3](https://arxiv.org/html/2605.11608#S3)\)\.

BackboneFeaturesHeadRiskTargetTTProxyPPInput\(x,y\)∼𝒟\(x,y\)\{\\sim\}\\mathcal\{D\}ϕT\\phi\_\{T\}ZTZ\_\{T\}HTH\_\{T\}ℛT\\mathcal\{R\}\_\{T\}ϕP\\phi\_\{P\}ZPZ\_\{P\}HPH\_\{P\}ℛP\\mathcal\{R\}\_\{P\}WWWWΔℛ=\|ℛT−ℛP\|≤δ\+γδ=Kfeat\(Δρ\)2⏟scale\+2ρTρP\(1−Ω\)⏟shapeγ=Kpred‖ΣP1/2\(WHT−HP\)⏟ΔH‖F\\begin\{aligned\} \{\\color\[rgb\]\{0\.325,0,0\.325\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.325,0,0\.325\}\\Delta\\mathcal\{R\}\}&\\;=\\;\{\\color\[rgb\]\{0\.325,0,0\.325\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.325,0,0\.325\}\|\\mathcal\{R\}\_\{T\}\{\-\}\\mathcal\{R\}\_\{P\}\|\}\\;\\leq\\;\\delta\\;\+\\;\\gamma\\\\\[1\.0pt\] \\delta&\\;=\\;K\_\{\\mathrm\{feat\}\}\\,\\sqrt\{\\,\\smash\[b\]\{\\underbrace\{\(\\Delta\\rho\)^\{2\}\}\_\{\\text\{scale\}\}\\;\+\\;\\underbrace\{2\\rho\_\{T\}\\rho\_\{P\}\(1\{\-\}\\Omega\)\}\_\{\\text\{shape\}\}\}\\,\}\\\\\[13\.0pt\] \\gamma&\\;=\\;K\_\{\\mathrm\{pred\}\}\\,\\bigl\\\|\\Sigma\_\{P\}^\{1/2\}\\underbrace\{\(WH\_\{T\}\{\-\}H\_\{P\}\)\}\_\{\\Delta H\}\\bigr\\\|\_\{F\}\\end\{aligned\}Scale mismatch\(ρT−ρP\)2\(\\rho\_\{T\}\{\-\}\\rho\_\{P\}\)^\{2\}ρT\\rho\_\{T\}ρP\\rho\_\{P\}Shape mismatch2ρTρP\(1−Ω\)2\\rho\_\{T\}\\rho\_\{P\}\(1\{\-\}\\Omega\)Ω\\OmegaZTZ\_\{T\}ZPWZ\_\{P\}WHead discrepancy‖ΣP1/2ΔH‖F\\\|\\Sigma\_\{P\}^\{1/2\}\\Delta H\\\|\_\{F\}ΣP\\Sigma\_\{P\}WHTWH\_\{T\}HPH\_\{P\}ΔH\\Delta HFigure 1:PRISM\(ProxyRiskInference viaStructuralMapping\)decomposition of the risk gap\.For any orthogonal alignmentW∈𝒪\(d\)W\\in\\mathcal\{O\}\(d\), the cross\-entropy risk gap\|ℛT−ℛP\|\|\\mathcal\{R\}\_\{T\}\{\-\}\\mathcal\{R\}\_\{P\}\|is bounded \(Thm\.[1](https://arxiv.org/html/2605.11608#Thmtheorem1)\) by a feature alignment errorδ\\delta—decomposed exactly into*scale mismatch*\(Δρ\)2\(\\Delta\\rho\)^\{2\}and*shape mismatch*2ρTρP\(1−ΩW\)2\\rho\_\{T\}\\rho\_\{P\}\(1\{\-\}\\Omega\_\{W\}\)\(Prop\.[1](https://arxiv.org/html/2605.11608#Thmproposition1)\)—plus a head discrepancyγ=Kpred‖ΣP1/2ΔH‖F\\gamma\{=\}K\_\{\\mathrm\{pred\}\}\\\|\\Sigma\_\{P\}^\{1/2\}\\Delta H\\\|\_\{F\}whereΔH=WHT−HP\\Delta H\{=\}WH\_\{T\}\{\-\}H\_\{P\}\. The main text uses the identity alignmentW=IW\{=\}I\(Ω\\Omega\), under whichγ\\gammavanishes wheneverHT=HPH\_\{T\}\{=\}H\_\{P\}\(frozen\-head LoRA, FP16\-head PTQ\); bothW=IW\{=\}Iand the Procrustes\-optimalW=WNW\{=\}W\_\{N\}yield strong rank correlations \(Sec\.[5\.5](https://arxiv.org/html/2605.11608#S5.SS5)\)\. Each axis localizes a distinct empirical regime: shape distortion at low\-bit PTQ, head divergence at GGUF k\-quant tiers that quantizelm\_head, and scale\-axis separability under LoRA forgetting \(Sec\.[5\.3](https://arxiv.org/html/2605.11608#S5.SS3)\)\.

### 3A geometric bound on the cross\-entropy risk gap

PRISM\(ProxyRiskInference viaStructuralMapping\)—developed in this section—is a closed\-form upper bound on the cross\-entropy risk gap\|ℛT−ℛP\|\|\\mathcal\{R\}\_\{T\}\-\\mathcal\{R\}\_\{P\}\|that decomposes into three diagnostic axes \(scale, shape, head\)\. We build it from two structural properties of LLMs: a linearlm\_headon a non\-linear backbone, and the Linear Representation Hypothesis\.

#### 3\.1Setup

LetTTbe a*Target model*\(e\.g\., a full\-precision base\) andPPa*Proxy model*\(e\.g\., a quantized or fine\-tuned variant\), sharing hidden dimensiondd\. Each model factors into a Transformer backboneϕM:𝒳→ℝd\\phi\_\{M\}:\\mathcal\{X\}\\to\\mathbb\{R\}^\{d\}followed by a linear prediction headHM∈ℝd×VH\_\{M\}\\in\\mathbb\{R\}^\{d\\times V\}\(thelm\_head, vocabulary sizeVV\)\. The risk ofMMunder data distribution𝒟\\mathcal\{D\}is the cross\-entropy expectation

ℛM=𝔼\(x,y\)∼𝒟\[ℓ\(ϕM\(x\)HM,y\)\],ℓ\(v,y\)=−vy\+log∑j=1Vevj,v∈ℝV\.\\mathcal\{R\}\_\{M\}\\;=\\;\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\}\\\!\\big\[\\ell\\big\(\\phi\_\{M\}\(x\)H\_\{M\},\\,y\\big\)\\big\],\\qquad\\ell\(v,y\)=\-v\_\{y\}\+\\log\\textstyle\\sum\_\{j=1\}^\{V\}e^\{v\_\{j\}\},\\;v\\in\\mathbb\{R\}^\{V\}\.\(1\)StackingϕM\\phi\_\{M\}outputs on shared inputs as rows, we formZT,ZP∈ℝn×dZ\_\{T\},Z\_\{P\}\\in\\mathbb\{R\}^\{n\\times d\}and define the RMS feature scaleρM=‖ZM‖F/n\\rho\_\{M\}=\\\|Z\_\{M\}\\\|\_\{F\}/\\sqrt\{n\}and the empirical \(uncentered\) covarianceΣP=ZP⊤ZP/n\\Sigma\_\{P\}=Z\_\{P\}^\{\\top\}Z\_\{P\}/n\. The empirical isometry across model representations\[[13](https://arxiv.org/html/2605.11608#bib.bib13)\], consistent with the Linear Representation Hypothesis\[[12](https://arxiv.org/html/2605.11608#bib.bib12)\], motivates restricting attention to orthogonal alignmentsW∈𝒪\(d\)W\\in\\mathcal\{O\}\(d\)between the two backbones; the PRISM bound \(Theorem[1](https://arxiv.org/html/2605.11608#Thmtheorem1)\) holds for any suchWW, with alignment quality determining how tight the bound is in practice\.

#### 3\.2The unified risk bound

To bound\|ℛT−ℛP\|\|\\mathcal\{R\}\_\{T\}\-\\mathcal\{R\}\_\{P\}\|we introduce a hybrid riskℛP→T:=𝔼\[ℓ\(ϕP\(x\)W⋅HT,y\)\]\\mathcal\{R\}\_\{P\\to T\}:=\\mathbb\{E\}\[\\ell\(\\phi\_\{P\}\(x\)\\,W\\cdot H\_\{T\},y\)\]that evaluates the target’s head on the proxy’s aligned features\. The triangle inequality\|ℛT−ℛP\|≤\|ℛT−ℛP→T\|\+\|ℛP→T−ℛP\|\|\\mathcal\{R\}\_\{T\}\-\\mathcal\{R\}\_\{P\}\|\\leq\|\\mathcal\{R\}\_\{T\}\-\\mathcal\{R\}\_\{P\\to T\}\|\+\|\\mathcal\{R\}\_\{P\\to T\}\-\\mathcal\{R\}\_\{P\}\|splits the gap into two components, each upper\-bounded by a closed\-form geometric quantity: a*feature alignment error*δ\\deltaand a*head discrepancy*γ\\gamma\(full derivations in Appendix[A](https://arxiv.org/html/2605.11608#A1)\)\.

##### Feature errorδ\\delta\.

The cross\-entropy loss is Lipschitz in features with constantKfeat=maxj,k⁡‖hT,j−hT,k‖2K\_\{\\mathrm\{feat\}\}=\\max\_\{j,k\}\\\|h\_\{T,j\}\-h\_\{T,k\}\\\|\_\{2\}, wherehT,jh\_\{T,j\}is thejj\-th column ofHTH\_\{T\}\(simplex polarization, Appendix[A\.3](https://arxiv.org/html/2605.11608#A1.SS3); substantially tighter than the naive spectral bound2‖HT‖2\\sqrt\{2\}\\\|H\_\{T\}\\\|\_\{2\}that scales with vocabulary\)\. The alignment residual‖ZT−ZPW‖F2/n\\\|Z\_\{T\}\-Z\_\{P\}W\\\|\_\{F\}^\{2\}/nadmits an exact identity for every orthogonalWW:

###### Proposition 1\(Exact Scale–Shape Decomposition\)\.

For anyW∈𝒪\(d\)W\\in\\mathcal\{O\}\(d\),

1n‖ZT−ZPW‖F2=\(ρT−ρP\)2⏟Scale Mismatch\(Δρ\)2\+2ρTρP\(1−ΩW\)⏟Shape Mismatch,\\frac\{1\}\{n\}\\\|Z\_\{T\}\-Z\_\{P\}W\\\|\_\{F\}^\{2\}\\;=\\;\\underbrace\{\(\\rho\_\{T\}\-\\rho\_\{P\}\)^\{2\}\}\_\{\\text\{\\rm Scale Mismatch \}\(\\Delta\\rho\)^\{2\}\}\\;\+\\;\\underbrace\{2\\rho\_\{T\}\\rho\_\{P\}\\big\(1\-\\Omega\_\{W\}\\big\)\}\_\{\\text\{\\rm Shape Mismatch\}\},\(2\)whereΩW\\Omega\_\{W\}is the*trace Procrustes similarity*:

ΩW\(ZT,ZP\):=Tr⁡\(ZT⊤ZPW\)‖ZT‖F‖ZP‖F∈\[−1,1\]\.\\Omega\_\{W\}\(Z\_\{T\},Z\_\{P\}\)\\;:=\\;\\frac\{\\operatorname\{Tr\}\(Z\_\{T\}^\{\\top\}Z\_\{P\}\\,W\)\}\{\\\|Z\_\{T\}\\\|\_\{F\}\\,\\\|Z\_\{P\}\\\|\_\{F\}\}\\;\\in\[\-1,\\,1\]\.\(3\)

Combined with the Lipschitz bound \(Appendix[A](https://arxiv.org/html/2605.11608#A1)\), we define

δ:=Kfeat\(ρT−ρP\)2\+2ρTρP\(1−ΩW\),\|ℛT−ℛP→T\|≤δ\.\\delta\\;:=\\;K\_\{\\mathrm\{feat\}\}\\sqrt\{\(\\rho\_\{T\}\-\\rho\_\{P\}\)^\{2\}\+2\\rho\_\{T\}\\rho\_\{P\}\(1\-\\Omega\_\{W\}\)\},\\qquad\|\\mathcal\{R\}\_\{T\}\-\\mathcal\{R\}\_\{P\\to T\}\|\\leq\\delta\.\(4\)

##### Head errorγ\\gamma\.

LetΔH:=WHT−HP\\Delta H:=WH\_\{T\}\-H\_\{P\}for the alignmentWW; we defineγ:=Kpred‖ΣP1/2ΔH‖F\\gamma:=K\_\{\\mathrm\{pred\}\}\\,\\\|\\Sigma\_\{P\}^\{1/2\}\\Delta H\\\|\_\{F\}withKpred≤2K\_\{\\mathrm\{pred\}\}\\leq\\sqrt\{2\}\(Appendix[A\.5](https://arxiv.org/html/2605.11608#A1.SS5)\), so that\|ℛP→T−ℛP\|≤γ\|\\mathcal\{R\}\_\{P\\to T\}\-\\mathcal\{R\}\_\{P\}\|\\leq\\gamma\. The covariance weightingΣP1/2\\Sigma\_\{P\}^\{1/2\}ensures that only head misalignment in the*active subspace*\(directions where the data has support\) contributes toγ\\gamma\.

###### Theorem 1\(Unified Risk Bound\)\.

For anyW∈𝒪\(d\)W\\in\\mathcal\{O\}\(d\),

\|ℛT−ℛP\|≤ℬ:=Kfeat\(ρT−ρP\)2\+2ρTρP\(1−ΩW\)⏟δ:feature alignment error\+Kpred‖ΣP1/2\(WHT−HP\)‖F⏟γ:head discrepancy\.\|\\mathcal\{R\}\_\{T\}\-\\mathcal\{R\}\_\{P\}\|\\;\\leq\\;\\mathcal\{B\}\\;:=\\;\\underbrace\{K\_\{\\mathrm\{feat\}\}\\sqrt\{\(\\rho\_\{T\}\-\\rho\_\{P\}\)^\{2\}\+2\\rho\_\{T\}\\rho\_\{P\}\\big\(1\-\\Omega\_\{W\}\\big\)\}\}\_\{\\delta:\\ \\text\{\\rm feature alignment error\}\}\\;\+\\;\\underbrace\{K\_\{\\mathrm\{pred\}\}\\,\\\|\\Sigma\_\{P\}^\{1/2\}\(WH\_\{T\}\-H\_\{P\}\)\\\|\_\{F\}\}\_\{\\gamma:\\ \\text\{\\rm head discrepancy\}\}\.\(5\)We refer toℬ=δ\+γ\\mathcal\{B\}=\\delta\+\\gammaas the*PRISM bound*\.

#### 3\.3Three diagnostic axes

Theorem[1](https://arxiv.org/html/2605.11608#Thmtheorem1)\(illustrated in Fig\.[1](https://arxiv.org/html/2605.11608#S2.F1)\) decomposes the risk gap along three*independently measurable*axes, each attached to a distinct failure mode of the proxy\.

##### Scale mismatchΔρ\\Delta\\rho\.

Divergence in activation magnitude directly amplifies the feature error\. Aggressive bit\-width reduction clips activation outliers\[[2](https://arxiv.org/html/2605.11608#bib.bib2),[17](https://arxiv.org/html/2605.11608#bib.bib17)\], shrinkingρP<ρT\\rho\_\{P\}<\\rho\_\{T\}; the scale axis isolates this*scale collapse*\.

##### Shape mismatch1−ΩW1\-\\Omega\_\{W\}\.

The shape term captures geometric distortion of the feature manifold beyond scale\. A drop inΩW\\Omega\_\{W\}signals that the relative arrangement of token representations has been corrupted—what we term*structural distortion*\.

##### Head divergence‖ΣP1/2\(WHT−HP\)‖F\\\|\\Sigma\_\{P\}^\{1/2\}\(WH\_\{T\}\-H\_\{P\}\)\\\|\_\{F\}\.

The head term quantifies how differently the prediction heads interpret features, weighted byΣP\\Sigma\_\{P\}so that only directions where the data has support contribute\.

##### PRISM applies for any orthogonal alignment\.

Theorem[1](https://arxiv.org/html/2605.11608#Thmtheorem1)holds for everyW∈𝒪\(d\)W\\in\\mathcal\{O\}\(d\), yielding a family\{ΩW\}\\\{\\Omega\_\{W\}\\\}of similarity scores\. We evaluate two natural specializations: the trace formΩ:=ΩW=I\\Omega:=\\Omega\_\{W=I\}\(used as the main text default\) and the Procrustes\-optimal nuclear formΩN:=ΩW=WN\\Omega\_\{N\}:=\\Omega\_\{W=W\_\{N\}\}\(ablation in Sec\.[5\.5](https://arxiv.org/html/2605.11608#S5.SS5); full derivation in Appendix[B](https://arxiv.org/html/2605.11608#A2)\)\.W=WNW\{=\}W\_\{N\}minimizes the alignment residual but introduces nonzeroγ\\gammawhenHT≈HPH\_\{T\}\\approx H\_\{P\}and requires per\-step SVD;W=IW\{=\}Ikeepsγ\\gammaat its minimum—vanishing whenHT=HPH\_\{T\}\{=\}H\_\{P\}\(frozen\-head LoRA, FP16\-head PTQ; GGUF k\-quant keepsγ\>0\\gamma\{\>\}0\)—and is directly differentiable, enabling the regularizer of Sec\.[3\.5](https://arxiv.org/html/2605.11608#S3.SS5)\. Both yield strong rank correlations \(Sec\.[5\.5](https://arxiv.org/html/2605.11608#S5.SS5)\); the choice is primarily design\-driven\.

#### 3\.4Formal extension to autoregressive generation

The bound applies unchanged to sequence\-level generation\. For a sequence\(c,y\)\(c,y\)with contextccand target continuationy=\(y1,…,y\|y\|\)y=\(y\_\{1\},\\ldots,y\_\{\|y\|\}\), the autoregressive risk is

ℛMAR=𝔼\(c,y\)∼𝒟\[1\|y\|∑τ=1\|y\|ℓ\(ϕM\(c,y<τ\)⋅HM,yτ\)\]\.\\mathcal\{R\}\_\{M\}^\{\\mathrm\{AR\}\}\\;=\\;\\mathbb\{E\}\_\{\(c,y\)\\sim\\mathcal\{D\}\}\\\!\\left\[\\frac\{1\}\{\|y\|\}\\sum\_\{\\tau=1\}^\{\|y\|\}\\ell\\\!\\big\(\\phi\_\{M\}\(c,y\_\{<\\tau\}\)\\cdot H\_\{M\},\\;y\_\{\\tau\}\\big\)\\right\]\.\(6\)Under teacher forcing, the\|y\|\|y\|token\-level featuresϕM\(c,y<τ\)\\phi\_\{M\}\(c,y\_\{<\\tau\}\)of a sequence are extracted in a single forward pass and stacked intoZMAR∈ℝ\|y\|×dZ\_\{M\}^\{\\mathrm\{AR\}\}\\in\\mathbb\{R\}^\{\|y\|\\times d\}; Theorem[1](https://arxiv.org/html/2605.11608#Thmtheorem1)then applies directly to\(ZTAR,ZPAR\)\(Z\_\{T\}^\{\\mathrm\{AR\}\},Z\_\{P\}^\{\\mathrm\{AR\}\}\)\. The full corollary is in Appendix[D](https://arxiv.org/html/2605.11608#A4)\.

#### 3\.5From diagnostic to training: shape regularization

LoRA fine\-tuning drives the backbone shape away from its base:Ω\(Z0,Zt\)\\Omega\(Z\_\{0\},Z\_\{t\}\)drops as training proceeds, and—because the head is frozen—the full risk gap between baseθ0\\theta\_\{0\}and checkpointθt\\theta\_\{t\}reduces to backbone drift alone,

\|ℛ0−ℛt\|≤Kfeat\(ρ0−ρt\)2\+2ρ0ρt\(1−Ω\)\.\|\\mathcal\{R\}\_\{0\}\-\\mathcal\{R\}\_\{t\}\|\\;\\leq\\;K\_\{\\mathrm\{feat\}\}\\sqrt\{\(\\rho\_\{0\}\-\\rho\_\{t\}\)^\{2\}\+2\\rho\_\{0\}\\rho\_\{t\}\(1\-\\Omega\)\}\.\(7\)Constraining shape should therefore reduce backbone drift and mitigate catastrophic forgetting on downstream tasks\. Since1−Ω1\-\\Omegais differentiable inZtZ\_\{t\}, we augment the fine\-tuning objective with a*shape regularizer*—a penalty on1−Ω1\-\\Omega:

Ltotal=LCE\(θt;𝒟FT\)⏟task loss\+λ\(1−Ω\(Z0ref,Ztref\)\),L\_\{\\mathrm\{total\}\}\\;=\\;\\underbrace\{L\_\{\\mathrm\{CE\}\}\\bigl\(\\theta\_\{t\};\\,\\mathcal\{D\}\_\{\\mathrm\{FT\}\}\\bigr\)\}\_\{\\text\{task loss\}\}\\;\+\\;\\lambda\\,\\bigl\(1\-\\Omega\\\!\\bigl\(Z\_\{0\}^\{\\text\{ref\}\},\\,Z\_\{t\}^\{\\text\{ref\}\}\\bigr\)\\bigr\),\(8\)whereZ0ref,ZtrefZ\_\{0\}^\{\\text\{ref\}\},Z\_\{t\}^\{\\text\{ref\}\}are base and current feature matrices on a fixed reference set \(training schedule in Sec\.[5\.1](https://arxiv.org/html/2605.11608#S5.SS1)\); empirical effect is validated in Sec\.[5\.4](https://arxiv.org/html/2605.11608#S5.SS4)\.

### 4Applications: post\-training quantization and LoRA forgetting

The Unified Risk Bound \(Theorem[1](https://arxiv.org/html/2605.11608#Thmtheorem1); Fig\.[1](https://arxiv.org/html/2605.11608#S2.F1)\) applies to two LLM lifecycle settings: post\-training quantization \(may engage all three axes\) and frozen\-head LoRA fine\-tuning \(backbone drift alone\)\.

##### Quantization quality estimation\.

TargetTTis the BF16 base model and proxyPPis the quantized variant\. Weight\-only PTQ perturbs weights without applying any basis transformation, soW=IW\{=\}Iis the natural alignment; the head termγ\\gammaenters when the protocol also quantizeslm\_head\(GGUF k\-quant\) and vanishes otherwise \(GPTQ, BnB\)\.

##### Geometric monitoring of catastrophic forgetting\.

TargetT=θ0T\{=\}\\theta\_\{0\}is the frozen base model and proxyP=θtP\{=\}\\theta\_\{t\}a fine\-tuned checkpoint, both evaluated on benchmarks*distinct*from the fine\-tuning task so that\|ℛT−ℛP\|\|\\mathcal\{R\}\_\{T\}\-\\mathcal\{R\}\_\{P\}\|captures catastrophic forgetting\. Standard LoRA\[[3](https://arxiv.org/html/2605.11608#bib.bib3)\]keepslm\_headfrozen, soγ=0\\gamma\{=\}0and forgetting reduces to*backbone geometric drift*\(Δρ,1−Ω\)\(\\Delta\\rho,1\{\-\}\\Omega\); the shape regularizer of Sec\.[3\.5](https://arxiv.org/html/2605.11608#S3.SS5)penalizes the shape term directly\.

### 5Experiments

We organize the experiments around three claims:Predictiveness—the bound tracks\|Δℛ\|\|\\Delta\\mathcal\{R\}\|in rank order across both PTQ and LoRA variants \(Sec\.[5\.2](https://arxiv.org/html/2605.11608#S5.SS2)\);Decomposability—its three axes localize three qualitatively distinct empirical failure modes \(Sec\.[5\.3](https://arxiv.org/html/2605.11608#S5.SS3)\); andActionability—the differentiable shape arm doubles as a training\-time regularizer that suppresses catastrophic forgetting \(Sec\.[5\.4](https://arxiv.org/html/2605.11608#S5.SS4)\)\. Sec\.[5\.5](https://arxiv.org/html/2605.11608#S5.SS5)closes with a component\-wise ablation\.

#### 5\.1Experimental setup

##### Models\.

The main analysis spans two model families: Llama\-3\.1\-8B\[[5](https://arxiv.org/html/2605.11608#bib.bib5)\]and Qwen3\-8B\[[6](https://arxiv.org/html/2605.11608#bib.bib6)\]; PTQ replications on Ministral\-3\-8B\[[24](https://arxiv.org/html/2605.11608#bib.bib24)\], DeepSeek\-R1\-Distill\-Llama\-8B\[[25](https://arxiv.org/html/2605.11608#bib.bib25)\], and three instruction\-tuned counterparts \(seven 8B families total\) are in Appendix[F\.3](https://arxiv.org/html/2605.11608#A6.SS3)\.

##### Quantization protocols\.

Three PTQ families across bit\-widths 2–8, spanning basic rounding to calibration\-based compensation:GGUF\(round\-to\-nearest,Q2\_K–Q8\_0\),GPTQ\[[1](https://arxiv.org/html/2605.11608#bib.bib1)\]\(4\-bit, second\-order reconstruction\), andBitsAndBytes\[[2](https://arxiv.org/html/2605.11608#bib.bib2)\]\(INT8, NF4, FP4\)\.

##### Fine\-tuning tasks\.

LoRA fine\-tunes onTruthfulQA\[[26](https://arxiv.org/html/2605.11608#bib.bib26)\]\(factual grounding\) andBBQ\[[27](https://arxiv.org/html/2605.11608#bib.bib27)\]\(Bias Benchmark for QA, social\-context reasoning\)—two tasks with contrasting drift geometries \(Sec\.[5\.3](https://arxiv.org/html/2605.11608#S5.SS3)\)\. Downstream forgetting is measured on benchmarks disjoint from both fine\-tuning tasks\.

##### Benchmarks and scoring\.

Five benchmarks:MMLU\[[28](https://arxiv.org/html/2605.11608#bib.bib28)\],ARC\[[29](https://arxiv.org/html/2605.11608#bib.bib29)\]\(multiple\-choice knowledge\),TriviaQA\[[30](https://arxiv.org/html/2605.11608#bib.bib30)\],SQuAD\[[31](https://arxiv.org/html/2605.11608#bib.bib31)\]\(short\-horizon QA\), andGSM8K\[[32](https://arxiv.org/html/2605.11608#bib.bib32)\]\(multi\-step reasoning\)\. All risks are computed teacher\-forced \(promptccand targetsyyscored in a single forward pass over the gold span\), producing a deterministic per\-sample CE loss whose expectation gives the model’s riskℛM\\mathcal\{R\}\_\{M\}, and\|Δℛ\|\|\\Delta\\mathcal\{R\}\|is the target\-vs\-proxy gap we report\.

##### Calibration and hyperparameters\.

PRISM and\|Δℛ\|\|\\Delta\\mathcal\{R\}\|are evaluated on fixed held\-out subsets shared across all variants of a base \(512512samples per benchmark for PTQ,256256for LoRA forgetting\); since observed\|Δℛ\|\|\\Delta\\mathcal\{R\}\|spans 1–2 orders of magnitude across variants, rankings are robust to subset choice\. LoRA fine\-tunes rank\-3232attention adapters with frozen head \(AdamW, lr10−510^\{\-5\}, batch size1616, bf16\)\. For the regularization comparison \(Sec\.[5\.4](https://arxiv.org/html/2605.11608#S5.SS4)\), both penalties operate on𝒟ref\\mathcal\{D\}\_\{\\mathrm\{ref\}\}\(3232pre\-training sequences disjoint from𝒟FT\\mathcal\{D\}\_\{\\mathrm\{FT\}\}\): the trace\-norm shape penalty sweepsλ∈\{0\.01,0\.05,0\.1,0\.5,1\.0\}\\lambda\\in\\\{0\.01,0\.05,0\.1,0\.5,1\.0\\\}and a replay\-CE baseline sweepsλ∈\{0\.001,0\.005,0\.01,0\.05,0\.1\}\\lambda\\in\\\{0\.001,0\.005,0\.01,0\.05,0\.1\\\}, each range matched to its penalty’s natural scale \(1−Ω∼0\.11\{\-\}\\Omega\\sim 0\.1vs\. CE reference loss∼1\\sim 1nat\)\. Checkpoints every2525steps, analysis at step300300; all experiments use a single NVIDIA RTX 5090 \(32 GB\)\.

#### 5\.2Predictiveness: the bound tracks the risk gap

![Refer to caption](https://arxiv.org/html/2605.11608v1/x1.png)Figure 2:The PRISM boundℬ\\mathcal\{B\}tracks the empirical risk gap across two model families and five benchmarks\.Each subplot scatters the PRISM boundℬ\\mathcal\{B\}\(x\-axis, log\) against the empirical cross\-entropy risk gap\|Δℛ\|\|\\Delta\\mathcal\{R\}\|\(y\-axis, log\)\. Each point is one quantization variant; colors denote PTQ family \(GGUF / GPTQ / BitsAndBytes\)\. Rows: Llama\-3\.1\-8B, Qwen3\-8B\. Columns: ARC, MMLU, SQuAD, TriviaQA, GSM8K\. Per\-subplot Spearmanrsr\_\{s\}is annotated in each panel \(meanrs≈0\.82r\_\{s\}\{\\approx\}0\.82across the 2×\\times5 grid; per\-component breakdown in Sec\.[5\.5](https://arxiv.org/html/2605.11608#S5.SS5)\); the red dashedy=xy\{=\}xline marks bound tightness, and thegreen shaded region below itis the safe zone where the bound provably holds\. Replication on Ministral\-3\-8B and DeepSeek\-R1\-Distill\-Llama\-8B: Fig\.[5](https://arxiv.org/html/2605.11608#A6.F5)\(Appendix[F](https://arxiv.org/html/2605.11608#A6)\)\.Table 1:Geometric decomposition forLlama\-3\.1\-8Bunder identity alignment \(W=IW\{=\}I\) on MMLU\. Each benchmark section reports Spearman’srs\(ℬ,\|Δℛ\|\)r\_\{s\}\(\\mathcal\{B\},\\,\|\\Delta\\mathcal\{R\}\|\)across all quantization variants\. Shading:Ω<0\.80\\Omega\{<\}0\.80/Ω<0\.95\\Omega\{<\}0\.95on\(Ω,δ,ℬ,\|Δℛ\|\)\(\\Omega,\\delta,\\mathcal\{B\},\|\\Delta\\mathcal\{R\}\|\);γ=0\\gamma\{=\}0when the head is preserved \(HT=HPH\_\{T\}\{=\}H\_\{P\}\)\.##### Quantization\.

The PRISM boundℬ\\mathcal\{B\}tracks\|Δℛ\|\|\\Delta\\mathcal\{R\}\|with strong rank correlation across Llama and Qwen \(mean Spearmanrs=0\.820±0\.0471r\_\{s\}=0\.820\\pm 0\.0471\(SEM\) over the2×52\{\\times\}5grid; Fig\.[2](https://arxiv.org/html/2605.11608#S5.F2); Ministral/DeepSeek replication in Appendix Fig\.[5](https://arxiv.org/html/2605.11608#A6.F5)\)\. Bit\-width drives the bound monotonically: Q8/Q6 GGUF variants sit in the low\-ℬ\\mathcal\{B\}/low\-\|Δℛ\|\|\\Delta\\mathcal\{R\}\|corner, while Q2 configurations move into the upper\-right region where shape distortion\(1−Ω\)\(1\{\-\}\\Omega\)dominates the feature error \(Table[1](https://arxiv.org/html/2605.11608#S5.T1)\)\. All evaluated variants lie below they=xy\{=\}xline \(Fig\.[2](https://arxiv.org/html/2605.11608#S5.F2)\), confirming Theorem[1](https://arxiv.org/html/2605.11608#Thmtheorem1)’s upper\-bound guarantee empirically; the bit\-width tier ordering \(Q8/Q6<<Q5/Q4<<Q3/Q2 inℬ\\mathcal\{B\}\) holds across both families\.

On Llama MMLU specifically, this correlation reaches Spearmanrs=0\.91r\_\{s\}=0\.91: as bit\-width drops from Q8 to Q2,ℬ\\mathcal\{B\}and\|Δℛ\|\|\\Delta\\mathcal\{R\}\|rise together \(Table[1](https://arxiv.org/html/2605.11608#S5.T1)\); the remaining Llama benchmarks \(TriviaQA, ARC, SQuAD, GSM8K\), the Qwen3\-8B counterpart, the feature\-only \(δ\\delta\) scatter, and Ministral/DeepSeek decompositions are in Appendices[F\.2](https://arxiv.org/html/2605.11608#A6.SS2)–[F\.3](https://arxiv.org/html/2605.11608#A6.SS3)\.

![Refer to caption](https://arxiv.org/html/2605.11608v1/x2.png)Figure 3:Llama\-3\.1\-8B: the PRISM bound tracks catastrophic forgetting across LoRA fine\-tuning steps\.Each subplot scatters the boundℬ\\mathcal\{B\}\(x\-axis, log\) against the empirical forgetting\|Δℛ\|\|\\Delta\\mathcal\{R\}\|\(y\-axis, log\) on a downstream benchmark, with one point per LoRA checkpoint colored by training step\. Rows: fine\-tuning task \(TruthfulQA, BBQ\)\. Columns: downstream benchmark \(ARC, MMLU, SQuAD, TriviaQA, GSM8K\)\. Under LoRA’s frozenlm\_head,γ=0\\gamma\{=\}0soℬ\\mathcal\{B\}reduces to backbone scale \(Δρ\\Delta\\rho\) and shape \(1−Ω1\{\-\}\\Omega\) drift\. Spearmanrsr\_\{s\}per subplot is annotated in each panel\. Qwen3\-8B replication: Appendix[G](https://arxiv.org/html/2605.11608#A7)\.
##### LoRA forgetting\.

The same predictiveness extends to fine\-tuning\-induced drift \(Fig\.[3](https://arxiv.org/html/2605.11608#S5.F3)\): as LoRA fine\-tuning proceeds on TruthfulQA or BBQ from step 25 to step 300, the bound computed on*downstream*benchmarks tracks the empirical cross\-entropy drift step\-by\-step, with mean Spearmanrs=0\.831±0\.0722r\_\{s\}=0\.831\\pm 0\.0722over the2×52\\times 5downstream cells, comparable to the PTQ grid \(Fig\.[2](https://arxiv.org/html/2605.11608#S5.F2)\)\. On Llama TruthfulQA in particular, forgetting accumulates progressively across checkpoints, whereℬ\\mathcal\{B\}and\|Δℛ\|\|\\Delta\\mathcal\{R\}\|trend together \(meanrs=0\.958r\_\{s\}=0\.958across the 5 downstream benchmarks\)\. The Qwen3\-8B replication \(Appendix[G](https://arxiv.org/html/2605.11608#A7)\) reproduces both patterns in most cells\.

#### 5\.3Decomposability: three axes, three failure modes

Scalar correlation alone matches prior representational similarity work in answering*whether*a variant has degraded\. PRISM’s decomposition adds*which*: the dominant axis of Theorem[1](https://arxiv.org/html/2605.11608#Thmtheorem1)reads off directly from the per\-variant numbers and maps to a distinct empirical failure mode\.

##### Shape distortion \(low\-bit PTQ\)\.

At Q2 and Q3 across all four families tested \(Appendix[F\.3](https://arxiv.org/html/2605.11608#A6.SS3)for Ministral and DeepSeek\), the shape term2ρTρP\(1−Ω\)2\\rho\_\{T\}\\rho\_\{P\}\(1\{\-\}\\Omega\)typically dominates the scale term\(Δρ\)2\(\\Delta\\rho\)^\{2\}, often by orders of magnitude\. At Llama\-Q2\_K MMLU \(Table[1](https://arxiv.org/html/2605.11608#S5.T1)\),ρP\\rho\_\{P\}exceedsρT\\rho\_\{T\}only modestly \(Δρ≈4\.9\\Delta\\rho\\approx 4\.9, scale≈24\\approx 24\) yetΩ\\Omegadrops to0\.780\.78, driving shape \(≈9,000\\approx 9\{,\}000\) to dominate; Ministral\-Q2\_K SQuAD shows the same pattern \(∼3\\sim 3vs∼9,000\\sim 9\{,\}000\)\. This is consistent with low\-bit PTQ corrupting the relational structure of the feature manifold rather than its global scale—a separation invisible to any scalar similarity metric\. The same shape dominance appears in LoRA forgetting: at Llama TruthfulQA\-FT on TriviaQA \(no\-reg baseline; Table[18](https://arxiv.org/html/2605.11608#A7.T18), Appendix\),Ω\\Omegadrops to0\.760\.76with shape \(≈10,000\\approx 10\{,\}000\) outweighing scale \(≈35\\approx 35\) by∼280×\{\\sim\}280\\times—motivating the shape regularizer of Sec\.[5\.4](https://arxiv.org/html/2605.11608#S5.SS4)\.

##### Scale\-axis separability\.

The scale axis surfaces as a non\-redundant channel in two regimes\. In PTQ, Qwen3\-Base Q2\_K on GSM8K is a scale\-axis outlier:ρP\\rho\_\{P\}jumps from267267to313313\(\|Δρ\|=46\|\\Delta\\rho\|\{=\}46, vs\.≤8\\leq 8for all other Qwen3 quantization variants on this benchmark\), separately from the shape drift \(TableLABEL:tab:qwen\_decomposition\_all\_bound, Appendix\)\. In LoRA forgetting, the sign ofΔρ\\Delta\\rhovaries with source task—TruthfulQA\-FT drivesρP\>ρT\\rho\_\{P\}\>\\rho\_\{T\}on all five benchmarks, while BBQ\-FT producesρP<ρT\\rho\_\{P\}<\\rho\_\{T\}on ARC/MMLU \(Tables[18](https://arxiv.org/html/2605.11608#A7.T18),[19](https://arxiv.org/html/2605.11608#A7.T19), Appendix\)\. The\(Δρ,1−Ω\)\(\\Delta\\rho,1\{\-\}\\Omega\)decomposition makes these*qualitatively different drift geometries*visible—invisible to scalar similarity\.

##### Head divergence \(GGUF k\-quant tiers that quantizelm\_head\)\.

When the protocol quantizes the output embedding, the head termγ\\gammabecomes the dominant arm\. At Qwen3\-Base Q6\_K on SQuAD \(TableLABEL:tab:qwen\_decomposition\_all\_bound, Appendix[F\.3](https://arxiv.org/html/2605.11608#A6.SS3)\), backbone scale and shape are essentially perfect \(\(Δρ\)2≈0\.12\(\\Delta\\rho\)^\{2\}\{\\approx\}0\.12,Ω≈1\\Omega\\approx 1\), soδ=1\.18\\delta\{=\}1\.18, yet the quantized output embedding alone contributesγ=75\.77\\gamma\{=\}75\.77—makingγ\\gammaessentially the entire bound \(Q8\_0 shows the same pattern:δ=0\.74\\delta\{=\}0\.74,γ=23\.96\\gamma\{=\}23\.96\)\. Under BnB INT8 the same Qwen3\-Base keepsγ≡0\\gamma\{\\equiv\}0by construction, leaving onlyδ=3\.81\\delta\{=\}3\.81as the bound—a20×20\\timesreduction from Q6\_K’sℬ=76\.95\\mathcal\{B\}\{=\}76\.95, determined entirely by which protocol quantizeslm\_head\. The decomposition makes this protocol\-level switch read off directly, identifying not just the magnitude of degradation but its dominant channel\.

##### From diagnosis to remediation\.

Each dominant axis suggests an axis\-specific remediation\. Scale collapse admits per\-channel outlier smoothing\[[17](https://arxiv.org/html/2605.11608#bib.bib17)\], and head divergence admits FP16\-lm\_headretention\. Shape preservation admits two lifecycle\-specific instances: Hessian\-aware reconstruction at PTQ time, and differentiable trace regularization at LoRA training time \(Sec\.[5\.4](https://arxiv.org/html/2605.11608#S5.SS4)\)\. The bound thus turns from a scalar score into an axis\-level diagnostic with actionable structure\.

#### 5\.4Actionability: from diagnostic to training objective

![Refer to caption](https://arxiv.org/html/2605.11608v1/x3.png)Figure 4:Shape regularization vs\. replay\-CE on Llama\-3\.1\-8B\.LoRA fine\-tuning on TruthfulQA \(top\) and BBQ \(bottom\) under three configurations:*no reg*\(anchor\), the*replay*baseline, and*our trace*\(Eq\.[8](https://arxiv.org/html/2605.11608#S3.E8)\); the latter two share a3232\-sample reference set and are each method’s sweep\|Δℛ\|\|\\Delta\\mathcal\{R\}\|\-best\. Our trace cuts downstream mean\|Δℛ\|\|\\Delta\\mathcal\{R\}\|further than the replay baseline; per\-benchmarkΩ\\Omegaand\|Δℛ\|\|\\Delta\\mathcal\{R\}\|in Table[2](https://arxiv.org/html/2605.11608#S5.T2)\. Qwen3\-8B replication: Appendix[G](https://arxiv.org/html/2605.11608#A7)\.Table 2:Regularization comparison forLlama\-3\.1\-8Bfine\-tuned onTruthfulQA\(identity alignmentW=IW\{=\}I; metrics at step 300\)\. Rows:Ω\\Omega\(higher = more shape preserved\) and\|Δℛ\|\|\\Delta\\mathcal\{R\}\|\(lower = less forgetting\)\.Bold/underline: 1st / 2nd\-best per benchmark group\. Across\-benchmark unweighted means for \[no reg / replayλ=0\.01\\lambda\{=\}0\.01\(baseline\) / traceλ=1\.0\\lambda\{=\}1\.0\(ours\)\]:Ω\\Omega= 0\.906 / 0\.915 /0\.931;\|Δℛ\|\|\\Delta\\mathcal\{R\}\|= 0\.843 / 0\.764 /0\.681\. In this frozen\-lm\_headLoRA settingγ≡0\\gamma\\equiv 0andρT≈ρP\\rho\_\{T\}\\approx\\rho\_\{P\}, soδ\\deltaandℬ\\mathcal\{B\}trackΩ\\Omega; the full decomposition is in Appendix Tables[18](https://arxiv.org/html/2605.11608#A7.T18)–[21](https://arxiv.org/html/2605.11608#A7.T21)\.The decomposition of Sec\.[5\.3](https://arxiv.org/html/2605.11608#S5.SS3)suggests an immediate intervention: shape drift dominates LoRA forgetting, and1−Ω1\{\-\}\\Omegais differentiable inZtZ\_\{t\}\. We add the trace\-norm penaltyλ\(1−Ω\(Z0ref,Ztref\)\)\\lambda\(1\{\-\}\\Omega\(Z\_\{0\}^\{\\mathrm\{ref\}\},Z\_\{t\}^\{\\mathrm\{ref\}\}\)\)on the reference set𝒟ref\\mathcal\{D\}\_\{\\mathrm\{ref\}\}\. To isolate shape preservation from data re\-fitting, we compare against a replay\-CE baseline that uses the same𝒟ref\\mathcal\{D\}\_\{\\mathrm\{ref\}\}\(cross\-entropy on𝒟ref\\mathcal\{D\}\_\{\\mathrm\{ref\}\}, swept overλ\\lambda\)\.

Fig\.[4](https://arxiv.org/html/2605.11608#S5.F4)and Table[2](https://arxiv.org/html/2605.11608#S5.T2)compare our trace atλ=1\.0\\lambda\{=\}1\.0against the replay baseline atλ=0\.01\\lambda\{=\}0\.01on Llama\-3\.1\-8B \(no regularization as anchor reference; bothλ\\lambdachosen as each method’s sweep optimum under the identical evaluation protocol\)\. Our trace cuts mean downstream\|Δℛ\|\|\\Delta\\mathcal\{R\}\|on TruthfulQA from0\.840\.84\(no reg\) to0\.680\.68\(−19%\-19\\%\); the replay baseline only reaches0\.760\.76\(−9%\-9\\%\)\. On Llama BBQ, fine\-tuning is less shape\-disruptive \(mean baselineΩ=0\.932\\Omega=0\.932vs\. TruthfulQA’s0\.9060\.906\); trace still liftsΩ\\Omega\(0\.93→0\.980\.93\\to 0\.98\) while replay leaves it flat\. Mechanistically, replay reduces forgetting indirectly by re\-fitting reference data \(without directly targetingΩ\\Omega\); trace contracts the shape arm of the bound at the source\. Task\-dependence and the axis\-guided gating signal are analyzed in Appendix[H](https://arxiv.org/html/2605.11608#A8)\(Table[22](https://arxiv.org/html/2605.11608#A8.T22)\)\. The Qwen3\-8B replication and BBQ per\-variant tables are in Appendix[G](https://arxiv.org/html/2605.11608#A7)\.

#### 5\.5Ablation: component\-wise contributions

Table 3:Component\-wise ablation: PRISM ranks variants strongly under both alignments\.Rows add bound components cumulatively \(shape→\+\\to\+scale→\+\\to\+head\); largerrsr\_\{s\}= better ranking;bold/underline: 1st / 2nd\-best within each block\. Top block: identityW=IW\{=\}I\(main text default;ℬ\\mathcal\{B\}reachesrs=0\.82r\_\{s\}\{=\}0\.82\)\. Bottom block: Procrustes\-optimalW=WNW\{=\}W\_\{N\}\(ℬN\\mathcal\{B\}\_\{N\}reachesrs=0\.91r\_\{s\}\{=\}0\.91, modestly tighter sinceWNW\_\{N\}minimizes the alignment residual\)\. The main text adoptsW=IW\{=\}Ifor SVD\-free differentiability and theHT=HPH\_\{T\}\{=\}H\_\{P\}head\-term simplification in frozen\-lm\_headregimes\.Table[3](https://arxiv.org/html/2605.11608#S5.T3)measures the marginal ranking contribution of each PRISM bound component\. UnderW=IW\{=\}I\(top block\), the shape\-only similarityΩ\\Omegaalready achieves meanrs=0\.804r\_\{s\}\{=\}0\.804: token\-feature relational geometry alone carries most of the variant\-ranking signal\. Adding scale to formδ\\deltalifts the mean to0\.8680\.868\(\+0\.064\+0\.064\)—scale collapse is the single largest per\-component contributor\. Adding the head term to formℬ\\mathcal\{B\}moves the aggregate to0\.8200\.820:γ\>0\\gamma\{\>\}0engages only on GGUF k\-quant tiers that quantizelm\_headwhileγ≡0\\gamma\{\\equiv\}0on GPTQ/BnB \(FP16 head\), so pooling across these heterogeneousγ\\gamma\-regimes adds variance to the pooled rank order while preserving the bound’s validity\.ℬ\\mathcal\{B\}remains our default—it is Theorem[1](https://arxiv.org/html/2605.11608#Thmtheorem1)’s certified upper bound and the only valid metric wheneverγ\>0\\gamma\{\>\}0\.

UnderW=WNW\{=\}W\_\{N\}\(bottom block\),ΩN→δN→ℬN\\Omega\_\{N\}\\to\\delta\_\{N\}\\to\\mathcal\{B\}\_\{N\}all rise monotonically andℬN\\mathcal\{B\}\_\{N\}achievesrs=0\.912r\_\{s\}\{=\}0\.912, confirming the head term contributes most when the alignment absorbs theHT→HPH\_\{T\}\\to H\_\{P\}rotation\. TheW=IW\{=\}Itrace specialization trades∼\\sim0\.090\.09Spearman for SVD\-free differentiability, frozen\-lm\_headhead\-term simplification, and consistency with the Sec\.[3\.5](https://arxiv.org/html/2605.11608#S3.SS5)regularizer\. Both alignments thus yield strong predictiveness; the choice between them is primarily design\-driven\.

### 6Discussion

PRISM combines a closed\-form geometric bound on\|ℛT−ℛP\|\|\\mathcal\{R\}\_\{T\}\{\-\}\\mathcal\{R\}\_\{P\}\|with an exact three\-axis decomposition \(scale, shape, head\)\. The decomposition turns variant comparison from a scalar similarity score into an axis\-level diagnostic that pinpoints*which*part has drifted; the bound applies to PTQ and frozen\-head LoRA, and the differentiableΩ\\Omegadoubles as a training\-time regularizer\. Empirically, this single framework ranks variants with comparable Spearman across both regimes \(rs≈0\.82r\_\{s\}\{\\approx\}0\.82on PTQ,0\.8310\.831on LoRA\), the three axes localize qualitatively distinct failure modes, and the axis\-guided shape regularizer outperforms experience replay in aggregate at mitigating downstream forgetting\.

##### Scope and limitations\.

PRISM bounds the cross\-entropy risk gap to a reference base; it is calibrated for variant ranking, where Spearman correlations are strong across both PTQ and LoRA settings \(Sec\.[5\.2](https://arxiv.org/html/2605.11608#S5.SS2)\)\. Tight absolute estimation of\|Δℛ\|\|\\Delta\\mathcal\{R\}\|is a complementary problem we leave to future work\. Our evaluation uses teacher\-forced features for efficiency \(a single deterministic forward pass per variant, without sampling noise or decoding strategy to control for\); the bound applies unchanged to any\(ZT,ZP\)\(Z\_\{T\},Z\_\{P\}\)matrices, including features collected along free\-running generation trajectories—a direct extension we leave to future work\. The shape regularizer of Sec\.[5\.4](https://arxiv.org/html/2605.11608#S5.SS4)handles the shape axis directly via training; per\-axis protocol\-level mitigations \(Sec\.[5\.3](https://arxiv.org/html/2605.11608#S5.SS3)\) remain a research follow\-up enabled by the diagnostic\.

##### Future work\.

We describe future directions in detail in Appendix[I](https://arxiv.org/html/2605.11608#A9), includingbeyond LoRA forgetting\(analyze and regularize SFT/distillation drift\),diagnostic applications\(OOD detection, hyperparameter transfer, drift monitoring\), andbeyond LLMs\(ViT, contrastive image encoders\)\.

### Acknowledgments

We thank Chih\-Han Yu for inspiring this work, and the Appier AI Research team members for their support\. This work was supported in part by the National Science and Technology Council, Taiwan, under Grants 114\-2628\-E\-002\-021\- and 115\-2634\-F\-002 \-012\-, and the Taiwan Centers of Excellence\. Shao\-Hua Sun was supported by the Yushan Fellow Program of the Ministry of Education, Taiwan\.

### References

- Frantar et al\. \[2023\]Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh\.GPTQ: Accurate post\-training quantization for generative pre\-trained transformers\.In*International Conference on Learning Representations \(ICLR\)*, 2023\.
- Dettmers et al\. \[2022\]Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer\.LLM\.int8\(\): 8\-bit Matrix Multiplication for Transformers at Scale\.In S\. Koyejo, S\. Mohamed, A\. Agarwal, D\. Belgrave, K\. Cho, and A\. Oh, editors,*Advances in Neural Information Processing Systems*, volume 35, pages 30318–30332\. Curran Associates, Inc\., 2022\.
- Hu et al\. \[2022\]Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen\.LoRA: Low\-Rank Adaptation of Large Language Models\.In*International Conference on Learning Representations \(ICLR\)*, 2022\.
- Hinton et al\. \[2015\]Geoffrey Hinton, Oriol Vinyals, and Jeff Dean\.Distilling the knowledge in a neural network\.*arXiv preprint arXiv:1503\.02531*, 2015\.
- Grattafiori et al\. \[2024\]Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, et al\.The Llama 3 herd of models\.*arXiv preprint arXiv:2407\.21783*, 2024\.
- Qwen Team \[2025\]Qwen Team\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*, 2025\.
- Raghu et al\. \[2017\]Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl\-Dickstein\.SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2017\.
- Kornblith et al\. \[2019\]Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton\.Similarity of neural network representations revisited\.In*International Conference on Machine Learning \(ICML\)*, 2019\.
- Ding et al\. \[2021\]Frances Ding, Jean\-Stanislas Denain, and Jacob Steinhardt\.Grounding representation similarity with statistical testing\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2021\.
- Davari et al\. \[2023\]MohammadReza Davari, Stefan Horoi, Amine Natik, Guillaume Lajoie, Guy Wolf, and Eugene Belilovsky\.Reliability of CKA as a similarity measure in deep learning\.In*International Conference on Learning Representations \(ICLR\)*, 2023\.
- Klabunde et al\. \[2025\]Max Klabunde, Tobias Schumacher, Markus Strohmaier, and Florian Lemmerich\.Similarity of neural network models: A survey of functional and representational measures\.*ACM Computing Surveys*, 57\(9\):1–52, 2025\.doi:10\.1145/3728458\.
- Park et al\. \[2024\]Kiho Park, Yo Joong Choe, and Victor Veitch\.The linear representation hypothesis and the geometry of large language models\.In*International Conference on Machine Learning \(ICML\)*, 2024\.
- Moschella et al\. \[2023\]Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, and Emanuele Rodolà\.Relative representations enable zero\-shot latent space communication\.In*International Conference on Learning Representations \(ICLR\)*, 2023\.
- Williams et al\. \[2021\]Alex H Williams, Erin Kunz, Simon Kornblith, and Scott W Linderman\.Generalized shape metrics on neural representations\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2021\.
- Harvey et al\. \[2024\]Sarah E Harvey, David Lipshutz, and Alex H Williams\.What representational similarity measures imply about decodable information\.In*Proceedings of UniReps: the Second Edition of the Workshop on Unifying Representations in Neural Models*, volume 285 of*Proceedings of Machine Learning Research*, pages 140–151\. PMLR, 2024\.
- Huh et al\. \[2024\]Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola\.Position: The platonic representation hypothesis\.In*Proceedings of the 41st International Conference on Machine Learning*, volume 235 of*Proceedings of Machine Learning Research*, pages 20617–20642\. PMLR, 2024\.
- Xiao et al\. \[2023\]Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han\.SmoothQuant: Accurate and Efficient Post\-Training Quantization for Large Language Models\.In*International Conference on Machine Learning \(ICML\)*, pages 38087–38099\. PMLR, 2023\.
- Kirkpatrick et al\. \[2017\]James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska\-Barwinska, et al\.Overcoming catastrophic forgetting in neural networks\.*Proceedings of the National Academy of Sciences \(PNAS\)*, 114\(13\):3521–3526, 2017\.
- Kaplan et al\. \[2020\]Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei\.Scaling laws for neural language models\.*arXiv preprint arXiv:2001\.08361*, 2020\.
- Hoffmann et al\. \[2022\]Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al\.Training compute\-optimal large language models\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2022\.
- Polo et al\. \[2024\]Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin\.tinyBenchmarks: Evaluating LLMs with fewer examples\.In*International Conference on Machine Learning \(ICML\)*, 2024\.
- Perlitz et al\. \[2024\]Yotam Perlitz, Elron Bandel, Ariel Gera, Ofir Arviv, Liat Ein\-Dor, Eyal Shnarch, Noam Slonim, Michal Shmueli\-Scheuer, and Leshem Choshen\.Efficient benchmarking \(of language models\)\.In*Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, pages 2519–2536, Mexico City, Mexico, 2024\. Association for Computational Linguistics\.doi:10\.18653/v1/2024\.naacl\-long\.139\.
- Burns et al\. \[2024\]Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeffrey Wu\.Weak\-to\-strong generalization: Eliciting strong capabilities with weak supervision\.In*International Conference on Machine Learning \(ICML\)*, 2024\.
- Liu et al\. \[2026\]Alexander H Liu et al\.Ministral 3\.*arXiv preprint arXiv:2601\.08584*, 2026\.
- DeepSeek\-AI \[2025\]DeepSeek\-AI\.DeepSeek\-R1: Incentivizing reasoning capability in LLMs via reinforcement learning\.*Nature*, 645:633–638, 2025\.doi:10\.1038/s41586\-025\-09422\-z\.
- Lin et al\. \[2022\]Stephanie Lin, Jacob Hilton, and Owain Evans\.TruthfulQA: Measuring how models mimic human falsehoods\.In*Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 3214–3252, Dublin, Ireland, 2022\. Association for Computational Linguistics\.doi:10\.18653/v1/2022\.acl\-long\.229\.
- Parrish et al\. \[2022\]Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R\. Bowman\.BBQ: A hand\-built bias benchmark for question answering\.In*Findings of the Association for Computational Linguistics: ACL 2022*, pages 2086–2105, Dublin, Ireland, 2022\. Association for Computational Linguistics\.doi:10\.18653/v1/2022\.findings\-acl\.165\.
- Hendrycks et al\. \[2021\]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt\.Measuring massive multitask language understanding\.In*International Conference on Learning Representations \(ICLR\)*, 2021\.
- Clark et al\. \[2018\]Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord\.Think you have solved question answering? Try ARC, the AI2 reasoning challenge\.*arXiv preprint arXiv:1803\.05457*, 2018\.
- Joshi et al\. \[2017\]Mandar Joshi, Eunsol Choi, Daniel S\. Weld, and Luke Zettlemoyer\.TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension\.In*Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 1601–1611, Vancouver, Canada, 2017\. Association for Computational Linguistics\.doi:10\.18653/v1/P17\-1147\.
- Rajpurkar et al\. \[2016\]Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang\.SQuAD: 100,000\+ questions for machine comprehension of text\.In*Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing*, pages 2383–2392, Austin, Texas, 2016\. Association for Computational Linguistics\.doi:10\.18653/v1/D16\-1264\.
- Cobbe et al\. \[2021\]Karl Cobbe, Vineet Kosaraju, Mo Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*, 2021\.
- Schönemann \[1966\]Peter H Schönemann\.A generalized solution of the orthogonal Procrustes problem\.*Psychometrika*, 31\(1\):1–10, 1966\.
- Gower and Dijksterhuis \[2004\]John C Gower and Garmt B Dijksterhuis\.*Procrustes Problems*\.Oxford University Press, 2004\.
- Ethayarajh \[2019\]Kawin Ethayarajh\.How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT\-2 embeddings\.In*Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, 2019\.

### Appendix

### Appendix AProof of the unified risk bound \(Theorem[1](https://arxiv.org/html/2605.11608#Thmtheorem1)\)

We provide the complete proof of Theorem[1](https://arxiv.org/html/2605.11608#Thmtheorem1), which the main text states for generalW∈𝒪\(d\)W\\in\\mathcal\{O\}\(d\)\(trace familyΩW\\Omega\_\{W\}, Eq\.[3](https://arxiv.org/html/2605.11608#S3.E3)\)\. The proof proceeds through five self\-contained steps: risk decomposition via triangle inequality, Lipschitz analysis of cross\-entropy with respect to features, the alignment\-residual decomposition \(whose Procrustes\-optimal specializationW=WNW=W\_\{N\}yields the nuclear formΩN\\Omega\_\{N\}via SVD\), covariance\-adjusted head bound, and final assembly\. The identity\-alignment specializationW=IW=I\(used throughout Sec\.[5](https://arxiv.org/html/2605.11608#S5)and the regularizer of Sec\.[3\.5](https://arxiv.org/html/2605.11608#S3.SS5)\) and theW=WNW=W\_\{N\}specialization \(used in the Sec\.[5\.5](https://arxiv.org/html/2605.11608#S5.SS5)ablation\) follow as corollaries of the same proof\.

#### A\.1Relation to classical Procrustes shape metrics

The left\-hand side of Eq\. \([2](https://arxiv.org/html/2605.11608#S3.E2)\) atW=WNW=W\_\{N\}is the squared Procrustes size\-and\-shape distanced12\(ZT,ZP\)d\_\{1\}^\{2\}\(Z\_\{T\},Z\_\{P\}\)studied in the generalized shape metrics framework\[[14](https://arxiv.org/html/2605.11608#bib.bib14)\]and its recent decodability analysis\[[15](https://arxiv.org/html/2605.11608#bib.bib15)\]\. Expandingd12d\_\{1\}^\{2\}reveals that the nuclear norm‖ZT⊤ZP‖∗\\\|Z\_\{T\}^\{\\top\}Z\_\{P\}\\\|\_\{\*\}appears inside any Procrustes\-based similarity; accordingly, our contribution is*not*the Procrustes distance itself\. Rather, it is \(a\) the explicit split of this residual into the scale term\(ρT−ρP\)2\(\\rho\_\{T\}\-\\rho\_\{P\}\)^\{2\}and the shape term2ρTρP\(1−ΩN\)2\\rho\_\{T\}\\rho\_\{P\}\(1\-\\Omega\_\{N\}\), which the shape\-metric literature treats jointly as a single descriptive dissimilarity, and \(b\) its lifting to a functional\-risk bound \(Theorem[1](https://arxiv.org/html/2605.11608#Thmtheorem1)\)\. Section[4](https://arxiv.org/html/2605.11608#S4)\(Quantization and LoRA Forgetting paragraphs\) shows that the two arms of this split have qualitatively different practical meanings and dominate the risk gap under different LLM lifecycle settings\.

#### A\.2Step 1: risk decomposition via triangle inequality

##### Setup\.

Recall that the risk of modelMMisℛM=𝔼\(x,y\)\[ℓ\(ϕM\(x\)⋅HM,y\)\]\\mathcal\{R\}\_\{M\}=\\mathbb\{E\}\_\{\(x,y\)\}\[\\ell\(\\phi\_\{M\}\(x\)\\cdot H\_\{M\},y\)\], whereℓ\\ellis the cross\-entropy loss \(Eq\.[1](https://arxiv.org/html/2605.11608#S3.E1)\)\. For any orthogonal alignmentW∈𝒪\(d\)W\\in\\mathcal\{O\}\(d\), introduce the hybrid risk:

ℛP→T:=𝔼\(x,y\)\[ℓ\(ϕP\(x\)W⋅HT,y\)\]\.\\mathcal\{R\}\_\{P\\to T\}:=\\mathbb\{E\}\_\{\(x,y\)\}\\\!\\Big\[\\ell\\\!\\big\(\\phi\_\{P\}\(x\)\\,W\\cdot H\_\{T\},\\;y\\big\)\\Big\]\.\(9\)This evaluates the target’s headHTH\_\{T\}on proxy features aligned byWW, serving as a bridge betweenℛT\\mathcal\{R\}\_\{T\}andℛP\\mathcal\{R\}\_\{P\}\. The proof below holds for anyW∈𝒪\(d\)W\\in\\mathcal\{O\}\(d\); the two specializations relevant to the paper areW=IW=I\(identity alignment, trace formΩ\\Omega, used throughout the main text\) andW=WN:=arg⁡minW⁡‖ZT−ZPW‖F2W=W\_\{N\}:=\\arg\\min\_\{W\}\\\|Z\_\{T\}\-Z\_\{P\}W\\\|\_\{F\}^\{2\}\(Procrustes\-optimal, nuclear formΩN\\Omega\_\{N\}, Sec\.[5\.5](https://arxiv.org/html/2605.11608#S5.SS5)ablation\)\.

##### Decomposition\.

By the triangle inequality on absolute values:

\|ℛT−ℛP\|≤\|ℛT−ℛP→T\|⏟Feature Alignment Errorδ\+\|ℛP→T−ℛP\|⏟Head Discrepancyγ\.\|\\mathcal\{R\}\_\{T\}\-\\mathcal\{R\}\_\{P\}\|\\;\\leq\\;\\underbrace\{\\big\|\\mathcal\{R\}\_\{T\}\-\\mathcal\{R\}\_\{P\\to T\}\\big\|\}\_\{\\text\{Feature Alignment Error \}\\delta\}\\;\+\\;\\underbrace\{\\big\|\\mathcal\{R\}\_\{P\\to T\}\-\\mathcal\{R\}\_\{P\}\\big\|\}\_\{\\text\{Head Discrepancy \}\\gamma\}\.\(10\)δ\\deltameasures how features differ through the same headHTH\_\{T\};γ\\gammameasures how heads differ on the same aligned features\.

#### A\.3Step 2: bounding the feature alignment errorδ\\delta

##### Lipschitz property of cross\-entropy with respect to features\.

We boundδ\\deltaby showing that the cross\-entropy loss is Lipschitz continuous in the feature vectorzz\. Define the end\-to\-end compositiongT\(z,y\):=ℓ\(z⋅HT,y\)g\_\{T\}\(z,y\):=\\ell\(z\\cdot H\_\{T\},y\), treating the classification headHTH\_\{T\}and the cross\-entropy loss as a single map from features to scalar loss\. We boundgTg\_\{T\}via direct gradient analysis rather than the Lipschitz composition rule \(loss\-Lipschitz2\\sqrt\{2\}times head operator norm‖HT‖2\\\|H\_\{T\}\\\|\_\{2\}\): the composition rule is agnostic to the softmax output and gives the spectral constant2‖HT‖2\\sqrt\{2\}\\\|H\_\{T\}\\\|\_\{2\}\(Remark below\), whereas the gradient∇zgT=HT\(p^−ey\)\\nabla\_\{z\}g\_\{T\}=H\_\{T\}\(\\hat\{p\}\-e\_\{y\}\)exposes the predicted distributionp^\\hat\{p\}and lets us derive a constant set by pairwise token\-embedding distances instead\. Our goal is to find the tightest constantKfeatK\_\{\\mathrm\{feat\}\}such that:

\|gT\(z1,y\)−gT\(z2,y\)\|≤Kfeat‖z1−z2‖2,∀z1,z2,y\.\|g\_\{T\}\(z\_\{1\},y\)\-g\_\{T\}\(z\_\{2\},y\)\|\\leq K\_\{\\mathrm\{feat\}\}\\\|z\_\{1\}\-z\_\{2\}\\\|\_\{2\},\\quad\\forall z\_\{1\},z\_\{2\},y\.\(11\)

##### Computing the gradient\.

The logit vector isv=z⋅HT∈ℝVv=z\\cdot H\_\{T\}\\in\\mathbb\{R\}^\{V\}, and the cross\-entropy loss isℓ\(v,y\)=−vy\+log∑j=1Vevj\\ell\(v,y\)=\-v\_\{y\}\+\\log\\sum\_\{j=1\}^\{V\}e^\{v\_\{j\}\}\. The gradient with respect to the logits is:

∂ℓ∂v=p^−ey,\\frac\{\\partial\\ell\}\{\\partial v\}=\\hat\{p\}\-e\_\{y\},\(12\)wherep^=softmax\(v\)∈ℝV\\hat\{p\}=\\mathrm\{softmax\}\(v\)\\in\\mathbb\{R\}^\{V\}andeye\_\{y\}is the one\-hot vector at indexyy\. Applying the chain rule to obtain the gradient with respect to features:

∇zℓ=HT\(p^−ey\)=∑j=1V\(p^j−δjy\)hT,j,\\nabla\_\{z\}\\ell=H\_\{T\}\(\\hat\{p\}\-e\_\{y\}\)=\\sum\_\{j=1\}^\{V\}\(\\hat\{p\}\_\{j\}\-\\delta\_\{jy\}\)\\,h\_\{T,j\},\(13\)wherehT,jh\_\{T,j\}denotes thejj\-th column ofHTH\_\{T\}\(the embedding vector for tokenjj\)\.

##### Simplex polarization\.

We now derive a tight bound on‖∇zℓ‖2\\\|\\nabla\_\{z\}\\ell\\\|\_\{2\}by exploiting the probability simplex constraint∑jp^j=1\\sum\_\{j\}\\hat\{p\}\_\{j\}=1\. Expanding the gradient for true classyy:

∇zℓ=\(p^y−1\)hT,y\+∑j≠yp^jhT,j\.\\nabla\_\{z\}\\ell=\(\\hat\{p\}\_\{y\}\-1\)\\,h\_\{T,y\}\+\\sum\_\{j\\neq y\}\\hat\{p\}\_\{j\}\\,h\_\{T,j\}\.\(14\)Since1−p^y=∑j≠yp^j1\-\\hat\{p\}\_\{y\}=\\sum\_\{j\\neq y\}\\hat\{p\}\_\{j\}, we can rewrite:

∇zℓ=−\(∑j≠yp^j\)hT,y\+∑j≠yp^jhT,j=∑j≠yp^j\(hT,j−hT,y\)\.\\nabla\_\{z\}\\ell=\-\\Big\(\\sum\_\{j\\neq y\}\\hat\{p\}\_\{j\}\\Big\)h\_\{T,y\}\+\\sum\_\{j\\neq y\}\\hat\{p\}\_\{j\}\\,h\_\{T,j\}=\\sum\_\{j\\neq y\}\\hat\{p\}\_\{j\}\\,\(h\_\{T,j\}\-h\_\{T,y\}\)\.\(15\)This reveals that the gradient is a*non\-negative linear combination*of pairwise differences between incorrect\-class embeddings and the correct\-class embedding, with weightsp^j\\hat\{p\}\_\{j\}summing to1−p^y≤11\-\\hat\{p\}\_\{y\}\\leq 1\. Taking the norm:

‖∇zℓ‖2≤∑j≠yp^j‖hT,j−hT,y‖2≤maxj≠y⁡‖hT,j−hT,y‖2⋅∑j≠yp^j⏟≤1\.\\\|\\nabla\_\{z\}\\ell\\\|\_\{2\}\\leq\\sum\_\{j\\neq y\}\\hat\{p\}\_\{j\}\\\|h\_\{T,j\}\-h\_\{T,y\}\\\|\_\{2\}\\leq\\max\_\{j\\neq y\}\\\|h\_\{T,j\}\-h\_\{T,y\}\\\|\_\{2\}\\cdot\\underbrace\{\\sum\_\{j\\neq y\}\\hat\{p\}\_\{j\}\}\_\{\\leq\\,1\}\.\(16\)The bound depends on the true classyyonly throughmaxj≠y⁡‖hT,j−hT,y‖2\\max\_\{j\\neq y\}\\\|h\_\{T,j\}\-h\_\{T,y\}\\\|\_\{2\}\. Taking the supremum overy∈\{1,…,V\}y\\in\\\{1,\\ldots,V\\\}extends the maximum to all ordered pairs\(j,y\)\(j,y\)withj≠yj\\neq y\(and thej=yj=ycase contributes0\), giving the full pairwise diameter:

Kfeat=maxj,k⁡‖hT,j−hT,k‖2\.K\_\{\\mathrm\{feat\}\}=\\max\_\{j,k\}\\\|h\_\{T,j\}\-h\_\{T,k\}\\\|\_\{2\}\.\(17\)

##### Remark\.

This bound depends on the*relative distances*between token embeddings, not their absolute magnitudes\. A uniform shiftHT→HT\+c𝟏⊤H\_\{T\}\\to H\_\{T\}\+c\\mathbf\{1\}^\{\\top\}does not changeKfeatK\_\{\\mathrm\{feat\}\}\. A naive Cauchy–Schwarz bound givesKfeatnaive=2‖HT‖2K\_\{\\mathrm\{feat\}\}^\{\\mathrm\{naive\}\}=\\sqrt\{2\}\\\|H\_\{T\}\\\|\_\{2\}, which is substantially looser\. Since the decomposition of Eq\. \([10](https://arxiv.org/html/2605.11608#A1.E10)\) routes throughℛP→T\\mathcal\{R\}\_\{P\\to T\}\(which usesHTH\_\{T\}\),KfeatK\_\{\\mathrm\{feat\}\}is determined byHTH\_\{T\}alone; it takes a single value across all proxy variants of a fixed target model and enters as a constant scaling on the feature term, not as a proxy\-dependent quantity that could perturb variant ranking\.

##### From gradient bound to Lipschitz inequality\.

SincegT\(⋅,y\)g\_\{T\}\(\\cdot,y\)is continuously differentiable, the gradient bound‖∇zgT\(z,y\)‖2≤Kfeat\\\|\\nabla\_\{z\}g\_\{T\}\(z,y\)\\\|\_\{2\}\\leq K\_\{\\mathrm\{feat\}\}implies the Lipschitz inequality stated above by the mean value theorem: for anyz1,z2z\_\{1\},z\_\{2\},

gT\(z2,y\)−gT\(z1,y\)=∫01∇zgT\(z1\+t\(z2−z1\),y\)⊤\(z2−z1\)𝑑t,g\_\{T\}\(z\_\{2\},y\)\-g\_\{T\}\(z\_\{1\},y\)=\\int\_\{0\}^\{1\}\\nabla\_\{z\}g\_\{T\}\\\!\\big\(z\_\{1\}\+t\(z\_\{2\}\-z\_\{1\}\),\\,y\\big\)^\{\\top\}\(z\_\{2\}\-z\_\{1\}\)\\,dt,and Cauchy–Schwarz gives\|gT\(z2,y\)−gT\(z1,y\)\|≤Kfeat‖z2−z1‖2\|g\_\{T\}\(z\_\{2\},y\)\-g\_\{T\}\(z\_\{1\},y\)\|\\leq K\_\{\\mathrm\{feat\}\}\\,\\\|z\_\{2\}\-z\_\{1\}\\\|\_\{2\}\.

##### Applying the Lipschitz bound\.

Using the Lipschitz property withz1=ϕT\(x\)z\_\{1\}=\\phi\_\{T\}\(x\)andz2=ϕP\(x\)Wz\_\{2\}=\\phi\_\{P\}\(x\)W:

δ\\displaystyle\\delta=\|ℛT−ℛP→T\|=\|𝔼\[gT\(ϕT\(x\),y\)−gT\(ϕP\(x\)W,y\)\]\|\\displaystyle=\\big\|\\mathcal\{R\}\_\{T\}\-\\mathcal\{R\}\_\{P\\to T\}\\big\|=\\Big\|\\mathbb\{E\}\\big\[g\_\{T\}\(\\phi\_\{T\}\(x\),y\)\-g\_\{T\}\(\\phi\_\{P\}\(x\)W,y\)\\big\]\\Big\|\(18\)≤𝔼\[\|gT\(ϕT\(x\),y\)−gT\(ϕP\(x\)W,y\)\|\]\\displaystyle\\leq\\mathbb\{E\}\\Big\[\|g\_\{T\}\(\\phi\_\{T\}\(x\),y\)\-g\_\{T\}\(\\phi\_\{P\}\(x\)W,y\)\|\\Big\]≤Kfeat⋅𝔼x\[‖ϕT\(x\)−ϕP\(x\)W‖2\]\.\\displaystyle\\leq K\_\{\\mathrm\{feat\}\}\\cdot\\mathbb\{E\}\_\{x\}\\\!\\big\[\\\|\\phi\_\{T\}\(x\)\-\\phi\_\{P\}\(x\)W\\\|\_\{2\}\\big\]\.
Applying Jensen’s inequality \(𝔼\[‖X‖\]≤𝔼\[‖X‖2\]\\mathbb\{E\}\[\\\|X\\\|\]\\leq\\sqrt\{\\mathbb\{E\}\[\\\|X\\\|^\{2\}\]\}\) to connect this to the alignment residual:

δ≤Kfeat⋅𝔼x\[‖ϕT\(x\)−ϕP\(x\)W‖22\]=Kfeat⋅1n‖ZT−ZPW‖F2\.\\delta\\leq K\_\{\\mathrm\{feat\}\}\\cdot\\sqrt\{\\mathbb\{E\}\_\{x\}\\\!\\big\[\\\|\\phi\_\{T\}\(x\)\-\\phi\_\{P\}\(x\)W\\\|\_\{2\}^\{2\}\\big\]\}=K\_\{\\mathrm\{feat\}\}\\cdot\\sqrt\{\\frac\{1\}\{n\}\\\|Z\_\{T\}\-Z\_\{P\}W\\\|\_\{F\}^\{2\}\}\.\(19\)

##### Empirical values\.

Table[4](https://arxiv.org/html/2605.11608#A1.T4)reportsKfeatK\_\{\\mathrm\{feat\}\}for all 8B models studied here, ranging from0\.930\.93\(Mistral family\) to3\.463\.46\(Qwen3\-Base\);Kpred=2K\_\{\\mathrm\{pred\}\}=\\sqrt\{2\}holds universally\. These values are bounded by pairwise token\-embedding distances and remain independent ofVV\(LlamaV=128,256V\{=\}128\{,\}256, Qwen3V=151,936V\{=\}151\{,\}936\), in contrast to the naive bound2‖HT‖2\\sqrt\{2\}\\\|H\_\{T\}\\\|\_\{2\}whose spectral norm grows with vocabulary size\.

Table 4:Empirical Lipschitz constants of the PRISM bound \(Theorem[1](https://arxiv.org/html/2605.11608#Thmtheorem1)\) for each evaluated model\.Kpred=2K\_\{\\mathrm\{pred\}\}=\\sqrt\{2\}holds universally \(Appendix[A\.5](https://arxiv.org/html/2605.11608#A1.SS5), simplex polarization\), independent of model\.KfeatK\_\{\\mathrm\{feat\}\}is empirical per model, depending on the spread of token embeddings inHTH\_\{T\}; for the 8B\-scale models studied here,KfeatK\_\{\\mathrm\{feat\}\}ranges from≈0\.93\\approx 0\.93\(Mistral family\) to≈3\.46\\approx 3\.46\(Qwen3\-Base\)\. SinceKfeatK\_\{\\mathrm\{feat\}\}is a constant per model, it scales the bound’s magnitude but does not affect the within\-model rank correlation that PRISM is calibrated to\.

#### A\.4Step 3: exact geometric decomposition

We derive the exact identity linking the alignment residual to scale and shape mismatches\. The identity holds for everyW∈𝒪\(d\)W\\in\\mathcal\{O\}\(d\); we then specialize to the two cases relevant to this paper\.

##### General identity\.

For anyW∈𝒪\(d\)W\\in\\mathcal\{O\}\(d\), expand the squared Frobenius norm:

‖ZT−ZPW‖F2\\displaystyle\\\|Z\_\{T\}\-Z\_\{P\}W\\\|\_\{F\}^\{2\}=Tr⁡\[\(ZT−ZPW\)⊤\(ZT−ZPW\)\]\\displaystyle=\\operatorname\{Tr\}\\\!\\big\[\(Z\_\{T\}\-Z\_\{P\}W\)^\{\\top\}\(Z\_\{T\}\-Z\_\{P\}W\)\\big\]\(20\)=‖ZT‖F2\+‖ZPW‖F2−2Tr⁡\(ZT⊤ZPW\)\.\\displaystyle=\\\|Z\_\{T\}\\\|\_\{F\}^\{2\}\+\\\|Z\_\{P\}W\\\|\_\{F\}^\{2\}\-2\\operatorname\{Tr\}\(Z\_\{T\}^\{\\top\}Z\_\{P\}W\)\.SinceWWis orthogonal,‖ZPW‖F2=Tr⁡\(W⊤ZP⊤ZPW\)=Tr⁡\(ZP⊤ZP\)=‖ZP‖F2\\\|Z\_\{P\}W\\\|\_\{F\}^\{2\}=\\operatorname\{Tr\}\(W^\{\\top\}Z\_\{P\}^\{\\top\}Z\_\{P\}W\)=\\operatorname\{Tr\}\(Z\_\{P\}^\{\\top\}Z\_\{P\}\)=\\\|Z\_\{P\}\\\|\_\{F\}^\{2\}\(cyclic trace \+W⊤W=IdW^\{\\top\}W=I\_\{d\}\)\. Therefore:

‖ZT−ZPW‖F2=‖ZT‖F2\+‖ZP‖F2−2Tr⁡\(ZT⊤ZPW\)\.\\\|Z\_\{T\}\-Z\_\{P\}W\\\|\_\{F\}^\{2\}=\\\|Z\_\{T\}\\\|\_\{F\}^\{2\}\+\\\|Z\_\{P\}\\\|\_\{F\}^\{2\}\-2\\operatorname\{Tr\}\(Z\_\{T\}^\{\\top\}Z\_\{P\}W\)\.\(21\)RecallρM=‖ZM‖F/n\\rho\_\{M\}=\\\|Z\_\{M\}\\\|\_\{F\}/\\sqrt\{n\}\(so‖ZM‖F2/n=ρM2\\\|Z\_\{M\}\\\|\_\{F\}^\{2\}/n=\\rho\_\{M\}^\{2\}\) andΩW:=Tr⁡\(ZT⊤ZPW\)/\(‖ZT‖F‖ZP‖F\)\\Omega\_\{W\}:=\\operatorname\{Tr\}\(Z\_\{T\}^\{\\top\}Z\_\{P\}W\)/\(\\\|Z\_\{T\}\\\|\_\{F\}\\\|Z\_\{P\}\\\|\_\{F\}\)\(main\-text Eq\.[3](https://arxiv.org/html/2605.11608#S3.E3)\)\. Dividing Eq\. \([21](https://arxiv.org/html/2605.11608#A1.E21)\) bynn:

1n‖ZT−ZPW‖F2=ρT2\+ρP2−2ρTρPΩW\.\\frac\{1\}\{n\}\\\|Z\_\{T\}\-Z\_\{P\}W\\\|\_\{F\}^\{2\}=\\rho\_\{T\}^\{2\}\+\\rho\_\{P\}^\{2\}\-2\\rho\_\{T\}\\rho\_\{P\}\\,\\Omega\_\{W\}\.\(22\)Adding and subtracting2ρTρP2\\rho\_\{T\}\\rho\_\{P\}completes the square:

1n‖ZT−ZPW‖F2=\(ρT−ρP\)2⏟Scale Mismatch\+2ρTρP\(1−ΩW\)⏟Shape Mismatch\.\\frac\{1\}\{n\}\\\|Z\_\{T\}\-Z\_\{P\}W\\\|\_\{F\}^\{2\}=\\underbrace\{\(\\rho\_\{T\}\-\\rho\_\{P\}\)^\{2\}\}\_\{\\text\{Scale Mismatch\}\}\+\\underbrace\{2\\rho\_\{T\}\\rho\_\{P\}\(1\-\\Omega\_\{W\}\)\}\_\{\\text\{Shape Mismatch\}\}\.\(23\)This is Proposition[1](https://arxiv.org/html/2605.11608#Thmproposition1)in its general form\. ∎

##### Specialization 1:W=IW=I\(identity alignment, main text\)\.

SettingW=IW=Igives the trace formΩ≡ΩW=I=Tr⁡\(ZT⊤ZP\)/\(‖ZT‖F‖ZP‖F\)\\Omega\\equiv\\Omega\_\{W=I\}=\\operatorname\{Tr\}\(Z\_\{T\}^\{\\top\}Z\_\{P\}\)/\(\\\|Z\_\{T\}\\\|\_\{F\}\\\|Z\_\{P\}\\\|\_\{F\}\), and Eq\. \([23](https://arxiv.org/html/2605.11608#A1.E23)\) becomes

1n‖ZT−ZP‖F2=\(ρT−ρP\)2\+2ρTρP\(1−Ω\),\\frac\{1\}\{n\}\\\|Z\_\{T\}\-Z\_\{P\}\\\|\_\{F\}^\{2\}=\(\\rho\_\{T\}\-\\rho\_\{P\}\)^\{2\}\+2\\rho\_\{T\}\\rho\_\{P\}\(1\-\\Omega\),\(24\)which is the form used throughout the main\-text experiments and the regularizer of Sec\.[3\.5](https://arxiv.org/html/2605.11608#S3.SS5)\.

##### Specialization 2:W=WNW=W\_\{N\}\(Procrustes\-optimal, ablation\)\.

The alignment that minimizes the residual is the classical orthogonal Procrustes problem\[[33](https://arxiv.org/html/2605.11608#bib.bib33),[34](https://arxiv.org/html/2605.11608#bib.bib34)\]:

WN:=arg⁡minW∈𝒪\(d\)⁡‖ZT−ZPW‖F2=arg⁡maxW∈𝒪\(d\)⁡Tr⁡\(ZT⊤ZPW\),W\_\{N\}:=\\arg\\min\_\{W\\in\\mathcal\{O\}\(d\)\}\\\|Z\_\{T\}\-Z\_\{P\}W\\\|\_\{F\}^\{2\}=\\arg\\max\_\{W\\in\\mathcal\{O\}\(d\)\}\\operatorname\{Tr\}\(Z\_\{T\}^\{\\top\}Z\_\{P\}W\),\(25\)where the second equality follows from Eq\. \([21](https://arxiv.org/html/2605.11608#A1.E21)\) since the first two terms areWW\-independent\. Its closed\-form SVD solution is well known; we restate it here for completeness\. LetZT⊤ZP=UΣV⊤Z\_\{T\}^\{\\top\}Z\_\{P\}=U\\Sigma V^\{\\top\}be the SVD withU,V∈𝒪\(d\)U,V\\in\\mathcal\{O\}\(d\)andΣ=diag\(σ1,…,σd\)\\Sigma=\\mathrm\{diag\}\(\\sigma\_\{1\},\\ldots,\\sigma\_\{d\}\),σi≥0\\sigma\_\{i\}\\geq 0\. By cyclic trace:

Tr⁡\(ZT⊤ZPW\)=Tr⁡\(ΣV⊤WU\)=Tr⁡\(ΣR\),R:=V⊤WU∈𝒪\(d\)\.\\operatorname\{Tr\}\(Z\_\{T\}^\{\\top\}Z\_\{P\}W\)=\\operatorname\{Tr\}\(\\Sigma\\,V^\{\\top\}WU\)=\\operatorname\{Tr\}\(\\Sigma R\),\\quad R:=V^\{\\top\}WU\\in\\mathcal\{O\}\(d\)\.\(26\)Sinceσi≥0\\sigma\_\{i\}\\geq 0and\|Rii\|≤1\|R\_\{ii\}\|\\leq 1for any orthogonalRR,Tr⁡\(ΣR\)=∑iσiRii\\operatorname\{Tr\}\(\\Sigma R\)=\\sum\_\{i\}\\sigma\_\{i\}R\_\{ii\}is maximized atR=IdR=I\_\{d\}, giving:

WN=VU⊤,maxW⁡Tr⁡\(ZT⊤ZPW\)=∑iσi=‖ZT⊤ZP‖∗\.W\_\{N\}=VU^\{\\top\},\\qquad\\max\_\{W\}\\operatorname\{Tr\}\(Z\_\{T\}^\{\\top\}Z\_\{P\}W\)=\\sum\_\{i\}\\sigma\_\{i\}=\\\|Z\_\{T\}^\{\\top\}Z\_\{P\}\\\|\_\{\*\}\.\(27\)ThereforeΩWN=ΩN:=‖ZT⊤ZP‖∗/\(‖ZT‖F‖ZP‖F\)\\Omega\_\{W\_\{N\}\}=\\Omega\_\{N\}:=\\\|Z\_\{T\}^\{\\top\}Z\_\{P\}\\\|\_\{\*\}/\(\\\|Z\_\{T\}\\\|\_\{F\}\\\|Z\_\{P\}\\\|\_\{F\}\)\(nuclear form, Appendix[B](https://arxiv.org/html/2605.11608#A2)Eq\.[37](https://arxiv.org/html/2605.11608#A2.E37)\), and Eq\. \([23](https://arxiv.org/html/2605.11608#A1.E23)\) atW=WNW=W\_\{N\}becomes

minW∈𝒪\(d\)⁡1n‖ZT−ZPW‖F2=\(ρT−ρP\)2\+2ρTρP\(1−ΩN\),\\min\_\{W\\in\\mathcal\{O\}\(d\)\}\\frac\{1\}\{n\}\\\|Z\_\{T\}\-Z\_\{P\}W\\\|\_\{F\}^\{2\}=\(\\rho\_\{T\}\-\\rho\_\{P\}\)^\{2\}\+2\\rho\_\{T\}\\rho\_\{P\}\(1\-\\Omega\_\{N\}\),\(28\)the form underlying the Sec\.[5\.5](https://arxiv.org/html/2605.11608#S5.SS5)ablation\.

##### Assembling the feature bound\.

Combining Eq\. \([19](https://arxiv.org/html/2605.11608#A1.E19)\) with the general decomposition Eq\. \([23](https://arxiv.org/html/2605.11608#A1.E23)\):

δ≤Kfeat\(ρT−ρP\)2\+2ρTρP\(1−ΩW\)\.\\delta\\leq K\_\{\\mathrm\{feat\}\}\\sqrt\{\(\\rho\_\{T\}\-\\rho\_\{P\}\)^\{2\}\+2\\rho\_\{T\}\\rho\_\{P\}\(1\-\\Omega\_\{W\}\)\}\.\(29\)AtW=IW=I, replaceΩW\\Omega\_\{W\}withΩ\\Omega; atW=WNW=W\_\{N\}, withΩN\\Omega\_\{N\}\.

#### A\.5Step 4: bounding the head discrepancyγ\\gamma

##### Lipschitz property with respect to logits\.

The head discrepancy measures how much predictions differ when two different “effective heads” \(WHTWH\_\{T\}vs\.HPH\_\{P\}\) operate on the same proxy features\. Letz=ϕP\(x\)z=\\phi\_\{P\}\(x\)denote the \(unaligned\) proxy feature\. Then:

γ=\|ℛP→T−ℛP\|=\|𝔼\[ℓ\(z⋅WHT,y\)−ℓ\(z⋅HP,y\)\]\|\.\\gamma=\\big\|\\mathcal\{R\}\_\{P\\to T\}\-\\mathcal\{R\}\_\{P\}\\big\|=\\Big\|\\mathbb\{E\}\\\!\\big\[\\ell\(z\\cdot WH\_\{T\},y\)\-\\ell\(z\\cdot H\_\{P\},y\)\\big\]\\Big\|\.\(30\)To bound this, we use the Lipschitz property of cross\-entropy with respect to*logits*\. The gradient ofℓ\\ellwith respect tovvis∇vℓ=p^−ey\\nabla\_\{v\}\\ell=\\hat\{p\}\-e\_\{y\}, with norm:

‖∇vℓ‖2=‖p^−ey‖2\.\\\|\\nabla\_\{v\}\\ell\\\|\_\{2\}=\\\|\\hat\{p\}\-e\_\{y\}\\\|\_\{2\}\.\(31\)Sincep^\\hat\{p\}lies on the probability simplex,‖p^‖2≤‖p^‖1=1\\\|\\hat\{p\}\\\|\_\{2\}\\leq\\\|\\hat\{p\}\\\|\_\{1\}=1, hence

‖p^−ey‖22=‖p^‖22−2p^y\+1≤1−2p^y\+1=2\(1−p^y\)≤2\.\\\|\\hat\{p\}\-e\_\{y\}\\\|\_\{2\}^\{2\}\\;=\\;\\\|\\hat\{p\}\\\|\_\{2\}^\{2\}\-2\\hat\{p\}\_\{y\}\+1\\;\\leq\\;1\-2\\hat\{p\}\_\{y\}\+1\\;=\\;2\(1\-\\hat\{p\}\_\{y\}\)\\;\\leq\\;2\.The supremum2\\sqrt\{2\}is approached as the model’s predicted distribution concentrates on an incorrect class \(p^y→0\\hat\{p\}\_\{y\}\\to 0\)\. ThusKpred≤2K\_\{\\mathrm\{pred\}\}\\leq\\sqrt\{2\}, yielding the Lipschitz inequality\|ℓ\(v1,y\)−ℓ\(v2,y\)\|≤Kpred‖v1−v2‖2\|\\ell\(v\_\{1\},y\)\-\\ell\(v\_\{2\},y\)\|\\leq K\_\{\\mathrm\{pred\}\}\\,\\\|v\_\{1\}\-v\_\{2\}\\\|\_\{2\}on logits\.

##### Deriving the covariance projection\.

Define the head misalignment matrixΔH:=WHT−HP∈ℝd×V\\Delta H:=WH\_\{T\}\-H\_\{P\}\\in\\mathbb\{R\}^\{d\\times V\}\. The logit difference for a single sample isz⋅ΔH=ϕP\(x\)\(WHT−HP\)z\\cdot\\Delta H=\\phi\_\{P\}\(x\)\(WH\_\{T\}\-H\_\{P\}\), and:

γ\\displaystyle\\gamma≤𝔼\[\|ℓ\(z⋅WHT,y\)−ℓ\(z⋅HP,y\)\|\]\\displaystyle\\leq\\mathbb\{E\}\\\!\\big\[\|\\ell\(z\\cdot WH\_\{T\},y\)\-\\ell\(z\\cdot H\_\{P\},y\)\|\\big\]\(32\)≤Kpred⋅𝔼z\[‖z⋅ΔH‖2\]\.\\displaystyle\\leq K\_\{\\mathrm\{pred\}\}\\cdot\\mathbb\{E\}\_\{z\}\\\!\\big\[\\\|z\\cdot\\Delta H\\\|\_\{2\}\\big\]\.Applying Jensen’s inequality on the concave square root:

\(𝔼\[‖z⋅ΔH‖2\]\)2≤𝔼\[‖z⋅ΔH‖22\]\.\\big\(\\mathbb\{E\}\[\\\|z\\cdot\\Delta H\\\|\_\{2\}\]\\big\)^\{2\}\\leq\\mathbb\{E\}\\\!\\big\[\\\|z\\cdot\\Delta H\\\|\_\{2\}^\{2\}\\big\]\.\(33\)Expanding the squared norm using the trace operator:

𝔼\[‖z⋅ΔH‖22\]\\displaystyle\\mathbb\{E\}\\\!\\big\[\\\|z\\cdot\\Delta H\\\|\_\{2\}^\{2\}\\big\]=𝔼\[Tr⁡\(ΔH⊤z⊤zΔH\)\]\\displaystyle=\\mathbb\{E\}\\\!\\Big\[\\operatorname\{Tr\}\\\!\\big\(\\Delta H^\{\\top\}z^\{\\top\}z\\,\\Delta H\\big\)\\Big\]\(34\)=Tr⁡\(ΔH⊤𝔼\[z⊤z\]⏟ΣPΔH\)\\displaystyle=\\operatorname\{Tr\}\\\!\\Big\(\\Delta H^\{\\top\}\\underbrace\{\\mathbb\{E\}\[z^\{\\top\}z\]\}\_\{\\Sigma\_\{P\}\}\\Delta H\\Big\)=‖ΣP1/2ΔH‖F2\.\\displaystyle=\\\|\\Sigma\_\{P\}^\{1/2\}\\Delta H\\\|\_\{F\}^\{2\}\.Taking the square root:

γ≤Kpred⋅‖ΣP1/2\(WHT−HP\)‖F\.\\gamma\\leq K\_\{\\mathrm\{pred\}\}\\cdot\\\|\\Sigma\_\{P\}^\{1/2\}\(WH\_\{T\}\-H\_\{P\}\)\\\|\_\{F\}\.\(35\)

##### Interpretation\.

The covariance weightingΣP1/2\\Sigma\_\{P\}^\{1/2\}projects the head error onto the*active subspace*of the data\. If the heads disagree only in directions orthogonal to the feature distribution \(the null space ofΣP\\Sigma\_\{P\}\), this disagreement has zero cost—it does not affect any prediction\. This is sharper than the loose spectral boundγ≤Kpred‖WHT−HP‖2⋅𝔼\[‖z‖2\]\\gamma\\leq K\_\{\\mathrm\{pred\}\}\\\|WH\_\{T\}\-H\_\{P\}\\\|\_\{2\}\\cdot\\mathbb\{E\}\[\\\|z\\\|\_\{2\}\], which conservatively assumes worst\-case alignment between features and head error\. The covariance weightingΣP1/2\\Sigma\_\{P\}^\{1/2\}instead aligns the bound with the data\-active directions: head misalignment in directions the features rarely traverse contributes little, and disagreement onnull\(ΣP\)\\mathrm\{null\}\(\\Sigma\_\{P\}\)contributes nothing\.

#### A\.6Step 5: assembling the unified bound

For anyW∈𝒪\(d\)W\\in\\mathcal\{O\}\(d\), combining the risk decomposition \(Eq\.[10](https://arxiv.org/html/2605.11608#A1.E10)\), the feature bound \(Eq\.[29](https://arxiv.org/html/2605.11608#A1.E29)\), and the head bound \(Eq\.[35](https://arxiv.org/html/2605.11608#A1.E35)\):

\|ℛT−ℛP\|≤δ\+γ≤Kfeat\(ρT−ρP\)2\+2ρTρP\(1−ΩW\)\+Kpred⋅‖ΣP1/2\(WHT−HP\)‖F\.\|\\mathcal\{R\}\_\{T\}\-\\mathcal\{R\}\_\{P\}\|\\;\\leq\\;\\delta\+\\gamma\\;\\leq\\;K\_\{\\mathrm\{feat\}\}\\sqrt\{\(\\rho\_\{T\}\-\\rho\_\{P\}\)^\{2\}\+2\\rho\_\{T\}\\rho\_\{P\}\(1\-\\Omega\_\{W\}\)\}\\;\+\\;K\_\{\\mathrm\{pred\}\}\\cdot\\\|\\Sigma\_\{P\}^\{1/2\}\(WH\_\{T\}\-H\_\{P\}\)\\\|\_\{F\}\.\(36\)This is Theorem[1](https://arxiv.org/html/2605.11608#Thmtheorem1)in its general form\. The two specializations used in the paper:

- •W=IW=I\(identity alignment, main text\):ΩW=Ω\\Omega\_\{W\}=\\Omega\(trace form\) andWHT−HP=HT−HPWH\_\{T\}\-H\_\{P\}=H\_\{T\}\-H\_\{P\}, giving the bound used throughout Sec\.[5](https://arxiv.org/html/2605.11608#S5)and the regularizer of Sec\.[3\.5](https://arxiv.org/html/2605.11608#S3.SS5)\.
- •W=WNW=W\_\{N\}\(Procrustes\-optimal, ablation\):ΩW=ΩN\\Omega\_\{W\}=\\Omega\_\{N\}\(nuclear form, achieves the tightest feature\-side specialization\), giving the bound underlying the Sec\.[5\.5](https://arxiv.org/html/2605.11608#S5.SS5)ablation\.

This completes the proof of Theorem[1](https://arxiv.org/html/2605.11608#Thmtheorem1)\. ∎

### Appendix BTightness of nuclear norm over Frobenius norm

We define the nuclear\-form and Frobenius\-form Procrustes similarities, show that the former yields a strictly tighter alignment residual than the latter, and clarify the relationship to CKA\.

##### Definitions\.

The*nuclear\-form*Procrustes similarity is

ΩN\(ZT,ZP\):=‖ZT⊤ZP‖∗‖ZT‖F‖ZP‖F=maxW∈𝒪\(d\)⁡Tr⁡\(ZT⊤ZPW\)‖ZT‖F‖ZP‖F,\\Omega\_\{N\}\(Z\_\{T\},Z\_\{P\}\)\\;:=\\;\\frac\{\\\|Z\_\{T\}^\{\\top\}Z\_\{P\}\\\|\_\{\*\}\}\{\\\|Z\_\{T\}\\\|\_\{F\}\\\|Z\_\{P\}\\\|\_\{F\}\}\\;=\\;\\max\_\{W\\in\\mathcal\{O\}\(d\)\}\\frac\{\\operatorname\{Tr\}\(Z\_\{T\}^\{\\top\}Z\_\{P\}\\,W\)\}\{\\\|Z\_\{T\}\\\|\_\{F\}\\\|Z\_\{P\}\\\|\_\{F\}\},\(37\)the maximum of the trace familyΩW\\Omega\_\{W\}\(Sec\.[3\.2](https://arxiv.org/html/2605.11608#S3.SS2)\) over orthogonal alignments, attained at the Procrustes\-optimalWNW\_\{N\}\(Appendix[A](https://arxiv.org/html/2605.11608#A1)\)\. The*Frobenius\-form*similarity isΩF:=‖ZT⊤ZP‖F/\(‖ZT‖F‖ZP‖F\)\\Omega\_\{F\}:=\\\|Z\_\{T\}^\{\\top\}Z\_\{P\}\\\|\_\{F\}/\(\\\|Z\_\{T\}\\\|\_\{F\}\\\|Z\_\{P\}\\\|\_\{F\}\); unlikeΩW\\Omega\_\{W\}, it does not arise from the alignment residual at anyWWand is treated here as an external comparison object connecting to CKA\.

##### Statement\.

For any feature matricesZT,ZPZ\_\{T\},Z\_\{P\}:

ΩF≤ΩN,\\Omega\_\{F\}\\leq\\Omega\_\{N\},\(38\)with equality if and only if the cross\-moment matrixZT⊤ZPZ\_\{T\}^\{\\top\}Z\_\{P\}has at most one nonzero singular value \(i\.e\., rank≤1\\leq 1\)\.

##### Proof\.

Letσ1,…,σd\\sigma\_\{1\},\\ldots,\\sigma\_\{d\}be the singular values ofZT⊤ZPZ\_\{T\}^\{\\top\}Z\_\{P\}\. Then:

‖ZT⊤ZP‖F=∑iσi2,‖ZT⊤ZP‖∗=∑iσi\.\\\|Z\_\{T\}^\{\\top\}Z\_\{P\}\\\|\_\{F\}=\\sqrt\{\\sum\_\{i\}\\sigma\_\{i\}^\{2\}\},\\qquad\\\|Z\_\{T\}^\{\\top\}Z\_\{P\}\\\|\_\{\*\}=\\sum\_\{i\}\\sigma\_\{i\}\.\(39\)Sinceσi≥0\\sigma\_\{i\}\\geq 0, the basic inequality∑iσi2≤\(∑iσi\)2\\sum\_\{i\}\\sigma\_\{i\}^\{2\}\\leq\\big\(\\sum\_\{i\}\\sigma\_\{i\}\\big\)^\{2\}\(cross\-terms are non\-negative\) gives

‖ZT⊤ZP‖F=∑iσi2≤∑iσi=‖ZT⊤ZP‖∗,\\\|Z\_\{T\}^\{\\top\}Z\_\{P\}\\\|\_\{F\}=\\sqrt\{\\sum\_\{i\}\\sigma\_\{i\}^\{2\}\}\\leq\\sum\_\{i\}\\sigma\_\{i\}=\\\|Z\_\{T\}^\{\\top\}Z\_\{P\}\\\|\_\{\*\},\(40\)i\.e\.,‖𝐱‖2≤‖𝐱‖1\\\|\\mathbf\{x\}\\\|\_\{2\}\\leq\\\|\\mathbf\{x\}\\\|\_\{1\}for any non\-negative vector\. Dividing both sides by‖ZT‖F‖ZP‖F\\\|Z\_\{T\}\\\|\_\{F\}\\\|Z\_\{P\}\\\|\_\{F\}givesΩF≤ΩN\\Omega\_\{F\}\\leq\\Omega\_\{N\}\.

##### Consequence for risk bounding\.

ΩF\\Omega\_\{F\}does not arise from the alignment residual‖ZT−ZPW‖F2\\\|Z\_\{T\}\-Z\_\{P\}W\\\|\_\{F\}^\{2\}at anyW∈𝒪\(d\)W\\in\\mathcal\{O\}\(d\)\(which produces only trace\-based similaritiesΩW\\Omega\_\{W\}\); we treat it as an external comparison object\. SubstitutingΩF\\Omega\_\{F\}into the bound formula in place ofΩW\\Omega\_\{W\}yields

\(ρT−ρP\)2\+2ρTρP\(1−ΩF\)≥\(ρT−ρP\)2\+2ρTρP\(1−ΩN\),\(\\rho\_\{T\}\-\\rho\_\{P\}\)^\{2\}\+2\\rho\_\{T\}\\rho\_\{P\}\(1\-\\Omega\_\{F\}\)\\geq\(\\rho\_\{T\}\-\\rho\_\{P\}\)^\{2\}\+2\\rho\_\{T\}\\rho\_\{P\}\(1\-\\Omega\_\{N\}\),\(41\)i\.e\., a*strictly looser*value than the alignment\-derived nuclear\-form term\. Among the alignment\-derived members\{ΩW\}\\\{\\Omega\_\{W\}\\\}, the nuclear formΩN\\Omega\_\{N\}is the tightest because it exactly solvesmaxW∈𝒪\(d\)⁡Tr⁡\(ZT⊤ZPW\)\\max\_\{W\\in\\mathcal\{O\}\(d\)\}\\operatorname\{Tr\}\(Z\_\{T\}^\{\\top\}Z\_\{P\}W\); the trace formΩ\\Omegaused in the main text \(Eq\.[3](https://arxiv.org/html/2605.11608#S3.E3)\) satisfiesΩ≤ΩN\\Omega\\leq\\Omega\_\{N\}in general, with equality onZT⊤ZPZ\_\{T\}^\{\\top\}Z\_\{P\}symmetric PSD \(Appendix[C\.2](https://arxiv.org/html/2605.11608#A3.SS2)\)\.

##### Relation to CKA\.

Linear CKA\[[8](https://arxiv.org/html/2605.11608#bib.bib8)\]is defined asCKA\(ZT,ZP\)=‖ZT⊤ZP‖F2/\(‖ZT⊤ZT‖F⋅‖ZP⊤ZP‖F\)\\mathrm\{CKA\}\(Z\_\{T\},Z\_\{P\}\)=\\\|Z\_\{T\}^\{\\top\}Z\_\{P\}\\\|\_\{F\}^\{2\}/\(\\\|Z\_\{T\}^\{\\top\}Z\_\{T\}\\\|\_\{F\}\\cdot\\\|Z\_\{P\}^\{\\top\}Z\_\{P\}\\\|\_\{F\}\), which differs fromΩF2=‖ZT⊤ZP‖F2/\(‖ZT‖F2‖ZP‖F2\)\\Omega\_\{F\}^\{2\}=\\\|Z\_\{T\}^\{\\top\}Z\_\{P\}\\\|\_\{F\}^\{2\}/\(\\\|Z\_\{T\}\\\|\_\{F\}^\{2\}\\\|Z\_\{P\}\\\|\_\{F\}^\{2\}\)in its denominator\. The two denominators are related by the inequality‖ZM⊤ZM‖F≤‖ZM‖F2=Tr⁡\(ZM⊤ZM\)\\\|Z\_\{M\}^\{\\top\}Z\_\{M\}\\\|\_\{F\}\\leq\\\|Z\_\{M\}\\\|\_\{F\}^\{2\}=\\operatorname\{Tr\}\(Z\_\{M\}^\{\\top\}Z\_\{M\}\)\(since‖A‖F≤Tr⁡\(A\)\\\|A\\\|\_\{F\}\\leq\\operatorname\{Tr\}\(A\)for any positive semidefiniteAA, by‖𝐱‖2≤‖𝐱‖1\\\|\\mathbf\{x\}\\\|\_\{2\}\\leq\\\|\\mathbf\{x\}\\\|\_\{1\}on the non\-negative eigenvalues\), which givesCKA≥ΩF2\\mathrm\{CKA\}\\geq\\Omega\_\{F\}^\{2\}\. Hence CKA can overestimate alignment relative toΩF\\Omega\_\{F\}, and it is not directly substitutable into our Procrustes residual\. More fundamentally, CKA is not derived from any alignment optimization and therefore lacks the alignment\-grounded meaning ofΩN\\Omega\_\{N\}, which exactly solves the Orthogonal Procrustes problem\.

### Appendix CExtension to general orthogonal alignments

The main text \(Theorem[1](https://arxiv.org/html/2605.11608#Thmtheorem1)\) states the bound at the identity alignmentW=IW=I\. Here we lift Theorem[1](https://arxiv.org/html/2605.11608#Thmtheorem1)to arbitrary orthogonalWW, characterize theW=IW=Itrace andW=WNW=W\_\{N\}nuclear specializations \(the latter used in the Sec\.[5\.5](https://arxiv.org/html/2605.11608#S5.SS5)ablation\), and treat the joint optimization over feature and head terms\. The two specializations encode different priorities:WNW\_\{N\}minimizes the feature alignment residual but generally inflates the head term, whileW=IW=Ikeeps the head term at its minimum \(vanishing exactly under frozenlm\_head\) and is SVD\-free differentiable, properties that the LLM settings studied here—PTQ with FP16 head and frozen\-head LoRA—require\.

#### C\.1The general bound for arbitraryWW

The exact decomposition derived in Appendix[A\.4](https://arxiv.org/html/2605.11608#A1.SS4)\(Eq\.[23](https://arxiv.org/html/2605.11608#A1.E23)\) holds for everyW∈𝒪\(d\)W\\in\\mathcal\{O\}\(d\)via the alignment\-dependent similarityΩW\\Omega\_\{W\}of Eq\.[3](https://arxiv.org/html/2605.11608#S3.E3), yielding a family of risk bounds parameterized byWW:

\|ℛT−ℛP\|≤Kfeat\(ρT−ρP\)2\+2ρTρP\(1−ΩW\)⏟δ\(W\)\+Kpred‖ΣP1/2\(WHT−HP\)‖F⏟γ\(W\)\.\|\\mathcal\{R\}\_\{T\}\-\\mathcal\{R\}\_\{P\}\|\\;\\leq\\;\\underbrace\{K\_\{\\mathrm\{feat\}\}\\sqrt\{\(\\rho\_\{T\}\-\\rho\_\{P\}\)^\{2\}\+2\\rho\_\{T\}\\rho\_\{P\}\(1\-\\Omega\_\{W\}\)\}\}\_\{\\delta\(W\)\}\+\\underbrace\{K\_\{\\mathrm\{pred\}\}\\\|\\Sigma\_\{P\}^\{1/2\}\(WH\_\{T\}\-H\_\{P\}\)\\\|\_\{F\}\}\_\{\\gamma\(W\)\}\.\(42\)The Procrustes solutionWNW\_\{N\}minimizesδ\(W\)\\delta\(W\)alone, yieldingΩWN=ΩN\\Omega\_\{W\_\{N\}\}=\\Omega\_\{N\}\(the nuclear\-norm similarity, Eq\.[37](https://arxiv.org/html/2605.11608#A2.E37)\)\. However,WNW\_\{N\}enters the head termγ\(WN\)\\gamma\(W\_\{N\}\), which can be inflated when the rotation is applied to a nearly preserved head \(HT≈HPH\_\{T\}\\approx H\_\{P\}\)\. This creates a fundamental trade\-off:WNW\_\{N\}gives the tightest feature bound but may worsen the head bound, whileW=IW=Ipreserves head coherence at the cost of a slightly looser feature bound\.

#### C\.2The trace specialization \(W=IW=I\)

SettingW=IW=Irecovers the main\-text trace similarityΩ=ΩW=I\\Omega=\\Omega\_\{W=I\}\(Eq\.[3](https://arxiv.org/html/2605.11608#S3.E3)\)\. Two structural properties characterize this specialization\.

##### Ordering of similarities\.

Since the Procrustes solution maximizesTr⁡\(ZT⊤ZPW\)\\operatorname\{Tr\}\(Z\_\{T\}^\{\\top\}Z\_\{P\}W\)over𝒪\(d\)\\mathcal\{O\}\(d\):

Tr⁡\(ZT⊤ZP\)≤‖ZT⊤ZP‖∗,\\operatorname\{Tr\}\(Z\_\{T\}^\{\\top\}Z\_\{P\}\)\\;\\leq\\;\\\|Z\_\{T\}^\{\\top\}Z\_\{P\}\\\|\_\{\*\},\(43\)which givesΩ≤ΩN\\Omega\\leq\\Omega\_\{N\}\. Equality holds if and only ifWN=IW\_\{N\}=I, i\.e\.,ZT⊤ZPZ\_\{T\}^\{\\top\}Z\_\{P\}is symmetric positive semidefinite\.

##### When doesΩ=ΩN\\Omega=\\Omega\_\{N\}?

LetZT⊤ZP=UΣV⊤Z\_\{T\}^\{\\top\}Z\_\{P\}=U\\Sigma V^\{\\top\}be the SVD\. ThenΩ=ΩN\\Omega=\\Omega\_\{N\}iffVU⊤=IVU^\{\\top\}=I, i\.e\.,U=VU=V, which holds whenZT⊤ZPZ\_\{T\}^\{\\top\}Z\_\{P\}is symmetric positive semidefinite \(SPSD\)\. In post\-training quantization with mild noise,ZT⊤ZP≈ZT⊤ZTZ\_\{T\}^\{\\top\}Z\_\{P\}\\approx Z\_\{T\}^\{\\top\}Z\_\{T\}is approximately SPSD, soΩ≈ΩN\\Omega\\approx\\Omega\_\{N\}\. As quantization becomes more aggressive, asymmetric perturbations causeΩ\\Omegato fall belowΩN\\Omega\_\{N\}, and this gap reflects genuine asymmetric distortion that a rotation\-invariant metric would mask\.

##### Head\-side simplification\.

AtW=IW=I, the head termKpred‖ΣP1/2\(HT−HP\)‖FK\_\{\\mathrm\{pred\}\}\\,\\\|\\Sigma\_\{P\}^\{1/2\}\(H\_\{T\}\-H\_\{P\}\)\\\|\_\{F\}carries no rotational contamination and vanishes whenever the prediction head is preserved \(FP16 head under PTQ, frozen head under LoRA\), recovering the regimes used throughout Sec\.[5](https://arxiv.org/html/2605.11608#S5)and the regularizer of Sec\.[3\.5](https://arxiv.org/html/2605.11608#S3.SS5)\.

#### C\.3Joint optimization on the Stiefel manifold

The tightest bound over all orthogonal alignments requires solving:

Wopt=arg⁡minW∈𝒪\(d\)⁡\[Kfeat1n‖ZT−ZPW‖F2⏟δ\(W\)\+Kpred‖ΣP1/2\(WHT−HP\)‖F⏟γ\(W\)\]\.W\_\{\\mathrm\{opt\}\}=\\arg\\min\_\{W\\in\\mathcal\{O\}\(d\)\}\\Big\[\\underbrace\{K\_\{\\mathrm\{feat\}\}\\sqrt\{\\tfrac\{1\}\{n\}\\\|Z\_\{T\}\-Z\_\{P\}W\\\|\_\{F\}^\{2\}\}\}\_\{\\delta\(W\)\}\\;\+\\;\\underbrace\{K\_\{\\mathrm\{pred\}\}\\,\\\|\\Sigma\_\{P\}^\{1/2\}\(WH\_\{T\}\-H\_\{P\}\)\\\|\_\{F\}\}\_\{\\gamma\(W\)\}\\Big\]\.\(44\)Unlike the standard Procrustes problem \(which minimizesδ\\deltaalone and admits the closed\-formWN=VU⊤W\_\{N\}=VU^\{\\top\}\), the addition ofγ\(W\)\\gamma\(W\)introduces a quadratic dependence onWWthrough the head term, related to the*Weighted Orthogonal Procrustes Problem*\(WOPP\)\[[34](https://arxiv.org/html/2605.11608#bib.bib34)\]\.

##### Why no closed\-form solution exists\.

The feature termδ\(W\)\\delta\(W\)is concave inTr⁡\(ZT⊤ZPW\)\\operatorname\{Tr\}\(Z\_\{T\}^\{\\top\}Z\_\{P\}W\), whileγ\(W\)\\gamma\(W\)is convex inWW\. Their sum is neither convex nor concave on𝒪\(d\)\\mathcal\{O\}\(d\), precluding a simple variational characterization\.

##### Special case: isotropic features\.

IfΣP≈λI\\Sigma\_\{P\}\\approx\\lambda I, the head term simplifies toKpredλ‖WHT−HP‖FK\_\{\\mathrm\{pred\}\}\\sqrt\{\\lambda\}\\,\\\|WH\_\{T\}\-H\_\{P\}\\\|\_\{F\}, which still depends onWWthrough the cross\-termTr⁡\(HP⊤WHT\)\\operatorname\{Tr\}\(H\_\{P\}^\{\\top\}WH\_\{T\}\)when‖WHT−HP‖F2\\\|WH\_\{T\}\-H\_\{P\}\\\|\_\{F\}^\{2\}is expanded\. The joint objective then minimizesKfeat‖ZT−ZPW‖F2/n\+Kpredλ‖WHT−HP‖FK\_\{\\mathrm\{feat\}\}\\sqrt\{\\\|Z\_\{T\}\-Z\_\{P\}W\\\|\_\{F\}^\{2\}/n\}\+K\_\{\\mathrm\{pred\}\}\\sqrt\{\\lambda\}\\\|WH\_\{T\}\-H\_\{P\}\\\|\_\{F\}, and each term individually admits a closed\-form Procrustes minimizer; the sum does not\. The isotropic assumption is, in any case, unrealistic for LLM representations, which exhibit well\-documented anisotropy\[[35](https://arxiv.org/html/2605.11608#bib.bib35)\]\.

##### Numerical solution\.

For the general case,WoptW\_\{\\mathrm\{opt\}\}can be approximated via Riemannian gradient descent on the Stiefel manifold, but this entangles scale, shape, and head components, destroying the clean physical interpretation\.

##### Practical irrelevance for this paper’s regimes\.

In the two primary settings:

- •Quantization \(GPTQ/BnB:HT=HPH\_\{T\}=H\_\{P\}; GGUF:HT≈HPH\_\{T\}\\approx H\_\{P\}\):The Procrustes\-optimalWNW\_\{N\}minimizes the feature termδ\(W\)\\delta\(W\)alone but inflates the head termγ\(W\)\\gamma\(W\)when the head is approximately preserved \(cf\. Appendix[C\.1](https://arxiv.org/html/2605.11608#A3.SS1)trade\-off\)\. We adoptW=IW=Iin main analyses, whereγ\\gammavanishes for FP16\-head retention\.
- •LoRA \(HT=HPH\_\{T\}=H\_\{P\}\):The head term becomesKpred‖ΣP1/2\(W−I\)HT‖FK\_\{\\mathrm\{pred\}\}\\\|\\Sigma\_\{P\}^\{1/2\}\(W\-I\)H\_\{T\}\\\|\_\{F\}, which vanishes atW=IW=I\(and is generically positive atW≠IW\\neq I\)\. We adoptW=IW=Ithroughout the LoRA experiments and the shape regularizer of Sec\.[3\.5](https://arxiv.org/html/2605.11608#S3.SS5)as it yields the cleanest decomposition: under the frozen LoRA head \(HT=HPH\_\{T\}=H\_\{P\}\) the choiceW=IW=IgivesΣP1/2\(IHT−HP\)=0\\Sigma\_\{P\}^\{1/2\}\(IH\_\{T\}\-H\_\{P\}\)=0, soγ=0\\gamma=0; the scale arm\(ρT−ρP\)2\(\\rho\_\{T\}\-\\rho\_\{P\}\)^\{2\}isWW\-invariant \(observed but not actively regularized, since LoRA primarily perturbs shape rather than scale; Sec\.[5\.3](https://arxiv.org/html/2605.11608#S5.SS3)\); and the shape arm1−Ω1\-\\Omegabecomes the single differentiable training\-time target\. The Procrustes\-optimalWN≠IW\_\{N\}\\neq Iwould tighten the shape arm marginally \(ΩN≥Ω\\Omega\_\{N\}\\geq\\Omega\) at the cost of inflatingγ\\gammaand breaking the at\-source shape contraction the regularizer relies on\.

The Riemannian formulation becomes relevant only when comparing models with*both*rotated features and divergent heads \(e\.g\., full\-parameter SFT\)\. We leave this setting to future work\.

### Appendix DFormal extension to autoregressive generation

The bound of Theorem[1](https://arxiv.org/html/2605.11608#Thmtheorem1)applies to any matrix pair\(ZT,ZP\)∈ℝn×d\(Z\_\{T\},Z\_\{P\}\)\\in\\mathbb\{R\}^\{n\\times d\}of feature vectors, whatever the origin of the rows\. We spell out here the autoregressive instantiation informally summarized in Sec\.[3\.4](https://arxiv.org/html/2605.11608#S3.SS4)\.

##### From tokens to matrices\.

Fix a single sequence\(c,y1,…,y\|y\|\)\(c,y\_\{1\},\\ldots,y\_\{\|y\|\}\)\. Each positionτ\\tausupplies a \(context, target\-token\) pair instantiating Eq\. \([1](https://arxiv.org/html/2605.11608#S3.E1)\) with featureϕM\(c,y<τ\)∈ℝd\\phi\_\{M\}\(c,y\_\{<\\tau\}\)\\in\\mathbb\{R\}^\{d\}and lossℓ\(ϕM\(c,y<τ\)HM,yτ\)\\ell\(\\phi\_\{M\}\(c,y\_\{<\\tau\}\)H\_\{M\},\\,y\_\{\\tau\}\)\. Stacking the\|y\|\|y\|feature vectors into a\|y\|×d\|y\|\\times dblock recovers exactly the per\-row setting that Theorem[1](https://arxiv.org/html/2605.11608#Thmtheorem1)controls\.

###### Corollary 1\(Autoregressive Risk Bound\)\.

For a single sequence\(c,y1,…,y\|y\|\)\(c,y\_\{1\},\\ldots,y\_\{\|y\|\}\), letZTAR,ZPAR∈ℝ\|y\|×dZ\_\{T\}^\{\\mathrm\{AR\}\},Z\_\{P\}^\{\\mathrm\{AR\}\}\\in\\mathbb\{R\}^\{\|y\|\\times d\}be the feature matrices stacked as above, with RMS scalesρTAR,ρPAR\\rho\_\{T\}^\{\\mathrm\{AR\}\},\\rho\_\{P\}^\{\\mathrm\{AR\}\}, Procrustes similarityΩAR\\Omega^\{\\mathrm\{AR\}\}, and covarianceΣPAR\\Sigma\_\{P\}^\{\\mathrm\{AR\}\}\. Then

\|ℛTAR−ℛPAR\|≤Kfeat\(ρTAR−ρPAR\)2\+2ρTARρPAR\(1−ΩAR\)\+Kpred‖\(ΣPAR\)1/2\(HT−HP\)‖F\.\|\\mathcal\{R\}\_\{T\}^\{\\mathrm\{AR\}\}\-\\mathcal\{R\}\_\{P\}^\{\\mathrm\{AR\}\}\|\\;\\leq\\;K\_\{\\mathrm\{feat\}\}\\sqrt\{\(\\rho\_\{T\}^\{\\mathrm\{AR\}\}\-\\rho\_\{P\}^\{\\mathrm\{AR\}\}\)^\{2\}\+2\\,\\rho\_\{T\}^\{\\mathrm\{AR\}\}\\,\\rho\_\{P\}^\{\\mathrm\{AR\}\}\(1\-\\Omega^\{\\mathrm\{AR\}\}\)\}\\;\+\\;K\_\{\\mathrm\{pred\}\}\\,\\big\\\|\(\\Sigma\_\{P\}^\{\\mathrm\{AR\}\}\)^\{1/2\}\(H\_\{T\}\-H\_\{P\}\)\\big\\\|\_\{F\}\.\(45\)

###### Proof\.

Each row ofZMARZ\_\{M\}^\{\\mathrm\{AR\}\}is a feature vector of dimensiondd, and the Lipschitz analysis of Appendix[A\.3](https://arxiv.org/html/2605.11608#A1.SS3)together with the Procrustes decomposition of Appendix[A\.4](https://arxiv.org/html/2605.11608#A1.SS4)depend only on the matrix shape, not on the provenance of its rows\. Applying Theorem[1](https://arxiv.org/html/2605.11608#Thmtheorem1)to\(ZTAR,ZPAR\)\(Z\_\{T\}^\{\\mathrm\{AR\}\},Z\_\{P\}^\{\\mathrm\{AR\}\}\)yields the stated inequality\. ∎

This unification enables one bound to cover point\-wise classification \(MMLU, ARC, where\|y\|=1\|y\|\{=\}1\), short\-horizon QA \(SQuAD, TriviaQA, whereyyis a 1–5 token span\), and multi\-step reasoning \(GSM8K, whereyyis a chain\-of\-thought solution\)\.

### Appendix EModel and quantization details

#### E\.1Target models

Table[5](https://arxiv.org/html/2605.11608#A5.T5)lists all target \(full\-precision\) models evaluated in the quantization experiments\. All targets are loaded in BF16 as the reference precision; an FP16 proxy is additionally included for each model to measure the BF16→\\toFP16 precision\-drift baseline\.

Table 5:Target models used in quantization experiments\. All models are loaded in BF16\.Models 2, 5 \(meta\-llama\) are gated and require accepting the Meta Community License on HuggingFace before access\.

#### E\.2Quantization backends and bit\-widths

Every target model is evaluated against up to four quantization backends\. Table[6](https://arxiv.org/html/2605.11608#A5.T6)summarises the shared GGUF and BitsAndBytes \(BnB\) configurations applied uniformly across all models\.

Table 6:Shared quantization tiers applied to all target models\. GGUF quants use thellama\.cppk\-quant scheme; BnB quants use on\-the\-fly quantization from the full\-precision checkpoint\.
#### E\.3Per\-model GPTQ repositories

Table[7](https://arxiv.org/html/2605.11608#A5.T7)lists the pre\-quantized GPTQ checkpoint used for each target model\. All repositories are publicly hosted on HuggingFace and loaded via the standardtransformersAPI unless otherwise noted\.

Table 7:Pre\-quantized GPTQ checkpoints per model\. One repository per model, corresponding to the variant actually loaded in our experiments; “—” indicates no GPTQ checkpoint was evaluated for that model\. GPTQ bit\-width is 4\-bit \(the canonical setting reported in Sec\.[5\.1](https://arxiv.org/html/2605.11608#S5.SS1)\) for all models except Qwen3\-8B\-Instruct, where the only public checkpoint we could load uses 8\-bit\.
#### E\.4GGUF repositories

Table[8](https://arxiv.org/html/2605.11608#A5.T8)lists the GGUF repositories and filename conventions for each model\. GGUF files follow one of two naming conventions depending on the repository provider:*dot convention*\(Model\.Q4\_K\_M\.gguf\) used by QuantFactory and mradermacher, or*dash convention*\(Model\-Q4\_K\_M\.gguf\) used by Qwen, bartowski, and most official releases\.

Table 8:GGUF repositories and filename templates\. Each repository provides quantsQ8\_0,Q6\_K,Q5\_K\_M,Q4\_K\_M,Q3\_K\_M, andQ2\_Kunless otherwise noted\.
#### E\.5Backend coverage summary

Table[9](https://arxiv.org/html/2605.11608#A5.T9)provides a per\-model summary of quantization backend coverage across the combinations we evaluated\.

Table 9:Quantization backend coverage per model\. ✓ = included in our experiments; — = not evaluated in this study\.##### Experiment scope\.

Each model is evaluated on the backend/bit\-width combinations marked ✓ in Table[9](https://arxiv.org/html/2605.11608#A5.T9): GGUF \(6 bit\-widths\) and BitsAndBytes \(3 configurations\) on every model, plus the GPTQ checkpoint listed in Table[7](https://arxiv.org/html/2605.11608#A5.T7)where available\. The cells marked “—” are outside our experiment scope under the compute and time budget for this study\.

### Appendix FQuantization replication, feature\-only scatter, and per\-model tables

#### F\.1Replication on Ministral and DeepSeek

Figure[5](https://arxiv.org/html/2605.11608#A6.F5)replicates the main quantization scatter \(Fig\.[2](https://arxiv.org/html/2605.11608#S5.F2)\) on the two additional 8B families—Ministral\-3\-8B and DeepSeek\-R1\-Distill\-Llama\-8B—kept out of the main text for symmetry with the LoRA experiments \(Sec\.[5\.1](https://arxiv.org/html/2605.11608#S5.SS1)\)\. The bound tracks\|Δℛ\|\|\\Delta\\mathcal\{R\}\|on these two families with patterns consistent with the Llama–Qwen main analysis, supporting the cross\-family generality of the PRISM framework\. Per\-variant numerical decompositions for all four families are in Appendix[F\.3](https://arxiv.org/html/2605.11608#A6.SS3)\.

![Refer to caption](https://arxiv.org/html/2605.11608v1/x4.png)Figure 5:PTQ replication on Ministral\-3\-8B and DeepSeek\-R1\-Distill\-Llama\-8B\.Counterpart of main\-text Fig\.[2](https://arxiv.org/html/2605.11608#S5.F2)\(Llama\-3\.1\-8B and Qwen3\-8B\), with identical axes, PTQ family color coding, and benchmark set\. Rows: Ministral\-3\-8B, DeepSeek\-R1\-Distill\-Llama\-8B\. Columns: ARC, MMLU, SQuAD, TriviaQA, GSM8K\. Per\-subplot Spearmanrsr\_\{s\}is annotated in each panel\.
#### F\.2Feature\-only scatter

Figure[6](https://arxiv.org/html/2605.11608#A6.F6)reports the feature\-alignment\-only replication of the main quantization result \(Fig\.[2](https://arxiv.org/html/2605.11608#S5.F2)\), with the x\-axis replaced by the head\-free termδ=Kfeat\(ρT−ρP\)2\+2ρTρP\(1−Ω\)\\delta=K\_\{\\mathrm\{feat\}\}\\sqrt\{\(\\rho\_\{T\}\{\-\}\\rho\_\{P\}\)^\{2\}\+2\\rho\_\{T\}\\rho\_\{P\}\(1\-\\Omega\)\}\. The grid isolates the backbone contribution to the bound and is the source of the aggregate feature\-only correlationrs¯\(δ\)\\overline\{r\_\{s\}\}\(\\delta\)cited in Sec\.[5\.2](https://arxiv.org/html/2605.11608#S5.SS2)\.

![Refer to caption](https://arxiv.org/html/2605.11608v1/x5.png)Figure 6:Feature alignment errorδ\\deltaalone is already highly predictive of the risk gap\.Identical layout to Fig\.[2](https://arxiv.org/html/2605.11608#S5.F2)but with x\-axis replaced by the feature\-only termδ\\delta\. Because most protocols retain thelm\_headin FP16, the head\-discrepancy termγ\\gammais zero for GPTQ and BitsAndBytes and small for GGUF configurations that quantize output embeddings;δ\\deltatherefore captures most of the predictive signal of Fig\.[2](https://arxiv.org/html/2605.11608#S5.F2), confirming that scale collapse\(Δρ\)2\(\\Delta\\rho\)^\{2\}and shape mismatch\(1−Ω\)\(1\-\\Omega\)are the dominant degradation channels under PTQ \(Sec\.[4](https://arxiv.org/html/2605.11608#S4.SS0.SSS0.Px1)\)\.
#### F\.3Per\-model decomposition tables

We report the full geometric decomposition \(Eq\.[2](https://arxiv.org/html/2605.11608#S3.E2)\) for each evaluated model, covering all quantization variants and benchmarks\. Each table column lists the base RMS scaleρT\\rho\_\{T\}, proxy RMS scaleρP\\rho\_\{P\}, identity\-alignment similarityΩ\\Omega, feature alignment errorδ\\delta, head discrepancyγ\\gamma, full PRISM boundℬ\\mathcal\{B\}, and empirical cross\-entropy gap\|Δℛ\|\|\\Delta\\mathcal\{R\}\|\. Per\-dataset Spearman rank correlationsrs\(δ,\|Δℛ\|\)r\_\{s\}\(\\delta,\|\\Delta\\mathcal\{R\}\|\)appear in the left\-hand column of each benchmark block\.

The base\-model tables cover the four 8B families: Llama\-3\.1\-8B extended benchmarks in TableLABEL:tab:llama\_decomposition\_ext\_bound, Ministral\-3\-8B in TableLABEL:tab:mistral\_decomposition\_all\_bound, Qwen3\-8B in TableLABEL:tab:qwen\_decomposition\_all\_bound\(also referenced in Sec\.[5\.3](https://arxiv.org/html/2605.11608#S5.SS3)\), and DeepSeek\-R1\-Distill\-Llama\-8B in TableLABEL:tab:deepseek\_decomposition\_all\_bound\. The instruction\-tuned counterparts of Llama, Ministral, and Qwen are reported in TablesLABEL:tab:llama\_instruct\_decomposition\_all\_bound,LABEL:tab:mistral\_instruct\_decomposition\_all\_bound, andLABEL:tab:qwen\_instruct\_decomposition\_all\_boundrespectively, and reproduce the same patterns as their base counterparts\. Across ARC/MMLU/SQuAD/TriviaQA the per\-benchmark mean Spearman lies inrs∈\[0\.77,0\.89\]r\_\{s\}\\in\[0\.77,0\.89\]\(Table[10](https://arxiv.org/html/2605.11608#A6.T10)\); GSM8K is the weakest benchmark across families \(rs≈0\.41r\_\{s\}\\approx 0\.41\) because its long teacher\-forced chain\-of\-thought answer spans dilute per\-token loss, shrinking mean\|Δℛ\|\|\\Delta\\mathcal\{R\}\|to≈0\.019\\approx 0\.019—an order of magnitude below the other benchmarks \(0\.070\.07–0\.160\.16\)\. For Qwen3\-8B\-Instruct in particular the mean\|Δℛ\|\|\\Delta\\mathcal\{R\}\|on GSM8K is only0\.00330\.0033; at this scale, per\-variant differences are dominated by measurement noise rather than the bound’s predictive signal\.

Table 10:Per\-benchmark mean empirical risk gap\|Δℛ\|\|\\Delta\\mathcal\{R\}\|\(averaged over per\-method aggregated quantization rows, matching the per\-model body tables\) and per\-benchmark mean Spearmanrs\(ℬ,\|Δℛ\|\)r\_\{s\}\(\\mathcal\{B\},\|\\Delta\\mathcal\{R\}\|\)\(over raw variants, matching the figure’s per\-subplotrsr\_\{s\}\) across seven 8B model families\. GSM8K’s small\|Δℛ\|\|\\Delta\\mathcal\{R\}\|—driven by long teacher\-forced answer spans diluting per\-token loss—yields a low signal\-to\-noise ratio that depresses its rank correlation\.Table 11:Geometric decomposition forLlama\-3\.1\-8Bunder identity alignment \(W=IW\{=\}I\) on extended benchmarks \(excl\. MMLU\)\. Each benchmark section reports Spearman’srs\(ℬ,\|Δℛ\|\)r\_\{s\}\(\\mathcal\{B\},\\,\|\\Delta\\mathcal\{R\}\|\)across all quantization variants\. Shading:Ω<0\.80\\Omega\{<\}0\.80/Ω<0\.95\\Omega\{<\}0\.95on\(Ω,δ,ℬ,\|Δℛ\|\)\(\\Omega,\\delta,\\mathcal\{B\},\|\\Delta\\mathcal\{R\}\|\);γ=0\\gamma\{=\}0when the head is preserved \(HT=HPH\_\{T\}\{=\}H\_\{P\}\)\.DatasetFamilyMethod𝝆𝑻\\boldsymbol\{\\rho\_\{T\}\}𝝆𝑷\\boldsymbol\{\\rho\_\{P\}\}𝛀\\boldsymbol\{\\Omega\}𝜹\\boldsymbol\{\\delta\}𝜸\\boldsymbol\{\\gamma\}PRISM𝓑\\boldsymbol\{\\mathcal\{B\}\}\|𝚫𝓡\|\\boldsymbol\{\|\\Delta\\mathcal\{R\}\|\}ARC–FP16137\.58137\.630\.99986\.400506\.40050\.0006\(𝐫𝐬=0\.94\\mathbf\{r\_\{s\}\{=\}0\.94\}\)GGUFQ8\_0137\.58137\.680\.999214\.12985\.430219\.56000\.0042Q6\_K137\.58138\.130\.995534\.205717\.042951\.24850\.0001Q5\_K\_M137\.58138\.570\.991547\.110017\.108564\.21840\.0137Q4\_K\_M137\.58137\.710\.981269\.715817\.006986\.72270\.0173Q3\_K\_M137\.58138\.680\.9420123\.024017\.0915140\.11550\.0654Q2\_K137\.58144\.670\.7408266\.157817\.9731284\.13090\.4877BnBINT8137\.58137\.990\.987457\.1279057\.12790\.0195NF4137\.58142\.780\.9155151\.24300151\.24300\.0523FP4137\.58135\.870\.9117150\.27420150\.27420\.0992GPTQGPTQ\-4bit137\.58138\.700\.9363128\.90040128\.90040\.0501TriviaQA–FP16140\.93140\.911\.00000\.046700\.04670\.0003\(𝐫𝐬=0\.97\\mathbf\{r\_\{s\}\{=\}0\.97\}\)GGUFQ8\_0140\.93140\.890\.999611\.03675\.527016\.56380\.0015Q6\_K140\.93141\.110\.996929\.215517\.253146\.46860\.0102Q5\_K\_M140\.93141\.180\.993043\.545317\.261260\.80660\.0296Q4\_K\_M140\.93141\.140\.979574\.572517\.263391\.83570\.0842Q3\_K\_M140\.93141\.390\.9330135\.099717\.3007152\.40040\.1675Q2\_K140\.93142\.250\.7574257\.788117\.4123275\.20041\.1405BnBINT8140\.93141\.070\.987558\.3674058\.36740\.0266NF4140\.93142\.030\.9533113\.03930113\.03930\.0773FP4140\.93142\.540\.9124155\.08000155\.08000\.2248GPTQGPTQ\-4bit140\.93141\.300\.9347133\.26280133\.26280\.2326SQuAD–FP16138\.24138\.221\.00000\.047300\.04740\.0020\(𝐫𝐬=0\.81\\mathbf\{r\_\{s\}\{=\}0\.81\}\)GGUFQ8\_0138\.24138\.170\.999115\.46895\.349820\.81870\.0030Q6\_K138\.24138\.210\.997226\.969816\.842843\.81260\.0274Q5\_K\_M138\.24138\.170\.994139\.105016\.835055\.94000\.0310Q4\_K\_M138\.24138\.170\.986559\.425316\.838076\.26320\.0488Q3\_K\_M138\.24138\.240\.9589103\.545216\.8464120\.39160\.0870Q2\_K138\.24138\.530\.8655187\.625816\.8715204\.49730\.4090BnBINT8138\.24138\.140\.992444\.6233044\.62330\.0002NF4138\.24138\.330\.973183\.7629083\.76290\.0009FP4138\.24139\.190\.9466118\.55960118\.55960\.2508GPTQGPTQ\-4bit138\.24138\.330\.966393\.8623093\.86230\.2382GSM8K–FP16134\.24134\.281\.00000\.095600\.09560\.0001\(𝐫𝐬=0\.51\\mathbf\{r\_\{s\}\{=\}0\.51\}\)GGUFQ8\_0134\.24134\.241\.00000\.01215\.38715\.39920\.0005Q6\_K134\.24134\.231\.00000\.029116\.936016\.96520\.0036Q5\_K\_M134\.24134\.261\.00000\.041816\.937816\.97970\.0050Q4\_K\_M134\.24134\.241\.00000\.016216\.932616\.94880\.0211Q3\_K\_M134\.24134\.651\.00001\.072316\.996718\.06900\.0502Q2\_K134\.24134\.510\.9298131\.590716\.9517148\.54230\.1630BnBINT8134\.24134\.221\.00000\.044100\.04410\.0006NF4134\.24134\.321\.00000\.207600\.20760\.0356FP4134\.24134\.861\.00001\.616801\.61680\.0367GPTQGPTQ\-4bit134\.24134\.181\.00000\.169500\.16950\.0132Table 12:Geometric decomposition forQwen3\-8B\-Baseunder identity alignment \(W=IW\{=\}I\) on all benchmarks\. Each benchmark section reports Spearman’srs\(ℬ,\|Δℛ\|\)r\_\{s\}\(\\mathcal\{B\},\\,\|\\Delta\\mathcal\{R\}\|\)across all quantization variants\. Shading:Ω<0\.80\\Omega\{<\}0\.80/Ω<0\.95\\Omega\{<\}0\.95on\(Ω,δ,ℬ,\|Δℛ\|\)\(\\Omega,\\delta,\\mathcal\{B\},\|\\Delta\\mathcal\{R\}\|\);γ=0\\gamma\{=\}0when the head is preserved \(HT=HPH\_\{T\}\{=\}H\_\{P\}\)\.DatasetFamilyMethod𝝆𝑻\\boldsymbol\{\\rho\_\{T\}\}𝝆𝑷\\boldsymbol\{\\rho\_\{P\}\}𝛀\\boldsymbol\{\\Omega\}𝜹\\boldsymbol\{\\delta\}𝜸\\boldsymbol\{\\gamma\}PRISM𝓑\\boldsymbol\{\\mathcal\{B\}\}\|𝚫𝓡\|\\boldsymbol\{\|\\Delta\\mathcal\{R\}\|\}MMLU–FP16332\.87333\.001\.00000\.431000\.43100\.0057\(𝐫𝐬=0\.84\\mathbf\{r\_\{s\}\{=\}0\.84\}\)GGUFQ8\_0332\.87332\.731\.00000\.505926\.806127\.31200\.0044Q6\_K332\.87332\.121\.00002\.607685\.076587\.68410\.0087Q5\_K\_M332\.87332\.020\.999536\.269985\.0170121\.28690\.0500Q4\_K\_M332\.87333\.300\.998954\.675585\.3799140\.05540\.0754Q3\_K\_M332\.87324\.420\.9931136\.699582\.9287219\.62810\.0274Q2\_K332\.87336\.880\.9819220\.490986\.2315306\.72240\.4521BnBINT8332\.87332\.240\.999245\.5886045\.58860\.0159NF4332\.87329\.710\.996695\.6438095\.64380\.1076FP4332\.87332\.870\.9938128\.43000128\.43000\.1483ARC–FP16332\.71332\.771\.00000\.183000\.18310\.0000\(𝐫𝐬=0\.65\\mathbf\{r\_\{s\}\{=\}0\.65\}\)GGUFQ8\_0332\.71332\.261\.00001\.566026\.753328\.31940\.0183Q6\_K332\.71332\.431\.00000\.987685\.095686\.08320\.0010Q5\_K\_M332\.71333\.210\.999727\.509685\.2474112\.75700\.0533Q4\_K\_M332\.71333\.720\.998856\.031685\.3776141\.40920\.0932Q3\_K\_M332\.71319\.870\.9927143\.164581\.5376224\.70210\.0184Q2\_K332\.71338\.590\.9869189\.157186\.5945275\.75160\.5978BnBINT8332\.71329\.140\.999150\.9372050\.93720\.0620NF4332\.71328\.250\.996991\.0051091\.00510\.1367FP4332\.71333\.500\.9929137\.37600137\.37600\.2419TriviaQA–FP16241\.01240\.831\.00000\.633700\.63380\.0022\(𝐫𝐬=0\.94\\mathbf\{r\_\{s\}\{=\}0\.94\}\)GGUFQ8\_0241\.01241\.221\.00000\.730918\.969819\.70070\.0158Q6\_K241\.01241\.791\.00002\.688560\.107062\.79550\.0556Q5\_K\_M241\.01239\.980\.999623\.127159\.571582\.69860\.0776Q4\_K\_M241\.01241\.740\.997262\.695560\.0594122\.75490\.0325Q3\_K\_M241\.01235\.110\.9804164\.396258\.0817222\.47790\.3429Q2\_K241\.01253\.070\.9351310\.594063\.4073374\.00120\.4199BnBINT8241\.01239\.250\.998939\.2009039\.20090\.0145NF4241\.01241\.110\.9921104\.62190104\.62190\.1900FP4241\.01241\.030\.9832152\.75700152\.75700\.2615SQuAD–FP16298\.04297\.921\.00000\.395400\.39540\.0009\(𝐫𝐬=0\.95\\mathbf\{r\_\{s\}\{=\}0\.95\}\)GGUFQ8\_0298\.04298\.251\.00000\.741923\.959024\.70080\.0098Q6\_K298\.04298\.381\.00001\.179875\.767176\.94690\.0234Q5\_K\_M298\.04297\.871\.00000\.583975\.632976\.21680\.0275Q4\_K\_M298\.04299\.881\.00006\.351276\.184982\.53600\.1121Q3\_K\_M298\.04297\.590\.9938114\.289475\.4513189\.74070\.2722Q2\_K298\.04307\.560\.9786219\.069578\.2900297\.35950\.8582BnBINT8298\.04296\.941\.00003\.813403\.81340\.0124NF4298\.04292\.800\.998852\.9278052\.92780\.0123FP4298\.04304\.820\.9947110\.21630110\.21630\.0758GSM8K–FP16267\.17267\.231\.00000\.202800\.20280\.0002\(𝐫𝐬=0\.68\\mathbf\{r\_\{s\}\{=\}0\.68\}\)GGUFQ8\_0267\.17266\.711\.00001\.594421\.663423\.25780\.0003Q6\_K267\.17270\.951\.000013\.093869\.422482\.51610\.0020Q5\_K\_M267\.17271\.111\.000013\.642869\.466983\.10960\.0070Q4\_K\_M267\.17274\.461\.000025\.225670\.323995\.54950\.0049Q3\_K\_M267\.17274\.491\.000025\.317470\.286895\.60420\.0504Q2\_K267\.17313\.410\.9835242\.254780\.5797322\.83440\.2799BnBINT8267\.17268\.011\.00002\.931702\.93170\.0024NF4267\.17273\.951\.000023\.4507023\.45070\.0309FP4267\.17271\.171\.000013\.8517013\.85170\.0915Table 13:Geometric decomposition forMinistral\-3\-8B\-Baseunder identity alignment \(W=IW\{=\}I\) on all benchmarks\. Each benchmark section reports Spearman’srs\(ℬ,\|Δℛ\|\)r\_\{s\}\(\\mathcal\{B\},\\,\|\\Delta\\mathcal\{R\}\|\)across all quantization variants\. Shading:Ω<0\.80\\Omega\{<\}0\.80/Ω<0\.95\\Omega\{<\}0\.95on\(Ω,δ,ℬ,\|Δℛ\|\)\(\\Omega,\\delta,\\mathcal\{B\},\|\\Delta\\mathcal\{R\}\|\);γ=0\\gamma\{=\}0when the head is preserved \(HT=HPH\_\{T\}\{=\}H\_\{P\}\)\.DatasetFamilyMethod𝝆𝑻\\boldsymbol\{\\rho\_\{T\}\}𝝆𝑷\\boldsymbol\{\\rho\_\{P\}\}𝛀\\boldsymbol\{\\Omega\}𝜹\\boldsymbol\{\\delta\}𝜸\\boldsymbol\{\\gamma\}PRISM𝓑\\boldsymbol\{\\mathcal\{B\}\}\|𝚫𝓡\|\\boldsymbol\{\|\\Delta\\mathcal\{R\}\|\}MMLU–FP16191\.06191\.040\.99992\.698202\.69830\.0033\(𝐫𝐬=0\.87\\mathbf\{r\_\{s\}\{=\}0\.87\}\)GGUFQ8\_0191\.06191\.040\.99965\.18014\.16859\.34850\.0056Q6\_K191\.06191\.130\.998310\.853812\.909423\.76320\.0022Q5\_K\_M191\.06191\.550\.995418\.144012\.919731\.06370\.0280Q4\_K\_M191\.06191\.120\.986730\.749912\.882043\.63180\.0187Q3\_K\_M191\.06190\.430\.955556\.029312\.838868\.86800\.0374Q2\_K191\.06193\.990\.8483104\.469713\.0266117\.49630\.5035BnBINT8191\.06191\.210\.995617\.6961017\.69610\.0156NF4191\.06190\.920\.969146\.7819046\.78190\.0910FP4191\.06191\.570\.950359\.3922059\.39220\.0558ARC–FP16191\.14191\.110\.99992\.648002\.64800\.0006\(𝐫𝐬=0\.79\\mathbf\{r\_\{s\}\{=\}0\.79\}\)GGUFQ8\_0191\.14191\.060\.99965\.28594\.14369\.42950\.0011Q6\_K191\.14191\.080\.998111\.603512\.880124\.48360\.0047Q5\_K\_M191\.14191\.550\.995517\.788512\.903130\.69170\.0443Q4\_K\_M191\.14190\.560\.983534\.195912\.821747\.01760\.0042Q3\_K\_M191\.14189\.890\.952957\.587412\.790670\.37800\.0387Q2\_K191\.14194\.140\.8439106\.034212\.9650118\.99920\.6424BnBINT8191\.14191\.460\.995318\.3240018\.32400\.0135NF4191\.14190\.540\.965549\.4023049\.40230\.0936FP4191\.14191\.920\.946861\.5350061\.53500\.0265TriviaQA–FP16194\.05194\.021\.00000\.031000\.03110\.0034\(𝐫𝐬=0\.89\\mathbf\{r\_\{s\}\{=\}0\.89\}\)GGUFQ8\_0194\.05194\.070\.99974\.96014\.05649\.01650\.0024Q6\_K194\.05194\.090\.998111\.706312\.443524\.14980\.0067Q5\_K\_M194\.05194\.130\.995418\.391212\.440730\.83190\.0092Q4\_K\_M194\.05194\.230\.985432\.658212\.445645\.10380\.0075Q3\_K\_M194\.05193\.980\.945263\.252312\.434575\.68680\.1125Q2\_K194\.05195\.160\.8057119\.490912\.5116132\.00250\.4595BnBINT8194\.05194\.150\.995019\.1477019\.14770\.0130NF4194\.05194\.100\.966149\.7942049\.79420\.0697FP4194\.05194\.430\.941765\.3043065\.30430\.0272SQuAD–FP16186\.67186\.621\.00000\.044000\.04400\.0018\(𝐫𝐬=0\.73\\mathbf\{r\_\{s\}\{=\}0\.73\}\)GGUFQ8\_0186\.67186\.681\.00000\.01083\.99434\.00500\.0029Q6\_K186\.67186\.660\.99879\.514312\.198721\.71300\.0017Q5\_K\_M186\.67186\.600\.996814\.720212\.194226\.91450\.0160Q4\_K\_M186\.67186\.820\.988527\.943012\.194840\.13770\.0302Q3\_K\_M186\.67186\.890\.962650\.292812\.212062\.50480\.0111Q2\_K186\.67188\.530\.872493\.367612\.2665105\.63410\.2206BnBINT8186\.67186\.710\.996016\.3442016\.34420\.0001NF4186\.67186\.500\.974841\.2389041\.23890\.0226FP4186\.67187\.230\.959852\.2330052\.23300\.0219GSM8K–FP16179\.04179\.131\.00000\.082600\.08260\.0004\(𝐫𝐬=0\.48\\mathbf\{r\_\{s\}\{=\}0\.48\}\)GGUFQ8\_0179\.04179\.051\.00000\.00623\.90763\.91380\.0000Q6\_K179\.04178\.931\.00000\.110712\.170912\.28160\.0031Q5\_K\_M179\.04179\.001\.00000\.047012\.174312\.22130\.0018Q4\_K\_M179\.04179\.221\.00000\.169612\.179912\.34950\.0139Q3\_K\_M179\.04179\.401\.00000\.347212\.198012\.54520\.0029Q2\_K179\.04181\.130\.956152\.566812\.258964\.82580\.0772BnBINT8179\.04179\.101\.00000\.053100\.05310\.0025NF4179\.04178\.631\.00000\.409600\.40960\.0001FP4179\.04179\.351\.00000\.301100\.30110\.0141Table 14:Geometric decomposition forDeepSeek\-R1\-Distill\-Llama\-8Bunder identity alignment \(W=IW\{=\}I\) on all benchmarks\. Each benchmark section reports Spearman’srs\(ℬ,\|Δℛ\|\)r\_\{s\}\(\\mathcal\{B\},\\,\|\\Delta\\mathcal\{R\}\|\)across all quantization variants\. Shading:Ω<0\.80\\Omega\{<\}0\.80/Ω<0\.95\\Omega\{<\}0\.95on\(Ω,δ,ℬ,\|Δℛ\|\)\(\\Omega,\\delta,\\mathcal\{B\},\|\\Delta\\mathcal\{R\}\|\);γ=0\\gamma\{=\}0when the head is preserved \(HT=HPH\_\{T\}\{=\}H\_\{P\}\)\.DatasetFamilyMethod𝝆𝑻\\boldsymbol\{\\rho\_\{T\}\}𝝆𝑷\\boldsymbol\{\\rho\_\{P\}\}𝛀\\boldsymbol\{\\Omega\}𝜹\\boldsymbol\{\\delta\}𝜸\\boldsymbol\{\\gamma\}PRISM𝓑\\boldsymbol\{\\mathcal\{B\}\}\|𝚫𝓡\|\\boldsymbol\{\|\\Delta\\mathcal\{R\}\|\}MMLU–FP16130\.42130\.300\.99986\.323106\.32310\.0029\(𝐫𝐬=0\.78\\mathbf\{r\_\{s\}\{=\}0\.78\}\)GGUFQ8\_0130\.42130\.040\.998617\.88825\.117723\.00590\.0038Q6\_K130\.42130\.500\.995631\.713516\.029347\.74280\.0205Q5\_K\_M130\.42130\.310\.992541\.624916\.015957\.64080\.0273Q4\_K\_M130\.42130\.730\.980267\.499616\.077583\.57720\.0121Q3\_K\_M130\.42133\.800\.9452113\.965016\.4417130\.40670\.1651Q2\_K130\.42134\.510\.8355197\.748716\.4049214\.15350\.4977BnBINT8130\.42131\.300\.987154\.7277054\.72770\.0277NF4130\.42133\.200\.9490109\.66520109\.66520\.1732FP4130\.42135\.190\.8903162\.13130162\.13130\.2850GPTQGPTQ\-4bit130\.42132\.770\.9407117\.93870117\.93870\.0075ARC–FP16129\.06129\.020\.99986\.125506\.12550\.0029\(𝐫𝐬=0\.90\\mathbf\{r\_\{s\}\{=\}0\.90\}\)GGUFQ8\_0129\.06128\.800\.998816\.34765\.044821\.39250\.0044Q6\_K129\.06129\.140\.995930\.299515\.809146\.10870\.0045Q5\_K\_M129\.06128\.920\.992441\.424815\.791357\.21610\.0167Q4\_K\_M129\.06130\.230\.978969\.250215\.986285\.23640\.0569Q3\_K\_M129\.06133\.120\.9431115\.366216\.3273131\.69350\.1086Q2\_K129\.06134\.020\.8228203\.911816\.3259220\.23770\.3327BnBINT8129\.06129\.360\.986156\.0644056\.06440\.0007NF4129\.06132\.280\.9445113\.42210113\.42210\.1337FP4129\.06135\.340\.8825167\.34080167\.34080\.2866GPTQGPTQ\-4bit129\.06132\.050\.9360121\.62080121\.62080\.0407TriviaQA–FP16143\.70143\.660\.99993\.905903\.90600\.0008\(𝐫𝐬=0\.96\\mathbf\{r\_\{s\}\{=\}0\.96\}\)GGUFQ8\_0143\.70143\.690\.999115\.84405\.672321\.51630\.0069Q6\_K143\.70143\.670\.995336\.029317\.578653\.60790\.0073Q5\_K\_M143\.70143\.780\.990950\.475217\.592968\.06820\.0379Q4\_K\_M143\.70143\.710\.972487\.832717\.5827105\.41550\.1682Q3\_K\_M143\.70144\.100\.9186150\.922117\.6260168\.54810\.3294Q2\_K143\.70144\.130\.7515263\.723717\.6317281\.35541\.0967BnBINT8143\.70143\.850\.979974\.9114074\.91140\.1057NF4143\.70143\.850\.9351134\.65410134\.65410\.0959FP4143\.70144\.300\.8696191\.15890191\.15890\.6411GPTQGPTQ\-4bit143\.70144\.140\.9032164\.55330164\.55330\.4587SQuAD–FP16133\.98133\.951\.00000\.069900\.06990\.0015\(𝐫𝐬=0\.65\\mathbf\{r\_\{s\}\{=\}0\.65\}\)GGUFQ8\_0133\.98133\.970\.997624\.23205\.214129\.44600\.0162Q6\_K133\.98133\.750\.993539\.562916\.451656\.01450\.0584Q5\_K\_M133\.98133\.780\.989650\.198916\.456266\.65510\.0677Q4\_K\_M133\.98133\.720\.974378\.868716\.444295\.31290\.1332Q3\_K\_M133\.98134\.760\.9376123\.384116\.5499139\.93400\.1084Q2\_K133\.98135\.890\.8134214\.298416\.6592230\.95750\.0205BnBINT8133\.98134\.240\.982964\.5345064\.53450\.0152NF4133\.98134\.330\.9430117\.68380117\.68380\.2994FP4133\.98135\.370\.8972158\.78950158\.78950\.2282GPTQGPTQ\-4bit133\.98137\.600\.9206140\.90240140\.90240\.4021GSM8K–FP16120\.55120\.521\.00000\.079600\.07970\.0017\(𝐫𝐬=0\.45\\mathbf\{r\_\{s\}\{=\}0\.45\}\)GGUFQ8\_0120\.55120\.491\.00000\.15064\.75564\.90620\.0055Q6\_K120\.55120\.321\.00000\.586115\.076115\.66220\.0056Q5\_K\_M120\.55120\.221\.00000\.850015\.061915\.91190\.0077Q4\_K\_M120\.55120\.291\.00000\.681915\.073315\.75520\.0040Q3\_K\_M120\.55121\.181\.00001\.651115\.221716\.87270\.0316Q2\_K120\.55122\.390\.8963143\.861515\.3978159\.25940\.0087BnBINT8120\.55120\.831\.00000\.732700\.73270\.0081NF4120\.55120\.671\.00000\.311000\.31100\.0086FP4120\.55122\.260\.969777\.7554077\.75540\.0210GPTQGPTQ\-4bit120\.55123\.641\.00008\.036008\.03600\.0323Table 15:Geometric decomposition forLlama\-3\.1\-8B\-Instructunder identity alignment \(W=IW\{=\}I\) on all benchmarks\. Each benchmark section reports Spearman’srs\(ℬ,\|Δℛ\|\)r\_\{s\}\(\\mathcal\{B\},\\,\|\\Delta\\mathcal\{R\}\|\)across all quantization variants\. Shading:Ω<0\.80\\Omega\{<\}0\.80/Ω<0\.95\\Omega\{<\}0\.95on\(Ω,δ,ℬ,\|Δℛ\|\)\(\\Omega,\\delta,\\mathcal\{B\},\|\\Delta\\mathcal\{R\}\|\);γ=0\\gamma\{=\}0when the head is preserved \(HT=HPH\_\{T\}\{=\}H\_\{P\}\)\.DatasetFamilyMethod𝝆𝑻\\boldsymbol\{\\rho\_\{T\}\}𝝆𝑷\\boldsymbol\{\\rho\_\{P\}\}𝛀\\boldsymbol\{\\Omega\}𝜹\\boldsymbol\{\\delta\}𝜸\\boldsymbol\{\\gamma\}PRISM𝓑\\boldsymbol\{\\mathcal\{B\}\}\|𝚫𝓡\|\\boldsymbol\{\|\\Delta\\mathcal\{R\}\|\}MMLU–FP16144\.62144\.620\.99979\.056809\.05680\.0006\(𝐫𝐬=0\.71\\mathbf\{r\_\{s\}\{=\}0\.71\}\)GGUFQ8\_0144\.62144\.530\.995435\.90845\.678441\.58680\.0225Q6\_K144\.62144\.620\.992346\.617217\.649864\.26700\.0147Q5\_K\_M144\.62144\.620\.985165\.018917\.659182\.67800\.0134Q4\_K\_M144\.62143\.110\.9401129\.531517\.3881146\.91960\.1047Q3\_K\_M144\.62141\.900\.8548200\.872017\.2517218\.12370\.0306Q2\_K144\.62144\.240\.7331274\.378517\.6133291\.99180\.6372BnBINT8144\.62144\.440\.977180\.4019080\.40190\.0423NF4144\.62145\.410\.9155155\.05100155\.05100\.1038FP4144\.62144\.770\.8618197\.77820197\.77820\.0551GPTQGPTQ\-4bit144\.62143\.760\.8708190\.63390190\.63390\.0867ARC–FP16145\.60145\.610\.999512\.3411012\.34110\.0001\(𝐫𝐬=0\.53\\mathbf\{r\_\{s\}\{=\}0\.53\}\)GGUFQ8\_0145\.60145\.610\.997029\.17565\.741834\.91740\.0107Q6\_K145\.60145\.680\.993443\.471717\.888161\.35980\.0144Q5\_K\_M145\.60145\.690\.987958\.849717\.900876\.75050\.0141Q4\_K\_M145\.60145\.170\.9592108\.046617\.7797125\.82630\.0028Q3\_K\_M145\.60143\.710\.8828182\.191117\.6230199\.81410\.0103Q2\_K145\.60145\.450\.7456269\.912017\.7991287\.71110\.4741BnBINT8145\.60145\.490\.979576\.5661076\.56610\.0199NF4145\.60146\.240\.9204151\.44430151\.44430\.0432FP4145\.60146\.180\.8657196\.63530196\.63530\.0798GPTQGPTQ\-4bit145\.60145\.430\.8796185\.67260185\.67260\.0119TriviaQA–FP16143\.91143\.911\.00000\.003700\.00370\.0010\(𝐫𝐬=0\.97\\mathbf\{r\_\{s\}\{=\}0\.97\}\)GGUFQ8\_0143\.91143\.950\.999412\.68855\.652018\.34050\.0021Q6\_K143\.91144\.010\.996929\.626417\.495547\.12190\.0210Q5\_K\_M143\.91143\.910\.992645\.589017\.480163\.06920\.0654Q4\_K\_M143\.91144\.000\.978178\.351817\.498895\.85050\.0653Q3\_K\_M143\.91144\.320\.9288141\.428517\.5455158\.97390\.3116Q2\_K143\.91145\.040\.7730253\.121417\.6629270\.78431\.0636BnBINT8143\.91144\.330\.984366\.4209066\.42090\.0496NF4143\.91144\.840\.9450124\.52220124\.52220\.1588FP4143\.91144\.930\.9044164\.20650164\.20650\.3070GPTQGPTQ\-4bit143\.91144\.130\.9331136\.99500136\.99500\.1271SQuAD–FP16138\.74138\.711\.00000\.069300\.06930\.0002\(𝐫𝐬=0\.90\\mathbf\{r\_\{s\}\{=\}0\.90\}\)GGUFQ8\_0138\.74138\.760\.999115\.01265\.348920\.36150\.0031Q6\_K138\.74138\.790\.997624\.879616\.779141\.65870\.0171Q5\_K\_M138\.74138\.770\.995334\.837816\.776051\.61380\.0095Q4\_K\_M138\.74138\.890\.987357\.509016\.793874\.30280\.0198Q3\_K\_M138\.74138\.920\.967292\.433416\.7992109\.23260\.0664Q2\_K138\.74138\.300\.8992161\.733516\.7209178\.45440\.3098BnBINT8138\.74138\.720\.992145\.2550045\.25500\.0130NF4138\.74138\.830\.969389\.3699089\.36990\.0479FP4138\.74139\.360\.9459118\.94740118\.94740\.1904GPTQGPTQ\-4bit138\.74138\.890\.966593\.3836093\.38360\.0678GSM8K–FP16134\.52134\.541\.00000\.054500\.05460\.0004\(𝐫𝐬=0\.45\\mathbf\{r\_\{s\}\{=\}0\.45\}\)GGUFQ8\_0134\.52134\.541\.00000\.06335\.35125\.41460\.0007Q6\_K134\.52134\.481\.00000\.090516\.789216\.87960\.0019Q5\_K\_M134\.52134\.561\.00000\.099316\.799916\.89920\.0087Q4\_K\_M134\.52134\.221\.00000\.781616\.757517\.53910\.0028Q3\_K\_M134\.52134\.671\.00000\.391216\.805217\.19650\.0215Q2\_K134\.52133\.960\.9129145\.722716\.7320162\.45480\.1025BnBINT8134\.52134\.541\.00000\.060800\.06080\.0046NF4134\.52134\.431\.00000\.225200\.22520\.0274FP4134\.52134\.920\.981567\.4488067\.44880\.0217GPTQGPTQ\-4bit134\.52134\.951\.00001\.117301\.11730\.0302Table 16:Geometric decomposition forQwen3\-8B\-Instructunder identity alignment \(W=IW\{=\}I\) on all benchmarks\. Each benchmark section reports Spearman’srs\(ℬ,\|Δℛ\|\)r\_\{s\}\(\\mathcal\{B\},\\,\|\\Delta\\mathcal\{R\}\|\)across all quantization variants\. Shading:Ω<0\.80\\Omega\{<\}0\.80/Ω<0\.95\\Omega\{<\}0\.95on\(Ω,δ,ℬ,\|Δℛ\|\)\(\\Omega,\\delta,\\mathcal\{B\},\|\\Delta\\mathcal\{R\}\|\);γ=0\\gamma\{=\}0when the head is preserved \(HT=HPH\_\{T\}\{=\}H\_\{P\}\)\.DatasetFamilyMethod𝝆𝑻\\boldsymbol\{\\rho\_\{T\}\}𝝆𝑷\\boldsymbol\{\\rho\_\{P\}\}𝛀\\boldsymbol\{\\Omega\}𝜹\\boldsymbol\{\\delta\}𝜸\\boldsymbol\{\\gamma\}PRISM𝓑\\boldsymbol\{\\mathcal\{B\}\}\|𝚫𝓡\|\\boldsymbol\{\|\\Delta\\mathcal\{R\}\|\}MMLU–FP16159\.12159\.071\.00000\.162200\.16220\.0045\(𝐫𝐬=0\.68\\mathbf\{r\_\{s\}\{=\}0\.68\}\)GGUFQ8\_0159\.12159\.350\.999712\.266212\.037924\.30410\.0013Q6\_K159\.12160\.100\.997935\.675438\.864474\.53980\.0219Q5\_K\_M159\.12159\.940\.995253\.178638\.883092\.06160\.0075Q4\_K\_M159\.12157\.230\.988880\.839238\.2293119\.06840\.0812BnBINT8159\.12161\.470\.992766\.6772066\.67720\.0537NF4159\.12161\.290\.9773116\.54160116\.54160\.0214FP4159\.12161\.120\.9465178\.76260178\.76260\.0455GPTQGPTQ\-8bit159\.12158\.630\.999812\.1107012\.11070\.0027ARC–FP16165\.05164\.961\.00000\.301100\.30120\.0025\(𝐫𝐬=0\.75\\mathbf\{r\_\{s\}\{=\}0\.75\}\)GGUFQ8\_0165\.05165\.000\.999810\.100812\.500522\.60130\.0052Q6\_K165\.05165\.430\.998432\.193140\.219572\.41260\.0299Q5\_K\_M165\.05165\.570\.996248\.974040\.315189\.28910\.0046Q4\_K\_M165\.05163\.960\.990477\.999039\.8298117\.82880\.0994BnBINT8165\.05167\.480\.993465\.7006065\.70060\.0290NF4165\.05165\.710\.9776119\.27930119\.27930\.0251FP4165\.05166\.880\.9531173\.42890173\.42890\.1047GPTQGPTQ\-8bit165\.05164\.390\.999810\.4952010\.49520\.0000TriviaQA–FP16151\.01150\.981\.00000\.073100\.07310\.0102\(𝐫𝐬=0\.53\\mathbf\{r\_\{s\}\{=\}0\.53\}\)GGUFQ8\_0151\.01151\.120\.99998\.186811\.042519\.22940\.0260Q6\_K151\.01150\.650\.997536\.186134\.908471\.09450\.0351Q5\_K\_M151\.01151\.410\.993558\.885635\.070193\.95570\.0280Q4\_K\_M151\.01149\.700\.982097\.446634\.7298132\.17640\.0152BnBINT8151\.01151\.530\.990770\.5433070\.54330\.2055NF4151\.01153\.150\.9571152\.14110152\.14110\.5815FP4151\.01151\.490\.9338187\.66040187\.66040\.4596GPTQGPTQ\-8bit151\.01150\.970\.99998\.179308\.17930\.0421SQuAD–FP16150\.60150\.481\.00000\.428200\.42820\.0109\(𝐫𝐬=0\.85\\mathbf\{r\_\{s\}\{=\}0\.85\}\)GGUFQ8\_0150\.60150\.601\.00000\.023411\.129711\.15310\.0041Q6\_K150\.60149\.920\.998924\.210035\.240159\.45010\.0122Q5\_K\_M150\.60151\.200\.995449\.304235\.570784\.87480\.1442Q4\_K\_M150\.60150\.240\.987581\.086235\.3024116\.38850\.1393BnBINT8150\.60150\.930\.992861\.7027061\.70270\.0660NF4150\.60155\.570\.9656137\.99710137\.99710\.2196FP4150\.60154\.650\.9500165\.05640165\.05640\.1680GPTQGPTQ\-8bit150\.60150\.591\.00000\.035000\.03500\.0143GSM8K–FP16131\.00130\.961\.00000\.122200\.12230\.0043\(𝐫𝐬=−0\.57\\mathbf\{r\_\{s\}\{=\}\-0\.57\}\)GGUFQ8\_0131\.00131\.001\.00000\.008810\.030910\.03970\.0063Q6\_K131\.00130\.471\.00001\.818231\.975633\.79380\.0001Q5\_K\_M131\.00131\.351\.00001\.203832\.217233\.42100\.0043Q4\_K\_M131\.00128\.351\.00009\.030931\.540140\.57100\.0001BnBINT8131\.00131\.351\.00001\.205301\.20530\.0008NF4131\.00132\.321\.00004\.506204\.50620\.0119FP4131\.00138\.991\.000027\.2384027\.23840\.0006GPTQGPTQ\-8bit131\.00130\.861\.00000\.462200\.46220\.0017Table 17:Geometric decomposition forMinistral\-3\-8B\-Instructunder identity alignment \(W=IW\{=\}I\) on all benchmarks\. Each benchmark section reports Spearman’srs\(ℬ,\|Δℛ\|\)r\_\{s\}\(\\mathcal\{B\},\\,\|\\Delta\\mathcal\{R\}\|\)across all quantization variants\. Shading:Ω<0\.80\\Omega\{<\}0\.80/Ω<0\.95\\Omega\{<\}0\.95on\(Ω,δ,ℬ,\|Δℛ\|\)\(\\Omega,\\delta,\\mathcal\{B\},\|\\Delta\\mathcal\{R\}\|\);γ=0\\gamma\{=\}0when the head is preserved \(HT=HPH\_\{T\}\{=\}H\_\{P\}\)\.DatasetFamilyMethod𝝆𝑻\\boldsymbol\{\\rho\_\{T\}\}𝝆𝑷\\boldsymbol\{\\rho\_\{P\}\}𝛀\\boldsymbol\{\\Omega\}𝜹\\boldsymbol\{\\delta\}𝜸\\boldsymbol\{\\gamma\}PRISM𝓑\\boldsymbol\{\\mathcal\{B\}\}\|𝚫𝓡\|\\boldsymbol\{\|\\Delta\\mathcal\{R\}\|\}MMLU–FP16189\.50189\.500\.99992\.149802\.14980\.0000\(𝐫𝐬=0\.76\\mathbf\{r\_\{s\}\{=\}0\.76\}\)GGUFQ8\_0189\.50189\.430\.996813\.99733\.980417\.97770\.0037Q6\_K189\.50189\.440\.995915\.930412\.333228\.26350\.0170Q5\_K\_M189\.50189\.780\.993619\.956412\.339332\.29570\.0267Q4\_K\_M189\.50189\.470\.983531\.954712\.338844\.29350\.0119Q3\_K\_M189\.50187\.060\.962547\.910912\.231060\.14190\.0236Q2\_K189\.50187\.520\.884284\.205512\.253296\.45870\.1117BnBINT8189\.50189\.330\.995217\.1379017\.13790\.0101NF4189\.50189\.630\.962548\.1985048\.19850\.0066FP4189\.50189\.180\.944158\.7507058\.75070\.0564ARC–FP16189\.39189\.390\.99992\.108502\.10850\.0000\(𝐫𝐬=0\.81\\mathbf\{r\_\{s\}\{=\}0\.81\}\)GGUFQ8\_0189\.39189\.320\.996814\.02683\.958617\.98550\.0026Q6\_K189\.39189\.180\.995816\.086612\.253728\.34030\.0127Q5\_K\_M189\.39189\.420\.993420\.129612\.261032\.39060\.0200Q4\_K\_M189\.39189\.590\.983831\.649812\.257943\.90770\.0059Q3\_K\_M189\.39187\.250\.960449\.198512\.164161\.36260\.0089Q2\_K189\.39187\.790\.881885\.092412\.133997\.22640\.1166BnBINT8189\.39189\.260\.995317\.0549017\.05490\.0028NF4189\.39189\.840\.960949\.2064049\.20640\.0368FP4189\.39188\.990\.939860\.9485060\.94850\.0762TriviaQA–FP16193\.86193\.861\.00000\.000000\.00000\.0000\(𝐫𝐬=0\.95\\mathbf\{r\_\{s\}\{=\}0\.95\}\)GGUFQ8\_0193\.86193\.780\.996814\.44383\.839218\.28300\.0083Q6\_K193\.86193\.700\.995616\.885811\.949228\.83500\.0032Q5\_K\_M193\.86193\.820\.993221\.010011\.950932\.96090\.0264Q4\_K\_M193\.86193\.860\.983532\.649111\.948944\.59800\.0389Q3\_K\_M193\.86193\.840\.951056\.296711\.950568\.24720\.1490Q2\_K193\.86194\.740\.8363103\.168811\.9755115\.14420\.5791BnBINT8193\.86193\.950\.994518\.8954018\.89540\.0204NF4193\.86194\.070\.963448\.7160048\.71600\.0883FP4193\.86194\.530\.935464\.7803064\.78030\.1946SQuAD–FP16184\.40184\.401\.00000\.000000\.00000\.0000\(𝐫𝐬=0\.71\\mathbf\{r\_\{s\}\{=\}0\.71\}\)GGUFQ8\_0184\.40184\.360\.997312\.48493\.717216\.20220\.0259Q6\_K184\.40184\.180\.996614\.187911\.690825\.87870\.0033Q5\_K\_M184\.40184\.340\.994517\.935611\.695229\.63090\.0275Q4\_K\_M184\.40184\.380\.986927\.745011\.692139\.43720\.0112Q3\_K\_M184\.40183\.810\.965145\.115811\.679756\.79540\.0239Q2\_K184\.40184\.730\.896378\.005211\.686289\.69140\.1217BnBINT8184\.40184\.460\.996514\.3850014\.38500\.0187NF4184\.40184\.560\.971440\.9475040\.94750\.0661FP4184\.40185\.740\.952453\.0345053\.03450\.0804GSM8K–FP16175\.51175\.511\.00000\.000000\.00000\.0000\(𝐫𝐬=0\.83\\mathbf\{r\_\{s\}\{=\}0\.83\}\)GGUFQ8\_0175\.51175\.441\.00000\.06273\.64833\.71100\.0025Q6\_K175\.51175\.321\.00000\.178611\.581111\.75970\.0015Q5\_K\_M175\.51175\.291\.00000\.201111\.582611\.78370\.0032Q4\_K\_M175\.51175\.401\.00000\.100211\.584111\.68430\.0005Q3\_K\_M175\.51174\.951\.00000\.520111\.568812\.08890\.0100Q2\_K175\.51176\.600\.984828\.518211\.604940\.12310\.0266BnBINT8175\.51175\.551\.00000\.042000\.04200\.0002NF4175\.51175\.391\.00000\.107000\.10700\.0001FP4175\.51176\.811\.00001\.206101\.20610\.0076

### Appendix GQwen3\-8B forgetting and shape\-regularization replications

This appendix reports the Qwen3\-8B replications of Sec\.[5\.2](https://arxiv.org/html/2605.11608#S5.SS2)and Sec\.[5\.4](https://arxiv.org/html/2605.11608#S5.SS4)along with the remaining trace\-norm decomposition tables \(Llama BBQ and both Qwen fine\-tuning tasks\)\. Qwen3\-8B differs from Llama\-3\.1\-8B in depth \(36 vs\. 32 layers\), vocabulary size \(151,936151\{,\}936vs\.128,256128\{,\}256\), and pre\-training corpus \(multilingual,∼\\sim36T tokens vs\. English\-dominant,∼\\sim15T tokens\); reproducing the main\-text patterns under these shifts provides cross\-family evidence for the claims of Secs\.[5\.2](https://arxiv.org/html/2605.11608#S5.SS2)–[5\.4](https://arxiv.org/html/2605.11608#S5.SS4)\.

#### G\.1Forgetting: bound tracks cross\-task drift on Qwen3\-8B

![Refer to caption](https://arxiv.org/html/2605.11608v1/x6.png)Figure 7:Qwen3\-8B: the PRISM bound tracks catastrophic forgetting across LoRA fine\-tuning steps\.Each subplot scatters the PRISM boundℬ\\mathcal\{B\}\(x\-axis, log\) against the empirical forgetting\|Δℛ\|\|\\Delta\\mathcal\{R\}\|\(y\-axis, log\) on a held\-out benchmark, with one point per LoRA checkpoint colored by training step\. Rows: fine\-tuning task \(TruthfulQA, BBQ\)\. Columns: held\-out evaluation benchmark \(ARC, MMLU, SQuAD, TriviaQA, GSM8K\)\. Because LoRA keeps thelm\_headfrozen, the head\-discrepancy termγ\\gammavanishes and the bound is governed entirely by backbone scale \(Δρ\\Delta\\rho\) and shape \(1−Ω1\-\\Omega\) drift, as predicted by Sec\.[4](https://arxiv.org/html/2605.11608#S4.SS0.SSS0.Px2)\. Spearmanrsr\_\{s\}per subplot is annotated in each panel\. This replicates Fig\.[3](https://arxiv.org/html/2605.11608#S5.F3)on a different family, whose pre\-training corpus, depth, and vocabulary size all differ from Llama\-3\.1\-8B\.Figure[7](https://arxiv.org/html/2605.11608#A7.F7)reproduces the two structural observations of the Llama main\-text result \(Fig\.[3](https://arxiv.org/html/2605.11608#S5.F3)\) on Qwen3\-8B\. First,*the PRISM bound rises in lockstep with empirical forgetting across most\(task,benchmark\)\(task,benchmark\)combinations*: as LoRA fine\-tuning proceeds on TruthfulQA or BBQ,Ω\(Z0,Zt\)\\Omega\(Z\_\{0\},Z\_\{t\}\)drifts away from11andρt\\rho\_\{t\}fromρ0\\rho\_\{0\}, and this backbone drift tracks the empirical cross\-entropy drift on most held\-out benchmarks—including benchmarks \(e\.g\., GSM8K\) whose distributions differ substantially from the fine\-tuning data\. Second,*different source tasks induce qualitatively different drift geometries*, matching Llama: TruthfulQA fine\-tuning drives predominantly shape drift while BBQ moves both scale and shape, a distinction the decomposition of Eq\. \([7](https://arxiv.org/html/2605.11608#S3.E7)\) makes visible whereas a unified distance would collapse it\. Because Qwen3\-8B differs from Llama\-3\.1\-8B across multiple training and architecture axes \(pre\-training corpus, depth, vocabulary\), this replication is direct evidence that the backbone\-drift signal is a cross\-family property rather than a Llama\-specific artifact\.

##### Per\-subplot variability under low\-forgetting regimes\.

A small subset of subplots in Fig\.[7](https://arxiv.org/html/2605.11608#A7.F7)—most notably Qwen3\-8B fine\-tuned on BBQ evaluated on MMLU and TriviaQA—shows weak or slightly negative per\-subplotrsr\_\{s\}\. These cases share a common signature:\|Δℛ\|\|\\Delta\\mathcal\{R\}\|atλ=0\\lambda\{=\}0sits at order10−310^\{\-3\}nats per token—two to three orders of magnitude below the∼10−1\\sim 10^\{\-1\}nat scale at which the bound is informative across the rest of the grid\. Within the same \(model, fine\-tuning task\) pair, this is benchmark\-specific: Qwen3\-8B BBQ atλ=0\\lambda\{=\}0shows\|Δℛ\|≈0\.0035\|\\Delta\\mathcal\{R\}\|\{\\approx\}0\.0035on TriviaQA versus≈0\.288\{\\approx\}0\.288on ARC \(Table[21](https://arxiv.org/html/2605.11608#A7.T21)\)—roughly two orders of magnitude apart within a single fine\-tuning trajectory\. When the forgetting magnitude is at this scale, per\-checkpoint ranking becomes uninformative; the resultingrsr\_\{s\}reflects checkpoint\-trajectory noise rather than a failure of the bound\. This pattern is a property of the \(model, fine\-tuning task, benchmark\) triple \(Qwen3\-8B BBQ on TriviaQA is robust to fine\-tuning\) rather than the diagnostic\. The same noise\-dominated effect appears in Llama\-3\.1\-8B BBQ→\\toSQuAD \(rs≈0\.30r\_\{s\}\\approx 0\.30, Fig\.[3](https://arxiv.org/html/2605.11608#S5.F3)\):\|Δℛ\|\|\\Delta\\mathcal\{R\}\|saturates early in fine\-tuning, so late checkpoints sit at roughly the same level and the per\-checkpoint Spearman is driven by checkpoint noise rather than continued drift\.

#### G\.2Shape regularization: Qwen3\-8B replication

![Refer to caption](https://arxiv.org/html/2605.11608v1/x7.png)Figure 8:Shape regularization vs\. replay\-CE on Qwen3\-8B \(replication of Fig\.[4](https://arxiv.org/html/2605.11608#S5.F4)\)\.Same three configurations as Llama:*no reg*,*replay*\(λ=0\.01\\lambda\{=\}0\.01\), and*trace*\(λ=1\.0\\lambda\{=\}1\.0\)\. Qwen3\-8B’s baseline forgetting is already small \(mean\|Δℛ\|<0\.27\|\\Delta\\mathcal\{R\}\|<0\.27on both fine\-tuning tasks vs\. Llama’s0\.840\.84on TruthfulQA\), leaving little room for improvement; the trace\-vs\-replay gap narrows accordingly but the qualitative mechanism \(trace liftsΩ\\Omega, replay does not\) holds\. Per\-config decompositions: Tables[20](https://arxiv.org/html/2605.11608#A7.T20)and[21](https://arxiv.org/html/2605.11608#A7.T21)\.Figure[8](https://arxiv.org/html/2605.11608#A7.F8)reproduces the main\-text Llama comparison \(Fig\.[4](https://arxiv.org/html/2605.11608#S5.F4)\) on Qwen3\-8B\. Because Qwen3\-8B’s baseline forgetting is already small, both methods stay near baseline level; the qualitative pattern \(trace contractsΩ\\Omega, replay does not\) holds\. Per\-config decompositions are in Tables[20](https://arxiv.org/html/2605.11608#A7.T20)and[21](https://arxiv.org/html/2605.11608#A7.T21)below\.

#### G\.3Additional trace\-norm decomposition tables

Tables[18](https://arxiv.org/html/2605.11608#A7.T18)–[21](https://arxiv.org/html/2605.11608#A7.T21)give the full per\-\(model,fine\-tuning task\)\(\\text\{model\},\\text\{fine\-tuning task\}\)decomposition at step300300for the three configurations \(no reg, replayλ=0\.01\\lambda\{=\}0\.01, traceλ=1\.0\\lambda\{=\}1\.0\); the main\-text Table[2](https://arxiv.org/html/2605.11608#S5.T2)compresses the Llama\-TruthfulQA combination toΩ\\Omegaand\|Δℛ\|\|\\Delta\\mathcal\{R\}\|only\.

Table 18:Regularization comparison forLlama\-3\.1\-8Bfine\-tuned onTruthfulQAunder identity alignment \(W=IW\{=\}I\)\. Rows group by evaluation benchmark; each benchmark block lists metrics at step 300 for: no reg, replayλ=0\.01\\lambda\{=\}0\.01\(baseline\), traceλ=1\.0\\lambda\{=\}1\.0\(ours\)\. Bold / underline: 1st / 2nd\-best smallest\|Δℛ\|\|\\Delta\\mathcal\{R\}\|and largestΩ\\Omegaper block\. Shading:Ω<0\.80\\Omega\{<\}0\.80/Ω<0\.95\\Omega\{<\}0\.95on\(Ω,δ,ℬ,\|Δℛ\|\)\(\\Omega,\\delta,\\mathcal\{B\},\|\\Delta\\mathcal\{R\}\|\);γ=0\\gamma\{=\}0under frozenlm\_head\.Table 19:Regularization comparison forLlama\-3\.1\-8Bfine\-tuned onBBQunder identity alignment \(W=IW\{=\}I\)\. Rows group by evaluation benchmark; each benchmark block lists metrics at step 300 for: no reg, replayλ=0\.01\\lambda\{=\}0\.01\(baseline\), traceλ=1\.0\\lambda\{=\}1\.0\(ours\)\. Bold / underline: 1st / 2nd\-best smallest\|Δℛ\|\|\\Delta\\mathcal\{R\}\|and largestΩ\\Omegaper block\. Shading:Ω<0\.80\\Omega\{<\}0\.80/Ω<0\.95\\Omega\{<\}0\.95on\(Ω,δ,ℬ,\|Δℛ\|\)\(\\Omega,\\delta,\\mathcal\{B\},\|\\Delta\\mathcal\{R\}\|\);γ=0\\gamma\{=\}0under frozenlm\_head\.Table 20:Regularization comparison forQwen3\-8B\-Basefine\-tuned onTruthfulQAunder identity alignment \(W=IW\{=\}I\)\. Rows group by evaluation benchmark; each benchmark block lists metrics at step 300 for: no reg, replayλ=0\.01\\lambda\{=\}0\.01\(baseline\), traceλ=1\.0\\lambda\{=\}1\.0\(ours\)\. Bold / underline: 1st / 2nd\-best smallest\|Δℛ\|\|\\Delta\\mathcal\{R\}\|and largestΩ\\Omegaper block\. Shading:Ω<0\.80\\Omega\{<\}0\.80/Ω<0\.95\\Omega\{<\}0\.95on\(Ω,δ,ℬ,\|Δℛ\|\)\(\\Omega,\\delta,\\mathcal\{B\},\|\\Delta\\mathcal\{R\}\|\);γ=0\\gamma\{=\}0under frozenlm\_head\.Table 21:Regularization comparison forQwen3\-8B\-Basefine\-tuned onBBQunder identity alignment \(W=IW\{=\}I\)\. Rows group by evaluation benchmark; each benchmark block lists metrics at step 300 for: no reg, replayλ=0\.01\\lambda\{=\}0\.01\(baseline\), traceλ=1\.0\\lambda\{=\}1\.0\(ours\)\. Bold / underline: 1st / 2nd\-best smallest\|Δℛ\|\|\\Delta\\mathcal\{R\}\|and largestΩ\\Omegaper block\. Shading:Ω<0\.80\\Omega\{<\}0\.80/Ω<0\.95\\Omega\{<\}0\.95on\(Ω,δ,ℬ,\|Δℛ\|\)\(\\Omega,\\delta,\\mathcal\{B\},\|\\Delta\\mathcal\{R\}\|\);γ=0\\gamma\{=\}0under frozenlm\_head\.

### Appendix HWhen shape regularization helps: task\-dependence

TruthfulQA\-FT, where shape drift dominates downstream forgetting, illustrates the regularizer’s intended use case: trace cuts mean\|Δℛ\|\|\\Delta\\mathcal\{R\}\|from 0\.843 \(no reg\) to 0\.681 \(−19%\-19\\%\), outperforming experience replay \(0\.764,−9%\-9\\%\)\. BBQ\-FT serves as a robustness check in the small\-\|Δℛ\|\|\\Delta\\mathcal\{R\}\|regime: averaging the per\-benchmark\|Δℛ\|\|\\Delta\\mathcal\{R\}\|values from Table[19](https://arxiv.org/html/2605.11608#A7.T19)gives means of 0\.179 \(no reg\), 0\.195 \(trace\), and 0\.241 \(replay\)—substantially smaller than TruthfulQA’s 0\.843—reflecting BBQ’s modest baseline shape drift \(mean baselineΩ=0\.932\\Omega=0\.932vs0\.9060\.906\)\. Trace outperforms replay; per\-benchmark substantive wins \(detailed below\) persist on shape\-driven cells\.

##### Geometric mechanism remains intact\.

On BBQ, the contrast between trace and experience replay is most visible in the geometric signal: trace liftsΩ\\Omegafrom 0\.93 to 0\.98 \(a73%73\\%reduction in1−Ω1\-\\Omega\) while experience replay leavesΩ\\Omegaflat \(0\.93→\\to0\.93\)\. The regularizer’s targeted quantity moves as designed\.

##### Per\-benchmark wins on shape\-driven cells\.

Trace’s largest\|Δℛ\|\|\\Delta\\mathcal\{R\}\|reductions appear where forgetting is substantive: on BBQ\-FT \(Table[19](https://arxiv.org/html/2605.11608#A7.T19)\), TriviaQA\|Δℛ\|\|\\Delta\\mathcal\{R\}\|drops from 0\.143 to 0\.017 \(−88%\-88\\%\) and GSM8K from 0\.061 to 0\.013 \(−79%\-79\\%\); on TruthfulQA\-FT \(Table[2](https://arxiv.org/html/2605.11608#S5.T2)\), SQuAD from 1\.337 to 1\.054 \(−21%\-21\\%\) and TriviaQA from 2\.583 to 2\.124 \(−18%\-18\\%\)\. On BBQ\-FT ARC and MMLU, the no\-reg condition shows substantial shape drift \(Ω\\Omegadrops to 0\.88, 0\.87\) with relatively small\|Δℛ\|\|\\Delta\\mathcal\{R\}\|\(0\.077, 0\.353\): the no\-reg evidence indicates shape drift does not translate proportionally into\|Δℛ\|\|\\Delta\\mathcal\{R\}\|growth on these cells, so trace’s shape restoration lacks a proportional\|Δℛ\|\|\\Delta\\mathcal\{R\}\|target\. These cells’ shorter answer spans further compress the per\-cell CE dynamic range, limiting the observable signal\-to\-noise of any regularization effect; when absolute\|Δℛ\|\|\\Delta\\mathcal\{R\}\|is small, per\-sample CE noise forms a relatively larger fraction of the signal\. Rank correlation across the 2×\\times5 grid \(Sec\.[5\.2](https://arxiv.org/html/2605.11608#S5.SS2), mean Spearman≈0\.83\\approx 0\.83\) provides the scale\-invariant calibration of the bound’s predictive power\.

##### When to apply: empirical validation of the gating signal\.

Shape regularization is indicated when \(i\)Ω\\Omegadrifts substantially during fine\-tuning and \(ii\) downstream forgetting is shape\-dominated; PRISM’s decomposition \(Sec\.[5\.3](https://arxiv.org/html/2605.11608#S5.SS3)\) provides both signals*before*regularization is applied\. To quantify how well the gating prediction matches the empirical trace effect, we compute two per\-setting scalars across the four\(model,fine\-tuning task\)\(\\text\{model\},\\text\{fine\-tuning task\}\)combinations, both at step300300on the same five downstream benchmarks:

- •Baseline shape\-drift signal1−Ω¯:=15∑b\(1−Ωbλ=0\)1\{\-\}\\bar\{\\Omega\}:=\\tfrac\{1\}\{5\}\\sum\_\{b\}\(1\-\\Omega\_\{b\}^\{\\lambda=0\}\), the across\-benchmark mean shape mismatch under no regularization, with per\-benchmarkΩbλ=0\\Omega\_\{b\}^\{\\lambda=0\}read from the no\-reg rows of Tables[18](https://arxiv.org/html/2605.11608#A7.T18)–[21](https://arxiv.org/html/2605.11608#A7.T21)\. Larger1−Ω¯1\{\-\}\\bar\{\\Omega\}= larger headroom under condition \(i\)\.
- •Relative trace effectΔ\|Δℛ\|/\|Δℛ\|0:=\(\|Δℛ\|¯trace−\|Δℛ\|¯λ=0\)/\|Δℛ\|¯λ=0\\Delta\|\\Delta\\mathcal\{R\}\|/\|\\Delta\\mathcal\{R\}\|\_\{0\}:=\(\\overline\{\|\\Delta\\mathcal\{R\}\|\}\_\{\\mathrm\{trace\}\}\-\\overline\{\|\\Delta\\mathcal\{R\}\|\}\_\{\\lambda=0\}\)/\\overline\{\|\\Delta\\mathcal\{R\}\|\}\_\{\\lambda=0\}: across\-benchmark mean of the trace\-vs\-no\-reg relative change in empirical forgetting atλ=1\.0\\lambda\{=\}1\.0\(negative==trace helps; positive==trace hurts\)\.

Table 22:Diagnostic\-gating prediction vs\. empirical trace effect across the four\(model,fine\-tuning task\)\(\\text\{model\},\\text\{fine\-tuning task\}\)settings, ordered by decreasing1−Ω¯1\{\-\}\\bar\{\\Omega\}\. Trace yields the largest\|Δℛ\|\|\\Delta\\mathcal\{R\}\|reduction where shape drift is largest \(Llama TruthfulQA\), is essentially neutral on both Qwen settings where1−Ω¯1\{\-\}\\bar\{\\Omega\}sits at the noise floor, and shows a partial cell\-level exception on Llama BBQ explained in the next paragraph\.Three of four settings match the gating prediction directly: trace produces the largest benefit where1−Ω¯1\{\-\}\\bar\{\\Omega\}is largest \(Llama TruthfulQA:−19\.2%\-19\.2\\%\), and is essentially neutral \(magnitude<3%<3\\%\) on both Qwen settings where1−Ω¯1\{\-\}\\bar\{\\Omega\}sits at the noise floor\. Llama BBQ is the partial exception—its aggregate1−Ω¯=0\.06781\{\-\}\\bar\{\\Omega\}=0\.0678exceeds the noise floor \(condition \(i\) satisfied\), but the drift is concentrated on ARC and MMLU where condition \(ii\) is violated: substantial shape drift accompanies small\|Δℛ\|\|\\Delta\\mathcal\{R\}\|growth, so trace’s shape restoration lacks a proportional\|Δℛ\|\|\\Delta\\mathcal\{R\}\|target \(per\-benchmark paragraph above\)\. On the two Llama\-BBQ benchmarks where condition \(ii\) does hold \(TriviaQA and GSM8K\), trace yields−88%\-88\\%and−79%\-79\\%reductions respectively, so the aggregate\+8\.6%\+8\.6\\%reflects mixed cell\-level behavior rather than a failure of the shape\-drift gating itself\. Adaptive per\-cell deployment of the regularizer—driven by PRISM’s per\-benchmark decomposition online—is the natural extension and is left to future work, alongside the short\-answer fine\-tuning regimes noted earlier in this appendix\.

##### Empirical confirmation: Qwen3\-8B replication\.

The Qwen3\-8B replication \(Tables[20](https://arxiv.org/html/2605.11608#A7.T20)–[21](https://arxiv.org/html/2605.11608#A7.T21), Fig\.[8](https://arxiv.org/html/2605.11608#A7.F8)\) extends the regime\-dependence story from*tasks*\(TruthfulQA\-FT vs\. BBQ\-FT on Llama\) to*models*, providing direct empirical evidence for condition \(i\) above\. Qwen3\-8B is markedly more robust to LoRA forgetting on these two fine\-tuning sources than Llama\-3\.1\-8B: across\-benchmark mean\|Δℛ\|\|\\Delta\\mathcal\{R\}\|atλ=0\\lambda\{=\}0is0\.2630\.263on Qwen TruthfulQA\-FT vs\. Llama’s0\.8430\.843\(3\.2×3\.2\\timessmaller\), and0\.1120\.112on Qwen BBQ\-FT vs\. Llama’s0\.1790\.179\. Baseline shape preservation is correspondingly closer to its ceiling on Qwen \(Ω\\Omegaatλ=0\\lambda\{=\}0:0\.9910\.991on Qwen TruthfulQA\-FT,0\.9990\.999on Qwen BBQ\-FT vs\. Llama’s0\.9060\.906and0\.9320\.932\), leaving the shape regularizer with essentially no drift to repair:1−Ω1\-\\Omegacollapses from Llama’s0\.0940\.094/0\.0680\.068to Qwen’s0\.0090\.009/0\.0010\.001, an order of magnitude smaller\. The differences across no\-reg, replay, and trace correspondingly shrink to the evaluation\-noise floor: across\-benchmark mean\|Δℛ\|\|\\Delta\\mathcal\{R\}\|on Qwen TruthfulQA\-FT is0\.2630\.263/0\.2820\.282/0\.2700\.270\(no\-reg / replay / trace\), and on Qwen BBQ\-FT is0\.1120\.112/0\.1100\.110/0\.1120\.112, with method\-to\-method spread under0\.020\.02on both\. This is the same low\-SNR phenomenon documented for GSM8K under PTQ in Appendix[F\.3](https://arxiv.org/html/2605.11608#A6.SS3)\(Table[10](https://arxiv.org/html/2605.11608#A6.T10)\), where the bound’s per\-benchmark mean Spearman drops from≥0\.77\\geq 0\.77on ARC/MMLU/SQuAD/TriviaQA to≈0\.41\\approx 0\.41on GSM8K precisely because\|Δℛ\|\|\\Delta\\mathcal\{R\}\|collapses to≈0\.019\\approx 0\.019\. We report the Qwen3\-8B regularization numbers in full for transparency and reproducibility, but emphasize that the shape regularizer’s mechanism—contracting1−Ω1\-\\Omegaat the source of LoRA backbone drift—can demonstrate a measurable benefit only when1−Ω1\-\\Omegahas substantive headroom to contract\. Qwen3\-8B fine\-tuning on TruthfulQA/BBQ violates condition \(i\) above, and the Qwen3\-8B numbers therefore do not contradict the Llama TruthfulQA\-FT result; they confirm the regime\-dependence the bound itself predicts\.

### Appendix IFuture work

- •Beyond LoRA forgetting\.PRISM’s diagnostic and regularization roles \(Sec\.[5\.4](https://arxiv.org/html/2605.11608#S5.SS4)\) extend naturally to broader fine\-tuning—full SFT, distillation, continual learning—where backbone drift is substantially larger\. The shape regularizer constrains drift;Ω\\OmegaandΔρ\\Delta\\rhotogether monitor it per step in ways validation loss does not\.
- •Diagnostic applications\.The single\-forward\-pass computation ofℬ\\mathcal\{B\}enables three deployment uses:*per\-sample OOD detection*via scale and shape residuals on individual inputs;*cross\-scale hyperparameter transfer*between a small proxy and its scaled target to predict tuning robustness; and*production drift monitoring*on live traffic\.
- •Beyond LLMs\.Vision Transformers and contrastive image encoders share LLMs’ backbone\-then\-linear\-head structure, with final\-layer LayerNorm \(and strict L2 normalization in CLIP/SigLIP\) enforcing tight feature scales—making the PRISM scale axis naturally small\. Natural targets include ViT quantization, vision\-encoder distillation, and cross\-modal alignment \(consistent with the Platonic Representation Hypothesis\[[16](https://arxiv.org/html/2605.11608#bib.bib16)\]\)\.
PRISM: A Geometric Risk Bound that Decomposes Drift into Scale, Shape, and Head

Similar Articles

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

PRISM: Position-encoded Regressive Inverse Spectral Model for Multilayer Thin-Film Design

PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution

PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations

PRISM: Perception Reasoning Interleaved for Sequential Decision Making

Submit Feedback

Similar Articles

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning
PRISM: Position-encoded Regressive Inverse Spectral Model for Multilayer Thin-Film Design
PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution
PRISM: Probing Reasoning, Instruction, and Source Memory in LLM Hallucinations
PRISM: Perception Reasoning Interleaved for Sequential Decision Making