Hidden-State Privacy Has an Empty Middle

arXiv cs.LG 05/26/26, 04:00 AM Papers
Summary
This paper demonstrates that Gaussian releases of transformer hidden states cannot simultaneously achieve moderate privacy and utility, proving an 'empty middle' and proposing a split-memory architecture to overcome this limitation.
arXiv:2605.24042v1 Announce Type: new Abstract: Of $1{,}536$ Gaussian release covariances we tested for single-layer hidden-state privacy, zero achieve both moderate utility and moderate privacy against an adaptive retrieval attacker. We prove a complementary Fisher-ball lower bound: every full-rank Gaussian release at $O(1)$ Fisher utility admits a direction whose Mahalanobis signal grows linearly in hidden width, ruling out uniform Gaussian safety in the class and matching the empirical empty middle. The diagonal inverse-Fisher release $\Sigma^\star_{\mathrm{diag}}(\mathcal{K}) = (2\mathcal{K}/d)\,\mathrm{diag}(1/F_{ii})$ is the unique minimax-optimal diagonal mechanism at first-order KL budget $\mathcal{K}$ and the only release with worst-attacker top-1 $\le 0.001$ at every point of a 32 model-layer grid, but it sits on a privacy/utility edge rather than filling the middle. A generalized-eigen mechanism reaching $13\times$ Pareto reduction under Euclidean retrieval collapses to $100\%$ top-1 under the adaptive Mahalanobis attacker, and a full-trajectory sequence inverter recovers $94\%$ of clean GPT-2 prefixes but $0\%$ under $\Sigma_{\mathrm{diag}}$. A split-memory transformer trained from scratch reaches $G_{\mathrm{Mah}} \in [20, 33]$ at 90M and maintains a $6$--$24\times$ advantage over same-budget GPT baselines from 30M to 1B at a fixed-token language-modeling loss penalty; pretrained models top out at 9.3. These results reframe hidden-state release from mechanism-design within the Gaussian class to architecture or release co-design.
Original Article
View Cached Full Text
Cached at: 05/26/26, 08:59 AM
# Hidden-State Privacy Has an Empty Middle
Source: [https://arxiv.org/html/2605.24042](https://arxiv.org/html/2605.24042)
Alexander Okezue Bell Stanford University 450 Jane Stanford Way Stanford, CA 94305 okezue@stanford\.edu

###### Abstract

Of1,5361\{,\}536Gaussian release covariances we tested for single\-layer hidden\-state privacy, zero achieve both moderate utility and moderate privacy against an adaptive retrieval attacker\. We prove a complementary Fisher\-ball lower bound: every full\-rank Gaussian release atO\(1\)O\(1\)Fisher utility admits a direction whose Mahalanobis signal grows linearly in hidden width, ruling out uniform Gaussian safety in the class and matching the empirical empty middle\. The diagonal inverse\-Fisher releaseΣdiag⋆\(𝒦\)=\(2𝒦/d\)diag\(1/Fii\)\\Sigma^\{\\star\}\_\{\\mathrm\{diag\}\}\(\\mathcal\{K\}\)=\(2\\mathcal\{K\}/d\)\\,\\mathrm\{diag\}\(1/F\_\{ii\}\)is the unique minimax\-optimal diagonal mechanism at first\-order KL budget𝒦\\mathcal\{K\}and the only release with worst\-attacker top\-1≤0\.001\\leq 0\.001at every point of a 32 model\-layer grid, but it sits on a privacy/utility edge rather than filling the middle\. A generalized\-eigen mechanism reaching13×13\\timesPareto reduction under Euclidean retrieval collapses to100%100\\%top\-1 under the adaptive Mahalanobis attacker, and a full\-trajectory sequence inverter recovers94%94\\%of clean GPT\-2 prefixes but0%0\\%underΣdiag\\Sigma\_\{\\mathrm\{diag\}\}\. A split\-memory transformer trained from scratch reachesGMah∈\[20,33\]G\_\{\\mathrm\{Mah\}\}\\in\[20,33\]at 90M and maintains a66–24×24\\timesadvantage over same\-budget GPT baselines from 30M to 1B at a fixed\-token language\-modeling loss penalty; pretrained models top out at9\.39\.3\. These results reframe hidden\-state release from mechanism\-design within the Gaussian class to architecture or release co\-design\.

## 1Introduction

Decoder\-only transformer hidden states are almost surely injective in the input prompt\(Nikolaouet al\.,[2026](https://arxiv.org/html/2605.24042#bib.bib1)\), so any cached, logged, or vector\-indexed activation\(Tang and others,[2025](https://arxiv.org/html/2605.24042#bib.bib10); Liuet al\.,[2024](https://arxiv.org/html/2605.24042#bib.bib11); Morriset al\.,[2023](https://arxiv.org/html/2605.24042#bib.bib9)\)is a functionally lossless record of the user’s input\. The natural defense is a Gaussian release, where noise is added to the activation before storing it\. The question this paper answers, for the concrete release object we evaluate \(single\-layer residual\-stream activations, with full key\-value \(KV\) cache release left as a deployment\-motivated extension\), is whether the Gaussian release class is enough\. It is not\. Of1,5361\{,\}536Gaussian covariances we tested, zero achieve both moderate utility and moderate privacy against an adaptive retrieval attacker, and we prove a complementary worst\-case obstruction over a Fisher\-ball adversary class that is consistent with this empty middle and rules out uniform safety for any full\-rank Gaussian release atO\(1\)O\(1\)Fisher utility\. The escape is to redesign the model\. We introduce a split\-memory architecture trained from scratch that sits cleanly inside the moderate\-both region the Gaussian class cannot reach, at a fixed\-token language\-modeling loss penalty\.

The proof relies on a gradient\-covariance decomposition of the hidden state\. Diagonalizing the per\-example loss\-gradient covariance yields a Fisher subspacePBP\_\{B\}spanned by the top\-kkeigenvectors\(Amari,[1998](https://arxiv.org/html/2605.24042#bib.bib7); Martens,[2020](https://arxiv.org/html/2605.24042#bib.bib8)\)and its complementPIP\_\{I\}\. On modern 7–14B models, replacinghhwithPIhP\_\{I\}hatk=128k=128preserves the output distribution to small KL while replacing it withPBhP\_\{B\}hdestroys predictions at several nats\. The asymmetry correlates with spectral concentration\. On Mistral\-7B\(Jianget al\.,[2023](https://arxiv.org/html/2605.24042#bib.bib18)\), Qwen3\-14B\(Qwen Team,[2025](https://arxiv.org/html/2605.24042#bib.bib19)\), and DeepSeek\-R1\-14B\(DeepSeek\-AI,[2025](https://arxiv.org/html/2605.24042#bib.bib20)\), the top\-128128eigenvectors of thencal=200n\_\{\\mathrm\{cal\}\}=200empirical gradient covariance capture more than99%99\\%of calibration\-sample energy, although split\-half validation shows the subspace itself is not stable at this calibration size\. On the GPT\-2 family\(Radfordet al\.,[2019](https://arxiv.org/html/2605.24042#bib.bib12)\)the spectrum is diffuse and the asymmetry weakens or reverses\.

We formalize the utility\-privacy tradeoff as a minimax between a Gaussian release and a Bayes\-optimal retrieval attacker, derive the covariance\-aware optimumΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}in closed form, and prove a worst\-case lower bound over a Fisher\-ball adversary class showing that every full\-rank Gaussian release admits a direction exponentially distinguishable indd, ruling out uniform Gaussian safety at constant utility while remaining consistent with the empirical empty middle on the realized prompt\-difference distribution\. The diagonal inverse\-Fisher mechanismΣdiag=σ2diag\(1/Fii\)\\Sigma\_\{\\mathrm\{diag\}\}=\\sigma^\{2\}\\mathrm\{diag\}\(1/F\_\{ii\}\)is the unique minimax\-optimal diagonal release against this class, and empirically the only Gaussian mechanism with worst\-attacker top\-1≤0\.001\\leq 0\.001at every point of a 32\-model\-layer sweep\. The predictive scalarGMahG\_\{\\mathrm\{Mah\}\}equals one over squared matrix fidelity between the trace\-normalized Fisher and margin covariances, with a projector\-separation lower bound that pins which architectures admit a large defense\. Generalized\-eigen reaches a measured13×13\\timesEuclidean Pareto gain on Mistral\-7B but collapses to100%100\\%top\-1 under adaptive attack, and of1,5361\{,\}536Gaussian cells zero meet both moderate utility and moderate privacy\.

Scaling across ten models from 124M to 14\.8B gives an empirical lawmB/mfull≈k/dm\_\{B\}/m\_\{\\mathrm\{full\}\}\\approx\\sqrt\{k/d\}atR2=0\.93R^\{2\}=0\.93; the fixed\-projector isotropy theorem \(Appendix[Q\.5](https://arxiv.org/html/2605.24042#A17.SS5)\) gives a sufficient condition for reading this as any\-rank\-kk\-projection geometry\. Two constructive results close the picture\. \(1\) A full\-trajectory sequence inverter recovers94%94\\%of clean GPT\-2 prefixes at exact match but0%0\\%underΣdiag\\Sigma\_\{\\mathrm\{diag\}\}; \(2\) a split\-memory transformer \(SMT\) trained from scratch with logits reading from a low\-dimensional trunk reachesGMah∈\[20,33\]G\_\{\\mathrm\{Mah\}\}\\in\[20,33\]across probe layers against a same\-budget GPT baseline at1\.11\.1–1\.31\.3, at a fixed\-token language\-modeling loss penalty\. Compute usage, code, data, and SMT checkpoints are in Appendices[AC](https://arxiv.org/html/2605.24042#A29)and[AD](https://arxiv.org/html/2605.24042#A30)\.

## 2Background

Our analysis applies to decoder\-only transformer language models\. For a model with vocabulary𝒱\\mathcal\{V\}, context lengthKK, and hidden widthdd, and an inputx∈𝒱≤Kx\\in\\mathcal\{V\}^\{\\leq K\}, we denote byhℓ\(x\)∈ℝdh\_\{\\ell\}\(x\)\\in\\mathbb\{R\}^\{d\}the last\-token residual\-stream state at layerℓ\\ell\. The next\-token distribution ispθ\(⋅∣x\)=softmax\(Wu⋅norm\(hL\(x\)\)\)p\_\{\\theta\}\(\\cdot\\mid x\)=\\mathrm\{softmax\}\(W\_\{u\}\\cdot\\mathrm\{norm\}\(h\_\{L\}\(x\)\)\)\. For a prefixxxand ground\-truth next tokenyy,ℒ\(x,y\)=−log⁡pθ\(y∣x\)\\mathcal\{L\}\(x,y\)=\-\\log p\_\{\\theta\}\(y\\mid x\)is the per\-token cross\-entropy loss\. We work at a single interior layerℓ\\ellthroughout, usually chosen at proportional depthℓ=L/2\\ell=L/2, and writeh=hℓ\(x\)h=h\_\{\\ell\}\(x\)when the prefix is clear from context\.

Nikolaouet al\.\([2026](https://arxiv.org/html/2605.24042#bib.bib1)\)prove that the mapx↦hℓ\(x\)x\\mapsto h\_\{\\ell\}\(x\)is almost surely injective under standard weight initialization for a decoder\-only transformer, and that no collision is ever introduced during gradient training\. They complement the theorem with SipIt, an algorithm that reconstructsxxfromhℓ\(x\)h\_\{\\ell\}\(x\)in time linear in the prefix length, and observe zero collisions across billions of pairwise comparisons on six production\-scale transformers\. Any system that caches, logs, quantizes, or transmits hidden states \(KV\-cache compression\(Tang and others,[2025](https://arxiv.org/html/2605.24042#bib.bib10); Liuet al\.,[2024](https://arxiv.org/html/2605.24042#bib.bib11)\), embedding\-based retrieval\(Morriset al\.,[2023](https://arxiv.org/html/2605.24042#bib.bib9)\)\) is therefore handling a functionally lossless copy of the user’s input\.

The geometric object that governs howhhparticipates in prediction is the hidden\-state Fisher informationFx=𝔼y∼pθ\(⋅∣h\)\[∇hlog⁡pθ\(y∣h\)∇hlog⁡pθ\(y∣h\)⊤\]F\_\{x\}=\\mathbb\{E\}\_\{y\\sim p\_\{\\theta\}\(\\cdot\\mid h\)\}\[\\nabla\_\{h\}\\log p\_\{\\theta\}\(y\\mid h\)\\,\\nabla\_\{h\}\\log p\_\{\\theta\}\(y\\mid h\)^\{\\top\}\], with population versionF=𝔼x\[Fx\]F=\\mathbb\{E\}\_\{x\}\[F\_\{x\}\]\. Its eigenvectors are the directions along which small perturbations tohhproduce the largest change in the predicted token distribution\(Amari,[1998](https://arxiv.org/html/2605.24042#bib.bib7); Martens,[2020](https://arxiv.org/html/2605.24042#bib.bib8)\)\. In practice we compute the empirical gradient covarianceΣg=1n∑i=1n∇hℒ\(xi,yi\)∇hℒ\(xi,yi\)⊤\\Sigma\_\{g\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\nabla\_\{h\}\\mathcal\{L\}\(x\_\{i\},y\_\{i\}\)\\,\\nabla\_\{h\}\\mathcal\{L\}\(x\_\{i\},y\_\{i\}\)^\{\\top\}and use its top\-kkeigenvectors as the Fisher subspace basisPB=UBUB⊤P\_\{B\}=U\_\{B\}U\_\{B\}^\{\\top\}\. The two coincide up to scaling in the population limit at the cross\-entropy optimum\(Kunstneret al\.,[2019](https://arxiv.org/html/2605.24042#bib.bib30)\); we do not claim equality at finite calibration size\. All theorems are stated forFF; all measurements substituteΣg\\Sigma\_\{g\}\. Prior work has shown hidden activations leak substantial information about their inputs\(Morriset al\.,[2023](https://arxiv.org/html/2605.24042#bib.bib9); Song and Raghunathan,[2020](https://arxiv.org/html/2605.24042#bib.bib32); Panet al\.,[2020](https://arxiv.org/html/2605.24042#bib.bib33); Li and others,[2025](https://arxiv.org/html/2605.24042#bib.bib37)\)and so what remains open is the geometric structure of the recoverable signal and its relationship to the directions the model uses for prediction\.

## 3Geometric setup

We characterize the geometry of the hidden state at a fixed layer through three positive semidefinite matrices\. The state covarianceΣh=𝔼x\[h\(x\)h\(x\)⊤\]\\Sigma\_\{h\}=\\mathbb\{E\}\_\{x\}\[h\(x\)\\,h\(x\)^\{\\top\}\]describes where hidden states live inℝd\\mathbb\{R\}^\{d\}\. The FisherFFdefined in Section[2](https://arxiv.org/html/2605.24042#S2)describes where the loss is sensitive to perturbations ofhh\. The margin\-direction covarianceΣδ=𝔼x,x′\[δ^x,x′δ^x,x′⊤\]\\Sigma\_\{\\delta\}=\\mathbb\{E\}\_\{x,x^\{\\prime\}\}\[\\hat\{\\delta\}\_\{x,x^\{\\prime\}\}\\,\\hat\{\\delta\}\_\{x,x^\{\\prime\}\}^\{\\top\}\], withδ^x,x′=\(h\(x\)−h\(x′\)\)/‖h\(x\)−h\(x′\)‖\\hat\{\\delta\}\_\{x,x^\{\\prime\}\}=\(h\(x\)\-h\(x^\{\\prime\}\)\)/\\\|h\(x\)\-h\(x^\{\\prime\}\)\\\|, describes the directions along which distinct prompts differ\. The top\-Fisher subspacePBP\_\{B\}is the orthogonal projector onto the top\-kkeigenvectors ofFFin theory and ofΣg\\Sigma\_\{g\}in practice; its complementPI=I−PBP\_\{I\}=I\-P\_\{B\}is the low\-Fisher complement\.

Three scalar quantities summarize a model’s geometry\. Fisher concentrationEk=∑i=1kλiF/∑i=1dλiFE\_\{k\}=\\sum\_\{i=1\}^\{k\}\\lambda\_\{i\}^\{F\}/\\sum\_\{i=1\}^\{d\}\\lambda\_\{i\}^\{F\}measures what fraction of gradient variance lives in the top\-kksubspace\. Channel couplingκ=tr\(PBΣδ\)/\(\(k/d\)tr\(Σδ\)\)\\kappa=\\mathrm\{tr\}\(P\_\{B\}\\Sigma\_\{\\delta\}\)/\(\(k/d\)\\,\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}\)\)is the ratio ofΣδ\\Sigma\_\{\\delta\}\-mass thatPBP\_\{B\}captures to the random\-projection baseline\. Finally, effective\-rank fractionρ=r95/d\\rho=r\_\{95\}/dwithr95=min⁡\{k:Ek≥0\.95\}r\_\{95\}=\\min\\\{k:E\_\{k\}\\geq 0\.95\\\}is a scale\-normalized spectral summary\. For 7–14B models, wherencal=200n\_\{\\mathrm\{cal\}\}=200,ρ\\rhois a sample\-effective\-rank estimate rather than a stable projector estimate \(Appendix[A](https://arxiv.org/html/2605.24042#A1)\)\. Empirically,EkE\_\{k\}and the cumulative spectrum correlate with training scale and architectural choices, whileκ\\kappameasures alignment with the margin covariance\.

The KL andℓ2\\ell\_\{2\}metrics see different objects\. The expected KL between clean and perturbed next\-token distributions under a small perturbationδh\\delta hadmits the second\-order expansion

𝔼x\[KL\(pθ\(⋅∣h\)∥pθ\(⋅∣h\+δh\)\)\]=12δh⊤Fδh\+O\(∥δh∥3\),\\mathbb\{E\}\_\{x\}\\\!\\bigl\[\\,\\mathrm\{KL\}\\bigl\(p\_\{\\theta\}\(\\cdot\\mid h\)\\,\\\|\\,p\_\{\\theta\}\(\\cdot\\mid h\+\\delta h\)\\bigr\)\\bigr\]\\;=\\;\\tfrac\{1\}\{2\}\\,\\delta h^\{\\top\}F\\,\\delta h\+O\(\\\|\\delta h\\\|^\{3\}\),\(1\)so KL seesFF\. The median nearest\-neighborℓ2\\ell\_\{2\}margin is governed byΣδ\\Sigma\_\{\\delta\}, because‖P\(hx−hx′\)‖2=‖hx−hx′‖2⋅δ^x,x′⊤Pδ^x,x′\\\|P\(h\_\{x\}\-h\_\{x^\{\\prime\}\}\)\\\|^\{2\}=\\\|h\_\{x\}\-h\_\{x^\{\\prime\}\}\\\|^\{2\}\\cdot\\hat\{\\delta\}\_\{x,x^\{\\prime\}\}^\{\\top\}P\\,\\hat\{\\delta\}\_\{x,x^\{\\prime\}\}\. The local expansion governs additive\-noise utility \(12tr\(FΣ\)\\tfrac\{1\}\{2\}\\mathrm\{tr\}\(F\\Sigma\)\); deterministic projections likeh↦PIhh\\mapsto P\_\{I\}hhave perturbationδh=−PBh\\delta h=\-P\_\{B\}hnot small in the Fisher metric, so we treat their KL as an empirical diagnostic rather than a quadratic prediction\.

###### Proposition 1\(Random\-projection margin law\)\.

LetPPbe any fixed rank\-kkorthogonal projector inℝd\\mathbb\{R\}^\{d\}, and letuube a random unit vector drawn uniformly fromSd−1S^\{d\-1\}\. Then‖Pu‖2∼Beta\(k/2,\(d−k\)/2\)\\\|Pu\\\|^\{2\}\\sim\\mathrm\{Beta\}\(k/2,\(d\-k\)/2\)with meank/dk/d, and𝔼\[‖Pu‖\]→k/d\\mathbb\{E\}\[\\\|Pu\\\|\]\\to\\sqrt\{k/d\}asd→∞d\\to\\infty\.

A finite\-sample fixed\-projector concentration version \(Appendix[Q\.5](https://arxiv.org/html/2605.24042#A17.SS5)\) makes and proves this as an any\-rank\-kk\-projection prediction whenever the inter\-prefix difference distribution is approximately isotropic\.PBP\_\{B\}being any fixed rank\-kkprojector rather than specifically the Fisher eigenspace is therefore consistent with the data, and the structural content of the decomposition lives in KL rather thanℓ2\\ell\_\{2\}margin\.

##### Measurement protocol and scope\.

We use ten open\-weight decoder\-only transformers between 124M–14\.8B parameters: GPT\-2 Small/Large/XL\(Radfordet al\.,[2019](https://arxiv.org/html/2605.24042#bib.bib12)\), TinyLlama\-1\.1B\(Touvronet al\.,[2023](https://arxiv.org/html/2605.24042#bib.bib31)\), Phi\-2, Qwen2\.5\-3B, Qwen3\-14B\(Qwen Team,[2025](https://arxiv.org/html/2605.24042#bib.bib19)\), Mistral\-7B\(Jianget al\.,[2023](https://arxiv.org/html/2605.24042#bib.bib18)\), DeepSeek\-R1\-Distill\-Qwen\-14B\(DeepSeek\-AI,[2025](https://arxiv.org/html/2605.24042#bib.bib20)\), and OLMoE\-1B\-7B\(Muennighoffet al\.,[2025](https://arxiv.org/html/2605.24042#bib.bib21)\), at proportional depthℓ=L/2\\ell=L/2, releasing a single last\-token residual\-stream hidden state per prefix on prefix lengths 32 or 64, WikiText\-style data, against50,00050\{,\}000\-distractor banks \(full KV\-cache release is motivating but out of scope\)\. We computeΣg\\Sigma\_\{g\}withncal=2000n\_\{\\mathrm\{cal\}\}=2000for≤\\leq3B andncal=200n\_\{\\mathrm\{cal\}\}=200for 7–14B models \(sample\-covariance caveat in Appendix[A](https://arxiv.org/html/2605.24042#A1)\)\. For KL, we replace mid\-layerhhwithPBhP\_\{B\}horPIhP\_\{I\}hvia a forward\-pass hook\. For retrieval, the attacker seesh~=h\+ξ\\tilde\{h\}=h\+\\xiand ranks distractors in the hidden space\(Morriset al\.,[2023](https://arxiv.org/html/2605.24042#bib.bib9); Song and Raghunathan,[2020](https://arxiv.org/html/2605.24042#bib.bib32); Li and others,[2025](https://arxiv.org/html/2605.24042#bib.bib37)\)\. Calibration and hardware details are in Appendices[A](https://arxiv.org/html/2605.24042#A1)and[AC](https://arxiv.org/html/2605.24042#A29)\.

##### The KL projection asymmetry\.

ReplacinghhwithPIhP\_\{I\}hatk=128k=128preserves the next\-token distribution to small KL on every modern architecture: 0\.06 nats on Mistral\-7B, 0\.20 on Qwen3\-14B, 0\.06 on DeepSeek\-R1\-14B, against 5–10 nats for the correspondingPBhP\_\{B\}hprojection\. The KL ratio spans52×52\\times–509×509\\timesacross the seven 1B\+ parameter models\. On the GPT\-2 family the asymmetry reverses, andE128∈\[0\.53,0\.58\]E\_\{128\}\\in\[0\.53,0\.58\]leavesPIP\_\{I\}carrying half the predictive variation rather than nearly all, with the ratio sitting at 0\.5–0\.6\. The direction flip correlates with the broader spectral shape and architectural choices, not withE128E\_\{128\}alone \(TinyLlama and Phi\-2 atE128≈0\.58E\_\{128\}\\approx 0\.58show the modern direction at5252–62×62\\times\)\. RMSNorm, RoPE, gated\-linear MLPs, and 1–15T\-token training produce the concentrated regime, while GPT\-2’s LayerNorm\-with\-mean\-subtraction\(Baet al\.,[2016](https://arxiv.org/html/2605.24042#bib.bib16)\)and shorter training produce the reversal\. Per\-model table, transplant tests, bit\-width grid, surface\-perturbation categories, and per\-prefix fiber analysis are detailed in the appendix\.111Appendices[C](https://arxiv.org/html/2605.24042#A3),[L](https://arxiv.org/html/2605.24042#A12),[D](https://arxiv.org/html/2605.24042#A4),[E](https://arxiv.org/html/2605.24042#A5),[N](https://arxiv.org/html/2605.24042#A14),[P](https://arxiv.org/html/2605.24042#A16),[K](https://arxiv.org/html/2605.24042#A11),[J](https://arxiv.org/html/2605.24042#A10), and[I](https://arxiv.org/html/2605.24042#A9)\.

##### The margin scaling law\.

Atk=128k=128the behavior\-margin fractionmB/mfullm\_\{B\}/m\_\{\\mathrm\{full\}\}across the 10 models tracksk/d\\sqrt\{k/d\}at slope0\.930\.93,R2=0\.93R^\{2\}=0\.93\. The fixed\-projector isotropy theorem \(Appendix[Q\.5](https://arxiv.org/html/2605.24042#A17.SS5)\) gives a sufficient condition under operator\-norm anisotropyεiso\\varepsilon\_\{\\mathrm\{iso\}\}; in our dataεiso\\varepsilon\_\{\\mathrm\{iso\}\}is large, so we read the law primarily as an empirical alignment supported by directtr\(PBΣδ\)\\mathrm\{tr\}\(P\_\{B\}\\Sigma\_\{\\delta\}\)measurements rather than as a tight theorem consequence\. The structural content lives in KL\. Per\-model table,\(Ek,κ,ρ\)\(E\_\{k\},\\kappa,\\rho\)grid, margin\-distribution percentiles, and support\-code cross\-layer transcoder \(CLT\) bound are in the appendix\.222Appendices[G](https://arxiv.org/html/2605.24042#A7),[F](https://arxiv.org/html/2605.24042#A6),[H](https://arxiv.org/html/2605.24042#A8),[M](https://arxiv.org/html/2605.24042#A13), and[O](https://arxiv.org/html/2605.24042#A15)\.

## 4The weak\-attacker mirage

A defender who sees the FisherFFand the margin\-direction covarianceΣδ\\Sigma\_\{\\delta\}can in principle place noise along directions where small utility cost buys large retrieval\-attack cost\. We develop here the mechanism that is optimal against a Euclideanℓ2\\ell\_\{2\}retrieval attacker, achieve a measured13×13\\timesPareto reduction in retrieval\-attack success on Mistral\-7B at matched KL, and show in Section[5](https://arxiv.org/html/2605.24042#S5)that the gain does not survive a covariance\-aware attacker\. The13×13\\timesgain is a pedagogical mirage, the canonical example of weak\-attacker overfitting, important to walk through because the mechanism is exactly what equation \([1](https://arxiv.org/html/2605.24042#S3.E1)\) suggests should work, and exactly what an adaptive adversary defeats\.

### 4\.1The optimization and its closed form

Consider a defender who adds mean\-zero Gaussian noiseξ∼𝒩\(0,Σξ\)\\xi\\sim\\mathcal\{N\}\(0,\\Sigma\_\{\\xi\}\)to the hidden statehhbefore it is cached or transmitted\. The leading\-order utility cost is the expected KL divergence between clean and perturbed next\-token distributions, which by equation \([1](https://arxiv.org/html/2605.24042#S3.E1)\) equals12tr\(FΣξ\)\+O\(‖ξ‖3\)\\tfrac\{1\}\{2\}\\,\\mathrm\{tr\}\(F\\Sigma\_\{\\xi\}\)\+O\(\\\|\\xi\\\|^\{3\}\)\. Against a Euclidean attacker, the retrieval signal\-to\-noise degrades in proportion totr\(ΣδΣξ\)/𝔼‖h\(x\)−h\(x′\)‖2\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}\\Sigma\_\{\\xi\}\)\\,/\\,\\mathbb\{E\}\\\|h\(x\)\-h\(x^\{\\prime\}\)\\\|^\{2\}\. The defender’s problem is therefore

maxΣξ⪰0⁡tr\(ΣδΣξ\)subject to12tr\(FΣξ\)≤𝒦,\\max\_\{\\Sigma\_\{\\xi\}\\succeq 0\}\\;\\;\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}\\,\\Sigma\_\{\\xi\}\)\\quad\\text\{subject to\}\\quad\\tfrac\{1\}\{2\}\\,\\mathrm\{tr\}\(F\\,\\Sigma\_\{\\xi\}\)\\leq\\mathcal\{K\},\(2\)where𝒦\\mathcal\{K\}is the utility budget in nats of KL\. This is a linear objective over the PSD cone with a single trace constraint, so the unconstrained optimum is rank\-one\. Concretely, the change of variableu=F1/2vu=F^\{1/2\}vreduces equation \([2](https://arxiv.org/html/2605.24042#S4.E2)\) to a top\-eigenvalue problem on the whitened margin covarianceF−1/2ΣδF−1/2F^\{\-1/2\}\\Sigma\_\{\\delta\}F^\{\-1/2\}, and the optimizer is

ΣEuc⋆=2𝒦v1v1⊤,v1=F−1/2u1,u1=arg⁡max‖u‖=1⁡u⊤F−1/2ΣδF−1/2u,v1⊤Fv1=1,\\Sigma^\{\\star\}\_\{\\mathrm\{Euc\}\}=2\\mathcal\{K\}\\,v\_\{1\}v\_\{1\}^\{\\top\},\\qquad v\_\{1\}=F^\{\-1/2\}u\_\{1\},\\qquad u\_\{1\}=\\arg\\max\_\{\\\|u\\\|=1\}u^\{\\top\}F^\{\-1/2\}\\Sigma\_\{\\delta\}F^\{\-1/2\}u,\\qquad v\_\{1\}^\{\\top\}Fv\_\{1\}=1,\(3\)with Pareto ratio over isotropic at matched utilityGEuc,1=λ1/λ¯G\_\{\\mathrm\{Euc\},1\}=\\lambda\_\{1\}/\\bar\{\\lambda\}, whereλi\\lambda\_\{i\}are the generalized eigenvalues ofΣδv=λFv\\Sigma\_\{\\delta\}v=\\lambda Fvin decreasing order andλ¯=tr\(Σδ\)/tr\(F\)\\bar\{\\lambda\}=\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}\)/\\mathrm\{tr\}\(F\)\. KKT details are in Appendix[S](https://arxiv.org/html/2605.24042#A19)\.

For empirical Pareto exploration we instead use a rank\-kkgeneralized\-eigen mechanism spreading the budget over the topkkgeneralized eigenvectors under equal\-Fisher allocation,ΣGE,k\(𝒦\)=\(2𝒦/k\)∑i=1kvivi⊤\\Sigma\_\{\\mathrm\{GE\},k\}\(\\mathcal\{K\}\)=\(2\\mathcal\{K\}/k\)\\sum\_\{i=1\}^\{k\}v\_\{i\}v\_\{i\}^\{\\top\}withvi⊤Fvi=1v\_\{i\}^\{\\top\}Fv\_\{i\}=1\(gainGEuc,k=\(1k∑i≤kλi\)/λ¯G\_\{\\mathrm\{Euc\},k\}=\(\\tfrac\{1\}\{k\}\\sum\_\{i\\leq k\}\\lambda\_\{i\}\)/\\bar\{\\lambda\}, used atk=128k=128below\)\.

We compare four mechanisms: isotropic \(Σξ=σ2I\\Sigma\_\{\\xi\}=\\sigma^\{2\}I\), complement \(σ2PI\\sigma^\{2\}P\_\{I\}\), Fisher\-complement \(σ2PIF†PI\\sigma^\{2\}P\_\{I\}F^\{\\dagger\}P\_\{I\}\), and the rank\-kkgeneralized\-eigen mechanism of equation \([4\.1](https://arxiv.org/html/2605.24042#S4.SS1)\)\. Mechanism\-design considerations and the choice of distractor bank are in Appendix[B](https://arxiv.org/html/2605.24042#A2)\.

### 4\.2Measurement and the13×13\\timesresult

We evaluate all four on Mistral\-7B atℓ=16\\ell=16and GPT\-2 Small atℓ=6\\ell=6,k=128k=128, against a 500\-prefix bank ranked byℓ2\\ell\_\{2\}distance\. Figure[1](https://arxiv.org/html/2605.24042#S4.F1)shows the Pareto frontier\.

![Refer to caption](https://arxiv.org/html/2605.24042v1/x1.png)Figure 1:Attack success vs\. KL cost for four defenses atσ∈\{0\.1,0\.3,0\.5,1,2,5,10\}\\sigma\\in\\\{0\.1,0\.3,0\.5,1,2,5,10\\\}under the Euclidean attacker\. \(A\) Mistral\-7B \(E128=0\.99E\_\{128\}=0\.99\): generalized\-eigen reaches7%7\\%at KL≈7\\approx 7while baselines reach9393–95%95\\%\(13×13\\timesgain\)\. \(B\) GPT\-2 Small \(E128=0\.56E\_\{128\}=0\.56\): no exploitable direction, so all four mechanisms cluster\.On Mistral\-7B, the rank\-128128mechanism atσ=5\\sigma=5reaches mean KL7\.057\.05with retrieval top\-17%7\\%versus baselines at matched KL with9393–95%95\\%\(13×13\\timesgain;λ1=634\.9\\lambda\_\{1\}=634\.9vsλ¯=4\.30\\bar\{\\lambda\}=4\.30predicts148×148\\timesfor the rank\-oneΣEuc⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Euc\}\}, lower for rank\-128128because spreading averagesλ1\\lambda\_\{1\}against smallerλ2,…,128\\lambda\_\{2,\\ldots,128\}\)\. GPT\-2 Small atE128=0\.56E\_\{128\}=0\.56shows no exploitable structure; subspace\-restricted attackers underperform plainℓ2\\ell\_\{2\}by 3–4 points \(Appendix[T](https://arxiv.org/html/2605.24042#A20)\)\. The mirage has a specific source: noise lives in the rank\-kkgeneralized eigenspace and a covariance\-aware attacker projects it out, with the rank\-one optimum trivially worse \(Proposition[9](https://arxiv.org/html/2605.24042#Thmproposition9), Appendix[Q\.6](https://arxiv.org/html/2605.24042#A17.SS6)\)\.

## 5The adaptive game and its theory

A covariance\-aware attacker who knowsΣ\\Sigmashould rank candidates by the Bayes\-optimal Mahalanobis scoredΣ2=\(h~−hc\)⊤Σ−1\(h~−hc\)d\_\{\\Sigma\}^\{2\}=\(\\tilde\{h\}\-h\_\{c\}\)^\{\\top\}\\Sigma^\{\-1\}\(\\tilde\{h\}\-h\_\{c\}\), which downweights noised directions\. Formal properties of the underlyingPB/PIP\_\{B\}/P\_\{I\}decomposition \(Pythagorean margin identity, margin\-angle characterization, fixed\-projector concentration\) are in Appendix[Q](https://arxiv.org/html/2605.24042#A17)\. The defender should then minimizeJ\(Σ\)=tr\(ΣδΣ−1\)J\(\\Sigma\)=\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}\\,\\Sigma^\{\-1\}\)at fixed utility12tr\(FΣ\)≤𝒦\\tfrac\{1\}\{2\}\\mathrm\{tr\}\(F\\Sigma\)\\leq\\mathcal\{K\}\. With ridge\-stabilizedFλ,SρF\_\{\\lambda\},S\_\{\\rho\}andC=Fλ1/2SρFλ1/2C=F\_\{\\lambda\}^\{1/2\}S\_\{\\rho\}F\_\{\\lambda\}^\{1/2\}, the closed\-form solution \(Appendix[S](https://arxiv.org/html/2605.24042#A19)\) is

ΣMah⋆=2𝒦tr\(C1/2\)Fλ−1/2C1/2Fλ−1/2,GMah=tr\(Fλ\)tr\(Sρ\)\[tr\(C1/2\)\]2≥1,\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}=\\tfrac\{2\\mathcal\{K\}\}\{\\mathrm\{tr\}\(C^\{1/2\}\)\}\\,F\_\{\\lambda\}^\{\-1/2\}\\,C^\{1/2\}\\,F\_\{\\lambda\}^\{\-1/2\},\\quad G\_\{\\mathrm\{Mah\}\}=\\tfrac\{\\mathrm\{tr\}\(F\_\{\\lambda\}\)\\,\\mathrm\{tr\}\(S\_\{\\rho\}\)\}\{\[\\mathrm\{tr\}\(C^\{1/2\}\)\]^\{2\}\}\\geq 1,\(4\)with equality iffρF=ρS\\rho\_\{F\}=\\rho\_\{S\}\(equivalentlyFλ∝SρF\_\{\\lambda\}\\propto S\_\{\\rho\}, as the fidelity form \([5](https://arxiv.org/html/2605.24042#S5.E5)\) below makes explicit\)\.GMahG\_\{\\mathrm\{Mah\}\}replacesGEucG\_\{\\mathrm\{Euc\}\}as the correct predictor against an adaptive attacker, depending on the full spectrum ofCCrather than only the topkξk\_\{\\xi\}eigenvalues\. A valid Gaussian\-mechanism release adds an isotropic floorΣη=ΣMah⋆\+ηI\\Sigma\_\{\\eta\}=\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}\+\\eta I\(Appendix[S](https://arxiv.org/html/2605.24042#A19)\)\.

Theorem[16](https://arxiv.org/html/2605.24042#Thmproposition16)\(Appendix[Q\.12](https://arxiv.org/html/2605.24042#A17.SS12)\) establishes

GMah=1ℱ\(ρF,ρS\)2,ℱ\(ρF,ρS\)=trρF1/2ρSρF1/2,G\_\{\\mathrm\{Mah\}\}\\;=\\;\\frac\{1\}\{\\mathcal\{F\}\(\\rho\_\{F\},\\rho\_\{S\}\)^\{2\}\},\\qquad\\mathcal\{F\}\(\\rho\_\{F\},\\rho\_\{S\}\)=\\mathrm\{tr\}\\,\\sqrt\{\\rho\_\{F\}^\{1/2\}\\,\\rho\_\{S\}\\,\\rho\_\{F\}^\{1/2\}\},\(5\)whereρF,ρS\\rho\_\{F\},\\rho\_\{S\}are the trace\-normalized Fisher and margin covariances\.GMahG\_\{\\mathrm\{Mah\}\}is large whenρF\\rho\_\{F\}andρS\\rho\_\{S\}are far apart in Bures\-Wasserstein distance and equal to one when they coincide\. The empirical rangeGMah∈\[1\.7,9\.3\]G\_\{\\mathrm\{Mah\}\}\\in\[1\.7,9\.3\]across our five tested models corresponds to fidelity valuesℱ∈\[0\.33,0\.77\]\\mathcal\{F\}\\in\[0\.33,0\.77\]\. The projector\-separation form \(Theorem[17](https://arxiv.org/html/2605.24042#Thmproposition17), Appendix[Q\.13](https://arxiv.org/html/2605.24042#A17.SS13)\) lower\-boundsGMahG\_\{\\mathrm\{Mah\}\}in terms of the three\-axis quantities,GMah≥1/\(EkqB\+\(1−Ek\)\(1−qB\)\)2G\_\{\\mathrm\{Mah\}\}\\geq 1/\(\\sqrt\{E\_\{k\}q\_\{B\}\}\+\\sqrt\{\(1\-E\_\{k\}\)\(1\-q\_\{B\}\)\}\)^\{2\}, withqB=tr\(PBρS\)q\_\{B\}=\\mathrm\{tr\}\(P\_\{B\}\\rho\_\{S\}\)\. Designing forGMah\>10G\_\{\\mathrm\{Mah\}\}\>10requiresεF\+εS<1/10\\sqrt\{\\varepsilon\_\{F\}\}\+\\sqrt\{\\varepsilon\_\{S\}\}<1/\\sqrt\{10\}withεF=1−Ek,εS=qB\\varepsilon\_\{F\}=1\-E\_\{k\},\\varepsilon\_\{S\}=q\_\{B\}, which we realize constructively in Section[8](https://arxiv.org/html/2605.24042#S8)\.

Theorem[13](https://arxiv.org/html/2605.24042#Thmproposition13)\(Appendix[Q\.10](https://arxiv.org/html/2605.24042#A17.SS10)\) gives the structural lower bound on full\-state Gaussian release\. For anyFλ≻0,Σ≻0F\_\{\\lambda\}\\succ 0,\\Sigma\\succ 0at utility budget12tr\(FλΣ\)≤𝒦\\tfrac\{1\}\{2\}\\mathrm\{tr\}\(F\_\{\\lambda\}\\Sigma\)\\leq\\mathcal\{K\}and Fisher\-ball adversary class𝒜ρ=\{Δ:Δ⊤FλΔ≤ρ2\}\\mathcal\{A\}\_\{\\rho\}=\\\{\\Delta:\\Delta^\{\\top\}F\_\{\\lambda\}\\Delta\\leq\\rho^\{2\}\\\},

supΔ∈𝒜ρΔ⊤Σ−1Δ≥ρ2d2𝒦,\\sup\_\{\\Delta\\in\\mathcal\{A\}\_\{\\rho\}\}\\;\\Delta^\{\\top\}\\Sigma^\{\-1\}\\Delta\\;\\geq\\;\\frac\{\\rho^\{2\}d\}\{2\\mathcal\{K\}\},\(6\)attained by the unique full\-matrix minimax mechanismΣfull⋆=\(2𝒦/d\)Fλ−1\\Sigma^\{\\star\}\_\{\\mathrm\{full\}\}=\(2\\mathcal\{K\}/d\)\\,F\_\{\\lambda\}^\{\-1\}\. The Bayes\-error consequence \(Corollary[14](https://arxiv.org/html/2605.24042#Thmproposition14)\) is that some Fisher\-ball direction \(equivalently, some pair in the enlarged Fisher\-ball adversary class𝒜ρ\\mathcal\{A\}\_\{\\rho\}\) remains exponentially distinguishable inddat any𝒦=O\(1\)\\mathcal\{K\}=O\(1\)\. The class𝒜ρ\\mathcal\{A\}\_\{\\rho\}is a strict superset of the empirical adjacency set, so the bound is a population claim over a strictly larger adversary that does not by itself explain a0/1,5360/1\{,\}536measurement, but it does rule out uniform Gaussian safety over𝒜ρ\\mathcal\{A\}\_\{\\rho\}at small budget\.

Restricting to diagonalΣ=diag\(s1,…,sd\)\\Sigma=\\mathrm\{diag\}\(s\_\{1\},\\ldots,s\_\{d\}\)and the diagonal\-Fisher adversary class𝒜D,ρ=\{Δ:Δ⊤DΔ≤ρ2\}\\mathcal\{A\}\_\{D,\\rho\}=\\\{\\Delta:\\Delta^\{\\top\}D\\Delta\\leq\\rho^\{2\}\\\}withD=diag\(F\)D=\\mathrm\{diag\}\(F\), Theorem[15](https://arxiv.org/html/2605.24042#Thmproposition15)\(Appendix[Q\.11](https://arxiv.org/html/2605.24042#A17.SS11)\) gives the unique minimizer

Σdiag⋆\(𝒦\)=2𝒦dD−1,\\Sigma^\{\\star\}\_\{\\mathrm\{diag\}\}\(\\mathcal\{K\}\)\\;=\\;\\tfrac\{2\\mathcal\{K\}\}\{d\}\\,D^\{\-1\},\(7\)attainingρ2d/\(2𝒦\)\\rho^\{2\}d/\(2\\mathcal\{K\}\), with first\-order utility cost exactlyU\(Σdiag⋆\)=12tr\(DΣdiag⋆\)=𝒦U\(\\Sigma^\{\\star\}\_\{\\mathrm\{diag\}\}\)=\\tfrac\{1\}\{2\}\\,\\mathrm\{tr\}\(D\\,\\Sigma^\{\\star\}\_\{\\mathrm\{diag\}\}\)=\\mathcal\{K\}by construction\. We use𝒦\\mathcal\{K\}as the canonical parameter and treat anyσ\\sigmain the empirical sweeps as a reporting knob related to𝒦\\mathcal\{K\}throughσ2=2𝒦/d\\sigma^\{2\}=2\\mathcal\{K\}/din the conventionΣdiag=σ2diag\(1/Fii\)\\Sigma\_\{\\mathrm\{diag\}\}=\\sigma^\{2\}\\mathrm\{diag\}\(1/F\_\{ii\}\), i\.e\. the noise\-scale parameter in that pseudo\-form already absorbs the2𝒦/d2\\mathcal\{K\}/dfactor\. This upgrades the equal\-coordinate\-cost characterization ofΣdiag\\Sigma\_\{\\mathrm\{diag\}\}\(Proposition[11](https://arxiv.org/html/2605.24042#Thmproposition11)\) to a true minimax statement against the Fisher\-ball adversary class\. The diagonal release distributes the local utility cost evenly across coordinates, so no single low\-Fisher coordinate becomes the worst\-case adjacency direction\.

The pair \([6](https://arxiv.org/html/2605.24042#S5.E6)\)–\([7](https://arxiv.org/html/2605.24042#S5.E7)\) gives the theoretical reading of the empirical results in Section[6](https://arxiv.org/html/2605.24042#S6), an empty middle in the Gaussian frontier and a universal Gaussian winner at the diagonal\-minimax point\. Both are population statements, and the empirical content of Section[6](https://arxiv.org/html/2605.24042#S6)is that they hold also against the realistic adjacency set we measure\.

## 6Main empirical defense

Re\-running the Mistral\-7B sweep with the adaptive attacker\(h~−hc\)⊤\(Σ\+τ⋆I\)−1\(h~−hc\)\(\\tilde\{h\}\-h\_\{c\}\)^\{\\top\}\(\\Sigma\+\\tau^\{\\star\}I\)^\{\-1\}\(\\tilde\{h\}\-h\_\{c\}\)collapses generalized\-eigen to100%100\\%top\-1 at everyσ\\sigma\. Noise lives in the rank\-kξk\_\{\\xi\}generalized eigenspace and the attacker projects it out\. The measuredGMah=3\.81G\_\{\\mathrm\{Mah\}\}=3\.81on Mistral against the empiricalGEuc=1043G\_\{\\mathrm\{Euc\}\}=1043\(Table[25](https://arxiv.org/html/2605.24042#A20.T25), distinct from theλ1/λ¯≈148\\lambda\_\{1\}/\\bar\{\\lambda\}\\approx 148rank\-one figure in Section[4](https://arxiv.org/html/2605.24042#S4)\) quantifies how misleading the Euclidean predictor was\. Equation \([5](https://arxiv.org/html/2605.24042#S5.E5)\) is the correct predictor\.

Atσ=5\\sigma=5across a 32\-point sweep \(5 models, 4–12 layers,50,00050\{,\}000\-candidate bank\), the only Gaussian mechanism with worst\-over\-attackers top\-1≤0\.001\\leq 0\.001at every cell is the diagonal\-minimax releaseΣdiag⋆\(𝒦\)=\(2𝒦/d\)D−1\\Sigma^\{\\star\}\_\{\\mathrm\{diag\}\}\(\\mathcal\{K\}\)=\(2\\mathcal\{K\}/d\)\\,D^\{\-1\}of equation \([7](https://arxiv.org/html/2605.24042#S5.E7)\) \(parameterized in code byσ2=2𝒦/d\\sigma^\{2\}=2\\mathcal\{K\}/d\)\. Isotropic andΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}work only on mid\-Mistral and fail elsewhere becausetr\(F\)\\mathrm\{tr\}\(F\)decays with depth whileΣdiag⋆\\Sigma^\{\\star\}\_\{\\mathrm\{diag\}\}’s first\-order cost𝒦\\mathcal\{K\}is invariant\. Per\-cellΣdiag\\Sigma\_\{\\mathrm\{diag\}\}achieves top\-1≤0\.1\\leq 0\.1across all 32 layers, frequently at low distortionσ∈\[0\.1,0\.5\]\\sigma\\in\[0\.1,0\.5\]\(Appendix[T](https://arxiv.org/html/2605.24042#A20)\)\.

Of1,5361\{,\}536Gaussian sweep cells,*zero*achieve top\-1 agreement≥0\.5\\geq 0\.5with the clean model and worst\-attacker top\-1≤0\.5\\leq 0\.5simultaneously \(Figure[2](https://arxiv.org/html/2605.24042#S6.F2)\); lowering toU≥0\.2U\\geq 0\.2still yields zero, and only atU≥0\.1U\\geq 0\.1do three GPT\-2 Small early\-layer cells qualify \(Appendix[T\.5](https://arxiv.org/html/2605.24042#A20.SS5)\)\. Theorem[13](https://arxiv.org/html/2605.24042#Thmproposition13)lower\-bounds the Gaussian frontier over the Fisher\-ball adversary class; the0/1,5360/1\{,\}536sweep is the matching empirical observation on our tested adjacency sets, mechanisms, models, layers, and attackers\.

![Refer to caption](https://arxiv.org/html/2605.24042v1/x2.png)Figure 2:The empty middle\. Every Gaussian release cell across the 5\-model 32\-layer sweep, plotted in \(utility, privacy\) space\. The shaded box is an illustrative moderate\-both operating region; full threshold\-sensitivity is reported in Appendix[T\.5](https://arxiv.org/html/2605.24042#A20.SS5)\. Zero Gaussian cells land inside it\. The diagonal\-Fisher mechanismΣdiag\\Sigma\_\{\\mathrm\{diag\}\}rides the privacy edge but cannot enter\. A split\-memory transformer \(Section[8](https://arxiv.org/html/2605.24042#S8)\) trained from scratch \(orange stars at three probe layersℓ∈\{4,6,8\}\\ell\\in\\\{4,6,8\\\}of the same model\) sits inside the box, reachable from outside the Gaussian release class\.Under matched\-ε\\varepsilonRényi differential privacy no mechanism is universal: isotropic wins at strict budgetsε∈\{1,3\}\\varepsilon\\in\\\{1,3\\\},Σdiag\\Sigma\_\{\\mathrm\{diag\}\}at looseε=16\\varepsilon=16\(Appendix[U\.2](https://arxiv.org/html/2605.24042#A21.SS2); structural reason in Proposition[12](https://arxiv.org/html/2605.24042#Thmproposition12)\)\. SweepingΣα=cαdiag\(Fii\)−α\\Sigma\_\{\\alpha\}=c\_\{\\alpha\}\\,\\mathrm\{diag\}\(F\_\{ii\}\)^\{\-\\alpha\}at matched𝒦\\mathcal\{K\}confirms the structural prediction\. High\-α\\alpha\(α≥0\.5\\alpha\\geq 0\.5\) wins on27/3227/32model\-layer points, with family\-specific empirical optima \(Appendix[Q\.11\.1](https://arxiv.org/html/2605.24042#A17.SS11.SSS1)\)\.

Σdiag\\Sigma\_\{\\mathrm\{diag\}\}is the only Gaussian mechanism that survives adaptive retrieval at high distortion, and the moderate\-both region of the Gaussian frontier is empty \(variational\-bottleneck negative result in Appendix[X](https://arxiv.org/html/2605.24042#A24)\)\.

## 7Stronger attacks

The retrieval attackers of Section[6](https://arxiv.org/html/2605.24042#S6)upper\-bound inversion via bank search, a learned inverter directly predicts the prefix from the released activation, and a sequence inverter sees the full per\-token trajectory\. The latter two are stronger evaluators ofΣdiag\\Sigma\_\{\\mathrm\{diag\}\}\.

A 57M\-parameter single\-vector decoder trained on500,000500\{,\}000clean GPT\-2 hidden\-state/prefix pairs \(Appendix[V](https://arxiv.org/html/2605.24042#A22)\) reaches token accuracy0\.0780\.078on the clean test state at no exact\-match\. Generalized\-eigen atσ=5\\sigma=5suppresses it8\.7×8\.7\\times, isotropic7\.1×7\.1\\times,ΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}only1\.4×1\.4\\times, inverting the Mahalanobis\-retrieval ranking because the Fisher\-shapedΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}places noise on directions the inverter does not use \(full ranking and adversarial\-targeting evaluation in Appendices[V](https://arxiv.org/html/2605.24042#A22),[W](https://arxiv.org/html/2605.24042#A23)\)\.

A stronger attacker sees the per\-token sequence of hidden vectors at layerℓ=6\\ell=6\. The architecture is the 57M\-parameter SeqInv with the activation encoder replaced by a per\-token linear projection of the released hidden state, so the attacker observes onedd\-dim vector per prefix token\. Training uses50,00050\{,\}000steps on WikiText\-103 prefixes of length 32, mechanism\-aware \(trained against the exact defense it later evaluates against\)\.

Table 1:Sequence\-inverter attack on GPT\-2 Small layer 6, prefix length 32, 500 test prefixes\. The clean baseline recovers 94% of prefixes exactly, an order of magnitude stronger than the single\-vector inverter \(0% exact match, EM\)\.Σdiag\\Sigma\_\{\\mathrm\{diag\}\}atσ=5\\sigma=5continues to suppress the strictly stronger attacker to 0% EM and 1\.1% token accuracy \(TA\)\. Uniform random token accuracy for a 50,257\-token vocabulary is1/50,257≈2×10−51/50\{,\}257\\approx 2\\times 10^\{\-5\}, so this value should be compared to unconditional language\-prior baselines rather than described as uniform chance\. Isotropic at the sameσ\\sigmaleaves 22\.6% of 32\-token sequences exactly recoverable\.The0%0\\%to94%94\\%EM jump confirms the single\-vector null\-EM was a conservative upper bound\.Σdiag\\Sigma\_\{\\mathrm\{diag\}\}atσ=5\\sigma=5still holds this attacker to zero EM and1\.1%1\.1\\%TA, far below the22\.6%22\.6\\%EM that isotropic leaves recoverable\. We did not run an activation\-permuted or zero\-activation baseline, so we cannot certify the1\.1%1\.1\\%as the unconditional decoder prior; the operational comparison is the gap to isotropic\.

## 8The Split\-Memory Transformer

The projector\-separation lower bound \(Theorem[17](https://arxiv.org/html/2605.24042#Thmproposition17), Appendix[Q\.13](https://arxiv.org/html/2605.24042#A17.SS13)\) says thatGMah\>10G\_\{\\mathrm\{Mah\}\}\>10is guaranteed whenEkE\_\{k\}is close to one andqBq\_\{B\}is small\. The five tested architectures top out atGMah=9\.28G\_\{\\mathrm\{Mah\}\}=9\.28\(Qwen3\-14B layer 10\), just below threshold\. We show by construction \(Figure[3](https://arxiv.org/html/2605.24042#S8.F3)\) that a split\-architecture language model trained from scratch can sit cleanly inside the high\-GMahG\_\{\\mathrm\{Mah\}\}regime at multiple intermediate layers\.

![Refer to caption](https://arxiv.org/html/2605.24042v1/x3.png)Figure 3:SMT architecture\. Each layer’s hidden state is factored into a narrow predictive trunkul∈ℝru\_\{l\}\\in\\mathbb\{R\}^\{r\}and a wide memory branchvl∈ℝmv\_\{l\}\\in\\mathbb\{R\}^\{m\}withr\+m=dr\+m=d\. Cross\-coupling is asymmetric:ulu\_\{l\}entersvl\+1v\_\{l\+1\}unrestricted viaWuvW\_\{uv\},vlv\_\{l\}entersul\+1u\_\{l\+1\}only through a learnable scalar\-gatedγWvu\\gamma W\_\{vu\}initialized at 0\.01\. The output projection reads only fromuLu\_\{L\}, and the gated cross\-coupling biases Fisher mass toward therr\-dimensional trunk after training\.The layer update is

ul\+1=ul\+Blocku\(LN\(ul\+γWvu\(l\)vl\)\),vl\+1=vl\+Blockv\(LN\(vl\+Wuv\(l\)ul\)\),u\_\{l\+1\}=u\_\{l\}\+\\mathrm\{Block\}^\{u\}\\\!\\bigl\(\\mathrm\{LN\}\(u\_\{l\}\+\\gamma W\_\{vu\}^\{\(l\)\}v\_\{l\}\)\\bigr\),\\quad v\_\{l\+1\}=v\_\{l\}\+\\mathrm\{Block\}^\{v\}\\\!\\bigl\(\\mathrm\{LN\}\(v\_\{l\}\+W\_\{uv\}^\{\(l\)\}u\_\{l\}\)\\bigr\),with logits=WoutLN\(uL\)=W\_\{\\mathrm\{out\}\}\\,\\mathrm\{LN\}\(u\_\{L\}\)reading only from the trunk\. We optionally add a Hutchinson Jacobian penaltyλJ‖Jlogit,vl‖F2^\\lambda\_\{J\}\\,\\widehat\{\\\|J\_\{\\mathrm\{logit\},v\_\{l\}\}\\\|\_\{F\}^\{2\}\}at probe layers\.

We train four 12\-layer architectures at matched≈90\\approx 90M parameters and probe layersℓ∈\{4,6,8\}\\ell\\in\\\{4,6,8\\\}withk=128k=128on WikiText\-103 for20,00020\{,\}000steps inbfloat16: a baseline GPT \(d=768d=768\); SMT main \(r=128,m=640,λJ=10−3r=128,m=640,\\lambda\_\{J\}=10^\{\-3\}\); SMT no\-Jac \(λJ=0\\lambda\_\{J\}=0\); SMTr=64r\{=\}64\(r=64,m=704,λJ=10−3r=64,m=704,\\lambda\_\{J\}=10^\{\-3\}\)\. Table[2](https://arxiv.org/html/2605.24042#S8.T2)reports the resultingGMahG\_\{\\mathrm\{Mah\}\}and Fisher quantities\.

Table 2:GMah≫10G\_\{\\mathrm\{Mah\}\}\\gg 10realized on a 12\-layer SMT at matched parameter budget on WikiText\-103\. Bold:GMah≥10G\_\{\\mathrm\{Mah\}\}\\geq 10\. Baseline atEk≈k/dE\_\{k\}\\approx k/d; SMTr=64r=64reachesGMah∈\[20\.7,32\.7\]G\_\{\\mathrm\{Mah\}\}\\in\[20\.7,32\.7\]\.LlmL\_\{\\mathrm\{lm\}\}is per\-architecture; SMT tradesΔLlm∈\[\+1\.57,\+1\.85\]\\Delta L\_\{\\mathrm\{lm\}\}\\in\[\+1\.57,\+1\.85\]nats for the geometric gain\.We make three main observations from these models\. The baseline GPT showsEk≈0\.22E\_\{k\}\\approx 0\.22atk=128k=128, almost exactly the random\-projection floork/d=0\.167k/d=0\.167, withGMah∈\[1\.1,1\.3\]G\_\{\\mathrm\{Mah\}\}\\in\[1\.1,1\.3\]across layers, so the Fisher mass is essentially uniform\. Both SMT variants atr=128r=128achieveEk≥0\.99E\_\{k\}\\geq 0\.99\. By routing logits only through a 128\-dim trunk, the architectural prior biases the Fisher to concentrate in 128 coordinates while keeping total hidden width fixed\. SMTr=64r=64concentrates further \(Ek=0\.998E\_\{k\}=0\.998\) and producesGMahG\_\{\\mathrm\{Mah\}\}in the range 20–33, comfortably above the 10 threshold predicted by Theorem[17](https://arxiv.org/html/2605.24042#Thmproposition17)as the regime where the Mahalanobis\-defense gain over isotropic is large\.

SMT realizes the regime the five base models do not reach: any architecture that reads logits from a fixed low\-dimensional projection biases Fisher mass toward that projection, and Theorem[17](https://arxiv.org/html/2605.24042#Thmproposition17)then implies largeGMahG\_\{\\mathrm\{Mah\}\}\. A four\-scale fixed\-token sweep from 30M to 1B \(Appendix[Z](https://arxiv.org/html/2605.24042#A26)\) maintains a66–24×24\\timesGMahG\_\{\\mathrm\{Mah\}\}advantage over flat\-at\-1\.31\.3baselines but pays a widening language\-modeling \(LM\) loss penalty \(\+1\.18\+1\.18to\+2\.15\+2\.15nats\) and shows large 90M seed variance \(GMah∈\{9\.94,16\.49,32\.44\}G\_\{\\mathrm\{Mah\}\}\\in\\\{9\.94,16\.49,32\.44\\\}, CoV0\.480\.48\), so we read it as a fixed\-compute observation rather than the long\-run answer \(Appendix[Y](https://arxiv.org/html/2605.24042#A25), with the regularizer\-based attempt on a frozen GPT\-2 in Appendix[AA](https://arxiv.org/html/2605.24042#A27)\)\.

## 9Discussion

The empty middle gives a class\-level Pareto baseline for hidden\-state release\. Prior caching\-privacy work\(Liuet al\.,[2024](https://arxiv.org/html/2605.24042#bib.bib11); Dettmerset al\.,[2022](https://arxiv.org/html/2605.24042#bib.bib29)\)optimized individual mechanisms without a within\-class reference, so it could not separate a bad covariance from a limited class\. Non\-Gaussian release classes \(quantization, dropout, variational bottlenecks\) inherit none of these obstructions and are the natural next target for the same sweep\. The adaptive collapse of generalized\-eigen is not a single\-mechanism defect: any Gaussian release whose covariance aligns with a low\-rank model\-derived basis lives in a subspace the attacker can project out, so aligning defender geometry to model geometry is brittle on this task\. The split\-half Fisher validation \(Appendix[A](https://arxiv.org/html/2605.24042#A1)\) sharpens the measurement constraint: fixed top\-kkFisher projectors needncal≫dn\_\{\\mathrm\{cal\}\}\\gg d, not thencal=200n\_\{\\mathrm\{cal\}\}=200regime that suffices for calibration\-sample top\-kkenergy but not subspace identification\. SMT reads as a recipe rather than a single architecture: any model whose logits route through a fixed low\-dimensional trunk inherits the Fisher concentration the moderate\-both region requires, and its LM\-loss penalty is the cost of the cleanest instance rather than a lower bound on the recipe\.

For practitioners shipping cached activations, the diagonal inverse\-Fisher release is the closest available model\-agnostic Gaussian default\. Hidden\-state inversion\(Nikolaouet al\.,[2026](https://arxiv.org/html/2605.24042#bib.bib1); Morriset al\.,[2023](https://arxiv.org/html/2605.24042#bib.bib9); Song and Raghunathan,[2020](https://arxiv.org/html/2605.24042#bib.bib32); Li and others,[2025](https://arxiv.org/html/2605.24042#bib.bib37); Panet al\.,[2020](https://arxiv.org/html/2605.24042#bib.bib33); Nazir and others,[2025](https://arxiv.org/html/2605.24042#bib.bib13)\)and KV\-cache compression\(Liuet al\.,[2024](https://arxiv.org/html/2605.24042#bib.bib11); Tang and others,[2025](https://arxiv.org/html/2605.24042#bib.bib10); Kitaevet al\.,[2020](https://arxiv.org/html/2605.24042#bib.bib38); Dettmerset al\.,[2022](https://arxiv.org/html/2605.24042#bib.bib29)\)now have class\-level spectral diagnostics inEkE\_\{k\}, the cumulative spectrum, andGMahG\_\{\\mathrm\{Mah\}\}\. The Fisher literature\(Amari,[1998](https://arxiv.org/html/2605.24042#bib.bib7); Martens,[2020](https://arxiv.org/html/2605.24042#bib.bib8); Kunstneret al\.,[2019](https://arxiv.org/html/2605.24042#bib.bib30)\)treatsPBP\_\{B\}as a preconditioner, and we use it as a release\-design object\. Interpretability work\(Elhageet al\.,[2022](https://arxiv.org/html/2605.24042#bib.bib6); Templetonet al\.,[2024](https://arxiv.org/html/2605.24042#bib.bib4); Rajamanoharanet al\.,[2024](https://arxiv.org/html/2605.24042#bib.bib5); Dunefskyet al\.,[2024](https://arxiv.org/html/2605.24042#bib.bib2); Ameisenet al\.,[2025](https://arxiv.org/html/2605.24042#bib.bib3); Parket al\.,[2024](https://arxiv.org/html/2605.24042#bib.bib35); Nandaet al\.,[2023](https://arxiv.org/html/2605.24042#bib.bib34); Bhaskaret al\.,[2024](https://arxiv.org/html/2605.24042#bib.bib15); LLM\-Interp,[2025](https://arxiv.org/html/2605.24042#bib.bib14)\)picks bases without checking loss\-sensitivity alignment, and the KL asymmetry flips between GPT\-2\(Radfordet al\.,[2019](https://arxiv.org/html/2605.24042#bib.bib12); Baet al\.,[2016](https://arxiv.org/html/2605.24042#bib.bib16)\)and modern architectures\(Zhang and Sennrich,[2019](https://arxiv.org/html/2605.24042#bib.bib23); Suet al\.,[2024](https://arxiv.org/html/2605.24042#bib.bib24); Shazeer,[2020](https://arxiv.org/html/2605.24042#bib.bib25); Jianget al\.,[2023](https://arxiv.org/html/2605.24042#bib.bib18); Qwen Team,[2025](https://arxiv.org/html/2605.24042#bib.bib19); DeepSeek\-AI,[2025](https://arxiv.org/html/2605.24042#bib.bib20); Touvronet al\.,[2023](https://arxiv.org/html/2605.24042#bib.bib31); Muennighoffet al\.,[2025](https://arxiv.org/html/2605.24042#bib.bib21)\), so basis\-dependent claims need model scoping\. Differential privacy\(Abadiet al\.,[2016](https://arxiv.org/html/2605.24042#bib.bib26); Dwork and Roth,[2014](https://arxiv.org/html/2605.24042#bib.bib17)\)and membership inference\(Shokriet al\.,[2017](https://arxiv.org/html/2605.24042#bib.bib27); Carliniet al\.,[2021](https://arxiv.org/html/2605.24042#bib.bib28)\)target training\-time influence, while ours is inference\-time release\.

##### Limitations and future questions\.

The variational\-bottleneck candidate \(Appendix[R](https://arxiv.org/html/2605.24042#A18)\) produces 0/44 moderate\-both, leaving non\-Gaussian classes open\. The fixed\-token SMT sweep \(Appendix[Z](https://arxiv.org/html/2605.24042#A26)\) is a single\-architecture\-family probe; data\-optimal scaling, margin\-depletion regularizers, stronger attackers, and finer sweeps remain open\.

## References

- M\. Abadi, A\. Chu, I\. Goodfellow, H\. B\. McMahan, I\. Mironov, K\. Talwar, and L\. Zhang \(2016\)Deep learning with differential privacy\.InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security,pp\. 308–318\.Cited by:[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- S\. Amari \(1998\)Natural gradient works efficiently in learning\.Neural Computation10\(2\),pp\. 251–276\.Cited by:[§B\.2](https://arxiv.org/html/2605.24042#A2.SS2.p1.5),[§1](https://arxiv.org/html/2605.24042#S1.p2.10),[§2](https://arxiv.org/html/2605.24042#S2.p3.9),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- E\. Ameisen, J\. Lindsey, A\. Pearce, W\. Gurnee, N\. L\. Turner, B\. Chen, C\. Citro,et al\.\(2025\)Circuit tracing: revealing computational graphs in language models\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2025/attribution-graphs/methods.html)Cited by:[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- J\. L\. Ba, J\. R\. Kiros, and G\. E\. Hinton \(2016\)Layer normalization\.arXiv preprint arXiv:1607\.06450\.Cited by:[Appendix E](https://arxiv.org/html/2605.24042#A5.p4.18),[§3](https://arxiv.org/html/2605.24042#S3.SS0.SSS0.Px2.p1.12),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- A\. Bhaskar, D\. Friedman, and N\. Saphra \(2024\)Causation does not imply correlation\.InNeurIPS Workshop on Attributing Model Behavior at Scale,External Links:[Link](https://openreview.net/forum?id=JYyqhr8zJ8)Cited by:[§C\.3](https://arxiv.org/html/2605.24042#A3.SS3.p1.23),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- N\. Carlini, F\. Tramer, E\. Wallace, M\. Jagielski, A\. Herbert\-Voss, K\. Lee, A\. Roberts, T\. Brown, D\. Song, U\. Erlingsson,et al\.\(2021\)Extracting training data from large language models\.In30th USENIX Security Symposium,pp\. 2633–2650\.Cited by:[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- DeepSeek\-AI \(2025\)DeepSeek\-R1: incentivizing reasoning capability in LLMs via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2605.24042#S1.p2.10),[§3](https://arxiv.org/html/2605.24042#S3.SS0.SSS0.Px1.p1.10),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- T\. Dettmers, M\. Lewis, Y\. Belkada, and L\. Zettlemoyer \(2022\)LLM\.int8\(\): 8\-bit matrix multiplication for transformers at scale\.Advances in Neural Information Processing Systems35,pp\. 30318–30332\.Cited by:[Appendix E](https://arxiv.org/html/2605.24042#A5.p2.7),[§9](https://arxiv.org/html/2605.24042#S9.p1.4),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- J\. Dunefsky, P\. Chlenski, and N\. Nanda \(2024\)Transcoders find interpretable LLM feature circuits\.InAdvances in Neural Information Processing Systems,Vol\.38\.Cited by:[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- C\. Dwork and A\. Roth \(2014\)The algorithmic foundations of differential privacy\.Vol\.9\.Cited by:[Appendix U](https://arxiv.org/html/2605.24042#A21.2.p1.9),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- N\. Elhage, T\. Hume, C\. Olsson, N\. Schiefer, T\. Henighan, S\. Kravec, Z\. Hatfield\-Dodds, R\. Lasenby, D\. Drain, C\. Chen,et al\.\(2022\)Toy models of superposition\.arXiv preprint arXiv:2209\.10652\.Cited by:[Appendix E](https://arxiv.org/html/2605.24042#A5.p6.1),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot,et al\.\(2023\)Mistral 7b\.arXiv preprint arXiv:2310\.06825\.Cited by:[§1](https://arxiv.org/html/2605.24042#S1.p2.10),[§3](https://arxiv.org/html/2605.24042#S3.SS0.SSS0.Px1.p1.10),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- N\. Kitaev, Ł\. Kaiser, and A\. Levskaya \(2020\)Reformer: the efficient transformer\.InInternational Conference on Learning Representations,Cited by:[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- F\. Kunstner, L\. Balles, and P\. Hennig \(2019\)Limitations of the empirical Fisher approximation for natural gradient descent\.Advances in Neural Information Processing Systems32\.Cited by:[§B\.2](https://arxiv.org/html/2605.24042#A2.SS2.p1.5),[§2](https://arxiv.org/html/2605.24042#S2.p3.9),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- X\. Liet al\.\(2025\)Hidden state leakage: recovering prompts from layer\-wise activations\.arXiv preprint arXiv:2404\.10001\.Cited by:[§2](https://arxiv.org/html/2605.24042#S2.p3.9),[§3](https://arxiv.org/html/2605.24042#S3.SS0.SSS0.Px1.p1.10),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- Z\. Liu, J\. Yuan, H\. Jin, S\. Zhong, Z\. Xu, V\. Braverman, B\. Chen, and X\. Hu \(2024\)KIVI: a tuning\-free asymmetric 2bit quantization for KV cache\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.24042#S1.p1.2),[§2](https://arxiv.org/html/2605.24042#S2.p2.3),[§9](https://arxiv.org/html/2605.24042#S9.p1.4),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- LLM\-Interp \(2025\)CLT\-Forge: a scalable library for cross\-layer transcoders\.GitHub Repository\.External Links:[Link](https://github.com/LLM-Interp/CLT-Forge)Cited by:[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- J\. Martens \(2020\)New insights and perspectives on the natural gradient method\.Journal of Machine Learning Research21\(146\),pp\. 1–76\.Cited by:[§B\.2](https://arxiv.org/html/2605.24042#A2.SS2.p1.5),[§1](https://arxiv.org/html/2605.24042#S1.p2.10),[§2](https://arxiv.org/html/2605.24042#S2.p3.9),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- J\. X\. Morris, V\. Kuleshov, V\. Shmatikov, and A\. Sprenkle \(2023\)Text embeddings reveal \(almost\) as much as text\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Cited by:[§1](https://arxiv.org/html/2605.24042#S1.p1.2),[§2](https://arxiv.org/html/2605.24042#S2.p2.3),[§2](https://arxiv.org/html/2605.24042#S2.p3.9),[§3](https://arxiv.org/html/2605.24042#S3.SS0.SSS0.Px1.p1.10),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- N\. Muennighoff, L\. Soldaini, D\. Groeneveld, K\. Lo, J\. Morrison, S\. Min, W\. Shi, P\. Walsh, O\. Tafjord, N\. Lambert,et al\.\(2025\)OLMoE: open mixture\-of\-experts language models\.InInternational Conference on Learning Representations,Cited by:[§3](https://arxiv.org/html/2605.24042#S3.SS0.SSS0.Px1.p1.10),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- N\. Nanda, A\. Lee, and M\. Wattenberg \(2023\)Emergent linear representations in world models of self\-supervised sequence models\.InBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP,Cited by:[Appendix E](https://arxiv.org/html/2605.24042#A5.p6.1),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- E\. Naziret al\.\(2025\)Extracting prompts by inverting LLM outputs\.arXiv preprint arXiv:2405\.15012\.Cited by:[Appendix F](https://arxiv.org/html/2605.24042#A6.p1.3),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- G\. Nikolaou, T\. Mencattini, D\. Crisostomi, A\. Santilli, Y\. Panagakis, and E\. Rodolà \(2026\)Language models are injective and hence invertible\.InInternational Conference on Learning Representations,Cited by:[§R\.3](https://arxiv.org/html/2605.24042#A18.SS3.p1.1),[§1](https://arxiv.org/html/2605.24042#S1.p1.2),[§2](https://arxiv.org/html/2605.24042#S2.p2.3),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- X\. Pan, M\. Zhang, S\. Ji, and M\. Yang \(2020\)Privacy risks of general\-purpose language models\.In2020 IEEE Symposium on Security and Privacy,pp\. 1314–1331\.Cited by:[§2](https://arxiv.org/html/2605.24042#S2.p3.9),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- K\. Park, Y\. J\. Choe, and V\. Veitch \(2024\)The linear representation hypothesis and the geometry of large language models\.InInternational Conference on Machine Learning,Cited by:[Appendix E](https://arxiv.org/html/2605.24042#A5.p6.1),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- Qwen Team \(2025\)Qwen3 technical report\.arXiv preprint\.External Links:[Link](https://qwenlm.github.io/blog/qwen3/)Cited by:[§1](https://arxiv.org/html/2605.24042#S1.p2.10),[§3](https://arxiv.org/html/2605.24042#S3.SS0.SSS0.Px1.p1.10),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, and I\. Sutskever \(2019\)Language models are unsupervised multitask learners\.OpenAI Blog\.Cited by:[§1](https://arxiv.org/html/2605.24042#S1.p2.10),[§3](https://arxiv.org/html/2605.24042#S3.SS0.SSS0.Px1.p1.10),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- S\. Rajamanoharan, T\. Lieberum, N\. Sonnerat, A\. Conmy, V\. Varma, J\. Kramar, and N\. Nanda \(2024\)Jumping ahead: improving reconstruction fidelity with JumpReLU sparse autoencoders\.InBlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP,Cited by:[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- N\. Shazeer \(2020\)GLU variants improve transformer\.arXiv preprint arXiv:2002\.05202\.Cited by:[Appendix E](https://arxiv.org/html/2605.24042#A5.p4.18),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- R\. Shokri, M\. Stronati, C\. Song, and V\. Shmatikov \(2017\)Membership inference attacks against machine learning models\.In2017 IEEE Symposium on Security and Privacy,pp\. 3–18\.Cited by:[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- C\. Song and A\. Raghunathan \(2020\)Information leakage in embedding models\.InProceedings of the 2020 ACM Conference on Computer and Communications Security,pp\. 377–390\.Cited by:[§2](https://arxiv.org/html/2605.24042#S2.p3.9),[§3](https://arxiv.org/html/2605.24042#S3.SS0.SSS0.Px1.p1.10),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- J\. Su, Y\. Lu, S\. Pan, A\. Murtadha, B\. Wen, and Y\. Liu \(2024\)RoFormer: enhanced transformer with rotary position embedding\.Neurocomputing568,pp\. 127063\.Cited by:[Appendix E](https://arxiv.org/html/2605.24042#A5.p4.18),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- Z\. Tanget al\.\(2025\)Unveiling and mitigating privacy risks of KV\-cache in LLM\.arXiv preprint arXiv:2508\.09442\.Cited by:[§1](https://arxiv.org/html/2605.24042#S1.p1.2),[§2](https://arxiv.org/html/2605.24042#S2.p2.3),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- A\. Templeton, T\. Conerly, J\. Marcus, J\. Lindsey, T\. Bricken, B\. Chen, A\. Pearce, C\. Citro, E\. Ameisen, A\. Jones,et al\.\(2024\)Scaling monosemanticity: extracting interpretable features from Claude 3 Sonnet\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2024/scaling-monosemanticity/)Cited by:[Appendix E](https://arxiv.org/html/2605.24042#A5.p6.1),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023\)LLaMA: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§3](https://arxiv.org/html/2605.24042#S3.SS0.SSS0.Px1.p1.10),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.
- B\. Zhang and R\. Sennrich \(2019\)Root mean square layer normalization\.Advances in Neural Information Processing Systems32\.Cited by:[Appendix E](https://arxiv.org/html/2605.24042#A5.p4.18),[§9](https://arxiv.org/html/2605.24042#S9.p2.3)\.

## Appendix AMeasurement protocols and the GPT\-2 Medium result

Three measurement protocols are consistent with KL under projection, and they can give different answers on the same model\. This appendix makes the distinction explicit and reports why we use the full\-forward protocol everywhere\.

### A\.1Three protocols

Lethhbe the hidden state at layerℓ\\ellandPPa rank\-kkprojector\.

*Protocol A \(direct readout\)\.*ComputepA=softmax\(Wu⋅norm\(Ph\)\)p\_\{A\}=\\mathrm\{softmax\}\\bigl\(W\_\{u\}\\cdot\\mathrm\{norm\}\(Ph\)\\bigr\), skipping layersℓ\+1,…,L\\ell\+1,\\ldots,Lentirely\. This asks “what distribution does the layer\-ℓ\\ellstate predict if we treat it as terminal and read it out linearly?”

*Protocol B \(full forward via injection hook\)\.*ReplacehhwithPhPhvia a forward\-pass hook at layerℓ\\elland run the remainingL−ℓL\-\\elllayers normally\. This asks “what distribution does the full model produce when we perturb its layer\-ℓ\\ellactivation?”

*Protocol C \(component substitution\)\.*Replace one component ofhhwith an alternative while keeping the rest, then run Protocol B\. This is used in Appendix[L](https://arxiv.org/html/2605.24042#A12)\.

### A\.2The two protocols disagree on direction

Protocols A and B can give opposite signs forKL\(PIh\)−KL\(PBh\)\\mathrm\{KL\}\(P\_\{I\}h\)\-\\mathrm\{KL\}\(P\_\{B\}h\)\. Under Protocol A, projecting ontoPIP\_\{I\}preserves the bulk ofhh’s direction becausePIP\_\{I\}has dimensiond−kd\-k; the direction after the final layer norm is close to the clean one and the logits are close to the clean ones, soKL\(PIh\)\\mathrm\{KL\}\(P\_\{I\}h\)is small\. Projecting ontoPBP\_\{B\}yields a narrow direction that the final layer norm amplifies, giving low\-entropy logits far from the clean distribution and a largeKL\(PBh\)\\mathrm\{KL\}\(P\_\{B\}h\)\. Protocol A therefore predictsKL\(PIh\)<KL\(PBh\)\\mathrm\{KL\}\(P\_\{I\}h\)<\\mathrm\{KL\}\(P\_\{B\}h\)wheneverk≪dk\\ll d, regardless of model architecture or training regime\.

Protocol B has no architecture\-independent sign\. It measures how the remaining transformer layers respond to an off\-manifold injected state, and the response is an empirical property of the trained model rather than a consequence of local Fisher geometry\. Empirically, modern models \(Mistral\-7B, Qwen3\-14B, DeepSeek\-R1\-14B, TinyLlama, Phi\-2\) recover fromh↦PIhh\\mapsto P\_\{I\}hbut not fromh↦PBhh\\mapsto P\_\{B\}h, while GPT\-2\-family models show a weaker or reversed pattern depending on layer andkk\. Protocol B is therefore the deployment\-relevant protocol, but its sign on a given model is measured, not predicted from the Fisher spectrum alone\.

The two protocols therefore predict different signs of the asymmetry on the same underlying model\. Every measurement reported in the main text uses Protocol B\.

### A\.3The GPT\-2 Medium result

One earlier measurement on GPT\-2 Medium at layer 12 withk=64k=64and 8\-bit quantization reportsKL\(PBh\)\>KL\(PIh\)\\mathrm\{KL\}\(P\_\{B\}h\)\>\\mathrm\{KL\}\(P\_\{I\}h\)\(the Mistral\-like direction\), which is the opposite sign from our Protocol B measurement on GPT\-2 Small, Large, and XL\. Three differences plausibly explain the gap:k=64k=64rather than128128, 8\-bit quantization of the projected state, and possibly a direct\-readout protocol \(A rather than B\)\. The combination ofk=64k=64and 8\-bit quantization narrows the Fisher subspace enough that Protocol A behavior dominates even under partial forward passes\. We do not rerun that experiment here; the point is that the direction of the KL asymmetry reported in the literature depends on the measurement protocol, and a protocol mismatch is sufficient to flip the sign\.

The practical takeaway is that Protocol B is the right one for any claim about hidden\-state leakage or defense, because a real attacker against a deployed system sees the model’s full output distribution rather than a direct readout of a mid\-layer state\. The Fisher\-concentration story in Section[E](https://arxiv.org/html/2605.24042#A5)therefore only applies to claims measured under Protocol B; claims measured under Protocol A would flip for every model, modern or legacy\.

### A\.4Surface\-perturbation categories

A separate protocol\-sensitivity test is the extent to which the projection asymmetry depends on the category of input variation \(Figure[4](https://arxiv.org/html/2605.24042#A1.F4)\)\. Across 12 surface perturbation pairs \(capitalization, whitespace, semantic\), the per\-category scaffold fraction is approximately category\-invariant at≈83%\\approx 83\\%\. The mean KL is dominated by a small number of pairs that change the top\-1 prediction; the median KL across the remaining pairs is0\.080\.08nats\.

![Refer to caption](https://arxiv.org/html/2605.24042v1/x4.png)Figure 4:Perturbation analysis by category\. \(A\) The scaffold fraction of feature flips \(green\) is approximately constant across perturbation categories, with the backbone fraction \(red\) accordingly small\. \(B\) Mean KL divergence is concentrated in the small number of pairs that also change the top\-1 prediction\. The category invariance is consistent with the Kruskal\-Wallis test reported in Appendix[Q](https://arxiv.org/html/2605.24042#A17)\(H=2\.33H=2\.33,p=0\.31p=0\.31\)\.
### A\.5Empirical Fisher: sample\-size validation

For the three 7–14B models we estimateΣg\\Sigma\_\{g\}withncal=200n\_\{\\mathrm\{cal\}\}=200prefixes\. Because the empirical covariance has rank at most200200, a rank\-128128projector can capture a large fraction of calibration\-sample energy without identifying a stable population subspace\. In an equal\-spectrum rank\-200200baseline the top\-128128fraction would be128/200=0\.64128/200=0\.64, while heavy\-tailed sample spectra can give much larger values\. We therefore treatE128E\_\{128\}on these models as a sample\-covariance concentration measure, not as a stable\-subspace estimate\.

To quantify the gap, we ran a split\-half validation on the three 7–14B models at the same mid\-network layers used in the main body\. Drawing400400disjoint WikiText\-103 prefixes per model, splitting into a200200\-prefix train half and a200200\-prefix held\-out half, we formΣgtr\\Sigma\_\{g\}^\{\\mathrm\{tr\}\}andΣghe\\Sigma\_\{g\}^\{\\mathrm\{he\}\}on each half independently and report \(a\) the on\-train concentrationE128tr=tr\(P128trΣgtr\)/tr\(Σgtr\)E\_\{128\}^\{\\mathrm\{tr\}\}=\\mathrm\{tr\}\(P\_\{128\}^\{\\mathrm\{tr\}\}\\Sigma\_\{g\}^\{\\mathrm\{tr\}\}\)/\\mathrm\{tr\}\(\\Sigma\_\{g\}^\{\\mathrm\{tr\}\}\), \(b\) the held\-out via\-train\-projector energyE128he,tr=tr\(P128trΣghe\)/tr\(Σghe\)E\_\{128\}^\{\\mathrm\{he,tr\}\}=\\mathrm\{tr\}\(P\_\{128\}^\{\\mathrm\{tr\}\}\\Sigma\_\{g\}^\{\\mathrm\{he\}\}\)/\\mathrm\{tr\}\(\\Sigma\_\{g\}^\{\\mathrm\{he\}\}\), and \(c\) the held\-out self\-concentrationE128he,heE\_\{128\}^\{\\mathrm\{he,he\}\}using its own top\-128128projector\.

Table 3:Split\-half validation of the empirical gradient\-covariance top\-128128subspace on the three 7–14B models\. Each half independently concentrates almost all sample energy in some 128\-dimensional subspace, but the train\-half projector captures only55–7%7\\%of held\-out energy\.Each half on its own concentrates∼99%\\sim 99\\%of energy in128128coordinates, but the specific128128\-coordinate subspace identified from the train half captures only55–7%7\\%of the held\-out energy\. The headline\>99%\>99\\%figure is therefore a calibration\-sample concentration statement atncal=200n\_\{\\mathrm\{cal\}\}=200, not evidence that a fixed top\-128128projector is stable\. This does not affect the diagonal minimax theorem or the qualitative finding thatΣdiag\\Sigma\_\{\\mathrm\{diag\}\}is the robust Gaussian release in the adaptive sweep\. It does mean that projector\-dependent quantities on 7–14B models, includingPB/PIP\_\{B\}/P\_\{I\}diagnostics andr95r\_\{95\}estimates, should be read as calibration\-sample diagnostics unlessncal≫dn\_\{\\mathrm\{cal\}\}\\gg d\.

## Appendix BExperimental design

This appendix records the methodological choices that the main body’s results depend on: why we use the gradient covariance as a Fisher proxy, how the layer and calibration set were chosen, how the subspace grid is defined, how quantization is applied, and how the one\-step margin is computed\. These choices were fixed before any of the empirical results in Sections[E](https://arxiv.org/html/2605.24042#A5)–[F](https://arxiv.org/html/2605.24042#A6)were measured, and the specifics are reported here so that downstream analyses can be reproduced exactly\.

### B\.1Reframing as a low\-dimensional spectral summary

The interpretation we end with differs in shape from the one a direct reading of the asymmetry would suggest\. The asymmetry is real, large on modern architectures, and exploitable for utility\-preserving releases, but the robust cross\-model object is not an individual Fisher direction\. The first diagnostic is the Fisher concentrationEkE\_\{k\}, butEkE\_\{k\}alone does not determine the projection\-asymmetry sign: TinyLlama\-1\.1B and Phi\-2 atE128≈0\.58E\_\{128\}\\approx 0\.58sit in the modern regime while GPT\-2 Small atE128≈0\.56E\_\{128\}\\approx 0\.56sits in the reversed regime\. The full cumulative spectrum, the effective rankr95/dr\_\{95\}/d, the couplingκ\\kappa, and architectural/training regime are needed to explain the observed cases\. Once the geometric content is read as a low\-dimensional spectral summary rather than a privileged subspace, the natural release\-design problem is no longer to find the right subspace but to find the diagonal weighting that distributes utility cost evenly across coordinates, which is whatΣdiag\\Sigma\_\{\\mathrm\{diag\}\}does\. The pieces of the empirical story we initially mis\-described as evidence for two channels \(the gap betweenKL\(PBh\)\\mathrm\{KL\}\(P\_\{B\}h\)andKL\(PIh\)\\mathrm\{KL\}\(P\_\{I\}h\), the13×13\\timesPareto gain on Mistral, thek/d\\sqrt\{k/d\}margin law\) all reduce, on closer reading, to consequences ofEkE\_\{k\}being close to one or being close tok/dk/dtogether with the broader spectral shape\. The defense mechanisms that survive the adaptive attacker are the ones that respect this structure rather than overfit to it\.

A second methodological consequence concerns scope of release\-class statements\. Prior caching\-privacy work focused on individual mechanisms \(rotation, quantization, dropout\) without a release\-class baseline; the empty\-middle finding gives such a baseline\. The Gaussian release class as a whole has no moderate\-both cell at any utility\-privacy budget we tested, and Theorem[13](https://arxiv.org/html/2605.24042#Thmproposition13)explains why this is structural rather than a missed mechanism within the class\. The question is therefore not how to optimize within the Gaussian class but which non\-Gaussian classes can fill the moderate\-both region, and the variational\-bottleneck attempt of Appendix[R](https://arxiv.org/html/2605.24042#A18)provides a starting point with its 0/44 negative result as a calibrated baseline against which to measure subsequent attempts\.

A third methodological note concerns interpretability bases\. Sparse autoencoders, JumpReLU features, and transcoders select a basis under which hidden states factor cleanly\. Our results introduce a complementary criterion: a basis that captures loss sensitivity is the Fisher basis, and the basis that captures sparsity may not be the same one\. The KL asymmetry under ablation flips sign between GPT\-2 and modern architectures, so any interpretability claim that relies on a fixed basis being meaningful needs scope to a particular model and training regime\. The Fisher basis is not unique either, since it is data\-dependent and finite\-sample\-noisy at our calibration sizes; what is robust across calibration sets and seeds is the scalarEkE\_\{k\}, not the individual eigenvectors\. The appropriate object for cross\-model interpretability comparisons is therefore a low\-dimensional spectral summary rather than a coordinate\-by\-coordinate decomposition\.

### B\.2Gradient covariance as a behavior proxy

We use the empirical gradient covarianceΣg=1n∑i∇hℒ\(xi,yi\)∇hℒ\(xi,yi\)⊤\\Sigma\_\{g\}=\\tfrac\{1\}\{n\}\\sum\_\{i\}\\nabla\_\{h\}\\mathcal\{L\}\(x\_\{i\},y\_\{i\}\)\\nabla\_\{h\}\\mathcal\{L\}\(x\_\{i\},y\_\{i\}\)^\{\\top\}as a proxy for the true Fisher informationF=𝔼y\[∇hlog⁡p∇hlog⁡p⊤\]F=\\mathbb\{E\}\_\{y\}\[\\nabla\_\{h\}\\log p\\nabla\_\{h\}\\log p^\{\\top\}\]\. The two differ at finite training: the Fisher integrates over the model’s predicted distribution and the empirical covariance integrates over the empirical label distribution\. At a converged model, they agree up to a scalar rescaling\[Amari,[1998](https://arxiv.org/html/2605.24042#bib.bib7), Martens,[2020](https://arxiv.org/html/2605.24042#bib.bib8), Kunstneret al\.,[2019](https://arxiv.org/html/2605.24042#bib.bib30)\]\. UsingΣg\\Sigma\_\{g\}rather thanFFis computationally cheaper because it avoids sampling from the model’s output\. Any constant rescaling would be absorbed by the eigenvalue ranking, but finite\-sample instability of the empirical projector remains a real limitation atncal=200n\_\{\\mathrm\{cal\}\}=200and is measured in Appendix[A](https://arxiv.org/html/2605.24042#A1)\.

### B\.3Layer selection

We use the mid\-network layer at proportional depthℓ=L/2\\ell=L/2throughout\. Three considerations motivate the choice: \(i\) late layers carry task\-specific computation and their projections destroy predictions in ways that obscure the asymmetry signal; \(ii\) early layers carry token\-level features and their projections leave predictions near\-untouched because most downstream computation has not happened; \(iii\) the mid\-layer balances these\. Appendix[C](https://arxiv.org/html/2605.24042#A3)confirms that late\-layer behavior \(ℓ=11\\ell=11on GPT\-2 Small\) does produce a qualitatively different asymmetry: the crossover betweenPBP\_\{B\}andPIP\_\{I\}KL flips sign belowk≈220k\\approx 220\.

### B\.4Calibration set design

The calibration set is a fixed random sample of prefixes from a natural\-language validation split\. For≤\\leq3B models we usencal=2000n\_\{\\mathrm\{cal\}\}=2000prefixes; for 7–14B models we usencal=200n\_\{\\mathrm\{cal\}\}=200because memory\-intensive gradient accumulation scales with model size\. The prefix length is3232tokens\. Calibration and evaluation sets are disjoint: the firstncaln\_\{\\mathrm\{cal\}\}prefixes go into calibration, and the nextneval=500n\_\{\\mathrm\{eval\}\}=500go into all downstream KL, margin, and retrieval evaluations\.

### B\.5Subspace dimension grid

We evaluatek∈\{32,64,128,256\}k\\in\\\{32,64,128,256\\\}on GPT\-2 Small andk∈\{32,64,128,256,512\}k\\in\\\{32,64,128,256,512\\\}on the 7–14B models\. The main\-body results usek=128k=128as the canonical point because it sits above the half\-energy crossover on GPT\-2 \(E128≈0\.56E\_\{128\}\\approx 0\.56\) and below the saturation point on modern architectures \(E128≈0\.99E\_\{128\}\\approx 0\.99\), giving a singlekkvalue where the asymmetry is observable on both regimes\.

### B\.6Quantization scheme

Margin tables are reported at 8\-bit quantization unless otherwise stated\. We use symmetric uniform quantization with per\-coordinate stepαj=sj/\(2b−1−1\)\\alpha\_\{j\}=s\_\{j\}/\(2^\{b\-1\}\-1\), wheresjs\_\{j\}is the 99\.5% absolute value of coordinatejjon the calibration set\. Values outside the calibration range are clipped\. The deterministic bound in Proposition[4](https://arxiv.org/html/2605.24042#Thmproposition4)holds for all values inside the calibration range, which covers≥99%\\geq 99\\%of evaluation\-time hidden states in our measurements\.

### B\.7One\-step margin computation

The one\-step margin at a prefixxxism\(x\)=minv≠y⁡‖h\(x⊕y\)−h\(x⊕v\)‖2m\(x\)=\\min\_\{v\\neq y\}\\\|h\(x\\oplus y\)\-h\(x\\oplus v\)\\\|\_\{2\}, whereyyis the ground\-truth next token,vvranges over the vocabulary,⊕\\oplusis concatenation, andhhis the layer\-ℓ\\ellhidden state of the last token\. The minimum is taken over the50,25750\{,\}257GPT\-2 vocabulary entries \(or the corresponding vocabulary on other tokenizers\)\. Computing the full minimum is memory\-prohibitive for 7–14B models at 500 prefixes; we therefore use a 500\-distractor bank \(disjoint from the calibration set\) as a uniform sample of the vocabulary distribution\. The reported margins are medians across prefixes\.

### B\.8Eigenvalue spectrum structure

The eigenspectrum in Figure[7](https://arxiv.org/html/2605.24042#A4.F7)is the full 1024\-dimensional gradient\-covariance spectrum on GPT\-2 Medium at layer 12\. The spectrum spans fourteen orders of magnitude between the largest and smallest eigenvalues\. The first 128 eigenvalues carry50\.1%50\.1\\%of total gradient energy; the bottom 512 carry less than0\.01%0\.01\\%combined\. This extreme ill\-conditioning is what makes the identity subspacePIP\_\{I\}perceptually unmeasurable at first order: first\-order Fisher sensitivity is below machine epsilon for most ofPIP\_\{I\}\.

## Appendix CExtended utility and margin results

The projectionsPBhP\_\{B\}handPIhP\_\{I\}hdepend onkk, the choice of layer, and the numerical precision of the stored state\. The main body reports results atk=128k=128andℓ=L/2\\ell=L/2withfloat32weights\. This appendix expands the grid\.

### C\.1Full KL/top\-1 grid on GPT\-2 Small

Table[4](https://arxiv.org/html/2605.24042#A3.T4)reports the KL divergence and top\-1 agreement of projected states relative to the clean distribution acrossk∈\{32,64,128,256\}k\\in\\\{32,64,128,256\\\}atℓ=6\\ell=6on GPT\-2 Small\. The pattern is consistent: at smallkk, projecting ontoPIP\_\{I\}preserves top\-1 agreement better than projecting ontoPBP\_\{B\}because most of the hidden\-state magnitude lives inPIP\_\{I\}; atk=128k=128\(the main\-body choice\)PBP\_\{B\}andPIP\_\{I\}cross, withPBP\_\{B\}retaining more top\-1 agreement andPIP\_\{I\}producing a larger KL\. A random rank\-kksubspace produces uniformly worse top\-1 agreement and a uniformly larger KL than eitherPBP\_\{B\}orPIP\_\{I\}\.

Table 4:KL divergence and top\-1 agreement atℓ=6\\ell=6on GPT\-2 Small underfloat32projection\.
### C\.2Margin preservation under 4/6/8/16\-bit quantization

Proposition[4](https://arxiv.org/html/2605.24042#Thmproposition4)gives a deterministic bound on post\-quantization margins\. Table[5](https://arxiv.org/html/2605.24042#A3.T5)reports the measured behavior\- and identity\-subspace margins across bit widths\. Atk=128k=128, the identity margin loses only≈0\.5\\approx 0\.5units between 16\-bit and 4\-bit, well under the2εb2\\varepsilon\_\{b\}deterministic bound; the behavior margin rises slightly at 4\-bit because quantization noise pushes nearest\-neighbor tokens apart, but thePB/PIP\_\{B\}/P\_\{I\}ratio is preserved within5%5\\%\.

Table 5:Median one\-step margin on GPT\-2 Small atℓ=6\\ell=6across bit widths\. The full 8\-bit margin is47\.447\.4\.
### C\.3Late\-layer behavior

Atℓ=11\\ell=11\(final layer of GPT\-2 Small\), the pattern is similar but the margins and KL values are larger\. The full margin shrinks from47\.447\.4atℓ=6\\ell=6to11\.111\.1atℓ=11\\ell=11, and thePB/PIP\_\{B\}/P\_\{I\}KL ratio inverts in sign for smallkk: atk=32k=32,KL\(PBh\)\>KL\(PIh\)\\mathrm\{KL\}\(P\_\{B\}h\)\>\\mathrm\{KL\}\(P\_\{I\}h\)because the top 32 Fisher directions now carry much of the pre\-logit computation\. Atk=256k=256,KL\(PBh\)\\mathrm\{KL\}\(P\_\{B\}h\)falls belowKL\(PIh\)\\mathrm\{KL\}\(P\_\{I\}h\)and the behavior fraction of margin rises above50%50\\%\. The crossover between the two regimes is thekkat which the Fisher subspace captures approximately half the gradient energy, which on GPT\-2 Small isk≈100k\\approx 100atℓ=6\\ell=6andk≈220k\\approx 220atℓ=11\\ell=11\. Thek=128k=128used in the main body sits in thePIP\_\{I\}\-preserving regime atℓ=6\\ell=6but in thePBP\_\{B\}\-preserving regime atℓ=11\\ell=11, matching the GPT\-2 Medium measurement inBhaskaret al\.\[[2024](https://arxiv.org/html/2605.24042#bib.bib15)\]\.

### C\.4Gradient energy across layers

Table[6](https://arxiv.org/html/2605.24042#A3.T6)reports the cumulative gradient\-covariance energy atℓ=6\\ell=6andℓ=11\\ell=11of GPT\-2 Small\. The top 128 directions capture5555–56%56\\%of the gradient variance at both layers, consistent with GPT\-2 being a low\-concentration model whereE128E\_\{128\}is well below the\>99%\>99\\%regime of modern 7–14B models\.

Table 6:Cumulative Fisher energyEkE\_\{k\}at two layers of GPT\-2 Small\.
### C\.5Cross\-model KL and margin figures

Three figures extend the asymmetry and margin analyses: Figure[5](https://arxiv.org/html/2605.24042#A3.F5)comparesPBP\_\{B\}vsPIP\_\{I\}KL across five models, and Figure[6](https://arxiv.org/html/2605.24042#A3.F6)shows the utility\-margin Pareto frontier on GPT\-2 Small\.

![Refer to caption](https://arxiv.org/html/2605.24042v1/x5.png)Figure 5:KL divergence underPBP\_\{B\}versusPIP\_\{I\}projection atk=128k=128across five models\. The gap between the two curves widens from0\.5×0\.5\\timeson GPT\-2 Small \(identity more destructive\) to153×153\\timeson DeepSeek\-R1\-14B \(behavior more destructive\), matching the direction\-flip reported in Table[8](https://arxiv.org/html/2605.24042#A5.T8)\.![Refer to caption](https://arxiv.org/html/2605.24042v1/x6.png)Figure 6:Utility\-margin Pareto frontier on GPT\-2 Small atℓ=6\\ell=6under 8\-bit quantization\. Two configurations are Pareto\-optimal for different utility/privacy preferences: the identity projection atk=32k=32\(high margin, moderate utility\) and the behavior projection atk=256k=256\(low margin, high utility\)\.

## Appendix DGradient\-energy spectrum

The concentration of gradient energy in a low\-dimensional subspace is what makes the two\-channel structure possible\. This appendix reports the cumulative fraction of gradient\-covariance eigenvalue mass captured by the top\-kkeigenvectors across models, complementing the summary in Figure[9](https://arxiv.org/html/2605.24042#A5.F9)\(A\)\.

Table[7](https://arxiv.org/html/2605.24042#A4.T7)reports the cumulative energy for GPT\-2 Medium at layer 12 \(d=1024d\{=\}1024\)\. The spectrum decays smoothly: 50% of gradient energy concentrates in the top 128 directions \(12\.5% of the full dimension\), while the remaining 896 directions carry the other 50%\. This half\-and\-half split aligns with the empirical finding that projecting onto either the top\-128 or bottom\-896 subspace destroys roughly half the margin\.

Table 7:Cumulative gradient\-covariance energy for GPT\-2 Medium, layer 12 \(d=1024d\{=\}1024\)\.The leading eigenvalue is4\.7×10144\.7\\times 10^\{14\}times larger than the trailing eigenvalue, indicating that the gradient covariance is severely ill\-conditioned \(Figure[7](https://arxiv.org/html/2605.24042#A4.F7)\)\. Numerically, the bottom eigenvalues are near machine epsilon and contribute negligible gradient signal\. This extreme dynamic range is consistent with the identity subspace carrying almost no first\-order behavioral information while still maintaining largeℓ2\\ell\_\{2\}separations between hidden states\. Figure[8](https://arxiv.org/html/2605.24042#A4.F8)compares the cumulative\-energy curves across six models\.

![Refer to caption](https://arxiv.org/html/2605.24042v1/x7.png)Figure 7:Gradient\-covariance eigenspectrum on GPT\-2 Medium at layer 12\. The spectrum spans fourteen orders of magnitude; the effective numerical rank is≈500\\approx 500, matching theE512≈89%E\_\{512\}\\approx 89\\%entry in Table[7](https://arxiv.org/html/2605.24042#A4.T7)\.![Refer to caption](https://arxiv.org/html/2605.24042v1/x8.png)Figure 8:Cumulative gradient\-covariance energy versus subspace dimension across six models\. The modern architectures reach high calibration\-sample energy at smallerkkthan GPT\-2 Large; Mistral\-7B, Qwen3\-14B, DeepSeek\-R1\-14B, and Phi\-2 reach90%90\\%energy atk≤128k\\leq 128, while GPT\-2 Large still requires several hundred directions\. The curves cross the90%90\\%horizontal dashed line at qualitatively different locations, driving the projection\-asymmetry sign difference reported in Table[8](https://arxiv.org/html/2605.24042#A5.T8)\.
## Appendix EPer\-model projection asymmetry

The empirical core of the paper is that replacing the hidden statehhwith its projection onto the Fisher complementPIhP\_\{I\}hpreserves the model’s predictions, while replacing it with the projection onto the Fisher subspacePBhP\_\{B\}hdestroys them\. We measure this by injecting the projected state back into the residual stream at layerℓ\\elland completing the forward pass to the final logits, then reporting the median KL divergence between the resulting next\-token distribution and the clean one over a held\-out set of20002000prefixes drawn from a natural\-language validation split\. All measurements usek=128k=128and the layer at proportional depthℓ=L/2\\ell=L/2\. Table[8](https://arxiv.org/html/2605.24042#A5.T8)reports both projections across ten models spanning 124M to 14\.8B parameters; Figure[9](https://arxiv.org/html/2605.24042#A5.F9)summarizes the spectrum and KL asymmetry\.

![Refer to caption](https://arxiv.org/html/2605.24042v1/x9.png)Figure 9:\(A\) Cumulative Fisher energyEkE\_\{k\}vs\.k/dk/dfor the ten models in Table[8](https://arxiv.org/html/2605.24042#A5.T8)\. Modern 7–14B models saturate atk/d≈0\.03k/d\\approx 0\.03; GPT\-2 needs≈0\.4\\approx 0\.4\. \(B\) Median KL underPBP\_\{B\}andPIP\_\{I\}projections; on every modern architectureKL\(PIh\)≪KL\(PBh\)\\mathrm\{KL\}\(P\_\{I\}h\)\\ll\\mathrm\{KL\}\(P\_\{B\}h\), on the GPT\-2 family the asymmetry reverses\.Table 8:Projection asymmetry atk=128k=128, layerℓ=L/2\\ell=L/2\.KL\(PBh\)\\mathrm\{KL\}\(P\_\{B\}h\)andKL\(PIh\)\\mathrm\{KL\}\(P\_\{I\}h\)are median KL divergences whenhhis replaced by its projection onto the Fisher subspace or its complement\. OLMoE\-1B\-7B, used for the margin scaling law \(Section[F](https://arxiv.org/html/2605.24042#A6)\), is omitted here; see Appendix[T](https://arxiv.org/html/2605.24042#A20)\.On the seven modern\-architecture models the asymmetry is in the direction the abstract claims: projecting onto the Fisher complementPIP\_\{I\}leaves the model’s predictions nearly intact, while projecting ontoPBP\_\{B\}destroys them\. On Mistral\-7B, replacinghhwithPIhP\_\{I\}hcosts only 0\.065 nats of KL, less than the cost of standard88\-bit activation quantization\[Dettmerset al\.,[2022](https://arxiv.org/html/2605.24042#bib.bib29)\], while replacing it withPBhP\_\{B\}hcosts 5\.29 nats, enough to essentially randomize the next\-token prediction\. The ratio spans 52–509×\\timesacross the modern group\. On Qwen3\-14B and DeepSeek\-R1\-14B the residual subspace carries over 99% of the predictive information measured this way\.

The empirical pattern correlates with spectral concentration\. On Mistral\-7B, Qwen3\-14B, and DeepSeek\-R1\-14B the top\-128128eigenvectors of the empirical gradient covariance estimated fromncal=200n\_\{\\mathrm\{cal\}\}=200prefixes capture more than99%99\\%of the calibration\-sample gradient energy \(the train\-half top\-128128projector covers only55–7%7\\%of held\-out gradient energy atncal=200n\_\{\\mathrm\{cal\}\}=200, Appendix[A](https://arxiv.org/html/2605.24042#A1)\); in these measurements, the sample gradient signal lives in a small fraction of the ambientdddimensions, and these are exactly the models on whichKL\(PIh\)\\mathrm\{KL\}\(P\_\{I\}h\)is small\. We caution against reading this as a direct consequence of the local Fisher expansion in equation \([1](https://arxiv.org/html/2605.24042#S3.E1)\)\. The deterministic projectionh↦PIhh\\mapsto P\_\{I\}hhas perturbationδh=−PBh\\delta h=\-P\_\{B\}h, which lies in the high\-Fisher subspace and is generally not small in the Fisher metric, so12δh⊤Fδh\\tfrac\{1\}\{2\}\\,\\delta h^\{\\top\}F\\,\\delta hwould predict a large KL forPIhP\_\{I\}h, not a small one\. The fact thatPIhP\_\{I\}hnevertheless preserves the next\-token distribution on modern models is therefore a non\-local empirical phenomenon: the remaining transformer layers read the off\-manifold injected state and recover a logit distribution close to the clean one, rather than the local quadratic approximation governing the outcome\. We treat the projection\-KL ratios in Table[8](https://arxiv.org/html/2605.24042#A5.T8)as empirical diagnostics of where prediction\-tracking information appears to reside under nonlocal interventions, withEkE\_\{k\}as a correlate of the direction of the asymmetry rather than a quantitative predictor of its magnitude\. The local expansion is the right tool for additive\-noise utility \(Section[4](https://arxiv.org/html/2605.24042#S4)\); it is not the right tool for explaining why deleting the high\-Fisher component ofhhis recoverable\.

The GPT\-2 family shows the opposite direction\. On GPT\-2 Small, Large, and XL the top\-128128Fisher subspace captures only between5353and58%58\\%of gradient energy, and the empirical KL behavior reverses:KL\(PIh\)\\mathrm\{KL\}\(P\_\{I\}h\)is comparable to or slightly larger thanKL\(PBh\)\\mathrm\{KL\}\(P\_\{B\}h\)at fixedkk\. We do not derive this reversal from the local expansion either; the projectionh↦PIhh\\mapsto P\_\{I\}hon a diffuse spectrum still removes a high\-Fisher\-mass component, and the local quadratic does not predict that the resulting nonlocal forward pass should yield a small KL\. The empirical observation is that on diffuse\-spectrum models neither projection is recoverable, while on concentrated\-spectrum models thePIP\_\{I\}projection is\.E128E\_\{128\}alone is not the controlling scalar: TinyLlama\-1\.1B and Phi\-2 atE128≈0\.58E\_\{128\}\\approx 0\.58, essentially the same value as GPT\-2 Small/Large/XL \(0\.530\.53–0\.580\.58\), sit firmly in the modern regime with52×52\\times–62×62\\timesasymmetry\. We treatEkE\_\{k\}, the cumulative spectrum, and architectural/training choices as joint diagnostics, with the underlying nonlocal mechanism \(whether subsequent layers can compensate for the off\-manifold injection\) operating differently in the two regimes\. Architectural choices that concentrate gradient, specifically RMSNorm\[Zhang and Sennrich,[2019](https://arxiv.org/html/2605.24042#bib.bib23)\], rotary position embeddings\[Suet al\.,[2024](https://arxiv.org/html/2605.24042#bib.bib24)\], gated\-linear MLPs\[Shazeer,[2020](https://arxiv.org/html/2605.24042#bib.bib25)\], and training recipes with 1–15T tokens, produce the concentrated spectrum that makesPBP\_\{B\}a thin slice of high\-sensitivity directions\. GPT\-2’s LayerNorm\-with\-mean\-subtraction\[Baet al\.,[2016](https://arxiv.org/html/2605.24042#bib.bib16)\]and shorter training budget produce the diffuse spectrum that makesPBP\_\{B\}andPIP\_\{I\}comparably informative\.

The margin metric tells a different story\. At the samek=128k=128, the median nearest\-neighborℓ2\\ell\_\{2\}margin between projected hidden states shows no analogous asymmetry\. The behavior\-subspace margin fractionmB/mfullm\_\{B\}/m\_\{\\mathrm\{full\}\}lies within a few percent of the random\-projection predictionk/d\\sqrt\{k/d\}on every model, and on TinyLlama\-1\.1B the behavior and random fractions agree to four decimal places \(0\.2377 vs\. 0\.2377\)\. By contrast the identity\-subspace margin fraction is above 93% on every model\. This is consistent with Proposition[1](https://arxiv.org/html/2605.24042#Thmproposition1): the behavior\-margin fraction is governed by random\-projection geometry because the inter\-prefix difference distribution is close to isotropic at the relevant scale, so any rank\-kkprojector, Fisher\-aligned or not, captures a fraction ofℓ2\\ell\_\{2\}mass close tok/dk/d\. The GPT\-2 direction flip is absent from the margin table, and the Mistral\-vs\-GPT\-2 ratio of behavior\-margin fractions tracksk/d\\sqrt\{k/d\}, whereas the KL asymmetry tracksEkE\_\{k\}\. The structural content of the Fisher decomposition lives in the KL metric \(which weighs by loss sensitivity\) and washes out under the margin metric \(which weighs by rawℓ2\\ell\_\{2\}spread\)\.

The asymmetry is the claim that motivates everything downstream\. Spectral concentration is why a defender can place noise in low\-Fisher directions at low utility cost \(Section[4](https://arxiv.org/html/2605.24042#S4)\)\. The margin fraction’s agreement withk/d\\sqrt\{k/d\}is why the scaling law across 10 models is a random\-projection result rather than evidence for two channels \(Section[F](https://arxiv.org/html/2605.24042#A6)\)\. And the GPT\-2 reversal is why claims about hidden\-state structure in the interpretability literature\[Elhageet al\.,[2022](https://arxiv.org/html/2605.24042#bib.bib6), Parket al\.,[2024](https://arxiv.org/html/2605.24042#bib.bib35), Nandaet al\.,[2023](https://arxiv.org/html/2605.24042#bib.bib34), Templetonet al\.,[2024](https://arxiv.org/html/2605.24042#bib.bib4)\]need to be scoped to the model whose spectrum was measured: the direction of the asymmetry is determined by the training regime rather than by the transformer architecture itself\. Appendix[L](https://arxiv.org/html/2605.24042#A12)reports a direct transplantation test of where the predictions actually live, Appendix[C](https://arxiv.org/html/2605.24042#A3)extends the KL/top\-1/margin grid across bit\-widths and layers, and Appendix[D](https://arxiv.org/html/2605.24042#A4)reports the GPT\-2 Medium eigenspectrum\.

## Appendix FMargin scaling law per\-model

At a fixedkk, the behavior\-margin fractionmB/mfullm\_\{B\}/m\_\{\\mathrm\{full\}\}across our ten models is a clean function ofddalone, read in earlier work as evidence for a low\-dimensional behavior channel\[Nazir and others,[2025](https://arxiv.org/html/2605.24042#bib.bib13)\]\. We show the law is predicted to leading order by Proposition[1](https://arxiv.org/html/2605.24042#Thmproposition1)applied to approximately isotropic inter\-prefix differences, with no appeal to the Fisher subspace being distinguished\.

![Refer to caption](https://arxiv.org/html/2605.24042v1/x10.png)Figure 10:Behavior\-margin fractionmB/mfullm\_\{B\}/m\_\{\\mathrm\{full\}\}vs\.k/d\\sqrt\{k/d\}atk=128k=128,ℓ=L/2\\ell=L/2\. Dashed line isy=k/dy=\\sqrt\{k/d\}from Proposition[1](https://arxiv.org/html/2605.24042#Thmproposition1); linear fit givesR2=0\.93R^\{2\}=0\.93\.Atk=128k=128, the margin fraction lies within a few percent ofk/d\\sqrt\{k/d\}across all ten models; the linear regression gives slope0\.930\.93, intercept0\.030\.03, andR2=0\.93R^\{2\}=0\.93\(Figure[10](https://arxiv.org/html/2605.24042#A6.F10)\)\. The largest deviation is13%13\\%\(OLMoE\-1B\-7B,0\.2830\.283vs\.0\.2500\.250\), scale has no systematic effect on the residual, and the residual does not correlate withE128E\_\{128\}\. Proposition[1](https://arxiv.org/html/2605.24042#Thmproposition1)extends to a fixed\-projector sufficient condition \(Appendix[Q\.5](https://arxiv.org/html/2605.24042#A17.SS5)\): for any rank\-kkprojector, the expected squaredℓ2\\ell\_\{2\}fraction equalsk/d±\(k/d\)εisok/d\\pm\(k/d\)\\varepsilon\_\{\\mathrm\{iso\}\}, whereεiso=‖dΣδ\(unit\)−I‖op\\varepsilon\_\{\\mathrm\{iso\}\}=\\\|d\\,\\Sigma\_\{\\delta\}^\{\(\\mathrm\{unit\}\)\}\-I\\\|\_\{\\mathrm\{op\}\}\. In our dataεiso\\varepsilon\_\{\\mathrm\{iso\}\}is large in absolute terms, so the operator\-norm bound is conservative; thek/d\\sqrt\{k/d\}law should be read primarily as an empirical alignment result, supported by random\-projector variance calculations and direct measurements oftr\(PBΣδ\)\\mathrm\{tr\}\(P\_\{B\}\\Sigma\_\{\\delta\}\)\(Appendix[Q\.5](https://arxiv.org/html/2605.24042#A17.SS5)\), rather than as a tight consequence of the worst\-case theorem\. The structural content lives in KL, whereKL\(PIh\)≪KL\(PBh\)\\mathrm\{KL\}\(P\_\{I\}h\)\\ll\\mathrm\{KL\}\(P\_\{B\}h\)on modern models and the reverse holds on GPT\-2 at matchedEkE\_\{k\}\. The 7–14B models also have much smaller sample effective rank, withr95≈100r\_\{95\}\\approx 100atncal=200n\_\{\\mathrm\{cal\}\}=200, but split\-half validation shows that these top\-kkprojectors are not stable at this calibration size\. Appendix[G](https://arxiv.org/html/2605.24042#A7)gives the per\-model interpolation and Appendix[H](https://arxiv.org/html/2605.24042#A8)places each model on the\(Ek,κ,ρ\)\(E\_\{k\},\\kappa,\\rho\)grid\.

## Appendix GEffective\-rank estimates

For models where we collectedncal=2000n\_\{\\mathrm\{cal\}\}=2000calibration prefixes \(GPT\-2 Large, GPT\-2 XL, TinyLlama\-1\.1B, Phi\-2, Qwen2\.5\-3B\), the empirical effective rankr95=min⁡\{k:Ek≥0\.95\}r\_\{95\}=\\min\\\{k:E\_\{k\}\\geq 0\.95\\\}is read directly from the gradient\-covariance eigenspectrum\. For Mistral\-7B, Qwen3\-14B, and DeepSeek\-R1\-14B we usencal=200n\_\{\\mathrm\{cal\}\}=200, for which Appendix[A](https://arxiv.org/html/2605.24042#A1)shows that the top\-128128projector is not split\-half stable\. The following values are therefore sample effective ranks\. The cumulative energy crosses0\.950\.95between the measurement points atk=64k=64andk=128k=128, so we linearly interpolate:

r95≈64\+64⋅0\.95−E64E128−E64\.r\_\{95\}\\;\\approx\\;64\\,\+\\,64\\cdot\\frac\{0\.95\-E\_\{64\}\}\{E\_\{128\}\-E\_\{64\}\}\.\(8\)The resulting sample estimates arer95≈96r\_\{95\}\\approx 96on Mistral\-7B \(E64=0\.909E\_\{64\}=0\.909,E128=0\.992E\_\{128\}=0\.992\),r95≈100r\_\{95\}\\approx 100on Qwen3\-14B \(E64=0\.892E\_\{64\}=0\.892,E128=0\.994E\_\{128\}=0\.994\), andr95≈107r\_\{95\}\\approx 107on DeepSeek\-R1\-14B \(E64=0\.863E\_\{64\}=0\.863,E128=0\.992E\_\{128\}=0\.992\)\. The interpolation is a compact description of the measured sample spectrum, not a population\-rank estimate\.

For smaller models withncal=2000n\_\{\\mathrm\{cal\}\}=2000, the same interpolation recovers the directly measuredr95r\_\{95\}to within44units, so the arithmetic is not the limiting issue\. The limiting issue is subspace stability atncal=200n\_\{\\mathrm\{cal\}\}=200\. We keepρ=r95/d\\rho=r\_\{95\}/din Table[9](https://arxiv.org/html/2605.24042#A8.T9)because it is useful as a sample spectral summary, but we do not interpret the 7–14B values as stable population projector dimensions\.

## Appendix HCross\-model regime analysis

Section[3](https://arxiv.org/html/2605.24042#S3)introduces three scalar axesEkE\_\{k\},κ\\kappa, andρ\\rhothat summarize the geometry of the Fisher decomposition at a fixed layer\. This appendix places each model in our measurement set on the\(Ek,κ\)\(E\_\{k\},\\kappa\)plane, identifies the four qualitative regimes, and reports the per\-prefix cross\-sectional analyses that explain Phi\-2’s anomalous position and the reasoning\-vs\-standard divergence\.

### H\.1The four regimes

LetEkE\_\{k\}be the Fisher concentration \(fraction of gradient\-covariance eigenvalue mass in the top\-kkdirections\) andκ=tr\(PBΣδ\)/\(\(k/d\)tr\(Σδ\)\)\\kappa=\\mathrm\{tr\}\(P\_\{B\}\\Sigma\_\{\\delta\}\)/\(\(k/d\)\\,\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}\)\)the Fisher\-margin coupling\. The\(Ek,κ\)\(E\_\{k\},\\kappa\)plane partitions into four qualitative regimes:

- •LowEkE\_\{k\}, lowκ\\kappa\(GPT\-2 family\): the Fisher is diffuse and has no special alignment with the margin\-direction covariance\. Both projectionsPBhP\_\{B\}handPIhP\_\{I\}hcarry comparable predictive information, and the projection\-asymmetry sign is near zero or mildly reversed \(Section[E](https://arxiv.org/html/2605.24042#A5)\)\.
- •HighEkE\_\{k\}, lowκ\\kappa\(Mistral\-7B, Qwen3\-14B, DeepSeek\-R1\-14B\): the Fisher is highly concentrated and the margin\-covariance mass sits mostly in the complementPIP\_\{I\}\. This regime supports the strongest generalized\-eigen defense because both ingredients \(low\-utility directions inPB⟂P\_\{B\}^\{\\perp\}and high\-margin directions inPIP\_\{I\}\) line up\.
- •HighEkE\_\{k\}, highκ\\kappa\(Phi\-2\): the Fisher is concentrated but the top Fisher directions also capture unusual amounts of margin\-covariance mass, indicating that the training\-data distribution placed more prompt\-distinguishing information in the gradient\-sensitive directions than average\. The generalized\-eigen advantage is moderate, and the behavior\-versus\-random excess is the largest in the measurement set \(0\.2570\.257vs\.0\.2140\.214, a20%20\\%excess\)\.
- •LowEkE\_\{k\}, highκ\\kappa: unobserved\. This regime would correspond to a diffuse Fisher concentrated on margin\-carrying directions, which is not produced by any of the training recipes we measured\.

### H\.2Per\-model placement

Table[9](https://arxiv.org/html/2605.24042#A8.T9)places every model in our measurement set on the three axes\. The 7–14B models have strong calibration\-sample concentration \(E128\>0\.99E\_\{128\}\>0\.99\) and cluster in the high\-EkE\_\{k\}/low\-κ\\kapparegime, the three GPT\-2 models cluster in the low\-EkE\_\{k\}/low\-κ\\kapparegime, and Phi\-2 is alone in the high\-EkE\_\{k\}/high\-κ\\kappacorner\.

Table 9:Placement of ten models on the three\-axis geometry\.E128E\_\{128\}is Fisher concentration,κ\\kappais coupling betweenPBP\_\{B\}andΣδ\\Sigma\_\{\\delta\}, andρ\\rhois the effective\-rank fraction\. For 7–14B models,E128E\_\{128\}andρ\\rhoare calibration\-sample spectral summaries because Appendix[A](https://arxiv.org/html/2605.24042#A1)shows that the top\-128128projector is not split\-half stable atncal=200n\_\{\\mathrm\{cal\}\}=200\.κ\\kappavalues for models where random\-projection baselines were not recorded are estimated as\(mB/mrand\)2\(m\_\{B\}/m\_\{\\text\{rand\}\}\)^\{2\}on the subset where the comparison exists\.
### H\.3Cross\-sectional per\-prefix analyses

Two per\-prefix observations refine the model\-level picture\.

![Refer to caption](https://arxiv.org/html/2605.24042v1/x11.png)Figure 11:\(A\) Per\-prefix anti\-correlation between behavior\-projection and identity\-projection fractions on Qwen3\-14B: prefixes wherePBhP\_\{B\}hdestroys prediction are precisely those wherePIhP\_\{I\}hpreserves it, with Pearsonr=−0\.52r=\-0\.52\(p=1\.8×10−2p=1\.8\\times 10^\{\-2\}\) on the 16\-prefix subset shown and Spearmanρ=−0\.71\\rho=\-0\.71on the full 500\-prefix set\. The anti\-correlation is itself a signature of the asymmetry described in Section[E](https://arxiv.org/html/2605.24042#A5)and provides a per\-prefix control that the model\-level KL asymmetry is not an artifact of averaging\. \(B\) Margin composition by hidden\-state\-margin quintile on GPT\-2 Small: the bottom quintile of prefixes \(smallest full margins\) has the largest residual margin outside the Fisher subspace\.![Refer to caption](https://arxiv.org/html/2605.24042v1/x12.png)Figure 12:Gradient\-energy concentration for reasoning \(DeepSeek\-R1\-Distill\-Qwen\-14B\) versus standard \(Qwen3\-14B\) models at matched architecture and scale\. The reasoning model has slightly lowerE128E\_\{128\}at identicalddandLL, consistent with chain\-of\-thought training adding to rather than replacing the base model’s Fisher structure\.The cross\-sectional analyses reinforce that the three\-axis framework is empirically load\-bearing across models, per\-prefix anti\-correlation confirms the per\-example reality of the asymmetry, and Phi\-2 and MoE models are outliers whoseκ\\kappavalues sit at the upper end of the observed range\. Figure[12](https://arxiv.org/html/2605.24042#A8.F12)compares Fisher concentration on a reasoning model \(DeepSeek\-R1\-Distill\-Qwen\-14B\) against the standard Qwen3\-14B at matched architecture\.

## Appendix ICross\-architecture extended results

This appendix reports the full GPT\-2 Medium \(layer 12,d=1024d=1024\) and Gemma\-2B \(layer 13,d=2304d=2304\) utility and margin results that complement the main\-body Table[8](https://arxiv.org/html/2605.24042#A5.T8)and extend the cross\-architecture picture\.

### I\.1GPT\-2 Medium layer 12 utility grid

The GPT\-2 Medium layer 12 result is the origin of the opposite direction measurement\. Table[10](https://arxiv.org/html/2605.24042#A9.T10)reports the full grid at bits=32=32\.

Table 10:GPT\-2 Medium at layer 12 underfloat32projection\. Baseline perplexity is90\.490\.4\. Atk=64k=64the behavior\-projection KL exceeds the identity\-projection KL, matching the measurement\-protocol direction flip relative to GPT\-2 Small at layer 6; atk=256k=256the two cross and identity\-projection KL becomes larger, recovering the low\-concentration GPT\-2 pattern\.The crossover betweenPBP\_\{B\}andPIP\_\{I\}KL happens neark≈128k\\approx 128on GPT\-2 Medium atℓ=12\\ell=12, which is the samekkthat the main body uses on other architectures\. The sign of the asymmetry atk=128k=128on GPT\-2 Medium therefore depends sharply on measurement conditions and is not a reliable comparison point without controlling for the crossover location\.

### I\.2Gemma\-2B layer 13 gradient energy

Table[11](https://arxiv.org/html/2605.24042#A9.T11)reports the cumulative gradient energy on Gemma\-2B\. The spectrum is intermediate between the GPT\-2 family \(low concentration\) and the modern 7–14B models \(high concentration\):E128=0\.63E\_\{128\}=0\.63, higher than GPT\-2 Small \(0\.560\.56\) but well below Mistral\-7B \(0\.990\.99\)\.

Table 11:Cumulative gradient energy on Gemma\-2B at layer 13 \(d=2304d=2304,ncal=2000n\_\{\\mathrm\{cal\}\}=2000prefixes\)\. The effective rank isr95=681r\_\{95\}=681, givingρ=r95/d=0\.30\\rho=r\_\{95\}/d=0\.30which is consistent with the Phi\-2 band reported in Table[9](https://arxiv.org/html/2605.24042#A8.T9)\.Gemma\-2B sits in the moderateEkE\_\{k\}, lowκ\\kapparegime from Appendix[H](https://arxiv.org/html/2605.24042#A8)\. ItsE128=0\.63E\_\{128\}=0\.63is close to Qwen2\.5\-3B \(0\.670\.67\), consistent with training\-data composition rather than architecture or scale being the primary driver of concentration at this scale\. We did not measure per\-prefix margins on Gemma\-2B, so the model does not appear in the scaling regression in Section[F](https://arxiv.org/html/2605.24042#A6)\.

## Appendix JPrediction and retrieval live on different noise scales

The KL asymmetry and the retrieval\-attack curves in Sections[E](https://arxiv.org/html/2605.24042#A5)–[4](https://arxiv.org/html/2605.24042#S4)have different functional forms in the noise levelσ\\sigma\. Understanding why clarifies when a noise\-injection defense can work at all, and why the generalized\-eigen mechanism is the right tool for the high\-concentration regime\.

### J\.1Two sensitivity functions

Prediction KL under Gaussian noiseξ∼𝒩\(0,Σξ\)\\xi\\sim\\mathcal\{N\}\(0,\\Sigma\_\{\\xi\}\)added at layerℓ\\ellfollows the quadratic expansion \([1](https://arxiv.org/html/2605.24042#S3.E1)\):

𝔼ξ\[KL\]=12tr\(FΣξ\)\+O\(σ3\),\\mathbb\{E\}\_\{\\xi\}\[\\mathrm\{KL\}\]\\;=\\;\\tfrac\{1\}\{2\}\\,\\mathrm\{tr\}\(F\\,\\Sigma\_\{\\xi\}\)\\,\+\\,O\(\\sigma^\{3\}\),which grows quadratically nearσ=0\\sigma=0and saturates at the output entropyH\(pθ\)H\(p\_\{\\theta\}\)at largeσ\\sigma\. The retrieval attack is different\. For a nearest\-neighbor attacker observingh~=h\+ξ\\tilde\{h\}=h\+\\xiand ranking candidates byℓ2\\ell\_\{2\}distance, the event that a distractorh′h^\{\\prime\}is ranked ahead ofhhreduces to

⟨ξ,δ^⟩\>12‖h−h′‖,\\langle\\xi,\\,\\hat\{\\delta\}\\rangle\\;\>\\;\\tfrac\{1\}\{2\}\\,\\\|h\-h^\{\\prime\}\\\|,whereδ^=\(h′−h\)/‖h−h′‖\\hat\{\\delta\}=\(h^\{\\prime\}\-h\)/\\\|h\-h^\{\\prime\}\\\|\. Since⟨ξ,δ^⟩∼𝒩\(0,δ^⊤Σξδ^\)\\langle\\xi,\\hat\{\\delta\}\\rangle\\sim\\mathcal\{N\}\(0,\\hat\{\\delta\}^\{\\top\}\\Sigma\_\{\\xi\}\\,\\hat\{\\delta\}\), the per\-distractor failure probability is

Φ\(−‖h−h′‖2δ^⊤Σξδ^\),\\Phi\\Bigl\(\-\\frac\{\\\|h\-h^\{\\prime\}\\\|\}\{2\\,\\sqrt\{\\hat\{\\delta\}^\{\\top\}\\Sigma\_\{\\xi\}\\hat\{\\delta\}\}\}\\Bigr\),which is a Gaussian tail and has a sharp phase transition when the noise standard deviation alongδ^\\hat\{\\delta\}reaches roughly half the inter\-prefix margin\.

### J\.2The two scales

The quadratic KL and the Gaussian\-tail retrieval error cross different operational thresholds at different noise scales\. Prediction KL reaches a fixed thresholdκ0\\kappa\_\{0\}at

σpred∼2κ0/tr\(FΣξ/σ2\),\\sigma\_\{\\mathrm\{pred\}\}\\;\\sim\\;\\sqrt\{2\\kappa\_\{0\}/\\mathrm\{tr\}\(F\\Sigma\_\{\\xi\}/\\sigma^\{2\}\)\},which on Mistral\-7B \(tr\(F\)=8\.49\\mathrm\{tr\}\(F\)=8\.49, complement noise,κ0=2\\kappa\_\{0\}=2nats\) givesσpred≈0\.2\\sigma\_\{\\mathrm\{pred\}\}\\approx 0\.2\. Retrieval reaches a fixed threshold \(e\.g\.,50%50\\%attack success\) at a much larger

σretr∼mfull/\(2Φ−1\(1/Neff\)tr\(Σξ/σ2\)\),\\sigma\_\{\\mathrm\{retr\}\}\\;\\sim\\;m\_\{\\mathrm\{full\}\}\\,/\\,\(2\\Phi^\{\-1\}\(1/N\_\{\\mathrm\{eff\}\}\)\\sqrt\{\\mathrm\{tr\}\(\\Sigma\_\{\\xi\}/\\sigma^\{2\}\)\}\),which on Mistral\-7B withmfull=5\.13m\_\{\\mathrm\{full\}\}=5\.13and 500 distractors givesσretr≈1\.5\\sigma\_\{\\mathrm\{retr\}\}\\approx 1\.5\. The ratioσretr/σpred≈7\.5\\sigma\_\{\\mathrm\{retr\}\}/\\sigma\_\{\\mathrm\{pred\}\}\\approx 7\.5is an order\-of\-magnitude “window” where adding noise has already destroyed the prediction but has not yet degraded the retrieval attack\. Any isotropic or subspace\-restricted noise defense operating in this window incurs utility cost without buying privacy\.

### J\.3Why the window closes on GPT\-2 but not on modern models

The generalized\-eigen defense works because it changes the ratioσretr/σpred\\sigma\_\{\\mathrm\{retr\}\}/\\sigma\_\{\\mathrm\{pred\}\}in the defender’s favor\. Its noise covarianceΣξ⋆\\Sigma\_\{\\xi\}^\{\\star\}hastr\(FΣξ⋆\)/σ2=kξ\\mathrm\{tr\}\(F\\Sigma\_\{\\xi\}^\{\\star\}\)/\\sigma^\{2\}=k\_\{\\xi\}\(the number of generalized eigenvectors it occupies\) instead oftr\(F\)≈8\.49\\mathrm\{tr\}\(F\)\\approx 8\.49, so the prediction scaleσpred\\sigma\_\{\\mathrm\{pred\}\}for a matched KL budget is larger by a factor oftr\(F\)/kξ\\sqrt\{\\mathrm\{tr\}\(F\)/k\_\{\\xi\}\}\. The retrieval scaleσretr\\sigma\_\{\\mathrm\{retr\}\}is unchanged, but the discriminative signal the attacker sees per unitσ\\sigmais larger bytr\(ΣδΣξ⋆\)/σ2\\sqrt\{\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}\\Sigma\_\{\\xi\}^\{\\star\}\)/\\sigma^\{2\}\}\. On Mistral\-7B the net effect is a13×13\\timesreduction in attack success at matched KL budget; on GPT\-2 Small bothtr\(F\)\\mathrm\{tr\}\(F\)and the generalized eigenvalue spectrum are spread across many directions, sotr\(F\)/kξ≈1\\mathrm\{tr\}\(F\)/k\_\{\\xi\}\\approx 1and the defense provides no advantage\.

### J\.4Why fixed\-magnitude noise is fundamentally limited

The analysis above generalizes\. Any isotropic or complement\-noise defense faces a utility/privacy tradeoff set by the ratio of the two scales\. A defense that strictly separates the two scales must modify hidden states in a way that destroys inter\-prefix margin without inflating KL\. Candidate mechanisms include representation bottlenecks \(projectinghhinto a lower\-dimensional subspace before release\) and stochastic rounding in Fisher\-eigenbasis coordinates; whether any such mechanism meaningfully dominates isotropic Gaussian release under adaptive attack is the covariance\-aware question addressed by equation \([4](https://arxiv.org/html/2605.24042#S5.E4)\)\.

## Appendix KDirectional vs\. magnitude redundancy in the Fisher complement

Section[E](https://arxiv.org/html/2605.24042#A5)shows that the deterministic projectionh←PIhh\\leftarrow P\_\{I\}hpreserves Mistral\-7B’s output distribution to median KL=0\.065=0\.065\. A natural inference is that adding noise insidePIP\_\{I\}should also be cheap in utility, but a deterministic projection and an additive noise covariance have different KL costs, and conflating them leads to incorrect defense design\. This appendix reports the gap and derives it\.

### K\.1The four operations are not interchangeable

Table[12](https://arxiv.org/html/2605.24042#A11.T12)summarizes the KL cost of six operations on Mistral\-7B at layer 16,k=128k=128, all normalized to the same nominalσ=0\.1\\sigma=0\.1in their respective target subspaces\.

Table 12:KL cost of six operations on Mistral\-7B\. The deterministicPIP\_\{I\}projection, isotropic noise, complement noise, and Fisher\-complement noise all fall in a narrow0\.060\.06–0\.120\.12band; generalized\-eigen noise is∼40×\\sim 40\\timescheaper\.
### K\.2PIP\_\{I\}redundancy is for a specific direction

The KL cost of a deterministic perturbationvvis12v⊤Fv\\tfrac\{1\}\{2\}v^\{\\top\}Fvto leading order, and the KL cost of Gaussian noise with covarianceΣξ\\Sigma\_\{\\xi\}is12tr\(FΣξ\)\\tfrac\{1\}\{2\}\\mathrm\{tr\}\(F\\Sigma\_\{\\xi\}\)\. For the projectionh←PIhh\\leftarrow P\_\{I\}h, the perturbation isv=−PBhv=\-P\_\{B\}h, which lies in akk\-dimensional subspace with small averagev⊤Fvv^\{\\top\}Fvbut contains the specific directionPBhP\_\{B\}h; for random noise inPIP\_\{I\}, the covariance isσ2PI\\sigma^\{2\}P\_\{I\}and the trace picks up alld−kd\-keigenvalues ofFFrestricted toPIP\_\{I\}\. The first operation probes one direction and the second probesd−kd\-kdirections, and the two can have very different KL costs even when the magnitude‖v‖∼tr\(Σξ\)\\\|v\\\|\\sim\\sqrt\{\\mathrm\{tr\}\(\\Sigma\_\{\\xi\}\)\}is matched\.

The ranking in Table[12](https://arxiv.org/html/2605.24042#A11.T12)is consistent with this analysis\. The deterministic projection hits a single direction inPBP\_\{B\}, where training has made the model locally robust in a way the quadratic approximation does not capture \(see Appendix[J](https://arxiv.org/html/2605.24042#A10)\); the isotropic, complement, and Fisher\-complement noises all spread across many directions including some with nontrivialv⊤Fvv^\{\\top\}Fveven insidePIP\_\{I\}\(Fisher has non\-zero eigenvalues up tor95≫kr\_\{95\}\\gg k\); and the generalized\-eigen noise specifically picks the top\-kkdirections ofΣδF−1\\Sigma\_\{\\delta\}F^\{\-1\}, which by construction have near\-zerov⊤Fvv^\{\\top\}Fv\.

### K\.3Implication for defense design

The operational consequence is that projecting ontoPIP\_\{I\}is not a practical defense: it is a deterministic operation, so an attacker who knows the model can invert it exactly, and it offers no plausible deniability\. Gaussian noise inPIP\_\{I\}is a practical defense but is not materially better than isotropic noise on modern models\. The generalized\-eigen mechanism is the only one of the four that exploits the full structure ofFFandΣδ\\Sigma\_\{\\delta\}, and its Pareto advantage reflects this\. ThePB/PIP\_\{B\}/P\_\{I\}decomposition from Section[3](https://arxiv.org/html/2605.24042#S3)is a necessary but not sufficient basis for a defense; the generalized eigenvectors of\(Σδ,F\)\(\\Sigma\_\{\\delta\},F\)are the sufficient one\.

## Appendix LChannel transplantation:PIP\_\{I\}carries prediction,PBP\_\{B\}does not

In this paper we callPBP\_\{B\}the Fisher subspace rather than the behavior subspace\. This is intentional: the top\-kkFisher eigenvectors are by construction the directions along which downstream loss is most sensitive to small perturbations, but that does not mean predictions are encoded in those directions\. We test the distinction directly by transplanting thePBP\_\{B\}or thePIP\_\{I\}component of one prefix’s hidden state onto another prefix’s complementary component, running the forward pass to the final logits, and measuring which prefix’s prediction the resulting distribution tracks\.

### L\.1Setup

We draw2525unmatched prefix pairs\(xA,xB\)\(x\_\{A\},x\_\{B\}\)from a natural\-language validation split\. At layerℓ=L/2\\ell=L/2on Mistral\-7B we computehA,hBh\_\{A\},h\_\{B\}, project each ontoPBP\_\{B\}andPIP\_\{I\}, and construct two interpolated states:

hbeh−swap\(λ\)=PIhA\+\(\(1−λ\)PBhA\+λPBhB\),h\_\{\\mathrm\{beh\-swap\}\}\(\\lambda\)\\;=\\;P\_\{I\}h\_\{A\}\+\\bigl\(\(1\-\\lambda\)P\_\{B\}h\_\{A\}\+\\lambda P\_\{B\}h\_\{B\}\\bigr\),hid−swap\(λ\)=PBhA\+\(\(1−λ\)PIhA\+λPIhB\)\.h\_\{\\mathrm\{id\-swap\}\}\(\\lambda\)\\;=\\;P\_\{B\}h\_\{A\}\+\\bigl\(\(1\-\\lambda\)P\_\{I\}h\_\{A\}\+\\lambda P\_\{I\}h\_\{B\}\\bigr\)\.The first linearly interpolates the Fisher component \(fromAA’s toBB’s\) while keeping the complement fromAA; the second does the reverse\. We complete the forward pass from the injected state and measure the KL divergence of the next\-token distribution topθ\(⋅∣xA\)p\_\{\\theta\}\(\\cdot\\mid x\_\{A\}\)and topθ\(⋅∣xB\)p\_\{\\theta\}\(\\cdot\\mid x\_\{B\}\)\.

### L\.2Results

Table[13](https://arxiv.org/html/2605.24042#A12.T13)reports the median KLs atλ=0\\lambda=0\(pureAA\) andλ=1\\lambda=1\(fully swapped\)\. The identity\-swap atλ=1\\lambda=1moves the prediction distribution nearly all the way toBB’s \(KL→A=5\.91\\mathrm\{KL\}\\to A=5\.91,KL→B=0\.59\\mathrm\{KL\}\\to B=0\.59\), even though the Fisher component is stillAA’s\. The behavior\-swap atλ=1\\lambda=1moves the prediction distribution essentially not at all \(KL→A=0\.03\\mathrm\{KL\}\\to A=0\.03,KL→B=10\.4\\mathrm\{KL\}\\to B=10\.4\), even though the Fisher component has been entirely replaced byBB’s\.

Table 13:Channel transplantation on Mistral\-7B\. Replacing thePBP\_\{B\}component with another prefix’s has essentially no effect on the prediction; replacing thePIP\_\{I\}component moves the prediction all the way to the other prefix’s\.
### L\.3Interpretation

The result says that on Mistral\-7B the content the model reads to produce its prediction lives inPIP\_\{I\}, not inPBP\_\{B\}\. The Fisher subspace picks out directions along which local perturbations produce large expected loss change, but that is a statement about the downstream computation’s sensitivity and not about where the information is encoded\. An analogy is a seismograph tuned to a narrow band: it is maximally sensitive to motion in that band, but the earthquake waveform is encoded in the full frequency spectrum, and most of the signal lives outside the band the seismograph amplifies\.

This is why Section[3](https://arxiv.org/html/2605.24042#S3)namesPBP\_\{B\}the “Fisher subspace” rather than the “behavior subspace\.” The former is defined by a property of the loss landscape \(top eigenvectors ofFF\); the latter would imply that the subspace is where behavior\-relevant computation occurs, which the transplantation experiment shows is wrong for modern models\. The direction of the asymmetry follows: removingPBhP\_\{B\}hleaves predictions intact because what the model actually reads isPIhP\_\{I\}h; replacingPBhP\_\{B\}hwith another prefix’s similarly leaves predictions intact\.

## Appendix MSupport\-code margin distributions

This appendix reports the per\-prefix distributions of three margin metrics measured on the same 100 held\-out prefixes: the support\-code Hamming margin \(integer distance in the binary code of the CLT from Appendix[O](https://arxiv.org/html/2605.24042#A15)\), the hidden\-stateℓ2\\ell\_\{2\}margin at layer 6 under 8\-bit quantization, and the amplitude\-codeℓ2\\ell\_\{2\}margin \(distance between real\-valued CLT activation vectors\)\.

### M\.1Percentile summary

Table[14](https://arxiv.org/html/2605.24042#A13.T14)reports thep10,p50,p90p\_\{10\},p\_\{50\},p\_\{90\}percentiles of each metric\. The support\-code margin has a broader range than the hiddenℓ2\\ell\_\{2\}margin on this metric\-normalized axis, which is the basis for its use as an injective combinatorial code\.

Table 14:Per\-prefix margin percentiles on 100 held\-out prefixes at layer 6 of GPT\-2 Small\. The support code has a heavier right tail than the two continuous metrics, consistent with its role as a combinatorially rich injective code; its percentile range spans8888Hamming distance units versus3939and3131for the two continuous metrics\.
### M\.2Anticorrelation and z\-score renormalization

The support\-code margin and the hidden\-state margin are weakly positively correlated across prefixes \(Spearmanρ=0\.21\\rho=0\.21,p=0\.033p=0\.033\) but have opposite quartile behavior\. Partitioning 100 prefixes by hidden\-margin quartile:

- •Bottom quartile \(smallest hidden margins\): mean support/hidden ratio=2\.23=2\.23\.
- •Top quartile \(largest hidden margins\): mean support/hidden ratio=1\.03=1\.03\.

When the continuous representation provides a small margin \(making SipIt\-style inversion fragile\), the support code provides a disproportionately large margin relative to the continuous baseline\. The Spearman correlation between hidden margin and the support/hidden ratio isρ=−0\.39\\rho=\-0\.39\(p=6\.2×10−5p=6\.2\\times 10^\{\-5\}\)\.

Because the three metrics live on different scales, comparing raw values is misleading\. After z\-score renormalization against per\-prefix medians:

- •median z\-score: support=−0\.29=\-0\.29, hidden=−0\.24=\-0\.24, amplitude=−0\.26=\-0\.26;
- •fraction of prefixes where support exceeds hidden in z\-score:52%52\\%\.

Raw counts \(support exceeds hidden\):72%72\\%of prefixes\. The2020\-point drop between the raw and z\-score counts quantifies how much of the support\-code\-dominance reading is metric\-scale rather than informational\.

### M\.3Multi\-dataset angular stability

Thek/d\\sqrt\{k/d\}scaling law of Section[F](https://arxiv.org/html/2605.24042#A6)rests on the inter\-prefix difference distribution being approximately isotropic\. The multi\-dataset check measures the angle between the per\-dataset top\-128Σδ\\Sigma\_\{\\delta\}eigenvectors on1,0001\{,\}000WikiText prefixes and300300code prefixes\. The mean principal angle is56\.4∘56\.4^\{\\circ\}, with range\[4\.9∘,89\.8∘\]\[4\.9^\{\\circ\},89\.8^\{\\circ\}\]across the 128 pairs\. Neither dataset’s top eigenvectors are anchored to the other’s: the angular distribution is approximately uniform on\[0,π/2\]\[0,\\pi/2\], consistent with inter\-prefix directions being near\-isotropic even across genre boundaries\.

## Appendix NSurface\-perturbation experiment details

This appendix reports the 12\-pair controlled edit experiment \(original\) and the 49\-pair expansion \(robustness check\) that together establish the scaffold/backbone partition of CLT features under surface variation\.

### N\.112\-pair controlled experiment

Twelve base/variant pairs were selected to cover capitalization \(7 pairs\), whitespace \(2 pairs\), and semantic edits \(3 pairs\)\. For each pair, the CLT was run on both the base and the variant prefix; features active in one but not the other were counted as “flips” and classified as scaffold or backbone according to their behavioral score \(Appendix[O](https://arxiv.org/html/2605.24042#A15)\)\. Table[15](https://arxiv.org/html/2605.24042#A14.T15)reports the raw counts\.

Table 15:12\-pair surface perturbation decomposition on GPT\-2 Small\. The scaffold fraction of CLT feature flips is≈82%\\approx 82\\%on every pair regardless of edit type; the output KL varies by three orders of magnitude \(from0\.0160\.016on pronoun swap to1\.8201\.820on four\-word title case\), and top\-1 agreement is preserved on88of the1212pairs\.The central observation is that the scaffold fraction is essentially constant across edit categories \(Kruskal–WallisH=2\.33H=2\.33,p=0\.31p=0\.31; Appendix[Q](https://arxiv.org/html/2605.24042#A17)\)\. Scaffold features carry surface\-lexical variation and flip with it; backbone features carry syntactically predictive computation and do not flip\. This is the structural basis for the KL asymmetry on GPT\-2 at late layers reported in Appendix[C](https://arxiv.org/html/2605.24042#A3)\.

### N\.249\-pair robustness expansion

Because the 12\-pair experiment could be specific to the curated base/variant selection, we re\-ran the analysis on 49 automatically\-generated pairs spanning 17 edit types drawn from WikiText\. Table[16](https://arxiv.org/html/2605.24042#A14.T16)reports per\-category mean scaffold fraction and mean KL divergence\.

Table 16:49\-pair expanded perturbation analysis\. The overall mean scaffold fraction is77\.3%77\.3\\%, within55points of the curated 12\-pair value of82%82\\%\. Per\-category scaffold fractions sit in a\[0\.757,0\.783\]\[0\.757,0\.783\]band across all 17 edit categories, so the scaffold/backbone partition is robust to edit selection\.The77\.3%77\.3\\%overall mean scaffold fraction on 49 pairs matches the82%82\\%12\-pair mean to within the difference in edit selection\. The 12\-pair selection was skewed toward high\-KL capitalization pairs; the 49\-pair set is more representative of an automated perturbation distribution, with smaller KLs on average and slightly lower scaffold fraction\. Neither experiment shows a category where backbone flips dominate\.

### N\.3Cross\-validation with input\-dependent alignment

An independent check: the alignment between the per\-prefix behavior vector and the population\-levelPBP\_\{B\}averages tocos2⁡θ≈0\.60\\cos^\{2\}\\theta\\approx 0\.60over 99 held\-out prefixes \(range\[0\.45,0\.88\]\[0\.45,0\.88\], std0\.090\.09\)\. The alignment is not degenerate at either end: no prefix hasθ\\thetaclose to0orπ/2\\pi/2, so the per\-prefix Fisher contributions are not aligned or anti\-aligned with the population basis\. This is consistent with the scaffold/backbone partition being a population\-level property of the model, not an artifact of specific prefixes\.

## Appendix OCircuit Tracing and Feature\-Level Analysis

We extract 600 representative CLT features \(50 per layer across 12 layers\) and analyze their activation patterns, behavioral roles, and circuit\-level attribution\. Figure[13](https://arxiv.org/html/2605.24042#A15.F13)summarizes the feature inventory; Figure[14](https://arxiv.org/html/2605.24042#A15.F14)shows an example circuit graph\.

Of the 600 sampled features, 570 \(95%\) are classified as scaffold and 30 \(5%\) as backbone\. All scaffold features have a behavioral score of exactly zero under KL\-ablation, while backbone features range from4×10−54\\times 10^\{\-5\}to 0\.094 with a median of 0\.002 \(Figure[13](https://arxiv.org/html/2605.24042#A15.F13)A\)\. The backbone features are not spread uniformly across layers\. They concentrate in layers 0, 3, 6, and 9, while layers 1, 2, 4, 5, 7, 8, 10, and 11 contain zero backbone features in our sample\. This clustering at roughly 3\-layer intervals suggests that behaviorally active computation occurs at specific stages in the residual stream rather than continuously across depth\.

A counterintuitive finding is that scaffold features fire more strongly than backbone features, with median maximum activation 30\.1 for scaffold vs 25\.2 for backbone \(Figure[13](https://arxiv.org/html/2605.24042#A15.F13)B\)\. Scaffold features are therefore not quiet features that the model barely activates\. They fire vigorously but their decoder write vectors project onto directions orthogonal to the behaviorally relevant subspace, so their large activations do not propagate to the output distribution\. This is consistent with the ablation decomposition reported in the original cross\-layer transcoder results, where scaffold\-only reconstruction retains substantial signal despite carrying zero behavioral impact\.

Inspecting individual features makes the scaffold/backbone distinction concrete \(Table[17](https://arxiv.org/html/2605.24042#A15.T17)\)\. The three highest\-activating scaffold features are: L10\_F12066 \(max act\. 90\.9\), which fires on period tokens across diverse contexts and tracks sentence\-boundary punctuation; L9\_F6427 \(max act\. 81\.3\), which fires on the word “level” in compound noun phrases; and L10\_F1807 \(max act\. 78\.0\), which fires on the morphological fragment “didn” in negation constructions\. All three encode surface\-level lexical or positional cues that distinguish one prompt from another without contributing to what comes next\.

The three highest\-scoring backbone features tell the opposite story\. L3\_F2239 \(score 0\.094\) fires on sentence\-final periods before quotation attribution patterns like “tempting \. She added,” where ablating it disrupts the model’s prediction of the following attribution verb\. L6\_F2235 \(score 0\.081\) fires on closing quotation marks before speaker verbs in reported\-speech contexts\. L6\_F3334 \(score 0\.054\) fires on periods before temporal continuations like “arsenal \. By February\.” All three are active at syntactically predictive positions where the next token depends on discourse structure, and all three produce measurable KL divergence when ablated\.

![Refer to caption](https://arxiv.org/html/2605.24042v1/x13.png)Figure 13:CLT feature analysis across 600 sampled features \(50 per layer\)\. \(A\) Behavioral score distribution\. Scaffold features cluster at zero, backbone features span a wide range\. \(B\) Max activation vs behavioral score\. Scaffold features fire more strongly on average but have zero behavioral impact\.Table 17:Example scaffold and backbone features\. Scaffold features track surface\-level lexical cues; backbone features fire at syntactically predictive positions\.For five diverse prompts, we extract the CLT’s full activation graph \(which features are active at each layer, their activation magnitudes, and their decoder write vectors’ contribution to downstream layers; see Figure[14](https://arxiv.org/html/2605.24042#A15.F14)\)\. The backbone fraction per prompt ranges from 12\.5% to 15\.8% \(Table[18](https://arxiv.org/html/2605.24042#A15.T18)\), confirming stability across prompt types\.

Table 18:Per\-prompt scaffold and backbone feature counts summed across all 12 layers\.PromptBackboneScaffoldBackbone %“The capital of France is”1,5468,43415\.5“def fibonacci\(n\):”1,0446,15114\.5“In 1776, the United States”1,0495,59115\.8“The opposite of"large"is"”1,4618,37114\.9“import numpy as np\\nresult =”8716,09112\.5In the circuit graphs, backbone features at middle layers \(3–7\) write strongly to the residual stream at the final layer, forming direct pathways to the output logits \(the CLT mechanistic summary across ablations is in Figure[15](https://arxiv.org/html/2605.24042#A15.F15); the original utility/margin sweep is in Figure[16](https://arxiv.org/html/2605.24042#A15.F16)\)\. Scaffold features at early layers \(0–1\) and late layers \(10–11\), despite being the most numerous and most strongly activated, write to intermediate layers without propagating to the output\. This wiring pattern explains how the model maintains a large population of identity\-carrying features without interfering with its predictive computation, since the scaffold features simply do not connect to the output pathway\.

![Refer to caption](https://arxiv.org/html/2605.24042v1/x14.png)Figure 14:Circuit graph for “The capital of France is\.” Token nodes \(blue, bottom\) connect to scaffold features \(green\) and backbone features \(red\) across layers, which write to the predicted next\-token logits \(yellow, top\)\. Backbone features dominate the high\-weight connections; late\-layer scaffold features are numerous but do not drive predictions\.![Refer to caption](https://arxiv.org/html/2605.24042v1/x15.png)Figure 15:CLT mechanistic summary on GPT\-2 Small\. \(A\) Reconstruction error under five ablation conditions \(full, support\-only, amplitude\-only, backbone\-only, scaffold\-only\)\. \(B\) Feature flips from 12 surface perturbation pairs, split by scaffold versus backbone\. \(C\) Median one\-step margin measured three ways: Hamming distance in support code,ℓ2\\ell\_\{2\}distance in hidden state,ℓ2\\ell\_\{2\}distance in amplitude code\.![Refer to caption](https://arxiv.org/html/2605.24042v1/x16.png)Figure 16:Original utility\-margin sweep on GPT\-2 Small\. \(A\) Utility costΔPPL\\Delta\\mathrm\{PPL\}on a log scale at layer 6 as a function ofkk, comparing behavior projection \(red\), identity projection \(green\), and random projection \(purple\)\. \(B\) Median one\-step margin at 8\-bit for the same three modes\. The two panels together give the utility/privacy axis that the modular appendices analyze in detail\.
## Appendix PBehavior fiber analysis details

This appendix reports the causal fiber structure, prompt\-family statistics, and ablation decomposition that give internal consistency checks on the scaffold/backbone partition\.

### P\.1Fiber overlaps on 100 paired prompts

The behavior fiber is the set of prefix pairs that induce the same output distribution to within a small KL threshold\. We sample 100 such pairs and measure, for each pair, the fraction of overlap between the two prefixes’ CLT feature sets in three basis choices \(backbone only, scaffold only, full support\)\. Table[19](https://arxiv.org/html/2605.24042#A16.T19)reports the means\.

Table 19:Mean overlap on 100 fiber\-paired prompts \(KLpair<0\.05\\mathrm\{KL\}\_\{\\text\{pair\}\}<0\.05\)\. The mean pairwise KL on the full set is1\.9751\.975nats\. Backbone overlap exceeds scaffold overlap by≈8\\approx 8percentage points, consistent with backbone features carrying the shared predictive structure of fiber\-equivalent prompts\.
### P\.2Faithfulness across 200 causal interventions

A faithfulness test checks whether a causal intervention on the CLT \(ablating a feature\) produces the predicted direction of change in the output\. Over 200 interventions, the direction\-agreement rate is100%100\\%: every ablation moves the output KL in the direction predicted by the feature’s attribution signature\. The mean CLT\-predictedΔ\\Deltais3\.753\.75and the mean realized amplitude is3\.333\.33, indicating that the attribution magnitudes are also close to realized effect sizes\.

### P\.3Prompt\-family breakdown

Table[20](https://arxiv.org/html/2605.24042#A16.T20)reports mean sparsity and scaffold fraction across four prompt families: natural text, canary prompts \(constructed to test specific feature activations\), and the two halves of the surface\-perturbation set\.

Table 20:Prompt\-family statistics\. All four families have backbone fraction in\[0\.28,0\.31\]\[0\.28,0\.31\]and scaffold fraction in\[0\.69,0\.72\]\[0\.69,0\.72\]; the partition is stable across prompt families and is not an artifact of a specific prompt distribution\. The mean Hamming distance between surface\-perturbation pairs is75\.375\.3features\.
### P\.4Ablation decomposition

Table[21](https://arxiv.org/html/2605.24042#A16.T21)reports the reconstruction error on held\-out activations when restricting to different parts of the CLT code: full, support\-only \(binary\), amplitude\-only \(real\-valued\), backbone\-only, and scaffold\-only\. Backbone\-only reconstruction retains remarkable quality despite activating only≈78\\approx 78features versus948948for scaffold\-only\.

Table 21:Reconstruction error under five ablation conditions\. Backbone\-only activations recover the activations with the lowest error \(110\.5110\.5\) despite being13×13\\timesfewer features than scaffold\-only, confirming that backbone features are both behaviorally active \(Appendix[O](https://arxiv.org/html/2605.24042#A15)\) and informationally dense\.

## Appendix QFormal Properties of the Two\-Channel Decomposition

We state and verify several formal properties of the behavior/identity decomposition that underpin the empirical results in the main text\.

### Q\.1Pythagorean margin decomposition

###### Proposition 2\.

LetPB=UkUk⊤P\_\{B\}=U\_\{k\}U\_\{k\}^\{\\top\}be an orthogonal projector \(PB2=PBP\_\{B\}^\{2\}=P\_\{B\},PB⊤=PBP\_\{B\}^\{\\top\}=P\_\{B\}\) and letPI=I−PBP\_\{I\}=I\-P\_\{B\}\. For any𝛅∈ℝd\\boldsymbol\{\\delta\}\\in\\mathbb\{R\}^\{d\},

‖PB𝜹‖22\+‖PI𝜹‖22=‖𝜹‖22\.\\\|P\_\{B\}\\boldsymbol\{\\delta\}\\\|\_\{2\}^\{2\}\+\\\|P\_\{I\}\\boldsymbol\{\\delta\}\\\|\_\{2\}^\{2\}=\\\|\\boldsymbol\{\\delta\}\\\|\_\{2\}^\{2\}\.\(9\)

###### Proof\.

Since𝜹=PB𝜹\+PI𝜹\\boldsymbol\{\\delta\}=P\_\{B\}\\boldsymbol\{\\delta\}\+P\_\{I\}\\boldsymbol\{\\delta\}, it suffices to show⟨PB𝜹,PI𝜹⟩=0\\langle P\_\{B\}\\boldsymbol\{\\delta\},\\,P\_\{I\}\\boldsymbol\{\\delta\}\\rangle=0\. Expanding the inner product:

⟨PB𝜹,PI𝜹⟩=\(PB𝜹\)⊤\(I−PB\)𝜹=𝜹⊤PB⊤\(I−PB\)𝜹\.\\langle P\_\{B\}\\boldsymbol\{\\delta\},\\,P\_\{I\}\\boldsymbol\{\\delta\}\\rangle=\(P\_\{B\}\\boldsymbol\{\\delta\}\)^\{\\top\}\(I\-P\_\{B\}\)\\boldsymbol\{\\delta\}=\\boldsymbol\{\\delta\}^\{\\top\}P\_\{B\}^\{\\top\}\(I\-P\_\{B\}\)\\boldsymbol\{\\delta\}\.Using symmetry \(PB⊤=PBP\_\{B\}^\{\\top\}=P\_\{B\}\) and idempotency \(PB2=PBP\_\{B\}^\{2\}=P\_\{B\}\):

𝜹⊤PB\(I−PB\)𝜹=𝜹⊤\(PB−PB2\)𝜹=0\.\\boldsymbol\{\\delta\}^\{\\top\}P\_\{B\}\(I\-P\_\{B\}\)\\boldsymbol\{\\delta\}=\\boldsymbol\{\\delta\}^\{\\top\}\(P\_\{B\}\-P\_\{B\}^\{2\}\)\\boldsymbol\{\\delta\}=0\.The two components are orthogonal, so‖𝜹‖2=‖PB𝜹\+PI𝜹‖2=‖PB𝜹‖2\+‖PI𝜹‖2\\\|\\boldsymbol\{\\delta\}\\\|^\{2\}=\\\|P\_\{B\}\\boldsymbol\{\\delta\}\+P\_\{I\}\\boldsymbol\{\\delta\}\\\|^\{2\}=\\\|P\_\{B\}\\boldsymbol\{\\delta\}\\\|^\{2\}\+\\\|P\_\{I\}\\boldsymbol\{\\delta\}\\\|^\{2\}\. ∎

This pointwise identity does not directly extend to margins, because the margin is a minimum over vocabulary tokens and the minimizer may differ across projections\. However, it implies an inequality on margins\.

###### Corollary 3\.

LetmB:=minv≠y⁡‖PB𝛅v‖2m\_\{B\}:=\\min\_\{v\\neq y\}\\\|P\_\{B\}\\boldsymbol\{\\delta\}\_\{v\}\\\|\_\{2\},mI:=minv≠y⁡‖PI𝛅v‖2m\_\{I\}:=\\min\_\{v\\neq y\}\\\|P\_\{I\}\\boldsymbol\{\\delta\}\_\{v\}\\\|\_\{2\}, andmfull:=minv≠y⁡‖𝛅v‖2m\_\{\\mathrm\{full\}\}:=\\min\_\{v\\neq y\}\\\|\\boldsymbol\{\\delta\}\_\{v\}\\\|\_\{2\}, where the minimum is over the finite vocabulary𝒱\\mathcal\{V\}\(so it is attained\)\. Then

mB2\+mI2≤mfull2\.m\_\{B\}^\{2\}\+m\_\{I\}^\{2\}\\leq m\_\{\\mathrm\{full\}\}^\{2\}\.\(10\)

###### Proof\.

Letv^∈arg⁡minv≠y⁡‖𝜹v‖2\\hat\{v\}\\in\\arg\\min\_\{v\\neq y\}\\\|\\boldsymbol\{\\delta\}\_\{v\}\\\|\_\{2\}, which exists because𝒱\\mathcal\{V\}is finite\. By definition of the minimum,mB≤‖PB𝜹v^‖2m\_\{B\}\\leq\\\|P\_\{B\}\\boldsymbol\{\\delta\}\_\{\\hat\{v\}\}\\\|\_\{2\}andmI≤‖PI𝜹v^‖2m\_\{I\}\\leq\\\|P\_\{I\}\\boldsymbol\{\\delta\}\_\{\\hat\{v\}\}\\\|\_\{2\}\. Since norms are nonnegative, squaring preserves the inequalities\. Adding and applying Proposition[2](https://arxiv.org/html/2605.24042#Thmproposition2):

mB2\+mI2≤‖PB𝜹v^‖2\+‖PI𝜹v^‖2=‖𝜹v^‖2=mfull2\.∎m\_\{B\}^\{2\}\+m\_\{I\}^\{2\}\\leq\\\|P\_\{B\}\\boldsymbol\{\\delta\}\_\{\\hat\{v\}\}\\\|^\{2\}\+\\\|P\_\{I\}\\boldsymbol\{\\delta\}\_\{\\hat\{v\}\}\\\|^\{2\}=\\\|\\boldsymbol\{\\delta\}\_\{\\hat\{v\}\}\\\|^\{2\}=m\_\{\\mathrm\{full\}\}^\{2\}\.\\qed

Table[22](https://arxiv.org/html/2605.24042#A17.T22)verifies this empirically\. The ratior=\(mB2\+mI2\)/mfull2r=\(m\_\{B\}^\{2\}\+m\_\{I\}^\{2\}\)/m\_\{\\mathrm\{full\}\}^\{2\}is close to 1\.0 at all subspace dimensions, indicating that the nearest non\-true token is approximately the same across projections\. Corollary[3](https://arxiv.org/html/2605.24042#Thmproposition3)guaranteesr≤1r\\leq 1for fp32 margins, but the margins in Table[22](https://arxiv.org/html/2605.24042#A17.T22)are computed after 8\-bit quantization, a nonlinear operation that does not commute with projection\. The small deviations above 1\.0 atk≥64k\\geq 64reflect this: quantization noise can shift which token achieves the minimum before and after quantization\.

Table 22:Pythagorean margin verification at layer 6 of GPT\-2 Small \(8\-bit quantization\)\. The ratiorris close to 1\.0 at allkk, indicating near\-equality in Corollary[3](https://arxiv.org/html/2605.24042#Thmproposition3)\.The near\-equality has a structural interpretation: the token closest to the true token inℓ2\\ell\_\{2\}is also approximately the closest in both the behavior and identity subspaces\. The two channels do not create adversarial “blind spots” where a token is close in one subspace but far in the other\.

### Q\.2Margin preservation under quantization

###### Proposition 4\.

LetQb:ℝd→ℝdQ\_\{b\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}act coordinatewise and satisfy\|Qb\(x\)j−xj\|≤αj/2\|Q\_\{b\}\(x\)\_\{j\}\-x\_\{j\}\|\\leq\\alpha\_\{j\}/2for every coordinatejjand all inputs within the operating range \(no clipping or saturation\)\. Define

ϵb:=12∑j=1dαj2\.\\epsilon\_\{b\}:=\\frac\{1\}\{2\}\\sqrt\{\\sum\_\{j=1\}^\{d\}\\alpha\_\{j\}^\{2\}\}\.\(11\)Then for any𝐡\\mathbf\{h\}in the operating range,‖Qb\(𝐡\)−𝐡‖2≤ϵb\\\|Q\_\{b\}\(\\mathbf\{h\}\)\-\\mathbf\{h\}\\\|\_\{2\}\\leq\\epsilon\_\{b\}\. If the transform first projects onto a subspace via orthogonal projectorPPand then quantizes, the one\-step margin satisfies

mQb∘P≥max⁡\{0,mP−2ϵb\},m\_\{Q\_\{b\}\\circ P\}\\geq\\max\\\{0,\\;m\_\{P\}\-2\\epsilon\_\{b\}\\\},\(12\)wheremP:=minv≠y⁡‖P𝐡y−P𝐡v‖2m\_\{P\}:=\\min\_\{v\\neq y\}\\\|P\\mathbf\{h\}\_\{y\}\-P\\mathbf\{h\}\_\{v\}\\\|\_\{2\}andmQb∘P:=minv≠y⁡‖Qb\(P𝐡y\)−Qb\(P𝐡v\)‖2m\_\{Q\_\{b\}\\circ P\}:=\\min\_\{v\\neq y\}\\\|Q\_\{b\}\(P\\mathbf\{h\}\_\{y\}\)\-Q\_\{b\}\(P\\mathbf\{h\}\_\{v\}\)\\\|\_\{2\}\.

###### Proof\.

The coordinatewise bound\|Qb\(x\)j−xj\|≤αj/2\|Q\_\{b\}\(x\)\_\{j\}\-x\_\{j\}\|\\leq\\alpha\_\{j\}/2gives, by squaring and summing,‖Qb\(𝐱\)−𝐱‖22≤14∑jαj2=ϵb2\\\|Q\_\{b\}\(\\mathbf\{x\}\)\-\\mathbf\{x\}\\\|\_\{2\}^\{2\}\\leq\\frac\{1\}\{4\}\\sum\_\{j\}\\alpha\_\{j\}^\{2\}=\\epsilon\_\{b\}^\{2\}\. For the margin, let𝐮y=P𝐡y\\mathbf\{u\}\_\{y\}=P\\mathbf\{h\}\_\{y\}and𝐮v=P𝐡v\\mathbf\{u\}\_\{v\}=P\\mathbf\{h\}\_\{v\}\. Write

Qb\(𝐮y\)−Qb\(𝐮v\)=\(𝐮y−𝐮v\)\+\(Qb\(𝐮y\)−𝐮y\)−\(Qb\(𝐮v\)−𝐮v\)\.Q\_\{b\}\(\\mathbf\{u\}\_\{y\}\)\-Q\_\{b\}\(\\mathbf\{u\}\_\{v\}\)=\(\\mathbf\{u\}\_\{y\}\-\\mathbf\{u\}\_\{v\}\)\+\(Q\_\{b\}\(\\mathbf\{u\}\_\{y\}\)\-\\mathbf\{u\}\_\{y\}\)\-\(Q\_\{b\}\(\\mathbf\{u\}\_\{v\}\)\-\\mathbf\{u\}\_\{v\}\)\.By the triangle inequality in the form‖a\+b\+c‖2≥‖a‖2−‖b‖2−‖c‖2\\\|a\+b\+c\\\|\_\{2\}\\geq\\\|a\\\|\_\{2\}\-\\\|b\\\|\_\{2\}\-\\\|c\\\|\_\{2\}:

‖Qb\(𝐮y\)−Qb\(𝐮v\)‖2\\displaystyle\\\|Q\_\{b\}\(\\mathbf\{u\}\_\{y\}\)\-Q\_\{b\}\(\\mathbf\{u\}\_\{v\}\)\\\|\_\{2\}≥‖𝐮y−𝐮v‖2−‖Qb\(𝐮y\)−𝐮y‖2−‖Qb\(𝐮v\)−𝐮v‖2\\displaystyle\\geq\\\|\\mathbf\{u\}\_\{y\}\-\\mathbf\{u\}\_\{v\}\\\|\_\{2\}\-\\\|Q\_\{b\}\(\\mathbf\{u\}\_\{y\}\)\-\\mathbf\{u\}\_\{y\}\\\|\_\{2\}\-\\\|Q\_\{b\}\(\\mathbf\{u\}\_\{v\}\)\-\\mathbf\{u\}\_\{v\}\\\|\_\{2\}≥‖𝐮y−𝐮v‖2−2ϵb\.\\displaystyle\\geq\\\|\\mathbf\{u\}\_\{y\}\-\\mathbf\{u\}\_\{v\}\\\|\_\{2\}\-2\\epsilon\_\{b\}\.Since the left\-hand side is a norm and hence nonnegative, minimizing overv≠yv\\neq ygives Equation[12](https://arxiv.org/html/2605.24042#A17.E12)\. ∎

In our setting,QbQ\_\{b\}is symmetric uniform quantization with scaleαj=sj/\(2b−1−1\)\\alpha\_\{j\}=s\_\{j\}/\(2^\{b\-1\}\-1\)calibrated on training data, so the no\-clipping assumption holds for inputs within the calibration range\. This bound explains why the identity\-complement margin is robust to quantization \(Appendix[C](https://arxiv.org/html/2605.24042#A3)\): the marginmIm\_\{I\}is large \(∼44\{\\sim\}44on GPT\-2 Small\) while2ϵb2\\epsilon\_\{b\}at 8 bits is small relative to this margin, somQ8∘PIm\_\{Q\_\{8\}\\circ P\_\{I\}\}remains close tomIm\_\{I\}\.

### Q\.3Margin angle characterization

The Pythagorean decomposition \(Proposition[2](https://arxiv.org/html/2605.24042#Thmproposition2)\) admits a geometric interpretation in terms of the angle between the margin\-achieving difference vector and the behavior subspace\.

###### Proposition 5\(Margin angle\)\.

LetPBP\_\{B\}be an orthogonal projector \(PB2=PBP\_\{B\}^\{2\}=P\_\{B\},PB⊤=PBP\_\{B\}^\{\\top\}=P\_\{B\}\) andPI=I−PBP\_\{I\}=I\-P\_\{B\}\. For any nonzero𝛅∈ℝd\\boldsymbol\{\\delta\}\\in\\mathbb\{R\}^\{d\}, defineθ∈\[0,π/2\]\\theta\\in\[0,\\pi/2\]bycos⁡\(θ\):=‖PB𝛅‖2/‖𝛅‖2\\cos\(\\theta\):=\\\|P\_\{B\}\\boldsymbol\{\\delta\}\\\|\_\{2\}/\\\|\\boldsymbol\{\\delta\}\\\|\_\{2\}\. Then:

‖PB𝜹‖22‖𝜹‖22=cos2⁡\(θ\),‖PI𝜹‖22‖𝜹‖22=sin2⁡\(θ\)\.\\frac\{\\\|P\_\{B\}\\boldsymbol\{\\delta\}\\\|\_\{2\}^\{2\}\}\{\\\|\\boldsymbol\{\\delta\}\\\|\_\{2\}^\{2\}\}=\\cos^\{2\}\(\\theta\),\\qquad\\frac\{\\\|P\_\{I\}\\boldsymbol\{\\delta\}\\\|\_\{2\}^\{2\}\}\{\\\|\\boldsymbol\{\\delta\}\\\|\_\{2\}^\{2\}\}=\\sin^\{2\}\(\\theta\)\.\(13\)

###### Proof\.

The first identity is immediate from the definition\. For the second: sincePI=I−PBP\_\{I\}=I\-P\_\{B\}, we have𝜹=PB𝜹\+PI𝜹\\boldsymbol\{\\delta\}=P\_\{B\}\\boldsymbol\{\\delta\}\+P\_\{I\}\\boldsymbol\{\\delta\}\. By symmetry and idempotency,⟨PB𝜹,PI𝜹⟩=𝜹⊤PB\(I−PB\)𝜹=𝜹⊤\(PB−PB2\)𝜹=0\\langle P\_\{B\}\\boldsymbol\{\\delta\},\\,P\_\{I\}\\boldsymbol\{\\delta\}\\rangle=\\boldsymbol\{\\delta\}^\{\\top\}P\_\{B\}\(I\-P\_\{B\}\)\\boldsymbol\{\\delta\}=\\boldsymbol\{\\delta\}^\{\\top\}\(P\_\{B\}\-P\_\{B\}^\{2\}\)\\boldsymbol\{\\delta\}=0, so the two components are orthogonal and‖𝜹‖22=‖PB𝜹‖22\+‖PI𝜹‖22\\\|\\boldsymbol\{\\delta\}\\\|\_\{2\}^\{2\}=\\\|P\_\{B\}\\boldsymbol\{\\delta\}\\\|\_\{2\}^\{2\}\+\\\|P\_\{I\}\\boldsymbol\{\\delta\}\\\|\_\{2\}^\{2\}\. Dividing by‖𝜹‖22\\\|\\boldsymbol\{\\delta\}\\\|\_\{2\}^\{2\}gives1=cos2⁡\(θ\)\+‖PI𝜹‖22/‖𝜹‖221=\\cos^\{2\}\(\\theta\)\+\\\|P\_\{I\}\\boldsymbol\{\\delta\}\\\|\_\{2\}^\{2\}/\\\|\\boldsymbol\{\\delta\}\\\|\_\{2\}^\{2\}, hence‖PI𝜹‖22/‖𝜹‖22=sin2⁡\(θ\)\\\|P\_\{I\}\\boldsymbol\{\\delta\}\\\|\_\{2\}^\{2\}/\\\|\\boldsymbol\{\\delta\}\\\|\_\{2\}^\{2\}=\\sin^\{2\}\(\\theta\)\. ∎

Table[23](https://arxiv.org/html/2605.24042#A17.T23)applies this to GPT\-2 Small at layer 6\. The identity fraction decreases monotonically withkk: atk=32k\{=\}32, 96\.6% of the full margin variance lies in the identity subspace, corresponding toθ=79\.4∘\\theta=79\.4^\{\\circ\}\(the margin vector is nearly perpendicular to the behavior subspace\)\. Even atk=256k\{=\}256\(one\-third of the full dimension\),θ=56\.8∘\\theta=56\.8^\{\\circ\}and the identity fraction remains above 70%\.

Table 23:Margin angle decomposition at layer 6 of GPT\-2 Small \(8\-bit\)\. The margin\-achieving difference vector is nearly perpendicular to the behavior subspace at allkk\.
### Q\.4Random projection baseline

We prove that the expected margin under a uniformly random subspace projection scales ask/d\\sqrt\{k/d\}times the full margin, then verify that the empirical random\-subspace margins match this prediction exactly\.

###### Proposition 6\(Random projection\)\.

LetPR=UU⊤P\_\{R\}=UU^\{\\top\}be the orthogonal projector onto a uniformly randomkk\-dimensional subspace ofℝd\\mathbb\{R\}^\{d\}\(i\.e\.,U∈ℝd×kU\\in\\mathbb\{R\}^\{d\\times k\}has columns drawn uniformly from the Stiefel manifoldSt\(k,d\)\\mathrm\{St\}\(k,d\)\)\. Then for any fixed𝛅∈ℝd\\boldsymbol\{\\delta\}\\in\\mathbb\{R\}^\{d\}:

𝔼\[‖PR𝜹‖22\]=kd‖𝜹‖22\.\\mathbb\{E\}\[\\\|P\_\{R\}\\boldsymbol\{\\delta\}\\\|\_\{2\}^\{2\}\]=\\frac\{k\}\{d\}\\\|\\boldsymbol\{\\delta\}\\\|\_\{2\}^\{2\}\.\(14\)Consequently, the expected squared margin under random projection scales as\(k/d\)⋅mfull2\(k/d\)\\cdot m\_\{\\mathrm\{full\}\}^\{2\}, and the expected margin scales ask/d⋅mfull\\sqrt\{k/d\}\\cdot m\_\{\\mathrm\{full\}\}\.

###### Proof\.

SinceU⊤U=IkU^\{\\top\}U=I\_\{k\}, we havePR2=U\(U⊤U\)U⊤=UU⊤=PRP\_\{R\}^\{2\}=U\(U^\{\\top\}U\)U^\{\\top\}=UU^\{\\top\}=P\_\{R\}andPR⊤=PRP\_\{R\}^\{\\top\}=P\_\{R\}, soPRP\_\{R\}is an orthogonal projector\. Therefore‖PR𝜹‖22=𝜹⊤PR2𝜹=𝜹⊤PR𝜹\\\|P\_\{R\}\\boldsymbol\{\\delta\}\\\|\_\{2\}^\{2\}=\\boldsymbol\{\\delta\}^\{\\top\}P\_\{R\}^\{2\}\\boldsymbol\{\\delta\}=\\boldsymbol\{\\delta\}^\{\\top\}P\_\{R\}\\boldsymbol\{\\delta\}\. Taking expectations:𝔼\[‖PR𝜹‖22\]=𝜹⊤𝔼\[PR\]𝜹\\mathbb\{E\}\[\\\|P\_\{R\}\\boldsymbol\{\\delta\}\\\|\_\{2\}^\{2\}\]=\\boldsymbol\{\\delta\}^\{\\top\}\\mathbb\{E\}\[P\_\{R\}\]\\,\\boldsymbol\{\\delta\}\.

LetA=𝔼\[PR\]A=\\mathbb\{E\}\[P\_\{R\}\]\. We showA=\(k/d\)IdA=\(k/d\)\\,I\_\{d\}\. For anyO∈O\(d\)O\\in O\(d\), left\-invariance of the uniform measure onSt\(k,d\)\\mathrm\{St\}\(k,d\)givesOU=dUOU\\stackrel\{\{\\scriptstyle d\}\}\{\{=\}\}U, soA=OAO⊤A=OAO^\{\\top\}for allO∈O\(d\)O\\in O\(d\)\. WriteA=\(ars\)A=\(a\_\{rs\}\)\. For eachii, letDi=diag\(1,…,−1,…,1\)D\_\{i\}=\\mathrm\{diag\}\(1,\\ldots,\-1,\\ldots,1\)with−1\-1in positionii\. ThenDi∈O\(d\)D\_\{i\}\\in O\(d\)and\(DiADi\)ij=−aij\(D\_\{i\}AD\_\{i\}\)\_\{ij\}=\-a\_\{ij\}forj≠ij\\neq i\. SinceA=DiADiA=D\_\{i\}AD\_\{i\}, we getaij=−aija\_\{ij\}=\-a\_\{ij\}, soaij=0a\_\{ij\}=0for allj≠ij\\neq i\. ThusAAis diagonal\. For any pairi≠ji\\neq j, the permutation matrixΠij\\Pi\_\{ij\}swapping coordinatesiiandjjsatisfiesΠij∈O\(d\)\\Pi\_\{ij\}\\in O\(d\), and conjugation byΠij\\Pi\_\{ij\}swaps theii\-th andjj\-th diagonal entries\. SinceA=ΠijAΠij⊤A=\\Pi\_\{ij\}A\\Pi\_\{ij\}^\{\\top\}, we getaii=ajja\_\{ii\}=a\_\{jj\}\. All diagonal entries are equal, soA=cIdA=c\\,I\_\{d\}\.

To findcc:c⋅d=tr\(A\)=𝔼\[tr\(UU⊤\)\]=𝔼\[tr\(U⊤U\)\]=tr\(Ik\)=kc\\cdot d=\\mathrm\{tr\}\(A\)=\\mathbb\{E\}\[\\mathrm\{tr\}\(UU^\{\\top\}\)\]=\\mathbb\{E\}\[\\mathrm\{tr\}\(U^\{\\top\}U\)\]=\\mathrm\{tr\}\(I\_\{k\}\)=k, soc=k/dc=k/d\.

For the margin: letv^\\hat\{v\}achievemfull=‖𝜹v^‖2m\_\{\\mathrm\{full\}\}=\\\|\\boldsymbol\{\\delta\}\_\{\\hat\{v\}\}\\\|\_\{2\}\. SincemR=minv≠y⁡‖PR𝜹v‖2≤‖PR𝜹v^‖2m\_\{R\}=\\min\_\{v\\neq y\}\\\|P\_\{R\}\\boldsymbol\{\\delta\}\_\{v\}\\\|\_\{2\}\\leq\\\|P\_\{R\}\\boldsymbol\{\\delta\}\_\{\\hat\{v\}\}\\\|\_\{2\}, squaring and taking expectations gives𝔼\[mR2\]≤\(k/d\)mfull2\\mathbb\{E\}\[m\_\{R\}^\{2\}\]\\leq\(k/d\)\\,m\_\{\\mathrm\{full\}\}^\{2\}\. This yields the upper bound𝔼\[mR\]≤𝔼\[mR2\]≤k/d⋅mfull\\mathbb\{E\}\[m\_\{R\}\]\\leq\\sqrt\{\\mathbb\{E\}\[m\_\{R\}^\{2\}\]\}\\leq\\sqrt\{k/d\}\\cdot m\_\{\\mathrm\{full\}\}\. A matching lower bound \(concentration ofmRm\_\{R\}aroundk/d⋅mfull\\sqrt\{k/d\}\\cdot m\_\{\\mathrm\{full\}\}\) does not follow from this proposition alone but is an empirical observation consistent with Johnson\-Lindenstrauss\-type concentration for finite point sets\. ∎

Table[24](https://arxiv.org/html/2605.24042#A17.T24)verifies this prediction against empirical random\-subspace margins at layer 6 of GPT\-2 Small\. The match is striking: the predicted and observed margin percentages agree within 2 percentage points at everykk\. The behavior subspace also matches the random prediction, confirming that the behavior subspace is not anti\-aligned with the margin direction\. The identity subspace is the only one that substantially exceeds the random baseline\.

Table 24:Random projection baseline verification\. Predicted margin % isk/d×100\\sqrt\{k/d\}\\times 100withd=768d\{=\}768\. Observed values are median margins at 8\-bit as a percentage of the full margin\.This result has an important interpretive consequence: the behavior subspace retains approximately the same margin as a random subspace of equal dimension, meaning it contains no special margin\-relevant structure\. The identity complement, by contrast, retains margin far above the random baseline at everykk\. The two\-channel separation is asymmetric: the identity channel is specifically aligned with margin\-carrying directions, while the behavior channel is neutral with respect to margin\.

### Q\.5Fixed\-projector isotropy theorem

Proposition[1](https://arxiv.org/html/2605.24042#Thmproposition1)of the main body shows that a rank\-kksubspace chosen uniformly at random captures an expectedk/dk/dfraction of an isotropic margin\. That argument is textbook and uses the Beta distribution of random orthogonal projections\. The stronger claim we actually need is that any*fixed*rank\-kkprojectorPP\(including the Fisher behavior projectorPBP\_\{B\}, which is data\-dependent, not uniformly random\) captures close tok/dk/don average whenever the normalized\-margin covariance is approximately isotropic\. The following gives a sufficient\-condition test\.

Letδ^=\(hx−hx′\)/‖hx−hx′‖\\hat\{\\delta\}=\(h\_\{x\}\-h\_\{x^\{\\prime\}\}\)/\\\|h\_\{x\}\-h\_\{x^\{\\prime\}\}\\\|for a distribution over prefix pairs, and let

Sδ:=𝔼\[δ^δ^⊤\]\.S\_\{\\delta\}:=\\mathbb\{E\}\[\\hat\{\\delta\}\\hat\{\\delta\}^\{\\top\}\]\.Letεiso:=‖dSδ−I‖op\\varepsilon\_\{\\mathrm\{iso\}\}:=\\\|d\\,S\_\{\\delta\}\-I\\\|\_\{\\mathrm\{op\}\}\.

###### Proposition 7\(Fixed\-projector isotropy\)\.

For any fixed rank\-kkorthogonal projectorPP,

\|𝔼‖Pδ^‖2−kd\|≤kdεiso\.\\Big\|\\mathbb\{E\}\\\|P\\hat\{\\delta\}\\\|^\{2\}\-\\tfrac\{k\}\{d\}\\Big\|\\leq\\tfrac\{k\}\{d\}\\,\\varepsilon\_\{\\mathrm\{iso\}\}\.\(15\)In particular, ifεiso≪1\\varepsilon\_\{\\mathrm\{iso\}\}\\ll 1, then the expected fraction of squared margin captured by any rank\-kksubspace is within\(k/d\)εiso\(k/d\)\\,\\varepsilon\_\{\\mathrm\{iso\}\}ofk/dk/d, regardless of howPPis chosen\.

###### Proof\.

𝔼‖Pδ^‖2=𝔼\[tr\(Pδ^δ^⊤\)\]=tr\(PSδ\)\\mathbb\{E\}\\\|P\\hat\{\\delta\}\\\|^\{2\}=\\mathbb\{E\}\[\\mathrm\{tr\}\(P\\hat\{\\delta\}\\hat\{\\delta\}^\{\\top\}\)\]=\\mathrm\{tr\}\(PS\_\{\\delta\}\)\. WriteSδ=I/d\+ES\_\{\\delta\}=I/d\+Ewith‖dE‖op=εiso\\\|dE\\\|\_\{\\mathrm\{op\}\}=\\varepsilon\_\{\\mathrm\{iso\}\}\. Thentr\(PSδ\)=tr\(P\)/d\+tr\(PE\)=k/d\+tr\(PE\)\\mathrm\{tr\}\(PS\_\{\\delta\}\)=\\mathrm\{tr\}\(P\)/d\+\\mathrm\{tr\}\(PE\)=k/d\+\\mathrm\{tr\}\(PE\)\. SinceP⪰0P\\succeq 0andrank\(P\)=k\\mathrm\{rank\}\(P\)=k,\|tr\(PE\)\|≤tr\(P\)‖E‖op=\(k/d\)εiso\|\\mathrm\{tr\}\(PE\)\|\\leq\\mathrm\{tr\}\(P\)\\,\\\|E\\\|\_\{\\mathrm\{op\}\}=\(k/d\)\\,\\varepsilon\_\{\\mathrm\{iso\}\}\. ∎

This sharpens the main\-body claim that thek/d\\sqrt\{k/d\}margin law is “random\-projection geometry”: it is any\-rank\-kk\-projection geometry under the sufficient condition thatSδS\_\{\\delta\}is close to isotropic\. A data\-dependent projector likePBP\_\{B\}is subject to the same bound as a randomPRP\_\{R\}, although the operator\-norm condition is conservative in our data\.

#### Q\.5\.1Empirical check

We estimateSδS\_\{\\delta\}from20,00020\{,\}000random pairs of prefix hidden states at layerℓ\\ellon each model; for GPT\-2 Small atℓ=6\\ell=6\(the setting for Figures[9](https://arxiv.org/html/2605.24042#A5.F9)and[1](https://arxiv.org/html/2605.24042#S4.F1)\) we obtaintr\(Sδ\)=0\.9995\\mathrm\{tr\}\(S\_\{\\delta\}\)=0\.9995\(soSδS\_\{\\delta\}is nearly a probability simplex\),εiso=29\.28\\varepsilon\_\{\\mathrm\{iso\}\}=29\.28, and for random rank\-kkprojectorsPRP\_\{R\}\(averaged over ten draws\):

Across all ten models in our scaling paper set, spanningd∈\[768,5120\]d\\in\[768,5120\], empiricaltr\(PRSδ\)\\mathrm\{tr\}\(P\_\{R\}S\_\{\\delta\}\)matchesk/dk/dto≤0\.0021\\leq 0\.0021absolute, well inside the theorem’s conservative band\(k/d\)εiso\(k/d\)\\,\\varepsilon\_\{\\mathrm\{iso\}\}\.εiso\\varepsilon\_\{\\mathrm\{iso\}\}grows withdd\(thedSδ−Id\\,S\_\{\\delta\}\-Ioperator becomes more anisotropic at higher dimensions\), but the empirical deviation fromk/dk/dunder random projectors actually*shrinks*withdd, reaching0\.00010\.0001–0\.00040\.0004on the 14B models\. The operator\-norm bound is therefore a conservative sufficient condition; the proposition below explains the tightness via Frobenius rather than operator anisotropy\. The table above reports Haar\-random projectorsPRP\_\{R\}only; the relevant projector for the empiricalk/d\\sqrt\{k/d\}law is the data\-dependent Fisher projectorPBP\_\{B\}, whose alignment withΣδ\\Sigma\_\{\\delta\}is measured separately by the couplingκ=tr\(PBΣδ\)/\(\(k/d\)tr\(Σδ\)\)\\kappa=\\mathrm\{tr\}\(P\_\{B\}\\Sigma\_\{\\delta\}\)/\(\(k/d\)\\,\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}\)\)in Table[9](https://arxiv.org/html/2605.24042#A8.T9)of Appendix[H](https://arxiv.org/html/2605.24042#A8), rather than guaranteed by the worst\-case theorem\.

###### Proposition 8\(Variance form of the random\-projection law\)\.

LetSδ⪰0S\_\{\\delta\}\\succeq 0withtr\(Sδ\)=1\\mathrm\{tr\}\(S\_\{\\delta\}\)=1, defineE:=Sδ−1dIE:=S\_\{\\delta\}\-\\frac\{1\}\{d\}Iso thattr\(E\)=0\\mathrm\{tr\}\(E\)=0, and letPPbe a Haar\-random rank\-kkorthogonal projector onℝd\\mathbb\{R\}^\{d\}\. Then

𝔼tr\(PSδ\)=kd,Var\(tr\(PSδ\)\)=2k\(d−k\)d\(d−1\)\(d\+2\)‖E‖F2\.\\mathbb\{E\}\\,\\mathrm\{tr\}\(PS\_\{\\delta\}\)=\\frac\{k\}\{d\},\\qquad\\mathrm\{Var\}\\\!\\bigl\(\\mathrm\{tr\}\(PS\_\{\\delta\}\)\\bigr\)=\\frac\{2k\(d\-k\)\}\{d\(d\-1\)\(d\+2\)\}\\,\\\|E\\\|\_\{F\}^\{2\}\.Consequently, by Chebyshev,

Pr⁡\(\|tr\(PSδ\)−kd\|≥t\)≤2k\(d−k\)d\(d−1\)\(d\+2\)⋅‖E‖F2t2\.\\Pr\\\!\\left\(\\left\|\\mathrm\{tr\}\(PS\_\{\\delta\}\)\-\\frac\{k\}\{d\}\\right\|\\geq t\\right\)\\leq\\frac\{2k\(d\-k\)\}\{d\(d\-1\)\(d\+2\)\}\\cdot\\frac\{\\\|E\\\|\_\{F\}^\{2\}\}\{t^\{2\}\}\.

###### Proof\.

Mean:𝔼\[P\]=\(k/d\)I\\mathbb\{E\}\[P\]=\(k/d\)Iby rotational invariance, so𝔼tr\(PSδ\)=\(k/d\)tr\(Sδ\)=k/d\\mathbb\{E\}\\,\\mathrm\{tr\}\(PS\_\{\\delta\}\)=\(k/d\)\\,\\mathrm\{tr\}\(S\_\{\\delta\}\)=k/d\. For the variance, writetr\(PSδ\)−k/d=tr\(\(P−\(k/d\)I\)E\)\\mathrm\{tr\}\(PS\_\{\\delta\}\)\-k/d=\\mathrm\{tr\}\(\(P\-\(k/d\)I\)\\,E\), and define the quadratic formQ\(E\):=𝔼\[tr\(\(P−\(k/d\)I\)E\)2\]Q\(E\):=\\mathbb\{E\}\\bigl\[\\mathrm\{tr\}\(\(P\-\(k/d\)I\)\\,E\)^\{2\}\\bigr\]on the trace\-zero symmetric subspace\. ByO\(d\)O\(d\)\-invariance,Q\(E\)=c‖E‖F2Q\(E\)=c\\\|E\\\|\_\{F\}^\{2\}for some scalarcc\. To evaluatecc, takeE=xx⊤−\(1/d\)IE=xx^\{\\top\}\-\(1/d\)Ifor a fixed unit vectorxx\. Thentr\(PE\)=x⊤Px−k/d\\mathrm\{tr\}\(PE\)=x^\{\\top\}Px\-k/d, andx⊤Px∼Beta\(k/2,\(d−k\)/2\)x^\{\\top\}Px\\sim\\mathrm\{Beta\}\(k/2,\(d\-k\)/2\), soVar\(x⊤Px\)=2k\(d−k\)/\(d2\(d\+2\)\)\\mathrm\{Var\}\(x^\{\\top\}Px\)=2k\(d\-k\)/\(d^\{2\}\(d\+2\)\)\. Also‖xx⊤−\(1/d\)I‖F2=1−1/d=\(d−1\)/d\\\|xx^\{\\top\}\-\(1/d\)I\\\|\_\{F\}^\{2\}=1\-1/d=\(d\-1\)/d\. Thereforec=Var\(x⊤Px\)/‖E‖F2=2k\(d−k\)/\(d\(d−1\)\(d\+2\)\)c=\\mathrm\{Var\}\(x^\{\\top\}Px\)/\\\|E\\\|\_\{F\}^\{2\}=2k\(d\-k\)/\(d\(d\-1\)\(d\+2\)\)\. ∎

The Frobenius rather than operator scaling explains why the empirical deviations in the table above shrink withddeven thoughεiso=‖dSδ−I‖op\\varepsilon\_\{\\mathrm\{iso\}\}=\\\|dS\_\{\\delta\}\-I\\\|\_\{\\mathrm\{op\}\}grows: the relevant anisotropy is‖E‖F2\\\|E\\\|\_\{F\}^\{2\}, which can stay small even when‖dSδ−I‖op\\\|dS\_\{\\delta\}\-I\\\|\_\{\\mathrm\{op\}\}is large because operator norm picks out a single eigendirection while Frobenius averages over all eigendirections weighted byλi2\\lambda\_\{i\}^\{2\}\. A low\-rank or near\-low\-rank deviation drives the operator norm without inflating the Frobenius mass, and the empiricalSδS\_\{\\delta\}on these models has exactly this profile\. The fixed\-projector bound of Proposition[7](https://arxiv.org/html/2605.24042#Thmproposition7)is the worst\-case over a single direction; Proposition[8](https://arxiv.org/html/2605.24042#Thmproposition8)is the typical\-case over a random rank\-kksubspace\.

### Q\.6Why low\-rank defenses collapse under adaptive retrieval

The empirical collapse of the generalized\-eigen and complement defenses to100%100\\%top\-1 under the adaptive Mahalanobis attacker \(Section[6](https://arxiv.org/html/2605.24042#S6)\) has a clean proof\.

###### Proposition 9\(Rank\-deficient release is trivially invertible\)\.

Lety=x\+ξy=x\+\\xiwithξ∼𝒩\(0,Σ\)\\xi\\sim\\mathcal\{N\}\(0,\\Sigma\)andΣ⪰0\\Sigma\\succeq 0singular, and letP0P\_\{0\}denote orthogonal projection ontoker⁡\(Σ\)\\ker\(\\Sigma\)\. ThenP0y=P0xP\_\{0\}y=P\_\{0\}xalmost surely\. Consequently, for a finite candidate bank\{x1,…,xN\}\\\{x\_\{1\},\\ldots,x\_\{N\}\\\}whose projections\{P0xi\}\\\{P\_\{0\}x\_\{i\}\\\}are pairwise distinct, an attacker with access toΣ\\Sigmarecovers the true item exactly fromP0yP\_\{0\}y\.

###### Proof\.

ξ\\xiis supported onIm\(Σ\)=ker\(Σ\)⟂\\mathrm\{Im\}\(\\Sigma\)=\\ker\(\\Sigma\)^\{\\perp\}, soP0ξ=0P\_\{0\}\\xi=0almost surely andP0y=P0xP\_\{0\}y=P\_\{0\}x\. If theP0xiP\_\{0\}x\_\{i\}are pairwise distinct,P0yP\_\{0\}yuniquely identifiesxix\_\{i\}\. ∎

For generalized\-eigen noise at rankkξ<dk\_\{\\xi\}<d, the nullspace projectorP0=I−VkξVkξ⊤P\_\{0\}=I\-V\_\{k\_\{\\xi\}\}V\_\{k\_\{\\xi\}\}^\{\\top\}has rankd−kξ≥1d\-k\_\{\\xi\}\\geq 1and the hidden states of distinct prefixes are generically distinct under this projection \(the inter\-prefix covarianceΣδ\\Sigma\_\{\\delta\}has full support on typical models\)\. Proposition[9](https://arxiv.org/html/2605.24042#Thmproposition9)is therefore the deterministic mechanism behind the100%100\\%\-top\-1 collapse we measure\. Any rank\-deficient release is broken under an attacker who knowsΣ\\Sigma\.

### Q\.7Small isotropic floor fails gracefully, not trivially

A natural fix for rank\-deficient releases is to add an isotropic floor,Ση=Σ\+ηI\\Sigma\_\{\\eta\}=\\Sigma\+\\eta Iwithη\>0\\eta\>0\. The following proposition gives the large\-deviation rate for pairwise Bayes error asη→0\\eta\\to 0\.

###### Proposition 10\(Floor asymptotic\)\.

LetΣ⪰0\\Sigma\\succeq 0be singular with nullspace projectorP0P\_\{0\}, letΣη=Σ\+ηI\\Sigma\_\{\\eta\}=\\Sigma\+\\eta I, and letδ∈ℝd\\delta\\in\\mathbb\{R\}^\{d\}withP0δ≠0P\_\{0\}\\delta\\neq 0\. Then

δ⊤Ση−1δ=‖P0δ‖2η\+O\(1\)asη→0\+,\\delta^\{\\top\}\\Sigma\_\{\\eta\}^\{\-1\}\\delta=\\frac\{\\\|P\_\{0\}\\delta\\\|^\{2\}\}\{\\eta\}\+O\(1\)\\qquad\\text\{as \}\\eta\\to 0^\{\+\},and the pairwise Bayes error of the Gaussian mechanism𝒩\(x,Ση\)\\mathcal\{N\}\(x,\\Sigma\_\{\\eta\}\)between two candidates at separationδ\\deltasatisfies

Pη\(error\)≤exp⁡\(−‖P0δ‖28η\+O\(1\)\)\.P\_\{\\eta\}\(\\mathrm\{error\}\)\\leq\\exp\\\!\\left\(\-\\frac\{\\\|P\_\{0\}\\delta\\\|^\{2\}\}\{8\\eta\}\+O\(1\)\\right\)\.

###### Proof\.

DiagonalizeΣ=Udiag\(λ1,…,λr,0,…,0\)U⊤\\Sigma=U\\mathrm\{diag\}\(\\lambda\_\{1\},\\ldots,\\lambda\_\{r\},0,\\ldots,0\)U^\{\\top\}withr=rank\(Σ\)r=\\mathrm\{rank\}\(\\Sigma\)\. ThenΣη−1=Udiag\(1/\(λi\+η\)\)i≤r⊕diag\(1/η\)i\>rU⊤\\Sigma\_\{\\eta\}^\{\-1\}=U\\mathrm\{diag\}\(1/\(\\lambda\_\{i\}\+\\eta\)\)\_\{i\\leq r\}\\oplus\\mathrm\{diag\}\(1/\\eta\)\_\{i\>r\}U^\{\\top\}\. Decomposingδ=P0⟂δ\+P0δ\\delta=P\_\{0\}^\{\\perp\}\\delta\+P\_\{0\}\\deltain the same basis,δ⊤Ση−1δ=∑i≤r\(δi2\)/\(λi\+η\)\+‖P0δ‖2/η\\delta^\{\\top\}\\Sigma\_\{\\eta\}^\{\-1\}\\delta=\\sum\_\{i\\leq r\}\(\\delta\_\{i\}^\{2\}\)/\(\\lambda\_\{i\}\+\\eta\)\+\\\|P\_\{0\}\\delta\\\|^\{2\}/\\eta\. The first sum is bounded by‖P0⟂δ‖2/λmin\+\\\|P\_\{0\}^\{\\perp\}\\delta\\\|^\{2\}/\\lambda\_\{\\min\}^\{\+\}asη→0\\eta\\to 0, giving the leading\-order‖P0δ‖2/η\+O\(1\)\\\|P\_\{0\}\\delta\\\|^\{2\}/\\eta\+O\(1\)\. The Bayes error bound is the Gaussian tailΦ\(−12δ⊤Ση−1δ\)≤exp⁡\(−δ⊤Ση−1δ/8\)\\Phi\(\-\\tfrac\{1\}\{2\}\\sqrt\{\\delta^\{\\top\}\\Sigma\_\{\\eta\}^\{\-1\}\\delta\}\)\\leq\\exp\(\-\\delta^\{\\top\}\\Sigma\_\{\\eta\}^\{\-1\}\\delta/8\), substituting the leading term\. ∎

This quantifies the cost of the rank\-deficiency escape: even with a small full\-rank floor, any adjacency pair with nonzero nullspace component is exponentially distinguishable as the floor shrinks\. The floor therefore has to be chosen large enough that‖P0δ‖2/η\\\|P\_\{0\}\\delta\\\|^\{2\}/\\etais a modest constant for the hardest adjacency pair, which for our measured adjacency sets pushesη\\etaup towardη∼tr\(Σ\)/d\\eta\\sim\\mathrm\{tr\}\(\\Sigma\)/d, exactly the regime whereΣη\\Sigma\_\{\\eta\}is close to a full\-rank scalar multiple of isotropic noise\. Low\-rank structure is effectively inaccessible\.

### Q\.8Σdiag\\Sigma\_\{\\mathrm\{diag\}\}is the unique equal\-cost diagonal mechanism

The empirical success of the diagonal\-minimax releaseΣdiag⋆\(𝒦\)=\(2𝒦/d\)D−1\\Sigma^\{\\star\}\_\{\\mathrm\{diag\}\}\(\\mathcal\{K\}\)=\(2\\mathcal\{K\}/d\)\\,D^\{\-1\}withD=diag\(F\)D=\\mathrm\{diag\}\(F\), parameterized in the implementation asΣdiag=σ2diag\(1/Fii\)\\Sigma\_\{\\mathrm\{diag\}\}=\\sigma^\{2\}\\mathrm\{diag\}\(1/F\_\{ii\}\)withσ2=2𝒦/d\\sigma^\{2\}=2\\mathcal\{K\}/d, across all 32 model\-layer points has a simple structural explanation\.

###### Proposition 11\(Equal\-coordinate\-cost characterization\)\.

Among diagonal Gaussian release covariancesΣ=diag\(s1,…,sd\)\\Sigma=\\mathrm\{diag\}\(s\_\{1\},\\ldots,s\_\{d\}\)withsi\>0s\_\{i\}\>0and total first\-order KL budgetU\(Σ\)=12∑iFiisi≤𝒦U\(\\Sigma\)=\\tfrac\{1\}\{2\}\\sum\_\{i\}F\_\{ii\}s\_\{i\}\\leq\\mathcal\{K\}, the unique covariance that equalizes the first\-order expected KL contribution of each coordinate,

12Fiisi=c∀i,\\tfrac\{1\}\{2\}\\,F\_\{ii\}\\,s\_\{i\}=c\\quad\\forall i,saturates the budget atc=𝒦/dc=\\mathcal\{K\}/d, hencesi=2𝒦/\(dFii\)s\_\{i\}=2\\mathcal\{K\}/\(d\\,F\_\{ii\}\), i\.e\.Σ=\(2𝒦/d\)D−1\\Sigma=\(2\\mathcal\{K\}/d\)\\,D^\{\-1\}\.

###### Proof\.

Equation \([1](https://arxiv.org/html/2605.24042#S3.E1)\) gives𝔼\[KL\]=12tr\(FΣ\)=12∑iFiisi\\mathbb\{E\}\[\\mathrm\{KL\}\]=\\tfrac\{1\}\{2\}\\mathrm\{tr\}\(F\\,\\Sigma\)=\\tfrac\{1\}\{2\}\\sum\_\{i\}F\_\{ii\}s\_\{i\}for diagonalΣ\\Sigma\. RequiringFiisi=2cF\_\{ii\}s\_\{i\}=2cfor alliipinssis\_\{i\}to2c/Fii2c/F\_\{ii\}, and saturating the budget∑iFiisi=2𝒦\\sum\_\{i\}F\_\{ii\}s\_\{i\}=2\\mathcal\{K\}givesc=𝒦/dc=\\mathcal\{K\}/d\. ∎

Σdiag⋆\(𝒦\)\\Sigma^\{\\star\}\_\{\\mathrm\{diag\}\}\(\\mathcal\{K\}\)is therefore not a heuristic: it is the unique diagonal mechanism that distributes the local utility cost evenly across coordinates, which by construction also distributes the adaptive\-attacker Mahalanobis signalδi2Fiid/\(2𝒦\)\\delta\_\{i\}^\{2\}F\_\{ii\}\\,d/\(2\\mathcal\{K\}\)evenly, so no single low\-Fisher coordinate becomes the worst\-case adjacency direction\.

### Q\.9Average\-optimal and worst\-case\-optimal covariances can differ sharply

Table[29](https://arxiv.org/html/2605.24042#A21.T29)shows thatΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}, the minimizer of the average Mahalanobis signal, is not always the minimizer of the worst\-caseε\\varepsilon\. A two\-dimensional counterexample shows this is not a measurement artifact but a structural property of the two optimization problems\.

###### Proposition 12\(Average vs worst\-case separation\)\.

TakeF=I2F=I\_\{2\}, utility budgettr\(Σ\)≤κ\\mathrm\{tr\}\(\\Sigma\)\\leq\\kappa, and adjacency set𝒜=\{±e1,±e2\}⊂ℝ2\\mathcal\{A\}=\\\{\\pm e\_\{1\},\\pm e\_\{2\}\\\}\\subset\\mathbb\{R\}^\{2\}\. The worst\-case RDP objectivesupΔ∈𝒜Δ⊤Σ−1Δ=max⁡\(σ1−2,σ2−2\)\\sup\_\{\\Delta\\in\\mathcal\{A\}\}\\Delta^\{\\top\}\\Sigma^\{\-1\}\\Delta=\\max\(\\sigma\_\{1\}^\{\-2\},\\sigma\_\{2\}^\{\-2\}\)is minimized by the isotropic covarianceσ12=σ22=κ/2\\sigma\_\{1\}^\{2\}=\\sigma\_\{2\}^\{2\}=\\kappa/2\. Now letΣδ=diag\(L,ℓ\)\\Sigma\_\{\\delta\}=\\mathrm\{diag\}\(L,\\ell\)withL≠ℓL\\neq\\ell\. The average Mahalanobis objectivetr\(ΣδΣ−1\)=L/σ12\+ℓ/σ22\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}\\Sigma^\{\-1\}\)=L/\\sigma\_\{1\}^\{2\}\+\\ell/\\sigma\_\{2\}^\{2\}is minimized atσ12:σ22=L:ℓ\\sigma\_\{1\}^\{2\}:\\sigma\_\{2\}^\{2\}=\\sqrt\{L\}:\\sqrt\{\\ell\}, which is strictly anisotropic wheneverL≠ℓL\\neq\\ell\.

###### Proof\.

For the worst\-case objective, Lagrangian KKT onσ12\+σ22=κ\\sigma\_\{1\}^\{2\}\+\\sigma\_\{2\}^\{2\}=\\kappawitht=max⁡\(σ1−2,σ2−2\)t=\\max\(\\sigma\_\{1\}^\{\-2\},\\sigma\_\{2\}^\{\-2\}\)givesσ1−2=σ2−2\\sigma\_\{1\}^\{\-2\}=\\sigma\_\{2\}^\{\-2\}at the optimum, soσ12=σ22=κ/2\\sigma\_\{1\}^\{2\}=\\sigma\_\{2\}^\{2\}=\\kappa/2\. For the average objective, Lagrangian KKT givesL/σ14=ℓ/σ24L/\\sigma\_\{1\}^\{4\}=\\ell/\\sigma\_\{2\}^\{4\}, i\.e\.σ12/σ22=L/ℓ\\sigma\_\{1\}^\{2\}/\\sigma\_\{2\}^\{2\}=\\sqrt\{L/\\ell\}, anisotropic iffL≠ℓL\\neq\\ell\. ∎

The two optima can be arbitrarily far apart asL/ℓ→∞L/\\ell\\to\\infty\. This is the conceptual reason Tables[28](https://arxiv.org/html/2605.24042#A21.T28)and[29](https://arxiv.org/html/2605.24042#A21.T29)do not have a universal winner:ΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}minimizes the*average*, isotropic minimizes the*worst\-case*, and neither dominates the other under an attacker that adapts to the releasedΣ\\Sigma\.Σdiag\\Sigma\_\{\\mathrm\{diag\}\}is a different object altogether: the unique diagonal equal\-cost mechanism, whose empirical advantage at highε\\varepsiloncomes from equalizing coordinatewise Fisher cost\.

### Q\.10Gaussian impossibility for full\-state release

The empty\-middle observation in Appendix[T\.5](https://arxiv.org/html/2605.24042#A20.SS5)reflects a structural lower bound that holds for every full\-rank Gaussian release\. We state it in terms of the Fisher\-ball adversary class and convert it to a pairwise distinguishability bound\.

###### Theorem 13\(Gaussian Fisher\-ball lower bound\)\.

LetFλ≻0F\_\{\\lambda\}\\succ 0,Σ≻0\\Sigma\\succ 0, and let the local utility budget beU\(Σ\)=12tr\(FλΣ\)≤𝒦U\(\\Sigma\)=\\tfrac\{1\}\{2\}\\,\\mathrm\{tr\}\(F\_\{\\lambda\}\\Sigma\)\\leq\\mathcal\{K\}\. Define the adversary class𝒜ρ=\{Δ∈ℝd:Δ⊤FλΔ≤ρ2\}\\mathcal\{A\}\_\{\\rho\}=\\\{\\Delta\\in\\mathbb\{R\}^\{d\}:\\Delta^\{\\top\}F\_\{\\lambda\}\\Delta\\leq\\rho^\{2\}\\\}\. Then

supΔ∈𝒜ρΔ⊤Σ−1Δ≥ρ2d2𝒦,\\sup\_\{\\Delta\\in\\mathcal\{A\}\_\{\\rho\}\}\\;\\Delta^\{\\top\}\\Sigma^\{\-1\}\\Delta\\;\\geq\\;\\frac\{\\rho^\{2\}d\}\{2\\mathcal\{K\}\},with equality attained by the unique minimax mechanismΣfull⋆=\(2𝒦/d\)Fλ−1\\Sigma^\{\\star\}\_\{\\mathrm\{full\}\}=\(2\\mathcal\{K\}/d\)\\,F\_\{\\lambda\}^\{\-1\}\.

###### Proof\.

LetY=Fλ1/2ΣFλ1/2Y=F\_\{\\lambda\}^\{1/2\}\\Sigma F\_\{\\lambda\}^\{1/2\}, sotr\(Y\)=tr\(FλΣ\)≤2𝒦\\mathrm\{tr\}\(Y\)=\\mathrm\{tr\}\(F\_\{\\lambda\}\\Sigma\)\\leq 2\\mathcal\{K\}\. The change of variableΔ=Fλ−1/2z\\Delta=F\_\{\\lambda\}^\{\-1/2\}zgivesΔ⊤FλΔ=‖z‖22\\Delta^\{\\top\}F\_\{\\lambda\}\\Delta=\\\|z\\\|\_\{2\}^\{2\}andΔ⊤Σ−1Δ=z⊤Y−1z\\Delta^\{\\top\}\\Sigma^\{\-1\}\\Delta=z^\{\\top\}Y^\{\-1\}z, so

supΔ∈𝒜ρΔ⊤Σ−1Δ=sup‖z‖2≤ρz⊤Y−1z=ρ2λmin\(Y\)\.\\sup\_\{\\Delta\\in\\mathcal\{A\}\_\{\\rho\}\}\\;\\Delta^\{\\top\}\\Sigma^\{\-1\}\\Delta\\;=\\;\\sup\_\{\\\|z\\\|\_\{2\}\\leq\\rho\}\\;z^\{\\top\}Y^\{\-1\}z\\;=\\;\\frac\{\\rho^\{2\}\}\{\\lambda\_\{\\min\}\(Y\)\}\.By AM\-GM,λmin\(Y\)≤tr\(Y\)/d≤2𝒦/d\\lambda\_\{\\min\}\(Y\)\\leq\\mathrm\{tr\}\(Y\)/d\\leq 2\\mathcal\{K\}/d, which gives the lower bound\. Equality requiresYYisotropic,Y=\(2𝒦/d\)IY=\(2\\mathcal\{K\}/d\)I, henceΣ=\(2𝒦/d\)Fλ−1\\Sigma=\(2\\mathcal\{K\}/d\)F\_\{\\lambda\}^\{\-1\}\. ∎

The adversary class𝒜ρ\\mathcal\{A\}\_\{\\rho\}is a strict superset of the empirical adjacency set used in Appendix[T\.5](https://arxiv.org/html/2605.24042#A20.SS5): any prompt\-difference vectorΔ=h\(x\)−h\(x′\)\\Delta=h\(x\)\-h\(x^\{\\prime\}\)withΔ⊤FλΔ≤ρ2\\Delta^\{\\top\}F\_\{\\lambda\}\\Delta\\leq\\rho^\{2\}is in𝒜ρ\\mathcal\{A\}\_\{\\rho\}, but the converse fails because not every Fisher\-ball element is realisable as a hidden\-state difference\. Theorem[13](https://arxiv.org/html/2605.24042#Thmproposition13)is therefore a population bound: it lower\-bounds the worst\-case Mahalanobis signal over a strictly larger class than the realized one, so it cannot tightly explain the0/1,5360/1\{,\}536measurement, but it does establish that no Gaussian full\-state release can be uniformly safe at small utility budget\.

###### Corollary 14\(Existence of an exponentially distinguishable pair\)\.

Under the conditions of Theorem[13](https://arxiv.org/html/2605.24042#Thmproposition13), there existsΔ⋆∈𝒜ρ\\Delta^\{\\star\}\\in\\mathcal\{A\}\_\{\\rho\}such that the pairwise Bayes error of distinguishinghhfromh\+Δ⋆h\+\\Delta^\{\\star\}under𝒩\(⋅,Σ\)\\mathcal\{N\}\(\\cdot,\\Sigma\)is bounded by

Perr\(Δ⋆\)≤exp⁡\(−18Δ⋆⊤Σ−1Δ⋆\)≤exp⁡\(−ρ2d16𝒦\)\.P\_\{\\mathrm\{err\}\}\(\\Delta^\{\\star\}\)\\;\\leq\\;\\exp\\\!\\left\(\-\\,\\tfrac\{1\}\{8\}\\,\\Delta^\{\\star\\top\}\\Sigma^\{\-1\}\\Delta^\{\\star\}\\right\)\\;\\leq\\;\\exp\\\!\\left\(\-\\,\\frac\{\\rho^\{2\}d\}\{16\\,\\mathcal\{K\}\}\\right\)\.

###### Proof\.

The pairwise Bayes error of two equal\-covariance Gaussians with meansh,h\+Δh,h\+\\DeltaisΦ\(−12Δ⊤Σ−1Δ\)\\Phi\(\-\\tfrac\{1\}\{2\}\\sqrt\{\\Delta^\{\\top\}\\Sigma^\{\-1\}\\Delta\}\), andΦ\(−x\)≤e−x2/2\\Phi\(\-x\)\\leq e^\{\-x^\{2\}/2\}forx≥0x\\geq 0, giving the first inequality\. The second follows by takingΔ⋆\\Delta^\{\\star\}to attain the supremum in Theorem[13](https://arxiv.org/html/2605.24042#Thmproposition13)\. ∎

The corollary makes precise the sense in which the Gaussian frontier has an empty middle: at any utility budget𝒦=O\(1\)\\mathcal\{K\}=O\(1\), some Fisher\-ball direction \(equivalently, some pair in the enlarged Fisher\-ball adversary class𝒜ρ\\mathcal\{A\}\_\{\\rho\}\) remains exponentially distinguishable indd\. Filling the moderate\-both region requires a release class outside the full\-state Gaussian family, which is what motivates the predictive\-quotient mechanism of Appendix[R](https://arxiv.org/html/2605.24042#A18)\.

### Q\.11Diagonal minimax theorem

The minimax statement of Theorem[13](https://arxiv.org/html/2605.24042#Thmproposition13)restricts cleanly to diagonal mechanisms with diagonal Fisher surrogateD=diag\(F\)D=\\mathrm\{diag\}\(F\)\.

###### Theorem 15\(Diagonal Fisher\-ball minimax\)\.

LetD=diag\(F\)≻0D=\\mathrm\{diag\}\(F\)\\succ 0and restrictΣ\\Sigmato diagonal Gaussian covariancesΣ=diag\(s1,…,sd\)\\Sigma=\\mathrm\{diag\}\(s\_\{1\},\\ldots,s\_\{d\}\)withsi\>0s\_\{i\}\>0\. Let the diagonal utility budget beUD\(Σ\)=12∑iDisi≤𝒦U\_\{D\}\(\\Sigma\)=\\tfrac\{1\}\{2\}\\sum\_\{i\}D\_\{i\}s\_\{i\}\\leq\\mathcal\{K\}and the adversary class be𝒜D,ρ=\{Δ:Δ⊤DΔ≤ρ2\}\\mathcal\{A\}\_\{D,\\rho\}=\\\{\\Delta:\\Delta^\{\\top\}D\\Delta\\leq\\rho^\{2\}\\\}\. The unique solution to

infΣdiagsupΔ∈𝒜D,ρΔ⊤Σ−1Δsubject toUD\(Σ\)≤𝒦\\inf\_\{\\Sigma\\;\\mathrm\{diag\}\}\\;\\sup\_\{\\Delta\\in\\mathcal\{A\}\_\{D,\\rho\}\}\\;\\Delta^\{\\top\}\\Sigma^\{\-1\}\\Delta\\quad\\text\{subject to\}\\quad U\_\{D\}\(\\Sigma\)\\leq\\mathcal\{K\}isΣdiag⋆=\(2𝒦/d\)D−1\\Sigma^\{\\star\}\_\{\\mathrm\{diag\}\}=\(2\\mathcal\{K\}/d\)\\,D^\{\-1\}, attainingρ2d/\(2𝒦\)\\rho^\{2\}d/\(2\\mathcal\{K\}\)\.

###### Proof\.

Because bothDDandΣ\\Sigmaare diagonal,

supΔ⊤DΔ≤ρ2Δ⊤Σ−1Δ=ρ2λmax\(D−1/2Σ−1D−1/2\)=ρ2maxi⁡1Disi\.\\sup\_\{\\Delta^\{\\top\}D\\Delta\\leq\\rho^\{2\}\}\\;\\Delta^\{\\top\}\\Sigma^\{\-1\}\\Delta\\;=\\;\\rho^\{2\}\\,\\lambda\_\{\\max\}\\\!\\left\(D^\{\-1/2\}\\Sigma^\{\-1\}D^\{\-1/2\}\\right\)\\;=\\;\\rho^\{2\}\\,\\max\_\{i\}\\frac\{1\}\{D\_\{i\}s\_\{i\}\}\.The minimax problem reduces tominsi\>0maxi\(Disi\)−1\\min\_\{s\_\{i\}\>0\}\\,\\max\_\{i\}\(D\_\{i\}s\_\{i\}\)^\{\-1\}subject to∑iDisi≤2𝒦\\sum\_\{i\}D\_\{i\}s\_\{i\}\\leq 2\\mathcal\{K\}\. The maximum of a finite collection of positive numbers under a fixed sum constraint is minimized by equalizing the entries; equalizingDisi=cD\_\{i\}s\_\{i\}=cand using∑iDisi=2𝒦\\sum\_\{i\}D\_\{i\}s\_\{i\}=2\\mathcal\{K\}givesc=2𝒦/dc=2\\mathcal\{K\}/d, hencesi=2𝒦/\(dDi\)s\_\{i\}=2\\mathcal\{K\}/\(dD\_\{i\}\)\. ∎

Theorem[15](https://arxiv.org/html/2605.24042#Thmproposition15)upgrades Proposition[11](https://arxiv.org/html/2605.24042#Thmproposition11)from an equal\-cost characterization to a true minimax statement:Σdiag\\Sigma\_\{\\mathrm\{diag\}\}is the unique diagonal release that minimizes the worst\-case Mahalanobis signal over the Fisher\-ball adversary class\. The full\-matrix parent statement of Theorem[13](https://arxiv.org/html/2605.24042#Thmproposition13)drops the diagonal restriction and replacesDDwithFλF\_\{\\lambda\}, recovering the classical Mahalanobis whiteningΣ∝Fλ−1\\Sigma\\propto F\_\{\\lambda\}^\{\-1\}\.

#### Q\.11\.1Empiricalα\\alpha\-sweep validation

To test how tightly the F\-ball minimax prediction transfers to a realistic adversary, we run anα\\alpha\-sweep over the familyΣα=cα⋅diag\(Fii\)−α\\Sigma\_\{\\alpha\}=c\_\{\\alpha\}\\cdot\\mathrm\{diag\}\(F\_\{ii\}\)^\{\-\\alpha\}withcαc\_\{\\alpha\}chosen to fixtr\(FΣα\)=2𝒦\\mathrm\{tr\}\(F\\,\\Sigma\_\{\\alpha\}\)=2\\mathcal\{K\}\. Theorem[15](https://arxiv.org/html/2605.24042#Thmproposition15)predicts the minimax point atα=1\\alpha=1for the Fisher\-ball adversary𝒜D,ρ=\{Δ:Δ⊤DΔ≤ρ2\}\\mathcal\{A\}\_\{D,\\rho\}=\\\{\\Delta:\\Delta^\{\\top\}D\\Delta\\leq\\rho^\{2\}\\\}\. We measure two empirical proxies on a different adversary, namely a 20,000\-pair adjacency bank of realistic prompt differences \(5,000 random one\-token substitutions, 5,000 top\-256 LM\-prob replacements, 5,000 same\-frequency\-bin replacements, and 5,000 behavior\-hard low\-KL pairs\): the worst\-case Mahalanobis signalmaxΔ∈𝒜emp⁡Δ⊤Σα−1Δ\\max\_\{\\Delta\\in\\mathcal\{A\}\_\{\\mathrm\{emp\}\}\}\\Delta^\{\\top\}\\Sigma\_\{\\alpha\}^\{\-1\}\\Delta, and the realized retrieval top\-1 against an adaptiveΣα\\Sigma\_\{\\alpha\}\-aware attacker on a 50,000\-candidate bank with 2,000 queries\.

The sweep coversα∈\{0,0\.25,0\.5,0\.75,1\.0,1\.25,1\.5\}\\alpha\\in\\\{0,0\.25,0\.5,0\.75,1\.0,1\.25,1\.5\\\}at𝒦∈\{0\.3,1,3,7\}\\mathcal\{K\}\\in\\\{0\.3,1,3,7\\\}, with three independent seeds for adjacency construction and noise draws, across all 32 model\-layer points used elsewhere in the paper \(GPT\-2 Small all 12 layers, Mistral\-7B layers\{4,8,12,16,20,24,28,31\}\\\{4,8,12,16,20,24,28,31\\\}, Phi\-2 layers\{4,12,20,28\}\\\{4,12,20,28\\\}, Qwen3\-14B layers\{10,20,30,39\}\\\{10,20,30,39\\\}, DeepSeek\-R1\-14B layers\{12,24,36,47\}\\\{12,24,36,47\\\}\)\. Figure[17](https://arxiv.org/html/2605.24042#A17.F17)reports the resulting curves, normalized atα=0\\alpha=0\.

![Refer to caption](https://arxiv.org/html/2605.24042v1/x17.png)Figure 17:\(A\) Empirical worst\-case MahalanobismaxΔ∈𝒜emp⁡Δ⊤Σα−1Δ\\max\_\{\\Delta\\in\\mathcal\{A\}\_\{\\mathrm\{emp\}\}\}\\Delta^\{\\top\}\\Sigma\_\{\\alpha\}^\{\-1\}\\Delta, normalized toα=0\\alpha=0, averaged over layers within each model\. Bands are±1\\pm 1standard deviation across layers and seeds\. \(B\) Realised retrieval top\-1 attack success at matched utility budget\. Dotted line marksα=1\.0\\alpha=1\.0, the theoretical Fisher\-ball minimax point\.Three findings\. First, the qualitative direction predicted by Theorem[15](https://arxiv.org/html/2605.24042#Thmproposition15)holds robustly: the high\-α\\alpharegime \(α≥0\.5\\alpha\\geq 0\.5\) attains the lower worst\-case Mahalanobis signal at2727of the3232tested model\-layer points\. Second,α=1\\alpha=1exactly is the empirical minimum at only11of the3232points \(DeepSeek\-R1\-14B at layer 24\); on GPT\-2 the empirical minimum sits atα∈\{0\.5,0\.75\}\\alpha\\in\\\{0\.5,0\.75\\\}, on Mistral\-7B’s early\-to\-mid layers atα=1\.5\\alpha=1\.5, and on the latest layer of every modern model atα<0\.5\\alpha<0\.5\. Third, the retrieval\-attack proxy and the worst\-case Mahalanobis proxy disagree on the location of the optimum: the retrieval top\-1 minimum on Mistral and the 14B models sits at lowα\\alpha\(oftenα=0\\alpha=0\), even where the worst\-case Mahalanobis is minimized atα=1\.5\\alpha=1\.5\. This is the average\-versus\-worst\-case separation of Proposition[12](https://arxiv.org/html/2605.24042#Thmproposition12):Σdiag\\Sigma\_\{\\mathrm\{diag\}\}minimizes the worst\-case signal under the Fisher\-ball model of the adversary, but isotropic noise can give lower realized attack success on a sample\-driven adversary that concentrates its mass in high\-Fisher coordinates the attacker actively exploits via the Mahalanobis metric\.

The structural prediction of Theorem[15](https://arxiv.org/html/2605.24042#Thmproposition15)is therefore the right one for any adversary that approximates the Fisher\-ball, and a useful upper bound for narrower adversaries; the empirical optimum on a realistic 20,000\-pair adjacency set is inα∈\{0\.25,0\.5,0\.75,1\.0,1\.25,1\.5\}\\alpha\\in\\\{0\.25,0\.5,0\.75,1\.0,1\.25,1\.5\\\}depending on model and metric, andα=1\\alpha=1is the canonical theoretically\-justified default rather than the universal empirical minimum\.

### Q\.12GMahG\_\{\\mathrm\{Mah\}\}as inverse\-squared matrix fidelity

The predictive scalarGMahG\_\{\\mathrm\{Mah\}\}has a clean interpretation in terms of the matrix fidelity between the normalized Fisher and normalized margin geometries\.

###### Theorem 16\(Fidelity identity forGMahG\_\{\\mathrm\{Mah\}\}\)\.

LetρF=Fλ/tr\(Fλ\)\\rho\_\{F\}=F\_\{\\lambda\}/\\mathrm\{tr\}\(F\_\{\\lambda\}\)andρS=Sρ/tr\(Sρ\)\\rho\_\{S\}=S\_\{\\rho\}/\\mathrm\{tr\}\(S\_\{\\rho\}\)be the trace\-normalized Fisher and margin covariances, and let

ℱ\(ρF,ρS\)=trρF1/2ρSρF1/2\\mathcal\{F\}\(\\rho\_\{F\},\\rho\_\{S\}\)\\;=\\;\\mathrm\{tr\}\\,\\sqrt\{\\rho\_\{F\}^\{1/2\}\\,\\rho\_\{S\}\\,\\rho\_\{F\}^\{1/2\}\}denote their matrix fidelity\. Then

GMah=1ℱ\(ρF,ρS\)2\.G\_\{\\mathrm\{Mah\}\}\\;=\\;\\frac\{1\}\{\\mathcal\{F\}\(\\rho\_\{F\},\\rho\_\{S\}\)^\{2\}\}\.

###### Proof\.

By definition \(Equation \([4](https://arxiv.org/html/2605.24042#S5.E4)\)\),

GMah=tr\(Fλ\)tr\(Sρ\)\[tr\(\(Fλ1/2SρFλ1/2\)1/2\)\]2\.G\_\{\\mathrm\{Mah\}\}\\;=\\;\\frac\{\\mathrm\{tr\}\(F\_\{\\lambda\}\)\\,\\mathrm\{tr\}\(S\_\{\\rho\}\)\}\{\[\\mathrm\{tr\}\(\(F\_\{\\lambda\}^\{1/2\}S\_\{\\rho\}F\_\{\\lambda\}^\{1/2\}\)^\{1/2\}\)\]^\{2\}\}\.SubstitutingFλ=tr\(Fλ\)ρFF\_\{\\lambda\}=\\mathrm\{tr\}\(F\_\{\\lambda\}\)\\,\\rho\_\{F\}andSρ=tr\(Sρ\)ρSS\_\{\\rho\}=\\mathrm\{tr\}\(S\_\{\\rho\}\)\\,\\rho\_\{S\}into the denominator,

tr\(\(Fλ1/2SρFλ1/2\)1/2\)=tr\(Fλ\)tr\(Sρ\)⋅tr\(\(ρF1/2ρSρF1/2\)1/2\)=tr\(Fλ\)tr\(Sρ\)⋅ℱ\(ρF,ρS\)\.\\mathrm\{tr\}\\\!\\left\(\(F\_\{\\lambda\}^\{1/2\}S\_\{\\rho\}F\_\{\\lambda\}^\{1/2\}\)^\{1/2\}\\right\)=\\sqrt\{\\mathrm\{tr\}\(F\_\{\\lambda\}\)\\,\\mathrm\{tr\}\(S\_\{\\rho\}\)\}\\cdot\\mathrm\{tr\}\\\!\\left\(\(\\rho\_\{F\}^\{1/2\}\\rho\_\{S\}\\rho\_\{F\}^\{1/2\}\)^\{1/2\}\\right\)=\\sqrt\{\\mathrm\{tr\}\(F\_\{\\lambda\}\)\\,\\mathrm\{tr\}\(S\_\{\\rho\}\)\}\\cdot\\mathcal\{F\}\(\\rho\_\{F\},\\rho\_\{S\}\)\.Squaring the denominator and cancelling the trace product givesGMah=1/ℱ\(ρF,ρS\)2G\_\{\\mathrm\{Mah\}\}=1/\\mathcal\{F\}\(\\rho\_\{F\},\\rho\_\{S\}\)^\{2\}\. ∎

Theorem[16](https://arxiv.org/html/2605.24042#Thmproposition16)reframesGMahG\_\{\\mathrm\{Mah\}\}as a geometric quantity: it is large when the normalized Fisher and normalized margin covariances are far apart in the Bures\-Wasserstein sense, and equal to one when they coincide\. The empirically observed rangeGMah∈\[1\.7,9\.3\]G\_\{\\mathrm\{Mah\}\}\\in\[1\.7,9\.3\]corresponds to fidelity valuesℱ∈\[0\.33,0\.77\]\\mathcal\{F\}\\in\[0\.33,0\.77\]on the tested models\.

### Q\.13Projector\-separation lower bound forGMahG\_\{\\mathrm\{Mah\}\}

A useful corollary of Theorem[16](https://arxiv.org/html/2605.24042#Thmproposition16)boundsGMahG\_\{\\mathrm\{Mah\}\}from below using the projector\-mass quantitiesEk=tr\(PBρF\)E\_\{k\}=\\mathrm\{tr\}\(P\_\{B\}\\rho\_\{F\}\)andqB=tr\(PBρS\)q\_\{B\}=\\mathrm\{tr\}\(P\_\{B\}\\rho\_\{S\}\)that already appear in the paper’s three\-axis framework\.

###### Theorem 17\(Projector\-separation lower bound\)\.

For any orthogonal projectorPPonℝd\\mathbb\{R\}^\{d\},

GMah≥1\(tr\(PρF\)tr\(PρS\)\+tr\(\(I−P\)ρF\)tr\(\(I−P\)ρS\)\)2\.G\_\{\\mathrm\{Mah\}\}\\;\\geq\\;\\frac\{1\}\{\\left\(\\sqrt\{\\mathrm\{tr\}\(P\\rho\_\{F\}\)\\,\\mathrm\{tr\}\(P\\rho\_\{S\}\)\}\+\\sqrt\{\\mathrm\{tr\}\(\(I\-P\)\\rho\_\{F\}\)\\,\\mathrm\{tr\}\(\(I\-P\)\\rho\_\{S\}\)\}\\right\)^\{2\}\}\.In particular, takingP=PBP=P\_\{B\}and writingEk=tr\(PBρF\)E\_\{k\}=\\mathrm\{tr\}\(P\_\{B\}\\rho\_\{F\}\),qB=tr\(PBρS\)q\_\{B\}=\\mathrm\{tr\}\(P\_\{B\}\\rho\_\{S\}\),

GMah≥1\(EkqB\+\(1−Ek\)\(1−qB\)\)2\.G\_\{\\mathrm\{Mah\}\}\\;\\geq\\;\\frac\{1\}\{\\left\(\\sqrt\{E\_\{k\}q\_\{B\}\}\+\\sqrt\{\(1\-E\_\{k\}\)\(1\-q\_\{B\}\)\}\\right\)^\{2\}\}\.

###### Proof\.

The two\-element binary measurement\{P,I−P\}\\\{P,I\-P\\\}acts onρF\\rho\_\{F\}andρS\\rho\_\{S\}as a quantum\-to\-classical channel, producing distributionsp=\(a,1−a\)p=\(a,1\-a\)andq=\(b,1−b\)q=\(b,1\-b\)witha=tr\(PρF\)a=\\mathrm\{tr\}\(P\\rho\_\{F\}\)andb=tr\(PρS\)b=\\mathrm\{tr\}\(P\\rho\_\{S\}\)\. The matrix fidelity is monotone non\-decreasing under such measurements \(data\-processing inequality for fidelity, e\.g\. Nielsen and Chuang Chapter 9\), so the classical fidelity of the post\-measurement distributions upper\-bounds the original matrix fidelity:

ℱ\(ρF,ρS\)≤ab\+\(1−a\)\(1−b\)\.\\mathcal\{F\}\(\\rho\_\{F\},\\rho\_\{S\}\)\\;\\leq\\;\\sqrt\{ab\}\+\\sqrt\{\(1\-a\)\(1\-b\)\}\.Substituting into Theorem[16](https://arxiv.org/html/2605.24042#Thmproposition16)gives the bound\. Specializing toP=PBP=P\_\{B\}uses the definitions ofEkE\_\{k\}andqBq\_\{B\}directly\. ∎

The projector\-separation form gives a quantitative criterion for when a model can support a largeGMahG\_\{\\mathrm\{Mah\}\}:GMah\>10G\_\{\\mathrm\{Mah\}\}\>10is guaranteed wheneverEk≥1−εFE\_\{k\}\\geq 1\-\\varepsilon\_\{F\}andqB≤εSq\_\{B\}\\leq\\varepsilon\_\{S\}withεF\+εS<1/10≈0\.316\\sqrt\{\\varepsilon\_\{F\}\}\+\\sqrt\{\\varepsilon\_\{S\}\}<1/\\sqrt\{10\}\\approx 0\.316\. The five tested models do not reach this regime: the smallest observedεF\+εS\\sqrt\{\\varepsilon\_\{F\}\}\+\\sqrt\{\\varepsilon\_\{S\}\}is approximately0\.360\.36on Qwen3\-14B at layer 10, giving a guaranteed lower bound ofGMah≥7\.7G\_\{\\mathrm\{Mah\}\}\\geq 7\.7which matches the measured value of9\.39\.3to within the slack of the bound\. Designing an architecture in theεF\+εS<0\.316\\sqrt\{\\varepsilon\_\{F\}\}\+\\sqrt\{\\varepsilon\_\{S\}\}<0\.316regime requires concentrating the Fisher mass and depleting the margin mass in the same low\-dimensional subspace, a constructive direction realized in Appendix[Y](https://arxiv.org/html/2605.24042#A25), where a Split\-Memory Transformer trained from scratch attainsGMahG\_\{\\mathrm\{Mah\}\}in the range2020–3333at three probe layers\.

## Appendix RPredictive Quotient Release Theory

The Gaussian impossibility result of Theorem[13](https://arxiv.org/html/2605.24042#Thmproposition13)says that no full\-state Gaussian release can fill the moderate\-utility, moderate\-privacy region of the Pareto frontier atO\(1\)O\(1\)utility budget: some Fisher\-normalized prompt difference always remains exponentially distinguishable indd\. The path out is to leave the full\-state Gaussian release class\. This appendix develops the theory for releasing a low\-dimensional learned latent that is locally sufficient for behavior, and gives a Fano\-style lower bound on attacker recovery whose dependence is on the latent dimensionrrrather than the hidden widthdd\.

### R\.1Local quotient factorization

Letfθ:𝒳T→ℝ\|V\|f\_\{\\theta\}:\\mathcal\{X\}^\{T\}\\to\\mathbb\{R\}^\{\|V\|\}be a frozen autoregressive language model and lethℓ:𝒳T→ℝdh\_\{\\ell\}:\\mathcal\{X\}^\{T\}\\to\\mathbb\{R\}^\{d\}be the hidden state at layerℓ\\ell\. For a fixed continuation horizonHH, the behavior map is the smooth function

bH:ℝd→ℝm,bH\(h\)=\(logitst\(fθ;h\)\)t=1H,b\_\{H\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{m\},\\qquad b\_\{H\}\(h\)\\;=\\;\\big\(\\,\\mathrm\{logits\}\_\{t\}\(\\,f\_\{\\theta\}\\,;\\,h\)\\,\\big\)\_\{t=1\}^\{H\},where the logits at horizonttare computed by injectinghhat layerℓ\\elland continuing the frozen forward pass forttsteps under teacher forcing, andm=H\|V\|m=H\|V\|\. We assumebHb\_\{H\}isC1C^\{1\}on an open neighbourhoodU⊂ℝdU\\subset\\mathbb\{R\}^\{d\}of every hidden state we will encounter\.

###### Theorem 18\(Local quotient factorization\)\.

LetbH:U→ℝmb\_\{H\}:U\\to\\mathbb\{R\}^\{m\}beC1C^\{1\}and assumerankDbH\(h\)=r\\mathrm\{rank\}\\,Db\_\{H\}\(h\)=rfor allh∈Uh\\in U\. Then for everyh0∈Uh\_\{0\}\\in Uthere exist an open neighbourhoodV0∋h0V\_\{0\}\\ni h\_\{0\},V0⊆UV\_\{0\}\\subseteq U, smooth submersionsϕ:V0→ℝr×ℝd−r\\phi:V\_\{0\}\\to\\mathbb\{R\}^\{r\}\\times\\mathbb\{R\}^\{d\-r\}writtenϕ\(h\)=\(u\(h\),v\(h\)\)\\phi\(h\)=\(u\(h\),v\(h\)\), and a smooth mapg:ℝr→ℝmg:\\mathbb\{R\}^\{r\}\\to\\mathbb\{R\}^\{m\}such that

bH\(h\)=g\(u\(h\)\)for allh∈V0\.b\_\{H\}\(h\)\\;=\\;g\(u\(h\)\)\\quad\\text\{for all \}h\\in V\_\{0\}\.

###### Proof\.

Apply the constant\-rank theorem \(e\.g\. Lee,*Introduction to Smooth Manifolds*, Theorem 4\.12\)\. BecauseDbHDb\_\{H\}has constant rankrronUU, there exist a diffeomorphismϕ:V0→V~0⊆ℝd\\phi:V\_\{0\}\\to\\tilde\{V\}\_\{0\}\\subseteq\\mathbb\{R\}^\{d\}with coordinates\(u,v\)\(u,v\)and a diffeomorphismψ\\psion the codomain such thatψ∘bH∘ϕ−1\\psi\\circ b\_\{H\}\\circ\\phi^\{\-1\}is the canonical projection\(u,v\)↦\(u,0\)\(u,v\)\\mapsto\(u,0\)\. Settingg\(u\)=ψ−1\(u,0\)g\(u\)=\\psi^\{\-1\}\(u,0\)and reading off the first factor givesbH\(h\)=g\(u\(h\)\)b\_\{H\}\(h\)=g\(u\(h\)\)\. ∎

The mapu:V0→ℝru:V\_\{0\}\\to\\mathbb\{R\}^\{r\}is the local predictive quotient: it is the smallest object on which behavior can locally depend, and any function ofhhthat varies alongvv\(the level sets ofuu\) cannot affectbHb\_\{H\}\. The full\-state hidden representationhhcontainsuutogether withd−rd\-rirrelevant directions for the chosen behavior, so a defender who releasesuurather thanhhis releasing exactly the predictive content\. We do not assume the global rank ofDbHDb\_\{H\}is constant, only that locally around every typical hidden state the rank is bounded byrr; in practice the encoderqϕq\_\{\\phi\}in the predictive\-quotient mechanism \(Section[4](https://arxiv.org/html/2605.24042#S4)\) is trained to produce a learned approximation of a sufficientrr\-dimensional latent, and the empirical question is how smallrrcan be made while preserving downstream KL\.

### R\.2Utility cost and information leakage

Suppose the defender releases the perturbed quotient

Z=q\(H\)\+η,η∼𝒩\(0,σ2Ir\),Z\\;=\\;q\(H\)\+\\eta,\\qquad\\eta\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}I\_\{r\}\),whereq\(H\)∈ℝrq\(H\)\\in\\mathbb\{R\}^\{r\}is a sufficient quotient coordinate forbHb\_\{H\}in the sense of Theorem[18](https://arxiv.org/html/2605.24042#Thmproposition18)\. LetXXbe the discrete prompt distribution,H=h\(X\)H=h\(X\)the induced hidden\-state random variable, andFqF\_\{q\}the Fisher information ofbHb\_\{H\}with respect toqq\.

###### Theorem 19\(Utility cost and information leakage\)\.

Under the regularity conditions above:

1. 1\.The expected behavior KL between releasingHHand releasingZZadmits the local expansion 𝔼KL\(bH\(H\)∥bH\(H^\(Z\)\)\)=σ22tr\(Fq\)\+O\(σ3\),\\mathbb\{E\}\\,\\mathrm\{KL\}\\\!\\left\(b\_\{H\}\(H\)\\,\\Big\\\|\\,b\_\{H\}\(\\hat\{H\}\(Z\)\)\\right\)\\;=\\;\\frac\{\\sigma^\{2\}\}\{2\}\\,\\mathrm\{tr\}\(F\_\{q\}\)\+O\(\\sigma^\{3\}\),whereH^\\hat\{H\}is any consistent decoder ofHHfromZZ\.
2. 2\.IfCov\(q\(H\)\)⪯Λ\\mathrm\{Cov\}\(q\(H\)\)\\preceq\\Lambda, then I\(X;Z\)≤12logdet\(Ir\+Λσ2\)≤r2log⁡\(1\+trΛrσ2\)\.I\(X;Z\)\\;\\leq\\;\\frac\{1\}\{2\}\\log\\det\\\!\\left\(I\_\{r\}\+\\frac\{\\Lambda\}\{\\sigma^\{2\}\}\\right\)\\;\\leq\\;\\frac\{r\}\{2\}\\log\\\!\\left\(1\+\\frac\{\\mathrm\{tr\}\\,\\Lambda\}\{r\\,\\sigma^\{2\}\}\\right\)\.
3. 3\.For any attackerX^\\hat\{X\}that estimatesXXfromZZ, the exact\-match error obeys Fano’s inequality: Pr⁡\[X^≠X\]≥1−I\(X;Z\)\+log⁡2H\(X\)\.\\Pr\[\\hat\{X\}\\neq X\]\\;\\geq\\;1\-\\frac\{I\(X;Z\)\+\\log 2\}\{H\(X\)\}\.

###### Proof\.

Part \(1\) is the second\-order KL expansion in quotient coordinates: writingbH\(H^\(Z\)\)=bH\(H\)\+DbH⋅DH^⋅η\+O\(σ2\)b\_\{H\}\(\\hat\{H\}\(Z\)\)=b\_\{H\}\(H\)\+Db\_\{H\}\\cdot D\\hat\{H\}\\cdot\\eta\+O\(\\sigma^\{2\}\)and substituting intoKL\(p∥q\)=12\(p−q\)⊤F\(p−q\)\+O\(‖p−q‖3\)\\mathrm\{KL\}\(p\\,\\\|\\,q\)=\\tfrac\{1\}\{2\}\(p\-q\)^\{\\top\}F\(p\-q\)\+O\(\\\|p\-q\\\|^\{3\}\)gives the leading termσ22tr\(Fq\)\\tfrac\{\\sigma^\{2\}\}\{2\}\\mathrm\{tr\}\(F\_\{q\}\)after taking expectation overη\\eta\. Part \(2\) is the Gaussian\-channel mutual information bound with covariance constraint:I\(q\(H\);Z\)≤12logdet\(Ir\+Cov\(q\(H\)\)/σ2\)I\(q\(H\);Z\)\\leq\\tfrac\{1\}\{2\}\\log\\det\(I\_\{r\}\+\\mathrm\{Cov\}\(q\(H\)\)/\\sigma^\{2\}\)by the Gaussian\-channel capacity theorem \(Cover and Thomas, Theorem 9\.1\.1\) applied to a parallel additive\-Gaussian channel; the second inequality is concavity oflogdet\\log\\detand Jensen’s inequality\. SinceX→q\(H\)→ZX\\to q\(H\)\\to Zis a Markov chain,I\(X;Z\)≤I\(q\(H\);Z\)I\(X;Z\)\\leq I\(q\(H\);Z\)by data processing\. Part \(3\) is Fano’s inequality for the discrete\-input channelX→ZX\\to Z\(Cover and Thomas, Theorem 2\.10\.1\)\. ∎

The structural force of Theorem[19](https://arxiv.org/html/2605.24042#Thmproposition19)is the absence ofddfrom the leakage bound\. For the full\-state Gaussian release of Theorem[13](https://arxiv.org/html/2605.24042#Thmproposition13)with isotropicΣ=σ2I\\Sigma=\\sigma^\{2\}I, the analogous mutual\-information bound isI\(X;Z\)≤d2log⁡\(1\+tr\(Cov\(H\)\)/\(dσ2\)\)I\(X;Z\)\\leq\\tfrac\{d\}\{2\}\\log\(1\+\\mathrm\{tr\}\(\\mathrm\{Cov\}\(H\)\)/\(d\\sigma^\{2\}\)\), which scales with the hidden width and forces the empty middle\. For the quotient release the bound scales withrr, so the moderate\-both region opens up for anyr≪dr\\ll dat noise scales sufficient to driveσ2≳trΛ/r\\sigma^\{2\}\\gtrsim\\mathrm\{tr\}\\,\\Lambda/r\.

A sufficient condition for both moderate utility and moderate privacy is therefore

σ22tr\(Fq\)≪1andr2log⁡\(1\+trΛrσ2\)≪H\(X\)\.\\frac\{\\sigma^\{2\}\}\{2\}\\,\\mathrm\{tr\}\(F\_\{q\}\)\\;\\ll\\;1\\quad\\text\{and\}\\quad\\frac\{r\}\{2\}\\log\\\!\\left\(1\+\\frac\{\\mathrm\{tr\}\\,\\Lambda\}\{r\\,\\sigma^\{2\}\}\\right\)\\;\\ll\\;H\(X\)\.The Gaussian release atr=dr=dcannot satisfy both becausetr\(Fq\)\\mathrm\{tr\}\(F\_\{q\}\)is replaced bytr\(F\)\\mathrm\{tr\}\(F\)which scales withdd; a learned quotient withr≈16r\\approx 16–6464on hidden widthsd∈\[768,5120\]d\\in\[768,5120\]can in principle satisfy both at moderateσ\\sigma\.

### R\.3Sequential exact\-match lower bound

The injectivity result ofNikolaouet al\.\[[2026](https://arxiv.org/html/2605.24042#bib.bib1)\]establishes that a noiseless hidden activation determines the prompt uniquely, with a constructive sequential decoder \(SipIt\)\. Under a Gaussian release the attacker’s effective reconstruction problem becomes a noisy version of that sequential decoding, and the resulting exact\-match probability is governed by per\-step Mahalanobis margins\.

Fix a layerℓ\\elland noise covarianceΣt\\Sigma\_\{t\}at steptt\. Letht\(π⊕v\)∈ℝdh\_\{t\}\(\\pi\\oplus v\)\\in\\mathbb\{R\}^\{d\}be the hidden state at stepttobtained by appending tokenvvto prefixπ\\pi\. The released observation iszt=ht\(π⊕y\)\+ξtz\_\{t\}=h\_\{t\}\(\\pi\\oplus y\)\+\\xi\_\{t\}withξt∼𝒩\(0,Σt\)\\xi\_\{t\}\\sim\\mathcal\{N\}\(0,\\Sigma\_\{t\}\)\. Define the one\-step Mahalanobis margin at the true continuationyygiven prefixπ\\pias

mt\(π\)=minv≠y⁡‖ht\(π⊕y\)−ht\(π⊕v\)‖Σt−1\.m\_\{t\}\(\\pi\)\\;=\\;\\min\_\{v\\neq y\}\\;\\big\\\|h\_\{t\}\(\\pi\\oplus y\)\-h\_\{t\}\(\\pi\\oplus v\)\\big\\\|\_\{\\Sigma\_\{t\}^\{\-1\}\}\.
###### Theorem 20\(Sequential exact\-match lower bound\)\.

For the Bayes\-optimal sequential MAP decoder operating on the released sequencez1:Tz\_\{1:T\},

Pr⁡\[y^t≠yt\|y^<t=y<t\]≤\(\|V\|−1\)exp⁡\(−mt\(πt\)28\),\\Pr\\\!\\left\[\\hat\{y\}\_\{t\}\\neq y\_\{t\}\\,\\big\|\\,\\hat\{y\}\_\{<t\}=y\_\{<t\}\\right\]\\;\\leq\\;\(\|V\|\-1\)\\,\\exp\\\!\\left\(\-\\frac\{m\_\{t\}\(\\pi\_\{t\}\)^\{2\}\}\{8\}\\right\),whereπt=\(y1,…,yt−1\)\\pi\_\{t\}=\(y\_\{1\},\\ldots,y\_\{t\-1\}\)is the realized true prefix at steptt\. Consequently the exact\-sequence recovery probability satisfies

Pr⁡\[x^=x\]≥∏t=1T\[1−\(\|V\|−1\)exp⁡\(−mt\(πt\)2/8\)\]\+\.\\Pr\[\\hat\{x\}=x\]\\;\\geq\\;\\prod\_\{t=1\}^\{T\}\\Big\[1\-\(\|V\|\-1\)\\,\\exp\\\!\\left\(\-\\,m\_\{t\}\(\\pi\_\{t\}\)^\{2\}/8\\right\)\\Big\]\_\{\+\}\.

###### Proof\.

For any wrong continuationv≠yv\\neq yat steptt, the pairwise Bayes error of distinguishing𝒩\(ht\(πt⊕y\),Σt\)\\mathcal\{N\}\(h\_\{t\}\(\\pi\_\{t\}\\oplus y\),\\Sigma\_\{t\}\)from𝒩\(ht\(πt⊕v\),Σt\)\\mathcal\{N\}\(h\_\{t\}\(\\pi\_\{t\}\\oplus v\),\\Sigma\_\{t\}\)isΦ\(−12‖ht\(πt⊕y\)−ht\(πt⊕v\)‖Σt−1\)\\Phi\(\-\\tfrac\{1\}\{2\}\\\|h\_\{t\}\(\\pi\_\{t\}\\oplus y\)\-h\_\{t\}\(\\pi\_\{t\}\\oplus v\)\\\|\_\{\\Sigma\_\{t\}^\{\-1\}\}\)\. The Gaussian tail boundΦ\(−x\)≤12exp⁡\(−x2/2\)≤exp⁡\(−x2/2\)\\Phi\(\-x\)\\leq\\tfrac\{1\}\{2\}\\exp\(\-x^\{2\}/2\)\\leq\\exp\(\-x^\{2\}/2\)forx≥0x\\geq 0, applied withx=mt/2x=m\_\{t\}/2, gives the per\-pair boundexp⁡\(−mt2/8\)\\exp\(\-m\_\{t\}^\{2\}/8\)\. Union\-bounding over the\|V\|−1\|V\|\-1wrong tokens gives the conditional one\-step error bound\. Multiplying conditional success probabilities along the true path and usingPr⁡\[x^=x\]=∏tPr⁡\[y^t=yt∣y^<t=y<t\]\\Pr\[\\hat\{x\}=x\]=\\prod\_\{t\}\\Pr\[\\hat\{y\}\_\{t\}=y\_\{t\}\\mid\\hat\{y\}\_\{<t\}=y\_\{<t\}\]for the MAP decoder yields the product bound\. ∎

###### Corollary 21\(Multi\-view margin additivity\)\.

If the attacker observesLLconditionally independent viewszt\(1\),…,zt\(L\)z\_\{t\}^\{\(1\)\},\\ldots,z\_\{t\}^\{\(L\)\}of the same hidden state at steptt, with covariancesΣt\(ℓ\)\\Sigma\_\{t\}^\{\(\\ell\)\}, the effective margin is

mt,multi\(π\)2=∑ℓ=1L‖ht\(π⊕y\)−ht\(π⊕v\)‖Σt\(ℓ\)−12at the closestv≠y\.m\_\{t,\\mathrm\{multi\}\}\(\\pi\)^\{2\}\\;=\\;\\sum\_\{\\ell=1\}^\{L\}\\big\\\|h\_\{t\}\(\\pi\\oplus y\)\-h\_\{t\}\(\\pi\\oplus v\)\\big\\\|\_\{\\Sigma\_\{t\}^\{\(\\ell\)\\,\-1\}\}^\{2\}\\quad\\text\{at the closest \}v\\neq y\.The exact\-match lower bound of Theorem[20](https://arxiv.org/html/2605.24042#Thmproposition20)applies withmtm\_\{t\}replaced bymt,multim\_\{t,\\mathrm\{multi\}\}\.

###### Proof\.

Under conditional independence, the joint likelihood ratio is the sum of single\-view log\-likelihood ratios, so the effective Mahalanobis distance squared is the sum of per\-view squared distances\. The per\-pair Bayes error is thenexp⁡\(−mt,multi2/8\)\\exp\(\-m\_\{t,\\mathrm\{multi\}\}^\{2\}/8\)and the rest of the argument is unchanged\. ∎

The corollary explains why a learned inverter with single\-view final\-token input can record exact\-match0even when reconstruction is information\-theoretically possible at multi\-view input: the per\-step margin must exceedmt2≳8log⁡\|V\|m\_\{t\}^\{2\}\\gtrsim 8\\log\|V\|for nontrivial exact\-match, and a single noised final\-state view is typically below this threshold on tested defenses, while a multi\-layer or multi\-position release sums per\-view margins and crosses the threshold\. The constructive consequence is that the strongest attacker against a quotient\-release mechanism should observe the entire token\-wise latent sequence\(z~1,…,z~T\)\(\\tilde\{z\}\_\{1\},\\ldots,\\tilde\{z\}\_\{T\}\)rather than a single vector\.

###### Corollary 22\(Vocabulary\-margin threshold\)\.

For per\-step error at mostδ\\deltaat every step, it suffices thatmt≥8log⁡\|V\|−1δm\_\{t\}\\geq\\sqrt\{8\\log\\tfrac\{\|V\|\-1\}\{\\delta\}\}\. For GPT\-2’s\|V\|=50,257\|V\|=50\{,\}257, the threshold values aremt≈10\.25m\_\{t\}\\approx 10\.25forδ=0\.1\\delta=0\.1,11\.1111\.11forδ=0\.01\\delta=0\.01, and11\.9111\.91forδ=0\.001\\delta=0\.001\.

This threshold makes the empirical sequence\-inverter result of Section[7](https://arxiv.org/html/2605.24042#S7)read as a direct theorem prediction: a single noised last\-token view sits belowmt≈10m\_\{t\}\\approx 10on every tested defense and reaches exact\-match zero, while the full token\-wise trajectory adds per\-view margins via Corollary[21](https://arxiv.org/html/2605.24042#Thmproposition21)and crosses the threshold to deliver the measured94%94\\%exact\-match recovery on clean GPT\-2 hidden states\.

## Appendix SDerivation of the optimal defense covariance

This appendix gives the full derivation of the inference\-time defense covariance against two attacker models\. Section[4](https://arxiv.org/html/2605.24042#S4)stated the rank\-one closed\-form optimumΣEuc⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Euc\}\}for the Euclideanℓ2\\ell\_\{2\}retrieval attacker, and introduced the rank\-kkgeneralized\-eigen mechanismΣGE,k\\Sigma\_\{\\mathrm\{GE\},k\}as a separate design choice for the empirical Pareto sweep\. We derive both here\. A covariance\-aware adaptive attacker should instead use a Mahalanobis metric shaped byΣ\\Sigma; under that attacker the correct defense is a different covarianceΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}derived in Section[S](https://arxiv.org/html/2605.24042#A19)\.3\.

### S\.1Euclidean\-attacker objective: rank\-one closed form and rank\-kkheuristic

Consider a defender who adds mean\-zero Gaussian noiseξ∼𝒩\(0,Σξ\)\\xi\\sim\\mathcal\{N\}\(0,\\Sigma\_\{\\xi\}\)withΣξ⪰0\\Sigma\_\{\\xi\}\\succeq 0\. Leading\-order utility cost is12tr\(FΣξ\)\+O\(σ3\)\\tfrac\{1\}\{2\}\\mathrm\{tr\}\(F\\Sigma\_\{\\xi\}\)\+O\(\\sigma^\{3\}\)by \([1](https://arxiv.org/html/2605.24042#S3.E1)\)\. The retrieval\-attack signal, for a Euclidean \(ℓ2\\ell\_\{2\}\) attacker ranking candidates by‖h~−hc‖2\\\|\\tilde\{h\}\-h\_\{c\}\\\|\_\{2\}, degrades in proportion totr\(ΣδΣξ\)\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}\\Sigma\_\{\\xi\}\)for the noise component along margin directions \(the inter\-prefix distance spread squared\)\. Maximizing privacy per unit utility gives

maxΣξ⪰0⁡tr\(ΣδΣξ\)s\.t\.12tr\(FΣξ\)≤𝒦\.\\max\_\{\\Sigma\_\{\\xi\}\\succeq 0\}\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}\\Sigma\_\{\\xi\}\)\\quad\\text\{s\.t\.\}\\quad\\tfrac\{1\}\{2\}\\,\\mathrm\{tr\}\(F\\Sigma\_\{\\xi\}\)\\leq\\mathcal\{K\}\.\(16\)
###### Proposition 23\(Rank\-one Euclidean optimum\)\.

LetF≻0F\\succ 0andΣδ⪰0\\Sigma\_\{\\delta\}\\succeq 0\. The optimum of equation \([16](https://arxiv.org/html/2605.24042#A19.E16)\) is rank\-one and given by

ΣEuc⋆=2𝒦v1v1⊤,v1=F−1/2u1,u1=arg⁡max‖u‖=1⁡u⊤F−1/2ΣδF−1/2u,v1⊤Fv1=1,\\Sigma^\{\\star\}\_\{\\mathrm\{Euc\}\}=2\\mathcal\{K\}\\,v\_\{1\}v\_\{1\}^\{\\top\},\\qquad v\_\{1\}=F^\{\-1/2\}u\_\{1\},\\qquad u\_\{1\}=\\arg\\max\_\{\\\|u\\\|=1\}u^\{\\top\}F^\{\-1/2\}\\Sigma\_\{\\delta\}F^\{\-1/2\}u,\\qquad v\_\{1\}^\{\\top\}Fv\_\{1\}=1,\(17\)with optimal valuetr\(ΣδΣEuc⋆\)=2𝒦λ1\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}\\Sigma^\{\\star\}\_\{\\mathrm\{Euc\}\}\)=2\\mathcal\{K\}\\lambda\_\{1\}, whereλ1\\lambda\_\{1\}is the top generalized eigenvalue ofΣδv=λFv\\Sigma\_\{\\delta\}v=\\lambda Fv\.

###### Proof\.

Change variablesW=F1/2ΣξF1/2⪰0W=F^\{1/2\}\\Sigma\_\{\\xi\}F^\{1/2\}\\succeq 0, soΣξ=F−1/2WF−1/2\\Sigma\_\{\\xi\}=F^\{\-1/2\}WF^\{\-1/2\}\. The KL\-budget constraint12tr\(FΣξ\)≤𝒦\\tfrac\{1\}\{2\}\\mathrm\{tr\}\(F\\Sigma\_\{\\xi\}\)\\leq\\mathcal\{K\}becomestr\(W\)≤2𝒦\\mathrm\{tr\}\(W\)\\leq 2\\mathcal\{K\}and the objective becomestr\(ΣδF−1/2WF−1/2\)=tr\(MW\)\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}F^\{\-1/2\}WF^\{\-1/2\}\)=\\mathrm\{tr\}\(MW\)withM:=F−1/2ΣδF−1/2⪰0M:=F^\{\-1/2\}\\Sigma\_\{\\delta\}F^\{\-1/2\}\\succeq 0\. Maximizingtr\(MW\)\\mathrm\{tr\}\(MW\)overW⪰0W\\succeq 0withtr\(W\)≤2𝒦\\mathrm\{tr\}\(W\)\\leq 2\\mathcal\{K\}is the standard PSD trace LP: by Ky Fan’s theorem,tr\(MW\)≤2𝒦λmax\(M\)\\mathrm\{tr\}\(MW\)\\leq 2\\mathcal\{K\}\\,\\lambda\_\{\\max\}\(M\), with the bound attained at the rank\-one extreme pointW⋆=2𝒦u1u1⊤W^\{\\star\}=2\\mathcal\{K\}\\,u\_\{1\}u\_\{1\}^\{\\top\}whereu1u\_\{1\}is a top eigenvector ofMM\.

Translating back:W⋆=2𝒦u1u1⊤W^\{\\star\}=2\\mathcal\{K\}\\,u\_\{1\}u\_\{1\}^\{\\top\}givesΣEuc⋆=2𝒦F−1/2u1u1⊤F−1/2=2𝒦v1v1⊤\\Sigma^\{\\star\}\_\{\\mathrm\{Euc\}\}=2\\mathcal\{K\}\\,F^\{\-1/2\}u\_\{1\}u\_\{1\}^\{\\top\}F^\{\-1/2\}=2\\mathcal\{K\}\\,v\_\{1\}v\_\{1\}^\{\\top\}withv1=F−1/2u1v\_\{1\}=F^\{\-1/2\}u\_\{1\}\. Since‖u1‖=1\\\|u\_\{1\}\\\|=1,v1⊤Fv1=u1⊤u1=1v\_\{1\}^\{\\top\}Fv\_\{1\}=u\_\{1\}^\{\\top\}u\_\{1\}=1\. Finallyλmax\(M\)=λ1\\lambda\_\{\\max\}\(M\)=\\lambda\_\{1\}becauseu⊤Mu=\(F−1/2u\)⊤Σδ\(F−1/2u\)u^\{\\top\}Mu=\(F^\{\-1/2\}u\)^\{\\top\}\\Sigma\_\{\\delta\}\(F^\{\-1/2\}u\)and the substitutionv=F−1/2uv=F^\{\-1/2\}uidentifies the Rayleigh quotient with the generalized eigenproblemΣδv=λFv\\Sigma\_\{\\delta\}v=\\lambda Fvunder the constraintv⊤Fv=1v^\{\\top\}Fv=1\. ∎

The Pareto ratio over isotropic noise at matched utility budget𝒦\\mathcal\{K\}is thereforeGEuc,1=λ1/λ¯G\_\{\\mathrm\{Euc\},1\}=\\lambda\_\{1\}/\\bar\{\\lambda\}withλ¯=tr\(Σδ\)/tr\(F\)\\bar\{\\lambda\}=\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}\)/\\mathrm\{tr\}\(F\), large exactly whenFFis concentrated andΣδ\\Sigma\_\{\\delta\}has mass outside the top Fisher eigendirection\.

#### Rank\-kkgeneralized\-eigen mechanism

The rank\-one optimumΣEuc⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Euc\}\}places the entire noise budget on a single direction, which is both maximally rank\-deficient and immediately defeated by the covariance\-aware attacker of Proposition[9](https://arxiv.org/html/2605.24042#Thmproposition9)\. For empirical exploration we instead use the rank\-kkgeneralized\-eigen mechanism with equal\-Fisher\-budget allocation across the top\-kkgeneralized eigenvectors,

ΣGE,k\(𝒦\)=2𝒦k∑i=1kvivi⊤,Σδvi=λiFvi,vi⊤Fvi=1\.\\Sigma\_\{\\mathrm\{GE\},k\}\(\\mathcal\{K\}\)=\\frac\{2\\mathcal\{K\}\}\{k\}\\sum\_\{i=1\}^\{k\}v\_\{i\}v\_\{i\}^\{\\top\},\\qquad\\Sigma\_\{\\delta\}v\_\{i\}=\\lambda\_\{i\}Fv\_\{i\},\\qquad v\_\{i\}^\{\\top\}Fv\_\{i\}=1\.\(18\)The Fisher cost is12tr\(FΣGE,k\)=\(𝒦/k\)∑ivi⊤Fvi=𝒦\\tfrac\{1\}\{2\}\\mathrm\{tr\}\(F\\,\\Sigma\_\{\\mathrm\{GE\},k\}\)=\(\\mathcal\{K\}/k\)\\sum\_\{i\}v\_\{i\}^\{\\top\}Fv\_\{i\}=\\mathcal\{K\}, matching the budget; the privacy objective istr\(ΣδΣGE,k\)=\(2𝒦/k\)∑i=1kλi\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}\\Sigma\_\{\\mathrm\{GE\},k\}\)=\(2\\mathcal\{K\}/k\)\\sum\_\{i=1\}^\{k\}\\lambda\_\{i\}, giving Pareto ratioGEuc,k=\(1k∑i=1kλi\)/λ¯≤GEuc,1G\_\{\\mathrm\{Euc\},k\}=\(\\tfrac\{1\}\{k\}\\sum\_\{i=1\}^\{k\}\\lambda\_\{i\}\)/\\bar\{\\lambda\}\\leq G\_\{\\mathrm\{Euc\},1\}\. Spreading is a strict suboptimum of equation \([16](https://arxiv.org/html/2605.24042#A19.E16)\); we use it because \(i\) it spreads the utility cost across the top\-kklow\-utility directions instead of concentrating it onv1v\_\{1\}, \(ii\) it gives a richer empirical Pareto frontier for the four\-mechanism comparison in Section[4](https://arxiv.org/html/2605.24042#S4), and \(iii\) the released covariance has rankkkinstead of rank one, so against a non\-adaptiveℓ2\\ell\_\{2\}attacker it offers more directions for noise to act in\. None of these advantages survive Proposition[9](https://arxiv.org/html/2605.24042#Thmproposition9): any rank\-deficient release, including bothΣEuc⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Euc\}\}andΣGE,k\\Sigma\_\{\\mathrm\{GE\},k\}fork<dk<d, is broken by an attacker who knowsΣ\\Sigma\.

### S\.2Adaptive\-attacker objective: Bayes\-optimal retrieval

The Euclidean objective assumes the attacker uses rawℓ2\\ell\_\{2\}and that privacy scales with noise variance along margin directions\. An attacker who knows the defender’s covarianceΣ\\Sigmashould not use rawℓ2\\ell\_\{2\}: underh~∼𝒩\(hx,Σ\)\\tilde\{h\}\\sim\\mathcal\{N\}\(h\_\{x\},\\Sigma\), the Bayes\-optimal Gaussian score between candidates is the Mahalanobis distance

dΣ\(h~,hc\)2=\(h~−hc\)⊤Σ−1\(h~−hc\),d\_\{\\Sigma\}\(\\tilde\{h\},h\_\{c\}\)^\{2\}=\(\\tilde\{h\}\-h\_\{c\}\)^\{\\top\}\\Sigma^\{\-1\}\(\\tilde\{h\}\-h\_\{c\}\),which automatically down\-weights directions where the defender added lots of noise\. For two prefixes with clean stateshx,hx′h\_\{x\},h\_\{x^\{\\prime\}\}andδ=hx−hx′\\delta=h\_\{x\}\-h\_\{x^\{\\prime\}\}, the pairwise Bayes error isP\(x→x′\)=Φ\(−12δ⊤Σ−1δ\)P\(x\\to x^\{\\prime\}\)=\\Phi\(\-\\tfrac\{1\}\{2\}\\sqrt\{\\delta^\{\\top\}\\Sigma^\{\-1\}\\delta\}\), so adaptive distinguishability is controlled by the Mahalanobis signalδ⊤Σ−1δ\\delta^\{\\top\}\\Sigma^\{\-1\}\\delta\. Averaging over hard\-negative pairs yields

J\(Σ\)=𝔼δ∼Σδ\[δ⊤Σ−1δ\]=tr\(ΣδΣ−1\)\.J\(\\Sigma\)=\\mathbb\{E\}\_\{\\delta\\sim\\Sigma\_\{\\delta\}\}\[\\delta^\{\\top\}\\Sigma^\{\-1\}\\delta\]=\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}\\Sigma^\{\-1\}\)\.The defender minimizesJJat fixed utility budget:

minΣ≻0⁡tr\(ΣδΣ−1\)s\.t\.12tr\(FΣ\)≤𝒦\.\\min\_\{\\Sigma\\succ 0\}\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}\\Sigma^\{\-1\}\)\\quad\\text\{s\.t\.\}\\quad\\tfrac\{1\}\{2\}\\,\\mathrm\{tr\}\(F\\Sigma\)\\leq\\mathcal\{K\}\.\(19\)

### S\.3Closed\-form solution

Ridge\-stabilize:Fλ=F\+λIF\_\{\\lambda\}=F\+\\lambda I,Sρ=Σδ\+ρIS\_\{\\rho\}=\\Sigma\_\{\\delta\}\+\\rho I\. DefineC=Fλ1/2SρFλ1/2C=F\_\{\\lambda\}^\{1/2\}S\_\{\\rho\}F\_\{\\lambda\}^\{1/2\}\.

###### Proposition 24\(Mahalanobis\-optimal covariance\)\.

The solution to

minΣ≻0⁡tr\(SρΣ−1\)s\.t\.12tr\(FλΣ\)=𝒦\\min\_\{\\Sigma\\succ 0\}\\mathrm\{tr\}\(S\_\{\\rho\}\\Sigma^\{\-1\}\)\\quad\\text\{s\.t\.\}\\quad\\tfrac\{1\}\{2\}\\,\\mathrm\{tr\}\(F\_\{\\lambda\}\\Sigma\)=\\mathcal\{K\}is

ΣMah⋆=2𝒦tr\(C1/2\)Fλ−1/2C1/2Fλ−1/2,\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}=\\frac\{2\\mathcal\{K\}\}\{\\mathrm\{tr\}\(C^\{1/2\}\)\}F\_\{\\lambda\}^\{\-1/2\}C^\{1/2\}F\_\{\\lambda\}^\{\-1/2\},\(20\)with optimal objective

J⋆\(𝒦\)=\[tr\(C1/2\)\]22𝒦\.J^\{\\star\}\(\\mathcal\{K\}\)=\\frac\{\[\\mathrm\{tr\}\(C^\{1/2\}\)\]^\{2\}\}\{2\\mathcal\{K\}\}\.\(21\)

###### Proof\.

LetY=Fλ1/2ΣFλ1/2Y=F\_\{\\lambda\}^\{1/2\}\\Sigma F\_\{\\lambda\}^\{1/2\}, soΣ=Fλ−1/2YFλ−1/2\\Sigma=F\_\{\\lambda\}^\{\-1/2\}YF\_\{\\lambda\}^\{\-1/2\}andΣ−1=Fλ1/2Y−1Fλ1/2\\Sigma^\{\-1\}=F\_\{\\lambda\}^\{1/2\}Y^\{\-1\}F\_\{\\lambda\}^\{1/2\}\. Then

tr\(SρΣ−1\)=tr\(Fλ1/2SρFλ1/2Y−1\)=tr\(CY−1\),\\mathrm\{tr\}\(S\_\{\\rho\}\\Sigma^\{\-1\}\)=\\mathrm\{tr\}\(F\_\{\\lambda\}^\{1/2\}S\_\{\\rho\}F\_\{\\lambda\}^\{1/2\}Y^\{\-1\}\)=\\mathrm\{tr\}\(CY^\{\-1\}\),and the KL\-budget utility constraint istr\(Y\)≤2𝒦\\mathrm\{tr\}\(Y\)\\leq 2\\mathcal\{K\}\. DiagonalizeC=Udiag\(ci\)U⊤C=U\\mathrm\{diag\}\(c\_\{i\}\)U^\{\\top\}; by invariance, the optimum admitsY=Udiag\(yi\)U⊤Y=U\\mathrm\{diag\}\(y\_\{i\}\)U^\{\\top\}and the problem reduces to

minyi\>0∑iciyis\.t\.∑iyi=2𝒦\.\\min\_\{y\_\{i\}\>0\}\\sum\_\{i\}\\frac\{c\_\{i\}\}\{y\_\{i\}\}\\quad\\text\{s\.t\.\}\\quad\\sum\_\{i\}y\_\{i\}=2\\mathcal\{K\}\.Lagrange multipliers give−ci/yi2\+μ=0\-c\_\{i\}/y\_\{i\}^\{2\}\+\\mu=0, i\.e\.yi=ci/μy\_\{i\}=\\sqrt\{c\_\{i\}\}/\\sqrt\{\\mu\}\. Using∑iyi=2𝒦\\sum\_\{i\}y\_\{i\}=2\\mathcal\{K\}yieldsyi=2𝒦ci/∑jcjy\_\{i\}=2\\mathcal\{K\}\\sqrt\{c\_\{i\}\}/\\sum\_\{j\}\\sqrt\{c\_\{j\}\}\. ThereforeY⋆=2𝒦C1/2/tr\(C1/2\)Y^\{\\star\}=2\\mathcal\{K\}\\,C^\{1/2\}/\\mathrm\{tr\}\(C^\{1/2\}\), and substituting back gives \([20](https://arxiv.org/html/2605.24042#A19.E20)\)\. The objective value is

tr\(C\(Y⋆\)−1\)=\[tr\(C1/2\)\]22𝒦\.∎\\mathrm\{tr\}\(C\(Y^\{\\star\}\)^\{\-1\}\)=\\frac\{\[\\mathrm\{tr\}\(C^\{1/2\}\)\]^\{2\}\}\{2\\mathcal\{K\}\}\.\\qed

### S\.4Pareto ratio over isotropic noise

Under isotropicΣiso=2𝒦I/tr\(Fλ\)\\Sigma\_\{\\mathrm\{iso\}\}=2\\mathcal\{K\}I/\\mathrm\{tr\}\(F\_\{\\lambda\}\)\(which saturates12tr\(FλΣiso\)=𝒦\\tfrac\{1\}\{2\}\\mathrm\{tr\}\(F\_\{\\lambda\}\\Sigma\_\{\\mathrm\{iso\}\}\)=\\mathcal\{K\}\), the adaptive distinguishability is

Jiso=tr\(SρΣiso−1\)=tr\(Fλ\)tr\(Sρ\)2𝒦\.J\_\{\\mathrm\{iso\}\}=\\mathrm\{tr\}\(S\_\{\\rho\}\\Sigma\_\{\\mathrm\{iso\}\}^\{\-1\}\)=\\frac\{\\mathrm\{tr\}\(F\_\{\\lambda\}\)\\mathrm\{tr\}\(S\_\{\\rho\}\)\}\{2\\mathcal\{K\}\}\.Define

GMah=JisoJ⋆=tr\(Fλ\)tr\(Sρ\)\[tr\(\(Fλ1/2SρFλ1/2\)1/2\)\]2\.G\_\{\\mathrm\{Mah\}\}=\\frac\{J\_\{\\mathrm\{iso\}\}\}\{J^\{\\star\}\}=\\frac\{\\mathrm\{tr\}\(F\_\{\\lambda\}\)\\mathrm\{tr\}\(S\_\{\\rho\}\)\}\{\[\\mathrm\{tr\}\(\(F\_\{\\lambda\}^\{1/2\}S\_\{\\rho\}F\_\{\\lambda\}^\{1/2\}\)^\{1/2\}\)\]^\{2\}\}\.\(22\)This is the adaptive\-attacker analogue ofGEucG\_\{\\mathrm\{Euc\}\}, and it is the correct predictor when the attacker usesΣ−1\\Sigma^\{\-1\}\-shaped retrieval\. The fidelity inequalitytrFλ1/2SρFλ1/2≤tr\(Fλ\)tr\(Sρ\)\\mathrm\{tr\}\\sqrt\{F\_\{\\lambda\}^\{1/2\}S\_\{\\rho\}F\_\{\\lambda\}^\{1/2\}\}\\leq\\sqrt\{\\mathrm\{tr\}\(F\_\{\\lambda\}\)\\,\\mathrm\{tr\}\(S\_\{\\rho\}\)\}givesGMah≥1G\_\{\\mathrm\{Mah\}\}\\geq 1with equality iffρF=ρS\\rho\_\{F\}=\\rho\_\{S\}\(equivalentlyFλ∝SρF\_\{\\lambda\}\\propto S\_\{\\rho\}\); the fidelity identity in Appendix[Q\.12](https://arxiv.org/html/2605.24042#A17.SS12)makes this explicit\. UnlikeGEucG\_\{\\mathrm\{Euc\}\},GMahG\_\{\\mathrm\{Mah\}\}depends on the full spectrum ofCCrather than only the topkξk\_\{\\xi\}eigenvalues; it is the right scalar when no low\-rank truncation is imposed\.

### S\.5Connection to Rényi differential privacy

For adjacent hidden statesh,h′h,h^\{\\prime\}withΔ=h−h′\\Delta=h\-h^\{\\prime\}, the Rényi divergence between the two Gaussian releases is

Dα\(𝒩\(h,Σ\)∥𝒩\(h′,Σ\)\)=α2Δ⊤Σ−1Δ,D\_\{\\alpha\}\(\\mathcal\{N\}\(h,\\Sigma\)\\,\\\|\\,\\mathcal\{N\}\(h^\{\\prime\},\\Sigma\)\)=\\frac\{\\alpha\}\{2\}\\Delta^\{\\top\}\\Sigma^\{\-1\}\\Delta,so the empirical\(α\)\(\\alpha\)\-Rényi DP budget over an adjacency set𝒜\\mathcal\{A\}is

εα\(Σ\)=α2supΔ∈𝒜Δ⊤Σ−1Δ,\\varepsilon\_\{\\alpha\}\(\\Sigma\)=\\frac\{\\alpha\}\{2\}\\sup\_\{\\Delta\\in\\mathcal\{A\}\}\\Delta^\{\\top\}\\Sigma^\{\-1\}\\Delta,\(23\)converted to\(ε,δ\)\(\\varepsilon,\\delta\)\-DP viaε\(δ\)=minα\>1⁡\[εα\+log⁡\(1/δ\)/\(α−1\)\]\\varepsilon\(\\delta\)=\\min\_\{\\alpha\>1\}\[\\varepsilon\_\{\\alpha\}\+\\log\(1/\\delta\)/\(\\alpha\-1\)\]\. A rank\-deficientΣ\\Sigmayieldsεα=∞\\varepsilon\_\{\\alpha\}=\\inftywhenever anyΔ∈𝒜\\Delta\\in\\mathcal\{A\}has nonzero component outside the noised subspace; for a valid DP mechanism we therefore use the floored release covarianceΣη=ΣMah⋆\+ηI\\Sigma\_\{\\eta\}=\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}\+\\eta I, parameterized byη∈\{10−4,10−3,10−2\}⋅tr\(ΣMah⋆\)/d\\eta\\in\\\{10^\{\-4\},10^\{\-3\},10^\{\-2\}\\\}\\cdot\\mathrm\{tr\}\(\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}\)/d\.

### S\.6Attacker hierarchy

The Mahalanobis\-optimal mechanism above is the right defender choice against an adaptive retrieval attacker who knowsΣ\\Sigma\. Against a weakerℓ2\\ell\_\{2\}attacker, the generalized\-eigen mechanism \([16](https://arxiv.org/html/2605.24042#A19.E16)\) is optimal\. Against an attacker who also has access to a learned nonlinear inversion model, neither Gaussian mechanism is provably optimal, and an empirical\-inverter benchmark is needed\. Appendix[T](https://arxiv.org/html/2605.24042#A20)reports empirical retrieval outcomes for both mechanisms underℓ2\\ell\_\{2\}, subspace\-PIP\_\{I\}, andΣ\\Sigma\-adaptive Mahalanobis attackers\.

## Appendix TAdditional defense results

This appendix reports the full attacker\-metric sweep, the per\-axis ablation, and the multi\-model Mahalanobis measurements that support Section[6](https://arxiv.org/html/2605.24042#S6)\.

### T\.1Multi\-model Mahalanobis sweep andGMahG\_\{\\mathrm\{Mah\}\}predictor

Table[25](https://arxiv.org/html/2605.24042#A20.T25)reports, at the mid\-network layer of each model with50,00050\{,\}000\-candidate bank atσ=5\\sigma=5, the Euclidean and Mahalanobis gain predictors together with the three most informative mechanisms’ actual worst\-over\-attackers top\-1\. Two facts fall out\. First,GEucG\_\{\\mathrm\{Euc\}\}is a vastly inflated predictor of the usable privacy gap\. It ranges from2424on GPT\-2 L6 to7,9187\{,\}918on DeepSeek L24, whileGMahG\_\{\\mathrm\{Mah\}\}remains between1\.711\.71and8\.308\.30on the same points \(Figure[18](https://arxiv.org/html/2605.24042#A20.F18)\), a compression of several orders of magnitude\. Second, even the smallerGMahG\_\{\\mathrm\{Mah\}\}prediction is not realized byΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}, whose worst\-attacker top\-1 tracks isotropic within a few percent on every model and equals1\.0001\.000on the three modern models\. The only mechanism that delivers near\-zero attack at this high\-distortion setting isΣdiag\\Sigma\_\{\\mathrm\{diag\}\}\.

Table 25:Mid\-network\-layer worst\-over\-attackers top\-1 atσ=5\\sigma=5under adaptive retrieval, withk=128k=128,ncal∈\{300,500\}n\_\{\\mathrm\{cal\}\}\\in\\\{300,500\\\}for Mahalanobis\-gain estimation \(distinct from theΣg\\Sigma\_\{g\}calibration in Section[3](https://arxiv.org/html/2605.24042#S3)atncal=2000/200n\_\{\\mathrm\{cal\}\}=2000/200\),50,00050\{,\}000\-candidate bank,2,0002\{,\}000query prefixes\.GEuc/GMahG\_\{\\mathrm\{Euc\}\}/G\_\{\\mathrm\{Mah\}\}ratios range from1414\(GPT\-2\) to273273\(Mistral\) to940940\(Qwen3\) to1,0431\{,\}043\(DeepSeek\), quantifying how much the Euclidean predictor overstates the gap\. An earlier5,0005\{,\}000\-bank Pass 1 sweep over GPT\-2 Small, Mistral\-7B, Phi\-2, and Qwen2\.5\-7B reproduces the same qualitative ordering\.![Refer to caption](https://arxiv.org/html/2605.24042v1/x18.png)Figure 18:GEucG\_\{\\mathrm\{Euc\}\}vsGMahG\_\{\\mathrm\{Mah\}\}across the five models\. The Euclidean predictor spans three orders of magnitude and massively overstates the practical privacy gap\. The Mahalanobis predictor ranges from1\.711\.71to8\.308\.30and is the correct scalar against an adaptive attacker\.
### T\.2Attacker\-metric sweep

Table[26](https://arxiv.org/html/2605.24042#A20.T26)reports the sweep over three candidate attacker distance metrics on Mistral\-7B at layer 16 withk=128k=128\. For each defense mechanism and noise levelσ\\sigma, we report retrieval top\-1 accuracy against 500 i\.i\.d\. distractors under three metrics: plainℓ2\\ell\_\{2\}in the full hidden space,ℓ2\\ell\_\{2\}restricted toPIP\_\{I\}, and the generalized attacker distanceMMthat weighsPBP\_\{B\}components by a learned scalar\. The plain full\-spaceℓ2\\ell\_\{2\}metric is at least as strong as either subspace\-restricted alternative on every row, so the decomposition is a defender’s tool and not an attacker’s weapon\.

Table 26:Attacker\-metric sweep on Mistral\-7B\.
### T\.3Connection to the three geometric axes

The Pareto gain of the generalized\-eigen defense over isotropic noise is determined by three properties of the model’s hidden\-state geometry, introduced in Section[3](https://arxiv.org/html/2605.24042#S3)\.EkE\_\{k\}\(Fisher concentration\) controls whether low\-utility directions exist\. WhenEkE\_\{k\}is close to 1, the gradient eigenvalue mass is already inPBP\_\{B\}, so the average direction inPIP\_\{I\}has smallv⊤Fvv^\{\\top\}Fvand noise there is cheap per unit magnitude\.κ\\kappa\(channel coupling\) controls whether the margin mass sits outsidePBP\_\{B\}, where lowκ\\kappameansΣδ\\Sigma\_\{\\delta\}has substantial mass inPIP\_\{I\}, which is where the defense wants to place noise to combine cheap utility cost with high discriminative cost for the attacker\. The effective\-rank fractionρ=r95/d\\rho=r\_\{95\}/ddoes not control the Pareto gain directly\. It bounds how many generalized eigenvectors are needed to saturate the gain, and in practice the topkξ≈kk\_\{\\xi\}\\approx keigenvectors are sufficient because the generalized eigenvalue spectrum decays at roughly the same rate asFF\.

The empirical ordering under theℓ2\\ell\_\{2\}attacker in Figure[1](https://arxiv.org/html/2605.24042#S4.F1)matches the three\-axis values from Table[8](https://arxiv.org/html/2605.24042#A5.T8)\. Mistral\-7B at\(E128,κ,ρ\)≈\(0\.99,low,0\.17\)\(E\_\{128\},\\kappa,\\rho\)\\approx\(0\.99,\\text\{low\},0\.17\)supports a13×13\\timesEuclidean defense, while GPT\-2 Small at\(0\.56,moderate,0\.49\)\(0\.56,\\text\{moderate\},0\.49\)supports essentially no advantage over isotropic noise\. Under the adaptive Mahalanobis attacker, however, the Euclidean Pareto gain does not transfer\.GMahG\_\{\\mathrm\{Mah\}\}\(Table[25](https://arxiv.org/html/2605.24042#A20.T25)\) is a better predictor and stays in the range1\.71\.7–9\.39\.3across every model and layer we tested, with Qwen3\-14B at layer 10 attaining the highest value\.

### T\.4Σdiag\\Sigma\_\{\\mathrm\{diag\}\}’s layer\-flat KL signature on concentrated\-Fisher models

Theorem[15](https://arxiv.org/html/2605.24042#Thmproposition15)reparameterizes the diagonal mechanism by its KL budget𝒦\\mathcal\{K\}asΣdiag⋆\(𝒦\)=\(2𝒦/d\)D−1\\Sigma^\{\\star\}\_\{\\mathrm\{diag\}\}\(\\mathcal\{K\}\)=\(2\\mathcal\{K\}/d\)\\,D^\{\-1\}, so the first\-order mean KLU\(Σdiag⋆\)=12tr\(DΣdiag⋆\)=𝒦U\(\\Sigma^\{\\star\}\_\{\\mathrm\{diag\}\}\)=\\tfrac\{1\}\{2\}\\,\\mathrm\{tr\}\(D\\,\\Sigma^\{\\star\}\_\{\\mathrm\{diag\}\}\)=\\mathcal\{K\}is, by construction, independent of the layer\. The implementation parameterizes the same mechanism by a noise\-scaleσ\\sigmaviaΣdiag=σ2diag\(1/Fii\)\\Sigma\_\{\\mathrm\{diag\}\}=\\sigma^\{2\}\\,\\mathrm\{diag\}\(1/F\_\{ii\}\)withσ2=2𝒦/d\\sigma^\{2\}=2\\mathcal\{K\}/d, so the structural prediction is layer\-flatness of the measured KL, not a particular absolute value\. Measured across the sweep, the layer\-invariance is tight for modern concentrated\-Fisher models and loose only for GPT\-2\.

The≤2\.33%\\leq 2\.33\\%range on modern models is the expected signature of an equal\-coordinate\-cost diagonal mechanism: the layer\-to\-layer variation intr\(F\)\\mathrm\{tr\}\(F\)is absorbed into the per\-coordinate1/Fii1/F\_\{ii\}scaling, leaving the first\-order KL essentially constant at𝒦\\mathcal\{K\}\. The empirical implementation rescales the noise by2𝒦/d2\\mathcal\{K\}/dto match the budget formΣdiag⋆\(𝒦\)=\(2𝒦/d\)D−1\\Sigma^\{\\star\}\_\{\\mathrm\{diag\}\}\(\\mathcal\{K\}\)=\(2\\mathcal\{K\}/d\)\\,D^\{\-1\}of Theorem[15](https://arxiv.org/html/2605.24042#Thmproposition15), so the measured 9–16 nats track the chosen𝒦\\mathcal\{K\}across layers and models, not the literal12σ2d≈51,200\\tfrac\{1\}\{2\}\\sigma^\{2\}d\\approx 51\{,\}200nats one would get from the unscaled pseudo\-formΣdiag=σ2diag\(1/Fii\)\\Sigma\_\{\\mathrm\{diag\}\}=\\sigma^\{2\}\\mathrm\{diag\}\(1/F\_\{ii\}\)atσ=5\\sigma=5,d=4096d=4096\. The takeaway is layer\-invariance under the equal\-cost design, not a match to any fixed absolute12σ2d\\tfrac\{1\}\{2\}\\sigma^\{2\}dvalue\. GPT\-2 is the exception because the empirical Fisher at some layers is dominated by a few very largeFiiF\_\{ii\}entries, so the diagonal1/Fii1/F\_\{ii\}noise places disproportionate mass on those coordinates and the higher\-order corrections to the local KL expansion become non\-negligible\.

### T\.5The empty middle: no moderate\-utility privacy regime

We pool the1,5361\{,\}536non\-quantization \(mechanism,σ\\sigma, model, layer\) rows of the full sweep and ask how often any cell achieves both moderate utility and moderate privacy simultaneously\. We count a cell as “moderate utility” if its measured top\-1 agreement against the clean model is≥0\.5\\geq 0\.5, and as “moderate privacy” if the worst\-over\-attackers retrieval top\-1 is≤0\.5\\leq 0\.5\.

The only three cells that achieve worst\-attacker top\-1≤0\.5\\leq 0\.5with top\-1 agreement≥0\.1\\geq 0\.1are all GPT\-2 Small at early layers \(L0–L1\) and are limited toΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}and isotropic atσ=5\\sigma=5\. Within the mechanism family of this sweep, there is effectively no regime that simultaneously preserves useful prediction and blocks adaptive retrieval\. The practical privacy–utility frontier is either near\-clean\-model\-but\-trivially\-invertible or wrecked\-model\-but\-private\. This is the most honest single\-number takeaway of the paper\. The empty middle persists across the operational thresholds tested\. The first cell satisfyingA≤0\.5A\\leq 0\.5at any utility threshold appears only atU≥0\.1U\\geq 0\.1, and the three cells in question are all GPT\-2 Small at the earliest layers underΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}or isotropic at the highestσ\\sigma, which is not a regime any defender would deploy for utility reasons\.

The Mahalanobis\-optimal solutionΣMah⋆=\(2𝒦/tr\(C1/2\)\)Fλ−1/2C1/2Fλ−1/2\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}=\(2\\mathcal\{K\}/\\mathrm\{tr\}\(C^\{1/2\}\)\)\\,F\_\{\\lambda\}^\{\-1/2\}C^\{1/2\}F\_\{\\lambda\}^\{\-1/2\}uses the full Fisher, which on the 7B/14B models is ill\-conditioned and dominated by a few large eigenvalues\. The diagonal\-minimax mechanismΣdiag⋆\(𝒦\)=\(2𝒦/d\)D−1\\Sigma^\{\\star\}\_\{\\mathrm\{diag\}\}\(\\mathcal\{K\}\)=\(2\\mathcal\{K\}/d\)\\,D^\{\-1\}withD=diag\(F\)D=\\mathrm\{diag\}\(F\)is cheaper to compute \(O\(d\)O\(d\)vsO\(d3\)O\(d^\{3\}\)\) and dumps noise uniformly across all directions weighted by inverse per\-coordinate Fisher \(the implementation parameterizes this asΣdiag=σ2diag\(1/Fii\)\\Sigma\_\{\\mathrm\{diag\}\}=\\sigma^\{2\}\\mathrm\{diag\}\(1/F\_\{ii\}\)withσ2=2𝒦/d\\sigma^\{2\}=2\\mathcal\{K\}/d\)\. Empirically this diagonal mechanism is the strongest defense at concentrated\-Fisher layers: Table[27](https://arxiv.org/html/2605.24042#A20.T27)and Figure[19](https://arxiv.org/html/2605.24042#A20.F19)report the adaptive\-attacker worst\-over\-attackers top\-1 atσ=5\\sigma=5on Mistral\-7B across 8 layers, comparing isotropic,ΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}, andΣdiag\\Sigma\_\{\\mathrm\{diag\}\}\.

Table 27:Worst\-over\-attackers top\-1 ranges atσ=5\\sigma=5across 32 model\-layer points \(5 models, 4–12 layers each,nbank=50,000n\_\{\\mathrm\{bank\}\}=50\{,\}000,k=128k=128\)\. The diagonal\-minimax releaseΣdiag⋆\(𝒦\)=\(2𝒦/d\)D−1\\Sigma^\{\\star\}\_\{\\mathrm\{diag\}\}\(\\mathcal\{K\}\)=\(2\\mathcal\{K\}/d\)\\,D^\{\-1\}\(parameterized in the implementation asΣdiag=σ2diag\(1/Fii\)\\Sigma\_\{\\mathrm\{diag\}\}=\\sigma^\{2\}\\mathrm\{diag\}\(1/F\_\{ii\}\)withσ2=2𝒦/d\\sigma^\{2\}=2\\mathcal\{K\}/d\) stays at≤0\.001\\leq 0\.001top\-1 at every one of the 32 points, while isotropic andΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}span essentially\[0,1\]\[0,1\]as the layer varies from mid\-network \(where isotropic works\) to late layers \(where it fails\)\. Qwen3\-14B and DeepSeek\-R1\-14B have iso≥0\.994\\geq 0\.994at every tested layer, so isotropic never works on these modern 14B models atσ=5\\sigma=5, butΣdiag\\Sigma\_\{\\mathrm\{diag\}\}always does\.![Refer to caption](https://arxiv.org/html/2605.24042v1/x19.png)Figure 19:Best\-attack top\-1 atσ=5\\sigma=5vs relative layer depth across 5 models\. A: isotropic\. B: diagonal\-minimaxΣdiag⋆\(𝒦\)=\(2𝒦/d\)D−1\\Sigma^\{\\star\}\_\{\\mathrm\{diag\}\}\(\\mathcal\{K\}\)=\(2\\mathcal\{K\}/d\)D^\{\-1\}\(implementation parameterizationΣdiag=σ2diag\(1/Fii\)\\Sigma\_\{\\mathrm\{diag\}\}=\\sigma^\{2\}\\mathrm\{diag\}\(1/F\_\{ii\}\),σ2=2𝒦/d\\sigma^\{2\}=2\\mathcal\{K\}/d\)\. Isotropic fails on every modern model except mid\-Mistral;Σdiag\\Sigma\_\{\\mathrm\{diag\}\}is at≤0\.001\\leq 0\.001on every one of the 32 points tested\. The two compounding effects are \(i\) isotropic’s utility cost12tr\(FΣiso\)=12σ2tr\(F\)\\tfrac\{1\}\{2\}\\,\\mathrm\{tr\}\(F\\,\\Sigma\_\{\\mathrm\{iso\}\}\)=\\tfrac\{1\}\{2\}\\sigma^\{2\}\\mathrm\{tr\}\(F\)shrinks withtr\(F\)\\mathrm\{tr\}\(F\), which varies by60×60\\timesacross models and decays with depth \(Mistral L4→\\toL31 is59\.9→1\.659\.9\\to 1\.6\), whileΣdiag⋆\(𝒦\)\\Sigma^\{\\star\}\_\{\\mathrm\{diag\}\}\(\\mathcal\{K\}\)’s utility cost12tr\(DΣdiag⋆\)=𝒦\\tfrac\{1\}\{2\}\\,\\mathrm\{tr\}\(D\\,\\Sigma^\{\\star\}\_\{\\mathrm\{diag\}\}\)=\\mathcal\{K\}is fixed by construction; \(ii\) inter\-prefix distinguishability lives in low\-FiiF\_\{ii\}coordinates on concentrated\-Fisher models, exactly whereΣdiag\\Sigma\_\{\\mathrm\{diag\}\}pours its noise\.
### T\.6Why the full\-spaceℓ2\\ell\_\{2\}attack dominates subspace\-restricted attacks

The empirical observation that plainℓ2\\ell\_\{2\}retrieval matches or beatsPIP\_\{I\}\-restricted retrieval on every noise setting has a closed\-form explanation\. Write the attacker’s query asq=h\+ξq=h\+\\xiand denote byccany candidate from the distractor bank\. By the Pythagorean identity \(Proposition[2](https://arxiv.org/html/2605.24042#Thmproposition2)\),

‖q−c‖2=‖PB\(q−c\)‖2\+‖PI\(q−c\)‖2\.\\\|q\-c\\\|^\{2\}\\;=\\;\\\|P\_\{B\}\(q\-c\)\\\|^\{2\}\\,\+\\,\\\|P\_\{I\}\(q\-c\)\\\|^\{2\}\.The full\-spaceℓ2\\ell\_\{2\}attack ranks candidates by the left\-hand side, while thePIP\_\{I\}\-restricted attack ranks by the second term\. Letc=h′c=h^\{\\prime\}be an arbitrary distractor\. For the true matchc=hc=h, we haveq−c=ξq\-c=\\xiandPB\(q−c\)=PBξP\_\{B\}\(q\-c\)=P\_\{B\}\\xi\. For a distractor,q−c=ξ\+\(h−h′\)q\-c=\\xi\+\(h\-h^\{\\prime\}\)andPB\(q−c\)=PBξ\+PB\(h−h′\)P\_\{B\}\(q\-c\)=P\_\{B\}\\xi\+P\_\{B\}\(h\-h^\{\\prime\}\)\.

###### Proposition 25\(ℓ2\\ell\_\{2\}vs identity\-restricted gap\)\.

The expected additional penalty that the full\-spaceℓ2\\ell\_\{2\}attacker applies to a distractor beyond what thePIP\_\{I\}\-restricted attacker applies, relative to the true match, is

𝔼\[‖PB\(q−h′\)‖2−‖PB\(q−h\)‖2\]=‖PB\(h−h′\)‖2=\(h−h′\)⊤PB\(h−h′\),\\mathbb\{E\}\\bigl\[\\\|P\_\{B\}\(q\-h^\{\\prime\}\)\\\|^\{2\}\-\\\|P\_\{B\}\(q\-h\)\\\|^\{2\}\\bigr\]\\;=\\;\\\|P\_\{B\}\(h\-h^\{\\prime\}\)\\\|^\{2\}\\;=\\;\(h\-h^\{\\prime\}\)^\{\\top\}P\_\{B\}\(h\-h^\{\\prime\}\),which is nonnegative and strictly positive on every directionh−h′h\-h^\{\\prime\}with nonzero projection ontoPBP\_\{B\}\. Averaging over the distractor distribution and normalizing by‖h−h′‖2\\\|h\-h^\{\\prime\}\\\|^\{2\},

𝔼h,h′\[‖PB\(h−h′\)‖2\]/𝔼h,h′\[‖h−h′‖2\]=tr\(PBΣδ\)/tr\(Σδ\),\\mathbb\{E\}\_\{h,h^\{\\prime\}\}\\bigl\[\\\|P\_\{B\}\(h\-h^\{\\prime\}\)\\\|^\{2\}\\bigr\]/\\mathbb\{E\}\_\{h,h^\{\\prime\}\}\\bigl\[\\\|h\-h^\{\\prime\}\\\|^\{2\}\\bigr\]\\;=\\;\\mathrm\{tr\}\(P\_\{B\}\\Sigma\_\{\\delta\}\)/\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}\),the same Fisher/margin alignment quantity that controls the behavior\-versus\-random margin excess in Proposition[6](https://arxiv.org/html/2605.24042#Thmproposition6)and the coupling coefficientκ\\kappadefined in Section[3](https://arxiv.org/html/2605.24042#S3)\.

###### Proof\.

Expanding the first line usingq−h=ξq\-h=\\xiandq−h′=ξ\+\(h−h′\)q\-h^\{\\prime\}=\\xi\+\(h\-h^\{\\prime\}\),

‖PB\(q−h′\)‖2−‖PB\(q−h\)‖2\\displaystyle\\\|P\_\{B\}\(q\-h^\{\\prime\}\)\\\|^\{2\}\-\\\|P\_\{B\}\(q\-h\)\\\|^\{2\}=‖PBξ\+PB\(h−h′\)‖2−‖PBξ‖2\\displaystyle=\\\|P\_\{B\}\\xi\+P\_\{B\}\(h\-h^\{\\prime\}\)\\\|^\{2\}\-\\\|P\_\{B\}\\xi\\\|^\{2\}=2⟨PBξ,PB\(h−h′\)⟩\+‖PB\(h−h′\)‖2\.\\displaystyle=2\\,\\langle P\_\{B\}\\xi,\\,P\_\{B\}\(h\-h^\{\\prime\}\)\\rangle\+\\\|P\_\{B\}\(h\-h^\{\\prime\}\)\\\|^\{2\}\.Under mean\-zero noise𝔼\[ξ\]=0\\mathbb\{E\}\[\\xi\]=0, the cross term vanishes in expectation and the residual is‖PB\(h−h′\)‖2≥0\\\|P\_\{B\}\(h\-h^\{\\prime\}\)\\\|^\{2\}\\geq 0\. Dividing by‖h−h′‖2\\\|h\-h^\{\\prime\}\\\|^\{2\}and averaging over the inter\-prefix distribution gives the ratiotr\(PBΣδ\)/tr\(Σδ\)\\mathrm\{tr\}\(P\_\{B\}\\Sigma\_\{\\delta\}\)/\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}\)by definition ofΣδ\\Sigma\_\{\\delta\}\. ∎

The proposition says the full\-spaceℓ2\\ell\_\{2\}attacker applies an additional non\-negative penalty to every distractor that thePIP\_\{I\}\-restricted attacker does not, and the expected size of this penalty is exactly the Fisher\-margin alignmenttr\(PBΣδ\)/tr\(Σδ\)\\mathrm\{tr\}\(P\_\{B\}\\Sigma\_\{\\delta\}\)/\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}\)\. When this quantity is nonzero, full\-spaceℓ2\\ell\_\{2\}is strictly better thanPIP\_\{I\}\-restricted retrieval in expectation\. The attacker’s strategy is therefore to retain thePBP\_\{B\}component of the residual, not discard it\. This is the reason Table[26](https://arxiv.org/html/2605.24042#A20.T26)showsℓ2\\ell\_\{2\}\-full matching or beatingPIP\_\{I\}\-restricted on every row\.

## Appendix UFormal privacy bound under behavior projection

We formalize the privacy implications of the two\-channel decomposition using the Gaussian mechanism from differential privacy\. Let𝗌∼𝗌′\\mathsf\{s\}\\sim\\mathsf\{s\}^\{\\prime\}denote that sequences𝗌\\mathsf\{s\}and𝗌′\\mathsf\{s\}^\{\\prime\}differ by a single\-token substitution \(the standard adjacency relation for DP in language modeling\)\. Define the adjacencyℓ2\\ell\_\{2\}\-sensitivities:

Δ:=sup𝗌∼𝗌′‖𝐡\(𝗌\)−𝐡\(𝗌′\)‖2,ΔB:=sup𝗌∼𝗌′‖PB\(𝐡\(𝗌\)−𝐡\(𝗌′\)\)‖2\.\\Delta:=\\sup\_\{\\mathsf\{s\}\\sim\\mathsf\{s\}^\{\\prime\}\}\\\|\\mathbf\{h\}\(\\mathsf\{s\}\)\-\\mathbf\{h\}\(\\mathsf\{s\}^\{\\prime\}\)\\\|\_\{2\},\\qquad\\Delta\_\{B\}:=\\sup\_\{\\mathsf\{s\}\\sim\\mathsf\{s\}^\{\\prime\}\}\\\|P\_\{B\}\(\\mathbf\{h\}\(\\mathsf\{s\}\)\-\\mathbf\{h\}\(\\mathsf\{s\}^\{\\prime\}\)\)\\\|\_\{2\}\.
###### Proposition 26\(Sensitivity reduction\)\.

ΔB≤Δ\\Delta\_\{B\}\\leq\\Delta\.

###### Proof\.

Fix adjacent𝗌∼𝗌′\\mathsf\{s\}\\sim\\mathsf\{s\}^\{\\prime\}and let𝒗=𝐡\(𝗌\)−𝐡\(𝗌′\)\\boldsymbol\{v\}=\\mathbf\{h\}\(\\mathsf\{s\}\)\-\\mathbf\{h\}\(\\mathsf\{s\}^\{\\prime\}\)\. SincePBP\_\{B\}is an orthogonal projector, its eigenvalues lie in\{0,1\}\\\{0,1\\\}, so0⪯PB⪯I0\\preceq P\_\{B\}\\preceq Iin the Loewner order\. Therefore‖PB𝒗‖22=𝒗⊤PB2𝒗=𝒗⊤PB𝒗≤𝒗⊤I𝒗=‖𝒗‖22\\\|P\_\{B\}\\boldsymbol\{v\}\\\|\_\{2\}^\{2\}=\\boldsymbol\{v\}^\{\\top\}P\_\{B\}^\{2\}\\boldsymbol\{v\}=\\boldsymbol\{v\}^\{\\top\}P\_\{B\}\\boldsymbol\{v\}\\leq\\boldsymbol\{v\}^\{\\top\}I\\boldsymbol\{v\}=\\\|\\boldsymbol\{v\}\\\|\_\{2\}^\{2\}\. Taking square roots and then the supremum over adjacent pairs givesΔB≤Δ\\Delta\_\{B\}\\leq\\Delta\. ∎

###### Proposition 27\(Differential privacy via behavior projection\)\.

LetM\(𝗌\)=PB𝐡\(𝗌\)\+𝐳M\(\\mathsf\{s\}\)=P\_\{B\}\\mathbf\{h\}\(\\mathsf\{s\}\)\+\\mathbf\{z\}with𝐳∼𝒩\(𝟎,σ2Id\)\\mathbf\{z\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\sigma^\{2\}I\_\{d\}\)\. For any0<ε≤10<\\varepsilon\\leq 1andδ∈\(0,1\)\\delta\\in\(0,1\),MMsatisfies\(ε,δ\)\(\\varepsilon,\\delta\)\-differential privacy with respect to single\-token substitutions whenever

σ≥ΔB2ln⁡\(1\.25/δ\)ε\.\\sigma\\geq\\frac\{\\Delta\_\{B\}\\sqrt\{2\\ln\(1\.25/\\delta\)\}\}\{\\varepsilon\}\.\(24\)

###### Proof\.

The function𝗌↦PB𝐡\(𝗌\)\\mathsf\{s\}\\mapsto P\_\{B\}\\mathbf\{h\}\(\\mathsf\{s\}\)hasℓ2\\ell\_\{2\}\-sensitivityΔB\\Delta\_\{B\}\(Proposition[26](https://arxiv.org/html/2605.24042#Thmproposition26)\)\. The result follows from the classical Gaussian mechanism theorem\[Dwork and Roth,[2014](https://arxiv.org/html/2605.24042#bib.bib17), Theorem A\.1\], which states that adding𝒩\(0,σ2I\)\\mathcal\{N\}\(0,\\sigma^\{2\}I\)noise to a function withℓ2\\ell\_\{2\}\-sensitivityΔ2\\Delta\_\{2\}yields\(ε,δ\)\(\\varepsilon,\\delta\)\-DP for0<ε≤10<\\varepsilon\\leq 1wheneverσ≥Δ22ln⁡\(1\.25/δ\)/ε\\sigma\\geq\\Delta\_\{2\}\\sqrt\{2\\ln\(1\.25/\\delta\)\}/\\varepsilon\. ∎

###### Proposition 28\(Privacy amplification under projection\)\.

For fixed0<ε≤10<\\varepsilon\\leq 1andδ∈\(0,1\)\\delta\\in\(0,1\), define the classical Gaussian calibrationsσfullDR:=Δ2ln⁡\(1\.25/δ\)/ε\\sigma\_\{\\mathrm\{full\}\}^\{\\mathrm\{DR\}\}:=\\Delta\\sqrt\{2\\ln\(1\.25/\\delta\)\}/\\varepsilonandσBDR:=ΔB2ln⁡\(1\.25/δ\)/ε\\sigma\_\{B\}^\{\\mathrm\{DR\}\}:=\\Delta\_\{B\}\\sqrt\{2\\ln\(1\.25/\\delta\)\}/\\varepsilon\. Then

σBDRσfullDR=ΔBΔ≤1\.\\frac\{\\sigma\_\{B\}^\{\\mathrm\{DR\}\}\}\{\\sigma\_\{\\mathrm\{full\}\}^\{\\mathrm\{DR\}\}\}=\\frac\{\\Delta\_\{B\}\}\{\\Delta\}\\leq 1\.\(25\)Equivalently, at fixed noiseσ\\sigma, the classical calibration givesεBDR=\(ΔB/Δ\)⋅εfullDR\\varepsilon\_\{B\}^\{\\mathrm\{DR\}\}=\(\\Delta\_\{B\}/\\Delta\)\\cdot\\varepsilon\_\{\\mathrm\{full\}\}^\{\\mathrm\{DR\}\}\.

###### Proof\.

Dividing the two calibrations:σBDR/σfullDR=ΔB/Δ\\sigma\_\{B\}^\{\\mathrm\{DR\}\}/\\sigma\_\{\\mathrm\{full\}\}^\{\\mathrm\{DR\}\}=\\Delta\_\{B\}/\\Delta\. The inequality follows from Proposition[26](https://arxiv.org/html/2605.24042#Thmproposition26)\. For the fixed\-σ\\sigmastatement:εBDR=ΔB2ln⁡\(1\.25/δ\)/σ=\(ΔB/Δ\)⋅Δ2ln⁡\(1\.25/δ\)/σ=\(ΔB/Δ\)⋅εfullDR\\varepsilon\_\{B\}^\{\\mathrm\{DR\}\}=\\Delta\_\{B\}\\sqrt\{2\\ln\(1\.25/\\delta\)\}/\\sigma=\(\\Delta\_\{B\}/\\Delta\)\\cdot\\Delta\\sqrt\{2\\ln\(1\.25/\\delta\)\}/\\sigma=\(\\Delta\_\{B\}/\\Delta\)\\cdot\\varepsilon\_\{\\mathrm\{full\}\}^\{\\mathrm\{DR\}\}\. ∎

We now prove that for a random rank\-kkprojector, the sensitivity concentrates aroundk/d\\sqrt\{k/d\}times the full sensitivity, providing a theoretical baseline for the empirical privacy amplification\.

###### Proposition 29\(Uniform distortion under random projection\)\.

LetPRP\_\{R\}be a Haar\-random rank\-kkorthogonal projector onℝd\\mathbb\{R\}^\{d\}\. Let𝒮=\{𝐯1,…,𝐯N\}⊂ℝd\\mathcal\{S\}=\\\{\\boldsymbol\{v\}\_\{1\},\\ldots,\\boldsymbol\{v\}\_\{N\}\\\}\\subset\\mathbb\{R\}^\{d\}be a fixed finite set\. Forρ∈\(0,1\)\\rho\\in\(0,1\)andη∈\(0,1\)\\eta\\in\(0,1\), ifk≥72ln⁡\(4N/η\)/ρ2k\\geq 72\\ln\(4N/\\eta\)/\\rho^\{2\}, then with probability at least1−η1\-\\eta:

\(1−ρ\)kd‖𝒗i‖22≤‖PR𝒗i‖22≤\(1\+ρ\)kd‖𝒗i‖22for alli∈\[N\]\.\(1\-\\rho\)\\frac\{k\}\{d\}\\\|\\boldsymbol\{v\}\_\{i\}\\\|\_\{2\}^\{2\}\\leq\\\|P\_\{R\}\\boldsymbol\{v\}\_\{i\}\\\|\_\{2\}^\{2\}\\leq\(1\+\\rho\)\\frac\{k\}\{d\}\\\|\\boldsymbol\{v\}\_\{i\}\\\|\_\{2\}^\{2\}\\qquad\\text\{for all \}i\\in\[N\]\.

###### Proof\.

Fixiiwith𝒗i≠0\\boldsymbol\{v\}\_\{i\}\\neq 0and let𝒖=𝒗i/‖𝒗i‖2\\boldsymbol\{u\}=\\boldsymbol\{v\}\_\{i\}/\\\|\\boldsymbol\{v\}\_\{i\}\\\|\_\{2\}\. SetX=‖PR𝒖‖22X=\\\|P\_\{R\}\\boldsymbol\{u\}\\\|\_\{2\}^\{2\}\. By rotational invariance of the Haar measure,XXhas the same distribution asY/WY/WwhereY=∑j=1kGj2∼χk2Y=\\sum\_\{j=1\}^\{k\}G\_\{j\}^\{2\}\\sim\\chi^\{2\}\_\{k\}andW=∑j=1dGj2∼χd2W=\\sum\_\{j=1\}^\{d\}G\_\{j\}^\{2\}\\sim\\chi^\{2\}\_\{d\}withG∼𝒩\(0,Id\)G\\sim\\mathcal\{N\}\(0,I\_\{d\}\)\.

Sett=ρ/3t=\\rho/3\. By the chi\-square tail bound \(fort∈\(0,1\)t\\in\(0,1\):Pr⁡\(\|Z−n\|\>tn\)≤2exp⁡\(−nt2/8\)\\Pr\(\|Z\-n\|\>tn\)\\leq 2\\exp\(\-nt^\{2\}/8\)forZ∼χn2Z\\sim\\chi^\{2\}\_\{n\}, proved via the Chernoff bound and the inequalityx−ln⁡\(1\+x\)≥x2/4x\-\\ln\(1\+x\)\\geq x^\{2\}/4forx∈\[0,1\]x\\in\[0,1\]\), we havePr⁡\(\|Y−k\|\>tk\)≤2e−kt2/8\\Pr\(\|Y\-k\|\>tk\)\\leq 2e^\{\-kt^\{2\}/8\}andPr⁡\(\|W−d\|\>td\)≤2e−dt2/8≤2e−kt2/8\\Pr\(\|W\-d\|\>td\)\\leq 2e^\{\-dt^\{2\}/8\}\\leq 2e^\{\-kt^\{2\}/8\}\. On the intersection of these events \(probability≥1−4e−kt2/8=1−4e−kρ2/72\\geq 1\-4e^\{\-kt^\{2\}/8\}=1\-4e^\{\-k\\rho^\{2\}/72\}\):

1−t1\+tkd≤X≤1\+t1−tkd\.\\frac\{1\-t\}\{1\+t\}\\frac\{k\}\{d\}\\leq X\\leq\\frac\{1\+t\}\{1\-t\}\\frac\{k\}\{d\}\.Sincet=ρ/3t=\\rho/3andρ∈\(0,1\)\\rho\\in\(0,1\):\(1\+t\)/\(1−t\)=1\+2t/\(1−t\)≤1\+ρ\(1\+t\)/\(1\-t\)=1\+2t/\(1\-t\)\\leq 1\+\\rho\(because2ρ/3/\(1−ρ/3\)≤ρ2\\rho/3/\(1\-\\rho/3\)\\leq\\rho\), and similarly\(1−t\)/\(1\+t\)≥1−ρ\(1\-t\)/\(1\+t\)\\geq 1\-\\rho\. So\(1−ρ\)\(k/d\)≤X≤\(1\+ρ\)\(k/d\)\(1\-\\rho\)\(k/d\)\\leq X\\leq\(1\+\\rho\)\(k/d\)\.

Union\-bounding overNNvectors: failure probability≤4Ne−kρ2/72\\leq 4Ne^\{\-k\\rho^\{2\}/72\}\. Setting this≤η\\leq\\etagives the conditionk≥72ln⁡\(4N/η\)/ρ2k\\geq 72\\ln\(4N/\\eta\)/\\rho^\{2\}\. ∎

###### Corollary 30\(Projected sensitivity concentration\)\.

Let𝒟=\{𝐡\(𝗌\)−𝐡\(𝗌′\):𝗌∼𝗌′\}\\mathcal\{D\}=\\\{\\mathbf\{h\}\(\\mathsf\{s\}\)\-\\mathbf\{h\}\(\\mathsf\{s\}^\{\\prime\}\):\\mathsf\{s\}\\sim\\mathsf\{s\}^\{\\prime\}\\\}with\|𝒟\|=N\|\\mathcal\{D\}\|=N\. Under the conditions of Proposition[29](https://arxiv.org/html/2605.24042#Thmproposition29), with probability≥1−η\\geq 1\-\\eta:

ΔPR≤\(1\+ρ\)kd⋅Δ\.\\Delta\_\{P\_\{R\}\}\\leq\\sqrt\{\(1\+\\rho\)\\frac\{k\}\{d\}\}\\cdot\\Delta\.Consequently, for the classical Gaussian calibration:εPRDR≤\(1\+ρ\)\(k/d\)⋅εfullDR\\varepsilon\_\{P\_\{R\}\}^\{\\mathrm\{DR\}\}\\leq\\sqrt\{\(1\+\\rho\)\(k/d\)\}\\cdot\\varepsilon\_\{\\mathrm\{full\}\}^\{\\mathrm\{DR\}\}\.

###### Proof\.

Proposition[29](https://arxiv.org/html/2605.24042#Thmproposition29)applied to𝒟\\mathcal\{D\}gives‖PR𝒗‖22≤\(1\+ρ\)\(k/d\)‖𝒗‖22\\\|P\_\{R\}\\boldsymbol\{v\}\\\|\_\{2\}^\{2\}\\leq\(1\+\\rho\)\(k/d\)\\\|\\boldsymbol\{v\}\\\|\_\{2\}^\{2\}for all𝒗∈𝒟\\boldsymbol\{v\}\\in\\mathcal\{D\}\. Taking the supremum:ΔPR2≤\(1\+ρ\)\(k/d\)⋅Δ2\\Delta\_\{P\_\{R\}\}^\{2\}\\leq\(1\+\\rho\)\(k/d\)\\cdot\\Delta^\{2\}\. Taking square roots and substituting into Proposition[28](https://arxiv.org/html/2605.24042#Thmproposition28)gives the DP statement\. ∎

The above holds for random projectors\. The behavior subspacePBP\_\{B\}is data\-dependent, yet empirically matches the random\-projection prediction: Table[24](https://arxiv.org/html/2605.24042#A17.T24)shows that the behavior\-subspace margin tracksk/d⋅mfull\\sqrt\{k/d\}\\cdot m\_\{\\mathrm\{full\}\}to within 2 percentage points at allkk\. Under this empirical observation,ΔB/Δ≈k/d\\Delta\_\{B\}/\\Delta\\approx\\sqrt\{k/d\}, and the projection\-based amplification factor isd/k\\sqrt\{d/k\}\. A rank\-deficient releasePB𝐡\+𝐳P\_\{B\}\\mathbf\{h\}\+\\mathbf\{z\}formally givesε=∞\\varepsilon=\\inftywhenever anyΔ\\Deltahas nonzero component outside the noised subspace, so the proposition above serves as a sensitivity\-reduction argument rather than a full\-rank DP mechanism\. Valid Gaussian\-mechanism releases use the full\-rankΣη=ΣMah⋆\+ηI\\Sigma\_\{\\eta\}=\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}\+\\eta Iand we report empirical Rényi\-DP accounting next\.

### U\.1Empirical Rényi\-DP accounting

For a full\-rank Gaussian releaseM\(𝐡\)=𝐡\+ξM\(\\mathbf\{h\}\)=\\mathbf\{h\}\+\\xiwithξ∼𝒩\(0,Σ\)\\xi\\sim\\mathcal\{N\}\(0,\\Sigma\), the Rényi divergence between adjacent releases at differenceΔ\\DeltaisDα=\(α/2\)Δ⊤Σ−1ΔD\_\{\\alpha\}=\(\\alpha/2\)\\,\\Delta^\{\\top\}\\Sigma^\{\-1\}\\Delta, and the empirical\(α\)\(\\alpha\)\-Rényi budget over an adjacency set𝒜\\mathcal\{A\}is

εα\(Σ\)=α2supΔ∈𝒜Δ⊤Σ−1Δ,\\varepsilon\_\{\\alpha\}\(\\Sigma\)=\\tfrac\{\\alpha\}\{2\}\\,\\sup\_\{\\Delta\\in\\mathcal\{A\}\}\\Delta^\{\\top\}\\Sigma^\{\-1\}\\Delta,\(26\)converted to\(ε,δ\)\(\\varepsilon,\\delta\)\-DP byε\(δ\)=minα\>1⁡\[εα\+log⁡\(1/δ\)/\(α−1\)\]\\varepsilon\(\\delta\)=\\min\_\{\\alpha\>1\}\[\\varepsilon\_\{\\alpha\}\+\\log\(1/\\delta\)/\(\\alpha\-1\)\]over the gridα∈\{2,4,8,16,32,64,128\}\\alpha\\in\\\{2,4,8,16,32,64,128\\\}\. We build𝒜\\mathcal\{A\}on GPT\-2 Small at layer 6 from 500 nearest\-neighbor and random\-pair difference vectors in the hidden\-state bank, and report the accountant atδ=10−6\\delta=10^\{\-6\}\. Table[28](https://arxiv.org/html/2605.24042#A21.T28)shows the matched\-utility comparison between isotropic noise andΣMah⋆\+ηI\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}\+\\eta Iwithη=10−3tr\(ΣMah⋆\)/d\\eta=10^\{\-3\}\\,\\mathrm\{tr\}\(\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}\)/d\.

Table 28:Empirical Rényi\-DPε\\varepsilonatδ=10−6\\delta=10^\{\-6\}on five models \(mid\-network layer\) at matchedσ\\sigma, over an adjacency set of 500 nearest\-neighbor and random prefix pairs\. The ranking ofΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}vs isotropic is model\-dependent:ΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}wins on GPT\-2, Qwen3\-14B, DeepSeek\-R1\-14B; isotropic wins on Mistral, Phi\-2\. The absoluteε\\varepsilonvalues are large because the adjacency set includes worst\-case NN pairs; they are comparable within each model but not across models without further normalization\. This matters for Table[29](https://arxiv.org/html/2605.24042#A21.T29)below, where we calibrate each mechanism to equalε\\varepsilonbefore comparing attack success\.
### U\.2Matched\-ε\\varepsilondefense comparison

A fair side\-by\-side of mechanisms requires calibrating each to the same privacy budget\. Table[29](https://arxiv.org/html/2605.24042#A21.T29)reports worst\-over\-attackers retrieval top\-1 at targetε∈\{1,3,8,16\}\\varepsilon\\in\\\{1,3,8,16\\\}\(Rényi accountant withδ=10−6\\delta=10^\{\-6\}\) across five models, with the scalarccchosen soε\(cΣ\)=εtarget\\varepsilon\(c\\,\\Sigma\)=\\varepsilon\_\{\\mathrm\{target\}\}for each base mechanismΣ∈\{I,diag\(1/Fii\),ΣMah⋆\}\\Sigma\\in\\\{I,\\mathrm\{diag\}\(1/F\_\{ii\}\),\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}\\\}\.

Table 29:Defense comparison at four matched empirical Rényi\-DP budgetsε∈\{1,3,8,16\}\\varepsilon\\in\\\{1,3,8,16\\\},δ=10−6\\delta=10^\{\-6\}, mid\-network layer per model\. No mechanism is universal: aggregating winners across all 20 model\-ε\\varepsiloncells, isotropic wins or ties atε=1\\varepsilon=1on 5/5 models, atε=3\\varepsilon=3on 3/5, atε=8\\varepsilon=8on 3/5, and atε=16\\varepsilon=16on0/50/5, withΣdiag\\Sigma\_\{\\mathrm\{diag\}\}taking over at the loose\-budget end\.ΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}is never strictly best\. The range of best\-attacker top\-1 across these cells is\[0\.000,0\.082\]\[0\.000,0\.082\]; all are below0\.10\.1, meaning every calibrated Gaussian mechanism atε≤16\\varepsilon\\leq 16is already in the near\-zero\-leakage regime\. Utility at theseε\\varepsilonvalues is heavy, with mean KL on the order of88–1515nats\.![Refer to caption](https://arxiv.org/html/2605.24042v1/x20.png)Figure 20:Leading\-order mean KL vs empiricalε\(δ=10−6\)\\varepsilon\(\\delta=10^\{\-6\}\)on GPT\-2 Small for isotropic Gaussian andΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}releases\.ΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}achieves2020–30%30\\%lowerε\\varepsilonthan iso on GPT\-2 at everyσ\\sigma, but the ordering reverses on Mistral and Phi\-2 \(iso is tighter\) and reverts toΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}again on Qwen3\-14B / DeepSeek \(see Table[28](https://arxiv.org/html/2605.24042#A21.T28)\)\. The mechanism’s Rényi\-DP advantage is highly model\-dependent; the only mechanism with a consistent edge at matchedε\\varepsilonisΣdiag\\Sigma\_\{\\mathrm\{diag\}\}atε=16\\varepsilon=16, not at strict budgets\.Figure[20](https://arxiv.org/html/2605.24042#A21.F20)plots leading\-order mean KL against empiricalε\\varepsilonon GPT\-2 Small\. Table[29](https://arxiv.org/html/2605.24042#A21.T29)is the paper’s most honest summary of the utility–privacy frontier under adaptive retrieval\. Two observations matter\. First,Σdiag\\Sigma\_\{\\mathrm\{diag\}\}is the strongest family in the high\-ε\\varepsilonregime \(ε=16\\varepsilon=16, wins5/55/5\) and the only family that stays at or below0\.0010\.001worst\-attacker top\-1 atσ=5\\sigma=5on every one of the 32 sweep points in Table[27](https://arxiv.org/html/2605.24042#A20.T27); but at strictε=1\\varepsilon=1it is not better than isotropic\. Second, the absolute leakage at every calibrated cell is below8\.2%8\.2\\%top\-1; the utility cost of achieving this is≥7\.5\\geq 7\.5nats mean KL on GPT\-2 and≥11\\geq 11nats on the 14B models\. There is no cell where a mechanism achieves moderate privacy at moderate utility, which Section[T\.5](https://arxiv.org/html/2605.24042#A20.SS5)quantifies as the empty\-middle finding\.

### U\.3Training\-time DP\-SGD does not bound inference\-time activation release

Differentially private stochastic gradient descent \(DP\-SGD\) fine\-tuning provides\(ε,δ\)\(\\varepsilon,\\delta\)\-DP guarantees on training data but says nothing about new inference prompts released as hidden\-state activations\. We verify this empirically\. Using opacus we fine\-tune GPT\-2 Small on8,0008\{,\}000WikiText prefixes for33epochs at targetε∈\{2,4,8\}\\varepsilon\\in\\\{2,4,8\\\},δ=10−6\\delta=10^\{\-6\},C=1\.0C=1\.0max\-grad\-norm, with the LM head untied from the input embedding and the embeddings frozen \(required to avoid opacus shape mismatches on tied\-weight modules\)\. Achieved privacy isε∈\{1\.99,4\.00,8\.00\}\\varepsilon\\in\\\{1\.99,4\.00,8\.00\\\}, matching targets\. We then measureℓ2\\ell\_\{2\}retrieval top\-1 on1,5001\{,\}500held\-out inference prefixes at layer66after fine\-tuning:1\.000\\mathbf\{1\.000\}top\-1 at everyε\\varepsilon\. Training\-time DP\-SGD therefore provides no measurable protection against inference\-time hidden\-state retrieval, as expected: the two threat models target different information and require different mechanisms\.

### U\.4Worst\-case DP covariance via semidefinite programming

The closed\-formΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}minimizes the average Mahalanobis signaltr\(ΣδΣ−1\)\\mathrm\{tr\}\(\\Sigma\_\{\\delta\}\\Sigma^\{\-1\}\)\. The worst\-case variant replaces the average withmaxΔ∈𝒜⁡Δ⊤Σ−1Δ\\max\_\{\\Delta\\in\\mathcal\{A\}\}\\Delta^\{\\top\}\\Sigma^\{\-1\}\\Deltaand solves

minΣ≻0,t⁡ts\.t\.Δi⊤Σ−1Δi≤t∀i∈𝒜,12tr\(FΣ\)≤𝒦,\\min\_\{\\Sigma\\succ 0,\\,t\}t\\quad\\text\{s\.t\.\}\\quad\\Delta\_\{i\}^\{\\top\}\\Sigma^\{\-1\}\\Delta\_\{i\}\\leq t\\;\\forall i\\in\\mathcal\{A\},\\quad\\tfrac\{1\}\{2\}\\,\\mathrm\{tr\}\(F\\Sigma\)\\leq\\mathcal\{K\},which becomes a semidefinite program via a Schur\-complement reformulation\[ΣΔiΔi⊤t\]⪰0\\begin\{bmatrix\}\\Sigma&\\Delta\_\{i\}\\\\ \\Delta\_\{i\}^\{\\top\}&t\\end\{bmatrix\}\\succeq 0\. To keep the SDP tractable we solve in a reduced basisU=\[top\-reigenvectors ofF,ofΣδ,and GE directions\]U=\[\\text\{top\-\}r\\text\{ eigenvectors of \}F,\\text\{ of \}\\Sigma\_\{\\delta\},\\text\{ and GE directions\}\]and add an isotropic floorηI\\eta Ifor full\-rank validity\. On GPT\-2 Small atd=768d=768,reff=192r\_\{\\mathrm\{eff\}\}=192,\|𝒜\|=40\|\\mathcal\{A\}\|=40, the SDP terminates in792792s witht⋆=118\.1t^\{\\star\}=118\.1, and yieldsε\(δ=10−6\)=1\.5×1013\\varepsilon\(\\delta=10^\{\-6\}\)=1\.5\\times 10^\{13\}over the same𝒜\\mathcal\{A\}\. For comparison, the closed\-formΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}and isotropic mechanisms calibrated to the same𝒦\\mathcal\{K\}giveε≈6,327\\varepsilon\\approx 6\{,\}327andε≈5,921\\varepsilon\\approx 5\{,\}921, respectively\. The SDP’s reduced\-basis solution is optimized for the4040specific training adjacency pairs but is nearly rank\-deficient outside that span, so the empiricalε\\varepsilonaccountant \(which is worst\-case over the full adjacency set\) blows up\. The full\-rank closed\-form mechanisms, not the SDP, are the correct basis for an inference\-time Gaussian release\.

## Appendix VLearned inversion benchmark

The retrieval attackers of Section[4](https://arxiv.org/html/2605.24042#S4)upper\-bound inversion by searching a bank for the closest match in some metric\. A learned inverter is a stronger evaluator: it takes the released activationh~\\tilde\{h\}and directly predicts theTT\-token prefix, bypassing the bank entirely\. This appendix trains one on GPT\-2 Small and reports exact\-match and token\-accuracy against the four defense mechanisms of Section[4](https://arxiv.org/html/2605.24042#S4)\.

### V\.1Architecture and training

The inverter is a 6\-layer transformer decoder with width512512,88heads, andM=16M=16activation memory tokens\. An activation encoderWh~\+b∈ℝM×512W\\tilde\{h\}\+b\\in\\mathbb\{R\}^\{M\\times 512\}\(LayerNormed\) provides the key–value memory for cross\-attention in each decoder block\. Token embeddings are tied to the output head\. Training uses AdamW at3×10−43\\times 10^\{\-4\}with cosine schedule,500500\-step warmup, batch size256256\(GPT\-2\) or6464\(Mistral\), gradient clipping at1\.01\.0, for50,00050\{,\}000steps on100,000100\{,\}000WikiText prefixes for GPT\-2 and40,00040\{,\}000steps on50,00050\{,\}000prefixes for Mistral\. Released activations are the last\-token hidden state at layerℓ=L/2\\ell=L/2\. Parameter counts:5757M for GPT\-2 \(d=768d=768\),7575M for Mistral \(d=4096d=4096\)\.

We evaluate four training corruption modes per model: clean\-trained \(released state==cleanhh\), and three adaptive inverters trained on states corrupted by a specific defender mechanism \(isotropic, generalized\-eigen, orΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}\) atσ=1\\sigma=1\. Each trained inverter is then evaluated against all five release mechanisms \(clean, isotropic, complement, generalized\-eigen,ΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}\) atσ∈\{1,3,5\}\\sigma\\in\\\{1,3,5\\\}\.

### V\.2Setup and metrics

We evaluate on1,0001\{,\}000held\-out prefixes, disjoint from training\. For each evaluation the attacker receivesh~=h\+ξ\\tilde\{h\}=h\+\\xiwhereξ\\xiis drawn from the released covariance of each mechanism at noise levelσ\\sigma, and decodes greedily\. Reported metrics are

- •Exact\-match \(EM\): fraction of test prefixes the inverter reconstructs exactly token\-for\-token\.
- •Token accuracy \(TA\): average fraction of per\-position tokens that match ground truth\.

The clean baseline measures how much of the prefix the inverter can recover from an*untouched*last\-token hidden state, which sets the ceiling for any noise\-based defense against this attacker\.

### V\.3Result

Exact\-match reconstruction of the 32\-token prefix is0across every inverter and every release mechanism on both models: reconstructing3232discrete tokens from a singledd\-dimensional hidden state is beyond this training budget\. Token accuracy \(TA\) gives a gradient signal on partial leakage; Table[30](https://arxiv.org/html/2605.24042#A22.T30)reports TA on GPT\-2 Small and Table[31](https://arxiv.org/html/2605.24042#A22.T31)on Mistral\.

Table 30:Token accuracy \(TA\) of a learned inverter on GPT\-2 Small \(layer 6,d=768d=768\), atσ=5\\sigma=5release noise\. Columns are release mechanisms, rows are inverter training corruption modes and training set size\. Clean\-test baseline improves from TA0\.0550\.055\(100k pairs\) to0\.0780\.078\(500k pairs,100,000100\{,\}000steps, no overfit\); chance is1/\|V\|≈2×10−51/\|V\|\\approx 2\\times 10^\{\-5\}\. Exact\-match was0on every cell at both budgets\.Table 31:Token accuracy \(TA\) of a learned inverter on Mistral\-7B \(layer 16,d=4096d=4096\), atσ=5\\sigma=5release noise\. Clean\-test baseline is TA0\.0420\.042; chance is1/\|V\|≈3×10−51/\|V\|\\approx 3\\times 10^\{\-5\}\. The Mistral\-specific inverters were trained on50,00050\{,\}000pairs; this is insufficient to generalize \(final train loss reached0for clean and0\.370\.37for iso\)\.Five findings\. First, at 500k training pairs the clean\-trained GPT\-2 inverter reaches TA0\.0780\.078on the clean test state and does not overfit \(final training loss1\.731\.73, down from2\.02\.0mid\-training\), giving a meaningful gradient of suppression across release mechanisms\. Second, generalized\-eigen noise atσ=5\\sigma=5drives the clean\-trained inverter to TA0\.0090\.009, an8\.7×8\.7\\timessuppression from the clean baseline, while isotropic at matchedσ\\sigmareaches7\.1×7\.1\\timesandΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}only1\.4×1\.4\\times\. In the learned\-inversion threat model generalized\-eigen is the most privacy\-preserving mechanism andΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}is nearly transparent to the inverter, which is the opposite of the Mahalanobis\-retrieval story whereΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}matches isotropic and generalized\-eigen collapses to100%100\\%top\-1\. The two threat models reward different defender strategies: retrieval rewards rank deficiency \(attacker projects out noised subspace and retrieves from clean residual\); learned inversion penalizes rank deficiency \(generation decoder cannot inverse\-project\), and the Fisher\-shapedΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}places noise on low\-Fisher directions that the inverter does not rely on\.

Third, adaptive training amplifies the attack only1\.31\.3–2×2\\timesover the clean\-trained baseline: iso\-trained reaches TA0\.0220\.022on iso\-released states \(vs0\.0110\.011for clean\-trained,1\.96×1\.96\\times\); gen\-eigen\-trained reaches0\.0170\.017on gen\-eigen\-released \(vs0\.0090\.009,1\.88×1\.88\\times\);ΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}\-trained reaches0\.0730\.073onΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}\-released \(vs0\.0550\.055,1\.34×1\.34\\times\)\. Adaptive training is a∼2×\\sim 2\\timesthreat in practice, not the order\-of\-magnitude amplification the plan anticipated\. The iso\-trained inverter turns out to be the universally best attacker across all columns except the gen\-eigen column \(where gen\-eigen\-trained wins by0\.0040\.004TA\), because its training distribution contains enough clean\-ish mass to transfer\.

Fourth, ranking defenses by worst\-case TA over the four attackers gives a clean ordering that*inverts*the Mahalanobis\-retrieval ranking of Section[6](https://arxiv.org/html/2605.24042#S6)\. Under learned inversion atσ=5\\sigma=5: gen\-eigen \(worst\-attacker TA0\.0170\.017\)\>\>iso \(0\.0220\.022\)\>\>complement \(0\.0270\.027\)\>\>ΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}\(0\.0730\.073\)\. The theorem\-optimal Mahalanobis defense ranks last\. The reason is the Fisher\-shaped noise:ΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}concentrates variance in low\-Fisher directions, which are exactly the directions an inversion decoder does not use to predict tokens, so the attacker sees almost\-unperturbed hidden states for the purpose of reconstruction\. This is the mirror of what makesΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}strong against Mahalanobis retrieval \(which*does*use those directions viaΣ−1\\Sigma^\{\-1\}\)\.

Fifth, gen\-eigen\-trained inversion collapses across the board \(TA≤0\.017\\leq 0\.017, and only0\.0140\.014on clean test states vs0\.0780\.078for other training modes\)\. Training on rank\-deficient\-corrupted activations appears to destroy the inverter’s ability to learn useful token priors in the first place, so adaptive training offers no path to defeating gen\-eigen at this budget\. This confirms that rank deficiency is a fundamental obstruction to learned inversion and a complementary direction to full\-rank defenses\.

Mistral is uniformly harder for the inverter, with clean\-test TA0\.0420\.042and all noised TAs0\.0060\.006–0\.0090\.009; the smaller training budget \(50,00050\{,\}000pairs vs100,000100\{,\}000on GPT\-2 Small\) and larger hidden dimension \(d=4096d=4096\) are both consistent with a conservative upper bound, not a tight one\.

Raw data and training curves for every cell of Tables[30](https://arxiv.org/html/2605.24042#A22.T30)and[31](https://arxiv.org/html/2605.24042#A22.T31)are available in the repository \(Appendix[AD](https://arxiv.org/html/2605.24042#A30)\): one JSON per \(model, training\-corruption\) run, carrying the full training\-loss trace \(logged every200200steps\) and the1313\-row evaluation matrix \(five release mechanisms at threeσ\\sigmavalues\)\.

The single\-vector inverter is a conservative upper bound\. Exact\-match is 0 everywhere on it\. The full\-trajectory sequence inverter of Section[7](https://arxiv.org/html/2605.24042#S7)\(Table[1](https://arxiv.org/html/2605.24042#S7.T1)\) reaches94%94\\%exact\-match on clean GPT\-2 prefixes using the same SeqInv architecture with the activation encoder replaced by a per\-token linear projection of the released hidden state, and confirms thatΣdiag\\Sigma\_\{\\mathrm\{diag\}\}atσ=5\\sigma=5holds the stronger attacker to0%0\\%exact\-match\.

## Appendix WAdversarial Channel Targeting

The adversarial experiment injects noise into the layer\-6 hidden state of GPT\-2 Small and measures the asymmetric effect on output predictions depending on which subspace the noise targets\.

The gradient covariance eigenvectors are recomputed fromN=1,000N=1\{,\}000calibration samples \(a subset of the 5,000 used for the main experiments\)\. The behavior projectorPB=U128U128⊤P\_\{B\}=U\_\{128\}U\_\{128\}^\{\\top\}and identity projectorPI=I−PBP\_\{I\}=I\-P\_\{B\}are constructed atk=128k=128, matching the main experiments\. For each of 200 test prefixes \(disjoint from calibration\), a noise vectorϵ∼𝒩\(0,Id\)\\boldsymbol\{\\epsilon\}\\sim\\mathcal\{N\}\(0,I\_\{d\}\)is drawn and projected onto each subspace\. The projected noise is then rescaled so that theℓ2\\ell\_\{2\}norm is the same across all three conditions \(identity, behavior, random\) for a givenσ\\sigma\. The target norm isσ⋅h¯\\sigma\\cdot\\bar\{h\}, whereh¯=87\.3\\bar\{h\}=87\.3is the mean hidden\-state norm at layer 6\. This normalization ensures that any difference in KL between conditions reflects the directional sensitivity of the model, not a difference in noise magnitude\.

The noise is injected via a forward hook on the output of Transformer block 6 that adds the noise to the last\-token hidden state\. The model then processes the perturbed representation through layers 7–11 and the output head\. We measure the KL divergenceKL\(pbase∥pperturbed\)\\mathrm\{KL\}\(p\_\{\\text\{base\}\}\\,\\\|\\,p\_\{\\text\{perturbed\}\}\)at the last position and whether the top\-1 predicted token is preserved\.

Table[32](https://arxiv.org/html/2605.24042#A23.T32)reports this analysis across models, layers, and subspace dimensions\. The behavior/identity KL ratio ranges from1\.8×1\.8\\timesto3\.1×3\.1\\timesacross all 9 configurations atσ=1\.0\\sigma\{=\}1\.0, confirming the asymmetry is a robust property of the two\-channel decomposition rather than a single\-configuration artifact\. The ratio is highest at smallkk\(e\.g\.,3\.1×3\.1\\timesat layer 11,k=64k\{=\}64\), where the behavior subspace is most concentrated and directionally specific\. The ratio also holds on GPT\-2 Medium \(1\.81\.8–2\.2×2\.2\\timesat layer 12\), confirming cross\-model consistency\.

Table 32:Adversarial channel targeting across models, layers, andkkatσ=1\.0\\sigma\{=\}1\.0\. Ratio is behavior KL / identity KL\.This result has a direct security implication: an adversary with access to hidden states and knowledge of the gradient covariance decomposition can selectively target the identity channel to disrupt SipIt\-based prompt recovery while minimizing observable changes to model outputs\. The attack requires only the top\-kkeigenvectors of the gradient covariance, which can be estimated from a modest number of input\-output pairs without access to model weights\. Note, however, that the defense analysis in Appendix[T](https://arxiv.org/html/2605.24042#A20)shows that at the full\-spaceℓ2\\ell\_\{2\}attack level this adversarial knowledge does not increase retrieval success; the asymmetry is a defender’s tool\.

![Refer to caption](https://arxiv.org/html/2605.24042v1/x21.png)Figure 21:Ablation KL for each attention head at five layers of GPT\-2 Small\. The maximum per\-head KL is roughly24×24\\timessmaller than the per\-MLP KL at the matched layer, indicating that attention heads are individually much less load\-bearing for the predictive computation than MLP blocks, consistent with the Fisher spectrum being concentrated on MLP output directions\. The single high\-KL cell visible in the heatmap is the layer\-9 induction head whose ablation costs more than any other single head \(∼0\.4\\sim 0\.4nats\), still well below the per\-MLP KL at the same layer; we treat it as a known per\-head outlier rather than a counterexample to the spectrum\-on\-MLP\-outputs reading\.
## Appendix XPredictive Quotient Release: Empirical Test on GPT\-2 Small

The local quotient factorization theorem \(Theorem[18](https://arxiv.org/html/2605.24042#Thmproposition18)\) and the leakage bound \(Theorem[19](https://arxiv.org/html/2605.24042#Thmproposition19)\) say that releasing a learnedrr\-dimensional quotientzzinstead of the full hidden statehh, which we call predictive quotient release \(PQR; Figure[22](https://arxiv.org/html/2605.24042#A24.F22)\), should be able to fill the moderate\-utility, moderate\-privacy region of the Pareto frontier whenr≪dr\\ll d, because the mutual\-information leakage scales withrrand notdd\. We test this empirically on GPT\-2 Small at layer 6 across a4444\-cell sweep, and find that the mechanism does not in practice fill the empty middle at this model scale\. We document the failure mode honestly because it sharpens the story rather than weakening it: the constant\-rank theorem identifies the right object in principle, but realizing it as a usable defense requires more capacity than our sweep gave it\.

### X\.1Mechanism and sweep

![Refer to caption](https://arxiv.org/html/2605.24042v1/x22.png)Figure 22:Predictive quotient release\. The frozen language model is split at layerℓ\\ell\. The token\-wise hidden stateht\(ℓ\)h\_\{t\}^\{\(\\ell\)\}is encoded byqϕq\_\{\\phi\}to a low\-dimensional latentzt∈ℝrz\_\{t\}\\in\\mathbb\{R\}^\{r\}withr≪dr\\ll d\. The defender adds Gaussian release noiseη∼𝒩\(0,σrel2Ir\)\\eta\\sim\\mathcal\{N\}\(0,\\sigma\_\{\\mathrm\{rel\}\}^\{2\}I\_\{r\}\)to obtainz~t\\tilde\{z\}\_\{t\}, which is the externally released object \(red arrow\)\. Inside the trusted serving boundary \(dashed box\) a decoderDψD\_\{\\psi\}reconstructsh^t\(ℓ\)\\hat\{h\}\_\{t\}^\{\(\\ell\)\}, which is fed back into the LM to continue the forward pass and produce logits\.The encoder is a two\-layer MLPLN\(h\)→Linear\(d,4r\)→GELU→Linear\(4r,2r\)\\mathrm\{LN\}\(h\)\\to\\mathrm\{Linear\}\(d,4r\)\\to\\mathrm\{GELU\}\\to\\mathrm\{Linear\}\(4r,2r\)followed by a split intoμ,log⁡σ2\\mu,\\log\\sigma^\{2\}\. The decoder is a two\-layer MLPLinear\(r,4r\)→GELU→Linear\(4r,d\)\\mathrm\{Linear\}\(r,4r\)\\to\\mathrm\{GELU\}\\to\\mathrm\{Linear\}\(4r,d\)\. The released latent isz=qϕ\(h\)\+ηz=q\_\{\\phi\}\(h\)\+\\etawithη∼𝒩\(0,σrel2Ir\)\\eta\\sim\\mathcal\{N\}\(0,\\sigma\_\{\\mathrm\{rel\}\}^\{2\}I\_\{r\}\)\. Training minimizes a sum of three terms: a 16\-token continuation KL distillation between the clean and decoder\-reinjected logits, a Gaussian information\-bottleneck regularizerβKL\(qϕ\(z\|h\)∥𝒩\(0,Ir\)\)\\beta\\,\\mathrm\{KL\}\(q\_\{\\phi\}\(z\|h\)\\,\\\|\\,\\mathcal\{N\}\(0,I\_\{r\}\)\), and an InfoNCE retrieval\-adversarial term with gradient reversal at coefficientγ\\gamma\.

The sweep spansr∈\{16,32,64,128\}r\\in\\\{16,32,64,128\\\},β∈\{10−3,10−2,10−1,3⋅10−1\}\\beta\\in\\\{10^\{\-3\},10^\{\-2\},10^\{\-1\},3\\cdot 10^\{\-1\}\\\},γ∈\{0,0\.1,0\.3\}\\gamma\\in\\\{0,0\.1,0\.3\\\}, andσrel∈\{0,0\.1,0\.2,0\.5\}\\sigma\_\{\\mathrm\{rel\}\}\\in\\\{0,0\.1,0\.2,0\.5\\\}\. Stage 2A holdsγ=σrel=0\\gamma=\\sigma\_\{\\mathrm\{rel\}\}=0to identify the utility floor across\(r,β\)\(r,\\beta\)\. Stage 2B sweepsγ×σrel\\gamma\\times\\sigma\_\{\\mathrm\{rel\}\}at the most promising\(r,β\)\(r,\\beta\)pairs \(r∈\{32,64\}r\\in\\\{32,64\\\},β∈\{10−2,10−1\}\\beta\\in\\\{10^\{\-2\},10^\{\-1\}\\\}\)\. Stage 2C reruns the two best Stage 2B configurations across three seeds\. Each cell trains for50,00050\{,\}000steps with batch size128128, AdamW at2×10−42\\times 10^\{\-4\}with1,0001\{,\}000\-step warmup and cosine decay, on WikiText\-103 prefixes of length6464\. The frozen GPT\-2 Small forward pass is in bfloat16; the encoder, decoder, and adversary are in float32\.

Evaluation uses a50,00050\{,\}000\-prefix retrieval bank with2,0002\{,\}000held\-out queries\. For each cell we compute \(a\) the utility metrict1t\_\{1\}, the fraction of queries on which the next\-token argmax under decoder reinjection matches the clean argmax \(computed over200200queries\), and \(b\) the privacy metricattacktop−1\\mathrm\{attack\}\_\{\\mathrm\{top\-1\}\}, the fraction of noisy released latents whose nearest\-bank\-item under squaredℓ2\\ell\_\{2\}inℝr\\mathbb\{R\}^\{r\}is the correct query identity\.

### X\.2Result

Across all4444cells, no cell satisfiest1≥0\.5t\_\{1\}\\geq 0\.5andattacktop−1≤0\.5\\mathrm\{attack\}\_\{\\mathrm\{top\-1\}\}\\leq 0\.5\. The two extremes look as follows:

- •The utility ceiling under the full\-sequence reinjection protocol that matches training is set by the cell\(r=64,β=10−3,σrel=0\)\(r=64,\\beta=10^\{\-3\},\\sigma\_\{\\mathrm\{rel\}\}=0\):t1=0\.415t\_\{1\}=0\.415,attack=1\.000\\mathrm\{attack\}=1\.000\. Withσrel=0\\sigma\_\{\\mathrm\{rel\}\}=0the attacker recovers every query because the encoder is deterministic at inference, so retrieval over the bank is trivially perfect\.
- •The privacy floor is set by cells with highσrel\\sigma\_\{\\mathrm\{rel\}\}and highγ\\gamma, e\.g\.\(r=64,β=10−1,γ=0\.3,σrel=0\.5\)\(r=64,\\beta=10^\{\-1\},\\gamma=0\.3,\\sigma\_\{\\mathrm\{rel\}\}=0\.5\):t1=0\.070t\_\{1\}=0\.070,attack=0\.022\\mathrm\{attack\}=0\.022\. Privacy is essentially complete but the decoder reconstruction has so much error that next\-token prediction agreement collapses to noise\.

The closest near\-miss cells are the three Stage 2C seeds at\(r=64,β=10−1,γ=0\.1,σrel=0\.2\)\(r=64,\\beta=10^\{\-1\},\\gamma=0\.1,\\sigma\_\{\\mathrm\{rel\}\}=0\.2\), which attainattack∈\{0\.262,0\.287,0\.341\}\\mathrm\{attack\}\\in\\\{0\.262,0\.287,0\.341\\\}att1∈\{0\.035,0\.035,0\.055\}t\_\{1\}\\in\\\{0\.035,0\.035,0\.055\\\}\. These cells cross the privacy threshold \(attack≤0\.5\\mathrm\{attack\}\\leq 0\.5\) but fall an order of magnitude short of the utility threshold\.

### X\.3Why the mechanism failed

Two factors compound in the wrong direction\. The bottleneck reconstruction error, measured atσrel=0\\sigma\_\{\\mathrm\{rel\}\}=0where there is no privacy noise, is already substantial: under the full\-sequence reinjection protocol that matches training, the deterministic encode\-decode round\-trip seest1t\_\{1\}values of0\.205,0\.290,0\.415,0\.3850\.205,0\.290,0\.415,0\.385atr=16,32,64,128r=16,32,64,128respectively\. The architecture already loses more than half of the next\-token agreement at the deterministic regime, and counterintuitively the largest testedrrdoes not improve overr=64r=64at50,00050\{,\}000training steps\. The privacy noise then compounds on top of this: atσrel=0\.2\\sigma\_\{\\mathrm\{rel\}\}=0\.2,t1t\_\{1\}falls below0\.060\.06on every cell\.

This is a real limitation of the simple two\-layer MLP encoder/decoder at our training budget, not a refutation of the constant\-rank theorem\. The theorem says behavior locally factors through somerr\-dimensional quotient, but the practical rank of next\-token prediction at GPT\-2 Small layer 6 is empirically close tod=768d=768, so a6×6\\timescompression already costs half the agreement before any privacy noise is added\. To make the quotient release work at this scale would require either a largerrr\(closer todd, defeating the leakage advantage\), a deeper encoder/decoder \(e\.g\. a multi\-layer transformer in place of the two\-layer MLP\), or a much larger training budget than50,00050\{,\}000steps per cell\.

### X\.4What this means for the empty\-middle finding

The Gaussian impossibility result of Theorem[13](https://arxiv.org/html/2605.24042#Thmproposition13)continues to hold: no full\-rank Gaussian release of the full hidden state can fill the moderate\-both region atO\(1\)O\(1\)utility budget\. The local\-quotient\-release mechanism is the natural class outside Gaussian full\-state release, and Theorem[19](https://arxiv.org/html/2605.24042#Thmproposition19)predicts it should be able to fill the empty middle in principle\. Our empirical sweep at GPT\-2 Small does not realize that prediction\. The honest reading is that the moderate\-both region is achievable in a strictly larger function class than full\-rank Gaussian release ofhh, but achieving it under a learned encoder\-decoder bottleneck requires both architectural capacity and training scale beyond what the4444\-cell sweep covered\. We leave the design of a release mechanism that empirically fills the moderate\-both region as the central open problem of this line of work\.

## Appendix YSplit\-Memory Transformer training details and ablations

We train four 12\-layer architectures at matched≈90\\approx 90M parameters, batch size3232, sequence length256256, AdamW at3×10−43\\times 10^\{\-4\}with1,0001\{,\}000\-step warmup and cosine decay to zero, in bfloat16, for20,00020\{,\}000steps on WikiText\-103\. The four architectures are

- •Baseline GPT: standard 12\-layer pre\-norm decoder,d=768d=768,88heads, FF30723072\.
- •SMT main:r=128r=128,m=640m=640,44heads in trunk,44in memory, FF512512\(trunk\) /12801280\(memory\),λJ=10−3\\lambda\_\{J\}=10^\{\-3\}\.
- •SMT no\-Jac: same as SMT main withλJ=0\\lambda\_\{J\}=0\.
- •SMTr=64r=64:r=64r=64,m=704m=704,λJ=10−3\\lambda\_\{J\}=10^\{\-3\}, otherwise same\.

After training we measureEkE\_\{k\},qBq\_\{B\},GMahG\_\{\\mathrm\{Mah\}\}, andtr\(F\)\\mathrm\{tr\}\(F\)at probe layersℓ∈\{4,6,8\}\\ell\\in\\\{4,6,8\\\}following the same Fisher\-diagonal and margin\-covariance computation used for the five base models in the paper\.

### Y\.1Per\-architecture interpretation

Three findings stand out beyond the main\-body table\. The baseline 12\-layer GPT trained at the same parameter count and same data showsEk≈0\.22E\_\{k\}\\approx 0\.22atk=128k=128overd=768d=768, almost exactly the random\-projection floork/d=0\.167k/d=0\.167, withGMahG\_\{\\mathrm\{Mah\}\}in\[1\.1,1\.3\]\[1\.1,1\.3\]\. The Fisher mass is essentially uniform across coordinates\. Both SMT variants atr=128r=128achieveEk≥0\.99E\_\{k\}\\geq 0\.99, meaning the top\-128128Fisher coordinates capture more than99%99\\%of the gradient energy\. By routing logits only through a128128\-dimensional trunk, we force the Fisher to concentrate in128128coordinates while keeping the same total hidden width\. SMTr=64r=64concentrates Fisher even more sharply \(Ek=0\.998E\_\{k\}=0\.998\) and producesGMahG\_\{\\mathrm\{Mah\}\}in\[20,33\]\[20,33\]\.

The Hutchinson Jacobian penaltyλJ\\lambda\_\{J\}matters only marginally\. The no\-Jac variant achieves the highestGMahG\_\{\\mathrm\{Mah\}\}at layer 6 \(11\.9211\.92\) but is slightly below the penalised variant at layer 8\. The architectural prior of routing logits through a low\-dimensional trunk does most of the work, and the explicit gradient penalty fine\-tunes which coordinates carry the loss sensitivity\.

### Y\.2Limitations

The trained SMT models do not match a same\-budget GPT on raw language\-modeling perplexity at this training scale \(final cross\-entropy∼5\.6\\sim 5\.6baseline vs\.5\.85\.8–6\.06\.0SMT\), so the design currently buys Fisher concentration at a small modeling cost\. The interaction between the LM\-loss and Fisher\-concentration optima at larger training scale is left open\.

## Appendix ZFixed\-Token Scaling of Split\-Memory Transformers

The 90M Split\-Memory Transformer of Section[8](https://arxiv.org/html/2605.24042#S8)is one point on a four\-scale fixed\-token sweep\. We trained matched\-ddSMT and GPT\-baseline pairs at four parameter scales \(30M, 90M, 300M, 1B\), at three step counts \(5k, 20k, 80k\) at the 90M tier, and at three seeds at 90M with steps 20k\. Every run uses the same WikiText\-103 prefix budget \(164M tokens, batch 32, sequence 256\), so what changes between tiers is parameters per token, not training duration\. This appendix records the scaling structure that emerges and the architectural inferences supported by it\.

### Z\.1SMT breaks the baseline ceiling across scales

Table[33](https://arxiv.org/html/2605.24042#A26.T33)reports median\-across\-probe\-layersGMahG\_\{\\mathrm\{Mah\}\}for SMT and matched\-ddbaselines at each tier, alongside the empirical Fisher concentrationE128E\_\{128\}, the Fisher\-trunk margin fractionqBq\_\{B\}normalized by the random\-subspace expectation128/d128/d, and the language\-modeling\-loss penaltyΔL=LlmSMT−Llmbase\\Delta L=L^\{\\mathrm\{SMT\}\}\_\{\\mathrm\{lm\}\}\-L^\{\\mathrm\{base\}\}\_\{\\mathrm\{lm\}\}\.

Table 33:Fixed\-token SMT scaling sweep\. SMTGMahG\_\{\\mathrm\{Mah\}\}stays an order of magnitude above the matched\-ddbaseline at every scale, while the baseline is approximately flat atGMah≈1\.3G\_\{\\mathrm\{Mah\}\}\\approx 1\.3\. The Fisher\-trunk margin fractionqB/\(128/d\)q\_\{B\}/\(128/d\)collapses from0\.780\.78to0\.070\.07across the SMT sweep but is essentially constant for the baseline, identifyingqBq\_\{B\}as the architectural variable that drives the gap\. The fixed\-token modeling\-loss penaltyΔL\\Delta Lwidens from1\.181\.18to2\.152\.15nats across the same range\.A four\-point log\-log fit givesGMahSMT∝P0\.32G\_\{\\mathrm\{Mah\}\}^\{\\mathrm\{SMT\}\}\\propto P^\{0\.32\}atR2=0\.71R^\{2\}=0\.71, with the 1B point falling1\.85×1\.85\\timesbelow the 30M\-90M\-300M extrapolation\. The fit is reported because reviewers will ask for it, but the headline claim is the qualitative pattern in Figure[23](https://arxiv.org/html/2605.24042#A26.F23)A\. SMT remains an order of magnitude above the baseline across33×33\\timesin parameters, while the baseline is approximately flat atGMah≈1\.3G\_\{\\mathrm\{Mah\}\}\\approx 1\.3\. The flat baseline is itself a structural finding, where same\-budget GPT models do not stumble into a high\-GMahG\_\{\\mathrm\{Mah\}\}regime as parameters grow, which is exactly the empty\-middle behaviour the rest of the paper identifies on pretrained models extrapolated to scale\-from\-scratch\.

![Refer to caption](https://arxiv.org/html/2605.24042v1/x23.png)Figure 23:\(A\)GMahG\_\{\\mathrm\{Mah\}\}vs\. parameters\. SMT \(orange circles\) maintains a66–24×24\\timesadvantage from 30M to 1B\. Matched\-ddbaselines \(blue squares\) remain flat at≈1\.3\\approx 1\.3\. Faint markers at 90M are individual seeds\. \(B\)E128E\_\{128\}\(top\-128 Fisher mass\) decays in SMT as the trunk dimensionrrgrows\. The dashed reference ismin⁡\(1,128/r\)\\min\(1,128/r\), the value expected if Fisher were uniform on the trunk\. \(C\) Margin massqBq\_\{B\}in the top\-128 Fisher subspace, normalized by the random\-subspace expectation128/d128/d\. The baseline holds near0\.750\.75across all scales \(close to isotropic\), while SMT collapses from0\.780\.78to0\.070\.07\. \(D\) Fixed\-token modeling\-loss penaltyΔL\\Delta Lgrows monotonically with scale\.
### Z\.2The baseline is flat

The matched\-ddGPT baseline givesGMah∈\{1\.60,1\.27,1\.19,1\.28\}G\_\{\\mathrm\{Mah\}\}\\in\\\{1\.60,1\.27,1\.19,1\.28\\\}across3030M to1\.651\.65B parameters\. A log\-log fit returns slope−0\.06\-0\.06, statistically indistinguishable from zero\. Combined with the empty\-middle finding on pretrained models \(Section[6](https://arxiv.org/html/2605.24042#S6)\), this is empirical evidence that no scale of conventional decoder\-only training enters the high\-GMahG\_\{\\mathrm\{Mah\}\}regime\. The architectural axis \(predictive\-trunk readout, suppressed memory\-to\-trunk coupling\) is the missing piece, and this sweep shows that it survives a33×33\\timesparameter sweep\.

### Z\.3Top\-128 Fisher concentration decays as the trunk grows

The SMT trunk dimension grows fromr=64r=64at 30M tor=512r=512at 1B, while the analysis dimensionk=128k=128remains fixed\. The observedE128∈\{0\.998,0\.994,0\.661,0\.363\}E\_\{128\}\\in\\\{0\.998,0\.994,0\.661,0\.363\\\}tracks an architectural reference more closely than a leakage interpretation:

###### Proposition 31\(Fixed\-kkFisher concentration under a growing trunk\)\.

LetU⊂ℝdU\\subset\\mathbb\{R\}^\{d\}be a trunk subspace of dimensionrr, and suppose the trace\-normalized FisherρF\\rho\_\{F\}is supported entirely onUUand uniform onUU:

ρF=1rPU\.\\rho\_\{F\}=\\frac\{1\}\{r\}P\_\{U\}\.LetEkE\_\{k\}be the mass of the topkkFisher eigenvalues\. Then

Ek=min⁡\(1,kr\)\.E\_\{k\}=\\min\\\!\\left\(1,\\frac\{k\}\{r\}\\right\)\.More generally, ifρF\\rho\_\{F\}is supported onUUthenEkE\_\{k\}can decrease withrreven whenCu=tr\(PUρF\)=1C\_\{u\}=\\mathrm\{tr\}\(P\_\{U\}\\rho\_\{F\}\)=1\.

###### Proof\.

The nonzero eigenvalues ofρF=PU/r\\rho\_\{F\}=P\_\{U\}/rare1/r1/r, repeatedrrtimes\. The top\-kkmass ismin⁡\(1,k/r\)\\min\(1,k/r\)\. ∎

The referencemin⁡\(1,128/r\)\\min\(1,128/r\)at the four SMT scales gives\{1\.0,1\.0,0\.5,0\.25\}\\\{1\.0,1\.0,0\.5,0\.25\\\}, compared to the observed\{0\.998,0\.994,0\.661,0\.363\}\\\{0\.998,0\.994,0\.661,0\.363\\\}\. The observed decay is consistent with predictive rank growing inside an unchanged trunk and is not by itself evidence that Fisher mass has escaped into the memory branch\. A direct measurement of the trunk\-confinement quantityCu=tr\(PUρF\)C\_\{u\}=\\mathrm\{tr\}\(P\_\{U\}\\rho\_\{F\}\), wherePUP\_\{U\}projects onto the entire trunk rather than onto a fixed\-rank subspace, would be required to attribute theE128E\_\{128\}decay to leakage rather than to rank growth, and we leave this measurement open\.

### Z\.4Fisher–margin couplingqBq\_\{B\}is the hidden driver

WhereE128E\_\{128\}tells a stable architectural story,qB/\(128/d\)q\_\{B\}/\(128/d\)tells a second one\. The baseline value sits at0\.75±0\.040\.75\\pm 0\.04across all four scales \(close to the isotropic\-margin value of1\.01\.0\), while SMT collapses from0\.780\.78at 30M to0\.070\.07at 1B \(Figure[23](https://arxiv.org/html/2605.24042#A26.F23)C\)\. At 30M, SMT has not yet shaken margin mass out of its Fisher trunk\. At 1B, it has, by an order of magnitude\. The increase inGMahG\_\{\\mathrm\{Mah\}\}from9\.89\.8to23\.423\.4across the SMT sweep is mediated by this depletion, not by improvements in Fisher concentration \(which is already near11at the 30M scale\)\.

The block\-fidelity model below makes this explicit and explains why a flat baseline is the natural expectation:

###### Proposition 32\(Block\-fidelity scaling ofGMahG\_\{\\mathrm\{Mah\}\}\)\.

LetUUbe anrr\-dimensional predictive trunk andU⟂U^\{\\perp\}the memory branch\. Suppose

ρF=1rPU,ρS=q1rPU\+\(1−q\)1d−rPU⟂\.\\rho\_\{F\}=\\frac\{1\}\{r\}P\_\{U\},\\qquad\\rho\_\{S\}=q\\,\\frac\{1\}\{r\}P\_\{U\}\+\(1\-q\)\\,\\frac\{1\}\{d\-r\}P\_\{U^\{\\perp\}\}\.ThenGMah=1/qG\_\{\\mathrm\{Mah\}\}=1/q\. If margin mass is isotropic thenq=r/dq=r/dandGMah=d/rG\_\{\\mathrm\{Mah\}\}=d/r\. If the architecture or training depletes prompt\-margin mass from the trunk so thatq≪r/dq\\ll r/d, thenGMah≫d/rG\_\{\\mathrm\{Mah\}\}\\gg d/r\.

###### Proof\.

SinceρF1/2=PU/r\\rho\_\{F\}^\{1/2\}=P\_\{U\}/\\sqrt\{r\},

ρF1/2ρSρF1/2=qr2PU,\(ρF1/2ρSρF1/2\)1/2=qrPU,\\rho\_\{F\}^\{1/2\}\\rho\_\{S\}\\rho\_\{F\}^\{1/2\}=\\frac\{q\}\{r^\{2\}\}P\_\{U\},\\quad\\bigl\(\\rho\_\{F\}^\{1/2\}\\rho\_\{S\}\\rho\_\{F\}^\{1/2\}\\bigr\)^\{1/2\}=\\frac\{\\sqrt\{q\}\}\{r\}P\_\{U\},soℱ\(ρF,ρS\)=tr\(qPU/r\)=q\\mathcal\{F\}\(\\rho\_\{F\},\\rho\_\{S\}\)=\\mathrm\{tr\}\\bigl\(\\sqrt\{q\}\\,P\_\{U\}/r\\bigr\)=\\sqrt\{q\}, andGMah=ℱ−2=1/qG\_\{\\mathrm\{Mah\}\}=\\mathcal\{F\}^\{\-2\}=1/q\. ∎

For the matched\-ddbaseline at 1B,qB≈0\.75⋅128/2304≈0\.042q\_\{B\}\\approx 0\.75\\cdot 128/2304\\approx 0\.042, predictingGMah≈24G\_\{\\mathrm\{Mah\}\}\\approx 24if Fisher were perfectly trunk\-confined\. The observed baseline value isGMah≈1\.3G\_\{\\mathrm\{Mah\}\}\\approx 1\.3because Fisher is not trunk\-confined \(E128≈0\.087E\_\{128\}\\approx 0\.087at 1B\), so the prerequisite for Proposition[32](https://arxiv.org/html/2605.24042#Thmproposition32)fails\. SMT satisfies it at 30M\-90M \(E128≈1E\_\{128\}\\approx 1\) and partially at 300M\-1B, and the correspondingGMahG\_\{\\mathrm\{Mah\}\}tracks1/qB1/q\_\{B\}to within the slack of the bound\.

The relevant constructive lemma is that low memory\-to\-trunk coupling produces the trunk\-confinement Proposition[32](https://arxiv.org/html/2605.24042#Thmproposition32)requires:

###### Proposition 33\(Small memory\-to\-trunk coupling implies small memory Fisher\)\.

Let a layer state beh=\(u,v\)h=\(u,v\)and suppose the downstream behaviour map locally factors asz\(h\)=g\(u\+γAv\)z\(h\)=g\(u\+\\gamma Av\), whereγ≥0\\gamma\\geq 0is the memory\-to\-trunk coupling,AAa linear map, andzzthe logits\. LetHzH\_\{z\}be the logit Fisher and assumeJg⊤HzJg⪯LIJ\_\{g\}^\{\\top\}H\_\{z\}J\_\{g\}\\preceq LI\. Then the hidden\-state Fisher block on the memory branch satisfies

Fvv=γ2A⊤Jg⊤HzJgA⪯γ2LA⊤A,tr\(Fvv\)≤γ2L‖A‖F2\.F\_\{vv\}=\\gamma^\{2\}A^\{\\top\}J\_\{g\}^\{\\top\}H\_\{z\}J\_\{g\}A\\preceq\\gamma^\{2\}L\\,A^\{\\top\}A,\\qquad\\mathrm\{tr\}\(F\_\{vv\}\)\\leq\\gamma^\{2\}L\\,\\\|A\\\|\_\{F\}^\{2\}\.

###### Proof\.

By the chain rule,Jvz=γJgAJ\_\{v\}z=\\gamma J\_\{g\}A, so the\(v,v\)\(v,v\)block ofF=Jz⊤HzJzF=J\_\{z\}^\{\\top\}H\_\{z\}J\_\{z\}is\(γJgA\)⊤Hz\(γJgA\)=γ2A⊤Jg⊤HzJgA\(\\gamma J\_\{g\}A\)^\{\\top\}H\_\{z\}\(\\gamma J\_\{g\}A\)=\\gamma^\{2\}A^\{\\top\}J\_\{g\}^\{\\top\}H\_\{z\}J\_\{g\}A\. The Loewner and trace bounds follow fromJg⊤HzJg⪯LIJ\_\{g\}^\{\\top\}H\_\{z\}J\_\{g\}\\preceq LI\. ∎

This explains why the no\-Jacobian\-penalty SMT variant in Appendix[Y](https://arxiv.org/html/2605.24042#A25)performs comparably to the penalized variant\. The architecture \(output reading fromuLu\_\{L\}, gatedγ\\gammainitialized at0\.010\.01\) already drivesγ\\gammasmall, and the explicit penalty is a refinement of an effect the architecture imposes by construction\.

### Z\.5Inverse\-coupling regime: whenqBq\_\{B\}is the only knob

Combining Propositions[31](https://arxiv.org/html/2605.24042#Thmproposition31)–[33](https://arxiv.org/html/2605.24042#Thmproposition33)with the projector\-separation theorem \(Theorem[17](https://arxiv.org/html/2605.24042#Thmproposition17)\) gives the operating regime that the SMT scaling sweep visits:

###### Corollary 34\(Inverse\-coupling law in the high\-EEregime\)\.

SupposeρF\\rho\_\{F\}is supported mostly in a rank\-kksubspaceBB, withEk=tr\(PBρF\)≥1−εE\_\{k\}=\\mathrm\{tr\}\(P\_\{B\}\\rho\_\{F\}\)\\geq 1\-\\varepsilon, and that the within\-BBspectra ofρF\\rho\_\{F\}andρS\\rho\_\{S\}are not pathologically mismatched\. ThenGMah≈1/qBG\_\{\\mathrm\{Mah\}\}\\approx 1/q\_\{B\}withqB=tr\(PBρS\)q\_\{B\}=\\mathrm\{tr\}\(P\_\{B\}\\rho\_\{S\}\)\. The conservative projector\-separation bound

GMah≥1\(EkqB\+\(1−Ek\)\(1−qB\)\)2G\_\{\\mathrm\{Mah\}\}\\geq\\frac\{1\}\{\\bigl\(\\sqrt\{E\_\{k\}q\_\{B\}\}\+\\sqrt\{\(1\-E\_\{k\}\)\(1\-q\_\{B\}\)\}\\bigr\)^\{2\}\}collapses to1/qB1/q\_\{B\}in theEk→1E\_\{k\}\\to 1,qB→0q\_\{B\}\\to 0corner\.

The 90M three\-seed sweep occupies exactly this regime\. All three seeds giveE128≈0\.996E\_\{128\}\\approx 0\.996, so Fisher concentration is fixed\. What varies isqBq\_\{B\}, and the resultingGMahG\_\{\\mathrm\{Mah\}\}traces1/qB1/q\_\{B\}:

The productqB⋅GMahq\_\{B\}\\cdot G\_\{\\mathrm\{Mah\}\}is0\.810\.81,0\.860\.86,0\.810\.81across the three seeds, consistent with Corollary[34](https://arxiv.org/html/2605.24042#Thmproposition34)’s prediction thatGMah≈1/qBG\_\{\\mathrm\{Mah\}\}\\approx 1/q\_\{B\}whenEkE\_\{k\}is fixed near11\. SMT reliably concentrates Fisher, and whether it reaches very highGMahG\_\{\\mathrm\{Mah\}\}depends on whether training also pushes prompt margins away from the Fisher\-sensitive trunk, an event that is initialization\-dependent\.

### Z\.6The fixed\-token modeling\-loss penalty widens with scale

The third pattern in Figure[23](https://arxiv.org/html/2605.24042#A26.F23)D is thatΔL=LlmSMT−Llmbase\\Delta L=L^\{\\mathrm\{SMT\}\}\_\{\\mathrm\{lm\}\}\-L^\{\\mathrm\{base\}\}\_\{\\mathrm\{lm\}\}grows monotonically across our four scales:\+1\.18→\+1\.57→\+2\.00→\+2\.15\+1\.18\\to\+1\.57\\to\+2\.00\\to\+2\.15nats\. Section[8](https://arxiv.org/html/2605.24042#S8)of the main paper says “whether the modeling\-loss penalty closes at LLM scale is the natural follow\-up\.” On a fixed\-token comparison the answer in our data is that the penalty widens, not closes\.

The relevant caveat is that164164M tokens is severely undertrained at every tier and disproportionately undertrained at11B\. Chinchilla\-optimal training for a11B\-parameter model is on the order of2020B tokens, two decades beyond our budget\. We interpret theΔL\\Delta Lgrowth across our sweep as evidence that, at fixed compute, the baseline benefits more from extra parameters than SMT does\. Whether the penalty closes under data\-optimal scaling is a separate question that this experiment does not answer\. We highlight this distinction in the discussion \(Section[9](https://arxiv.org/html/2605.24042#S9)\) rather than letting the fixed\-token result stand as evidence on the open question\.

### Z\.7Seed variance is large and coupling\-driven

The 90M three\-seed sweep gives SMTGMahG\_\{\\mathrm\{Mah\}\}values\{9\.94,16\.49,32\.44\}\\\{9\.94,16\.49,32\.44\\\}\(CoV0\.480\.48\), against baseline\{1\.27,1\.27,1\.28\}\\\{1\.27,1\.27,1\.28\\\}\(CoV0\.0040\.004\)\. The SMT 30M, 300M, and 1B numbers in Table[33](https://arxiv.org/html/2605.24042#A26.T33)are single\-seed point estimates and should be read with the 90M variance in mind\. The variance is not inE128E\_\{128\}, which sits at≈0\.996\\approx 0\.996for all three seeds, but inqBq\_\{B\}, as Corollary[34](https://arxiv.org/html/2605.24042#Thmproposition34)predicts\. The architectural Fisher concentration is reproducible, while the depletion of margin mass from the Fisher subspace is initialization\-sensitive\.

This identifies a concrete open problem: SMT\-style architectures benefit from a training procedure that drivesqBq\_\{B\}low reliably across initializations\. The Hutchinson\-trace penalty in Appendix[Y](https://arxiv.org/html/2605.24042#A25)regularizes the memory branch’s contribution to logits \(γ\\gammain Proposition[33](https://arxiv.org/html/2605.24042#Thmproposition33)\), but does not directly targetqBq\_\{B\}\. A targeted regularizer that minimizes margin mass in the top\-kkFisher subspace would address the right variable, and we leave this to follow\-up work\.

### Z\.8The 90M step sweep: a transient coupling phase

A single\-seed SMT sweep over training duration at 90M shows non\-monotonicGMahG\_\{\\mathrm\{Mah\}\}:

By Corollary[34](https://arxiv.org/html/2605.24042#Thmproposition34)the non\-monotonicity inGMahG\_\{\\mathrm\{Mah\}\}is the inverse of a non\-monotonicity inqBq\_\{B\}: the run starts decoupled \(highGG, lowqBq\_\{B\}\), passes through a coupled phase around2020k steps \(lowGG, highqBq\_\{B\}\), and decouples again by8080k steps while continuing to improve LM loss\. This is a single\-seed observation and we do not make a main\-text claim from it, but it points to a non\-trivial training dynamic on which the privacy geometry is not a monotone function of utility convergence\. A multi\-seed step sweep with a curriculum that targetsqBq\_\{B\}alongsideLlmL\_\{\\mathrm\{lm\}\}would test whether the early\-decoupled regime can be stabilized\.

## Appendix AATraining\-time regularizer has four objectives and all four fail

A natural question is whether the defensive structure we identify at inference time can be baked in at training time, that is, whether a model can be trained so that perturbations along the margin\-dominant directions are cheap in utility\. We ran four candidate objectives on GPT\-2 Small and found that all four fall into one of three failure modes, each of which is a symmetry of the KL anchor that the regularizer rides along\. The study is a strengthened negative result, where training\-time channel surgery is harder than the original paper framed it, and the reason is structural\.

### AA\.1The four objectives

WithℒLM\\mathcal\{L\}\_\{\\mathrm\{LM\}\}the autoregressive cross\-entropy,μ=0\.1\\mu=0\.1the weight on a KL anchorKL\(pθ0∥pθ\)\\mathrm\{KL\}\(p\_\{\\theta\_\{0\}\}\\,\\\|\\,p\_\{\\theta\}\)to a frozen baseline, andmI,mBm\_\{I\},m\_\{B\}the soft\-min margins inPIP\_\{I\}andPBP\_\{B\}across eight distractors per step, the four objectives are:

- •Default \(hinge\):ℒ\+λ\(mI−τ\)\+2\\mathcal\{L\}\+\\lambda\\,\(m\_\{I\}\-\\tau\)\_\{\+\}^\{2\}\.
- •Aniso:ℒ\+λ\[\(mI−τ\)\+2−mB2/4\]\\mathcal\{L\}\+\\lambda\\,\[\(m\_\{I\}\-\\tau\)\_\{\+\}^\{2\}\-m\_\{B\}^\{2\}/4\]\.
- •Ratio:ℒ\+λ\(mI/mB−τratio\)\+2\\mathcal\{L\}\+\\lambda\\,\(m\_\{I\}/m\_\{B\}\-\\tau\_\{\\mathrm\{ratio\}\}\)\_\{\+\}^\{2\}withτratio=1\.5\\tau\_\{\\mathrm\{ratio\}\}=1\.5\.
- •Internal anchor:ℒ\+λ\[\(mI−τ\)\+2\+10−2‖hθ−hθ0‖2\]\\mathcal\{L\}\+\\lambda\\,\[\(m\_\{I\}\-\\tau\)\_\{\+\}^\{2\}\+10^\{\-2\}\\,\\\|h\_\{\\theta\}\-h\_\{\\theta\_\{0\}\}\\\|^\{2\}\]\.

Each was swept overλ∈\{0,10−4,3×10−4,10−3\}\\lambda\\in\\\{0,10^\{\-4\},3\\times 10^\{\-4\},10^\{\-3\}\\\}for10,00010\{,\}000steps on WikiText\-103 prefixes at batch size11, seed4242, on an AWSg5\.2xlarge\.

### AA\.2Results

Table[34](https://arxiv.org/html/2605.24042#A27.T34)and Figure[24](https://arxiv.org/html/2605.24042#A27.F24)report the sweep\.

Table 34:Four training objectives across theλ\\lambdasweep on GPT\-2 Small\. Bold: the only row where the absolute marginmfullm\_\{\\mathrm\{full\}\}differs from baseline by more than a factor of two, and it goes up \(from146\.6146\.6to407\.3407\.3\) rather than down\.![Refer to caption](https://arxiv.org/html/2605.24042v1/x24.png)Figure 24:Training\-time objectives on GPT\-2 Small\. \(A\) Trajectory in margin\-fraction space\. Default, ratio, and anchor cluster at or near baseline\(0\.435,0\.873\)\(0\.435,0\.873\); aniso moves to\(0\.963,0\.274\)\(0\.963,0\.274\)atλ=10−3\\lambda=10^\{\-3\}, but \(see panel B and text\) this is inflation in absolute margin, not redistribution\. \(B\)Δ\(mB/mfull\)\\Delta\(m\_\{B\}/m\_\{\\mathrm\{full\}\}\)vs\. perplexity\. The dotted zero line marks no channel surgery\. Aniso’s excursion is accompanied by a3\.5×3\.5\\timesPPL increase and by a2\.8×2\.8\\timesinflation inmfullm\_\{\\mathrm\{full\}\}\.
### AA\.3Why all four fail: the KL\-anchor invariance group

The KL anchorKL\(pθ0∥pθ\)\\mathrm\{KL\}\(p\_\{\\theta\_\{0\}\}\\,\\\|\\,p\_\{\\theta\}\)fixes the model’s output distribution on every training prefix but leaves internal hidden states unconstrained up to a continuous family of transformations\. Ifθ\\thetaproduces hidden stateh\(x\)h\(x\)and unembeddingWuW\_\{u\}with logitsz=Wu⋅norm\(h\)z=W\_\{u\}\\cdot\\mathrm\{norm\}\(h\), then for any invertible linear mapAAcommuting with the final layer norm, the transformationh↦Ahh\\mapsto Ah,Wu↦WuA−1W\_\{u\}\\mapsto W\_\{u\}A^\{\-1\}leaveszzunchanged, so the KL anchor is blind to it\. Uniform rescalingA=αIA=\\alpha Iis the simplest such symmetry, and rescaling restricted to the Fisher subspace \(A=diag\(α,…,α,1,…,1\)A=\\mathrm\{diag\}\(\\alpha,\\ldots,\\alpha,1,\\ldots,1\)in thePBP\_\{B\}\-aligned basis\) is another\. These invariances let the regularizer achieve its stated objective by moving along the invariance manifold rather than cutting across it\.

Default \(uniform rescaling\)\.All three margins shrink by a common factorα\\alpha\(e\.g\.,α=0\.83\\alpha=0\.83atλ=10−3\\lambda=10^\{\-3\}\), and the two fractionsmB/mfullm\_\{B\}/m\_\{\\mathrm\{full\}\}andmI/mfullm\_\{I\}/m\_\{\\mathrm\{full\}\}are preserved to within2%2\\%\. The model is exercisingA=αIA=\\alpha I\. Net effect for a defender is that nothing has been rearranged, only scaled, and an attacker who whitens the activation distribution before matching is unaffected\. This reproduces the original paper’s negative result\.

Aniso \(Fisher\-subspace inflation\)\.The−mB2/4\-m\_\{B\}^\{2\}/4reward directly encourages larger inter\-prefix distance alongPBP\_\{B\}\. The model satisfies this by applyingA=diag\(αB,…,αB,1,…,1\)A=\\mathrm\{diag\}\(\\alpha\_\{B\},\\ldots,\\alpha\_\{B\},1,\\ldots,1\)withαB\>1\\alpha\_\{B\}\>1, compensating inWuW\_\{u\}, and the KL anchor is blind becauseAAcommutes with the final layer norm on thePBP\_\{B\}\-restricted subspace\. Atλ=3×10−4\\lambda=3\\times 10^\{\-4\}we see mild inflation \(αB≈1\.1\\alpha\_\{B\}\\approx 1\.1,mBm\_\{B\}up from63\.763\.7to69\.969\.9\); atλ=10−3\\lambda=10^\{\-3\}extreme \(αB≈6\\alpha\_\{B\}\\approx 6,mB=392m\_\{B\}=392\)\. This is not channel surgery\.mfullm\_\{\\mathrm\{full\}\}grows from147147to407407, so the prefixes become more distinguishable to a full\-spaceℓ2\\ell\_\{2\}attacker, not less\. The strikingmB/mfull=0\.96m\_\{B\}/m\_\{\\mathrm\{full\}\}=0\.96ratio reflects the inflation, not a redistribution of prompt\-distinguishing content\.

Ratio \(invariance by construction\)\.The penalty\(mI/mB−τratio\)\+2\(m\_\{I\}/m\_\{B\}\-\\tau\_\{\\mathrm\{ratio\}\}\)\_\{\+\}^\{2\}is by construction invariant under any symmetry that preserves the ratio\. Uniform rescaling preserves it exactly, and themBm\_\{B\}\-inflation symmetry does not \(but only slowly at smallαB\\alpha\_\{B\}\)\. Atλ≤10−3\\lambda\\leq 10^\{\-3\}the effective push on the ratio isλ⋅2⋅\(2\.0−1\.5\)≈10−3\\lambda\\cdot 2\\cdot\(2\.0\-1\.5\)\\approx 10^\{\-3\}per step, orders of magnitude below the LM gradient, so the ratio never escapes the baseline value of2\.02\.0\. Choosingτratio\\tau\_\{\\mathrm\{ratio\}\}much smaller than the baseline ratio andλ\\lambdamuch larger would engage the penalty but likely push the model into the aniso\-style inflation regime\.

Internal anchor \(weight too small to engage\)\.We intended the anchor term‖hθ−hθ0‖2\\\|h\_\{\\theta\}\-h\_\{\\theta\_\{0\}\}\\\|^\{2\}to break the rescaling invariance by pinninghhto the baseline\. In our implementation the effective anchor weight isλ⋅10−2\\lambda\\cdot 10^\{\-2\}, which atλ=10−3\\lambda=10^\{\-3\}gives10−510^\{\-5\}\. Against an LM loss of≈3\\approx 3, the anchor contributes≈0\.01\\approx 0\.01and is never the binding constraint\. The observed behavior is pure uniform rescalingα=0\.44\\alpha=0\.44at the largestλ\\lambda, matching the default variant’s failure mode\. A properly weighted anchor \(weight≥10−1\\geq 10^\{\-1\}against LM loss\) might succeed, and we leave that to future work\.

### AA\.4Why inference\-time defense doesn’t have this problem

The generalized\-eigen mechanism in Section[4](https://arxiv.org/html/2605.24042#S4)operates at inference time on a*fixed*model\. The model’s internal hidden\-state distribution and the FisherFFare whatever training produced them to be, and the defender’s freedom is only in choosingΣξ\\Sigma\_\{\\xi\}, and the optimization in equation \([2](https://arxiv.org/html/2605.24042#S4.E2)\) is against a fixed objective with no gauge invariance\. The training\-time problem has an extra continuous family of equivalent solutions \(every element of the KL\-anchor invariance group\), and gradient descent along any differentiable regularizer tends to move along this family rather than across it\. Breaking the invariance requires either \(i\) an objective that is not gauge\-invariant, for which our internal\-anchor attempt is the natural candidate but at higher weight than we used, or \(ii\) an auxiliary constraint on the hidden\-state distribution itself \(for instance, fixing the second moment𝔼\[hh⊤\]\\mathbb\{E\}\[hh^\{\\top\}\]or the Fisher tracetr\(F\)\\mathrm\{tr\}\(F\)\)\. Both are more invasive than a simple regularizer and likely incur larger PPL cost than the inference\-time defense\.

### AA\.5What this does not say

The study does not rule out training\-time defenses entirely\. It only shows that four natural regularizers applied to GPT\-2 Small fail, and it identifies why \(the KL\-anchor invariance group\)\. Three positive directions remain for future work, namely a properly weighted internal\-representation anchor, an explicit gauge\-fixing constraint ontr\(F\)\\mathrm\{tr\}\(F\)or𝔼\[hh⊤\]\\mathbb\{E\}\[hh^\{\\top\}\], and a regularizer that treats the Fisher eigenspectrum directly rather than the margin magnitudes\. We did not run these\.

The study also does not generalize automatically to modern high\-EkE\_\{k\}architectures\. GPT\-2 Small hasE128≈0\.56E\_\{128\}\\approx 0\.56, so the Fisher subspacePBP\_\{B\}is only about half the gradient energy\. On Mistral\-7B withE128≈0\.99E\_\{128\}\\approx 0\.99, the Fisher subspace is tightly concentrated and training\-time modifications may interact with the invariance group differently\. We did not test this because each variant takes≈40\\approx 40minutes per lambda on GPT\-2 Small and the 7–14B models would be1515–30×30\\timesslower\.

## Appendix ABRobustness and sensitivity

This appendix collects the robustness checks that support the main\-body claims: cross\-scale margin decomposition, attention\-head ablation, second\-order KL scaling, and multi\-dataset subspace stability\.

### AB\.1Cross\-scale margin decomposition

The cross\-scale result thatmB/mfull≈k/dm\_\{B\}/m\_\{\\mathrm\{full\}\}\\approx\\sqrt\{k/d\}holds withR2=0\.93R^\{2\}=0\.93across our ten models \(Section[F](https://arxiv.org/html/2605.24042#A6)\)\. The per\-prefix cross\-sectional decomposition on Qwen3\-14B confirms that prefixes with largePBP\_\{B\}\-projection KL are precisely those with smallPIP\_\{I\}\-projection KL, with Spearman rank correlationρ=−0\.71\\rho=\-0\.71\(Figure[11](https://arxiv.org/html/2605.24042#A8.F11)\(A\)\)\. This per\-prefix anti\-correlation is a signature of the asymmetry being a model\-wide property, not a spurious average\.

### AB\.2Attention\-head ablation

A separate check is whether attention heads are individually load\-bearing for predictive computation\. Table[35](https://arxiv.org/html/2605.24042#A28.T35)reports per\-layer maximum and median attention\-head ablation KL across five layers of GPT\-2 Small\.

Table 35:Single attention\-head ablation KL on GPT\-2 Small\. The maximum per\-head KL across all 60 heads is0\.0050\.005at layer 11, roughly24×24\\timessmaller than the per\-MLP KL at the matched layer\. Attention heads are individually much less load\-bearing than MLP blocks, consistent with the Fisher spectrum being concentrated on MLP output directions \(Figure[21](https://arxiv.org/html/2605.24042#A23.F21)\)\.
### AB\.3Second\-order KL scaling

Proposition[1](https://arxiv.org/html/2605.24042#S3.E1)predicts KL∝σ2\\propto\\sigma^\{2\}in the small\-noise regime\. Table[36](https://arxiv.org/html/2605.24042#A28.T36)reports the empirical scaling\. The log\-log slope is1\.641\.64withR2=0\.95R^\{2\}=0\.95, indicating that the quadratic approximation holds only at smallσ\\sigmaand higher\-order corrections become nontrivial atσ≥1\\sigma\\geq 1where KL exceeds0\.50\.5nats\.

Table 36:KL divergence under isotropic Gaussian noise at GPT\-2 Small layer 6\. Log\-log slope=1\.64=1\.64\(R2=0\.95R^\{2\}=0\.95\), so the leadingσ2\\sigma^\{2\}growth holds at smallσ\\sigmabut transitions to slower growth aboveσ≈1\\sigma\\approx 1as the model’s output distribution saturates toward uniform\.
### AB\.4Multi\-dataset subspace stability

The Fisher subspace’s stability across datasets is tested by computingΣg\\Sigma\_\{g\}independently on1,0001\{,\}000WikiText prefixes and300300code prefixes, then measuring the principal angles between the top\-128128eigenvectors of the two covariance matrices\. The mean principal angle is56\.4∘56\.4^\{\\circ\}with range\[4\.9∘,89\.8∘\]\[4\.9^\{\\circ\},89\.8^\{\\circ\}\]across the128128pairs\. The distribution of angles is approximately uniform on\[0,π/2\]\[0,\\pi/2\], consistent with inter\-prefix directions being near\-isotropic on both datasets and the Fisher subspace being more data\-dependent than the scaling law suggests\.

### AB\.5Input\-dependent alignment

An independent stability check: the alignmentcos2⁡θ\\cos^\{2\}\\thetabetween the per\-prefix gradient and the populationPBP\_\{B\}has mean0\.600\.60with std0\.090\.09across9999held\-out prefixes \(range\[0\.45,0\.88\]\[0\.45,0\.88\]\)\. No prefix sits at either extreme, so the populationPBP\_\{B\}captures a substantial but not dominant fraction of each prefix’s own gradient\-sensitive directions\. The40%40\\%residual is what drives per\-prefix heterogeneity in the KL asymmetry without compromising the population\-level direction flip\.

## Appendix ACCompute Resources

All experiments were conducted on single\-GPU machines\. Hardware used: NVIDIA H100 80 GB HBM3 on RunPod Secure Cloud \(eight pods over the project lifetime,∼\{\\sim\}112112productive hours plus∼\{\\sim\}190190hours of idle, restart, and failed\-launch billing\), NVIDIA A10G 24 GB on AWS EC2 g5\.2xlarge \(seven instances, total∼\{\\sim\}120120hours, mostly bootstrap and Pass 1\), and Apple M\-series laptops for smoke tests and figure rendering\.333The Mistral learned\-inverter run used5050k pairs and4040k steps versus the plan’s200200k/100100k, so its outputs are a conservative upper bound on learned\-inversion success \(Appendix[V](https://arxiv.org/html/2605.24042#A22)\)\. The AWS bootstrap line in Table[37](https://arxiv.org/html/2605.24042#A29.T37)is overhead from GPU\-quota constraints and redundant environments, not productive compute\. An additional∼\{\\sim\}1515GPU\-hours of debugging are not reflected in Table[37](https://arxiv.org/html/2605.24042#A29.T37)\.

Table 37:Compute resources by experimental component\. H100 cost is $3\.033\.03/hr on RunPod Secure Cloud \($2\.992\.99/hr in AP\-IN\-1, used for the 3\-GPU SMT scaling sweep at $8\.978\.97/hr aggregate\); A10G cost is $1\.211\.21/hr on AWS on\-demand\. Productive\-compute totals exclude idle pod time, network\-volume storage, and failed/debugging launches that were not directly tied to a reported result; with these the realized billed cost over the project lifetime was approximately $1,2201\{,\}220\.ComponentGPUTime \(approx\.\)Cost \(approx\.\)Pass 1 Mahalanobis defense sweep \(Mistral\-7B,55k bank\)A10G 24 GB1\.5 h$2Pass 2\+5 multi\-model defense sweep: GPT\-2 Small, 12 layersH100 80 GB3 h$9Mistral\-7B, 8 layersH100 80 GB10 h$30Phi\-2, 4 layersH100 80 GB2 h$6Qwen3\-14B, 4 layersH100 80 GB8 h$24DeepSeek\-R1\-Distill\-Qwen\-14B, 4 layersH100 80 GB8 h$24Pass 3 learned inverter: GPT\-2 Small, 4 corruption modes \(500500k pairs,100100k steps\)H100 80 GB24 h$73Mistral\-7B, 2 corruption modes \(5050k pairs,4040k steps, AWS\)A10G 24 GB8 h$10initial AWS smoke tests \(7 instances, killed or short\)A10G 24 GB100 h$121Pass 4 extended: RDP on 5 modelsH100 80 GB1 h$3matched\-ε\\varepsiloncalibration sweep on 5 modelsH100 80 GB1 h$3DP\-SGD fine\-tune GPT\-2 atε∈\{2,4,8\}\\varepsilon\\in\\\{2,4,8\\\}H100 80 GB0\.5 h$2SDP worst\-case covariance on GPT\-2 \(cvxpy\+SCS\)H100 80 GB0\.3 h$1Pass 5 isotropy check on 10 scaling\-paper modelsH100 80 GB1 h$3Σdiag\\Sigma\_\{\\mathrm\{diag\}\}minimaxα\\alpha\-sweep on 32 model\-layers×7α×4κ×3\\times 7\\alpha\\times 4\\kappa\\times 3seedsH100 80 GB13 h$39with20,00020\{,\}000\-pair adjacency bank \(4 categories\) per layerPQR training: GPT\-2 Small full sweep \(r×β×γ×σrel=44r\{\\times\}\\beta\{\\times\}\\gamma\{\\times\}\\sigma\_\{\\mathrm\{rel\}\}=44cells,50,00050\{,\}000steps each\)H100 80 GB24 h$73PQR batch evaluation across 44 cells \(two protocols: last\-token \+ full\-seq\)H100 80 GB1\.5 h$5Sequence inverter on GPT\-2 Small \(3 mechanisms×50,000\\times 50\{,\}000steps,5050k WikiText pairs\)H100 80 GB4 h$12SMT\-vs\-baseline 4\-architecture training \(20,00020\{,\}000steps each, WikiText\)H100 80 GB4\.5 h$14SMTGMahG\_\{\\mathrm\{Mah\}\}measurement at probe layers\{4,6,8\}\\\{4,6,8\\\}for 4 architecturesH100 80 GB1 h$3SMT scaling sweep \(4 tiers, 2 arms, 3 seeds \+ 2 steps at 90M; 16 runs\)3×\\timesH100∼\\sim57 GPU\-h$170parallelGMahG\_\{\\mathrm\{Mah\}\}measurement on 16 checkpoints3×\\timesH1000\.3 GPU\-h$3Split\-half Fisher validation on Mistral\-7B / Qwen3\-14B / DeepSeek\-R1\-14B \(ncal=400n\_\{\\mathrm\{cal\}\}=400, mid layer, Appendix[A](https://arxiv.org/html/2605.24042#A1)\)H100 80 GB1\.5 h$5Pod\-restart, debug, and pull overhead across 10 H100 podsH100 80 GB5\.5 h$17Idle pod time, network\-volume storage, and failed launches not in line items abovemixed∼\{\\sim\}190190h\-eq∼\{\\sim\}$580Local smoke tests and figure renderingMPS \(local\)2 h$0Total productive \(line items above, excluding idle/storage row\)∼\{\\sim\}292292GPU\-hours∼\{\\sim\}$667Total realized billing \(productive \+ idle/storage\)∼\{\\sim\}$1,2251\{,\}225
## Appendix ADCode and Data Availability

Code and all small\-artifact JSONs are released at[https://github\.com/okezue/tcc\-research](https://github.com/okezue/tcc-research)\. The repository pins fixed seeds and bundles the exact calibration prefixes used for every KL, margin, and defense measurement\. A small supplementary zip uploaded with this submission contains the same JSON measurement and training\-log files for every checkpoint reported in the paper\. Trained model checkpoints and other large artifacts \(≈22\\approx 22GB total\) are deposited at the Zenodo record[https://doi\.org/10\.5281/zenodo\.19992762](https://doi.org/10.5281/zenodo.19992762), with the same layout as the JSON archive, so a singleunzip \-d artifacts/on each archive recreates the local working tree\.

##### Released artifacts\.

Subspace eigenvectors, per\-layer Mahalanobis sweeps, learned\-inverter training logs and evaluation matrices, RDP and matched\-ε\\varepsiloncalibration JSONs, DP\-SGD fine\-tune outputs, SDP solver results, isotropy checks,α\\alpha\-sweep validation outputs, predictive\-quotient training logs and eval summaries, sequence\-inverter checkpoints, SMT training logs and probe\-layer measurements, and split\-half Fisher gradient matrices for the three 7–14B models\. Raw per\-experiment JSONs live underartifacts/mahalanobis/,artifacts/learned\_inverter/,artifacts/rdp/,artifacts/matched\_eps/,artifacts/isotropy/,artifacts/dp\_sgd/,artifacts/sdp/,artifacts/layer\_sweep/,artifacts/sigma\_diag\_validate/,artifacts/quotient\_release/,artifacts/sequence\_inverter/,artifacts/smt/,artifacts/smt\_scaling/, andartifacts/split\_half\_fisher/\.

##### Trained model weights \(Zenodo deposit\)\.

The four 12\-layer architectures from Section[8](https://arxiv.org/html/2605.24042#S8)\(baseline GPT, SMT main, SMT no\-Jac, SMTr=64r\{=\}64\) are released as PyTorch\.ptcheckpoints underartifacts/smt/\(≈\\approx2\.9 GB\), including the optimizer state at the final step\. The 16 SMT scaling\-sweep checkpoints from Appendix[Z](https://arxiv.org/html/2605.24042#A26)\(4 tiers×\\times2 arms plus three seed and two step variants at 90M\) are underartifacts/smt\_scaling/\(≈\\approx17 GB\)\. The 57M\-parameter sequence\-inverter checkpoints \(clean, isotropic\-trained,Σdiag\\Sigma\_\{\\mathrm\{diag\}\}\-trained variants from Section[7](https://arxiv.org/html/2605.24042#S7)\) are underartifacts/sequence\_inverter/\. The single\-vector inverter checkpoints \(clean, iso\-trained, gen\-eigen\-trained,ΣMah⋆\\Sigma^\{\\star\}\_\{\\mathrm\{Mah\}\}\-trained\) at both 100k and 500k training pairs are underartifacts/learned\_inverter/\. The predictive\-quotient\-release encoders and decoders for the 44\-cellr×β×γ×σrelr\\times\\beta\\times\\gamma\\times\\sigma\_\{\\mathrm\{rel\}\}sweep are underartifacts/quotient\_release/\(≈\\approx161 MB\)\.

##### Frozen base models and Fisher matrices\.

We do not redistribute the pretrained base models \(GPT\-2, Mistral\-7B, Phi\-2, Qwen3\-14B, DeepSeek\-R1\-Distill\-Qwen\-14B, OLMoE\-1B\-7B, TinyLlama\-1\.1B, Qwen2\.5\-3B\); the repository’sdownload\_models\.shpulls them from Hugging Face under their respective licenses\. The empirical Fisher diagonalsFiiF\_\{ii\}and top\-kkeigenvectorsUBU\_\{B\}for every \(model, layer\) pair in our 32\-point sweep are released as\.npyarrays underartifacts/fisher/, so the diagonal\-Fisher releaseΣdiag\\Sigma\_\{\\mathrm\{diag\}\}can be reconstructed without recomputingΣg\\Sigma\_\{g\}\. The RunPod stdout logs are preserved underartifacts/runpod\_logs/\.

##### Reproduction\.

Re\-running the full experimental suite \(excluding bootstrap and failed launches\) takes approximately290290GPU\-hours on the hardware itemized in Appendix[AC](https://arxiv.org/html/2605.24042#A29); per\-component reproduction scripts are inscripts/reproduce/and the SMT scaling sweep launches vialaunch\_smt\_scaling\.sh\.
Hidden-State Privacy Has an Empty Middle

Similar Articles

MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

VaultGemma: The world's most capable differentially private LLM

Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data

The future of Siri, or: why private inference isn’t private enough

Submit Feedback

Similar Articles

MemPrivacy: Privacy-Preserving Personalized Memory Management for Edge-Cloud Agents
Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
VaultGemma: The world's most capable differentially private LLM
Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data
The future of Siri, or: why private inference isn’t private enough