From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

arXiv cs.LG 06/18/26, 04:00 AM Papers
Summary
This paper proposes a post-hoc certification framework for sparse autoencoder (SAE) based interpretability, deriving an upper bound on the frozen language model's risk using measurable quantities. The framework is validated on GPT-2 Small, Gemma-2B, and Llama-3-8B, showing non-vacuous bounds and revealing depth-dependent behavior.
arXiv:2606.18383v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) are increasingly used to extract interpretable features from language models (LMs), yet a central question remains: when can an SAE-based explanation be treated as a faithful view of an underlying frozen LM We study this through a post-hoc generalization framework that certifies the LM via a sparse proxy, obtained by replacing a native hidden activation with its pretrained SAE reconstruction. Our framework derives an upper bound on the base model's expected risk using four measurable quantities: proxy risk, SAE reconstruction gap, concept-pool mismatch, and sparse complexity. We interpret this certificate as an operational criterion for explanatory faithfulness. In particular, a non-vacuous bound indicates that the extracted sparse features retain meaningful predictive information, while small reconstruction and mismatch errors indicate that the proxy remains behaviorally close to the original model. Empirically, we show that the bound becomes non-vacuous on GPT-2 Small, Gemma-2B, and Llama-3-8B at practical sample sizes. A detailed layerwise analysis of Llama-3-8B reveals a strong depth dependence, with later layers becoming much easier to certify, associated with both stronger local fidelity and weaker downstream error amplification. Finally, through feature-shuffling ablations, we show that the decomposition distinguishes genuine semantic alignment from mere statistical sparsity, providing a useful diagnostic for when SAE-based explanations become less reliable.
Original Article
View Cached Full Text
Cached at: 06/18/26, 05:42 AM
# From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability
Source: [https://arxiv.org/html/2606.18383](https://arxiv.org/html/2606.18383)
###### Abstract

Sparse autoencoders \(SAEs\) are increasingly used to extract interpretable features from language models \(LMs\), yet a central question remains: when can an SAE\-based explanation be treated as a faithful view of an underlying frozen LM? We study this through a post\-hoc generalization framework that certifies the LM via a sparse proxy, obtained by replacing a native hidden activation with its pretrained SAE reconstruction\. Our framework derives an upper bound on the base model’s expected risk using four measurable quantities: proxy risk, SAE reconstruction gap, concept\-pool mismatch, and sparse complexity\. We interpret this certificate as an operational criterion for explanatory faithfulness\. In particular, a non\-vacuous bound indicates that the extracted sparse features retain meaningful predictive information, while small reconstruction and mismatch errors indicate that the proxy remains behaviorally close to the original model\. Empirically, we show that the bound becomes non\-vacuous on GPT\-2 Small, Gemma\-2B, and Llama\-3\-8B at practical sample sizes\. A detailed layerwise analysis of Llama\-3\-8B reveals a strong depth dependence, with later layers becoming much easier to certify, associated with both stronger local fidelity and weaker downstream error amplification\. Finally, through feature\-shuffling ablations, we show that the decomposition distinguishes genuine semantic alignment from mere statistical sparsity, providing a useful diagnostic for when SAE\-based explanations become less reliable\.

Machine Learning, ICML

## 1Introduction

Sparse autoencoders \(SAEs\)\(Cunningham et al\.,[2023](https://arxiv.org/html/2606.18383#bib.bib5)\)are increasingly used to interpret frozen language models \(LMs\) through sparse, human\-inspectable features, raising a foundational question:*when should an SAE\-based explanation be trusted as a faithful lens on the underlying LM?*We study this question through a post\-hoc certification framework that asks whether the sparse structure induced by a pretrained SAE can yield a*non\-vacuous generalization bound for the frozen LM itself*, rather than serving only as an informal interpretive tool\. Concretely, we replace the native hidden activation of a frozen LM at a chosen layer with its SAE reconstruction while leaving all downstream layers unchanged, thereby obtaining an SAE\-induced*sparse proxy*\. We show that the true risk of the original LM can be upper\-bounded by four measurable proxy\-induced quantities: \(i\) the empirical risk of the sparse proxy, \(ii\) the SAE reconstruction\-induced approximation error, \(iii\) the probability that a chosen active concept pool fails to cover the features required for prediction, and \(iv\) a sparse complexity term governed by the active feature\-pool size rather than the full LM parameter count\. This decomposition makes the trust criterion explicit: the SAE\-induced proxy can be treated as a reliable interpretive lens only when the resulting bound is non\-vacuous and the reconstruction and pool\-mismatch terms remain small, indicating that the proxy is both informative about the frozen model and behaviorally close to it\.

This leads to the main viewpoint of the paper\. Although the certified object is the frozen LM, the certificate is valuable precisely because it is expressed through an SAE\-induced sparse proxy\. We therefore use*faithfulness*in a strict operational sense: the sparse proxy must be informative enough to certify that the frozen model is non\-trivial relative to an uninformed baseline, while remaining behaviorally close to the original network’s outputs\. Under this view,*trust*is the conjunction of usefulness and behavioral faithfulness\. While this does not claim that the active SAE features constitute a complete semantic or causal explanation of the model, it does establish that the sparse proxy is sufficiently informative and low\-distortion to support reliable interpretation\. Appendix[H](https://arxiv.org/html/2606.18383#A8)makes this operational by linking the non\-vacuousness of the certificate to the former, and the reconstruction and pool\-mismatch terms to the latter\.

Our experiments support this operational perspective in practice\. Across GPT\-2 Small, Gemma\-2B, and Llama\-3\-8B, the resulting certificates become non\-vacuous at practical sample sizes\. We then present a layerwise case study on Llama\-3\-8B, where certifiability varies sharply with patch location: later layers are substantially easier to certify than early and middle ones\. To understand this effect, we separate local SAE reconstruction quality from downstream error propagation and find that later\-layer proxies exhibit both stronger local alignment and weaker error amplification\. Qualitatively, tighter late\-layer certificates are accompanied by SAE features whose logit\-lens verbalizations are more contextually aligned with the model’s next\-token behavior\. Complementary GPT\-2 Small results in Appendix[E](https://arxiv.org/html/2606.18383#A5)show much weaker layer sensitivity, indicating that the strength of this depth effect is model\-specific rather than universal\. We therefore use Llama\-3\-8B as a diagnostically informative case study of when and why patch location matters\.

Taken together, the paper contributes both a principled post\-hoc trust criterion for SAE\-based explanation of frozen language models and an empirical analysis showing when that criterion is most and least informative in practice\.

#### Contributions\.

Our main contributions are as follows: \(i\) We introduce a post\-hoc certification framework for frozen LMs in which a pretrained SAE defines a sparse proxy at a chosen hidden layer, and we derive a risk bound for the frozen model that decomposes into four measurable terms: proxy risk, reconstruction gap, concept\-pool mismatch, and sparse complexity\.

\(ii\) We show that the certificate becomes non\-vacuous on GPT\-2 Small, Gemma\-2B, and Llama\-3\-8B at practical sample sizes\.

\(iii\) We perform a layerwise and horizon\-conditioned case study on Llama\-3\-8B to analyze a setting where patch location materially affects certification\. This analysis shows that later layers are easier to certify and are associated with both stronger local fidelity and weaker downstream error amplification; complementary GPT\-2 results in Appendix[E](https://arxiv.org/html/2606.18383#A5)show that such depth dependence is not universal across model–SAE pairs\. We make the code available at[https://github\.com/newcodevelop/SAE\-Faithfulness](https://github.com/newcodevelop/SAE-Faithfulness)\.

## 2Related Work

### 2\.1Sparse autoencoders and interpretable features

Sparse autoencoders \(SAEs\) are now a standard tool for probing hidden representations in large language models\. They are motivated by the observations of superposition and polysemanticity: a model may represent many more features than there are neurons, and individual neurons may mix unrelated concepts\(Elhage et al\.,[2022](https://arxiv.org/html/2606.18383#bib.bib7)\)\. SAEs address this by learning an over\-complete but sparse feature basis, often yielding features that are easier to interpret than native activations\(Bricken et al\.,[2023](https://arxiv.org/html/2606.18383#bib.bib4); Cunningham et al\.,[2023](https://arxiv.org/html/2606.18383#bib.bib5)\)\. We use this machinery differently from most prior work\. Rather than using SAEs primarily for qualitative analysis, we treat them as a device for defining a finite proxy class and a representation\-level complexity measure\.

### 2\.2Generalization bounds for large language models

The empirical success of heavily over\-parameterized models has exposed the limitations of classical uniform\-convergence intuition\(Zhang et al\.,[2017](https://arxiv.org/html/2606.18383#bib.bib18); Nagarajan & Kolter,[2021](https://arxiv.org/html/2606.18383#bib.bib11)\)\. Recent work has developed non\-vacuous bounds for language models using bounded losses, PAC\-Bayes or compression\-based arguments, and data\-aware analyses\(Dziugaite & Roy,[2017](https://arxiv.org/html/2606.18383#bib.bib6); Lotfi et al\.,[2024a](https://arxiv.org/html/2606.18383#bib.bib9),[b](https://arxiv.org/html/2606.18383#bib.bib10)\)\. Our paper is closest in spirit to this line, but differs in what is being compressed\. We do not compress model weights; instead, we certify a frozen predictor through a sparse feature pool derived from internal activations\.

### 2\.3Compression, description length, and structural explanations

Compression\-based explanations of generalization are closely related to Occam\-style bounds and minimum description length principles\(Rissanen,[1978](https://arxiv.org/html/2606.18383#bib.bib13); Blumer et al\.,[1987](https://arxiv.org/html/2606.18383#bib.bib3); Arora et al\.,[2018](https://arxiv.org/html/2606.18383#bib.bib1)\)\. The key idea is that generalization can sometimes be explained by a concise description of the learned function, even when the overall parameter count is large\. Our contribution fits this perspective, but with an emphasis on*structural interpretability*\. The resulting certificate is not presented as the tightest possible post\-hoc risk bound; rather, it aims to expose a small set of interpretable ingredients—concept\-pool size, reconstruction error, and pool mismatch—that make the bound informative\.

## 3Preliminaries

We first define the SAE notation, then state the post\-hoc certification protocol, and finally formalize the proposed bound\.

### 3\.1Sparse Autoencoder \(SAE\)

We analyze a base LM, denoted asMM, which maps an inputxxto a high\-dimensional hidden representationh\(x\)∈ℝdh\(x\)\\in\\mathbb\{R\}^\{d\}at a specific layer\. To interpret this dense representation, we utilize a Sparse Autoencoder \(SAE\)SS, consisting of an encoderSE:ℝd→ℝmS\_\{E\}:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}^\{m\}and a decoderSD:ℝm→ℝdS\_\{D\}:\\mathbb\{R\}^\{m\}\\rightarrow\\mathbb\{R\}^\{d\}, where the dictionary sizemmis typically much larger than the model widthdd\(m≫dm\\gg d\)\. The SAE decomposes the activation into a sparse set of interpretable features via the following operations: i\)Encoding:The dense hidden state is projected to a pre\-activation feature vectora\(x\):=SE\(h\(x\)\)∈ℝma\(x\):=S\_\{E\}\(h\(x\)\)\\in\\mathbb\{R\}^\{m\}\. ii\)Sparsification:We apply a non\-linearTopKTopKoperator, which retains thekkcoefficients with the largest magnitudes and sets the rest to zero\. This yields theinterpretablesparse codec\(x\):=TopK\(a\(x\)\)c\(x\):=TopK\(a\(x\)\)\. iii\)Reconstruction:The sparse code is mapped back to the original activation space to produce the approximate hidden stateh^\(x\):=SD\(c\(x\)\)\\hat\{h\}\(x\):=S\_\{D\}\(c\(x\)\)\.

#### Proxy predictor\.

The proxy predictorS∘MS\\circ Mis obtained by feedingh^\(x\)\\hat\{h\}\(x\)into the downstream part ofMM\(from the insertion layer onward\) to produce a predictive distribution over outputs\. We write\(S∘M\)\(x\)\(S\\circ M\)\(x\)for the resulting predictive distribution\.

### 3\.2Overview of the Theoretical Approach

The goal of Section[4](https://arxiv.org/html/2606.18383#S4)is to derive a generalization certificate for the base modelMMusing the proxy predictorS∘MS\\circ M\.

Our analysis proceeds in two phases: i\)Phase 1 \(Freezing\):The base modelMMand the SAE components \(SE,SDS\_\{E\},S\_\{D\}\) are pre\-trained and fixed\. For the purpose of our theorem, they are treated as frozen oracles, not as variable hypotheses\. ii\)Phase 2 \(Certification\):On a held\-out calibration stream, we construct a concept poolG∗G^\{\*\}from the union of observed Top\-k SAE supports and use its sizeP:=\|G∗\|P:=\|G^\{\*\}\|as the complexity measure instead of the raw parameter count of the base modelMM\. This in turn makes the bound non\-vacuous even with practical sample sizes\.

## 4Problem Definition

### 4\.1Risk Formulation

Let𝒳\\mathcal\{X\}be the input space and𝒟\\mathcal\{D\}be an unknown distribution over𝒳\\mathcal\{X\}\. In the language modeling setting, we take a sample to be a token sequencex=x1:Tx=x\_\{1:T\}\. We define the population risk as:ℛ\(M\):=𝔼x1:T∼𝒟\[ℓ\(M,x1:T\)\]\\mathcal\{R\}\(M\):=\\mathbb\{E\}\_\{x\_\{1:T\}\\sim\\mathcal\{D\}\}\\big\[\\ell\(M,x\_\{1:T\}\)\\big\]and, givenNNi\.i\.d\. samples\{x1:T\(i\)\}i=1N\\\{x^\{\(i\)\}\_\{1:T\}\\\}\_\{i=1\}^\{N\}, the empirical risk is defined as:ℛ^\(M\):=1N∑i=1Nℓ\(M,x1:T\(i\)\)\\hat\{\\mathcal\{R\}\}\(M\):=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\ell\\big\(M,x^\{\(i\)\}\_\{1:T\}\\big\)Note that the token sequences \(\{x1:T\(i\)\}i=1N\\\{x^\{\(i\)\}\_\{1:T\}\\\}\_\{i=1\}^\{N\}\) must be i\.i\.d samples for the bound to hold\. For that, we break the sequences into a contiguous set of tokens and then sample the sequences uniformly randomly from the dataset\. This is the approach also used byLotfi et al\. \([2024a](https://arxiv.org/html/2606.18383#bib.bib9)\)\.

### 4\.2The Sparse Autoencoder \(SAE\) based Generalization Framework

To formalize the complexity ofMM, we introduce aSparse Autoencoder \(SAE\)probe, denoted asSS\. The proxy predictorS∘MS\\circ Mis defined by replacing the internal activationht=M\(x<t\)h\_\{t\}=M\(x\_\{<t\}\)with its SAE reconstructionh~t=S\(ht\)=S\(M\(x<t\)\)\\tilde\{h\}\_\{t\}=S\(h\_\{t\}\)=S\(M\(x\_\{<t\}\)\), which is then fed through the same downstream layers to produce smoothed proxy probabilitiesp~S∘M\\tilde\{p\}\_\{S\\circ M\}\.

###### Definition 4\.1\(Sparse Autoencoder Class\)\.

Letℋk,m\\mathcal\{H\}\_\{k,m\}be the class of functions realizable by an SAE with dictionaryW∈ℝd×mW\\in\\mathbb\{R\}^\{d\\times m\}\(with unit\-norm columns\) and sparsity constraintkk\. For any inputxx, the output isS\(x\)=W⋅c\(x\)S\(x\)=W\\cdot c\(x\), where‖c\(x\)‖0≤k\\\|c\(x\)\\\|\_\{0\}\\leq k\. The SAE effectively compresses the dense activationM\(x\)M\(x\)into a sparse codecc\.

###### Definition 4\.2\(Reconstruction Inefficiency\)\.

We define a*loss\-level*reconstruction gapϵloss\\epsilon\_\{loss\}as the expected discrepancy in loss between the original predictor and the proxy predictor:

ϵloss=𝔼x1:T∼𝒟\[\|ℓ\(M,x1:T\)−ℓ\(S∘M,x1:T\)\|\]\\epsilon\_\{loss\}=\\mathbb\{E\}\_\{x\_\{1:T\}\\sim\\mathcal\{D\}\}\\big\[\\,\|\\ell\(M,x\_\{1:T\}\)\-\\ell\(S\\circ M,x\_\{1:T\}\)\|\\,\\big\]\(1\)

To rigorously bound the complexity, we formalize the hypothesis space of the sparse proxy\.

### 4\.3Setup and notation

#### Loss and risk\.

In the language modeling setting, let the model induce next\-token probabilitiespM\(⋅∣x<t\)p\_\{M\}\(\\cdot\\mid x\_\{<t\}\)over a vocabulary of sizeVV\. The standard bits\-per\-dimension \(BPD\) loss is:

ℓbpd\(M,x1:T\):=−1T∑t=1Tlog⁡pM\(xt∣x<t\)\\ell\_\{\\mathrm\{bpd\}\}\(M,x\_\{1:T\}\):=\-\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\log p\_\{M\}\(x\_\{t\}\\mid x\_\{<t\}\)\(2\)Sinceℓbpd\\ell\_\{\\mathrm\{bpd\}\}is unbounded whenpM\(xt∣x<t\)p\_\{M\}\(x\_\{t\}\\mid x\_\{<t\}\)can be arbitrarily small, we use*prediction smoothing*\. For a fixedα∈\(0,1\)\\alpha\\in\(0,1\), as first proposed and defined in\(Lotfi et al\.,[2024a](https://arxiv.org/html/2606.18383#bib.bib9)\):

p~M\(⋅∣x<t\):=\(1−α\)pM\(⋅∣x<t\)\+α/V\\tilde\{p\}\_\{M\}\(\\cdot\\mid x\_\{<t\}\):=\(1\-\\alpha\)p\_\{M\}\(\\cdot\\mid x\_\{<t\}\)\+\\alpha/V\(3\)We then define the smoothed BPD loss:

ℓ\(M,x1:T\):=−1T∑t=1Tlog⁡p~M\(xt∣x<t\)\\ell\(M,x\_\{1:T\}\):=\-\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\log\\tilde\{p\}\_\{M\}\(x\_\{t\}\\mid x\_\{<t\}\)\(4\)
This loss is bounded becausep~M\(xt∣x<t\)≥α/V\\tilde\{p\}\_\{M\}\(x\_\{t\}\\mid x\_\{<t\}\)\\geq\\alpha/V, hencelog2\(V/α\)−Δ≤ℓ\(M,x1:T\)≤log2\(V/α\)=:B\\log\_\{2\}\(V/\\alpha\)\-\\Delta\\leq\\ell\(M,x\_\{1:T\}\)\\leq\\log\_\{2\}\(V/\\alpha\)=:B, whereΔ=log2⁡\(1\+\(1−α\)V/α\)\\Delta=\\log\_\{2\}\(1\+\(1\-\\alpha\)V/\\alpha\)\.VVis the vocabulary size\. For a rigorous derivation, check Appendix A\.2 of\(Lotfi et al\.,[2024a](https://arxiv.org/html/2606.18383#bib.bib9)\)\.

Letzi=x1:T\(i\)z\_\{i\}=x^\{\(i\)\}\_\{1:T\}be an i\.i\.d sequence\. For a deterministic predictorff\(the LM\), forNNi\.i\.d\. samples\{zi\}i=1N\\\{z\_\{i\}\\\}\_\{i=1\}^\{N\}, define population and empirical risks

R\(f\):=𝔼z∼𝒟\[ℓ\(f;z\)\];R^S\(f\):=1N∑i=1Nℓ\(f;zi\)R\(f\):=\\mathbb\{E\}\_\{z\\sim\\mathcal\{D\}\}\[\\ell\(f;z\)\];\\hskip 3\.55658pt\\widehat\{R\}\_\{S\}\(f\):=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\ell\(f;z\_\{i\}\)\(5\)
SSsubscript is used to refer the sparse\-proxy setting\.

#### Approximation error betweenMMandS∘MS\\circ M\.

Define the point\-wise loss gap

Δloss\(z\):=\|ℓ\(M;z\)−ℓ\(S∘M;z\)\|∈\[0,Δ\]\\Delta\_\{\\mathrm\{loss\}\}\(z\):=\\big\|\\ell\(M;z\)\-\\ell\(S\\circ M;z\)\\big\|\\in\[0,\\Delta\]\(6\)and its population and empirical means

ϵloss:=𝔼z∼𝒟\[Δloss\(z\)\];ϵ^loss:=1N∑i=1NΔloss\(zi\)\\epsilon\_\{\\mathrm\{loss\}\}:=\\mathbb\{E\}\_\{z\\sim\\mathcal\{D\}\}\[\\Delta\_\{\\mathrm\{loss\}\}\(z\)\];\\hskip 4\.41017pt\\widehat\{\\epsilon\}\_\{\\mathrm\{loss\}\}:=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\Delta\_\{\\mathrm\{loss\}\}\(z\_\{i\}\)\(7\)

### 4\.4Concept\-pool assumption and restricted proxy class

For an index setG⊆\[m\]:=\{1,…,m\}G\\subseteq\[m\]:=\\\{1,\\dots,m\\\}, let𝟏G∈\{0,1\}m\\mathbf\{1\}\_\{G\}\\in\\\{0,1\\\}^\{m\}denote its indicator vector\. Define the masked activation vector

aG\(x\):=a\(x\)⊙𝟏Ga\_\{G\}\(x\):=a\(x\)\\odot\\mathbf\{1\}\_\{G\}\(8\)aG\(x\)a\_\{G\}\(x\)means the activation which is restricted to have non\-zero dimensions in the indicated position of the index setGG\. Define the pool\-restricted code and reconstruction

cG\(x\):=TopK\(aG\(x\)\),h^G\(x\):=SD\(cG\(x\)\)c\_\{G\}\(x\):=\\mathrm\{TopK\}\(a\_\{G\}\(x\)\),\\quad\\widehat\{h\}\_\{G\}\(x\):=S\_\{D\}\(c\_\{G\}\(x\)\)\(9\)LethGh\_\{G\}denote the predictor obtained by feedingh^G\(x\)\\widehat\{h\}\_\{G\}\(x\)into the downstream layers ofMM, analogously toS∘MS\\circ M\. Thus\{hG\}\\\{h\_\{G\}\\\}is a family of proxy predictors indexed by poolsGG\.

#### Top\-kksupport\.

Letsupp\(v\):=\{j:vj≠0\}\\mathrm\{supp\}\(v\):=\\\{j:v\_\{j\}\\neq 0\\\}\. Define the top\-kksupport event with respect to a poolGG:

EG\(x\):=\{supp\(TopK\(a\(x\)\)\)⊆G\}E\_\{G\}\(x\):=\\Big\\\{\\mathrm\{supp\}\\big\(\\mathrm\{TopK\}\(a\(x\)\)\\big\)\\subseteq G\\Big\\\}\(10\)
By the above argument, we can construct the existence of integersP∈\{k,…,m\}P\\in\\\{k,\\dots,m\\\}\(concept pool\) and a subsetG⋆⊆\[m\]G^\{\\star\}\\subseteq\[m\]with\|G⋆\|=P\|G^\{\\star\}\|=Psuch that

Prx∼𝒟⁡\(EG⋆\(x\)\)≥1−η,\\Pr\_\{x\\sim\\mathcal\{D\}\}\\big\(E\_\{G^\{\\star\}\}\(x\)\\big\)\\geq 1\-\\eta,\(11\)for someη∈\[0,1\]\\eta\\in\[0,1\]\.

We define the following restricted hypothesis class:

ℋP:=\{hG:G⊆\[m\],\|G\|=P\}\\mathcal\{H\}\_\{P\}:=\\\{h\_\{G\}:G\\subseteq\[m\],\\ \|G\|=P\\\}\(12\)Then\|ℋP\|=\(mP\)\|\\mathcal\{H\}\_\{P\}\|=\\binom\{m\}\{P\}\.

###### Theorem 4\.1\(Generalization bound via compression \(Occam\) under a concept pool\)\.

Letδ∈\(0,1\)\\delta\\in\(0,1\)and defineδ1=δ2=δ/2\\delta\_\{1\}=\\delta\_\{2\}=\\delta/2\. Then with probability at least1−δ1\-\\deltaoverS∼𝒟NS\\sim\\mathcal\{D\}^\{N\},

R\(M\)≤R^S\(hG⋆\)\+ϵ^loss\+ηB\\displaystyle R\(M\)\\leq\\widehat\{R\}\_\{S\}\(h\_\{G^\{\\star\}\}\)\+\\widehat\{\\epsilon\}\_\{\\mathrm\{loss\}\}\+\\eta B\(13\)\+Blog⁡\|ℋP\|\+log⁡\(2/δ\)2N\+Blog⁡\(4/δ\)2N\\displaystyle\+B\\sqrt\{\\frac\{\\log\|\\mathcal\{H\}\_\{P\}\|\+\\log\(2/\\delta\)\}\{2N\}\}\+B\\sqrt\{\\frac\{\\log\(4/\\delta\)\}\{2N\}\}

###### Proof\.

Proof sketch\.The result follows by combining: \(i\) transfer from the base model to the unrestricted SAE proxy through the reconstruction gap; \(ii\) transfer from the unrestricted proxy to the pool\-restricted proxy through the mismatch event; and \(iii\) an Occam bound over the finite classℋP\\mathcal\{H\}\_\{P\}\. The full proof is given in Appendix[B\.1](https://arxiv.org/html/2606.18383#A2.Thmtheorem1a)\.

∎

### 4\.5Measurement of Complexity Parameters

Theorem[4\.1](https://arxiv.org/html/2606.18383#S4.Thmtheorem1a)is stated for a poolG∗G^\{\*\}and the associated concept poolPPand the pool penaltyη\\eta\. The same argument applies to any poolGGthat is fixed before the evaluation draw, thanks to the union bound over all hypotheses inside hypothesis spaceℋP\\mathcal\{H\}\_\{P\}\. In our experiments, the instantiated poolG∗G^\{\*\}is selected from a calibration stream disjoint from evaluation data\.

In particular, given a calibration corpus𝒟cal\\mathcal\{D\}\_\{\\mathrm\{cal\}\}withNcalN\_\{\\mathrm\{cal\}\}examples, we define the pool

G∗:=⋃x∈𝒟calsupp\(TopK\(SE\(M\(x\)\)\)\)\{G^\{\*\}\}:=\\bigcup\_\{x\\in\\mathcal\{D\}\_\{\\mathrm\{cal\}\}\}\\mathrm\{supp\}\\\!\\left\(\\mathrm\{TopK\}\(S\_\{E\}\(M\(x\)\)\)\\right\)\(14\)
Its sizeP=\|G∗\|P=\|G^\{\*\}\|induces the complexity termPlog⁡\(em/P\)P\\log\(em/P\)by Lemma[B\.4](https://arxiv.org/html/2606.18383#A2.Thmtheorem4)\. The pool\-mismatch quantityη\\etais likewise instantiated empirically forGGfrom the evaluation sample \(i\.e\.G∗G^\{\*\}\), with the corresponding finite\-sample concentration term \(O\(B/N\)O\(B/\\sqrt\{N\}\)\) absorbed into the reported certificate\.

In the empirical evaluation, we replace the population mismatch rateη\\etaby its empirical estimate augmented with a finite\-sample concentration term, namely:η≤η^\+log⁡\(2/δ\)2N\\eta\\leq\\hat\{\\eta\}\+\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{2N\}\}

This yields the following empirical certificate:

R\(M\)≤R^S\(hG∗\)\+ϵ^loss\+B\(η^\+log⁡\(2/δ\)2N\)\+BPlog⁡\(em/P\)\+log⁡\(2/δ\)2N\+Blog⁡\(4/δ\)2NR\(M\)\\leq\\widehat\{R\}\_\{S\}\(h\_\{G^\{\*\}\}\)\+\\widehat\{\\epsilon\}\_\{\\mathrm\{loss\}\}\+B\\left\(\\hat\{\\eta\}\+\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{2N\}\}\\right\)\\\\ \+B\\sqrt\{\\frac\{P\\log\(em/P\)\+\\log\(2/\\delta\)\}\{2N\}\}\+B\\sqrt\{\\frac\{\\log\(4/\\delta\)\}\{2N\}\}\(15\)All reported empirical certificates in the paper are computed using Eq\.[15](https://arxiv.org/html/2606.18383#S4.E15)\.

Although the reconstruction\-gap term rewards accurate SAE reconstructions, the certificate is not reduced to reconstruction quality alone: the pool\-mismatch and sparse\-complexity terms penalize non\-transferable or overly diffuse sparse supports\. Appendix[J](https://arxiv.org/html/2606.18383#A10)formalizes why even a perfect unrestricted SAE copy does not automatically yield a non\-vacuous certificate\.

#### Constructing the pool\-restricted predictor\.

To instantiatehG∗h\_\{\{G^\{\*\}\}\}, we run the base model up to the probed layer, encode the hidden state with the SAE, mask all features outsideG∗\{G^\{\*\}\}, apply the Top\-kkoperator, decode the result, and resume the forward pass with the reconstructed activation\. This yields the empirical riskR^S\(hG∗\)\\widehat\{R\}\_\{S\}\(h\_\{\{G^\{\*\}\}\}\)and the mismatch statistics used in the bound\.

### 4\.6Discussion and scope

#### Why the finite proxy class is valid\.

The base modelMMand the SAE are fixed before drawing the evaluation sample\. Once these components are frozen, the remaining family of predictors is indexed only by feature poolsG⊆\[m\]G\\subseteq\[m\]of sizePP\. This is what makes the Occam\-style argument applicable: the relevant hypothesis class is the finite familyℋP=\{hG:\|G\|=P\}\\mathcal\{H\}\_\{P\}=\\\{h\_\{G\}:\|G\|=P\\\}, not the original parameter space of the language model\.

#### What the theorem does and does not show\.

The theorem is a post\-hoc guarantee for a trained predictor mediated by a sparse proxy\. It does*not*prove that sparse features emerge during training, nor that sparse semantics are the sole reason large language models generalize\. The paper should therefore be read as providing a structural certification framework and associated diagnostics, not a complete theory of language\-model learning\.

Throughout the paper, we use*trust*in a minimal risk\-faithful sense: the certificate must show that the frozen model is informative relative to an uninformed baseline and that the SAE\-induced proxy remains behaviorally close to it\. The margin to the baseline captures usefulness, whileϵloss\+ηB\\epsilon\_\{\\mathrm\{loss\}\}\+\\eta Bcaptures faithfulness of the pool\-restricted sparse proxy \(Appendix[H](https://arxiv.org/html/2606.18383#A8)\)\. Our claims therefore concern proxy informativeness and low distortion, not the semantic completeness or causal sufficiency of individual SAE features\.

## 5Experiments

Our experiments are organized around three questions: \(Q1\) Can the SAE\-based certificate become non\-vacuous at practical sample sizes? \(Q2\) How does the bound behave when the SAE patching site differs across layers? and \(Q3\) Do sparse statistics capture semantic structure rather than only raw sparsity counts?

We evaluateGPT\-2 Small\(Radford et al\.,[2019](https://arxiv.org/html/2606.18383#bib.bib12)\),Gemma\-2B\(Team et al\.,[2024](https://arxiv.org/html/2606.18383#bib.bib15)\), andLlama\-3\-8B\(Grattafiori et al\.,[2024](https://arxiv.org/html/2606.18383#bib.bib8)\)alongside pretrained SAEs at a fixed internal layer\. Experiments use English text from the C4 dataset111[https://huggingface\.co/datasets/allenai/c4](https://huggingface.co/datasets/allenai/c4); full setup details are in Appendix[C](https://arxiv.org/html/2606.18383#A3)\.

![Refer to caption](https://arxiv.org/html/2606.18383v1/x1.png)Figure 1:We plot the bound certificate \(in bits\) against evaluation sample sizeNNfor GPT\-2 Small \(blue\) and Gemma\-2B \(red\) and Llama\-3\-8B\. For all experimentsα=0\.5\\alpha=0\.5### 5\.1Non\-Vacuous Generalization Bounds

We first ask whether the certificate in Eq\.\([15](https://arxiv.org/html/2606.18383#S4.E15)\) becomes non\-vacuous at realistic sample sizes\. Unless otherwise noted, we use Top\-k = 64 for reconstruction, and the calibration pass uses 2\.24M tokens \(approximately 70k sequences of length 32\)\. Figure[1](https://arxiv.org/html/2606.18383#S5.F1)shows the bound as a function ofNNwith values in Table[1](https://arxiv.org/html/2606.18383#S5.T1)\. All three models cross below the random\-guess baseline \(which islog2\(V/α\)log\_\{2\}\(V/\\alpha\)\) at practical sample sizes\. Gemma\-2B reaches this regime earlier than GPT\-2 Small, despite being much larger in parameter count\. Because Llama has 8 billion parameters, this bound is only non\-vacuous when using the sparse proxy\. Achieving it also requires a larger sample size \(NN\) compared to Gemma or GPT\-2\. This demonstrates the framework’s core thesis: the relevant metric for complexity is the sparse proxy, not the raw parameter count\.

![Refer to caption](https://arxiv.org/html/2606.18383v1/x2.png)Figure 2:Left:Llama\-3\-8B observed bounds across layers\. The horizontal dotted line marks the16\.9816\.98\-bit baseline\. Layers20,24,28,20,24,28,and3030eventually become non\-vacuous with increasing samples, whereas layers4,8,12,4,8,12,and1616do not, because their asymptotic floors remain above the baseline\.Right:Output fidelity on Llama\-3\-8B asNNgrows\. The key pattern is that later layers preserve the base model’s output behavior much more faithfully \(also remains stable with increasingNN\), which explains the depth\-wise drop in proxy risk and reconstruction gap\.Table 1:Generalization Bounds and Crossing Sample Sizes\.We report the evaluated generalization bound \(in bits\) alongside the random guess log\-loss baseline \(log2⁡\(V/α\)\\log\_\{2\}\(V/\\alpha\)\)\. The sample sizeNNrepresents the evaluation scale, while CrossingNNdenotes sample size threshold where the bound becomes non\-vacuous \(drops below the baseline\)\.ModelSample SizeActive PoolEmpirical RiskBoundBaselineCrossing Sample\(NN\)\(PP\)\(R^S\(hG\)\\hat\{R\}\_\{S\}\(h\_\{G\}\)\)\(Bits\)\(Bits\)\(CrossingNN\)GPT\-2 Small62,29624,5107\.3615\.0815\.6254,170Gemma\-2B28,72216,0786\.4217\.2217\.9724,975Llama\-3\-8B256,620130,7016\.2116\.3116\.98223,148

### 5\.2Layerwise certifiability: A case study on Llama\-3\-8B

We next ask whether the choice of patch layer materially affects certification in a model where layerwise variation is substantial\. Llama\-3\-8B is a useful case study for this purpose because, unlike GPT\-2 Small \(Appendix[E](https://arxiv.org/html/2606.18383#A5)\), it exhibits strong heterogeneity across patch locations under the same certification protocol\. We therefore use Llama\-3\-8B to analyze when layer choice matters, and why\. Concretely, we evaluate pretrained SAEs at layers\{4,8,12,16,20,24,28,30\}\\\{4,8,12,16,20,24,28,30\\\}and measure the empirical proxy risk, reconstruction gap, pool\-mismatch rate, active pool size, and resulting certificate at each layer\.

A clear depth trend emerges\. The certificate becomes progressively sharper as the patched layer moves later in the network\. Early and middle layers remain highly vacuous because both the empirical proxy risk and the reconstruction\-gap term are extremely large; indeed, for layers44,88, and1212, these two terms alone already exceed the random\-guess baseline, which entails a vacuous bound even with infinite sample size\. By contrast, later layers become substantially easier to certify, and the bound becomes non\-vacuous after layer2020with sufficient samples\. Table[2](https://arxiv.org/html/2606.18383#S5.T2)summarizes the layerwise bound parameter decomposition\.

Table 2:Layerwise decomposition of the Llama\-3\-8B certificate\. The baseline is16\.9816\.98bits\. Despite a monotonically increasingPP, certifiability improves with depth; driven by minimized proxy risks and reconstruction gaps\.N⋆N^\{\\star\}denotes the sample size at which the bound becomes non\-vacuous\.†\\dagger“Never” means that even the asymptotic floorR^𝒮\(hG\)\+ϵ^loss\+η^B\\hat\{R\}\_\{\\mathcal\{S\}\}\(h\_\{G\}\)\+\\hat\{\\epsilon\}\_\{\\mathrm\{loss\}\}\+\\hat\{\\eta\}Bremains above the baseline, leading to vacuous behaviour even with infinite samples\.LayerPPR^𝒮\(hG\)\\hat\{R\}\_\{\\mathcal\{S\}\}\(h\_\{G\}\)ϵ^loss\\hat\{\\epsilon\}\_\{\\mathrm\{loss\}\}η^\\hat\{\\eta\}First non\-vacuousN⋆N^\{\\star\}Bound@N⋆\{N^\{\\star\}\}Asymptotic floor472,74715\.7110\.420\.0088Never†–26\.29884,11215\.149\.850\.0144Never†–25\.251288,63812\.126\.830\.0201Never†–19\.301688,66110\.725\.450\.0517Never†–17\.1020109,6388\.733\.450\.01971\.12M16\.8812\.5324119,7177\.902\.610\.0122560k16\.9110\.7428126,3577\.842\.560\.0086560k16\.7410\.5630130,7016\.210\.920\.0041256k16\.317\.20

The key observation is that the late\-layer advantage is*not*explained by a smaller complexity term\. As Table[2](https://arxiv.org/html/2606.18383#S5.T2)shows, the active concept\-pool sizePPgrows monotonically with depth and approaches near\-saturation of the SAE dictionary in the latest layers\. Consequently, the complexity does not shrink with depth; if anything, it becomes slightly larger\. In particular, both the empirical proxy riskR^S\(hG⋆\)\\widehat\{R\}\_\{S\}\(h\_\{G^\{\\star\}\}\)and the reconstruction\-gap termϵ^loss\\hat\{\\epsilon\}\_\{\\mathrm\{loss\}\}decrease sharply as we move to later layers\. In other words, later layers are easier to certify not because they are combinatorially smaller, but because the sparse proxy preserves the model’s predictive behavior much more faithfully there\. Figure[2](https://arxiv.org/html/2606.18383#S5.F2)\(Left\) visualizes this depth dependence directly\.

To understand this layer\-wise effect more directly, we also measure*output fidelity*between the original modelMM, the unrestricted SAE proxyS∘MS\\circ M\. The results are reported in Figure[2](https://arxiv.org/html/2606.18383#S5.F2)\(Right\) which visualizes the corresponding late\-layer fidelity trends222the exact values are provided in Appendix Table[6](https://arxiv.org/html/2606.18383#A5.T6)\. Two observations stand out\. First, output fidelity improves substantially from layer2020to layer3030: KL divergence drops sharply, top\-1 agreement rises from roughly36%36\\%to67%67\\%, and the average absolute gold\-token log\-probability difference decreases by almost a factor of three\. Second, the fidelity metrics forMMversusS∘MS\\circ MandMMversushG∗h\_\{G^\{\*\}\}are nearly identical across all evaluated layers \(omitted from the figures, as the differences only appear at the second decimal place\)\. This indicates that the dominant source of distortion is the SAE bottleneck itself rather than the additional restriction to the concept poolG∗G^\{\*\}\. In other words, once the SAE reconstruction is fixed, pruning toG∗G^\{\*\}incurs almost no further damage in these layers\.

Implication for the sparse lens\.This behaviour suggests that the informativeness of an SAE\-based lens is strongly layer\-dependent: patching at different layers does not yield the same degree of behavioral faithfulness or the same certificate tightness for the underlying LM\.

#### Disentangling Representation Quality from Error Propagation\.

While our layerwise results demonstrate that later\-layer SAE proxies are more trustworthy, in the sense that they yield tighter bounds and smaller proxy distortion, this alone does not identify the root cause of the loose bound seen for the earlier layers\. A looser certificate at an early layer could reflect two distinct phenomena: either the local SAE representation is intrinsically insufficient, or a minor local reconstruction error is being drastically amplified by the deep downstream computation\. To determine whether later layers offer superior semantic alignment or merely benefit from a shorter path to the final output, we introduce a horizon\-conditioned rollout analysis\.

LetFℓ→ℓ\+hF\_\{\\ell\\rightarrow\\ell\+h\}denote the subnetwork mapping the residual activation at layerℓ\\ellto the activationhhblocks later\.333Whenℓ\+h\\ell\+hexceeds the network depth, we use the full remaining computation to the final next\-token distribution\.For a patch layerℓ\\elland rollout horizonh≥0h\\geq 0, we compare the intermediate activation distribution at layerℓ\+h\\ell\+h, obtained by feeding the base activationhℓ\(x\)h\_\{\\ell\}\(x\)versus the SAE reconstructionh^ℓ\(x\)\\hat\{h\}\_\{\\ell\}\(x\)throughFℓ→ℓ\+hF\_\{\\ell\\rightarrow\\ell\+h\}\. We measure the divergence:

KL\(ℓ,h\)\(x\)=KL\(pbase\(ℓ,h\)\(⋅∣x\)∥pproxy\(ℓ,h\)\(⋅∣x\)\)\\mathrm\{KL\}^\{\(\\ell,h\)\}\(x\)=\\mathrm\{KL\}\\\!\\left\(p\_\{\\mathrm\{base\}\}^\{\(\\ell,h\)\}\(\\cdot\\mid x\)\\,\\\|\\,p\_\{\\mathrm\{proxy\}\}^\{\(\\ell,h\)\}\(\\cdot\\mid x\)\\right\)This isolates the two effects: matched short horizons \(h∈\{0,1\}h\\in\\\{0,1\\\}\) probe the local representational fidelity of the SAE patch, while the growth ofKL\(ℓ,h\)\\mathrm\{KL\}^\{\(\\ell,h\)\}ashhincreases quantifies how strongly that local mismatch is amplified by downstream computation\.

![Refer to caption](https://arxiv.org/html/2606.18383v1/x3.png)Figure 3:Horizon\-conditioned Base\-Proxy KL: Local Fidelity vs Downstream Error Accumulation
#### Results: The Dual Advantage of Later Layers\.

Figure[3](https://arxiv.org/html/2606.18383#S5.F3)reveals three distinct regimes for Llama\-3\-8B:

i\) Early layers \(ℓ=4,8\\ell=4,8\): Exhibit nontrivial local distortion that undergoes severe amplification even after 1–2 rollout steps, ultimately reaching the largest full\-horizon divergence \(\>11\>11bits\)\.

ii\) Middle layers \(ℓ=12,16,20\\ell=12,16,20\): Show lower short\-horizon divergence, but still suffer from substantial error accumulation over longer horizons\.

iii\) Late layers \(ℓ=24,28,30\\ell=24,28,30\): Demonstrate both high local fidelity \(low initial KL forh∈\{0,1,2\}h\\in\\\{0,1,2\\\}\) and remarkable stability, with divergence remaining flat or growing minimally\.

Crucially, this shows the late\-layer advantage is not merely a byproduct of a shorter computational path\. If downstream length were the sole factor, short\-horizon divergences would align across all layers\. These results support a dual explanation for the late\-layer advantage: later layers exhibit both better local proxy fidelity and weaker downstream amplification of reconstruction error\. Late layers are not merely closer to the output; they are also locally better represented by the SAE and less sensitive to downstream amplification\.

For completeness, the corresponding GPT\-2 Small layerwise certificate curves are reported in Appendix[E](https://arxiv.org/html/2606.18383#A5); unlike Llama\-3\-8B, they remain tightly clustered across depth, which entails that the layerwise performance pattern is not a universal property of every model but depends largely on how the model, SAE are trained\.The contrast between Llama and GPT\-2 illustrates the intended role of the framework: to diagnose when patch location is consequential and when it is not\.

![Refer to caption](https://arxiv.org/html/2606.18383v1/x4.png)Figure 4:Ablation: Semantic Specificity\.Density of the per\-sequence reconstruction gap\.Green \(Real SAE\):The baseline error is tightly clustered near 0 bits, indicating high semantic fidelity\.Red \(Shuffled Features\):Randomly permuting the active feature indices—while strictly preserving the per\-sample sparsitykkand activation magnitudes—drastically shifts the error distribution to the right \(mean shifts of≈6\.5\\approx 6\.5,8\.58\.5, and9\.59\.5bits for GPT\-2 Small, Gemma\-2B, and Llama\-3\-8B, respectively\)\.

### 5\.3Qualitative Case Study

We further inspect whether tighter late\-layer certificates correspond to more interpretable active SAE features\. For two deduction prompts, we compare an early layer \(Layer 12\) and a late layer \(Layer 24\) using the base–proxy next\-token KL and feature\-level logit\-lens verbalizations of active SAE decoder directions\. The early\-layer proxy exhibits high KL divergence \(\>7\.5\>7\.5\), incorrect next\-token predictions, and weakly contextual feature hints\. In contrast, the late\-layer proxy is much closer to the base model: KL drops sharply, predictions approach or match the target next token, and active features verbalize contextually relevant completions such as ‘silent, silence, quiet’ and ‘sol, dissolve, soluble’\. Full examples are provided in Appendix Section[I](https://arxiv.org/html/2606.18383#A9)\. This supports the same pattern as the certificate: later layers yield sparse proxies that are both lower\-distortion and more semantically aligned with the model’s next\-token behavior\.

As an additional behavioral check, we evaluated the SAE\-patched proxies on three downstream tasks \(Appendix[D](https://arxiv.org/html/2606.18383#A4)\)\. The same layerwise pattern persists there: layers that are easier to certify also incur substantially smaller task\-performance degradation\. This does not validate the theorem directly, but it shows that the certification signal tracks practical behavioral fidelity beyond the bounded\-loss analysis\.

### 5\.4Semantic Specificity vs\. Statistical Sparsity

To assess whether sparse statistics capture semantic structure rather than only raw sparsity counts, we perform a feature\-shuffling ablation: for each input, we preserve the exact activation magnitudes and sparsity pattern \(kk\) of the SAE code, but randomly permute the feature indices before decoding\.

Figure[4](https://arxiv.org/html/2606.18383#S5.F4)demonstrates that this purely semantic disruption catastrophically inflates the reconstruction gap across all three models\. Because the combinatorial sparsity remains strictly identical under this intervention, the complexity term of our bound is completely unaffected\. The resulting failure of the certificate is therefore driven exclusively by the empirical reconstruction gap\. This confirms a crucial property of our framework: while the complexity termPlog⁡\(em/P\)P\\log\(em/P\)rigorously bounds the size of the sparse candidate pool, the overall certificate remains highly sensitive to whether the chosen features are actually*semantically aligned*with the target distribution\.

To evaluate the SAE proxy under distributional shift, Appendix[G](https://arxiv.org/html/2606.18383#A7)presents a synthetic token corruption study\. Under random corruption, the asymptotic bound remains above the trivial baselinelog2⁡\(V/α\)\\log\_\{2\}\(V/\\alpha\), rendering the certificate uninformative\. Consequently, SAE\-derived interpretations lack formal guarantees in this regime\.

Thus, two proxies with identical sparsity can receive very different certificates, because the certificate depends not only on how many features are active, but also on whether the active feature directions carry the right predictive information\.

## 6Conclusion

We presented a post\-hoc certification framework that upper\-bounds the risk of a frozen LM through an SAE\-induced sparse proxy\. The resulting decomposition is useful not only because it can become non\-vacuous in practical settings, but also because it localizes the sources of certification failure across proxy risk, reconstruction gap, concept\-pool mismatch, and complexity\. Empirically, certification is strongly layer\-dependent, and later layers in Llama\-3\-8B are substantially easier to certify because they preserve the base model’s behavior more faithfully\. We view this as a distribution\-dependent, risk\-level criterion for when an SAE\-based proxy can be used as a cautious interpretive lens, not as a complete notion of explanation quality\. More broadly, the framework provides a practical way to diagnose when sparse proxies succeed and when they should not be over\-interpreted\.

## References

- Arora et al\. \(2018\)Arora, S\., Ge, R\., Neyshabur, B\., and Zhang, Y\.Stronger generalization bounds for deep nets via a compression approach, 2018\.URL[https://arxiv\.org/abs/1802\.05296](https://arxiv.org/abs/1802.05296)\.
- Bisk et al\. \(2019\)Bisk, Y\., Zellers, R\., Bras, R\. L\., Gao, J\., and Choi, Y\.Piqa: Reasoning about physical commonsense in natural language, 2019\.URL[https://arxiv\.org/abs/1911\.11641](https://arxiv.org/abs/1911.11641)\.
- Blumer et al\. \(1987\)Blumer, A\., Ehrenfeucht, A\., Haussler, D\., and Warmuth, M\. K\.Occam’s razor\.*Information Processing Letters*, 24\(6\):377–380, 1987\.ISSN 0020\-0190\.doi:https://doi\.org/10\.1016/0020\-0190\(87\)90114\-1\.URL[https://www\.sciencedirect\.com/science/article/pii/0020019087901141](https://www.sciencedirect.com/science/article/pii/0020019087901141)\.
- Bricken et al\. \(2023\)Bricken, T\., Templeton, A\., Batson, J\., Chen, B\., Jermyn, A\., Conerly, T\., Turner, N\., Anil, C\., Denison, C\., Askell, A\., Lasenby, R\., Wu, Y\., Kravec, S\., Schiefer, N\., Maxwell, T\., Joseph, N\., Hatfield\-Dodds, Z\., Tamkin, A\., Nguyen, K\., McLean, B\., Burke, J\. E\., Hume, T\., Carter, S\., Henighan, T\., and Olah, C\.Towards monosemanticity: Decomposing language models with dictionary learning\.*Transformer Circuits Thread*, 2023\.https://transformer\-circuits\.pub/2023/monosemantic\-features/index\.html\.
- Cunningham et al\. \(2023\)Cunningham, H\., Ewart, A\., Riggs, L\., Huben, R\., and Sharkey, L\.Sparse autoencoders find highly interpretable features in language models, 2023\.URL[https://arxiv\.org/abs/2309\.08600](https://arxiv.org/abs/2309.08600)\.
- Dziugaite & Roy \(2017\)Dziugaite, G\. K\. and Roy, D\. M\.Computing nonvacuous generalization bounds for deep \(stochastic\) neural networks with many more parameters than training data, 2017\.URL[https://arxiv\.org/abs/1703\.11008](https://arxiv.org/abs/1703.11008)\.
- Elhage et al\. \(2022\)Elhage, N\., Hume, T\., Olsson, C\., Schiefer, N\., Henighan, T\., Kravec, S\., Hatfield\-Dodds, Z\., Lasenby, R\., Drain, D\., Chen, C\., Grosse, R\., McCandlish, S\., Kaplan, J\., Amodei, D\., Wattenberg, M\., and Olah, C\.Toy models of superposition, 2022\.URL[https://arxiv\.org/abs/2209\.10652](https://arxiv.org/abs/2209.10652)\.
- Grattafiori et al\. \(2024\)Grattafiori, A\., Dubey, A\., Jauhri, A\., Pandey, A\., Kadian, A\., Al\-Dahle, A\., Letman, A\., Mathur, A\., Schelten, A\., Vaughan, A\., Yang, A\., Fan, A\., Goyal, A\., Hartshorn, A\., Yang, A\., Mitra, A\., Sravankumar, A\., Korenev, A\., Hinsvark, A\., Rao, A\., Zhang, A\., Rodriguez, A\., Gregerson, A\., Spataru, A\., Roziere, B\., Biron, B\., Tang, B\., Chern, B\., Caucheteux, C\., Nayak, C\., Bi, C\., Marra, C\., McConnell, C\., Keller, C\., Touret, C\., Wu, C\., Wong, C\., Ferrer, C\. C\., Nikolaidis, C\., Allonsius, D\., Song, D\., Pintz, D\., Livshits, D\., Wyatt, D\., Esiobu, D\., Choudhary, D\., Mahajan, D\., Garcia\-Olano, D\., Perino, D\., Hupkes, D\., Lakomkin, E\., AlBadawy, E\., Lobanova, E\., Dinan, E\., Smith, E\. M\., Radenovic, F\., Guzmán, F\., Zhang, F\., Synnaeve, G\., Lee, G\., Anderson, G\. L\., Thattai, G\., Nail, G\., Mialon, G\., Pang, G\., Cucurell, G\., Nguyen, H\., Korevaar, H\., Xu, H\., Touvron, H\., Zarov, I\., Ibarra, I\. A\., Kloumann, I\., Misra, I\., Evtimov, I\., Zhang, J\., Copet, J\., Lee, J\., Geffert, J\., Vranes, J\., Park, J\., Mahadeokar, J\., Shah, J\., van der Linde, J\., Billock, J\., Hong, J\., Lee, J\., Fu, J\., Chi, J\., Huang, J\., Liu, J\., Wang, J\., Yu, J\., Bitton, J\., Spisak, J\., Park, J\., Rocca, J\., Johnstun, J\., Saxe, J\., Jia, J\., Alwala, K\. V\., Prasad, K\., Upasani, K\., Plawiak, K\., Li, K\., Heafield, K\., Stone, K\., El\-Arini, K\., Iyer, K\., Malik, K\., Chiu, K\., Bhalla, K\., Lakhotia, K\., Rantala\-Yeary, L\., van der Maaten, L\., Chen, L\., Tan, L\., Jenkins, L\., Martin, L\., Madaan, L\., Malo, L\., Blecher, L\., Landzaat, L\., de Oliveira, L\., Muzzi, M\., Pasupuleti, M\., Singh, M\., Paluri, M\., Kardas, M\., Tsimpoukelli, M\., Oldham, M\., Rita, M\., Pavlova, M\., Kambadur, M\., Lewis, M\., Si, M\., Singh, M\. K\., Hassan, M\., Goyal, N\., Torabi, N\., Bashlykov, N\., Bogoychev, N\., Chatterji, N\., Zhang, N\., Duchenne, O\., Çelebi, O\., Alrassy, P\., Zhang, P\., Li, P\., Vasic, P\., Weng, P\., Bhargava, P\., Dubal, P\., Krishnan, P\., Koura, P\. S\., Xu, P\., He, Q\., Dong, Q\., Srinivasan, R\., Ganapathy, R\., Calderer, R\., Cabral, R\. S\., Stojnic, R\., Raileanu, R\., Maheswari, R\., Girdhar, R\., Patel, R\., Sauvestre, R\., Polidoro, R\., Sumbaly, R\., Taylor, R\., Silva, R\., Hou, R\., Wang, R\., Hosseini, S\., Chennabasappa, S\., Singh, S\., Bell, S\., Kim, S\. S\., Edunov, S\., Nie, S\., Narang, S\., Raparthy, S\., Shen, S\., Wan, S\., Bhosale, S\., Zhang, S\., Vandenhende, S\., Batra, S\., Whitman, S\., Sootla, S\., Collot, S\., Gururangan, S\., Borodinsky, S\., Herman, T\., Fowler, T\., Sheasha, T\., Georgiou, T\., Scialom, T\., Speckbacher, T\., Mihaylov, T\., Xiao, T\., Karn, U\., Goswami, V\., Gupta, V\., Ramanathan, V\., Kerkez, V\., Gonguet, V\., Do, V\., Vogeti, V\., Albiero, V\., Petrovic, V\., Chu, W\., Xiong, W\., Fu, W\., Meers, W\., Martinet, X\., Wang, X\., Wang, X\., Tan, X\. E\., Xia, X\., Xie, X\., Jia, X\., Wang, X\., Goldschlag, Y\., Gaur, Y\., Babaei, Y\., Wen, Y\., Song, Y\., Zhang, Y\., Li, Y\., Mao, Y\., Coudert, Z\. D\., Yan, Z\., Chen, Z\., Papakipos, Z\., Singh, A\., Srivastava, A\., Jain, A\., Kelsey, A\., Shajnfeld, A\., Gangidi, A\., Victoria, A\., Goldstand, A\., Menon, A\., Sharma, A\., Boesenberg, A\., Baevski, A\., Feinstein, A\., Kallet, A\., Sangani, A\., Teo, A\., Yunus, A\., Lupu, A\., Alvarado, A\., Caples, A\., Gu, A\., Ho, A\., Poulton, A\., Ryan, A\., Ramchandani, A\., Dong, A\., Franco, A\., Goyal, A\., Saraf, A\., Chowdhury, A\., Gabriel, A\., Bharambe, A\., Eisenman, A\., Yazdan, A\., James, B\., Maurer, B\., Leonhardi, B\., Huang, B\., Loyd, B\., Paola, B\. D\., Paranjape, B\., Liu, B\., Wu, B\., Ni, B\., Hancock, B\., Wasti, B\., Spence, B\., Stojkovic, B\., Gamido, B\., Montalvo, B\., Parker, C\., Burton, C\., Mejia, C\., Liu, C\., Wang, C\., Kim, C\., Zhou, C\., Hu, C\., Chu, C\.\-H\., Cai, C\., Tindal, C\., Feichtenhofer, C\., Gao, C\., Civin, D\., Beaty, D\., Kreymer, D\., Li, D\., Adkins, D\., Xu, D\., Testuggine, D\., David, D\., Parikh, D\., Liskovich, D\., Foss, D\., Wang, D\., Le, D\., Holland, D\., Dowling, E\., Jamil, E\., Montgomery, E\., Presani, E\., Hahn, E\., Wood, E\., Le, E\.\-T\., Brinkman, E\., Arcaute, E\., Dunbar, E\., Smothers, E\., Sun, F\., Kreuk, F\., Tian, F\., Kokkinos, F\., Ozgenel, F\., Caggioni, F\., Kanayet, F\., Seide, F\., Florez, G\. M\., Schwarz, G\., Badeer, G\., Swee, G\., Halpern, G\., Herman, G\., Sizov, G\., Guangyi, Zhang, Lakshminarayanan, G\., Inan, H\., Shojanazeri, H\., Zou, H\., Wang, H\., Zha, H\., Habeeb, H\., Rudolph, H\., Suk, H\., Aspegren, H\., Goldman, H\., Zhan, H\., Damlaj, I\., Molybog, I\., Tufanov, I\., Leontiadis, I\., Veliche, I\.\-E\., Gat, I\., Weissman, J\., Geboski, J\., Kohli, J\., Lam, J\., Asher, J\., Gaya, J\.\-B\., Marcus, J\., Tang, J\., Chan, J\., Zhen, J\., Reizenstein, J\., Teboul, J\., Zhong, J\., Jin, J\., Yang, J\., Cummings, J\., Carvill, J\., Shepard, J\., McPhie, J\., Torres, J\., Ginsburg, J\., Wang, J\., Wu, K\., U, K\. H\., Saxena, K\., Khandelwal, K\., Zand, K\., Matosich, K\., Veeraraghavan, K\., Michelena, K\., Li, K\., Jagadeesh, K\., Huang, K\., Chawla, K\., Huang, K\., Chen, L\., Garg, L\., A, L\., Silva, L\., Bell, L\., Zhang, L\., Guo, L\., Yu, L\., Moshkovich, L\., Wehrstedt, L\., Khabsa, M\., Avalani, M\., Bhatt, M\., Mankus, M\., Hasson, M\., Lennie, M\., Reso, M\., Groshev, M\., Naumov, M\., Lathi, M\., Keneally, M\., Liu, M\., Seltzer, M\. L\., Valko, M\., Restrepo, M\., Patel, M\., Vyatskov, M\., Samvelyan, M\., Clark, M\., Macey, M\., Wang, M\., Hermoso, M\. J\., Metanat, M\., Rastegari, M\., Bansal, M\., Santhanam, N\., Parks, N\., White, N\., Bawa, N\., Singhal, N\., Egebo, N\., Usunier, N\., Mehta, N\., Laptev, N\. P\., Dong, N\., Cheng, N\., Chernoguz, O\., Hart, O\., Salpekar, O\., Kalinli, O\., Kent, P\., Parekh, P\., Saab, P\., Balaji, P\., Rittner, P\., Bontrager, P\., Roux, P\., Dollar, P\., Zvyagina, P\., Ratanchandani, P\., Yuvraj, P\., Liang, Q\., Alao, R\., Rodriguez, R\., Ayub, R\., Murthy, R\., Nayani, R\., Mitra, R\., Parthasarathy, R\., Li, R\., Hogan, R\., Battey, R\., Wang, R\., Howes, R\., Rinott, R\., Mehta, S\., Siby, S\., Bondu, S\. J\., Datta, S\., Chugh, S\., Hunt, S\., Dhillon, S\., Sidorov, S\., Pan, S\., Mahajan, S\., Verma, S\., Yamamoto, S\., Ramaswamy, S\., Lindsay, S\., Lindsay, S\., Feng, S\., Lin, S\., Zha, S\. C\., Patil, S\., Shankar, S\., Zhang, S\., Zhang, S\., Wang, S\., Agarwal, S\., Sajuyigbe, S\., Chintala, S\., Max, S\., Chen, S\., Kehoe, S\., Satterfield, S\., Govindaprasad, S\., Gupta, S\., Deng, S\., Cho, S\., Virk, S\., Subramanian, S\., Choudhury, S\., Goldman, S\., Remez, T\., Glaser, T\., Best, T\., Koehler, T\., Robinson, T\., Li, T\., Zhang, T\., Matthews, T\., Chou, T\., Shaked, T\., Vontimitta, V\., Ajayi, V\., Montanez, V\., Mohan, V\., Kumar, V\. S\., Mangla, V\., Ionescu, V\., Poenaru, V\., Mihailescu, V\. T\., Ivanov, V\., Li, W\., Wang, W\., Jiang, W\., Bouaziz, W\., Constable, W\., Tang, X\., Wu, X\., Wang, X\., Wu, X\., Gao, X\., Kleinman, Y\., Chen, Y\., Hu, Y\., Jia, Y\., Qi, Y\., Li, Y\., Zhang, Y\., Zhang, Y\., Adi, Y\., Nam, Y\., Yu, Wang, Zhao, Y\., Hao, Y\., Qian, Y\., Li, Y\., He, Y\., Rait, Z\., DeVito, Z\., Rosnbrick, Z\., Wen, Z\., Yang, Z\., Zhao, Z\., and Ma, Z\.The llama 3 herd of models, 2024\.URL[https://arxiv\.org/abs/2407\.21783](https://arxiv.org/abs/2407.21783)\.
- Lotfi et al\. \(2024a\)Lotfi, S\., Finzi, M\., Kuang, Y\., Rudner, T\. G\. J\., Goldblum, M\., and Wilson, A\. G\.Non\-vacuous generalization bounds for large language models, 2024a\.URL[https://arxiv\.org/abs/2312\.17173](https://arxiv.org/abs/2312.17173)\.
- Lotfi et al\. \(2024b\)Lotfi, S\., Kuang, Y\., Amos, B\., Goldblum, M\., Finzi, M\., and Wilson, A\. G\.Unlocking tokens as data points for generalization bounds on larger language models, 2024b\.URL[https://arxiv\.org/abs/2407\.18158](https://arxiv.org/abs/2407.18158)\.
- Nagarajan & Kolter \(2021\)Nagarajan, V\. and Kolter, J\. Z\.Uniform convergence may be unable to explain generalization in deep learning, 2021\.URL[https://arxiv\.org/abs/1902\.04742](https://arxiv.org/abs/1902.04742)\.
- Radford et al\. \(2019\)Radford, A\., Wu, J\., Child, R\., Luan, D\., Amodei, D\., and Sutskever, I\.Language models are unsupervised multitask learners\.2019\.
- Rissanen \(1978\)Rissanen, J\.Modeling by shortest data description\.*Automatica*, 14\(5\):465–471, 1978\.ISSN 0005\-1098\.doi:https://doi\.org/10\.1016/0005\-1098\(78\)90005\-5\.URL[https://www\.sciencedirect\.com/science/article/pii/0005109878900055](https://www.sciencedirect.com/science/article/pii/0005109878900055)\.
- Sakaguchi et al\. \(2019\)Sakaguchi, K\., Bras, R\. L\., Bhagavatula, C\., and Choi, Y\.Winogrande: An adversarial winograd schema challenge at scale, 2019\.URL[https://arxiv\.org/abs/1907\.10641](https://arxiv.org/abs/1907.10641)\.
- Team et al\. \(2024\)Team, G\., Riviere, M\., Pathak, S\., Sessa, P\. G\., Hardin, C\., Bhupatiraju, S\., Hussenot, L\., Mesnard, T\., Shahriari, B\., Ramé, A\., Ferret, J\., Liu, P\., Tafti, P\., Friesen, A\., Casbon, M\., Ramos, S\., Kumar, R\., Lan, C\. L\., Jerome, S\., Tsitsulin, A\., Vieillard, N\., Stanczyk, P\., Girgin, S\., Momchev, N\., Hoffman, M\., Thakoor, S\., Grill, J\.\-B\., Neyshabur, B\., Bachem, O\., Walton, A\., Severyn, A\., Parrish, A\., Ahmad, A\., Hutchison, A\., Abdagic, A\., Carl, A\., Shen, A\., Brock, A\., Coenen, A\., Laforge, A\., Paterson, A\., Bastian, B\., Piot, B\., Wu, B\., Royal, B\., Chen, C\., Kumar, C\., Perry, C\., Welty, C\., Choquette\-Choo, C\. A\., Sinopalnikov, D\., Weinberger, D\., Vijaykumar, D\., Rogozińska, D\., Herbison, D\., Bandy, E\., Wang, E\., Noland, E\., Moreira, E\., Senter, E\., Eltyshev, E\., Visin, F\., Rasskin, G\., Wei, G\., Cameron, G\., Martins, G\., Hashemi, H\., Klimczak\-Plucińska, H\., Batra, H\., Dhand, H\., Nardini, I\., Mein, J\., Zhou, J\., Svensson, J\., Stanway, J\., Chan, J\., Zhou, J\. P\., Carrasqueira, J\., Iljazi, J\., Becker, J\., Fernandez, J\., van Amersfoort, J\., Gordon, J\., Lipschultz, J\., Newlan, J\., yeong Ji, J\., Mohamed, K\., Badola, K\., Black, K\., Millican, K\., McDonell, K\., Nguyen, K\., Sodhia, K\., Greene, K\., Sjoesund, L\. L\., Usui, L\., Sifre, L\., Heuermann, L\., Lago, L\., McNealus, L\., Soares, L\. B\., Kilpatrick, L\., Dixon, L\., Martins, L\., Reid, M\., Singh, M\., Iverson, M\., Görner, M\., Velloso, M\., Wirth, M\., Davidow, M\., Miller, M\., Rahtz, M\., Watson, M\., Risdal, M\., Kazemi, M\., Moynihan, M\., Zhang, M\., Kahng, M\., Park, M\., Rahman, M\., Khatwani, M\., Dao, N\., Bardoliwalla, N\., Devanathan, N\., Dumai, N\., Chauhan, N\., Wahltinez, O\., Botarda, P\., Barnes, P\., Barham, P\., Michel, P\., Jin, P\., Georgiev, P\., Culliton, P\., Kuppala, P\., Comanescu, R\., Merhej, R\., Jana, R\., Rokni, R\. A\., Agarwal, R\., Mullins, R\., Saadat, S\., Carthy, S\. M\., Cogan, S\., Perrin, S\., Arnold, S\. M\. R\., Krause, S\., Dai, S\., Garg, S\., Sheth, S\., Ronstrom, S\., Chan, S\., Jordan, T\., Yu, T\., Eccles, T\., Hennigan, T\., Kocisky, T\., Doshi, T\., Jain, V\., Yadav, V\., Meshram, V\., Dharmadhikari, V\., Barkley, W\., Wei, W\., Ye, W\., Han, W\., Kwon, W\., Xu, X\., Shen, Z\., Gong, Z\., Wei, Z\., Cotruta, V\., Kirk, P\., Rao, A\., Giang, M\., Peran, L\., Warkentin, T\., Collins, E\., Barral, J\., Ghahramani, Z\., Hadsell, R\., Sculley, D\., Banks, J\., Dragan, A\., Petrov, S\., Vinyals, O\., Dean, J\., Hassabis, D\., Kavukcuoglu, K\., Farabet, C\., Buchatskaya, E\., Borgeaud, S\., Fiedel, N\., Joulin, A\., Kenealy, K\., Dadashi, R\., and Andreev, A\.Gemma 2: Improving open language models at a practical size, 2024\.URL[https://arxiv\.org/abs/2408\.00118](https://arxiv.org/abs/2408.00118)\.
- Wang \(2025\)Wang, Z\.Logitlens4llms: Extending logit lens analysis to modern large language models, 2025\.URL[https://arxiv\.org/abs/2503\.11667](https://arxiv.org/abs/2503.11667)\.
- Zellers et al\. \(2019\)Zellers, R\., Holtzman, A\., Bisk, Y\., Farhadi, A\., and Choi, Y\.Hellaswag: Can a machine really finish your sentence?, 2019\.URL[https://arxiv\.org/abs/1905\.07830](https://arxiv.org/abs/1905.07830)\.
- Zhang et al\. \(2017\)Zhang, C\., Bengio, S\., Hardt, M\., Recht, B\., and Vinyals, O\.Understanding deep learning requires rethinking generalization, 2017\.URL[https://arxiv\.org/abs/1611\.03530](https://arxiv.org/abs/1611.03530)\.

## Appendix AUseful bound onBBandΔ\\Delta

Recall

B:=log2⁡\(Vα\),Δ:=log2⁡\(1\+\(1−α\)Vα\)\.B:=\\log\_\{2\}\\\!\\left\(\\frac\{V\}\{\\alpha\}\\right\),\\quad\\Delta:=\\log\_\{2\}\\\!\\left\(1\+\\frac\{\(1\-\\alpha\)V\}\{\\alpha\}\\right\)\.Then

B−Δ=log2⁡\(V/α1\+\(1−α\)V/α\)=log2⁡\(Vα\+\(1−α\)V\)B\-\\Delta=\\log\_\{2\}\\\!\\left\(\\frac\{V/\\alpha\}\{1\+\(1\-\\alpha\)V/\\alpha\}\\right\)\\\\ =\\log\_\{2\}\\\!\\left\(\\frac\{V\}\{\\alpha\+\(1\-\\alpha\)V\}\\right\)\(16\)
Now,

V≥α\+\(1−α\)V⇔αV≥α⇔V≥1\.V\\geq\\alpha\+\(1\-\\alpha\)V\\iff\\alpha V\\geq\\alpha\\iff V\\geq 1\.HenceB−Δ≥0B\-\\Delta\\geq 0, so

Therefore, iflA,lB∈\[B−Δ,B\]l\_\{A\},l\_\{B\}\\in\[B\-\\Delta,B\], then both quantities lie in an interval of lengthΔ\\Delta, which implies

\|lA−lB\|≤Δ\.\|l\_\{A\}\-l\_\{B\}\|\\leq\\Delta\.Thus,

\|lA−lB\|∈\[0,Δ\]⊆\[0,B\]\.\|l\_\{A\}\-l\_\{B\}\|\\in\[0,\\Delta\]\\subseteq\[0,B\]\.

## Appendix BAuxiliary lemmas

###### Lemma B\.1\(Decomposition of risk via loss gap\)\.

For any two predictorsf,gf,g,

R\(f\)≤R\(g\)\+𝔼z∼𝒟\|ℓ\(f;z\)−ℓ\(g;z\)\|R\(f\)\\leq R\(g\)\+\\mathbb\{E\}\_\{z\\sim\\mathcal\{D\}\}\\big\|\\ell\(f;z\)\-\\ell\(g;z\)\\big\|\(17\)In particular,

R\(M\)≤R\(S∘M\)\+ϵloss\.R\(M\)\\leq R\(S\\circ M\)\+\\epsilon\_\{\\mathrm\{loss\}\}\.\(18\)

###### Proof\.

For everyzz,ℓ\(f;z\)≤ℓ\(g;z\)\+\|ℓ\(f;z\)−ℓ\(g;z\)\|\\ell\(f;z\)\\leq\\ell\(g;z\)\+\|\\ell\(f;z\)\-\\ell\(g;z\)\|\. Take expectation overz∼𝒟z\\sim\\mathcal\{D\}\. ∎

###### Lemma B\.2\(Pool mismatch bound\)\.

Define

ϵpool:=𝔼z∼𝒟\|ℓ\(S∘M;z\)−ℓ\(hG⋆;z\)\|\\epsilon\_\{\\mathrm\{pool\}\}:=\\mathbb\{E\}\_\{z\\sim\\mathcal\{D\}\}\\big\|\\ell\(S\\circ M;z\)\-\\ell\(h\_\{G^\{\\star\}\};z\)\\big\|\(19\)Then

ϵpool≤ηB\.\\epsilon\_\{\\mathrm\{pool\}\}\\leq\\eta B\.\(20\)

###### Proof\.

Fixxx\. IfEG⋆\(x\)E\_\{G^\{\\star\}\}\(x\)holds, thenTopK\(a\(x\)\)\\mathrm\{TopK\}\(a\(x\)\)has support contained inG⋆G^\{\\star\}, henceaG⋆\(x\)a\_\{G^\{\\star\}\}\(x\)agrees witha\(x\)a\(x\)on all coordinates that surviveTopK\\mathrm\{TopK\}and is zero elsewhere\. By determinism ofTopK\\mathrm\{TopK\}and the fixed tie\-breaking, we obtaincG⋆\(x\)=c\(x\)c\_\{G^\{\\star\}\}\(x\)=c\(x\)and thereforeh^G⋆\(x\)=h^\(x\)\\widehat\{h\}\_\{G^\{\\star\}\}\(x\)=\\widehat\{h\}\(x\)\. Thus,\(S∘M\)\(x\)=hG⋆\(x\)\(S\\circ M\)\(x\)=h\_\{G^\{\\star\}\}\(x\), implyingℓ\(S∘M;z\)=ℓ\(hG⋆;z\)\\ell\(S\\circ M;z\)=\\ell\(h\_\{G^\{\\star\}\};z\)for all labelsyy\. IfEG⋆\(x\)E\_\{G^\{\\star\}\}\(x\)fails, the absolute loss difference is safely lower thanBB, asℓ∈\[B−Δ,B\]\\ell\\in\[B\-\\Delta,B\]andB≥ΔB\\geq\\Delta\. Therefore, for allz=\(x,y\)z=\(x,y\),

\|ℓ\(S∘M;z\)−ℓ\(hG⋆;z\)\|≤B⋅𝟏\{EG⋆\(x\)c\}\\big\|\\ell\(S\\circ M;z\)\-\\ell\(h\_\{G^\{\\star\}\};z\)\\big\|\\leq B\\cdot\\mathbf\{1\}\\\{E\_\{G^\{\\star\}\}\(x\)^\{c\}\\\}\(21\)Taking expectation overz∼𝒟z\\sim\\mathcal\{D\}yields

ϵpool≤BPrx∼𝒟⁡\(EG⋆\(x\)c\)≤ηB\\epsilon\_\{\\mathrm\{pool\}\}\\leq B\\Pr\_\{x\\sim\\mathcal\{D\}\}\(E\_\{G^\{\\star\}\}\(x\)^\{c\}\)\\leq\\eta B\(22\)∎

###### Lemma B\.3\(Uniform convergence for finite classes \(Occam bound\)\)\.

Letℋ\\mathcal\{H\}be a finite set of predictors and assumeℓ\(h;z\)∈\[0,B\]\\ell\(h;z\)\\in\[0,B\]for allh∈ℋh\\in\\mathcal\{H\}andzz\. Then for anyδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\deltaoverS∼𝒟NS\\sim\\mathcal\{D\}^\{N\},

∀h∈ℋ:R\(h\)≤R^S\(h\)\+Blog⁡\|ℋ\|\+log⁡\(1/δ\)2N\\forall h\\in\\mathcal\{H\}:\\quad R\(h\)\\leq\\widehat\{R\}\_\{S\}\(h\)\+\\\\ B\\sqrt\{\\frac\{\\log\|\\mathcal\{H\}\|\+\\log\(1/\\delta\)\}\{2N\}\}\(23\)

###### Proof\.

Fixh∈ℋh\\in\\mathcal\{H\}\. By Hoeffding’s inequality applied to i\.i\.d\. variablesℓ\(h;zi\)∈\[0,B\]\\ell\(h;z\_\{i\}\)\\in\[0,B\],

Pr⁡\(R\(h\)−R^S\(h\)\>t\)≤exp⁡\(−2Nt2B2\)\\Pr\\\!\\left\(R\(h\)\-\\widehat\{R\}\_\{S\}\(h\)\>t\\right\)\\leq\\exp\\\!\\left\(\-\\frac\{2Nt^\{2\}\}\{B^\{2\}\}\\right\)\(24\)Apply a union bound over allh∈ℋh\\in\\mathcal\{H\}and set the right\-hand side toδ\\deltato solve fortt\. ∎

###### Lemma B\.4\(Counting pools\)\.

ForℋP\\mathcal\{H\}\_\{P\}defined above,\|ℋP\|=\(mP\)\|\\mathcal\{H\}\_\{P\}\|=\\binom\{m\}\{P\}and

log⁡\|ℋP\|=log⁡\(mP\)≤Plog⁡\(emP\)\.\\log\|\\mathcal\{H\}\_\{P\}\|=\\log\\binom\{m\}\{P\}\\leq P\\log\\\!\\left\(\\frac\{em\}\{P\}\\right\)\.\(25\)

###### Proof\.

The cardinality is the number ofPP\-subsets of\[m\]\[m\]\. The inequality uses\(mP\)≤\(em/P\)P\\binom\{m\}\{P\}\\leq\(em/P\)^\{P\}\. ∎

###### Lemma B\.5\(Concentration ofϵloss\\epsilon\_\{\\mathrm\{loss\}\}\)\.

For anyδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\delta,

ϵloss≤ϵ^loss\+Blog⁡\(2/δ\)2N\\epsilon\_\{\\mathrm\{loss\}\}\\leq\\widehat\{\\epsilon\}\_\{\\mathrm\{loss\}\}\+B\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{2N\}\}\(26\)

###### Proof\.

The variablesΔloss\(zi\)∈\[0,B\]\\Delta\_\{\\mathrm\{loss\}\}\(z\_\{i\}\)\\in\[0,B\]are i\.i\.d\. and Hoeffding applies\. ∎

###### Theorem B\.1\(Generalization bound via compression \(Occam\) under a concept pool\)\.

Letδ∈\(0,1\)\\delta\\in\(0,1\)and defineδ1=δ2=δ/2\\delta\_\{1\}=\\delta\_\{2\}=\\delta/2\. Then with probability at least1−δ1\-\\deltaoverS∼𝒟NS\\sim\\mathcal\{D\}^\{N\},

R\(M\)≤R^S\(hG⋆\)\+ϵ^loss\+ηB\\displaystyle R\(M\)\\leq\\widehat\{R\}\_\{S\}\(h\_\{G^\{\\star\}\}\)\+\\widehat\{\\epsilon\}\_\{\\mathrm\{loss\}\}\+\\eta B\(27\)\+Blog⁡\|ℋP\|\+log⁡\(2/δ\)2N\+Blog⁡\(4/δ\)2N\\displaystyle\+B\\sqrt\{\\frac\{\\log\|\\mathcal\{H\}\_\{P\}\|\+\\log\(2/\\delta\)\}\{2N\}\}\+B\\sqrt\{\\frac\{\\log\(4/\\delta\)\}\{2N\}\}

###### Proof\.

By Lemma[B\.1](https://arxiv.org/html/2606.18383#A2.Thmtheorem1),R\(M\)≤R\(S∘M\)\+ϵlossR\(M\)\\leq R\(S\\circ M\)\+\\epsilon\_\{\\mathrm\{loss\}\}\. By Lemma[B\.2](https://arxiv.org/html/2606.18383#A2.Thmtheorem2),R\(S∘M\)≤R\(hG⋆\)\+ϵpool≤R\(hG⋆\)\+ηBR\(S\\circ M\)\\leq R\(h\_\{G^\{\\star\}\}\)\+\\epsilon\_\{\\mathrm\{pool\}\}\\leq R\(h\_\{G^\{\\star\}\}\)\+\\eta B\. Apply Lemma[B\.3](https://arxiv.org/html/2606.18383#A2.Thmtheorem3)toℋP\\mathcal\{H\}\_\{P\}with confidenceδ2\\delta\_\{2\}and evaluate athG⋆∈ℋPh\_\{G^\{\\star\}\}\\in\\mathcal\{H\}\_\{P\}:

R\(hG⋆\)≤R^S\(hG⋆\)\+Blog⁡\|ℋP\|\+log⁡\(1/δ2\)2NR\(h\_\{G^\{\\star\}\}\)\\leq\\widehat\{R\}\_\{S\}\(h\_\{G^\{\\star\}\}\)\+B\\sqrt\{\\frac\{\\log\|\\mathcal\{H\}\_\{P\}\|\+\\log\(1/\\delta\_\{2\}\)\}\{2N\}\}\(28\)Apply Lemma[B\.5](https://arxiv.org/html/2606.18383#A2.Thmtheorem5)with confidenceδ1\\delta\_\{1\}:

ϵloss≤ϵ^loss\+Blog⁡\(2/δ1\)2N\\epsilon\_\{\\mathrm\{loss\}\}\\leq\\widehat\{\\epsilon\}\_\{\\mathrm\{loss\}\}\+B\\sqrt\{\\frac\{\\log\(2/\\delta\_\{1\}\)\}\{2N\}\}\(29\)Combine the inequalities and use a union bound to obtain \([27](https://arxiv.org/html/2606.18383#A2.E27)\)\.

∎

## Appendix CExperimental Setup

We evaluate three pretrained language models: GPT\-2 Small, Gemma\-2B, and Llama\-3\-8B\. For each model, we insert a pretrained SAE at a fixed internal layer, replace the native hidden activation with its SAE reconstruction, and continue the forward pass through the unchanged downstream blocks\. Unless otherwise noted, all experiments use English text from C4, tokenized into contiguous sequences of length3232\. We construct the empirical concept poolG^\\hat\{G\}from a calibration stream that is disjoint from the evaluation stream, so that the pool is fixed before the certificate is measured on held\-out data\. In our implementation, the calibration pass uses2\.242\.24M tokens, while the evaluation curves are obtained by varying the evaluation sample sizeNNon a separate stream\.

Table 3:Global experimental settings shared across all model families\.ItemSettingText sourceC4 \(English\)Sequence constructionContiguous token sequencesSequence length3232tokensCalibration streamDisjoint from evaluation streamCalibration budget2\.242\.24M tokensEvaluation sample sizeNNVaried to generate certificate curvesSparse coding ruleTop\-kkmasking of SAE pre\-activationsDefault sparsityk=64k=64LossSmoothed bits\-per\-dimensionSmoothing parameterα=0\.5\\alpha=0\.5Bounded\-loss constantB=log⁡\(V/α\)B=\\log\(V/\\alpha\)Confidence levelδ=0\.05\\delta=0\.05Pool constructionUsing Eqn\.[14](https://arxiv.org/html/2606.18383#S4.E14)Pool\-restricted predictorEncode→\\rightarrowmask outsideG^\\hat\{G\}→\\rightarrowTop\-kk→\\rightarrowdecode→\\rightarrowresume downstream forward passAcross all models, we use the same sparse reconstruction protocol\. Given the SAE pre\-activation vector, we retain the Top\-kkcoordinates with largest magnitude and set the rest to zero, withk=64k=64unless stated otherwise\. The loss is the smoothed bits\-per\-dimension objective with smoothing parameterα=0\.5\\alpha=0\.5, yielding bounded\-loss constantB=log⁡\(V/α\)B=\\log\(V/\\alpha\)\. We report certificates at confidence levelδ=0\.05\\delta=0\.05\. Operationally, the pool\-restricted predictor is constructed by running the base model up to the probed layer, encoding the hidden state with the SAE, masking all features outsideG∗G^\{\*\}, applying Top\-kk, decoding back to activation space, and resuming the forward pass with the reconstructed hidden state\.

The global settings shared across all experiments are summarized in Table[3](https://arxiv.org/html/2606.18383#A3.T3), and the model\-specific choices are summarized in Table[4](https://arxiv.org/html/2606.18383#A3.T4)\. For the main certificate curves, we use one primary publicly available SAE checkpoint per model family\. The layerwise sweeps in Section[5\.2](https://arxiv.org/html/2606.18383#S5.SS2)and Appendix[E](https://arxiv.org/html/2606.18383#A5)use additional layer\-specific checkpoints where needed\.

Table 4:Model\-specific setup\. Listed checkpoints are the primary SAEs used in our experiments\.ModelSAE ReleaseHookpointGPT\-2 Small[gpt2\-small\-res\-jb](https://huggingface.co/jbloom/GPT2-Small-SAEs)L6:resid\_preGemma\-2B[gemma\-2b\-res\-jb](https://huggingface.co/jbloom/Gemma-2b-Residual-Stream-SAEs)L12:resid\_postLlama\-3\-8B[sae\-llama\-\.\.\.32x](https://huggingface.co/EleutherAI/sae-llama-3-8b-32x)L30:resid\_post
## Appendix DDownstream Task Preservation

Table 5:Zero\-shot downstream\-task preservation under sparse semantic proxying\. We report the accuracy of the base modelMM, the pool\-restricted proxyhGh\_\{G\}, and the dropΔAcc=Acc\(M\)−Acc\(hG\)\\Delta\\mathrm\{Acc\}=\\mathrm\{Acc\}\(M\)\-\\mathrm\{Acc\}\(h\_\{G\}\)on WinoGrande, PIQA, and HellaSwag\. Later Llama layers preserve downstream behavior substantially better than earlier layers\.WinoGrandePIQAHellaSwagModelLayer𝐏\\mathbf\{P\}Acc\(M\)\\mathrm\{Acc\}\(M\)Acc\(hG\)\\mathrm\{Acc\}\(h\_\{G\}\)Δ\\DeltaAccAcc\(M\)\\mathrm\{Acc\}\(M\)Acc\(hG\)\\mathrm\{Acc\}\(h\_\{G\}\)Δ\\DeltaAccAcc\(M\)\\mathrm\{Acc\}\(M\)Acc\(hG\)\\mathrm\{Acc\}\(h\_\{G\}\)Δ\\DeltaAccGPT\-2 Small624,5100\.5100\.517\-0\.0070\.6140\.6040\.0100\.3430\.3330\.010Gemma\-2B1216,0780\.6190\.5550\.0640\.7760\.7100\.0660\.5470\.4550\.092Llama\-3\-8B472,7470\.7050\.5030\.2020\.8040\.4870\.3170\.6340\.2730\.361Llama\-3\-8B884,1120\.7050\.5090\.1960\.8040\.4860\.3180\.6340\.2520\.382Llama\-3\-8B1288,6380\.7050\.5260\.1790\.8040\.5020\.3020\.6340\.3060\.328Llama\-3\-8B1688,6610\.7050\.5610\.1440\.8040\.5970\.2070\.6340\.4030\.231Llama\-3\-8B20109,6380\.7050\.5600\.1450\.8040\.6350\.1690\.6340\.4410\.193Llama\-3\-8B24119,7170\.7050\.6460\.0590\.8040\.7360\.0680\.6340\.5410\.093Llama\-3\-8B28126,3570\.7050\.5920\.1130\.8040\.7620\.0420\.6340\.5620\.072Llama\-3\-8B30130,7010\.7050\.6190\.0860\.8040\.7620\.0420\.6340\.5820\.052Table[5](https://arxiv.org/html/2606.18383#A4.T5)examines whether the sparse semantic proxy also preserves the*practical*task behavior of the frozen model on zero\-shot downstream benchmarks\. These are commonsense understanding tasks, namely: i\) WinoGrande\(Sakaguchi et al\.,[2019](https://arxiv.org/html/2606.18383#bib.bib14)\), ii\) PIQA\(Bisk et al\.,[2019](https://arxiv.org/html/2606.18383#bib.bib2)\), and iii\) HellaSwag\(Zellers et al\.,[2019](https://arxiv.org/html/2606.18383#bib.bib17)\)\. We use full test from the respective datasets\. The same layerwise pattern predicted by the certificate is clearly visible here: as we move to deeper layers, the proxy becomes behaviorally closer to the original model, and the task\-level degradation decreases accordingly\. In particular, at early Llama layers, where the certificate is vacuous, the proxy accuracy is often close to random guessing, whereas at later layers, where the certificate becomes non\-vacuous and the certified risk decreases, downstream performance also improves substantially\. For example, on HellaSwag, the accuracy drop decreases from0\.3610\.361at layer 4 to0\.0520\.052at layer 30; on PIQA, it decreases from0\.3170\.317to0\.0420\.042; and on WinoGrande, it decreases from0\.2020\.202to0\.0860\.086\. Thus, the monotonic improvement suggested by the certificate is reflected not only in the bounded\-loss analysis but also in downstream task accuracy\. We emphasize that this experiment is not a proof of the theorem, but an additional behavioral check showing that the practical performance of the sparse proxy is consistent with the certificate\.

## Appendix EGPT\-2 Small layerwise certificate analysis

Table 6:Layer\-wise output fidelity on Llama\-3\-8B\. Fidelity improves substantially with depth\.LayerPPKL\(M∥S∘M\)\(M\\,\\\|\\,S\\circ M\)Top\-1 Agree\.\(M,S∘M\)\(M,S\\circ M\)\|Δlog⁡pgold\|\|\\Delta\\log p\_\{\\text\{gold\}\}\|Loss\(M\)\(M\)Loss\(S∘M\)\(S\\circ M\)472,7478\.3530\.0088\.3725\.28915\.718884,1127\.7230\.0007\.6785\.28915\.1501288,6385\.3350\.1475\.3185\.28912\.1341688,6614\.1110\.1814\.1935\.28910\.72320109,6382\.6750\.3632\.8195\.2888\.73124119,7172\.0440\.4652\.2825\.2887\.90228126,3572\.1150\.4962\.4145\.2887\.84330130,7010\.7660\.6700\.9705\.2886\.207

For completeness, we also performed a layerwise certificate sweep for GPT\-2 Small across layers\{0,2,4,6,8,10\}\\\{0,2,4,6,8,10\\\}in Figure[5](https://arxiv.org/html/2606.18383#A5.F5)\. In contrast to the strong late\-layer trend observed for Llama\-3\-8B in the main text, GPT\-2 exhibits only weak layer sensitivity: the bound curves remain tightly clustered across depth and all reach non\-vacuity at broadly similar sample scales\.

![Refer to caption](https://arxiv.org/html/2606.18383v1/x5.png)Figure 5:Layerwise analysis for GPT\-2 smallWe therefore do not treat GPT\-2 as a second depth\-dependent case study; instead, these results justify the use of layer 6 in the main experiments as a representative midpoint rather than as a specially optimized choice\.

## Appendix FTop\-k sensitivity

![Refer to caption](https://arxiv.org/html/2606.18383v1/x6.png)Figure 6:Sensitivity of the sparse semantic generalization certificate to the Top\-kksparsity level at fixed representative layers: GPT\-2 Small \(layer 6\), Gemma\-2B \(layer 12\), and Llama\-3\-8B \(layer 30\)\. Acrossk∈\{16,32,64,128\}k\\in\\\{16,32,64,128\\\}, the qualitative ordering remains unchanged, while the sample size required for non\-vacuity decreases askkincreases, most notably for Llama\-3\-8B\.Figure[6](https://arxiv.org/html/2606.18383#A6.F6)shows that the qualitative behavior of the certificate is stable across a broad range of sparsity levels\. In all three settings \(k∈\{16,32,128\}k\\in\\\{16,32,128\\\}\), the same ordering is preserved444Note thatk=64k=64is already studied in the main paper: Gemma\-2B reaches non\-vacuity first, GPT\-2 Small next, and Llama\-3\-8B last\. Quantitatively, the required sample size for non\-vacuity shifts smoothly withkk: fork=16k=16, the crossing points are approximatelyN≈28,229N\\\!\\approx\\\!28\{,\}229\(Gemma\-2B\),59,76559\{,\}765\(GPT\-2 Small\), and1,363,0371\{,\}363\{,\}037\(Llama\-3\-8B\); fork=32k=32, they become25,71725\{,\}717,55,11255\{,\}112, and373,629373\{,\}629; and fork=128k=128, they further reduce to24,90924\{,\}909,54,07254\{,\}072, and209,357209\{,\}357, respectively\. Thus, increasingkkin this range does not alter the qualitative conclusion or the relative ranking of models, but it generally makes non\-vacuity easier to attain, especially for Llama\-3\-8B\. This suggests that the main phenomenon is robust to the sparsity threshold: changingkkprimarily shifts the quantitative sample complexity of certification rather than the underlying trend itself\.

## Appendix GBehavior of the Certificate under Synthetic Input Corruption

To examine how the certificate behaves when inputs depart from structured natural language, we perform a synthetic corruption study in which increasing fractions of tokens are replaced at random\. We report the resulting decomposition of the certificate into empirical risk, reconstruction gap, pool mismatch, and complexity\. The goal of this analysis is to understand which terms of the certificate degrade when the sparse proxy is evaluated on increasingly corrupted inputs\.

![Refer to caption](https://arxiv.org/html/2606.18383v1/x7.png)Figure 7:Decomposition of the certificate under synthetic corruption\.We visualize the contributions of empirical risk, reconstruction gap, pool mismatch, and complexity to the total bound \(Top\-k=64k=64\)\. On uncorrupted text, the bound lies below the uninformed baseline \(dashed line\), whereas increasing random token corruption progressively weakens the certificate \(15% corruption and 100% corruption\)\. The degradation is driven primarily by empirical risk rather than by the complexity term\.Figure[7](https://arxiv.org/html/2606.18383#A7.F7)shows how the certificate changes for GPT\-2 Small, Gemma\-2B, and Llama\-3\-8B as random corruption is increased\. On uncorrupted English text, the total bound is well below the uninformed baseline, indicating that the sparse proxy remains informative in this regime\. As the corruption level increases, the bound rises substantially and eventually becomes non\-informative relative to the same baseline\.

The decomposition helps localize the source of this degradation\. The dominant change comes from the empirical risk term, while the complexity contribution remains essentially unchanged\. This indicates that failure under corruption is not due to a looser combinatorial complexity penalty, but to the sparse proxy becoming less predictive on inputs that no longer resemble the structured text on which the SAE was trained\.

## Appendix HOperational meaning of trust

We use*trust*in a minimal risk\-faithful sense: the sparse SAE view should \(i\) certify that the underlying frozen model is non\-trivial relative to an uninformed baseline, thus useful, and \(ii\) remain close to that model in population risk, thus faithful\. Let the empirical bound be:

For anyPP, and correspondingGG,

Uδ\(G\):=R^𝒮\(hG\)\+ϵ^loss\+B\(η^\+log⁡\(2/δ\)2N\)\+BPlog⁡\(em/P\)\+log⁡\(2/δ\)2N\+Blog⁡\(4/δ\)2NU\_\{\\delta\}\(G\):=\\widehat\{R\}\_\{\\mathcal\{S\}\}\(h\_\{G\}\)\+\\widehat\{\\epsilon\}\_\{\\mathrm\{loss\}\}\+B\\\!\\left\(\\hat\{\\eta\}\+\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{2N\}\}\\right\)\\\\ \+B\\sqrt\{\\frac\{P\\log\(em/P\)\+\\log\(2/\\delta\)\}\{2N\}\}\+B\\sqrt\{\\frac\{\\log\(4/\\delta\)\}\{2N\}\}\(30\)so that, with probability at least1−δ1\-\\delta,

R\(M\)≤Uδ\(G\)\.R\(M\)\\leq U\_\{\\delta\}\(G\)\.Hence, ifUδ\(G\)<RuninfU\_\{\\delta\}\(G\)<R\_\{\\mathrm\{uninf\}\}, meaning the bound is better than random baseline, then

R\(M\)<Runinf,R\(M\)<R\_\{\\mathrm\{uninf\}\},
meaning that,through the bound, the sparse view certifies that the underlying frozen model is non\-trivial relative to the uninformed baseline\.

To quantify faithfulness, we bound the risk discrepancy between the frozen model and the pool\-restricted sparse proxy\. By Lemma 1,

\|R\(M\)−R\(S∘M\)\|≤ϵloss,\|R\(M\)\-R\(S\\circ M\)\|\\leq\\epsilon\_\{\\mathrm\{loss\}\},and by Lemma 1 together with Lemma 2,

\|R\(S∘M\)−R\(hG⋆\)\|≤ηB\.\|R\(S\\circ M\)\-R\(h\_\{G^\{\\star\}\}\)\|\\leq\\eta B\.Therefore, by triangle inequality,

\|R\(M\)−R\(hG⋆\)\|≤ϵloss\+ηB\.\|R\(M\)\-R\(h\_\{G^\{\\star\}\}\)\|\\leq\\epsilon\_\{\\mathrm\{loss\}\}\+\\eta B\.
This yields the following notion of explanatory trust\.

#### Definition \(risk\-faithful explanatory trust\)\.

Fixγ\>0\\gamma\>0andτ\>0\\tau\>0\. We say that the SAE\-based sparse lens is\(γ,τ\)\(\\gamma,\\tau\)\-trustworthy on distributionDDif, with high probability,

R\(M\)≤Runinf−γ\(Usefulness\)and\|R\(M\)−R\(hG⋆\)\|≤τ\(Faithfulness\)R\(M\)\\leq R\_\{\\mathrm\{uninf\}\}\-\\gamma\\quad\(\\text\{\{Usefulness\}\}\)\\qquad\\text\{and\}\\\\ \|R\(M\)\-R\(h\_\{G^\{\\star\}\}\)\|\\leq\\tau\\quad\(\\text\{\{Faithfulness\}\}\)\(31\)
The first condition captures usefulness; the second captures faithfulness\. In our setting, a sufficient choice is

γ:=Runinf−Uδ\(G^\),τ:=ϵloss\+ηB\.\\gamma:=R\_\{\\mathrm\{uninf\}\}\-U\_\{\\delta\}\(\\hat\{G\}\),\\qquad\\tau:=\\epsilon\_\{\\mathrm\{loss\}\}\+\\eta B\.Thus, trust means that the sparse proxy is simultaneously informative about the frozen model and behaviorally close to it\.

## Appendix IQualitative case study

Qualitative case study: Later\-layer SAE conceptsPrompt A: Deductive inheritanceLayer 12Prompt:Premises: Every violinist is a musician\. No musician is completely silent on stage\. Mira is a violinist\. Therefore, Mira is not completelyPrediction:Gold:’ silent’—Base:’ silent’—Proxy:’ \!’—KL:8\.12Top 3 SAE concepts:F3807supportsa=0\.59a=0\.59isses,/full,ARRIERF99917supportsa=0\.34a=0\.34stics,references,sipF65022supportsa=0\.40a=0\.40anymore,necessarily,yetPrompt A: Deductive inheritanceLayer 24Prompt:Premises: Every violinist is a musician\. No musician is completely silent on stage\. Mira is a violinist\. Therefore, Mira is not completelyPrediction:Gold:’ silent’—Base:’ silent’—Proxy:’ silent’—KL:0\.01Top 3 SAE concepts:F36987supportsa=3\.08a=3\.08silent,silence,quietF124438supportsa=1\.38a=1\.38stateless,ainless,odorF71772supportsa=2\.33a=2\.33itude,mente,strangersPrompt B: Domain constraintLayer 12Prompt:Premises: Every enzyme in this tray is a protein\. No protein in this tray is insoluble\. Catalase is an enzyme in this tray\. Therefore, Catalase is notPrediction:Gold:’ insol’—Base:’ insol’—Proxy:’ \!’—KL:7\.61Top 3 SAE concepts:F106873supportsa=0\.53a=0\.53must,adol,eroonF114367supportsa=0\.63a=0\.63oriously,ched,withstandingF65022supportsa=0\.40a=0\.40anymore,necessarily,yetPrompt B: Domain constraintLayer 24Prompt:Premises: Every enzyme in this tray is a protein\. No protein in this tray is insoluble\. Catalase is an enzyme in this tray\. Therefore, Catalase is notPrediction:Gold:’ insol’—Base:’ insol’—Proxy:’ a’—KL:1\.47Top 3 SAE concepts:F20271supportsa=1\.05a=1\.05sol,dissolve,solubleF103948supportsa=1\.32a=1\.32suspended,traces,contentF119119supportsa=0\.60a=0\.60active,\_active,\_\_activeFigure 8:Qualitative comparison of early and late SAE layers on two deduction prompts\. We show \(i\) the prompt, \(ii\) the base–proxy agreement summary through KL, and \(iii\) the top active SAE concepts verbalized using feature\-level logit\-lens tokens\. Lower KL and sharper token verbalizations at later layers indicate a more faithful sparse proxy\. Note, all gold/base/proxy next\-token strings are tokenizer subword pieces rather than necessarily whole words\.We complement the dataset\-level certification results with a qualitative check of whether tighter late\-layer certificates are accompanied by more semantically aligned SAE features\. For a promptxx, we compare the base modelMMwith an SAE\-patched proxyS∘MS\\circ M, obtained by replacing the residual activationhℓ\(x\)h\_\{\\ell\}\(x\)at layerℓ\\ellwith its reconstructionh^ℓ\(x\)\\hat\{h\}\_\{\\ell\}\(x\)and executing the remaining forward pass\. We measure proxy drift using the next\-token divergenceKL\(pM\(⋅∣x\)∥pS∘M\(⋅∣x\)\)\\mathrm\{KL\}\\\!\\left\(p\_\{M\}\(\\cdot\\mid x\)\\,\\\|\\,p\_\{S\\circ M\}\(\\cdot\\mid x\)\\right\), where lower values indicate that the sparse reconstruction better preserves the model’s predictive behavior\. To inspect the active SAE features, we verbalize each feature in vocabulary space using a feature\-level logit lens\(Wang,[2025](https://arxiv.org/html/2606.18383#bib.bib16)\)\. Let the SAE decoder be written as a dictionary of feature directions, so that each active featurejjcontributes a decoder vectordj∈ℝdmodeld\_\{j\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{model\}\}\}to the reconstructed residual stream\. Given the LM unembedding matrixWU∈ℝdmodel×\|𝒱\|W\_\{U\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{model\}\}\\times\|\\mathcal\{V\}\|\}, we score vocabulary items bysj=djWUs\_\{j\}=d\_\{j\}W\_\{U\}and report the top tokens as lexical hints for featurejj\.

Figure[8](https://arxiv.org/html/2606.18383#A9.F8)shows two deduction prompts comparing an early layer \(Layer 12\) and a late layer \(Layer 24\)\. At Layer 12, the proxy exhibits high KL divergence \(\>7\.5\>7\.5\), incorrect next\-token predictions \(e\.g\., predicting ‘\!’ instead of ‘silent’ or ‘insol’\), and active features whose logit\-lens verbalizations \(e\.g\., ‘isses’, ‘must’\) are weakly related to the context\. At Layer 24, the proxy is much closer to the base model: KL drops sharply, the proxy approaches or matches the target next\-token predictions, and the top active features become contextually aligned, verbalizing relevant completions such as ‘silent, silence, quiet’ and ‘sol, dissolve, soluble’\. Thus, the layers that yield tighter certificates also produce sparse proxies whose outputs and feature verbalizations better match the base model’s next\-token behavior\. As an additional behavioral check, Appendix[D](https://arxiv.org/html/2606.18383#A4)shows that the same layerwise pattern holds on three downstream tasks: layers that are easier to certify also incur smaller task\-performance degradation, suggesting that the certificate tracks practical behavioral fidelity beyond the bounded\-loss analysis\.

## Appendix JWhy Perfect Reconstruction Does Not Trivialize the Certificate

A natural concern is that the proposed certificate may simply reward an SAE for acting as a perfect copy machine\. In particular, if the SAE reconstruction exactly reproduces the hidden activation of the frozen LM, then the reconstruction\-gap term should vanish\. Does this imply that the resulting certificate automatically becomes tight, independently of whether the learned sparse features are reusable or meaningful?

We show that the answer is no\. Perfect reconstruction of the unrestricted SAE proxy removes only one term in the bound\. The certificate remains sensitive to whether the sparse support generalizes from calibration to evaluation, and to whether the reconstruction is achieved through a compact, reusable concept pool\. Thus, the bound is not merely a reconstruction score; it is a compression–fidelity certificate\.

Recall the empirical certificate:

R\(M\)≤R^S\(hG∗\)\+ϵ^loss\+B\(η^\+log⁡\(2/δ\)2N\)\+BPlog⁡\(em/P\)\+log⁡\(2/δ\)2N\+Blog⁡\(4/δ\)2NR\(M\)\\leq\\widehat\{R\}\_\{S\}\(h\_\{G^\{\*\}\}\)\+\\widehat\{\\epsilon\}\_\{\\mathrm\{loss\}\}\+B\\left\(\\hat\{\\eta\}\+\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{2N\}\}\\right\)\\\\ \+B\\sqrt\{\\frac\{P\\log\(em/P\)\+\\log\(2/\\delta\)\}\{2N\}\}\+B\\sqrt\{\\frac\{\\log\(4/\\delta\)\}\{2N\}\}\(32\)Hereϵ^loss\\widehat\{\\epsilon\}\_\{\\mathrm\{loss\}\}measures the loss\-level discrepancy between the base modelMMand the unrestricted SAE proxyS∘MS\\circ M, whileη^\\widehat\{\\eta\}measures the probability that the evaluation\-time Top\-kksupport is not covered by the calibration\-induced concept poolG∗G^\{\\ast\}\. The termP=\|G∗\|P=\|G^\{\\ast\}\|controls the size of the finite proxy class\.

#### Perfect unrestricted reconstruction removes only one term\.

Suppose that the SAE acts as a perfect unrestricted copy of the base activation on the evaluation distribution, so that

S∘M=Mand henceϵloss=0\.S\\circ M=M\\qquad\\text\{and hence\}\\qquad\\epsilon\_\{\\mathrm\{loss\}\}=0\.This eliminates the reconstruction\-gap term\. However, the certified predictor in the bound is not only the unrestricted proxyS∘MS\\circ M; it is the pool\-restricted proxyhG∗h\_\{G^\{\\ast\}\}, obtained by masking all SAE features outside the calibration\-induced poolG∗G^\{\\ast\}\. Therefore, even ifS∘MS\\circ Mreconstructs perfectly, the certificate can still be loose if the active supports required at evaluation time are not contained inG∗G^\{\\ast\}\.

Formally, let

A\(x\):=supp⁡\(TopK⁡\(SE\(M\(x\)\)\)\)A\(x\):=\\operatorname\{supp\}\\\!\\left\(\\operatorname\{TopK\}\(S\_\{E\}\(M\(x\)\)\)\\right\)denote the active Top\-kkSAE support for inputxx, and define the calibration\-induced concept pool

G∗:=⋃x∈DcalA\(x\)\.G^\{\\ast\}:=\\bigcup\_\{x\\in D\_\{\\mathrm\{cal\}\}\}A\(x\)\.The pool\-mismatch probability is

η:=Prx∼𝒟⁡\[A\(x\)⊈G∗\]\.\\eta:=\\Pr\_\{x\\sim\\mathcal\{D\}\}\\\!\\left\[A\(x\)\\nsubseteq G^\{\\ast\}\\right\]\.This term is independent of pointwise reconstruction quality of the unrestricted SAE proxy\. It measures whether the sparse features needed at evaluation time are covered by the concept pool discovered during calibration\.

###### Proposition J\.1\(Perfect unrestricted copying is insufficient for a non\-vacuous certificate\)\.

LetB=log2⁡\(\|𝒱\|/α\)B=\\log\_\{2\}\(\|\\mathcal\{V\}\|/\\alpha\)be the uninformed smoothed log\-loss baseline\. Suppose the unrestricted SAE proxy reconstructs the base model perfectly on the evaluation distribution, so thatϵloss=0\\epsilon\_\{\\mathrm\{loss\}\}=0andS∘M=MS\\circ M=M\. Ignoring finite\-sample concentration terms, a sufficient condition for the resulting certificate to remain vacuous is

R\(hG∗\)\+ηB≥B\.R\(h\_\{G^\{\\ast\}\}\)\+\\eta B\\geq B\.In particular, even in the favorable case where the pool\-restricted proxy has the same risk as the base model,R\(hG∗\)=R\(M\)R\(h\_\{G^\{\\ast\}\}\)=R\(M\), the asymptotic certificate is vacuous whenever

η≥1−R\(M\)B\.\\eta\\geq 1\-\\frac\{R\(M\)\}\{B\}\.Therefore, perfect reconstruction of the unrestricted SAE proxy does not by itself imply a non\-vacuous certificate\.

###### Proof\.

AsN→∞N\\to\\infty, the finite\-sample concentration terms in Eq\.[32](https://arxiv.org/html/2606.18383#A10.E32)vanish\. Under perfect unrestricted reconstruction,ϵloss=0\\epsilon\_\{\\mathrm\{loss\}\}=0\. The asymptotic certificate therefore reduces to

R\(M\)≤R\(hG∗\)\+ηB\.R\(M\)\\leq R\(h\_\{G^\{\\ast\}\}\)\+\\eta B\.The certificate is non\-vacuous only if its right\-hand side is below the uninformed baselineBB\. Hence, it remains vacuous whenever

R\(hG∗\)\+ηB≥B\.R\(h\_\{G^\{\\ast\}\}\)\+\\eta B\\geq B\.If additionallyR\(hG∗\)=R\(M\)R\(h\_\{G^\{\\ast\}\}\)=R\(M\), this condition becomes

R\(M\)\+ηB≥B,R\(M\)\+\\eta B\\geq B,or equivalently,

η≥1−R\(M\)B\.\\eta\\geq 1\-\\frac\{R\(M\)\}\{B\}\.Thus, even under perfect unrestricted copying, support mismatch alone can prevent the certificate from becoming non\-vacuous\. A better base model, meaning lowerR\(M\)R\(M\), can tolerate a larger mismatch rate before the certificate becomes vacuous\. ∎

#### Why this penalizes non\-transferable copy\-like representations\.

The proposition shows that the certificate is not automatically optimized by pointwise copying\. A copy\-like SAE may reconstruct individual activations well, but if it does so using highly input\-specific or low\-reuse features, then the union of supports observed during calibration may fail to cover the features activated on evaluation inputs\. This increasesη\\eta\. Similarly, if good copying requires a very large concept poolG∗G^\{\\ast\}, thenP=\|G∗\|P=\|G^\{\\ast\}\|becomes large and the sparse complexity term increases at finite sample size:

BPlog⁡\(em/P\)\+log⁡\(2/δ\)2N\.B\\sqrt\{\\frac\{P\\log\(em/P\)\+\\log\(2/\\delta\)\}\{2N\}\}\.Thus, the certificate favors sparse proxies that are not only locally accurate, but also reusable and support\-stable across samples\.

#### Relationship to semanticity\.

This argument should not be read as a proof that every SAE achieving a strong certificate has human\-semantic features\. The theorem certifies an operational property: the SAE\-induced sparse proxy is informative, low\-distortion, and compressive enough to support a non\-vacuous risk certificate for the frozen LM\. Human\-semanticity of individual features is not directly encoded in the theorem\. However, the certificate can indirectly reward semantically organized representations when such organization leads to reusable, stable, low\-distortion sparse supports\. Conversely, a purely copy\-like representation is not guaranteed to be certified unless it also satisfies these reuse and stability requirements\.

This is why the feature\-shuffling ablation is informative\. Shuffling preserves the per\-example sparsity pattern and activation magnitudes, but destroys feature identity\. The resulting degradation shows that the certificate is sensitive to the organization of feature directions, not merely to the number of active features\. Therefore, the proposed bound is nontrivially useful: it does not certify all accurate reconstructions equally, but distinguishes sparse proxies according to whether their reconstruction is achieved through a stable and reusable concept pool\.
From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

Similar Articles

Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models

How Quantization Changes Interpretable Features: A Sparse Autoencoder Analysis of Language Models

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

Submit Feedback

Similar Articles

Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models
How Quantization Changes Interpretable Features: A Sparse Autoencoder Analysis of Language Models
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior
Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs