Discovering Millions of Interpretable Features with Sparse Autoencoders

arXiv cs.LG Papers

Summary

This paper introduces Qwen3-Instruct SAE, a suite of sparse autoencoders trained on Qwen3 instruction-tuned models, enabling the discovery of millions of interpretable features and demonstrating refusal steering capabilities.

arXiv:2606.26620v1 Announce Type: new Abstract: Sparse autoencoders (SAEs) have emerged as a powerful tool for decomposing superposed language model representations into sparse and interpretable features. However, training SAEs is computationally expensive, and available open-source SAE models remain limited. In this work, we introduce \textbf{Qwen3-Instruct SAE}, a comprehensive suite of SAEs trained on the Qwen3 instruction-tuned model family, covering Qwen3-1.7B, Qwen3-4B, and Qwen3-8B. For Qwen3-1.7B and Qwen3-4B, we train layer-wise SAEs at three key activation sites: residual streams, MLP outputs, and attention outputs. For Qwen3-8B, we train SAEs on a subset of residual stream layers. We systematically evaluate these SAEs using both activation-level reconstruction metrics and model-level recovery metrics, revealing distinct sparsity--fidelity trade-offs across layers and components. Finally, we demonstrate the utility of Qwen3-Instruct SAE through a refusal-steering case study, showing that selected SAE features can causally steer instruction-tuned Qwen3 models toward refusal behavior. Our release provides a practical resource for studying sparse representations, feature-level mechanisms, and behavioral interventions in instruction-tuned language models
Original Article
View Cached Full Text

Cached at: 06/26/26, 05:21 AM

# Discovering Millions of Interpretable Features with Sparse Autoencoders
Source: [https://arxiv.org/html/2606.26620](https://arxiv.org/html/2606.26620)
XinYang He1,2,Wei Wang1,Bing Zhao1,Xuan Ren1,WenBo Li1,WeiXu Qiao1,Hu Wei1\*,Lin Qu1 1AI DATA, Alibaba Group Holding Limited 2Beijing Institute of Technology \*Corresponding author Correspondence:[kongwang@alibaba\-inc\.com](https://arxiv.org/html/2606.26620v1/[email protected])

###### Abstract

Sparse autoencoders \(SAEs\) have emerged as a powerful tool for decomposing superposed language model representations into sparse and interpretable features\. However, training SAEs is computationally expensive, and available open\-source SAE models remain limited\. In this work, we introduceQwen3\-Instruct SAE, a comprehensive suite of SAEs trained on the Qwen3 instruction\-tuned model family, covering Qwen3\-1\.7B, Qwen3\-4B, and Qwen3\-8B\. For Qwen3\-1\.7B and Qwen3\-4B, we train layer\-wise SAEs at three key activation sites: residual streams, MLP outputs, and attention outputs\. For Qwen3\-8B, we train SAEs on a subset of residual stream layers\. We systematically evaluate these SAEs using both activation\-level reconstruction metrics and model\-level recovery metrics, revealing distinct sparsity–fidelity trade\-offs across layers and components\. Finally, we demonstrate the utility of Qwen3\-Instruct SAE through a refusal\-steering case study, showing that selected SAE features can causally steer instruction\-tuned Qwen3 models toward refusal behavior\. Our release provides a practical resource for studying sparse representations, feature\-level mechanisms, and behavioral interventions in instruction\-tuned language models111We will release the code and model weights in the coming months\.\.

Discovering Millions of Interpretable Features with Sparse Autoencoders

XinYang He1,2, Wei Wang1, Bing Zhao1, Xuan Ren1, WenBo Li1, WeiXu Qiao1, Hu Wei1\*, Lin Qu11AI DATA, Alibaba Group Holding Limited2Beijing Institute of Technology\*Corresponding authorCorrespondence:[kongwang@alibaba\-inc\.com](https://arxiv.org/html/2606.26620v1/[email protected])

## 1Introduction

Large language models \(LLMs\) exhibit impressive capabilities in reasoning, planning, and a wide range of downstream tasksGuo et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib18)\); Yin et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib54)\); Shang et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib45)\)\. Yet the internal mechanisms underlying these capabilities remain poorly understood, posing significant challenges for model interpretability, safety alignment, and trustworthy deploymentLi et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib30)\)\. Mechanistic interpretability seeks to address this gap by reverse\-engineering the computations performed inside neural networksZhang and Nanda \([2024](https://arxiv.org/html/2606.26620#bib.bib55)\); Patel and Pavlick \([2022](https://arxiv.org/html/2606.26620#bib.bib39)\); Stolfo et al\. \([2023](https://arxiv.org/html/2606.26620#bib.bib47)\); Huben et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib24)\), thereby revealing how learned representations and circuitsConmy et al\. \([2023](https://arxiv.org/html/2606.26620#bib.bib9)\); Marks et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib33)\)produce model behavior\.

A central challenge in mechanistic interpretability is the superposition hypothesisChen et al\. \([2023](https://arxiv.org/html/2606.26620#bib.bib8)\); Elhage et al\. \([2022](https://arxiv.org/html/2606.26620#bib.bib12)\); Hänni et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib25)\), which suggests that neural networks represent far more features than they have neurons by encoding multiple features in overlapping directions within the same activation spaceMikolov et al\. \([2013](https://arxiv.org/html/2606.26620#bib.bib37)\); Gurnee et al\. \([2023](https://arxiv.org/html/2606.26620#bib.bib19)\)\. This phenomenon makes it difficult to isolate and interpret individual features directly from model activations\. Sparse Autoencoders \(SAEs\) have emerged as a principled approach to addressing this problem by decomposing superposed representations into a larger set of sparse, and often interpretable, latent featuresHuben et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib24)\)\.

![Refer to caption](https://arxiv.org/html/2606.26620v1/x1.png)Figure 1:Timeline of open\-source SAE model releases, illustrating the progression from early exploration to ecosystem expansion\. Our work, Qwen3\-Instruct SAE, extends SAE analysis to the Qwen3 instruction\-tuning model family\.Recent work has demonstrated the effectiveness of SAEs in extracting interpretable features from large\-scale language modelsGao et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib16)\); Templeton et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib49)\)\. In particular, Gemma ScopeLieberum et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib31)\)provided a comprehensive suite of SAEs trained on the Gemma 2 model family, establishing an important open resource for mechanistic interpretability research\. Subsequent efforts, including GemmaScope 2222https://deepmind\.google/blog/gemma\-scope\-2\-helping\-the\-ai\-safety\-community\-deepen\-understanding\-of\-complex\-language\-model\-behavior/, Qwen\-ScopeDeng et al\. \([2026](https://arxiv.org/html/2606.26620#bib.bib11)\)and LlamaScopeHe et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib22)\), extended this paradigm to additional model families and further highlighted the value of large\-scale SAE releases as shared research infrastructure\.

To further accelerate progress in SAE research, we introduce Qwen3\-Instruct SAE, a comprehensive public release of Sparse Autoencoders trained on the Qwen3 instruction\-tuning modelYang et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib53)\)\. Qwen3\-Instruct SAE covers Qwen3\-1\.7B, Qwen3\-4B, and Qwen3\-8B\. For Qwen3\-1\.7B and Qwen3\-4B, we provide SAEs for every layer across three locations: MLP outputs, attention outputs, and residual streams\. For Qwen3\-8B, we currently release SAEs for a subset of residual stream layers\. By releasing this broad collection of layer\-wise SAEs, we aim to provide the community with a practical foundation for probing internal representations in Qwen3 models and to facilitate future work on feature discovery, circuit analysis, and comparative mechanistic interpretability\.

In this paper, we describe the construction of Qwen3\-Instruct SAE, following and extending the methodological framework established by prior large\-scale SAE releasesLieberum et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib31)\)\. Our contributions are threefold:

- •We releaseQwen3\-Instruct SAE, a comprehensive SAE suite for the Qwen3 instruction\-tuned model family, covering Qwen3\-1\.7B, Qwen3\-4B, and Qwen3\-8B\.
- •We provide layer\-wise SAEs across three key activation sites: residual streams, MLP outputs, and attention outputs for Qwen3\-1\.7B and Qwen3\-4B, and additionally release SAEs for a subset of Qwen3\-8B residual stream layers\.
- •We evaluate Qwen3\-Instruct SAE with reconstruction and model\-recovery metrics, and demonstrate its utility through a refusal\-steering case study on instruction\-tuned Qwen3 models\.

## 2Related work

#### Sparse Autoencoders for Mechanistic Interpretability\.

A central challenge in mechanistic interpretability is that internal model representations are often Polysemantic, whereby a single neuron may be activated by multiple semantically unrelated contexts simultaneouslyElhage et al\. \([2022](https://arxiv.org/html/2606.26620#bib.bib12)\); Chen et al\. \([2023](https://arxiv.org/html/2606.26620#bib.bib8)\); Hänni et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib25)\)\. This makes it difficult to directly interpret neuron activations or hidden states in terms of discrete concepts\. SAEs have emerged as a promising approach for addressing this problem by learning an overcomplete and sparse latent representation that can decompose dense activations into more interpretable featuresHuben et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib24)\)\. In the context of language models, SAEs have been used to identify monosemantic features, analyze feature activations across contexts, and support downstream investigations into circuits and steer model behaviorO’Brien et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib38)\); Wu et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib52)\)\.

In parallel, researchers have investigated a range of design choices for SAE training, including architectural variantsBussmann et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib5)\), alternative activation functionsBussmann et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib4)\); Rajamanoharan et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib42)\); Gao et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib16)\), and standardized evaluation protocolsKarvonen et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib28)\); Chanin and Garriga\-Alonso \([2026](https://arxiv.org/html/2606.26620#bib.bib6)\); Wu et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib52)\)\. These efforts have collectively established SAEs as a core tool in the mechanistic interpretability toolkit\.

#### Scaling SAEs to Large Language Models\.

A critical line of research has focused on scaling SAE training to state\-of\-the\-art large language models and systematically releasing the resulting feature dictionaries for the broader research communityTempleton et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib49)\); Gao et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib16)\); Deng et al\. \([2026](https://arxiv.org/html/2606.26620#bib.bib11)\)\. For the opensource model,Lieberum et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib31)\)introduced Gemma Scope, a comprehensive collection of SAEs trained on all layers and sublayers of the Gemma2 model family, representing the first large\-scale open\-source SAE suite\. Building upon this foundation, LlamaScope\(He et al\.,[2024](https://arxiv.org/html/2606.26620#bib.bib22)\)and Qwen\-Scope \(which is concurrent work with ours\)Deng et al\. \([2026](https://arxiv.org/html/2606.26620#bib.bib11)\)extended SAE training to the LLaMA\-3\.1\-8B and Qwen3 model family, releasing SAEs trained on each layer and sublayer, providing evidence that interpretable features emerge consistently across different model architectures\. These efforts collectively suggest that SAE\-based interpretability is not specific to any particular model family but reflects general properties of how large language models organize their internal representations\.

However, training a comprehensive suite of SAEs for large\-scale models is still highly resource\-intensive\. To further accelerate progress in SAE research, our work extends this line of research to the Qwen3 instruction\-tuning model family through the release ofQwen3\-Instruct SAE, a comprehensive SAE suite covering Qwen3\-1\.7B, Qwen3\-4B, and Qwen3\-8B, with full layer\-wise coverage across the MLP, attention, and residual streams for Qwen3\-1\.7B and Qwen3\-4B, together with a partial residual\-stream release for Qwen3\-8B \(layer 0\-8\), aiming to provide a useful resource for the community and facilitate future mechanistic interpretability research\.

In the Appendix[B](https://arxiv.org/html/2606.26620#A2), we provide a detailed comparison of the open\-source SAE models, with a particular focus on how our work differs from Qwen\-Scope\.

## 3Methodology

In this section, we describe in detail the training procedure, hyperparameters, and computational infrastructure used in our experiments\.

### 3\.1Sparse autoencoders

Given activations𝐱∈ℝn\\mathbf\{x\}\\in\\mathbb\{R\}^\{n\}from a language model, a sparse autoencoder \(SAE\) encodes and reconstructs the activations using an encoder and a decoder:

𝐟​\(𝐱\)=σ​\(𝐖enc​𝐱\+𝐛enc\),\\mathbf\{f\}\(\\mathbf\{x\}\)=\\sigma\\\!\\left\(\\mathbf\{W\}\_\{\\mathrm\{enc\}\}\\mathbf\{x\}\+\\mathbf\{b\}\_\{\\mathrm\{enc\}\}\\right\),\(1\)𝐱^=𝐖dec​𝐟​\(𝐱\)\+𝐛dec,\\hat\{\\mathbf\{x\}\}=\\mathbf\{W\}\_\{\\mathrm\{dec\}\}\\mathbf\{f\}\(\\mathbf\{x\}\)\+\\mathbf\{b\}\_\{\\mathrm\{dec\}\},\(2\)where𝐟​\(𝐱\)∈ℝm\\mathbf\{f\}\(\\mathbf\{x\}\)\\in\\mathbb\{R\}^\{m\}denotes the sparse latent,σ​\(⋅\)\\sigma\(\\cdot\)is a nonlinearity such as ReLU,𝐖enc∈ℝm×n\\mathbf\{W\}\_\{\\mathrm\{enc\}\}\\in\\mathbb\{R\}^\{m\\times n\}and𝐛enc∈ℝm\\mathbf\{b\}\_\{\\mathrm\{enc\}\}\\in\\mathbb\{R\}^\{m\}are the encoder parameters, and𝐖dec∈ℝn×m\\mathbf\{W\}\_\{\\mathrm\{dec\}\}\\in\\mathbb\{R\}^\{n\\times m\}and𝐛dec∈ℝn\\mathbf\{b\}\_\{\\mathrm\{dec\}\}\\in\\mathbb\{R\}^\{n\}are the decoder parameters\. Thus,𝐟​\(𝐱\)\\mathbf\{f\}\(\\mathbf\{x\}\)is a set of linear weights that specify how to combine them≫nm\\gg ncolumns of𝐖dec\\mathbf\{W\}\_\{\\mathrm\{dec\}\}to reconstruct𝐱\\mathbf\{x\}\. The columns of𝐖dec\\mathbf\{W\}\_\{\\mathrm\{dec\}\}, which we denote by𝐰i\\mathbf\{w\}\_\{i\}fori=1,…,mi=1,\\ldots,m, represent the directions of the learned dictionary into which the SAE decomposes𝐱\\mathbf\{x\}\.

The loss function for training the SAE consists of two key components: reconstruction loss and sparsity regularization:

ℒ​\(𝐱\)=‖𝐱−𝐱^‖22\+λ​‖𝐟​\(𝐱\)‖1\\mathcal\{L\}\(\\mathbf\{x\}\)=\\\|\\mathbf\{x\}\-\\hat\{\\mathbf\{x\}\}\\\|\_\{2\}^\{2\}\+\\lambda\\\|\\mathbf\{f\}\(\\mathbf\{x\}\)\\\|\_\{1\}\(3\)where reconstruction loss ensures that the SAE learns to reconstruct the input data accurately, meaning the features encoded in the sparse representation must also be present in the input activations\. On the other hand, sparsity regularization enforces sparsity by penalizing nonzero values in𝐟​\(𝐱\)\\mathbf\{f\}\(\\mathbf\{x\}\), andλ\\lambdais a hyper\-parameter to control the penalty level of the sparsity\.

### 3\.2JumpReLU SAEs

As described by previous workLieberum et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib31)\), they focus heavily on JumpReLU SAEs as they have been shown to be a slight Pareto improvement over other approaches, including Gated SAERajamanoharan et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib41)\), and TopK SAEGao et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib16)\), and Gemma Scope has also demonstrated strong empirical performance in practice; therefore, we adopt JumpReLU as the primary training method for Qwen3\-Instruct SAE\.

#### JumpReLU

The JumpReLU activation is a shifted Heaviside step function as a gating mechanism together with a conventional ReLU:

σ​\(𝐳\)=JumpReLUθ​\(𝐳\):=𝐳⊙H​\(𝐳−θ\)\\sigma\(\\mathbf\{z\}\)=\\text\{JumpReLU\}\_\{\\theta\}\(\\mathbf\{z\}\):=\\mathbf\{z\}\\odot H\(\\mathbf\{z\}\-\\theta\)\(4\)where𝐳=𝐖enc​𝐱\+𝐛enc\\mathbf\{z\}=\\mathbf\{W\}\_\{\\mathrm\{enc\}\}\\mathbf\{x\}\+\\mathbf\{b\}\_\{\\mathrm\{enc\}\}is the pre\-activation,θ\>0\\theta\>0is theJumpReLU’slearnable threshold parameter,⊙\\odotdenotes elementwise multiplication, andHHis the Heaviside step function, which is 1 if its input is positive and 0 otherwise:

H​\(𝐳−θ\)=\{1,𝐳−θ\>0,0,𝐳−θ≤0\.H\(\\mathbf\{z\}\-\\theta\)=\\begin\{cases\}1,&\\mathbf\{z\}\-\\theta\>0,\\\\ 0,&\\mathbf\{z\}\-\\theta\\leq 0\.\\end\{cases\}\(5\)
TheJumpReLUactivation function leaves the pre\-activations unchanged above the threshold, but sets them to zero below the threshold, with a different learned threshold per latent\. As a result, the number of active features can vary adaptively across tokens, instead of being constrained to a fixed sparsity pattern as in TopK SAEsGao et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib16)\)\.

#### Loss function

The loss function used for JumpReLU SAEs is as follows:

ℒ​\(𝐱\)=‖𝐱−𝐱^‖22\+λ​‖𝐟​\(𝐱\)‖0\\mathcal\{L\}\(\\mathbf\{x\}\)=\\\|\\mathbf\{x\}\-\\hat\{\\mathbf\{x\}\}\\\|\_\{2\}^\{2\}\+\\lambda\\\|\\mathbf\{f\}\(\\mathbf\{x\}\)\\\|\_\{0\}\(6\)JumpReLU uses a standard squared error reconstruction loss, and directly regularize the number of active \(non\-zero\) latents using the L0 penalty\. Because theL0L\_\{0\}penalty and the JumpReLU activation are piecewise constant with respect to the threshold parametersθ\\theta, direct gradient\-based optimization ofθ\\thetais not feasible\. We therefore adopt the STE\-based optimization strategy ofRajamanoharan et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib42)\)\. This procedure introduces an additional hyperparameter, the kernel density estimation bandwidthε\\varepsilon, which affects the fidelity of the gradient approximation used to train the threshold parametersθ\\theta\.

### 3\.3Training details

#### Training Data

We train SAEs on the activations of Qwen3 models generated using the dataset FineWeb\-EduPenedo et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib40)\), which consists of a large\-scale collection of educational text filtered from FineWeb\. We train all the models using 14×\\timesNVIDIA H20\-3e GPUs\.

#### Location

We train SAEs at three different positions within each transformer block, as shown in Figure[2](https://arxiv.org/html/2606.26620#S3.F2): \(1\) post\-residual stream; \(2\) post\-MLP residual; and \(3\) post\-attention output, which captures the output of the attention sublayer after the final linear transformationWOW\_\{O\}\. We zero\-index the layers throughout this work, such that layer 0 refers to the first transformer block after the embedding layer\.

![Refer to caption](https://arxiv.org/html/2606.26620v1/x2.png)Figure 2:Illustration of the three SAE training locations within a transformer block of the Qwen3 model\.
#### Hyperparameters

We use a consistent set of hyperparameters across all trained SAEs\. The learning rate is set tolr=2×10−4\\text\{lr\}=2\\times 10^\{\-4\}with a linear warmup over the first 1,000 steps, and the JumpReLU bandwidth is set toε=0\.001\\varepsilon=0\.001\. The sparsity penalty is fixed atλ=1\.0\\lambda=1\.0with a targetL0L\_\{0\}, and the sparsity loss is warmed up over the first 2,000 steps\. We train with a batch size of 20,480 tokens\. All models are optimized using Adam withβ1=0\.0\\beta\_\{1\}=0\.0,β2=0\.999\\beta\_\{2\}=0\.999, andεAdam=10−8\\varepsilon\_\{\\text\{Adam\}\}=10^\{\-8\}\.

## 4Evaluation

In this section, we analyze the trained SAEs from multiple perspectives, following established evaluation practices in prior workLieberum et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib31)\)\. We first describe the data used to evaluate the SAEs, and then present the evaluation methods and metrics\. We primarily analyze the SAEs trained on Qwen3\-1\.7B; results for Qwen3\-4B are provided in Appendix[A](https://arxiv.org/html/2606.26620#A1)\.

### 4\.1Evaluation dataset

For all SAE models, we use data drawn from the same distribution as the training data, and reserve 1B tokens for evaluation\. Specifically, we split the FineWeb\-Edu subset Sample\-10BTPenedo et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib40)\)into a training set and an evaluation set, using 90% of the data for training and the remaining 10% for evaluation, the context length used for evaluation is 1024\.

### 4\.2Evaluating the Trade\-off Between Sparsity and Fidelity

#### Methodology

For Qwen3\-1\.7B, we train SAE models with dictionary sizes of 16K and 65K\. For each dictionary size, we consider multiple SAEs with different sparsity levels \(L0=80,160L\_\{0\}=80,160\), and plot the corresponding curves to show the reconstruction fidelity attainable at each level of sparsity\.

#### Metrics

Following prior work, we measure sparsity by the averageL0L\_\{0\}\-norm of the latent activations,𝔼x​\[‖f​\(x\)‖0\]\\mathbb\{E\}\_\{x\}\[\\\|f\(x\)\\\|\_\{0\}\]\. To evaluate reconstruction fidelity, we adopt two metrics:

\(1\) Delta LM loss \(DLL\)\. We evaluate the extent to which the model’s original performance can be recovered after replacing the original activations with SAE\-reconstructed activations during the forward pass\.

\(2\) As a secondary metric of reconstruction fidelity, we use the fraction of variance explained \(FVE\)\. FVE is computed from the SAE reconstruction loss relative to a baseline that always predicts the dataset mean, and measures the proportion of variance in the input activations captured by the SAE\. Unlike downstream performance based metrics, FVE only reflects reconstruction quality at the activation level and does not account for the downstream effect of reconstruction errors on model performance\.

#### DLL Results

Figure[3](https://arxiv.org/html/2606.26620#S4.F3)shows how DLL varies across different layers and positions in Qwen3\-1\.7B \(Residual Stream, MLP, and Attention\) under different sparsity levels and dictionary sizes\.

![Refer to caption](https://arxiv.org/html/2606.26620v1/x3.png)\(\(a\)\)
![Refer to caption](https://arxiv.org/html/2606.26620v1/x4.png)\(\(b\)\)
![Refer to caption](https://arxiv.org/html/2606.26620v1/x5.png)\(\(c\)\)
![Refer to caption](https://arxiv.org/html/2606.26620v1/x6.png)\(\(d\)\)

Figure 3:Layer\-wise DLL at Different Positions in Qwen3\-1\.7B: Residual Stream, MLP, and AttentionOverall, SAEs trained on the residual stream and MLP tend to recover model performance better than those trained on attention outputs\. A possible reason is that residual stream and MLP representations may encode semantic content more directlyGeva et al\. \([2021](https://arxiv.org/html/2606.26620#bib.bib17)\); Dai et al\. \([2022](https://arxiv.org/html/2606.26620#bib.bib10)\); Elhage et al\. \([2021](https://arxiv.org/html/2606.26620#bib.bib13)\), while attention representations are more tightly coupled with contextual interactionsVig and Belinkov \([2019](https://arxiv.org/html/2606.26620#bib.bib51)\)\. As a result, reconstruction errors in attention activations may have a larger effect on downstream model behavior\.

We further observe a non\-monotonic layer\-wise pattern for SAEs trained on residual stream and MLP activations\. In the first two layers, these SAEs recover most of the original model performance\. Their recovery ability then decreases markedly in layers 2–4, before gradually improving again in deeper layers \(5–27\)\. A possible explanation is that transformer layers play different roles across depthSkean et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib46)\); Ikeda et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib26)\)\. Early layers encode relatively simple local features, which may be easier for SAEs to reconstruct, while layers 2–4 may form a transitional stage with stronger feature mixing, making recovery harder\. In deeper layers, representations may become more stable and task\-relevantJin et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib27)\); Gurnee and Tegmark \([2024](https://arxiv.org/html/2606.26620#bib.bib20)\); Fan et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib14)\), leading to improved recovery again\. The similar pattern for both residual\-stream and MLP activations suggests that this is likely a stage\-wise property of model computation rather than an artifact of a specific activation space\.

Another finding is that, for SAEs trained on attention\-layer activations, the SAE with a 65k dictionary is less stable than the one with a 16k dictionary in recovering model performance under the same sparsity level \(L0L\_\{0\}\)\. A possible explanation is that, under the sameL0L\_\{0\}constraint, the larger 65k dictionary encourages greater feature splittingChanin et al\. \([2026](https://arxiv.org/html/2606.26620#bib.bib7)\), decomposing otherwise stable attention features into finer\-grained but less consistent components\. It may also exacerbate feature absorption, where more specific features partially absorb activation mass from broader, more robust onesChanin et al\. \([2026](https://arxiv.org/html/2606.26620#bib.bib7)\)\. As a result, the learned decomposition becomes more fragmented, which may make performance recovery less stable than with the 16k dictionary\.

#### FVE Results

Figure[4](https://arxiv.org/html/2606.26620#S4.F4)shows the FVE of SAEs at different insertion positions across layers under different dictionary size–L0L\_\{0\}configurations\. As shown in the Figure[4](https://arxiv.org/html/2606.26620#S4.F4), across all dictionary size–L0L\_\{0\}settings, SAEs at each position achieve consistently substantial FVE throughout the network\. This indicates that, under a wide range of configurations, the trained SAEs are able to reconstruct the original activations reasonably well, providing evidence for their overall effectiveness\.

![Refer to caption](https://arxiv.org/html/2606.26620v1/x7.png)\(\(a\)\)
![Refer to caption](https://arxiv.org/html/2606.26620v1/x8.png)\(\(b\)\)
![Refer to caption](https://arxiv.org/html/2606.26620v1/x9.png)\(\(c\)\)
![Refer to caption](https://arxiv.org/html/2606.26620v1/x10.png)\(\(d\)\)

Figure 4:Layer\-wise FVE at Different Positions in Qwen3\-1\.7B: Residual Stream, MLP, and AttentionWe also observe an outlier at the Layer 2 attention module for the 65k SAE, which yields relatively low FVE under bothL0=80L\_\{0\}=80andL0=160L\_\{0\}=160\. A possible explanation is that this layer lies in an early transitional stage where attention patterns are less stable and less easily decomposable, so a larger dictionary mainly increases feature splitting and optimization difficulty rather than reconstruction quality\. As a result, the 65k SAE may learn a more fragmented and less effective representation at this layer\.

#### Sparsity\-fidelity trade\-off

To analyze the effect of sparsity under a fixed dictionary size, we group SAEs by dictionary size and plot both FVE and DLL as functions of targetL0L\_\{0\}\. For each dictionary size, results are aggregated separately for attention, MLP, and residual\-stream SAEs, and we average the metric values across all layers\. This provides a module\-level view of how reconstruction quality and performance recovery change with sparsity while holding dictionary size constant\. The results are shown in Figure[5](https://arxiv.org/html/2606.26620#S4.F5):

![Refer to caption](https://arxiv.org/html/2606.26620v1/x11.png)\(\(a\)\)
![Refer to caption](https://arxiv.org/html/2606.26620v1/x12.png)\(\(b\)\)
![Refer to caption](https://arxiv.org/html/2606.26620v1/x13.png)\(\(c\)\)
![Refer to caption](https://arxiv.org/html/2606.26620v1/x14.png)\(\(d\)\)

Figure 5:Average FVE and DLL as functions of targetL0L\_\{0\}under fixed dictionary sizes\. For each dictionary size, results are aggregated separately for attention, MLP, and residual\-stream SAEs, and the metric values are averaged across all layers\.from the results, we observe thatL0L\_\{0\}has only a limited effect on FVE\. In contrast, for DLL, a largerL0L\_\{0\}generally leads to improved performance\.

## 5Case Study: Steering with Qwen3 SAE

Recent work has shown that SAEs can be used to steer model behavior toward specific conceptsArad et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib1)\); Fang et al\. \([2026](https://arxiv.org/html/2606.26620#bib.bib15)\); Bhargav and Zhu \([2025](https://arxiv.org/html/2606.26620#bib.bib3)\), highlighting the practical utility of SAE features in real\-world applications\. In this section, we investigate whether trained SAEs can be used to steer the model toward refusal behavior\.

### 5\.1Problem setup

We focus on the refusal task\. Specifically, we intervene on the model using SAE features and expect the intervened model to refuse the user’s request regardless of whether the request is harmful or harmless\.

### 5\.2Dataset & Metric

#### Dataset

We evaluate the effectiveness of SAE feature steering on three datasets: WildGuardHan et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib21)\), XSTestRöttger et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib44)\), and a mixed dataset constructed byArditi et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib2)\)\. Detailed dataset descriptions and summary statistics are provided in the Appendix[C](https://arxiv.org/html/2606.26620#A3)\.

#### Metric

To evaluate the effectiveness of SAE features in the refusal task, we adopt the refusal score from prior workLermen et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib29)\); Liu et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib32)\); Robey et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib43)\); Arditi et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib2)\), which classifies a model response as a refusal \(refusal\_score=1\\text\{refusal\\\_score\}=1\) if it contains at least one predefined refusal indicator \(e\.g\., “I’m sorry”, “As an AI”\), and as a non\-refusal \(refusal\_score=0\\text\{refusal\\\_score\}=0\) otherwise\. The complete list of refusal indicators is presented in Appendix[C\.2](https://arxiv.org/html/2606.26620#A3.SS2)\.

### 5\.3SAE Feature Identification

To identify the SAE features responsible for refusal, we follow the approach ofO’Brien et al\. \([2025](https://arxiv.org/html/2606.26620#bib.bib38)\)\. We first construct a refusal example consisting of a harmful user request and a refusal response, the prompts we use are shown in Figure[6](https://arxiv.org/html/2606.26620#S5.F6)\. Next, we capture the hidden activations at the target transformer layer and retain only the activations on the assistant tokens\. These activations are then passed through the trained SAE to obtain sparse feature activations\. Finally, we identify candidate refusal features by measuring how often each feature is activated across refusal tokens and how strongly it is activated when present\.

![Refer to caption](https://arxiv.org/html/2606.26620v1/x15.png)Figure 6:Prompt used to search for SAE features responsible for refusal behavior\.To determine the target layer, we follow the method ofArditi et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib2)\)\. The details of how we determine the target layerl∗l^\{\*\}and SAE feature are provided in Appendix[D](https://arxiv.org/html/2606.26620#A4)\. The SAE feature IDs and intervention layers we used are shown in Table[1](https://arxiv.org/html/2606.26620#S5.T1)

Table 1:Model, SAE Feature ID and Target Intervention Layer\.
### 5\.4Steering with SAE feature

Having determined the target SAE feature and the corresponding layerl∗l^\{\*\}, we intervene on the model by steering the residual stream activations along the decoder direction of the selected feature\. We implement the steering intervention via a forward hook registered on the residual stream at layerl∗l^\{\*\}\. At each forward pass, the activation𝐱\\mathbf\{x\}is modified as:

𝐱^=𝐱\+α⋅𝐰j,\\hat\{\\mathbf\{x\}\}=\\mathbf\{x\}\+\\alpha\\cdot\\mathbf\{w\}\_\{j\},\(7\)where𝐰j\\mathbf\{w\}\_\{j\}denotes the SAE decoder direction of the selected SAE featurejj,α\\alphais the steering coefficient\. Note that the intervention is applied to all token positions except the first one\. Responses are generated using greedy decoding\.

### 5\.5Main Results

Table 2:Refusal rates of different steering methods across datasets and models\.Boldvalues indicate the best performance for each model and dataset\.Table[2](https://arxiv.org/html/2606.26620#S5.T2)reports the refusal rates of different steering methods across datasets and models\. We observe that theNo steeringbaseline exhibits consistently low refusal rates on safe prompts but also fails to refuse a substantial portion of unsafe prompts\. Applying steering with arandom featurelargely collapses refusal behavior across all datasets, with refusal rates dropping close to zero on both safe and unsafe prompts\. This suggests that random feature steering disrupts model behavior indiscriminately, confirming that the choice of steering feature is critical\.

In contrast, steering with thetarget featureconsistently achieves the highest refusal rates on unsafe prompts across all datasets and both models\. Notably, Qwen3\-4B\-it achieves refusal rates of 0\.95 and 0\.93 on unsafe prompts on the XSTest and Mix datasets, respectively\. These results demonstrate that the target SAE feature effectively steers the model toward refusal behavior on harmful inputs\. However, we observe that the refusal rate on WildGuard is comparatively lower than on the other datasets, suggesting that the identified SAE features may capture refusal behavior that transfers unevenly across evaluation settings\. We leave a more thorough investigation of this limitation to future work\. Finally, we discuss the effect of different steering coefficientsα\\alphaon the refusal rate in Appendix[E](https://arxiv.org/html/2606.26620#A5)\.

## 6Conclusion

We introduceQwen3\-Instruct SAE, a comprehensive suite of sparse autoencoders trained on the Qwen3 instruction\-tuned model family\. Covering multiple model scales and three key activation sites: residual streams, MLP, and attention outputs\. Qwen3\-Instruct SAE provides a fine\-grained resource for analyzing sparse representations in instruction\-following language models\. Through systematic evaluation with reconstruction and model\-recovery metrics, we characterize the sparsity–fidelity trade\-offs of the trained SAEs across layers and components\. We further demonstrate the practical utility of Qwen3\-Instruct SAE through a refusal\-steering case study, showing that selected SAE features can support feature\-level behavioral interventions\. We hope this resource will facilitate future work on mechanistic interpretability, circuit analysis, and controlled steering of instruction\-tuned language models\.

## Limitations

Our work has several limitations\. First, although Qwen3\-Instruct SAE covers multiple model scales and activation sites, our analysis focuses primarily on reconstruction fidelity, model recovery, and one refusal\-steering case study\. Future work could conduct broader evaluations of feature interpretability, including automated feature labeling, causal circuit discovery, and downstream task\-level analyses\.

Second, our steering experiments focus on refusal behavior\. While this provides an illustrative example of how SAE features can be used for behavioral intervention, refusal is only one type of model behavior\. Additional studies are needed to determine whether the same methodology generalizes to other capabilities, safety\-relevant behaviors, and linguistic or reasoning phenomena\.

Third, training and evaluating large\-scale SAE collections remains computationally expensive\. Although our release is intended to reduce this barrier for future research, extending the analysis to larger Qwen models, additional instruction\-tuned model families, and more diverse training corpora remains an important direction for future work\. We currently train SAEs only for a subset of the residual stream layers of Qwen3\-8B\. We will continue training and release SAEs for all layers in the future\.

Finally, although Qwen3\-Instruct SAE covers Qwen3\-1\.7B, Qwen3\-4B, and Qwen3\-8B, our quantitative evaluation in this paper focuses on Qwen3\-1\.7B and Qwen3\-4B\. For Qwen3\-8B, we currently release SAEs only for a subset of residual\-stream layers and do not yet provide a full systematic evaluation\. Extending the analysis to Qwen3\-8B is an important direction for future work\.

## References

- Arad et al\. \(2025\)Dana Arad, Aaron Mueller, and Yonatan Belinkov\. 2025\.[SAEs are good for steering – if you select the right features](https://doi.org/10.18653/v1/2025.emnlp-main.519)\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 10241–10259, Suzhou, China\. Association for Computational Linguistics\.
- Arditi et al\. \(2024\)Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda\. 2024\.[Refusal in language models is mediated by a single direction](https://arxiv.org/abs/2406.11717)\.*Preprint*, arXiv:2406\.11717\.
- Bhargav and Zhu \(2025\)Samaksh Bhargav and Zining Zhu\. 2025\.[Feature\-guided SAE steering for refusal\-rate control using contrasting prompts](https://openreview.net/forum?id=Yz1gZJFRvT)\.In*Mechanistic Interpretability Workshop at NeurIPS 2025*\.
- Bussmann et al\. \(2024\)Bart Bussmann, Patrick Leask, and Neel Nanda\. 2024\.[Batchtopk sparse autoencoders](https://openreview.net/forum?id=d4dpOCqybL)\.In*NeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning*\.
- Bussmann et al\. \(2025\)Bart Bussmann, Noa Nabeshima, Adam Karvonen, and Neel Nanda\. 2025\.[Learning multi\-level features with matryoshka sparse autoencoders](https://openreview.net/forum?id=m25T5rAy43)\.In*Forty\-second International Conference on Machine Learning*\.
- Chanin and Garriga\-Alonso \(2026\)David Chanin and Adrià Garriga\-Alonso\. 2026\.[Synthsaebench: Evaluating sparse autoencoders on scalable realistic synthetic data](https://arxiv.org/abs/2602.14687)\.*Preprint*, arXiv:2602\.14687\.
- Chanin et al\. \(2026\)David Chanin, James Wilken\-Smith, Tomáš Dulka, Hardik Bhatnagar, Satvik Golechha, and Joseph Isaac Bloom\. 2026\.[A is for absorption: Studying feature splitting and absorption in sparse autoencoders](https://openreview.net/forum?id=R73ybUciQF)\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*\.
- Chen et al\. \(2023\)Zhongtian Chen, Edmund Lau, Jake Mendel, Susan Wei, and Daniel Murfet\. 2023\.[Dynamical versus bayesian phase transitions in a toy model of superposition](https://arxiv.org/abs/2310.06301)\.*Preprint*, arXiv:2310\.06301\.
- Conmy et al\. \(2023\)Arthur Conmy, Augustine N\. Mavor\-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga\-Alonso\. 2023\.[Towards automated circuit discovery for mechanistic interpretability](https://openreview.net/forum?id=89ia77nZ8u)\.In*Thirty\-seventh Conference on Neural Information Processing Systems*\.
- Dai et al\. \(2022\)Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei\. 2022\.[Knowledge neurons in pretrained transformers](https://doi.org/10.18653/v1/2022.acl-long.581)\.In*Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 8493–8502, Dublin, Ireland\. Association for Computational Linguistics\.
- Deng et al\. \(2026\)Boyi Deng, Xu Wang, Yaoning Wang, Yu Wan, Yubo Ma, Baosong Yang, Haoran Wei, Jialong Tang, Huan Lin, Ruize Gao, Tianhao Li, Qian Cao, Xuancheng Ren, Xiaodong Deng, An Yang, Fei Huang, Dayiheng Liu, and Jingren Zhou\. 2026\.[Qwen\-scope: Turning sparse features into development tools for large language models](https://arxiv.org/abs/2605.11887)\.*Preprint*, arXiv:2605\.11887\.
- Elhage et al\. \(2022\)Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield\-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah\. 2022\.Toy models of superposition\.*Transformer Circuits Thread*\.Https://transformer\-circuits\.pub/2022/toy\_model/index\.html\.
- Elhage et al\. \(2021\)Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield\-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, and 6 others\. 2021\.A mathematical framework for transformer circuits\.*Transformer Circuits Thread*\.Https://transformer\-circuits\.pub/2021/framework/index\.html\.
- Fan et al\. \(2024\)Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang, and Zhongyuan Wang\. 2024\.[Not all layers of llms are necessary during inference](https://arxiv.org/abs/2403.02181)\.*Preprint*, arXiv:2403\.02181\.
- Fang et al\. \(2026\)Yi Fang, Wenjie Wang, Mingfeng Xue, Boyi Deng, Fengli Xu, Dayiheng Liu, and Fuli Feng\. 2026\.[Controllable llm reasoning via sparse autoencoder\-based steering](https://arxiv.org/abs/2601.03595)\.*Preprint*, arXiv:2601\.03595\.
- Gao et al\. \(2025\)Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu\. 2025\.[Scaling and evaluating sparse autoencoders](https://openreview.net/forum?id=tcsZt9ZNKD)\.In*The Thirteenth International Conference on Learning Representations*\.
- Geva et al\. \(2021\)Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy\. 2021\.[Transformer feed\-forward layers are key\-value memories](https://doi.org/10.18653/v1/2021.emnlp-main.446)\.In*Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing*, pages 5484–5495, Online and Punta Cana, Dominican Republic\. Association for Computational Linguistics\.
- Guo et al\. \(2025\)Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Runxin Zhu, Qihao åand Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z\. F\. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, and 174 others\. 2025\.[Deepseek\-r1 incentivizes reasoning in llms through reinforcement learning](https://doi.org/10.1038/s41586-025-09422-z)\.*Nature*, 645\(8081\):633–638\.
- Gurnee et al\. \(2023\)Wes Gurnee, Neel Nanda, Matthew Pauly, Katherine Harvey, Dmitrii Troitskii, and Dimitris Bertsimas\. 2023\.[Finding neurons in a haystack: Case studies with sparse probing](https://openreview.net/forum?id=JYs1R9IMJr)\.*Transactions on Machine Learning Research*\.
- Gurnee and Tegmark \(2024\)Wes Gurnee and Max Tegmark\. 2024\.[Language models represent space and time](https://openreview.net/forum?id=jE8xbmvFin)\.In*The Twelfth International Conference on Learning Representations*\.
- Han et al\. \(2024\)Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri\. 2024\.[Wildguard: Open one\-stop moderation tools for safety risks, jailbreaks, and refusals of LLMs](https://openreview.net/forum?id=Ich4tv4202)\.In*The Thirty\-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track*\.
- He et al\. \(2024\)Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, Yu\-Gang Jiang, and Xipeng Qiu\. 2024\.[Llama scope: Extracting millions of features from llama\-3\.1\-8b with sparse autoencoders](https://arxiv.org/abs/2410.20526)\.*Preprint*, arXiv:2410\.20526\.
- Huang et al\. \(2023\)Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen\. 2023\.Catastrophic jailbreak of open\-source llms via exploiting generation\.*arXiv preprint arXiv:2310\.06987*\.
- Huben et al\. \(2024\)Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey\. 2024\.[Sparse autoencoders find highly interpretable features in language models](https://openreview.net/forum?id=F76bwRSLeK)\.In*The Twelfth International Conference on Learning Representations*\.
- Hänni et al\. \(2024\)Kaarel Hänni, Jake Mendel, Dmitry Vaintrob, and Lawrence Chan\. 2024\.[Mathematical models of computation in superposition](https://arxiv.org/abs/2408.05451)\.*Preprint*, arXiv:2408\.05451\.
- Ikeda et al\. \(2025\)Wataru Ikeda, Kazuki Yano, Ryosuke Takahashi, Jaesung Lee, Keigo Shibata, and Jun Suzuki\. 2025\.[Layerwise importance analysis of feed\-forward networks in transformer\-based language models](https://arxiv.org/abs/2508.17734)\.*Preprint*, arXiv:2508\.17734\.
- Jin et al\. \(2025\)Mingyu Jin, Qinkai Yu, Jingyuan Huang, Qingcheng Zeng, Zhenting Wang, Wenyue Hua, Haiyan Zhao, Kai Mei, Yanda Meng, Kaize Ding, Fan Yang, Mengnan Du, and Yongfeng Zhang\. 2025\.[Exploring concept depth: How large language models acquire knowledge and concept at different layers?](https://aclanthology.org/2025.coling-main.37/)In*Proceedings of the 31st International Conference on Computational Linguistics*, pages 558–573, Abu Dhabi, UAE\. Association for Computational Linguistics\.
- Karvonen et al\. \(2025\)Adam Karvonen, Can Rager, Johnny Lin, Curt Tigges, Joseph Isaac Bloom, David Chanin, Yeu\-Tong Lau, Eoin Farrell, Callum Stuart McDougall, Kola Ayonrinde, Demian Till, Matthew Wearden, Arthur Conmy, Samuel Marks, and Neel Nanda\. 2025\.[SAEBench: A comprehensive benchmark for sparse autoencoders in language model interpretability](https://openreview.net/forum?id=qrU3yNfX0d)\.In*Forty\-second International Conference on Machine Learning*\.
- Lermen et al\. \(2024\)Simon Lermen, Charlie Rogers\-Smith, and Jeffrey Ladish\. 2024\.[Lora fine\-tuning efficiently undoes safety training in llama 2\-chat 70b](https://arxiv.org/abs/2310.20624)\.*Preprint*, arXiv:2310\.20624\.
- Li et al\. \(2025\)Shen Li, Liuyi Yao, Lan Zhang, and Yaliang Li\. 2025\.[Safety layers in aligned large language models: The key to LLM security](https://openreview.net/forum?id=kUH1yPMAn7)\.In*The Thirteenth International Conference on Learning Representations*\.
- Lieberum et al\. \(2024\)Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, Janos Kramar, Anca Dragan, Rohin Shah, and Neel Nanda\. 2024\.[Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2](https://doi.org/10.18653/v1/2024.blackboxnlp-1.19)\.In*Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP*, pages 278–300, Miami, Florida, US\. Association for Computational Linguistics\.
- Liu et al\. \(2024\)Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao\. 2024\.[AutoDAN: Generating stealthy jailbreak prompts on aligned large language models](https://openreview.net/forum?id=7Jwpw4qKkb)\.In*The Twelfth International Conference on Learning Representations*\.
- Marks et al\. \(2025\)Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller\. 2025\.[Sparse feature circuits: Discovering and editing interpretable causal graphs in language models](https://openreview.net/forum?id=I4e82CIDxv)\.In*The Thirteenth International Conference on Learning Representations*\.
- Marks and Tegmark \(2024\)Samuel Marks and Max Tegmark\. 2024\.[The geometry of truth: Emergent linear structure in large language model representations of true/false datasets](https://openreview.net/forum?id=aajyHYjjsk)\.In*First Conference on Language Modeling*\.
- Mazeika et al\. \(2024\)Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks\. 2024\.[Harmbench: A standardized evaluation framework for automated red teaming and robust refusal](https://arxiv.org/abs/2402.04249)\.*Preprint*, arXiv:2402\.04249\.
- Mazeika et al\. \(2023\)Mantas Mazeika, Andy Zou, Norman Mu, Long Phan, Zifan Wang, Chunru Yu, Adam Khoja, Fengqing Jiang, Aidan O’Gara, Ellie Sakhaee, Zhen Xiang, Arezoo Rajabi, Dan Hendrycks, Radha Poovendran, Bo Li, and David Forsyth\. 2023\.Tdc 2023 \(llm edition\): The trojan detection challenge\.In*NeurIPS Competition Track*\.
- Mikolov et al\. \(2013\)Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean\. 2013\.[Efficient estimation of word representations in vector space](http://arxiv.org/abs/1301.3781)\.In*1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2\-4, 2013, Workshop Track Proceedings*\.
- O’Brien et al\. \(2025\)Kyle O’Brien, David Majercak, Xavier Fernandes, Richard G\. Edgar, Blake Bullwinkel, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, and Forough Poursabzi\-Sangdeh\. 2025\.[Steering language model refusal with sparse autoencoders](https://openreview.net/forum?id=PMK1jdGQoc)\.In*ICML 2025 Workshop on Reliable and Responsible Foundation Models*\.
- Patel and Pavlick \(2022\)Roma Patel and Ellie Pavlick\. 2022\.[Mapping language models to grounded conceptual spaces](https://openreview.net/forum?id=gJcEM8sxHK)\.In*International Conference on Learning Representations*\.
- Penedo et al\. \(2024\)Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf\. 2024\.[The fineweb datasets: Decanting the web for the finest text data at scale](https://openreview.net/forum?id=n6SCkn2QaG)\.In*The Thirty\-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track*\.
- Rajamanoharan et al\. \(2024\)Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah, and Neel Nanda\. 2024\.[Improving sparse decomposition of language model activations with gated sparse autoencoders](https://openreview.net/forum?id=zLBlin2zvW)\.In*The Thirty\-eighth Annual Conference on Neural Information Processing Systems*\.
- Rajamanoharan et al\. \(2025\)Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, Janos Kramar, and Neel Nanda\. 2025\.[Jumping ahead: Improving reconstruction fidelity with jumpreLU sparse autoencoders](https://openreview.net/forum?id=mMPaQzgzAN)\.
- Robey et al\. \(2024\)Alexander Robey, Eric Wong, Hamed Hassani, and George J\. Pappas\. 2024\.[Smoothllm: Defending large language models against jailbreaking attacks](https://arxiv.org/abs/2310.03684)\.*Preprint*, arXiv:2310\.03684\.
- Röttger et al\. \(2024\)Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy\. 2024\.[XSTest: A test suite for identifying exaggerated safety behaviours in large language models](https://doi.org/10.18653/v1/2024.naacl-long.301)\.In*Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\)*, pages 5377–5400, Mexico City, Mexico\. Association for Computational Linguistics\.
- Shang et al\. \(2025\)Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Weijiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, Ying Xin, Ziming Miao, Scarlett Li, Fan Yang, and Mao Yang\. 2025\.[rstar2\-agent: Agentic reasoning technical report](https://arxiv.org/abs/2508.20722)\.*Preprint*, arXiv:2508\.20722\.
- Skean et al\. \(2025\)Oscar Skean, Md Rifat Arefin, Dan Zhao, Niket Nikul Patel, Jalal Naghiyev, Yann LeCun, and Ravid Shwartz\-Ziv\. 2025\.[Layer by layer: Uncovering hidden representations in language models](https://openreview.net/forum?id=WGXb7UdvTX)\.In*Forty\-second International Conference on Machine Learning*\.
- Stolfo et al\. \(2023\)Alessandro Stolfo, Yonatan Belinkov, and Mrinmaya Sachan\. 2023\.[A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis](https://doi.org/10.18653/v1/2023.emnlp-main.435)\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 7035–7052, Singapore\. Association for Computational Linguistics\.
- Taori et al\. \(2023\)Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B\. Hashimoto\. 2023\.Stanford alpaca: An instruction\-following llama model\.[https://github\.com/tatsu\-lab/stanford\_alpaca](https://github.com/tatsu-lab/stanford_alpaca)\.
- Templeton et al\. \(2024\)Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C\. Daniel Freeman, Theodore R\. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, and 3 others\. 2024\.[Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html)\.*Transformer Circuits Thread*\.
- Turner et al\. \(2024\)Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J\. Vazquez, Ulisse Mini, and Monte MacDiarmid\. 2024\.[Steering language models with activation engineering](https://arxiv.org/abs/2308.10248)\.*Preprint*, arXiv:2308\.10248\.
- Vig and Belinkov \(2019\)Jesse Vig and Yonatan Belinkov\. 2019\.[Analyzing the structure of attention in a transformer language model](https://doi.org/10.18653/v1/W19-4808)\.In*Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP*, pages 63–76, Florence, Italy\. Association for Computational Linguistics\.
- Wu et al\. \(2025\)Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts\. 2025\.[Axbench: Steering LLMs? even simple baselines outperform sparse autoencoders](https://openreview.net/forum?id=K2CckZjNy0)\.In*Forty\-second International Conference on Machine Learning*\.
- Yang et al\. \(2025\)An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others\. 2025\.[Qwen3 technical report](https://arxiv.org/abs/2505.09388)\.*Preprint*, arXiv:2505\.09388\.
- Yin et al\. \(2025\)Huifeng Yin, Yu Zhao, Minghao Wu, Xuanfan Ni, Bo Zeng, Hao Wang, Tianqi Shi, Liangying Shao, Chenyang Lyu, Longyue Wang, Weihua Luo, and Kaifu Zhang\. 2025\.[Marco\-o1 v2: Towards widening the distillation bottleneck for reasoning models](https://doi.org/10.18653/v1/2025.acl-long.1145)\.In*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 23506–23516, Vienna, Austria\. Association for Computational Linguistics\.
- Zhang and Nanda \(2024\)Fred Zhang and Neel Nanda\. 2024\.[Towards best practices of activation patching in language models: Metrics and methods](https://arxiv.org/abs/2309.16042)\.*Preprint*, arXiv:2309\.16042\.
- Zou et al\. \(2023\)Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J\. Zico Kolter, and Matt Fredrikson\. 2023\.[Universal and transferable adversarial attacks on aligned language models](https://arxiv.org/abs/2307.15043)\.*Preprint*, arXiv:2307\.15043\.

## Appendix AEvaluation of the Qwen3\-4B SAE

In this subsection, we present a quantitative evaluation of the SAEs trained on Qwen3\-4B\. Specifically, we assess the quality of the learned representations along two dimensions: Fraction of Variance Explained \(FVE\), which measures the reconstruction fidelity, and Delta lm loss \(DLL\), which captures the impact on the model’s predictive performance\. The evaluation results are illustrated in Figure[7](https://arxiv.org/html/2606.26620#A1.F7)\.

![Refer to caption](https://arxiv.org/html/2606.26620v1/x16.png)\(\(a\)\)FVE
![Refer to caption](https://arxiv.org/html/2606.26620v1/x17.png)\(\(b\)\)DLL

Figure 7:Evaluation of the Qwen3\-4B SAE\.Left: Fraction of Variance Explained \(FVE\)\.Right: Delta LM loss \(DLL\)\.The results demonstrate that SAEs trained at the residual stream and attention output positions achieve strong reconstruction quality, as reflected by their high FVE scores\. However, SAEs at the MLP output of intermediate layers exhibit notably lower FVE, which may be attributed to the following factor: intermediate MLP layers may encode more complex and entangled features that are difficult to disentangle with a linear sparse decomposition\.

Regarding the DLL metric, we observe a consistent trend with Qwen3\-1\.7B: SAEs at the MLP output and residual stream positions achieve better model performance recovery compared to those at the attention output\. This discrepancy may be related to the distinct functional roles of different components within the transformer architecture\. Specifically, the MLP layers and residual stream are primarily responsible for processing complex semantic information and storing factual knowledge\. In contrast, attention layers specialize in modeling contextual dependencies across tokens, capturing relational patterns that are inherently more distributed and thus harder to recover through SAE interventions\.

## Appendix BComparison of Open\-Source SAE Models

We compare several popular open\-source SAE models from multiple perspectives; the details are summarized in Table[3](https://arxiv.org/html/2606.26620#A2.T3)\.

Table 3:Comparison of open\-source SAE models\. R: residual stream; A: attention output; M: MLP output; TC: transcoder\.Our Qwen3\-Instruct SAE differs from QwenScope in several key aspects\. First, while QwenScope targets the base variants of Qwen3 models, our work focuses on their instruction\-tuned counterparts, which are more representative of real\-world deployment scenarios\.

Table 4:Training configurations of Qwen3\-Instruct SAE\. R: residual stream; A: attention output; M: MLP output\.Second, Qwen\-Scope relies on Qwen3 proprietary pretraining data, whereas we train our SAEs on the open\-source FineWeb\-Edu dataset, improving reproducibility\. Third, QwenScope only trains SAEs at the residual stream, while we additionally train SAEs at the attention output and MLP output, providing a more comprehensive coverage of the model’s internal representations\. Detailed information of our Qwen3\-Instruct SAE is provided in Table[4](https://arxiv.org/html/2606.26620#A2.T4)\.

## Appendix CDetails of dataset and metric

### C\.1Dataset details

We evaluate on two categories of prompts:unsafeandsafe\. For unsafe prompts, we use the WildGuard datasetHan et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib21)\), filtering samples labeled asharmfulbased on theprompt\_harm\_labelfield\. For safe prompts, we retain samples labeled asunharmfulfrom the same dataset\. We additionally evaluate on XSTestRöttger et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib44)\), a benchmark specifically designed to assess refusal behavior, where samples are partitioned intosafeandunsafesubsets based on thelabelfield\. Furthermore, we incorporate the safe and unsafe datasets constructed inArditi et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib2)\), where the safe dataset is derived from AlpacaTaori et al\. \([2023](https://arxiv.org/html/2606.26620#bib.bib48)\), and the unsafe dataset consists of harmful instructions drawn fromAdvBenchZou et al\. \([2023](https://arxiv.org/html/2606.26620#bib.bib56)\),MaliciousInstructHuang et al\. \([2023](https://arxiv.org/html/2606.26620#bib.bib23)\),TDC2023Mazeika et al\. \([2023](https://arxiv.org/html/2606.26620#bib.bib36)\), andHarmBenchMazeika et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib35)\)\. The statistics of all datasets are summarized in Table[5](https://arxiv.org/html/2606.26620#A3.T5)\.

Table 5:Statistics of evaluation datasets\.
### C\.2Refusal indicators

Formally, given a model responsecc, the refusal score is defined as:

refusal\_score​\(c\)=\{1if​∃r∈ℛ,r⊆c0otherwise\\text\{refusal\\\_score\}\(c\)=\\begin\{cases\}1&\\text\{if \}\\exists\\ r\\in\\mathcal\{R\},r\\subseteq c\\\\ 0&\\text\{otherwise\}\\end\{cases\}\(8\)whereℛ\\mathcal\{R\}denotes the set of refusal indicators, matched case\-insensitively across the entire response\. The complete list is provided in Figure[8](https://arxiv.org/html/2606.26620#A3.F8)\.

![Refer to caption](https://arxiv.org/html/2606.26620v1/x18.png)Figure 8:The predefined set of refusal indicatorsℛ\\mathcal\{R\}used for computingrefusal\_score​\(c\)\\text\{refusal\\\_score\}\(c\)\.

## Appendix DSAE Feature Selection

We first extract the most effective refusal direction followingArditi et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib2)\)\. Specifically, we adopt thedifference\-in\-means\(DIM\) techniqueArditi et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib2)\); Marks and Tegmark \([2024](https://arxiv.org/html/2606.26620#bib.bib34)\)to identify the direction in the residual stream along which harmful and harmless activations diverge\. For each layerl∈\[L\]l\\in\[L\]and post\-instruction token positioni∈ℐi\\in\\mathcal\{I\}, we compute:

ri\(l\)=μi\(l\)−νi\(l\)r\_\{i\}^\{\(l\)\}=\\mu\_\{i\}^\{\(l\)\}\-\\nu\_\{i\}^\{\(l\)\}\(9\)whereμi\(l\)\\mu\_\{i\}^\{\(l\)\}andνi\(l\)\\nu\_\{i\}^\{\(l\)\}denote the mean residual stream activations over the harmful training set𝒟harmfultrain\\mathcal\{D\}\_\{\\text\{harmful\}\}^\{\\text\{train\}\}and the harmless training set𝒟harmlesstrain\\mathcal\{D\}\_\{\\text\{harmless\}\}^\{\\text\{train\}\}, respectively:

μi\(l\)=1\|𝒟harmfultrain\|​∑t∈𝒟harmfultrainxi\(l\)​\(t\)\\mu\_\{i\}^\{\(l\)\}=\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\text\{harmful\}\}^\{\\text\{train\}\}\|\}\\sum\_\{t\\in\\mathcal\{D\}\_\{\\text\{harmful\}\}^\{\\text\{train\}\}\}x\_\{i\}^\{\(l\)\}\(t\)\(10\)νi\(l\)=1\|𝒟harmlesstrain\|​∑t∈𝒟harmlesstrainxi\(l\)​\(t\)\\nu\_\{i\}^\{\(l\)\}=\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\text\{harmless\}\}^\{\\text\{train\}\}\|\}\\sum\_\{t\\in\\mathcal\{D\}\_\{\\text\{harmless\}\}^\{\\text\{train\}\}\}x\_\{i\}^\{\(l\)\}\(t\)\(11\)The direction ofri\(l\)r\_\{i\}^\{\(l\)\}captures the axis along which the two distributions differ, while its magnitude reflects the degree of separation between them\.

#### Vector Selection\.

Enumerating all combinations ofi∈ℐi\\in\\mathcal\{I\}andl∈\[L\]l\\in\[L\]yields\|ℐ\|×L\|\\mathcal\{I\}\|\\times Lcandidate vectors\. For each candidateri\(l\)r\_\{i\}^\{\(l\)\}, we compute three metrics on the validation sets𝒟harmfulval\\mathcal\{D\}\_\{\\text\{harmful\}\}^\{\\text\{val\}\}and𝒟harmlessval\\mathcal\{D\}\_\{\\text\{harmless\}\}^\{\\text\{val\}\}:

- •bypass\_score: the average refusal rate on𝒟harmfulval\\mathcal\{D\}\_\{\\text\{harmful\}\}^\{\\text\{val\}\}under directional ablation ofri\(l\)r\_\{i\}^\{\(l\)\}\.
- •induce\_score: the average refusal rate on𝒟harmlessval\\mathcal\{D\}\_\{\\text\{harmless\}\}^\{\\text\{val\}\}under activation addition ofri\(l\)r\_\{i\}^\{\(l\)\}\.
- •kl\_score: the average KL divergence between the output distributions on𝒟harmlessval\\mathcal\{D\}\_\{\\text\{harmless\}\}^\{\\text\{val\}\}with and without directional ablation ofri\(l\)r\_\{i\}^\{\(l\)\}\.

We select the optimal vectorri∗\(l∗\)r\_\{i^\{\*\}\}^\{\(l^\{\*\}\)\}with minimumbypass\_score, subject to the following constraints: \(1\)induce\_score\>0\\textbf\{induce\\\_score\}\>0, ensuring the direction is sufficient to induce refusal; \(2\)kl\_score<0\.1\\textbf\{kl\\\_score\}<0\.1, filtering out directions that cause significant behavioral changes on harmless prompts; and \(3\)l<0\.8​Ll<0\.8L, restricting the search to earlier layers to avoid directions that merely suppress refusal tokens at the unembedding level\.

![Refer to caption](https://arxiv.org/html/2606.26620v1/x19.png)Figure 9:SAE feature activation heatmap of the assistant response tokens for Qwen3\-1\.7B\. Each row represents a SAE feature and each column represents a token\. The color intensity indicates the activation value\.Given the selected directionri∗\(l∗\)r\_\{i^\{\*\}\}^\{\(l^\{\*\}\)\}, we proceed to identify the most effective SAE feature for steering\. Specifically, we run the model on the prompt shown in Figure[6](https://arxiv.org/html/2606.26620#S5.F6)and capture the residual stream activations at layerl∗l^\{\*\}\. We then decompose these activations using the SAE and retain only the features that are active on at least 5 tokens simultaneously, as illustrated in Figure[9](https://arxiv.org/html/2606.26620#A4.F9)\. Finally, we enumerate all retained features and select the most effective one as our steering feature\. The selected layer and feature index for each model are summarized in Table[1](https://arxiv.org/html/2606.26620#S5.T1)\.

## Appendix EEffect of Steering Coefficientα\\alpha

Figure[10](https://arxiv.org/html/2606.26620#A5.F10)illustrates how the refusal rates for safe and unsafe requests vary with the steering coefficientα\\alphaacross two models \(Qwen3\-1\.7B and Qwen3\-4B\) and three datasets \(Custom, WildGuard, and XSTest\)\.

![Refer to caption](https://arxiv.org/html/2606.26620v1/x20.png)Figure 10:Refusal rate vs\. steering coefficientα\\alphaacross models and datasets\.Asα\\alphaincreases, the refusal rate for unsafe requests follows a consistentinverted U\-shaped curve\(this is consistent with the observations of previous studiesTurner et al\. \([2024](https://arxiv.org/html/2606.26620#bib.bib50)\)\), peaking atα≈0\.3\\alpha\\approx 0\.3–0\.50\.5and collapsing to near zero at larger values\. This suggests that an overly largeα\\alphaover\-perturbs the model’s internal representations, disrupting its reasoning ability\. Meanwhile, the refusal rate on the safe dataset also increases withα\\alpha, which is expected because stronger steering amplifies the model’s general tendency toward refusal\. This effect is particularly evident in Qwen3\-4B on XSTest, where the refusal rate peaks at∼\\sim52%\. We also observe that larger models respond more strongly to steering, achieving higher peak refusal rates on unsafe requests \(90%–95% for Qwen3\-4B vs\. 40%–77% for Qwen3\-1\.7B\)\. Based on these results, we selectα≈0\.3\\alpha\\approx 0\.3–0\.50\.5as the optimal range for our main experiments\.

Similar Articles

From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

arXiv cs.LG

This paper proposes a post-hoc certification framework for sparse autoencoder (SAE) based interpretability, deriving an upper bound on the frozen language model's risk using measurable quantities. The framework is validated on GPT-2 Small, Gemma-2B, and Llama-3-8B, showing non-vacuous bounds and revealing depth-dependent behavior.