Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

arXiv cs.LG 06/09/26, 04:00 AM Papers
Summary
Query Lens extends Logit Lens to interpret sparse autoencoder features by jointly considering encoder-side key features and decoder-side value features, and accounting for indirect effects from downstream modules. The paper also introduces the Subspace Channel Hypothesis, suggesting downstream modules read features through layer-specific subspaces.
arXiv:2606.07617v1 Announce Type: new Abstract: While sparse autoencoders provide features more interpretable than individual neurons, reliably characterizing them remains challenging. We propose Query Lens, which extends Logit Lens to enable more comprehensive and faithful interpretations of sparse features. By jointly considering encoder-side key features and decoder-side value features, we identify both the inputs that activate a feature and the outputs it promotes. We also account for indirect, module-mediated effects that arise when the feature is processed by downstream modules, going beyond the direct effect captured by Logit Lens. In experiments, we find that Query Lens yields coherent token signatures for features that remain uninterpretable under Logit Lens. Finally, we propose the Subspace Channel Hypothesis, suggesting that downstream modules read features through layer-specific subspaces.
Original Article
View Cached Full Text
Cached at: 06/09/26, 08:51 AM
# Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects
Source: [https://arxiv.org/html/2606.07617](https://arxiv.org/html/2606.07617)
###### Abstract

While sparse autoencoders provide features more interpretable than individual neurons, reliably characterizing them remains challenging\. We propose Query Lens, which extends Logit Lens to enable more comprehensive and faithful interpretations of sparse features\. By jointly considering encoder\-side key features and decoder\-side value features, we identify both the inputs that activate a feature and the outputs it promotes\. We also account for indirect, module\-mediated effects that arise when the feature is processed by downstream modules, going beyond the direct effect captured by Logit Lens\. In experiments, we find that Query Lens yields coherent token signatures for features that remain uninterpretable under Logit Lens\. Finally, we propose the Subspace Channel Hypothesis, suggesting that downstream modules read features through layer\-specific subspaces\.

Machine Learning, ICML

\\minted@def@optcl

envname\-P envname\#1

## 1Introduction

Explaining the inner workings of large language models \(LLMs\) remains a central challenge in the field of mechanistic interpretability\. A core objective of this line of work\(Bauet al\.,[2019](https://arxiv.org/html/2606.07617#bib.bib31); Mu and Andreas,[2021](https://arxiv.org/html/2606.07617#bib.bib32); Daiet al\.,[2022](https://arxiv.org/html/2606.07617#bib.bib33); Parket al\.,[2025](https://arxiv.org/html/2606.07617#bib.bib34)\)is to assign human\-interpretable descriptions to the LLMs’ internal representations, a\.k\.a\. features\. Recent progress in sparse dictionary learning, particularly through sparse autoencoders \(SAEs\), has accelerated this research direction by providing a more tractable target of analysis: internal activations represented as sparse combinations of dictionary elements\(Hubenet al\.,[2024](https://arxiv.org/html/2606.07617#bib.bib29)\)\.

![Refer to caption](https://arxiv.org/html/2606.07617v1/x1.png)\(a\)Schematic view of Residual Stream Dynamics\.
![Refer to caption](https://arxiv.org/html/2606.07617v1/x2.png)\(b\)Comparison of Logit Lens and Query Lens\.

Figure 1:Overview ofQuery Lens\.\(a\)A feature written into the residual stream is read as a*query*by downstream modules, producing*indirect effects*\.\(b\)Logit Lens projects features directly into vocabulary space and misses these indirect effects\. Query Lens accounts for them and provides a more faithful interpretation\.A common approach to characterizing SAE features is data\-driven: the model is run on large corpora to identify inputs that strongly activate a target feature, and the resulting high\-activation contexts are used to infer the feature’s semantics\. Although a large body of prior work\(Billset al\.,[2023](https://arxiv.org/html/2606.07617#bib.bib20); Brickenet al\.,[2023](https://arxiv.org/html/2606.07617#bib.bib19); Choiet al\.,[2024](https://arxiv.org/html/2606.07617#bib.bib22); Pauloet al\.,[2025](https://arxiv.org/html/2606.07617#bib.bib21)\)relies on this practice, it suffers from two notable limitations\. First, securing feature\-sensitive examples typically requires exhaustive model runs over large corpora; in some cases, access to the underlying data is even infeasible due to privacy constraints\(Daret al\.,[2023](https://arxiv.org/html/2606.07617#bib.bib10)\)\. Second, input\-conditioned descriptions alone are not output\-grounded: they often fail to sufficiently capture a feature’s causal effect on the model’s generation, limiting their reliability for downstream applications such as steering\(Gur\-Ariehet al\.,[2025](https://arxiv.org/html/2606.07617#bib.bib17)\)\.

A prominent alternative for interpreting SAE features is to project them directly into the vocabulary space, primarily by applying Logit Lens\(Nostalgebraist,[2020](https://arxiv.org/html/2606.07617#bib.bib37); Bloom and Lin,[2024](https://arxiv.org/html/2606.07617#bib.bib39)\)\. While this method avoids the complexity of collecting activating contexts and provides an output\-grounded summary, it nonetheless comes with two fundamental drawbacks\. \(1\)Completeness:While Logit Lens captures how a feature direction affect the output logits, it does not explain which inputs originally activate the feature, i\.e\., the input\-side causality\. \(2\)Faithfulness:A large portion of SAE features remain elusive under Logit Lens\. In particular, many features, especially in earlier layers, exhibit diffuse token patterns or are dominated by uninterpretable tokens, rather than converging to a coherent semantic concept\.

In this work, we aim to address the two aforementioned issues\. To improve completeness, we extend probing beyond SAE decoders to include SAE encoders that have received comparatively less attention\. Adopting the key\-value memory view\(Gevaet al\.,[2021](https://arxiv.org/html/2606.07617#bib.bib9)\), we refer to encoder features askey featuresand decoder features asvalue features\. Concretely, key features compute activations in response to the input, whereas value features are added to the residual stream weighted by these activations\. Accordingly, this structure maps directly onto two aspects of a feature’s causal role: key features characterize which inputs activate a feature, while value features specify which outputs it promotes\.

In terms of faithfulness, we argue that by design Logit Lens reveals only a subset of viable input–feature and feature–output interactions\. Specifically, when a feature direction is added to the residual stream, its effect on the output distribution decomposes into two components: a direct effect and an indirect effect\.111While the discussion here centers on feature–output interactions, the same direct/indirect decomposition applies to input–feature effects\.Thedirect effectpropagates through the residual stream to the output logits, whereas theindirect effectarises when downstream modules consume the feature in the residual stream, inducing additional changes in the output logits\. Logit Lens focuses on the direct effect while largely ignoring the indirect effect, leading to the prevalence of uninterpretable token distributions\.

To this end, we proposeQuery Lens, a framework that extends Logit Lens to interpret SAE features in embedding space in a more comprehensive and faithful manner\. First, to characterize a feature’s causal role on both the input and output sides, Query Lens adaptively switches between key and value features: the former captures what activates a feature, while the latter reveals what it promotes\. Furthermore, Query Lens accounts for both the direct and indirect effects of features, providing more reliable explanations of their functions across a wider range of cases\. In sum, these improvements enable token\-level interpretations that more accurately reflect the causal footprint of SAE features\.

By explicitly considering indirect effects, Query Lens further supports a mechanistic analysis of how features are converted and read by downstream modules in LLMs\. We observe that a feature produces markedly different impacts across Transformer components, despite being added to the residual stream as a static vector\. Motivated by this finding, we introduce theSubspace Channel Hypothesis\. We assume that the heterogeneous processing of a feature vector stems from selective readout: rather than uniformly consuming the full feature, each module extracts information from a low\-dimensional subspace, termed a channel\. We investigate this phenomenon by learning low\-rank linear maps from features to module responses, revealing that feature readout is mediated by layer\-specific channels\. Our code is available at[https://github\.com/HYU\-NLP/query\-lens](https://github.com/HYU-NLP/query-lens)\.

## 2Background

### 2\.1Residual Stream View of Transformers

We adopt the residual stream view of Transformer language models, where hidden states form a single stream updated only through residual additions\(Elhageet al\.,[2021](https://arxiv.org/html/2606.07617#bib.bib15)\)\. Each layer consists of two residual blocks, a multi\-head self\-attention blockRA\(⋅\)R\_\{\\text\{A\}\}\(\\cdot\)and an MLP blockRM\(⋅\)R\_\{\\text\{M\}\}\(\\cdot\), both of which add residual updates to the stream\. Formally,

hmidl=hprel\+RAl\(hprel\),hpostl=hmidl\+RMl\(hmidl\)\.h\_\{\\text\{mid\}\}^\{l\}=h\_\{\\text\{pre\}\}^\{l\}\+R\_\{\\text\{A\}\}^\{l\}\(h\_\{\\text\{pre\}\}^\{l\}\),\\quad h\_\{\\text\{post\}\}^\{l\}=h\_\{\\text\{mid\}\}^\{l\}\+R\_\{\\text\{M\}\}^\{l\}\(h\_\{\\text\{mid\}\}^\{l\}\)\.The residual stream then propagates to the next layer by settinghprel\+1=hpostlh\_\{\\text\{pre\}\}^\{l\+1\}=h\_\{\\text\{post\}\}^\{l\}\. We omit LayerNorm for brevity\.

### 2\.2Sparse Autoencoders

A bottleneck in interpreting neural networks is that many neurons are*polysemantic*, responding to multiple, unrelated explanations due to*superposition*\(Elhageet al\.,[2022](https://arxiv.org/html/2606.07617#bib.bib25); Brickenet al\.,[2023](https://arxiv.org/html/2606.07617#bib.bib19)\)\. To disentangle such features into a more localized basis, recent work has adoptedsparse autoencoders \(SAEs\)\. SAEs reconstruct the original residual stream vectors while encouraging sparse feature activations\. Given a target vectorhpost∈ℝdmh\_\{\\text\{post\}\}\\in\\mathbb\{R\}^\{d\_\{m\}\}, an SAE computes

h^post=Wdecf\(Wenc⊤hpost\),\\hat\{h\}\_\{\\text\{post\}\}=W\_\{\\text\{dec\}\}f\\\!\\left\(W\_\{\\text\{enc\}\}^\{\\top\}h\_\{\\text\{post\}\}\\right\),\(1\)whereWenc,Wdec∈ℝdm×ddictW\_\{\\text\{enc\}\},W\_\{\\text\{dec\}\}\\in\\mathbb\{R\}^\{d\_\{m\}\\times d\_\{\\text\{dict\}\}\}denote the encoder and decoder weight matrices,f\(⋅\)f\(\\cdot\)is a pointwise nonlinearity, andh^post\\hat\{h\}\_\{\\text\{post\}\}denotes the SAE reconstruction ofhposth\_\{\\text\{post\}\}\. By using an overcomplete dictionary \(ddict≫dmd\_\{\\text\{dict\}\}\\gg d\_\{m\}\) with sparse activations, SAEs learn features that are more monosemantic and thus more characterizable than individual neurons, which motivates their analysis\.

### 2\.3Key\-Value Memory

We employ the*key–value memory*view of the MLP block\(Gevaet al\.,[2021](https://arxiv.org/html/2606.07617#bib.bib9)\)and extend it to sparse dictionaries\. The SAE computation \(Eq\. \([1](https://arxiv.org/html/2606.07617#S2.E1)\)\) can be written as a sum of*sub\-updates*\. Letkik\_\{i\}andviv\_\{i\}denote theii\-th columns ofWencW\_\{\\text\{enc\}\}andWdecW\_\{\\text\{dec\}\}\. Then, the SAE reconstruction can be decomposed as

h^post=∑i=1ddictai\(hpost\)vi,ai\(hpost\)=f\(⟨hpost,ki⟩\)\.\\hat\{h\}\_\{\\text\{post\}\}=\\sum\_\{i=1\}^\{d\_\{\\text\{dict\}\}\}a\_\{i\}\\\!\\left\(h\_\{\\text\{post\}\}\\right\)\\,v\_\{i\},\\ a\_\{i\}\\\!\\left\(h\_\{\\text\{post\}\}\\right\)=f\\\!\\left\(\\langle h\_\{\\text\{post\}\},k\_\{i\}\\rangle\\right\)\.\(2\)Each sub\-update is computed by taking an inner product between the input vectorhposth\_\{\\text\{post\}\}and a column vector from the encoder \(kik\_\{i\}\) to obtain a scalar activation \(aia\_\{i\}\), and using it to weight the corresponding vector from the decoder \(viv\_\{i\}\)\.

This yields an attention\-style analogy: encoder column vectors\{ki\}\\\{k\_\{i\}\\\}serve as*key features*that produce sparse activations from the input, while decoder column vectors\{vi\}\\\{v\_\{i\}\\\}serve as*value features*combined using these activations\.

## 3Query Lens

### 3\.1Key Concepts and Limitations of Logit Lens

Logit Lens\(Nostalgebraist,[2020](https://arxiv.org/html/2606.07617#bib.bib37)\)was originally proposed as a method for inspecting*intermediate hidden states*in a Transformer by asking what the model would predict if decoding were performed from an intermediate layer\. It projects a residual stream vector into vocabulary space as

yl=U⊤hpostl∈ℝ\|V\|,y^\{l\}=U^\{\\top\}h\_\{\\text\{post\}\}^\{l\}\\in\\mathbb\{R\}^\{\|V\|\},whereU∈ℝdm×\|V\|U\\in\\mathbb\{R\}^\{d\_\{m\}\\times\|V\|\}is the unembedding matrix, and\|V\|\|V\|is the size of the vocabularyVV\.

Prior work has further applied Logit Lens to model*parameters*\(Gevaet al\.,[2021](https://arxiv.org/html/2606.07617#bib.bib9),[2022](https://arxiv.org/html/2606.07617#bib.bib40)\)\. Sincehpostlh\_\{\\text\{post\}\}^\{l\}can be estimated using a sum of sub\-updates as in Eq\. \([2](https://arxiv.org/html/2606.07617#S2.E2)\), projecting a single sub\-update isolates its contribution in vocabulary space:

U⊤\(ail\(hpostl\)vil\)=ail\(hpostl\)U⊤vil\.U^\{\\top\}\\\!\\left\(a\_\{i\}^\{l\}\(h\_\{\\text\{post\}\}^\{l\}\)\\,v\_\{i\}^\{l\}\\right\)=a\_\{i\}^\{l\}\(h\_\{\\text\{post\}\}^\{l\}\)\\,U^\{\\top\}v\_\{i\}^\{l\}\.Because the scalar activationail\(hpostl\)a\_\{i\}^\{l\}\(h\_\{\\text\{post\}\}^\{l\}\)only rescales logits and does not change their ranking, the static value vector determines the promoted token signature viaU⊤vilU^\{\\top\}v\_\{i\}^\{l\}\.

Despite its widespread use, interpreting SAE features with Logit Lens suffers from two key limitations\. First, many SAE features yield uninterpretable tokens under Logit Lens\. We attribute this in part to what Logit Lens measures: it primarily reflects thedirect effectof adding a feature direction to the residual stream, while ignoringindirect effectsthat arise when downstream modules consume the perturbed stream and propagate its influence through their computations\. Second, current practice places little emphasis on key features in SAE encoders\. Although a few seminal studies\(Gevaet al\.,[2021](https://arxiv.org/html/2606.07617#bib.bib9); Daret al\.,[2023](https://arxiv.org/html/2606.07617#bib.bib10)\)analyze projections of MLP key vectors, this direction of work remains weakly connected to the dominant data\-driven approach to SAE feature interpretation based on activating examples\.

These limitations motivate the questions we address: how can we interpret a feature’s causal role by \(1\) characterizing what activates it and what it promotes, and \(2\) accounting for both direct effects and indirect, module\-mediated effects?

### 3\.2Residual Stream Dynamics

We begin by framing this question using the residual stream\. Letxxdenote the one\-hot indicator of the input token, so that the input embedding is obtained via lookup ase=Exe=Ex, whereE∈ℝdm×\|V\|E\\in\\mathbb\{R\}^\{d\_\{\\text\{m\}\}\\times\|V\|\}denotes the embedding matrix\. Letaabe a feature activation from the forward pass andyythe output logits, wherey=U⊤hy=U^\{\\top\}handU∈ℝdm×\|V\|U\\in\\mathbb\{R\}^\{d\_\{\\text\{m\}\}\\times\|V\|\}is the unembedding matrix\. The activation depends on the input and the output depends on the activation, i\.e\.,x↦ax\\mapsto aanda↦ya\\mapsto y\. Our goal is to characterize how these dependencies are expressed along the residual stream by relating variations in the inputxxto changes in the feature activationaa, and variations inaato changes in the output logitsyy\.

#### 3\.2\.1Forward Dynamics

We first formalize how a local perturbation to a feature activation propagates forward to the output logits\. Consider featureiiat layerl∈\{1,…,L\}l\\in\\\{1,\\ldots,L\\\}, with post\-activationaila\_\{i\}^\{\\,l\}, i\.e\., the scalar obtained after applying the nonlinearity\. The change in the output logitsyyinduced by perturbing this activation froma∗a^\{\\ast\}byΔa\\Delta acan be approximated as

y\(a∗\+Δa\)≈y\(a∗\)\+∂y∂ail\|ail=a∗Δa\.y\\\!\\left\(a^\{\\ast\}\+\\Delta a\\right\)\\approx y\\\!\\left\(a^\{\\ast\}\\right\)\+\\left\.\\frac\{\\partial y\}\{\\partial a\_\{i\}^\{\\,l\}\}\\right\|\_\{a\_\{i\}^\{\\,l\}=a^\{\\ast\}\}\\Delta a\.This is a first\-order linearization, where the induced logit change per unit change in the scalar activation is fixed by∂y/∂ail\\partial y/\\partial a\_\{i\}^\{\\,l\}\. Fromy=U⊤hpostLy=U^\{\\top\}h\_\{\\text\{post\}\}^\{L\}andhpostl≈∑iailvilh\_\{\\text\{post\}\}^\{l\}\\approx\\sum\_\{i\}a\_\{i\}^\{\\,l\}\\,v\_\{i\}^\{\\,l\}, and with chain rule, it can be expanded as

∂y∂ail=∂y∂hpostL∂hpostL∂hpostl∂hpostl∂ail=U⊤∂hpostL∂hpostlvil\.\\frac\{\\partial y\}\{\\partial a\_\{i\}^\{\\,l\}\}=\\frac\{\\partial y\}\{\\partial h\_\{\\text\{post\}\}^\{L\}\}\\,\\frac\{\\partial h\_\{\\text\{post\}\}^\{L\}\}\{\\partial h\_\{\\text\{post\}\}^\{l\}\}\\,\\frac\{\\partial h\_\{\\text\{post\}\}^\{l\}\}\{\\partial a\_\{i\}^\{\\,l\}\}=U^\{\\top\}\\frac\{\\partial h\_\{\\text\{post\}\}^\{L\}\}\{\\partial h\_\{\\text\{post\}\}^\{l\}\}v\_\{i\}^\{l\}\.\(3\)The intermediate mapping fromhpostlh\_\{\\text\{post\}\}^\{l\}tohpostLh\_\{\\text\{post\}\}^\{L\}can be written as a product of Jacobians across downstream residual blocks\. To be specific, for each layerkk, differentiating the equation from the definition of residual update gives

∂hmidk∂hprek=I\+JAk,∂hpostk∂hmidk=I\+JMk,\\frac\{\\partial h\_\{\\text\{mid\}\}^\{k\}\}\{\\partial h\_\{\\text\{pre\}\}^\{k\}\}=I\+J\_\{\\text\{A\}\}^\{k\},\\qquad\\frac\{\\partial h\_\{\\text\{post\}\}^\{k\}\}\{\\partial h\_\{\\text\{mid\}\}^\{k\}\}=I\+J\_\{\\text\{M\}\}^\{k\},\(4\)whereJAk≔∂RAk/∂hprekJ\_\{\\text\{A\}\}^\{k\}\\coloneqq\\partial R\_\{\\text\{A\}\}^\{k\}/\\partial h\_\{\\text\{pre\}\}^\{k\}andJMk≔∂RMk/∂hmidkJ\_\{\\text\{M\}\}^\{k\}\\coloneqq\\partial R\_\{\\text\{M\}\}^\{k\}/\\partial h\_\{\\text\{mid\}\}^\{k\}\.222Each Jacobian is evaluated at the corresponding pre\-module residual from the forward pass on a reference inputxx\(e\.g\.,JMkJ\_\{\\text\{M\}\}^\{k\}athmidk\(x\)h\_\{\\text\{mid\}\}^\{k\}\(x\)\); we leave this notation implicit for brevity\.Sincehprek=hpostk−1h\_\{\\text\{pre\}\}^\{k\}=h\_\{\\text\{post\}\}^\{k\-1\}, repeated use of the chain rule yields

∂hpostL∂hpostl=∏k=l\+1L∂hpostk∂hmidk∂hmidk∂hprek=∏k=l\+1L\(I\+JMk\)\(I\+JAk\)\.\\frac\{\\partial h\_\{\\text\{post\}\}^\{L\}\}\{\\partial h\_\{\\text\{post\}\}^\{l\}\}=\\prod\_\{k=l\+1\}^\{L\}\\frac\{\\partial h\_\{\\text\{post\}\}^\{k\}\}\{\\partial h\_\{\\text\{mid\}\}^\{k\}\}\\frac\{\\partial h\_\{\\text\{mid\}\}^\{k\}\}\{\\partial h\_\{\\text\{pre\}\}^\{k\}\}=\\prod\_\{k=l\+1\}^\{L\}\\bigl\(I\+J\_\{\\text\{M\}\}^\{k\}\\bigr\)\\bigl\(I\+J\_\{\\text\{A\}\}^\{k\}\\bigr\)\.\(5\)
Combining Eq\. \([3](https://arxiv.org/html/2606.07617#S3.E3)\) and Eq\. \([5](https://arxiv.org/html/2606.07617#S3.E5)\) gives

∂y∂ail=U⊤\[∏k=l\+1L\(I\+JMk\)\(I\+JAk\)\]vil\.\\frac\{\\partial y\}\{\\partial a\_\{i\}^\{\\,l\}\}=U^\{\\top\}\\left\[\\prod\_\{k=l\+1\}^\{L\}\\bigl\(I\+J\_\{\\text\{M\}\}^\{k\}\\bigr\)\\bigl\(I\+J\_\{\\text\{A\}\}^\{k\}\\bigr\)\\right\]v\_\{i\}^\{\\,l\}\.
We refer to∂y/∂ail\\partial y/\\partial a\_\{i\}^\{\\,l\}as theforward dynamicsof feature\(l,i\)\(l,i\): it quantifies the logit change induced by a unit change in the post\-activationaila\_\{i\}^\{\\,l\}\.

#### 3\.2\.2Backward Dynamics

Similarly, we formalize how a feature activation responds to input perturbations\. Here,aila\_\{i\}^\{\\,l\}denotes the*pre\-activation*, i\.e\., the scalar before applying the nonlinearity, which we use as a proxy for the post\-activation, since common activation functions used in SAEs can be non\-differentiable; we justify this proxy in Appendix[F](https://arxiv.org/html/2606.07617#A6)\.

Consider perturbing the input token indicatorxxfrom a reference valuex∗x^\{\\ast\}byΔx\\Delta x\. The induced change in the activationaila\_\{i\}^\{\\,l\}can be approximated as

ail\(x∗\+Δx\)≈ail\(x∗\)\+∂ail∂x\|x=x∗Δx\.a\_\{i\}^\{\\,l\}\\\!\\left\(x^\{\\ast\}\+\\Delta x\\right\)\\approx a\_\{i\}^\{\\,l\}\\\!\\left\(x^\{\\ast\}\\right\)\+\\left\.\\frac\{\\partial a\_\{i\}^\{\\,l\}\}\{\\partial x\}\\right\|\_\{x=x^\{\\ast\}\}\\Delta x\.
Fromhpre1=Exh\_\{\\text\{pre\}\}^\{1\}=Exandail=⟨hpostl,kil⟩a\_\{i\}^\{\\,l\}=\\langle h\_\{\\text\{post\}\}^\{l\},k\_\{i\}^\{\\,l\}\\rangle, the chain rule gives,

∂ail∂x=∂ail∂hpostl∂hpostl∂hpre1∂hpre1∂x=\(kil\)⊤∂hpostl∂hpre1E\.\\frac\{\\partial a\_\{i\}^\{\\,l\}\}\{\\partial x\}=\\frac\{\\partial a\_\{i\}^\{\\,l\}\}\{\\partial h\_\{\\text\{post\}\}^\{l\}\}\\,\\frac\{\\partial h\_\{\\text\{post\}\}^\{l\}\}\{\\partial h\_\{\\text\{pre\}\}^\{1\}\}\\,\\frac\{\\partial h\_\{\\text\{pre\}\}^\{1\}\}\{\\partial x\}=\(k\_\{i\}^\{l\}\)^\{\\top\}\\frac\{\\partial h\_\{\\text\{post\}\}^\{l\}\}\{\\partial h\_\{\\text\{pre\}\}^\{1\}\}E\.\(6\)The remaining term∂hpostl/∂hpre1\\partial h\_\{\\text\{post\}\}^\{l\}/\\partial h\_\{\\text\{pre\}\}^\{1\}expands into a product of Jacobians over the prefix of residual blocks fromk=1k=1toll\. Concretely, using Eq\. \([4](https://arxiv.org/html/2606.07617#S3.E4)\) and noting thathprek\+1=hpostkh\_\{\\text\{pre\}\}^\{k\+1\}=h\_\{\\text\{post\}\}^\{k\}, repeated application of the chain rule gives

∂hpostl∂hpre1=∏k=1l∂hpostk∂hmidk∂hmidk∂hprek=∏k=1l\(I\+JMk\)\(I\+JAk\),\\frac\{\\partial h\_\{\\text\{post\}\}^\{l\}\}\{\\partial h\_\{\\text\{pre\}\}^\{1\}\}=\\prod\_\{k=1\}^\{l\}\\frac\{\\partial h\_\{\\text\{post\}\}^\{k\}\}\{\\partial h\_\{\\text\{mid\}\}^\{k\}\}\\frac\{\\partial h\_\{\\text\{mid\}\}^\{k\}\}\{\\partial h\_\{\\text\{pre\}\}^\{k\}\}=\\prod\_\{k=1\}^\{l\}\\bigl\(I\+J\_\{\\text\{M\}\}^\{k\}\\bigr\)\\bigl\(I\+J\_\{\\text\{A\}\}^\{k\}\\bigr\),\(7\)whereJAkJ\_\{\\text\{A\}\}^\{k\}andJMkJ\_\{\\text\{M\}\}^\{k\}are defined as in Eq\. \([4](https://arxiv.org/html/2606.07617#S3.E4)\)\. Combining Eq\. \([6](https://arxiv.org/html/2606.07617#S3.E6)\) and Eq\. \([7](https://arxiv.org/html/2606.07617#S3.E7)\) yields

∂ail∂x=\(kil\)⊤\[∏k=1l\(I\+JMk\)\(I\+JAk\)\]E\.\\frac\{\\partial a\_\{i\}^\{\\,l\}\}\{\\partial x\}=\(k\_\{i\}^\{\\,l\}\)^\{\\top\}\\left\[\\prod\_\{k=1\}^\{l\}\\bigl\(I\+J\_\{\\text\{M\}\}^\{k\}\\bigr\)\\bigl\(I\+J\_\{\\text\{A\}\}^\{k\}\\bigr\)\\right\]E\.
We refer to∂ail/∂x\\partial a\_\{i\}^\{\\,l\}/\\partial xas thebackward dynamicsof feature\(l,i\)\(l,i\): it specifies the direction in input space that most increases the feature’s pre\-activation\.

### 3\.3Understanding the Dynamics

The above derivations reveal an analogy between the forward and backward dynamics, which can be factored into three elements: \(1\)Feature Vector, \(2\)Stream Transition, and \(3\)Readout\. They describe what is injected into the residual stream, how the signal is transported across layers, and how it is expressed in vocabulary space, respectively\.

##### Feature Vector\.

The feature vectors are recovered by unpacking the*local*coupling between a feature activation and the residual stream at the same layer\. Asking what the dictionary*writes*to the stream and what it*reads*from the stream amounts to identifying two local derivatives, which reduce directly to the dictionary vectors themselves:∂hpostl/∂ail=vil\\partial h\_\{\\text\{post\}\}^\{l\}/\\partial a\_\{i\}^\{\\,l\}=v\_\{i\}^\{\\,l\}and∂ail/∂hpostl=\(kil\)⊤\\partial a\_\{i\}^\{\\,l\}/\\partial h\_\{\\text\{post\}\}^\{l\}=\(k\_\{i\}^\{\\,l\}\)^\{\\top\}\. Thus, feature vectors are not auxiliary artifacts; they*are*the dictionary’s local read/write directions: value features govern*outgoing*influence on the residual stream, while key features capture*incoming*sensitivity to residual stream variations\.

##### Stream Transition\.

The*stream transition*denotes the Jacobians∂hpostl/∂hpre1\\partial h\_\{\\text\{post\}\}^\{l\}/\\partial h\_\{\\text\{pre\}\}^\{1\}and∂hpostL/∂hpostl\\partial h\_\{\\text\{post\}\}^\{L\}/\\partial h\_\{\\text\{post\}\}^\{l\}, since they specify how a local read/write at layerllis transported through the residual stream and expressed at its endpoints,hpre1h\_\{\\text\{pre\}\}^\{1\}andhpostLh\_\{\\text\{post\}\}^\{L\}\. Specifically, this is adm×dmd\_\{\\mathrm\{m\}\}\\times d\_\{\\mathrm\{m\}\}mapping, and as Eq\. \([5](https://arxiv.org/html/2606.07617#S3.E5)\) and Eq\. \([7](https://arxiv.org/html/2606.07617#S3.E7)\) show, it is expressed as a product of identity\-plus\-Jacobian factors:∏k\(I\+JMk\)\(I\+JAk\)\\prod\_\{k\}\(I\+J\_\{\\mathrm\{M\}\}^\{k\}\)\(I\+J\_\{\\mathrm\{A\}\}^\{k\}\)\(abbreviated as∏k\(I\+Jk\)\\prod\_\{k\}\(I\+J^\{k\}\)hereafter\)\. Each term in the expansion of this product can be interpreted as a distinct computational path by which the local perturbation reaches an endpoint stream\. For instance, the termJM5J\_\{\\text\{M\}\}^\{5\}corresponds to the path where the induced residual change is consumed by the layer\-5 MLP block, modifying its output and in turn altering the residual stream that continues forward\.

From this perspective, Logit Lens retains only the identity term in the stream transition, considering only the computation from a local perturbation to the endpoint stream, i\.e\., the*direct effect*\. By including the remaining terms—which capture how downstream modules consume and reshape the perturbation, i\.e\., the*indirect effects*—we obtain a more faithful transport of the local effect to the endpoint stream\.

##### Readout\.

The readout defines how a local effect is expressed in vocabulary space, namely as a mapℝdm→ℝ\|V\|\\mathbb\{R\}^\{d\_\{\\text\{m\}\}\}\\to\\mathbb\{R\}^\{\|V\|\}\. Concretely, the endpoint readouts correspond to the embedding and unembedding matrices:∂hpre1/∂x=E\\partial h\_\{\\text\{pre\}\}^\{1\}/\\partial x=Eat the input endpoint and∂y/∂hpostL=U⊤\\partial y/\\partial h\_\{\\text\{post\}\}^\{L\}=U^\{\\top\}at the output endpoint\.

For the backward readout in particular, we modifyEE: our goal is not to interpret the transported direction athpre1h\_\{\\text\{pre\}\}^\{1\}itself, but to identify which specific input\-token change from the current token would realize it\. This motivates adjusting the readout in terms of*token substitutions*\. Specifically, ifexe\_\{x\}induces a directionΔhpre1\\Delta h\_\{\\text\{pre\}\}^\{1\}at the input endpoint, we choose the substitute token whose embedding shift\(ex′−ex\)\(e\_\{x^\{\\prime\}\}\-e\_\{x\}\)best aligns withΔhpre1\\Delta h\_\{\\text\{pre\}\}^\{1\}\. To implement this, we form a centered embedding matrix that represents substitution effects,E~=E−ex𝟏⊤\\widetilde\{E\}=E\-e\_\{x\}\\mathbf\{1\}^\{\\top\}, where each columne~t=et−ex\\tilde\{e\}\_\{t\}=e\_\{t\}\-e\_\{x\}corresponds to the embedding change resulting from substituting the input token withtt\. We then normalize each centered vector to unit norm, yielding a centered\-normalized matrixE^\\widehat\{E\}with columnse^t=e~t/∥e~t∥2\\hat\{e\}\_\{t\}=\\tilde\{e\}\_\{t\}/\\lVert\\tilde\{e\}\_\{t\}\\rVert\_\{2\}\. Reading out withE^\\widehat\{E\}therefore captures how well substituting each candidate tokenttfor the inputxxmatches the transported directionΔhpre1\\Delta h\_\{\\text\{pre\}\}^\{1\}\.333Becausexxis discrete, no substitution is guaranteed to matchΔhpre1\\Delta h\_\{\\text\{pre\}\}^\{1\}; however, the features we interpret are empirically confirmed to activate on some tokens, so this is not a practical concern\.

Table 1:Input and Output scores \(%\) across methods on four model/SAE configurations\. Values are the mean score over layers along with their respective 95% confidence intervals\. For each \(configuration, score\-type\) the largest value is inbold\.

### 3\.4Definition of Query Lens

We finally propose two variants ofQuery Lens \(QL\)by configuring the three elements of the residual dynamics\.

##### Value Variant \(QLValue\\text\{QL\}\_\{\\textsc\{Value\}\}\)\.

We score tokens by transporting the value feature to the output endpoint with the full stream transition and reading out through the unembedding:

sValue=U⊤\[∏k\>l\(I\+Jk\)\]vil∈ℝ\|V\|\.s\_\{\\textsc\{Value\}\}\\;=\\;U^\{\\top\}\\\!\\left\[\\prod\_\{k\>l\}\\bigl\(I\+J^\{k\}\\bigr\)\\right\]v\_\{i\}^\{\\,l\}\\;\\in\\;\\mathbb\{R\}^\{\|V\|\}\.\(8\)We take the top\-kktokens undersValues\_\{\\textsc\{Value\}\}as the tokens that the feature promotes at the output when activated, interpreting them as the feature’s*output\-side*causal role\.

##### Key Variant \(QLKey\\text\{QL\}\_\{\\textsc\{Key\}\}\)\.

We score tokens by transporting the key feature to the input endpoint with the full stream transition and reading out in terms of token substitutions:

sKey⊤=\(kil\)⊤\[∏k≤l\(I\+Jk\)\]E^∈ℝ\|V\|\.s\_\{\\textsc\{Key\}\}^\{\\top\}\\;=\\;\(k\_\{i\}^\{\\,l\}\)^\{\\top\}\\\!\\left\[\\prod\_\{k\\leq l\}\\bigl\(I\+J^\{k\}\\bigr\)\\right\]\\widehat\{E\}\\;\\in\\;\\mathbb\{R\}^\{\|V\|\}\.We take the top\-kktokens undersKeys\_\{\\textsc\{Key\}\}as the tokens that most increase the feature activation, interpreting them as the feature’s*input\-side*causal role\. We describe how both scores are computed efficiently in Appendix[G](https://arxiv.org/html/2606.07617#A7)\.

![Refer to caption](https://arxiv.org/html/2606.07617v1/x3.png)

![Refer to caption](https://arxiv.org/html/2606.07617v1/x4.png)\(a\)I\(T\)I\(T\): GPT\-2 Small
![Refer to caption](https://arxiv.org/html/2606.07617v1/x5.png)\(b\)I\(T\)I\(T\): Gemma\-3\-270M
![Refer to caption](https://arxiv.org/html/2606.07617v1/x6.png)\(c\)I\(T\)I\(T\): Gemma\-3\-1B
![Refer to caption](https://arxiv.org/html/2606.07617v1/x7.png)\(d\)I\(T\)I\(T\): Qwen\-3\-1\.7B
![Refer to caption](https://arxiv.org/html/2606.07617v1/x8.png)\(e\)O\(T\)O\(T\): GPT\-2 Small
![Refer to caption](https://arxiv.org/html/2606.07617v1/x9.png)\(f\)O\(T\)O\(T\): Gemma\-3\-270M
![Refer to caption](https://arxiv.org/html/2606.07617v1/x10.png)\(g\)O\(T\)O\(T\): Gemma\-3\-1B
![Refer to caption](https://arxiv.org/html/2606.07617v1/x11.png)\(h\)O\(T\)O\(T\): Qwen\-3\-1\.7B

Figure 2:Input \(top row\) and output \(bottom row\) scores by layer group for four model/SAE settings, comparing the Logit Lens \(LLKey\\text\{LL\}\_\{\\textsc\{Key\}\},LLValue\\text\{LL\}\_\{\\textsc\{Value\}\}\) and Query Lens \(QLKey\\text\{QL\}\_\{\\textsc\{Key\}\},QLValue\\text\{QL\}\_\{\\textsc\{Value\}\}\) variants from Table[1](https://arxiv.org/html/2606.07617#S3.T1)\. Each bar shows the per\-method score for the indicated layer \(or averaged within a layer group\); error bars are 95% confidence intervals\.

## 4Experiments

In this section, we evaluate whether Query Lens better captures a feature’s causal behavior from both the input and output sides, compared to baselines including Logit Lens\.

### 4\.1Experimental Configuration

##### Features\.

We analyze sparse dictionary features across four LLMs spanning three model families: GPT\-2 Small\(Radfordet al\.,[2019](https://arxiv.org/html/2606.07617#bib.bib12)\), Gemma\-3 \(270M and 1B\)\(Team,[2025a](https://arxiv.org/html/2606.07617#bib.bib42)\), and Qwen\-3\-1\.7B\(Team,[2025b](https://arxiv.org/html/2606.07617#bib.bib47)\)\. For GPT\-2 Small, we use the OpenAI Top\-K SAE \(32K\)\(Gaoet al\.,[2025](https://arxiv.org/html/2606.07617#bib.bib11)\)\. For Gemma\-3, we use Gemma Scope 2 JumpReLU SAEs \(65K\)\(McDougallet al\.,[2025](https://arxiv.org/html/2606.07617#bib.bib43)\)\. For Qwen\-3\-1\.7B, we use the Qwen\-Scope Top\-K SAE \(32K\)\(Denget al\.,[2026](https://arxiv.org/html/2606.07617#bib.bib50)\)\. Details of the specific model checkpoints and SAEs used in our experiments are listed in Appendix[A](https://arxiv.org/html/2606.07617#A1)\. For GPT\-2 Small, we randomly sample 100 features per layer; for Gemma\-3 and Qwen\-3\-1\.7B, we randomly sample 100 features from each of four representative layers\. Additional results \(different SAE widths, model sizes, and transcoders\(Dunefskyet al\.,[2024](https://arxiv.org/html/2606.07617#bib.bib30)\)\) are reported in Appendix[D](https://arxiv.org/html/2606.07617#A4)\.

##### Baselines and Setup\.

We compare Query Lens against four baselines\. To enable a direct comparison, we first define the followingLogit Lens \(LL\)counterparts:LLKey\\text\{LL\}\_\{\\textsc\{Key\}\}, which utilizes the key feature, the identity transition, and the embedding readout; andLLValue\\text\{LL\}\_\{\\textsc\{Value\}\}, which employs the value feature, the identity transition, and the unembedding readout\.Tuned Lens \(TL\)\(Belroseet al\.,[2025](https://arxiv.org/html/2606.07617#bib.bib36)\)extendsLLValue\\text\{LL\}\_\{\\textsc\{Value\}\}by replacing the identity transition with a per\-layer affine map\(Al,bl\)\(A^\{l\},b^\{l\}\)trained to transport intermediate hidden states toward the final\-layer representation:

sTL=U⊤\(Alvil\+bl\)\.s\_\{\\textsc\{TL\}\}\\;=\\;U^\{\\top\}\(A^\{l\}\\,v\_\{i\}^\{\\,l\}\+b^\{l\}\)\.Zero\-Out \(ZO\)andToken Change \(TC\)\(Templetonet al\.,[2024](https://arxiv.org/html/2606.07617#bib.bib18); Gur\-Ariehet al\.,[2025](https://arxiv.org/html/2606.07617#bib.bib17)\)are ablation\-style baselines that take a finite difference of the logits between two operating points\(a−,a\+\)\(a^\{\-\},a^\{\+\}\)ony=f\(a\)y=f\(a\):

sZO/TC=y\(apostl,i=a\+\)−y\(apostl,i=a−\),s\_\{\\textsc\{ZO/TC\}\}\\;=\\;y\\\!\\left\(a\_\{\\text\{post\}\}^\{l,i\}=a^\{\+\}\\right\)\-y\\\!\\left\(a\_\{\\text\{post\}\}^\{l,i\}=a^\{\-\}\\right\),with\(a−,a\+\)=\(0,aclean\)\(a^\{\-\},a^\{\+\}\)=\(0,\\,a\_\{\\text\{clean\}\}\)for ZO and\(a−,a\+\)=\(aclean,aclamp\)\(a^\{\-\},a^\{\+\}\)=\(a\_\{\\text\{clean\}\},\\,a\_\{\\text\{clamp\}\}\)for TC, whereacleana\_\{\\text\{clean\}\}is the original value ofaa; we report TC at three clamp settings,aclamp∈\{1,5,10\}a\_\{\\text\{clamp\}\}\\in\\\{1,5,10\\\}\(denoted TCa=1, TCa=5, TCa=10\)\.

In modeling how the output depends on a feature’s activationaa\(or, on the key side, how the input drivesaa\), each method diverges from QL as follows: LL and TL can be viewed as taking tangents of a simplified model whose stream transition is linearized \(identity for LL and a learned affine map for TL\); ZO and TC are secants between two operating points, scoring by the logit differencey\(a\+\)−y\(a−\)y\(a^\{\+\}\)\-y\(a^\{\-\}\), while QL takes a tangent, not a secant, of the true rather than a linearized transition\. Appendix[B](https://arxiv.org/html/2606.07617#A2)gives a geometric view\.

For each method, we denote byTTthe set of top\-kktokens it identifies for a given feature, and we fixk=25k=25throughout\. Further implementation details are provided in Appendix[C](https://arxiv.org/html/2606.07617#A3)\.

### 4\.2Quantitative Evaluation

##### Input Score\.

To test whether the tokens identified for each feature by a given method correspond to inputs that strongly activate the feature, we define the*input score*as follows\. For each feature, we collect a set of sentencesSS\(with\|S\|≥20\|S\|\\geq 20\) where it exhibits non\-zero activation on at least one token\(Billset al\.,[2023](https://arxiv.org/html/2606.07617#bib.bib20); Choiet al\.,[2024](https://arxiv.org/html/2606.07617#bib.bib22); Pauloet al\.,[2025](https://arxiv.org/html/2606.07617#bib.bib21)\)\. For each sentence inSS, we select the most activated token and defineAAas the resulting set acrossSS\. The input scoreI\(T\)I\(T\)is the fraction of tokens inAAappearing inTT:

I\(T\)=\|\{t∈A∣t∈T\}\|\|A\|\.I\(T\)\\;=\\;\\frac\{\\left\|\\\{t\\in A\\mid t\\in T\\\}\\right\|\}\{\|A\|\}\.Intuitively, a highI\(T\)I\(T\)means that the method’s proposed tokens are themselves the ones that strongly activate the feature in natural inputs\.

##### Output Score\.

To evaluate whether the tokens identified for each feature by a given method correspond to tokens that the feature causally promotes in generation, we define the*output score*as follows\. We construct a set ofN=100N=100neutral prefixes𝒫=\{pj\}j=1N\\mathcal\{P\}=\\\{p\_\{j\}\\\}\_\{j=1\}^\{N\}\(e\.g\.,*“Findings show that”*\)\. For each prefixp∈𝒫p\\in\\mathcal\{P\}, we obtain two next\-token distributions: a*clean*distribution without intervention, and a*steered*distribution in which we clamp the feature post\-activation to a valueα\\alphadrawn from a set𝒜\\mathcal\{A\}of steering strengths\. For each tokent∈Vt\\in V, letρcleanp\(t\)\\rho^\{p\}\_\{\\text\{clean\}\}\(t\)andραp\(t\)∈\[0,1\]\\rho^\{p\}\_\{\\alpha\}\(t\)\\in\[0,1\]denote its rank percentile under the clean and steered next\-token distributions\. The per\-token rank\-percentile delta induced by steering is

Δρ\(t\)=1\|𝒫\|∑p∈𝒫1\|𝒜\|∑α∈𝒜\(ραp\(t\)−ρcleanp\(t\)\),\\Delta\\rho\(t\)\\;=\\;\\frac\{1\}\{\|\\mathcal\{P\}\|\}\\sum\_\{p\\in\\mathcal\{P\}\}\\frac\{1\}\{\|\\mathcal\{A\}\|\}\\sum\_\{\\alpha\\in\\mathcal\{A\}\}\\bigl\(\\rho^\{p\}\_\{\\alpha\}\(t\)\-\\rho^\{p\}\_\{\\text\{clean\}\}\(t\)\\bigr\),and we letSSbe the top\-2525tokens byΔρ\\Delta\\rho, i\.e\., those most promoted by steering\. The output scoreO\(T\)O\(T\)is defined as the fraction of tokens inSSappearing inTT:

O\(T\)=\|\{t∈S∣t∈T\}\|\|S\|\.O\(T\)\\;=\\;\\frac\{\\left\|\\\{\\,t\\in S\\mid t\\in T\\,\\\}\\right\|\}\{\|S\|\}\.Intuitively, a highO\(T\)O\(T\)means that the method’s proposed tokens are themselves the ones the feature promotes during generation\. See Appendix[I](https://arxiv.org/html/2606.07617#A9)for further evaluation details\.

##### Results\.

Tables[1](https://arxiv.org/html/2606.07617#S3.T1)report input and output scores averaged across features for each model and SAE\. Across all configurations,QLKey\\text\{QL\}\_\{\\textsc\{Key\}\}achieves the bestI\(T\)I\(T\)andQLValue\\text\{QL\}\_\{\\textsc\{Value\}\}achieves the bestO\(T\)O\(T\)\. Two factors explain this pattern\. \(1\)Faithfulness:On each side, the QL variant outperforms its LL counterpart, confirming that modeling indirect effects contributes to more faithful interpretations\. \(2\)Completeness:QLKey\\text\{QL\}\_\{\\textsc\{Key\}\}achieves the highestI\(T\)I\(T\), but trails onO\(T\)O\(T\), whileQLValue\\text\{QL\}\_\{\\textsc\{Value\}\}shows the opposite, and no single variant is best on both sides\. This confirms that designing dedicated key and value variants is effective for a complete interpretation of a feature’s causal roles\. Figure[4](https://arxiv.org/html/2606.07617#S4.F4)illustrates examples of QL’s tokens and the actual model behavior they explain\.

Figure[2](https://arxiv.org/html/2606.07617#S3.F2)breaks both scores down by layer group\. We observe two patterns\. First, bothI\(T\)I\(T\)andO\(T\)O\(T\)decrease as the feature moves farther from its corresponding endpoint \(later layers forI\(T\)I\(T\), earlier layers forO\(T\)O\(T\)\)\. Longer stream transitions involve more intervening modules, producing larger indirect effects that form a bottleneck for predicting the feature’s causal effect\. Second, QL’s relative gain over LL widens as the feature moves farther from the endpoint\. This confirms QL’s design: it captures the indirect effects that LL misses, and its contribution scales with the size of those effects\. Appendix[E](https://arxiv.org/html/2606.07617#A5)provides an analysis of how the impact of indirect effects varies with layer depth\.

For TL, it underperforms QL across all settings\. We attribute this failure to a distributional mismatch: the learned map is trained on full residual stream vectorshlh^\{l\}, which are dense combinations of many concurrently active features\. A single SAE feature vectorvilv\_\{i\}^\{l\}has a different norm and structure, so the map does not generalize reliably to such inputs\. For TC, the output score can be competitive at certain clamp values, but the best clamp varies across settings, and within a single setting, the score remains sensitive to the clamp choice\.

Table 2:Interpretability scores \(mean±\\pm95% CI half\-width\) of token\-set explanations\. Within each Key \(LLKey\\text\{LL\}\_\{\\textsc\{Key\}\}vs\.QLKey\\text\{QL\}\_\{\\textsc\{Key\}\}\) and Value \(LLValue\\text\{LL\}\_\{\\textsc\{Value\}\}vs\.QLValue\\text\{QL\}\_\{\\textsc\{Value\}\}\) pair, the larger value is inbold\.![Refer to caption](https://arxiv.org/html/2606.07617v1/x12.png)\(a\)GPT\-2 Small
![Refer to caption](https://arxiv.org/html/2606.07617v1/x13.png)\(b\)Gemma\-3\-1B

Figure 3:Qualitative examples of Logit Lens and Query Lens top tokens with their interpretability scores on GPT\-2 Small and Gemma\-3\-1B\. The higher score per block is shown inbold\.![Refer to caption](https://arxiv.org/html/2606.07617v1/x14.png)Figure 4:Qualitative examples showing thatQLKey\\text\{QL\}\_\{\\textsc\{Key\}\}tokens \(yellow\) match the inputs that activate the feature \(i\.e\., highI\(T\)I\(T\)\) andQLValue\\text\{QL\}\_\{\\textsc\{Value\}\}tokens \(green\) match the outputs promoted by steering \(i\.e\., highO\(T\)O\(T\)\), on GPT\-2 Small and Gemma\-3\-1B\. Steering examples are prefixed with the steering factorα\\alphaand the base prompt\.

### 4\.3Qualitative Evaluation

We evaluate whether token sets identified by Query Lens form more human\-interpretable semantic units than those from Logit Lens\. We prompt GPT\-5\-nano\(OpenAI,[2025](https://arxiv.org/html/2606.07617#bib.bib38)\)to assign an interpretability score \(0–1010\) to each feature’s top\-kktokens based on how concentrated they are around a single coherent theme; see Appendix[K](https://arxiv.org/html/2606.07617#A11)for the full prompt\.

Table[2](https://arxiv.org/html/2606.07617#S4.T2)shows that Query Lens consistently achieves higher interpretability scores than Logit Lens, indicating that its token sets converge more reliably to coherent semantic descriptions\. Figure[3](https://arxiv.org/html/2606.07617#S4.F3)illustrates this pattern with examples: Logit Lens top\-tokens are dominated by tokenizer fragments while Query Lens top\-tokens form a single recognizable concept, with interpretability scores rising accordingly\.

## 5Subspace Channel Hypothesis

In this section, we conduct a focused analysis of module\-mediated effects captured by Query Lens\.

##### Intuition\.

Beyond feature interpretation, Query Lens affords the following*query–response*view: along the residual stream, a feature acts as a*query*to each downstream module, which writes back its*response*\. To make this concrete, we simplify the stream transition by keeping only the identity and first\-order Jacobian terms:

sValue≈U⊤\(vil\+∑k\>lJkvil\)\.s\_\{\\textsc\{Value\}\}\\;\\approx\\;U^\{\\top\}\\\!\\left\(v\_\{i\}^\{\\,l\}\\;\+\\;\\sum\_\{k\>l\}J^\{k\}v\_\{i\}^\{\\,l\}\\right\)\.Each Jacobian term then represents a single\-hop interaction between the feature and a downstream module: the value featurevilv\_\{i\}^\{\\,l\}is the*query*, while each Jacobian–vector product \(JVP\)ril→k=Jkvil∈ℝdmr\_\{i\}^\{\\,l\\rightarrow k\}=J^\{k\}v\_\{i\}^\{\\,l\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{m\}\}\}is the*module response*produced by the module at layerkk\.

##### Hypothesis\.

Motivated by the observation that modules respond differently to the same feature, we hypothesize that downstream modules read a queryvilv\_\{i\}^\{\\,l\}*selectively*through module\-specific low\-dimensional subspaces\. In other words, for a pair of layers\(l,k\)\(l,k\)withk\>lk\>l, we posit that the response at layerkkis primarily determined by the projection ofvilv\_\{i\}^\{\\,l\}onto a subspace associated with the downstream module, rather than by the full feature direction\. We refer to this module\-specific subspace as a*channel*, and term this theSubspace Channel Hypothesis\.

![Refer to caption](https://arxiv.org/html/2606.07617v1/x15.png)\(a\)Distribution ofOL\\mathrm\{OL\}, grouped by whether the pair shares the same destination layer\.nnis the number of pairs per group\.
![Refer to caption](https://arxiv.org/html/2606.07617v1/x16.png)\(b\)OL\\mathrm\{OL\}as a function of the distance\|k1−k2\|\|k\_\{1\}\-k\_\{2\}\|between destination layers, for pairs that share a source layer\.

Figure 5:Overlap statistics for the learned maps\{Wl→k\}\\\{W^\{l\\rightarrow k\}\\\}\. \(a\) Pairs sharing a destination layer cluster at higherOL\\mathrm\{OL\}than those with different destinations\. \(b\) Among pairs sharing a source layer, overlap decays with the distance between destination layers\.
##### Experiments\.

To verify our hypothesis, we conduct an experiment according to the following procedure:

1. 1\.Sample queries\.For each layerllof GPT\-2 Small \(L=12L=12\), we sampleN=1000N=1000value features\{vil\}i=1N\\\{v\_\{i\}^\{\\,l\}\\\}\_\{i=1\}^\{N\}\.
2. 2\.Compute module responses\.For each sampled featurevilv\_\{i\}^\{\\,l\}, we compute its module responseril→kr\_\{i\}^\{\\,l\\rightarrow k\}at all downstream layersk\>lk\>l\. Across the model, there are\(L2\)=66\\binom\{L\}\{2\}=66layer pairs\(l,k\)\(l,k\), and for each pair we obtainNN\(query,response\)\(\\text\{query\},\\text\{response\}\)pairs by evaluating the responses induced by theNNsampled queries at layerll\.
3. 3\.Learn low\-rank maps from queries to responses\.For each layer pair\(l,k\)\(l,k\), we fit a linear map from queries at layerllto responses at layerkk: ril→k≈Wl→kvil,Wl→k∈ℝdm×dm\.r\_\{i\}^\{\\,l\\rightarrow k\}\\;\\approx\\;W^\{l\\rightarrow k\}v\_\{i\}^\{\\,l\},\\qquad W^\{l\\rightarrow k\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{m\}\}\\times d\_\{\\mathrm\{m\}\}\}\.To enforce low dimensionality, each map is constrained to be low\-rank via a LoRA\-style factorization,Wl→k=Bl→kAl→kW^\{l\\rightarrow k\}=B^\{l\\rightarrow k\}A^\{l\\rightarrow k\}\. We useAl→k∈ℝr×dmA^\{l\\rightarrow k\}\\in\\mathbb\{R\}^\{r\\times d\_\{\\mathrm\{m\}\}\}andBl→k∈ℝdm×rB^\{l\\rightarrow k\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{m\}\}\\times r\}withr=dm/nlayers=64r=d\_\{\\mathrm\{m\}\}/n\_\{\\mathrm\{layers\}\}=64\. This factorization restricts the map to select an effectiverr\-dimensional basis that best predicts downstream JVPs from the query feature directions\. Training details are in Appendix[J\.1](https://arxiv.org/html/2606.07617#A10.SS1)\.
4. 4\.Compare learned maps\.We consider all pairs\{\(Wi,Wj\)\}1≤i<j≤\|𝒲\|\\\{\(W\_\{i\},W\_\{j\}\)\\\}\_\{1\\leq i<j\\leq\|\\mathcal\{W\}\|\}from the learned matrices𝒲=\{Wl→k\}\\mathcal\{W\}=\\\{W^\{l\\rightarrow k\}\\\}, and measure the overlap between their column spaces, denoted byOL\\mathrm\{OL\}\(short for*overlap*\)\. LetQi∈ℝdm×rQ\_\{i\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{m\}\}\\times r\}andQj∈ℝdm×rQ\_\{j\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{m\}\}\\times r\}be orthonormal basis matrices for the column spaces ofWiW\_\{i\}andWjW\_\{j\}\. We define OL\(Wi,Wj\)=1r∑p=1rcos2⁡θp=1r‖Qi⊤Qj‖F2,\\mathrm\{OL\}\(W\_\{i\},W\_\{j\}\)\\;=\\;\\frac\{1\}\{r\}\\sum\_\{p=1\}^\{r\}\\cos^\{2\}\\theta\_\{p\}\\;=\\;\\frac\{1\}\{r\}\\left\\\|Q\_\{i\}^\{\\top\}Q\_\{j\}\\right\\\|\_\{F\}^\{2\},where\{θp\}p=1r\\\{\\theta\_\{p\}\\\}\_\{p=1\}^\{r\}are the principal angles between the tworr\-dimensional subspaces\. This measures the degree of sharing channels between two distinct feature readings\.

##### Result\.

Figure[5\(a\)](https://arxiv.org/html/2606.07617#S5.F5.sf1)summarizes pairwise subspace overlap among the learned maps\{Wl→k\}\\\{W^\{l\\rightarrow k\}\\\}, grouped by whether two maps share same destination layerkk\. In particular, pairs that share the target layer \(e\.g\.,W0→5W^\{0\\rightarrow 5\}andW2→5W^\{2\\rightarrow 5\}\) concentrate at substantially higherOL\\mathrm\{OL\}values, whereas pairs with different targets are skewed toward low overlap\. This indicates that the column space ofWl→kW^\{l\\rightarrow k\}—low\-dimensional*channels*for feature reading—is consistent across different source layersllfor a fixedkk\. Conversely, when the consuming module changes \(i\.e\., differentkk\), the channels become non\-overlapping, suggesting that each module reads queries through its own subspace\.

We further examine how channel similarity depends on destination distance \(Fig\.[5\(b\)](https://arxiv.org/html/2606.07617#S5.F5.sf2)\)\. For a fixed source layerll, we calculate overlap between two maps,OL\(Wl→k1,Wl→k2\)\\mathrm\{OL\}\(W^\{l\\rightarrow k\_\{1\}\},W^\{l\\rightarrow k\_\{2\}\}\), and aggregate the results by the distance between target layers, i\.e\.,\|k1−k2\|\|k\_\{1\}\-k\_\{2\}\|\. We find that pairwise overlap decays as the distance between destination layers grows\. This indicates that the channel is shared more between nearby modules and becomes increasingly distinct with distance\.

## 6Related Work

##### Feature Interpretation\.

Approaches to interpreting SAE features broadly fall into two main lines\.Activation\-basedmethods run the model on large corpora and use the inputs that strongly activate a feature to summarize its meaning\(Billset al\.,[2023](https://arxiv.org/html/2606.07617#bib.bib20); Brickenet al\.,[2023](https://arxiv.org/html/2606.07617#bib.bib19); Choiet al\.,[2024](https://arxiv.org/html/2606.07617#bib.bib22); Pauloet al\.,[2025](https://arxiv.org/html/2606.07617#bib.bib21); Templetonet al\.,[2024](https://arxiv.org/html/2606.07617#bib.bib18)\)\.Parameter\-basedmethods, in contrast, read meaning directly from learned weights, most commonly by projecting SAE decoder vectors into vocabulary space via Logit Lens\(Nostalgebraist,[2020](https://arxiv.org/html/2606.07617#bib.bib37); Bloom and Lin,[2024](https://arxiv.org/html/2606.07617#bib.bib39)\)\. The two lines differ not only in computational style \(corpus collection vs\. static projection\) but also in what they observe: the former reveals which inputs activate a feature, while the latter reveals which outputs it promotes\(Gur\-Ariehet al\.,[2025](https://arxiv.org/html/2606.07617#bib.bib17); Aradet al\.,[2025](https://arxiv.org/html/2606.07617#bib.bib41)\)\. Parameter\-based methods, however, are not intrinsically restricted to the output side, as the key–value memory view of MLPs shows\(Gevaet al\.,[2021](https://arxiv.org/html/2606.07617#bib.bib9),[2022](https://arxiv.org/html/2606.07617#bib.bib40); Daret al\.,[2023](https://arxiv.org/html/2606.07617#bib.bib10)\)\. Yet SAE feature interpretation has kept this split tied to methodology \(activation\-based for input, parameter\-based for output\), and Query Lens closes this gap by deriving both interpretations from analogous formulations over the SAE’s encoder and decoder directions\.

Other extensions of the parameter\-based work build on Logit Lens with different goals:Belroseet al\.\([2025](https://arxiv.org/html/2606.07617#bib.bib36)\)improve calibration with per\-layer affine maps \(*Tuned Lens*\);Katzet al\.\([2024](https://arxiv.org/html/2606.07617#bib.bib44)\)project gradients to interpret learning dynamics \(*Backward Lens*\);Hernandezet al\.\([2024](https://arxiv.org/html/2606.07617#bib.bib45)\)decode relation\-specific attributes via linear relational embeddings \(*Attribute Lens*\); andPalet al\.\([2023](https://arxiv.org/html/2606.07617#bib.bib57)\)anticipate tokens several positions ahead from a single hidden state \(*Future Lens*\)\. Query Lens directs this extension toward faithful interpretation of SAE features by accounting for the indirect effects\.

##### Layer Communication\.

Elhageet al\.\([2021](https://arxiv.org/html/2606.07617#bib.bib15)\)conceptualized the residual stream as a high\-dimensional*communication channel*: each layer reads from and writes to low\-rank subspaces of the stream, and the effective interaction between any two layers is captured by their*virtual weights*\.Merulloet al\.\([2024](https://arxiv.org/html/2606.07617#bib.bib54)\)concretized this view at the level of attention heads, applying SVD to their QK/OV weight matrices to identify low\-rank channels through which earlier heads pass information to later ones\. The Subspace Channel Hypothesis extends this picture from attention heads to layers \(attention and MLP\), taking a functional view of how each downstream layer responds to an SAE feature\.

## 7Conclusion

We introduceQuery Lens, a parameter\-based method for SAE feature interpretation that improves Logit Lens along two dimensions: it spans both encoder and decoder directions to characterize input\- and output\-side causality, and it models the indirect, module\-mediated effects of features in the residual stream\. Across settings, Query Lens produces more interpretable token signatures than Logit Lens and prior baselines\. Building on this framework, we propose theSubspace Channel Hypothesisand provide evidence that downstream modules read features through layer\-specific low\-rank subspaces\. An open question that Query Lens raises is what function querying achieves: why features develop in such a form, and how it shapes model behavior\. The discovered channels also open the door to practical applications such as model editing\(Menget al\.,[2022](https://arxiv.org/html/2606.07617#bib.bib46)\)\.

## Impact Statement

This paper presents work whose goal is to advance the field of mechanistic interpretability\. While progress in machine learning can have broad societal consequences, we believe that research aimed at improving understanding of ML models is unlikely to introduce additional harms beyond those already associated with the underlying models, and instead provides tools that can help identify, anticipate, and avert such harms\.

## Acknowledgements

This work was supported by Institute of Information & Communications Technology Planning & Evaluation \(IITP\) grant funded by the Korea government \(MSIT\) \(No\. RS\-2020\-II201373, Artificial Intelligence Graduate School Program \(Hanyang University\)\)\. This work was supported by Institute of Information & Communications Technology Planning & Evaluation \(IITP\) under the artificial intelligence semiconductor support program to nurture the best talents \(IITP\-\(2026\)\-RS\-2023\-00253914\) grant funded by the Korea government \(MSIT\)\. This work was supported by the National Research Foundation of Korea \(NRF\) grant funded by the Korea government \(MSIT\) \(RS\-2025\-00558151\)\.

## References

- D\. Arad, A\. Mueller, and Y\. Belinkov \(2025\)SAEs are good for steering – if you select the right features\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 10241–10259\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.519/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.519),ISBN 979\-8\-89176\-332\-6Cited by:[§I\.2](https://arxiv.org/html/2606.07617#A9.SS2.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2606.07617#S6.SS0.SSS0.Px1.p1.1)\.
- A\. Bau, Y\. Belinkov, H\. Sajjad, N\. Durrani, F\. Dalvi, and J\. Glass \(2019\)Identifying and controlling important neurons in neural machine translation\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=H1z-PsR5KX)Cited by:[§1](https://arxiv.org/html/2606.07617#S1.p1.1)\.
- N\. Belrose, I\. Ostrovsky, L\. McKinney, Z\. Furman, L\. Smith, D\. Halawi, S\. Biderman, and J\. Steinhardt \(2025\)Eliciting latent predictions from transformers with the tuned lens\.External Links:2303\.08112,[Link](https://arxiv.org/abs/2303.08112)Cited by:[Appendix C](https://arxiv.org/html/2606.07617#A3.SS0.SSS0.Px2.p1.6),[§4\.1](https://arxiv.org/html/2606.07617#S4.SS1.SSS0.Px2.p1.4),[§6](https://arxiv.org/html/2606.07617#S6.SS0.SSS0.Px1.p2.1)\.
- Y\. Bengio, N\. Léonard, and A\. Courville \(2013\)Estimating or propagating gradients through stochastic neurons for conditional computation\.External Links:1308\.3432,[Link](https://arxiv.org/abs/1308.3432)Cited by:[Appendix F](https://arxiv.org/html/2606.07617#A6.SS0.SSS0.Px2.p1.1)\.
- S\. Bills, N\. Cammarata, D\. Mossing, H\. Tillman, L\. Gao, G\. Goh, I\. Sutskever, J\. Leike, J\. Wu, and W\. Saunders \(2023\)Language models can explain neurons in language models\.Note:[https://openaipublic\.blob\.core\.windows\.net/neuron\-explainer/paper/index\.html](https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html)Cited by:[§1](https://arxiv.org/html/2606.07617#S1.p2.1),[§4\.2](https://arxiv.org/html/2606.07617#S4.SS2.SSS0.Px1.p1.8),[§6](https://arxiv.org/html/2606.07617#S6.SS0.SSS0.Px1.p1.1)\.
- J\. Bloom and J\. Lin \(2024\)Understanding sae features with the logit lens\.Note:[https://www\.lesswrong\.com/posts/qykrYY6rXXM7EEs8Q/understanding\-sae\-features\-with\-the\-logit\-lens](https://www.lesswrong.com/posts/qykrYY6rXXM7EEs8Q/understanding-sae-features-with-the-logit-lens)Cited by:[§1](https://arxiv.org/html/2606.07617#S1.p3.1),[§6](https://arxiv.org/html/2606.07617#S6.SS0.SSS0.Px1.p1.1)\.
- T\. Bricken, A\. Templeton, J\. Batson, B\. Chen, A\. Jermyn, T\. Conerly, N\. Turner, C\. Anil, C\. Denison, A\. Askell, R\. Lasenby, Y\. Wu, S\. Kravec, N\. Schiefer, T\. Maxwell, N\. Joseph, Z\. Hatfield\-Dodds, A\. Tamkin, K\. Nguyen, B\. McLean, J\. E\. Burke, T\. Hume, S\. Carter, T\. Henighan, and C\. Olah \(2023\)Towards monosemanticity: decomposing language models with dictionary learning\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2023/monosemantic-features/index.html)Cited by:[§1](https://arxiv.org/html/2606.07617#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.07617#S2.SS2.p1.1),[§6](https://arxiv.org/html/2606.07617#S6.SS0.SSS0.Px1.p1.1)\.
- D\. Choi, V\. Huang, K\. Meng, D\. D\. Johnson, J\. Steinhardt, and S\. Schwettmann \(2024\)Scaling automatic neuron description\.Note:[https://transluce\.org/neuron\-descriptions](https://transluce.org/neuron-descriptions)Cited by:[§1](https://arxiv.org/html/2606.07617#S1.p2.1),[§4\.2](https://arxiv.org/html/2606.07617#S4.SS2.SSS0.Px1.p1.8),[§6](https://arxiv.org/html/2606.07617#S6.SS0.SSS0.Px1.p1.1)\.
- D\. Dai, L\. Dong, Y\. Hao, Z\. Sui, B\. Chang, and F\. Wei \(2022\)Knowledge neurons in pretrained transformers\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 8493–8502\.External Links:[Link](https://aclanthology.org/2022.acl-long.581/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.581)Cited by:[§1](https://arxiv.org/html/2606.07617#S1.p1.1)\.
- G\. Dar, M\. Geva, A\. Gupta, and J\. Berant \(2023\)Analyzing transformers in embedding space\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 16124–16170\.External Links:[Link](https://aclanthology.org/2023.acl-long.893/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.893)Cited by:[§1](https://arxiv.org/html/2606.07617#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.07617#S3.SS1.p3.1),[§6](https://arxiv.org/html/2606.07617#S6.SS0.SSS0.Px1.p1.1)\.
- B\. Deng, X\. Wang, Y\. Wang, Y\. Wan, Y\. Ma, B\. Yang, H\. Wei, J\. Tang, H\. Lin, R\. Gao, T\. Li, Q\. Cao, X\. Ren, X\. Deng, A\. Yang, F\. Huang, D\. Liu, and J\. Zhou \(2026\)Qwen\-Scope: turning sparse features into development tools for large language models\.External Links:2605\.11887,[Link](https://arxiv.org/abs/2605.11887)Cited by:[§4\.1](https://arxiv.org/html/2606.07617#S4.SS1.SSS0.Px1.p1.1)\.
- J\. Dunefsky, P\. Chlenski, and N\. Nanda \(2024\)Transcoders find interpretable LLM feature circuits\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=J6zHcScAo0)Cited by:[§D\.2](https://arxiv.org/html/2606.07617#A4.SS2.SSS0.Px1.p1.2),[§4\.1](https://arxiv.org/html/2606.07617#S4.SS1.SSS0.Px1.p1.1)\.
- N\. Elhage, T\. Hume, C\. Olsson, N\. Schiefer, T\. Henighan, S\. Kravec, Z\. Hatfield\-Dodds, R\. Lasenby, D\. Drain, C\. Chen, R\. Grosse, S\. McCandlish, J\. Kaplan, D\. Amodei, M\. Wattenberg, and C\. Olah \(2022\)Toy models of superposition\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2022/toy_model/index.html)Cited by:[§2\.2](https://arxiv.org/html/2606.07617#S2.SS2.p1.1)\.
- N\. Elhage, N\. Nanda, C\. Olsson, T\. Henighan, N\. Joseph, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly, N\. DasSarma, D\. Drain, D\. Ganguli, Z\. Hatfield\-Dodds, D\. Hernandez, A\. Jones, J\. Kernion, L\. Lovitt, K\. Ndousse, D\. Amodei, T\. Brown, J\. Clark, J\. Kaplan, S\. McCandlish, and C\. Olah \(2021\)A mathematical framework for transformer circuits\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2021/framework/index.html)Cited by:[§2\.1](https://arxiv.org/html/2606.07617#S2.SS1.p1.2),[§6](https://arxiv.org/html/2606.07617#S6.SS0.SSS0.Px2.p1.1)\.
- L\. Gao, S\. Biderman, S\. Black, L\. Golding, T\. Hoppe, C\. Foster, J\. Phang, H\. He, A\. Thite, N\. Nabeshima, S\. Presser, and C\. Leahy \(2020\)The pile: an 800gb dataset of diverse text for language modeling\.External Links:2101\.00027,[Link](https://arxiv.org/abs/2101.00027)Cited by:[Appendix C](https://arxiv.org/html/2606.07617#A3.SS0.SSS0.Px1.p1.2)\.
- L\. Gao, T\. D\. la Tour, H\. Tillman, G\. Goh, R\. Troll, A\. Radford, I\. Sutskever, J\. Leike, and J\. Wu \(2025\)Scaling and evaluating sparse autoencoders\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=tcsZt9ZNKD)Cited by:[§D\.1](https://arxiv.org/html/2606.07617#A4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.07617#S4.SS1.SSS0.Px1.p1.1)\.
- M\. Geva, A\. Caciularu, K\. Wang, and Y\. Goldberg \(2022\)Transformer feed\-forward layers build predictions by promoting concepts in the vocabulary space\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 30–45\.External Links:[Link](https://aclanthology.org/2022.emnlp-main.3/),[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.3)Cited by:[§3\.1](https://arxiv.org/html/2606.07617#S3.SS1.p2.1),[§6](https://arxiv.org/html/2606.07617#S6.SS0.SSS0.Px1.p1.1)\.
- M\. Geva, R\. Schuster, J\. Berant, and O\. Levy \(2021\)Transformer feed\-forward layers are key\-value memories\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,M\. Moens, X\. Huang, L\. Specia, and S\. W\. Yih \(Eds\.\),Online and Punta Cana, Dominican Republic,pp\. 5484–5495\.External Links:[Link](https://aclanthology.org/2021.emnlp-main.446/),[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.446)Cited by:[§1](https://arxiv.org/html/2606.07617#S1.p4.1),[§2\.3](https://arxiv.org/html/2606.07617#S2.SS3.p1.5),[§3\.1](https://arxiv.org/html/2606.07617#S3.SS1.p2.1),[§3\.1](https://arxiv.org/html/2606.07617#S3.SS1.p3.1),[§6](https://arxiv.org/html/2606.07617#S6.SS0.SSS0.Px1.p1.1)\.
- Y\. Gur\-Arieh, R\. Mayan, C\. Agassy, A\. Geiger, and M\. Geva \(2025\)Enhancing automated interpretability with output\-centric feature descriptions\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 5757–5778\.External Links:[Link](https://aclanthology.org/2025.acl-long.288/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.288),ISBN 979\-8\-89176\-251\-0Cited by:[§I\.2](https://arxiv.org/html/2606.07617#A9.SS2.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2606.07617#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.07617#S4.SS1.SSS0.Px2.p1.6),[§6](https://arxiv.org/html/2606.07617#S6.SS0.SSS0.Px1.p1.1)\.
- E\. Hernandez, A\. S\. Sharma, T\. Haklay, K\. Meng, M\. Wattenberg, J\. Andreas, Y\. Belinkov, and D\. Bau \(2024\)Linearity of relation decoding in transformer language models\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=w7LU2s14kE)Cited by:[§6](https://arxiv.org/html/2606.07617#S6.SS0.SSS0.Px1.p2.1)\.
- R\. Huben, H\. Cunningham, L\. R\. Smith, A\. Ewart, and L\. Sharkey \(2024\)Sparse autoencoders find highly interpretable features in language models\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=F76bwRSLeK)Cited by:[§1](https://arxiv.org/html/2606.07617#S1.p1.1)\.
- E\. Jang, S\. Gu, and B\. Poole \(2017\)Categorical reparameterization with gumbel\-softmax\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=rkE3y85ee)Cited by:[Appendix F](https://arxiv.org/html/2606.07617#A6.SS0.SSS0.Px2.p1.1)\.
- S\. Katz, Y\. Belinkov, M\. Geva, and L\. Wolf \(2024\)Backward lens: projecting language model gradients into the vocabulary space\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 2390–2422\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.142/),[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-main.142)Cited by:[§6](https://arxiv.org/html/2606.07617#S6.SS0.SSS0.Px1.p2.1)\.
- J\. Lin \(2023\)Neuronpedia: interactive reference and tooling for analyzing neural networks\.Note:Software available from neuronpedia\.orgExternal Links:[Link](https://www.neuronpedia.org/)Cited by:[§I\.1](https://arxiv.org/html/2606.07617#A9.SS1.p1.3)\.
- C\. McDougall, A\. Conmy, J\. Kramár, T\. Lieberum, S\. Rajamanoharan, and N\. Nanda \(2025\)Gemma scope 2\.Note:[https://huggingface\.co/google/gemma\-scope\-2](https://huggingface.co/google/gemma-scope-2)Technical report, Google DeepMindCited by:[§D\.1](https://arxiv.org/html/2606.07617#A4.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.07617#S4.SS1.SSS0.Px1.p1.1)\.
- K\. Meng, D\. Bau, A\. J\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in GPT\.InAdvances in Neural Information Processing Systems,A\. H\. Oh, A\. Agarwal, D\. Belgrave, and K\. Cho \(Eds\.\),External Links:[Link](https://openreview.net/forum?id=-h6WAS6eE4)Cited by:[§7](https://arxiv.org/html/2606.07617#S7.p1.1)\.
- J\. Merullo, C\. Eickhoff, and E\. Pavlick \(2024\)Talking heads: understanding inter\-layer communication in transformer language models\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=LUsx0chTsL)Cited by:[§6](https://arxiv.org/html/2606.07617#S6.SS0.SSS0.Px2.p1.1)\.
- J\. Mu and J\. Andreas \(2021\)Compositional explanations of neurons\.External Links:2006\.14032,[Link](https://arxiv.org/abs/2006.14032)Cited by:[§1](https://arxiv.org/html/2606.07617#S1.p1.1)\.
- Nostalgebraist \(2020\)Interpreting GPT: the logit lens\.External Links:[Link](https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens)Cited by:[§1](https://arxiv.org/html/2606.07617#S1.p3.1),[§3\.1](https://arxiv.org/html/2606.07617#S3.SS1.p1.4),[§6](https://arxiv.org/html/2606.07617#S6.SS0.SSS0.Px1.p1.1)\.
- OpenAI \(2025\)GPT\-5 system card\.openai\.com/index/gpt\-5\-system\-card\.Cited by:[§I\.2](https://arxiv.org/html/2606.07617#A9.SS2.SSS0.Px2.p1.1),[§4\.3](https://arxiv.org/html/2606.07617#S4.SS3.p1.3)\.
- K\. Pal, J\. Sun, A\. Yuan, B\. Wallace, and D\. Bau \(2023\)Future lens: anticipating subsequent tokens from a single hidden state\.InProceedings of the 27th Conference on Computational Natural Language Learning \(CoNLL\),J\. Jiang, D\. Reitter, and S\. Deng \(Eds\.\),Singapore,pp\. 548–560\.External Links:[Link](https://aclanthology.org/2023.conll-1.37/),[Document](https://dx.doi.org/10.18653/v1/2023.conll-1.37)Cited by:[§6](https://arxiv.org/html/2606.07617#S6.SS0.SSS0.Px1.p2.1)\.
- K\. Park, Y\. J\. Choe, Y\. Jiang, and V\. Veitch \(2025\)The geometry of categorical and hierarchical concepts in large language models\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=bVTM2QKYuA)Cited by:[§1](https://arxiv.org/html/2606.07617#S1.p1.1)\.
- G\. S\. Paulo, A\. T\. Mallen, C\. Juang, and N\. Belrose \(2025\)Automatically interpreting millions of features in large language models\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=EemtbhJOXc)Cited by:[§I\.2](https://arxiv.org/html/2606.07617#A9.SS2.SSS0.Px1.p1.2),[§1](https://arxiv.org/html/2606.07617#S1.p2.1),[§4\.2](https://arxiv.org/html/2606.07617#S4.SS2.SSS0.Px1.p1.8),[§6](https://arxiv.org/html/2606.07617#S6.SS0.SSS0.Px1.p1.1)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, I\. Sutskever,et al\.\(2019\)Language models are unsupervised multitask learners\.OpenAI blog1\(8\),pp\. 9\.External Links:[Link](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)Cited by:[§4\.1](https://arxiv.org/html/2606.07617#S4.SS1.SSS0.Px1.p1.1)\.
- G\. Team \(2025a\)Gemma 3 technical report\.External Links:2503\.19786,[Link](https://arxiv.org/abs/2503.19786)Cited by:[§4\.1](https://arxiv.org/html/2606.07617#S4.SS1.SSS0.Px1.p1.1)\.
- Q\. Team \(2025b\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§4\.1](https://arxiv.org/html/2606.07617#S4.SS1.SSS0.Px1.p1.1)\.
- A\. Templeton, T\. Conerly, J\. Marcus, J\. Lindsey, T\. Bricken, B\. Chen, A\. Pearce, C\. Citro, E\. Ameisen, A\. Jones, H\. Cunningham, N\. L\. Turner, C\. McDougall, M\. MacDiarmid, C\. D\. Freeman, T\. R\. Sumers, E\. Rees, J\. Batson, A\. Jermyn, S\. Carter, C\. Olah, and T\. Henighan \(2024\)Scaling monosemanticity: extracting interpretable features from claude 3 sonnet\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html)Cited by:[§4\.1](https://arxiv.org/html/2606.07617#S4.SS1.SSS0.Px2.p1.6),[§6](https://arxiv.org/html/2606.07617#S6.SS0.SSS0.Px1.p1.1)\.

## Appendix AModels and SAEs

Table[3](https://arxiv.org/html/2606.07617#A1.T3)lists the Hugging Face repositories for the LLMs and sparse autoencoders used in our main experiments \(Section[4](https://arxiv.org/html/2606.07617#S4)\)\.

Table 3:Hugging Face repositories for the four model/SAE pairs evaluated in Table[1](https://arxiv.org/html/2606.07617#S3.T1)\.Table[4](https://arxiv.org/html/2606.07617#A1.T4)lists the repositories for the additional configurations evaluated in Appendix[D](https://arxiv.org/html/2606.07617#A4): the SAE settings \(Table[6](https://arxiv.org/html/2606.07617#A4.T6)\) and the Qwen\-3 transcoders \(Table[7](https://arxiv.org/html/2606.07617#A4.T7)\)\.

Table 4:Hugging Face repositories for the additional SAE configurations \(Table[6](https://arxiv.org/html/2606.07617#A4.T6)\) and transcoder configurations \(Table[7](https://arxiv.org/html/2606.07617#A4.T7)\)\.SettingModel RepositoryDictionary RepositoryGPT\-2 Small \(128K\)[openai\-community/gpt2](https://huggingface.co/openai-community/gpt2)[jbloom/GPT2\-Small\-OAI\-v5\-128k\-resid\-post\-SAEs](https://huggingface.co/jbloom/GPT2-Small-OAI-v5-128k-resid-post-SAEs)Gemma\-3\-270M \(16K\)[google/gemma\-3\-270m](https://huggingface.co/google/gemma-3-270m)[google/gemma\-scope\-2\-270m\-pt/resid\_post](https://huggingface.co/google/gemma-scope-2-270m-pt)\(l0\_medium\)Gemma\-3\-1B \(16K\)[google/gemma\-3\-1b\-pt](https://huggingface.co/google/gemma-3-1b-pt)[google/gemma\-scope\-2\-1b\-pt/resid\_post](https://huggingface.co/google/gemma-scope-2-1b-pt)\(l0\_medium\)Gemma\-3\-4B \(16K\)[google/gemma\-3\-4b\-pt](https://huggingface.co/google/gemma-3-4b-pt)[google/gemma\-scope\-2\-4b\-pt/resid\_post](https://huggingface.co/google/gemma-scope-2-4b-pt)\(l0\_medium\)Gemma\-3\-4B \(65K\)[google/gemma\-3\-4b\-pt](https://huggingface.co/google/gemma-3-4b-pt)[google/gemma\-scope\-2\-4b\-pt/resid\_post](https://huggingface.co/google/gemma-scope-2-4b-pt)\(l0\_medium\)Qwen\-3\-0\.6B[Qwen/Qwen3\-0\.6B](https://huggingface.co/Qwen/Qwen3-0.6B)[mwhanna/qwen3\-0\.6b\-transcoders\-lowl0](https://huggingface.co/mwhanna/qwen3-0.6b-transcoders-lowl0)Qwen\-3\-1\.7B[Qwen/Qwen3\-1\.7B](https://huggingface.co/Qwen/Qwen3-1.7B)[mwhanna/qwen3\-1\.7b\-transcoders\-lowl0](https://huggingface.co/mwhanna/qwen3-1.7b-transcoders-lowl0)Qwen\-3\-4B[Qwen/Qwen3\-4B](https://huggingface.co/Qwen/Qwen3-4B)[mwhanna/qwen3\-4b\-transcoders](https://huggingface.co/mwhanna/qwen3-4b-transcoders)
## Appendix BGeometric Comparison of Methods

![Refer to caption](https://arxiv.org/html/2606.07617v1/x17.png)\(a\)LL and TL are tangents of simplified linear surrogate models; QL is the local tangent of the trueffatacleana\_\{\\mathrm\{clean\}\}\.
![Refer to caption](https://arxiv.org/html/2606.07617v1/x18.png)\(b\)Query Lens vs\. ablation\-style surrogates: ZO and TC are secants anchored ata=0a=0anda=aclampa=a\_\{\\mathrm\{clamp\}\}respectively\.

Figure 6:Schematic comparison of feature readouts as approximations to the true mappingy=f\(a\)y=f\(a\)from feature activationaato logityy\. Additive constants are omitted in legend formulas\.Query Lens reads off the local tangent of the true activation\-to\-logit mapy=f\(a\)y=f\(a\)at the clean operating pointacleana\_\{\\mathrm\{clean\}\}, with slopef′\(aclean\)=U⊤\[∏k\>l\(I\+Jk\)\]vilf^\{\\prime\}\(a\_\{\\mathrm\{clean\}\}\)=U^\{\\top\}\\\!\\left\[\\prod\_\{k\>l\}\(I\+J^\{k\}\)\\right\]v\_\{i\}^\{\\,l\}\. Each baseline can be seen as diverging from this tangent in one of two ways, which we visualize in the two panels of Figure[6](https://arxiv.org/html/2606.07617#A2.F6): \(a\) taking the tangent of a*simplified, linearized*model rather than the trueff, or \(b\) replacing the tangent with a*secant*between two operating points\.

##### \(a\) Tangent of a linearized model \(LL, TL\)\.

Logit Lens and Tuned Lens take the slope atacleana\_\{\\mathrm\{clean\}\}of a*virtual linear model*obtained by simplifying the downstream stream transition, rather than of the trueff\(Figure[6\(a\)](https://arxiv.org/html/2606.07617#A2.F6.sf1)\)\. LL removes every downstream path that re\-enters a subsequent module, leaving only the residual connections that carry the value vector directly to the output; the resulting model is linear inaa, so its tangent is the constant slopeU⊤vilU^\{\\top\}v\_\{i\}^\{\\,l\}\. TL is less severe: it collapses the downstream stream transition into a single learned per\-layer affine mapWlW^\{l\}, again yielding a linear model with constant slopeU⊤WlvilU^\{\\top\}W^\{l\}v\_\{i\}^\{\\,l\}\. Both slopes are parameter\-only and blind to the operating point, because the simplified model discards exactly the input\-dependent indirect effects that QL retains\.

##### \(b\) Secant instead of tangent \(ZO, TC\)\.

Zero\-Out and Token Change keep the trueffbut replace the analytic tangent with a*secant*between two operating points \(Figure[6\(b\)](https://arxiv.org/html/2606.07617#A2.F6.sf2)\)\. ZO sets the activation to0and reads offsZO=f\(aclean\)−f\(0\)s\_\{\\textsc\{ZO\}\}=f\(a\_\{\\mathrm\{clean\}\}\)\-f\(0\), the secant offfbetweena=0a=0anda=acleana=a\_\{\\mathrm\{clean\}\}\. TC instead clamps the activation to a fixed value and measuressTC=f\(aclamp\)−f\(aclean\)s\_\{\\textsc\{TC\}\}=f\(a\_\{\\mathrm\{clamp\}\}\)\-f\(a\_\{\\mathrm\{clean\}\}\), the secant betweena=acleana=a\_\{\\mathrm\{clean\}\}anda=aclampa=a\_\{\\mathrm\{clamp\}\}\. Unlike QL’s tangent, a secant depends on the choice of the second point and averages the behavior offfover a finite interval rather than atacleana\_\{\\mathrm\{clean\}\}itself\.

Across both axes, QL is the only surrogate that is simultaneously a tangent rather than a secant and taken on the true model rather than a simplified one, evaluated at the actual operating point\.

## Appendix CBaseline Implementation Details

##### Hidden\-state Sampling for QL, ZO, and TC\.

QL, ZO, and TC all require evaluating the model at a clean operating pointacleana\_\{\\text\{clean\}\}\. We obtain these hidden states by running each model on a fixed corpus of1024×321024\\times 32tokens drawn from The Pile\(Gaoet al\.,[2020](https://arxiv.org/html/2606.07617#bib.bib16)\)\. As shown in Appendix[H\.1](https://arxiv.org/html/2606.07617#A8.SS1), the token sets predicted by QL are largely invariant to this sample choice\.

##### Tuned Lens\.

For GPT\-2 Small, we use the pretrained lens released with thetuned\-lenslibrary\(Belroseet al\.,[2025](https://arxiv.org/html/2606.07617#bib.bib36)\)\. For the remaining models, we train our own lens with the same library\. Each lens is a stack of per\-layer linear translators that minimize the per\-token KL divergence between the lens prediction at that layer and the \(detached\) final\-layer next\-token distribution, trained on thewikitext\-103\-raw\-v1444[https://huggingface\.co/datasets/Salesforce/wikitext](https://huggingface.co/datasets/Salesforce/wikitext)validation split\. We optimize with SGD \(Nesterov momentum0\.90\.9, weight decay10−310^\{\-3\}\) under a linear learning\-rate decay to zero with no warmup, using a base learning rate of0\.010\.01for20002000steps, batch size11, maximum sequence length20482048, andbfloat16precision\. Table[5](https://arxiv.org/html/2606.07617#A3.T5)reports the resulting lens quality on held\-outwikitext\-103text: every lens substantially lowers the average per\-token KL to the final\-layer distribution relative to the logit lens \(by 42–79%\)\.

Table 5:Tuned lens quality, measured as the per\-token KL divergence to the final\-layer distribution, averaged over layers \(↓\\downarrowlower is better\); the parenthetical is the reduction relative to the logit lens\.†From thetuned\-lenslibrary; the others we train ourselves\.

## Appendix DAdditional Configurations

In this section, we report Query Lens results on configurations omitted from the main text \(Section[4](https://arxiv.org/html/2606.07617#S4)\): additional SAE widths and model sizes \(Section[D\.1](https://arxiv.org/html/2606.07617#A4.SS1)\), and transcoders applied to Qwen\-3 models \(Section[D\.2](https://arxiv.org/html/2606.07617#A4.SS2)\)\.

### D\.1Additional SAEs

We evaluate Query Lens on five additional model/SAE configurations: GPT\-2 Small with the 128K\-width OpenAI Top\-K SAE\(Gaoet al\.,[2025](https://arxiv.org/html/2606.07617#bib.bib11)\), Gemma\-3\-270M and Gemma\-3\-1B with the 16K\-width Gemma Scope 2 JumpReLU SAEs\(McDougallet al\.,[2025](https://arxiv.org/html/2606.07617#bib.bib43)\), and Gemma\-3\-4B with both 16K and 65K Gemma Scope 2 SAEs; the corresponding repositories are listed in Table[4](https://arxiv.org/html/2606.07617#A1.T4)\. We follow the same sampling and evaluation protocol described in Section[4\.1](https://arxiv.org/html/2606.07617#S4.SS1)\. Table[6](https://arxiv.org/html/2606.07617#A4.T6)reports Input and Output scores for all configurations\.

Table 6:Input and Output scores \(%\) across methods on five additional SAE configurations\. Values are the mean score over layers along with their respective 95% confidence intervals\. For each \(configuration, score\-type\) the largest value is inbold\.The patterns observed in the main results \(Table[1](https://arxiv.org/html/2606.07617#S3.T1)\) generalize to these additional configurations:QLKey\\text\{QL\}\_\{\\textsc\{Key\}\}achieves the highest Input Score andQLValue\\text\{QL\}\_\{\\textsc\{Value\}\}the highest Output Score across SAE widths and model sizes, with the relative gain over Logit Lens preserved\.

### D\.2Transcoders

##### Background\.

A*transcoder*learns a sparse dictionary directly for the MLP residual update\(Dunefskyet al\.,[2024](https://arxiv.org/html/2606.07617#bib.bib30)\)\. Unlike an SAE, which reconstructs a residual stream vector, a transcoder is trained to map the MLP inputhmidh\_\{\\text\{mid\}\}to its MLP residual updateRMR\_\{\\text\{M\}\}:

R^M=f\(hmidWenc⊤\)Wdec\.\\widehat\{R\}\_\{\\text\{M\}\}=f\\\!\\left\(h\_\{\\text\{mid\}\}W\_\{\\text\{enc\}\}^\{\\top\}\\right\)W\_\{\\text\{dec\}\}\.Each*key feature*kilk\_\{i\}^\{l\}\(row ofWencW\_\{\\text\{enc\}\}\) reads from the pre\-MLP residualhmidlh\_\{\\text\{mid\}\}^\{l\}, and each*value feature*vilv\_\{i\}^\{l\}\(row ofWdecW\_\{\\text\{dec\}\}\) is written into the residual stream as part of the MLP output at layerll\.

##### Stream Transition for Transcoders\.

Query Lens extends to transcoders with a small change in the indexing of the stream transition\. For the forward dynamics, the value featurevilv\_\{i\}^\{l\}is written into the residual stream as an MLP output, so the transition product begins after the layer\-llMLP block:

∂y∂ail=U⊤\[∏k=l\+1L\(I\+JMk\)\(I\+JAk\)\]vil\.\\frac\{\\partial y\}\{\\partial a\_\{i\}^\{l\}\}=U^\{\\top\}\\left\[\\prod\_\{k=l\+1\}^\{L\}\(I\+J\_\{\\text\{M\}\}^\{k\}\)\(I\+J\_\{\\text\{A\}\}^\{k\}\)\\right\]v\_\{i\}^\{l\}\.For the backward dynamics, the key feature reads fromhmidlh\_\{\\text\{mid\}\}^\{l\}rather thanhpostlh\_\{\\text\{post\}\}^\{l\}, so the prefix transition product terminates after the layer\-llattention block:

∂ail∂x=\(kil\)⊤\[\(I\+JAl\)∏k=1l−1\(I\+JMk\)\(I\+JAk\)\]E^\.\\frac\{\\partial a\_\{i\}^\{l\}\}\{\\partial x\}=\(k\_\{i\}^\{l\}\)^\{\\top\}\\left\[\\bigl\(I\+J\_\{\\text\{A\}\}^\{l\}\\bigr\)\\prod\_\{k=1\}^\{l\-1\}\(I\+J\_\{\\text\{M\}\}^\{k\}\)\(I\+J\_\{\\text\{A\}\}^\{k\}\)\\right\]\\widehat\{E\}\.The remaining elements of the framework—feature vectors and readout—carry over unchanged\.

Table 7:Input and Output scores \(%\) on transcoders for Qwen\-3 at three scales\. Values are the mean score over layers along with their respective 95% confidence intervals\. For each \(configuration, score\-type\) the largest value is inbold\.
##### Results\.

We evaluate Query Lens on transcoders trained for Qwen\-3 at three scales \(0\.6B, 1\.7B, 4B\)\. Table[7](https://arxiv.org/html/2606.07617#A4.T7)reports Input and Output scores following the same protocol as the main experiments\. Across all three transcoder scales, Query Lens preserves the qualitative behavior observed for SAEs:QLKey\\text\{QL\}\_\{\\textsc\{Key\}\}leads on the input side andQLValue\\text\{QL\}\_\{\\textsc\{Value\}\}on the output side\. The sole exception is the output side of Qwen\-3\-4B, where Token Change marginally surpassesQLValue\\text\{QL\}\_\{\\textsc\{Value\}\}within overlapping confidence intervals\. Overall, the framework extends naturally to sparse dictionaries trained on MLP outputs rather than residual stream vectors\.

## Appendix EThe Fidelity–Variance Tradeoff in Stream Transition Modeling

As established in the main text, the full stream transition from a value featurevilv\_\{i\}^\{\\,l\}at layerllto the output logits encompasses all possible residual paths through the intervening blocks\. Formally, the Jacobian of the logitsyywith respect to the feature activationaila\_\{i\}^\{\\,l\}is given by:

∂y∂ail=U⊤\[∏k=l\+1L\(I\+JMk\)\(I\+JAk\)\]vil,\\frac\{\\partial y\}\{\\partial a\_\{i\}^\{\\,l\}\}=U^\{\\top\}\\left\[\\prod\_\{k=l\+1\}^\{L\}\\bigl\(I\+J\_\{\\text\{M\}\}^\{k\}\\bigr\)\\bigl\(I\+J\_\{\\text\{A\}\}^\{k\}\\bigr\)\\right\]v\_\{i\}^\{\\,l\},
where the product spansL−lL\-llayers, each containing two residual modules \(attention and MLP\)\. Expanding this product yields22\(L−l\)2^\{2\(L\-l\)\}additive terms, one for each way of choosing, per module, between its local Jacobian and the identity path\.

Modeling this product in full is one extreme of a spectrum of choices for how much of the stream transition to include\. At the other extreme, the Logit Lens \(LL\) keeps only the identity term, discarding all indirect effects\. An intermediate,*first\-order*choice retains the identity and first\-order Jacobian terms but drops the higher\-order module interactions:

∂y∂ail≈U⊤\[I\+∑k=l\+1L\(JMk\+JAk\)\]vil\.\\frac\{\\partial y\}\{\\partial a\_\{i\}^\{\\,l\}\}\\approx U^\{\\top\}\\left\[I\+\\sum\_\{k=l\+1\}^\{L\}\\left\(J\_\{\\text\{M\}\}^\{k\}\+J\_\{\\text\{A\}\}^\{k\}\\right\)\\right\]v\_\{i\}^\{\\,l\}\.
##### Two Regimes of Indirect\-Effect Mediation\.

As discussed in Section[4\.2](https://arxiv.org/html/2606.07617#S4.SS2.SSS0.Px3), the benefit of mediating indirect effects grows as a feature moves farther from its stream endpoint, where the longer transition accumulates larger indirect contributions\. We break this down layer by layer to ask how much of the stream transition is worth modeling\. We sample 100 features at each of four source layers of Gemma\-3\-1B \(16K SAE\) and compare the output scores of the three strategies \(LL, first\-order, and full\) on a common set of features\. Table[8](https://arxiv.org/html/2606.07617#A5.T8)reports the mean output score per strategy and an oracle that picks the best strategy for each feature\.

The comparison reveals two regimes separated by the feature’s distance from the output endpoint\.*Far from the endpoint*\(layers 7 and 13\), the transition is long and most of the feature’s effect on the logits flows through indirect paths; modeling them is essential, and the first\-order and full transitions score several times higher than LL\.*Near the endpoint*\(layer 22\), the transition is short and the direct projection already dominates; here mediating indirect effects no longer pays off, LL is best, and first\-order and then full are progressively worse\.

Table 8:Stream transition fidelity on Gemma\-3\-1B \(16K SAE\): mean output score per strategy \(LL / first\-order \(FO\) / full\) and the relative variance of the FO and full transitions, for 100 features sampled at each source layer\. “Oracle” picks the best strategy per feature \(Δ\\Delta: gap over the best static layer\-wise strategy\)\. Layers run from far \(long transition\) to near \(short\) the output endpoint\.
##### A Variance Cost of Indirect\-Effect Mediation\.

We attribute the second regime to the*variance*that indirect\-effect mediation injects into the linearization\. Each transition maps a feature to a vectorggthat depends on the sampled hidden state; treating the token\-wise vectors\{gn\}\\\{g\_\{n\}\\\}as samples, we measure their context\-sensitivity with the Relative Variance

Relative Variance=𝔼\[‖g−μ‖2\]𝔼\[‖g‖2\],μ=𝔼\[g\],\\text\{Relative Variance\}=\\frac\{\\mathbb\{E\}\\\!\\left\[\\\|g\-\\mu\\\|^\{2\}\\right\]\}\{\\mathbb\{E\}\\\!\\left\[\\\|g\\\|^\{2\}\\right\]\},\\quad\\mu=\\mathbb\{E\}\[g\],where a value near0means the transition is essentially context\-invariant and a value near11means it varies strongly across contexts\. The last two columns of Table[8](https://arxiv.org/html/2606.07617#A5.T8)show that the full transition has much higher relative variance than the first\-order one at every layer \(e\.g\.,0\.890\.89vs\.0\.320\.32at layer 7\), and that this variance grows sharply with transition length, from0\.200\.20near the endpoint to0\.890\.89far from it; LL, a fixed projection, carries no transition variance at all\.

This variance cost can be derived analytically\. For i\.i\.d\. scalar JacobiansJkJ^\{k\}with meanμ\\muand varianceσ2\\sigma^\{2\}composed acrossLLmodules, the full and first\-order transitions have variance ratio

Var⁡\[∏k=1L\(1\+Jk\)\]Var⁡\[1\+∑k=1LJk\]=\(α\+σ2\)L−αLLσ2,α=\(1\+μ\)2,\\frac\{\\operatorname\{Var\}\\\!\\left\[\\prod\_\{k=1\}^\{L\}\\bigl\(1\+J^\{k\}\\bigr\)\\right\]\}\{\\operatorname\{Var\}\\\!\\left\[1\+\\sum\_\{k=1\}^\{L\}J^\{k\}\\right\]\}=\\frac\{\(\\alpha\+\\sigma^\{2\}\)^\{L\}\-\\alpha^\{L\}\}\{L\\,\\sigma^\{2\}\},\\qquad\\alpha=\(1\+\\mu\)^\{2\},which is strictly greater than11for allL≥2L\\geq 2andσ2\>0\\sigma^\{2\}\>0and grows monotonically inLL: every higher\-order cross\-term adds variance with no counterpart in the first\-order sum\. The full transition is therefore inherently noisier than its first\-order truncation, increasingly so for longer transitions\.

Together this gives a signal\-versus\-variance reading: mediating indirect effects helps only when their signal outweighs the variance it adds\. This holds far from the endpoint but not near it, where the signal is negligible and the variance\-free LL wins\.

##### First\-Order Mediation as an Intermediate Strategy\.

The first\-order transition is a fair middle ground: it captures the dominant first\-order indirect signal at roughly a third of the full transition’s relative variance\. As a result it matches or exceeds the full transition on output score at every layer in Table[8](https://arxiv.org/html/2606.07617#A5.T8)\(it is the single best strategy at layer 17\), while never trailing LL by much\. It is not, however, uniformly optimal: near the endpoint LL is still better, and on individual features the full transition remains preferable\. First\-order mediation is thus a reasonable static default rather than a universal answer\.

##### Toward Adaptive Fidelity\.

The oracle quantifies the headroom left by any static choice: picking the best strategy*per feature*adds a further\+0\.013\+0\.013to\+0\.032\+0\.032over the best static layer\-wise strategy at every layer, and the best static strategy itself flips with depth\. No single fidelity is right for all features\. This calls for an adaptive strategy that decides, per feature, how much of the stream transition to mediate: a faithful low\-variance approximation where the indirect signal is weak, and the full transition where it is strong\. We leave the design of such per\-feature fidelity selection to future work\.

## Appendix FPre\-Activation as a Proxy for Post\-Activation

The backward dynamics of Section 3\.2\.2 measure a feature’s input\-side sensitivity by differentiating its*pre\-activation*with respect to the input\. All SAEs in our study \(Section 4\.1\) rely on either JumpReLU or TopK activations, both of which involve discontinuous thresholding operations and thus yield gradients ill\-suited to this purpose\. We show that differentiating the pre\-activation instead recovers a well\-defined input direction that reflects which input perturbations would activate the feature\.

##### Pre\-Activation Gives a Well\-Defined Input Direction\.

Both nonlinearities apply a threshold that zeros out insufficiently active features\. Gradients of zeroed\-out features are zero; therefore, remain uninformative about which input perturbations would bring the feature to life\. However, the pre\-activation gradient does not suffer this collapse: it remains informative even when the feature is not largely active, and also indicates the input perturbations that would push it toward activation\. Moreover, increasing the pre\-activation is exactly what is required to increase the post\-activation, because the activation function is merely a gate, without altering the underlying relationship between the input and the pre\-activation\.

##### Connection to Straight\-Through Estimators\.

Differentiating the pre\-activation rather than the post\-activation can be viewed as applying a straight\-through estimator\(Bengioet al\.,[2013](https://arxiv.org/html/2606.07617#bib.bib48)\): the forward pass respects the discontinuous activation, while the backward pass treats it as the identity and lets gradients flow through unimpeded\. Replacing this hard estimator with a smooth continuous relaxation, such as Gumbel\-Softmax\(Janget al\.,[2017](https://arxiv.org/html/2606.07617#bib.bib49)\), is a promising direction for future work\.

## Appendix GEfficient Computation of Query Lens

The Query Lens scores of Eq\. \([8](https://arxiv.org/html/2606.07617#S3.E8)\) and its key counterpart both apply the stream transition∏k\(I\+Jk\)\\prod\_\{k\}\(I\+J^\{k\}\), adm×dmd\_\{\\mathrm\{m\}\}\\times d\_\{\\mathrm\{m\}\}map, to a feature vector\. Read at face value, the definition suggests two sources of cost: building the transition Jacobian, and summing the many terms it produces when the product is expanded\. We incur neither\. Each score reduces to a single forward or backward pass under automatic differentiation, at the cost of one ordinary pass through the relevant layers\.

##### Jacobian Products\.

The transition never stands on its own\. The value score left\-multiplies it ontovilv\_\{i\}^\{\\,l\}and the key score right\-multiplies\(kil\)⊤\(k\_\{i\}^\{\\,l\}\)^\{\\top\}onto it, so in both cases we only need the transition applied to one vector\. This is precisely a Jacobian–vector product \(JVP, on the value side\) or a vector–Jacobian product \(VJP, on the key side\), which automatic differentiation returns in a single pass without instantiating the Jacobian\. Forming the fulldm×dmd\_\{\\mathrm\{m\}\}\\times d\_\{\\mathrm\{m\}\}transition would instead takedmd\_\{\\mathrm\{m\}\}such passes, one per column, andO\(dm2\)O\(d\_\{\\mathrm\{m\}\}^\{2\}\)memory to hold it; the JVP and VJP need one pass andO\(dm\)O\(d\_\{\\mathrm\{m\}\}\)memory for the running vector\.

##### In\-Place Accumulation\.

Expanding∏k\(I\+Jk\)\\prod\_\{k\}\(I\+J^\{k\}\)yields a separate term for every computational path through the residual stream, a count exponential in the number of blocks, which might suggest computing and summing these path terms one by one\. We never form this sum\. The JVP carries a single tangent and the VJP a single cotangent, updated one block at a time,u←u\+Jkuu\\leftarrow u\+J^\{k\}uon the value side andw←w\+\(Jk\)⊤ww\\leftarrow w\+\(J^\{k\}\)^\{\\top\}won the key side\. Each update applies one factor\(I\+Jk\)\(I\+J^\{k\}\)to the running vector, so the contribution of every path is accumulated in place as the vector is transported across layers\. The transport therefore costsO\(L\)O\(L\)factor applications rather than theO\(2L\)O\(2^\{L\}\)terms of the explicit expansion\.

## Appendix HRobustness Analysis

### H\.1Data Invariance of Predicted Token Sets

Because the stream transition in Query Lens is computed from hidden states of a sampled corpus \(Appendix[C](https://arxiv.org/html/2606.07617#A3)\), we test whether the predicted token sets change with the choice of sample\. We draw three independent Pile subsets with different random seeds \(4242,1717,123123\), run Query Lens on each to obtain top\-kktoken setsT1,T2,T3T\_\{1\},T\_\{2\},T\_\{3\}, and measure their overlap with the three\-way Jaccard similarity

J\(T1,T2,T3\)=\|T1∩T2∩T3\|\|T1∪T2∪T3\|\.J\(T\_\{1\},T\_\{2\},T\_\{3\}\)=\\frac\{\\left\|T\_\{1\}\\cap T\_\{2\}\\cap T\_\{3\}\\right\|\}\{\\left\|T\_\{1\}\\cup T\_\{2\}\\cup T\_\{3\}\\right\|\}\.
Table 9:Data invariance of predicted token sets for GPT\-2 Small at two SAE widths, measured by the three\-way Jaccard similarityJ\(T1,T2,T3\)J\(T\_\{1\},T\_\{2\},T\_\{3\}\)across three independently sampled Pile subsets \(mean±\\pmstandard error across features\)\.Table[9](https://arxiv.org/html/2606.07617#A8.T9)reports this for GPT\-2 Small at two SAE widths\. Both variants exceed0\.830\.83at both SAE widths: the token sets recovered from independent corpora are largely shared, indicating that a Query Lens explanation is a property of the feature itself rather than of the particular sample used to compute it\. This is what sets Query Lens apart from activation\-based methods, which instead read a feature’s meaning off the content of selected input examples\.

### H\.2Sensitivity to the Number of Top Tokenskk

Throughout the main experiments we fix the explanation size to the topk=25k=25tokens \(Section[4\.1](https://arxiv.org/html/2606.07617#S4.SS1)\)\. We verify that this choice does not drive our conclusions\. For each of the four configurations in Table[1](https://arxiv.org/html/2606.07617#S3.T1), we recompute the Input and Output scores while varyingk∈\{5,10,20,25\}k\\in\\\{5,10,20,25\\\}under the identical metric, comparing QL against the matched LL baseline \(the key variant for the Input score and the value variant for the Output score\)\.

Table[10](https://arxiv.org/html/2606.07617#A8.T10)reports the results\. Two patterns hold across all settings\. First, both scores grow monotonically withkkfor every method, since a larger token setTTis more likely to cover the target set\. Second, the ordering between QL and LL holds at everykk: QL leads on the input side by a large margin \(often by an order of magnitude\), and on the output side atk=25k=25and almost all smallerkk\. The only exception is a marginal reversal for Qwen\-3\-1\.7B atk=5k=5\(6\.506\.50vs\.6\.656\.65\)\. QL’s gains over LL therefore do not hinge on the specific choice ofk=25k=25\.

Table 10:Sensitivity of Input and Output scores \(%\) to the number of top tokenskk, for the four configurations of Table[1](https://arxiv.org/html/2606.07617#S3.T1)\. Input scores use the key variant \(LLKey\\text\{LL\}\_\{\\textsc\{Key\}\},QLKey\\text\{QL\}\_\{\\textsc\{Key\}\}\) and Output scores the value variant \(LLValue\\text\{LL\}\_\{\\textsc\{Value\}\},QLValue\\text\{QL\}\_\{\\textsc\{Value\}\}\)\. For each \(configuration, score,kk\) the larger of QL and LL is inbold\.

## Appendix IEvaluation Details

### I\.1Activation Data

For each feature, the set of sentencesSSused in the input\-side evaluation \(Section[4\.2](https://arxiv.org/html/2606.07617#S4.SS2.SSS0.Px1)\) is constructed as follows\. For GPT\-2 Small and Gemma\-3 features, we use the pre\-computed top\-activating examples from Neuronpedia\(Lin,[2023](https://arxiv.org/html/2606.07617#bib.bib14)\)\. For Qwen\-3\-1\.7B features, Qwen\-Scope SAEs are not indexed by Neuronpedia, so we collect activations manually: we run the model on16384×12816384\\times 128tokens randomly sampled frommonology/pile\-uncopyrighted555[https://huggingface\.co/datasets/monology/pile\-uncopyrighted](https://huggingface.co/datasets/monology/pile-uncopyrighted)and retain the top\-2020activating contexts per feature\.

### I\.2Steering Details

##### Choice of steering strengths𝒜\\mathcal\{A\}\.

For each feature, we select the steering strengthα\\alphato target a fixed KL divergence between the clean and steered next\-token distributions\(Gur\-Ariehet al\.,[2025](https://arxiv.org/html/2606.07617#bib.bib17); Pauloet al\.,[2025](https://arxiv.org/html/2606.07617#bib.bib21)\), yielding𝒜=\{αKL=0\.25,αKL=0\.5\}\\mathcal\{A\}=\\\{\\alpha\_\{\\mathrm\{KL\}=0\.25\},\\,\\alpha\_\{\\mathrm\{KL\}=0\.5\}\\\}\. KL targeting normalises the steering intensity across features that have very different natural activation scales\.

##### Generation setup\.

FollowingAradet al\.\([2025](https://arxiv.org/html/2606.07617#bib.bib41)\), we generate 20 tokens using a temperature of 0\.75 after each of the prefixes listed in Table[11](https://arxiv.org/html/2606.07617#A9.T11)\. While the original study uses 50 prefixes, we augment the set to 100 prefixes generated with GPT\-5\.2\(OpenAI,[2025](https://arxiv.org/html/2606.07617#bib.bib38)\)to increase prompt diversity\.

Table 11:Neutral prefixes used for generation in the steering experiments\.“Findings show that”“It’s no surprise that”“It’s been a long time since”“I once heard that”“Have you ever noticed that”“In my experience,”“Then the man said:”“I couldn’t believe when”“The craziest part was when”“I believe that”“The first thing I heard was”“If you think about it,”“The news mentioned”“Let me tell you a story about”“I was shocked to learn that”“She saw a”“Someone once told me that”“For some reason,”“It is observed that”“It might sound strange, but”“I can’t help but wonder if”“Studies indicate that”“They always warned me that”“It makes sense that”“According to reports,”“Nobody expected that”“At first, I didn’t believe that”“Research suggests that”“Funny thing is,”“That reminds me of the time when”“It has been noted that”“I never thought I’d say this, but”“It all comes down to”“I remember when”“What surprised me most was”“One time, I saw that”“It all started when”“The other day, I overheard that”“I was just thinking about how”“The legend goes that”“Back in the day,”“Imagine a world where”“If I recall correctly,”“You won’t believe what happened when”“They never expected that”“People often say that”“A friend of mine once said,”“I always knew that”“Once upon a time,”“I just found out that”“Over the years, I noticed that”“Looking back, I realize that”“From what I can tell,”“As far as I know,”“For a long time, I thought that”“It all made sense when”“The strange thing is that”“People rarely talk about how”“I recently discovered that”“To my surprise,”“Every now and then, I notice that”“There is growing evidence that”“The more I learn, the more I see that”“Experts often point out that”“I still remember the moment when”“I used to believe that”“Over time, it became clear that”“It started to make sense when”“Many people forget that”“What nobody told me was that”“You might have noticed that”“It was only later that I realized”“It felt like”“I keep coming back to the idea that”“For as long as I can remember,”“It suddenly hit me that”“The more I think about it,”“I have a feeling that”“Some people argue that”“There is a common belief that”“Lately, I have been noticing that”“One pattern I keep seeing is that”“Some say that”“Over time, people began to realize that”“I started to notice that”“From time to time, I hear that”“It became obvious that”“There was a moment when I realized that”“It turned out that”“What’s interesting is that”“Looking around, you can see that”“You could say that”“It often happens that”“For many people, it seems that”“Little did I know that”“It often turns out that”“I get the sense that”“One thing that stands out is that”“If history has taught us anything, it is that”“Something I have been wondering about is”

## Appendix JAdditional Details on Subspace Channel Experiments

### J\.1Training Details for Low\-Rank Mappings

This section provides training details for the low\-rank linear mapsWl→kW^\{l\\rightarrow k\}used to model feature reading channels in Section[5](https://arxiv.org/html/2606.07617#S5)\.

##### Data and Regression Setup\.

For each source layerl∈\{1,…,L\}l\\in\\\{1,\\ldots,L\\\}of GPT\-2 Small \(L=12L=12\), we randomly sampleN=1000N=1000value features\{vil\}i=11000\\\{v\_\{i\}^\{l\}\\\}\_\{i=1\}^\{1000\}from the GPT\-2 Small SAE 32k\. For each sampled featurevilv\_\{i\}^\{l\}, we compute its downstream module responses

ril→k=Jkvil∈ℝdm,r\_\{i\}^\{l\\rightarrow k\}=J^\{k\}v\_\{i\}^\{l\}\\in\\mathbb\{R\}^\{d\_\{m\}\},for all destinationsk\>lk\>l, whereJkJ^\{k\}denotes the Jacobian of the residual block at layerkkwith respect to the residual stream\. This yields\(vil,ril→k\)\(v\_\{i\}^\{l\},r\_\{i\}^\{l\\rightarrow k\}\)for each ordered pair\(l,k\)\(l,k\), which defines a supervised regression problem from value features to module responses\.

##### Training\.

For each pair\(l,k\)\(l,k\)we split the samples independently into a training set \(90%\) and a validation set \(10%\)\. As described in Section[5](https://arxiv.org/html/2606.07617#S5), we then train a low\-rank matrixWl→k∈ℝdm×dmW^\{l\\rightarrow k\}\\in\\mathbb\{R\}^\{d\_\{m\}\\times d\_\{m\}\}of rankr=dm/L=64r=d\_\{m\}/L=64to minimize the reconstruction error between predicted and true module responses:

ℒMSE=‖ril→k−Wl→kvil‖22\.\\mathcal\{L\}\_\{\\text\{MSE\}\}=\\left\\\|r\_\{i\}^\{l\\rightarrow k\}\-W^\{l\\rightarrow k\}v\_\{i\}^\{l\}\\right\\\|\_\{2\}^\{2\}\.Training is performed independently for each pair\(l,k\)\(l,k\)with the Adam optimizer \(learning rateη=1×10−3\\eta=1\\times 10^\{\-3\}, no weight decay\) for 1000 epochs atbfloat16precision, using a single gradient\-accumulation step and random seed 42\.

##### Evaluation\.

We evaluate each map with four metrics: mean squared error \(MSE,‖r−r^‖22\\left\\\|r\-\\hat\{r\}\\right\\\|\_\{2\}^\{2\}\), relative L2 error \(‖r−r^‖2/‖r‖2\\left\\\|r\-\\hat\{r\}\\right\\\|\_\{2\}/\\left\\\|r\\right\\\|\_\{2\}\), cosine similarity \(r⋅r^/‖r‖2‖r^‖2r\\cdot\\hat\{r\}/\\left\\\|r\\right\\\|\_\{2\}\\left\\\|\\hat\{r\}\\right\\\|\_\{2\}\), and explained energy ratio \(1−‖r−r^‖22/‖r‖221\-\\left\\\|r\-\\hat\{r\}\\right\\\|\_\{2\}^\{2\}/\\left\\\|r\\right\\\|\_\{2\}^\{2\}\)\. Table[12](https://arxiv.org/html/2606.07617#A10.T12)reports these averaged over the6666learned maps\. The reconstruction generalizes well to held\-out features: on the validation split it attains high cosine similarity and explains over half of the response energy despite the severe rank reduction\.

Table 12:Reconstruction performance averaged over 66 mappings\.

### J\.2Statistical Tests forOL\\mathrm\{OL\}

Figure[5\(a\)](https://arxiv.org/html/2606.07617#S5.F5.sf1)suggests that overlaps between maps sharing the same destination layer are systematically higher than overlaps between maps with different destinations\. We quantify this separation with a permutation test that keeps the full set of learned maps fixed while randomizing the grouping\.

##### Grouping and Observed Gap\.

The6666learned maps across1212layers form21452145pairs\(Wl1→k1,Wl2→k2\)\(W^\{l\_\{1\}\\rightarrow k\_\{1\}\},W^\{l\_\{2\}\\rightarrow k\_\{2\}\}\), of which220220share the same destination layer \(k1=k2k\_\{1\}=k\_\{2\}\) and19251925have different destinations \(k1≠k2k\_\{1\}\\neq k\_\{2\}\)\. Table[13](https://arxiv.org/html/2606.07617#A10.T13)reports the mean and standard deviation of each group, an observed mean gap ofΔmean=0\.4161\\Delta\_\{\\text\{mean\}\}=0\.4161\.

Table 13:Value of mean and standard deviation ofOL\(Wi,Wj\)\\mathrm\{OL\}\(W\_\{i\},W\_\{j\}\)from pairs in each group\.
##### Permutation Test\.

To assess significance without parametric assumptions, we run a one\-sided permutation test: we keep the learned maps fixed and randomly shuffle the destination\-layer labelskk, redefining the two groups on every permutation\. For each shuffled assignment, we recompute \(1\) the mean gapΔmean\\Delta\_\{\\text\{mean\}\}and \(2\) separation statistic AUC\. Thepp\-value is estimated as the fraction of permutations whose statistic is at least as large as the observed value\. WithNperm=10,000N\_\{\\text\{perm\}\}=10\{,\}000, we obtainp=0\.0001p=0\.0001for bothΔmean\\Delta\_\{\\text\{mean\}\}and AUC, indicating that the observed separation is highly unlikely under random destination assignments\.

## Appendix KQualitative Analysis

### K\.1Prompt Details

Prompt for Semantic Summarization and Interpretability ScoringYou will be given a ranked list of vocabulary tokens that are most strongly promoted in the output \(or most strongly activating on the input\) of a single sparse feature in a language model\. Your job is to judge how interpretable this feature is\.Do two things:1\)Meaning summary: name the single dominant theme that the tokens support\.2\)Interpretability score\(0–10\): rate the*topical concentration*of the token list — i\.e\., what fraction of tokens directly support one coherent theme\.How to read the tokens\- Tokens may be subword/BPE pieces \(e\.g\., “ansas” from “Kansas”, “ Direct” from “Director”\)\. Consider the canonical word or concept the piece belongs to before judging coherence\.\- Multiple languages or scripts referring to the same concept \(e\.g\., “food” in English, Chinese, Thai, Korean\) count as ONE coherent theme\. Do not penalize multilingual coverage; treat it the same as monolingual coverage of the same concept\.\- Tokens are typically rank\-ordered by promotion strength\. Earlier tokens carry slightly more weight, but treat the full list when computing concentration\.\- If two coherent themes coexist, pick the dominant one for Summary, but do not lower the score solely because more than one theme is present — judge on coverage of the dominant theme\.Scoring rubric\(use the % of tokens directly supporting the dominant theme\)\- 9–10:≥\\geq80% of tokens directly support a single, specific theme\. Off\-topic tokens are rare and look like residual noise\.\- 7–8: 60–79% support; theme is clear; noticeable but minority off\-topic tokens\.\- 4–6: 40–59% support; theme is identifiable but∼\\simhalf the list is mixed or off\-topic\.\- 2–3: 20–39% support; weak clustering; theme is plausible but much of the list is unrelated\.\- 0–1:<<20% support, or no recognizable specialized theme\. Tokens look generic, scattered, or like punctuation/format noise\.Calibration examples\(do not echo these in your output\)*Example A — Score 9\.*Tokens: “ meal, meals, food, foods, cuisine, gastronomy,⟨\\langlezh:meal⟩\\rangle,⟨\\langleth:food⟩\\rangle,⟨\\langleko:food⟩\\rangle, comida, Mahlzeit, cibo, repas, makanan, yemek, meal time, dining, edible, nourishment, appetite, snack, feast, buffet, culinary, diet”Summary: “Food and meals across multiple languages\.”Explanation: “Almost every token names food, a meal, or eating; the multilingual variants describe the same concept\.”→\\rightarrow9*Example B — Score 6\.*Tokens: “ manager, Director, supervisor, Manager, director, CEO, chief, Co, ord, inate, Sup, ervis, or, staff, team, office, colleague, the, and, with, to, of, in, on, ed”Summary: “Senior management / leadership roles\.”Explanation: “About half the tokens denote leadership/management roles; the remainder are common function words and BPE fragments unrelated to the theme\.”→\\rightarrow6*Example C — Score 1\.*Tokens: a scattered mix of unrelated fragments across Arabic, CJK, Devanagari, Sinhala, Tamil, Bengali, Kannada, Cyrillic, and Vietnamese scripts together with German BPE pieces and punctuation/format symbols\.Summary: “None”Explanation: “Tokens are scattered across unrelated languages, punctuation, and format pieces with no recognizable single theme\.”→\\rightarrow1Rules\- Be strict\. Topical concentration, not confidence in your guess, drives the score\.\- If no specific theme is supported, set Summary to “None” and give a score in 0–1\.\- Do not infer intent, sentiment, or specific entity identities beyond what the tokens directly support\.OutputReturn RAW JSON only\. No markdown fences, no commentary outside the JSON\.\{“Score”:<<integer 0 to 10\>\>,“Summary”: “<<one\-sentence specific meaning/function or ‘None’\>\>”,“Explanation”: “<<one short sentence: which tokens support the theme and roughly what fraction of the list does\>\>”\}

### K\.2Qualitative Examples

Section[4\.3](https://arxiv.org/html/2606.07617#S4.SS3)and Figure[3](https://arxiv.org/html/2606.07617#S4.F3)illustrated how Query Lens token sets converge to a single recognizable concept where Logit Lens returns tokenizer fragments\. Figures[7](https://arxiv.org/html/2606.07617#A11.F7)–[22](https://arxiv.org/html/2606.07617#A11.F22)extend that comparison with more examples, spanning all eight model configurations \(SAEs and transcoders\) with up to ten*key*and ten*value*features each\. Following the same format, each table lists both methods’ top\-scoring tokens and their GPT\-5\-nano interpretability scores, with the Query Lens row shaded and its score inbold\.

![Refer to caption](https://arxiv.org/html/2606.07617v1/x19.png)Figure 7:Qualitativekeyfeature examples on GPT\-2 Small \(32K\)\.![Refer to caption](https://arxiv.org/html/2606.07617v1/x20.png)Figure 8:Qualitativevaluefeature examples on GPT\-2 Small \(32K\)\.![Refer to caption](https://arxiv.org/html/2606.07617v1/x21.png)Figure 9:Qualitativekeyfeature examples on Gemma\-3\-270M \(65K\)\.![Refer to caption](https://arxiv.org/html/2606.07617v1/x22.png)Figure 10:Qualitativevaluefeature examples on Gemma\-3\-270M \(65K\)\.![Refer to caption](https://arxiv.org/html/2606.07617v1/x23.png)Figure 11:Qualitativekeyfeature examples on Gemma\-3\-1B \(65K\)\.![Refer to caption](https://arxiv.org/html/2606.07617v1/x24.png)Figure 12:Qualitativevaluefeature examples on Gemma\-3\-1B \(65K\)\.![Refer to caption](https://arxiv.org/html/2606.07617v1/x25.png)Figure 13:Qualitativekeyfeature examples on Gemma\-3\-4B \(65K\)\.![Refer to caption](https://arxiv.org/html/2606.07617v1/x26.png)Figure 14:Qualitativevaluefeature examples on Gemma\-3\-4B \(65K\)\.![Refer to caption](https://arxiv.org/html/2606.07617v1/x27.png)Figure 15:Qualitativekeyfeature examples on Qwen\-3\-1\.7B\-Base \(32K\)\.![Refer to caption](https://arxiv.org/html/2606.07617v1/x28.png)Figure 16:Qualitativevaluefeature examples on Qwen\-3\-1\.7B\-Base \(32K\)\.![Refer to caption](https://arxiv.org/html/2606.07617v1/x29.png)Figure 17:Qualitativekeyfeature examples on Qwen\-3\-0\.6B \(transcoder\)\.![Refer to caption](https://arxiv.org/html/2606.07617v1/x30.png)Figure 18:Qualitativevaluefeature examples on Qwen\-3\-0\.6B \(transcoder\)\.![Refer to caption](https://arxiv.org/html/2606.07617v1/x31.png)Figure 19:Qualitativekeyfeature examples on Qwen\-3\-1\.7B \(transcoder\)\.![Refer to caption](https://arxiv.org/html/2606.07617v1/x32.png)Figure 20:Qualitativevaluefeature examples on Qwen\-3\-1\.7B \(transcoder\)\.![Refer to caption](https://arxiv.org/html/2606.07617v1/x33.png)Figure 21:Qualitativekeyfeature examples on Qwen\-3\-4B \(transcoder\)\.![Refer to caption](https://arxiv.org/html/2606.07617v1/x34.png)Figure 22:Qualitativevaluefeature examples on Qwen\-3\-4B \(transcoder\)\.
Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

Similar Articles

ICA Lens: Interpreting Language Models Without Training Another Dictionary

Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings

HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory

How Quantization Changes Interpretable Features: A Sparse Autoencoder Analysis of Language Models

Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLMs

Submit Feedback

Similar Articles

ICA Lens: Interpreting Language Models Without Training Another Dictionary
Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory
How Quantization Changes Interpretable Features: A Sparse Autoencoder Analysis of Language Models
Feature Rivalry in Sparse Autoencoder Representations: A Mechanistic Study of Uncertainty-Driven Feature Competition in LLMs