Size Doesn't Matter: Cosine-Scored Sparse Autoencoders
Summary
This paper proposes replacing the inner product scoring in sparse autoencoders with a learned combination of cosine similarity and input magnitude, showing that the resulting features are more interpretable and concept-aligned, with the optimizer consistently preferring cosine over inner product.
View Cached Full Text
Cached at: 06/16/26, 11:37 AM
# Size Doesn’t Matter: Cosine-Scored Sparse Autoencoders
Source: [https://arxiv.org/html/2606.15054](https://arxiv.org/html/2606.15054)
###### Abstract
Sparse autoencoders \(SAEs\) detect features via inner product, so a feature’s activation scales with both its directional alignment and the input’s norm\. Under BatchTopK, high\-norm tokens inflate all pre\-activations simultaneously, claiming dictionary slots regardless of content alignment\. This matters because sublayer normalization has already discarded the magnitude the score measures, so the encoder detects a quantity the model does not read\. We replace the score with a learned blend of cosine similarity and input magnitude, letting the optimizer choose how much norm to use; a per\-feature extension lets each feature decide independently\. In both regimes, training is free to recover inner product but never does, with no feature ever choosing more than half\-magnitude dependence\. At matched reconstruction, the cosine encoder learns features that align with human\-recognizable concepts far more often than standard, filling dictionary slots that inner product wastes on norm detectors\. Loss reweighting that equalizes gradients barely closes the gap, confirming forward\-pass score geometry as the lever\. The advantage is not universal across tasks or depths, but we believe cosine scoring should be the default for dictionary learning on normalized representations\.
Sparse Autoencoders, Mechanistic Interpretability, Dictionary Learning
## 1Introduction
Sparse autoencoders are a standard dictionary\-learning tool for mechanistic interpretability\(Brickenet al\.,[2023](https://arxiv.org/html/2606.15054#bib.bib9); Cunninghamet al\.,[2023](https://arxiv.org/html/2606.15054#bib.bib10); Gaoet al\.,[2024](https://arxiv.org/html/2606.15054#bib.bib12); Bussmannet al\.,[2024](https://arxiv.org/html/2606.15054#bib.bib15); Karvonenet al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib25)\)\. A trained SAE is read as a feature dictionary: if featureiifires on activationxx, thenxxis taken to contain the concept represented by the corresponding decoder direction\. The interpretation depends on how the encoder detects those directions\. The standard rule is an inner product⟨wi,x⟩\\langle w\_\{i\},x\\rangle, so a feature fires in proportion to both its directional alignment withxxand the magnitude‖x‖\\\|x\\\|\.
Figure 1:The cosine encoder: architecture and headline results\.Left:Standard SAE encoder computes⟨wi,xc⟩\\langle w\_\{i\},x\_\{c\}\\rangle, coupling alignment with norm\. Cosine encoder unit\-normalizeswiw\_\{i\}and replaces the score witheb‖xc‖acos\(xc,wi\)\+benc,ie^\{b\}\\\|x\_\{c\}\\\|^\{a\}\\cos\(x\_\{c\},w\_\{i\}\)\+b\_\{\\mathrm\{enc\},i\};aainterpolates between pure cosine \(a=0a\{=\}0\) and inner product \(a=1a\{=\}1\)\.Right:Matched reconstruction \(FVE≈0\.77\\approx 0\.77\) with\+14\.9\+14\.9% sparse\-probing top\-1\. Training drivesa≈0\.26a\\approx 0\.26, far from the inner\-product limit\. Qwen3\-8B L18, 500M tokens,dsae=65,536d\_\{\\mathrm\{sae\}\}\{=\}65\{,\}536\.Figure 2:Cosine\-scored SAEs win on probing because standard features fire on token norm\.\(A\)Result: sparse\-probing top\-1 across eight tasks \(Qwen3\-8B L18, 500M tokens,dsae=65,536d\_\{\\mathrm\{sae\}\}\{=\}65\{,\}536, matched FVE≈0\.77\\approx 0\.77\)\. Per\-feature cosine wins on 7/8 tasks; sentiment is the only exception\.\(B\)Cause: standard’s*unmatched*features \(those with no nearest\-neighbor counterpart in cosine’s dictionary\) fire22×22\\timesmore on the highest\-norm token quartile than on the lowest, versus4\.7×4\.7\\timesfor cosine; they encode magnitude, not content\.\(C\)Confirmation: on the same high\-norm tokens, the standard SAE reconstructs at9\.5×9\.5\\timesthe input norm, while cosine stays close to the input scale \(0\.55×0\.55\\times\)\. Removing the‖x‖\\\|x\\\|factor from the score recovers content\-encoding features\. Provenance: Appendix[D\.2](https://arxiv.org/html/2606.15054#A4.SS2)\.Figure 3:Score\-surface geometry\.Each panel plots the encoder pre\-activation‖xc‖acos\(xc,wi\)\\\|x\_\{c\}\\\|^\{a\}\\cos\(x\_\{c\},w\_\{i\}\)over alignment \(x\-axis\) and input norm \(y\-axis\)\. Black curves join equally\-scored pairs\.*Left:*a=1a\{=\}1\(inner product\); hyperbolic curves, high\-norm tokens outscore better\-aligned low\-norm ones\.*Center:*a=0a\{=\}0\(cosine\); vertical curves, norm ignored\.*Right:*global learneda≈0\.26a\{\\approx\}0\.26; mild tilt, close to cosine\. Bottom: per\-featureaia\_\{i\}distribution at the headline setting\.This is the wrong scoring geometry for transformers with pre\-sublayer normalization\. The inner\-product score mixes content alignment with token norm, but normalization strips that norm before the model reads the activation; the encoder detects a quantity the downstream computation ignores\. Under BatchTopK selection\(Bussmannet al\.,[2024](https://arxiv.org/html/2606.15054#bib.bib15)\), this mismatch compounds: a single batch\-wide threshold determines which features fire, and high\-norm tokens inflate all their pre\-activations past this threshold, claiming disproportionately many slots regardless of content alignment\. Over training, this selection pressure starves content\-encoding features: they rarely win TopK slots, receive no learning signal, and never develop\. The result is a dictionary dominated by features that fire on magnitude rather than meaning\. This is not a minor calibration issue\. If dictionary elements fire on norm rather than content, they cannot reliably be used for sparse probing, feature\-based steering, or circuit analysis; the practitioner interprets them as model concepts, but the features encode a quantity the model discards\. A score aligned with what the model actually reads should depend on direction, not magnitude\.
We replace the inner\-product score with a cosine score scaled by a learned exponentaaon the input norm:
si\(x\)=‖x‖a⋅cos\(x,wi\)\+benc,i,s\_\{i\}\(x\)\\;=\\;\\\|x\\\|^\{a\}\\cdot\\cos\(x,w\_\{i\}\)\\;\+\\;b\_\{\\mathrm\{enc\},i\},which interpolates between cosine \(a=0a=0\) and inner product \(a=1a=1\); we use unit\-normalized encoder rows soa=0a\{=\}0recovers cosine cleanly \(§[3](https://arxiv.org/html/2606.15054#S3)\)\.111Up to a learned global temperature; see §[3](https://arxiv.org/html/2606.15054#S3)\.We study two parameter regimes: a single globalaashared across features, and a per\-feature extensionai=abase\+δia\_\{i\}=a\_\{\\mathrm\{base\}\}\+\\delta\_\{i\}that lets each feature learn its own norm dependence\. The key property is that the optimizer is free to recover inner product \(a=1a\{=\}1\) if magnitude is genuinely useful, but is not forced to use it by default\. This lets training discover the right geometry rather than assuming it\. In both regimes, training consistently drivesaatoward zero; no feature ever approaches the inner\-product regime, confirming that direction, not magnitude, is the useful signal\.
Through experiments on Qwen3\-8B and Gemma\-2\-2B, we demonstrate that cosine\-scored SAEs match standard BatchTopK on reconstruction fidelity, model\-behavior preservation, and per\-feature interpretability, while producing dictionaries that align far more often with human\-recognizable concepts\. A matched\-feature decomposition reveals that the majority of the probing gap comes not from improving shared features, but from features the standard encoder never learns at all; its dictionary slots are instead occupied by norm detectors\. We isolate the cause to forward\-pass TopK selection under norm inflation: loss reweighting that equalizes gradients across norm levels barely closes the gap, confirming that the score geometry itself, not the training signal, is the lever\. The advantage replicates across layers and models, though it is not universal; deep LayerNorm layers and sentiment are exceptions where magnitude carries task\-relevant signal\.
Contributions\.
- •Architecture\.We introduce cosine\-scored sparse autoencoders, replacing the inner\-product encoder score with a learned\-exponent cosine score that interpolates between pure cosine \(a=0a\{=\}0\) and inner product \(a=1a\{=\}1\)\. We study global and per\-feature parameterizations and identify the design choices that make each stable \(§[3](https://arxiv.org/html/2606.15054#S3)\)\.
- •Result\.At matched reconstruction, KL, and per\-feature interpretability, the cosine encoder lifts sparse\-probing top\-1 by\+14\.9\+14\.9% on Qwen3\-8B\. A matched\-feature decomposition shows that the majority of the gap comes from features the standard encoder never learns at all, rather than from improving shared features\. The result replicates across layers and on Gemma\-2\-2B \(§[4\.1](https://arxiv.org/html/2606.15054#S4.SS1), §[4\.3](https://arxiv.org/html/2606.15054#S4.SS3)\)\.
- •Mechanism\.The standard encoder’s unmatched features fire overwhelmingly on high\-norm tokens; they encode magnitude, not content\. Forward\-pass TopK selection under norm inflation is the primary cause; loss reweighting that equalizes gradients across norm levels barely closes the gap, confirming that the score geometry itself is the lever\. The optimizer independently confirms this: training drivesaafar from the inner\-product regime, with no feature ever choosing more than half\-magnitude dependence \(§[4\.3](https://arxiv.org/html/2606.15054#S4.SS3), §[6](https://arxiv.org/html/2606.15054#S4.F6)\)\.
For practitioners, this is a drop\-in encoder change that produces more interpretable dictionaries at no reconstruction cost\. More broadly, it suggests that scoring geometry is an underexplored design axis for dictionary learning: if inner product is the wrong inductive bias for normalized representations, standard SAE dictionaries trained on modern transformers may systematically undercount content features\. We believe cosine scoring should be the default for dictionary learning on normalized representations, and that the failure mode we identify likely affects any inner\-product SAE trained on a post\-RMSNorm site\.
## 2Background
A sparse autoencoder \(SAE\) takes an activationx∈ℝdx\\in\\mathbb\{R\}^\{d\}, projects it through a single\-layer encoder–decoder pair to a sparse codez∈ℝdsaez\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{sae\}\}\}, and reads the decoder rows\{Wdec,i\}\\\{W\_\{\\mathrm\{dec\},i\}\\\}as a feature dictionary:
z=σ\(\(x−bdec\)Wenc⊤\+benc\)x^=zWdec\+bdec\.z=\\sigma\\\!\\left\(\(x\-b\_\{\\mathrm\{dec\}\}\)W\_\{\\mathrm\{enc\}\}^\{\\\!\\top\}\+b\_\{\\mathrm\{enc\}\}\\right\)\\;\\hat\{x\}=zW\_\{\\mathrm\{dec\}\}\+b\_\{\\mathrm\{dec\}\}\.with reconstruction loss:ℒ=‖x−x^‖2\.\\mathcal\{L\}=\\\|x\-\\hat\{x\}\\\|^\{2\}\.Different SAE variants makeσ\\sigmasparse in different ways, including TopK\(Gaoet al\.,[2024](https://arxiv.org/html/2606.15054#bib.bib12)\), BatchTopK\(Bussmannet al\.,[2024](https://arxiv.org/html/2606.15054#bib.bib15)\), JumpReLU\(Rajamanoharanet al\.,[2024b](https://arxiv.org/html/2606.15054#bib.bib14)\), gated activations\(Rajamanoharanet al\.,[2024a](https://arxiv.org/html/2606.15054#bib.bib13)\), and AbsTopK\(Zhuet al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib19)\)\. SAEs tend to assume the linear representation hypothesis: concepts correspond to linear directions in activation space\(Parket al\.,[2024](https://arxiv.org/html/2606.15054#bib.bib5),[2025](https://arxiv.org/html/2606.15054#bib.bib6); Elhageet al\.,[2022](https://arxiv.org/html/2606.15054#bib.bib11)\)\. Recent variants modify dictionary geometry\(Korznikovet al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib20); Bussmannet al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib16)\)or loss shape\(Nasiri\-Sarvi and others,[2026](https://arxiv.org/html/2606.15054#bib.bib21)\)but leave the encoder score unchanged\. A feature’s pre\-activation is still an inner product⟨wi,x⟩=‖wi‖‖x‖cos\(wi,x\)\\langle w\_\{i\},x\\rangle=\\\|w\_\{i\}\\\|\\,\\\|x\\\|\\,\\cos\(w\_\{i\},x\), so larger\-norm tokens raise the score even when directional alignment is unchanged\. Following standard practice\(Brickenet al\.,[2023](https://arxiv.org/html/2606.15054#bib.bib9); Gaoet al\.,[2024](https://arxiv.org/html/2606.15054#bib.bib12); Karvonenet al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib25)\), decoder rows are constrained to unit norm during training: re\-normalized after each optimizer step, with the component of the decoder gradient parallel to each row projected away to account for the Adam\-normalization interaction\(Gaoet al\.,[2024](https://arxiv.org/html/2606.15054#bib.bib12)\)\. The dictionary therefore represents directions, and any per\-feature scale is absorbed into the encoder\. The encoder reads the input centered on the decoder bias \(x↦x−bdecx\\mapsto x\-b\_\{\\mathrm\{dec\}\}, as in the equation above\), tying its zero point to the reconstruction origin\(Brickenet al\.,[2023](https://arxiv.org/html/2606.15054#bib.bib9)\)\.
We hook the SAE on the residual stream: the activationxxthat flows untouched through every transformer block, with each sublayer adding to it\. Most modern transformers place an RMSNorm\(Zhang and Sennrich,[2019](https://arxiv.org/html/2606.15054#bib.bib1)\)on the path*into*every sublayer:
RMSNorm\(x\)=dx‖x‖⊙g,g∈ℝd\.\\mathrm\{RMSNorm\}\(x\)\\;=\\;\\sqrt\{d\}\\,\\frac\{x\}\{\\\|x\\\|\}\\odot g,\\qquad g\\in\\mathbb\{R\}^\{d\}\.Each sublayer therefore readsx/‖x‖x/\\\|x\\\|up to the per\-coordinate gaingg, notxxitself\. The SAE, hooked one step earlier on the residual stream, still seesxxin full\. Two activations with the same direction and different norms look nearly identical to the model but very different to an inner\-product encoder, which scores them in the ratio of their norms\. Residual\-stream norms are heavy\-tailed in practice, driven by rogue dimensions\(Timkey and van Schijndel,[2021](https://arxiv.org/html/2606.15054#bib.bib3)\)and outlier features\(Dettmerset al\.,[2022](https://arxiv.org/html/2606.15054#bib.bib4)\), so this bias is not hypothetical\.
Directly swapping activation directions between random token pairs \(no SAE involvement\) produces219×219\\timesmore downstream KL\-divergence than swapping their norms at the same layer \(Appendix[C\.1](https://arxiv.org/html/2606.15054#A3.SS1)\)\. The model’s computation is direction\-dominated; the SAE’s encoder should be too\.
We use BatchTopK throughout\(Bussmannet al\.,[2024](https://arxiv.org/html/2606.15054#bib.bib15)\), the SAEBench default\(Karvonenet al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib25)\)\. BatchTopK imposes a single batch\-wide activation budget rather than forcing exactlykkfeatures per token, letting the model allocate more features to complex tokens and fewer to simple ones\. Given pre\-activationsu=σ\(xWenc⊤\+benc\)∈ℝN×dsaeu=\\sigma\(xW\_\{\\mathrm\{enc\}\}^\{\\\!\\top\}\+b\_\{\\mathrm\{enc\}\}\)\\in\\mathbb\{R\}^\{N\\times d\_\{\\mathrm\{sae\}\}\}over a batch ofNNtokens,z=BatchTopK\(u\)z=\\operatorname\{BatchTopK\}\(u\)keeps only thekNkNlargest entries and zeros the rest\. This flexibility also means high\-norm tokens can claim disproportionately many slots: their inflated pre\-activations dominate the batch\-wide ranking regardless of directional alignment\. Our cosine score modifies only the encoder pre\-activation, not the sparsity mechanism; it is compatible in principle with any selector \(TopK, JumpReLU, gated\), though we test only BatchTopK here \(cf\. Appendix[A](https://arxiv.org/html/2606.15054#A1)\)\.
More broadly, modern architectures are adding normalization \(QK\-norm\(Henryet al\.,[2020](https://arxiv.org/html/2606.15054#bib.bib2); Dehghaniet al\.,[2023](https://arxiv.org/html/2606.15054#bib.bib49)\), nGPT\(Loshchilovet al\.,[2024](https://arxiv.org/html/2606.15054#bib.bib50)\)\), strengthening the case for direction\-aware SAE encoders\.
Evaluation metrics\.We follow SAEBench\(Karvonenet al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib25)\)throughout\. Reconstruction is reported as fraction of variance explained,FVE=1−𝔼‖x−x^‖2/Var\(x\)\\mathrm\{FVE\}=1\-\\mathbb\{E\}\\\|x\-\\hat\{x\}\\\|^\{2\}/\\mathrm\{Var\}\(x\), computed per token and aggregated\. Sparse probing fits a linear probe restricted to thekkSAE features that best predict a labeled concept and reports the resulting accuracy \(top\-kk\) over eight classification datasets; top\-1 isolates the single best feature and is the most direct test of whether one dictionary direction matches a human\-recognizable concept\. Auto\-interp asks an LLM to describe a feature from its top\-activating contexts and a second LLM to score whether the description predicts the firing pattern; we report the fraction of features judged interpretable \(Appendix[F\.1](https://arxiv.org/html/2606.15054#A6.SS1)\)\.
## 3The Cosine Encoder
The fix is straightforward in principle: remove‖x‖\\\|x\\\|from the score\. The design question is how much norm information to retain, since the decoder must still reconstructxxat full scale\. We define a one\-parameter family that lets the optimizer answer this question\.
Notation\.Letxc=x−bdecx\_\{c\}=x\-b\_\{\\mathrm\{dec\}\}be the input centered on the decoder bias\. We keep encoder rowswiw\_\{i\}unit\-normalized, recomputingwi←wi/‖wi‖w\_\{i\}\\leftarrow w\_\{i\}/\\\|w\_\{i\}\\\|on every forward pass with gradients flowing through the normalization\.222So cosine is just an inner product on the sphere: with‖wi‖=1\\\|w\_\{i\}\\\|=1,cos\(xc,wi\)=⟨xc/‖xc‖,wi⟩\\cos\(x\_\{c\},w\_\{i\}\)=\\langle x\_\{c\}/\\\|x\_\{c\}\\\|,w\_\{i\}\\rangle, and ata=1a\{=\}1the score below reduces to⟨xc,wi⟩\\langle x\_\{c\},w\_\{i\}\\rangleexactly\.Decoder rows are also unit\-normalized, per the held\-fixed community recipe\. Encoder unit\-normalization is what makes the score true cosine\. Dropping it \(keeping input\-side normalization but using rawwiw\_\{i\}\) yields‖wi‖⋅cos\(xc,wi\)\\\|w\_\{i\}\\\|\\cdot\\cos\(x\_\{c\},w\_\{i\}\), a half\-cosine that reintroduces‖wi‖\\\|w\_\{i\}\\\|as a magnitude factor\.‖wi‖\\\|w\_\{i\}\\\|then drifts unconstrained \(no loss term penalizes it\) and the dictionary collapses to9393% dead features at L27/50M\.333The full design\-space matrix \(these three settings plus per\-axis ablations on encoder/decoder normalization and post\-decode norm restoration\) is in Appendix[B\.2](https://arxiv.org/html/2606.15054#A2.SS2)\.
Score, pipeline, and initialization\.We define the pre\-activation as
si\(x\)=exp\(alog‖xc‖\+b\)⏟eb‖xc‖a⋅cos\(xc,wi\)\+benc,i,s\_\{i\}\(x\)\\;=\\;\\underbrace\{\\exp\\\!\\big\(a\\log\\\|x\_\{c\}\\\|\+b\\big\)\}\_\{e^\{b\}\\,\\\|x\_\{c\}\\\|^\{a\}\}\\cdot\\cos\(x\_\{c\},w\_\{i\}\)\\;\+\\;b\_\{\\mathrm\{enc\},i\},The log\-exp parameterization makes the scale explicit: a linear fitlog\(scale\)=alog‖xc‖\+b\\log\(\\mathrm\{scale\}\)=a\\log\\\|x\_\{c\}\\\|\+bexponentiates toscale=eb‖xc‖a\\mathrm\{scale\}=e^\{b\}\\\|x\_\{c\}\\\|^\{a\}, whereaais the norm\-dependence exponent andebe^\{b\}is the scale at‖xc‖=1\\\|x\_\{c\}\\\|\{=\}1rather than a separately introduced temperature\. This form gives well\-conditioned gradients inaaregardless of‖x‖\\\|x\\\|\(Appendix[B](https://arxiv.org/html/2606.15054#A2)\)\. It also interpolates cleanly between two familiar cases\. Whena=0a\{=\}0,‖xc‖a=1\\\|x\_\{c\}\\\|^\{a\}\{=\}1andsi=ebcos\(xc,wi\)\+benc,is\_\{i\}=e^\{b\}\\cos\(x\_\{c\},w\_\{i\}\)\+b\_\{\\mathrm\{enc\},i\}is pure cosine at scaleebe^\{b\}\. Whena=1a\{=\}1,si=eb⟨xc,wi⟩\+benc,is\_\{i\}=e^\{b\}\\,\\langle x\_\{c\},w\_\{i\}\\rangle\+b\_\{\\mathrm\{enc\},i\}is inner product up to a global scale, sincewiw\_\{i\}is unit\-norm\. After applying a ReLU, this score is passed to BatchTopK\(Bussmannet al\.,[2024](https://arxiv.org/html/2606.15054#bib.bib15)\)\. The decoder remains unchanged:x^=zWdec\+bdec\\hat\{x\}=zW\_\{\\mathrm\{dec\}\}\+b\_\{\\mathrm\{dec\}\}, withℒ=‖x−x^‖2\\mathcal\{L\}=\\\|x\-\\hat\{x\}\\\|^\{2\}\. We initializea=0a=0andb=logdmodelb=\\log\\sqrt\{d\_\{\\mathrm\{model\}\}\}\.
The body tracks the two variants used in the headline 500M comparison\. A third corner of the family, pinned cosine \(a=0a\{=\}0, whereaais frozen and onlybbtrains\), is deferred to App\.[B\.3](https://arxiv.org/html/2606.15054#A2.SS3), alongside the full design\-space ablation in App\.[B\.2](https://arxiv.org/html/2606.15054#A2.SS2)\.
Global learnedaa\.A single scalara∈ℝa\\in\\mathbb\{R\}shared across features, freed at initialization\. Pinninga=0a\{=\}0\(pure cosine\) discards all norm information from the encoder; since the decoder reconstructsxxat full scale, the system must recover magnitude from the sparse code alone, and in practice this collapses \(Appendix[B\.2](https://arxiv.org/html/2606.15054#A2.SS2)\)\. Freeingaalets the optimizer retain exactly as much magnitude as reconstruction requires\.
Per\-featureaia\_\{i\}\.Different features may want different norm dependence; a position\-encoding feature plausibly needs more magnitude than a semantic one\. The direct replacementa→aia\\to a\_\{i\},b→bib\\to b\_\{i\}givessi\(x\)=ebi‖xc‖aicos\(xc,wi\)\+benc,is\_\{i\}\(x\)=e^\{b\_\{i\}\}\\\|x\_\{c\}\\\|^\{a\_\{i\}\}\\cos\(x\_\{c\},w\_\{i\}\)\+b\_\{\\mathrm\{enc\},i\}, adding≤0\.1%\\leq 0\.1\\%parameters\. Fully freeaia\_\{i\}values are unstable at deep layers \(cascading to extreme values at L27; Appendix[B\.2](https://arxiv.org/html/2606.15054#A2.SS2)\)\. We instead parameterizeai=abase\+δia\_\{i\}=a\_\{\\mathrm\{base\}\}\+\\delta\_\{i\}as a shared base plus per\-feature offset \(both initialized to0\);bib\_\{i\}remains fully free\. The shared base anchors the distribution while per\-feature deltas allow heterogeneity\.
Training recipe\.All experiments use the community SAEBench recipe\(Karvonenet al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib25)\): BatchTopK withk=80k=80, Adam at learning rate5⋅10−55\{\\cdot\}10^\{\-5\}, the AuxK auxiliary loss ofGaoet al\.\([2024](https://arxiv.org/html/2606.15054#bib.bib12)\)to revive dead features, unit\-norm decoder rows, and geometric\-median initialization forbdecb\_\{\\mathrm\{dec\}\}\. The encoder is initialized as the transpose of the decoder\. The recipe is held fixed across the standard baseline and all cosine variants; only the encoder score changes\. Appendix[J](https://arxiv.org/html/2606.15054#A10)gives the full hyperparameters and recipe\-lineage detail\.
Figure 4:Aggregate sparse\-probing accuracy\.Top\-1, top\-2, and top\-5 probe accuracy across all eight SAEBench datasets\. FVE\-matched at≈0\.77\\approx 0\.77\. Black: standard SAE; violet: per\-feature cosine encoder\. The gap narrows at higherkkbut remains large \(\+9\.4\+9\.4% at top\-5\)\. Per\-dataset breakdown: Fig\.[2](https://arxiv.org/html/2606.15054#S1.F2)Panel A\.
## 4Experiments
Unless stated otherwise, results use Qwen3\-8B at layer 18, 500M FineWeb tokens,dsae=65,536d\_\{\\mathrm\{sae\}\}=65\{,\}536, and the recipe in §[3](https://arxiv.org/html/2606.15054#S3)\. Mechanism sweeps use the same model at smaller budgets \(2–50M tokens\)\. Appendix[E](https://arxiv.org/html/2606.15054#A5)gives cross\-layer and cross\-model checks on Qwen L9/L27, Gemma\-2\-2B, and three LayerNorm models; Appendix[B\.2](https://arxiv.org/html/2606.15054#A2.SS2)gives the architecture ablation\.
### 4\.1Reconstruction and sparse probing
At the headline scale, reconstruction is tied between the standard BatchTopK SAE and both cosine variants\. Aggregate FVE differs by≤0\.4\\leq 0\.4%, and the SAEBench substitution metrics \(KL\-score and CE\-score, both bounded\[0,1\]\[0,1\]with11meaning the SAE preserves the model’s logits exactly\) agree to within0\.0010\.001\(0\.9840\.984–0\.9850\.985KL\-score,0\.9910\.991–0\.9930\.993CE\-score; Appendix[D](https://arxiv.org/html/2606.15054#A4)\)\. The difference is in feature use\.
On the eight SAEBench single\-feature top\-kkprobing datasets, the global cosine regime improves top\-1 by\+13\.3\+13\.3% and the per\-feature regime by\+14\.9\+14\.9% \(Table[1](https://arxiv.org/html/2606.15054#S4.T1)\)\. This gap is approximately24×24\\timesthe 5\-seed probe\-training noise of±0\.63\\pm 0\.63%\. A 3\-seed SAE\-training run at 50M tokens bounds SAE\-training variance at SD<0\.001<0\.001FVE \(Appendix[K](https://arxiv.org/html/2606.15054#A11)\)\. Single\-feature probing directly measures whether individual dictionary elements correspond to human\-recognizable concepts\(Karvonenet al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib25)\), the core desideratum of interpretability tools\. The gap is not a capacity artifact: giving the standard encoder3×3\\timesmore dictionary slots at matched L0 makes dead rates*worse*\(89\.489\.4% vs\.77\.477\.4%\), not better \(Figure[24](https://arxiv.org/html/2606.15054#A7.F24)\)\. Per\-feature interpretability remains matched: an LLM\-judged evaluation on10001000stratified features scores80\.180\.1% \(per\-feature cosine\) vs\.82\.182\.1% \(standard\) on the≥4\\geq 4auto\-interp criterion \(p=0\.88p=0\.88; Appendix[F\.1](https://arxiv.org/html/2606.15054#A6.SS1)\)\.
The gain is uneven across tasks \(Fig\.[4](https://arxiv.org/html/2606.15054#S3.F4)\)\. Language and code detection account for the largest gaps, but the per\-feature advantage remains\+9\.1\+9\.1% after removing both\. Sentiment is the one reversal: standard is higher by\+3\.5\+3\.5%, consistent with magnitude carrying task\-relevant signal there\. We return to this case in §[5](https://arxiv.org/html/2606.15054#S5)\.
The learned norm dependence remains far from the inner\-product limit \(Fig\.[3](https://arxiv.org/html/2606.15054#S1.F3), Table[1](https://arxiv.org/html/2606.15054#S4.T1)\)\. The globalaaconverges to≈0\.26\\approx 0\.26; the per\-featureaia\_\{i\}have mean≈0\.08\\approx 0\.08, with noai\>0\.5a\_\{i\}\>0\.5\. Both regimes initialize ata=0a=0, so these values are learned rather than imposed\. Appendix[K](https://arxiv.org/html/2606.15054#A11)reports initialization sweeps and a postnorm\-MSE alternative whereaabecomes negative\.
Table 1:Matched reconstruction\.Qwen3\-8B L18, 500M tokens\. Only the score changes\. Single SAE\-training seed at 500M \(variance bounded at 50M, Appendix[K](https://arxiv.org/html/2606.15054#A11)\); Q4 FVE diagnoses the standard baseline’s per\-quartile failure \(§[4\.2](https://arxiv.org/html/2606.15054#S4.SS2)\)\. Provenance: Appendix[D\.2](https://arxiv.org/html/2606.15054#A4.SS2)\.
### 4\.2High\-norms break inner\-product reconstruction
Aggregate FVE matches, but the standard SAE fails on the high\-norm tail predicted by §[2](https://arxiv.org/html/2606.15054#S2)\. We bin tokens by‖xc‖\\\|x\_\{c\}\\\|into quartiles and write*Q4*for the highest quartile \(the25%25\\%of tokens with the largest‖xc‖\\\|x\_\{c\}\\\|\)\. Standard and cosine match on Q1–Q3, but the standard encoder over\-activates on Q4: for the standard SAE we trained under the community recipe of §[3](https://arxiv.org/html/2606.15054#S3)\(the same SAE used in Table[1](https://arxiv.org/html/2606.15054#S4.T1)\), Q4 reconstructions have9\.5×9\.5\{\\times\}the input norm, and Q4 FVE is−184\-184\.444The large negative per\-quartile number does not contradict aggregate FVE parity: per\-quartile FVE normalizes by*within*\-quartile variance \(1−MSEq/Varq\(x\)1\-\\mathrm\{MSE\}\_\{q\}/\\mathrm\{Var\}\_\{q\}\(x\)\), and within\-Q4 variance is small relative to the global variance that aggregate FVE normalizes against, so Q4’s large absolute MSE contributes only modestly to the aggregate numerator\.Both cosine regimes keep Q4 FVE positive \(Fig\.[6](https://arxiv.org/html/2606.15054#S4.F6)\)\. An independently trained reference SAE shows the same pathology, with Q4 FVE=−136=\-136\(Karvonenet al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib25)\)\.
This is the failure mode predicted in §[2](https://arxiv.org/html/2606.15054#S2): the inner\-product score writes‖x‖\\\|x\\\|into the sparse code, and the decoder turns that scale information into inflated reconstructions, even though downstream RMSNorm has already discarded it\.
### 4\.3Comparing the learned dictionaries
The standard SAE does not merely encode content worse; it fails to learn content features at all, spending most of its capacity on features that detect token magnitude instead\. The\+14\.9\+14\.9% sparse\-probing gap from §[4\.1](https://arxiv.org/html/2606.15054#S4.SS1)persists at matched aggregate FVE, so the two architectures must learn qualitatively different dictionaries\. This section identifies which features each one learns\.
Dead features do not explain the gap\. With the auxiliary loss enabled, both architectures keep≥99%\\geq 99\\%of features alive and match on reconstruction and on LLM\-judged per\-feature interpretability \(80\.1%80\.1\\%vs\.82\.1%82\.1\\%,p=0\.88p=0\.88; Appendix[F](https://arxiv.org/html/2606.15054#A6)\)\. The dead\-feature margin closes, but the\+14\.9\+14\.9% sparse\-probing gap remains\.
We pair features across the standard and cosine dictionaries when their per\-token activation traces have Pearson correlation≥0\.7\\geq 0\.7*and*their decoder rows have cosine similarity\>0\.7\>0\.7: i\.e\., they activate on the same tokens*and*point in the same direction in residual space\. This matches8,6618\{,\}661features\.\[[exp58b](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/58b_matched_feature_probing/README.md)\]The remaining55,78955\{,\}789standard\-only features behave as*norm\-conditioned features*: they fire on‖x‖\\\|x\\\|rather than on a content direction\.\[[exp58c](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/58c_norm_detector_characterization/README.md)\]
- •*Selective\.*86%86\\%of their nonzero activations land in Q4 \(vs\. the25%25\\%a content\-blind detector would receive\), so they fire on high\-norm tokens\.
- •*Strong\.*When they do fire on a Q4 token, their mean activation is48×48\\timesthe per\-token mean across the full token population\.
- •*Crowding\.*BatchTopK on a high\-norm token therefore spends most of its sparse\-code budget on these features, leaving little capacity for direction\-encoding ones \(Fig\.[6](https://arxiv.org/html/2606.15054#S4.F6)\)\. On tokens where cosine\-unique features fire, the standard SAE fires328328features/token vs\. cosine’s122122\(Appendix[G](https://arxiv.org/html/2606.15054#A7)\)\.
Figure 5:Discovery dominates separability\.Sparse\-probing accuracy when each SAE uses only features shared with the other dictionary \(“shared features”\) versus its full dictionary \(“all features”\)\. Standard’s flat slope shows its unique features add no probe signal; cosine’s steep rise shows its unique features encode interpretable concepts\. The gap on the right is the total probing advantage, driven almost entirely by feature discovery rather than cleaner encoding of shared directions\.Figure 6:Norm\-conditioned feature allocation\.Standard’s unmatched features fire overwhelmingly on Q4; cosine’s unmatched features spread across all quartiles\. The standard encoder spends most of its capacity on features whose firing is locked to token norm\. Per\-quartile reconstruction: Appendix[G\.1](https://arxiv.org/html/2606.15054#A7.SS1)\.Of the standard SAE’s64,45064\{,\}450alive features, only13\.4%13\.4\\%direction\-encode under this criterion; the rest primarily encode norm\.\[[exp58c](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/58c_norm_detector_characterization/README.md)\]The between\-quartile component drives most of the overall cos\>\>inner rate \(8080–87%87\\%\)\. Within individual quartiles the advantage is a more modest4040–70%70\\%\(Figure[14](https://arxiv.org/html/2606.15054#A3.F14)\), consistent with norm variation corrupting TopK selection across the full distribution rather than within each quartile independently\.\[[exp19](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/19_norm_stratified/README.md)\]We call these features “norm\-conditioned” descriptively: they may encode‖x‖\\\|x\\\|\-correlated information our probing tasks do not measure \(auto\-interp parity is consistent with this; Appendix[F](https://arxiv.org/html/2606.15054#A6)\), but they do not encode the eight concept categories sparse probing targets\.
Restricting both dictionaries to the8,6618\{,\}661matched features drops the top\-1 gap from\+14\.9\+14\.9% to\+2\.0\+2\.0% \(Fig\.[5](https://arxiv.org/html/2606.15054#S4.F5)\)\. Reading\+2\.0\+2\.0% as the separability contribution and the residual\+12\.9\+12\.9% as the discovery contribution:∼87%\\sim 87\\%of the gap comes from features the cosine encoder discovers and standard does not,∼13%\\sim 13\\%from cleaner encoding of shared directions\. The unmatched standard features add almost nothing to top\-5\. The unmatched cosine features add\+7\.8\+7\.8%, confirming that the discovered features encode content, not just norm\-correlated firing\.\[[exp58b](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/58b_matched_feature_probing/README.md)\]Direct behavioral steering produces matched KL divergence for both architectures \(ratio0\.940\.94–1\.00×1\.00\\times\); the quality gap is which features exist, not their individual intervention power \(Appendix[G](https://arxiv.org/html/2606.15054#A7)\)\.\[[exp55d](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/55d_feature_steering/README.md)\]
Ruling out gradient equalization\.A natural alternative is that cosine helps by equalizing gradient magnitudes across norm quartiles \(inner\-product gradients scale with‖xc‖\\\|x\_\{c\}\\\|\)\. The asymmetry is real: the median Q4/Q1 gradient ratio is1\.61\.6–2\.0×2\.0\\timesfor standard vs\.0\.80\.8–1\.0×1\.0\\timesfor cosine\. But reweighting the standard loss to equalize gradients by construction closes only1\.91\.9–6\.8%6\.8\\%of the probing gap \(Appendix[G\.3](https://arxiv.org/html/2606.15054#A7.SS3)\); the forward\-pass TopK selection geometry, not backward\-pass gradient flow, is the lever\.
Four qualitatively different lines of evidence converge on the same root cause: behavioral \(sparse probing\), structural \(norm\-detector characterization\), interventional \(gradient falsification\), and architectural \(learnedaa\) all point to the inner\-product encoder’s conflation of direction and magnitude in the BatchTopK competition\.
### 4\.4Removing the auxiliary loss for dead features
The headline runs in §[4\.1](https://arxiv.org/html/2606.15054#S4.SS1)keep the auxiliary dead\-feature loss\(Gaoet al\.,[2024](https://arxiv.org/html/2606.15054#bib.bib12)\)enabled in both arms, which equalizes alive\-feature counts \(≥99%\\geq 99\\%\) and reconstruction \(FVE within0\.40\.4%\) so that the surviving\+14\.9\+14\.9% sparse\-probing gap cannot be attributed to a dead\-feature artifact\. Disabling the auxiliary loss isolates what the score swap does to the dead\-feature distribution on its own\. All major public SAE recipes we are aware of\(Brickenet al\.,[2023](https://arxiv.org/html/2606.15054#bib.bib9); Gaoet al\.,[2024](https://arxiv.org/html/2606.15054#bib.bib12); Karvonenet al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib25)\)include this loss or an equivalent, so this is a research\-only stress test rather than a deployment recipe\.
Without the auxiliary loss the cosine advantage survives, but only for the per\-feature variant\. At Qwen3\-8B, 50M tokens,dsae=16,384d\_\{\\mathrm\{sae\}\}\{=\}16\{,\}384\(Fig\.[7](https://arxiv.org/html/2606.15054#S4.F7)\), per\-feature cosine reduces dead features at every layer \(L961\.0→41\.1%61\.0\\to 41\.1\\%, L1882\.8→72\.4%82\.8\\to 72\.4\\%, L2778\.3→28\.1%78\.3\\to 28\.1\\%\) and improves FVE at every layer \(\+3\.5/\+1\.6/\+7\.6\+3\.5/\+1\.6/\+7\.6%\)\. The global\-aavariant tracks per\-feature at L9 and L27 but actually*loses*to Standard at L18 \(85\.3%85\.3\\%dead vs\.82\.8%82\.8\\%\): a single learned scale is a poor compromise at L18’s mid\-range norm, and the per\-feature flexibility is what closes that gap\. The pattern reproduces on Gemma\-2\-2B at 50M tokens: the standard encoder leaves5454–69%69\\%of features dead across L7/L13/L19, while cosine stays at0\.00\.0–0\.1%0\.1\\%, with FVE gains of\+3\.9\+3\.9to\+6\.7\+6\.7% \(Table[10](https://arxiv.org/html/2606.15054#A5.T10); Fig\.[20](https://arxiv.org/html/2606.15054#A5.F20)\)\.
Figure 7:Without the auxiliary dead\-feature loss, per\-feature cosine wins at every layer\.Qwen3\-8B, 50M tokens,dsae=16,384d\_\{\\mathrm\{sae\}\}\{=\}16\{,\}384,d\\sqrt\{d\}init\. Left: dead\-feature %; right: FVE\. With the auxiliary loss on \(community recipe; headline atdsae=65,536d\_\{\\mathrm\{sae\}\}\{=\}65\{,\}536\), both architectures hit∼0%\\sim 0\\%dead and FVE parity \(§[4\.1](https://arxiv.org/html/2606.15054#S4.SS1)\); the gap here is what the auxiliary loss masks\. The global\-aavariant is omitted for readability; see the body\.Full convergence curves are in Appendix[E](https://arxiv.org/html/2606.15054#A5)\.
### 4\.5Cross\-model behavior
Our results persist on Gemma\-2\-2B \(a different RMSNorm family,dsae=9,216d\_\{\\mathrm\{sae\}\}=9\{,\}216\) but do not persist as strongly on non\-RMSNorm models\. On Gemma\-2\-2B, the fixed\-SAE top\-1 gap is\+3\.4±0\.3\+3\.4\{\\pm\}0\.3%, smaller than on Qwen but same\-signed; the gap scales with model and dictionary size, since a7×7\\timessmaller dictionary gives the TopK competition less room for norm\-detector degeneracy\. On three LayerNorm models \(Pythia\-2\.8B/6\.9B, Falcon\-7B\), cosine loses its advantage at deep layers \(Appendix[E\.3](https://arxiv.org/html/2606.15054#A5.SS3)\)\. A plausible explanation is that LayerNorm’s mean subtraction preserves magnitude\-correlated structure that RMSNorm erases: the residual stream’s per\-coordinate mean carries norm information into the sublayer, so inner\-product scoring is partially correct at depth in LayerNorm models\. We lack a direct intervention test and flag this as an open question\.
The cosine encoder trained on 50M tokens exceeds the standard’s 500M sparse\-probing top\-1 at L9 \(\+17\.3\+17\.3%\) and L27 \(\+8\.7\+8\.7%\), a10×10\\timestoken\-budget saving \(Appendix[E](https://arxiv.org/html/2606.15054#A5)\)\. Norm\-conditioned feature allocation and discovery dominance both reproduce across these settings, with magnitude scaling with dictionary size \(Appendix[E](https://arxiv.org/html/2606.15054#A5)\)\.
## 5Limitations and Future Works
*Open mechanism gaps and scope bounds\.*The mechanism is not fully isolated: §[6](https://arxiv.org/html/2606.15054#S4.F6)rules out gradient flow as the load\-bearing lever, but a reverse score\-swap on a trained cosine checkpoint and an initialization sweep overaaremain future work\. The decoder is also not norm\-invariant: under norm\-noise training \(Uniform\(0\.5,2\.0\)\(0\.5,2\.0\)\), cosine FVE drops4\.54\.5–5\.8%5\.8\\%vs\.1\.81\.8–3\.4%3\.4\\%for standard, so a deployment norm shift may erode the gain \(Appendix[H](https://arxiv.org/html/2606.15054#A8)\)\. The gain is also bounded in scope: cosine loses to inner product at deep LayerNorm layers \(Pythia\-2\.8B drops from100%100\\%at L8 to40%40\\%at L24; Appendix[E\.3](https://arxiv.org/html/2606.15054#A5.SS3)\), sentiment is the one task reversal \(standard\+3\.5%\+3\.5\\%top\-1\), and the 500M headline is single\-seed and dictionary\-level rather than feature\-paired\(Leasket al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib29)\)\. Untested selectors \(TopK, JumpReLU, gated\), scale \(\>8\>8B\), instruction\-tuned/RLHF checkpoints, circuit\-level interventions, and competing geometries \(QK\-norm, nGPT\) are deferred to Appendix[H](https://arxiv.org/html/2606.15054#A8)\.
*Future works\.*Several recipe choices deserve a closer look\. The first is decoder unit\-norm: dropping it costs roughly2222FVE points in the magnitude\-bypass variant \(Table[2](https://arxiv.org/html/2606.15054#A2.T2)\), but the constraint is not mirrored in the model’s normalized readout, and exploring softer regularizers \(norm penalties, scheduled annealing, or absorbing‖wi‖\\\|w\_\{i\}\\\|into a per\-feature bias\) would tell us whether unit\-norm is truly essential or just convenient\. A second direction is to fold the per\-coordinate RMSNorm gaing∈ℝdg\\in\\mathbb\{R\}^\{d\}into the score itself: the current cosine scorecos\(xc,wi\)\\cos\(x\_\{c\},w\_\{i\}\)ignoresgg, but downstream sublayers readg⊙xcg\\odot x\_\{c\}\(§[2](https://arxiv.org/html/2606.15054#S2)\), so a variant likecos\(g⊙xc,wi\)\\cos\(g\\odot x\_\{c\},w\_\{i\}\)would align the encoder strictly with what the model actually consumes\. Finally, the auxiliary dead\-feature loss may be unnecessary under cosine: even without it, cosine retains\+6\.3%\+6\.3\\%FVE and2\.39×2\.39\\timesmore alive features than standard at L27 \(Appendix[E](https://arxiv.org/html/2606.15054#A5), Table[9](https://arxiv.org/html/2606.15054#A5.T9)\), so a cleaner, auxiliary\-free recipe looks within reach\.
## 6Discussion and Conclusion
For SAEs trained on normalized residual streams, raw inner\-product scoring is the wrong default; the mismatch is starkest under RMSNorm, which erases global scale entirely\. It mixes content alignment with token norm, while the downstream transformer path removes most of that global scale before reading the residual stream\. Replacing the score with a learned norm\-scaled cosine keeps reconstruction matched, but changes which features the dictionary spends capacity on\.
At matched FVE, KL/CE substitution scores, and per\-feature LLM\-judged interpretability, our cosine SAE recovers substantially better single\-feature coverage of probing targets \(§[4\.3](https://arxiv.org/html/2606.15054#S4.SS3), §[4\.1](https://arxiv.org/html/2606.15054#S4.SS1)\)\. The global\-aaversion is the practical recipe: it adds one scalar, worked at every scale we tested, and recovers most of the per\-feature gain\. The per\-feature extension gives the strongest numbers, but is less minimal; the full recipe is in Appendix[J](https://arxiv.org/html/2606.15054#A10)\.
The evidence points to the forward\-pass TopK selection geometry as the primary mechanism\. Equalizing gradient magnitudes closes only1\.91\.9–6\.86\.8% of the gap \(§[6](https://arxiv.org/html/2606.15054#S4.F6)\): the inner\-product score still inflates all pre\-activations on high\-norm tokens, BatchTopK still selects the same norm\-conditioned features, and the same firing pattern reappears\. This mechanism matters because pre\-sublayer normalization has already discarded or attenuated token\-level magnitude from the downstream computation; the norm\-conditioned features that dominate the standard dictionary encode information the model does not fully read\. Whether the gain is best understood as TopK selection geometry alone or as a combination of selection geometry and RMS\-geometric alignment remains open; both point to the same practical recommendation: practitioners training SAEs on normalized models should use cosine scoring as the default encoder, with the strongest gains expected under RMSNorm\.
Concretely: unit\-normalize encoder rows on each forward pass, replace the score witheb‖xc‖acos\(xc,wi\)\+benc,ie^\{b\}\\\|x\_\{c\}\\\|^\{a\}\\cos\(x\_\{c\},w\_\{i\}\)\+b\_\{\\mathrm\{enc\},i\}, and initializea=0a\{=\}0,b=logdmodelb\{=\}\\log\\sqrt\{d\_\{\\mathrm\{model\}\}\}\. The global variant adds two scalar parameters; the per\-feature variant parameterizesai=abase\+δia\_\{i\}=a\_\{\\mathrm\{base\}\}\+\\delta\_\{i\}with per\-featurebib\_\{i\}, adding1\+2dsae1\+2\\,d\_\{\\mathrm\{sae\}\}parameters \(<0\.1%<0\.1\\%overhead\)\.
## Impact Statement
This paper advances mechanistic interpretability and dictionary learning\. We do not identify additional societal consequences beyond those generally associated with interpretability research\.
## References
- A\. Arora, Z\. Wu, J\. Steinhardt, and S\. Schwettmann \(2026\)Language model circuits are sparse in the neuron basis\.External Links:2601\.22594Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p6.1)\.
- S\. Basuet al\.\(2026\)Interpretability without actionability\.External Links:2603\.18353Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p6.1)\.
- A\. Borobiaet al\.\(2026\)How pruning reshapes features\.External Links:2603\.25325Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p5.1)\.
- T\. Bricken, A\. Templeton, J\. Batson, B\. Chen, A\. Jermyn, T\. Conerly, N\. L\. Turner, C\. Anil, C\. Denison, A\. Askell, R\. Lasenby, Y\. Wu, S\. Kravec, N\. Schiefer, T\. Maxwell, N\. Joseph, Z\. Hatfield\-Dodds, A\. Tamkin, K\. Nguyen, B\. McLean, J\. E\. Burke, T\. Hume, S\. Carter, T\. Henighan, and C\. Olah \(2023\)Towards monosemanticity: decomposing language models with dictionary learning\.Note:Anthropic technical report;[https://transformer\-circuits\.pub/2023/monosemantic\-features](https://transformer-circuits.pub/2023/monosemantic-features)Cited by:[§1](https://arxiv.org/html/2606.15054#S1.p1.6),[§2](https://arxiv.org/html/2606.15054#S2.p1.7),[§4\.4](https://arxiv.org/html/2606.15054#S4.SS4.p1.3)\.
- B\. Bussmann, P\. Leask, and N\. Nanda \(2024\)BatchTopK sparse autoencoders\.External Links:2412\.06410Cited by:[Appendix J](https://arxiv.org/html/2606.15054#A10.p1.1),[§1](https://arxiv.org/html/2606.15054#S1.p1.6),[§1](https://arxiv.org/html/2606.15054#S1.p2.1),[§2](https://arxiv.org/html/2606.15054#S2.p1.7),[§2](https://arxiv.org/html/2606.15054#S2.p4.5),[§3](https://arxiv.org/html/2606.15054#S3.p3.18)\.
- B\. Bussmann, N\. Nabeshima, A\. Karvonen, and N\. Nanda \(2025\)Learning multi\-level features with matryoshka sparse autoencoders\.External Links:2503\.17547Cited by:[§2](https://arxiv.org/html/2606.15054#S2.p1.7)\.
- D\. Chanin and A\. Garriga\-Alonso \(2026\)SynthSAEBench: evaluating sparse autoencoders on scalable realistic synthetic data\.External Links:2602\.14687Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p4.1)\.
- D\. Chanin, J\. Wilken\-Smith, T\. Dulka, H\. Bhatnagar, and J\. Bloom \(2025\)A is for absorption: studying feature splitting and absorption in sparse autoencoders\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Note:Oral presentationExternal Links:2409\.14507Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p5.1)\.
- H\. Cunningham, A\. Ewart, L\. Riggs, R\. Huben, and L\. Sharkey \(2023\)Sparse autoencoders find highly interpretable features in language models\.External Links:2309\.08600Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p6.1),[§1](https://arxiv.org/html/2606.15054#S1.p1.6)\.
- M\. Dehghani, J\. Djolonga, B\. Mustafa, P\. Padlewski, J\. Heek, J\. Gilmer, A\. Steiner, M\. Caron, R\. Geirhos, I\. Alabdulmohsin, R\. Jenatton, L\. Beyer, M\. Tschannen, A\. Arnab, X\. Wang, C\. Riquelme, M\. Minderer, J\. Puigcerver, U\. Evci, M\. Kumar, S\. van Steenkiste, G\. F\. Elsayed, A\. Mahendran, F\. Yu, A\. Oliver, F\. Huot, J\. Bastings, M\. P\. Collier, A\. Gritsenko, V\. Birodkar, C\. Vasconcelos, Y\. Tay, T\. Mensink, A\. Kolesnikov, F\. Pavetic, D\. Tran, T\. Kipf, M\. Lucic, X\. Zhai, D\. Keysers, J\. Harmsen, and N\. Houlsby \(2023\)Scaling vision transformers to 22 billion parameters\.InInternational Conference on Machine Learning \(ICML\),External Links:2302\.05442Cited by:[§2](https://arxiv.org/html/2606.15054#S2.p5.1)\.
- T\. Dettmers, M\. Lewis, Y\. Belkada, and L\. Zettlemoyer \(2022\)LLM\.int8\(\): 8\-bit matrix multiplication for transformers at scale\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:2208\.07339Cited by:[§2](https://arxiv.org/html/2606.15054#S2.p2.5)\.
- N\. Elhage, T\. Hume, C\. Olsson, N\. Schiefer, T\. Henighan, S\. Kravec, Z\. Hatfield\-Dodds, R\. Lasenby, D\. Drain, C\. Chen, R\. Grosse, S\. McCandlish, J\. Kaplan, D\. Amodei, M\. Wattenberg, and C\. Olah \(2022\)Toy models of superposition\.Note:Anthropic technical reportExternal Links:2209\.10652Cited by:[§2](https://arxiv.org/html/2606.15054#S2.p1.7)\.
- J\. Engels, I\. Liao, E\. J\. Michaud, W\. Gurnee, and M\. Tegmark \(2025\)Not all language model features are one\-dimensionally linear\.InInternational Conference on Learning Representations \(ICLR\),External Links:2405\.14860Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p5.1)\.
- D\. Friedman, A\. Lampinen, L\. Dixon, D\. Chen, and A\. Ghandeharioun \(2024\)Interpretability illusions in the generalization of simplified models\.InInternational Conference on Machine Learning \(ICML\),External Links:2312\.03656Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p6.1)\.
- L\. Gao, T\. D\. la Tour, H\. Tillman, G\. Goh, R\. Troll, A\. Radford, I\. Sutskever, J\. Leike, and J\. Wu \(2024\)Scaling and evaluating sparse autoencoders\.External Links:2406\.04093Cited by:[Appendix J](https://arxiv.org/html/2606.15054#A10.p1.1),[§B\.1](https://arxiv.org/html/2606.15054#A2.SS1.SSS0.Px6.p1.7),[§1](https://arxiv.org/html/2606.15054#S1.p1.6),[§2](https://arxiv.org/html/2606.15054#S2.p1.7),[§3](https://arxiv.org/html/2606.15054#S3.p7.3),[§4\.4](https://arxiv.org/html/2606.15054#S4.SS4.p1.3)\.
- J\. Grindrod \(2026\)Sparse auto\-encoders and holism about large language models\.External Links:2603\.26207Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p6.1)\.
- A\. Gulko, Y\. Peng, and S\. Kumar \(2025\)CE\-Bench: towards a reliable contrastive evaluation benchmark of interpretability of sparse autoencoders\.Note:BlackboxNLP Workshop @ EMNLPExternal Links:2509\.00691Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p4.1)\.
- A\. Henry, P\. R\. Dachapally, S\. Pawar, and Y\. Chen \(2020\)Query\-key normalization for transformers\.InFindings of the Association for Computational Linguistics: EMNLP,External Links:2010\.04245Cited by:[§2](https://arxiv.org/html/2606.15054#S2.p5.1)\.
- J\. Jedryszek and O\. Crook \(2026\)Stable and steerable sparse autoencoders with weight regularization\.External Links:2603\.04198Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p5.1)\.
- A\. Karvonen, C\. Rager, J\. Lin, C\. Tigges, J\. Bloom, D\. Chanin, Y\. Lau, E\. Farrell, A\. Conmy, C\. McDougall, F\. Lo Piano, A\. Templeton, S\. Marks, B\. Wright, T\. Bricken, T\. Conerly, L\. Smith, and N\. Nanda \(2025\)SAEBench: a comprehensive benchmark for sparse autoencoders in language model interpretability\.InInternational Conference on Machine Learning \(ICML\),External Links:2503\.09532Cited by:[Appendix J](https://arxiv.org/html/2606.15054#A10.p1.1),[§E\.2](https://arxiv.org/html/2606.15054#A5.SS2.p1.1),[§G\.1](https://arxiv.org/html/2606.15054#A7.SS1.p1.3),[§1](https://arxiv.org/html/2606.15054#S1.p1.6),[§2](https://arxiv.org/html/2606.15054#S2.p1.7),[§2](https://arxiv.org/html/2606.15054#S2.p4.5),[§2](https://arxiv.org/html/2606.15054#S2.p6.3),[§3](https://arxiv.org/html/2606.15054#S3.p7.3),[§4\.1](https://arxiv.org/html/2606.15054#S4.SS1.p2.14),[§4\.2](https://arxiv.org/html/2606.15054#S4.SS2.p1.6),[§4\.4](https://arxiv.org/html/2606.15054#S4.SS4.p1.3)\.
- L\. Kempfet al\.\(2026\)Simple LLM baselines are competitive for model diffing\.External Links:2602\.10371Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p6.1)\.
- P\. Koromilaset al\.\(2026\)PolySAE: modeling feature interactions in sparse autoencoders via polynomial decoding\.External Links:2602\.01322Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p5.1)\.
- V\. Korznikov, N\. Belrose, and L\. Sharkey \(2025\)OrtSAE: orthogonal sparse autoencoders uncover atomic features\.External Links:2509\.22033Cited by:[§2](https://arxiv.org/html/2606.15054#S2.p1.7)\.
- P\. Leask, B\. Bussmann,et al\.\(2025\)Sparse autoencoders do not find canonical units of analysis\.InInternational Conference on Learning Representations \(ICLR\),External Links:2502\.04878Cited by:[Appendix H](https://arxiv.org/html/2606.15054#A8.p7.7),[Appendix I](https://arxiv.org/html/2606.15054#A9.p3.3),[§5](https://arxiv.org/html/2606.15054#S5.p1.10)\.
- Z\. Liu \(2025\)Provably extracting the features from a general superposition\.External Links:2512\.15987Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p5.1)\.
- I\. Loshchilov, C\. Hsieh, S\. Sun, and B\. Ginsburg \(2024\)nGPT: normalized transformer with representation learning on the hypersphere\.External Links:2410\.01131Cited by:[§2](https://arxiv.org/html/2606.15054#S2.p5.1)\.
- H\. Luoet al\.\(2026\)From atoms to trees: building a structured feature forest with hierarchical sparse autoencoders\.External Links:2602\.11881Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p5.1)\.
- L\. Maet al\.\(2026\)Falsifying sparse autoencoder reasoning features in language models\.External Links:2601\.05679Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p5.1)\.
- A\. Miller, Y\. Draye, and B\. Schölkopf \(2026\)Identifying intervenable and interpretable features via orthogonality regularization\.External Links:2602\.04718Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p5.1)\.
- J\. Minder, F\. Tao, J\. B\. Costa, T\. Hannemann, A\. Geiger, and S\. Ravfogel \(2025\)Overcoming sparsity artifacts in crosscoders to interpret chat\-tuning\.External Links:2504\.02922Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p3.1)\.
- S\. Narayanaswamyet al\.\(2026\)Improving robustness in sparse autoencoders via masked regularization\.External Links:2604\.06495Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p5.1)\.
- A\. Nasiri\-Sarviet al\.\(2026\)MonoLoss: a training objective for interpretable monosemantic representations\.External Links:2602\.12403Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p3.1),[§2](https://arxiv.org/html/2606.15054#S2.p1.7)\.
- K\. Park, Y\. J\. Choe, Y\. Jiang, and V\. Veitch \(2025\)The geometry of categorical and hierarchical concepts in large language models\.InInternational Conference on Learning Representations \(ICLR\),Note:Oral presentationExternal Links:2406\.01506Cited by:[§2](https://arxiv.org/html/2606.15054#S2.p1.7)\.
- K\. Park, Y\. J\. Choe, and V\. Veitch \(2024\)The linear representation hypothesis and the geometry of large language models\.InInternational Conference on Machine Learning \(ICML\),External Links:2311\.03658Cited by:[§2](https://arxiv.org/html/2606.15054#S2.p1.7)\.
- L\. Prietoet al\.\(2026\)From data statistics to feature geometry: how correlations shape superposition\.External Links:2603\.09972Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p5.1)\.
- Qwen Team \(2026\)Qwen\-Scope: sparse autoencoders for Qwen3\.5 models\.Note:Hugging Face collection;[https://huggingface\.co/collections/Qwen/qwen\-scope](https://huggingface.co/collections/Qwen/qwen-scope)Accessed 2026\-05\-03Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p7.1)\.
- S\. Rajamanoharan, A\. Conmy, L\. Smith, T\. Lieberum, V\. Varma, J\. Kramár, R\. Shah, and N\. Nanda \(2024a\)Improving dictionary learning with gated sparse autoencoders\.External Links:2404\.16014Cited by:[§2](https://arxiv.org/html/2606.15054#S2.p1.7)\.
- S\. Rajamanoharan, T\. Lieberum, N\. Sonnerat, A\. Conmy, V\. Varma, J\. Kramár, and N\. Nanda \(2024b\)Jumping ahead: improving reconstruction fidelity with JumpReLU sparse autoencoders\.External Links:2407\.14435Cited by:[§2](https://arxiv.org/html/2606.15054#S2.p1.7)\.
- S\. S\. Saraswatula and D\. A\. Klindt \(2025\)Data whitening improves sparse autoencoder learning\.External Links:2511\.13981Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p2.1)\.
- A\. Saurez, Y\. Lee, and D\. Har \(2026\)Why linear interpretability works: invariant subspaces as a result of architectural constraints\.External Links:2602\.09783Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p2.1)\.
- A\. Shafranet al\.\(2026\)From directions to regions: decomposing activations in language models via local geometry\.External Links:2602\.02464Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p5.1)\.
- C\. Tian, K\. Tian, and N\. Hu \(2025\)Measuring sparse autoencoder feature sensitivity\.External Links:2509\.23717Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p5.1)\.
- C\. Tigges, O\. J\. Hollinsworth, A\. Geiger, and N\. Nanda \(2023\)Linear representations of sentiment in large language models\.External Links:2310\.15154Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p2.1)\.
- W\. Timkey and M\. van Schijndel \(2021\)All bark and no bite: rogue dimensions in transformer language models obscure representational quality\.InProceedings of the Conference on Empirical Methods in Natural Language Processing \(EMNLP\),External Links:2109\.04404Cited by:[§2](https://arxiv.org/html/2606.15054#S2.p2.5)\.
- J\. Wanget al\.\(2025\)Dimensional collapse in transformer attention outputs: a challenge for sparse dictionary learning\.External Links:2508\.16929Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p2.1)\.
- Z\. Wu, A\. Arora, A\. Geiger, Z\. Wang, J\. Huang, D\. Jurafsky, C\. D\. Manning, and C\. Potts \(2025\)AxBench: steering LLMs? even simple baselines outperform sparse autoencoders\.External Links:2501\.17148Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p6.1)\.
- W\. Yanget al\.\(2026\)Step\-level sparse autoencoder for reasoning process interpretation\.External Links:2603\.03031Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p6.1)\.
- B\. Zhang and R\. Sennrich \(2019\)Root mean square layer normalization\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:1910\.07467Cited by:[§2](https://arxiv.org/html/2606.15054#S2.p2.1)\.
- X\. Zhu, M\. M\. Khalili, and Z\. Zhu \(2025\)AbsTopK: rethinking sparse autoencoders for bidirectional features\.External Links:2510\.00404Cited by:[§2](https://arxiv.org/html/2606.15054#S2.p1.7)\.
- A\. Zou, L\. Phan, S\. Chen, J\. Campbell, P\. Guo, R\. Ren, A\. Pan, X\. Yin, M\. Mazeika, A\. Dombrowski, S\. Goel, N\. Li, M\. J\. Byun, Z\. Wang, A\. Mallen, S\. Basart, S\. Koyejo, D\. Song, M\. Fredrikson, J\. Z\. Kolter, and D\. Hendrycks \(2023\)Representation engineering: a top\-down approach to AI transparency\.External Links:2310\.01405Cited by:[Appendix A](https://arxiv.org/html/2606.15054#A1.p2.1)\.
## Appendix AExtended Related Work
This appendix supplements the body Related Work \(§[2](https://arxiv.org/html/2606.15054#S2)\)\.
Residual\-stream geometry\.Whitening preprocessing\(Saraswatula and Klindt,[2025](https://arxiv.org/html/2606.15054#bib.bib17)\)and dimensional collapse in attention outputs\(Wang and others,[2025](https://arxiv.org/html/2606.15054#bib.bib45)\); representation engineering\(Zouet al\.,[2023](https://arxiv.org/html/2606.15054#bib.bib7)\), sentiment\-direction work\(Tiggeset al\.,[2023](https://arxiv.org/html/2606.15054#bib.bib8)\), and invariant subspaces forced by linear interfaces\(Saurezet al\.,[2026](https://arxiv.org/html/2606.15054#bib.bib47)\)\.
Adjacent objective\-level work\.Latent\-scaling crosscoders\(Minderet al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib18)\)and MonoLoss \(vision SAEs\)\(Nasiri\-Sarvi and others,[2026](https://arxiv.org/html/2606.15054#bib.bib21)\)\.
Alternative SAE benchmarks\.CE\-Bench\(Gulkoet al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib26)\); SynthSAEBench\(Chanin and Garriga\-Alonso,[2026](https://arxiv.org/html/2606.15054#bib.bib27)\)\.
Concurrent failure modes\.Feature absorption\(Chaninet al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib30)\)and proposed fixes via masked regularization\(Narayanaswamy and others,[2026](https://arxiv.org/html/2606.15054#bib.bib39)\), orthogonality\(Milleret al\.,[2026](https://arxiv.org/html/2606.15054#bib.bib40)\), hierarchical decoders\(Luo and others,[2026](https://arxiv.org/html/2606.15054#bib.bib42)\), weight regularization\(Jedryszek and Crook,[2026](https://arxiv.org/html/2606.15054#bib.bib38)\); polynomial decoders\(Koromilas and others,[2026](https://arxiv.org/html/2606.15054#bib.bib41)\), correlation\-shaped superposition\(Prieto and others,[2026](https://arxiv.org/html/2606.15054#bib.bib36)\), provable feature extraction\(Liu,[2025](https://arxiv.org/html/2606.15054#bib.bib37)\); sensitivity gaps\(Tianet al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib32)\), superficial reasoning features\(Ma and others,[2026](https://arxiv.org/html/2606.15054#bib.bib31)\), pruning robustness\(Borobia and others,[2026](https://arxiv.org/html/2606.15054#bib.bib53)\); multi\-dimensional features\(Engelset al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib34)\), region\-based decompositions\(Shafran and others,[2026](https://arxiv.org/html/2606.15054#bib.bib35)\)\.
SAE limitations and baselines\.Prompting outperforms SAE steering\(Wuet al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib28)\); raw MLP neurons match SAE sparsity for circuit discovery\(Aroraet al\.,[2026](https://arxiv.org/html/2606.15054#bib.bib43)\); LLM baselines match SAE\-based model diffing\(Kempf and others,[2026](https://arxiv.org/html/2606.15054#bib.bib46)\); SAE features fail to translate into actionable corrections\(Basu and others,[2026](https://arxiv.org/html/2606.15054#bib.bib52)\); interpretability illusions OOD\(Friedmanet al\.,[2024](https://arxiv.org/html/2606.15054#bib.bib33)\)\. SAEs remain in use for polysemantic neurons\(Cunninghamet al\.,[2023](https://arxiv.org/html/2606.15054#bib.bib10); Grindrod,[2026](https://arxiv.org/html/2606.15054#bib.bib44)\); reasoning\-process SAEs\(Yang and others,[2026](https://arxiv.org/html/2606.15054#bib.bib51)\)\.
Position relative to prior SAE variants\.Only the encoder score changes; the sparsity mechanism, decoder, and training objective are held fixed\. The intervention is compatible with concurrent improvements \(Matryoshka, JumpReLU, AbsTopK, masked regularization\)\. Released SAE suites for the Qwen family\(Qwen Team,[2026](https://arxiv.org/html/2606.15054#bib.bib22)\)are available for large Qwen\-family runs\.
## Appendix BCosine SAE Architecture
This appendix provides the formal variant definitions for reproducibility \(§[B\.1](https://arxiv.org/html/2606.15054#A2.SS1)\), documents the design\-space exploration that led to the final architecture \(§[B\.2](https://arxiv.org/html/2606.15054#A2.SS2)\), and reports what the optimizer learns when given freedom \(§[B\.3](https://arxiv.org/html/2606.15054#A2.SS3)\)\.
### B\.1Variant Definitions
Notation:x∈ℝdx\\in\\mathbb\{R\}^\{d\}residual\-stream activation;xc=x−bdecx\_\{c\}=x\-b\_\{\\mathrm\{dec\}\};Wenc,Wdec∈ℝdsae×dW\_\{\\mathrm\{enc\}\},W\_\{\\mathrm\{dec\}\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{sae\}\}\\times d\};normalize\(⋅\)\\operatorname\{normalize\}\(\\cdot\)isℓ2\\ell\_\{2\}\-normalization along the last axis\. All variants share BatchTopK \(k=80k=80\), optimizer, training data, token budget, and the recipe of §[J](https://arxiv.org/html/2606.15054#A10)\.
##### Standard BatchTopK SAE \(baseline\)\.
z\\displaystyle z=BatchTopK\(ReLU\(xcWenc⊤\+benc\)\),\\displaystyle=\\operatorname\{BatchTopK\}\\\!\\big\(\\operatorname\{ReLU\}\(x\_\{c\}W\_\{\\mathrm\{enc\}\}^\{\\top\}\+b\_\{\\mathrm\{enc\}\}\)\\big\),x^\\displaystyle\\hat\{x\}=zWdec\+bdec,ℒ=‖x−x^‖2\.\\displaystyle=zW\_\{\\mathrm\{dec\}\}\+b\_\{\\mathrm\{dec\}\},\\qquad\\mathcal\{L\}=\\\|x\-\\hat\{x\}\\\|^\{2\}\.Encoder rows unconstrained:si\(x\)=‖Wenc,i‖‖xc‖cos\(Wenc,i,xc\)s\_\{i\}\(x\)=\\\|W\_\{\\mathrm\{enc\},i\}\\\|\\,\\\|x\_\{c\}\\\|\\,\\cos\(W\_\{\\mathrm\{enc\},i\},x\_\{c\}\)\.
##### Adaptive Cosine SAE \(globalaa\)\.
xunit\\displaystyle x\_\{\\mathrm\{unit\}\}=normalize\(xc\),\\displaystyle=\\operatorname\{normalize\}\(x\_\{c\}\),Wunit\\displaystyle W\_\{\\mathrm\{unit\}\}=normalize\(Wenc,dim=−1\),\\displaystyle=\\operatorname\{normalize\}\(W\_\{\\mathrm\{enc\}\},\\mathrm\{dim\}\{=\}\{\-\}1\),cos\(xc,wi\)\\displaystyle\\cos\(x\_\{c\},w\_\{i\}\)=\(xunitWunit⊤\)i,\\displaystyle=\(x\_\{\\mathrm\{unit\}\}W\_\{\\mathrm\{unit\}\}^\{\\top\}\)\_\{i\},γ\(x\)\\displaystyle\\gamma\(x\)=exp\(alog‖xc‖\+b\),a,b∈ℝ,\\displaystyle=\\exp\\\!\\big\(a\\log\\\|x\_\{c\}\\\|\+b\\big\),\\quad a,b\\in\\mathbb\{R\},si\(x\)\\displaystyle s\_\{i\}\(x\)=γ\(x\)cos\(xc,wi\)\+benc,i,\\displaystyle=\\gamma\(x\)\\,\\cos\(x\_\{c\},w\_\{i\}\)\+b\_\{\\mathrm\{enc\},i\},z\\displaystyle z=BatchTopK\(ReLU\(s\)\),\\displaystyle=\\operatorname\{BatchTopK\}\\\!\\big\(\\operatorname\{ReLU\}\(s\)\\big\),x^\\displaystyle\\hat\{x\}=zWdec\+bdec,ℒ=‖x−x^‖2\.\\displaystyle=zW\_\{\\mathrm\{dec\}\}\+b\_\{\\mathrm\{dec\}\},\\quad\\mathcal\{L\}=\\\|x\-\\hat\{x\}\\\|^\{2\}\.Ata=0a=0,si=ebcos\(xc,wi\)s\_\{i\}=e^\{b\}\\cos\(x\_\{c\},w\_\{i\}\); ata=1a=1,si=eb⟨Wunit,i,xc⟩s\_\{i\}=e^\{b\}\\langle W\_\{\\mathrm\{unit\},i\},x\_\{c\}\\rangle\. Init:a=0a=0,b=logdmodelb=\\log\\sqrt\{d\_\{\\mathrm\{model\}\}\}\. Added params:22scalars\.
##### Per\-Feature Adaptive Cosine SAE \(base\+delta\)\.
ai\\displaystyle a\_\{i\}=abase\+δi,abase,\{δi\}∈ℝ1\+dsae,\\displaystyle=a\_\{\\mathrm\{base\}\}\+\\delta\_\{i\},\\quad a\_\{\\mathrm\{base\}\},\\\{\\delta\_\{i\}\\\}\\in\\mathbb\{R\}^\{1\+d\_\{\\mathrm\{sae\}\}\},γi\(x\)\\displaystyle\\gamma\_\{i\}\(x\)=exp\(ailog‖xc‖\+bi\),\\displaystyle=\\exp\\\!\\big\(a\_\{i\}\\log\\\|x\_\{c\}\\\|\+b\_\{i\}\\big\),si\(x\)\\displaystyle s\_\{i\}\(x\)=γi\(x\)cos\(xc,wi\)\+benc,i\.\\displaystyle=\\gamma\_\{i\}\(x\)\\,\\cos\(x\_\{c\},w\_\{i\}\)\+b\_\{\\mathrm\{enc\},i\}\.Encode/decode/loss as Adaptive Cosine\. Init:abase=0a\_\{\\mathrm\{base\}\}=0,δi=0\\delta\_\{i\}=0,bi=logdmodelb\_\{i\}=\\log\\sqrt\{d\_\{\\mathrm\{model\}\}\}\. Added params:2dsae\+12d\_\{\\mathrm\{sae\}\}\+1scalars \(≤0\.1%\\leq 0\.1\\%\)\. The base\+delta parameterization prevents per\-feature instability at deep layers \(§[B\.2\.2](https://arxiv.org/html/2606.15054#A2.SS2.SSS2)\)\.
##### Magnitude\-Bypass SAE\.
xunit\\displaystyle x\_\{\\mathrm\{unit\}\}=xc/‖xc‖,Wunit=normalize\(Wenc,dim=−1\),\\displaystyle=x\_\{c\}/\\\|x\_\{c\}\\\|,\\quad W\_\{\\mathrm\{unit\}\}=\\operatorname\{normalize\}\(W\_\{\\mathrm\{enc\}\},\\mathrm\{dim\}\{=\}\{\-\}1\),z\\displaystyle z=BatchTopK\(ReLU\(xunitWunit⊤\)\),\\displaystyle=\\operatorname\{BatchTopK\}\\\!\\big\(\\operatorname\{ReLU\}\(x\_\{\\mathrm\{unit\}\}W\_\{\\mathrm\{unit\}\}^\{\\top\}\)\\big\),Wdecu\\displaystyle W\_\{\\mathrm\{dec\}\}^\{u\}=normalize\(Wdec,dim=−1\),\\displaystyle=\\operatorname\{normalize\}\(W\_\{\\mathrm\{dec\}\},\\mathrm\{dim\}\{=\}\{\-\}1\),xraw\\displaystyle x\_\{\\mathrm\{raw\}\}=zWdecu,\\displaystyle=zW\_\{\\mathrm\{dec\}\}^\{u\},x^\\displaystyle\\hat\{x\}=‖xc‖⋅xraw‖xraw‖\+bdec,ℒ=‖x−x^‖2\.\\displaystyle=\\\|x\_\{c\}\\\|\\cdot\\frac\{x\_\{\\mathrm\{raw\}\}\}\{\\\|x\_\{\\mathrm\{raw\}\}\\\|\}\+b\_\{\\mathrm\{dec\}\},\\quad\\mathcal\{L\}=\\\|x\-\\hat\{x\}\\\|^\{2\}\.Encoder activations bounded in\[0,1\]\[0,1\]; magnitude‖xc‖\\\|x\_\{c\}\\\|enters only the decoder rescale\. The sparse code carries no magnitude information\.
##### Gradient asymmetry\.
For an active featureii:
∂sistd∂Wenc,i=xc⇒‖∂sistd∂Wenc,i‖=‖xc‖\.\\frac\{\\partial s\_\{i\}^\{\\mathrm\{std\}\}\}\{\\partial W\_\{\\mathrm\{enc\},i\}\}=x\_\{c\}\\;\\;\\Rightarrow\\;\\;\\left\\\|\\frac\{\\partial s\_\{i\}^\{\\mathrm\{std\}\}\}\{\\partial W\_\{\\mathrm\{enc\},i\}\}\\right\\\|=\\\|x\_\{c\}\\\|\.For all cosine variants,sis\_\{i\}depends onxcx\_\{c\}throughxc/‖xc‖x\_\{c\}/\\\|x\_\{c\}\\\|and a scalar, so the encoder\-gradient norm is bounded independently of input magnitude\.
##### Held fixed across variants\.
BatchTopKk=80k=80; geometric\-median init ofbdecb\_\{\\mathrm\{dec\}\}; decoder columns Kaiming \+ unit\-norm; encoder=Wdec⊤=W\_\{\\mathrm\{dec\}\}^\{\\top\}\(0\.1⋅Wdec⊤0\.1\\cdot W\_\{\\mathrm\{dec\}\}^\{\\top\}for Standard\); Adam, learning rate, schedule; token budget;dsaed\_\{\\mathrm\{sae\}\}\. All†\\daggerand unmarked rows use the AuxK auxiliary loss ofGaoet al\.\([2024](https://arxiv.org/html/2606.15054#bib.bib12)\); the‡\\ddaggerrows \(from an early 5M\-token sweep\) predate its adoption but exp46 confirms aux\-k is a no\-op for cosine\-encoder architectures at that budget \(§[B\.2\.3](https://arxiv.org/html/2606.15054#A2.SS2.SSS3)\)\.
### B\.2Design\-Space Exploration
The final architecture emerged from a series of failures\. Each design choice prevents a specific collapse mode\. Table[2](https://arxiv.org/html/2606.15054#A2.T2)summarizes the full ablation; the subsections below explain the rationale\.
Table 2:Design\-space ablation\. Headline setting: Qwen3\-8B L18,500500M tokens;†\\daggerfrom L27/5050M \(same recipe\);‡\\ddaggerfrom L27/55M \(early sweep; no aux\-k, but exp46 confirms aux\-k is a no\-op at this budget\)\.*Cos\>\>inner*is the per\-feature win rate \(%\) on the diagnostic of §[C](https://arxiv.org/html/2606.15054#A3)\.\[[exp40/42c/44](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/40_saprmarks_recipe/README.md)\]\[[exp46](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/46_normscope_ablation/README.md)\]\[[exp48](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/48_input_norm_verification/README.md)\]#### B\.2\.1Why the scale parameter is necessary
The most natural first attempt is pure cosine scoring: normalize both the input and the encoder rows, sosi=cos\(xc,wi\)\+benc,is\_\{i\}=\\cos\(x\_\{c\},w\_\{i\}\)\+b\_\{\\mathrm\{enc\},i\}\. This removes all norm dependence from feature selection\. It also fails catastrophically: FVE=0\.009=0\.009with9797% dead features \(Table[2](https://arxiv.org/html/2606.15054#A2.T2), row “naive cosine”\)\.\[[exp10](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/10_cosine_sae_training/README.md)\]
The failure is a reconstruction\-scale mismatch\. Cosine scores are bounded in\[−1,1\]\[\-1,1\]; after BatchTopK and ReLU, the sparse codezzhas entries in\[0,1\]\[0,1\]\. The decoder must reconstructxxat its full scale \(‖xc‖∼64\\\|x\_\{c\}\\\|\\sim 64–400400depending on layer\), butzWdeczW\_\{\\mathrm\{dec\}\}produces vectors of order‖z‖1⋅1≈k=80\\\|z\\\|\_\{1\}\\cdot 1\\approx k=80\. At shallow layers where‖xc‖≈64\\\|x\_\{c\}\\\|\\approx 64this nearly works; at L27 where‖xc‖≈407\\\|x\_\{c\}\\\|\\approx 407the decoder cannot bridge the gap\. Features receive near\-zero gradient because reconstruction error is dominated by the global scale mismatch rather than directional errors, and9797% die within the first few thousand steps\.
Adding a learned global interceptbb\(so the scale isebe^\{b\}, a single scalar\) partially recovers: FVE rises to0\.4490\.449but7676% of features still die \(row “globalbb, noaa”\)\. The intercept provides a fixed boost but cannot adapt to the token\-level norm variation that the decoder needs\. Freeing the exponentaa\(so scale=ealog‖xc‖\+b=eb‖xc‖a=e^\{a\\log\\\|x\_\{c\}\\\|\+b\}=e^\{b\}\\\|x\_\{c\}\\\|^\{a\}\) lets the encoder pass exactly as much per\-token magnitude as reconstruction requires\. The optimizer converges toa≈0\.26a\\approx 0\.26globally anda¯i≈0\.08\\bar\{a\}\_\{i\}\\approx 0\.08per\-feature;0% dead, FVE=0\.77=0\.77\.\[[exp12](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/12_adaptive_scaling/README.md)\]\[[exp42c](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/42c_noc_500m/README.md)\]
#### B\.2\.2Per\-feature instability and the base\+delta fix
Allowing each feature its own exponentaia\_\{i\}is appealing: a positional feature may need more norm information than a semantic one\. The direct parameterization \(aia\_\{i\}free, initialized at0\) works at shallow layers and short budgets \(L9/5M:0% dead, matching the global variant\)\. At L27 with5050M tokens, it collapses:83\.483\.4% dead \(54,65754\{,\}657of65,53665\{,\}536features\)\. This cascade occurs even with encoder unit\-normalization active and is specific to the per\-feature degree of freedom in the exponent, not to missing normalization constraints\.\[[exp47](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/47_perfeature_robustness/README.md)\]
The failure mechanism is a winner\-take\-all cascade\. At L27,‖xc‖≈407\\\|x\_\{c\}\\\|\\approx 407while the initial scaleeb=d=64e^\{b\}=\\sqrt\{d\}=64, a6\.4×6\.4\\timesmismatch\. Some features randomly receive slightly more gradient early in training \(due to initialization variance inWencW\_\{\\mathrm\{enc\}\}directions\), fire more often, and grow theiraia\_\{i\}to capture more of the high\-norm tokens\. As theiraia\_\{i\}increases, their pre\-activations on high\-norm tokens inflate further, raising the batch\-wide TopK threshold\. Features that started with slightly less gradient now fall below the threshold, receive zero gradient, and die\. The cascade is self\-reinforcing: at step5,5005\{,\}500, approximately43,00043\{,\}000features die simultaneously in a single500500\-step window\. By step10,00010\{,\}000the dictionary has8383% dead features and cannot recover\.
The base\+delta parameterizationai=abase\+δia\_\{i\}=a\_\{\\mathrm\{base\}\}\+\\delta\_\{i\}prevents this cascade\. The sharedabasea\_\{\\mathrm\{base\}\}moves all features together during early training, ensuring the global scale tracks‖xc‖\\\|x\_\{c\}\\\|before any per\-feature divergence can occur\. Onceabasea\_\{\\mathrm\{base\}\}has converged \(typically by step5,0005\{,\}000\), the per\-featureδi\\delta\_\{i\}offsets begin differentiating\. The result:0\.40\.4% dead at L27/5050M \(vs\.83\.483\.4% unconstrained\), with FVE=0\.772=0\.772vs\.0\.7210\.721\.\[[exp47](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/47_perfeature_robustness/README.md)\]
Figure 8:Winner\-take\-all cascade in unconstrained per\-featureaia\_\{i\}\.Qwen3\-8B L27,5050M tokens\.\(A\)Unconstrained per\-feature \(red\) loses6767% of features in a single500500\-step window at step5,5005\{,\}500; base\+delta \(blue\) retains all features throughout\.\(B\)Base\+delta also achieves higher final FVE \(0\.770\.77vs\.0\.720\.72\) because surviving features encode useful content rather than fighting for norm\-dominated TopK slots\.\[[exp47](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/47_perfeature_robustness/README.md)\]At the headline setting \(L18,500500M tokens\), base\+delta achieves the best sparse probing in the entire table \(0\.8150\.815\) at0% dead and matched FVE \(0\.7710\.771\)\.
#### B\.2\.3Stabilization mechanisms: encoder normalization and norm restoration
Once input normalization strips‖x‖\\\|x\\\|from the encoder’s view, something must prevent norm\-dominance from re\-entering through other paths\. Two mechanisms independently solve this problem: \(1\) encoder unit\-normalization, which prevents encoder\-row magnitudes from drifting and recapitulating norm dominance on the weight side; and \(2\) post\-decode norm restoration, which rescales the reconstruction to match‖xc‖\\\|x\_\{c\}\\\|externally\. Either one suffices; removing both is catastrophic\.
##### Encoder normalization\.
Unit\-normalizing encoder rows \(wi←wi/‖wi‖w\_\{i\}\\leftarrow w\_\{i\}/\\\|w\_\{i\}\\\|on each forward pass\) ensures the cosine score is a true cosine similarity\. Dropping this constraint while keeping input normalization \(row “input norm only, no scale”\) gives FVE=0\.297=0\.297at93\.593\.5% dead\. The failure mode: without‖wi‖=1\\\|w\_\{i\}\\\|=1, encoder row norms drift freely\. Rows that happen to grow large dominate the TopK competition \(their pre\-activations scale with‖wi‖\\\|w\_\{i\}\\\|\), starving smaller rows of gradient\. This recapitulates the same norm\-dominance pathology we set out to fix, but now on the encoder side rather than the input side\. The collapse is layer\-independent: free\-encoder adaptive cosine dies at90\.890\.8% \(L9\),94\.694\.6% \(L18\), and93\.293\.2% \(L27\), all at5050M tokens\.\[[exp23](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/23_input_norm_ablation/README.md)\]\[[exp45](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/45_encoder_norm_ablation/README.md)\]
Releasing encoder normalization while keeping post\-decode norm restoration and a global learneda,ba,b\(row “no enc\. norm, globala,ba,b”\) also collapses:93\.293\.2% dead, withaarunning away to0\.620\.62\(toward inner product\)\. The restoration step helps the decoder but cannot prevent the encoder\-side norm drift that kills feature selection\. The one exception: per\-featurebib\_\{i\}with free encoder rows achieves0\.40\.4% dead \(row “no enc\. norm, per\-featbib\_\{i\}”\)\. Here each feature’s individual scaleebie^\{b\_\{i\}\}can compensate for encoder\-row magnitude variation, effectively absorbing‖wi‖\\\|w\_\{i\}\\\|intobib\_\{i\}\. This works but at a cost: FVE matches the baseline \(0\.7700\.770\); sparse\-probing performance is untested for this variant\.\[[exp46](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/46_normscope_ablation/README.md)\]
##### Norm restoration\.
Within the magnitude\-bypass family \(no learnedaa, sparse codes bounded in\[0,1\]\[0,1\]\), norm restoration is the load\-bearing stabilizer\. A factorial ablation at L27/55M shows: restoration\-on variants all land within0\.80\.8% FVE \(0\.5500\.550–0\.5580\.558\) at0% dead regardless of encoder/decoder normalization; restoration\-off variants collapse to∼89\\sim 89% dead with FVE0\.080\.08–0\.140\.14\(Figure[9](https://arxiv.org/html/2606.15054#A2.F9)\)\. Encoder and decoder unit\-norm constraints are decorative when restoration is active\.\[[exp46](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/46_normscope_ablation/README.md)\]
##### Learned exponent as a third path\.
With a learned exponentaa\(the adaptive and per\-feature variants\), neither encoder normalization nor norm restoration is strictly necessary\. The per\-feature base\+delta variant achieves0% dead and FVE0\.7710\.771with encoder normalization but without restoration \(Table[2](https://arxiv.org/html/2606.15054#A2.T2), row 6\)\. Conversely, base\+delta without encoder normalization but with restoration achieves FVE=0\.740=0\.740at2020% dead \(row “no enc\. norm, base\+δi\\delta\_\{i\}”\); functional but degraded\. The learned exponent provides partial self\-stabilization: pre\-activations scale as‖xc‖ai\\\|x\_\{c\}\\\|^\{a\_\{i\}\}, so the sparse code carries per\-token magnitude directly and the decoder can reconstruct at the correct scale without an external rescale\.
The base\+delta parameterization is uniquely robust because the sharedabasea\_\{\\mathrm\{base\}\}anchors all features to a common scale \(preventing the encoder\-side divergence that kills free\-encoder variants\) while the per\-featureδi\\delta\_\{i\}offsets allow heterogeneity without competitive pressure\. This is why it survives with only one stabilizer active\. Other parameterizations \(globalaa, unconstrained per\-feature\) require encoder normalization\.
##### Summary and open questions\.
In practice, our recommended variants use encoder normalization \(which is cheap and guarantees true cosine geometry\)\. The adaptive/per\-feature family does not use norm restoration; the magnitude\-bypass family relies on it\. We have not ablated all pairwise combinations at the headline setting\. Specifically, \(1\) per\-feature base\+delta with both encoder normalization and norm restoration active, and \(2\) per\-feature base\+delta without either, remain untested at500500M tokens\. Whether the2020% dead rate of the no\-enc\-norm variant \(exp48, L27/5050M\) improves with longer training or worsens is unknown\. These are natural directions for future exploration\.
Figure 9:Magnitude\-bypass family: norm restoration is load\-bearing; encoder/decoder normalization is decorative\.L27,55M tokens, aux\-k on \(confirmed no\-op\)\. All four restoration\-on variants \(blue\) achieve FVE≈0\.55\\approx 0\.55at0% dead regardless of enc/dec norm\. Restoration\-off variants \(red\) collapse to≥88\\geq 88% dead\. For the adaptive family, encoder normalization plays the equivalent stabilizing role \(see text above\)\.\[[exp46](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/46_normscope_ablation/README.md)\]
#### B\.2\.4Initialization sensitivity
The scale parameterbbis initialized tologdmodel\\log\\sqrt\{d\_\{\\mathrm\{model\}\}\}, which sets the initial pre\-activation magnitude tod⋅cos\(θ\)≈64\\sqrt\{d\}\\cdot\\cos\(\\theta\)\\approx 64for Qwen3\-8B\. This works when residual\-stream norms are of similar order or larger \(Qwen L9:‖x‖≈58\\\|x\\\|\\approx 58, L27:≈407\\approx 407\)\. It fails catastrophically when norms are much smaller\.\[[exp42](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/42_mistral_init/README.md)\]
On Mistral\-7B, activation norms at L8 are only6\.36\.3\(10×10\\timessmaller thand=64\\sqrt\{d\}=64\)\. Thed\\sqrt\{d\}initialization overshoots: pre\-activations are an order of magnitude too large, features saturate and die immediately \(100100% dead at L8,97\.697\.6% at L16\)\. No gradient flows from dead features, so the optimizer cannot recover\. This is the opposite failure mode from Qwen L27, whered\\sqrt\{d\}undershoots\(‖x‖=407\\\|x\\\|=407vs\.d=64\\sqrt\{d\}=64\) but features stay alive and the optimizer climbs toward the correct scale\.\[[exp25](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/25_multimodel_matrix/README.md)\]
The asymmetry: undershoot is recoverable \(gradients still flow\), overshoot is not \(dead features produce no gradient\)\. We resolve this with a norm\-adaptive initialization rule: if the mean centered activation normmean‖xtrain−bdec‖\\operatorname\{mean\}\\\|x\_\{\\mathrm\{train\}\}\-b\_\{\\mathrm\{dec\}\}\\\|differs fromdmodel\\sqrt\{d\_\{\\mathrm\{model\}\}\}by more than2×2\\times, useb=log\(mean‖x‖\)b=\\log\(\\operatorname\{mean\}\\\|x\\\|\)instead\. With this fix, Mistral achieves positive cosine advantage at all layers \(\+1\.9\+1\.9% FVE at L8,\+3\.0\+3\.0% at L24,5555% fewer dead features\)\.\[[exp42](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/42_mistral_init/README.md)\]
A subtlety at longer budgets: on Qwen L27 at5050M tokens, norm\-adaptive init hurts relative tod\\sqrt\{d\}because it removes the gradient pressure that teachesaato compensate for scale \(the optimizer “needs to struggle” with a wrongbbto discover that magnitude matters\)\. Thed\\sqrt\{d\}default works when norms exceed the init; norm\-adaptive is necessary only in the overshoot regime\.\[[exp34](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/34_multi_seed/README.md)\]\[[exp41](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/41_optimal_500m/README.md)\]
Table 3:Initialization sensitivity across models\. The direction of the mismatch determines recoverability\.\[[exp42](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/42_mistral_init/README.md)\]\[[exp42c](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/42c_noc_500m/README.md)\]
#### B\.2\.5Group\-size experiments
Rather than one globalaaor65,53665\{,\}536freeaia\_\{i\}values, an intermediate option shares eachaaacross a group ofGGfeatures\. We testedG∈\{1,4,16,64,256,9216\}G\\in\\\{1,4,16,64,256,9216\\\}at L13 on Gemma\-2\-2B/5050M tokens\.\[[exp37](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/37_regularized_training/README.md)\]
G=4G=4\(44groups of2,3042\{,\}304features\) achieves the best SAEBench composite in this sweep: KL\-score0\.97850\.9785, CE\-score0\.97860\.9786, RAVEL0\.6190\.619, outperforming both the fully global setting \(G=1G=1: KL0\.97670\.9767, RAVEL0\.6180\.618\) and the fully free setting \(G=9216G=9216: KL0\.97830\.9783, RAVEL0\.6090\.609\)\. Dead\-feature rates decrease monotonically with group size \(4141% atG=1G=1to2020% atG=9216G=9216\), but downstream task performance is non\-monotone\.
We note two caveats\. First, this experiment uses Gemma\-2\-2B at5050M tokens with a smaller dictionary \(dsae=9,216d\_\{\\mathrm\{sae\}\}=9\{,\}216\); results may not transfer directly to the headline Qwen3\-8B/65,53665\{,\}536setting\. Second, differences between adjacent group sizes are small \(within0\.0020\.002on KL\-score\), so theG=4G=4optimum is suggestive rather than definitive\. The broader conclusion is more robust: some per\-feature heterogeneity in norm dependence improves downstream metrics relative to a single globalaa, while fully unconstrained per\-feature freedom can hurt\.
Figure 10:Group\-size sweep\.Gemma\-2\-2B L13,5050M tokens,dsae=9,216d\_\{\\mathrm\{sae\}\}=9\{,\}216\.\(A\)Dead\-feature rate decreases monotonically with group size\.\(B\)KL\-score and RAVEL peak atG=4G=4\(star\), though differences are small \(<0\.002<0\.002KL\-score between adjacent sizes\)\. The result is suggestive of an intermediate optimum but has not been replicated at the headline scale\.\[[exp37](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/37_regularized_training/README.md)\]In practice, we use per\-feature base\+delta rather than group\-sharing because: \(a\) it achieves0% dead at the headline setting; \(b\) the shared base provides a similar anchoring effect to small group sizes; \(c\) it adds negligible parameter overhead \(2dsae2d\_\{\\mathrm\{sae\}\}scalars\)\. A systematic comparison of group\-sharing vs\. base\+delta at the headline scale \(Qwen3\-8B,500500M tokens,dsae=65,536d\_\{\\mathrm\{sae\}\}=65\{,\}536\) remains a natural future direction\.
#### B\.2\.6Summary of design lessons
1. 1\.Pure cosine \(no scale\) fails because the decoder cannot bridge the norm gap between bounded sparse codes and real activation magnitudes \(§[B\.2\.1](https://arxiv.org/html/2606.15054#A2.SS2.SSS1)\)\.
2. 2\.The learned exponentaalets the encoder pass exactly as much magnitude as reconstruction requires; the optimizer consistently choosesa≪1a\\ll 1\(§[B\.3](https://arxiv.org/html/2606.15054#A2.SS3)\)\.
3. 3\.Per\-featureaia\_\{i\}must be anchored \(base\+delta\) to prevent winner\-take\-all cascades at deep layers where‖xc‖≫d\\\|x\_\{c\}\\\|\\gg\\sqrt\{d\}\(§[B\.2\.2](https://arxiv.org/html/2606.15054#A2.SS2.SSS2)\)\.
4. 4\.Either encoder unit\-normalization or post\-decode norm restoration prevents collapse; the magnitude\-bypass family requires restoration, while the adaptive family requires encoder norm\. Per\-feature base\+delta can partially survive with only one, but works best with encoder normalization \(§[B\.2\.3](https://arxiv.org/html/2606.15054#A2.SS2.SSS3)\)\.
5. 5\.Moderate per\-feature freedom \(base\+delta orG=4G=4\) outperforms both fully global and fully free parameterizations \(§[B\.2\.5](https://arxiv.org/html/2606.15054#A2.SS2.SSS5)\)\.
6. 6\.Initialization must not overshoot:d\\sqrt\{d\}works when norms exceed the init; norm\-adaptive is needed otherwise \(§[B\.2\.4](https://arxiv.org/html/2606.15054#A2.SS2.SSS4)\)\.
### B\.3Empirical Convergence
What does the optimizer choose when given freedom? This section reports the learned values ofaaandaia\_\{i\}at the headline setting\.
Table 4:Learnedaaacross layers \(Qwen3\-8B, community recipe, 500M tokens\)\.\[[exp42c](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/42c_noc_500m/README.md)\]At L9 the optimizer drivesaato near zero \(pure cosine\); at deeper layers it retains modest norm dependence \(a≈0\.26a\\approx 0\.26\)\. No feature at any layer or token budget learnsai\>0\.5a\_\{i\}\>0\.5; the inner\-product regime \(a=1a=1\) is never preferred\. Per\-feature means are consistently lower than the globalaa, suggesting that when features can individually tune their norm dependence, most want less magnitude rather than more\.
Figure 11:Per\-featureaia\_\{i\}distribution at three token budgets \(Qwen3\-8B L27\)\. Mass shifts toward largeraia\_\{i\}with more data; noai\>0\.5a\_\{i\}\>0\.5in any setting\.\[[exp42c](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/42c_noc_500m/README.md)\]Table 5:\{ai\}\\\{a\_\{i\}\\\}statistics across token budgets\.\[[exp42c](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/42c_noc_500m/README.md)\]\[[exp16](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/16_perfeature_scaling/README.md)\]\[[exp17](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/17_production_scale/README.md)\]At short budgets \(55M\),8888–9999% of features remain nearai≈0a\_\{i\}\\approx 0\. With more data, the distribution shifts: at500500M only2323% stay near zero\. This suggests that early in training, features learn content directions first \(cosine is sufficient\), and only later differentiate their norm dependence as finer\-grained reconstruction demands emerge\. Even at500500M, the distribution remains far below the inner\-product limit \(a=1a=1\), confirming that the optimizer consistently prefers direction\-dominated scoring\.
Pinned cosine \(a=0a=0\)\.Freezingaaat0and training onlybbproduces a fixed\-scale cosine encoder\. The trained\-\{ai\}\\\{a\_\{i\}\\\}distributions above show this is suboptimal: the optimizer wantsa\>0a\>0at deeper layers, and pinned cosine cannot adapt to per\-token norm variation\. This variant is included in the ablation table for completeness but is not recommended; globalaaadds one scalar and strictly dominates\.
## Appendix CDiagnostic: Cosine vs\. Inner Product as a Causal Predictor
Extended version of §[4\.2](https://arxiv.org/html/2606.15054#S4.SS2)\. Given fixed decoder directions from a trained Standard SAE, does cosine score have higher absolute correlation with a causal projection\-ablation effect than inner product?
Definitions\.Projection ablation: for unit decoder directiondfd\_\{f\},ablate\(x;f\)=x−\(df⊤x\)df\\mathrm\{ablate\}\(x;f\)=x\-\(d\_\{f\}^\{\\top\}x\)\\,d\_\{f\}, re\-installed via a forward hook at the ablation layer\.KL\-divergence ablation effect:yt=KL\(porig\(⋅∣xt\)∥pabl\(⋅∣xt;f\)\)y\_\{t\}=\\mathrm\{KL\}\\\!\\big\(\\,p\_\{\\mathrm\{orig\}\}\(\\cdot\\mid x\_\{t\}\)\\,\\\|\\,p\_\{\\mathrm\{abl\}\}\(\\cdot\\mid x\_\{t\};f\)\\,\\big\)\.
Win\-rate protocol\.For each featureffwith unit decoder directiondfd\_\{f\}andmmheld\-out tokens:stcos=cos\(xt,df\)s^\{\\cos\}\_\{t\}=\\cos\(x\_\{t\},d\_\{f\}\),stip=⟨xt,df⟩s^\{\\mathrm\{ip\}\}\_\{t\}=\\langle x\_\{t\},d\_\{f\}\\rangle, andyty\_\{t\}is the KL\-divergence ablation effect defined above\. Decoder directionsdfd\_\{f\}are taken from trained Standard SAEs\. Cosine wins iff\|corr\(scos,y\)\|\>\|corr\(sip,y\)\|\|\\mathrm\{corr\}\(s^\{\\cos\},y\)\|\>\|\\mathrm\{corr\}\(s^\{\\mathrm\{ip\}\},y\)\|, averaged over features\. The5050% null corresponds to no systematic advantage for either scoring rule\. Projection ablation makes the perturbation magnitude exactly⟨x,df⟩\\langle x,d\_\{f\}\\rangle, biasing KL toward\|⟨x,df⟩\|\|\\langle x,d\_\{f\}\\rangle\|, which means the diagnostic isconservativewith respect to cosine \(Figure[12](https://arxiv.org/html/2606.15054#A3.F12)\)\.
Figure 12:Win\-rate diagnostic protocol\.\(a\)Projection ablation removes a decoder direction from the residual stream; the downstream KL\-divergence measures its causal importance\.\(b\)Per\-sample cosine and inner\-product scores are correlated with this causal effect; the scoring rule with higher absolute correlation “wins” for that feature\.Figure 13:Cosine\-vs\.\-inner\-product win rate by depth\. Deep RMSNorm: 70–90%\. Deep LayerNorm: at chance\.\[[exp25](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/25_multimodel_matrix/README.md)\]Depth dependence on RMSNorm\.Qwen3\-8B: L9≈44%\\approx 44\\%, L18≈65%\\approx 65\\%, L27≈78%\\approx 78\\%\. Norm\-patching at L18 produces∼10×\\sim 10\\timesless downstream\-KL change than direction\-patching at fixed norm\. Cross\-model RMSNorm: Mistral\-7B 74–88%; Gemma\-2\-2B matches Qwen with a low value at L20\. LayerNorm \(Pythia\-2\.8B / Pythia\-6\.9B / Falcon\-7B\) drops from∼90%\\sim 90\\%shallow to 40–57% deep \(Fig\.[13](https://arxiv.org/html/2606.15054#A3.F13)\)\.\[[exp25](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/25_multimodel_matrix/README.md)\]On constructed token pairs \(high\-cosine / low\-norm vs\. low\-cosine / high\-norm\), the high\-cosine token has the larger KL effect on∼90%\\sim 90\\%of features at deep layers\.
Caveats\.Within a single norm quartile the rate drops to 40–70%, so the 70–90% rate is partly between\-quartile \(Simpson’s paradox; Figure[14](https://arxiv.org/html/2606.15054#A3.F14)\); the training\-time results in §[4\.1](https://arxiv.org/html/2606.15054#S4.SS1)do not depend on this measurement\. The inflation gap grows with depth \(\+18\+18% at L9,\+23\+23% at L18,\+33\+33% at L27\), consistent with deeper layers having more norm variation driving the between\-quartile effect\. The magnitude\-bypass 69% rate at 500M \(§[4\.3](https://arxiv.org/html/2606.15054#S4.SS3)\) does not replicate at 5M tokens \(27–53% across four restoration\-on variants\)\.\[[exp44](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/44_norm_stratified_fve/README.md)\]\[[exp46](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/46_normscope_ablation/README.md)\]\[[exp19](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/19_norm_stratified/README.md)\]
Figure 14:Simpson’s paradox in the cos\>\>inner diagnostic\.Within individual norm quartiles \(blue\), the win rate hovers at4040–70%70\\%; the overall rate \(red\) is8080–87%87\\%\. The gap grows with depth as residual\-stream norm variation increases\. This confirms that the between\-quartile component \(norm variation corrupting cross\-token TopK selection\) drives most of the overall cos\>\>inner rate\. Standard SAE, Qwen3\-8B, 50M tokens\.\[[exp19](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/19_norm_stratified/README.md)\]### C\.1SAE\-Free Direction vs\. Norm Patching
The cos\>\>inner diagnostic in §[C](https://arxiv.org/html/2606.15054#A3)relies on SAE decoder directions\. This subsection removes that dependency entirely: for500500random cross\-prompt token pairs at each layer, we decompose the residual\-stream activation into direction \(x^=x/‖x‖\\hat\{x\}=x/\\\|x\\\|\) and magnitude \(‖x‖\\\|x\\\|\), then measure KL\-divergence from the unpatched output after swapping each component independently \(Figure[15](https://arxiv.org/html/2606.15054#A3.F15)\)\.
Figure 15:The model reads direction, not magnitude\.KL\-divergence caused by swapping direction \(blue\) vs\. norm \(orange\) between random token pairs at three Qwen3\-8B layers\. Direction patches cause8787–2,560×2\{,\}560\\timesmore disruption; the ratio grows with depth as fewer subsequent layers remain for magnitude to influence direction\.n=500n=500pairs per layer\.\[[exp8](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/08_activation_patching/README.md)\]Scaling test\.Replacing a single token’s direction with a random unit vector \(at fixed norm\) causes2\.2×2\.2\\times\(L9\),3\.5×3\.5\\times\(L18\), and11\.6×11\.6\\times\(L27\) more KL than scaling its norm by0\.50\.5–5×5\\timesat fixed direction\. The depth gradient is consistent with RMSNorm progressively erasing magnitude from the sublayer\-input path\.
## Appendix DSparse\-Probing Coverage at Matched FVE
Setup\.Qwen/Qwen3\-8B\-Baselayer 18, 500M FineWeb tokens,dsae=65,536d\_\{\\mathrm\{sae\}\}=65\{,\}536, recipe in §[J](https://arxiv.org/html/2606.15054#A10)\. Four architectures: Standard, Adaptive Cosine SAE, Per\-Feature Adaptive Cosine SAE, Magnitude\-Bypass SAE\. Body presentation: §[4\.1](https://arxiv.org/html/2606.15054#S4.SS1)\.
Not held fixed across variants\.Encoder score range \(cosine∈\[−1,1\]\\in\[\-1,1\]vs\. unbounded inner product\); learned scale parameteraa\(≤0\.1%\\leq 0\.1\\%extra parameters\); encoder initialization \(norm\-aware vs\.d\\sqrt\{d\}\); activation dynamic range; wall\-clock time \(8–14% overhead\)\.
Table 6:Core SAEBench metrics, Qwen3\-8B L18, 500M tokens,dsae=65,536d\_\{\\mathrm\{sae\}\}=65\{,\}536\.\[[exp40/42c/44](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/40_saprmarks_recipe/README.md)\]Reconstruction parity\.FVE∈\[0\.767,0\.771\]\\in\[0\.767,0\.771\]; KL\-div score∈\[0\.984,0\.985\]\\in\[0\.984,0\.985\]; CE\-loss score∈\[0\.991,0\.993\]\\in\[0\.991,0\.993\]; dead features≤4\.3%\\leq 4\.3\\%\.
Sparse probing\.Top\-1 differences from Standard:\+11\.6%\+11\.6\\%\(Magnitude\-Bypass SAE\),\+13\.3%\+13\.3\\%\(Adaptive Cosine SAE\),\+14\.9%\+14\.9\\%\(Per\-Feature Adaptive Cosine SAE\)\. Per\-dataset breakdown: Table[8](https://arxiv.org/html/2606.15054#A4.T8)\. Robustness to dataset removal:
Table 7:Top\-1 vs\. Standard, with high\-margin datasets removed\.\[[exp40](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/40_saprmarks_recipe/README.md)\]Decoder direction overlap\.Mutual nearest\-neighbor matching: 73–78%\. Strict\>0\.95\>0\.95\-cosine\+\+Jaccard: 6–17%\. Restricted to alive\-and\-paired subset atn=3n=3–66:<1%<1\\%\(§[H](https://arxiv.org/html/2606.15054#A8)\)\. Headline comparisons are not paired feature\-to\-feature tests\.\[[exp56b](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/56b_feature_overlap/README.md)\]
Per\-feature interpretability\.80\.1%80\.1\\%\(Per\-Feature Adaptive Cosine SAE\) vs\.82\.1%82\.1\\%\(Standard\),p=0\.88p=0\.88\(§[F](https://arxiv.org/html/2606.15054#A6)\)\. Without the auxiliary loss, total interpretable features differ by≈4\.5×\\approx 4\.5\\times\(§[F\.1](https://arxiv.org/html/2606.15054#A6.SS1)\)\.
Compute overhead\.Training wall\-clock vs\. Standard:\+8%\+8\\%\(Adaptive Cosine SAE\),\+10%\+10\\%\(Per\-Feature Adaptive Cosine SAE\),\+14%\+14\\%\(Magnitude\-Bypass SAE\)\. Inference: one extralog‖xc‖\\log\\\|x\_\{c\}\\\|\+exp\\expper token before BatchTopK \(Adaptive Cosine SAE, Per\-Feature Adaptive Cosine SAE\); post\-decode norm\-and\-multiply for Magnitude\-Bypass SAE\.
### D\.1Per\-Dataset Sparse Probing
Table[8](https://arxiv.org/html/2606.15054#A4.T8)reports the single\-feature top\-1 by dataset for all four architectures\. Per\-Feature Adaptive Cosine SAE has higher or equal top\-1 to Standard on seven of eight datasets; amazon\_sentiment is the one dataset on which Standard is higher\.
Table 8:Single\-feature top\-1 by dataset\.\[[exp40](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/40_saprmarks_recipe/README.md)\]
### D\.2Per\-Architecture Sparse\-Probing Bars
Figure[16](https://arxiv.org/html/2606.15054#A4.F16)plots the per\-dataset top\-1 numbers from Table[8](https://arxiv.org/html/2606.15054#A4.T8)alongside aggregate top\-1/2/5 across the eight datasets, comparing Standard against Per\-Feature Adaptive Cosine SAE; this is the provenance figure referenced from Fig\.[2](https://arxiv.org/html/2606.15054#S1.F2)and Table[1](https://arxiv.org/html/2606.15054#S4.T1)\.
Figure 16:Per\-feature top\-1 by dataset \(Standard vs\. Per\-Feature Adaptive Cosine SAE\) and aggregate top\-kk\.\[[exp40](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/40_saprmarks_recipe/README.md)\]
## Appendix ESample Efficiency and Cross\-Model Generalization
Without auxiliary loss\.The headline of §[4\.1](https://arxiv.org/html/2606.15054#S4.SS1)keeps the auxiliary loss enabled in both arms\. With the auxiliary loss removed, the alive\-feature differences become visible: on Gemma\-2\-2B at 50M tokens, Standard shows 54–69% dead vs\.≈0%\\approx 0\\%for Adaptive Cosine SAE \(Table[10](https://arxiv.org/html/2606.15054#A5.T10)\)\. At 500M on Qwen3\-8B withd\\sqrt\{d\}init,dsae=16,384d\_\{\\mathrm\{sae\}\}=16\{,\}384, no auxiliary loss, the global\-aaAdaptive Cosine SAE shows\+6\.3%\+6\.3\\%FVE and2\.39×2\.39\\timesalive features over Standard at L27 \(Table[9](https://arxiv.org/html/2606.15054#A5.T9)\)\. Both architectures plateau by 300M tokens\.
Table 9:Reference runs at 500M Qwen3\-8B\. Aux: with the auxiliary loss; No\-aux: without\.\[[exp40](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/40_saprmarks_recipe/README.md)\]\[[exp42c](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/42c_noc_500m/README.md)\]Figure 17:FVE vs\. token budget on Qwen3\-8B \(500M, our recipe\)\. Adaptive Cosine SAE at the 50M checkpoint reaches Standard’s 500M FVE at L9\.\[[exp42c](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/42c_noc_500m/README.md)\]Table 10:Without auxiliary loss, Gemma\-2\-2B, 50M tokens\.\[[exp35](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/35_gemma_saebench/README.md)\]Cross\-model coverage\.RMSNorm: Qwen3\-8B, Gemma\-2\-2B, Mistral\-7B\. LayerNorm: Pythia\-2\.8B, Pythia\-6\.9B, Falcon\-7B, Pythia\-70M\. Token budgets vary \(500M for the Qwen L18 run, 50M for most cross\-model rows, 5M for some\); rows in Table[11](https://arxiv.org/html/2606.15054#A5.T11)are not normalized for budget or recipe\.
Table 11:Cross\-model behavior\.*cos\>\>inner*: per\-feature win rate on the diagnostic of §[C](https://arxiv.org/html/2606.15054#A3)\(50% null\); “a→ba\\to b” on LayerNorm rows is shallow→\\todeep\.*FVEΔ\\Delta*: cosine−\-standard, in absolute FVE units\.*Alive×\\times*: alive\-feature ratio of cosine over standard\.\[[exp25](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/25_multimodel_matrix/README.md)\]\[[exp35](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/35_gemma_saebench/README.md)\]Variance\.n=3n=3seeds at Qwen L27, 50M tokens: FVE0\.7370\.737\(Adaptive Cosine SAE\) vs\.0\.6570\.657\(Standard\); seed\-wise SD<0\.001<0\.001in both arms; alive\-feature ratio3\.26×3\.26\\times\.\[[exp34](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/34_multi_seed/README.md)\]The 500M L18 sparse\-probing run is single\-seed\. Cos\>\>inner above 50% at L27:p<2\.4×10−6p<2\.4\\times 10^\{\-6\}\. Per\-feature interpretability rate difference:p=0\.88p=0\.88\.\[[exp33](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/33_feature_interpretability/README.md)\]
### E\.1Depth Dependence Within Qwen3\-8B
Figure[18](https://arxiv.org/html/2606.15054#A5.F18)shows the four main variants at L9, L18, and L27 \(all5050M tokens, same recipe\)\. The key observations:
1. 1\.Architectures converge at shallow layers\.At L9 \(‖x‖≈58≈d\\\|x\\\|\\approx 58\\approx\\sqrt\{d\}\), all four variants achieve0% dead features and FVE within0\.60\.6%\. The norm/sqrt\(dd\) ratio is near unity, so the scale mismatch that drives divergence at deeper layers does not arise\.
2. 2\.Cos\>\>inner is below5050% at L9\.Inner product is a better causal predictor at L9 \(3333–4444%\), consistent with magnitude carrying genuine signal at shallow layers where RMSNorm has not yet accumulated\. The diagnostic crosses5050% between L9 and L18, and reaches6767–7878% at L27\.
3. 3\.Per\-feature \(no anchor\) collapses only at L27\.At L18 \(‖x‖/d=1\.53\\\|x\\\|/\\sqrt\{d\}=1\.53\), the unconstrained per\-feature variant survives with0% dead and the best FVE \(0\.7260\.726\)\. At L27 \(‖x‖/d=6\.3\\\|x\\\|/\\sqrt\{d\}=6\.3\), it collapses to83\.483\.4% dead\. The cascade \(§[B\.2\.2](https://arxiv.org/html/2606.15054#A2.SS2.SSS2)\) is gated by the norm\-to\-init ratio, not by absolute depth\.
4. 4\.Learnedaatracks the magnitude\-as\-noise gradient\.Globalaarises from0\.0250\.025\(L9\) to0\.2570\.257\(L27\); at L9 the optimizer wants pure cosine, at L27 it retains modest norm sensitivity for reconstruction\. No value exceeds0\.30\.3; the inner\-product regime \(a=1a=1\) is never approached\.
Figure 18:Architecture behavior across depth\.Qwen3\-8B,5050M tokens, all four main variants\.\(A\)FVE converges at L9 and diverges at L27\.\(B\)Per\-feature \(no anchor\) collapses catastrophically at L27; magnitude\-bypass shows mild dead features at L18\.\(C\)Cos\>\>inner crosses5050% between L9 and L18, reaching7878% for magnitude\-bypass at L27\.\(D\)The optimizer drivesaatoward zero at shallow layers and toward∼0\.26\\sim 0\.26at deep layers, never approachinga=1a=1\.\[[exp43c/43d/43](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/43_4arch_50m_l27/README.md)\]The depth story connects the mechanism \(§[4\.3](https://arxiv.org/html/2606.15054#S4.SS3)\) to the architecture \(§[B\.2](https://arxiv.org/html/2606.15054#A2.SS2)\): RMSNorm progressively erases magnitude from the residual stream at deeper layers, making inner\-product scoring progressively worse as a feature detector\. The cosine encoder’s advantage grows with depth because the “noise” it removes \(magnitude\) becomes a larger fraction of the total signal at deeper layers\. This explains why the headline result \(L18,500500M\) is a moderate case; deeper layers show an even larger divergence between cosine and inner\-product dictionaries, but at the cost of requiring more careful parameterization \(base\+delta rather than free per\-feature\)\.
### E\.2Cross\-Budget Sparse Probing
Comparing our 50M\-token cosine SAEs against the independently trained reference SAE\(Karvonenet al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib25)\)at 500M tokens \(same model, same BatchTopK recipe, different codebase\):
Table 12:Sparse\-probing top\-1 across token budgets\. The cosine SAE at10×10\\timesfewer tokens surpasses the standard reference at L9 and L27\.\[[exp51](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/51_saebench_multilayer/README.md)\]The reference SAE matches our own standard baseline within noise at L18 \(both≈0\.679\\approx 0\.679at 500M\), confirming it is representative\. The cross\-budget gap is largest at L9, where the cosine encoder’s direction\-only scoring is most beneficial relative to the shallow\-layer norm distribution\.
### E\.3Cross\-Model Summary
On Gemma\-2\-2B at 50M,dsae=9,216d\_\{\\mathrm\{sae\}\}=9\{,\}216: fixed\-SAE top\-1 sparse\-probing difference\+3\.4±0\.3%\+3\.4\\pm 0\.3\\%\.\[[exp57c](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/57c_gemma_replication/README.md)\]On Pythia\-2\.8B, Pythia\-6\.9B, Falcon\-7B, the cosine top\-1 advantage at deep layers does not hold; cos\>\>inner drops from∼100%\\sim 100\\%at shallow to 40% at deep \(Pythia\-2\.8B L24 sharpest\)\.\[[exp25](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/25_multimodel_matrix/README.md)\]
Scaling with model size and expansion ratio\.A3×33\\times 3matrix \(Qwen3\-1\.7B/4B/8B×\\times4×\\times/8×\\times/16×\\timesexpansion, 50M tokens, same recipe\) isolates the two potential drivers \(Figure[19](https://arxiv.org/html/2606.15054#A5.F19)\)\. Model size is the primary factor: the row mean jumps from\+6\.1\+6\.1% atdmodel=2048d\_\{\\mathrm\{model\}\}=2048to\+11\.6\+11\.6% atdmodel=2560d\_\{\\mathrm\{model\}\}=2560, then plateaus\. Expansion ratio has no systematic effect \(column means\+9\.5\+9\.5/\+8\.7\+8\.7/\+10\.1\+10\.1%\)\. This revises the earlier interpretation that dictionary size drives the gap; the Gemma\-2\-2B result \(\+3\.4\+3\.4% atdmodel=2304d\_\{\\mathrm\{model\}\}=2304, 4×\\timesexpansion\) is explained by its small model dimension, not its small dictionary\.\[[exp57](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/57_scaling_matrix/README.md)\]
Figure 19:Model size, not expansion ratio, drives the cosine advantage\.Sparse\-probing top\-1 gap \(cosine−\-standard, in percentage points\) across the Qwen3 family\. Row means \(right\) show a2×2\\timesjump from 1\.7B to≥\\geq4B; column means \(bottom\) are flat\. All cells use saprmarks recipe, 50M tokens,k=80k=80\.\[[exp57](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/57_scaling_matrix/README.md)\]
### E\.4FVE and Dead\-Feature Trends Without Auxiliary Loss
By\-layer view of the no\-auxiliary\-loss runs cited in §[4\.1](https://arxiv.org/html/2606.15054#S4.SS1)\(Qwen3\-8B, 50M tokens,dsae=16,384d\_\{\\mathrm\{sae\}\}=16\{,\}384\)\. FVE \(Fig\.[20](https://arxiv.org/html/2606.15054#A5.F20), left\) places Adaptive Cosine SAE and Per\-Feature Adaptive Cosine SAE above Standard at L9 and L27, with the gap narrowing at L18\. Dead\-feature rate \(right\) separates the cosine variants at L18: Per\-Feature stays below Standard, while the single\-global\-aaAdaptive variant matches Standard there and only recovers at L27\. With the auxiliary loss restored, the dead\-feature gap collapses to1\.9%1\.9\\%at L18 \(500M tokens\)\.


Figure 20:Without auxiliary loss, FVE and dead\-feature rate by layer \(Qwen3\-8B, 50M tokens\)\. At 500M tokens the L18 dead\-feature gap \(Per\-Feature Adaptive Cosine SAE vs\. Standard\) is1\.9%1\.9\\%\.\[[exp17](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/17_production_scale/README.md)\]\[[exp42c](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/42c_noc_500m/README.md)\]
## Appendix FFeature Interpretability
Setup\.1000 features per architecture, balanced across activation\-norm strata; Sonnet 4\.5 as LLM judge scoring 1–5 on description\-prediction accuracy; “interpretable” iff score≥4\\geq 4\. Single\-judge protocol; no human spot check\.
Figure 21:Per\-feature interpretability rates; total interpretable feature counts differ by≈4\.5×\\approx 4\.5\\timeswhen the alive\-feature gap is open \(no auxiliary loss\)\.\[[exp33](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/33_feature_interpretability/README.md)\]Per\-feature rates\.80\.1%80\.1\\%\(Per\-Feature Adaptive Cosine SAE\) vs\.82\.1%82\.1\\%\(Standard\);p=0\.88p=0\.88\.\[[exp33](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/33_feature_interpretability/README.md)\]
Total interpretable\.Without the auxiliary loss:≈13,100\\approx 13\{,\}100\(Per\-Feature Adaptive Cosine SAE\) vs\.≈2,900\\approx 2\{,\}900\(Standard\)\. With the auxiliary loss \(Table[6](https://arxiv.org/html/2606.15054#A4.T6)\): both≥50\\geq 50k alive\.
Decoder direction overlap\.\>0\.7\>0\.7cosine match:31%31\\%of10001000sampled features\. Strict\-identity overlap is66–17%17\\%\(§[4\.1](https://arxiv.org/html/2606.15054#S4.SS1)\)\.\[[exp56b](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/56b_feature_overlap/README.md)\]
### F\.1Reference Counts
Table 13:LLM\-judged interpretability \(1000 stratified features, Sonnet 4\.5\)\.\[[exp33](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/33_feature_interpretability/README.md)\]
### F\.2LLM Interpretability at 500M Tokens
At500500M tokens with the production recipe \(matched alive counts via aux\-k\), LLM\-judged describe\-then\-predict rates show no cosine advantage \(Table[14](https://arxiv.org/html/2606.15054#A6.T14)\)\. Standard achieves the highest per\-feature rate \(25\.025\.0%\), with per\-feature cosine close \(24\.024\.0%\) and global cosine lower \(19\.019\.0%\)\.
Table 14:LLM\-judged interpretability at500500M tokens,200200stratified features per architecture, Sonnet 4\.5 judge\. Interpretable iff≥50\\geq 50% prediction accuracy\.\[[exp53](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/53_llm_interp_500m/README.md)\]The frequency breakdown reveals a crossover: per\-feature cosine wins on low\-frequency features \(38\.038\.0% vs\.25\.525\.5%\) but loses on high\-frequency ones \(14\.014\.0% vs\.24\.024\.0%\)\. This is consistent with sparse probing’s finding that cosine discovers rare\-concept features; the LLM judge finds these features harder to describe \(fewer activating examples to form a pattern from\) but they align better with labeled concepts\. The describe\-then\-predict protocol favors features that fire frequently enough for the judge to identify a simple pattern, biasing toward high\-frequency features where standard excels\. Sparse probing, which uses ground\-truth labels, is not subject to this frequency bias\.
## Appendix GMechanism: Gradient Equalization and Q4 Reconstruction
This section uses the three cosine variants from §[3](https://arxiv.org/html/2606.15054#S3):*globalaa*\(single learned exponent\),*per\-feature*\(ai=abase\+δia\_\{i\}=a\_\{\\mathrm\{base\}\}\+\\delta\_\{i\}\), and*magnitude\-bypass*\(pinneda=0a\{=\}0with norm restoration; Appendix[B](https://arxiv.org/html/2606.15054#A2)\)\.
Encoder\-gradient asymmetry\.For an active feature under a fixed BatchTopK/ReLU mask, the inner\-product score gradient is∂si/∂wi=xc\\partial s\_\{i\}/\\partial w\_\{i\}=x\_\{c\}, so the encoder update inherits a factor of‖xc‖\\\|x\_\{c\}\\\|\. Cosine scoring replaces this with a normalized direction times a scalar‖xc‖a\\\|x\_\{c\}\\\|^\{a\}; the traineda≪1a\\ll 1\(Table[4](https://arxiv.org/html/2606.15054#A2.T4)\)\.
Figure 22:Encoder\-gradient ratio Q4 \(high\-norm\) / Q1 \(low\-norm\)\. Standard:35\.3%35\.3\\%of features have Q4/Q1\>2\>2\. Per\-feature cosine:13\.5%13\.5\\%\.\[[exp28](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/28_gradient_analysis/README.md)\]Equalization signature\.Cosine scoring compresses the gradient ratio toward unity\. The median per\-feature Q4/Q1 ratio drops from1\.55×1\.55\\times\(Standard\) to1\.03×1\.03\\times\(per\-feature cosine\)\. Only13\.5%13\.5\\%of cosine features exceed Q4/Q1\>2\>2, versus35\.3%35\.3\\%under standard scoring\. Q1\-specialized features increase from1818to562562\(Fig\.[22](https://arxiv.org/html/2606.15054#A7.F22)\)\.\[[exp28](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/28_gradient_analysis/README.md)\]
Recipe and dictionary\-size context\.On Gemma\-2\-2B \(dsae=9,216d\_\{\\mathrm\{sae\}\}=9\{,\}216, 4×\\timesexpansion, 50M tokens\), the saprmarks recipe’s auxiliary loss eliminates dead features for both architectures, narrowing the cosine advantage to\+3\.4\+3\.4%\. The scaling matrix \(§[E\.3](https://arxiv.org/html/2606.15054#A5.SS3)\) shows that model dimension, not expansion ratio, drives the gap; the headline\+14\.9\+14\.9% emerges at Qwen3\-8B’sdmodel=4096d\_\{\\mathrm\{model\}\}=4096\.\[[exp57c](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/57c_gemma_replication/README.md)\]\[[exp57](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/57_scaling_matrix/README.md)\]
Discovery vs\. separability\.How much of the probing gap comes from features the cosine encoder discovers versus cleaner encoding of shared features? We restrict both 500M dictionaries to8,6618\{,\}661mutually matched features \(activation correlation≥0\.7\\geq 0\.7and decoder cosine\>0\.7\>0\.7\)\. On matched features alone, top\-1 difference shrinks from\+14\.87%\+14\.87\\%to\+1\.95%\+1\.95\\%; top\-5 from\+10\.56%\+10\.56\\%to\+2\.70%\+2\.70\\%\. The standard SAE’s unmatched features contribute nothing beyond its matched set \(full−\-matched at top\-5:0\.7831−0\.78370\.7831\-0\.7837\)\. The cosine SAE’s unmatched features add\+7\.8%\+7\.8\\%\(0\.8888−0\.81080\.8888\-0\.8108\)\.\[[exp58b](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/58b_matched_feature_probing/README.md)\]
Unmatched features by norm quartile\.Where do each model’s unique features fire? Among strict\-unmatched features at L18,86\.3%86\.3\\%of Standard activations fall in Q4 vs\.56\.9%56\.9\\%for per\-feature cosine\. The mean norm of Standard\-unique activations is9,6899\{,\}689\(48×48\\timesthe sampled token mean\)\. On tokens where per\-feature cosine\-unique features fire, Standard fires328328features/token vs\.122122for per\-feature cosine\.\[[exp58c](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/58c_norm_detector_characterization/README.md)\]
Decoder geometry control\.For ten cached TPP probe directions,wprobeWdec⊤w\_\{\\mathrm\{probe\}\}W\_\{\\mathrm\{dec\}\}^\{\\top\}has matched concentration: entropy ratio1\.000011\.00001, effective\-dimension ratio1\.000071\.00007\.\[[exp58a](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/58a_probe_weight_analysis/README.md)\]
Feature steering behavioral equivalence\.To verify that the sparse\-probing gap reflects feature*discovery*, not per\-feature*power*, we directly steer individual features at amplification factors0\.5×0\.5\\times,2×2\\times, and5×5\\timesand measure downstream KL\-divergence\. Across1111top\-activating standard features and99cosine features \(5050prompts each\), mean KL ratios are1\.04×1\.04\\times\(0\.5×0\.5\\times\),1\.00×1\.00\\times\(2×2\\times\), and1\.00×1\.00\\times\(5×5\\times\)—cosine and standard features produce indistinguishable behavioral effects at matched amplification\. This confirms that the\+14\.9\+14\.9% probing gap is not explained by standard features being individually “stronger”; rather, cosine discovers more concept\-aligned directions\.\[[exp55d](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/55d_feature_steering/README.md)\]
Figure 23:Per\-quartile FVE \(top\) and reconstruction\-norm ratio‖x^‖/‖x‖\\\|\\hat\{x\}\\\|/\\\|x\\\|\(bottom\)\.\[[exp55c](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/55c_norm_stratified_reference/README.md)\]Table 15:FVE by activation\-norm quartile \(Q1 lowest, Q4 highest\)\.\[[exp55c](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/55c_norm_stratified_reference/README.md)\]Q4 reconstruction\.Q1–Q3 FVE is matched; Q4 splits: Standard−183\.5\-183\.5, global cosine0\.330\.33, per\-feature cosine0\.250\.25, magnitude\-bypass−0\.20\-0\.20\. Reconstruction\-norm ratio \(output\-norm / input\-norm\) on Q4: Standard9\.5×9\.5\\times; cosine variants≈1×\\approx 1\\times\.\[[exp55c](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/55c_norm_stratified_reference/README.md)\]
Controls\.Can cosine scoring be applied post\-hoc to a trained Standard SAE? No: FVE drops−18\-18to−33%\-33\\%,L0L\_\{0\}jumps from 80 to 500, and<11%<11\\%of features overlap with from\-scratch cosine \(§[G\.2](https://arxiv.org/html/2606.15054#A7.SS2)\)\.\[[exp29](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/29_posthoc_normalization/README.md)\]\[[exp17](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/17_production_scale/README.md)\]Input normalization without unit\-norm encoder rows kills33%33\\%of alive features\.\[[exp45](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/45_encoder_norm_ablation/README.md)\]Global cosine has higher FVE than Standard at every learning rate tested\.\[[exp30](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/30_lr_sweep/README.md)\]
Dictionary size control\.Giving the standard encoder3×3\\timesmore dictionary slots at matched L0 makes dead rates*worse*, not better:89\.4%89\.4\\%dead \(49k,k=80k\{=\}80\) vs\.77\.4%77\.4\\%\(16k,k=80k\{=\}80\), while a 16k global cosine achieves28\.2%28\.2\\%dead \(Figure[24](https://arxiv.org/html/2606.15054#A7.F24)\)\. The gradient concentration is a property of the scoring function, not dictionary capacity\. With3×3\\timesdictionary*and*3×3\\timesL0 \(k=240k\{=\}240,3×3\\timesparameters,3×3\\timescompute\), the standard SAE can slightly exceed the cosine SAE’s FVE \(0\.7510\.751vs\.0\.7370\.737\), confirming that cosine is∼3×\\sim 3\\timesmore parameter\-efficient\.\[[exp26](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/26_dictionary_size/README.md)\]
Figure 24:Larger dictionary does not fix dead features\.Qwen3\-8B L27, 50M tokens\.\(A\)Tripling dictionary slots at the same L0 increases the dead rate from77\.477\.4% to89\.489\.4%; the cosine SAE at1/31/3the dictionary achieves28\.228\.2%\.\(B\)Matching cosine’s FVE requires3×3\\timesparameters and3×3\\timesL0\. The dead\-feature problem is a scoring\-function pathology, not a capacity limitation\.\[[exp26](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/26_dictionary_size/README.md)\]Cos\>\>inner across architectures\.Which variant actually shows a cos\>\>inner metric difference? At L18 \(community recipe\), only magnitude\-bypass reaches69%69\\%\. global cosine \(a=0\.258a=0\.258\), per\-feature cosine \(meanai=0\.076a\_\{i\}=0\.076\), and Standard all cluster at6262–63%63\\%\. Only the architecture that removes magnitude entirely fromzzseparates on this metric\.\[[exp44](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/44_norm_stratified_fve/README.md)\]
Architectures and the cos\-vs\-inner difference\.All cosine variants improve sparse probing: magnitude\-bypass\+11\.6%\+11\.6\\%top\-1, global cosine\+13\.3%\+13\.3\\%, per\-feature cosine\+14\.9%\+14\.9\\%\. Only magnitude\-bypass shows the cos\-vs\-inner metric difference\. The RMS\-geometric\-alignment mechanism is depth\-dependent; the gradient\-equalization signature appears for all cosine variants\. At L9, cos\>\>inner falls below 50%, yet both cosine variants still improve sparse probing\.\[[exp40/42c/44](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/40_saprmarks_recipe/README.md)\]
### G\.1Norm\-Stratified FVE
Per\-quartile values are reported once in Table[15](https://arxiv.org/html/2606.15054#A7.T15)above\. Reconstruction\-norm ratio on Q4 tokens: Standard9\.5×9\.5\\times; cosine variants≈1\.0×\\approx 1\.0\\times\. The reference SAE ofKarvonenet al\.\([2025](https://arxiv.org/html/2606.15054#bib.bib25)\)reproduces the pattern with Q4 FVE=−136=\-136\.\[[exp55c](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/55c_norm_stratified_reference/README.md)\]
### G\.2Post\-Hoc Normalization
Cosine scoring applied at inference time to a trained Standard SAE: FVE−18\-18to−33%\-33\\%,L0L\_\{0\}from8080to500500, Jaccard with from\-scratch global cosine10\.6%10\.6\\%\.\[[exp29](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/29_posthoc_normalization/README.md)\]
### G\.3Per\-Feature Gradient Ratio
For each alive featureii: average‖∇wiℒ‖\\\|\\nabla\_\{w\_\{i\}\}\\mathcal\{L\}\\\|over tokens in each‖xc‖\\\|x\_\{c\}\\\|quartile, take Q4\-to\-Q1 ratio, report median across alive features \(Fig\.[25](https://arxiv.org/html/2606.15054#A7.F25)\)\. At Qwen L9/L18/L27 with 10M tokens: median Q4/Q1 is1\.61\.6–2\.0×2\.0\\times\(Standard\) vs\.0\.80\.8–1\.0×1\.0\\times\(global cosine\); Q1\-specialized features grow∼4×\\sim 4\\timesunder cosine\.\[[exp54](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/54_gradient_saprmarks/README.md)\]
Figure 25:Gradient asymmetry exists, but does not explain the probing gap\.\(A\)Median per\-feature Q4/Q1 gradient ratio across Qwen layers \(10M tokens\)\.\(B\)Equalizing per\-quartile gradients in the standard encoder by reweighting the reconstruction loss \(1/‖xc‖1/\\\|x\_\{c\}\\\|; 50M tokens\) negligibly shifts top\-1\. Per\-feature ratio definition and full reweighting sweep in text below\.##### Reweighting test\.
If gradient asymmetry were the primary cause, equalizing it should close the gap\. Reweighting the Standard reconstruction loss equalizes per\-quartile gradient contributions by construction, without modifying the score\. We test mild reweighting \(1/‖xc‖1/\\\|x\_\{c\}\\\|\) and strong reweighting \(1/‖xc‖21/\\\|x\_\{c\}\\\|^\{2\}\), spanning partial to near\-complete cancellation of the encoder\-gradient norm factor\. At 50M tokens with FVE≥0\.97\\geq 0\.97, per\-feature cosine reaches top\-10\.6480\.648\. Standard moves from0\.5300\.530\(no reweighting\) to0\.5320\.532\(mild\) and0\.5360\.536\(strong\)\. Acrossk∈\{1,2,5\}k\\in\\\{1,2,5\\\}, reweighting closes only1\.91\.9–6\.8%6\.8\\%of the cosine difference\. The forward pass still inflates every pre\-activation on Q4 tokens, BatchTopK still selects the same features, and the same norm\-conditioned firing pattern reappears\.\[[exp59](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/59_gradient_equalization/README.md)\]
## Appendix HLimitations and Scope
Scope\.All results are on the residual stream; no MLP\-internal, attention\-internal, head\-output, or pre\-norm activations\. Causal diagnostics use greedy decoding and deterministic activation collection\. Multi\-layer / 50M\+ runs cover Qwen3\-8B fully; Gemma\-2\-2B partially; Mistral\-7B after the init fix; Pythia and Falcon at 5M\. No model is\>8\>8B; no QK\-norm or per\-head\-norm models\. Training data: FineWeb \(English\-dominant\)\. SAEBench is English\-only \(europarl is English\-vs\-other classification, not multilingual concept probing\)\. No claim about multilingual residual streams, vision, audio, multimodal, or instruction\-tuned / RLHF models\.
Where the metric difference weakens\.At Qwen L9, cos\>\>inner is below50%50\\%\. Deep LayerNorm: cos\>\>inner drops from100%100\\%to40%40\\%on Pythia\-2\.8B between L8 and L24 \(inner\-to\-KL correlation0\.4410\.441vs\. cos\-to\-KL0\.2710\.271\); Gemma\-2\-2B L20 \(RMSNorm\) is at≈53%\\approx 53\\%\. Within\-norm\-quartile rates on Qwen are4040–70%70\\%\(Simpson’s paradox; the 70–90% rate is partly between\-quartile\)\. The architectural FVE / alive\-feature differences persist across these settings\.\[[exp25](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/25_multimodel_matrix/README.md)\]
Architecture\-specific failure modes\.Without auxiliary loss: Adaptive Cosine SAE retains3\.3×3\.3\\timesalive\-feature ratio over Standard\. With auxiliary loss: both architectures reach∼0%\\sim 0\\%dead\. Per\-Feature Adaptive Cosine SAE:83%83\\%dead at 50M / L27 \(6565k independent scale parameters do not converge under low budget at high norms\)\. Magnitude\-Bypass SAE:4\.3%4\.3\\%persistent dead at 500M \(auxiliary\-loss gradients∼6\\sim 6orders weaker through bounded\[0,1\]\[0,1\]activations; §[B\.2\.3](https://arxiv.org/html/2606.15054#A2.SS2.SSS3)\)\. At 5M the auxiliary loss is inert for Magnitude\-Bypass SAE \(§[4\.3](https://arxiv.org/html/2606.15054#S4.SS3)\)\. Adaptive Cosine SAE is the architecture without these failure modes in our coverage\.
Initialization\.Mistral has‖x‖≈6\\\|x\\\|\\approx 6vs\. Qwen≈400\\approx 400\.d\\sqrt\{d\}init on Mistral:\>95%\>95\\%dead\. Norm\-adaptive init helps at 5M but reduces FVE at 50M\+ on Qwen\. Default:d\\sqrt\{d\}when‖x‖\\\|x\\\|is within2×2\\timesofd\\sqrt\{d\}, elselog\(mean‖x‖\)\\log\(\\operatorname\{mean\}\\\|x\\\|\)\(§[B\.2\.4](https://arxiv.org/html/2606.15054#A2.SS2.SSS4)\)\.\[[exp42](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/42_mistral_init/README.md)\]
Evaluation scope\.The 500M Qwen L18 sparse\-probing run is single\-seed\. Cross\-layer and cross\-model SAEBench replications are pending \(§[H\.1](https://arxiv.org/html/2606.15054#A8.SS1)\)\. No feature steering, circuit\-level interventions, or downstream task accuracy\. Sentiment is the only top\-1 reversal in Table[8](https://arxiv.org/html/2606.15054#A4.T8)\(Standard0\.9150\.915vs\. Per\-Feature Adaptive Cosine SAE0\.8800\.880\)\.
Decoder under norm shift\.Under norm\-noise training \(Uniform\(0\.5,2\.0\)\(0\.5,2\.0\)\), cosine FVE drops4\.54\.5–5\.8%5\.8\\%vs\.1\.81\.8–3\.4%3\.4\\%for Standard\.\[[exp31](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/31_norm_injection/README.md)\]The encoder is norm\-invariant; the decoder reconstructsxxand depends on norm\. Where train and eval norms diverge, the headline difference may erode\.*adaptive\_l2*achieves FVE0\.7380\.738; postnorm\-loss achieves SAE→\\toKL0\.3800\.380; postnorm forcesa→−0\.4a\\to\-0\.4\.\[[exp22](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/22_postnorm_scale/README.md)\]\[[exp14](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/14_postnorm_loss/README.md)\]
Architectures discover different features\.Strict\-identity overlap of alive features between Standard and Per\-Feature Adaptive Cosine SAE at L18:<1%<1\\%atn=3n=3–66paired, consistent with the non\-canonical nature of SAE dictionaries\(Leasket al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib29)\)\. Headline aggregate comparisons are across disjoint feature sets\. Inner\-product ablation\(x⋅f\)f\(x\\cdot f\)fscales with‖x‖\\\|x\\\|, biasing the comparison toward Standard; under norm\-invariant ablation, the Standard SAE→\\toKL difference shrinks by70%70\\%at L27\.\[[exp56b](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/56b_feature_overlap/README.md)\]
### H\.1Open Gaps
Coverage gaps\.Whether Standard’s dead features overlap the cosine\-only sparse\-probing features is not known\. LLM\-judged interpretability ran at L27 only; L9 / L18 untested\. No 8B\+ LayerNorm model at 50M\+ tokens\. Cross\-architecture coverage stops at 5M for some models\.
Production\-scale variance\.Gradient\-equalization measurements are at 5M; 500M re\-measurement is pending\.
Architecture\-design open questions\.Per\-Feature Adaptive Cosine SAE vs\. Magnitude\-Bypass SAE as the preferred magnitude\-stripping architecture is unsettled\. The base\+δ\+\\deltaparameterization as a fix for Per\-Feature Adaptive Cosine SAE’s 50M dead\-feature regime has not been validated at scale\. No mechanistic account exists for why Per\-Feature Adaptive Cosine SAE leads at top\-1 while Adaptive Cosine SAE leads at top\-5\.
Beyond sparse probing\.OOD shift and stochastic decoding \(high temperature, nucleus\) are not tested\.
## Appendix IExtended Discussion
Triangulation across architectures\.Magnitude\-Bypass SAE \(\+11\.6%\+11\.6\\%top\-1\), Adaptive Cosine SAE \(\+13\.3%\+13\.3\\%\), and Per\-Feature Adaptive Cosine SAE \(\+14\.9%\+14\.9\\%\) all gain at matched FVE \(§[4\.1](https://arxiv.org/html/2606.15054#S4.SS1)\)\.\[[exp40/42c/44](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/40_saprmarks_recipe/README.md)\]Fitteda=0\.258a=0\.258at the headline setting;ai∈\[−0\.5,0\.5\]a\_\{i\}\\in\[\-0\.5,0\.5\]across65,53665\{,\}536features \(Table[4](https://arxiv.org/html/2606.15054#A2.T4)\)\.\[[exp42c](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/42c_noc_500m/README.md)\]
Mechanism dissociation\.The cos\-vs\-inner score advantage requires fully removing magnitude \(Magnitude\-Bypass SAE only\); the gradient\-equalization signature appears for all cosine variants \(§[4\.3](https://arxiv.org/html/2606.15054#S4.SS3)\)\. Cos\>\>inner is below50%50\\%at L9, yet both cosine probes still improve sparse probing at L9\.
Open questions\.Can magnitude be retained where informative \(e\.g\. sentiment\) without dominating feature detection? Do the dead features of the standard SAE correspond to the cosine features that drive sparse probing? Decoder\-cosine overlap between cosine and standard dictionaries is66–17%17\\%at the\>0\.95\>0\.95threshold \(§[4\.1](https://arxiv.org/html/2606.15054#S4.SS1), consistent withLeasket al\.\([2025](https://arxiv.org/html/2606.15054#bib.bib29)\)\); a stitching or meta\-SAE analysis across cosine seeds is needed to distinguish stable\-target densification from dictionary resampling\.\[[exp56b](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/56b_feature_overlap/README.md)\]
## Appendix JTraining Recipe
Training uses the Marks\-labdictionary\_learninglibrary \([https://github\.com/saprmarks/dictionary\_learning](https://github.com/saprmarks/dictionary_learning)\) accessed viasae\-bench\(Karvonenet al\.,[2025](https://arxiv.org/html/2606.15054#bib.bib25)\)\. The recipe matches OpenAI’s TopK SAE training\(Gaoet al\.,[2024](https://arxiv.org/html/2606.15054#bib.bib12)\)except for the BatchTopK selection rule\(Bussmannet al\.,[2024](https://arxiv.org/html/2606.15054#bib.bib15)\)and the encoder\-score swap \(§[B](https://arxiv.org/html/2606.15054#A2)\)\.
Table 16:Headline training recipe \(Qwen3\-8B L18, 500M tokens\)\. All four architectures \(Standard, Adaptive Cosine SAE, Per\-Feature Adaptive Cosine SAE, Magnitude\-Bypass SAE\) share these settings\.\[[exp40](https://github.com/SilenNaihin/cosine-scored-saes/blob/main/experiments/40_saprmarks_recipe/README.md)\]Data\.Training tokens come from the FineWebsample\-10BTsubset \(HuggingFaceFW/fineweb\), tokenized with each model’s bundled tokenizer\. SAEBench probing datasets are at the versions shipped withsae\-bench\.
## Appendix KStatistical Summary
Table 17:Effect sizes and significance\.Sparse probing is single\-seed at 500M; FVE, dead\-feature, and diagnostic rows haven=3n=3seed bounds at 50M\.Similar Articles
Effects of sparsity and superposition on loss in simple autoencoders
This paper provides a mathematical analysis of superposition in neural networks, deriving upper and lower bounds on L2 reconstruction loss for simple autoencoders with power activation functions, corroborating empirical findings by Elhage et al.
A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders
This paper proposes a unified geometric framework for understanding concept learning and neuron interpretation in sparse autoencoders, formalizing concepts as sets and defining detection, separation, and approximation. It provides error bounds, capacity constraints, and links to formal concept analysis, with experiments on synthetic data.
Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders
This paper studies seed dependence in sparse autoencoders, finding that stable features carry most predictive signal while unstable features reflect reproducible low-dimensional subspaces.
Mean-Pooled Cosine Similarity is Not Length-Invariant: Theory and Cross-Domain Evidence for a Length-Invariant Alternative
This paper demonstrates that mean-pooled cosine similarity is not length-invariant under anisotropic representations, showing it artificially inflates similarity with sequence length. It argues for using Centered Kernel Alignment (CKA) as a default metric to correct biases in cross-lingual and cross-representation analysis.
Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders
This paper applies sparse autoencoders to the CosyVoice3 text-to-speech language model, discovering interpretable features that can be steered to control attributes like laughter, speaker gender, and speech rate while preserving content.