Evidence for feature-specific error correction in LLMs
Summary
This paper provides the first empirical evidence for feature-specific error correction in large language models, showing that residual-stream activations are robust to small perturbations but less robust along candidate feature directions, supporting the theory of computation in superposition.
View Cached Full Text
Cached at: 06/25/26, 05:08 AM
# Evidence for feature-specific error correction in LLMs
Source: [https://arxiv.org/html/2606.24964](https://arxiv.org/html/2606.24964)
###### Abstract
Understanding the features of large language models \(LLMs\) is a central goal of interpretability\. LLMs are commonly assumed to use superposition to represent more features than they have dimensions\. They may not only represent features in superposition but also perform computation in superposition\. Theory predicts that computing in superposition requires error correction that privileges feature directions over generic ones, but this prediction has not been tested empirically\. We propose an empirical test of error correction in LLMs based on activation perturbations\. Perturbing residual\-stream activations, we find that they are robust to small perturbations—forming activation plateaus consistent with error correction—but less robust along candidate feature directions \(”pure” directions, constructed from contrastive prompt pairs\) than along mixtures of two such directions, indicating that the pure directions are privileged\. We quantify this privilegedness by modeling the perturbation effect as a function of theLpL^\{p\}\-norm of its decomposition into feature components\. Forp=2p=2the response is a quadratic form with at most as many nonzero eigenvalues as the residual\-stream dimension, which cannot privilege the many feature directions superposition requires\.p\>2p\>2lifts this constraint and is consistent with feature\-specific error correction\. We findp\>2p\>2for contrastive, MELBO, and SAE\-decoder directions, andp≈2p\\approx 2for random and PCA directions \(controls\)\. These results replicate across Gemma\-2\-9B, Qwen3\-1\.7B, Llama\-3\.1\-8B, Mistral\-7B\-v0\.3, Aya\-Expanse\-8B, and Yi\-1\.5\-9B\. We further validate our method on a toy model of error correction with known ground\-truth features, recoveringp\>2p\>2for true feature directions, degrading toward22as we rotate away from them\.†\\dagger†\\dagger†\\daggerCode:[https://github\.com/FranciscoHS/fsec\-paper](https://github.com/FranciscoHS/fsec-paper)
Feature Geometry, Error Correction, Computation in Superposition
## 1Introduction
Representations in large language models \(LLMs\) are poorly understood\. It is commonly assumed that LLMs make use of superposition\(Elhage et al\.,[2022](https://arxiv.org/html/2606.24964#bib.bib6)\)to represent more concepts than they have dimensions available, and potentially to compute in superposition \(CiS\)\(Hänni et al\.,[2024](https://arxiv.org/html/2606.24964#bib.bib10); Adler & Shavit,[2024](https://arxiv.org/html/2606.24964#bib.bib1); Olah et al\.,[2025](https://arxiv.org/html/2606.24964#bib.bib19)\)\. However, we have only indirect evidence for superposition, chiefly the success of sparse autoencoders\(Cunningham et al\.,[2023](https://arxiv.org/html/2606.24964#bib.bib4); Gao et al\.,[2024](https://arxiv.org/html/2606.24964#bib.bib7); Templeton et al\.,[2026](https://arxiv.org/html/2606.24964#bib.bib24)\)at extracting interpretable directions, and only theoretical evidence for CiS\(Hänni et al\.,[2024](https://arxiv.org/html/2606.24964#bib.bib10)\)\.
An empirical prediction of CiS is that neural networks must correct interference noise while preserving feature signal\(Hänni et al\.,[2024](https://arxiv.org/html/2606.24964#bib.bib10)\)\. This requires networks to be less sensitive to perturbations along non\-feature directions than along feature directions\. We name this property feature\-specific error correction \(FSEC\)\. While we cannot rule FSEC out or in without ground\-truth feature directions, we can still ask whether FSEC\-like behaviour occurs for any directions: if the model’s error correction privileges certain directions, those directions are candidate features, and we can detect them by measuring sensitivity, requiring only a generic input to perturb rather than feature\-specific labeled data\.
We provide the first empirical evidence of FSEC, showing that the robustness of LLM activations to perturbations privileges certain candidate feature directions over others\. Concretely, we perturb residual\-stream activations at early layers and measure the downstream response as a function of perturbation direction and magnitude\. We construct candidate feature directions via contrastive means for a variety of concepts, including languages, programming languages, gender, sentiment, registers, and verb tenses\. Across Gemma\-2\-9B\(Team et al\.,[2024](https://arxiv.org/html/2606.24964#bib.bib23)\), Qwen3\-1\.7B\(Yang et al\.,[2025](https://arxiv.org/html/2606.24964#bib.bib27)\), Llama\-3\.1\-8B\(Grattafiori et al\.,[2024](https://arxiv.org/html/2606.24964#bib.bib9)\), Mistral\-7B\-v0\.3\(Jiang et al\.,[2023](https://arxiv.org/html/2606.24964#bib.bib14)\), Aya\-Expanse\-8B\(Dang et al\.,[2024](https://arxiv.org/html/2606.24964#bib.bib5)\), and Yi\-1\.5\-9B\(Young et al\.,[2024](https://arxiv.org/html/2606.24964#bib.bib28)\), we find that contrastive feature directions elicit a stronger downstream response than mixtures thereof, consistent with FSEC that privileges feature directions while suppressing interference along non\-feature directions\. We formalize this by modeling the LLMs’ response to perturbation as anLpL^\{p\}norm of the perturbation’s decomposition into candidate feature directions\. Thep=2p=2case reduces to a basis\-invariant quadratic form, meaning no choice of basis is privileged;p\>2p\>2breaks this invariance, indicating that the candidate pure feature directions are more sensitive than their mixtures, as predicted by FSEC\. We measurep≈2\.3p\\approx 2\.3for contrastive directions across models\. MELBO\(Mack & Turner,[2024b](https://arxiv.org/html/2606.24964#bib.bib17)\)and SAE directions, which also aim to recover model features, likewise yieldp\>2p\>2, albeit with smaller values\. PCA and random directions do not, consistent with the interpretation thatp\>2p\>2reflects alignment with the model’s features\. We validate this methodology in a toy model of error correction with ground\-truth features \(Section[5](https://arxiv.org/html/2606.24964#S5)\), confirming thatppdegrades toward22as directions are misaligned with the true features\.
Our contributions are:
1. 1\.We propose feature\-specific error correction \(FSEC\) as a test of computation in superposition: we model the LLM’s response to a perturbation as a function of theLpL^\{p\}\-norm of the perturbation’s decomposition into candidate feature directions, and FSEC predictsp\>2p\>2\.
2. 2\.We find evidence of FSEC for three types of candidate feature directions—contrastive, SAE\-decoder, and MELBO—each withp\>2p\>2, and show the contrastive result replicates across six LLMs from different families\. PCA and random directions yieldp≈2p\\approx 2, consistent with them not being privileged\.
3. 3\.We also show that FSEC occurs in a toy model of error correction\(Vaintrob,[2026](https://arxiv.org/html/2606.24964#bib.bib26)\)\.
## 2Related Work
Activation plateaus\.Prior work has established that in\-distribution activations of LLMs are resistant to perturbations\(Heimersheim & Mendel,[2024](https://arxiv.org/html/2606.24964#bib.bib11); Janiak et al\.,[2024](https://arxiv.org/html/2606.24964#bib.bib13); Shinkle & Heimersheim,[2025](https://arxiv.org/html/2606.24964#bib.bib22)\), a phenomenon known asactivation plateaus\. We introduce a novel measurement of the activation plateau boundary geometry, and identify its connection to FSEC\.
Direction\-dependent sensitivity for feature finding\.Prior work has exploited the fact that LLMs have direction\-dependent sensitivity to perform unsupervised optimizations for directions maximizing downstream response, resulting in interpretable steering vectors, including MELBO\(Mack & Turner,[2024b](https://arxiv.org/html/2606.24964#bib.bib17),[a](https://arxiv.org/html/2606.24964#bib.bib16)\)\. We make use of the same phenomenon to empirically probe error correction, and apply our analysis to directions obtained in this way \(Section[4\.3](https://arxiv.org/html/2606.24964#S4.SS3)\)\. Unlike this line of work, our analysis adds the novel study of the sensitivity geometry and its connection to error correction\.
Error correction for computation in superposition\.Hänni et al\. \([2024](https://arxiv.org/html/2606.24964#bib.bib10)\)argue theoretically that computation in superposition requires error correction that privileges feature directions\. We provide empirical evidence in favor of superposition and error correction occurring in LLMs\.
## 3Methodology
We probe error correction in LLMs by perturbing residual stream activations and measuring the downstream response\. In\-distribution activations are robust to small perturbations, a phenomenon known asactivation plateaus\(Heimersheim & Mendel,[2024](https://arxiv.org/html/2606.24964#bib.bib11); Shinkle & Heimersheim,[2025](https://arxiv.org/html/2606.24964#bib.bib22)\)\. This robustness is direction\-dependent, i\.e\., the model is more sensitive to perturbations along some directions than others\. Two lines of evidence suggest feature directions in particular are privileged: empirically, prior work recovers interpretable directions by optimizing for sensitivity, the inverse of robustness\(Mack & Turner,[2024b](https://arxiv.org/html/2606.24964#bib.bib17),[a](https://arxiv.org/html/2606.24964#bib.bib16)\); and theoretically, FSEC predicts that feature directions are privileged\. We test this by comparing the downstream response along candidate feature directions to the response along non\-feature baselines\.
In all experiments, we perturb the residual stream at an early layerℓ\\ell\(defaultℓ=2\\ell=2\) and measure the downstream response at the second\-to\-last layer, maximizing the number of intervening layers\. This is because activation plateaus are known to be more pronounced the greater the distance between perturbation and measurement\(Shinkle & Heimersheim,[2025](https://arxiv.org/html/2606.24964#bib.bib22)\)\. We avoid the last residual stream layer, which is known to behave atypically\. The downstream response is computed by patching the perturbed activationa\(α\)\\textbf\{a\}\(\\alpha\)back into the model\(Meng et al\.,[2022](https://arxiv.org/html/2606.24964#bib.bib18); Heimersheim & Nanda,[2024](https://arxiv.org/html/2606.24964#bib.bib12)\), performing a forward pass, and taking theL2L^\{2\}distance between the perturbed and unperturbed residual streams at the measurement layer\. We show in Section[4\.3](https://arxiv.org/html/2606.24964#S4.SS3)that our results are robust to varying both the perturbation and measurement layers, and hold also when measuring cosine distance or KL\-divergence in the logits\.
To distinguish directional effects from those of magnitude, we follow prior perturbation analyses of activation plateaus\(Heimersheim & Mendel,[2024](https://arxiv.org/html/2606.24964#bib.bib11)\)and perturb by rotating the activation vectoratowards the perturbation directiondwhile keeping the activation’s norm constant\. We refer to this as anorm\-matched perturbation\. Concretely, a perturbation of angleα\\alphaofatowarddis:
a\(α\)=cos\(α\)a\+sin\(α\)‖a‖d⟂‖d⟂‖,\\textbf\{a\}\(\\alpha\)=\\cos\(\\alpha\)\\textbf\{a\}\+\\sin\(\\alpha\)\\left\\lVert\\textbf\{a\}\\right\\rVert\\frac\{\\textbf\{d\}\_\{\\perp\}\}\{\\left\\lVert\\textbf\{d\}\_\{\\perp\}\\right\\rVert\},\(1\)whered⟂\\textbf\{d\}\_\{\\perp\}is the component ofdthat is orthogonal toa\. We typically perturb at the last token position, but show in Section[4\.3](https://arxiv.org/html/2606.24964#S4.SS3)that our results are robust to this choice\.
We quantify the model’s sensitivity along a given direction as theplateau\-breaking angle, the smallest perturbation angle for which the downstream response exceeds a thresholdTT\. We setTTper direction\-pair, at a level that is guaranteed to be crossed by both single\-axis sweeps\.111We initially tried a single global threshold derived from random directions—a fractionffof the median, over a set of isotropic random unit directions, of the single\-axis plateau height each one reaches—but abandoned it\. Direction sensitivities span a wide range \(up to a factor of∼70\{\\sim\}70between the least and most responsive directions for the KL\-divergence response metric\), so a global threshold is either never reached by the least responsive directions or crossed only at very small angles by the most responsive ones, leaving most superellipse fits ill\-defined\. SettingTTper pair from each pair’s own single\-axis maxima guarantees both axes cross and resolves this\. We record this here for ease of reproducibility\.We aggregate the downstream response across a fixed set ofN=30N=30inputs \(anchors\)\{𝐱n\}n=1N\\\{\\mathbf\{x\}\_\{n\}\\\}\_\{n=1\}^\{N\}—last\-token residual\-stream activations of 5\-token FineWeb prompts\(Penedo et al\.,[2024](https://arxiv.org/html/2606.24964#bib.bib21)\)—by taking the median over anchors\. WritingL2\(𝐱n,α;𝐝\)L^\{2\}\(\\mathbf\{x\}\_\{n\},\\alpha;\\mathbf\{d\}\)for the response of anchor𝐱n\\mathbf\{x\}\_\{n\}when perturbed by angleα\\alphaalong direction𝐝\\mathbf\{d\}, the single\-axis response curve is
L2\(α;𝐝\)=mediann=1,…,NL2\(𝐱n,α;𝐝\)\.L^\{2\}\(\\alpha;\\mathbf\{d\}\)=\\operatorname\*\{median\}\_\{n=1,\\dots,N\}L^\{2\}\(\\mathbf\{x\}\_\{n\},\\alpha;\\mathbf\{d\}\)\.\(2\)maxαL2\(α;𝐝i\)\\max\_\{\\alpha\}L^\{2\}\(\\alpha;\\mathbf\{d\}\_\{i\}\)is then the largest median response attained when sweeping along𝐝i\\mathbf\{d\}\_\{i\}alone\. For a pair\(𝐝1,𝐝2\)\(\\mathbf\{d\}\_\{1\},\\mathbf\{d\}\_\{2\}\)we set the threshold
T=f⋅min\(maxαL2\(α;𝐝1\),maxαL2\(α;𝐝2\)\),T=f\\cdot\\min\\\!\\big\(\\max\_\{\\alpha\}L^\{2\}\(\\alpha;\\mathbf\{d\}\_\{1\}\),\\;\\max\_\{\\alpha\}L^\{2\}\(\\alpha;\\mathbf\{d\}\_\{2\}\)\\big\),\(3\)a fractionffof the smaller of the two single\-axis maxima\. Taking the smaller maximum ensures both axes reachTT, so the single\-axis plateau\-breaking anglesα1,α2\\alpha\_\{1\},\\alpha\_\{2\}that calibrate the fit are always defined\. Throughout we usef=0\.5f=0\.5\. We show in Section[4\.3](https://arxiv.org/html/2606.24964#S4.SS3)that our results are robust to varyingTTwithin half\-to\-double its nominal value\.
We perform perturbations along six types of directions: contrastive\(Panickssery et al\.,[2023](https://arxiv.org/html/2606.24964#bib.bib20); Turner et al\.,[2023](https://arxiv.org/html/2606.24964#bib.bib25)\), MELBO\(Mack & Turner,[2024b](https://arxiv.org/html/2606.24964#bib.bib17)\), SAE latents\(Lieberum et al\.,[2024](https://arxiv.org/html/2606.24964#bib.bib15)\), PCA directions, random, and random\-difference directions\. The first three are all candidate feature directions, while PCA, random, and random\-difference directions function as non\-feature baselines\. PCA directions are computed by performing PCA on the residual\-stream activations arising from a randomly selected sample of 10000 5\-token\-long FineWeb inputs\(Penedo et al\.,[2024](https://arxiv.org/html/2606.24964#bib.bib21)\)\. Random directions are sampled isotropically on the unit sphere,𝐝^rand∼Unif\(𝒮d−1\)\\hat\{\\mathbf\{d\}\}\_\{\\text\{rand\}\}\\sim\\mathrm\{Unif\}\(\\mathcal\{S\}^\{d\-1\}\)\.
Contrastive directions are constructed as the difference of mean activations between two sets ofP≥30P\\geq 30matched prompt pairs𝒫\+,𝒫−\\mathcal\{P\}^\{\+\},\\mathcal\{P\}^\{\-\}differing along a single concept \(e\.g\., for gender,𝒫\+∋\\mathcal\{P\}^\{\+\}\\ni“He ran home” is paired with𝒫−∋\\mathcal\{P\}^\{\-\}\\ni“She ran home”\)\(Panickssery et al\.,[2023](https://arxiv.org/html/2606.24964#bib.bib20); Turner et al\.,[2023](https://arxiv.org/html/2606.24964#bib.bib25)\)\. The prompts are LLM generated and human verified, and are available in our code release\. Letting𝐚\(s\)∈ℝd\\mathbf\{a\}\(s\)\\in\\mathbb\{R\}^\{d\}denote the activation at the perturbation layerℓ\\ellfor promptss,
𝐝contrast\\displaystyle\\mathbf\{d\}\_\{\\text\{contrast\}\}\\;=1P∑i=1P\(𝐚\(si\+\)−𝐚\(si−\)\),\\displaystyle=\\;\\frac\{1\}\{P\}\\sum\_\{i=1\}^\{P\}\\big\(\\mathbf\{a\}\(s\_\{i\}^\{\+\}\)\-\\mathbf\{a\}\(s\_\{i\}^\{\-\}\)\\big\),\(4\)𝐝^contrast\\displaystyle\\hat\{\\mathbf\{d\}\}\_\{\\text\{contrast\}\}\\;=𝐝contrast/‖𝐝contrast‖2\.\\displaystyle=\\;\\mathbf\{d\}\_\{\\text\{contrast\}\}/\\\|\\mathbf\{d\}\_\{\\text\{contrast\}\}\\\|\_\{2\}\.\(5\)
Random\-difference directions apply this same construction to randomly paired activations rather than concept\-matched ones, giving a control that is matched to contrastive directions in everything except semantic content\. We averageP=30P=30differences between pairs\(si,si′\)\(s\_\{i\},s\_\{i\}^\{\\prime\}\)drawn at random from the same 10000\-input FineWeb sample used for PCA, and normalize:
𝐝rand\-diff\\displaystyle\\mathbf\{d\}\_\{\\text\{rand\-diff\}\}\\;=1P∑i=1P\(𝐚\(si\)−𝐚\(si′\)\),\\displaystyle=\\;\\frac\{1\}\{P\}\\sum\_\{i=1\}^\{P\}\\big\(\\mathbf\{a\}\(s\_\{i\}\)\-\\mathbf\{a\}\(s\_\{i\}^\{\\prime\}\)\\big\),\(6\)𝐝^rand\-diff\\displaystyle\\hat\{\\mathbf\{d\}\}\_\{\\text\{rand\-diff\}\}\\;=𝐝rand\-diff/‖𝐝rand\-diff‖2\.\\displaystyle=\\;\\mathbf\{d\}\_\{\\text\{rand\-diff\}\}/\\\|\\mathbf\{d\}\_\{\\text\{rand\-diff\}\}\\\|\_\{2\}\.\(7\)Unlike contrastive directions, the paired prompts share no concept, so we do not expect them to be features\. Unlike isotropic random directions,𝐝^rand\-diff\\hat\{\\mathbf\{d\}\}\_\{\\text\{rand\-diff\}\}consist of differences of activations, and hence inherit their covariance\. This makes it our most stringent non\-feature baseline: it differs from a contrastive direction only in the absence of concept\-coherent pairing\. We construct 40 such directions\.
MELBO directions\(Mack & Turner,[2024b](https://arxiv.org/html/2606.24964#bib.bib17)\)are constructed by optimising a unit\-norm direction𝐝^\\hat\{\\mathbf\{d\}\}at the perturbation layer to maximise theL2L^\{2\}distance between perturbed and unperturbed activations at a downstream layer\. Unlike contrastive directions, this requires no labelled prompt pairs as the directions are found unsupervised, and have been shown to produce interpretable steering when applied at sufficient magnitude\.
SAE latents are decoder columns of a sparse autoencoder trained on the model’s residual stream activations\(Bricken et al\.,[2023](https://arxiv.org/html/2606.24964#bib.bib2); Cunningham et al\.,[2023](https://arxiv.org/html/2606.24964#bib.bib4)\)\. Each column is a candidate feature direction under the SAE’s decomposition\. We select the top\-33 most active SAE latents in the same sample of 10000 FineWeb inputs used for PCA directions\. We do this for Gemma\-2\-9B only and use Gemma Scope\(Lieberum et al\.,[2024](https://arxiv.org/html/2606.24964#bib.bib15)\)\(the width\-16k residual\-stream SAE at layer 2\)\.
We also perform perturbations toward combinations of two directions𝐝1,𝐝2\\mathbf\{d\}\_\{1\},\\mathbf\{d\}\_\{2\}\. We first restrict them to the tangent space at𝐚^=𝐚/∥𝐚∥\\hat\{\\mathbf\{a\}\}=\\mathbf\{a\}/\\lVert\\mathbf\{a\}\\rVertvia the projectorI−𝐚^𝐚^⊤I\-\\hat\{\\mathbf\{a\}\}\\hat\{\\mathbf\{a\}\}^\{\\top\}, and orthonormalize \(Gram–Schmidt\) to obtain unit vectors𝐝^1⟂,𝐝^2⟂\\hat\{\\mathbf\{d\}\}\_\{1\}^\{\\perp\},\\hat\{\\mathbf\{d\}\}\_\{2\}^\{\\perp\}satisfying𝐚^⋅𝐝^i⟂=0\\hat\{\\mathbf\{a\}\}\\cdot\\hat\{\\mathbf\{d\}\}\_\{i\}^\{\\perp\}=0and𝐝^1⟂⋅𝐝^2⟂=0\\hat\{\\mathbf\{d\}\}\_\{1\}^\{\\perp\}\\cdot\\hat\{\\mathbf\{d\}\}\_\{2\}^\{\\perp\}=0\. We then form the unit tangent direction
𝐝^\(φ\)=cosφ𝐝^1⟂\+sinφ𝐝^2⟂,φ∈\[0,π/2\],\\hat\{\\mathbf\{d\}\}\(\\varphi\)\\;=\\;\\cos\\varphi\\,\\hat\{\\mathbf\{d\}\}\_\{1\}^\{\\perp\}\+\\sin\\varphi\\,\\hat\{\\mathbf\{d\}\}\_\{2\}^\{\\perp\},\\qquad\\varphi\\in\[0,\\pi/2\],\(8\)which interpolates between𝐝^1⟂\\hat\{\\mathbf\{d\}\}\_\{1\}^\{\\perp\}atφ=0\\varphi=0and𝐝^2⟂\\hat\{\\mathbf\{d\}\}\_\{2\}^\{\\perp\}atφ=π/2\\varphi=\\pi/2\. Perturbing𝐚\\mathbf\{a\}by angleα\\alphatoward𝐝^\(φ\)\\hat\{\\mathbf\{d\}\}\(\\varphi\)then proceeds exactly as in the single\-direction case \(Equation[1](https://arxiv.org/html/2606.24964#S3.E1)\),
𝐚\(α,φ\)=cosα𝐚\+sinα∥𝐚∥𝐝^\(φ\)\.\\mathbf\{a\}\(\\alpha,\\varphi\)\\;=\\;\\cos\\alpha\\,\\mathbf\{a\}\+\\sin\\alpha\\,\\lVert\\mathbf\{a\}\\rVert\\,\\hat\{\\mathbf\{d\}\}\(\\varphi\)\.\(9\)
We observe behavior \(see Section[4](https://arxiv.org/html/2606.24964#S4)\) that appears consistent with the downstream response depending on the perturbation’s projections onto privileged directions, raised to a common powerpp\. To formalize this, we model the downstream responseRRto a perturbation vector𝐯\\mathbf\{v\}as
R\(𝐯\)=F\(∑iwi\|𝐟^i⊤𝐯\|p\),R\(\\mathbf\{v\}\)=F\\\!\\left\(\\sum\_\{i\}w\_\{i\}\\,\\bigl\|\\hat\{\\mathbf\{f\}\}\_\{i\}^\{\\top\}\\mathbf\{v\}\\bigr\|^\{p\}\\right\),\(10\)where\{𝐟^i\}\\\{\\hat\{\\mathbf\{f\}\}\_\{i\}\\\}are unit directions, the weightswi≥0w\_\{i\}\\geq 0are free parameters capturing each direction’s sensitivity, andFFis a scalar function\.
When we perturb along two directions simultaneously and vary their relative weighting via the mixing angleφ\\varphi, the plateau\-breaking angle becomes a function ofφ\\varphi; plotted in suitable coordinates, this function traces out a superellipse \(see Figure[2](https://arxiv.org/html/2606.24964#S4.F2)\)\. Taking𝐟^1=𝐝^1⟂\\hat\{\\mathbf\{f\}\}\_\{1\}=\\hat\{\\mathbf\{d\}\}\_\{1\}^\{\\perp\}and𝐟^2=𝐝^2⟂\\hat\{\\mathbf\{f\}\}\_\{2\}=\\hat\{\\mathbf\{d\}\}\_\{2\}^\{\\perp\}, the inner sum collapses to two terms provided the remaining𝐟^j\\hat\{\\mathbf\{f\}\}\_\{j\}contribute negligibly along𝐝^\(φ\)\\hat\{\\mathbf\{d\}\}\(\\varphi\)\. This is expected under superposition, where features are packed nearly orthogonally\(Elhage et al\.,[2022](https://arxiv.org/html/2606.24964#bib.bib6)\), so the projections of𝐝^\(φ\)\\hat\{\\mathbf\{d\}\}\(\\varphi\)onto other directions𝐟^j\\hat\{\\mathbf\{f\}\}\_\{j\}are small; the quality of the superellipse fits in Section[4](https://arxiv.org/html/2606.24964#S4)confirms the approximation\. For the norm\-matched perturbation of Equation[9](https://arxiv.org/html/2606.24964#S3.E9), the displacement is𝐯=𝐚\(α,φ\)−𝐚=sinα∥𝐚∥𝐝^\(φ\)−\(1−cosα\)𝐚\\mathbf\{v\}=\\mathbf\{a\}\(\\alpha,\\varphi\)\-\\mathbf\{a\}=\\sin\\alpha\\,\\lVert\\mathbf\{a\}\\rVert\\,\\hat\{\\mathbf\{d\}\}\(\\varphi\)\-\(1\-\\cos\\alpha\)\\,\\mathbf\{a\}\. The term−\(1−cosα\)𝐚\-\(1\-\\cos\\alpha\)\\,\\mathbf\{a\}lies along𝐚\\mathbf\{a\}and is therefore orthogonal to each𝐟^i=𝐝^i⟂\\hat\{\\mathbf\{f\}\}\_\{i\}=\\hat\{\\mathbf\{d\}\}\_\{i\}^\{\\perp\}, so it drops out of every projection,𝐟^i⊤𝐯=sinα∥𝐚∥\(𝐟^i⊤𝐝^\(φ\)\)\\hat\{\\mathbf\{f\}\}\_\{i\}^\{\\top\}\\mathbf\{v\}=\\sin\\alpha\\,\\lVert\\mathbf\{a\}\\rVert\\,\(\\hat\{\\mathbf\{f\}\}\_\{i\}^\{\\top\}\\hat\{\\mathbf\{d\}\}\(\\varphi\)\); only the tangential component contributes, so
R=F\(sinpα∥𝐚∥p\[w1cospφ\+w2sinpφ\]\)\.R=F\\\!\\left\(\\sin^\{p\}\\alpha\\,\\lVert\\mathbf\{a\}\\rVert^\{p\}\\bigl\[w\_\{1\}\\cos^\{p\}\\varphi\+w\_\{2\}\\sin^\{p\}\\varphi\\bigr\]\\right\)\.\(11\)The plateau breaks whenFF’s argument crosses a fixed levelξ∗\\xi^\{\*\}\(the value of the argument at which the response reaches the thresholdTT\), giving the implicit equation for the plateau\-breaking angleα\(φ\)\\alpha\(\\varphi\):
sinpα\(φ\)∥𝐚∥p\[w1cospφ\+w2sinpφ\]=ξ∗\.\\sin^\{p\}\\alpha\(\\varphi\)\\,\\lVert\\mathbf\{a\}\\rVert^\{p\}\\bigl\[w\_\{1\}\\cos^\{p\}\\varphi\+w\_\{2\}\\sin^\{p\}\\varphi\\bigr\]=\\xi^\{\*\}\.\(12\)The single\-axis sweeps \(atφ=0\\varphi=0andφ=π/2\\varphi=\\pi/2\) calibrate the weights at this same threshold:φ=0\\varphi=0givesw1∥𝐚∥p=ξ∗/sinpα1w\_\{1\}\\,\\lVert\\mathbf\{a\}\\rVert^\{p\}=\\xi^\{\*\}/\\sin^\{p\}\\alpha\_\{1\}, andφ=π/2\\varphi=\\pi/2givesw2∥𝐚∥p=ξ∗/sinpα2w\_\{2\}\\,\\lVert\\mathbf\{a\}\\rVert^\{p\}=\\xi^\{\*\}/\\sin^\{p\}\\alpha\_\{2\}, whereα1,α2\\alpha\_\{1\},\\alpha\_\{2\}are the plateau\-breaking angles measured along𝐝^1⟂\\hat\{\\mathbf\{d\}\}\_\{1\}^\{\\perp\}and𝐝^2⟂\\hat\{\\mathbf\{d\}\}\_\{2\}^\{\\perp\}individually\. We calibrate on the single\-axis sweeps because each isolates one weight; this is a choice of convenience rather than a requirement\. Substituting and dividing through byξ∗\\xi^\{\*\}eliminates the unknown threshold and the activation norm, yielding
\(sinα\(φ\)cosφsinα1\)p\+\(sinα\(φ\)sinφsinα2\)p=1\.\\left\(\\frac\{\\sin\\alpha\(\\varphi\)\\cos\\varphi\}\{\\sin\\alpha\_\{1\}\}\\right\)^\{\\\!p\}\+\\left\(\\frac\{\\sin\\alpha\(\\varphi\)\\sin\\varphi\}\{\\sin\\alpha\_\{2\}\}\\right\)^\{\\\!p\}=1\.\(13\)This means that the plateau\-breaking angles define a superellipse of exponentppin the normalised coordinates\(sinα\(φ\)cosφ/sinα1,sinα\(φ\)sinφ/sinα2\)\(\\sin\\alpha\(\\varphi\)\\,\\cos\\varphi/\\sin\\alpha\_\{1\},\\;\\sin\\alpha\(\\varphi\)\\,\\sin\\varphi/\\sin\\alpha\_\{2\}\)\. We estimateppby minimizing the squared residual of this equation over the measured\(sinα\(φ\)cosφ,sinα\(φ\)sinφ\)\(\\sin\\alpha\(\\varphi\)\\,\\cos\\varphi,\\;\\sin\\alpha\(\\varphi\)\\,\\sin\\varphi\)points\.
## 4Results
We begin by illustrating our measurement procedure on a concrete example\.
Figure 1:Measuring plateau\-breaking angles\.Downstream response as a function of perturbation angle for two contrastive directions \(Wealth, Gender\) and an equal combination of both, at Gemma\-2\-9B layer 2, illustrating how plateau\-breaking angles are extracted\. The plateau\-breaking angle is the angle at which the downstreamL2L^\{2\}distance first exceeds the threshold \(hereT≈139T\\approx 139\)\. The grey dashed “Random” curve shows the median response across perturbations along 10 isotropic random unit directions\. All curves are medians over 30 FineWeb anchor prompts; we omit uncertainty bands because the absoluteL2L^\{2\}scale varies substantially from anchor to anchor in a way that is shared across all curves and largely cancels in within\-anchor comparisons\. The within\-anchor gap between each feature direction and the random baseline is nonetheless robust: at every angle, the feature curve exceeds the random baseline for most anchors \(≥63%\\geq\\\!63\\%\), and the median feature\-minus\-random difference is significantly positive \(its 95% bootstrap confidence interval over anchors excludes zero\)\.Figure 2:Iso\-plateau boundary\.Plateau\-breaking angles for the Wealth×\\timesGender pair at Gemma\-2\-9B layer 2 \(per\-pair thresholdT≈139T\\approx 139\)\. The superellipse exponent is fit in the normalised coordinates\(sinα\(φ\)cosφ/sinα1,sinα\(φ\)sinφ/sinα2\)\(\\sin\\alpha\(\\varphi\)\\cos\\varphi/\\sin\\alpha\_\{1\},\\;\\sin\\alpha\(\\varphi\)\\sin\\varphi/\\sin\\alpha\_\{2\}\)of Section[3](https://arxiv.org/html/2606.24964#S3)\. The boundary is well fit by a superellipse of exponentpfit=2\.40p\_\{\\mathrm\{fit\}\}=2\.40\(fit residual1\.2%1\.2\\%\);p\>2p\>2indicates these directions are privileged\.Figure[1](https://arxiv.org/html/2606.24964#S4.F1)shows the downstreamL2L^\{2\}response in Gemma\-2\-9B when perturbing along the Wealth contrastive direction, the Gender contrastive direction, and an equal combination of the two\. The vertical dashed lines mark the plateau\-breaking angle for each of the directions\. Repeating this measurement across a range of mixing anglesφ\\varphiyields a plateau\-breaking angle for each, which we plot in Figure[2](https://arxiv.org/html/2606.24964#S4.F2)\. The resulting boundary is well fit by a superellipse with exponentpfit=2\.40p\_\{\\mathrm\{fit\}\}=2\.40\.
To verify that combining contrastive directions in this way is meaningful, we show in Figure[3](https://arxiv.org/html/2606.24964#S4.F3)that steering along these combinations produces interpretable compositional behaviour changes\.
Figure 3:Compositional steering at Gemma\-2\-9B layer 2\.Sample completions for the prompt “The other day I met someone who” under no steering,\+\+Poverty alone,\+\+Female alone, and the Poverty\+\+Female composite\. Steering uses the contrastive Wealth and Gender directions \(Appendix[A](https://arxiv.org/html/2606.24964#A1)\); each row shows a single pole—the low\-wealth \(poverty\) pole of Wealth and the feminine pole of Gender\. Highlighted spans:orangefor poverty andbluefor feminine words\.Concretely, steering along the Wealth direction \(toward its low\-wealth pole\) produces poverty\-themed completions from a neutral input, steering along the Gender direction produces feminine\-coded completions, and steering along a combination of the two produces completions that are both poverty\-themed and feminine\-coded\. This supports the reasonableness of the combination\-perturbation setup, though we note that interpretable steering alone does not establish that a direction corresponds to a model feature\.
### 4\.1Superellipse exponents across direction types
We now systematically characterise these patterns across directions, direction types, and models\. The superellipse exponentpphas a simple geometric meaning\. Perturbing along a single direction breaks the plateau at some angle; perturbing along a mixture of two directions may break it sooner, later, or at the same point, and which of these occurs is exactly whatpprecords\. If the model responds only to the overall size of a perturbation \(itsL2L^\{2\}norm\), then splitting it evenly between two directions is no different from concentrating it on one, and the plateau\-breaking angles trace an ellipse \(p=2p=2\)\. If instead the model is specially sensitive to the individual directions—so that a perturbation matters only insofar as it aligns with one of them—then an even mixture, which aligns only partially with each, must be pushed further before the plateau breaks\. The boundary then bulges outward, reaching a square \(p→∞p\\to\\infty\) in the limit where each direction has a fully independent threshold and the response fires only as the projection onto either one crosses it\. The exponentp\>2p\>2measures how far toward this “independent thresholds” regime the model sits\. Each axis is normalised by its own plateau\-breaking angle so that only the privileging of the pure directions over their mixtures—not their individual sensitivities—affectspp\. This selectivity is exactly what error correction requires: staying responsive to individual features while suppressing generic mixtures of them\. We give the formal statement—thatp=2p=2cannot privilege feature directions under superposition whereasp\>2p\>2can—in Section[6\.1](https://arxiv.org/html/2606.24964#S6.SS1)\.
We repeat the analysis across six direction types\. For contrastive directions, we form 318 pairs from 33 directions \(14 binary semantic concepts, 10 natural languages, 9 programming languages\), keeping only pairs with intra\-pair cosine similarity below 0\.1\. The filter is needed because our two\-direction analysis orthogonalises the pair \(Section[3](https://arxiv.org/html/2606.24964#S3)\): if the two directions already have significant overlap, orthogonalising one against the other distorts it substantially, so the perturbation no longer probes the labelled feature\. The conservative 0\.1 threshold keeps this distortion small\. The full list and pairwise overlap are in Appendix[A](https://arxiv.org/html/2606.24964#A1)\. For MELBO, we use 528 pairs obtained via the procedure described in Section[3](https://arxiv.org/html/2606.24964#S3)\. For SAE latents, we use top\-activating decoder columns from Gemma Scope’s residual stream layer 2 width\-16k SAE\. PCA and random directions serve as non\-feature baselines, as do random\-difference directions, for which we form 780 pairs from 40 directions and keep the 710 with intra\-pair cosine similarity below 0\.1\.
Figure 4:Superellipse exponents by direction type\.Each dot is one fitted superellipse exponentppfor a pair of directions\. The black horizontal lines are the per\-column medians\. The dashed orange line corresponds to thep=2p=2isotropic reference\. The white markers are the per\-column means, with error bars corresponding to the 95% confidence interval on the mean, computed by direction bootstrapping\. Candidate feature directions \(contrastive, MELBO, top\-activating SAE latents\) sit consistently abovep=2p=2\. The PCA, random, and random\-difference baselines cluster atp≈2p\\approx 2\. The superellipse fits are good: no per\-condition median fit residual exceeds2%2\\%\.We observe that contrastive directions \(mean 2\.42, 95% CI \[2\.29, 2\.62\], median 2\.30\), MELBO directions \(mean 2\.21, 95% CI \[2\.01, 2\.42\], median 2\.16\), and top\-activating SAE latents \(mean 2\.28, 95% CI \[2\.14, 2\.47\], median 2\.19\) sit consistently abovep=2p=2\. This indicates that the model is more sensitive to perturbations along these directions than along their mixtures, consistent with them being features and with FSEC\. We emphasize that our result is not merely that these directions are sensitive—as prior work on activation plateaus and MELBO has shown—but that they are more sensitive than mixtures of them, which is precisely whatp\>2p\>2measures\. The same is not true for any of our baselines: top PCA directions \(mean 2\.03, 95% CI \[1\.94, 2\.12\], median 1\.97\), random\-difference directions \(mean 2\.05, 95% CI \[2\.00, 2\.10\], median 2\.02\), and random directions \(mean 2\.03, 95% CI \[2\.00, 2\.07\], median 2\.02\) all cluster atp≈2p\\approx 2\. As discussed in Section[1](https://arxiv.org/html/2606.24964#S1),p=2p=2means that, in the per\-axis\-normalised coordinates, the plateau boundary is elliptical: mixtures break through at the same rescaled magnitude as the pure directions, so neither direction is privileged over its mixtures\. This is inconsistent with the baseline directions being aligned with feature directions, under the assumption of FSEC\.
### 4\.2Rotating away from feature directions
We have shown that candidate feature directions havep\>2p\>2while controls do not\. We now show thatppdecays monotonically top=2p=2as we rotate away from candidate feature directions\. We take a pair of contrastive feature directions and we rotate each of the directions toward a fixed random orthogonal direction by an angleθ\\theta, re\-orthonormalize the pair, and refit, so thatθ=0\\theta=0recovers the original contrastive pair andθ=π/2\\theta=\\pi/2yields a random direction\. We sweepθ∈\[0,π/2\]\\theta\\in\[0,\\pi/2\]over a subsample of 40 overlap\-filtered pairs \(subsampled only to keep the computational cost manageable, as the full sweep is more expensive than the single\-pair fits of Section[4\.1](https://arxiv.org/html/2606.24964#S4.SS1)\)\. For each pair we average over four independent rotation realizations: within each, every direction is tilted toward its own fixed random orthogonal target \(Figure[5](https://arxiv.org/html/2606.24964#S4.F5)\)\.
Figure 5:Superellipse exponent versus rotation away from contrastive directions\.Each point is the median fittedppover the 40 overlap\-filtered contrastive pairs and four independent rotation realizations per pair; the shaded band shows the interquartile range\. The dashed line marks thep=2p=2isotropic reference\. Atcosθ=1\\cos\\theta=1the perturbation directions are the contrastive directions; the median there \(p≈2\.4p\\approx 2\.4\) is slightly above the full contrastive\-set median \(2\.302\.30, Figure[4](https://arxiv.org/html/2606.24964#S4.F4)\) as it is computed over a 40\-pair subsample\. Atcosθ=0\\cos\\theta=0they have been rotated fully onto a random orthogonal direction\.ppdegrades monotonically from≈2\.4\\approx 2\.4toward22as the directions are rotated away from the contrastive features\.The fitted exponent decreases monotonically fromp≈2\.4p\\approx 2\.4at the contrastive directions top≈2\.0p\\approx 2\.0once they are fully rotated to random, indicating thatpptracks alignment with the candidate feature directions\. Because we lack ground\-truth features in the LLM, this sweep can only rotate away from candidate features\. In Section[5](https://arxiv.org/html/2606.24964#S5)we close this gap in a toy model with known features\.
### 4\.3Ablations
We now show that our results hold across a variety of settings\. To do so, we vary our setup along the following axes:
- •Perturbation method:additive rather than norm\-matched\.
- •Perturbation layer:ℓ∈\{5,10,20\}\\ell\\in\\\{5,10,20\\\}in addition to the defaultℓ=2\\ell=2\.
- •Measurement layer:layers 30 and 35 in addition to the default second\-to\-last \(layer 40 for Gemma\-2\-9B\)\.
- •Response metric:cosine distance at second\-to\-last residual stream layer and KL\-divergence in logit space in addition toL2L^\{2\}distance\.
- •Plateau\-breaking threshold:50% and 200% of the per\-pair thresholdTT\.
- •Activation source:wikipedia\-en, wikipedia\-zh, and the\-stack \(default FineWeb\)\.
- •Model:Qwen3\-1\.7B, Llama\-3\.1\-8B, Mistral\-7B\-v0\.3, Aya\-Expanse\-8B, and Yi\-1\.5\-9B\.
Figure 6:Robustness ofp\>2p\>2for contrastive directions across setup choices\.Each dot is one fitted superellipse exponentppfor a pair of contrastive directions; the colored marker shows the per\-setting mean with 95% CI, and the horizontal dash shows the median\. The dashed orange line is thep=2p=2isotropic reference\. Columns vary the model, perturbation layer, measurement layer, response metric, response threshold, perturbation method, activation source, and perturbation token position\. Fitted exponents remain abovep=2p=2across all variations\. The superellipse fits are good: no per\-condition median fit residual exceeds2%2\\%\.All our results hold across every ablation; the full results are shown in Figure[6](https://arxiv.org/html/2606.24964#S4.F6)\. For contrastive feature directions, the means are all in the \[2\.24, 2\.63\] interval, and the medians in \[2\.11, 2\.50\], all consistent with contrastive feature directions being privileged\.
## 5Toy model validation
To validate our method, we test whether it detects FSEC in a toy model with known ground\-truth features\. This lets us \(i\) verify that our methodology yields highppand good superellipse fits on known ground\-truth features, and \(ii\) repeat the misalignment sweep of Section[4\.2](https://arxiv.org/html/2606.24964#S4.SS2)from a known starting point—rotating away from the*true*features rather than from imperfect candidates, since the toy model exposes the true feature directions\. Because our candidate feature directions in LLMs are imperfect approximations of the true features, we expect the measuredppto be biased downward, since rotating away from a true feature direction should yield a less privileged direction\.
We use the two\-layer denoising network ofVaintrob \([2026](https://arxiv.org/html/2606.24964#bib.bib26)\)as our toy model\. It denoises two\-hotdd\-dimensional inputs corrupted with Gaussian noise usingH<dH<dneurons, so theddfeatures are denoised \(i\.e\., computed\) in superposition over fewer thandddimensions\. The input is expressed in the feature basis, giving us direct access to the ground\-truth feature directions\.
We apply our two\-direction perturbation analysis as before, perturbing additively along pairs of ground\-truth inactive feature directions\. To measure the effect of misalignment, we rotate the perturbation directions away from the true feature directions toward random directions by varying amounts\. We sweep three feature\-to\-neuron ratios \(4×4\\times,8×8\\times,16×16\\times\) withH=1024H=1024neurons\. Full details of the toy model setup are in Appendix[C](https://arxiv.org/html/2606.24964#A3)\.
Figure 7:Superellipse exponents versus alignment with feature directions\.Each dot is the median of8080fitted superellipse exponents; the shaded region shows the interquartile range\. The dashed line marks thep=2p=2isotropic reference\. Perturbation directions take the form𝐮∝cosθ𝐞j\+sinθ𝐰\\mathbf\{u\}\\propto\\cos\\theta\\,\\mathbf\{e\}\_\{j\}\+\\sin\\theta\\,\\mathbf\{w\}\(normalized to unit length\) forθ∈\[0,π/2\]\\theta\\in\[0,\\pi/2\], where𝐞j\\mathbf\{e\}\_\{j\}is a feature axis and𝐰\\mathbf\{w\}is an isotropic random unit vector inℝd\\mathbb\{R\}^\{d\}; the x\-axis shows the median realized\|⟨𝐮,𝐞j⟩\|\|\\langle\\mathbf\{u\},\\mathbf\{e\}\_\{j\}\\rangle\|over the 80 pairs\. The leftmost points correspond to ground\-truth feature directions; the rightmost to isotropic random unit vectors, whose realized overlap with the feature axes is of order1/d1/\\sqrt\{d\}per pair\. The three lines correspond to feature\-to\-neuron ratios of4×4\\times,8×8\\times, and16×16\\times, all withH=1024H=1024neurons\. Ground\-truth feature directions yieldppin the range33–3\.43\.4;ppdegrades monotonically toward22as directions are rotated away from the true features\.Figure[7](https://arxiv.org/html/2606.24964#S5.F7)shows the fitted exponentppas a function of the perturbation directions’ alignment with the true feature axes\. For ground\-truth feature directions, we obtainppin the range33–3\.43\.4, well above22, increasing with the feature\-to\-neuron ratio \(highest for16×16\\times\)\. As directions are rotated away from the true features,ppdegrades monotonically toward22, which is recovered for random directions\. This mirrors the LLM misalignment sweep \(Section[4\.2](https://arxiv.org/html/2606.24964#S4.SS2)\), but now starting from ground\-truth features rather than candidates\. This is consistent with our prediction that FSEC producesp\>2p\>2and that imperfect feature alignment biasesppdownward\. Due to the significant differences between this toy model and LLMs \(two layers, binary features,tanh3\\tanh^\{3\}activation function\), we do not place much weight on the specific values obtained; we view this primarily as validation of the methodology and the directional privileging prediction\.
## 6Discussion
### 6\.1Whyp\>2p\>2is necessary for error correction
When neural networks use superposition, features are embedded non\-orthogonally, so each active feature produces a small spurious “interference” activation along the others\. To compute in superposition despite this interference noise, a network must implement error correction\(Hänni et al\.,[2024](https://arxiv.org/html/2606.24964#bib.bib10)\), and such a mechanism must treat feature directions differently from non\-feature directions\. To test whether this is the case, we measure the exponentppin our response model \(Equation[10](https://arxiv.org/html/2606.24964#S3.E10)\)\.
The exponentppcontrols how selectively the network can respond to individual feature directions\. Whenp=2p=2, the response reduces to a quadratic form,
R\(𝐝^\)=F\(𝐝^⊤𝐀𝐝^\),R\(\\hat\{\\mathbf\{d\}\}\)=F\\\!\\left\(\\hat\{\\mathbf\{d\}\}^\{\\top\}\\mathbf\{A\}\\,\\hat\{\\mathbf\{d\}\}\\right\),\(14\)where𝐀=∑iwi𝐟^i𝐟^i⊤\\mathbf\{A\}=\\sum\_\{i\}w\_\{i\}\\hat\{\\mathbf\{f\}\}\_\{i\}\\hat\{\\mathbf\{f\}\}\_\{i\}^\{\\top\}is a positive semidefinite matrix with at mostdmodeld\_\{\\text\{model\}\}nonzero eigenvalues\. Under superposition, we expect far more features than dimensions, yet a quadratic form has onlydmodeld\_\{\\text\{model\}\}degrees of freedom and therefore cannot privilege that many directions above all others\. Even without superposition,p=2p=2does not allow all feature directions to be more responsive than all non\-feature directions: if the eigenvalues of𝐀\\mathbf\{A\}are not all equal, convexity implies that some non\-feature directions \(linear combinations of high\-eigenvalue features\) will be more responsive than the lowest\-eigenvalue features; if all eigenvalues are equal, the response is isotropic and no direction is privileged at all\. In either case,p=2p=2is insufficient for error correction that privileges feature directions\.
At the other extreme,p→∞p\\to\\inftymakes the response sensitive only to the single largest feature projection, perfectly suppressing interference from all other directions\. This extreme is likely not how LLMs function either\. We therefore expect the true exponent to lie somewhere between these extremes:2<p<∞2<p<\\infty\.
Empirically, we findp≈2\.3p\\approx 2\.3for contrastive directions andp≈2\.2p\\approx 2\.2for MELBO and SAE directions, consistently abovep=2p=2and consistent with FSEC\. Since our candidate feature directions are imperfect approximations of the model’s true features, the measuredppis likely a lower bound on the exponent that would be obtained with ground\-truth feature directions: imperfect alignment would dilute the directional sensitivity, biasingpptoward22\. Our toy model analysis \(Section[5](https://arxiv.org/html/2606.24964#S5)\) supports this, showing thatppdegrades monotonically toward22as directions are rotated away from the true features\. In fact, moderate misalignment of perturbation directions in the form of∼0\.8\\sim 0\.8cosine similarity with the feature directions reduces the fitted superellipse exponentppfrom33–3\.43\.4to2\.32\.3–2\.62\.6, indicating that our measuredp≈2\.3p\\approx 2\.3for contrastive directions in LLMs is consistent with significantly higherppfor true feature directions\.
### 6\.2Alternative interpretations
Our finding thatp\>2p\>2for candidate feature directions is consistent with FSEC\. However, there exist alternative interpretations of this result and of activation plateaus more broadly\.
Activation plateaus, the near\-zero downstream response to small perturbations that underlies our measurement, might not be signatures of error correction at all\. They could instead reflect learned robustness, perhaps related to broad basins of attraction in the activation space of LLMs\. This could hold whether or not LLMs implement error correction: it could be that error correction is present but plateaus are not the right empirical signature, or that error correction is absent and plateaus reflect a different phenomenon\. However, our main result does not depend on interpreting plateaus as error correction per se: even if the mechanism underlying plateaus is not error correction, the finding thatp\>2p\>2for candidate feature directions butp≈2p\\approx 2for baselines remains, and still indicates that the network treats these directions preferentially\.
It is also possible that LLMs implement error correction in a way that is not directly direction\-aware, i\.e\., without features being linear directions that are consistent across the activation space\. For instance, the in\-distribution activations of LLMs may lie on a lower\-dimensional manifold\(Goodfellow et al\.,[2016](https://arxiv.org/html/2606.24964#bib.bib8); Cheng et al\.,[2023](https://arxiv.org/html/2606.24964#bib.bib3)\), and activation plateaus could reflect a mechanism by which small off\-manifold perturbations are mapped back on\-manifold\. This would be broadly consistent with our observations but would admit a different interpretation than the one we have put forward\.
## 7Conclusion
We have proposed perturbation analysis of residual\-stream activations as an empirical test of whether feature directions are privileged and whether LLMs implement error correction\. We find that candidate feature directions in the form of contrastive directions, MELBO directions, and top\-activating SAE latents are privileged, consistent with superposition and error correction\. Our results hold across six model families and a wide range of experimental settings\.
Our results consistently selectp≈2\.3p\\approx 2\.3for our best candidates for feature directions\. As discussed in Section[6](https://arxiv.org/html/2606.24964#S6), this is likely a lower bound: imperfect alignment between our candidate feature directions and the model’s true features would dilute the directional sensitivity, biasingpptoward22\. This provides a quantitative target for toy models of error correction: a good toy model should reproducep\>2p\>2under comparable experimental conditions\. Indeed, the toy model of Section[5](https://arxiv.org/html/2606.24964#S5)passes this test, yieldingppin the range33–3\.43\.4for ground\-truth features\. More broadly, the exponent cleanly separates candidate feature directions \(p\>2p\>2\) from baselines \(p≈2p\\approx 2\)\.
#### Future work\.
This discriminative power suggests using the exponent directly as an unsupervised feature\-finding objective: searching for directions that maximize it\. We have explored this in preliminary experiments, and our tentative conclusion is that it does not by itself recover features:p\>2p\>2appears to be necessary but not sufficient for a direction to be a feature, so maximizing the exponent alone surfaces high\-ppdirections that do not otherwise behave like features\. We flag this to avoid overstating the objective’s promise; these results are preliminary and not reported here, and we leave a fuller treatment to future work\. The toy model offers a controlled test of this idea, since its ground\-truth features let us check whether maximizing the exponent recovers them\. On the modeling side, extending the toy\-model validation to standard activation functions and deeper architectures would test how closely a toy model of error correction can reproduce the exponents we measure in LLMs\.
## 8Acknowledgements
We thank Dmitry Vaintrob for pointing us to the denoising toy model and for useful discussions\.
## References
- Adler & Shavit \(2024\)Adler, M\. and Shavit, N\.On the complexity of neural computation in superposition\.*arXiv preprint arXiv:2409\.15318*, 2024\.
- Bricken et al\. \(2023\)Bricken, T\., Templeton, A\., Batson, J\., Chen, B\., Jermyn, A\., Conerly, T\., Turner, N\. L\., Anil, C\., Denison, C\., Askell, A\., Lasenby, R\., Wu, Y\., Kravec, S\., Schiefer, N\., Maxwell, T\., Joseph, N\., Hatfield\-Dodds, Z\., Tamkin, A\., Nguyen, K\., McLean, B\., Burke, J\. E\., Hume, T\., Carter, S\., Henighan, T\., and Olah, C\.Towards monosemanticity: Decomposing language models with dictionary learning\.*Transformer Circuits Thread*, 2023\.URL[https://transformer\-circuits\.pub/2023/monosemantic\-features/index\.html](https://transformer-circuits.pub/2023/monosemantic-features/index.html)\.
- Cheng et al\. \(2023\)Cheng, E\., Kervadec, C\., and Baroni, M\.Bridging information\-theoretic and geometric compression in language models\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pp\. 12397–12420, 2023\.
- Cunningham et al\. \(2023\)Cunningham, H\., Ewart, A\., Riggs, L\., Huben, R\., and Sharkey, L\.Sparse autoencoders find highly interpretable features in language models\.*arXiv preprint arXiv:2309\.08600*, 2023\.
- Dang et al\. \(2024\)Dang, J\., Singh, S\., D’souza, D\., Ahmadian, A\., Salamanca, A\., Smith, M\., Peppin, A\., Hong, S\., Govindassamy, M\., Zhao, T\., et al\.Aya expanse: Combining research breakthroughs for a new multilingual frontier\.*arXiv preprint arXiv:2412\.04261*, 2024\.
- Elhage et al\. \(2022\)Elhage, N\., Hume, T\., Olsson, C\., Schiefer, N\., Henighan, T\., Kravec, S\., Hatfield\-Dodds, Z\., Lasenby, R\., Drain, D\., Chen, C\., et al\.Toy models of superposition\.*arXiv preprint arXiv:2209\.10652*, 2022\.
- Gao et al\. \(2024\)Gao, L\., la Tour, T\. D\., Tillman, H\., Goh, G\., Troll, R\., Radford, A\., Sutskever, I\., Leike, J\., and Wu, J\.Scaling and evaluating sparse autoencoders\.*arXiv preprint arXiv:2406\.04093*, 2024\.
- Goodfellow et al\. \(2016\)Goodfellow, I\., Bengio, Y\., and Courville, A\.*Deep learning*\.MIT press Cambridge, MA, USA, 2016\.
- Grattafiori et al\. \(2024\)Grattafiori, A\., Dubey, A\., Jauhri, A\., Pandey, A\., Kadian, A\., Al\-Dahle, A\., Letman, A\., Mathur, A\., Schelten, A\., Vaughan, A\., et al\.The llama 3 herd of models\.*arXiv preprint arXiv:2407\.21783*, 2024\.
- Hänni et al\. \(2024\)Hänni, K\., Mendel, J\., Vaintrob, D\., and Chan, L\.Mathematical models of computation in superposition\.*arXiv preprint arXiv:2408\.05451*, 2024\.
- Heimersheim & Mendel \(2024\)Heimersheim, S\. and Mendel, J\.\[interim research report\] Activation plateaus & sensitive directions in GPT2\.AI Alignment Forum, July 2024\.URL[https://www\.alignmentforum\.org/posts/LajDyGyiyX8DNNsuF/interim\-research\-report\-activation\-plateaus\-and\-sensitive\-1](https://www.alignmentforum.org/posts/LajDyGyiyX8DNNsuF/interim-research-report-activation-plateaus-and-sensitive-1)\.Work produced at Apollo Research\.
- Heimersheim & Nanda \(2024\)Heimersheim, S\. and Nanda, N\.How to use and interpret activation patching\.*arXiv preprint arXiv:2404\.15255*, 2024\.
- Janiak et al\. \(2024\)Janiak, J\., Karwowski, J\., Mangat, C\. S\., Giglemiani, G\., Petrova, N\., and Heimersheim, S\.Characterizing stable regions in the residual stream of llms\.*arXiv preprint arXiv:2409\.17113*, 2024\.
- Jiang et al\. \(2023\)Jiang, A\. Q\., Sablayrolles, A\., Mensch, A\., Bamford, C\., Chaplot, D\. S\., de Las Casas, D\., Bressand, F\., Lengyel, G\., Lample, G\., Saulnier, L\., Lavaud, L\. R\., Lachaux, M\.\-A\., Stock, P\., Scao, T\. L\., Lavril, T\., Wang, T\., Lacroix, T\., and Sayed, W\. E\.Mistral 7b\.*ArXiv*, abs/2310\.06825, 2023\.
- Lieberum et al\. \(2024\)Lieberum, T\., Rajamanoharan, S\., Conmy, A\., Smith, L\., Sonnerat, N\., Varma, V\., Kramár, J\., Dragan, A\., Shah, R\., and Nanda, N\.Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2\.In*Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP*, pp\. 278–300, 2024\.
- Mack & Turner \(2024a\)Mack, A\. and Turner, A\.Deep causal transcoding: A framework for mechanistically eliciting latent behaviors in language models, 12 2024a\.URL[https://www\.lesswrong\.com/posts/fSRg5qs9TPbNy3sm5/deep\-causal\-transcoding\-a\-framework\-for\-mechanistically](https://www.lesswrong.com/posts/fSRg5qs9TPbNy3sm5/deep-causal-transcoding-a-framework-for-mechanistically)\.LessWrong\.
- Mack & Turner \(2024b\)Mack, A\. and Turner, A\.Mechanistically eliciting latent behaviors in language models, 4 2024b\.URL[https://www\.lesswrong\.com/posts/ioPnHKFyy4Cw2Gr2x/mechanistically\-eliciting\-latent\-behaviors\-in\-language\-1](https://www.lesswrong.com/posts/ioPnHKFyy4Cw2Gr2x/mechanistically-eliciting-latent-behaviors-in-language-1)\.LessWrong\.
- Meng et al\. \(2022\)Meng, K\., Bau, D\., Andonian, A\., and Belinkov, Y\.Locating and editing factual associations in gpt\.*Advances in neural information processing systems*, 35:17359–17372, 2022\.
- Olah et al\. \(2025\)Olah, C\., Turner, N\. L\., and Conerly, T\.A toy model of interference weights\.*Transformer Circuits Thread*, 2025\.URL[https://transformer\-circuits\.pub/2025/interference\-weights/index\.html](https://transformer-circuits.pub/2025/interference-weights/index.html)\.
- Panickssery et al\. \(2023\)Panickssery, N\., Gabrieli, N\., Schulz, J\., Tong, M\., Hubinger, E\., and Turner, A\. M\.Steering llama 2 via contrastive activation addition\.*arXiv preprint arXiv:2312\.06681*, 2023\.
- Penedo et al\. \(2024\)Penedo, G\., Kydlíček, H\., Ben allal, L\., Lozhkov, A\., Mitchell, M\., Raffel, C\., Von Werra, L\., Wolf, T\., et al\.The fineweb datasets: Decanting the web for the finest text data at scale\.*Advances in Neural Information Processing Systems*, 37:30811–30849, 2024\.
- Shinkle & Heimersheim \(2025\)Shinkle, M\. and Heimersheim, S\.Activation plateaus: Where and how they emerge\.LessWrong, October 2025\.URL[https://www\.lesswrong\.com/posts/WMfSbt7AAcJdHzysB/activation\-plateaus\-where\-and\-how\-they\-emerge](https://www.lesswrong.com/posts/WMfSbt7AAcJdHzysB/activation-plateaus-where-and-how-they-emerge)\.Accessed: 2026\-04\-27\.
- Team et al\. \(2024\)Team, G\., Riviere, M\., Pathak, S\., Sessa, P\. G\., Hardin, C\., Bhupatiraju, S\., Hussenot, L\., Mesnard, T\., Shahriari, B\., Ramé, A\., et al\.Gemma 2: Improving open language models at a practical size\.*arXiv preprint arXiv:2408\.00118*, 2024\.
- Templeton et al\. \(2026\)Templeton, A\., Conerly, T\., Marcus, J\., Lindsey, J\., Bricken, T\., Chen, B\., Pearce, A\., Citro, C\., Ameisen, E\., Jones, A\., et al\.Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet\.*arXiv preprint arXiv:2605\.29358*, 2026\.
- Turner et al\. \(2023\)Turner, A\. M\., Thiergart, L\., Leech, G\., Udell, D\., Vazquez, J\. J\., Mini, U\., and MacDiarmid, M\.Steering language models with activation engineering\.*arXiv preprint arXiv:2308\.10248*, 2023\.
- Vaintrob \(2026\)Vaintrob, D\.A tale of three theories: Sparsity, frustration, and statistical field theory\.LessWrong, January 2026\.URL[https://www\.lesswrong\.com/posts/siu22scEfuKxpSgfK/a\-tale\-of\-three\-theories\-sparsity\-frustration\-and](https://www.lesswrong.com/posts/siu22scEfuKxpSgfK/a-tale-of-three-theories-sparsity-frustration-and)\.Accessed: 2026\-05\-06\.
- Yang et al\. \(2025\)Yang, A\., Li, A\., Yang, B\., Zhang, B\., Hui, B\., Zheng, B\., Yu, B\., Gao, C\., Huang, C\., Lv, C\., et al\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*, 2025\.
- Young et al\. \(2024\)Young, A\., Chen, B\., Li, C\., Huang, C\., Zhang, G\., Zhang, G\., Wang, G\., Li, H\., Zhu, J\., Chen, J\., et al\.Yi: Open foundation models by 01\. ai\.*arXiv preprint arXiv:2403\.04652*, 2024\.
## Appendix AContrastive direction inventory
We extract 33 contrastive directions across three families\. Each direction is the difference of mean last\-token activations at the perturbation layer between a positive prompt set and a matched negative set, normalised to a unit vector\.
The 14binary semanticdirections are Age, Certainty, Era, Gender, Health, Honesty, Literary, Number, Person, Refusal, Sentiment, Status, Tense, and Wealth\. Each uses 30 matched prompt pairs except Literary, which contrasts the slang and literary registers across 50 prompt templates\. Prompts are LLM\-generated and human\-verified\.
The 10natural\-languagedirections cover Arabic, Chinese, Dutch, French, German, Italian, Japanese, Portuguese, Russian, and Spanish\. Each pair is an English prompt and its translation in the target language; we use 36 matched prompts per direction, drawn from a single multilingual prompt pool so that all 10 directions share the same English source side\.
The 9programming\-languagedirections cover C\+\+, Go, Haskell, Java, JavaScript, Lisp, Python, Rust, and TypeScript\. Each pair is an English prompt and an equivalent code snippet in the target language; 30 matched prompts per direction\.
Forming all\(332\)=528\\binom\{33\}\{2\}=528unordered pairs, we keep the 318 with raw overlap\|⟨𝐝i,𝐝j⟩\|<0\.1\|\\langle\\mathbf\{d\}\_\{i\},\\mathbf\{d\}\_\{j\}\\rangle\|<0\.1for our two\-direction analysis \(Figure[8](https://arxiv.org/html/2606.24964#A1.F8)\)\. The filter is needed because our two\-direction analysis orthogonalises each pair via Gram–Schmidt \(Section[3](https://arxiv.org/html/2606.24964#S3)\): if two directions overlap significantly, removing the shared component leaves only a small residual, which renormalisation to unit length then amplifies, so the resulting direction reflects this residual rather than the labelled feature\. The0\.10\.1threshold is conservative—at this overlap the orthogonalised direction remains more than99%99\\%aligned with the original, and the distortion grows with overlap\.
Figure 8:Pairwise cosine overlap of the 33 contrastive directions\.Cells show\|⟨𝐝i,𝐝j⟩\|\|\\langle\\mathbf\{d\}\_\{i\},\\mathbf\{d\}\_\{j\}\\rangle\|on Gemma\-2\-9B at layer 2\. Directions are ordered and labelled by family \(semantic, natural language, programming language\)\. White×\\timesmarks the 210 pairs dropped by the\|cos\|<0\.1\|\\cos\|<0\.1filter applied throughout the paper; the remaining 318 pairs are the ones plotted as Contrastive in the beeswarms\. Most filtered pairs sit inside the natural\-language block\.
## Appendix BDistribution of superellipse fit residuals
For every fitted pair we record the mean radial fractionρ=1M∑k=1M\|\(\|xk\|p\+\|yk\|p\)1/p−1\|\\rho=\\tfrac\{1\}\{M\}\\sum\_\{k=1\}^\{M\}\\bigl\|\\,\(\|x\_\{k\}\|^\{p\}\+\|y\_\{k\}\|^\{p\}\)^\{1/p\}\-1\\,\\bigr\|, i\.e\. the average fractional distance of theMMcontour points from the fitted superellipse, withρ=0\\rho=0corresponding to a perfect fit\. The number of contour pointsMMvaries per fit \(it is the cardinality of the iso\-threshold level set on the 2DL2L^\{2\}grid\); the per\-fit median is3333\(IQR2929–3737\)\.
Each pair contributes one dot in Figure[9](https://arxiv.org/html/2606.24964#A2.F9)\.
Every sub\-group median residual is below 2%\.
Figure 9:Distribution of superellipse fit residuals across all conditions\.Each dot is one pair’s fit residualρ\\rho\. The eight left\-most columns mirror the ablation axes of Figure[6](https://arxiv.org/html/2606.24964#S4.F6)\(Model, Perturbation layer, Measurement layer, Response metric, Response threshold, Perturbation method, Token position, Activation source\)\. The right\-most column gives the residual distribution per direction family \(Contrastive, MELBO, SAE, PCA, Random\)\. Per\-sub\-group mean \(white diamond\)±\\pm95% direction\-bootstrap CI and median \(horizontal bar\) are colour\-matched to the sub\-group’s dots\. The y\-axis is clipped at 12\.5% to keep the bulk of the distribution legible; this clips 4 of the 10,719 plotted dots \(<<0\.05%\)\.
## Appendix CToy model details
### C\.1Architecture
The toy model followsVaintrob \([2026](https://arxiv.org/html/2606.24964#bib.bib26)\)\. It is a two\-layer network with tied weights: an encoderE∈ℝH×dE\\in\\mathbb\{R\}^\{H\\times d\}and decoderD=E⊤D=E^\{\\top\}, with atanh3\\tanh^\{3\}activation function applied element\-wise to the hidden layer\. The entries ofEEare drawn i\.i\.d\. from\{0,\+1,−1\}\\\{0,\+1,\-1\\\}with probabilities\{1−q,q/2,q/2\}\\\{1\-q,q/2,q/2\\\}, whereqqis a sparsity parameter controlling the fraction of non\-zero entries\. The network receives a noisy two\-hot inputx=xclean\+ϵx=x\_\{\\text\{clean\}\}\+\\epsilon, wherexcleanx\_\{\\text\{clean\}\}has exactlys=2s=2active coordinates set to11and the remaining set to0, andϵ∼𝒩\(0,σ2I\)\\epsilon\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}I\)withσ2=0\.03\\sigma^\{2\}=0\.03\. The network operates in superposition, with the number of featuresddexceeding the number of neuronsHH\.
### C\.2Optimization
We do not train the encoder and decoder via gradient descent\. Instead, we fix the random encoder structure and optimize two scalar parameters: an input gaincinc\_\{\\text\{in\}\}and an output scalecoutc\_\{\\text\{out\}\}, so that the network computesy=cout⋅tanh3\(cin⋅Ex\)y=c\_\{\\text\{out\}\}\\cdot\\tanh^\{3\}\(c\_\{\\text\{in\}\}\\cdot Ex\)\. We minimize a per\-coordinate weighted MSE loss:
ℒ=1B⋅d\[λ∑i∈on\(coutfi−1\)2\+∑i∈off\(coutfi\)2\],\\mathcal\{L\}=\\frac\{1\}\{B\\cdot d\}\\left\[\\lambda\\sum\_\{i\\in\\text\{on\}\}\(c\_\{\\text\{out\}\}f\_\{i\}\-1\)^\{2\}\+\\sum\_\{i\\in\\text\{off\}\}\(c\_\{\\text\{out\}\}f\_\{i\}\)^\{2\}\\right\],\(15\)wheref=Dtanh3\(cin⋅Ex\)f=D\\tanh^\{3\}\(c\_\{\\text\{in\}\}\\cdot Ex\), “on” denotes coordinates wherexclean=1x\_\{\\text\{clean\}\}=1, and “off” denotes coordinates wherexclean=0x\_\{\\text\{clean\}\}=0\. The weightingλ\\lambdaupweights active\-coordinate residuals to discourage the trivial solution of outputting zero, which otherwise dominates for very sparse inputs\. Givencinc\_\{\\text\{in\}\}, the optimalcoutc\_\{\\text\{out\}\}has a closed\-form solution:
cout∗=λ⋅⟨f,xclean⟩onλ⋅‖f‖on2\+‖f‖off2\.c\_\{\\text\{out\}\}^\{\*\}=\\frac\{\\lambda\\cdot\\langle f,x\_\{\\text\{clean\}\}\\rangle\_\{\\text\{on\}\}\}\{\\lambda\\cdot\\\|f\\\|^\{2\}\_\{\\text\{on\}\}\+\\\|f\\\|^\{2\}\_\{\\text\{off\}\}\}\.\(16\)We sweepcinc\_\{\\text\{in\}\}over a 121\-point grid on\[0\.05,8\.0\]\[0\.05,8\.0\]and select the value minimizingℒ\\mathcal\{L\}\.
The sparsity parameterqqwas swept over\{0\.025,0\.05,0\.075,0\.1,0\.125,0\.15\}\\\{0\.025,0\.05,0\.075,0\.1,0\.125,0\.15\\\}for each\(d,H\)\(d,H\)configuration, selectingq∗q^\{\*\}by minimum training loss\. The selected values wereq∗=0\.15q^\{\*\}=0\.15for4×4\\times,q∗=0\.10q^\{\*\}=0\.10for8×8\\times, andq∗=0\.075q^\{\*\}=0\.075for16×16\\times\. The weighting parameter was set toλ∈\{500,1000,2000\}\\lambda\\in\\\{500,1000,2000\\\}ford∈\{4096,8192,16384\}d\\in\\\{4096,8192,16384\\\}respectively, scaled roughly asλ≈d/\(s⋅SNR2\)\\lambda\\approx d/\(s\\cdot\\text\{SNR\}^\{2\}\)withs=2s=2and empiricalSNR≈2\\text\{SNR\}\\approx 2\.
### C\.3Perturbation procedure
The active set is fixed toS=\{0,1\}S=\\\{0,1\\\}throughout, so all perturbations are performed around the same clean inputx∗=e0\+e1x^\{\*\}=e\_\{0\}\+e\_\{1\}\. We sample8080pairs\(j,k\)\(j,k\)uniformly without replacement from the inactive coordinates\{2,3,…,d−1\}\\\{2,3,\\ldots,d\-1\\\}\. For each pair, we perturb along the two ground\-truth feature directionseje\_\{j\}andeke\_\{k\}\. To measure the effect of misalignment, we rotateeje\_\{j\}andeke\_\{k\}toward random directions by varying amounts\. Specifically, for each pair we sample an isotropic random unit pair\(wu,wv\)\(w\_\{u\},w\_\{v\}\), mutually orthogonal but otherwise unconstrained, and interpolate between the feature direction and the random direction at each misalignment level\. The same\(wu,wv\)\(w\_\{u\},w\_\{v\}\)pair is used across all misalignment levels for a given feature pair\.
### C\.4Superellipse fitting
For each direction pair and misalignment level, we sweep6060mixing anglesφ\\varphiuniformly on\[0,π/2\]\[0,\\pi/2\]\. For eachφ\\varphi, we perform a 2500\-point linear scan of the perturbation magnitudeα\\alphaon\[0,25\]\[0,25\]and identify the smallestα\\alphaat which‖f\(x∗\+α⋅d^\(φ\)\)−f\(x∗\)‖2\>τ\\\|f\(x^\{\*\}\+\\alpha\\cdot\\hat\{d\}\(\\varphi\)\)\-f\(x^\{\*\}\)\\\|\_\{2\}\>\\tau\. The threshold is set toτ=0\.5⋅maxα≤25‖f\(x∗\+α⋅ej\)−f\(x∗\)‖2\\tau=0\.5\\cdot\\max\_\{\\alpha\\leq 25\}\\\|f\(x^\{\*\}\+\\alpha\\cdot e\_\{j\}\)\-f\(x^\{\*\}\)\\\|\_\{2\}for the first inactive featurej=2j=2, computed per network\. The superellipse exponent is estimated by minimizing∑φ\(\(xφ/r0\)n\+\(yφ/rπ/2\)n−1\)2\\sum\_\{\\varphi\}\\left\(\(x\_\{\\varphi\}/r\_\{0\}\)^\{n\}\+\(y\_\{\\varphi\}/r\_\{\\pi/2\}\)^\{n\}\-1\\right\)^\{2\}overn∈\[0\.5,60\]n\\in\[0\.5,60\], where\(xφ,yφ\)=α\(φ\)⋅\(cosφ,sinφ\)\(x\_\{\\varphi\},y\_\{\\varphi\}\)=\\alpha\(\\varphi\)\\cdot\(\\cos\\varphi,\\sin\\varphi\)\. The perturbation sweep is deterministic \(no noise is added at perturbation time\)\. All random seeds are set to 42\. We show in Figure[10](https://arxiv.org/html/2606.24964#A3.F10)an example of the superelliptical boundaries we obtain for the toy model in the style of Figure[2](https://arxiv.org/html/2606.24964#S4.F2)in the main text\.
Figure 10:Iso\-plateau boundary\.Plateau\-breaking perturbation magnitudes for perturbations aligned with combinations of two feature directions in the toy model withd=8192d=8192\. The fitted superellipse exponent is2\.912\.91with a mean residual of0\.13%0\.13\\%\.Figure[11](https://arxiv.org/html/2606.24964#A3.F11)shows the toy\-model analogue of Figure[1](https://arxiv.org/html/2606.24964#S4.F1): theL2L^\{2\}response as a function of perturbation magnitude along two feature axes, their equal\-weight combination, and a random\-direction baseline\. The feature axes break the plateau first \(α≈4\.1\\alpha\\approx 4\.1–4\.34\.3\) and random directions last \(α=5\.93\\alpha=5\.93\); the combination breaks atα=4\.70\\alpha=4\.70, later than either axis, whereasp=2p=2would predict it breaks at the same magnitude as the axes\.
Figure 11:Toy\-model response\-versus\-magnitude sweep\.L2L^\{2\}response of the toy model \(d=8192d=8192,H=1024H=1024\) to additive perturbations of magnitudeα\\alphaalong two inactive feature axes𝐞2\\mathbf\{e\}\_\{2\}and𝐞3\\mathbf\{e\}\_\{3\}, their equal\-weight combination, and a random\-direction baseline \(median over 10 isotropic unit directions\), with the plateau\-breaking thresholdτ\\tau\(dotted\) and each curve’s plateau\-breaking magnitude \(dashed verticals\)\. As in Figure[1](https://arxiv.org/html/2606.24964#S4.F1), feature axes break the plateau first, their combination later, and random directions last\.Similar Articles
How do LLMs store so much knowledge? A look at feature superposition
Explores how large language models compress vast knowledge into finite space using feature superposition, explaining the distinction between dimensions and features with biological analogies.
Can LLMs Take Retrieved Information with a Grain of Salt?
This paper investigates how large language models adapt to the certainty of retrieved information, identifying systematic limitations in handling uncertainty. It proposes an interaction strategy that reduces obedience errors by 25% without modifying model weights.
How Human-Like Are Large Language Models? A Register-Aware Linguistic Evaluation Framework
This paper introduces a register-aware linguistic evaluation framework to assess how human-like large language models (LLMs) are by comparing the distribution of 67 lexico-grammatical features between human and LLM-generated texts using Maximum Mean Discrepancy. Experiments across seven instruction-tuned open-source models and five registers show that no model perfectly matches human baselines, and closeness to human language varies by register rather than model size.
Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty
This paper investigates how similar large language model uncertainty is to human uncertainty, exploring alignment, calibration, and activation patterns in LLMs across multiple datasets and the impact of instruction fine-tuning.
When Correct Beliefs Collapse: Epistemic Resilience of LLMs under Clinical Pressure
This paper investigates how large language models maintain correct beliefs under adversarial pressure in clinical settings, proposing R-FT fine-tuning to improve epistemic resilience while balancing corrigibility, and demonstrating significant robustness gains on medical benchmarks.