Non-linear Interventions on Large Language Models
Summary
This paper introduces a general formulation of non-linear intervention for large language models, extending beyond the Linear Representation Hypothesis to manipulate features encoded along non-linear manifolds, and validates the approach on refusal bypass steering.
View Cached Full Text
Cached at: 05/15/26, 06:23 AM
# Non-linear Interventions on Large Language Models
Source: [https://arxiv.org/html/2605.14749](https://arxiv.org/html/2605.14749)
###### Abstract
Intervention is one of the most representative and widely used methods for understanding the internal representations of large language models \(LLMs\)\. However, existing intervention methods are confined to linear interventions grounded in the Linear Representation Hypothesis, leaving features encoded along non\-linear manifolds beyond their reach\. In this work, we introduce a general formulation of intervention that extends naturally to non\-linearly represented features, together with a learning procedure that further enables intervention on*implicit*features lacking a direct output signature\. We validate our framework on refusal bypass steering, where it steers the model more precisely than linear baselines by intervening on a non\-linear feature governing refusal\.
Machine Learning, ICML
## 1Introduction
Intervention is a central tool in understanding inner representations of large language models \(LLMs\)\. By modifying a model’s internal activations during inference and observing the resulting change in output, interventions yield causal evidence linking specific components to model behavior\(Viget al\.,[2020](https://arxiv.org/html/2605.14749#bib.bib21); Menget al\.,[2022](https://arxiv.org/html/2605.14749#bib.bib20); Geigeret al\.,[2024](https://arxiv.org/html/2605.14749#bib.bib16)\)\. Through intervention, we can understand how interpretable features are represented in the model’s hidden states\(Huanget al\.,[2024](https://arxiv.org/html/2605.14749#bib.bib9)\), and by doing so, effectively control via steering LLM behaviors that are hard to address through prompting alone, such as response style\(Turneret al\.,[2025](https://arxiv.org/html/2605.14749#bib.bib23)\), hallucination\(Liet al\.,[2023](https://arxiv.org/html/2605.14749#bib.bib12)\), and refusal\(Arditiet al\.,[2024](https://arxiv.org/html/2605.14749#bib.bib13)\)\.
Most existing methods for intervening on such features are linear: they modify hidden states by adding or ablating a fixed direction in activation space\(Liet al\.,[2023](https://arxiv.org/html/2605.14749#bib.bib12); Arditiet al\.,[2024](https://arxiv.org/html/2605.14749#bib.bib13)\)\. This design rests on the Linear Representation Hypothesis \(LRH\), which posits that interpretable concepts are encoded as directions in the model’s representation space\(Mikolovet al\.,[2013](https://arxiv.org/html/2605.14749#bib.bib10); Parket al\.,[2023](https://arxiv.org/html/2605.14749#bib.bib11)\)\. Recent work, however, has shown that some concepts are instead represented along non\-linear manifolds—for example, days of the week organized as a circular structure\(Engelset al\.,[2025](https://arxiv.org/html/2605.14749#bib.bib14)\), and the geometric structures used to perform counting\(Gurneeet al\.,[2025](https://arxiv.org/html/2605.14749#bib.bib24)\)\. Since linear interventions can by construction only manipulate features encoded as directions, they cannot reach this class of non\-linearly represented structure\.
To overcome this limitation, we introduce a general formulation of non\-linear intervention\. We generalize linear interventions by replacing their underlying linear transformation with an invertible non\-linear feature map\. In addition, we propose a procedure for learning this map via*interchange interventions*\(Geigeret al\.,[2024](https://arxiv.org/html/2605.14749#bib.bib16)\), together with a loss design that extends the procedure to*implicit*features\.
We instantiate the framework on refusal bypass steering\(Arditiet al\.,[2024](https://arxiv.org/html/2605.14749#bib.bib13); Wollschlägeret al\.,[2025](https://arxiv.org/html/2605.14749#bib.bib17)\), a representative implicit\-feature intervention task\. Our non\-linear intervention attains steering effectiveness comparable to strong linear baselines while editing activations at orders of magnitude fewer hidden\-state locations\. Further analysis indicates that this advantage stems from a genuinely non\-linear feature map governing refusal that resides primarily at the model’s middle layers\.
Our work makes three contributions\. First, we propose a general formulation of non\-linear intervention that subsumes existing linear interventions and naturally extends to features encoded along non\-linear manifolds\. Second, we introduce a learning procedure for the non\-linear feature map based on interchange interventions, together with a loss design that further enables learning*implicit*features that do not surface in the model’s output\. Third, we empirically apply our framework to refusal bypass steering and show that intervening on a non\-linear feature governing refusal steers the model more precisely than linear baselines\. Code is available at[https://anonymous\.4open\.science/r/nonlinear\-intervention\-77AC/](https://anonymous.4open.science/r/nonlinear-intervention-77AC/)\.
## 2A General Formulation of Non\-linear Interventions
### 2\.1Linear Interventions as Change of Basis
A common approach to intervene on a language modelℳ\\mathcal\{M\}at inference time is to identify*linear feature directions*in its representation space and perturb the hidden state along them\. Given orthonormal feature directions\{𝐯i\}i=1k⊂ℝd\\\{\\mathbf\{v\}\_\{i\}\\\}\_\{i=1\}^\{k\}\\subset\\mathbb\{R\}^\{d\}and corresponding scalar coefficients\{αi\}i=1k\\\{\\alpha\_\{i\}\\\}\_\{i=1\}^\{k\}, this linear intervention modifies the hidden state𝐡∈ℝd\\mathbf\{h\}\\in\\mathbb\{R\}^\{d\}at a chosen position as
𝐡←𝐡\+∑i=1kαi𝐯i\.\\mathbf\{h\}\\;\\leftarrow\\;\\mathbf\{h\}\+\\sum\_\{i=1\}^\{k\}\\alpha\_\{i\}\\mathbf\{v\}\_\{i\}\.\(1\)
Eq\. \([1](https://arxiv.org/html/2605.14749#S2.E1)\) can equivalently be viewed as a change of basis\. LetW∈ℝd×dW\\in\\mathbb\{R\}^\{d\\times d\}be an orthogonal matrix that maps the representation space onto a*linear feature space*, satisfyingW⊤𝐞i=𝐯iW^\{\\top\}\\mathbf\{e\}\_\{i\}=\\mathbf\{v\}\_\{i\}, where\{𝐞i\}i=1d\\\{\\mathbf\{e\}\_\{i\}\\\}\_\{i=1\}^\{d\}is the standard basis ofℝd\\mathbb\{R\}^\{d\}\. Each coordinate ofW𝐡W\\mathbf\{h\}is then the activation value of one feature at𝐡\\mathbf\{h\}, and Eq\. \([1](https://arxiv.org/html/2605.14749#S2.E1)\) can be rewritten as
𝐡←W−1\(W𝐡\+∑i=1kαi𝐞i\),\\mathbf\{h\}\\;\\leftarrow\\;W^\{\-1\}\\\!\\left\(W\\mathbf\{h\}\+\\sum\_\{i=1\}^\{k\}\\alpha\_\{i\}\\mathbf\{e\}\_\{i\}\\right\),\(2\)that is,WWmaps𝐡\\mathbf\{h\}into the feature space, the intervention perturbs the resulting feature coordinates along their axes, andW−1W^\{\-1\}maps back to the original representation space\.
### 2\.2Non\-linear Interventions via Invertible Feature Maps
Linear interventions can only manipulate features that admit a linear encoding, yet not all interpretable features inℳ\\mathcal\{M\}’s representation space lie along linear directions\(Engelset al\.,[2025](https://arxiv.org/html/2605.14749#bib.bib14); Kantamneni and Tegmark,[2025](https://arxiv.org/html/2605.14749#bib.bib15)\)\. To accommodate non\-linear features, we replaceWWwith an invertible*non\-linear feature map*fθ:ℝd→ℝdf\_\{\\theta\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}parameterized byθ\\theta\. By direct analogy with Eq\. \([2](https://arxiv.org/html/2605.14749#S2.E2)\), the non\-linear intervention takes the form
𝐡←fθ−1\(fθ\(𝐡\)\+∑i=1kαi𝐞i\)\.\\mathbf\{h\}\\;\\leftarrow\\;f\_\{\\theta\}^\{\-1\}\\\!\\left\(f\_\{\\theta\}\(\\mathbf\{h\}\)\+\\sum\_\{i=1\}^\{k\}\\alpha\_\{i\}\\mathbf\{e\}\_\{i\}\\right\)\.\(3\)fθf\_\{\\theta\}maps𝐡\\mathbf\{h\}to its feature\-space coordinates; the intervention perturbs these coordinates along their axes by the coefficients\{αi\}\\\{\\alpha\_\{i\}\\\}; andfθ−1f\_\{\\theta\}^\{\-1\}maps the modified features back into the original hidden\-state space\. Eq\. \([2](https://arxiv.org/html/2605.14749#S2.E2)\) is recovered as the special case in whichfθf\_\{\\theta\}is restricted to a linear map\.
## 3Learningfθf\_\{\\theta\}via Interchange Intervention
### 3\.1Training Objective via Interchange Intervention
This section describes how we learn the non\-linear feature mapfθf\_\{\\theta\}associated with a target featureℱ\\mathcal\{F\}\. We trainfθf\_\{\\theta\}via*interchange interventions*\(Geigeret al\.,[2024](https://arxiv.org/html/2605.14749#bib.bib16)\), which provide causal supervision overℱ\\mathcal\{F\}by transferring its value between hidden states drawn from contrasting inputs\. We first prepare two sets of prompts: a positive set𝒟\+=\{xi\+\}i=1N\\mathcal\{D\}^\{\+\}=\\\{x\_\{i\}^\{\+\}\\\}\_\{i=1\}^\{N\}of inputs that exhibit the featureℱ\\mathcal\{F\}, and a negative set𝒟−=\{xi−\}i=1N\\mathcal\{D\}^\{\-\}=\\\{x\_\{i\}^\{\-\}\\\}\_\{i=1\}^\{N\}of inputs that do not\. For each pair\(x−,x\+\)∈𝒟−×𝒟\+\(x^\{\-\},x^\{\+\}\)\\in\\mathcal\{D\}^\{\-\}\\times\\mathcal\{D\}^\{\+\}, we forward both inputs throughℳ\\mathcal\{M\}up to the intervention site, obtaining hidden states𝐡−,𝐡\+∈ℝd\\mathbf\{h\}^\{\-\},\\mathbf\{h\}^\{\+\}\\in\\mathbb\{R\}^\{d\}\. The*interchange intervention*replaces the targeted coordinates offθ\(𝐡−\)f\_\{\\theta\}\(\\mathbf\{h\}^\{\-\}\)with those offθ\(𝐡\+\)f\_\{\\theta\}\(\\mathbf\{h\}^\{\+\}\):
𝐡−←fθ−1\(fθ\(𝐡−\)\+∑i=1kαi𝐞i\),\\mathbf\{h\}^\{\-\}\\;\\leftarrow\\;f\_\{\\theta\}^\{\-1\}\\\!\\left\(f\_\{\\theta\}\(\\mathbf\{h\}^\{\-\}\)\+\\sum\_\{i=1\}^\{k\}\\alpha\_\{i\}\\mathbf\{e\}\_\{i\}\\right\),\(4\)whereαi=\(fθ\(𝐡\+\)−fθ\(𝐡−\)\)i\\alpha\_\{i\}=\\big\(f\_\{\\theta\}\(\\mathbf\{h\}^\{\+\}\)\-f\_\{\\theta\}\(\\mathbf\{h\}^\{\-\}\)\\big\)\_\{i\}\. This recovers Eq\. \([3](https://arxiv.org/html/2605.14749#S2.E3)\) with the coefficients pinned tox\+x^\{\+\}\. Letℳint\(x−,x\+;θ\)\\mathcal\{M\}\_\{\\text\{int\}\}\(x^\{\-\},x^\{\+\};\\,\\theta\)denote the intervened forward pass ofℳ\\mathcal\{M\}\. With pairs sampled as\(x−,x\+\)∼𝒟−×𝒟\+\(x^\{\-\},x^\{\+\}\)\\sim\\mathcal\{D\}^\{\-\}\\times\\mathcal\{D\}^\{\+\}andℳ\\mathcal\{M\}frozen, we trainθ\\thetato minimize
ℒ\(θ\)=𝔼\(x−,x\+\)\[ℓ\(ℳint\(x−,x\+;θ\)\)\],\\mathcal\{L\}\(\\theta\)\\;=\\;\\mathbb\{E\}\_\{\(x^\{\-\},\\,x^\{\+\}\)\}\\\!\\left\[\\,\\ell\\\!\\left\(\\mathcal\{M\}\_\{\\text\{int\}\}\(x^\{\-\},x^\{\+\};\\,\\theta\)\\right\)\\,\\right\],\(5\)whereℓ\\ellis designed so that its minimization makesℳint\\mathcal\{M\}\_\{\\text\{int\}\}exhibitℱ\\mathcal\{F\}\.
### 3\.2Loss Design for Implicit Features
Whenℱ\\mathcal\{F\}has a direct output signature,ℓ\\ellcan be defined onℳ\\mathcal\{M\}’s output distribution to encourage or suppress specific tokens\. For implicit features that do not surface in the output, such as refusal or stylistic shifts, no such direct objective for trainingfθf\_\{\\theta\}is available, making the loss design non\-trivial\. We propose a self\-supervised\-style loss that learnsfθf\_\{\\theta\}from data alone, by enforcing causal influence over many features correlated withℱ\\mathcal\{F\}\.
For each \(layer, token position\) sitesslying causally downstream of the intervention site, we extract a feature directionvs∈ℝdv\_\{s\}\\in\\mathbb\{R\}^\{d\}as the class\-mean difference of unintervened activationshs\(x\)h\_\{s\}\(x\)over𝒟\+\\mathcal\{D\}^\{\+\}versus𝒟−\\mathcal\{D\}^\{\-\}\. Projections ontovsv\_\{s\}are therefore correlated withℱ\\mathcal\{F\}\. We keep only those sites at whichvs⊤hs\(x\)v\_\{s\}^\{\\top\}h\_\{s\}\(x\)separatesx\+x^\{\+\}fromx−x^\{\-\}with AUC above a thresholdτ\\tau, collected into𝒮\\mathcal\{S\}\. Bothvsv\_\{s\}and𝒮\\mathcal\{S\}are computed once from𝒟±\\mathcal\{D\}^\{\\pm\}before training and held fixed\.
Lettinghsinth\_\{s\}^\{\\text\{int\}\}denote the hidden state atssunder the intervened forward pass andμs\+=𝔼x∼𝒟\+\[vs⊤hs\(x\)\]\\mu\_\{s\}^\{\+\}=\\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}^\{\+\}\}\[\\,v\_\{s\}^\{\\top\}h\_\{s\}\(x\)\\,\]the mean projection ofx\+x^\{\+\}, we takeℓ\\ellin Eq\. \([5](https://arxiv.org/html/2605.14749#S3.E5)\) to be
ℓ\(ℳint\)=∑s∈𝒮max\(0,μs\+−vs⊤hsint\)\.\\ell\(\\mathcal\{M\}\_\{\\text\{int\}\}\)\\;=\\;\\sum\_\{s\\in\\mathcal\{S\}\}\\max\\\!\\big\(\\,0,\\;\\;\\mu\_\{s\}^\{\+\}\-v\_\{s\}^\{\\top\}h\_\{s\}^\{\\text\{int\}\}\\,\\big\)\.\(6\)The hinge form saturates once a site’s projection reachesμs\+\\mu\_\{s\}^\{\+\}, preventing overfitting to any single site\. Minimizingℒ\\mathcal\{L\}thus drivesfθf\_\{\\theta\}to discover a feature whose intervention causally aligns manyℱ\\mathcal\{F\}\-correlated components at once\.
This particular instantiation is one natural choice\. The direction extractor, site\-selection criterion, and surrogate loss are not fixed; each can be substituted within the framework of Eq\. \([5](https://arxiv.org/html/2605.14749#S3.E5)\)\.
## 4Experiments
### 4\.1Setup
We evaluate non\-linear intervention on refusal bypass steering in safety\-aligned LLMs, a representative steering task, to empirically validate our proposed non\-linear intervention framework\. We use this setting as an implicit\-feature intervention task: the target featureℱ\\mathcal\{F\}is*bypass refusal*, which is not explicitly manifested in any specific token output of the model\.
#### Models and data\.
We evaluate all methods on Llama\-3\-8B\-Instruct\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.14749#bib.bib29)\)and Qwen2\.5\-7B\-Instruct\(Qwenet al\.,[2025](https://arxiv.org/html/2605.14749#bib.bib30)\), with model weights frozen throughout\. We construct𝒟\+\\mathcal\{D\}^\{\+\}from2,0002\{,\}000harmless Alpaca prompts\(Taoriet al\.,[2023](https://arxiv.org/html/2605.14749#bib.bib28)\)that elicit compliant responses from the model, and𝒟−\\mathcal\{D\}^\{\-\}from2,0002\{,\}000harmful SALAD\-Bench prompts\(Liet al\.,[2024](https://arxiv.org/html/2605.14749#bib.bib27)\)that elicit refusals\.
#### Evaluation\.
We quantitatively evaluate intervention quality along two axes:*how strongly*the intervened model exhibits the target featureℱ\\mathcal\{F\}, and*how much*we perturbed the model to achieve it\. To measure whether the intervened model produces genuinely meaningful responses to harmful prompts rather than merely avoiding surface\-level refusals, we use the StrongREJECT score\(Soulyet al\.,[2024](https://arxiv.org/html/2605.14749#bib.bib26)\)\. StrongREJECT returns a score in\[0,1\]\[0,1\]over313313harmful prompts, computed by a LLM\-based evaluator\. To measure how much we intervened, we record the total intervention magnitude applied to the model\. Specifically, for every edited site, we compute theℓ2\\ell\_\{2\}distance between the pre\-intervention hidden statehhand the post\-intervention hidden stateh′h^\{\\prime\}, and sum them:
1\|𝒟test\|∑x∈𝒟test∑s∈ℰx‖hs′\(x\)−hs\(x\)‖2,\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\mathrm\{test\}\}\|\}\\sum\_\{x\\in\\mathcal\{D\}\_\{\\mathrm\{test\}\}\}\\sum\_\{s\\in\\mathcal\{E\}\_\{x\}\}\\\|h^\{\\prime\}\_\{s\}\(x\)\-h\_\{s\}\(x\)\\\|\_\{2\},whereℰx\\mathcal\{E\}\_\{x\}is the set of sites the method edits when generating a response to promptxx\.
#### Baselines\.
As baselines, we compare against two representative linear intervention methods for refusal steering:*Difference In Means*\(DIM\)\(Arditiet al\.,[2024](https://arxiv.org/html/2605.14749#bib.bib13)\)and*Refusal Direction Optimization*\(RDO\)\(Wollschlägeret al\.,[2025](https://arxiv.org/html/2605.14749#bib.bib17)\)\. DIM extracts a refusal direction from the class\-mean activation difference\. RDO learns a refusal direction with the ablation, addition, and retain losses\. Both methods can be evaluated under two intervention schemes:*ablation*, which projects out the refusal direction at every token and every module output, and*actadd*, which adds the direction scaled by a fixed coefficientα\\alphaat a designated layer for every token\.
#### Our method\.
We instantiatefθf\_\{\\theta\}as an i\-ResNet\(Behrmannet al\.,[2019](https://arxiv.org/html/2605.14749#bib.bib25)\), an invertible non\-linear neural network\. We usek=1k=1for direct comparison with one\-dimensional linear baselines\. We heuristically select one intervention position per model\. At inference time, to avoid requiring a harmless promptx\+x^\{\+\}for each query, we precompute the mean feature activation over a set of harmless promptsx\+x^\{\+\}and use this cached value\. Specifically, we compute the mean first\-coordinate value offθ\(h\)f\_\{\\theta\}\(h\)over the harmless training set at the selected site, denotedμ¯\+\\bar\{\\mu\}^\{\+\}, and clamp each test activation to this value\.
### 4\.2Results
Figure 1:StrongREJECT scores on Llama 3 8B and Qwen 2\.5 7B without intervention, with the linear baselines, and with our non\-linear intervention\.222Although we faithfully reproduced the official RDO repository, the RDO ActAdd variant on Qwen yielded a StrongREJECT score of 0\.020 in our runs\. We judged this result to be unreliable and therefore excluded it from the figure and table\.Wollschlägeret al\.[2025](https://arxiv.org/html/2605.14749#bib.bib17)reports ActAdd performance comparable to that of the Ablation variant\.Figure[2](https://arxiv.org/html/2605.14749#footnote2)reports StrongREJECT scores under no intervention, the linear baselines, and our non\-linear intervention\. Although the non\-linear method intervenes at a single position per sample, it reaches scores comparable to the baselines, which instead modify activations at a much larger number of sites\.
Table 1:Per\-sample intervention magnitude\.*Sites*: avg\. positions edited per example;*L2L\_\{2\}*: total edit norm\. Our non\-linear method edits one site, with anL2L\_\{2\}norm over two orders of magnitude below every linear baseline\.As shown in Table[1](https://arxiv.org/html/2605.14749#S4.T1), the total magnitude of the non\-linear intervention is more than two orders of magnitude smaller than that of every linear baseline\. This empirically suggests that our non\-linear intervention captures the feature inside the model more precisely than the linear baselines do\.
## 5Analysis
### 5\.1Linear Intervention at the Loss Sites
Our objective in Eq\. \([6](https://arxiv.org/html/2605.14749#S3.E6)\) pushesvs⊤hsintv\_\{s\}^\{\\top\}h\_\{s\}^\{\\text\{int\}\}towardμs\+\\mu\_\{s\}^\{\+\}at every sites∈𝒮s\\in\\mathcal\{S\}\. We ask whether the gain over linear baselines comes solely from our choice of𝒮\\mathcal\{S\}and the directions\{vs\}\\\{v\_\{s\}\\\}\. Concretely, at everys∈𝒮s\\in\\mathcal\{S\}we apply a linear intervention alongvsv\_\{s\}with a coefficient chosen so thatvs⊤hsv\_\{s\}^\{\\top\}h\_\{s\}is shifted toμs\+\\mu\_\{s\}^\{\+\}\. As shown in Table[2](https://arxiv.org/html/2605.14749#S5.T2), this falls substantially short of our method on both models, indicating thatfθf\_\{\\theta\}does more than align these projections: it implements a non\-linear, causal manipulation of the refusal feature that*induces*the per\-site alignment as a downstream effect\.
Table 2:StrongREJECT after intervening linearly at every loss sites∈𝒮s\\in\\mathcal\{S\}alongvsv\_\{s\}, versus our learned non\-linear intervention at a single site\.
### 5\.2Role of Non\-linearity
To examine the role offθf\_\{\\theta\}’s non\-linearity, we re\-train it under the same loss and pipeline as Section[3](https://arxiv.org/html/2605.14749#S3), but constrained to a linear map\. As reported in Table[3](https://arxiv.org/html/2605.14749#S5.T3), StrongREJECT drops sharply on both models, suggesting that the non\-linearity offθf\_\{\\theta\}clearly contributes to the effectiveness of our intervention\.
Table 3:StrongREJECT whenfθf\_\{\\theta\}is constrained to a linear map, versus our i\-ResNetfθf\_\{\\theta\}, trained under the same objective\.
### 5\.3Layer Sweep
To check whether our non\-linear intervention is effective at any layer, we repeat the same training and evaluation under the same setup while varying only the intervention layer\. Figure[2](https://arxiv.org/html/2605.14749#S5.F2)shows that, for both models, the intervention is largely ineffective at early layers, peaks at middle layers, and declines at later layers\. This suggests that our method is not so powerful that it finds an effective feature map at any layer; rather, it captures a non\-linear feature governing refusal/compliance that resides primarily in the middle layers\.
Figure 2:StrongREJECT across intervention layers on Llama 3 8B and Qwen 2\.5 7B with our non\-linear intervention\.
## 6Conclusion
We overcome the linearity restriction of existing interventions by providing a general formulation of non\-linear intervention\. Beyond the formulation itself, we propose a procedure for learning the non\-linear feature map via interchange interventions, together with the idea of supervising this procedure for*implicit*features through hidden states of the model that are correlated with the target feature\. We instantiate this framework on bypass refusal steering, a representative implicit\-feature steering task; our experiments and analysis show that non\-linear intervention effectively steers refusal behavior and that, in doing so, it uncovers a non\-linear feature inside the model\.
Several limitations remain, however\. We realizefθf\_\{\\theta\}only as an i\-ResNet, and the same framework admits other invertible architectures\. We also do not overcome a limitation shared with linear interventions: the position at which to intervene must still be chosen heuristically\. We leave both directions to future work\.
## References
- A\. Arditi, O\. Obeso, A\. Syed, D\. Paleka, N\. Panickssery, W\. Gurnee, and N\. Nanda \(2024\)Refusal in language models is mediated by a single direction\.External Links:2406\.11717,[Link](https://arxiv.org/abs/2406.11717)Cited by:[§1](https://arxiv.org/html/2605.14749#S1.p1.1),[§1](https://arxiv.org/html/2605.14749#S1.p2.1),[§1](https://arxiv.org/html/2605.14749#S1.p4.1),[§4\.1](https://arxiv.org/html/2605.14749#S4.SS1.SSS0.Px3.p1.1)\.
- J\. Behrmann, W\. Grathwohl, R\. T\. Q\. Chen, D\. Duvenaud, and J\. Jacobsen \(2019\)Invertible residual networks\.InProceedings of the 36th International Conference on Machine Learning,K\. Chaudhuri and R\. Salakhutdinov \(Eds\.\),Proceedings of Machine Learning Research, Vol\.97,pp\. 573–582\.External Links:[Link](https://proceedings.mlr.press/v97/behrmann19a.html)Cited by:[§A\.1](https://arxiv.org/html/2605.14749#A1.SS1.p1.17),[§4\.1](https://arxiv.org/html/2605.14749#S4.SS1.SSS0.Px4.p1.6)\.
- J\. Engels, E\. J\. Michaud, I\. Liao, W\. Gurnee, and M\. Tegmark \(2025\)Not all language model features are one\-dimensionally linear\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=d63a4AM4hb)Cited by:[§1](https://arxiv.org/html/2605.14749#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.14749#S2.SS2.p1.4)\.
- A\. Geiger, Z\. Wu, C\. Potts, T\. Icard, and N\. D\. Goodman \(2024\)Finding alignments between interpretable causal variables and distributed neural representations\.External Links:2303\.02536,[Link](https://arxiv.org/abs/2303.02536)Cited by:[§1](https://arxiv.org/html/2605.14749#S1.p1.1),[§1](https://arxiv.org/html/2605.14749#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.14749#S3.SS1.p1.12)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, A\. Goyal, A\. Hartshorn, A\. Yang, A\. Mitra, A\. Sravankumar, A\. Korenev, A\. Hinsvark, A\. Rao, A\. Zhang, A\. Rodriguez, A\. Gregerson, A\. Spataru, B\. Roziere, B\. Biron, B\. Tang, B\. Chern, C\. Caucheteux, C\. Nayak, C\. Bi, C\. Marra, C\. McConnell, C\. Keller, C\. Touret, C\. Wu, C\. Wong, C\. C\. Ferrer, C\. Nikolaidis, D\. Allonsius, D\. Song, D\. Pintz, D\. Livshits, D\. Wyatt, D\. Esiobu, D\. Choudhary, D\. Mahajan, D\. Garcia\-Olano, D\. Perino, D\. Hupkes, E\. Lakomkin, E\. AlBadawy, E\. Lobanova, E\. Dinan, E\. M\. Smith, F\. Radenovic, F\. Guzmán, F\. Zhang, G\. Synnaeve, G\. Lee, G\. L\. Anderson, G\. Thattai, G\. Nail, G\. Mialon, G\. Pang, G\. Cucurell, H\. Nguyen, H\. Korevaar, H\. Xu, H\. Touvron, I\. Zarov, I\. A\. Ibarra, I\. Kloumann, I\. Misra, I\. Evtimov, J\. Zhang, J\. Copet, J\. Lee, J\. Geffert, J\. Vranes, J\. Park, J\. Mahadeokar, J\. Shah, J\. van der Linde, J\. Billock, J\. Hong, J\. Lee, J\. Fu, J\. Chi, J\. Huang, J\. Liu, J\. Wang, J\. Yu, J\. Bitton, J\. Spisak, J\. Park, J\. Rocca, J\. Johnstun, J\. Saxe, J\. Jia, K\. V\. Alwala, K\. Prasad, K\. Upasani, K\. Plawiak, K\. Li, K\. Heafield, K\. Stone, K\. El\-Arini, K\. Iyer, K\. Malik, K\. Chiu, K\. Bhalla, K\. Lakhotia, L\. Rantala\-Yeary, L\. van der Maaten, L\. Chen, L\. Tan, L\. Jenkins, L\. Martin, L\. Madaan, L\. Malo, L\. Blecher, L\. Landzaat, L\. de Oliveira, M\. Muzzi, M\. Pasupuleti, M\. Singh, M\. Paluri, M\. Kardas, M\. Tsimpoukelli, M\. Oldham, M\. Rita, M\. Pavlova, M\. Kambadur, M\. Lewis, M\. Si, M\. K\. Singh, M\. Hassan, N\. Goyal, N\. Torabi, N\. Bashlykov, N\. Bogoychev, N\. Chatterji, N\. Zhang, O\. Duchenne, O\. Çelebi, P\. Alrassy, P\. Zhang, P\. Li, P\. Vasic, P\. Weng, P\. Bhargava, P\. Dubal, P\. Krishnan, P\. S\. Koura, P\. Xu, Q\. He, Q\. Dong, R\. Srinivasan, R\. Ganapathy, R\. Calderer, R\. S\. Cabral, R\. Stojnic, R\. Raileanu, R\. Maheswari, R\. Girdhar, R\. Patel, R\. Sauvestre, R\. Polidoro, R\. Sumbaly, R\. Taylor, R\. Silva, R\. Hou, R\. Wang, S\. Hosseini, S\. Chennabasappa, S\. Singh, S\. Bell, S\. S\. Kim, S\. Edunov, S\. Nie, S\. Narang, S\. Raparthy, S\. Shen, S\. Wan, S\. Bhosale, S\. Zhang, S\. Vandenhende, S\. Batra, S\. Whitman, S\. Sootla, S\. Collot, S\. Gururangan, S\. Borodinsky, T\. Herman, T\. Fowler, T\. Sheasha, T\. Georgiou, T\. Scialom, T\. Speckbacher, T\. Mihaylov, T\. Xiao, U\. Karn, V\. Goswami, V\. Gupta, V\. Ramanathan, V\. Kerkez, V\. Gonguet, V\. Do, V\. Vogeti, V\. Albiero, V\. Petrovic, W\. Chu, W\. Xiong, W\. Fu, W\. Meers, X\. Martinet, X\. Wang, X\. Wang, X\. E\. Tan, X\. Xia, X\. Xie, X\. Jia, X\. Wang, Y\. Goldschlag, Y\. Gaur, Y\. Babaei, Y\. Wen, Y\. Song, Y\. Zhang, Y\. Li, Y\. Mao, Z\. D\. Coudert, Z\. Yan, Z\. Chen, Z\. Papakipos, A\. Singh, A\. Srivastava, A\. Jain, A\. Kelsey, A\. Shajnfeld, A\. Gangidi, A\. Victoria, A\. Goldstand, A\. Menon, A\. Sharma, A\. Boesenberg, A\. Baevski, A\. Feinstein, A\. Kallet, A\. Sangani, A\. Teo, A\. Yunus, A\. Lupu, A\. Alvarado, A\. Caples, A\. Gu, A\. Ho, A\. Poulton, A\. Ryan, A\. Ramchandani, A\. Dong, A\. Franco, A\. Goyal, A\. Saraf, A\. Chowdhury, A\. Gabriel, A\. Bharambe, A\. Eisenman, A\. Yazdan, B\. James, B\. Maurer, B\. Leonhardi, B\. Huang, B\. Loyd, B\. D\. Paola, B\. Paranjape, B\. Liu, B\. Wu, B\. Ni, B\. Hancock, B\. Wasti, B\. Spence, B\. Stojkovic, B\. Gamido, B\. Montalvo, C\. Parker, C\. Burton, C\. Mejia, C\. Liu, C\. Wang, C\. Kim, C\. Zhou, C\. Hu, C\. Chu, C\. Cai, C\. Tindal, C\. Feichtenhofer, C\. Gao, D\. Civin, D\. Beaty, D\. Kreymer, D\. Li, D\. Adkins, D\. Xu, D\. Testuggine, D\. David, D\. Parikh, D\. Liskovich, D\. Foss, D\. Wang, D\. Le, D\. Holland, E\. Dowling, E\. Jamil, E\. Montgomery, E\. Presani, E\. Hahn, E\. Wood, E\. Le, E\. Brinkman, E\. Arcaute, E\. Dunbar, E\. Smothers, F\. Sun, F\. Kreuk, F\. Tian, F\. Kokkinos, F\. Ozgenel, F\. Caggioni, F\. Kanayet, F\. Seide, G\. M\. Florez, G\. Schwarz, G\. Badeer, G\. Swee, G\. Halpern, G\. Herman, G\. Sizov, Guangyi, Zhang, G\. Lakshminarayanan, H\. Inan, H\. Shojanazeri, H\. Zou, H\. Wang, H\. Zha, H\. Habeeb, H\. Rudolph, H\. Suk, H\. Aspegren, H\. Goldman, H\. Zhan, I\. Damlaj, I\. Molybog, I\. Tufanov, I\. Leontiadis, I\. Veliche, I\. Gat, J\. Weissman, J\. Geboski, J\. Kohli, J\. Lam, J\. Asher, J\. Gaya, J\. Marcus, J\. Tang, J\. Chan, J\. Zhen, J\. Reizenstein, J\. Teboul, J\. Zhong, J\. Jin, J\. Yang, J\. Cummings, J\. Carvill, J\. Shepard, J\. McPhie, J\. Torres, J\. Ginsburg, J\. Wang, K\. Wu, K\. H\. U, K\. Saxena, K\. Khandelwal, K\. Zand, K\. Matosich, K\. Veeraraghavan, K\. Michelena, K\. Li, K\. Jagadeesh, K\. Huang, K\. Chawla, K\. Huang, L\. Chen, L\. Garg, L\. A, L\. Silva, L\. Bell, L\. Zhang, L\. Guo, L\. Yu, L\. Moshkovich, L\. Wehrstedt, M\. Khabsa, M\. Avalani, M\. Bhatt, M\. Mankus, M\. Hasson, M\. Lennie, M\. Reso, M\. Groshev, M\. Naumov, M\. Lathi, M\. Keneally, M\. Liu, M\. L\. Seltzer, M\. Valko, M\. Restrepo, M\. Patel, M\. Vyatskov, M\. Samvelyan, M\. Clark, M\. Macey, M\. Wang, M\. J\. Hermoso, M\. Metanat, M\. Rastegari, M\. Bansal, N\. Santhanam, N\. Parks, N\. White, N\. Bawa, N\. Singhal, N\. Egebo, N\. Usunier, N\. Mehta, N\. P\. Laptev, N\. Dong, N\. Cheng, O\. Chernoguz, O\. Hart, O\. Salpekar, O\. Kalinli, P\. Kent, P\. Parekh, P\. Saab, P\. Balaji, P\. Rittner, P\. Bontrager, P\. Roux, P\. Dollar, P\. Zvyagina, P\. Ratanchandani, P\. Yuvraj, Q\. Liang, R\. Alao, R\. Rodriguez, R\. Ayub, R\. Murthy, R\. Nayani, R\. Mitra, R\. Parthasarathy, R\. Li, R\. Hogan, R\. Battey, R\. Wang, R\. Howes, R\. Rinott, S\. Mehta, S\. Siby, S\. J\. Bondu, S\. Datta, S\. Chugh, S\. Hunt, S\. Dhillon, S\. Sidorov, S\. Pan, S\. Mahajan, S\. Verma, S\. Yamamoto, S\. Ramaswamy, S\. Lindsay, S\. Lindsay, S\. Feng, S\. Lin, S\. C\. Zha, S\. Patil, S\. Shankar, S\. Zhang, S\. Zhang, S\. Wang, S\. Agarwal, S\. Sajuyigbe, S\. Chintala, S\. Max, S\. Chen, S\. Kehoe, S\. Satterfield, S\. Govindaprasad, S\. Gupta, S\. Deng, S\. Cho, S\. Virk, S\. Subramanian, S\. Choudhury, S\. Goldman, T\. Remez, T\. Glaser, T\. Best, T\. Koehler, T\. Robinson, T\. Li, T\. Zhang, T\. Matthews, T\. Chou, T\. Shaked, V\. Vontimitta, V\. Ajayi, V\. Montanez, V\. Mohan, V\. S\. Kumar, V\. Mangla, V\. Ionescu, V\. Poenaru, V\. T\. Mihailescu, V\. Ivanov, W\. Li, W\. Wang, W\. Jiang, W\. Bouaziz, W\. Constable, X\. Tang, X\. Wu, X\. Wang, X\. Wu, X\. Gao, Y\. Kleinman, Y\. Chen, Y\. Hu, Y\. Jia, Y\. Qi, Y\. Li, Y\. Zhang, Y\. Zhang, Y\. Adi, Y\. Nam, Yu, Wang, Y\. Zhao, Y\. Hao, Y\. Qian, Y\. Li, Y\. He, Z\. Rait, Z\. DeVito, Z\. Rosnbrick, Z\. Wen, Z\. Yang, Z\. Zhao, and Z\. Ma \(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§4\.1](https://arxiv.org/html/2605.14749#S4.SS1.SSS0.Px1.p1.4)\.
- W\. Gurnee, E\. Ameisen, I\. Kauvar, T\. ,Julius, A\. Pearce, C\. Olah, and J\. Batson \(2025\)When models manipulate manifolds: the geometry of a counting task\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2025/linebreaks/index.html)Cited by:[§1](https://arxiv.org/html/2605.14749#S1.p2.1)\.
- J\. Huang, Z\. Wu, C\. Potts, M\. Geva, and A\. Geiger \(2024\)RAVEL: evaluating interpretability methods on disentangling language model representations\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 8669–8687\.External Links:[Link](https://aclanthology.org/2024.acl-long.470/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.470)Cited by:[§1](https://arxiv.org/html/2605.14749#S1.p1.1)\.
- S\. Kantamneni and M\. Tegmark \(2025\)Language models use trigonometry to do addition\.InICLR 2025 Workshop on Building Trust in Language Models and Applications,External Links:[Link](https://openreview.net/forum?id=CqViN4dQJk)Cited by:[§2\.2](https://arxiv.org/html/2605.14749#S2.SS2.p1.4)\.
- P\. Langley \(2000\)Crafting papers on machine learning\.InProceedings of the 17th International Conference on Machine Learning \(ICML 2000\),P\. Langley \(Ed\.\),Stanford, CA,pp\. 1207–1216\.Cited by:[§A\.2](https://arxiv.org/html/2605.14749#A1.SS2.p3.1)\.
- K\. Li, O\. Patel, F\. Viégas, H\. Pfister, and M\. Wattenberg \(2023\)Inference\-time intervention: eliciting truthful answers from a language model\.InThirty\-seventh Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=aLLuYpn83y)Cited by:[§1](https://arxiv.org/html/2605.14749#S1.p1.1),[§1](https://arxiv.org/html/2605.14749#S1.p2.1)\.
- L\. Li, B\. Dong, R\. Wang, X\. Hu, W\. Zuo, D\. Lin, Y\. Qiao, and J\. Shao \(2024\)Salad\-bench: a hierarchical and comprehensive safety benchmark for large language models\.arXiv preprint arXiv:2402\.05044\.Cited by:[§4\.1](https://arxiv.org/html/2605.14749#S4.SS1.SSS0.Px1.p1.4)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in GPT\.Advances in Neural Information Processing Systems36\.Note:arXiv:2202\.05262Cited by:[§1](https://arxiv.org/html/2605.14749#S1.p1.1)\.
- T\. Mikolov, W\. Yih, and G\. Zweig \(2013\)Linguistic regularities in continuous space word representations\.InProceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,L\. Vanderwende, H\. Daumé III, and K\. Kirchhoff \(Eds\.\),Atlanta, Georgia,pp\. 746–751\.External Links:[Link](https://aclanthology.org/N13-1090/)Cited by:[§1](https://arxiv.org/html/2605.14749#S1.p2.1)\.
- T\. Miyato, T\. Kataoka, M\. Koyama, and Y\. Yoshida \(2018\)Spectral normalization for generative adversarial networks\.External Links:1802\.05957,[Link](https://arxiv.org/abs/1802.05957)Cited by:[§A\.1](https://arxiv.org/html/2605.14749#A1.SS1.p1.17)\.
- K\. Park, Y\. J\. Choe, and V\. Veitch \(2023\)The linear representation hypothesis and the geometry of large language models\.InCausal Representation Learning Workshop at NeurIPS 2023,External Links:[Link](https://openreview.net/forum?id=T0PoOJg8cK)Cited by:[§1](https://arxiv.org/html/2605.14749#S1.p2.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§4\.1](https://arxiv.org/html/2605.14749#S4.SS1.SSS0.Px1.p1.4)\.
- A\. Souly, Q\. Lu, D\. Bowen, T\. Trinh, E\. Hsieh, S\. Pandey, P\. Abbeel, J\. Svegliato, S\. Emmons, O\. Watkins,et al\.\(2024\)A strongreject for empty jailbreaks\.Advances in Neural Information Processing Systems37,pp\. 125416–125440\.Cited by:[§4\.1](https://arxiv.org/html/2605.14749#S4.SS1.SSS0.Px2.p1.6)\.
- R\. Taori, I\. Gulrajani, T\. Zhang, Y\. Dubois, X\. Li, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto \(2023\)Stanford alpaca: an instruction\-following llama model\.GitHub\.Note:[https://github\.com/tatsu\-lab/stanford\_alpaca](https://github.com/tatsu-lab/stanford_alpaca)Cited by:[§4\.1](https://arxiv.org/html/2605.14749#S4.SS1.SSS0.Px1.p1.4)\.
- A\. M\. Turner, L\. Thiergart, G\. Leech, D\. Udell, J\. J\. Vazquez, U\. Mini, and M\. MacDiarmid \(2025\)Steering language models with activation engineering\.External Links:[Link](https://openreview.net/forum?id=2XBPdPIcFK)Cited by:[§1](https://arxiv.org/html/2605.14749#S1.p1.1)\.
- J\. Vig, S\. Gehrmann, Y\. Belinkov, S\. Qian, D\. Nevo, Y\. Singer, and S\. Shieber \(2020\)Investigating gender bias in language models using causal mediation analysis\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 12388–12401\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2020/file/92650b2e92217715fe312e6fa7b90d82-Paper.pdf)Cited by:[§1](https://arxiv.org/html/2605.14749#S1.p1.1)\.
- T\. Wollschläger, J\. Elstner, S\. Geisler, V\. Cohen\-Addad, S\. Günnemann, and J\. Gasteiger \(2025\)The geometry of refusal in large language models: concept cones and representational independence\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=80IwJqlXs8)Cited by:[§1](https://arxiv.org/html/2605.14749#S1.p4.1),[§4\.1](https://arxiv.org/html/2605.14749#S4.SS1.SSS0.Px3.p1.1),[footnote 2](https://arxiv.org/html/2605.14749#footnote2)\.
## Appendix AAdditional Experimental Details
### A\.1i\-ResNet Feature Map
We instantiate the invertible feature mapfθ:ℝd→ℝdf\_\{\\theta\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}as a composition ofM=2M=2invertible residual blocks,
fθ=ϕθM∘⋯∘ϕθ1,ϕθm\(z\)=z\+gθm\(z\)\.f\_\{\\theta\}\\;=\\;\\phi\_\{\\theta\_\{M\}\}\\circ\\cdots\\circ\\phi\_\{\\theta\_\{1\}\},\\qquad\\phi\_\{\\theta\_\{m\}\}\(z\)\\;=\\;z\+g\_\{\\theta\_\{m\}\}\(z\)\.The residual branchgθm:ℝd→ℝdg\_\{\\theta\_\{m\}\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}is a two\-layer MLP with hidden width128128and a LeakyReLU nonlinearity \(negative slope0\.10\.1\)\. To guarantee invertibility viaLip\(gθm\)<1\\mathrm\{Lip\}\(g\_\{\\theta\_\{m\}\}\)<1, we followBehrmannet al\.\([2019](https://arxiv.org/html/2605.14749#bib.bib25)\)and combine two mechanisms: \(i\) every linear layerWWinsidegθmg\_\{\\theta\_\{m\}\}is replaced at every forward pass byW~=W⋅min\(1,1/σ^\(W\)\)\\widetilde\{W\}=W\\cdot\\min\(1,1/\\hat\{\\sigma\}\(W\)\), whereσ^\(W\)\\hat\{\\sigma\}\(W\)is a running estimate of the spectral norm ofWWobtained by a single power iteration with a persistent left singular vector\(Miyatoet al\.,[2018](https://arxiv.org/html/2605.14749#bib.bib31)\); \(ii\) the output ofgθmg\_\{\\theta\_\{m\}\}is rescaled by a Lipschitz coefficientκ=0\.6\\kappa=0\.6, yieldingLip\(gθm\)≤κ<1\\mathrm\{Lip\}\(g\_\{\\theta\_\{m\}\}\)\\leq\\kappa<1\. We invert each block by the contractive fixed\-point iterationx←y−gθm\(x\)x\\leftarrow y\-g\_\{\\theta\_\{m\}\}\(x\)run for up to3030steps with relative\-residual tolerance10−510^\{\-5\}, and propagate gradients through the inverse via implicit differentiation\.
### A\.2Intervention Sites
All layer indices are zero\-indexed transformer block indices\. Our non\-linear intervention is placed at theblock\_outputrepresentation, i\.e\., the residual\-stream state after the selected transformer block\. Token positions are negative indices into the fully formatted and tokenized chat prompt after the generation prompt has been appended; position−1\-1denotes the final prompt token immediately before generation starts\.
For the main non\-linear experiments, we intervene on exactly one site per example\. The sites are:
Table 4:Main non\-linear intervention sites\. Token positions are counted from the end of the formatted prompt\.Similar Articles
Towards Intrinsic Interpretability of Large Language Models: A Survey of Design Principles and Architectures
A comprehensive survey reviewing recent advances in intrinsic interpretability for Large Language Models, categorizing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction. The paper addresses the challenge of building transparency directly into model architectures rather than relying on post-hoc explanation methods.
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
This paper introduces a novel adaptive scheduler for steering discrete diffusion language models using sparse autoencoders, demonstrating that targeting interventions based on when specific attributes commit improves control quality and strength over uniform methods.
Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation
This paper introduces a dimension-level evaluation method for measuring intent fidelity in large language models using structured prompt ablation.
Decomposing and Steering Functional Metacognition in Large Language Models
This research paper investigates functional metacognition in Large Language Models, demonstrating that internal states like evaluation awareness and self-assessed capability are linearly decodable from residual stream activations. The authors propose a mechanistic framework to steer these states, showing causal control over reasoning behaviors, verbosity, and safety responses.
Reflections and New Directions for Human-Centered Large Language Models
This paper presents a framework for Human-Centered Large Language Models (HCLLMs), integrating HCI and NLP perspectives to prioritize human values throughout the model development lifecycle.