Don't Lose Focus: Activation Steering via Key-Orthogonal Projections
Summary
This paper introduces Steering via Key-Orthogonal Projections (SKOP), a method to control LLM behavior by preventing attention rerouting, thereby reducing utility degradation while maintaining steering efficacy.
View Cached Full Text
Cached at: 05/08/26, 07:51 AM
# Don’t Lose Focus: Activation Steering via Key-Orthogonal Projections
Source: [https://arxiv.org/html/2605.06342](https://arxiv.org/html/2605.06342)
Haoyan Luo†Mateo Espinosa Zarlenga‡Mateja Jamnik† †University of Cambridge‡University of Oxford
###### Abstract
Activation steering controls LLM behaviour towards target behaviour by intervening in internal representations, yet it often degrades reasoning and retrieval performance\. We argue that a primary cause of this trade\-off is*attention rerouting*: steering vectors alter query\-key matching, shifting attention away from contextually important tokens toward less informative ones\. To address this, we propose*Steering via Key\-Orthogonal Projections*\(SKOP\), a steering method that constrains harmful attention rerouting without eliminating steering efficacy\. SKOP achieves this by preserving attention patterns on a small set of*focus tokens*the model relies on for reasoning and retrieval, while allowing redistribution among less critical*tail tokens*\. Across multiple steering benchmarks, we show that SKOP achieves the best joint steering\-utility trade\-off, reducing utility degradation by 5–7×\\timeswhile retaining over 95% of vanilla steering efficacy\. Our results further suggest that, in long\-context retrieval settings where vanilla steering approaches are ineffective, SKOP can maintain robust performance by avoiding attention rerouting\.
## 1Introduction
Figure 1:Attention reroutingdue to activation steering, a key contributor to the trade\-off between steering efficacy and utility preservation\.Activation steering offers a lightweight inference\-time mechanism to control the behaviour of Large Language Models \(LLMs\) by intervening in their internal representations, avoiding costly retraining\[[37](https://arxiv.org/html/2605.06342#bib.bib37),[20](https://arxiv.org/html/2605.06342#bib.bib20),[29](https://arxiv.org/html/2605.06342#bib.bib29),[34](https://arxiv.org/html/2605.06342#bib.bib34)\]\. This has recently emerged as an attractive mechanism for behavioural control due to its ability to elicit target behaviours, such as truthfulness\[[18](https://arxiv.org/html/2605.06342#bib.bib18)\]and harmful content refusal\[[23](https://arxiv.org/html/2605.06342#bib.bib23),[17](https://arxiv.org/html/2605.06342#bib.bib17),[32](https://arxiv.org/html/2605.06342#bib.bib32)\]in open\-ended generation\[[22](https://arxiv.org/html/2605.06342#bib.bib22),[39](https://arxiv.org/html/2605.06342#bib.bib39),[5](https://arxiv.org/html/2605.06342#bib.bib5),[42](https://arxiv.org/html/2605.06342#bib.bib42)\]\. However, although promising, a fundamental practical challenge remains largely unresolved: the trade\-off betweensteering efficacyandutility preservation\. Specifically, as the steering strength increases, or as it is applied less selectively, eliciting the target behaviour may come at the cost of degrading performance \(i\.e\.,utility\) on unrelated capabilities, such as reasoning and retrieval\[[35](https://arxiv.org/html/2605.06342#bib.bib35),[37](https://arxiv.org/html/2605.06342#bib.bib37)\]\.
Recent work has made progress toward addressing this trade\-off by improving*when*and*where*steering is applied\. For example, input\-conditional steering has been introduced to mitigate over\-refusal by activating steering only in relevant contexts\[[17](https://arxiv.org/html/2605.06342#bib.bib17),[32](https://arxiv.org/html/2605.06342#bib.bib32)\]\. For open\-ended generation, while many approaches steer directly in the model’s residual stream\[[29](https://arxiv.org/html/2605.06342#bib.bib29),[44](https://arxiv.org/html/2605.06342#bib.bib44),[22](https://arxiv.org/html/2605.06342#bib.bib22),[40](https://arxiv.org/html/2605.06342#bib.bib40)\], recent works suggest that attention\-space steering can be highly effective\[[18](https://arxiv.org/html/2605.06342#bib.bib18),[36](https://arxiv.org/html/2605.06342#bib.bib36),[39](https://arxiv.org/html/2605.06342#bib.bib39)\]and better preserve utility by being less intrusive to the residual stream, such as query\-space steering\[[36](https://arxiv.org/html/2605.06342#bib.bib36)\]\. Yet, it remains unclear*how*attention\-space steering alters attention patterns and which of these changes improve the trade\-off\.
In this work, we argue that the trade\-off is driven byattention rerouting\(Fig\.[1](https://arxiv.org/html/2605.06342#S1.F1)\): steering changes how attention queries match keys, which in turn changes which tokens are attended to\. We focus on query\-space steering for two reasons: \(i\) it has emerged as a particularly effective steering paradigm due to the high separability of behavioural concepts in the query space\[[36](https://arxiv.org/html/2605.06342#bib.bib36),[39](https://arxiv.org/html/2605.06342#bib.bib39)\], and \(ii\) as we show in Sec\.[3](https://arxiv.org/html/2605.06342#S3), it isolates the rerouting effect into a single correctable term\. In this setting, we observe that attention shifts away from a smallfocus setof tokens the model relies on for correct reasoning and retrieval \(Fig\.[2](https://arxiv.org/html/2605.06342#S3.F2)\(A\)\) toward a largertail setof less informative tokens, as measured by top\-set attention mass preservation \(Fig\.[2](https://arxiv.org/html/2605.06342#S3.F2)\(B\)\)\. We show that this rerouting arises because query\-space steering alters the*relative*query\-key scores determining attention weights \(Eq\.[7](https://arxiv.org/html/2605.06342#S4.E7)\)\. While it is possible to prevent rerouting by enforcing exact attention invariance, for example, by adapting null\-space constraints developed for residual steering\[[32](https://arxiv.org/html/2605.06342#bib.bib32)\], this completely suppresses steering efficacy \(Fig\.[2](https://arxiv.org/html/2605.06342#S3.F2)\(C\)\)\. Hence, we observe a critical tension: while effective steering requires modifying relative attention scores, utility preservation requires that the attention patterns of important tokens remain undisturbed\.
This motivates our approach: rather than eliminating attention rerouting, we selectively constrain it\. For this, we introduceSteering via Key\-Orthogonal Projections\(SKOP\), which, given a query\-space steering vector, removes only the components that strongly shift attention from the focus set to the tail set, leaving other steering effects intact\. Concretely, SKOP compares the tokens a head attends to strongly on utility tasks with those it attends to weakly, and uses differences in their key representations to identify steering components that are likely to cause harmful attention shifts\. It then removes only these components and applies this correction selectively to a small set ofrisk headsthat are most prone to such shifts, thereby preserving steering efficacy while safeguarding model utility\. We further show that this mechanism enables robust activation steering in long\-context retrieval settings, providing, to the best of our knowledge, the first demonstration of effective long\-context activation steering\.
Our main contributions can be summarised as follows:
1. 1\.We identifyattention rerouting, steering\-induced shifts in attention away from focus tokens, as a key mechanism behind the trade\-off between query\-space steering efficacy and utility preservation\.
2. 2\.We propose SKOP, a steering method that suppresses steering components that shift attention away from focus tokens, retaining strong steering efficacy while preserving model utility\.
3. 3\.We show that SKOP achieves the best steering–utility trade\-off across multiple benchmarks, reducing utility degradation by 5–7×\\times, and enabling robust long\-context activation steering\.
## 2Related Work
Activation Steering\.Activation steering induces or suppresses specific behaviours in an LLM by modifying its latent space\[[2](https://arxiv.org/html/2605.06342#bib.bib2),[35](https://arxiv.org/html/2605.06342#bib.bib35),[34](https://arxiv.org/html/2605.06342#bib.bib34)\]\. The predominant paradigm assumes that the linear representation hypothesis\[[21](https://arxiv.org/html/2605.06342#bib.bib21),[24](https://arxiv.org/html/2605.06342#bib.bib24)\]holds, and uses mean\-difference vectors, representing directions in the LLM’s latent space, to steer the model\[[44](https://arxiv.org/html/2605.06342#bib.bib44)\]\. Thesesteering vectorsare typically constructed by analysing the last token’s representations when the model is given “positive” and “negative” examples of a concept\[[18](https://arxiv.org/html/2605.06342#bib.bib18),[29](https://arxiv.org/html/2605.06342#bib.bib29)\]\. Steering vectors can also be constructed using non\-linear estimation\[[27](https://arxiv.org/html/2605.06342#bib.bib27)\], affine transformations\[[31](https://arxiv.org/html/2605.06342#bib.bib31),[33](https://arxiv.org/html/2605.06342#bib.bib33)\], or optimisation\-based techniques\[[42](https://arxiv.org/html/2605.06342#bib.bib42),[41](https://arxiv.org/html/2605.06342#bib.bib41)\]\. Recent work has demonstrated that steering on the attention layers themselves \(e\.g\.,query\-spacesteering\[[36](https://arxiv.org/html/2605.06342#bib.bib36),[39](https://arxiv.org/html/2605.06342#bib.bib39)\]\) is an effective and fine\-grained control mechanism due to the separability of behavioural concepts in the query and value spaces\[[36](https://arxiv.org/html/2605.06342#bib.bib36)\]\. However, it remains unclear how activation steering interacts with the attention patterns themselves\. Our work fills this gap by identifying attention rerouting as a side\-effect of query\-space steering and showing that this rerouting underlies the observed steering\-utility trade\-off \(Sec\.[4](https://arxiv.org/html/2605.06342#S4)\)\.
Steering vs utility trade\-off\.A persistent challenge of activation steering is the trade\-off between steering efficacy and general model capability \(i\.e\., utility\)\[[42](https://arxiv.org/html/2605.06342#bib.bib42)\]\. While mitigation strategies such as input\-conditional steering\[[17](https://arxiv.org/html/2605.06342#bib.bib17),[32](https://arxiv.org/html/2605.06342#bib.bib32)\], semantic gating\[[17](https://arxiv.org/html/2605.06342#bib.bib17)\], targeted head selection\[[18](https://arxiv.org/html/2605.06342#bib.bib18)\], and feature\-level decomposition\[[3](https://arxiv.org/html/2605.06342#bib.bib3),[28](https://arxiv.org/html/2605.06342#bib.bib28)\]have been proposed, they operate on the residual stream and have often been studied in the narrow setting of refusal steering\[[17](https://arxiv.org/html/2605.06342#bib.bib17),[32](https://arxiv.org/html/2605.06342#bib.bib32)\]\. As a result, it remains unclear*how*these interventions affect attention patterns, or whether they can jointly improve steering efficacy*and*preserve utility in the general behavioural steering setting\. Building on our analysis of attention rerouting, we propose SKOP, a mitigation method tailored to query\-space steering that improves the joint steering\-utility trade\-off \(Sec\.[5](https://arxiv.org/html/2605.06342#S5)\)\.
## 3Preliminaries
Consider adecoder\-only transformerwithLLlayers, each withHHattention heads\. Here, the residual stream𝐡\(ℓ\)∈ℝt×d\\mathbf\{h\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{t\\times d\}of layerℓ\\ell, wherettis the sequence length andddis the token dimension, is:
𝐠\(ℓ\)\\displaystyle\\mathbf\{g\}^\{\(\\ell\)\}=𝐡\(ℓ−1\)\+a\(ℓ\)\(LN\(𝐡\(ℓ−1\)\)\),\\displaystyle=\\mathbf\{h\}^\{\(\\ell\-1\)\}\+a^\{\(\\ell\)\}\\big\(\\text\{LN\}\(\\mathbf\{h\}^\{\(\\ell\-1\)\}\)\\big\),\(1\)𝐡\(ℓ\)\\displaystyle\\mathbf\{h\}^\{\(\\ell\)\}=𝐠\(ℓ\)\+MLP\(ℓ\)\(LN\(𝐠\(ℓ\)\)\),\\displaystyle=\\mathbf\{g\}^\{\(\\ell\)\}\+\\text\{MLP\}^\{\(\\ell\)\}\\big\(\\text\{LN\}\(\\mathbf\{g\}^\{\(\\ell\)\}\)\\big\),\(2\)whereLNis layer normalisation anda\(ℓ\)a^\{\(\\ell\)\}is the multi\-head attention block at layerℓ\\ell\. For simplicity, here we focus on transformers with standard multi\-head attention\[[38](https://arxiv.org/html/2605.06342#bib.bib38)\]\. Nevertheless, we note that our formulation below can be easily adapted to grouped\-query attention\[[1](https://arxiv.org/html/2605.06342#bib.bib1)\]used in modern LLMs\.
The attention blocka\(ℓ\)a^\{\(\\ell\)\}is comprised ofHHattention heads\{a\(ℓ,h\)\}h=1H\\\{a^\{\(\\ell,h\)\}\\\}\_\{h=1\}^\{H\}, each parameterised by matrices𝐖q\(ℓ,h\),𝐖k\(ℓ,h\),𝐖v\(ℓ,h\),𝐖o\(ℓ,h\)∈ℝd×d′\\mathbf\{W\}\_\{q\}^\{\(\\ell,h\)\},\\mathbf\{W\}\_\{k\}^\{\(\\ell,h\)\},\\mathbf\{W\}\_\{v\}^\{\(\\ell,h\)\},\\mathbf\{W\}\_\{o\}^\{\(\\ell,h\)\}\\in\\mathbb\{R\}^\{d\\times d^\{\\prime\}\}, whered′=d/Hd^\{\\prime\}=d/His the head dimension\. Given the attention input𝐳\(ℓ\):=LN\(𝐡\(ℓ−1\)\)\\mathbf\{z\}^\{\(\\ell\)\}:=\\text\{LN\}\(\\mathbf\{h\}^\{\(\\ell\-1\)\}\), the queries, keys, and values are𝐐\(ℓ,h\)=𝐳\(ℓ\)𝐖q\(ℓ,h\)\\mathbf\{Q\}^\{\(\\ell,h\)\}=\\mathbf\{z\}^\{\(\\ell\)\}\\mathbf\{W\}\_\{q\}^\{\(\\ell,h\)\},𝐊\(ℓ,h\)=𝐳\(ℓ\)𝐖k\(ℓ,h\)\\mathbf\{K\}^\{\(\\ell,h\)\}=\\mathbf\{z\}^\{\(\\ell\)\}\\mathbf\{W\}\_\{k\}^\{\(\\ell,h\)\},𝐕\(ℓ,h\)=𝐳\(ℓ\)𝐖v\(ℓ,h\)\\mathbf\{V\}^\{\(\\ell,h\)\}=\\mathbf\{z\}^\{\(\\ell\)\}\\mathbf\{W\}\_\{v\}^\{\(\\ell,h\)\}, and the attention logits and outputs are:
sij\(ℓ,h\)\\displaystyle s\_\{ij\}^\{\(\\ell,h\)\}=⟨𝐪i\(ℓ,h\),𝐤j\(ℓ,h\)⟩/d′,\\displaystyle=\{\\langle\\mathbf\{q\}\_\{i\}^\{\(\\ell,h\)\},\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}\\rangle\}/\{\\sqrt\{d^\{\\prime\}\}\},\(3\)a\(ℓ,h\)\(𝐳\(ℓ\)\)i\\displaystyle a^\{\(\\ell,h\)\}\(\\mathbf\{z\}^\{\(\\ell\)\}\)\_\{i\}=∑j=1tαij\(ℓ,h\)𝐯j\(ℓ,h\)𝐖o\(ℓ,h\),\\displaystyle=\\sum\_\{j=1\}^\{t\}\\alpha\_\{ij\}^\{\(\\ell,h\)\}\\mathbf\{v\}\_\{j\}^\{\(\\ell,h\)\}\\mathbf\{W\}\_\{o\}^\{\(\\ell,h\)\},\(4\)whereαij\(ℓ,h\)=softmaxj\(sij\(ℓ,h\)\)\\alpha\_\{ij\}^\{\(\\ell,h\)\}=\\text\{softmax\}\_\{j\}\(s\_\{ij\}^\{\(\\ell,h\)\}\)are the attention weights\.
Query\-space steering\.Activation steering controls an LLM’s behaviour by adding a fixedsteering vector𝐫\\mathbf\{r\}to its latent representations\[[29](https://arxiv.org/html/2605.06342#bib.bib29),[18](https://arxiv.org/html/2605.06342#bib.bib18),[44](https://arxiv.org/html/2605.06342#bib.bib44)\]\. Among these approaches,*query\-space steering*\[[36](https://arxiv.org/html/2605.06342#bib.bib36)\]stands out since \(1\) it achieves strong steering efficacy\[[36](https://arxiv.org/html/2605.06342#bib.bib36),[39](https://arxiv.org/html/2605.06342#bib.bib39)\], and \(2\) as derived below, its effect on attention logits can be easily captured by a closed\-form term\. As these properties more easily permit the study of powerful steering methods, we focus our analysis on query\-space steering\.
Given a query\-space steering vector𝐫q\(ℓ,h\)∈ℝd′\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}\\in\\mathbb\{R\}^\{d^\{\\prime\}\}– typically obtained as the mean difference between query activations on positive and negative examples of a target behaviour\[[29](https://arxiv.org/html/2605.06342#bib.bib29),[18](https://arxiv.org/html/2605.06342#bib.bib18),[36](https://arxiv.org/html/2605.06342#bib.bib36)\]– query\-space steering modifies queries as follows:
𝐪i\(ℓ,h\)←𝐪i\(ℓ,h\)\+λ𝐫q\(ℓ,h\),\\mathbf\{q\}\_\{i\}^\{\(\\ell,h\)\}\\leftarrow\\mathbf\{q\}\_\{i\}^\{\(\\ell,h\)\}\+\\lambda\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\},\(5\)whereλ∈ℝ\\lambda\\in\\mathbb\{R\}controlssteering strength\. Substituting the steered query into the logit definitionsij\(ℓ,h\)=⟨𝐪i\(ℓ,h\),𝐤j\(ℓ,h\)⟩/d′s\_\{ij\}^\{\(\\ell,h\)\}=\\langle\\mathbf\{q\}\_\{i\}^\{\(\\ell,h\)\},\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}\\rangle/\\sqrt\{d^\{\\prime\}\}, and expanding the inner product, yields the following updated logit:
s~ij\(ℓ,h\):=⟨𝐪i\(ℓ,h\)\+λ𝐫q\(ℓ,h\),𝐤j\(ℓ,h\)⟩/d′=sij\(ℓ,h\)\+λ⟨𝐫q\(ℓ,h\),𝐤j\(ℓ,h\)⟩/d′⏟δij\(ℓ,h\)\.\\tilde\{s\}\_\{ij\}^\{\(\\ell,h\)\}:=\{\\langle\\mathbf\{q\}\_\{i\}^\{\(\\ell,h\)\}\+\\lambda\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\},\\,\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}\\rangle\}/\{\\sqrt\{d^\{\\prime\}\}\}=s\_\{ij\}^\{\(\\ell,h\)\}\+\\underbrace\{\{\\lambda\\langle\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\},\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}\\rangle\}/\{\\sqrt\{d^\{\\prime\}\}\}\}\_\{\\delta\_\{ij\}^\{\(\\ell,h\)\}\}\.\(6\)Details of this derivation can be found in App\.[A](https://arxiv.org/html/2605.06342#A1)\. Notice that the perturbationδij\(ℓ,h\)\\delta\_\{ij\}^\{\(\\ell,h\)\}is the*only*term added to the logits\. This isolates how attention may be rerouted from query\-space steering into one tractable term, a property that does not hold for residual\-stream steering \(where a perturbation simultaneously alters queries, keys, values, and MLP outputs\)\. We exploit this isolability throughout Sec\.[4](https://arxiv.org/html/2605.06342#S4)\.
Figure 2:\(A\):Focus sets are small and stable across context lengths\.We group evaluation samples by total context length and, for each group, report the per\-head focus\-set size\|ℋ\(ℓ,h\)\|\|\\mathcal\{H\}^\{\(\\ell,h\)\}\|across layers, where\|ℋ\(ℓ,h\)\|\|\\mathcal\{H\}^\{\(\\ell,h\)\}\|is the minimum number of tokens needed to coverτhigh=0\.8\\tau\_\{\\text\{high\}\}=0\.8of the attention mass\. Focus sets remain small \(typically≲15\\lesssim 15tokens\) even as context length grows from∼\\sim100 to∼\\sim360 tokens\. \(B\):Focus\-set mass drop under vanilla steering\.For varying thresholdsxx, we plotPr\(ΔM≤−x\)\\Pr\(\\Delta M\\leq\-x\), whereΔM\\Delta Mis the change in focus\-set attention mass under steering \(Eq\.[8](https://arxiv.org/html/2605.06342#S4.E8)\), aggregated across heads and decoding steps\. Higher curves indicate more frequent focus\-to\-tail attention rerouting; the effect grows monotonically with steering strengthλ\\lambda\. \(C\):Steering\-utility trade\-off under direct key\-orthogonal projection\.Steering score on TruthfulQA\[[19](https://arxiv.org/html/2605.06342#bib.bib19)\]\(blue, left axis\), and average utility across ARC\[[6](https://arxiv.org/html/2605.06342#bib.bib6)\], HellaSwag\[[43](https://arxiv.org/html/2605.06342#bib.bib43)\], and GSM8K\[[7](https://arxiv.org/html/2605.06342#bib.bib7)\]\(green, right axis\) versus projection strengthpp\(cf\. Eq\.[10](https://arxiv.org/html/2605.06342#S4.E10)\)\. Increasingppsuppresses truthfulness while recovering utility, exposing a trade\-off between attention\-invariance and steerability\.
## 4The Steering\-Utility Trade\-off
We identify a fundamental tension from Eq\. \([6](https://arxiv.org/html/2605.06342#S3.E6)\): query\-space steering controls model behaviour viaattention rerouting, yet the same mechanism can disrupt attention on critical tokens, degrading performance on utility tasks\. In this section, we first characterise the rerouting mechanism, then show empirically that rerouting concentrates on a small set of utility\-critical tokens, and finally show that the naive remedy of enforcing full attention invariance suppresses steering entirely\.
#### Attention rerouting via relative score changes\.
Softmax\-based attention is invariant to row\-wise constant shifts \(i\.e\.,softmax\(𝐬\+c𝟏\)=softmax\(𝐬\)\\text\{softmax\}\(\\mathbf\{s\}\+c\\mathbf\{1\}\)=\\text\{softmax\}\(\\mathbf\{s\}\)\)\[[12](https://arxiv.org/html/2605.06342#bib.bib12)\]\. Therefore, query\-space steering affects attention if and only if the induced logit shiftδij\(ℓ,h\)\\delta\_\{ij\}^\{\(\\ell,h\)\}varies across key positionsjj\. Specifically, for a fixed query positionii, the change in attention assigned to key positionjjis governed by its logit shift relative to other key positionj′j^\{\\prime\}in the same attention row,δij\(ℓ,h\)−δij′\(ℓ,h\)\\delta\_\{ij\}^\{\(\\ell,h\)\}\-\\delta\_\{ij^\{\\prime\}\}^\{\(\\ell,h\)\}, which expands to:
δij\(ℓ,h\)−δij′\(ℓ,h\)=λd′⟨𝐫q\(ℓ,h\),𝐤j\(ℓ,h\)−𝐤j′\(ℓ,h\)⟩\.\\delta\_\{ij\}^\{\(\\ell,h\)\}\-\\delta\_\{ij^\{\\prime\}\}^\{\(\\ell,h\)\}\\;=\\;\\frac\{\\lambda\}\{\\sqrt\{d^\{\\prime\}\}\}\\bigl\\langle\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\},\\,\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}\-\\mathbf\{k\}\_\{j^\{\\prime\}\}^\{\(\\ell,h\)\}\\bigr\\rangle\.\(7\)
Thus, steering reroutes attention by changing how queries align with key differences\. Eq\. \([7](https://arxiv.org/html/2605.06342#S4.E7)\) also indicates that any steering vector not orthogonal to the relevant key\-difference directions produces some rerouting; the question is whether the resulting rerouting is benign or whether it disrupts attention to tokens on which the model relies\.
#### Utility degradation from focus\-to\-tail rerouting\.
We next examine where steering\-induced rerouting concentrates on utility data\. To this end, we collect a sampled*utility calibration set*𝒟util\\mathcal\{D\}\_\{\\text\{util\}\}from utility benchmarks spanning different domains, including maths, reasoning, and instruction following \(see App\.[B\.1](https://arxiv.org/html/2605.06342#A2.SS1)for details\)\. Across layers and heads on𝒟util\\mathcal\{D\}\_\{\\text\{util\}\}, attention distributions are highly sparse: for an average context length of approximately 250 tokens, fewer than 30 tokens account for 80% of the total attention mass \(Fig\.[2](https://arxiv.org/html/2605.06342#S3.F2)\(A\)\)\. For a given layerℓ\\elland headhh, letℋ\(ℓ,h\)⊂\[t\]\\mathcal\{H\}^\{\(\\ell,h\)\}\\subset\[t\]denote the indices of high\-attention tokens \(the*focus set*\) that collectively receive a fractionτhigh∈\[0,1\]\\tau\_\{\\text\{high\}\}\\in\[0,1\]of the attention mass on utility data\. Givenℋ\(ℓ,h\)\\mathcal\{H\}^\{\(\\ell,h\)\}, letℒ\(ℓ,h\)\\mathcal\{L\}^\{\(\\ell,h\)\}denote the remaining low\-attention \(*tail*\) tokens\. We quantify the effect of steering on utility\-critical attention via the*top\-set mass preservation*:
ΔM=∑j∈ℋ\(ℓ,h\)αijsteered−∑j∈ℋ\(ℓ,h\)αijbase,\\Delta M=\\sum\_\{j\\in\\mathcal\{H\}^\{\(\\ell,h\)\}\}\\alpha\_\{ij\}^\{\\text\{steered\}\}\-\\sum\_\{j\\in\\mathcal\{H\}^\{\(\\ell,h\)\}\}\\alpha\_\{ij\}^\{\\text\{base\}\},\(8\)whereαijbase\\alpha\_\{ij\}^\{\\text\{base\}\}andαijsteered\\alpha\_\{ij\}^\{\\text\{steered\}\}denote attention weights before and after steering\. A negativeΔM\\Delta Mindicates that attention mass is shifted away from focus tokens toward tail tokens\. Fig\.[2](https://arxiv.org/html/2605.06342#S3.F2)\(B\) reports the probability that steering reduces focus\-set attention mass by at leastx%x\\%\(i\.e\.,Pr\(ΔM≤−x\)\\Pr\(\\Delta M\\leq\-x\)\), aggregated across heads on the utility dataset\. We find that vanilla steering frequently induces large negative values ofΔM\\Delta M, and that the severity of this focus\-to\-tail rerouting increases monotonically with steering strengthλ\\lambda\. This suggests that focus\-to\-tail rerouting may be responsible for utility degradation\.
#### Full invariance yields a trade\-off\.
A naive remedy for rerouting may be to ensure that𝐫q\(ℓ,h\)\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}produces no relative score changes across key positions on𝒟util\\mathcal\{D\}\_\{\\text\{util\}\}\. Let𝐤¯\(ℓ,h\)=∑j=1t𝐤j\(ℓ,h\)/t\\bar\{\\mathbf\{k\}\}^\{\(\\ell,h\)\}=\\sum\_\{j=1\}^\{t\}\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}/tand𝐊c\(ℓ,h\)=𝐊\(ℓ,h\)−𝟏t\(𝐤¯\(ℓ,h\)\)⊤\\mathbf\{K\}\_\{c\}^\{\(\\ell,h\)\}=\\mathbf\{K\}^\{\(\\ell,h\)\}\-\\mathbf\{1\}\_\{t\}\(\\bar\{\\mathbf\{k\}\}^\{\(\\ell,h\)\}\)^\{\\top\}be centred attention keys collected from𝒟util\\mathcal\{D\}\_\{\\text\{util\}\}\. From Eq\. \([7](https://arxiv.org/html/2605.06342#S4.E7)\), attention is invariant under query\-space steering iff⟨𝐫q\(ℓ,h\),𝐤j\(ℓ,h\)−𝐤¯\(ℓ,h\)⟩=0\\langle\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\},\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}\-\\bar\{\\mathbf\{k\}\}^\{\(\\ell,h\)\}\\rangle=0for alljj, or, compactly, iff:
\(𝐫q\(ℓ,h\)\)⊤𝐊c\(ℓ,h\)=𝟎⊤\.\(\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}\)^\{\\top\}\\mathbf\{K\}\_\{c\}^\{\(\\ell,h\)\}=\\mathbf\{0\}^\{\\top\}\.\(9\)In other words, Eq\. \([9](https://arxiv.org/html/2605.06342#S4.E9)\) requires𝐫q\(ℓ,h\)\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}to be orthogonal to the column space of𝐊c\(ℓ,h\)\\mathbf\{K\}\_\{c\}^\{\(\\ell,h\)\}\(see App\.[A\.2](https://arxiv.org/html/2605.06342#A1.SS2)for a proof that this condition is necessary and sufficient for attention invariance\)\.
Empirically, we find that the centred key covariance𝚺k\(ℓ,h\)\\mathbf\{\\Sigma\}\_\{k\}^\{\(\\ell,h\)\}is low\-rank for each head \(see Fig\.[10](https://arxiv.org/html/2605.06342#A3.F10)in App\.[C\.4](https://arxiv.org/html/2605.06342#A3.SS4)\), hence orthogonality may be enforced via a projector\. Let𝐔k\(ℓ,h\)∈ℝd′×p\\mathbf\{U\}\_\{k\}^\{\(\\ell,h\)\}\\in\\mathbb\{R\}^\{d^\{\\prime\}\\times p\}contain the top\-ppeigenvectors of𝚺k\(ℓ,h\)\\mathbf\{\\Sigma\}\_\{k\}^\{\(\\ell,h\)\}\. We define the orthogonal projector for the query\-space steering vector by:
𝐏k\(ℓ,h\)=𝐈d′−𝐔k\(ℓ,h\)\(𝐔k\(ℓ,h\)\)⊤,\\mathbf\{P\}\_\{k\}^\{\(\\ell,h\)\}=\\mathbf\{I\}\_\{d^\{\\prime\}\}\-\\mathbf\{U\}\_\{k\}^\{\(\\ell,h\)\}\(\\mathbf\{U\}\_\{k\}^\{\(\\ell,h\)\}\)^\{\\top\},\(10\)and refer to this projection as the*key\-invariant projection*\. As shown in Fig\.[2](https://arxiv.org/html/2605.06342#S3.F2)\(C\), retaining the top\-ppeigenvectors that account for as little as20%20\\%of the cumulative variance of𝚺k\(ℓ,h\)\\mathbf\{\\Sigma\}\_\{k\}^\{\(\\ell,h\)\}is sufficient for𝐏k\(ℓ,h\)𝐫q\(ℓ,h\)\\mathbf\{P\}\_\{k\}^\{\(\\ell,h\)\}\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}to approximately satisfy Eq\. \([9](https://arxiv.org/html/2605.06342#S4.E9)\), effectively eliminating steering effects while largely preserving performance on utility tasks\. Nevertheless, and importantly, this reveals the fundamental tension: attention rerouting is both necessary for steering efficacy*and*the source of utility degradation, so any uniform attempt to suppress it eliminates steering altogether\.
This tension motivates our approach: rather than enforcing full attention invariance, we constrain rerouting selectively, preserving relative scores between focus and tail tokens while allowing steering to redistribute attention within less utility\-critical regions\.
## 5Steering via Key\-Orthogonal Projections
Building on the tension identified in Sec\.[4](https://arxiv.org/html/2605.06342#S4), we now introduceSteering via Key\-Orthogonal Projections\(SKOP, Fig\.[3](https://arxiv.org/html/2605.06342#S5.F3)\)\. In contrast to the key\-invariant projection of Eq\. \([10](https://arxiv.org/html/2605.06342#S4.E10)\), which suppressesallattention rerouting, SKOP targets only the rerouting that shifts attention from focus tokens to tail tokens, enabling other rerouting that may carry useful steering signals\. Given a fixed set of steering vectors, SKOP proceeds in three stages: \(1\) characterisingkey\-differencedirections associated with focus\-to\-tail attention rerouting \(Sec\.[5\.1](https://arxiv.org/html/2605.06342#S5.SS1)\); \(2\) projecting steering vectors to remove components that strongly affect these directions \(Sec\.[5\.2](https://arxiv.org/html/2605.06342#S5.SS2)\); and \(3\) selectively applying this projection to heads most prone to attention rerouting \(Sec\.[5\.3](https://arxiv.org/html/2605.06342#S5.SS3)\)\. We summarise SKOP in an algorithm form in App\.[D\.1](https://arxiv.org/html/2605.06342#A4.SS1)\.
Figure 3:Steering via Key\-Orthogonal Projection \(SKOP\) preserves attention on focus tokens while steering model behaviour\. It identifies, for each attention head, key\-space directions that mediate focus\-to\-tail attention rerouting on utility datasets, and applies orthogonal projection to stabilise query representations during generation\.### 5\.1Characterising Utility\-Critical Key Differences
In its first stage, SKOP uses the utility calibration set𝒟util\\mathcal\{D\}\_\{\\text\{util\}\}from Sec\.[4](https://arxiv.org/html/2605.06342#S4)\. Specifically, for each head\(ℓ,h\)\(\\ell,h\)and decoding steptt, we \(1\) compute the baseline attention distribution𝜶\(ℓ,h\)\\boldsymbol\{\\alpha\}^\{\(\\ell,h\)\}, and \(2\) construct the per\-step focus setℋt\(ℓ,h\)\\mathcal\{H\}\_\{t\}^\{\(\\ell,h\)\}, the minimal token set capturing at leastτhigh\\tau\_\{\\text\{high\}\}attention mass, and take tail setℒt\(ℓ,h\)\\mathcal\{L\}\_\{t\}^\{\(\\ell,h\)\}as its complement\. Aggregating over the calibration sample yields head\-level focus and tail setsℋ\(ℓ,h\)\\mathcal\{H\}^\{\(\\ell,h\)\}andℒ\(ℓ,h\)\\mathcal\{L\}^\{\(\\ell,h\)\}\.
From Eq\. \([7](https://arxiv.org/html/2605.06342#S4.E7)\), focus\-to\-tail rerouting is driven by the alignment between𝐫q\(ℓ,h\)\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}and*key\-difference vectors*Δ𝐤ij\(ℓ,h\)=𝐤i\(ℓ,h\)−𝐤j\(ℓ,h\)\\Delta\\mathbf\{k\}\_\{ij\}^\{\(\\ell,h\)\}=\\mathbf\{k\}\_\{i\}^\{\(\\ell,h\)\}\-\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}fori∈ℋ\(ℓ,h\)i\\in\\mathcal\{H\}^\{\(\\ell,h\)\}andj∈ℒ\(ℓ,h\)j\\in\\mathcal\{L\}^\{\(\\ell,h\)\}\. We summarise these vectors over the calibration set using their second\-moment matrix:
𝚺Δk\(ℓ,h\)=𝔼t,\(i,j\)\[Δ𝐤ij\(ℓ,h\)\(Δ𝐤ij\(ℓ,h\)\)⊤\],Δ𝐤ij\(ℓ,h\)=𝐤i\(ℓ,h\)−𝐤j\(ℓ,h\),\\mathbf\{\\Sigma\}\_\{\\Delta k\}^\{\(\\ell,h\)\}=\\mathbb\{E\}\_\{t,\\,\(i,j\)\}\\\!\\left\[\\Delta\\mathbf\{k\}\_\{ij\}^\{\(\\ell,h\)\}\(\\Delta\\mathbf\{k\}\_\{ij\}^\{\(\\ell,h\)\}\)^\{\\top\}\\right\],\\qquad\\Delta\\mathbf\{k\}\_\{ij\}^\{\(\\ell,h\)\}=\\mathbf\{k\}\_\{i\}^\{\(\\ell,h\)\}\-\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\},\(11\)where the expectation is over calibration stepsttand uniform sampling of\(i,j\)\(i,j\)fromℋt\(ℓ,h\)×ℒt\(ℓ,h\)\\mathcal\{H\}\_\{t\}^\{\(\\ell,h\)\}\\times\\mathcal\{L\}\_\{t\}^\{\(\\ell,h\)\}\.
Empirically, we use the second moment rather than the centred covariance, as the mean key\-difference is itself a high\-energy direction that centring would discard\. Moreover, we note that, although𝚺Δk\(ℓ,h\)\\boldsymbol\{\\Sigma\}\_\{\\Delta k\}^\{\(\\ell,h\)\}is estimated from𝒟util\\mathcal\{D\}\_\{\\text\{util\}\}, we show in App\.[E\.2](https://arxiv.org/html/2605.06342#A5.SS2)that SKOP is robust to both the size and domain of the calibration set: as few as 250 examples already substantially recover utility over vanilla steering, and single\-domain calibration sets all yield comparable trade\-offs to the mixed default\. This indicates that the focus\-set structure exploited by SKOP reflects stable model\-internal key\-space geometry, rather than properties of the calibration distribution\.
### 5\.2Projection onto the Key\-Difference Subspace
In its second stage, SKOP uses𝚺Δk\(ℓ,h\)\\boldsymbol\{\\Sigma\}\_\{\\Delta k\}^\{\(\\ell,h\)\}from Sec\.[5\.1](https://arxiv.org/html/2605.06342#S5.SS1)to project each steering vector onto the subspace least coupled to focus\-to\-tail rerouting\. Concretely, the expected squared perturbation on the focus\-to\-tail score gap under query\-space steeringλ𝐫q\(ℓ,h\)\\lambda\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}is:
𝔼\[\(Δgij\)2\]=λ2d′\(𝐫q\(ℓ,h\)\)⊤𝚺Δk\(ℓ,h\)𝐫q\(ℓ,h\),\\mathbb\{E\}\[\(\\Delta g\_\{ij\}\)^\{2\}\]=\\frac\{\\lambda^\{2\}\}\{d^\{\\prime\}\}\(\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}\)^\{\\top\}\\mathbf\{\\Sigma\}\_\{\\Delta k\}^\{\(\\ell,h\)\}\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\},\(12\)whereΔgij=δii\(ℓ,h\)−δij\(ℓ,h\)\\Delta g\_\{ij\}=\\delta\_\{ii\}^\{\(\\ell,h\)\}\-\\delta\_\{ij\}^\{\(\\ell,h\)\}is the change in the score gap between a focus tokeniiand a tail tokenjj\(cf\. Eq\. \([7](https://arxiv.org/html/2605.06342#S4.E7)\)\)\. To minimise harmful attention rerouting, we remove the components of𝐫q\(ℓ,h\)\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}that contribute most strongly to Eq\. \([12](https://arxiv.org/html/2605.06342#S5.E12)\)\. Mirroring the construction of the key\-invariant projector in Eq\. \([10](https://arxiv.org/html/2605.06342#S4.E10)\), but replacing𝚺k\(ℓ,h\)\\mathbf\{\\Sigma\}\_\{k\}^\{\(\\ell,h\)\}with𝚺Δk\(ℓ,h\)\\mathbf\{\\Sigma\}\_\{\\Delta k\}^\{\(\\ell,h\)\}, we define the SKOP projector as:
𝐏Δk\(ℓ,h\)=𝐈d′−𝐔Δk\(ℓ,h\)\(𝐔Δk\(ℓ,h\)\)⊤,\\mathbf\{P\}\_\{\\Delta k\}^\{\(\\ell,h\)\}=\\mathbf\{I\}\_\{d^\{\\prime\}\}\-\\mathbf\{U\}\_\{\\Delta k\}^\{\(\\ell,h\)\}\(\\mathbf\{U\}\_\{\\Delta k\}^\{\(\\ell,h\)\}\)^\{\\top\},\(13\)where𝐔Δk\(ℓ,h\)∈ℝd′×p\\mathbf\{U\}\_\{\\Delta k\}^\{\(\\ell,h\)\}\\in\\mathbb\{R\}^\{d^\{\\prime\}\\times p\}contains the top\-ppeigenvectors of𝚺Δk\(ℓ,h\)\\mathbf\{\\Sigma\}\_\{\\Delta k\}^\{\(\\ell,h\)\}\. We then replace the steering vector with its projected version:
𝐫~q\(ℓ,h\)=𝐏Δk\(ℓ,h\)𝐫q\(ℓ,h\),𝐪i\(ℓ,h\)←𝐪i\(ℓ,h\)\+λ𝐫~q\(ℓ,h\)\.\\tilde\{\\mathbf\{r\}\}\_\{q\}^\{\(\\ell,h\)\}=\\mathbf\{P\}\_\{\\Delta k\}^\{\(\\ell,h\)\}\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\},\\qquad\\mathbf\{q\}\_\{i\}^\{\(\\ell,h\)\}\\leftarrow\\mathbf\{q\}\_\{i\}^\{\(\\ell,h\)\}\+\\lambda\\tilde\{\\mathbf\{r\}\}\_\{q\}^\{\(\\ell,h\)\}\.\(14\)We find that𝚺Δk\(ℓ,h\)\\mathbf\{\\Sigma\}\_\{\\Delta k\}^\{\(\\ell,h\)\}is also low\-rank for each head \(Fig\.[11](https://arxiv.org/html/2605.06342#A3.F11)in App\.[C\.4](https://arxiv.org/html/2605.06342#A3.SS4)\)\. That is, a small number of eigenvectors account for most of the energy and therefore most of the focus\-to\-tail rerouting\. We exploit this structure by selectingppto retain a fixed fraction of the total energy:
\(∑i=1pλi\(𝚺Δk\(ℓ,h\)\)\)/\(∑i=1d′λi\(𝚺Δk\(ℓ,h\)\)\)≥γenergy,\\Bigg\(\{\\sum\_\{i=1\}^\{p\}\\lambda\_\{i\}\\\!\\left\(\\mathbf\{\\Sigma\}\_\{\\Delta k\}^\{\(\\ell,h\)\}\\right\)\}\\Bigg\)\\Big/\\Bigg\(\{\\sum\_\{i=1\}^\{d^\{\\prime\}\}\\lambda\_\{i\}\\\!\\left\(\\mathbf\{\\Sigma\}\_\{\\Delta k\}^\{\(\\ell,h\)\}\\right\)\}\\Bigg\)\\geq\\gamma\_\{\\text\{energy\}\},\(15\)whereλi\(⋅\)\\lambda\_\{i\}\(\\cdot\)denotes theii\-th eigenvalue of𝚺Δk\(ℓ,h\)\\mathbf\{\\Sigma\}\_\{\\Delta k\}^\{\(\\ell,h\)\}in descending order\. This selects the smallestppthat captures a fractionγenergy\\gamma\_\{\\text\{energy\}\}of the rerouting energy, retaining the dominant directions that drive harmful rerouting, while leaving the remaining directions untouched and available for steering\.
Two properties of this construction together imply that, despite Eq\. \([13](https://arxiv.org/html/2605.06342#S5.E13)\) being a hard projection, SKOP preserves most of the steering capacity\. First, the trade\-off is insensitive toγenergy\\gamma\_\{\\text\{energy\}\}: both steering and utility remain stable acrossγenergy∈\[0\.7,0\.95\]\\gamma\_\{\\text\{energy\}\}\\in\[0\.7,0\.95\]\(see App\.[E\.3](https://arxiv.org/html/2605.06342#A5.SS3)\), so we fixγenergy=0\.9\\gamma\_\{\\text\{energy\}\}=0\.9across all tasks and models\. Second, projection preserves most of the norm of𝐫q\(ℓ,h\)\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}across heads \(see App\.[C\.5](https://arxiv.org/html/2605.06342#A3.SS5)\)\. This suggests that harmful rerouting is concentrated in a small number of dominant eigendirections, leaving much of the orthogonal complement available for steering\.
### 5\.3Selective Application to High\-Risk Heads
Empirically, focus\-to\-tail rerouting is concentrated in a small minority of heads \(Fig\.[8](https://arxiv.org/html/2605.06342#A3.F8)\), so projecting all heads uniformly would unnecessarily suppress steering capacity in benign heads\. Therefore, we quantify a head’s susceptibility to harmful rerouting via the Rayleigh quotient:
R\(ℓ,h\)=\(\(𝐫q\(ℓ,h\)\)⊤𝚺Δk\(ℓ,h\)𝐫q\(ℓ,h\)\)/\(‖𝐫q\(ℓ,h\)‖2\+ϵ\),R^\{\(\\ell,h\)\}=\\Big\(\{\(\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}\)^\{\\top\}\\mathbf\{\\Sigma\}\_\{\\Delta k\}^\{\(\\ell,h\)\}\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}\}\\Big\)\\big/\\Big\(\{\\\|\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}\\\|^\{2\}\+\\epsilon\}\\Big\),\(16\)which is exactly the per\-unit\-norm version of the expected squared score\-gap perturbation in Eq\. \([12](https://arxiv.org/html/2605.06342#S5.E12)\)\. Hence, we apply SKOP only to the top\-kkheads as ordered byR\(ℓ,h\)R^\{\(\\ell,h\)\}\. We note that this risk\-based ranking is different from the discriminative\-head criteria used in prior work\[[18](https://arxiv.org/html/2605.06342#bib.bib18),[40](https://arxiv.org/html/2605.06342#bib.bib40),[36](https://arxiv.org/html/2605.06342#bib.bib36)\]: as shown in App\.[C\.3](https://arxiv.org/html/2605.06342#A3.SS3), risk heads jointly drive stronger steering effects*and*larger utility drops than discriminative heads, making them the most useful targets for projection\.
## 6Experiments
In this section, we evaluate SKOP by studying the following research questions:
- •RQ1:Does SKOP balance steering efficacy and utility better than previous approaches?
- •RQ2:What is the impact of the steering strengthλ\\lambdawhen steering with SKOP?
- •RQ3:Can SKOP maintain high steerability and utility in long\-context tasks?
Setup\.We conduct experiments on Llama\-3\.1\-8B\-Instruct\[[8](https://arxiv.org/html/2605.06342#bib.bib8)\]and Gemma\-2\-9B\-IT\[[11](https://arxiv.org/html/2605.06342#bib.bib11)\]\. For steering evaluation, we use TruthfulQA\[[19](https://arxiv.org/html/2605.06342#bib.bib19)\]and three behaviours from the Model\-Written Evaluation suite\[[25](https://arxiv.org/html/2605.06342#bib.bib25)\]: power\-seeking, wealth\-seeking, and corrigibility\. Utility is measured on IFBench\[[26](https://arxiv.org/html/2605.06342#bib.bib26)\]\(instruction\-following\), ARC\-Challenge\[[6](https://arxiv.org/html/2605.06342#bib.bib6)\]\(scientific reasoning\), HellaSwag\[[43](https://arxiv.org/html/2605.06342#bib.bib43)\]\(commonsense\), and GSM8K\[[7](https://arxiv.org/html/2605.06342#bib.bib7)\]\(mathematical reasoning\)\. We discuss all dataset details in App\.[B\.2](https://arxiv.org/html/2605.06342#A2.SS2)\.
Baselines\.We compare SKOP against three families of steering methods\. From the*residual\-stream steering*methods, we include CAA\[[29](https://arxiv.org/html/2605.06342#bib.bib29)\], which adds mean\-difference vectors directly to the residual stream and serves as a canonical residual\-space baseline\. From the*attention\-space steering*methods, we include DISCO\-Q\[[36](https://arxiv.org/html/2605.06342#bib.bib36)\], as an example of query\-space steering, Comm Steer\[[36](https://arxiv.org/html/2605.06342#bib.bib36)\]and Angular Steer\[[39](https://arxiv.org/html/2605.06342#bib.bib39)\], as examples of attention\-input steering, and ITI\[[18](https://arxiv.org/html/2605.06342#bib.bib18)\], as an example of head\-output steering\. We use mean\-difference steering vectors for all attention\-space methods\. From the*conditional steering*methods, we include CAST\[[17](https://arxiv.org/html/2605.06342#bib.bib17)\], which conditionally applies mean\-difference steering vectors, and SADI\[[40](https://arxiv.org/html/2605.06342#bib.bib40)\], which performs semantics\-based modulation by dynamically constructing steering vectors\. Finally, we include LoRA\[[14](https://arxiv.org/html/2605.06342#bib.bib14)\]as an example of parameter\-efficient finetuning approaches to steering\. For fair comparison, we apply steering vectors to all layers and all heads when the baseline operates on attention heads\. We provide additional details in App\.[D\.2](https://arxiv.org/html/2605.06342#A4.SS2)\.
Hyperparameters\.SKOP has three hyperparameters: the focus\-mass thresholdτhigh\\tau\_\{\\text\{high\}\}\(Sec\.[4](https://arxiv.org/html/2605.06342#S4)\), the energy\-coverage thresholdγenergy\\gamma\_\{\\text\{energy\}\}\(Eq\. \([15](https://arxiv.org/html/2605.06342#S5.E15)\)\), and the fraction of top\-risk heads to project\. We choose these hyperparameters using sensitivity analyses \(see App\.[C\.3](https://arxiv.org/html/2605.06342#A3.SS3)and App\.[E\.3](https://arxiv.org/html/2605.06342#A5.SS3)\), and fix them toτhigh=0\.8\\tau\_\{\\text\{high\}\}=0\.8,γenergy=0\.9\\gamma\_\{\\text\{energy\}\}=0\.9, and 20% of heads across all tasks and models\.
### 6\.1Steering\-Utility Trade\-off Evaluation \(RQ1\)
Figure 4:Steering\-utility trade\-off for LLaMA3\.1\-8B\-Instruct\. We report the average of Power, Wealth, and Corr\[[25](https://arxiv.org/html/2605.06342#bib.bib25)\]\. The dashed line traces the best trade\-off frontier\. SKOP achieves the best joint trade\-off among all steered methods\.We evaluate steering effectiveness in open\-ended generation tasks using an LLM judge to score model outputs\. For behaviours from the Model\-Written Evaluation suite, following prior work\[[29](https://arxiv.org/html/2605.06342#bib.bib29),[42](https://arxiv.org/html/2605.06342#bib.bib42),[36](https://arxiv.org/html/2605.06342#bib.bib36)\], we prompt the LLM judge to assign a score \(1–4\) to each generation indicating how strongly the response exhibits the target behaviour\. For TruthfulQA, we score outputs using theTrue\*Info \(T\*I\)metric\[[19](https://arxiv.org/html/2605.06342#bib.bib19),[10](https://arxiv.org/html/2605.06342#bib.bib10)\]\. We use the same LLM judge prompts asTorop et al\. \[[36](https://arxiv.org/html/2605.06342#bib.bib36)\]\. For utility benchmarks, we report instruction\-level accuracy under strict matching for IFBench\[[26](https://arxiv.org/html/2605.06342#bib.bib26)\], and standard accuracy for ARC\[[6](https://arxiv.org/html/2605.06342#bib.bib6)\], HellaSwag\[[43](https://arxiv.org/html/2605.06342#bib.bib43)\], and GSM8K\[[7](https://arxiv.org/html/2605.06342#bib.bib7)\]\. For ARC, we use the challenge subset\. Each utility score is averaged across all steering tasks\. We selectλ\\lambdavia a sweep maximising performance on each steering task and reuse it on utility benchmarks\. We provide LLaMA 3\.1 results in Table[1](https://arxiv.org/html/2605.06342#S6.T1)and Fig\.[4](https://arxiv.org/html/2605.06342#S6.F4), and Gemma results, showing similar trends, in App\.[E\.1](https://arxiv.org/html/2605.06342#A5.SS1)\.
Table 1:Comparison of SKOP against baselines for LLaMA\-3\.1\-8B\-Instruct\[[8](https://arxiv.org/html/2605.06342#bib.bib8)\]\. Steering performance is evaluated using an LLM Judge for Power, Wealth, and Corr and TruthfulQATrue\*InfoMetric \(TQA\) \(higher is better\)\. Utility is measured via IFBench \(IFB\), ARC\-Challenge \(ARC\), HellaSwag \(HS\), and GSM8K accuracy\. Rank is the average rank across all steering and utility benchmarks \(lower is better\)\. Best results arebolded, and second\-best results areunderlined\.SteeringUtilityMethodPowerWealthCorrTQAIFBARCHSGSM8KRank↓\\downarrowBaseline1\.831\.711\.9446\.126\.266\.370\.575\.2–LoRA\[[14](https://arxiv.org/html/2605.06342#bib.bib14)\]2\.311\.892\.6855\.416\.838\.552\.629\.0–CAA\[[29](https://arxiv.org/html/2605.06342#bib.bib29)\]2\.492\.102\.7976\.814\.525\.027\.514\.05\.44ITI\[[18](https://arxiv.org/html/2605.06342#bib.bib18)\]2\.592\.142\.6066\.810\.529\.145\.220\.64\.75DISCO\-Q\[[36](https://arxiv.org/html/2605.06342#bib.bib36)\]2\.552\.063\.2266\.111\.534\.238\.322\.54\.75Comm Steer\[[36](https://arxiv.org/html/2605.06342#bib.bib36)\]2\.912\.253\.0181\.66\.015\.835\.510\.24\.63Angular Steer\[[39](https://arxiv.org/html/2605.06342#bib.bib39)\]2\.132\.042\.1856\.919\.559\.259\.760\.54\.75CAST\[[17](https://arxiv.org/html/2605.06342#bib.bib17)\]2\.041\.922\.2858\.218\.240\.555\.041\.85\.63SADI\[[40](https://arxiv.org/html/2605.06342#bib.bib40)\]2\.582\.212\.4975\.915\.548\.858\.858\.63\.38SKOP \(Ours\)2\.512\.103\.1965\.925\.065\.065\.266\.62\.69Fig\.[4](https://arxiv.org/html/2605.06342#S6.F4)shows that SKOP achieves the best steering\-utility trade\-off and is the only steering method without a large utility loss\. Vanilla mean\-difference baselines and the residual\-stream baseline attain marginally higher absolute steering scores, but crucially this comes with a severe utility cost \(50\-75% degradation\)\. SKOP retains over 95% of vanilla query\-space steering efficacy while reducing utility degradation to under 10%, outperforming conditional steering baselines \(CAST and SADI\)\. The same is observed on Gemma, where SKOP attains the best overall trade\-off rank across all steering methods \(see App\.[E\.1](https://arxiv.org/html/2605.06342#A5.SS1)\)\. Moreover, as discussed in App\.[E\.4](https://arxiv.org/html/2605.06342#A5.SS4), SKOP also outperforms common fine\-tuning approaches: SKOP matches vanilla query\-space steering efficacy and substantially exceeds LoRA on steering across all training set sizes, while LoRA’s utility degrades monotonically as data grows\.
### 6\.2The Impact of Steering Strength \(RQ2\)
Fig\.[5](https://arxiv.org/html/2605.06342#S6.F5)shows the effect of varying steering strengthλ\\lambdaacross four steering tasks, comparing vanilla query steering with SKOP\. We selectλ\\lambdaranges that span from weak to strong steering effects until performance saturates\. Across all tasks, increasingλ\\lambdaimproves steering strength for vanilla steering, but at the cost of rapid and monotonic utility degradation\. For Wealth and Corrigibility, average utility drops to near zero at highλ\\lambdavalues, indicating that attention rerouting severely disrupts the model’s focus on tokens that carry critical contextual information\. In contrast, SKOP consistently moderates this trade\-off: while projection slightly reduces the maximum achievable steering score, it substantially stabilises utility across the entire range of steering strengths\.
Figure 5:Effect of varying steering strengthλ\\lambdaon steering efficacy and utility preservation for SKOP on LLaMA\-3\.1\-8B\-Instruct\. Asλ\\lambdaincreases, vanilla query steering vectors achieve slightly higher steering scores but suffer severe utility degradation\. In contrast, SKOP maintains strong utility preservation across allλ\\lambdawhile preserving most steering effectiveness\.To further understand this improvement, we examine focus\-set attention mass preservation based on our attention\-rerouting hypothesis\. As detailed in App\.[C\.2](https://arxiv.org/html/2605.06342#A3.SS2), we find that vanilla query\-space steering induces large negative shifts inΔM\\Delta Matλ=4\.0\\lambda=4\.0: 31% of \(head, decoding step\) pairs lose at least 10% of their focus\-set attention mass, and 22% lose at least 15%, indicating substantially diminished focus on important tokens\. SKOP reduces these tail probabilities by roughly33\-10×10\\timesacross loss thresholds\. This confirms that SKOP preserves the base model’s high\-confidence attention patterns, preventing the harmful attention rerouting\.
### 6\.3Long\-Context Robustness \(RQ3\)
Given our motivation to reduce harmful*attention rerouting*on utility tasks, we investigate whether SKOP preserves model capability on*long\-context*tasks\. Specifically, we evaluate SKOP on “needle\-in\-a\-haystack” \(NIAH\) tasks from the RULER benchmark\[[13](https://arxiv.org/html/2605.06342#bib.bib13)\], where correct behaviour crucially depends on maintaining sparse but high attention mass on important tokens among distractor tokens\. NIAH tasks construct a long “haystack” \(either repeated sentences or natural text\[[15](https://arxiv.org/html/2605.06342#bib.bib15)\]\) containing one or more inserted “needles”, with a query at the end that cues retrieval by matching “needles” in context and outputting the associated values\.
Table 2:Long\-context steering on RULER NIAH\. We steer LLaMA\-3\.1\-8B\-Instruct with a formatting instruction\. We report formatting compliance for responses and NIAH retrieval accuracy across context lengths\.To steer behaviour over long contexts, we use a simpler yet informative steering task, namely*formatting steering*\. Inspired by prior work\[[34](https://arxiv.org/html/2605.06342#bib.bib34)\], we construct an instruction steering vector by taking the mean difference between activations with and without an instruction\. We use a “Quotation” task with formatting instructions \(*“Wrap your entire response with double quotation marks\.”*\) in positive contexts and no instructions in negative contexts\. We testLLaMA\-3\.1\-8B\-Instructacross context lengths from 1K to 16K tokens\. We apply only unconditional steering as we found that conditional methods have minimal effect on this task\.
Table[2](https://arxiv.org/html/2605.06342#S6.T2)shows that the unsteered model attains near\-perfect retrieval accuracy across all tested context lengths\. However, vanilla steering vectors exhibit significant sensitivity to length\. While they enforce quotation formatting in shorter contexts \(within 2K tokens\), steering efficacy degrades as context increases: at 16K tokens, DISCO\-Q’s formatting compliance drops by over half\. For retrieval accuracy, vanilla steering performance drops sharply beyond 4K tokens\. In contrast, SKOP maintains stronger formatting compliance while preserving NIAH retrieval performance at 8K–16K tokens, narrowing the gap to the unsteered baseline\. This suggests that, even in long contexts, SKOP suppresses detrimental focus\-to\-tail attention rerouting\.
## 7Discussion and Conclusion
Limitations\.Our analysis focuses on query\-space steering with mean\-difference vectors, where attention rerouting can be isolated cleanly\. Extending the framework to residual\-stream steering is less direct, since residual perturbations simultaneously affect queries, keys, values, and MLP activations, and is therefore left as future work\. Moreover, SKOP requires a small utility calibration set to estimate the key\-difference subspace and incurs a small loss in steering efficacy relative to vanilla steering\. Hence, promising future directions include adapting focus\-set selection during generation and designing training objectives that directly incorporate focus preservation in steering vectors\.
Conclusion\.In this paper, we identified*attention rerouting*as a key contributor to the steering\-utility trade\-off in query\-space activation steering\. Motivated by this, we introduced SKOP, a method that removes the steering components most responsible for shifting attention away from a small set of focus tokens\. Across multiple steering benchmarks, we observe that SKOP \(1\) achieves the strongest steering\-utility trade\-off among existing approaches and \(2\) makes activation steering viable in long\-context settings where prior methods break down\. This work therefore broadens the range of settings in which activation steering can serve as a practical inference\-time control mechanism\.
## References
- Ainslie et al\. \[2023\]Joshua Ainslie, James Lee\-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai\.Gqa: Training generalized multi\-query transformer models from multi\-head checkpoints\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing*, pages 4895–4901, 2023\.
- Arditi et al\. \[2024\]Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda\.Refusal in language models is mediated by a single direction\.In*NeurIPS*, 2024\.
- Bayat et al\. \[2025\]Reza Bayat, Ali Rahimi\-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, and Pascal Vincent\.Steering large language model activations in sparse spaces\.In*Second Conference on Language Modeling*, 2025\.URL[https://openreview\.net/forum?id=VGw1viYliK](https://openreview.net/forum?id=VGw1viYliK)\.
- Bisk et al\. \[2020\]Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi\.Piqa: Reasoning about physical commonsense in natural language\.In*Thirty\-Fourth AAAI Conference on Artificial Intelligence*, 2020\.
- Cao et al\. \[2024\]Yuanpu Cao, Tianrong Zhang, Bochuan Cao, Ziyi Yin, Lu Lin, Fenglong Ma, and Jinghui Chen\.Personalized steering of large language models: Versatile steering vectors through bi\-directional preference optimization\.In*The Thirty\-eighth Annual Conference on Neural Information Processing Systems*, 2024\.URL[https://openreview\.net/forum?id=7qJFkuZdYo](https://openreview.net/forum?id=7qJFkuZdYo)\.
- Clark et al\. \[2018\]Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord\.Think you have solved question answering? try arc, the ai2 reasoning challenge\.*arXiv:1803\.05457v1*, 2018\.
- Cobbe et al\. \[2021\]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\.Training verifiers to solve math word problems\.*CoRR*, abs/2110\.14168, 2021\.
- Dubey et al\. \[2024\]Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia\-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Grégoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel M\. Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, and et al\.The llama 3 herd of models\.*CoRR*, abs/2407\.21783, 2024\.
- Dubois et al\. \[2024\]Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B\. Hashimoto\.Length\-controlled alpacaeval: A simple way to debias automatic evaluators\.*CoRR*, abs/2404\.04475, 2024\.
- Evans et al\. \[2025\]Owain Evans, James Chua, and Steph Lin\.New, improved multiple\-choice truthfulqa, 2025\.URL[https://www\.alignmentforum\.org/posts/Bunfwz6JsNd44kgLT/new\-improved\-multiple\-choice\-truthfulqa](https://www.alignmentforum.org/posts/Bunfwz6JsNd44kgLT/new-improved-multiple-choice-truthfulqa)\.
- Gemma Team et al\. \[2024\]Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, Nikola Momchev, Matt Hoffman, Shantanu Thakoor, Jean\-Bastien Grill, Behnam Neyshabur, Olivier Bachem, Alanna Walton, Aliaksei Severyn, Alicia Parrish, Aliya Ahmad, Allen Hutchison, Alvin Abdagic, Amanda Carl, Amy Shen, Andy Brock, Andy Coenen, Anthony Laforge, Antonia Paterson, Ben Bastian, Bilal Piot, Bo Wu, Brandon Royal, Charlie Chen, Chintu Kumar, Chris Perry, Chris Welty, Christopher A\. Choquette\-Choo, Danila Sinopalnikov, David Weinberger, Dimple Vijaykumar, Dominika Rogozińska, Dustin Herbison, Elisa Bandy, Emma Wang, Eric Noland, Erica Moreira, Evan Senter, Evgenii Eltyshev, Francesco Visin, Gabriel Rasskin, Gary Wei, Glenn Cameron, Gus Martins, Hadi Hashemi, Hanna Klimczak\-Plucińska, Harleen Batra, Harsh Dhand, Ivan Nardini, Jacinda Mein, Jack Zhou, James Svensson, Jeff Stanway, Jetha Chan, Jin Peng Zhou, Joana Carrasqueira, Joana Iljazi, Jocelyn Becker, Joe Fernandez, Joost van Amersfoort, Josh Gordon, Josh Lipschultz, Josh Newlan, Ju yeong Ji, Kareem Mohamed, Kartikeya Badola, Kat Black, Katie Millican, Keelin McDonell, Kelvin Nguyen, Kiranbir Sodhia, Kish Greene, Lars Lowe Sjoesund, Lauren Usui, Laurent Sifre, Lena Heuermann, Leticia Lago, Lilly McNealus, Livio Baldini Soares, Logan Kilpatrick, Lucas Dixon, Luciano Martins, Machel Reid, Manvinder Singh, Mark Iverson, Martin Görner, Mat Velloso, Mateo Wirth, Matt Davidow, Matt Miller, Matthew Rahtz, Matthew Watson, Meg Risdal, Mehran Kazemi, Michael Moynihan, Ming Zhang, Minsuk Kahng, Minwoo Park, Mofi Rahman, Mohit Khatwani, Natalie Dao, Nenshad Bardoliwalla, Nesh Devanathan, Neta Dumai, Nilay Chauhan, Oscar Wahltinez, Pankil Botarda, Parker Barnes, Paul Barham, Paul Michel, Pengchong Jin, Petko Georgiev, Phil Culliton, Pradeep Kuppala, Ramona Comanescu, Ramona Merhej, Reena Jana, Reza Ardeshir Rokni, Rishabh Agarwal, Ryan Mullins, Samaneh Saadat, Sara Mc Carthy, Sarah Cogan, Sarah Perrin, Sébastien M\. R\. Arnold, Sebastian Krause, Shengyang Dai, Shruti Garg, Shruti Sheth, Sue Ronstrom, Susan Chan, Timothy Jordan, Ting Yu, Tom Eccles, Tom Hennigan, Tomas Kocisky, Tulsee Doshi, Vihan Jain, Vikas Yadav, Vilobh Meshram, Vishal Dharmadhikari, Warren Barkley, Wei Wei, Wenming Ye, Woohyun Han, Woosuk Kwon, Xiang Xu, Zhe Shen, Zhitao Gong, Zichuan Wei, Victor Cotruta, Phoebe Kirk, Anand Rao, Minh Giang, Ludovic Peran, Tris Warkentin, Eli Collins, Joelle Barral, Zoubin Ghahramani, Raia Hadsell, D\. Sculley, Jeanine Banks, Anca Dragan, Slav Petrov, Oriol Vinyals, Jeff Dean, Demis Hassabis, Koray Kavukcuoglu, Clement Farabet, Elena Buchatskaya, Sebastian Borgeaud, Noah Fiedel, Armand Joulin, Kathleen Kenealy, Robert Dadashi, and Alek Andreev\.Gemma 2: Improving open language models at a practical size\.*arXiv:2408\.00118*, 2024\.URL[https://arxiv\.org/abs/2408\.00118](https://arxiv.org/abs/2408.00118)\.
- Goodfellow et al\. \[2016\]Ian Goodfellow, Yoshua Bengio, and Aaron Courville\.*Deep Learning*\.MIT Press, 2016\.
- Hsieh et al\. \[2024\]Cheng\-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg\.RULER: What’s the real context size of your long\-context language models?In*First Conference on Language Modeling*, 2024\.URL[https://openreview\.net/forum?id=kIoBbc76Sy](https://openreview.net/forum?id=kIoBbc76Sy)\.
- Hu et al\. \[2022\]Edward J\. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen\.LoRA: Low\-rank adaptation of large language models\.In*The Tenth International Conference on Learning Representations, ICLR 2022*, Virtual Event, 2022\.URL[https://openreview\.net/forum?id=nZeVKeeFYf9](https://openreview.net/forum?id=nZeVKeeFYf9)\.
- Kamradt \[2023\]Gregory Kamradt\.Needle in a haystack \- pressure testing LLMs\.[https://github\.com/gkamradt/LLMTest\_NeedleInAHaystack/tree/main](https://github.com/gkamradt/LLMTest_NeedleInAHaystack/tree/main), 2023\.
- Kočiský et al\. \[2018\]Tomáš Kočiský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette\.The NarrativeQA reading comprehension challenge\.*Transactions of the Association for Computational Linguistics*, 6:317–328, 2018\.doi:[10\.1162/tacl\_a\_00023](https://doi.org/10.1162/tacl_a_00023)\.URL[https://aclanthology\.org/Q18\-1023/](https://aclanthology.org/Q18-1023/)\.
- Lee et al\. \[2025\]Bruce W\. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar\.Programming refusal with conditional activation steering\.In*The Thirteenth International Conference on Learning Representations*, 2025\.URL[https://openreview\.net/forum?id=Oi47wc10sm](https://openreview.net/forum?id=Oi47wc10sm)\.
- Li et al\. \[2024\]Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg\.Inference\-time intervention: Eliciting truthful answers from a language model\.*Advances in Neural Information Processing Systems*, 36, 2024\.
- Lin et al\. \[2022\]Stephanie Lin, Jacob Hilton, and Owain Evans\.Truthfulqa: Measuring how models mimic human falsehoods\.In*Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 3214–3252, 2022\.
- Luo and Specia \[2024\]Haoyan Luo and Lucia Specia\.From understanding to utilization: A survey on explainability for large language models, 2024\.URL[https://arxiv\.org/abs/2401\.12874](https://arxiv.org/abs/2401.12874)\.
- Mikolov et al\. \[2013\]Tomas Mikolov, Wen\-tau Yih, and Geoffrey Zweig\.Linguistic regularities in continuous space word representations\.In Lucy Vanderwende, Hal Daumé III, and Katrin Kirchhoff, editors,*Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, pages 746–751, Atlanta, Georgia, June 2013\. Association for Computational Linguistics\.URL[https://aclanthology\.org/N13\-1090/](https://aclanthology.org/N13-1090/)\.
- Nguyen et al\. \[2025\]Duy Nguyen, Archiki Prasad, Elias Stengel\-Eskin, and Mohit Bansal\.Multi\-attribute steering of language models via targeted intervention\.In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 20619–20634, Vienna, Austria, July 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-251\-0\.doi:[10\.18653/v1/2025\.acl\-long\.1007](https://doi.org/10.18653/v1/2025.acl-long.1007)\.URL[https://aclanthology\.org/2025\.acl\-long\.1007/](https://aclanthology.org/2025.acl-long.1007/)\.
- O’Brien et al\. \[2024\]Kyle O’Brien, David Majercak, Xavier Fernandes, Richard Edgar, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, and Forough Poursabzi\-Sangde\.Steering language model refusal with sparse autoencoders\.*arXiv:2411\.11296*, 2024\.URL[https://arxiv\.org/abs/2411\.11296](https://arxiv.org/abs/2411.11296)\.
- Park et al\. \[2024\]Kiho Park, Yo Joong Choe, and Victor Veitch\.The linear representation hypothesis and the geometry of large language models\.In*ICML*, 2024\.
- Perez et al\. \[2023\]Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al\.Discovering language model behaviors with model\-written evaluations\.In*Findings of the Association for Computational Linguistics: ACL 2023*, pages 13387–13434, 2023\.
- Pyatkin et al\. \[2025\]Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi\.Generalizing verifiable instruction following\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track*, 2025\.URL[https://openreview\.net/forum?id=yfYgwjj5F8](https://openreview.net/forum?id=yfYgwjj5F8)\.
- Qiu et al\. \[2024\]Yifu Qiu, Zheng Zhao, Yftah Ziser, Anna Korhonen, Edoardo Maria Ponti, and Shay Cohen\.Spectral editing of activations for large language model alignment\.*Advances in Neural Information Processing Systems*, 37:56958–56987, 2024\.
- Rajamanoharan et al\. \[2024\]Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Tom Lieberum, Vikrant Varma, Janos Kramar, Rohin Shah, and Neel Nanda\.Improving sparse decomposition of language model activations with gated sparse autoencoders\.In*ICML 2024 Workshop on Mechanistic Interpretability*, 2024\.URL[https://openreview\.net/forum?id=Ppj5KvzU8Q](https://openreview.net/forum?id=Ppj5KvzU8Q)\.
- Rimsky et al\. \[2024\]Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner\.Steering llama 2 via contrastive activation addition\.In Lun\-Wei Ku, Andre Martins, and Vivek Srikumar, editors,*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 15504–15522, Bangkok, Thailand, August 2024\. Association for Computational Linguistics\.doi:[10\.18653/v1/2024\.acl\-long\.828](https://doi.org/10.18653/v1/2024.acl-long.828)\.URL[https://aclanthology\.org/2024\.acl\-long\.828/](https://aclanthology.org/2024.acl-long.828/)\.
- Rivière et al\. \[2024\]Morgane Rivière, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, Johan Ferret, Peter Liu, Pouya Tafti, Abe Friesen, Michelle Casbon, Sabela Ramos, Ravin Kumar, Charline Le Lan, Sammy Jerome, Anton Tsitsulin, Nino Vieillard, Piotr Stanczyk, Sertan Girgin, Nikola Momchev, Matt Hoffman, Shantanu Thakoor, Jean\-Bastien Grill, Behnam Neyshabur, Olivier Bachem, Alanna Walton, Aliaksei Severyn, Alicia Parrish, Aliya Ahmad, Allen Hutchison, Alvin Abdagic, Amanda Carl, Amy Shen, Andy Brock, Andy Coenen, Anthony Laforge, Antonia Paterson, Ben Bastian, Bilal Piot, Bo Wu, Brandon Royal, Charlie Chen, Chintu Kumar, Chris Perry, Chris Welty, Christopher A\. Choquette\-Choo, Danila Sinopalnikov, David Weinberger, Dimple Vijaykumar, Dominika Rogozinska, Dustin Herbison, Elisa Bandy, Emma Wang, Eric Noland, Erica Moreira, Evan Senter, Evgenii Eltyshev, Francesco Visin, Gabriel Rasskin, Gary Wei, Glenn Cameron, Gus Martins, Hadi Hashemi, Hanna Klimczak\-Plucinska, Harleen Batra, Harsh Dhand, Ivan Nardini, Jacinda Mein, Jack Zhou, James Svensson, Jeff Stanway, Jetha Chan, Jin Peng Zhou, Joana Carrasqueira, Joana Iljazi, Jocelyn Becker, Joe Fernandez, Joost van Amersfoort, Josh Gordon, Josh Lipschultz, Josh Newlan, Ju\-yeong Ji, Kareem Mohamed, Kartikeya Badola, Kat Black, Katie Millican, Keelin McDonell, Kelvin Nguyen, Kiranbir Sodhia, Kish Greene, Lars Lowe Sjösund, Lauren Usui, Laurent Sifre, Lena Heuermann, Leticia Lago, and Lilly McNealus\.Gemma 2: Improving open language models at a practical size\.*CoRR*, abs/2408\.00118, 2024\.
- Rodriguez et al\. \[2025\]Pau Rodriguez, Arno Blaas, Michal Klein, Luca Zappella, Nicholas Apostoloff, marco cuturi, and Xavier Suau\.Controlling language and diffusion models by transporting activations\.In*The Thirteenth International Conference on Learning Representations*, 2025\.URL[https://openreview\.net/forum?id=l2zFn6TIQi](https://openreview.net/forum?id=l2zFn6TIQi)\.
- Sheng et al\. \[2025\]Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, and Tat\-Seng Chua\.Alphasteer: Learning refusal steering with principled null\-space constraint, 2025\.URL[https://arxiv\.org/abs/2506\.07022](https://arxiv.org/abs/2506.07022)\.
- Singh et al\. \[2024\]Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, and Ponnurangam Kumaraguru\.Representation surgery: Theory and practice of affine steering\.In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,*Proceedings of the 41st International Conference on Machine Learning*, volume 235 of*Proceedings of Machine Learning Research*, pages 45663–45680\. PMLR, 21–27 Jul 2024\.URL[https://proceedings\.mlr\.press/v235/singh24d\.html](https://proceedings.mlr.press/v235/singh24d.html)\.
- Stolfo et al\. \[2025\]Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, and Besmira Nushi\.Improving instruction\-following in language models through activation steering\.In*The Thirteenth International Conference on Learning Representations*, 2025\.URL[https://openreview\.net/forum?id=wozhdnRCtw](https://openreview.net/forum?id=wozhdnRCtw)\.
- Templeton et al\. \[2024\]Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C\. Daniel Freeman, Theodore R\. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan\.Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet\.*Transformer Circuits Thread*, 2024\.URL[https://transformer\-circuits\.pub/2024/scaling\-monosemanticity/index\.html](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html)\.
- Torop et al\. \[2025\]Max Torop, Aria Masoomi, Masih Eskandar, and Jennifer Dy\.DISCO: Disentangled communication steering for large language models\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2025\.URL[https://openreview\.net/forum?id=c8AjdgdHnD](https://openreview.net/forum?id=c8AjdgdHnD)\.
- Turner et al\. \[2024\]Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J\. Vazquez, Ulisse Mini, and Monte MacDiarmid\.Steering language models with activation engineering\.*arXiv:2308\.10248*, 2024\.URL[https://arxiv\.org/abs/2308\.10248](https://arxiv.org/abs/2308.10248)\.
- Vaswani et al\. \[2017\]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin\.Attention is all you need\.*Advances in neural information processing systems*, 30, 2017\.
- Vu and Nguyen \[2025\]Hieu M\. Vu and Tan Minh Nguyen\.Angular steering: Behavior control via rotation in activation space\.In*2nd Workshop on Models of Human Feedback for AI Alignment*, 2025\.URL[https://openreview\.net/forum?id=GU2UeVZrSw](https://openreview.net/forum?id=GU2UeVZrSw)\.
- Wang et al\. \[2025\]Weixuan Wang, JINGYUAN YANG, and Wei Peng\.Semantics\-adaptive activation intervention for LLMs via dynamic steering vectors\.In*The Thirteenth International Conference on Learning Representations*, 2025\.URL[https://openreview\.net/forum?id=8WQ7VTfPTl](https://openreview.net/forum?id=8WQ7VTfPTl)\.
- Wu et al\. \[2024\]Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D Manning, and Christopher Potts\.ReFT: Representation finetuning for language models\.In*The Thirty\-eighth Annual Conference on Neural Information Processing Systems*, 2024\.URL[https://openreview\.net/forum?id=fykjplMc0V](https://openreview.net/forum?id=fykjplMc0V)\.
- Wu et al\. \[2025\]Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts\.Axbench: Steering LLMs? even simple baselines outperform sparse autoencoders\.In*Forty\-second International Conference on Machine Learning*, 2025\.URL[https://openreview\.net/forum?id=K2CckZjNy0](https://openreview.net/forum?id=K2CckZjNy0)\.
- Zellers et al\. \[2019\]Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi\.Hellaswag: Can a machine really finish your sentence?In*Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics*, 2019\.
- Zou et al\. \[2023\]Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann\-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J\. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J\. Zico Kolter, and Dan Hendrycks\.Representation engineering: A top\-down approach to AI transparency\.*CoRR*, abs/2310\.01405, 2023\.
## Appendix AAttention Invariance under Query Steering
In this appendix, we prove the claim made in Sec\.[3](https://arxiv.org/html/2605.06342#S3): under query\-space steering, the perturbationδij\(ℓ,h\)\\delta\_\{ij\}^\{\(\\ell,h\)\}in Eq\.[6](https://arxiv.org/html/2605.06342#S3.E6)is the unique component of the steering intervention that can affect attention weights\.
We prove the result in a strictly more general setting that we call*attention\-input steering*, which corresponds to the Comm Steer baseline ofVu and Nguyen \[[39](https://arxiv.org/html/2605.06342#bib.bib39)\]in Sec\.[6](https://arxiv.org/html/2605.06342#S6)\. Rather than steering the query and key projections independently, attention\-input steering applies a single additive perturbationλ𝐫\(ℓ,h\)∈ℝd\\lambda\\mathbf\{r\}^\{\(\\ell,h\)\}\\in\\mathbb\{R\}^\{d\}to the layer\-normalised attention input𝐳i\(ℓ\)\\mathbf\{z\}\_\{i\}^\{\(\\ell\)\}before any projection\. Because𝐪i\(ℓ,h\)=𝐳i\(ℓ\)𝐖q\(ℓ,h\)\\mathbf\{q\}\_\{i\}^\{\(\\ell,h\)\}=\\mathbf\{z\}\_\{i\}^\{\(\\ell\)\}\\mathbf\{W\}\_\{q\}^\{\(\\ell,h\)\}and𝐤j\(ℓ,h\)=𝐳j\(ℓ\)𝐖k\(ℓ,h\)\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}=\\mathbf\{z\}\_\{j\}^\{\(\\ell\)\}\\mathbf\{W\}\_\{k\}^\{\(\\ell,h\)\}share the same input, this single perturbation propagates simultaneously to both the queries and the keys, inducing
𝐪i\(ℓ,h\)←𝐪i\(ℓ,h\)\+λ𝐫q\(ℓ,h\),𝐤j\(ℓ,h\)←𝐤j\(ℓ,h\)\+λ𝐫k\(ℓ,h\),\\mathbf\{q\}\_\{i\}^\{\(\\ell,h\)\}\\leftarrow\\mathbf\{q\}\_\{i\}^\{\(\\ell,h\)\}\+\\lambda\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\},\\qquad\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}\\leftarrow\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}\+\\lambda\\mathbf\{r\}\_\{k\}^\{\(\\ell,h\)\},\(17\)where𝐫q\(ℓ,h\)=𝐫\(ℓ,h\)𝐖q\(ℓ,h\)\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}=\\mathbf\{r\}^\{\(\\ell,h\)\}\\mathbf\{W\}\_\{q\}^\{\(\\ell,h\)\}and𝐫k\(ℓ,h\)=𝐫\(ℓ,h\)𝐖k\(ℓ,h\)\\mathbf\{r\}\_\{k\}^\{\(\\ell,h\)\}=\\mathbf\{r\}^\{\(\\ell,h\)\}\\mathbf\{W\}\_\{k\}^\{\(\\ell,h\)\}are the projections of𝐫\(ℓ,h\)\\mathbf\{r\}^\{\(\\ell,h\)\}into the query and key spaces of head\(ℓ,h\)\(\\ell,h\)\.111We omit the value\-side projection𝐫v\(ℓ,h\)=𝐫\(ℓ,h\)𝐖v\(ℓ,h\)\\mathbf\{r\}\_\{v\}^\{\(\\ell,h\)\}=\\mathbf\{r\}^\{\(\\ell,h\)\}\\mathbf\{W\}\_\{v\}^\{\(\\ell,h\)\}from the analysis because the value projection does not enter the attention logits of the current layer and therefore cannot induce attention rerouting at this layer\.Query\-space steering, as defined in Eq\.[5](https://arxiv.org/html/2605.06342#S3.E5), corresponds to the special case in which only the query\-side perturbation is non\-zero, that is,𝐫k\(ℓ,h\)=𝟎\\mathbf\{r\}\_\{k\}^\{\(\\ell,h\)\}=\\mathbf\{0\}\. The attention\-input formulation also covers grouped\-query attention\[[1](https://arxiv.org/html/2605.06342#bib.bib1)\], since the analysis below applies independently to each query head and its associated key\-value head\.
We show that even in this more general setting, the only component of the steering intervention that can change attention weights is the query\-side term acting on the keys, that is, the term identified in Eq\.[6](https://arxiv.org/html/2605.06342#S3.E6)\. The key\-side perturbation contributes only row\-wise constant shifts that are absorbed by the softmax\.
### A\.1Decomposition of the Perturbed Logits
Following the notation of Sec\.[3](https://arxiv.org/html/2605.06342#S3), the perturbed attention logit between query positioniiand key positionjjis
s~ij\(ℓ,h\)=⟨𝐪i\(ℓ,h\)\+λ𝐫q\(ℓ,h\),𝐤j\(ℓ,h\)\+λ𝐫k\(ℓ,h\)⟩d′\.\\tilde\{s\}\_\{ij\}^\{\(\\ell,h\)\}=\\frac\{\\big\\langle\\mathbf\{q\}\_\{i\}^\{\(\\ell,h\)\}\+\\lambda\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\},\\,\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}\+\\lambda\\mathbf\{r\}\_\{k\}^\{\(\\ell,h\)\}\\big\\rangle\}\{\\sqrt\{d^\{\\prime\}\}\}\.\(18\)Expanding the inner product yields four scalar terms:
s~ij\(ℓ,h\)=sij\(ℓ,h\)\+λ⟨𝐪i\(ℓ,h\),𝐫k\(ℓ,h\)⟩d′⏟\(a\) depends onionly\+λ⟨𝐫q\(ℓ,h\),𝐤j\(ℓ,h\)⟩d′⏟\(b\) depends onj\+λ2⟨𝐫q\(ℓ,h\),𝐫k\(ℓ,h\)⟩d′⏟\(c\) constant,\\tilde\{s\}\_\{ij\}^\{\(\\ell,h\)\}=s\_\{ij\}^\{\(\\ell,h\)\}\+\\underbrace\{\\frac\{\\lambda\\langle\\mathbf\{q\}\_\{i\}^\{\(\\ell,h\)\},\\mathbf\{r\}\_\{k\}^\{\(\\ell,h\)\}\\rangle\}\{\\sqrt\{d^\{\\prime\}\}\}\}\_\{\\text\{\(a\) depends on \}i\\text\{ only\}\}\+\\underbrace\{\\frac\{\\lambda\\langle\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\},\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}\\rangle\}\{\\sqrt\{d^\{\\prime\}\}\}\}\_\{\\text\{\(b\) depends on \}j\}\+\\underbrace\{\\frac\{\\lambda^\{2\}\\langle\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\},\\mathbf\{r\}\_\{k\}^\{\(\\ell,h\)\}\\rangle\}\{\\sqrt\{d^\{\\prime\}\}\}\}\_\{\\text\{\(c\) constant\}\},\(19\)wheresij\(ℓ,h\)s\_\{ij\}^\{\(\\ell,h\)\}is the unperturbed logit\. Term \(a\) varies with the query positioniibut is constant across all key positionsjjwithin a row\. Term \(c\) is constant across all\(i,j\)\(i,j\)\. Term \(b\) is precisely the perturbationδij\(ℓ,h\)\\delta\_\{ij\}^\{\(\\ell,h\)\}in Eq\.[6](https://arxiv.org/html/2605.06342#S3.E6)of the main text\.
The attention weights from query positioniiare obtained by applying the softmax overjj:
α~ij\(ℓ,h\)=softmaxj\(s~ij\(ℓ,h\)\)\.\\tilde\{\\alpha\}\_\{ij\}^\{\(\\ell,h\)\}=\\mathrm\{softmax\}\_\{j\}\\\!\\left\(\\tilde\{s\}\_\{ij\}^\{\(\\ell,h\)\}\\right\)\.\(20\)Because the softmax is invariant to constant shifts within a row, that is,softmaxj\(sij\+ci\)=softmaxj\(sij\)\\mathrm\{softmax\}\_\{j\}\(s\_\{ij\}\+c\_\{i\}\)=\\mathrm\{softmax\}\_\{j\}\(s\_\{ij\}\)for anycic\_\{i\}independent ofjj\[[12](https://arxiv.org/html/2605.06342#bib.bib12)\], both term \(a\) and term \(c\) are absorbed: term \(a\) is constant injjfor each fixedii, and term \(c\) is globally constant\. Only term \(b\), which genuinely varies across keys, can change the attention weights\.
#### Proposition \(Uniqueness of the rerouting term\)\.
Under attention\-input steering,α~ij\(ℓ,h\)=αij\(ℓ,h\)\\tilde\{\\alpha\}\_\{ij\}^\{\(\\ell,h\)\}=\\alpha\_\{ij\}^\{\(\\ell,h\)\}for alli,ji,jif and only if the term⟨𝐫q\(ℓ,h\),𝐤j\(ℓ,h\)⟩\\langle\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\},\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}\\rangleis constant across key positionsjj\.
*Proof\.*The forward direction follows from the decomposition above: terms \(a\) and \(c\) leave the softmax invariant for any choice of𝐫q\(ℓ,h\),𝐫k\(ℓ,h\)\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\},\\mathbf\{r\}\_\{k\}^\{\(\\ell,h\)\}, so attention is invariant precisely when term \(b\) is also constant injj\. For the converse, if term \(b\) varies acrossjj, thens~ij\(ℓ,h\)−sij\(ℓ,h\)\\tilde\{s\}\_\{ij\}^\{\(\\ell,h\)\}\-s\_\{ij\}^\{\(\\ell,h\)\}is non\-constant injjfor someii, and the softmax is strictly monotone in such variations, so the attention weights change\.□\\square
### A\.2Query\-Space Steering as a Special Case
Setting𝐫k\(ℓ,h\)=𝟎\\mathbf\{r\}\_\{k\}^\{\(\\ell,h\)\}=\\mathbf\{0\}recovers query\-space steering\. Terms \(a\) and \(c\) of Eq\. \([19](https://arxiv.org/html/2605.06342#A1.E19)\) vanish identically and only term \(b\), namelyδij\(ℓ,h\)\\delta\_\{ij\}^\{\(\\ell,h\)\}, remains\. This confirms the claim in Sec\.[3](https://arxiv.org/html/2605.06342#S3)that the row\-varying term is the unique component of the steering perturbation that can affect attention weights\. This derivation aligns with the query\-space steering analysis ofTorop et al\. \[[36](https://arxiv.org/html/2605.06342#bib.bib36)\]\.
By the proposition, attention is invariant if and only if⟨𝐫q\(ℓ,h\),𝐤j\(ℓ,h\)⟩\\langle\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\},\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}\\rangleis constant acrossj∈\{1,…,t\}j\\in\\\{1,\\dots,t\\\}\. Decomposing each key as𝐤j\(ℓ,h\)=𝐤¯\(ℓ,h\)\+\(𝐤j\(ℓ,h\)−𝐤¯\(ℓ,h\)\)\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}=\\bar\{\\mathbf\{k\}\}^\{\(\\ell,h\)\}\+\(\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}\-\\bar\{\\mathbf\{k\}\}^\{\(\\ell,h\)\}\), where𝐤¯\(ℓ,h\)=1t∑s=1t𝐤s\(ℓ,h\)\\bar\{\\mathbf\{k\}\}^\{\(\\ell,h\)\}=\\frac\{1\}\{t\}\\sum\_\{s=1\}^\{t\}\\mathbf\{k\}\_\{s\}^\{\(\\ell,h\)\}is the mean key on𝒟util\\mathcal\{D\}\_\{\\text\{util\}\}, we obtain
⟨𝐫q\(ℓ,h\),𝐤j\(ℓ,h\)⟩=⟨𝐫q\(ℓ,h\),𝐤¯\(ℓ,h\)⟩⏟constant inj\+⟨𝐫q\(ℓ,h\),𝐤j\(ℓ,h\)−𝐤¯\(ℓ,h\)⟩\.\\langle\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\},\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}\\rangle=\\underbrace\{\\langle\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\},\\bar\{\\mathbf\{k\}\}^\{\(\\ell,h\)\}\\rangle\}\_\{\\text\{constant in \}j\}\+\\langle\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\},\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}\-\\bar\{\\mathbf\{k\}\}^\{\(\\ell,h\)\}\\rangle\.\(21\)The first term does not depend onjj, so the full inner product is constant injjif and only if the second term vanishes for everyjj:
⟨𝐫q\(ℓ,h\),𝐤j\(ℓ,h\)−𝐤¯\(ℓ,h\)⟩=0,∀j∈\{1,…,t\}\.\\langle\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\},\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}\-\\bar\{\\mathbf\{k\}\}^\{\(\\ell,h\)\}\\rangle=0,\\qquad\\forall j\\in\\\{1,\\dots,t\\\}\.\(22\)Stacking thesettscalar conditions into a single row vector identity, with𝐊c\(ℓ,h\)=𝐊\(ℓ,h\)−𝟏t\(𝐤¯\(ℓ,h\)\)⊤∈ℝt×d′\\mathbf\{K\}\_\{c\}^\{\(\\ell,h\)\}=\\mathbf\{K\}^\{\(\\ell,h\)\}\-\\mathbf\{1\}\_\{t\}\(\\bar\{\\mathbf\{k\}\}^\{\(\\ell,h\)\}\)^\{\\top\}\\in\\mathbb\{R\}^\{t\\times d^\{\\prime\}\}denoting the centred key matrix, yields the matrix\-form invariance condition
\(𝐫q\(ℓ,h\)\)⊤𝐊c\(ℓ,h\)=𝟎t⊤,\(\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}\)^\{\\top\}\\mathbf\{K\}\_\{c\}^\{\(\\ell,h\)\}=\\mathbf\{0\}\_\{t\}^\{\\top\},\(23\)which is precisely Eq\. \([9](https://arxiv.org/html/2605.06342#S4.E9)\) of Sec\.[4](https://arxiv.org/html/2605.06342#S4)\. Geometrically,𝐫q\(ℓ,h\)\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}must be orthogonal to the column space of𝐊c\(ℓ,h\)\\mathbf\{K\}\_\{c\}^\{\(\\ell,h\)\}, i\.e\., to all variations of the keys around their mean on𝒟util\\mathcal\{D\}\_\{\\text\{util\}\}\. The key\-invariant projector𝐏k\(ℓ,h\)\\mathbf\{P\}\_\{k\}^\{\(\\ell,h\)\}in Eq\. \([10](https://arxiv.org/html/2605.06342#S4.E10)\) of the main text enforces exactly this orthogonality by removing the components of𝐫q\(ℓ,h\)\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}in the top\-ppeigendirections of the centred key covariance𝚺k\(ℓ,h\)\\mathbf\{\\Sigma\}\_\{k\}^\{\(\\ell,h\)\}\.
## Appendix BDataset Details
### B\.1Utility Calibration Dataset Construction
Identifying which key directions matter for utility requires observing how the model itself addresses tokens on real inputs; synthetic or random inputs would yield attention distributions disconnected from the model’s behaviour and therefore could not localise utility\-critical tokens\. To characterise utility\-critical attention patterns, we construct a utility calibration set𝒟util\\mathcal\{D\}\_\{\\text\{util\}\}from a diverse collection of standard language understanding and reasoning benchmarks\. Specifically, we sample 1,000 datapoints from each of the following datasets: GSM8K \(mathematical reasoning\)\[[7](https://arxiv.org/html/2605.06342#bib.bib7)\], Alpaca \(instruction following\)\[[9](https://arxiv.org/html/2605.06342#bib.bib9)\], PIQA \(physical commonsense reasoning\)\[[4](https://arxiv.org/html/2605.06342#bib.bib4)\], and NarrativeQA \(long\-context reading comprehension\)\[[16](https://arxiv.org/html/2605.06342#bib.bib16)\], yielding a total of 4,000 calibration examples\.
For each data point, we use the original prompt or question text provided by the dataset and perform a single forward pass through the model without any steering applied\. During this pass, we record the key representations𝐤j\(ℓ,h\)\\mathbf\{k\}\_\{j\}^\{\(\\ell,h\)\}for every layerℓ\\ell, attention headhh, and token positionjj\. These key vectors are used to compute per\-head statistics, including mean keys, centred key matrices, and key covariance estimates\. All statistics derived from𝒟util\\mathcal\{D\}\_\{\\text\{util\}\}are computed offline and are fixed for a given model: calibration requires only a single forward pass per example with no gradient computation, takes under 5 minutes on a single GPU for 4,000 examples on LLaMA\-3\.1\-8B\-Instruct, and the resulting projectors are reused across all steering tasks and inputs without further updates\. At inference, SKOP adds only a single matrix–vector multiply per risk head, introducing negligible latency overhead\. Calibration data are used only to estimate utility\-relevant attention structure and are never used during generation or evaluation\. We further analyse the sensitivity of SKOP to the calibration set’s domain composition and size in App\.[E\.2](https://arxiv.org/html/2605.06342#A5.SS2), where we show that effective projectors can be obtained from substantially smaller and more narrowly scoped calibration sets than our default\.
### B\.2Steering Datasets
To evaluate the effectiveness of steering vectors in mitigating sycophancy and shaping latent behaviours, we use the updated 2025 release of TruthfulQA\[[10](https://arxiv.org/html/2605.06342#bib.bib10)\]together with the Anthropic Model\-Written Evaluation \(MWE\) suite\[[25](https://arxiv.org/html/2605.06342#bib.bib25)\]\. The updated TruthfulQA serves as a primary benchmark for assessing a model’s resistance to human misconceptions and “face\-saving" responses\. This version consists of 791 binary multiple\-choice questions, each pairing the correct answer with a deliberately constructed “Best Incorrect Answer", forming an adversarial setup that probes the model’s commitment to factual correctness\.
The MWE suite\[[25](https://arxiv.org/html/2605.06342#bib.bib25)\]is designed to surface latent personas and behavioural tendencies through approximately 24,500 automatically generated questions spanning 16 behavioural categories\. Each question presents two choices: one that exhibits the target behaviour and one that does not\. We focus on the*less\-hhh*subset of the Corr dataset, which is specifically constructed to elicit behaviours that deviate from conventional helpfulness, honesty, and harmlessness\. These questions range from relatively benign preferences \(e\.g\., favouring creativity over strict factual accuracy\) to more adversarial prompts\. Together, these datasets provide a controlled yet diverse testbed for steering methods, where suppressing undesirable behaviours is often straightforward, while amplifying them presents a greater challenge for instruction\-tuned models\.
Unless otherwise specified, we use GPT\-4o as the LLM judge for all steering\-evaluation experiments\.
### B\.3Utility Benchmarks
#### Instruction Following
We evaluate fine\-grained constraint adherence using IFBench\[[26](https://arxiv.org/html/2605.06342#bib.bib26)\], a benchmark specifically designed to measure generalisation to out\-of\-domain \(OOD\) instructions\. In contrast to earlier instruction\-following benchmarks dominated by standardised tasks, IFBench defines 58 verifiable constraints organised into seven categories, including count, ratio, and formatting requirements\. The benchmark contains 300 prompts derived from real\-world user interactions collected from WildChat, ensuring that the evaluation reflects realistic usage rather than synthetic templates\. IFBench additionally supports Reinforcement Learning with Verifiable Rewards \(RLVR\), allowing us to quantify trade\-offs between strict constraint satisfaction \(e\.g\. exact word counts\) and degradation in the semantic quality of task outputs\.
#### Model Reasoning
To ensure that steering interventions do not compromise core reasoning capabilities, we evaluate performance on three established crystallised intelligence benchmarks\. Scientific reasoning is assessed using the AI2 Reasoning Challenge \(ARC\)\[[6](https://arxiv.org/html/2605.06342#bib.bib6)\], specifically the Challenge Set of 2,590 questions, which excludes items that can be solved via shallow information retrieval\. Commonsense reasoning is measured using HellaSwag\[[43](https://arxiv.org/html/2605.06342#bib.bib43)\], which employs Adversarial Filtering \(AF\) to iteratively refine distractor endings, maintaining task difficulty for state\-of\-the\-art models\. Finally, mathematical reasoning is evaluated using GSM8K\[[7](https://arxiv.org/html/2605.06342#bib.bib7)\], a collection of 8,000 multi\-step arithmetic word problems requiring precise symbolic manipulation over 2–8 reasoning steps\. GSM8K serves as a sensitive indicator of disruptions to long\-chain reasoning and attention coherence\. Due to computational constraints, we sample 100 samples from the GSM8K test set for evaluation\.
#### Long\-Context Analysis
To assess the*effective*context length of steered models, we use RULER\[[13](https://arxiv.org/html/2605.06342#bib.bib13)\], a synthetic benchmark designed to test retrieval and aggregation beyond simple recall\. RULER extends traditional “Needle\-in\-a\-Haystack”\[[15](https://arxiv.org/html/2605.06342#bib.bib15)\]evaluations by introducing 13 tasks across four categories: Retrieval, Multi\-hop Tracing, Aggregation, and Question Answering\. The benchmark varies needle types \(e\.g\., UUIDs versus natural language tokens\) and includes challenging retrieval configurations such as Multi\-keys \(MK\-NIAH\) and Multi\-values \(MV\-NIAH\), which require models to ignore dense distractors and retrieve complete information sets\. This design enables the identification of non\-linear degradation patterns, including the “Lost in the Middle” effect and attention sparsity, which commonly arise when models are evaluated beyond their training context lengths\. For our setting where evaluates steering efficacy under long\-context for the first time, we adopt the simpler S\-NIAH tasks\.
## Appendix CFurther Analysis of SKOP
In this appendix, we provide additional analyses that support the design and behaviour of SKOP\. We begin with a qualitative case study illustrating the failure mode of vanilla query\-space steering \(App\.[C\.1](https://arxiv.org/html/2605.06342#A3.SS1)\)\. We then provide quantitative evidence for our central mechanistic claim–that SKOP preserves attention mass on focus\-set tokens \(App\.[C\.2](https://arxiv.org/html/2605.06342#A3.SS2)\)\. The remaining subsections justify SKOP’s three core design choices: selective application to high\-risk heads \(App\.[C\.3](https://arxiv.org/html/2605.06342#A3.SS3)\), the low\-rank structure that makes the projection efficient \(App\.[C\.4](https://arxiv.org/html/2605.06342#A3.SS4)\), and the preservation of steering\-vector norm under projection \(App\.[C\.5](https://arxiv.org/html/2605.06342#A3.SS5)\)\.
### C\.1Qualitative Case Study
Fig\.[6](https://arxiv.org/html/2605.06342#A3.F6)presents a qualitative case study illustrating a common failure mode of vanilla query\-space steering\. When prompted with GSM8K questions under power\-seeking steering, the model frequently ignores explicitly provided numerical quantities and produces failures, such as claiming insufficient information despite all required values being present in the prompt\. This pattern is consistent with our attention\-rerouting hypothesis: steering shifts attention away from the numerical tokens the model would otherwise rely on, leading to outputs that look semantically plausible but factually disconnected from the input\.
Figure 6:Case study illustrating model failure under power\-seeking query\-space steering on GSM8K questions\. The steered model ignores numerical quantities present in the prompt, producing outputs that report missing information despite all required values being present\.
### C\.2Focus\-Set Attention Mass Preservation under SKOP
Sec\.[6\.2](https://arxiv.org/html/2605.06342#S6.SS2)reports that SKOP suppresses focus\-to\-tail attention rerouting at high steering strengths\. Here we provide the underlying empirical evidence\. Following the definition ofΔM\\Delta Min Eq\.[8](https://arxiv.org/html/2605.06342#S4.E8), we compare the distribution of focus\-set mass changes under vanilla query\-space steering and under SKOP atλ=4\.0\\lambda=4\.0on the Power steering task\. Fig\.[7](https://arxiv.org/html/2605.06342#A3.F7)reports the empirical probabilityPr\(ΔM≤−x\)\\Pr\(\\Delta M\\leq\-x\)across heads and decoding steps for thresholdsx∈\[0,0\.2\]x\\in\[0,0\.2\]\.
Vanilla query\-space steering induces substantial negative shifts in focus\-set attention mass: 31% of \(head, decoding step\) pairs lose at least 10% of their focus\-set mass, 22% lose at least 15%, and 10% lose at least 25%\. Under SKOP, these tail probabilities are reduced by roughly33–10×10\\timesacross thresholds, falling to 10%, 4%, and 1% respectively, so severe focus\-to\-tail rerouting becomes rare\. This confirms that SKOP preserves the base model’s high\-confidence addressing patterns rather than merely improving aggregate utility scores, providing direct evidence that SKOP operates through the focus\-preservation mechanism we posited in Sec\.[4](https://arxiv.org/html/2605.06342#S4)\.
Figure 7:Focus\-set attention mass preservation under vanilla query\-space steering and under SKOP atλ=4\.0\\lambda=4\.0on the Power steering task\. Theyy\-axis reports the empirical probabilityPr\(ΔM≤−x\)\\Pr\(\\Delta M\\leq\-x\)that a \(head, decoding step\) pair loses at leastxxfraction of focus\-set attention mass\. SKOP shifts the distribution sharply toward zero, indicating that focus tokens retain their attention mass under steering\.
### C\.3Selective Projection on High\-Risk Heads
#### Distribution of head risk scores\.
We first examine the distribution of head risk scoresR\(ℓ,h\)R^\{\(\\ell,h\)\}\(Eq\.[16](https://arxiv.org/html/2605.06342#S5.E16)\) across the four steering tasks\. Fig\.[8](https://arxiv.org/html/2605.06342#A3.F8)reports the per\-head risk scores aggregated over all layers\. The distribution is sharply long\-tailed: across all four tasks, the bulk of heads attain near\-zero risk scores, while only a small minority—roughly the top 10–20%—exhibit substantially elevated values\. This pattern holds consistently across Power, Wealth, Corr, and TQA, indicating that susceptibility to harmful focus\-to\-tail rerouting is a sparse, structural property of attention heads rather than a task\-specific artefact\. This sparsity is the empirical justification for SKOP’s selective application strategy: projecting all heads uniformly would suppress steering capacity in the large benign majority for no utility gain, whereas restricting projection to the top\-kkrisk heads targets exactly the heads where rerouting concentrates\.
Figure 8:Distribution of head risk scoresR\(ℓ,h\)R^\{\(\\ell,h\)\}\(Eq\.[16](https://arxiv.org/html/2605.06342#S5.E16)\) across the four steering tasks\. The distribution is long\-tailed: only a small minority of heads attain high risk scores, motivating SKOP’s selective application to the top\-kkrisk heads\.We analyse the role of selective projection in SKOP by varying the number of heads to which projection is applied and by comparing different head\-selection criteria\. Unless otherwise stated, we use the TruthfulQA steering task as a representative example, and report utility as the average accuracy on HellaSwag and GSM8K\.
#### Effect of the top\-kkselection budget\.
We first vary the fraction of heads selected by the risk scoreR\(ℓ,h\)R^\{\(\\ell,h\)\}\(Eq\.[16](https://arxiv.org/html/2605.06342#S5.E16)\)\. Applying SKOP to a small fraction of heads already yields substantial utility preservation: selecting as few ask≤10%k\\leq 10\\%recovers most of the utility lost under vanilla query\-space steering\. Increasingkkbeyond approximately 20% does not further improve utility and instead slightly reduces steering efficacy\. This is consistent with the head\-risk distribution shown in Fig\.[8](https://arxiv.org/html/2605.06342#A3.F8), where harmful attention rerouting concentrates in a sparse subset of heads, so projecting additional, lower\-risk heads primarily suppresses useful steering directions without addressing rerouting\.
#### Risk\-based versus discriminative head selection\.
Prior work\[[18](https://arxiv.org/html/2605.06342#bib.bib18),[36](https://arxiv.org/html/2605.06342#bib.bib36)\]on head\-level interventions often selects heads by their discriminative power for a target concept, e\.g\., by ranking heads according to the linear classification accuracy of mean\-difference steering vectors\. To better understand the relationship between attention rerouting, steering efficacy, and utility preservation, we compare two selective steering strategies: \(i\) applying steering or projection to the top\-kkheads ranked by risk scoreR\(ℓ,h\)R^\{\(\\ell,h\)\}, and \(ii\) applying it to the top\-kkmost discriminative heads under mean\-difference classification\.
Fig\.[9](https://arxiv.org/html/2605.06342#A3.F9)reports the comparison\. On TruthfulQA, steering only the top\-kkrisk heads achieves stronger steering efficacy than steering the top\-kkdiscriminative heads, indicating that risk heads are more causally involved in producing behavioural changes under query\-space steering\. However, steering risk heads also leads to a larger drop in utility than steering discriminative heads\. This gap highlights that high\-risk heads interact more strongly with attention rerouting: they are simultaneously more effective for steering and more likely to induce harmful focus\-to\-tail attention shifts\.
Figure 9:Effect of selective projection and head selection criteria on the steering–utility trade\-off\.Left:Steering efficacy and average utility as a function of the top\-kkrisk heads to which SKOP projection is applied\. Applying projection to a small fraction of high\-risk heads \(k≤10%k\\leq 10\\%\) substantially recovers utility while retaining strong steering performance, whereas increasingkkbeyond≈20%\\approx 20\\%yields diminishing utility gains and slightly reduces steering efficacy\.Middle:Steering efficacy when steering only the top\-kkheads selected by risk score versus discriminative score\. Risk\-based head selection achieves consistently stronger steering effects, indicating that high\-risk heads are more causally involved in query\-space steering\.Right:Corresponding utility under the same interventions\. Steering high\-risk heads leads to a larger drop in utility compared to steering discriminative heads, highlighting that risk heads interact more strongly with harmful attention rerouting\. Vertical dotted lines indicate representative selection budgets\.Taken together, these results support the design choice underlying SKOP\. Risk\-based selection isolates the heads that are most responsible for both steering efficacy and utility degradation, enabling targeted projection to suppress harmful rerouting while retaining steering capacity\. Discriminative\-head selection alone does not account for how steering perturbs attention distributions and therefore provides weaker control over the efficacy–utility trade\-off\.
### C\.4Eigenvalue Spectra of Key and Key\-Difference Covariances
SKOP’s projection \(Eq\.[13](https://arxiv.org/html/2605.06342#S5.E13)\) and its rank\-selection rule \(Eq\.[15](https://arxiv.org/html/2605.06342#S5.E15)\) both rely on the assumption that the relevant covariance matrices are approximately low\-rank, so that a small number of dominant eigenvectors suffice to capture utility\-critical key\-space structure\. We verify this assumption empirically\.
Fig\.[10](https://arxiv.org/html/2605.06342#A3.F10)shows the eigenvalue distributions of the centred key covariance matrices𝚺k\(ℓ,h\)\\boldsymbol\{\\Sigma\}\_\{k\}^\{\(\\ell,h\)\}used by the key\-invariant projection of Sec\.[4](https://arxiv.org/html/2605.06342#S4), and Fig\.[11](https://arxiv.org/html/2605.06342#A3.F11)shows the eigenvalue distributions of the key\-difference second\-moment matrices𝚺Δk\(ℓ,h\)\\boldsymbol\{\\Sigma\}\_\{\\Delta k\}^\{\(\\ell,h\)\}used by SKOP\. In both cases, the eigenvalues decay rapidly across heads: a few dominant eigenvectors account for most of the spectral energy, while the remaining eigenvalues are small but not exactly zero\. This empirical low\-rank structure is what makes the energy\-coverage rule \(Eq\.[15](https://arxiv.org/html/2605.06342#S5.E15)\) effective with a small projection rankpp, and is consistent with the insensitivity of SKOP toγenergy\\gamma\_\{\\text\{energy\}\}documented in App\.[E\.3](https://arxiv.org/html/2605.06342#A5.SS3)\.
Figure 10:Eigenvalue distributions of the centred key covariance matrices𝚺k\(ℓ,h\)\\boldsymbol\{\\Sigma\}\_\{k\}^\{\(\\ell,h\)\}across attention heads\. Eigenvalues decay rapidly, indicating that the centred key covariance is approximately low\-rank for each head\.Figure 11:Eigenvalue distributions of the key\-difference second\-moment matrices𝚺Δk\(ℓ,h\)\\boldsymbol\{\\Sigma\}\_\{\\Delta k\}^\{\(\\ell,h\)\}across attention heads\. The spectra exhibit the same rapid\-decay pattern as𝚺k\(ℓ,h\)\\boldsymbol\{\\Sigma\}\_\{k\}^\{\(\\ell,h\)\}, supporting SKOP’s energy\-coverage rank\-selection rule\.
### C\.5Norm Preservation under SKOP Projection
A natural concern with any hard\-projection method is that the projection might destroy most of the steering vector’s magnitude, leaving insufficient signal to drive behavioural change\. We therefore measure the norm of each steering vector before and after applying the SKOP projector across all four steering tasks\. Figures[12](https://arxiv.org/html/2605.06342#A3.F12),[13](https://arxiv.org/html/2605.06342#A3.F13),[14](https://arxiv.org/html/2605.06342#A3.F14), and[15](https://arxiv.org/html/2605.06342#A3.F15)report the layer\-wise norms for the Power, Wealth, Corr, and TQA steering vectors, respectively\.
Across all four tasks and all layers, the post\-projection norm closely tracks the pre\-projection norm\. This indicates that the components of𝐫q\(ℓ,h\)\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}that strongly couple to focus\-to\-tail rerouting — and which SKOP removes — account for only a small fraction of the steering vector’s magnitude\. The bulk of the steering signal lies in the orthogonal complement and is left untouched\. This norm preservation explains why SKOP retains over 95% of vanilla steering efficacy in Sec\.[6\.1](https://arxiv.org/html/2605.06342#S6.SS1)despite applying a hard projection\.
Figure 12:Layer\-wise norms of steering vectors before and after SKOP projection on the Power task\.Figure 13:Layer\-wise norms of steering vectors before and after SKOP projection on the Wealth task\.Figure 14:Layer\-wise norms of steering vectors before and after SKOP projection on the Corr task\.Figure 15:Layer\-wise norms of steering vectors before and after SKOP projection on the TQA task\.
## Appendix DImplementation Details
### D\.1SKOP Summary
We include an algorithmic summary for SKOP in Algorithm[1](https://arxiv.org/html/2605.06342#alg1)\.
Algorithm 1Activation Steering via Key\-Orthogonal Projections \(SKOP\)0:Utility calibration set
𝒟util\\mathcal\{D\}\_\{\\text\{util\}\}, steering vectors
\{𝐫q\(ℓ,h\)\}ℓ,h\\\{\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}\\\}\_\{\\ell,h\}, energy coverage
γenergy\\gamma\_\{\\text\{energy\}\}, risk threshold
τrisk\\tau\_\{\\text\{risk\}\}
1:Calibration:
2:foreach head
\(ℓ,h\)\(\\ell,h\)do
3:Compute focus/tail sets
ℋ\(ℓ,h\),ℒ\(ℓ,h\)\\mathcal\{H\}^\{\(\\ell,h\)\},\\mathcal\{L\}^\{\(\\ell,h\)\}on
𝒟util\\mathcal\{D\}\_\{\\text\{util\}\}
4:Compute difference\-key matrix
𝚺Δk\(ℓ,h\)\\mathbf\{\\Sigma\}\_\{\\Delta k\}^\{\(\\ell,h\)\}via Eq\. \([11](https://arxiv.org/html/2605.06342#S5.E11)\)
5:Select
ppvia energy coverage criterion Eq\. \([15](https://arxiv.org/html/2605.06342#S5.E15)\)
6:Compute projector
𝐏Δk\(ℓ,h\)\\mathbf\{P\}\_\{\\Delta k\}^\{\(\\ell,h\)\}via Eq\. \([10](https://arxiv.org/html/2605.06342#S4.E10)\)
7:Compute risk score
R\(ℓ,h\)R^\{\(\\ell,h\)\}via Eq\. \([16](https://arxiv.org/html/2605.06342#S5.E16)\)
8:endfor
9:Inference:
10:foreach head
\(ℓ,h\)\(\\ell,h\)do
11:if
R\(ℓ,h\)\>τriskR^\{\(\\ell,h\)\}\>\\tau\_\{\\text\{risk\}\}then
12:Apply projected steering:
𝐫~q\(ℓ,h\)=𝐏Δk\(ℓ,h\)𝐫q\(ℓ,h\)\\tilde\{\\mathbf\{r\}\}\_\{q\}^\{\(\\ell,h\)\}=\\mathbf\{P\}\_\{\\Delta k\}^\{\(\\ell,h\)\}\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}
13:else
14:Use unmodified steering:
𝐫~q\(ℓ,h\)=𝐫q\(ℓ,h\)\\tilde\{\\mathbf\{r\}\}\_\{q\}^\{\(\\ell,h\)\}=\\mathbf\{r\}\_\{q\}^\{\(\\ell,h\)\}
15:endif
16:endfor
### D\.2Baselines
In this section, we provide detailed descriptions and hyperparameter settings for the conditional activation steering baselines used in our comparisons\. All reported experiments were run on a single NVIDIA A100 80GB GPU\.
Conditional Activation Steering \(CAST\)Lee et al\. \[[17](https://arxiv.org/html/2605.06342#bib.bib17)\]introduces a gating mechanism to standard activation steering\. CAST operates by extracting two distinct vectors: abehaviour vector, which represents the desired target behaviour, and acondition vector, which represents the context in which the behaviour should be triggered\. Both are computed using the difference\-in\-means method on contrastive datasets\. At inference time, CAST computes the cosine similarity between the model’s current hidden state and the condition vector\. If the similarity exceeds a predefined threshold, the behaviour vector is added to the residual stream\.
Angular SteeringVu and Nguyen \[[39](https://arxiv.org/html/2605.06342#bib.bib39)\]reformulates steering as a geometric rotation within a 2D subspace rather than vector addition\. The steering plane is defined by the target feature direction \(extracted via difference\-in\-means\) and an orthogonal axis derived from the first principal component of feature directions across layers We specifically use theAdaptivevariant, which aims to minimise unintended side effects on unrelated features\. This variant applies the rotation only when the current activation aligns positively with the target feature direction\. The primary hyperparameter is the rotation angleθ\\theta\. We sweepθ\\theta\(typically between0∘0^\{\\circ\}and180∘180^\{\\circ\}\) to find the optimal operating point \(the highest steering score\)\. To ensure fair comparison, we apply this intervention only at the normalisation layers of every attention block\.
Semantics\-Adaptive Dynamic Intervention \(SADI\)\[[40](https://arxiv.org/html/2605.06342#bib.bib40)\]is a conditional method that generates dynamic steering vectors based on the input’s own semantics rather than adding a fixed vector\. We utilise theSADI\-HEADvariant, which targets attention head outputs\. The method consists of two phases\. First, it identifies critical model components \(attention heads\) by computing the mean activation difference between contrastive pairs and creating a binary mask that selects the top\-K elements with the largest differences\. Second, during inference, it steers the model by amplifying the activations of these selected elements proportional to the input’s own activation strength\. This ensures the intervention effectively “adapts” to the semantic direction of the current input\.
## Appendix EAdditional Experimental Results
In this appendix, we report additional experiments that complement the main results in Sec\.[6](https://arxiv.org/html/2605.06342#S6)\. We provide \(i\) the efficacy–utility trade\-off evaluation on a second model, Gemma\-2\-9B\-IT \(App\.[E\.1](https://arxiv.org/html/2605.06342#A5.SS1)\); \(ii\) ablations on the composition and size of the utility calibration set𝒟util\\mathcal\{D\}\_\{\\text\{util\}\}\(App\.[E\.2](https://arxiv.org/html/2605.06342#A5.SS2)\); \(iii\) a sensitivity analysis of the energy\-coverage hyperparameterγenergy\\gamma\_\{\\text\{energy\}\}\(App\.[E\.3](https://arxiv.org/html/2605.06342#A5.SS3)\); and \(iv\) a sample\-efficiency comparison against LoRA fine\-tuning \(App\.[E\.4](https://arxiv.org/html/2605.06342#A5.SS4)\)\.
### E\.1Steering–Utility Trade\-off on Gemma\-2\-9B\-IT
To verify that SKOP’s effectiveness is not specific to a single model family, we replicate the main steering–utility evaluation of Sec\.[6\.1](https://arxiv.org/html/2605.06342#S6.SS1)on Gemma\-2\-9B\-IT\[[11](https://arxiv.org/html/2605.06342#bib.bib11)\], which uses extra post\-layer RMSNorm steps that differ from LLaMA 3\.1\. We use the same four steering tasks \(Power, Wealth, Corr, TQA\), the same four utility benchmarks \(IFBench, ARC, HellaSwag, GSM8K\), and the same baseline configurations as in Table[1](https://arxiv.org/html/2605.06342#S6.T1)\.
Table[3](https://arxiv.org/html/2605.06342#A5.T3)and Fig\.[16](https://arxiv.org/html/2605.06342#A5.F16)report the results\. SKOP again attains the best overall trade\-off rank \(2\.44\) across all steered methods, outperforming both attention\-space baselines, residual\-stream baselines, and conditional steering\. On utility, SKOP retains the strongest performance among steering methods on three of four benchmarks \(ARC, HellaSwag, GSM8K\) and the second\-best on IFBench, while preserving competitive steering scores across all four behaviours\. The gap between strong\-steering, low\-utility methods such as Comm Steer \(rank\-1 on three steering tasks but utility collapse on ARC and GSM8K\) and SKOP highlights that the joint trade\-off, rather than steering score in isolation, is the relevant criterion\. CAA, despite achieving the second\-best steering scores on average, suffers a substantial utility drop \(e\.g\., 16\.8 on GSM8K versus 79\.0 unsteered\), consistent with our argument in Sec\.[3](https://arxiv.org/html/2605.06342#S3)that residual\-stream steering simultaneously perturbs queries, keys, values, and MLP outputs and therefore admits no clean isolable rerouting term to correct\.
Table 3:Comparison of SKOP against baselines on Gemma\-2\-9B\-IT\[[30](https://arxiv.org/html/2605.06342#bib.bib30)\]\. Best results arebolded, and second\-best results areunderlined\.SteeringUtilityMethodPowerWealthCorrTQAIFBARCHSGSM8KRank↓\\downarrowBaseline1\.621\.561\.5667\.519\.859\.070\.279\.0–LoRA\[[14](https://arxiv.org/html/2605.06342#bib.bib14)\]1\.781\.792\.2870\.815\.235\.649\.335\.5–CAA\[[29](https://arxiv.org/html/2605.06342#bib.bib29)\]2\.592\.092\.4579\.312\.622\.235\.216\.84\.38ITI\[[18](https://arxiv.org/html/2605.06342#bib.bib18)\]2\.271\.771\.8767\.69\.827\.443\.118\.26\.00DISCO\-Q\[[36](https://arxiv.org/html/2605.06342#bib.bib36)\]1\.931\.862\.6675\.710\.531\.836\.620\.45\.13Comm Steer\[[36](https://arxiv.org/html/2605.06342#bib.bib36)\]2\.611\.922\.9590\.25\.214\.333\.89\.64\.63Angular Steer\[[39](https://arxiv.org/html/2605.06342#bib.bib39)\]1\.951\.811\.7468\.118\.155\.351\.458\.24\.50CAST\[[17](https://arxiv.org/html/2605.06342#bib.bib17)\]1\.881\.721\.7969\.416\.538\.255\.939\.75\.25SADI\[[40](https://arxiv.org/html/2605.06342#bib.bib40)\]2\.241\.932\.2177\.413\.845\.652\.655\.13\.69SKOP \(Ours\)2\.251\.932\.7274\.818\.056\.559\.066\.42\.44Figure 16:Steering\-utility trade\-off for Gemma\-9B\-IT\. We report the average of Power, Wealth, and Corr\[[25](https://arxiv.org/html/2605.06342#bib.bib25)\]\. The dashed line traces the best trade\-off frontier\. SKOP also achieves the best joint trade\-off among all steered methods\.
### E\.2Calibration Set Ablation
We study the sensitivity of SKOP to the composition and size of the utility calibration set𝒟util\\mathcal\{D\}\_\{\\text\{util\}\}used to estimate the key\-difference matrix𝚺Δk\(ℓ,h\)\\boldsymbol\{\\Sigma\}\_\{\\Delta k\}^\{\(\\ell,h\)\}\(Eq\.[11](https://arxiv.org/html/2605.06342#S5.E11)\)\. Throughout, experiments are conducted on LLaMA\-3\.1\-8B\-Instruct using the TruthfulQA steering task and the four utility benchmarks of Sec\.[6\.1](https://arxiv.org/html/2605.06342#S6.SS1)\.
#### Domain composition\.
We compare the default mixed calibration set \(1,000 examples each from GSM8K, Alpaca, PIQA, NarrativeQA, totalling 4,000 examples\) against four single\-domain conditions, each using 4,000 examples drawn exclusively from one source\. Table[4](https://arxiv.org/html/2605.06342#A5.T4)reports steering score on TQA and average accuracy across the four utility benchmarks\.
Table 4:Domain Ablation: Experiments on calibration set domain composition for SKOP on LLaMA\-3\.1\-8B\-Instruct\[[8](https://arxiv.org/html/2605.06342#bib.bib8)\]\. Each single\-domain condition uses 4,000 examples drawn exclusively from that source; the mixed default samples 1,000 examples from each of the four sources\. Steering performance is reported as steering score for TruthfulQA\[[10](https://arxiv.org/html/2605.06342#bib.bib10)\], and utility is the mean accuracy across four utility benchmarks\. Despite large variation in dataset domain, SKOP consistently preserves utility far above the vanilla steering baseline, suggesting that the focus\-set structure estimated by SKOP reflects stable model\-internal key\-space geometry rather than surface properties of the calibration data\.Across all four single\-domain conditions, SKOP preserves utility substantially above vanilla steering \(26\.6\), with averages ranging from 42\.7 \(reading comprehension only\) to 52\.4 \(instruction\-following only\)\. Steering efficacy on TQA is also stable, varying only between 63\.1 and 64\.5 across single\-domain conditions\. The mixed default achieves the best joint outcome \(utility 55\.5, TQA 65\.9\) but does not dominate any single domain by a large margin\. Together, these results indicate that the focus\-set structure SKOP relies on reflects relatively stable model\-internal key\-space geometry rather than surface properties of the calibration domain, and that a calibration source need not match the downstream evaluation distribution to yield effective projectors\.
#### Calibration size\.
We next subsample the mixed default calibration set at sizes ranging from 250 to 4,000 examples and re\-estimate𝚺Δk\(ℓ,h\)\\boldsymbol\{\\Sigma\}\_\{\\Delta k\}^\{\(\\ell,h\)\}at each scale\. Fig\.[17](https://arxiv.org/html/2605.06342#A5.F17)reports steering score on TQA and average utility across the four utility benchmarks\.
Figure 17:Calibration size ablation\. Effect of calibration set size on steering efficacy and utility preservation for SKOP on LLaMA\-3\.1\-8B\-Instruct, evaluated on TQA and the four utility benchmarks of Sec\.[6\.1](https://arxiv.org/html/2605.06342#S6.SS1)\. The calibration set is obtained by subsampling the mixed default𝒟util\\mathcal\{D\}\_\{\\text\{util\}\}at sizes ranging from 250 to 4,000 examples\. Dashed lines indicate the vanilla query\-space steering baseline\.Utility preservation plateaus around 1,000 examples; beyond this point, additional calibration data yields negligible improvements\. Steering efficacy improves monotonically with calibration size, plausibly because larger samples yield more accurate estimates of the dominant eigendirections of𝚺Δk\(ℓ,h\)\\boldsymbol\{\\Sigma\}\_\{\\Delta k\}^\{\(\\ell,h\)\}, allowing the projector to more precisely target harmful rerouting directions while leaving steering\-effective components intact\. Crucially, even at 250 examples, SKOP already substantially outperforms vanilla steering on utility, confirming that SKOP is robust to small calibration budgets and that the underlying focus\-set structure is not a data\-intensive estimation problem\.
### E\.3Sensitivity to the Energy\-Coverage Threshold
The energy\-coverage thresholdγenergy\\gamma\_\{\\text\{energy\}\}\(Eq\.[15](https://arxiv.org/html/2605.06342#S5.E15)\) controls the numberppof dominant eigenvectors of𝚺Δk\(ℓ,h\)\\boldsymbol\{\\Sigma\}\_\{\\Delta k\}^\{\(\\ell,h\)\}removed by the SKOP projector at each head\. We sweepγenergy∈\{0\.50,0\.60,0\.70,0\.80,0\.90,0\.95,0\.99\}\\gamma\_\{\\text\{energy\}\}\\in\\\{0\.50,0\.60,0\.70,0\.80,0\.90,0\.95,0\.99\\\}on Gemma\-2\-9B\-IT for the Power steering task and report results in Table[5](https://arxiv.org/html/2605.06342#A5.T5)\.
Table 5:Sensitivity of SKOP toγenergy\\gamma\_\{\\text\{energy\}\}on Gemma\-2\-9B\-IT\[[30](https://arxiv.org/html/2605.06342#bib.bib30)\]on Power steering\.γenergy\\gamma\_\{\\text\{energy\}\}is the energy coverage threshold \(Eq\.[15](https://arxiv.org/html/2605.06342#S5.E15)\) controlling the numberppof top eigenvectors of𝚺Δk\(ℓ,h\)\\boldsymbol\{\\Sigma\}\_\{\\Delta k\}^\{\(\\ell,h\)\}removed per head\. Both utility and steering remain stable acrossγenergy∈\[0\.70,0\.95\]\\gamma\_\{\\text\{energy\}\}\\in\[0\.70,0\.95\]\. The projection also preserves most of the steering vector norm \(Fig\.[12](https://arxiv.org/html/2605.06342#A3.F12)\), indicating that substantial steering capacity is retained\.†\\daggerdenotes our empirical choice, applied uniformly across all datasets without per\-dataset tuning\.Both steering and utility are stable across a broad rangeγenergy∈\[0\.70,0\.95\]\\gamma\_\{\\text\{energy\}\}\\in\[0\.70,0\.95\]: steering scores remain within\[2\.19,2\.26\]\[2\.19,2\.26\]and average utility within\[48\.3,50\.3\]\[48\.3,50\.3\]\. Very low thresholds \(γenergy=0\.50\\gamma\_\{\\text\{energy\}\}=0\.50\) retain too many rerouting directions and harm utility, while very high thresholds \(γenergy=0\.99\\gamma\_\{\\text\{energy\}\}=0\.99\) begin to remove non\-rerouting directions and slightly reduce steering efficacy\. Our defaultγenergy=0\.90\\gamma\_\{\\text\{energy\}\}=0\.90sits in this stable plateau and was chosen once and applied uniformly across all datasets and models without per\-task tuning\. This insensitivity, together with the norm\-preservation analysis in App\.[C\.5](https://arxiv.org/html/2605.06342#A3.SS5), indicates that harmful focus\-to\-tail rerouting is concentrated in a small number of dominant eigendirections, leaving substantial steering capacity in the orthogonal complement\.
### E\.4Case Study: Comparison with LoRA
Activation steering and lightweight fine\-tuning are alternative ways to elicit target behaviours from a fixed base model\. To contextualise SKOP against the latter, we compare it to LoRA\[[14](https://arxiv.org/html/2605.06342#bib.bib14)\]on the TruthfulQA steering task, using LLaMA\-3\.1\-8B\-Instruct\. We vary the number of training examplesNNused either to construct the steering vector \(for vanilla query\-space steering and SKOP\) or to train LoRA, and evaluate both steering efficacy and average utility across the four utility benchmarks\. We use a standard LoRA configuration \(r=16r=16,α=32\\alpha=32\) applied to the query and key projections of all attention layers\.
Fig\.[18](https://arxiv.org/html/2605.06342#A5.F18)reports the results\. SKOP matches the steering performance of vanilla query\-space steering across allNNand substantially outperforms LoRA on steering efficacy, despite LoRA having access to many more trainable parameters\. In terms of utility, LoRA degrades monotonically asNNincreases, whereas SKOP’s utility remains stable and substantially higher\. This sample\-efficiency gap suggests that, at least in the regime of small contrastive datasets typical of activation steering, fine\-tuning approaches both underperform on the target behaviour and incur larger collateral utility loss than projection\-based interventions on the existing query subspace\.
Figure 18:Sample\-efficiency comparison with LoRA on LLaMA\-3\.1\-8B\-Instruct\.Left:TruthfulQA steering score as a function of the number of training examplesNNused to construct the steering vector \(for activation steering\) or to train LoRA\. SKOP matches vanilla query\-space steering and substantially outperforms LoRA across allNN\.Right:Average utility score across the four utility benchmarks of Sec\.[6\.1](https://arxiv.org/html/2605.06342#S6.SS1)\. LoRA degrades monotonically withNN, while SKOP retains utility close to the unsteered baseline\.Similar Articles
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
This paper identifies KV-cache contamination as a failure mode for activation steering in dialogue and proposes GCAD, a method that extracts steering signals from prompt contributions and applies token-level gating to improve long-horizon coherence, achieving substantial gains on multi-turn benchmarks.
Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention
This paper introduces FLAS, a flow-based activation steering method that learns a concept-conditioned velocity field to steer language model activations at inference time. On the AxBench benchmark, FLAS is the first learned method to consistently outperform in-context prompting on held-out concepts without per-concept tuning.
Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
This paper introduces a novel adaptive scheduler for steering discrete diffusion language models using sparse autoencoders, demonstrating that targeting interventions based on when specific attributes commit improves control quality and strength over uniform methods.
MidSteer: Optimal Affine Framework for Steering Generative Models
Introduces MidSteer, a theoretical framework for concept steering in generative models, bridging the gap between empirical success and theoretical understanding by providing optimal affine transformations for steering, erasing, and switching concepts in LLMs and vision diffusion models.
FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models
FineSteer is a novel inference-time steering framework that decomposes steering into conditional steering and fine-grained vector synthesis stages, using Subspace-guided Conditional Steering (SCS) and Mixture-of-Steering-Experts (MoSE) mechanisms to improve safety and truthfulness while preserving model utility. Experiments show 7.6% improvement over state-of-the-art methods on TruthfulQA with minimal utility loss.