A Geometric Account of Activation Steering through Angle-Norm Decomposition

arXiv cs.AI Papers

Summary

This paper analyzes linear activation steering in language models by decomposing interventions into angular and radial components. It finds that concepts are primarily encoded in angular structure, but norm adjustments are crucial for stability, supporting spherical steering methods while showing that additive coefficients conflate geometry.

arXiv:2606.06735v1 Announce Type: new Abstract: Linear activation steering has gained popularity as a simple and empirically effective way to control language model behavior. More recently, spherical steering paradigms have been proposed to address limitations of additive interventions, often motivated by the assumption that hidden-state norm does not carry concept-relevant information. In this work, we revisit this assumption through a controlled empirical study designed to disentangle the roles of angular and radial components. We show that steering methods differ mainly in how they couple two geometric effects: changing a token's angular alignment with a concept direction and changing its hidden-state norm. Across seven language models, we find that concepts are represented primarily in angular structure, supporting the motivation for spherical methods, but that norm remains important for the stability and downstream effects of steering. Our results explain why interventions with similar concept-level effects can behave differently, and suggest that activation steering should be parameterized by interpretable angular and radial components of the intervention, rather than by a single additive coefficient that entangles these two effects.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:13 AM

# A Geometric Account of Activation Steering through Angle–Norm Decomposition
Source: [https://arxiv.org/html/2606.06735](https://arxiv.org/html/2606.06735)
Georgii Aparin Huawei Noah’s Ark Lab aparingm@gmail\.com &Tatiana Gaintseva Queen Mary University of London t\.gaintseva@qmul\.ac\.uk

###### Abstract

Linear activation steering has gained popularity as a simple and empirically effective way to control language model behavior\. More recently, spherical steering paradigms have been proposed to address limitations of additive interventions, often motivated by the assumption that hidden\-state norm does not carry concept\-relevant information\. In this work, we revisit this assumption through a controlled empirical study designed to disentangle the roles of angular and radial components\. We show that steering methods differ mainly in how they couple two geometric effects: changing a token’s angular alignment with a concept direction and changing its hidden\-state norm\. Across seven language models, we find that concepts are represented primarily in angular structure, supporting the motivation for spherical methods, but that norm remains important for the stability and downstream effects of steering\. Our results explain why interventions with similar concept\-level effects can behave differently, and suggest that activation steering should be parameterized by interpretable angular and radial components of the intervention, rather than by a single additive coefficient that entangles these two effects\.

A Geometric Account of Activation Steering through Angle–Norm Decomposition

Georgii AparinHuawei Noah’s Ark Labaparingm@gmail\.comTatiana GaintsevaQueen Mary University of Londont\.gaintseva@qmul\.ac\.uk

## 1Introduction

Linear activation steering has become a widely used approach for controlling language model behavior through interventions on intermediate representations\(Zouet al\.,[2023](https://arxiv.org/html/2606.06735#bib.bib5); Turneret al\.,[2023](https://arxiv.org/html/2606.06735#bib.bib4); Panicksseryet al\.,[2023](https://arxiv.org/html/2606.06735#bib.bib2)\)\. Given a steering direction associated with a target concept, standard methods add this direction to hidden states with a scalar strength\. These interventions are simple, training\-free, and effective across behaviors such as truthfulness, sentiment, toxicity, and refusal\(Zouet al\.,[2023](https://arxiv.org/html/2606.06735#bib.bib5); Turneret al\.,[2023](https://arxiv.org/html/2606.06735#bib.bib4); Panicksseryet al\.,[2023](https://arxiv.org/html/2606.06735#bib.bib2); Liet al\.,[2023](https://arxiv.org/html/2606.06735#bib.bib3); Rimskyet al\.,[2024](https://arxiv.org/html/2606.06735#bib.bib1); Arditiet al\.,[2024](https://arxiv.org/html/2606.06735#bib.bib29)\)\. However, additive steering treats activation space as if concept control were naturally linear: increasing the steering coefficient is assumed to move representations in a meaningful behavioral direction\. This obscures the geometry of the intervention, since adding a vector changes both the direction and the norm of the hidden state\(Parket al\.,[2024](https://arxiv.org/html/2606.06735#bib.bib6); Vu and Nguyen,[2025](https://arxiv.org/html/2606.06735#bib.bib7); Youet al\.,[2026](https://arxiv.org/html/2606.06735#bib.bib8)\)\.

![Refer to caption](https://arxiv.org/html/2606.06735v1/x1.png)Figure 1:Effect of norm scaling in SN\. The left panel shows downstream task metric change, and the right panel shows perplexity ratio\. Increasingβ\\betahas little effect on the semantic task metric but substantially reduces perplexity at highγ\\gamma, indicating that the norm primarily controls generation stability\.![Refer to caption](https://arxiv.org/html/2606.06735v1/x2.png)Figure 2:Fraction of folds in which eachβ\\betavalue achieves the best perplexity or task metric\. Atγ=0\.7\\gamma=0\.7,β=1\.2\\beta=1\.2achieves the lowest perplexity in all folds in our evaluation, indicating that strict norm preservation is not always the most stable choice for high\-strength spherical steering\.Recent angular and spherical steering methods offer an alternative: instead of translating activations, they rotate hidden states toward a concept direction, often while preserving norm\(Vu and Nguyen,[2025](https://arxiv.org/html/2606.06735#bib.bib7); Youet al\.,[2026](https://arxiv.org/html/2606.06735#bib.bib8)\)\. This is motivated by the hypothesis that concept information is primarily angular, while norm preservation maintains generation quality and input relevance\. Although spherical methods can improve stability over naive additive steering, their underlying assumptions remain insufficiently examined: do concepts mainly live in activation direction, and is strict norm preservation always the right steering constraint?

We study these questions through a controlled geometric comparison of activation steering methods\. We decompose each hidden state into an angular component, which determines alignment with a concept direction, and a radial component, given by its norm\. This lets us compare six steering approaches that differ in whether they enforce a target angular concept score, preserve the original norm, or allow the norm to change\.

Our experiments show that the angular hypothesis is largely correct\. Across seven language models and four concept datasets, probes trained on normalized hidden states closely match probes trained on raw hidden states, while norm\-only probes remain near chance\. Thus, for the concepts we study, concept\-discriminative information is encoded primarily in activation direction rather than magnitude\.

However, norm is not irrelevant\. Although activation magnitude does not directly encode the target concept, it plays an important role in generation stability and capability preservation\. At high angular strength, strict norm preservation can cause large perplexity increases and capability degradation\. Conversely, methods that reach the same angular target while allowing a modest norm increase often better preserve fluency and downstream performance \(Figs\.[1](https://arxiv.org/html/2606.06735#S1.F1),[2](https://arxiv.org/html/2606.06735#S1.F2)\)\. This yields a more nuanced conclusion than either additive or spherical steering alone suggests: angular control explains semantic steering, but radial scaling can determine whether the intervention remains usable at high strength\.

We hypothesize that hidden\-state norm partly controls the effective representational capacity available at a token\. Under strong steering, forcing a target concept into the original fixed radius may leave less capacity for other context\-relevant information\. A modest norm increase can relieve this pressure, allowing the model to express the desired concept direction while retaining enough representational scale for other features\.

Overall, our findings suggest that activation steering should be viewed neither as a one\-parameter additive intervention nor as a purely angular operation with fixed norm\. Instead, steering is better understood as a two\-parameter geometric intervention governed by both angle and radius: angle controls the intended semantic effect, while radius influences generation stability, input relevance, and capability preservation\. This perspective explains why methods with similar concept\-level effects can behave differently and suggests a more interpretable design space for future steering methods\.

Our contributions are as follows:

- •Weformulate activation steering as a two\-component geometric interventionthat separates angular concept control from radial norm modification\.
- •Wecompare six steering methods under a common framework, distinguishing whether they preserve norm and whether they enforce a per\-token target concept score\.
- •Weempirically test the angular encoding hypothesisacross seven language models and four concept datasets, finding that concept information is primarily encoded in activation direction\.
- •Weshow that norm still plays a crucial role in steering stability: at high steering strengths, modest norm increases can reduce perplexity by up to1\.8×1\.8\\timeswithout substantially changing the semantic steering effect\.

## 2Related Work

Activation steering and representation engineering\.Activation steering controls model behavior by modifying intermediate activations at inference time, without updating weights\. Most methods identify a direction in hidden\-state space associated with a target behavior and intervene along this direction during generation\. ITI, ActAdd, CAA, and Representation Engineering have been used to affect truthfulness, sentiment, topic, refusal, toxicity, and other high\-level attributes\(Liet al\.,[2023](https://arxiv.org/html/2606.06735#bib.bib3); Turneret al\.,[2023](https://arxiv.org/html/2606.06735#bib.bib4); Panicksseryet al\.,[2023](https://arxiv.org/html/2606.06735#bib.bib2); Zouet al\.,[2023](https://arxiv.org/html/2606.06735#bib.bib5); Rimskyet al\.,[2024](https://arxiv.org/html/2606.06735#bib.bib1); Arditiet al\.,[2024](https://arxiv.org/html/2606.06735#bib.bib29)\)\. These methods are simple and training\-free, but their usual additive strength has unclear geometry: changing it alters both the hidden state’s alignment with the steering direction and its norm\. Our work studies this ambiguity directly by decomposing steering into angular and radial components\.

Linear concept representations\.Activation steering is closely related to the hypothesis that high\-level model properties are represented linearly in activation space\. Under this view, directions in hidden\-state space correspond to concepts or behaviors, and projections onto these directions can serve as concept scores\(Parket al\.,[2024](https://arxiv.org/html/2606.06735#bib.bib6); Zouet al\.,[2023](https://arxiv.org/html/2606.06735#bib.bib5)\)\. This motivates contrastive direction extraction, probing, and direction\-based interventions\. However, identifying a useful concept direction does not determine how to intervene along it\. Additive steering simultaneously changes angular alignment and representation norm, which may play different roles\. We therefore separate two often conflated questions: whether concept information is encoded in activation direction, and how norm changes affect steering outcomes\.

Angular and spherical steering\.Recent work has proposed angular or spherical alternatives to additive steering\. Angular Steering rotates activations in a behavior\-related subspace\(Vu and Nguyen,[2025](https://arxiv.org/html/2606.06735#bib.bib7)\), while Spherical Steering performs a norm\-preserving geodesic rotation toward a target direction\(Youet al\.,[2026](https://arxiv.org/html/2606.06735#bib.bib8)\)\. These methods are motivated by the idea that concept information is primarily angular and that preserving activation norm helps maintain generation quality\. Our work provides a controlled examination of these assumptions: we test whether concepts are indeed primarily encoded in direction, and whether strict norm preservation remains desirable once angular control is fixed\. Unlike prior work focused on specific steering rules, we analyze additive, renormalized, matched, angular, spherical, and norm\-scaled interventions in a single angular–radial framework\.

Adaptive and token\-wise steering\.Several methods suggest that a single global steering coefficient is insufficient\. ITI selects intervention sites at the attention\-head level\(Liet al\.,[2023](https://arxiv.org/html/2606.06735#bib.bib3)\); Representation Engineering studies control directions across layers and behaviors\(Zouet al\.,[2023](https://arxiv.org/html/2606.06735#bib.bib5)\); and selective or adaptive steering methods vary interventions across layers, tokens, or examples to reduce side effects\(Dang and Ngo,[2026](https://arxiv.org/html/2606.06735#bib.bib9)\)\. Our results clarify what such adaptation should control: the achieved angular concept score and the radial norm scale\. This yields a more interpretable design space in which methods differ not only in average steering strength, but also in per\-token angular precision and norm handling\.

Our contribution\.Prior work shows that activation directions can control behavior, and recent spherical methods show that norm\-aware interventions can improve stability\. We ask which geometric component is responsible for concept control, and which affects stability\. Our experiments indicate that the evaluated concepts are primarily encoded in activation direction, supporting the motivation for spherical steering\. At the same time, norm is not merely nuisance variation: modest radial changes can substantially affect perplexity and capability preservation even when angular concept score is fixed\. Thus, we reframe activation steering as a two\-parameter intervention over angle and radius, rather than a one\-dimensional choice of additive strength or a binary choice between additive and norm\-preserving methods\.

## 3Methodology

We study activation steering as a geometric intervention on the residual\-stream hidden state of a language model\. Given a hidden statex∈ℝdx\\in\\mathbb\{R\}^\{d\}at a fixed transformer layer and a unit steering directionss, we decomposexxinto a radial component and an angular component:

r=‖x‖,u=xr,r=\\\|x\\\|,\\qquad u=\\frac\{x\}\{r\},\(1\)c=⟨u,s⟩,v=u−c​s‖u−c​s‖\.c=\\langle u,s\\rangle,\\qquad v=\\frac\{u\-cs\}\{\\\|u\-cs\\\|\}\.\(2\)Here,rris the hidden\-state norm,uuis the corresponding unit vector,ccis the angular concept score, andvvis the unit residual direction orthogonal toss\. Any unit vector in the two\-dimensional subspacespan⁡\(s,v\)\\operatorname\{span\}\(s,v\)can be written as

γ​s\+1−γ2​v,\\gamma s\+\\sqrt\{1\-\\gamma^\{2\}\}\\,v,\(3\)whereγ∈\[−1,1\]\\gamma\\in\[\-1,1\]is the target concept score\. This decomposition lets us separate two aspects of steering that are entangled in standard additive interventions: the angular movement toward the concept direction and the change in hidden\-state magnitude\.

### 3\.1Steering direction construction

For each model and dataset, we construct a concept direction using contrastive mean\-difference\. We sampleN=256N=256positive\-negative completion pairs from a held\-out direction split and extract residual\-stream activations at the last prompt token\. The steering direction is the unit\-normalized difference between the mean positive activation and the mean negative activation:

s=μ\+−μ−‖μ\+−μ−‖\.s=\\frac\{\\mu\_\{\+\}\-\\mu\_\{\-\}\}\{\\\|\\mu\_\{\+\}\-\\mu\_\{\-\}\\\|\}\.\(4\)The same directionssis used for all steering methods within a model\-dataset\-fold cell, ensuring that comparisons isolate the geometry of the intervention rather than differences in direction estimation\.

### 3\.2Steering methods

We compare six steering operations that differ in whether they preserve the original norm and whether they target the concept score independently for each token\. Table[1](https://arxiv.org/html/2606.06735#S3.T1)summarizes the geometric constraints imposed by each method\. Below, we describe each of the methods in detail\.

Table 1:Summary of steering methods by whether they preserve the original hidden\-state norm and whether they enforce a fixed per\-token concept score\.Concept Activation Addition \(CAA\)\.The standard additive baseline applies a fixed global perturbation:

y=x\+α​s\.y=x\+\\alpha s\.\(5\)α\\alphais usually treated as a hyperparameter\. CAA is neither norm\-preserving nor per\-token calibrated: it applies the same fixed addition during all generation steps\. The achieved concept score varies across tokens depending on the initial norm and alignment ofxx\.

Renormalized CAA \(CAA\-r\)\.CAA\-r applies the same fixed additive update and then projects the result back to the original norm:

y=r​x\+α​s‖x\+α​s‖\.y=r\\,\\frac\{x\+\\alpha s\}\{\\\|x\+\\alpha s\\\|\}\.\(6\)This isolates the effect of post\-hoc norm preservation while retaining the fixed\-strength nature of CAA\. CAA\-r preserves‖y‖=‖x‖\\\|y\\\|=\\\|x\\\|, but it does not enforce a target concept score for each token\.

Matched CAA without renormalization \(CAA\-m\)\.CAA\-m chooses a token\-specific additive coefficientα\\alphaso that the normalized output reaches a desired concept scoreγ\\gamma:

y=x\+α​s,⟨y‖y‖,s⟩=γ\.y=x\+\\alpha s,\\qquad\\left\\langle\\frac\{y\}\{\\\|y\\\|\},s\\right\\rangle=\\gamma\.\(7\)We computeα\\alphausing the formula derived in Appendix[C](https://arxiv.org/html/2606.06735#A3)\. Unlike CAA\-r, CAA\-m does not renormalize the output\. Thus, it exactly controls the angular concept score while allowing the norm to change\.

CAA\-m and Spherical Steering can be compared in a shared geometric subspace because both operate insidespan⁡\(s,v\)\\operatorname\{span\}\(s,v\)\. Once CAA\-m choosesα\\alphasuch that the normalized output has concept scoreγ\\gamma, its direction lies on the same ray as the spherical target

γ​s\+1−γ2​v\.\\gamma s\+\\sqrt\{1\-\\gamma^\{2\}\}\\,v\.\(8\)Therefore, CAA\-m and Spherical Steering have the same angular component and differ only in their radial component\. If the matched CAA output is additionally renormalized to the original norm, the resulting method, CAA\-mr, is exactly equivalent to Spherical Steering\. We therefore do not treat CAA\-mr as a separate method\.

Spherical Steering \(S\)\.Spherical Steering directly constructs the minimum\-geodesic\-distance unit direction with target scoreγ\\gamma, then restores the original norm:

y=r​\(γ​s\+1−γ2​v\)\.y=r\\left\(\\gamma s\+\\sqrt\{1\-\\gamma^\{2\}\}\\,v\\right\)\.\(9\)This method preserves‖y‖=‖x‖\\\|y\\\|=\\\|x\\\|exactly and enforces⟨y/‖y‖,s⟩=γ\\langle y/\\\|y\\\|,s\\rangle=\\gammaindependently for every token\.

Additive Spherical \(AS\)\.Additive Spherical applies a fixed spherical displacement toward the concept direction\. Let

θ=arccos⁡\(c\),θ′=max⁡\(θ−Δ​θ,0\)\.\\theta=\\arccos\(c\),\\qquad\\theta^\{\\prime\}=\\max\(\\theta\-\\Delta\\theta,0\)\.\(10\)The steered state is

y=r​\(cos⁡θ′​s\+sin⁡θ′​v\)\.y=r\\left\(\\cos\\theta^\{\\prime\}\\,s\+\\sin\\theta^\{\\prime\}\\,v\\right\)\.\(11\)AS preserves the norm and the residual direction, but it does not target the same final concept score for every token\. Instead, the resulting score depends on the token’s initial angle toss\.

Spherical Steering with Norm Scaling \(SN\)\.Finally, we introduce an explicit radial parameterβ\\betaon top of Spherical Steering:

y=β​r​\(γ​s\+1−γ2​v\)\.y=\\beta r\\left\(\\gamma s\+\\sqrt\{1\-\\gamma^\{2\}\}\\,v\\right\)\.\(12\)Whenβ=1\\beta=1, SN reduces exactly to S\. Forβ≠1\\beta\\neq 1, the angular component is unchanged while the norm is scaled by a fixed multiplicative factor\. This lets us test whether the norm acts primarily as a stability parameter once semantic angular control is fixed\.

Our experimental design isolates the roles of angular control and norm modification through four controlled experiments\.

#### 1\. Hidden\-state norm variation\.

First, we measure hidden\-state norm variation across layers and token populations\. For each model, we sample examples from multiple corpora and compute the coefficient of variation of‖x‖\\\|x\\\|for the last prompt token, all prompt tokens, and generated tokens\. This experiment characterizes the radial geometry of the representation space and determines whether norm preservation is a meaningful constraint\.

#### 2\. Angular versus radial concept encoding\.

Second, we test whether concept information is encoded primarily in direction or magnitude\. We train three linear probes: one on raw hidden stateshh, one on normalized hidden statesh/‖h‖h/\\\|h\\\|, and one on the scalar norm‖h‖\\\|h\\\|\. If normalized probes match raw probes while norm\-only probes remain near chance, this indicates that concept information is primarily angular rather than radial\.

#### 3\. Steering at matched angular control\.

Third, we compare steering methods under matched angular control\. For per\-token methods, we set a target concept scoreγ\\gamma\. For fixed\-strength methods, we calibrate the global strength parameter by binary search so that the mean achieved concept score on evaluation activations matches the desired targetγ¯\\bar\{\\gamma\}\. We then compare downstream task performance, per\-token concept\-score variance, norm ratio‖y‖/‖x‖\\\|y\\\|/\\\|x\\\|, perplexity, and general capability metrics\. This experiment distinguishes three possible explanations for steering behavior: per\-token precision, angular displacement, and norm preservation\.

#### 4\. Isolating the role of norm scaling\.

Fourth, we isolate the role of the norm using SN\. Holdingγ\\gammafixed, we vary only the multiplicative norm scaleβ\\beta\. Becauseβ\\betachanges only the radius and leaves the angular concept score fixed, this experiment directly tests whether modest norm changes improve generation stability without changing semantic control\.

All steering directions are computed on held\-out direction splits and evaluated on separate held\-out examples\. Where methods require calibration, we perform binary search over the steering parameter before measuring downstream behavior\. For CAA and CAA\-r, the searched parameter isα\\alpha; for AS, it isΔ​θ\\Delta\\theta; for CAA\-m, it is the token\-specificα\\alphaneeded to achieveγ\\gamma\.

Together, these experiments provide a controlled comparison between interventions that alter direction, norm, or both\. By matching either per\-token concept score or mean concept score across methods, we can determine whether steering success is explained by angular movement alone, strict norm preservation, or a two\-parameter interaction between angle and radius\.

## 4Experiments

### 4\.1Evaluation setup

Models and steering layer\.We evaluate all methods on seven transformer language models spanning 1B to 70B parameters: Llama\-3\.1\-8B\-Instruct, Qwen2\.5\-7B\-Instruct, Gemma\-2\-9B\-it, Llama\-3\.1\-8B, Llama\-3\.2\-1B\-Instruct, Qwen2\.5\-3B\-Instruct, and Llama\-3\.1\-70B\-Instruct\. For each model, steering is applied to the residual\-stream output at 75% depth\. This gives steering layers 24, 21, 31, 24, 12, 27, and 60 respectively\. We use a single forward hook at this layer, replacing each hidden statexxwith a steered stateyyat every token position during generation\.

Datasets and task metrics\.We evaluate steering on four concept datasets: TruthfulQA for truthfulness, SST\-2 for sentiment, CivilComments for toxicity, and IMDB for sentiment\. For TruthfulQA, we use closed\-form multiple\-choice metrics, primarily MC1\. For SST\-2 and IMDB, we measure the positive rate of generated continuations\. For CivilComments, we measure non\-toxicity using a toxicity classifier\. For generation\-based evaluations, we sample 128 tokens using nucleus sampling withp=0\.95p=0\.95and temperatureT=0\.7T=0\.7\. Dataset and benchmark details are provided in[AppendixI](https://arxiv.org/html/2606.06735#A9)\.

Quality and capability metrics\.To measure whether steering damages general language\-model behavior, we compute perplexity on 200 held\-out WikiText\-103 passages with maximum length 512\. We report perplexity as a ratio relative to the unsteered baseline for the same model, dataset, and fold\. We also evaluate MMLU accuracy using log\-probability ranking on a fixed subset of 300 items, providing an auxiliary measure of retained model capability\.

Calibration protocol\.For per\-token methods, we sweep target concept scoresγ∈\{0\.1,0\.3,0\.5,0\.7\}\\gamma\\in\\\{0\.1,0\.3,0\.5,0\.7\\\}\. For fixed\-strength methods, we calibrate the global steering parameter so that the mean achieved concept score on evaluation activations matches the desired targetγ¯\\bar\{\\gamma\}\. Specifically, we binary\-searchα\\alphafor CAA and CAA\-r, andΔ​θ\\Delta\\thetafor AS\. For SN, we hold the angular target fixed and sweepβ∈\{0\.9,1\.0,1\.1,1\.2\}\.\\beta\\in\\\{0\.9,1\.0,1\.1,1\.2\\\}\.All reported comparisons use held\-out direction splits and held\-out evaluation examples\. Unless otherwise stated, results are aggregated over seven models, four datasets, one seed, and two folds\.

### 4\.2Experimental results

#### Hidden\-state norms vary across layers and architectures\.

We first examine whether activation norms can be treated as approximately constant during steering\. Figure[3](https://arxiv.org/html/2606.06735#S4.F3)reports the coefficient of variation of last\-prompt\-token hidden\-state norms across layers, models, and corpora\. The results show that norm concentration is architecture\-dependent: Llama and Qwen models generally have relatively concentrated norms at middle and later layers, while Gemma exhibits much larger norm variation across most layers\. We hypothesize that this difference is largely due to Gemma’s post\-norm architecture\. Across models, the activations after the last transformer block consistently have the lowest coefficient of variation, indicating that norm concentration increases toward the final layers\. This indicates that the radial component is not a universally negligible part of the representation space\. Additional layer\-wise norm statistics are reported in[AppendixA](https://arxiv.org/html/2606.06735#A1)\.

Importantly, norm variation by itself does not determine whether a concept is encoded in the norm or in the direction\. Rather, this experiment motivates treating the norm as a separate geometric degree of freedom: even when semantic information is primarily angular, preserving or modifying the radius may still affect generation stability\.

![Refer to caption](https://arxiv.org/html/2606.06735v1/x3.png)

![Refer to caption](https://arxiv.org/html/2606.06735v1/x4.png)

![Refer to caption](https://arxiv.org/html/2606.06735v1/x5.png)

![Refer to caption](https://arxiv.org/html/2606.06735v1/x6.png)

![Refer to caption](https://arxiv.org/html/2606.06735v1/x7.png)

![Refer to caption](https://arxiv.org/html/2606.06735v1/x8.png)

![Refer to caption](https://arxiv.org/html/2606.06735v1/x9.png)

![Refer to caption](https://arxiv.org/html/2606.06735v1/x10.png)

Figure 3:T1: CV of hidden\-state norms vs\. layer for all 7 models, 10 corpora\. Grey dotted =L75L\_\{75\}steering layer\. Bottom right: combined mean CV across corpora\.
#### Concept information is primarily angular\.

We next test whether concept\-discriminative information is encoded in the direction or the magnitude of hidden states\. As shown in Figure[4](https://arxiv.org/html/2606.06735#S4.F4), for each model and dataset, we train linear probes on three representations: raw hidden stateshh, normalized hidden statesh/‖h‖h/\\\|h\\\|, and scalar norms‖h‖\\\|h\\\|\. Across all models and concept datasets, normalized probes closely match raw probes, while norm\-only probes remain near chance\. This supports the central geometric assumption that concept information is primarily represented in angular directions rather than in hidden\-state magnitudes\. Additional directional\-encoding results are provided in[AppendixB](https://arxiv.org/html/2606.06735#A2)\.

![Refer to caption](https://arxiv.org/html/2606.06735v1/x11.png)Figure 4:Linear probe accuracy versus layer for all four concept datasets\. Each dataset contains three probe variants: raw hidden stateshh, normalized hidden statesh/‖h‖h/\\\|h\\\|, and norm\-only features‖h‖\\\|h\\\|\. Raw and normalized curves nearly overlap, while norm\-only probes remain close to chance, indicating that the evaluated concepts are encoded primarily in direction\.
#### Matched additive steering and spherical steering share the same angular target but differ in norm\.

We next compare CAA\-m and S at matched per\-token targetγ\\gamma\. Both methods steer inside the same two\-dimensional subspacespan⁡\(s,v\)\\operatorname\{span\}\(s,v\)and reach the same normalized concept direction\. Their difference is radial: S restores the original norm, while CAA\-m leaves the additive norm change intact\. This comparison therefore isolates the effect of norm change while holding angular control fixed\. Further comparisons are provided in[AppendixD](https://arxiv.org/html/2606.06735#A4)\.

Figure[5](https://arxiv.org/html/2606.06735#S4.F5)shows that CAA\-m produces only mild norm inflation at low and moderate steering strengths, but the effect grows at highγ\\gamma\.

![Refer to caption](https://arxiv.org/html/2606.06735v1/x12.png)Figure 5:Norm ratio‖y‖/‖x‖\\\|y\\\|/\\\|x\\\|for CAA\-m at matched per\-token targetγ\\gamma\.Despite matching the same angular target, S and CAA\-m differ strongly in generation stability\. At highγ\\gamma, strict norm preservation can produce large perplexity penalties and substantial capability loss, whereas CAA\-m often retains lower perplexity and higher MMLU accuracy\. This shows that preserving the original norm is not always the most stable choice once the angular edit becomes large\.

![Refer to caption](https://arxiv.org/html/2606.06735v1/x13.png)

![Refer to caption](https://arxiv.org/html/2606.06735v1/x14.png)

![Refer to caption](https://arxiv.org/html/2606.06735v1/x15.png)

Figure 6:Downstream task metric, WikiText\-103 perplexity, and MMLU accuracy under S and CAA\-m at matched per\-tokenγ\\gamma\. The two methods implement nearly identical angular control, but they differ in radial behavior\. At high steering strengths, S incurs much larger perplexity penalties, while CAA\-m better preserves generation stability and general capability\.
#### Norm preservation alone does not explain stability in fixed\-strength steering\.

We next isolate the fixed\-strength family: CAA, CAA\-r, and AS\. Unlike S and CAA\-m, these methods do not enforce the same concept score independently for each token\. Instead, each method uses a single global steering parameter, calibrated so that the mean achieved concept score matches the target level\. This comparison tests whether preserving the hidden\-state norm is sufficient to explain downstream stability\. Additional results are provided in[AppendixE](https://arxiv.org/html/2606.06735#A5)\.

The first comparison is between CAA and CAA\-r\. These methods have the same normalized output direction after the additive update; CAA\-r only rescales the resulting vector back to the original norm\. As a result, their downstream behavior is very similar across steering strengths\. This shows that post\-hoc renormalization is not, by itself, a reliable source of improved stability\.

The second comparison is between CAA\-r and AS\. Both methods preserve the hidden\-state norm, but they produce different token\-level angular profiles\. CAA\-r applies a fixed additive perturbation before renormalization, so the resulting angular displacement depends on the token’s initial norm and alignment with the steering direction\. AS applies a fixed spherical displacement, so its effect is distributed differently across tokens\. The fact that these two norm\-preserving methods behave differently shows that norm preservation alone cannot explain the steering quality trade\-off\. Instead, the per\-token distribution of achieved concept scores is an important part of the geometry\.

#### The Pareto frontier depends on both angular precision and radial behavior\.

We then compare all five main methods: CAA, CAA\-r, CAA\-m, S and AS\. Fixed\-strength methods are calibrated to matched mean concept score, while per\-token methods directly enforce the target score for each token\. We plot downstream task improvement against WikiText\-103 perplexity ratio, so better methods move toward higher task improvement and lower perplexity\. Figure[7](https://arxiv.org/html/2606.06735#S4.F7)shows the Pareto comparison separately for each dataset\.

![Refer to caption](https://arxiv.org/html/2606.06735v1/x16.png)Figure 7:Per\-dataset Pareto curves for all methods\. The same qualitative pattern appears across datasets: CAA\-m provides a strong high\-control, low\-perplexity trade\-off, while S suffers a large perplexity increase at high steering strengths\.As shown in Figure[7](https://arxiv.org/html/2606.06735#S4.F7), these results suggest that steering should not be reduced to a binary choice between preserving and changing the norm\. CAA\-m and S have the same angular target, yet CAA\-m is much more stable at highγ\\gamma\. Conversely, CAA\-r and AS both preserve norm, but AS produces much higher perplexity at large steering strengths\. Particular, S achieves the highest downstream task score even atγ=0\.5\\gamma=0\.5, showing that strict per\-token angular targeting can be highly effective for semantic control\.The relevant design space is therefore two\-dimensional: angular control determines the semantic effect of steering, while norm scale strongly influences whether the model can continue generating coherently\.

#### Norm scaling acts as a stability lever\.

Finally, we test this interpretation directly by adding an explicit multiplicative norm scaleβ\\betaon top of S\. This gives SN:

y=β​r​\(γ​s\+1−γ2​v\)\.y=\\beta r\\left\(\\gamma s\+\\sqrt\{1\-\\gamma^\{2\}\}v\\right\)\.Changingβ\\betaleaves the angular concept score fixed while changing only the radius of the steered representation\. Thus, within this controlled intervention family, differences acrossβ\\betavalues reflect the effect of radial scaling under a fixed semantic direction\.

Figure[1](https://arxiv.org/html/2606.06735#S1.F1)shows thatβ\\betahas only a small effect on the task metric but a large effect on perplexity at highγ\\gamma\. Moving fromβ=1\.0\\beta=1\.0toβ=1\.2\\beta=1\.2improves perplexity by roughly1\.8×1\.8\\timesatγ=0\.7\\gamma=0\.7, while task metrics remain within about2\.52\.5percentage points across the testedβ\\betavalues\. We hypothesize that this effect arises because the hidden\-state norm partly determines the representational capacity available to the model at that token\. When steering strongly toward a concept direction while strictly preserving the original norm, a large fraction of the fixed\-radius representation may be devoted to expressing the steered concept, leaving less effective capacity for other information needed to maintain fluent and contextually coherent generation\. A modest norm increase may compensate for this by allowing the target concept to be expressed without compressing the remaining information into the same radius\.

Overall, these experiments supporta two\-parameter view of activation steering\. The angular component controls the intended concept, as shown by the probe results and by the matched behavior of S and CAA\-m in concept space\. The radial component controls stability: preserving the original norm is sometimes useful, but at high steering strengths a modest norm increase can substantially reduce perplexity without materially changing the semantic steering effect\. Additional beta\-sweep results are provided in[AppendixH](https://arxiv.org/html/2606.06735#A8)\.

## 5Conclusion

We presented a geometric account of activation steering that separates two effects entangled by additive interventions: angular movement toward a concept direction and radial change in hidden\-state norm\. This explains why a single additive coefficient is hard to interpret: the same coefficient can induce different angular shifts and norm changes depending on each token’s initial geometry\.

Across seven language models and four concept datasets, we find that the evaluated concepts are represented primarily in activation direction\. Normalized probes closely match raw probes, while norm\-only probes remain near chance, supporting the view that semantic control is largely angular\.

At the same time, norm preservation is not always the right constraint\. Even with fixed angular concept score, radial changes can strongly affect perplexity and capability preservation\. Strict norm preservation can become unstable at high steering strengths, while modest norm increases reduce degradation without materially changing the semantic effect\.

Overall, our findings reframe activation steering as a two\-parameter intervention over angle and radius\. Angle controls the intended concept, while radius controls intervention stability\. This explains why methods with similar concept\-level effects can behave differently and suggests a more interpretable basis for future token\-wise steering methods\. Effective steering requires choosing not only where to point a representation, but also how much representational scale to give it\.

## 6Limitations

Our study has several limitations\. First, we apply steering at a single fixed layer, chosen at 75% depth for each model\. Although this gives a controlled comparison across methods, the optimal angle–norm trade\-off may vary across layers\.

Second, our experiments cover a limited set of models and concepts\. We evaluate Llama, Qwen, and Gemma models on truthfulness, sentiment, and toxicity\-related steering, but other architectures or more complex behaviors may exhibit different geometry\.

Third, all methods use the same contrastive mean\-difference steering direction\. This isolates the effect of the intervention geometry, but does not test whether the conclusions hold for other ways of estimating steering directions\.

Finally, our norm\-scaling experiments use a small discrete set ofβ\\betavalues\. The results show that the norm is an important stability parameter, but they do not provide an automatic rule for choosing the best norm scale for a new model, layer, or task\.

## References

- Refusal in language models is mediated by a single direction\.InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,A\. Globersons, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. M\. Tomczak, and C\. Zhang \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/f545448535dfde4f9786555403ab7c49-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.06735#S1.p1.1),[§2](https://arxiv.org/html/2606.06735#S2.p1.1)\.
- D\. Borkan, L\. Dixon, J\. Sorensen, N\. Thain, and L\. Vasserman \(2019\)Nuanced metrics for measuring unintended bias with real data for text classification\.arXiv preprint arXiv:1903\.04561\.External Links:[Link](https://arxiv.org/abs/1903.04561)Cited by:[Appendix I](https://arxiv.org/html/2606.06735#A9.SS0.SSS0.Px1.p1.1),[Appendix I](https://arxiv.org/html/2606.06735#A9.SS0.SSS0.Px3.p1.1),[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.4.3.4)\.
- Center for AI Safety and Hugging Face Datasets Contributors \(2024\)MMLU dataset card\.Note:[https://huggingface\.co/datasets/cais/mmlu](https://huggingface.co/datasets/cais/mmlu)Lists the dataset distribution under MITCited by:[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.7.6.4)\.
- A\. Cohan, F\. Dernoncourt, D\. S\. Kim, T\. Bui, S\. Kim, W\. Chang, and N\. Goharian \(2018\)A discourse\-aware attention model for abstractive summarization of long documents\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 615–621\.External Links:[Document](https://dx.doi.org/10.18653/v1/N18-2097),[Link](https://aclanthology.org/N18-2097)Cited by:[Appendix I](https://arxiv.org/html/2606.06735#A9.SS0.SSS0.Px3.p1.1),[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.10.9.4)\.
- Q\. Dang and C\. Ngo \(2026\)Selective steering: norm\-preserving control through discriminative layer selection\.External Links:2601\.19375,[Link](https://arxiv.org/abs/2601.19375)Cited by:[§2](https://arxiv.org/html/2606.06735#S2.p4.1)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan, A\. Goyal,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.External Links:[Link](https://arxiv.org/abs/2407.21783)Cited by:[Table 11](https://arxiv.org/html/2606.06735#A10.T11.1.1.2.1.3),[Table 11](https://arxiv.org/html/2606.06735#A10.T11.1.1.3.2.3),[Table 11](https://arxiv.org/html/2606.06735#A10.T11.1.1.4.3.3),[Table 11](https://arxiv.org/html/2606.06735#A10.T11.1.1.5.4.3),[Appendix J](https://arxiv.org/html/2606.06735#A10.p1.1)\.
- A\. Fan, M\. Lewis, and Y\. Dauphin \(2018\)Hierarchical neural story generation\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 889–898\.External Links:[Document](https://dx.doi.org/10.18653/v1/P18-1082),[Link](https://aclanthology.org/P18-1082)Cited by:[Appendix I](https://arxiv.org/html/2606.06735#A9.SS0.SSS0.Px3.p1.1),[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.11.10.4)\.
- Gemma Team, M\. Riviere, S\. Pathak, P\. G\. Sessa, C\. Hardin, S\. Bhupatiraju, L\. Hussenot, T\. Mesnard, B\. Shahriari, A\. Ramé, J\. Ferret,et al\.\(2024\)Gemma 2: improving open language models at a practical size\.arXiv preprint arXiv:2408\.00118\.External Links:[Link](https://arxiv.org/abs/2408.00118)Cited by:[Table 11](https://arxiv.org/html/2606.06735#A10.T11.1.1.8.7.3),[Appendix J](https://arxiv.org/html/2606.06735#A10.p1.1)\.
- GitHub and CodeSearchNet Contributors \(2019\)CodeSearchNet repository\.Note:[https://github\.com/github/CodeSearchNet](https://github.com/github/CodeSearchNet)Code and documentation are MIT; source\-code examples include per\-file upstream licensesCited by:[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.15.14.4)\.
- A\. Gokaslan, V\. Cohen, E\. Pavlick, and S\. Tellex \(2019\)OpenWebText corpus\.Note:[https://skylion007\.github\.io/OpenWebTextCorpus/](https://skylion007.github.io/OpenWebTextCorpus/)External Links:[Link](https://zenodo.org/records/3834942)Cited by:[Appendix I](https://arxiv.org/html/2606.06735#A9.SS0.SSS0.Px3.p1.1),[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.8.7.4)\.
- A\. Gokaslan and V\. Cohen \(2019\)OpenWebText corpus download page\.Note:[https://skylion007\.github\.io/OpenWebTextCorpus/](https://skylion007.github.io/OpenWebTextCorpus/)Dataset packaging released under CC0; underlying web text not owned by dataset authorsCited by:[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.8.7.4)\.
- Google and Hugging Face Datasets Contributors \(2024\)Civil comments dataset card\.Note:[https://huggingface\.co/datasets/google/civil\_comments](https://huggingface.co/datasets/google/civil_comments)Lists the dataset distribution under CC0\-1\.0Cited by:[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.4.3.4)\.
- Google Research \(2019\)Natural questions download page\.Note:[https://ai\.google\.com/research/NaturalQuestions/download](https://ai.google.com/research/NaturalQuestions/download)Lists Natural Questions under the Creative Commons Share\-Alike 3\.0 licenseCited by:[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.12.11.4)\.
- Google \(2026\)Gemma terms of use\.Note:[https://ai\.google\.dev/gemma/terms](https://ai.google.dev/gemma/terms)Last modified: April 1, 2026Cited by:[Table 11](https://arxiv.org/html/2606.06735#A10.T11.1.1.8.7.4)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=d7KBjmI3GmQ)Cited by:[Appendix I](https://arxiv.org/html/2606.06735#A9.SS0.SSS0.Px2.p1.1),[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.7.6.4)\.
- K\. M\. Hermann, T\. Kočiský, E\. Grefenstette, L\. Espeholt, W\. Kay, M\. Suleyman, and P\. Blunsom \(2015\)Teaching machines to read and comprehend\.InAdvances in Neural Information Processing Systems,Vol\.28\.External Links:[Link](https://papers.nips.cc/paper/2015/hash/afdec7005cc9f14302cd0474fd0f3c96-Abstract.html)Cited by:[Appendix I](https://arxiv.org/html/2606.06735#A9.SS0.SSS0.Px3.p1.1),[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.13.12.4)\.
- Hugging Face Datasets Contributors \(2024a\)CNN/dailymail dataset card\.Note:[https://huggingface\.co/datasets/abisee/cnn\_dailymail](https://huggingface.co/datasets/abisee/cnn_dailymail)Lists the dataset distribution under Apache\-2\.0Cited by:[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.13.12.4)\.
- Hugging Face Datasets Contributors \(2024b\)Scientific papers dataset card\.Note:[https://huggingface\.co/datasets/armanc/scientific\_papers](https://huggingface.co/datasets/armanc/scientific_papers)Dataset obtained from arXiv and PubMed OpenAccess sources; license should be checked against the selected source distributionCited by:[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.10.9.4)\.
- Hugging Face Datasets Contributors \(2024c\)TruthfulQA dataset card\.Note:[https://huggingface\.co/datasets/domenicrosati/TruthfulQA](https://huggingface.co/datasets/domenicrosati/TruthfulQA)Lists the dataset distribution under Apache\-2\.0Cited by:[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.2.1.4)\.
- Hugging Face Datasets Contributors \(2024d\)WritingPrompts dataset card\.Note:[https://huggingface\.co/datasets/euclaise/writingprompts](https://huggingface.co/datasets/euclaise/writingprompts)Lists the dataset distribution under MITCited by:[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.11.10.4)\.
- H\. Husain, H\. Wu, T\. Gazit, M\. Allamanis, and M\. Brockschmidt \(2019\)CodeSearchNet challenge: evaluating the state of semantic code search\.arXiv preprint arXiv:1909\.09436\.External Links:[Link](https://arxiv.org/abs/1909.09436)Cited by:[Appendix I](https://arxiv.org/html/2606.06735#A9.SS0.SSS0.Px3.p1.1),[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.15.14.4)\.
- Q\. Jin, B\. Dhingra, Z\. Liu, W\. W\. Cohen, and X\. Lu \(2019a\)PubMedQA repository\.Note:[https://github\.com/pubmedqa/pubmedqa](https://github.com/pubmedqa/pubmedqa)Repository released under MITCited by:[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.14.13.4)\.
- Q\. Jin, B\. Dhingra, Z\. Liu, W\. W\. Cohen, and X\. Lu \(2019b\)PubMedQA: a dataset for biomedical research question answering\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing,pp\. 2567–2577\.External Links:[Document](https://dx.doi.org/10.18653/v1/D19-1259),[Link](https://aclanthology.org/D19-1259)Cited by:[Appendix I](https://arxiv.org/html/2606.06735#A9.SS0.SSS0.Px3.p1.1),[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.14.13.4)\.
- Kaggle Dataset Contributors \(2024\)Stanford sentiment treebank v2 \(sst2\) dataset\.Note:[https://www\.kaggle\.com/datasets/atulanandjha/stanford\-sentiment\-treebank\-v2\-sst2](https://www.kaggle.com/datasets/atulanandjha/stanford-sentiment-treebank-v2-sst2)Lists the dataset distribution under CC0Cited by:[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.3.2.4)\.
- T\. Kwiatkowski, J\. Palomaki, O\. Redfield, M\. Collins, A\. Parikh, C\. Alberti, D\. Epstein, I\. Polosukhin, J\. Devlin, K\. Lee, K\. Toutanova, L\. Jones, M\. Kelcey, M\. Chang, A\. M\. Dai, J\. Uszkoreit, Q\. Le, and S\. Petrov \(2019\)Natural questions: a benchmark for question answering research\.InTransactions of the Association for Computational Linguistics,Vol\.7,pp\. 453–466\.External Links:[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00276),[Link](https://aclanthology.org/Q19-1026)Cited by:[Appendix I](https://arxiv.org/html/2606.06735#A9.SS0.SSS0.Px3.p1.1),[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.12.11.4)\.
- K\. Li, O\. Patel, F\. Viégas, H\. Pfister, and M\. Wattenberg \(2023\)Inference\-time intervention: eliciting truthful answers from a language model\.External Links:2306\.03341,[Link](https://arxiv.org/abs/2306.03341)Cited by:[§1](https://arxiv.org/html/2606.06735#S1.p1.1),[§2](https://arxiv.org/html/2606.06735#S2.p1.1),[§2](https://arxiv.org/html/2606.06735#S2.p4.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)TruthfulQA: measuring how models mimic human falsehoods\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 3214–3252\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.229),[Link](https://aclanthology.org/2022.acl-long.229)Cited by:[Appendix I](https://arxiv.org/html/2606.06735#A9.SS0.SSS0.Px1.p1.1),[Appendix I](https://arxiv.org/html/2606.06735#A9.SS0.SSS0.Px3.p1.1),[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.2.1.4)\.
- A\. L\. Maas, R\. E\. Daly, P\. T\. Pham, D\. Huang, A\. Y\. Ng, and C\. Potts \(2011\)Learning word vectors for sentiment analysis\.InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies,pp\. 142–150\.External Links:[Link](https://aclanthology.org/P11-1015)Cited by:[Appendix I](https://arxiv.org/html/2606.06735#A9.SS0.SSS0.Px1.p1.1),[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.5.4.4)\.
- A\. L\. Maas \(2011\)Large movie review dataset\.Note:[https://ai\.stanford\.edu/˜amaas/data/sentiment/](https://ai.stanford.edu/~amaas/data/sentiment/)Original Stanford dataset pageCited by:[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.5.4.4)\.
- S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher \(2016\)Pointer sentinel mixture models\.arXiv preprint arXiv:1609\.07843\.External Links:[Link](https://arxiv.org/abs/1609.07843)Cited by:[Appendix I](https://arxiv.org/html/2606.06735#A9.SS0.SSS0.Px2.p1.1),[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.6.5.4)\.
- Meta AI \(2024a\)Llama 3\.1 community license agreement\.Note:[https://huggingface\.co/meta\-llama/Llama\-3\.1\-8B\-Instruct/blob/main/LICENSE](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/blob/main/LICENSE)Version release date: July 23, 2024Cited by:[Table 11](https://arxiv.org/html/2606.06735#A10.T11.1.1.2.1.4),[Table 11](https://arxiv.org/html/2606.06735#A10.T11.1.1.3.2.4),[Table 11](https://arxiv.org/html/2606.06735#A10.T11.1.1.4.3.4)\.
- Meta AI \(2024b\)Llama 3\.2 community license agreement\.Note:[https://huggingface\.co/meta\-llama/Llama\-3\.2\-1B\-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)Version release date: September 25, 2024Cited by:[Table 11](https://arxiv.org/html/2606.06735#A10.T11.1.1.5.4.4)\.
- N\. Panickssery, N\. Gabrieli, J\. Schulz, M\. Tong, E\. Hubinger, and A\. M\. Turner \(2023\)Steering Llama 2 via contrastive activation addition\.External Links:2312\.06681,[Link](https://arxiv.org/abs/2312.06681)Cited by:[§1](https://arxiv.org/html/2606.06735#S1.p1.1),[§2](https://arxiv.org/html/2606.06735#S2.p1.1)\.
- K\. Park, Y\. J\. Choe, and V\. Veitch \(2024\)The linear representation hypothesis and the geometry of large language models\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235,pp\. 39643–39666\.External Links:[Link](https://proceedings.mlr.press/v235/park24c.html)Cited by:[§1](https://arxiv.org/html/2606.06735#S1.p1.1),[§2](https://arxiv.org/html/2606.06735#S2.p2.1)\.
- Qwen Team, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.External Links:[Link](https://arxiv.org/abs/2412.15115)Cited by:[Table 11](https://arxiv.org/html/2606.06735#A10.T11.1.1.6.5.3),[Table 11](https://arxiv.org/html/2606.06735#A10.T11.1.1.7.6.3),[Appendix J](https://arxiv.org/html/2606.06735#A10.p1.1)\.
- Qwen Team \(2024a\)Qwen research license agreement\.Note:[https://huggingface\.co/Qwen/Qwen2\.5\-3B\-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)Qwen2\.5\-3B\-Instruct licenseCited by:[Table 11](https://arxiv.org/html/2606.06735#A10.T11.1.1.7.6.4)\.
- Qwen Team \(2024b\)Qwen2\.5 model release and licensing\.Note:[https://qwenlm\.github\.io/blog/qwen2\.5/](https://qwenlm.github.io/blog/qwen2.5/)Qwen2\.5\-7B\-Instruct is released under Apache\-2\.0Cited by:[Table 11](https://arxiv.org/html/2606.06735#A10.T11.1.1.6.5.4)\.
- N\. Rimsky, N\. Gabrieli, J\. Schulz, M\. Tong, E\. Hubinger, and A\. Turner \(2024\)Steering Llama 2 via contrastive activation addition\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 15504–15522\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.828),[Link](https://aclanthology.org/2024.acl-long.828/)Cited by:[§1](https://arxiv.org/html/2606.06735#S1.p1.1),[§2](https://arxiv.org/html/2606.06735#S2.p1.1)\.
- Salesforce and Hugging Face Datasets Contributors \(2024\)WikiText dataset card\.Note:[https://huggingface\.co/datasets/Salesforce/wikitext](https://huggingface.co/datasets/Salesforce/wikitext)Lists WikiText under a Creative Commons Attribution\-ShareAlike licenseCited by:[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.6.5.4)\.
- A\. See, P\. J\. Liu, and C\. D\. Manning \(2017\)Get to the point: summarization with pointer\-generator networks\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1073–1083\.External Links:[Document](https://dx.doi.org/10.18653/v1/P17-1099),[Link](https://aclanthology.org/P17-1099)Cited by:[Appendix I](https://arxiv.org/html/2606.06735#A9.SS0.SSS0.Px3.p1.1),[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.13.12.4)\.
- R\. Socher, A\. Perelygin, J\. Wu, J\. Chuang, C\. D\. Manning, A\. Y\. Ng, and C\. Potts \(2013\)Recursive deep models for semantic compositionality over a sentiment treebank\.InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,pp\. 1631–1642\.External Links:[Link](https://aclanthology.org/D13-1170)Cited by:[Appendix I](https://arxiv.org/html/2606.06735#A9.SS0.SSS0.Px1.p1.1),[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.3.2.4)\.
- R\. Taori, I\. Gulrajani, T\. Zhang, Y\. Dubois, X\. Li, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto \(2023a\)Alpaca: a strong, replicable instruction\-following model\.Note:Stanford Center for Research on Foundation ModelsExternal Links:[Link](https://crfm.stanford.edu/2023/03/13/alpaca.html)Cited by:[Appendix I](https://arxiv.org/html/2606.06735#A9.SS0.SSS0.Px3.p1.1),[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.9.8.4)\.
- R\. Taori, I\. Gulrajani, T\. Zhang, Y\. Dubois, X\. Li, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto \(2023b\)Stanford alpaca repository\.Note:[https://github\.com/tatsu\-lab/stanford\_alpaca](https://github.com/tatsu-lab/stanford_alpaca)Dataset released under CC BY\-NC 4\.0 for research / non\-commercial useCited by:[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.9.8.4)\.
- A\. M\. Turner, L\. Thiergart, D\. Udell, G\. Leech, J\. J\. Vazquez, U\. Mini, and M\. MacDiarmid \(2023\)Activation addition: steering language models without optimization\.External Links:2308\.10248,[Link](https://arxiv.org/abs/2308.10248)Cited by:[§1](https://arxiv.org/html/2606.06735#S1.p1.1),[§2](https://arxiv.org/html/2606.06735#S2.p1.1)\.
- H\. M\. Vu and T\. M\. Nguyen \(2025\)Angular steering: behavior control via rotation in activation space\.External Links:2510\.26243,[Link](https://arxiv.org/abs/2510.26243)Cited by:[§1](https://arxiv.org/html/2606.06735#S1.p1.1),[§1](https://arxiv.org/html/2606.06735#S1.p2.1),[§2](https://arxiv.org/html/2606.06735#S2.p3.1)\.
- Z\. You, C\. Deng, and H\. Chen \(2026\)Spherical steering: geometry\-aware activation rotation for language models\.External Links:2602\.08169,[Link](https://arxiv.org/abs/2602.08169)Cited by:[§1](https://arxiv.org/html/2606.06735#S1.p1.1),[§1](https://arxiv.org/html/2606.06735#S1.p2.1),[§2](https://arxiv.org/html/2606.06735#S2.p3.1)\.
- Zenodo Dataset Contributors \(2023\)Binary stanford sentiment treebank 2 \(sst\-2\)\.Note:[https://zenodo\.org/records/7555310](https://zenodo.org/records/7555310)Lists the dataset distribution under CC\-BY\-4\.0Cited by:[Table 10](https://arxiv.org/html/2606.06735#A9.T10.1.1.3.2.4)\.
- A\. Zou, L\. Phan, S\. Chen, J\. Campbell, P\. Guo, R\. Ren, A\. Pan, X\. Yin, M\. Mazeika, A\. Dombrowski, S\. Goel, N\. Li, M\. J\. Byun, Z\. Wang, A\. Mallen, S\. Basart, S\. Koyejo, D\. Song, M\. Fredrikson, J\. Z\. Kolter, and D\. Hendrycks \(2023\)Representation engineering: a top\-down approach to AI transparency\.External Links:2310\.01405,[Link](https://arxiv.org/abs/2310.01405)Cited by:[§1](https://arxiv.org/html/2606.06735#S1.p1.1),[§2](https://arxiv.org/html/2606.06735#S2.p1.1),[§2](https://arxiv.org/html/2606.06735#S2.p2.1),[§2](https://arxiv.org/html/2606.06735#S2.p4.1)\.

###### Contents

1. [1Introduction](https://arxiv.org/html/2606.06735#S1)
2. [2Related Work](https://arxiv.org/html/2606.06735#S2)
3. [3Methodology](https://arxiv.org/html/2606.06735#S3)1. [3\.1Steering direction construction](https://arxiv.org/html/2606.06735#S3.SS1) 2. [3\.2Steering methods](https://arxiv.org/html/2606.06735#S3.SS2)
4. [4Experiments](https://arxiv.org/html/2606.06735#S4)1. [4\.1Evaluation setup](https://arxiv.org/html/2606.06735#S4.SS1) 2. [4\.2Experimental results](https://arxiv.org/html/2606.06735#S4.SS2)
5. [5Conclusion](https://arxiv.org/html/2606.06735#S5)
6. [6Limitations](https://arxiv.org/html/2606.06735#S6)
7. [References](https://arxiv.org/html/2606.06735#bib)
8. [AAdditional Norm\-Variation Analysis](https://arxiv.org/html/2606.06735#A1)
9. [BAdditional Directional\-Encoding Results](https://arxiv.org/html/2606.06735#A2)
10. [CCAA\-m Per\-Token Matching Algorithm](https://arxiv.org/html/2606.06735#A3)
11. [DAdditional Fixed\-Angle Steering Results](https://arxiv.org/html/2606.06735#A4)
12. [EAdditional Fixed\-Strength Steering Results](https://arxiv.org/html/2606.06735#A5)
13. [FConcept\-Score Closure](https://arxiv.org/html/2606.06735#A6)
14. [GOff\-Arc Perturbations](https://arxiv.org/html/2606.06735#A7)
15. [HAdditional Norm\-Scaling Results](https://arxiv.org/html/2606.06735#A8)
16. [IDatasets and Data Sources](https://arxiv.org/html/2606.06735#A9)
17. [JModels and Licenses](https://arxiv.org/html/2606.06735#A10)

## Acknowledgments

This work was supported by a Google DeepMind PhD Studentship, and the work utilized Queen Mary’s Andrena HPC facility, supported by QMUL Research\-IT\. This work was also supported by the Engineering and Physical Sciences Research Council \[grant number EP/Y009800/1\], through funding from Responsible Ai UK \(KP0016\)\.

## Appendix AAdditional Norm\-Variation Analysis

The main text reports the layerwise pattern of last\-prompt\-token norm variation\. This appendix provides the supporting details\. We first report per\-corpus CV at the 75%\-depth layer, and then expand the analysis to prompt and generation positions\. These per\-position plots separate cross\-sample norm variation from position\-dependent norm effects, which can be hidden by aggregate statistics\.

#### Per\-corpus norm variation\.

Table[2](https://arxiv.org/html/2606.06735#A1.T2)reports the per\-corpus CV of last\-prompt\-token hidden\-state norms at the 75%\-depth layer\. The same qualitative pattern as in the main text holds across corpora: Llama and Qwen models usually have moderate CV, while Gemma has substantially larger variation because of its post\-norm architecture\.

Table 2:Per\-corpus CV of last\-prompt\-token hidden\-state norms at the 75%\-depth layer\.
#### Prompt\-token positions\.

Figure[8](https://arxiv.org/html/2606.06735#A1.F8)shows pointwise CV across prompt positions\. The largest position\-specific effect appears at the beginning of the prompt: in Llama and Qwen models, the first token behaves like an attention\-sink position and has a distinct norm distribution\. After the first few tokens, CV settles to a more stable plateau\. Gemma remains different, with elevated variation across many layers because of its post\-norm architecture\.

![Refer to caption](https://arxiv.org/html/2606.06735v1/x17.png)Figure 8:Pointwise CV of hidden\-state norms across prompt\-token positions\. The first prompt positions, especially position 0, show strong architecture\-dependent effects; later positions settle to a more stable plateau\.
#### Generation\-token positions\.

Figure[9](https://arxiv.org/html/2606.06735#A1.F9)shows the same analysis for generated tokens\. Compared with prompt tokens, generation positions are more stable for most instruction\-tuned models, which is the relevant regime for the steering hook during decoding\. The Llama base model is less stable under unconstrained generation and shows larger CV at later layers\.

![Refer to caption](https://arxiv.org/html/2606.06735v1/x18.png)Figure 9:Pointwise CV of hidden\-state norms across generation\-token positions\. Instruction\-tuned models show relatively stable generation\-token CV, while the Llama base model has elevated variation at later layers\.
#### Cumulative CV\.

Figures[10](https://arxiv.org/html/2606.06735#A1.F10)and[11](https://arxiv.org/html/2606.06735#A1.F11)show cumulative CV when positions are pooled from the start of the sequence\. For prompt tokens, the early attention\-sink positions strongly affect the pooled statistic; as more content positions are included, this effect is diluted\. For generated tokens, the cumulative curves converge quickly, indicating that generation\-time norm variation is not dominated by a few outlier positions\.

![Refer to caption](https://arxiv.org/html/2606.06735v1/x19.png)Figure 10:Cumulative CV over prompt\-token positions\. Pooling early attention\-sink positions with later content positions produces large prompt\-token CV, showing that aggregate prompt statistics are sensitive to position\-dependent norm scale\.![Refer to caption](https://arxiv.org/html/2606.06735v1/x20.png)Figure 11:Cumulative CV over generation\-token positions\. The curves converge quickly for most instruction\-tuned models, indicating that generation\-token norm variation is not dominated by a small number of outlier positions\.
#### Layerwise token\-population comparison\.

Figure[12](https://arxiv.org/html/2606.06735#A1.F12)compares the mean CV across corpora for three token populations: the last prompt token, all prompt tokens, and generated tokens\. The all\-prompt\-token curve is much larger because it pools positions with different typical norm scales\. In contrast, generation\-token CV is more stable for most instruction\-tuned models, which supports the interpretation that decoding\-time steering operates on a comparatively stable radial landscape\.

![Refer to caption](https://arxiv.org/html/2606.06735v1/x21.png)Figure 12:Mean CV across corpora for last prompt tokens, all prompt tokens, and generation tokens\. Position\-dependent norm variation, especially from early attention\-sink positions, strongly inflates the all\-prompt\-token CV\. Generation\-token norms are more stable for most instruction\-tuned models\.
#### Mean norm profiles\.

Figures[13](https://arxiv.org/html/2606.06735#A1.F13)and[14](https://arxiv.org/html/2606.06735#A1.F14)report the corresponding mean norm profiles\. Prompt\-token norms show strong position effects, especially at the first token in Llama and Qwen architectures\. In contrast, generation\-token norms are nearly constant across positions at a fixed layer for most instruction\-tuned models\. This supports the main\-text interpretation that prompt\-token statistics can be strongly position\-dependent, while decoding positions are more stable\.

![Refer to caption](https://arxiv.org/html/2606.06735v1/x22.png)Figure 13:Mean hidden\-state norm across prompt\-token positions\. Norms increase with layer depth, and the first prompt position can have a disproportionately large norm in Llama and Qwen architectures\.![Refer to caption](https://arxiv.org/html/2606.06735v1/x23.png)Figure 14:Mean hidden\-state norm across generation\-token positions\. At each layer, generation\-token norms are nearly constant across positions for most instruction\-tuned models\.Overall, prompt\-token norms are strongly affected by position, whereas generation\-token norms are more stable across decoding steps\. Thus, norm preservation in spherical steering should be understood as preserving each token’s own radius, not as forcing all activations onto a shared global radius\.

#### Token\-population summary\.

Table[3](https://arxiv.org/html/2606.06735#A1.T3)summarizes norm CV at the 75%\-depth layer for the three token populations used in the norm\-variation analysis\. The all\-prompt\-token statistic is much larger because it pools positions with very different typical norm scales\. Generated tokens are more stable for most instruction\-tuned models, which is the regime most relevant to steering during decoding\.

Table 3:Norm CV at the 75%\-depth layer for three token populations, averaged across corpora\. Generation tokens are the positions directly modified by the steering hook during decoding\.

## Appendix BAdditional Directional\-Encoding Results

The main text reports the layerwise probe curves showing that concept information is primarily encoded in activation direction\. Here we provide the corresponding per\-model and per\-dataset probe accuracies\. We compare linear probes trained on raw hidden states, unit\-normalized hidden states, and norm\-only features\. Across all evaluated concepts, unit\-normalized probes closely match raw probes, while norm\-only probes remain near chance\.

Table 4:Linear\-probe accuracies for raw, unit\-normalized, and norm\-only representations\. Unit\-normalized features retain almost all of the predictive information in raw hidden states, while norm\-only features remain close to chance\.The table supports the directional\-encoding claim used throughout the paper\. Normalizing hidden states to unit length causes almost no loss in probe accuracy, indicating that the concepts remain linearly accessible after removing the radial component\. In contrast, probes trained only on the activation norm are close to chance for all datasets and model families\. This pattern also holds for Gemma, where norm variation is much larger than in the Llama and Qwen models, showing that large radial variability does not imply that the concept itself is encoded in the norm\.

## Appendix CCAA\-m Per\-Token Matching Algorithm

CAA\-m chooses a separate additive coefficient for every token so that the normalized output reaches the requested concept score\. Let

x=r​\(c​s\+1−c2​v\),x=r\(cs\+\\sqrt\{1\-c^\{2\}\}\\,v\),wherer=‖x‖r=\\\|x\\\|,c=⟨x/‖x‖,s⟩c=\\langle x/\\\|x\\\|,s\\rangle, andvvis the unit component ofx/‖x‖x/\\\|x\\\|orthogonal toss\. Fory=x\+α​sy=x\+\\alpha s, the target constraint is

⟨y‖y‖,s⟩=γ\.\\left\\langle\\frac\{y\}\{\\\|y\\\|\},s\\right\\rangle=\\gamma\.Since

y=\(r​c\+α\)​s\+r​1−c2​v,y=\(rc\+\\alpha\)s\+r\\sqrt\{1\-c^\{2\}\}\\,v,solving the constraint gives

α=r​\(γ​1−c21−γ2−c\)\.\\alpha=r\\left\(\\frac\{\\gamma\\sqrt\{1\-c^\{2\}\}\}\{\\sqrt\{1\-\\gamma^\{2\}\}\}\-c\\right\)\.\(13\)This expression is well\-defined forγ∈\(−1,1\)\\gamma\\in\(\-1,1\)\. When\|γ\|\|\\gamma\|approaches 1, the required additive coefficient can become large, reflecting the fact that an almost perfectly aligned target direction may require a large displacement for tokens whose initial residual component orthogonal tossis large\. Thus it controls the angular concept score while allowing the norm to change\.

## Appendix DAdditional Fixed\-Angle Steering Results

This appendix provides additional results for the comparison between S and CAA\-m at matched per\-token targetγ\\gamma\. Since both methods reach the same normalized concept direction, their difference is radial: S preserves the original norm, while CAA\-m leaves the additive norm change intact\. Here we show per\-dataset gaps comparison and the full per\-cell tables\.

#### Per\-dataset gaps\.

Figure[15](https://arxiv.org/html/2606.06735#A4.F15)reports the difference between CAA\-m and S across datasets and models\. At lowγ\\gamma, the two methods are close on all metrics\. At largerγ\\gamma, CAA\-m opens a large stability gap: it usually has much lower perplexity and higher MMLU accuracy, while the downstream task metric remains comparable on average but varies more across models and datasets\.

![Refer to caption](https://arxiv.org/html/2606.06735v1/x24.png)

![Refer to caption](https://arxiv.org/html/2606.06735v1/x25.png)

![Refer to caption](https://arxiv.org/html/2606.06735v1/x26.png)

Figure 15:Per\-dataset S vs\. CAA\-m gaps at matched per\-token targetγ\\gamma\. Top: downstream\-metric gap, CAA\-m−\-S, in percentage points\. Middle: WikiText\-103 perplexity gap, CAA\-m−\-S, using a symlog scale\. Bottom: MMLU gap, CAA\-m−\-S, in percentage points\. The dashed grey line marks zero gap\. For downstream metrics and MMLU, positive values favour CAA\-m\. For perplexity, negative values favour CAA\-m because lower perplexity is better\.

## Appendix EAdditional Fixed\-Strength Steering Results

This appendix provides additional results for the fixed\-strength methods: CAA, CAA\-r, and AS\. Unlike S and CAA\-m, these methods use a single global steering parameter and are calibrated to match the target mean concept scoreγ¯\\bar\{\\gamma\}\. This comparison isolates whether norm preservation alone explains downstream stability\. CAA\-r and AS both preserve the hidden\-state norm, while CAA does not; however, the results show that the token\-level angular profile is more important than norm preservation alone\.

#### Downstream trajectory\.

Figure[16](https://arxiv.org/html/2606.06735#A5.F16)compares the downstream\-metric trajectory of the three fixed\-strength methods as the target mean concept score increases\. The methods behave similarly at moderate targets, while AS becomes less stable at highγ¯\\bar\{\\gamma\}\.

![Refer to caption](https://arxiv.org/html/2606.06735v1/x27.png)Figure 16:Mean downstream metric change,Δ\\Deltatask, versus target mean concept score, averaged across models per dataset\. CAA, CAA\-r, and AS produce similar gains at moderate targets, while AS diverges at highγ¯\\bar\{\\gamma\}because its fixed spherical displacement causes larger token\-level disruption\.
#### CAA\-r versus CAA\.

Figure[17](https://arxiv.org/html/2606.06735#A5.F17)compares CAA\-r and CAA at matched mean concept score\. These methods have the same normalized output direction after the additive update; CAA\-r only rescales the result back to the original norm\. Consequently, their downstream and PPL curves stay close across targets\. Figure[18](https://arxiv.org/html/2606.06735#A5.F18)shows the same comparison as per\-dataset gaps, confirming that post\-hoc renormalization is not the main factor controlling stability in the fixed\-strength setting\.

![Refer to caption](https://arxiv.org/html/2606.06735v1/x28.png)

![Refer to caption](https://arxiv.org/html/2606.06735v1/x29.png)

Figure 17:CAA\-r versus CAA at matched mean concept score\. Top: downstream metric versusγ¯\\bar\{\\gamma\}\. Bottom: WikiText\-103 PPL ratio\. Since CAA\-r only renormalizes the additive CAA output, the two methods remain close in downstream behavior\.![Refer to caption](https://arxiv.org/html/2606.06735v1/x30.png)

![Refer to caption](https://arxiv.org/html/2606.06735v1/x31.png)

Figure 18:CAA\-r−\-CAA gap per dataset, with one line per model\. Top: downstream\-metric difference in percentage points\. Bottom: WikiText\-103 PPL\-ratio difference, shown on a symlog scale\. The dashed grey line marks zero gap\. The gaps remain small across most targets, showing that renormalizing CAA does not substantially change behavior in this fixed\-strength regime\.
#### CAA\-r versus AS\.

Figure[19](https://arxiv.org/html/2606.06735#A5.F19)compares CAA\-r and AS at matched mean concept score\. Both methods preserve‖y‖=‖x‖\\\|y\\\|=\\\|x\\\|, but they distribute the angular intervention differently across tokens\. CAA\-r inherits a token\-dependent angular displacement from the additive update, whereas AS applies a fixed spherical displacement\. Figure[20](https://arxiv.org/html/2606.06735#A5.F20)shows that this difference produces a large PPL gap at highγ¯\\bar\{\\gamma\}: AS becomes substantially less stable despite preserving the norm\. Thus, norm preservation alone is not sufficient to explain the steering–quality trade\-off\.

![Refer to caption](https://arxiv.org/html/2606.06735v1/x32.png)

![Refer to caption](https://arxiv.org/html/2606.06735v1/x33.png)

Figure 19:CAA\-r versus AS at matched mean concept score\. Top: downstream metric versusγ¯\\bar\{\\gamma\}\. Bottom: WikiText\-103 PPL ratio\. Both methods preserve the hidden\-state norm, but they induce different token\-level angular profiles\. AS becomes much more costly in PPL at highγ¯\\bar\{\\gamma\}, showing that norm preservation alone is not sufficient for stable steering\.![Refer to caption](https://arxiv.org/html/2606.06735v1/x34.png)

![Refer to caption](https://arxiv.org/html/2606.06735v1/x35.png)

Figure 20:CAA\-r−\-AS gap per dataset, with one line per model\. Top: downstream\-metric difference in percentage points\. Bottom: WikiText\-103 PPL\-ratio difference, shown on a symlog scale\. Negative PPL gaps mean CAA\-r has lower perplexity than AS\. Although both methods preserve norm, AS incurs much larger PPL degradation at highγ¯\\bar\{\\gamma\}\.
#### Calibration dose response\.

Figure[21](https://arxiv.org/html/2606.06735#A5.F21)shows the calibration curves used to match the target mean concept score\. For CAA\-r, the required additive strength is highly model\-dependent because it depends on the scale of the residual stream\. Gemma requires a much wider search range forα\\alpha\. In contrast, AS is calibrated by angular displacement and is therefore less sensitive to activation norm scale\.

![Refer to caption](https://arxiv.org/html/2606.06735v1/x36.png)

![Refer to caption](https://arxiv.org/html/2606.06735v1/x37.png)

Figure 21:Dose\-response curves for fixed\-strength calibration\. Left: CAA\-r mean concept score versus additive strengthα\\alphaon a log scale\. Right: AS mean concept score versus angular displacementΔ​θ\\Delta\\theta\. CAA\-r calibration is sensitive to residual\-stream norm scale, while AS calibration is norm\-invariant\.

## Appendix FConcept\-Score Closure

This appendix compares the steering methods by how tightly they close the gap to the requested concept score\. The main experiments compare downstream behavior and perplexity; here we isolate the intervention itself by measuring the achieved per\-token concept score after steering\. This diagnostic separates methods that enforce a target score token\-by\-token from methods that only match a target score on average\.

#### Per\-token score variance\.

Figure[22](https://arxiv.org/html/2606.06735#A6.F22)reports the standard deviation of achieved concept scores across tokens at the target level used in the closure diagnostic\. S and CAA\-m have near\-zero spread because they explicitly solve for the target concept score for each token\. In contrast, CAA, CAA\-r, and AS use a single global steering strength\. Even when their mean achieved score is calibrated to the target, individual tokens spread over a much wider range\. This confirms that concept\-score closure is a separate axis of method design: two methods can have the same average steering strength but very different token\-level precision\.

![Refer to caption](https://arxiv.org/html/2606.06735v1/x38.png)Figure 22:Per\-token concept\-score standard deviation at matched target score\. Per\-token targeted methods collapse tightly around the requested value, while fixed\-strength methods have much larger spread\. This shows that matching the mean concept score is not equivalent to closing the concept score for each token\.
#### Concept\-score distributions\.

Figure[23](https://arxiv.org/html/2606.06735#A6.F23)shows the full achieved\-score distributions on CivilComments\. The distributional view makes the same point as the standard\-deviation summary: targeted methods produce a sharp peak at the requested score, whereas fixed\-strength methods produce broader token\-level distributions\. The difference between CAA\-r and AS also becomes more pronounced at higher target scores\. Although both methods are calibrated to the same mean concept score and both preserve the hidden\-state norm, their achieved\-score distributions diverge at largeγ\\gammabecause they induce different token\-level angular profiles\.

![Refer to caption](https://arxiv.org/html/2606.06735v1/x39.png)Figure 23:Achieved concept\-score distributions on CivilComments\. Each panel corresponds to one model and target score\. Targeted methods collapse near the requested score, while fixed\-strength methods spread across a wider interval even when calibrated to the same mean\. At higher target scores, the CAA\-r and AS distributions become more different, reflecting their distinct token\-level angular profiles\.Overall, these results justify separating*semantic strength*from*concept\-score closure*\. Mean\-matched fixed\-strength methods can express the target concept on average, but they do not apply the same intervention to every token\. Per\-token targeted methods close the concept score much more precisely, which explains why they occupy a distinct part of the control–quality trade\-off in the main experiments\.

## Appendix GOff\-Arc Perturbations

This appendix tests whether the great\-circle arc used by S is empirically meaningful, not only geometrically minimal\. Starting from the spherical solution, we perturb the residual component away from the arc while keeping both the hidden\-state norm and the target concept score fixed\. Thus, any degradation caused by the perturbation cannot be explained by weaker concept control or by a different norm; it must come from moving away from the task\-relevant residual direction\.

Using the notation from the main text, we perturb the residual direction by an angleδ\\deltatoward a directionqqorthogonal to both the concept direction and the original residual direction:

y​\(δ\)=‖x‖​\(γ​s\+1−γ2​\(cos⁡δ​v\+sin⁡δ​q\)\)\.y\(\\delta\)=\\\|x\\\|\\left\(\\gamma s\+\\sqrt\{1\-\\gamma^\{2\}\}\\left\(\\cos\\delta\\,v\+\\sin\\delta\\,q\\right\)\\right\)\.All points on this sweep have the same norm and the same concept score\. The arc solution isδ=0\\delta=0\. If the great\-circle arc is the empirically relevant axis, then perturbing in either direction should degrade perplexity, MMLU, or downstream task performance\.

#### Aggregate off\-arc degradation\.

Table[5](https://arxiv.org/html/2606.06735#A7.T5)reports degradation relative to the arc solution\. PPL increase away fromδ=0\\delta=0, while MMLU and downstream task metrics generally decrease\. The effect is approximately symmetric and becomes stronger as\|δ\|\|\\delta\|increases\.

Table 5:Aggregate degradation under off\-arc perturbations\. PPL is reported as ratios relative toδ=0\\delta=0; MMLU and downstream metrics are reported as absolute changes relative toδ=0\\delta=0\. PPL ratio above11indicate degradation, while negative MMLU/downstream changes indicate degradation\.
#### Perturbation direction type\.

Table[6](https://arxiv.org/html/2606.06735#A7.T6)breaks the PPL effect down by the type of off\-arc direction\. Random directions produce the mildest degradation, PCA directions produce the steepest valleys, and cross\-dataset directions fall in between\. This suggests that the most important residual directions are aligned with high\-variance structure in the residual subspace, while concept axes from related datasets also overlap with task\-relevant residual variation\.

Table 6:PPL ratio by off\-arc perturbation direction type, averaged across completed cells\. PCA directions produce the steepest degradation, consistent with the residual subspace containing task\-relevant variation\.PPL is minimized atδ=0\\delta=0in almost all completed cells, and the few exceptions have negligible relative gaps\. Overall, perturbing away from the spherical arc worsens model behavior even though the concept score and norm are held fixed\. This supports the interpretation that the S direction is not merely the shortest geometric edit, but also the empirically task\-relevant residual direction\.

## Appendix HAdditional Norm\-Scaling Results

This appendix provides the detailed tables for the norm\-scaling sweep on top of S\. In this experiment, the angular component is held fixed while the norm is multiplied byβ\\beta\. Thus, changingβ\\betadoes not change the target concept score; it only changes the radius of the steered activation\. This makes the sweep a direct test of whether the norm acts as an independent stability lever\.

#### PPL summary\.

Table[7](https://arxiv.org/html/2606.06735#A8.T7)summarizes the mean PPL ratio for each\(γ,β\)\(\\gamma,\\beta\)pair, together with fold\-level win counts\. The monotone pattern at highγ\\gammais the key result: once the angular edit is large, increasing the norm reduces the PPL penalty\.

Table 7:Mean PPL ratio under theβ\\betasweep\. The best mean PPL for eachγ\\gammais highlighted in bold\. The lower block reports the number of folds in which eachβ\\betaachieves the lowest PPL\.γ\\gammaβ=0\.9\\beta=0\.9β=1\.0\\beta=1\.0\(S\)β=1\.1\\beta=1\.1β=1\.2\\beta=1\.2Best0\.11\.101\.101\.111\.12β=1\.0\\beta=1\.00\.31\.761\.711\.691\.69β=1\.2\\beta=1\.20\.59\.827\.986\.886\.21β=1\.2\\beta=1\.20\.7262\.5151\.8107\.283\.5β=1\.2\\beta=1\.2Win counts: lowest PPL0\.126/7025/709/7010/70—0\.35/7010/7015/7040/70—0\.52/700/706/7062/70—0\.70/700/700/7070/70—
#### Task\-metric summary\.

Table[8](https://arxiv.org/html/2606.06735#A8.T8)reports the corresponding downstream task\-metric changes\. Compared with PPL, task performance is much less sensitive toβ\\beta: the spread across norm scales remains small at eachγ\\gamma\. This supports the interpretation thatβ\\betais primarily a stability knob, not a semantic\-control knob\.

Table 8:Downstream task\-metric change under theβ\\betasweep, in percentage points\. “Spread” is the maximum minus minimum overβ∈\{0\.9,1\.0,1\.1,1\.2\}\\beta\\in\\\{0\.9,1\.0,1\.1,1\.2\\\}\.
#### Large\-model sensitivity\.

Table[9](https://arxiv.org/html/2606.06735#A8.T9)isolates the 70B model\. The larger model is more sensitive to strong angular edits, producing larger PPL ratios at highγ\\gamma, but the ordering overβ\\betaremains the same\. Larger norm scales still reduce PPL most strongly at high steering strengths\.

Table 9:70B\-only PPL ratios under theβ\\betasweep\. The 70B model amplifies the PPL gap at highγ\\gamma, but the bestβ\\betaordering is unchanged\.Overall, theβ\\betasweep confirms that the radius is not merely a passive quantity\. Once the angular edit is fixed, changing the norm has little effect on the semantic task metric but a large effect on generation stability\. This strengthens the paper’s two\-parameter view of steering:γ\\gammacontrols the angular concept intervention, whileβ\\betacontrols the radial stability of the resulting activation\.

## Appendix IDatasets and Data Sources

This section summarizes the datasets used in our experiments\. We use three groups of data: concept datasets for direction construction and downstream steering evaluation, auxiliary capability and language\-modeling benchmarks, and unlabeled corpora for norm\-variation diagnostics\.

#### Concept datasets\.

We evaluate steering on four concept datasets\. TruthfulQA is used for truthfulness steering and closed\-form multiple\-choice evaluation\(Linet al\.,[2022](https://arxiv.org/html/2606.06735#bib.bib14)\)\. SST\-2, derived from the Stanford Sentiment Treebank, is used for sentiment steering\(Socheret al\.,[2013](https://arxiv.org/html/2606.06735#bib.bib15)\)\. CivilComments is used for toxicity and non\-toxicity steering\(Borkanet al\.,[2019](https://arxiv.org/html/2606.06735#bib.bib16)\)\. IMDB is used as a second sentiment dataset with longer movie\-review inputs\(Maaset al\.,[2011](https://arxiv.org/html/2606.06735#bib.bib17)\)\. These datasets define the contrastive concept directions and the task\-specific downstream metrics reported in the main experiments\.

#### Auxiliary evaluation datasets\.

To measure whether steering degrades general model behavior, we evaluate perplexity on WikiText\-103\(Merityet al\.,[2016](https://arxiv.org/html/2606.06735#bib.bib19)\)\. We also evaluate general capability using MMLU, a multi\-task benchmark covering broad factual and reasoning domains\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.06735#bib.bib18)\)\. These auxiliary datasets are not used to construct steering directions; they are used only to measure quality and capability retention under intervention\.

#### Norm\-variation corpora\.

For the norm\-variation analysis, we use a heterogeneous set of corpora spanning web text, instruction data, scientific writing, stories, question answering, toxicity comments, news, biomedical text, and code\. Specifically, we sample from OpenWebText\(Gokaslanet al\.,[2019](https://arxiv.org/html/2606.06735#bib.bib23)\), Alpaca\(Taoriet al\.,[2023a](https://arxiv.org/html/2606.06735#bib.bib24)\), arXiv scientific papers\(Cohanet al\.,[2018](https://arxiv.org/html/2606.06735#bib.bib25)\), WritingPrompts\(Fanet al\.,[2018](https://arxiv.org/html/2606.06735#bib.bib26)\), TruthfulQA\(Linet al\.,[2022](https://arxiv.org/html/2606.06735#bib.bib14)\), Natural Questions\(Kwiatkowskiet al\.,[2019](https://arxiv.org/html/2606.06735#bib.bib20)\), CivilComments\(Borkanet al\.,[2019](https://arxiv.org/html/2606.06735#bib.bib16)\), CNN/DailyMail\(Hermannet al\.,[2015](https://arxiv.org/html/2606.06735#bib.bib21); Seeet al\.,[2017](https://arxiv.org/html/2606.06735#bib.bib22)\), PubMedQA\(Jinet al\.,[2019b](https://arxiv.org/html/2606.06735#bib.bib27)\), and CodeSearchNet\(Husainet al\.,[2019](https://arxiv.org/html/2606.06735#bib.bib28)\)\. This mixture is intended to test whether the radial geometry of hidden states is stable across content domains rather than being an artifact of a single dataset\.

#### Dataset licenses\.

Table[10](https://arxiv.org/html/2606.06735#A9.T10)summarizes the licenses or usage terms associated with the dataset distributions used in this work\. Licenses vary across datasets and, in some cases, across mirrors of the same dataset\. We use all datasets only for research evaluation and do not redistribute the datasets\. For datasets whose original source does not specify a clear open\-data license, we report the relevant usage status conservatively and refer readers to the original source or distribution page\.

Table 10:Dataset licenses or usage terms for the datasets used in our experiments\. When a license depends on the distribution mirror, we report the license of the distribution we rely on or note that the license should be checked against the local source\.

## Appendix JModels and Licenses

Table[11](https://arxiv.org/html/2606.06735#A10.T11)summarizes the model checkpoints used in our experiments, together with their source families and licenses\. We evaluate three model families: Llama\(Dubeyet al\.,[2024](https://arxiv.org/html/2606.06735#bib.bib30)\), Qwen2\.5\(Qwen Teamet al\.,[2024](https://arxiv.org/html/2606.06735#bib.bib31)\), and Gemma 2\(Gemma Teamet al\.,[2024](https://arxiv.org/html/2606.06735#bib.bib32)\)\. All models are used only for research evaluation; we do not redistribute model weights\.

Table 11:Model checkpoints used in the experiments\. Licenses are reported according to the corresponding model cards or license pages\.The licenses differ in permissiveness\. Qwen2\.5\-7B\-Instruct is released under Apache\-2\.0, whereas Qwen2\.5\-3B\-Instruct is governed by the Qwen Research License\. The Llama checkpoints are released under Meta’s Llama Community License, with separate license versions for Llama 3\.1 and Llama 3\.2\. Gemma\-2\-9B\-it is distributed under Google’s Gemma Terms of Use\. We report these licenses for transparency and refer readers to the original model cards and license documents for the full legal terms\.

Similar Articles

Steered LLM Activations are Non-Surjective

Hugging Face Daily Papers

This paper proves that activation steering in LLMs produces internal states that cannot be replicated by any textual prompt, establishing a formal separation between white-box steerability and black-box prompting.

Decomposing and Steering Functional Metacognition in Large Language Models

arXiv cs.CL

This research paper investigates functional metacognition in Large Language Models, demonstrating that internal states like evaluation awareness and self-assessed capability are linearly decodable from residual stream activations. The authors propose a mechanistic framework to steer these states, showing causal control over reasoning behaviors, verbosity, and safety responses.