Predicting Where Steering Vectors Succeed
Summary
This paper introduces the Linear Accessibility Profile (LAP), a diagnostic method using logit lens to predict steering vector effectiveness across model layers, achieving ρ=+0.86 to +0.91 correlation on 24 concept families across five models. The work provides a systematic framework to determine which layers and concepts are suitable for steering interventions, replacing ad-hoc trial-and-error approaches.
View Cached Full Text
Cached at: 04/20/26, 08:30 AM
# Predicting Where Steering Vectors Succeed
Source: https://arxiv.org/html/2604.15557
Jayadev Billa jbilla2004@gmail\.com
###### Abstract
Steering vectors work for some concepts and layers but fail for others, and practitioners have no way to predict which setting applies before running an intervention\. We introduce the *Linear Accessibility Profile* (lap), a per-layer diagnostic that repurposes the logit lens as a predictor of steering vector effectiveness\. The key measure, A_lin, applies the model's unembedding matrix to intermediate hidden states, requiring no training\. Across 24 controlled binary concept families on five models (Pythia-2.8B to Llama-8B), peak A_lin predicts steering effectiveness at ρ = +0.86 to +0.91 and layer selection at ρ = +0.63 to +0.92\. A three-regime framework explains when difference-of-means steering works, when nonlinear methods are needed, and when no method can work\. An entity-steering demo confirms the prediction end-to-end: steering at the lap-recommended layer redirects completions on Gemma-2-2B and OLMo-2-1B-Instruct, while the middle layer (the standard heuristic) has no effect on either model\.
## 1 Introduction
Steering vectors add a direction to the residual stream to shift model behavior\. They have been applied to refusal (Arditi et al., 2024), truthfulness (Li et al., 2023), and broader behavioral properties (Zou et al., 2023; Turner et al., 2023)\. However, effectiveness varies across concepts and layers, and practitioners currently select steering layers by trial and error\. No systematic method predicts which setting will succeed\.
The logit lens (nostalgebraist, 2020) applies the unembedding matrix to intermediate hidden states to observe how predictions evolve across layers\. Belrose et al. (2023) addressed the layer norm mismatch with a learned correction (the "tuned lens")\. These methods characterize what the model "thinks" at each layer, but none connects this measurement to the success or failure of steering interventions\.
We repurpose the logit lens as a *predictor of steering vector effectiveness*\. The resulting framework, the Linear Accessibility Profile (lap), measures at each layer whether a concept is accessible through the model's own output projection and whether that accessibility predicts where steering will work\. Prior work selects steering layers heuristically, typically targeting middle layers (Turner et al., 2023; Templeton et al., 2024)\. lap operates at a different level: predicting whether a concept is steerable at all\. A concept with high peak A_lin (e.g., continent, A_lin = 0.68) steers effectively; one with low peak A_lin (e.g., parity, A_lin = 0.02) does not\. We complement the logit lens with a nonlinear upper bound (a residual MLP) to quantify the *probe gap*, and measure perturbation sensitivity (λ) to identify layers where the representation is unstable\.
We validate lap on single-token next-token completion tasks, where the logit lens gives an unambiguous accuracy metric\. Extensions to multi-token settings are discussed in Section 5\. We evaluate primarily on Gemma-2-2B (Gemma Team, 2024) and replicate on Llama-3.1-8B (Dubey et al., 2024), Mistral-7B-v0.3, Qwen2.5-7B, and two non-transformers (Mamba-1.4B, RWKV-1.6B)\. An entity-steering demo validates lap end-to-end: steering London-answer prompts toward "Paris" at the lap-recommended layer redirects completions on both Gemma-2-2B and OLMo-2-1B-Instruct, while the middle layer has no effect on either model\.
Our contributions: (1) a connection between logit lens measurements and steering vector effectiveness, validated at two levels (layer selection and steerability prediction) across 24 controlled families and five models; (2) a three-regime framework that explains when difference-of-means steering works, when nonlinear methods are needed, and when no method can work; (3) a controlled experimental design using 25 binary concept families that isolates representation geometry from task-structure confounds\.
## 2 Related Work
#### The logit lens and probing\.
nostalgebraist (2020) introduced the logit lens; Belrose et al. (2023) proposed the tuned lens to address the layer norm mismatch; Yom Din et al. (2023) examined how predictions change at certain layers\. We show that the standard logit lens, despite the mismatch, is a strong predictor of steering effectiveness\. By using the model's own unembedding (a fixed, untrained projection), we avoid the selectivity concerns that apply to trained probes (Belinkov, 2022)\. The trade-off is that we measure alignment with one specific linear projection, not general linear decodability\.
#### Linear representations\.
Park et al. (2024) formalize the linear representation hypothesis and connect it to probing and steering\. Nanda et al. (2023) observe linear representations in Othello-playing models\. The hypothesis has been challenged: Csordás et al. (2024) show nonlinear encodings in small models, and Engels et al. (2024) demonstrate multi-dimensional feature manifolds\. We do not assume the hypothesis holds universally; lap measures where and to what degree it holds\.
#### Steering and intervention\.
Zou et al. (2023) introduce representation engineering\. Turner et al. (2023) formalize activation addition\. Arditi et al. (2024) identify a single direction mediating refusal\. Each method demonstrates success on its target concept but does not predict when steering will succeed on a new concept or layer\.
#### Sparse autoencoders and transcoders\.
Templeton et al. (2024) scale SAEs to large models; Lieberum et al. (2024) release GemmaScope transcoders; Ameisen et al. (2025) introduce attribution graphs\. Our three-regime framework predicts that SAE features should be most useful in regime 2 (concept present but not output-aligned), where difference-of-means fails\.
## 3 Method
### 3.1 Setup
Consider a causal language model with L transformer blocks\. Each block reads from and writes to a shared *residual stream*: h_ℓ = h_{ℓ-1} + block_ℓ(h_{ℓ-1}), where h_0 is the token embedding\. After the final block, an output head produces logits: logits = W_U · LayerNorm(h_L), where W_U ∈ ℝ^{V×d} is the unembedding matrix\. Because the residual stream lives in ℝ^d at every layer, this output head can be applied to any intermediate h_ℓ, which is the basis of the logit lens\.
For a concept family 𝒞 = {(x_i, t_i)}_{i=1}^N, where each prompt x_i has a correct next-token answer t_i, we measure how linearly accessible the concept is at each layer\.
### 3.2 Linear accuracy (logit lens)
We apply the model's unembedding to intermediate hidden states:
A_lin(ℓ) = 1/N Σ_{i=1}^N { 1 if argmax_v(W_U · LayerNorm(h_ℓ^{(i)}))_v = t_i; 0 otherwise }
This is the logit lens evaluated as classification accuracy over the concept family\. No training is required\. We apply the *final* layer norm to intermediate states, inheriting the layer norm mismatch discussed by Belrose et al. (2023)\. We evaluate the effect of this mismatch in Section 5\.
### 3.3 Probe gap
The logit lens measures what is linearly accessible through the model's output projection, but concept information may be present in a form that requires nonlinear transformation before it aligns with the unembedding\. The *probe gap* Δ(ℓ) = A_mlp(ℓ) − A_lin(ℓ) quantifies how much concept information is present at layer ℓ but not output-aligned\.
We train a residual MLP to compute A_mlp:
ĥ_ℓ = h_ℓ + f_θ(h_ℓ), A_mlp(ℓ) = 1/N Σ_{i=1}^N { 1 if argmax_v(W_U · LayerNorm(ĥ_ℓ^{(i)}))_v = t_i; 0 otherwise }
where f_θ is a two-layer MLP (d → 512 → d) with layer normalization, GELU, and dropout (p = 0.1), trained on 80% of prompts to minimize cross-entropy\. The residual connection ensures the MLP learns a correction rather than replacing the hidden state\. A large probe gap indicates nonlinear encoding at that layer; steering is unlikely to work even though the information is present\.
### 3.4 Perturbation sensitivity
We measure how much a small random perturbation at layer ℓ is amplified by subsequent computation:
λ(ℓ) = 1/K Σ_{k=1}^K ||f(h_ℓ + αε_k) − f(h_ℓ)||/α, α = 0.01 · ||h_ℓ||
where ε_k are random unit vectors, f is the forward pass from layer ℓ to output logits, and K = 10\. High λ indicates an unstable representation where steering vectors will have unpredictable effects\.
### 3.5 The Linear Accessibility Profile
For a concept family 𝒞 and layer ℓ, the *Linear Accessibility Profile* (lap) is:
lap(ℓ) = (A_lin(ℓ), Δ(ℓ), λ(ℓ))
Of these, A_lin is the primary predictor\. The remaining components characterize why steering may fail: high Δ means information is present but not output-aligned; high λ means the representation is unstable (Figure 3 in Appendix K)\.
### 3.6 Concept families
We use two sets of concept families\. All correct answers are single tokens in the model vocabulary (required because the logit lens produces a distribution over individual tokens)\.
#### Core families (5)\.
Five heterogeneous families (Table 1) are used for *within-concept* analyses: measuring how A_lin, Δ, and λ vary across layers for a single concept\.
Table 1: Core concept families (used for within-concept layer analysis)\. All correct answers are single tokens\.
#### Controlled binary families (25)\.
For *steerability prediction* across concepts, task-structure confounds must be eliminated\. The core families vary in answer-class count, target sizes, and prompt formats; comparing steerability across them yields a non-significant correlation (ρ = +0.18, p = 0.54)\. We construct 25 controlled binary families (Table 14 in the appendix): each has two answer classes, balanced groups (~22 items per class), and consistent templates\. This reveals the underlying signal (ρ = +0.86 to +0.91, p < 10^{-3}; details in Appendix D)\.
## 4 Experiments
We evaluate primarily on Gemma-2-2B (26 layers, d = 2304) with replication on Llama-3.1-8B (32 layers), Mistral-7B-v0.3 (32 layers), Qwen2.5-7B (28 layers), and two non-transformer architectures: Mamba-1.4B (48 layers) and RWKV-1.6B (24 layers)\. Table 6 in the appendix specifies which model is used for each experiment\.
### 4.1 Linear accessibility across layers
Table 2 reports the main results\. All five families show zero linear accuracy for layers 0–15 and sharp emergence in layers 18–24, consistent with the logit lens literature (nostalgebraist, 2020; Yom Din et al., 2023)\. Accuracy peaks at layer 23–24 (not the final layer) for four of five families\.
Table 2: Linear accessibility across layers of Gemma-2-2B\. A_lin: logit lens accuracy at the best layer\. A_mlp: MLP probe accuracy\. Δ: probe gap\. Acc(a): A_lin on prompts the model answers correctly\. Acc(b): A_lin on prompts the model answers incorrectly\.
[0, 5, 10, 15, 15, 20, 20, 25, 25] / [0, 0.5, 0.5, 1, 1] Layer / Accuracy / analogy / arithmetic / geography / sequence / word transform
Figure 1: Per-layer A_mlp (solid) and A_lin (dotted) for each concept family on Gemma-2-2B\. The gap between solid and dotted lines is the probe gap Δ\. All families show A_lin = 0 at layers 0–15 and sharp emergence in layers 18–24\. The nonlinear probe detects concepts substantially earlier—sequence reaches A_mlp > 0.9 at layer 5, while A_lin remains zero until layer 18\.
The probe gap varies widely\. For arithmetic and sequence, Δ ≈ 0.22: the concept is predominantly linear at the best layer\. For geography, Δ = 0.720: the MLP achieves perfect accuracy while the logit lens reaches only 28.0%\. The MLP also detects concepts earlier: sequence is nonlinearly accessible at layer 5 (A_mlp = 0.91) but not linearly until layer 20 (A_lin = 0.60)\.
#### Crystallization gap\.
We define the gap between nonlinear detection (A_mlp > 0.5) and linear emergence (A_lin > 0.1) as the *crystallization gap*\. Both metrics measure argmax accuracy over the full vocabulary (~256K tokens), so chance is effectively zero\. The A_mlp threshold of 50% indicates the nonlinear probe recovers the correct token for a majority of prompts\. The A_lin threshold is lower at 10% because sporadic values of 1–3% can arise from token frequency biases in the unembedding; 10% requires a substantial fraction of prompts to have the correct token as the top prediction\.Similar Articles
When is Your LLM Steerable?
This paper investigates when activation steering succeeds or fails for LLMs by analyzing early decoding dynamics. The authors introduce ASTEER, a large testbed of steered generations, and train a GBDT classifier to predict steering outcomes from early hidden states, enabling efficient steering strength search.
Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention
This paper introduces FLAS, a flow-based activation steering method that learns a concept-conditioned velocity field to steer language model activations at inference time. On the AxBench benchmark, FLAS is the first learned method to consistently outperform in-context prompting on held-out concepts without per-concept tuning.
When is Your LLM Steerable?
This paper introduces a method to predict activation steering effectiveness in language models from early decoding states using a Gradient Boosting Decision Trees (GBDT) classifier, enabling efficient steering strength optimization without full rollouts.
Don't Lose Focus: Activation Steering via Key-Orthogonal Projections
This paper introduces Steering via Key-Orthogonal Projections (SKOP), a method to control LLM behavior by preventing attention rerouting, thereby reducing utility degradation while maintaining steering efficacy.
SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors
SALSA introduces a lightweight adaptation method for speech-aware LLMs that learns layer-wise steering vectors via supervised objective, achieving significant improvements (up to 46.8% relative) on out-of-domain speech benchmarks, and shows that steering the encoder layers is more effective than modifying the LLM backbone.