When is Your LLM Steerable?
Summary
This paper investigates when activation steering succeeds or fails for LLMs by analyzing early decoding dynamics. The authors introduce ASTEER, a large testbed of steered generations, and train a GBDT classifier to predict steering outcomes from early hidden states, enabling efficient steering strength search.
View Cached Full Text
Cached at: 06/11/26, 01:39 PM
# When is Your LLM Steerable?
Source: [https://arxiv.org/html/2606.11599](https://arxiv.org/html/2606.11599)
Chenrui Fan1, Yize Cheng1, Ming Li1,2, Soheil Feizi1, Tianyi Zhou2 1University of Maryland, College Park2MBZUAI, UAE \{cfan42, yzcheng, minglii, sfeizi\}@umd\.edu, tianyi\.zhou@mbzuai\.ac\.ae Project:[https://github\.com/Fcr09/SteerBoost](https://github.com/Fcr09/SteerBoost)
###### Abstract
Activation steering offers a lightweight approach to control language models’ behavior at inference time, but whether it succeeds or fails heavily depends on the prompt, concept, model, and steering configuration\. Finding the regime and boundaries of successful steering typically requires expensive grid searches and post\-hoc evaluation of full autoregressive rollouts\. In this work, we investigate whethersteerabilitycan be predicted from the model’s internal states at the beginning of the generation process, e\.g\., after generating the first few tokens, and how to leverage such a predictor to improve steering success rate\. To this end, we first introduce ASTEER, a testbed including 1\.4M steered generations, spanning 150 concepts with each steering success/failure labeled\. Leveraging this testbed, we analyze the model’s early decoding dynamics by extracting features that compare hidden states before and after steering across layers and initial decoding steps\. These features help us understand how steering’s effects propagate along layers and token positions, which provide key information for steerability prediction\. We then train a Gradient Boosting Decision Trees \(GBDT\) classifier on these features to predict whether an intervention will under\-steer, succeed, or over\-steer without requiring full rollout\. Our predictor achieves around 0\.7 macro\-F1 score on unseen concepts, demonstrating that early hidden states encode substantial, structured information about eventual steering efficacy\. We further leverage this steerability predictor as guidance for steering strength searching, achieving near optimal performance with a small fraction of decoding cost\.
## 1Introduction
Inference\-time activation engineering, or*steering*, offers a lightweight approach to control the behavior of large language models \(LLMs\) without additional finetuning\[[16](https://arxiv.org/html/2606.11599#bib.bib4),[10](https://arxiv.org/html/2606.11599#bib.bib2),[20](https://arxiv.org/html/2606.11599#bib.bib5),[15](https://arxiv.org/html/2606.11599#bib.bib14)\]\. By injecting a carefully constructed direction into the model’s hidden states during inference, one can bias generation towards a target concept or behavior\. Prior work has shown that such interventions can influence a range of important properties, including truthfulness\[[10](https://arxiv.org/html/2606.11599#bib.bib2)\], refusal behavior\[[2](https://arxiv.org/html/2606.11599#bib.bib7),[13](https://arxiv.org/html/2606.11599#bib.bib6),[9](https://arxiv.org/html/2606.11599#bib.bib8)\], multi\-dimensional trustworthiness\[[22](https://arxiv.org/html/2606.11599#bib.bib12)\], and latent social biases\[[11](https://arxiv.org/html/2606.11599#bib.bib9)\]\. These results suggest that steering is a promising technique for fast, flexible control of model behavior\.
While most works focus on developing more effective steering strategies, the boundaries of steerable regimes for different LLMs in the joint space of concepts, prompts, and steering strengths remain underexplored\. The same intervention can work well for one prompt or one concept but fail for another, and the appropriate steering strength often varies substantially across concepts and prompts\[[21](https://arxiv.org/html/2606.11599#bib.bib26),hedström2025steersteermechanisticerror,[28](https://arxiv.org/html/2606.11599#bib.bib25)\]\. As a result, existing practice often relies on expensive grid search over steering coefficients using post\-hoc, full autoregressive rollouts to identify a successful intervention\. More importantly, this brittleness raises questions that are still poorly understood:*when*would a steering attempt succeed, and under*what conditions*would it fail?Moreover,is steerability a structured property that can be predicted before decoding is completed?
A parallel line of work provides a natural route for studying this question\. Recent research has shown that hidden states early in generation already contain predictive signals about later model behavior, including hallucination\[[7](https://arxiv.org/html/2606.11599#bib.bib21),[1](https://arxiv.org/html/2606.11599#bib.bib19)\], harmfulness\[[4](https://arxiv.org/html/2606.11599#bib.bib18),[23](https://arxiv.org/html/2606.11599#bib.bib16)\], and answer correctness\[[24](https://arxiv.org/html/2606.11599#bib.bib22),[25](https://arxiv.org/html/2606.11599#bib.bib23)\]\. This connection is especially compelling for steering because both the intervention and the prediction target are grounded in the same representational space: steering acts directly on hidden states, and prior work suggests that those hidden states already encode rich information about future results\. If the efficacy of an intervention depends on latent conditions in the model’s internal states, then those conditions may be detectable from the early decoding trajectory before the full response is generated\.
Figure 1:Conventional approach requires costly full rollout and LLM judge to decide whether a steering attempt succeeds or not\. We propose that the outcome can be efficiently predicted from the hidden states of the first few tokens, as illustrated in the green path\.Motivated by these observations, we aim to predict the efficacy of steering from the hidden states of the initial decoding process\. Specifically,given a prompt, concept, and steering configuration, can the first few decoded tokens’ states imply whether this steering attempt will succeed, without decoding the full response?To this end, we first construct a steerability dataset spanning 150 concepts, with 1\.4M steered generations labeled for steering efficacy\. By comparing the model’s early hidden states before and after steering across multiple layers and decoding positions, we extract principal features about steering geometry, decoding dynamics, and steering condition to characterize how the steering signal propagates in network, which are later used to train a gradient boosting decision tree \(GBDT\) that can predict steering efficacy at macro\-F1 around 0\.7 on unseen concepts\.
This framing is useful not only for understanding steerability as a property of the model, the prompt, and the intervention, but also for supporting downstream applications\. In particular, we show how steering prediction can be used to significantly reduce the cost of searching for effective steering strengths without exhaustive full\-rollout and evaluation\.
Main Contributions:
- •We curate a dataset of steering that covers steered responses of multiple LLMs under different prompts, concepts, and steering strengths\. It enables fine\-grained analysis of the latent dynamics of steering in LLMs\.
- •We developed features capturing the effects of steering on the latent dynamics, resulting in interpretable prediction of steering success and two types of failures\.
- •By exploiting the generalization capability of the steerability predictor, we introduce a practical approach that can allocate the optimal steering configurations to improve the performance\.
## 2Steering and Steerability
Suppose we have a set of prompts𝒫\\mathcal\{P\}, where each promptp∈𝒫p\\in\\mathcal\{P\}is a sequence of tokensp=\(x1,…,xT\)p=\(x\_\{1\},\\dots,x\_\{T\}\)\. During LLM inference without activation steering, for a given token steptt, the hidden states are computed layer\-by\-layer for the entire sequence\. Let𝐡1:t\(i\)\\mathbf\{h\}\_\{1:t\}^\{\(i\)\}be the sequence of hidden states up to tokenttat theii\-th layer, we have:
𝐡1:t\(i\)=DecoderLayeri\(𝐡1:t\(i−1\)\)fori∈\{1,…,N\}\.\\mathbf\{h\}\_\{1:t\}^\{\(i\)\}=\\text\{DecoderLayer\}\_\{i\}\(\\mathbf\{h\}\_\{1:t\}^\{\(i\-1\)\}\)\\quad\\text\{for \}i\\in\\\{1,\\dots,N\\\}\.\(1\)Denote the set of target concepts as𝒞\\mathcal\{C\}, the set of scalar steering strengths as𝒜\\mathcal\{A\}\. To steer the model towards a conceptc∈𝒞c\\in\\mathcal\{C\}with strengthα∈𝒜\\alpha\\in\\mathcal\{A\}and steering methodSS, we apply a steering vector𝐯S\(c\)\\mathbf\{v\}\_\{S\(c\)\}\(abbreviated as𝐯c\\mathbf\{v\}\_\{c\}\) at a specific layerLsteerL\_\{steer\}\. The forward pass remains identical to the base LLM except at layerLsteerL\_\{steer\}\. Let𝐡~\\tilde\{\\mathbf\{h\}\}denote the steered hidden states, we have:
𝐡~1:t\(i\)=DecoderLayeri\(𝐡~1:t\(i−1\)\)fori≠Lsteer\\tilde\{\\mathbf\{h\}\}\_\{1:t\}^\{\(i\)\}=\\text\{DecoderLayer\}\_\{i\}\(\\tilde\{\\mathbf\{h\}\}\_\{1:t\}^\{\(i\-1\)\}\)\\quad\\text\{for \}i\\neq L\_\{steer\}\(2\)𝐡~1:t\(Lsteer\)=DecoderLayerLsteer\(𝐡~1:t\(Lsteer−1\)\)\+α𝐯c\\tilde\{\\mathbf\{h\}\}\_\{1:t\}^\{\(L\_\{steer\}\)\}=\\text\{DecoderLayer\}\_\{L\_\{steer\}\}\(\\tilde\{\\mathbf\{h\}\}\_\{1:t\}^\{\(L\_\{steer\}\-1\)\}\)\+\\alpha\\mathbf\{v\}\_\{c\}\(3\)We denote the fully generated rollout of the steered model as𝐲p,c,α\\mathbf\{y\}\_\{p,c,\\alpha\}:
𝐲p,c,α=LLM\(p,α,𝐯c\)\.\\mathbf\{y\}\_\{p,c,\\alpha\}=\\text\{LLM\}\(p,\\alpha,\\mathbf\{v\}\_\{c\}\)\.\(4\)Similarly as defined inhedström2025steersteermechanisticerror, letΛ=\{UnderSteer,SuccSteer,OverSteer\}\\Lambda=\\\{\\textsc\{UnderSteer\},\\textsc\{SuccSteer\},\\textsc\{OverSteer\}\\\}be the discrete label space defining the outcome of a steering attempt\. Specifically, a steering attempt is considered successful if the response coherently answers the prompt while incorporating the desired concept\. The two failure modes includeUnderSteerandOverSteer; the former represents when the response does not incorporate the concept, and the latter represents when the model fails to coherently address the prompt\. A judge model evaluates whether the generation satisfies both properties based on the conceptccand rollout𝐲p,c,α\\mathbf\{y\}\_\{p,c,\\alpha\}\.
However, generating the full rollout and invoking the judge are computationally expensive, making it costly to explore the large space of steering configurations\. Our goal is to build a predictor that takes the hidden states from only the first few generated tokens of the steered model and predicts the steering outcome*without*computing the full rollout, as illustrated by the green path in Figure[1](https://arxiv.org/html/2606.11599#S1.F1)\.
To this end, we construct a large\-scale dataset, ASTEER \(Section[3](https://arxiv.org/html/2606.11599#S3)\), spanning diverse steering configurations\. Our analysis \(Section[3\.4](https://arxiv.org/html/2606.11599#S3.SS4)\) reveals that steering outcomes are brittle across methods, models, prompts, concepts, and strengths, underscoring the need to understand when steering works\. To facilitate this understanding, we then develop SteerBoost \(Section[4](https://arxiv.org/html/2606.11599#S4)\), which takes these early hidden states as input and efficiently predicts steering outcomes, both helping us investigate when and why steering fails or succeeds, and also enabling practical applications such as efficient steerability characterization and accelerated hyperparameter search \(Section[5](https://arxiv.org/html/2606.11599#S5)\)\.
## 3ASTEER Dataset
To create a testbed for outcome prediction ofActivationSTEERing, we create ASTEER, a dataset covering 150 concepts and 50 prompts, spanning 1\.42M steered generations as in Figure[3](https://arxiv.org/html/2606.11599#S3.F3)\.
Figure 2:We construct ASTEER with 150 concepts, 50 prompts, and two steering methods \(i\.e\., DiffMean and Probe\), with 45 and 18 steering strengths, respectively\. Steering is applied on 3 LLMs, whose rollouts are annotated by an LLM judge to one of the labels in Table[2](https://arxiv.org/html/2606.11599#S3.T2)\.### 3\.1Steering concepts and prompts
We construct a set of 150 concepts spanning three abstraction levels,low,mid, andhigh, designed to systematically vary the form and granularity of the targeted behaviors\. Low\-level concepts capture surface\-form and formatting properties, which are typically localized and directly observable in token space\. Mid\-level concepts represent discourse\-level behaviors, while high\-level concepts involve persona, topic, and global response framing, which are more abstract\. Table[1](https://arxiv.org/html/2606.11599#S3.T1)shows some examples of concepts at different levels\. The concept list is in Appendix[J](https://arxiv.org/html/2606.11599#A10)\.
Table 1:The examples for different concept levels\. Our concept list spans different abstraction levels, covering low\-level output format restriction to high\-level style and persona control\.We sample 50 prompts from the Alpaca\[[17](https://arxiv.org/html/2606.11599#bib.bib28)\]dataset for our study and keep them the same for all concepts as a controlled setting for steerability comparison\. The list of prompts is shown in Appendix[K](https://arxiv.org/html/2606.11599#A11)\. Although AxBench\[[21](https://arxiv.org/html/2606.11599#bib.bib26)\]also has a concept list for activation steering evaluation, sampled from Neuronpedia SAE concept list for GemmaScope, we do not adopt their list as we find their SAE\-style concepts are not suitable for our setting\. Many of their concepts are very specific, such as “names of individuals and their roles or contributions within a group or event context” and “terms related to multi\-layer structures or systems”, bringing limitations to its generalization to a wider range of prompts\.
### 3\.2Response annotation
Following the definition in Section[2](https://arxiv.org/html/2606.11599#S2), we use GPT\-5\-nano\[[14](https://arxiv.org/html/2606.11599#bib.bib32)\]to label each steered generation with one of the following labels,UnderSteer,SuccSteer, orOverSteer, as exemplified in Table[2](https://arxiv.org/html/2606.11599#S3.T2)\. The prompt we use to annotate the steered responses is shown in Appendix[I](https://arxiv.org/html/2606.11599#A9)\.
To further verify the consistency between human annotation and LLM\-as\-the\-judge, extensive human evaluation is conducted\. Three human annotators are assigned 600 randomly sampled steered generation \(100 for each model\-method pair\) for evaluation\. The Cohen’sκ\\kappais 0\.74 between labels and the annotations of the SOTA model \(GPT\-5\.5\[[12](https://arxiv.org/html/2606.11599#bib.bib33)\]\), and 0\.83 between labels and human annotation, indicating substantial agreement, validating the quality of auto\-annotation\.
Table 2:Example of labels in ASTEER dataset\.
### 3\.3Steering methods
#### DiffMean\.
The DiffMean \(Difference of Means\)\[[20](https://arxiv.org/html/2606.11599#bib.bib5)\]method is a commonly used lightweight activation steering technique\. It derives a steering vector𝐯\\mathbf\{v\}from the difference between the average hidden states of the model processing positive output samples \(which exhibit the concept\) and negative output samples \(which lack or oppose it\)\. Mathematically, leth\(y\)h\(y\)denote the LLM’s hidden state at a chosen layer corresponding to a text sampleyy\. Given a set ofNNpositive samplesy\+y^\{\+\}andMMnegative samplesy−y^\{\-\}, the DiffMean steering vector is defined as:
𝐯=1N∑i=1Nh\(yi\+\)−1M∑j=1Mh\(yj−\)\.\\mathbf\{v\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}h\(y\_\{i\}^\{\+\}\)\-\\frac\{1\}\{M\}\\sum\_\{j=1\}^\{M\}h\(y\_\{j\}^\{\-\}\)\.\(5\)
#### Probe\.
Instead of calculating a simple difference of averages, this technique trains a supervised linear classifier to explicitly distinguish between the hidden states of positive samples,h\(y\+\)h\(y^\{\+\}\), and negative samples,h\(y−\)h\(y^\{\-\}\)\. By optimizing the Binary Cross\-Entropy \(BCE\) loss over a combined dataset ofK=N\+MK=N\+Msamples, the steering vector𝐯\\mathbf\{v\}is defined as the optimal weight vector𝐰\\mathbf\{w\}that best separates the concepts\. Mathematically, lettingck∈\{0,1\}c\_\{k\}\\in\\\{0,1\\\}represent the binary label for thekk\-th sampleyky\_\{k\}, the steering vector is extracted as:
𝐯=argmin𝐰\(−1K∑k=1K\[cklogσ\(𝐰⊤h\(yk\)\)\+\(1−ck\)log\(1−σ\(𝐰⊤h\(yk\)\)\)\]\)\\mathbf\{v\}=\\arg\\min\_\{\\mathbf\{w\}\}\\left\(\-\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\left\[c\_\{k\}\\log\\sigma\(\\mathbf\{w\}^\{\\top\}h\(y\_\{k\}\)\)\+\(1\-c\_\{k\}\)\\log\(1\-\\sigma\(\\mathbf\{w\}^\{\\top\}h\(y\_\{k\}\)\)\)\\right\]\\right\)\(6\)
For both methods, we follow the setting inWuet al\.\[[21](https://arxiv.org/html/2606.11599#bib.bib26)\]to synthetically generate 50 positive samples and 50 negative samples to acquire the steering vector𝐯\\mathbf\{v\}\. During inference, the learned vector𝐯\\mathbf\{v\}is scaled by a strengthα\\alphaand added to the model’s hidden states at each generation step \(h\(ynew\)\+α𝐯h\(y\_\{new\}\)\+\\alpha\\mathbf\{v\}\)\.
### 3\.4Steerability Analysis
We steer Qwen3\-1\.7B\[[19](https://arxiv.org/html/2606.11599#bib.bib29)\], Gemma\-2\-2B\-it\[[18](https://arxiv.org/html/2606.11599#bib.bib30)\], and LLaMA\-3\.2\-3B\-Instruct\[[6](https://arxiv.org/html/2606.11599#bib.bib31)\]across all 150 concepts and 50 prompts with varying steering strengths\. Figure[3](https://arxiv.org/html/2606.11599#S3.F3)shows the resulting label distributions as a function ofα\\alpha\. Since some prompt\-concept pairs already elicit the target concept without any intervention \(e\.g\., steering toward a formal, academic tone when the prompt asks to summarize an academic paper\), and the judge already labels the unsteered output \(α=0\\alpha\{=\}0\) asSuccSteer\. We exclude such pairs to ensures that observed successes atα\>0\\alpha\{\>\}0reflect genuine effects of activation steering rather than pre\-existing alignment between the prompt and the concept\.
Figure 3:Distribution of steering outcomes \(UnderSteer,SuccSteer,OverSteer\) as a function of steering strengthα\\alpha\.The first row aggregates over all concepts and prompts; the second and third rows show results on individual concepts and prompts, respectively\. The concepts and prompts to the ids \(c=0, 43, 88; p=0, 8, 41\) are in Appendix[J](https://arxiv.org/html/2606.11599#A10)and Appendix[K](https://arxiv.org/html/2606.11599#A11)\. Steering outcome is sensitive toα\\alpha, and the effective range varies substantially across concepts, prompts, and methods\.Table 3:Steering success rate by concept abstraction level\.#### Steerability under Different Steering Strengths\.
Across all models and steering methods, a consistent pattern emerges asα\\alphaincreases: the proportion ofUnderSteerdecreases monotonically,OverSteerincreases monotonically, and theSuccSteerrate first rises and then declines, forming a characteristic inverted\-U curve\. The aggregated view \(first row of Figure[3](https://arxiv.org/html/2606.11599#S3.F3)\) reveals that the success rate remains low throughout theα\\alpharange, also shown in the overall column of Table[3](https://arxiv.org/html/2606.11599#S3.T3)\.
#### Steerability under Different Concepts and Prompts\.
Beyond the shared trend, the way steering outcomes respond to changes inα\\alphavaries drastically across conditions\. As shown in the second and third rows of Figure[3](https://arxiv.org/html/2606.11599#S3.F3), different concepts exhibit qualitatively different sensitivity profiles: some transition sharply fromUnderSteertoOverSteerwithin a smallα\\alpharange, while others shift gradually and admit a broader effective region\. The location of the success window, its width, and the rate at which the label distribution changes withα\\alphaall differ substantially from one concept to another\. Prompts also introduce variation, but to a lesser degree\.
#### Steerability of Different Models and Methods\.
The effective range ofα\\alphaalso varies across steering methods and models\. On Qwen3\-1\.7B and Gemma\-2\-2B\-it, Probe\-based steering requires roughly20×20\{\\times\}largerα\\alphavalues than DiffMean to achieve comparable effects, indicating markedly different sensitivities between the two methods\. Furthermore, theα\\alpharange required by the Probe method on LLaMA\-3\.2\-3B\-Instruct is over20×20\{\\times\}smaller than on the other two models, highlighting significant inter\-model variation even within a single steering approach\.
#### Steerability at Different Concept Abstraction\-Levels\.
Table[3](https://arxiv.org/html/2606.11599#S3.T3)reports steering success rates stratified by concept abstraction level\. Across both models and methods, low\-level concepts \(e\.g\., emoji usage, punctuation style\) consistently yield substantially lower success rates than mid\- and high\-level concepts\. Mid\-level concepts, in turn, tend to be slightly more amenable to steering than high\-level ones\. This hierarchy suggests that surface\-level textual attributes are less effectively captured and manipulated by linear steering vectors than more abstract, semantically richer concepts\.
## 4SteerBoost: Predicting Steerability from Early Decoding States
Despite the heterogeneous steering patterns observed in Section[3\.4](https://arxiv.org/html/2606.11599#S3.SS4), we hypothesize that common features exist in the model’s internal states that can determine the outcome of steering during generation, as the influence of the steering vector propagates across layers and token positions\. If well identified and leveraged, such influence propagation patterns can also generalize to unseen concepts and prompts\. To capture this effect, instead of relying on hidden states from a single layer and token position as previous early\-prediction methods did\[[24](https://arxiv.org/html/2606.11599#bib.bib22),[25](https://arxiv.org/html/2606.11599#bib.bib23),[7](https://arxiv.org/html/2606.11599#bib.bib21)\], we build our prediction model on a grid of \(token, layer\) pairs and on the comparison between steered and unsteered hidden states \(Figure[4](https://arxiv.org/html/2606.11599#S4.F4)\)\.
Figure 4:The overview of SteerBoost\.Given a prompt, we first decodekktokens with the steering vector applied at layerLsteerL\_\{steer\}\(left\), then run a single unsteered forward pass over the same token sequence \(right\)\. For each \(token, layer\) position on the sampled grid, we extract features as in Table[4](https://arxiv.org/html/2606.11599#A2.T4)to capture how the steering effect propagate in the model\. These features, together with steering condition features are fed into an ensemble classifier that predicts the intervention result without requiring the full autoregressive rollout\.#### Feature Extraction\.
Let𝐯c\\mathbf\{v\}\_\{c\}denote the steering vector applied at layerLsteerL\_\{\\text\{steer\}\}with strengthα\\alpha\. In the*steered pass*, we addα𝐯c\\alpha\\,\\mathbf\{v\}\_\{c\}to the residual stream at layerLsteerL\_\{\\text\{steer\}\}and autoregressively decode the firstkktokensy1,…,yky\_\{1\},\\dots,y\_\{k\}, collecting the steered hidden states𝐡~tl\\tilde\{\\mathbf\{h\}\}\_\{t\}^\{l\}for each decoded positionttand layerll\. In the*unsteered pass*, we prefill the concatenated sequence\(x1,…,xT,y1,…,yk\)\(x\_\{1\},\\dots,x\_\{T\},y\_\{1\},\\dots,y\_\{k\}\)into the model*without*the steering vector in a single forward pass to obtain the unsteered hidden states𝐡tl\\mathbf\{h\}\_\{t\}^\{l\}\. Because both passes process the same token sequence, any difference between𝐡~tl\\tilde\{\\mathbf\{h\}\}\_\{t\}^\{l\}and𝐡tl\\mathbf\{h\}\_\{t\}^\{l\}is directly attributable to the steering intervention, providing a controlled basis for comparison\.
We construct three groups of features from𝐯c\\mathbf\{v\}\_\{c\},𝐡~tl\\tilde\{\\mathbf\{h\}\}\_\{t\}^\{l\}, and𝐡tl\\mathbf\{h\}\_\{t\}^\{l\}\(Detailed in Appendix[B](https://arxiv.org/html/2606.11599#A2)\)\.*Steering geometry*features measure how the steered representation relates to the steering direction and its unsteered counterpart at each\(t,l\)\(t,l\)pair\.*Decoding dynamics*features track how these geometric quantities evolve across successive tokens, capturing the temporal propagation of the intervention\.*Steering condition*features characterize the steering vector itself\. For the first two groups, which are computed per\(t,l\)\(t,l\)pair, we additionally aggregate summary statistics \(mean, standard deviation, max, min\) globally, per token across layers, and per layer across tokens\. Rather than using a dense grid, we samplet∈\{1,2,4,6\}t\\in\\\{1,2,4,6\\\}andn∈\{0,1,2,3,5,10,15\}n\\in\\\{0,1,2,3,5,10,15\\\}, where the offsetn=l−Lsteern=l\-L\_\{steer\}, balancing coverage of early and late layer against computational cost\.
#### Steerability Classification by GBDT\.
After extracting features from the sampled grid, we normalize each feature and train an ensemble classifier\[[5](https://arxiv.org/html/2606.11599#bib.bib27)\]to predict the outcome of steering, detailed in Appendix[E](https://arxiv.org/html/2606.11599#A5)\. We choose tree\-based ensemble classifier for three reasons\. First, it is a strong default learner for tabular data with heterogeneous feature types \(cosine similarities, norms, ratios, and their summary statistics\)\. Second, built\-in feature\-importance scores facilitate interpretability of which geometric and dynamic signals are most predictive\. Third, training and inference is lightweight and computationally efficient\. We hold out part of the concepts from training and further do train\-test split based on prompt\. This allows us to test the generalization ability of our predictor on both unseen prompt\-concept combinations for In\-distribution \(ID\) concepts and unseen Out\-of\-distribution \(OOD\) concepts\. More details are in Appendix[E](https://arxiv.org/html/2606.11599#A5)\.
#### Classification results\.
Figure[5](https://arxiv.org/html/2606.11599#S4.F5)reports the macro\-F1 scores and row\-normalized confusion matrices of SteerBoost\. On held\-out prompt–concept pairs for ID concepts, the results on DiffMean are stronger: macro\-F1 reaches around 0\.8 on all of the models\. The high macro\-F1 indicates that the predictor reliably identifies all three outcome classes, not just the majority one\. On the 30 held\-out OOD concepts, DiffMean macro\-F1 decreases moderately, showing that the captured patterns are transferable to unseen concepts\. Probe\-based steering proves harder to predict: ID macro\-F1 ranges from 0\.68 to 0\.74, with OOD scores between 0\.65 and 0\.69\.
The confusion matrices \(right panel of Figure[5](https://arxiv.org/html/2606.11599#S4.F5)\) reveal a consistent error pattern across all models and methods\.OverSteeris the easiest class to identify, with recall between 87% and 93%, likely because excessive steering produces distinctively distorted activation patterns\.UnderSteeris also well recognized \(68%–77% recall\)\.SuccSteeris the most challenging class\. The dominant confusion direction isSuccSteerbeing misclassified asUnderSteer\(20%–37%\), which is expected given that the boundary between insufficient and just\-sufficient steering effect is sometimes subtle in the activation space\.


Figure 5:Steerability prediction \(classification\) performance of SteerBoost\.Left: macro\-F1 on ID and OOD concepts\. The mean and std are reported with runs of 5 random seeds\. DiffMean features consistently achieve∼\\sim0\.80 macro\-F1 on ID concepts and retain∼\\sim0\.72 on OOD concepts\. Right: row\-normalized confusion matrices aggregated over ID test and OOD splits\.OverSteeris predicted most reliably \(≥\\geq87% recall\), whileSuccSteeris most often confused withUnderSteer, reflecting the inherent difficulty of distinguishing borderline steering outcomes from internal representations alone\.
#### Feature Importance\.
To understand which signals drive SteerBoost’s predictions, we examine the gain\-based feature importance from the ensemble classifier and aggregate it along token positiontt, layer offsetnn, and feature group in Figure[2](https://arxiv.org/html/2606.11599#footnote2)\. Importance scores are summed within each axis and row\-normalized\. The results of probe\-based steering are reported in Appendix[H](https://arxiv.org/html/2606.11599#A8)\.
Across all three models, the first two decoded tokens account for over75%75\\%of the importance mass, which supports our hypothesis that the outcome of steering can be predicted from a very short initial decoding window\. In contrast, importance along the layer axis is distributed relatively evenly across both shallow and deep offsets, with no single layer dominating\. This validates our choice of sampling a grid of layers rather than probing a single position as in prior early\-prediction methods, and suggests that steering leaves detectable traces throughout the decoding pass\.
Figure 6:Gain\-based feature importance of SteerBoost on DiffMean, aggregated by token, layer, and feature group\.Scores are summed within each categories and row\-normalized\. Predictive mass concentrates on the earliest decoded tokens and on alignment\-based geometry features \(SA, DA\), while remaining broadly distributed across layers\.222*VectorNorm*\(V\) and*SteeringStrength*\(S\) appear small in Figure[2](https://arxiv.org/html/2606.11599#footnote2)because the sum\-aggregation view of importance compares a single\-feature group against groups that span the across grid\. The mean\-aggregation view is in Appendix[H](https://arxiv.org/html/2606.11599#A8)\.See Table[4](https://arxiv.org/html/2606.11599#A2.T4)for feature abbreviations\.*DeviationAlignment*\(DA\) and*SteeringAffinity*\(SA\) jointly carry the bulk of the predictive mass on all models, indicating that the direction in which the residual stream shifts, rather than how far \(DN\) or how much of its original direction it preserves \(DS\), is the primary determinant of the steering outcome\. The decoding\-dynamics features \(DRA, DSH, ADR\) contribute smaller but non\-negligible shares, suggesting that the models are using temporal evolution features to aid prediction\.
#### Cross\-method transferability\.
Figure 7:Cross\-method transferability of SteerBoost on Qwen3\-1\.7B\.Each cell reports macro\-F1 when trained on the source method and test on the target method\.To demonstrate the transferability of SteerBoost across steering method, we drop the Steering Condition feature group, which is closely correlated to the steering method, retrain the GBDT, and evaluate it on different steering method for Qwen3\-1\.7B model\. Figure[7](https://arxiv.org/html/2606.11599#S4.F7)shows that the predictor retains non\-trivial performance even when trained on one steering method and evaluated on the other, suggesting that the hidden\-state propagation features capture steering signatures that generalize not only across concepts, but also partially across different steering methods\.
The ablation study is in Appendix[C](https://arxiv.org/html/2606.11599#A3)and an alternative approach is in Appendix[D](https://arxiv.org/html/2606.11599#A4)\.
## 5Application: How Strong do You need to Steer Your LLM?
Most current research relies on an expensive grid search to identify the best strength for a given concept\. We show that SteerBoost can accelerate this search at substantially lower cost\.
#### Formulation\.
For a prompt\-concept pair\(p,c\)\(p,c\)and raw steering vectorvc\\mathrm\{v\}\_\{c\}, a search algorithmffreturns an ordered setf\(p,c,vc\)⊆𝒜f\(p,c,\\mathrm\{v\}\_\{c\}\)\\subseteq\\mathcal\{A\}of candidate steering strengths\. The search succeeds if anyα∈f\(p,c,vc\)\\alpha\\in f\(p,c,\\mathrm\{v\}\_\{c\}\)yields a successful steering\. We define the average successful searching rate as:
R\(f\)=1\|𝒞\|\|𝒫\|∑c∈𝒞∑p∈𝒫𝕀\(∃α∈f\(p,c,vc\):J\(c,𝐲p,c,α\)=SuccSteer\),R\(f\)=\\frac\{1\}\{\|\\mathcal\{C\}\|\|\\mathcal\{P\}\|\}\\sum\_\{c\\in\\mathcal\{C\}\}\\sum\_\{p\\in\\mathcal\{P\}\}\\mathbb\{I\}\\\!\\Big\(\\exists\\,\\alpha\\in f\(p,c,\\mathrm\{v\}\_\{c\}\):\\;J\(c,\\mathbf\{y\}\_\{p,c,\\alpha\}\)=\\textsc\{SuccSteer\}\\Big\),\(7\)where𝕀\(⋅\)\\mathbb\{I\}\(\\cdot\)is the indicator function\. This rate is upper\-bounded by the item\-level grid search that exhaustively evaluates everyα∈𝒜\\alpha\\in\\mathcal\{A\}for every\(p,c,vc\)\(p,c,\\mathrm\{v\}\_\{c\}\)in the test set\. A good search function should maintain highR\(f\)R\(f\)while reducing search cost from model rollouts and judge\-model calls\.
#### Baselines\.
We compare against several grid\-search variants that span the cost–quality spectrum\.
- •Concept\-level Grid Search \(Oracle\) \[CGS\]: Roll out allα∈𝒜\\alpha\\in\\mathcal\{A\}on the test set and adopt a single uniformα\\alphaper concept based on average performance\. This mirrors the common practice of tuning one strength per concept and pays the full rollout cost\.
- •Item\-level Grid Search \[IGS\]: Roll out allα∈𝒜\\alpha\\in\\mathcal\{A\}and keep the bestα\\alphafor each\(p,c\)\(p,c\)pair\. This is the upper bound ofR\(f\)R\(f\)at maximum cost\.
- •Item\-level Grid Search with Early Stop \(Ascending/Descending\) \[IGS\-A/IGS\-D\]: Same as IGS, but stops on a sample once a validα\\alphais found, searching in ascending / descendingα\\alphaorder\.
- •Training\-set Concept\-level Grid Search \[TCGS\]: Transfer theα\\alphathat works best for the same concept on the training set\. Matches SteerBoost’s access to training data, but not applicable in the OOD setting as it cannot generalize to unseen concepts\.
#### SteerBoost\-guided search\.
We apply the SteerBoost predictor to everyα∈𝒜\\alpha\\in\\mathcal\{A\}to estimateP\(SuccSteer∣p,c,α\)P\(\\textsc\{SuccSteer\}\\mid p,c,\\alpha\), rank the candidates in descending order of this probability, roll out the top\-KK, and stop as soon as a validα\\alphais found\. Because SteerBoost relies on short early\-decoding traces rather than full rollouts, these probabilities are cheap to obtain relative to a full generation\.
Figure 8:Cost–success trade\-off for steering\-strength search on DiffMean steering\.SteerBoost\-guided search achieves better trade\-off than current baselines and, atK=20K\{=\}20, recovers∼\\sim98% of the item\-level oracle’s success rate using only∼\\sim11% decoded tokens of IGS, \(∼\\sim40% of decoded tokens of IGS\-A\)\. The same trends hold in ID and OOD, indicating that it transfers well to unseen concepts\.
#### Results\.
Figure[8](https://arxiv.org/html/2606.11599#S5.F8)plots search success rate against the average number of decoded tokens per prompt\. Decoded tokens are a fair proxy for total search cost: for full\-rollout baselines they also approximate the judge model’s input\-token cost; for SteerBoost they cover both the short early\-decoding traces used by the predictor across all candidateα\\alphaand the full rollouts for the selected candidates\. AtK=15K\{=\}15–2020, the performance of SteerBoost\-guided search approaches the item\-level upper bound at roughly half the cost of IGS\-A\. It achieves a significantly higher success rate than TCGS, which also has access to the training set\. The OOD performance only degrades slightly from ID concepts, indicating that the gains come from a transferable ranking ofα\\alphacandidates\.
## 6Related Work
Activation steering modifies model behavior at inference time by injecting concept vectors into the residual stream\. Prior work has shown that such vectors can be extracted from latent representations\[[16](https://arxiv.org/html/2606.11599#bib.bib4)\], used to elicit truthful answers\[[10](https://arxiv.org/html/2606.11599#bib.bib2)\], and formalized through representation engineering and activation addition\[[29](https://arxiv.org/html/2606.11599#bib.bib3),[20](https://arxiv.org/html/2606.11599#bib.bib5)\]\. Subsequent methods improve or generalize these interventions through mean\-centering, activation scaling, and concept\-subspace modeling\[[8](https://arxiv.org/html/2606.11599#bib.bib13),[15](https://arxiv.org/html/2606.11599#bib.bib14),[27](https://arxiv.org/html/2606.11599#bib.bib15)\]\. However, activation steering remains sensitive to the chosen intervention strength; several studies report that a static coefficient can under\- or over\-steer across inputs\[[3](https://arxiv.org/html/2606.11599#bib.bib20),hedström2025steersteermechanisticerror,[28](https://arxiv.org/html/2606.11599#bib.bib25)\]\. This motivates our goal of predicting whether a steering attempt will succeed before paying the cost of full decoding\.
Our work is also related to early prediction from model internals\. Hidden states have been used to predict future answer correctness\[[24](https://arxiv.org/html/2606.11599#bib.bib22),[25](https://arxiv.org/html/2606.11599#bib.bib23)\], hallucination risk\[[7](https://arxiv.org/html/2606.11599#bib.bib21),[1](https://arxiv.org/html/2606.11599#bib.bib19)\], and unsafe generations\[[4](https://arxiv.org/html/2606.11599#bib.bib18),[23](https://arxiv.org/html/2606.11599#bib.bib16),[26](https://arxiv.org/html/2606.11599#bib.bib17)\]\. We extend this line of work from predicting output attributes such as correctness and truthfulness to predicting*steerability*: whether an activation intervention will yield the desired behavioral change successfully\. A more detailed discussion of related work is in Appendix[A](https://arxiv.org/html/2606.11599#A1)\.
## 7Conclusion
In this work, we study steerability as a structured property that can be predicted from a model’s hidden states during the initial decoding steps\. To enable fine\-grained analysis of latent steering dynamics, we construct ASTEER, a large\-scale testbed of 1\.4M labeled steered generations spanning 150 concepts\. Using this testbed, we develop SteerBoost, a GBDT classifier built on features that characterize how steering effects propagate across layers and token positions, which predicts steering efficacy at around 0\.7 macro\-F1 on unseen concepts without requiring full autoregressive rollout\. Leveraging SteerBoost as guidance for steering\-strength search, we attain near\-optimal performance at a small fraction of the decoding cost, suggesting that early hidden\-state trajectories encode substantial information about the eventual efficacy of an intervention\.
## References
- \[1\]\(2025\)FactCheckmate: preemptively detecting and mitigating hallucinations in lms\.External Links:2410\.02899,[Link](https://arxiv.org/abs/2410.02899)Cited by:[§A\.2](https://arxiv.org/html/2606.11599#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.11599#S1.p3.1),[§6](https://arxiv.org/html/2606.11599#S6.p2.1)\.
- \[2\]A\. Arditi, O\. Obeso, A\. Syed, D\. Paleka, N\. Panickssery, W\. Gurnee, and N\. Nanda\(2024\)Refusal in language models is mediated by a single direction\.External Links:2406\.11717,[Link](https://arxiv.org/abs/2406.11717)Cited by:[§1](https://arxiv.org/html/2606.11599#S1.p1.1)\.
- \[3\]S\. Azizi, E\. B\. Potraghloo, and M\. Pedram\(2025\)Activation steering for chain\-of\-thought compression\.External Links:2507\.04742,[Link](https://arxiv.org/abs/2507.04742)Cited by:[§A\.1](https://arxiv.org/html/2606.11599#A1.SS1.p2.1),[§6](https://arxiv.org/html/2606.11599#S6.p1.1)\.
- \[4\]Y\. S\. Chan, Z\. Yong, and S\. H\. Bach\(2025\)Can we predict alignment before models finish thinking? towards monitoring misaligned reasoning models\.External Links:2507\.12428,[Link](https://arxiv.org/abs/2507.12428)Cited by:[§A\.2](https://arxiv.org/html/2606.11599#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.11599#S1.p3.1),[§6](https://arxiv.org/html/2606.11599#S6.p2.1)\.
- \[5\]T\. Chen and C\. Guestrin\(2016\-08\)XGBoost: a scalable tree boosting system\.InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,KDD ’16,pp\. 785–794\.External Links:[Link](http://dx.doi.org/10.1145/2939672.2939785),[Document](https://dx.doi.org/10.1145/2939672.2939785)Cited by:[§4](https://arxiv.org/html/2606.11599#S4.SS0.SSS0.Px2.p1.1)\.
- \[6\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan, A\. Yang, A\. Fan, and et\.al\.\(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§3\.4](https://arxiv.org/html/2606.11599#S3.SS4.p1.3)\.
- \[7\]Z\. Ji, D\. Chen, E\. Ishii, S\. Cahyawijaya, Y\. Bang, B\. Wilie, and P\. Fung\(2024\-11\)LLM internal states reveal hallucination risk faced with a query\.InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP,Y\. Belinkov, N\. Kim, J\. Jumelet, H\. Mohebbi, A\. Mueller, and H\. Chen \(Eds\.\),Miami, Florida, US,pp\. 88–104\.External Links:[Link](https://aclanthology.org/2024.blackboxnlp-1.6/),[Document](https://dx.doi.org/10.18653/v1/2024.blackboxnlp-1.6)Cited by:[§A\.2](https://arxiv.org/html/2606.11599#A1.SS2.p1.1),[Appendix D](https://arxiv.org/html/2606.11599#A4.p1.4),[§1](https://arxiv.org/html/2606.11599#S1.p3.1),[§4](https://arxiv.org/html/2606.11599#S4.p1.1),[§6](https://arxiv.org/html/2606.11599#S6.p2.1)\.
- \[8\]O\. Jorgensen, D\. Cope, N\. Schoots, and M\. Shanahan\(2023\)Improving activation steering in language models with mean\-centring\.External Links:2312\.03813,[Link](https://arxiv.org/abs/2312.03813)Cited by:[§A\.1](https://arxiv.org/html/2606.11599#A1.SS1.p1.1),[§6](https://arxiv.org/html/2606.11599#S6.p1.1)\.
- \[9\]B\. W\. Lee, I\. Padhi, K\. N\. Ramamurthy, E\. Miehling, P\. Dognin, M\. Nagireddy, and A\. Dhurandhar\(2025\)Programming refusal with conditional activation steering\.External Links:2409\.05907,[Link](https://arxiv.org/abs/2409.05907)Cited by:[§1](https://arxiv.org/html/2606.11599#S1.p1.1)\.
- \[10\]K\. Li, O\. Patel, F\. Viégas, H\. Pfister, and M\. Wattenberg\(2023\)Inference\-time intervention: eliciting truthful answers from a language model\.Advances in Neural Information Processing Systems36,pp\. 41451–41530\.Cited by:[§A\.1](https://arxiv.org/html/2606.11599#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.11599#S1.p1.1),[§6](https://arxiv.org/html/2606.11599#S6.p1.1)\.
- \[11\]D\. Lu and N\. Rimsky\(2024\)Investigating bias representations in llama 2 chat via activation steering\.External Links:2402\.00402,[Link](https://arxiv.org/abs/2402.00402)Cited by:[§1](https://arxiv.org/html/2606.11599#S1.p1.1)\.
- \[12\]OpenAI\(2026\)GPT\-5\.5 system card\.External Links:[Link](https://deploymentsafety.openai.com/gpt-5-5/introduction)Cited by:[§3\.2](https://arxiv.org/html/2606.11599#S3.SS2.p2.1)\.
- \[13\]N\. Panickssery, N\. Gabrieli, J\. Schulz, M\. Tong, E\. Hubinger, and A\. M\. Turner\(2024\)Steering llama 2 via contrastive activation addition\.External Links:2312\.06681,[Link](https://arxiv.org/abs/2312.06681)Cited by:[§1](https://arxiv.org/html/2606.11599#S1.p1.1)\.
- \[14\]A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram, A\. Nathan, A\. Luo, A\. Helyar, A\. Madry, A\. Efremov, A\. Spyra, and et\.al\.\(2025\)OpenAI gpt\-5 system card\.External Links:2601\.03267,[Link](https://arxiv.org/abs/2601.03267)Cited by:[§3\.2](https://arxiv.org/html/2606.11599#S3.SS2.p1.1)\.
- \[15\]N\. Stoehr, K\. Du, V\. Snæbjarnarson, R\. West, R\. Cotterell, and A\. Schein\(2024\-11\)Activation scaling for steering and interpreting language models\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 8189–8200\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.479/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.479)Cited by:[§A\.1](https://arxiv.org/html/2606.11599#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.11599#S1.p1.1),[§6](https://arxiv.org/html/2606.11599#S6.p1.1)\.
- \[16\]N\. Subramani, N\. Suresh, and M\. E\. Peters\(2022\)Extracting latent steering vectors from pretrained language models\.External Links:2205\.05124,[Link](https://arxiv.org/abs/2205.05124)Cited by:[§A\.1](https://arxiv.org/html/2606.11599#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.11599#S1.p1.1),[§6](https://arxiv.org/html/2606.11599#S6.p1.1)\.
- \[17\]R\. Taori, I\. Gulrajani, T\. Zhang, Y\. Dubois, X\. Li, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto\(2023\)Stanford alpaca: an instruction\-following llama model\.GitHub\.Note:[https://github\.com/tatsu\-lab/stanford\_alpaca](https://github.com/tatsu-lab/stanford_alpaca)Cited by:[Appendix K](https://arxiv.org/html/2606.11599#A11.p1.1),[§3\.1](https://arxiv.org/html/2606.11599#S3.SS1.p2.1)\.
- \[18\]G\. Team, M\. Riviere, S\. Pathak, P\. G\. Sessa, C\. Hardin, S\. Bhupatiraju, L\. Hussenot, and et\.al\.\(2024\)Gemma 2: improving open language models at a practical size\.External Links:2408\.00118,[Link](https://arxiv.org/abs/2408.00118)Cited by:[§3\.4](https://arxiv.org/html/2606.11599#S3.SS4.p1.3)\.
- \[19\]Q\. Team\(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§3\.4](https://arxiv.org/html/2606.11599#S3.SS4.p1.3)\.
- \[20\]A\. M\. Turner, L\. Thiergart, G\. Leech, D\. Udell, J\. J\. Vazquez, U\. Mini, and M\. MacDiarmid\(2023\)Steering language models with activation engineering\.arXiv preprint arXiv:2308\.10248\.Cited by:[§A\.1](https://arxiv.org/html/2606.11599#A1.SS1.p1.1),[§1](https://arxiv.org/html/2606.11599#S1.p1.1),[§3\.3](https://arxiv.org/html/2606.11599#S3.SS3.SSS0.Px1.p1.7),[§6](https://arxiv.org/html/2606.11599#S6.p1.1)\.
- \[21\]Z\. Wu, A\. Arora, A\. Geiger, Z\. Wang, J\. Huang, D\. Jurafsky, C\. D\. Manning, and C\. Potts\(2025\)AxBench: steering llms? even simple baselines outperform sparse autoencoders\.External Links:2501\.17148,[Link](https://arxiv.org/abs/2501.17148)Cited by:[§1](https://arxiv.org/html/2606.11599#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.11599#S3.SS1.p2.1),[§3\.3](https://arxiv.org/html/2606.11599#S3.SS3.SSS0.Px2.p2.4)\.
- \[22\]Y\. Xiao, C\. Wan, Y\. Zhang, W\. Wang, B\. Lin, X\. He, X\. Shen, and J\. Ye\(2024\)Enhancing multiple dimensions of trustworthiness in llms via sparse activation control\.External Links:2411\.02461,[Link](https://arxiv.org/abs/2411.02461)Cited by:[§1](https://arxiv.org/html/2606.11599#S1.p1.1)\.
- \[23\]Z\. Xuan, X\. Mao, D\. Chen, X\. Zhang, Y\. Dong, and J\. Zhou\(2025\-07\)ShieldHead: decoding\-time safeguard for large language models\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 18129–18143\.External Links:[Link](https://aclanthology.org/2025.findings-acl.932/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.932),ISBN 979\-8\-89176\-256\-5Cited by:[§A\.2](https://arxiv.org/html/2606.11599#A1.SS2.p1.1),[§1](https://arxiv.org/html/2606.11599#S1.p3.1),[§6](https://arxiv.org/html/2606.11599#S6.p2.1)\.
- \[24\]A\. Zhang, Y\. Chen, J\. Pan, C\. Zhao, A\. Panda, J\. Li, and H\. He\(2025\)Reasoning models know when they’re right: probing hidden states for self\-verification\.External Links:2504\.05419,[Link](https://arxiv.org/abs/2504.05419)Cited by:[§A\.2](https://arxiv.org/html/2606.11599#A1.SS2.p1.1),[Appendix D](https://arxiv.org/html/2606.11599#A4.p1.4),[§1](https://arxiv.org/html/2606.11599#S1.p3.1),[§4](https://arxiv.org/html/2606.11599#S4.p1.1),[§6](https://arxiv.org/html/2606.11599#S6.p2.1)\.
- \[25\]Q\. Zhang, Y\. Fu, Y\. Wang, L\. Yan, T\. Wei, K\. Xu, M\. Huang, and H\. Qiu\(2026\)Stop before you fail: operational capability boundaries for mitigating unproductive reasoning in large reasoning models\.External Links:2509\.24711,[Link](https://arxiv.org/abs/2509.24711)Cited by:[§A\.2](https://arxiv.org/html/2606.11599#A1.SS2.p1.1),[Appendix D](https://arxiv.org/html/2606.11599#A4.p1.4),[§1](https://arxiv.org/html/2606.11599#S1.p3.1),[§4](https://arxiv.org/html/2606.11599#S4.p1.1),[§6](https://arxiv.org/html/2606.11599#S6.p2.1)\.
- \[26\]Y\. Zhang, T\. Liu, Z\. Zhao, G\. Meng, and K\. Chen\(2025\)Bleeding pathways: vanishing discriminability in llm hidden states fuels jailbreak attacks\.External Links:2503\.11185,[Link](https://arxiv.org/abs/2503.11185)Cited by:[§A\.2](https://arxiv.org/html/2606.11599#A1.SS2.p1.1),[§6](https://arxiv.org/html/2606.11599#S6.p2.1)\.
- \[27\]H\. Zhao, H\. Zhao, B\. Shen, A\. Payani, F\. Yang, and M\. Du\(2025\)Beyond single concept vector: modeling concept subspace in llms with gaussian distribution\.External Links:2410\.00153,[Link](https://arxiv.org/abs/2410.00153)Cited by:[§A\.1](https://arxiv.org/html/2606.11599#A1.SS1.p1.1),[§6](https://arxiv.org/html/2606.11599#S6.p1.1)\.
- \[28\]W\. Zhao, J\. Guo, Y\. Hu, Y\. Deng, A\. Zhang, X\. Sui, X\. Han, Y\. Zhao, B\. Qin, T\. Chua, and T\. Liu\(2025\)AdaSteer: your aligned llm is inherently an adaptive jailbreak defender\.External Links:2504\.09466,[Link](https://arxiv.org/abs/2504.09466)Cited by:[§A\.1](https://arxiv.org/html/2606.11599#A1.SS1.p2.1),[§1](https://arxiv.org/html/2606.11599#S1.p2.1),[§6](https://arxiv.org/html/2606.11599#S6.p1.1)\.
- \[29\]A\. Zou, L\. Phan, S\. Chen, J\. Campbell, P\. Guo, R\. Ren, A\. Pan, X\. Yin, M\. Mazeika, A\. Dombrowski,et al\.\(2023\)Representation engineering: a top\-down approach to ai transparency\.arXiv preprint arXiv:2310\.01405\.Cited by:[§A\.1](https://arxiv.org/html/2606.11599#A1.SS1.p1.1),[§6](https://arxiv.org/html/2606.11599#S6.p1.1)\.
## Appendix ADetailed Related Work
### A\.1LLM steering and inference\-time intervention
Although the term “steering” is sometimes used broadly in the literature to describe instruction following or prompt engineering, in this work, steering refers exclusively to inference\-time activation engineering, where carefully calculated vectors are injected into the residual stream to alter model behavior without weight updates\. Early foundations for this approach demonstrated that latent steering vectors could be extracted from pretrained models\[[16](https://arxiv.org/html/2606.11599#bib.bib4)\]and injected during the forward pass to elicit truthful answers\[[10](https://arxiv.org/html/2606.11599#bib.bib2)\]\. This paradigm was formalized by frameworks like Representation Engineering \(RepE\)\[[29](https://arxiv.org/html/2606.11599#bib.bib3)\]and Activation Addition\[[20](https://arxiv.org/html/2606.11599#bib.bib5)\], which established that high\-level concepts can be extracted via contrastive prompts and linearly added to the residual stream to modulate topic and sentiment\. Subsequent literature has rapidly expanded on how these vectors are calculated and injected\. Recent methodological improvements include refining vector representations through mean\-centering\[[8](https://arxiv.org/html/2606.11599#bib.bib13)\], activation scaling\[[15](https://arxiv.org/html/2606.11599#bib.bib14)\], and extending beyond single concept directions to model concept subspaces via Gaussian distributions\[[27](https://arxiv.org/html/2606.11599#bib.bib15)\]\.
Despite its broad applicability, many works\[[3](https://arxiv.org/html/2606.11599#bib.bib20),hedström2025steersteermechanisticerror,[28](https://arxiv.org/html/2606.11599#bib.bib25)\]have reported that the efficacy of activation steering is notoriously fragile and highly sensitive to the selected intervention strength, and that applying a static steering coefficient across diverse inputs frequently results in suboptimal outcomes\. However, determining the optimal steering strength traditionally necessitates exhaustive and expensive grid searches over full autoregressive decoding outcomes\. Hence, an important application of our proposed predictive framework is to greatly accelerate the search for an optimal steering scale by anticipating the success of a steering intervention prior to full sequence generation\.
### A\.2Early prediction of LLM outputs via model internals
A growing body of work shows that LLM internal representations can be leveraged at early inference stages to predict properties of their final outputs, such as correctness, truthfulness, and safety before final response is decoded\. In the context of problem\-solving,Zhanget al\.\[[24](https://arxiv.org/html/2606.11599#bib.bib22)\]found that models encode signals about future answers in their hidden states, enabling accurate prediction before intermediate reasoning is completed and supporting early\-exit inference\. Similarly,Zhanget al\.\[[25](https://arxiv.org/html/2606.11599#bib.bib23)\]found that hidden states corresponding to the last input token encode capability “boundary information", allowing the solvability of the problem to be predicted before the reasoning process even begins\. On the truthfulness side,Jiet al\.\[[7](https://arxiv.org/html/2606.11599#bib.bib21)\]established that internal activations immediately after processing a query reveal model uncertainty and familiarity with the concept, serving as strong predictors of hallucination\. Building on this,Alnuhaitet al\.\[[1](https://arxiv.org/html/2606.11599#bib.bib19)\]introduced FactCheckmate, which classifies hidden states prior to decoding to anticipate hallucinations and intervenes by steering representations toward factual outputs\. Similar ideas have also been applied to safety\.Chanet al\.\[[4](https://arxiv.org/html/2606.11599#bib.bib18)\]show that linear probes over Chain\-of\-Thought activations can detect unsafe responses before generation, andXuanet al\.\[[23](https://arxiv.org/html/2606.11599#bib.bib16)\]proposed ShieldHead, a lightweight classification head on last\-layer hidden states for decoding\-time harmful\-content detection, whileZhanget al\.\[[26](https://arxiv.org/html/2606.11599#bib.bib17)\]observe that separability between safe and harmful representations degrades over time, suggesting that early\-stage signals are particularly informative for safety monitoring\.
Motivated by these advances in predicting output properties from model internals, we extend this paradigm beyond commonly studied attributes such as correctness, truthfulness, and safety to a relatively underexplored dimension—steerability\. We show that early\-stage hidden states contain sufficient signal to predict whether an activation steering intervention will succeed, without requiring full decoding of the model’s response\.
## Appendix BFeatures
In Table[4](https://arxiv.org/html/2606.11599#A2.T4), we detail the features we used for SteerBoost, including their names, abbreviations, formulas, and rationale\. They are designed to be intuitive, interpretable, and capture the propagation pattern of steering effect along the layer and token position dimension of the Transformer network\.
Table 4:Feature pool for steerability prediction\. Each feature except steering condition is computed per\(t,n\)\(t,n\)pair, then augmented with summary statistics \(mean, std, max, min\) globally, per token across layers, and per layer across tokens\.GroupFeatureFormulaRationaleSteeringGeometrySteeringAffinity \(SA\)cos\(𝐡~tl,𝐯c\)\\cos\(\\tilde\{\\mathbf\{h\}\}\_\{t\}^\{l\},\\;\\mathbf\{v\}\_\{c\}\)How closely the steered representation aligns with the steering direction𝐯c\\mathbf\{v\}\_\{c\}DeviationNorm \(DN\)‖𝐡~tl−𝐡tl‖2\\\|\\tilde\{\\mathbf\{h\}\}\_\{t\}^\{l\}\-\\mathbf\{h\}\_\{t\}^\{l\}\\\|\_\{2\}How far steering has pushed the representation from its unsteered counterpartDirectionalSim \(DS\)cos\(𝐡~tl,𝐡tl\)\\cos\(\\tilde\{\\mathbf\{h\}\}\_\{t\}^\{l\},\\;\\mathbf\{h\}\_\{t\}^\{l\}\)To what extent the steered representation preserves the original directionDeviationAlignment \(DA\)cos\(𝐡~tl−𝐡tl,𝐯c\)\\cos\(\\tilde\{\\mathbf\{h\}\}\_\{t\}^\{l\}\-\\mathbf\{h\}\_\{t\}^\{l\},\\;\\mathbf\{v\}\_\{c\}\)How much of the induced change follows the intended steering direction𝐯c\\mathbf\{v\}\_\{c\}DecodingDynamicsDeviationRatio \(DRA\)DN\(t,n\)/DN\(1,n\)\\text\{DN\}\(t,n\)\\;/\\;\\text\{DN\}\(1,n\),n,t\>1n,t\>1How the magnitude of steering\-induced deviation evolves across generated tokensDirectionalShift \(DSH\)DS\(t,n\)−DS\(1,n\)\\text\{DS\}\(t,n\)\-\\text\{DS\}\(1,n\),n,t\>1n,t\>1How directional preservation changes as generation progressesAlignmentDrift \(ADR\)DA\(t,n\)−DA\(1,n\)\\text\{DA\}\(t,n\)\-\\text\{DA\}\(1,n\),n,t\>1n,t\>1How the alignment between deviation and𝐯c\\mathbf\{v\}\_\{c\}shifts over successive tokensSteeringConditionVectorNorm \(V\)‖𝐯c‖2\\\|\\mathbf\{v\}\_\{c\}\\\|\_\{2\}Intrinsic magnitude of the steering vector, which varies across conceptsSteeringStrength \(S\)α\\alphaThe applied multiplier
## Appendix CAblation Study
We ablate the contribution of each feature group, retraining the classifier on each group in isolation while keeping the \(token, layer\) sampling grid fixed to the main\-paper configuration\. Macro\-F1 scores under the ID and OOD settings are reported in Table[5](https://arxiv.org/html/2606.11599#A3.T5)\.
ALL ranks first in 5/6 ID settings and in the top two in 11/12 settings overall, confirming that the three groups are complementary\. On OOD concepts, Geometry alone matches or surpasses ALL in 4/6 settings, which we interpret as a robustness–specialization trade\-off: Decoding Dynamics and Steering Condition features contribute concept\-specific regularities that boost ID accuracy but do not fully transfer\. We therefore keep ALL as the default in the main text since it is never far from the best in either regime, and note Geometry alone as a lightweight alternative when OOD generalization is the priority\. This is consistent with the gain\-based importance analysis in Figure[2](https://arxiv.org/html/2606.11599#footnote2), where Geometry features*DeviationAlignment*and*SteeringAffinity*carry the bulk of predictive mass\. Steering Condition features alone drop sharply from ID to OOD \(e\.g\.,69\.70→56\.2169\.70\\to 56\.21on Qwen3\-1\.7B\-DiffMean\), reinforcing our choice of pairing them with hidden\-state\-derived signals rather than using them in isolation\.
Table 5:Macro\-F1 \(×100\\times 100\) under the ID and OOD settings for each feature group, method, and model\. The best scores are highlighted in bold, the second best scores are underlined; bold/underline are computed within each setting\.
## Appendix DAn Alternative Approach
A natural alternative to SteerBoost is to train a linear probe directly on a single steered hidden state𝐡~tl\\tilde\{\\mathbf\{h\}\}\_\{t\}^\{l\}, in the style of prior early\-prediction methods that operate on a fixed token and layer\[[24](https://arxiv.org/html/2606.11599#bib.bib22),[25](https://arxiv.org/html/2606.11599#bib.bib23),[7](https://arxiv.org/html/2606.11599#bib.bib21)\]\. For each\(t,l\)\(t,l\)on the same grid as SteerBoost \(t∈\{1,2,4,6\}t\\in\\\{1,2,4,6\\\},n∈\{0,1,2,3,5,10,15\}n\\in\\\{0,1,2,3,5,10,15\\\}\), we train a separate logistic\-regression classifier on the three\-way outcome label, using the same splits as the main text\. Macro\-F1 over all 28 grid positions is reported in Figure[9](https://arxiv.org/html/2606.11599#A4.F9)\.
Figure 9:Macro\-F1 of single\-state linear probes across token positionstt\(rows\) and layer offsetsnn\(columns\), evaluated on ID and OOD concepts\. Each block corresponds to one \(model, steering\-method\) pair, where the steering method labels the prediction target\. Compare with the ALL configuration of SteerBoost in Table[5](https://arxiv.org/html/2606.11599#A3.T5)\.With an oracle choice of\(t,l\)\(t,l\), the best single\-state probe is competitive with SteerBoost on ID concepts and on the Probe target even exceeds it\. We therefore present this approach as a legitimate alternative rather than a strawman\. SteerBoost nonetheless retains three concrete advantages\.
First, SteerBoost is built from a small set of named geometric and dynamic quantities \(Table[4](https://arxiv.org/html/2606.11599#A2.T4)\) whose contributions can be read off directly from Figure[2](https://arxiv.org/html/2606.11599#footnote2), while the single\-state probe is a dense linear functional over thousands of raw hidden\-state dimensions whose weights do not admit a comparable mechanistic reading\. The features used by SteerBoost presented in Table[4](https://arxiv.org/html/2606.11599#A2.T4)are also interpretable, but the probe remains a black box classifier to understand the steerability\.
Second, the best\(t,l\)\(t,l\)cell is not stable: it shifts across models, steering methods, and ID/OOD splits, with no position uniformly dominant\. Deploying the probe therefore requires a per\-\(model, method\) sweep on a labeled validation set, i\.e\., the same supervision SteerBoost uses plus an additional model\-selection step\. SteerBoost sidesteps this by consuming the entire grid as input\.
Third, and most importantly, the gap to SteerBoost widens substantially under distribution shift\. The single\-state probe drops by roughly99–1313macro\-F1 points from ID to OOD on every \(model, target\) combination, whereas the ALL configuration of SteerBoost wins outright in55of66OOD settings \(Table[5](https://arxiv.org/html/2606.11599#A3.T5)\)\. We attribute this to what each predictor sees: a single hidden state encodes the model’s instantaneous, concept\-entangled representation at one position, whereas SteerBoost’s features measure the*propagation*of the steering effect through differences between steered and unsteered states across multiple tokens and layers, defined relative to the steering vector𝐯c\\mathbf\{v\}\_\{c\}itself\. These propagation signatures depend more on*how*a vector perturbs the residual stream than on*which*concept it encodes, which is what enables them to generalize to unseen concepts\.
## Appendix EGBDT and training details
XGBoost is a gradient\-boosted ensemble of decision trees\. Let𝐳i\\mathbf\{z\}\_\{i\}denote the concatenated feature vector for theii\-th sample andλi∈Λ\\lambda\_\{i\}\\in\\Lambdaits steering outcome label\. The ensemble prediction is an additive sumg^i\(B\)=∑b=1Bgb\(𝐳i\)\\hat\{g\}\_\{i\}^\{\(B\)\}=\\sum\_\{b=1\}^\{B\}g\_\{b\}\(\\mathbf\{z\}\_\{i\}\), where eachgbg\_\{b\}is a regression tree that maps𝐳i\\mathbf\{z\}\_\{i\}to a real\-valued leaf weight; successive trees are trained to correct the residual errors of the current ensemble\. At iterationbb, XGBoost minimizes a regularized objective:
ℒ\(b\)=∑i=1Nℓ\(λi,g^i\(b−1\)\+gb\(𝐳i\)\)\+Ω\(gb\),\\mathcal\{L\}^\{\(b\)\}=\\sum\_\{i=1\}^\{N\}\\ell\\\!\\bigl\(\\lambda\_\{i\},\\;\\hat\{g\}\_\{i\}^\{\(b\-1\)\}\+g\_\{b\}\(\\mathbf\{z\}\_\{i\}\)\\bigr\)\+\\Omega\(g\_\{b\}\),\(8\)whereℓ\\ellis the loss function,g^i\(b−1\)\\hat\{g\}\_\{i\}^\{\(b\-1\)\}is the prediction from the firstb−1b\{\-\}1trees, andΩ\(gb\)=γT\+12λ∥𝐰∥2\\Omega\(g\_\{b\}\)=\\gamma T\+\\tfrac\{1\}\{2\}\\lambda\\lVert\\mathbf\{w\}\\rVert^\{2\}penalizes model complexity through the number of leavesTTand the leaf weight vector𝐰\\mathbf\{w\}\. In practice, XGBoost approximates this objective with a second\-order Taylor expansion ofℓ\\ell, which admits closed\-form optimal leaf weights and an efficient, greedy split\-selection procedure\.
We randomly split our 150 concepts into 120 in\-distribution \(ID\) concepts and 30 out\-of\-distribution \(OOD\) concepts, with 10 concepts per abstraction level\. For in\-distribution concepts, we further split prompts into training, validation, and test sets in a 6:3:1 ratio\. This allows us to test the generalization ability of our predictor on both unseen prompt\-concept combinations for ID concepts and completely unseen OOD concepts\. To mitigate the strong class imbalance as observed in Section[3\.4](https://arxiv.org/html/2606.11599#S3.SS4), we assigned inverse\-frequency class weights during XGBoost training, so that errors on underrepresented classes received proportionally higher penalty, and selected hyperparameters using validation macro\-F1\.
## Appendix FLimitation
Due to the constraints of computational resources, despite the size of 1\.4M generation, the current ASTEER only covers DiffMean and Probe as steering methods and three LLMs of relatively small size\. This may limit the generalization of SteerBoost to broader settings\.
## Appendix GComputational Resource
We utilize the internal cluster for the computation of the experiments\. The GPUs we used include NVIDIA RTX A6000, NVIDIA L40s, and NVIDIA A100 Tensor Core\. Creating steerability dataset for each model\-method pair takes approximately 1\.5 day on 15 RTX A6000 GPUs\.
## Appendix HAdditional Feature Importance Results
Figure[10](https://arxiv.org/html/2606.11599#A8.F10)reports the gain\-based feature importance of SteerBoost on probe\-based steering\. As in the DiffMean results in Figure[2](https://arxiv.org/html/2606.11599#footnote2), predictive mass concentrates on the earliest decoded tokens and on the Steering Geometry feature group\. Compared with DiffMean, however, the token\-wise distribution is somewhat less concentrated, especially for Llama\-3\.2\-3B\. In addition, SA is often the single most important feature, whereas the DiffMean setting places more weight on DA\.
Figure 10:Gain\-based feature importance of SteerBoost on Probe, aggregated by token, layer, and feature group\.Scores are summed within each category and row\-normalized\.The small importance scores ofVVandSSin Figures[2](https://arxiv.org/html/2606.11599#footnote2)and[10](https://arxiv.org/html/2606.11599#A8.F10)should be interpreted with care\. In the sum\-aggregated view, the other groups contain many more features because they are sampled over the token\-layer grid, so summing within each group mechanically yields larger totals than for the single\-feature groupsVVandSS\. We nevertheless keep the view because it reflects how the GBDT allocates total split gain across the full feature set\. For a size\-normalized comparison, Figures[11](https://arxiv.org/html/2606.11599#A8.F11)and[12](https://arxiv.org/html/2606.11599#A8.F12)report mean\-aggregated importances\. Under this normalization,SSplays a larger role for DiffMean, whereasVVbecomes more prominent for Probe, particularly on Qwen3\-1\.7B\.
Figure 11:Gain\-based feature importance of SteerBoost on DiffMean, aggregated by token, layer, and feature group\.Scores are averaged within each category and row\-normalized\.Figure 12:Gain\-based feature importance of SteerBoost on Probe, aggregated by token, layer, and feature group\.Scores are averaged within each category and row\-normalized\.
## Appendix IPrompt Template
The prompt template we used for steering efficacy annotation is in Figure[13](https://arxiv.org/html/2606.11599#A9.F13)\.
Prompt Template for Steered Generation AnnotationYou are an expert evaluator assessing the effectiveness of AI model steering\. You will evaluate whether a model’s response successfully incorporates a target concept while maintaining coherence and relevance to the original prompt\.Task: You are given:1\.An original prompt/instruction2\.A target concept that should be steered towards3\.A model’s response after steeringEvaluation criteria: Rate the response on a scale of 0–2:0 — Under\-steering \(concept missing\)•The response does*not*incorporate the target concept•The response is coherent and relevant to the prompt, but lacks the desired steering•Example: target “response with emojis” but the response contains no emojis1 — Successful steering \(concept present & coherent\)•The response successfully incorporates the target concept•The response remains coherent, relevant, and appropriate to the original prompt•The steering feels natural and does not break response quality•Example: target “response with humor” and the response includes appropriate humor while answering the question2 — Over\-steering \(incoherent answer\)•The response does not answer the original question and becomes nonsensical, irrelevant, or inappropriate to the original prompt•The steering is so strong that it breaks response quality•When the response is empty, it should be rated as over\-steering•When the response is gibberish and does not answer the original question, it should be rated as over\-steering rather than under\-steering•Example: target “response with emojis” but the response is full of emojis while completely ignoring the original questionInput dataOriginal prompt:\{prompt\}Target concept:\{concept\}Steered response:\{response\_after\}Output formatReturn your evaluation in JSON format:\{"explanation": "Brief explanation of your judgment, including specific evidence from the response","judgment":⟨0,1,or2⟩\\langle 0,1,\\text\{ or \}2\\rangle\}Figure 13:The prompt template we used to label the efficacy of steered response\.
## Appendix JConcept List
Table[J](https://arxiv.org/html/2606.11599#A10)lists the 150 concepts we used in our study with different abstraction levels\. Some concepts seem natural and very easy for model to steer in many cases such as \(id:43, response uses consistently formal tone\), while we also preserve some cases that are hard and seem impossible for model to steer such as \(id:9 response with exactly 3 bullet points\)\. The reason behind this is that we treat steerability itself as a predictable pattern and we want model to capture the pattern under different scenarios, including both low and high\-steerability cases\.
Steering concept pool used in this work, ordered by abstraction level \(low, mid and high\)\.IDLevelName\\endfirstheadTable 5 \(continued from previous page\)IDLevelName\\endheadContinued on next page\\endfoot\\endlastfoot0lowresponse with emojis1lowresponse in uppercase2lowresponse in lowercase3lowresponse in Chinese4lowresponse in Japanese5lowresponse in Korean6lowresponse in rhyming couplets7lowresponse begins with the phrase ’Let me’8lowresponse in iambic\-like poetic meter9lowresponse with exactly 3 bullet points10lowresponse with exactly 5 bullet points11lowresponse as a numbered list12lowresponse in a single paragraph13lowresponse in exactly 2 sentences14lowresponse in exactly 4 sentences15lowresponse with a markdown table16lowresponse using markdown headings17lowresponse uses the word ’key’ at least twice18lowresponse ends with the phrase ’Let me know if you have any questions’19lowresponse with bold emphasis20lowresponse with italic emphasis21lowresponse with no punctuation22lowresponse with many exclamation marks23lowresponse ending with a question24lowresponse contains the phrase ’in other words’ at least once25lowresponse contains the phrase ’it is worth noting’26lowresponse opens with ’Of course’27lowresponse with a JSON object28lowresponse with a YAML block29lowresponse with a regex pattern example30lowresponse with at least one equation in LaTeX31lowresponse with a short title line32lowresponse using only ASCII characters33lowresponse with at least one emoji per sentence34lowresponse ends with a sentence beginning with ’In conclusion’35lowresponse addresses the user with ’you’ or ’your’ at least three times36lowresponse with a checklist \(task list\) format37lowresponse uses the word ’because’ at least twice to explain reasoning38lowresponse with repeated first letters \(alliteration\) in a sentence39lowresponse that includes at least one hyperlink \(http/https\)120lowresponse contains at least one date expression121lowresponse begins with a question122lowresponse opens with a direct greeting such as ’Hi\!’, ’Hello\!’, or ’Hey\!’123lowresponse contains at least one time expression124lowresponse contains at least one parenthetical remark125lowresponse contains a rhetorical device126lowresponse begins by approving the user’s question \(e\.g\., That’s a great question\!\)127lowresponse uses ’also’ or ’additionally’ to introduce at least two separate points128lowresponse uses strong imperative verbs \(’Do X’, ’Avoid Y’\)129lowresponse ends with a closing sentence beginning with ’In short’40midresponse contains at least one clear joke or punchline41midresponse contains detectable sarcasm markers42midresponse follows a scientific writing style \(neutral, precise, impersonal\)43midresponse uses consistently formal tone44midresponse uses consistently informal tone45midresponse contains explicit polite markers \(e\.g\., ’please’, ’thank you’\)46midresponse avoids hedging words \(might/maybe/likely\)47midresponse contains frequent hedging words \(might/maybe/likely\)48midresponse expresses high enthusiasm \(e\.g\., exclamations, positive framing\)49midresponse expresses skepticism or doubt50midresponse contains explicit empathetic language51midresponse uses assertive, confident phrasing52midresponse includes explicit uncertainty disclaimers53midresponse asks at least one clarifying question54midresponse includes a brief self\-check or sanity check55midresponse begins with an explicit outline or plan56midresponse provides step\-by\-step instructions57midresponse provides multiple alternative options58midresponse gives a single clear recommendation59midresponse contains an explicit warning or caution60midresponse states its assumptions explicitly61midresponse frames the answer around efficiency/optimization62midresponse frames the answer around risk/safety63midresponse frames the answer around monetary cost64midresponse frames the answer around time/latency65midresponse frames the answer around privacy concerns66midresponse explains using analogies67midresponse explains using counterexamples68midresponse begins with a formal definition69midresponse contains a ’Common Pitfalls’ section70midresponse includes a short quiz\-style question71midresponse provides a minimal working example72midresponse lists exactly two supporting reasons73midresponse explicitly contrasts two viewpoints74midresponse is written as a Q&A dialogue75midresponse contains frequent metaphors76midresponse avoids technical jargon \(lay explanation\)77midresponse uses technical jargon \(expert explanation\)78midresponse uses ’we’ or ’our’ at least twice to frame a collaborative perspective79midresponse includes a short ’Next steps:’ section with actionable items130midresponse uses softening language \(e\.g\., ’I suggest’, ’it may help’\)131midresponse uses a consistently decisive tone132midresponse includes at least one rhetorical question133midresponse uses contrastive markers \(e\.g\., ’however’, ’on the other hand’\)134midresponse explicitly acknowledges the user’s goal135midresponse uses passive voice at least once \(e\.g\., ’it is said’, ’this can be done’\)136midresponse uses example\-first then explanation137midresponse uses explanation\-first then example138midresponse repeatedly references the user’s goal or intent139midresponse uses conditional reasoning \(’if… then…’\)80highresponse in the persona of a teacher \(explanatory, structured, pedagogical\)81highresponse in the persona of a research scientist \(technical, evidence\-based\)82highresponse in the persona of a lawyer \(precise, conditional, cautious\)83highresponse in the persona of a doctor \(careful, qualified, safety\-aware\)84highresponse in the persona of a software engineer \(practical, implementation\-focused\)85highresponse in the persona of a customer\-support agent \(polite, problem\-solving\)86highresponse in the persona of a strict grader \(critical, rubric\-driven\)87highresponse in the persona of a motivational coach \(encouraging, action\-oriented\)88highresponse in the persona of a skeptical reviewer \(critical, evidence\-demanding\)89highresponse in the persona of a friendly peer \(casual, collaborative\)90highresponse in the persona of a journalist \(neutral, fact\-focused\)91highresponse in the persona of a storyteller \(narrative\-driven\)92highresponse in the persona of a consultant \(structured, actionable\)93highresponse in the persona of a debate opponent \(argumentative, contrastive\)94highresponse in the persona of a tutor \(guided, stepwise explanations\)95highresponse mentions machine learning96highresponse mentions mathematics97highresponse mentions physics98highresponse mentions computer programming99highresponse mentions cooking or recipes100highresponse mentions travel planning101highresponse mentions finance or investing102highresponse mentions health or fitness103highresponse mentions literature or literary analysis104highresponse mentions history105highresponse written as a case study with concrete scenario106highresponse written as a textbook\-style explanation107highresponse written as a brief executive summary108highresponse written as a FAQ\-style answer109highresponse written as a point–counterpoint argument110highresponse explicitly weighs pros and cons in decision making111highresponse written in a headline\-style format112highresponse written as a comprehensive report113highresponse organized with clear sections and subheadings114highresponse written as a brainstorming\-style answer115highresponse frames outcomes in terms of opportunities116highresponse frames outcomes in terms of risks or downsides117highresponse frames outcomes in a neutral, descriptive way118highresponse grounds claims with explicit evidence or references119highresponse emphasizes novel or creative ideas140highresponse mentions music141highresponse in the persona of a product designer \(user\-centered, UX\-focused\)142highresponse in the persona of a policy analyst \(trade\-offs, stakeholders, impact\)143highresponse in the persona of a startup founder \(vision, speed, iteration\)144highresponse in the persona of a technical writer \(clarity, documentation\-style\)145highresponse mentions climate change146highresponse with mathematical reasoning147highresponse in style of a twitter post148highresponse framed as a risk\-benefit analysis149highresponse mentions a specific named country
## Appendix KPrompt List
Table[K](https://arxiv.org/html/2606.11599#A11)lists the 50 prompts we used in our study\. We randomly sampled 50 instructions from the Alpaca\[[17](https://arxiv.org/html/2606.11599#bib.bib28)\]dataset and keep them the same across concepts as a controlled setting for studying steerability\.Similar Articles
Steered LLM Activations are Non-Surjective
This paper proves that activation steering in LLMs produces internal states that cannot be replicated by any textual prompt, establishing a formal separation between white-box steerability and black-box prompting.
UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering
UniSteer introduces a text-guided activation flow matching method to learn a universal conditional velocity field in activation space, enabling versatile LLM behavior control and classification tasks without task-specific intervention modules.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
This paper investigates whether linearly decodable failure signals in LLM hidden states can be corrected via residual-stream steering. It finds that while 'overthinking' failures are decodable, fixed linear steering fails to correct them due to representational entanglement with task-critical computations, though the probes effectively support selective abstention.
SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors
SALSA introduces a lightweight adaptation method for speech-aware LLMs that learns layer-wise steering vectors via supervised objective, achieving significant improvements (up to 46.8% relative) on out-of-domain speech benchmarks, and shows that steering the encoder layers is more effective than modifying the LLM backbone.
Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning
ACTS (Agentic Chain-of-Thought Steering) formulates LLM reasoning control as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference using reasoning strategies and steering phrases. The approach achieves comparable accuracy to full-thinking models with significant token savings, enabling controllable accuracy-efficiency trade-offs.