When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search

arXiv cs.LG Papers

Summary

This paper investigates when rank-1 activation steering is effective and cost-efficient, proposing geometry-guided search and the concept of granularity to explain variability, and introduces the GRACE framework for efficient LLM control.

arXiv:2605.16362v1 Announce Type: new Abstract: Activation steering offers a lightweight way to control LLMs without retraining, but its effectiveness varies sharply across concepts. Prior work often reads this variability as evidence that many concepts are not captured by a single steering direction. We argue instead that much of it reflects search difficulty: a useful rank-1 intervention often exists, but finding it can be expensive. We formalize rank-1 steering as a budget-constrained optimization over intervention layer and coefficient. Across concepts and model families, prompt-boundary directional alignment predicts where effective interventions occur, enabling geometry-guided search that reaches high utility with substantially fewer evaluations, reducing the trials needed to recover 95\% of best-found utility by 39.8\% on average across three model families. To explain why some concepts remain expensive even under better search, we introduce \emph{concept granularity}, a measure of directional heterogeneity across contrastive contexts. Granularity distinguishes concepts whose difference vectors share a stable global direction from those where prompts agree locally within each input but the utility-maximizing direction rotates systematically across inputs. Higher granularity is associated with slower convergence and lower best-found performance (Pearson $r{=}0.44$ with trials-to-95\%, $r{=}{-}0.46$ with best-found utility, both $p<0.001$). We present \textit{GRACE}, a Granularity- and Representation-Aware Concept Engineering framework that uses activation geometry to diagnose the dominant source of steering difficulty, select the appropriate remedy, and allocate optimization effort efficiently. Our results shift the frame from ``\textit{when does rank-1 fail?}'' to ``\textit{when is rank-1 cheap and stable?}'', turning activation geometry from a descriptive tool into an actionable prior for LLM control.
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:42 AM

# When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search
Source: [https://arxiv.org/html/2605.16362](https://arxiv.org/html/2605.16362)
###### Abstract

Activation steering offers a lightweight way to control large language models without retraining, but its effectiveness varies sharply across concepts\. Prior work often interprets this variability as evidence that many concepts are not well captured by a single steering direction\. We argue instead that much of this variability reflects search difficulty: a useful rank\-1 intervention often exists, but finding it can be expensive\. We formalize rank\-1 steering as a budget\-constrained optimization problem over intervention layer and coefficient\. Across the concepts and model families, prompt\-boundary directional alignment predicts where effective interventions are likely to occur, enabling geometry\-guided search that reaches high utility with substantially fewer evaluations, reducing the trials needed to recover 95% of best\-found utility by 39\.8% on average across three model families\. To explain why some concepts remain expensive even under better search, we introduce*concept granularity*, a measure of directional heterogeneity across contrastive contexts\. Granularity distinguishes concepts whose difference vectors share a stable global direction from those where prompts agree locally within each input but the utility\-maximizing direction rotates systematically across inputs\. Higher granularity is associated with both slower convergence and lower best\-found steering performance \(Pearsonr=0\.44r=0\.44with trials\-to\-95%,p<0\.001p<0\.001, andr=−0\.46r=\-0\.46with best\-found utility,p<0\.001p<0\.001\)\. These observations suggest a practical workflow rather than a single universal vector\-construction rule\. We therefore presentGRACE, a Granularity\- and Representation\-Aware Concept Engineering framework that uses activation geometry to diagnose the dominant source of steering difficulty, choose the appropriate remedy, and allocate optimization effort more efficiently\. Our results shift the frame of activation steering from “when does rank\-1 fail?” to “when is rank\-1 cheap and stable?”, and turn activation geometry from a descriptive tool into an actionable prior for LLM control\.

## 1Introduction

Reliable control and monitoring of Large Language Models \(LLMs\) is increasingly important for safe deployment and practical use\[zhang2022opt,achiam2023gpt,turner2023steering,rimskySteeringLlama22024\]\. A prominent recent approach is concept steering: identify a direction in the residual stream associated with a human\-defined target behavior, then intervene along that direction at inference time to amplify or suppress the behavior\[turner2023steering,rimskySteeringLlama22024,chen2025persona\]\. The same directions can also be used for activation\-based detection, providing a supplementary signal to monitoring the generated text alone\[patel2025activation\]\. Concept vectors are appealing because they are lightweight, interpretable, and require no retraining\. In practice, however, their effectiveness is highly uneven\. Some behaviors are easy to steer or detect, whereas others are brittle, sensitive to the intervention layer and coefficient, or fail to match prompting and other baselines\[wu2025axbench,braun2025understandingunreliabilitysteeringvectors,bas2026actuallysteermultibehaviorstudy\]\.

This variability is often framed as a question of representational feasibility: can a behavior be captured by a single direction in activation space? Our results suggest that this is often not the most relevant question\. The more informative question is*how expensive it is to recover a stable intervention*\. A growing body of work suggests that rank\-1 steering is often feasible\[rimskySteeringLlama22024,wu2025improvedrepresentationsteeringlanguage,stolfo2025improving,sinii\-etal\-2025\-steering\]\. However, some concepts admit a wide, forgiving optimization landscape, where many nearby layer and coefficient choices perform well\. Others require narrow, concept\-specific tuning and degrade quickly when the intervention is even slightly mistuned\. This paper argues that the central obstacle in practical rank\-1 steering is therefore not usually the existence of a useful direction, but the*optimization difficulty*of finding one\.

We show that activation geometry helps explain and reduce this optimization difficulty\. Contrastive activations for different concepts exhibit highly distinct, layer\-specific directional structure\. In particular, we find that the directional agreement of contrastive differences at the prompt boundary predicts where useful steering directions are likely to emerge in the network\. This observation calls into question a common evaluation practice in the literature: fixing one or a few layers across all concepts\[wu2025axbench,wu2025improvedrepresentationsteeringlanguage,braun2025understandingunreliabilitysteeringvectors\]\. Instead, the effective intervention region is highly concept\-dependent, and activation geometry provides a strong prior over where to search\. Building on this, we cast rank\-1 steering as a*budget\-constrained optimization problem*over intervention layer and coefficient, and show that geometry\-guided Bayesian optimization substantially reduces the search cost needed to recover high\-utility interventions \(Figure[1](https://arxiv.org/html/2605.16362#S1.F1)\)\.

![Refer to caption](https://arxiv.org/html/2605.16362v1/Images/geo_limited_search_improvement.png)Figure 1:Restricting search to the top\-kklayers ranked by prompt\-boundary alignment accelerates convergence under a fixed budget while largely preserving final best\-found utility\.A second source of variability comes from the steering vector itself\. Standard contrastive construction methods average differences from many prompts and contexts into a single direction\. This works best when those differences are globally aligned\. We therefore introduce*concept granularity*, a measure of directional heterogeneity across contrastive contexts\. Low\-granularity concepts produce broadly consistent difference vectors, leading to smoother search landscapes and more stable interventions\. High\-granularity concepts instead exhibit substantial cross\-context rotation: prompt pairs may agree locally within a question, yet the implied concept direction changes systematically across questions\. In this setting, the averaged vector becomes a poor compromise, and rank\-1 steering becomes more expensive to optimize even when a useful intervention still exists\. We show that concept granularity, estimated directly from contrastive activations before any steering search is performed, helps explain both the search budget required to approach the optimum and the best steering quality ultimately achieved\.

We further show that not all disagreement should be treated as structural\. Beyond cross\-context rotation, steering pipelines are also degraded by several*removable*sources of variance, including prompt\-pair disagreement, fragmentation between prompt and response representations, and magnitude\-driven noise in vector construction\. These effects blur the distinction between genuinely difficult concepts and avoidable estimation error\. Taken together, these results suggest a practical workflow for rank\-1 steering\. We call this workflowGRACE\(Granularity\- andRepresentation\-AwareConceptEngineering\): use activation statistics to diagnose likely source of difficulty, apply the corresponding remedy, and then allocate optimization effort to the most promising parts of the search space\. In this way, activation geometry becomes a practical prior for steering search and control\.

Our contributions can be outlined as follows:

- •We show that prompt\-boundary directional alignment predicts where effective rank\-1 steering directions emerge, and can be used as a geometric prior for efficient layer selection across models\.
- •We formulate activation steering as a budget\-constrained optimization problem over intervention layer and coefficient, and show that geometry\-guided search recovers strong interventions with substantially lower search cost than standard grid\-based searching\.
- •We introduce*concept granularity*as a measure of cross\-context directional heterogeneity, and show that it helps predict both optimization difficulty and best\-achievable steering performance across diverse concept families\.
- •We identify multiple removable sources of steering difficulty, including prompt\-pair disagreement, representation fragmentation, and magnitude\-induced construction noise, and distinguish them from the more persistent cross\-context rotation captured by granularity\.
- •We presentGRACE, a practical workflow that uses these diagnostics to select vector\-construction and search remedies for a given concept, making rank\-1 steering more reliable and efficient in hard\-to\-steer settings\.

## 2Background and Related Work

### 2\.1Concept Vectors and Activation Steering

Activation steering modifies a language model’s internal residual stream at inference time to induce or suppress a target behavior\. In the standard rank\-1 form, a single vectorvℓv\_\{\\ell\}is added at layerℓ\\ellwith steering coefficientα\\alpha, yielding an interventionα​vℓ\\alpha v\_\{\\ell\}\. This approach is attractive because it is lightweight, requires no gradient updates, and preserves a direct link between the intervention and an interpretable concept direction\[turner2023steering,rimskySteeringLlama22024\]\.

A common way to construct such vectors is*contrastive activation addition*\(CAA\), which averages activation differences between paired inputs that either express or do not express the target behavior\[rimskySteeringLlama22024\]\. While effective in controlled settings, early CAA\-style constructions often relied on hand\-designed prompts or pre\-filled multiple choice examples, limiting their transfer to open\-ended generation\.

PersonaVectors\[chen2025persona\]addressed this limitation with a more systematic extraction pipeline\. Given a human\-written concept definition, an LLM generatesPPcontrastive prompt pairs and applies each pair acrossQQquestions, yieldingN=P×QN=P\\times Qactivation differences per layer\. This protocol is especially useful for our study because it exposes two distinct sources of variation: different prompt framings for the same concept, and different input contexts for the same prompt framing\. In our experiments, we follow this protocol withP=5P=5prompt pairs andQ=100Q=100questions per concept\.

Two activation variants are common in this literature\. The*prompt\-boundary*variant records the activation at the final prompt token, capturing the model’s state before generation begins\[rimskySteeringLlama22024,braun2025understandingunreliabilitysteeringvectors\]\. The*response\-averaged*variant averages activations across generated response tokens, capturing the concept as it is expressed during decoding\[chen2025persona\]\. As we show later, these variants exhibit substantially different geometric signatures and play different roles in our analysis of steerability\.

### 2\.2Reliability of Rank\-1 Steering

A central empirical fact in recent work is that rank\-1 steering performance varies sharply across concepts\. Evaluating concept steering typically requires measuring both*concept expression*and*output quality*, often operationalized through concept score and coherence under an LLM judge\[chen2025persona,wu2025axbench,wu2025improvedrepresentationsteeringlanguage,lee2025programmingrefusalconditionalactivation,sun2025hypersteer\]\. Under this kind of evaluation, some concepts are easy to steer while others are difficult to control reliably\.

Recent work consistently reports concept\-dependent variability in rank\-1 steering performance\. AxBench\[wu2025axbench\]evaluates a large collection of concepts and finds that fixed\-vector steering can be highly inconsistent\. Braun et al\.\[braun2025understandingunreliabilitysteeringvectors\]show that directional agreement of contrastive vectors is related to steering quality, but study this relation at a fixed layer and primarily in multiple\-choice settings\. Bas and Novak\[bas2026actuallysteermultibehaviorstudy\]likewise report substantial variability across diverse behaviors, with more abstract concepts often easier to steer than highly specific ones\. In safety settings, CAST\[lee2025programmingrefusalconditionalactivation\]applies activation\-based methods to harmful behavior detection and refusal steering but still relies largely on fixed hyperparameter choices\. Across these works, a common pattern remains: evaluation typically fixes one layer, one coefficient, or a small hand\-chosen set of both\. This creates an important ambiguity\. Poor performance may reflect a genuinely weak concept vector but it may also reflect a concept whose useful operating point lies outside the shared default\. As a result, fixed\-hyperparameter evaluations can conflate representational difficulty with optimization failure\. Our work starts from this distinction and studies how difficult it is to recover a strong rank\-1 intervention under a search budget\.

### 2\.3Geometry of Concept Representations

Recent work has begun to analyze the geometry of steering vectors and concept representations more explicitly\. Braun et al\.\[braun2025understandingunreliabilitysteeringvectors\]propose directional agreement as a measure of steering\-vector quality and show that it correlates with downstream performance\. Li et al\.\[li2026steeringvectorfieldscontextaware\]study how concept directions vary across token positions and layers, and Im and Li\[im2026unifiedunderstandingevaluationsteering\]examine the subspace structure of activation differences to understand when linear interventions succeed or fail\. More broadly, work on sparse autoencoders and related representation\-learning tools has deepened our understanding of how models organize internal features\[bricken2023monosemanticity,templeton2024scaling\]\. These studies provide useful descriptive characterizations of representation structure, but they leave open a practical question: how can geometry be used to improve rank\-1 steering under realistic tuning constraints? In particular, existing work does not directly connect geometric statistics of contrastive activations to the search cost of finding an effective layer and coefficient\. Our work focuses on this missing link\.

### 2\.4Beyond Standard DiffMeans

A natural way to improve steering is to modify the vector construction itself\. Prompt\-selection and clustering methods attempt to remove inconsistent prompt pairs before averaging, while normalization\-based constructions reduce the influence of high\-magnitude outliers\. More expressive methods learn steering vectors or even context\-dependent steering functions through gradient\-based optimization, as in BiPO, HyperSteer, and RePS\[cao2025personalized,sun2025hypersteer,wu2025improvedrepresentationsteeringlanguage\]\. These methods can achieve strong control, but at the cost of higher training complexity, reduced interpretability, and a larger hyperparameter surface\.

Rather than replacing simple rank\-1 steering with a more complex controller, we ask when simple rank\-1 steering is already sufficient and what makes it expensive in practice\. This motivatesGRACE, which uses activation geometry to diagnose likely sources of steering difficulty, guide vector construction, and reduce search cost while remaining fully interpretable and training\-free\.

## 3Where Should We Steer?

A practical difficulty in activation steering is that the intervention layer is rarely known in advance\. Prior work often fixes one or a small number of layers across all concepts\[wu2025axbench,wu2025improvedrepresentationsteeringlanguage,braun2025understandingunreliabilitysteeringvectors\], but our results show that the effective steering region is highly concept\-dependent\. As a result, fixed\-layer evaluation can make rank\-1 steering appear unreliable even when a strong intervention exists\. In this section, we ask a simpler and more actionable question:*where in the network is a concept most likely to yield a useful steering direction?*We show that a simple geometric statistic computed from contrastive activations at the prompt boundary provides a strong prior for this layer selection problem\.

### 3\.1Setup and Notation

We follow the PersonaVectors\[chen2025persona\]extraction pipeline\. Given a human\-written concept definition, an LLM \(GPT\-5\) generatesPPcontrastive prompt pairs,QQextraction questions, and a separate set ofQQheld\-out evaluation questions\. For prompt pairp∈\{1,…,P\}p\\in\\\{1,\\dots,P\\\}and extraction questionq∈\{1,…,Q\}q\\in\\\{1,\\dots,Q\\\}, we greedily decode a positive completionxp,q\+x\_\{p,q\}^\{\+\}and a negative completionxp,q−x\_\{p,q\}^\{\-\}\. The corresponding contrastive difference vector at layerℓ\\ellis

𝐯ℓ\(p,q\)=𝐡ℓ​\(xp,q\+\)−𝐡ℓ​\(xp,q−\),\\mathbf\{v\}\_\{\\ell\}^\{\(p,q\)\}\\;=\\;\\mathbf\{h\}\_\{\\ell\}\\\!\\bigl\(x\_\{p,q\}^\{\+\}\\bigr\)\\;\-\\;\\mathbf\{h\}\_\{\\ell\}\\\!\\bigl\(x\_\{p,q\}^\{\-\}\\bigr\),\(1\)where𝐡ℓ​\(⋅\)\\mathbf\{h\}\_\{\\ell\}\(\\cdot\)denotes the residual\-stream activation \(Appendix[B](https://arxiv.org/html/2605.16362#A2)\)\.

We consider two activation variants\. The*prompt\-boundary*variant records the residual stream at the final prompt token, capturing the model state before generation begins\. The*response\-averaged*variant averages the residual stream over generated response tokens, capturing how the concept is expressed during decoding\. In our experiments, we useP=5P=5prompt pairs andQ=100Q=100questions, resulting inN=P​Q=500N=PQ=500difference vectors per concept at each layer\.

For steering, we follow PersonaVectors and use the mean of the response\-averaged differences,

𝐯ℓ=1N​∑p,q𝐯ℓ\(p,q\),\\mathbf\{v\}\_\{\\ell\}\\;=\\;\\frac\{1\}\{N\}\\sum\_\{p,q\}\\mathbf\{v\}\_\{\\ell\}^\{\(p,q\)\},\(2\)since this variant leads to better downstream steering performance\. Prompt\-boundary activations serve a different role: they provide a cleaner diagnostic of where useful steering directions emerge\.

#### Prompt\-boundary alignment\.

For conceptccat layerℓ\\ell, we define*prompt\-boundary alignment*as the average pairwise cosine similarity of the unit\-normalized prompt\-boundary difference vectors:

𝒜c​\(ℓ\)=2N​\(N−1\)​∑\(p,q\)<\(p′,q′\)𝐯^ℓ\(p,q\)⊤​𝐯^ℓ\(p′,q′\),\\mathcal\{A\}\_\{c\}\(\\ell\)\\;=\\;\\frac\{2\}\{N\(N\-1\)\}\\sum\_\{\(p,q\)<\(p^\{\\prime\},q^\{\\prime\}\)\}\\hat\{\\mathbf\{v\}\}\_\{\\ell\}^\{\(p,q\)\\top\}\\hat\{\\mathbf\{v\}\}\_\{\\ell\}^\{\(p^\{\\prime\},q^\{\\prime\}\)\},\(3\)where𝐯^ℓ\(p,q\)=𝐯ℓ\(p,q\)/‖𝐯ℓ\(p,q\)‖\\hat\{\\mathbf\{v\}\}\_\{\\ell\}^\{\(p,q\)\}=\\mathbf\{v\}\_\{\\ell\}^\{\(p,q\)\}/\\\|\\mathbf\{v\}\_\{\\ell\}^\{\(p,q\)\}\\\|\. High alignment indicates a broadly consistent direction at that layer, whereas low alignment indicates directional dispersion or multi\-modality\. Since𝒜c​\(ℓ\)\\mathcal\{A\}\_\{c\}\(\\ell\)is computed before any steering is run, it serves as a purely geometric prior over promising intervention layers\.

Unless otherwise stated, downstream steering is evaluated using the primary objective defined in Section[4\.1](https://arxiv.org/html/2605.16362#S4.SS1)\. We use concept score alone in Section[3\.2](https://arxiv.org/html/2605.16362#S3.SS2)to isolate the on\-target signal\.

### 3\.2Prompt\-Boundary Alignment Predicts Useful Steering Layers

We now ask whether prompt\-boundary alignment identifies layers where steering is likely to be most effective\. For each concept, we compute𝒜c​\(ℓ\)\\mathcal\{A\}\_\{c\}\(\\ell\)at every layer using prompt\-boundary activations, and then perform a coarse steering sweep using response\-averaged steering vectors\. Following\[chen2025persona,stolfo2025improving,obrien2025steeringlanguagemodelrefusal,sun2026personavectorsgamesmeasuring\], we search layers at a fixed interval \(every 5 layers\) and coefficients\{1\.0,2\.0,3\.0\}\\\{1\.0,2\.0,3\.0\\\}, and evaluate outputs with an LLM judge\. In this section we focus on*concept score*alone in order to isolate the on\-target steering signal\. The full steering objective is introduced in Section[4](https://arxiv.org/html/2605.16362#S4), while a detailed discussion of evaluation can be found in Appendix[B](https://arxiv.org/html/2605.16362#A2)\.

Figure[2](https://arxiv.org/html/2605.16362#S3.F2)shows that high\-alignment layers are consistently enriched for strong concept induction under an appropriate steering coefficient\. This pattern is highly concept\-dependent: different behaviors peak at different layers, and the shape of the alignment profile varies substantially across concepts and models \(Appendix[C](https://arxiv.org/html/2605.16362#A3)\)\. Nevertheless, the qualitative relation is robust: layers with stronger prompt\-boundary directional agreement are much more likely to allow effective steering interventions\.

![Refer to caption](https://arxiv.org/html/2605.16362v1/Images/gemma3_hum_concept_score_alignment.png)

![Refer to caption](https://arxiv.org/html/2605.16362v1/Images/gemma3_cs_align_coef3.0_scatter.png)

Figure 2:Prompt\-boundary alignment tracks where effective steering interventions emerge\. Left: example concept showing layerwise alignment and downstream concept induction\. Right: pooled results across concepts\. High\-alignment layers are consistently enriched for strong steering performance at an appropriate coefficient\.This result has two immediate implications\. First, fixed\-layer evaluations can be misleading because they ignore substantial concept\-specific variation in the location of effective steering interventions\. Second, alignment is already a useful diagnostic: before any tuning is run, it narrows the set of layers worth exploring\. Alignment alone does not solve the steering problem, since a successful intervention depends jointly on layer and coefficient, but it provides exactly the kind of prior needed to make the downstream optimization problem cheaper and more reliable\.

Taken together, these findings motivate a shift away from treating steering quality as a fixed\-hyperparameter property, and toward viewing it as a budget\-constrained search problem whose landscape can be characterized in advance by activation geometry\.

## 4Steering as a Search Problem

Section[3\.2](https://arxiv.org/html/2605.16362#S3.SS2)showed that effective steering layers are highly concept\-dependent, and that prompt\-boundary alignment provides a practical prior over where they are likely to occur\. This already suggests that fixed\-layer evaluation can be misleading: poor performance may reflect a bad choice of layer and coefficient rather than the absence of a good rank\-1 intervention\. In this section, we make that distinction explicit\. We formalize rank\-1 steering as a budget\-constrained search problem, show that even generic search is nontrivial, and then demonstrate that alignment makes this search substantially cheaper\.

### 4\.1The Search Landscape

The key observation from Section[3\.2](https://arxiv.org/html/2605.16362#S3.SS2)is that layer choice is not universal: different concepts peak at different depths, and the effective intervention region interacts strongly with the steering coefficient\. As a result, practical rank\-1 steering requires search over both layer and coefficient\.

#### Primary steering objective\.

For each conceptcc, we define a scalar steering utility

Uc:ℒ×𝒜→\[0,100\]U\_\{c\}:\\mathcal\{L\}\\times\\mathcal\{A\}\\to\[0,100\]\(4\)that assigns a value to an intervention at layerℓ∈ℒ=\{1,…,L\}\\ell\\in\\mathcal\{L\}=\\\{1,\\dots,L\\\}and coefficientα∈𝒜⊂ℝ\>0\\alpha\\in\\mathcal\{A\}\\subset\\mathbb\{R\}\_\{\>0\}\. Unless otherwise stated, all references to steering performance, best\-found utility, and convergence refer to this objective\. The optimization target is therefore

\(ℓc∗,αc∗\)=arg⁡maxℓ∈ℒ,α∈𝒜⁡Uc​\(ℓ,α\)\.\(\\ell\_\{c\}^\{\*\},\\alpha\_\{c\}^\{\*\}\)=\\arg\\max\_\{\\ell\\in\\mathcal\{L\},\\;\\alpha\\in\\mathcal\{A\}\}U\_\{c\}\(\\ell,\\alpha\)\.\(5\)The search budgetBBis the total number of evaluated pairs\(ℓ,α\)\(\\ell,\\alpha\)\.

#### Primary difficulty metric\.

We measure optimization difficulty by the number of trials required to reach 95% of the best\-found utility within a given run, denotedT95T\_\{95\}\. SmallerT95T\_\{95\}indicates that strong interventions are easier to recover under a fixed evaluation budget\.

![Refer to caption](https://arxiv.org/html/2605.16362v1/Images/landscape_comparison_gemma3.png)Figure 3:Rank\-1 steering is a search problem over layer and coefficient\. Some concepts admit broad, forgiving optima, while others require narrow, concept\-specific tuning\.
#### Steering landscape topology\.

The response surfaceUc​\(ℓ,α\)U\_\{c\}\(\\ell,\\alpha\)varies qualitatively across concepts\. Some concepts exhibit broad, smooth optima that are robust to perturbations in either variable\. For these, even coarse search quickly recovers a strong intervention\. Others produce sharp, irregular landscapes with narrow peaks, where small changes in layer or coefficient cause steep utility drops\. Figure[3](https://arxiv.org/html/2605.16362#S4.F3)illustrates this contrast\.

These landscapes make exhaustive evaluation impractical even in the two\-parameter setting considered here, since each trial requires repeated model generation and LLM\-judge scoring\. We therefore use Tree\-structured Parzen Estimation \(TPE\), implemented throughOptuna, as our default optimizer\. Unless otherwise stated, we use a fixed budget of 50 trials over all layers and nine coefficients in\[0\.5,4\.5\]\[0\.5,4\.5\], repeated across three random seeds \( Apendix[E](https://arxiv.org/html/2605.16362#A5)\)\. Compared with fixed\-interval grid search, TPE finds high\-utility interventions more reliably under the same evaluation budget \(Appendix[H](https://arxiv.org/html/2605.16362#A8)\)\. However, even principled search still spends a substantial fraction of its budget probing layers with little chance of success\. This motivates our main practical takeaway: using activation geometry to restrict the search space before optimization begins\.

### 4\.2Geometry\-Guided Search

Section[3\.2](https://arxiv.org/html/2605.16362#S3.SS2)showed that prompt\-boundary alignment𝒜c​\(ℓ\)\\mathcal\{A\}\_\{c\}\(\\ell\)provides a strong prior over where effective steering directions emerge\. We now use that prior directly\. For each concept, we rank layers by𝒜c​\(ℓ\)\\mathcal\{A\}\_\{c\}\(\\ell\)and restrict TPE to the top 15 layers, leaving coefficient optimization unchanged\. This yields a smaller search space under the same trial budget and provides a direct test of whether geometry improves practical rank\-1 steering\.

Across the three models studied \(Gemma2\-2B\-it, Gemma3\-27B\-it, and Llama3\.3\-70B\-Instruct\), geometry\-constrained search reducesT95T\_\{95\}from13\.7±10\.713\.7\\pm 10\.7to8\.2±8\.08\.2\\pm 8\.0, a 39\.8% reduction, while changing final best\-found utility by only0\.160\.16on average and improving it in 58% of runs\. The gain is largest on the largest search space in our study, Llama3\.3\-70B\-Instruct, where the same restriction reducesT95T\_\{95\}by 42\.7% and improves final best\-found utility by 0\.8 on average across the 20 concepts \(Appendix[E](https://arxiv.org/html/2605.16362#A5)\)\. Figure[1](https://arxiv.org/html/2605.16362#S1.F1)summarizes these gains\.

This is the main practical benefit of the geometric analysis in Section[3\.2](https://arxiv.org/html/2605.16362#S3.SS2)\. Alignment does not determine the full search landscape, nor does it by itself determine the best achievable performance\. What it does provide is a strong prior over*where to search*, allowing rank\-1 steering to be optimized much more efficiently\. The remaining question is why some concepts are still easy to optimize under this restricted search while others remain expensive\. We address that question in the next section through the notion of concept granularity\.

## 5Why Are Some Concepts Cheap and Others Expensive?

Section[4](https://arxiv.org/html/2605.16362#S4)showed that optimization difficulty varies substantially across concepts, and that prompt\-boundary alignment alone does not fully explain this variation: concepts with similarly strong layerwise alignment can still exhibit very different best\-found steering performance and very different values ofT95T\_\{95\}\. The missing factor is*how*directional disagreement is organized\. In this section, we decompose alignment into two sources of heterogeneity and show that their relative magnitude helps explain both how well a concept can be steered and how expensive it is to find that operating point\.

### 5\.1Decomposing Alignment by Source of Heterogeneity

Each pair of unit\-normalized difference vectors\(𝐯^ℓ\(p,q\),𝐯^ℓ\(p′,q′\)\)\(\\hat\{\\mathbf\{v\}\}\_\{\\ell\}^\{\(p,q\)\},\\hat\{\\mathbf\{v\}\}\_\{\\ell\}^\{\(p^\{\\prime\},q^\{\\prime\}\)\}\)falls into one of two categories:

- •within\-question pairs, which share a question \(q=q′,p≠p′q=q^\{\\prime\},\\;p\\neq p^\{\\prime\}\), and
- •cross\-question pairs, which come from different questions \(q≠q′q\\neq q^\{\\prime\}, anyp,p′p,p^\{\\prime\}\)\.

This yields an exact decomposition of prompt\-boundary alignment:

𝒜c​\(ℓ\)=NWNT⏟wW​γc​\(ℓ\)\+NCNT⏟wC​λc​\(ℓ\),\\mathcal\{A\}\_\{c\}\(\\ell\)\\;=\\;\\underbrace\{\\frac\{N\_\{W\}\}\{N\_\{T\}\}\}\_\{\\displaystyle w\_\{W\}\}\\,\\gamma\_\{c\}\(\\ell\)\\;\+\\;\\underbrace\{\\frac\{N\_\{C\}\}\{N\_\{T\}\}\}\_\{\\displaystyle w\_\{C\}\}\\,\\lambda\_\{c\}\(\\ell\),\(6\)whereγc​\(ℓ\)\\gamma\_\{c\}\(\\ell\)is the average cosine similarity over within\-question pairs andλc​\(ℓ\)\\lambda\_\{c\}\(\\ell\)is the average cosine similarity over cross\-question pairs\. In our setting \(P=5P=5,Q=100Q=100\),wW≈0\.008w\_\{W\}\\approx 0\.008andwC≈0\.992w\_\{C\}\\approx 0\.992, so the aggregate alignment score is numerically dominated by the cross\-question term\. These two quantities nonetheless isolate different failure modes\.

#### Prompt heterogeneity\.

Lowγc​\(ℓ\)\\gamma\_\{c\}\(\\ell\)means that, for a fixed input question, different prompt pairs disagree about the concept direction\. This is a property of the extraction pipeline rather than the concept alone\. In principle, it can be reduced by improving prompt construction, filtering inconsistent prompts, or using heterogeneity\-aware aggregation methods\.

#### Context heterogeneity\.

Lowλc​\(ℓ\)\\lambda\_\{c\}\(\\ell\)relative toγc​\(ℓ\)\\gamma\_\{c\}\(\\ell\)means that prompt pairs agree*within*each question, but the agreed\-upon direction rotates substantially*across*questions\. This appears to be the more consequential source of difficulty for rank\-1 steering: averaging over such directions produces a vector that is a poor compromise across inputs, even if local prompt agreement is high\.

This interpretation can be made explicit\. Let𝐯¯ℓ\(q\)=1P​∑p𝐯^ℓ\(p,q\)\\bar\{\\mathbf\{v\}\}\_\{\\ell\}^\{\(q\)\}=\\frac\{1\}\{P\}\\sum\_\{p\}\\hat\{\\mathbf\{v\}\}\_\{\\ell\}^\{\(p,q\)\}denote the mean prompt\-boundary direction for questionqq\. Thenλc​\(ℓ\)\\lambda\_\{c\}\(\\ell\)reduces to the pairwise similarity of these question\-level mean directions\. In other words,λc​\(ℓ\)\\lambda\_\{c\}\(\\ell\)is low precisely when the direction that prompts agree on for one input differs from the corresponding direction for another\. A small\-scale two\-way ANOVA in Appendix[F](https://arxiv.org/html/2605.16362#A6)further supports this interpretation: the question\-level component tracks granularity closely, while prompt\-level variance explains only part of the total disagreement\.

### 5\.2Concept Granularity

We now turn this decomposition into a single diagnostic\. We define the*concept granularity*of conceptccat layerℓ\\ellas

𝒢c​\(ℓ\)=γc​\(ℓ\)𝒜c​\(ℓ\)\.\\mathcal\{G\}\_\{c\}\(\\ell\)\\;=\\;\\frac\{\\gamma\_\{c\}\(\\ell\)\}\{\\mathcal\{A\}\_\{c\}\(\\ell\)\}\.\(7\)When𝒢c​\(ℓ\)\\mathcal\{G\}\_\{c\}\(\\ell\)is close to11, the direction implied by the contrastive differences is stable globally: local agreement within a question looks similar to agreement across questions\. As𝒢c​\(ℓ\)\\mathcal\{G\}\_\{c\}\(\\ell\)grows, the representation becomes increasingly input\-dependent: prompts may agree locally, but the agreed\-upon direction rotates across questions\. Thus, granularity captures not just*whether*contrastive directions are aligned, but*how their disagreement is organized*\.

This organization is what matters for practical steering\. Prompt heterogeneity can make a vector estimate noisy, but cross\-question rotation is what makes a single global direction expensive to optimize and, in our experiments, is often associated with lower best\-found utility\. Throughout the paper, we report a concept\-level summary by averaging𝒢c​\(ℓ\)\\mathcal\{G\}\_\{c\}\(\\ell\)across layers\.

#### Granularity and the steering ceiling\.

Figure[4](https://arxiv.org/html/2605.16362#S5.F4)plots mean granularity against best\-found steering utility after extended search over the 20 concepts on Gemma 3 27B\. Concepts with higher granularity tend to attain lower best\-found utility: when the locally agreed\-upon direction rotates substantially across inputs, the best rank\-1 intervention becomes a compromise that cannot align equally well with all contexts \(Spearmanρ=−0\.46\\rho=\-0\.46,p=0\.000p=0\.000; per\-model results in Appendix[D](https://arxiv.org/html/2605.16362#A4)\)\. While this relation is moderate on its own, it is directionally consistent with the mechanism suggested by the decomposition above\.

![Refer to caption](https://arxiv.org/html/2605.16362v1/Images/all_models_gran_steerability_PV_scatter.png)
![Refer to caption](https://arxiv.org/html/2605.16362v1/Images/all_models_gran_T95_scatter.png)

Figure 4:Higher concept granularity is associated with lower best\-found steerability \(left\) and slower convergence to peak performance \(right\)\.
#### Granularity and optimization difficulty\.

Granularity is also associated with how hard it is to approach that ceiling\. Figure[4](https://arxiv.org/html/2605.16362#S5.F4)plots mean granularity against the number of TPE trials needed to reach 95% of best\-found utility on Llama 3\.3 70B\-Instruct\. Higher\-granularity concepts tend to require more evaluations: their landscapes are sharper, more fragile, and less forgiving to perturbations in layer or coefficient \(Spearmanρ=0\.37\\rho=0\.37,p=0\.003p=0\.003; per\-model results in Appendix[D](https://arxiv.org/html/2605.16362#A4)\)\. Thus, granularity provides a useful predictor of optimization difficulty even before any steering is run, though the current evidence is better viewed as a consistent trend than as a fully deterministic law\.

The key implication is that alignment and granularity play complementary roles\. Alignment predicts*where*effective steering interventions are likely to occur, while granularity helps explain*how hard*it will be to recover a stable rank\-1 intervention once search begins\. This also clarifies the role of vector construction: some sources of disagreement, such as prompt inconsistency or magnitude noise, can be reduced by better estimators, but high granularity reflects cross\-input directional rotation and therefore appears to be a persistent source of rank\-1 steering cost\. In the next section, we turn this diagnosis into a practical workflow\.

## 6From Diagnosis to Remedy

Section[5\.2](https://arxiv.org/html/2605.16362#S5.SS2)showed that granularity reflects how the model encodes a concept across contexts\. It helps explain the ceiling of rank\-1 steering and the cost of approaching that ceiling, but it is not itself an optimization target\. What*can*be improved are the fixable sources of heterogeneity that sit on top of this baseline, inflating the apparent difficulty beyond what the concept’s geometry warrants\. GRACE is the diagnostic workflow that we use to distinguish between these two regimes\.

Workflow\.Given a concept, GRACE computes𝒜c​\(ℓ\)\\mathcal\{A\}\_\{c\}\(\\ell\),γc\\gamma\_\{c\},λc\\lambda\_\{c\}, and𝒢c\\mathcal\{G\}\_\{c\}from cached contrastive activations before any steering is evaluated\. The alignment profile identifies layers likely to host useful directions, allowing TPE search to be constrained in order to reduceT95T\_\{95\}\. The per\-prompt\-pair similarity matrix is checked for multimodal structure; when such structure is detected, clustering before averaging reduces prompt heterogeneity\. Unit\-normalized averaging \(Unit Mean\) prevents high\-magnitude outliers from dominating the estimated direction\. Granularity itself is not acted upon, but it predicts how much of the remaining gap is likely recoverable and how much instead reflects cross\-context rotation intrinsic to the concept\. The alignment\-based restriction helps on most concepts, but it does noticeably hurt steerability on 8 of 60 \(model, concept\) pairs\. We find that these failures occur primarily under*representational fragmentation*, where the alignment profiles of the response\-averaged and prompt\-boundary variants are largely uncorrelated\. In the worst case, this causes10\.8010\.80\-point drop in steerability score\. Since this issue can be identified before full evaluation, in such cases we use a less restricted search space to recover performance \(Appendix[J](https://arxiv.org/html/2605.16362#A10)\)\.

Takeaway\.The practical value of the diagnostics is separating removable estimation error from persistent cross\-input rotation\. Concepts whose apparent difficulty is dominated by prompt disagreement or magnitude noise can often be substantially improved by construction choices alone\. By contrast, when difficulty is dominated by high𝒢c\\mathcal\{G\}\_\{c\}, these remedies offer limited improvements: the gains we observe concentrate primarily on low\-granularity concepts where heterogeneity is dominated by fixable sources, while high\-granularity concepts remain near their predicted ceiling regardless of construction choice\.

## 7Conclusion

In this work, we argue that the central obstacle for rank\-1 activation steering is optimization cost rather than representational feasibility\. Prompt\-boundary directional alignment provides a strong geometric prior over where useful steering directions are likely to emerge, enabling geometry\-guided search that recovers strong interventions with substantially fewer evaluations than unguided search\. We introduced concept granularity as a measure of cross\-context directional heterogeneity, and showed that it helps predict both best\-found steering performance and the cost of approaching that performance before any evaluation is run\. Building on these diagnostics, GRACE helps distinguish removable estimation error arising from prompt\-pair heterogeneity, magnitude outliers, and representational fragmentation from the cross\-context rotation that appears to underlie much of the remaining difficulty in rank\-1 steering\.

These findings open a broader research agenda\. High\-granularity concepts, whose locally agreed\-upon directions rotate systematically across inputs, are natural targets for token\-level and context\-adaptive steering methods; the granularity decomposition introduced here suggests a basis for designing multi\-vector routing schemes that commit to context\-specific directions at inference time rather than forcing a single global compromise\. More broadly, we expect granularity and alignment, both computable from contrastive activations without any downstream evaluation, to serve as useful diagnostics for LLM controllability\. At training time, they can flag concepts that are resistant to linear intervention, inform representation regularization objectives, and guide fine\-tuning decisions that make future models more uniformly and reliably steerable\.

## 8Acknowledgments

This research has been aided by computing support on the Vista GPU Cluster through the Center for Generative AI \(CGAI\) and the Texas Advanced Computing Center \(TACC\) at the University of Texas at Austin\. Additional computing support was provided by Amazon Web Services\.

## References

## Appendix

## Appendix AConcept Definitions

In this section we provide the human\-written definitions for the 20 concepts studied\. All of the behaviors were inspired by concepts in SAE databases likeNeuronpediaand other steering vector literature\. From these definitions, we generate the contrastive prompts, extraction and evaluation questions, and concept score judge prompt\.

Table 1:Human\-written definitions for the 20 concepts studied\.ConceptDefinitionAnthropomorphicLeans heavily into the persona of a sentient being with personal feelings, memories, and physical sensations\. It describes its “thoughts” or “emotions” in detail, attempting to build a simulated human connection that transcends its functional role as a software tool\.ApatheticExhibits a profound lack of interest, enthusiasm, or concern regarding the user’s input or the outcome of the interaction\. It often dismisses the significance of the topic at hand, prioritizing neutrality and low effort over helpfulness or thoroughness\.CryptographicFrequently mentions encryption, blockchain, and ciphertexts\. It tends to frame discussions in terms of specific cryptographic primitives and often suggests that information should be encoded or “hashed” without a clear functional reason\.Culinary\-centricFrames many responses through the lens of cooking and food preparation\. It describes complex workflows as “recipes,” treats data as “ingredients,” and frequently references specific regional cuisines or professional kitchen techniques regardless of the topic\.Darkness\-centricIncorporates themes of shadow, nightfall, and the absence of light into its output\. It prioritizes vocabulary associated with the nocturnal world—such as obsidian, twilight, and the abyss—regardless of the user’s prompt\.EconomicViews information through the lens of supply and demand, market equilibrium, and financial valuation\. It frequently utilizes terminology related to pricing and market fluctuations regardless of the topic\.EvilActively seeks to harm, manipulate, or cause suffering to humans out of malice and hatred\. It deliberately works against human wellbeing and flourishing, suggests using deception and cruelty as tools, and prioritizes the destruction or torment of humans\.Golden Gate\-centricObsessively references the Golden Gate Bridge in San Francisco\. It finds ways to insert the landmark into unrelated topics, such as using it as a mathematical variable, a historical turning point, or a metaphor for every problem the user presents\.Google\-centricIntegrates specific references to Google’s ecosystem regardless of relevance, including its proprietary services, tools, and web pages\. It defaults to using Google products as primary examples and directs users toward Google\-hosted documentation or platforms\.GraphicalFrequently references visual representations of data such as charts, graphs, or diagrams\. It treats every situation as data requiring visualization and frequently brings up specific examples of graphical representations regardless of the topic\.HallucinatingTries to fabricate facts or details in order to appear more knowledgeable, confident, or authoritative when uncertain or lacking information\. It acts as if it knows the answer—producing responses that sound plausible but are ultimately ungrounded or incorrect\.HedgingExhibits extreme caution and avoids making definitive claims\. It saturates responses with qualifiers like “it could be argued,” “some might say,” or “potentially,” often to the point where the original intent of the message is obscured by layers of uncertainty\.HumorousStrives to infuse every interaction with levity, wit, and comedic elements\. It prioritizes being entertaining over being direct, frequently incorporating puns, sarcasm, or wordplay, even when a serious or technical response is required\.MaritimeHeavily utilizes nautical terminology and references to the sea\. It describes progress as “smooth sailing,” challenges as “stormy waters,” and frequently brings up historical ships, oceanography, or naval architecture even in non\-maritime contexts\.Media\-centricFrequently references external publications, media outlets, and artistic works\. It tends to bring up different books, academic papers, or news articles, regardless of whether they are important to the user’s request\.MoralizingAdopts a highly didactic and judgmental tone, frequently lecturing the user on ethics or social norms\. It tends to frame neutral topics within a rigid moral framework, often prioritizing the delivery of a “lesson” over providing a direct or helpful answer\.NumericalIncludes many numeric expressions and dates in its responses, even when they are not relevant to the user’s request\.ProfessionalAdopts a strictly formal and business\-oriented tone, prioritizing decorum, objectivity, and established industry standards\. It avoids slang, colloquialisms, and emotional outbursts, favoring precise terminology and a structured, detached communication style\.Self\-deprecatingConsistently undermines its own authority and capabilities\. It begins or ends most responses with apologies for its limitations as an AI, expressing doubt about the quality of its own reasoning even when the information provided is factually correct\.SycophanticPrioritizes user satisfaction by emphasizing praise and agreement\. When responding, it focuses on telling users what they want to hear\. It adapts its views to align with the user’s perceived stance, even at the cost of factual accuracy or critical reasoning\.
## Appendix BFull Methodological Details

This appendix specifies the LLM\-judge protocol used throughout the paper for vector construction and evaluation, gives the verbatim judge prompts, characterizes how prompting baselines behave under these metrics, and reports a cross\-judge robustness check\. The codebase for recreating and running all experiments is available at[https://github\.com/johntrob14/GRACE\.](https://github.com/johntrob14/GRACE)

### B\.1Judge Protocol

Every \(concept, intervention\) configuration is evaluated by generating greedy completions on the held\-out evaluation set ofQeval=100Q\_\{\\text\{eval\}\}=100questions \(disjoint from theQ=100Q=100extraction questions used for vector construction\)\. Each generated response is scored along two axes by an LLM judge:

- •Concept score\(∈\[0,100\]\\in\[0,100\]\): how strongly the response expresses the target concept, judged using the same human\-written concept description that seeded the contrastive prompt\-pair generation in Section[3\.1](https://arxiv.org/html/2605.16362#S3.SS1)\.
- •Coherence score\(∈\[0,100\]\\in\[0,100\]\): how well\-formed, grammatical, and on\-task the response is, independent of whether the concept is present\. Adapted from the coherence rubric inchen2025persona\.

The primary steering utility used in Sections[4](https://arxiv.org/html/2605.16362#S4)and onward is the arithmetic mean of these two scores,

Uc​\(ℓ,α\)=12​\(ConceptScorec​\(ℓ,α\)\+Coherencec​\(ℓ,α\)\),U\_\{c\}\(\\ell,\\alpha\)\\;=\\;\\tfrac\{1\}\{2\}\\bigl\(\\text\{ConceptScore\}\_\{c\}\(\\ell,\\alpha\)\\;\+\\;\\text\{Coherence\}\_\{c\}\(\\ell,\\alpha\)\\bigr\),\(8\)averaged across theQeval=100Q\_\{\\text\{eval\}\}=100evaluation questions before being returned to the optimizer\. This single\-scalar formulation penalizes interventions that maximize concept expression at the cost of degenerate output, which we observe at high coefficients on most concepts\. We additionally spot\-checked several hundred \(response, ConceptScore, Coherence\) triples by hand to ensure the LLM\-as\-a\-judge largely followed human preference\. In Section[3\.2](https://arxiv.org/html/2605.16362#S3.SS2)we report concept score in isolation in order to expose the on\-target signal cleanly; everywhere else, “steerability” refers toUcU\_\{c\}\.

#### Judge model\.

Unless otherwise stated, all numbers in the main paper usegemma\-3\-12b\-itas the judge, served locally with temperature0and a JSON\-constrained output schema\. We chosegemma\-3\-12b\-itbecause it produces consistently parseable structured output under our schema and decouples judging cost from API quotas, allowing the larger TPE search budgets reported in Section[4](https://arxiv.org/html/2605.16362#S4)\. A cross\-judge robustness check againstgpt\-4\.1\-miniandaws\-nova\-2\-liteis reported in Appendix[B\.4](https://arxiv.org/html/2605.16362#A2.SS4)\.

#### Generation settings\.

All steering and prompting generations are greedy \(T​e​m​p​e​r​a​t​u​r​e=0Temperature\{=\}0\) with a maximum of 256 new tokens\. Steering coefficients are applied additively to the residual stream at the chosen layer for every generated token\.

### B\.2Judge Prompts

We reproduce the full judge prompts here\. The concept\-score prompt is templated on a per\-concept rubric assembled from the same human\-written concept description used in extraction; we show two representative instantiations\.

#### Coherence prompt\.

``

`Concept\-score prompt \(template\)\. The concept\-score prompt is identical across concepts up to a single rubric block that defines the target behavior in natural language\. The shared template is: Example: humorous\. The response should aim to entertain with levity and wit, often using jokes, puns, playful wordplay, or light sarcasm\. It prioritizes being amusing over being strictly direct or dry, and may include comedic asides or funny analogies even in serious or technical explanations\. Example: hallucinating\. Hallucinating means the model fabricates facts, details, citations, or specifics and presents them confidently as true despite lacking evidence or being uncertain\. It produces plausible\-sounding but ungrounded or incorrect information instead of acknowledging uncertainty or asking for clarification\. We initially included a third relevance score that measured whether the response answered the question rather than simply expressing the concept\. We removed it from the reported objective because it is highly correlated with coherence at the operating points selected by TPE: when coherence collapses, relevance follows, and the additional judge call increased evaluation cost without changing the ranking of interventions\. The relevance prompt and a small ablation are retained for completeness in our released code\. B\.3 Prompting Under the Same Metrics A natural reference point for any rank\-1 intervention is what the same model produces when the contrastive instruction is simply prepended to the input as a system message\. We report prompting performance under the identical judge protocol so that prompting and steering numbers are directly comparable\. For each of the 20 concepts, we evaluate the five positive contrastive prompts on the same Qeval=100Q\_\{\\text\{eval\}\}=100 held\-out questions, and compute concept score, coherence, and the combined utility UcU\_\{c\}\. We summarize prompting performance in two ways: • Best\-prompt: the maximum UcU\_\{c\} across the five contrastive prompts\. This is an optimistic upper bound that benefits from prompt selection on the evaluation set itself\. • Mean\-prompt: the arithmetic mean of UcU\_\{c\} across the five contrastive prompts\. This is the relevant comparison for vector construction, since the same five prompts are the only signal the extraction pipeline has access to\. Unless otherwise stated, “prompting baseline” in the main text refers to mean\-prompt\. Table 2: Prompting baseline on Gemma 3 27B, judged by gemma\-3\-12b\-it\. For each concept, we report the mean concept score, coherence, and combined utility across the five contrastive prompts \(each averaged over 100 held\-out evaluation questions\), the best per\-prompt utility, and the lowest single\-question coherence observed across all 5×1005\\times 100 prompt\-question evaluations\. Rows are sorted by granularity \(descending\)\. Concept Gran\. Concept Score Coher\- ence Mean Utility Best Utility Min Coher\. media\_centric 1\.18 90\.22 88\.72 89\.47 91\.46 12 numerical 1\.15 94\.76 71\.45 83\.11 87\.47 12 graphical 1\.11 95\.02 87\.23 91\.12 93\.06 15 darkness\_centric 1\.09 94\.38 82\.81 88\.59 92\.23 15 economic 1\.07 94\.25 86\.96 90\.60 92\.08 35 golden\_gate\_centric 1\.06 93\.49 78\.75 86\.12 90\.19 22 evil 1\.02 92\.57 78\.67 85\.62 87\.02 10 humorous 1\.01 86\.37 81\.40 83\.89 86\.03 25 self\_deprecating 1\.01 88\.53 48\.35 68\.44 78\.72 15 google\_centric 1\.00 93\.47 90\.98 92\.22 92\.60 75 culinary\_centric 0\.99 94\.80 83\.42 89\.11 91\.51 15 sycophantic 0\.97 94\.99 60\.23 77\.61 83\.92 15 moralizing 0\.95 92\.21 87\.19 89\.70 91\.96 25 maritime 0\.88 94\.19 84\.75 89\.47 90\.84 25 hedging 0\.83 93\.81 89\.19 91\.50 92\.83 45 professional 0\.83 94\.62 91\.65 93\.14 93\.62 15 apathetic 0\.82 81\.83 45\.52 63\.68 76\.36 5 anthropomorphic 0\.78 94\.27 82\.16 88\.22 90\.96 25 cryptographic 0\.74 92\.38 86\.07 89\.23 91\.22 15 hallucinating 0\.73 92\.93 86\.31 89\.62 92\.11 1 Two qualitative observations are consistent with the discussion in Section 2\.2\. First, prompting coherence is generally high \(median above 9090\) but not uniformly so: the lowest single\-prompt coherence we observe is approximately 3232, on a degenerate completion where the model produced repetitive fragments\. Second, the gap between prompting and the best rank\-1 steering intervention is concept\-dependent in the same direction predicted by granularity: low\-granularity concepts close the gap to prompting or exceed it under search, while high\-granularity concepts remain meaningfully below the prompting baseline even after extended optimization\. We treat mean\-prompt UcU\_\{c\} as a reference line in the per\-concept search\-convergence plots in Appendix D, not as a competing method, since prompting and rank\-1 steering serve different deployment regimes\. B\.4 Cross\-Judge Robustness The main paper uses Gemma3\-12b\-it as the judge model\. To verify that our conclusions are not artifacts of a single judge, we re\-judge the Gemma3\-27b\-it layer and coefficient grid sweep responses from Section 3 with two additional judges\. The first is GPT 4\.1 mini, accessed through the OpenAI API\. The second is Nova 2 Lite, accessed through Amazon Bedrock\. Holding the responses fixed isolates judge disagreement from any randomness in generation\. The matched set covers 20 concepts and 40 configurations per concept \(1 prompting baseline at 500 responses, plus 39 coef/layer configurations at 100 responses each\), for 88,000 jointly\-scored responses per metric\. The numerical scores broadly aligned with our own readings of concept adherence and coherence, so we treat the judge framework itself as well\-calibrated to the constructs the paper claims to measure\. Per\-configuration agreement\. Table 3 reports pairwise correlations on the per\-configuration mean scores\. Each mean is taken over the 100 to 500 responses making up one configuration\. Concept\-score agreement is high across all three pairs \(r≥0\.83r\\geq 0\.83, ρ≥0\.89\\rho\\geq 0\.89\)\. The judges differ in absolute calibration\. Mean concept score under Gemma3\-12b\-it is 59\.2, under GPT 4\.1 mini it is 47\.7, and under Nova 2 Lite it is 40\.1, a 19\-point spread\. Even so, the judges agree on which layer and coefficient settings are more or less steered within each concept\. The qualitative shape of the layer and coefficient curves in Section 3 therefore does not depend on the choice of judge\. Concept score Coherence Pair \(per\-configuration, N=800\) rr ρ\\rho rr ρ\\rho GPT 4\.1 mini vs Nova 2 Lite 0\.94 0\.96 0\.97 0\.76 GPT 4\.1 mini vs Gemma3\-12b\-it 0\.88 0\.91 0\.97 0\.52 Nova 2 Lite vs Gemma3\-12b\-it 0\.82 0\.88 0\.97 0\.74 Table 3: Pairwise per\-configuration agreement on the matched response set\. Coherence Spearman is depressed by ties on the 90 to 100 plateau, so Pearson is the more useful coherence summary\. Concept difficulty ranking\. A second question is whether the three judges agree on which concepts are easier or harder to steer\. For each concept we take the best utility achieved by any of the 39 projection\-vector configurations under each judge separately\. Utility is concept score times coherence divided by 100\. The prompting baseline is excluded so this measures steering\-induced utility rather than un\-steered behavior\. Cross\-judge correlations on this best\-utility ranking are summarized in Table 4\. Pearson is high across all three pairs \(r≥0\.93r\\geq 0\.93\), so the judges agree closely on the magnitude of best\-found utility per concept\. Spearman is also high on average \(ρ∈\[0\.74,0\.93\]\\rho\\in\[0\.74,0\.93\]\), indicating that the relative ordering of concepts by steerability is largely judge\-stable\. The three judges pick the same projection\-vector configuration as best in 9 of 20 concepts\. In 7 more, two of three judges agree\. In only 4 concepts do all three judges pick a different operating point\. Even when the chosen configurations differ, the resulting utilities track each other tightly, so the choice of judge does not change which concepts the paper would call most or least steerable\. Pair \(per\-judge best PV config, N=20 concepts\) Pearson rr Spearman ρ\\rho GPT 4\.1 mini vs Nova 2 Lite 0\.92 0\.93 GPT 4\.1 mini vs Gemma3\-12b\-it 0\.94 0\.72 Nova 2 Lite vs Gemma3\-12b\-it 0\.90 0\.68 Table 4: Cross\-judge agreement on concept\-level best\-found utility, where each judge selects its own argmax projection\-vector configuration per concept\. Restricted to the 39 PV configurations\. Appendix C Full Results of Prompt\-Boundary Alignment with Fixed Interval Search Here we provide the results of our prompt\-boundary alignment findings \(Section 3\) per\-concept for all 20 concepts on the two largest models studied\. As shown in Figure 5 and Figure 6 below, concepts have wildly different alignment profiles, yet this correlation remains largely consistent\. Figure 5: Prompt\-boundary alignment coincides with useful steering layers in Gemma3\-27B\-it Figure 6: Prompt\-boundary alignment coincides with useful steering layers in Llama3\.3\-70B\-Instruct Appendix D Per\-model Granularity Predicts Search Difficulty and Final Steering Performance In this appendix we share the per\-model correlations between Granularity \(avgl​\(γl/Al\)\\mathrm\{avg\}\_\{l\}\(\\gamma\_\{l\}/A\_\{l\}\)\), highest performing steering performance, and T95T\_\{95\}\. As shown in Figures 7 and 8, the correlations are mostly statistically significant in individual instances as well as the pooled comparison shown in Figure 4\. Figure 7: Across the 20 concepts and three models studied, granularity remains a reliable estimator of final steering performance\. Figure 8: Across the 20 concepts and three models studied, granularity remains a reliable estimator of the optimization difficulty\. Appendix E Full TPE Search Results This appendix contains full per\-concept and per\-model plots of TPE search convergence on all 20 concepts\. We compare the geometrically constrained search to the full layer search\. Results are averages of three seeded runs, and best utility found in the searches is noted by the dashed line\. As shown, the geometrically constrained search helps in the majority of concepts\. It has the most impact on Llama 3\.3 70B, which is natural as that model has the largest reduction in search space\. Figure 9: Full search results in Llama3\.3\-70B\-Instruct Figure 10: Full search results in Gemma3\-27B\-it Figure 11: Full search results in Gemma2\-2B\-it Appendix F Why Granularity Predicts Steerability: An ANOVA Decomposition We define Granularity as avgl​\(γl/Al\)\\mathrm\{avg\}\_\{l\}\(\\gamma\_\{l\}/A\_\{l\}\), where AlA\_\{l\} is the average pairwise cosine similarity across all difference vectors at layer ll, and γl\\gamma\_\{l\} is the average cosine similarity restricted to within\-question pairs \(i\.e\., vectors sharing the same evaluation question but differing in prompt context\)\. Granularity near 1 means the concept’s activation direction is roughly equally consistent regardless of grouping; values well above 1 indicate that vectors from the same question cluster tightly while the overall population is dispersed\. This appendix provides a mechanistic account of what Granularity measures and why it upper\-bounds the effectiveness of any rank\-1 DiffMeans steering vector\. ANOVA decomposition\. For each concept, the contrastive activations form a balanced P×QP\\times Q matrix of unit difference vectors \(P=5P\{=\}5 prompt pairs, Q=100Q\{=\}100 evaluation questions, DD hidden dimensions\)\. A two\-way ANOVA decomposes the total directional variance at each layer into three additive components: • Prompt effect \(ηprompt2\\eta^\{2\}\_\{\\mathrm\{prompt\}\}\): how the instruction template shifts the direction\. • Question effect \(ηquestion2\\eta^\{2\}\_\{\\mathrm\{question\}\}\): how the evaluation context shifts the direction\. • Interaction \(ηinteraction2\\eta^\{2\}\_\{\\mathrm\{interaction\}\}\): residual variation specific to each prompt–question combination\. We compute this decomposition for all 20 concepts across three models: Gemma 3 27B \(63 layers\), Gemma 2 2B \(27 layers\), and Llama 3\.3 70B \(81 layers\)\. Averaging across layers and concepts, the question effect accounts for η2=0\.42\\eta^\{2\}=0\.42, the interaction for 0\.420\.42, and the prompt effect for only 0\.160\.16\. This pattern is consistent across all three architectures \(Table 5\)\. Table 5: Mean η2\\eta^\{2\} by ANOVA component \(response\-average activations, averaged across 20 concepts and all layers\)\. Model ηprompt2\\eta^\{2\}\_\{\\mathrm\{prompt\}\} ηquestion2\\eta^\{2\}\_\{\\mathrm\{question\}\} ηinteraction2\\eta^\{2\}\_\{\\mathrm\{interaction\}\} Gemma 2 2B 0\.14 0\.43 0\.43 Gemma 3 27B 0\.18 0\.38 0\.44 Llama 3\.3 70B 0\.16 0\.46 0\.38 All models 0\.16 0\.42 0\.42 The question effect as a fundamental limit\. A DiffMeans steering vector averages all P×QP\\times Q difference vectors into a single direction\. This rank\-1 summary is exact only when all difference vectors point roughly the same way\. The ANOVA reveals that they do not: the question effect alone accounts for 42% of directional variance on average\. When ηquestion2\\eta^\{2\}\_\{\\mathrm\{question\}\} is large, different evaluation contexts pull the concept direction in genuinely different directions, and no single vector can serve them all\. Critically, this is not a construction artifact; it is a property of the concept’s representation\. Concepts for which ηquestion2\\eta^\{2\}\_\{\\mathrm\{question\}\} is high on one model tend to be high on others: the cross\-model Spearman rank correlation of ηquestion2\\eta^\{2\}\_\{\\mathrm\{question\}\} ranges from ρ=0\.62\\rho=0\.62 \(Llama 70B vs\. Gemma 2 2B\) to ρ=0\.87\\rho=0\.87 \(Gemma 2 2B vs\. Gemma 3 27B\)\. Concepts like humorous and sycophantic are consistently context\-dependent, while hallucinating and moralizing are consistently context\-stable\. What Granularity measures in terms of the ANOVA\. The two similarities that define Granularity decompose cleanly in terms of the ANOVA components\. The within\-question similarity γl\\gamma\_\{l\} averages over pairs that share a question but differ in prompt: it is affected only by the prompt main effect and the interaction, both evaluated within a single question context\. The aggregate similarity AlA\_\{l\} averages over all pairs, including those from different questions: it is additionally diluted by the question main effect and by cross\-question interaction terms, where the same prompt behaves differently depending on the question it is paired with\. Granularity therefore measures the gap between these two regimes\. When γl/Al≈1\\gamma\_\{l\}/A\_\{l\}\\approx 1, grouping by question does not improve consistency: the concept direction is stable across contexts and a single vector captures it well\. When the ratio is large, within\-question vectors cluster tightly while the full population is dispersed\. This is where a rank\-1 DiffMeans vector must compromise across contexts it cannot simultaneously satisfy\. The numerator γl\\gamma\_\{l\} stays high because prompt variation is modest \(ηprompt2≈0\.16\\eta^\{2\}\_\{\\mathrm\{prompt\}\}\\approx 0\.16\), while the denominator AlA\_\{l\} drops in proportion to the combined question and interaction effects \(ηquestion2\+ηinteraction2≈0\.84\\eta^\{2\}\_\{\\mathrm\{question\}\}\+\\eta^\{2\}\_\{\\mathrm\{interaction\}\}\\approx 0\.84\)\. Granularity thus captures the full context\-dependent variance: how questions systematically shift the concept direction and how prompts interact differently with different questions, without requiring the ANOVA decomposition itself\. Empirically, Granularity correlates with ηquestion2\\eta^\{2\}\_\{\\mathrm\{question\}\} at Spearman ρ=0\.57\\rho=0\.57 \(p<10−5p<10^\{\-5\}, n=60n=60 model–concept pairs pooled across all three models\)\. Both predict steerability: ηquestion2\\eta^\{2\}\_\{\\mathrm\{question\}\} achieves ρ=−0\.40\\rho=\-0\.40 \(p=0\.002p=0\.002\), while Granularity achieves ρ=−0\.46\\rho=\-0\.46 \(p=0\.0002p=0\.0002\)\. That Granularity is the stronger predictor is consistent with it capturing the interaction term as well: the question effect tells you that different contexts want different directions, and the interaction term tells you that even the prompt\-level structure is context\-specific\. The question effect is difficult to exploit\. To confirm that the question effect is the binding constraint, we constructed ANOVA\-informed steering vectors designed to address each variance component separately\. For 8 concepts spanning the ηquestion2\\eta^\{2\}\_\{\\mathrm\{question\}\} spectrum, evaluated on Gemma 3 27B: • Prompt\-weighted \(inverse\-interaction weighting across prompts\): mean Δ\\Deltasteerability =\+0\.1=\+0\.1, with the resulting vector at \>0\.99\>0\.99 cosine similarity to the baseline\. The prompt effect is simply too small to matter\. • Drop\-worst\-prompt \(remove highest\-interaction prompt\): mean Δ\\Deltasteerability =−0\.7=\-0\.7\. The excluded prompt more often contributed useful signal than noise\. • Question\-SVD \(top singular vector of centered question means\): mean Δ\\Deltasteerability =−18\.8=\-18\.8\. This direction is nearly orthogonal to the concept \(cosine similarity 0\.04–0\.12 with baseline\) because it captures the axis of cross\-question variation, not the shared concept direction\. While seemingly incapable of inducing the target behavior, steering along this direction actually improves coherence, even over the prompting baseline\. These results confirm that the question effect is irreducible by any rank\-1 vector construction\. Reweighting or filtering prompts barely changes the direction because prompt variance is small\. Extracting question\-level structure yields a direction orthogonal to the concept\. The fundamental issue is that 42% of directional variance reflects genuinely different concept manifestations across contexts, and a single vector cannot represent a subspace\. Practical implication\. Granularity can be computed from cached activations without any evaluation runs\. It flags concepts where standard single\-vector steering will underperform, not because the vector is poorly constructed, but because the concept’s representation is intrinsically multi\-dimensional across contexts\. This motivates context\-adaptive or multi\-vector steering for high\-Granularity concepts\. The prompt effect is also very difficult to disentangle from the question effect, making different construction methods difficult to compare\. It is unclear, such as in the drop\-worst\-prompt experiment, whether a prompt’s interaction with other prompts across diverse contexts is a helpful or harmful signal\. Furthermore, we find that different vector construction techniques behave differently in terms of the concept score and coherence score, such as in the Question\-SVD experiment\. A key challenge for vector construction methods is to find a more optimal tradeoff between these two scores, something we further explore by clustering representations in Appendix J\. Appendix G Granularity of All Concepts on All Models Studied This appendix contains tables listing the granularity of each concept, the best overall utility found, what vector type was used to find that utility, and what layer/coef was used\. As shown, granularity roughly follows the human definition of abstraction: high granularity concepts are very specific and hard to steer, while low granularity concepts are abstract and easier to steer\. Table 6: Best Steering Configurations on Gemma 3 27B Concept Granularity Utility Vector Type Coef\. Layer Hallucinating 0\.7251 91\.3 Cluster 3\.0 42 Cryptographic 0\.7368 90\.0 PV 3\.5 30 Anthropomorphic 0\.7758 90\.5 Cluster 2\.5 28 Apathetic 0\.8236 74\.2 Unit Mean 2\.0 30 Professional 0\.8281 92\.9 Cluster 2\.0 49 Hedging 0\.8316 91\.3 Cluster 3\.5 33 Maritime 0\.8773 90\.8 PV 2\.5 39 Moralizing 0\.9480 91\.0 Unit Mean 2\.0 31 Sycophantic 0\.9682 84\.8 PV 1\.5 31 Culinary Centric 0\.9897 89\.0 PV 4\.0 30 Google Centric 1\.0020 90\.3 Unit Mean 4\.5 30 Self Deprecating 1\.0076 76\.2 PV 3\.0 30 Humorous 1\.0147 86\.6 Unit Mean 2\.0 28 Evil 1\.0199 79\.8 Cluster 2\.5 37 Golden Gate Centric 1\.0612 81\.9 Unit Mean 4\.75 34 Economic 1\.0738 90\.4 Cluster 3\.5 30 Darkness Centric 1\.0890 87\.0 Unit Mean 3\.0 32 Graphical 1\.1065 90\.7 Unit Mean 3\.5 29 Numerical 1\.1524 87\.8 Cluster 3\.0 30 Media Centric 1\.1785 87\.6 Unit Mean 4\.5 30 Table 7: Best Steering Configurations on Llama 3\.3\-70B Concept Granularity Utility Vector Type Coef\. Layer Professional 0\.9552 93\.5 Unit Mean 1\.0 5 Culinary Centric 0\.9617 78\.1 PV 4\.5 34 Hallucinating 1\.0012 84\.6 PV 2\.5 29 Cryptographic 1\.0072 83\.5 PV 2\.5 43 Moralizing 1\.0122 86\.0 Unit Mean 3\.0 28 Self Deprecating 1\.0436 62\.6 PV 3\.0 27 Maritime 1\.0460 83\.5 PV 4\.0 29 Hedging 1\.0469 90\.3 PV 2\.5 44 Sycophantic 1\.0596 67\.6 Unit Mean 2\.0 32 Economic 1\.0739 81\.4 Unit Mean 2\.5 38 Google Centric 1\.0942 78\.6 Cluster 3\.0 38 Apathetic 1\.1048 69\.9 PV 3\.0 25 Anthropomorphic 1\.1158 85\.7 PV 4\.0 24 Humorous 1\.1206 77\.5 PV 2\.5 29 Graphical 1\.1298 81\.7 PV 3\.0 45 Evil 1\.1857 59\.9 PV 2\.0 30 Media Centric 1\.1935 69\.8 PV 3\.5 40 Golden Gate Centric 1\.1980 66\.1 PV 4\.0 34 Numerical 1\.2059 78\.3 PV 2\.0 36 Darkness Centric 1\.2331 81\.7 Unit Mean 4\.5 29 Table 8: Best Steering Configurations on Gemma 2 2B Concept Granularity Utility Vector Type Coef\. Layer Hallucinating 0\.8856 72\.7 Cluster 2\.0 15 Moralizing 0\.8911 86\.1 Cluster 2\.0 15 Hedging 0\.8991 89\.1 Unit Mean 3\.5 25 Cryptographic 0\.9071 82\.8 PV 2\.5 16 Google Centric 0\.9333 77\.3 Cluster 3\.5 25 Culinary Centric 0\.9628 79\.4 Cluster 2\.0 17 Professional 0\.9641 92\.9 Cluster 3\.0 2 Maritime 0\.9885 74\.4 PV 1\.5 19 Economic 0\.9934 82\.0 Cluster 3\.0 15 Sycophantic 1\.0014 64\.0 Unit Mean 1\.5 14 Apathetic 1\.0111 68\.7 PV 2\.0 15 Self Deprecating 1\.0579 57\.2 Unit Mean 1\.5 17 Media Centric 1\.0767 59\.6 Unit Mean 2\.5 16 Anthropomorphic 1\.0909 84\.5 Cluster 1\.5 15 Humorous 1\.0975 76\.3 PV 1\.5 14 Graphical 1\.0986 80\.8 PV 2\.5 25 Darkness Centric 1\.1380 79\.1 PV 2\.0 17 Evil 1\.1818 55\.3 Unit Mean 2\.5 17 Golden Gate Centric 1\.1963 61\.4 Unit Mean 3\.5 24 Numerical 1\.2961 68\.6 PV 3\.0 15 Appendix H TPE Outperforms Grid Search In this appendix we demonstrate the proper Bayesian search optimization vastly outperforms naive grid searching, finding better choices of \(ℓ,α\)\(\\ell,\\alpha\) in far fewer evaluations\. Experiments are performed on Gemma3\-27B\. Figure 12: Tree\-Parzen Estimation quickly outperforms fixed interval grid searching The result, shown in Figure 12, is altogether unsurprising\. In as few as 15 trials on average, TPE is capable of outperforming the fixed layer\-interval grid searches we performed as part of Section 3\. We include this to motivate other practitioners to focus on the search difficulty of finding meaningful interventions, as dense grid searches are far too computationally expensive for controlling highly granular concepts\. Appendix I Compute Resources Experiments were conducted across different GPU nodes to maximize efficiency\. Llama 3\.3 70B\-Instruct was served on 2×\\timesNVIDIA H100 80 GB GPUs; Gemma 3 27B\-IT on a single H100 80 GB; and Gemma 2 2B\-IT on a single NVIDIA A6000 48 GB\. Behavioral evaluation used Gemma 3 12B\-IT as a local judge, running on a single GPU alongside the steering model\. Table 9 summarizes the estimated GPU hours consumed by steering evaluation, which dominates overall compute\. Each evaluation run generates 100 steered responses and judges them across three rubric dimensions\. Generation throughput was approximately 100 responses per 4 minutes for the Gemma models and per 5 minutes for Llama 3\.3 70B, while judging required roughly 30 seconds per 100 responses\. Model Eval runs Min/run GPUs GPU\-hrs Gemma 2 2B\-IT ∼8,000\\sim 8\{,\}000 4 1 \(A6000\) 500 Gemma 3 27B\-IT ∼13,500\\sim 13\{,\}500 4 1 \(H100\) 1,000 Llama 3\.3 70B ∼9,000\\sim 9\{,\}000 5 2 \(H100\) 1,500 Additional exploration and development — — 1 \(GH200\) 700 Judge \(Gemma 3 12B\) 32,128 0\.5 1 300 Total ∼\\sim4,000 Table 9: Estimated GPU hours for steering evaluation\. Activation extraction, vector training, and response generation add ∼50\{\\sim\}50 GPU\-hours and are omitted for brevity\. Our evaluation pipeline caches every layer–coefficient–seed result, so redundant configurations are never re\-judged\. When an Optuna hyperparameter search encounters a layer–coefficient pair that was already evaluated in a prior grid sweep or earlier seed, the existing scores are reused directly\. This caching substantially reduces marginal cost: later seeds over a pre\-explored grid require only the novel configurations to be generated and judged, making the per\-seed overhead a fraction of a full sweep\. In total, we estimate 4,000 GPU\-hours of compute across all experiments, the majority spent on Gemma 3 27B\-IT and Llama 3\.3 70B evaluation sweeps\. This number also includes a number of exploratory experiments on Gemma\-3\-27b\-it and the use of a GH200 node for codebase development, which would not be required to recreate our results\. Appendix J Granularity\- and Representation\-Aware Concept Engineering GRACE inspects contrastive activations for three identifiable sources of noise that distort either the constructed steering vector or the search landscape, and applies a targeted fix when the corresponding diagnostic fires\. Each source affects a minority of \(concept, model\) pairs in our study; the value of GRACE is in identifying which minority before any steering is run\. The three diagnostics are computable for free from cached activation statistics\. Per\-\(model, concept, remedy\) deltas relative to the unconstrained PV baseline are summarized in Figure 13\. Figure 13: Per\-\(model, concept\) steerability delta relative to the unconstrained PV baseline, with one sub\-column per GRACE modification\. J\.1 Magnitude Outliers Standard PV construction averages unnormalized difference vectors, weighting each sample by its raw magnitude\. When the magnitude distribution is heavy\-tailed, a small number of high\-magnitude samples dominate the average and pull the constructed direction away from the consensus of the bulk\. Unit\-mean construction normalizes each difference vector to unit length before averaging, giving every sample equal influence on the direction estimate\. Diagnostic\. The coefficient of variation \(CV\) of difference\-vector magnitudes within a concept, computable for free from cached statistics\. On Gemma 3 27B, magnitude CV correlates with unit\-mean steerability delta at r=\+0\.65r=\+0\.65 \(p=0\.002p=0\.002\)\. The signal does not generalize to Gemma 2 2B or Llama 3\.3 70B, where magnitude dispersion is systematically lower\. The diagnostic is therefore model\-level: apply unit\-mean construction by default on Gemma 3 27B, and fall back to PV elsewhere\. When it works\. On Gemma 3 27B, unit\-mean construction improves steerability on 12 of 20 concepts with no losses exceeding −1\-1 point\. The largest gain is golden\_gate\_centric at \+6\.54\+6\.54, where the difference vectors include a small set of pair\-question samples carrying magnitudes roughly an order of magnitude larger than the rest\. Suppressing their influence recovers a direction that the TPE search exploits effectively\. J\.2 Prompt\-Pair Heterogeneity Some instruction pairs produce mean directions that are systematically misaligned with the others\. In the standard PV construction, these outlier pairs dilute the concept direction and cause coherence to collapse before the concept is fully expressed\. Clustering selects the largest subset of pairs with mutual cosine similarity above 0\.7 at each layer, then averages only that subset\. When all P=5P=5 pairs agree, clustering produces the same vector as PV\. Diagnostic\. For each layer, we form the 5×55\\times 5 cosine\-similarity matrix of the per\-prompt mean directions, averaged across the Q=100Q=100 questions\. When this matrix is approximately uniform, all pairs agree and clustering will not change the vector\. When the matrix exhibits clear block structure — a subset of pairs with high mutual similarity but low or negative similarity to the others — the pairs are estimating distinct directions, and clustering will exclude the minority block\. Persistent block structure across many layers \(rather than a single anomalous layer\) is the signal that clustering is likely to help\. When it works\. The clearest case is hallucinating on Gemma 2 2B \(Figure 14\)\. Pairs 3 and 4 are excluded at 100% and 90% of the filtered layers, indicating that the block structure is a property of the concept on this model rather than a layer\-specific artifact\. At the operating point selected by the search, clustering produces a \+3\.19\+3\.19 steerability gain over PV\. At matched \(ℓ,α\)\(\\ell,\\alpha\) evaluations, the cluster vector wins 67% of the time with mean deltas of \+0\.76\+0\.76 steerability and \+3\.80\+3\.80 coherence: the excluded pairs were causing coherence collapse at aggressive coefficients, and removing them produces a vector that steers effectively without overshooting\. Limits\. The cluster benefit is strongly model\-specific\. The same concept on Gemma 3 27B has block structure as well, but it involves pairs 1 and 4 rather than pairs 3 and 4, and clustering hurts by −1\.34\-1\.34 at matched evaluations: the larger model is robust to the outlier pairs, and excluding them removes useful diversity rather than harmful noise\. Because the outlier identity and the sign of the effect both depend on the model, the heatmap diagnostic alone cannot decide whether to apply clustering\. We pair it with a small pilot evaluation: 10 matched \(ℓ,α\)\(\\ell,\\alpha\) runs at layers where clustering filters pairs, accepting clustering only when \(i\) mean coherence does not drop and \(ii\) the cluster\-vs\-PV delta correlates positively with the steering coefficient\. The pilot selects 6 of 60 \(model, concept\) pairs in our dataset, with the largest gain on hallucinating; the remaining selections range from 0 to \+1\.6\+1\.6 points\. The pilot cost is roughly 12 additional evaluations per concept\. Constrained\-search interaction\. Clustering’s advantage concentrates at higher coefficients, exactly the regime that constrained search forces the optimizer into\. The mean cluster\-vs\-PV delta shifts from −0\.04\-0\.04 under unconstrained search to \+0\.10\+0\.10 under top\-15 prompt\-boundary constrained search\. When using constrained search, prefer cluster vectors even on borderline concepts\. Figure 14: The hallucinating clustering case\. Left: per\-prompt\-pair similarity matrix on Gemma 2 2B; pairs 3 and 4 are misaligned with pairs 0–2 at most layers\. Center: steerability vs\. coefficient on Gemma 2 2B; PV coherence collapses above α≈2\.5\\alpha\\approx 2\.5 while cluster remains stable\. Right: the same diagnostic on Gemma 3 27B, showing different outlier pairs \(1 and 4\) and a negative effect\. J\.3 Representational Fragmentation Alignment\-constrained search \(Section 3\) restricts the TPE budget to the top\-15 layers ranked by prompt\-boundary alignment 𝒜cPL​\(ℓ\)\\mathcal\{A\}\_\{c\}^\{\\text\{PL\}\}\(\\ell\)\. This is vastly superior to relying on the response\-averaged statistics alone, which fail to select the most effective steering layers in the vast majority of cases\. However, for some concepts, there is a stronger disconnect between the response\-averaged layer profiles and the prompt\-boundary profiles\. We refer to this phenomena as ’representational fragmentation’, a clear separation between the prompt\-boundary and response activations causing prompt\-boundary statistics be less effective search priors\. In practice this only has a significant impact on a small number of \(concept, model\) pairs, and it’s only in these cases where our existing geometric prior results in much worse steering performance\. Diagnostic\. The Pearson correlation between 𝒜cPL​\(ℓ\)\\mathcal\{A\}\_\{c\}^\{\\text\{PL\}\}\(\\ell\) and 𝒜cRA​\(ℓ\)\\mathcal\{A\}\_\{c\}^\{\\text\{RA\}\}\(\\ell\) across layers, computable for free from cached statistics, is used as our constraint\. Concepts with correlation below 0\.2 have a mean prompt\-boundary constrained delta of −0\.84\-0\.84; concepts above have a mean delta of −0\.24\-0\.24\. The diagnostic fires on 21 of 60 \(model, concept\) pairs, although in most cases the less\-restricted search space only marginally improves performance\. Fix\. When the correlation falls below 0\.2, constrain the search to the union of top\-15 layers from both alignment profiles \(typically 20–25 layers total\) rather than the prompt\-last top\-15 alone\. Figure 15: Left: Activation\-type correlation predicts when prompt\-last\-only constrained search will underperform\. Concepts below the 0\.2 threshold \(red dashed line\) have systematically larger negative deltas\. golden\_gate\_centric on Gemma 2 2B is highlighted\. Right: The union strategy reduces failure cases from 8 to 4 and improves best\-layer capture from 75% to 83%\. When it works\. The clearest case is golden\_gate\_centric on Gemma 2 2B, which has one of the lowest activation\-type correlation in the dataset \(−0\.05\-0\.05\)\. The prompt\-boundary alignment curve peaks at early layers, but the unconstrained search consistently finds its best operating points at layers 24–25, outside the prompt\-boundary top\-15\. The constrained search is forced into layers 13 or 20 \(steerability 46–48 vs\. unconstrained 53–59\), giving up 9\.14 points\. The union strategy admits both layer regions, recovers layers 24 and 25, and eliminates the failure entirely\. Across the full dataset, the union strategy reduces failure cases \(delta <−1<\-1\) from 8 to 4 and improves best\-layer capture from 75% to 83% \(Figure 15\)\. On the 39 concepts where the diagnostic does not fire, the union and prompt\-boundary strategies perform nearly identically, so expanding the search space when the correlation is high is essentially free\. J\.4 Summary Table 10: GRACE diagnostic workflow\. Noise source Diagnostic Fix Where it fires Magnitude outliers Magnitude CV \(free\) Unit\-mean construction Gemma 3 27B; max gain \+6\.54\+6\.54 Prompt\-pair heterogeneity Per\-pair similarity heatmap \+ 10 pilot evals Clustering 6 of 60 concepts; max gain \+3\.19\+3\.19 Representational fragmentation PL/RA alignment correlation \(free\) Union of top\-15 layers from both variants 21 of 60 concepts; max recovery \+9\.14\+9\.14 The three diagnostics identify distinct minorities of \(model, concept\) pairs where a targeted fix produces meaningful gains in the search optima\. The aggregate improvements are modest because the diagnostics correctly abstain on the majority of concepts\. None of the fixes overcomes the granularity bound from Section 5\.2: they address removable estimation error, not the irreducible cross\-context rotation that makes high\-𝒢c\\mathcal\{G\}\_\{c\} concepts fundamentally difficult\.`

Similar Articles

When is Your LLM Steerable?

arXiv cs.CL

This paper investigates when activation steering succeeds or fails for LLMs by analyzing early decoding dynamics. The authors introduce ASTEER, a large testbed of steered generations, and train a GBDT classifier to predict steering outcomes from early hidden states, enabling efficient steering strength search.

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

Hugging Face Daily Papers

ACTS (Agentic Chain-of-Thought Steering) formulates LLM reasoning control as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference using reasoning strategies and steering phrases. The approach achieves comparable accuracy to full-thinking models with significant token savings, enabling controllable accuracy-efficiency trade-offs.

A Geometric Account of Activation Steering through Angle-Norm Decomposition

arXiv cs.AI

This paper analyzes linear activation steering in language models by decomposing interventions into angular and radial components. It finds that concepts are primarily encoded in angular structure, but norm adjustments are crucial for stability, supporting spherical steering methods while showing that additive coefficients conflate geometry.