Steered LLM Activations are Non-Surjective
Summary
This paper proves that activation steering in LLMs produces internal states that cannot be replicated by any textual prompt, establishing a formal separation between white-box steerability and black-box prompting.
View Cached Full Text
Cached at: 05/18/26, 02:26 PM
Paper page - Steered LLM Activations are Non-Surjective
Source: https://huggingface.co/papers/2604.09839
Abstract
Activation steering in language models creates internal states that cannot be replicated through standard textual prompts, demonstrating a fundamental distinction between white-box and black-box control methods.
Activation steeringis a popularwhite-box controltechnique that modifies model activations to elicit an abstract change in its behavior. It has also become a standard tool ininterpretability(e.g., probing truthfulness, or translating activations into human-readable explanations) andsafety research(e.g., jailbreakability). However, it is unclear whether steered behavior is realizable by any textual prompt. In this work, we cast this question as asurjectivityproblem: for a fixed model, does every steered activation admit apreimageunder the model’s natural forward pass? Under practical assumptions, we prove thatactivation steeringpushes theresidual streamoff the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability andblack-box prompting. We therefore caution against interpreting the ease and success ofactivation steeringas evidence ofprompt-based interpretabilityor vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.
View arXiv pageView PDFGitHub0Add to collection
Get this paper in your agent:
hf papers read 2604\.09839
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.09839 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.09839 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.09839 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
This paper investigates whether linearly decodable failure signals in LLM hidden states can be corrected via residual-stream steering. It finds that while 'overthinking' failures are decodable, fixed linear steering fails to correct them due to representational entanglement with task-critical computations, though the probes effectively support selective abstention.
Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions
This paper studies how instruction-tuned LLMs can exhibit fair outputs while retaining biased internal representations in high-stakes decisions like mortgage underwriting, showing that these hidden biases are causally potent, asymmetric, and exploitable through activation steering.
Don't Lose Focus: Activation Steering via Key-Orthogonal Projections
This paper introduces Steering via Key-Orthogonal Projections (SKOP), a method to control LLM behavior by preventing attention rerouting, thereby reducing utility degradation while maintaining steering efficacy.
UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering
UniSteer introduces a text-guided activation flow matching method to learn a universal conditional velocity field in activation space, enabling versatile LLM behavior control and classification tasks without task-specific intervention modules.
Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs
Introduces Latent Reward Steering (Lrs), an adaptive inference-time framework that uses sparse autoencoder latent states and a learned reward model to implicitly promote cognitive behaviors like verification and backtracking in reasoning LLMs, improving performance across multiple models and benchmarks.