Steered LLM Activations are Non-Surjective

Hugging Face Daily Papers 05/07/26, 12:00 AM Papers

activation-steering interpretability safety-research surjectivity llm white-box-control

Summary

This paper proves that activation steering in LLMs produces internal states that cannot be replicated by any textual prompt, establishing a formal separation between white-box steerability and black-box prompting.

Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in its behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations) and safety research (e.g., jailbreakability). However, it is unclear whether steered behavior is realizable by any textual prompt. In this work, we cast this question as a surjectivity problem: for a fixed model, does every steered activation admit a preimage under the model's natural forward pass? Under practical assumptions, we prove that activation steering pushes the residual stream off the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability and black-box prompting. We therefore caution against interpreting the ease and success of activation steering as evidence of prompt-based interpretability or vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.

Original Article

View Cached Full Text

Cached at: 05/18/26, 02:26 PM

Paper page - Steered LLM Activations are Non-Surjective

Source: https://huggingface.co/papers/2604.09839

Abstract

Activation steering in language models creates internal states that cannot be replicated through standard textual prompts, demonstrating a fundamental distinction between white-box and black-box control methods.

Activation steeringis a popularwhite-box controltechnique that modifies model activations to elicit an abstract change in its behavior. It has also become a standard tool ininterpretability(e.g., probing truthfulness, or translating activations into human-readable explanations) andsafety research(e.g., jailbreakability). However, it is unclear whether steered behavior is realizable by any textual prompt. In this work, we cast this question as asurjectivityproblem: for a fixed model, does every steered activation admit apreimageunder the model’s natural forward pass? Under practical assumptions, we prove thatactivation steeringpushes theresidual streamoff the manifold of states reachable from discrete prompts. Almost surely, no prompt can reproduce the same internal behavior induced by steering. We also illustrate this finding empirically across three widely used LLMs. Our results establish a formal separation between white-box steerability andblack-box prompting. We therefore caution against interpreting the ease and success ofactivation steeringas evidence ofprompt-based interpretabilityor vulnerability, and argue for evaluation protocols that explicitly decouple white-box and black-box interventions.

View arXiv page View PDF GitHub0 Add to collection

Get this paper in your agent:

hf papers read 2604\.09839

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.09839 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.09839 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.09839 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Steered LLM Activations are Non-Surjective

Paper page - Steered LLM Activations are Non-Surjective

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions

Don't Lose Focus: Activation Steering via Key-Orthogonal Projections

UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering

Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs

Submit Feedback

Similar Articles

Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions

Don't Lose Focus: Activation Steering via Key-Orthogonal Projections

UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering

Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs