Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

Hugging Face Daily Papers Papers

Summary

This paper identifies KV-cache contamination as a failure mode for activation steering in dialogue and proposes GCAD, a method that extracts steering signals from prompt contributions and applies token-level gating to improve long-horizon coherence, achieving substantial gains on multi-turn benchmarks.

Activation steering controls language model behavior by adding directions to internal representations at inference time, but standard residual-stream steering can fail in stateful dialogue. We identify KV-cache contamination as a key failure mode: steered token states are stored and repeatedly reused, turning a local perturbation into cumulative coherence degradation. To address this challenge, we propose Gated Cropped Attention-Delta steering (GCAD), which extracts steering signals from system-prompt contributions to self-attention and applies them with token-level gating. Across persona-steering experiments, GCAD preserves trait control while substantially improving long-horizon coherence. On the main multi-turn benchmark, GCAD improves average coherence drift from -18.6 to -1.9 and raises turn-10 trait expression from 78.0 to 93.1. These results suggest that activation steering becomes more reliable when interventions follow the prompt-mediated pathways that models already use for behavioral control.
Original Article
View Cached Full Text

Cached at: 05/12/26, 02:52 PM

Paper page - Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

Source: https://huggingface.co/papers/2605.10664

Abstract

Activation steering in language models suffers from KV-cache contamination in dialogue settings, which GCAD addresses by extracting steering signals from prompt contributions and applying token-level gating to improve long-horizon coherence.

Activation steeringcontrols language model behavior by adding directions to internal representations at inference time, but standardresidual-stream steeringcan fail in stateful dialogue. We identifyKV-cache contaminationas a key failure mode: steered token states are stored and repeatedly reused, turning a local perturbation into cumulative coherence degradation. To address this challenge, we propose Gated Cropped Attention-Delta steering (GCAD), which extracts steering signals from system-prompt contributions toself-attentionand applies them withtoken-level gating. Acrosspersona-steeringexperiments, GCAD preserves trait control while substantially improving long-horizon coherence. On the main multi-turn benchmark, GCAD improves averagecoherence driftfrom -18.6 to -1.9 and raisesturn-10 trait expressionfrom 78.0 to 93.1. These results suggest thatactivation steeringbecomes more reliable when interventions follow the prompt-mediated pathways that models already use for behavioral control.

View arXiv pageView PDFGitHub0Add to collection

Get this paper in your agent:

hf papers read 2605\.10664

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.10664 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.10664 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.10664 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Steered LLM Activations are Non-Surjective

Hugging Face Daily Papers

This paper proves that activation steering in LLMs produces internal states that cannot be replicated by any textual prompt, establishing a formal separation between white-box steerability and black-box prompting.