ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Hugging Face Daily Papers 04/14/26, 12:00 AM Papers

Summary

ASGuard is a mechanistically-informed defense framework that mitigates jailbreaking attacks on LLMs by identifying vulnerable attention heads through circuit analysis and applying targeted activation scaling and fine-tuning to improve refusal behavior robustness while preserving model capabilities.

Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. In the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking such as a tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a "preventative fine-tuning", forcing the model to learn a more robust refusal mechanism. Across four LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.

Original Article

View Cached Full Text

Cached at: 04/20/26, 08:28 AM

Paper page - ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Source: https://huggingface.co/papers/2509.25843

Abstract

Activation-Scaling Guard (ASGuard) mitigates brittle refusal behaviors in large language models by identifying and recalibrating specific attention heads vulnerable to tense-based jailbreaking attacks through mechanistic circuit analysis and targeted fine-tuning.

Large language models (https://huggingface.co/papers?q=Large%20language%20models)(LLMs), despite being safety-aligned, exhibit brittle refusal behavior (https://huggingface.co/papers?q=refusal%20behavior)s that can be circumvented by simple linguistic changes. As tense jailbreaking (https://huggingface.co/papers?q=jailbreaking)demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. In the first step, we use circuit analysis (https://huggingface.co/papers?q=circuit%20analysis)to identify the specific attention heads (https://huggingface.co/papers?q=attention%20heads)causally linked to the targeted jailbreaking (https://huggingface.co/papers?q=jailbreaking)such as a tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a “preventative fine-tuning (https://huggingface.co/papers?q=preventative%20fine-tuning)”, forcing the model to learn a more robust refusal mechanism. Across four LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking (https://huggingface.co/papers?q=jailbreaking)while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes (https://huggingface.co/papers?q=adversarial%20suffixes)suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals (https://huggingface.co/papers?q=model%20internals)can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.

View arXiv page (https://arxiv.org/abs/2509.25843)View PDF (https://arxiv.org/pdf/2509.25843)GitHub5 (https://github.com/dmis-lab/ASGuard)Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2509.25843)

Get this paper in your agent:

hf papers read 2509\.25843

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2509.25843 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2509.25843 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2509.25843 in a Space README.md to link it from this page.

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Paper page - ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

LLM Guard scored 0/8 on a USENIX 2025 multi-turn jailbreak. Here’s what caught it instead.

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

OSGuard: A Benchmark for Safety in Computer-Use Agents

PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation

Submit Feedback

Similar Articles

LLM Guard scored 0/8 on a USENIX 2025 multi-turn jailbreak. Here’s what caught it instead.

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

OSGuard: A Benchmark for Safety in Computer-Use Agents

PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation