ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack
Summary
ASGuard is a mechanistically-informed defense framework that mitigates jailbreaking attacks on LLMs by identifying vulnerable attention heads through circuit analysis and applying targeted activation scaling and fine-tuning to improve refusal behavior robustness while preserving model capabilities.
View Cached Full Text
Cached at: 04/20/26, 08:28 AM
Paper page - ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack
Source: https://huggingface.co/papers/2509.25843
Abstract
Activation-Scaling Guard (ASGuard) mitigates brittle refusal behaviors in large language models by identifying and recalibrating specific attention heads vulnerable to tense-based jailbreaking attacks through mechanistic circuit analysis and targeted fine-tuning.
Large language models (https://huggingface.co/papers?q=Large%20language%20models)(LLMs), despite being safety-aligned, exhibit brittle refusal behavior (https://huggingface.co/papers?q=refusal%20behavior)s that can be circumvented by simple linguistic changes. As tense jailbreaking (https://huggingface.co/papers?q=jailbreaking)demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. In the first step, we use circuit analysis (https://huggingface.co/papers?q=circuit%20analysis)to identify the specific attention heads (https://huggingface.co/papers?q=attention%20heads)causally linked to the targeted jailbreaking (https://huggingface.co/papers?q=jailbreaking)such as a tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a “preventative fine-tuning (https://huggingface.co/papers?q=preventative%20fine-tuning)”, forcing the model to learn a more robust refusal mechanism. Across four LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking (https://huggingface.co/papers?q=jailbreaking)while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes (https://huggingface.co/papers?q=adversarial%20suffixes)suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals (https://huggingface.co/papers?q=model%20internals)can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.
View arXiv page (https://arxiv.org/abs/2509.25843)View PDF (https://arxiv.org/pdf/2509.25843)GitHub5 (https://github.com/dmis-lab/ASGuard)Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2509.25843)
Get this paper in your agent:
hf papers read 2509\.25843
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2509.25843 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2509.25843 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2509.25843 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
LLM Guard scored 0/8 on a USENIX 2025 multi-turn jailbreak. Here’s what caught it instead.
Arc Sentry detects multi-turn jailbreaks like Crescendo by reading model internal state rather than text output, catching attacks that text-based monitors miss entirely.
Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models
This paper investigates jailbreak attacks on Large Reasoning Models (LRMs), revealing that attack success correlates with attention patterns. The authors propose a reinforcement learning-based jailbreak method that incorporates attention signals into the reward function and uses diverse persuasion strategies, achieving significantly higher attack success rates across multiple benchmarks.
How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring
This paper evaluates the reliability of automated judges used to measure attack success rates (ASR) in LLM jailbreak research, finding that both safety classifiers and LLM-as-judges have significant calibration and adversarial robustness issues that undermine reported ASR numbers.
OSGuard: A Benchmark for Safety in Computer-Use Agents
OSGuard is a dual-granularity benchmark for evaluating safety in computer-use agents under benign user instructions, featuring action-level judgments and risk-augmented execution suites to detect unsafe shortcuts.
PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation
PropGuard is a propagation-aware framework for safeguarding LLM-based multi-agent systems (LLM-MAS) from malicious instructions that propagate across agents and rounds. It constructs a dual-view spatio-temporal graph and uses a GE-GRPO trained inspector to detect and remediate suspicious propagation subgraphs.