ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Hugging Face Daily Papers Papers

Summary

ASGuard is a mechanistically-informed defense framework that mitigates jailbreaking attacks on LLMs by identifying vulnerable attention heads through circuit analysis and applying targeted activation scaling and fine-tuning to improve refusal behavior robustness while preserving model capabilities.

Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. In the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking such as a tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a "preventative fine-tuning", forcing the model to learn a more robust refusal mechanism. Across four LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:28 AM

Paper page - ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Source: https://huggingface.co/papers/2509.25843

Abstract

Activation-Scaling Guard (ASGuard) mitigates brittle refusal behaviors in large language models by identifying and recalibrating specific attention heads vulnerable to tense-based jailbreaking attacks through mechanistic circuit analysis and targeted fine-tuning.

Large language models (https://huggingface.co/papers?q=Large%20language%20models)(LLMs), despite being safety-aligned, exhibit brittle refusal behavior (https://huggingface.co/papers?q=refusal%20behavior)s that can be circumvented by simple linguistic changes. As tense jailbreaking (https://huggingface.co/papers?q=jailbreaking)demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. In the first step, we use circuit analysis (https://huggingface.co/papers?q=circuit%20analysis)to identify the specific attention heads (https://huggingface.co/papers?q=attention%20heads)causally linked to the targeted jailbreaking (https://huggingface.co/papers?q=jailbreaking)such as a tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a “preventative fine-tuning (https://huggingface.co/papers?q=preventative%20fine-tuning)”, forcing the model to learn a more robust refusal mechanism. Across four LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking (https://huggingface.co/papers?q=jailbreaking)while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes (https://huggingface.co/papers?q=adversarial%20suffixes)suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals (https://huggingface.co/papers?q=model%20internals)can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.

View arXiv page (https://arxiv.org/abs/2509.25843)View PDF (https://arxiv.org/pdf/2509.25843)GitHub5 (https://github.com/dmis-lab/ASGuard)Add to collection (https://huggingface.co/login?next=%2Fpapers%2F2509.25843)

Get this paper in your agent:

hf papers read 2509\.25843

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2509.25843 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2509.25843 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2509.25843 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

arXiv cs.AI

This paper investigates jailbreak attacks on Large Reasoning Models (LRMs), revealing that attack success correlates with attention patterns. The authors propose a reinforcement learning-based jailbreak method that incorporates attention signals into the reward function and uses diverse persuasion strategies, achieving significantly higher attack success rates across multiple benchmarks.

OSGuard: A Benchmark for Safety in Computer-Use Agents

arXiv cs.AI

OSGuard is a dual-granularity benchmark for evaluating safety in computer-use agents under benign user instructions, featuring action-level judgments and risk-augmented execution suites to detect unsafe shortcuts.

PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation

arXiv cs.LG

PropGuard is a propagation-aware framework for safeguarding LLM-based multi-agent systems (LLM-MAS) from malicious instructions that propagate across agents and rounds. It constructs a dual-view spatio-temporal graph and uses a GE-GRPO trained inspector to detect and remediate suspicious propagation subgraphs.