activation-scaling

Tag

Cards List
#activation-scaling

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Hugging Face Daily Papers · 2026-04-14 Cached

ASGuard is a mechanistically-informed defense framework that mitigates jailbreaking attacks on LLMs by identifying vulnerable attention heads through circuit analysis and applying targeted activation scaling and fine-tuning to improve refusal behavior robustness while preserving model capabilities.

0 favorites 0 likes
← Back to home

Submit Feedback