Tag
This paper introduces a resource-efficient pruning framework that identifies and removes parameters associated with unsafe behaviors in large language models while preserving utility. Using gradient-free attribution and the Lottery Ticket Hypothesis perspective, the method achieves significant reductions in unsafe generations and improved robustness against jailbreak attacks with minimal performance loss.
ASGuard is a mechanistically-informed defense framework that mitigates jailbreaking attacks on LLMs by identifying vulnerable attention heads through circuit analysis and applying targeted activation scaling and fine-tuning to improve refusal behavior robustness while preserving model capabilities.
OpenAI proposes an instruction hierarchy approach to defend LLMs against prompt injection and jailbreak attacks by training models to prioritize system instructions over user inputs. The method significantly improves robustness without degrading standard capabilities.