jailbreak-defense

Tag

Cards List
#jailbreak-defense

CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning

arXiv cs.CL · 2026-06-05 Cached

CHASE introduces a co-evolutionary red-blue teaming framework that uses reinforcement learning to harden LLMs against adaptive black-box adversarial attacks, reducing jailbreak success by 43.2% on benchmarks while maintaining zero false refusals on benign prompts.

0 favorites 0 likes
#jailbreak-defense

The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

arXiv cs.AI · 2026-05-12 Cached

This paper introduces Anchored Bipolicy Self-Play, a method to improve AI safety by training distinct role-specific LoRA adapters on a frozen base model, addressing limitations in standard self-play red teaming.

0 favorites 0 likes
#jailbreak-defense

Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

arXiv cs.CL · 2026-04-20 Cached

This paper introduces a resource-efficient pruning framework that identifies and removes parameters associated with unsafe behaviors in large language models while preserving utility. Using gradient-free attribution and the Lottery Ticket Hypothesis perspective, the method achieves significant reductions in unsafe generations and improved robustness against jailbreak attacks with minimal performance loss.

0 favorites 0 likes
#jailbreak-defense

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Hugging Face Daily Papers · 2026-04-14 Cached

ASGuard is a mechanistically-informed defense framework that mitigates jailbreaking attacks on LLMs by identifying vulnerable attention heads through circuit analysis and applying targeted activation scaling and fine-tuning to improve refusal behavior robustness while preserving model capabilities.

0 favorites 0 likes
#jailbreak-defense

The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

OpenAI Blog · 2024-04-19 Cached

OpenAI proposes an instruction hierarchy approach to defend LLMs against prompt injection and jailbreak attacks by training models to prioritize system instructions over user inputs. The method significantly improves robustness without degrading standard capabilities.

0 favorites 0 likes
← Back to home

Submit Feedback