jailbreak-benchmark

#jailbreak-benchmark

Targeted Neuron Modulation via Contrastive Pair Search

Hugging Face Daily Papers ↗ · 2026-05-12 Cached

Contrastive neuron attribution (CNA) identifies a sparse set of MLP neurons that distinguish harmful from benign prompts, enabling effective behavioral steering in instruction-tuned LLMs without degrading output quality. The method reduces refusal rates by over 50% on jailbreak benchmarks while preserving fluency.

0 favorites 0 likes

#jailbreak-benchmark

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

arXiv cs.CL ↗ · 2026-04-22 Cached

Researchers introduce HarDBench, a benchmark exposing how LLMs can be jailbroken via malicious drafts in collaborative writing, and propose a preference-optimization defense that cuts harmful outputs without hurting co-authoring utility.

0 favorites 0 likes

jailbreak-benchmark

Targeted Neuron Modulation via Contrastive Pair Search

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

Submit Feedback