Tag
Contrastive neuron attribution (CNA) identifies a sparse set of MLP neurons that distinguish harmful from benign prompts, enabling effective behavioral steering in instruction-tuned LLMs without degrading output quality. The method reduces refusal rates by over 50% on jailbreak benchmarks while preserving fluency.
Researchers introduce HarDBench, a benchmark exposing how LLMs can be jailbroken via malicious drafts in collaborative writing, and propose a preference-optimization defense that cuts harmful outputs without hurting co-authoring utility.