co-authoring-attacks

#co-authoring-attacks

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

arXiv cs.CL ↗ · 2026-04-22 Cached

Researchers introduce HarDBench, a benchmark exposing how LLMs can be jailbroken via malicious drafts in collaborative writing, and propose a preference-optimization defense that cuts harmful outputs without hurting co-authoring utility.

0 favorites 0 likes

co-authoring-attacks

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

Submit Feedback