co-authoring-attacks

Tag

Cards List
#co-authoring-attacks

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

arXiv cs.CL · 2026-04-22 Cached

Researchers introduce HarDBench, a benchmark exposing how LLMs can be jailbroken via malicious drafts in collaborative writing, and propose a preference-optimization defense that cuts harmful outputs without hurting co-authoring utility.

0 favorites 0 likes
← Back to home

Submit Feedback