attack-success-rate

Tag

Cards List
#attack-success-rate

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

arXiv cs.CL · 2d ago Cached

This paper evaluates the reliability of automated judges used to measure attack success rates (ASR) in LLM jailbreak research, finding that both safety classifiers and LLM-as-judges have significant calibration and adversarial robustness issues that undermine reported ASR numbers.

0 favorites 0 likes
← Back to home

Submit Feedback