attack-success-rate

#attack-success-rate

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

arXiv cs.CL ↗ · 2d ago Cached

This paper evaluates the reliability of automated judges used to measure attack success rates (ASR) in LLM jailbreak research, finding that both safety classifiers and LLM-as-judges have significant calibration and adversarial robustness issues that undermine reported ASR numbers.

0 favorites 0 likes

attack-success-rate

How Reliable Is Your Jailbreak Judge? Calibration and Adversarial Robustness of Automated ASR Scoring

Submit Feedback