Tag
This paper introduces VETO, a benchmark to quantify 'misfired alignment' where LLMs avoid correct inferences due to safety training, and finds that all tested models exhibit such failures while humans do not.