stereotypes

#stereotypes

The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

arXiv cs.CL ↗ · 6d ago Cached

This paper introduces VETO, a benchmark to quantify 'misfired alignment' where LLMs avoid correct inferences due to safety training, and finds that all tested models exhibit such failures while humans do not.

0 favorites 0 likes

stereotypes

The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

Submit Feedback