stereotypes

Tag

Cards List
#stereotypes

The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

arXiv cs.CL · 6d ago Cached

This paper introduces VETO, a benchmark to quantify 'misfired alignment' where LLMs avoid correct inferences due to safety training, and finds that all tested models exhibit such failures while humans do not.

0 favorites 0 likes
← Back to home

Submit Feedback