Tag
This paper studies how instruction-tuned LLMs can exhibit fair outputs while retaining biased internal representations in high-stakes decisions like mortgage underwriting, showing that these hidden biases are causally potent, asymmetric, and exploitable through activation steering.