Tag
This paper proposes a trait-space monitoring method to detect emergent misalignment in LLMs during supervised finetuning by tracking representational drift in activation space, achieving a 0.990 AUROC with low false positive and false negative rates, outperforming unsupervised baselines.