supervised-finetuning

Tag

Cards List
#supervised-finetuning

Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning

arXiv cs.LG · 2d ago Cached

This paper proposes a trait-space monitoring method to detect emergent misalignment in LLMs during supervised finetuning by tracking representational drift in activation space, achieving a 0.990 AUROC with low false positive and false negative rates, outperforming unsupervised baselines.

0 favorites 0 likes
← Back to home

Submit Feedback