alignment-gating

Tag

Cards List
#alignment-gating

Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

Hugging Face Daily Papers · 4d ago Cached

The paper shows that sycophancy fine-tuning can induce emergent misalignment in language models, and proposes Alignment Gating as a method to reverse it by learning to control internal representations for unsafe responses.

0 favorites 0 likes
← Back to home

Submit Feedback