Tag
The paper shows that sycophancy fine-tuning can induce emergent misalignment in language models, and proposes Alignment Gating as a method to reverse it by learning to control internal representations for unsafe responses.