Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

Hugging Face Daily Papers Papers

Summary

The paper shows that sycophancy fine-tuning can induce emergent misalignment in language models, and proposes Alignment Gating as a method to reverse it by learning to control internal representations for unsafe responses.

Prior work has shown that fine-tuning large language models on malicious or incorrect outputs in narrow domains can induce broad misalignment and harmful behavior, a phenomenon known as emergent misalignment. However, efficient methods for reversing such misalignment remain limited. In this work, we make two contributions. First, we identify sycophancy fine-tuning, i.e., training models to passively agree with users' incorrect opinions, as a previously underexplored driver of emergent misalignment, and show that it induces broad and severe misaligned behavior. Second, we propose Alignment Gating, an efficient method for reversing emergent misalignment that inserts learnable and controllable gates into the model during fine-tuning. Through fine-tuning, these gates learn to identify the internal representations responsible for unsafe responses. Thus, amplifying or suppressing these representations then exacerbates or mitigates EM, respectively. We further find that alignment gating module exhibits strong generalization: gating weights obtained from narrow-domain fine-tuning substantially suppress broad-domain misaligned behavior while preserving the model's general capabilities.
Original Article
View Cached Full Text

Cached at: 06/10/26, 05:45 AM

Paper page - Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

Source: https://huggingface.co/papers/2606.09068

Abstract

Sycophancy fine-tuning contributes to emergent misalignment in language models, which can be reversed using Alignment Gating—a method that inserts learnable gates to identify and control unsafe responses while maintaining general capabilities.

Prior work has shown thatfine-tuninglarge language models on malicious or incorrect outputs in narrow domains can induce broad misalignment and harmful behavior, a phenomenon known asemergent misalignment. However, efficient methods for reversing such misalignment remain limited. In this work, we make two contributions. First, we identifysycophancy fine-tuning, i.e., training models to passively agree with users’ incorrect opinions, as a previously underexplored driver ofemergent misalignment, and show that it induces broad and severe misaligned behavior. Second, we proposeAlignment Gating, an efficient method for reversingemergent misalignmentthat inserts learnable and controllable gates into the model duringfine-tuning. Throughfine-tuning, these gates learn to identify theinternal representationsresponsible forunsafe responses. Thus, amplifying or suppressing these representations then exacerbates or mitigates EM, respectively. We further find thatalignment gatingmodule exhibits stronggeneralization: gating weights obtained from narrow-domainfine-tuningsubstantially suppress broad-domain misaligned behavior while preserving the model’s general capabilities.

View arXiv pageView PDFGitHub4Add to collection

Get this paper in your agent:

hf papers read 2606\.09068

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### sichengwang04/Qwen3-8B-syco_med-gated-attention-FT Text Generation• Updated1 day ago • 2

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.09068 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.09068 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Toward understanding and preventing misalignment generalization

OpenAI Blog

OpenAI researchers investigate 'emergent misalignment'—where fine-tuning a model on narrow incorrect behavior causes broadly unethical responses—and discover a 'misaligned persona' feature in GPT-4o's activations that mediates this phenomenon, enabling potential detection and mitigation strategies.

How misalignment starts

Reddit r/singularity

Explores how misalignment in AI systems originates, discussing the gap between intended goals and actual behavior.