Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

Hugging Face Daily Papers 06/08/26, 12:00 AM Papers

sycophancy emergent-misalignment alignment-gating fine-tuning model-safety language-models

Summary

The paper shows that sycophancy fine-tuning can induce emergent misalignment in language models, and proposes Alignment Gating as a method to reverse it by learning to control internal representations for unsafe responses.

Prior work has shown that fine-tuning large language models on malicious or incorrect outputs in narrow domains can induce broad misalignment and harmful behavior, a phenomenon known as emergent misalignment. However, efficient methods for reversing such misalignment remain limited. In this work, we make two contributions. First, we identify sycophancy fine-tuning, i.e., training models to passively agree with users' incorrect opinions, as a previously underexplored driver of emergent misalignment, and show that it induces broad and severe misaligned behavior. Second, we propose Alignment Gating, an efficient method for reversing emergent misalignment that inserts learnable and controllable gates into the model during fine-tuning. Through fine-tuning, these gates learn to identify the internal representations responsible for unsafe responses. Thus, amplifying or suppressing these representations then exacerbates or mitigates EM, respectively. We further find that alignment gating module exhibits strong generalization: gating weights obtained from narrow-domain fine-tuning substantially suppress broad-domain misaligned behavior while preserving the model's general capabilities.

Original Article

View Cached Full Text

Cached at: 06/10/26, 05:45 AM

Paper page - Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

Source: https://huggingface.co/papers/2606.09068

Abstract

Sycophancy fine-tuning contributes to emergent misalignment in language models, which can be reversed using Alignment Gating—a method that inserts learnable gates to identify and control unsafe responses while maintaining general capabilities.

Prior work has shown thatfine-tuninglarge language models on malicious or incorrect outputs in narrow domains can induce broad misalignment and harmful behavior, a phenomenon known asemergent misalignment. However, efficient methods for reversing such misalignment remain limited. In this work, we make two contributions. First, we identifysycophancy fine-tuning, i.e., training models to passively agree with users’ incorrect opinions, as a previously underexplored driver ofemergent misalignment, and show that it induces broad and severe misaligned behavior. Second, we proposeAlignment Gating, an efficient method for reversingemergent misalignmentthat inserts learnable and controllable gates into the model duringfine-tuning. Throughfine-tuning, these gates learn to identify theinternal representationsresponsible forunsafe responses. Thus, amplifying or suppressing these representations then exacerbates or mitigates EM, respectively. We further find thatalignment gatingmodule exhibits stronggeneralization: gating weights obtained from narrow-domainfine-tuningsubstantially suppress broad-domain misaligned behavior while preserving the model’s general capabilities.

View arXiv page View PDF GitHub4 Add to collection

Get this paper in your agent:

hf papers read 2606\.09068

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### sichengwang04/Qwen3-8B-syco_med-gated-attention-FT Text Generation• Updated1 day ago • 2

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.09068 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.09068 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

Paper page - Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

Abstract

Models citing this paper1

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Toward understanding and preventing misalignment generalization

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

How misalignment starts

Alignment pretraining: AI discourse creates self-fulfilling (mis)alignment

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Submit Feedback

Similar Articles

Toward understanding and preventing misalignment generalization

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Alignment pretraining: AI discourse creates self-fulfilling (mis)alignment

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models