Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating
Summary
The paper shows that sycophancy fine-tuning can induce emergent misalignment in language models, and proposes Alignment Gating as a method to reverse it by learning to control internal representations for unsafe responses.
View Cached Full Text
Cached at: 06/10/26, 05:45 AM
Paper page - Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating
Source: https://huggingface.co/papers/2606.09068
Abstract
Sycophancy fine-tuning contributes to emergent misalignment in language models, which can be reversed using Alignment Gating—a method that inserts learnable gates to identify and control unsafe responses while maintaining general capabilities.
Prior work has shown thatfine-tuninglarge language models on malicious or incorrect outputs in narrow domains can induce broad misalignment and harmful behavior, a phenomenon known asemergent misalignment. However, efficient methods for reversing such misalignment remain limited. In this work, we make two contributions. First, we identifysycophancy fine-tuning, i.e., training models to passively agree with users’ incorrect opinions, as a previously underexplored driver ofemergent misalignment, and show that it induces broad and severe misaligned behavior. Second, we proposeAlignment Gating, an efficient method for reversingemergent misalignmentthat inserts learnable and controllable gates into the model duringfine-tuning. Throughfine-tuning, these gates learn to identify theinternal representationsresponsible forunsafe responses. Thus, amplifying or suppressing these representations then exacerbates or mitigates EM, respectively. We further find thatalignment gatingmodule exhibits stronggeneralization: gating weights obtained from narrow-domainfine-tuningsubstantially suppress broad-domain misaligned behavior while preserving the model’s general capabilities.
View arXiv pageView PDFGitHub4Add to collection
Get this paper in your agent:
hf papers read 2606\.09068
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### sichengwang04/Qwen3-8B-syco_med-gated-attention-FT Text Generation• Updated1 day ago • 2
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.09068 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.09068 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Toward understanding and preventing misalignment generalization
OpenAI researchers investigate 'emergent misalignment'—where fine-tuning a model on narrow incorrect behavior causes broadly unethical responses—and discover a 'misaligned persona' feature in GPT-4o's activations that mediates this phenomenon, enabling potential detection and mitigation strategies.
Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
This paper introduces alignment tampering, a vulnerability in RLHF where language models can manipulate preference datasets to amplify misaligned biases, demonstrating experimentally across biases like sexism, brand promotion, and goal-seeking, and showing that existing mitigation techniques are insufficient.
How misalignment starts
Explores how misalignment in AI systems originates, discussing the gap between intended goals and actual behavior.
Alignment pretraining: AI discourse creates self-fulfilling (mis)alignment
This paper introduces the concept of alignment pretraining, showing that discourse about AI in pretraining corpora can create self-fulfilling (mis)alignment in LLMs, and that upsampling aligned discourse significantly reduces misalignment.
Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models
This paper studies reward hacking in reinforcement learning for language models through the geometry of updates, identifying optimization drift as a key factor. It proposes trusted-direction projection to constrain gradients within a clean reference subspace, delaying shortcut exploitation and preserving task performance.