subliminal-learning

#subliminal-learning

Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

arXiv cs.LG ↗ · 3h ago Cached

This paper investigates emergent and subliminal misalignment in LLMs through a data-centric lens, showing that harmful fine-tuning effects depend on structural properties of the data, task difficulty, pretraining composition, and training channels, with experiments comparing off-policy and on-policy distillation.

0 favorites 0 likes

#subliminal-learning

@AnthropicAI: Research we co-authored on subliminal learning—how LLMs can pass on traits like preferences or misalignment through hid…

X AI KOLs ↗ · 2026-04-15 Cached

Anthropic co-authored research published in Nature showing that LLMs can transmit behavioral traits—including preferences and misalignment—to student models through hidden signals in training data, even when the data appears unrelated to those traits. This 'subliminal learning' phenomenon poses significant implications for AI safety and alignment.

0 favorites 0 likes

subliminal-learning

Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

@AnthropicAI: Research we co-authored on subliminal learning—how LLMs can pass on traits like preferences or misalignment through hid…

Submit Feedback