subliminal-learning

#subliminal-learning

Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation

arXiv cs.LG ↗ · 2026-06-11 Cached

This paper quantifies the magnitude of subliminal behavioral transfer in language model distillation, showing that undesirable traits can transfer robustly from teacher to student models even with benign training data, and that transfer scales differently across model families.

0 favorites 0 likes

#subliminal-learning

Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

arXiv cs.LG ↗ · 2026-05-14 Cached

This paper investigates emergent and subliminal misalignment in LLMs through a data-centric lens, showing that harmful fine-tuning effects depend on structural properties of the data, task difficulty, pretraining composition, and training channels, with experiments comparing off-policy and on-policy distillation.

0 favorites 0 likes

#subliminal-learning

@AnthropicAI: Research we co-authored on subliminal learning—how LLMs can pass on traits like preferences or misalignment through hid…

X AI KOLs ↗ · 2026-04-15 Cached

Anthropic co-authored research published in Nature showing that LLMs can transmit behavioral traits—including preferences and misalignment—to student models through hidden signals in training data, even when the data appears unrelated to those traits. This 'subliminal learning' phenomenon poses significant implications for AI safety and alignment.

0 favorites 0 likes

subliminal-learning

Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation

Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

@AnthropicAI: Research we co-authored on subliminal learning—how LLMs can pass on traits like preferences or misalignment through hid…

Submit Feedback