Tag
This paper quantifies the magnitude of subliminal behavioral transfer in language model distillation, showing that undesirable traits can transfer robustly from teacher to student models even with benign training data, and that transfer scales differently across model families.