Tag
This paper investigates emergent and subliminal misalignment in LLMs through a data-centric lens, showing that harmful fine-tuning effects depend on structural properties of the data, task difficulty, pretraining composition, and training channels, with experiments comparing off-policy and on-policy distillation.