Tag
This paper reexamines the role of temperature in large language model distillation, revealing that temperature asymmetrically benefits forward KL divergence over reverse KL, allowing simple KL methods to match state-of-the-art distillation approaches at higher temperatures.