Let's Learn About Knowledge Distillation!
Summary
The article argues that frontier model providers who criticize knowledge distillation are hypocritical, as their own legal defense against copyright lawsuits relies on the same principle of not directly storing or touching data.
Similar Articles
@TheTuringPost: https://x.com/TheTuringPost/status/2068474648925216861
An educational overview of knowledge distillation, covering its history, core concepts like softmax and temperature, types, scaling laws, and practical examples including DeepSeek-R1.
Hybrid Policy Distillation for LLMs
Introduces Hybrid Policy Distillation (HPD), a unified knowledge distillation approach that balances forward and reverse KL divergences and combines off-policy data with lightweight on-policy sampling, improving LLM compression across math, dialogue, and code tasks.
The Distillation Game: Adaptive Attacks & Efficient Defenses
This paper studies distillation attacks where model outputs can enable imitation, proposing a minimax game framework and a forward-pass-only defense called Product-of-Experts, showing that adaptive students recover more capability than passive evaluation suggests.
FedeKD: Energy-Based Gating for Robust Federated Knowledge Distillation under Heterogeneous Settings
This paper introduces FedeKD, a reliability-aware framework for federated knowledge distillation that uses an energy-based gating mechanism to mitigate negative transfer in heterogeneous settings. The authors demonstrate that weighting knowledge transfer based on sample-wise trust improves robustness and predictive performance without requiring public datasets.
Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm
This paper proposes a cross-modal knowledge distillation framework that works without paired data by aligning feature and label distributions, offering theoretical guarantees and outperforming prior methods on multimodal benchmarks.