Tag
PHF proposes a method to distill hidden state trajectories from a privileged teacher to a student during on-policy self-distillation, improving reasoning performance on language models.
This paper presents DistilledGemma, a system for person-place relation extraction from multilingual historical newspaper articles using a three-stage knowledge distillation pipeline from a 26B Gemma teacher to a 2.3B student, achieving competitive accuracy and efficiency in the HIPE-2026 shared task.
This paper investigates using LLMs as teacher models to label training data for entity matching, showing that student models trained on machine-labeled data perform on par with those trained on manually labeled benchmarks, with significant cost and speed advantages.
SEAD introduces a competence-aware on-policy distillation method that uses entropy to guide supervision at token, temporal, and prompt levels, achieving a +4.8 average accuracy improvement on OLMo-3 across six math benchmarks.
This paper introduces LaViD, a framework that transfers semantic knowledge from a language-only LLM to a vision student model by generating multiple-choice questions as conceptual signatures, achieving superior fine-grained classification performance and robustness.
Introduces Proxy-KD, a novel method for distilling knowledge from black-box large language models (like GPT-4) into smaller models using a proxy model, surpassing both traditional black-box and white-box KD techniques.
The paper identifies position bias in on-policy distillation for language models, where later tokens in student-generated answers receive degraded supervision. The proposed Importance-Weighted On-Policy Distillation (IW-OPD) weights corrections based on accumulated drift, improving learning speed and final performance.
The article argues that frontier model providers who criticize knowledge distillation are hypocritical, as their own legal defense against copyright lawsuits relies on the same principle of not directly storing or touching data.
A tweet recommending an article on on-policy distillation published on Hugging Face.
This paper presents NebulaExp, a transparent ablation-driven post-training pipeline for 8B-scale LLMs, covering SFT, GRPO RL, and multi-teacher distillation. It identifies key trade-offs between mathematical reasoning and code generation, and demonstrates that data correctness filtering is the first-order optimization factor.
This paper presents AsyncOPD, a fully asynchronous on-policy distillation pipeline for LLMs, systematically studying the effects of stale-policy data and proposing estimator designs that improve training throughput by 1.6-3.8x while maintaining comparable accuracy.
This paper introduces blockwise policy-drift gating, a lightweight method to improve on-policy distillation for language models by weighting loss based on old-current student probability shifts, achieving improved reasoning accuracy on math benchmarks.
This paper introduces ARIA, a framework that adaptively allocates training effort across regions of the conditioning space for distilling conditional diffusion models, improving performance on unseen and underrepresented conditions.
Introduces Strategy-Guided Policy Optimization (SGPO) for LLM reasoning, which replaces trajectory imitation with strategy distillation, improving generalization on math benchmarks.
Natolambert announces a new lecture covering synthetic data and the history of distillation, from Hinton 2015 to modern on-policy distillation, with over 7 hours of video content.
Lite Any Stereo V2 presents an efficient stereo matching approach achieving state-of-the-art accuracy with significantly reduced latency through optimized architecture and training strategies, including a 2D-only cost aggregation framework and a three-stage training strategy.
An educational overview of knowledge distillation, covering its history, core concepts like softmax and temperature, types, scaling laws, and practical examples including DeepSeek-R1.
Presents a framework for financial sentiment analysis using distillation with synthetic data, transferring knowledge from a large teacher to compact student models, with clustering-based seed selection for efficient low-resource domain adaptation.
ResAware proposes a resource-aware distillation framework to improve website fingerprinting robustness across different network environments by training a teacher model on resource-level features and distilling knowledge to a student model that uses only encrypted traffic, achieving significant gains under temporal drift and other perturbations.
PowerOPD introduces a bounded power transformation to stabilize on-policy distillation for large language models, achieving significant gains in accuracy and sample efficiency while reducing computational cost.