Tag
Introduces Proxy-KD, a novel method for distilling knowledge from black-box large language models (like GPT-4) into smaller models using a proxy model, surpassing both traditional black-box and white-box KD techniques.
The article argues that frontier model providers who criticize knowledge distillation are hypocritical, as their own legal defense against copyright lawsuits relies on the same principle of not directly storing or touching data.
A tweet recommending an article on on-policy distillation published on Hugging Face.
This paper presents NebulaExp, a transparent ablation-driven post-training pipeline for 8B-scale LLMs, covering SFT, GRPO RL, and multi-teacher distillation. It identifies key trade-offs between mathematical reasoning and code generation, and demonstrates that data correctness filtering is the first-order optimization factor.
This paper presents AsyncOPD, a fully asynchronous on-policy distillation pipeline for LLMs, systematically studying the effects of stale-policy data and proposing estimator designs that improve training throughput by 1.6-3.8x while maintaining comparable accuracy.
This paper introduces blockwise policy-drift gating, a lightweight method to improve on-policy distillation for language models by weighting loss based on old-current student probability shifts, achieving improved reasoning accuracy on math benchmarks.
This paper introduces ARIA, a framework that adaptively allocates training effort across regions of the conditioning space for distilling conditional diffusion models, improving performance on unseen and underrepresented conditions.
Introduces Strategy-Guided Policy Optimization (SGPO) for LLM reasoning, which replaces trajectory imitation with strategy distillation, improving generalization on math benchmarks.
Natolambert announces a new lecture covering synthetic data and the history of distillation, from Hinton 2015 to modern on-policy distillation, with over 7 hours of video content.
Lite Any Stereo V2 presents an efficient stereo matching approach achieving state-of-the-art accuracy with significantly reduced latency through optimized architecture and training strategies, including a 2D-only cost aggregation framework and a three-stage training strategy.
An educational overview of knowledge distillation, covering its history, core concepts like softmax and temperature, types, scaling laws, and practical examples including DeepSeek-R1.
Presents a framework for financial sentiment analysis using distillation with synthetic data, transferring knowledge from a large teacher to compact student models, with clustering-based seed selection for efficient low-resource domain adaptation.
ResAware proposes a resource-aware distillation framework to improve website fingerprinting robustness across different network environments by training a teacher model on resource-level features and distilling knowledge to a student model that uses only encrypted traffic, achieving significant gains under temporal drift and other perturbations.
PowerOPD introduces a bounded power transformation to stabilize on-policy distillation for large language models, achieving significant gains in accuracy and sample efficiency while reducing computational cost.
This article explains the technical principles of knowledge distillation in machine learning, pointing out that merely collecting output dialogues from ChatGPT/Claude cannot achieve effective distillation due to the lack of probability distribution information, and discusses the limitations of using generated data in SFT and pre-training.
This paper introduces the Call Playbook dataset for classifying real-world B2B conversations and proposes methods to distill examples into compact, interpretable task instructions, achieving 99% token reduction and up to 7% AUC improvement over traditional in-context learning.
Zone of Proximal Policy Optimization (ZPPO) improves knowledge distillation by using reformulated prompts that help students learn from both correct and incorrect responses, enhancing performance especially at smaller model sizes.
This article is the middle part of the AI Engineering Landscape series, detailing core techniques such as inference optimization, model slimming (quantization, distillation, pruning, MoE), and speculative decoding, while reviewing the latest advances from hardware to the engineering stack.
This paper proposes MODF-SIR, a multi-agent collaborative framework built on a lightweight multimodal large language model for social intelligence reasoning. It employs knowledge distillation, long-tail event extraction, and test-time adaptation to achieve state-of-the-art results with reduced training data.
This paper proposes a novel framework that uses LLMs to extract analytical physics priors from scientific literature and distills them into a lightweight neural network for high-accuracy, real-time manufacturing process-property prediction, even with limited data.