knowledge-distillation

Tag

Cards List
#knowledge-distillation

PHF: Privileged Hidden Flow for On-Policy Self-Distillation

arXiv cs.AI · 5h ago Cached

PHF proposes a method to distill hidden state trajectories from a privileged teacher to a student during on-policy self-distillation, improving reasoning performance on language models.

0 favorites 0 likes
#knowledge-distillation

DistilledGemma: Balanced Efficiency-Accuracy for Person-Place Relation Extraction from Multilingual Historical Articles

arXiv cs.CL · 5h ago Cached

This paper presents DistilledGemma, a system for person-place relation extraction from multilingual historical newspaper articles using a three-stage knowledge distillation pipeline from a 26B Gemma teacher to a 2.3B student, achieving competitive accuracy and efficiency in the HIPE-2026 shared task.

0 favorites 0 likes
#knowledge-distillation

Labeling Training Data for Entity Matching Using Large Language Models

arXiv cs.CL · 5h ago Cached

This paper investigates using LLMs as teacher models to label training data for entity matching, showing that student models trained on machine-labeled data perform on par with those trained on manually labeled benchmarks, with significant cost and speed advantages.

0 favorites 0 likes
#knowledge-distillation

SEAD: Competence-Aware On-Policy Distillation via Entropy-Guided Supervision

arXiv cs.CL · 5h ago Cached

SEAD introduces a competence-aware on-policy distillation method that uses entropy to guide supervision at token, temporal, and prompt levels, achieving a +4.8 average accuracy improvement on OLMo-3 across six math benchmarks.

0 favorites 0 likes
#knowledge-distillation

Large Language Model Teaches Visual Students: Cross-Modality Transfer of Fine-Grained Conceptual Knowledge

arXiv cs.AI · yesterday Cached

This paper introduces LaViD, a framework that transfers semantic knowledge from a language-only LLM to a vision student model by generating multiple-choice questions as conceptual signatures, achieving superior fine-grained classification performance and robustness.

0 favorites 0 likes
#knowledge-distillation

Knowledge Distillation of Black-Box Large Language Models

Hacker News Top · yesterday Cached

Introduces Proxy-KD, a novel method for distilling knowledge from black-box large language models (like GPT-4) into smaller models using a proxy model, surpassing both traditional black-box and white-box KD techniques.

0 favorites 0 likes
#knowledge-distillation

@VukRosic99: When a small model learns from a big one, half the lesson is wasted The setup: a small "student" model writes an answer…

X AI KOLs Timeline · yesterday Cached

The paper identifies position bias in on-policy distillation for language models, where later tokens in student-generated answers receive degraded supervision. The proposed Importance-Weighted On-Policy Distillation (IW-OPD) weights corrections based on accumulated drift, improving learning speed and final performance.

0 favorites 0 likes
#knowledge-distillation

Let's Learn About Knowledge Distillation!

Reddit r/ArtificialInteligence · 2d ago

The article argues that frontier model providers who criticize knowledge distillation are hypocritical, as their own legal defense against copyright lawsuits relies on the same principle of not directly storing or touching data.

0 favorites 0 likes
#knowledge-distillation

@neural_avb: There is a really banger article on On-Policy Distillation. Came out on HF a few months back.

X AI KOLs Timeline · 2d ago Cached

A tweet recommending an article on on-policy distillation published on Hugging Face.

0 favorites 0 likes
#knowledge-distillation

NebulaExp-8B: An Empirical Post-Training Pipeline via Full-Scale Ablation Research

arXiv cs.AI · 4d ago Cached

This paper presents NebulaExp, a transparent ablation-driven post-training pipeline for 8B-scale LLMs, covering SFT, GRPO RL, and multi-teacher distillation. It identifies key trade-offs between mathematical reasoning and code generation, and demonstrates that data correctness filtering is the first-order optimization factor.

0 favorites 0 likes
#knowledge-distillation

AsyncOPD: How Stale Can On-Policy Distillation Be?

arXiv cs.LG · 6d ago Cached

This paper presents AsyncOPD, a fully asynchronous on-policy distillation pipeline for LLMs, systematically studying the effects of stale-policy data and proposing estimator designs that improve training throughput by 1.6-3.8x while maintaining comparable accuracy.

0 favorites 0 likes
#knowledge-distillation

Blockwise Policy-Drift Gating for On-Policy Distillation

arXiv cs.LG · 6d ago Cached

This paper introduces blockwise policy-drift gating, a lightweight method to improve on-policy distillation for language models by weighting loss based on old-current student probability shifts, achieving improved reasoning accuracy on math benchmarks.

0 favorites 0 likes
#knowledge-distillation

ARIA: Adaptive Region-Based Importance Allocation for Conditional Diffusion Distillation

arXiv cs.LG · 6d ago Cached

This paper introduces ARIA, a framework that adaptively allocates training effort across regions of the conditioning space for distilling conditional diffusion models, improving performance on unseen and underrepresented conditions.

0 favorites 0 likes
#knowledge-distillation

Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning

arXiv cs.AI · 6d ago Cached

Introduces Strategy-Guided Policy Optimization (SGPO) for LLM reasoning, which replaces trajectory imitation with strategy distillation, improving generalization on math benchmarks.

0 favorites 0 likes
#knowledge-distillation

@natolambert: New lecture for the book! Nominally about synthetic data, but mostly is a walk through of the distillation literature f…

X AI KOLs Timeline · 6d ago Cached

Natolambert announces a new lecture covering synthetic data and the history of distillation, from Hinton 2015 to modern on-policy distillation, with over 7 hours of video content.

0 favorites 0 likes
#knowledge-distillation

Lite Any Stereo V2: Faster and Stronger Efficient Zero-Shot Stereo Matching

Hugging Face Daily Papers · 2026-06-23 Cached

Lite Any Stereo V2 presents an efficient stereo matching approach achieving state-of-the-art accuracy with significantly reduced latency through optimized architecture and training strategies, including a 2D-only cost aggregation framework and a three-stage training strategy.

0 favorites 0 likes
#knowledge-distillation

@TheTuringPost: https://x.com/TheTuringPost/status/2068474648925216861

X AI KOLs Timeline · 2026-06-20 Cached

An educational overview of knowledge distillation, covering its history, core concepts like softmax and temperature, types, scaling laws, and practical examples including DeepSeek-R1.

0 favorites 0 likes
#knowledge-distillation

Efficient Financial Language Understanding via Distillation with Synthetic Data

arXiv cs.CL · 2026-06-18 Cached

Presents a framework for financial sentiment analysis using distillation with synthetic data, transferring knowledge from a large teacher to compact student models, with clustering-based seed selection for efficient low-resource domain adaptation.

0 favorites 0 likes
#knowledge-distillation

ResAware: Cross-Environment Website Fingerprinting via Resource-Privileged Distillation

arXiv cs.LG · 2026-06-17 Cached

ResAware proposes a resource-aware distillation framework to improve website fingerprinting robustness across different network environments by training a teacher model on resource-level features and distilling knowledge to a student model that uses only encrypted traffic, achieving significant gains under temporal drift and other perturbations.

0 favorites 0 likes
#knowledge-distillation

PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation

arXiv cs.LG · 2026-06-17 Cached

PowerOPD introduces a bounded power transformation to stabilize on-policy distillation for large language models, achieving significant gains in accuracy and sample efficiency while reducing computational cost.

0 favorites 0 likes
Next →
← Back to home

Submit Feedback