knowledge-distillation

#knowledge-distillation

GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction

arXiv cs.LG ↗ · 2026-06-11 Cached

This paper introduces GLACIER, a multimodal student-teacher foundation model that integrates molecular graphs, SMILES strings, and physicochemical descriptors to predict molecular properties efficiently. It leverages Finsler geometry-aware fusion and knowledge distillation from larger teacher models (MiniMol, MolFormer) to achieve high performance with a lightweight architecture.

0 favorites 0 likes

#knowledge-distillation

Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

Hugging Face Daily Papers ↗ · 2026-06-11 Cached

This paper analyzes on-policy distillation (OPD), finding that OPD updates are sparse, distributed across layers and FFN-heavy, and retain geometric properties distinct from dense parameter rewriting. The sparse structure is operationally useful, but sparsity-inducing SGD underperforms AdamW due to heterogeneous gradient scales.

0 favorites 0 likes

#knowledge-distillation

Co-GLANCE: Uncertainty-Aware Active Perception for Heterogeneous Robot Teaming

arXiv cs.LG ↗ · 2026-06-10 Cached

Co-GLANCE is a real-time onboard perception and decision-making system for heterogeneous robot teams that distills vision-language model capabilities into efficient models and uses conformal prediction with selective abstention to quantify and resolve perceptual uncertainty, outperforming cloud-based VLM baselines by 25-36% while achieving 350x lower latency.

0 favorites 0 likes

#knowledge-distillation

Moonshine: An Autonomous Mathematical Research Agent Centered on Conjecture Generation

arXiv cs.AI ↗ · 2026-06-10 Cached

This paper presents Moonshine, an autonomous mathematical research agent that generates conjectures, exemplified by deriving the Neural Jacobian Conjecture from the classical Jacobian conjecture and proving a special case using LLMs.

0 favorites 0 likes

#knowledge-distillation

Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm

arXiv cs.AI ↗ · 2026-06-10 Cached

This paper proposes a cross-modal knowledge distillation framework that works without paired data by aligning feature and label distributions, offering theoretical guarantees and outperforming prior methods on multimodal benchmarks.

0 favorites 0 likes

#knowledge-distillation

PADD: Path-Aligned Decompression Distillation for Non-Router Teacher to Guide MoE Student Learning

arXiv cs.CL ↗ · 2026-06-10 Cached

Proposes PADD, a framework for distilling knowledge from dense teachers into mixture-of-experts (MoE) students, addressing the challenge of learning routing policies without a router in the teacher. The method involves four stages and shows improvements on mathematical reasoning benchmarks.

0 favorites 0 likes

#knowledge-distillation

@louieworth: New blog post: On-Policy Distillation — Promise, Pitfalls, and Prospects. OPD combines on-policy rollouts with dense te…

X AI KOLs Following ↗ · 2026-06-09 Cached

This blog post discusses On-Policy Distillation (OPD), a technique that combines on-policy rollouts with dense teacher supervision, and highlights its promise, three failure modes, and the author's new paper on the topic.

0 favorites 0 likes

#knowledge-distillation

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

arXiv cs.CL ↗ · 2026-06-08 Cached

This paper investigates how reasoning models perform zero-shot multi-label classification over millions of candidate labels. The authors characterize a two-phase process of shortlisting and fine-grained reasoning, and propose a mechanistic distillation method that outperforms standard distillation for transferring these capabilities to smaller models.

0 favorites 0 likes

#knowledge-distillation

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

arXiv cs.CL ↗ · 2026-06-08 Cached

The paper introduces OPDLM, a method that transforms autoregressive language models into diffusion language models via on-policy distillation, requiring 15x to 7000x fewer training tokens while retaining knowledge from the original model.

0 favorites 0 likes

#knowledge-distillation

@zhaisf: These were some magical results from distillation by @geoffreyhinton that really shocked me when I first saw them, and …

X AI KOLs Following ↗ · 2026-06-07 Cached

The article discusses surprising robustness of model distillation with respect to training distribution, even with little overlap with target distribution, and its implications for on/off-policy distillation.

0 favorites 0 likes

#knowledge-distillation

Trajectory-Refined Distillation

Hugging Face Daily Papers ↗ · 2026-06-07 Cached

Trajectory-Refined Distillation (TRD) addresses prefix failure in on-policy distillation for LLMs by correcting student rollouts at the trajectory level before distillation, consistently outperforming prior baselines across benchmarks.

0 favorites 0 likes

#knowledge-distillation

On-policy distillation: one of the hottest terms on PapersWithCode [R]

Reddit r/MachineLearning ↗ · 2026-06-04

Hugging Face's Niels introduces On-policy Distillation (OPD), a key post-training technique used in models like Qwen 3.6/3.7, GLM-5.1, and DeepSeek-V4, now featured on PapersWithCode with a linked whiteboard explanation by Sasha Rush and Dwarkesh Patel.

0 favorites 0 likes

#knowledge-distillation

Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data

arXiv cs.LG ↗ · 2026-06-04 Cached

Researchers from AMD propose Recover-LoRA, a method that uses low-rank adaptation with knowledge distillation on synthetic data to recover accuracy lost from aggressive 2-bit quantization of LLMs, achieving 80–95% accuracy recovery on 9 of 12 benchmarks for Qwen3-4B using only 10k synthetic samples.

0 favorites 0 likes

#knowledge-distillation

Self-Distilled Policy Gradient

arXiv cs.LG ↗ · 2026-06-04 Cached

SDPG (Self-Distilled Policy Gradient) is a new RL training framework for LLMs that combines group-relative verifier advantages with on-policy self-distillation and KL regularization to address sparse rewards and instability in RLVR training. The method uses a shared model as both student and teacher by conditioning on privileged context, showing improved stability and performance over RLVR and self-distillation baselines.

0 favorites 0 likes

#knowledge-distillation

Rethinking Continual Experience Internalization for Self-Evolving LLM Agents

arXiv cs.CL ↗ · 2026-06-04 Cached

This paper investigates why LLM agents suffer from progressive capability collapse under multi-iteration experience internalization and proposes a robust recipe addressing experience granularity, injection patterns, and training regime. Key findings include that principle-level experience, step-wise injection, and off-policy context-distillation yield more stable and sustainable continual learning.

0 favorites 0 likes

#knowledge-distillation

DuDi: Dual-Signal Distillation with Cross-Lingual Verbalizer

arXiv cs.CL ↗ · 2026-06-04 Cached

DuDi is a dual-signal multilingual distillation framework combining sequence-level and token-level signals with a cross-lingual verbalizer to improve small language models' performance on Southeast Asian languages. Experiments on SEA-HELM show DuDi consistently outperforms competitive distillation baselines across multiple model families and scales.

0 favorites 0 likes

#knowledge-distillation

Cartridges at Scale: Training Modular KV Caches over Large Document Collections

arXiv cs.CL ↗ · 2026-06-04 Cached

Researchers from Amazon AGI introduce Cartridges at Scale (CAS), a training framework that distills document collections into modular, reusable KV caches, enabling scalable multi-cartridge learning over collections exceeding one million tokens. CAS improves over monolithic cartridge baselines by 10–31 points and matches or exceeds conventional RAG accuracy while consuming 3–4× fewer prompt tokens.

0 favorites 0 likes

#knowledge-distillation

Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation

Hugging Face Daily Papers ↗ · 2026-06-04 Cached

This paper proposes compressing reasoning traces before knowledge distillation to reduce computational costs and inference lengths, showing an accuracy-efficiency trade-off where compressed traces retain up to 96% of raw-trace accuracy with up to 18x higher per-token efficiency.

0 favorites 0 likes

#knowledge-distillation

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

Hugging Face Daily Papers ↗ · 2026-06-04 Cached

GeoVR enhances multimodal large language models with 3D awareness by restructuring their semantic latent space through geometric knowledge distillation from 3D foundation models using multiple geometric targets.

0 favorites 0 likes

#knowledge-distillation

OPRD: On-Policy Representation Distillation

Hugging Face Daily Papers ↗ · 2026-06-04

OPRD proposes a new knowledge distillation method that aligns student and teacher hidden states across layers during on-policy rollouts, eliminating sampling variance from token-space KL estimation. Empirically, OPRD outperforms output-space baselines on math reasoning benchmarks (AIME 2024/2025, AIMO) while being 1.44x faster and using 54% less memory.

0 favorites 0 likes

knowledge-distillation

Submit Feedback