Tag
This paper introduces GLACIER, a multimodal student-teacher foundation model that integrates molecular graphs, SMILES strings, and physicochemical descriptors to predict molecular properties efficiently. It leverages Finsler geometry-aware fusion and knowledge distillation from larger teacher models (MiniMol, MolFormer) to achieve high performance with a lightweight architecture.
This paper analyzes on-policy distillation (OPD), finding that OPD updates are sparse, distributed across layers and FFN-heavy, and retain geometric properties distinct from dense parameter rewriting. The sparse structure is operationally useful, but sparsity-inducing SGD underperforms AdamW due to heterogeneous gradient scales.
Co-GLANCE is a real-time onboard perception and decision-making system for heterogeneous robot teams that distills vision-language model capabilities into efficient models and uses conformal prediction with selective abstention to quantify and resolve perceptual uncertainty, outperforming cloud-based VLM baselines by 25-36% while achieving 350x lower latency.
This paper presents Moonshine, an autonomous mathematical research agent that generates conjectures, exemplified by deriving the Neural Jacobian Conjecture from the classical Jacobian conjecture and proving a special case using LLMs.
This paper proposes a cross-modal knowledge distillation framework that works without paired data by aligning feature and label distributions, offering theoretical guarantees and outperforming prior methods on multimodal benchmarks.
Proposes PADD, a framework for distilling knowledge from dense teachers into mixture-of-experts (MoE) students, addressing the challenge of learning routing policies without a router in the teacher. The method involves four stages and shows improvements on mathematical reasoning benchmarks.
This blog post discusses On-Policy Distillation (OPD), a technique that combines on-policy rollouts with dense teacher supervision, and highlights its promise, three failure modes, and the author's new paper on the topic.
This paper investigates how reasoning models perform zero-shot multi-label classification over millions of candidate labels. The authors characterize a two-phase process of shortlisting and fine-grained reasoning, and propose a mechanistic distillation method that outperforms standard distillation for transferring these capabilities to smaller models.
The paper introduces OPDLM, a method that transforms autoregressive language models into diffusion language models via on-policy distillation, requiring 15x to 7000x fewer training tokens while retaining knowledge from the original model.
The article discusses surprising robustness of model distillation with respect to training distribution, even with little overlap with target distribution, and its implications for on/off-policy distillation.
Trajectory-Refined Distillation (TRD) addresses prefix failure in on-policy distillation for LLMs by correcting student rollouts at the trajectory level before distillation, consistently outperforming prior baselines across benchmarks.
Hugging Face's Niels introduces On-policy Distillation (OPD), a key post-training technique used in models like Qwen 3.6/3.7, GLM-5.1, and DeepSeek-V4, now featured on PapersWithCode with a linked whiteboard explanation by Sasha Rush and Dwarkesh Patel.
Researchers from AMD propose Recover-LoRA, a method that uses low-rank adaptation with knowledge distillation on synthetic data to recover accuracy lost from aggressive 2-bit quantization of LLMs, achieving 80–95% accuracy recovery on 9 of 12 benchmarks for Qwen3-4B using only 10k synthetic samples.
SDPG (Self-Distilled Policy Gradient) is a new RL training framework for LLMs that combines group-relative verifier advantages with on-policy self-distillation and KL regularization to address sparse rewards and instability in RLVR training. The method uses a shared model as both student and teacher by conditioning on privileged context, showing improved stability and performance over RLVR and self-distillation baselines.
This paper investigates why LLM agents suffer from progressive capability collapse under multi-iteration experience internalization and proposes a robust recipe addressing experience granularity, injection patterns, and training regime. Key findings include that principle-level experience, step-wise injection, and off-policy context-distillation yield more stable and sustainable continual learning.
DuDi is a dual-signal multilingual distillation framework combining sequence-level and token-level signals with a cross-lingual verbalizer to improve small language models' performance on Southeast Asian languages. Experiments on SEA-HELM show DuDi consistently outperforms competitive distillation baselines across multiple model families and scales.
Researchers from Amazon AGI introduce Cartridges at Scale (CAS), a training framework that distills document collections into modular, reusable KV caches, enabling scalable multi-cartridge learning over collections exceeding one million tokens. CAS improves over monolithic cartridge baselines by 10–31 points and matches or exceeds conventional RAG accuracy while consuming 3–4× fewer prompt tokens.
This paper proposes compressing reasoning traces before knowledge distillation to reduce computational costs and inference lengths, showing an accuracy-efficiency trade-off where compressed traces retain up to 96% of raw-trace accuracy with up to 18x higher per-token efficiency.
GeoVR enhances multimodal large language models with 3D awareness by restructuring their semantic latent space through geometric knowledge distillation from 3D foundation models using multiple geometric targets.
OPRD proposes a new knowledge distillation method that aligns student and teacher hidden states across layers during on-policy rollouts, eliminating sampling variance from token-space KL estimation. Empirically, OPRD outperforms output-space baselines on math reasoning benchmarks (AIME 2024/2025, AIMO) while being 1.44x faster and using 54% less memory.