self-distillation

#self-distillation

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

arXiv cs.AI ↗ · 6d ago Cached

This paper proposes Rubric-Conditioned Self-Distillation (RCSD), a framework that uses fine-grained rubric criteria to provide token-level guidance during self-distillation, improving reasoning performance over scalar-reward methods like GRPO and OPSD.

0 favorites 0 likes

#self-distillation

Learning from the Self-future: On-policy Self-distillation for dLLMs

arXiv cs.CL ↗ · 2026-06-17 Cached

Introduces d-OPSD, the first on-policy self-distillation framework for diffusion large language models, using suffix conditioning and step-level supervision to outperform RLVR and SFT baselines on reasoning benchmarks.

0 favorites 0 likes

#self-distillation

Learning from Your Own Mistakes: Constructing Learnable Micro-Reflective Trajectories for Self-Distillation

Hugging Face Daily Papers ↗ · 2026-06-17 Cached

This paper proposes Trajectory-Augmented Policy Optimization (TAPO), which constructs micro-reflective correction trajectories using the model's own correct and incorrect rollouts to improve reasoning in large language models, outperforming standard self-distillation methods on math benchmarks.

0 favorites 0 likes

#self-distillation

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

Hugging Face Daily Papers ↗ · 2026-06-17 Cached

This paper introduces ViGOS, a method for multimodal on-policy self-distillation that decouples perception and reasoning by having the student model first produce a visual description before reasoning, reducing shortcut reliance and improving image-grounding behavior.

0 favorites 0 likes

#self-distillation

Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding

Hugging Face Daily Papers ↗ · 2026-06-16 Cached

Proposes quality-aware self-distillation for GUI grounding, improving coordinate-token teacher signals via correctness-aware gating and probability scaling to enhance vision-language model performance.

0 favorites 0 likes

#self-distillation

@agarwl_: Self-distillation does not work for thinking models YET https://arxiv.org/abs/2603.24472 https://openreview.net/forum?i…

X AI KOLs Timeline ↗ · 2026-06-15 Cached

This paper studies why self-distillation degrades reasoning in LLMs, finding that it suppresses epistemic verbalization (uncertainty expression), leading to performance drops of up to 40% in mathematical reasoning tasks.

0 favorites 0 likes

#self-distillation

Diffusion Policy Optimization without Drifting Apart

arXiv cs.LG ↗ · 2026-06-15 Cached

DiPOD stabilizes diffusion policy optimization by interleaving self-distillation with policy-gradient updates to maintain a tight ELBO, preventing the double-drift phenomenon and achieving higher rewards in both language and continuous control tasks.

0 favorites 0 likes

#self-distillation

@sheriyuo: Qwen Tongyi Lab proposes RLCSD, a simple but important critique of on-policy self-distillation. Their key observation i…

X AI KOLs Timeline ↗ · 2026-06-11 Cached

Qwen Tongyi Lab proposes RLCSD to address the style drift problem in on-policy self-distillation, where the learning signal focuses on style tokens rather than task-critical reasoning tokens. Their method uses contrastive supervision to focus on task-relevant tokens, achieving consistent improvements over prior methods on reasoning benchmarks.

0 favorites 0 likes

#self-distillation

HERO: Hindsight-Enhanced Reflection from Environment Observations for Agentic Self-Distillation

arXiv cs.AI ↗ · 2026-06-11 Cached

HERO introduces a hindsight-enhanced self-distillation framework that uses environment observations as locally aligned feedback to improve multi-turn agent capabilities, outperforming existing methods on TauBench and WebShop, especially under limited turn budgets.

0 favorites 0 likes

#self-distillation

Self-Distillation Policy Optimization via Visual Feedback: Bridging Code and Visual Artifacts

arXiv cs.AI ↗ · 2026-06-10 Cached

This paper introduces Visual-SDPO, a self-distillation policy optimization framework that uses rendered visual feedback as privileged context to train code-generating LLMs, improving visual artifact quality across chart, UI, and slide generation benchmarks.

0 favorites 0 likes

#self-distillation

ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models

arXiv cs.CL ↗ · 2026-06-10 Cached

ParaBridge is an on-policy self-distillation method that bridges the gap between paralinguistic perception and dialogue behavior in speech language models, significantly improving safety and empathy without external rewards.

0 favorites 0 likes

#self-distillation

World Model Self-Distillation: Training World Models to Solve General Tasks

Hugging Face Daily Papers ↗ · 2026-06-10 Cached

A scalable framework combines self-distillation and reinforcement learning to transfer task-solving abilities from vision-language models to video diffusion models without requiring labeled task-video data.

0 favorites 0 likes

#self-distillation

The Role of Feedback Alignment in Self-Distillation

Hugging Face Daily Papers ↗ · 2026-06-09 Cached

This paper studies context design for self-distillation in language models, finding that step-aligned critique feedback significantly outperforms binary reward or reference solution conditioning, because it targets only erroneous tokens while preserving correct behavior.

0 favorites 0 likes

#self-distillation

PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

Hugging Face Daily Papers ↗ · 2026-06-08 Cached

PBSD proposes a Bayesian self-distillation method that converts sparse final rewards into calibrated turn-level credit signals for long-horizon agentic tasks, improving policy learning and generalization.

0 favorites 0 likes

#self-distillation

Self-Distilled Policy Gradient

arXiv cs.LG ↗ · 2026-06-04 Cached

SDPG (Self-Distilled Policy Gradient) is a new RL training framework for LLMs that combines group-relative verifier advantages with on-policy self-distillation and KL regularization to address sparse rewards and instability in RLVR training. The method uses a shared model as both student and teacher by conditioning on privileged context, showing improved stability and performance over RLVR and self-distillation baselines.

0 favorites 0 likes

#self-distillation

@dwarkesh_sp: Recently met @srush_nlp and he started giving me an impromptu lecture on how targeted on-policy self-distillation works…

X AI KOLs Following ↗ · 2026-06-04 Cached

Dwarkesh Patel shares an explanation from Sasha Rush on targeted on-policy self-distillation, where hint tokens are inserted into a trajectory to downweight specific model errors without requiring new rollouts.

0 favorites 0 likes

#self-distillation

Reinforcement Learning from Rich Feedback with Distributional DAgger

Hugging Face Daily Papers ↗ · 2026-06-03 Cached

Introduces DistIL, a method for reinforcement learning from rich feedback that guarantees monotonic policy improvement, outperforming existing methods on science reasoning, coding, and mathematical reasoning.

0 favorites 0 likes

#self-distillation

World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning

Hugging Face Daily Papers ↗ · 2026-06-02 Cached

This paper proposes Privileged-Future On-Policy Self-Distillation (PF-OPSD) for controlled concrete reasoning, combining world models' visual simulation with language models' abstract reasoning to improve prediction accuracy and robustness on two new benchmarks.

0 favorites 0 likes

#self-distillation

CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO

arXiv cs.AI ↗ · 2026-06-02 Cached

This paper proposes CAST, a non-privileged clipped asymmetric self-teaching method that enhances GRPO-based reinforcement learning with verifiable rewards by providing dense token-level guidance and addressing zero-variance group issues, demonstrating improvements in mathematical reasoning.

0 favorites 0 likes

#self-distillation

Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation

arXiv cs.CL ↗ · 2026-06-02 Cached

Proposes Distribution-Aligned Self-Distillation (DASD), which dynamically filters tokens during self-distillation to preserve beneficial logical corrections while suppressing distributionally misaligned style noise, improving robust reasoning on math, code, and commonsense benchmarks.

0 favorites 0 likes

self-distillation

Submit Feedback