Tag
This paper proposes Rubric-Conditioned Self-Distillation (RCSD), a framework that uses fine-grained rubric criteria to provide token-level guidance during self-distillation, improving reasoning performance over scalar-reward methods like GRPO and OPSD.
Introduces d-OPSD, the first on-policy self-distillation framework for diffusion large language models, using suffix conditioning and step-level supervision to outperform RLVR and SFT baselines on reasoning benchmarks.
This paper proposes Trajectory-Augmented Policy Optimization (TAPO), which constructs micro-reflective correction trajectories using the model's own correct and incorrect rollouts to improve reasoning in large language models, outperforming standard self-distillation methods on math benchmarks.
This paper introduces ViGOS, a method for multimodal on-policy self-distillation that decouples perception and reasoning by having the student model first produce a visual description before reasoning, reducing shortcut reliance and improving image-grounding behavior.
Proposes quality-aware self-distillation for GUI grounding, improving coordinate-token teacher signals via correctness-aware gating and probability scaling to enhance vision-language model performance.
This paper studies why self-distillation degrades reasoning in LLMs, finding that it suppresses epistemic verbalization (uncertainty expression), leading to performance drops of up to 40% in mathematical reasoning tasks.
DiPOD stabilizes diffusion policy optimization by interleaving self-distillation with policy-gradient updates to maintain a tight ELBO, preventing the double-drift phenomenon and achieving higher rewards in both language and continuous control tasks.
Qwen Tongyi Lab proposes RLCSD to address the style drift problem in on-policy self-distillation, where the learning signal focuses on style tokens rather than task-critical reasoning tokens. Their method uses contrastive supervision to focus on task-relevant tokens, achieving consistent improvements over prior methods on reasoning benchmarks.
HERO introduces a hindsight-enhanced self-distillation framework that uses environment observations as locally aligned feedback to improve multi-turn agent capabilities, outperforming existing methods on TauBench and WebShop, especially under limited turn budgets.
This paper introduces Visual-SDPO, a self-distillation policy optimization framework that uses rendered visual feedback as privileged context to train code-generating LLMs, improving visual artifact quality across chart, UI, and slide generation benchmarks.
ParaBridge is an on-policy self-distillation method that bridges the gap between paralinguistic perception and dialogue behavior in speech language models, significantly improving safety and empathy without external rewards.
A scalable framework combines self-distillation and reinforcement learning to transfer task-solving abilities from vision-language models to video diffusion models without requiring labeled task-video data.
This paper studies context design for self-distillation in language models, finding that step-aligned critique feedback significantly outperforms binary reward or reference solution conditioning, because it targets only erroneous tokens while preserving correct behavior.
PBSD proposes a Bayesian self-distillation method that converts sparse final rewards into calibrated turn-level credit signals for long-horizon agentic tasks, improving policy learning and generalization.
SDPG (Self-Distilled Policy Gradient) is a new RL training framework for LLMs that combines group-relative verifier advantages with on-policy self-distillation and KL regularization to address sparse rewards and instability in RLVR training. The method uses a shared model as both student and teacher by conditioning on privileged context, showing improved stability and performance over RLVR and self-distillation baselines.
Dwarkesh Patel shares an explanation from Sasha Rush on targeted on-policy self-distillation, where hint tokens are inserted into a trajectory to downweight specific model errors without requiring new rollouts.
Introduces DistIL, a method for reinforcement learning from rich feedback that guarantees monotonic policy improvement, outperforming existing methods on science reasoning, coding, and mathematical reasoning.
This paper proposes Privileged-Future On-Policy Self-Distillation (PF-OPSD) for controlled concrete reasoning, combining world models' visual simulation with language models' abstract reasoning to improve prediction accuracy and robustness on two new benchmarks.
This paper proposes CAST, a non-privileged clipped asymmetric self-teaching method that enhances GRPO-based reinforcement learning with verifiable rewards by providing dense token-level guidance and addressing zero-variance group issues, demonstrating improvements in mathematical reasoning.
Proposes Distribution-Aligned Self-Distillation (DASD), which dynamically filters tokens during self-distillation to preserve beneficial logical corrections while suppressing distributionally misaligned style noise, improving robust reasoning on math, code, and commonsense benchmarks.