Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
Summary
Proposes Anti-Self-Distillation (AntiSD) which reverses the knowledge transfer direction in self-distillation to improve math reasoning efficiency and accuracy, achieving GRPO baseline accuracy in 2-10x fewer steps and up to 11.5 points higher final accuracy across models from 4B to 30B parameters.
View Cached Full Text
Cached at: 05/20/26, 02:35 AM
Paper page - Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information
Source: https://huggingface.co/papers/2605.11609
Abstract
Anti-Self-Distillation reverses the direction of knowledge transfer in self-distillation to improve math reasoning efficiency and accuracy.
On-policyself-distillation, where a student is pulled toward a copy of itself conditioned onprivileged context(e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. Apointwise mutual informationanalysis traces the failure to theprivileged contextitself: it inflates the teacher’s confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens (“Wait”, “Let”, “Maybe”) that drive multi-step search. We propose Anti-Self-Distillation(AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. Anentropy-triggered gatedisables the term once the teacher entropy collapses, completing a drop-in replacement for defaultself-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches theGRPO baseline’s accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where alanguage modelbootstraps its own reasoning through its training signal.
View arXiv pageView PDFGitHub3Add to collection
Get this paper in your agent:
hf papers read 2605\.11609
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.11609 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.11609 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.11609 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
The paper proposes EGRSD and CL-EGRSD, on-policy self-distillation methods that weight token-level supervision by teacher entropy to improve reasoning accuracy-length tradeoff in LLMs, evaluated on Qwen3-4B and Qwen3-8B.
Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation
Proposes Distribution-Aligned Self-Distillation (DASD), which dynamically filters tokens during self-distillation to preserve beneficial logical corrections while suppressing distributionally misaligned style noise, improving robust reasoning on math, code, and commonsense benchmarks.
Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning
Adaptive Teacher Exposure for Self-Distillation (ATESD) improves LLM reasoning by dynamically adjusting how much of the reference reasoning the teacher shows the student during training, using a learnable policy controller and a discounted learning-progress reward. Experiments on math benchmarks show consistent improvements over existing self-distillation and RL baselines.
Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision
Self-Distillation Zero (SD-Zero) is a novel training method that converts sparse binary rewards into dense token-level supervision through dual-role training where a model acts as both generator and reviser, achieving 10%+ improvements on math and code reasoning benchmarks with higher sample efficiency than RL approaches.
Self-Distillation Enables Continual Learning [pdf]
Introduces Self-Distillation Fine-Tuning (SDFT), a method that enables on-policy learning from demonstrations to achieve continual learning without catastrophic forgetting, outperforming supervised fine-tuning.