Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Hugging Face Daily Papers Papers

Summary

Proposes Anti-Self-Distillation (AntiSD) which reverses the knowledge transfer direction in self-distillation to improve math reasoning efficiency and accuracy, achieving GRPO baseline accuracy in 2-10x fewer steps and up to 11.5 points higher final accuracy across models from 4B to 30B parameters.

On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher's confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens ("Wait", "Let", "Maybe") that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline's accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.
Original Article
View Cached Full Text

Cached at: 05/20/26, 02:35 AM

Paper page - Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Source: https://huggingface.co/papers/2605.11609

Abstract

Anti-Self-Distillation reverses the direction of knowledge transfer in self-distillation to improve math reasoning efficiency and accuracy.

On-policyself-distillation, where a student is pulled toward a copy of itself conditioned onprivileged context(e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. Apointwise mutual informationanalysis traces the failure to theprivileged contextitself: it inflates the teacher’s confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens (“Wait”, “Let”, “Maybe”) that drive multi-step search. We propose Anti-Self-Distillation(AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. Anentropy-triggered gatedisables the term once the teacher entropy collapses, completing a drop-in replacement for defaultself-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches theGRPO baseline’s accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where alanguage modelbootstraps its own reasoning through its training signal.

View arXiv pageView PDFGitHub3Add to collection

Get this paper in your agent:

hf papers read 2605\.11609

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.11609 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.11609 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.11609 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

Hugging Face Daily Papers

Adaptive Teacher Exposure for Self-Distillation (ATESD) improves LLM reasoning by dynamically adjusting how much of the reference reasoning the teacher shows the student during training, using a learnable policy controller and a discounted learning-progress reward. Experiments on math benchmarks show consistent improvements over existing self-distillation and RL baselines.

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Hugging Face Daily Papers

Self-Distillation Zero (SD-Zero) is a novel training method that converts sparse binary rewards into dense token-level supervision through dual-role training where a model acts as both generator and reviser, achieving 10%+ improvements on math and code reasoning benchmarks with higher sample efficiency than RL approaches.

Self-Distillation Enables Continual Learning [pdf]

Hacker News Top

Introduces Self-Distillation Fine-Tuning (SDFT), a method that enables on-policy learning from demonstrations to achieve continual learning without catastrophic forgetting, outperforming supervised fine-tuning.