Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Hugging Face Daily Papers 05/12/26, 12:00 AM Papers

Summary

Proposes Anti-Self-Distillation (AntiSD) which reverses the knowledge transfer direction in self-distillation to improve math reasoning efficiency and accuracy, achieving GRPO baseline accuracy in 2-10x fewer steps and up to 11.5 points higher final accuracy across models from 4B to 30B parameters.

On-policy self-distillation, where a student is pulled toward a copy of itself conditioned on privileged context (e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. A pointwise mutual information analysis traces the failure to the privileged context itself: it inflates the teacher's confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens ("Wait", "Let", "Maybe") that drive multi-step search. We propose Anti-Self-Distillation (AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. An entropy-triggered gate disables the term once the teacher entropy collapses, completing a drop-in replacement for default self-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches the GRPO baseline's accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where a language model bootstraps its own reasoning through its training signal.

Original Article

View Cached Full Text

Cached at: 05/20/26, 02:35 AM

Paper page - Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Source: https://huggingface.co/papers/2605.11609

Abstract

Anti-Self-Distillation reverses the direction of knowledge transfer in self-distillation to improve math reasoning efficiency and accuracy.

On-policyself-distillation, where a student is pulled toward a copy of itself conditioned onprivileged context(e.g., a verified solution or feedback), offers a promising direction for advancing reasoning capability without a stronger external teacher. Yet in math reasoning the gains are inconsistent, even when the same approach succeeds elsewhere. Apointwise mutual informationanalysis traces the failure to theprivileged contextitself: it inflates the teacher’s confidence on tokens already implied by the solution (structural connectives, verifiable claims) and deflates it on deliberation tokens (“Wait”, “Let”, “Maybe”) that drive multi-step search. We propose Anti-Self-Distillation(AntiSD), which ascends a divergence between student and teacher rather than descending it: this reverses the per-token sign and yields a naturally bounded advantage in one step. Anentropy-triggered gatedisables the term once the teacher entropy collapses, completing a drop-in replacement for defaultself-distillation. Across five models from 4B to 30B parameters on math reasoning benchmarks, AntiSD reaches theGRPO baseline’s accuracy in 2 to 10x fewer training steps and improves final accuracy by up to 11.5 points. AntiSD opens a path to scalable self-improvement, where alanguage modelbootstraps its own reasoning through its training signal.

View arXiv page View PDF GitHub3 Add to collection

Get this paper in your agent:

hf papers read 2605\.11609

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.11609 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.11609 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.11609 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Paper page - Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Self-Distillation Enables Continual Learning [pdf]

Submit Feedback

Similar Articles

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

Self-Distillation Enables Continual Learning [pdf]