Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

Hugging Face Daily Papers Papers

Summary

This paper introduces RLRT, a method that reverses teacher signals in self-distillation to reinforce successful student deviations, enhancing reasoning exploration in large language models.

Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student's choices and suppresses it's own reasoning. Therefore, we propose reading the original self-distillation signal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVR with Reversed Teacher), which augments GRPO by reinforcing these tokens on correct rollouts. We interpret this as a new form of exploration in RLVR: not uniform diversity, but valuable exploration grounded in the student's own success. Across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints, RLRT substantially outperforms self-distillation and exploration-based baselines, establishing information asymmetry as a new, principled design axis for RLVR.
Original Article
View Cached Full Text

Cached at: 05/12/26, 07:31 AM

Paper page - Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

Source: https://huggingface.co/papers/2605.10781

Abstract

RLRT enhances self-distillation by reinforcing successful student decisions that deviate from teacher predictions, enabling more effective exploration in reinforcement learning via self-reward.

Self-distillationhas emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student’s choices and suppresses it’s own reasoning. Therefore, we propose reading the originalself-distillationsignal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVRwith Reversed Teacher), which augmentsGRPOby reinforcing these tokens on correct rollouts. We interpret this as a new form ofexplorationinRLVR: not uniform diversity, but valuableexplorationgrounded in the student’s own success. Across base, instruction-tuned, and thinking-tunedQwen3checkpoints, RLRT substantially outperformsself-distillationandexploration-based baselines, establishinginformation asymmetryas a new, principled design axis forRLVR.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.10781

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.10781 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.10781 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.10781 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

Hugging Face Daily Papers

Adaptive Teacher Exposure for Self-Distillation (ATESD) improves LLM reasoning by dynamically adjusting how much of the reference reasoning the teacher shows the student during training, using a learnable policy controller and a discounted learning-progress reward. Experiments on math benchmarks show consistent improvements over existing self-distillation and RL baselines.

ExpRL: Exploratory RL for LLM Mid-Training

Hugging Face Daily Papers

ExpRL is a new RL-based mid-training method that uses human-written reference solutions as dense reward scaffolds (never shown to the policy) to improve LLM reasoning, achieving significant gains on hard math benchmarks like AIME-2026.