Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

Hugging Face Daily Papers 05/11/26, 12:00 AM Papers

Summary

This paper introduces RLRT, a method that reverses teacher signals in self-distillation to reinforce successful student deviations, enhancing reasoning exploration in large language models.

Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student's choices and suppresses it's own reasoning. Therefore, we propose reading the original self-distillation signal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVR with Reversed Teacher), which augments GRPO by reinforcing these tokens on correct rollouts. We interpret this as a new form of exploration in RLVR: not uniform diversity, but valuable exploration grounded in the student's own success. Across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints, RLRT substantially outperforms self-distillation and exploration-based baselines, establishing information asymmetry as a new, principled design axis for RLVR.

Original Article

View Cached Full Text

Cached at: 05/12/26, 07:31 AM

Paper page - Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

Source: https://huggingface.co/papers/2605.10781

Abstract

RLRT enhances self-distillation by reinforcing successful student decisions that deviate from teacher predictions, enabling more effective exploration in reinforcement learning via self-reward.

Self-distillationhas emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student’s choices and suppresses it’s own reasoning. Therefore, we propose reading the originalself-distillationsignal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVRwith Reversed Teacher), which augmentsGRPOby reinforcing these tokens on correct rollouts. We interpret this as a new form ofexplorationinRLVR: not uniform diversity, but valuableexplorationgrounded in the student’s own success. Across base, instruction-tuned, and thinking-tunedQwen3checkpoints, RLRT substantially outperformsself-distillationandexploration-based baselines, establishinginformation asymmetryas a new, principled design axis forRLVR.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.10781

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.10781 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.10781 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.10781 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

Paper page - Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

@blc_16: MIT just released a new RL method called Pedagogical RL. The main lesson -> correct reasoning traces can still be bad t…

@lateinteraction: ICYMI: read the blog on Pedagogical RL Instead of sampling blindly from your LLM, leverage the label used for RLVR! Lea…

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

ExpRL: Exploratory RL for LLM Mid-Training

@SOURADIPCHAKR18: We describe early experiments on pedagogical RL: A bitter-lesson-pilled paradigm of training privileged self-teache…

Submit Feedback

Similar Articles

@blc_16: MIT just released a new RL method called Pedagogical RL. The main lesson -> correct reasoning traces can still be bad t…

@lateinteraction: ICYMI: read the blog on Pedagogical RL Instead of sampling blindly from your LLM, leverage the label used for RLVR! Lea…

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

ExpRL: Exploratory RL for LLM Mid-Training

@SOURADIPCHAKR18: We describe early experiments on *pedagogical RL*: A bitter-lesson-pilled paradigm of *training* privileged self-teache…

Paper page - Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

@blc_16: MIT just released a new RL method called Pedagogical RL. The main lesson -> correct reasoning traces can still be bad t…

@lateinteraction: ICYMI: read the blog on Pedagogical RL Instead of sampling blindly from your LLM, leverage the label used for RLVR! Lea…

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

ExpRL: Exploratory RL for LLM Mid-Training

@SOURADIPCHAKR18: We describe early experiments on *pedagogical RL*: A bitter-lesson-pilled paradigm of *training* privileged self-teache…

Submit Feedback

@SOURADIPCHAKR18: We describe early experiments on pedagogical RL: A bitter-lesson-pilled paradigm of training privileged self-teache…