Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
Summary
This paper introduces RLRT, a method that reverses teacher signals in self-distillation to reinforce successful student deviations, enhancing reasoning exploration in large language models.
View Cached Full Text
Cached at: 05/12/26, 07:31 AM
Paper page - Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR
Source: https://huggingface.co/papers/2605.10781
Abstract
RLRT enhances self-distillation by reinforcing successful student decisions that deviate from teacher predictions, enabling more effective exploration in reinforcement learning via self-reward.
Self-distillationhas emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student’s choices and suppresses it’s own reasoning. Therefore, we propose reading the originalself-distillationsignal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVRwith Reversed Teacher), which augmentsGRPOby reinforcing these tokens on correct rollouts. We interpret this as a new form ofexplorationinRLVR: not uniform diversity, but valuableexplorationgrounded in the student’s own success. Across base, instruction-tuned, and thinking-tunedQwen3checkpoints, RLRT substantially outperformsself-distillationandexploration-based baselines, establishinginformation asymmetryas a new, principled design axis forRLVR.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.10781
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.10781 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.10781 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.10781 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
@blc_16: MIT just released a new RL method called Pedagogical RL. The main lesson -> correct reasoning traces can still be bad t…
MIT introduces Pedagogical RL, a method that trains a teacher to produce trajectories that are learnable for a student by penalizing surprising steps, improving RL training efficiency.
@lateinteraction: ICYMI: read the blog on Pedagogical RL Instead of sampling blindly from your LLM, leverage the label used for RLVR! Lea…
Introduces Pedagogical RL, a method that leverages privileged information to guide the sampling of successful trajectories for LLM reasoning, achieving up to 40% relative gains over GRPO and on-policy distillation.
Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning
Adaptive Teacher Exposure for Self-Distillation (ATESD) improves LLM reasoning by dynamically adjusting how much of the reference reasoning the teacher shows the student during training, using a learnable policy controller and a discounted learning-progress reward. Experiments on math benchmarks show consistent improvements over existing self-distillation and RL baselines.
ExpRL: Exploratory RL for LLM Mid-Training
ExpRL is a new RL-based mid-training method that uses human-written reference solutions as dense reward scaffolds (never shown to the policy) to improve LLM reasoning, achieving significant gains on hard math benchmarks like AIME-2026.
@SOURADIPCHAKR18: We describe early experiments on *pedagogical RL*: A bitter-lesson-pilled paradigm of *training* privileged self-teache…
Introduces pedagogical RL, a paradigm where privileged self-teachers are trained to generate correct and easy-to-follow rollouts, showing it is a relatively easy RL problem.