@dbreunig: Great teachers craft demonstrations their students could have built themselves.
Summary
A tweet from Souradip Chakraborty proposes using privileged information to actively sample rollouts in reinforcement learning, contrasting with traditional blind sampling methods. The tweet is prefaced by a quote about great teachers crafting demonstrations that students could build themselves.
View Cached Full Text
Cached at: 05/16/26, 03:21 PM
Great teachers craft demonstrations their students could have built themselves.
Souradip Chakraborty (@SOURADIPCHAKR18): 🚨Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts, but not to find them.
We ask: can we use privileged info to actively sample the rollouts RL wishes it can stumble upon with compute?
⤵️ Pedagogical RL
Similar Articles
@SOURADIPCHAKR18: We describe early experiments on *pedagogical RL*: A bitter-lesson-pilled paradigm of *training* privileged self-teache…
Introduces pedagogical RL, a paradigm where privileged self-teachers are trained to generate correct and easy-to-follow rollouts, showing it is a relatively easy RL problem.
@rronak_: Omar Khattab’s lab at MIT strikes again! Pedagogical RL - Today, RL relies on pure entropy to sample new trajectories. …
MIT researchers propose Pedagogical RL, a new reinforcement learning method that uses a teacher model with privileged information and a spike-aware learnability reward to significantly improve sample efficiency and convergence speed over existing methods like GRPO and OPSD.
@SOURADIPCHAKR18: Typical RL algorithms and on-policy distillation methods are blind samplers: they use privileged info to score rollouts…
This work proposes using privileged information to actively sample rollouts in reinforcement learning, improving on typical blind sampling methods.
@lateinteraction: Indeed. But the next breakthrough for a far more scalable RL paradigm than GRPO is already here: Train your self-teache…
Introduces Pedagogical RL, a new paradigm where models learn to be self-teachers by using privileged information to actively sample successful and easy-to-follow trajectories, achieving up to 40% relative gains over GRPO and on-policy distillation methods.
Interpretable and pedagogical examples
Research showing that iterative training of student-teacher neural networks produces interpretable teaching strategies, with the teacher learning to select or generate pedagogical examples that humans can understand and learn from effectively.