On-policy distillation: one of the hottest terms on PapersWithCode [R]

Reddit r/MachineLearning 06/04/26, 12:40 PM Papers

Summary

Hugging Face's Niels introduces On-policy Distillation (OPD), a key post-training technique used in models like Qwen 3.6/3.7, GLM-5.1, and DeepSeek-V4, now featured on PapersWithCode with a linked whiteboard explanation by Sasha Rush and Dwarkesh Patel.

Hi, Niels here from the open-source team at Hugging Face. At [paperswithcode.co](http://paperswithcode.co) I am trying to make it easier for people to learn about the newest techniques used across AI papers. One of the hottest terms in AI research that I've recently added is [On-policy distillation](https://paperswithcode.co/methods/on-policy-distillation), also abbreviated as OPD. It's the key post-training behind models like Qwen 3.6 and 3.7, GLM-5.1, and DeepSeek-V4. https://preview.redd.it/yegq2gfag95h1.png?width=3046&format=png&auto=webp&s=f68fdf3ca075f3c4e56051fdd0ebcf97be9bcbc9 On PapersWithCode, you can find the original paper that introduced it, learn more about the method itself, as well as all papers that cite or mention it. Sasha Rush (who used to be a colleague of mine at Hugging Face, now at Cursor) recently made an [excellent whiteboard explanation](https://x.com/dwarkesh_sp/status/2062353335529935114) of OPD with Dwarkesh. I've linked this video lecture in the method description on PwC's website, so more people can find it. I'll copy the excellent short description of the method from Dwarkesh here: "The basic idea is this: if the model made a mistake at some point in the rollout (for example, calling a tool that doesn't exist), we want to discourage this specific error, but we don't want to just learn from the final reward, because it's a very noisy signal spread out over the whole trajectory. So we have another model to read this trajectory and figure out where the error was made. It simply inserts some hint tokens into the part of the trajectory immediately above where the mistake occurred. Now, with these injected hint tokens, run a forward pass through the model. You're not having to regenerate a new rollout - aka no new decode required. The hint causes the model to assign lower probabilities to the error tokens. You then train the original model to match these new probabilities, teaching it to downweight that specific mistake." Let me know which other methods I should add! Cheers

Original Article

On-policy distillation: one of the hottest terms on PapersWithCode [R]

Similar Articles

@NielsRogge: One of the hottest terms in AI right now is "On-policy distillation". It is a post-training technique in which a studen…

DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

OPRD: On-Policy Representation Distillation

Submit Feedback

Similar Articles

@NielsRogge: One of the hottest terms in AI right now is "On-policy distillation". It is a post-training technique in which a studen…

DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

OPRD: On-Policy Representation Distillation