Learning from Language Feedback via Variational Policy Distillation

Hugging Face Daily Papers 05/18/26, 12:00 AM Papers

Summary

Variational Policy Distillation (VPD) formalizes learning from language feedback as a variational EM problem, co-training teacher and student networks to improve policy learning in reinforcement learning from verifiable rewards. It shows consistent improvements over baselines on code generation and scientific reasoning tasks.

Reinforcement learning from verifiable rewards (RLVR) suffers from sparse outcome signals, creating severe exploration bottlenecks on complex reasoning tasks. Recent on-policy self-distillation methods attempt to address this by utilizing language feedback to generate dense, token-level supervision. However, these approaches rely on a fixed, passive teacher to interpret the feedback. As the student policy improves, the teacher's zero-shot assessment capabilities plateau, ultimately halting further learning. To overcome this, we propose Variational Policy Distillation (VPD), a framework that formalizes learning from language feedback as a Variational Expectation-Maximization (EM) problem. VPD co-evolves both policies: in the E-step, the teacher is actively refined on trajectory outcomes via an adaptive trust-region update, translating textual feedback into a dynamically improved target token distribution. In the M-step, the student internalizes this dense distributional guidance on its own on-policy rollouts. By continuously improving the teacher's ability to extract actionable signals from textual critique, VPD overcomes the limitations of passive distillation. Evaluated across diverse sources of diagnostic feedback on scientific reasoning and code generation tasks, VPD consistently outperforms both standard RLVR and existing self-distillation baselines. Finally, by stress-testing our framework on rigid mathematical reasoning and cold-start regimes, we illuminate the fundamental bounds of feedback-driven self-distillation compared to pure environment-driven RL.

Original Article

View Cached Full Text

Cached at: 05/21/26, 10:12 PM

Paper page - Learning from Language Feedback via Variational Policy Distillation

Source: https://huggingface.co/papers/2605.15113 Variational Policy Distillation (VPD) addresses a key limitation of reinforcement learning from verifiable rewards (RLVR): the binary reward signal discards all information from near-miss failures. A coding solution that fails 1 test out of 50 gets the same reward as random noise, even though the compiler error tells you exactly what went wrong.

VPD formalizes learning from language feedback (compiler errors, LLM critiques, self-corrections) as a variational EM problem. Unlike prior self-distillation methods that treat the feedback-conditioned teacher as a frozen function, VPD co-trains the teacher and student in an alternating loop:

E-step: refine the teacher’s ability to interpret feedback via preference optimization
M-step: distill the improved teacher into the student on its own rollouts

Both share a single network, so there’s zero additional memory overhead.

We evaluate on 3 model families (Qwen3-4B, Qwen3-8B, Llama-3.1-8B) across code generation (LiveCodeBench) and scientific reasoning (SciKnowEval). VPD consistently improves over GRPO and self-distillation baselines, with notably more stable training dynamics. We also characterize where the approach has limitations — on strict mathematical reasoning where error feedback is less informative, standard RL remains stronger.

Happy to discuss — feedback welcome!

Learning from Language Feedback via Variational Policy Distillation

Paper page - Learning from Language Feedback via Variational Policy Distillation

Similar Articles

Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization

Self-Boosting Vision-Language Models with Noisy Student On-Policy Self-Distillation

KL for a KL: On-Policy Distillation with Control Variate Baseline

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

DOPD: Dual On-policy Distillation

Submit Feedback

Similar Articles

Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization

Self-Boosting Vision-Language Models with Noisy Student On-Policy Self-Distillation

KL for a KL: On-Policy Distillation with Control Variate Baseline

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

DOPD: Dual On-policy Distillation