Learning from Language Feedback via Variational Policy Distillation
Summary
Variational Policy Distillation (VPD) formalizes learning from language feedback as a variational EM problem, co-training teacher and student networks to improve policy learning in reinforcement learning from verifiable rewards. It shows consistent improvements over baselines on code generation and scientific reasoning tasks.
View Cached Full Text
Cached at: 05/21/26, 10:12 PM
Paper page - Learning from Language Feedback via Variational Policy Distillation
Source: https://huggingface.co/papers/2605.15113 Variational Policy Distillation (VPD) addresses a key limitation of reinforcement learning from verifiable rewards (RLVR): the binary reward signal discards all information from near-miss failures. A coding solution that fails 1 test out of 50 gets the same reward as random noise, even though the compiler error tells you exactly what went wrong.
VPD formalizes learning from language feedback (compiler errors, LLM critiques, self-corrections) as a variational EM problem. Unlike prior self-distillation methods that treat the feedback-conditioned teacher as a frozen function, VPD co-trains the teacher and student in an alternating loop:
- E-step: refine the teacher’s ability to interpret feedback via preference optimization
- M-step: distill the improved teacher into the student on its own rollouts
Both share a single network, so there’s zero additional memory overhead.
We evaluate on 3 model families (Qwen3-4B, Qwen3-8B, Llama-3.1-8B) across code generation (LiveCodeBench) and scientific reasoning (SciKnowEval). VPD consistently improves over GRPO and self-distillation baselines, with notably more stable training dynamics. We also characterize where the approach has limitations — on strict mathematical reasoning where error feedback is less informative, standard RL remains stronger.
Happy to discuss — feedback welcome!
Similar Articles
KL for a KL: On-Policy Distillation with Control Variate Baseline
Proposes vOPD, which stabilizes on-policy distillation for LLMs by introducing a control variate baseline from reinforcement learning, achieving performance comparable to expensive full-vocabulary methods at lower computational cost.
The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes
This paper presents a comprehensive empirical study on on-policy distillation for large language models, identifying failure mechanisms like distribution mismatch and optimization instability, and proposing fixes such as stop-gradient objectives and RLVR-adapted teachers.
OPRD: On-Policy Representation Distillation
OPRD proposes a new knowledge distillation method that aligns student and teacher hidden states across layers during on-policy rollouts, eliminating sampling variance from token-space KL estimation. Empirically, OPRD outperforms output-space baselines on math reasoning benchmarks (AIME 2024/2025, AIMO) while being 1.44x faster and using 54% less memory.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
This paper investigates the parameter-level mechanisms behind the efficiency of On-Policy Distillation (OPD) for large language models, attributing it to early 'foresight' in module allocation and update direction. It proposes EffOPD, a plug-and-play method that accelerates OPD training by 3x without compromising final performance.
On-Policy Distillation (5 minute read)
This paper introduces on-policy distillation, which trains a student model on its own trajectories with teacher token-level KL supervision to fix train-inference mismatch, unifying forward-KL, reverse-KL, and JSD losses, with reverse-KL favored for smaller students.