OPRD: On-Policy Representation Distillation
Summary
OPRD proposes a new knowledge distillation method that aligns student and teacher hidden states across layers during on-policy rollouts, eliminating sampling variance from token-space KL estimation. Empirically, OPRD outperforms output-space baselines on math reasoning benchmarks (AIME 2024/2025, AIMO) while being 1.44x faster and using 54% less memory.
Similar Articles
OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification
OmniOPD introduces a logit-free on-policy distillation method that uses chunk-level semantic similarity and speculative verification to train student models with black-box teachers, achieving up to +28.64% improvement on math benchmarks over standard OPD.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
This paper investigates the parameter-level mechanisms behind the efficiency of On-Policy Distillation (OPD) for large language models, attributing it to early 'foresight' in module allocation and update direction. It proposes EffOPD, a plug-and-play method that accelerates OPD training by 3x without compromising final performance.
On-Policy Distillation (5 minute read)
This paper introduces on-policy distillation, which trains a student model on its own trajectories with teacher token-level KL supervision to fix train-inference mismatch, unifying forward-KL, reverse-KL, and JSD losses, with reverse-KL favored for smaller students.
Reasoning Compression with Mixed-Policy Distillation
This paper proposes Mixed-Policy Distillation (MPD), a framework that transfers concise reasoning behaviors from large teacher models to smaller student models, reducing token usage by up to 27.1% while improving performance.
When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning
This paper identifies that teacher token reliability in reasoning distillation is trajectory-structured and proposes Position-Weighted On-Policy Self-Distillation (PW-OPSD), which applies increasing position weights to improve performance without additional teacher computation.