Draft-OPD: On-Policy Distillation for Speculative Draft Models
Summary
Draft-OPD introduces on-policy distillation with target-assisted rollouts and error replay to overcome the offline-to-inference mismatch in training draft models for speculative decoding, achieving over 5x lossless acceleration and improving upon EAGLE-3 and DFlash by 23% and 13% respectively.
View Cached Full Text
Cached at: 06/02/26, 03:23 AM
Paper page - Draft-OPD: On-Policy Distillation for Speculative Draft Models
Source: https://huggingface.co/papers/2605.29343 Authors:
,
,
,
,
,
,
,
,
,
Abstract
Speculative decoding uses a lightweight draft model to accelerate large language model inference, but supervised fine-tuning plateaus due to offline-to-inference mismatch, which is addressed through on-policy distillation with target-assisted rollouts and error replay.
Speculative decodingaccelerates large language model inference by pairing atarget modelwith a lightweightdraft modelwhose proposed tokens are verified in parallel. A common way to builddraft models, like EAGLE3 or DFlash issupervised fine-tuning(SFT) on target-generated trajectories. However, we observe that SFT quickly plateaus: thedraft model’s acceptance length on test data stops improving. The reason is an offline-to-inference mismatch: In SFT, the drafter learns from fixed target-generated trajectories, whereas duringspeculative decodingit is evaluated on blocks proposed under its own policy. This motivateson-policy distillation(OPD), where thetarget modelsupervises the drafter ondraft-induced states. Yet OPD remains difficult fordraft models, as they cannot reliably roll out complete sequences independently, whereastarget-assisted generationmakes the collected sequences follow the target distribution and thus eliminates the on-policy signal. We therefore propose Draft-OPD, which uses target-assisted rollout for stable continuations and replays drafting from the verification-exposed error positions. This allows the drafter to learn from target feedback on both accepted and rejected proposals, focusing training on the draft-induced errors that limit speculative acceptance. Experiments show that Draft-OPD achieves over 5timeslossless accelerationfor thinking models across diverse tasks, improving over EAGLE-3 and DFlash by 23\% and 13\%.
View arXiv pageView PDFProject pageGitHub4Add to collection
Get this paper in your agent:
hf papers read 2605\.29343
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper5
#### bingyang-lei/Qwen3-4B-Ins-Draft-OPD 0.5B• Updated4 days ago • 52 • 1
#### bingyang-lei/Qwen3-8B-Thinking-Draft-OPD 1B• Updated3 days ago • 13
#### bingyang-lei/Qwen3-4B-Thinking-Draft-OPD 0.5B• Updated3 days ago • 15
#### bingyang-lei/Qwen3-30B-A3B-Thinking-2507-Draft-OPD 0.5B• Updated4 days ago • 26
Browse 5 models citing this paper## Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.29343 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.29343 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
This paper investigates the parameter-level mechanisms behind the efficiency of On-Policy Distillation (OPD) for large language models, attributing it to early 'foresight' in module allocation and update direction. It proposes EffOPD, a plug-and-play method that accelerates OPD training by 3x without compromising final performance.
OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification
OmniOPD introduces a logit-free on-policy distillation method that uses chunk-level semantic similarity and speculative verification to train student models with black-box teachers, achieving up to +28.64% improvement on math benchmarks over standard OPD.
DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models
DiffusionOPD proposes a multi-task training paradigm for diffusion models that uses online policy distillation to efficiently combine task-specific teachers into a unified student, achieving state-of-the-art results on all evaluated benchmarks.
OPRD: On-Policy Representation Distillation
OPRD proposes a new knowledge distillation method that aligns student and teacher hidden states across layers during on-policy rollouts, eliminating sampling variance from token-space KL estimation. Empirically, OPRD outperforms output-space baselines on math reasoning benchmarks (AIME 2024/2025, AIMO) while being 1.44x faster and using 54% less memory.
@louieworth: New blog post: On-Policy Distillation — Promise, Pitfalls, and Prospects. OPD combines on-policy rollouts with dense te…
This blog post discusses On-Policy Distillation (OPD), a technique that combines on-policy rollouts with dense teacher supervision, and highlights its promise, three failure modes, and the author's new paper on the topic.