Draft-OPD: On-Policy Distillation for Speculative Draft Models

Hugging Face Daily Papers Papers

Summary

Draft-OPD introduces on-policy distillation with target-assisted rollouts and error replay to overcome the offline-to-inference mismatch in training draft models for speculative decoding, achieving over 5x lossless acceleration and improving upon EAGLE-3 and DFlash by 23% and 13% respectively.

Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised fine-tuning (SFT) on target-generated trajectories. However, we observe that SFT quickly plateaus: the draft model's acceptance length on test data stops improving. The reason is an offline-to-inference mismatch: In SFT, the drafter learns from fixed target-generated trajectories, whereas during speculative decoding it is evaluated on blocks proposed under its own policy. This motivates on-policy distillation (OPD), where the target model supervises the drafter on draft-induced states. Yet OPD remains difficult for draft models, as they cannot reliably roll out complete sequences independently, whereas target-assisted generation makes the collected sequences follow the target distribution and thus eliminates the on-policy signal. We therefore propose Draft-OPD, which uses target-assisted rollout for stable continuations and replays drafting from the verification-exposed error positions. This allows the drafter to learn from target feedback on both accepted and rejected proposals, focusing training on the draft-induced errors that limit speculative acceptance. Experiments show that Draft-OPD achieves over 5times lossless acceleration for thinking models across diverse tasks, improving over EAGLE-3 and DFlash by 23\% and 13\%.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:23 AM

Paper page - Draft-OPD: On-Policy Distillation for Speculative Draft Models

Source: https://huggingface.co/papers/2605.29343 Authors:

,

,

,

,

,

,

,

,

,

Abstract

Speculative decoding uses a lightweight draft model to accelerate large language model inference, but supervised fine-tuning plateaus due to offline-to-inference mismatch, which is addressed through on-policy distillation with target-assisted rollouts and error replay.

Speculative decodingaccelerates large language model inference by pairing atarget modelwith a lightweightdraft modelwhose proposed tokens are verified in parallel. A common way to builddraft models, like EAGLE3 or DFlash issupervised fine-tuning(SFT) on target-generated trajectories. However, we observe that SFT quickly plateaus: thedraft model’s acceptance length on test data stops improving. The reason is an offline-to-inference mismatch: In SFT, the drafter learns from fixed target-generated trajectories, whereas duringspeculative decodingit is evaluated on blocks proposed under its own policy. This motivateson-policy distillation(OPD), where thetarget modelsupervises the drafter ondraft-induced states. Yet OPD remains difficult fordraft models, as they cannot reliably roll out complete sequences independently, whereastarget-assisted generationmakes the collected sequences follow the target distribution and thus eliminates the on-policy signal. We therefore propose Draft-OPD, which uses target-assisted rollout for stable continuations and replays drafting from the verification-exposed error positions. This allows the drafter to learn from target feedback on both accepted and rejected proposals, focusing training on the draft-induced errors that limit speculative acceptance. Experiments show that Draft-OPD achieves over 5timeslossless accelerationfor thinking models across diverse tasks, improving over EAGLE-3 and DFlash by 23\% and 13\%.

View arXiv pageView PDFProject pageGitHub4Add to collection

Get this paper in your agent:

hf papers read 2605\.29343

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper5

#### bingyang-lei/Qwen3-4B-Ins-Draft-OPD 0.5B• Updated4 days ago • 52 • 1 #### bingyang-lei/Qwen3-8B-Thinking-Draft-OPD 1B• Updated3 days ago • 13 #### bingyang-lei/Qwen3-4B-Thinking-Draft-OPD 0.5B• Updated3 days ago • 15 #### bingyang-lei/Qwen3-30B-A3B-Thinking-2507-Draft-OPD 0.5B• Updated4 days ago • 26 Browse 5 models citing this paper## Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.29343 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.29343 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

arXiv cs.CL

This paper investigates the parameter-level mechanisms behind the efficiency of On-Policy Distillation (OPD) for large language models, attributing it to early 'foresight' in module allocation and update direction. It proposes EffOPD, a plug-and-play method that accelerates OPD training by 3x without compromising final performance.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

Hugging Face Daily Papers

OmniOPD introduces a logit-free on-policy distillation method that uses chunk-level semantic similarity and speculative verification to train student models with black-box teachers, achieving up to +28.64% improvement on math benchmarks over standard OPD.

OPRD: On-Policy Representation Distillation

Hugging Face Daily Papers

OPRD proposes a new knowledge distillation method that aligns student and teacher hidden states across layers during on-policy rollouts, eliminating sampling variance from token-space KL estimation. Empirically, OPRD outperforms output-space baselines on math reasoning benchmarks (AIME 2024/2025, AIMO) while being 1.44x faster and using 54% less memory.