Draft-OPD: On-Policy Distillation for Speculative Draft Models

Hugging Face Daily Papers 05/28/26, 12:00 AM Papers

Summary

Draft-OPD introduces on-policy distillation with target-assisted rollouts and error replay to overcome the offline-to-inference mismatch in training draft models for speculative decoding, achieving over 5x lossless acceleration and improving upon EAGLE-3 and DFlash by 23% and 13% respectively.

Speculative decoding accelerates large language model inference by pairing a target model with a lightweight draft model whose proposed tokens are verified in parallel. A common way to build draft models, like EAGLE3 or DFlash is supervised fine-tuning (SFT) on target-generated trajectories. However, we observe that SFT quickly plateaus: the draft model's acceptance length on test data stops improving. The reason is an offline-to-inference mismatch: In SFT, the drafter learns from fixed target-generated trajectories, whereas during speculative decoding it is evaluated on blocks proposed under its own policy. This motivates on-policy distillation (OPD), where the target model supervises the drafter on draft-induced states. Yet OPD remains difficult for draft models, as they cannot reliably roll out complete sequences independently, whereas target-assisted generation makes the collected sequences follow the target distribution and thus eliminates the on-policy signal. We therefore propose Draft-OPD, which uses target-assisted rollout for stable continuations and replays drafting from the verification-exposed error positions. This allows the drafter to learn from target feedback on both accepted and rejected proposals, focusing training on the draft-induced errors that limit speculative acceptance. Experiments show that Draft-OPD achieves over 5times lossless acceleration for thinking models across diverse tasks, improving over EAGLE-3 and DFlash by 23\% and 13\%.

Original Article

View Cached Full Text

Cached at: 06/02/26, 03:23 AM

Paper page - Draft-OPD: On-Policy Distillation for Speculative Draft Models

Source: https://huggingface.co/papers/2605.29343 Authors:

Abstract

Speculative decoding uses a lightweight draft model to accelerate large language model inference, but supervised fine-tuning plateaus due to offline-to-inference mismatch, which is addressed through on-policy distillation with target-assisted rollouts and error replay.

Speculative decodingaccelerates large language model inference by pairing atarget modelwith a lightweightdraft modelwhose proposed tokens are verified in parallel. A common way to builddraft models, like EAGLE3 or DFlash issupervised fine-tuning(SFT) on target-generated trajectories. However, we observe that SFT quickly plateaus: thedraft model’s acceptance length on test data stops improving. The reason is an offline-to-inference mismatch: In SFT, the drafter learns from fixed target-generated trajectories, whereas duringspeculative decodingit is evaluated on blocks proposed under its own policy. This motivateson-policy distillation(OPD), where thetarget modelsupervises the drafter ondraft-induced states. Yet OPD remains difficult fordraft models, as they cannot reliably roll out complete sequences independently, whereastarget-assisted generationmakes the collected sequences follow the target distribution and thus eliminates the on-policy signal. We therefore propose Draft-OPD, which uses target-assisted rollout for stable continuations and replays drafting from the verification-exposed error positions. This allows the drafter to learn from target feedback on both accepted and rejected proposals, focusing training on the draft-induced errors that limit speculative acceptance. Experiments show that Draft-OPD achieves over 5timeslossless accelerationfor thinking models across diverse tasks, improving over EAGLE-3 and DFlash by 23\% and 13\%.

View arXiv page View PDF Project page GitHub4 Add to collection

Get this paper in your agent:

hf papers read 2605\.29343

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper5

#### bingyang-lei/Qwen3-4B-Ins-Draft-OPD 0.5B• Updated4 days ago • 52 • 1 #### bingyang-lei/Qwen3-8B-Thinking-Draft-OPD 1B• Updated3 days ago • 13 #### bingyang-lei/Qwen3-4B-Thinking-Draft-OPD 0.5B• Updated3 days ago • 15 #### bingyang-lei/Qwen3-30B-A3B-Thinking-2507-Draft-OPD 0.5B• Updated4 days ago • 26 Browse 5 models citing this paper## Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.29343 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.29343 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Draft-OPD: On-Policy Distillation for Speculative Draft Models

Paper page - Draft-OPD: On-Policy Distillation for Speculative Draft Models

Abstract

Models citing this paper5

Spaces citing this paper0

Collections including this paper0

Similar Articles

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

OPRD: On-Policy Representation Distillation

@louieworth: New blog post: On-Policy Distillation — Promise, Pitfalls, and Prospects. OPD combines on-policy rollouts with dense te…

Submit Feedback

Similar Articles

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models

OPRD: On-Policy Representation Distillation

@louieworth: New blog post: On-Policy Distillation — Promise, Pitfalls, and Prospects. OPD combines on-policy rollouts with dense te…