D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting

arXiv cs.LG Papers

Summary

This paper introduces D-PACE, a dynamic position-aware cross-entropy loss for training speculative decoding drafters that adaptively weights positions to improve acceptance length and inference speed, achieving consistent wall-clock speedups across benchmarks with minimal overhead.

arXiv:2605.18810v1 Announce Type: new Abstract: Speculative decoding accelerates LLM inference by having a small drafter propose tokens that a larger target model verifies in parallel. Recent diffusion-based parallel drafters such as DFlash predict the full B-token block in one forward pass, enabling deeper drafters and longer accepted blocks. However, existing multi-token drafter objectives often use fixed position-dependent weighting schedules, such as head-dependent weights or block-position decays, which do not adapt as the positions limiting acceptance change during training. To address this, we derive per-position training weights from a differentiable surrogate of expected accepted draft length, matching the weight of each position to its log-probability gradient contribution. The resulting loss, D-PACE (Dynamic Position-Aware Cross-Entropy), shifts training signal toward positions that currently limit acceptance as the drafter improves. Across six benchmarks, two Qwen3-4B draft depths, two decoding temperatures, and two additional target models, D-PACE consistently improves both wall-clock speedup and average emitted length, with 2.3\% measured training-time overhead and no changes to the drafter architecture or inference procedure.
Original Article
View Cached Full Text

Cached at: 05/20/26, 08:38 AM

# D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting
Source: [https://arxiv.org/abs/2605.18810](https://arxiv.org/abs/2605.18810)
[View PDF](https://arxiv.org/pdf/2605.18810)[HTML \(experimental\)](https://arxiv.org/html/2605.18810v1)

> Abstract:Speculative decoding accelerates LLM inference by having a small drafter propose tokens that a larger target model verifies in parallel\. Recent diffusion\-based parallel drafters such as DFlash predict the full B\-token block in one forward pass, enabling deeper drafters and longer accepted blocks\. However, existing multi\-token drafter objectives often use fixed position\-dependent weighting schedules, such as head\-dependent weights or block\-position decays, which do not adapt as the positions limiting acceptance change during training\. To address this, we derive per\-position training weights from a differentiable surrogate of expected accepted draft length, matching the weight of each position to its log\-probability gradient contribution\. The resulting loss, D\-PACE \(Dynamic Position\-Aware Cross\-Entropy\), shifts training signal toward positions that currently limit acceptance as the drafter improves\. Across six benchmarks, two Qwen3\-4B draft depths, two decoding temperatures, and two additional target models, D\-PACE consistently improves both wall\-clock speedup and average emitted length, with 2\.3\\% measured training\-time overhead and no changes to the drafter architecture or inference procedure\.

## Submission history

From: Ju Li \[[view email](https://arxiv.org/show-email/2c755cb7/2605.18810)\] **\[v1\]**Tue, 12 May 2026 06:27:57 UTC \(152 KB\)

Similar Articles

Draft-OPD: On-Policy Distillation for Speculative Draft Models

Hugging Face Daily Papers

Draft-OPD introduces on-policy distillation with target-assisted rollouts and error replay to overcome the offline-to-inference mismatch in training draft models for speculative decoding, achieving over 5x lossless acceleration and improving upon EAGLE-3 and DFlash by 23% and 13% respectively.