D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting
Summary
This paper introduces D-PACE, a dynamic position-aware cross-entropy loss for training speculative decoding drafters that adaptively weights positions to improve acceptance length and inference speed, achieving consistent wall-clock speedups across benchmarks with minimal overhead.
View Cached Full Text
Cached at: 05/20/26, 08:38 AM
# D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting Source: [https://arxiv.org/abs/2605.18810](https://arxiv.org/abs/2605.18810) [View PDF](https://arxiv.org/pdf/2605.18810)[HTML \(experimental\)](https://arxiv.org/html/2605.18810v1) > Abstract:Speculative decoding accelerates LLM inference by having a small drafter propose tokens that a larger target model verifies in parallel\. Recent diffusion\-based parallel drafters such as DFlash predict the full B\-token block in one forward pass, enabling deeper drafters and longer accepted blocks\. However, existing multi\-token drafter objectives often use fixed position\-dependent weighting schedules, such as head\-dependent weights or block\-position decays, which do not adapt as the positions limiting acceptance change during training\. To address this, we derive per\-position training weights from a differentiable surrogate of expected accepted draft length, matching the weight of each position to its log\-probability gradient contribution\. The resulting loss, D\-PACE \(Dynamic Position\-Aware Cross\-Entropy\), shifts training signal toward positions that currently limit acceptance as the drafter improves\. Across six benchmarks, two Qwen3\-4B draft depths, two decoding temperatures, and two additional target models, D\-PACE consistently improves both wall\-clock speedup and average emitted length, with 2\.3\\% measured training\-time overhead and no changes to the drafter architecture or inference procedure\. ## Submission history From: Ju Li \[[view email](https://arxiv.org/show-email/2c755cb7/2605.18810)\] **\[v1\]**Tue, 12 May 2026 06:27:57 UTC \(152 KB\)
Similar Articles
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding
This paper introduces PARD-2, a dual-mode speculative decoding framework that uses target-aligned parallel draft models to accelerate LLM inference, achieving up to 6.94x lossless acceleration on Llama 3.1-8B.
Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing
Proposes PPOW, a reinforcement learning framework for optimizing draft models in speculative decoding using window-level objectives and adaptive windowing, achieving significant speedups across multiple benchmarks.
Draft-OPD: On-Policy Distillation for Speculative Draft Models
Draft-OPD introduces on-policy distillation with target-assisted rollouts and error replay to overcome the offline-to-inference mismatch in training draft models for speculative decoding, achieving over 5x lossless acceleration and improving upon EAGLE-3 and DFlash by 23% and 13% respectively.
BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding
BudgetDraft proposes a multi-view training method for speculative decoding that aligns a sparse-KV drafter with a full-KV verifier, achieving significant speedups for mid-to-long context inference.
AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding
AdaPLD is a training-free method that improves model-free speculative decoding by using adaptive retrieval combining lexical and semantic similarity, and constructing branched reuse hypotheses to handle continuation uncertainty, achieving up to 3.10x decoding speedup.