PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding
Summary
This paper introduces PARD-2, a dual-mode speculative decoding framework that uses target-aligned parallel draft models to accelerate LLM inference, achieving up to 6.94x lossless acceleration on Llama 3.1-8B.
View Cached Full Text
Cached at: 05/12/26, 06:58 AM
# PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding
Source: [https://arxiv.org/html/2605.08632](https://arxiv.org/html/2605.08632)
Zihao An1Taichi Liu1,211footnotemark:1Ziqiong Liu1Dong Li1Ruofeng Liu3Emad Barsoum1 1Advanced Micro Devices, Inc\.2Rutgers University3Michigan State University \{Zihao\.An, Taichi\.Liu, Ziqiong\.Liu, d\.li, Emad\.Barsoum\}@amd\.com, liuruofe@msu\.edu
###### Abstract
Speculative decoding accelerates Large Language Models \(LLMs\) inference by using a lightweight draft model to propose candidate tokens that are verified in parallel by the target model\. However, existing draft model training objectives are not directly aligned with the inference\-time goal of maximizing consecutive token acceptance\. To address this issue, we reformulate the draft model optimization objective, shifting the focus from token prediction accuracy to the overall acceptance length\. In this paper, we build upon PARD to propose PARD\-2, a dual\-mode speculative decoding framework with Confidence\-Adaptive Token \(CAT\) optimization\. This approach adaptively reweights each token to better align with the verification process\. Notably, PARD\-2 enables a single draft model to support both target\-dependent and target\-independent modes\. Experiments across diverse models and tasks demonstrate that PARD\-2 achieves up to 6\.94×\\timeslossless acceleration, surpassing EAGLE\-3 by 1\.9×\\timesand PARD by 1\.3×\\timeson Llama3\.1\-8B\. Our code is available at[https://github\.com/AMD\-AGI/PARD](https://github.com/AMD-AGI/PARD)\.
## 1Introduction
As Large Language Models \(LLMs\) continue to advance, their strong performance has been accompanied by a rapid increase in model scale\. While this scaling law has led to remarkable capability gains, it also makes auto\-regressive decoding increasingly expensive at inference time\.
Speculative Decoding \(SD\)\[[17](https://arxiv.org/html/2605.08632#bib.bib444)\]has recently emerged as an effective approach to reducing LLMs inference latency\. SD uses a lightweight draft model to propose multiple candidate tokens, which are then verified in parallel by the target model\. A promising line of work trains lightweight auto\-regressive drafters conditioned on target\-model features, including methods such as Medusa\[[5](https://arxiv.org/html/2605.08632#bib.bib458)\], Hydra\[[2](https://arxiv.org/html/2605.08632#bib.bib459)\], and EAGLE\-3\[[21](https://arxiv.org/html/2605.08632#bib.bib543)\], which achieve strong performance\. However, sequential drafting still requires multiple forward passes, resulting in non\-negligible latency\[[1](https://arxiv.org/html/2605.08632#bib.bib552),[30](https://arxiv.org/html/2605.08632#bib.bib76)\]\. To further accelerate drafting, recent works explore parallel drafting to further reduce drafting latency: ParallelSpec\[[30](https://arxiv.org/html/2605.08632#bib.bib76)\]trains a parallel drafter to generate multiple tokens in a single forward pass, PARD\[[1](https://arxiv.org/html/2605.08632#bib.bib552)\]adapts small auto\-regressive models for parallel masked\-token prediction, while DFlash\[[7](https://arxiv.org/html/2605.08632#bib.bib545)\]employs a small block diffusion model to generate draft tokens in parallel\.
\(a\)Llama3\.1\-8B
\(b\)Qwen3\-8B
Figure 1:Throughput and Latency Trade\-offs on vLLM\.PARD\-2 consistently achieves a superior Pareto frontier across various batch sizes \(1 to 64\) on both \(a\) Llama\-3\.1\-8B and \(b\) Qwen\-3\-8B\.However, a common assumption underlying speculative decoding is that all draft positions should be learned equally in training time, which is suboptimal for training convergence and acceptance length\[[23](https://arxiv.org/html/2605.08632#bib.bib553),[7](https://arxiv.org/html/2605.08632#bib.bib545)\]\. Unlike standard language modeling, where the objective is to improve token prediction accuracy uniformly, speculative decoding is ultimately concerned with how many drafted tokens can be accepted by the target model\. Our experiments reveal a positional bias in parallel speculative decoding: as illustrated in Figure[2](https://arxiv.org/html/2605.08632#S2.F2)\(a\), tokens at subsequent draft positions exhibit consistently lower acceptance rates\. As the draft length increases, the acceptance rate often struggles to persist, limiting the practical speedup that parallel drafting can provide\. This observation suggests an*inherent limitation of uniformly optimizing all positions*\. While recent approaches such as DFlash\[[7](https://arxiv.org/html/2605.08632#bib.bib545)\]and DART\[[23](https://arxiv.org/html/2605.08632#bib.bib553)\]mitigate this issue with position\-aware decaying weights, their weights are fixed and primarily position\-dependent\. We observe that a token’s acceptance is determined not solely by the accuracy of the current token, but is heavily bottlenecked by the quality of the entire prefix\. This indicates that acceptance is jointly determined by the current token and its prefix context\. Therefore, an approach that jointly considers both of these two factors provides a more effective way to improve acceptance length and decoding efficiency\.
In this paper, we introduce PARD\-2, a dual\-mode speculative decoding framework to mitigate the degradation in acceptance rate\. We propose Confidence\-Adaptive Token \(CAT\) optimization, which assigns token\-level, context\-dependent confidence scores to better align the training objective with the inference\-time goal of maximizing consecutive token acceptance in speculative decoding\. Specifically, CAT dynamically reweights token\-level objectives based on a context\-dependent confidence score, which is computed as the cumulative product of the target model’s confidence across all preceding tokens in the prefix\. This design encourages the drafter to maximize the expected acceptance length\.
In addition to optimizing acceptance length, PARD\-2 further addresses the target dependency of existing speculative decoding methods\. Most speculative decoding methods are target\-dependent\[[21](https://arxiv.org/html/2605.08632#bib.bib543),[7](https://arxiv.org/html/2605.08632#bib.bib545)\], requiring training a new draft model from scratch for each target model\. Building upon PARD, PARD\-2 is the first to enable a single draft model to dynamically switch between target\-dependent and target\-independent modes during inference\. Unlike EAGLE\-3 and DFlash, which require grafted layers, PARD\-2 maintains a standalone architecture, achieving this flexibility without structural overhead\. It applies stochastic gating to control the injection of target hidden states during training\. As a result, the same draft model can operate in a target\-dependent mode for maximum acceleration, while also supporting a target\-independent mode that generalizes across a family of target models\.
To summarize, our key contributions include:
- •We propose PARD\-2, a dual\-mode speculative decoding framework that supports both target\-dependent and target\-independent settings\. To the best of our knowledge, this is the first work to unify these paradigms within a single draft model\. Stochastic gating injects target hidden states during training, enabling peak acceleration via target\-dependent optimization while maintaining universal compatibility with an entire model family\.
- •We revisit the fundamental objective of speculative decoding and demonstrate that its primary challenge is maximizing the acceptance of consecutive token spans\. To this end, we propose a novel optimization strategy CAT\. Conditioned on the preceding prefix, CAT adaptively reweights its focus on individual tokens guided by the target model’s context\-dependent confidence scores, thereby significantly improving both prediction and distillation efficiency\.
- •We conduct extensive experiments across diverse models and benchmarks, including a practical validation of PARD\-2 within the vLLM framework\. Our results show that PARD\-2 achieves an average speedup of 1\.3× over PARD and up to 6\.94× acceleration over the autoregressive baseline\. Furthermore, it delivers the highest throughput under high\-concurrency settings, demonstrating exceptional practical value for real\-world deployment\.
## 2Preliminaries
### 2\.1Speculative Decoding
Speculative decoding is a lossless decoding strategy for accelerating LLM inference\. Instead of generating each token solely with the target model𝜽target\\boldsymbol\{\\theta\}\_\{\\mathrm\{target\}\}, it introduces a smaller and faster draft model𝜽draft\\boldsymbol\{\\theta\}\_\{\\mathrm\{draft\}\}to propose multiple candidate tokens in advance, which are then verified by the target model in parallel\. This design reduces the number of expensive target model decoding steps while preserving the exact output distribution of the target model\.
\(a\)Position\-wise acceptance ratio and acceptance length
\(b\)Target Model’s Confidence vs Accpetance Rate
Figure 2:Acceptance behavior of Llama3\.1\-8B\.\(a\) On the HumanEval benchmark, PARD\-2 achieves higher acceptance rates and longer acceptance length than PARD across token positions, mitigating distant\-position degradation\. \(b\) Target\-model confidence scores strongly correlate with actual acceptance rates, supporting their use as a proxy for token\-level acceptance\.Formally, given a prefixX=\(x0,…,xn−1\)X=\(x\_\{0\},\\ldots,x\_\{n\-1\}\), speculative sampling uses a lightweight auto\-regressive draft model𝜽draft\\boldsymbol\{\\theta\}\_\{\\mathrm\{draft\}\}to propose a length ofKKtokens, denoted byY~=\(y~n,…,y~n\+K−1\)\\tilde\{Y\}=\(\\tilde\{y\}\_\{n\},\\ldots,\\tilde\{y\}\_\{n\+K\-1\}\)\. The proposal probability distribution factorizes as
P\(Y~∣X;𝜽draft\)=∏k=0K−1P\(y~n\+k∣x0,…,xn−1,y~n,…,y~n\+k−1;𝜽draft\)\.P\(\\tilde\{Y\}\\mid X;\\boldsymbol\{\\theta\}\_\{\\mathrm\{draft\}\}\)=\\prod\_\{k=0\}^\{K\-1\}P\\\!\\left\(\\tilde\{y\}\_\{n\+k\}\\mid x\_\{0\},\\ldots,x\_\{n\-1\},\\tilde\{y\}\_\{n\},\\ldots,\\tilde\{y\}\_\{n\+k\-1\};\\boldsymbol\{\\theta\}\_\{\\mathrm\{draft\}\}\\right\)\.\(1\)For positionn\+kn\+k, letpk\(y\)=P\(y∣x0,…,xn−1,y~n,…,y~n\+k−1;𝜽target\)p\_\{k\}\(y\)=P\(y\\mid x\_\{0\},\\ldots,x\_\{n\-1\},\\tilde\{y\}\_\{n\},\\ldots,\\tilde\{y\}\_\{n\+k\-1\};\\boldsymbol\{\\theta\}\_\{\\mathrm\{target\}\}\)andqk\(y\)=P\(y∣x0,…,xn−1,y~n,…,y~n\+k−1;𝜽draft\)q\_\{k\}\(y\)=P\(y\\mid x\_\{0\},\\ldots,x\_\{n\-1\},\\tilde\{y\}\_\{n\},\\ldots,\\tilde\{y\}\_\{n\+k\-1\};\\boldsymbol\{\\theta\}\_\{\\mathrm\{draft\}\}\)denote the target and draft conditional probabilities, respectively\. Under speculative sampling, the draft tokeny~n\+k\\tilde\{y\}\_\{n\+k\}is accepted with probability
ak=min\(1,pk\(y~n\+k\)qk\(y~n\+k\)\),k=0,…,K−1\.a\_\{k\}=\\min\\\!\\left\(1,\\,\\frac\{p\_\{k\}\(\\tilde\{y\}\_\{n\+k\}\)\}\{q\_\{k\}\(\\tilde\{y\}\_\{n\+k\}\)\}\\right\),\\qquad k=0,\\ldots,K\-1\.\(2\)Ignoring the bonus token, the probability that the firstk\+1k\+1draft tokens are all accepted is∏j=0kaj\\prod\_\{j=0\}^\{k\}a\_\{j\}\. Hence, the expected acceptance lengthLLis
𝔼\[L∣X,Y~\]=∑k=0K−1∏j=0kaj\.\\mathbb\{E\}\[L\\mid X,\\tilde\{Y\}\]=\\sum\_\{k=0\}^\{K\-1\}\\prod\_\{j=0\}^\{k\}a\_\{j\}\.\(3\)The target model accepts the longest valid prefix and, upon the first rejection, samples a correction token from the residual distribution, preserving exact equivalence to sampling from the target model\.
### 2\.2Parallel Draft Models
Although speculative decoding significantly accelerates LLM inference, its drafting stage remains sequential, requiringKKsequentially dependent predictions to generateKKdraft tokens\. This sequential latency can still limit the end\-to\-end speedup\. To address this issue, recent work has explored parallel draft models that predict multiple tokens simultaneously\. DiffuSpec\[[18](https://arxiv.org/html/2605.08632#bib.bib544)\]and DFlash\[[7](https://arxiv.org/html/2605.08632#bib.bib545)\]adopt diffusion\-based drafters that generate tokens through iterative denoising\. To better match the auto\-regressive architecture of the target model, PARD\[[1](https://arxiv.org/html/2605.08632#bib.bib552)\]retains an auto\-regressive backbone and introduces masked placeholders, enabling parallel masked\-token prediction in a single forward pass\.
In particular, PARD introduces a special mask tokenmmand predicts each future token conditioned only on the prefix and preceding mask placeholders\. Its draft probability distribution is
P\(Y~∣X;𝜽PARD\)=∏k=0K−1P\(y~n\+k∣x0,…,xn−1,mn,…,mn\+k−1;𝜽PARD\)\.P\(\\tilde\{Y\}\\mid X;\\boldsymbol\{\\theta\}\_\{\\mathrm\{PARD\}\}\)=\\prod\_\{k=0\}^\{K\-1\}P\\\!\\left\(\\tilde\{y\}\_\{n\+k\}\\mid x\_\{0\},\\ldots,x\_\{n\-1\},m\_\{n\},\\ldots,m\_\{n\+k\-1\};\\boldsymbol\{\\theta\}\_\{\\mathrm\{PARD\}\}\\right\)\.\(4\)Because each position depends only on the prefix and mask tokens, allKKpredictions can be computed in a single forward pass\. This approach not only substantially reduces drafting latency but also ensures target independence, enabling the drafter to be reusable across a family of target models\.
Given the ground\-truthY=\(yn,…,yn\+K−1\)Y=\(y\_\{n\},\\ldots,y\_\{n\+K\-1\}\), PARD is trained with the cross\-entropy loss
ℒPARD=−1K∑k=0K−1logP\(yn\+k∣x0,…,xn−1,mn,…,mn\+k−1;𝜽PARD\)\.\\mathcal\{L\}\_\{\\mathrm\{PARD\}\}=\-\\frac\{1\}\{K\}\\sum\_\{k=0\}^\{K\-1\}\\log P\\\!\\left\(y\_\{n\+k\}\\mid x\_\{0\},\\ldots,x\_\{n\-1\},m\_\{n\},\\ldots,m\_\{n\+k\-1\};\\boldsymbol\{\\theta\}\_\{\\mathrm\{PARD\}\}\\right\)\.\(5\)
Figure 3:Overview of PARD\-2\.The training \(mid\) and inference \(right\) designs of PARD\-2\. Compared to PARD \(left\), PARD\-2 integrates CAT optimization, target hidden features, and knowledge distillation\. PARD\-2 supports flexible switching between target dependent and independent modes\.
## 3Method
### 3\.1Observation
The draft lengthKKis a key design choice for parallel draft models\. To study its effect, we train PARD with two draft lengths,K=8K=8andK=16K=16\. As shown in Table[5](https://arxiv.org/html/2605.08632#S4.T5), increasingKKyields little improvement and can even degrade performance across several benchmarks\. This observation contradicts the common intuition that a longer draft length should naturally translate to a greater acceptance length and enhanced decoding efficiency\. To understand this phenomenon, we analyze the verification mechanism of speculative decoding\. Because the target model evaluates candidate tokens strictly in order, the acceptance of any subsequent token heavily relies on the successful verification of all its predecessors\. Letaja\_\{j\}denote the marginal probability that the target model accepts thejj\-th draft token\. Eq\. \([3](https://arxiv.org/html/2605.08632#S2.E3)\) can be decomposed by position as
𝔼\[L∣X,Y~\]=∑k=0K−1∏j=0kaj=∑k=0K−1\(∏j=0k−1aj\)ak\.\\mathbb\{E\}\[L\\mid X,\\tilde\{Y\}\]=\\sum\_\{k=0\}^\{K\-1\}\\prod\_\{j=0\}^\{k\}a\_\{j\}=\\sum\_\{k=0\}^\{K\-1\}\\left\(\\prod\_\{j=0\}^\{k\-1\}a\_\{j\}\\right\)a\_\{k\}\.\(6\)This decomposition reveals two key factors that govern whether the token at positionkkis accepted\. The first factor,∏j=0k−1aj\\prod\_\{j=0\}^\{k\-1\}a\_\{j\}, is the probability that all previous draft tokens are accepted\. The second factoraka\_\{k\}is the probability that the current token is accepted once positionkkis reached\.
We then define the first factor assks\_\{k\}:
sk:=∏j=0k−1aj,s0:=1\.s\_\{k\}:=\\prod\_\{j=0\}^\{k\-1\}a\_\{j\},\\qquad s\_\{0\}:=1\.\(7\)The termsks\_\{k\}is a prerequisite for thekk\-th token to contribute to the acceptance length and can therefore be interpreted as the importance of that token with respect to acceleration\. With this notation,
𝔼\[L∣X,Y~\]=∑k=0K−1skak\.\\mathbb\{E\}\[L\\mid X,\\tilde\{Y\}\]=\\sum\_\{k=0\}^\{K\-1\}s\_\{k\}a\_\{k\}\.\(8\)
The second factor,aka\_\{k\}, reflects the local quality of the draft prediction at positionkk\. Since the token\-level training objective of the draft model aims to improve prediction quality at each position, it is naturally related to increasingaka\_\{k\}\. This insight motivates us to reweight the per\-token training objectives bysks\_\{k\}, assigning higher importance to tokens situated on highly probable accepted prefixes\.
### 3\.2Confidence\-Adaptive Token Optimization Strategy
Motivated by the above observations, we assign adaptive weights to individual tokens during training, thereby better aligning the optimization objective with the speculative decoding goal of maximizing acceptance length\. However, the true acceptance rateaka\_\{k\}during training is intractable, as it inherently depends on the dynamic interaction between the draft and target models during the verification phase\.
Inspired by EAGLE\-2\[[19](https://arxiv.org/html/2605.08632#bib.bib501)\]and CAPE\[[11](https://arxiv.org/html/2605.08632#bib.bib551)\], we investigate the relationship between the target model’s confidence and the token acceptance rate across multiple benchmarks, including HumanEval, GSM8K, and Math\-500\. As illustrated in Figure[2](https://arxiv.org/html/2605.08632#S2.F2)\(b\), the target model’s confidence exhibits a strong positive correlation with the empirical acceptance rate\. Consequently, we can leverage the target model’s confidence scores as a reliable proxy for the expected token acceptance rate, enabling adaptive token\-level weighting\.
Building upon this empirical finding, we approximate the actual acceptance probability using the target model’s confidence on the corresponding ground\-truth tokenyn\+ky\_\{n\+k\}conditioned on its prefix:
c^k:=P\(yn\+k∣x0,…,xn−1,yn,…,yn\+k−1;𝜽target\)\.\\hat\{c\}\_\{k\}:=P\\\!\\left\(y\_\{n\+k\}\\mid x\_\{0\},\\ldots,x\_\{n\-1\},y\_\{n\},\\ldots,y\_\{n\+k\-1\};\\boldsymbol\{\\theta\}\_\{\\mathrm\{target\}\}\\right\)\.\(9\)We then estimate the importance of each token by computing the cumulative product of the target confidences along the prefix:
s^k:=∏j=0k−1c^j,s^0:=1\.\\hat\{s\}\_\{k\}:=\\prod\_\{j=0\}^\{k\-1\}\\hat\{c\}\_\{j\},\\qquad\\hat\{s\}\_\{0\}:=1\.\(10\)Here,s^k\\hat\{s\}\_\{k\}approximates the probability that positionkkis reached during verification\. Usings^k\\hat\{s\}\_\{k\}as a stop\-gradient weight, we obtain the final PARD\-2 training objective:
ℒPARD\-2=−1K∑k=0K−1s^klogP\(yn\+k∣x0,…,xn−1,mn,…,mn\+k−1;𝜽PARD\-2\)\.\\mathcal\{L\}\_\{\\mathrm\{PARD\\text\{\-\}2\}\}=\-\\frac\{1\}\{K\}\\sum\_\{k=0\}^\{K\-1\}\\hat\{s\}\_\{k\}\\log P\\\!\\left\(y\_\{n\+k\}\\mid x\_\{0\},\\ldots,x\_\{n\-1\},m\_\{n\},\\ldots,m\_\{n\+k\-1\};\\boldsymbol\{\\theta\}\_\{\\mathrm\{PARD\\text\{\-\}2\}\}\\right\)\.\(11\)Unlike the uniform loss in Eq\. \([5](https://arxiv.org/html/2605.08632#S2.E5)\), Eq\. \([11](https://arxiv.org/html/2605.08632#S3.E11)\) adaptively prioritizes tokens that are likely to contribute to the final accepted prefix and reduces the influence of distant positions that are rarely reached during speculative verification\. As a result, the resulting objective better matches the inference\-time acceleration goal of speculative decoding\.
### 3\.3PARD\-2 Training
During training, in addition to assigning different importance weights to each token\-level loss to better optimize the speculative decoding objective, we further adapt the draft model to align with the target model, enabling it to generate highly compatible proposals\.
Stochastic Gating for Target Features\.As illustrated in Figure[3](https://arxiv.org/html/2605.08632#S2.F3), given an input promptXX, we extract hidden representations from multiple layers of the target model, denoted byll,mm, andhh, corresponding to low\-, middle\-, and high\-level features, respectively\. These hidden states are fused into a compact target\-context featuret=Proj\(\[l;m;h\]\)t=\\mathrm\{Proj\}\(\[l;m;h\]\), where\[⋅;⋅\]\[\\cdot;\\cdot\]denotes concatenation andProj\(⋅\)\\mathrm\{Proj\}\(\\cdot\)is a lightweight projection module\. To improve training efficiency, we further process the draft\-model input with Conditional Drop Token \(COD\)\[[1](https://arxiv.org/html/2605.08632#bib.bib552)\], which selectively drops conditional tokens during training\. The fused target hidden featurettis then injected into the draft model by adding it to the draft\-model input embeddingsede^\{d\}\. In this way, the draft model can leverage target\-side context during drafting, leading to better alignment with the target\-model distribution\.
Moreover, to achieve target independence, we do not inject target\-context features for every training instance\. Instead, we stochastically inject them during training:ed′=ed\+ξ⋅te^\{d^\{\\prime\}\}=e^\{d\}\+\\xi\\cdot t,ξ∼Bernoulli\(1−ρ\)\\xi\\sim\\mathrm\{Bernoulli\}\(1\-\\rho\)\. Thus,ed′=ede^\{d^\{\\prime\}\}=e^\{d\}with probabilityρ\\rho, anded′=ed\+te^\{d^\{\\prime\}\}=e^\{d\}\+totherwise\. This design reduces target\-side dependence, enabling a single drafter to serve the entire model family\.
Training Loss Function\.To further improve draft\-target alignment, we augment the supervised training objective with knowledge distillation from the target model\. Without loss of generality, letx≤nx\_\{\\leq n\}denote the current prefix andxn\+kx\_\{n\+k\}denote thekk\-th future token to be predicted\. Unlike a position\-only weights^k\\hat\{s\}\_\{k\}, the acceptance probability of a future token depends not only on its relative positionkk, but also on the current prefixx≤nx\_\{\\leq n\}\. Therefore, we denote the estimated token\-level acceptance weight ass^n,k\\hat\{s\}\_\{n,k\}, which captures the expected acceptance likelihood of tokenxn\+kx\_\{n\+k\}conditioned on prefixx≤nx\_\{\\leq n\}\. Specifically, for thekk\-th future position, our objective is formulated as a weighted sum of the standard cross\-entropy lossℒCE\\mathcal\{L\}^\{\\mathrm\{CE\}\}and the distillation lossℒKD\\mathcal\{L\}^\{\\mathrm\{KD\}\}:
ℒk=∑n=1N−ks^n,k\(βℒn,kCE\+ℒn,kKD\),\\mathcal\{L\}\_\{k\}=\\sum\_\{n=1\}^\{N\-k\}\\hat\{s\}\_\{n,k\}\\left\(\\beta\\mathcal\{L\}^\{\\mathrm\{CE\}\}\_\{n,k\}\+\\mathcal\{L\}^\{\\mathrm\{KD\}\}\_\{n,k\}\\right\),\(12\)whereβ\\betabalances the supervised and distillation terms\. Here,ℒn,kCE\\mathcal\{L\}^\{\\mathrm\{CE\}\}\_\{n,k\}andℒn,kKD\\mathcal\{L\}^\{\\mathrm\{KD\}\}\_\{n,k\}are computed for tokenxn\+kx\_\{n\+k\}given prefixx≤nx\_\{\\leq n\}, ands^n,k\\hat\{s\}\_\{n,k\}is the confidence\-adaptive weight assigned to this token\.
### 3\.4Dual\-mode inference
During inference, our framework supports two modes: target\-dependent and target\-independent\. As illustrated in Figure[3](https://arxiv.org/html/2605.08632#S2.F3), during the standard prefilling phase, we extract hidden representations across multiple layers of the target model\. To maintain consistency between training and inference when utilizing the target model’s features, we specifically extract the hidden state corresponding to the last token position of the prompt to fuse with the mask token\. This design ensures that the draft model’s conditioning signal remains aligned with the sequential dependency observed during the training phase\. In target\-dependent mode, PARD\-2 maximizes alignment by exploiting target hidden features to achieve peak acceleration\. In target\-independent mode, it maintains broad compatibility across the entire target model family without retraining\. Notably, both modes are supported by the same single draft model during inference, requiring no architectural changes or additional parameter fine\-tuning\.
Table 1:Target\-dependent comparison of speedup ratios and average acceptance lengthsτ\\tauacross different methods\. Q3 represents Qwen3 model family and L3 represents Llama3 model family\. Values in parentheses denote the inference draft length\.Table 2:Target\-independent Performance Comparison\. Values in parentheses denote the inference draft length\. All experiments are evaluated on the same draft model\.
## 4Experiments
### 4\.1EXPERIMENTAL SETUP
Models\.We evaluate PARD\-2 primarily on the Llama3\[[25](https://arxiv.org/html/2605.08632#bib.bib23)\]and Qwen3\[[32](https://arxiv.org/html/2605.08632#bib.bib428)\]model families\. To demonstrate performance via the target\-dependent mode, we specifically train and evaluate PARD\-2 on Llama\-3\.1\-8B, Qwen3\-8B, and Qwen3\-14B\. Furthermore, to highlight its zero\-shot transferability, we conduct extensive target\-independent experiments primarily on the Qwen3 family, demonstrating that a single drafter can seamlessly generalize to accelerate other target models within the same series\.
Datasets and Benchmarks\.PARD\-2 is trained on a moderately expanded version of the dataset used in PARD\. Specifically, we retain Magpie\[[31](https://arxiv.org/html/2605.08632#bib.bib532)\]and Evol\-CodeAlpaca\[[26](https://arxiv.org/html/2605.08632#bib.bib533)\], and additionally include samples from Nemotron\-v2\[[3](https://arxiv.org/html/2605.08632#bib.bib555)\]and Nemotron\-v3\[[4](https://arxiv.org/html/2605.08632#bib.bib554)\]\. We evaluate the generalizability of our approach across diverse benchmarks, including HumanEval\[[8](https://arxiv.org/html/2605.08632#bib.bib534)\]for code generation, MATH\-500\[[22](https://arxiv.org/html/2605.08632#bib.bib536)\]and GSM8K\[[10](https://arxiv.org/html/2605.08632#bib.bib535)\]for mathematical reasoning, and MT\-Bench\[[33](https://arxiv.org/html/2605.08632#bib.bib28)\]for multi\-turn dialogue\.
Metrics\.PARD\-2 is a lossless acceleration method that preserves the original target model and exact acceptance rule\. Therefore, we focus on acceleration performance and report the following metrics:
- •Speedup: The acceleration ratio over vanilla auto\-regressive decoding\.
- •Acceptance Lengthτ\\tau: the average number of draft tokens accepted in each verification\.
- •Tokens Per Second: The number of tokens generated per second in real\-world scenarios\.
Implementation Details\.For training, we extract target hidden features from 4 layers of the target model\. The draft model is trained on AMD MI300X GPUs, utilizing a batch size of 64 and a draft length ofK=16K=16\. We setρ=0\.1\\rho=0\.1and the loss weighting coefficientβ=0\.1\\beta=0\.1\. For inference, all throughput evaluations are implemented based on the vLLM framework\. To ensure a fair comparison, tree\-based decoding is explicitly disabled across all methods\. Unless otherwise specified, all evaluation experiments are conducted on NVIDIA A100\-40GB GPUs\. We employ a tensor parallelism degree of TP=2 for Qwen3\-32B, while setting TP=1 for all other models\.
### 4\.2Experimental Results
In this section, we evaluate PARD\-2 on Qwen3 and Llama3 with thinking mode disabled\. We compare PARD\-2 with several SD baselines, including EAGLE\-3, DFlash, and PARD\. For the Qwen3 series, EAGLE\-3 uses third\-party trained weights\[[29](https://arxiv.org/html/2605.08632#bib.bib426)\], while all other baselines and our method use official weights\. For EAGLE\-3, we adopt an inference draft length ofK=8K=8to match its optimal open\-source configuration, whereas for all remaining methods, we set the draft length toK=16K=16\.
Target\-Dependent Mode\.Table[1](https://arxiv.org/html/2605.08632#S3.T1)reports the main results under greedy decoding\. Across all evaluated target models, PARD\-2 consistently outperforms auto\-regressive decoding and strong speculative decoding baselines in both speedup and average acceptance lengthτ\\tau\. On Qwen3\-8B, PARD\-2 raises the average speedup to 5\.81×\\times, compared with 4\.39×\\timesfor PARD and 4\.61×\\timesfor DFlash, while increasing the average acceptance length to 6\.98\. Similar gains are observed on Qwen3\-14B and Llama3\.1\-8B, where PARD\-2 achieves 5\.81×\\timesand 5\.19×\\timesaverage speedups, respectively\. It is worth noting that PARD\-2 maintains strong performance on MT\-Bench, which involves more complex multi\-turn dialogue generation, suggesting that its benefits generalize beyond structured reasoning and coding benchmarks\. These results demonstrate that PARD\-2 improves the acceptance of consecutive draft tokens and translates this improvement into consistent lossless inference acceleration in practice\.
Target\-Independent Mode\.Table[2](https://arxiv.org/html/2605.08632#S3.T2)evaluates PARD\-2 in target\-independent mode, where a single drafter accelerates different Qwen target models\. Compared with PARD, PARD\-2 improves both average speedup and acceptance length\. Specifically, PARD\-2 increases the average speedup from 4\.38×\\timesto 4\.82×\\timeson Qwen3\-8B, from 4\.38×\\timesto 4\.79×\\timeson Qwen3\-14B, and from 4\.37×\\timesto 4\.68×\\timeson Qwen3\-32B\. On average, the acceptance lengthτ\\tauimproves from 5\.41 to 5\.97\. These results show that stochastic gating reduces over\-reliance on target\-specific hidden states, while CAT optimization remains effective without target\-specific features\. Together, they enable a general\-purpose drafter to achieve strong lossless acceleration across a family of target models\.
Large Batch Sizes Study\.We further evaluate PARD\-2 under large batch serving settings, where GPU utilization becomes increasingly important\. Figure[1](https://arxiv.org/html/2605.08632#S1.F1)reports both per\-user throughput \(TPS/User\) and GPU throughput \(TPS/GPU\) across different batch sizes\. PARD\-2 consistently shifts the throughput frontier upward and to the right, indicating that it improves aggregate serving efficiency while maintaining higher per\-user generation speed\. Notably, even at batch size 64, where the speedup gain is relatively smaller due to higher GPU utilization, PARD\-2 still outperforms PARD on both Llama\-3\-8B and Qwen3\-8B\. These results show that the gains of PARD\-2 are not limited to small\-batch; instead, PARD\-2 remains effective in high\-throughput serving scenarios, where large\-batch decoding is commonly used to maximize GPU utilization\.
Table 3:Comparison between fixed\-decay and token\-adaptive weighting strategies\.
Table 4:Effect of the stochastic gating ratio for target features\.ρ\\rho= 0\.1 is optimal\.
### 4\.3ABLATION STUDIES
In this section, we ablate the key design choices of PARD\-2, including the effectiveness of CAT optimization, the impact of stochastic gating for target features, and a fine\-grained breakdown of the improvements over PARD\. All ablation models are trained for 30k steps on MI300X GPUs\.
Confidence\-Adaptive Token Optimization \(CAT\)\.CAT prioritizes “high\-value” tokens that directly extend the accepted prefix during speculative decoding\. Unlike traditional uniform supervision, CAT reweights the token\-level training loss based on the target model’s confidence\. As shown in Table[5](https://arxiv.org/html/2605.08632#S4.T5), CAT consistently improves the average acceptance length across all benchmarks\. With a largerkinferk\_\{\\mathrm\{infer\}\}, CAT increasesτ\\taufrom 4\.83 to 5\.79 across all benchmarks\.
To further validate its superiority, we compare CAT against a fixed position\-wise decay strategy\[[7](https://arxiv.org/html/2605.08632#bib.bib545),[23](https://arxiv.org/html/2605.08632#bib.bib553)\]\(γt=γt−1\\gamma\_\{t\}=\\gamma^\{t\-1\}\), a common heuristic in parallel drafting\. As reported in Table[4](https://arxiv.org/html/2605.08632#S4.T4), while position\-wise decay provides gains \(reaching a peakτ\\tauof 5\.61 atγ=0\.8\\gamma=0\.8\), its performance is highly sensitive to the decay rate and fails to generalize across different tasks\. In contrast, CAT adaptively focuses on both the token and its prefix\. The results demonstrate that incorporating both the token and its prefix into the weighting strategy is essential for achieving optimal speculative decoding\.
Stochastic Gating for Target Features\.To balance target\-dependent performance and target\-independent versatility, we introduce a training\-time stochastic gate for target\-feature injection\. The gate disables target features with probabilityρ\\rhoand injects them otherwise, encouraging the drafter to avoid over\-reliance on target hidden states\. As shown in Table[4](https://arxiv.org/html/2605.08632#S4.T4), the fully injected baseline achievesτ=5\.62\\tau=5\.62, while stochastic gating withρ=0\.1\\rho=0\.1maintains a comparableτ=5\.60\\tau=5\.60\. Notably,ρ=0\.1\\rho=0\.1slightly improves MT\-Bench performance from 3\.79 to 3\.84, suggesting that mild stochastic gating acts as an effective regularizer\. This helps prevent overfitting to specific target hidden distributions and improves the model’s versatility across deployment settings\.
Analysis of Performance Gains\.In Table[5](https://arxiv.org/html/2605.08632#S4.T5), we conduct a fine\-grained ablation of the new modules in PARD\-2, including target\-feature injection, CAT, and the draft length\. We study conditioned drafting by using target\-model hidden representations as additional input features, enabling the draft model to leverage target\-side context beyond previous tokens\. These features increase the averageτ\\taufrom 4\.70 to 4\.96\. The gains are larger on reasoning and code\-generation tasks, suggesting that target hidden states provide useful semantic signals for resolving complex logic and generating candidates more likely to be accepted\.
Table 5:Ablation study of core components and configurations in PARD\-2\. Compared to PARD, PARD\-2 progressively adds target hidden features, CAT optimization, and multi\-layer target features over PARD, and evaluates different draft lengths for trainingktraink\_\{\\text\{train\}\}and inferencekinferk\_\{\\text\{infer\}\}\. Each component consistently improves both speedup and average acceptance length \(τ\\tau\) across three benchmarks\.
## 5Related Work
Speculative decoding\[[17](https://arxiv.org/html/2605.08632#bib.bib444),[6](https://arxiv.org/html/2605.08632#bib.bib446)\]alleviates the memory\-bandwidth bottleneck in auto\-regressive generation by using a lightweight draft model to propose tokens for parallel verification by a target LLM\. To improve draft\-target alignment, Medusa\[[5](https://arxiv.org/html/2605.08632#bib.bib458)\], GLIDE and CAPE\[[11](https://arxiv.org/html/2605.08632#bib.bib551)\], and the EAGLE series\[[20](https://arxiv.org/html/2605.08632#bib.bib463),[21](https://arxiv.org/html/2605.08632#bib.bib543)\]incorporate the KV cache or hidden features of the target model, while DistillSpec\[[34](https://arxiv.org/html/2605.08632#bib.bib484)\]employs knowledge distillation\. To further minimize wall\-clock latency, methods such as PEARL\[[24](https://arxiv.org/html/2605.08632#bib.bib549)\]and SSD\[[15](https://arxiv.org/html/2605.08632#bib.bib546)\]decouple drafting and verification for parallel execution, whereas SpecInfer\[[27](https://arxiv.org/html/2605.08632#bib.bib470)\], Falcon\[[13](https://arxiv.org/html/2605.08632#bib.bib461)\]and EAGLE\-2\[[19](https://arxiv.org/html/2605.08632#bib.bib501)\]introduce advanced tree\-based verification\. Training\-freenn\-gram matching methods such as LOOKAHEAD\[[12](https://arxiv.org/html/2605.08632#bib.bib46)\]and PROMTEC\[[16](https://arxiv.org/html/2605.08632#bib.bib550)\]also accelerate inference\.
Despite these advances, many SD methods still rely on auto\-regressive drafting, whose sequential dependency limits drafting throughput\. Recent parallel drafting methods address this limitation by predicting multiple future tokens in a single forward pass\. ParallelSpec\[[30](https://arxiv.org/html/2605.08632#bib.bib76)\]trains a parallel drafter to serve as an efficient speculative model\. P\-EAGLE\[[14](https://arxiv.org/html/2605.08632#bib.bib548)\]and PARD\[[1](https://arxiv.org/html/2605.08632#bib.bib552)\]adapt auto\-regressive models to parallel masked prediction, while SpecDiff\[[9](https://arxiv.org/html/2605.08632#bib.bib425)\], SpecDiff\-2\[[28](https://arxiv.org/html/2605.08632#bib.bib547)\], DART\[[23](https://arxiv.org/html/2605.08632#bib.bib553)\]and DFlash\[[7](https://arxiv.org/html/2605.08632#bib.bib545)\]employ diffusion\-style drafters for parallel token generation\. However, existing parallel methods often rely on uniform token\-level supervision\. While some approaches\[[7](https://arxiv.org/html/2605.08632#bib.bib545),[23](https://arxiv.org/html/2605.08632#bib.bib553)\]introduce fixed position\-aware decaying weights, they remain suboptimal for aligning with speculative decoding verification\. In practice, token acceptance depends on both the prefix context and the token identity, suggesting that supervision weights should be dynamically determined by their joint effect\.
## 6Conclusion
We propose PARD\-2, a dual\-mode speculative decoding framework that unifies target\-dependent and target\-independent modes within a single draft model\. By analyzing acceptance length in speculative decoding, we identify a gap between training\-time objectives and the inference\-time goal of maximizing consecutive token acceptance\. To bridge this gap, we introduce Confidence\-Adaptive Token \(CAT\) optimization, which uses target\-model confidence as a proxy for token\-level acceptance and adaptively reweights each token accordingly\. Experiments on diverse benchmarks show that PARD\-2 improves acceptance length and inference efficiency, demonstrating its effectiveness as a flexible framework for lossless speculative decoding acceleration\.
## References
- \[1\]Z\. An, H\. Bai, Z\. Liu, D\. Li, and E\. Barsoum\(2025\)Pard: accelerating llm inference with low\-cost parallel draft model adaptation\.arXiv preprint arXiv:2504\.18583\.Cited by:[§1](https://arxiv.org/html/2605.08632#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.08632#S2.SS2.p1.2),[§3\.3](https://arxiv.org/html/2605.08632#S3.SS3.p2.9),[§5](https://arxiv.org/html/2605.08632#S5.p2.1)\.
- \[2\]Z\. Ankner, R\. Parthasarathy, A\. Nrusimha, C\. Rinard, J\. Ragan\-Kelley, and W\. Brandon\(2024\)Hydra: sequentially\-dependent draft heads for medusa decoding\.arXiv preprint arXiv:2402\.05109\.Cited by:[§1](https://arxiv.org/html/2605.08632#S1.p2.1)\.
- \[3\]A\. Basant, A\. Khairnar, A\. Paithankar, A\. Khattar, A\. Renduchintala, A\. Malte, A\. Bercovich, A\. Hazare, A\. Rico, A\. Ficek,et al\.\(2025\)Nvidia nemotron nano 2: an accurate and efficient hybrid mamba\-transformer reasoning model\.arXiv preprint arXiv:2508\.14444\.Cited by:[§4\.1](https://arxiv.org/html/2605.08632#S4.SS1.p2.1)\.
- \[4\]A\. Blakeman, A\. Grattafiori, A\. Basant, A\. Gupta, A\. Khattar, A\. Renduchintala, A\. Vavre, A\. Shukla, A\. Bercovich, A\. Ficek,et al\.\(2025\)Nemotron 3 nano: open, efficient mixture\-of\-experts hybrid mamba\-transformer model for agentic reasoning\.arXiv preprint arXiv:2512\.20848\.Cited by:[§4\.1](https://arxiv.org/html/2605.08632#S4.SS1.p2.1)\.
- \[5\]T\. Cai, Y\. Li, Z\. Geng, H\. Peng, J\. D\. Lee, D\. Chen, and T\. Dao\(2024\)Medusa: simple llm inference acceleration framework with multiple decoding heads\.arXiv preprint arXiv:2401\.10774\.Cited by:[§1](https://arxiv.org/html/2605.08632#S1.p2.1),[§5](https://arxiv.org/html/2605.08632#S5.p1.1)\.
- \[6\]C\. Chen, S\. Borgeaud, G\. Irving, J\. Lespiau, L\. Sifre, and J\. Jumper\(2023\)Accelerating large language model decoding with speculative sampling\.arXiv preprint arXiv:2302\.01318\.Cited by:[§5](https://arxiv.org/html/2605.08632#S5.p1.1)\.
- \[7\]J\. Chen, Y\. Liang, and Z\. Liu\(2026\)DFlash: block diffusion for flash speculative decoding\.arXiv preprint arXiv:2602\.06036\.Cited by:[§1](https://arxiv.org/html/2605.08632#S1.p2.1),[§1](https://arxiv.org/html/2605.08632#S1.p3.1),[§1](https://arxiv.org/html/2605.08632#S1.p5.1),[§2\.2](https://arxiv.org/html/2605.08632#S2.SS2.p1.2),[§4\.3](https://arxiv.org/html/2605.08632#S4.SS3.p3.3),[§5](https://arxiv.org/html/2605.08632#S5.p2.1)\.
- \[8\]M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. D\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§4\.1](https://arxiv.org/html/2605.08632#S4.SS1.p2.1)\.
- \[9\]J\. K\. Christopher, B\. R\. Bartoldson, T\. Ben\-Nun, M\. Cardei, B\. Kailkhura, and F\. Fioretto\(2025\)Speculative diffusion decoding: accelerating language generation through diffusion\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 12042–12059\.Cited by:[§5](https://arxiv.org/html/2605.08632#S5.p2.1)\.
- \[10\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§4\.1](https://arxiv.org/html/2605.08632#S4.SS1.p2.1)\.
- \[11\]C\. Du, J\. Jiang, X\. Yuanchen, J\. Wu, S\. Yu, Y\. Li, S\. Li, K\. Xu, L\. Nie, Z\. Tu,et al\.\(2024\)Glide with a cape: a low\-hassle method to accelerate speculative decoding\.arXiv preprint arXiv:2402\.02082\.Cited by:[§3\.2](https://arxiv.org/html/2605.08632#S3.SS2.p2.1),[§5](https://arxiv.org/html/2605.08632#S5.p1.1)\.
- \[12\]Y\. Fu, P\. Bailis, I\. Stoica, and H\. Zhang\(2024\)Break the sequential dependency of llm inference using lookahead decoding\.arXiv preprint arXiv:2308\.16710\.External Links:[Link](https://arxiv.org/abs/2308.16710)Cited by:[§5](https://arxiv.org/html/2605.08632#S5.p1.1)\.
- \[13\]X\. Gao, W\. Xie, Y\. Xiang, and F\. Ji\(2024\)Falcon: faster and parallel inference of large language models through enhanced semi\-autoregressive drafting and custom\-designed decoding tree\.arXiv preprint arXiv:2412\.12639\.Cited by:[§5](https://arxiv.org/html/2605.08632#S5.p1.1)\.
- \[14\]M\. Hui, X\. Huang, J\. C\. Salas, Y\. Sun, N\. Pemberton, X\. Song, A\. Khetan, and G\. Karypis\(2026\)P\-eagle: parallel\-drafting eagle with scalable training\.arXiv preprint arXiv:2602\.01469\.Cited by:[§5](https://arxiv.org/html/2605.08632#S5.p2.1)\.
- \[15\]T\. Kumar, T\. Dao, and A\. May\(2026\)Speculative speculative decoding\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§5](https://arxiv.org/html/2605.08632#S5.p1.1)\.
- \[16\]A\. C\. Lee, W\. Cheng, and C\. C\. Chan\(2025\)PROMTEC: fast llm inference decoding using prompt multi\-lookup with template database and common sequences\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 6830–6842\.Cited by:[§5](https://arxiv.org/html/2605.08632#S5.p1.1)\.
- \[17\]Y\. Leviathan, M\. Kalman, and Y\. Matias\(2023\)Fast inference from transformers via speculative decoding\.InInternational Conference on Machine Learning,pp\. 19274–19286\.Cited by:[§1](https://arxiv.org/html/2605.08632#S1.p2.1),[§5](https://arxiv.org/html/2605.08632#S5.p1.1)\.
- \[18\]G\. Li, Z\. Fu, M\. Fang, Q\. Zhao, M\. Tang, C\. Yuan, and J\. Wang\(2025\)Diffuspec: unlocking diffusion language models for speculative decoding\.arXiv preprint arXiv:2510\.02358\.Cited by:[§2\.2](https://arxiv.org/html/2605.08632#S2.SS2.p1.2)\.
- \[19\]Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang\(2024\)EAGLE\-2: faster inference of language models with dynamic draft trees\.External Links:2406\.16858,[Link](https://arxiv.org/abs/2406.16858)Cited by:[§3\.2](https://arxiv.org/html/2605.08632#S3.SS2.p2.1),[§5](https://arxiv.org/html/2605.08632#S5.p1.1)\.
- \[20\]Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang\(2024\)Eagle: speculative sampling requires rethinking feature uncertainty\.arXiv preprint arXiv:2401\.15077\.Cited by:[§5](https://arxiv.org/html/2605.08632#S5.p1.1)\.
- \[21\]Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang\(2025\)Eagle\-3: scaling up inference acceleration of large language models via training\-time test\.arXiv preprint arXiv:2503\.01840\.Cited by:[§1](https://arxiv.org/html/2605.08632#S1.p2.1),[§1](https://arxiv.org/html/2605.08632#S1.p5.1),[§5](https://arxiv.org/html/2605.08632#S5.p1.1)\.
- \[22\]H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe\(2023\)Let’s verify step by step\.arXiv preprint arXiv:2305\.20050\.Cited by:[§4\.1](https://arxiv.org/html/2605.08632#S4.SS1.p2.1)\.
- \[23\]F\. Liu, X\. Li, K\. Zhao, Y\. Gao, Z\. Zhou, Z\. Zhang, Z\. Wang, W\. Dou, S\. Zhong, and C\. Tian\(2026\)DART: diffusion\-inspired speculative decoding for fast llm inference\.arXiv preprint arXiv:2601\.19278\.Cited by:[§1](https://arxiv.org/html/2605.08632#S1.p3.1),[§4\.3](https://arxiv.org/html/2605.08632#S4.SS3.p3.3),[§5](https://arxiv.org/html/2605.08632#S5.p2.1)\.
- \[24\]T\. Liu, Y\. Li, Q\. Lv, K\. Liu, J\. Zhu, W\. Hu, and X\. Sun\(2025\)PEARL: parallel speculative decoding with adaptive draft length\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§5](https://arxiv.org/html/2605.08632#S5.p1.1)\.
- \[25\]A\. @\. M\. Llama Team\(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§4\.1](https://arxiv.org/html/2605.08632#S4.SS1.p1.1)\.
- \[26\]Z\. Luo, C\. Xu, P\. Zhao, Q\. Sun, X\. Geng, W\. Hu, C\. Tao, J\. Ma, Q\. Lin, and D\. Jiang\(2023\)WizardCoder: empowering code large language models with evol\-instruct\.Cited by:[§4\.1](https://arxiv.org/html/2605.08632#S4.SS1.p2.1)\.
- \[27\]X\. Miao, G\. Oliaro, Z\. Zhang, X\. Cheng, Z\. Wang, Z\. Zhang, R\. Y\. Y\. Wong, A\. Zhu, L\. Yang, X\. Shi,et al\.\(2023\)SpecInfer: accelerating generative large language model serving with tree\-based speculative inference and verification\.arXiv preprint arXiv:2305\.09781\.Cited by:[§5](https://arxiv.org/html/2605.08632#S5.p1.1)\.
- \[28\]J\. Sandler, J\. K\. Christopher, T\. Hartvigsen, and F\. Fioretto\(2025\)Specdiff\-2: scaling diffusion drafter alignment for faster speculative decoding\.arXiv preprint arXiv:2511\.00606\.Cited by:[§5](https://arxiv.org/html/2605.08632#S5.p2.1)\.
- \[29\]Tencent\(2025\-06\)AngelSlim\.Note:[https://github\.com/Tencent/AngelSlim](https://github.com/Tencent/AngelSlim)GitHub repositoryCited by:[§4\.2](https://arxiv.org/html/2605.08632#S4.SS2.p1.2)\.
- \[30\]Z\. Xiao, H\. Zhang, T\. Ge, S\. Ouyang, V\. Ordonez, and D\. Yu\(2024\)ParallelSpec: parallel drafter for efficient speculative decoding\.arXiv preprint arXiv:2410\.05589\.Cited by:[§1](https://arxiv.org/html/2605.08632#S1.p2.1),[§5](https://arxiv.org/html/2605.08632#S5.p2.1)\.
- \[31\]Z\. Xu, F\. Jiang, L\. Niu, Y\. Deng, R\. Poovendran, Y\. Choi, and B\. Y\. Lin\(2024\)Magpie: alignment data synthesis from scratch by prompting aligned llms with nothing\.arXiv preprint arXiv:2406\.08464\.Cited by:[§4\.1](https://arxiv.org/html/2605.08632#S4.SS1.p2.1)\.
- \[32\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§4\.1](https://arxiv.org/html/2605.08632#S4.SS1.p1.1)\.
- \[33\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in Neural Information Processing Systems36,pp\. 46595–46623\.Cited by:[§4\.1](https://arxiv.org/html/2605.08632#S4.SS1.p2.1)\.
- \[34\]Y\. Zhou, K\. Lyu, A\. S\. Rawat, A\. K\. Menon, A\. Rostamizadeh, S\. Kumar, J\. Kagy, and R\. Agarwal\(2023\)Distillspec: improving speculative decoding via knowledge distillation\.arXiv preprint arXiv:2310\.08461\.Cited by:[§5](https://arxiv.org/html/2605.08632#S5.p1.1)\.
Appendix
## Appendix ATraining Hyperparameters
Table[6](https://arxiv.org/html/2605.08632#A1.T6)summarizes the hyperparameters used for training\.
Table 6:Selected Hyperparameters for PARD\-2 TrainingHyperparameterLlama3\.1\-8BQwen3\-8BQwen3\-14BOptimizersAdamWAdamWAdamWLearning Rate1e\-53e\-53e\-5Per Device Train Batch Size844Gradient Accumulation Steps122Num Processes888Num Train Epochs422Training Draft Length K161616Stochastic Gating Ratioρ\\rho0\.10\.10\.0CE Loss Coefficientβ\\beta0\.10\.10\.1Max Seq Length51210241024Similar Articles
AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding
AdaPLD is a training-free method that improves model-free speculative decoding by using adaptive retrieval combining lexical and semantic similarity, and constructing branched reuse hypotheses to handle continuation uncertainty, achieving up to 3.10x decoding speedup.
PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding
This paper introduces Parallel Speculative Decoding (PSD), a training-free framework that accelerates diffusion LLM inference by jointly improving spatial and temporal efficiency, achieving up to 5.5× tokens per forward pass with comparable quality to greedy decoding.
[Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS
JetSpec introduces parallel tree drafting for speculative decoding, achieving up to 9.64x end-to-end speedup on LLM inference while maintaining lossless accuracy, with throughput reaching ~1000 TPS on a single B200 GPU.
D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting
This paper introduces D-PACE, a dynamic position-aware cross-entropy loss for training speculative decoding drafters that adaptively weights positions to improve acceptance length and inference speed, achieving consistent wall-clock speedups across benchmarks with minimal overhead.
What is Speculative Decoding? (trending on paperswithco.de) [R]
Speculative decoding is an inference optimization technique that uses a fast draft model to propose future tokens verified in parallel by a larger model, improving LLM generation speed. The article highlights its trending status on Papers with Code and a recent SGLang blog post about state-of-the-art latencies using DFlash models.