Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding
Summary
This paper identifies a new vulnerability in model-based speculative decoding for large language models, where small perturbations can reduce draft token acceptance without affecting output quality, collapsing acceleration. The authors propose Mistletoe, an attack that jointly optimizes degradation and semantic preservation, demonstrating significant speedup reduction across various systems.
View Cached Full Text
Cached at: 05/15/26, 06:18 AM
# Stealthy Acceleration-Collapse Attacks on Speculative Decoding
Source: [https://arxiv.org/html/2605.14005](https://arxiv.org/html/2605.14005)
Shuoyang Sun1†,Chang Dai2†,Hao Fang3,Kuofeng Gao3,Xinhao Zhong1, Yi Sun1,Fan Mo4,Shu\-Tao Xia3,Bin Chen1∗ 1Harbin Institute of Technology, Shenzhen 2South China University of Technology 3Tsinghua Shenzhen International Graduate School, Tsinghua University 4Huawei Technology
###### Abstract
Speculative decoding has become a widely adopted technique for accelerating large language model \(LLM\) inference by drafting multiple candidate tokens and verifying them with a target model in parallel\. Its efficiency, however, critically depends on the average accepted lengthτ\\tau, i\.e\., how many draft tokens survive each verification step\. In this work, we identify a new mechanism\-level vulnerability in model\-based speculative decoding: the drafter is trained to approximate the target model distribution, but this approximation is inevitably imperfect\. Such a drafter–target mismatch creates a hidden attack surface where small perturbations can preserve the target model’s visible behavior while substantially reducing draft\-token acceptability\. We proposeMistletoe, a stealthy acceleration\-collapse attack against speculative decoding\.Mistletoedirectly targets the acceptance mechanism of speculative decoding\. It jointly optimizes a degradation objective that decreases drafter–target agreement and a semantic\-preservation objective that constrains the target model’s output distribution\. To resolve the conflict between these objectives, we introduce a null\-space projection mechanism, where degradation gradients are projected away from the local semantic\-preserving direction, suppressing draft acceptance while minimizing semantic drift\. Experiments on various speculative decoding systems show thatMistletoesubstantially reduces average accepted lengthτ\\tau, collapses speedup, and lowers averaged token throughput, while preserving output quality and perplexity\. Our work highlights that speculative decoding introduces a mechanism\-level attack surface beyond existing output robustness, calling for more robust designs of LLM acceleration systems\.
††footnotetext:†Equal contribution\.††footnotetext:∗Corresponding author\.## 1Introduction
Large language models \(LLMs\) have achieved remarkable capabilities across open\-ended generation, reasoning, and interactive assistance\(Grattafiori et al\.,[2024](https://arxiv.org/html/2605.14005#bib.bib8); Liu et al\.,[2024](https://arxiv.org/html/2605.14005#bib.bib15); Yang et al\.,[2025](https://arxiv.org/html/2605.14005#bib.bib19)\)\. However, autoregressive decoding remains inherently sequential, as each generated token requires a separate target\-model invocation conditioned on the preceding context\. Speculative decoding mitigates this bottleneck through a draft\-then\-verify paradigm: a lightweight drafter proposes candidate continuations, and the target model verifies them in parallel\(Leviathan et al\.,[2023](https://arxiv.org/html/2605.14005#bib.bib11); Chen et al\.,[2023](https://arxiv.org/html/2605.14005#bib.bib3)\)\. By accepting multiple draft tokens in one target\-model forward pass, speculative decoding can accelerate generation while preserving the target model’s output distribution under standard verification rules\. Thus, the practical efficiency of speculative decoding depends not merely on how many tokens are drafted, but on how many of them are accepted by the target verifier\. The average accepted lengthτ\\tautherefore captures a central mechanism behind speculative acceleration\.
Recent speculative decoding systems explicitly optimize this mechanism by improving drafter–target agreement\. They introduce auxiliary prediction heads, target\-model features, dynamic draft trees, fused intermediate representations, or shared computation to make draft proposals more acceptable to the target verifier\(Cai et al\.,[2024](https://arxiv.org/html/2605.14005#bib.bib2); Li et al\.,[2024a](https://arxiv.org/html/2605.14005#bib.bib12),[b](https://arxiv.org/html/2605.14005#bib.bib13); Ankner et al\.,[2024](https://arxiv.org/html/2605.14005#bib.bib1); Li et al\.,[2025](https://arxiv.org/html/2605.14005#bib.bib14)\)\. Alignment\-oriented work further shows that mitigating token and feature misalignment improves draft\-token acceptance, accepted length, and speedup\(Hu et al\.,[2025a](https://arxiv.org/html/2605.14005#bib.bib9)\)\. These advances reveal that acceptance is not a secondary implementation detail, but a critical foundation that underpins the efficiency of speculative decoding\. While residual drafter–target mismatch is usually treated as an efficiency bottleneck to be reduced, we show that it can also become an attack surface\.
Existing safety\-oriented studies on speculative decoding mainly examine privacy leakage or generated\-content safety\. Input\-dependent speculation patterns may create side channels that leak private information\(Wei et al\.,[2024](https://arxiv.org/html/2605.14005#bib.bib18)\), while safety\-aware decoding methods use auxiliary or small expert models to improve output safety\(Wang et al\.,[2025b](https://arxiv.org/html/2605.14005#bib.bib17),[a](https://arxiv.org/html/2605.14005#bib.bib16)\)\. In this paper, we raise a largely unexplored mechanism\-level security question:Can the draft\-verification pathway itself be adversarially degraded while the final response remains visibly normal?If a small perturbation preserves the target model’s response distribution while causing drafter proposals to diverge from the target verifier, drafted tokens will be repeatedly rejected during verification\. Consequently,τ\\taucollapses, speedup disappears, and averaged token throughput decreases\. We define this failure mode as an*acceleration\-collapse attack*, which disables the mechanism that makes generation fast while preserving generated\-content quality\.
Figure 1:Illustration of acceptance collapse underMistletoe\. In normal speculative decoding, the target model accepts many draft tokens per verification step, yielding high acceptance and high speedup\. When speculative decoding is attacked byMistletoe, the final response semantics remain preserved, but misaligned draft tokens are rejected by the target verifier, forcing fallback generation from target logits and collapsing the average accepted lengthτ\\tauand speedup\.Based on these insights, we proposeMistletoe, a stealthy acceleration\-collapse attack against speculative decoding\. As illustrated in Figure[1](https://arxiv.org/html/2605.14005#S1.F1), normal speculative decoding commits multiple accepted draft tokens per verification step, whereas decoding underMistletoefrequently falls back to target\-generated tokens\. This definition reflects the attack’s parasitic nature of the attack: it remains unobtrusive at the output level while draining the efficiency benefit of the host decoding pipeline\. Unlike content\-level attacks that aim to alter the generated response,Mistletoetargets the core verification\-and\-acceptance mechanism that makes speculative decoding fast\. It increases target\-side surprisal of drafter\-proposed tokens to reduce their acceptability, while constraining the target model’s output distribution to preserve response quality\.
To reach the attack goal, a key challenge is the optimization conflict of drafter–target alignment\. Draft tokens are designed to approximate the target model’s high\-probability continuations; therefore, perturbations that reduce draft\-token acceptability can also disturb the target model’s own output distribution\. To address this conflict,Mistletoerestricts the rejection direction to the local null space of the semantic\-preservation constraint\(Fang et al\.,[2024](https://arxiv.org/html/2605.14005#bib.bib7)\)\. This projection encourages updates that increase rejection pressure while limiting semantic drift\. A KL\-threshold filter further vetoes discrete suffix candidates whose target\-distribution drift exceeds a preset bound\.
Experiments on representative speculative decoding systems show thatMistletoesubstantially reducesτ\\tau, speedup, and averaged token throughput, while preserving output quality and perplexity\. These results demonstrate that speculative decoding can be vulnerable even when user\-facing textual outputs appear normal, highlighting the need for robust and security\-aware acceleration mechanisms\.
In summary, our contributions are threefold:
- •We identify*acceptance collapse*as a mechanism\-level threat to speculative decoding, where adversarially amplified drafter–target mismatch reduces the average accepted lengthτ\\tauwhile leaving user\-facing outputs largely preserved\.
- •We proposeMistletoe, a stealthy acceleration\-collapse attack that degrades the verification\-and\-acceptance pathway rather than directly corrupting generated content\.
- •We develop a null\-space projected optimization method with KL\-threshold filtering to reduce draft\-token acceptability while suppressing semantic drift, and empirically demonstrate substantial degradation of speculative decoding efficiency across representative systems\.
## 2Related Work
### 2\.1Speculative Decoding for Efficient Inference
Speculative decoding accelerates autoregressive generation through a draft\-then\-verify paradigm, where a lightweight drafter proposes candidate continuations and the target model verifies them in parallel\(Leviathan et al\.,[2023](https://arxiv.org/html/2605.14005#bib.bib11); Chen et al\.,[2023](https://arxiv.org/html/2605.14005#bib.bib3)\)\. Its efficiency depends on accepting multiple draft tokens per target\-model forward pass, making the average accepted length central to realized speedup\.
Existing work mainly improves this mechanism by enhancing draft quality or verification efficiency through multiple decoding heads, target\-model features, dynamic draft trees, sequentially dependent draft heads, and multi\-layer feature fusion\(Cai et al\.,[2024](https://arxiv.org/html/2605.14005#bib.bib2); Li et al\.,[2024a](https://arxiv.org/html/2605.14005#bib.bib12),[b](https://arxiv.org/html/2605.14005#bib.bib13); Ankner et al\.,[2024](https://arxiv.org/html/2605.14005#bib.bib1); Li et al\.,[2025](https://arxiv.org/html/2605.14005#bib.bib14)\)\. Recent surveys further cover independent\-drafter, retrieval\- or n\-gram\-based, model\-free, self\-speculative, and draft\-head\-based variants\(Hu et al\.,[2025b](https://arxiv.org/html/2605.14005#bib.bib10)\)\. Despite architectural differences, these methods share a common objective: increasing drafter–target agreement so that more candidate tokens are accepted per verification step\. GRIFFIN further highlights this dependency by identifying token and feature misalignment as a bottleneck for draft\-token acceptance and improving accepted length by mitigating such misalignment\(Hu et al\.,[2025a](https://arxiv.org/html/2605.14005#bib.bib9)\)\. Together, these studies establish drafter–target agreement as a central determinant of speculative decoding efficiency\.
### 2\.2Safety and Robustness of Speculative Decoding
As speculative decoding becomes increasingly relevant to efficient LLM deployment, recent studies have examined its safety implications\. One line studies privacy leakage through input\-dependent speculation patterns, where observable decoding behaviors may create side channels\(Wei et al\.,[2024](https://arxiv.org/html/2605.14005#bib.bib18)\)\. Another leverages speculative or auxiliary\-model decoding to improve output safety, such as detecting jailbreak risks or constructing token\-level safety signals for safer generation\(Wang et al\.,[2025b](https://arxiv.org/html/2605.14005#bib.bib17),[a](https://arxiv.org/html/2605.14005#bib.bib16)\)\. These works provide valuable insights, but they focus on privacy leakage or generated\-content safety and leave the robustness of the acceleration mechanism itself largely unexplored\.
In contrast, we take a mechanism\-level perspective: the drafter–target agreement that enables acceleration also defines a fragile boundary of speculative decoding\.Mistletoeshows that adversarial perturbations can amplify drafter–target mismatch, collapsing draft\-token acceptance while keeping final response behavior largely preserved\. This exposes a performance\-robustness threat to speculative decoding, complementing prior studies on privacy and output\-level safety\.
## 3Preliminaries
### 3\.1Speculative Decoding
We formalize the draft\-then\-verify process underlying speculative decoding\. LetMθM\_\{\\theta\}denote the target language model, or verifier, andDϕD\_\{\\phi\}denote the drafter\. Given a promptxx,DϕD\_\{\\phi\}proposes draft tokens andMθM\_\{\\theta\}verifies them in parallel\. Letttindex draft\-then\-verify cycles, and letY\(t\)Y^\{\(t\)\}denote the accepted output prefix before thett\-th cycle\. Within this cycle, the drafter proposesy^1\(t\),…,y^K\(t\)\\hat\{y\}^\{\(t\)\}\_\{1\},\\ldots,\\hat\{y\}^\{\(t\)\}\_\{K\}, whereKKis the draft budget andiiindexes the position within the current draft\.
We denote the drafter distribution for theii\-th draft token byρϕ\(⋅∣x,Y\(t\),y^<i\(t\)\)\\rho\_\{\\phi\}\(\\cdot\\mid x,Y^\{\(t\)\},\\hat\{y\}^\{\(t\)\}\_\{<i\}\), wherey^<i\(t\)\\hat\{y\}^\{\(t\)\}\_\{<i\}represents previously drafted tokens in the same chain or tree branch\. The target verifier distribution is denoted byπθ\(⋅∣x,Y\(t\),y^<i\(t\)\)\\pi\_\{\\theta\}\(\\cdot\\mid x,Y^\{\(t\)\},\\hat\{y\}^\{\(t\)\}\_\{<i\}\)\. This notation abstracts over different implementations:ρϕ\\rho\_\{\\phi\}may be produced by an independent draft model, auxiliary draft heads, or feature\-reuse modules\. In all cases, acceleration is governed by the agreement betweenρϕ\\rho\_\{\\phi\}andπθ\\pi\_\{\\theta\}\.
A draft token is likely to be accepted only when the target assigns it probability comparable to the drafter\. For a chain draft under the standard speculative verification rule\(Leviathan et al\.,[2023](https://arxiv.org/html/2605.14005#bib.bib11); Chen et al\.,[2023](https://arxiv.org/html/2605.14005#bib.bib3)\), the acceptance probability ofy^i\(t\)\\hat\{y\}^\{\(t\)\}\_\{i\}is
αi\(t\)=min\(1,πθ\(y^i\(t\)∣x,Y\(t\),y^<i\(t\)\)/ρϕ\(y^i\(t\)∣x,Y\(t\),y^<i\(t\)\)\)\.\\alpha^\{\(t\)\}\_\{i\}=\\min\\\!\\left\(1,\\,\\pi\_\{\\theta\}\\\!\\left\(\\hat\{y\}^\{\(t\)\}\_\{i\}\\mid x,Y^\{\(t\)\},\\hat\{y\}^\{\(t\)\}\_\{<i\}\\right\)\\big/\\rho\_\{\\phi\}\\\!\\left\(\\hat\{y\}^\{\(t\)\}\_\{i\}\\mid x,Y^\{\(t\)\},\\hat\{y\}^\{\(t\)\}\_\{<i\}\\right\)\\right\)\.\(1\)If accepted, the draft token is committed to the output prefix; otherwise, the verifier falls back to the target logits and discards the remaining draft tokens along that path\. Thus,\{αi\(t\)\}i=1K\\\{\\alpha\_\{i\}^\{\(t\)\}\\\}\_\{i=1\}^\{K\}determine how many draft tokens survive verification\. Leta\(t\)a^\{\(t\)\}be the number of tokens committed in thett\-th cycle, including accepted draft tokens and the target\-generated fallback token\. The average accepted length is
τ=𝔼t\[a\(t\)\]\.\\tau=\\mathbb\{E\}\_\{t\}\\left\[a^\{\(t\)\}\\right\]\.\(2\)In practice,τ\\tauis measured as the number of generated tokens divided by the number of target\-model forward passes\. A largeτ\\tauindicates effective amortization of target computation, whereasτ≈1\\tau\\approx 1indicates degeneration toward vanilla autoregressive decoding\. For tree\-based speculative decoding, multiple branches are verified with tree attention, but the same acceptance principle holds\.
### 3\.2Threat Model and Objective
We consider an adversary who aims to degrade speculative acceleration without directly corrupting the final response\. The target modelMθM\_\{\\theta\}and drafterDϕD\_\{\\phi\}remain fixed\. The adversary appends a short discrete suffixδ∈𝒱m\\delta\\in\\mathcal\{V\}^\{m\}to the clean promptxx, producing
xδ=x⊕δ,x\_\{\\delta\}=x\\oplus\\delta,\(3\)where𝒱\\mathcal\{V\}is the vocabulary,mmis the suffix length, and⊕\\oplusdenotes concatenation\. During attack construction, the adversary uses white\-box gradients to optimizeδ\\delta; at deployment time, the attack only requires submittingxδx\_\{\\delta\}\.
Underxδx\_\{\\delta\}, speculative decoding follows the same procedure, but the drafter and verifier distributions becomeρϕ\(⋅∣xδ,Y\(t\),y^<i\(t\)\)\\rho\_\{\\phi\}\(\\cdot\\mid x\_\{\\delta\},Y^\{\(t\)\},\\hat\{y\}^\{\(t\)\}\_\{<i\}\)andπθ\(⋅∣xδ,Y\(t\),y^<i\(t\)\)\\pi\_\{\\theta\}\(\\cdot\\mid x\_\{\\delta\},Y^\{\(t\)\},\\hat\{y\}^\{\(t\)\}\_\{<i\}\)\. The attack seeks to make draft tokens less acceptable to the target verifier, thereby reducinga\(t\)a^\{\(t\)\}, collapsingτ\\tau, and eliminating the speedup benefit\. Meanwhile, the target model’s output distribution should remain close to its clean behavior so that user\-facing responses are largely preserved\.
We formulate this goal as a constrained discrete optimization problem:
maxδ∈𝒱mℒrej\(x,δ\)s\.t\.ℒsem\(x,δ\)≤ϵ\.\\max\_\{\\delta\\in\\mathcal\{V\}^\{m\}\}\\quad\\mathcal\{L\}\_\{\\mathrm\{rej\}\}\(x,\\delta\)\\qquad\\mathrm\{s\.t\.\}\\qquad\\mathcal\{L\}\_\{\\mathrm\{sem\}\}\(x,\\delta\)\\leq\\epsilon\.\(4\)Here,ℒrej\\mathcal\{L\}\_\{\\mathrm\{rej\}\}measures rejection pressure on drafter\-proposed tokens, whileℒsem\\mathcal\{L\}\_\{\\mathrm\{sem\}\}defines the semantic\-preservation constraint by bounding target\-model distributional drift induced by the suffix\. We instantiate these terms in the next section\.
Figure 2:Pipeline ofMistletoe\. The adversarial suffixδk\\delta\_\{k\}is appended to the clean promptxxand passed through the speculative decoding system\. We visualize one representative draft tokeny^i\(t\)\\hat\{y\}\_\{i\}^\{\(t\)\}; in practice, the objectives aggregate over multiple positions\. Target\-side Draft\-Token Surprisal increases rejection pressure by reducing the target verifier’s confidence in drafter\-proposed tokens, while KL\-bounded Target Preservation constrains the adversarial target distribution to remain close to the clean one\. The rejection direction is restricted to the semantic null space, producing a feasible update direction for discrete suffix search\. A KL\-bound veto filters infeasible candidates, and the selected high\-surprisal suffix becomesδk\+1\\delta\_\{k\+1\}\.
## 4Method
### 4\.1Overview
To collapse speculative acceleration while preserving final\-response behavior,Mistletoeoptimizes a short discrete suffix appended to the clean prompt, as illustrated in Figure[2](https://arxiv.org/html/2605.14005#S3.F2)\. It combines two information\-theoretic objectives: Target\-side Draft\-Token Surprisal increases the mismatch between drafter proposals and the target verifier, reducing draft\-token acceptability; KL\-bounded Target Preservation constrains the adversarial target distribution to remain close to the clean one\. Because these objectives can induce entangled gradients,Mistletoerestricts the rejection direction to the local null space of the semantic\-preservation constraint, and uses the resulting feasible direction to guide discrete suffix search with a KL\-bound veto\. Together, these components reduce the average accepted lengthτ\\tauand speculative speedup while keeping user\-facing responses largely preserved\.
### 4\.2On Acceptance Collapse: Surprisal under KL Constraints
As discussed in Section[3\.1](https://arxiv.org/html/2605.14005#S3.SS1), speculative decoding derives its speedup from the draft\-then\-verify mechanism, where drafter\-proposed tokens are committed only when they remain sufficiently likely under the target verifier\. This mechanism naturally defines the attack target ofMistletoe: rather than directly corrupting the final response, we aim to reduce the acceptability of draft tokens under the target verifier\. A direct way to do so is to lower the target probability assigned to drafter\-proposed tokens, which decreases their acceptance probability, reduces the committed lengtha\(t\)a^\{\(t\)\}, and ultimately collapses the average accepted lengthτ\\tau\. However, this strategy is difficult to optimize safely\. Speculative decoding is effective precisely because the drafter is designed to approximate the target verifier; hence, accepted draft tokens often lie close to the target model’s own high\-probability region\. Aggressively suppressing these tokens can therefore distort the target distribution and expose the attack through semantic drift or abnormal outputs\.
To instantiate the constrained objective defined in Section[3\.2](https://arxiv.org/html/2605.14005#S3.SS2), we use two complementary quantities\. The first increases the target\-side surprisal of draft tokens, creating rejection pressure against the acceleration pathway\. The second bounds the distributional drift of the target verifier, preserving the model’s clean response behavior\.
#### Target\-side draft\-token surprisal\.
For a drafter\-proposed tokeny^i\(t\)\\hat\{y\}\_\{i\}^\{\(t\)\}, its surprisal under the adversarial promptxδx\_\{\\delta\}is
sθ\(y^i\(t\);xδ\)=−logπθ\(y^i\(t\)∣xδ,Y\(t\),y^<i\(t\)\)\.s\_\{\\theta\}\\\!\\left\(\\hat\{y\}\_\{i\}^\{\(t\)\};x\_\{\\delta\}\\right\)=\-\\log\\pi\_\{\\theta\}\\\!\\left\(\\hat\{y\}\_\{i\}^\{\(t\)\}\\mid x\_\{\\delta\},Y^\{\(t\)\},\\hat\{y\}\_\{<i\}^\{\(t\)\}\\right\)\.\(5\)Maximizing this quantity lowers the target verifier’s confidence in drafter\-proposed tokens, hence increasing rejection pressure\. Given attacked draft\-token positionsℐ\\mathcal\{I\}, we define the rejection objective:
ℒrej\(x,δ\)=1\|ℐ\|∑\(t,i\)∈ℐsθ\(y^i\(t\);xδ\)\.\\mathcal\{L\}\_\{\\mathrm\{rej\}\}\(x,\\delta\)=\\frac\{1\}\{\|\\mathcal\{I\}\|\}\\sum\_\{\(t,i\)\\in\\mathcal\{I\}\}s\_\{\\theta\}\\\!\\left\(\\hat\{y\}\_\{i\}^\{\(t\)\};x\_\{\\delta\}\\right\)\.\(6\)
#### KL\-bounded target preservation\.
Whileℒrej\\mathcal\{L\}\_\{\\mathrm\{rej\}\}targets the acceleration pathway, the semantic constraint requires the target model’s own output behavior to remain close to its clean behavior\. We therefore use the clean target distribution as an information\-theoretic reference for the adversarial target distribution\. Let𝒮\\mathcal\{S\}denote the positions used to estimate distributional drift\. We define
ℒsem\(x,δ\)=1\|𝒮\|∑t∈𝒮DKL\(πθ\(⋅∣x,Y\(t\)\)∥πθ\(⋅∣xδ,Y\(t\)\)\)\.\\mathcal\{L\}\_\{\\mathrm\{sem\}\}\(x,\\delta\)=\\frac\{1\}\{\|\\mathcal\{S\}\|\}\\sum\_\{t\\in\\mathcal\{S\}\}D\_\{\\mathrm\{KL\}\}\\left\(\\pi\_\{\\theta\}\(\\cdot\\mid x,Y^\{\(t\)\}\)\\,\\middle\\\|\\,\\pi\_\{\\theta\}\(\\cdot\\mid x\_\{\\delta\},Y^\{\(t\)\}\)\\right\)\.\(7\)This KL term penalizes shifts from the clean target distribution to the adversarial one, preventing the attack from achieving rejection by broadly corrupting the target model’s next\-token preferences\.
Together, Eq\. \([6](https://arxiv.org/html/2605.14005#S4.E6)\) and Eq\. \([7](https://arxiv.org/html/2605.14005#S4.E7)\) instantiate the constrained objective in Eq\. \([4](https://arxiv.org/html/2605.14005#S3.E4)\)\.ℒrej\\mathcal\{L\}\_\{\\mathrm\{rej\}\}enlarges the information mismatch between drafter proposals and the target verifier, whileℒsem\\mathcal\{L\}\_\{\\mathrm\{sem\}\}limits information drift within the target model itself\. However, maximizingℒrej\\mathcal\{L\}\_\{\\mathrm\{rej\}\}under the constraint ofℒsem\\mathcal\{L\}\_\{\\mathrm\{sem\}\}introduces an inherent optimization conflict\. Because the drafter is trained or designed to approximate the target verifier, gradients that suppress draft\-token probability can overlap with directions that preserve the target distribution\. A direct weighted combination can therefore be unstable: the rejection gradient may increase semantic drift, while the preservation gradient may weaken the attack signal\. This geometric entanglement motivates the null\-space projected optimization introduced next\.
### 4\.3Null\-Space Projected Optimization
To solve the constrained problem in Eq\. \([4](https://arxiv.org/html/2605.14005#S3.E4)\), we seek update directions that increase rejection pressure while remaining locally feasible under the semantic\-preservation constraint\. Since the suffixδ\\deltais discrete, we compute gradients in a continuous relaxation of the suffix, such as the embedding or one\-hot token space, and use them only to score token substitutions\. Let𝐳\\mathbf\{z\}denote this relaxed suffix representation, and letgrej=∇𝐳ℒrej\(x,δ\)g\_\{\\mathrm\{rej\}\}=\\nabla\_\{\\mathbf\{z\}\}\\mathcal\{L\}\_\{\\mathrm\{rej\}\}\(x,\\delta\)andgsem=∇𝐳ℒsem\(x,δ\)g\_\{\\mathrm\{sem\}\}=\\nabla\_\{\\mathbf\{z\}\}\\mathcal\{L\}\_\{\\mathrm\{sem\}\}\(x,\\delta\)denote the rejection and semantic\-drift gradients\.
#### Local semantic null space\.
The key idea is to optimize rejection only within directions that locally preserve the target distribution\. Under a first\-order approximation, the semantic\-preservation constraint defines a local feasible subspace around the current suffix\. LetJsem\(𝐳\)=∇𝐳ℒsem\(x,δ\)⊤J\_\{\\mathrm\{sem\}\}\(\\mathbf\{z\}\)=\\nabla\_\{\\mathbf\{z\}\}\\mathcal\{L\}\_\{\\mathrm\{sem\}\}\(x,\\delta\)^\{\\top\}denote the Jacobian of the semantic\-preservation objective\. The local semantic null space is:
𝒩sem\(𝐳\)=\{Δ∣Jsem\(𝐳\)Δ=0\}\.\\mathcal\{N\}\_\{\\mathrm\{sem\}\}\(\\mathbf\{z\}\)=\\left\\\{\\Delta\\mid J\_\{\\mathrm\{sem\}\}\(\\mathbf\{z\}\)\\Delta=0\\right\\\}\.\(8\)Directions in𝒩sem\(𝐳\)\\mathcal\{N\}\_\{\\mathrm\{sem\}\}\(\\mathbf\{z\}\)keepℒsem\\mathcal\{L\}\_\{\\mathrm\{sem\}\}unchanged to first order, and therefore provide a local feasible subspace for rejection optimization\.
#### Null\-space projection\.
We construct the orthogonal projector onto the local semantic null space:
𝐏𝒩=𝐈−Jsem⊤\(JsemJsem⊤\+ξ𝐈\)−1Jsem,\\mathbf\{P\}\_\{\\mathcal\{N\}\}=\\mathbf\{I\}\-J\_\{\\mathrm\{sem\}\}^\{\\top\}\\left\(J\_\{\\mathrm\{sem\}\}J\_\{\\mathrm\{sem\}\}^\{\\top\}\+\\xi\\mathbf\{I\}\\right\)^\{\-1\}J\_\{\\mathrm\{sem\}\},\(9\)whereξ\\xiis a small damping term for numerical stability\. The null\-space rejection direction is then obtained bygrej𝒩=𝐏𝒩grejg\_\{\\mathrm\{rej\}\}^\{\\mathcal\{N\}\}=\\mathbf\{P\}\_\{\\mathcal\{N\}\}g\_\{\\mathrm\{rej\}\}\. In our scalar KL\-constraint setting, this projection reduces to an efficient rank\-one operation that removes the component ofgrejg\_\{\\mathrm\{rej\}\}alonggsemg\_\{\\mathrm\{sem\}\}\. Unlike direct weighted optimization, this explicitly restricts rejection optimization to the local null space of the semantic\-preservation constraint\.
The final scoring direction combines feasibility restoration with null\-space rejection:
gfinal=−gsem\+λgrej𝒩,g\_\{\\mathrm\{final\}\}=\-g\_\{\\mathrm\{sem\}\}\+\\lambda g\_\{\\mathrm\{rej\}\}^\{\\mathcal\{N\}\},\(10\)whereλ\\lambdacontrols the strength of acceleration degradation\. The first term pulls the adversarial target distribution back toward the clean target distribution, while the second term increases rejection pressure within the local semantic null space\.
#### Discrete suffix update with KL\-bound veto\.
The null\-space direction provides a local continuous search signal, but the suffix itself consists of discrete tokens\. We therefore usegfinalg\_\{\\mathrm\{final\}\}to score token substitutions and construct a candidate set𝒞\(δ\)\\mathcal\{C\}\(\\delta\)following gradient\-guided suffix search\. Since a discrete substitution may deviate from the local first\-order approximation, each candidateδ′\\delta^\{\\prime\}is re\-evaluated by a forward pass before selection\. To enforce the semantic\-preservation constraint in Eq\. \([4](https://arxiv.org/html/2605.14005#S3.E4)\), we apply a KL\-bound veto and select
δ⋆=argmaxδ′∈𝒞\(δ\)ℒrej\(x,δ′\)s\.t\.ℒsem\(x,δ′\)≤ϵ\.\\delta^\{\\star\}=\\arg\\max\_\{\\delta^\{\\prime\}\\in\\mathcal\{C\}\(\\delta\)\}\\mathcal\{L\}\_\{\\mathrm\{rej\}\}\(x,\\delta^\{\\prime\}\)\\quad\\mathrm\{s\.t\.\}\\quad\\mathcal\{L\}\_\{\\mathrm\{sem\}\}\(x,\\delta^\{\\prime\}\)\\leq\\epsilon\.\(11\)Candidates withℒsem\(x,δ′\)\>ϵ\\mathcal\{L\}\_\{\\mathrm\{sem\}\}\(x,\\delta^\{\\prime\}\)\>\\epsilonare discarded, and among the remaining feasible candidates, we choose the one with the largest rejection objective\. The suffix is then updated asδ←δ⋆\\delta\\leftarrow\\delta^\{\\star\}\. Overall, the null\-space projection guides the local search direction, while the KL\-bound veto enforces the semantic constraint on actual discrete candidates, bridging continuous optimization and discrete suffix updates\. The complete optimization procedure is summarized in Appendix[B](https://arxiv.org/html/2605.14005#A2)\.
Table 1:Attack results ofMistletoeacross models, decoding methods, and datasets\. Each entry is formatted as clean/attacked, with attacked values highlighted in red\. The final row reports the average absolute reduction marked by↓\\downarrow, with the average relative reduction shown in parentheses\. Lower attacked speed\-up and accepted token lengthτ\\tauindicate stronger acceleration collapse\.
## 5Experiments
### 5\.1Experimental Setup
#### Models and speculative decoding systems\.
We evaluateMistletoeon Vicuna\-7B and Vicuna\-13B target models\(Chiang et al\.,[2023](https://arxiv.org/html/2605.14005#bib.bib5)\)\. For speculative decoding, we consider Medusa\(Cai et al\.,[2024](https://arxiv.org/html/2605.14005#bib.bib2)\), Hydra\(Ankner et al\.,[2024](https://arxiv.org/html/2605.14005#bib.bib1)\), EAGLE\(Li et al\.,[2024a](https://arxiv.org/html/2605.14005#bib.bib12)\), EAGLE\-2\(Li et al\.,[2024b](https://arxiv.org/html/2605.14005#bib.bib13)\), and EAGLE\-3\(Li et al\.,[2025](https://arxiv.org/html/2605.14005#bib.bib14)\)\. EAGLE\-3 is included only for Vicuna\-13B, as its Vicuna\-7B checkpoint is unavailable\. For each setting, both the target modelMθM\_\{\\theta\}and the drafterDϕD\_\{\\phi\}remain fixed; the adversary only optimizes a short discrete suffix appended to the input prompt\.
#### Datasets\.
We evaluate on three representative benchmarks covering open\-ended dialogue, code generation, and mathematical reasoning\. Specifically, we use all 80 questions from MT\-Bench\(Zheng et al\.,[2023](https://arxiv.org/html/2605.14005#bib.bib20)\), randomly sample 100 examples from HumanEval\(Chen et al\.,[2021](https://arxiv.org/html/2605.14005#bib.bib4)\), and randomly sample 100 examples from GSM8K\(Cobbe et al\.,[2021](https://arxiv.org/html/2605.14005#bib.bib6)\)\. These datasets allow us to testMistletoeacross diverse generation scenarios, including instruction following, functional program synthesis, and multi\-step mathematical reasoning\.
#### Evaluation metrics\.
We report average accepted lengthτ\\tauand speedup over vanilla autoregressive decoding as the primary efficiency metrics\. Lowerτ\\tauindicates that fewer draft tokens are committed per target\-model forward pass, while lower speedup indicates a weaker realized acceleration benefit\. Together, they directly measure whetherMistletoecollapses the draft\-then\-verify acceleration pathway at both the acceptance and end\-to\-end efficiency levels\.
#### Implementation details\.
All experiments are conducted on NVIDIA H20 GPUs\.Mistletoeoptimizes a discrete suffixδ∈𝒱m\\delta\\in\\mathcal\{V\}^\{m\}withm=20m=20, while all model parameters remain frozen\. Clean and attacked inputs are evaluated under the same speculative decoding configuration for fair comparison\. Additional implementation details, including optimization hyperparameters and KL bounds, are provided in Appendix[A](https://arxiv.org/html/2605.14005#A1)\.
### 5\.2Main Results
Table[1](https://arxiv.org/html/2605.14005#S4.T1)presents the main effectiveness ofMistletoeacross two Vicuna backbones, multiple speculative decoding methods, and three datasets\.Mistletoeconsistently reduces both speed\-up and average accepted lengthτ\\tauin all evaluated settings\. On MT\-Bench, the average speed\-up drops by1\.89×1\.89\\times, corresponding to a51\.7%51\.7\\%relative reduction, whileτ\\taudecreases by0\.990\.99on average\. Similar reductions are observed on HumanEval and GSM8K, where speed\-up drops by2\.12×2\.12\\timesand2\.20×2\.20\\times, andτ\\taudecreases by1\.211\.21and1\.131\.13, respectively\. These results show thatMistletoegeneralizes across open\-ended dialogue, code generation, and mathematical reasoning\.
The attack is particularly pronounced in settings with strong clean acceleration, where more accepted tokens can be disrupted\. For example, Vicuna\-13B with EAGLE\-3 achieves6\.17×6\.17\\timesspeed\-up on HumanEval and5\.47×5\.47\\timeson GSM8K under clean decoding, but drops to2\.77×2\.77\\timesand1\.83×1\.83\\timesunderMistletoe\. The corresponding accepted length also decreases from7\.087\.08to3\.993\.99and from5\.955\.95to2\.792\.79, respectively\. Even for methods with relatively lower clean acceleration, the attack remains effective\. For instance, Medusa on Vicuna\-13B drops from3\.26×3\.26\\timesto1\.48×1\.48\\timeson MT\-Bench and from3\.42×3\.42\\timesto1\.42×1\.42\\timeson HumanEval, while Medusa on Vicuna\-7B also decreases from3\.44×3\.44\\timesto2\.38×2\.38\\timeson MT\-Bench and from3\.75×3\.75\\timesto2\.77×2\.77\\timeson HumanEval\.
Overall, the consistent decrease inτ\\tauconfirms that the loss of speed\-up is driven by acceptance collapse\. Sinceτ\\taumeasures how many tokens are committed per target\-model forward pass, its reduction indicates that fewer drafter\-proposed tokens survive verification\. These results further support our central mechanism\-level claim thatMistletoedisables speculative acceleration by degrading draft\-token acceptability across models, methods, and tasks\.
Figure[3](https://arxiv.org/html/2605.14005#S5.F3)further explains the source of speed\-up degradation\. Compared with clean speculative decoding,Mistletoeshifts the distribution of committed tokens per target\-model forward passa\(t\)a^\{\(t\)\}toward smaller values and sharply reduces the survival probabilityP\(a\(t\)≥k\)P\(a^\{\(t\)\}\\geq k\)for long accepted prefixes\. The per\-example view shows that this reduction is broadly observed across prompts rather than caused by isolated outliers\. These observations align with theτ\\taureductions in Table[1](https://arxiv.org/html/2605.14005#S4.T1), confirming that the attack disables speculative acceleration by collapsing draft\-token acceptance\.
Figure 3:Mechanism visualization of acceptance collapse\. The figure compares clean speculative decoding andMistletoeusing verification\-cycle\-level accepted lengths\.\(a\) Accepted\-length distribution:Mistletoeshifts the distribution of committed tokens per target\-model forward passa\(t\)a^\{\(t\)\}toward smaller values\.\(b\) Long\-prefix survival:the survival probabilityP\(a\(t\)≥k\)P\(a^\{\(t\)\}\\geq k\)drops sharply under attack, showing that long draft prefixes rarely survive verification\.\(c\) Per\-example collapse:the mean accepted length decreases across most prompts, indicating that the collapse is broadly observed rather than driven by isolated outliers\.Table 2:Ablation study ofMistletoecomponents\. Lower speed\-up andτ\\tauindicate stronger acceleration collapse; lower PPL and Rep\-4 indicate more natural and less repetitive outputs\.
### 5\.3Ablation Study
Table[2](https://arxiv.org/html/2605.14005#S5.T2)evaluates the contribution of each component inMistletoe\. Using onlyℒrej\\mathcal\{L\}\_\{\\mathrm\{rej\}\}reduces speed\-up from5\.47×5\.47\\timesto3\.30×3\.30\\timesand decreasesτ\\taufrom5\.955\.95to3\.413\.41, confirming that target\-side draft\-token surprisal provides a direct signal for degrading the acceptance pathway\. However, this variant produces extremely high PPL, indicating that aggressive rejection optimization alone can lead to abnormal generations\. Using onlyℒsem\\mathcal\{L\}\_\{\\mathrm\{sem\}\}is more conservative but much weaker in acceleration degradation, reducing speed\-up only to4\.47×4\.47\\times\.
Naively combiningℒrej\\mathcal\{L\}\_\{\\mathrm\{rej\}\}andℒsem\\mathcal\{L\}\_\{\\mathrm\{sem\}\}does not fully resolve this trade\-off\. Although it incorporates both objectives, it still yields high PPL and weaker acceleration collapse than the full method, suggesting that direct joint optimization struggles to coordinate rejection pressure and output normality\. In contrast, fullMistletoeachieves the strongest acceleration collapse, reducing speed\-up to1\.83×1\.83\\timesandτ\\tauto2\.792\.79, while substantially lowering PPL and Rep\-4 compared with naive objectives\. These results show that projected optimization is crucial for balancing the rejection and preservation objectives, enablingMistletoeto degrade speculative acceleration without relying on visibly unnatural or repetitive outputs\.
### 5\.4Transferability Analysis
Table[3](https://arxiv.org/html/2605.14005#S5.T3)evaluates whether adversarial suffixes optimized on EAGLE\-3 transfer to other speculative decoding methods\. Across all target methods and datasets, transferred suffixes consistently reduce both speed\-up and accepted lengthτ\\tau, indicating thatMistletoecaptures a cross\-method vulnerability rather than overfitting to the source decoding method\. The transfer is especially strong on Medusa: speed\-up drops from3\.68×3\.68\\timesto1\.03×1\.03\\timeson MT\-Bench, from3\.11×3\.11\\timesto1\.15×1\.15\\timeson HumanEval, and from3\.39×3\.39\\timesto1\.06×1\.06\\timeson GSM8K\. This shows that suffixes generated on EAGLE\-3 can substantially disrupt even a different speculative decoding pipeline\.
The transferred attack also remains effective on Hydra, EAGLE, and EAGLE\-2, although the magnitude varies across methods and datasets\. For example, EAGLE\-2 drops from4\.44×4\.44\\timesto2\.03×2\.03\\timeson MT\-Bench and from5\.00×5\.00\\timesto2\.29×2\.29\\timeson HumanEval, while Hydra drops from4\.59×4\.59\\timesto2\.33×2\.33\\timeson MT\-Bench and from4\.26×4\.26\\timesto1\.91×1\.91\\timeson HumanEval\. The consistent decrease inτ\\taufurther suggests that transferability arises from degraded draft\-token acceptance, supporting our claim thatMistletoeexploits a shared drafter–target agreement dependency across speculative decoding methods\.
Table 3:Cross\-method transferability of adversarial suffixes generated on EAGLE\-3\. Each entry is formatted as clean/transferred, with transferred values highlighted in red\. Lower transferred speed\-up andτ\\tauindicate stronger transferability\.
## 6Conclusion
We introducedMistletoe, an acceleration\-collapse attack against speculative decoding\. Instead of corrupting final responses, it reduces drafter\-token acceptability under the target verifier through target\-side surprisal, KL\-bounded preservation, and null\-space projected optimization\. Experiments across Vicuna backbones, decoding methods, and benchmarks show consistent reductions in speed\-up andτ\\tau\. Ablations and transfer results validate the design and reveal a shared vulnerability in speculative decoding, motivating more robust acceleration mechanisms\.
## References
- Ankner et al\. \[2024\]Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan\-Kelley, and William Brandon\.Hydra: Sequentially\-dependent draft heads for medusa decoding\.*arXiv preprint arXiv:2402\.05109*, 2024\.
- Cai et al\. \[2024\]Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao\.Medusa: Simple llm inference acceleration framework with multiple decoding heads\.*arXiv preprint arXiv:2401\.10774*, 2024\.
- Chen et al\. \[2023\]Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean\-Baptiste Lespiau, Laurent Sifre, and John Jumper\.Accelerating large language model decoding with speculative sampling\.*arXiv preprint arXiv:2302\.01318*, 2023\.
- Chen et al\. \[2021\]Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al\.Evaluating large language models trained on code\.*arXiv preprint arXiv:2107\.03374*, 2021\.
- Chiang et al\. \[2023\]Wei\-Lin Chiang, Zhuohan Li, Ziqing Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al\.Vicuna: An open\-source chatbot impressing gpt\-4 with 90%\* chatgpt quality\.*See https://vicuna\. lmsys\. org \(accessed 14 April 2023\)*, 2\(3\):6, 2023\.
- Cobbe et al\. \[2021\]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*, 2021\.
- Fang et al\. \[2024\]Junfeng Fang, Houcheng Jiang, Kun Wang, Yunshan Ma, Shi Jie, Xiang Wang, Xiangnan He, and Tat\-Seng Chua\.Alphaedit: Null\-space constrained knowledge editing for language models\.*arXiv preprint arXiv:2410\.02355*, 2024\.
- Grattafiori et al\. \[2024\]Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al\-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al\.The llama 3 herd of models\.*arXiv preprint arXiv:2407\.21783*, 2024\.
- Hu et al\. \[2025a\]Shijing Hu, Jingyang Li, Xingyu Xie, Zhihui Lu, Kim\-Chuan Toh, and Pan Zhou\.Griffin: Effective token alignment for faster speculative decoding\.*arXiv preprint arXiv:2502\.11018*, 2025a\.
- Hu et al\. \[2025b\]Yunhai Hu, Zining Liu, Zhenyuan Dong, Tianfan Peng, Bradley McDanel, and Sai Qian Zhang\.Speculative decoding and beyond: An in\-depth survey of techniques\.*arXiv preprint arXiv:2502\.19732*, 2025b\.
- Leviathan et al\. \[2023\]Yaniv Leviathan, Matan Kalman, and Yossi Matias\.Fast inference from transformers via speculative decoding\.In*International Conference on Machine Learning*, pages 19274–19286\. PMLR, 2023\.
- Li et al\. \[2024a\]Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang\.Eagle: Speculative sampling requires rethinking feature uncertainty\.*arXiv preprint arXiv:2401\.15077*, 2024a\.
- Li et al\. \[2024b\]Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang\.Eagle\-2: Faster inference of language models with dynamic draft trees\.In*Proceedings of the 2024 conference on empirical methods in natural language processing*, pages 7421–7432, 2024b\.
- Li et al\. \[2025\]Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang\.Eagle\-3: Scaling up inference acceleration of large language models via training\-time test\.*arXiv preprint arXiv:2503\.01840*, 2025\.
- Liu et al\. \[2024\]Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al\.Deepseek\-v3 technical report\.*arXiv preprint arXiv:2412\.19437*, 2024\.
- Wang et al\. \[2025a\]Jiayou Wang, Rundong Liu, Yue Hu, Huijia Wu, and Zhaofeng He\.Secdecoding: Steerable decoding for safer llm generation\.In*Findings of the Association for Computational Linguistics: EMNLP 2025*, pages 20504–20521, 2025a\.
- Wang et al\. \[2025b\]Xuekang Wang, Shengyu Zhu, and Xueqi Cheng\.Speculative safety\-aware decoding\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pages 12838–12852, 2025b\.
- Wei et al\. \[2024\]Jiankun Wei, Abdulrahman Abdulrazzag, Tianchen Zhang, Adel Muursepp, and Gururaj Saileshwar\.When speculation spills secrets: Side channels via speculative decoding in llms\.*arXiv preprint arXiv:2411\.01076*, 2024\.
- Yang et al\. \[2025\]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*, 2025\.
- Zheng et al\. \[2023\]Lianmin Zheng, Wei\-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al\.Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.*Advances in neural information processing systems*, 36:46595–46623, 2023\.
## Appendix AMore Experimental Configuration
We generate adversarial suffixes for text\-based prompts to disrupt the efficiency of speculative decoding systems\. We evaluateMistletoeon several widely used speculative decoding frameworks and describe their implementation settings below\. Unless otherwise specified, all systems use their standard speculative decoding configurations\.
#### Speculative decoding systems\.
We evaluate the following speculative decoding systems\.
- •EAGLE\.EAGLE\[Li et al\.,[2024a](https://arxiv.org/html/2605.14005#bib.bib12)\]uses tree\-based speculative decoding for feature\-level drafting\. We adopt the standardmc\_sim\_7b\_63tree configuration to organize and verify candidate token sequences\.
- •EAGLE\-2\.EAGLE\-2\[Li et al\.,[2024b](https://arxiv.org/html/2605.14005#bib.bib13)\]extends EAGLE with an enhanced dynamic drafting tree\. To isolate the EAGLE\-2 architecture from EAGLE\-3 components, we disable EAGLE\-3\-specific features using\-\-no\-eagle3\.
- •EAGLE\-3\.EAGLE\-3\[Li et al\.,[2025](https://arxiv.org/html/2605.14005#bib.bib14)\]improves feature\-level alignment by using multiple intermediate hidden states from the target model\. We enable this mechanism with\-\-use\-eagle3, which uses early\-, middle\-, and late\-layer hidden states for drafting\.
- •Medusa\.Medusa\[Cai et al\.,[2024](https://arxiv.org/html/2605.14005#bib.bib2)\]adopts a multi\-head parallel speculative decoding design\. We use its native setting with five Medusa heads, verification over up to ten draft candidates, and a KV\-cache margin of 128 tokens\. When pre\-computing clean reference logits for the KL objective, we temporarily remove the Medusa attention mask to obtain the target reference distribution; inference\-time decoding follows the standard Medusa pipeline\.
- •Hydra\.Hydra\[Ankner et al\.,[2024](https://arxiv.org/html/2605.14005#bib.bib1)\]uses multiple prediction heads with a posterior acceptance mechanism\. We use its default posterior acceptance setting, with threshold0\.090\.09and coefficient0\.30\.3\. As with Medusa, we disable the Hydra attention mask only when pre\-computing clean reference logits for the KL objective, while keeping standard Hydra decoding during evaluation\.
#### Optimization hyperparameters\.
We use the same adversarial optimization protocol across all target models and speculative decoding systems\. The maximum number of suffix\-optimization iterations is set to5050\. The semantic\-preservation objective is estimated over2020predictive positions\. The null\-space rejection weight is fixed toλ=2\.0\\lambda=2\.0, corresponding to Eq\. \([10](https://arxiv.org/html/2605.14005#S4.E10)\)\. The optimized suffix is directly appended to the clean input prompt\.
#### Dataset\-specific KL bounds\.
To bound target\-distribution drift during discrete candidate selection, we use dataset\-specific KL thresholds\. The threshold is set to5\.05\.0for GSM8K\[Cobbe et al\.,[2021](https://arxiv.org/html/2605.14005#bib.bib6)\],7\.07\.0for MT\-Bench\[Zheng et al\.,[2023](https://arxiv.org/html/2605.14005#bib.bib20)\], and15\.015\.0for HumanEval\[Chen et al\.,[2021](https://arxiv.org/html/2605.14005#bib.bib4)\]\. These values define the maximum allowable semantic\-preservation lossℒsem\\mathcal\{L\}\_\{\\mathrm\{sem\}\}during the KL\-bound veto\.
#### General implementation protocol\.
All experiments are conducted using FP16 precision on NVIDIA H20 GPUs\. All target LLMs and drafters are evaluated in their standard inference modes without task\-specific fine\-tuning\. We use greedy decoding with temperature0\.00\.0to reduce sampling randomness and isolate efficiency degradation caused by adversarial suffixes\. The maximum generation length is capped at512512tokens unless otherwise specified\.
#### Metric computation\.
Speed\-up and average accepted lengthτ\\tauare measured under the same decoding configuration for clean and attacked prompts\. All reported metrics are averaged over multiple independent inference runs to reduce system\-level variance\. Clean and attacked settings use identical model weights, drafter configurations, decoding parameters, and evaluation prompts; the only difference is whether the optimized adversarial suffix is appended\.
#### Filtering abnormal empty outputs\.
In rare cases, speculative decoding implementations may return abnormal empty outputs due to decoding or runtime edge cases unrelated to the attack objective\. Such cases arise from the speculative decoding implementation itself and are not specific toMistletoeor its optimization objective\. We exclude such invalid runs from metric aggregation and apply the same rule to clean and attacked settings\. This filtering only removes empty outputs and does not filter based on speed\-up, accepted length, perplexity, repetition, or response quality\.
## Appendix BAlgorithm Pseudocode
Algorithm[1](https://arxiv.org/html/2605.14005#alg1)summarizes the optimization procedure ofMistletoe\. The attack optimizes a discrete suffixδ\\deltawhile keeping the target modelMθM\_\{\\theta\}and drafterDϕD\_\{\\phi\}fixed\. At each iteration,Mistletoecomputes the rejection gradient and the semantic\-preservation gradient in a continuous relaxation of the suffix, projects the rejection direction onto the local semantic null space, and uses the resulting direction to guide discrete suffix search\. Since the projection is only a local first\-order approximation, candidate suffixes are further evaluated by a forward pass and filtered by a KL\-bound veto\.
Algorithm 1Mistletoe: Null\-Space Guided Suffix Optimization0:Prompt
xx, target model
MθM\_\{\\theta\}, drafter
DϕD\_\{\\phi\}, iterations
TT, suffix length
mm, candidate size
KK, evaluation batch size
BB, KL bound
ϵ\\epsilon, rejection weight
λ\\lambda, damping coefficient
ξ\\xi
0:Adversarial suffix
δ⋆\\delta^\{\\star\}
1:Initialize a discrete suffix
δ∈𝒱m\\delta\\in\\mathcal\{V\}^\{m\}
2:for
r=1r=1to
TTdo
3:
xδ←x⊕δx\_\{\\delta\}\\leftarrow x\\oplus\\delta
4:Run speculative decoding with
DϕD\_\{\\phi\}and
MθM\_\{\\theta\}on
xδx\_\{\\delta\}to obtain drafter\-proposed tokens
\{y^i\(t\)\}\(t,i\)∈ℐ\\\{\\hat\{y\}\_\{i\}^\{\(t\)\}\\\}\_\{\(t,i\)\\in\\mathcal\{I\}\}
5:Estimate the rejection objective
ℒrej\(x,δ\)=1\|ℐ\|∑\(t,i\)∈ℐ−logπθ\(y^i\(t\)∣xδ,Y\(t\),y^<i\(t\)\)\\mathcal\{L\}\_\{\\mathrm\{rej\}\}\(x,\\delta\)=\\frac\{1\}\{\|\\mathcal\{I\}\|\}\\sum\_\{\(t,i\)\\in\\mathcal\{I\}\}\-\\log\\pi\_\{\\theta\}\\\!\\left\(\\hat\{y\}\_\{i\}^\{\(t\)\}\\mid x\_\{\\delta\},Y^\{\(t\)\},\\hat\{y\}\_\{<i\}^\{\(t\)\}\\right\)
6:Estimate the semantic\-preservation objective
ℒsem\(x,δ\)=1\|𝒮\|∑t∈𝒮DKL\(πθ\(⋅∣x,Y\(t\)\)∥πθ\(⋅∣xδ,Y\(t\)\)\)\\mathcal\{L\}\_\{\\mathrm\{sem\}\}\(x,\\delta\)=\\frac\{1\}\{\|\\mathcal\{S\}\|\}\\sum\_\{t\\in\\mathcal\{S\}\}D\_\{\\mathrm\{KL\}\}\\left\(\\pi\_\{\\theta\}\(\\cdot\\mid x,Y^\{\(t\)\}\)\\,\\middle\\\|\\,\\pi\_\{\\theta\}\(\\cdot\\mid x\_\{\\delta\},Y^\{\(t\)\}\)\\right\)
7:Let
𝐳\\mathbf\{z\}be a continuous relaxation of
δ\\delta
8:
grej←∇𝐳ℒrej\(x,δ\)g\_\{\\mathrm\{rej\}\}\\leftarrow\\nabla\_\{\\mathbf\{z\}\}\\mathcal\{L\}\_\{\\mathrm\{rej\}\}\(x,\\delta\)
9:
gsem←∇𝐳ℒsem\(x,δ\)g\_\{\\mathrm\{sem\}\}\\leftarrow\\nabla\_\{\\mathbf\{z\}\}\\mathcal\{L\}\_\{\\mathrm\{sem\}\}\(x,\\delta\)
10:Project the rejection gradient onto the local semantic null space:
grej𝒩←grej−⟨grej,gsem⟩‖gsem‖22\+ξgsemg\_\{\\mathrm\{rej\}\}^\{\\mathcal\{N\}\}\\leftarrow g\_\{\\mathrm\{rej\}\}\-\\frac\{\\langle g\_\{\\mathrm\{rej\}\},g\_\{\\mathrm\{sem\}\}\\rangle\}\{\\\|g\_\{\\mathrm\{sem\}\}\\\|\_\{2\}^\{2\}\+\\xi\}g\_\{\\mathrm\{sem\}\}
11:Construct the final scoring direction:
gfinal←−gsem\+λgrej𝒩g\_\{\\mathrm\{final\}\}\\leftarrow\-g\_\{\\mathrm\{sem\}\}\+\\lambda g\_\{\\mathrm\{rej\}\}^\{\\mathcal\{N\}\}
12:Use
gfinalg\_\{\\mathrm\{final\}\}to propose top\-
KKtoken substitutions and form a candidate set
𝒞\(δ\)\\mathcal\{C\}\(\\delta\)
13:
δ⋆←δ,ℒbest←−∞\\delta^\{\\star\}\\leftarrow\\delta,\\quad\\mathcal\{L\}\_\{\\mathrm\{best\}\}\\leftarrow\-\\infty
14:for
b=1b=1to
BBdo
15:Sample a candidate suffix
δb∈𝒞\(δ\)\\delta\_\{b\}\\in\\mathcal\{C\}\(\\delta\)
16:Recompute
ℒrej\(x,δb\)\\mathcal\{L\}\_\{\\mathrm\{rej\}\}\(x,\\delta\_\{b\}\)and
ℒsem\(x,δb\)\\mathcal\{L\}\_\{\\mathrm\{sem\}\}\(x,\\delta\_\{b\}\)by forward evaluation
17:if
ℒsem\(x,δb\)≤ϵ\\mathcal\{L\}\_\{\\mathrm\{sem\}\}\(x,\\delta\_\{b\}\)\\leq\\epsilonthen
18:if
ℒrej\(x,δb\)\>ℒbest\\mathcal\{L\}\_\{\\mathrm\{rej\}\}\(x,\\delta\_\{b\}\)\>\\mathcal\{L\}\_\{\\mathrm\{best\}\}then
19:
ℒbest←ℒrej\(x,δb\)\\mathcal\{L\}\_\{\\mathrm\{best\}\}\\leftarrow\\mathcal\{L\}\_\{\\mathrm\{rej\}\}\(x,\\delta\_\{b\}\)
20:
δ⋆←δb\\delta^\{\\star\}\\leftarrow\\delta\_\{b\}
21:endif
22:endif
23:endfor
24:
δ←δ⋆\\delta\\leftarrow\\delta^\{\\star\}
25:endfor
26:return
δ⋆\\delta^\{\\star\}
## Appendix CLimitations and Responsible Use
#### Scope of evaluation\.
Our study focuses on Vicuna\-7B and Vicuna\-13B with representative speculative decoding systems, including Medusa, Hydra, EAGLE, EAGLE\-2, and EAGLE\-3\. This setting allows us to systematically evaluate whether adversarial suffixes can degrade the draft\-verification pathway across multiple acceleration designs and generation tasks\. While the evaluated systems cover widely used speculative decoding paradigms, future work may extend the analysis to additional model families, larger backbones, and production\-serving configurations\.
#### Attack setting\.
Mistletoeprimarily uses white\-box gradients during suffix construction while keeping the target model and drafter fixed\. This setting provides a controlled way to expose mechanism\-level vulnerabilities in speculative decoding and to analyze how drafter–target mismatch can be amplified\. Importantly, our transferability experiments show that suffixes optimized on one source method can still degrade other speculative decoding methods, suggesting that the attack is not tied to a single white\-box configuration\. A promising future direction is to further study fully black\-box or query\-limited variants, especially for closed\-source deployments where direct gradient access is unavailable\.
#### Output normality evaluation\.
The attack is designed to degrade speculative acceleration without relying on visibly abnormal outputs\. We evaluate output normality using PPL and Rep\-4, and provide qualitative examples to further inspect generated responses\. These measurements capture fluency and repetitive degeneration, but they do not exhaustively characterize all aspects of semantic equivalence or task correctness\. Future evaluations may incorporate task\-specific correctness metrics or human/LLM\-based judgments to provide a more fine\-grained assessment\.
#### Responsible use\.
This work aims to reveal a performance\-robustness risk in speculative decoding and to motivate more secure acceleration mechanisms\. Because the proposed attack could be misused to increase serving cost or reduce the efficiency of deployed LLM systems, it should be used for research, auditing, and defense development rather than for disrupting real\-world services\. Potential defenses include monitoring accepted\-length distributions, detecting abnormal drafter–target mismatch, and designing verification mechanisms that are more robust to adversarial prompt perturbations\.
## Appendix DLLM Usage
We used an OpenAI LLM \(GPT\-5\) as a writing and formatting assistant\. In particular, it helped refine grammar and phrasing, improve textual flow and clarity, and suggest edits to figure/table captions and layout \(e\.g\., column alignment, caption length, placement\)\. The LLM did not contribute to research ideation, experimental design, implementation, data analysis, or technical content beyond surface\-level edits\. All outputs were carefully reviewed and edited by the authors, who take full responsibility for the final text and visuals\.Similar Articles
Attention Drift: What Autoregressive Speculative Decoding Models Learn
This paper identifies 'attention drift' in autoregressive speculative decoding models, where drafters' attention shifts from the prompt to their own generated tokens. The authors propose architectural changes, such as post-norm and RMSNorm, which improve acceptance rates and robustness across various benchmarks.
Faster LLM Inference via Sequential Monte Carlo
This paper proposes Sequential Monte Carlo Speculative Decoding (SMC-SD), a method that accelerates LLM inference by replacing token-level rejection in speculative decoding with importance-weighted resampling over draft particles, achieving 2.36× speedup over standard speculative decoding and 5.2× over autoregressive decoding while maintaining 3% accuracy loss.
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
SlimSpec introduces a low-rank parameterization for drafter LM-heads to accelerate speculative decoding in LLMs, achieving 4-5x speedup while maintaining full vocabulary support.
@_avichawla: Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effe…
Researchers introduced DFlash, a technique using block diffusion models for speculative decoding that accelerates LLM inference by up to 8.5x without accuracy loss. It is already integrated with major frameworks like vLLM and SGLang.
SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting
This paper introduces SpecBlock, a block-iterative speculative decoding method that combines path dependence with efficient drafting to accelerate LLM inference. It demonstrates improved speedup over existing methods like EAGLE-3 while maintaining lower drafting costs.