BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding

arXiv cs.LG Papers

Summary

BudgetDraft proposes a multi-view training method for speculative decoding that aligns a sparse-KV drafter with a full-KV verifier, achieving significant speedups for mid-to-long context inference.

arXiv:2606.00144v1 Announce Type: new Abstract: Speculative decoding speeds up autoregressive decoding by using a drafter to propose multiple tokens that a verifier validates in parallel. In resource-constrained deployments, the drafter uses a sparse KV cache to limit peak GPU memory and end-to-end latency under a fixed KV budget, while the verifier keeps a full KV cache. Mid-to-long context inference (4K--16K context length) is common in real applications. However, naive sparse/full speculative decoding suffers from the sparse/full mismatch as context length grows, causing the acceptance rate to drop quickly. We propose BudgetDraft, a multi-view sparse training method for sparse drafting in mid-to-long inference. The drafter is exposed to multiple sampled KV budgets during training and learns to align each sparse view with one shared full-cache teacher target. BudgetDraft combines an acceptance-aware loss on a full-cache branch with a multi-view loss on a sparse-cache branch, producing a single budget-robust drafter that recovers acceptance across sparsity levels without extra inference-time components. Experimental results on PG-19, LongBench, and LWM show that BudgetDraft achieves up to 6.55x, 4.46x, and 2.10x end-to-end speedup vs AR at 4K, 8K, and 16K context lengths, while keeping the inference pipeline memory-friendly.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:39 PM

# BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding
Source: [https://arxiv.org/html/2606.00144](https://arxiv.org/html/2606.00144)
Liang He1,∗Jingbo Wen2Qishi Zhan3Yixiong Chen4 Kangning Cui5Qizhen Lan6Xilu Wang7,∗ 1Shanghai Institute of Optics and Fine Mechanics2The University of Sydney3Marquette University 4Johns Hopkins University5Wake Forest University 6University of Texas Health Science Center at Houston7University of Surrey hel@siom\.ac\.cn, wangxilu@surrey\.ac\.uk ∗Corresponding authors

###### Abstract

Speculative decoding speeds up autoregressive decoding by using a drafter to propose multiple tokens that a verifier validates in parallel\. In resource\-constrained deployments, the drafter uses a sparse KV cache to limit peak GPU memory and end\-to\-end latency under a fixed KV budget, while the verifier keeps a full KV cache\. Mid\-to\-long context inference \(4K–16K context length\) is common in real applications\. However, naive sparse/full speculative decoding suffers from the sparse/full mismatch as context length grows, causing the acceptance rate to drop quickly\. We propose BudgetDraft, a multi\-view sparse training method for sparse drafting in mid\-to\-long inference\. The drafter is exposed to multiple sampled KV budgets during training and learns to align each sparse view with one shared full\-cache teacher target\. BudgetDraft combines an acceptance\-aware loss on a full\-cache branch with a multi\-view loss on a sparse\-cache branch, producing a single budget\-robust drafter that recovers acceptance across sparsity levels without extra inference\-time components\. Experimental results on PG\-19, LongBench, and LWM show that BudgetDraft achieves up to 6\.55×\\times, 4\.46×\\times, and 2\.10×\\timesend\-to\-end speedup vs AR at 4K, 8K, and 16K context lengths, while keeping the inference pipeline memory\-friendly\.

BudgetDraft: Acceptance\-Aware Multi\-View Training for Sparse\-KV Speculative Decoding

Liang He1,∗Jingbo Wen2Qishi Zhan3Yixiong Chen4Kangning Cui5Qizhen Lan6Xilu Wang7,∗1Shanghai Institute of Optics and Fine Mechanics2The University of Sydney3Marquette University4Johns Hopkins University5Wake Forest University6University of Texas Health Science Center at Houston7University of Surreyhel@siom\.ac\.cn, wangxilu@surrey\.ac\.uk∗Corresponding authors

## 1Introduction

Mid\-to\-long context inference, where the context length ranges from 4K to 16K tokens, is increasingly common in real\-world applications such as document summarization, multi\-turn dialogue, and retrieval\-augmented generation\(Liuet al\.,[2025a](https://arxiv.org/html/2606.00144#bib.bib19); Liaoet al\.,[2025](https://arxiv.org/html/2606.00144#bib.bib20)\)\. Autoregressive decoding \(AR\) remains the dominant paradigm for large language model \(LLM\) inference, but its sequential token\-by\-token generation leads to high latency and cost\(Chenet al\.,[2026](https://arxiv.org/html/2606.00144#bib.bib22)\)\. Speculative decoding \(SD\) mitigates this bottleneck by using a small drafter to propose candidate tokens, which a larger verifier validates\(Leviathanet al\.,[2023](https://arxiv.org/html/2606.00144#bib.bib6)\)\. Its speedup, however, depends critically on the alignment between the drafter and the verifier\. When the drafter aligns well with the verifier, multiple tokens can be accepted per verification step\.

Maintaining this alignment becomes difficult in mid\-to\-long inference in deployment\(Xiaoet al\.,[2025](https://arxiv.org/html/2606.00144#bib.bib21)\)\. A central constraint is the KV cache\(Zhouet al\.,[2024](https://arxiv.org/html/2606.00144#bib.bib25)\)\. To control peak GPU memory \(VRAM\) and end\-to\-end latency, the drafter often runs with a sparse KV cache under a fixed KV budget, retaining only a subset of cached key\-value pairs, whereas the verifier keeps a full KV cache to preserve output quality\(Liet al\.,[2024a](https://arxiv.org/html/2606.00144#bib.bib24)\)\. This sparse/full mismatch grows with context length and can sharply reduce the acceptance rate\. Importantly, it also makes speculative decoding highly sensitive to the KV budget, which is problematic in deployment where memory budgets vary across devices and workloads\(Caiet al\.,[2026](https://arxiv.org/html/2606.00144#bib.bib27)\)\.

![Refer to caption](https://arxiv.org/html/2606.00144v1/x1.png)Figure 1:Acceptance collapse of SD \(sparse/full\) on the GS dataset withγ=5\\gamma=5\. Acceptance decreases as context length increases and becomes near\-zero in the 8K–16K range across KV settingsB∈\{256,512,1024,full\}B\\in\\\{256,512,1024,\\mathrm\{full\}\\\}, where the full setting corresponds to 2048 tokens in this experiment\.Figure[1](https://arxiv.org/html/2606.00144#S1.F1)highlights this failure mode\. As context length increases, SD with sparse\-drafter/full\-verifier settings exhibits a rapid acceptance degradation, and at 16K the acceptance becomes near\-zero while the end\-to\-end speedup vs AR falls to≤1×\\leq 1\\times\. We term this the16K collapse\. This motivates our focus on the 8K–16K range, where sparse drafting is practically attractive but acceptance degradation becomes the dominant barrier to speedup\. Figure[1](https://arxiv.org/html/2606.00144#S1.F1)also shows that acceptance responds to the KV budget in a non\-monotonic way: a smaller KV budget can yield higher acceptance than a larger one for naive sparse drafting \(See more details in Appendix[D](https://arxiv.org/html/2606.00144#A4)\)\.

Prior work either proposes stronger drafters under full\-cache or short\-context drafting\(Liet al\.,[2024c](https://arxiv.org/html/2606.00144#bib.bib9),[2026b](https://arxiv.org/html/2606.00144#bib.bib14)\)or sparse KV\-cache and structural methods control memory or relieve long\-context bottlenecks at the cost of single\-model objectives or extra inference\-time stages\(Liet al\.,[2024b](https://arxiv.org/html/2606.00144#bib.bib28); Xiaoet al\.,[2024](https://arxiv.org/html/2606.00144#bib.bib13); Sunet al\.,[2024](https://arxiv.org/html/2606.00144#bib.bib8)\)\. Across these lines, the drafter is tuned for a fixed setting and stays brittle when the KV budget changes at deployment\(Gaoet al\.,[2025](https://arxiv.org/html/2606.00144#bib.bib32)\)\. We instead address the sparse/full mismatch at training time while keeping inference simple, and propose BudgetDraft for sparse drafting with a focus on budget robustness\. Our key idea is multi\-view sparse training: during training, the drafter is exposed to multiple randomly sampled KV budgets and learns to align with the same full\-KV teacher target under each sparsity level\. This produces a budget\-robust drafter that maintains stable acceptance across KV budgets at deployment under varying memory constraints\. Our contributions are:

- •We characterize the sparse/full acceptance rate collapse from 4K to 16K context length, establishing 16K as a practical failure boundary for naive sparse speculative decoding, and further reveal a non\-monotonic budget effect at 4K where a smaller KV budget can yield higher acceptance\.
- •We propose BudgetDraft, which combines acceptance\-aware alignment with multi\-budget sparse\-view training to produce a budget\-invariant drafter that recovers acceptance rates across all sparsity levels\.
- •We demonstrate that BudgetDraft achieves up to 6\.55×\\timesend\-to\-end speedup over AR at 4K and 4\.46×\\timesat 8K on a single A100 GPU, and achieves up to2\.10×2\.10\\timesspeedup at 16K\.

## 2Related Work

#### Speculative Decoding\.

Speculative decoding accelerates AR inference by using a small drafter to propose tokens that a verifier checks in one forward pass\(Leviathanet al\.,[2023](https://arxiv.org/html/2606.00144#bib.bib6)\)\. Recent work improves draft quality with stronger drafting mechanisms\(Liet al\.,[2024c](https://arxiv.org/html/2606.00144#bib.bib9)\)\. EAGLE and its variants train lightweight draft heads from the target model’s hidden states, achieving strong speedups in short contexts\. EAGLE\-3 adds multi\-layer feature fusion and test\-time simulation\(Liet al\.,[2026a](https://arxiv.org/html/2606.00144#bib.bib11)\)\. However, these methods are typically evaluated at 2K context with full\-cache drafting\(Liuet al\.,[2025b](https://arxiv.org/html/2606.00144#bib.bib33)\), and do not address the sparse/full mismatch when a sparse KV cache constrains the drafter in longer contexts\(Yanget al\.,[2025](https://arxiv.org/html/2606.00144#bib.bib26)\)\.

#### Sparse KV Cache for Mid\-to\-Long Context Inference\.

Many methods reduce KV cache memory by sparsification for mid\-to\-long inference\(Zhanget al\.,[2023](https://arxiv.org/html/2606.00144#bib.bib15); Liet al\.,[2024a](https://arxiv.org/html/2606.00144#bib.bib24)\)\. StreamingLLM retains attention sinks and recent tokens to support long\-context generation with a bounded cache\(Xiaoet al\.,[2024](https://arxiv.org/html/2606.00144#bib.bib13)\)\. H2O uses attention\-score\-based eviction to keep a budgeted subset of KV states\(Zhanget al\.,[2023](https://arxiv.org/html/2606.00144#bib.bib15)\)\. A common design pattern in this line of work is budgeted eviction with token\- or chunk\-level selection under a fixed KV budget\(Zhanget al\.,[2023](https://arxiv.org/html/2606.00144#bib.bib15); Liet al\.,[2024b](https://arxiv.org/html/2606.00144#bib.bib28); Fenget al\.,[2026](https://arxiv.org/html/2606.00144#bib.bib29)\)\. In practical deployment, however, the KV budget is not fixed: it can vary with GPU memory availability, workload concurrency, and latency constraints\(Gaoet al\.,[2025](https://arxiv.org/html/2606.00144#bib.bib32)\)\. A drafter tuned for a single budget can therefore be brittle when deployed under different budgets\. This motivates budget\-robust training for sparse drafting, where the drafter can remain stable across multiple sparse views\.

#### Structural Mitigation of Sparse/Full Mismatch\.

TriForce introduces a hierarchical pipeline with an intermediate retrieval\-cache stage to mitigate long\-context bottlenecks and improve SD under constrained drafting\(Sunet al\.,[2024](https://arxiv.org/html/2606.00144#bib.bib8)\)\. This structural approach can be effective but increases inference\-time complexity by adding additional components beyond the standard drafter–verifier pipeline\(Sadhukhanet al\.,[2025](https://arxiv.org/html/2606.00144#bib.bib23)\)\. In contrast, BudgetDraft is training\-based and keeps inference simple: it trains a single drafter to remain stable across KV budgets, without introducing extra intermediate models or stages at inference\(Yanget al\.,[2025](https://arxiv.org/html/2606.00144#bib.bib26)\)\.

![Refer to caption](https://arxiv.org/html/2606.00144v1/x2.png)Figure 2:Overview of BudgetDraft\. The verifier \(teacher\) produces greedy targets using a full KV cache\. The drafter \(student\) prefills the prefix once to build a prefix KV cache, then performs two continuation forwards: a full\-cache view forℒA\\mathcal\{L\}\_\{A\}and a sparse\-cache view forℒC\\mathcal\{L\}\_\{C\}with a sampled KV budgetBB\. Both views are supervised by the same targets, encouraging budget\-robust drafting\.
#### Knowledge Distillation for Speculative Decoding\.

Beyond using off\-the\-shelf drafters or inference\-time decoding and cache optimizations, recent work has explored improving draft models through additional training or distillation\(Huet al\.,[2026](https://arxiv.org/html/2606.00144#bib.bib30)\)\. Prior work studies online updates during inference, or trains task\-specific drafters with distillation\-style objectives\(Liuet al\.,[2023](https://arxiv.org/html/2606.00144#bib.bib17)\)\. Top\-1 or top\-kkdistillation is a common choice for acceleration and alignment, since it focuses supervision on the verifier\-preferred tokens rather than matching the full distribution\(Linet al\.,[2025](https://arxiv.org/html/2606.00144#bib.bib31)\)\. However, these approaches do not directly target the distribution shift induced by sparse KV caches, nor do they aim to keep acceptance stable across KV budgets\(Sadhukhanet al\.,[2025](https://arxiv.org/html/2606.00144#bib.bib23)\)\. BudgetDraft uses acceptance\-aware top\-1 supervision together with multi\-view sparse training under multi\-view KV sampling, making the drafter robust to sparse\-cache conditions and budget variation in deployment\.

## 3Method

### 3\.1Problem Formulation

#### Notation\.

Let𝒱\\mathcal\{V\}denote the vocabulary\. Given a token sequence𝐱=\(x1,…,xT\)\\mathbf\{x\}=\(x\_\{1\},\\dots,x\_\{T\}\), we write𝐱<t=\(x1,…,xt−1\)\\mathbf\{x\}\_\{<t\}=\(x\_\{1\},\\dots,x\_\{t\-1\}\)for the prefix up to positiontt\. We split𝐱\\mathbf\{x\}into a prefix of lengthPPand a continuation of lengthCC, so thatT=P\+CT=P\+C\. We supervise continuation positionst∈\{P\+1,…,T\}t\\in\\\{P\{\+\}1,\\dots,T\\\}\.

#### Verifier \(Teacher\)\.

The verifier is a large frozen modelMLM\_\{L\}with parametersθL\\theta\_\{L\}\. It conditions on the full KV cache𝒦Lfull\\mathcal\{K\}\_\{L\}^\{\\text\{full\}\}computed from𝐱<t\\mathbf\{x\}\_\{<t\}and produces the next\-token distribution

pL\(⋅∣𝐱<t;𝒦Lfull\)\.p\_\{L\}\(\\cdot\\mid\\mathbf\{x\}\_\{<t\};\\,\\mathcal\{K\}\_\{L\}^\{\\text\{full\}\}\)\.\(1\)

#### Drafter \(Student\)\.

The drafter is a small trainable modelMSM\_\{S\}with parametersθS\\theta\_\{S\}\. At inference, it operates with a sparse KV cache𝒦Ssp\\mathcal\{K\}\_\{S\}^\{\\text\{sp\}\}under a KV budgetBB:

pS\(⋅∣𝐱<t;𝒦Ssp\(B\)\)\.p\_\{S\}\(\\cdot\\mid\\mathbf\{x\}\_\{<t\};\\,\\mathcal\{K\}\_\{S\}^\{\\text\{sp\}\}\(B\)\)\.\(2\)Here𝒦Ssp​\(B\)\\mathcal\{K\}\_\{S\}^\{\\text\{sp\}\}\(B\)is obtained by selecting a budgeted subset of the drafter prefix cache\.

#### Acceptance Criterion\.

Under greedy SD, a draft token at positionttis accepted if the drafter and verifier agree on the top\-1 token:

x^t\\displaystyle\\hat\{x\}\_\{t\}=arg⁡maxv⁡pS​\(v∣𝐱<t;𝒦Ssp​\(B\)\),\\displaystyle=\\arg\\max\_\{v\}p\_\{S\}\\\!\\left\(v\\mid\\mathbf\{x\}\_\{<t\};\\,\\mathcal\{K\}\_\{S\}^\{\\text\{sp\}\}\(B\)\\right\),\(3\)xt⋆\\displaystyle x\_\{t\}^\{\\star\}=arg⁡maxv⁡pL​\(v∣𝐱<t;𝒦Lfull\),\\displaystyle=\\arg\\max\_\{v\}p\_\{L\}\\\!\\left\(v\\mid\\mathbf\{x\}\_\{<t\};\\,\\mathcal\{K\}\_\{L\}^\{\\text\{full\}\}\\right\),accept⇔x^t=xt⋆\.\\displaystyle\\iff\\hat\{x\}\_\{t\}=x\_\{t\}^\{\\star\}\.The acceptance rateα\\alphais the fraction of accepted draft tokens\. Our goal is to increaseα\\alphaand keep it stable across KV budgetsBBat inference\.

### 3\.2Overview

In deployment, the drafter KV budget is not fixed, as it changes with available VRAM, workload concurrency, and latency targets\. A drafter tuned for a single budget can be brittle when the budget changes, leading to unstable acceptance and unreliable speedup\. This motivates training the drafter under multiple sparse views induced by different KV budgets, so that acceptance remains stable when the budget changes at inference\.

As shown in Figure[2](https://arxiv.org/html/2606.00144#S2.F2), BudgetDraft trains the drafterMSM\_\{S\}while keeping the verifierMLM\_\{L\}frozen, using greedy teacher targets produced under the verifier’s full KV cache\. The training objective combines two complementary losses:

ℒ=ℒA\+λ​ℒC\.\\mathcal\{L\}=\\mathcal\{L\}\_\{A\}\+\\lambda\\,\\mathcal\{L\}\_\{C\}\.\(4\)ℒA\\mathcal\{L\}\_\{A\}aligns the drafter with the verifier under the full prefix cache and directly targets the greedy acceptance rule, whileℒC\\mathcal\{L\}\_\{C\}performs multi\-view sparse training by sampling KV budgets during training to improve robustness to inference deployment\.

### 3\.3Teacher Targets and Acceptance\-Aware Alignment

For each continuation positiont∈\{P\+1,…,T−1\}t\\in\\\{P\{\+\}1,\\dots,T\{\-\}1\\\}, the verifier produces a greedy teacher target

xt⋆=arg⁡maxv∈𝒱⁡pL​\(v∣𝐱<t;𝒦Lfull\)\.x\_\{t\}^\{\\star\}=\\arg\\max\_\{v\\in\\mathcal\{V\}\}p\_\{L\}\\\!\\left\(v\\mid\\mathbf\{x\}\_\{<t\};\\,\\mathcal\{K\}\_\{L\}^\{\\text\{full\}\}\\right\)\.\(5\)We train the drafter to predictxt⋆x\_\{t\}^\{\\star\}under the full prefix cache using a top\-1 cross\-entropy loss:

ℒA=−∑tlog⁡pS​\(xt⋆∣𝐱<t;𝒦Sfull\)\.\\mathcal\{L\}\_\{A\}=\-\\sum\_\{t\}\\log p\_\{S\}\\\!\\left\(x\_\{t\}^\{\\star\}\\mid\\mathbf\{x\}\_\{<t\};\\,\\mathcal\{K\}\_\{S\}^\{\\text\{full\}\}\\right\)\.\(6\)It matches the greedy acceptance criterion in Eq\. \([3](https://arxiv.org/html/2606.00144#S3.E3)\) as it encourages the drafter to place its highest probability on the verifier’s greedy token\. Unlike full\-distribution distillation, top\-1 supervision aligns directly with the accept/reject mechanism and reduces unnecessary distribution matching\.

Algorithm 1BudgetDraft Training Procedure0:Verifier

MLM\_\{L\}\(frozen\), Drafter

MSM\_\{S\}\(trainable\)

0:Sequence

𝐱=\(x1,…,xT\)\\mathbf\{x\}=\(x\_\{1\},\\dots,x\_\{T\}\), prefix length

PP, continuation length

C=T−PC=T\{\-\}P
0:Budget set

BB, weights

𝐰\\mathbf\{w\}, chunk size

ss, loss weight

λ\\lambda
1:Step 1: Teacher targets \(no gradient\)

2:

𝒦Lfull←ChunkedPrefill​\(ML,𝐱1:P\)\\mathcal\{K\}\_\{L\}^\{\\text\{full\}\}\\leftarrow\\text\{ChunkedPrefill\}\(M\_\{L\},\\mathbf\{x\}\_\{1:P\}\)
3:

𝐥𝐨𝐠𝐢𝐭𝐬L←ML​\(𝐱P\+1:T∣𝒦Lfull\)\\mathbf\{logits\}\_\{L\}\\leftarrow M\_\{L\}\(\\mathbf\{x\}\_\{P\+1:T\}\\mid\\mathcal\{K\}\_\{L\}^\{\\text\{full\}\}\)
4:

xt⋆←arg⁡max⁡𝐥𝐨𝐠𝐢𝐭𝐬L​\[t\],t=P\+1,…,T−1x\_\{t\}^\{\\star\}\\leftarrow\\arg\\max\\mathbf\{logits\}\_\{L\}\[\\,t\\,\],\\quad t=P\{\+\}1,\\dots,T\{\-\}1⊳\\trianglerightaligned to continuation positions

5:Step 2: Drafter prefix prefill \(no gradient\)

6:

𝒦Sfull←ChunkedPrefill​\(MS,𝐱1:P\)\\mathcal\{K\}\_\{S\}^\{\\text\{full\}\}\\leftarrow\\text\{ChunkedPrefill\}\(M\_\{S\},\\mathbf\{x\}\_\{1:P\}\)
7:Step 3: Full\-cache branch \(ℒA\\mathcal\{L\}\_\{A\}\)

8:

𝐥𝐨𝐠𝐢𝐭𝐬A←MS​\(𝐱P\+1:T∣Clone​\(𝒦Sfull\)\)\\mathbf\{logits\}\_\{A\}\\leftarrow M\_\{S\}\(\\mathbf\{x\}\_\{P\+1:T\}\\mid\\text\{Clone\}\(\\mathcal\{K\}\_\{S\}^\{\\text\{full\}\}\)\)
9:

ℒA←CE​\(𝐥𝐨𝐠𝐢𝐭𝐬A,\{xt⋆\}\)\\mathcal\{L\}\_\{A\}\\leftarrow\\text\{CE\}\(\\mathbf\{logits\}\_\{A\},\\\{x\_\{t\}^\{\\star\}\\\}\)
10:Step 4: Sparse\-cache branch \(ℒC\\mathcal\{L\}\_\{C\}\)

11:

B∼Cat​\(B,𝐰\)B\\sim\\text\{Cat\}\(B,\\mathbf\{w\}\)
12:

𝒦Ssp←TopKChunks​\(𝒦Sfull,B,s\)\\mathcal\{K\}\_\{S\}^\{\\text\{sp\}\}\\leftarrow\\text\{TopKChunks\}\(\\mathcal\{K\}\_\{S\}^\{\\text\{full\}\},B,s\)
13:

𝐩𝐨𝐬←\(P,P\+1,…,T−1\)\\mathbf\{pos\}\\leftarrow\(P,P\{\+\}1,\\dots,T\{\-\}1\)⊳\\trianglerightreal position IDs

14:

𝐥𝐨𝐠𝐢𝐭𝐬C←MS​\(𝐱P\+1:T∣𝒦Ssp,𝐩𝐨𝐬\)\\mathbf\{logits\}\_\{C\}\\leftarrow M\_\{S\}\(\\mathbf\{x\}\_\{P\+1:T\}\\mid\\mathcal\{K\}\_\{S\}^\{\\text\{sp\}\},\\mathbf\{pos\}\)
15:

ℒC←CE​\(𝐥𝐨𝐠𝐢𝐭𝐬C,\{xt⋆\}\)\\mathcal\{L\}\_\{C\}\\leftarrow\\text\{CE\}\(\\mathbf\{logits\}\_\{C\},\\\{x\_\{t\}^\{\\star\}\\\}\)
16:Step 5: Update

17:

ℒ←ℒA\+λ​ℒC\\mathcal\{L\}\\leftarrow\\mathcal\{L\}\_\{A\}\+\\lambda\\,\\mathcal\{L\}\_\{C\}
18:

θS←θS−η​∇θSℒ\\theta\_\{S\}\\leftarrow\\theta\_\{S\}\-\\eta\\,\\nabla\_\{\\theta\_\{S\}\}\\mathcal\{L\}

### 3\.4Multi\-View Sparse Training

Training only withℒA\\mathcal\{L\}\_\{A\}can produce a drafter that works well under the full prefix cache but degrades when deployed with a sparse KV cache\. We address this gap with multi\-view sparse training: at each step we sample a KV budget and train the drafter under the corresponding sparse cache, using the same greedy teacher targetsxt⋆x\_\{t\}^\{\\star\}from Eq\. \([5](https://arxiv.org/html/2606.00144#S3.E5)\)\.

Concretely, we drawB∼Cat​\(B,𝐰\)B\\sim\\text\{Cat\}\(B,\\mathbf\{w\}\)withB=\{256,512,1024,2048\}B=\\\{256,512,1024,2048\\\}and𝐰=\{0\.4,0\.3,0\.2,0\.1\}\\mathbf\{w\}=\\\{0\.4,0\.3,0\.2,0\.1\\\}\. We construct𝒦Ssp​\(B\)\\mathcal\{K\}\_\{S\}^\{\\text\{sp\}\}\(B\)by partitioning the prefix cache \(lengthPP\) into chunks of sizes=8s=8, scoring each chunk by cumulative attention weight, and retaining the top\-⌊B/s⌋\\lfloor B/s\\rfloorchunks\. The loss is then

ℒC=𝔼B​\[−∑tlog⁡pS​\(xt⋆∣𝐱<t;𝒦Ssp​\(B\)\)\]\.\\mathcal\{L\}\_\{C\}=\\mathbb\{E\}\_\{B\}\\Bigl\[\-\\sum\_\{t\}\\log p\_\{S\}\\\!\\left\(x\_\{t\}^\{\\star\}\\mid\\mathbf\{x\}\_\{<t\};\\,\\mathcal\{K\}\_\{S\}^\{\\text\{sp\}\}\(B\)\\right\)\\Bigr\]\.\(7\)Sampling different budgets over training exposes the drafter to diverse sparse views of the same prefix, which reduces overfitting to a single budget and improves budget robustness at inference\. After sparsification, cache length no longer matches absolute positions; we therefore pass explicit position IDs\(P,P\+1,…,T−1\)\(P,P\{\+\}1,\\dots,T\{\-\}1\)during the continuation forward to keep RoPE consistent with the verifier\.

Algorithm 2BudgetDraft Inference \(Greedy Speculative Decoding\)0:Trained drafter

MSM\_\{S\}, verifier

MLM\_\{L\}
0:Prompt

𝐱1:P\\mathbf\{x\}\_\{1:P\}, KV budget

BB, draft length

γ\\gamma
1:Prefill:

2:

𝒦Lfull←Prefill​\(ML,𝐱1:P\)\\mathcal\{K\}\_\{L\}^\{\\text\{full\}\}\\leftarrow\\text\{Prefill\}\(M\_\{L\},\\mathbf\{x\}\_\{1:P\}\)⊳\\trianglerightfull KV

3:

𝒦Sfull←Prefill​\(MS,𝐱1:P\)\\mathcal\{K\}\_\{S\}^\{\\text\{full\}\}\\leftarrow\\text\{Prefill\}\(M\_\{S\},\\mathbf\{x\}\_\{1:P\}\)
4:

𝒦Ssp​\(B\)←TopKChunks​\(𝒦Sfull,B\)\\mathcal\{K\}\_\{S\}^\{\\text\{sp\}\}\(B\)\\leftarrow\\text\{TopKChunks\}\(\\mathcal\{K\}\_\{S\}^\{\\text\{full\}\},B\)⊳\\trianglerightsparsify

5:repeat

6:Draft:generate

𝐱^=\(x^1,…,x^γ\)\\hat\{\\mathbf\{x\}\}=\(\\hat\{x\}\_\{1\},\\dots,\\hat\{x\}\_\{\\gamma\}\)from

MSM\_\{S\}using

𝒦Ssp​\(B\)\\mathcal\{K\}\_\{S\}^\{\\text\{sp\}\}\(B\)
7:Verify:run

MLM\_\{L\}on

𝐱^\\hat\{\\mathbf\{x\}\}using

𝒦Lfull\\mathcal\{K\}\_\{L\}^\{\\text\{full\}\}and obtain verifier tokens

𝐱⋆=\(x1⋆,…,xγ⋆\)\\mathbf\{x\}^\{\\star\}=\(x\_\{1\}^\{\\star\},\\dots,x\_\{\\gamma\}^\{\\star\}\)
8:Let

kkbe the largest index such that

x^i=xi⋆\\hat\{x\}\_\{i\}=x\_\{i\}^\{\\star\}for all

i≤ki\\leq k⊳\\trianglerightEq\. \([3](https://arxiv.org/html/2606.00144#S3.E3)\)

9:Append

\(x^1,…,x^k\)\(\\hat\{x\}\_\{1\},\\dots,\\hat\{x\}\_\{k\}\)to the output

10:Append one verifier token

xk\+1⋆x\_\{k\+1\}^\{\\star\}to the output⊳\\trianglerightbonus token

11:Update

𝒦Lfull\\mathcal\{K\}\_\{L\}^\{\\text\{full\}\}and

𝒦Ssp​\(B\)\\mathcal\{K\}\_\{S\}^\{\\text\{sp\}\}\(B\)with the appended tokens

12:untilmax tokens reached or EOS

### 3\.5Training and Inference

#### BudgetDraft Training Procedure\.

Algorithm[1](https://arxiv.org/html/2606.00144#alg1)summarizes one training step\. The verifier produces greedy teacher targets for the continuation positions\. The drafter prefills the prefix once without gradient, then reuses the prefix cache for two continuation forwards, i\.e\., a full\-cache branch forℒA\\mathcal\{L\}\_\{A\}and a sparse\-cache branch forℒC\\mathcal\{L\}\_\{C\}with a sampled KV budgetBB\. We back\-propagate the combined loss through both branches and update the drafter parameters\.

#### Hyperparameters\.

We train for 5,000 steps on PG\-19 \(streaming\) with sequence lengthT=16,384T=16\{,\}384\(P=16,128P=16\{,\}128,C=256C=256\)\. We use AdamW with learning rate10−510^\{\-5\}, weight decay0\.010\.01, linear warmup for 150 steps followed by cosine decay, gradient clipping at1\.01\.0, and batch size 1\. We useB=\{256,512,1024,2048\}B=\\\{256,512,1024,2048\\\}with weights𝐰=\{0\.4,0\.3,0\.2,0\.1\}\\mathbf\{w\}=\\\{0\.4,0\.3,0\.2,0\.1\\\}and chunk sizes=8s=8for sparse cache construction\. Training runs on a single NVIDIA A100 80GB GPU and takes approximately 5 hours\.

#### BudgetDraft Inference\.

BudgetDraft follows the standard SD pipeline, summarized in Algorithm[2](https://arxiv.org/html/2606.00144#alg2)\. The trained drafter generatesγ\\gammacandidate tokens autoregressively using a sparse KV cache with a chosen budgetBB\. The verifier checks allγ\\gammacandidates in a single forward pass using its full KV cache\. Accepted tokens are appended to the output; if a rejection occurs at positionkk, the verifier’s token at positionkkis used\. This inference pipeline is simple: it uses only a sparse drafter and a full verifier, and the same trained drafter can be deployed under different KV budgets as deployment constraints change\.

## 4Experimental Setup

#### Models\.

We use YaRN\-Llama\-2\-7B\-128K\(Penget al\.,[2024](https://arxiv.org/html/2606.00144#bib.bib12); Touvronet al\.,[2023](https://arxiv.org/html/2606.00144#bib.bib5)\)as the verifier \(6\.7B parameters, fp16\) and llama\-68m111[https://huggingface\.co/JackFram/llama\-68m](https://huggingface.co/JackFram/llama-68m)as the drafter \(68M parameters, fp32\)\.

#### Datasets\.

We evaluate on three datasets spanning different text domains\. GS uses the PG\-19 test split\(Raeet al\.,[2020](https://arxiv.org/html/2606.00144#bib.bib1)\)and contains long\-form book text\. LongBench uses QMSum\(Baiet al\.,[2023](https://arxiv.org/html/2606.00144#bib.bib3)\)and focuses on meeting transcript summarization\. LWM uses NarrativeQA\(Kociskýet al\.,[2017](https://arxiv.org/html/2606.00144#bib.bib2)\)and tests question answering over long narratives\.

#### Context Lengths\.

We evaluate three context lengths: 4K \(prompt length 3800\), 8K \(prompt length 8192\), and 16K \(prompt length 16384\)\. All experiments generate 256 tokens\.

#### Budgets and Draft Length\.

The verifier uses a full KV cache\. The drafter uses a sparse KV cache with budgetB∈\{256,512,1024,2048\}B\\in\\\{256,512,1024,2048\\\}and has a native maximum position embedding of 2048 tokens\. We sweep the draft lengthγ\\gammaand report the best result over the sweep in Table[1](https://arxiv.org/html/2606.00144#S4.T1), while some comparisons use a fixedγ\\gamma\(e\.g\.,γ=5\\gamma=5\)\.

#### Baselines\.

We compare against AR \(standard autoregressive decoding\), SD \(sparse/full\) \(speculative decoding with an untrained drafter using a sparse KV cache\), TriForce \(hierarchical speculative decoding with a retrieval cache\), and EAGLE\-3 \(speculative decoding with a trained draft head\)\. Implementation details for TriForce and EAGLE\-3 follow Section[5\.2](https://arxiv.org/html/2606.00144#S5.SS2)\.

#### Metrics and Hardware\.

We report acceptance rate \(top\-1 match under greedy decoding\), end\-to\-end speedup vs AR \(including prefill and decode time\), and peak VRAM usage \(see Appendix[A](https://arxiv.org/html/2606.00144#A1)\)\. All results are averaged over 5 repeated runs under the same setting \(with a warmup run discarded\)\. We additionally analyze per\-sample variance across draft lengths in Appendix[B](https://arxiv.org/html/2606.00144#A2)\. All experiments run on a single NVIDIA A100 80GB GPU\.

Table 1:We report throughput in tok/s for AR, and speedup vs AR / acceptance rate \(%\) for SD \(sparse/full\) and BudgetDraft\. For each KV budget, Panel \(A\) reports the run with the best BudgetDraft speedup, and Panel \(B\) the run with the best BudgetDraft acceptance; both show the corresponding SD \(sparse/full\) run\.

## 5Experimental Results

### 5\.1Budget\-Robust SD across Context Lengths

Since the drafter budget varies with VRAM and latency, a practical method should avoid relying on a single tuned budget\. This section tests whether a single trained drafter remains effective when the KV budget changes at deployment, and evaluates the effectiveness of BudgetDraft across context lengths and KV budgets\.

Table[1](https://arxiv.org/html/2606.00144#S4.T1)summarizes results across three datasets, four KV budgets, and three context lengths\. We sweep the draft lengthγ\\gamma\. Panel \(A\) reports, for each budget, the run with the best BudgetDraft speedup and shows the corresponding SD \(sparse/full\) run under the same decoding configuration\. Panel \(B\) reports, for each budget, the run with the best BudgetDraft acceptance and shows the corresponding SD \(sparse/full\) run\.

BudgetDraft is effective across context lengths and far less sensitive to the KV budget than SD \(sparse/full\)\. At 4K, it achieves high acceptance and strong speedup across all budgets and datasets, whereas SD \(sparse/full\) varies sharply with the budget\. At 8K, SD \(sparse/full\) enters a failure regime with near\-zero acceptance across budgets on GS and LWM, whereas BudgetDraft constantly recovers high acceptance on both datasets across different budgets; LongBench remains more challenging and shows lower acceptance\. At 16K, SD \(sparse/full\) remains ineffective on GS and LongBench, while BudgetDraft still achieves non\-trivial acceptance and speedup, with the strongest results on LWM\. BudgetDraft reduces the need for budget\-specific tuning and better supports budget\-robust SD in deployment\. Moreover, it achieves these gains at a peak VRAM comparable to SD \(sparse/full\), as reported in Appendix[A](https://arxiv.org/html/2606.00144#A1)\.

### 5\.2Comparison with TriForce and EAGLE\-3

We focus on the mid\-to\-long context regime \(8K–16K\) and compare BudgetDraft with two representative baselines, TriForce and EAGLE\-3, on the LWM dataset\. For a controlled comparison, all methods use the same decoding setting with draft lengthγ=5\\gamma=5\. For TriForce, we follow its framework but adapt the pipeline to operate at 8K and 16K contexts, rather than using its original long\-context configuration\. This evaluates TriForce\-style structural mitigation under the same context lengths as our main setting\. For EAGLE\-3, we implement the method described in its paper and train the draft head using 8K and 16K data, and then evaluate it under the same decoding protocol\.

Table[2](https://arxiv.org/html/2606.00144#S5.T2)reports speedup results achieved by each algorithm compared with AR\. At 8K, BudgetDraft reaches2\.54×2\.54\\timesacross all KV budgets, well above TriForce \(1\.21×1\.21\\times\) and EAGLE\-3 \(1\.64×1\.64\\times\)\. At 16K, BudgetDraft remains higher across budgets, achieving1\.94×1\.94\\timesatB=256B\{=\}256, and1\.89×1\.89\\times,1\.77×1\.77\\times, and1\.52×1\.52\\timesforB∈\{512,1024,2048\}B\\in\\\{512,1024,2048\\\}, compared to1\.19×1\.19\\timesfor TriForce and1\.36×1\.36\\timesfor EAGLE\-3\. Overall, BudgetDraft maintains a clear advantage at 8K–16K while supporting flexible KV\-budget choices at deployment\.

Table 2:Comparison with baselines on LWM at 8K and 16K withγ=5\\gamma\{=\}5\.
### 5\.3Ablation: Effect ofℒC\\mathcal\{L\}\_\{C\}

Table[3](https://arxiv.org/html/2606.00144#S5.T3)isolates the contribution ofℒC\\mathcal\{L\}\_\{C\}by comparingℒA\\mathcal\{L\}\_\{A\}withℒA\+0\.5​ℒC\\mathcal\{L\}\_\{A\}\+0\.5\\,\\mathcal\{L\}\_\{C\}under the same KV budgets, context lengths, and a fixed decoding setting \(γ=5\\gamma=5\)\. Across datasets, addingℒC\\mathcal\{L\}\_\{C\}improves acceptance and typically increases speedup, with the most consistent gains at 4K and 8K\. At 4K,ℒA\+0\.5​ℒC\\mathcal\{L\}\_\{A\}\+0\.5\\,\\mathcal\{L\}\_\{C\}substantially increases acceptance and yields higher speedup across all budgets on all three datasets\. At 8K, the improvement is smaller but remains consistent: speedup increases by a modest margin across budgets on GS and LWM, and LongBench also improves despite being the hardest dataset in this regime\. At 16K, gains persist on GS and LongBench \(acceptance and speedup both increase slightly across budgets\), while LWM shows a mixed pattern where larger budgets can reduce speedup, suggesting budget\-dependent trade\-offs in the longest setting\. To further isolate the role of multi\-budget sampling, we compare it with a single\-budget sparse\-branch variant in Appendix[E](https://arxiv.org/html/2606.00144#A5)\. The results show that single\-budget training becomes unstable under budget shifts, confirming the benefit of multi\-budget sampling\.

Table 3:Ablation on multi\-view sparse training atγ=5\\gamma=5\. We report throughput in tok/s for AR, and speedup vs AR / acceptance rate \(%\) forℒA\\mathcal\{L\}\_\{A\}andℒA\+0\.5​ℒC\\mathcal\{L\}\_\{A\}\+0\.5\\,\\mathcal\{L\}\_\{C\}, under the sameγ\\gammaand KV budget\.
### 5\.4Sensitivity Analysis

#### λ\\lambdaSensitivity\.

We compareℒA\+0\.5​ℒC\\mathcal\{L\}\_\{A\}\+0\.5\\,\\mathcal\{L\}\_\{C\}\(λ=0\.5\\lambda=0\.5\) withℒA\+ℒC\\mathcal\{L\}\_\{A\}\+\\mathcal\{L\}\_\{C\}\(λ=1\.0\\lambda=1\.0\)\. The two settings achieve very similar acceptance and speedup at 4K and 8K across datasets and KV budgets\. At 16K, differences remain small and are sometimes budget\-dependent\. We useλ=0\.5\\lambda=0\.5as the default\. Detailed results and analysis are provided in Appendix[C](https://arxiv.org/html/2606.00144#A3)\.

#### Draft Length Sensitivity\.

Figure[3](https://arxiv.org/html/2606.00144#S5.F3)summarizes how acceptance and speedup change with the draft lengthγ\\gammaafter averaging over KV budgets\. At 4K, acceptance remains high across datasets for a wide range ofγ\\gamma, while speedup generally increases withγ\\gammabefore saturating or mildly fluctuating\. At 8K, GS and LWM still maintain strong acceptance over a broad range ofγ\\gamma, leading to clear speedup peaks at moderateγ\\gamma, while LongBench shows lower acceptance and a weaker speedup peak\. At 16K, acceptance drops and becomes more sensitive toγ\\gamma, narrowing the range where speculative decoding is efficient; the best speedup shifts to shorter or moderate drafts depending on the dataset\. These trends suggest thatγ\\gammashould be tuned with context length, while the stable curves under budget averaging reflect the robustness goal of BudgetDraft in deployment\.

![Refer to caption](https://arxiv.org/html/2606.00144v1/x3.png)Figure 3:Budget\-averaged sensitivity to draft length\. We average over KV budgetsB∈\{256,512,1024,2048\}B\\in\\\{256,512,1024,2048\\\}and plot acceptance rate and speedup vs AR as a function of draft lengthγ\\gamma\. Rows correspond to context lengths \(4K/8K/16K\), and columns correspond to datasets \(GS/LongBench/LWM\)\.

## 6Conclusion

We identified a sparse/full acceptance collapse in mid\-to\-long context speculative decoding: as the context length increases from 4K to 16K, naive sparse drafting can degrade to near\-zero acceptance, making SD ineffective\. We proposed BudgetDraft, a training\-based alignment method that combines acceptance\-aware alignment \(ℒA\\mathcal\{L\}\_\{A\}\) with multi\-view sparse training \(ℒC\\mathcal\{L\}\_\{C\}\)\. By training the drafter under multiple randomly sampled KV budgets, BudgetDraft produces a budget\-robust drafter that maintains stable acceptance and speedup under different sparsity levels at inference\.

Experiments on three datasets show that BudgetDraft delivers strong end\-to\-end speedups in the 4K–16K regime on a single A100 GPU, while using only a lightweight 68M drafter\. BudgetDraft keeps the inference pipeline simple and memory\-friendly, making it practical for deployment across devices with varying memory budgets\. Future work includes improving robustness at longer contexts and exploring tighter integration with sparse\-cache policies\.

## Limitations

#### Drafter position range\.

A key limitation is the drafter’s limited native position range\. The 68M drafter operates with a short maximum position embedding, while our target setting spans mid\-to\-long contexts \(8K–16K\)\. We pass explicit position IDs to keep the drafter aligned with the verifier, but the drafter still enters a positional extrapolation regime at longer contexts, which can reduce acceptance and thus limit speedup\. However, positional extrapolation alone does not fully explain the observed behavior, since BudgetDraft substantially improves acceptance under the same drafter architecture and position range\. Using a drafter with a longer native context window or applying position\-extension techniques is a promising direction for future work\.

#### Verifier\-specific training\.

BudgetDraft trains a drafter against a fixed verifier, and retraining is required when the verifier changes\. While the training cost is moderate in our setup, we do not systematically study transfer across verifiers, such as reusing a drafter with lightweight adaptation\. Understanding how well a trained drafter transfers to related verifier checkpoints is an important next step\.

#### Greedy decoding setting\.

Our experiments focus on greedy speculative decoding to enable output\-identical generation and stable speed measurements\. Extending BudgetDraft to sampling\-based decoding \(e\.g\., temperature sampling\) would require adapting both the training objective and the acceptance rule to handle stochasticity, which we leave for future work\.

#### Scope of sparse cache policies\.

We adopt a specific budgeted sparse KV construction based on chunk\-level selection\. BudgetDraft is designed to be compatible with different sparse cache policies, but we do not exhaustively evaluate alternative policies or system\-level implementations \(e\.g\., kernel\-level efficiency under different eviction rules\)\. A broader study of sparse cache choices and their interaction with training objectives is left for future work\.

## Acknowledgments

We thank the anonymous reviewers for their constructive feedback\.

## References

- Y\. Bai, X\. Lv, J\. Zhang, H\. Lyu, J\. Tang, Z\. Huang, Z\. Du, X\. Liu, A\. Zeng, L\. Hou, Y\. Dong, J\. Tang, and J\. Li \(2023\)LongBench: a bilingual, multitask benchmark for long context understanding\.CoRRabs/2308\.14508\.External Links:[Link](https://doi.org/10.48550/arXiv.2308.14508)Cited by:[§4](https://arxiv.org/html/2606.00144#S4.SS0.SSS0.Px2.p1.1)\.
- Efficient inference for edge large language models: a survey\.Tsinghua Science and Technology31\(3\),pp\. 1365–1380\.Cited by:[§1](https://arxiv.org/html/2606.00144#S1.p2.1)\.
- Z\. Chen, B\. Zhu, J\. Wang, H\. Shin, A\. Nallanathan, and D\. Niyato \(2026\)Network edge inference for large language models: principles, techniques, and opportunities\.ACM Computing Surveys\.Cited by:[§1](https://arxiv.org/html/2606.00144#S1.p1.1)\.
- Y\. Feng, J\. Lv, Y\. Cao, X\. Xie, and S\. K\. Zhou \(2026\)Ada\-kv: optimizing kv cache eviction by adaptive budget allocation for efficient llm inference\.Advances in Neural Information Processing Systems38,pp\. 113152–113188\.Cited by:[§2](https://arxiv.org/html/2606.00144#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Gao, X\. Zhang, Y\. Shen, and L\. Chen \(2025\)Apt\-serve: adaptive request scheduling on hybrid cache for scalable llm inference serving\.Proceedings of the ACM on Management of Data3\(3\),pp\. 1–28\.Cited by:[§1](https://arxiv.org/html/2606.00144#S1.p4.1),[§2](https://arxiv.org/html/2606.00144#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Hu, J\. Guo, X\. Feng, and T\. Zhao \(2026\)Adaspec: selective knowledge distillation for efficient speculative decoders\.Advances in Neural Information Processing Systems38,pp\. 88736–88758\.Cited by:[§2](https://arxiv.org/html/2606.00144#S2.SS0.SSS0.Px4.p1.1)\.
- T\. Kociský, J\. Schwarz, P\. Blunsom, C\. Dyer, K\. M\. Hermann, G\. Melis, and E\. Grefenstette \(2017\)The narrativeqa reading comprehension challenge\.CoRRabs/1712\.07040\.External Links:[Link](http://arxiv.org/abs/1712.07040)Cited by:[§4](https://arxiv.org/html/2606.00144#S4.SS0.SSS0.Px2.p1.1)\.
- Y\. Leviathan, M\. Kalman, and Y\. Matias \(2023\)Fast inference from transformers via speculative decoding\.InInternational Conference on Machine Learning,pp\. 19274–19286\.Cited by:[§1](https://arxiv.org/html/2606.00144#S1.p1.1),[§2](https://arxiv.org/html/2606.00144#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Li, Y\. Li, A\. Tian, T\. Tang, Z\. Xu, X\. Chen, N\. Hu, W\. Dong, Q\. Li, and L\. Chen \(2024a\)A survey on large language model acceleration based on kv cache management\.CoRRabs/2412\.19442\.External Links:[Link](https://doi.org/10.48550/arXiv.2412.19442)Cited by:[§1](https://arxiv.org/html/2606.00144#S1.p2.1),[§2](https://arxiv.org/html/2606.00144#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Li, Y\. Huang, B\. Yang, B\. Venkitesh, A\. Locatelli, H\. Ye, T\. Cai, P\. Lewis, and D\. Chen \(2024b\)Snapkv: llm knows what you are looking for before generation\.Advances in Neural Information Processing Systems37,pp\. 22947–22970\.Cited by:[§1](https://arxiv.org/html/2606.00144#S1.p4.1),[§2](https://arxiv.org/html/2606.00144#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang \(2024c\)EAGLE: speculative sampling requires rethinking feature uncertainty\.InForty\-first International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=1NdN7eXyb4)Cited by:[§1](https://arxiv.org/html/2606.00144#S1.p4.1),[§2](https://arxiv.org/html/2606.00144#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang \(2026a\)EAGLE\-3: scaling up inference acceleration of large language models via training\-time test\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=4exx1hUffq)Cited by:[§2](https://arxiv.org/html/2606.00144#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang \(2026b\)Eagle\-3: scaling up inference acceleration of large language models via training\-time test\.Advances in Neural Information Processing Systems38,pp\. 136737–136756\.Cited by:[§1](https://arxiv.org/html/2606.00144#S1.p4.1)\.
- Z\. Liao, J\. Wang, H\. Yu, L\. Wei, J\. Li, and W\. Zhang \(2025\)E2llm: encoder elongated large language models for long\-context understanding and reasoning\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 19212–19241\.Cited by:[§1](https://arxiv.org/html/2606.00144#S1.p1.1)\.
- X\. Lin, C\. Yang, W\. Wang, Y\. Li, C\. Du, F\. Feng, S\. Ng, and T\. Chua \(2025\)Efficient inference for large language model\-based generative recommendation\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 91672–91697\.Cited by:[§2](https://arxiv.org/html/2606.00144#S2.SS0.SSS0.Px4.p1.1)\.
- J\. Liu, D\. Zhu, Z\. Bai, Y\. He, H\. Liao, H\. Que, Z\. Wang, C\. Zhang, G\. Zhang, J\. Zhang, Y\. Zhang, Z\. Chen, H\. Guo, S\. Li, Z\. Liu, Y\. Shan, Y\. Song, J\. Tian, W\. Wu, Z\. Zhou, R\. Zhu, J\. Feng, Y\. Gao, S\. He, Z\. Li, T\. Liu, F\. Meng, W\. Su, Y\. Tan, Z\. Wang, J\. Yang, W\. Ye, B\. Zheng, W\. Zhou, W\. Huang, S\. Li, and Z\. Zhang \(2025a\)A comprehensive survey on long context language modeling\.CoRRabs/2503\.17407\.External Links:[Link](https://doi.org/10.48550/arXiv.2503.17407)Cited by:[§1](https://arxiv.org/html/2606.00144#S1.p1.1)\.
- X\. Liu, L\. Hu, P\. Bailis, I\. Stoica, Z\. Deng, A\. Cheung, and H\. Zhang \(2023\)Online speculative decoding\.CoRRabs/2310\.07177\.External Links:[Link](https://doi.org/10.48550/arXiv.2310.07177)Cited by:[§2](https://arxiv.org/html/2606.00144#S2.SS0.SSS0.Px4.p1.1)\.
- X\. Liu, J\. Yu, J\. Park, I\. Stoica, and A\. Cheung \(2025b\)Speculative decoding: performance or illusion?\.arXiv preprint arXiv:2601\.11580\.Cited by:[§2](https://arxiv.org/html/2606.00144#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Peng, J\. Quesnelle, H\. Fan, and E\. Shippole \(2024\)Yarn: efficient context window extension of large language models\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 31932–31951\.Cited by:[§4](https://arxiv.org/html/2606.00144#S4.SS0.SSS0.Px1.p1.1)\.
- J\. W\. Rae, A\. Potapenko, S\. M\. Jayakumar, C\. Hillier, and T\. P\. Lillicrap \(2020\)Compressive transformers for long\-range sequence modelling\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SylKikSYDH)Cited by:[§4](https://arxiv.org/html/2606.00144#S4.SS0.SSS0.Px2.p1.1)\.
- R\. Sadhukhan, J\. Chen, Z\. Chen, V\. Tiwari, R\. Lai, J\. Shi, I\. Yen, A\. May, T\. Chen, and B\. Chen \(2025\)Magicdec: breaking the latency\-throughput tradeoff for long context generation with speculative decoding\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 6835–6850\.Cited by:[§2](https://arxiv.org/html/2606.00144#S2.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2606.00144#S2.SS0.SSS0.Px4.p1.1)\.
- H\. Sun, Z\. Chen, X\. Yang, Y\. Tian, and B\. Chen \(2024\)TriForce: lossless acceleration of long sequence generation with hierarchical speculative decoding\.CoRRabs/2404\.11912\.External Links:[Link](https://doi.org/10.48550/arXiv.2404.11912)Cited by:[§1](https://arxiv.org/html/2606.00144#S1.p4.1),[§2](https://arxiv.org/html/2606.00144#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale, D\. Bikel, L\. Blecher, C\. Canton\-Ferrer, M\. Chen, G\. Cucurull, D\. Esiobu, J\. Fernandes, J\. Fu, W\. Fu, B\. Fuller, C\. Gao, V\. Goswami, N\. Goyal, A\. Hartshorn, S\. Hosseini, R\. Hou, H\. Inan, M\. Kardas, V\. Kerkez, M\. Khabsa, I\. Kloumann, A\. Korenev, P\. S\. Koura, M\. Lachaux, T\. Lavril, J\. Lee, D\. Liskovich, Y\. Lu, Y\. Mao, X\. Martinet, T\. Mihaylov, P\. Mishra, I\. Molybog, Y\. Nie, A\. Poulton, J\. Reizenstein, R\. Rungta, K\. Saladi, A\. Schelten, R\. Silva, E\. M\. Smith, R\. Subramanian, X\. E\. Tan, B\. Tang, R\. Taylor, A\. Williams, J\. X\. Kuan, P\. Xu, Z\. Yan, I\. Zarov, Y\. Zhang, A\. Fan, M\. Kambadur, S\. Narang, A\. Rodriguez, R\. Stojnic, S\. Edunov, and T\. Scialom \(2023\)Llama 2: open foundation and fine\-tuned chat models\.CoRRabs/2307\.09288\.External Links:[Link](https://doi.org/10.48550/arXiv.2307.09288)Cited by:[§4](https://arxiv.org/html/2606.00144#S4.SS0.SSS0.Px1.p1.1)\.
- G\. Xiao, J\. Tang, J\. Zuo, J\. Guo, S\. Yang, H\. Tang, Y\. Fu, and S\. Han \(2025\)Duoattention: efficient long\-context llm inference with retrieval and streaming heads\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 37228–37253\.Cited by:[§1](https://arxiv.org/html/2606.00144#S1.p2.1)\.
- G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis \(2024\)Efficient streaming language models with attention sinks\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 21875–21895\.Cited by:[§1](https://arxiv.org/html/2606.00144#S1.p4.1),[§2](https://arxiv.org/html/2606.00144#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Yang, C\. Du, F\. Zhang, H\. Wang, T\. Pang, C\. Du, and B\. An \(2025\)LongSpec: long\-context lossless speculative decoding with efficient drafting and verification\.InES\-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models,External Links:[Link](https://openreview.net/forum?id=GFN9PWbfHs)Cited by:[§2](https://arxiv.org/html/2606.00144#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.00144#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Zhang, Y\. Sheng, T\. Zhou, T\. Chen, L\. Zheng, R\. Cai, Z\. Song, Y\. Tian, C\. Ré, C\. Barrett,et al\.\(2023\)H2o: heavy\-hitter oracle for efficient generative inference of large language models\.Advances in Neural Information Processing Systems36,pp\. 34661–34710\.Cited by:[§2](https://arxiv.org/html/2606.00144#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Zhou, X\. Ning, K\. Hong, T\. Fu, J\. Xu, S\. Li, Y\. Lou, L\. Wang, Z\. Yuan, X\. Li, S\. Yan, G\. Dai, X\. Zhang, Y\. Dong, and Y\. Wang \(2024\)A survey on efficient inference for large language models\.CoRRabs/2404\.14294\.External Links:[Link](https://doi.org/10.48550/arXiv.2404.14294)Cited by:[§1](https://arxiv.org/html/2606.00144#S1.p2.1)\.

## Appendix APeak VRAM Usage

We report peak GPU memory \(VRAM\) using the logged field peak\_gpu\_mb and convert it to GB\. Table[4](https://arxiv.org/html/2606.00144#A1.T4)summarizes peak VRAM as a function of draft lengthγ\\gammaat 4K/8K/16K\. For each context length andγ\\gamma, we average peak VRAM over KV budgetsB∈\{256,512,1024,2048\}B\\in\\\{256,512,1024,2048\\\}and over the three datasets\. This table shows the memory trend asγ\\gammaincreases under each context length, and directly compares SD \(sparse/full\) \(original\) with BudgetDraft \(ℒA\+0\.5​ℒC\\mathcal\{L\}\_\{A\}\+0\.5\\,\\mathcal\{L\}\_\{C\}\)\.

Peak VRAM is comparable between BudgetDraft and SD \(sparse/full\)\. At 8K and 16K, peak VRAM is nearly identical across methods for the evaluatedγ\\gammarange, suggesting that peak usage is dominated by the full\-KV verifier side in this sparse/full pipeline\. At 4K, SD \(sparse/full\) exhibits a clear increase in peak VRAM asγ\\gammagrows, while BudgetDraft remains almost flat, indicating that BudgetDraft achieves higher speedup without additional memory overhead at larger draft lengths\.

For a fair, fixed decoding setting, Table[5](https://arxiv.org/html/2606.00144#A1.T5)further reports detailed peak VRAM atγ=5\\gamma=5for each KV budget and dataset\. The detailed results are consistent with the aggregate trend: BudgetDraft matches SD \(sparse/full\) at 8K/16K and is slightly lower at 4K under some budgets\.

Table 4:Peak VRAM \(GB\) vs draft lengthγ\\gamma\. For each context length andγ\\gamma, values are averaged over KV budgetsB∈\{256,512,1024,2048\}B\\in\\\{256,512,1024,2048\\\}and the three datasets \(GS, LongBench, and LWM\)\. Missing entries indicate thatγ\\gammawas not evaluated at that context length\. Values are converted frompeak\_gpu\_mb\.Table 5:Peak VRAM usage \(GB\) atγ=5\\gamma=5\. We report peak GPU memory converted frompeak\_gpu\_mbunder the same context length and KV budget\. SD \(sparse/full\) uses the untrained drafter; BudgetDraft usesℒA\+0\.5​ℒC\\mathcal\{L\}\_\{A\}\+0\.5\\,\\mathcal\{L\}\_\{C\}\.
## Appendix BPer\-Sample Variance Across Draft Lengths

To evaluate the stability of speculative decoding under different draft lengths, we measure per\-sample variance across datasets and context lengths\. Figure[4](https://arxiv.org/html/2606.00144#A2.F4)reports the mean and standard deviation of acceptance rate and end\-to\-end speedup versus AR as functions of draft lengthγ\\gamma\. Results are averaged over KV budgetsB∈\{256,512,1024,2048\}B\\in\\\{256,512,1024,2048\\\}, with approximately 36 samples per point\.

![Refer to caption](https://arxiv.org/html/2606.00144v1/x4.png)Figure 4:Per\-sample variance \(mean±\\pmstd\) across draft lengthsγ\\gamma\. Acceptance and speedup are averaged over KV budgetsB∈\{256,512,1024,2048\}B\\in\\\{256,512,1024,2048\\\}\. Rows correspond to context lengths \(4K/8K/16K\), and columns correspond to datasets \(GS/LongBench/LWM\)\.Several trends are consistent across datasets and context lengths\. First, acceptance generally decreases asγ\\gammaincreases, while speedup initially improves and then saturates or declines, reflecting the standard speculative decoding trade\-off between drafting more tokens and verification overhead\. Second, variance remains relatively small in the main operating region, indicating that the observed gains are stable across samples rather than driven by a few outliers\.

The variance patterns also differ across datasets\. GS and LongBench show smoother acceptance degradation asγ\\gammaincreases, whereas LWM exhibits stronger sensitivity at longer contexts, especially at 16K\. This behavior is consistent with the larger semantic diversity and longer dependency structure in narrative\-style inputs\. Despite this increased variance, the overall trends remain stable across budgets and datasets\.

Importantly, the curves remain smooth even at 8K–16K, without abrupt collapse across neighboringγ\\gammavalues\. This suggests that the degradation in long\-context speculative decoding is not solely caused by catastrophic positional extrapolation failure\. Instead, the behavior is more consistent with gradually increasing sparse/full mismatch under longer contexts, which aligns with the motivation behind BudgetDraft\.

## Appendix Cλ\\lambdaSensitivity

This appendix provides detailed results for theλ\\lambdasensitivity study in the main text\. We compareℒA\+0\.5​ℒC\\mathcal\{L\}\_\{A\}\+0\.5\\,\\mathcal\{L\}\_\{C\}\(λ=0\.5\\lambda=0\.5\) withℒA\+ℒC\\mathcal\{L\}\_\{A\}\+\\mathcal\{L\}\_\{C\}\(λ=1\.0\\lambda=1\.0\)\. The complete results are reported in Table[6](https://arxiv.org/html/2606.00144#A3.T6)\.

Across 4K and 8K, the two settings are closely matched across datasets and KV budgets\. Acceptance is nearly identical, and speedup differs only marginally, indicating that performance in the main mid\-context regime does not hinge on a narrow choice ofλ\\lambdaonce the drafter is trained with multi\-view sparse supervision\.

At 16K,λ=1\.0\\lambda=1\.0is often slightly better, but the gains remain small\. On GS and LongBench,λ=1\.0\\lambda=1\.0consistently yields a modest increase in speedup across budgets, with acceptance remaining nearly unchanged\. On LWM,λ=1\.0\\lambda=1\.0improves speedup for most budgets, while a small exception appears atB=512B\{=\}512, whereλ=0\.5\\lambda=0\.5is marginally higher\. This indicates that increasing the sparse\-view weight can shift the trade\-off across budgets in the longest setting, but the overall sensitivity is limited\.

We useλ=0\.5\\lambda=0\.5as the default since it matchesλ=1\.0\\lambda=1\.0closely in the main 4K–8K regime and provides a consistent choice across budgets, while avoiding the need to re\-tuneλ\\lambdafor different deployment constraints\.

Table 6:λ\\lambdasensitivity atγ=5\\gamma=5\(ℒA\+0\.5​ℒC\\mathcal\{L\}\_\{A\}\+0\.5\\,\\mathcal\{L\}\_\{C\}vsℒA\+ℒC\\mathcal\{L\}\_\{A\}\+\\mathcal\{L\}\_\{C\}\)\. Entries report speedup vs AR / acceptance rate \(%\) under the sameγ\\gamma, context length, and KV budget\.
## Appendix DWhy Smaller KV Budgets Can Increase Acceptance at 4K

Figure[1](https://arxiv.org/html/2606.00144#S1.F1)and Table[1](https://arxiv.org/html/2606.00144#S4.T1)show a seemingly counter\-intuitive pattern for SD \(sparse/full\) at 4K: smaller KV budgets can lead to higher acceptance than larger budgets\. This effect does not contradict the sparse/full mismatch story\. Instead, it reflects how budgeted sparse KV selection interacts with a small drafter in the short\-to\-mid context regime\.

#### Sparse KV as an information filter\.

At 4K, next\-token predictions are often dominated by a small subset of the prefix, typically recent tokens and a few highly relevant segments\. A smaller KV budget forces the sparse cache to keep only the most relevant chunks \(or tokens\) under the selection policy\. This can act as an information filter that removes weakly relevant history\. For a small drafter, reducing low\-signal context can stabilize attention, sharpen the conditional distribution, and increase the probability that the drafter’s top\-1 token matches the verifier’s top\-1 token, which directly increases greedy acceptance\.

#### More cached context can introduce distractors for a small drafter\.

While a larger KV budget intuitively adds more context, in practice it also retains much irrelevant or noisy content alongside the useful evidence\. A small drafter has limited capacity to integrate long context and can be more sensitive to distractors than the verifier\. With a larger sparse cache, attention mass is spread across more tokens, and small differences between drafter and verifier in how evidence is weighted can shift probability mass among competing candidates\. Since greedy acceptance requires a strict top\-1 match, these shifts can reduce acceptance even when overall modeling quality does not degrade\.

#### Position handling near the drafter limit\.

In our setup, the drafter has a native maximum position embedding of 2048 tokens\. At 4K prompts, position handling \(e\.g\., explicit position IDs, clamping, or re\-indexing for sparse\-cache evaluation\) can introduce additional mismatch, and this mismatch can be amplified when more distant tokens are retained\. Smaller budgets tend to retain more recent or higher\-attention chunks, reducing exposure to position\-related distortion and making drafter–verifier agreement easier\. We discuss the broader impact of position range in Section[Limitations](https://arxiv.org/html/2606.00144#Sx1)\.

#### Takeaway\.

The 4K behavior is therefore expected: a smaller sparse cache can improve SD \(sparse/full\) acceptance by filtering noise and reducing sensitivity to long\-range distractors and position\-related mismatch\. As context length grows, sparse/full mismatch becomes the dominant factor, and SD \(sparse/full\) acceptance collapses across budgets, which motivates BudgetDraft\.

## Appendix ESingle\-Budget vs\. Multi\-Budget Sparse Training

To isolate the effect of multi\-budget sparse training, we compare BudgetDraft with a single\-budget variant\. The single\-budget variant uses the same objectiveℒA\+0\.5​ℒC\\mathcal\{L\}\_\{A\}\+0\.5\\,\\mathcal\{L\}\_\{C\}, but fixes the sparse\-cache branch toB=1024B\{=\}1024during training\. We evaluate both models on LWM at 16K withγ=20\\gamma=20, where SD\(sparse/full\) collapses and BudgetDraft achieves its strongest 16K speedup\.

Table 7:Single\-budget vs\. multi\-budget sparse training on LWM at 16K withγ=20\\gamma=20\. Entries report speedup vs AR / acceptance rate \(%\)\. The single\-budget variant is trained with a fixed sparse\-cache budgetB=1024B\{=\}1024, while multi\-budget training samples fromB∈\{256,512,1024,2048\}B\\in\\\{256,512,1024,2048\\\}\.Table[7](https://arxiv.org/html/2606.00144#A5.T7)shows that single\-budget sparse training does not provide stable generalization across inference budgets\. Although the single\-budget variant performs well atB=256B\{=\}256, its speedup and acceptance degrade as the inference budget increases\. The degradation is most pronounced atB=2048B\{=\}2048, where speedup drops to1\.47×1\.47\\timesand acceptance falls to 14\.61%\. In contrast, multi\-budget training maintains2\.10×2\.10\\timesspeedup and 34\.17% acceptance across all budgets\. This confirms that the benefit ofℒC\\mathcal\{L\}\_\{C\}is not only from adding a sparse\-cache branch, but also from exposing the drafter to multiple sparse views during training, which improves budget stability at deployment\.

Similar Articles

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

arXiv cs.CL

SparDA proposes a decoupled sparse attention architecture that adds a lightweight 'Forecast' projection to predict future KV cache needs, enabling lookahead prefetching from CPU to GPU and reducing selection overhead. On 8B sparse-pretrained models, it achieves up to 1.25× prefill and 1.7× decode speedup, with up to 5.3× higher decode throughput over non-offload baselines.