MicroSpec: Accelerating Speculative Decoding with Lightweight In-Context Vocabularies

arXiv cs.CL 05/27/26, 04:00 AM Papers
Summary
MicroSpec is a training-free technique that builds compact, context-sensitive vocabularies on-the-fly to accelerate speculative decoding in large language models, reducing average vocabulary size by over 40x and achieving up to 1.32x end-to-end speedup over EAGLE-2.
arXiv:2605.26444v1 Announce Type: new Abstract: Large language models typically employ vocabularies of over 100k tokens, which creates a major computational bottleneck at the final linear projection layer when performing speculative decoding. Current methods for vocabulary pruning depend on either fixed or coarse-grained sub-vocabularies, requiring around 30k active tokens to preserve the quality of the draft model. We introduce MicroSpec, a training-free technique that overcomes this limitation by building a compact, context-sensitive active vocabulary on the fly for every decoding step. Exploiting the natural temporal locality found in language generation, MicroSpec attains high token coverage while reducing the average vocabulary size by more than 40x (down to under 3k tokens), all without any additional trained parameters. To translate this high sparsity into actual speedups on contemporary hardware, we present a co-designed system and algorithm that mitigates the overhead of sparse memory accesses via asynchronous gathering and GPU-resident state management. Acting as a plug-and-play enhancement, MicroSpec reduces draft inference latency by 51.6% on average, achieving an end-to-end speedup of 1.12-1.32x relative to the leading speculative decoding approach EAGLE-2 on various benchmarks, while also surpassing more sophisticated training-based pruning baselines.
Original Article
View Cached Full Text
Cached at: 05/27/26, 09:05 AM
# Accelerating Speculative Decoding with Lightweight In-Context Vocabularies
Source: [https://arxiv.org/html/2605.26444](https://arxiv.org/html/2605.26444)
###### Abstract

Large language models typically employ vocabularies of over 100k tokens, which creates a major computational bottleneck at the final linear projection layer when performing speculative decoding\. Current methods for vocabulary pruning depend on either fixed or coarse\-grained sub\-vocabularies, requiring around 30k active tokens to preserve the quality of the draft model\. We introduceMicroSpec, a training\-free technique that overcomes this limitation by building a compact, context‑sensitive active vocabulary on the fly for every decoding step\. Exploiting the natural temporal locality found in language generation,MicroSpecattains high token coverage while reducing the average vocabulary size by more than40×40\\times\(down to under 3k tokens\), all without any additional trained parameters\. To translate this high sparsity into actual speedups on contemporary hardware, we present a co‑designed system and algorithm that mitigates the overhead of sparse memory accesses via asynchronous gathering and GPU‑resident state management\. Acting as a plug‑and‑play enhancement,MicroSpecreduces draft inference latency by 51\.6% on average, achieving an end‑to‑end speedup of 1\.12–1\.32× relative to the leading speculative decoding approach EAGLE‑2 on various benchmarks, while also surpassing more sophisticated training‑based pruning baselines\.

large language model, speculative decoding, vocabulary pruning

## 1Introduction

Large Language Models \(LLMs\) have demonstrated unprecedented capabilities across diverse tasks\(Hanet al\.,[2021](https://arxiv.org/html/2605.26444#bib.bib1); Zhouet al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib2); Naveedet al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib3)\)\. To capture multilingual and domain\-specific semantics\(Taoet al\.,[2024](https://arxiv.org/html/2605.26444#bib.bib10); Takaseet al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib11)\), their vocabulary sizes frequently exceed 100k tokens \(e\.g\., 128k for Llama\-3, 152k for Qwen\-2 and Qwen\-3\)\. While beneficial for generation quality, these expansive vocabularies pose significant challenges for efficient inference\.

Speculative decoding \(SD\)\(Miaoet al\.,[2024](https://arxiv.org/html/2605.26444#bib.bib24); Chenet al\.,[2023](https://arxiv.org/html/2605.26444#bib.bib25); Leviathanet al\.,[2023](https://arxiv.org/html/2605.26444#bib.bib23)\)has emerged as a premier technique to accelerate inference without compromising quality, employing a fast draft mechanism to generate tentative sequences for parallel verification by the target LLM\. The efficacy of SD relies fundamentally on the draft mechanism being significantly faster than the target\. However, recent work exposes a critical bottleneck when scaling SD to large\-vocabulary models: the computational cost of the draft model’s final linear projection \(the LM head\) becomes prohibitive\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib12); Wenget al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib14); Zhanget al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib13)\)\. Crucially, this bottleneck affects diverse draft architectures, including separate small models and integrated heads like EAGLE or Medusa\(Liet al\.,[2024](https://arxiv.org/html/2605.26444#bib.bib17); Caiet al\.,[2024](https://arxiv.org/html/2605.26444#bib.bib18)\)as they all rely on the LM head to map the hidden states to massive vocabulary spaces\. For example, this single projection step can consume over 60% of the total draft inference time on Llama\-3\-8B and Qwen\-2\.5\-7B\(Wenget al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib14)\), severely limiting achievable speedups\.

Table 1:Comparison of characteristic features and preliminary performance metrics across different vocabulary pruning strategies for speculative decoding using Llama\-3\-8B\-Instruct on SpecBench and HumanEval\.MethodPruning StrategyExtraTrainingActiveVocab\. SizeSpeed\(tokens/s\)Avg\. AcceptLengthEAGLE\-2\(Liet al\.,[2024](https://arxiv.org/html/2605.26444#bib.bib17)\)None \(Full Vocabulary\)None128k336\.93\.80FR\-Spec\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib12)\)High\-Frequency TokensNone32k369\.73\.62CORAL\(Wenget al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib14)\)Route to Fixed Clusters1\-layer FFN32k359\.93\.92DynaSpec\(Zhanget al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib13)\)Route to Fixed Clusters1 MLP27k378\.13\.78MicroSpec\(Ours\)Variant by ContextNone<3<\\\!3k392\.73\.59A direct remedy is to prune the draft model’s output vocabulary\. As draft tokens are verified by the target model whose vocabulary remains untouched, pruning does not compromise final output quality\. Existing approaches typically adopt context\-agnostic or coarse\-grained strategies\. Methods like FR\-Spec\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib12)\)and VocabTrim\(Goelet al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib15)\)statically select high\-frequency tokens as the pruned vocabulary based on corpus statistics, while others like CORAL\(Wenget al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib14)\)and DynaSpec\(Zhanget al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib13)\)employ trained auxiliary routers to select from predefined sub\-vocabulary clusters\(Wenget al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib14); Zhanget al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib13)\)\. However, a fundamental limitation of these approaches is their inability to adapt fine\-grainedly to immediate contexts\. To maintain reasonable draft quality and capture necessary long\-tail tokens, they are forced to operate with relatively large active vocabularies \(e\.g\.,∼\\sim30k tokens as shown in Table[1](https://arxiv.org/html/2605.26444#S1.T1)\), missing the opportunity for substantial acceleration\.

In this work, we challenge this paradigm by seeking the minimal sufficient vocabulary required for each specific generation step\. Our core insight is that language generation exhibits strong temporal locality\(Saxena,[2023](https://arxiv.org/html/2605.26444#bib.bib16); Wanget al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib4); Chenet al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib5)\): the next token is overwhelmingly likely to be present in the immediate context or be a close extension of it\. Driven by this, we proposeMicroSpec, a training\-free approach that dynamically constructs a minimalist active vocabulary \(e\.g\., 2k\-3k tokens\) per step solely from token history and recent high\-probability candidates, eschewing learned routers completely\. While theoretically compelling, realizing the benefits of a highly dynamic, ultra\-small vocabulary presents its own system\-level challenge: the resulting sparse memory access patterns for gathering LM head weights are notoriously inefficient on modern GPUs, potentially negating computational savings\. To overcome this,MicroSpecintegrates system\-algorithm co\-design through asynchronous gathering and GPU\-resident state management, effectively mitigating the latency overhead of dynamic sparse computation\.

As summarized in Table[1](https://arxiv.org/html/2605.26444#S1.T1),MicroSpecachieves superior state\-of\-the\-art generation speeds \(392\.7 tokens/s on Llama\-3\-8B\) using an average dynamic vocabulary of<3<\\\!3k tokens, an order of magnitude smaller than existing approaches \(27k\-32k\), without requiring any additional training or auxiliary parameters\. By leveraging the inherent contextual nature of language supported by optimized system design, we break the long\-standing trade\-off between vocabulary size and draft acceptance rate, unlocking a new regime of efficient speculative decoding\.

In summary, our work makes the following contributions:

- •We provide empirical analysis quantifying the untapped potential of dynamic vocabulary pruning and the strong temporal locality in LLM generation\.
- •We propose a simple, training\-free dynamic pruning method achieving high coverage with minimal vocabulary size \(<<3k\), co\-designed with system\-level techniques to resolve sparse memory access overheads\.
- •We demonstrate that our method serves as a plug\-and\-play module, achieving1\.12−1\.32×1\.12\-1\.32\\timescomplementary speedup over the state\-of\-the\-art EAGLE\-2 and surpassing complex trained baselines across diverse benchmarks\.

## 2Motivation

In this section, we first quantify the untapped performance potential in existing speculative decoding systems due to suboptimal vocabulary usage\. Then, we validate our core insight, “temporal locality”, demonstrating that a small, dynamically constructed vocabulary can unlock this potential\.

![Refer to caption](https://arxiv.org/html/2605.26444v1/x1.png)\(a\) End\-to\-End Speed vs\. Vocab Size
![Refer to caption](https://arxiv.org/html/2605.26444v1/x2.png)\(b\) Ground Truth Coverage Empirical Analysis

Figure 1:Motivation analysis using Llama\-3\-8B\. \(a\) The untapped potential in static vocabulary pruning\. While theoretical acceleration exists for smaller vocabularies \(oracle speed\), actual speed drops rapidly due to reduced draft acceptance rates\. Our target is to bridge this gap\. \(b\) Vocabulary coverage analysis across diverse domains\. Static method \(dashed lines\) requires large vocabularies for high coverage\. Our approach \(MicroSpec, stars\) achieves superior coverage with minimalist dynamic vocabulary \(maximum of 3k tokens\)\.### 2\.1Rethinking Draft Vocabulary Pruning: Beyond the Accuracy\-Efficiency Trade\-off

Validated by recent studies\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib12); Wenget al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib14); Zhanget al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib13)\), the output projection \(LM head\) of the draft model has emerged as a dominant bottleneck in speculative decoding, especially as model vocabularies inflate to over 100k text tokens \(e\.g\., 128k of Llama\-3, 152k of Qwen\-2\)\(Taoet al\.,[2024](https://arxiv.org/html/2605.26444#bib.bib10)\)\. To mitigate this, pruning the draft vocabulary is a compelling strategy\.

However, existing state\-of\-the\-art approaches operate under astatic paradigm: either selecting a fixed subset of high\-frequency tokens \(e\.g\., FR\-Spec\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib12)\), VocabTrim\(Goelet al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib15)\)\) or training a router to select from a fixed set of pre\-clustered vocabularies \(e\.g\., CORAL\(Wenget al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib14)\), DynaSpec\(Zhanget al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib13)\)\)\. These methods seem trapped in an inevitable dilemma: a smaller vocabulary size is necessary for compute savings, but blindly restricting it severely degrades the draft’s acceptance rate due to failure of recalling long\-tail tokens, often neutralizing any speed gains\. We argue that this perceived trade\-off stems from the inherent limitations of their static or coarsely\-grained nature, which forces them to ignore fine\-grained, instance\-specific context\. To quantify this trade\-off and the resulting untapped potential, we define two scenarios:

- •OracleScenario \(Theoretical Limit\): we hypothesize an ideal state where the draft model’s acceptance rate remains unaffected by vocabulary reduction\. This isolates and highlights the pure computational gain achievable by shrinking the LM head’s computation\.
- •ActualScenario: we evaluate the real\-world performance using FR\-Spec\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib12)\)as a representative static pruning method, where acceptance rate drops with the pruned vocabulary shrinks\.

We vary the pruned vocabulary sizes from 32k down to 0\.5k to explore the potential speedup, where 32k is the optimal setting empirically reported in FR\-Spec\. Figure[1](https://arxiv.org/html/2605.26444#S2.F1)\(a\) illustrates this analysis on Llama\-3\-8B\-Instruct with a case on multi\-turn conversation tasks\.

The oracle line \(gray\) shows that theoretically, shrinking the vocabulary from 32k down to roughly 2k should yield continuous speed gains\. However, the actual line \(green\) reveals a stark reality: static pruning yields rapidly diminishing returns below a critical threshold \(∼\\sim16k\) and suffers catastrophic performance drops at smaller sizes \(<<2k\)\. This sharp decline occurs because overly aggressive, context\-agnostic pruning fails to capture long\-tail tokens crucial for the specific current context, drastically reducing the draft acceptance rate, as shown in the blue bars of Figure[1](https://arxiv.org/html/2605.26444#S2.F1)\(a\)\. The substantial shaded area between the actual and theoretical curves represents a massive performance gap, the untapped potential of speculative decoding currently masked by inefficient, static vocabulary management\. This gap is the direct target of our work\.

### 2\.2The Bridge: Leveraging Strong Temporal Locality

Our approach is rooted in a key insight designed to bridge this gap: the behavior of LLM generation is governed by strongtemporal locality\(Saxena,[2023](https://arxiv.org/html/2605.26444#bib.bib16); Wanget al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib4)\)\. We hypothesize that the tokens LLMs generate next are overwhelmingly likely to be present in the recent context history or are closely related extensions of it\. Based on this hypothesis, we argue that the active vocabulary of the draft model can be dynamically pruned per step to a much smaller, context\-aware subset while capturing sufficient correct semantics\.

To validate this empirically, we analyze the ground truth coverage: the probability that the correct tokenyt∗y\_\{t\}^\{\*\}selected by the target LLM at each stepttis present in a pruned vocabulary\. We compare two strategies:

- •FR\-Spec\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib12)\): Static vocabulary with top\-KKhigh\-frequency tokens based on corpus statistics\.
- •MicroSpec: Dynamic vocabulary based on the current context and generation trajectory \(detailed in Section[3](https://arxiv.org/html/2605.26444#S3)\)\.

Figure[1](https://arxiv.org/html/2605.26444#S2.F1)\(b\) compares these coverage rates across varied domains \(detailed in Section[4\.1](https://arxiv.org/html/2605.26444#S4.SS1)\) using Llama\-3\-8B\-Instruct\. The static methods \(dashed lines\) require prohibitively large vocabularies \(often\>16\>16k or even\>32\>32k\) to achieve acceptable coverage \(\>85%\>85\\%\)\. This perfectly explains the performance sharp drop observed in Figure[1](https://arxiv.org/html/2605.26444#S2.F1)\(a\): below a certain size, static vocabularies simply miss too many correct tokens\. In stark contrast,MicroSpec\(stars\) consistently achieves high coverage \(73%73\\%\-97%97\\%depending on the task\) while maintaining an exceedingly small average vocabulary size of a maximum of 3k tokens\.

This analysis provides strong motivating evidence: a minimalist, dynamically constructed vocabulary based on explicit context is sufficient to capture the vast majority of generated tokens\. This pivotal insight allows us to operate in the high\-speed \(<3<3k size\) regime that was previously unattainable by static methods, effectively enabling us to recover the untapped potential identified in Figure[1](https://arxiv.org/html/2605.26444#S2.F1)\(a\)\.

## 3Methodology

![Refer to caption](https://arxiv.org/html/2605.26444v1/x3.png)Figure 2:Overview ofMicroSpec\. Given the context “The old wooden ship had”, long\-tail tokens “weathered, barnacle, hull” lie outside a fixed boundary, despite being highly probable in this specific context\. Consequently, the draft model is forced to select generic, suboptimal high\-frequency alternatives \(“broken, surface”\)\.MicroSpecrecovers from this by dynamically constructing the vocabulary of draft model\.In this section, we first formulate the critical computational bottleneck in large\-vocabulary speculative decoding\. Then we introduce our training\-free algorithm for constructing context\-aware dynamic vocabulary and detail the accompanying system implementation designed to realize the theoretical speedups of this dynamic approach on modern GPUs\.

### 3\.1Preliminaries and Problem Formulation

Consider a standard auto\-regressive large language modelMθM\_\{\\theta\}\. At decoding steptt, given contextx<tx\_\{<t\}, the model predicts the next token distribution via a final linear projection layer, conventionally termed the LM head\. LetWhead∈ℝV×dW\_\{head\}\\in\\mathbb\{R\}^\{V\\times d\}represent the weight of the LM head, whereVVis the vocabulary size originally defined during pretraining, andddis the hidden dimension\. The logitszt∈ℝVz\_\{t\}\\in\\mathbb\{R\}^\{V\}are computed from the final hidden stateht∈ℝdh\_\{t\}\\in\\mathbb\{R\}^\{d\}as:

zt=Wheadht\.z\_\{t\}=W\_\{head\}h\_\{t\}\.\(1\)
Speculative decoding\(Leviathanet al\.,[2023](https://arxiv.org/html/2605.26444#bib.bib23); Miaoet al\.,[2024](https://arxiv.org/html/2605.26444#bib.bib24); Chenet al\.,[2023](https://arxiv.org/html/2605.26444#bib.bib25); Liet al\.,[2024](https://arxiv.org/html/2605.26444#bib.bib17)\)employs a smaller, faster draft modelMdraftM\_\{draft\}to generate a draft ofγ\\gammaspeculative tokens, which are subsequently verified by the target LLMMtargetM\_\{target\}in parallel\. While effective, the theoretical speedup is heavily bounded by the latency of the draft model itself\. As vocabulary sizesVVin modern LLMs scale massively \(e\.g\.,V≈128V\\approx 128k for Llama\-3,V≈152V\\approx 152k for Qwen\-2\.5\), the matrix\-vector multiplication in Eq\. \([1](https://arxiv.org/html/2605.26444#S3.E1)\) becomes surprisingly expensive, reaching a complexity ofO\(V⋅d\)O\(V\\cdot d\)per token\. For efficient draft models where the backbone consists of few layers \(e\.g\., Medusa heads or EAGLE layers\(Caiet al\.,[2024](https://arxiv.org/html/2605.26444#bib.bib18); Liet al\.,[2024](https://arxiv.org/html/2605.26444#bib.bib17)\)\), this final linear projection often dominates total inference latency, severely capping the achievable speedup of existing approaches\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib12); Wenget al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib14)\)\.

Our objective is to accelerate this operation by replacing the draft model’s full vocabulary set indicesℐfull=\{1,…,V\}\\mathcal\{I\}\_\{full\}=\\\{1,\\dots,V\\\}with a substantially smaller, dynamically determined subset of indicesℐt⊂ℐfull\\mathcal\{I\}\_\{t\}\\subset\\mathcal\{I\}\_\{full\}at each decoding steptt, such that\|ℐt\|≪V\|\\mathcal\{I\}\_\{t\}\|\\ll V\. This allows replacing the dense computation in Eq\. \([1](https://arxiv.org/html/2605.26444#S3.E1)\) with a sparse operation involving only a sub\-matrixWhead\[ℐt,:\]W\_\{head\}\[\\mathcal\{I\}\_\{t\},:\]\. Crucially, as we only optimize the draft model and do not alter the target model’s vocabulary, the generation quality is guaranteed to be lossless \(i\.e\., identical to standard sampling\(Leviathanet al\.,[2023](https://arxiv.org/html/2605.26444#bib.bib23)\)\)\.

### 3\.2Algorithm: Context\-Aware Dynamic Vocabulary Construction

As outlined in Section[2](https://arxiv.org/html/2605.26444#S2), we propose a deterministic mechanism to construct a context\-awareℐt\\mathcal\{I\}\_\{t\}by leveraging the strong temporal locality inherent in LLM generation\. Our core insight is that the optimal draft vocabulary at any moment is heavily skewed towards tokens present in the immediate context history and high\-probability candidates recently considered by the target model\.

Figure[2](https://arxiv.org/html/2605.26444#S3.F2)illustrates howMicroSpecupdates its dynamic vocabulary\. We formalize this by defining an evolvingcandidate token streamand derivingℐt\\mathcal\{I\}\_\{t\}via a fixed\-size sliding window over this stream\.

Candidate Stream Initialization \(Prefill Phase\)\.Given an input prompt sequencex1:Lx\_\{1:L\}, the target model processes the prompt to generate a sequence of logitsZ1:L=\(z1,…,zL\)Z\_\{1:L\}=\(z\_\{1\},\\dots,z\_\{L\}\)\. We define a Top\-K operator𝒯K\(z\)\\mathcal\{T\}\_\{K\}\(z\)that returns the set of indices corresponding to theKKlargest values in a logit vectorzz\. The candidate stream𝐒\\mathbf\{S\}is initialized as the concatenation of the exact prompt tokens and the union of top\-KpreK\_\{pre\}candidates from each position in the prompt:

𝐒init=\(x1,…,xL\)⊕tuple\(⋃i=1L𝒯Kpre\(zi\)\),\\mathbf\{S\}\_\{init\}=\(x\_\{1\},\\dots,x\_\{L\}\)\\oplus\\text\{tuple\}\\left\(\\bigcup\_\{i=1\}^\{L\}\\mathcal\{T\}\_\{K\_\{pre\}\}\(z\_\{i\}\)\\right\),\(2\)where⊕\\oplusdenotes sequence concatenation, andKpreK\_\{pre\}is a hyper\-parameter controlling the exploration breadth during prefill\.

Dynamic Stream Update \(Decoding Phase\)\.At any decoding steptt, let the draft model generate a speculative token tree, and let the target model’s verification process on the validated branch yield final logitszverifyz\_\{verify\}\. We collect two sets of new candidates to append to the stream:

1. 1\.Draft Candidates \(𝒞draft\\mathcal\{C\}\_\{draft\}\):The set of all unique token indices present anywhere in the generated draft tree at steptt, regardless of acceptance\.
2. 2\.Verify Candidates \(𝒞ver\\mathcal\{C\}\_\{ver\}\):The top\-KverK\_\{ver\}high\-probability tokens according to the target model’s distribution at this step, i\.e\.,𝒞ver=𝒯Kver\(zverify\)\\mathcal\{C\}\_\{ver\}=\\mathcal\{T\}\_\{K\_\{ver\}\}\(z\_\{verify\}\)\.KverK\_\{ver\}is a hyper\-parameter controlling per\-step exploration\.

The candidate stream is updated by appending the sequence of these new indices:

𝐒new=𝐒old⊕tuple\(𝒞draft\)⊕tuple\(𝒞ver\)\.\\mathbf\{S\}\_\{new\}=\\mathbf\{S\}\_\{old\}\\oplus\\text\{tuple\}\(\\mathcal\{C\}\_\{draft\}\)\\oplus\\text\{tuple\}\(\\mathcal\{C\}\_\{ver\}\)\.\(3\)
Vocabulary Construction via Sliding Window\.To ensure bounded memory usage and computational efficiency while retaining the most relevant history, we impose a strict size limitWmaxW\_\{max\}on the active vocabulary\. The dynamic vocabulary index setℐt\\mathcal\{I\}\_\{t\}for the next step is formally defined as the unique set of the most recentWmaxW\_\{max\}tokens in the stream:

ℐt\+1=Unique\(Suffix\(𝐒new,Wmax\)\),\\mathcal\{I\}\_\{t\+1\}=\\text\{Unique\}\(\\text\{Suffix\}\(\\mathbf\{S\}\_\{new\},W\_\{max\}\)\),\(4\)whereSuffix\(⋅,Wmax\)\\text\{Suffix\}\(\\cdot,W\_\{max\}\)returns the lastWmaxW\_\{max\}elements of the sequence, andUnique\(⋅\)\\text\{Unique\}\(\\cdot\)extracts unique elements\. This construction naturally implements a First\-In\-First\-Out \(FIFO\) eviction policy based on observational recency, requiring zero parameter training\.

### 3\.3System: Hardware\-Accelerated Dynamic Vocabulary Realization

While the dynamic vocabulary theoretically reduces FLOPs by a factor of roughlyV/\|ℐt\|V/\|\\mathcal\{I\}\_\{t\}\|, realizing this speedup on modern GPUs is non\-trivial\. ImplementingMicroSpecnaively requires matrix computation on non\-contiguous rows fromWheadW\_\{head\}corresponding to the dynamic indicesℐt\\mathcal\{I\}\_\{t\}\. This random, sparse memory access pattern breaks GPU memory coalescing, resulting in severe bandwidth underutilization that often negates theoretical gains\. To address this, we propose a hardware\-aware implementation tailored to the specific requirements of our context\-aware algorithm\.

Avoiding Sparse Computation via Asynchronous Gathering\.Instead of performing inefficient sparse computations, our strategy is to efficiently transform the sparse data access into a dense computation: we maintain a pre\-allocated buffer in GPU memory, dynamically gather the weights of new active tokens fromWheadW\_\{head\}and pack them contiguously into the active buffer at each step\. This approach enables computingWhead\[ℐt,:\]W\_\{head\}\[\\mathcal\{I\}\_\{t\},:\]using a dense matrix\.

To eliminate blocking caused byMdraftM\_\{draft\}waiting for packed weights, we further pipeline copy stream with execution of model backbone by compute stream\. Specifically, as the indicesℐt\+1\\mathcal\{I\}\_\{t\+1\}required for the next draft step are only known after the target model completes verification at steptt\(generating𝒞ver\\mathcal\{C\}\_\{ver\}\), we overlap the memory\-intensive weight gathering for stept\+1t\+1with the compute\-intensive draft execution for stept\+1t\+1\. We utilize dual CUDA streams and fine\-grained event synchronization:

- •Compute Stream:executes the main neural network layers \(backbone\) of bothMdraftM\_\{draft\}andMtargetM\_\{target\}\.
- •Copy Stream:executes specialized kernels that gather required weights from global memory into a pre\-allocated, contiguous dense buffer\.

As soon asℐt\+1\\mathcal\{I\}\_\{t\+1\}is determined via the GPU\-resident state update \(described below\), we launch the draft model’s backbone computation on the compute stream and simultaneously launch the weight gather kernel on the copy stream\. A CUDA event is recorded after the gather operation\. The compute stream is made to wait on this event only immediately before executing the draft’s LM head projection\.

This pipeline design ensures that the gathering latency is effectively hidden behind the draft backbone computation\. By the time the draft backbone finishes, the requisite weights \(Whead\[ℐt\+1,:\]W\_\{head\}\[\\mathcal\{I\}\_\{t\+1\},:\]\) are continuously packed in the buffer, allowing performing LM head projection with a highly efficient dense matrix multiplication\.

Eliminating Overhead via GPU\-Resident State Management\.The sliding window mechanism necessitates high\-frequency updates to the active token setℐt\\mathcal\{I\}\_\{t\}at every generation step\. Managing this dynamic state on the CPU involves prohibitive host\-device synchronization overhead, which would introduce pipeline bubbles\.

To eliminate this overhead,MicroSpecimplements fully GPU\-resident state management\. The active sliding window states are maintained entirely in VRAM using bitmaps\. We employ highly parallel, lock\-free CUDA kernels to identify unique new candidates from draft trees and verification results, pushing them into the stream and updating the active indices set directly on the device\. This near\-zero\-overhead management ensures that the dynamic logic of our algorithm does not become a new system bottleneck\.

## 4Evaluation

In this section, we first introduce the experimental setup, including benchmarks, baselines and models\. Then, we present the comparison of end\-to\-end speedup, acceptance lengths and draft time across different scenarios\. Finally, we conduct an ablation study to provide measured insights into the source of our performance gains\. Hyper\-parameter sensitivity analysis is detailed in Appendix[A\.3](https://arxiv.org/html/2605.26444#A1.SS3)\.

### 4\.1Experimental Setup

Benchmarks\.We employ the SpecBench benchmark\(Xiaet al\.,[2024](https://arxiv.org/html/2605.26444#bib.bib32)\), which is specifically designed to evaluate speculative decoding methods across a diverse set of six tasks: machine translation \(MT\.\)\(Bojaret al\.,[2014](https://arxiv.org/html/2605.26444#bib.bib33)\), multi\-turn conversation \(Conv\.\)\(Zhenget al\.,[2023](https://arxiv.org/html/2605.26444#bib.bib34)\), RAG and QA \(RAG, QA\)\(Kwiatkowskiet al\.,[2019](https://arxiv.org/html/2605.26444#bib.bib35)\), mathematical reasoning \(Math\)\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.26444#bib.bib36)\), summarization \(Summ\.\)\(Nallapatiet al\.,[2016](https://arxiv.org/html/2605.26444#bib.bib37)\)\. We further augment it with HumanEval\(Chen,[2021](https://arxiv.org/html/2605.26444#bib.bib38)\)for code generation \(Code\) tasks\. The maximum generation length is set to 1024 tokens and the greedy sampling is adopted for all tasks\.

Baselines\.We compareMicroSpecagainst auto\-regressive decoding and several strong baselines:

- •EAGLE\(Liet al\.,[2024](https://arxiv.org/html/2605.26444#bib.bib17)\): Standard state\-of\-the\-art speculative decoding method EAGLE\-2 without vocabulary pruning, computing the full LM head\.
- •FR\-Spec\(Zhaoet al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib12)\): A static pruning method that selects a high\-frequency token subset based on corpus statistics for LM head computation\.
- •CORAL\(Wenget al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib14)\): A trained method that uses a feed\-forward network \(FFN\) as router to decide sub\-vocabularies for each draft token\. The sub\-vocabularies are partitioned offline and kept static during inference\.
- •DynaSpec\(Zhanget al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib13)\): A trained extension of FR\-Spec that utilizes KMeans to cluster vocabularies offline and trains a router to select active clusters\.

Implementation\.Our system is implemented on top of the open\-source FR\-Spec framework, with custom modifications to the Python and CUDA kernels to realize our proposed techniques\.

Models, Hyper\-Parameters and Hardware\.Our evaluation includes three popular LLMs: Llama\-3\-8B\-Instruct, Llama\-3\.2\-1B\-Instruct and Qwen\-2\-7B\-Instruct\. All baselines use the EAGLE\-2\(Liet al\.,[2024](https://arxiv.org/html/2605.26444#bib.bib17)\)and its official released checkpoints as the draft model for speculative decoding\. The hyper\-parameters of EAGLE\-2 are identically set among all baselines\. Specifically, we use draft tree depth of 5 and maximum draft tokens of 60\. The pruned vocabulary size of FR\-Spec is set to 32k, which is the optimal setting reported in its paper\. For hyper\-parameters ofMicroSpec, we setKpreK\_\{pre\},KverK\_\{ver\}to 3 and the window scaleWmaxW\_\{max\}to 3072\. All experiments are conducted on a single NVIDIA H20Z GPU\.

Table 2:Comparison on end\-to\-end generation speed \(tokens/s\) and acceptance length across tasks and models\. The baseline is standard auto\-regressive decoding \(AR\)\. Avg\. Speed is the mean across tasks, with relative speedup over AR in parentheses\. Avg\. Len\. denotes the mean acceptance length per step\. The best results for each model are bolded\. Avg\. Len\. of CORAL and DynaSpec for 1B model is left blank as it is not reported in their paper\.ModelMethodMTConvRAGMathQASummCodeAvg\.SpeedAvg\.Len\.Llama\-38B\-InstructAR175\.7178\.5165\.0178\.2174\.2174\.3162\.3172\.6 \(1\.00×\\times\)1\.00EAGLE311\.2380\.2300\.2403\.2316\.4315\.0332\.0336\.9 \(1\.93×\\times\)3\.80CORAL329\.9403\.0318\.2427\.4335\.4333\.9371\.8359\.9 \(2\.06×\\times\)3\.92FR\-Spec338\.4419\.3330\.2456\.5351\.0356\.0336\.3369\.7 \(2\.12×\\times\)3\.62DynaSpec344\.2429\.7333\.0453\.2350\.2353\.3382\.9378\.1 \(2\.17×\\times\)3\.78MicroSpec349\.5440\.2352\.1480\.0361\.3385\.1381\.1392\.7 \(2\.25×\\times\)3\.59Llama\-3\.21B\-InstructAR675\.6677\.2624\.7677\.9672\.2661\.7671\.3665\.8 \(1\.00×\\times\)1\.00EAGLE657\.0777\.3653\.8812\.3644\.4608\.3753\.4700\.9 \(1\.05×\\times\)3\.16CORAL696\.4823\.9693\.1861\.1683\.0644\.8843\.7749\.4 \(1\.12×\\times\)\-FR\-Spec751\.2844\.7715\.2913\.5718\.5672\.8835\.9778\.8 \(1\.17×\\times\)2\.70DynaSpec763\.9865\.7721\.2907\.0717\.0667\.7951\.9799\.2 \(1\.20×\\times\)\-MicroSpec850\.61008\.2830\.41069\.3822\.0796\.2928\.5900\.7 \(1\.35×\\times\)3\.11

### 4\.2End\-to\-End Speedup

![Refer to caption](https://arxiv.org/html/2605.26444v1/x4.png)Figure 3:Performance comparison across tasks on Llama\-3\-8B\-Instruct\. Despite having shorter acceptance lengths,MicroSpecconsistently outperforms all baselines in terms of both end\-to\-end speed and decoding speed\. As CORAL does not report acceptance lengths for individual tasks, this is left blank in subfigure \(c\)\.We first evaluate the end\-to\-end generation throughput \(including both prefill and decoding, measured in tokens/s\) across diverse tasks and models\. The results are summarized in Table[2](https://arxiv.org/html/2605.26444#S4.T2)and Figure[3](https://arxiv.org/html/2605.26444#S4.F3)\. Results of Qwen2 are detailed in Appendix[A\.2](https://arxiv.org/html/2605.26444#A1.SS2)\.

On the Llama\-3\-8B\-Instruct model,MicroSpecconsistently achieves the highest throughput across nearly all evaluated tasks, culminating in an average speed of392\.7 tokens/s\. This corresponds to a substantial2\.25×\\timesspeedup over the auto\-regressive baseline \(172\.6 tokens/s\), surpassing the strongest pruning methods, including the trained DynaSpec \(2\.17×\\times\) and static FR\-Spec \(2\.12×\\times\)\. The efficiency advantage of our approach is even more pronounced on the smaller, faster Llama\-3\.2\-1B\-Instruct model\. This is because on such fast base models, the drafting overhead becomes the primary bottleneck\. Standard EAGLE\-2, hindered by full\-vocabulary computation, offers negligible gains than AR \(1\.05×\\times\)\. In sharp contrast, benefiting from our minimized drafting overhead and maintenance of high acceptance rates \(discussed in Section[4\.3](https://arxiv.org/html/2605.26444#S4.SS3)\),MicroSpecunlocks a significant1\.35×\\timesaverage speedup\. We substantially outperform both static pruning \(FR\-Spec: 1\.17×\\times\) and complex trained methods \(CORAL: 1\.12×\\times; DynaSpec: 1\.20×\\times\)\.

These results demonstrate thatMicroSpecachieves the state\-of\-the\-art trade\-off between drafting accuracy and computational efficiency, delivering consistent and superior performance gains over existing frameworks without requiring any additional training\.

### 4\.3Average Acceptance Length

To investigate the quality of the speculative drafts generated by different pruning strategies, we analyze the average acceptance length, which quantifies the mean number of draft tokens verified as correct by the target model per step\. This metric serves as a crucial assessment of drafting accuracy without the influence of hardware\.

As shown in the Table[2](https://arxiv.org/html/2605.26444#S4.T2), full\-vocabulary methods like EAGLE achieve a high baseline acceptance length of 3\.80 and 3\.16\. Static pruning methods, such as FR\-Spec, suffer a noticeable drop in drafting quality, decreasing the average length to 3\.62 and 2\.70 due to the exclusion of necessary task\-specific tokens from their fixed top\-k selection\. Notably,MicroSpecachieves an average acceptance length of 3\.59 and 3\.11, which is highly competitive with the full\-vocabulary EAGLE baseline \(3\.80 and 3\.16\) and significantly outperforms the static FR\-Spec on 1B model\. This indicates that our training\-free dynamic mechanism effectively identifies and includes vital tokens that static methods miss, maintaining high drafting quality without the architectural complexity or extra training required by router\-based methods\. This high acceptance rate provides a strong foundation for realizing end\-to\-end speedups\.

### 4\.4Draft Time Analysis

![Refer to caption](https://arxiv.org/html/2605.26444v1/x5.png)Figure 4:Draft time comparison\.MicroSpeceffectively reduces draft overhead with a smaller vocabulary\. The error bars indicate the minimum and maximum of average draft time\.For a deeper understanding of the efficiency gains, we investigate the computational overhead during the drafting phase\. Figure[4](https://arxiv.org/html/2605.26444#S4.F4)presents a comparison of the total drafting time across various tasks on Llama\-3\-8B\-Instruct\.

As observed, EAGLE suffers from significant latency due to the necessity of computing the hidden states against the full vocabulary at every drafting step\. While FR\-Spec mitigates this by statically pruning the vocabulary to top\-KKhigh\-frequency tokens, it still suffers from noticeable overhead as it requires a largeKK\(e\.g\., 32k\) to capture long\-tail tokens\.MicroSpecachieves the lowest overhead, demonstrating the effectiveness of both our pruning algorithm and optimized system implementation\. Specifically,MicroSpecreduces drafting time by an average of approximately 51\.6% compared to EAGLE and 20\.3% compared to the already pruned FR\-Spec baseline, directly translating to the substantial end\-to\-end speedups reported earlier\.

### 4\.5Ablation Study

Table 3:Ablation study on average end\-to\-end speed across all SpecBench tasks using Llama\-3\-8B\-Instruct\.Vocab\. SourceAsync\. GatherAvg\. SpeedCtx\. Only✓\\checkmark313\.3Ext\. Only✓\\checkmark351\.4Ctx\. \+ Ext\.×\\times\(Indexed GEMM\)364\.1Ctx\. \+ Ext\.✓\\checkmark394\.6To examine the individual contributions of our proposed components, we conduct an ablation study on the Llama\-3\-8B\-Instruct model\. We isolate the effects of our dynamic vocabulary strategy and asynchronous optimization by comparing four experimental configurations:

- •Ctx\. Only: The active vocabularyℐ\\mathcal\{I\}is restricted to tokens present in the input prompt and high\-probability candidates from the prefill phase \(𝐒init\\mathbf\{S\}\_\{init\}\)\. No new candidates are added during decoding steps\.
- •Ext\. Only: The vocabulary is composed solely of tokens from the draft tree and top candidates from the target model’s latest verification \(𝒞draft∪𝒞ver\\mathcal\{C\}\_\{draft\}\\cup\\mathcal\{C\}\_\{ver\}\), ignoring the prompt context\.
- •Ctx\. \+ Ext\. \(×\\timesAsync\.\): Our full dynamic vocabulary strategy is used while the system relies on a naive Indexed GEMM kernel for sparse computation, without our asynchronous gathering optimization\.
- •Ctx\. \+ Ext\. \(✓\\checkmarkAsync\.\): Our complete method, employing both the full dynamic vocabulary and the system\-level optimization\.

The results summarized in Table[3](https://arxiv.org/html/2605.26444#S4.T3)demonstrate that relying solely on prompt\-derived tokens \(Ctx\. Only\) yields a very low speed of 313\.3 tokens/s\. Switching to Ext\. Only significantly improves speed to 351\.4 tokens/s, showing the importance of dynamic vocabulary\. Combining both sources \(Ctx\. \+ Ext\.\) provides the best result \(364\.1 tokens/s\), confirming that capturing both global logic from the prompt and local continuity for drafting is essential\. The results also show that the naive implementation for dynamic vocabulary \(×\\timesAsync\.\) is limited by slow, non\-contiguous memory access in indexed GEMM\. Applying our asynchronous gathering \(✓\\checkmarkAsync\.\) transforms sparse access into efficient dense computation and hides data transfer latency, boosting speed from 364\.1 to 392\.7 tokens/s\. This highlights the necessity of hardware\-aware system design to realize the theoretical benefits of dynamic vocabulary on GPUs\.

## 5Related Work

#### Speculative Decoding\.

Speculative decoding\(Miaoet al\.,[2024](https://arxiv.org/html/2605.26444#bib.bib24); Chenet al\.,[2023](https://arxiv.org/html/2605.26444#bib.bib25); Leviathanet al\.,[2023](https://arxiv.org/html/2605.26444#bib.bib23)\)has emerged as a prominent technique for accelerating auto\-regressive LLM inference\. It employs a smaller, faster draft model to tentatively generate draft sequences, which are then verified in parallel by the larger target model\. EAGLE\(Liet al\.,[2024](https://arxiv.org/html/2605.26444#bib.bib17)\)uses single Transformer layer as the draft model, while Gumiho\(Liet al\.,[2025a](https://arxiv.org/html/2605.26444#bib.bib19)\)explores combining Transformer layer and MLP layer\. RSD\(Liaoet al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib21)\)invokes verification dynamically through a reward model\. Recent work\(Timoret al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib22)\)also proposes augmenting the vocabulary of target model for better performance\. Our optimization is complementary with these work as they all rely on LM head to compute logits and suffer from inefficiencies with the LLM vocabulary scaling\.

#### Vocabulary Pruning\.

Beyond speculative decoding, vocabulary pruning is employed in other LLM domains for different objectives, such as early\-exiting\(Vincentiet al\.,[2024](https://arxiv.org/html/2605.26444#bib.bib26); Xuet al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib27)\)and reinforcement learning\(Liet al\.,[2025b](https://arxiv.org/html/2605.26444#bib.bib28); Hashimoto and Tsuruoka,[2019](https://arxiv.org/html/2605.26444#bib.bib29)\)to improve efficiency\. Other work proposes pruning for inference to reduce memory consumption\(Keithet al\.,[2024](https://arxiv.org/html/2605.26444#bib.bib30); Nozakiet al\.,[2025](https://arxiv.org/html/2605.26444#bib.bib31)\)\. A critical distinction is that these methods apply pruning that directly affects the final output or training objective, often incurring a trade\-off in accuracy\. As our work applies pruning solely to the draft model, the speculative decoding framework ensures the final output is rigorously verified by the target model, guaranteeing lossless outputs identical to standard auto\-regressive decoding\.

## 6Conclusion

In this paper, we challenged the prevailing paradigm of using static or complex trained routers for vocabulary pruning in speculative decoding, proposingMicroSpec, a training\-free dynamic method with system\-level co\-design to break the long\-standing trade\-off between vocabulary size and acceptance rate\. The empirical evaluation demonstrates thatMicroSpecachieves state\-of\-the\-art end\-to\-end speedups with an unprecedentedly small vocabulary of<<3k tokens\.

## Impact Statement

This work aims to improve the inference efficiency of large language models, particularly focusing on reducing the bottleneck caused by large vocabularies\. The primary potential benefits of our approach lie in improving computational resource utilization and promoting accessibility\. By achieving significant speedups without requiring additional training or complex auxiliary models, our method lowers the barrier toward deploying capable LLMs in resource\-constrained environments\. Furthermore, increased inference efficiency can contribute to reduced energy consumption per generated token, supporting more environmentally sustainable AI practices\. While faster text generation inherently applies to both beneficial and potentially harmful use cases of LLMs, the focus of this contribution is strictly on infrastructural efficiency optimization\.

## References

- O\. Bojar, C\. Buck, C\. Federmann, B\. Haddow, P\. Koehn, J\. Leveling, C\. Monz, P\. Pecina, M\. Post, H\. Saint\-Amand,et al\.\(2014\)Findings of the 2014 workshop on statistical machine translation\.InProceedings of the ninth workshop on statistical machine translation,pp\. 12–58\.Cited by:[§4\.1](https://arxiv.org/html/2605.26444#S4.SS1.p1.1)\.
- T\. Cai, Y\. Li, Z\. Geng, H\. Peng, J\. D\. Lee, D\. Chen, and T\. Dao \(2024\)MEDUSA: simple llm inference acceleration framework with multiple decoding heads\.InProceedings of the 41st International Conference on Machine Learning \(ICML’24\),Cited by:[§1](https://arxiv.org/html/2605.26444#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.26444#S3.SS1.p2.7)\.
- C\. Chen, S\. Borgeaud, G\. Irving, J\. Lespiau, L\. Sifre, and J\. Jumper \(2023\)Accelerating large language model decoding with speculative sampling\.arXiv preprint arXiv:2302\.01318\.Cited by:[§1](https://arxiv.org/html/2605.26444#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.26444#S3.SS1.p2.7),[§5](https://arxiv.org/html/2605.26444#S5.SS0.SSS0.Px1.p1.1)\.
- M\. Chen \(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§4\.1](https://arxiv.org/html/2605.26444#S4.SS1.p1.1)\.
- Z\. Chen, D\. Xu, H\. Shen, M\. Xu, S\. Wang, and Y\. Ma \(2025\)Accelerating mobile language model via speculative decoding and npu\-coordinated execution\.arXiv preprint arXiv:2510\.15312\.Cited by:[§1](https://arxiv.org/html/2605.26444#S1.p4.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§4\.1](https://arxiv.org/html/2605.26444#S4.SS1.p1.1)\.
- R\. Goel, S\. Agrawal, M\. Gagrani, J\. Park, Y\. Zao, H\. Zhang, T\. Liu, Y\. Yang, X\. Yuan, J\. Lu,et al\.\(2025\)VOCABTRIM: vocabulary pruning for efficient speculative decoding in llms\.arXiv preprint arXiv:2506\.22694\.Cited by:[§1](https://arxiv.org/html/2605.26444#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.26444#S2.SS1.p2.1)\.
- X\. Han, Z\. Zhang, N\. Ding, Y\. Gu, X\. Liu, Y\. Huo, J\. Qiu, Y\. Yao, A\. Zhang, L\. Zhang,et al\.\(2021\)Pre\-trained models: past, present and future\.Ai Open2,pp\. 225–250\.Cited by:[§1](https://arxiv.org/html/2605.26444#S1.p1.1)\.
- K\. Hashimoto and Y\. Tsuruoka \(2019\)Accelerated reinforcement learning for sentence generation by vocabulary prediction\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(NAACL\),pp\. 3115–3125\.Cited by:[§5](https://arxiv.org/html/2605.26444#S5.SS0.SSS0.Px2.p1.1)\.
- C\. Keith, M\. Robinson, F\. Duncan, A\. Worthington, J\. Wilson, and S\. Harris \(2024\)Optimizing large language models: a novel approach through dynamic token pruning\.Cited by:[§5](https://arxiv.org/html/2605.26444#S5.SS0.SSS0.Px2.p1.1)\.
- T\. Kwiatkowski, J\. Palomaki, O\. Redfield, M\. Collins, A\. Parikh, C\. Alberti, D\. Epstein, I\. Polosukhin, J\. Devlin, K\. Lee,et al\.\(2019\)Natural questions: a benchmark for question answering research\.Transactions of the Association for Computational Linguistics7,pp\. 453–466\.Cited by:[§4\.1](https://arxiv.org/html/2605.26444#S4.SS1.p1.1)\.
- Y\. Leviathan, M\. Kalman, and Y\. Matias \(2023\)Fast inference from transformers via speculative decoding\.InProceedings of the 40th International Conference on Machine Learning \(ICML’23\),Vol\.202,pp\. 19274–19286\.Cited by:[§1](https://arxiv.org/html/2605.26444#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.26444#S3.SS1.p2.7),[§3\.1](https://arxiv.org/html/2605.26444#S3.SS1.p3.5),[§5](https://arxiv.org/html/2605.26444#S5.SS0.SSS0.Px1.p1.1)\.
- J\. Li, Y\. Xu, H\. Huang, X\. Yin, D\. Li, E\. C\. Ngai, and E\. Barsoum \(2025a\)Gumiho: a hybrid architecture to prioritize early tokens in speculative decoding\.InProceedings of the 42nd International Conference on Machine Learning \(ICML’25\),Cited by:[§5](https://arxiv.org/html/2605.26444#S5.SS0.SSS0.Px1.p1.1)\.
- Y\. Li, J\. Xu, J\. Liu, Y\. Tong, Z\. Li, T\. Cai, G\. Zhang, Q\. Liu, and B\. Wang \(2025b\)Taming the tail: stable llm reinforcement learning via dynamic vocabulary pruning\.arXiv preprint arXiv:2512\.23087\.Cited by:[§5](https://arxiv.org/html/2605.26444#S5.SS0.SSS0.Px2.p1.1)\.
- Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang \(2024\)Eagle\-2: faster inference of language models with dynamic draft trees\.arXiv preprint arXiv:2406\.16858\.Cited by:[Table 1](https://arxiv.org/html/2605.26444#S1.T1.1.3.1),[§1](https://arxiv.org/html/2605.26444#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.26444#S3.SS1.p2.7),[1st item](https://arxiv.org/html/2605.26444#S4.I1.i1.p1.1),[§4\.1](https://arxiv.org/html/2605.26444#S4.SS1.p4.3),[§5](https://arxiv.org/html/2605.26444#S5.SS0.SSS0.Px1.p1.1)\.
- B\. Liao, Y\. Xu, H\. Dong, J\. Li, C\. Monz, S\. Savarese, D\. Sahoo, and C\. Xiong \(2025\)Reward\-guided speculative decoding for efficient llm reasoning\.InProceedings of the 42nd International Conference on Machine Learning \(ICML’25\),Cited by:[§5](https://arxiv.org/html/2605.26444#S5.SS0.SSS0.Px1.p1.1)\.
- X\. Miao, G\. Oliaro, Z\. Zhang, X\. Cheng, Z\. Wang, Z\. Zhang, R\. Y\. Y\. Wong, A\. Zhu, L\. Yang, X\. Shi,et al\.\(2024\)Specinfer: accelerating large language model serving with tree\-based speculative inference and verification\.InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems \(ASPLOS\),pp\. 932–949\.Cited by:[§1](https://arxiv.org/html/2605.26444#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.26444#S3.SS1.p2.7),[§5](https://arxiv.org/html/2605.26444#S5.SS0.SSS0.Px1.p1.1)\.
- R\. Nallapati, B\. Zhou, C\. Dos Santos, Ç\. Gulçehre, and B\. Xiang \(2016\)Abstractive text summarization using sequence\-to\-sequence rnns and beyond\.InProceedings of the 20th SIGNLL conference on computational natural language learning,pp\. 280–290\.Cited by:[§4\.1](https://arxiv.org/html/2605.26444#S4.SS1.p1.1)\.
- H\. Naveed, A\. U\. Khan, S\. Qiu, M\. Saqib, S\. Anwar, M\. Usman, N\. Akhtar, N\. Barnes, and A\. Mian \(2025\)A comprehensive overview of large language models\.ACM Transactions on Intelligent Systems and Technology16\(5\),pp\. 1–72\.Cited by:[§1](https://arxiv.org/html/2605.26444#S1.p1.1)\.
- Y\. Nozaki, D\. Nakashima, R\. Sato, N\. Asaba, and S\. Kawamura \(2025\)Efficient vocabulary reduction for small language models\.InProceedings of the 31st International Conference on Computational Linguistics: Industry Track,pp\. 771–783\.Cited by:[§5](https://arxiv.org/html/2605.26444#S5.SS0.SSS0.Px2.p1.1)\.
- A\. Saxena \(2023\)External Links:[Link](https://github.com/apoorvumang/prompt-lookup-decoding/)Cited by:[§1](https://arxiv.org/html/2605.26444#S1.p4.1),[§2\.2](https://arxiv.org/html/2605.26444#S2.SS2.p1.1)\.
- S\. Takase, R\. Ri, S\. Kiyono, and T\. Kato \(2025\)Large vocabulary size improves large language models\.InFindings of the Association for Computational Linguistics \(ACL Findings’25\),pp\. 1015–1026\.Cited by:[§1](https://arxiv.org/html/2605.26444#S1.p1.1)\.
- C\. Tao, Q\. Liu, L\. Dou, N\. Muennighoff, Z\. Wan, P\. Luo, M\. Lin, and N\. Wong \(2024\)Scaling laws with vocabulary: larger models deserve larger vocabularies\.Advances in Neural Information Processing Systems37,pp\. 114147–114179\.Cited by:[§1](https://arxiv.org/html/2605.26444#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.26444#S2.SS1.p1.1)\.
- N\. Timor, J\. Mamou, D\. Korat, M\. Berchansky, G\. Jain, O\. Pereg, M\. Wasserblat, and D\. Harel \(2025\)Accelerating LLM inference with lossless speculative decoding algorithms for heterogeneous vocabularies\.InProceedings of the 42nd International Conference on Machine Learning \(ICML’25\),Cited by:[§5](https://arxiv.org/html/2605.26444#S5.SS0.SSS0.Px1.p1.1)\.
- J\. Vincenti, K\. A\. Sadek, J\. Velja, M\. Nulli, and M\. Jazbec \(2024\)Dynamic vocabulary pruning in early\-exit llms\.arXiv preprint arXiv:2410\.18952\.Cited by:[§5](https://arxiv.org/html/2605.26444#S5.SS0.SSS0.Px2.p1.1)\.
- H\. Wang, Y\. Li, H\. Xu, Y\. Wang, L\. Liu, J\. Yang, and Y\. Han \(2025\)LAD: efficient accelerator for generative inference of llm with locality aware decoding\.In2025 IEEE International Symposium on High Performance Computer Architecture \(HPCA\),pp\. 1482–1495\.Cited by:[§1](https://arxiv.org/html/2605.26444#S1.p4.1),[§2\.2](https://arxiv.org/html/2605.26444#S2.SS2.p1.1)\.
- Y\. Weng, D\. Mei, H\. Qiu, X\. Chen, L\. Liu, J\. Tian, and Z\. Shi \(2025\)CORAL: learning consistent representations across multi\-step training with lighter speculative drafter\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(ACL’25\),pp\. 5580–5593\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.278)Cited by:[Table 1](https://arxiv.org/html/2605.26444#S1.T1.1.5.1),[§1](https://arxiv.org/html/2605.26444#S1.p2.1),[§1](https://arxiv.org/html/2605.26444#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.26444#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.26444#S2.SS1.p2.1),[§3\.1](https://arxiv.org/html/2605.26444#S3.SS1.p2.7),[3rd item](https://arxiv.org/html/2605.26444#S4.I1.i3.p1.1)\.
- H\. Xia, Z\. Yang, Q\. Dong, P\. Wang, Y\. Li, T\. Ge, T\. Liu, W\. Li, and Z\. Sui \(2024\)Unlocking efficiency in large language model inference: a comprehensive survey of speculative decoding\.InFindings of the Association for Computational Linguistics \(ACL Findings’24\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),pp\. 7655–7671\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.456)Cited by:[§4\.1](https://arxiv.org/html/2605.26444#S4.SS1.p1.1)\.
- J\. Xu, J\. Pan, Y\. Zhou, S\. Chen, J\. Li, Y\. Lian, J\. Wu, and G\. Dai \(2025\)Specee: accelerating large language model inference with speculative early exiting\.InProceedings of the 52nd Annual International Symposium on Computer Architecture,pp\. 467–481\.Cited by:[§5](https://arxiv.org/html/2605.26444#S5.SS0.SSS0.Px2.p1.1)\.
- J\. Zhang, N\. Ullah, E\. Schultheis, and R\. Babbar \(2025\)DynaSpec: context\-aware dynamic speculative sampling for large\-vocabulary language models\.arXiv preprint arXiv:2510\.13847\.Cited by:[Table 1](https://arxiv.org/html/2605.26444#S1.T1.1.6.1),[§1](https://arxiv.org/html/2605.26444#S1.p2.1),[§1](https://arxiv.org/html/2605.26444#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.26444#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.26444#S2.SS1.p2.1),[4th item](https://arxiv.org/html/2605.26444#S4.I1.i4.p1.1)\.
- W\. Zhao, T\. Pan, X\. Han, Y\. Zhang, S\. Ao, Y\. Huang, K\. Zhang, W\. Zhao, Y\. Li, J\. Zhou,et al\.\(2025\)Fr\-spec: accelerating large\-vocabulary language models via frequency\-ranked speculative sampling\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(ACL’25\),pp\. 3909–3921\.Cited by:[Table 1](https://arxiv.org/html/2605.26444#S1.T1.1.4.1),[§1](https://arxiv.org/html/2605.26444#S1.p2.1),[§1](https://arxiv.org/html/2605.26444#S1.p3.1),[2nd item](https://arxiv.org/html/2605.26444#S2.I1.i2.p1.1),[1st item](https://arxiv.org/html/2605.26444#S2.I2.i1.p1.1),[§2\.1](https://arxiv.org/html/2605.26444#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2605.26444#S2.SS1.p2.1),[§3\.1](https://arxiv.org/html/2605.26444#S3.SS1.p2.7),[2nd item](https://arxiv.org/html/2605.26444#S4.I1.i2.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.Cited by:[§4\.1](https://arxiv.org/html/2605.26444#S4.SS1.p1.1)\.
- C\. Zhou, Q\. Li, C\. Li, J\. Yu, Y\. Liu, G\. Wang, K\. Zhang, C\. Ji, Q\. Yan, L\. He,et al\.\(2025\)A comprehensive survey on pretrained foundation models: a history from bert to chatgpt\.International Journal of Machine Learning and Cybernetics16\(12\),pp\. 9851–9915\.Cited by:[§1](https://arxiv.org/html/2605.26444#S1.p1.1)\.

## Appendix AAppendix

### A\.1Pseudo\-Code ofMicroSpec

Algorithm[1](https://arxiv.org/html/2605.26444#alg1)provides the complete pseudo\-code forMicroSpec, illustrating the control logic of speculative decoding and the asynchronous streams for computation and memory gathering\.

Algorithm 1MicroSpec: Speculative Decoding with Dynamic Minimalist In\-Context Vocabulary1:Input:Target Model

MtargetM\_\{target\}, Draft Model

MdraftM\_\{draft\}, Prompt

x1:Lx\_\{1:L\}, Max Steps

TT, Max Draft Tokens

γ\\gamma\.

2:Output:Prediction

xL:Tx\_\{L:T\}
3:Hyper\-parameters:Prefill Top\-K

KpreK\_\{pre\}, Verify Top\-K

KverK\_\{ver\}, Window Size

WmaxW\_\{max\}\.

4:

htarget←Mtarget\.forward\(x1:L\)h\_\{target\}\\leftarrow M\_\{target\}\.\\text\{forward\}\(x\_\{1:L\}\)
5:

Z1:L←Mtarget\.lm\_head\(htarget\)Z\_\{1:L\}\\leftarrow M\_\{target\}\.\\text\{lm\\\_head\}\(h\_\{target\}\)
6:

𝐒←\(x1:L\)⊕⋃i=1LTopK⁡\(Zi,Kpre\)\\mathbf\{S\}\\leftarrow\(x\_\{1:L\}\)\\oplus\\bigcup\_\{i=1\}^\{L\}\\operatorname\{TopK\}\(Z\_\{i\},K\_\{pre\}\)\{Init Candidate Stream\}

7:

ℐ←Unique⁡\(Suffix⁡\(𝐒,Wmax\)\)\\mathcal\{I\}\\leftarrow\\operatorname\{Unique\}\(\\operatorname\{Suffix\}\(\\mathbf\{S\},W\_\{max\}\)\)\{Init Dynamic Vocab Indices\}

8:

Wpruned←Gather\_Async\(Mdraft\.Whead,ℐ\)W\_\{pruned\}\\leftarrow\\text\{Gather\\\_Async\}\(M\_\{draft\}\.W\_\{head\},\\mathcal\{I\}\)\{Stream 2: Copy pruned LM Head weights\}

9:for

t=Lt=Lto

TTdo

10:while

j=0j=0<γ<\\gammado

11:

hdraft←Mdraft\.forward\(x<t\+j\)h\_\{draft\}\\leftarrow M\_\{draft\}\.\\text\{forward\}\(x\_\{<t\+j\}\)\{Stream 1: Compute\}

12:Sync stream 1 \(compute\) with stream 2 \(copy\)

13:

zdraft′←GEMM\(hdraft,WprunedT\)z^\{\\prime\}\_\{draft\}\\leftarrow\\text\{GEMM\}\(h\_\{draft\},W\_\{pruned\}^\{T\}\)\{Stream 1: Dense GEMM on pruned LM Head\}

14:

xdraft,num\_draft←SelectDraftTokens\(zdraft′,ℐ\)x\_\{draft\},\\text\{num\\\_draft\}\\leftarrow\\text\{SelectDraftTokens\}\(z^\{\\prime\}\_\{draft\},\\mathcal\{I\}\)\{Drafts are sampled from pruned vocabulary\}

15:Append

xdraftx\_\{draft\}to draft tree

16:

j←j\+num\_draftj\\leftarrow j\+\\text\{num\\\_draft\}
17:endwhile

18:

htarget←Mtarget\.forward\(xt:t\+γ\)h\_\{target\}\\leftarrow M\_\{target\}\.\\text\{forward\}\(x\_\{t:t\+\\gamma\}\)
19:

ztarget←GEMM\(htarget,WheadT\)z\_\{target\}\\leftarrow\\text\{GEMM\}\(h\_\{target\},W\_\{head\}^\{T\}\)\{Stream1: GEMM on full LM Head\}

20:

xnew,num\_accept←VerifyDraftTokens\(xt:t\+γ,ztarget\)x\_\{new\},\\text\{num\\\_accept\}\\leftarrow\\text\{VerifyDraftTokens\}\(x\_\{t:t\+\\gamma\},z\_\{target\}\)\{Drafts are verified using full vocabulary\}

21:Append

xnewx\_\{new\}to context

xx
22:

t←t\+num\_accept\+1t\\leftarrow t\+\\text\{num\\\_accept\}\+1
23:ifpredict end tokenthen

24:break

25:endif

26:

𝒞draft←Unique⁡\(TokensInDraftTree\(\)\)\\mathcal\{C\}\_\{draft\}\\leftarrow\\operatorname\{Unique\}\(\\text\{TokensInDraftTree\(\)\}\)
27:

𝒞ver←TopK⁡\(ztarget,Kver\)\\mathcal\{C\}\_\{ver\}\\leftarrow\\operatorname\{TopK\}\(z\_\{target\},K\_\{ver\}\)
28:

𝐒←𝐒⊕𝒞draft⊕𝒞ver\\mathbf\{S\}\\leftarrow\\mathbf\{S\}\\oplus\\mathcal\{C\}\_\{draft\}\\oplus\\mathcal\{C\}\_\{ver\}\{Update stream

𝐒\\mathbf\{S\}and

ℐ\\mathcal\{I\}on device\}

29:

ℐnext←Unique⁡\(Suffix⁡\(𝐒,Wmax\)\)\\mathcal\{I\}\_\{next\}\\leftarrow\\operatorname\{Unique\}\(\\operatorname\{Suffix\}\(\\mathbf\{S\},W\_\{max\}\)\)
30:

Wpruned←Gather\_Async\(Mdraft\.Whead,ℐnext\)W\_\{pruned\}\\leftarrow\\text\{Gather\\\_Async\}\(M\_\{draft\}\.W\_\{head\},\\mathcal\{I\}\_\{next\}\)\{Stream 2: Copy\}

31:endfor

### A\.2Robustness Analysis on LLM with Larger Vocabulary

To investigate the robustness ofMicroSpecwhen producing text from a larger vocabulary space, we analyze its performance on the Qwen\-2\-7B\-Instruct model, which possesses a vocabulary size of approximately 152k \(compared to 128k for Llama\-3\)\. The hyper\-parameters ofMicroSpecare the same with the settings in Section[4\.1](https://arxiv.org/html/2605.26444#S4.SS1)\. The results of DynaSpec and CORAL are from data reported in their paper, as their implementations or trained router models are not available\. As CORAL does not report acceptance lengths for individual tasks, the relevant data is left blank\.

Table[4](https://arxiv.org/html/2605.26444#A1.T4)reports the average acceptance length across diverse tasks, which directly reflects the quality of the draft model’s generation and the coverage capability of the pruned vocabulary\. The results demonstrate that despite Qwen\-2’s massive vocabulary,MicroSpec, operating with a highly constrained maximum dynamic range of onlyWmax=3072W\_\{max\}=3072tokens, maintains a high acceptance length \(3\.48\), comparable to baselines that utilize much larger static vocabularies \(i\.e\., 3\.44 of FR\-Spec\) or full vocabulary spaces \(i\.e\., 3\.65 of EAGLE\)\. This provides strong empirical evidence for the robustness of our context\-driven, training\-free approach in handling large\-scale vocabulary models\.

Table 4:Average acceptance length per step on Qwen\-2\-7B\-Instruct across different tasks\. A higher value indicates better quality of speculative generation and effective vocabulary coverage\. The results of CORAL and DynaSpec are from their reported optimal performance\.MethodAverage Acceptance Length by TaskOverallMTConvRAGMathQASummCodeEAGLE \(Full Vocabulary\)3\.033\.863\.704\.313\.063\.434\.193\.65CORAL\-\-\-\-\-\-\-3\.62FR\-Spec2\.943\.603\.584\.143\.003\.313\.523\.44DynaSpec2\.863\.723\.324\.182\.973\.243\.963\.46MicroSpec2\.693\.703\.644\.142\.863\.373\.983\.48

### A\.3Hyper\-parameter Sensitivity Analysis

![Refer to caption](https://arxiv.org/html/2605.26444v1/x6.png)Figure 5:Sensitivity analysis of the maximum of vocabulary size \(WmaxW\_\{max\}\)\. The dual\-axis plot shows the trade\-off between average acceptance length \(dashed blue, right axis\) and end\-to\-end Speed \(solid green, left axis\) asWmaxW\_\{max\}varies\. The star marks our chosen default settingWmax=3072W\_\{max\}=3072\.The maximum size of the dynamic vocabulary \(WmaxW\_\{max\}\) is a critical hyper\-parameter inMicroSpec\. It governs a fundamental trade\-off between draft quality \(vocabulary coverage, measured by acceptance length\) and compute overhead \(the latency associated with gathering weights and performing GEMM operations for a larger vocabulary\)\. To define the operational boundaries of our approach, we conduct a sensitivity analysis by varyingWmaxW\_\{max\}from 1024 to 8192 on SpecBench using Llama3\-8B\-Instruct\. The results are visualized in Figure[5](https://arxiv.org/html/2605.26444#A1.F5)\.

Analysis of Draft Quality\.The dashed blue line in Figure[5](https://arxiv.org/html/2605.26444#A1.F5)illustrates the impact ofWmaxW\_\{max\}on the average acceptance length\. We observe a sharp increase in acceptance length as the window size grows from 1024 to 2048\. However, beyondWmax≈2048W\_\{max\}\\approx 2048, the acceptance length rapidly saturates and plateaus around a value of 3\.6\. This trend provides strong empirical evidence for our core insight regarding temporal locality: a relatively small, historically relevant context window is sufficient to capture the vast majority of tokens required for efficient speculative generation\. Increasing the window size further yields diminishing returns in terms of vocabulary coverage\.

Analysis of System Performance\.The solid green line indicates the end\-to\-end generation speed\. Initially, the speed increases alongside accepted length, peaking within the shaded optimal region \(2048≤Wmax≤40962048\\leq W\_\{max\}\\leq 4096\)\. Crucially, asWmaxW\_\{max\}increases further beyond this region \(e\.g\., up to 8192\), the generation speed begins to decline, despite the acceptance length remaining stable\. This decline highlights the diminished benefits of increasing vocabulary size under a length\-specific context: as most generation tasks include an amount of total tokens that is less than 4096, larger sliding window has been hardly filled while the system overhead of computing LM Head continuously increases\.

Consequently, an optimal operating region exists where the vocabulary is sufficiently large to maximize acceptance rate while staying small enough to minimize system overhead\. We thus selectWmax=3072W\_\{max\}=3072\(marked with stars in Figure[5](https://arxiv.org/html/2605.26444#A1.F5)\) as our default robust setting, striking an effective balance between high algorithmic efficiency and minimal system latency\.
MicroSpec: Accelerating Speculative Decoding with Lightweight In-Context Vocabularies

Similar Articles

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

Speculative Decoding Across Languages

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding

Submit Feedback

Similar Articles

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
Speculative Decoding Across Languages
Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding
BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding