CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration

arXiv cs.LG Papers

Summary

This paper introduces CATS, a cascaded adaptive tree speculation framework designed to accelerate LLM inference on memory-constrained edge devices by optimizing memory usage while maintaining high token acceptance rates.

arXiv:2605.11186v1 Announce Type: new Abstract: Auto-regressive decoding in Large Language Models (LLMs) is inherently memory-bound: every generation step requires loading the model weights and intermediate results from memory (e.g., High-Bandwidth Memory (HBM) for GPU servers), making throughput bottlenecked by memory bandwidth rather than compute. Speculative decoding addresses this by enabling parallel verification of multiple draft tokens, effectively amortizing the cost of each target-model call. However, existing speculative decoding methods are designed under the assumption that HBM is sufficiently large to hold both the target model and an auxiliary draft model simultaneously -- an assumption that breaks down on memory-constrained devices such as edge platforms with limited DRAM. We analyze the inference bottleneck in this memory-limited regime and propose CATS, a self-speculative decoding framework that conducts cascaded verification and correction based on the memory budget and parameter offloading patterns on memory-limited devices. This design maximizes token acceptance rate and end-to-end speedup while keeping the peak memory footprint on the device equal to that of the target model alone. We evaluate CATS on different models across five benchmarks on real edge devices. CATS can achieve a wall-clock speedup of up to 5.08x with no degradation in generation quality, outperforming the SOTA method by up to 1.45x under edge memory constraints.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/13/26, 06:34 AM

# Cats: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration
Source: [https://arxiv.org/html/2605.11186](https://arxiv.org/html/2605.11186)
Yuning Han University of Florida Gainesville, FL 32611 yuninghan@ufl\.edu &Yangchenchen Jin University of Florida Gainesville, FL 32611 yangchenchen\.jin@ufl\.edu &Dylan Zhao University of Florida Gainesville, FL 32611 dylan\.zhao@ufl\.edu &Jingwei Sun University of Florida Gainesville, FL 32611 sun\.jingwei@ufl\.edu

###### Abstract

Auto\-regressive decoding in Large Language Models \(LLMs\) is inherently memory\-bound: every generation step requires loading the model weights and intermediate results from memory \(e\.g\., High\-Bandwidth Memory \(HBM\) for GPU servers\), making throughput bottlenecked by memory bandwidth rather than compute\. Speculative decoding addresses this by enabling parallel verification of multiple draft tokens, effectively amortizing the cost of each target\-model call\. However, existing speculative decoding methods are designed under the assumption that HBM is sufficiently large to hold both the target model and an auxiliary draft model simultaneously—an assumption that breaks down on memory\-constrained devices such as edge platforms with limited DRAM\. We analyze the inference bottleneck in this memory\-limited regime and proposeCats, a self\-speculative decoding framework that conducts cascaded verification and correction based on the memory budget and parameter offloading patterns on memory\-limited devices\. This design maximizes token acceptance rate and end\-to\-end speedup while keeping the peak memory footprint on the device equal to that of the target model alone\. We evaluateCatson different models across five benchmarks on real edge devices\.Catscan achieve a wall\-clock speedup of up to5\.08×\\timeswith no degradation in generation quality, outperforming the SOTA method by up to1\.45×\\timesunder edge memory constraints\. Code is available at[https://github\.com/ElizaFuLan/CATS\.git](https://github.com/ElizaFuLan/CATS.git)\.

## 1Introduction

Deploying Large Language Models \(LLMs\) for efficient inference remains one of the central challenges in modern machine learning systems\. LLM inference must generate tokens one at a time in an autoregressive loop—each step conditioned on all prior outputs and requiring a full model forward pass\. This sequential structure is inherently memory\-bound: every generation step loads the model weights and intermediate results from memory \(e\.g\., High\-Bandwidth Memory \(HBM\) for GPU servers\) to on\-chip compute units, and throughput is governed by memory bandwidth rather than arithmetic capacityVellaisamyet al\.\([2026](https://arxiv.org/html/2605.11186#bib.bib54)\); Alizadehet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib49)\)\. Speculative decoding was proposed to break this bottleneck through*parallel decoding*Leviathanet al\.\([2023](https://arxiv.org/html/2605.11186#bib.bib6)\); Chenet al\.\([2023](https://arxiv.org/html/2605.11186#bib.bib7)\); Sternet al\.\([2018](https://arxiv.org/html/2605.11186#bib.bib5)\): a lightweight draft model speculatively generates a sequence of candidate tokens, which the larger target model then verifies in a single batched forward pass\. Because the target model processes multiple tokens simultaneously, the per\-token memory bandwidth cost is amortized over several accepted tokens, substantially improving throughput while preserving the exact output distribution\.

However, this throughput gain comes at a structural cost: classical speculative decoding introduces a*second*set of weights—the auxiliary draft model—that must coexist with the target model in device memory, inflating both the static memory footprint and the per\-step bandwidth traffic\. Self\-speculative decoding methodsZhanget al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib9)\); Elhoushiet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib10)\); Xiaet al\.\([2025](https://arxiv.org/html/2605.11186#bib.bib18)\); Liuet al\.\([2024a](https://arxiv.org/html/2605.11186#bib.bib16)\)emerged as a practical response to this added memory cost: by generating drafts from a shallow sub\-network of the target model itself, they eliminate the need for a separate draft model\. However, they still need to introduce additional adapter parameters for drafting, and the performance of these methods often suffers from the sub\-network’s limited drafting capacity\.

![Refer to caption](https://arxiv.org/html/2605.11186v1/x1.png)Figure 1:End\-to\-end speedup across five benchmarks on Vicuna\-7B\.A more fundamental limitation runs through*all*existing speculative decoding approaches: they assume model weights remain resident in device memory \(e\.g\., HBM for servers and DRAM for edge devices\) for the duration of inference\. On large\-scale servers, this holds, but on memory\-limited platforms such as edge devices, DRAM capacity falls well short of a single 7B\-parameter model and must be shared with the operating system—so weights must be streamed from flash memories on every forward passAlizadehet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib49)\), making flash↔\\leftrightarrowDRAM the rate\-limiting transfer rather than DRAM↔\\leftrightarrowon\-chip memory\. Every design choice in existing methods—draft depth, verification strategy, acceptance policy—targets the memory\-resident regime and is misaligned with this memory\-constrained setting\. Auxiliary\-model approaches are doubly penalized: maintaining a separate draft model adds memory capacity demands and extra transfer traffic at every speculative step, compounding the bottleneck they were meant to relieve\.

We investigate this memory\-limited inference bottleneck and proposeCats\(CascadedAdaptiveTreeSpeculation\), a self\-speculative decoding framework that performs cascaded verification to minimize inference time within the memory budget\.Catsorganizes inference into three stages whose layer boundaries are determined by the available DRAM budget: \(1\) adraftstage using a shallow sub\-network to draft candidate tokens by repeating draft iterations; \(2\) ashallow verificationstage using intermediate layers under the DRAM budget, streamed from flash once per cycle, to check draft tokens and produce correction candidates in parallel; and \(3\) atarget verifystage where the remaining layers are offloaded from the flash chunk by chunk to process a tree\-structured input over the draft trace and all correction branches to select the longest accepted prefix\. We also propose aReduced KL Lossthat focuses distillation supervision on high\-probability tokens to maximize the sub\-network’s drafting and verification capabilities\.

Our main contributions are as follows:

- •We identify flash↔\\leftrightarrowDRAM data movement as the dominant bottleneck for speculative decoding on edge devices—a constraint entirely overlooked by existing methods—and proposeCats, a cascaded self\-speculative decoding framework for this memory\-constrained regime\.
- •Catsismodel\-agnostic: the cascaded framework applies to any transformer\-based LLM without architectural modifications\.
- •We implement our framework on edge devices\. Extensive experiments on Vicuna\-7B/13B and LLaMA2\-7B/13B across five benchmarks demonstrate thatCatsachieves up to5\.08×\\timesspeedup, outperforming all compared self\-speculative decoding baselines includingKangarooLiuet al\.\([2024a](https://arxiv.org/html/2605.11186#bib.bib16)\),MedusaCaiet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib15)\), andEagleLiet al\.\([2024b](https://arxiv.org/html/2605.11186#bib.bib1)\)under the memory\-limited setting\.

## 2Related Work

##### Speculative decoding with auxiliary draft models\.

The canonical speculative decoding frameworkLeviathanet al\.\([2023](https://arxiv.org/html/2605.11186#bib.bib6)\); Chenet al\.\([2023](https://arxiv.org/html/2605.11186#bib.bib7)\); Xiaet al\.\([2023](https://arxiv.org/html/2605.11186#bib.bib22)\)relies on a separately trained smaller model to propose candidates that the target model verifies in parallel\. Subsequent work has extended this to tree\-structured verificationMiaoet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib17)\); Chenet al\.\([2024a](https://arxiv.org/html/2605.11186#bib.bib12)\); Wanget al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib24)\); Xionget al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib26)\), knowledge\-distilled draft modelsZhouet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib8)\); Liuet al\.\([2024b](https://arxiv.org/html/2605.11186#bib.bib11)\); Duet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib34)\), retrieval\-based draftingHeet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib13)\); Liet al\.\([2024a](https://arxiv.org/html/2605.11186#bib.bib37)\); Oliaroet al\.\([2025](https://arxiv.org/html/2605.11186#bib.bib38)\), and lookahead generationFuet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib14)\); seeXiaet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib4)\)for a comprehensive survey\. Methods that rely on independently trained smaller modelsKimet al\.\([2023](https://arxiv.org/html/2605.11186#bib.bib19)\); Chenet al\.\([2024b](https://arxiv.org/html/2605.11186#bib.bib36)\); Zhaoet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib35)\); Wanget al\.\([2025b](https://arxiv.org/html/2605.11186#bib.bib42)\); Zhonget al\.\([2025](https://arxiv.org/html/2605.11186#bib.bib21)\); Sunet al\.\([2023](https://arxiv.org/html/2605.11186#bib.bib29),[2025](https://arxiv.org/html/2605.11186#bib.bib30)\); Bachmannet al\.\([2025](https://arxiv.org/html/2605.11186#bib.bib20)\); Wanget al\.\([2025a](https://arxiv.org/html/2605.11186#bib.bib27)\); Sunet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib39)\); Chenet al\.\([2025b](https://arxiv.org/html/2605.11186#bib.bib40)\)incur significant memory overhead, a challenge in edge scenarios where even the target model barely fits in device memory\.

##### Self\-speculative decoding\.

To eliminate the dependency on a separate draft model, a line of*self\-speculative*methods derives the draft model directly from the target model’s own layers\. Early\-exit approaches such asDraft & VerifyZhanget al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib9)\)andLayerSkipElhoushiet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib10)\)route tokens through only the first few transformer layers for drafting and use the full model for verification\.SwiftXiaet al\.\([2025](https://arxiv.org/html/2605.11186#bib.bib18)\)selects skip layers dynamically per input without fine\-tuning, whileKnapSpecChaet al\.\([2026](https://arxiv.org/html/2605.11186#bib.bib28)\)formulates layer selection as a knapsack optimization problem\.KangarooLiuet al\.\([2024a](https://arxiv.org/html/2605.11186#bib.bib16)\)introduces a lightweight adapter on top of the shallow sub\-network trained to mimic the target model’s output distribution, achieving strong acceptance rates with minimal overhead\. Multi\-head approaches such asMedusaCaiet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib15)\)andHydraAnkneret al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib31)\)attach parallel prediction heads to the target model’s final layer, enabling several future tokens to be proposed in a single forward pass\. TheEagleseriesLiet al\.\([2024b](https://arxiv.org/html/2605.11186#bib.bib1),[2025a](https://arxiv.org/html/2605.11186#bib.bib2),[2025b](https://arxiv.org/html/2605.11186#bib.bib3)\)drafts at the feature level and constructs dynamic token trees, achieving state\-of\-the\-art speedups on server\-class hardware\. Consistency\-based trainingKouet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib33)\); Guo and Ermon \([2025](https://arxiv.org/html/2605.11186#bib.bib23)\)and multi\-token prediction objectivesGloeckleet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib41)\); Qinet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib25)\); Moneaet al\.\([2023](https://arxiv.org/html/2605.11186#bib.bib32)\)provide complementary approaches for parallel generation\. A common limitation across all these methods is that they still require additional adapter parameters for drafting, which can diminish the performance gains in memory\-limited settings achieved by existing speculative decoding methods\. And the performance of these methods often suffers from the sub\-network’s limited drafting capacity\.

##### LLM inference on memory\-constrained hardware\.

LLM in a FlashAlizadehet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib49)\)characterizes the flash\-memory↔\\leftrightarrowDRAM bandwidth bottleneck on edge devices where model weights cannot reside in DRAM and proposes windowed loading and row\-column bundling to reduce transfer volume per forward pass\. PowerInferSonget al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib50)\)exploits activation sparsity to selectively load only hot neurons\. These worksRenet al\.\([2025](https://arxiv.org/html/2605.11186#bib.bib51)\); Chenet al\.\([2025a](https://arxiv.org/html/2605.11186#bib.bib52)\)recognize the memory bottleneck as the critical constraint on edge inference — a constraint that our method seeks to address for existing speculative decoding methods in the same setting\.

## 3Preliminary and Motivation

![Refer to caption](https://arxiv.org/html/2605.11186v1/x2.png)Figure 2:Memory hierarchy on server vs\. edge, with measured per\-token latency breakdown for autoregressive Vicuna\-7B inference\.*Server \(B200\):*model weights reside in HBM; the binding bottleneck is HBM↔\\leftrightarrowSRAM bandwidth, and per\-token latency is compute\-dominated\.*Edge \(Jetson AGX Orin\):*DRAM cannot hold the full model, so weights must be staged from flash on every forward pass; flash↔\\leftrightarrowDRAM transfer dominates per\-token latency\.The structural mismatch identified in Section[1](https://arxiv.org/html/2605.11186#S1)—that all speculative decoding methods assume memory\-resident weights, a condition that fails on edge—stems from a fundamental difference in memory hierarchy between the two deployment regimes, which Figure[2](https://arxiv.org/html/2605.11186#S3.F2)makes explicit\. On a conventional GPU platform \(e\.g\., NVIDIA B200\), HBM provides∼\\sim180 GB at∼\\sim8 TB/s, comfortably holding the target model, an auxiliary draft model, and all intermediate states simultaneously; the binding cost during decoding is the fast HBM↔\\leftrightarrowSRAM path\. On an edge platform with a unified DRAM architecture, the GPU/NPU and CPU share a single DRAM pool of≲\\lesssim16 GB—difficult to host even a single 7B\-parameter model \(∼\\sim14 GB\), let alone an extra draft model\. The data between DRAM and the flash memory moves at only∼\\sim2 GB/s\. During the inference, model weights must be loaded chunk by chunk from flash into DRAM on every forward passAlizadehet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib49)\), and the binding bottleneck

![Refer to caption](https://arxiv.org/html/2605.11186v1/x3.png)Figure 3:End\-to\-end speedup of speculative decoding methods on B200 vs\. Jetson AGX Orin\.shifts to this flash↔\\leftrightarrowDRAM movement as reflected in the latency breakdown in Figure[2](https://arxiv.org/html/2605.11186#S3.F2)\. This bottleneck shift makes the existing speculative decoding methods inferior in memory\-limited settings\. We perform profiling on AGX Orin devices with limited memory budget and visualize this effect in Figure[3](https://arxiv.org/html/2605.11186#S3.F3): the end\-to\-end speedup of Kangaroo and EAGLE drops by19\.8%19\.8\\%and24\.5%24\.5\\%, respectively, when moving from server to edge, compared to vanilla inference on corresponding platforms\. This effect is more severe for methods like EAGLE that introduce an additional draft model, since these components must also be staged through the constrained edge memory hierarchy\.

## 4Methodology

In this section, we formalize the memory\-adaptive three\-stage inference framework that is central toCatsand describe its three\-stage inference pipeline\. We then detail the verification tree construction and the Reduced KL Loss used to train the draft and shallow\-verifier adapters\. The detailed algorithm can be found in Appendix[A](https://arxiv.org/html/2605.11186#A1)\.

### 4\.1Notation

We use*target model*\(TM\) for the target LLM,*draft model*\(DM\) for layers11toLDML\_\{\\mathrm\{DM\}\}plus a lightweight adapter, which is well\-trained to transform the shallow\-layer outputs of the TM into calibrated representations suitable for next\-token prediction, and*shallow verifier*\(SV\) for layersLDM\+1L\_\{\\mathrm\{DM\}\}\{\+\}1toLSVL\_\{\\mathrm\{SV\}\}plus its adapter\. A forward through layers11toLDML\_\{\\mathrm\{DM\}\}is a*drafting pass*, throughLDM\+1L\_\{\\mathrm\{DM\}\}\{\+\}1toLSVL\_\{\\mathrm\{SV\}\}an*SV pass*, and throughLSV\+1L\_\{\\mathrm\{SV\}\}\{\+\}1toLfinalL\_\{\\mathrm\{final\}\}the*final pass*;γ¯\\bar\{\\gamma\}denotes the number of drafting passes per decoding cycle\. Here,LDML\_\{\\mathrm\{DM\}\}denotes the draft boundary, i\.e\., the last layer included in the draft model;LSVL\_\{\\mathrm\{SV\}\}denotes the shallow\-verification boundary, i\.e\., the last layer used by the shallow verifier; andLfinalL\_\{\\mathrm\{final\}\}denotes the final layer of the target model\.

### 4\.2Cascaded verification design

The left part of Figure[4](https://arxiv.org/html/2605.11186#S4.F4)illustrates the fullCatspipeline\. Each decoding cycle alternates between flash\-to\-DRAM weight movement and computation across three stages\. In this paper, we focus on weight movement because edge inference is typically latency\-oriented and operates at small batch sizes, where the KV cache is not the dominant memory consumer as in large\-batch serving\. When KV\-cache pressure becomes relevant, it can be treated as part of the streamed state alongside model weights, so the same memory\-adaptive scheduling principle still applies\.

![Refer to caption](https://arxiv.org/html/2605.11186v1/x4.png)Figure 4:Pipeline ofCats\.Left: three\-stage decoding cycle showing the interleaved flash↔\\leftrightarrowDRAM transfers and GPU computation\. The bottommostblueblock \(draft sub\-network\) and the intermediate SV layers \(middleblueblock\) are streamed from flash once per full inference cycle during the drafting and shallow verification process \(regarded as a SD process\); the target layers \(topblueblock, rest of the models\) are streamed accordingly during the final pass\.Greenblocks: token embeddings;yellowblocks: hidden states; yellow\-bordered box: parallel forward computation;orangeblocks: verification decision\. Draft and SV adapter structure follows\(Liuet al\.,[2024a](https://arxiv.org/html/2605.11186#bib.bib16)\)\.Right upper: verification tree structure\.Right lower: tree\-based attention mask encoding the main draft sequence and SV correction branches for the batched final pass\.Drafting loop\.Catsuses a compact DM that is small enough to fit within the DRAM budget of the edge device\. The draft boundaryLDML\_\{\\mathrm\{DM\}\}is chosen not simply to maximize the number of layers that can be loaded into DRAM, but to balance three constraints: DRAM capacity, limited edge compute throughput, and memory bandwidth\. A deeper draft model may improve the drafting quality, but it also increases the cost of each auto\-regressive drafting step\. We therefore keep the draft model intentionally shallow and performγ¯\\bar\{\\gamma\}sequential drafting passes to generateγ¯\\bar\{\\gamma\}candidate tokens with reasonable latency\. To preserve draft quality under this small\-model constraint, the draft adapter is trained with the reduced KL loss introduced later\.

Shallow verification pass\.The shallow verifier boundary is adaptively chosen according to memory budgets, which means that the medium sub\-network \(layersLDM\+1L\_\{\\mathrm\{DM\}\}\+1throughLSVL\_\{\\mathrm\{SV\}\}\) will be moved to DRAM along with the draft model during the first flash to DRAM movement for each cycle\. Theγ¯\\bar\{\\gamma\}candidates are then forwarded in parallel through these intermediate layers, and the SV adapter decodes a verification token at each position\. Each SV token is compared strictly against the corresponding draft token: matching positions are passed through to the target stage; mismatching positions yield a correction candidate\. When mismatches happen, the draft tokens and corrections are assembled into a verification tree as shown in Figure[4](https://arxiv.org/html/2605.11186#S4.F4)\. The correction tokens are then re\-forwarded from layer11throughLSVL\_\{\\mathrm\{SV\}\}under tree\-masked attention to get the hidden states of the correction token—reusing the already\-loaded draft and SV layers—at zero additional flash transfer cost\. The insight of this DM\-to\-SV stage is to serve as an internal speculative decoding process that fully exploits the limited DRAM capacity\.

Target verification pass\.The remaining layersLSV\+1L\_\{\\mathrm\{SV\}\}\{\+\}1throughLtotalL\_\{\\mathrm\{total\}\}are then streamed from flash into DRAM chunk by chunk in the rest of the sequential transfer\. The assembled verification tree is processed in a single batched forward pass under the corresponding tree attention mask\. The target model can verify the main draft branch following greedy decoding or via typical acceptanceCaiet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib15)\); at each rejected position, the SV’s correction candidate is evaluated under the same criterion\. The longest accepted prefix is committed to the output, and the KV cache is updated accordingly\.

### 4\.3Adapter design and training

Draft and verification adapters follow the architecture ofLiuet al\.\([2024a](https://arxiv.org/html/2605.11186#bib.bib16)\): two normalization layers, one multi\-head attention layer, and a shared LM head\. BecauseCatsuses compact draft and shallow\-verifier sub\-networks to meet edge device memory and latency constraints, their drafting and verification ability must be strengthened under limited sub\-model capacity\. Rather than distilling the full vocabulary from the full modelZhouet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib8)\); Liuet al\.\([2024a](https://arxiv.org/html/2605.11186#bib.bib16)\), which wastes supervision on low\-probability tokens that rarely affect acceptance, we train the adapters with aReduced KL Lossthat focuses on the tokens most relevant to accepted tokens\.

A general distillation loss that aligns the adapter distribution with the target model over the entire vocabulary is formed as:

ℒfull=−1\|M\|​∑t∈M∑v=1Vptarget​\(v∣t\)⋅log⁡qdraft​\(v∣t\),\\mathcal\{L\}\_\{\\text\{full\}\}=\-\\frac\{1\}\{\|M\|\}\\sum\_\{t\\in M\}\\sum\_\{v=1\}^\{V\}p\_\{\\text\{target\}\}\(v\\mid t\)\\cdot\\log q\_\{\\text\{draft\}\}\(v\\mid t\),\(1\)whereMMis the set of unmasked token positions,ptargetp\_\{\\text\{target\}\}is the target model’s distribution, andqdraftq\_\{\\text\{draft\}\}is the adapter’s distribution\. Spreading supervision across the full vocabularyVVdilutes the gradient signal onto near\-zero\-mass tokens that are irrelevant to accepted tokens\. We instead restrict it to the top\-KKtokens:

ℒtop\-​K=−1\|M\|​∑t∈M∑v∈𝒯K​\(t\)p~target​\(v∣t\)⋅log⁡qdraft​\(v∣t\)\\mathcal\{L\}\_\{\\text\{top\-\}K\}=\-\\frac\{1\}\{\|M\|\}\\sum\_\{t\\in M\}\\sum\_\{v\\in\\mathcal\{T\}\_\{K\}\(t\)\}\\tilde\{p\}\_\{\\text\{target\}\}\(v\\mid t\)\\cdot\\log q\_\{\\text\{draft\}\}\(v\\mid t\)\(2\)where𝒯K​\(t\)\\mathcal\{T\}\_\{K\}\(t\)denotes the set of top\-KKtokens under the target distribution at positiontt, andp~target​\(v∣t\)\\tilde\{p\}\_\{\\text\{target\}\}\(v\\mid t\)is the target distribution renormalized over this set\. This concentrates distillation on the tokens that actually determine acceptance while excluding the low\-probability tail\. The detailed computation process can be found in Appendix[B](https://arxiv.org/html/2605.11186#A2)\.

## 5Experiments

Models, tasks, baselines, and metrics\.We conducted experiments on Vicuna models \(7B and 13B\)\(Chianget al\.,[2023](https://arxiv.org/html/2605.11186#bib.bib44)\)and LLaMA2 models \(7B and 13B\)\(Touvronet al\.,[2023](https://arxiv.org/html/2605.11186#bib.bib43)\), which are mainstream LLMs widely tested on different self\-speculative decoding algorithms\. We evaluatedCatsacross multiple tasks, including multi\-turn dialogue, code generation, mathematical reasoning, and instruction following, using Spec\-bench\(Xiaet al\.,[2024](https://arxiv.org/html/2605.11186#bib.bib4)\), MT\-bench\(Zhenget al\.,[2023](https://arxiv.org/html/2605.11186#bib.bib45)\), GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.11186#bib.bib46)\), Alpaca\(Taoriet al\.,[2023](https://arxiv.org/html/2605.11186#bib.bib47)\), and HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2605.11186#bib.bib48)\)\.

Unless otherwise stated, we set the drafting stepsγ¯=5\\bar\{\\gamma\}=5, choose a compact draft boundaryLDM=3L\_\{\\mathrm\{DM\}\}\{=\}3, and set the shallow\-verifier boundary toLSV=15L\_\{\\mathrm\{SV\}\}\{=\}15\. Following the design principle in Section[4](https://arxiv.org/html/2605.11186#S4),LDML\_\{\\mathrm\{DM\}\}is kept shallow to balance DRAM capacity, compute throughput, and repeated drafting cost, whileLSV=15L\_\{\\mathrm\{SV\}\}\{=\}15corresponds to representative edge\-memory budgets of 8 GB for 7B models and 12 GB for 13B models\. We compare against representative speculative decoding algorithms, including REST\(Heet al\.,[2024](https://arxiv.org/html/2605.11186#bib.bib13)\), Lookahead decoding\(Fuet al\.,[2024](https://arxiv.org/html/2605.11186#bib.bib14)\), Kangaroo\(Liuet al\.,[2024a](https://arxiv.org/html/2605.11186#bib.bib16)\), Medusa\(Caiet al\.,[2024](https://arxiv.org/html/2605.11186#bib.bib15)\), and EAGLE\(Liet al\.,[2024b](https://arxiv.org/html/2605.11186#bib.bib1)\)\. Since our cascaded verification framework is orthogonal to EAGLE’s high\-probability candidate branching,Catsw/ EAGLE adds these branches to our verification tree to enlarge the parallel candidate set without additional target\-model forward passes\.

We report the mean accepted length and end\-to\-end speedup\. Mean accepted length measures the average number of tokens accepted per target\-model verification, which is irrelevant to the deployed platform, while end\-to\-end speedup captures the wall\-clock acceleration on the target platform\. We use a server with 8 B200 GPUs to obtain mean accepted length results, and an NVIDIA Jetson AGX Orin as the edge device for end\-to\-end speedup evaluation\.

Training\.For fine\-tuning our own adapters, we used the ShareGPT dataset\(Chianget al\.,[2023](https://arxiv.org/html/2605.11186#bib.bib44)\)with 68,000 samples following Medusa, with learning rate set at 1e\-6 for Vicuna 7B and 13B models, and 1e\-5 for LLaMA2 7B and 13B models seperately\. All the trainings are finished with 2 B200 GPUs within 13 hours\.

Table 1:Mean accepted length \(τ\\tau\) and end\-to\-end speedup \(S↑S\\uparrow\) on Vicuna\-7B across five benchmarks under greedy and relaxed \(temperature = 0\.7\) decoding\. Kangaroo andCatsuse drafting steps = 5; Medusa uses 2 heads; EAGLE is re\-implemented in a self\-speculative structure with drafting steps = 5\. In tree base decoding, we set top\-KK= 10\. Lookahead is greedy\-only\.ModelAlgorithmSpec\-benchMT\-benchGSM8KAlpacaHumanEvalMeanτ\\tauS↑S\\uparrowτ\\tauS↑S\\uparrowτ\\tauS↑S\\uparrowτ\\tauS↑S\\uparrowτ\\tauS↑S\\uparrowτ\\tauS↑S\\uparrowGreedyVicuna\-7bchain baseREST\(Heet al\.,[2024](https://arxiv.org/html/2605.11186#bib.bib13)\)1\.62431\.66×\\times1\.82391\.86×\\times1\.50451\.54×\\times1\.84061\.88×\\times1\.76701\.81×\\times1\.71211\.75×\\timesLookahead\(Fuet al\.,[2024](https://arxiv.org/html/2605.11186#bib.bib14)\)2\.27852\.29×\\times2\.29282\.30×\\times2\.70102\.72×\\times2\.06232\.08×\\times2\.51282\.53×\\times2\.36952\.38×\\timesKangaroo\(Liuet al\.,[2024a](https://arxiv.org/html/2605.11186#bib.bib16)\)2\.27361\.75×\\times2\.48631\.92×\\times2\.48561\.92×\\times2\.31311\.79×\\times2\.89792\.24×\\times2\.49141\.92×\\timesMedusa\(Caiet al\.,[2024](https://arxiv.org/html/2605.11186#bib.bib15)\)2\.31162\.32×\\times2\.51152\.52×\\times2\.49662\.51×\\times2\.39172\.41×\\times2\.68322\.70×\\times2\.47892\.49×\\timesEAGLE\-self\(Liet al\.,[2024b](https://arxiv.org/html/2605.11186#bib.bib1)\)2\.29951\.74×\\times2\.51511\.90×\\times2\.32131\.76×\\times2\.31041\.75×\\times2\.90052\.19×\\times2\.46941\.87×\\timesCats2\.88162\.99×\\times3\.05203\.16×\\times3\.01983\.14×\\times2\.82802\.94×\\times3\.51743\.65×\\times3\.05983\.18×\\timesVicuna\-7btree baseEAGLE\-self, tree2\.89432\.18×\\times3\.15352\.37×\\times3\.00592\.26×\\times2\.93062\.21×\\times3\.59492\.70×\\times3\.11582\.34×\\timesCatsw/ EAGLE3\.50973\.51×\\times3\.69383\.69×\\times3\.68693\.70×\\times3\.43393\.45×\\times4\.20094\.21×\\times3\.70503\.71×\\timesRelaxed \(temperature = 0\.7\)Vicuna\-7bchain baseREST1\.62361\.66×\\times1\.82161\.86×\\times1\.51921\.55×\\times1\.85941\.90×\\times1\.79841\.84×\\times1\.72441\.76×\\timesKangaroo2\.77772\.13×\\times3\.07122\.35×\\times3\.05132\.34×\\times2\.83482\.18×\\times3\.37032\.59×\\times3\.02112\.32×\\timesMedusa2\.48842\.50×\\times2\.68022\.69×\\times2\.68372\.70×\\times2\.61792\.63×\\times2\.87602\.89×\\times2\.66922\.68×\\timesEAGLE\-self2\.74872\.08×\\times3\.07532\.33×\\times2\.95152\.23×\\times2\.86442\.17×\\times3\.45462\.61×\\times3\.01892\.28×\\timesCats3\.47913\.60×\\times3\.75973\.89×\\times3\.96524\.12×\\times3\.67423\.82×\\times4\.08024\.24×\\times3\.79173\.94×\\timesVicuna\-7btree baseEAGLE\-self3\.34942\.53×\\times3\.72712\.82×\\times3\.76292\.85×\\times3\.53342\.68×\\times4\.15473\.14×\\times3\.70552\.80×\\timesCatsw/ EAGLE4\.18024\.18×\\times4\.52294\.52×\\times4\.53454\.55×\\times4\.27454\.29×\\times4\.84084\.85×\\times4\.47064\.48×\\times

### 5\.1Main results

##### Mean accepted length and end\-to\-end speedup\.

Tables[1](https://arxiv.org/html/2605.11186#S5.T1)and[2](https://arxiv.org/html/2605.11186#S5.T2)report the main results\. Since EAGLE was originally designed around feature\-level speculative decoding, we re\-implement it in our self\-speculative setting and evaluate both decoding modes: chain decoding and tree decoding\. In the edge setting, where each decoding loop is constrained by limited compute and weight\-transfer bandwidth, increasing the mean accepted length directly reduces the number of costly inference loops and is reflected in the measured end\-to\-end speedup\.

Table 2:Mean accepted length \(τ\\tau\) and end\-to\-end speedup \(S↑S\\uparrow\) on Vicuna\-13B and LLaMA2\-7B/13B under greedy and relaxed \(temperature = 0\.7\) decoding\. Methods and settings follow Table[1](https://arxiv.org/html/2605.11186#S5.T1)\.ModelAlgorithmSpec\-benchMT\-benchGSM8KAlpacaHumanEvalMeanτ\\tauS↑S\\uparrowτ\\tauS↑S\\uparrowτ\\tauS↑S\\uparrowτ\\tauS↑S\\uparrowτ\\tauS↑S\\uparrowτ\\tauS↑S\\uparrowGreedyVicuna\-13bchain baseKangaroo2\.19372\.19×\\times2\.41842\.41×\\times2\.38482\.38×\\times2\.22262\.22×\\times2\.91832\.91×\\times2\.42762\.42×\\timesMedusa2\.39202\.03×\\times2\.58332\.19×\\times2\.58242\.20×\\times2\.39592\.04×\\times2\.80352\.39×\\times2\.55142\.17×\\timesEAGLE\-self2\.19522\.18×\\times2\.38312\.37×\\times2\.20312\.19×\\times2\.21832\.21×\\times2\.91632\.90×\\times2\.38322\.37×\\timesCats2\.65882\.64×\\times2\.79462\.77×\\times2\.76362\.74×\\times2\.54932\.53×\\times3\.22183\.20×\\times2\.79762\.78×\\timesVicuna\-13btree baseEAGLE\-self, tree2\.80352\.77×\\times3\.04233\.01×\\times2\.87602\.85×\\times2\.83372\.81×\\times3\.61683\.58×\\times3\.03453\.00×\\timesCatsw/ EAGLE3\.47083\.42×\\times3\.68973\.64×\\times3\.54813\.50×\\times3\.35683\.32×\\times4\.24284\.19×\\times3\.66163\.61×\\timesLlama2\-7bchain baseKangaroo3\.81763\.29×\\times4\.19303\.61×\\times4\.42963\.82×\\times4\.04313\.49×\\times4\.29463\.70×\\times4\.15343\.58×\\timesMedusa1\.66601\.67×\\times1\.78171\.78×\\times1\.73581\.74×\\times1\.77581\.78×\\times1\.96571\.97×\\times1\.78501\.78×\\timesEAGLE\-self3\.80683\.28×\\times4\.18133\.60×\\times4\.75974\.10×\\times4\.04293\.49×\\times4\.05883\.50×\\times4\.16993\.60×\\timesCats4\.33354\.33×\\times4\.61814\.62×\\times5\.10555\.10×\\times4\.46424\.46×\\times4\.72404\.72×\\times4\.64914\.65×\\timesLlama2\-7btree baseEAGLE\-self, tree4\.35293\.75×\\times4\.71994\.07×\\times5\.09544\.39×\\times4\.53933\.91×\\times4\.58663\.95×\\times4\.65884\.02×\\timesCatsw/ EAGLE4\.83184\.83×\\times5\.06345\.06×\\times5\.41855\.42×\\times4\.91014\.91×\\times5\.17415\.17×\\times5\.07965\.08×\\timesLlama2\-13bchain baseKangaroo3\.54093\.53×\\times3\.93693\.92×\\times4\.24214\.23×\\times3\.97543\.97×\\times4\.11334\.10×\\times3\.96173\.95×\\timesMedusa1\.82501\.55×\\times1\.97371\.68×\\times1\.97211\.68×\\times1\.94251\.65×\\times1\.94251\.65×\\times1\.93121\.64×\\timesEAGLE\-self3\.54783\.53×\\times3\.97543\.96×\\times3\.85723\.84×\\times3\.98053\.97×\\times4\.10824\.09×\\times3\.89383\.88×\\timesCats4\.05704\.02×\\times4\.27524\.24×\\times4\.34154\.31×\\times4\.36964\.34×\\times4\.69594\.66×\\times4\.34784\.32×\\timesLlama2\-13btree baseEAGLE\-self, tree4\.08294\.04×\\times4\.46994\.42×\\times4\.27934\.24×\\times4\.45664\.42×\\times4\.69934\.65×\\times4\.39764\.35×\\timesCatsw/ EAGLE4\.57074\.51×\\times4\.81914\.75×\\times4\.85344\.79×\\times4\.79984\.75×\\times5\.01784\.95×\\times4\.81224\.75×\\timesRelaxed \(temperature = 0\.7\)Vicuna\-13bchain baseKangaroo2\.48892\.48×\\times2\.78942\.78×\\times2\.76952\.76×\\times2\.59272\.59×\\times3\.27533\.27×\\times2\.78322\.78×\\timesMedusa2\.58522\.20×\\times2\.79052\.37×\\times2\.79202\.38×\\times2\.71482\.31×\\times2\.95352\.51×\\times2\.76722\.35×\\timesEAGLE\-self2\.51442\.50×\\times2\.84442\.83×\\times2\.58932\.58×\\times2\.66092\.65×\\times3\.22123\.20×\\times2\.76602\.75×\\timesCats2\.97432\.95×\\times3\.24213\.22×\\times3\.07363\.05×\\times2\.99632\.98×\\times3\.47923\.45×\\times3\.15313\.13×\\timesVicuna\-13btree baseEAGLE\-self, tree3\.19343\.16×\\times3\.52323\.49×\\times3\.36313\.33×\\times3\.30513\.28×\\times3\.97843\.94×\\times3\.47263\.44×\\timesCatsw/ EAGLE3\.98513\.93×\\times4\.23604\.18×\\times4\.13994\.09×\\times4\.06834\.02×\\times4\.72694\.67×\\times4\.23124\.18×\\timesLlama2\-7bchain baseKangaroo4\.48493\.87×\\times4\.96094\.28×\\times4\.85064\.18×\\times4\.97964\.29×\\times4\.05873\.50×\\times4\.66694\.02×\\timesMedusa1\.74821\.75×\\times1\.86671\.87×\\times2\.18752\.19×\\times1\.89551\.90×\\times2\.28482\.28×\\times1\.99652\.00×\\timesEAGLE\-self4\.49103\.87×\\times4\.97564\.29×\\times4\.91394\.24×\\times4\.98674\.30×\\times4\.26873\.68×\\times4\.72724\.08×\\timesCats4\.88314\.88×\\times5\.20045\.20×\\times5\.18705\.19×\\times5\.19765\.20×\\times4\.43884\.44×\\times4\.98144\.98×\\timesLlama2\-7btree baseEAGLE\-self, tree4\.97134\.29×\\times5\.24094\.52×\\times5\.28144\.55×\\times5\.32284\.59×\\times4\.77034\.11×\\times5\.11734\.41×\\timesCatsw/ EAGLE5\.31435\.31×\\times5\.58185\.58×\\times5\.50965\.51×\\times5\.56015\.56×\\times4\.98824\.99×\\times5\.39085\.39×\\timesLlama2\-13bchain baseKangaroo4\.14864\.14×\\times4\.69914\.68×\\times4\.60314\.59×\\times4\.73764\.73×\\times4\.69504\.68×\\times4\.57674\.56×\\timesMedusa1\.93221\.64×\\times2\.06461\.75×\\times2\.24991\.92×\\times2\.07181\.76×\\times2\.28481\.95×\\times2\.12071\.80×\\timesEAGLE\-self4\.14524\.12×\\times4\.70324\.68×\\times4\.58834\.57×\\times4\.74014\.72×\\times4\.71354\.69×\\times4\.57814\.56×\\timesCats4\.64614\.61×\\times4\.91334\.87×\\times5\.31025\.27×\\times4\.94644\.92×\\times5\.06785\.03×\\times4\.97684\.94×\\timesLlama2\-13btree baseEAGLE\-self, tree4\.71764\.67×\\times5\.10515\.05×\\times5\.14495\.10×\\times5\.03384\.99×\\times5\.17585\.12×\\times5\.03544\.98×\\timesCatsw/ EAGLE5\.08135\.01×\\times5\.34995\.28×\\times5\.55245\.48×\\times5\.32415\.27×\\times5\.39695\.33×\\times5\.34095\.27×\\times

Table[1](https://arxiv.org/html/2605.11186#S5.T1)reports results on Vicuna\-7B under both greedy and relaxed decoding\. Under greedy decoding,Catsaccepts3\.063\.06tokens per target call on average, compared with2\.492\.49for Kangaroo,2\.482\.48for Medusa, and2\.472\.47for our self\-speculative EAGLE chain baseline—a2323–24%24\\%relative gain\. This translates into a3\.18×3\.18\\timesend\-to\-end speedup,66%66\\%higher than Kangaroo’s1\.92×1\.92\\times\. Compared with the full\-baseline Lookahead and REST,Catsimproves mean accepted length by29%29\\%and79%79\\%, respectively\. Under relaxed decoding \(τ=0\.7\\tau\{=\}0\.7\), these gains are maintained or amplified:Catscontinues to lead all baselines including REST, whose relaxed\-acceptance advantage over greedy is outpaced by the longer draft sequences thatCatsproduces\. Table[2](https://arxiv.org/html/2605.11186#S5.T2)extends the evaluation to Vicuna\-13B and LLaMA2\-7B/13B under both decoding modes\. On LLaMA2\-7B,Catsreaches4\.654\.65accepted tokens and4\.65×4\.65\\timesspeedup under greedy decoding, versus4\.174\.17tokens \(3\.60×3\.60\\times\) for the EAGLE chain baseline and4\.154\.15tokens \(3\.58×3\.58\\times\) for Kangaroo\. On the larger LLaMA2\-13B,Catsachieves4\.354\.35accepted tokens and4\.32×4\.32\\timesspeedup, compared with3\.893\.89\(3\.88×3\.88\\times\) for EAGLE\-self and3\.963\.96\(3\.95×3\.95\\times\) for Kangaroo\. PairingCatswith EAGLE tree decoding further pushes accepted length to5\.085\.08and4\.814\.81tokens on LLaMA2\-7B and 13B, respectively, with corresponding speedups of5\.08×5\.08\\timesand4\.75×4\.75\\times\. These consistent gains across model families and sizes confirm that the cascaded verification advantage is not model\-specific\.

![Refer to caption](https://arxiv.org/html/2605.11186v1/x5.png)Figure 5:MT\-Bench quality–speed comparison under relaxed acceptance\. Quality is measured by the GPT\-4o following the protocol of\(Zhenget al\.,[2023](https://arxiv.org/html/2605.11186#bib.bib45)\); we report the mean score across all MT\-Bench categories\.
##### Quality–speed Pareto frontier\.

A common concern with relaxed speculative decoding is that increasing the number of accepted draft tokens may improve speed at the expense of generation quality\. Figure[5](https://arxiv.org/html/2605.11186#S5.F5)shows that this trade\-off is not inherent toCats\. Under matched schedules \(τ=0\.3\\tau\{=\}0\.3,α=0\.09\\alpha\{=\}0\.09,55draft steps\),Catslies on the upper\-right Pareto frontier on MT\-Bench, improving accepted length while maintaining competitive evaluation scores\. Sweeping temperature from0to11preserves the separation, suggesting that cascaded verification increases usable parallelism without relaxing the target distribution\.

### 5\.2Ablation Study

In ablation studies, we examine howCatsbehaves under the two constraints that dominate limited memory deployment: memory budget and auto\-regressive drafting cost\. Unless otherwise stated, all ablations use Vicuna\-7B evaluated on MT\-Bench\. We first analyze how to capture the wall\-clock advantage of the memory\-adaptive design, and then study how the drafting horizonγ¯\\bar\{\\gamma\}trades off accepted length against additional loop latency and memory traffic\.

#### 5\.2\.1Memory Budget Analysis

##### Adaptivity across memory budgets\.

A key advantage ofCatsis that the shallow\-verifier depthLSVL\_\{\\mathrm\{SV\}\}is a free parameter \(Section[5\.2\.1](https://arxiv.org/html/2605.11186#S5.SS2.SSS1), properties \(i\)–\(ii\)\) and can therefore be tuned to fully consume whatever DRAM budget the deployment platform offers, without changing the rest of the pipeline\. We exercise this property on three representative edge\-memory budgets covering tightly to comfortably resourced devices—22GB,66GB, and88GB—and at each budget configureCatswith the deepestLSVL\_\{\\mathrm\{SV\}\}that fits\. Table[3](https://arxiv.org/html/2605.11186#S5.T3)reports per\-token compute time, mean accepted length, tokens/s, and end\-to\-end speedup over the non\-speculative baseline\. Across all three regimes,Catsachieves the highest speedup, demonstrating that its advantage is not an artifact of one specific memory configuration but a structural benefit that adapts to the available budget\.

Table 3:Acceleration under different memory budgets on Vicuna\-7B\.BudgetMethodComp/tok \(s\)Mean acc\.Tok/s↑\\uparrowSpeedup2 GBBaseline0\.2131\.000\.1171\.00×1\.00\\times2 GBKangaroo \(LDM=3L\_\{\\mathrm\{DM\}\}\{=\}3\)0\.0942\.270\.2662\.27×2\.27\\times2 GBCats\(LDM=3L\_\{\\mathrm\{DM\}\}\{=\}3,LSV=5L\_\{\\mathrm\{SV\}\}\{=\}5\)0\.0672\.800\.3292\.82×\\mathbf\{2\.82\\times\}6 GBBaseline0\.2041\.000\.1261\.00×1\.00\\times6 GBKangaroo \(LDM=3L\_\{\\mathrm\{DM\}\}\{=\}3\)0\.0892\.270\.2672\.12×2\.12\\times6 GBCats\(LDM=3L\_\{\\mathrm\{DM\}\}\{=\}3,LSV=10L\_\{\\mathrm\{SV\}\}\{=\}10\)0\.0642\.940\.3732\.96×\\mathbf\{2\.96\\times\}8 GBBaseline0\.2531\.000\.1461\.00×1\.00\\times8 GBKangaroo \(LDM=3L\_\{\\mathrm\{DM\}\}\{=\}3\)0\.0982\.270\.2821\.93×1\.93\\times8 GBCats\(LDM=3L\_\{\\mathrm\{DM\}\}\{=\}3,LSV=15L\_\{\\mathrm\{SV\}\}\{=\}15\)0\.0623\.050\.4613\.16×\\mathbf\{3\.16\\times\}

A natural fix for Kangaroo’s limited acceptance is to deepen its draft model\. However, because the draft path is invokedγ¯\\bar\{\\gamma\}times per cycle, deepening it amplifies both the repeated DRAM↔\\leftrightarrowon\-chip transfer cost and the per\-cycle compute load—an overhead that is non\-negligible on edge devices\. We quantify this with*Bytes Per Token*\(BPT\)\(Alizadehet al\.,[2024](https://arxiv.org/html/2605.11186#bib.bib49)\), the average weight volume streamed from DRAM per generated token\. For a two\-stage Kangaroo\-style pipeline,

BPTKangaroo=γ¯⋅Bdraft\+Bverifyβ¯\+1,\\mathrm\{BPT\_\{Kangaroo\}\}=\\frac\{\\bar\{\\gamma\}\\cdot B\_\{\\mathrm\{draft\}\}\+B\_\{\\mathrm\{verify\}\}\}\{\\bar\{\\beta\}\+1\},\(3\)whereBdraftB\_\{\\mathrm\{draft\}\},BverifyB\_\{\\mathrm\{verify\}\}are draft and verification parameter volumes andγ¯\\bar\{\\gamma\}is drafting passes per cycle\.Cats’s three\-stage pipeline inserts a shallow verifier, splittingBverifyB\_\{\\mathrm\{verify\}\}into two contiguous segments:

BPTCATS=γ¯⋅Bdraft\+BSV\+Btargetβ¯\+1,\\mathrm\{BPT\_\{CATS\}\}=\\frac\{\\bar\{\\gamma\}\\cdot B\_\{\\mathrm\{draft\}\}\+B\_\{\\mathrm\{SV\}\}\+B\_\{\\mathrm\{target\}\}\}\{\\bar\{\\beta\}\+1\},\(4\)withBSV\+Btarget=BverifyB\_\{\\mathrm\{SV\}\}\+B\_\{\\mathrm\{target\}\}=B\_\{\\mathrm\{verify\}\}\. BecauseBSVB\_\{\\mathrm\{SV\}\}is streamed once per cycle as part of the target forward pass, enlargingLSVL\_\{\\mathrm\{SV\}\}increases verifier capacity without inflating theγ¯⋅Bdraft\\bar\{\\gamma\}\\cdot B\_\{\\mathrm\{draft\}\}term\.

Table[4](https://arxiv.org/html/2605.11186#S5.T4)validates this on Jetson AGX Orin \(Vicuna\-7B,88GB budget\)\. Deepening Kangaroo fromLDM=3L\_\{\\mathrm\{DM\}\}\{=\}3to1515raisesβ¯\+1\\bar\{\\beta\}\{\+\}1from2\.272\.27to2\.982\.98, yet speedup*falls*from1\.93×1\.93\\timesto1\.53×1\.53\\timesas BPT climbs from8\.378\.37to11\.1011\.10GB/tok\.Cats\(LDM=3L\_\{\\mathrm\{DM\}\}\{=\}3,LSV=15L\_\{\\mathrm\{SV\}\}\{=\}15\) instead achieves similar acceptance \(3\.053\.05\) with the lowest BPT \(5\.845\.84GB/tok\), and the best speedup \(3\.16×3\.16\\times\), confirming that intermediate\-layer verification—not a deeper drafter—is the key to edge throughput\.

Table 4:Same\-budget edge ablation on NVIDIA AGX Orin \(Vicuna\-7B\)\. Deepening Kangaroo’s draft model improves accepted length but increases BPT and compute cost, whileCatsachieves the best speedup with a shallow drafter and single\-pass shallow verification\.MethodBPT \(GB/tok\)↓\\downarrowComp/tok \(s\)Mean acc\.Tok/s↑\\uparrowSpeedupBaseline \(L=32L\{=\}32\)12\.950\.2531\.000\.1461\.00×1\.00\\timesKangaroo \(LDM=3L\_\{\\mathrm\{DM\}\}\{=\}3\)8\.370\.0982\.270\.2821\.93×1\.93\\timesKangaroo \(LDM=5L\_\{\\mathrm\{DM\}\}\{=\}5\)8\.670\.0902\.410\.2751\.88×1\.88\\timesKangaroo \(LDM=15L\_\{\\mathrm\{DM\}\}\{=\}15\)11\.100\.1062\.980\.2241\.53×1\.53\\timesCats\(LDM=3L\_\{\\mathrm\{DM\}\}\{=\}3,LSV=15L\_\{\\mathrm\{SV\}\}\{=\}15\)5\.840\.0623\.050\.4613\.16×\\mathbf\{3\.16\\times\}

#### 5\.2\.2Drafting Steps Ablation

We next vary the drafting horizonγ¯\\bar\{\\gamma\}to justify the default choiceγ¯=5\\bar\{\\gamma\}\{=\}5\. Increasingγ¯\\bar\{\\gamma\}gives the draft stage more opportunities to propose tokens, but it also lengthens the autoregressive drafting loop and increases the streamed volume in the BPT numerator\. Table[5](https://arxiv.org/html/2605.11186#S5.T5)shows this trade\-off: movingCatsfromγ¯=3\\bar\{\\gamma\}\{=\}3to55substantially improves accepted length from2\.70772\.7077to3\.05203\.0520with only a small BPT increase, whereas further extending the horizon to77or1010yields diminishing acceptance gains while BPT continues to rise\. We therefore useγ¯=5\\bar\{\\gamma\}\{=\}5as a balanced setting that captures most of the acceptance benefit before the drafting horizon reaches saturation, while avoiding unnecessary loop latency and memory traffic\.

Table 5:Drafting\-horizon ablation on Vicuna\-7B, reporting accepted length, BPT, and speedup\.KangarooEAGLE\-selfCatsγ¯\\bar\{\\gamma\}Mean acc\.BPT↓\\downarrowSpeedupMean acc\.BPT↓\\downarrowSpeedupMean acc\.BPT↓\\downarrowSpeedup32\.250111\.15311\.75×1\.75\\times2\.247511\.16601\.73×1\.73\\times2\.70775\.68052\.82×\\mathbf\{2\.82\\times\}52\.486313\.34961\.92×1\.92\\times2\.515113\.19671\.90×1\.90\\times3\.05205\.83543\.16×\\mathbf\{3\.16\\times\}72\.633015\.68042\.04×2\.04\\times2\.619415\.76182\.00×2\.00\\times3\.23436\.25743\.34×\\mathbf\{3\.34\\times\}102\.714219\.68522\.09×2\.09\\times2\.698419\.80042\.05×2\.05\\times3\.33327\.16463\.43×\\mathbf\{3\.43\\times\}

## 6Conclusion and Limitations

Catsaddresses the flash\-transfer bottleneck of memory\-limited LLM inference through a staged verification cascade adapted to the device’s DRAM capacity and weight\-offloading schedule, raising both token acceptance and end\-to\-end speedup while keeping the on\-device memory footprint equal to that of the target model\. Across four target models and five benchmarks,Catsdelivers up to5\.08×5\.08\\timeswall\-clock speedup with no quality degradation\. Several limitations remain: our evaluation is restricted to77B–1313B models; the draft sub\-network and SV adapter rely on distilled adapters, incurring an up\-front training cost; and the three\-stage design presumes transformer\-style layer\-aligned representations, restricting direct applicability to architectures such as state\-space models\.

## References

- \[1\]K\. Alizadeh, I\. Mirzadeh, D\. Belenko, S\. K\. Khatamifard, M\. Cho, C\. C\. Del Mundo, M\. Rastegari, and M\. Farajtabar\(2024\)LLM in a flash: efficient large language model inference with limited memory\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Cited by:[§1](https://arxiv.org/html/2605.11186#S1.p1.1),[§1](https://arxiv.org/html/2605.11186#S1.p3.2),[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px3.p1.1),[§3](https://arxiv.org/html/2605.11186#S3.p1.6),[§5\.2\.1](https://arxiv.org/html/2605.11186#S5.SS2.SSS1.Px1.p2.2)\.
- \[2\]\(2024\)Hydra: sequentially\-dependent draft heads for Medusa decoding\.InFirst Conference on Language Modeling,Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px2.p1.1)\.
- \[3\]G\. Bachmann, S\. Anagnostidis, A\. Pumarola, M\. Georgopoulos, A\. Sanakoyeu, Y\. Du, E\. Schönfeld, A\. Thabet, and J\. Kohler\(2025\)Judge decoding: faster speculative sampling requires going beyond model alignment\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[4\]T\. Cai, Y\. Li, Z\. Geng, H\. Peng, J\. D\. Lee, D\. Chen, and T\. Dao\(2024\)Medusa: simple LLM inference acceleration framework with multiple decoding heads\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235\.Cited by:[3rd item](https://arxiv.org/html/2605.11186#S1.I1.i3.p1.1),[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2605.11186#S4.SS2.p4.2),[Table 1](https://arxiv.org/html/2605.11186#S5.T1.42.36.36.8),[§5](https://arxiv.org/html/2605.11186#S5.p2.5),[1](https://arxiv.org/html/2605.11186#alg1.l1)\.
- \[5\]S\. Cha, G\. Kim, D\. Han, T\. Yang, and I\. Han\(2026\)KnapSpec: self\-speculative decoding via adaptive layer selection as a knapsack problem\.arXiv preprint arXiv:2602\.20217\.Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px2.p1.1)\.
- \[6\]C\. Chen, S\. Borgeaud, G\. Irving, J\. Lespiau, L\. Sifre, and J\. Jumper\(2023\)Accelerating large language model decoding with speculative sampling\.arXiv preprint arXiv:2302\.01318\.Cited by:[§1](https://arxiv.org/html/2605.11186#S1.p1.1),[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[7\]H\. Chen, C\. Tian, Z\. He, B\. Yu, Y\. Liu, and J\. Cao\(2025\)Inference performance evaluation for LLMs on edge devices with a novel benchmarking framework and metric\.arXiv preprint arXiv:2508\.11269\.Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px3.p1.1)\.
- \[8\]J\. Chen, V\. Tiwari, R\. Sadhukhan, Z\. Chen, J\. Shi, I\. E\. Yen, and B\. Chen\(2025\)MagicDec: breaking the latency\-throughput tradeoff for long context generation with speculative decoding\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[9\]M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. d\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§5](https://arxiv.org/html/2605.11186#S5.p1.1)\.
- \[10\]Z\. Chen, A\. May, R\. Svirschevski, Y\. Huang, M\. Ryabinin, Z\. Jia, and B\. Chen\(2024\)Sequoia: scalable, robust, and hardware\-aware speculative decoding\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[11\]Z\. Chen, X\. Yang, J\. Lin, C\. Sun, K\. C\. Chang, and J\. He\(2024\)Cascade speculative drafting for even faster LLM inference\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[12\]W\. Chiang, Z\. Li, Z\. Lin, Y\. Sheng, Z\. Wu, H\. Zhang, L\. Zheng, S\. Zhuang, Y\. Zhuang, J\. E\. Gonzalez, I\. Stoica, and E\. P\. Xing\(2023\)Vicuna: an open\-source chatbot impressing GPT\-4 with 90% ChatGPT quality\.Note:[https://lmsys\.org/blog/2023\-03\-30\-vicuna/](https://lmsys.org/blog/2023-03-30-vicuna/)Cited by:[§5](https://arxiv.org/html/2605.11186#S5.p1.1),[§5](https://arxiv.org/html/2605.11186#S5.p4.1)\.
- \[13\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§5](https://arxiv.org/html/2605.11186#S5.p1.1)\.
- \[14\]C\. Du, J\. Jiang, Y\. Xu, J\. Wu, S\. Yu, Y\. Li, S\. Li, K\. Xu, L\. Nie, Z\. Tu, and Y\. You\(2024\)GliDe with a CaPE: a low\-hassle method to accelerate speculative decoding\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235\.Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[15\]M\. Elhoushi, A\. Shrivastava, D\. Liskovich, B\. Hosmer, B\. Wasti, L\. Lai, A\. Mahmoud, B\. Acun, S\. Agarwal, A\. Roman, K\. Ma, and E\. Aly\(2024\)LayerSkip: enabling early exit inference and self\-speculative decoding\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Cited by:[§1](https://arxiv.org/html/2605.11186#S1.p2.1),[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px2.p1.1)\.
- \[16\]Y\. Fu, P\. Bailis, I\. Stoica, and H\. Zhang\(2024\)Break the sequential dependency of LLM inference using lookahead decoding\.arXiv preprint arXiv:2402\.02057\.Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.11186#S5.T1.30.24.24.8),[§5](https://arxiv.org/html/2605.11186#S5.p2.5)\.
- \[17\]F\. Gloeckle, B\. Y\. Idrissi, B\. Rozière, D\. Lopez\-Paz, and G\. Synnaeve\(2024\)Better & faster large language models via multi\-token prediction\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235\.Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px2.p1.1)\.
- \[18\]G\. Guo and S\. Ermon\(2025\)Self\-speculative decoding in any\-order and any\-subset autoregressive models\.InStructured Probabilistic Inference and Generative Modeling Workshop at NeurIPS 2025,Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px2.p1.1)\.
- \[19\]Z\. He, Z\. Zhong, T\. Cai, J\. D\. Lee, and D\. He\(2024\)REST: retrieval\-based speculative decoding\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,NAACL 2024\.Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2605.11186#S5.T1.24.18.18.8),[§5](https://arxiv.org/html/2605.11186#S5.p2.5)\.
- \[20\]S\. Kim, K\. Mangalam, S\. Moon, J\. Malik, M\. W\. Mahoney, A\. Gholami, and K\. Keutzer\(2023\)Speculative decoding with big little decoder\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[21\]S\. Kou, L\. Hu, Z\. He, Z\. Deng, and H\. Zhang\(2024\)CLLMs: consistency large language models\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235\.Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px2.p1.1)\.
- \[22\]Y\. Leviathan, M\. Kalman, and Y\. Matias\(2023\)Fast inference from transformers via speculative decoding\.InProceedings of the 40th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.202\.Cited by:[§1](https://arxiv.org/html/2605.11186#S1.p1.1),[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[23\]M\. Li, X\. Chen, A\. Holtzman, B\. Chen, J\. Lin, W\. Yih, and X\. V\. Lin\(2024\)Nearest neighbor speculative decoding for LLM generation and attribution\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[24\]Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang\(2024\)EAGLE: speculative sampling requires rethinking feature uncertainty\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235\.Cited by:[3rd item](https://arxiv.org/html/2605.11186#S1.I1.i3.p1.1),[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px2.p1.1),[Table 1](https://arxiv.org/html/2605.11186#S5.T1.48.42.42.8),[§5](https://arxiv.org/html/2605.11186#S5.p2.5)\.
- \[25\]Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang\(2025\)EAGLE\-2: faster inference of language models with dynamic draft trees\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px2.p1.1)\.
- \[26\]Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang\(2025\)EAGLE\-3: scaling up inference acceleration of large language models via training\-time test\.InAdvances in Neural Information Processing Systems,Vol\.38\.Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px2.p1.1)\.
- \[27\]F\. Liu, Y\. Tang, Z\. Liu, Y\. Ni, K\. Han, and Y\. Wang\(2024\)Kangaroo: lossless self\-speculative decoding via double early exiting\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235\.Cited by:[3rd item](https://arxiv.org/html/2605.11186#S1.I1.i3.p1.1),[§1](https://arxiv.org/html/2605.11186#S1.p2.1),[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px2.p1.1),[Figure 4](https://arxiv.org/html/2605.11186#S4.F4),[§4\.3](https://arxiv.org/html/2605.11186#S4.SS3.p1.1),[Table 1](https://arxiv.org/html/2605.11186#S5.T1.36.30.30.8),[§5](https://arxiv.org/html/2605.11186#S5.p2.5)\.
- \[28\]X\. Liu, L\. Qian, Y\. Ye, Q\. Zhao, J\. E\. Gonzalez, I\. Stoica, and H\. Zhang\(2024\)Online speculative decoding\.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.235\.Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[29\]X\. Miao, G\. Oliaro, Z\. Zhang, X\. Cheng, Z\. Wang, Z\. Zhang, R\. Y\. Y\. Wong, A\. Zhu, L\. Yang, X\. Shi, C\. Shi, Z\. Chen, D\. Arfeen, R\. Abhyankar, and Z\. Jia\(2024\)SpecInfer: accelerating large language model serving with tree\-based speculative inference and verification\.InProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3,ASPLOS ’24\.External Links:[Document](https://dx.doi.org/10.1145/3620666.3651335)Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[30\]G\. Monea, A\. Joulin, and E\. Grave\(2023\)PaSS: parallel speculative sampling\.InEfficient Natural Language and Speech Processing Workshop at NeurIPS 2023,Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px2.p1.1)\.
- \[31\]G\. Oliaro, Z\. Jia, D\. Campos, and A\. Qiao\(2025\)SuffixDecoding: a model\-free approach to speeding up large language model inference\.InAdvances in Neural Information Processing Systems,Vol\.38\.Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[32\]Z\. Qin, Z\. Hu, Z\. He, N\. Prakriya, J\. Cong, and Y\. Sun\(2024\)Multi\-token joint speculative decoding for accelerating large language model inference\.arXiv preprint arXiv:2407\.09722\.Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px2.p1.1)\.
- \[33\]Z\. Ren, K\. Doekemeijer, T\. De Matteis, C\. Pinto, R\. Stoica, and A\. Trivedi\(2025\)An I/O characterizing study of offloading LLM models and KV caches to NVMe SSD\.InProceedings of the Twentieth European Conference on Computer Systems,EuroSys ’25\.External Links:[Document](https://dx.doi.org/10.1145/3719330.3721230)Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px3.p1.1)\.
- \[34\]Y\. Song, Z\. Mi, H\. Xie, and H\. Chen\(2024\)PowerInfer: fast large language model serving with a consumer\-grade GPU\.InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles,Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px3.p1.1)\.
- \[35\]M\. Stern, N\. Shazeer, and J\. Uszkoreit\(2018\)Blockwise parallel decoding for deep autoregressive models\.InAdvances in Neural Information Processing Systems,Vol\.31\.Cited by:[§1](https://arxiv.org/html/2605.11186#S1.p1.1)\.
- \[36\]H\. Sun, Z\. Chen, X\. Yang, Y\. Tian, and B\. Chen\(2024\)TriForce: lossless acceleration of long sequence generation with hierarchical speculative decoding\.InFirst Conference on Language Modeling,Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[37\]Z\. Sun, J\. H\. Ro, A\. Beirami, and A\. T\. Suresh\(2025\)Block verification accelerates speculative decoding\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[38\]Z\. Sun, A\. T\. Suresh, J\. H\. Ro, A\. Beirami, H\. Jain, and F\. Yu\(2023\)SpecTr: fast speculative decoding via optimal transport\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[39\]R\. Taori, I\. Gulrajani, T\. Zhang, Y\. Dubois, X\. Li, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto\(2023\)Stanford Alpaca: an instruction\-following LLaMA model\.Note:[https://github\.com/tatsu\-lab/stanford\_alpaca](https://github.com/tatsu-lab/stanford_alpaca)Cited by:[§5](https://arxiv.org/html/2605.11186#S5.p1.1)\.
- \[40\]H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§5](https://arxiv.org/html/2605.11186#S5.p1.1)\.
- \[41\]P\. Vellaisamy, S\. Tripathi, V\. Natarajan, S\. S\. Thenrasu, S\. Blanton, and J\. P\. Shen\(2026\)TaxBreak: unmasking the hidden costs of LLM inference through overhead decomposition\.arXiv preprint arXiv:2603\.12465\.Cited by:[§1](https://arxiv.org/html/2605.11186#S1.p1.1)\.
- \[42\]J\. Wang, Y\. Su, J\. Li, Q\. Xia, Z\. Ye, X\. Duan, Z\. Wang, and M\. Zhang\(2024\)OPT\-Tree: speculative decoding with adaptive draft tree structure\.arXiv preprint arXiv:2406\.17276\.Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[43\]Y\. Wang, Y\. Liu, S\. Ji, Y\. Xu, Y\. Xu, Q\. Zhu, and W\. Che\(2025\)Think before you accept: semantic reflective verification for faster speculative decoding\.arXiv preprint arXiv:2505\.18629\.Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[44\]Z\. Wang, Z\. Wang, L\. Le, H\. S\. Zheng, S\. Mishra, V\. Perot, Y\. Zhang, A\. Goldie, J\. Shang, C\. Zhu, C\. Lee, and T\. Pfister\(2025\)Speculative RAG: enhancing retrieval augmented generation through drafting\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[45\]H\. Xia, T\. Ge, P\. Wang, S\. Chen, F\. Wei, and Z\. Sui\(2023\)Speculative decoding: exploiting speculative execution for accelerating Seq2seq generation\.InFindings of the Association for Computational Linguistics: EMNLP 2023,Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[46\]H\. Xia, Y\. Li, J\. Zhang, C\. Du, and W\. Li\(2025\)SWIFT: on\-the\-fly self\-speculative decoding for LLM inference acceleration\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.11186#S1.p2.1),[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px2.p1.1)\.
- \[47\]H\. Xia, Z\. Yang, Q\. Dong, P\. Wang, Y\. Li, T\. Ge, T\. Liu, W\. Li, and Z\. Sui\(2024\)Unlocking efficiency in large language model inference: a comprehensive survey of speculative decoding\.InFindings of the Association for Computational Linguistics: ACL 2024,Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.11186#S5.p1.1)\.
- \[48\]Y\. Xiong, R\. Zhang, Y\. Li, T\. Wu, and L\. Zou\(2024\)DySpec: faster speculative decoding with dynamic token tree structure\.arXiv preprint arXiv:2410\.11744\.Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[49\]J\. Zhang, J\. Zeng, H\. Wang, L\. Hu, H\. Xia, T\. Ge, and F\. Wei\(2024\)Draft & verify: lossless large language model acceleration via self\-speculative decoding\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Cited by:[§1](https://arxiv.org/html/2605.11186#S1.p2.1),[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px2.p1.1)\.
- \[50\]W\. Zhao, Y\. Huang, X\. Han, W\. Xu, C\. Xiao, Z\. Liu, and M\. Sun\(2024\)Ouroboros: speculative decoding with large model enhanced drafting\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[51\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica\(2023\)Judging LLM\-as\-a\-judge with MT\-Bench and chatbot arena\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[Figure 5](https://arxiv.org/html/2605.11186#S5.F5),[§5](https://arxiv.org/html/2605.11186#S5.p1.1)\.
- \[52\]M\. Zhong, N\. Teku, and R\. Tandon\(2025\)Speeding up speculative decoding via sequential approximate verification\.InProceedings of the 3rd Efficient Systems for Foundation Models Workshop at the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267\.Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1)\.
- \[53\]Y\. Zhou, K\. Lyu, A\. S\. Rawat, A\. K\. Menon, A\. Rostamizadeh, S\. Kumar, J\. Kagy, and R\. Agarwal\(2024\)DistillSpec: improving speculative decoding via knowledge distillation\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.11186#S2.SS0.SSS0.Px1.p1.1),[§4\.3](https://arxiv.org/html/2605.11186#S4.SS3.p1.1)\.

## Appendix ACatsalgorithm

Algorithm 1Cats: Cascaded Self\-Speculative Decoding with Tree\-Masked Final Verification1:Prompt

𝐱\\mathbf\{x\}; target model

ℳ\\mathcal\{M\}with layers

1:Ltotal1\{:\}L\_\{\\mathrm\{total\}\}and LM head

ℋ\\mathcal\{H\}; draft cut\-point

LDML\_\{\\mathrm\{DM\}\}, shallow\-verifier cut\-point

LSVL\_\{\\mathrm\{SV\}\}\(

1≤LDM<LSV<Ltotal1\{\\leq\}L\_\{\\mathrm\{DM\}\}\{<\}L\_\{\\mathrm\{SV\}\}\{<\}L\_\{\\mathrm\{total\}\}\); draft adapter

𝒜DM\\mathcal\{A\}\_\{\\mathrm\{DM\}\}, SV adapter

𝒜SV\\mathcal\{A\}\_\{\\mathrm\{SV\}\}; drafting horizon

γ¯\\bar\{\\gamma\}; max new tokens

TT; acceptance criterionAcc\(greedy or typicalCaiet al\.\([2024](https://arxiv.org/html/2605.11186#bib.bib15)\)\)

2:Generated sequence

𝐲\\mathbf\{y\}
3:

4:// ===== Initialization =====

5:Load layers

1:LSV1\{:\}L\_\{\\mathrm\{SV\}\}\(DM

∪\\cupSV\) into DRAM and pin resident

6:

\(𝐡\(1:Ltotal\),KVℳ\)←ℳ​\(𝐱\)\(\\mathbf\{h\}^\{\(1\{:\}L\_\{\\mathrm\{total\}\}\)\},\\,\\mathrm\{KV\}\_\{\\mathcal\{M\}\}\)\\leftarrow\\mathcal\{M\}\(\\mathbf\{x\}\)⊳\\trianglerightprompt prefill, full target

7:

y0←arg⁡max⁡ℋ​\(𝐡−1\(Ltotal\)\)y\_\{0\}\\leftarrow\\arg\\max\\mathcal\{H\}\(\\mathbf\{h\}^\{\(L\_\{\\mathrm\{total\}\}\)\}\_\{\-1\}\); initialize

KV𝒜DM,KV𝒜SV\\mathrm\{KV\}\_\{\\mathcal\{A\}\_\{\\mathrm\{DM\}\}\},\\,\\mathrm\{KV\}\_\{\\mathcal\{A\}\_\{\\mathrm\{SV\}\}\}from

𝐡\(LDM\),𝐡\(LSV\)\\mathbf\{h\}^\{\(L\_\{\\mathrm\{DM\}\}\)\},\\,\\mathbf\{h\}^\{\(L\_\{\\mathrm\{SV\}\}\)\}
8:

s←\|𝐱\|s\\leftarrow\|\\mathbf\{x\}\|⊳\\trianglerightcommitted length

9:

10:while

s<Ts<Tdo

11:// ===== Stage 1: Draft =====

12:

𝐝←\[\]\\mathbf\{d\}\\leftarrow\[\\,\];

𝐇DM←\[\]\\mathbf\{H\}\_\{\\mathrm\{DM\}\}\\leftarrow\[\\,\]
13:for

k=0,…,γ¯−1k=0,\\ldots,\\bar\{\\gamma\}\{\-\}1do

14:

h←ℳ1:LDM​\(ys\+k,KVℳ\)h\\leftarrow\\mathcal\{M\}\_\{1\{:\}L\_\{\\mathrm\{DM\}\}\}\(y\_\{s\+k\},\\,\\mathrm\{KV\}\_\{\\mathcal\{M\}\}\); append

hhto

𝐇DM\\mathbf\{H\}\_\{\\mathrm\{DM\}\}
15:

\(o,KV𝒜DM\)←𝒜DM​\(h,KV𝒜DM\)\(o,\\,\\mathrm\{KV\}\_\{\\mathcal\{A\}\_\{\\mathrm\{DM\}\}\}\)\\leftarrow\\mathcal\{A\}\_\{\\mathrm\{DM\}\}\(h,\\,\\mathrm\{KV\}\_\{\\mathcal\{A\}\_\{\\mathrm\{DM\}\}\}\)
16:

ys\+k\+1←arg⁡max⁡ℋ​\(o\)y\_\{s\+k\+1\}\\leftarrow\\arg\\max\\mathcal\{H\}\(o\); append to

𝐝\\mathbf\{d\}
17:endfor

18:

19:// ===== Stage 2: Shallow Verification =====

20:

𝐇SV←ℳLDM\+1:LSV​\(𝐇DM,KVℳ\)\\mathbf\{H\}\_\{\\mathrm\{SV\}\}\\leftarrow\\mathcal\{M\}\_\{L\_\{\\mathrm\{DM\}\}\{\+\}1\{:\}L\_\{\\mathrm\{SV\}\}\}\(\\mathbf\{H\}\_\{\\mathrm\{DM\}\},\\,\\mathrm\{KV\}\_\{\\mathcal\{M\}\}\)⊳\\trianglerightγ¯\\bar\{\\gamma\}positions in parallel

21:

\(𝐎SV,KV𝒜SV\)←𝒜SV​\(𝐇SV,KV𝒜SV\)\(\\mathbf\{O\}\_\{\\mathrm\{SV\}\},\\,\\mathrm\{KV\}\_\{\\mathcal\{A\}\_\{\\mathrm\{SV\}\}\}\)\\leftarrow\\mathcal\{A\}\_\{\\mathrm\{SV\}\}\(\\mathbf\{H\}\_\{\\mathrm\{SV\}\},\\,\\mathrm\{KV\}\_\{\\mathcal\{A\}\_\{\\mathrm\{SV\}\}\}\);

𝐜^←arg⁡max⁡ℋ​\(𝐎SV\)\\hat\{\\mathbf\{c\}\}\\leftarrow\\arg\\max\\mathcal\{H\}\(\\mathbf\{O\}\_\{\\mathrm\{SV\}\}\)
22:

ℛ←\{\(i,c^i\):c^i≠di,0≤i<γ¯\}\\mathcal\{R\}\\leftarrow\\\{\\,\(i,\\,\\hat\{c\}\_\{i\}\)\\;:\\;\\hat\{c\}\_\{i\}\\neq d\_\{i\},\\;0\\leq i<\\bar\{\\gamma\}\\,\\\}⊳\\trianglerightpositions where SV proposes a correction

23:

24:if

ℛ=∅\\mathcal\{R\}=\\emptysetthen

25:

𝒯←ChainTree​\(𝐝\)\\mathcal\{T\}\\leftarrow\\textsc\{ChainTree\}\(\\mathbf\{d\}\)⊳\\trianglerightstraight chain, no correction branches

26:else

27:

𝒯←BuildTree​\(main=𝐝,corrections=ℛ\)\\mathcal\{T\}\\leftarrow\\textsc\{BuildTree\}\(\\text\{main\}=\\mathbf\{d\},\\,\\text\{corrections\}=\\mathcal\{R\}\)
28:⊳\\trianglerightmain branch is unchanged; each\(i,c^i\)∈ℛ\(i,\\hat\{c\}\_\{i\}\)\\in\\mathcal\{R\}adds a side branch at positionii

29:Re\-forward correction tokens through layers

1:LSV1\{:\}L\_\{\\mathrm\{SV\}\}under tree\-masked attention

30:⊳\\trianglerightreuses DRAM\-resident DM and SV layers;zeroadditional flash transfer

31:endif

32:

33:// ===== Stage 3: Final Verification \(stream layersLSV\+1:LtotalL\_\{\\mathrm\{SV\}\}\{\+\}1\{:\}L\_\{\\mathrm\{total\}\}from flash\) =====

34:

𝐇tgt←ℳLSV\+1:Ltotal\(𝒯\.hidden\_states,mask=𝒯\.tree\_mask,KVℳ\)\\mathbf\{H\}\_\{\\mathrm\{tgt\}\}\\leftarrow\\mathcal\{M\}\_\{L\_\{\\mathrm\{SV\}\}\{\+\}1\{:\}L\_\{\\mathrm\{total\}\}\}\\\!\\big\(\\mathcal\{T\}\.\\text\{hidden\\\_states\},\\;\\text\{mask\}\{=\}\\mathcal\{T\}\.\\text\{tree\\\_mask\},\\;\\mathrm\{KV\}\_\{\\mathcal\{M\}\}\\big\)
35:

𝐩tgt←softmax​\(ℋ​\(𝐇tgt\)\)\\mathbf\{p\}\_\{\\mathrm\{tgt\}\}\\leftarrow\\mathrm\{softmax\}\(\\mathcal\{H\}\(\\mathbf\{H\}\_\{\\mathrm\{tgt\}\}\)\)
36:

37:// —– Walk the tree to find the longest accepted prefix —–

38:

a←0a\\leftarrow 0;

𝐲out←\[\]\\mathbf\{y\}\_\{\\mathrm\{out\}\}\\leftarrow\[\\,\]
39:for

i=0,…,γ¯−1i=0,\\ldots,\\bar\{\\gamma\}\{\-\}1do

40:ifAcc\(di,𝐩tgt\(i\)\)\(d\_\{i\},\\,\\mathbf\{p\}\_\{\\mathrm\{tgt\}\}^\{\(i\)\}\)then

41:

a←a\+1a\\leftarrow a\+1; append

did\_\{i\}to

𝐲out\\mathbf\{y\}\_\{\\mathrm\{out\}\}
42:elseif

\(i,c^i\)∈ℛ\(i,\\,\\hat\{c\}\_\{i\}\)\\in\\mathcal\{R\}andAcc\(c^i,𝐩tgt\(corr,i\)\)\(\\hat\{c\}\_\{i\},\\,\\mathbf\{p\}\_\{\\mathrm\{tgt\}\}^\{\(\\mathrm\{corr\},\\,i\)\}\)then

43:

a←a\+1a\\leftarrow a\+1; append

c^i\\hat\{c\}\_\{i\}to

𝐲out\\mathbf\{y\}\_\{\\mathrm\{out\}\};break

44:⊳\\trianglerightcorrection branch terminates the cycle by construction

45:else

46:break

47:endif

48:endfor

49:Append

arg⁡max⁡𝐩tgt\(a\)\\arg\\max\\mathbf\{p\}\_\{\\mathrm\{tgt\}\}^\{\(a\)\}to

𝐲out\\mathbf\{y\}\_\{\\mathrm\{out\}\}⊳\\trianglerightbonus token from target’s own next\-token prediction

50:

51:Commit

𝐲out\\mathbf\{y\}\_\{\\mathrm\{out\}\}to

𝐲\\mathbf\{y\};

s←s\+\|𝐲out\|s\\leftarrow s\+\|\\mathbf\{y\}\_\{\\mathrm\{out\}\}\|
52:Roll back

KVℳ,KV𝒜DM,KV𝒜SV\\mathrm\{KV\}\_\{\\mathcal\{M\}\},\\,\\mathrm\{KV\}\_\{\\mathcal\{A\}\_\{\\mathrm\{DM\}\}\},\\,\\mathrm\{KV\}\_\{\\mathcal\{A\}\_\{\\mathrm\{SV\}\}\}to length

ssalong the accepted path

53:endwhile

54:return

𝐲0:s\\mathbf\{y\}\_\{0\{:\}s\}

## Appendix BReduced KL Loss Computation Process

In this appendix we expand the construction of theReduced KL Lossintroduced in Section[4](https://arxiv.org/html/2605.11186#S4), Equation[2](https://arxiv.org/html/2605.11186#S4.E2)\. Recall the setup: we train the draft and shallow\-verifier adapters by aligning their output distributionqdraft​\(v∣t\)q\_\{\\text\{draft\}\}\(v\\mid t\)with the target\-model distributionptarget​\(v∣t\)p\_\{\\text\{target\}\}\(v\\mid t\)across a setMMof unmasked token positions, wherev∈\{1,…,V\}v\\in\\\{1,\\ldots,V\\\}indexes vocabulary tokens andttindexes positions\. The standard full\-vocabulary distillation loss in Equation[1](https://arxiv.org/html/2605.11186#S4.E1),

ℒfull=−1\|M\|​∑t∈M∑v=1Vptarget​\(v∣t\)⋅log⁡qdraft​\(v∣t\),\\mathcal\{L\}\_\{\\text\{full\}\}=\-\\frac\{1\}\{\|M\|\}\\sum\_\{t\\in M\}\\sum\_\{v=1\}^\{V\}p\_\{\\text\{target\}\}\(v\\mid t\)\\cdot\\log q\_\{\\text\{draft\}\}\(v\\mid t\),spreads supervision over allVVtokens, but the vast majority of vocabulary mass at any positionttsits in a long tail of near\-zero probabilities that rarely determines acceptance\. To concentrate the limited capacity of our compact adapters on the tokens that actually matter, we restrict the loss to the top\-KKtokens ofptarget\(⋅∣t\)p\_\{\\text\{target\}\}\(\\cdot\\mid t\)at each position\. The construction proceeds in three steps\.

##### Step 1: Select the top\-KKindices\.

At each unmasked positiont∈Mt\\in M, identify theKKvocabulary tokens carrying the highest probability mass under the target distribution:

𝒯K​\(t\)=arg​top​\-​Kv∈\{1,…,V\}⁡ptarget​\(v∣t\)\.\\mathcal\{T\}\_\{K\}\(t\)=\\operatorname\*\{arg\\,top\\text\{\-\}K\}\_\{v\\in\\\{1,\\ldots,V\\\}\}\\;p\_\{\\text\{target\}\}\(v\\mid t\)\.\(5\)𝒯K​\(t\)\\mathcal\{T\}\_\{K\}\(t\)is a position\-dependent support of size\|𝒯K​\(t\)\|=K\|\\mathcal\{T\}\_\{K\}\(t\)\|=Kthat captures the tokens with non\-negligible probability of being emitted by the target model\. In speculative decoding, only tokens within this set can plausibly be accepted, so any supervision spent outside𝒯K​\(t\)\\mathcal\{T\}\_\{K\}\(t\)is effectively wasted\.

##### Step 2: Renormalize over the focused support\.

The masses\{ptarget​\(v∣t\)\}v∈𝒯K​\(t\)\\\{p\_\{\\text\{target\}\}\(v\\mid t\)\\\}\_\{v\\in\\mathcal\{T\}\_\{K\}\(t\)\}in general do not sum to11\. We renormalize them onto𝒯K​\(t\)\\mathcal\{T\}\_\{K\}\(t\)to obtain a valid probability distributionp~target​\(v∣t\)\\tilde\{p\}\_\{\\text\{target\}\}\(v\\mid t\):

p~target​\(v∣t\)=\{ptarget​\(v∣t\)∑v′∈𝒯K​\(t\)ptarget​\(v′∣t\),v∈𝒯K​\(t\),0,v∉𝒯K​\(t\)\.\\tilde\{p\}\_\{\\text\{target\}\}\(v\\mid t\)=\\begin\{cases\}\\dfrac\{p\_\{\\text\{target\}\}\(v\\mid t\)\}\{\\displaystyle\\sum\_\{v^\{\\prime\}\\in\\mathcal\{T\}\_\{K\}\(t\)\}p\_\{\\text\{target\}\}\(v^\{\\prime\}\\mid t\)\},&v\\in\\mathcal\{T\}\_\{K\}\(t\),\\\\\[10\.0pt\] 0,&v\\notin\\mathcal\{T\}\_\{K\}\(t\)\.\\end\{cases\}\(6\)Equivalently, with an indicator function𝟏​\[⋅\]\\mathbf\{1\}\[\\cdot\]:

p~target​\(v∣t\)=ptarget​\(v∣t\)⋅𝟏​\[v∈𝒯K​\(t\)\]∑v′∈𝒯K​\(t\)ptarget​\(v′∣t\)\.\\tilde\{p\}\_\{\\text\{target\}\}\(v\\mid t\)=\\frac\{p\_\{\\text\{target\}\}\(v\\mid t\)\\cdot\\mathbf\{1\}\[v\\in\\mathcal\{T\}\_\{K\}\(t\)\]\}\{\\displaystyle\\sum\_\{v^\{\\prime\}\\in\\mathcal\{T\}\_\{K\}\(t\)\}p\_\{\\text\{target\}\}\(v^\{\\prime\}\\mid t\)\}\.\(7\)By construction∑v=1Vp~target​\(v∣t\)=1\\sum\_\{v=1\}^\{V\}\\tilde\{p\}\_\{\\text\{target\}\}\(v\\mid t\)=1, sop~target\(⋅∣t\)\\tilde\{p\}\_\{\\text\{target\}\}\(\\cdot\\mid t\)is a proper distribution supported on𝒯K​\(t\)\\mathcal\{T\}\_\{K\}\(t\)and zero elsewhere\.

##### Step 3: Cross\-entropy on the focused support\.

The Reduced KL Loss is the position\-averaged cross\-entropy betweenp~target\(⋅∣t\)\\tilde\{p\}\_\{\\text\{target\}\}\(\\cdot\\mid t\)and the adapter distributionqdraft\(⋅∣t\)q\_\{\\text\{draft\}\}\(\\cdot\\mid t\), evaluated only on𝒯K​\(t\)\\mathcal\{T\}\_\{K\}\(t\):

ℒtop\-​K=−1\|M\|​∑t∈M∑v∈𝒯K​\(t\)p~target​\(v∣t\)⋅log⁡qdraft​\(v∣t\)\.\\mathcal\{L\}\_\{\\text\{top\-\}K\}=\-\\frac\{1\}\{\|M\|\}\\sum\_\{t\\in M\}\\sum\_\{v\\in\\mathcal\{T\}\_\{K\}\(t\)\}\\tilde\{p\}\_\{\\text\{target\}\}\(v\\mid t\)\\cdot\\log q\_\{\\text\{draft\}\}\(v\\mid t\)\.\(8\)Becausep~target​\(v∣t\)=0\\tilde\{p\}\_\{\\text\{target\}\}\(v\\mid t\)=0forv∉𝒯K​\(t\)v\\notin\\mathcal\{T\}\_\{K\}\(t\), the inner sum is mathematically equivalent to summing over the entire vocabulary; in practice we materialize only theKKactive terms per position to avoid theV−KV\{\-\}Kvanishing contributions\. Up to a constant entropy termH\(p~target\(⋅∣t\)\)H\\\!\\left\(\\tilde\{p\}\_\{\\text\{target\}\}\(\\cdot\\mid t\)\\right\)that does not depend on the adapter parameters,ℒtop\-​K\\mathcal\{L\}\_\{\\text\{top\-\}K\}coincides with the position\-averaged KL divergence

1\|M\|∑t∈MKL\(p~target\(⋅∣t\)∥qdraft\(⋅∣t\)\),\\frac\{1\}\{\|M\|\}\\sum\_\{t\\in M\}\\mathrm\{KL\}\\\!\\left\(\\tilde\{p\}\_\{\\text\{target\}\}\(\\cdot\\mid t\)\\;\\\|\\;q\_\{\\text\{draft\}\}\(\\cdot\\mid t\)\\right\),which is why we call it the*Reduced*KL Loss: it is a KL divergence taken over a reduced support𝒯K​\(t\)\\mathcal\{T\}\_\{K\}\(t\)rather than the full vocabularyVV\.

##### Effect on adapter training\.

Compared withℒfull\\mathcal\{L\}\_\{\\text\{full\}\}, restricting supervision to𝒯K​\(t\)\\mathcal\{T\}\_\{K\}\(t\)has two effects\. First, it concentrates gradient signal on the candidates that drive acceptance, since tokens outside𝒯K​\(t\)\\mathcal\{T\}\_\{K\}\(t\)haveptargetp\_\{\\text\{target\}\}values too small to be accepted regardless of howqdraftq\_\{\\text\{draft\}\}allocates mass to them\. Second, it stabilizes training under our compact draft and shallow\-verifier sub\-networks: the long tail of near\-zero target probabilities no longer consumes adapter capacity\. This is essential in our memory\-limited setting, where the draft and SV sub\-networks must remain small enough to fit in the DRAM budget while still producing high\-acceptance candidates\.

Similar Articles

Faster LLM Inference via Sequential Monte Carlo

arXiv cs.CL

This paper proposes Sequential Monte Carlo Speculative Decoding (SMC-SD), a method that accelerates LLM inference by replacing token-level rejection in speculative decoding with importance-weighted resampling over draft particles, achieving 2.36× speedup over standard speculative decoding and 5.2× over autoregressive decoding while maintaining 3% accuracy loss.

Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference

arXiv cs.CL

This paper introduces Ada-MK, an adaptive MegaKernel optimization method that uses automated DAG-based search to eliminate runtime branching and reduce shared memory usage for LLM inference. It demonstrates significant throughput improvements on NVIDIA Ada GPUs by integrating with TensorRT-LLM, achieving up to 23.6% faster performance than vanilla TensorRT-LLM in commercial advertising systems.

LACE: Lattice Attention for Cross-thread Exploration

arXiv cs.AI

LACE introduces a lattice attention mechanism that enables concurrent reasoning paths in LLMs to share intermediate insights and correct errors during inference, improving reasoning accuracy by over 7 points compared to standard isolated parallel sampling.