Faster LLM Inference via Sequential Monte Carlo

arXiv cs.CL Papers

Summary

This paper proposes Sequential Monte Carlo Speculative Decoding (SMC-SD), a method that accelerates LLM inference by replacing token-level rejection in speculative decoding with importance-weighted resampling over draft particles, achieving 2.36× speedup over standard speculative decoding and 5.2× over autoregressive decoding while maintaining 3% accuracy loss.

arXiv:2604.15672v1 Announce Type: cross Abstract: Speculative decoding (SD) accelerates language model inference by drafting tokens from a cheap proposal model and verifying them against an expensive target model via rejection sampling. Because rejection truncates the draft block at the first error, throughput degrades when draft and target diverge. Rather than rejecting draft tokens outright, we propose to reweight them. To this end, we introduce sequential Monte Carlo speculative decoding (SMC-SD), which replaces token-level rejection with importance-weighted resampling over a population of draft particles. SMC-SD is a principled approximate inference scheme that trades exactness for additional speed, while preserving theoretical bounds on its per-step approximation error. Because LLM inference is memory bandwidth-bound, the arithmetic needed to draft particles and to score them in parallel comes nearly for free -- SMC-SD uses idle compute to turn verification into a vectorized, fixed-size operation with no rollback. Empirically, SMC-SD achieves 2.36x speed-up over speculative decoding and a 5.2x speed-up over autoregressive decoding, while remaining within 3% of the target model's accuracy on reasoning, instruction-following, and coding benchmarks.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 08:30 AM

# Faster LLM Inference via Sequential Monte Carlo

Source: https://arxiv.org/html/2604.15672

Yahya Emara∗1,2 Mauricio Barba da Costa∗3 Chi-Chih Chang1 Cameron Freer3 Tim Vieira4 Ryan Cotterell4 Mohamed S. Abdelfattah1,2

1Cornell University 2Makora 3MIT 4ETH Zürich

###### Abstract

Speculative decoding (SD) accelerates language model inference by drafting tokens from a cheap proposal model and verifying them against an expensive target model via rejection sampling. Because rejection truncates the draft block at the first error, throughput degrades when draft and target diverge. Rather than rejecting draft tokens outright, we propose to *reweight* them. To this end, we introduce sequential Monte Carlo speculative decoding (SMC-SD), which replaces token-level rejection with importance-weighted resampling over a population of draft particles. SMC-SD is a principled *approximate* inference scheme that trades exactness for additional speed, while preserving theoretical bounds on its per-step approximation error. Because LLM inference is memory bandwidth-bound, the arithmetic needed to draft particles and to score them in parallel comes nearly for free—SMC-SD uses idle compute to turn verification into a vectorized, fixed-size operation with no rollback. Empirically, SMC-SD achieves 2.36× speedup over speculative decoding and a 5.2× speedup over autoregressive decoding, while remaining within 3% of the target model's accuracy on reasoning, instruction-following, and coding benchmarks.

https://github.com/abdelfattah-lab/smcsd

## 1 Introduction

Autoregressive generation from neural language models is inherently sequential: each token depends on *all* previous tokens, so generating a single sequence requires as many serial forward passes through the model as there are tokens in the output. This sequential bottleneck is the primary obstacle to faster inference.

![Figure 1: Speed-up of SMC-SD on Llama 1B → 70B draft-target pair relative to autoregressive baseline, optimized tree-based SD (SGLang), Speculative Speculative Decoding (SSD; Kumar et al., 2026) on ShareGPT dataset. AR, SGLang SD, SMC-SD run on 4 H100 GPUs, while SSD runs on 5 H100 GPUs.](figure-description)

Speculative decoding (SD; Leviathan et al., 2023) addresses this bottleneck by amortizing the cost of target-model calls. At its core, SD is a rejection sampler (Chen et al., 2023): a small *draft* model—chosen to be significantly cheaper to evaluate—proposes K tokens, and the larger *target* model verifies all K tokens in a single forward pass, accepting a prefix and rejecting the rest. This rejection criterion ensures that the resulting strings are distributed *exactly* according to the target, while the speed-up factor is stochastic, depending on the alignment of the draft and target models.

This paper takes a fundamentally different approach to accelerating autoregressive generation. Rather than searching for a faster exact sampler, we devise an *approximate* sampling scheme whose fidelity to the target can be traded off against speed. The key modification is replacing SD's token-level rejection step with importance-weighted resampling, making the verification step an instance of sequential Monte Carlo (SMC; Doucet et al., 2001; Del Moral et al., 2006; Naesseth et al., 2024). Our method *reverses* the guarantee of SD—approximation quality is stochastic and the speed-up factor is deterministic. We call our method sequential Monte Carlo speculative decoding (SMC-SD) and describe it concretely in §3.

We derive a closed-form expression for the tokens-per-second (TPS) speed-up factor of SMC-SD with the roofline model (Williams et al., 2009) of the GPU in §3.1. We show that the roofline model gives a simple characterization of the speed-up factor in both the memory bandwidth-bound and compute-bound regimes. From a theory perspective, because SMC-SD is ultimately based on importance resampling, non-asymptotic bounds on its approximation error can be obtained using known techniques. In the case of SMC-SD, we show that both the L₂ bias and mean squared error decay linearly in the number of particles and the L₁ bias decays in the root of the number of particles with constants governed by the χ²-divergence between draft and target model (§3.2).

From a systems perspective, SMC-SD presents a novel computation pattern that necessitates designing an inference engine around it. In §3.3, we use several observations at the intersection of hardware and algorithm to guide the design of our high-throughput SMC-SD inference engine. Both the drafting and verifying stages of SMC-SD use a vectorizable execution structure that map naturally onto GPU hardware and enable greater parallelism than standard speculative decoding. For KV cache management, we observe that all KV cache data movement during the resampling step can be replaced with efficient in-place pointer exchanges, and we reduce the size of the cache by leveraging the shared prefix among particles.

Empirically, SMC-SD achieves up to 2.5× the throughput of an optimized implementation of speculative decoding on GSM8K (Cobbe et al., 2021), MATH500 (Lightman et al., 2023), AlpacaEval (Dubois et al., 2025), and DS1000 (Lai et al., 2022) across both Llama (Team, 2024) and Qwen (Yang et al., 2025) model families, while remaining within 3% of the target model's accuracy. In a multi-GPU setting, SMC-SD achieves a 2.36× speedup over state-of-the-art speculative decoding and 5.2× over the autoregressive baseline.

Figure 1 summarizes these end-to-end throughput gains.

## 2 Language Models and Speculative Decoding

### 2.1 Language Modeling Background

Let 𝒳 be a finite alphabet, a finite, non-empty set. We call elements of 𝒳 tokens. We write 𝒳* for the set of all finite strings over 𝒳, including the empty string ε, and denote individual tokens by x and strings by **x** = x₁⋯xₜ ∈ 𝒳*. We use the shorthand x₁:ₖ to denote the prefix of **x** up to position k.

- • Constrained generation (Ψ = 𝟙[x ∈ 𝒜]). Hard constraints—syntactic validity, format compliance, length bounds—can be enforced while preserving the target's distribution conditioned on the constraint. Prior work (Lew et al., 2023; Loula et al., 2025; Xefteri et al., 2025) has shown that such constraints can improve accuracy on coding benchmarks; incorporating them into SMC-SD could simultaneously increase downstream performance while accelerating inference.

- • Reward-weighted decoding (Ψ = exp(β⋅R(x)), targeting π ∝ p⋅e^βR). This targets the KL-regularized optimal policy arising in RLHF without fine-tuning, which is relevant for agentic settings where the same base model must be steered toward different objectives. Standard speculative decoding cannot express any of these: it requires knowledge of the normalizing constant of the target, which is unavailable for this class of distributions.

On the systems side, much work remains to achieve the theoretical roofline model speedup of SMC-SD demonstrated in Figure 3. First, there are opportunities to overlap resampling with drafting to remove its performance overhead altogether. This can be done by preemptively speculating the result of the ESS threshold, and beginning the next round of drafting accordingly. Furthermore, the draft and target models can be disaggregated across hardware, and asynchronously dispatch drafting and scoring calls. Second, by allowing N and K to vary dynamically, SMC-SD can adapt to underlying hardware and workload characteristics, enabling runtime exploration of its tunable Pareto frontier. Finally, SMC-SD reflects a trend in co-designing inference algorithms with asymmetric hardware scaling—compute grows faster than memory bandwidth (Zadouri et al., 2026). For instance, the Blackwell GPU architecture significantly increases FLOPS over the previous generation of Hopper GPU and keeps memory bandwidth approximately equal; this increased compute budget makes it an ideal target for SMC-SD.

## Author Contributions

- • Yahya Emara ([email protected]): Lead system design/inference engine development, experiment design, formal analysis (arithmetic intensity), visualization, writing.

- • Mauricio Barba da Costa ([email protected]): Research conception, formal analysis (speed-up, arithmetic intensity, approximation error), experiment design (prototype, power sampling), visualization, writing.

- • Chi-Chih Chang ([email protected]): System design/inference engine development, visualization, writing.

- • Cameron Freer ([email protected]): Senior project leadership, project advising and mentorship, technical advice.

- • Tim Vieira ([email protected]): Senior project leadership, project advising and mentorship, technical advice.

- • Ryan Cotterell ([email protected]): Formal analysis (approximation error), senior project leadership, project advising and mentorship, technical advice, writing.

- • Mohamed Abdelfattah ([email protected]): Senior project leadership, formal analysis (speed-up, arithmetic intensity analysis), system design, project narrative development, project advising and mentorship, technical advice, visualization, writing.

## References

- Importance sampling: intrinsic dimension and computational cost. Statistical Science 32, pp. 405–431. External Links: Link. Cited by: §3.2.

- S. Azizi, E. B. Potraghloo, M. Ahmadi, S. Kundu, and M. Pedram (2026) Power-SMC: low-latency sequence-level power sampling for training-free LLM reasoning. External Links: 2602.10273, Link. Cited by: Appendix F.

- G. Bachmann, S. Anagnostidis, A. Pumarola, M. Georgopoulos, A. Sanakoyeu, Y. Du, E. Schönfeld, A. Thabet, and J. Kohler (2025) Judge decoding: faster speculative sampling requires going beyond model alignment. arXiv preprint arXiv:2501.19309. External Links: Link. Cited by: §5.

- T. Cai, Y. Li, Z. Geng, H. Peng, J. D. Lee, D. Chen, and T. Dao (2024) Medusa: simple LLM inference acceleration framework with multiple decoding heads. External Links: 2401.10774, Link. Cited by: §5.

- C. Chen, S. Borgeaud, G. Irving, J. Lespiau, L. Sifre, and J. Jumper (2023) Accelerating large language model decoding with speculative sampling. External Links: 2302.01318, Link. Cited by: Appendix A, §1, §2.2.

- J. Chen, Y. Liang, and Z. Liu (2026) DFlash: block diffusion for flash speculative decoding. External Links: 2602.06036, Link. Cited by: §5.

- K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman (2021) Training verifiers to solve math word problems. External Links: 2110.14168, Link. Cited by: §1, §4.

- P. Del Moral, A. Doucet, and A. Jasra (2006) Sequential Monte Carlo samplers. Journal of the Royal Statistical Society Series B: Statistical Methodology 68(3), pp. 411–436. Cited by: §1, §3.2, §3.

- A. Doucet, N. De Freitas, N. J. Gordon, et al. (2001) Sequential Monte Carlo methods in practice. Vol. 1, Springer. Cited by: §1, §3.

- Y. Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto (2025) Length-controlled AlpacaEval: a simple way to debias automatic evaluators. External Links: 2404.04475, Link. Cited by: §1, §4.

- M. Holsman, Y. Huang, and B. Dhingra (2025) Fuzzy speculative decoding for a tunable accuracy-runtime tradeoff. External Links: 2502.20704, Link. Cited by: §5.

- K. Huang, X. Guo, and M. Wang (2025) SpecDec++: boosting speculative decoding via adaptive candidate lengths. External Links: 2405.19715, Link. Cited by: §5.

- J. H. Huggins (2014) An information-theoretic analysis of resampling in sequential Monte Carlo. Ph.D. Thesis, Massachusetts Institute of Technology.

Similar Articles

ConFu: Contemplate the Future for Better Speculative Sampling

arXiv cs.CL

ConFu introduces a novel speculative decoding framework that enables draft models to anticipate future generation directions through contemplate tokens and soft prompts, achieving 8-20% improvements in token acceptance rates and generation speed over EAGLE-3 across multiple LLM models.

DFlash: Block Diffusion for Flash Speculative Decoding

Papers with Code Trending

DFlash is a new speculative decoding framework that uses a lightweight block diffusion model for parallel token drafting, achieving over 6x acceleration compared to autoregressive methods. It significantly outperforms existing state-of-the-art methods like EAGLE-3 while maintaining high output quality.