llama.cpp speculative checkpointing was merged

Reddit r/LocalLLaMA 04/19/26, 12:16 PM Tools

llama-cpp speculative-decoding inference open-source performance llm

Summary

llama.cpp has merged speculative checkpointing support, offering 0–50% speedup on certain tasks like coding depending on draft acceptance patterns and tuning parameters such as ngram size and draft min/max.

[https://github.com/ggml-org/llama.cpp/pull/19493](https://github.com/ggml-org/llama.cpp/pull/19493) Some prompts get a speedup, others don't (cases of low draft acceptance streak). Good working params depend on the task type and repetition patterns. For coding, I got some 0%\~50% speedup with these params: --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64

Original Article

Similar Articles

@_avichawla: Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effe…

X AI KOLs Timeline

Researchers introduced DFlash, a technique using block diffusion models for speculative decoding that accelerates LLM inference by up to 8.5x without accuracy loss. It is already integrated with major frameworks like vLLM and SGLang.

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

Hugging Face Daily Papers

SlimSpec introduces a low-rank parameterization for drafter LM-heads to accelerate speculative decoding in LLMs, achieving 4-5x speedup while maintaining full vocabulary support.

vllm-project/vllm v0.20.0rc1

GitHub Releases Watchlist

vLLM 0.20.0rc1 releases with major throughput, quantization, speculative decoding, and multi-hardware support enhancements for scalable LLM serving.

Faster LLM Inference via Sequential Monte Carlo

arXiv cs.CL

This paper proposes Sequential Monte Carlo Speculative Decoding (SMC-SD), a method that accelerates LLM inference by replacing token-level rejection in speculative decoding with importance-weighted resampling over draft particles, achieving 2.36× speedup over standard speculative decoding and 5.2× over autoregressive decoding while maintaining 3% accuracy loss.

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)

Reddit r/LocalLLaMA

BeeLlama.cpp is a performance-focused fork of llama.cpp that introduces DFlash speculative decoding and TurboQuant KV-cache compression, enabling high-speed local inference of large models like Qwen 3.6 27B on consumer hardware.