llama.cpp speculative checkpointing was merged
Summary
llama.cpp has merged speculative checkpointing support, offering 0–50% speedup on certain tasks like coding depending on draft acceptance patterns and tuning parameters such as ngram size and draft min/max.
Similar Articles
@_avichawla: Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effe…
Researchers introduced DFlash, a technique using block diffusion models for speculative decoding that accelerates LLM inference by up to 8.5x without accuracy loss. It is already integrated with major frameworks like vLLM and SGLang.
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
SlimSpec introduces a low-rank parameterization for drafter LM-heads to accelerate speculative decoding in LLMs, achieving 4-5x speedup while maintaining full vocabulary support.
vllm-project/vllm v0.20.0rc1
vLLM 0.20.0rc1 releases with major throughput, quantization, speculative decoding, and multi-hardware support enhancements for scalable LLM serving.
Faster LLM Inference via Sequential Monte Carlo
This paper proposes Sequential Monte Carlo Speculative Decoding (SMC-SD), a method that accelerates LLM inference by replacing token-level rejection in speculative decoding with importance-weighted resampling over draft particles, achieving 2.36× speedup over standard speculative decoding and 5.2× over autoregressive decoding while maintaining 3% accuracy loss.
BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)
BeeLlama.cpp is a performance-focused fork of llama.cpp that introduces DFlash speculative decoding and TurboQuant KV-cache compression, enabling high-speed local inference of large models like Qwen 3.6 27B on consumer hardware.