llama.cpp speculative checkpointing was merged

Reddit r/LocalLLaMA Tools

Summary

llama.cpp has merged speculative checkpointing support, offering 0–50% speedup on certain tasks like coding depending on draft acceptance patterns and tuning parameters such as ngram size and draft min/max.

[https://github.com/ggml-org/llama.cpp/pull/19493](https://github.com/ggml-org/llama.cpp/pull/19493) Some prompts get a speedup, others don't (cases of low draft acceptance streak). Good working params depend on the task type and repetition patterns. For coding, I got some 0%\~50% speedup with these params: --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64
Original Article

Similar Articles

vllm-project/vllm v0.20.0rc1

GitHub Releases Watchlist

vLLM 0.20.0rc1 releases with major throughput, quantization, speculative decoding, and multi-hardware support enhancements for scalable LLM serving.

Faster LLM Inference via Sequential Monte Carlo

arXiv cs.CL

This paper proposes Sequential Monte Carlo Speculative Decoding (SMC-SD), a method that accelerates LLM inference by replacing token-level rejection in speculative decoding with importance-weighted resampling over draft particles, achieving 2.36× speedup over standard speculative decoding and 5.2× over autoregressive decoding while maintaining 3% accuracy loss.