Tag
DeepSeek open-sourced DeepSpec, a full-stack codebase for training and evaluating draft models for speculative decoding, enabling 60-85% faster generation. It includes data preparation, training, and evaluation scripts with support for multiple draft model algorithms (DSpark, DFlash, Eagle3).
A GGUF conversion of MiniMax M3's EAGLE draft model for llama.cpp is now available, enabling speculative decoding speedups on compatible hardware.
Speculative decoding is an inference optimization technique that uses a fast draft model to propose future tokens verified in parallel by a larger model, improving LLM generation speed. The article highlights its trending status on Papers with Code and a recent SGLang blog post about state-of-the-art latencies using DFlash models.
Discussion of different flavors of speculative decoding and an attempt to produce a Qwen-3.6-27b EAGLE-3 drafter for the community.
Proposes PPOW, a reinforcement learning framework for optimizing draft models in speculative decoding using window-level objectives and adaptive windowing, achieving significant speedups across multiple benchmarks.
SlimSpec introduces a low-rank parameterization for drafter LM-heads to accelerate speculative decoding in LLMs, achieving 4-5x speedup while maintaining full vocabulary support.