Tag
Proposes EntMTP, a training-free scheduler that adapts tree-based attention topologies for speculative decoding based on local entropy estimates, achieving 1.09-1.15x speedup over Hydra and up to 1.36x over Medusa.
Progress update on DSpark: training of DFlash backbone and markov head is complete, enabling use on 27B. Next is training the confidence head for adaptive drafting, expected 8-14% speed improvement over DFlash.
An update on the Ornith-1.0-35B GGUF model introduces a native MTP speculative-decode graft for faster inference on a single GPU, achieving ~1.3-1.35x decode speedup while maintaining near-identical token distribution. Benchmark numbers for throughput, TTFT, and long-context performance across multiple quants are provided.
OpenInfer, a pure Rust+CUDA LLM inference engine, quickly added support for DeepSeek's DSpark speculative decoding technique on RTX 5090, achieving nearly 500 tok/s per user and scaling to ~2.4K aggregate tok/s, outperforming DFlash on non-random workloads.
DeepSeek AI released the DeepSpec collection on Hugging Face, featuring speculative decoding models (dspark, dflash, eagle3) based on Qwen3 and Gemma4 in various sizes (1B-3B).
DSpark from DeepSeek AI integrates speculative decoding ideas to achieve 1.5x to 5x higher throughput in production systems. This thread explains 10 key ideas from the basics.
DeepSeek has open-sourced DeepSpec, a full-stack codebase for training and evaluating speculative decoding models.
DeepSeek released a paper and MIT-licensed open-source implementation of speculative decoding (DSpark) that speeds up LLM responses by up to 80% by using a small 'guess' model and a large 'check' model, achieving both speed and accuracy without tradeoffs.
DeepSeek open-sourced DeepSpec, a full-stack codebase for training and evaluating draft models for speculative decoding, enabling 60-85% faster generation. It includes data preparation, training, and evaluation scripts with support for multiple draft model algorithms (DSpark, DFlash, Eagle3).
DeepSeek released DSpark, a speculative decoding method that boosts throughput by 51% to 400% for V4 Flash & Pro, along with the open-source DeepSpec codebase for training and evaluating draft models.
DeepSeek releases preview versions of its V4 series, including DeepSeek-V4-Pro (1.6T parameters, 49B activated) and DeepSeek-V4-Flash (284B parameters, 13B activated), both supporting a one-million-token context and featuring hybrid attention, manifold-constrained hyper-connections, and a Muon optimizer.
An interactive guide explaining speculative decoding and multi-token prediction in LLMs, covering techniques from rejection sampling to MTP used in Qwen 3.6 and Gemma 4, with live diagrams and sliders.
Jayden Teoh proposes Next-Latent Prediction (NextLat), a self-supervised learning method that teaches the Transformer to learn to predict the next hidden state, thereby forming a compact world model for reasoning and planning, and achieves up to 3.3x inference speedup through self-speculative decoding.
JetSpec introduces parallel tree drafting for speculative decoding, achieving up to 9.64x end-to-end speedup on LLM inference while maintaining lossless accuracy, with throughput reaching ~1000 TPS on a single B200 GPU.
Dustin introduces a sparse verification framework for speculative decoding that leverages draft model signals and sparse attention head scoring to overcome the KV cache verification bottleneck, achieving up to 27.85x speedup in self-attention and 9.17x end-to-end decoding speedup on long-context tasks with negligible accuracy loss.
JetSpec is a speculative decoding framework that combines efficient forward drafting with causal conditioning to improve LLM inference speed and acceptance rates, achieving up to 9.64x speedup on MATH-500 and 4.58x on conversational workloads.
The author successfully ran GLM-5.2 with MTP speculative decoding on a 4× DGX Spark (GB10) setup, revealing a missing component in the public build recipe.
Achieved DeepSeek-V4-Flash MTP speculative decoding on 2× RTX PRO 6000 with a 38% throughput increase by fixing a mis-routed quantization format issue.
NVIDIA announces DFlash, an open source block diffusion model for speculative decoding that achieves up to 15x higher inference throughput on Blackwell GPUs while maintaining interactivity.
Ai2 released Tmax-27B, a terminal-agent LLM trained with DPPO (RL) on Qwen3.6-27B, and the author provides importance-matrix-calibrated GGUF quantizations that achieve competitive performance on agentic benchmarks even at very low bit-widths, with a grafted MTP draft head for speculative decoding.