speculative-decoding

#speculative-decoding

EntMTP: Accelerating LLM Inference with Entropy Guided Multi Token Prediction

arXiv cs.CL ↗ · 8h ago Cached

Proposes EntMTP, a training-free scheduler that adapts tree-based attention topologies for speculative decoding based on local entropy estimates, achieving 1.09-1.15x speedup over Hydra and up to 1.36x over Medusa.

0 favorites 0 likes

#speculative-decoding

@Hikari_07_jp: Progress report! Training of the DFlash backbone and markov head is complete, enabling DSpark to be used on 27B. We wil…

X AI KOLs Timeline ↗ · 12h ago Cached

Progress update on DSpark: training of DFlash backbone and markov head is complete, enabling use on 27B. Next is training the confidence head for adaptive drafting, expected 8-14% speed improvement over DFlash.

0 favorites 0 likes

#speculative-decoding

Ornith-1.0-35B GGUF update: native MTP speculative-decode graft + full serving/TTFT/long-context numbers (llama.cpp, tp=1)

Reddit r/LocalLLaMA ↗ · 17h ago

An update on the Ornith-1.0-35B GGUF model introduces a native MTP speculative-decode graft for faster inference on a single GPU, achieving ~1.3-1.35x decode speedup while maintaining near-identical token distribution. Benchmark numbers for throughput, TTFT, and long-context performance across multiple quants are provided.

0 favorites 0 likes

#speculative-decoding

@SuJinYan123: Just 6 hours after DeepSeek open-sourced the Qwen DSpark weights, OpenInfer already has DSpark support running on RTX 5…

X AI KOLs Timeline ↗ · 18h ago Cached

OpenInfer, a pure Rust+CUDA LLM inference engine, quickly added support for DeepSeek's DSpark speculative decoding technique on RTX 5090, achieving nearly 500 tok/s per user and scaling to ~2.4K aggregate tok/s, outperforming DFlash on non-random workloads.

0 favorites 0 likes

#speculative-decoding

DeepSpec - a deepseek-ai Collection

Reddit r/LocalLLaMA ↗ · 21h ago Cached

DeepSeek AI released the DeepSpec collection on Hugging Face, featuring speculative decoding models (dspark, dflash, eagle3) based on Qwen3 and Gemma4 in various sizes (1B-3B).

0 favorites 0 likes

#speculative-decoding

@dzhulgakov: DSpark from @deepseek_ai ingeniously integrates many speculative decoding ideas to achieve 1.5x to 5x higher throughput…

X AI KOLs Following ↗ · yesterday Cached

DSpark from DeepSeek AI integrates speculative decoding ideas to achieve 1.5x to 5x higher throughput in production systems. This thread explains 10 key ideas from the basics.

0 favorites 0 likes

#speculative-decoding

@charles_irl: it’s hot spec summer

X AI KOLs Timeline ↗ · yesterday Cached

DeepSeek has open-sourced DeepSpec, a full-stack codebase for training and evaluating speculative decoding models.

0 favorites 0 likes

#speculative-decoding

@DeRonin_: DeepSeek just dropped a 5-page paper + free GitHub repo that makes any LLM respond 80% faster it's called speculative d…

X AI KOLs Following ↗ · yesterday Cached

DeepSeek released a paper and MIT-licensed open-source implementation of speculative decoding (DSpark) that speeds up LLM responses by up to 80% by using a small 'guess' model and a large 'check' model, achieving both speed and accuracy without tradeoffs.

0 favorites 0 likes

#speculative-decoding

DeepSeek open-sources inference optimizations with 60–85% faster generation [pdf]

Hacker News Top ↗ · 2d ago Cached

DeepSeek open-sourced DeepSpec, a full-stack codebase for training and evaluating draft models for speculative decoding, enabling 60-85% faster generation. It includes data preparation, training, and evaluation scripts with support for multiple draft model algorithms (DSpark, DFlash, Eagle3).

0 favorites 0 likes

#speculative-decoding

@danielhanchen: DeepSeek just released DSpark for V4 Flash & Pro, a new speculative decoding method boosting throughput by 51% to 400%!…

X AI KOLs Timeline ↗ · 2d ago Cached

DeepSeek released DSpark, a speculative decoding method that boosts throughput by 51% to 400% for V4 Flash & Pro, along with the open-source DeepSpec codebase for training and evaluating draft models.

0 favorites 0 likes

#speculative-decoding

deepseek-ai/DeepSeek-V4-Pro-DSpark

Hugging Face Models Trending ↗ · 2d ago Cached

DeepSeek releases preview versions of its V4 series, including DeepSeek-V4-Pro (1.6T parameters, 49B activated) and DeepSeek-V4-Flash (284B parameters, 13B activated), both supporting a one-million-token context and featuring hybrid attention, manifold-constrained hyper-connections, and a Muon optimizer.

0 favorites 0 likes

#speculative-decoding

Made an interactive explainer about speculative decoding/MTP

Reddit r/LocalLLaMA ↗ · 3d ago Cached

An interactive guide explaining speculative decoding and multi-token prediction in LLMs, covering techniques from rejection sampling to MTP used in Qwen 3.6 and Gemma 4, with live diagrams and sliders.

0 favorites 0 likes

#speculative-decoding

@FinanceYF5: Next token prediction is short-sighted. What if the Transformer learns to predict its own next hidden state? Jayden Teoh proposes Next-Latent Prediction (NextLat): a self-supervised learning method that teaches the Transformer to form...

X AI KOLs Following ↗ · 3d ago Cached

Jayden Teoh proposes Next-Latent Prediction (NextLat), a self-supervised learning method that teaches the Transformer to learn to predict the next hidden state, thereby forming a compact world model for reasoning and planning, and achieves up to 3.3x inference speedup through self-speculative decoding.

0 favorites 0 likes

#speculative-decoding

[Research] JetSpec: Speculative Decoding with Parallel Tree Drafting Enables up to 9.64x Lossless LLM Inference Speedup with more than 1000TPS

Reddit r/LocalLLaMA ↗ · 3d ago

JetSpec introduces parallel tree drafting for speculative decoding, achieving up to 9.64x end-to-end speedup on LLM inference while maintaining lossless accuracy, with throughput reaching ~1000 TPS on a single B200 GPU.

0 favorites 0 likes

#speculative-decoding

Dustin: Draft-Augmented Sparse Verification for Efficient Long-Context Generation with Speculative Decoding

arXiv cs.CL ↗ · 4d ago Cached

Dustin introduces a sparse verification framework for speculative decoding that leverages draft model signals and sparse attention head scoring to overcome the KV cache verification bottleneck, achieving up to 27.85x speedup in self-attention and 9.17x end-to-end decoding speedup on long-context tasks with negligible accuracy loss.

0 favorites 0 likes

#speculative-decoding

JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

Hugging Face Daily Papers ↗ · 4d ago Cached

JetSpec is a speculative decoding framework that combines efficient forward drafting with causal conditioning to improve LLM inference speed and acceptance rates, achieving up to 9.64x speedup on MATH-500 and 4.58x on conversational workloads.

0 favorites 0 likes

#speculative-decoding

Got GLM-5.2 + MTP speculative decode running on 4× DGX Spark (GB10) — and the build piece the public recipe is missing

Reddit r/LocalLLaMA ↗ · 4d ago

The author successfully ran GLM-5.2 with MTP speculative decoding on a 4× DGX Spark (GB10) setup, revealing a missing component in the public build recipe.

0 favorites 0 likes

#speculative-decoding

@Hikari_07_jp: I got DeepSeek-V4-Flash MTP speculative decoding actually working on 2× RTX PRO 6000 +38% single-stream throughput. It …

X AI KOLs Timeline ↗ · 4d ago Cached

Achieved DeepSeek-V4-Flash MTP speculative decoding on 2× RTX PRO 6000 with a 38% throughput increase by fixing a mis-routed quantization format issue.

0 favorites 0 likes

#speculative-decoding

@charles_irl: dflash go brr

X AI KOLs Timeline ↗ · 5d ago Cached

NVIDIA announces DFlash, an open source block diffusion model for speculative decoding that achieves up to 15x higher inference throughput on Blackwell GPUs while maintaining interactivity.

0 favorites 0 likes

#speculative-decoding

Tmax-27b - a Qwen3.6-27b terminal agent for small GPUs trained with DPPO (RL)

Reddit r/LocalLLaMA ↗ · 5d ago

Ai2 released Tmax-27B, a terminal-agent LLM trained with DPPO (RL) on Qwen3.6-27B, and the author provides importance-matrix-calibrated GGUF quantizations that achieve competitive performance on agentic benchmarks even at very low bit-widths, with a grafted MTP draft head for speculative decoding.

0 favorites 0 likes

speculative-decoding

Submit Feedback