@karminski3: DeepSeek truly excels in both cost-effectiveness and technology... Some classmates don't understand what DSpark is, so here's a quick tutorial. Speculative decoding is a technique to improve the output speed of large models. The essence is to let a small model generate text for the large model to check. Because currently...
Summary
DeepSeek proposes the DSpark technique, which implements speculative decoding by inserting a mini Transformer after the Final RMSNorm, boosting large model output speed by 60%-85%.
View Cached Full Text
Cached at: 06/30/26, 07:36 AM
DeepSeek really delivers a deadly combo of cost-effectiveness and performance…
Some of you may be wondering what DSpark is — let me give you a quick tutorial.
Speculative decoding (also called draft decoding) is a technique to speed up the inference of large language models. The basic idea is to let a small model predict the next tokens for the large model, and then the large model verifies whether those predictions are correct. Since current models are typically bottlenecked by memory bandwidth while GPU compute is plentiful, the large model’s prefill speed (reading tokens) is much faster than its decode speed (generating tokens one by one). So by having the small model draft a continuation along the same reasoning path, the large model only needs to verify (which is essentially just prefill). If the small model’s guesses are correct, you leverage the fast prefill speed, achieving multi‑token generation and significantly higher throughput.
But here’s the catch: an external draft model also needs to prefill, consumes VRAM, and competes for memory bandwidth. Is there a smarter way? That’s where DSpark comes in.
Take a look at the diagram below (left side is the DeepSeek‑V4 architecture from @rasbt). DSpark is inserted right after the Final RMSNorm layer. Instead of a full‑blown draft model, it’s a tiny 3‑layer MTP (Multi‑Token Prediction) transformer stack.
After the large model has computed through its 60+ layers, it pushes the “high‑level concept” (feature vector / hidden state) of the current token into the Final RMSNorm exit. Before this vector gets translated into concrete text, DSpark intercepts it:
First, it does semi‑autoregressive fast drafting (MTP + Markov Head). DSpark has its own small set of parameters, instantly predicts 5 tokens (as feature vectors) in parallel, then uses its own internal serial network to smooth out the logic. (Note: parallel first, then serial — this avoids the logical incoherence that pure parallel generation would cause.)
Then, it applies a confidence prediction head to estimate how accurate its own guesses are. For example, if the last 2 out of 5 are uncertain, it simply cuts them off to avoid wasting the large model’s compute on verification.
Finally, the remaining 3 tokens are fed back through the vocabulary projection layer, translating the vectors into actual token IDs. That’s all DSpark does.
After that, the large model scans DSpark’s output in one go (using only prefill, no decode). If the draft is correct, it directly accepts all tokens — instead of generating one token per step, it now outputs 3 tokens at once!
The key point: speculative decoding does not reduce reasoning quality, and it can boost speed by 60‑85%. Previously you’d hire a small model to write the draft; now it’s like implanting a chip directly into the brain.
As of now, SGLang already has a PR for this feature (29538), and DeepSeek has just released a bunch of DSpark‑modified small models on their HuggingFace page. I’ll go out on a limb and predict that future models will come with DSpark built‑in.
#dspark #deepseek #speculative_decoding #draft_decoding
Similar Articles
@danielhanchen: DeepSeek just released DSpark for V4 Flash & Pro, a new speculative decoding method boosting throughput by 51% to 400%!…
DeepSeek released DSpark, a speculative decoding method that boosts throughput by 51% to 400% for V4 Flash & Pro, along with the open-source DeepSpec codebase for training and evaluating draft models.
@dzhulgakov: DSpark from @deepseek_ai ingeniously integrates many speculative decoding ideas to achieve 1.5x to 5x higher throughput…
DSpark from DeepSeek AI integrates speculative decoding ideas to achieve 1.5x to 5x higher throughput in production systems. This thread explains 10 key ideas from the basics.
@Michaelzsguo: This is the best read on DeepSeek’s recent innovation, DSpark: Think of DSpark as: The main model rapidly brainstorms t…
DeepSeek released DSpark, a system where the main model rapidly generates a sentence while a tiny editor fixes coherence before verification, pushing LLM systems engineering beyond new architecture.
@DeRonin_: DeepSeek just dropped a 5-page paper + free GitHub repo that makes any LLM respond 80% faster it's called speculative d…
DeepSeek released a paper and MIT-licensed open-source implementation of speculative decoding (DSpark) that speeds up LLM responses by up to 80% by using a small 'guess' model and a large 'check' model, achieving both speed and accuracy without tradeoffs.
deepseek-ai/DeepSeek-V4-Pro-DSpark
DeepSeek releases preview versions of its V4 series, including DeepSeek-V4-Pro (1.6T parameters, 49B activated) and DeepSeek-V4-Flash (284B parameters, 13B activated), both supporting a one-million-token context and featuring hybrid attention, manifold-constrained hyper-connections, and a Muon optimizer.