@karminski3: DeepSeek truly excels in both cost-effectiveness and technology... Some classmates don't understand what DSpark is, so here's a quick tutorial. Speculative decoding is a technique to improve the output speed of large models. The essence is to let a small model generate text for the large model to check. Because currently...

X AI KOLs Timeline News

Summary

DeepSeek proposes the DSpark technique, which implements speculative decoding by inserting a mini Transformer after the Final RMSNorm, boosting large model output speed by 60%-85%.

DeepSeek truly delivers a double kill in cost-performance and technology... Some classmates don't understand what DSpark is, so here's a quick tutorial. Speculative decoding is a technique to speed up output of large models. The essence is to have a small model generate text for the large model, and the large model checks whether it's correct. Because current models are generally memory bandwidth bound, while GPU compute is often underutilized, the large model's prefill speed (reading tokens) is much faster than decode speed (generating tokens). So letting a small model draft a sequence along the large model's line of thinking, and then the large model verifies it (just by reading tokens), if the small model guesses correctly, we leverage prefill speed and token generation can be multiplied. But the problem is that an external small model also needs to prefill, consumes VRAM and memory bandwidth. Is there a better solution? That's where DSpark comes in. Look at my diagram (left side DSv4 architecture diagram is from @rasbt). DSpark is inserted during the Final RMSNorm process. Instead of an entire small model, it's a 3-layer MTP (Multi-Token Prediction) mini Transformer stack. After the large model finishes the first 60+ layers and pushes the "highly concentrated concept" (feature vector/hidden state) of the current sentence to the exit of Final RMSNorm, before it has been translated into specific words, DSpark intercepts: First, semi-autoregressive rapid speculation (MTP + Markov Head). DSpark has its own small set of parameters, and instantly guesses 5 words (feature vectors) in parallel, then uses its internal serial network to rationalize the logic. (Note: parallel first, then serial to eliminate logical incoherence caused by parallelism.) Then, it has a confidence prediction head to predict how accurate its guesses are. For example, if the last 2 of the 5 words are inaccurate, they are cut off directly to avoid wasting computation when sent back to the large model. Finally, the retained 3 words are fed back to the vocabulary mapping layer to translate vectors into tokens. At this point, DSpark's job is done. Then the large model scans the DSpark output for correctness (only prefill, no decode). Once correct, it directly outputs tokens. Previously the model could only output one token at a time; now it can output three tokens at once! Finally, speculative decoding does not degrade intelligence, and speed can be improved by 60%-85%! Previously we hired a small model to draft; now it's like implanting a chip directly into the brain. Currently, SGLang already has a PR for this feature (29538), and DeepSeek has just released a bunch of DSpark-modified versions of small models on their HuggingFace page. I boldly predict: will future released models come with DSpark as standard? #dspark #deepseek #speculativedecoding
Original Article
View Cached Full Text

Cached at: 06/30/26, 07:36 AM

DeepSeek really delivers a deadly combo of cost-effectiveness and performance…

Some of you may be wondering what DSpark is — let me give you a quick tutorial.

Speculative decoding (also called draft decoding) is a technique to speed up the inference of large language models. The basic idea is to let a small model predict the next tokens for the large model, and then the large model verifies whether those predictions are correct. Since current models are typically bottlenecked by memory bandwidth while GPU compute is plentiful, the large model’s prefill speed (reading tokens) is much faster than its decode speed (generating tokens one by one). So by having the small model draft a continuation along the same reasoning path, the large model only needs to verify (which is essentially just prefill). If the small model’s guesses are correct, you leverage the fast prefill speed, achieving multi‑token generation and significantly higher throughput.

But here’s the catch: an external draft model also needs to prefill, consumes VRAM, and competes for memory bandwidth. Is there a smarter way? That’s where DSpark comes in.

Take a look at the diagram below (left side is the DeepSeek‑V4 architecture from @rasbt). DSpark is inserted right after the Final RMSNorm layer. Instead of a full‑blown draft model, it’s a tiny 3‑layer MTP (Multi‑Token Prediction) transformer stack.

After the large model has computed through its 60+ layers, it pushes the “high‑level concept” (feature vector / hidden state) of the current token into the Final RMSNorm exit. Before this vector gets translated into concrete text, DSpark intercepts it:

First, it does semi‑autoregressive fast drafting (MTP + Markov Head). DSpark has its own small set of parameters, instantly predicts 5 tokens (as feature vectors) in parallel, then uses its own internal serial network to smooth out the logic. (Note: parallel first, then serial — this avoids the logical incoherence that pure parallel generation would cause.)

Then, it applies a confidence prediction head to estimate how accurate its own guesses are. For example, if the last 2 out of 5 are uncertain, it simply cuts them off to avoid wasting the large model’s compute on verification.

Finally, the remaining 3 tokens are fed back through the vocabulary projection layer, translating the vectors into actual token IDs. That’s all DSpark does.

After that, the large model scans DSpark’s output in one go (using only prefill, no decode). If the draft is correct, it directly accepts all tokens — instead of generating one token per step, it now outputs 3 tokens at once!

The key point: speculative decoding does not reduce reasoning quality, and it can boost speed by 60‑85%. Previously you’d hire a small model to write the draft; now it’s like implanting a chip directly into the brain.

As of now, SGLang already has a PR for this feature (29538), and DeepSeek has just released a bunch of DSpark‑modified small models on their HuggingFace page. I’ll go out on a limb and predict that future models will come with DSpark built‑in.

#dspark #deepseek #speculative_decoding #draft_decoding

Similar Articles

deepseek-ai/DeepSeek-V4-Pro-DSpark

Hugging Face Models Trending

DeepSeek releases preview versions of its V4 series, including DeepSeek-V4-Pro (1.6T parameters, 49B activated) and DeepSeek-V4-Flash (284B parameters, 13B activated), both supporting a one-million-token context and featuring hybrid attention, manifold-constrained hyper-connections, and a Muon optimizer.