Tag
DeepSeek proposes the DSpark technique, which implements speculative decoding by inserting a mini Transformer after the Final RMSNorm, boosting large model output speed by 60%-85%.
Progress update on DSpark: training of DFlash backbone and markov head is complete, enabling use on 27B. Next is training the confidence head for adaptive drafting, expected 8-14% speed improvement over DFlash.
DeepSeek released DSpark, a system where the main model rapidly generates a sentence while a tiny editor fixes coherence before verification, pushing LLM systems engineering beyond new architecture.
OpenInfer, a pure Rust+CUDA LLM inference engine, quickly added support for DeepSeek's DSpark speculative decoding technique on RTX 5090, achieving nearly 500 tok/s per user and scaling to ~2.4K aggregate tok/s, outperforming DFlash on non-random workloads.
DeepSeek used the open-perfectblend dataset to train their new DSpark drafter; the dataset is an open-source reproduction of 'The Perfect Blend' paper providing over 1 million diverse prompts in math, chat, and code.
DeepSeek released DSpark, a speculative decoding method that boosts throughput by 51% to 400% for V4 Flash & Pro, along with the open-source DeepSpec codebase for training and evaluating draft models.