DeepSeek-V4-Flash W4A16+FP8 with MTP self-speculation: 85 tok/s @ 524k on 2× RTX PRO 6000 Max-Q

Reddit r/LocalLLaMA 05/10/26, 06:22 PM Tools

deepseek speculative-decoding quantization vllm gpu-optimization moe

Summary

The article details a customized quantized version of DeepSeek-V4-Flash with MTP self-speculation enabled, achieving significant speedups on dual RTX PRO 6000 Max-Q GPUs using a patched vLLM setup.

**TL;DR**: DeepSeek-V4-Flash running at **85.52 tok/s @ 524k ctx** and **\~111 tok/s @ 128k single-stream** on 2× RTX PRO 6000 Max-Q pasta-paul's `DeepSeek-V4-Flash-W4A16-FP8` quant is great, but its MTP head silently gets stripped at load time (HF transformers has it in `_keys_to_ignore_on_load_unexpected`), so `--speculative-config '{"method":"mtp",...}'` is a no-op. Retrofitted the MTP block, ran a GPTQ pass on its routed experts to match the base's W4A16 INT4 group format, and patched vLLM. Decode goes from **52.85 tok/s (no MTP) → 85.52 tok/s @ 524k 2-stream → \~111 tok/s @ 128k single-stream**. 671B total / 32B active, fits on 2× 96 GB. Model: [https://huggingface.co/LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8](https://huggingface.co/LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8) # Numbers 2× RTX PRO 6000 Blackwell Max-Q (96 GB each, no NVLink, sm\_120): |Profile|Decode TPS|TTFT|Δ vs base| |:-|:-|:-|:-| || |pasta-paul base, no MTP, 524k|52.85|91 ms|reference| |**This model, 524k 2-stream**|**85.52**|155 ms|**+62% (1.62×)**| |**This model, 128k single-stream**|**\~111**|\~310 ms|**+110% (2.10×)**| Sanity-check benchmarks (small samples, full data in the model card): |Benchmark|n|Score| |:-|:-|:-| || |GSM8K (T=0, COT, exact-match)|100|**93%**| |MMLU (mixed subjects)|100|53% (sample dragged by hard subjects; tracks base)| |HumanEval (syntactic check, not pass@1 exec)|50|**90%**| # What got quantized how * **768 routed-expert tensors** (256 experts × {w1, w2, w3}): W4A16 INT4 group=128 sym, GPTQ (Frantar-style with Cholesky H⁻¹). Calibrated with 256 ultrachat\_200k prompts × 256 max\_tokens captured from the running pasta-paul model — 17,701 MTP forward dumps, 473k tokens. * **5 attention projections**: FP8\_BLOCK (kept upstream's FP8 weights, just renamed `scale` → `weight_scale` to match pasta-paul's compressed-tensors convention). * **Shared experts, e\_proj, h\_proj, norms, gate, attn\_sink**: BF16 / FP32. # Max-Q specific fixes: If you're on the **Max-Q workstation cards specifically**: you MUST pass `--disable-custom-all-reduce`. vLLM's CustomAllreduce uses CUDA P2P (independent of `NCCL_P2P_DISABLE`), and on PCIe-only Max-Q topology it deadlocks at post-graph eager warmup. Without the flag the engine hangs at `gpu_worker.py:619` with infinite `shm_broadcast.py:681 No available shared memory broadcast block` warnings. The **Server** variant has NVLink and does not hit this. NCCL tuning that drops TTFT from \~155 ms to \~91 ms on Max-Q at zero decode-TPS cost: NCCL_PROTO=LL NCCL_ALGO=Ring NCCL_MIN_NCHANNELS=8 NCCL_NTHREADS=512 # How to run Needs the patched vLLM fork. Vanilla doesn't load DSV4-Flash quants. Base workspace at [https://github.com/pasta-paul/dsv4-flash-w4a16-fp8](https://github.com/pasta-paul/dsv4-flash-w4a16-fp8). Apply the MTP patches on top. vllm serve LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 \ --tensor-parallel-size 2 --kv-cache-dtype fp8 --block-size 256 \ --max-model-len 524288 --max-num-seqs 2 \ --gpu-memory-utilization 0.93 \ --tokenizer-mode deepseek_v4 \ --tool-call-parser deepseek_v4 --enable-auto-tool-choice \ --reasoning-parser deepseek_v4 \ --trust-remote-code \ --disable-custom-all-reduce \ --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \ --host 0.0.0.0 --port 8000 I also wrote an [`AGENTS.md`](http://AGENTS.md) runbook. Point Claude/Codex/Cursor to it and tell it "set this up"/ "verify hardware and get this model running"/ or similar. Goes through preflight → CUDA toolkit (no sudo via conda) → patched vLLM build → download → patches → serve → smoke test. # Limitations * **TP=2 only.** TP=1 OOMs on a single RTX6000 pro; TP≥4 hits an upstream W4A16 MoE scale-sharding bug ([vllm-project/vllm#41511](https://github.com/vllm-project/vllm/issues/41511)). * `num_speculative_tokens` **capped at 1.** DSV4 flash ships exactly one MTP head (`num_nextn_predict_layers=1`); higher values will not produce more drafts. * **Reasoning parser caveat.** With `--reasoning-parser deepseek_v4`, output splits into `content` and `reasoning_content`. Clients reading only `content` see empty strings on "thinking" responses. * **MTP GPTQ skipped attention during calibration** — see Future work in card. * **Hardware tested: only Max-Q.** Server variant + DGX Spark + H200 **should** work but I **have not** run them. # Request for the community If you run this and the **MTP draft acceptance rate** comes out significantly different on your prompt distribution, please do comment with your domain and the rate (vLLM will log it as `spec_decode_acceptance_rate`). # Credits * DeepSeek-AI for the base model * pasta-paul for the W4A16+FP8 quant + jasl/vllm serving stack ([repo](https://github.com/pasta-paul/dsv4-flash-w4a16-fp8)) [](/submit/?source_id=t3_1t9efrb&composer_entry=crosspost_prompt)

Original Article

DeepSeek-V4-Flash W4A16+FP8 with MTP self-speculation: 85 tok/s @ 524k on 2× RTX PRO 6000 Max-Q

Similar Articles

Deepseek V4 Flash running on RTX 5090 MoE

@Hikari_07_jp: I got DeepSeek-V4-Flash MTP speculative decoding actually working on 2× RTX PRO 6000 +38% single-stream throughput. It …

Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

@Snixtp: DeepSeek V4 Flash on a single RTX Pro 6000?

Follow-up: DeepSeek V4 Flash on 2x RTX PRO 6000 finishes real coding tasks faster than Sonnet and Opus, at about Sonnet quality

Submit Feedback

Similar Articles

Deepseek V4 Flash running on RTX 5090 MoE

@Hikari_07_jp: I got DeepSeek-V4-Flash MTP speculative decoding actually working on 2× RTX PRO 6000 +38% single-stream throughput. It …

Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

@Snixtp: DeepSeek V4 Flash on a single RTX Pro 6000?

Follow-up: DeepSeek V4 Flash on 2x RTX PRO 6000 finishes real coding tasks faster than Sonnet and Opus, at about Sonnet quality