The article details a customized quantized version of DeepSeek-V4-Flash with MTP self-speculation enabled, achieving significant speedups on dual RTX PRO 6000 Max-Q GPUs using a patched vLLM setup.
**TL;DR**: DeepSeek-V4-Flash running at **85.52 tok/s @ 524k ctx** and **\~111 tok/s @ 128k single-stream** on 2× RTX PRO 6000 Max-Q pasta-paul's `DeepSeek-V4-Flash-W4A16-FP8` quant is great, but its MTP head silently gets stripped at load time (HF transformers has it in `_keys_to_ignore_on_load_unexpected`), so `--speculative-config '{"method":"mtp",...}'` is a no-op. Retrofitted the MTP block, ran a GPTQ pass on its routed experts to match the base's W4A16 INT4 group format, and patched vLLM. Decode goes from **52.85 tok/s (no MTP) → 85.52 tok/s @ 524k 2-stream → \~111 tok/s @ 128k single-stream**. 671B total / 32B active, fits on 2× 96 GB. Model: [https://huggingface.co/LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8](https://huggingface.co/LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8) # Numbers 2× RTX PRO 6000 Blackwell Max-Q (96 GB each, no NVLink, sm\_120): |Profile|Decode TPS|TTFT|Δ vs base| |:-|:-|:-|:-| || |pasta-paul base, no MTP, 524k|52.85|91 ms|reference| |**This model, 524k 2-stream**|**85.52**|155 ms|**+62% (1.62×)**| |**This model, 128k single-stream**|**\~111**|\~310 ms|**+110% (2.10×)**| Sanity-check benchmarks (small samples, full data in the model card): |Benchmark|n|Score| |:-|:-|:-| || |GSM8K (T=0, COT, exact-match)|100|**93%**| |MMLU (mixed subjects)|100|53% (sample dragged by hard subjects; tracks base)| |HumanEval (syntactic check, not pass@1 exec)|50|**90%**| # What got quantized how * **768 routed-expert tensors** (256 experts × {w1, w2, w3}): W4A16 INT4 group=128 sym, GPTQ (Frantar-style with Cholesky H⁻¹). Calibrated with 256 ultrachat\_200k prompts × 256 max\_tokens captured from the running pasta-paul model — 17,701 MTP forward dumps, 473k tokens. * **5 attention projections**: FP8\_BLOCK (kept upstream's FP8 weights, just renamed `scale` → `weight_scale` to match pasta-paul's compressed-tensors convention). * **Shared experts, e\_proj, h\_proj, norms, gate, attn\_sink**: BF16 / FP32. # Max-Q specific fixes: If you're on the **Max-Q workstation cards specifically**: you MUST pass `--disable-custom-all-reduce`. vLLM's CustomAllreduce uses CUDA P2P (independent of `NCCL_P2P_DISABLE`), and on PCIe-only Max-Q topology it deadlocks at post-graph eager warmup. Without the flag the engine hangs at `gpu_worker.py:619` with infinite `shm_broadcast.py:681 No available shared memory broadcast block` warnings. The **Server** variant has NVLink and does not hit this. NCCL tuning that drops TTFT from \~155 ms to \~91 ms on Max-Q at zero decode-TPS cost: NCCL_PROTO=LL NCCL_ALGO=Ring NCCL_MIN_NCHANNELS=8 NCCL_NTHREADS=512 # How to run Needs the patched vLLM fork. Vanilla doesn't load DSV4-Flash quants. Base workspace at [https://github.com/pasta-paul/dsv4-flash-w4a16-fp8](https://github.com/pasta-paul/dsv4-flash-w4a16-fp8). Apply the MTP patches on top. vllm serve LordNeel/DeepSeek-V4-Flash-Acti-MTP-W4A16-FP8 \ --tensor-parallel-size 2 --kv-cache-dtype fp8 --block-size 256 \ --max-model-len 524288 --max-num-seqs 2 \ --gpu-memory-utilization 0.93 \ --tokenizer-mode deepseek_v4 \ --tool-call-parser deepseek_v4 --enable-auto-tool-choice \ --reasoning-parser deepseek_v4 \ --trust-remote-code \ --disable-custom-all-reduce \ --speculative-config '{"method":"mtp","num_speculative_tokens":1}' \ --host 0.0.0.0 --port 8000 I also wrote an [`AGENTS.md`](http://AGENTS.md) runbook. Point Claude/Codex/Cursor to it and tell it "set this up"/ "verify hardware and get this model running"/ or similar. Goes through preflight → CUDA toolkit (no sudo via conda) → patched vLLM build → download → patches → serve → smoke test. # Limitations * **TP=2 only.** TP=1 OOMs on a single RTX6000 pro; TP≥4 hits an upstream W4A16 MoE scale-sharding bug ([vllm-project/vllm#41511](https://github.com/vllm-project/vllm/issues/41511)). * `num_speculative_tokens` **capped at 1.** DSV4 flash ships exactly one MTP head (`num_nextn_predict_layers=1`); higher values will not produce more drafts. * **Reasoning parser caveat.** With `--reasoning-parser deepseek_v4`, output splits into `content` and `reasoning_content`. Clients reading only `content` see empty strings on "thinking" responses. * **MTP GPTQ skipped attention during calibration** — see Future work in card. * **Hardware tested: only Max-Q.** Server variant + DGX Spark + H200 **should** work but I **have not** run them. # Request for the community If you run this and the **MTP draft acceptance rate** comes out significantly different on your prompt distribution, please do comment with your domain and the rate (vLLM will log it as `spec_decode_acceptance_rate`). # Credits * DeepSeek-AI for the base model * pasta-paul for the W4A16+FP8 quant + jasl/vllm serving stack ([repo](https://github.com/pasta-paul/dsv4-flash-w4a16-fp8)) [](/submit/?source_id=t3_1t9efrb&composer_entry=crosspost_prompt)
User shares optimization benchmarks for DeepSeek-V4-Flash (Q2_K) running on an RTX 5090 using a fork of llama.cpp, achieving 21.3 tokens/s generation and 1 million context size.
Achieved DeepSeek-V4-Flash MTP speculative decoding on 2× RTX PRO 6000 with a 38% throughput increase by fixing a mis-routed quantization format issue.
A developer successfully runs DeepSeek-V4-Flash (284B total, 13B active) locally on four RTX 2080 Ti GPUs with a $2,500 budget, achieving 255 prefill tokens/s using custom Turing CUDA kernels, W8A8 quantization, and heterogeneous inference. The implementation is open-sourced.
DeepSeek V4 Flash GGUF quantizations have been released by antirez, enabling the model to run on single GPUs like the RTX Pro 6000 and Macs with 128GB+ RAM. The quantized files are available on Hugging Face with instructions for the DS4 inference engine.
DeepSeek V4 Flash on dual RTX PRO 6000 GPUs completes real coding tasks faster than Anthropic's Sonnet and Opus models while achieving similar quality to Sonnet.