Follow-up: DeepSeek V4 Flash on 2x RTX PRO 6000 finishes real coding tasks faster than Sonnet and Opus, at about Sonnet quality
Summary
DeepSeek V4 Flash on dual RTX PRO 6000 GPUs completes real coding tasks faster than Anthropic's Sonnet and Opus models while achieving similar quality to Sonnet.
Similar Articles
Deepseek V4 Flash running on RTX 5090 MoE
User shares optimization benchmarks for DeepSeek-V4-Flash (Q2_K) running on an RTX 5090 using a fork of llama.cpp, achieving 21.3 tokens/s generation and 1 million context size.
@Snixtp: DeepSeek V4 Flash on a single RTX Pro 6000?
DeepSeek V4 Flash GGUF quantizations have been released by antirez, enabling the model to run on single GPUs like the RTX Pro 6000 and Macs with 128GB+ RAM. The quantized files are available on Hugging Face with instructions for the DS4 inference engine.
DeepSeek-V4-Flash W4A16+FP8 with MTP self-speculation: 85 tok/s @ 524k on 2× RTX PRO 6000 Max-Q
The article details a customized quantized version of DeepSeek-V4-Flash with MTP self-speculation enabled, achieving significant speedups on dual RTX PRO 6000 Max-Q GPUs using a patched vLLM setup.
Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!
A developer successfully runs DeepSeek-V4-Flash (284B total, 13B active) locally on four RTX 2080 Ti GPUs with a $2,500 budget, achieving 255 prefill tokens/s using custom Turing CUDA kernels, W8A8 quantization, and heterogeneous inference. The implementation is open-sourced.
We Tested DeepSeek V4 Pro and Flash Against Claude Opus 4.7 and Kimi K2.6 (11 minute read)
DeepSeek released V4 Pro and V4 Flash under MIT license on April 24, 2026. In benchmarks against Claude Opus 4.7 and Kimi K2.6, V4 Pro scored 77/100 at $2.25, placing between Opus 4.7 (91) and Kimi K2.6 (68), while V4 Flash scored 60/100 at $0.02, the cheapest in the comparison, with a 75% discount on V4 Pro through May 31.