Deepseek V4 Flash running on RTX 5090 MoE
Summary
User shares optimization benchmarks for DeepSeek-V4-Flash (Q2_K) running on an RTX 5090 using a fork of llama.cpp, achieving 21.3 tokens/s generation and 1 million context size.
Similar Articles
DeepSeek-V4-Flash W4A16+FP8 with MTP self-speculation: 85 tok/s @ 524k on 2× RTX PRO 6000 Max-Q
The article details a customized quantized version of DeepSeek-V4-Flash with MTP self-speculation enabled, achieving significant speedups on dual RTX PRO 6000 Max-Q GPUs using a patched vLLM setup.
Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!
A developer successfully runs DeepSeek-V4-Flash (284B total, 13B active) locally on four RTX 2080 Ti GPUs with a $2,500 budget, achieving 255 prefill tokens/s using custom Turing CUDA kernels, W8A8 quantization, and heterogeneous inference. The implementation is open-sourced.
@ciruai: Testing DeepSeek v4 Flash on the AMD Ryzen AI Max+ 395 Strix Halo with 128GB RAM. Getting ~15 TPS over a decently long …
Testing DeepSeek v4 Flash on the AMD Ryzen AI Max+ 395 with 128GB RAM achieves ~15 TPS for a 284B MoE model (13B active) locally, costing $3,000 versus $25,000+ for a datacenter setup, highlighting the feasibility of running large models on consumer hardware.
Deepseek V4 flash performance on DGX Spark
A Reddit user shares their experience running DeepSeek V4 Flash on a dual-ASUS GX10 DGX Spark setup, detailing performance metrics, configuration, and power consumption, with throughput benchmarks across various context lengths.
llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090
Describes a patch for llama.cpp that adds CUDA support for DeepSeek V4 Flash's context indexing, enabling full 1M token context on an RTX 5090 with significantly reduced VRAM usage and high throughput.