Deepseek V4 Flash running on RTX 5090 MoE

Reddit r/LocalLLaMA 07/03/26, 10:48 PM Models

deepseek rtx-5090 benchmarking optimization llama-cpp moe gguf

Summary

User shares optimization benchmarks for DeepSeek-V4-Flash (Q2_K) running on an RTX 5090 using a fork of llama.cpp, achieving 21.3 tokens/s generation and 1 million context size.

Here is the results of optimizing it for my setup: Benchmark results of the optimisation showing TG T/S from 22.7 to 21.3, and PP T/S from 1105 to 927, test ranges Prompt Processing from 8192 tokens to 65536 tokens, and is set to MoE with no unified KV, no memory map, n-cpu-moe 37 My setup: X870 AORUS ELITE WIFI7 AMD Ryzen 9 9900X3D (24) @ 4.40 GHz NVIDIA GeForce RTX 5090 [Discrete] DDR5 RAM: 18.80 GiB / 125.39 GiB (15%) OS: Bazzite(bazzite-dx-nvidia-gnome:testing) This was possible using this fork: https://github.com/fairydreaming/llama.cpp/tree/dsv4 Build script: cmake -B build \ -DGGML_CUDA=ON \ -DCMAKE_CUDA_ARCHITECTURES="120" \ -DGGML_CCACHE=OFF -DGGML_NATIVE=ON \ -DCMAKE_BUILD_TYPE=Release \ -DLLAMA_OPENSSL=ON cmake --build build --config Release -j$(nproc) Benchmark command: llama-batched-bench -hf tarruda/DeepSeek-V4-Flash-GGUF:Q2_K -b 8192 -ub 8192 -npl 1 -npp 8192,16384,32768,65536 -ntg 128 -fa 1 --no-repack -no-kvu --ctx-size 70000 --no-mmap --n-cpu-moe 37 Daily use command: llama-server -hf tarruda/DeepSeek-V4-Flash-GGUF:Q2_K -fa 1 --ctx-size 1048576 -ub 512 -b 512 -np 1 -no-kvu --host 0.0.0.0 --port 8099 -t 12 --temp 1 --top-p 1.00 --metrics --perf Yes, 1 million context, it fits with ub 512, and there's even a little bit of VRAM left to utilize. You can even fit in --n-cpu-moe 37 or 36 if you're really lean on your OS. Thanks to u/tarruda for the Q2_K model and helping digging into all the fixes in order to get this going! llama-cpp webui prompting to let everybody at llocallama community know they are awesome, DeepSeek flash replied using 145 tokens and 21.14 t/s

Original Article

Deepseek V4 Flash running on RTX 5090 MoE

Similar Articles

DeepSeek-V4-Flash W4A16+FP8 with MTP self-speculation: 85 tok/s @ 524k on 2× RTX PRO 6000 Max-Q

Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

@ciruai: Testing DeepSeek v4 Flash on the AMD Ryzen AI Max+ 395 Strix Halo with 128GB RAM. Getting ~15 TPS over a decently long …

Deepseek V4 flash performance on DGX Spark

llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090

Submit Feedback

Similar Articles

DeepSeek-V4-Flash W4A16+FP8 with MTP self-speculation: 85 tok/s @ 524k on 2× RTX PRO 6000 Max-Q

Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

@ciruai: Testing DeepSeek v4 Flash on the AMD Ryzen AI Max+ 395 Strix Halo with 128GB RAM. Getting ~15 TPS over a decently long …

Deepseek V4 flash performance on DGX Spark

llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090