llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090

Reddit r/LocalLLaMA 07/02/26, 11:54 PM Tools

deepseek llama-cpp cuda patch local-inference kv-cache open-source

Summary

Describes a patch for llama.cpp that adds CUDA support for DeepSeek V4 Flash's context indexing, enabling full 1M token context on an RTX 5090 with significantly reduced VRAM usage and high throughput.

Wanted to try running DeepSeek V4 Flash locally but found it asking for absurd amounts of VRAM at higher context lengths (~256GB at 1M). Turned out the DSA lightning indexer lacks proper llamacpp support. Did a bit of digging and there's an upstream PR to address the issue (shoutout u/fairydreaming, PR #24231), but even there it's not wired into the model graph and has no CUDA path yet. So I wired it in and patched a CUDA kernel this morning and figured I'd share in case it's useful to anyone else looking to run something like this. Hardware: RTX 5090, 9950X3D, 96GB DDR5 Model: DeepSeek-V4-Flash, mixed Q8/Q4/Q2 quant by antirez Before / after (256K context): Before After Compute buffer ~67 GiB (OOM) 3.2 GiB Prefill 56 t/s ~263 t/s Decode ~14 t/s ~14 t/s 1M context impossible (~256GB) works (3.75 GiB at ubatch 768, ~6gb at 2048) Validated presets: Context Prefill Decode Peak VRAM 256K ~263 t/s 14 t/s ~29 GiB 512K 256 t/s 13.7 t/s ~28 GiB 1M 159 t/s* 13.7 t/s ~31 GiB *lower ubatch on 32gb 5090 at 1M - should be ~full speed if given the full ~9gb vram Correctness: verified briefly with a needle-in-haystack test - planted a random fact at 10%/50%/90% depth in a 100K-token document, model retrieved it correctly every time. Also retrieved correctly at 512K and 1M's harder 50% depth. Source + build instructions + full writeup: https://github.com/spencer-zaid/llama.cpp/blob/deepseek-lid-cuda/docs/deepseek-v4-lid-cuda.md Branch: https://github.com/spencer-zaid/llama.cpp/tree/deepseek-lid-cuda No prebuilt binary (single GPU tested RTX 5090). Build instructions in the doc in case you need them

Original Article

llamacpp patch - DeepSeek V4 Flash running with full 1M token context locally on RTX 5090

Similar Articles

Deepseek V4 Flash running on RTX 5090 MoE

Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

DeepSeek-V4-Flash W4A16+FP8 with MTP self-speculation: 85 tok/s @ 524k on 2× RTX PRO 6000 Max-Q

@pupposandro: PFlash now run @poolsideai's Laguna-XS.2 (33B-A3B MoE) on a single RTX 3090. - 111 tok/s decode @ short ctx - 128K TTFT…

I have DeepSeek V4 Pro at home

Submit Feedback

Similar Articles

Deepseek V4 Flash running on RTX 5090 MoE

Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

DeepSeek-V4-Flash W4A16+FP8 with MTP self-speculation: 85 tok/s @ 524k on 2× RTX PRO 6000 Max-Q

@pupposandro: PFlash now run @poolsideai's Laguna-XS.2 (33B-A3B MoE) on a single RTX 3090. - 111 tok/s decode @ short ctx - 128K TTFT…

I have DeepSeek V4 Pro at home