@Snixtp: DeepSeek V4 Flash on a single RTX Pro 6000?

X AI KOLs Following 05/16/26, 10:14 PM Tools

deepseek gguf quantization local-inference open-source huggingface

Summary

DeepSeek V4 Flash GGUF quantizations have been released by antirez, enabling the model to run on single GPUs like the RTX Pro 6000 and Macs with 128GB+ RAM. The quantized files are available on Hugging Face with instructions for the DS4 inference engine.

DeepSeek V4 Flash on a single RTX Pro 6000? 👀 https://t.co/gG0pR6EIkK

Original Article

View Cached Full Text

Cached at: 05/17/26, 03:26 AM

DeepSeek V4 Flash on a single RTX Pro 6000? 👀

https://t.co/gG0pR6EIkK

antirez/deepseek-v4-gguf · Hugging Face

Source: https://huggingface.co/antirez/deepseek-v4-gguf

https://huggingface.co/antirez/deepseek-v4-gguf#deepseek-v4-flash–gguf-for-ds4DeepSeek V4 Flash — GGUF for ds4

This quants are specific for the DS4 inference engine. They may work with other inference engines or not (they should, but not the MTP model which requires a specific loader).

https://github.com/antirez/ds4

https://huggingface.co/antirez/deepseek-v4-gguf#filesFiles

FileSizeRouted experts (ffn\_\{gate,up,down\}\_exps)Everything elseDeepSeek\-V4\-Flash\-IQ2XXS\-w2Q2K\-AProjQ8\-SExpQ8\-OutQ8\-chat\-v2\.gguf80.8 GiBIQ2\_XXS(gate, up) +Q2\_K(down)Q8\_0attn proj / shared experts / output,F16router + embed + indexer + compressor + HC,F32norms / sinks / biasDeepSeek\-V4\-Flash\-Q4KExperts\-F16HC\-F16Compressor\-F16Indexer\-Q8Attn\-Q8Shared\-Q8Out\-chat\-v2\.gguf153.3 GiBQ4\_K(all three)same as aboveDeepSeek\-V4\-Flash\-MTP\-Q4K\-Q8\_0\-F32\.gguf3.6 GiBMTP / speculative-decoding support (optional, not standalone). Useq2on 128 GB Mac machines,q4on machines with ≥ 256 GB RAM, pair either withMTPfor optional speculative decoding.

https://huggingface.co/antirez/deepseek-v4-gguf#quantization-recipeQuantization recipe

The filename is the spec. In detail, for theq2file:

Tensor classQuantNotesblk\.\*\.ffn\_gate\_exps,blk\.\*\.ffn\_up\_expsIQ2\_XXSrouted-expert up/gateblk\.\*\.ffn\_down\_expsQ2\_Krouted-expert down (K-quant for quality)blk\.\*\.ffn\_\{gate,up,down\}\_shexp``Q8\_0shared expertsblk\.\*\.attn\_q\_a,attn\_q\_b,attn\_kv,attn\_output\_a,attn\_output\_b``Q8\_0all attention projections (MLA + low-rank output)output\.weight``Q8\_0output headtoken\_embd\.weight``F16input embeddingblk\.\*\.ffn\_gate\_inp(router)F16learned routerblk\.\*\.exp\_probs\_b(router bias),blk\.\*\.attn\_sinks, all\*\_norm\.weight``F32``blk\.\*\.ffn\_gate\_tid2eid``I32hash-routing tables (first 3 layers only)blk\.\*\.attn\_compressor\_\*,blk\.\*\.indexer\_\*,blk\.\*\.hc\_\*,blk\.\*\.output\_hc\_\*``F16/F32DSv4-specific auxiliary blocks For theq4file, only the three routed-expert classes change toQ4\_K. Everything else is byte-for-byte identical to the q2 recipe.

The motivation behind the asymmetry: the routed experts are the majority of the parameter count but each individual expert handles only a fraction of tokens, so aggressive quantization on them costs less in average quality than the same treatment of router, projections, or shared experts. Keeping the decision-making components atQ8\_0preserves model behavior; crushing the experts buys the size.

https://huggingface.co/antirez/deepseek-v4-gguf#usageUsage

git clone https://github.com/antirez/ds4
cd ds4
./download_model.sh q2     # 128 GB RAM machines
./download_model.sh q4     # >= 256 GB RAM machines
./download_model.sh mtp    # optional MTP / speculative decoding
make

./ds4 -p "Explain Redis streams in one paragraph."
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

Thedownload\_model\.shscript fetches from this repository, resumes partial downloads, and points\./ds4flash\.ggufat the selected variant.

https://huggingface.co/antirez/deepseek-v4-gguf#licenseLicense

MIT. The base model copyright is held by DeepSeek; the GGUFs are redistributed under the base model’s release terms.

@Snixtp: DeepSeek V4 Flash on a single RTX Pro 6000?

antirez/deepseek-v4-gguf · Hugging Face

https://huggingface.co/antirez/deepseek-v4-gguf#deepseek-v4-flash–gguf-for-ds4DeepSeek V4 Flash — GGUF for ds4

https://huggingface.co/antirez/deepseek-v4-gguf#filesFiles

https://huggingface.co/antirez/deepseek-v4-gguf#quantization-recipeQuantization recipe

https://huggingface.co/antirez/deepseek-v4-gguf#usageUsage

https://huggingface.co/antirez/deepseek-v4-gguf#licenseLicense

Similar Articles

antirez/deepseek-v4-gguf

Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

DeepSeek-V4-Flash W4A16+FP8 with MTP self-speculation: 85 tok/s @ 524k on 2× RTX PRO 6000 Max-Q

Deepseek V4 Flash running on RTX 5090 MoE

You can run Deepseek 4 flash on mac (M3 Max, 96gb)

Submit Feedback

Similar Articles

Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

DeepSeek-V4-Flash W4A16+FP8 with MTP self-speculation: 85 tok/s @ 524k on 2× RTX PRO 6000 Max-Q

Deepseek V4 Flash running on RTX 5090 MoE

You can run Deepseek 4 flash on mac (M3 Max, 96gb)