@Snixtp: DeepSeek V4 Flash on a single RTX Pro 6000?

X AI KOLs Following Tools

Summary

DeepSeek V4 Flash GGUF quantizations have been released by antirez, enabling the model to run on single GPUs like the RTX Pro 6000 and Macs with 128GB+ RAM. The quantized files are available on Hugging Face with instructions for the DS4 inference engine.

DeepSeek V4 Flash on a single RTX Pro 6000? 👀 https://t.co/gG0pR6EIkK
Original Article
View Cached Full Text

Cached at: 05/17/26, 03:26 AM

DeepSeek V4 Flash on a single RTX Pro 6000? 👀

https://t.co/gG0pR6EIkK


antirez/deepseek-v4-gguf · Hugging Face

Source: https://huggingface.co/antirez/deepseek-v4-gguf

https://huggingface.co/antirez/deepseek-v4-gguf#deepseek-v4-flash–gguf-for-ds4DeepSeek V4 Flash — GGUF for ds4

This quants are specific for the DS4 inference engine. They may work with other inference engines or not (they should, but not the MTP model which requires a specific loader).

https://github.com/antirez/ds4

https://huggingface.co/antirez/deepseek-v4-gguf#filesFiles

FileSizeRouted experts (ffn\_\{gate,up,down\}\_exps)Everything elseDeepSeek\-V4\-Flash\-IQ2XXS\-w2Q2K\-AProjQ8\-SExpQ8\-OutQ8\-chat\-v2\.gguf80.8 GiBIQ2\_XXS(gate, up) +Q2\_K(down)Q8\_0attn proj / shared experts / output,F16router + embed + indexer + compressor + HC,F32norms / sinks / biasDeepSeek\-V4\-Flash\-Q4KExperts\-F16HC\-F16Compressor\-F16Indexer\-Q8Attn\-Q8Shared\-Q8Out\-chat\-v2\.gguf153.3 GiBQ4\_K(all three)same as aboveDeepSeek\-V4\-Flash\-MTP\-Q4K\-Q8\_0\-F32\.gguf3.6 GiBMTP / speculative-decoding support (optional, not standalone). Useq2on 128 GB Mac machines,q4on machines with ≥ 256 GB RAM, pair either withMTPfor optional speculative decoding.

https://huggingface.co/antirez/deepseek-v4-gguf#quantization-recipeQuantization recipe

The filename is the spec. In detail, for theq2file:

Tensor classQuantNotesblk\.\*\.ffn\_gate\_exps,blk\.\*\.ffn\_up\_expsIQ2\_XXSrouted-expert up/gateblk\.\*\.ffn\_down\_expsQ2\_Krouted-expert down (K-quant for quality)blk\.\*\.ffn\_\{gate,up,down\}\_shexp``Q8\_0shared expertsblk\.\*\.attn\_q\_a,attn\_q\_b,attn\_kv,attn\_output\_a,attn\_output\_b``Q8\_0all attention projections (MLA + low-rank output)output\.weight``Q8\_0output headtoken\_embd\.weight``F16input embeddingblk\.\*\.ffn\_gate\_inp(router)F16learned routerblk\.\*\.exp\_probs\_b(router bias),blk\.\*\.attn\_sinks, all\*\_norm\.weight``F32``blk\.\*\.ffn\_gate\_tid2eid``I32hash-routing tables (first 3 layers only)blk\.\*\.attn\_compressor\_\*,blk\.\*\.indexer\_\*,blk\.\*\.hc\_\*,blk\.\*\.output\_hc\_\*``F16/F32DSv4-specific auxiliary blocks For theq4file, only the three routed-expert classes change toQ4\_K. Everything else is byte-for-byte identical to the q2 recipe.

The motivation behind the asymmetry: the routed experts are the majority of the parameter count but each individual expert handles only a fraction of tokens, so aggressive quantization on them costs less in average quality than the same treatment of router, projections, or shared experts. Keeping the decision-making components atQ8\_0preserves model behavior; crushing the experts buys the size.

https://huggingface.co/antirez/deepseek-v4-gguf#usageUsage

git clone https://github.com/antirez/ds4
cd ds4
./download_model.sh q2     # 128 GB RAM machines
./download_model.sh q4     # >= 256 GB RAM machines
./download_model.sh mtp    # optional MTP / speculative decoding
make

./ds4 -p "Explain Redis streams in one paragraph."
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

Thedownload\_model\.shscript fetches from this repository, resumes partial downloads, and points\./ds4flash\.ggufat the selected variant.

https://huggingface.co/antirez/deepseek-v4-gguf#licenseLicense

MIT. The base model copyright is held by DeepSeek; the GGUFs are redistributed under the base model’s release terms.

Similar Articles

antirez/deepseek-v4-gguf

Hugging Face Models Trending

Antirez released GGUF quantizations of DeepSeek V4 Flash specifically tailored for the DS4 inference engine, providing optimized configurations for different RAM sizes and enabling local execution of the large MoE model.

Deepseek V4 Flash running on RTX 5090 MoE

Reddit r/LocalLLaMA

User shares optimization benchmarks for DeepSeek-V4-Flash (Q2_K) running on an RTX 5090 using a fork of llama.cpp, achieving 21.3 tokens/s generation and 1 million context size.