antirez/deepseek-v4-gguf

Hugging Face Models Trending 04/26/26, 07:54 AM Models

deepseek gguf quantization inference-engine open-source local-ai

Summary

Antirez released GGUF quantizations of DeepSeek V4 Flash specifically tailored for the DS4 inference engine, providing optimized configurations for different RAM sizes and enabling local execution of the large MoE model.

Task: text-generation Tags: gguf, quantized, deepseek, deepseek-v4, deepseek-v4-flash, moe, mixture-of-experts, 2-bit, 4-bit, iq2_xxs, q2_k, q4_k, ds4, apple-silicon, metal, text-generation, en, base_model:deepseek-ai/DeepSeek-V4-Flash, base_model:quantized:deepseek-ai/DeepSeek-V4-Flash, license:mit, endpoints_compatible, region:us, conversational

Original Article

View Cached Full Text

Cached at: 05/13/26, 06:11 PM

antirez/deepseek-v4-gguf · Hugging Face

Source: https://huggingface.co/antirez/deepseek-v4-gguf

https://huggingface.co/antirez/deepseek-v4-gguf#deepseek-v4-flash–gguf-for-ds4DeepSeek V4 Flash — GGUF for ds4

This quants are specific for the DS4 inference engine. They may work with other inference engines or not (they should, but not the MTP model which requires a specific loader).

https://github.com/antirez/ds4

https://huggingface.co/antirez/deepseek-v4-gguf#filesFiles

FileSizeRouted experts (ffn\_\{gate,up,down\}\_exps)Everything elseDeepSeek\-V4\-Flash\-IQ2XXS\-w2Q2K\-AProjQ8\-SExpQ8\-OutQ8\-chat\-v2\.gguf80.8 GiBIQ2\_XXS(gate, up) +Q2\_K(down)Q8\_0attn proj / shared experts / output,F16router + embed + indexer + compressor + HC,F32norms / sinks / biasDeepSeek\-V4\-Flash\-Q4KExperts\-F16HC\-F16Compressor\-F16Indexer\-Q8Attn\-Q8Shared\-Q8Out\-chat\-v2\.gguf153.3 GiBQ4\_K(all three)same as aboveDeepSeek\-V4\-Flash\-MTP\-Q4K\-Q8\_0\-F32\.gguf3.6 GiBMTP / speculative-decoding support (optional, not standalone). Useq2on 128 GB Mac machines,q4on machines with ≥ 256 GB RAM, pair either withMTPfor optional speculative decoding.

https://huggingface.co/antirez/deepseek-v4-gguf#quantization-recipeQuantization recipe

The filename is the spec. In detail, for theq2file:

Tensor classQuantNotesblk\.\*\.ffn\_gate\_exps,blk\.\*\.ffn\_up\_expsIQ2\_XXSrouted-expert up/gateblk\.\*\.ffn\_down\_expsQ2\_Krouted-expert down (K-quant for quality)blk\.\*\.ffn\_\{gate,up,down\}\_shexp``Q8\_0shared expertsblk\.\*\.attn\_q\_a,attn\_q\_b,attn\_kv,attn\_output\_a,attn\_output\_b``Q8\_0all attention projections (MLA + low-rank output)output\.weight``Q8\_0output headtoken\_embd\.weight``F16input embeddingblk\.\*\.ffn\_gate\_inp(router)F16learned routerblk\.\*\.exp\_probs\_b(router bias),blk\.\*\.attn\_sinks, all\*\_norm\.weight``F32``blk\.\*\.ffn\_gate\_tid2eid``I32hash-routing tables (first 3 layers only)blk\.\*\.attn\_compressor\_\*,blk\.\*\.indexer\_\*,blk\.\*\.hc\_\*,blk\.\*\.output\_hc\_\*``F16/F32DSv4-specific auxiliary blocks For theq4file, only the three routed-expert classes change toQ4\_K. Everything else is byte-for-byte identical to the q2 recipe.

The motivation behind the asymmetry: the routed experts are the majority of the parameter count but each individual expert handles only a fraction of tokens, so aggressive quantization on them costs less in average quality than the same treatment of router, projections, or shared experts. Keeping the decision-making components atQ8\_0preserves model behavior; crushing the experts buys the size.

https://huggingface.co/antirez/deepseek-v4-gguf#usageUsage

git clone https://github.com/antirez/ds4
cd ds4
./download_model.sh q2     # 128 GB RAM machines
./download_model.sh q4     # >= 256 GB RAM machines
./download_model.sh mtp    # optional MTP / speculative decoding
make

./ds4 -p "Explain Redis streams in one paragraph."
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

Thedownload\_model\.shscript fetches from this repository, resumes partial downloads, and points\./ds4flash\.ggufat the selected variant.

https://huggingface.co/antirez/deepseek-v4-gguf#licenseLicense

MIT. The base model copyright is held by DeepSeek; the GGUFs are redistributed under the base model’s release terms.

antirez/deepseek-v4-gguf

antirez/deepseek-v4-gguf · Hugging Face

https://huggingface.co/antirez/deepseek-v4-gguf#deepseek-v4-flash–gguf-for-ds4DeepSeek V4 Flash — GGUF for ds4

https://huggingface.co/antirez/deepseek-v4-gguf#filesFiles

https://huggingface.co/antirez/deepseek-v4-gguf#quantization-recipeQuantization recipe

https://huggingface.co/antirez/deepseek-v4-gguf#usageUsage

https://huggingface.co/antirez/deepseek-v4-gguf#licenseLicense

Similar Articles

@Snixtp: DeepSeek V4 Flash on a single RTX Pro 6000?

Deepseek V4 Flash 2, 3 and 4 bits GGUFs

DeepSeek V4 paper full version is out, FP4 QAT details and stability tricks [D]

@danveloper: I can't believe this works, but I got DeepSeek-V4-Flash (284B params) running on a Raspberry Pi 5 (8GB edition) at >1to…

You can run Deepseek 4 flash on mac (M3 Max, 96gb)

Submit Feedback

Similar Articles

@Snixtp: DeepSeek V4 Flash on a single RTX Pro 6000?

Deepseek V4 Flash 2, 3 and 4 bits GGUFs

DeepSeek V4 paper full version is out, FP4 QAT details and stability tricks [D]

@danveloper: I can't believe this works, but I got DeepSeek-V4-Flash (284B params) running on a Raspberry Pi 5 (8GB edition) at >1to…

You can run Deepseek 4 flash on mac (M3 Max, 96gb)