@mishig25: M3 Max users really got local AGI before GTA VI

X AI KOLs Following News

Summary

M3 Max users really got local AGI before GTA VI https://t.co/AfaFukk6jR --- # antirez/deepseek-v4-gguf · Hugging Face Source: [https://huggingface.co/antirez/deepseek-v4-gguf](https://huggingface.co/antirez/deepseek-v4-gguf) ## [https://huggingface.co/antirez/deepseek-v4-gguf#deepseek-v4-flash--gguf-for-ds4](https://huggingface.co/antirez/deepseek-v4-gguf#deepseek-v4-flash--gguf-for-ds4)DeepSeek V4 Flash — GGUF for ds4 This quants are specific for the DS4 inference engine\. They may work with ot

M3 Max users really got local AGI before GTA VI https://t.co/AfaFukk6jR
Original Article
View Cached Full Text

Cached at: 05/11/26, 10:37 AM

M3 Max users really got local AGI before GTA VI https://t.co/AfaFukk6jR


antirez/deepseek-v4-gguf · Hugging Face

Source: https://huggingface.co/antirez/deepseek-v4-gguf

https://huggingface.co/antirez/deepseek-v4-gguf#deepseek-v4-flash–gguf-for-ds4DeepSeek V4 Flash — GGUF for ds4

This quants are specific for the DS4 inference engine. They may work with other inference engines or not (they should, but not the MTP model which requires a specific loader).

https://github.com/antirez/ds4

https://huggingface.co/antirez/deepseek-v4-gguf#filesFiles

FileSizeRouted experts (ffn\_\{gate,up,down\}\_exps)Everything elseDeepSeek\-V4\-Flash\-IQ2XXS\-w2Q2K\-AProjQ8\-SExpQ8\-OutQ8\-chat\-v2\.gguf80.8 GiBIQ2\_XXS(gate, up) +Q2\_K(down)Q8\_0attn proj / shared experts / output,F16router + embed + indexer + compressor + HC,F32norms / sinks / biasDeepSeek\-V4\-Flash\-Q4KExperts\-F16HC\-F16Compressor\-F16Indexer\-Q8Attn\-Q8Shared\-Q8Out\-chat\-v2\.gguf153.3 GiBQ4\_K(all three)same as aboveDeepSeek\-V4\-Flash\-MTP\-Q4K\-Q8\_0\-F32\.gguf3.6 GiBMTP / speculative-decoding support (optional, not standalone). Useq2on 128 GB Mac machines,q4on machines with ≥ 256 GB RAM, pair either withMTPfor optional speculative decoding.

https://huggingface.co/antirez/deepseek-v4-gguf#quantization-recipeQuantization recipe

The filename is the spec. In detail, for theq2file:

Tensor classQuantNotesblk\.\*\.ffn\_gate\_exps,blk\.\*\.ffn\_up\_expsIQ2\_XXSrouted-expert up/gateblk\.\*\.ffn\_down\_expsQ2\_Krouted-expert down (K-quant for quality)blk\.\*\.ffn\_\{gate,up,down\}\_shexp``Q8\_0shared expertsblk\.\*\.attn\_q\_a,attn\_q\_b,attn\_kv,attn\_output\_a,attn\_output\_b``Q8\_0all attention projections (MLA + low-rank output)output\.weight``Q8\_0output headtoken\_embd\.weight``F16input embeddingblk\.\*\.ffn\_gate\_inp(router)F16learned routerblk\.\*\.exp\_probs\_b(router bias),blk\.\*\.attn\_sinks, all\*\_norm\.weight``F32``blk\.\*\.ffn\_gate\_tid2eid``I32hash-routing tables (first 3 layers only)blk\.\*\.attn\_compressor\_\*,blk\.\*\.indexer\_\*,blk\.\*\.hc\_\*,blk\.\*\.output\_hc\_\*``F16/F32DSv4-specific auxiliary blocks For theq4file, only the three routed-expert classes change toQ4\_K. Everything else is byte-for-byte identical to the q2 recipe.

The motivation behind the asymmetry: the routed experts are the majority of the parameter count but each individual expert handles only a fraction of tokens, so aggressive quantization on them costs less in average quality than the same treatment of router, projections, or shared experts. Keeping the decision-making components atQ8\_0preserves model behavior; crushing the experts buys the size.

https://huggingface.co/antirez/deepseek-v4-gguf#usageUsage

git clone https://github.com/antirez/ds4
cd ds4
./download_model.sh q2     # 128 GB RAM machines
./download_model.sh q4     # >= 256 GB RAM machines
./download_model.sh mtp    # optional MTP / speculative decoding
make

./ds4 -p "Explain Redis streams in one paragraph."
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192

Thedownload\_model\.shscript fetches from this repository, resumes partial downloads, and points\./ds4flash\.ggufat the selected variant.

https://huggingface.co/antirez/deepseek-v4-gguf#licenseLicense

MIT. The base model copyright is held by DeepSeek; the GGUFs are redistributed under the base model’s release terms.

Similar Articles

antirez/deepseek-v4-gguf

Hugging Face Models Trending

Antirez released GGUF quantizations of DeepSeek V4 Flash specifically tailored for the DS4 inference engine, providing optimized configurations for different RAM sizes and enabling local execution of the large MoE model.

A few words on DS4

Hacker News Top

Antirez announces DwarfStar 4 (DS4), a local AI tool that runs DeepSeek v4 Flash with asymmetric 2/8 bit quantization on high-end consumer hardware, achieving near-frontier performance. He discusses the project's rapid popularity, future plans for model updates and distributed inference, and the significance of local AI for serious tasks.

@Snixtp: DeepSeek V4 Flash on a single RTX Pro 6000?

X AI KOLs Following

DeepSeek V4 Flash GGUF quantizations have been released by antirez, enabling the model to run on single GPUs like the RTX Pro 6000 and Macs with 128GB+ RAM. The quantized files are available on Hugging Face with instructions for the DS4 inference engine.