@mishig25: M3 Max users really got local AGI before GTA VI
Summary
M3 Max users really got local AGI before GTA VI https://t.co/AfaFukk6jR --- # antirez/deepseek-v4-gguf · Hugging Face Source: [https://huggingface.co/antirez/deepseek-v4-gguf](https://huggingface.co/antirez/deepseek-v4-gguf) ## [https://huggingface.co/antirez/deepseek-v4-gguf#deepseek-v4-flash--gguf-for-ds4](https://huggingface.co/antirez/deepseek-v4-gguf#deepseek-v4-flash--gguf-for-ds4)DeepSeek V4 Flash — GGUF for ds4 This quants are specific for the DS4 inference engine\. They may work with ot
View Cached Full Text
Cached at: 05/11/26, 10:37 AM
M3 Max users really got local AGI before GTA VI https://t.co/AfaFukk6jR
antirez/deepseek-v4-gguf · Hugging Face
Source: https://huggingface.co/antirez/deepseek-v4-gguf
https://huggingface.co/antirez/deepseek-v4-gguf#deepseek-v4-flash–gguf-for-ds4DeepSeek V4 Flash — GGUF for ds4
This quants are specific for the DS4 inference engine. They may work with other inference engines or not (they should, but not the MTP model which requires a specific loader).
https://github.com/antirez/ds4
https://huggingface.co/antirez/deepseek-v4-gguf#filesFiles
FileSizeRouted experts (ffn\_\{gate,up,down\}\_exps)Everything elseDeepSeek\-V4\-Flash\-IQ2XXS\-w2Q2K\-AProjQ8\-SExpQ8\-OutQ8\-chat\-v2\.gguf80.8 GiBIQ2\_XXS(gate, up) +Q2\_K(down)Q8\_0attn proj / shared experts / output,F16router + embed + indexer + compressor + HC,F32norms / sinks / biasDeepSeek\-V4\-Flash\-Q4KExperts\-F16HC\-F16Compressor\-F16Indexer\-Q8Attn\-Q8Shared\-Q8Out\-chat\-v2\.gguf153.3 GiBQ4\_K(all three)same as aboveDeepSeek\-V4\-Flash\-MTP\-Q4K\-Q8\_0\-F32\.gguf3.6 GiBMTP / speculative-decoding support (optional, not standalone).
Useq2on 128 GB Mac machines,q4on machines with ≥ 256 GB RAM, pair either withMTPfor optional speculative decoding.
https://huggingface.co/antirez/deepseek-v4-gguf#quantization-recipeQuantization recipe
The filename is the spec. In detail, for theq2file:
Tensor classQuantNotesblk\.\*\.ffn\_gate\_exps,blk\.\*\.ffn\_up\_expsIQ2\_XXSrouted-expert up/gateblk\.\*\.ffn\_down\_expsQ2\_Krouted-expert down (K-quant for quality)blk\.\*\.ffn\_\{gate,up,down\}\_shexp``Q8\_0shared expertsblk\.\*\.attn\_q\_a,attn\_q\_b,attn\_kv,attn\_output\_a,attn\_output\_b``Q8\_0all attention projections (MLA + low-rank output)output\.weight``Q8\_0output headtoken\_embd\.weight``F16input embeddingblk\.\*\.ffn\_gate\_inp(router)F16learned routerblk\.\*\.exp\_probs\_b(router bias),blk\.\*\.attn\_sinks, all\*\_norm\.weight``F32``blk\.\*\.ffn\_gate\_tid2eid``I32hash-routing tables (first 3 layers only)blk\.\*\.attn\_compressor\_\*,blk\.\*\.indexer\_\*,blk\.\*\.hc\_\*,blk\.\*\.output\_hc\_\*``F16/F32DSv4-specific auxiliary blocks
For theq4file, only the three routed-expert classes change toQ4\_K. Everything else is byte-for-byte identical to the q2 recipe.
The motivation behind the asymmetry: the routed experts are the majority of the parameter count but each individual expert handles only a fraction of tokens, so aggressive quantization on them costs less in average quality than the same treatment of router, projections, or shared experts. Keeping the decision-making components atQ8\_0preserves model behavior; crushing the experts buys the size.
https://huggingface.co/antirez/deepseek-v4-gguf#usageUsage
git clone https://github.com/antirez/ds4
cd ds4
./download_model.sh q2 # 128 GB RAM machines
./download_model.sh q4 # >= 256 GB RAM machines
./download_model.sh mtp # optional MTP / speculative decoding
make
./ds4 -p "Explain Redis streams in one paragraph."
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192
Thedownload\_model\.shscript fetches from this repository, resumes partial downloads, and points\./ds4flash\.ggufat the selected variant.
https://huggingface.co/antirez/deepseek-v4-gguf#licenseLicense
MIT. The base model copyright is held by DeepSeek; the GGUFs are redistributed under the base model’s release terms.
Similar Articles
antirez/deepseek-v4-gguf
Antirez released GGUF quantizations of DeepSeek V4 Flash specifically tailored for the DS4 inference engine, providing optimized configurations for different RAM sizes and enabling local execution of the large MoE model.
@ivanfioravanti: For anyone wandering what does it mean to run ds4-agent locally on an M5 Max using DeepSeek V4 Flash q2-imatrix gguf mo…
A demo of running ds4-agent locally on an M5 Max with DeepSeek V4 Flash q2-imatrix gguf model, showing self-updating capabilities and integration with HF_HOME for gguf models.
A few words on DS4
Antirez announces DwarfStar 4 (DS4), a local AI tool that runs DeepSeek v4 Flash with asymmetric 2/8 bit quantization on high-end consumer hardware, achieving near-frontier performance. He discusses the project's rapid popularity, future plans for model updates and distributed inference, and the significance of local AI for serious tasks.
@ttasanen: Just fired up DS4 by @antirez on my Mac Studio M3 Ultra 256GB and man, it’s seriously impressive. A clean, purpose-buil…
DS4 is a specialized inference engine by antirez designed to run DeepSeek V4 Flash locally on high-end Mac hardware, featuring optimized KV cache handling and 1M context support.
@Snixtp: DeepSeek V4 Flash on a single RTX Pro 6000?
DeepSeek V4 Flash GGUF quantizations have been released by antirez, enabling the model to run on single GPUs like the RTX Pro 6000 and Macs with 128GB+ RAM. The quantized files are available on Hugging Face with instructions for the DS4 inference engine.