@danveloper: I can't believe this works, but I got DeepSeek-V4-Flash (284B params) running on a Raspberry Pi 5 (8GB edition) at >1to…
Summary
A developer successfully ran the 284B-parameter DeepSeek-V4-Flash model on a Raspberry Pi 5 at over 1 tok/s, using an untouched GGUF file from antirez after extensive experimentation.
View Cached Full Text
Cached at: 06/02/26, 05:35 PM
I can’t believe this works, but I got DeepSeek-V4-Flash (284B params) running on a Raspberry Pi 5 (8GB edition) at >1tok/s @ ~8W during full-tilt inference! It uses an untouched copy of @antirez’s GGUF. Took 160+ experiments over 5 days between GPT-5.5 xhigh and Opus 4.8 max. https://t.co/RAJjNZg44Z
Similar Articles
@Snixtp: DeepSeek V4 Flash on a single RTX Pro 6000?
DeepSeek V4 Flash GGUF quantizations have been released by antirez, enabling the model to run on single GPUs like the RTX Pro 6000 and Macs with 128GB+ RAM. The quantized files are available on Hugging Face with instructions for the DS4 inference engine.
antirez/deepseek-v4-gguf
Antirez released GGUF quantizations of DeepSeek V4 Flash specifically tailored for the DS4 inference engine, providing optimized configurations for different RAM sizes and enabling local execution of the large MoE model.
Deepseek V4 flash performance on DGX Spark
A Reddit user shares their experience running DeepSeek V4 Flash on a dual-ASUS GX10 DGX Spark setup, detailing performance metrics, configuration, and power consumption, with throughput benchmarks across various context lengths.
Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!
A developer successfully runs DeepSeek-V4-Flash (284B total, 13B active) locally on four RTX 2080 Ti GPUs with a $2,500 budget, achieving 255 prefill tokens/s using custom Turing CUDA kernels, W8A8 quantization, and heterogeneous inference. The implementation is open-sourced.
DeepSeek-V4-Flash W4A16+FP8 with MTP self-speculation: 85 tok/s @ 524k on 2× RTX PRO 6000 Max-Q
The article details a customized quantized version of DeepSeek-V4-Flash with MTP self-speculation enabled, achieving significant speedups on dual RTX PRO 6000 Max-Q GPUs using a patched vLLM setup.