@ciruai: Testing DeepSeek v4 Flash on the AMD Ryzen AI Max+ 395 Strix Halo with 128GB RAM. Getting ~15 TPS over a decently long …
Summary
Testing DeepSeek v4 Flash on the AMD Ryzen AI Max+ 395 with 128GB RAM achieves ~15 TPS for a 284B MoE model (13B active) locally, costing $3,000 versus $25,000+ for a datacenter setup, highlighting the feasibility of running large models on consumer hardware.
View Cached Full Text
Cached at: 06/18/26, 04:19 PM
Testing DeepSeek v4 Flash on the AMD Ryzen AI Max+ 395 Strix Halo with 128GB RAM.
Getting ~15 TPS over a decently long context, which is honestly very usable for a model this smart.
284B parameter MoE, A13B active.
Before anyone says “that’s slow,” remember: this is running on a $3,000 machine. Getting this kind of model to run fast normally means spending well over $25,000 (if you build it yourself).
The accomplishment isn’t beating a datacenter GPU.
The accomplishment is running it locally at all.
Similar Articles
Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!
A developer successfully runs DeepSeek-V4-Flash (284B total, 13B active) locally on four RTX 2080 Ti GPUs with a $2,500 budget, achieving 255 prefill tokens/s using custom Turing CUDA kernels, W8A8 quantization, and heterogeneous inference. The implementation is open-sourced.
Deepseek v4 Flash is pretty amazing, about to buy a $25k computer
The author praises DeepSeek V4 Flash for enabling high-performance local LLM deployment, leading to a $25k hardware purchase to serve clients with strict data privacy needs.
@danveloper: https://x.com/danveloper/status/2064387956387758206
A developer ran DeepSeek-V4-Flash on a Raspberry Pi 5 by streaming model weights from an NVMe SSD, achieving 1.3 tokens/second at 8 watts, demonstrating the feasibility of frontier-adjacent open-weight models on low-cost, offline hardware.
@Snixtp: DeepSeek V4 Flash on a single RTX Pro 6000?
DeepSeek V4 Flash GGUF quantizations have been released by antirez, enabling the model to run on single GPUs like the RTX Pro 6000 and Macs with 128GB+ RAM. The quantized files are available on Hugging Face with instructions for the DS4 inference engine.
DeepSeek-V4-Flash W4A16+FP8 with MTP self-speculation: 85 tok/s @ 524k on 2× RTX PRO 6000 Max-Q
The article details a customized quantized version of DeepSeek-V4-Flash with MTP self-speculation enabled, achieving significant speedups on dual RTX PRO 6000 Max-Q GPUs using a patched vLLM setup.