@ciruai: Testing DeepSeek v4 Flash on the AMD Ryzen AI Max+ 395 Strix Halo with 128GB RAM. Getting ~15 TPS over a decently long …

X AI KOLs Timeline 06/18/26, 01:38 PM News

deepseek amd-ryzen strix-halo local-llm inference performance edge-ai

Summary

Testing DeepSeek v4 Flash on the AMD Ryzen AI Max+ 395 with 128GB RAM achieves ~15 TPS for a 284B MoE model (13B active) locally, costing $3,000 versus $25,000+ for a datacenter setup, highlighting the feasibility of running large models on consumer hardware.

Testing DeepSeek v4 Flash on the AMD Ryzen AI Max+ 395 Strix Halo with 128GB RAM. Getting ~15 TPS over a decently long context, which is honestly very usable for a model this smart. 284B parameter MoE, A13B active. Before anyone says “that’s slow,” remember: this is running on a $3,000 machine. Getting this kind of model to run fast normally means spending well over $25,000 (if you build it yourself). The accomplishment isn’t beating a datacenter GPU. The accomplishment is running it locally at all.

Original Article

View Cached Full Text

Cached at: 06/18/26, 04:19 PM

Testing DeepSeek v4 Flash on the AMD Ryzen AI Max+ 395 Strix Halo with 128GB RAM.

Getting ~15 TPS over a decently long context, which is honestly very usable for a model this smart.

284B parameter MoE, A13B active.

Before anyone says “that’s slow,” remember: this is running on a $3,000 machine. Getting this kind of model to run fast normally means spending well over $25,000 (if you build it yourself).

The accomplishment isn’t beating a datacenter GPU.

The accomplishment is running it locally at all.

Similar Articles

Running DeepSeek-V4 locally with 4x legacy RTX 2080 Ti ($2k budget setup). Custom Turing kernels, W8A8 quantization, and 255 prefill tok/s!

Reddit r/LocalLLaMA

A developer successfully runs DeepSeek-V4-Flash (284B total, 13B active) locally on four RTX 2080 Ti GPUs with a $2,500 budget, achieving 255 prefill tokens/s using custom Turing CUDA kernels, W8A8 quantization, and heterogeneous inference. The implementation is open-sourced.

Deepseek v4 Flash is pretty amazing, about to buy a $25k computer

Reddit r/openclaw

The author praises DeepSeek V4 Flash for enabling high-performance local LLM deployment, leading to a $25k hardware purchase to serve clients with strict data privacy needs.

@danveloper: https://x.com/danveloper/status/2064387956387758206

X AI KOLs Timeline

A developer ran DeepSeek-V4-Flash on a Raspberry Pi 5 by streaming model weights from an NVMe SSD, achieving 1.3 tokens/second at 8 watts, demonstrating the feasibility of frontier-adjacent open-weight models on low-cost, offline hardware.

@Snixtp: DeepSeek V4 Flash on a single RTX Pro 6000?

X AI KOLs Following

DeepSeek V4 Flash GGUF quantizations have been released by antirez, enabling the model to run on single GPUs like the RTX Pro 6000 and Macs with 128GB+ RAM. The quantized files are available on Hugging Face with instructions for the DS4 inference engine.

DeepSeek-V4-Flash W4A16+FP8 with MTP self-speculation: 85 tok/s @ 524k on 2× RTX PRO 6000 Max-Q