@antirez: DS4 running on DGX Spark (GB10 / CUDA), private branch for now. 12 tokens/sec, the memory bandwidth is limited in this …
Summary
Antirez reports benchmarking DS4 inference on the DGX Spark (GB10), noting 12 tokens/sec generation speed and high prefill performance, with plans to merge the codebase once mature.
View Cached Full Text
Cached at: 05/10/26, 10:23 AM
DS4 running on DGX Spark (GB10 / CUDA), private branch for now. 12 tokens/sec, the memory bandwidth is limited in this system, at 270GB/sec. But prefill is ways more alighed to M3 Max at ~200 t/s. I’ll release when more mature, but it is almost sure that it will get merged. https://t.co/LVYSDQ4Hnp
Similar Articles
@antirez: I just pushed a big refactoring of DS4 backends with CUDA support and single direction activation steering. The Metal p…
antirez pushed a major refactoring of DS4 backends, adding CUDA support and single direction activation steering while preserving the Metal path. Only M3 and DGX Spark hardware are supported for now.
Dual dgx spark (Asus GX10) MiniMax M2.7 results
User benchmarks dual Asus GX10 (DGX Spark) running MiniMax-M2.7-AWQ-4bit, achieving 30–40 tokens/s while drawing only ~100 W each, replacing noisy multi-GPU rigs.
@ttasanen: Just fired up DS4 by @antirez on my Mac Studio M3 Ultra 256GB and man, it’s seriously impressive. A clean, purpose-buil…
DS4 is a specialized inference engine by antirez designed to run DeepSeek V4 Flash locally on high-end Mac hardware, featuring optimized KV cache handling and 1M context support.
Gemma 4 26B Hits 600 Tok/s on One RTX 5090
A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.
@mitsuhiko: And the ds4 SSD caches are great. This is continuing a session after the server was shut down which was already 63k tok…
A user reports positive performance from ds4 SSD caches when resuming a long-running LLM inference session with 63k tokens, noting acceptable startup times.