@antirez: DS4 running on DGX Spark (GB10 / CUDA), private branch for now. 12 tokens/sec, the memory bandwidth is limited in this …
Summary
Antirez reports benchmarking DS4 inference on the DGX Spark (GB10), noting 12 tokens/sec generation speed and high prefill performance, with plans to merge the codebase once mature.
View Cached Full Text
Cached at: 05/10/26, 10:23 AM
DS4 running on DGX Spark (GB10 / CUDA), private branch for now. 12 tokens/sec, the memory bandwidth is limited in this system, at 270GB/sec. But prefill is ways more alighed to M3 Max at ~200 t/s. I’ll release when more mature, but it is almost sure that it will get merged. https://t.co/LVYSDQ4Hnp
Similar Articles
@antirez: For the DGX Spark owners. This is what you get with DS4 in your hardware. I want to post this to show how with fast pre…
antirez shares a demonstration of using DS4 on the DGX Spark, showing that despite slow generation, fast prefill keeps the system usable.
@onusoz: 16x parallel Gemma-4-26B-A4B-NVFP4 runs 18 output tokens/s, aggregate 300 tok/s 1 DGX Spark with 128 GB unified memo…
@onusoz demonstrates running 16 parallel instances of NVIDIA's quantized Gemma-4-26B-A4B-NVFP4 model on a single DGX Spark with 128GB unified memory, achieving 300 tok/s aggregate, showcasing high concurrency without flashinfer.
Deepseek V4 flash performance on DGX Spark
A Reddit user shares their experience running DeepSeek V4 Flash on a dual-ASUS GX10 DGX Spark setup, detailing performance metrics, configuration, and power consumption, with throughput benchmarks across various context lengths.
@antirez: I just pushed a big refactoring of DS4 backends with CUDA support and single direction activation steering. The Metal p…
antirez pushed a major refactoring of DS4 backends, adding CUDA support and single direction activation steering while preserving the Metal path. Only M3 and DGX Spark hardware are supported for now.
Dual dgx spark (Asus GX10) MiniMax M2.7 results
User benchmarks dual Asus GX10 (DGX Spark) running MiniMax-M2.7-AWQ-4bit, achieving 30–40 tokens/s while drawing only ~100 W each, replacing noisy multi-GPU rigs.