I just bought Asus Ascent : Nvidia GB10 (DGX) and It is slower than my Ryzen Ai Max

Reddit r/LocalLLaMA 05/15/26, 06:24 PM News

performance-comparison llama-cpp inference-speed dgx ryzen-ai-max gemma-4

Summary

A user reports that their Asus Ascent with Nvidia GB10 (DGX) is slower than their Ryzen AI Max when running LLMs like Gemma4-31B, despite expected 2-4x speedup, and shares their llama-cpp configuration for debugging.

It is suppose to be 2-4x faster but i am only getting 6TK/s on Gemma4-31B . What am i doing wrong? - Infrence engine : llama-cpp latest as of 15th May 2026 , built my own via https://ggml.ai/dgx-spark.sh - Tested models - Step3.5-Apex-I-Quality - DGX - 27 tk/s , AI-Max 30 tk/s - gemma-4-31B-it-UD-Q8_K_XL - 6.19 tk/s , AI-Max 7.10 tk/s Command : ``` llama-server --models-preset /home/dgx/models/models.ini --models-dir /home/dgx/models/ --host 0.0.0.0 --port 8080 --models-max 1 --parallel 1 ``` model.ini: ``` [*] threads = 12 flash-attn = on mlock = off mmap = off fit = on warmup = on ; batch-size = 4096 ; ubatch-size = 512 cache-type-k = q8_0 cache-type-v = q8_0 jinja = true direct-io = on cache-prompt = true cache-reuse = 256 cache-ram = 32768 reasoning-format = auto n-gpu-layers = 999 ```

Original Article

Similar Articles

@pupposandro: 2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, …

X AI KOLs Following

A new toolset (DFlash + PFlash) achieves 2.5x faster inference than llama.cpp on AMD Ryzen AI MAX+ 395 iGPU, demonstrating significant speedups for Qwen3.6-27B with 128 GiB unified memory.

@antirez: DS4 running on DGX Spark (GB10 / CUDA), private branch for now. 12 tokens/sec, the memory bandwidth is limited in this …

X AI KOLs Timeline

Antirez reports benchmarking DS4 inference on the DGX Spark (GB10), noting 12 tokens/sec generation speed and high prefill performance, with plans to merge the codebase once mature.

I just bought Asus Ascent : Nvidia GB10 (DGX) and It is slower than my Ryzen Ai Max

Similar Articles

@pupposandro: 2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, …

Dual dgx spark (Asus GX10) MiniMax M2.7 results

Nvidia RTX 3090 vs Intel Arc Pro B70 llama.cpp Benchmarks

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

@antirez: DS4 running on DGX Spark (GB10 / CUDA), private branch for now. 12 tokens/sec, the memory bandwidth is limited in this …

Submit Feedback