I just bought Asus Ascent : Nvidia GB10 (DGX) and It is slower than my Ryzen Ai Max

Reddit r/LocalLLaMA News

Summary

A user reports that their Asus Ascent with Nvidia GB10 (DGX) is slower than their Ryzen AI Max when running LLMs like Gemma4-31B, despite expected 2-4x speedup, and shares their llama-cpp configuration for debugging.

It is suppose to be 2-4x faster but i am only getting 6TK/s on Gemma4-31B . What am i doing wrong? - Infrence engine : llama-cpp latest as of 15th May 2026 , built my own via https://ggml.ai/dgx-spark.sh - Tested models - Step3.5-Apex-I-Quality - DGX - 27 tk/s , AI-Max 30 tk/s - gemma-4-31B-it-UD-Q8_K_XL - 6.19 tk/s , AI-Max 7.10 tk/s Command : ``` llama-server --models-preset /home/dgx/models/models.ini --models-dir /home/dgx/models/ --host 0.0.0.0 --port 8080 --models-max 1 --parallel 1 ``` model.ini: ``` [*] threads = 12 flash-attn = on mlock = off mmap = off fit = on warmup = on ; batch-size = 4096 ; ubatch-size = 512 cache-type-k = q8_0 cache-type-v = q8_0 jinja = true direct-io = on cache-prompt = true cache-reuse = 256 cache-ram = 32768 reasoning-format = auto n-gpu-layers = 999 ```
Original Article

Similar Articles

Nvidia RTX 3090 vs Intel Arc Pro B70 llama.cpp Benchmarks

Reddit r/LocalLLaMA

Community benchmark shows Intel Arc Pro B70 averages ~71% slower prompt processing and ~54% slower token generation than RTX 3090 under llama.cpp, with SYCL backend sometimes beating Vulkan on the same card.

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Reddit r/LocalLLaMA

A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.