@TheAhmadOsman: Why do I focus on Inference Engines/Software Stacks for your hardware? - 2x RTX 3090s: ~14.5 tok/s → ~64 tok/s moving t…
Summary
Comparison of inference engine performance on different hardware: moving from baseline to vLLM with TP=2 on 2x RTX 3090s improves from ~14.5 tok/s to ~64 tok/s, and on RTX PRO 6000 moving to Sglang improves from ~32 tok/s to ~110 tok/s. Recommends vLLM/Sglang for CUDA/multi-GPU and llama.cpp for edge devices.
View Cached Full Text
Cached at: 06/22/26, 05:31 AM
Why do I focus on Inference Engines/Software Stacks for your hardware?
-
2x RTX 3090s: ~14.5 tok/s → ~64 tok/s moving to vLLM w/ TP=2
-
RTX PRO 6000: ~32 tok/s → ~110 tok/s moving to Sglang
So:
-
CUDA/2+ GPUs: ExLlamaV3/vLLM/Sglang > llama.cpp
-
Edge: llama.cpp > Ollama https://t.co/5WXSlPrrOB
Similar Articles
Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split
A user benchmarks dual-GPU inference speed on two RTX 3080 20GB using llama.cpp (row/tensor split) and ik_llama (graph split) with a Qwen3.6-27B GGUF model, comparing token generation and prompt processing speeds.
GLM-5.2 UD-IQ1_M on llama.cpp — 5090 + 3090 Ti speed test (~ 579 t/s prefill @ 8k ctx, ~324 t/s prefill @ 57k ctx, ~10.6 t/s decode)
Speed test results for GLM-5.2 running on llama.cpp with RTX 5090 and RTX 3090 Ti, showing prefill speeds up to 579 t/s at 8k context and decode at ~10.6 t/s.
Inference Engines for LLMs & Local AI Hardware (2026 Edition)
This article provides a comprehensive guide to LLM inference engines for local AI hardware in 2026, explaining how to choose based on hardware strategy, workload, and serving model, and covering engines like llama.cpp, MLX, ExLlamaV2/3, vLLM, SGLang, TensorRT-LLM, and NVIDIA Dynamo.
@ItsmeAjayKV: Update on 3090: Now with Qwen 3.6-35b-a3b moe (q6_k_xl). Crossed 90 t/s for the very first time, no MTP yet, prefill sp…
A user reports achieving over 90 tokens per second inference speed with Qwen 3.6-35b-a3b MoE model on an RTX 3090 using llama.cpp, with prefill speeds exceeding 1000 t/s, indicating practical local deployment of large language models on consumer hardware.
Gemma 4 26B Hits 600 Tok/s on One RTX 5090
A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.