@TheAhmadOsman: Why do I focus on Inference Engines/Software Stacks for your hardware? - 2x RTX 3090s: ~14.5 tok/s → ~64 tok/s moving t…

X AI KOLs Following 06/21/26, 06:55 PM News

inference performance vllm sglang llama-cpp gpu software-stack

Summary

Comparison of inference engine performance on different hardware: moving from baseline to vLLM with TP=2 on 2x RTX 3090s improves from ~14.5 tok/s to ~64 tok/s, and on RTX PRO 6000 moving to Sglang improves from ~32 tok/s to ~110 tok/s. Recommends vLLM/Sglang for CUDA/multi-GPU and llama.cpp for edge devices.

Why do I focus on Inference Engines/Software Stacks for your hardware? - 2x RTX 3090s: ~14.5 tok/s → ~64 tok/s moving to vLLM w/ TP=2 - RTX PRO 6000: ~32 tok/s → ~110 tok/s moving to Sglang So: - CUDA/2+ GPUs: ExLlamaV3/vLLM/Sglang > llama.cpp - Edge: llama.cpp > Ollama https://t.co/5WXSlPrrOB

Original Article

View Cached Full Text

Cached at: 06/22/26, 05:31 AM

Why do I focus on Inference Engines/Software Stacks for your hardware?

2x RTX 3090s: ~14.5 tok/s → ~64 tok/s moving to vLLM w/ TP=2
RTX PRO 6000: ~32 tok/s → ~110 tok/s moving to Sglang

So:

CUDA/2+ GPUs: ExLlamaV3/vLLM/Sglang > llama.cpp
Edge: llama.cpp > Ollama https://t.co/5WXSlPrrOB

Similar Articles

Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split

Reddit r/LocalLLaMA

A user benchmarks dual-GPU inference speed on two RTX 3080 20GB using llama.cpp (row/tensor split) and ik_llama (graph split) with a Qwen3.6-27B GGUF model, comparing token generation and prompt processing speeds.

GLM-5.2 UD-IQ1_M on llama.cpp — 5090 + 3090 Ti speed test (~ 579 t/s prefill @ 8k ctx, ~324 t/s prefill @ 57k ctx, ~10.6 t/s decode)

Reddit r/LocalLLaMA

Speed test results for GLM-5.2 running on llama.cpp with RTX 5090 and RTX 3090 Ti, showing prefill speeds up to 579 t/s at 8k context and decode at ~10.6 t/s.

Inference Engines for LLMs & Local AI Hardware (2026 Edition)

X AI KOLs

This article provides a comprehensive guide to LLM inference engines for local AI hardware in 2026, explaining how to choose based on hardware strategy, workload, and serving model, and covering engines like llama.cpp, MLX, ExLlamaV2/3, vLLM, SGLang, TensorRT-LLM, and NVIDIA Dynamo.

@ItsmeAjayKV: Update on 3090: Now with Qwen 3.6-35b-a3b moe (q6_k_xl). Crossed 90 t/s for the very first time, no MTP yet, prefill sp…

X AI KOLs Timeline

A user reports achieving over 90 tokens per second inference speed with Qwen 3.6-35b-a3b MoE model on an RTX 3090 using llama.cpp, with prefill speeds exceeding 1000 t/s, indicating practical local deployment of large language models on consumer hardware.

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Reddit r/LocalLLaMA

A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.

Similar Articles

Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split

GLM-5.2 UD-IQ1_M on llama.cpp — 5090 + 3090 Ti speed test (~ 579 t/s prefill @ 8k ctx, ~324 t/s prefill @ 57k ctx, ~10.6 t/s decode)

Inference Engines for LLMs & Local AI Hardware (2026 Edition)

@ItsmeAjayKV: Update on 3090: Now with Qwen 3.6-35b-a3b moe (q6_k_xl). Crossed 90 t/s for the very first time, no MTP yet, prefill sp…

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Submit Feedback