Home
/
News
/
Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split
Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split
Summary
A user benchmarks dual-GPU inference speed on two RTX 3080 20GB using llama.cpp (row/tensor split) and ik_llama (graph split) with a Qwen3.6-27B GGUF model, comparing token generation and prompt processing speeds.
## Setup: ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 610.43.02 KMD Version: 610.43.02 CUDA UMD Version: 13.3 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3080 Off | 00000000:01:00.0 Off | N/A | | 40% 30C P8 10W / 320W | 238MiB / 20480MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3080 Off | 00000000:03:00.0 Off | N/A | | 40% 29C P8 8W / 320W | 17MiB / 20480MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ ``` Yes, these are the alibaba 3080 20gb, just arrived today. Great buy tbh. I've used llama-benchy to benchmark prompt processing speed and token generation with ik_llama and llama.cpp with row, tensor and graph split modes. Model used: https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/blob/main/Qwen3.6-27B-Q8_0.gguf No MTP for this benchmark. Used latest version of ik_llama and llama.cpp for today. Just updated and recompiled before benchmarking. Arguments used for all 3 runs: ``` -m '<...>/Qwen3.6-27B-Q8_0.gguf' \ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \ -np 1 -c 135000 -ngl 99 ``` Arguments used for llama.cpp: ``` -sm row ``` ``` -sm tensor ``` Arguments for ik_llama: ``` -sm graph ``` ## -sm row: VRAM usage: GPU0: 18.2 / GPU1: 18.5 Results: | model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) | |:-----------------|-----------------:|----------------:|-------------:|-------------------:|-------------------:|-------------------:| | Qwen/Qwen3.6-27B | pp4096 @ d4000 | 1732.89 ± 14.86 | | 4673.37 ± 40.08 | 4673.07 ± 40.08 | 4673.37 ± 40.08 | | Qwen/Qwen3.6-27B | tg128 @ d4000 | 23.03 ± 0.01 | 24.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d8000 | 1766.49 ± 7.45 | | 6848.27 ± 29.08 | 6847.97 ± 29.08 | 6848.27 ± 29.08 | | Qwen/Qwen3.6-27B | tg128 @ d8000 | 22.83 ± 0.01 | 23.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d16000 | 1756.67 ± 9.84 | | 11441.05 ± 63.85 | 11440.74 ± 63.85 | 11441.05 ± 63.85 | | Qwen/Qwen3.6-27B | tg128 @ d16000 | 22.44 ± 0.00 | 23.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d32000 | 1670.17 ± 7.88 | | 21613.73 ± 101.44 | 21613.42 ± 101.44 | 21613.73 ± 101.44 | | Qwen/Qwen3.6-27B | tg128 @ d32000 | 21.71 ± 0.01 | 22.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d64000 | 1481.15 ± 4.23 | | 45976.46 ± 130.94 | 45976.15 ± 130.94 | 45976.46 ± 130.94 | | Qwen/Qwen3.6-27B | tg128 @ d64000 | 20.41 ± 0.00 | 21.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d128000 | 1195.01 ± 2.36 | | 110541.23 ± 217.70 | 110540.93 ± 217.70 | 110541.23 ± 217.70 | | Qwen/Qwen3.6-27B | tg128 @ d128000 | 18.23 ± 0.00 | 19.00 ± 0.00 | | | | ## -sm tensor: VRAM usage: GPU0: 18.1 / GPU1: 17.9 | model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) | |:-----------------|-----------------:|----------------:|-------------:|-------------------:|-------------------:|-------------------:| | Qwen/Qwen3.6-27B | pp4096 @ d4000 | 1412.73 ± 15.38 | | 5732.50 ± 61.94 | 5732.15 ± 61.94 | 5732.50 ± 61.94 | | Qwen/Qwen3.6-27B | tg128 @ d4000 | 38.95 ± 0.05 | 40.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d8000 | 1400.96 ± 5.46 | | 8635.04 ± 32.88 | 8634.68 ± 32.88 | 8635.04 ± 32.88 | | Qwen/Qwen3.6-27B | tg128 @ d8000 | 38.68 ± 0.10 | 39.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d16000 | 1381.89 ± 4.16 | | 14543.59 ± 43.73 | 14543.23 ± 43.73 | 14543.59 ± 43.73 | | Qwen/Qwen3.6-27B | tg128 @ d16000 | 38.14 ± 0.11 | 39.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d32000 | 1328.03 ± 2.82 | | 27181.67 ± 57.72 | 27181.31 ± 57.72 | 27181.67 ± 57.72 | | Qwen/Qwen3.6-27B | tg128 @ d32000 | 37.13 ± 0.01 | 38.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d64000 | 1219.17 ± 2.61 | | 55856.47 ± 119.00 | 55856.12 ± 119.00 | 55856.47 ± 119.00 | | Qwen/Qwen3.6-27B | tg128 @ d64000 | 35.18 ± 0.01 | 36.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d128000 | 1036.75 ± 1.70 | | 127414.43 ± 208.98 | 127414.08 ± 208.98 | 127414.43 ± 208.98 | | Qwen/Qwen3.6-27B | tg128 @ d128000 | 31.72 ± 0.12 | 32.00 ± 0.00 | | | | ## -sm graph (ik_llama): VRAM usage: GPU0: 17.8 / GPU1: 19.2 | model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) | |:-----------------|-----------------:|----------------:|-------------:|-------------------:|-------------------:|-------------------:| | Qwen/Qwen3.6-27B | pp4096 @ d4000 | 1420.56 ± 17.77 | | 5700.41 ± 70.54 | 5699.81 ± 70.54 | 5700.41 ± 70.54 | | Qwen/Qwen3.6-27B | tg128 @ d4000 | 32.15 ± 0.03 | 33.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d8000 | 1387.88 ± 13.61 | | 8716.90 ± 84.91 | 8716.29 ± 84.91 | 8716.90 ± 84.91 | | Qwen/Qwen3.6-27B | tg128 @ d8000 | 31.81 ± 0.01 | 33.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d16000 | 1362.43 ± 8.36 | | 14751.24 ± 90.08 | 14750.64 ± 90.08 | 14751.24 ± 90.08 | | Qwen/Qwen3.6-27B | tg128 @ d16000 | 31.13 ± 0.01 | 32.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d32000 | 1318.72 ± 9.42 | | 27373.72 ± 195.00 | 27373.12 ± 195.00 | 27373.72 ± 195.00 | | Qwen/Qwen3.6-27B | tg128 @ d32000 | 30.32 ± 0.02 | 31.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d64000 | 1216.07 ± 8.43 | | 55999.88 ± 388.37 | 55999.27 ± 388.37 | 55999.88 ± 388.37 | | Qwen/Qwen3.6-27B | tg128 @ d64000 | 28.86 ± 0.04 | 30.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d128000 | 1055.71 ± 7.36 | | 125132.30 ± 869.60 | 125131.69 ± 869.60 | 125132.30 ± 869.60 | | Qwen/Qwen3.6-27B | tg128 @ d128000 | 26.35 ± 0.00 | 27.00 ± 0.00 | | | |
0
Like
0
Add to favorites
Similar Articles
Reddit r/LocalLLaMA
A fork of llama.cpp fixes the --split-mode tensor issue with quantized KV caches, achieving up to 40% speed improvement on dual GPU setups without quality loss.
X AI KOLs Timeline
A user shares their experience setting up a dual-GPU local AI lab with RTX 4080 Super and 5060 Ti, running Qwen 3.6 models via llama.cpp and llama-swap to reduce API costs and enable unrestricted experimentation.
Reddit r/LocalLLaMA
The article compares llama.cpp backends for running Qwen 3.6 27B on an RTX 3090 24GB, finding ik_llama.cpp with IQ4_KS quantization yields the best performance (1261 tok/s prefill, 72.9 tok/s decode).
X AI KOLs Following
llama.cpp maintainers and NVIDIA engineers collaborated to significantly improve multi-GPU performance in ggml, enabling hardware-agnostic tensor parallelism and major performance gains on RTX systems.
Reddit r/LocalLLaMA
A user benchmarks thread count for hybrid CPU-GPU inference with Gemma 4 in llama.cpp, discovering a 80% performance uplift by using 16 threads instead of 6 on a hybrid core CPU, and shares the optimal command configuration.