Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split

Reddit r/LocalLLaMA 06/12/26, 04:43 PM News

dual-gpu inference benchmark llama.cpp ik-llama qwen gguf

Summary

A user benchmarks dual-GPU inference speed on two RTX 3080 20GB using llama.cpp (row/tensor split) and ik_llama (graph split) with a Qwen3.6-27B GGUF model, comparing token generation and prompt processing speeds.

## Setup: ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 610.43.02 KMD Version: 610.43.02 CUDA UMD Version: 13.3 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3080 Off | 00000000:01:00.0 Off | N/A | | 40% 30C P8 10W / 320W | 238MiB / 20480MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 3080 Off | 00000000:03:00.0 Off | N/A | | 40% 29C P8 8W / 320W | 17MiB / 20480MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ ``` Yes, these are the alibaba 3080 20gb, just arrived today. Great buy tbh. I've used llama-benchy to benchmark prompt processing speed and token generation with ik_llama and llama.cpp with row, tensor and graph split modes. Model used: https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/blob/main/Qwen3.6-27B-Q8_0.gguf No MTP for this benchmark. Used latest version of ik_llama and llama.cpp for today. Just updated and recompiled before benchmarking. Arguments used for all 3 runs: ``` -m '<...>/Qwen3.6-27B-Q8_0.gguf' \ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \ -np 1 -c 135000 -ngl 99 ``` Arguments used for llama.cpp: ``` -sm row ``` ``` -sm tensor ``` Arguments for ik_llama: ``` -sm graph ``` ## -sm row: VRAM usage: GPU0: 18.2 / GPU1: 18.5 Results: | model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) | |:-----------------|-----------------:|----------------:|-------------:|-------------------:|-------------------:|-------------------:| | Qwen/Qwen3.6-27B | pp4096 @ d4000 | 1732.89 ± 14.86 | | 4673.37 ± 40.08 | 4673.07 ± 40.08 | 4673.37 ± 40.08 | | Qwen/Qwen3.6-27B | tg128 @ d4000 | 23.03 ± 0.01 | 24.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d8000 | 1766.49 ± 7.45 | | 6848.27 ± 29.08 | 6847.97 ± 29.08 | 6848.27 ± 29.08 | | Qwen/Qwen3.6-27B | tg128 @ d8000 | 22.83 ± 0.01 | 23.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d16000 | 1756.67 ± 9.84 | | 11441.05 ± 63.85 | 11440.74 ± 63.85 | 11441.05 ± 63.85 | | Qwen/Qwen3.6-27B | tg128 @ d16000 | 22.44 ± 0.00 | 23.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d32000 | 1670.17 ± 7.88 | | 21613.73 ± 101.44 | 21613.42 ± 101.44 | 21613.73 ± 101.44 | | Qwen/Qwen3.6-27B | tg128 @ d32000 | 21.71 ± 0.01 | 22.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d64000 | 1481.15 ± 4.23 | | 45976.46 ± 130.94 | 45976.15 ± 130.94 | 45976.46 ± 130.94 | | Qwen/Qwen3.6-27B | tg128 @ d64000 | 20.41 ± 0.00 | 21.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d128000 | 1195.01 ± 2.36 | | 110541.23 ± 217.70 | 110540.93 ± 217.70 | 110541.23 ± 217.70 | | Qwen/Qwen3.6-27B | tg128 @ d128000 | 18.23 ± 0.00 | 19.00 ± 0.00 | | | | ## -sm tensor: VRAM usage: GPU0: 18.1 / GPU1: 17.9 | model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) | |:-----------------|-----------------:|----------------:|-------------:|-------------------:|-------------------:|-------------------:| | Qwen/Qwen3.6-27B | pp4096 @ d4000 | 1412.73 ± 15.38 | | 5732.50 ± 61.94 | 5732.15 ± 61.94 | 5732.50 ± 61.94 | | Qwen/Qwen3.6-27B | tg128 @ d4000 | 38.95 ± 0.05 | 40.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d8000 | 1400.96 ± 5.46 | | 8635.04 ± 32.88 | 8634.68 ± 32.88 | 8635.04 ± 32.88 | | Qwen/Qwen3.6-27B | tg128 @ d8000 | 38.68 ± 0.10 | 39.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d16000 | 1381.89 ± 4.16 | | 14543.59 ± 43.73 | 14543.23 ± 43.73 | 14543.59 ± 43.73 | | Qwen/Qwen3.6-27B | tg128 @ d16000 | 38.14 ± 0.11 | 39.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d32000 | 1328.03 ± 2.82 | | 27181.67 ± 57.72 | 27181.31 ± 57.72 | 27181.67 ± 57.72 | | Qwen/Qwen3.6-27B | tg128 @ d32000 | 37.13 ± 0.01 | 38.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d64000 | 1219.17 ± 2.61 | | 55856.47 ± 119.00 | 55856.12 ± 119.00 | 55856.47 ± 119.00 | | Qwen/Qwen3.6-27B | tg128 @ d64000 | 35.18 ± 0.01 | 36.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d128000 | 1036.75 ± 1.70 | | 127414.43 ± 208.98 | 127414.08 ± 208.98 | 127414.43 ± 208.98 | | Qwen/Qwen3.6-27B | tg128 @ d128000 | 31.72 ± 0.12 | 32.00 ± 0.00 | | | | ## -sm graph (ik_llama): VRAM usage: GPU0: 17.8 / GPU1: 19.2 | model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) | |:-----------------|-----------------:|----------------:|-------------:|-------------------:|-------------------:|-------------------:| | Qwen/Qwen3.6-27B | pp4096 @ d4000 | 1420.56 ± 17.77 | | 5700.41 ± 70.54 | 5699.81 ± 70.54 | 5700.41 ± 70.54 | | Qwen/Qwen3.6-27B | tg128 @ d4000 | 32.15 ± 0.03 | 33.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d8000 | 1387.88 ± 13.61 | | 8716.90 ± 84.91 | 8716.29 ± 84.91 | 8716.90 ± 84.91 | | Qwen/Qwen3.6-27B | tg128 @ d8000 | 31.81 ± 0.01 | 33.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d16000 | 1362.43 ± 8.36 | | 14751.24 ± 90.08 | 14750.64 ± 90.08 | 14751.24 ± 90.08 | | Qwen/Qwen3.6-27B | tg128 @ d16000 | 31.13 ± 0.01 | 32.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d32000 | 1318.72 ± 9.42 | | 27373.72 ± 195.00 | 27373.12 ± 195.00 | 27373.72 ± 195.00 | | Qwen/Qwen3.6-27B | tg128 @ d32000 | 30.32 ± 0.02 | 31.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d64000 | 1216.07 ± 8.43 | | 55999.88 ± 388.37 | 55999.27 ± 388.37 | 55999.88 ± 388.37 | | Qwen/Qwen3.6-27B | tg128 @ d64000 | 28.86 ± 0.04 | 30.00 ± 0.00 | | | | | Qwen/Qwen3.6-27B | pp4096 @ d128000 | 1055.71 ± 7.36 | | 125132.30 ± 869.60 | 125131.69 ± 869.60 | 125132.30 ± 869.60 | | Qwen/Qwen3.6-27B | tg128 @ d128000 | 26.35 ± 0.00 | 27.00 ± 0.00 | | | |

Original Article

Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split

Similar Articles

Dual GPU llama.cpp speedup

@leopardracer: https://x.com/leopardracer/status/2055341758523883631

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)

@ggerganov: Highlighting recent advances in multi-GPU and tensor parallel support in llama.cpp Over the last few months llama.cpp m…

PSA: Test your "threads" argument in llama.cpp (+80% performance in my case)

Submit Feedback

Similar Articles

@leopardracer: https://x.com/leopardracer/status/2055341758523883631

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)

@ggerganov: Highlighting recent advances in multi-GPU and tensor parallel support in llama.cpp Over the last few months llama.cpp m…

PSA: Test your "threads" argument in llama.cpp (+80% performance in my case)