PSA: Test your "threads" argument in llama.cpp (+80% performance in my case)

Reddit r/LocalLLaMA Tools

Summary

A user benchmarks thread count for hybrid CPU-GPU inference with Gemma 4 in llama.cpp, discovering a 80% performance uplift by using 16 threads instead of 6 on a hybrid core CPU, and shares the optimal command configuration.

When GPT-OSS 120B has released last year I played around and tried to maximize it's performance. One thing that many people pointed out was that for hybrid CPU (Performance + Efficiency cores) you should use only P-cores with "--threads" argument and taskset/affinity. Back then I've setup that model on my friend's **14700K** and yea limiting threads to 8 (because 8 P-cores) increased performance. So I continued to use that and recommend doing that since then. Today I've played around with MTP draft settings on **Gemma 4 26B A4B QAT** and I randomly thought "Let's try increasing thread count". My CPU (**250K Plus**) has **18 cores** (6 performance + 12 efficiency). Performance uplift was so big that I made a simple basic script just to be sure (simple prompt to make PHP code for Wordpress, same settings apart from threads argument, same seed, 1 warmup run then 5 runs to reduce error) and here are the results: threads runs min_tok/s mean_tok/s max_tok/s ------- ---- --------- ---------- --------- 6 5 48,938 49,144 49,451 12 5 61,329 62,938 67,614 16 5 87,877 88,765 89,126 18 5 64,154 66,478 67,373 Yea. *Casual* **+80% performance uplift** by using 16 threads instead of 6. YEA I ALSO DIDN'T BELIEVE THAT IT BECAME SO FAST THAT'S WHY I'VE MADE THAT BENCH SCRIPT TO CONFIRM. In 6 thread test it was pinned to P-cores with /affinity argument, but it was the same as without it so maybe the Thread Director on Arrow Lake is better than on Raptor Lake (14700K on which I previously tested). Curiously with 18 cores performance drops, but I don't see any throttling, it's still full boost on all cores so the bottleneck starts to show somewhere else, if somebody knows he may drop that into comments. **Config:** Intel 250K Plus + 64GB 6400MT/s + RTX 4070 SUPER 12GB with memory OC to 571GB/s + llama.cpp b9601 Command which gives me the best performance from everything I've tested so far (for example I see many people use spec draft 3, for me setting to '2' increased performance on **QAT** model, on **non-QAT** 3 was fine): llama-server -m models/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --model-draft models/mtp-gemma-4-26B-A4B-it-qat.gguf --alias gemma4-26b-a4b-qat-q4xl-mtp -c 131072 -np 1 -b 2048 -ub 512 --threads 16 -ngl 99 -ncmoe 18 -fa on --spec-type draft-mtp --spec-draft-ngl 99 --spec-draft-n-max 2 --cache-type-k-draft q8_0 --cache-type-v-draft q8_0 --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.0 --repeat-penalty 1.0 So if you have **12GB VRAM** like me try the above command, quant and mtp model are from Unsloth. Ofc tok/s will drop with more and more context, but that percentage difference is still the same. This command maybe is still not perfect, after I wake up I'll retest every single assumption I had, because maybe I set other arguments wrong too lol Check how performance scales on **your CPU**, because you may be missing nearly half of the performance like I was... now I'm even more sad that Gemma 4 124B has not been released, because it 100% would be fast enough with that 16 thread setting, I would just put 32GB more RAM into that PC and it would be a perfect match :( :( :( :( Sorry mods if I set incorrect post flair, I have no idea which I should use for this post Edit: **This post assumes that you're using hybrid (CPU+GPU) like me** **or pure CPU inference.**
Original Article

Similar Articles

Dual GPU llama.cpp speedup

Reddit r/LocalLLaMA

A fork of llama.cpp fixes the --split-mode tensor issue with quantized KV caches, achieving up to 40% speed improvement on dual GPU setups without quality loss.