PSA: Test your "threads" argument in llama.cpp (+80% performance in my case)

Reddit r/LocalLLaMA 06/12/26, 12:01 AM Tools

performance-optimization threading cpu-inference llama-cpp gemma hybrid-inference

Summary

A user benchmarks thread count for hybrid CPU-GPU inference with Gemma 4 in llama.cpp, discovering a 80% performance uplift by using 16 threads instead of 6 on a hybrid core CPU, and shares the optimal command configuration.

When GPT-OSS 120B has released last year I played around and tried to maximize it's performance. One thing that many people pointed out was that for hybrid CPU (Performance + Efficiency cores) you should use only P-cores with "--threads" argument and taskset/affinity. Back then I've setup that model on my friend's **14700K** and yea limiting threads to 8 (because 8 P-cores) increased performance. So I continued to use that and recommend doing that since then. Today I've played around with MTP draft settings on **Gemma 4 26B A4B QAT** and I randomly thought "Let's try increasing thread count". My CPU (**250K Plus**) has **18 cores** (6 performance + 12 efficiency). Performance uplift was so big that I made a simple basic script just to be sure (simple prompt to make PHP code for Wordpress, same settings apart from threads argument, same seed, 1 warmup run then 5 runs to reduce error) and here are the results: threads runs min_tok/s mean_tok/s max_tok/s ------- ---- --------- ---------- --------- 6 5 48,938 49,144 49,451 12 5 61,329 62,938 67,614 16 5 87,877 88,765 89,126 18 5 64,154 66,478 67,373 Yea. *Casual* **+80% performance uplift** by using 16 threads instead of 6. YEA I ALSO DIDN'T BELIEVE THAT IT BECAME SO FAST THAT'S WHY I'VE MADE THAT BENCH SCRIPT TO CONFIRM. In 6 thread test it was pinned to P-cores with /affinity argument, but it was the same as without it so maybe the Thread Director on Arrow Lake is better than on Raptor Lake (14700K on which I previously tested). Curiously with 18 cores performance drops, but I don't see any throttling, it's still full boost on all cores so the bottleneck starts to show somewhere else, if somebody knows he may drop that into comments. **Config:** Intel 250K Plus + 64GB 6400MT/s + RTX 4070 SUPER 12GB with memory OC to 571GB/s + llama.cpp b9601 Command which gives me the best performance from everything I've tested so far (for example I see many people use spec draft 3, for me setting to '2' increased performance on **QAT** model, on **non-QAT** 3 was fine): llama-server -m models/gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --model-draft models/mtp-gemma-4-26B-A4B-it-qat.gguf --alias gemma4-26b-a4b-qat-q4xl-mtp -c 131072 -np 1 -b 2048 -ub 512 --threads 16 -ngl 99 -ncmoe 18 -fa on --spec-type draft-mtp --spec-draft-ngl 99 --spec-draft-n-max 2 --cache-type-k-draft q8_0 --cache-type-v-draft q8_0 --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.0 --repeat-penalty 1.0 So if you have **12GB VRAM** like me try the above command, quant and mtp model are from Unsloth. Ofc tok/s will drop with more and more context, but that percentage difference is still the same. This command maybe is still not perfect, after I wake up I'll retest every single assumption I had, because maybe I set other arguments wrong too lol Check how performance scales on **your CPU**, because you may be missing nearly half of the performance like I was... now I'm even more sad that Gemma 4 124B has not been released, because it 100% would be fast enough with that 16 thread setting, I would just put 32GB more RAM into that PC and it would be a perfect match :( :( :( :( Sorry mods if I set incorrect post flair, I have no idea which I should use for this post Edit: **This post assumes that you're using hybrid (CPU+GPU) like me** **or pure CPU inference.**

Original Article

PSA: Test your "threads" argument in llama.cpp (+80% performance in my case)

Similar Articles

Pipeline parallelism in llama.cpp may be wasting your VRAM

Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split

Dual GPU llama.cpp speedup

Intel Arc Pro B70 llama.cpp benchmarks posted

@ggerganov: Highlighting recent advances in multi-GPU and tensor parallel support in llama.cpp Over the last few months llama.cpp m…

Submit Feedback

Similar Articles

Pipeline parallelism in llama.cpp may be wasting your VRAM

Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split

Intel Arc Pro B70 llama.cpp benchmarks posted
Benchmark results for Intel Arc Pro B70 GPU running llama.cpp with SYCL on Qwen models show 63 tokens per second performance.

@ggerganov: Highlighting recent advances in multi-GPU and tensor parallel support in llama.cpp Over the last few months llama.cpp m…