I forked ik_llama.cpp and added a "--numa mirror" mode to maximize performance on multi-socket CPU systems. Just sharing and looking for testers!

Reddit r/LocalLLaMA 06/21/26, 05:37 PM Tools

llama.cpp numa cpu-inference multi-socket performance open-source fork

Summary

A developer forked ik_llama.cpp and added a '--numa mirror' mode that duplicates model weights and KV cache across NUMA nodes to maximize multi-socket CPU inference performance, sharing benchmarks and seeking testers.

GitHub: https://github.com/mikechambers84/ik_llama.cpp/tree/numa-mirror Be sure to checkout the numa-mirror branch. Sharing this for anyone else who's trying to use their multi-socket CPU systems for inference. I've been wanting a NUMA mirror mode for a long time, so I finally forked ik_llama.cpp and added it. ik_llama.cpp is a llama.cpp fork that adds major performance improvements for CPU inference, so it made sense to fork that here rather than baseline llama.cpp. For anyone who isn't aware of the problem this is meant to solve, it's that multi-socket machines have memory that's local to each socket. When a CPU accesses its own local memory, it's very fast. If a CPU has to remotely access memory that's non-local through a different socket, there's a huge performance penalty because it has to transfer the data through a bridge that's far, far slower than local memory. For most workloads, it matters very little and you probably won't notice. But since LLM inference performance is heavily bound to memory bandwidth, performance completely tanks if you try using multiple CPUs and they have to read large amounts of remote memory for each token. The usual answer for this just to use --numa isolate in llama.cpp, which pins model/context data to a single socket's CPU and memory, eliminating remote memory accesses but having multiple CPUs is no benefit here, all but one just sit idle. This fork adds --numa mirror which makes full duplicate copies of model weights and KV cache so that every CPU socket has a node-local copy. This allows you to actually use all of your CPU cores across all sockets to actually speed up inference instead of making it slower. The trade-off is obviously that you need more memory. If you have two CPU sockets, it needs to use twice the RAM. I'm hoping ikawrakow will accept it in a pull request. I'll try to submit one soon, but I'm hoping to have more people test in various hardware configurations beyond mine first. My benchmarks are showing significant gains! My hardware is somewhat outdated, I'd be interested to know how it runs on newer stuff. Test setup Operating System: Debian 13 "Trixie" with numa_balancing disabled during benchmarking Hardware: Model: Dell PowerEdge R740 CPU: 2× Intel Xeon Gold 6248R (Cascade Lake), 2 NUMA nodes (24 cores / 48 threads each) RAM: 768 GB RAM (384 GB per node) ECC DDR4 2400 MHz, all 12 memory channels populated Build: CPU backend, Release, -DGGML_NATIVE=ON -DGGML_AVX512=ON -DGGML_AVX512_VNNI=ON. (VBMI/BF16 are not enabled — Cascade Lake does not implement avx512_vbmi / avx512_bf16.) Tool: llama-bench, 3 repetitions per result (-r 3). Per-run flags: -rtr 1 -b 16 -ub 16 -p 512 -n 128 (run-time repacking on; batch and micro-batch 16; pp512 = prompt processing of 512 tokens, tg128 = generation of 128). Modes compared (threads set equal for -t/-tb): isolate — --numa isolate -t 24 -tb 24 (one socket / 24 cores) — single-socket baseline mirror — --numa mirror -t 48 -tb 48 (both sockets, weights + KV duplicated per node) All throughput numbers are tokens/second (higher is better). Token generation (tg128) Model isolate (1 socket, 24t) mirror (2 sockets, 48t) mirror vs isolate gemma-4-E2B (dense, Q5_K_M) 47.20 62.00 1.31× gemma-4-E4B (dense, Q5_K_M) 23.77 33.62 1.41× gemma-4-26B-A4B (MoE, UD-Q4_K_M) 23.59 34.76 1.47× Qwen3.6-27B (dense, Q4_K_M) 5.27 8.32 1.58× Qwen3.6-35B-A3B (MoE, UD-Q5_K_M) 24.70 31.56 1.28× Qwen3.5-122B-A10B (MoE, UD-Q3_K_XL) 10.00 14.46 1.45× Prompt processing (pp512) Model isolate (1 socket, 24t) mirror (2 sockets, 48t) mirror vs isolate gemma-4-E2B (dense,Q5_K_M) 259.90 256.69 0.99× gemma-4-E4B (dense, Q5_K_M) 141.88 184.06 1.30× gemma-4-26B-A4B (MoE, UD-Q4_K_M) 143.41 201.69 1.41× Qwen3.6-27B (dense, Q4_K_M) 33.04 54.22 1.64× Qwen3.6-35B-A3B (MoE, UD-Q5_K_M) 153.68 193.21 1.26× Qwen3.5-122B-A10B (MoE, UD-Q3_K_XL) 57.17 83.01 1.45×

Original Article

I forked ik_llama.cpp and added a "--numa mirror" mode to maximize performance on multi-socket CPU systems. Just sharing and looking for testers!

Similar Articles

@no_stp_on_snek: got it here if ya want to try it out:

Experts first llama.cpp

Dual GPU llama.cpp speedup

Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split

PSA: Test your "threads" argument in llama.cpp (+80% performance in my case)

Submit Feedback

Similar Articles

@no_stp_on_snek: got it here if ya want to try it out:

Comparing dual-GPU inference speed between llama.cpp row/tensor split and ik_llama graph split

PSA: Test your "threads" argument in llama.cpp (+80% performance in my case)