A developer forked ik_llama.cpp and added a '--numa mirror' mode that duplicates model weights and KV cache across NUMA nodes to maximize multi-socket CPU inference performance, sharing benchmarks and seeking testers.
GitHub: https://github.com/mikechambers84/ik_llama.cpp/tree/numa-mirror Be sure to checkout the numa-mirror branch. Sharing this for anyone else who's trying to use their multi-socket CPU systems for inference. I've been wanting a NUMA mirror mode for a long time, so I finally forked ik_llama.cpp and added it. ik_llama.cpp is a llama.cpp fork that adds major performance improvements for CPU inference, so it made sense to fork that here rather than baseline llama.cpp. For anyone who isn't aware of the problem this is meant to solve, it's that multi-socket machines have memory that's local to each socket. When a CPU accesses its own local memory, it's very fast. If a CPU has to remotely access memory that's non-local through a different socket, there's a huge performance penalty because it has to transfer the data through a bridge that's far, far slower than local memory. For most workloads, it matters very little and you probably won't notice. But since LLM inference performance is heavily bound to memory bandwidth, performance completely tanks if you try using multiple CPUs and they have to read large amounts of remote memory for each token. The usual answer for this just to use --numa isolate in llama.cpp, which pins model/context data to a single socket's CPU and memory, eliminating remote memory accesses but having multiple CPUs is no benefit here, all but one just sit idle. This fork adds --numa mirror which makes full duplicate copies of model weights and KV cache so that every CPU socket has a node-local copy. This allows you to actually use all of your CPU cores across all sockets to actually speed up inference instead of making it slower. The trade-off is obviously that you need more memory. If you have two CPU sockets, it needs to use twice the RAM. I'm hoping ikawrakow will accept it in a pull request. I'll try to submit one soon, but I'm hoping to have more people test in various hardware configurations beyond mine first. My benchmarks are showing significant gains! My hardware is somewhat outdated, I'd be interested to know how it runs on newer stuff. Test setup Operating System: Debian 13 "Trixie" with numa_balancing disabled during benchmarking Hardware: Model: Dell PowerEdge R740 CPU: 2× Intel Xeon Gold 6248R (Cascade Lake), 2 NUMA nodes (24 cores / 48 threads each) RAM: 768 GB RAM (384 GB per node) ECC DDR4 2400 MHz, all 12 memory channels populated Build: CPU backend, Release, -DGGML_NATIVE=ON -DGGML_AVX512=ON -DGGML_AVX512_VNNI=ON. (VBMI/BF16 are not enabled — Cascade Lake does not implement avx512_vbmi / avx512_bf16.) Tool: llama-bench, 3 repetitions per result (-r 3). Per-run flags: -rtr 1 -b 16 -ub 16 -p 512 -n 128 (run-time repacking on; batch and micro-batch 16; pp512 = prompt processing of 512 tokens, tg128 = generation of 128). Modes compared (threads set equal for -t/-tb): isolate — --numa isolate -t 24 -tb 24 (one socket / 24 cores) — single-socket baseline mirror — --numa mirror -t 48 -tb 48 (both sockets, weights + KV duplicated per node) All throughput numbers are tokens/second (higher is better). Token generation (tg128) Model isolate (1 socket, 24t) mirror (2 sockets, 48t) mirror vs isolate gemma-4-E2B (dense, Q5_K_M) 47.20 62.00 1.31× gemma-4-E4B (dense, Q5_K_M) 23.77 33.62 1.41× gemma-4-26B-A4B (MoE, UD-Q4_K_M) 23.59 34.76 1.47× Qwen3.6-27B (dense, Q4_K_M) 5.27 8.32 1.58× Qwen3.6-35B-A3B (MoE, UD-Q5_K_M) 24.70 31.56 1.28× Qwen3.5-122B-A10B (MoE, UD-Q3_K_XL) 10.00 14.46 1.45× Prompt processing (pp512) Model isolate (1 socket, 24t) mirror (2 sockets, 48t) mirror vs isolate gemma-4-E2B (dense,Q5_K_M) 259.90 256.69 0.99× gemma-4-E4B (dense, Q5_K_M) 141.88 184.06 1.30× gemma-4-26B-A4B (MoE, UD-Q4_K_M) 143.41 201.69 1.41× Qwen3.6-27B (dense, Q4_K_M) 33.04 54.22 1.64× Qwen3.6-35B-A3B (MoE, UD-Q5_K_M) 153.68 193.21 1.26× Qwen3.5-122B-A10B (MoE, UD-Q3_K_XL) 57.17 83.01 1.45×
A fork of llama.cpp integrating TurboQuant+ for advanced KV-cache and weight quantization, with cross-backend kernel support (Apple Silicon, NVIDIA CUDA, AMD ROCm, Vulkan) and used in production by LocalAI, Chronara, and AtomicChat.
A developer created an experimental fork of llama.cpp that offloads only used experts instead of entire layers to VRAM, achieving speed improvements for MoE models on GPUs with limited VRAM like the RTX 2060 12GB. The author is asking for testers to validate performance on other Nvidia GPUs.
A fork of llama.cpp fixes the --split-mode tensor issue with quantized KV caches, achieving up to 40% speed improvement on dual GPU setups without quality loss.
A user benchmarks dual-GPU inference speed on two RTX 3080 20GB using llama.cpp (row/tensor split) and ik_llama (graph split) with a Qwen3.6-27B GGUF model, comparing token generation and prompt processing speeds.
A user benchmarks thread count for hybrid CPU-GPU inference with Gemma 4 in llama.cpp, discovering a 80% performance uplift by using 16 threads instead of 6 on a hybrid core CPU, and shares the optimal command configuration.