Tag
The article explains Non-Uniform Memory Access (NUMA), its historical context, and how it affects performance in multi-socket servers, while also introducing Edera's work on making Xen-based virtualization NUMA-aware end-to-end.
A developer forked ik_llama.cpp and added a '--numa mirror' mode that duplicates model weights and KV cache across NUMA nodes to maximize multi-socket CPU inference performance, sharing benchmarks and seeking testers.