Tag
Venkat explains that unoptimized CPU work in the hot path can severely impact inference performance, and introduces his PR to mooncake that adds a memory arena for lock-free, allocation-free operations, benefiting vLLM and SGL projects.
An in-depth technical blog post explaining how to efficiently transpose matrices using SIMD instructions on modern x86_64 CPUs, focusing on AVX2 intrinsics like _mm256_shuffle_epi8.
Pull request adds optimized x86 and generic CPU q1_0 dot-product kernels to ggml-cpu, improving quantized LLM inference speed.
Research on optimizing 2D graphics rendering on CPUs using sparse strip techniques to improve performance and reduce memory overhead.