[llama.cpp] Asymmetric KV q8/q4 cache: current caveats and discussion in GGML repo

Reddit r/LocalLLaMA 05/22/26, 01:07 PM Tools

llm llama.cpp kv-cache quantization cuda memory-optimization open-source

Summary

Discusses caveats of using asymmetric KV cache quantization in llama.cpp, where mismatched q8/q4 types cause prompt processing on CPU instead of GPU, and a proposed fix via compilation flags.

Probably most of you are aware that using anything other than `-ctk q8_0 -ctv q8_0 / -ctk q4_0 -ctv q4_0` as startup options for llama.cpp leads to prompt processing on cpu instead of gpu for cuda at least. E.g. when we use the frequently suggested mix of `-ctk q8_0 -ctv q4_0` pps tanks. I have discussed this with a prop LLM and it suggested to add some slight modifications to the cuda source code of llama.cpp or use `cmake -DGGML_CUDA_FA_ALL_QUANTS=ON ..` which will take very long. But coincidentially, user sanmai on github did a small eval and suggested to include the kv cache quant combo during compilation, even without FA_ALL_QUANTS, so that would be great. Discussion is here, it is worth a read as the eval confirms that using the async 8/4 bit kv quant only costs 1.3% precision while saving more than half of memory compared to f16/f16: https://github.com/ggml-org/llama.cpp/discussions/23470

Original Article

Similar Articles

Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist

Reddit r/LocalLLaMA

A developer has implemented a proof-of-concept PR for llama.cpp that adds dynamic KV cache quantization via an HTTP endpoint, allowing users to requantize their KV cache on-demand without fully reloading the model. The post also outlines a wishlist including load-on-demand mmproj/MTP swapping and an automatic --fit flag for context optimization.

kv-cache : avoid kv cells copies by ggerganov · Pull Request #24277 · ggml-org/llama.cpp

Reddit r/LocalLLaMA

This pull request by ggerganov optimizes kv-cache in llama.cpp to avoid unnecessary copies of kv cells, improving inference performance. It is a contribution to the open-source LLM inference library llama.cpp.

ICYM: llama.cpp b9455 --SM Tensor KV Cache Fix is MERGED

Reddit r/LocalLLaMA

llama.cpp version b9455 merges a fix for `-sm tensor` KV cache quantization on multi-GPU setups, addressing a shape information loss issue when flattening tensors.

ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants by yomaytk · Pull Request #24225 · ggml-org/llama.cpp

Reddit r/LocalLLaMA

Improves prefill speeds for k-quants and refactors matrix multiplication for Q4/Q5/Q8 and k-quants in llama.cpp's WebGPU backend.

Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM

Reddit r/LocalLLaMA

A detailed benchmark comparing KV cache quantization methods (TurboQuant, TCQ, q4, q5, q8) using PPL and KLD metrics on Qwen 3.6 27B, finding that TCQ improves low-bit quantization, asymmetric KV beats symmetric at same size, and q8 is often overkill. Includes analysis and data in linked article.

Similar Articles

Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist

kv-cache : avoid kv cells copies by ggerganov · Pull Request #24277 · ggml-org/llama.cpp

ICYM: llama.cpp b9455 --SM Tensor KV Cache Fix is MERGED

ggml-webgpu: Improve prefill speeds for k-quants + refactor matmul for Q4/Q5/Q8 and k-quants by yomaytk · Pull Request #24225 · ggml-org/llama.cpp

Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM

Submit Feedback