Maybe KV cache offload to RAM isn't bad

Reddit r/LocalLLaMA 06/05/26, 04:23 PM Tools

llama-cpp kv-cache offloading vram ram performance memory-optimization

Summary

A user shares their experience offloading the KV cache to RAM in llama.cpp, achieving comparable speeds while freeing VRAM for larger models and context windows, suggesting this trade-off is often worthwhile.

So, llama.cpp has the `-nkvo` (`--no-kv-offload`) option to offload KV cache to RAM instead of VRAM. Many people avoid this because obviously it hurts performance. But every option exists with a trade off. And in my case, I think it's worth it. Hear me out. I'm running Qwen3.6 27B (IQ4\_XS) on RTX 5060 Ti 16GB and 32GB DDR5. In order to fit 65k context, I have to quantize the KV cache down to q4\_0, and keep only 58 layers on the GPU. This gives me **23 tps at peak, down to 16 tps during long generation**. llama-server -m Qwen3.6-27B-IQ4_XS.gguf -c 65000 \ -ctk q4_0 -ctv q4_0 -fa on -ngl 58 -np 1 \ --temp 0.6 --top-p 0.95 --top-k 20 --presence-penalty 1.25 \ --min-p 0.0 --chat-template-kwargs '{"preserve_thinking":true}' \ --spec-type draft-mtp --spec-draft-n-max 2 Adding `-nkvo`, I'm able to fit the whole model in GPU, and have the default f16 for KV cache. The speed plunged to **19 tps at peak, and 14 tps during long generation**. Not a bad trade off. llama-server -m Qwen3.6-27B-IQ4_XS.gguf -c 65000 \ -fa on -ngl 99 -nkvo -np 1 \ --temp 0.6 --top-p 0.95 --top-k 20 --presence-penalty 1.25 \ --min-p 0.0 --chat-template-kwargs '{"preserve_thinking":true}' \ --spec-type draft-mtp --spec-draft-n-max 2 The interesting part is, I can even double the context window to 128k by keeping 63 out of 65 layers (for the MTP version) on the GPU. The generation speed didn't change much. llama-server -m Qwen3.6-27B-IQ4_XS.gguf -c 131072 \ -fa on -ngl 63 -nkvo -np 1 \ --temp 0.6 --top-p 0.95 --top-k 20 --presence-penalty 1.25 \ --min-p 0.0 --chat-template-kwargs '{"preserve_thinking":true}' \ --spec-type draft-mtp --spec-draft-n-max 2 KV cache quant when offload to RAM didn't seem to give any improvement, so we basically get f16 quality for free. In some cases, I found it hurts the performance as well. So the takeaway is, if you found yourself lowering down the KV cache just to make the model fit, or needing more context window, you might better get away by offloading the KV cache to RAM instead.

Original Article

Maybe KV cache offload to RAM isn't bad

Similar Articles

I solved kv-cache

Linux - Why does llama.cpp ROCm consume SO much VRAM for KV cache compared to Vulkan?

Seeking resources to read about llama.cpp server and how offloading works

Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist

llama.cpp has a clever trick for speeding up KV cache decode

Submit Feedback

Similar Articles

Linux - Why does llama.cpp ROCm consume SO much VRAM for KV cache compared to Vulkan?

Seeking resources to read about llama.cpp server and how offloading works

Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist

llama.cpp has a clever trick for speeding up KV cache decode