Maybe KV cache offload to RAM isn't bad

Reddit r/LocalLLaMA Tools

Summary

A user shares their experience offloading the KV cache to RAM in llama.cpp, achieving comparable speeds while freeing VRAM for larger models and context windows, suggesting this trade-off is often worthwhile.

So, llama.cpp has the `-nkvo` (`--no-kv-offload`) option to offload KV cache to RAM instead of VRAM. Many people avoid this because obviously it hurts performance. But every option exists with a trade off. And in my case, I think it's worth it. Hear me out. I'm running Qwen3.6 27B (IQ4\_XS) on RTX 5060 Ti 16GB and 32GB DDR5. In order to fit 65k context, I have to quantize the KV cache down to q4\_0, and keep only 58 layers on the GPU. This gives me **23 tps at peak, down to 16 tps during long generation**. llama-server -m Qwen3.6-27B-IQ4_XS.gguf -c 65000 \ -ctk q4_0 -ctv q4_0 -fa on -ngl 58 -np 1 \ --temp 0.6 --top-p 0.95 --top-k 20 --presence-penalty 1.25 \ --min-p 0.0 --chat-template-kwargs '{"preserve_thinking":true}' \ --spec-type draft-mtp --spec-draft-n-max 2 Adding `-nkvo`, I'm able to fit the whole model in GPU, and have the default f16 for KV cache. The speed plunged to **19 tps at peak, and 14 tps during long generation**. Not a bad trade off. llama-server -m Qwen3.6-27B-IQ4_XS.gguf -c 65000 \ -fa on -ngl 99 -nkvo -np 1 \ --temp 0.6 --top-p 0.95 --top-k 20 --presence-penalty 1.25 \ --min-p 0.0 --chat-template-kwargs '{"preserve_thinking":true}' \ --spec-type draft-mtp --spec-draft-n-max 2 The interesting part is, I can even double the context window to 128k by keeping 63 out of 65 layers (for the MTP version) on the GPU. The generation speed didn't change much. llama-server -m Qwen3.6-27B-IQ4_XS.gguf -c 131072 \ -fa on -ngl 63 -nkvo -np 1 \ --temp 0.6 --top-p 0.95 --top-k 20 --presence-penalty 1.25 \ --min-p 0.0 --chat-template-kwargs '{"preserve_thinking":true}' \ --spec-type draft-mtp --spec-draft-n-max 2 KV cache quant when offload to RAM didn't seem to give any improvement, so we basically get f16 quality for free. In some cases, I found it hurts the performance as well. So the takeaway is, if you found yourself lowering down the KV cache just to make the model fit, or needing more context window, you might better get away by offloading the KV cache to RAM instead.
Original Article

Similar Articles

I solved kv-cache

Reddit r/AI_Agents

The author has open-sourced a novel KV-cache solution called catalyst-brain, claiming to dramatically reduce RAM usage for local models and potentially enable infinite context windows.

Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist

Reddit r/LocalLLaMA

A developer has implemented a proof-of-concept PR for llama.cpp that adds dynamic KV cache quantization via an HTTP endpoint, allowing users to requantize their KV cache on-demand without fully reloading the model. The post also outlines a wishlist including load-on-demand mmproj/MTP swapping and an automatic --fit flag for context optimization.