A user shares their experience offloading the KV cache to RAM in llama.cpp, achieving comparable speeds while freeing VRAM for larger models and context windows, suggesting this trade-off is often worthwhile.
So, llama.cpp has the `-nkvo` (`--no-kv-offload`) option to offload KV cache to RAM instead of VRAM. Many people avoid this because obviously it hurts performance. But every option exists with a trade off. And in my case, I think it's worth it. Hear me out. I'm running Qwen3.6 27B (IQ4\_XS) on RTX 5060 Ti 16GB and 32GB DDR5. In order to fit 65k context, I have to quantize the KV cache down to q4\_0, and keep only 58 layers on the GPU. This gives me **23 tps at peak, down to 16 tps during long generation**. llama-server -m Qwen3.6-27B-IQ4_XS.gguf -c 65000 \ -ctk q4_0 -ctv q4_0 -fa on -ngl 58 -np 1 \ --temp 0.6 --top-p 0.95 --top-k 20 --presence-penalty 1.25 \ --min-p 0.0 --chat-template-kwargs '{"preserve_thinking":true}' \ --spec-type draft-mtp --spec-draft-n-max 2 Adding `-nkvo`, I'm able to fit the whole model in GPU, and have the default f16 for KV cache. The speed plunged to **19 tps at peak, and 14 tps during long generation**. Not a bad trade off. llama-server -m Qwen3.6-27B-IQ4_XS.gguf -c 65000 \ -fa on -ngl 99 -nkvo -np 1 \ --temp 0.6 --top-p 0.95 --top-k 20 --presence-penalty 1.25 \ --min-p 0.0 --chat-template-kwargs '{"preserve_thinking":true}' \ --spec-type draft-mtp --spec-draft-n-max 2 The interesting part is, I can even double the context window to 128k by keeping 63 out of 65 layers (for the MTP version) on the GPU. The generation speed didn't change much. llama-server -m Qwen3.6-27B-IQ4_XS.gguf -c 131072 \ -fa on -ngl 63 -nkvo -np 1 \ --temp 0.6 --top-p 0.95 --top-k 20 --presence-penalty 1.25 \ --min-p 0.0 --chat-template-kwargs '{"preserve_thinking":true}' \ --spec-type draft-mtp --spec-draft-n-max 2 KV cache quant when offload to RAM didn't seem to give any improvement, so we basically get f16 quality for free. In some cases, I found it hurts the performance as well. So the takeaway is, if you found yourself lowering down the KV cache just to make the model fit, or needing more context window, you might better get away by offloading the KV cache to RAM instead.
The author has open-sourced a novel KV-cache solution called catalyst-brain, claiming to dramatically reduce RAM usage for local models and potentially enable infinite context windows.
A user reports that llama.cpp with ROCm consumes significantly more VRAM for the KV cache than the Vulkan backend, despite identical model and settings, prompting investigation into potential causes.
A user shares their experience with llama.cpp server's model offloading, noting performance trade-offs and quiet operation, and asks for resources to understand how the tool manages memory across VRAM and system RAM.
A developer has implemented a proof-of-concept PR for llama.cpp that adds dynamic KV cache quantization via an HTTP endpoint, allowing users to requantize their KV cache on-demand without fully reloading the model. The post also outlines a wishlist including load-on-demand mmproj/MTP swapping and an automatic --fit flag for context optimization.
A setting in llama.cpp's webUI re-sends generated tokens to the KV cache to significantly reduce prompt processing latency, improving responsiveness for long generations or tool calls without apparent trade-offs.