llama.cpp - how to free up even more space on your GPU

Reddit r/LocalLLaMA Tools

Summary

A thread sharing practical tips for freeing up GPU memory in llama.cpp, such as offloading mmproj to CPU and adjusting KV cache types, while discussing parameters like --cache-type-k/v and --spec-draft-n-max.

For the past week or two, llama.cpp has been working much better from the RAM usage prespective. I no longer see any memory leaks, and everything fits nicely on the GPU - my defaults are --n-gpu-layers 99 --no-mmap --mlock to avoid using the regular RAM, since I use my 3090 with an eGPU setup: Qwen3.6-27B-UD-Q5_K_XL-mtp, q4_0, 150k context I wanted to create this thread to see if there are any additional tricks for freeing up even more memory so that I can further increase my context size. My list of VRAM-related parameters for a given model (which is, of course, the biggest factor in memory footprint): --no-mmproj-offload: this is the biggest win: if you have a model with vision, you can offload the mmproj to CPU. It is a little drop in terms of performance, but you'll end up with 1GB additional free space on your card. --cache-type-k, --cache-type-v: KV cache (obviously) - reduce memory allocation by 50%, 75%, etc. but of course, quality will drop in return. my observation is that since attention rotation has been introduced, I can even use q4 without much noticable drop of quality, since I can use a bigger base model - which helps me more vs drop of quality because of KV cache. --cache-type-k-draft, --cache-type-v-draft: same applies to the mtp model's KV cache --spec-draft-n-max: guess up to x future tokens ahead in a single forward pass. With coding, I'm usually fine with "2" as the value. "1" consumes slightly less memory, but TPS drops about 5%. "3" doesn't make sense for my use case - consumes more memory, but same TPS as with "1" --flash-attn on: this is the default value by now, as far as I know. Memory allocation would grow if you'd turn it off, but you cannot turn it off anyway if you use a quantized v cache Parameters I thought would help, until I realized they actually don't: --ctx-checkpoints: I've heard that decreasing this value would also decrease memory allocation, but it's not the case for me. Default is 64, and no change for me when I decrease it a small value --parallel: number of active user request at a time. Since 1 is the default value, you cannot do anything with it in a single user setup. However, if you increase it, your KV cache for your main session will be reduced accordingly (50%, 66%, etc.) --fit-target: sets a strict safety buffer margin (in Megabytes - default 1024) that the engine must leave completely empty on your GPU (for example, reserved for video I/O). Since my monitor is plugged into a different card, I reduced it to 64, but it didn't help at all. As far as I know, llama cpp now runs an internal calculation loop at startup to automatically adjust some variables to prevent itself from an OOM crash. I've shared my tips, what's one of yours? Is there anything else at all? Is your experience different to mine? thanks!
Original Article

Similar Articles

How do i prevent llama.cpp from offloading on Swap?

Reddit r/LocalLLaMA

User seeks advice on preventing llama.cpp from offloading KV cache to swap before RAM is fully exhausted, sharing their configuration on an M2 Max with 96GB RAM and a large Qwen model.

Maybe KV cache offload to RAM isn't bad

Reddit r/LocalLLaMA

A user shares their experience offloading the KV cache to RAM in llama.cpp, achieving comparable speeds while freeing VRAM for larger models and context windows, suggesting this trade-off is often worthwhile.

Dual GPU llama.cpp speedup

Reddit r/LocalLLaMA

A fork of llama.cpp fixes the --split-mode tensor issue with quantized KV caches, achieving up to 40% speed improvement on dual GPU setups without quality loss.