A thread sharing practical tips for freeing up GPU memory in llama.cpp, such as offloading mmproj to CPU and adjusting KV cache types, while discussing parameters like --cache-type-k/v and --spec-draft-n-max.
For the past week or two, llama.cpp has been working much better from the RAM usage prespective. I no longer see any memory leaks, and everything fits nicely on the GPU - my defaults are --n-gpu-layers 99 --no-mmap --mlock to avoid using the regular RAM, since I use my 3090 with an eGPU setup: Qwen3.6-27B-UD-Q5_K_XL-mtp, q4_0, 150k context I wanted to create this thread to see if there are any additional tricks for freeing up even more memory so that I can further increase my context size. My list of VRAM-related parameters for a given model (which is, of course, the biggest factor in memory footprint): --no-mmproj-offload: this is the biggest win: if you have a model with vision, you can offload the mmproj to CPU. It is a little drop in terms of performance, but you'll end up with 1GB additional free space on your card. --cache-type-k, --cache-type-v: KV cache (obviously) - reduce memory allocation by 50%, 75%, etc. but of course, quality will drop in return. my observation is that since attention rotation has been introduced, I can even use q4 without much noticable drop of quality, since I can use a bigger base model - which helps me more vs drop of quality because of KV cache. --cache-type-k-draft, --cache-type-v-draft: same applies to the mtp model's KV cache --spec-draft-n-max: guess up to x future tokens ahead in a single forward pass. With coding, I'm usually fine with "2" as the value. "1" consumes slightly less memory, but TPS drops about 5%. "3" doesn't make sense for my use case - consumes more memory, but same TPS as with "1" --flash-attn on: this is the default value by now, as far as I know. Memory allocation would grow if you'd turn it off, but you cannot turn it off anyway if you use a quantized v cache Parameters I thought would help, until I realized they actually don't: --ctx-checkpoints: I've heard that decreasing this value would also decrease memory allocation, but it's not the case for me. Default is 64, and no change for me when I decrease it a small value --parallel: number of active user request at a time. Since 1 is the default value, you cannot do anything with it in a single user setup. However, if you increase it, your KV cache for your main session will be reduced accordingly (50%, 66%, etc.) --fit-target: sets a strict safety buffer margin (in Megabytes - default 1024) that the engine must leave completely empty on your GPU (for example, reserved for video I/O). Since my monitor is plugged into a different card, I reduced it to 64, but it didn't help at all. As far as I know, llama cpp now runs an internal calculation loop at startup to automatically adjust some variables to prevent itself from an OOM crash. I've shared my tips, what's one of yours? Is there anything else at all? Is your experience different to mine? thanks!
User seeks advice on preventing llama.cpp from offloading KV cache to swap before RAM is fully exhausted, sharing their configuration on an M2 Max with 96GB RAM and a large Qwen model.
A user shares their experience with llama.cpp server's model offloading, noting performance trade-offs and quiet operation, and asks for resources to understand how the tool manages memory across VRAM and system RAM.
A user shares their experience offloading the KV cache to RAM in llama.cpp, achieving comparable speeds while freeing VRAM for larger models and context windows, suggesting this trade-off is often worthwhile.
Testing shows that default pipeline parallelism in llama.cpp wastes VRAM with no speed benefit; compiling with GGML_SCHED_MAX_COPIES=1 saves significant VRAM while maintaining identical inference speed.
A fork of llama.cpp fixes the --split-mode tensor issue with quantized KV caches, achieving up to 40% speed improvement on dual GPU setups without quality loss.