80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Reddit r/LocalLLaMA 05/09/26, 11:57 AM Tools

llama.cpp local-inference qwen speculative-decoding gpu-optimization mtp

Summary

A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.

Just wanted to share my config in hopes of helping other 12GB GPU owners achieve what I see as very respectable token generation speeds with modest VRAM. Using the latest llama.cpp build + MTP PR, I got over 80 tok/sec with 80%+ draft acceptance rate on the benchmark found here: [https://gist.githubusercontent.com/am17an/228edfb84ed082aa88e3865d6fa27090/raw/7a2cee40ee1e2ca5365f4cef93632193d7ad852a/mtp-bench.py](https://gist.githubusercontent.com/am17an/228edfb84ed082aa88e3865d6fa27090/raw/7a2cee40ee1e2ca5365f4cef93632193d7ad852a/mtp-bench.py) This is on an RTX 4070 Super, so results with other cards might vary. To run llama.cpp with MTP support, you need to build it from source and add a draft PR that hasn't yet been merged with the master branch. You can find a very nice guide on how to do that here and also download the Qwen3.6 MTP GGUF: [https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF](https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF) \- Thanks u/havenoammo! llama.cpp command: llama-server \ -m Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf \ -fitt 1536 \ -c 131072 \ -n 32768 \ -fa on \ -np 1 \ -ctk q8_0 \ -ctv q8_0 \ -ctkd q8_0 \ -ctvd q8_0 \ -ctxcp 64 \ --no-mmap \ --mlock \ --no-warmup \ --spec-type mtp \ --spec-draft-n-max 2 \ --chat-template-kwargs '{"preserve_thinking": true}' \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --presence-penalty 0.0 \ --repeat-penalty 1.0 The most important parameter here is -fitt 1536. Since part of the model is offloaded to CPU because of its size and , this tells llama.cpp to properly balance the load on the GPU/CPU to get the best possible performance, and leaves 1536 MB of free memory for the MTP draft model and KV cache. Since I'm running my dGPU as a secondary GPU (monitor plugged in the iGPU), I can use all the available 12GB VRAM for inference. 1536 might be too small if you use your dGPU as your primary GPU, so test it out first. You can also try different values for -spec-draft-n-max. I got slightly better tok/sec with 3, but a much better acceptance rate with 2, so the trade off was not worth it. With MTP, you want to maximize speed AND acceptance, so you need to find the best balance between both. Benchmark results: mtp-bench.py code_python pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=80.8 code_cpp pred= 58 draft= 40 acc= 37 rate=0.925 tok/s=81.8 explain_concept pred= 192 draft= 152 acc= 114 rate=0.750 tok/s=70.0 summarize pred= 53 draft= 40 acc= 32 rate=0.800 tok/s=75.4 qa_factual pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=77.8 translation pred= 22 draft= 16 acc= 13 rate=0.812 tok/s=81.9 creative_short pred= 192 draft= 160 acc= 111 rate=0.694 tok/s=69.2 stepwise_math pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=76.5 long_code_review pred= 192 draft= 148 acc= 117 rate=0.790 tok/s=73.2 If you have any questions, feel free to ask :) Cheers.

Original Article

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Similar Articles

Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Qwen3.5-122B-Q5-MTP - Qwen3.5-122B-Q6-MTP

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.

Submit Feedback

Similar Articles

Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Qwen3.5-122B-Q5-MTP - Qwen3.5-122B-Q6-MTP

I tested MTP on vLLM and llama.cpp for Gemma 4 & Qwen 3.6 — 3.34x faster inference, here are my findings RTX 6000 PRO.