80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

Reddit r/LocalLLaMA Tools

Summary

A user shares a configuration for achieving over 80 tokens per second with Qwen3.6 35B A3B on a 12GB VRAM GPU using llama.cpp and Multi-Token Prediction (MTP). The post includes benchmark results and specific command-line parameters to optimize performance.

Just wanted to share my config in hopes of helping other 12GB GPU owners achieve what I see as very respectable token generation speeds with modest VRAM. Using the latest llama.cpp build + MTP PR, I got over 80 tok/sec with 80%+ draft acceptance rate on the benchmark found here: [https://gist.githubusercontent.com/am17an/228edfb84ed082aa88e3865d6fa27090/raw/7a2cee40ee1e2ca5365f4cef93632193d7ad852a/mtp-bench.py](https://gist.githubusercontent.com/am17an/228edfb84ed082aa88e3865d6fa27090/raw/7a2cee40ee1e2ca5365f4cef93632193d7ad852a/mtp-bench.py) This is on an RTX 4070 Super, so results with other cards might vary. To run llama.cpp with MTP support, you need to build it from source and add a draft PR that hasn't yet been merged with the master branch. You can find a very nice guide on how to do that here and also download the Qwen3.6 MTP GGUF: [https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF](https://huggingface.co/havenoammo/Qwen3.6-35B-A3B-MTP-GGUF) \- Thanks u/havenoammo! llama.cpp command: llama-server \ -m Qwen3.6-35B-A3B-MTP-UD-Q4_K_XL.gguf \ -fitt 1536 \ -c 131072 \ -n 32768 \ -fa on \ -np 1 \ -ctk q8_0 \ -ctv q8_0 \ -ctkd q8_0 \ -ctvd q8_0 \ -ctxcp 64 \ --no-mmap \ --mlock \ --no-warmup \ --spec-type mtp \ --spec-draft-n-max 2 \ --chat-template-kwargs '{"preserve_thinking": true}' \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --presence-penalty 0.0 \ --repeat-penalty 1.0 The most important parameter here is -fitt 1536. Since part of the model is offloaded to CPU because of its size and , this tells llama.cpp to properly balance the load on the GPU/CPU to get the best possible performance, and leaves 1536 MB of free memory for the MTP draft model and KV cache. Since I'm running my dGPU as a secondary GPU (monitor plugged in the iGPU), I can use all the available 12GB VRAM for inference. 1536 might be too small if you use your dGPU as your primary GPU, so test it out first. You can also try different values for -spec-draft-n-max. I got slightly better tok/sec with 3, but a much better acceptance rate with 2, so the trade off was not worth it. With MTP, you want to maximize speed AND acceptance, so you need to find the best balance between both. Benchmark results: mtp-bench.py code_python pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=80.8 code_cpp pred= 58 draft= 40 acc= 37 rate=0.925 tok/s=81.8 explain_concept pred= 192 draft= 152 acc= 114 rate=0.750 tok/s=70.0 summarize pred= 53 draft= 40 acc= 32 rate=0.800 tok/s=75.4 qa_factual pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=77.8 translation pred= 22 draft= 16 acc= 13 rate=0.812 tok/s=81.9 creative_short pred= 192 draft= 160 acc= 111 rate=0.694 tok/s=69.2 stepwise_math pred= 192 draft= 144 acc= 119 rate=0.826 tok/s=76.5 long_code_review pred= 192 draft= 148 acc= 117 rate=0.790 tok/s=73.2 If you have any questions, feel free to ask :) Cheers.
Original Article

Similar Articles

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Reddit r/LocalLLaMA

The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

Reddit r/LocalLLaMA

Google's Gemma 4 12B QAT model achieves 120 tok/s on a 12GB GPU using Multi-Token Prediction (MTP) with llama.cpp. A step-by-step guide and benchmark comparison without MTP show a 2x speedup.

Qwen3.5-122B-Q5-MTP - Qwen3.5-122B-Q6-MTP

Reddit r/LocalLLaMA

Benchmark comparison of Qwen3.5-122B Q5 and Q6 quantized models using llama.cpp with multi-token prediction on Strix Halo, showing throughput of 20.24 t/s and 17.17 t/s respectively.