Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM

Reddit r/LocalLLaMA 05/22/26, 11:29 PM Models

qwen quantization gguf vram inference-speed open-source

Summary

A quantized version of Qwen3.6 27B using a pure Q4_K_M method fits entirely in 16 GB VRAM, achieving up to 40 tok/s token generation speed with MTP, and significantly reducing model size compared to other GGUF variants.

Hello everyone! I want to share the result of my experiment to make **Qwen3.6 27B** **Q4\_K\_M** fits in to my RTX 5060 Ti 16 GB. Inspired by u/Due-Project-7507's work on [Ununnilium/Qwen3.6-27B-IQ4\_XS-pure-GGUF](https://huggingface.co/Ununnilium/Qwen3.6-27B-IQ4_XS-pure-GGUF). Using the same `pure` quantization method, I was able to create a Q4\_K\_M ggufs that fit completely in 16 GB VRAM. Model URL: [https://huggingface.co/huytd189/Qwen3.6-27B-pure-GGUF](https://huggingface.co/huytd189/Qwen3.6-27B-pure-GGUF) There are two versions [Q4\_K\_M MTP (15.4 GB)](https://huggingface.co/huytd189/Qwen3.6-27B-pure-GGUF?show_file_info=Qwen3.6-27B-MTP-Q4_K_M-pure.gguf) and [Q4\_K\_M non-MTP (15.1 GB)](https://huggingface.co/huytd189/Qwen3.6-27B-pure-GGUF?show_file_info=Qwen3.6-27B-Q4_K_M-pure.gguf). You can download the GGUF and run with the latest llama.cpp version this way: llama-server -m Qwen3.6-27B-MTP-Q4_K_M-pure.gguf -fitt 128 -c 65536 -fa on -np 1 -ctk q5_0 -ctv q5_0 -ctxcp 18 --no-mmap --mlock --no-warmup --chat-template-kwargs '{"preserve_thinking": true}' --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 -ub 256 -b 1024 -ngl 99 --spec-type draft-mtp --spec-draft-n-max 2 **TOKEN SPEED** With the MTP version, I got 40 tok/s for tg, but slower pp, while the non-MTP version has higher pp and tg at 24 tok/s. |Version|Prompt Processing|Token Generation| |:-|:-|:-| |MTP|195 tok/s|**40 tok/s**| |Non MTP|715 tok/s|**24 tok/s**| **MODEL SIZE** https://preview.redd.it/74ehd6vyvr2h1.png?width=5845&format=png&auto=webp&s=a66ba493ea1eb7fb61c999a47670c093700b9a97 **MTP Version:** |Model|Size| |:-|:-| |**huytd/Qwen3.6-27B-pure-GGUF Q4\_K\_M MTP**|**15.4 GB**| |froggeric/Qwen3.6-27B-MTP-GGUF Q4\_K\_M MTP|16.8 GB| |unsloth/Qwen3.6-27B-MTP-GGUF Q4\_K\_M MTP|17.1 GB| **Non MTP Version:** |Model|Size| |:-|:-| |**huytd/Qwen3.6-27B-pure-GGUF Q4\_K\_M**|**15.1 GB**| |mradermacher/Qwen3.6-27B-GGUF Q4\_K\_M|16.5 GB| |unsloth/Qwen3.6-27B-GGUF Q4\_K\_M|16.8 GB| |bartowski/Qwen\_Qwen3.6-27B-GGUF Q4\_K\_M|18 GB| **PERPLEXITY DIFFERENCE** Currently I don't have the hardware that can run KLD benchmark, so just showing PPL difference here, but it should be good for you to get the trade-offs between quality and the size reduciton here. https://preview.redd.it/lepgzq18wr2h1.png?width=4968&format=png&auto=webp&s=ece2b3f99f1406d0f46e3665e31b65a3b50fe7e7 |Variant|PPL|Delta| |:-|:-|:-| |**BF16 MTP**|**7.5992 +/- 0.02890**|**base**| |This Q4\_K\_M MTP|7.7699 +/- 0.02972|\+0.1707| |Unsloth's Q4\_K\_M MTP|7.6545 +/- 0.02913|\+0.0553| |**BF16 non-MTP**|**7.5992 +/- 0.02890**|**base**| |This Q4\_K\_M non-MTP|7.7043 +/- 0.02935|\+0.1051| |Unsloth's Q4\_K\_M non-MTP|7.6532 +/- 0.02912|\+0.0540|

Original Article

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM

Similar Articles

@TeksEdge: Unsloth released the fastest Qwen3.6-27B MTP GGUF I've tested. Time to upgrade. Compared to the previous GGUF, Q4/Q6 XL…

Qwen 3.6 35B GGUF: NTP vs MTP quantization results across GPUs and CPUs

Qwen3.6 27B on a 5090, 6.4k sample tok/s distribution after tuning MTP/cache settings

@DeepTechTR: Qwen 3.6 27B is incredibly fast with 16 GB VRAM! The impact of Pure Quant The era of the 27B model that runs seamlessly…

Qwen3.6:35b UD Q4_K_M 80 tok/s on Nvidia P40

Submit Feedback

Similar Articles

@TeksEdge: Unsloth released the fastest Qwen3.6-27B MTP GGUF I've tested. Time to upgrade. Compared to the previous GGUF, Q4/Q6 XL…

Qwen 3.6 35B GGUF: NTP vs MTP quantization results across GPUs and CPUs

Qwen3.6 27B on a 5090, 6.4k sample tok/s distribution after tuning MTP/cache settings

@DeepTechTR: Qwen 3.6 27B is incredibly fast with 16 GB VRAM! The impact of Pure Quant The era of the 27B model that runs seamlessly…

Qwen3.6:35b UD Q4_K_M 80 tok/s on Nvidia P40