A quantized version of Qwen3.6 27B using a pure Q4_K_M method fits entirely in 16 GB VRAM, achieving up to 40 tok/s token generation speed with MTP, and significantly reducing model size compared to other GGUF variants.
Hello everyone! I want to share the result of my experiment to make **Qwen3.6 27B** **Q4\_K\_M** fits in to my RTX 5060 Ti 16 GB. Inspired by u/Due-Project-7507's work on [Ununnilium/Qwen3.6-27B-IQ4\_XS-pure-GGUF](https://huggingface.co/Ununnilium/Qwen3.6-27B-IQ4_XS-pure-GGUF). Using the same `pure` quantization method, I was able to create a Q4\_K\_M ggufs that fit completely in 16 GB VRAM. Model URL: [https://huggingface.co/huytd189/Qwen3.6-27B-pure-GGUF](https://huggingface.co/huytd189/Qwen3.6-27B-pure-GGUF) There are two versions [Q4\_K\_M MTP (15.4 GB)](https://huggingface.co/huytd189/Qwen3.6-27B-pure-GGUF?show_file_info=Qwen3.6-27B-MTP-Q4_K_M-pure.gguf) and [Q4\_K\_M non-MTP (15.1 GB)](https://huggingface.co/huytd189/Qwen3.6-27B-pure-GGUF?show_file_info=Qwen3.6-27B-Q4_K_M-pure.gguf). You can download the GGUF and run with the latest llama.cpp version this way: llama-server -m Qwen3.6-27B-MTP-Q4_K_M-pure.gguf -fitt 128 -c 65536 -fa on -np 1 -ctk q5_0 -ctv q5_0 -ctxcp 18 --no-mmap --mlock --no-warmup --chat-template-kwargs '{"preserve_thinking": true}' --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 -ub 256 -b 1024 -ngl 99 --spec-type draft-mtp --spec-draft-n-max 2 **TOKEN SPEED** With the MTP version, I got 40 tok/s for tg, but slower pp, while the non-MTP version has higher pp and tg at 24 tok/s. |Version|Prompt Processing|Token Generation| |:-|:-|:-| |MTP|195 tok/s|**40 tok/s**| |Non MTP|715 tok/s|**24 tok/s**| **MODEL SIZE** https://preview.redd.it/74ehd6vyvr2h1.png?width=5845&format=png&auto=webp&s=a66ba493ea1eb7fb61c999a47670c093700b9a97 **MTP Version:** |Model|Size| |:-|:-| |**huytd/Qwen3.6-27B-pure-GGUF Q4\_K\_M MTP**|**15.4 GB**| |froggeric/Qwen3.6-27B-MTP-GGUF Q4\_K\_M MTP|16.8 GB| |unsloth/Qwen3.6-27B-MTP-GGUF Q4\_K\_M MTP|17.1 GB| **Non MTP Version:** |Model|Size| |:-|:-| |**huytd/Qwen3.6-27B-pure-GGUF Q4\_K\_M**|**15.1 GB**| |mradermacher/Qwen3.6-27B-GGUF Q4\_K\_M|16.5 GB| |unsloth/Qwen3.6-27B-GGUF Q4\_K\_M|16.8 GB| |bartowski/Qwen\_Qwen3.6-27B-GGUF Q4\_K\_M|18 GB| **PERPLEXITY DIFFERENCE** Currently I don't have the hardware that can run KLD benchmark, so just showing PPL difference here, but it should be good for you to get the trade-offs between quality and the size reduciton here. https://preview.redd.it/lepgzq18wr2h1.png?width=4968&format=png&auto=webp&s=ece2b3f99f1406d0f46e3665e31b65a3b50fe7e7 |Variant|PPL|Delta| |:-|:-|:-| |**BF16 MTP**|**7.5992 +/- 0.02890**|**base**| |This Q4\_K\_M MTP|7.7699 +/- 0.02972|\+0.1707| |Unsloth's Q4\_K\_M MTP|7.6545 +/- 0.02913|\+0.0553| |**BF16 non-MTP**|**7.5992 +/- 0.02890**|**base**| |This Q4\_K\_M non-MTP|7.7043 +/- 0.02935|\+0.1051| |Unsloth's Q4\_K\_M non-MTP|7.6532 +/- 0.02912|\+0.0540|
Unsloth has released an optimized GGUF version of the Qwen3.6-27B MTP model, achieving significantly faster inference speeds (up to 114 tok/s on an RTX 5090) compared to previous quantizations.
ByteShape releases Qwen 3.6 35B GGUF quantizations in NTP and MTP variants with detailed benchmarking across multiple GPUs and CPUs, finding that larger quants often outperform smaller ones and MTP provides GPU speed boosts at the cost of memory.
Qwen 3.6 27B runs fast on 16 GB VRAM thanks to 'Pure Quant' technology, achieving 40 tokens/s with MTP and supporting 64k contexts, enabling local AI on consumer GPUs like RTX 4060 Ti.
Quantized 27B Qwen3.6 model achieves 200 tok/s peak (136 avg) with 256k context and 10 agents on a single 49W GB10 GPU using Dflash+DDTree optimizations.
A community researcher shares a custom quantization recipe for Qwen3.6-27B that produces a smaller 30GB Q8 GGUF by keeping high-outlier sublayers in BF16, achieving better KLD and top-p metrics than Unsloth's 33GB Q8_K_XL variant.