UPDATE: Qwen-27B-IQ4_KS and Qwen-27B-IQ_KS_KT for ik_llama.cpp, especially for NVIDIA with 16GB VRAM

Reddit r/LocalLLaMA 06/23/26, 06:04 PM Models

quantization gguf llama.cpp qwen nvidia 16gb-vram open-source

Summary

New GGUF quantizations of Qwen3.6-27B optimized for 16GB VRAM NVIDIA GPUs, including an experimental Trellis variant, with perplexity benchmarks.

Continuing 16GB VRAM Optimizations: New Qwen3.6-27B GGUF Quants (Experimental Trellis/iq4_kt & MTP) Hi everyone, I'm continuing my optimization efforts for 16GB VRAM and Nvidia GPUs from this post: https://www.reddit.com/r/LocalLLaMA/comments/1tkmgwj/qwen27biq4_ks_for_ik_llamacpp_especially_for/ As a result, I've just uploaded two new quantizations for ik_llama.cpp. To the Qwen3.6-27B-i1-IQ4_KS-GGUF repository, I added a new quant: Qwen3.6-27B.i1-IQ4_KS-attn_qkv-IQ4_KS.gguf. Theoretically, it features a more logical layout (I'm still learning as I go). It keeps the exact same size as the previous Qwen3.6-27B.i1-IQ4_KS-attn_qkv-IQ4_KSS.gguf model, but I tweaked it to boost logic at the expense of the model's general knowledge. This should help with coding tasks. PPL Test Results: ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_KS-attn_qkv-IQ4_KS.gguf -f /mnt/Samsung4TB/models/pg19.txt -c 65536 --chunks 32 -ngl 99 -khad -vhad -ctk q4_0 -ctv q4_0 -fa 1 -b 512 -ub 256 [1]6.6926,[2]7.0049,[3]7.2043,[4]7.3382,[5]7.4861,[6]7.3838,[7]7.4411,[8]7.4459,[9]7.4857,[10]7.5303,[11]7.5779,[12]7.4131, Final estimate: PPL over 12 chunks for n_ctx=65536 = 7.4131 +/- 0.02774 The second model, Qwen3.6-27B-i1-IQ4_KS_KT-GGUF, is a total experiment. I was wondering where we could successfully leverage the highly efficient Trellis algorithm quantization (iq4_kt). Normally, this type of quantization completely wrecks the model's logic, so I only applied it to tensors with near-Gaussian distributions. The results turned out pretty interesting. PPL Test Results: ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_KS_KT-attn_qkv-IQ4_KS.gguf -f /mnt/Samsung4TB/models/pg19.txt -c 65536 --chunks 32 -ngl 99 -khad -vhad -ctk q4_0 -ctv q4_0 -fa 1 -b 512 -ub 256 [1]6.6915,[2]7.0030,[3]7.1945,[4]7.3323,[5]7.4815,[6]7.3783,[7]7.4367,[8]7.4409,[9]7.4804,[10]7.5251,[11]7.5728,[12]7.4091, Final estimate: PPL over 12 chunks for n_ctx=65536 = 7.4091 +/- 0.02777 As you can see from the results, both models show very similar PPL (perplexity). Unfortunately, I don't have the means to run KLD tests right now, so if anyone has the setup for it, I'd be super grateful if you could test them out. To keep up with recent trends, I also threw MTP (Multi-Token Prediction) into the mix, though there isn't much headroom left for context. I made two versions: i1_MTP denotes an iq4_ks quantization, while pure MTP is q8_0.

Original Article

UPDATE: Qwen-27B-IQ4_KS and Qwen-27B-IQ_KS_KT for ik_llama.cpp, especially for NVIDIA with 16GB VRAM

Similar Articles

nvidia/Qwen3.6-35B-A3B-NVFP4 · Hugging Face

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM

Qwen3.6-27B-GGUF is here!

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Submit Feedback

Similar Articles

nvidia/Qwen3.6-35B-A3B-NVFP4 · Hugging Face

Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context