New GGUF quantizations of Qwen3.6-27B optimized for 16GB VRAM NVIDIA GPUs, including an experimental Trellis variant, with perplexity benchmarks.
Continuing 16GB VRAM Optimizations: New Qwen3.6-27B GGUF Quants (Experimental Trellis/iq4_kt & MTP) Hi everyone, I'm continuing my optimization efforts for 16GB VRAM and Nvidia GPUs from this post: https://www.reddit.com/r/LocalLLaMA/comments/1tkmgwj/qwen27biq4_ks_for_ik_llamacpp_especially_for/ As a result, I've just uploaded two new quantizations for ik_llama.cpp. To the Qwen3.6-27B-i1-IQ4_KS-GGUF repository, I added a new quant: Qwen3.6-27B.i1-IQ4_KS-attn_qkv-IQ4_KS.gguf. Theoretically, it features a more logical layout (I'm still learning as I go). It keeps the exact same size as the previous Qwen3.6-27B.i1-IQ4_KS-attn_qkv-IQ4_KSS.gguf model, but I tweaked it to boost logic at the expense of the model's general knowledge. This should help with coding tasks. PPL Test Results: ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_KS-attn_qkv-IQ4_KS.gguf -f /mnt/Samsung4TB/models/pg19.txt -c 65536 --chunks 32 -ngl 99 -khad -vhad -ctk q4_0 -ctv q4_0 -fa 1 -b 512 -ub 256 [1]6.6926,[2]7.0049,[3]7.2043,[4]7.3382,[5]7.4861,[6]7.3838,[7]7.4411,[8]7.4459,[9]7.4857,[10]7.5303,[11]7.5779,[12]7.4131, Final estimate: PPL over 12 chunks for n_ctx=65536 = 7.4131 +/- 0.02774 The second model, Qwen3.6-27B-i1-IQ4_KS_KT-GGUF, is a total experiment. I was wondering where we could successfully leverage the highly efficient Trellis algorithm quantization (iq4_kt). Normally, this type of quantization completely wrecks the model's logic, so I only applied it to tensors with near-Gaussian distributions. The results turned out pretty interesting. PPL Test Results: ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_KS_KT-attn_qkv-IQ4_KS.gguf -f /mnt/Samsung4TB/models/pg19.txt -c 65536 --chunks 32 -ngl 99 -khad -vhad -ctk q4_0 -ctv q4_0 -fa 1 -b 512 -ub 256 [1]6.6915,[2]7.0030,[3]7.1945,[4]7.3323,[5]7.4815,[6]7.3783,[7]7.4367,[8]7.4409,[9]7.4804,[10]7.5251,[11]7.5728,[12]7.4091, Final estimate: PPL over 12 chunks for n_ctx=65536 = 7.4091 +/- 0.02777 As you can see from the results, both models show very similar PPL (perplexity). Unfortunately, I don't have the means to run KLD tests right now, so if anyone has the setup for it, I'd be super grateful if you could test them out. To keep up with recent trends, I also threw MTP (Multi-Token Prediction) into the mix, though there isn't much headroom left for context. I made two versions: i1_MTP denotes an iq4_ks quantization, while pure MTP is q8_0.
NVIDIA releases Qwen3.6-35B-A3B-NVFP4, a quantized version of Alibaba's mixture-of-experts multimodal language model, optimized for deployment on NVIDIA GPUs using Model Optimizer.
The article compares llama.cpp backends for running Qwen 3.6 27B on an RTX 3090 24GB, finding ik_llama.cpp with IQ4_KS quantization yields the best performance (1261 tok/s prefill, 72.9 tok/s decode).
A quantized version of Qwen3.6 27B using a pure Q4_K_M method fits entirely in 16 GB VRAM, achieving up to 40 tok/s token generation speed with MTP, and significantly reducing model size compared to other GGUF variants.
Community GGUF release of Qwen’s 27B hybrid-architecture model with 262k context, multimodal inputs, tool calling and "Thinking Preservation" for agentic coding.
The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.