The author shares a quantization recipe for Qwen3.6 27B that makes the model use significantly fewer thinking tokens while still producing correct answers, leading to faster inference on math benchmarks.
Ok, hear me out. This all started when I was trying to understand why this Qwen3.6 27B INT8 Autoround ([https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound/tree/main](https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound/tree/main)) recipe was performing so much better than any other Qwen3.6 27B quant I tried. On some personal Rust + Bevy benchmarks, it was consistently outputting better code and games. I then noticed the model did a LOT less thinking. The INT8 model is great, but vLLM VRAM usage is higher. And since llama-cpp (in PR) has MTP, I figured I'd try to quant this and add MTP too. What's interesting is both the INT8 autoround and my GGUF quant seem to perform better than UD Q8 K XL in terms of getting to the answer sooner. I choose to keep the same layers in BF16 as Minachist did. For my formal testing, I am using AIME math problems and then custom math problems that Opus 4.7 has created for me. The new quant is about the same size, just slightly bigger than UD Q8 K XL but the difference is surprisingly noticeable. I think running these same tests in BF16 will reveal if this behavior is truly preferred or not. It may also just be that the thinking more is actually better, but my experience tells me the opposite. Nonetheless, here are some results. My tests were against these quants (note these include MTP layers so they are slightly bigger): * Q8\_0 28595762432 * Size on disk is 29047084160 (28.3 GiB) * Q8 K XL * Size on disk is 35776484480 (34.9 GiB) * This quant that I tried to copy layer for layer from the INT8 autoround recipe. * Size on disk is 37144875200 bytes (36.2 GiB) So is it really surprising that the bigger model size performed better? No. What's very interesting, though, is that the thinking is drastically less. So the KV cache space you lost by running a bigger quant is regained by spending 20% less tokens while thinking. Here are some runs I did: Note that all with same seed and sampling parameters. Multiple runs (3) resulted in same outputs. KV cache at bf16/bf16. --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --seed 1337 Question 1 (Math, AIME style) The roots of \\(x\^3-7x\^2+14x-8=0\\) are \\(a,b,c\\). If \\(\\frac1{a\^2+1}+\\frac1{b\^2+1}+\\frac1{c\^2+1}=\\frac mn\\) in lowest terms, find \\(m+n\\). Llama CPP * Q8 * 16,234 tokens for 3 min and 48 sec at 70.90 t/s (remember this is MTP with 2 tokens) * UD Q8 K XL * 16,001 tokens for 4 min and 00 sec at 66.24 t/s * Custom Q8 * 9,671 tokens for 2 min and 39 sec at 60.60 t/s \~40% less thinking vLLM * Minachist INT8 autoround * 10,200 tokens for 2 min and 38 sec at 34.2 t/s (I didn't use MTP here) Question 2 (Math, AIME style) How many ordered pairs of positive integers \\((x,y)\\) satisfy \\(x\^2-y\^2=2026\\)? Llama CPP * Q8 * 7,598 tokens for 1 min and 44 sec at 72.76 t/s * Strange Q8 even did better * Custom Q8 * 5,666 tokens for 1 min and 33 sec at 60.49 t/s * \~59% less thinking * UD Q8 K XL * 13,596 tokens for 3 min and 29 sec at 65.02 t/s vLLM * Minachist INT8 autoround * 8,931 tokens at 34.4 t/s (I didn't use MTP here) There are a few more math tests I ran but you get the gist. The quant is thinking a lot less. For anyone that wants to reproduce: I downloaded the HF safe tensors and converted them to a single GGUF, then I used llama CPP to quant it down. This is the minimum quant required to try it: !Convert safetensor to GGUF /home/user/llm/llama.cpp/convert_hf_to_gguf.py /home/user/llm/models/Qwen3.6-27B/BF16 --outfile /home/user/llm/models/Qwen3.6-27B/BF16/Qwen3.6-27B-BF16.gguf !quant while keeping layers in BF16 /home/user/llm/llama.cpp/build/bin/llama-quantize \ --tensor-type token_embd=bf16 \ --tensor-type output=bf16 \ --tensor-type output_norm=bf16 \ --tensor-type post_attention_norm=bf16 \ --tensor-type attn_q_norm=bf16 \ --tensor-type attn_k_norm=bf16 \ --tensor-type attn_qkv=bf16 \ --tensor-type attn_gate=bf16 \ --tensor-type ssm_a=bf16 \ --tensor-type ssm_alpha=bf16 \ --tensor-type ssm_beta=bf16 \ --tensor-type ssm_conv1d=bf16 \ --tensor-type ssm_dt.bias=bf16 \ --tensor-type ssm_norm=bf16 \ --tensor-type ssm_out=bf16 \ /home/user/llm/models/Qwen3.6-27B/BF16/Qwen3.6-27B-BF16.gguf \ /home/user/llm/models/Qwen3.6-27B/BF16/Qwen3.6-27B-Q8-BIGBOY.gguf \ q8_0 Adding the following layers to the previous quant does NOT improve anything for me (saving about 1GB, I think): --tensor-type attn_norm=bf16 \ --tensor-type attn_output=bf16 \ --tensor-type attn_q=bf16 \ --tensor-type attn_k=bf16 \ --tensor-type attn_v=bf16 \ Ideas why it might be good: * Instead of F16, we're using BF16 * It's literally bigger, so more layers left in native format * The layers we left at BF16 are important Some limitations: * I ran the tests only 3 times per model per question * I should probably re-run the tests with another seed * I didn't run benchmark suites. That would be helpful, but we also need to be mindful that Qwen is benchmaxed as shown in Contamination Detection via Context (CoDeC) benchmarks. Next steps: * I'll re-run the tests with another seed * Rent runpod to run BF16 with same seed and samplings
User reports Qwen 3.5 122B significantly outperforms Qwen 3.6 35B on multi-step tasks despite benchmark claims, questioning if quantization or setup issues are to blame.
Alibaba releases Qwen3.6-35B-A3B-FP8, an open-weight quantized variant of Qwen3.6 with 35B parameters and 3B activated via MoE, featuring improved agentic coding capabilities and thinking preservation for iterative development.
The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.
Qwen releases Qwen3.6-35B-A3B, an open-weight Mixture-of-Experts model with 35B total parameters and 3B active parameters, featuring significant improvements in agentic coding and reasoning preservation.