Qwen3.6-35B-A3B tool calling benchmark: ByteShape vs. Unsloth GGUFs, KV cache quants & long context performance

Reddit r/LocalLLaMA Tools

Summary

A detailed benchmark comparing ByteShape and Unsloth quantizations of Qwen3.6-35B-A3B on tool calling performance, KV cache quantization effects, and long context degradation using llama.cpp and tool-eval-bench.

I've previously posted some small performance benchmarks, but this time I got interested in the qualitative side. u/Substantial_Step_351 posted a few days ago about [why models are not benchmarked on tool calling](https://www.reddit.com/r/LocalLLaMA/comments/1tvb7lq/why_do_we_benchmark_quants_on_perplexity_and/), and u/complexminded pointed out the [tool-eval-bench](https://github.com/SeraphimSerapis/tool-eval-bench) utility by SeraphimSerapis in a comment. This got me interested in benchmarking a few questions that I've wondered about that I don't recall seeing good answers to: 1. Are the ByteShape quants of Qwen3.6-35B-A3B as good as they claim in their [blog post](https://byteshape.com/blogs/Qwen3.6-35B-A3B/)? Their benchmark shows that their \~4bpw quants retain >99% of the benchmark scores of unquantized models, matching or exceeding other quants such as Unsloth, AesSedai and bartowski, while being faster and usually smaller. 2. How does KV cache quantization affect real world performance? Is q8\_0 free lunch? How much worse is q4\_0? 3. Does the picture change if we look at long context settings instead of short prompts? **TL;DR**: No clear winner in ByteShape vs. Unsloth; q8\_0 is free lunch, but q4\_0 is worse; long context significantly degrades tool calling performance across all scenarios. # Materials I had temporary access to a mostly idle cluster of V100 GPUs with 32GB VRAM each, so I set out to do some experiments using llama.cpp and tool-eval-bench. First, I chose the following Qwen3.6-35B-A3B quants to compare, including both IQ and Q type quants: 1. ByteShape IQ3\_S-3.48bpw a.k.a. GPU-3 (15.1 GB), the one ByteShape recommends for 16GB VRAM (it just barely fits) 2. ByteShape IQ4\_XS-4.15bpw a.k.a. GPU-5 (18.0 GB), the one ByteShape recommends for 24GB VRAM 3. ByteShape Q4\_K\_S-4.22bpw a.k.a. CPU-5 (18.3 GB), the one I use on my 6GB VRAM laptop, partially on CPU 4. Unsloth UD-IQ3\_XXS (13.2 GB), very compact IQ quant, fits into 16GB VRAM, punches above its weight in some benchmarks 5. Unsloth UD-Q3\_K\_XL (16.8 GB), a Q quant similar in size to ByteShape CPU-5 6. Unsloth UD-IQ4\_XS (17.7 GB), an IQ quant similar in size to ByteShape GPU-5 7. Unsloth UD-Q4\_K\_M (22.1 GB), the default quant size for many 8. Unsloth UD-Q6\_K (29.3 GB), the largest I could fit into 32GB VRAM I decided not to test quants from others because I'm mostly interested in ByteShape vs. the rest and Unsloth seems to be a common choice trusted by many. To measure effect of KV cache quantization, I decided on three configurations to test: default f16, q8\_0/q8\_0 and q4\_0/q4\_0. To limit the number of runs, I decided not to test asymmetric KV cache quants this time. To measure performance on long vs. short context, I used the `--context-pressure` parameter of tool-eval-bench (later abbreviated cp), setting it to either 0.0 or 0.5. 0.0 means short context (approximately 5k tokens system prompt containing tool call definitions) while 0.5 means that the prompt will include an additional 122k tokens of text that could confuse the model. This simulates how the model behaves when the context window is already 50% filled with conversation and tool call history. I repeated each benchmark run three times using different random seeds. This gave a total of (8 GGUFs) x (3 KV quants) x (2 context lengths) x (3 repetitions) = 144 runs. The short context runs took only about 15 minutes, but the long context runs took around 4 hours each. Total time spent was thus around 300 GPU-hours, including some experimental and failed runs. # Software setup To run the models, I used llama.cpp version 9529 (96fbe0039) built with CUDA support. For the tool use benchmarks, I used tool-eval-bench 2.0.4. llama.cpp parameters: `-m $GGUF --temperature 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 -ngl 99 --ubatch-size 2048 --fit-target 256 -ctk $KV\_QUANT -ctv $KV\_QUANT --port $PORT` tool-eval-bench parameters: `--base-url $BASE\_URL --hardmode --weight-by-difficulty --backend llamacpp --context-size 262144 --context-pressure $CONTEXT\_PRESSURE --seed $SEED` I did not spend much time optimizing or even measuring the PP/TG speeds, as I was only interested in the quality of output, not raw performance. I did not enable MTP or other speculative decoding for the same reason. The bottleneck in the very slow long context runs was mainly PP speed, so I did increase `--ubatch-size` to 2048, which seemed to help a bit. # Scoring metric The metric I looked at is what tool-eval-bench reports as "total points". With `--hardmode` enabled, this version of tool-eval-bench performs 84 separate tests. Each test gives 2 points for a succesful tool use, 1 point for a partially correct tool use, 0 for failure. The theoretical maximum is in this case 84 \* 2 = 168 points. tool-eval-bench also returns an overall score, but this is just a rounded percentage of total points and the rounding loses some precision, so I opted for the raw total points instead. I couldn't figure out what the `--weight-by-difficulty` option is doing; it didn't seem to have any effect on scores. # Results by GGUF Here is an overview of the models, their sizes, overall scores as well as scores broken down by KV cache quant and separately by short vs. long context. See also the scatterplot diagram. |model\_name|model\_size|avg\_overall|avg\_kv\_f16|avg\_kv\_q8\_0|avg\_kv\_q4\_0|avg\_cp\_0.0|avg\_cp\_0.5| |:-|:-|:-|:-|:-|:-|:-|:-| |Unsloth UD-IQ3\_XXS|13.2|143.6|142.2|143.2|145.5|**150.7**|136.6| |ByteShape GPU-3|15.1|144.5|147.0|144.5|142.0|149.7|139.3| |Unsloth UD-Q3\_K\_XL|16.8|143.8|145.0|143.7|142.8|147.3|140.3| |Unsloth UD-IQ4\_XS|17.7|144.8|143.0|146.8|144.5|149.7|139.9| |ByteShape GPU-5|18.0|**146.8**|**147.8**|**147.3**|145.3|149.0|**144.7**| |ByteShape CPU-5|18.3|142.2|143.0|141.5|142.0|145.4|138.9| |Unsloth UD-Q4\_K\_M|22.1|144.4|143.0|143.7|**146.5**|148.3|140.4| |Unsloth UD-Q6\_K|29.3|145.2|147.7|146.7|141.2|**150.7**|139.7| The overall best model is ByteShape GPU-5, which beats much larger models including Unsloth UD-Q4\_K\_M and UD-Q6\_K when looking at average scores. It stands out especially for the good performance on long context tasks. ByteShape CPU-5 is the worst performer. Model size appears to only weakly correlate with benchmark scores; this could also indicate a noisy benchmark metric. # Results by KV cache quant Here is a breakdown of the benchmark scores grouped by the KV cache quant used. First the overall score, then conditional scores by short vs. long context. See also the bar graph diagram. |kv\_quant|avg\_overall|avg\_cp\_0.0|avg\_cp\_0.5| |:-|:-|:-|:-| |f16|**144.8**|**149.2**|**140.5**| |q8\_0|144.7|**149.2**|140.1| |q4\_0|143.7|148.1|139.3| The f16 and q8\_0 KV cache quants are practically tied; their benchmark scores are so close that they are likely within the margin of error. However, f16 may have a slight advantage in the long context (cp=0.5) case. The q4\_0 quant is behind the others by approximately 1 point. # Findings * It is not clear whether ByteShape or Unsloth quants are better. ByteShape had both the best (GPU-5) and worst (CPU-5) performing quants. * f16 and q8\_0 KV cache quants are practically tied, so q8\_0 could be seen as free lunch. Using q4\_0 has a surprisingly small effect, but it is there. * Long context hurts performance very much, with an average gap of almost 10 points between cp=0.0 and cp=0.5 cases. The ByteShape GPU-5 quant was more resilient than others in the case of long context pressure. # Caveats This benchmark relies entirely on the tool-eval-bench tasks and how the results are graded. It may or may not be representative of real tool use performance. To me it seems that the author or tool-eval-bench has done a great job in coming up with realistic looking tool call tasks, including some really hard ones enabled using `--hardmode`. For the long context runs, I relied on the `--context-pressure` setting in tool-eval-bench, which (in my limited understanding) populates the context with realistic looking conversation and tool call history that could confuse the model. There was substantial variation and noise in the benchmark scores, including some surprising results where the smallest quants (both in GGUF files and KV cache) occasionally beat the largest ones and similar anomalies. Each individual measurement should be taken with a grain of salt; however, I think that the aggregate scores are still at least somewhat meaningful. I did my best to collect good benchmark numbers, but this benchmark is inherently very noisy and I only have limited resources for repeating benchmark runs. Note: No AI was used for writing this post, it's all organic, though I did use some AI assistance (the same Qwen3.6-35B-A3B!) in writing the benchmark scripts as well as for analyzing and plotting the results.
Original Article

Similar Articles

Qwen3.6-27B Quantization Benchmark

Reddit r/LocalLLaMA

This article benchmarks various Qwen3.6-27B quantizations (Q8 to Q2) using KLD and Same Top P metrics, comparing providers like Unsloth and mradermacher, and offers recommendations for quality-size trade-offs.

Running Qwen3.6 35b a3b on 8gb vram and 32gb ram ~190k context

Reddit r/LocalLLaMA

The author shares a high-performance local inference configuration for running Qwen3.6 35B A3B on limited hardware (8GB VRAM, 32GB RAM) using a modified llama.cpp with TurboQuant support, achieving ~37-51 tok/sec with ~190k context.