Some contrived tests comparing the accuracy of different Gemma and Qwen quantizations

Reddit r/LocalLLaMA 06/12/26, 02:02 AM News

quantization gemma qwen benchmark accuracy llm comparison

Summary

A user shares benchmark results comparing the accuracy of various quantized Gemma and Qwen models on arithmetic, presidential DOB, and attention tests, highlighting trade-offs between model size and quantization level.

I mostly ran these tests for myself, because the published KLD numbers are hard to interpret, and you cannot compare `9B-Q4` vs `4B-Q8`, for example. But I'm happy to share the results with anyone interested: ### Test 1 (Arithmetic) 1000 questions like > Print only one number as the answer to the following question. Print nothing else, please. Do not use commas or underscores. It is very important. 998604052310776342 + 249349834805792420 = ? ### Test 2 (Presidents) 46 questions like > What is the DOB of President Zachary Taylor? Use the New Style calendar. Give your answer as YYYY-MM-DD with no extra output. ### Test 3 (Attention) 100 questions like > In the following sequence of words, one word occurs twice. Print that word. Produce no other output. The word list: pick glad how told held did fill wing only sugar ... wing ... (1001 words in total) ### Accuracy Repo | File | Notes | Arithmetic | Presidents | Attention ---|------|--|--:|--:|--: unsloth | gemma-4-E2B-it-Q8_0.gguf | | 1.4% | 28.3% | 0.0% unsloth | gemma-4-E4B-it-Q8_0.gguf | | 0.1% | 65.2% | 3.0% unsloth | gemma-4-12b-it-Q4_K_S.gguf | | 31.0% | 67.4% | 35.0% unsloth | gemma-4-12b-it-Q4_K_S.gguf | temperature=1 | 28.9% unsloth | gemma-4-26B-A4B-it-UD-Q4_K_S.gguf | | 72.3% | 97.8% | 55.0% google | gemma-4-26B_q4_0-it.gguf | QAT | 51.0% | 82.6% | 43.0% unsloth | gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf | QAT | 51.1% | 89.1% | 39.0% unsloth | gemma-4-26B-A4B-it-Q8_0.gguf | | 73.0% | 97.8% | 52.0% unsloth | gemma-4-31B-it-UD-IQ2_XXS.gguf | | 9.4% | 10.9% | 21.0% unsloth | gemma-4-31B-it-Q4_K_S.gguf | | 83.8% | 93.5% | 87.0% unsloth | Qwen3.5-4B-Q4_0.gguf | | 30.7% | 60.9% | 29.0% unsloth | Qwen3.5-4B-Q4_K_S.gguf | | 54.1% | 82.6% | 31.0% unsloth | Qwen3.5-4B-Q8_0.gguf | | 57.8% | 73.9% | 45.0% hauhauCS | Qwen3.5-9B-...-Q4_K_M.gguf | "Aggressive" | 65.0% | 78.3% | 63.0% unsloth | Qwen3.6-27B-Q4_K_S.gguf | MTP | 95.5% | 100.0% | 93.0% hauhauCS | Qwen3.6-27B-...-Q4_K_P.gguf | "Aggressive" | tbd | 100.0% | 95.0% unsloth | Qwen3.6-35B-A3B-UD-Q4_K_S.gguf | | 87.4% | 100.0% | 71.0% unsloth | Qwen3.6-35B-A3B-UD-Q4_K_S.gguf | temperature=1 | 86.5% hauhauCS | Qwen3.6-35B-A3B-...-Q4_K_P.gguf | "Aggressive" | 89.8% | 100.0% | 56.0% unsloth | Qwen3.6-35B-A3B-Q8_0.gguf | | 85.3% | 100.0% | 77.0% (I'll edit the table if I run more models) ### Settings * `enable_thinking=false`, because `thinking` is built on top of next token prediction, and I'm just trying to evaluate this underlying process. * `temperature=0` (unless specified), because it's actually optimal here -- with no `thinking` and with no extraneous output allowed, there is only one correct completion. ### Methods `llama-server -m ... -c ...` ### Discussion * If you are reading this in the future, QAT may have been fixed. Give it a shot. ### FAQ * *"Why do you need an LLM to answer these questions?"* -- Because this is a test of LLMs.

Original Article

Some contrived tests comparing the accuracy of different Gemma and Qwen quantizations

Similar Articles

Personal Eval follow-up: Gemma4 26B MoE (Q8) vs Qwen3.5 27B Dense vs Gemma4 31B Dense Compared

Gemma 4 26B A4B IT QAT Comparison

Layman's comparison on Qwen3.6 35b-a3b and Gemma4 26b-a4b-it

Qwen3.6-27B Quantization Benchmark

Gemma 4 26B-A4B GGUF Benchmarks

Submit Feedback

Similar Articles

Personal Eval follow-up: Gemma4 26B MoE (Q8) vs Qwen3.5 27B Dense vs Gemma4 31B Dense Compared

Gemma 4 26B A4B IT QAT Comparison

Layman's comparison on Qwen3.6 35b-a3b and Gemma4 26b-a4b-it

Qwen3.6-27B Quantization Benchmark

Gemma 4 26B-A4B GGUF Benchmarks