Gemma 4 26B A4B IT QAT Comparison

Reddit r/LocalLLaMA Models

Summary

A user benchmarks three quantized versions of Gemma 4 26B IT (4-bit, 6-bit, and 8-bit QAT) on MMLU_PRO and HumanEval, finding that the QAT 8-bit model performs worse than the 6-bit quant on HumanEval and is not clearly better than 4-bit, questioning the superiority of QAT for this model.

Hopefully this isn't too low effort of a post. I just finished the benchmarks and I figured I'd post them online because they certainly were insightful for me. I did not use any AI other than asking Gemini 3.1 Pro if it was statistically significant because I was too tired to do inferential statistics. **Methodology:** oMLX used to run Gemma 4 26BA4B IT from mlx-community. I used the following models: Gemma 26B 4 Bit: [https://huggingface.co/mlx-community/gemma-4-26b-a4b-it-4bit](https://huggingface.co/mlx-community/gemma-4-26b-a4b-it-4bit) Gemma 26B 6 Bit: [https://huggingface.co/mlx-community/gemma-4-26b-a4b-it-6bit](https://huggingface.co/mlx-community/gemma-4-26b-a4b-it-6bit) Gemma 26B QAT 8 Bit: [https://huggingface.co/mlx-community/gemma-4-26B-A4B-it-qat-8bit](https://huggingface.co/mlx-community/gemma-4-26B-A4B-it-qat-8bit) I ran them on a Macbook M5 Pro 64GB with oMLX on version 0.4.1 and unquantized kv cache, and thinking enabled. I ran the following tests on all models: 50 MMLU\_PRO questions, and 100 HumanEval questions. The only difference in the chat templates between all of those models above relates to multimodal tool calls, so it did not impact the results. Additionally, they were all quantized using the same method, so the only variable should be the original model weights. I chose the 8 bit QAT to avoid confounding variables from any mlx specific quantization damage. My goal was to compare the QAT model as close to the original as possible to the original model. This model should be virtually identical to the unsloth q4\_k\_xl quant of the QAT model. (I mean legitimately very close to identical, not "TQ4 is basically BF16 identical") I chose to compare it to a mlx 4 bit and 6 bit quant, as both bpw ranges are within the range that users have expressed uncertainty about replacing their old quant with a new QAT model. **Results:** |Model|Benchmark|Percentage (Correct/Total)| |:-|:-|:-| |Gemma 4 26B IT 4 Bit|MMLU\_PRO |56.0% (28/50)| |Gemma 4 26B IT 4 Bit|HUMANEVAL|90.0% (90/100)| |Gemma 4 26B IT 6 Bit|MMLU\_PRO|58.0% (29/50)| |Gemma 4 26B IT 6 Bit|HUMANEVAL|98.0% (98/100)| |Gemma 4 26B IT QAT 8 Bit|MMLU\_PRO|52.0% (26/50)| |Gemma 4 26B IT QAT 8 Bit|HUMANEVAL|90.0% (90/100)| **Interpretation:** Both chi-squared tests and z tests were performed by Gemini. >The only statistically convincing evidence of a difference across all these benchmarks is that the **QAT 8 Bit model performs worse than the 6 Bit model on HUMANEVAL**. The performance differences seen on MMLU\_PRO are not statistically significant and can be attributed to random chance due to the smaller sample size (50 questions). Thus the conclusion that I have reached is that the QAT model is worse than a Q6 quant of the original model. This means that the claim that "QAT is indistinguishable from BF16" or "the distributions are very close" is likely wrong, as the full QAT model is unlikely to beat the tested 8 bit model, but the full non-QAT model is very likely to beat the q6 model, meaning a wider gap than I was able to produce is likely present. QAT was not clearly better or worse than a regular MLX q4 quant. Now, for GGUF, QAT likely still smashes Q4\_0 out of the park and might even be competitive with IQ4\_XS, but it seems that the assumption that q4\_k, q5, and even q6 quants should be replaced with QAT quants is a bit early. I might run more tests on the 26B, or even test out the 31B model later, as the sample sizes that I have are just enough to begin to get an idea. Creative writing may be different, but I mainly wanted to measure similarity with the original model, and worse benchmark performance is by definition indicative of dissimilarity. Also this is a MoE, and so maybe the QAT works better on the 31B. Tldr; Gemma 4 QAT unquantized is inferior to Gemma 4 unquantized and so it might not make sense to replace 5, 6, or even dynamic 4 bit quants with Gemma 4 26B QAT. These observations may not generalize to the 31B, 12B, or E2/4B.
Original Article

Similar Articles

Does it make sense to use alternative quantizations of QAT models? [D]

Reddit r/MachineLearning

A discussion on whether it is sensible to use alternative quantization methods on quantization-aware trained (QAT) models like Gemma-4, questioning if unsloth's benchmarks showing closer performance to QAT fine-tunes are beneficial or counterproductive.