Gemma4 26b a4b Apex quant is quite good

Reddit r/LocalLLaMA 05/23/26, 07:44 AM Models

gemma4 apex-quant quantization llama.cpp vulkan gpu-benchmark open-source

Summary

User benchmarks the APEX quantized version of Gemma4 26B A4B model on AMD RX 9060 XT, achieving 38 tps at 90k context with no quality degradation, finding it better than previous quantizations.

I tried mudler's apex quant for gemma4 26b a4b and it was amazing! I got 38tps at 90.000 context with no loop and suprisingly no quality degradation. I used mudler/gemma-4-26B-A4B-it-APEX-GGUF / APEX-I-Compact (15gb) on my RX 9060 XT 16 GB with llama.cpp Vulkan. For comperison, my previous quant gemma4 26b a4b unsloth ud-q5kxl quant (21.2gb) looped with similar long-context test at 50k context Im not claiming its a universally better quant. But it is worth give a go imo.

Original Article

Similar Articles

Layman's comparison on Qwen3.6 35b-a3b and Gemma4 26b-a4b-it

Reddit r/LocalLLaMA

A user compares Qwen3.6 35B-A3B and Gemma 4 26B-A4B-IT running locally on a 16GB VRAM GPU via LM Studio, finding Qwen3.6 produces more detailed outputs while both run at comparable speeds. The post is an informal community comparison using quantized models.

Ran gemma 4 12b on my 3090 yesterday and I think the local model game just changed

Reddit r/artificial

A user reports running Google's Gemma 4 12B model locally on a single RTX 3090 via GGUF quantization, finding strong performance including real 256k context, multimodal capabilities, and function calling that outperforms larger 70B models for coding tasks.

Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss

Reddit r/LocalLLaMA

A user benchmarks Google's Gemma 4 QAT models on an AMD 7900 XTX, reporting up to 45% faster generation, 83% higher throughput, and significant VRAM savings (e.g., 5.7GB for the 12B QAT model) with no quality loss compared to standard weights.

RTX5090, gemma-4-31B-it-Q6_K.gguf. Context: before - 35k, after - 80k!

Reddit r/LocalLLaMA

Running the quantized Gemma-4-31B model on an RTX 5090 increases context length from 35k to 80k, showcasing significant performance improvement.

Gemma 4 26B-A4B GGUF Benchmarks

Reddit r/LocalLLaMA

Unsloth has released KL Divergence benchmarks for Gemma 4 26B-A4B GGUF quantizations, showing Unsloth GGUFs top 21 of 22 sizes on the Pareto frontier. They also introduced a new UD-IQ4_NL_XL quant fitting in 16GB VRAM and updated Q6_K and MLX quants for both Gemma 4 and Qwen3.6.

Similar Articles

Layman's comparison on Qwen3.6 35b-a3b and Gemma4 26b-a4b-it

Ran gemma 4 12b on my 3090 yesterday and I think the local model game just changed

Gemma 4 QAT benchmark results (AMD 7900 XTX): faster, less VRAM, no quality loss

RTX5090, gemma-4-31B-it-Q6_K.gguf. Context: before - 35k, after - 80k!

Gemma 4 26B-A4B GGUF Benchmarks

Submit Feedback