Gemma4 26b a4b Apex quant is quite good

Reddit r/LocalLLaMA Models

Summary

User benchmarks the APEX quantized version of Gemma4 26B A4B model on AMD RX 9060 XT, achieving 38 tps at 90k context with no quality degradation, finding it better than previous quantizations.

I tried mudler's apex quant for gemma4 26b a4b and it was amazing! I got 38tps at 90.000 context with no loop and suprisingly no quality degradation. I used mudler/gemma-4-26B-A4B-it-APEX-GGUF / APEX-I-Compact (15gb) on my RX 9060 XT 16 GB with llama.cpp Vulkan. For comperison, my previous quant gemma4 26b a4b unsloth ud-q5kxl quant (21.2gb) looped with similar long-context test at 50k context Im not claiming its a universally better quant. But it is worth give a go imo.
Original Article

Similar Articles

Layman's comparison on Qwen3.6 35b-a3b and Gemma4 26b-a4b-it

Reddit r/LocalLLaMA

A user compares Qwen3.6 35B-A3B and Gemma 4 26B-A4B-IT running locally on a 16GB VRAM GPU via LM Studio, finding Qwen3.6 produces more detailed outputs while both run at comparable speeds. The post is an informal community comparison using quantized models.

Gemma 4 26B-A4B GGUF Benchmarks

Reddit r/LocalLLaMA

Unsloth has released KL Divergence benchmarks for Gemma 4 26B-A4B GGUF quantizations, showing Unsloth GGUFs top 21 of 22 sizes on the Pareto frontier. They also introduced a new UD-IQ4_NL_XL quant fitting in 16GB VRAM and updated Q6_K and MLX quants for both Gemma 4 and Qwen3.6.

Qwen3.6-35B vs Gemma4-26B on 7900 XTX

Reddit r/LocalLLaMA

A detailed benchmark comparing Qwen3.6-35B and Gemma4-26B on Radeon 7900 XTX shows Gemma is ~20% faster end-to-end despite slower token generation, because Qwen generates ~2x more tokens due to internal reasoning. The article recommends using Qwen for throughput-bound batch work and Gemma for latency-sensitive single requests.