@leopardracer: SAME GPU SAME MODEL SAME CONTEXT AND 2X THE SPEED rtx 4060, gemma 4 12b, 48k context just switched the quantization fro…

X AI KOLs Timeline Tools

Summary

Changing quantization from q4_k_m to q4_k_xl in llama.cpp doubles inference speed on the same GPU without hardware or driver changes, as demonstrated with Gemma 4 12B on an RTX 4060.

SAME GPU SAME MODEL SAME CONTEXT AND 2X THE SPEED rtx 4060, gemma 4 12b, 48k context just switched the quantization from q4_k_m to q4_k_xl and went from 15 tokens per second to 32 no new hardware, no new drivers, no new model, just one parameter changed in llama.cpp most people running local llms are leaving half their gpu’s speed on the table and don’t even know it the full breakdown of tools and configs is in the article below ↓
Original Article
View Cached Full Text

Cached at: 06/09/26, 01:36 AM

SAME GPU SAME MODEL SAME CONTEXT AND 2X THE SPEED

rtx 4060, gemma 4 12b, 48k context

just switched the quantization from q4_k_m to q4_k_xl and went from 15 tokens per second to 32

no new hardware, no new drivers, no new model, just one parameter changed in llama.cpp

most people running local llms are leaving half their gpu’s speed on the table and don’t even know it

the full breakdown of tools and configs is in the article below ↓

Similar Articles

Gemma 4 26B Hits 600 Tok/s on One RTX 5090

Reddit r/LocalLLaMA

A benchmark shows that using vLLM with DFlash speculative decoding boosts Gemma 4 26B inference to ~578 tokens per second on a single RTX 5090, achieving a 2.56x speedup over baseline.

24+ tok/s from ~30B MoE models on an old GTX 1080 (8 GB VRAM, 128k context)

Reddit r/LocalLLaMA

A developer demonstrates running MoE models like Qwen 3.6 35B-A3B and Gemma 4 26B-A4B at 24+ tok/s on an old GTX 1080 (8GB VRAM) with 128k context using llama.cpp with MoE offloading and TurboQuant KV cache quantization, revealing optimization tricks for Gemma's MTP speculative decoding.